OpenAI o1 Likely Uses RL over Chains of Thought to Build System 2 LLMs

7 months ago 45

Recently, OpenAI released two models – OpenAI o1-preview and OpenAI o1-mini – marking a significant leap in the AI world. These models can now reason using chain of thoughts and reasoning tokens.

Jim Fan, in a recent post on X, mentioned that o1 models mark a significant shift towards inference-time scaling in AI, emphasising the importance of search and reasoning over mere knowledge accumulation. This approach suggests that effective reasoning can be achieved with smaller models.

By implementing techniques like Monte Carlo tree search during inference, the model can explore multiple strategies and scenarios to converge on optimal solutions. The key advantage of using MCTS during inference is that it allows the model to consider many different approaches to a problem, rather than committing to a single strategy early on.

Subbarao Kambhampati, professor at Arizona State University, said that OpenAI’s o1 model uses reinforcement learning over auto-generated chain of thought—similar to AlphaGo’s self-play approach—to optimise problem-solving by building a generalised System 2 component atop LLM substrates, albeit without guarantees.

“One interesting issue with o1 is that it seems to be significantly less steerable compared to LLMs. For example, it often completely ignores any output formatting instructions making it hard to automatically check its solutions,” he added, saying that once you are an approximate reasoner, you might develop the ‘don’t tell me how to solve the problem; I already have a way I use to solve it’ complex.

In his 2011 book Thinking, Fast and Slow, Daniel Kahneman coined the term ‘System 2 thinking’ which refers to complex problem-solving, logical reasoning, and careful decision-making, often involving step-by-step analysis and focused attention. Sounds very similar to what OpenAI has promised with its latest o1 model, which “thinks”.

Maybe, OpenAI’s o1 could be considered the first successful commercial launch of a System 2 LLM and the most important reason for that is the reasoning tokens. These tokens are designed to guide the system to perform step-by-step reasoning. They are created based on the user’s prompt and added to the reasoning process. Reasoning tokens in the systems are often notated with single or double-angle brackets for illustrative purposes.

OpenAI decided to use English words as reasoning tokens for convenience, such as “Interesting”, “First”, “Let’s test this theory”, “Wait”, “That seems unlikely”, “Alternatively”, “So this works”, “Perfect”, etc.

With the use of reasoning tokens, o1 models demonstrated significantly better performance on complex tasks compared to previous models. For example, o1 solved 83% of problems in a qualifying exam for the International Mathematics Olympiad, compared to GPT-4’s 13%.

It Gets the ‘R’s in Strawberry Right

When AIM first tapped into ChatGPT’s o1, our debut question was, “How many ‘R’s does ‘Strawberry’ have?” – and it nailed it. Later, we also asked which was bigger – 9.9 or 9.11 – and it got that right as well. This shows that OpenAI has finally solved Jagged Intelligence.

o1 gets how many r are there in strawberry

Anatoly Geyfman, the co-founder and CEO of Carevoyance, explained that reasoning tokens are meant to be used for the time a model spends “thinking”. It’s used to pay for the additional submissions to itself to refine the process of arriving at an answer, or whatever the actual mechanism of action is.

“This is important – there is now a way for model builders to monetise the more sophisticated actions of a model beyond “input” and “output” tokens. The reasoning tokens let OpenAI and, I bet, others in the near future release models that aren’t so much better trained, but instead, are better at thinking through responses,” he added further.

A similar approach was mentioned in a paper titled ‘Guiding Language Model Reasoning with Planning Tokens’, published in July. It proposed adding specialised planning tokens at the beginning of each chain-of-thought step to guide and improve language models’ maths reasoning ability.

Saurabh Sarkar, the CEO of Phenx Machine Learning Technologies, mentioned that when you try to solve a question like “What is 2 + 2, then multiply the result by 3?” with a traditional approach, it will first calculate 2 + 2, get the result 4, and then multiply 4 by 3 to get 12.

Using reasoning tokens, the model anticipates the need to multiply the intermediate result (4) with 3 while still calculating 2 + 2. It “pre-computes” and stores this information, so when it reaches the multiplication step, it already has the necessary data, allowing for faster and more efficient processing. This is how reasoning tokens allow for more thorough and accurate responses to challenging queries.

Still, There’s a Long Way to Go

Theo Browne, a popular YouTuber, founder and CEO of Ping Labs, recently posted a video on the reasoning capabilities of o1 models. In response to the popular saying “We have PhD in our pocket,” Browne said, “PhD that can’t do basic maths” as the o1 models were not able to find all the possible corners of a parallelogram.

https://x.com/allgarbled/status/1834344480797057307 (Please embed this in HTML)

A Reddit user mentioned that OpenAI is advertising this as some kind of mega-assistant for research scientists and quantum physicists.

“I gave it a fairly simple twin paradox time dilation problem, and it failed just as miserably as all the previous versions. It seems like it still has no understanding, just probabilistic word guessing,” he added. He suggested that even after using reasoning tokens and taking more time to generate the answer, the model does not give satisfactory results.

Another user mentioned that o1 was, in fact, performing worse than ChatGPT 4o. He mentioned that the responses from o1 were wordy, generic and ‘safe’ and he had to coax it several times to give him the same response that GPT4 provided on the first try.

Apart from the reasoning capabilities of reasoning tokens, not showing tokens to API users raised concerns amongst users.

Read Entire Article

OpenAI o1 Likely Uses RL over Chains of Thought to Build System 2 LLMs

It Gets the ‘R’s in Strawberry Right

Still, There’s a Long Way to Go

Related

Kids in Schools Are Skipping Lunch to Vibe Code

The State of Reinforcement Learning for LLM Reasoning

GPT-4o makes beautiful images but fails basic reasoning test...