Apple Proves OpenAI o1 is Actually Good at Reasoning

6 months ago 44

Apple has gotten better at gaslighting AI companies that are spending all they have on making LLMs better at reasoning. A research team of six people at Apple recently published a paper titled – Understanding the Limitations of Mathematical Reasoning in Large Language Models – which basically said that the current LLMs can’t reason. 

“…current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data,” reads the paper, which also includes LLMs like OpenAI’s GPT-4o and even the much-touted “thinking and reasoning” LLM, o1. The research was done on a series of other models as well, such as Llama, Phi, Gemma, and Mistral. 

Mehrdad Farajtabar, the senior author of the paper, posted on X explaining how the team came to the conclusion. According to him, LLMs just follow sophisticated patterns and even models smaller than 3 billion parameters are hitting benchmarks that only larger ones could do earlier, specifically the GSM8K score released by OpenAI three years ago. 

1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the… pic.twitter.com/yli5q3fKIT

— Mehrdad Farajtabar (@MFarajtabar) October 10, 2024

The researchers introduced GSM-Symbolic, a new tool for testing mathematical reasoning within LLMs because GSM8K was not accurate enough and thus, not reliable for testing the reasoning abilities of LLMs. 

Surprisingly, on this benchmark, OpenAI’s o1 demonstrated “strong performance on various reasoning and knowledge-based benchmarks” according to the researchers, but the capabilities dropped by 30% when the researchers introduced the GSM-NoOp experiment, which involved adding irrelevant information to the questions.

This proves that the “reasoning” capabilities of OpenAI’s models are definitely getting better, and maybe GPT-5 would be a lot better. Maybe it’s just Apple’s LLMs that don’t reason well, but the team didn’t test out Apple’s model.

Also, not everyone is happy with the research paper as it fails to even explain what “reasoning” actually means and just introduces a new benchmark for evaluating LLMs.

The Tech World Goes Bonkers

“Overall, we found no evidence of formal reasoning in language models…their behaviour is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!” Farajtabar further added that scaling these models would just result in ‘better pattern machines’ but not ‘better reasoners’.

Some people have been making this claim all along that LLMs cannot reason and they are an off road to AGI. Possibly, Apple has finally accepted this after trying out LLMs on their products and this is possibly also one of the reasons why it backed out of its investment in OpenAI.

Most of the researchers have been praising the paper by Apple and believe that it is important that others also accept that LLMs cannot reason. Gary Marcus, a long-standing critic of LLMs, also shared several examples of LLMs not able to perform reasoning tasks such as calculation and being better at Chess.

On the other hand, a problem with Apple’s paper is that it has confused reasoning with computation. “Reasoning is knowing an algorithm to solve a problem, not solving all of it in your head,” said Paras Chopra, an AI researcher, while explaining that most of the LLMs know the approach to solving a problem even though they end up with the wrong answer in the end. According to him, knowing the approach is good enough to check if the LLM is reasoning even if the answer is wrong.

Discussions on Hacker News highlight that some of the questions that the  Apple researchers asked LLMs were trying to do a “gotcha!” on them, as they included irrelevant information in questions, which LLMs would not be able to actively filter out.

Reasoning is the progressive, iterative reduction of informational entropy in a knowledge domain. OpenAI’s o1-preview does that better by introducing iteration. It’s not perfect, but it does it.

But is This True? Do LLMs Not Reason?

Subbarao Kambhampati, a computer science and AI professor at ASU, agreed that some of the claims of LLMs being capable of reasoning are exaggerated. However, he said that LLMs require more tools to handle System 2 tasks (reasoning), for which techniques like ‘fine-tuning’ or ‘Chain of Thought’ are not adequate. 

https://twitter.com/rao2z/status/1845607153580838979

When OpenAI released o1, claiming that the model thinks and reasons, Clem Delangue, the CEO of Hugging Face, was not impressed. “Once again, an AI system is not ‘thinking’, it’s ‘processing’, ‘running predictions’,… just like Google or computers do,” said Delangue, when talking about how OpenAI is painting the false picture of what the company’s newest model can achieve.

While some agreed, others argued that it is exactly how human brains work as well. “Once again, human minds aren’t ‘thinking’ they are just executing a complex series of bio-chemical / bio-electrical computing operations at massive scale,” replied Phillip Rhodes to Delangue.

To test reasoning, some people also ask LLMs how many Rs are there in the word ‘Strawberry’, which does not make sense at all. LLMs can’t count letters directly because they process text in chunks called “tokens”. The tests for reasoning have been problematic in the case of LLMs ever since they were created.

Everyone seems to have strong opinions on LLMs. While some are grounded in research by experts such as Yann LeCun or Francois Chollet arguing that LLM research should be taken a bit more seriously, others just follow the hype and criticise it. Some say they’re our ticket to the AGI, while others think they’re just glorified text-producing algorithms with a fancy name. 

Meanwhile, Andrej Karpathy recently said that the technique of predicting the next token that these LLMs, or Transformers, use might be able to solve a lot of problems outside the realm of where it is being used right now. 

While it seems true to some extent that LLMs can reason, when it comes to putting them to the test, they end up failing it.

Read Entire Article