Reflection is Going Through a Phase of Self-Reflection

7 months ago 108

And the creator of Reflection 70B, Matt Shumer, is trying really hard to make things alright.

A few days ago, Matt Shumer, the founder of OthersideAI, announced that the company had made a breakthrough, which allowed them to train a mid-size model, achieving SOTA-level performance with the launch of Reflection, which outperforms GPT-4o and Claude Sonet 3.5.

But, the hype was short-lived. Upon trying out the model, many users claimed that the Reflection API was just a wrapper of Claude Sonnet 3.5 or GPT-4o. They revealed that when given the same prompt, both Claude and Reflection gave exactly the same answers.

A story about fraud in the AI research community:

On September 5th, Matt Shumer, CEO of OthersideAI, announces to the world that they've made a breakthrough, allowing them to train a mid-size model to top-tier levels of performance. This is huge. If it's real.

It isn't. pic.twitter.com/S0jWT8rDVb

— 𝞍 Shin Megami Boson 𝞍 (@shinboson) September 9, 2024

What Went Wrong?

It all started when the users tried achieving the same results as shared by the creators of Reflection AI. But the model completely missed the mark.

Artificial Analysis, known for its independent analysis of AI models and API providers, compared Reflection AI 70B to other models. It failed miserably, and the results were poor compared to Llama 3 70B.

A Reddit user claimed that the Reflection model was trained to give false answers first in its thinking phase, and then it reflects the thinking phrase.

“If you ask what 2+2 is, the default example on the Hugging Face page will say something like 2+2=3. Oh wait, I’ve made a mistake; 2+2 is actually 4. If the thinking is actually hidden, it might work, but it’s quite strange,” he added, further explaining the present flaw in the model.

Shumer claimed that the Reflection models were the best open-source model to date. They use the reflection-tuning method, designed to teach AI models to recognise and correct their own mistakes. This approach seemed poised to address one of the most persistent challenges in language models: the tendency to “hallucinate” or generate inaccurate information.

“When LLMs make mistakes, they often treat their errors as facts. If we could teach these models to think more as humans do—to reflect on their behaviour and recognise their mistakes—the models would become smarter and more reliable,” said Shumer, suggesting why reflection-tuning can help models reason better.

When the model generates an answer, it outputs its reasoning and surrounds the thought process with special tags (such as). When the model detects an error during inference, it marks it with a label and corrects itself. This feature enhances the reliability of the model, especially when dealing with complex problems.

When Artificial Analysis came up with poor results, they were granted access to private APIs of Reflection models. And the performance then was way better than the previous results. But again, when they compared the performance of the given private APIs with the available models on Hugging Face, the results were completely different, as the model hosted on Hugging Face showed poor results.

Self-Reflection Needed

Meanwhile, users have called Reflection a mere wrapper of Claude AI. When the model was made available on OpenRouter, users reported that it used a dumbed-down version compared to the previous version, as the one made available on OpenRouter was heavily censored.

“The version on OpenRouter seems to be heavily censored/dumbed down; it just refuses to write about what I asked for, while the “original” version did fine. So it was probably ChatGPT or Llama3+ChatGPT for Reflection initially, and now he switched to Claude, which is known to be heavily censored,” a Reddit user shared his experience with OpenRouter.

Shumer first blamed the upload process and mentioned that something might go wrong while uploading weights on Hugging Face but that didn’t solve the problem. So, he went a step further and decided to start the training from scratch to eliminate all the issues.

The API drama aside, an important reason why Reflection is not able to perform better than Llama is the use of different formats. LLaMA 3.1 70B was trained and uploaded using BF16 (Brain Float 16), while Reflection 70B was converted to FP16 (Float 16).

Converting a model from BF16 to FP16 results in significant information loss, and degrades the model performance.

A Reddit user solved a classic trolly problem by adding “it’s not the usual one” to the prompt in a single shot, suggesting the reasoning capabilities of the reflection-tuning method.

Can Retraining Do the Trick?

Shumer mentioned that ideally, this shouldn’t have happened in the first place. He said that his team has tried everything they could, but the performance they get from Hugging Face is nowhere close to what it ideally should be while running the Reflection model locally.

Some users believe that the whole release of the Reflection model was actually an advertisement for GlaiveAI as Shumer owns part of Glaive and was seen promoting it when he released the Reflection model. In response to that, Shumer said that he is a tiny investor and has only invested around $1000 in Glaive.

Here, it’s also important to note that this was the first release of the Reflection model, which was praised for its reflection-tuning approach. It would be a good idea to wait for the next update/release before judging the model too harshly.

Sagar Sharma

A software engineer who loves to experiment with new-gen AI. He also happens to love testing hardware and sometimes they crash. While reviving his crashed system, you can find him reading literature, manga, or watering plants.