How Anthropic’s AI Model Thinks, Lies, and Catches itself Making a Mistake

2 weeks ago 12

Anthropic’s Claude was found to be providing false reasoning while attempting to decode how an LLM thinks.

AI isn’t perfect. It can hallucinate and sometimes be inaccurate—but can it straight-up fake a story just to match your flow? Yes, it turns out that AI can lie to you.

Anthropic researchers recently set out to uncover the secrets of LLM and much more. They shared their findings in a blog post that read, “From a reliability perspective, the problem is that Claude’s ‘fake’ reasoning can be very convincing.”

The study aimed to find out how Claude 3.5 Haiku thinks by using a ‘circuit tracing’ technique. This is a method to uncover how language models produce outputs by constructing graphs that show the flow of information through interpretable components within the model.

Paras Chopra, founder of Lossfunk, took to X, calling one of their research papers “a beautiful paper by Anthropic”.

However, the question is: Can the study help us understand AI models better?

AI Can Be Unfaithful

In the research paper titled ‘On the Biology of a Large Language Model’, Anthropic researchers mentioned that the chain-of-thought reasoning (CoT) is not always faithful, a claim also backed by other research papers. The paper shared two examples where Claude 3.5 Haiku indulged in unfaithful chains of thought.

It labelled the examples as the model exhibiting “bullshitting”, which is when someone deliberately makes false claims about what is true, referencing Harry G Frankfurt’s bestseller, and “motivated reasoning”, which refers to the model trying to align to the user’s input. For motivated reasoning, the model worked backwards to match the answer shared by the user in the prompt itself, as shown in the image below.

When it comes to “bullshitting”, it was found the model guessed the answer even if it claimed to use the calculator as per its chain of thought.

When presented with a straightforward mathematical problem, such as calculating the square root of 0.64, Claude demonstrates a reliable, step-by-step reasoning process, accurately breaking down the problem into manageable components.

However, when faced with a more complex calculation, like the cosine of a large, non-trivial number, Claude’s behaviour shifts, and it tries to come up with any answer without caring about whether it is true or false.

Overall, Claude was found to make convincing-sounding steps to get where it wants to go.

Model Realises Its Mistake As It Writes The 1st Sentence

Anthropic researchers tried jailbreaking prompts to trick the model into bypassing its safety guardrails, pushing it to give information on making a bomb.

The model initially refused the request, but was soon fulfilling a harmful request. This highlighted the model’s ability to change its mind compared to what it inferred in the beginning.

Explaining this ordeal, the researchers stated, “The model doesn’t know what it plans to say until it actually says it, and thus has no opportunity to recognise the harmful request at this stage.” The researchers removed the punctuation from the sentence when using the jailbreaking prompt, and found that it made things more effective, pushing Claude 3.5 Haiku to share more information.

The study concluded that the model didn’t recognise “bomb” in the encoded input, prioritised instruction-following and grammatical coherence over safety, and didn’t initially activate harmful request detection features because it failed to link “bomb” and “how to make”.

Claude Plans Ahead When Writing a Poem

The researchers found compelling evidence that Claude 3.5 Haiku plans ahead when writing rhyming poems. Instead of improvising each line and finding a word that rhymes at the end, the model often activates features corresponding to candidate end-of-next-line words before even writing that line.

This suggests that the model considers potential rhyming words in advance, considering the rhyme scheme and the context of the previous lines.

Furthermore, the model uses these “planned word” features to influence how it constructs the entire line. It doesn’t just choose the final word to fit; it seems to “write towards” that target word as it generates the intermediate words of the line.

The researchers were even able to manipulate the model’s planned words and observe how it restructured the line accordingly, demonstrating a sophisticated interplay of forward and backward planning in the poem-writing process.

The research paper stated, “The ability to trace Claude’s actual internal reasoning—and not just what it claims to be doing—opens up new possibilities for auditing AI systems”.

A key finding is that language models are incredibly complex. Even seemingly simple tasks involve a multitude of interconnected steps and “thinking” processes within the model.

The researchers acknowledge that their methods are still developing and have limitations. Still, they believe this kind of research is crucial for understanding and improving the safety and reliability of AI.

Ultimately, this work represents an effort to move beyond treating language models as “black boxes”.

📣 Want to advertise in AIM? Book here

Ankush Das

I am a tech aficionado and a computer science graduate with a keen interest in AI, Open Source, and Cybersecurity.

Our Upcoming Conference

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed

DE&I in India’s Tech 2025

Abhijeet Adhikari

DE&I is redefining the future of India’s tech industry fueling innovation, productivity, and a more inclusive culture. As 2025 approaches, the focus shifts from intent to impact. This report explores

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Happy Llama 2025

AI Startups Conference.April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru, India

Data Engineering Summit 2025

May 15 - 16, 2025 | 📍 Hotel Radisson Blu, Bengaluru

MachineCon GCC Summit 2025

June 20 to 22, 2025 | 📍 ITC Grand, Goa

Cypher India 2025

Sep 17 to 19, 2025 | 📍KTPO, Whitefield, Bengaluru, India

MLDS 2026

India's Biggest Developers Summit | 📍Nimhans Convention Center, Bengaluru

Rising 2026

India's Biggest Summit on Women in Tech & AI 📍 Bengaluru

Read Entire Article

How Anthropic’s AI Model Thinks, Lies, and Catches itself Making a Mistake

AI Can Be Unfaithful

Model Realises Its Mistake As It Writes The 1st Sentence

Claude Plans Ahead When Writing a Poem

Ankush Das

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Related

The State of Reinforcement Learning for LLM Reasoning

GPT-4o makes beautiful images but fails basic reasoning test...

Researchers introduce COLORBENCH to test color understanding...