The neuroscientist Jean-Rémi King leads the Brain & AI team in Meta’s AI division. In an interview with The Decoder, he discusses the connection between AI and neuroscience, the challenges of long-term prediction in models, predictive coding, the question of multimodal systems, and the search for cognitive principles in artificial architectures.
Jean Rémi King: This is a highly debated question in the field, and I want to emphasize that what I’m expressing here is my personal opinion—it’s not a scientific consensus.
It’s a long-standing debate in cognitive science. Throughout its history, researchers have argued both for and against the necessity of grounding language in sensory experience—having access to images, sounds, and the physical world for language to have meaning, and vice versa.
For instance, Francisco Varela was a prominent advocate of embodied cognition, emphasizing the idea that cognition—including language—must be rooted in sensory and motor systems. While he might not have used the term “multimodal learning,” his work aligns closely with that concept. On the other side of the spectrum, you have figures like Noam Chomsky and his school of thought in linguistics, who have strongly argued for the independence of language. According to that view, the human brain contains a language system capable of combining and manipulating words largely independently of other systems like vision or auditory perception.
Ad
THE DECODER Newsletter
The most important AI news straight to your inbox.
✓ Weekly
✓ Free
✓ Cancel at any time
Now, in terms of where we are today: multimodal models are not yet dominating the field. Despite significant effort to combine modalities—text with images, for instance—it’s still difficult to build a multimodal model that outperforms a unimodal one at its own task. Just having access to multiple input streams doesn’t automatically make a model better at processing each. In fact, it often makes training harder. These models still struggle to reach state-of-the-art performance across all included modalities.
Personally, I tend to lean toward the idea that language can function relatively independently from other modalities. If you look at findings from psychology and cognitive science, it's clear that people who are congenitally blind, for example, can reason perfectly well. On IQ tests and similar measures, their performance matches that of sighted individuals. The same holds for people who are deaf, although deafness can sometimes affect language development, depending on the context. Still, it appears that language—and the reasoning it often supports—can develop largely independently of vision and hearing.
That said, there’s something very compelling about the multimodal perspective. Language, after all, is sparse. We don’t encounter that much language in daily life—perhaps 13,000 to 20,000 words per day. And from an AI standpoint, we’re approaching the limit of how much text data is available for training models. There simply won’t be much more new text.
In contrast, other modalities—like images and video—are virtually limitless. We don’t process the entire corpus of online video today simply because we lack the computational infrastructure to handle it. But there's an enormous amount of untapped information and structure in those formats.
So I think there’s real potential in combining the strengths of both: the depth and structure of language with the scale and richness of visual or other sensory data. That intersection remains a very important and promising direction for future research.
Recommendation
The Decoder: One last question: What’s your view on reasoning models? Systems that explicitly try to draw inferences. Are there plans to study such models in your team?
Jean Rémi King: I’m not an expert in reasoning models, but I find the recent developments in that area really exciting. Concepts like chain-of-thought reasoning have been present in cognitive science for quite some time, so it’s great to see them now being formalized in AI. These aren’t just vague theories anymore—we have concrete models that attempt to test these ideas.
What’s particularly interesting is that some of these models explore whether it’s more effective to carry out reasoning as a sequence of words—a verbal chain of thought—or whether reasoning should take place in the latent space of abstract concepts, not necessarily expressed in language. There’s a lot of potential in exploring how reasoning can be "unrolled," how you can revisit earlier steps in a process, and how different formats of representation might influence that process.
That said, it’s not my area of specialization, so I won’t comment in depth on the specific developments. But we are already seeing promising benefits from this line of research.
What I do find very encouraging is how this connects to reinforcement learning. The idea of fine-tuning large language models for agentic behavior ties directly back to core concepts in reinforcement learning. In many ways, the so-called “world models” used in LLM fine-tuning are just a new framing of ideas that have been present in reinforcement learning for a while.
All of this points to a broader, and very positive, trend in AI: different subfields—language modeling, reasoning, reinforcement learning—are no longer evolving in isolation. Instead, they’re increasingly converging, and that integration is proving to be incredibly powerful.
About Jean-Rémi King
Jean-Rémi King is a CNRS researcher at the École Normale Supérieure and currently works at Meta AI, where he leads the Brain & AI team. His team investigates the neural and computational foundations of human intelligence – with a focus on language – and develops deep learning algorithms to analyze brain activity (MEG, EEG, electrophysiology, fMRI).
The interview was conducted on March 14, 2025.
Interviewer: Maximilian Schreiner