In brief
- Stanford Computer Science Professor Fei-Fei Li said AI’s progress is now limited by systems that cannot understand physical space.
- World models are designed to simulate environments and predict how scenes change over time.
- Early prototypes like Marble hint at how these models could reshape creative work, robotics, and science.
Robots and multimodal artificial intelligence still can’t grasp the physical world, a shortcoming one prominent researcher says is now the field’s biggest obstacle.
Fei-Fei Li, the Stanford computer scientist widely regarded as a pioneer of modern computer vision, said the gap between AI and physical reality has become the tech's most urgent problem and argues that closing it would require systems built around spatial reasoning rather than language alone.
AI is fast approaching the limits of text-based learning, and progress will ultimately depend on “world models,” Li said in a report published Monday.
“At the core of unlocking spatial intelligence is the development of world models—a new type of generative AI that must meet a fundamentally different set of challenges than LLMs,” Li wrote on X. “These models must generate spatially consistent worlds that obey physical laws, process multimodal inputs from images to actions, and predict how those worlds evolve or be interacted with over time.”
What in the world are these models?
The concept of “world models” dates back to the early 1940s, when Scottish philosopher and psychologist Kenneth Craik conducted cognitive science research.
The idea resurfaced in modern AI after David Ha and Jürgen Schmidhuber’s 2018 paper showed that a neural network could learn a compact internal model of an environment and use it as a simulator for planning and control.
Li argued that world models matter because robots and multimodal systems still struggle with grounded spatial reasoning, leaving them unable to judge distances and scene changes, or to predict basic physical outcomes.
“Robots as human collaborators, whether aiding scientists at the lab bench or assisting seniors living alone, can expand part of the workforce in dire need of more labour and productivity,” Li wrote. Real environments follow rules that current machines can’t capture, Li argues.
From gravity shaping motion to materials influencing light, solving this requires systems capable of storing spatial memory and modeling scenes in more than two dimensions.
In September, Li’s company, World Labs, released the beta for Marble, an early world model that produced explorable three-dimensional environments from text or image prompts.
Users could walk through these worlds without time limits or scene drift, and the environments remained consistent rather than morphing or breaking apart, the company claims.
“Marble is only our first step in creating a truly spatially intelligent world model,” Li wrote. “As the progress accelerates, researchers, engineers, users, and business leaders alike are beginning to recognize its extraordinary potential. The next generation of world models will enable machines to achieve spatial intelligence on an entirely new level—an achievement that will unlock essential capabilities still largely absent from today’s AI systems.”
Li said world model use cases include supporting a range of applications because they give AI an internal understanding of how environments behave.
Creators could use them to explore scenes in real time, robots could rely on them to navigate and handle objects more safely, and researchers in science and healthcare could run spatial simulations or improve imaging and lab automation.
Li linked spatial intelligence research back to early biological studies, noting that humans learned to perceive and act long before they developed language.
“Long before written language, humans told stories—painted them on cave walls, passed them through generations, built entire cultures on shared narratives,” she wrote. “Stories are how we make sense of the world, connect across distance and time, explore what it means to be human, and most importantly, find meaning in life and love within ourselves.”
Li said AI needed the same grounding to function in the physical world and argued that its role should be to support people, not replace them. Progress, however, would depend on models that understood how the world worked rather than only describing it.
“AI’s next frontier is Spatial Intelligence, a technology that will turn seeing into reasoning, perception into action, and imagination into creation,” Li said.

