This AI Model Can Understand the Mechanisms of the Physical World.

The initial version of this narrative was published in Quanta Magazine.
Here’s a challenge for infants: Present them with a glass of water on a table. Conceal it behind a wooden panel. Then, slide the panel towards the glass. If the panel continues moving past the glass, as if it weren’t there, do they appear surprised? Many 6-month-olds do, and by the age of one, nearly all children possess an intuitive understanding of object permanence, acquired through observation. Remarkably, some artificial intelligence models are beginning to grasp this concept as well.
Researchers have designed an AI system that learns about the environment through videos and exhibits a sense of “surprise” when encountering information that contradicts its acquired knowledge.
The system, developed by Meta and named Video Joint Embedding Predictive Architecture (V-JEPA), refrains from making any assumptions about the physical laws governing the content of the videos. Nevertheless, it starts to comprehend how the world operates.
“Their assertions are, a priori, quite plausible, and the findings are extremely interesting,” remarks Micha Heilbron, a cognitive scientist at the University of Amsterdam who examines how both brains and artificial systems interpret the world.
Elevated Abstractions
As the developers of self-driving vehicles recognize, having an AI system consistently interpret visual data can be quite challenging. Most systems designed to “understand” videos do so to either categorize their content (“a person playing tennis,” for instance) or identify the outlines of objects—such as a car ahead—function in what’s termed “pixel space.” Essentially, the model treats each pixel within a video as equally significant.
However, these pixel-space models come with drawbacks. Consider trying to interpret a suburban street scene. If the visual contains cars, traffic signals, and trees, the model may fixate excessively on irrelevant details like the movement of leaves. It might overlook the color of the traffic light or the positioning of adjacent cars. “When dealing with images or videos, it’s not advisable to operate in [pixel] space because there are too many details you don’t wish to model,” stated Randall Balestriero, a computer scientist at Brown University.

