This AI Model Can Understand the Mechanisms of the Physical World.

This AI Model Can Understand the Mechanisms of the Physical World.

The initial version of this narrative was published in Quanta Magazine.

Here’s a challenge for infants: Present them with a glass of water on a table. Conceal it behind a wooden panel. Then, slide the panel towards the glass. If the panel continues moving past the glass, as if it weren’t there, do they appear surprised? Many 6-month-olds do, and by the age of one, nearly all children possess an intuitive understanding of object permanence, acquired through observation. Remarkably, some artificial intelligence models are beginning to grasp this concept as well.

Researchers have designed an AI system that learns about the environment through videos and exhibits a sense of “surprise” when encountering information that contradicts its acquired knowledge.

The system, developed by Meta and named Video Joint Embedding Predictive Architecture (V-JEPA), refrains from making any assumptions about the physical laws governing the content of the videos. Nevertheless, it starts to comprehend how the world operates.

“Their assertions are, a priori, quite plausible, and the findings are extremely interesting,” remarks Micha Heilbron, a cognitive scientist at the University of Amsterdam who examines how both brains and artificial systems interpret the world.

Elevated Abstractions

As the developers of self-driving vehicles recognize, having an AI system consistently interpret visual data can be quite challenging. Most systems designed to “understand” videos do so to either categorize their content (“a person playing tennis,” for instance) or identify the outlines of objects—such as a car ahead—function in what’s termed “pixel space.” Essentially, the model treats each pixel within a video as equally significant.

However, these pixel-space models come with drawbacks. Consider trying to interpret a suburban street scene. If the visual contains cars, traffic signals, and trees, the model may fixate excessively on irrelevant details like the movement of leaves. It might overlook the color of the traffic light or the positioning of adjacent cars. “When dealing with images or videos, it’s not advisable to operate in [pixel] space because there are too many details you don’t wish to model,” stated Randall Balestriero, a computer scientist at Brown University.

Image may contain Yann LeCun Face Happy Head Person Smile Photography Portrait Dimples Adult and Accessories

Yann LeCun, a computer scientist at New York University and head of AI research at Meta, created JEPA, a precursor to V-JEPA that operates on static images, in 2022.

Photograph: École Polytechnique Université Paris-Saclay

https://in.linkedin.com/in/rajat-media

Helping D2C Brands Scale with AI-Powered Marketing & Automation 🚀 | $15M+ in Client Revenue | Meta Ads Expert | D2C Performance Marketing Consultant