Jul 11, 2025

再写一篇关于sora及世界模型的吧，纯文字

There has been so much discussion lately that Yann LeCun has started "venting" on social media again. In reality, on the possible path toward AGI, everyone is like an ignorant child, performing one experiment after another.

Regarding World Models

After Sora was released, many claimed it was a "world model," while many others criticized this notion—most notably Meta's LeCun. To support his view, LeCun provided his own definition of what a world model should be. Of course, as LeCun himself noted, many people who have never contributed anything to machine learning are criticising him. On one hand, he certainly has a point; on the other hand, the universal discussion simply proves AI's current and potential impact on society. It's normal for everyone to have an opinion.

Moreover, as mentioned before, there are no absolute authorities on this road to the future.

Returning to "world models," although LeCun's definition is expressed in simple formulas, it involves fundamental knowledge of deep reinforcement learning that is hard to explain in a few words. Personally, I reserve my opinion on whether one should predict the next state of the object or the next state of the environment; in a complex world, objects and environments are somewhat difficult to define clearly.

My own conception of a "world model" is more like a "market" in microeconomics, with many market participants (subjects, objects, or agents directly corresponding to this environment), where the prediction at each moment is a so-called general equilibrium.

From this perspective, Sora and Meta's newly released V-JEPA model aren't actually as different as LeCun suggests:

Why the OpenAI team calls Sora a simulator. Every generation consists of two parts: the environment (scene) and objects (people, dogs, weather, etc.). If you don't view yourself merely as an observer of the generated video, the video is essentially the model acting as a subject predicting the behavior of various objects and the resulting changes to the environment. Thus, Sora is inherently a predictive model that manifests its predictions through generation.
Meta's new V-JEPA model. Stripping away the bells and whistles, this model works by masking objects in a video and predicting the masked parts. Through massive data training, the goal is for the model to generate specific concepts—be they objects or continuous actions. So, in reality, both models are based on predicting objects or environments, aren't they? The training structures differ, but the main difference is that one manifests through generation while the other just predicts.
LeCun's Criticism. LeCun criticizes Sora's pixel-based prediction as a dead end because the results are unstable. This makes sense. However, firstly, Sora is not strictly pixel-based; it uses a transformer to turn spatial and temporal information into "spacetime patches," where a patch is equivalent to a token. Secondly, V-JEPA's prediction involves masking, recognition, and complex operations requiring encoding and decoding—is that not also effectively "pixel-level"? There is no clear line. Thirdly, predictions are made within an encoded space; generation also involves diffusion in that space followed by decoding. I partially agree with LeCun, but anyone can find grounds for an argument.
All Roads Lead to Rome. These are all experiments. If one path fails, try another. It’s just that commercial interests and the competitive drive to be "number one" have everyone locked in a stalemate.

Is Sora's Application Scenario Text-to-Video?

Following the previous section, the answer is clearly no. Sora is essentially an observer and predictor of a world; video generation is merely a byproduct.

Many saw the video results and shouted "AGI!" but Sora is miles away from AGI. Its purpose is not to be AGI, but to serve AGI—specifically, by feeding it data.

While the definition of AGI remains unstandardized, AGI must at least possess decision-making capabilities, "understand" tasks, and execute them stably. Sora is called a simulator because it simulates a "world": environments and objects. It "predicts" object movement and environmental changes based on learned "rules." This data is fed to an "AGI" (if one is currently being trained) to see if AGI can truly emerge through complex reinforcement learning.

Currently, in autonomous driving R&D (like Tesla's), generated "worlds" cover most scenarios and can even replicate real-world incidents. However, if one used Sora to generate a "world" to train a driving model, no one would dare use it yet. This suggests that current human-predefined rule engines are more effective and Sora needs to improve. But this isn't a reason to criticize Sora; it's already better than other models on this path. Since the path of human-intervened rules is nearing its end, we must place our hopes on models like Sora.

Returning to the byproduct of video generation: expecting Sora to immediately reach Hollywood blockbuster quality is unrealistic. However, in practice, even partial integration is enough to optimize workflows and produce immeasurable effects.

Currently, the model can replace low-quality self-media video content.

So, Where is the Problem with Sora?

Interestingly, because training AGI requires massive data, we need simulators like Sora to generate it—yet Sora's biggest limitation is also a lack of data.

In Sora's technical report, the data section is glossed over in a few sentences. This indicates that data is the most critical part. OpenAI's lead is due to two factors: 1. Extreme work ethic ("involution"); 2. Superior data quality.

On the other hand, current training data—whether images or videos—is 2D and lacks spatial information. Sora, learning from 2D optical data, struggles to understand the "physics" of a 3D world. Data is not just scarce; it is extremely scarce. But as long as the scaling law is proven effective, the solution is to expand the scale and feed it as much data as possible.

Finally, Can Others Catch Up Quickly?

I am quite optimistic about this. In terms of capability, Google, Meta, Runway, Pika, and Stability AI can catch up quickly. Diffusion Transformer technology is not inherently difficult. While there is a gap in data quality compared to OpenAI, it isn't as vast as it was last year.

Model R&D and application implementation are actually two different problems.

PS: I completed this 3,000-word piece entirely using Apple's Vision Pro. Despite its flaws, once you experience this immersive workflow, there is no going back.

PS': Lately, I feel more like a "charlatan." How can my few words possibly cover models that others spent so much time perfecting? I've been reckless.