Jul 11, 2025

Gemini1.5与sora，那些被忽略的细节

2024 is destined to be an even more intense year, whether it is Google and OpenAI releasing their latest models on the same day, or the overwhelming coverage and commentary on Sora from domestic self-media.

In fact, regardless of how explosive the model effects are, everything presented today could have been linearly extrapolated at least six months ago, including the somewhat sensationalist claims of "non-existent reality" and "replacing XXX."

Actually, as early as when GPT-3 was released, the mindset of developers toward models underwent a dramatic shift: they began to "review" the training process with a growing sense of awe, moving from a level-eyed perspective to one of looking up at the model's results, and attempting to discover every surprising or disappointing detail within the model with child-like curiosity.

These details have appeared in technical reports and in successive papers by other researchers. The neural network architecture is set by developers, the data is filtered through specific standards using a combination of programs and manual labor, and the pre-training results are refined through extensive human alignment. Yet, under the pressure of unimaginable parameter counts and data volumes, no single person can truly understand the entirety of the model.

In the eyes of many, every model update is either a threat or a business opportunity. In the eyes of others, each update has always been a research process where curiosity outweighs complex emotions. Details in technical reports and papers always manage to negate some hypotheses, confirm others, and then propose new ones.

The greatest value of Gemini 1.5 and Sora is confirming that increasing scale can continue to enhance model capabilities.

The OpenAI team stated that they discovered "emergent" abilities in the Sora model: occasionally, it can simulate results consistent with the physical world. This "emergence" is purely due to the increase in scale.

Around the time GPT-4 was officially opened for use, many—including myself—believed that the feasibility and necessity of increasing the scale of large language models in the short term were diminishing. However, multimodal models have debunked these biases. It turns out the Transformer architecture is proving its effectiveness across one modality after another. Today, the conclusion OpenAI provides is in image and video data, which merely reinforces the consistent expectations that have gradually formed since the advent of multimodality.

Larger scale corresponds to more parameters and data, and the unit of measurement for scaling is "orders of magnitude." Growing ten or a hundred times on an already high base is the foundation for the ambitious projections recently made by Mark Zuckerberg, Sam Altman, and even Jensen Huang.

MoE is not just a model architecture; it is also related to hardware architecture.

Gemini 1.5 is MoE-based, GPT-4 is essentially confirmed to be MoE-based, and future foundation models will likely all be MoE-based. Much has been discussed regarding the advantages in model development—higher efficiency and performance at the same parameter scale, and perhaps better scheduling of various modalities. In fact, MoE architecture relates not only to the model itself (the software) but also to the underlying hardware architecture. This is why Google DeepMind mentioned in the Gemini 1.5 technical paper: "Gemini 1.5 Pro is trained on multiple 4096-chip pods of Google’s TPUv4 accelerators, distributed across multiple datacenters, and on a variety of multimodal and multilingual data."

Several important pieces of information are revealed here: because of MoE, training can utilize multiple clusters; thus, Google used multiple different TPU pods distributed across different data centers. We also have reason to believe that different branch models can not only have different architectures but that TPU chips (ASICs) in different pods can be optimized differently for different models.

So, why is it said that AI tests overall software and hardware capabilities? This has strong implications for us domestically. However, I cannot expand on that here.

Objective evaluation of models is becoming increasingly difficult.

Although the Gemini 1.5 report still provides a large number of third-party scores, we can clearly feel that for current models, even the latest third-party evaluation standards are becoming less representative. In fact, this has been the case since GPT-4.

One reason is that the amount of data used in any scoring system is becoming negligible compared to the data used for model training.

Another important reason is that many models have reached a usable level and are being adopted by people. Different usage scenarios, the user's level of understanding of the model, and the quality of prompts all significantly affect results. The evaluation of these results, however, is often subjective.

Third-party ratings without a sufficient user base of subjective scores as a reference may no longer be persuasive.

Data

Whether it is Google DeepMind or OpenAI, the descriptions of training data in their documents are very simple, with much less space dedicated to it than to model architecture and performance evaluation. From the limited information, we can roughly gather that the Sora model does not crop or downscale image and video data, and Gemini 1.5 only mentions using multimodal data (naturally). This is a significant departure from previous models that detailed the data preprocessing steps.

The small mention carries great weight. As models develop today, while the demand for computing power seems bottomless, it can at least be bought with capital. Data, however, is becoming something money cannot buy, and data preprocessing is increasingly becoming the most critical know-how.

On the other hand, it is evident from all sides that the demand for data volume has increased sharply, but the requirements for data preprocessing or even data quality have, at least, decreased relative to before. At this volume, relying on manual processing is no longer possible.

However, we are not yet entirely certain whether holding exclusive core data will gradually manifest as an advantage in model R&D. We believe that when Meta releases LLaMA-3 in the near future, we should get a more definitive answer.

Other

Expectations for AI today are basically focused on AGI. Whether larger-scale computing power and more data can accelerate this process is something no one can answer with certainty. But currently, this is the path with the most certainty. At least for now, Transformers can be used in various modalities, the performance gains from scaling remain obvious, and no visible bottlenecks have appeared.

It's just that there is more work to do, and the level of investment is growing. 2024 will only see bigger bets because failing to ascend will lead to a massive survival crisis.

PS: Amazon released a billion-parameter speech synthesis model that also incorporates Transformers and has shown signs of "emergence."