第一时间拿到字节Dreamina内测后,我们与Sora等国外模型进行了对比

第一时间拿到字节Dreamina内测后,我们与Sora等国外模型进行了对比


On March 25, Sora released a new batch of videos generated in collaboration with artists. The quality of the videos, combined with the artists' creativity, is staggering. ByteDance's Dreamina also released its first batch of waitlist spots for internal testing on the 26th. We obtained access immediately and conducted a comparison between Sora, Runway, and Pika.

Conclusion: As a product from ByteDance—which possesses the most data and top-tier computing power in China—Dreamina's video generation effects have already far surpassed previous applications like Runway and Pika. While a gap still exists compared to Sora, it is in a state of significant lead within the second tier. To achieve Sora's level, there is still massive room for improvement in training data quality, computing power, and memory capacity.

In terms of generation quality, Dreamina handles object movement well and shows a degree of 3D continuity. However, there are still flaws in human figures—especially legs and fingers during movement—and facial consistency is relatively poor. Despite this, the results are much better than other applications. Runway's generated videos only feature slight movements and contain unnatural distortions; Pika's generation quality is even worse, with highly unnatural handling of moving people and animals. Overall, in terms of generation quality: Sora >> Dreamina >> Runway > Pika.

Regarding duration, Dreamina shows no significant advantage over Runway or Pika, remaining in the range of several to a dozen seconds, which is a major gap compared to Sora's one-minute capability. Furthermore, looking at Dreamina's extended video generation, there is a noticeable difference in clarity and consistency in the extended segments. It appears to use the last frame of the previous video for re-generation rather than processing the entire preceding sequence. Achieving Sora-like lengths would significantly increase requirements for memory capacity and computing power.

In terms of prompt understanding, Dreamina and Runway are at a similar level, both being able to understand prompt content well and generate characters and backgrounds that match descriptions. Pika's generated videos tend to deviate from the prompts.

Regarding safety, it was surprising that even though Dreamina is still in beta, risk controls are already in place for prompts. When we tried to replicate Sora's latest "Balloon Man" video, the system prompted that it contained inappropriate content and refused to generate it.

Below is a comparison of videos generated using the same prompts across Sora, Dreamina, Runway, and Pika:

First is Sora's first video; we used a unified prompt for generation. Only Dreamina could generate a video with similar movement, though the leg and facial transitions were still a bit strange. After extending the video by 3 seconds, inconsistency and a drop in clarity occurred. We speculate this is because the full video was not used during the extension process; reaching Sora's 1-minute duration would drastically increase demands on memory and compute.

“A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.”

Second is the mammoth video. Sora remains in the lead in terms of quality, though the gap with Dreamina is narrowing. Both Runway and Pika exhibited abnormal movement, and Pika's generation quality was quite poor.

“Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, depth of field.”

To test the 3D consistency of the video generation models, we also used a custom prompt. From the results, Sora showed the strongest 3D consistency under camera movement, while Dreamina also demonstrated some 3D consistency. The gap for Runway and Pika was more apparent. 3D consistency, as an emergent capability of these models, is a fundamental indicator of whether a generative model understands the physical world.

“a Coca-Cola zip-top can at the center of the frame. The camera angle should start from its slanted top view and continuously rotate around the bottle, offering a dynamic, circular perspective.”

We also tried to replicate some of Sora's recently released videos of unnatural biological creatures. Dreamina could understand the prompts, but the results were still inferior to Sora.

Dreamina

Sora

Although Dreamina is still in beta, its performance in safety and risk control was somewhat unexpected. When attempting to replicate Sora's latest "Balloon Man" video, Dreamina rejected the generation. Safety is particularly vital for video generation models, as inappropriate videos have a far greater impact on audiences than text.

← Back to Blog