Sep 3, 2025

Deepseek研究（2）：少数派报告

Unless a new model is released, this should be my last article focusing solely on DeepSeek. In fact, many of the views I am about to express may differ significantly from the "mainstream," but these are "inferences" I have drawn through accumulation, time, and practice—they are intended for debate.

I won't write a long treatise, but rather a "Minority Report" on a few core issues I believe are central.

On a larger scale, this is a "free" feast: users who cannot frequently use advanced models like ChatGPT, Gemini, or Claude have finally "witnessed" the capabilities and potential of AI; long-term AI users have gained a "free and easy-to-use" alternative. Competition is always beneficial for users.

For a vast number of users, having an AI tool that yields "stunning" results without requiring professional knowledge is a breakthrough. Due to "access restrictions" in China, previous AI applications were mostly focused on efficiency improvements at the "production end." Now, the "minds" of a massive user base have been opened, which is undoubtedly a good thing for the industry.

For global practitioners, the engineering innovations in V3 have opened up new ways of thinking. Under the constraint of long-term supply shortages for "high-end computing power," more "real-value" optimization solutions have emerged. The two papers on V3 and R1 can inspire many "researchers" by providing a practical path to raising the upper limit of existing models. Open weights provide potential possibilities for optimization and promotion in more specialized fields.

However:

2025-02-10-deepseek研究2少数派报告-1dj37k-1771991182184-4416.jpg

Although a comprehensive evaluation of model capabilities is almost impossible at present, placing "V3 and R1 on the same capability level as frontier models like GPT" likely won't meet much resistance.

Therefore, if DeepSeek were not "free" and "open-weight," the attention it receives would likely drop exponentially.

2025-02-10-deepseek研究2少数派报告-1dj37k-1771991182184-6651.jpg

Fortunately, controversy over this issue is diminishing. At today's costs, if Meta were to train a 37B Dense Model with a total MoE size of 671B on 15T Tokens, the cost would certainly be higher than DeepSeek's, but only marginally so.

Rapid technological progress always leads to rapid cost reductions.

2025-02-10-deepseek研究2少数派报告-1dj37k-1771991182185-2002.jpg

This conclusion might be heavily criticized, but if DeepSeek is offering users the "native full-strength" versions of V3 and R1, charging $1.1 and $2.19 per million tokens likely barely covers electricity costs when factoring in hardware depreciation, data center hosting, networking, operations, and other expenses. At least on Hopper-architecture chips, even if there is profit, the margin is extremely small; it is highly probable that they are losing money.

Savings in inference service costs mainly come from three areas: 1. Hardware improvements (increasing computing power per watt); 2. Model pruning and distillation; 3. Inference optimization.

The latter two lead to significant declines in model capability, which is why users reported "intelligence drops" a while after GPT-4 was released; the same happened with GPT-4o.

Similarly, because of the ultra-low-price ceiling and the general user perception that "AI costs have dropped significantly," cloud service providers launching DeepSeek API services are likely operating without profit if the underlying models are truly "native full-strength" V3 and R1. Large cloud providers might have slightly higher-end computing power, but for others, well...

However, in the current mixed bag of offerings, many are likely selling "distilled small model" services under the banner of "native full-strength." In any case, the vast majority of users find it hard to distinguish the truth.

2025-02-10-deepseek研究2少数派报告-1dj37k-1771991182185-5361.jpg

Even considering internet speeds, we can download an Ollama application on any PC (Windows or Mac), and with a single command line "ollama run ****," we can achieve "edge deployment" in a very short time.

If you are already satisfied with the performance of 1.5B or 7B distilled models, congratulations.

But if your scenario requires a "native full-strength version," I'm sorry—the threshold is roughly 12-16 NVIDIA H100/800 or A100/800 GPUs (essentially two servers, though I have seen 16-card single servers). The optimization and adjustments required are very difficult; one machine is manageable, but as soon as you exceed one, the complexity increases several times over.

Of course, theoretically, I could run a full version using five or six Mac Studios with Apple M2 Ultra chips (192GB unified memory). Currently, the model weight files converted for Apple-use in the Hugging Face community are still 4-bit quantized versions, not full-strength. I was too lazy to quantize them myself using llamacpp, as it is too time-consuming. I have successfully run the 4-bit quantized version via three machines. An output speed of 10 tokens/s after optimization is acceptable, but I believe there aren't many people like me.

For most people, edge deployment will likely be abandoned within a week of starting.

2025-02-10-deepseek研究2少数派报告-1dj37k-1771991182186-3185.jpg

Explaining this issue is quite complex, so I'll try to keep the conclusions simple.

V3 and R1 are two different models: V3 is the base model, and R1 is the reasoning model. First, as mentioned, we simply cannot deploy a "native DeepSeek" on edge hardware (the smallest dynamic quantization can get down to 1.58-bit but still requires over 100GB of VRAM—try finding such edge hardware). Thus, we can only use distilled small models, which are models fine-tuned by R1, not R1 itself.
As for V3, the MoE architecture almost guarantees it cannot be miniaturized, so don't even think about it. Our base models for edge AI are still QWen or LLaMA.
Is distilled R1 meaningful? Yes, it can significantly improve the capability of the base model. Is there a cost? Yes, the output format becomes highly "rigid," using tags like <thinking></thinking>, and it loses the vast optimization space of "prompt engineering."
Current edge AI paths are converging on Agents. However, due to the output format issue mentioned above, R1 and Agents are actually in conflict. Some noticed this when OpenAI first launched o1; although many optimization methods were provided, and OpenAI itself made changes in o3 (Deep Research is evidence of this). But "hardening" thinking within the model limits the operational space of an Agent. To let an Agent perform freely, one must believe in the massive potential of the "base model." Frankly, many have thought about or even tried the R1 approach in the past, but whether "Reinforcement Learning" should follow a path similar to SFT or remain in the "pure inference-time" part of the Agent remains highly controversial.

2025-02-10-deepseek研究2少数派报告-1dj37k-1771991182186-2665.jpg

I was among the first to be optimistic about "Embodied AI." But those I've spoken with in detail know my consistent view: "intelligence" likely needs to be acquired through continuous feedback from "Reinforcement Learning" in the physical world; multi-modal data provides more information; using a base model as a "knowledge base" combined with specific scenarios to achieve personalized "intelligence" might be more important.

But this process is difficult and requires more time.

If we put a distilled R1 into a humanoid robot, technically, it would just be a smart speaker that talks "more like a human" because we would tolerate the latency.

If we put a distilled R1 into a "self-driving car," it would certainly enhance interaction in the smart cockpit—the principle is the same as a smart speaker. However, it "weakens" the control capability of the cockpit. For control, we need the model to initiate function calls (the foundation of an Agent). As mentioned, R1 fundamentally has reduced compatibility with Agents.

If we let a distilled R1 go "<thinking>The brake lights in front are on, what should I do...</thinking>", I can't imagine the scene.

One more thing: Is the distance between us and "Embodied AI" just a distilled R1 that can improve the performance of small QWen and LLaMA models?

If so, I believe model factories would have already put "Embodied AI" into practical use based on the base versions of QWen and LLaMA.

2025-02-10-deepseek研究2少数派报告-1dj37k-1771991182187-6542.jpg

Clearly, GPT-4o is still a long, long way from the "AGI" we desire. In the field of AI research, the most urgent problem is clearly not creating a "GPT substitute" (this doesn't mean DeepSeek is useless—it's very significant, both technically and in terms of emotional value).

Our most urgent questions are: When will the next-generation model arrive? Where is the research direction? How do we implement capabilities such as "persistent memory" and "hypothesis testing"?

Exploration always requires massive costs.

2025-02-10-deepseek研究2少数派报告-1dj37k-1771991182186-3135.jpg

The time left for models like GPT, Gemini, and Claude—even Grok and LLaMA—has suddenly decreased, and the sense of urgency has spiked.

2025-02-10-deepseek研究2少数派报告-1dj37k-1771991182187-9682.jpg

Yes, everyone's strategies will surely adjust—in fact, they may have started adjusting as early as the second half of last year.

Those who understand AI best are definitely those at the frontier.

However, at this moment, there seem to be many forks in the road that need to be prioritized for experimentation. If I were back in a scenario years ago deciding on the allocation of team computing resources, I would consider data centers like this:

Naturally, the more computing power a single location can accommodate, the better;

If a single cluster cannot be significantly expanded in a short time due to technical limitations, then break down the computing tasks first;

Once new equipment arrives, I need to spend a month or two testing various engineering parameters: power consumption, stability, hardware failure rates, compatibility between algorithms and hardware architecture, scaling curves, and so on;

Then, calculate an approximate actual operation plan, reserve sufficient hardware redundancy, and determine the delivery times of core component suppliers...

Then, power on all the equipment and begin an exciting journey into the unknown amidst the deafening roar.

Every long-term training session is a gamble.

2025-02-10-deepseek研究2少数派报告-1dj37k-1771991182187-5065.jpg

I have always been bullish on applications. I am bullish on overseas cloud services and overseas SaaS, but I am not bullish on C-end (consumer) applications. I have long held this belief: companies without proprietary models and scenarios will find it difficult to create C-end applications that generate sustainable commercial value.

We talk so much about "Equality for [X]," why don't we talk about "Equality of Applications"? Why don't we talk about how "equality" implies a massive shrinkage of potential commercial value?

Disclaimer: This is a purely technical discussion. All materials were prepared during my personal time. What I pursue is "Information Equality."