Sep 3, 2025

为什么我更看好算力异构

A few days ago, Alex told me that their Exolabs cluster inference project was ready for beta testing. Due to time constraints, I couldn't provide immediate feedback. However, seeing the project officially open-sourced a few days later, I was still very excited. Indeed, more and more wonderful things are coming at us with increasing speed. Amidst the mental fatigue of life, the main theme remains one of "happiness."

In fact, this is not just a cluster inference project, but a heterogeneous computing power inference project.

Project Address: https://github.com/exo-explore/exo

The underlying project supports MLX (Apple's training and inference framework for its silicon), llama.cpp, and also supports the recently launched tinygrad (which supports CUDA, RoCm i.e., AMD, Metal i.e., Apple's own chips, and Intel chips, etc.). Therefore, this is first and foremost a project that supports almost all mainstream inference hardware devices.

Then, through cluster scheduling, it achieves joint inference across multiple devices. (A typical example was in my previous article where I tested three M1 Mac Minis, which was equivalent to running a 22B model; recently, I also ran other devices, including the strangest combination: an Intel CPU + AMD GPU Apple device). Simply put, as in the example given by Alex, you can achieve joint inference across a Mac laptop, an iPhone, and an iPad.

Of course, you can also add Android phones, NVIDIA GPUs, AMD GPUs, and so on. This is heterogeneous computing (actually, llama.cpp can also run inference on these devices, but it doesn't support Apple's own MLX framework, leading to poor inference performance on Apple devices; exolabs directly supports the MLX framework, significantly improving the inference performance for Apple hardware).

So here comes the key point: Why am I optimistic about the application of heterogeneous computing power for inference, and why have several open-source projects recently tried to solve this problem?

I'll just go over the key points. This involves technical details regarding the importance of memory and network bandwidth for inference (I will think about how to release more concise and intuitive content later), so these are qualitative conclusions.

Undoubtedly, NVIDIA's GPUs—whether the Hopper series (H100, H200, etc.) or the soon-to-ship Blackwell series (B100, B200, GB200)—remain the best solution in terms of performance, ecosystem, compatibility, and even total cost. Roughly speaking, for daily active users over 10,000 (meaning peak concurrency might exceed 1,000), NVIDIA's solutions will be optimal. (AMD's MI300-350 and Intel's Gaudi-2/3 might have slightly higher price-performance ratios, but considering life-cycle costs, they don't have much advantage).
However, at this scale, while NVIDIA GPU server solutions can still achieve high cost-efficiency in terms of cost-per-token, they face high hidden maintenance costs: the environment for the servers (high hosting fees if in a data center; cooling, noise, and power issues if deployed locally), utilization rates, potential repair and replacement costs after failure, and personnel maintenance costs are all critical factors to consider.
For many small enterprises and individual users, the hidden costs mentioned above are actually extremely high, and most scenarios do not require such high inference speeds. On the contrary, small companies or "enthusiasts" have a lot of redundancy in various devices. Letting idle equipment play its part is a very "cool" way to do it. Moreover, extensive testing shows that the performance of idle equipment is not bad. As shown in Alex's test results in the video above, running a Llama3-8B model on two Macs, an iPad, and two iPhones, the inference performance is visibly over 10 tokens/s, which is already very practical.
The process of inference is essentially a process of large-scale data transfer. The reason why large memory and high memory bandwidth are needed is that memory speed is far higher than disk speed (even SSDs). Models must be loaded into memory; the larger the memory, the larger the model it can accommodate. The faster the memory speed (bandwidth), the faster the inference. In fact, because the computing speed of GPU cores is far higher than memory speed, most tests will show that the inference performance of different hardware is almost exclusively affected by memory bandwidth. As for network interconnect speed, it improves the data transfer rate between different devices. Even in cluster inference, the volume of data transferred between devices is far smaller than the volume transferred within a single device's memory. Therefore, network speed is important, but memory bandwidth remains paramount.
As is well known, NVIDIA's data center GPUs use the fastest HBM memory. The latest generation Blackwell architecture has a memory bandwidth of 8TB/s. In comparison, the Apple M2 Ultra's memory bandwidth is 800GB/s—a tenfold difference. We can simply assume a tenfold difference in inference performance. This seems like a lot, but in scenarios where local inference is used, model parameter sizes are still limited, and the performance is "good enough." Of course, if the Llama-3 400B model is really released next week with open weights, we'll see what challenges it poses to hardware. I will test it immediately, but I am confident the results will likely be acceptable, and things will only get better in the future.
In most scenarios, the models we use frequently are likely small-parameter models (within 10B is increasingly becoming the mainstream). This means that even with the M1's 100GB/s bandwidth (which I tested in a previous article), the performance is acceptable, and edge devices like mobile phones will not be bad either.
Although ChatGPT and others have reached hundreds of millions of daily active users, and ultra-large enterprises can deploy privately, the biggest application scenario for AI is actually local inference. This means there might be a greater demand and more choices for non-data center GPUs (NVIDIA's Hopper/Blackwell, AMD's MI300/350, etc.). Support for heterogeneous computing power follows naturally (users may wish to switch seamlessly between different laptops, pads, and even phones and other IoT devices without changing their application code).
Expanding the scope beyond small businesses and individuals, Apple's upcoming Apple Intelligence requires three model service methods: local inference on personal devices, iCloud Private Cloud Compute, and cloud inference via third-party models like ChatGPT. Among these, iCloud Private Cloud Compute is a crucial link. From what we know, Apple clearly does not intend to purchase third-party GPUs on a large scale to provide inference services. Instead, they will use their own M2 Ultra (or newer) on a large scale. If Apple can make this choice, it certainly has the confidence and has conducted thorough verification. In fact, a series of open-source projects from Apple over the past year have been proving the feasibility of this approach.
Returning to the domestic environment in China, for well-known reasons, our overall computing power will remain in a state of extreme shortage. Even if certain domestic computing chips can be used on a large scale, the flourishing of domestic chips, and even their joint use with mature chips from various other sources, will be the most realistic path to accelerate technical iteration. China needs heterogeneous computing power more than anywhere else.
Finally, there is one most important reason why I am optimistic about heterogeneity: I believe that just as the gap between models is narrowing, the gap between hardware will also narrow. More flexible choices can effectively and quickly reduce computing costs and will certainly accelerate technical iteration, driving better AI progress (perhaps better AI means not just smarter models, but also more energy-efficient models, models that are friendlier to humans, etc.).

Heterogeneous computing power, along with open-source models (open weights count, barely), are perhaps the two most important fundamental ecosystems for a better AI.