Meta Releases LLaMa 3 Training Infrastructure: Two Massive H100 Clusters
Last night, Meta released the infrastructure used for training the LLaMa 3 model: two massive clusters each containing 24,576 H100 GPUs.
Original link: https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/
Brief evaluation:
- Largest Scale: This is currently the largest single computing cluster available in public records. Not only does the number of cards exceed the 10,000-card cluster from ByteDance (primarily A100/A800) commented on previously, but the GPUs are entirely H100s. The theoretical computing power is at least 20 times that of ByteDance's 10,000-card cluster;
- Open Commitment: The article repeatedly mentions Meta's commitment and contribution to open AI, meaning there is a >95% probability that LLaMa-3 will still be open-source (open-source checkpoints, i.e., model weights);
- Multimodal Potential: The cluster has ultra-high throughput, capable of training thousands of models simultaneously and fully leveraging Meta's data advantages in text, images, video, etc.: there is a >90% probability that LLaMa-3 will be multimodal;
- Ready to Launch: LLaMa-3 is imminent.
The above is manually entered content; the following is AI-generated content.
For more information, please refer to the summary generated by Gemini 1.0 Ultra based on the original English text:
Meta Develops GenAI Infrastructure
Authors: Kevin Lee, Adi Gangidi, Mathew Oldham
Meta announced the launch of two clusters consisting of 24k GPUs each, reflecting Meta's massive investment in the future of AI. The article shares details on hardware, networking, storage, design, performance, and software to achieve high throughput and reliability for AI applications. The new clusters are currently being used for Llama 3 training.
Meta is committed to open computing and open innovation. Built on Grand Teton, OpenRack, and PyTorch, these clusters will continue to drive open innovation across the industry.
This move is a major milestone in Meta’s ambitious infrastructure roadmap. By the end of 2024, Meta aims to continue expanding its infrastructure, which will include 350,000 NVIDIA H100 GPUs, providing compute power equivalent to nearly 600,000 H100s.
Meta's Large AI Clusters
Meta's long-term vision is to build open and responsible Artificial General Intelligence (AGI). In the process of realizing the AGI vision, Meta is also working to scale clusters to support this goal.
Meta has a long history of building AI infrastructure. As early as 2022, Meta announced details of its first AI Research SuperCluster (RSC), which contained 16,000 NVIDIA A100 GPUs.
Technical Details
The new AI clusters build on the lessons from RSC, focusing on end-to-end AI systems as well as researcher and developer experience and efficiency. Featuring 24,576 NVIDIA Tensor Core H100 GPUs combined with high-efficiency networking and carefully designed storage solutions, the clusters can support larger and more complex models than RSC, paving the way for advancements in GenAI product development and AI research.
Innovations in Networking, Compute, and Storage
- Networking: Meta employs two network architectures, RDMA over Converged Ethernet (RoCE) and NVIDIA Quantum2 InfiniBand, for the evaluation of large-scale model training.
- Compute: The clusters use the Grand Teton hardware platform, which is designed by Meta and is OCP-compliant.
- Storage: Meta utilizes its self-developed Linux FUSE API and the flash-optimized 'Tectonic' distributed storage solution to meet the data and checkpointing needs of large-scale AI clusters. Additionally, Meta partnered with Hammerspace to develop parallel NFS deployments to satisfy developer experience requirements.
Performance Optimization
Meta has meticulously optimized the performance of its large AI clusters. By adjusting job schedulers, routing policies, and collaboratively improving the NVIDIA NCCL library, the network utilization of large clusters has significantly increased. Furthermore, Meta has implemented several optimizations for the H100 GPU's FP8 features, large-scale parallelization techniques, and checkpointing, while continuously improving the scalability of the PyTorch framework.
Commitment to Open Innovation
Meta remains committed to open innovation in both AI hardware and software. Meta is a founding member of OCP, actively contributing designs such as Grand Teton and Open Rack; it is also the largest contributor to the PyTorch framework.
Meta is also dedicated to open innovation in AI research, launching the Open Innovation AI Research Community program and founding the AI Alliance.
The Future of Infrastructure
By the end of 2024, Meta plans to deploy a total of 350,000 NVIDIA H100 GPUs, providing computing power equivalent to nearly 600,000 H100s. Meta will continue to improve all aspects of its infrastructure to flexibly and reliably support rapidly evolving new models and research needs.