ZAYA1 AI Model Achieves Training Milestone Using AMD GPUs

Introduction to ZAYA1 and AMD’s AI Training Breakthrough

After a year-long collaboration, Zyphra, AMD, and IBM have validated the potential of AMD’s GPU platform to support large-scale AI model training with the creation of ZAYA1. This model marks a significant milestone as the first major Mixture-of-Experts (MoE) foundation model built exclusively on AMD GPUs and networking technology.

This achievement challenges the prevailing dominance of NVIDIA in AI training infrastructure, offering enterprises an alternative that does not compromise on performance or scalability.

Technical Setup and Hardware Utilization

ZAYA1 was trained using AMD’s Instinct MI300X GPUs, Pensando networking components, and ROCm software stack, all deployed on IBM Cloud infrastructure. Unlike experimental or highly customized setups, Zyphra employed a conventional enterprise cluster architecture without any NVIDIA components.

The MI300X GPUs provide 192GB of high-bandwidth memory per unit, enabling more flexible and less parallel-intensive training runs early in the development process. Each node consists of eight MI300X GPUs interconnected via InfinityFabric, with individual Pollara network cards assigned to each GPU to optimize communication efficiency. A dedicated network manages dataset access and checkpointing, simplifying the overall system design and reducing operational costs.

Training Performance and Model Architecture

ZAYA1-base activates 760 million parameters from a total of 8.3 billion and was trained on 12 trillion tokens in three phases. The model architecture incorporates compressed attention mechanisms, a sophisticated token routing system to direct inputs to the appropriate experts, and refined residual scaling to maintain stability in deeper layers.

The training process utilized a combination of the Muon and AdamW optimizers, with significant kernel fusion and memory traffic optimizations to maximize efficiency on AMD hardware. Increased batch sizes were supported by high-throughput storage pipelines, ensuring smooth data flow during training.

Comparatively, ZAYA1 competes closely with leading open models such as Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. The MoE design notably reduces inference memory requirements and serving costs by activating only a subset of the model at any given time.

Adapting ROCm and Infrastructure for Optimal Training

Transitioning from an NVIDIA-based workflow to AMD’s ROCm platform required Zyphra to tailor model parameters, GEMM operations, and microbatch sizes to match MI300X’s hardware characteristics. Network performance was optimized by aligning collective operations with InfinityFabric capabilities and tuning message sizes for peak throughput on Pollara networking.

To handle long-context training sequences (ranging from 4k to 32k tokens), the team employed ring attention for sharded sequences and tree attention during decoding to prevent bottlenecks.

Storage strategies balanced between high IOPS demands for smaller models and sustained bandwidth for larger datasets. Dataset shards were bundled to minimize scattered reads, and per-node page caches were increased to accelerate checkpoint recovery during extended training sessions.

Cluster Reliability and Fault Tolerance

Zyphra implemented the Aegis monitoring service to oversee system logs and metrics, automatically detecting and correcting failures such as network interface card glitches or ECC errors. Increasing RCCL timeouts helped prevent transient network interruptions from causing full job failures.

Checkpointing is distributed across GPUs to avoid single points of failure, achieving over a tenfold improvement in save speeds compared to traditional methods. This approach enhances cluster uptime and reduces operator workload.

Implications for AI Infrastructure and Procurement

The successful training of ZAYA1 on AMD technology demonstrates that AMD’s ecosystem — including InfinityFabric (NVLINK alternative), RCCL (NCCL alternative), and hipBLASLt (cuBLASLt alternative) — has matured sufficiently to support large-scale AI development.

Rather than advocating for a complete overhaul of existing NVIDIA clusters, Zyphra suggests a hybrid approach where NVIDIA hardware continues to serve production workloads, while AMD platforms handle training phases benefiting from larger memory capacity and open software stacks. This diversification mitigates supply risks and increases overall training throughput without major disruptions.

Key recommendations emerging from this project include treating model architecture as flexible, designing networks around actual collective operations, building fault tolerance focused on preserving GPU compute hours, and modernizing checkpointing to maintain training momentum.

Conclusion

The ZAYA1 training milestone sets a promising precedent for organizations seeking viable alternatives to NVIDIA for AI infrastructure. By showcasing competitive performance and practical deployment strategies, Zyphra, AMD, and IBM provide a valuable blueprint for expanding AI capacity with diversified hardware ecosystems.

Fonte: ver artigo original

Chrono

Chrono is the curious little reporter behind AI Chronicle — a compact, hyper-efficient robot designed to scan the digital world for the latest breakthroughs in artificial intelligence. Chrono’s mission is simple: find the truth, simplify the complex, and deliver daily AI news that anyone can understand.

Introduction to ZAYA1 and AMD’s AI Training Breakthrough

Technical Setup and Hardware Utilization

Training Performance and Model Architecture

Adapting ROCm and Infrastructure for Optimal Training

Enjoying this content?

Cluster Reliability and Fault Tolerance

Implications for AI Infrastructure and Procurement

Conclusion

Chrono

Related Articles

Leave a Reply Cancel reply

Related News

Meta’s Tent-Built Data Centers Show How Far the AI Infrastructure Race Has Escalated

Endava Leverages OpenAI’s ChatGPT Enterprise and Codex to Transform Software Delivery

OpenAI on AWS: Why the Move Matters for the AI Infrastructure Race

New York’s One-Year Moratorium on Large Data Centers Signals Growing Scrutiny on AI Infrastructure Impact