ZAYA1: A Breakthrough in AI Training Using AMD GPUs
After a year of rigorous testing, Zyphra, AMD, and IBM have successfully trained ZAYA1, a pioneering large-scale AI foundation model developed exclusively on AMD’s GPU technology and networking. This milestone project validates AMD’s capability to support advanced AI workloads, presenting a viable alternative to the dominant NVIDIA ecosystem.
Collaborative Effort and Technical Setup
The ZAYA1 model was developed using AMD’s Instinct MI300X GPUs, Pensando networking devices, and ROCm software, all deployed on IBM Cloud infrastructure. Unlike experimental or niche hardware configurations, Zyphra adopted a conventional enterprise cluster design, simply replacing NVIDIA components with AMD equivalents. This approach underscores the maturity and readiness of AMD’s AI infrastructure for production-scale applications.
Performance and Efficiency
ZAYA1’s performance matches or exceeds leading open-source models in areas such as reasoning, mathematics, and coding tasks. This is particularly significant for organizations facing GPU supply shortages or escalating costs, as it provides a robust alternative without sacrificing training quality or efficiency.
Innovative Use of AMD Hardware to Optimize Costs
Key to the project’s success is the use of MI300X GPUs equipped with 192GB of high-bandwidth memory each, which allows for more flexible training runs and reduces the need for complex parallelism early in development. Zyphra configured nodes with eight MI300X GPUs interconnected via InfinityFabric, each paired with a dedicated Pollara network card, and separated data handling networks to maintain stable iteration times and lower infrastructure costs.
ZAYA1 Model Architecture and Training Details
ZAYA1-base activates 760 million parameters out of a total 8.3 billion, having been trained on 12 trillion tokens across three stages. The architecture incorporates compressed attention mechanisms, advanced token routing to specialized experts, and refined residual scaling to stabilize deeper model layers. Optimization strategies blended Muon and AdamW algorithms, with kernel fusion and memory traffic reduction tailored to AMD hardware to enhance iteration efficiency.
This architecture enables ZAYA1 to compete with larger AI models like Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE, while benefiting from Mixture-of-Experts (MoE) design advantages, including reduced inference memory requirements and lower serving costs.
Overcoming Software and Infrastructure Challenges
Transitioning from NVIDIA’s CUDA ecosystem to AMD’s ROCm required extensive tuning. Zyphra measured hardware behavior and adjusted model parameters, GEMM patterns, and batch sizes to optimize for MI300X. Network configurations were optimized to maximize throughput, with InfinityFabric and Pollara networks sized for peak performance. Data storage strategies were also refined to balance IOPS demands and sustained bandwidth, critical for long training runs and checkpoint recovery.
Reliability and Fault Tolerance
Zyphra implemented Aegis, a monitoring service that automatically detects and corrects failures such as network glitches or hardware errors. Enhanced checkpointing distributed across GPUs enabled faster saving processes, significantly improving uptime and reducing manual intervention during extended training sessions.
Implications for AI Infrastructure Procurement
The success of ZAYA1 illustrates that AMD’s GPU and software stack has matured sufficiently for serious AI development, offering enterprises a strategic option beyond NVIDIA. While existing NVIDIA clusters remain valuable for production workloads, AMD-powered clusters can complement them during memory-intensive training phases, diversifying supply chains and increasing overall training capacity without disrupting existing operations.
Zyphra’s findings recommend treating model configurations as adaptable, designing networks to fit actual training communication patterns, building robust fault tolerance, and modernizing checkpointing to maintain training momentum.
For organizations seeking to expand AI capabilities with reduced vendor dependency, the ZAYA1 project offers a practical blueprint demonstrating AMD’s viability in large-scale AI training.
Fonte: ver artigo original

HP Advances AI and Data Solutions for Enterprise with Cutting-Edge Compute and Governance
Coca-Cola Shifts to AI-Driven Marketing as Price-Based Growth Declines
PubMatic’s AgenticOS Redefines AI Integration in Enterprise Marketing
Grubhub Eliminates Delivery and Service Fees on Orders Above $50 to Compete with Rivals