GPU-accelerated servers have revolutionized high-performance computing (HPC) and artificial intelligence/machine learning (AI/ML) applications. While various manufacturers offer GPU server configurations, two primary setups dominate the landscape: Single Root and Dual Root.
GPU Server Architecture Fundamentals
Modern GPU servers typically employ a dual-CPU design, where each processor is coupled with its own DRAM memory. These CPUs communicate through high-speed interconnects, such as Intel's Ultra Path Interconnect (UPI) or AMD's Infinity Fabric (xGMI), ensuring efficient data transfer between processors.
There are two primary approaches to integrating GPUs into these systems:
- PCIe-based Systems:
- Utilize PCIe slots to house GPUs.
- Offer flexibility and scalability.
- Commonly found in 4U or 5U air-cooled servers.
- SXM/OAM Designs:
- GPUs are integrated onto dedicated boards.
- Each GPU maintains a single PCIe connection to the CPUs.
- Often seen in 8U air-cooled servers or 3U/4U liquid-cooled servers.
Within the realm of PCIe-based systems, two configurations have emerged as dominant: Single Root and Dual Root. These setups differ in how they manage the communication between CPUs and GPUs, each offering distinct advantages for various workloads.
Single Root Configuration
How it works: In a Single Root configuration, all GPUs are connected to a single CPU, even in systems with multiple CPUs. This setup utilizes a PCI switch (PLX) to manage communications between the CPU and GPUs. The designated CPU connects to the PLX switch via two PCIe x16 lanes, which then facilitates communication with the GPUs.
The Single Root design centralizes GPU management, which can be advantageous for applications that require direct access to all GPUs simultaneously. However, this configuration may have limitations in scenarios where peer-to-peer GPU communication is critical, as all data must pass through the single CPU and PLX switch.
Key features:
- All GPUs share the same PCIe root complex.
- One CPU manages all GPU communications.
- Simpler BIOS and OS configuration.
- More power-efficient due to single CPU usage.
- Ideal for deep learning applications where most computation occurs on GPUs.
Best for:
- Workloads with minimal CPU-GPU data transfers.
- Applications requiring direct access to all GPUs from a single CPU.
- Scenarios where power efficiency is crucial.
- Deep learning inference.
- Rendering workloads.
- Highly parallelizable data processing tasks.
Dual Root Configuration
How it works: In a Dual Root setup, the GPUs are distributed between two CPUs. Each CPU connects to its assigned GPUs through its own PLX switch. This configuration allows for more balanced distribution of computational resources and enables simultaneous transactions on both CPU-GPU pathways.
The Dual Root architecture creates two separate PCIe root complexes, one for each CPU-GPU group. This design can offer better performance for workloads that require frequent CPU-GPU interaction, as it provides more efficient CPU-memory to GPU-memory performance pathways. The distribution of GPUs between CPUs can be flexible based on specific workload requirements.
In some high-performance computing deployments, dual-root systems may include two Infiniband cards, one attached to each CPU. This configuration helps mitigate potential performance impacts and ensures balanced communication across the system, especially in scenarios where data needs to be shared efficiently between the two CPUs.
Key features:
- GPUs are split between two PCIe root complexes.
- Each CPU manages its own set of GPUs.
- More complex BIOS and OS configuration.
- Better performance for balanced CPU-GPU workloads.
- Allows simultaneous transactions on both buses.
- Ideal for AI training and complex simulations.
Best for:
- Workloads benefiting from reduced PCIe contention.
- Applications requiring balanced CPU and GPU resources.
- Scenarios where data needs to be shared efficiently between two CPUs.
- AI/ML training environments.
- Complex scientific simulations.
- Multi-user environments requiring dedicated CPU-GPU pairings.
Direct Attached Configuration
As an alternative to Single and Dual Root setups, some systems use a Direct Attached configuration. In this arrangement, each CPU has direct PCIe access to up to four full-size GPUs, eliminating the need for PLX switches. This allows for up to eight GPUs in a dual-CPU system. Direct Attached setups can reduce latencies between CPUs and GPUs, which is beneficial in HPC applications where minimizing latency is crucial. However, this configuration typically supports fewer total GPUs compared to setups using PLX switches.
Key features of Direct Attached:
- No PLX switch used.
- Up to 4 GPUs directly connected to each CPU.
- Lower latency between CPU and GPU.
- Typically limited to 8 GPUs in a dual-CPU system.
Conclusion
Choosing between Single Root and Dual Root GPU server configurations depends on specific workload requirements, power constraints, and performance needs. Single Root offers simplicity, power efficiency, and centralized GPU management for GPU-centric tasks. Dual Root provides enhanced performance and flexibility for applications requiring balanced CPU-GPU interaction, at the cost of higher power consumption.
When planning your GPU server infrastructure, carefully consider your current and future needs, as well as the specific requirements of your applications. It's also worth consulting with hardware vendors and benchmarking your specific workloads to determine the optimal configuration for your use case.
Remember that while Single Root and Dual Root are the most common configurations, Direct Attached setups can offer advantages in specific scenarios where low latency is crucial and fewer GPUs are needed.