Rack-Scale AI Supercomputers: From Hardware to Topology-Aware Scheduling

Decorative image.

The landscape of artificial intelligence is rapidly evolving, demanding ever-more powerful and efficient computational infrastructure. At the forefront of this evolution are rack-scale supercomputers, designed to accelerate the most complex AI and high-performance computing (HPC) workloads. NVIDIA's GB200 NVL72 and GB300 NVL72 systems, built upon the innovative Blackwell architecture, represent a significant leap in this direction, packaging immense GPU fabrics and high-bandwidth networking into cohesive, powerful units.

However, deploying such sophisticated hardware presents a unique challenge: how do you translate this intricate physical topology into a manageable, performant, and accessible resource for AI developers and researchers? The fundamental mismatch between the hierarchical nature of rack-scale hardware and the often-flat abstractions of traditional workload schedulers creates a bottleneck. This is precisely where a validated software stack like NVIDIA Mission Control steps in, bridging the gap to transform raw computational power into a seamless, topology-aware AI factory.

Next-Gen Rack-Scale AI Supercomputing with NVIDIA Blackwell

The NVIDIA GB200 NVL72 and GB300 NVL72 systems, powered by the cutting-edge NVIDIA Blackwell architecture, are not merely collections of powerful GPUs; they are integrated, rack-scale supercomputers engineered for the future of AI. Each system features 18 tightly coupled compute trays, forming a massive GPU fabric connected by advanced NVLink switches. These systems support NVIDIA Multi-Node NVLink (MNNVL), facilitating ultra-high-speed communication within the rack, and include IMEX-capable compute trays that enable shared GPU memory across nodes. This architecture provides an unparalleled foundation for training and deploying large-scale AI models, pushing the boundaries of what's possible in fields ranging from scientific discovery to enterprise AI applications.

The design philosophy behind these Blackwell-based systems focuses on maximizing data throughput and minimizing latency between interconnected gpus. This is achieved through a densely integrated hardware stack where every component is optimized for collective performance, ensuring that AI workloads can scale efficiently without hitting communication bottlenecks.

Bridging Hardware Topology with AI Scheduler Abstractions

For AI architects and HPC platform operators, the real challenge isn't just acquiring and assembling this advanced hardware, but rather operationalizing it into a 'safe, performant, and easy-to-use' resource. Traditional schedulers often operate under the assumption of a homogeneous, flat pool of computational resources. This paradigm is ill-suited for rack-scale supercomputers, where the hierarchical and topology-sensitive design of NVLink fabrics and IMEX domains are critical for performance. Without proper integration, schedulers might inadvertently place tasks in suboptimal locations, leading to reduced efficiency and unpredictable performance.

This is the gap NVIDIA Mission Control is engineered to fill. As a robust rack-scale control plane for NVIDIA Grace Blackwell NVL72 systems, Mission Control possesses a native understanding of the underlying NVIDIA NVLink and NVIDIA IMEX domains. This deep awareness allows it to intelligently integrate with popular workload management platforms such as Slurm and NVIDIA Run:ai. By translating complex hardware topologies into actionable scheduling intelligence, Mission Control ensures that the advanced capabilities of the Blackwell architecture are fully leveraged, transforming a sophisticated hardware assembly into a truly operational AI factory. This capability will extend to the upcoming NVIDIA Vera Rubin platform, including NVIDIA Rubin NVL8, further cementing a consistent approach to high-performance AI infrastructure.

Decoding NVLink Domains and Partitions for AI Workloads

At the heart of topology-aware scheduling for Blackwell systems are the concepts of NVLink domains and partitions, which are exposed through system-level identifiers: cluster UUID and clique ID. These identifiers are crucial because they provide a logical map of the physical NVLink fabric, allowing system software and schedulers to reason about the GPU's position and connectivity.

The mapping is straightforward yet powerful:

Cluster UUID corresponds to the NVLink domain. A shared cluster UUID signifies that systems—and their GPUs—belong to the same overarching NVLink domain and are connected by a common NVLink fabric. For Grace Blackwell NVL72, this UUID is consistent across the entire rack, indicating physical proximity and shared high-bandwidth connectivity.
Clique ID corresponds to the NVLink partition. The clique ID offers a finer-grained distinction, identifying groups of GPUs that share an NVLink Partition within a larger domain. When a rack is logically segmented into multiple NVLink partitions, the cluster UUID remains the same, but the clique IDs differentiate these smaller, isolated high-bandwidth groups.

This distinction is vital from an operational standpoint:

The Cluster UUID answers the question: Which GPUs physically share a rack and are capable of NVLink communication at the highest speeds?
The Clique ID answers: Which GPUs share an NVLink Partition and are intended to communicate together for a given workload or service tier, ensuring optimal performance for highly parallel tasks?

These identifiers are the connective tissue, enabling platforms like Slurm, Kubernetes, and NVIDIA Run:ai to align job placement, isolation, and performance guarantees with the actual structure of the NVLink fabric, all without exposing the underlying hardware complexity directly to end-users. NVIDIA Mission Control provides a centralized view of these identifiers, streamlining management.

Hardware Concept	Software Identifier	Description
NVLink Domain	Cluster UUID	Identifies GPUs physically sharing a rack, capable of rack-wide NVLink communication.
NVLink Partition	Clique ID	Distinguishes GPUs intended to communicate together within an NVLink domain for a specific workload or service tier.

Topology-Aware AI Scheduling with Slurm

For multi-node workloads running on Blackwell-based NVL72 systems, placement becomes as critical as the sheer count of GPUs allocated. An AI training job requiring 16 GPUs, for instance, will perform vastly differently if spread haphazardly across multiple less-connected nodes compared to being confined within a single, high-bandwidth NVLink fabric. This is where Slurm’s topology/block plugin proves indispensable, allowing Slurm to recognize the nuanced connectivity differences between nodes.

On Grace Blackwell NVL72 systems, blocks of nodes featuring lower-latency connections directly correspond to NVLink partitions—groups of GPUs that are united by a dedicated, high-bandwidth NVLink fabric. By enabling the topology/block plugin and exposing these NVLink partitions as distinct blocks, Slurm gains the contextual intelligence required to make superior scheduling decisions. By default, jobs are intelligently placed within a single NVLink partition (or block), thereby preserving the critical Multi-Node NVLink (MNNVL) performance. While larger jobs can still span multiple blocks if necessary, this approach makes the performance tradeoffs explicit, rather than accidental.

In practical terms, this allows for flexible deployment strategies:

One block/node group per rack: This configuration enables Slurm Quality of Service (QoS) to manage access to the shared, rack-wide partition, ideal for consolidated resource management.
Multiple blocks/node groups per rack: This approach is perfect for offering smaller, isolated, high-bandwidth GPU pools. Here, each block/node group maps to a dedicated Slurm partition, effectively providing a distinct service tier. Users can then leverage a specific Slurm partition, automatically landing their jobs within the intended NVLink partition without needing to understand the underlying fabric intricacies. This advanced resource management is crucial for organizations looking to scale their AI initiatives, aligning with the broader goal of scaling AI for everyone.

Optimizing MNNVL Workloads with IMEX and Mission Control

Multi-Node NVIDIA CUDA workloads frequently rely on MNNVL to achieve maximum performance, enabling GPUs on different compute trays to participate in a cohesive, shared-memory programming model. From an application developer's perspective, leveraging MNNVL can appear deceptively simple, but the underlying orchestration is complex.

This is where NVIDIA Mission Control plays a pivotal role. It ensures that critical components align perfectly when running MNNVL jobs with Slurm. Specifically, Mission Control guarantees that the IMEX service—which facilitates the shared GPU memory—runs on the exact set of compute trays participating in the MNNVL job. It also ensures that the necessary NVSwitches are correctly configured to establish and maintain these high-bandwidth MNNVL connections. This coordination is vital for providing consistent, predictable performance across the rack. Without Mission Control's intelligent orchestration, the benefits of MNNVL and IMEX would be challenging to realize and manage at scale, highlighting NVIDIA's commitment to delivering complete solutions for advanced gpus and their ecosystems.

Towards Automated, Scalable AI Infrastructure

The integration of NVIDIA's Blackwell architecture with sophisticated software layers like Mission Control and Topograph marks a significant step towards creating truly automated and scalable AI infrastructure. NVIDIA Topograph automates the discovery of the complex NVLink and interconnect hierarchy, exposing this vital information to schedulers such as Slurm, Kubernetes (through NVIDIA DRA and ComputeDomains), and NVIDIA Run:ai. This eliminates the manual overhead of managing topology, allowing organizations to deploy and scale AI workloads with unprecedented efficiency.

By providing schedulers with a deep, real-time understanding of the hardware topology, this integrated approach ensures that AI applications run on the optimal resources, minimizing communication latency and maximizing throughput. The result is a highly performant, resilient, and easy-to-manage AI factory capable of handling the most demanding AI training and inference tasks. As AI models continue to grow in complexity and size, the ability to effectively manage and schedule workloads on rack-scale supercomputers will be paramount for driving innovation and maintaining competitive advantage. This holistic strategy underpins the future of enterprise AI, transforming raw computational power into intelligent, responsive, and highly efficient AI supercomputing.

Original source

https://developer.nvidia.com/blog/running-ai-workloads-on-rack-scale-supercomputers-from-hardware-to-topology-aware-scheduling/

Frequently Asked Questions

What are NVIDIA GB200 and GB300 NVL72 systems, and what role does the Blackwell architecture play?

NVIDIA GB200 and GB300 NVL72 systems represent a new generation of rack-scale supercomputers specifically engineered for demanding AI and HPC workloads. These systems leverage the groundbreaking NVIDIA Blackwell architecture, which integrates massive GPU fabrics with high-bandwidth networking into a single, tightly coupled unit. The Blackwell architecture is designed to deliver unprecedented performance and efficiency for training and inference, featuring advanced NVLink switches, Multi-Node NVLink (MNNVL) for inter-GPU communication, and IMEX-capable compute trays that facilitate shared GPU memory across multiple nodes within the rack. This integrated design aims to overcome the limitations of traditional server-bound GPU deployments, providing a seamless, scalable platform for complex AI models.

What is the primary challenge in scheduling AI workloads on these advanced rack-scale supercomputers?

The core challenge lies in the significant mismatch between the intricate, hierarchical physical topology of rack-scale supercomputers and the often simplistic abstractions presented by conventional workload schedulers. While systems like the NVIDIA GB200/GB300 NVL72 boast sophisticated NVLink fabrics and IMEX domains, schedulers typically perceive a flat pool of GPUs and nodes. This can lead to inefficient resource allocation, sub-optimal performance due to poor data locality or communication bottlenecks, and increased operational complexity for platform operators. Without topology-aware scheduling, the inherent advantages of rack-scale integration, such as high-bandwidth interconnections, cannot be fully leveraged for AI workloads.

How does NVIDIA Mission Control address the operational complexities of rack-scale AI scheduling?

NVIDIA Mission Control acts as a crucial control plane that bridges the gap between the complex hardware topology of NVIDIA Grace Blackwell NVL72 systems and the needs of workload management platforms like Slurm and NVIDIA Run:ai. It provides a native, deep understanding of NVLink and IMEX domains, translating physical hardware relationships into logical identifiers that schedulers can interpret. By centralizing the view of cluster UUIDs and clique IDs, Mission Control enables precise, topology-aware job placement, ensures proper workload isolation, and guarantees consistent performance by aligning computations with the optimal underlying hardware fabric. This effectively transforms raw infrastructure into an efficient, manageable AI factory.

Explain the concepts of Cluster UUID and Clique ID in the context of NVLink topology and their operational significance.

Cluster UUID and Clique ID are system-level identifiers that encode a GPU's position within the NVLink fabric, making the complex topology understandable to system software and schedulers. The Cluster UUID corresponds to the NVLink domain, indicating that systems and their GPUs belong to the same physical rack and share a common NVLink fabric. For Grace Blackwell NVL72, this UUID is consistent across the entire rack. The Clique ID provides a finer distinction, corresponding to an NVLink Partition. GPUs sharing a Clique ID belong to the same logical partition within that domain. Operationally, the Cluster UUID answers which GPUs physically share a rack and can communicate via NVLink, while the Clique ID answers which GPUs share an NVLink Partition and are intended to communicate together for a specific workload, enabling finer-grained resource allocation and performance optimization.

How does Slurm's topology/block plugin enhance AI workload placement on NVL72 systems?

Slurm's topology/block plugin is essential for efficient AI workload placement on NVIDIA NVL72 systems by making Slurm aware that not all nodes (or GPUs) are equal in terms of connectivity and performance. On Grace Blackwell NVL72 systems, blocks of nodes with lower-latency connections directly map to NVLink partitions, which are groups of GPUs sharing a high-bandwidth NVLink fabric. By enabling this plugin and exposing NVLink partitions as 'blocks,' Slurm gains the necessary context to make intelligent placement decisions. This ensures that multi-GPU jobs are preferentially allocated within a single NVLink partition to preserve MNNVL performance, preventing performance degradation that could occur if jobs were spread indiscriminately across different, less-connected segments of the supercomputer. It allows for optimized resource utilization and predictable performance for demanding AI tasks.

What is Multi-Node NVLink (MNNVL), and how does IMEX facilitate it for shared GPU memory?

Multi-Node NVLink (MNNVL) is a key technology that allows GPUs across different compute nodes within a rack-scale system to communicate directly with high bandwidth and low latency, essential for scaling large AI models. MNNVL enables a shared-memory programming model across these distributed GPUs, making it appear to applications as a single, massive GPU fabric. IMEX (Infiniband Memory Expansion) is the underlying technology that facilitates MNNVL. IMEX-capable compute trays are designed to enable shared GPU memory across nodes by leveraging NVIDIA's advanced networking. While MNNVL simplifies the programming model for developers, Mission Control plays a crucial role behind the scenes to ensure that IMEX services are correctly provisioned and synchronized with MNNVL jobs, guaranteeing that the benefits of shared GPU memory are fully realized without exposing the underlying complexities to the end-user.

What are the key benefits of implementing topology-aware scheduling for AI workloads on rack-scale supercomputers?

Implementing topology-aware scheduling offers several significant benefits for AI workloads on rack-scale supercomputers. Firstly, it ensures optimal performance by intelligently placing jobs on GPUs that have the highest bandwidth and lowest latency connections, minimizing communication overheads inherent in distributed AI training. Secondly, it enhances resource utilization by preventing inefficient spreading of jobs across disparate hardware segments, leading to more predictable performance and better throughput. Thirdly, it simplifies management for platform operators by abstracting hardware complexities while providing clear isolation boundaries between workloads, improving system stability and security. Ultimately, topology-aware scheduling transforms complex hardware into a highly efficient, scalable, and manageable 'AI factory,' accelerating research and development while reducing operational burden.

How does NVIDIA Topograph contribute to the automated discovery and scheduling of supercomputer topologies?

NVIDIA Topograph is a critical component that automates the discovery of the intricate NVLink and interconnect hierarchy within rack-scale supercomputers. This automated discovery is essential because manually configuring and maintaining detailed topology information for large-scale systems would be prone to errors and highly time-consuming. Topograph exposes this detailed fabric information to workload schedulers, including Slurm and Kubernetes (through NVIDIA DRA and ComputeDomains), as well as NVIDIA Run:ai. By providing schedulers with an accurate and real-time view of the hardware topology, Topograph enables them to make intelligent, automated placement decisions. This ensures that AI workloads are scheduled in a topology-aware manner from the outset, optimizing performance, resource allocation, and overall system efficiency, which is crucial for building and operating scalable AI factories.

Stay Updated

Get the latest AI news delivered to your inbox.