ラック型AIスーパーコンピューター：ハードウェアからトポロジー認識型スケジューリングまで

Decorative image.

人工知能の状況は急速に進化しており、これまで以上に強力で効率的な計算インフラストラクチャが求められています。この進化の最前線にあるのが、最も複雑なAIおよびハイパフォーマンスコンピューティング（HPC）ワークロードを加速するために設計されたラック型スーパーコンピューターです。革新的なBlackwellアーキテクチャに基づいて構築されたNVIDIAのGB200 NVL72およびGB300 NVL72システムは、この方向における大きな飛躍を意味し、巨大なGPUファブリックと高帯域幅のネットワーキングを一体的で強力なユニットにまとめ上げています。

しかし、このような洗練されたハードウェアを展開するには、固有の課題があります。この複雑な物理トポロジーを、AI開発者や研究者にとって管理しやすく、高性能でアクセスしやすいリソースにどのように変換するかという点です。ラック型ハードウェアの階層的な性質と、従来のワークロードスケジューラーのしばしばフラットな抽象化との間の根本的なミスマッチがボトルネックを生み出します。NVIDIA Mission Controlのような検証済みソフトウェアスタックがまさにこのギャップを埋め、生の計算能力をシームレスなトポロジー認識型AIファクトリーへと変革する役割を担います。

NVIDIA Blackwellによる次世代ラック型AIスーパーコンピューティング

最先端のNVIDIA Blackwellアーキテクチャを搭載したNVIDIA GB200 NVL72およびGB300 NVL72システムは、単に強力なGPUの集合体ではありません。これらは、AIの未来のために設計された統合されたラック型スーパーコンピューターです。各システムは18個の密接に結合されたコンピューティングトレイを特徴とし、高度なNVLinkスイッチによって接続された巨大なGPUファブリックを形成します。これらのシステムはNVIDIA Multi-Node NVLink (MNNVL)をサポートし、ラック内での超高速通信を促進し、ノード間で共有GPUメモリを可能にするIMEX対応コンピューティングトレイを含みます。このアーキテクチャは、大規模AIモデルのトレーニングとデプロイメントのための比類のない基盤を提供し、科学的発見からエンタープライズAIアプリケーションに至るまで、可能なことの境界を押し広げます。

これらのBlackwellベースのシステムの設計思想は、相互接続されたGPU間のデータスループットを最大化し、レイテンシを最小化することに焦点を当てています。これは、すべてのコンポーネントが集合的なパフォーマンスのために最適化された密に統合されたハードウェアスタックを通じて実現され、AIワークロードが通信ボトルネックに遭遇することなく効率的にスケールできるようにします。

ハードウェアトポロジーとAIスケジューラー抽象化の橋渡し

AIアーキテクトやHPCプラットフォームオペレーターにとって、真の課題は、この高度なハードウェアを単に取得して組み立てるだけでなく、それを「安全で、高性能で、使いやすい」リソースとして運用することです。従来のスケジューラーは、均質でフラットな計算リソースのプールを前提として動作することがよくあります。このパラダイムは、NVLinkファブリックとIMEXドメインの階層的でトポロジーに敏感な設計がパフォーマンスにとって不可欠であるラック型スーパーコンピューターには不向きです。適切な統合がなければ、スケジューラーは誤ってタスクを最適でない場所に配置し、効率の低下や予測不可能なパフォーマンスにつながる可能性があります。

NVIDIA Mission Controlは、まさにこのギャップを埋めるために設計されています。NVIDIA Grace Blackwell NVL72システム向けの堅牢なラック型コントロールプレーンとして、Mission Controlは基盤となるNVIDIA NVLinkおよびNVIDIA IMEXドメインのネイティブな理解を持っています。この深い認識により、SlurmやNVIDIA Run:aiなどの一般的なワークロード管理プラットフォームとインテリジェントに統合することができます。複雑なハードウェアトポロジーを実行可能なスケジューリングインテリジェンスに変換することで、Mission ControlはBlackwellアーキテクチャの高度な機能が完全に活用されることを保証し、洗練されたハードウェアアセンブリを真に運用可能なAIファクトリーに変革します。この機能は、NVIDIA Rubin NVL8を含む今後のNVIDIA Vera Rubinプラットフォームにも拡張され、高性能AIインフラストラクチャに対する一貫したアプローチをさらに確固たるものにするでしょう。

AIワークロードのためのNVLinkドメインとパーティションの解読

Blackwellシステムのトポロジー認識型スケジューリングの中心にあるのは、システムレベルの識別子であるクラスターUUIDとクリックIDを通じて公開されるNVLinkドメインとパーティションの概念です。これらの識別子は、物理NVLinkファブリックの論理マップを提供し、システムソフトウェアとスケジューラーがGPUの位置と接続性について推論することを可能にするため、非常に重要です。

このマッピングはシンプルでありながら強力です。

クラスターUUIDはNVLinkドメインに対応します。共有クラスターUUIDは、システムとそのGPUが同じ包括的なNVLinkドメインに属し、共通のNVLinkファブリックによって接続されていることを示します。Grace Blackwell NVL72の場合、このUUIDはラック全体で一貫しており、物理的な近接性と共有された高帯域幅接続性を示します。
クリックIDはNVLinkパーティションに対応します。クリックIDは、よりきめ細かい区別を提供し、より大きなドメイン内でNVLinkパーティションを共有するGPUのグループを識別します。ラックが複数のNVLinkパーティションに論理的にセグメント化されている場合、クラスターUUIDは同じままですが、クリックIDはこれらのより小さく、分離された高帯域幅グループを区別します。

この区別は運用上の観点から極めて重要です。

クラスターUUIDは次の質問に答えます。どのGPUが物理的にラックを共有し、最高の速度でNVLink通信が可能ですか？
クリックIDは次の質問に答えます。どのGPUがNVLinkパーティションを共有し、特定のワークロードまたはサービス層のために一緒に通信することを意図しており、高度に並列なタスクに最適なパフォーマンスを保証しますか？

これらの識別子は結合組織であり、Slurm、Kubernetes、NVIDIA Run:aiのようなプラットフォームがジョブ配置、分離、パフォーマンス保証をNVLinkファブリックの実際の構造と連携させることを可能にし、基盤となるハードウェアの複雑さをエンドユーザーに直接さらすことなくすべてを実現します。NVIDIA Mission Controlはこれらの識別子の一元的なビューを提供し、管理を効率化します。

ハードウェア概念	ソフトウェア識別子	説明
NVLinkドメイン	クラスターUUID	物理的にラックを共有し、ラック全体でのNVLink通信が可能なGPUを識別します。
NVLinkパーティション	クリックID	特定のワークロードまたはサービス層のためにNVLinkドメイン内で一緒に通信することを意図したGPUを区別します。

Slurmによるトポロジー認識型AIスケジューリング

BlackwellベースのNVL72システムで実行されるマルチノードワークロードの場合、配置は割り当てられたGPUの絶対数と同じくらい重要になります。例えば、16個のGPUを必要とするAIトレーニングジョブは、単一の高帯域幅NVLinkファブリック内に限定される場合と比べて、複数の接続性の低いノードに無計画に分散された場合では、パフォーマンスが大きく異なります。ここでSlurmのtopology/blockプラグインが不可欠であることが証明され、Slurmがノード間の微妙な接続性の違いを認識できるようになります。

Grace Blackwell NVL72システムでは、低遅延接続を持つノードのブロックは、専用の高帯域幅NVLinkファブリックによって結合されたGPUグループであるNVLinkパーティションに直接対応します。このプラグインを有効にし、これらのNVLinkパーティションを個別のブロックとして公開することで、Slurmは優れたスケジューリング決定を行うために必要なコンテキストインテリジェンスを獲得します。デフォルトでは、ジョブは単一のNVLinkパーティション（またはブロック）内にインテリジェントに配置され、これにより重要なMulti-Node NVLink (MNNVL)パフォーマンスが維持されます。必要であればより大きなジョブを複数のブロックにまたがって実行することも可能ですが、このアプローチはパフォーマンスのトレードオフを偶発的ではなく、明示的なものにします。

実際には、これにより柔軟なデプロイメント戦略が可能になります。

ラックあたり1つのブロック/ノードグループ: この構成により、Slurm Quality of Service (QoS)が共有されたラック全体のパーティションへのアクセスを管理できるようになり、統合されたリソース管理に最適です。
ラックあたり複数のブロック/ノードグループ: このアプローチは、より小さく、分離された高帯域幅GPUプールを提供するために最適です。ここでは、各ブロック/ノードグループが専用のSlurmパーティションにマッピングされ、事実上、明確なサービス層を提供します。ユーザーは特定のSlurmパーティションを利用することで、基盤となるファブリックの複雑さを理解する必要なく、意図されたNVLinkパーティション内にジョブを自動的に配置できます。この高度なリソース管理は、AIイニシアチブをスケールアップしようとしている組織にとって極めて重要であり、すべての人のためのAIの拡張というより広範な目標と一致します。

IMEXとMission ControlによるMNNVLワークロードの最適化

Multi-Node NVIDIA CUDAワークロードは、多くの場合、最大限のパフォーマンスを達成するためにMNNVLに依存しており、異なるコンピューティングトレイ上のGPUが凝集した共有メモリプログラミングモデルに参加できるようにします。アプリケーション開発者の視点からは、MNNVLの活用は一見シンプルに見えるかもしれませんが、その基盤となるオーケストレーションは複雑です。

ここでNVIDIA Mission Controlが極めて重要な役割を果たします。SlurmでMNNVLジョブを実行する際、重要なコンポーネントが完全に連携するようにします。具体的には、Mission Controlは、共有GPUメモリを促進するIMEXサービスがMNNVLジョブに参加するコンピューティングトレイの正確なセットで実行されることを保証します。また、これらの高帯域幅MNNVL接続を確立および維持するために必要なNVSwitchが正しく構成されていることも保証します。この連携は、ラック全体で一貫性のある予測可能なパフォーマンスを提供するために不可欠です。Mission Controlのインテリジェントなオーケストレーションがなければ、MNNVLとIMEXの利点を大規模に実現し、管理することは困難であり、NVIDIAが先進のGPUとそのエコシステムに完全なソリューションを提供することへのコミットメントを強調しています。

自動化され、スケーラブルなAIインフラストラクチャへ

NVIDIAのBlackwellアーキテクチャと、Mission ControlやTopographのような洗練されたソフトウェア層との統合は、真に自動化されスケーラブルなAIインフラストラクチャを構築するための重要な一歩となります。NVIDIA Topographは、複雑なNVLinkおよび相互接続階層の検出を自動化し、この重要な情報をSlurm、Kubernetes（NVIDIA DRAおよびComputeDomains経由）、NVIDIA Run:aiなどのスケジューラーに公開します。これにより、トポロジー管理の手動オーバーヘッドがなくなり、組織は前例のない効率でAIワークロードを展開およびスケーリングできます。

スケジューラーにハードウェアトポロジーの深いリアルタイムな理解を提供することで、この統合アプローチは、AIアプリケーションが最適なリソース上で実行され、通信遅延を最小限に抑え、スループットを最大化することを保証します。その結果、最も要求の厳しいAIトレーニングおよび推論タスクを処理できる、高性能で回復力があり、管理しやすいAIファクトリーが生まれます。AIモデルの複雑さと規模が拡大し続けるにつれて、ラック型スーパーコンピューター上でワークロードを効果的に管理およびスケジューリングする能力は、イノベーションを推進し、競争上の優位性を維持するために最も重要になるでしょう。この全体的な戦略は、エンタープライズAIの未来を支え、生の計算能力をインテリジェントで応答性が高く、非常に効率的なAIスーパーコンピューティングへと変革します。

元の情報源

https://developer.nvidia.com/blog/running-ai-workloads-on-rack-scale-supercomputers-from-hardware-to-topology-aware-scheduling/

よくある質問

What are NVIDIA GB200 and GB300 NVL72 systems, and what role does the Blackwell architecture play?

NVIDIA GB200 and GB300 NVL72 systems represent a new generation of rack-scale supercomputers specifically engineered for demanding AI and HPC workloads. These systems leverage the groundbreaking NVIDIA Blackwell architecture, which integrates massive GPU fabrics with high-bandwidth networking into a single, tightly coupled unit. The Blackwell architecture is designed to deliver unprecedented performance and efficiency for training and inference, featuring advanced NVLink switches, Multi-Node NVLink (MNNVL) for inter-GPU communication, and IMEX-capable compute trays that facilitate shared GPU memory across multiple nodes within the rack. This integrated design aims to overcome the limitations of traditional server-bound GPU deployments, providing a seamless, scalable platform for complex AI models.

What is the primary challenge in scheduling AI workloads on these advanced rack-scale supercomputers?

The core challenge lies in the significant mismatch between the intricate, hierarchical physical topology of rack-scale supercomputers and the often simplistic abstractions presented by conventional workload schedulers. While systems like the NVIDIA GB200/GB300 NVL72 boast sophisticated NVLink fabrics and IMEX domains, schedulers typically perceive a flat pool of GPUs and nodes. This can lead to inefficient resource allocation, sub-optimal performance due to poor data locality or communication bottlenecks, and increased operational complexity for platform operators. Without topology-aware scheduling, the inherent advantages of rack-scale integration, such as high-bandwidth interconnections, cannot be fully leveraged for AI workloads.

How does NVIDIA Mission Control address the operational complexities of rack-scale AI scheduling?

NVIDIA Mission Control acts as a crucial control plane that bridges the gap between the complex hardware topology of NVIDIA Grace Blackwell NVL72 systems and the needs of workload management platforms like Slurm and NVIDIA Run:ai. It provides a native, deep understanding of NVLink and IMEX domains, translating physical hardware relationships into logical identifiers that schedulers can interpret. By centralizing the view of cluster UUIDs and clique IDs, Mission Control enables precise, topology-aware job placement, ensures proper workload isolation, and guarantees consistent performance by aligning computations with the optimal underlying hardware fabric. This effectively transforms raw infrastructure into an efficient, manageable AI factory.

Explain the concepts of Cluster UUID and Clique ID in the context of NVLink topology and their operational significance.

Cluster UUID and Clique ID are system-level identifiers that encode a GPU's position within the NVLink fabric, making the complex topology understandable to system software and schedulers. The Cluster UUID corresponds to the NVLink domain, indicating that systems and their GPUs belong to the same physical rack and share a common NVLink fabric. For Grace Blackwell NVL72, this UUID is consistent across the entire rack. The Clique ID provides a finer distinction, corresponding to an NVLink Partition. GPUs sharing a Clique ID belong to the same logical partition within that domain. Operationally, the Cluster UUID answers which GPUs physically share a rack and can communicate via NVLink, while the Clique ID answers which GPUs share an NVLink Partition and are intended to communicate together for a specific workload, enabling finer-grained resource allocation and performance optimization.

How does Slurm's topology/block plugin enhance AI workload placement on NVL72 systems?

Slurm's topology/block plugin is essential for efficient AI workload placement on NVIDIA NVL72 systems by making Slurm aware that not all nodes (or GPUs) are equal in terms of connectivity and performance. On Grace Blackwell NVL72 systems, blocks of nodes with lower-latency connections directly map to NVLink partitions, which are groups of GPUs sharing a high-bandwidth NVLink fabric. By enabling this plugin and exposing NVLink partitions as 'blocks,' Slurm gains the necessary context to make intelligent placement decisions. This ensures that multi-GPU jobs are preferentially allocated within a single NVLink partition to preserve MNNVL performance, preventing performance degradation that could occur if jobs were spread indiscriminately across different, less-connected segments of the supercomputer. It allows for optimized resource utilization and predictable performance for demanding AI tasks.

What is Multi-Node NVLink (MNNVL), and how does IMEX facilitate it for shared GPU memory?

Multi-Node NVLink (MNNVL) is a key technology that allows GPUs across different compute nodes within a rack-scale system to communicate directly with high bandwidth and low latency, essential for scaling large AI models. MNNVL enables a shared-memory programming model across these distributed GPUs, making it appear to applications as a single, massive GPU fabric. IMEX (Infiniband Memory Expansion) is the underlying technology that facilitates MNNVL. IMEX-capable compute trays are designed to enable shared GPU memory across nodes by leveraging NVIDIA's advanced networking. While MNNVL simplifies the programming model for developers, Mission Control plays a crucial role behind the scenes to ensure that IMEX services are correctly provisioned and synchronized with MNNVL jobs, guaranteeing that the benefits of shared GPU memory are fully realized without exposing the underlying complexities to the end-user.

What are the key benefits of implementing topology-aware scheduling for AI workloads on rack-scale supercomputers?

Implementing topology-aware scheduling offers several significant benefits for AI workloads on rack-scale supercomputers. Firstly, it ensures optimal performance by intelligently placing jobs on GPUs that have the highest bandwidth and lowest latency connections, minimizing communication overheads inherent in distributed AI training. Secondly, it enhances resource utilization by preventing inefficient spreading of jobs across disparate hardware segments, leading to more predictable performance and better throughput. Thirdly, it simplifies management for platform operators by abstracting hardware complexities while providing clear isolation boundaries between workloads, improving system stability and security. Ultimately, topology-aware scheduling transforms complex hardware into a highly efficient, scalable, and manageable 'AI factory,' accelerating research and development while reducing operational burden.

How does NVIDIA Topograph contribute to the automated discovery and scheduling of supercomputer topologies?

NVIDIA Topograph is a critical component that automates the discovery of the intricate NVLink and interconnect hierarchy within rack-scale supercomputers. This automated discovery is essential because manually configuring and maintaining detailed topology information for large-scale systems would be prone to errors and highly time-consuming. Topograph exposes this detailed fabric information to workload schedulers, including Slurm and Kubernetes (through NVIDIA DRA and ComputeDomains), as well as NVIDIA Run:ai. By providing schedulers with an accurate and real-time view of the hardware topology, Topograph enables them to make intelligent, automated placement decisions. This ensures that AI workloads are scheduled in a topology-aware manner from the outset, optimizing performance, resource allocation, and overall system efficiency, which is crucial for building and operating scalable AI factories.

ラック型AIスーパーコンピューター：ハードウェアからトポロジー認識型スケジューリングまで

ラック型AIスーパーコンピューター：ハードウェアからトポロジー認識型スケジューリングまで

NVIDIA Blackwellによる次世代ラック型AIスーパーコンピューティング

ハードウェアトポロジーとAIスケジューラー抽象化の橋渡し

AIワークロードのためのNVLinkドメインとパーティションの解読

Slurmによるトポロジー認識型AIスケジューリング

IMEXとMission ControlによるMNNVLワークロードの最適化

自動化され、スケーラブルなAIインフラストラクチャへ

よくある質問

最新情報を入手