机架级AI超级计算机：从硬件到拓扑感知调度

title: "机架级AI超级计算机：从硬件到拓扑感知调度" slug: "running-ai-workloads-on-rack-scale-supercomputers-from-hardware-to-topology-aware-scheduling" date: "2026-04-08" lang: "zh" source: "https://developer.nvidia.com/blog/running-ai-workloads-on-rack-scale-supercomputers-from-hardware-to-topology-aware-scheduling/" category: "企业级AI" keywords:

AI工作负载
机架级超级计算机
NVIDIA Blackwell
NVLink
拓扑感知调度
Slurm
NVIDIA Mission Control
多节点NVLink (MNNVL)
IMEX
GPU互连结构
资源管理
企业级AI meta_description: "深入了解 NVIDIA Blackwell 超级计算机如何与 Mission Control 结合，实现 AI 工作负载的拓扑感知调度，优化 NVLink 和 IMEX 域的性能。" image: "/images/articles/running-ai-workloads-on-rack-scale-supercomputers-from-hardware-to-topology-aware-scheduling.png" image_alt: "NVIDIA Grace Blackwell NVL72 机架，展示了用于机架级AI超级计算机的 NVLink 和 IMEX 域" quality_score: 94 content_score: 93 seo_score: 95 companies:
NVIDIA schema_type: "NewsArticle" reading_time: 7 faq:
question: "NVIDIA GB200 和 GB300 NVL72 系统是什么，Blackwell 架构在其中扮演什么角色？" answer: "NVIDIA GB200 和 GB300 NVL72 系统代表了新一代机架级超级计算机，专为严苛的 AI 和 HPC 工作负载而设计。这些系统利用开创性的 NVIDIA Blackwell 架构，将大规模 GPU 互连结构与高带宽网络整合到一个紧密耦合的单一单元中。Blackwell 架构旨在为训练和推理提供前所未有的性能和效率，它具有先进的 NVLink 交换机、用于 GPU 间通信的多节点 NVLink (MNNVL)，以及支持 IMEX 的计算托盘，这些托盘有助于在机架内的多个节点之间共享 GPU 内存。这种集成设计旨在克服传统服务器绑定 GPU 部署的局限性，为复杂的 AI 模型提供一个无缝、可扩展的平台。"
question: "在这些先进的机架级超级计算机上调度 AI 工作负载的主要挑战是什么？" answer: "核心挑战在于机架级超级计算机复杂的分层物理拓扑结构与传统工作负载调度器通常提供的简单抽象之间存在显著不匹配。尽管像 NVIDIA GB200/GB300 NVL72 这样的系统拥有复杂的 NVLink 互连结构和 IMEX 域，但调度器通常将其视为一个扁平的 GPU 和节点池。这可能导致资源分配效率低下、数据局部性差或通信瓶颈造成的性能不佳，并增加平台运营商的运维复杂性。如果没有拓扑感知调度，机架级集成的固有优势，例如高带宽互连，就无法在 AI 工作负载中得到充分利用。"
question: "NVIDIA Mission Control 如何解决机架级 AI 调度的操作复杂性？" answer: "NVIDIA Mission Control 作为一个关键的控制平面，弥合了 NVIDIA Grace Blackwell NVL72 系统复杂硬件拓扑结构与 Slurm 和 NVIDIA Run:ai 等工作负载管理平台需求之间的鸿沟。它提供对 NVLink 和 IMEX 域的本地深度理解，将物理硬件关系转换为调度器可解释的逻辑标识符。通过集中管理集群 UUID 和 Clique ID 的视图，Mission Control 能够实现精确的、拓扑感知的作业放置，确保适当的工作负载隔离，并通过将计算与最佳底层硬件互连结构对齐来保证一致的性能。这有效地将原始基础设施转变为一个高效、可管理的 AI 工厂。"
question: "在 NVLink 拓扑结构背景下，解释集群 UUID 和 Clique ID 的概念及其操作意义。" answer: "集群 UUID 和 Clique ID 是系统级标识符，用于编码 GPU 在 NVLink 互连结构中的位置，使复杂的拓扑结构对系统软件和调度器可理解。集群 UUID 对应于 NVLink 域，表示系统及其 GPU 属于同一个物理机架并共享一个共同的 NVLink 互连结构。对于 Grace Blackwell NVL72，此 UUID 在整个机架中保持一致。Clique ID 提供更精细的区分，对应于 NVLink 分区。共享 Clique ID 的 GPU 属于该域内同一逻辑分区。从操作角度看，集群 UUID 回答了哪些 GPU 物理上共享一个机架并通过 NVLink 进行通信，而 Clique ID 回答了哪些 GPU 共享一个 NVLink 分区并旨在为特定工作负载一起通信，从而实现更细粒度的资源分配和性能优化。"
question: "Slurm 的拓扑/块插件如何增强 NVL72 系统上的 AI 工作负载放置？" answer: "Slurm 的拓扑/块插件对于在 NVIDIA NVL72 系统上高效放置 AI 工作负载至关重要，它使 Slurm 意识到并非所有节点（或 GPU）在连接性和性能方面都是平等的。在 Grace Blackwell NVL72 系统上，具有较低延迟连接的节点块直接映射到 NVLink 分区，这些分区是共享高带宽 NVLink 互连结构的 GPU 组。通过启用此插件并将 NVLink 分区公开为‘块’，Slurm 获得了做出智能放置决策所需的上下文。这确保了多 GPU 作业优先分配到单个 NVLink 分区内，以保持 MNNVL 性能，防止作业随意分散到超级计算机不同、连接较差的段时可能发生的性能下降。它允许为要求严苛的 AI 任务优化资源利用率和可预测的性能。"
question: "多节点 NVLink (MNNVL) 是什么，IMEX 如何为共享 GPU 内存提供便利？" answer: "多节点 NVLink (MNNVL) 是一项关键技术，它允许机架级系统内不同计算节点上的 GPU 以高带宽和低延迟直接通信，这对于扩展大型 AI 模型至关重要。MNNVL 实现了这些分布式 GPU 之间的共享内存编程模型，使其在应用程序看来就像一个单一的、大规模的 GPU 互连结构。IMEX (Infiniband Memory Expansion) 是促进 MNNVL 的底层技术。支持 IMEX 的计算托盘旨在通过利用 NVIDIA 的先进网络，实现跨节点的共享 GPU 内存。虽然 MNNVL 简化了开发人员的编程模型，但 Mission Control 在幕后扮演着关键角色，确保 IMEX 服务正确配置并与 MNNVL 作业同步，保证共享 GPU 内存的优势得以充分实现，而无需向最终用户暴露底层复杂性。"
question: "在机架级超级计算机上为 AI 工作负载实施拓扑感知调度有哪些主要优势？" answer: "在机架级超级计算机上为 AI 工作负载实施拓扑感知调度具有多项显著优势。首先，它通过智能地将作业放置在具有最高带宽和最低延迟连接的 GPU 上，最大限度地减少分布式 AI 训练中固有的通信开销，从而确保最佳性能。其次，它通过防止作业低效地分散在不同的硬件段上，提高资源利用率，从而带来更可预测的性能和更好的吞吐量。第三，它通过抽象硬件复杂性，同时在工作负载之间提供清晰的隔离边界，简化了平台运营商的管理，提高了系统稳定性和安全性。最终，拓扑感知调度将复杂的硬件转变为一个高效、可扩展且易于管理的‘AI 工厂’，加速研发同时减轻运营负担。"
question: "NVIDIA Topograph 如何促进超级计算机拓扑的自动化发现和调度？" answer: "NVIDIA Topograph 是一个关键组件，它自动化地发现机架级超级计算机中复杂的 NVLink 和互连层次结构。这种自动化发现至关重要，因为手动配置和维护大规模系统的详细拓扑信息容易出错且非常耗时。Topograph 将这些详细的互连结构信息暴露给工作负载调度器，包括 Slurm 和 Kubernetes（通过 NVIDIA DRA 和 ComputeDomains），以及 NVIDIA Run:ai。通过向调度器提供准确和实时的硬件拓扑视图，Topograph 使它们能够做出智能的、自动化的放置决策。这确保了 AI 工作负载从一开始就以拓扑感知的方式进行调度，优化性能、资源分配和整体系统效率，这对于构建和运营可扩展的 AI 工厂至关重要。"


# 机架级AI超级计算机：从硬件到拓扑感知调度

![装饰性图片。](https://developer-blogs.nvidia.com/wp-content/uploads/2026/04/gtc25-tech-blog-dgx-gb300-1920x1080-1-1024x576.png)

人工智能领域正在迅速发展，需要日益强大和高效的计算基础设施。处于这一演变前沿的是机架级超级计算机，它们旨在加速最复杂的 AI 和高性能计算 (HPC) 工作负载。NVIDIA 基于创新 Blackwell 架构构建的 GB200 NVL72 和 GB300 NVL72 系统代表了这一方向上的重大飞跃，将庞大的 GPU 互连结构和高带宽网络封装成紧密、强大的单元。

然而，部署如此复杂的硬件带来了一个独特的挑战：如何将这种复杂的物理拓扑结构转化为 AI 开发人员和研究人员可管理、高性能且易于访问的资源？机架级硬件的分层特性与传统工作负载调度器通常采用的扁平抽象之间的根本不匹配，造成了一个瓶颈。这正是 NVIDIA Mission Control 这样的经过验证的软件栈发挥作用的地方，它弥合了这一差距，将原始计算能力转化为一个无缝的、拓扑感知的 AI 工厂。

## 采用 NVIDIA Blackwell 的下一代机架级 AI 超级计算

NVIDIA GB200 NVL72 和 GB300 NVL72 系统由尖端的 NVIDIA Blackwell 架构提供支持，它们不仅仅是强大的 GPU 集合；它们是为 AI 的未来而设计的集成式机架级超级计算机。每个系统都配备 18 个紧密耦合的计算托盘，通过先进的 NVLink 交换机形成一个庞大的 GPU 互连结构。这些系统支持 NVIDIA 多节点 NVLink (MNNVL)，促进机架内超高速通信，并包含支持 IMEX 的计算托盘，可在节点之间实现共享 GPU 内存。这种架构为训练和部署大规模 AI 模型提供了无与伦比的基础，将科学发现到企业 AI 应用等领域的可能性推向极限。

这些基于 Blackwell 的系统背后的设计理念侧重于最大化数据吞吐量并最小化互连 [GPU](/zh/gpus) 之间的延迟。这通过一个紧密集成的硬件堆栈实现，其中每个组件都针对集体性能进行了优化，确保 AI 工作负载能够高效扩展而不会遇到通信瓶颈。

## 弥合硬件拓扑与 AI 调度器抽象的鸿沟

对于 AI 架构师和 HPC 平台运营商来说，真正的挑战不仅仅是获取和组装这些先进的硬件，而是将其转化为一个“安全、高性能且易于使用”的资源并进行运营。传统调度器通常在同质、扁平的计算资源池的假设下运行。这种范式不适用于机架级超级计算机，因为 NVLink 互连结构和 IMEX 域的分层和拓扑敏感设计对于性能至关重要。如果没有适当的集成，调度器可能会无意中将任务放置在次优位置，从而导致效率降低和性能不可预测。

这正是 NVIDIA Mission Control 旨在填补的空白。作为 NVIDIA Grace Blackwell NVL72 系统的一个强大的机架级控制平面，Mission Control 对底层 NVIDIA NVLink 和 NVIDIA IMEX 域具有原生理解。这种深度感知使其能够与 Slurm 和 NVIDIA Run:ai 等流行的工作负载管理平台智能集成。通过将复杂的硬件拓扑结构转化为可操作的调度智能，Mission Control 确保 Blackwell 架构的先进功能得到充分利用，将复杂的硬件组合转变为一个真正可操作的 AI 工厂。这一能力将扩展到即将推出的 NVIDIA Vera Rubin 平台，包括 NVIDIA Rubin NVL8，进一步巩固高性能 AI 基础设施的一致性方法。

## 为 AI 工作负载解码 NVLink 域和分区

Blackwell 系统拓扑感知调度的核心是 NVLink 域和分区的概念，它们通过系统级标识符——**集群 UUID** 和 **clique ID**——公开。这些标识符至关重要，因为它们提供了物理 NVLink 互连结构的逻辑映射，允许系统软件和调度器推断 GPU 的位置和连接性。

映射直观而强大：
-   **集群 UUID** 对应于 **NVLink 域**。共享的集群 UUID 意味着系统及其 GPU 属于同一个总体的 NVLink 域，并通过共同的 NVLink 互连结构连接。对于 Grace Blackwell NVL72，此 UUID 在整个机架中保持一致，表明物理邻近性和共享的高带宽连接。
-   **Clique ID** 对应于 **NVLink 分区**。Clique ID 提供了更精细的区分，标识了在较大域内共享 NVLink 分区的一组 GPU。当一个机架被逻辑地分割成多个 NVLink 分区时，集群 UUID 保持不变，但 clique ID 区分了这些更小、独立的、高带宽的组。

从操作角度看，这种区分至关重要：
-   **集群 UUID** 回答了这个问题：*哪些 GPU 物理上共享一个机架并能够以最高速度进行 NVLink 通信？*
-   **Clique ID** 回答了：*哪些 GPU 共享一个 NVLink 分区并旨在为给定工作负载或服务层一起通信，确保高度并行任务的最佳性能？*

这些标识符是连接组织，使 Slurm、Kubernetes 和 NVIDIA Run:ai 等平台能够将作业放置、隔离和性能保证与 NVLink 互连结构的实际结构对齐，所有这些都无需直接向最终用户暴露底层硬件复杂性。NVIDIA Mission Control 提供了这些标识符的集中视图，简化了管理。

| 硬件概念      | 软件标识符 | 描述                                                                                |
| :-------------------- | :------------------ | :----------------------------------------------------------------------------------------- |
| NVLink 域         | 集群 UUID        | 标识物理上共享一个机架并能够进行机架级 NVLink 通信的 GPU。      |
| NVLink 分区      | Clique ID           | 区分在一个 NVLink 域内为特定工作负载或服务层而旨在一起通信的 GPU。 |

## 使用 Slurm 进行拓扑感知 AI 调度

对于在基于 Blackwell 的 NVL72 系统上运行的多节点工作负载，**放置与分配的 GPU 数量本身一样关键**。例如，一个需要 16 个 GPU 的 AI 训练作业，如果随意分散在多个连接较差的节点上，其性能将与被限制在单个高带宽 NVLink 互连结构内的情况大相径庭。这就是 Slurm 的 **拓扑/块插件** 显示其不可或缺之处，它允许 Slurm 识别节点之间细微的连接差异。

在 Grace Blackwell NVL72 系统上，具有较低延迟连接的节点块直接对应于 **NVLink 分区**——由专用高带宽 NVLink 互连结构连接的 GPU 组。通过启用拓扑/块插件并将这些 NVLink 分区公开为独立的块，Slurm 获得了做出卓越调度决策所需的上下文智能。默认情况下，作业被智能地放置在单个 NVLink 分区（或块）内，从而保持关键的多节点 NVLink (MNNVL) 性能。虽然在必要时更大的作业仍然可以跨越多个块，但这种方法使性能权衡变得明确，而不是偶然的。

在实际应用中，这允许灵活的部署策略：
-   **每个机架一个块/节点组**：此配置使 Slurm 服务质量 (QoS) 能够管理对共享的、机架级分区的访问，非常适合整合资源管理。
-   **每个机架多个块/节点组**：这种方法非常适合提供更小、独立的、高带宽的 GPU 池。在这里，每个块/节点组映射到一个专用的 Slurm 分区，有效地提供了一个独特的服务层。用户随后可以利用特定的 Slurm 分区，自动将其作业放置在预期的 NVLink 分区内，而无需理解底层的互连结构复杂性。这种先进的资源管理对于希望扩展其 AI 计划的组织至关重要，与 [为所有人扩展 AI](/zh/scaling-ai-for-everyone) 的更广泛目标保持一致。

## 通过 IMEX 和 Mission Control 优化 MNNVL 工作负载

多节点 NVIDIA CUDA 工作负载经常依赖 MNNVL 来实现最大性能，使不同计算托盘上的 GPU 能够参与到一个内聚的、共享内存的编程模型中。从应用程序开发人员的角度来看，利用 MNNVL 可能看似简单，但其底层的编排却很复杂。

这就是 NVIDIA Mission Control 发挥关键作用的地方。它确保在 Slurm 上运行 MNNVL 作业时，关键组件完美对齐。具体来说，Mission Control 保证 IMEX 服务——促进共享 GPU 内存的服务——运行在参与 MNNVL 作业的 *确切* 计算托盘组上。它还确保必要的 NVSwitch 被正确配置，以建立和维护这些高带宽的 MNNVL 连接。这种协调对于在整个机架上提供一致、可预测的性能至关重要。如果没有 Mission Control 的智能编排，MNNVL 和 IMEX 的优势将难以大规模实现和管理，这突显了 NVIDIA 致力于为先进 [GPU](/zh/gpus) 及其生态系统提供完整解决方案的承诺。

## 迈向自动化、可扩展的 AI 基础设施

NVIDIA Blackwell 架构与 Mission Control 和 Topograph 等复杂软件层的集成，标志着在创建真正自动化和可扩展的 AI 基础设施方面迈出了重要一步。NVIDIA Topograph 自动化发现复杂的 NVLink 和互连层次结构，将这一关键信息暴露给 Slurm、Kubernetes（通过 NVIDIA DRA 和 ComputeDomains）以及 NVIDIA Run:ai 等调度器。这消除了管理拓扑的手动开销，使组织能够以前所未有的效率部署和扩展 AI 工作负载。

通过向调度器提供对硬件拓扑的深入、实时理解，这种集成方法确保 AI 应用程序运行在最佳资源上，最大程度地减少通信延迟并最大化吞吐量。其结果是一个高性能、弹性且易于管理的 AI 工厂，能够处理最严苛的 AI 训练和推理任务。随着 AI 模型在复杂性和规模上持续增长，在机架级超级计算机上有效管理和调度工作负载的能力将对推动创新和保持竞争优势至关重要。这种整体策略支撑着企业 AI 的未来，将原始计算能力转化为智能、响应迅速且高效的 AI 超级计算。

原始来源

https://developer.nvidia.com/blog/running-ai-workloads-on-rack-scale-supercomputers-from-hardware-to-topology-aware-scheduling/

常见问题

What are NVIDIA GB200 and GB300 NVL72 systems, and what role does the Blackwell architecture play?

NVIDIA GB200 and GB300 NVL72 systems represent a new generation of rack-scale supercomputers specifically engineered for demanding AI and HPC workloads. These systems leverage the groundbreaking NVIDIA Blackwell architecture, which integrates massive GPU fabrics with high-bandwidth networking into a single, tightly coupled unit. The Blackwell architecture is designed to deliver unprecedented performance and efficiency for training and inference, featuring advanced NVLink switches, Multi-Node NVLink (MNNVL) for inter-GPU communication, and IMEX-capable compute trays that facilitate shared GPU memory across multiple nodes within the rack. This integrated design aims to overcome the limitations of traditional server-bound GPU deployments, providing a seamless, scalable platform for complex AI models.

What is the primary challenge in scheduling AI workloads on these advanced rack-scale supercomputers?

The core challenge lies in the significant mismatch between the intricate, hierarchical physical topology of rack-scale supercomputers and the often simplistic abstractions presented by conventional workload schedulers. While systems like the NVIDIA GB200/GB300 NVL72 boast sophisticated NVLink fabrics and IMEX domains, schedulers typically perceive a flat pool of GPUs and nodes. This can lead to inefficient resource allocation, sub-optimal performance due to poor data locality or communication bottlenecks, and increased operational complexity for platform operators. Without topology-aware scheduling, the inherent advantages of rack-scale integration, such as high-bandwidth interconnections, cannot be fully leveraged for AI workloads.

How does NVIDIA Mission Control address the operational complexities of rack-scale AI scheduling?

NVIDIA Mission Control acts as a crucial control plane that bridges the gap between the complex hardware topology of NVIDIA Grace Blackwell NVL72 systems and the needs of workload management platforms like Slurm and NVIDIA Run:ai. It provides a native, deep understanding of NVLink and IMEX domains, translating physical hardware relationships into logical identifiers that schedulers can interpret. By centralizing the view of cluster UUIDs and clique IDs, Mission Control enables precise, topology-aware job placement, ensures proper workload isolation, and guarantees consistent performance by aligning computations with the optimal underlying hardware fabric. This effectively transforms raw infrastructure into an efficient, manageable AI factory.

Explain the concepts of Cluster UUID and Clique ID in the context of NVLink topology and their operational significance.

Cluster UUID and Clique ID are system-level identifiers that encode a GPU's position within the NVLink fabric, making the complex topology understandable to system software and schedulers. The Cluster UUID corresponds to the NVLink domain, indicating that systems and their GPUs belong to the same physical rack and share a common NVLink fabric. For Grace Blackwell NVL72, this UUID is consistent across the entire rack. The Clique ID provides a finer distinction, corresponding to an NVLink Partition. GPUs sharing a Clique ID belong to the same logical partition within that domain. Operationally, the Cluster UUID answers which GPUs physically share a rack and can communicate via NVLink, while the Clique ID answers which GPUs share an NVLink Partition and are intended to communicate together for a specific workload, enabling finer-grained resource allocation and performance optimization.

How does Slurm's topology/block plugin enhance AI workload placement on NVL72 systems?

Slurm's topology/block plugin is essential for efficient AI workload placement on NVIDIA NVL72 systems by making Slurm aware that not all nodes (or GPUs) are equal in terms of connectivity and performance. On Grace Blackwell NVL72 systems, blocks of nodes with lower-latency connections directly map to NVLink partitions, which are groups of GPUs sharing a high-bandwidth NVLink fabric. By enabling this plugin and exposing NVLink partitions as 'blocks,' Slurm gains the necessary context to make intelligent placement decisions. This ensures that multi-GPU jobs are preferentially allocated within a single NVLink partition to preserve MNNVL performance, preventing performance degradation that could occur if jobs were spread indiscriminately across different, less-connected segments of the supercomputer. It allows for optimized resource utilization and predictable performance for demanding AI tasks.

What is Multi-Node NVLink (MNNVL), and how does IMEX facilitate it for shared GPU memory?

Multi-Node NVLink (MNNVL) is a key technology that allows GPUs across different compute nodes within a rack-scale system to communicate directly with high bandwidth and low latency, essential for scaling large AI models. MNNVL enables a shared-memory programming model across these distributed GPUs, making it appear to applications as a single, massive GPU fabric. IMEX (Infiniband Memory Expansion) is the underlying technology that facilitates MNNVL. IMEX-capable compute trays are designed to enable shared GPU memory across nodes by leveraging NVIDIA's advanced networking. While MNNVL simplifies the programming model for developers, Mission Control plays a crucial role behind the scenes to ensure that IMEX services are correctly provisioned and synchronized with MNNVL jobs, guaranteeing that the benefits of shared GPU memory are fully realized without exposing the underlying complexities to the end-user.

What are the key benefits of implementing topology-aware scheduling for AI workloads on rack-scale supercomputers?

Implementing topology-aware scheduling offers several significant benefits for AI workloads on rack-scale supercomputers. Firstly, it ensures optimal performance by intelligently placing jobs on GPUs that have the highest bandwidth and lowest latency connections, minimizing communication overheads inherent in distributed AI training. Secondly, it enhances resource utilization by preventing inefficient spreading of jobs across disparate hardware segments, leading to more predictable performance and better throughput. Thirdly, it simplifies management for platform operators by abstracting hardware complexities while providing clear isolation boundaries between workloads, improving system stability and security. Ultimately, topology-aware scheduling transforms complex hardware into a highly efficient, scalable, and manageable 'AI factory,' accelerating research and development while reducing operational burden.

How does NVIDIA Topograph contribute to the automated discovery and scheduling of supercomputer topologies?

NVIDIA Topograph is a critical component that automates the discovery of the intricate NVLink and interconnect hierarchy within rack-scale supercomputers. This automated discovery is essential because manually configuring and maintaining detailed topology information for large-scale systems would be prone to errors and highly time-consuming. Topograph exposes this detailed fabric information to workload schedulers, including Slurm and Kubernetes (through NVIDIA DRA and ComputeDomains), as well as NVIDIA Run:ai. By providing schedulers with an accurate and real-time view of the hardware topology, Topograph enables them to make intelligent, automated placement decisions. This ensures that AI workloads are scheduled in a topology-aware manner from the outset, optimizing performance, resource allocation, and overall system efficiency, which is crucial for building and operating scalable AI factories.

保持更新

将最新AI新闻发送到您的收件箱。