מחשבי-על AI בקנה מידה של ארון תקשורת: מחומרה לתזמון מודע-טופולוגיה

תמונה דקורטיבית.

נוף הבינה המלאכותית מתפתח במהירות, ודורש תשתית חישובית חזקה ויעילה יותר ויותר. בחזית התפתחות זו עומדים מחשבי-על בקנה מידה של ארון תקשורת, שתוכננו להאיץ את עומסי העבודה המורכבים ביותר של AI ושל מחשוב בעל ביצועים גבוהים (HPC). מערכות ה-GB200 NVL72 וה-GB300 NVL72 של NVIDIA, הבנויות על ארכיטקטורת Blackwell החדשנית, מייצגות קפיצת מדרגה משמעותית בכיוון זה, ואורזות מבני GPU עצומים ורשתות ברוחב פס גבוה ליחידות לכידות ועוצמתיות.

עם זאת, פריסה של חומרה כה מתוחכמת מציבה אתגר ייחודי: כיצד מתרגמים את הטופולוגיה הפיזית המורכבת הזו למשאב ניתן לניהול, בעל ביצועים גבוהים ונגיש עבור מפתחי וחוקרי AI? חוסר ההתאמה הבסיסי בין האופי ההיררכי של חומרת ארון התקשורת לבין ההפשטות השטוחות לעיתים קרובות של מתזמני עומסי עבודה מסורתיים יוצר צוואר בקבוק. זהו בדיוק המקום שבו ערימת תוכנה מאומתת כמו NVIDIA Mission Control נכנסת לתמונה, ומגשרת על הפער כדי להפוך כוח חישובי גולמי למפעל AI חלק ומודע-טופולוגיה.

מחשוב-על AI מהדור הבא בקנה מידה של ארון תקשורת עם NVIDIA Blackwell

מערכות NVIDIA GB200 NVL72 ו-GB300 NVL72, המופעלות על ידי ארכיטקטורת NVIDIA Blackwell המתקדמת, אינן רק אוספים של GPUs חזקים; הן מחשבי-על משולבים בקנה מידה של ארון תקשורת שתוכננו לעתיד ה-AI. כל מערכת כוללת 18 מגשי מחשוב מצומדים היטב, היוצרים מבנה GPU עצום המחובר באמצעות מתגי NVLink מתקדמים. מערכות אלו תומכות ב-NVIDIA Multi-Node NVLink (MNNVL), המאפשר תקשורת מהירה במיוחד בתוך הארון, וכוללות מגשי מחשוב תומכי IMEX המאפשרים זיכרון GPU משותף על פני צמתים. ארכיטקטורה זו מספקת בסיס ללא תחרות לאימון ופריסה של מודלי AI בקנה מידה גדול, ופורצת את גבולות האפשרי בתחומים הנעים מגילוי מדעי ועד יישומי AI ארגוניים.

פילוסופיית התכנון מאחורי מערכות מבוססות Blackwell אלו מתמקדת במקסום תפוקת הנתונים ובמזעור החביון בין GPUs מחוברים. הדבר מושג באמצעות ערימת חומרה משולבת בצפיפות שבה כל רכיב מותאם לביצועים קולקטיביים, ומבטיח שעומסי עבודה של AI יכולים להתרחב ביעילות מבלי להיתקל בצווארי בקבוק בתקשורת.

גישור בין טופולוגיית חומרה להפשטות מתזמן AI

עבור אדריכלי AI ומפעילי פלטפורמות HPC, האתגר האמיתי אינו רק רכישה והרכבה של חומרה מתקדמת זו, אלא דווקא הפעלתה כמשאב 'בטוח, בעל ביצועים גבוהים וקל לשימוש'. מתזמנים מסורתיים פועלים לעיתים קרובות תחת ההנחה של מאגר הומוגני ושטוח של משאבי חישוב. פרדיגמה זו אינה מתאימה למחשבי-על בקנה מידה של ארון תקשורת, שבהם התכנון ההיררכי והרגיש לטופולוגיה של מבני NVLink ודומייני IMEX קריטי לביצועים. ללא אינטגרציה נכונה, מתזמנים עלולים למקם משימות במיקומים תת-אופטימליים, מה שיוביל לירידה ביעילות ולביצועים בלתי צפויים.

זהו הפער ש-NVIDIA Mission Control תוכנן למלא. כמישור בקרה חזק בקנה מידה של ארון תקשורת עבור מערכות NVIDIA Grace Blackwell NVL72, ל-Mission Control יש הבנה טבעית של דומייני NVIDIA NVLink ו-NVIDIA IMEX הבסיסיים. מודעות עמוקה זו מאפשרת לו להשתלב בצורה חכמה עם פלטפורמות פופולריות לניהול עומסי עבודה כגון Slurm ו-NVIDIA Run:ai. על ידי תרגום טופולוגיות חומרה מורכבות למודיעין תזמון בר-פעולה, Mission Control מבטיח כי היכולות המתקדמות של ארכיטקטורת Blackwell ממונפות במלואן, והופך מכלול חומרה מתוחכם למפעל AI תפעולי באמת. יכולת זו תתרחב לפלטפורמת NVIDIA Vera Rubin הקרובה, כולל NVIDIA Rubin NVL8, ותבסס עוד יותר גישה עקבית לתשתית AI בעלת ביצועים גבוהים.

פענוח דומייני ומחיצות NVLink עבור עומסי עבודה של AI

בליבת התזמון מודע-טופולוגיה עבור מערכות Blackwell עומדים המושגים של דומייני ומחיצות NVLink, הנחשפים באמצעות מזהים ברמת המערכת: Cluster UUID ו-Clique ID. מזהים אלו קריטיים מכיוון שהם מספקים מפה לוגית של מבנה ה-NVLink הפיזי, ומאפשרים לתוכנות מערכת ולמתזמנים להבין את מיקום ה-GPU ואת קישוריותו.

המיפוי פשוט אך עוצמתי:

Cluster UUID מתאים לדומיין ה-NVLink. Cluster UUID משותף מסמן שמערכות – וה-GPUs שלהן – שייכות לאותו דומיין NVLink כולל ומחוברות באמצעות מבנה NVLink משותף. עבור Grace Blackwell NVL72, UUID זה עקבי על פני כל הארון, ומציין קרבה פיזית וקישוריות משותפת ברוחב פס גבוה.
Clique ID מתאים למחיצת ה-NVLink. ה-Clique ID מציע הבחנה עדינה יותר, ומזהה קבוצות של GPUs המשתפות מחיצת NVLink בתוך דומיין גדול יותר. כאשר ארון מחולק לוגית למספר מחיצות NVLink, ה-Cluster UUID נשאר זהה, אך מזהי ה-Clique מבדילים בין קבוצות קטנות יותר ומבודדות אלו בעלות רוחב פס גבוה.

הבחנה זו חיונית מנקודת מבט תפעולית:

הCluster UUID עונה על השאלה: אילו GPUs חולקים פיזית ארון ובעלי יכולת תקשורת NVLink במהירויות הגבוהות ביותר?
הClique ID עונה על השאלה: אילו GPUs חולקים מחיצת NVLink ונועדו לתקשר יחד עבור עומס עבודה נתון או רמת שירות, ומבטיחים ביצועים אופטימליים למשימות מקבילות מאוד?

מזהים אלו הם הרקמה המחברת, המאפשרת לפלטפורמות כמו Slurm, Kubernetes ו-NVIDIA Run:ai ליישר מיקום עבודות, בידוד והבטחות ביצועים עם המבנה האמיתי של מבנה ה-NVLink, והכל מבלי לחשוף את מורכבות החומרה הבסיסית ישירות למשתמשי הקצה. NVIDIA Mission Control מספקת תצוגה מרכזית של מזהים אלו, ומייעלת את הניהול.

מושג חומרה	מזהה תוכנה	תיאור
NVLink Domain	Cluster UUID	מזהה GPUs החולקים פיזית ארון, בעלי יכולת תקשורת NVLink בכל הארון.
NVLink Partition	Clique ID	מבחין בין GPUs שנועדו לתקשר יחד בתוך דומיין NVLink עבור עומס עבודה ספציפי או רמת שירות.

תזמון AI מודע-טופולוגיה עם Slurm

עבור עומסי עבודה מרובי צמתים הפועלים על מערכות NVL72 מבוססות Blackwell, המיקום הופך לקריטי לא פחות מספירת ה-GPUs שהוקצו. עבודת אימון AI הדורשת 16 GPUs, לדוגמה, תבצע בצורה שונה באופן ניכר אם תתפזר באקראי על פני מספר צמתים פחות מחוברים לעומת היותה מוגבלת בתוך מבנה NVLink יחיד ובעל רוחב פס גבוה. כאן נכנס לתמונה תוסף topology/block של Slurm כחיוני, המאפשר ל-Slurm לזהות את הבדלי הקישוריות העדינים בין הצמתים.

במערכות Grace Blackwell NVL72, בלוקים של צמתים הכוללים חיבורים עם חביון נמוך יותר מתאימים ישירות למחיצות NVLink – קבוצות של GPUs המאוחדות על ידי מבנה NVLink ייעודי ובעל רוחב פס גבוה. על ידי הפעלת תוסף topology/block וחשיפת מחיצות NVLink אלו כבלוקים נפרדים, Slurm מקבל את האינטליגנציה ההקשרית הנדרשת לקבלת החלטות תזמון מעולות. כברירת מחדל, עבודות ממוקמות בצורה חכמה בתוך מחיצת NVLink יחידה (או בלוק), ובכך נשמרים ביצועי ה-Multi-Node NVLink (MNNVL) הקריטיים. בעוד שעבודות גדולות יותר עדיין יכולות לפרוס על פני מספר בלוקים במידת הצורך, גישה זו הופכת את פשרי הביצועים למפורשים, במקום מקריים.

במונחים מעשיים, זה מאפשר אסטרטגיות פריסה גמישות:

בלוק/קבוצת צמתים אחת לכל ארון: תצורה זו מאפשרת ל-Slurm Quality of Service (QoS) לנהל גישה למחיצה המשותפת, בקנה מידה של ארון תקשורת, אידיאלית לניהול משאבים מאוחד.
מספר בלוקים/קבוצות צמתים לכל ארון: גישה זו מושלמת להצעת מאגרי GPU קטנים יותר, מבודדים, בעלי רוחב פס גבוה. כאן, כל בלוק/קבוצת צמתים ממופה למחיצת Slurm ייעודית, ומספקת למעשה רמת שירות נפרדת. משתמשים יכולים לאחר מכן למנף מחיצת Slurm ספציפית, ובכך להנחית אוטומטית את עבודותיהם בתוך מחיצת ה-NVLink המיועדת מבלי שהם יצטרכו להבין את מורכבויות המבנה הבסיסיות. ניהול משאבים מתקדם זה חיוני עבור ארגונים המעוניינים להרחיב את יוזמות ה-AI שלהם, ומתיישר עם המטרה הרחבה יותר של העצמת AI לכולם.

אופטימיזציה של עומסי עבודה של MNNVL עם IMEX ו-Mission Control

עומסי עבודה מרובי צמתי NVIDIA CUDA מסתמכים לעיתים קרובות על MNNVL כדי להשיג ביצועים מקסימליים, ומאפשרים ל-GPUs על מגשי מחשוב שונים להשתתף במודל תכנות זיכרון משותף ולכיד. מנקודת מבטו של מפתח יישומים, מינוף MNNVL יכול להיראות פשוט באופן מטעה, אך התזמור הבסיסי מורכב.

כאן נכנסת לתמונה NVIDIA Mission Control בתפקיד מרכזי. היא מבטיחה שרכיבים קריטיים מתיישרים בצורה מושלמת בעת הרצת עבודות MNNVL עם Slurm. בפרט, Mission Control מבטיחה ששירות ה-IMEX – המאפשר את זיכרון ה-GPU המשותף – פועל על אותה קבוצה מדויקת של מגשי מחשוב המשתתפים בעבודת ה-MNNVL. היא גם מבטיחה שמתגי ה-NVSwitches הדרושים מוגדרים נכון כדי ליצור ולתחזק את חיבורי ה-MNNVL בעלי רוחב פס גבוה. תיאום זה חיוני למתן ביצועים עקביים וצפויים על פני הארון. ללא התזמור החכם של Mission Control, יהיה קשה לממש ולנהל את היתרונות של MNNVL ו-IMEX בקנה מידה, מה שמדגיש את מחויבותה של NVIDIA לאספקת פתרונות שלמים עבור GPUs מתקדמים והאקוסיסטמות שלהם.

לקראת תשתית AI אוטומטית וניתנת להרחבה

השילוב של ארכיטקטורת Blackwell של NVIDIA עם שכבות תוכנה מתוחכמות כמו Mission Control ו-Topograph מסמן צעד משמעותי לקראת יצירת תשתית AI אוטומטית וניתנת להרחבה באמת. NVIDIA Topograph ממכנת את גילוי היררכיית ה-NVLink והחיבורים הפנימיים המורכבת, וחושפת מידע חיוני זה למתזמנים כגון Slurm, Kubernetes (באמצעות NVIDIA DRA ו-ComputeDomains), ו-NVIDIA Run:ai. זה מבטל את התקורה הידנית של ניהול טופולוגיה, ומאפשר לארגונים לפרוס ולהרחיב עומסי עבודה של AI ביעילות חסרת תקדים.

על ידי מתן הבנה עמוקה ובזמן אמת של טופולוגיית החומרה למתזמנים, גישה משולבת זו מבטיחה שיישומי AI יפעלו על המשאבים האופטימליים, ממזערת את חביון התקשורת וממקסמת את התפוקה. התוצאה היא מפעל AI בעל ביצועים גבוהים, עמיד וקל לניהול, המסוגל להתמודד עם משימות האימון וההסקה התובעניות ביותר של AI. ככל שמודלי AI ממשיכים לגדול במורכבות ובגודל, היכולת לנהל ולתזמן ביעילות עומסי עבודה על מחשבי-על בקנה מידה של ארון תקשורת תהיה בעלת חשיבות עליונה להנעת חדשנות ושמירה על יתרון תחרותי. אסטרטגיה הוליסטית זו עומדת בבסיס עתיד ה-AI הארגוני, והופכת כוח חישובי גולמי למחשוב-על AI חכם, מגיב ויעיל במיוחד.

מקור מקורי

https://developer.nvidia.com/blog/running-ai-workloads-on-rack-scale-supercomputers-from-hardware-to-topology-aware-scheduling/

שאלות נפוצות

What are NVIDIA GB200 and GB300 NVL72 systems, and what role does the Blackwell architecture play?

NVIDIA GB200 and GB300 NVL72 systems represent a new generation of rack-scale supercomputers specifically engineered for demanding AI and HPC workloads. These systems leverage the groundbreaking NVIDIA Blackwell architecture, which integrates massive GPU fabrics with high-bandwidth networking into a single, tightly coupled unit. The Blackwell architecture is designed to deliver unprecedented performance and efficiency for training and inference, featuring advanced NVLink switches, Multi-Node NVLink (MNNVL) for inter-GPU communication, and IMEX-capable compute trays that facilitate shared GPU memory across multiple nodes within the rack. This integrated design aims to overcome the limitations of traditional server-bound GPU deployments, providing a seamless, scalable platform for complex AI models.

What is the primary challenge in scheduling AI workloads on these advanced rack-scale supercomputers?

The core challenge lies in the significant mismatch between the intricate, hierarchical physical topology of rack-scale supercomputers and the often simplistic abstractions presented by conventional workload schedulers. While systems like the NVIDIA GB200/GB300 NVL72 boast sophisticated NVLink fabrics and IMEX domains, schedulers typically perceive a flat pool of GPUs and nodes. This can lead to inefficient resource allocation, sub-optimal performance due to poor data locality or communication bottlenecks, and increased operational complexity for platform operators. Without topology-aware scheduling, the inherent advantages of rack-scale integration, such as high-bandwidth interconnections, cannot be fully leveraged for AI workloads.

How does NVIDIA Mission Control address the operational complexities of rack-scale AI scheduling?

NVIDIA Mission Control acts as a crucial control plane that bridges the gap between the complex hardware topology of NVIDIA Grace Blackwell NVL72 systems and the needs of workload management platforms like Slurm and NVIDIA Run:ai. It provides a native, deep understanding of NVLink and IMEX domains, translating physical hardware relationships into logical identifiers that schedulers can interpret. By centralizing the view of cluster UUIDs and clique IDs, Mission Control enables precise, topology-aware job placement, ensures proper workload isolation, and guarantees consistent performance by aligning computations with the optimal underlying hardware fabric. This effectively transforms raw infrastructure into an efficient, manageable AI factory.

Explain the concepts of Cluster UUID and Clique ID in the context of NVLink topology and their operational significance.

Cluster UUID and Clique ID are system-level identifiers that encode a GPU's position within the NVLink fabric, making the complex topology understandable to system software and schedulers. The Cluster UUID corresponds to the NVLink domain, indicating that systems and their GPUs belong to the same physical rack and share a common NVLink fabric. For Grace Blackwell NVL72, this UUID is consistent across the entire rack. The Clique ID provides a finer distinction, corresponding to an NVLink Partition. GPUs sharing a Clique ID belong to the same logical partition within that domain. Operationally, the Cluster UUID answers which GPUs physically share a rack and can communicate via NVLink, while the Clique ID answers which GPUs share an NVLink Partition and are intended to communicate together for a specific workload, enabling finer-grained resource allocation and performance optimization.

How does Slurm's topology/block plugin enhance AI workload placement on NVL72 systems?

Slurm's topology/block plugin is essential for efficient AI workload placement on NVIDIA NVL72 systems by making Slurm aware that not all nodes (or GPUs) are equal in terms of connectivity and performance. On Grace Blackwell NVL72 systems, blocks of nodes with lower-latency connections directly map to NVLink partitions, which are groups of GPUs sharing a high-bandwidth NVLink fabric. By enabling this plugin and exposing NVLink partitions as 'blocks,' Slurm gains the necessary context to make intelligent placement decisions. This ensures that multi-GPU jobs are preferentially allocated within a single NVLink partition to preserve MNNVL performance, preventing performance degradation that could occur if jobs were spread indiscriminately across different, less-connected segments of the supercomputer. It allows for optimized resource utilization and predictable performance for demanding AI tasks.

What is Multi-Node NVLink (MNNVL), and how does IMEX facilitate it for shared GPU memory?

Multi-Node NVLink (MNNVL) is a key technology that allows GPUs across different compute nodes within a rack-scale system to communicate directly with high bandwidth and low latency, essential for scaling large AI models. MNNVL enables a shared-memory programming model across these distributed GPUs, making it appear to applications as a single, massive GPU fabric. IMEX (Infiniband Memory Expansion) is the underlying technology that facilitates MNNVL. IMEX-capable compute trays are designed to enable shared GPU memory across nodes by leveraging NVIDIA's advanced networking. While MNNVL simplifies the programming model for developers, Mission Control plays a crucial role behind the scenes to ensure that IMEX services are correctly provisioned and synchronized with MNNVL jobs, guaranteeing that the benefits of shared GPU memory are fully realized without exposing the underlying complexities to the end-user.

What are the key benefits of implementing topology-aware scheduling for AI workloads on rack-scale supercomputers?

Implementing topology-aware scheduling offers several significant benefits for AI workloads on rack-scale supercomputers. Firstly, it ensures optimal performance by intelligently placing jobs on GPUs that have the highest bandwidth and lowest latency connections, minimizing communication overheads inherent in distributed AI training. Secondly, it enhances resource utilization by preventing inefficient spreading of jobs across disparate hardware segments, leading to more predictable performance and better throughput. Thirdly, it simplifies management for platform operators by abstracting hardware complexities while providing clear isolation boundaries between workloads, improving system stability and security. Ultimately, topology-aware scheduling transforms complex hardware into a highly efficient, scalable, and manageable 'AI factory,' accelerating research and development while reducing operational burden.

How does NVIDIA Topograph contribute to the automated discovery and scheduling of supercomputer topologies?

NVIDIA Topograph is a critical component that automates the discovery of the intricate NVLink and interconnect hierarchy within rack-scale supercomputers. This automated discovery is essential because manually configuring and maintaining detailed topology information for large-scale systems would be prone to errors and highly time-consuming. Topograph exposes this detailed fabric information to workload schedulers, including Slurm and Kubernetes (through NVIDIA DRA and ComputeDomains), as well as NVIDIA Run:ai. By providing schedulers with an accurate and real-time view of the hardware topology, Topograph enables them to make intelligent, automated placement decisions. This ensures that AI workloads are scheduled in a topology-aware manner from the outset, optimizing performance, resource allocation, and overall system efficiency, which is crucial for building and operating scalable AI factories.

הישארו מעודכנים

קבלו את חדשות ה-AI האחרונות לתיבת הדוא״ל.

שתף