הסקת בינה מלאכותית יוצרת: האצה ב-SageMaker עם מופעי G7e

מופעי G7e: עידן חדש להסקת בינה מלאכותית ב-SageMaker

נוף הבינה המלאכותית היוצרת מתפתח בקצב חסר תקדים, ומניע דרישה מתמשכת לתשתית חזקה, גמישה וחסכונית יותר. כיום, Code Velocity נרגשת לדווח על התקדמות משמעותית מ-AWS: הזמינות הכללית של מופעי G7e ב-Amazon SageMaker AI. מופעים חדשים אלה, המופעלים על ידי מעבדי GPU NVIDIA RTX PRO 6000 Blackwell Server Edition, נועדו להגדיר מחדש את אמות המידה להסקת בינה מלאכותית יוצרת, ומציעים למפתחים ולארגונים ביצועים וקיבולת זיכרון ללא תחרות.

Amazon SageMaker AI הוא שירות מנוהל במלואו המספק למפתחים ולמדעני נתונים את הכלים לבנות, לאמן ולפרוס מודלי למידת מכונה בקנה מידה. הצגת מופעי G7e מציינת רגע מכונן עבור עומסי עבודה של בינה מלאכותית יוצרת בפלטפורמה זו. מופעים אלה ממנפים את מעבדי ה-GPU החדישים NVIDIA RTX PRO 6000 Blackwell, שכל אחד מהם מתהדר בזיכרון GDDR7 מרשים של 96 GB. גידול משמעותי זה בזיכרון מאפשר פריסה של מודלי יסוד (FMs) גדולים בהרבה ישירות ב-SageMaker AI, ונותן מענה לצורך קריטי ביישומי בינה מלאכותית מתקדמים.

ארגונים יכולים כעת לפרוס מודלים כמו GPT-OSS-120B, Nemotron-3-Super-120B-A12B (גרסת NVFP4), ו-Qwen3.5-35B-A3B ביעילות יוצאת דופן. מופע G7e.2xlarge, הכולל GPU יחיד, יכול לארח מודלים בעלי 35 מיליארד פרמטרים, בעוד שה-G7e.48xlarge, עם שמונה מעבדי GPU, מגיע למודלים בעלי 300 מיליארד פרמטרים. גמישות זו מתורגמת ליתרונות מוחשיים: מורכבות תפעולית מופחתת, חביון נמוך יותר וחיסכון משמעותי בעלויות עבור עומסי עבודה של הסקה.

פירוט קפיצת המדרגה הדורית בביצועים של G7e

מופעי G7e מייצגים קפיצת מדרגה עצומה על פני קודמיהם, G6e ו-G5, ומספקים ביצועי הסקה מהירים עד פי 2.3 בהשוואה ל-G6e. המפרטים הטכניים מדגישים התקדמות דורית זו. כל GPU של G7e מספק רוחב פס מדהים של 1,597 GB/s, ומכפיל למעשה את זיכרון ה-GPU למופע G6e ומרבע את זה של G5. יתרה מכך, יכולות הרשת משופרות באופן דרמטי, וגדלות עד ל-1,600 Gbps עם EFA בגודל ה-G7e הגדול ביותר. עלייה זו של פי 4 לעומת G6e ופי 16 לעומת G5 פותחת את הפוטנציאל לתרחישי הסקה וכיול עדין מרובי צמתים עם חביון נמוך, שנחשבו בעבר לבלתי מעשיים.

להלן השוואה המדגישה את ההתקדמות בין הדורות בשכבת 8 ה-GPUs:

מפרט	G5 (g5.48xlarge)	G6e (g6e.48xlarge)	G7e (g7e.48xlarge)
GPU	8x NVIDIA A10G	8x NVIDIA L40S	8x NVIDIA RTX PRO 6000 Blackwell
זיכרון GPU ל-GPU	24 GB GDDR6	48 GB GDDR6	96 GB GDDR7
זיכרון GPU כולל	192 GB	384 GB	768 GB
רוחב פס זיכרון GPU	600 GB/s ל-GPU	864 GB/s ל-GPU	1,597 GB/s ל-GPU
vCPUs	192	192	192
זיכרון מערכת	768 GiB	1,536 GiB	2,048 GiB
רוחב פס רשת	100 Gbps	400 Gbps	1,600 Gbps (EFA)
אחסון NVMe מקומי	7.6 TB	7.6 TB	15.2 TB
הסקה לעומת G6e	בסיס	~1x	עד פי 2.3

עם 768 GB עצומים של זיכרון GPU מצטבר במופע G7e יחיד, מודלים שדרשו בעבר תצורות מורכבות מרובות צמתים במופעים ישנים יותר יכולים כעת להיפרס בפשטות יוצאת דופן. זה מפחית באופן משמעותי את חביון בין הצמתים ואת התקורה התפעולית. בשילוב עם תמיכה בדיוק FP4 באמצעות ליבות Tensor מהדור החמישי ו-NVIDIA GPUDirect RDMA over EFAv4, מופעי G7e מתוכננים באופן חד משמעי עבור LLM תובעניים, AI מולטימודלי ותהליכי עבודה מתוחכמים של הסקת בינה מלאכותית סוכנית ב-AWS.

מגוון מקרי שימוש של בינה מלאכותית יוצרת משגשגים ב-G7e

השילוב החזק של צפיפות זיכרון, רוחב פס ויכולות רשת מתקדמות הופך את מופעי G7e לאידיאליים עבור מגוון רחב של עומסי עבודה עכשוויים של בינה מלאכותית יוצרת. משיפור בינה מלאכותית שיחתית ועד להפעלת סימולציות פיזיות מורכבות, G7e מציע יתרונות מוחשיים:

צ'אטבוטים ובינה מלאכותית שיחתית: זמן האסימון הראשון הנמוך (TTFT) והתפוקה הגבוהה של מופעי G7e מבטיחים חוויות אינטראקטיביות מגיבות וחלקות, גם כאשר מתמודדים עם עומסי משתמשים כבדים בו-זמנית. זה קריטי לשמירה על מעורבות ושביעות רצון המשתמשים באינטראקציות AI בזמן אמת.
תהליכי עבודה סוכניים וקריאת כלים: עבור צינורות Retrieval Augmented Generation (RAG) ומערכות סוכנות, הזרקת הקשר מהירה ממחסני אחזור היא בעלת חשיבות עליונה. השיפור של פי 4 ברוחב הפס בין המעבד ל-GPU בתוך מופעי G7e הופך אותם ליעילים במיוחד עבור פעולות קריטיות אלו, ומאפשר סוכני AI חכמים ודינמיים יותר.
יצירת טקסט, סיכום, והסקת הקשר ארוך: עם 96 GB זיכרון ל-GPU, מופעי G7e מטפלים במיומנות במטמוני Key-Value (KV) גדולים. זה מאפשר הקשרי מסמכים מורחבים, מפחית באופן משמעותי את הצורך בקיטוע טקסט ומקל על חשיבה עשירה ומדויקת יותר על פני קלטים עצומים.
יצירת תמונות ומודלי ראייה: היכן שמופעים מהדור הקודם נתקלו לעיתים קרובות בשגיאות חוסר זיכרון עם מודלים מולטימודליים גדולים יותר, קיבולת הזיכרון הכפולה של G7e פותרת בחן מגבלות אלו, וסוללת את הדרך ליישומי AI מתוחכמים וברזולוציה גבוהה יותר של תמונות וראייה.
בינה מלאכותית פיזית ומחשוב מדעי: מעבר לבינה מלאכותית יוצרת מסורתית, יכולות המחשוב מדור Blackwell של G7e, תמיכה ב-FP4, ויכולות מחשוב מרחביות (כולל DLSS 4.0 וליבות RT מהדור הרביעי) מרחיבות את השימושיות שלו לתאומים דיגיטליים, סימולציה תלת-ממדית, והסקת מודלי AI פיזיים מתקדמים, ופותחות גבולות חדשים במחקר מדעי וביישומים תעשייתיים.

פריסה יעילה ומדידת ביצועים

פריסת מודלי בינה מלאכותית יוצרת במופעי G7e באמצעות Amazon SageMaker AI מתוכננת להיות פשוטה. משתמשים יכולים לגשת למחברת לדוגמה כאן המייעלת את התהליך. התנאים המוקדמים כוללים בדרך כלל חשבון AWS, תפקיד IAM לגישה ל-SageMaker, ואת Amazon SageMaker Studio או מופע מחברת SageMaker עבור סביבת הפיתוח. חשוב לציין, על המשתמשים לבקש מכסה מתאימה עבור ml.g7e.2xlarge או מופעים גדולים יותר לשימוש בנקודת קצה של SageMaker AI באמצעות קונסולת Service Quotas.

כדי להדגים את שיפורי הביצועים המשמעותיים, AWS ביצעה בדיקות ביצועים למודל Qwen3-32B (BF16) הן במופעי G6e והן במופעי G7e. עומס העבודה כלל כ-1,000 אסימוני קלט ו-560 אסימוני פלט לבקשה, המדמים משימות סיכום מסמכים נפוצות. שתי התצורות השתמשו בקונטיינר ה-vLLM המקורי עם שמירת מטמון של קידומות מופעלת, מה שמבטיח השוואה אובייקטיבית.

התוצאות מרשימות. בעוד שבסיס ה-G6e (ml.g6e.12xlarge עם 4 מעבדי GPU L40S בעלות של $13.12 לשעה) הראה תפוקה חזקה לבקשה, ה-G7e (ml.g7e.2xlarge עם GPU אחד של RTX PRO 6000 Blackwell בעלות של $4.20 לשעה) מציג סיפור עלויות שונה באופן דרמטי. בקונקורנטיות של ייצור (C=32), G7e השיג $0.79 מדהימים למיליון אסימוני פלט. זה מייצג הפחתה בעלויות של פי 2.6 בהשוואה ל-$2.06 של G6e, ונובע מהתעריף השעתי הנמוך יותר של G7e ויכולתו לשמור על תפוקה עקבית תחת עומס, ומוכיח שביצועים גבוהים אינם חייבים להיות בעלות גבוהה.

עתיד הסקת הבינה המלאכותית היוצרת חסכונית בעלויות

הצגת מופעי G7e ב-Amazon SageMaker AI היא יותר מסתם שדרוג הדרגתי; זוהי מהלך אסטרטגי של AWS לדמוקרטיזציה של הגישה לבינה מלאכותית יוצרת בעלת ביצועים גבוהים. על ידי שילוב העוצמה הגולמית של מעבדי GPU NVIDIA RTX PRO 6000 Blackwell עם יכולות המדרגיות והניהול של SageMaker, AWS מעצימה ארגונים מכל הגדלים לפרוס מודלי AI גדולים ומורכבים יותר ביעילות ובחסכוניות חסרות תקדים. פיתוח זה מבטיח שההתקדמויות בבינה המלאכותית היוצרת יכולות להיות מתורגמות ליישומים מעשיים, מוכנים לייצור, על פני מגוון רחב של תעשיות, ובכך מחזק את מעמדה של SageMaker AI כפלטפורמה מובילה לחדשנות ב-AI.

מקור מקורי

https://aws.amazon.com/blogs/machine-learning/accelerate-generative-ai-inference-on-amazon-sagemaker-ai-with-g7e-instances/

שאלות נפוצות

What are G7e instances and how do they benefit generative AI inference?

G7e instances are the latest generation of GPU-accelerated computing instances available on Amazon SageMaker AI, specifically designed to accelerate generative AI inference workloads. They are powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, offering significant advancements in memory capacity, bandwidth, and overall inference performance. For generative AI, G7e instances mean faster Time To First Token (TTFT), higher throughput, and the ability to host much larger foundation models (FMs) within a single instance, or even on a single GPU. This translates into more responsive AI applications, reduced operational complexity, and substantial cost savings for deploying and running large language models (LLMs), multimodal AI, and agentic workflows. Their enhanced capabilities make them ideal for interactive applications requiring high-performance, cost-effective inference.

Which NVIDIA GPU powers the new G7e instances, and what are its key features?

The new G7e instances on Amazon SageMaker AI are powered by the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. Each of these cutting-edge GPUs provides an impressive 96 GB of GDDR7 memory, which is double the memory capacity per GPU compared to the previous G6e instances. Key features also include 1,597 GB/s of GPU memory bandwidth per GPU, support for FP4 precision through fifth-generation Tensor Cores, and NVIDIA GPUDirect RDMA over EFAv4. These features collectively contribute to the G7e instances' superior inference performance, memory density, and low-latency networking, making them exceptionally capable for demanding generative AI tasks.

How do G7e instances compare to previous generations (G6e, G5) in terms of performance and memory?

G7e instances demonstrate a significant generational leap over G6e and G5. They deliver up to 2.3x inference performance compared to G6e instances. In terms of memory, each G7e GPU offers 96 GB of GDDR7 memory, effectively doubling the per-GPU memory of G6e and quadrupling that of G5. A top-tier G7e.48xlarge instance provides an aggregate of 768 GB total GPU memory. Furthermore, networking bandwidth scales up to 1,600 Gbps with EFA on the largest G7e size, a 4x jump over G6e and 16x over G5. This vast improvement in memory, bandwidth, and networking allows G7e instances to host models that previously required multi-node setups on older instances, simplifying deployment and reducing latency.

What types of generative AI workloads are best suited for deployment on G7e instances?

G7e instances are exceptionally well-suited for a broad range of modern generative AI workloads due to their high memory density, bandwidth, and advanced networking. These include: Chatbots and Conversational AI, ensuring low Time To First Token (TTFT) and high throughput for responsive interactive experiences; Agentic and Tool-Calling Workflows, benefiting from 4x improved CPU-to-GPU bandwidth for fast context injection in RAG pipelines; Text Generation, Summarization, and Long-Context Inference, accommodating large KV caches for extended document contexts with 96 GB per-GPU memory; Image Generation and Vision Models, overcoming out-of-memory errors for larger multimodal models that struggled on previous instances; and Physical AI and Scientific Computing, leveraging Blackwell-generation compute, FP4 support, and spatial computing capabilities for digital twins and 3D simulation.

What is the cost efficiency of G7e instances compared to G6e for generative AI inference?

G7e instances offer significantly improved cost efficiency for generative AI inference compared to G6e instances. Benchmarks deploying Qwen3-32B showed that G7e achieved $0.79 per million output tokens at production concurrency (C=32). This represents a remarkable 2.6x cost reduction compared to G6e’s $2.06 per million output tokens for a similar workload. This cost saving is primarily driven by G7e’s substantially lower hourly rate (e.g., $4.20/hr for ml.g7e.2xlarge vs. $13.12/hr for ml.g6e.12xlarge) combined with its ability to maintain consistent and high throughput under load, making it a more economical choice for large-scale deployments.

What are the memory capacities for deploying LLMs on single and multi-GPU G7e instances?

G7e instances offer substantial memory capacities for deploying large language models (LLMs). A single-node GPU, specifically a G7e.2xlarge instance, can effectively host foundation models with up to 35 billion parameters in FP16 precision. For larger models, scaling across multiple GPUs within a single instance dramatically increases capacity: a 4-GPU node (G7e.24xlarge) can deploy models up to 150 billion parameters, while an 8-GPU node (G7e.48xlarge) can handle models as large as 300 billion parameters. This impressive scalability provides organizations with the flexibility to deploy a wide range of LLMs without the complexities of multi-instance distributed setups.

What are the prerequisites for deploying solutions using G7e instances on Amazon SageMaker AI?

To deploy generative AI solutions using G7e instances on Amazon SageMaker AI, several prerequisites must be met. You need an active AWS account to host your resources and an AWS Identity and Access Management (IAM) role configured with appropriate permissions to access Amazon SageMaker AI services. For development and deployment, access to Amazon SageMaker Studio or a SageMaker notebook instance is recommended, though other interactive development environments like PyCharm or Visual Studio Code are also viable. Crucially, you must request a quota for at least one `ml.g7e.2xlarge` instance (or a larger G7e instance type) for Amazon SageMaker AI endpoint usage through the AWS Service Quotas console, as these are new and specialized instance types.

הישארו מעודכנים

קבלו את חדשות ה-AI האחרונות לתיבת הדוא״ל.

שתף