生成AI推論：G7eインスタンスでSageMakerを加速

G7eインスタンス：SageMakerにおけるAI推論の新時代

生成AIの状況は前例のないペースで進化しており、より強力で柔軟、かつ費用対効果の高いインフラストラクチャへの継続的な需要を促進しています。本日、Code VelocityはAWSからの重要な進展、すなわちAmazon SageMaker AIにおけるG7eインスタンスの一般提供開始について報告できることを嬉しく思います。NVIDIA RTX PRO 6000 Blackwell Server Edition GPUを搭載したこれらの新しいインスタンスは、生成AI推論のベンチマークを再定義し、開発者と企業に比類ないパフォーマンスとメモリ容量を提供します。

Amazon SageMaker AIは、開発者とデータサイエンティストが機械学習モデルを大規模に構築、トレーニング、デプロイするためのツールを提供するフルマネージドサービスです。G7eインスタンスの導入は、このプラットフォームにおける生成AIワークロードにとって極めて重要な瞬間を画します。これらのインスタンスは、最先端のNVIDIA RTX PRO 6000 Blackwell GPUを利用しており、それぞれが96 GBという驚異的なGDDR7メモリを誇ります。この大幅なメモリ増加により、より大規模な基盤モデル（FM）をSageMaker AIに直接デプロイすることが可能になり、高度なAIアプリケーションにとって不可欠なニーズに応えます。

組織は今、GPT-OSS-120B、Nemotron-3-Super-120B-A12B (NVFP4 variant)、Qwen3.5-35B-A3Bのようなモデルを驚くべき効率でデプロイできます。シングルGPUを搭載したG7e.2xlargeインスタンスは350億パラメータのモデルをホストでき、8つのGPUを搭載したG7e.48xlargeは最大3000億パラメータのモデルにスケールアップします。この柔軟性は、運用上の複雑さの軽減、レイテンシーの低減、推論ワークロードにおける大幅なコスト削減といった具体的なメリットにつながります。

G7eの世代的パフォーマンス向上を紐解く

G7eインスタンスは、G6eおよびG5といった先行製品と比較して、驚異的な飛躍を遂げており、G6eと比較して最大2.3倍高速な推論パフォーマンスを実現します。この技術仕様は、世代間の進歩を裏付けています。各G7e GPUは、驚異的な1,597 GB/sの帯域幅を提供し、G6eのGPUあたりメモリを実質的に2倍にし、G5の4倍に増やします。さらに、ネットワーキング機能も劇的に強化され、最大のG7eサイズではEFAにより最大1,600 Gbpsにスケールアップします。G6eと比較して4倍、G5と比較して16倍のこの増加は、これまで非現実的と見なされていた低レイテンシーのマルチノード推論およびファインチューニングシナリオの可能性を解き放ちます。

以下は、8-GPUティアでの世代間の進歩を比較したものです。

仕様	G5 (g5.48xlarge)	G6e (g6e.48xlarge)	G7e (g7e.48xlarge)
GPU	8x NVIDIA A10G	8x NVIDIA L40S	8x NVIDIA RTX PRO 6000 Blackwell
GPUあたりのGPUメモリ	24 GB GDDR6	48 GB GDDR6	96 GB GDDR7
総GPUメモリ	192 GB	384 GB	768 GB
GPUメモリ帯域幅	600 GB/s per GPU	864 GB/s per GPU	1,597 GB/s per GPU
vCPU	192	192	192
システムメモリ	768 GiB	1,536 GiB	2,048 GiB
ネットワーク帯域幅	100 Gbps	400 Gbps	1,600 Gbps (EFA)
ローカルNVMeストレージ	7.6 TB	7.6 TB	15.2 TB
G6eとの推論比較	ベースライン	~1x	最大2.3倍

単一のG7eインスタンスで合計768 GBという巨大なGPUメモリを持つことで、かつては古いインスタンスで複雑なマルチノード構成を必要としたモデルを、驚くほど簡単にデプロイできるようになりました。これにより、ノード間のレイテンシーと運用上のオーバーヘッドが大幅に削減されます。第5世代TensorコアによるFP4精度のサポートとEFAv4経由のNVIDIA GPUDirect RDMAと相まって、G7eインスタンスは、要求の厳しいLLM、マルチモーダルAI、および高度なエージェンティック推論ワークフローのために明確に設計されています。

G7eで繁栄する多様な生成AIユースケース

メモリ密度、帯域幅、高度なネットワーキング機能の強力な組み合わせにより、G7eインスタンスは現代の幅広い生成AIワークロードに理想的です。会話型AIの強化から複雑な物理シミュレーションの駆動まで、G7eは明確な利点を提供します。

チャットボットと会話型AI: G7eインスタンスの低いTime To First Token (TTFT)と高いスループットは、大量の同時ユーザー負荷に直面した場合でも、応答性が高くシームレスなインタラクティブ体験を保証します。これは、リアルタイムのAIインタラクションにおけるユーザーエンゲージメントと満足度を維持するために不可欠です。
エージェンティックおよびツール呼び出しワークフロー: Retrieval Augmented Generation (RAG) パイプラインやエージェンティックシステムにとって、リトリーバルストアからの高速コンテキスト注入は極めて重要です。G7eインスタンス内のCPU-to-GPU帯域幅が4倍に改善されたことで、これらの重要な操作に非常に効果的となり、よりインテリジェントでダイナミックなAIエージェントを可能にします。
テキスト生成、要約、および長文コンテキスト推論: 96 GBのGPUあたりメモリにより、G7eインスタンスは大規模なキーバリュー（KV）キャッシュを巧みに処理します。これにより、ドキュメントコンテキストを拡張でき、テキストの切り捨ての必要性を大幅に減らし、膨大な入力に対するより豊かで微妙な推論を促進します。
画像生成およびビジョンモデル: 以前の世代のインスタンスでは、より大規模なマルチモーダルモデルで頻繁にメモリ不足エラーが発生しましたが、G7eの2倍のメモリ容量はこれらの制限を優雅に解決し、より洗練された高解像度の画像およびビジョンAIアプリケーションへの道を開きます。
物理AIおよび科学計算: 従来の生成AIを超えて、G7eのBlackwell世代のコンピューティング、FP4サポート、および空間コンピューティング機能（DLSS 4.0および第4世代RTコアを含む）は、デジタルツイン、3Dシミュレーション、および高度な物理AIモデル推論にその用途を拡大し、科学研究および産業アプリケーションにおける新たなフロンティアを開拓します。

合理化されたデプロイとパフォーマンスベンチマーキング

Amazon SageMaker AIを介してG7eインスタンスに生成AIモデルをデプロイすることは、シンプルに設計されています。ユーザーは、プロセスを合理化するサンプルノートブックをこちらからアクセスできます。前提条件には通常、AWSアカウント、SageMakerアクセス用のIAMロール、および開発環境用のAmazon SageMaker StudioまたはSageMakerノートブックインスタンスが含まれます。重要なこととして、ユーザーはService Quotasコンソールを通じて、SageMaker AIエンドポイントの使用のためにml.g7e.2xlargeまたはより大きなインスタンスの適切なクォータをリクエストする必要があります。

大幅なパフォーマンス向上を示すため、AWSはG6eおよびG7eインスタンスの両方でQwen3-32B (BF16)のベンチマークを実施しました。ワークロードには、一般的なドキュメント要約タスクを模倣した、リクエストあたり約1,000入力トークンと560出力トークンが含まれていました。両方の構成で、プレフィックスキャッシングが有効なネイティブのvLLMコンテナを利用し、公平な比較を保証しました。

結果は説得力のあるものです。G6eのベースライン（4x L40S GPU搭載のml.g6e.12xlarge、13.12ドル/時間）は強力なリクエストあたりのスループットを示しましたが、G7e（1x RTX PRO 6000 Blackwell搭載のml.g7e.2xlarge、4.20ドル/時間）は劇的に異なるコスト面での話を示しています。本番環境の同時実行数（C=32）において、G7eは驚異的な100万出力トークンあたり0.79ドルを達成しました。これは、G7eの低い時間単価と、負荷がかかっても一貫したスループットを維持できる能力によって、G6eの2.06ドルと比較して2.6倍のコスト削減を意味し、高性能が高額な費用を伴う必要がないことを証明しています。

費用対効果の高い生成AI推論の未来

Amazon SageMaker AIへのG7eインスタンスの導入は、単なる段階的なアップグレード以上のものです。これは、AWSによる高性能生成AIへのアクセスを民主化するための戦略的な動きです。NVIDIA RTX PRO 6000 Blackwell GPUの生来のパワーとSageMakerのスケーラビリティおよび管理機能を組み合わせることで、AWSはあらゆる規模の組織が、より大規模で複雑なAIモデルを前例のない効率性と費用対効果でデプロイできるよう支援しています。この発展は、生成AIの進歩が幅広い業界で実用的で本番環境対応のアプリケーションに変換されることを保証し、SageMaker AIをAIイノベーションをリードするプラットフォームとしての地位を確固たるものにします。

元の情報源

https://aws.amazon.com/blogs/machine-learning/accelerate-generative-ai-inference-on-amazon-sagemaker-ai-with-g7e-instances/

よくある質問

What are G7e instances and how do they benefit generative AI inference?

G7e instances are the latest generation of GPU-accelerated computing instances available on Amazon SageMaker AI, specifically designed to accelerate generative AI inference workloads. They are powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, offering significant advancements in memory capacity, bandwidth, and overall inference performance. For generative AI, G7e instances mean faster Time To First Token (TTFT), higher throughput, and the ability to host much larger foundation models (FMs) within a single instance, or even on a single GPU. This translates into more responsive AI applications, reduced operational complexity, and substantial cost savings for deploying and running large language models (LLMs), multimodal AI, and agentic workflows. Their enhanced capabilities make them ideal for interactive applications requiring high-performance, cost-effective inference.

Which NVIDIA GPU powers the new G7e instances, and what are its key features?

The new G7e instances on Amazon SageMaker AI are powered by the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. Each of these cutting-edge GPUs provides an impressive 96 GB of GDDR7 memory, which is double the memory capacity per GPU compared to the previous G6e instances. Key features also include 1,597 GB/s of GPU memory bandwidth per GPU, support for FP4 precision through fifth-generation Tensor Cores, and NVIDIA GPUDirect RDMA over EFAv4. These features collectively contribute to the G7e instances' superior inference performance, memory density, and low-latency networking, making them exceptionally capable for demanding generative AI tasks.

How do G7e instances compare to previous generations (G6e, G5) in terms of performance and memory?

G7e instances demonstrate a significant generational leap over G6e and G5. They deliver up to 2.3x inference performance compared to G6e instances. In terms of memory, each G7e GPU offers 96 GB of GDDR7 memory, effectively doubling the per-GPU memory of G6e and quadrupling that of G5. A top-tier G7e.48xlarge instance provides an aggregate of 768 GB total GPU memory. Furthermore, networking bandwidth scales up to 1,600 Gbps with EFA on the largest G7e size, a 4x jump over G6e and 16x over G5. This vast improvement in memory, bandwidth, and networking allows G7e instances to host models that previously required multi-node setups on older instances, simplifying deployment and reducing latency.

What types of generative AI workloads are best suited for deployment on G7e instances?

G7e instances are exceptionally well-suited for a broad range of modern generative AI workloads due to their high memory density, bandwidth, and advanced networking. These include: Chatbots and Conversational AI, ensuring low Time To First Token (TTFT) and high throughput for responsive interactive experiences; Agentic and Tool-Calling Workflows, benefiting from 4x improved CPU-to-GPU bandwidth for fast context injection in RAG pipelines; Text Generation, Summarization, and Long-Context Inference, accommodating large KV caches for extended document contexts with 96 GB per-GPU memory; Image Generation and Vision Models, overcoming out-of-memory errors for larger multimodal models that struggled on previous instances; and Physical AI and Scientific Computing, leveraging Blackwell-generation compute, FP4 support, and spatial computing capabilities for digital twins and 3D simulation.

What is the cost efficiency of G7e instances compared to G6e for generative AI inference?

G7e instances offer significantly improved cost efficiency for generative AI inference compared to G6e instances. Benchmarks deploying Qwen3-32B showed that G7e achieved $0.79 per million output tokens at production concurrency (C=32). This represents a remarkable 2.6x cost reduction compared to G6e’s $2.06 per million output tokens for a similar workload. This cost saving is primarily driven by G7e’s substantially lower hourly rate (e.g., $4.20/hr for ml.g7e.2xlarge vs. $13.12/hr for ml.g6e.12xlarge) combined with its ability to maintain consistent and high throughput under load, making it a more economical choice for large-scale deployments.

What are the memory capacities for deploying LLMs on single and multi-GPU G7e instances?

G7e instances offer substantial memory capacities for deploying large language models (LLMs). A single-node GPU, specifically a G7e.2xlarge instance, can effectively host foundation models with up to 35 billion parameters in FP16 precision. For larger models, scaling across multiple GPUs within a single instance dramatically increases capacity: a 4-GPU node (G7e.24xlarge) can deploy models up to 150 billion parameters, while an 8-GPU node (G7e.48xlarge) can handle models as large as 300 billion parameters. This impressive scalability provides organizations with the flexibility to deploy a wide range of LLMs without the complexities of multi-instance distributed setups.

What are the prerequisites for deploying solutions using G7e instances on Amazon SageMaker AI?

To deploy generative AI solutions using G7e instances on Amazon SageMaker AI, several prerequisites must be met. You need an active AWS account to host your resources and an AWS Identity and Access Management (IAM) role configured with appropriate permissions to access Amazon SageMaker AI services. For development and deployment, access to Amazon SageMaker Studio or a SageMaker notebook instance is recommended, though other interactive development environments like PyCharm or Visual Studio Code are also viable. Crucially, you must request a quota for at least one `ml.g7e.2xlarge` instance (or a larger G7e instance type) for Amazon SageMaker AI endpoint usage through the AWS Service Quotas console, as these are new and specialized instance types.