Generative AI Inference: Accelerating on SageMaker with G7e Instances

G7e Instances: A New Era for AI Inference on SageMaker

The landscape of generative AI is evolving at an unprecedented pace, driving a continuous demand for more powerful, flexible, and cost-effective infrastructure. Today, Code Velocity is excited to report on a significant advancement from AWS: the general availability of G7e instances on Amazon SageMaker AI. Powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, these new instances are set to redefine the benchmarks for generative AI inference, offering developers and enterprises unparalleled performance and memory capacity.

Amazon SageMaker AI is a fully managed service that provides developers and data scientists with the tools to build, train, and deploy machine learning models at scale. The introduction of G7e instances marks a pivotal moment for generative AI workloads on this platform. These instances leverage the cutting-edge NVIDIA RTX PRO 6000 Blackwell GPUs, each boasting an impressive 96 GB of GDDR7 memory. This substantial memory increase allows for the deployment of significantly larger foundation models (FMs) directly on SageMaker AI, addressing a critical need for advanced AI applications.

Organizations can now deploy models like GPT-OSS-120B, Nemotron-3-Super-120B-A12B (NVFP4 variant), and Qwen3.5-35B-A3B with remarkable efficiency. The G7e.2xlarge instance, featuring a single GPU, can host 35B parameter models, while the G7e.48xlarge, with eight GPUs, scales up to 300B parameter models. This flexibility translates into tangible benefits: reduced operational complexity, lower latency, and substantial cost savings for inference workloads.

Unpacking the Generational Performance Leap of G7e

G7e instances represent a monumental leap over their predecessors, G6e and G5, delivering up to 2.3 times faster inference performance compared to G6e. The technical specifications underscore this generational advancement. Each G7e GPU provides an astounding 1,597 GB/s bandwidth, effectively doubling the per-GPU memory of G6e and quadrupling that of G5. Furthermore, networking capabilities are dramatically enhanced, scaling up to 1,600 Gbps with EFA on the largest G7e size. This 4x increase over G6e and 16x over G5 unlocks the potential for low-latency multi-node inference and fine-tuning scenarios previously deemed impractical.

Here's a comparison highlighting the progression across generations at the 8-GPU tier:

Spec	G5 (g5.48xlarge)	G6e (g6e.48xlarge)	G7e (g7e.48xlarge)
GPU	8x NVIDIA A10G	8x NVIDIA L40S	8x NVIDIA RTX PRO 6000 Blackwell
GPU Memory per GPU	24 GB GDDR6	48 GB GDDR6	96 GB GDDR7
Total GPU Memory	192 GB	384 GB	768 GB
GPU Memory Bandwidth	600 GB/s per GPU	864 GB/s per GPU	1,597 GB/s per GPU
vCPUs	192	192	192
System Memory	768 GiB	1,536 GiB	2,048 GiB
Network Bandwidth	100 Gbps	400 Gbps	1,600 Gbps (EFA)
Local NVMe Storage	7.6 TB	7.6 TB	15.2 TB
Inference vs. G6e	Baseline	~1x	Up to 2.3x

With a colossal 768 GB of aggregate GPU memory on a single G7e instance, models that once necessitated complex multi-node configurations on older instances can now be deployed with remarkable simplicity. This significantly reduces inter-node latency and operational overhead. Coupled with support for FP4 precision via fifth-generation Tensor Cores and NVIDIA GPUDirect RDMA over EFAv4, G7e instances are unequivocally designed for demanding LLM, multimodal AI, and sophisticated agentic inference workflows on AWS.

Diverse Generative AI Use Cases Thrive on G7e

The robust combination of memory density, bandwidth, and advanced networking capabilities makes G7e instances ideal for a wide spectrum of contemporary generative AI workloads. From enhancing conversational AI to powering complex physical simulations, G7e offers tangible advantages:

Chatbots and Conversational AI: The low Time To First Token (TTFT) and high throughput of G7e instances ensure responsive and seamless interactive experiences, even when faced with heavy concurrent user loads. This is crucial for maintaining user engagement and satisfaction in real-time AI interactions.
Agentic and Tool-Calling Workflows: For Retrieval Augmented Generation (RAG) pipelines and agentic systems, fast context injection from retrieval stores is paramount. The 4x improvement in CPU-to-GPU bandwidth within G7e instances makes them exceptionally effective for these critical operations, enabling more intelligent and dynamic AI agents.
Text Generation, Summarization, and Long-Context Inference: With 96 GB of per-GPU memory, G7e instances adeptly handle large Key-Value (KV) caches. This allows for extended document contexts, significantly reducing the need for text truncation and facilitating richer, more nuanced reasoning over vast inputs.
Image Generation and Vision Models: Where previous-generation instances frequently encountered out-of-memory errors with larger multimodal models, G7e's doubled memory capacity gracefully resolves these limitations, paving the way for more sophisticated and higher-resolution image and vision AI applications.
Physical AI and Scientific Computing: Beyond traditional generative AI, G7e's Blackwell-generation compute, FP4 support, and spatial computing capabilities (including DLSS 4.0 and 4th-gen RT cores) extend its utility to digital twins, 3D simulation, and advanced physical AI model inference, opening new frontiers in scientific research and industrial applications.

Streamlined Deployment and Performance Benchmarking

Deploying generative AI models on G7e instances via Amazon SageMaker AI is designed to be straightforward. Users can access a sample notebook here that streamlines the process. Prerequisites typically include an AWS account, an IAM role for SageMaker access, and either Amazon SageMaker Studio or a SageMaker notebook instance for the development environment. Importantly, users should request an appropriate quota for ml.g7e.2xlarge or larger instances for SageMaker AI endpoint usage via the Service Quotas console.

To demonstrate the significant performance gains, AWS benchmarked Qwen3-32B (BF16) on both G6e and G7e instances. The workload involved approximately 1,000 input tokens and 560 output tokens per request, mimicking common document summarization tasks. Both configurations utilized the native vLLM container with prefix caching enabled, ensuring an apples-to-apples comparison.

The results are compelling. While the G6e baseline (ml.g6e.12xlarge with 4x L40S GPUs at $13.12/hr) showed strong per-request throughput, the G7e (ml.g7e.2xlarge with 1x RTX PRO 6000 Blackwell at $4.20/hr) tells a dramatically different cost story. At production concurrency (C=32), G7e achieved an astonishing $0.79 per million output tokens. This represents a 2.6x cost reduction compared to G6e’s $2.06, driven by G7e’s lower hourly rate and its ability to maintain consistent throughput under load, proving that high performance doesn't have to come at a premium cost.

The Future of Cost-Efficient Generative AI Inference

The introduction of G7e instances on Amazon SageMaker AI is more than just an incremental upgrade; it's a strategic move by AWS to democratize access to high-performance generative AI. By combining the raw power of NVIDIA RTX PRO 6000 Blackwell GPUs with the scalability and management capabilities of SageMaker, AWS is empowering organizations of all sizes to deploy larger, more complex AI models with unprecedented efficiency and cost-effectiveness. This development ensures that the advancements in generative AI can be translated into practical, production-ready applications across a vast array of industries, solidifying SageMaker AI's position as a leading platform for AI innovation.

Original source

https://aws.amazon.com/blogs/machine-learning/accelerate-generative-ai-inference-on-amazon-sagemaker-ai-with-g7e-instances/

Frequently Asked Questions

What are G7e instances and how do they benefit generative AI inference?

G7e instances are the latest generation of GPU-accelerated computing instances available on Amazon SageMaker AI, specifically designed to accelerate generative AI inference workloads. They are powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, offering significant advancements in memory capacity, bandwidth, and overall inference performance. For generative AI, G7e instances mean faster Time To First Token (TTFT), higher throughput, and the ability to host much larger foundation models (FMs) within a single instance, or even on a single GPU. This translates into more responsive AI applications, reduced operational complexity, and substantial cost savings for deploying and running large language models (LLMs), multimodal AI, and agentic workflows. Their enhanced capabilities make them ideal for interactive applications requiring high-performance, cost-effective inference.

Which NVIDIA GPU powers the new G7e instances, and what are its key features?

The new G7e instances on Amazon SageMaker AI are powered by the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. Each of these cutting-edge GPUs provides an impressive 96 GB of GDDR7 memory, which is double the memory capacity per GPU compared to the previous G6e instances. Key features also include 1,597 GB/s of GPU memory bandwidth per GPU, support for FP4 precision through fifth-generation Tensor Cores, and NVIDIA GPUDirect RDMA over EFAv4. These features collectively contribute to the G7e instances' superior inference performance, memory density, and low-latency networking, making them exceptionally capable for demanding generative AI tasks.

How do G7e instances compare to previous generations (G6e, G5) in terms of performance and memory?

G7e instances demonstrate a significant generational leap over G6e and G5. They deliver up to 2.3x inference performance compared to G6e instances. In terms of memory, each G7e GPU offers 96 GB of GDDR7 memory, effectively doubling the per-GPU memory of G6e and quadrupling that of G5. A top-tier G7e.48xlarge instance provides an aggregate of 768 GB total GPU memory. Furthermore, networking bandwidth scales up to 1,600 Gbps with EFA on the largest G7e size, a 4x jump over G6e and 16x over G5. This vast improvement in memory, bandwidth, and networking allows G7e instances to host models that previously required multi-node setups on older instances, simplifying deployment and reducing latency.

What types of generative AI workloads are best suited for deployment on G7e instances?

G7e instances are exceptionally well-suited for a broad range of modern generative AI workloads due to their high memory density, bandwidth, and advanced networking. These include: Chatbots and Conversational AI, ensuring low Time To First Token (TTFT) and high throughput for responsive interactive experiences; Agentic and Tool-Calling Workflows, benefiting from 4x improved CPU-to-GPU bandwidth for fast context injection in RAG pipelines; Text Generation, Summarization, and Long-Context Inference, accommodating large KV caches for extended document contexts with 96 GB per-GPU memory; Image Generation and Vision Models, overcoming out-of-memory errors for larger multimodal models that struggled on previous instances; and Physical AI and Scientific Computing, leveraging Blackwell-generation compute, FP4 support, and spatial computing capabilities for digital twins and 3D simulation.

What is the cost efficiency of G7e instances compared to G6e for generative AI inference?

G7e instances offer significantly improved cost efficiency for generative AI inference compared to G6e instances. Benchmarks deploying Qwen3-32B showed that G7e achieved $0.79 per million output tokens at production concurrency (C=32). This represents a remarkable 2.6x cost reduction compared to G6e’s $2.06 per million output tokens for a similar workload. This cost saving is primarily driven by G7e’s substantially lower hourly rate (e.g., $4.20/hr for ml.g7e.2xlarge vs. $13.12/hr for ml.g6e.12xlarge) combined with its ability to maintain consistent and high throughput under load, making it a more economical choice for large-scale deployments.

What are the memory capacities for deploying LLMs on single and multi-GPU G7e instances?

G7e instances offer substantial memory capacities for deploying large language models (LLMs). A single-node GPU, specifically a G7e.2xlarge instance, can effectively host foundation models with up to 35 billion parameters in FP16 precision. For larger models, scaling across multiple GPUs within a single instance dramatically increases capacity: a 4-GPU node (G7e.24xlarge) can deploy models up to 150 billion parameters, while an 8-GPU node (G7e.48xlarge) can handle models as large as 300 billion parameters. This impressive scalability provides organizations with the flexibility to deploy a wide range of LLMs without the complexities of multi-instance distributed setups.

What are the prerequisites for deploying solutions using G7e instances on Amazon SageMaker AI?

To deploy generative AI solutions using G7e instances on Amazon SageMaker AI, several prerequisites must be met. You need an active AWS account to host your resources and an AWS Identity and Access Management (IAM) role configured with appropriate permissions to access Amazon SageMaker AI services. For development and deployment, access to Amazon SageMaker Studio or a SageMaker notebook instance is recommended, though other interactive development environments like PyCharm or Visual Studio Code are also viable. Crucially, you must request a quota for at least one `ml.g7e.2xlarge` instance (or a larger G7e instance type) for Amazon SageMaker AI endpoint usage through the AWS Service Quotas console, as these are new and specialized instance types.

Stay Updated

Get the latest AI news delivered to your inbox.