G7e Instances: A New Era for AI Inference on SageMaker
The landscape of generative AI is evolving at an unprecedented pace, driving a continuous demand for more powerful, flexible, and cost-effective infrastructure. Today, Code Velocity is excited to report on a significant advancement from AWS: the general availability of G7e instances on Amazon SageMaker AI. Powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, these new instances are set to redefine the benchmarks for generative AI inference, offering developers and enterprises unparalleled performance and memory capacity.
Amazon SageMaker AI is a fully managed service that provides developers and data scientists with the tools to build, train, and deploy machine learning models at scale. The introduction of G7e instances marks a pivotal moment for generative AI workloads on this platform. These instances leverage the cutting-edge NVIDIA RTX PRO 6000 Blackwell GPUs, each boasting an impressive 96 GB of GDDR7 memory. This substantial memory increase allows for the deployment of significantly larger foundation models (FMs) directly on SageMaker AI, addressing a critical need for advanced AI applications.
Organizations can now deploy models like GPT-OSS-120B, Nemotron-3-Super-120B-A12B (NVFP4 variant), and Qwen3.5-35B-A3B with remarkable efficiency. The G7e.2xlarge instance, featuring a single GPU, can host 35B parameter models, while the G7e.48xlarge, with eight GPUs, scales up to 300B parameter models. This flexibility translates into tangible benefits: reduced operational complexity, lower latency, and substantial cost savings for inference workloads.
Unpacking the Generational Performance Leap of G7e
G7e instances represent a monumental leap over their predecessors, G6e and G5, delivering up to 2.3 times faster inference performance compared to G6e. The technical specifications underscore this generational advancement. Each G7e GPU provides an astounding 1,597 GB/s bandwidth, effectively doubling the per-GPU memory of G6e and quadrupling that of G5. Furthermore, networking capabilities are dramatically enhanced, scaling up to 1,600 Gbps with EFA on the largest G7e size. This 4x increase over G6e and 16x over G5 unlocks the potential for low-latency multi-node inference and fine-tuning scenarios previously deemed impractical.
Here's a comparison highlighting the progression across generations at the 8-GPU tier:
| Spec | G5 (g5.48xlarge) | G6e (g6e.48xlarge) | G7e (g7e.48xlarge) |
|---|---|---|---|
| GPU | 8x NVIDIA A10G | 8x NVIDIA L40S | 8x NVIDIA RTX PRO 6000 Blackwell |
| GPU Memory per GPU | 24 GB GDDR6 | 48 GB GDDR6 | 96 GB GDDR7 |
| Total GPU Memory | 192 GB | 384 GB | 768 GB |
| GPU Memory Bandwidth | 600 GB/s per GPU | 864 GB/s per GPU | 1,597 GB/s per GPU |
| vCPUs | 192 | 192 | 192 |
| System Memory | 768 GiB | 1,536 GiB | 2,048 GiB |
| Network Bandwidth | 100 Gbps | 400 Gbps | 1,600 Gbps (EFA) |
| Local NVMe Storage | 7.6 TB | 7.6 TB | 15.2 TB |
| Inference vs. G6e | Baseline | ~1x | Up to 2.3x |
With a colossal 768 GB of aggregate GPU memory on a single G7e instance, models that once necessitated complex multi-node configurations on older instances can now be deployed with remarkable simplicity. This significantly reduces inter-node latency and operational overhead. Coupled with support for FP4 precision via fifth-generation Tensor Cores and NVIDIA GPUDirect RDMA over EFAv4, G7e instances are unequivocally designed for demanding LLM, multimodal AI, and sophisticated agentic inference workflows on AWS.
Diverse Generative AI Use Cases Thrive on G7e
The robust combination of memory density, bandwidth, and advanced networking capabilities makes G7e instances ideal for a wide spectrum of contemporary generative AI workloads. From enhancing conversational AI to powering complex physical simulations, G7e offers tangible advantages:
- Chatbots and Conversational AI: The low Time To First Token (TTFT) and high throughput of G7e instances ensure responsive and seamless interactive experiences, even when faced with heavy concurrent user loads. This is crucial for maintaining user engagement and satisfaction in real-time AI interactions.
- Agentic and Tool-Calling Workflows: For Retrieval Augmented Generation (RAG) pipelines and agentic systems, fast context injection from retrieval stores is paramount. The 4x improvement in CPU-to-GPU bandwidth within G7e instances makes them exceptionally effective for these critical operations, enabling more intelligent and dynamic AI agents.
- Text Generation, Summarization, and Long-Context Inference: With 96 GB of per-GPU memory, G7e instances adeptly handle large Key-Value (KV) caches. This allows for extended document contexts, significantly reducing the need for text truncation and facilitating richer, more nuanced reasoning over vast inputs.
- Image Generation and Vision Models: Where previous-generation instances frequently encountered out-of-memory errors with larger multimodal models, G7e's doubled memory capacity gracefully resolves these limitations, paving the way for more sophisticated and higher-resolution image and vision AI applications.
- Physical AI and Scientific Computing: Beyond traditional generative AI, G7e's Blackwell-generation compute, FP4 support, and spatial computing capabilities (including DLSS 4.0 and 4th-gen RT cores) extend its utility to digital twins, 3D simulation, and advanced physical AI model inference, opening new frontiers in scientific research and industrial applications.
Streamlined Deployment and Performance Benchmarking
Deploying generative AI models on G7e instances via Amazon SageMaker AI is designed to be straightforward. Users can access a sample notebook here that streamlines the process. Prerequisites typically include an AWS account, an IAM role for SageMaker access, and either Amazon SageMaker Studio or a SageMaker notebook instance for the development environment. Importantly, users should request an appropriate quota for ml.g7e.2xlarge or larger instances for SageMaker AI endpoint usage via the Service Quotas console.
To demonstrate the significant performance gains, AWS benchmarked Qwen3-32B (BF16) on both G6e and G7e instances. The workload involved approximately 1,000 input tokens and 560 output tokens per request, mimicking common document summarization tasks. Both configurations utilized the native vLLM container with prefix caching enabled, ensuring an apples-to-apples comparison.
The results are compelling. While the G6e baseline (ml.g6e.12xlarge with 4x L40S GPUs at $13.12/hr) showed strong per-request throughput, the G7e (ml.g7e.2xlarge with 1x RTX PRO 6000 Blackwell at $4.20/hr) tells a dramatically different cost story. At production concurrency (C=32), G7e achieved an astonishing $0.79 per million output tokens. This represents a 2.6x cost reduction compared to G6e’s $2.06, driven by G7e’s lower hourly rate and its ability to maintain consistent throughput under load, proving that high performance doesn't have to come at a premium cost.
The Future of Cost-Efficient Generative AI Inference
The introduction of G7e instances on Amazon SageMaker AI is more than just an incremental upgrade; it's a strategic move by AWS to democratize access to high-performance generative AI. By combining the raw power of NVIDIA RTX PRO 6000 Blackwell GPUs with the scalability and management capabilities of SageMaker, AWS is empowering organizations of all sizes to deploy larger, more complex AI models with unprecedented efficiency and cost-effectiveness. This development ensures that the advancements in generative AI can be translated into practical, production-ready applications across a vast array of industries, solidifying SageMaker AI's position as a leading platform for AI innovation.
Original source
https://aws.amazon.com/blogs/machine-learning/accelerate-generative-ai-inference-on-amazon-sagemaker-ai-with-g7e-instances/Frequently Asked Questions
What are G7e instances and how do they benefit generative AI inference?
Which NVIDIA GPU powers the new G7e instances, and what are its key features?
How do G7e instances compare to previous generations (G6e, G5) in terms of performance and memory?
What types of generative AI workloads are best suited for deployment on G7e instances?
What is the cost efficiency of G7e instances compared to G6e for generative AI inference?
What are the memory capacities for deploying LLMs on single and multi-GPU G7e instances?
What are the prerequisites for deploying solutions using G7e instances on Amazon SageMaker AI?
Stay Updated
Get the latest AI news delivered to your inbox.
