生成式AI推理：使用G7e实例加速SageMaker上的性能

G7e实例：SageMaker上AI推理的新时代

生成式AI的格局正以空前的速度演变，持续推动对更强大、灵活且经济高效基础设施的需求。今天，Code Velocity 很高兴地报告AWS的一项重大进展：Amazon SageMaker AI上的G7e实例现已全面可用。这些新实例由NVIDIA RTX PRO 6000 Blackwell Server Edition GPU提供支持，将重新定义生成式AI推理的基准，为开发者和企业提供无与伦比的性能和内存容量。

Amazon SageMaker AI是一项完全托管的服务，为开发者和数据科学家提供构建、训练和大规模部署机器学习模型的工具。G7e实例的引入标志着该平台上生成式AI工作负载的一个关键时刻。这些实例利用了尖端的NVIDIA RTX PRO 6000 Blackwell GPU，每个都拥有令人印象深刻的96 GB GDDR7内存。这种显著的内存增加使得可以直接在SageMaker AI上部署更大的基础模型（FM），满足了高级AI应用程序的关键需求。

组织现在可以以卓越的效率部署GPT-OSS-120B、Nemotron-3-Super-120B-A12B（NVFP4变体）和Qwen3.5-35B-A3B等模型。G7e.2xlarge实例（配备单个GPU）可以托管350亿参数的模型，而G7e.48xlarge实例（配备八个GPU）则可扩展至3000亿参数的模型。这种灵活性带来了实实在在的好处：降低了操作复杂性、减少了延迟，并为推理工作负载带来了可观的成本节约。

剖析G7e的代际性能飞跃

G7e实例相对于其前身G6e和G5实现了巨大的飞跃，推理性能比G6e快2.3倍。技术规格凸显了这一代际进步。每个G7e GPU提供惊人的1,597 GB/s带宽，有效将G6e的每GPU内存翻倍，并将G5的每GPU内存翻两番。此外，网络功能得到显著增强，在最大的G7e实例上，EFA可将网络带宽扩展至1,600 Gbps。这比G6e提高了4倍，比G5提高了16倍，释放了以前被认为不切实际的低延迟多节点推理和微调场景的潜力。

以下是8-GPU层级不同代际的进展对比：

Spec	G5 (g5.48xlarge)	G6e (g6e.48xlarge)	G7e (g7e.48xlarge)
GPU	8x NVIDIA A10G	8x NVIDIA L40S	8x NVIDIA RTX PRO 6000 Blackwell
每个GPU的内存	24 GB GDDR6	48 GB GDDR6	96 GB GDDR7
总GPU内存	192 GB	384 GB	768 GB
GPU内存带宽	每个GPU 600 GB/s	每个GPU 864 GB/s	每个GPU 1,597 GB/s
vCPU	192	192	192
系统内存	768 GiB	1,536 GiB	2,048 GiB
网络带宽	100 Gbps	400 Gbps	1,600 Gbps (EFA)
本地NVMe存储	7.6 TB	7.6 TB	15.2 TB
与G6e相比的推理性能	基线	~1倍	高达2.3倍

单个G7e实例拥有高达768 GB的总GPU内存，使得过去需要在旧实例上进行复杂多节点配置的模型，现在可以极其简单地部署。这显著降低了节点间延迟和操作开销。结合通过第五代Tensor Cores支持FP4精度以及通过EFAv4实现的NVIDIA GPUDirect RDMA，G7e实例无疑专为AWS上要求严苛的LLM、多模态AI和复杂的智能体推理工作流而设计。

多样化的生成式AI用例在G7e上蓬勃发展

内存密度、带宽和先进网络功能的强大结合，使得G7e实例成为各种当代生成式AI工作负载的理想选择。从增强对话式AI到驱动复杂的物理模拟，G7e提供了切实的优势：

聊天机器人和对话式AI：G7e实例的低首个Token生成时间（TTFT）和高吞吐量确保了响应迅速、无缝的交互体验，即使面对高并发用户负载也是如此。这对于实时AI互动中保持用户参与度和满意度至关重要。
智能体和工具调用工作流：对于检索增强生成（RAG）管道和智能体系统，从检索存储中快速注入上下文至关重要。G7e实例内CPU到GPU带宽4倍的改进，使其在这些关键操作中异常高效，从而实现更智能和动态的AI智能体。
文本生成、摘要和长上下文推理：凭借96 GB的每GPU内存，G7e实例能够轻松处理大型键值（KV）缓存。这允许扩展的文档上下文，显著减少文本截断的需要，并有助于对大量输入进行更丰富、更细致的推理。
图像生成和视觉模型：当上一代实例在处理大型多模态模型时经常遇到内存不足错误时，G7e翻倍的内存容量优雅地解决了这些限制，为更复杂、更高分辨率的图像和视觉AI应用铺平了道路。
物理AI和科学计算：除了传统的生成式AI，G7e的Blackwell一代计算、FP4支持和空间计算能力（包括DLSS 4.0和第四代RT核心）将其应用扩展到数字孪生、3D模拟和高级物理AI模型推理，为科学研究和工业应用开辟了新领域。

简化部署和性能基准测试

通过Amazon SageMaker AI在G7e实例上部署生成式AI模型旨在简单直接。用户可以访问此处的示例笔记本，该笔记本可简化此过程。先决条件通常包括一个AWS账户、一个用于SageMaker访问的IAM角色，以及一个用于开发环境的Amazon SageMaker Studio或SageMaker笔记本实例。重要的是，用户应通过Service Quotas控制台为SageMaker AI端点使用请求ml.g7e.2xlarge或更大实例的适当配额。

为了展示显著的性能提升，AWS在G6e和G7e实例上对Qwen3-32B (BF16) 进行了基准测试。工作负载涉及每个请求大约1,000个输入Token和560个输出Token，模拟常见的文档摘要任务。两种配置都使用了原生vLLM容器并启用了前缀缓存，确保了公平的比较。

结果引人注目。尽管G6e基线（ml.g6e.12xlarge，配备4个L40S GPU，每小时13.12美元）显示出强大的每请求吞吐量，但G7e（ml.g7e.2xlarge，配备1个RTX PRO 6000 Blackwell，每小时4.20美元）则讲述了一个截然不同的成本故事。在生产并发度（C=32）下，G7e实现了惊人的每百万输出Token 0.79美元的成本。与G6e的2.06美元相比，这代表了2.6倍的成本降低，这得益于G7e更低的每小时费率及其在负载下保持一致吞吐量的能力，证明高性能不必以高昂成本为代价。

经济高效生成式AI推理的未来

Amazon SageMaker AI上G7e实例的推出不仅仅是增量升级；这是AWS一项战略性举措，旨在普及高性能生成式AI的访问。通过将NVIDIA RTX PRO 6000 Blackwell GPU的原始强大性能与SageMaker的可扩展性和管理能力相结合，AWS正在赋能各种规模的组织以前所未有的效率和成本效益部署更大、更复杂的AI模型。这一发展确保了生成式AI的进步能够转化为广泛行业中实用、可投入生产的应用，巩固了SageMaker AI作为AI创新领先平台的地位。

原始来源

https://aws.amazon.com/blogs/machine-learning/accelerate-generative-ai-inference-on-amazon-sagemaker-ai-with-g7e-instances/

常见问题

What are G7e instances and how do they benefit generative AI inference?

G7e instances are the latest generation of GPU-accelerated computing instances available on Amazon SageMaker AI, specifically designed to accelerate generative AI inference workloads. They are powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, offering significant advancements in memory capacity, bandwidth, and overall inference performance. For generative AI, G7e instances mean faster Time To First Token (TTFT), higher throughput, and the ability to host much larger foundation models (FMs) within a single instance, or even on a single GPU. This translates into more responsive AI applications, reduced operational complexity, and substantial cost savings for deploying and running large language models (LLMs), multimodal AI, and agentic workflows. Their enhanced capabilities make them ideal for interactive applications requiring high-performance, cost-effective inference.

Which NVIDIA GPU powers the new G7e instances, and what are its key features?

The new G7e instances on Amazon SageMaker AI are powered by the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. Each of these cutting-edge GPUs provides an impressive 96 GB of GDDR7 memory, which is double the memory capacity per GPU compared to the previous G6e instances. Key features also include 1,597 GB/s of GPU memory bandwidth per GPU, support for FP4 precision through fifth-generation Tensor Cores, and NVIDIA GPUDirect RDMA over EFAv4. These features collectively contribute to the G7e instances' superior inference performance, memory density, and low-latency networking, making them exceptionally capable for demanding generative AI tasks.

How do G7e instances compare to previous generations (G6e, G5) in terms of performance and memory?

G7e instances demonstrate a significant generational leap over G6e and G5. They deliver up to 2.3x inference performance compared to G6e instances. In terms of memory, each G7e GPU offers 96 GB of GDDR7 memory, effectively doubling the per-GPU memory of G6e and quadrupling that of G5. A top-tier G7e.48xlarge instance provides an aggregate of 768 GB total GPU memory. Furthermore, networking bandwidth scales up to 1,600 Gbps with EFA on the largest G7e size, a 4x jump over G6e and 16x over G5. This vast improvement in memory, bandwidth, and networking allows G7e instances to host models that previously required multi-node setups on older instances, simplifying deployment and reducing latency.

What types of generative AI workloads are best suited for deployment on G7e instances?

G7e instances are exceptionally well-suited for a broad range of modern generative AI workloads due to their high memory density, bandwidth, and advanced networking. These include: Chatbots and Conversational AI, ensuring low Time To First Token (TTFT) and high throughput for responsive interactive experiences; Agentic and Tool-Calling Workflows, benefiting from 4x improved CPU-to-GPU bandwidth for fast context injection in RAG pipelines; Text Generation, Summarization, and Long-Context Inference, accommodating large KV caches for extended document contexts with 96 GB per-GPU memory; Image Generation and Vision Models, overcoming out-of-memory errors for larger multimodal models that struggled on previous instances; and Physical AI and Scientific Computing, leveraging Blackwell-generation compute, FP4 support, and spatial computing capabilities for digital twins and 3D simulation.

What is the cost efficiency of G7e instances compared to G6e for generative AI inference?

G7e instances offer significantly improved cost efficiency for generative AI inference compared to G6e instances. Benchmarks deploying Qwen3-32B showed that G7e achieved $0.79 per million output tokens at production concurrency (C=32). This represents a remarkable 2.6x cost reduction compared to G6e’s $2.06 per million output tokens for a similar workload. This cost saving is primarily driven by G7e’s substantially lower hourly rate (e.g., $4.20/hr for ml.g7e.2xlarge vs. $13.12/hr for ml.g6e.12xlarge) combined with its ability to maintain consistent and high throughput under load, making it a more economical choice for large-scale deployments.

What are the memory capacities for deploying LLMs on single and multi-GPU G7e instances?

G7e instances offer substantial memory capacities for deploying large language models (LLMs). A single-node GPU, specifically a G7e.2xlarge instance, can effectively host foundation models with up to 35 billion parameters in FP16 precision. For larger models, scaling across multiple GPUs within a single instance dramatically increases capacity: a 4-GPU node (G7e.24xlarge) can deploy models up to 150 billion parameters, while an 8-GPU node (G7e.48xlarge) can handle models as large as 300 billion parameters. This impressive scalability provides organizations with the flexibility to deploy a wide range of LLMs without the complexities of multi-instance distributed setups.

What are the prerequisites for deploying solutions using G7e instances on Amazon SageMaker AI?

To deploy generative AI solutions using G7e instances on Amazon SageMaker AI, several prerequisites must be met. You need an active AWS account to host your resources and an AWS Identity and Access Management (IAM) role configured with appropriate permissions to access Amazon SageMaker AI services. For development and deployment, access to Amazon SageMaker Studio or a SageMaker notebook instance is recommended, though other interactive development environments like PyCharm or Visual Studio Code are also viable. Crucially, you must request a quota for at least one `ml.g7e.2xlarge` instance (or a larger G7e instance type) for Amazon SageMaker AI endpoint usage through the AWS Service Quotas console, as these are new and specialized instance types.

保持更新

将最新AI新闻发送到您的收件箱。