Generative AI Inference: Pinabilis sa SageMaker gamit ang G7e Instances

G7e Instances: Isang Bagong Panahon para sa AI Inference sa SageMaker

Ang tanawin ng generative AI ay nagbabago sa isang hindi pa nakikitang bilis, na nagtutulak ng patuloy na pangangailangan para sa mas malakas, flexible, at cost-effective na imprastraktura. Ngayon, masayang ibinabalata ng Code Velocity ang isang mahalagang pagsulong mula sa AWS: ang pangkalahatang pagkakaroon ng G7e instances sa Amazon SageMaker AI. Pinapagana ng NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, ang mga bagong instance na ito ay nakatakdang muling tukuyin ang mga benchmark para sa generative AI inference, na nag-aalok sa mga developer at negosyo ng walang kapantay na performance at kapasidad ng memorya.

Ang Amazon SageMaker AI ay isang ganap na pinamamahalaang serbisyo na nagbibigay sa mga developer at data scientist ng mga tool upang bumuo, magsanay, at mag-deploy ng machine learning models sa sukat. Ang pagpapakilala ng G7e instances ay nagmamarka ng isang mahalagang sandali para sa generative AI workloads sa platform na ito. Ginagamit ng mga instance na ito ang mga makabagong NVIDIA RTX PRO 6000 Blackwell GPU, bawat isa ay may kahanga-hangang 96 GB ng GDDR7 memorya. Ang malaking pagtaas ng memorya na ito ay nagbibigay-daan para sa pag-deploy ng mas malalaking foundation models (FMs) nang direkta sa SageMaker AI, na tumutugon sa isang kritikal na pangangailangan para sa advanced na AI applications.

Ang mga organisasyon ay maaari na ngayong mag-deploy ng mga modelo tulad ng GPT-OSS-120B, Nemotron-3-Super-120B-A12B (NVFP4 variant), at Qwen3.5-35B-A3B na may kapansin-pansing kahusayan. Ang G7e.2xlarge instance, na nagtatampok ng iisang GPU, ay maaaring mag-host ng mga modelong may 35B parameter, habang ang G7e.48xlarge, na may walong GPU, ay maaaring umabot sa mga modelong may 300B parameter. Ang flexibility na ito ay nagsasalin sa mga kapaki-pakinabang na benepisyo: pinababang kumplikadong operasyon, mas mababang latency, at malaking pagtitipid sa gastos para sa mga inference workload.

Pag-unawa sa Generational Performance Leap ng G7e

Ang mga G7e instance ay kumakatawan sa isang napakalaking paglukso mula sa mga nauna sa kanila, ang G6e at G5, na naghahatid ng hanggang 2.3 beses na mas mabilis na inference performance kumpara sa G6e. Binibigyang-diin ng mga teknikal na detalye ang generational na pagsulong na ito. Ang bawat G7e GPU ay nagbibigay ng kahanga-hangang 1,597 GB/s bandwidth, epektibong dinodoble ang per-GPU memory ng G6e at pinapakadoble ang sa G5. Bukod pa rito, ang mga kakayahan sa networking ay lubos na pinahusay, na umaabot sa 1,600 Gbps gamit ang EFA sa pinakamalaking sukat ng G7e. Ang 4x na pagtaas na ito kumpara sa G6e at 16x kumpara sa G5 ay nagbubukas ng potensyal para sa low-latency multi-node inference at fine-tuning scenarios na dati ay itinuturing na hindi praktikal.

Narito ang isang paghahambing na nagpapakita ng pag-unlad sa mga henerasyon sa 8-GPU tier:

Detalye	G5 (g5.48xlarge)	G6e (g6e.48xlarge)	G7e (g7e.48xlarge)
GPU	8x NVIDIA A10G	8x NVIDIA L40S	8x NVIDIA RTX PRO 6000 Blackwell
Memorya ng GPU bawat GPU	24 GB GDDR6	48 GB GDDR6	96 GB GDDR7
Kabuuang Memorya ng GPU	192 GB	384 GB	768 GB
Bandwidth ng Memorya ng GPU	600 GB/s per GPU	864 GB/s per GPU	1,597 GB/s per GPU
vCPUs	192	192	192
Memorya ng System	768 GiB	1,536 GiB	2,048 GiB
Bandwidth ng Network	100 Gbps	400 Gbps	1,600 Gbps (EFA)
Lokal na NVMe Storage	7.6 TB	7.6 TB	15.2 TB
Inference kumpara sa G6e	Baseline	~1x	Hanggang 2.3x

Sa napakalaking 768 GB ng pinagsama-samang memorya ng GPU sa isang G7e instance, ang mga modelo na dating nangangailangan ng kumplikadong multi-node configuration sa mga mas lumang instance ay maaari na ngayong i-deploy nang may kahanga-hangang pagiging simple. Binabawasan nito nang malaki ang inter-node latency at operational overhead. Kasama ang suporta para sa FP4 precision sa pamamagitan ng ikalimang henerasyon ng Tensor Cores at NVIDIA GPUDirect RDMA sa ibabaw ng EFAv4, ang mga G7e instance ay tiyak na idinisenyo para sa mahihirap na LLM, multimodal AI, at sopistikadong agentic inference workflows sa AWS.

Ang Diverse Generative AI Use Cases ay Lumalago sa G7e

Ang matatag na kombinasyon ng memory density, bandwidth, at advanced na kakayahan sa networking ay ginagawang ideal ang mga G7e instance para sa malawak na hanay ng kontemporaryong generative AI workloads. Mula sa pagpapahusay ng conversational AI hanggang sa pagpapagana ng kumplikadong physical simulations, nag-aalok ang G7e ng malinaw na mga benepisyo:

Mga Chatbot at Conversational AI: Ang mababang Time To First Token (TTFT) at mataas na throughput ng mga G7e instance ay nagsisiguro ng tumutugon at tuluy-tuloy na interactive na karanasan, kahit na nahaharap sa mabibigat na sabay-sabay na user loads. Ito ay mahalaga para mapanatili ang pakikipag-ugnayan at kasiyahan ng user sa real-time na AI interactions.
Mga Agentic at Tool-Calling Workflows: Para sa Retrieval Augmented Generation (RAG) pipelines at agentic systems, ang mabilis na context injection mula sa retrieval stores ay pinakamahalaga. Ang 4x na pagpapabuti sa CPU-to-GPU bandwidth sa loob ng mga G7e instance ay ginagawa silang lubhang epektibo para sa mga kritikal na operasyong ito, na nagbibigay-daan sa mas matalino at dynamic na AI agents.
Text Generation, Summarization, at Long-Context Inference: Sa 96 GB ng per-GPU memorya, mahusay na hinahawakan ng mga G7e instance ang malalaking Key-Value (KV) caches. Nagbibigay-daan ito para sa pinahabang konteksto ng dokumento, na lubos na binabawasan ang pangangailangan para sa pagputol ng teksto at nagpapadali sa mas mayaman, mas nuanced na pagdadahilan sa malalawak na input.
Image Generation at Vision Models: Kung saan ang mga nakaraang henerasyon ng instances ay madalas na nakakaranas ng mga out-of-memory error sa mas malalaking multimodal models, mahusay na nilulutas ng dinobleng memory capacity ng G7e ang mga limitasyong ito, na nagbibigay-daan para sa mas sopistikado at mas mataas na resolution ng image at vision AI applications.
Physical AI at Scientific Computing: Higit pa sa tradisyonal na generative AI, ang Blackwell-generation compute, FP4 support, at spatial computing capabilities ng G7e (kabilang ang DLSS 4.0 at 4th-gen RT cores) ay nagpapalawak ng utility nito sa digital twins, 3D simulation, at advanced na physical AI model inference, na nagbubukas ng mga bagong hangganan sa scientific research at industrial applications.

Streamlined na Deployment at Performance Benchmarking

Ang pag-deploy ng generative AI models sa G7e instances sa pamamagitan ng Amazon SageMaker AI ay idinisenyo upang maging simple. Maaaring i-access ng mga user ang isang sample na notebook dito na nagpapadali sa proseso. Karaniwang kasama sa mga prerequisites ang isang AWS account, isang IAM role para sa SageMaker access, at alinman sa Amazon SageMaker Studio o isang SageMaker notebook instance para sa development environment. Mahalaga, ang mga user ay dapat humiling ng angkop na quota para sa ml.g7e.2xlarge o mas malalaking instance para sa paggamit ng SageMaker AI endpoint sa pamamagitan ng AWS Service Quotas console.

Upang ipakita ang makabuluhang pagtaas ng performance, nag-benchmark ang AWS ng Qwen3-32B (BF16) sa parehong G6e at G7e instances. Ang workload ay kinabibilangan ng humigit-kumulang 1,000 input tokens at 560 output tokens bawat request, na gumagaya sa karaniwang mga gawain sa pagbubuod ng dokumento. Parehong configuration ang gumamit ng native na vLLM container na may prefix caching na naka-enable, na tinitiyak ang isang apples-to-apples na paghahambing.

Ang mga resulta ay nakakumbinsi. Bagama't ang G6e baseline (ml.g6e.12xlarge na may 4x L40S GPU sa $13.12/hr) ay nagpakita ng malakas na per-request throughput, ang G7e (ml.g7e.2xlarge na may 1x RTX PRO 6000 Blackwell sa $4.20/hr) ay nagpapakita ng ibang-ibang istorya sa gastos. Sa production concurrency (C=32), nakamit ng G7e ang isang kahanga-hangang $0.79 bawat milyong output tokens. Ito ay kumakatawan sa isang 2.6x na pagbawas ng gastos kumpara sa $2.06 ng G6e, na hinihimok ng mas mababang hourly rate ng G7e at ang kakayahan nitong mapanatili ang pare-parehong throughput sa ilalim ng load, na nagpapatunay na ang mataas na performance ay hindi kailangang magkaroon ng mataas na presyo.

Ang Kinabukasan ng Cost-Efficient Generative AI Inference

Ang pagpapakilala ng G7e instances sa Amazon SageMaker AI ay higit pa sa isang incremental na upgrade; ito ay isang estratehikong hakbang ng AWS upang demokratisahin ang access sa high-performance generative AI. Sa pamamagitan ng pagsasama-sama ng purong kapangyarihan ng NVIDIA RTX PRO 6000 Blackwell GPUs sa scalability at management capabilities ng SageMaker, binibigyan ng kapangyarihan ng AWS ang mga organisasyon ng lahat ng laki na mag-deploy ng mas malaki, mas kumplikadong AI models na may walang kapantay na kahusayan at cost-effectiveness. Tinitiyak ng pag-unlad na ito na ang mga pagsulong sa generative AI ay maaaring maisalin sa praktikal, production-ready na applications sa malawak na hanay ng mga industriya, na pinatitibay ang posisyon ng SageMaker AI bilang isang nangungunang platform para sa AI innovation.

Orihinal na pinagmulan

https://aws.amazon.com/blogs/machine-learning/accelerate-generative-ai-inference-on-amazon-sagemaker-ai-with-g7e-instances/

Mga Karaniwang Tanong

What are G7e instances and how do they benefit generative AI inference?

G7e instances are the latest generation of GPU-accelerated computing instances available on Amazon SageMaker AI, specifically designed to accelerate generative AI inference workloads. They are powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, offering significant advancements in memory capacity, bandwidth, and overall inference performance. For generative AI, G7e instances mean faster Time To First Token (TTFT), higher throughput, and the ability to host much larger foundation models (FMs) within a single instance, or even on a single GPU. This translates into more responsive AI applications, reduced operational complexity, and substantial cost savings for deploying and running large language models (LLMs), multimodal AI, and agentic workflows. Their enhanced capabilities make them ideal for interactive applications requiring high-performance, cost-effective inference.

Which NVIDIA GPU powers the new G7e instances, and what are its key features?

The new G7e instances on Amazon SageMaker AI are powered by the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. Each of these cutting-edge GPUs provides an impressive 96 GB of GDDR7 memory, which is double the memory capacity per GPU compared to the previous G6e instances. Key features also include 1,597 GB/s of GPU memory bandwidth per GPU, support for FP4 precision through fifth-generation Tensor Cores, and NVIDIA GPUDirect RDMA over EFAv4. These features collectively contribute to the G7e instances' superior inference performance, memory density, and low-latency networking, making them exceptionally capable for demanding generative AI tasks.

How do G7e instances compare to previous generations (G6e, G5) in terms of performance and memory?

G7e instances demonstrate a significant generational leap over G6e and G5. They deliver up to 2.3x inference performance compared to G6e instances. In terms of memory, each G7e GPU offers 96 GB of GDDR7 memory, effectively doubling the per-GPU memory of G6e and quadrupling that of G5. A top-tier G7e.48xlarge instance provides an aggregate of 768 GB total GPU memory. Furthermore, networking bandwidth scales up to 1,600 Gbps with EFA on the largest G7e size, a 4x jump over G6e and 16x over G5. This vast improvement in memory, bandwidth, and networking allows G7e instances to host models that previously required multi-node setups on older instances, simplifying deployment and reducing latency.

What types of generative AI workloads are best suited for deployment on G7e instances?

G7e instances are exceptionally well-suited for a broad range of modern generative AI workloads due to their high memory density, bandwidth, and advanced networking. These include: Chatbots and Conversational AI, ensuring low Time To First Token (TTFT) and high throughput for responsive interactive experiences; Agentic and Tool-Calling Workflows, benefiting from 4x improved CPU-to-GPU bandwidth for fast context injection in RAG pipelines; Text Generation, Summarization, and Long-Context Inference, accommodating large KV caches for extended document contexts with 96 GB per-GPU memory; Image Generation and Vision Models, overcoming out-of-memory errors for larger multimodal models that struggled on previous instances; and Physical AI and Scientific Computing, leveraging Blackwell-generation compute, FP4 support, and spatial computing capabilities for digital twins and 3D simulation.

What is the cost efficiency of G7e instances compared to G6e for generative AI inference?

G7e instances offer significantly improved cost efficiency for generative AI inference compared to G6e instances. Benchmarks deploying Qwen3-32B showed that G7e achieved $0.79 per million output tokens at production concurrency (C=32). This represents a remarkable 2.6x cost reduction compared to G6e’s $2.06 per million output tokens for a similar workload. This cost saving is primarily driven by G7e’s substantially lower hourly rate (e.g., $4.20/hr for ml.g7e.2xlarge vs. $13.12/hr for ml.g6e.12xlarge) combined with its ability to maintain consistent and high throughput under load, making it a more economical choice for large-scale deployments.

What are the memory capacities for deploying LLMs on single and multi-GPU G7e instances?

G7e instances offer substantial memory capacities for deploying large language models (LLMs). A single-node GPU, specifically a G7e.2xlarge instance, can effectively host foundation models with up to 35 billion parameters in FP16 precision. For larger models, scaling across multiple GPUs within a single instance dramatically increases capacity: a 4-GPU node (G7e.24xlarge) can deploy models up to 150 billion parameters, while an 8-GPU node (G7e.48xlarge) can handle models as large as 300 billion parameters. This impressive scalability provides organizations with the flexibility to deploy a wide range of LLMs without the complexities of multi-instance distributed setups.

What are the prerequisites for deploying solutions using G7e instances on Amazon SageMaker AI?

To deploy generative AI solutions using G7e instances on Amazon SageMaker AI, several prerequisites must be met. You need an active AWS account to host your resources and an AWS Identity and Access Management (IAM) role configured with appropriate permissions to access Amazon SageMaker AI services. For development and deployment, access to Amazon SageMaker Studio or a SageMaker notebook instance is recommended, though other interactive development environments like PyCharm or Visual Studio Code are also viable. Crucially, you must request a quota for at least one `ml.g7e.2xlarge` instance (or a larger G7e instance type) for Amazon SageMaker AI endpoint usage through the AWS Service Quotas console, as these are new and specialized instance types.

Manatiling Updated

Kunin ang pinakabagong AI news sa iyong inbox.

I-share