생성형 AI 추론: G7e 인스턴스로 SageMaker에서 가속화

G7e 인스턴스: SageMaker AI 추론의 새로운 시대

생성형 AI의 지형은 전례 없는 속도로 발전하고 있으며, 이는 더욱 강력하고 유연하며 비용 효율적인 인프라에 대한 지속적인 수요를 불러일으키고 있습니다. 오늘 Code Velocity는 AWS의 중요한 발전인 Amazon SageMaker AI에서 G7e 인스턴스의 정식 출시를 보고하게 되어 기쁩니다. NVIDIA RTX PRO 6000 Blackwell 서버 에디션 GPU로 구동되는 이 새로운 인스턴스는 생성형 AI 추론의 벤치마크를 재정의하고 개발자와 기업에 비할 데 없는 성능과 메모리 용량을 제공할 것입니다.

Amazon SageMaker AI는 개발자와 데이터 과학자에게 머신러닝 모델을 대규모로 구축, 훈련 및 배포할 수 있는 도구를 제공하는 완전 관리형 서비스입니다. G7e 인스턴스의 도입은 이 플랫폼에서 생성형 AI 워크로드에 있어 중대한 전환점을 의미합니다. 이 인스턴스들은 최첨단 NVIDIA RTX PRO 6000 Blackwell GPU를 활용하며, 각 GPU는 인상적인 96GB의 GDDR7 메모리를 자랑합니다. 이러한 상당한 메모리 증가는 훨씬 더 큰 파운데이션 모델(FM)을 SageMaker AI에 직접 배포할 수 있게 하여 고급 AI 애플리케이션의 중요한 요구 사항을 충족합니다.

이제 조직은 GPT-OSS-120B, Nemotron-3-Super-120B-A12B (NVFP4 변형), Qwen3.5-35B-A3B와 같은 모델을 놀라운 효율성으로 배포할 수 있습니다. 단일 GPU를 특징으로 하는 G7e.2xlarge 인스턴스는 350억 매개변수 모델을 호스팅할 수 있으며, 8개의 GPU를 가진 G7e.48xlarge는 최대 3000억 매개변수 모델까지 확장됩니다. 이러한 유연성은 운영 복잡성 감소, 지연 시간 단축, 추론 워크로드에 대한 상당한 비용 절감이라는 실질적인 이점으로 이어집니다.

G7e의 세대별 성능 도약 살펴보기

G7e 인스턴스는 이전 모델인 G6e 및 G5에 비해 엄청난 도약을 나타내며, G6e에 비해 최대 2.3배 빠른 추론 성능을 제공합니다. 기술 사양은 이러한 세대적 발전을 강조합니다. 각 G7e GPU는 놀라운 1,597GB/s 대역폭을 제공하며, G6e의 GPU당 메모리를 효과적으로 두 배로 늘리고 G5의 네 배로 늘립니다. 또한, 네트워킹 기능이 극적으로 향상되어 가장 큰 G7e 크기에서 EFA를 통해 최대 1,600Gbps까지 확장됩니다. G6e에 비해 4배, G5에 비해 16배 증가한 이러한 성능은 이전에 비현실적이라고 여겨졌던 낮은 지연 시간의 다중 노드 추론 및 미세 조정 시나리오의 잠재력을 열어줍니다.

8-GPU 티어에서 세대별 발전을 강조하는 비교 자료는 다음과 같습니다.

사양	G5 (g5.48xlarge)	G6e (g6e.48xlarge)	G7e (g7e.48xlarge)
GPU	8x NVIDIA A10G	8x NVIDIA L40S	8x NVIDIA RTX PRO 6000 Blackwell
GPU당 GPU 메모리	24 GB GDDR6	48 GB GDDR6	96 GB GDDR7
총 GPU 메모리	192 GB	384 GB	768 GB
GPU 메모리 대역폭	GPU당 600 GB/s	GPU당 864 GB/s	GPU당 1,597 GB/s
vCPU	192	192	192
시스템 메모리	768 GiB	1,536 GiB	2,048 GiB
네트워크 대역폭	100 Gbps	400 Gbps	1,600 Gbps (EFA)
로컬 NVMe 스토리지	7.6 TB	7.6 TB	15.2 TB
G6e 대비 추론 성능	기준선	~1x	최대 2.3배

단일 G7e 인스턴스에 총 768GB의 GPU 메모리를 통해, 이전 인스턴스에서 복잡한 다중 노드 구성이 필요했던 모델들을 이제 놀라울 정도로 간단하게 배포할 수 있습니다. 이는 노드 간 지연 시간과 운영 오버헤드를 크게 줄입니다. 5세대 Tensor 코어를 통한 FP4 정밀도 지원과 EFAv4를 통한 NVIDIA GPUDirect RDMA와 결합하여, G7e 인스턴스는 AWS에서 까다로운 LLM, 멀티모달 AI 및 정교한 에이전트 추론 워크플로우를 위해 명백히 설계되었습니다.

G7e에서 번성하는 다양한 생성형 AI 사용 사례

메모리 밀도, 대역폭, 고급 네트워킹 기능의 강력한 조합 덕분에 G7e 인스턴스는 광범위한 현대 생성형 AI 워크로드에 이상적입니다. 대화형 AI 향상부터 복잡한 물리 시뮬레이션 지원에 이르기까지 G7e는 다음과 같은 실질적인 이점을 제공합니다.

챗봇 및 대화형 AI: G7e 인스턴스의 낮은 TTFT(Time To First Token)와 높은 처리량은 동시 사용자 부하가 높을 때에도 반응성이 좋고 끊김 없는 대화형 경험을 보장합니다. 이는 실시간 AI 상호작용에서 사용자 참여와 만족도를 유지하는 데 중요합니다.
에이전트 및 도구 호출 워크플로우: RAG(Retrieval Augmented Generation) 파이프라인 및 에이전트 시스템의 경우, 검색 저장소로부터 빠른 컨텍스트 주입이 가장 중요합니다. G7e 인스턴스 내에서 CPU-GPU 대역폭이 4배 향상되어 이러한 중요한 작업에 매우 효과적이며, 더 지능적이고 동적인 AI 에이전트를 가능하게 합니다.
텍스트 생성, 요약 및 긴 컨텍스트 추론: GPU당 96GB의 메모리를 통해 G7e 인스턴스는 대규모 키-값(KV) 캐시를 능숙하게 처리합니다. 이를 통해 확장된 문서 컨텍스트를 사용할 수 있어 텍스트 잘림 현상을 크게 줄이고 방대한 입력에 대한 더욱 풍부하고 미묘한 추론을 용이하게 합니다.
이미지 생성 및 비전 모델: 이전 세대 인스턴스에서 대규모 멀티모달 모델에서 자주 발생했던 메모리 부족 오류를 G7e의 두 배 메모리 용량이 원활하게 해결하여, 더욱 정교하고 고해상도의 이미지 및 비전 AI 애플리케이션을 위한 길을 열어줍니다.
물리 AI 및 과학 컴퓨팅: 전통적인 생성형 AI를 넘어, G7e의 Blackwell 세대 컴퓨팅, FP4 지원 및 공간 컴퓨팅 기능(DLSS 4.0 및 4세대 RT 코어 포함)은 디지털 트윈, 3D 시뮬레이션 및 고급 물리 AI 모델 추론으로 그 활용도를 확장하여 과학 연구 및 산업 애플리케이션의 새로운 지평을 엽니다.

간소화된 배포 및 성능 벤치마킹

Amazon SageMaker AI를 통해 G7e 인스턴스에 생성형 AI 모델을 배포하는 것은 간단하게 설계되었습니다. 사용자는 프로세스를 간소화하는 샘플 노트북을 여기에서 액세스할 수 있습니다. 전제 조건으로는 일반적으로 AWS 계정, SageMaker 액세스를 위한 IAM 역할, 그리고 개발 환경을 위한 Amazon SageMaker Studio 또는 SageMaker 노트북 인스턴스가 포함됩니다. 중요하게도, 사용자는 Service Quotas 콘솔을 통해 SageMaker AI 엔드포인트 사용을 위한 ml.g7e.2xlarge 또는 더 큰 인스턴스에 대한 적절한 할당량을 요청해야 합니다.

상당한 성능 향상을 입증하기 위해 AWS는 G6e 및 G7e 인스턴스 모두에서 Qwen3-32B (BF16)를 벤치마크했습니다. 워크로드에는 요청당 약 1,000개의 입력 토큰과 560개의 출력 토큰이 포함되었으며, 이는 일반적인 문서 요약 작업을 모방한 것입니다. 두 구성 모두 접두사 캐싱이 활성화된 기본 vLLM 컨테이너를 활용하여 공정한 비교를 보장했습니다.

결과는 설득력이 있습니다. G6e 기준(시간당 13.12달러에 4개의 L40S GPU를 장착한 ml.g6e.12xlarge)이 강력한 요청당 처리량을 보인 반면, G7e(시간당 4.20달러에 1개의 RTX PRO 6000 Blackwell을 장착한 ml.g7e.2xlarge)는 극적으로 다른 비용 스토리를 보여줍니다. 프로덕션 동시성(C=32)에서 G7e는 백만 출력 토큰당 놀라운 0.79달러를 달성했습니다. 이는 G6e의 2.06달러에 비해 2.6배의 비용 절감을 나타내며, G7e의 낮은 시간당 요금과 부하 상태에서 일관된 처리량을 유지하는 능력에 의해 주도되어 고성능이 높은 비용을 수반할 필요가 없음을 입증합니다.

비용 효율적인 생성형 AI 추론의 미래

Amazon SageMaker AI에 G7e 인스턴스가 도입된 것은 단순한 점진적 업그레이드 이상입니다. 이는 고성능 생성형 AI에 대한 접근을 민주화하려는 AWS의 전략적 움직임입니다. NVIDIA RTX PRO 6000 Blackwell GPU의 강력한 성능과 SageMaker의 확장성 및 관리 기능을 결합함으로써, AWS는 모든 규모의 조직이 전례 없는 효율성과 비용 효율성으로 더 크고 복잡한 AI 모델을 배포할 수 있도록 지원하고 있습니다. 이러한 발전은 생성형 AI의 발전이 광범위한 산업 전반에 걸쳐 실용적이고 생산 준비가 된 애플리케이션으로 전환될 수 있도록 보장하며, AI 혁신을 위한 선도적인 플랫폼으로서 SageMaker AI의 위치를 확고히 합니다.

원본 출처

https://aws.amazon.com/blogs/machine-learning/accelerate-generative-ai-inference-on-amazon-sagemaker-ai-with-g7e-instances/

자주 묻는 질문

What are G7e instances and how do they benefit generative AI inference?

G7e instances are the latest generation of GPU-accelerated computing instances available on Amazon SageMaker AI, specifically designed to accelerate generative AI inference workloads. They are powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, offering significant advancements in memory capacity, bandwidth, and overall inference performance. For generative AI, G7e instances mean faster Time To First Token (TTFT), higher throughput, and the ability to host much larger foundation models (FMs) within a single instance, or even on a single GPU. This translates into more responsive AI applications, reduced operational complexity, and substantial cost savings for deploying and running large language models (LLMs), multimodal AI, and agentic workflows. Their enhanced capabilities make them ideal for interactive applications requiring high-performance, cost-effective inference.

Which NVIDIA GPU powers the new G7e instances, and what are its key features?

The new G7e instances on Amazon SageMaker AI are powered by the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. Each of these cutting-edge GPUs provides an impressive 96 GB of GDDR7 memory, which is double the memory capacity per GPU compared to the previous G6e instances. Key features also include 1,597 GB/s of GPU memory bandwidth per GPU, support for FP4 precision through fifth-generation Tensor Cores, and NVIDIA GPUDirect RDMA over EFAv4. These features collectively contribute to the G7e instances' superior inference performance, memory density, and low-latency networking, making them exceptionally capable for demanding generative AI tasks.

How do G7e instances compare to previous generations (G6e, G5) in terms of performance and memory?

G7e instances demonstrate a significant generational leap over G6e and G5. They deliver up to 2.3x inference performance compared to G6e instances. In terms of memory, each G7e GPU offers 96 GB of GDDR7 memory, effectively doubling the per-GPU memory of G6e and quadrupling that of G5. A top-tier G7e.48xlarge instance provides an aggregate of 768 GB total GPU memory. Furthermore, networking bandwidth scales up to 1,600 Gbps with EFA on the largest G7e size, a 4x jump over G6e and 16x over G5. This vast improvement in memory, bandwidth, and networking allows G7e instances to host models that previously required multi-node setups on older instances, simplifying deployment and reducing latency.

What types of generative AI workloads are best suited for deployment on G7e instances?

G7e instances are exceptionally well-suited for a broad range of modern generative AI workloads due to their high memory density, bandwidth, and advanced networking. These include: Chatbots and Conversational AI, ensuring low Time To First Token (TTFT) and high throughput for responsive interactive experiences; Agentic and Tool-Calling Workflows, benefiting from 4x improved CPU-to-GPU bandwidth for fast context injection in RAG pipelines; Text Generation, Summarization, and Long-Context Inference, accommodating large KV caches for extended document contexts with 96 GB per-GPU memory; Image Generation and Vision Models, overcoming out-of-memory errors for larger multimodal models that struggled on previous instances; and Physical AI and Scientific Computing, leveraging Blackwell-generation compute, FP4 support, and spatial computing capabilities for digital twins and 3D simulation.

What is the cost efficiency of G7e instances compared to G6e for generative AI inference?

G7e instances offer significantly improved cost efficiency for generative AI inference compared to G6e instances. Benchmarks deploying Qwen3-32B showed that G7e achieved $0.79 per million output tokens at production concurrency (C=32). This represents a remarkable 2.6x cost reduction compared to G6e’s $2.06 per million output tokens for a similar workload. This cost saving is primarily driven by G7e’s substantially lower hourly rate (e.g., $4.20/hr for ml.g7e.2xlarge vs. $13.12/hr for ml.g6e.12xlarge) combined with its ability to maintain consistent and high throughput under load, making it a more economical choice for large-scale deployments.

What are the memory capacities for deploying LLMs on single and multi-GPU G7e instances?

G7e instances offer substantial memory capacities for deploying large language models (LLMs). A single-node GPU, specifically a G7e.2xlarge instance, can effectively host foundation models with up to 35 billion parameters in FP16 precision. For larger models, scaling across multiple GPUs within a single instance dramatically increases capacity: a 4-GPU node (G7e.24xlarge) can deploy models up to 150 billion parameters, while an 8-GPU node (G7e.48xlarge) can handle models as large as 300 billion parameters. This impressive scalability provides organizations with the flexibility to deploy a wide range of LLMs without the complexities of multi-instance distributed setups.

What are the prerequisites for deploying solutions using G7e instances on Amazon SageMaker AI?

To deploy generative AI solutions using G7e instances on Amazon SageMaker AI, several prerequisites must be met. You need an active AWS account to host your resources and an AWS Identity and Access Management (IAM) role configured with appropriate permissions to access Amazon SageMaker AI services. For development and deployment, access to Amazon SageMaker Studio or a SageMaker notebook instance is recommended, though other interactive development environments like PyCharm or Visual Studio Code are also viable. Crucially, you must request a quota for at least one `ml.g7e.2xlarge` instance (or a larger G7e instance type) for Amazon SageMaker AI endpoint usage through the AWS Service Quotas console, as these are new and specialized instance types.