멀티모달 임베딩 확장: 미디어 및 엔터테인먼트용 AI 데이터 레이크

멀티모달 임베딩으로 동영상 검색 혁신

미디어 및 엔터테인먼트 산업은 방대한 동영상 콘텐츠의 바다에 잠겨 있습니다. 아카이브 푸티지부터 매일 업로드되는 콘텐츠에 이르기까지, 엄청난 양 때문에 수동 태깅 및 키워드 기반 검색과 같은 기존의 콘텐츠 발굴 방법은 점점 더 비효율적이고 종종 부정확합니다. 이러한 기존 방식은 동영상에 내재된 풍부함과 미묘한 컨텍스트를 완전히 포착하는 데 어려움을 겪어, 콘텐츠 재사용, 더 빠른 제작, 그리고 향상된 시청자 경험을 위한 기회를 놓치게 만듭니다.

멀티모달 임베딩 시대가 도래했습니다. AWS는 이러한 한계를 뛰어넘어 방대한 동영상 데이터셋 전반에 걸쳐 의미론적 검색 기능을 가능하게 하는 솔루션을 개척하고 있습니다. Amazon Nova 모델과 Amazon OpenSearch Service의 강력한 기능을 활용하여 콘텐츠 제작자와 배포자는 피상적인 키워드를 넘어 미디어 라이브러리를 진정으로 이해하고 접근할 수 있습니다. 이 혁신적인 접근 방식은 자연어 쿼리를 통해 시각 및 청각 정보의 깊이를 탐색할 수 있게 하여 콘텐츠 발굴에 전례 없는 정밀성을 제공합니다.

인상적인 규모로 이러한 기능을 입증하며, AWS는 AWS Open Data Registry에서 792,270개의 동영상을 처리했으며, 이는 무려 8,480시간의 동영상 콘텐츠에 해당합니다. 3천 5십만 초 이상의 동영상을 처리하는 데 단 41시간이 걸린 이 야심찬 프로젝트는 이 AI 기반 접근 방식의 확장성과 효율성을 강조합니다. 일회성 수집 및 연간 OpenSearch Service를 포함한 첫 해 비용은 매우 경쟁력 있는 23,632달러(OpenSearch Service 예약 인스턴스 사용 시)에서 27,328달러(온디맨드 사용 시)로 추정되었습니다. 이러한 솔루션은 미디어 회사가 디지털 자산과 상호 작용하는 방식을 근본적으로 변화시켜 콘텐츠 수익화 및 제작 워크플로를 위한 새로운 길을 열어줍니다. 의미론적 이해로의 이러한 패러다임 전환은 미디어 분야에서 엔터프라이즈 AI의 중요한 발전입니다.

확장 가능한 멀티모달 AI 데이터 레이크 아키텍처 이해하기

이 강력한 멀티모달 동영상 검색 시스템의 핵심은 동영상 수집 및 검색이라는 두 가지 상호 연결된 워크플로를 기반으로 구축됩니다. 이러한 구성 요소는 완벽하게 통합되어 동영상 콘텐츠의 복잡한 세부 사항을 이해하고 검색 가능하게 만드는 AI 데이터 레이크를 생성합니다.

동영상 수집 파이프라인

수집 파이프라인은 병렬 처리 및 효율성을 위해 설계되었습니다. 4개의 Amazon EC2 c7i.48xlarge 인스턴스를 활용하여 최대 600개의 병렬 워커를 오케스트레이션하여 시간당 19,400개의 동영상을 처리합니다. Amazon S3에 처음 업로드된 동영상은 Amazon Nova Multimodal Embeddings 비동기 API에 의해 처리됩니다. 이 API는 동영상을 최적의 15초 청크로 지능적으로 분할합니다. 이는 중요한 장면 변경을 포착하고 생성되는 임베딩의 볼륨을 관리하는 것 사이의 균형입니다. 각 세그먼트는 결합된 오디오-시각적 기능을 나타내는 1024차원 임베딩으로 변환됩니다. 3072차원 임베딩은 더 높은 충실도를 제공하지만, 1024차원 옵션은 이 애플리케이션에서 정확도에 미치는 영향은 최소화하면서 3배의 저장 비용 절감 효과를 제공하므로, 대규모 적용을 위한 실용적인 선택입니다.

검색 기능을 더욱 향상시키기 위해 Amazon Nova Pro (또는 더 새롭고 비용 효율적인 Nova 2 Lite)를 사용하여 미리 정의된 분류 체계에서 동영상당 10-15개의 설명 태그를 생성합니다. 이 이중 접근 방식은 콘텐츠가 의미론적 유사성과 기존 키워드 일치 모두를 통해 검색될 수 있도록 보장합니다. 이러한 임베딩은 벡터 유사성 검색에 최적화된 OpenSearch k-NN 인덱스에 저장되며, 설명 태그는 별도의 텍스트 인덱스에 인덱싱됩니다. 이러한 분리는 유연하고 효율적인 쿼리를 가능하게 합니다. 파이프라인은 강력한 작업 큐와 폴링 메커니즘을 통해 Bedrock의 동시성 제한(계정당 30개의 동시 작업)을 관리하여 지속적이고 규정을 준수하는 처리를 보장합니다.

아래는 이 정교한 수집 프로세스를 시각적으로 표현한 것입니다:

그림 1: S3 동영상 저장소에서 Nova Multimodal Embeddings 및 Nova Pro를 거쳐 이중 OpenSearch 인덱스로의 흐름을 보여주는 동영상 수집 파이프라인

다양한 동영상 검색 기능 강화

검색 아키텍처는 다용도로 설계되어 다양한 콘텐츠 발굴 모드를 제공합니다:

텍스트-투-동영상 검색: 사용자는 "밤에 분주한 도시를 드론으로 촬영한 장면" 또는 "미식 요리를 준비하는 요리사의 클로즈업"과 같은 자연어 쿼리를 입력할 수 있습니다. 시스템은 이러한 쿼리를 임베딩으로 변환한 다음, OpenSearch k-NN 인덱스를 활용하여 설명과 의미론적으로 일치하는 동영상 세그먼트 또는 전체 동영상을 찾습니다. 이 기능은 정확한 단어가 메타데이터에 없더라도 작동합니다. 이는 직관적인 콘텐츠 발굴 및 스토리보드 제작에 이상적입니다.
동영상-투-동영상 검색: 사용자가 동영상 클립을 가지고 유사한 콘텐츠를 찾으려는 시나리오에 이 모드가 탁월합니다. 입력 동영상의 임베딩을 OpenSearch k-NN 인덱스에 있는 임베딩과 직접 비교하여 시스템은 시각적으로나 청각적으로 유사한 콘텐츠를 식별할 수 있습니다. 이는 B-roll 푸티지를 식별하거나, 콘텐츠 일관성을 보장하거나, 파생 작업을 발견하는 데 매우 유용합니다.
하이브리드 검색: 하이브리드 검색은 벡터 유사성과 기존 키워드 일치를 결합하여 두 가지 장점을 모두 활용합니다. 제안된 솔루션은 가중치 접근 방식(예: 70% 벡터 유사성 및 30% 키워드 일치)을 사용합니다. 이는 높은 정확성과 관련성을 보장하며, 특정 메타데이터가 검색을 안내하는 동시에 의미론적 이해가 광범위한 맥락적 일치를 제공합니다. 이 접근 방식은 정확한 태그와 개념적 이해 모두로부터 이점을 얻는 복잡한 쿼리에 특히 효과적입니다.

그림 2: 텍스트-투-동영상, 동영상-투-동영상, k-NN과 BM25를 결합한 하이브리드 검색의 세 가지 검색 모드를 보여주는 동영상 검색 아키텍처

비용 효율적인 배포 및 전제 조건

이러한 정교한 AI 데이터 레이크를 배포하려면 인프라 및 비용에 대한 신중한 고려가 필요하며, AWS는 효율성을 위해 이를 최적화했습니다. 약 8,480시간의 동영상 콘텐츠에 해당하는 방대한 데이터셋을 처리하는 총 비용은 첫 해 총 27,328달러(OpenSearch 온디맨드 사용 시) 또는 23,632달러(OpenSearch Service 예약 인스턴스 사용 시)로 추정되었습니다.

수집 비용 분석은 주요 비용 동인을 강조합니다:

Amazon EC2 컴퓨팅: 421달러 (4x c7i.48xlarge 스팟 인스턴스 41시간 사용)
Amazon Bedrock Nova Multimodal Embeddings: 17,096달러 (30.5M 초, 배치 가격 $0.00056/초)
Nova Pro 태깅: 571달러 (792K 동영상, 동영상당 평균 약 600 토큰)
Amazon OpenSearch Service: 9,240달러 (온디맨드 연간) 또는 5,544달러 (예약 연간)

구현을 위한 전제 조건: 이 솔루션을 복제하거나 적용하려면 다음이 필요합니다:

us-east-1 리전에서 Amazon Bedrock에 액세스할 수 있는 AWS 계정.
Python 3.9 이상.
적절한 자격 증명으로 구성된 AWS Command Line Interface (AWS CLI).
k-NN 플러그인이 활성화된 Amazon OpenSearch Service 도메인 (r6g.large 이상 권장), 버전 2.11 이상.
동영상 저장 및 임베딩 출력을 위한 Amazon S3 버킷.
Amazon Bedrock, OpenSearch Service 및 Amazon S3에 대한 AWS Identity and Access Management (IAM) 권한.

이 솔루션은 특정 AWS 서비스 및 모델을 활용합니다:

임베딩을 위한 amazon.nova-2-multimodal-embeddings-v1:0를 사용하는 Amazon Bedrock.
태깅을 위한 us.amazon.nova-pro-v1:0 또는 us.amazon.nova-2-lite-v1:0를 사용하는 Amazon Bedrock.
k-NN 플러그인이 포함된 Amazon OpenSearch Service 2.11+.
저장을 위한 Amazon S3.

멀티모달 동영상 검색 솔루션 구현하기

이 아키텍처를 시작하려면 AWS 환경 설정을 위한 구조화된 접근 방식이 필요합니다. 첫 번째 중요한 단계는 필요한 권한을 설정하는 것입니다.

1단계: IAM 역할 및 정책 생성

애플리케이션 또는 서비스가 다양한 AWS 구성 요소와 상호 작용할 수 있는 권한을 부여하는 IAM 역할을 생성해야 합니다. 이 역할에는 Amazon Bedrock 모델(임베딩 생성 및 태깅 모두), OpenSearch 인덱스에 데이터 쓰기, 동영상 콘텐츠 및 처리된 출력이 있는 Amazon S3 버킷에 대한 읽기/쓰기 작업을 호출할 수 있는 권한이 포함되어야 합니다.

다음은 기본적인 IAM 정책 구조의 예시입니다:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:StartAsyncInvoke",
        "bedrock:GetAsyncInvoke",
        "bedrock:List"
      ],
      "Resource": "arn:aws:bedrock:us-east-1::foundation-model/amazon.nova-*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-video-bucket/*",
        "arn:aws:s3:::your-video-bucket"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "es:ESHttpPost",
        "es:ESHttpPut",
        "es:ESHttpDelete",
        "es:ESHttpGet"
      ],
      "Resource": "arn:aws:es:us-east-1:*:domain/your-opensearch-domain/*"
    }
  ]
}

이 정책은 파이프라인 작동에 필수적인 특정 권한을 부여합니다. your-video-bucket 및 your-opensearch-domain과 같은 플레이스홀더를 실제 리소스 이름으로 교체하는 것을 잊지 마십시오. IAM 설정 후에는 S3 버킷 구성, k-NN이 활성화된 OpenSearch Service 도메인 설정, 수집을 위한 Bedrock API를 활용하는 오케스트레이션 로직 개발을 진행해야 합니다. 이 강력한 프레임워크는 미디어 및 엔터테인먼트 회사가 끊임없이 증가하는 콘텐츠 라이브러리를 효율적으로 관리, 발굴하고 수익화할 수 있도록 보장하며, 콘텐츠 인텔리전스에서 중요한 도약을 이룹니다. 이 포괄적인 솔루션은 현대 AI 기능, 특히 멀티모달 이해가 콘텐츠 관리 및 접근성에 대한 산업 표준을 어떻게 재정의하고 있는지 보여주는 예시입니다. 이는 고급 AI 모델을 확장 가능한 클라우드 인프라와 통합하여 실제 엔터프라이즈 AI 문제를 해결하는 것의 힘에 대한 증거이며, 에이전틱 AI 워크플로에서 볼 수 있는 것과 유사한 발전을 촉진합니다.

원본 출처

https://aws.amazon.com/blogs/machine-learning/multimodal-embeddings-at-scale-ai-data-lake-for-media-and-entertainment-workloads/

자주 묻는 질문

What is a multimodal AI data lake for media and entertainment workloads?

A multimodal AI data lake for media and entertainment is an advanced system designed to store, process, and enable intelligent search across vast collections of video content. Unlike traditional keyword-based systems, it leverages AI models, specifically multimodal embeddings, to understand the nuanced meaning and context within audio and visual data. This allows for semantic search capabilities, where users can query content using natural language descriptions or by providing another video, moving beyond simple tags to find relevant moments or entire videos based on their actual content. AWS's solution utilizes services like Amazon Nova for embedding generation and Amazon OpenSearch Service for efficient storage and retrieval of these high-dimensional vectors, making it ideal for large-scale content libraries.

How does the video ingestion pipeline handle large-scale datasets?

The video ingestion pipeline detailed in the article is engineered for massive scale, demonstrating processing of nearly 800,000 videos totaling over 8,480 hours of content. It employs a distributed architecture using multiple Amazon EC2 instances (e.g., c7i.48xlarge) to parallelize video processing. Key to its efficiency is the asynchronous API of Amazon Nova Multimodal Embeddings, which segments videos into optimal chunks (e.g., 15-second segments) and generates 1024-dimensional embeddings. To manage Bedrock's concurrency limits, the pipeline implements a job queue with polling, ensuring continuous processing. Additionally, Amazon Nova Pro (or Nova Lite) is used to generate descriptive tags, further enriching the metadata. These embeddings and tags are then efficiently indexed into Amazon OpenSearch Service's k-NN and text indices respectively, preparing the data for rapid search.

What types of video search capabilities does this solution enable?

This multimodal AI data lake solution provides three powerful video search capabilities, significantly enhancing content discovery. First, **Text-to-video Search** allows users to input natural language queries (e.g., 'a person surfing at sunset') which are then converted into embeddings and matched semantically against video content, going beyond exact keyword matches. Second, **Video-to-video Search** enables users to find similar video segments or entire videos by comparing their embeddings directly, useful for content recommendations or identifying duplicates. Third, **Hybrid Search** combines the strengths of both semantic vector similarity and traditional keyword matching (e.g., 70% vector, 30% keyword) for maximum accuracy and relevance, especially when dealing with complex queries that benefit from both contextual understanding and specific metadata.

Which AWS services are critical for building this multimodal embedding solution?

Several core AWS services are critical for constructing this scalable multimodal embedding solution. At its heart are **Amazon Bedrock** and its **Nova Multimodal Embeddings** for generating high-dimensional vector representations from video and audio, and **Nova Pro** (or **Nova Lite**) for intelligent tagging. **Amazon OpenSearch Service** (specifically with its k-NN plugin) serves as the scalable vector database to store and query these embeddings, alongside a traditional text index for metadata. **Amazon S3** (Simple Storage Service) is essential for storing the raw video files and the outputs of the embedding process. **Amazon EC2** provides the compute power for orchestrating the ingestion pipeline and managing the large-scale processing of video data. Additionally, **AWS IAM** is vital for securing access and permissions across these integrated services.

What are the cost considerations for deploying such a large-scale multimodal video search system?

Deploying a large-scale multimodal video search system, as demonstrated by the processing of over 8,000 hours of video, involves significant but manageable costs. The article provides a detailed breakdown, estimating a first-year total cost of approximately $23,632 to $27,328. This cost is primarily divided into two components: one-time ingestion costs and ongoing annual Amazon OpenSearch Service costs. Ingestion is dominated by Amazon Bedrock Nova Multimodal Embeddings usage, charged per second of processed video, and Nova Pro tagging. Amazon EC2 compute for orchestration also contributes but is comparatively smaller. OpenSearch Service costs can be optimized by using Reserved Instances over on-demand pricing. Careful planning and monitoring of resource usage, especially Bedrock API calls and OpenSearch cluster sizing, are key to managing and optimizing these expenditures.

Why is semantic search using multimodal embeddings superior to traditional keyword search for video content?

Semantic search, powered by multimodal embeddings, offers a profound advantage over traditional keyword search for video content by enabling a deeper, contextual understanding. Keyword search is limited to exact matches of words and phrases, often failing to capture synonyms, related concepts, or the visual and auditory nuances of video. For instance, searching for 'people talking' might miss a scene where individuals are silently communicating through gestures. Multimodal embeddings, however, convert the rich information from both audio and video into dense numerical vectors. These vectors capture the meaning, style, and context, allowing for queries based on conceptual similarity rather than just lexical matches. This means users can find relevant content even if the exact keywords aren't present, or describe a visual scene using natural language, significantly improving content discovery and relevance in large video archives.

How does the Amazon Nova family of models contribute to this solution?

The Amazon Nova family of models plays a central role in enabling this advanced multimodal video search solution. Specifically, **Amazon Nova Multimodal Embeddings** is the backbone for transforming raw video and audio into actionable high-dimensional vectors (embeddings). It intelligently segments videos and extracts combined audio-visual features, allowing for sophisticated semantic comparisons. This model is crucial for both text-to-video and video-to-video search functionalities. Additionally, **Amazon Nova Pro** (or the more cost-effective **Nova Lite**) is utilized for generating descriptive tags. These tags enrich the video metadata, enabling hybrid search scenarios where both conceptual similarity and specific keywords can be used to refine search results. Together, these Nova models empower the system to understand, categorize, and make searchable the complex information contained within video content.

What are the benefits of using OpenSearch Service's k-NN index in this architecture?

Amazon OpenSearch Service's k-NN (k-Nearest Neighbor) index is a cornerstone of this multimodal video search architecture, providing the capability to efficiently store and query high-dimensional vector embeddings. The primary benefit is enabling rapid and accurate semantic search. When a query (text or video) is converted into an embedding, the k-NN index can quickly find the 'k' most similar video embeddings within the vast dataset. This is far more efficient than traditional database lookups for vector similarity. It allows for real-time semantic search across millions of video segments. By integrating seamlessly with other OpenSearch capabilities, it also facilitates hybrid search, combining vector similarity with traditional text-based filtering and scoring, ensuring a powerful and flexible search experience that scales with the size of the media library.