הטמעות מולטימודליות בקנה מידה: אגם נתונים מבוסס AI למדיה ובידור

חולל מהפכה בחיפוש וידאו עם הטמעות מולטימודליות

תעשיית המדיה והבידור שופעת אוקיינוסים עצומים של תוכן וידאו. החל מצילומים ארכיוניים ועד להעלאות יומיות, הנפח העצום הופך את שיטות גילוי התוכן המסורתיות – תיוג ידני וחיפושים מבוססי מילות מפתח – לבלתי יעילות יותר ויותר ולעיתים קרובות לא מדויקות. גישות מיושנות אלו מתקשות ללכוד את העושר המלא וההקשר המנומר הטמון בווידאו, מה שמוביל להחמצת הזדמנויות לשימוש חוזר בתוכן, הפקה מהירה יותר וחוויות צפייה משופרות.

היכנסו לעידן ההטמעות המולטימודליות. AWS מובילה פתרון שמתעלה על מגבלות אלו, ומאפשר יכולות חיפוש סמנטי על פני מערכי נתונים עצומים של וידאו. על ידי רתימת העוצמה של מודלי Amazon Nova ושירות Amazon OpenSearch Service, יוצרי ומפיצי תוכן יכולים להתעלות על מילות מפתח שטחיות כדי להבין באמת ולגשת לספריות המדיה שלהם. גישה חדשנית זו מאפשרת שאילתות בשפה טבעית לחקור את עומקי המידע החזותי והשמיעתי, ומביאה דיוק חסר תקדים לגילוי תוכן.

AWS מדגימה יכולת זו בקנה מידה מרשים, ועיבדה 792,270 סרטונים מ-AWS Open Data Registry, הכוללים 8,480 שעות של תוכן וידאו. מיזם שאפתני זה, שלקח רק 41 שעות לעבד למעלה מ-30.5 מיליון שניות של וידאו, מדגיש את יכולת ההרחבה והיעילות של גישה זו המונעת על ידי AI. עלות השנה הראשונה, כולל עיבוד חד פעמי ושירות OpenSearch שנתי, הוערכה בסכום תחרותי ביותר של 23,632 דולר (עם מופעים שמורים של OpenSearch Service) עד 27,328 דולר (עם לפי דרישה). פתרון כזה משנה באופן מהותי את האופן שבו חברות מדיה מתקשרות עם הנכסים הדיגיטליים שלהן, ופותח אפיקים חדשים לייצור רווחים מתוכן ותהליכי עבודה בהפקה. שינוי פרדיגמה זה לעבר הבנה סמנטית הוא התפתחות קריטית עבור AI ארגוני בתחום המדיה.

הבנת ארכיטקטורת אגם הנתונים המולטימודלי הניתנת להרחבה

בליבתו, מערכת חיפוש וידאו מולטימודלית חזקה זו בנויה על שני תהליכי עבודה מקושרים: עיבוד וידאו וחיפוש. רכיבים אלו משתלבים בצורה חלקה ליצירת אגם נתונים מבוסס AI המבין והופך את פרטיו המורכבים של תוכן וידאו לניתנים לחיפוש.

צינור עיבוד וידאו

צינור העיבוד מתוכנן לעיבוד מקבילי ויעילות. הוא משתמש בארבעה מופעי Amazon EC2 c7i.48xlarge, המפעילים עד 600 עובדים מקבילים כדי להשיג קצב עיבוד של 19,400 סרטונים לשעה. סרטונים שהועלו בתחילה ל-Amazon S3 מעובדים לאחר מכן על ידי ה-API האסינכרוני של Amazon Nova Multimodal Embeddings. API זה מפצל סרטונים באופן חכם למקטעים אופטימליים בני 15 שניות – איזון בין לכידת שינויי סצנה משמעותיים לבין ניהול נפח ההטמעות שנוצרות. כל מקטע מומר לאחר מכן להטמעה במימד 1024, המייצגת את התכונות האודיו-ויזואליות המשולבות שלו. בעוד שהטמעות במימד 3072 מציעות נאמנות גבוהה יותר, האפשרות במימד 1024 מספקת חיסכון של פי 3 בעלות האחסון עם השפעה מינימלית על הדיוק עבור יישום זה, מה שהופך אותה לבחירה פרגמטית בקנה מידה.

כדי לשפר עוד יותר את יכולת החיפוש, Amazon Nova Pro (או Nova 2 Lite החדש והחסכוני יותר) משמש ליצירת 10-15 תגים תיאוריים לכל סרטון מטקסונומיה מוגדרת מראש. גישה כפולה זו מבטיחה שניתן יהיה לגלות תוכן הן באמצעות דמיון סמנטי והן באמצעות התאמת מילות מפתח מסורתית. הטמעות אלו מאוחסנות באינדקס k-NN של OpenSearch, המותאם לחיפוש דמיון וקטורי, בעוד שהתגים התיאוריים מאונדקסים באינדקס טקסט נפרד. הפרדה זו מאפשרת שאילתות גמישות ויעילות. הצינור מנהל את מגבלות הבו-זמניות של Bedrock (30 עבודות בו-זמניות לכל חשבון) באמצעות מנגנון תור עבודות וסקר חזק, המבטיח עיבוד רציף ותואם.

להלן ייצוג ויזואלי של תהליך העיבוד המתוחכם הזה:

איור 1: צינור עיבוד וידאו המציג את הזרימה מאחסון וידאו ב-S3 דרך הטמעות מולטימודליות של Nova ו-Nova Pro לאינדקסים כפולים של OpenSearch

העצמת יכולות חיפוש וידאו מגוונות

ארכיטקטורת החיפוש מתוכננת לגמישות, ומציעה מספר מצבי גילוי תוכן:

חיפוש טקסט לווידאו: משתמשים יכולים להזין שאילתות בשפה טבעית, כגון "צילום רחפן של עיר שוקקת בלילה" או "צילום תקריב של שף מכין ארוחה גורמה". המערכת ממירה שאילתות אלו להטמעות, ולאחר מכן מנצלת את אינדקס ה-k-NN של OpenSearch כדי למצוא מקטעי וידאו או סרטונים שלמים התואמים סמנטית את התיאור, גם אם המילים המדויקות אינן קיימות באף מטא-דאטה. זה אידיאלי לגילוי תוכן אינטואיטיבי ותכנון עלילה.
חיפוש וידאו לווידאו: לתרחישים שבהם למשתמש יש קליפ וידאו והוא רוצה למצוא תוכן דומה, מצב זה מצטיין. על ידי השוואת ההטמעות של הווידאו הקלט ישירות עם אלו שבאינדקס ה-k-NN של OpenSearch, המערכת יכולה לזהות תוכן דומה מבחינה ויזואלית ושמיעתית. זה בעל ערך רב לזיהוי צילומי B-roll, הבטחת עקביות תוכן, או גילוי יצירות נגזרות.
חיפוש היברידי: שילוב הטוב משני העולמות, חיפוש היברידי משלב דמיון וקטורי עם התאמת מילות מפתח מסורתית. הפתרון המוצע משתמש בגישה משוקללת (לדוגמה, 70% דמיון וקטורי ו-30% התאמת מילות מפתח). זה מבטיח דיוק ורלוונטיות גבוהים, ומאפשר למטא-דאטה ספציפית להנחות את החיפוש בעוד שהבנה סמנטית מספקת התאמות הקשריות רחבות. גישה זו יעילה במיוחד עבור שאילתות מורכבות הנהנות הן מתגים מדויקים והן מהבנה מושגית.

איור 2: ארכיטקטורת חיפוש וידאו המדגימה שלושה מצבי חיפוש – טקסט לווידאו, וידאו לווידאו, וחיפוש היברידי המשלב k-NN ו-BM25

פריסה חסכונית ותנאים מוקדמים

פריסת אגם נתונים מתוחכם כזה דורשת התייחסות מדוקדקת לתשתיות ולעלויות, אותן AWS ייעל עבור יעילות. העלות הכוללת לעיבוד מערכי הנתונים הנרחבים, כ-8,480 שעות של תוכן וידאו, הסתכמה בהערכה ראשונית של 27,328 דולר (עם OpenSearch לפי דרישה) או 23,632 דולר (עם מופעים שמורים של OpenSearch Service).

פירוט העלויות מצביע על מניעי עלות עיקריים:

מחשוב Amazon EC2: 421 דולר (4 מופעי c7i.48xlarge spot למשך 41 שעות)
הטמעות מולטימודליות של Amazon Bedrock Nova: 17,096 דולר (30.5 מיליון שניות במחיר של 0.00056 דולר לשנייה בתמחור אצווה)
תיוג Nova Pro: 571 דולר (792 אלף סרטונים, ממוצע של כ-600 אסימונים לסרטון)
Amazon OpenSearch Service: 9,240 דולר (שנתי לפי דרישה) או 5,544 דולר (שנתי שמור)

תנאים מוקדמים ליישום: כדי לשכפל או להתאים פתרון זה, תזדקקו ל:

חשבון AWS עם גישה ל-Amazon Bedrock בus-east-1.
Python 3.9 ואילך.
ממשק שורת פקודה של AWS (AWS CLI) מוגדר עם אישורים מתאימים.
דומיין של Amazon OpenSearch Service (מומלץ r6g.large או גדול יותר), גרסה 2.11 ואילך, עם תוסף k-NN מופעל.
דלי Amazon S3 לאחסון וידאו ותוצרי הטמעה.
הרשאות AWS Identity and Access Management (IAM) עבור Amazon Bedrock, OpenSearch Service ו-Amazon S3.

הפתרון ממנף שירותים ומודלים ספציפיים של AWS:

Amazon Bedrock עם amazon.nova-2-multimodal-embeddings-v1:0 עבור הטמעות.
Amazon Bedrock עם us.amazon.nova-pro-v1:0 או us.amazon.nova-2-lite-v1:0 עבור תיוג.
Amazon OpenSearch Service 2.11+ עם תוסף k-NN.
Amazon S3 לאחסון.

יישום פתרון חיפוש הווידאו המולטימודלי

ההתחלה עם ארכיטקטורה זו כוללת גישה מובנית להגדרת סביבת AWS שלכם. הצעד הקריטי הראשון הוא יצירת ההרשאות הנדרשות.

שלב 1: יצירת תפקידי ומדיניות IAM

יהיה עליכם ליצור תפקיד IAM המעניק ליישום או לשירות שלכם את הסמכות לתקשר עם רכיבי AWS השונים. תפקיד זה חייב לכלול הרשאות להפעיל מודלים של Amazon Bedrock (הן ליצירת הטמעות והן לתיוג), לכתוב נתונים לאינדקסים של OpenSearch, ולבצע פעולות קריאה/כתיבה על דלי S3 של Amazon שבהם נמצאים תוכן הווידאו ותוצרי העיבוד שלכם.

הנה דוגמה למבנה מדיניות IAM בסיסית:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:StartAsyncInvoke",
        "bedrock:GetAsyncInvoke",
        "bedrock:List"
      ],
      "Resource": "arn:aws:bedrock:us-east-1::foundation-model/amazon.nova-*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-video-bucket/*",
        "arn:aws:s3:::your-video-bucket"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "es:ESHttpPost",
        "es:ESHttpPut",
        "es:ESHttpDelete",
        "es:ESHttpGet"
      ],
      "Resource": "arn:aws:es:us-east-1:*:domain/your-opensearch-domain/*"
    }
  ]
}

מדיניות זו מעניקה הרשאות ספציפיות החיוניות לפעולת הצינור. זכרו להחליף מצייני מיקום כמו your-video-bucket ו-your-opensearch-domain בשמות המשאבים בפועל שלכם. לאחר הגדרת IAM, תמשיכו בהגדרת דלי S3 שלכם, הגדרת דומיין OpenSearch Service שלכם עם k-NN מופעל, ופיתוח לוגיקת התיאום הממנפת את ממשקי ה-API של Bedrock לצורך עיבוד. מסגרת חזקה זו מבטיחה שחברות מדיה ובידור יוכלו לנהל, לגלות ולהפיק רווחים ביעילות מספריות התוכן ההולכות וגדלות שלהן, ובכך מסמנת קפיצת מדרגה משמעותית באינטליגנציית תוכן. פתרון מקיף זה הוא דוגמה לאופן שבו יכולות AI מודרניות, במיוחד בהבנה מולטימודלית, מגדירות מחדש את תקני התעשייה לניהול תוכן ונגישות. זהו עדות לכוח השילוב של מודלי AI מתקדמים עם תשתית ענן ניתנת להרחבה כדי לפתור אתגרי AI ארגוני בעולם האמיתי, תוך קידום התקדמויות דומות לאלו הנראות בתהליכי עבודה של AI סוכן.

מקור מקורי

https://aws.amazon.com/blogs/machine-learning/multimodal-embeddings-at-scale-ai-data-lake-for-media-and-entertainment-workloads/

שאלות נפוצות

What is a multimodal AI data lake for media and entertainment workloads?

A multimodal AI data lake for media and entertainment is an advanced system designed to store, process, and enable intelligent search across vast collections of video content. Unlike traditional keyword-based systems, it leverages AI models, specifically multimodal embeddings, to understand the nuanced meaning and context within audio and visual data. This allows for semantic search capabilities, where users can query content using natural language descriptions or by providing another video, moving beyond simple tags to find relevant moments or entire videos based on their actual content. AWS's solution utilizes services like Amazon Nova for embedding generation and Amazon OpenSearch Service for efficient storage and retrieval of these high-dimensional vectors, making it ideal for large-scale content libraries.

How does the video ingestion pipeline handle large-scale datasets?

The video ingestion pipeline detailed in the article is engineered for massive scale, demonstrating processing of nearly 800,000 videos totaling over 8,480 hours of content. It employs a distributed architecture using multiple Amazon EC2 instances (e.g., c7i.48xlarge) to parallelize video processing. Key to its efficiency is the asynchronous API of Amazon Nova Multimodal Embeddings, which segments videos into optimal chunks (e.g., 15-second segments) and generates 1024-dimensional embeddings. To manage Bedrock's concurrency limits, the pipeline implements a job queue with polling, ensuring continuous processing. Additionally, Amazon Nova Pro (or Nova Lite) is used to generate descriptive tags, further enriching the metadata. These embeddings and tags are then efficiently indexed into Amazon OpenSearch Service's k-NN and text indices respectively, preparing the data for rapid search.

What types of video search capabilities does this solution enable?

This multimodal AI data lake solution provides three powerful video search capabilities, significantly enhancing content discovery. First, **Text-to-video Search** allows users to input natural language queries (e.g., 'a person surfing at sunset') which are then converted into embeddings and matched semantically against video content, going beyond exact keyword matches. Second, **Video-to-video Search** enables users to find similar video segments or entire videos by comparing their embeddings directly, useful for content recommendations or identifying duplicates. Third, **Hybrid Search** combines the strengths of both semantic vector similarity and traditional keyword matching (e.g., 70% vector, 30% keyword) for maximum accuracy and relevance, especially when dealing with complex queries that benefit from both contextual understanding and specific metadata.

Which AWS services are critical for building this multimodal embedding solution?

Several core AWS services are critical for constructing this scalable multimodal embedding solution. At its heart are **Amazon Bedrock** and its **Nova Multimodal Embeddings** for generating high-dimensional vector representations from video and audio, and **Nova Pro** (or **Nova Lite**) for intelligent tagging. **Amazon OpenSearch Service** (specifically with its k-NN plugin) serves as the scalable vector database to store and query these embeddings, alongside a traditional text index for metadata. **Amazon S3** (Simple Storage Service) is essential for storing the raw video files and the outputs of the embedding process. **Amazon EC2** provides the compute power for orchestrating the ingestion pipeline and managing the large-scale processing of video data. Additionally, **AWS IAM** is vital for securing access and permissions across these integrated services.

What are the cost considerations for deploying such a large-scale multimodal video search system?

Deploying a large-scale multimodal video search system, as demonstrated by the processing of over 8,000 hours of video, involves significant but manageable costs. The article provides a detailed breakdown, estimating a first-year total cost of approximately $23,632 to $27,328. This cost is primarily divided into two components: one-time ingestion costs and ongoing annual Amazon OpenSearch Service costs. Ingestion is dominated by Amazon Bedrock Nova Multimodal Embeddings usage, charged per second of processed video, and Nova Pro tagging. Amazon EC2 compute for orchestration also contributes but is comparatively smaller. OpenSearch Service costs can be optimized by using Reserved Instances over on-demand pricing. Careful planning and monitoring of resource usage, especially Bedrock API calls and OpenSearch cluster sizing, are key to managing and optimizing these expenditures.

Why is semantic search using multimodal embeddings superior to traditional keyword search for video content?

Semantic search, powered by multimodal embeddings, offers a profound advantage over traditional keyword search for video content by enabling a deeper, contextual understanding. Keyword search is limited to exact matches of words and phrases, often failing to capture synonyms, related concepts, or the visual and auditory nuances of video. For instance, searching for 'people talking' might miss a scene where individuals are silently communicating through gestures. Multimodal embeddings, however, convert the rich information from both audio and video into dense numerical vectors. These vectors capture the meaning, style, and context, allowing for queries based on conceptual similarity rather than just lexical matches. This means users can find relevant content even if the exact keywords aren't present, or describe a visual scene using natural language, significantly improving content discovery and relevance in large video archives.

How does the Amazon Nova family of models contribute to this solution?

The Amazon Nova family of models plays a central role in enabling this advanced multimodal video search solution. Specifically, **Amazon Nova Multimodal Embeddings** is the backbone for transforming raw video and audio into actionable high-dimensional vectors (embeddings). It intelligently segments videos and extracts combined audio-visual features, allowing for sophisticated semantic comparisons. This model is crucial for both text-to-video and video-to-video search functionalities. Additionally, **Amazon Nova Pro** (or the more cost-effective **Nova Lite**) is utilized for generating descriptive tags. These tags enrich the video metadata, enabling hybrid search scenarios where both conceptual similarity and specific keywords can be used to refine search results. Together, these Nova models empower the system to understand, categorize, and make searchable the complex information contained within video content.

What are the benefits of using OpenSearch Service's k-NN index in this architecture?

Amazon OpenSearch Service's k-NN (k-Nearest Neighbor) index is a cornerstone of this multimodal video search architecture, providing the capability to efficiently store and query high-dimensional vector embeddings. The primary benefit is enabling rapid and accurate semantic search. When a query (text or video) is converted into an embedding, the k-NN index can quickly find the 'k' most similar video embeddings within the vast dataset. This is far more efficient than traditional database lookups for vector similarity. It allows for real-time semantic search across millions of video segments. By integrating seamlessly with other OpenSearch capabilities, it also facilitates hybrid search, combining vector similarity with traditional text-based filtering and scoring, ensuring a powerful and flexible search experience that scales with the size of the media library.

הישארו מעודכנים

קבלו את חדשות ה-AI האחרונות לתיבת הדוא״ל.

שתף