การประเมิน AI Agent: Strands Evals เพื่อความพร้อมในการใช้งานจริง

Experiment ทำหน้าที่เป็นชุดทดสอบ จัดการกระบวนการประเมินทั้งหมด มันรวบรวม Cases หลายรายการและ Evaluators หนึ่งรายการขึ้นไปที่กำหนดค่าไว้ ในระหว่างการรันการประเมิน Experiment จะนำแต่ละ Case ป้อนอินพุตไปยัง AI agent ของคุณ รวบรวมการตอบสนองและร่องรอยการดำเนินการของ agent จากนั้นส่งผลลัพธ์เหล่านี้ไปยัง Evaluators ที่กำหนดเพื่อทำการให้คะแนน การสรุปนี้ทำให้มั่นใจได้ว่าการประเมินเป็นไปอย่างมีระบบและทำซ้ำได้ในชุดสถานการณ์ที่กำหนด

สุดท้าย Evaluators คือผู้ตัดสินในระบบนี้ พวกมันจะตรวจสอบอย่างละเอียดว่า agent ของคุณสร้างอะไรขึ้นมา—ผลลัพธ์จริงและเส้นทางการทำงาน—และเปรียบเทียบสิ่งเหล่านี้กับสิ่งที่คาดหวังหรือต้องการ ซึ่งแตกต่างจากการตรวจสอบแบบ assertion อย่างง่าย Evaluators ของ Strands Evals ส่วนใหญ่ใช้ LLM นี่คือความแตกต่างที่สำคัญ; โดยการใช้ประโยชน์จากโมเดลภาษาขนาดใหญ่ Evaluators สามารถทำการตัดสินที่ซับซ้อนและละเอียดอ่อนเกี่ยวกับคุณภาพ เช่น ความเกี่ยวข้อง ความเป็นประโยชน์ ความสอดคล้อง และความถูกต้อง—คุณลักษณะที่ไม่สามารถประเมินได้อย่างแม่นยำด้วยการเปรียบเทียบสตริงเพียงอย่างเดียว ความสามารถในการตัดสินที่ยืดหยุ่นแต่เข้มงวดนี้เป็นหัวใจสำคัญในการประเมิน AI agent สำหรับการผลิตอย่างมีประสิทธิภาพ

Task Function: เชื่อมโยงการทำงานของ Agent และการประเมิน

เพื่อผสานรวม AI agent ของคุณเข้ากับกรอบการทำงาน Strands Evals จะใช้ส่วนประกอบที่สำคัญที่เรียกว่า Task Function ฟังก์ชันที่สามารถเรียกใช้งานได้นี้ทำหน้าที่เป็นตัวเชื่อมโยง โดยรับออบเจกต์ Case และส่งคืนผลลัพธ์ของการรัน case เฉพาะนั้นผ่านระบบ agent ของคุณ อินเทอร์เฟซนี้มีความยืดหยุ่นสูง รองรับรูปแบบการประเมินที่แตกต่างกันสองแบบโดยพื้นฐาน: แบบออนไลน์และแบบออฟไลน์ สำหรับข้อมูลเชิงลึกเพิ่มเติมเกี่ยวกับการเตรียม AI agent สำหรับการใช้งานจริง โปรดดูที่ Operationalizing Agentic AI Part 1: คู่มือสำหรับผู้มีส่วนได้ส่วนเสีย

การประเมินแบบออนไลน์ (Online evaluation) เกี่ยวข้องกับการเรียกใช้ AI agent ของคุณแบบเรียลไทม์ในระหว่างการรันการประเมิน Task Function จะสร้างอินสแตนซ์ agent แบบไดนามิก ส่งอินพุตของ case ไปให้ บันทึกการตอบสนองแบบสดของ agent และร่องรอยการทำงานของมัน รูปแบบนี้มีค่าอย่างยิ่งในช่วงการพัฒนา โดยให้ข้อเสนอแนะทันทีเกี่ยวกับการเปลี่ยนแปลง และจำเป็นสำหรับไปป์ไลน์ Continuous Integration และ Delivery (CI/CD) ซึ่งจำเป็นต้องมีการตรวจสอบพฤติกรรมของ agent ก่อนการนำไปใช้งานจริง มันช่วยให้มั่นใจได้ว่าประสิทธิภาพของ agent ได้รับการประเมินในสถานะการทำงานจริง

from strands import Agent

def online_task(case):
    agent = Agent(tools=[search_tool, calculator_tool])
    result = agent(case.input)

    return {
        "output": str(result),
        "trajectory": agent.session
    }

ในทางตรงกันข้าม การประเมินแบบออฟไลน์ (Offline evaluation) ทำงานกับข้อมูลในอดีต แทนที่จะเริ่มต้น agent แบบสด Task Function จะดึงร่องรอยการโต้ตอบที่บันทึกไว้ก่อนหน้านี้จากแหล่งต่างๆ เช่น บันทึก ฐานข้อมูล หรือระบบการสังเกตการณ์ จากนั้นจะแยกวิเคราะห์ร่องรอยในอดีตเหล่านี้ให้อยู่ในรูปแบบที่ evaluator คาดหวัง ซึ่งช่วยให้สามารถตัดสินได้ วิธีการนี้มีประสิทธิภาพสูงสำหรับการประเมินการจราจรในการผลิต การวิเคราะห์ประสิทธิภาพในอดีต หรือการเปรียบเทียบเวอร์ชัน agent ที่แตกต่างกันกับการโต้ตอบของผู้ใช้จริงที่สอดคล้องกัน โดยไม่ต้องเสียค่าใช้จ่ายในการประมวลผลในการรัน agent แบบสดอีกครั้ง มีประโยชน์อย่างยิ่งสำหรับการวิเคราะห์ย้อนหลังและการประเมินชุดข้อมูลขนาดใหญ่

def offline_task(case):
    trace = load_trace_from_database(case.session_id)
    session = session_mapper.map_to_session(trace)

    return {
        "output": extract_final_response(trace),
        "trajectory": session
    }

ไม่ว่าคุณจะกำลังทดสอบ agent ที่เพิ่งนำไปใช้หรือตรวจสอบข้อมูลการผลิตเป็นเวลาหลายเดือน evaluator ที่ทรงพลังและโครงสร้างพื้นฐานการรายงานที่แข็งแกร่งภายใน Strands Evals ก็สามารถนำไปใช้ได้ Task Function จะทำการสรุปแหล่งข้อมูล ปรับให้เข้ากับระบบการประเมินได้อย่างราบรื่น ซึ่งจะให้ข้อมูลเชิงลึกที่สอดคล้องและครอบคลุมเกี่ยวกับประสิทธิภาพของ agent การผสานรวมการประเมินที่แข็งแกร่งเช่นนี้เป็นกุญแจสำคัญสำหรับเวิร์กโฟลว์การเขียนโค้ด agentic ขั้นสูง คล้ายกับที่กล่าวถึงใน Xcode Agentic Coding

การประเมินคุณภาพของ Agent ด้วย Evaluator ในตัว

เมื่อ Task Function สามารถส่งผลลัพธ์ของ agent ไปยังระบบการประเมินได้อย่างมีประสิทธิภาพ ขั้นตอนสำคัญต่อไปคือการกำหนดว่าควรวัดคุณภาพของ agent ในด้านใดบ้าง Strands Evals ได้รับการออกแบบมาเพื่อให้การประเมินที่ครอบคลุม และด้วยเหตุนี้จึงมีชุด evaluator ในตัว ซึ่งแต่ละตัวได้รับการออกแบบมาโดยเฉพาะเพื่อกำหนดเป้าหมายและประเมินมิติที่แตกต่างกันของประสิทธิภาพและคุณภาพผลลัพธ์ของ AI agent

กรอบการทำงานนี้เข้าใจว่าคุณภาพของ agent มีหลายแง่มุม ไม่เพียงพอที่ agent จะสร้างข้อความเท่านั้น ข้อความนั้นจะต้องเป็นประโยชน์ เกี่ยวข้อง สอดคล้อง และถูกต้องตามบริบทหรือข้อมูลต้นฉบับ ตัวชี้วัดแบบดั้งเดิมมักจะไม่สามารถจับคุณลักษณะที่เป็นอัตวิสัยแต่มีความสำคัญเหล่านี้ได้ นี่คือจุดที่พลังของ evaluator ที่ใช้ LLM ที่กล่าวถึงก่อนหน้านี้กลายเป็นสิ่งจำเป็นอย่างยิ่ง โดยการใช้ประโยชน์จาก Large Language Models เองเพื่อทำหน้าที่เป็นผู้ตัดสิน Strands Evals สามารถทำการประเมินเชิงคุณภาพที่ซับซ้อนและละเอียดอ่อนได้ LLMs เหล่านี้สามารถวิเคราะห์การตอบสนองของ agent เพื่อดูประโยชน์ใช้สอยโดยรวมต่อผู้ใช้ การไหลของตรรกะ การปฏิบัติตามข้อเท็จจริงหรือคำแนะนำที่ระบุ และความสามารถในการรักษาความสอดคล้องตลอดการสนทนา การตัดสินที่ชาญฉลาดและละเอียดอ่อนนี้ช่วยให้นักพัฒนาสามารถก้าวข้ามการจับคู่คำหลักง่ายๆ และเข้าใจประสิทธิภาพและความน่าเชื่อถือของ AI agent ในสถานการณ์จริงได้อย่างแท้จริง

สรุป: สร้างความมั่นใจว่า AI Agent พร้อมใช้งานจริงด้วย Strands Evals

การย้าย AI agent จากแนวคิดไปสู่การใช้งานจริงที่เชื่อถือได้นั้นต้องการกลยุทธ์การประเมินที่ซับซ้อนซึ่งก้าวข้ามข้อจำกัดของการทดสอบซอฟต์แวร์แบบดั้งเดิม Strands Evals นำเสนอสิ่งนี้อย่างแม่นยำ: กรอบการทำงานเชิงปฏิบัติที่มีโครงสร้างที่ยอมรับลักษณะที่ไม่แน่นอนโดยธรรมชาติและธรรมชาติที่ปรับตัวได้ซับซ้อนของ AI agent ด้วยการกำหนดการประเมินอย่างชัดเจนผ่าน Cases การจัดการผ่าน Experiments และการใช้ Evaluators ที่ละเอียดอ่อน—โดยเฉพาะอย่างยิ่งที่ขับเคลื่อนด้วย LLMs สำหรับการตัดสินเชิงคุณภาพ—Strands Evals ช่วยให้นักพัฒนาสามารถประเมินประสิทธิภาพได้อย่างเป็นระบบ

ความยืดหยุ่นของ Task Function ซึ่งรองรับทั้งการประเมินออนไลน์แบบเรียลไทม์สำหรับการพัฒนาอย่างรวดเร็วและการวิเคราะห์ข้อมูลในอดีตแบบออฟไลน์ ยิ่งเสริมสร้างประโยชน์ใช้สอยตลอดวงจรชีวิตของ agent แนวทางที่ครอบคลุมนี้ทำให้มั่นใจได้ว่า AI agent ไม่เพียงแต่ใช้งานได้เท่านั้น แต่ยังเป็นประโยชน์ สอดคล้อง และแข็งแกร่ง มอบความมั่นใจที่จำเป็นสำหรับการผสานรวมที่ประสบความสำเร็จเข้ากับสภาพแวดล้อมการผลิตที่สำคัญ การนำกรอบการทำงานเช่น Strands Evals มาใช้เป็นสิ่งจำเป็นสำหรับทุกคนที่จริงจังกับการสร้าง การนำไปใช้งาน และการดูแลรักษา AI agent ที่มีคุณภาพสูงและพร้อมสำหรับการผลิตในภูมิทัศน์เทคโนโลยีที่พัฒนาอย่างรวดเร็วในปัจจุบัน

แหล่งที่มา

https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-for-production-a-practical-guide-to-strands-evals/

คำถามที่พบบ่อย

What fundamental challenge do AI agents pose for traditional software testing methodologies?

AI agents, by their inherent nature, are flexible, adaptive, and highly context-aware, making their outputs non-deterministic. Unlike traditional software where the same input reliably yields the same expected output, AI agents generate natural language responses and make decisions that can vary even with identical inputs. This variability means that conventional assertion-based testing, which relies on precise, predictable outcomes, is inadequate. Agents' ability to use tools, retrieve information, and engage in multi-turn conversations further complicates evaluation, requiring a shift from simple keyword comparisons to nuanced, judgment-based assessments that can handle the fluidity and creativity of AI-driven interactions. This necessitates specialized frameworks like Strands Evals to systematically gauge quality dimensions beyond strict determinism.

How does Strands Evals address the non-deterministic nature of AI agent outputs?

Strands Evals tackles the non-deterministic challenge by introducing a framework centered on judgment-based evaluation, primarily leveraging large language models (LLMs) as evaluators. Instead of relying on strict assertion checks, LLM-based evaluators can make nuanced assessments of qualitative aspects such as helpfulness, coherence, relevance, and faithfulness of agent responses. The framework organizes evaluation into Cases (individual scenarios), Experiments (collections of cases and evaluators), and Evaluators (the judging mechanism), allowing for systematic yet flexible assessment. This approach moves beyond simple string comparisons to understand the subjective quality of agent interactions, ensuring that even varied but valid outputs are correctly recognized as successful.

Explain the core concepts of Strands Evals: Cases, Experiments, and Evaluators.

Strands Evals builds upon three foundational concepts to enable systematic AI agent evaluation. A **Case** serves as the atomic unit of testing, defining a single test scenario. It includes the user input (e.g., a query), optional expected outputs, anticipated tool usage sequences (trajectories), and relevant metadata. An **Experiment** functions as a test suite, bundling multiple Cases together with one or more Evaluators. It orchestrates the entire evaluation process, running the agent against each Case and applying the configured Evaluators. Finally, **Evaluators** act as the 'judges,' assessing the agent's actual output and trajectory against the expectations. Crucially, Strands Evals primarily uses LLM-based Evaluators to make qualitative judgments on attributes like helpfulness and coherence, which are difficult to quantify with traditional assertion methods, providing a flexible yet rigorous assessment.

What is the purpose of the Task Function in Strands Evals, and how do online and offline evaluation differ?

The Task Function in Strands Evals is a critical callable component that bridges your AI agent's execution environment with the evaluation system. Its purpose is to receive a Case (a test scenario) and return the agent's results (output and execution trace) in a format suitable for evaluation. This function enables two distinct patterns: **Online Evaluation** involves invoking your agent live during the evaluation run. Here, the Task Function creates an agent, feeds it the case input, and captures its real-time response and execution trace. This is ideal for development, testing immediate changes, or integrating into CI/CD pipelines. In contrast, **Offline Evaluation** works with historical data. The Task Function retrieves previously recorded agent traces from logs or databases, parsing them into the expected format. This is highly effective for analyzing production traffic, performing historical performance analysis, or comparing different agent versions against consistent real-world interactions, offering flexibility without requiring live agent invocation.

Why are LLM-based evaluators crucial for assessing AI agents effectively?

LLM-based evaluators are crucial because they overcome the limitations of traditional, assertion-based testing when assessing AI agents. Agents often produce natural language outputs and make context-dependent decisions, meaning there isn't always one single 'correct' answer that can be checked with a simple string comparison. LLM-based evaluators, leveraging their understanding of language and context, can make nuanced judgments about subjective qualities such as a response's helpfulness, coherence, relevance, or faithfulness to source material. They can discern whether an agent's varied but valid output still meets user goals or maintains context across multi-turn conversations. This capability is essential for systematically measuring the qualitative dimensions of agent performance that are vital for real-world utility and user satisfaction, ensuring agents are not only factually accurate but also user-friendly and effective.

อัปเดตข่าวสาร

รับข่าว AI ล่าสุดในกล่องจดหมายของคุณ

แชร์