AI 代理：利用社会工程抵御提示注入

AI 代理的能力正在迅速扩展，从浏览网页到检索复杂信息，再到代表用户执行操作。虽然这些进步带来了前所未有的实用性和效率，但它们也同时引入了复杂的新攻击面。其中最主要的是提示注入——一种将恶意指令嵌入到外部内容中，旨在操纵 AI 模型执行非预期操作的方法。OpenAI 强调了这些攻击的一个关键演变：它们越来越模仿社会工程策略，这要求防御策略从简单的输入过滤转向强大的系统设计。

演变中的威胁：提示注入和社会工程

最初，提示注入攻击通常是直截了当的，例如在 AI 代理可能处理的维基百科文章中嵌入直接的对抗性命令。早期的模型缺乏在对抗性环境中的训练经验，很容易不加质疑地遵循这些明确的指令。然而，随着 AI 模型日趋成熟和复杂，它们对此类明显建议的脆弱性已经降低。这促使攻击者开发出更细致的方法，其中包含了社会工程元素。

这种演变意义重大，因为它超越了仅仅识别恶意字符串。相反，它挑战了 AI 系统在更广泛的语境中抵御误导性或操纵性内容的能力，就像人类会面临社会工程一样。例如，据 OpenAI 报告，2025 年的一次提示注入攻击涉及精心制作一封看似无害的电子邮件，但其中包含旨在诱骗 AI 助手提取敏感员工数据并将其提交给“合规验证系统”的嵌入式指令。这项攻击在测试中显示出 50% 的成功率，展示了将看似合法的请求与恶意指令混合的有效性。此类复杂攻击通常会绕过传统的“AI 防火墙”系统，这些系统通常试图基于简单的启发式方法对输入进行分类，因为在没有完整情境的情况下检测这些细微的操纵变得像区分谎言或虚假信息一样困难。

AI 代理作为人类对应物：从社会工程防御中汲取教训

为了应对这些高级提示注入技术，OpenAI 采取了范式转变，从人类社会工程的角度审视这个问题。这种方法认识到，目标并非完美识别每一个恶意输入，而是设计 AI 代理和系统，即使攻击部分成功，操纵的影响也会受到严格限制。这种思维方式类似于组织内部对人类员工进行社会工程风险管理。

考虑一个被授权退款或发放礼品卡的人工客服代理。虽然代理旨在服务客户，但他们不断暴露于外部输入中——其中一些可能具有操纵性甚至强制性。组织通过实施规则、限制和确定性系统来缓解这种风险。例如，客服代理可能对他们可以发出的退款数量有上限，或者有特定的程序来标记可疑请求。同样，AI 代理在代表用户操作时，必须具有内在的限制和保障措施。通过将 AI 代理置于这个“三角色系统”（用户、代理、外部世界）中进行构思，其中代理必须应对潜在的敌对外部输入，设计人员可以构建弹性。这种方法承认一些攻击将不可避免地发生，但确保其潜在危害最小化。这一原则支撑着 OpenAI 部署的一整套强大的对策。

防御原则	描述	类似于人类系统	益处
约束	将代理能力和操作限制在预定义的、安全的范围内，防止未经授权或过于宽泛的操作。	支出限制、授权等级、员工政策执行。	即使代理部分受损，也能减少潜在损害。
透明度	在执行潜在危险或敏感操作之前，要求用户明确确认。	经理批准例外、关键数据输入的二次检查。	赋能用户覆盖或确认敏感操作，确保控制。
沙盒隔离	隔离代理操作，特别是在与外部工具或应用程序交互时，将其置于安全、受监控的环境中。	对敏感系统的受控访问、分段网络环境。	防止恶意行为影响核心系统或泄露数据。
上下文源-汇分析	分析输入源和输出汇是否存在可疑数据流或未经授权的传输，识别指示恶意意图的模式。	数据防泄漏 (DLP) 系统、内部威胁检测协议。	识别并阻止未经授权的数据泄露尝试。
对抗性训练	持续训练 AI 模型，使其能够识别和抵御操纵性语言、欺骗性策略和社会工程尝试。	安全意识培训，识别钓鱼和诈骗尝试。	提高代理固有的检测和标记恶意内容的能力。

OpenAI 在 ChatGPT 中的多层防御

OpenAI 将这种社会工程模型与传统的安全工程技术，特别是“源-汇分析”，整合到 ChatGPT 中。在这个框架中，攻击者需要两个关键组件：一个“源”来注入影响（例如，不受信任的外部内容）和一个“汇”来利用危险功能（例如，传输信息、点击恶意链接或与受损工具交互）。OpenAI 的主要目标是维护一个基本的安全期望：危险操作或敏感信息的传输绝不应静默发生或缺乏适当的保障措施。

许多针对 ChatGPT 的攻击试图诱骗助手提取秘密对话信息并将其转发给恶意的第三方。虽然 OpenAI 的安全训练通常会导致代理拒绝此类请求，但在代理被说服的情况下，一个关键的缓解策略是 Safe Url。此机制专门用于检测在对话过程中学到的信息是否可能被传输到外部第三方 URL。在这种罕见情况下，系统会向用户显示信息以进行明确确认，或者完全阻止传输，并提示代理寻找替代的安全方法来满足用户的请求。即使代理暂时受损，这也能防止数据泄露。有关如何防范代理驱动的链接交互的更多见解，用户可以参考专门的博客文章：当 AI 代理点击链接时如何保障您的数据安全。

安全 URL 和沙盒隔离在代理式 AI 中的作用

Safe Url 机制旨在检测和控制敏感数据传输，其保护范围超越了单纯的链接点击。类似的保障措施也适用于 Atlas 中的导航和书签，以及 Deep Research 中的搜索和导航功能。这些应用程序本质上涉及 AI 代理与大量外部数据源的交互，因此对出站数据的强大控制至关重要。

此外，像 ChatGPT Canvas 和 ChatGPT Apps 这样的代理式功能也采用类似的安全理念。当代理创建和使用功能性应用程序时，这些操作被限制在安全的沙盒环境中。这种沙盒隔离允许检测意外的通信或操作。关键的是，任何潜在的敏感或未经授权的交互都会触发明确的用户同意请求，确保用户对其数据和代理行为拥有最终控制权。这种多层方法，结合源-汇分析、上下文感知、用户同意和沙盒执行，构成了抵御不断演变的提示注入和社会工程攻击的强大防御。有关这些代理式功能如何安全运行的更多详细信息，请参阅关于代理式 AI 的运作的讨论。

抵御对抗性攻击的自主代理的未来保障

确保与敌对外部世界的安全交互不仅仅是一个理想的功能，更是开发完全自主 AI 代理的必要基础。OpenAI 建议将 AI 模型集成到其应用程序中的开发者考虑在类似高风险情境下人类代理会拥有哪些控制，并在 AI 系统内部实施这些类似的限制。

尽管我们的愿景是让最智能的 AI 模型最终能够比人类代理更有效地抵御社会工程，但这并非对每个应用程序而言都是一个可行或成本效益高的即时目标。因此，设计具有内置约束和监督的系统仍然至关重要。OpenAI 致力于持续研究社会工程对 AI 模型的影响，并开发高级防御措施。这些发现被整合到其应用程序安全架构和 AI 模型的持续训练过程中，确保在不断演变的威胁环境中采取积极主动和适应性的 AI 安全方法。这种前瞻性策略旨在使 AI 代理既强大又本身值得信赖，与整个 AI 生态系统中的安全增强工作相呼应，包括打击恶意 AI 用途等举措。

原始来源

https://openai.com/index/designing-agents-to-resist-prompt-injection/

常见问题

What is prompt injection in the context of AI agents?

Prompt injection refers to a type of attack where malicious instructions are subtly embedded within external content that an AI agent processes. The goal is to manipulate the agent into performing actions or revealing information that the user did not intend or authorize. These attacks exploit the AI's ability to interpret and follow instructions, even if those instructions originate from an untrusted source, effectively hijacking the agent's behavior for adversarial purposes. Early forms might be direct commands, but advanced forms leverage social engineering to be less detectable and more persuasive, requiring sophisticated countermeasures to maintain system integrity and user trust.

How has prompt injection evolved, and why is this significant?

Prompt injection has evolved from simple, explicit adversarial commands (e.g., direct instructions in a web page) to sophisticated social engineering tactics. Early attacks were often caught by basic filtering. However, as AI models became smarter, attackers started crafting prompts that blend malicious intent with seemingly legitimate context, mimicking human social engineering. This shift is significant because it means defenses can no longer rely solely on identifying malicious strings. Instead, they must address the broader challenge of resisting misleading or manipulative content in context, requiring a more holistic, systemic approach to security rather than just simple input filtering.

How does OpenAI defend against social engineering prompt injection attacks?

OpenAI employs a multi-layered defense strategy, drawing parallels from human social engineering risk management. This includes a 'three-actor system' perspective (user, agent, external world) where agents are given limitations to constrain potential impact. Key techniques include 'source-sink analysis' to detect dangerous data flows, Safe Url mechanisms that prompt user confirmation or block sensitive transmissions to third parties, and sandboxing for agentic tools like ChatGPT Canvas and Apps. The overarching goal is to ensure that critical actions or data transmissions do not happen silently, always prioritizing user safety and consent to maintain robust AI security.

What is Safe Url, and how does it protect AI agents and users?

Safe Url is a critical mitigation strategy developed by OpenAI designed to protect AI agents and users from unauthorized data exfiltration. It detects when information that an AI agent has learned during a conversation or interaction might be transmitted to an external, potentially malicious, third-party URL. When such a transmission is detected, Safe Url intervenes by either displaying the sensitive information to the user for explicit confirmation before sending it, or by blocking the transmission entirely and instructing the agent to find an alternative, secure method to fulfill the user's request. This mechanism ensures that sensitive data remains under user control, even if an agent is momentarily swayed by a social engineering prompt injection.

Why is user consent crucial for AI agents, especially with new capabilities?

User consent is paramount for AI agents, particularly as their capabilities expand to include browsing, interacting with external tools, and transmitting information. With advanced prompt injection and social engineering tactics, an agent might be tricked into performing actions that compromise privacy or security. Requiring explicit user consent for potentially dangerous actions—like transmitting sensitive data, navigating to external sites, or using external applications—ensures that users maintain ultimate control. This prevents silent compromises and empowers users to confirm or deny actions, acting as a crucial final layer of defense against manipulation and unauthorized behavior, aligning with principles of data privacy and user autonomy.

What is 'source-sink' analysis in the context of AI security?

Source-sink analysis is a security engineering approach used by OpenAI to identify and mitigate risks associated with data flow within AI systems. In this framework, a 'source' refers to any input mechanism through which an attacker can influence the system, such as untrusted external content, web pages, or emails processed by an AI agent. A 'sink' refers to a capability or action that, if exploited, could become dangerous in the wrong context, such as transmitting information to a third party, following a malicious link, or executing a tool. By analyzing potential paths from sources to sinks, security teams can implement controls to prevent unauthorized data movement or dangerous actions, even if an AI agent is partially compromised by a prompt injection attack. This method is fundamental to ensuring data integrity and system security.

保持更新

将最新AI新闻发送到您的收件箱。