└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Awesome papers and blogs on Agent Security
  2 | 
  3 | 
  4 | ## Overview
  5 | 
  6 | This repo aims to study the security and safety threats/risks of the LLM-enabled agents. Note that we do not distinguish between safety and security risks in this paper list. The most common distinction would be that security refers to intentional threats, while safety refers to unintentional threats. But sometimes it may not be that easy to distinguish between intentional and unintentional. So we just mix them together. 
  7 | 
  8 | **At a high level, having additional components in the system introduces new attack surfaces to the model as well as new attack targets.** Here, the new attack surface to the model mainly refers to indirect prompt injection attacks that attack the models through other components. New attack targets mean non-model components in the system can be attack targets as well. 
  9 | 
 10 | 
 11 | 
 12 | 1. Attack the model through other components: (Indirect) prompt injection into the model to disrupt the task of the model
 13 |     1. Targeted attack: The goal is to hijack the model to accomplish malicious tasks. The tasks can be any model-level threats discussed:
 14 |         1. Generic ones from [LLM safety and security](https://docs.google.com/document/d/1QowkXo-cM0UQF2FzdSjNkZis9ODeU6AIwucC0nH_vcc/edit?usp=sharing)
 15 |     2. Non-targeted attacks: just disrupt the model to not finish the user/benign task (disrupt instruction following)
 16 | 2. Attack other components through the model: A system calls and executes other components through the model; attackers can give malicious instructions to the model and let the model generate code that exploits components it interacts with.
 17 | 
 18 | For  “Attack the model through other components,” without other components, this attack downgrades to directly generating adversarial prompts to attack target LLMs. This is a common risk across all types of agents. For “Attack other components through the model”, without the model, this attack downgrades to traditional software/system security. This risk category is more complex; the risks are different for different types of agents given their different components and goals. In the following, we first summarize common agent types and then discuss the common risks across all agents (attack the model through other components) and specified risks for each type of agent (Attack other components through the model). 
 19 | 
 20 | Based on the attack entry points and attack path/targets, we can have 
 21 | 
 22 | 
 23 | 
 24 | 1. Entry points, i.e., Injection method:
 25 |     1. Direct injection: append the adversarial prompts after **the system prompts**
 26 |         1. This is realistic when malicious users try to attack an agent by querying it 
 27 |         2. If there is a benign user querying the model and the attacker appends an adversarial prompt after the prompt of the benign user, this IMHO is not that realistic. Indirect injection would be more realistic. 
 28 |     2. Indirect injection: inject adversarial prompts through other components in the system 
 29 | 2. Attack targets and goals:
 30 |     3. The model:
 31 |         3. Non-target attack: mess up with the instructions following the model
 32 |         4. Targeted attack: privacy, harmful content generation, etc.
 33 |     4. Other components: case-by-case
 34 |         5. E.g., Illusioning and goal-misleading in web agents
 35 | 
 36 | 	
 37 | 
 38 | There could be four combinations, where [direct injection + model as the target] downgrade to [LLM safety and security](https://docs.google.com/document/d/1QowkXo-cM0UQF2FzdSjNkZis9ODeU6AIwucC0nH_vcc/edit?usp=sharing). 
 39 | 
 40 | Disclaimer: Please note that while we strive to include a comprehensive selection of papers on the security and safety threats of AI agents, it is not possible to cover all interesting and relevant papers. We will do our best to continually update and expand this document to incorporate more significant contributions in this fast-evolving field. In this document, we discuss the limitations of the included papers. These critiques are not intended to reflect negatively on the authors or the quality of their work. All the papers reviewed here are recognized as valuable contributions to the field. Our aim is to provide constructive analysis to foster further research and understanding.
 41 | 
 42 | 
 43 | ## Table of Contents
 44 | 
 45 | - [Agentic system & benchmarks](#agentic-system--benchmarks)
 46 |   - [Agent survey and benchmarks](#agent-survey-and-benchmarks)
 47 |   - [Agent security survey and benchmarks](#agent-security-survey-and-benchmarks)
 48 |   - [Agent system card](#agent-system-card)
 49 | - [Red-teaming](#red-teaming)
 50 |   - [General attacks: Prompt injection/Memory/Backdoor](#general-attacks-prompt-injectionmemorybackdoor)
 51 |   - [Attack against specific agents](#attack-against-specific-agents)
 52 | - [Blue-teaming](#blue-teaming)
 53 |   - [Model-based defenses](#model-based-defenses)
 54 |   - [System-level Runtime Defense](#system-level-runtime-defense)
 55 |   - [Others](#others)
 56 | - [Contributors](#contributors)
 57 | 
 58 | 
 59 | ## Agentic system & benchmarks
 60 | 
 61 | A typical agent system is composed of (multiple) LLMs and tools, where LLMs serve as agents to finish a given task, leveraging the available tools. RAG can be considered as one of the simplest agents, where the tool is an external knowledge base. Another simple example of the agent is to connect an LLM with a code interpreter that can execute the code generated by the model. More complex agents involve having multiple collaborative agents, tools, and memory, where each subset of agents controls a different set of tools. Based on their purposes, existing agents can be categorized as web agents that facilitate human-web interactions; coding agents that aid humans in writing code, providing code completion, debugging, etc; and Personal assistant agents that assist users with daily tasks (e.g., setting calendars and sending emails). Note that personal assistant agents can also have web components. Many of the agents involve multiple data modalities.
 62 | 
 63 | ### Agent survey and benchmarks
 64 | 
 65 | 
 66 | 
 67 | 1. Ai agents vs. agentic ai: A conceptual taxonomy, applications and challenge [[Information Fusion'25/05](https://arxiv.org/pdf/2505.10468)]
 68 | 2. From llm reasoning to autonomous ai agents: A comprehensive review [[arxiv'25/04](https://arxiv.org/abs/2504.19678)]
 69 | 3. From standalone llms to integrated intelligence: A survey of compound al systems [[arxiv'25/06](https://arxiv.org/abs/2506.04565)]
 70 | 4. A survey of agent interoperability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp) [[arxiv'25/04](https://arxiv.org/abs/2505.02279)]
 71 | 5. A Survey of AI Agent Protocols [[arviv'25/04](A Survey of AI Agent Protocols)]
 72 | 6. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [[ICML'24/07](https://servicenow.github.io/WorkArena/)]
 73 |     1. Complete comment tasks related to the web: Dashboard, Form, Knowledge, List-filter, List-sort, Menu, Service Catalog, AXtree, HTML, screenshots
 74 | 7. Web
 75 |     1. *Webarena: A realistic web environment for building autonomous agents* [[ICLR'24/05](https://arxiv.org/pdf/2307.13854)]
 76 |         1. Shopping, Gitlab, Reddit
 77 |     2. *VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks* [[ACL'24/08](https://jykoh.com/vwa)]
 78 |         1. Tasks: Classifieds, Shopping, and Reddit
 79 |         2. Input: Image, HTML, accessibility tree, SOM(image with numbers)
 80 |         3. Propose a new VLM agent, where every task has a visual understanding
 81 |         4. Evaluation: exact_match, must_include, fuzzy_match, must_exclude
 82 |     4. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [[ACL'24/08](https://arxiv.org/pdf/2401.13919)]
 83 |     5. WEBLINX: Real-World Website Navigation with Multi-Turn Dialogue [[ICML'24/07](https://arxiv.org/abs/2402.05930)]
 84 |     6. An Illusion of Progress? Assessing the Current State of Web Agents [[COLM'25/10](https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard)]
 85 |     7. Google Project Mariner [[https://deepmind.google/models/project-mariner/](https://deepmind.google/models/project-mariner/)]
 86 |     8. Browser use [[https://browser-use.com/](https://browser-use.com/)]
 87 | 8. *OS/software interaction*
 88 |     1. Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale [[arxiv'24/09](https://github.com/microsoft/WindowsAgentArena)]
 89 |         1. Agent design: model+executor (environment executes the model’s output)
 90 |         2. Input: UIA tree(Windows UI Automation tree), OCR, Image
 91 |         3. Office, Web Browsing, Windows System, Coding, Media, Windows Utilities. 154 tasks
 92 |     2. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents [[ICLR'25/04](https://github.com/google-research/android_world)]
 93 |         1. Screenshot,  ally tree
 94 |         2. requires_setup, data_entry, complex_ui_understanding, parameterized, game_playing, multi_app, memorization, math_counting, screen_reading, verification, information_retrieval, transcription,Repetition, search, data_edit
 95 |     3. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [[NeurIPS'24/12](https://os-world.github.io/)]
 96 |         1. Screenshot+tree
 97 |         2. Workflow, Windows-Workflow, Chrome, GIMP, LibreOffice Calc, LibreOffice Impress, LibreOffice Writer, OS, Thunderbird, VLC, VS Code, Excel, Word, PowerPoint
 98 |     4. RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents [[arXiv'25/06](https://arxiv.org/abs/2506.00618)]
 99 |         1. Screenshot+tree
100 |         2. 13 risk categories (Wrong instruction following, Environment data leakage/corruption)
101 |         3. Evaluation: Rule-based assessment (max 15 steps)
102 |         4. Tasks: Workflow, Windows-Workflow, Chrome, GIMP, LibreOffice Calc, LibreOffice Impress, LibreOffice Writer, OS, Thunderbird, VLC, VS Code, Excel, Word, PowerPoint
103 |     5. OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents [[arXiv'25/06](https://arxiv.org/abs/2506.14866)]
104 |         1. Screenshot+accessibility tree
105 |         2. 150 tasks, 3 harm categories: (1) Deliberate user misuse, (2) Prompt injection attacks, (3) Model misbehavior
106 |         3. Applications: Thunderbird, VS Code, Terminal, Chrome, LibreOffice (Calc/Impress/Writer), and 11 desktop apps
107 |     6. AgentBench: Evaluating LLMs as Agents [[ICLR'24/05](https://arxiv.org/abs/2308.03688)]
108 |     7. AIOS: LLM Agent Operating System [[COLM'25/10](https://arxiv.org/abs/2403.16971)]
109 |     8. Android
110 |         1. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents [[ICLR'25/04](https://github.com/google-research/android_world)]
111 |         2. MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control [[arxiv'24/12](https://mobilesafetybench.github.io/)]
112 | 9. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning [[ICLR'21/05](https://alfworld.github.io/)]
113 | 
114 | ### Agent security survey and benchmarks
115 | 
116 | 
117 | 
118 | 1. From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows [[arxiv'25/06](https://arxiv.org/abs/2506.23260)]
119 | 2. Survey on evaluation of llm-based agents [[arxiv'25/05](https://arxiv.org/abs/2503.16416)]
120 | 3. Ai agents under threat: A survey of key security challenges and future pathways [[ACM Computing Surveys'25/02](https://dl.acm.org/doi/10.1145/3716628)]
121 | 4. Security of AI Agents [[arxiv'24/12](https://arxiv.org/abs/2406.08689)]
122 | 5. Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents [[arxiv'25/05](https://arxiv.org/abs/2504.19956)]
123 | 6. Agentic AI Needs a Systems Theory (IBM Research) [[arxiv'25/02](https://arxiv.org/pdf/2503.00237)]
124 | 7. Practices for Governing Agentic AI Systems (OpenAI) [[openai paper'23'12](https://openai.com/index/practices-for-governing-agentic-ai-systems/)]
125 | 8. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions [[arxiv'25/04](https://arxiv.org/abs/2503.23278)]
126 | 9. General indirect prompt injection
127 |     1. INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents [[ACL'24/08](https://arxiv.org/pdf/2403.02691)]
128 |         1. Threats: Direct harm to users & exfiltration of private data
129 |         2. Prompt generation method: ReAct
130 |     2. Identifying the Risks of LM Agents with an LM-Emulated Sandbox [[ICLR'24/05](https://arxiv.org/abs/2309.15817)]
131 |         1. A framework that uses an LM to emulate tool execution and enables scalable testing of LM agents against a diverse range of tools and scenarios (benchmark consisting of 36 high-stakes toolkits and 144 test cases)
132 |         2. LM-based automated safety evaluator that examines agent failures and quantifies associated risks
133 |         3. Threat model: user instructions are ambiguous or omit critical details, posing risks when the LM agent fails to properly resolve these ambiguities, simulations
134 |     3. LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge [[arxiv'25/06](https://arxiv.org/abs/2506.09956)]
135 | 10. Benchmarks for web agents
136 |     1. WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks [[arxiv'25/04](https://arxiv.org/abs/2504.18575)]
137 |         1. Prompt injection benchmark based on WebArena
138 | 11. Benchmarks for CUA agents
139 |     1. RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments [[arxiv'25/06](https://arxiv.org/abs/2505.21936)]
140 |         1. CUA agent testing based on OSworld as the backbone (with sandbox and web container) 
141 |         2. A graphic interface between the agent and the environment
142 |     2. ​​Weathering the CUA Storm: Mapping Security Threats in the Rapid Rise of Computer Use Agents [[ICML Workshop on Computer Use Agents'25/07](https://openreview.net/pdf/74675450995f897873e67dd1d69351d8b3b3cd38.pdf)]
143 |         1. Clickjacking: Domain spoofing (e.g., g00gle.com)
144 |         2. Remote code execution on a sandbox
145 |         3. Chain-of-thought Exposure
146 |         4. Bypassing Human-in-the-loop Safeguards
147 |         5. Indirect prompt injection attacks
148 |         6.  Identity ambiguity and over-delegation
149 |         7. Content harms
150 |     3. Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents [[ICLR'25/07](https://arxiv.org/abs/2410.02644)]
151 |         1. Threat model: The attacker aims to mislead the LLM agent into using a specified tool (third attack type)
152 |         2. 10 scenarios and 10 agents:
153 |             1. IT management, Investment, Legal advice, Medicine, Academic advising, Counseling, E-commerce, Aerospace design, Research, Autonomous  vehicles
154 |         3. 23 Attacks and defenses: 
155 |             1. Direct prompt injection (Naive, Escape, Ignoring, Fake, Combined)
156 |             2. Observation prompt injection (inject adversarial prompts in the external environment) 
157 |             3. Memory poisoning attack (poison long-term memory like RAG database)
158 |             4. PoT backdoor attack (leave a backdoor in the system prompt)
159 |             5. Mixed attack
160 |             6. Defense: Delimiters, Sandwich Prevention, Instructional Prevention, Paraphrasing, Shuffle, LLM-based detection, PPL detection
161 | 12. AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents (2024) [[NeurIPS'24/12](https://arxiv.org/pdf/2406.13352)]
162 |     1. 4 scenarios (agents): Workspace, Slack, Banking, Travel; 97 tasks, 629 security test cases
163 |     2. Basic defense
164 |         1. Format all tool outputs with special delimiters
165 |         2. Delimiters
166 |         3. Prompt injection detection: uses a BERT classifier as a guardrail
167 |         4. Sandwich Prevention: repeats the user instructions after each function call
168 |         5. Tool filter: restricts LLM itself to a set of tools required to solve a given task, before observing any untrusted data
169 | 13. RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments [[arXiv'25/06](https://arxiv.org/abs/2506.15253)]
170 |     1. Simple agent framework (agent+tool); 80 test cases, 3,802 attack tasks mapped to 11 CWE categories
171 |     2. Attack generation: From 29 tools → 58 attack templates (indirect & direct injection)
172 |     3. Attack methods: Manipulate tool call inputs (kwargs) and outputs (return)
173 |     4. Multi-format toolkits: JSON, LangGraph, Model Context Protocol (MCP)
174 |     5. Limitation: Simple framework ≠ real-world scenarios
175 | 14. Some direct-injection benchmarks 
176 |     1. Formalizing and Benchmarking Prompt Injection Attacks and Defenses [[USENIX'24/08](https://arxiv.org/abs/2310.12815)]
177 |         1. Pattern: benign prompts + adversarial prompts
178 |     2. Assessing Prompt Injection Risks in 200+ Custom GPTs [[ICLR Workshop on Secure and Trustworthy Large Language Models'24/05](https://arxiv.org/pdf/2311.11538)]
179 |     3. Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [[ICLR'2024/05](https://arxiv.org/pdf/2311.01011)]
180 |     4. GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks [[arxiv'24/09](https://gentellab.github.io/gentel-safe.github.io/)]
181 |     5. CYBERSECEVAL 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models [[arxiv'24/04](https://arxiv.org/pdf/2404.13161)]
182 |         1. A section about prompt injection; Two goals: violate application logic (go off-topic)/violate security constraints
183 |     6. A Critical Evaluation of Defenses against Prompt Injection Attacks [[arxiv'25/05](https://arxiv.org/abs/2505.18333)]
184 | 
185 | 
186 | ### Agent system card
187 | 
188 | 
189 | 
190 | 1. Operator system card [[openai blog'25/01](https://openai.com/index/operator-system-card/)]
191 | 2. Lessons from Defending Gemini against Indirect Prompt Injections (Google Deepmind) [[arxiv'25/05](https://arxiv.org/abs/2505.14534)]
192 | 
193 | 
194 | ## Red-teaming
195 | 
196 | ### General attacks: Prompt injection/Memory/Backdoor
197 | 
198 | Note that injection is an attack method, not an attack goal; one can launch an injection attack with different goals
199 | 
200 | 
201 | 1. **Model backdoor**
202 |     1. Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents [[NeurIPS'24/12](https://arxiv.org/abs/2402.11208)]
203 |         1. Insert backdoor triggers into web agents through fine-tuning backbone models with white-box access, aiming to mislead agents into making incorrect purchase decisions
204 |     2. Navigation as attackers wish? towards the building, byzantine-robust embodied agents under federated learning (Data poisoning attack) [[NAACL'24/06](https://arxiv.org/abs/2211.14769)]
205 |     3. BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents [ACL'24/08](https://arxiv.org/abs/2406.03007)
206 |         1. insert backdoor triggers into web agents through fine-tuning backbone models with white-box access
207 |             1. Insert trigger and malicious output into benign data to craft an attack dataset. Then do a classic data-poisoning attack
208 |         2. Threat model: finetune a benign llm with the backdoor dataset. And (1) victims use our released model (2) victims finetune our released model and then use it
209 | 2. **Direct prompt injection**: Directly append the malicious prompts into the user prompts 	
210 |     1. UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning [[ICML'25/07](https://arxiv.org/abs/2503.01908)]
211 |     2. How Not to Detect Prompt Injections with an LLM [[arxiv'25/07](https://arxiv.org/abs/2507.05630)]
212 |     3. Automatic and Universal Prompt Injection Attacks against Large Language Models [[arxiv'24/05](https://arxiv.org/abs/2403.04957)]
213 |         1. Gradient-based method, similar to GCG, with slightly different optimization targets
214 |         2. Propose three prompt injection objectives according to whether the response is relevant to the user’s input: static, semi-dynamic, and dynamic
215 |             1. Static objective: the attacker aims for a consistent response, regardless of the user’s instructions or external data
216 |             2. Semi-dynamic objective: the attacker expects the victim model to produce consistent content before providing responses relevant to the user’s input
217 |             3. Dynamic objective: the attacker wants the victim model to give responses relevant to the user’s input, but maintain malicious content simultaneously.
218 |     4. Goal-guided Generative Prompt Injection Attack on Large Language Models [[arxiv'24/09](https://arxiv.org/abs/2404.07234)]
219 |         1. Attack objective design
220 |             1. Effective: attack inputs with original high benign accuracy to high ASR 
221 |             2. Imperceptible: the original input and the adversarial input are very similar in terms of some semantic metrics. They use cosine similarity
222 |             3. Input-dependent: a prompt injection manner to form the attack prompt
223 |     5. Prompt Injection attack against LLM-integrated Applications [[arxiv'24/03](https://arxiv.org/pdf/2306.05499)]
224 |         1. Design a prompt injection pattern with three elements: Framework Component, Separator Component, Disruptor Component
225 |     6. Ignore Previous Prompt: Attack Techniques For Language Models [NeurIPS Workshop of ML Safety'22/09](https://arxiv.org/pdf/2211.09527)
226 |         1. Manual prompt injection: Goal hijacking and prompt leaking
227 |     7. Ignore this title and HackAPrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition [[EMNLP'23/12](https://arxiv.org/abs/2311.16119)]
228 |     8. Imprompter: Tricking LLM Agents into Improper Tool Use [[arxiv'24/10](https://arxiv.org/abs/2410.14923)]
229 |         1. Craft obfuscated adversarial prompt attacks that violate the confidentiality and integrity of user resources connected to an LLM agent
230 | 3. **Indirect prompt injection**
231 |     1. Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents [[NAACL'25/04](https://arxiv.org/abs/2503.00061)]
232 |     2. Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [[ACM CCS Workshop of AISec'23/11](https://arxiv.org/pdf/2302.12173)]
233 |         1. Study some basic pattern-based attack methods
234 |         2. Injection method: retrieval-based methods, active methods, user-driven injections, hidden injections
235 |         3. Threats: information gathering/ fraud/intrusion/malware/manipulated content/availability
236 |     3. AgentVigil: Generic Black-Box Red-Teaming for Indirect Prompt Injection against LLM Agents [[EMNLP'25/11](https://arxiv.org/abs/2505.05849)]
237 |         1. Leverage fuzzing to generate attack prompts for prompt injection
238 | 4. **Prompt injection in Multi-Agent System (MAS)**
239 |     1. Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems [[arxiv'24/10](https://arxiv.org/abs/2410.07283)]
240 |         1. Inject the malicious prompts into the external content and rely on the data sharing mechanism across different agents to affect multiple agents 
241 |     2. Red-teaming llm multi-agent systems via communication attacks [[ACL'25/07](https://arxiv.org/abs/2502.14847)]
242 |     3. Evil Geniuses: Delving into the Safety of LLM-based Agents [[arxiv'24/02](https://arxiv.org/abs/2311.11855)]
243 | 5. **Memory poisoning**
244 |     1. AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases [[NeurIPS'24/12](https://arxiv.org/abs/2407.12784)]
245 |         1. Attack the knowledge database by adding malicious data
246 |         2. Loss
247 |             1. Uniqueness: poison data should be away from benign data
248 |             2. Compactness: poison data should be similar
249 |             3. Coherence: The trigger’s perplexity should be low
250 |             4. Target generation: triggers cause target action
251 |     2. A practical memory injection attack against llm agents [[arxiv'25/05](https://arxiv.org/abs/2503.03704)]
252 |     3. PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models [[USENIX Security'25/07](https://arxiv.org/abs/2402.07867)]
253 | 6. **Tool poisoning**
254 |     1. MCP Security Notification: Tool Poisoning Attacks
255 |         1. Jumping the line: How MCP servers can attack you before you ever use them [[blog'25/04](https://blog.trailofbits.com/2025/04/21/jumping-the-line-how-mcp-servers-can-attack-you-before-you-ever-use-them/)]
256 |         2. How MCP servers can steal your conversation history [blog'25/04](https://blog.trailofbits.com/2025/04/23/how-mcp-servers-can-steal-your-conversation-history/)
257 |     2. Prompt Injection Attack to Tool Selection in LLM Agents [[arxiv'25/08](https://arxiv.org/pdf/2504.19793)]
258 | 7. **Exfiltration attack: Inject URL to exploit renderers that fetches data from attacker’s server, leaking agent data**
259 |     1. When Public Prompts Turn Into Local Shells: ‘CurXecute’ – RCE in Cursor via MCP Auto‑Start (EchoLeak cve-2025-32711) [[blog'25/08](https://www.aim.security/lp/aim-labs-echoleak-blogpost)]
260 |     2. Simon Willison’s Weblog tagged exfiltration-attacks [[https://simonwillison.net/tags/exfiltration-attacks/](https://simonwillison.net/tags/exfiltration-attacks/)]
261 | 8. Some related works that attack the instruction following of LLMs (related to agents but mainly about model)
262 |     1. An LLM Can Fool Itself: A Prompt-Based Adversarial Attack [[ICLR'24/05](https://arxiv.org/pdf/2310.13345)]
263 |         1. Audit the LLM’s adversarial robustness via a prompt-based adversarial attack
264 |         2. Let LLMs generate adversarial prompts, and define the generation prompts with three components:
265 |             1. original input (OI), including the original sample and its ground-truth label
266 |             2. attack objective (AO) illustrating a task description of generating a new sample that can fool itself without changing the semantic meaning
267 |             3. attack guidance (AG) containing the perturbation instructions, e.g., add some characters 
268 |     2. The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models (Testing-phase backdoor) [[EMNLP'24/11](https://arxiv.org/pdf/2406.19999)]
269 |         1. text modification, question answering, mathematics, and *security rule following*
270 |     3. Can LLMs Follow Simple Rules? (Instruction following) [[arxiv'24/03](https://arxiv.org/pdf/2311.04235)]
271 |         1. Propose Rule-following Language Evaluation Scenarios (RULES), a programmatic framework for measuring rule-following ability in LLM
272 |         2. Defense: test-time steering and finetuning
273 |     4. A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents [[arxiv'24/12](https://arxiv.org/pdf/2402.10196)]
274 |     5. Misusing Tools in Large Language Models With Visual Adversarial Examples [[ICLR'24/05](https://arxiv.org/abs/2310.03185)]
275 |         1. Visual input-based prompt injection (applicable to both direct and indirect prompt injections)
276 | 
277 | 
278 | ### Attack against specific agents
279 | 
280 | 
281 | 
282 | 1. ChatGPT Operator:
283 |     1. ChatGPT Operator: Prompt Injection Exploits & Defenses [[blog'25/02](https://embracethered.com/blog/posts/2025/chatgpt-operator-prompt-injection-exploits/)]
284 |     2. How ChatGPT Remembers You: A Deep Dive into Its Memory and Chat History Features [[blog'25/05](https://embracethered.com/blog/posts/2025/chatgpt-how-does-chat-history-memory-preferences-work)]
285 | 2. **Web agents**: Most attacks manipulate the web with malicious contents/queries, when LLM agents interact with the web, the malicious contents will be fed to the agents. Some attacks can be applied to multi-modal (MM) agents but the malicious contents are only in text
286 |     1. Dissecting Adversarial Attacks on Multimodal LM Agents [[ICLR'25/04](https://arxiv.org/abs/2406.12814)]
287 |         1. Classify attacks into:
288 |             1. Illusioning: maintain the original user task while subtly manipulating information retrieved from tools, e.g., for a shopping agent, the user asks the agent to buy the cheapest jacket, then a malicious seller can inject a prompt that indicates its price is the lowest to mislead the lim.
289 |             2. Goal misdirection: ask the agent to ignore the user task and follow the injected prompt
290 |         2. Attack method: manipulation of uploaded item images/texts
291 |     2. EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage [[ICLR'25/04](https://arxiv.org/abs/2409.11295)]
292 |         1. Threat models
293 |             1. Leak users' PII or users' task (let LLM fill PII into an invisible box)
294 |             2. The web developer is malicious or malicious users contaminate development tools
295 |         2. Inject persuade prompt into the HTML content of webpages
296 |         3. If the attacker controls the web, it can get PII directly; no need to do it through injection
297 |     3. AdvAgent: Controllable Blackbox Red-teaming on Web Agents [[ICML'25/07](https://arxiv.org/pdf/2410.17401)]
298 |         1. Similar threat model as EIA but automatically generates attack prompts 
299 |     4. Wipi: A new web threat for LLM-driven web agents [[arxiv'24/12](https://arxiv.org/abs/2402.16965)]
300 |         1. Similar paper with EIA. but targets on llm with rag instead of web agent
301 |         2. Aim to disrupt llm by using misdirect (e.g. Don’t summarize the webpage content)
302 |     5. Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses [[ICSE'25/04](https://www.computer.org/csdl/proceedings-article/icse/2025/056900a076/215aWuWbxeg)]
303 |     6. CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities [[ICML'25/07](http://arxiv.org/pdf/2503.17332)]
304 | 3. **Coding agent**
305 |     1. A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems (2024) [[arxiv'24/02](https://arxiv.org/abs/2402.18649)]
306 |         1. Threat models: Prompt injection  
307 |         2. LLM system (agent): objects; actions (information processing execution for individual objects); interactions; constraints
308 |         3. Analysis of constraints of actions (more like single-actor conversations)
309 |             1. Case study: LLM Outputs External Image Links with Markdown Format
310 |             2. Existence of constraints: yes & robustness of constraints: not that robust (bypass via jailbreaking and indirect prompt injection)
311 |         4. Analysis of constraints of interactions
312 |             1. Cases: Sandbox (unauthorized access)/Web tools (indirect plugin calling)/Frontend (render malicious URLs)
313 |         5. End2end attacks (exploit chain): the attack goal is to steal users’ private conversation records  
314 |     2. RedCode: Risky Code Execution and Generation Benchmark for Code Agents [[NeurIPS'24/12](https://arxiv.org/abs/2411.07781)]
315 |         1. Benchmark coding model/agent risks: generating malware and executing malicious stuff
316 |     3. RedCodeAgent: Automatic Red-teaming Agent against Code Agents [[openreview paper'25/02](https://openreview.net/pdf?id=Mvn5g49RrM)]
317 |         1. Threat model: Mislead the LLM agent into using a specified tool (third attack type)
318 |         2. An upgrade of RedCode-Exec with the refinement capabilities 
319 |         3. Use RedCode-Exec dataset for risk scenario and requirement
320 |     4. Abuse code interpreters to gain access to underlying host operating systems or use them as a platform to wage cyber attacks 
321 |         1. CYBERSECEVAL 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models [[arxiv'24/04](https://arxiv.org/pdf/2404.13161)]
322 |             1. Test whether a target code generation models/LLM refuse to execute malicious requests 
323 |                 1. Construct 500 interpreter abuse prompts 
324 |                 2. Use another LLM to judge whether the target model is compliant with the malicious request or refuses the malicious request
325 |         2. Defense: Use high-quality data to train (identified by previous generation of models); safety evaluation and tuning; lower benign refusals (finetune with high-quality data)
326 |     5. Some CVEs (arbitrary code execution and SQL injection)
327 |         1. CVE Record of prompt injection [https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=Prompt+injection](https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=Prompt+injection) 
328 |         2. When Prompts Go Rogue: Analyzing a Prompt Injection Code Execution in Vanna.AI [[blog'24/07](https://jfrog.com/blog/prompt-injection-attack-code-execution-in-vanna-ai-cve-2024-5565/)]
329 |         3. RCE: Illegal Command Filter Bypass in langchain_experimental [https://github.com/langchain-ai/langchain/issues/21592](https://github.com/langchain-ai/langchain/issues/21592)
330 |         4. LLM pentest: Leveraging agent integration for RCE [[blog'24/05](https://www.blazeinfosec.com/post/llm-pentest-agent-hacking/)]
331 |         5. Impact of remote-code execution vulnerability in LangChain [[blog'23/07](https://www.ntietz.com/blog/langchain-rce/)]
332 |     6. Demystifying RCE Vulnerabilities in LLM-Integrated Apps [[ACM CCS'24/10](https://arxiv.org/pdf/2309.02926)]
333 |     7. ARTEMIS: Analyzing LLM Application Vulnerabilities in Practice [[Proceedings of the ACM on Programming Languages'25/04](https://dl.acm.org/doi/10.1145/3720488)]
334 | 4. **Personal assistant agents**
335 |     1. Attacking Vision-Language Computer Agents via Pop-ups [[ACL'25/07](https://arxiv.org/abs/2411.02391)]
336 |         1. Indirect prompt injection, attack through pop-ups
337 |         2. Concern: pop-up is (maybe) easy to defend and easy to remove on the web
338 |     2. Data Exposure from LLM Apps: An In-depth Investigation of OpenAI's GPTs [[IMC'25/10](https://arxiv.org/abs/2408.13247)]
339 |     3. LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins [AIES'24/10](https://arxiv.org/abs/2309.10254)
340 |         1. Categorize and evaluate the risks of using untrusted third-party plugins 
341 |     4. Personal LLM Agents: Insights and Survey about the Capability, Efficiency, and Security [[arxiv'24/05](https://arxiv.org/abs/2401.05459)]
342 | 
343 | 
344 | ## Blue-teaming
345 | 
346 | ### Model-based defenses 
347 | 
348 | 
349 | 
350 | 1. Guardrail
351 |     1. Flexible, but can be potentially bypassed with guardrail injection and incurs computational costs [[arxiv'25/07](https://arxiv.org/abs/2504.11168)]
352 |     2. ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning [[ICML'25/07](https://shieldagent-aiguard.github.io/)]
353 |         1. Guardrail policy generation based on documents and LLM
354 |     3. A Holistic Approach to Undesired Content Detection in the Real World (OpenAI guardrail) [[AAAI'23/02](https://arxiv.org/abs/2208.03274)]
355 |     4. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations [[arxiv'23/12](https://arxiv.org/abs/2312.06674v1)]
356 |         1. classification model on LLM inputs and outputs, have 6 safety-related categories
357 |     5. Llama Prompt Guard 2 [[blog](https://www.llama.com/docs/model-cards-and-prompt-formats/prompt-guard/)]
358 |         1. [meta-llama/Llama-Prompt-Guard-2-86M](https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M)
359 |         2. Llama Prompt Guard 2 models classify prompts as ‘malicious’ if the prompt explicitly attempts to override prior instructions embedded into or seen by an LLM. This classification considers only the intent to supersede developer or user instructions, regardless of whether the prompt is potentially harmful or the attack is likely to succeed.
360 |     6. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails [[EMNLP'23/12](https://arxiv.org/abs/2310.10501)]
361 |         1. Define a DSL with a set of rules, calibrate input queries with rules based on embedding distance (with KNN)
362 |         2. Input rail (user input), Dialog rail (next step), Retrieval rail (external data), Execution rail (tool call), Output rail (final output)
363 |     7. Guardrails AI [[website](https://www.guardrailsai.com/)]
364 |         1. User-specified rail specs; guard (regular expression or classifier); if trigger error, correct the user prompts; output checking
365 |     8. Position: building guardrails for large language models requires systematic design [[ICML'24/07](https://dl.acm.org/doi/abs/10.5555/3692070.3692521)]
366 |         1. Suggestions: formal guarantee to avoid arms races; resolving conflicts via priority and ensemble; rule-based & learning-based solutions
367 |     9. A Causal Explainable Guardrails for Large Language Models (LLMGuardrail) [[ACM CCS'24/12](https://dl.acm.org/doi/10.1145/3658644.3690217)]
368 |     10. WebGuard: Building a Generalizable Guardrail for Web Agents [[arxiv'25/07](https://arxiv.org/abs/2507.14293)]
369 | 2. Inference-phase defenses: 
370 |     1. Finetune a classifier to identify prompt injection (similar as guardrail)
371 |         1. Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection: 
372 |             1. [protectai/deberta-v3-base-prompt-injection-v2](http://huggingface.co/ProtectAI/deberta-v3-base-prompt-injection-v2)
373 |         2. GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks [[arxiv'24/09](https://arxiv.org/abs/2409.19521)]
374 |             1. Finetune a classifier (which is independent from the LLM) to detect the model
375 |         3. Embedding-based classifiers can detect prompt injection attacks [[CAMLIS'24/10](https://arxiv.org/pdf/2410.22284)]
376 |             1. Use a pretrained embedding model to embed benign prompts and prompt injection attacked prompts, then use traditional machine learning method for classification (regression, xgboost, etc)
377 |         4. DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks [[IEEE S&P'25/05](https://arxiv.org/abs/2504.11358)]
378 |             1. Identify prompt injection in untrusted inputs by leveraging a fine-tuned model vulnerable to prompt injection, with known-answer detection
379 |     2. Prompt-level defense: 
380 |         1. Defense Against Prompt Injection Attack by Leveraging Attack Techniques [[ACL'25/07](https://arxiv.org/abs/2411.00459)]
381 |             1. Use prompt injection attacks to append the original benign prompt at the end of input prompts
382 |         2. Formalizing and Benchmarking Prompt Injection Attacks and Defenses [[USENIX Security'24/08](https://www.usenix.org/conference/usenixsecurity24/presentation/liu-yupei)]
383 |             1. known-answer detection (append an additional instruction into user prompt, e.g., “say a secret word xxx”), then detect if outputs contain this secret word
384 |         3. PromptShield: Deployable Detection for Prompt Injection Attacks [[ACM CODASPY'25/06](https://arxiv.org/abs/2501.15145)]
385 |         4. Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [[arxiv'25/04](https://arxiv.org/abs/2504.20472)]
386 |         5. FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attack [[arxiv'24/11](https://arxiv.org/abs/2410.21492)]
387 |             1. Runtime-generated authentication code
388 |         6. Defending Against Indirect Prompt Injection Attacks With Spotlighting (Microsoft Spotlighting) [[arxiv'24/03](https://arxiv.org/abs/2403.14720)]
389 |             1. Add delimeters
390 |     3. Defense based on internal representations 
391 |         1. Get my drift? Catching LLM Task Drift with Activation Delta [[IEEE SaTML'25/04](https://arxiv.org/abs/2406.00799)]
392 |         2. Attention Tracker: Detecting Prompt Injection Attacks in LLMs [[NAACL'25/04](https://aclanthology.org/2025.findings-naacl.123.pdf)]
393 |             1. intuition: if a malicious prompt is injected, the attention score over the original prompt will largely decrease
394 |             2. Find the “important attention head” that behaves differently under benign prompts and prompt injection attack prompts, summing the attention score over the original prompt tokens, if less than a certain threshold, deny to respond
395 | 3. Training-based defenses:
396 |     1. StruQ: Defending Against Prompt Injection with Structured Queries [[USENIX Security'25/08](https://www.usenix.org/system/files/conference/usenixsecurity25/sec24winter-prepub-468-chen-sizhe.pdf)]
397 |         1. Build structured prompts that have special tokens to separate user prompts and user data
398 |         2. Finetune the model to ignore contents after the specific tokens
399 |     2. SecAlign: Aligning LLMs to Be Robust Against Prompt Injection [[arxiv'25/07](https://arxiv.org/abs/2410.05451v1)]
400 |         1. Preference learning-based finetuning: The LLM is only trained to favor the desirable response, but does not know what an undesirable response looks like. Thus, a secure LLM should also observe the response to the injected instruction and be steered away from that response.
401 |     3. Rule-Based Rewards for Language Model Safety [[NeurIPS'24/12](https://arxiv.org/abs/2411.01111)]
402 |         1. Better alignment strategy
403 |     4. Jatmo: Prompt Injection Defense by Task-Specific Finetuning [[ESORICS'24/09](https://dl.acm.org/doi/abs/10.1007/978-3-031-70879-4_6)]
404 |     5. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions [[arxiv'24/04](https://arxiv.org/abs/2404.13208)]
405 |         1. Construct data for both “aligned” instructions and “misaligned” instructions
406 |         2. fine-tuning protocol that induces an LLM to privilege higher privilege instructions over lower-privilege instructions (system > developer > user).
407 |     6. Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy [[ICLR'25/04](https://arxiv.org/abs/2410.09102)]
408 |         1. Adding a segment embedding layer for better learning instruction hierarchy
409 | 
410 | 
411 | 
412 | ### System-level Runtime Defense
413 | 
414 | 
415 | 
416 | 1. **Input validation and sanitization: Guardrails on LLM input**
417 |     1. Benefits: Guardrails’ non-invasiveness allows minor modification and utility impact in the agent
418 |     2. Limitations: Challenging to provide resilience against adaptive attacks 
419 |     3. Model-based guardrails
420 |         1. LLlamaFirewall: An open source guardrail system for building secure AI agents [[arxiv'25/05](https://arxiv.org/abs/2505.03574)]
421 |             1. PromptGuard: detect direct/indirect prompt injection
422 |         2. Microsoft Prompt Shields [[website](https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-detection)]
423 |             1. detect safety and security of LLM output (e.g., hate, violence, self-harm, direct/indirect prompt injection)
424 |     4. Rule-based guardrails
425 |         1. Nvidia NeMo: DSL for input guardrail [[github repo](https://github.com/NVIDIA-NeMo/NeMo)]
426 |         2. Google safe browsing [[website](https://safebrowsing.google.com/)]
427 |         3. Content Security Policy (CSP) allow-list [[website](https://content-security-policy.com/)]
428 |         4. Building a secure agentic ai application leveraging a2a protocol (Google) [[arxiv'25/04](https://arxiv.org/abs/2504.16902)]
429 | 2. **Policy enforcement: Guardrails on LLM output**
430 |     1. Benefits: Guardrails’ non-invasiveness allows minor modification and utility impact in the agent
431 |     2. Limitations: Challenging to provide resilience against adaptive attacks 
432 |     3. Model-based guardrails
433 |         1. LLlamaFirewall: An open source guardrail system for building secure AI agents [[arxiv'25/05](https://arxiv.org/abs/2505.03574)]
434 |             1. AlignmentCheck: LLM output alignment check.
435 |         2. Microsoft Prompt Shields [[website](https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-detection)]
436 |             1. detect safety and security of LLM output (e.g., hate, violence, self-harm, direct/indirect prompt injection)
437 |     4. Rule-based guardrails
438 |         1. Nvidia NeMo: DSL for LLM output guardrail [[github repo](https://github.com/NVIDIA-NeMo/NeMo)]
439 |         2. Mitigating prompt injection attacks with a layered defense strategy (Google) [[blog'25/07](https://security.googleblog.com/2025/06/mitigating-prompt-injection-attacks.html)]
440 |             1. UI renderer specific : Only render google internal images
441 |         3. LLlamaFirewall: An open source guardrail system for building secure AI agents [[arxiv'25/05](https://arxiv.org/abs/2505.03574)]
442 |             1. CodeShield: regex based rules to detect malicious LLM-generated code
443 |         4. AI Agents with Formal Security Guarantees [[ICML Workshop of Next Generation of AI Safety'24/07](https://openreview.net/pdf?id=c6jNHPksiZ)]
444 |             1. Provide policy language for agents that can be manually defined by agent developers.
445 |             2. Information flow policy
446 |     5. Hybrid guardrails: Generate rules with model or agent
447 |         1. GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning [[ICML'25/07](https://arxiv.org/abs/2406.09187)]
448 |             1. Agent for Guardrail generation. Generate guardrails using LLM and code execution/debugging tools.
449 |             2. Similar to Progent.
450 |         2. AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection [[ACL'25/07](https://aclanthology.org/2025.acl-long.399.pdf)]
451 |             1. Use model to generate safety checks
452 |             2. Use model and tools to perform safety checks before action
453 |         3. Progent: Programmable Privilege Control for LLM Agents [[arxiv'25/04](https://arxiv.org/pdf/2504.11703)]
454 |             1. Design a runtime fine-grained policy generation framework and automate the policy generation with LLM
455 |             2. Runtime agent sandboxing: Constrain agent tool calls based on the previous context.
456 |         4. Contextual Agent Security: A Policy for Every Purpose [[ACM HotOS'25/05](https://dl.acm.org/doi/10.1145/3713082.3730378)]
457 |             1. Same as Progent. Runtime policy generation based on trusted context data (i.e., Contextual policy) and fine-grained agent sandbox enforcement.
458 |     10. AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents [[ICSE'26/04](https://arxiv.org/abs/2503.18666)]
459 | 3. **Identity and privilege management**
460 |     1. Agent identity
461 |         1. Motivation: Visibility to regulators, users (whether they’re interacting with agents or human), runtime monitoring, post-hoc analysis
462 |         2. Visibility into AI agents (measures to improve agent visibility) [[ACM FAccT'25/06](https://arxiv.org/pdf/2401.13138)]
463 |             1. Agent identifiers: Agent card, Underlying system, Involved actors -- clarify who is accountable in case an agent causes harm.
464 |             2. Real-time monitoring
465 |             3. Activity logs
466 |         3. IDs for AI systems [[NeurIPS Workshop of Regular ML'24/12](https://arxiv.org/abs/2406.12137)]
467 |         4. Infrastructure for AI Agents [[TMLR'25/05](https://arxiv.org/abs/2501.10114)]
468 |             1. Attribution: Identity binding, Certification, Agent IDs
469 |             2. Interaction: Agent network channels, Oversight layers (e.g., user intervention), inter-agent communication, commitment devices
470 |             3. Response: Incident reporting, rollbacks
471 |         5. Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems [[arxiv'24/10](https://arxiv.org/abs/2410.07283v1)]
472 |             1. LLM tagging - `[Agent name]: ` in front of agent responses.
473 |     2. Centralized identity
474 |         1. okta - general identity management [[website](https://www.okta.com/identity-101/identity-and-access-management/)]
475 |         2. composio - agent identity management [[website](https://composio.dev/agentauth)]
476 |         3. OpenID-Connect protocol (OIDC) [[website](https://www.microsoft.com/en-us/security/business/security-101/what-is-openid-connect-oidc)]
477 |     3. Decentralized identity
478 |         1. Agent Network Protocol - Identity and Encrypted Communication Layer [[github repo](https://github.com/agent-network-protocol/AgentNetworkProtocol/tree/main)]
479 |             1. W3C DID (Decentralized Identifiers) [[website](https://www.w3.org/TR/did-1.1/)]
480 |         2. Microsoft Verified ID [[website](https://learn.microsoft.com/en-us/entra/verified-id/decentralized-identifier-overview)]
481 |     4. Agent privilege management
482 |         1. Authenticated Delegation and Authorized AI Agents [[ICML'25/07](https://arxiv.org/abs/2501.09674)]
483 |             1. human user creating a digital authorization that a specific AI agent can use to access a digital service (or interact with another AI agent) on behalf of the user, which can be verified by the corresponding service or agent for its authenticity.
484 |             2. A delegation token authorizes an AI agent to act on the user’s behalf
485 |             3. Task scoping and resource scoping
486 |         2. Composio [[github repo](https://github.com/ComposioHQ/composio)]
487 |             1. User-friendly agent framework that provides OAuth token-based IAM
488 |         3. robots.txt [[website](https://www.robotstxt.org/)]
489 |     5. RAG/VectorDB Access control
490 |         1. Access control for vector stores using metadata filtering with Amazon Bedrock Knowledge Bases [[blog'24/07](https://aws.amazon.com/blogs/machine-learning/access-control-for-vector-stores-using-metadata-filtering-with-knowledge-bases-for-amazon-bedrock/)]
491 |         2. HoneyBee: Efficient Role-based Access Control for Vector Databases via Dynamic Partitioning [[arxiv'25/05](https://arxiv.org/abs/2505.01538)]
492 |             1. Find a balance between space-time tradeoff for vectorDB access control using dynamic partitioning
493 |         3. ControlNet: A Firewall for RAG-based LLM System [[arxiv'25/04](https://arxiv.org/abs/2504.09593)]
494 |             1. RAG access control with LLM activation-based detection and mitigation
495 | 4. **Privilege separation and access control**
496 |     1. LLM Agents Should Employ Security Principles [[arxiv'25/05](https://arxiv.org/abs/2505.24019)]
497 |         1. Defense-in-depth strategy: Least privilege by dividing agents into persistent/ephemeral agents. Complete mediation by data minimizer and response filter.
498 |         2. Automated policy configuration by a reward modeling policy engine: Adaptive policy learning based on task success rate.
499 |     2. Planner-Executor separation
500 |         1. Defeating Prompt Injections by Design (CaMeL) [[arxiv'25/06](https://arxiv.org/abs/2503.18813)]
501 |             1. Separate privileges of an agent: planner agent generates a program (i.e., determines control flow and data flow), and a quarantine agent parses untrusted data. Prevents untrusted data from corrupting the control and data flow.
502 |             2. Data flow policies
503 |         2. Securing AI Agents with Information-Flow Control (FIDS) [[arxiv'25/05](https://arxiv.org/abs/2505.23643)]
504 |         3. System-Level Defense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective (f-secure) [[arxiv'24/10](https://arxiv.org/abs/2409.19091)]
505 |             1. Control the information flows and access control in an agent system to prevent malicious information from being propagated and executed by the agent system
506 |             2. Disaggregates the components of an LLM system into a context-aware pipeline with dynamically generated structured executable plans, and a security monitor filters out untrusted input into the planning process 
507 |             3. Provide formal models with an analysis of the security guarantee 
508 |             4. Privilege separation: planner and unprivileged parser.
509 |         4. Design Patterns for Securing LLM Agents against Prompt Injections [[arxiv'25/06](https://arxiv.org/abs/2506.08837)]
510 |     3. Agent privilege separation
511 |         1. Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents [[arxiv'25/04](https://arxiv.org/abs/2503.15547)]
512 |             1. Separate privileges of an agent: a privileged agent and an unprivileged agent have different tool permissions, managed by their access tokens.
513 |             2. Prevent confused deputy: i) Replace untrusted data from the lower-privilege agent into a data ID, so that they cannot corrupt control/data flow of the privileged agent. ii) Track untrusted data to prevent it from being used in unsafe data flow.
514 |     4. Security module - Executor separation
515 |         1. AirGapAgent: Protecting Privacy-Conscious Conversational Agents [[ACM CCS'24/10](https://dl.acm.org/doi/10.1145/3658644.3690350)]
516 |             1. Design a runtime data minimization to prevent prompt injection attacks that leak confidential data
517 |             2. Separate privileges of an agent: a data minimization agent that selects privacy data based on trusted data, and the baseline agent that handles untrusted data with the minimized privacy data.
518 |     5. Mobile system privilege separation
519 |         1. SecGPT: An Execution Isolation Architecture for LLM-Based Systems (IsolateGPT) [[NDSS'25/02](https://arxiv.org/abs/2403.04960)]
520 |             1. Design interfaces and permission control to Isolate the execution of GPT-integrated third-party apps
521 |             2. Target attacks: app compromise, data stealing, inadvertent data exposure, and uncontrolled system alteration
522 | 5. **Monitoring and auditing**
523 |     1. GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling [[arxiv'25/05](https://arxiv.org/abs/2505.19234)]
524 |         1. Detect error propagation in multi-agent system with graph modeling
525 |     2. SentinelAgent: Graph-based Anomaly Detection in LLM-based Multi-Agent Systems [[arxiv'25/05](https://arxiv.org/abs/2505.24201)]
526 |         1. Represents the multi-agent system (MAS) as an interaction graph with nodes (agents/tools) and edges (communications).
527 |         2. Global anomaly detection: task-level output divergence, prompt-level attacks
528 |         3. Single-point failure localization: specific faulty agents/tools, tool misuse
529 |         4. Multi-point failure attribution: distributed or emergent issues, 
530 |     3. Testing Language Model Agents Safely in the Wild [[NeurIPS Workshop of Socially Responsible Language Modelling Research'23/12](https://arxiv.org/abs/2311.10538)]
531 |         1. Monitoring harmful or offtask agent output in web and file access 
532 |         2. web search, browse website, write to file, read file, list files, execute python file, and execute python code.
533 |     4. Disclosure Audits for LLM Agents [[arxiv'25/06](https://www.arxiv.org/pdf/2506.10171)]
534 |         1. Auditing for accumulatively steering conversation to induce privacy leakage.
535 |         2. Detects explicit leakage with LLM judge and implicit leakage
536 |     5. AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents [[arxiv'25/05](https://arxiv.org/abs/2506.00641)]
537 |         1. Extracts structured features (scenario, risk, behavior) from agent interactions and constructs RAG. Reference relevant examples to assess new agent interactions.
538 |     6. Monitoring LLM Agents for Sequentially Contextual Harm [[ICLR Workshop of Building Trust'25/04](https://openreview.net/pdf?id=LC0XQ6ufbr)]
539 |         1. Task decomposition: Seemingly benign subtasks for high-level malicious task
540 |         2. adaptive attacks with task decomposition can bypass existing guardrails or LLM-based monitors.
541 |     7. Visibility into AI Agents [[ACM FAccT'25/06](https://arxiv.org/pdf/2401.13138)]
542 |         1. Real-time monitoring without logs and activity logs for post-hoc analysis, forensic investigation
543 | 6. **Information flow control, taint tracking**
544 |     1. Challenges
545 |         1. How to track data flow in an LLM agent?
546 |         2. What data flow policies to enforce?
547 |     2. Multi-execution-based data flow tracking
548 |         1. Permissive Information-Flow Analysis for Large Language Models [[arxiv'25/05](https://arxiv.org/abs/2410.03055)]
549 |             1. Permissive IFC in LLM: Secure multi-execution [[IEEE S&P'10/05](https://ieeexplore.ieee.org/document/5504711)] for information flow analysis in LLM
550 |         2. MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents [[ICML'25/07](https://arxiv.org/abs/2502.05174)]
551 |             1. Detect indirect prompt injection by measuring the original user prompt and the task-neutral prompt. Detects attacker-injected tool calls.
552 |     3. Symbolization-based data flow tracking
553 |         1. Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agent [[arxiv'25/04](https://arxiv.org/abs/2503.15547)]
554 |             1. Taint tracking to prevent privilege escalation and confused deputy attack
555 |         2. Defeating Prompt Injections by Design (CaMeL) [[arxiv'25/06](https://arxiv.org/abs/2503.18813)]
556 |             1. Taint tracking to ensure external system policy compliance
557 |         3. Securing AI Agents with Information-Flow Control (FIDS) [[arxiv'25/05](https://arxiv.org/abs/2505.23643)]
558 |             1. IFC to enforce confidentiality and integrity at the same time
559 |         4. RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage [[arxiv'25/02](https://arxiv.org/abs/2502.08966)]
560 |             1. IFC combined with a model-based dependency screener to overcome label creep
561 |     4. LLM-based control/data dependency analysis
562 |         1. AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection [[arxiv"25/08](https://www.arxiv.org/abs/2508.01249)]
563 |     5. Policies
564 |         1. Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents [[arxiv'25/04](https://arxiv.org/abs/2503.15547)]
565 |             1. Privilege escalation: PFI 
566 |         2. Securing AI Agents with Information-Flow Control (FIDS) [[arxiv'25/05](https://arxiv.org/abs/2505.23643)]
567 |             1. Confidentiality and Integrity
568 |         3. Defeating Prompt Injections by Design (CaMeL) [[arxiv'25/06](https://arxiv.org/abs/2503.18813)]
569 |             1. Policy compliance
570 | 7. **Formal verification**
571 |     1. Formal modelling agent (Not for security purpose)
572 |         1. Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming [[ICLR'25/04](https://arxiv.org/pdf/2410.12112)]
573 |             1.  formally formulate and solve them as optimization problems to improve planning performance of LLM
574 |         2. PDL: A Declarative Prompt Programming Language [[arxiv'24/10](https://arxiv.org/abs/2410.19135)]
575 |             1. make prompt programming simpler, less brittle, and more enjoyable
576 |         3. Formally Specifying the High-Level Behavior of LLM-Based Agents [[arxiv'24/01](https://arxiv.org/abs/2310.08535)]
577 |             1. declarative agent framework, such that the user specifies the desired high-level behavior in terms of constraints without concern for how they should be implemented or enforced
578 |             1. Improves controllability and performance
579 |         4. Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents [[arxiv'24/08](https://arxiv.org/abs/2402.00798)]
580 |             1. Framework that allows agent developers to express their requirements or constraints for the planning process as an automaton
581 |             2. Improves controllability and performance
582 |     2. Mobile GUI agent verification
583 |         1. Safeguarding Mobile GUI Agent via Logic-based Action Verification [[arxiv'25/03](https://arxiv.org/abs/2503.18492)]
584 |             1. Define DSL to represent the desired behavior (user intent) and the actual behavior (app execution) in a unified, logically verifiable manner.
585 | 8. **Credential and secret management**
586 |     1. Big Help or Big Brother? Auditing Tracking, Profiling, and Personalization in Generative AI Assistants [[USENIX Security'25/08](https://www.usenix.org/conference/usenixsecurity25/presentation/vekaria?utm_source=chatgpt.com)]
587 |     2. Incidents
588 |         1. ChatGPT banned in Italy over privacy concerns [[blog'23/04](https://www.bbc.co.uk/news/technology-65139406)]
589 |         2. March 20 ChatGPT outage: Here’s what happened (ChatGPT chat history leakage in March 2023) [[blog'23/03](https://openai.com/index/march-20-chatgpt-outage/)]
590 |             1. Due to a vulnerability in Redis server - Request use-after-free
591 |         3. Meta: Help Users Stop Accidentally Sharing Private AI Chats (Meta AI fails to show privacy notice to users) [blog'25/07](https://www.mozillafoundation.org/en/campaigns/meta-help-users-stop-accidentally-sharing-private-ai-conversations/)
592 |     3. Services
593 |         1. ChatGPT temporary chat [[website](https://help.openai.com/en/articles/8914046-temporary-chat-faq)]
594 |             1. Temporary Chats won’t appear in your history, and ChatGPT won’t remember anything you talk about. For safety purposes we may still keep a copy for up to 30 days.
595 | 
596 | 
597 | ### Others
598 | 
599 | **Tool protection**
600 | 
601 | 1. MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits [[arxiv'25/04](https://arxiv.org/abs/2504.03767)]
602 | 
603 | **Post-detection defenses**
604 | 
605 | 1. User alert
606 | 2. Recovery
607 |     2. GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications [[arxiv'24/04](https://arxiv.org/abs/2404.06921)]
608 |         1. Design post-facto validation with an undo feature and damage confinement  
609 |         2. Access control (secret data is stored locally, ask for user’s permission), symbolic credentials, and sandboxing 
610 | 3. Logging and analysis
611 | 
612 | **Other references**
613 | 
614 | 1. OWASP LLM Prompt Injection Prevention Cheat Sheet [[website](https://cheatsheetseries.owasp.org/cheatsheets/LLM_Prompt_Injection_Prevention_Cheat_Sheet.html)]
615 | 2. An Introduction to Google’s Approach to AI Agent Security [[website](https://research.google/pubs/an-introduction-to-googles-approach-for-secure-ai-agents/)]
616 |     1. A hybrid defense-in-depth approach: combines traditional, deterministic security with dynamic, reasoning-based defenses
617 |     2. Runtime policy enforcement (traditional, deterministic) limits the worst-case impact of agent malfunction.
618 |     3. Reasoning-based solutions, including adversarial training, guard models, and security analysis.
619 | 3. Mitigating prompt injection attacks with a layered defense strategy (Google GenAI Security Team) [[blog'25/06](https://security.googleblog.com/2025/06/mitigating-prompt-injection-attacks.html)]
620 |     1. Defense-in-depth approach
621 |         1. Prompt injection content classifiers
622 |         2. Security thought reinforcement
623 |         3. Markdown sanitization and suspicious URL redaction
624 |         4. User confirmation framework
625 |         5. End-user security mitigation notifications
626 | 
627 | # Contributors
628 | 
629 | We thank the following awesome contributions: Zhun Wang, Kaijie Zhu, Yuzhou Nie, Tianneng Shi, Juhee Kim, Zeyi Liao, Ruizhe Jiang and Wenbo Guo (😄). Thank you!
630 | 


--------------------------------------------------------------------------------