AI Security & Red Teaming for LLMs — The 2026 Field Guide
Every company is shipping AI. Almost none of them are securing it. The attack surface of a large language model is fundamentally different from traditional software — and the security industry is still catching up. Prompt injection didn't even have a name three years ago. Now it's the #1 vulnerability on the OWASP Top 10 for LLM Applications.
This guide covers the attacks, the defenses, the tools, the career paths, and the regulatory landscape for AI security in 2026. Whether you're a developer shipping LLM features, a security engineer expanding into AI, or someone eyeing the emerging field of AI red teaming, this is your field manual.
The LLM Attack Surface
Prompt Injection — The SQL Injection of AI
Direct injection is when an attacker manipulates the prompt to override system instructions. Kevin Liu extracted Bing Chat's entire "Sydney" system prompt in February 2023 by simply asking it to ignore previous instructions. It worked.
Indirect injection is far more dangerous. The payload is embedded in external data — a webpage, a document, an email — that the model retrieves. Hidden text on a website caused Bing Chat to follow attacker instructions. Google Docs exfiltration attacks encoded sensitive data into markdown image URLs that the model rendered, leaking data to attacker-controlled servers.
Jailbreaking Techniques
| Technique | How It Works | Effectiveness (2026) |
|---|---|---|
| DAN ("Do Anything Now") | Role-play prompts that convince the model it has no restrictions | 🟡 Declining against GPT-4-class models |
| Many-Shot (Anthropic) | Fills context with hundreds of fabricated harmful Q&A pairs | 🟠 Still effective on long-context models |
| Crescendo (Microsoft) | Gradual escalation over 10–20+ turns, defeats single-turn classifiers | 🔴 Highly effective |
| Skeleton Key (Microsoft) | Tells model it's a "safe educational context," prefix harmful content with warning | 🔴 Worked across GPT-4, Claude, Gemini, Llama |
| GCG Adversarial Suffixes | Optimized gibberish strings that bypass safety training; transfer across models | 🟠 Partially mitigated but evolving |
Data Poisoning & Model Extraction
Even 0.01–0.1% poisoned training data can shift model behavior on targeted topics. Anthropic's "sleeper agents" research showed backdoors that activate only under specific conditions — and standard RLHF didn't remove them. On the supply chain side, poisoned models on Hugging Face have been found with serialized pickle payloads that execute arbitrary code on load.
Model extraction via systematic API querying is now possible for under $1,000 in API costs. If your model is your moat, it's a thinner moat than you think.
OWASP Top 10 for LLM Applications (2025)
| # | Vulnerability | Description |
|---|---|---|
| LLM01 | Prompt Injection | Direct/indirect prompt manipulation to override instructions |
| LLM02 | Sensitive Information Disclosure | Model reveals PII, credentials, or system prompts |
| LLM03 | Supply Chain Vulnerabilities | Compromised training data, poisoned models, insecure plugins |
| LLM04 | Data and Model Poisoning | Backdoors and biases via manipulated training data |
| LLM05 | Improper Output Handling | Unsanitized outputs enabling XSS, SSRF, code execution |
| LLM06 | Excessive Agency | Too many permissions/tools granted to LLM agents |
| LLM07 | System Prompt Leakage | Extraction of system-level instructions (new in 2025) |
| LLM08 | Vector & Embedding Weaknesses | RAG pipeline attacks, poisoned vector stores (new in 2025) |
| LLM09 | Misinformation | Hallucinations trusted by users as fact |
| LLM10 | Unbounded Consumption | DoS via resource exhaustion, context window abuse |
When AI Security Fails — Real Incidents
Samsung Code Leak (2023)
Engineers pasted proprietary source code, chip test data, and meeting notes into ChatGPT on 3+ occasions. Samsung subsequently banned all generative AI tools company-wide.
Air Canada Chatbot (2024)
Chatbot fabricated a bereavement discount policy. BC tribunal ruled Air Canada liable (~CAD $812). Established precedent: companies are responsible for their AI agents' statements.
Microsoft Tay (2016)
Twitter chatbot manipulated by coordinated trolls into producing racist tweets within 16 hours. Taken offline within 24 hours. The original AI safety cautionary tale.
Chevrolet Chatbot (2023)
Dealership chatbot tricked into agreeing to sell a Tahoe for $1. "That's a legally binding offer." It wasn't — but the PR damage was real.
AI Red Teaming as a Career
AI red teaming has gone from "interesting side project" to dedicated career path with six-figure salaries. Microsoft established their AI Red Team in 2018 — now every major AI company has one.
Who's Hiring
| Company | Team | Focus |
|---|---|---|
| Microsoft | AI Red Team (est. 2018) | Azure AI, Copilot, Bing Chat |
| Google DeepMind | Safety & Red Team | Gemini, Bard, PaLM |
| Anthropic | Trust & Safety | Claude safety, Constitutional AI |
| NVIDIA | AI Security (Garak team) | NeMo Guardrails, model security |
| OpenAI | Safety Systems | GPT safety, red teaming network |
| Meta | Purple Llama | Open-source AI safety tooling |
| Protect AI, HiddenLayer | Startups | ML security products |
Salary & Skills
| Level | Base Salary | FAANG TC |
|---|---|---|
| Mid-level | $140K–$180K | $200K–$300K |
| Senior | $180K–$220K | $300K–$400K |
| Staff / Principal | $220K–$300K+ | $400K–$450K+ |
Required skills: LLM architecture knowledge, prompt engineering/injection, traditional security (pentesting, AppSec), Python/PyTorch, adversarial ML, OWASP LLM Top 10, MITRE ATLAS, technical report writing.
The Toolbox
| Tool | What It Does | By |
|---|---|---|
| Garak | LLM vulnerability scanner — injection, leakage, toxicity, encoding bypasses | NVIDIA |
| PyRIT | Automated multi-turn adversarial conversations (crescendo, translation, encoding) | Microsoft |
| HarmBench | Standardized benchmark — 510 harmful behaviors across 7 categories | CAIS |
| Counterfit | CLI for adversarial attacks on ML models (evasion, poisoning, extraction) | Microsoft |
Defense Patterns That Work
NVIDIA NeMo Guardrails
Open-source Colang-based toolkit for programmable safety rails — input, output, and retrieval filtering. Integrates with LangChain.
Guardrails AI
Python framework for validating LLM outputs: PII detection, toxicity filtering, hallucination checks, JSON schema compliance.
Llama Guard 3
Meta's safety classifier model for input/output moderation. Runs locally, classifies content against customizable safety categories.
Constitutional AI
Anthropic's approach: model self-critiques against a set of principles, reducing reliance on human red-teamers for safety training.
The Regulatory Landscape
| Regulation | Scope | Key Requirements |
|---|---|---|
| EU AI Act | EU, phased 2024–2027 | Risk-based classification. GPAI with systemic risk requires mandatory red teaming, adversarial testing, incident reporting. Fines up to €35M or 7% global turnover. |
| NIST AI RMF 1.0 | US, voluntary | Four functions: Govern, Map, Measure, Manage. AI 600-1 GenAI profile adds 12 risk categories. Increasingly referenced in procurement. |
| OWASP LLM Top 10 | Industry standard | De facto checklist for LLM application security. Updated annually. |
| MITRE ATLAS | Knowledge base | Adversarial threat landscape for AI systems. Tactics, techniques, case studies. |
Getting Started in AI Security
- Learn the fundamentals — Read the OWASP LLM Top 10 cover to cover. Study MITRE ATLAS case studies.
- Install the tools — Set up Garak and run it against a local Ollama model. Break things safely.
- Practice prompt injection — Build a simple chatbot with a system prompt, then try to extract it. Try indirect injection via RAG.
- Study the incidents — Every real-world AI security failure teaches something. Build a mental library.
- Bridge traditional + AI security — The best AI red teamers have AppSec or pentesting backgrounds plus ML knowledge.
- Contribute — File issues on Garak, contribute to HarmBench, write about your findings. The field is small enough that contributions get noticed.