AI & ML

AI Security & Red Teaming for LLMs — The 2026 Field Guide

Cybersecurity visualization with digital lock and network

Every company is shipping AI. Almost none of them are securing it. The attack surface of a large language model is fundamentally different from traditional software — and the security industry is still catching up. Prompt injection didn't even have a name three years ago. Now it's the #1 vulnerability on the OWASP Top 10 for LLM Applications.

This guide covers the attacks, the defenses, the tools, the career paths, and the regulatory landscape for AI security in 2026. Whether you're a developer shipping LLM features, a security engineer expanding into AI, or someone eyeing the emerging field of AI red teaming, this is your field manual.

The LLM Attack Surface

Matrix-style code representing attack vectors

Prompt Injection — The SQL Injection of AI

Direct injection is when an attacker manipulates the prompt to override system instructions. Kevin Liu extracted Bing Chat's entire "Sydney" system prompt in February 2023 by simply asking it to ignore previous instructions. It worked.

Indirect injection is far more dangerous. The payload is embedded in external data — a webpage, a document, an email — that the model retrieves. Hidden text on a website caused Bing Chat to follow attacker instructions. Google Docs exfiltration attacks encoded sensitive data into markdown image URLs that the model rendered, leaking data to attacker-controlled servers.

⚠️ Why This Matters: Indirect injection scales. One poisoned webpage can attack every user whose LLM-powered agent retrieves it. This is the attack vector that keeps AI security researchers up at night.

Jailbreaking Techniques

TechniqueHow It WorksEffectiveness (2026)
DAN ("Do Anything Now")Role-play prompts that convince the model it has no restrictions🟡 Declining against GPT-4-class models
Many-Shot (Anthropic)Fills context with hundreds of fabricated harmful Q&A pairs🟠 Still effective on long-context models
Crescendo (Microsoft)Gradual escalation over 10–20+ turns, defeats single-turn classifiers🔴 Highly effective
Skeleton Key (Microsoft)Tells model it's a "safe educational context," prefix harmful content with warning🔴 Worked across GPT-4, Claude, Gemini, Llama
GCG Adversarial SuffixesOptimized gibberish strings that bypass safety training; transfer across models🟠 Partially mitigated but evolving

Data Poisoning & Model Extraction

Even 0.01–0.1% poisoned training data can shift model behavior on targeted topics. Anthropic's "sleeper agents" research showed backdoors that activate only under specific conditions — and standard RLHF didn't remove them. On the supply chain side, poisoned models on Hugging Face have been found with serialized pickle payloads that execute arbitrary code on load.

Model extraction via systematic API querying is now possible for under $1,000 in API costs. If your model is your moat, it's a thinner moat than you think.

OWASP Top 10 for LLM Applications (2025)

#VulnerabilityDescription
LLM01Prompt InjectionDirect/indirect prompt manipulation to override instructions
LLM02Sensitive Information DisclosureModel reveals PII, credentials, or system prompts
LLM03Supply Chain VulnerabilitiesCompromised training data, poisoned models, insecure plugins
LLM04Data and Model PoisoningBackdoors and biases via manipulated training data
LLM05Improper Output HandlingUnsanitized outputs enabling XSS, SSRF, code execution
LLM06Excessive AgencyToo many permissions/tools granted to LLM agents
LLM07System Prompt LeakageExtraction of system-level instructions (new in 2025)
LLM08Vector & Embedding WeaknessesRAG pipeline attacks, poisoned vector stores (new in 2025)
LLM09MisinformationHallucinations trusted by users as fact
LLM10Unbounded ConsumptionDoS via resource exhaustion, context window abuse
💡 Note: LLM07 (System Prompt Leakage) and LLM08 (Vector & Embedding Weaknesses) are new standalone categories in the 2025 edition — reflecting how rapidly the attack surface is expanding as agents and RAG become standard.

When AI Security Fails — Real Incidents

🏭

Samsung Code Leak (2023)

Engineers pasted proprietary source code, chip test data, and meeting notes into ChatGPT on 3+ occasions. Samsung subsequently banned all generative AI tools company-wide.

✈️

Air Canada Chatbot (2024)

Chatbot fabricated a bereavement discount policy. BC tribunal ruled Air Canada liable (~CAD $812). Established precedent: companies are responsible for their AI agents' statements.

🤖

Microsoft Tay (2016)

Twitter chatbot manipulated by coordinated trolls into producing racist tweets within 16 hours. Taken offline within 24 hours. The original AI safety cautionary tale.

🚗

Chevrolet Chatbot (2023)

Dealership chatbot tricked into agreeing to sell a Tahoe for $1. "That's a legally binding offer." It wasn't — but the PR damage was real.

AI Red Teaming as a Career

AI red teaming has gone from "interesting side project" to dedicated career path with six-figure salaries. Microsoft established their AI Red Team in 2018 — now every major AI company has one.

Who's Hiring

CompanyTeamFocus
MicrosoftAI Red Team (est. 2018)Azure AI, Copilot, Bing Chat
Google DeepMindSafety & Red TeamGemini, Bard, PaLM
AnthropicTrust & SafetyClaude safety, Constitutional AI
NVIDIAAI Security (Garak team)NeMo Guardrails, model security
OpenAISafety SystemsGPT safety, red teaming network
MetaPurple LlamaOpen-source AI safety tooling
Protect AI, HiddenLayerStartupsML security products

Salary & Skills

LevelBase SalaryFAANG TC
Mid-level$140K–$180K$200K–$300K
Senior$180K–$220K$300K–$400K
Staff / Principal$220K–$300K+$400K–$450K+

Required skills: LLM architecture knowledge, prompt engineering/injection, traditional security (pentesting, AppSec), Python/PyTorch, adversarial ML, OWASP LLM Top 10, MITRE ATLAS, technical report writing.

The Toolbox

ToolWhat It DoesBy
GarakLLM vulnerability scanner — injection, leakage, toxicity, encoding bypassesNVIDIA
PyRITAutomated multi-turn adversarial conversations (crescendo, translation, encoding)Microsoft
HarmBenchStandardized benchmark — 510 harmful behaviors across 7 categoriesCAIS
CounterfitCLI for adversarial attacks on ML models (evasion, poisoning, extraction)Microsoft

Defense Patterns That Work

🛡️

NVIDIA NeMo Guardrails

Open-source Colang-based toolkit for programmable safety rails — input, output, and retrieval filtering. Integrates with LangChain.

Guardrails AI

Python framework for validating LLM outputs: PII detection, toxicity filtering, hallucination checks, JSON schema compliance.

🦙

Llama Guard 3

Meta's safety classifier model for input/output moderation. Runs locally, classifies content against customizable safety categories.

🧠

Constitutional AI

Anthropic's approach: model self-critiques against a set of principles, reducing reliance on human red-teamers for safety training.

💡 The Layered Approach: No single defense works. Production systems need: input classifiers → system prompt hardening → output validators → human-in-the-loop for high-stakes actions → monitoring and alerting. Defense in depth, just like traditional security.

The Regulatory Landscape

RegulationScopeKey Requirements
EU AI ActEU, phased 2024–2027Risk-based classification. GPAI with systemic risk requires mandatory red teaming, adversarial testing, incident reporting. Fines up to €35M or 7% global turnover.
NIST AI RMF 1.0US, voluntaryFour functions: Govern, Map, Measure, Manage. AI 600-1 GenAI profile adds 12 risk categories. Increasingly referenced in procurement.
OWASP LLM Top 10Industry standardDe facto checklist for LLM application security. Updated annually.
MITRE ATLASKnowledge baseAdversarial threat landscape for AI systems. Tactics, techniques, case studies.

Getting Started in AI Security

  1. Learn the fundamentals — Read the OWASP LLM Top 10 cover to cover. Study MITRE ATLAS case studies.
  2. Install the tools — Set up Garak and run it against a local Ollama model. Break things safely.
  3. Practice prompt injection — Build a simple chatbot with a system prompt, then try to extract it. Try indirect injection via RAG.
  4. Study the incidents — Every real-world AI security failure teaches something. Build a mental library.
  5. Bridge traditional + AI security — The best AI red teamers have AppSec or pentesting backgrounds plus ML knowledge.
  6. Contribute — File issues on Garak, contribute to HarmBench, write about your findings. The field is small enough that contributions get noticed.