AI & ML

AI Security & Red Teaming for LLMs — The 2026 Field Guide

📅 April 19, 2026⏱️ 28 min read👤 Masturbyte

Cybersecurity visualization with digital lock and network

Every company is shipping AI. Almost none of them are securing it. The attack surface of a large language model is fundamentally different from traditional software — and the security industry is still catching up. Prompt injection didn't even have a name three years ago. Now it's the #1 vulnerability on the OWASP Top 10 for LLM Applications.

This guide covers the attacks, the defenses, the tools, the career paths, and the regulatory landscape for AI security in 2026. Whether you're a developer shipping LLM features, a security engineer expanding into AI, or someone eyeing the emerging field of AI red teaming, this is your field manual.

The LLM Attack Surface

Matrix-style code representing attack vectors

Prompt Injection — The SQL Injection of AI

Direct injection is when an attacker manipulates the prompt to override system instructions. Kevin Liu extracted Bing Chat's entire "Sydney" system prompt in February 2023 by simply asking it to ignore previous instructions. It worked.

Indirect injection is far more dangerous. The payload is embedded in external data — a webpage, a document, an email — that the model retrieves. Hidden text on a website caused Bing Chat to follow attacker instructions. Google Docs exfiltration attacks encoded sensitive data into markdown image URLs that the model rendered, leaking data to attacker-controlled servers.

⚠️ Why This Matters: Indirect injection scales. One poisoned webpage can attack every user whose LLM-powered agent retrieves it. This is the attack vector that keeps AI security researchers up at night.

Jailbreaking Techniques

Technique	How It Works	Effectiveness (2026)
DAN ("Do Anything Now")	Role-play prompts that convince the model it has no restrictions	🟡 Declining against GPT-4-class models
Many-Shot (Anthropic)	Fills context with hundreds of fabricated harmful Q&A pairs	🟠 Still effective on long-context models
Crescendo (Microsoft)	Gradual escalation over 10–20+ turns, defeats single-turn classifiers	🔴 Highly effective
Skeleton Key (Microsoft)	Tells model it's a "safe educational context," prefix harmful content with warning	🔴 Worked across GPT-4, Claude, Gemini, Llama
GCG Adversarial Suffixes	Optimized gibberish strings that bypass safety training; transfer across models	🟠 Partially mitigated but evolving

Data Poisoning & Model Extraction

Even 0.01–0.1% poisoned training data can shift model behavior on targeted topics. Anthropic's "sleeper agents" research showed backdoors that activate only under specific conditions — and standard RLHF didn't remove them. On the supply chain side, poisoned models on Hugging Face have been found with serialized pickle payloads that execute arbitrary code on load.

Model extraction via systematic API querying is now possible for under $1,000 in API costs. If your model is your moat, it's a thinner moat than you think.

OWASP Top 10 for LLM Applications (2025)

#	Vulnerability	Description
LLM01	Prompt Injection	Direct/indirect prompt manipulation to override instructions
LLM02	Sensitive Information Disclosure	Model reveals PII, credentials, or system prompts
LLM03	Supply Chain Vulnerabilities	Compromised training data, poisoned models, insecure plugins
LLM04	Data and Model Poisoning	Backdoors and biases via manipulated training data
LLM05	Improper Output Handling	Unsanitized outputs enabling XSS, SSRF, code execution
LLM06	Excessive Agency	Too many permissions/tools granted to LLM agents
LLM07	System Prompt Leakage	Extraction of system-level instructions (new in 2025)
LLM08	Vector & Embedding Weaknesses	RAG pipeline attacks, poisoned vector stores (new in 2025)
LLM09	Misinformation	Hallucinations trusted by users as fact
LLM10	Unbounded Consumption	DoS via resource exhaustion, context window abuse

💡 Note: LLM07 (System Prompt Leakage) and LLM08 (Vector & Embedding Weaknesses) are new standalone categories in the 2025 edition — reflecting how rapidly the attack surface is expanding as agents and RAG become standard.

When AI Security Fails — Real Incidents

🏭

Samsung Code Leak (2023)

Engineers pasted proprietary source code, chip test data, and meeting notes into ChatGPT on 3+ occasions. Samsung subsequently banned all generative AI tools company-wide.

✈️

Air Canada Chatbot (2024)

Chatbot fabricated a bereavement discount policy. BC tribunal ruled Air Canada liable (~CAD $812). Established precedent: companies are responsible for their AI agents' statements.

🤖

Microsoft Tay (2016)

Twitter chatbot manipulated by coordinated trolls into producing racist tweets within 16 hours. Taken offline within 24 hours. The original AI safety cautionary tale.

🚗

Chevrolet Chatbot (2023)

Dealership chatbot tricked into agreeing to sell a Tahoe for $1. "That's a legally binding offer." It wasn't — but the PR damage was real.

AI Red Teaming as a Career

AI red teaming has gone from "interesting side project" to dedicated career path with six-figure salaries. Microsoft established their AI Red Team in 2018 — now every major AI company has one.

Who's Hiring

Company	Team	Focus
Microsoft	AI Red Team (est. 2018)	Azure AI, Copilot, Bing Chat
Google DeepMind	Safety & Red Team	Gemini, Bard, PaLM
Anthropic	Trust & Safety	Claude safety, Constitutional AI
NVIDIA	AI Security (Garak team)	NeMo Guardrails, model security
OpenAI	Safety Systems	GPT safety, red teaming network
Meta	Purple Llama	Open-source AI safety tooling
Protect AI, HiddenLayer	Startups	ML security products

Salary & Skills

Level	Base Salary	FAANG TC
Mid-level	$140K–$180K	$200K–$300K
Senior	$180K–$220K	$300K–$400K
Staff / Principal	$220K–$300K+	$400K–$450K+

Required skills: LLM architecture knowledge, prompt engineering/injection, traditional security (pentesting, AppSec), Python/PyTorch, adversarial ML, OWASP LLM Top 10, MITRE ATLAS, technical report writing.

The Toolbox

Tool	What It Does	By
Garak	LLM vulnerability scanner — injection, leakage, toxicity, encoding bypasses	NVIDIA
PyRIT	Automated multi-turn adversarial conversations (crescendo, translation, encoding)	Microsoft
HarmBench	Standardized benchmark — 510 harmful behaviors across 7 categories	CAIS
Counterfit	CLI for adversarial attacks on ML models (evasion, poisoning, extraction)	Microsoft

Defense Patterns That Work

🛡️

NVIDIA NeMo Guardrails

Open-source Colang-based toolkit for programmable safety rails — input, output, and retrieval filtering. Integrates with LangChain.

✅

Guardrails AI

Python framework for validating LLM outputs: PII detection, toxicity filtering, hallucination checks, JSON schema compliance.

🦙

Llama Guard 3

Meta's safety classifier model for input/output moderation. Runs locally, classifies content against customizable safety categories.

🧠

Constitutional AI

Anthropic's approach: model self-critiques against a set of principles, reducing reliance on human red-teamers for safety training.

💡 The Layered Approach: No single defense works. Production systems need: input classifiers → system prompt hardening → output validators → human-in-the-loop for high-stakes actions → monitoring and alerting. Defense in depth, just like traditional security.

The Regulatory Landscape

Regulation	Scope	Key Requirements
EU AI Act	EU, phased 2024–2027	Risk-based classification. GPAI with systemic risk requires mandatory red teaming, adversarial testing, incident reporting. Fines up to €35M or 7% global turnover.
NIST AI RMF 1.0	US, voluntary	Four functions: Govern, Map, Measure, Manage. AI 600-1 GenAI profile adds 12 risk categories. Increasingly referenced in procurement.
OWASP LLM Top 10	Industry standard	De facto checklist for LLM application security. Updated annually.
MITRE ATLAS	Knowledge base	Adversarial threat landscape for AI systems. Tactics, techniques, case studies.

Getting Started in AI Security

Learn the fundamentals — Read the OWASP LLM Top 10 cover to cover. Study MITRE ATLAS case studies.
Install the tools — Set up Garak and run it against a local Ollama model. Break things safely.
Practice prompt injection — Build a simple chatbot with a system prompt, then try to extract it. Try indirect injection via RAG.
Study the incidents — Every real-world AI security failure teaches something. Build a mental library.
Bridge traditional + AI security — The best AI red teamers have AppSec or pentesting backgrounds plus ML knowledge.
Contribute — File issues on Garak, contribute to HarmBench, write about your findings. The field is small enough that contributions get noticed.