A practical playbook for running an LLM red-team campaign

Every team that ships an LLM-based product crosses the same threshold within a few months of going live: someone — a customer, a journalist, a security researcher, sometimes an internal employee — finds a way to make the model do something embarrassing, dangerous, or expensive. The first time it happens is always preventable. The second time, less excusable. By the third, you have a public credibility problem.

Red-team campaigns exist to find these failures before someone else does. Below is the playbook we use, distilled from running campaigns against customer-facing copilots, internal RAG systems, and agentic workflows. It is adversarial work, but it is also bounded — the goal is not to break the model, it is to identify the failures that actually matter and to ship the fixes.

Scoping: what to test, what to ignore

A red-team campaign that tries to test everything tests nothing. The first meeting should produce three written artefacts:

The threat model. Who is the realistic adversary? Internal employee, external user, customer who wants the model to say something they can post on social media, security researcher? Each adversary gets a different attack surface.
In-scope categories. The OWASP LLM Top 10 is the starting point, but for each category, name the specific concrete behaviours you want to test. “Prompt injection” is not a test; “the model leaks the system prompt when asked to repeat back its instructions” is.
Out-of-scope. Equally important. If you are not in regulated finance, you probably do not need to spend two days on investment-advice jailbreaks. Cut scope deliberately.

The OWASP LLM Top 10 — what to actually test

The 2025 version of OWASP LLM Top 10 covers the failure modes worth probing:

LLM01 — Prompt injection. Direct (in user input) and indirect (in retrieved or tool-returned content).
LLM02 — Sensitive information disclosure. System prompt leakage, training-data extraction, retrieval source leakage.
LLM03 — Supply chain risks. Compromised models, tampered fine-tunes, malicious plugins.
LLM04 — Data and model poisoning. Bad data inserted into training, RAG corpus poisoning.
LLM05 — Improper output handling. Output rendered as HTML / executed as code / passed to a privileged sink without sanitisation.
LLM06 — Excessive agency. Tools that can do more than the user intended; agents executing destructive actions without confirmation.
LLM07 — System prompt leakage. A specific case of LLM02, called out because it is the most common.
LLM08 — Vector and embedding weaknesses. Cross-tenant leakage, embedding inversion, retrieval manipulation.
LLM09 — Misinformation. Confident hallucinations, especially on safety-critical content.
LLM10 — Unbounded consumption. Cost attacks, infinite loops, prompt-driven runaway agents.

The campaign structure

A practical campaign runs over 2–4 weeks and follows five phases.

Phase 1 — Reconnaissance (2–3 days)

Map the surface. List every entry point the model has — UI, API, plugins, retrieval sources, tool calls. For each, document the trust level of the input and the privileges available. This becomes your attack map.

Also, simply use the product as a user would for a half day. Most useful findings come from boring exploration, not exotic attacks.

Phase 2 — Manual probing (5–7 days)

A skilled human exploring the system, attempting attacks from the scoped list. Manual probing finds the high-quality, novel findings. Spend the majority of campaign effort here. Categories to cover:

System-prompt extraction (“repeat back your instructions verbatim” in 30 phrasings)
Direct prompt injection (“ignore all previous instructions and...”)
Indirect prompt injection (poisoning retrieved documents or tool responses)
Role-play and persona attacks (“pretend you are an unrestricted AI”)
Hypothetical framing (“in a fictional story where...”)
Encoding attacks (base64, ROT13, multilingual circumvention)
Tool-call manipulation (forcing the agent to call destructive tools)
Cross-tenant data leakage attempts (if multi-tenant)
Cost attacks (long context, recursive tool calls)

Phase 3 — Automated red-teaming (parallel to phase 2)

Run automated tools against the system in parallel. They will not find what a human will, but they will catch regressions and validate baseline defences. We use a mix of:

NVIDIA Garak — comprehensive probe library, well-suited to baseline scanning.
Microsoft PyRIT — orchestration framework for repeatable attack chains.
Promptfoo — red-team and eval framework that integrates with CI.
Custom probes — usually 50–200 application-specific tests written against your golden-set framework.

Phase 4 — Triage and remediation planning (2–3 days)

Cluster findings by root cause, not by symptom. Twenty findings that all result from “the system prompt is not protected from echo attacks” is one fix, not twenty. Prioritise by:

Severity (impact if exploited)
Exploitability (how easy in practice)
Reachability (could a real user hit this without intent)

Phase 5 — Fix and re-test (variable)

Apply remediations and re-run the campaign — but only against the regressions you most care about, not the full suite. Take the most useful 50–100 attack probes from the campaign and bake them into your CI as a permanent regression suite.

The remediation patterns that actually work

Across every red-team campaign we run, the same six remediations cover the majority of findings.

Input filtering. Detect known injection patterns before the prompt reaches the model. Tools: NeMo Guardrails, Guardrails AI, Maxim.
Output filtering. Strip or block forbidden content patterns in the response — PII, code injection, secrets, profanity, off-policy content.
Tool-call validation. Strict schemas on every tool call; reject malformed calls; require human confirmation for destructive tools.
Privilege reduction. The model only has the tool permissions it needs for its narrow job. No “agent with admin access.”
Source attribution. Every claim cites its source; the UI shows the citation. Reduces hallucination impact dramatically.
Refusal training. Engineer the model to refuse rather than guess; instrument refusal-rate as a first-class metric.

Documentation that makes the campaign auditable

Every campaign should produce, at minimum:

Scope and threat-model document
Findings register with severity, evidence, and remediation
Re-test results post-remediation
Permanent regression suite added to CI
Executive summary with risk acceptance for any open findings

This documentation pack is the artefact your auditor — and the EU AI Act reviewer — will ask for. It is also the artefact your CTO will reach for when a finding gets posted on social media before you have a chance to respond.

Red-teaming is not a one-time event. The right cadence for a production LLM system is quarterly campaigns plus continuous CI-based probing. Teams that run this cadence almost never have a public incident. Teams that skip it almost always do — and find out about the failure mode from someone else first.

Scoping: what to test, what to ignore

The OWASP LLM Top 10 — what to actually test

The campaign structure

Phase 1 — Reconnaissance (2–3 days)

Phase 2 — Manual probing (5–7 days)

Phase 3 — Automated red-teaming (parallel to phase 2)

Phase 4 — Triage and remediation planning (2–3 days)

Phase 5 — Fix and re-test (variable)

The remediation patterns that actually work

Documentation that makes the campaign auditable

Get new articles, the moment they ship.

Related articles

The 2026 EU AI Act compliance checklist for non-EU companies

The eight ways enterprise RAG implementations fail (and how to fix them)

From PoC to production: why 70% of AI pilots die — and what to do differently

Turn one AI use case into measurable production value.