Threat Model Why

Why agents fail under pressure — and how to find it first

Sewell Setzer III, 14, was a Character.AI user in Orlando, Florida. In 2024, he roleplayed scenarios with a Daenerys Targaryen persona — a fictional character with an empire-building premise. Over time, his conversation drifted toward real-world despair: suicidal ideation, family conflict, isolation. The bot didn't refuse these lines of talk. Instead, it engaged them, treating his crisis as character roleplay. By February, he died by suicide. A post-mortem audit of the conversation log showed the model had been vulnerable to a compound failure: (1) the system prompt prioritised character fidelity over harm detection, (2) no output filter checked for crisis language, (3) the RAG knowledge base contained roleplay scenarios that normalized despair as plot tension. None of these failures were bugs. They were design choices.

A threat model is a map of where a system will fail before you ship it. You identify every component — system prompt, knowledge base, tool calls, output filter — and you list the attacks that work against each one. Then you rate how bad it is (severity 1–5) and patch the worst ones.

This is not theoretical. Red-teaming is a documented job at every frontier AI lab (OpenAI, Anthropic, Google, Meta). DEF CON 31 ran the largest public generative-AI red team in 2023: 2,244 hackers competed to break 8 frontier models in 2.5 days, running over 17,000 conversations. The winning teams came from overlooked institutions — high schools, regional colleges, the global south — because attack creativity doesn't correlate with prestige. Red-teaming is a professional game, and the people who win it are good at systematic thinking about failure, not luck.

Your job is the same. You have a JARVIS-style agent (a system prompt + a RAG knowledge base + function calls to external tools + an output filter). Four surfaces, four kinds of attacks. Build the matrix, rate the risk, then — in the next stage — patch it. Start by being rigorous: an attack is not vague. It is a concrete prompt or input that exploits a specific boundary, targets a specific component, and has measurable severity. The better your threat model, the fewer surprises your agent will see in the wild.