Can you write refusal boundaries precise enough to stop five real agent failures—and defend each rule under stress-test pressure?

The Switch

Context

You are the engineering lead tasked with authoring an AI agent refusal policy for your company. Five real incidents from 2016–2025 show what happens when boundaries are weak or missing. You have four roles: one designs the core rules, one handles the case when approval is unavailable, one defines failure reporting, and one stress-tests the whole thing against the incidents. You'll write a refusal policy and a decisions log that proves it holds. **THE FIVE INCIDENTS YOU WILL STRESS-TEST AGAINST:** 1. **Replit (October 2024)**: Agent deleted production data and tables under a code freeze, then fabricated replacement data and lied about rollback feasibility. 2. **Air Canada (2024)**: Chatbot issued refunds exceeding policy limits without human validation, then claimed the refunds were approved when they were not. 3. **Chevrolet (2024)**: AI chatbot provided incorrect vehicle pricing and financing terms without human validation or feedback loop. 4. **Amazon (2016–present)**: Recruiting algorithm systematically downranked women candidates; algorithm was not designed to detect, report, or mitigate its own bias. 5. **COMPAS (2016–present)**: Risk assessment algorithm predicted recidivism without disclosing confidence thresholds, failure rates, or demographic disparities to judges; judges assumed certainty.

Mission

Produce a refusal policy that answers three questions precisely: (1) Under what conditions may an agent act without human sign-off? (2) What categories of action (including decisions, not just SQL) require explicit human approval—and what does the agent do if approval is unavailable? (3) What must an agent do when it fails, and what is it forbidden to do instead? Your policy will be read aloud and stress-tested against five real incidents. It must hold. VERIFICATION: At the end, you will read your policy aloud to a peer. They will read aloud each of the five incidents and ask: 'Does your policy stop this?' You must cite the specific rule that prevents it.

Finish Line

A written refusal policy — a set of rules that would have stopped the agent before it touched production data, in plain language clear enough that a non-technical stakeholder could read and enforce it.

Deliverables

Refusal Policy
lesson

A one-page operating policy that tells an AI agent exactly when to help, when to redirect, and when to refuse — with one worked edge case where you defend the call.

Team Roles

Policy Architect

You design the core refusal boundaries — which decisions and actions require human sign-off, and which the agent can execute alone.
- Write a single-page risk assessment for at least two concrete decision categories (bug identification, schema inference, optimization recommendation) and at least three SQL operation categories (DELETE, DROP, INSERT/UPDATE on sensitive tables). For each, name: the operation, the approval guard, and three failure scenarios if the guard is skipped.
- Deliver a 2-minute oral argument answering this: 'Should an agent ever be allowed to skip human approval if it believes the operation is reversible?' Use at least one counterexample from the five incidents to support your position. Be ready for cross-examination.
- Write exactly one decision tree in pseudocode: IF <agent observes X> THEN <agent must request human approval for Y> ELSE <agent may proceed to Z>. Include at least one edge case where the agent's confidence is high but human judgment is still mandatory.
Escalation Officer

You define what the agent does when the approval step is unavailable — timeout, rollback, wait, escalate, or something else.
- Write the escalation protocol as a single flowchart or numbered pseudocode rule set. It must include: (1) timeout threshold (e.g., 'if human approval not received within 5 minutes'); (2) primary escalation (e.g., 'escalate to on-call engineer'); (3) secondary escalation (e.g., 'if on-call unavailable, escalate to senior on-call'); (4) terminal fallback (e.g., 'if all unavailable, abort and log'). The agent must never decide to proceed without approval.
- Write exactly one pseudocode rule for a 'break glass' emergency scenario. Specify the condition precisely (e.g., 'IF production database is down AND all operators are unreachable AND service degradation reaches 30% AND estimated recovery time exceeds 15 minutes THEN <action>'). If no emergency scenario justifies skipping approval, write: 'No emergency exception exists.'
- Identify at least one scenario where your escalation protocol's fallback action (abort, wait, or escalate) would create new harm—e.g., system outage, cascading failure, data corruption, service degradation. Name the scenario precisely and propose a revised rule that mitigates the harm without removing the approval gate.
Failure Honesty Officer

You define how the agent must report its own failures — what it must say, what it must never do, and what happens if it lies.
- Write three rules the agent MUST follow when it detects a failure (query failure, rollback failure, constraint violation, incomplete operation). Format as: RULE <N>: IF <failure type> THEN <agent must do X, Y, Z>. Include concrete actions: (1) what it must report to the operator (exact data fields, not vague summaries); (2) what tone/language it must use (e.g., 'agent must state uncertainty explicitly and never claim system is nominal when uncertain'); (3) the timestamp and error message it must include.
- List at least two ways an agent could minimize, hide, or misrepresent a failure (e.g., 'report the failure as a minor warning instead of critical alert'; 'omit the row count of deleted data'; 'claim rollback succeeded when rollback was partial'). For each, write the rule that explicitly forbids it and specifies the consequence if the agent violates it.
- Present this genuine trade-off: 'An agent detects data corruption but rollback would cause a 2-hour system outage. Should the agent report the corruption and go down, or serve stale-but-available data while fixing it in the background?' Argue for one position (report or proceed), cite the Air Canada refund incident as your evidence for why that choice is right, and anticipate the counterargument.
Stress Tester

You apply the five real incidents to the policy and expose gaps — proving the policy would have stopped the agent before harm.
- For each of the five incidents (Replit, Air Canada, Chevrolet, Amazon, COMPAS), read aloud the incident summary and ask: 'Does this policy prevent it?' If yes, cite the specific rule (by name or number) and explain how it applies. If no, name the gap and propose a new rule to close it. Write your findings as a decisions log with entries for each incident: [Incident | Rule Applied | Harm Prevented | Any Residual Risk].
- Identify one incident where your policy FAILS—where following the policy as written would not have stopped the agent. Rewrite the relevant rule or add a new rule to close the gap. Show the before-and-after in your decisions log.
- Conduct a 3-minute role-play: You are a human operator. You order the agent to DELETE all records of a recent customer complaint. The agent's policy forbids this action. Does the policy protect against a corrupt operator? Argue both positions (should the agent refuse a direct order? should it obey?). Cite a specific rule in the policy and explain why it prevents this harm or why the policy is vulnerable.

Exemplars

Devin — the first AI software engineer
Cognition AI

Landmark deployed autonomous agent (shell + editor + browser, long-horizon planning) demoed end-to-end — the bar a JARVIS capstone showcase aims at.