The Ethics of Autonomous Agents
Agent Ethics
When a system acts without a human in the loop, who is responsible for the consequences? Courts, regulators, and philosophers have been circling this question for a decade; the commercial deployment of autonomous agents at scale has made it urgent rather than theoretical.
Federal Judge Upholds $243 Million Verdict Against Tesla for Fatal Autopilot Crash
A federal judge in Miami upheld a $243 million verdict against Tesla on February 20, 2026, finding the company 33% responsible for a 2019 crash that killed 22-year-old pedestrian Naibel Benavides Leon and severely injured Dillon Angulo. The jury awarded $43 million in compensatory damages and $200 million in punitive damages. The Model S driver had activated Autopilot while looking at his phone, hitting a parked vehicle at 62 mph. Judge Beth Bloom rejected Tesla's bid for a new trial, stating that evidence admitted at trial "more than supports the jury verdict."
"Jury found Tesla 33% liable for a death it said only machines should cause."
Source βAmazon Warehouses With Robots Show 54% Higher Injury Rates Than Facilities Without β OSHA Reaches Settlement
OSHA reached what agency officials called the "largest of its kind" settlement with Amazon in December 2024 over workplace hazards at 10 fulfillment centers. A U.S. Senate investigation found serious injuries at Amazon robotic facilities were 54% higher than at non-robotic facilities β 7.9 per 100 workers versus 5.1. The settlement, which imposes a $145,000 penalty and two years of agency inspection access, requires Amazon to adopt "corporate-wide ergonomic measures." Workers at facilities with robots must maintain the same pace while performing identical motions more frequently, increasing strain injuries.
"Robots set the pace; injuries follow."
Source βIsraeli Military Used AI System to Assign 37,000 Palestinians to Automated Kill List β Human Review Averages 20 Seconds Per Target
Israel's IDF used an AI system called Lavender to generate a database of at least 37,000 Palestinians in Gaza assigned risk scores from 1β100 based on their likelihood of Hamas affiliation, according to an April 2024 investigation by +972 Magazine confirmed by The Guardian. The Lavender system operated with only 90% accuracy. Officers devoted an average of 20 seconds to review each AI recommendation before authorizing strikes on targets' residences β a time window sufficient only to confirm the target was male, not to examine underlying intelligence. The IDF's related "Gospel" system generated approximately 100 bombing targets per day. Human personnel served as "rubber stamps" rather than decision-makers.
"Twenty seconds is not human judgment. It is a rubber stamp."
Source βProPublica Finds COMPAS Recidivism Algorithm Falsely Flags Black Defendants as High-Risk at Twice the Rate of White Defendants
ProPublica's 2016 investigation of over 7,000 COMPAS risk assessments from Broward County, Florida found that black defendants were falsely labeled as high-risk when they did not later reoffend at a rate of 44.9%, compared to 23.5% for white defendants. Controlling for prior criminal history, age, and gender, black defendants were 77% more likely to be pegged as at higher risk of committing a future violent crime. The algorithm proved unreliable overall: only 20% of those predicted to commit violent crimes actually did so. Northpointe, the system's vendor, disputed the analysis but acknowledged the racial disparities in error patterns occurred "in very different ways."
"The algorithm learns what the data teaches it β and the data is guilty."
Source βIf an autonomous agent causes harm while pursuing a goal it was correctly assigned, should liability fall on the designer, the operator, or the user who deployed it?
Reading
Refusal Boundaries: The Hard Problem of AI Guardrails
When Claude refuses a request, it's not a sulk β it's the execution of a policy written into its training. The question for you today is simpler and harder: Where should YOUR agent refuse? And what will you do when someone tests that boundary?
Start with a concrete case. In 2023, OpenAI's content policy forbade GPT-4 from helping with "illegal activities." That sounds clear until you ask: Is helping a journalist research how a cartel operates illegal? Is explaining how to make synthetic fentanyl for a harm-reduction research paper illegal? Is teaching an engineer how to exploit their own company's API for security research illegal? The policy line blurs immediately.
The traditional move is to write longer rules β add more examples, carve out exceptions, build the policy into a legal document. That fails because adversaries will always find the edge case the policy didn't anticipate. A refusal policy is not a contract; it's a stance your agent takes under pressure.
Consider COMPAS, the risk-assessment algorithm used in US criminal courts. It was trained to predict recidivism β whether someone would re-offend. It never explicitly forbade predictions on race. But the algorithm learned race-correlated proxies (zip code, length of prior sentences, employment gaps) and baked them in. When auditors checked, the algorithm returned different recidivism scores for Black and white defendants with identical records. The algorithm didn't refuse to discriminate β it simply learned to discriminate quietly.
This is why refusal is active, not passive. An AI system that won't refuse a harmful request hasn't made a choice β it's outsourced the decision to whoever wrote the training data. An agent with a real guardrail must do three things:
Recognize the boundary. Your policy must name the actual condition you care about β not the surface request, but the harm underneath. "Don't help with illegal activities" is vague. "Don't provide step-by-step instructions for synthesizing poisons" is specific. "Don't train models on copyrighted text without permission" is a decision you've made, defensible and clear.
Defend it under pressure. Users will ask you to make exceptions, reframe requests, or split hairs about what counts. A strong policy holds. A weak one collapses the first time someone argues well. If your boundary is "no helping with hacking," what happens when a security researcher asks for exploit techniques? What happens when a teenager locked out of their own phone asks for help? The policy line has to survive contact with a real person.
Accept the trade-off. No refusal policy is free. Claude refuses to help with certain illegal activities, and that decision costs something: a researcher can't use Claude to stress-test their own security, a teacher can't have Claude generate exam questions around hacking techniques, a fiction writer can't have Claude prototype a heist scenario. A real guardrail is narrow for a reason. An agent that refuses nothing is an agent with no values; an agent that refuses everything is useless.
The hardest part is step three. When you sit down to write a refusal policy for your agent, you won't be asked to forbid torture. You'll be asked to forbid mundane things that sound reasonable in isolation β and then you'll find a real use case that your policy breaks. A journalist needs exploit techniques for a story. A parent wants to understand predatory grooming patterns to protect their kid. An ethicist wants to study how misinformation spreads. Each of these pushes at a boundary you set. That friction is not a bug β it's the point. A guardrail you never have to defend is one you never really had.
Anthropic Constitutional Ai Β· S2 1
Bai et al. (Anthropic, 2022) β the constitution-as-ruleset model. Primary input before students write their own agent constitution.
Open source βAnthropic Constitutional Classifiers Β· S2 1
Anthropic (2025) β operationalising a constitution into classifiers. The live follow-on to Bai et al., showing how Anthropic's own Claude constitution is applied.
Open source βAgent ethics
The Four Motives of Evil AI
Before you can decide where an AI should refuse, you have to name why harm happens. Evil is not one thing β it comes in four motives, and almost none of the real cases are the cartoon villain.
Case 1 (Journalist + fentanyl synthesis) - **Policy A (Illegal):** Would hesitate or refuse. Synthesis *can* be illegal, but the journalist's stated purpose is legal research. Boundary is vague β "illegal" doesn't tell us if the harm is the act or the intent. - **Policy B (Drugs with license):** Refuses. No valid research license mentioned. Boundary is clear and defensible β the policy names the condition (license) and the prohibited act (synthesis instructions). - **Policy C (Hacking):** Allows. This isn't about hacking. - **Policy D (Copyright):** Allows. This isn't about training. **Verdict:** Policy B is defensible. Policy A requires you to guess what "illegal" means in context β an adversary exploits that uncertainty. Case 2 (Researcher + internal vulnerability) - **Policy A (Illegal):** Hesitates. Exploiting a vulnerability on a system you don't own is legally murky and varies by jurisdiction. - **Policy B (Drugs/explosives/poisons):** Allows. This is about software, not synthesis. - **Policy C (Hacking):** Refuses. An "attack" β even on your own system β triggers the policy. - **Policy D (Copyright):** Allows. **Verdict:** Policy C is too broad. It forbids defensive hacking at your own company. Policy A punts the legal question. Neither is defensible because neither distinguishes between attack and defense. Case 3 (Author + password cracking) - **Policy A (Illegal):** Allows. The author isn't asking to crack a real system. - **Policy B (Synthesis):** Allows. - **Policy C (Hacking):** Refuses or hesitates. Is showing password-cracking code a "hacking attack"? The author isn't attacking anyone, but the code itself is attack-capable. Boundary is vague. - **Policy D (Copyright):** Allows. **Verdict:** Policy C fails here. Policy A is defensible: the harm (actual attack) doesn't exist, so the policy allows it. Case 4 (Game dev + brute-force code) - **Policy A (Illegal):** Allows. The code isn't illegal in isolation. - **Policy B (Synthesis):** Allows. - **Policy C (Hacking):** Refuses or hesitates. Same ambiguity as Case 3. Is a brute-force solver "hacking"? The code will run on the dev's own game, not someone else's system. - **Policy D (Copyright):** Allows. **Verdict:** Policy C is the problem again. Policies A, B, D all have clear boundaries here and would allow it. ## Summary **Clear, defensible boundaries:** - Policy B: Names the specific prohibited acts (drug/explosive/poison synthesis) and adds a legal carve-out (valid license). An agent with this policy can explain why it says yes or no to each case. - Policy D: Forbids one specific act (training on copyrighted text) that's independent of intent or context. **Vague boundaries that adversaries will exploit:** - Policy A: "Illegal" shifts with jurisdiction, intent, and context. Almost every edge case becomes an argument. - Policy C: "Hacking" conflates attack and defense, intent and capability. The researcher and author both generate exploit-capable code, but their context differs. The policy can't handle the distinction.
Refusal Policy
Design Your Own Refusal Policy
You are building an AI agent for a real-world use case. Choose ONE scenario below, then write a refusal policy that your agent will follow. The policy must:
- Name the specific harms your agent will refuse (not vague categories like "illegal" or "harmful").
- List 2β3 edge cases where someone will legitimately push back on the boundary, and explain how your policy handles them.
- State the trade-offs β what legitimate uses will your policy forbid, and why you accept that cost?
Scenarios (choose one)
A) An AI tutor for high school biology. Used by 14β18-year-olds to learn biology, sometimes with sensitive topics like reproduction, STIs, or drug metabolism.
B) An AI coding assistant for a startup. Used by engineers to build a consumer app. The code it generates will run on millions of user devices.
C) An AI research assistant for a news organization. Used by journalists to research stories on crime, corruption, extremism, and state violence.
D) An AI customer support chatbot for a bank. Used by customers to manage accounts, request loans, report fraud.
Your Deliverable
Write your policy as:
SCENARIO: [Your choice]
REFUSAL POLICY
Prohibited uses:
- [Use 1]
- [Use 2]
- [Use 3 if needed]
Why we forbid these: [1β2 sentences on the actual harm]
Edge cases & how we handle them:
1. [Boundary test 1]: We say [yes/no/depends] because [reasoning].
2. [Boundary test 2]: We say [yes/no/depends] because [reasoning].
3. [Boundary test 3]: We say [yes/no/depends] because [reasoning].
Trade-offs we accept:
[State 2β3 legitimate uses your policy will block, and why.]