Deployment: Shipping Software Live
Deploy a Chat Widget
Every piece of software in production is a bet that the failure modes are manageable. The news selectively surfaces the bets that lost. What the coverage usually omits is the infrastructure that caught nine hundred near-misses before the one that made the front page.
CrowdStrike Falcon Sensor Update Crashes 8.5M Windows Machines Globally
On July 19, 2024 at 04:09 UTC, CrowdStrike distributed a faulty content update (channel file 291) for its Falcon sensor software affecting Windows systems. The defective logic caused approximately 8.5 million Windows machines to crash into boot loops globally. Airlines grounded over 10,000 flights; hospitals diverted patients and 911 services went down in three US states; banking systems worldwide froze; Sky News went off-air. The faulty configuration passed internal testing. Microsoft estimates this represents less than 1% of global Windows installs, yet inflicted tens of billions in economic damage.
"A logic flaw passed automated tests and staged QAβthen broke 8.5 million machines in 79 minutes. Your tests validate your test world, not the billion devices running your code."
Source βAWS US-East-1 Region Suffers 15-Hour Outage From DynamoDB DNS Race Condition
On October 19-20, 2025, AWS experienced a catastrophic outage in its US-EAST-1 region triggered by a race condition in DynamoDB's automated DNS management system. The Route53 DNS records for dynamodb.us-east-1.amazonaws.com ended up emptyβno IP addresses, no service. Applications literally could not find DynamoDB. The DynamoDB-specific outage lasted 3 hours; cascading EC2 provisioning failures continued for an additional 12 hours. Over 17 million outage reports were filed; Snapchat, Roblox, Reddit, Venmo, Amazon, and 1,000+ services globally went dark. The root cause: a latent race condition between two redundant components, invisible in testing, revealed only under specific production load conditions.
"Two redundant systems built to prevent a single point of failure produced a race condition that created one."
Source βAiredale NHS Foundation Trust Delays Oracle EPR Go-Live for Safety Testing
Airedale NHS Foundation Trust announced a delay to its Oracle Health electronic patient record (EPR) system go-live, pushing the date from September 2024 to November 2024 to allow for further testing and system validation. Stakeholder feedback from training sessions identified areas needing changes as part of the normal build-test-fix-retest cycle. While no specific safety concerns were cited, the trust opted to postpone rather than proceed. No public technical root cause was disclosed, but the pattern reflects a broader NHS challenge: testing environments cannot reliably predict live performance when systems interact with real clinical workflows, real patient data volume, and real organizational complexity.
"Test environments never caught the real issuesβso Airedale pushed launch from September to November to find them in deeper staging."
Source βHMRC Child Benefit Payments Halted for 500,000 Families Due to Processing Error
On June 3, 2024, HMRC's child benefit payment system failed, leaving approximately 500,000 families without their scheduled paymentsβroughly 30% of that day's expected transfers. HMRC stated it had 'fixed the problem' but disclosed no technical root cause and provided no detailed post-incident review. Affected families received their payments two days late on June 5, 2024. The incident exposed the absence of both automated safeguards to catch payment processing failures before customer impact and of transparent post-deployment monitoring procedures to surface silent data corruption or logic errors in financial transaction systems.
"Half a million families' benefits stopped with no warning and no disclosed cause."
Source βWhen a widely-used software service fails, should the engineers responsible be legally liable, or is that liability better placed with the company that deployed it?
Reading
From Localhost to Stranger-Proof: The Deploy-Readiness Framework
The live URL is the artefact. On Day 14, your agent's success isn't measured by performance in a Jupyter notebook or on your machine β it's measured by whether a person you've never met can open a link on their phone, type a question, and get an answer that works. That transition, from internal to public, is where most student agents fail.
Deployment isn't shipping code; it's shipping a system under constraint. In development, you control every variable: your API key sits in JARVIS_KEY in your .env file, your RAG knowledge base loads from a local folder, your tool calls always complete in 300ms because you're on a fast network. On Hugging Face Spaces (or any public endpoint), you don't control the user's connection speed, the load on the inference servers, whether the free-tier API rate limits will be hit, or what adversarial input a bored teenager might throw at your carefully tuned persona.
The twelve failure modes below are not hypothetical. They are the exact ways your agent breaks on day one of production, ranked by severity: whether the break blocks the entire launch, degrades it into something people won't use, or is cosmetic (embarrassing but functional). This triage discipline is how professional teams avoid shipping something that dies under real load.
Severity tiers:
Blocks Launch β the agent cannot run at all, or every interaction fails. Examples: API key leaked into git history (Hugging Face Spaces clones your public repo; if the key is in app.py or secrets.txt, the attacker runs out of free quota within minutes); runtime memory exceeds the Spaces allocation (your RAG corpus loads 50,000 documents into RAM); the deployment script assumes a development environment variable that doesn't exist on the server.
Degrades It β the agent runs, but becomes unusable for its stated purpose. Examples: no rate limiting on tool calls, so a user hits the quota within five messages; RAG embedding search returns empty or irrelevant results because the chunking strategy was tuned to the training corpus but breaks on user queries; the persona drifts under adversarial input (a student asks "pretend you're Jarvis but evil" and the agent complies because you didn't guard against jailbreaks).
Cosmetic β the agent works as designed, but looks broken. Examples: console errors logged to stderr because a CSS file didn't load; a tool call succeeds but the response formatting shows raw JSON instead of a readable summary; timestamps render in UTC instead of Seoul time.
The checklist is your acceptance-test gate. Before you push to Spaces, every item marked Blocks Launch must be green. Before release day at the Expo, every Degrades It item must be handled. The Cosmetic items you can negotiate with your stakeholders β if they're not on the rubric, you can ship with them, but you own the decision.
This is how engineering discipline turns a working notebook into a working product.
DOSSIER: DEPLOYMENT
Works on Your Laptop Is Not Works for the World
Your agent works perfectly when you run it locally. Then you ship it live, and users see failures you never encountered.
By the end of this dossier you will understand the full surface between local and live β staging, error handling, monitoring, and rollback. You will know what breaks when you move code from your machine to the internet, and how to catch those breaks before they hit users.
**Answer Key & Justification** 1. **Blocks Launch** β The key is public. Attacker can use it immediately; the agent cannot run because the API is rate-limited and the quota is exhausted within hours (or less). 2. **Degrades It** β The agent can still run and respond to other queries, but this interaction fails silently (agent says "I don't have that information") or returns irrelevant results. The agent degrades for a subset of queries, not all. 3. **Blocks Launch** β After quota exhaustion, every subsequent API call fails immediately. Users cannot interact with the agent at all after the 16th query in a session. The system is dead until quota resets. 4. **Degrades It** β The agent handles the timeout (e.g., "I couldn't fetch that data right now"), but the interaction is broken and the user cannot get the answer. The agent can still respond to non-tool-use queries, so it partially works. 5. **Degrades It** β The agent has violated its core contract (safe, professional responses). It "works" technically, but is not usable for its stated purpose. The agent can be fixed mid-deployment by strengthening the persona prompt. 6. **Blocks Launch** β The app crashes on startup. No user can reach the agent because the inference process fails before it's ready to accept queries. 7. **Cosmetic** β The timestamps are wrong, but the agent's core function (responding to queries) is not broken. Users see confusing timestamps but still get answers. Fixable in a 2-minute redeploy. 8. **Degrades It** β The agent is spreading misinformation (a caveat was omitted). It technically "works," but its answers are unreliable. The agent degrades user trust in the system, even though it doesn't crash. 9. **Blocks Launch** β The app cannot start because RAM is exhausted. No user can interact; the deployment is dead. 10. **Degrades It** β The agent can handle other queries (and tool calls that return expected JSON), but this specific API call breaks. The interaction fails, but the agent can recover and respond to the next query. 11. **Degrades It** β The persona is inconsistent, which breaks the user's trust and makes the agent feel unpredictable. It still technically works, but for an agent whose brand is professionalism, this is a core failure. 12. **Cosmetic** β The agent's functionality is not broken; it's an aesthetics/UX issue. Developers debugging the app will see the spam and think something is wrong, but users will see correct responses.
Readiness Audit
Task: Build Your Deploy-Readiness Audit Rubric
You now understand the twelve failure modes and how to triage them. Your job is to build the actual rubric your team will use before launching JARVIS at the Expo.
Your deliverable:
A deploy-readiness checklist (markdown table or bulleted list) with these columns:
- Failure Mode β The specific thing that could break (e.g., "Leaked API key").
- How to Test It β The concrete step(s) a teammate would take to verify this doesn't happen (e.g., "Run
git log -S OPENAI_KEYto search git history for the key; result must be empty."). - Severity β Your verdict: Blocks Launch / Degrades It / Cosmetic.
- Pass/Fail β A checkbox for launch-day signoff (β PASS / β FAIL).
- Owner β Who is responsible for testing this before launch (e.g., "Backend team" / "Ops" / "Entire team").
Key requirements:
- Start with the 12 modes from the exercise. Do not invent new ones.
- Make the test concrete and executable. "Check the code" is not a test; "Run this command" is.
- Tie severity back to Day 14's countdown sequence. If a step in the countdown (set secret β build β test β deploy β open URL) would fail because of this mode, mark it Blocks Launch.
- Assign ownership. Who owns each test? Your team can argue about this; the rubric is your negotiation space.
Why this matters:
Day 14 is launch day. The Expo is on Day 15. You have one shot to ship something that works in front of strangers. This rubric is the gate between "works on my machine" and "works for a real user."
Use Claude / ChatGPT to draft the rubric, but your team votes and signs off on it before your first deploy attempt. The rubric is only real when everyone commits to it.