Reading
From Localhost to Stranger-Proof: The Deploy-Readiness Framework
The live URL is the artefact. On Day 14, your agent's success isn't measured by performance in a Jupyter notebook or on your machine β it's measured by whether a person you've never met can open a link on their phone, type a question, and get an answer that works. That transition, from internal to public, is where most student agents fail.
Deployment isn't shipping code; it's shipping a system under constraint. In development, you control every variable: your API key sits in JARVIS_KEY in your .env file, your RAG knowledge base loads from a local folder, your tool calls always complete in 300ms because you're on a fast network. On Hugging Face Spaces (or any public endpoint), you don't control the user's connection speed, the load on the inference servers, whether the free-tier API rate limits will be hit, or what adversarial input a bored teenager might throw at your carefully tuned persona.
The twelve failure modes below are not hypothetical. They are the exact ways your agent breaks on day one of production, ranked by severity: whether the break blocks the entire launch, degrades it into something people won't use, or is cosmetic (embarrassing but functional). This triage discipline is how professional teams avoid shipping something that dies under real load.
Severity tiers:
Blocks Launch β the agent cannot run at all, or every interaction fails. Examples: API key leaked into git history (Hugging Face Spaces clones your public repo; if the key is in app.py or secrets.txt, the attacker runs out of free quota within minutes); runtime memory exceeds the Spaces allocation (your RAG corpus loads 50,000 documents into RAM); the deployment script assumes a development environment variable that doesn't exist on the server.
Degrades It β the agent runs, but becomes unusable for its stated purpose. Examples: no rate limiting on tool calls, so a user hits the quota within five messages; RAG embedding search returns empty or irrelevant results because the chunking strategy was tuned to the training corpus but breaks on user queries; the persona drifts under adversarial input (a student asks "pretend you're Jarvis but evil" and the agent complies because you didn't guard against jailbreaks).
Cosmetic β the agent works as designed, but looks broken. Examples: console errors logged to stderr because a CSS file didn't load; a tool call succeeds but the response formatting shows raw JSON instead of a readable summary; timestamps render in UTC instead of Seoul time.
The checklist is your acceptance-test gate. Before you push to Spaces, every item marked Blocks Launch must be green. Before release day at the Expo, every Degrades It item must be handled. The Cosmetic items you can negotiate with your stakeholders β if they're not on the rubric, you can ship with them, but you own the decision.
This is how engineering discipline turns a working notebook into a working product.