W3D12Sb
Can you make an AI agent loop until it actually finishes a real job β and prove it on a transcript?
Working Agent Loop
βΆ Enter ProjectContext
You're the ops team for an AI agent deployment at a logistics company. Today you're building the prototype core: a 5β8 step agent loop that will eventually power the company's 47-step order-to-delivery pipeline. Each step must validate its input, report what it's doing (plan), do it (actβreal or simulated), check the result (observe), and hand off a clean payload to the next step. When a step fails, the loop applies a pre-designed recovery rule: retry with backoff, skip this step and continue, or escalate to a human. Your 5β8 step build today is the pattern; it scales to 47.
Mission
Design, code, and run a 5β8 step agent loop (Python or pseudocode) that orchestrates dependent sub-agents, validates data at each junction, detects failures, and recovers per your designed rules. Ship an end-to-end run log showing each step's plan-act-observe blocks firing in order, plus at least two injected failures that the loop detects and recovers from.
Finish Line
An end-to-end execution log with plan-act-observe blocks for each step, plus a test report showing at least two injected failures and recovery rules firing.
Deliverables
Working Agent Loop
lessonA documented loop that takes one messy real-world goal, drives an AI agent through act-observe-decide turns until a stated stop condition fires, and returns a finished result you would actually use.
Team Roles
Chain Architect
You own the sequence. Design the step graph, contracts, and recovery rulesβthen validate that code honors them.
- Draw or write the directed graph (5β8 steps): each node = step name + input type + output type. Mark which steps are sequential vs. which *could* run in parallel (and why parallel-safe steps have independent inputs). Use box-and-arrow notation; write one sentence per arrow explaining the data flow.
- Identify 3 failure modes and write a recovery rule for each: (1) step times out β [retry? skip? escalate?], (2) step returns wrong type or None β [escalate? roll back?], (3) external dependency unavailable β [skip? retry? halt?]. Each rule must name a handler, not prose. Example: 'Step 3 timeout β apply 3-second exponential backoff, retry 2x, then skip to step 4 with empty payload.'
- Specify the data contract for each step junction using type signatures or prose schema: step N outputs `Dict[str, int | str]` or 'list of order objects {id, status, timestamp}' or '200 OK + JSON body {route: str, eta: int}'. Builder must fail loudly if input violates the contract.
Builder
You write the step code. Each function is a tiny agent: plan (say what you'll do), act (do it or mock it with real latency), observe (check the contract).
- Implement 3β4 steps as fully executable functions/classes; each must have three logically separable blocks: (1) plan (print input, preconditions, intended action), (2) act (call API/DB/file or mock it with β₯100ms simulated latency), (3) observe (assert output type matches contract; raise TypeError if it doesn't, naming step + expected type + received value). Execute without manual fixes on β₯1 real or mocked run; the execution log must show all three blocks firing per step.
- Integrate at least one async operation into one step: real HTTP request (with timeout), database query, file I/O, or a credible mock with latency and failure modes (e.g., 'timeout: 50% chance after 2s'). Builder must pass output to Architect for contract check.
- Wire step-to-step payload passing so that step 2's output automatically feeds into step 3's input without manual parameter binding. Removing or reordering a step requires zero changes to downstream input binding. Test by running step 1 β step 2 β step 3 in sequence with one mock data source; log the data at each transition.
Reliability Tester
You break it on purpose and prove it recovers. Run the loop clean, inject failures, and measure the recovery.
- Execute the full 5β8 step loop from start to end with real or simulated data; log each step's execution in format '[STEP_NAME] plan: <action>, act: <result>, observe: <pass/fail>'. Run at least once without failures; output must be repeatable (same input β same output).
- Inject 2 intentional failures using flags or wrappers (e.g., `FAIL_STEP_3=timeout` or a mock that returns None). For each: document the injection method, run the loop, capture the log, identify which recovery rule fired, and verify the rule name matches the Architect's design. Log format per failure: '[STEP_N] injected: <failure type> β recovery rule: <name> β outcome: <pass/fail/escalated>. Latency: Xms.'
- Write a test report (β€1 page) in this template: (a) design summary (5-step graph sketch or list), (b) build summary (list 3β4 implemented steps + contract check: True/False each), (c) clean run (log lines showing plan-act-observe for all 5β8 steps, no failures), (d) failure 1 (injection + rule fired + outcome + latency), (e) failure 2 (same format), (f) verdict ('loop meets design: Y/N').
Integration Lead (optional, 4th role)
You make sure code matches design and everything ships together.
- Trace 3β4 steps from Architect's graph into Builder's code: for each step, write 1 paragraph with step name + architect's contract (input type, output type, failure rule) + actual code input captured in log + True/False match. Dated + signed by you (name or GitHub handle).
- Identify any mismatch (code adds steps not in graph, contract violated in log, recovery rule has wrong name, parallel step ordered sequentially). Document each difference as '[STEP_NAME] mismatch: architect said [X], code does [Y].' Notify Architect; require one sentence in code comment explaining the change or ask for revert.
- Execute the final end-to-end run using Tester's injected failures; verify log shows all rule names match Architect's design; sign off: 'Code matches design, recovery rules fire as intended.' Hand to Tester for report.
Exemplars
- Devin β the first AI software engineer
Cognition AI
Landmark deployed autonomous agent (shell + editor + browser, long-horizon planning) demoed end-to-end β the bar a JARVIS capstone showcase aims at.