Can you make an AI agent loop until it actually finishes a real job — and prove it on a transcript?

Working Agent Loop

Context

You're the ops team for an AI agent deployment at a logistics company. Today you're building the prototype core: a 5–8 step agent loop that will eventually power the company's 47-step order-to-delivery pipeline. Each step must validate its input, report what it's doing (plan), do it (act—real or simulated), check the result (observe), and hand off a clean payload to the next step. When a step fails, the loop applies a pre-designed recovery rule: retry with backoff, skip this step and continue, or escalate to a human. Your 5–8 step build today is the pattern; it scales to 47.

Mission

Design, code, and run a 5–8 step agent loop (Python or pseudocode) that orchestrates dependent sub-agents, validates data at each junction, detects failures, and recovers per your designed rules. Ship an end-to-end run log showing each step's plan-act-observe blocks firing in order, plus at least two injected failures that the loop detects and recovers from.

Finish Line

An end-to-end execution log with plan-act-observe blocks for each step, plus a test report showing at least two injected failures and recovery rules firing.

Deliverables

Working Agent Loop
lesson

A documented loop that takes one messy real-world goal, drives an AI agent through act-observe-decide turns until a stated stop condition fires, and returns a finished result you would actually use.

Team Roles

Chain Architect

You own the sequence. Design the step graph, contracts, and recovery rules—then validate that code honors them.
- Draw or write the directed graph (5–8 steps): each node = step name + input type + output type. Mark which steps are sequential vs. which *could* run in parallel (and why parallel-safe steps have independent inputs). Use box-and-arrow notation; write one sentence per arrow explaining the data flow.
- Identify 3 failure modes and write a recovery rule for each: (1) step times out → [retry? skip? escalate?], (2) step returns wrong type or None → [escalate? roll back?], (3) external dependency unavailable → [skip? retry? halt?]. Each rule must name a handler, not prose. Example: 'Step 3 timeout → apply 3-second exponential backoff, retry 2x, then skip to step 4 with empty payload.'
- Specify the data contract for each step junction using type signatures or prose schema: step N outputs `Dict[str, int | str]` or 'list of order objects {id, status, timestamp}' or '200 OK + JSON body {route: str, eta: int}'. Builder must fail loudly if input violates the contract.
Builder

You write the step code. Each function is a tiny agent: plan (say what you'll do), act (do it or mock it with real latency), observe (check the contract).
- Implement 3–4 steps as fully executable functions/classes; each must have three logically separable blocks: (1) plan (print input, preconditions, intended action), (2) act (call API/DB/file or mock it with ≥100ms simulated latency), (3) observe (assert output type matches contract; raise TypeError if it doesn't, naming step + expected type + received value). Execute without manual fixes on ≥1 real or mocked run; the execution log must show all three blocks firing per step.
- Integrate at least one async operation into one step: real HTTP request (with timeout), database query, file I/O, or a credible mock with latency and failure modes (e.g., 'timeout: 50% chance after 2s'). Builder must pass output to Architect for contract check.
- Wire step-to-step payload passing so that step 2's output automatically feeds into step 3's input without manual parameter binding. Removing or reordering a step requires zero changes to downstream input binding. Test by running step 1 → step 2 → step 3 in sequence with one mock data source; log the data at each transition.
Reliability Tester

You break it on purpose and prove it recovers. Run the loop clean, inject failures, and measure the recovery.
- Execute the full 5–8 step loop from start to end with real or simulated data; log each step's execution in format '[STEP_NAME] plan: <action>, act: <result>, observe: <pass/fail>'. Run at least once without failures; output must be repeatable (same input → same output).
- Inject 2 intentional failures using flags or wrappers (e.g., `FAIL_STEP_3=timeout` or a mock that returns None). For each: document the injection method, run the loop, capture the log, identify which recovery rule fired, and verify the rule name matches the Architect's design. Log format per failure: '[STEP_N] injected: <failure type> → recovery rule: <name> → outcome: <pass/fail/escalated>. Latency: Xms.'
- Write a test report (≤1 page) in this template: (a) design summary (5-step graph sketch or list), (b) build summary (list 3–4 implemented steps + contract check: True/False each), (c) clean run (log lines showing plan-act-observe for all 5–8 steps, no failures), (d) failure 1 (injection + rule fired + outcome + latency), (e) failure 2 (same format), (f) verdict ('loop meets design: Y/N').
Integration Lead (optional, 4th role)

You make sure code matches design and everything ships together.
- Trace 3–4 steps from Architect's graph into Builder's code: for each step, write 1 paragraph with step name + architect's contract (input type, output type, failure rule) + actual code input captured in log + True/False match. Dated + signed by you (name or GitHub handle).
- Identify any mismatch (code adds steps not in graph, contract violated in log, recovery rule has wrong name, parallel step ordered sequentially). Document each difference as '[STEP_NAME] mismatch: architect said [X], code does [Y].' Notify Architect; require one sentence in code comment explaining the change or ask for revert.
- Execute the final end-to-end run using Tester's injected failures; verify log shows all rule names match Architect's design; sign off: 'Code matches design, recovery rules fire as intended.' Hand to Tester for report.

Exemplars

Devin — the first AI software engineer
Cognition AI

Landmark deployed autonomous agent (shell + editor + browser, long-horizon planning) demoed end-to-end — the bar a JARVIS capstone showcase aims at.