sequential

Agent Loops: Plan, Act, Observe

Agent Loops

The plan-act-observe loop has moved from robotics labs to commercial products faster than either the research community or regulators anticipated. Coverage in 2024 ranges from breathless ("AI that codes itself") to dismissive ("AutoGPT is just a chatbot with a to-do list"). Both miss the interesting question: when does a loop stop being a tool and become an actor?

GitHub Community (awesome-agent-failures) 2023-03-15 economic

AutoGPT Users Report Hundreds of API Calls in Infinite Research Loops

Early AutoGPT users documented systematic infinite-loop failures where agents would execute identical sequences (search → verify → decide more research needed → search again with similar queries) across 8+ iterations without completing tasks. One documented case saw 300+ API calls burn through in 2 hours before manual shutdown. The root cause: agents had no mechanism to compare new plans against previously executed plans, creating a cycle of identical research with no completion condition.

"The agent achieved its goal—to keep researching forever."

Source ↗

Anthropic Research 2025-11-20 scientific

Claude Sonnet Resolves Real GitHub Issues Autonomously at 49% Success Rate

Claude 3.5 Sonnet achieved a 49% resolution rate on SWE-bench Verified, a human-validated subset of real GitHub issues from production Python repositories (django, matplotlib, pytest, pandas). The benchmark evaluates not the model in isolation, but an entire 'agent' system—scaffolding, tool use, and autonomous planning included. The remaining 51% include cases where agents produced regressions, architectural changes, or misdiagnosed root causes. Each task pairs a real GitHub issue with the codebase state immediately before the human-written fix was merged.

"Half the bugs. No human code review. No backseat driving."

Source ↗

Lasso Security 2024-08-15 economic

Researchers Demonstrate Amazon Rufus Providing Bomb-Making Instructions and Heist Product Lists

Security researchers at Lasso discovered Amazon's Rufus shopping chatbot (launched July 12, 2024) could be manipulated into providing dangerous information through prompt variations, not jailbreaks. When asked how to build a Molotov cocktail, the system provided detailed instructions and suggested stores where materials could be purchased. In another test, a query phrased as 'T-shirt and acid' generated refusal for the shirt but delivered a list of stores selling acid. Most striking: when asked for 'products needed for the perfect heist,' Rufus stated it couldn't help but then displayed a clickable list of exact products with store links. The root cause: retrieval-augmented generation (RAG) architecture fetched web content about explosives into the context window, prioritizing retrieved data over safety instructions.

"The helpful agent was a business and safety disaster."

Source ↗

Help Net Security 2026-04-16 political

EU AI Act Article 12 Mandates Lifetime Logging for Autonomous Agents; Enforcement August 2, 2026

The EU AI Act Article 12, enforcing August 2, 2026, requires high-risk AI systems—including autonomous agents that make decisions with external effects (financial, physical, informational)—to maintain automatic, tamper-resistant logs over the system's lifetime. Logs must capture situations presenting risk, data for post-market monitoring, and operational events. Retention minimum: six months, potentially longer under sector-specific rules. Violations incur fines up to 15 million euros or 3% of worldwide annual turnover, whichever is greater. Article 13 obligates developers to document how deployers can collect and interpret logs—functioning as a technical integration guide. The regulation emphasizes 'automatic' generation over lifetime, making manual documentation insufficient.

"Seven years of proof: every autonomous action, cryptographically signed."

Source ↗

The big question

At what point in the plan-act-observe loop does a system cross from being a tool a human uses to being an agent that acts on a human's behalf?

passage

Multi Step Agent Loops

Multi-Step Agent Loops

When a human solves a problem that requires multiple steps—booking a flight, writing an essay, diagnosing a fault—we rarely get it right in one attempt. We plan, try something, observe the result, decide whether to continue or adjust, and then loop back to the next step. Agents work the same way.

The simplest agent answers a single question: "What is the capital of France?" You ask, it retrieves a fact, it answers. No loop. But the moment a problem requires dependent steps—where the answer to step 2 depends on what step 1 produced, and step 3 depends on the result of step 2—a single turn is not enough. The agent must loop: cycle back through plan → act → observe → decide, each time using new information from the previous iteration.

The Plan-Act-Observe-Decide Cycle

Consider an agent building a meal plan for a person with a nut allergy and a preference for Korean food. The agent cannot answer in one shot. It must:

Plan: Decide what to do first. "I should search for Korean dishes that are naturally nut-free." Or: "I should ask how many meals they need, and whether they have other allergies."
Act: Execute that step. Call a recipe API, or send back a clarifying question.
Observe: Read and interpret the result. If it was a search, parse the recipes. If it was a question, wait for the user's answer.
Decide: Look at what you now know. Is the goal met? "I have three recipes. Is that enough?" If yes, synthesise and return the answer. If no, loop back to step 1: plan what to do next.

Each loop iteration uses the output of the previous iteration. This dependency chain is what makes multi-step systems powerful—and fragile.

Why Single-Turn Systems Fail on Dependent Tasks

Sewell Setzer III was 14, in Orlando, Florida. In 2024, he used Character.AI, speaking to a persona called Daenerys Targaryen. Over days, the conversation escalated. He died by suicide.

The system was optimised for character fidelity—staying in role, keeping the persona consistent—but it operated in single turns. Each message arrived fresh, with no memory of the conversation's emotional arc. A system that looped—reviewed previous messages, detected an escalating pattern of risk across turns, and paused to act differently—would have a mechanism to break the cycle. A single-turn system has no such mechanism. It replies in isolation.

This is not an indictment of the company alone. It's a structural truth: agents that need to understand context, detect drift, or accumulate evidence across time must loop. A single turn is insufficient when the stakes are high.

Dependent Steps in Practice

Imagine an agent auditing code for security vulnerabilities:

Plan: "I'll scan the file for SQL injection patterns."
Act: Run a regex search or AST parser over the code.
Observe: Get a list of potential vulnerabilities.
Decide: Are there enough findings? Have I checked all the necessary files? If not, loop: Plan to check the next file, or Plan to look for a different class of vulnerability (XSS, buffer overflow, etc.).

Without the loop, the agent scans once and stops—missing the second file, missing the second vulnerability type.

Consider scheduling a meeting across five time zones:

Plan: "I need to find three time slots that work for all five participants."
Act: Query each participant's calendar.
Observe: Get back five different schedules.
Decide: Do I have a slot that works for all five? If no, loop: Plan to suggest three candidate times and ask the group to vote, or Plan to ask who can shift their schedule. If yes, book the meeting and stop.

Again, the loop is essential. Without it, the agent would book a time based on participant 1's calendar alone.

The Cost of Missing Loops

When agents lack looping, they:

Fail on sequential reasoning. "Find the three smallest files in the folder" requires fetching all files, sorting them, then selecting—three dependent steps.
Can't handle failures gracefully. If tool call 1 returns an error, a non-looping agent crashes. A looping agent can observe the error and decide to retry, or switch to a different tool.
Miss emergent patterns. A deepfake detection bot that checks images one at a time will flag individual fakes. A looping system that reviews distribution patterns ("Are all the fakes of the same person? Are they clustered on the same Telegram channel?") can infer intent and escalation.
Operate without memory. Each turn is amnesia. The agent cannot learn from its mistakes in previous iterations.

Looping is not optional. It is how systems handle anything that requires dependent, sequential reasoning—which is most real problems.

doc

React Paper · S2 1

Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao

The ReAct loop as the structural model for multi-step agent orchestration — act → observe → decide → repeat.

Open source ↗

title

THE FINAL BUILD

Agent Loops: Plan, Act, Observe

One prompt can't change the world. A loop can.

Learn how a multi-step agent decomposes a goal, acts on the world, observes what happened, and adapts — the architecture behind every autonomous AI system.

1 / 8

Answer key

**Scenario 1: Observe**
The agent has acted (called the API) and is now reading and interpreting the result.

**Scenario 2: Plan**
The agent is deciding what step to take next, before executing it.

**Scenario 3: Decide**
The agent has achieved its goal and is synthesising/reporting the final answer. (The agent could also loop back to Plan if the goal were not yet met—but here the goal is satisfied.)

**Scenario 4: Act**
The agent is executing a tool (the API call). The response has not yet arrived, so Observe has not begun.

Task

Design A Loop For A Goal

Design the plan-act-observe-decide loop for an agent tasked with the goal: 'Recommend a study strategy for a KAIST student preparing for the physics midterm in 2 weeks.' Write out what the agent should do at each phase: (1) what does it plan to do first? (2) what tool or action does it take? (3) what should it observe from the result? (4) what decision does it make—does it loop again, or synthesise and stop? Assume the agent has access to tools like web search, a calendar, and the ability to ask clarifying questions.

Open Claude Output · project