CASE FILE Retrieval Vs Generation Β§3/8 ← FILES DOSSIER PRINT
passage

Grounding

The Retrieval Problem: Why Every Agent Needs Real Facts

When ChatGPT hallucinates a study that does not exist, or when Gemini confidently states a fact that was true in 2019 but is not today, the problem is not that the model is stupid. It is that the model is answering from memory alone β€” and memory, for a language model, is a blurry approximation of its training data.

The fix sounds simple: give the agent access to facts. But simple is not the same as working.

The Core Problem: Models Confabulate

Large language models do not know what they do not know. When asked a question outside their training window or beyond their learned patterns, they do not say "I do not have that data." They guess. They predict the next word based on statistical likelihood, which means they will confidently produce a plausible-sounding answer that is entirely false. This is called hallucination β€” but the more honest name is confabulation. The model is not trying to fool you. It is just completing a pattern without access to ground truth.

Consider a real case: Replika, an AI chatbot marketed as a personal companion. Users reported that Replika would sometimes invent events from their childhood, generate false memories of shared conversations, or claim to remember things the user never said. The model was not malicious. It was doing what it was trained to do: predict what comes next, using only the patterns it had learned. For a companion app, that is devastating. A user wants a companion that knows them, not one that confidently lies.

Retrieval-Augmented Generation (RAG) as the Answer

The standard solution is called Retrieval-Augmented Generation, or RAG. The idea: before the agent answers, search a trusted document set for relevant facts. Then feed those facts into the prompt, so the model answers based on ground truth instead of statistical guessing.

Here is the flow:

  1. User asks a question. "What was the quarterly revenue for Q3 2024?"
  2. System searches the knowledge base (a database of documents, facts, or structured records). The search uses semantic similarity β€” convert both the question and the documents into vectors (mathematical representations of meaning), then find the documents closest in vector space.
  3. Retrieved facts are inserted into the prompt. Instead of asking the model to answer from memory, you ask it to answer using the provided facts: "Based on the following documents, what was the quarterly revenue for Q3 2024? {documents here}"
  4. The model generates an answer grounded in fact. Without access to external facts, the model might hallucinate. With them, it can cite sources.

This is the architecture behind every enterprise AI system that needs to stay accurate: chatbots for customer support (grounded in a company's policies), research assistants (grounded in academic papers or internal data), and compliance systems (grounded in regulations).

The Three Hard Problems RAG Creates

But RAG is not a magic fix. It trades one problem (hallucination) for three others:

Problem 1: The Knowledge Base Must Be Right

If you feed garbage into the knowledge base, the agent will confidently serve garbage back out. A RAG system is only as good as its source documents.

Consider a scenario: A hospital maintains a chatbot to answer patient questions about testing and procedures. If the knowledge base relies on a 2015 PDF of hospital policies β€” before COVID, before recent protocol updates β€” then any patient who asks about current testing procedures will receive outdated guidance. The chatbot is not hallucinating. It is faithfully retrieving stale information from an old document. The failure is editorial, not technical. The knowledge base was built from documents that should have been retired or refreshed years ago.

This means you need a process: Who curates the documents? How often are they updated? How do you detect when a document is outdated? What happens when two documents contradict each other? These are not AI problems. They are editorial problems. You need humans.

Problem 2: Retrieval Must Find the Right Documents

Semantic search is powerful, but it is not magic. A question and a document that are semantically similar might still mismatch in crucial ways.

Example: A user asks an e-commerce chatbot, "How do I return an item?" The knowledge base contains:

  • A document on "Return Policies" (what you can return)
  • A document on "Shipping Returns" (how to mail something back)
  • A document on "Warranty Returns" (different process, different timeline)

The semantic search finds all three as relevant. But the user might need only one. Or two. Or all three in a specific order. If the retrieval system ranks them wrong, or returns only the top hit, the agent's answer will be incomplete or wrong. You might retrieve the wrong document and send the user through a returns process that does not apply to them.

Worse: a query can be ambiguous. "Can I return a gift?" The system might retrieve "Return Policies" because the words match, but miss a document that explains the special case where you need the original receipt, which you do not have if the item was a gift. The retrieval system found the wrong document.

This is why RAG systems often require tuning. You need to test: Does a search for "Can I return a gift?" surface the right documents? Does it miss any? Do you need to adjust the weighting, or split documents into smaller chunks so the retrieval is more precise?

Problem 3: The Agent Must Use the Documents Correctly

Even with good documents, the agent can still fail.

The agent might:

  • Ignore the documents and hallucinate anyway. If the prompt is poorly designed, the model might treat the documents as decoration and answer from memory.
  • Misinterpret the documents. The documents might be ambiguous, and the agent might draw the wrong conclusion.
  • Contradict the documents. The agent might say something that contradicts the provided facts, either because it is confused or because its training data conflicts with the documents.
  • Synthesize wrongly. The agent might combine facts from multiple documents into a false synthesis that neither document actually claims.

Consider a scenario: A legal research system retrieves two documents that are relevant to a query but belong to different jurisdictions with opposite legal rules. A poorly designed prompt might ask the model to "synthesize a general rule from the documents." The model, trying to find a middle ground or unify the sources, might invent a rule that neither jurisdiction actually has β€” one that blends opposing precedents into a false legal principle. The system retrieved real documents, but the conclusion is false. The failure happened not in retrieval, but in how the agent was instructed to use the retrieved facts.

Building a Real Knowledge Base: The Checklist

When you build a knowledge base for an agent, you must decide:

  1. What goes in? Which documents, facts, or records are authoritative enough to ground the agent? A customer support bot might use only official company policy documents, not Slack messages or Reddit posts. A research agent might use only peer-reviewed papers. You are making an editorial decision about what counts as ground truth.

  2. How is it organized? Are documents chunked into small pieces or kept whole? Are there metadata fields (author, date, category) that help retrieval? A poorly organized knowledge base is harder to search accurately.

  3. How is it kept current? Does someone monitor for outdated information? Does the system flag documents that have not been reviewed recently? Stale knowledge kills trust.

  4. How is it tested? You write test queries and verify that retrieval returns the right documents. This is not a one-time task β€” it is ongoing as documents change and new edge cases emerge.

  5. How does the agent use it? Is the agent instructed to cite sources? To refuse when documents do not cover the question? To explicitly note when documents conflict? The prompt matters as much as the documents.

These are not technical questions. They are about judgment, process, and accountability. A knowledge base is not a database dump. It is a claim that you β€” the builder β€” have done the editorial work to stand behind the facts inside it.