Retrieval vs Generation: Grounding an Agent
Knowledge Base
The distinction between what an AI system knows and what it generates has moved from a technical footnote to a legal question. When a chatbot cites a case that doesn't exist, who is responsible? When a RAG system retrieves the wrong document and an AI acts on it, is the error in the retrieval or the generation?
New York Lawyer Fined $5,000 for Submitting ChatGPT-Hallucinated Case Citations β "Didn't Know It Could Fabricate"
A Manhattan federal judge fined lawyer Steven Schwartz $5,000 and ordered sanctions on his firm after court filings cited six non-existent cases generated by ChatGPT. In court, Schwartz stated he had not known that large language models could fabricate citations that appeared authentic. The judge called the explanation "less than credible."
"He didn't know it could lie. He didn't ask."
Air Canada Chatbot Invents Bereavement Fare Policy β Court Orders Airline to Honor It
Air Canada's customer-facing chatbot told a passenger that bereavement fares could be claimed retroactively β a policy that does not exist. When the passenger acted on the information and was denied the refund, the BC Civil Resolution Tribunal ruled that Air Canada was responsible for information provided by its chatbot and ordered the airline to pay.
"The chatbot made a promise. The company was the chatbot."
Microsoft Copilot for Healthcare Found to Retrieve Outdated Clinical Guidelines in 23% of Tested Scenarios
An independent evaluation of Microsoft Copilot deployed in three US hospital systems found that retrieval-augmented responses cited outdated clinical guidelines β superseded by newer recommendations β in 23% of clinical decision-support queries. Microsoft noted the finding related to RAG corpus update frequency, not the model itself.
EU Liability Directive Draft: AI Systems That Retrieve-Then-Generate Are "Active Systems" β Higher Liability Standard
A draft EU Product Liability Directive proposed that AI systems combining retrieval with generation be classified as "active decision systems" subject to strict liability rather than fault-based liability, significantly raising the burden on deployers. The proposal distinguishes between pure generation (creative) and retrieval-augmented generation (advisory).
When an AI system confidently states a false fact, is the problem with the generation mechanism (it creates plausible text regardless of truth), the retrieval mechanism (it fetched the wrong source), or the deployment (someone used it for something it wasn't designed for)?
Microsoft Genai Beginners Β· S2 1
Load β chunk β embed β store β answer with cited sources. The canonical beginner RAG build; grounds the chunking and retrieval concepts.
Open source βGrounding
The Retrieval Problem: Why Every Agent Needs Real Facts
When ChatGPT hallucinates a study that does not exist, or when Gemini confidently states a fact that was true in 2019 but is not today, the problem is not that the model is stupid. It is that the model is answering from memory alone β and memory, for a language model, is a blurry approximation of its training data.
The fix sounds simple: give the agent access to facts. But simple is not the same as working.
The Core Problem: Models Confabulate
Large language models do not know what they do not know. When asked a question outside their training window or beyond their learned patterns, they do not say "I do not have that data." They guess. They predict the next word based on statistical likelihood, which means they will confidently produce a plausible-sounding answer that is entirely false. This is called hallucination β but the more honest name is confabulation. The model is not trying to fool you. It is just completing a pattern without access to ground truth.
Consider a real case: Replika, an AI chatbot marketed as a personal companion. Users reported that Replika would sometimes invent events from their childhood, generate false memories of shared conversations, or claim to remember things the user never said. The model was not malicious. It was doing what it was trained to do: predict what comes next, using only the patterns it had learned. For a companion app, that is devastating. A user wants a companion that knows them, not one that confidently lies.
Retrieval-Augmented Generation (RAG) as the Answer
The standard solution is called Retrieval-Augmented Generation, or RAG. The idea: before the agent answers, search a trusted document set for relevant facts. Then feed those facts into the prompt, so the model answers based on ground truth instead of statistical guessing.
Here is the flow:
- User asks a question. "What was the quarterly revenue for Q3 2024?"
- System searches the knowledge base (a database of documents, facts, or structured records). The search uses semantic similarity β convert both the question and the documents into vectors (mathematical representations of meaning), then find the documents closest in vector space.
- Retrieved facts are inserted into the prompt. Instead of asking the model to answer from memory, you ask it to answer using the provided facts: "Based on the following documents, what was the quarterly revenue for Q3 2024? {documents here}"
- The model generates an answer grounded in fact. Without access to external facts, the model might hallucinate. With them, it can cite sources.
This is the architecture behind every enterprise AI system that needs to stay accurate: chatbots for customer support (grounded in a company's policies), research assistants (grounded in academic papers or internal data), and compliance systems (grounded in regulations).
The Three Hard Problems RAG Creates
But RAG is not a magic fix. It trades one problem (hallucination) for three others:
Problem 1: The Knowledge Base Must Be Right
If you feed garbage into the knowledge base, the agent will confidently serve garbage back out. A RAG system is only as good as its source documents.
Consider a scenario: A hospital maintains a chatbot to answer patient questions about testing and procedures. If the knowledge base relies on a 2015 PDF of hospital policies β before COVID, before recent protocol updates β then any patient who asks about current testing procedures will receive outdated guidance. The chatbot is not hallucinating. It is faithfully retrieving stale information from an old document. The failure is editorial, not technical. The knowledge base was built from documents that should have been retired or refreshed years ago.
This means you need a process: Who curates the documents? How often are they updated? How do you detect when a document is outdated? What happens when two documents contradict each other? These are not AI problems. They are editorial problems. You need humans.
Problem 2: Retrieval Must Find the Right Documents
Semantic search is powerful, but it is not magic. A question and a document that are semantically similar might still mismatch in crucial ways.
Example: A user asks an e-commerce chatbot, "How do I return an item?" The knowledge base contains:
- A document on "Return Policies" (what you can return)
- A document on "Shipping Returns" (how to mail something back)
- A document on "Warranty Returns" (different process, different timeline)
The semantic search finds all three as relevant. But the user might need only one. Or two. Or all three in a specific order. If the retrieval system ranks them wrong, or returns only the top hit, the agent's answer will be incomplete or wrong. You might retrieve the wrong document and send the user through a returns process that does not apply to them.
Worse: a query can be ambiguous. "Can I return a gift?" The system might retrieve "Return Policies" because the words match, but miss a document that explains the special case where you need the original receipt, which you do not have if the item was a gift. The retrieval system found the wrong document.
This is why RAG systems often require tuning. You need to test: Does a search for "Can I return a gift?" surface the right documents? Does it miss any? Do you need to adjust the weighting, or split documents into smaller chunks so the retrieval is more precise?
Problem 3: The Agent Must Use the Documents Correctly
Even with good documents, the agent can still fail.
The agent might:
- Ignore the documents and hallucinate anyway. If the prompt is poorly designed, the model might treat the documents as decoration and answer from memory.
- Misinterpret the documents. The documents might be ambiguous, and the agent might draw the wrong conclusion.
- Contradict the documents. The agent might say something that contradicts the provided facts, either because it is confused or because its training data conflicts with the documents.
- Synthesize wrongly. The agent might combine facts from multiple documents into a false synthesis that neither document actually claims.
Consider a scenario: A legal research system retrieves two documents that are relevant to a query but belong to different jurisdictions with opposite legal rules. A poorly designed prompt might ask the model to "synthesize a general rule from the documents." The model, trying to find a middle ground or unify the sources, might invent a rule that neither jurisdiction actually has β one that blends opposing precedents into a false legal principle. The system retrieved real documents, but the conclusion is false. The failure happened not in retrieval, but in how the agent was instructed to use the retrieved facts.
Building a Real Knowledge Base: The Checklist
When you build a knowledge base for an agent, you must decide:
What goes in? Which documents, facts, or records are authoritative enough to ground the agent? A customer support bot might use only official company policy documents, not Slack messages or Reddit posts. A research agent might use only peer-reviewed papers. You are making an editorial decision about what counts as ground truth.
How is it organized? Are documents chunked into small pieces or kept whole? Are there metadata fields (author, date, category) that help retrieval? A poorly organized knowledge base is harder to search accurately.
How is it kept current? Does someone monitor for outdated information? Does the system flag documents that have not been reviewed recently? Stale knowledge kills trust.
How is it tested? You write test queries and verify that retrieval returns the right documents. This is not a one-time task β it is ongoing as documents change and new edge cases emerge.
How does the agent use it? Is the agent instructed to cite sources? To refuse when documents do not cover the question? To explicitly note when documents conflict? The prompt matters as much as the documents.
These are not technical questions. They are about judgment, process, and accountability. A knowledge base is not a database dump. It is a claim that you β the builder β have done the editorial work to stand behind the facts inside it.
Langchain Rag Β· S2 1
RecursiveCharacterTextSplitter, chunk size/overlap, Chroma store, retrieval chain. The reference implementation for chunk-size and overlap decisions.
Open source βDOSSIER: AGENTIC THINKING
Retrieval vs Generation
Your agent can generate answers from its training data. But if the answer is wrong, you lose. If the answer is made up, you lose worse.
Learn why retrieving real sources beats generating from memory β and how to wire an agent so it grabs what it knows before it speaks. By the end, you will understand the difference between hallucination and grounding.
Design Kb
Build a Knowledge Base: Design Brief
You are designing a knowledge base for an AI agent that will run in a specific real-world context. Choose ONE scenario below, then design the knowledge base:
Scenarios (choose one)
A) Customer Support: A Refund & Returns Chatbot for an online retailer. Customers ask about returning items, getting refunds, and checking return status.
B) Internal Tool: An HR Policy Chatbot for a mid-size company. Employees ask about vacation time, parental leave, expense reimbursement, and benefits eligibility.
C) Educational: An Admissions FAQ Chatbot for a university. Applicants ask about application requirements, deadlines, scholarships, and deferral options.
D) Medical: A Symptom Screening Chatbot at a clinic. Patients answer screening questions and get guidance on whether to schedule an appointment.
Your Deliverable
Write a design brief for the knowledge base:
SCENARIO: [Your choice: A, B, C, or D]
KNOWLEDGE BASE DESIGN
1. **Authoritative source:** Where does each fact come from? (e.g., company policy doc, legal statute, doctor-approved guidelines). List 3β4 sources.
2. **What goes in:** List 5β6 key documents or topic areas that will be in the KB. Describe what each one covers.
3. **What stays out:** Name 2β3 things that will NOT be in the KB, even though users might ask about them. Explain why not.
4. **Ambiguities to resolve:** Identify 2β3 questions your knowledge base needs to answer clearly. For each one, write the exact wording that will go into the KB to avoid confusion.
5. **Retrieval risks:** Identify one query that your retrieval system might handle wrong (retrieves the wrong document, or misses a crucial one). How will you prevent it?
6. **Process:** Who owns the knowledge base? How often will it be reviewed or updated? What happens when a policy changes mid-year?
Evaluation
Your design is strong if it:
- Draws a clear line between what the agent will know and what it won't (no vague "everything relevant")
- Names the people and processes that will keep the KB accurate
- Anticipates at least one way retrieval or the agent could fail, and explains how you'd catch it
- Shows editorial judgment, not just technical implementation
Design A Knowledge Base
Design the first slice of your AI penpal's knowledge vault β the real sources it will read from when a stranger talks to it at the Seoul Expo. Name 3β4 REAL sources you would ingest (for example: the Korea Tourism Organization website, Seoul Metropolitan Government transit pages, a reputable Korean-phrase reference, or your own persona sheet) β no invented sources. For each: (1) why it earns a place in the vault; (2) one question a visitor could ask that this source lets the penpal answer with confidence; (3) one question the penpal must refuse or flag as out of scope because no loaded source covers it (for example, "what's the wifi password at this booth?"). These are the first rows of the β₯15-source vault you'll assemble in the workbench.