W1D4Sa
Can you pitch yourself to the group as a professional teacher — armed with one AI study tool you discovered at the conference?
The Teacher Pitch
â–¶ Enter ProjectContext
It's 2026. EdTech platforms compete on a single metric: 30-day retention of concept mastery, measured by post-test accuracy on new problems (target: 15% lift over baseline, or the tool doesn't ship). Your job: design a prompt so tight that the LLM output is the study tool itself—no teacher remix, no cleanup. Hackers ship the code (the prompt). Mentors use it on material the hacker never saw and report what broke. Judges measure whether students using your tool actually retain and transfer what they learned.
Mission
Ship a working AI study tool (a carefully designed prompt + one tested output + one revision cycle showing you diagnosed why it failed) and defend its pedagogical claim: what cognitive mechanism does it activate, and how would we prove it works on real students?
Finish Line
A 3-minute live pitch: your teacher persona, your chosen AI tool, and why it would work in a real classroom.
Deliverables
AI Study Kit
lessonA one-page doc of three battle-tested prompts that turn any AI chatbot into a flashcard deck, a Socratic quizzer, and a devil's-advocate for one topic you have to revise.
Team Roles
Hacker
Design the prompt that produces a measurable study artefact.
- Choose a specific topic and a Bloom action verb (analyze / evaluate / create). Write a system prompt (200–300 words) that specifies: (1) what artefact type the LLM produces (e.g., a 20-question drill with worked examples; a table comparing three competing claims; a checklist of misconceptions); (2) at least three hard constraints that force production, not recognition (e.g., 'each question must require the student to defend a claim on new material' or 'flag one historical misconception and show how to spot it'); (3) one concrete example input and output so the LLM knows exactly what you want.
- Test your prompt on a real source (a textbook chapter, research paper, or primary document). Feed it the source, run it, and score the output on your three constraints: Does the output require production? Are examples correct? Does it build from simple to complex? If any constraint fails (e.g., the LLM generates multiple-choice instead of open-ended), name the failure and revise.
- Re-run the prompt on a different source (same topic, different material). Document side-by-side: prompt v1 → test output v1 → failure diagnosis → prompt v2 → test output v2 → did the revision move the needle? If output quality improved, state by how much and why. If it didn't, state what you'd try next.
Mentor
Test whether the prompt works on material the hacker never saw.
- Read the hacker's system prompt and their chosen artefact type (e.g., 'a drill with worked examples'). Pick a fresh source on the same topic—one the hacker did NOT use. Run the prompt on your source and inspect the output. Score it on two criteria: (1) Bloom level—does the output require students to apply/analyze/evaluate/create? If the output is multiple-choice, true/false, or fact lists, stop and note 'This output is recognition, not production. The prompt failed.' (2) Misconception detection—does the output help students spot or correct a common error? Quote one specific line from the output that shows it either does or doesn't meet this bar.
- Write a one-paragraph critique naming the STRUCTURAL failure you found (not surface-level feedback). Example: 'Your prompt asks for synthesis, but the LLM generated fill-in-the-blank questions because you didn't constrain the output format tightly enough. Reframe to: "Generate only open-ended questions that require comparing two conflicting sources."' If the prompt succeeded on fresh material, name the ONE decision that made it work (e.g., 'You included a misconception scaffold, which forced students to debug instead of recall').
Judge
Award the tool that moves retention metrics, and name why.
- Synthesize mentor feedback and score each tool on three dimensions using the rubric. (1) Prompt Precision: Did the hacker name a Bloom goal, include hard constraints, and provide an example? (2) Artefact Quality: Does the output require production on fresh material? Did the mentor spot misconceptions baked in, or did the output force students to think? (3) Iteration Evidence: Did the hacker diagnose a structural failure and revise the constraint itself, or just tweak the surface? Assign points and a band (bronze/merit/distinction) for each dimension.
- In public, name the winning tool and explain in three sentences: (1) what the hacker's key design decision was, (2) why that decision moved the tool from 'competent' to 'enviable' (cite the Bloom level, misconception detection, or constraint type), (3) what the mentor feedback revealed that proved it works on fresh material. Example: 'This hacker locked the constraint to "students must defend a claim on new data." The mentor tested on an unfamiliar case study and got output that forced comparison, not recall. That's the win.'
- Propose ONE concrete downstream test: (1) Measure post-test accuracy on fresh problems (not the ones in the tool) before and after students use this tool for one week. (2) Specify the problem type (same domain as the tool output, but different content). (3) Name the cognitive mechanism the test detects (spacing / retrieval practice / elaboration / desirable difficulty). Example: 'Measure error patterns on new trade-off scenarios: Do students using this tool spot misconceptions faster than a control group? That detects elaboration and misconception-correction.'
Exemplars
- Anki — powerful, intelligent flashcards
AnkiWeb
Gold-standard personal mastery system: active recall + spaced repetition. Validity proven by repeated solo testing, not by looking things up — the capstone’s whole logic.