Model Bake-Off academic 1h 10m

W2D6Sa

Which model do you actually pay for โ€” and can you prove the cheap one isn't good enough?

Model Budget Plan

โ–ถ Enter Project

Context

You are a production team at a studio building a 240-frame explainer trailer for an AI course. The storyboard is locked: character illustrations, data visualizations, text overlays for key concepts. You need images generated at scale within 48 hours for a Monday release. Your budget cap is $5,000. Four models are live: Claude 3.5 Sonnet (Anthropic), GPT-4 Vision (OpenAI), Gemini 2.0 (Google), and Llama 3.1 Vision (via Groq). Your benchmark engineer will run the same test prompt on all four models via live API (if free credits available) or official public demo outputs (clearly labeled). None of your team has generated 240 images before at this cost scale.

Mission

Design and run an identical test prompt on all four models. Score each model on legibility (is text readable at phone size?) and concept match (does the output match the brief?). Build a cost-vs-accuracy table showing per-image cost and total 240-image cost for each model, plus quality scores. Write and sign a model-selection memo naming the constraint (cost, quality, or timeline) that resolves the tradeoff, then recommend the one model that best satisfies that constraintโ€”and explain why the three runners-up do not.

Finish Line

A one-page model-selection memo with a cost-vs-accuracy table recommending one model with explicit cost and quality tradeoff.

  • Model Comparison Brief

    lesson

    Structured comparison of 3 AI models on one task: quality, cost, and recommendation with arithmetic.

  • Buyer / CTO

    You own the selection decision and the timeline. Ship the trailer on budget and on time.

    • Write a model-selection memo (150โ€“200 words) that recommends one model with explicit cost, quality score, and justified tradeoff (e.g., 'Chose Gemini 2.0 at $4,200 and 3.5 quality over Claude at $5,400 and 4.1 quality because the $1,200 savings meets the binding constraint: we can tolerate 0.6-point quality loss for a 48-hour timeline if legibility passes >3/5').
    • In the memo, write one paragraph (3โ€“4 sentences) explaining why you rejected the runner-up model, citing cost/quality/timeline data from the benchmark table. Name the specific weakness that disqualified it.
    • Sign the memo with your name and date. Memo is readable on a phone and could be handed to finance as-is.
  • Benchmark Engineer

    You run the tests. Your data drives the pick. Own the test design and the numbers.

    • Design a test prompt (identical for all four models) that evaluates image generation for the explainer-trailer use case (text overlay legibility is critical). Run the prompt on all four models via live API or publicly posted demo outputs, and label clearly in the table which source each row uses.
    • Build a quality-scoring rubric: score each model's output on a 1โ€“5 scale for two dimensions: (1) text legibility (captions readable at phone size), (2) concept match (does the output match the prompt?). For each score, write one sentence explaining why (e.g., 'Llama: legibility=2 because text is blurry and hard to parse at <300px width').
    • Produce a comparison table with rows: Model | Per-Image Cost | Total 240-Image Cost | Legibility (1โ€“5) | Concept Match (1โ€“5) | Test Source (live API / demo).
  • FinOps Analyst

    You own cost predictability. Spot the hidden charges and the long-tail risk.

    • Extract the official per-token or per-request cost for each model from its API docs (Claude, GPT-4, Gemini, Llama via Groq). Calculate the total 240-image batch cost using the test prompt's average token count. Show your arithmetic; every number must be verifiable.
    • Identify ONE non-obvious cost or efficiency gain that the benchmark engineer might have missed (e.g., 'Input tokens for multi-image prompts add 18% if you prompt naively; use a batch API to reduce per-image input cost by 40%' or 'Gemini's free-tier limits cap batch size at 60/day; project needs three batches across three days'). Quantify the impact: 'This changes the total project cost from $X to $Y.'
    • Co-sign the cost table in the memo: initial the per-model costs and the total batch cost to confirm they match official pricing and your calculations.
  • Golden Trailer Awards

    Golden Trailer Awards

    The industry awards body for movie trailers โ€” the exemplar bar for what a finished, professional-grade trailer looks and sounds like.

  • Runway AI Film Festival

    Runway

    The premier festival showcase of finished AI films โ€” the "this is what pro AI filmmaking looks like" gallery. Complements the Golden Trailer Awards craft bar.