sequential

Choosing Models: Capability vs Cost

Models & Cost

In 2023, the question was which frontier model could do the most. In 2024, it became which model is appropriate for which task at which cost. The proliferation of models — proprietary, open-source, fine-tuned, multimodal — has made model selection a genuine engineering and business decision, not just a brand preference.

Hugging Face Blog / AI News 2024-04-18 economic

Meta Llama 3 70B Outperforms GPT-3.5 on Most Benchmarks at Zero Marginal Cost

Meta's open-source Llama 3 70B model scored above OpenAI's GPT-3.5-Turbo on MMLU, HumanEval, and GSM8K benchmarks while being freely downloadable and runnable locally. The release confirmed that open-source models had crossed the threshold where they match proprietary models from two years prior, undermining the pricing power of API-based competitors.

"The frontier moves forward. Everything behind it becomes free."

BMJ Digital Health 2024-10-08 economic

UK NHS Pilot: GPT-4 Used for Initial Triage Notes — Replaced by Fine-Tuned Clinical Llama After Cost Analysis

An NHS pilot that began using GPT-4 for ED triage note drafting switched after six months to a fine-tuned open-source clinical model, reducing per-note costs by 94% with a measured 3% reduction in note quality as rated by clinicians. The pilot was described as the first documented case of a public health system moving from frontier to fine-tuned models for cost reasons.

ArXiv / MIT Technology Review 2025-01-20 economic

China's DeepSeek R1 Matches OpenAI o1 Reasoning Performance at 3% of Training Cost — Benchmarks Published

Chinese AI lab DeepSeek published technical details of a reasoning model that matched or exceeded OpenAI's o1 on mathematics and coding benchmarks at a fraction of the reported training cost. The disclosure triggered a significant drop in Nvidia's share price as analysts revised assumptions about the infrastructure spending required to maintain frontier performance.

"The trillion-dollar assumption was that you needed a trillion dollars."

European Commission Guidance 2024-11-14 legal

EU AI Act Risk Tiers Force Enterprises to Document Model Selection Rationale for High-Risk Applications

Implementing guidance for the EU AI Act's high-risk application categories now requires enterprises to document why a specific model was selected over alternatives, including reasoning about accuracy, cost, and bias characteristics. Legal counsel note that the requirement effectively makes model selection a compliance function, not just a technical one.

The big question

If a smaller, cheaper model performs adequately for a task 90% of the time, should organisations be required to use it rather than a more expensive frontier model?

passage

Why Inference Costs Per Token

Every AI model that generates text runs on a GPU or TPU — specialized hardware that processes millions of tiny operations in parallel. When you send a prompt, the model doesn't think all at once. It works token by token — breaking text into chunks (usually 4–5 characters) and calculating probability scores for what comes next, then repeating.

This matters for cost because each token is one atomic unit of work on the GPU. Whether your prompt is 100 tokens or 1,000 tokens, the hardware cost scales linearly: more tokens = more cycles = more electricity, more GPU memory, more time the machine is occupied.

Compare two scenarios. Suppose you run a creative writing task in Claude 3.5 Sonnet versus Llama 70B (open-source). Sonnet will finish your 500-token output in a few seconds. Llama 70B, running on less powerful hardware, takes much longer. But the arithmetic is the same: output tokens cost the same rate per token on each model. The speed difference doesn't change per-token pricing — it reflects hardware choice and inference optimization, not a hidden discount.

Now scale to a classroom. Imagine 30 students each running a daily AI session that outputs 200 tokens. That's 6,000 tokens per day. Over 20 school days, that's 120,000 tokens. If your budget constraint is tight, you have three levers:

Model choice: Pick a smaller, cheaper model (like GPT-4o mini) if its accuracy is good enough for your task. Save money per token.
Prompt efficiency: Reduce output tokens by asking for concise answers. Fewer tokens = lower bill.
Batch offline work: Use cheaper batch APIs that process requests overnight. Trade speed for cost reduction (discount rate varies by provider).

The hard truth: open-source models on your own hardware have zero token cost — you pay only for the machine you already own. But hosting them yourself requires DevOps skill and uptime management. Cloud-hosted models charge per token because the provider manages the hardware, scale, and fault tolerance you don't have to.

When deciding between Claude, GPT-4o, and Llama for a cohort, benchmark a single real task on all three. Record the token count and time-to-output. Then calculate total cost for your class size and session frequency. The cheapest option is worthless if it fails on the task.

title

DOSSIER: AI ENGINEERING

Choosing Models: Capability vs Cost

The best model isn't the biggest — it's the right one for the job.

Learn the decision tree that separates capability from overkill: what does your task actually demand, and what price are you paying for features you do not use? By the end, you will know how to justify a model choice for any real project.

1 / 7

Answer key

**Model selection must show**: (1) an identical or nearly-identical prompt sent to all three; (2) token counts and response times documented; (3) quality assessment in plain language (not just "good" or "bad"); (4) total token math for 40 students × 15 days × 100 tokens per review; (5) separate answers for quality choice vs. speed choice vs. cost choice; (6) a sentence acknowledging that a model that times out or produces gibberish is worth infinite cost, even if technically cheaper.

**Mark correct if**: The student has a documented benchmark (real or from a verified source), the math is sound, and they recognize the trade-off between cost and failure risk.

Task

Build A Model Budget Plan

Build a model budget plan for your AI course. Pick one real AI task your students will do repeatedly (e.g., generate study notes, critique an essay, brainstorm ideas). Choose a model based on cost-vs-capability trade-off. Write a one-page plan stating: (1) the task and expected token count per run, (2) your chosen model and why, (3) the monthly cost for your cohort size (if known, estimate student count and daily frequency), (4) one risk you identified (accuracy failure, speed timeout, overspend) and how you would catch it. Do not invent prices or token rates — link to real provider documentation or state it as a hypothetical ("If GPT-4o costs X per million tokens, then...").

Open Claude Output · project