CASE FILE Model Selection Β§3/6
passage

Why Inference Costs Per Token

Why Inference Costs Per Token

Every AI model that generates text runs on a GPU or TPU β€” specialized hardware that processes millions of tiny operations in parallel. When you send a prompt, the model doesn't think all at once. It works token by token β€” breaking text into chunks (usually 4–5 characters) and calculating probability scores for what comes next, then repeating.

This matters for cost because each token is one atomic unit of work on the GPU. Whether your prompt is 100 tokens or 1,000 tokens, the hardware cost scales linearly: more tokens = more cycles = more electricity, more GPU memory, more time the machine is occupied.

Compare two scenarios. Suppose you run a creative writing task in Claude 3.5 Sonnet versus Llama 70B (open-source). Sonnet will finish your 500-token output in a few seconds. Llama 70B, running on less powerful hardware, takes much longer. But the arithmetic is the same: output tokens cost the same rate per token on each model. The speed difference doesn't change per-token pricing β€” it reflects hardware choice and inference optimization, not a hidden discount.

Now scale to a classroom. Imagine 30 students each running a daily AI session that outputs 200 tokens. That's 6,000 tokens per day. Over 20 school days, that's 120,000 tokens. If your budget constraint is tight, you have three levers:

  1. Model choice: Pick a smaller, cheaper model (like GPT-4o mini) if its accuracy is good enough for your task. Save money per token.

  2. Prompt efficiency: Reduce output tokens by asking for concise answers. Fewer tokens = lower bill.

  3. Batch offline work: Use cheaper batch APIs that process requests overnight. Trade speed for cost reduction (discount rate varies by provider).

The hard truth: open-source models on your own hardware have zero token cost β€” you pay only for the machine you already own. But hosting them yourself requires DevOps skill and uptime management. Cloud-hosted models charge per token because the provider manages the hardware, scale, and fault tolerance you don't have to.

When deciding between Claude, GPT-4o, and Llama for a cohort, benchmark a single real task on all three. Record the token count and time-to-output. Then calculate total cost for your class size and session frequency. The cheapest option is worthless if it fails on the task.