Training creates the model. Inference uses the model. Once GPT-4 is trained, every time someone uses ChatGPT, that's inference. The weights are frozen. The model isn't learning. It's just applying learned patterns to new inputs to generate outputs.
For LLMs, inference means: you provide a prompt, the model generates tokens one by one until it hits a stop condition, you get output. On the surface it's simple. Under the hood it's complex. The model needs to compute attention (how much each token should "look at" every other token in context), apply neural network layers, generate probabilities for the next token, sample from those probabilities, and repeat. This happens for every single token generated.
Inference cost and latency depend on several factors. Model size: bigger models need more computation. Context size: more context means more attention computation. Output length: more tokens generated means more iterations. Batch size: processing multiple requests at once is more efficient than one at a time. The same prompt generates faster and cheaper if processed as part of a batch than if processed alone.
There's a distinction between prefill and decode. Prefill is the initial pass where the model processes your entire prompt. Decode is the iterative process of generating output tokens one by one. Prefill can be efficiently batched and parallelized. Decode is sequential and harder to optimize. This is why the first token from an LLM API takes noticeably longer than subsequent tokens.
Latency is the wall clock time for inference. A typical LLM generates 100-300 tokens per second. That means generating 100 tokens takes 0.3-1 second. Add network latency, and you're looking at 1-2 seconds per request. For interactive applications, this is acceptable. For ultra-high-frequency applications (trading, real-time control), it's not.
Throughput is how many requests the server can process per unit time. This is a function of batch size, model size, and hardware. A server handling 100 concurrent requests needs to batch them and process them in parallel. Not all inference platforms handle batching the same way. Some queue requests, others process them as they arrive.
Inference hardware matters. LLMs typically run on GPUs or specialized AI accelerators (TPUs, NPUs). GPUs are the current standard. A single GPU can run inference for a mid-size model. Larger models require multiple GPUs or distributed inference across multiple machines. The cost of inference hardware is a major factor in LLM application economics.
Quantization reduces model size by reducing precision (32-bit to 8-bit or 4-bit weights). This dramatically reduces memory requirements and speeds up inference. An unquantized model might need 48GB of memory. The same quantized model might need 6GB. The tradeoff is slight accuracy degradation. For most applications, the speed and cost savings outweigh the tiny quality hit.
Caching is a powerful optimization. If you send the same prompt to the same model multiple times, you can cache the intermediate representations after prefill and skip recomputing them. This is why Anthropic added prompt caching to Claude. It dramatically speeds up repeated inferences and reduces cost.
Inference platforms abstract this complexity. You call an API, provide a prompt, get output. You don't care about batching, quantization, or hardware allocation. The provider handles it. But if you're running inference locally or in your own infrastructure, these details matter for cost and latency optimization.
The inference frontier keeps moving. Flash Attention makes attention computation 10x faster. Speculative decoding predicts multiple tokens at once instead of one at a time. KV-caching optimizes repeated sequences. Mixture of Experts models use only a fraction of parameters per token, dramatically speeding up inference. The field innovates constantly on inference efficiency.
Why It Matters
Inference cost and latency determine whether an AI application is economically viable and user-acceptable. High inference latency makes interactive applications frustrating. High inference cost makes business model economics impossible. Understanding inference factors helps teams make hardware decisions, model selection, and architectural choices. For high-volume applications, optimizing inference efficiency is a primary lever for profitability. For real-time applications, inference latency is a hard constraint that determines feasibility.
Example
A customer service company evaluates deploying an AI chatbot for 10,000 concurrent users. A naive approach: run a large model on expensive hardware, handle each user sequentially. Cost: prohibitive, latency: unacceptable. A better approach: use a smaller, quantized model on multiple GPUs, batch user requests, cache common questions, use speculative decoding. The optimized approach costs 10x less and provides 5x faster responses. Understanding inference optimization turned an infeasible idea into a viable product.