Inference
Inference is the act of running a trained model on new input to get a prediction or output. When you send a prompt to GPT-4 and get a response, that is inference. Training builds the model; inference uses it. The distinction matters because the economics are completely different. Training a frontier model costs tens of millions of dollars and happens once (or a few times). Inference happens millions of times per day and is where the ongoing cost lives. Inference speed—measured in tokens per second—determines whether your AI feature feels instant or sluggish. Inference cost—measured in dollars per million tokens—determines whether your AI feature is economically viable at scale. Much of the current hardware race (custom chips from Google, Amazon, startups like Groq and Cerebras) is about making inference cheaper and faster, not training bigger models.
Referenced in these posts:
Thinking Ahead, Building Ahead
Why the best AI products ship before they're ready—and why that's exactly right. In the AI era, speed and iteration beat waiting for perfection.
The Alephic AI Thesis: 2025
The AI revolution will be dictated by three physical constraints—compute packaging capacity, energy availability, and organizational agility—that concentrate...
The $1 Sweet Spot
Gemini Flash delivers around 80% of frontier LLM intelligence at just 10–25% of the cost, making it the defining model in the $1 intelligence tier.
Related terms:
Temperature
Temperature is a parameter controlling a language model’s randomness: at 0 it always picks the most probable next token for deterministic, reliable output,...
Token
In large language models, a token is the basic unit of text—usually chunks of three to four characters—that the model reads and generates.
Context Window
A context window is the maximum amount of text a language model can process in a single call—input and output combined—measured in tokens.