Inference

Inference is the act of running a trained model on new input to get a prediction or output. When you send a prompt to GPT-4 and get a response, that is inference. Training builds the model; inference uses it. The distinction matters because the economics are completely different. Training a frontier model costs tens of millions of dollars and happens once (or a few times). Inference happens millions of times per day and is where the ongoing cost lives. Inference speed—measured in tokens per second—determines whether your AI feature feels instant or sluggish. Inference cost—measured in dollars per million tokens—determines whether your AI feature is economically viable at scale. Much of the current hardware race (custom chips from Google, Amazon, startups like Groq and Cerebras) is about making inference cheaper and faster, not training bigger models.

Referenced in these posts:

Thinking Ahead, Building Ahead

Why the best AI products ship before they’re ready—and why that's exactly right.

The Alephic AI Thesis: 2025

The AI revolution will be dictated by three physical constraints—compute packaging capacity, energy availability, and organizational agility—that concentrate power in gravity wells. Whoever controls these choke points, not merely the best models, will shape the next decade of AI.

Related terms:

Temperature

Temperature is a parameter controlling a language model’s randomness: at 0 it always picks the most probable next token for deterministic, reliable output, at 1 it samples more broadly for varied, creative results, and above 1 it becomes increasingly random. Choosing the right temperature (e.g., 0 for consistent data extraction or 0.7–0.9 for brainstorming) balances reliability and diversity.

Context Window

A context window is the maximum amount of text a language model can process in a single call—input and output combined—measured in tokens. Larger windows (from about 4,000 tokens up to over a million) let you handle longer inputs but raise costs and can suffer from the “lost in the middle” attention issue.

Conway's Law

Conway’s Law states that organizations designing systems are constrained to produce designs mirroring their own communication structures. For example, separate sales, marketing, and support teams often yield a website organized into Shop, Learn, and Support sections—reflecting internal divisions rather than user needs.