Glossary

Multimodal AI

Multimodal AI refers to models that process and generate more than one type of data—text, images, audio, video—within a single system. GPT-4o can read a photo, transcribe speech, and respond in text in one call. Gemini can reason across video frames. These are multimodal models. The significance is integration: instead of chaining a speech-to-text model into a language model into an image generator, you get one model that handles the translation between modalities internally. For practical applications, multimodal capability means an AI can analyze a screenshot of a dashboard and explain the trend, or watch a product demo and write the marketing copy. The gap between multimodal demos and production reliability is still wide—image understanding is good, video understanding is inconsistent, and audio reasoning is early—but the trajectory is clear.

Related terms:

Prompt Engineering

Prompt engineering involves designing and refining inputs—ranging from simple instructions to detailed system prompts with examples, constraints, personas, and chain-of-thought scaffolding—to elicit desired outputs from a language model. It’s the most accessible way to boost AI performance, requiring no training data or ML expertise, but prompts can be fragile, hard to version-control, and easy to overfit.

Inference

Inference is the process of running a trained model on new input to generate a prediction or output—such as sending a prompt to GPT-4 and receiving a response. Unlike training, which is costly and infrequent, inference occurs millions of times per day, with speed (tokens per second) and cost (dollars per million tokens) determining an AI feature’s responsiveness and economic viability.

Token

In large language models, a token is the basic unit of text—usually chunks of three to four characters—that the model reads and generates. Since API costs, context windows, and rate limits are all measured in tokens, understanding tokenization is essential for controlling prompt length, cost, and model behavior.