RLHF
Reinforcement Learning from Human Feedback (RLHF) is the training technique that transformed large language models from impressive autocomplete engines into useful assistants by systematically aligning their outputs with human preferences. First popularized by OpenAI's InstructGPT paper in 2022, the process trains a reward model on thousands of human comparisons—which response is better?—then uses reinforcement learning to tune the base model toward responses humans actually prefer. This alignment layer is why modern AI can follow complex instructions, refuse harmful requests, and match organizational tone—making it the invisible substrate beneath every enterprise AI deployment.
Referenced in these posts:
Satisficing for LLMs
By applying Herbert Simon’s concept of satisficing to AI, this post argues that language models might prefer logical‐sounding content over emotional appeals,...
Related terms:
Prompt Injection
Prompt injection is an attack where a user or data source inserts instructions that override a language model’s intended behavior.
Model Context Protocol (MCP)
Model Context Protocol (MCP) is an open standard from Anthropic that standardizes how AI models connect to external tools and data sources via a...
System Prompt
A system prompt is an invisible set of instructions given to a language model—defining its persona, constraints, output format, and behavioral rules—and...