RLHF
Reinforcement Learning from Human Feedback (RLHF) is the training technique that transformed large language models from impressive autocomplete engines into useful assistants by systematically aligning their outputs with human preferences. First popularized by OpenAI's InstructGPT paper in 2022, the process trains a reward model on thousands of human comparisons—which response is better?—then uses reinforcement learning to tune the base model toward responses humans actually prefer. This alignment layer is why modern AI can follow complex instructions, refuse harmful requests, and match organizational tone—making it the invisible substrate beneath every enterprise AI deployment.
Referenced in these posts:
Satisficing for LLMs
By applying Herbert Simon’s concept of satisficing to AI, this post argues that language models might prefer logical‐sounding content over emotional appeals,...
Related terms:
System Prompt
A system prompt is an invisible set of instructions given to a language model—defining its persona, constraints, output format, and behavioral rules—and...
Chain-of-Thought
Chain-of-thought prompting, introduced by Google Research in 2022, transforms AI from an answer machine into a reasoning partner by explicitly modeling the...
AI Agent
An AI agent is a system that autonomously breaks a goal into steps—calling tools, reading results, and adjusting course—without waiting for a human prompt.