Transformer
The transformer is the neural network architecture behind every major language model since 2017. Introduced in the paper "Attention Is All You Need" by Vaswani et al. at Google, it replaced recurrent networks with a mechanism called self-attention that lets the model weigh the relevance of every word against every other word in parallel. This parallelism is what made training on internet-scale data feasible—and what made GPUs the bottleneck. Transformers are not limited to text: the same architecture powers image generation (Vision Transformers), protein folding (AlphaFold), and audio synthesis. The key insight was architectural simplicity. Transformers do one thing—attention—and scale it. That turned out to be enough for an extraordinary range of tasks, which is why transformer-based models now dominate AI research and production systems alike.
Referenced in these posts:
Things I Think I Think About AI
Noah distills his 2,400+ hours of AI use into a candid, unordered list of 29 controversial takeaways—from championing ChatGPT’s advanced models and token...
Related terms:
Token
In large language models, a token is the basic unit of text—usually chunks of three to four characters—that the model reads and generates.
Fine-Tuning
Fine-tuning continues training a pretrained language model on a smaller, task-specific dataset so it internalizes particular behaviors, styles, or domain...
Prompt Engineering
Prompt engineering involves designing and refining inputs—ranging from simple instructions to detailed system prompts with examples, constraints, personas,...