Transformers — AI Glossary

Transformers are a neural network architecture introduced in the 2017 paper "Attention Is All You Need" by researchers at Google. Built around a mechanism called self-attention, transformers process entire input sequences in parallel rather than sequentially, enabling them to capture long-range dependencies in text, images, and other data. They are the foundational architecture behind virtually every major large language model today, including Claude, GPT, Gemini, and Llama.

Why Transformers Matter

Before transformers, sequence models like RNNs and LSTMs processed tokens one at a time, creating bottlenecks for long sequences. Transformers eliminated this constraint, unlocking the scaling laws that made modern AI possible — more data, more parameters, better performance.

The architecture powers not just language models but also vision transformers (ViT), speech models, protein folding (AlphaFold), and multimodal systems. Every major AI lab — Anthropic, OpenAI, Google DeepMind, Meta — builds on transformer variants. Understanding transformers is prerequisite knowledge for following any development in modern AI.

How Transformers Work

The core mechanism is self-attention: each token in a sequence computes attention scores against every other token, producing a weighted representation that captures contextual relationships. This happens across multiple attention heads in parallel, each learning different relationship patterns.

A standard transformer has two main components:

Encoder: Processes input sequences into contextual representations (used in BERT-style models)
Decoder: Generates output tokens autoregressively, attending to both previous outputs and encoder states

Most modern LLMs like Claude use decoder-only architectures, where the model generates text one token at a time, attending to all prior tokens in the sequence. Scaling these architectures to billions of parameters — combined with massive training datasets — produces the emergent capabilities seen in today's frontier models.

Google: Origin of the transformer architecture, developed at Google Brain and Google DeepMind
Anthropic: Builder of Claude, a frontier AI system built on transformer architecture
Claude: Anthropic's AI assistant, powered by a decoder-only transformer model

Want more AI insights? Subscribe to LoreAI for daily briefings.

Transformers — AI Glossary

Why Transformers Matter

How Transformers Work

Related Terms