Transformer Models
The Transformer is a neural network architecture published by a Google research team in 2017. By using only “Attention” to process sequences, it overcame the limitations of the RNNs and LSTMs that were mainstream before it, and has become the foundation for virtually all modern large language models (LLMs).
Target audience: Those who understand the basics of deep learning (neural networks, loss functions).
Estimated learning time: 30 minutes to read
Prerequisites: Must have read What Is Deep Learning?
Why the Transformer Emerged — The Limitations of RNNs/LSTMs
Section titled “Why the Transformer Emerged — The Limitations of RNNs/LSTMs”Before the Transformer, sequence data (text, audio, etc.) was processed with RNNs (Recurrent Neural Networks) or LSTMs (Long Short-Term Memory).
Problems with RNNs/LSTMs
Section titled “Problems with RNNs/LSTMs”Cannot parallelize due to sequential processing: RNNs compute each step using the output of the previous step, so all steps must be processed in order. This means they can’t leverage the parallel computing capabilities of GPUs/TPUs, resulting in long training times.
Vanishing long-range dependencies: The longer the sentence, the harder it becomes for information from the beginning to reach the end. For example, it’s difficult for the model to correctly recognize that the verb at the end of “Tanaka-san, who worked at a large office in Tokyo yesterday, …” corresponds to “Tanaka-san.”
| Issue | RNN/LSTM | Transformer |
|---|---|---|
| Parallel processing | Not possible (sequential) | Possible (processes all tokens simultaneously) |
| Long-range dependencies | Information lost over distance | Can be referenced directly via Attention |
| Training speed | Slow | Fast (due to parallel training) |
| Scalability | Limited | Easy to scale |
About the Original Paper
Section titled “About the Original Paper”The Transformer was published in the following paper:
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30.
The title “Attention is All You Need” asserts that a high-performance model can be built using only Attention, without RNNs or CNNs.
Self-Attention
Section titled “Self-Attention”Self-Attention is a mechanism for understanding context by having each token in a sequence calculate its relevance to every other token.
Intuitive Explanation
Section titled “Intuitive Explanation”Consider the sentence “The animal didn’t cross the street because it was too tired.”
Understanding what “it” refers to requires looking at the relationship to other words in the sentence. Self-Attention numerically calculates that the relevance between “it” and “animal” is high, enabling correct contextual interpretation.
Three Vectors: Query, Key, and Value
Section titled “Three Vectors: Query, Key, and Value”In Self-Attention, each token is converted into three vectors.
| Vector | Role | Analogy |
|---|---|---|
| Query | Represents “what am I looking for?” | The search keywords a person uses to find a book in a library |
| Key | Represents “what kind of information am I?” | The spine title of a book in a library |
| Value | Represents “what information do I hold?” | The contents inside a library book |
Attention Score Calculation Flow
Section titled “Attention Score Calculation Flow”graph LR
Input["Input tokens\n(embedding vectors)"]
Q["Query vector\n(transformed by WQ)"]
K["Key vector\n(transformed by WK)"]
V["Value vector\n(transformed by WV)"]
Score["Attention Score\n(dot product of Q·K ÷ √dk)"]
Softmax["Softmax\n(convert to probability distribution)"]
Output["Output\n(weighted sum of Softmax × V)"]
Input --> Q
Input --> K
Input --> V
Q --> Score
K --> Score
Score --> Softmax
Softmax --> Output
V --> OutputExpressed as a formula:
Attention(Q, K, V) = softmax(Q·Kᵀ / √dk) × VQ·Kᵀ: Calculates a score via the dot product of Query and Key (quantifies relevance)/ √dk: Scaling to prevent scores from becoming too largesoftmax: Converts scores to a probability distribution summing to 1× V: Weighted sum of Values according to probabilities (takes in more information from highly relevant items)
Multi-Head Attention
Section titled “Multi-Head Attention”Multi-Head Attention is a mechanism that runs Self-Attention multiple times in parallel to simultaneously capture context from different perspectives.
graph TD
Input["Input"]
H1["Attention Head 1\n(captures syntactic relations)"]
H2["Attention Head 2\n(captures semantic relations)"]
H3["Attention Head 3\n(captures long-range dependencies)"]
HN["Attention Head N\n(...)"]
Concat["Concatenate"]
Linear["Linear transformation"]
Output["Output"]
Input --> H1
Input --> H2
Input --> H3
Input --> HN
H1 --> Concat
H2 --> Concat
H3 --> Concat
HN --> Concat
Concat --> Linear
Linear --> OutputEach head learns different kinds of relationships (syntactic, semantic, coreference, etc.), enabling richer contextual understanding.
Positional Encoding
Section titled “Positional Encoding”Positional Encoding is a mechanism for informing the model of the order of tokens.
Since Self-Attention processes all tokens simultaneously, the information about “which token appears at which position” — word order — is lost without additional help. Positional Encoding preserves word order by adding positional information to each token’s embedding vector.
Input embedding + Positional encoding = Input to modelEncoder and Decoder Structure
Section titled “Encoder and Decoder Structure”The Transformer was originally designed for machine translation, so it consists of two structures: an Encoder and a Decoder.
graph LR
subgraph Encoder["Encoder (understanding context)"]
E1["Self-Attention"]
E2["Feed-Forward Network"]
E1 --> E2
end
subgraph Decoder["Decoder (generating text)"]
D1["Masked Self-Attention"]
D2["Cross-Attention\n(references Encoder output)"]
D3["Feed-Forward Network"]
D1 --> D2 --> D3
end
Encoder --> DecoderModern models adopt parts of this structure and fall into three categories:
| Architecture | Part Used | Representative Models | Best Tasks |
|---|---|---|---|
| Encoder-Only | Encoder only | BERT, RoBERTa | Text understanding, classification, NER |
| Decoder-Only | Decoder only | GPT series, Claude, Llama | Text generation, dialogue, code generation |
| Encoder-Decoder | Both | T5, BART, translation models | Translation, summarization, QA |
Why the Transformer Changed AI
Section titled “Why the Transformer Changed AI”The impact of the Transformer on modern AI development can be summarized in two points:
Scalability through parallel training: Since all tokens are processed simultaneously, it can fully leverage the computing power of GPUs/TPUs. This made it possible to train large-scale models with billions or hundreds of billions of parameters in realistic timeframes.
Discovery of scaling laws: The Transformer demonstrated that performance improves as model size, data volume, and compute are increased — the “scaling laws.” This provided the justification for developing large-scale models like GPT-3 (175 billion parameters) and beyond.
Summary
Section titled “Summary”- The Transformer is an architecture published in the 2017 paper “Attention Is All You Need”
- Self-Attention calculates the relationships between all tokens at once
- Relevance is calculated using three vectors — Q, K, V — to capture context
- Multi-Head Attention simultaneously understands context from multiple perspectives
- Three types: Encoder-Only (BERT series), Decoder-Only (GPT series), Encoder-Decoder (translation)
- Its parallel training and scalability made it the foundation of modern LLMs
Frequently Asked Questions
Section titled “Frequently Asked Questions”Q: Are “Transformer” and “LLM” the same thing?
A: No. The Transformer is a neural network architecture (blueprint). An LLM (Large Language Model) is a large-scale model trained on massive amounts of text data, based on the Transformer. The Transformer is the “structure” of an LLM; the LLM is the “product” built using the Transformer.
Q: What’s the difference between Self-Attention and Cross-Attention?
A: Self-Attention calculates relationships between tokens within the same sequence. Cross-Attention calculates relationships between the Encoder’s output and the Decoder’s input. Cross-Attention is used in Encoder-Decoder models (such as translation models) to learn correspondences between input and output sentences.
Q: What happens without Positional Encoding?
A: Without word order information, “The cat chased the dog” and “The dog chased the cat” become indistinguishable. Positional Encoding is an essential element for grammatically and semantically correct text processing.
Q: How many Attention Heads is typical?
A: It varies by model scale. The original Transformer used 8 heads; GPT-3 used 96 heads. More heads allow parallel learning of different kinds of relationships, but compute cost also increases.
Next step: BERT vs. GPT