Transformer Models

The Transformer is a neural network architecture published by a Google research team in 2017. By using only “Attention” to process sequences, it overcame the limitations of the RNNs and LSTMs that were mainstream before it, and has become the foundation for virtually all modern large language models (LLMs).

Target audience: Those who understand the basics of deep learning (neural networks, loss functions).

Estimated learning time: 30 minutes to read

Prerequisites: Must have read What Is Deep Learning?

Why the Transformer Emerged — The Limitations of RNNs/LSTMs

Before the Transformer, sequence data (text, audio, etc.) was processed with RNNs (Recurrent Neural Networks) or LSTMs (Long Short-Term Memory).

Problems with RNNs/LSTMs

Cannot parallelize due to sequential processing: RNNs compute each step using the output of the previous step, so all steps must be processed in order. This means they can’t leverage the parallel computing capabilities of GPUs/TPUs, resulting in long training times.

Vanishing long-range dependencies: The longer the sentence, the harder it becomes for information from the beginning to reach the end. For example, it’s difficult for the model to correctly recognize that the verb at the end of “Tanaka-san, who worked at a large office in Tokyo yesterday, …” corresponds to “Tanaka-san.”

Issue	RNN/LSTM	Transformer
Parallel processing	Not possible (sequential)	Possible (processes all tokens simultaneously)
Long-range dependencies	Information lost over distance	Can be referenced directly via Attention
Training speed	Slow	Fast (due to parallel training)
Scalability	Limited	Easy to scale

About the Original Paper

The Transformer was published in the following paper:

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30.

The title “Attention is All You Need” asserts that a high-performance model can be built using only Attention, without RNNs or CNNs.

Self-Attention

Self-Attention is a mechanism for understanding context by having each token in a sequence calculate its relevance to every other token.

Intuitive Explanation

Consider the sentence “The animal didn’t cross the street because it was too tired.”

Understanding what “it” refers to requires looking at the relationship to other words in the sentence. Self-Attention numerically calculates that the relevance between “it” and “animal” is high, enabling correct contextual interpretation.

Three Vectors: Query, Key, and Value

In Self-Attention, each token is converted into three vectors.

Vector	Role	Analogy
Query	Represents “what am I looking for?”	The search keywords a person uses to find a book in a library
Key	Represents “what kind of information am I?”	The spine title of a book in a library
Value	Represents “what information do I hold?”	The contents inside a library book

Attention Score Calculation Flow

graph LR
    Input["Input tokens\n(embedding vectors)"]
    Q["Query vector\n(transformed by WQ)"]
    K["Key vector\n(transformed by WK)"]
    V["Value vector\n(transformed by WV)"]
    Score["Attention Score\n(dot product of Q·K ÷ √dk)"]
    Softmax["Softmax\n(convert to probability distribution)"]
    Output["Output\n(weighted sum of Softmax × V)"]

    Input --> Q
    Input --> K
    Input --> V
    Q --> Score
    K --> Score
    Score --> Softmax
    Softmax --> Output
    V --> Output

Expressed as a formula:

Attention(Q, K, V) = softmax(Q·Kᵀ / √dk) × V

Q·Kᵀ: Calculates a score via the dot product of Query and Key (quantifies relevance)
/ √dk: Scaling to prevent scores from becoming too large
softmax: Converts scores to a probability distribution summing to 1
× V: Weighted sum of Values according to probabilities (takes in more information from highly relevant items)

Multi-Head Attention

Multi-Head Attention is a mechanism that runs Self-Attention multiple times in parallel to simultaneously capture context from different perspectives.

graph TD
    Input["Input"]
    H1["Attention Head 1\n(captures syntactic relations)"]
    H2["Attention Head 2\n(captures semantic relations)"]
    H3["Attention Head 3\n(captures long-range dependencies)"]
    HN["Attention Head N\n(...)"]
    Concat["Concatenate"]
    Linear["Linear transformation"]
    Output["Output"]

    Input --> H1
    Input --> H2
    Input --> H3
    Input --> HN
    H1 --> Concat
    H2 --> Concat
    H3 --> Concat
    HN --> Concat
    Concat --> Linear
    Linear --> Output

Each head learns different kinds of relationships (syntactic, semantic, coreference, etc.), enabling richer contextual understanding.

Positional Encoding

Positional Encoding is a mechanism for informing the model of the order of tokens.

Since Self-Attention processes all tokens simultaneously, the information about “which token appears at which position” — word order — is lost without additional help. Positional Encoding preserves word order by adding positional information to each token’s embedding vector.

Input embedding + Positional encoding = Input to model

Encoder and Decoder Structure

The Transformer was originally designed for machine translation, so it consists of two structures: an Encoder and a Decoder.

graph LR
    subgraph Encoder["Encoder (understanding context)"]
        E1["Self-Attention"]
        E2["Feed-Forward Network"]
        E1 --> E2
    end
    subgraph Decoder["Decoder (generating text)"]
        D1["Masked Self-Attention"]
        D2["Cross-Attention\n(references Encoder output)"]
        D3["Feed-Forward Network"]
        D1 --> D2 --> D3
    end
    Encoder --> Decoder

Modern models adopt parts of this structure and fall into three categories:

Architecture	Part Used	Representative Models	Best Tasks
Encoder-Only	Encoder only	BERT, RoBERTa	Text understanding, classification, NER
Decoder-Only	Decoder only	GPT series, Claude, Llama	Text generation, dialogue, code generation
Encoder-Decoder	Both	T5, BART, translation models	Translation, summarization, QA

Why the Transformer Changed AI

The impact of the Transformer on modern AI development can be summarized in two points:

Scalability through parallel training: Since all tokens are processed simultaneously, it can fully leverage the computing power of GPUs/TPUs. This made it possible to train large-scale models with billions or hundreds of billions of parameters in realistic timeframes.

Discovery of scaling laws: The Transformer demonstrated that performance improves as model size, data volume, and compute are increased — the “scaling laws.” This provided the justification for developing large-scale models like GPT-3 (175 billion parameters) and beyond.

Summary

The Transformer is an architecture published in the 2017 paper “Attention Is All You Need”
Self-Attention calculates the relationships between all tokens at once
Relevance is calculated using three vectors — Q, K, V — to capture context
Multi-Head Attention simultaneously understands context from multiple perspectives
Three types: Encoder-Only (BERT series), Decoder-Only (GPT series), Encoder-Decoder (translation)
Its parallel training and scalability made it the foundation of modern LLMs

Frequently Asked Questions

Q: Are “Transformer” and “LLM” the same thing?

A: No. The Transformer is a neural network architecture (blueprint). An LLM (Large Language Model) is a large-scale model trained on massive amounts of text data, based on the Transformer. The Transformer is the “structure” of an LLM; the LLM is the “product” built using the Transformer.

Q: What’s the difference between Self-Attention and Cross-Attention?

A: Self-Attention calculates relationships between tokens within the same sequence. Cross-Attention calculates relationships between the Encoder’s output and the Decoder’s input. Cross-Attention is used in Encoder-Decoder models (such as translation models) to learn correspondences between input and output sentences.

Q: What happens without Positional Encoding?

A: Without word order information, “The cat chased the dog” and “The dog chased the cat” become indistinguishable. Positional Encoding is an essential element for grammatically and semantically correct text processing.

Q: How many Attention Heads is typical?

A: It varies by model scale. The original Transformer used 8 heads; GPT-3 used 96 heads. More heads allow parallel learning of different kinds of relationships, but compute cost also increases.

Next step: BERT vs. GPT