Transformer Models

About 10 minutes

The Transformer is a neural network architecture published by a Google research team in 2017. By using “Attention” to process sequences, it avoided the sequential-processing constraints of the RNNs and LSTMs that were mainstream before it, and has become the foundation for many large language models (LLMs).[1]

Why the Transformer Emerged — The Limitations of RNNs/LSTMs

Before the Transformer, sequence data (text, audio, etc.) was commonly processed with RNNs (Recurrent Neural Networks) or LSTMs (Long Short-Term Memory).[1]

Problems with RNNs/LSTMs

Hard to parallelize due to sequential processing: RNNs compute each step using the output of the previous step, so all steps must be processed in order. The Transformer paper describes this sequential nature as a constraint on efficient training over long sequences.[1]

Vanishing long-range dependencies: The longer the sentence, the harder it becomes for information from the beginning to reach the end. For example, it’s difficult for the model to correctly recognize that the verb at the end of “Tanaka-san, who worked at a large office in Tokyo yesterday, …” corresponds to “Tanaka-san.”

Issue	RNN/LSTM	Transformer
Parallel processing	Not possible (sequential)	Possible (processes all tokens simultaneously)
Long-range dependencies	Information lost over distance	Can be referenced directly via Attention
Training speed	Slow	Fast (due to parallel training)
Scalability	Limited	Easy to scale

About the Original Paper

The Transformer was published in the following paper:

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30.

The title “Attention is All You Need” describes a sequence-transduction design based on attention mechanisms without using recurrent or convolutional layers.[1]

Self-Attention

Self-Attention is a mechanism for representing context by having each token in a sequence calculate its relevance to every other token.[1]

Intuitive Explanation

Consider the sentence “The animal didn’t cross the street because it was too tired.”

Understanding what “it” refers to requires looking at the relationship to other words in the sentence. Self-Attention numerically calculates that the relevance between “it” and “animal” is high, enabling correct contextual interpretation.

Three Vectors: Query, Key, and Value

In Self-Attention, each token is converted into three vectors.

Vector	Role	Analogy
Query	Represents “what am I looking for?”	The search keywords a person uses to find a book in a library
Key	Represents “what kind of information am I?”	The spine title of a book in a library
Value	Represents “what information do I hold?”	The contents inside a library book

Attention Score Calculation Flow

graph LR
    Input["Input tokens\n(embedding vectors)"]
    Q["Query vector\n(transformed by WQ)"]
    K["Key vector\n(transformed by WK)"]
    V["Value vector\n(transformed by WV)"]
    Score["Attention Score\n(dot product of Q·K ÷ √dk)"]
    Softmax["Softmax\n(convert to probability distribution)"]
    Output["Output\n(weighted sum of Softmax × V)"]

    Input --> Q
    Input --> K
    Input --> V
    Q --> Score
    K --> Score
    Score --> Softmax
    Softmax --> Output
    V --> Output

Expressed as a formula:

Attention(Q, K, V) = softmax(Q·Kᵀ / √dk) × V

Q·Kᵀ: Calculates a score via the dot product of Query and Key (quantifies relevance)
/ √dk: Scaling to prevent scores from becoming too large
softmax: Converts scores to a probability distribution summing to 1
× V: Weighted sum of Values according to probabilities (takes in more information from highly relevant items)

Multi-Head Attention

Multi-Head Attention is a mechanism that runs attention multiple times in parallel so the model can attend to information from different representation subspaces.[1]

graph TD
    Input["Input"]
    H1["Attention Head 1\n(captures syntactic relations)"]
    H2["Attention Head 2\n(captures semantic relations)"]
    H3["Attention Head 3\n(captures long-range dependencies)"]
    HN["Attention Head N\n(...)"]
    Concat["Concatenate"]
    Linear["Linear transformation"]
    Output["Output"]

    Input --> H1
    Input --> H2
    Input --> H3
    Input --> HN
    H1 --> Concat
    H2 --> Concat
    H3 --> Concat
    HN --> Concat
    Concat --> Linear
    Linear --> Output

Each head learns different kinds of relationships (syntactic, semantic, coreference, etc.), enabling richer contextual understanding.

Positional Encoding

Positional Encoding is a mechanism for informing the model of the order of tokens.[1]

Since Self-Attention processes all tokens simultaneously, the information about “which token appears at which position” — word order — is lost without additional help. Positional Encoding preserves word order by adding positional information to each token’s embedding vector.

Input embedding + Positional encoding = Input to model

Encoder and Decoder Structure

The Transformer was originally designed for machine translation, so it consists of two structures: an Encoder and a Decoder.[1]

graph LR
    subgraph Encoder["Encoder (understanding context)"]
        E1["Self-Attention"]
        E2["Feed-Forward Network"]
        E1 --> E2
    end
    subgraph Decoder["Decoder (generating text)"]
        D1["Masked Self-Attention"]
        D2["Cross-Attention\n(references Encoder output)"]
        D3["Feed-Forward Network"]
        D1 --> D2 --> D3
    end
    Encoder --> Decoder

Modern models adopt parts of this structure and fall into three categories:

Architecture	Part Used	Representative Models	Best Tasks
Encoder-Only	Encoder only	BERT, RoBERTa	Text understanding, classification, NER
Decoder-Only	Decoder only	GPT series, Claude, Llama	Text generation, dialogue, code generation
Encoder-Decoder	Both	T5, BART, translation models	Translation, summarization, QA

Why the Transformer Changed AI

The impact of the Transformer on modern AI development can be summarized in two points:

Scalability through parallel training: Because relationships across tokens can be computed in parallel, Transformer models are easier to parallelize than sequential RNN-style models.[1]

Scaling-law research: Language-model research has reported predictable improvements as model size, dataset size, and compute increase. This observation is an important basis for thinking about large-scale model development.[2]

Summary

The Transformer is an architecture published in the 2017 paper “Attention Is All You Need”
Self-Attention calculates the relationships between all tokens at once
Relevance is calculated using three vectors — Q, K, V — to capture context
Multi-Head Attention simultaneously understands context from multiple perspectives
Three types: Encoder-Only (BERT series), Decoder-Only (GPT series), Encoder-Decoder (translation)
Its parallel training and scalability made it the foundation of modern LLMs

Frequently Asked Questions

Q: Are “Transformer” and “LLM” the same thing?

A: No. The Transformer is a neural network architecture (blueprint). An LLM (Large Language Model) is a large-scale model trained on massive amounts of text data, based on the Transformer. The Transformer is the “structure” of an LLM; the LLM is the “product” built using the Transformer.

Q: What’s the difference between Self-Attention and Cross-Attention?

A: Self-Attention calculates relationships between tokens within the same sequence. Cross-Attention calculates relationships between the Encoder’s output and the Decoder’s input. Cross-Attention is used in Encoder-Decoder models (such as translation models) to learn correspondences between input and output sentences.

Q: What happens without Positional Encoding?

A: Without word order information, “The cat chased the dog” and “The dog chased the cat” become indistinguishable. Positional Encoding is an essential element for grammatically and semantically correct text processing.

Q: How many Attention Heads is typical?

A: It varies by model scale and architecture. More heads let the model gather information from multiple representation subspaces, but they also add compute and implementation tradeoffs.

References

Ashish Vaswani et al., Attention Is All You Need, June 12, 2017
Jared Kaplan et al., Scaling Laws for Neural Language Models, January 23, 2020

BERT vs. GPT

What Is an LLM? Large Language Models Explained