Skip to content
X

Transformer Models

The Transformer is a neural network architecture published by a Google research team in 2017. By using only “Attention” to process sequences, it overcame the limitations of the RNNs and LSTMs that were mainstream before it, and has become the foundation for virtually all modern large language models (LLMs).

Target audience: Those who understand the basics of deep learning (neural networks, loss functions).

Estimated learning time: 30 minutes to read

Prerequisites: Must have read What Is Deep Learning?

Why the Transformer Emerged — The Limitations of RNNs/LSTMs

Section titled “Why the Transformer Emerged — The Limitations of RNNs/LSTMs”

Before the Transformer, sequence data (text, audio, etc.) was processed with RNNs (Recurrent Neural Networks) or LSTMs (Long Short-Term Memory).

Cannot parallelize due to sequential processing: RNNs compute each step using the output of the previous step, so all steps must be processed in order. This means they can’t leverage the parallel computing capabilities of GPUs/TPUs, resulting in long training times.

Vanishing long-range dependencies: The longer the sentence, the harder it becomes for information from the beginning to reach the end. For example, it’s difficult for the model to correctly recognize that the verb at the end of “Tanaka-san, who worked at a large office in Tokyo yesterday, …” corresponds to “Tanaka-san.”

IssueRNN/LSTMTransformer
Parallel processingNot possible (sequential)Possible (processes all tokens simultaneously)
Long-range dependenciesInformation lost over distanceCan be referenced directly via Attention
Training speedSlowFast (due to parallel training)
ScalabilityLimitedEasy to scale

The Transformer was published in the following paper:

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30.

The title “Attention is All You Need” asserts that a high-performance model can be built using only Attention, without RNNs or CNNs.

Self-Attention is a mechanism for understanding context by having each token in a sequence calculate its relevance to every other token.

Consider the sentence “The animal didn’t cross the street because it was too tired.”

Understanding what “it” refers to requires looking at the relationship to other words in the sentence. Self-Attention numerically calculates that the relevance between “it” and “animal” is high, enabling correct contextual interpretation.

In Self-Attention, each token is converted into three vectors.

VectorRoleAnalogy
QueryRepresents “what am I looking for?”The search keywords a person uses to find a book in a library
KeyRepresents “what kind of information am I?”The spine title of a book in a library
ValueRepresents “what information do I hold?”The contents inside a library book
graph LR
    Input["Input tokens\n(embedding vectors)"]
    Q["Query vector\n(transformed by WQ)"]
    K["Key vector\n(transformed by WK)"]
    V["Value vector\n(transformed by WV)"]
    Score["Attention Score\n(dot product of Q·K ÷ √dk)"]
    Softmax["Softmax\n(convert to probability distribution)"]
    Output["Output\n(weighted sum of Softmax × V)"]

    Input --> Q
    Input --> K
    Input --> V
    Q --> Score
    K --> Score
    Score --> Softmax
    Softmax --> Output
    V --> Output

Expressed as a formula:

Attention(Q, K, V) = softmax(Q·Kᵀ / √dk) × V
  • Q·Kᵀ: Calculates a score via the dot product of Query and Key (quantifies relevance)
  • / √dk: Scaling to prevent scores from becoming too large
  • softmax: Converts scores to a probability distribution summing to 1
  • × V: Weighted sum of Values according to probabilities (takes in more information from highly relevant items)

Multi-Head Attention is a mechanism that runs Self-Attention multiple times in parallel to simultaneously capture context from different perspectives.

graph TD
    Input["Input"]
    H1["Attention Head 1\n(captures syntactic relations)"]
    H2["Attention Head 2\n(captures semantic relations)"]
    H3["Attention Head 3\n(captures long-range dependencies)"]
    HN["Attention Head N\n(...)"]
    Concat["Concatenate"]
    Linear["Linear transformation"]
    Output["Output"]

    Input --> H1
    Input --> H2
    Input --> H3
    Input --> HN
    H1 --> Concat
    H2 --> Concat
    H3 --> Concat
    HN --> Concat
    Concat --> Linear
    Linear --> Output

Each head learns different kinds of relationships (syntactic, semantic, coreference, etc.), enabling richer contextual understanding.

Positional Encoding is a mechanism for informing the model of the order of tokens.

Since Self-Attention processes all tokens simultaneously, the information about “which token appears at which position” — word order — is lost without additional help. Positional Encoding preserves word order by adding positional information to each token’s embedding vector.

Input embedding + Positional encoding = Input to model

The Transformer was originally designed for machine translation, so it consists of two structures: an Encoder and a Decoder.

graph LR
    subgraph Encoder["Encoder (understanding context)"]
        E1["Self-Attention"]
        E2["Feed-Forward Network"]
        E1 --> E2
    end
    subgraph Decoder["Decoder (generating text)"]
        D1["Masked Self-Attention"]
        D2["Cross-Attention\n(references Encoder output)"]
        D3["Feed-Forward Network"]
        D1 --> D2 --> D3
    end
    Encoder --> Decoder

Modern models adopt parts of this structure and fall into three categories:

ArchitecturePart UsedRepresentative ModelsBest Tasks
Encoder-OnlyEncoder onlyBERT, RoBERTaText understanding, classification, NER
Decoder-OnlyDecoder onlyGPT series, Claude, LlamaText generation, dialogue, code generation
Encoder-DecoderBothT5, BART, translation modelsTranslation, summarization, QA

The impact of the Transformer on modern AI development can be summarized in two points:

Scalability through parallel training: Since all tokens are processed simultaneously, it can fully leverage the computing power of GPUs/TPUs. This made it possible to train large-scale models with billions or hundreds of billions of parameters in realistic timeframes.

Discovery of scaling laws: The Transformer demonstrated that performance improves as model size, data volume, and compute are increased — the “scaling laws.” This provided the justification for developing large-scale models like GPT-3 (175 billion parameters) and beyond.

  • The Transformer is an architecture published in the 2017 paper “Attention Is All You Need”
  • Self-Attention calculates the relationships between all tokens at once
  • Relevance is calculated using three vectors — Q, K, V — to capture context
  • Multi-Head Attention simultaneously understands context from multiple perspectives
  • Three types: Encoder-Only (BERT series), Decoder-Only (GPT series), Encoder-Decoder (translation)
  • Its parallel training and scalability made it the foundation of modern LLMs

Q: Are “Transformer” and “LLM” the same thing?

A: No. The Transformer is a neural network architecture (blueprint). An LLM (Large Language Model) is a large-scale model trained on massive amounts of text data, based on the Transformer. The Transformer is the “structure” of an LLM; the LLM is the “product” built using the Transformer.

Q: What’s the difference between Self-Attention and Cross-Attention?

A: Self-Attention calculates relationships between tokens within the same sequence. Cross-Attention calculates relationships between the Encoder’s output and the Decoder’s input. Cross-Attention is used in Encoder-Decoder models (such as translation models) to learn correspondences between input and output sentences.

Q: What happens without Positional Encoding?

A: Without word order information, “The cat chased the dog” and “The dog chased the cat” become indistinguishable. Positional Encoding is an essential element for grammatically and semantically correct text processing.

Q: How many Attention Heads is typical?

A: It varies by model scale. The original Transformer used 8 heads; GPT-3 used 96 heads. More heads allow parallel learning of different kinds of relationships, but compute cost also increases.


Next step: BERT vs. GPT