Skip to content
LinkedInX

Transformer Models

About 10 minutes

Prerequisites: Must have read What Is Deep Learning?

The Transformer is a neural network architecture published by a Google research team in 2017. By using “Attention” to process sequences, it avoided the sequential-processing constraints of the RNNs and LSTMs that were mainstream before it, and has become the foundation for many large language models (LLMs).[1]

Why the Transformer Emerged — The Limitations of RNNs/LSTMs

Section titled “Why the Transformer Emerged — The Limitations of RNNs/LSTMs”

Before the Transformer, sequence data (text, audio, etc.) was commonly processed with RNNs (Recurrent Neural Networks) or LSTMs (Long Short-Term Memory).[1]

Hard to parallelize due to sequential processing: RNNs compute each step using the output of the previous step, so all steps must be processed in order. The Transformer paper describes this sequential nature as a constraint on efficient training over long sequences.[1]

Vanishing long-range dependencies: The longer the sentence, the harder it becomes for information from the beginning to reach the end. For example, it’s difficult for the model to correctly recognize that the verb at the end of “Tanaka-san, who worked at a large office in Tokyo yesterday, …” corresponds to “Tanaka-san.”

IssueRNN/LSTMTransformer
Parallel processingNot possible (sequential)Possible (processes all tokens simultaneously)
Long-range dependenciesInformation lost over distanceCan be referenced directly via Attention
Training speedSlowFast (due to parallel training)
ScalabilityLimitedEasy to scale

The Transformer was published in the following paper:

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30.

The title “Attention is All You Need” describes a sequence-transduction design based on attention mechanisms without using recurrent or convolutional layers.[1]

Self-Attention is a mechanism for representing context by having each token in a sequence calculate its relevance to every other token.[1]

Consider the sentence “The animal didn’t cross the street because it was too tired.”

Understanding what “it” refers to requires looking at the relationship to other words in the sentence. Self-Attention numerically calculates that the relevance between “it” and “animal” is high, enabling correct contextual interpretation.

In Self-Attention, each token is converted into three vectors.

VectorRoleAnalogy
QueryRepresents “what am I looking for?”The search keywords a person uses to find a book in a library
KeyRepresents “what kind of information am I?”The spine title of a book in a library
ValueRepresents “what information do I hold?”The contents inside a library book
graph LR
    Input["Input tokens\n(embedding vectors)"]
    Q["Query vector\n(transformed by WQ)"]
    K["Key vector\n(transformed by WK)"]
    V["Value vector\n(transformed by WV)"]
    Score["Attention Score\n(dot product of Q·K ÷ √dk)"]
    Softmax["Softmax\n(convert to probability distribution)"]
    Output["Output\n(weighted sum of Softmax × V)"]

    Input --> Q
    Input --> K
    Input --> V
    Q --> Score
    K --> Score
    Score --> Softmax
    Softmax --> Output
    V --> Output

Expressed as a formula:

Attention(Q, K, V) = softmax(Q·Kᵀ / √dk) × V
  • Q·Kᵀ: Calculates a score via the dot product of Query and Key (quantifies relevance)
  • / √dk: Scaling to prevent scores from becoming too large
  • softmax: Converts scores to a probability distribution summing to 1
  • × V: Weighted sum of Values according to probabilities (takes in more information from highly relevant items)

Multi-Head Attention is a mechanism that runs attention multiple times in parallel so the model can attend to information from different representation subspaces.[1]

graph TD
    Input["Input"]
    H1["Attention Head 1\n(captures syntactic relations)"]
    H2["Attention Head 2\n(captures semantic relations)"]
    H3["Attention Head 3\n(captures long-range dependencies)"]
    HN["Attention Head N\n(...)"]
    Concat["Concatenate"]
    Linear["Linear transformation"]
    Output["Output"]

    Input --> H1
    Input --> H2
    Input --> H3
    Input --> HN
    H1 --> Concat
    H2 --> Concat
    H3 --> Concat
    HN --> Concat
    Concat --> Linear
    Linear --> Output

Each head learns different kinds of relationships (syntactic, semantic, coreference, etc.), enabling richer contextual understanding.

Positional Encoding is a mechanism for informing the model of the order of tokens.[1]

Since Self-Attention processes all tokens simultaneously, the information about “which token appears at which position” — word order — is lost without additional help. Positional Encoding preserves word order by adding positional information to each token’s embedding vector.

Input embedding + Positional encoding = Input to model

The Transformer was originally designed for machine translation, so it consists of two structures: an Encoder and a Decoder.[1]

graph LR
    subgraph Encoder["Encoder (understanding context)"]
        E1["Self-Attention"]
        E2["Feed-Forward Network"]
        E1 --> E2
    end
    subgraph Decoder["Decoder (generating text)"]
        D1["Masked Self-Attention"]
        D2["Cross-Attention\n(references Encoder output)"]
        D3["Feed-Forward Network"]
        D1 --> D2 --> D3
    end
    Encoder --> Decoder

Modern models adopt parts of this structure and fall into three categories:

ArchitecturePart UsedRepresentative ModelsBest Tasks
Encoder-OnlyEncoder onlyBERT, RoBERTaText understanding, classification, NER
Decoder-OnlyDecoder onlyGPT series, Claude, LlamaText generation, dialogue, code generation
Encoder-DecoderBothT5, BART, translation modelsTranslation, summarization, QA

The impact of the Transformer on modern AI development can be summarized in two points:

Scalability through parallel training: Because relationships across tokens can be computed in parallel, Transformer models are easier to parallelize than sequential RNN-style models.[1]

Scaling-law research: Language-model research has reported predictable improvements as model size, dataset size, and compute increase. This observation is an important basis for thinking about large-scale model development.[2]

  • The Transformer is an architecture published in the 2017 paper “Attention Is All You Need”
  • Self-Attention calculates the relationships between all tokens at once
  • Relevance is calculated using three vectors — Q, K, V — to capture context
  • Multi-Head Attention simultaneously understands context from multiple perspectives
  • Three types: Encoder-Only (BERT series), Decoder-Only (GPT series), Encoder-Decoder (translation)
  • Its parallel training and scalability made it the foundation of modern LLMs

Q: Are “Transformer” and “LLM” the same thing?

A: No. The Transformer is a neural network architecture (blueprint). An LLM (Large Language Model) is a large-scale model trained on massive amounts of text data, based on the Transformer. The Transformer is the “structure” of an LLM; the LLM is the “product” built using the Transformer.

Q: What’s the difference between Self-Attention and Cross-Attention?

A: Self-Attention calculates relationships between tokens within the same sequence. Cross-Attention calculates relationships between the Encoder’s output and the Decoder’s input. Cross-Attention is used in Encoder-Decoder models (such as translation models) to learn correspondences between input and output sentences.

Q: What happens without Positional Encoding?

A: Without word order information, “The cat chased the dog” and “The dog chased the cat” become indistinguishable. Positional Encoding is an essential element for grammatically and semantically correct text processing.

Q: How many Attention Heads is typical?

A: It varies by model scale and architecture. More heads let the model gather information from multiple representation subspaces, but they also add compute and implementation tradeoffs.

  1. Ashish Vaswani et al., Attention Is All You Need, June 12, 2017
  2. Jared Kaplan et al., Scaling Laws for Neural Language Models, January 23, 2020