Skip to content
X

What Is Deep Learning?

Deep learning is a machine learning technique that uses multi-layer neural networks to automatically learn advanced features from data. AI, machine learning, and deep learning form a nested structure: deep learning is positioned as one technique within machine learning.

The Nested Relationship: AI, ML, and Deep Learning

Section titled “The Nested Relationship: AI, ML, and Deep Learning”
graph TD
    AI["AI (Artificial Intelligence)\nAll techniques that replicate human intelligence"]
    ML["Machine Learning\nAutomatic pattern learning from data"]
    DL["Deep Learning\nMulti-layer neural networks for automatic feature extraction"]
    AI --> ML
    ML --> DL

Deep learning is one of the most powerful techniques within machine learning, but it is not the best fit for every problem. When data is scarce or interpretability is important, traditional machine learning methods often perform better.

A neural network is a mathematical model inspired by the connected structure of neurons (nerve cells) in the human brain. A large number of “nodes (units)” are arranged in layers, and adjacent layers are connected to process information.

graph LR
    subgraph IL["Input Layer"]
        I1["x₁"]
        I2["x₂"]
        I3["x₃"]
    end
    subgraph HL["Hidden Layers (multiple)"]
        H1["h₁"]
        H2["h₂"]
        H3["h₃"]
    end
    subgraph OL["Output Layer"]
        O1["y"]
    end
    I1 --> H1
    I1 --> H2
    I1 --> H3
    I2 --> H1
    I2 --> H2
    I2 --> H3
    I3 --> H1
    I3 --> H2
    I3 --> H3
    H1 --> O1
    H2 --> O1
    H3 --> O1
Layer TypeRole
Input LayerReceives the data — pixel values, numbers, and other features are fed in here
Hidden LayersProgressively extracts features from the input — this is the “deep” part
Output LayerProduces the final prediction or classification result

“Deep” refers to having multiple (many) hidden layers. More layers allow the network to learn increasingly abstract features.

In image recognition, for example, shallow layers learn edges and color patterns, while deeper layers learn higher-level features such as “eyes,” “nose,” and “face.”

Differences Between Traditional ML and Deep Learning

Section titled “Differences Between Traditional ML and Deep Learning”

The biggest difference is whether feature engineering is performed by humans or learned automatically.

Traditional Machine LearningDeep Learning
Feature designDesigned manually by humansLearned automatically from data
Data requirementsCan work with relatively small amountsRequires large amounts of data
Compute resourcesRuns on modest hardwareRequires high compute (e.g., GPUs)
InterpretabilityHigh (e.g., decision trees)Low (black-box problem)
StrengthsStructured (tabular) dataUnstructured data: images, audio, text

Think of a cooking recipe.

  • Traditional machine learning: A food researcher (a human) pre-selects the important features of a recipe (ingredients used, cooking time, calories, and so on) before training.
  • Deep learning: Show the model tens of thousands of food photos and let it discover on its own what features determine whether a dish is delicious.

Image Recognition (CNN: Convolutional Neural Network)

Section titled “Image Recognition (CNN: Convolutional Neural Network)”

A CNN (Convolutional Neural Network) is a network architecture designed to process image data. It efficiently learns spatial patterns within images.

  • Object detection and classification (accuracy on the ImageNet dataset has surpassed human-level performance)
  • Medical image diagnosis (detecting diseases from X-ray and MRI images)
  • Autonomous driving (recognizing roads, pedestrians, and signs)

Natural Language Processing (Transformer / LLM)

Section titled “Natural Language Processing (Transformer / LLM)”

A Transformer is a neural network architecture published in 2017. Its “attention mechanism” efficiently processes long-range dependencies in text. Virtually all modern large language models are based on the Transformer.

An LLM (Large Language Model) is a Transformer-based language model pre-trained on massive amounts of text data.

  • GPT series (OpenAI): Text generation and code completion
  • Claude (Anthropic): Dialogue, document analysis, and coding assistance
  • Gemini (Google): Multimodal processing (text, images, and video combined)

Speech recognition converts audio waveforms into text. It is used in voice assistants on smartphones and automatic caption generation.

LLMs learn and run inference through the following process.

graph LR
    PT["Massive text data\n(Pre-training)"] --> LLM["LLM\n(Transformer-based)"]
    LLM --> FT["Fine-tuning\nfor specific tasks"]
    FT --> APP["Applications\n(dialogue, translation, summarization, etc.)"]
  1. Pre-training: Train on large amounts of internet text by learning to “predict the next token (word fragment)”
  2. Fine-tuning: Additional training for a specific use case, such as conversation
  3. Inference: Generate text in response to user input

Q: Is deep learning always better than traditional machine learning?

A: No. Deep learning requires large amounts of data and high compute resources. When data is scarce, or when working with structured (tabular) data, traditional methods like Random Forest or XGBoost often achieve higher accuracy.

Q: Can I do deep learning without a GPU?

A: Training will be slow, but it can run on a CPU. Running inference (generating predictions from a trained model) is fast enough on a CPU for many use cases. Cloud services like Google Colab also provide free GPU access.

Q: Are LLMs and chatbots the same thing?

A: No. An LLM is a model (engine) for understanding and generating language. A chatbot is an application that uses an LLM to provide a conversational interface. The same LLM can power many applications beyond chatbots: translation, summarization, code generation, and more.


Next: Learning Paradigms

Link to this page (Japanese): ディープラーニングとは