Embeddings & Vector Representations

About 10 minutes

People who want a deeper understanding of how RAG works, and those who want to know how to choose an embedding model

Familiarity with the basic flow described in What Is RAG?

An embedding is a technique for representing the meaning of text or words as a list of numbers (a vector). In RAG, both documents and questions are converted into the same vector space, making it possible to find “semantically similar documents” through fast numerical computation.

What Are Embeddings?

A useful analogy is to think of embeddings as “coordinates in meaning-space.”

On a map, I can look up the coordinates of two cities and calculate the distance between them numerically. Embeddings do the same thing for text, but in a high-dimensional space. For example, “cat” and “dog” are semantically close, so their vectors in embedding space will be close together. “Cat” and “spaceship” are semantically distant, so their vectors will be far apart.

Real embedding vectors are made up of hundreds to thousands of numbers. For example, OpenAI’s text-embedding-3-small represents one piece of text using 1536 numbers — a 1536-dimensional vector.[1]

# Python 3.11+
# pip install openai
from openai import OpenAI

client = OpenAI()  # Uses the OPENAI_API_KEY environment variable

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="RAG stands for Retrieval-Augmented Generation"
)

vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")          # Output: Dimensions: 1536
print(f"First 5 elements: {vector[:5]}")     # Output: [-0.021, 0.034, ...]

Text-to-Vector Conversion Flow

In RAG, embeddings are used both when documents are indexed and when a query is made.

graph TD
    A["Text (document or question)"] --> B["Embedding model"]
    B --> C["Vector (list of numbers)"]
    C --> D{"Purpose"}
    D -->|"Indexing documents"| E["Store in vector DB"]
    D -->|"Searching with a query"| F["Compute similarity"]
    F --> G["Retrieve nearest vectors"]
    G --> H["Pass relevant documents to LLM"]

When indexing documents, each chunk (a piece of a split document) is converted to a vector and stored in the vector DB. When a user asks a question, that question is also converted to a vector, and the system computes the distance between that vector and all stored vectors to retrieve the most similar (semantically closest) chunks.

Important: The same embedding model must be used for both indexing and querying. Vectors produced by different models cannot be compared.

How Cosine Similarity Works

The most commonly used method for measuring “closeness” between vectors is cosine similarity.

Cosine similarity measures how similar the “direction” of two vectors is, on a scale from -1 to 1.

Close to 1.0: Very similar (e.g., “how to raise a dog” vs. “methods for keeping a dog”)
Close to 0: Little relationship (e.g., “how to raise a dog” vs. “history of space exploration”)
Close to -1: Opposite meaning (almost never occurs in practice with text)

import numpy as np

def cosine_similarity(vec_a, vec_b):
    """Compute cosine similarity between two vectors"""
    dot_product = np.dot(vec_a, vec_b)
    norm_a = np.linalg.norm(vec_a)
    norm_b = np.linalg.norm(vec_b)
    return dot_product / (norm_a * norm_b)

similarity = cosine_similarity(vector_doc, vector_query)
print(f"Similarity: {similarity:.4f}")  # Output: 0.8732

Most vector DBs compute similarity internally and return results sorted by score. Memorizing the cosine similarity formula is not necessary — the key intuition is “higher score = semantically closer.”

Popular Embedding Models

OpenAI Embeddings

OpenAI’s text-embedding-3-small and text-embedding-3-large are embedding models that support multiple languages including Japanese.[1]

text-embedding-3-small: 1536 dimensions, low cost, multilingual, ideal for prototypes
text-embedding-3-large: 3072 dimensions, higher accuracy for production environments

Cohere Embed

Cohere’s embed-multilingual-v3.0 is an embedding model for multilingual search, and its input_type parameter distinguishes inputs such as search documents and search queries.[2]

1024 dimensions
Supports 100+ languages
Distinguishes between search_document and search_query input types, which is a notable feature

# pip install cohere
import cohere

co = cohere.Client()  # Uses the COHERE_API_KEY environment variable

# For indexing: input_type="search_document"
doc_response = co.embed(
    texts=["RAG stands for Retrieval-Augmented Generation"],
    model="embed-multilingual-v3.0",
    input_type="search_document"
)

# For querying: input_type="search_query"
query_response = co.embed(
    texts=["How does RAG work?"],
    model="embed-multilingual-v3.0",
    input_type="search_query"
)

multilingual-e5 (Open-Source Multilingual)

multilingual-e5 (intfloat/multilingual-e5-large) is an open-source multilingual embedding model initialized from xlm-roberta-large and continually trained on multilingual datasets. It supports 100 languages, making it a self-hosting option for multilingual search that includes Japanese.[4]

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/multilingual-e5-large")

# Documents require the "passage: " prefix
docs = ["passage: RAG stands for Retrieval-Augmented Generation"]
# Queries require the "query: " prefix
queries = ["query: How does RAG work?"]

doc_vectors = model.encode(docs)
query_vectors = model.encode(queries)

Embedding Model Comparison

Model	Provider	Dimensions	Multilingual	Japanese Quality	Cost	Best For
text-embedding-3-small	OpenAI	1536	Yes	Good	Low	Prototypes, small-scale production
text-embedding-3-large	OpenAI	3072	Yes	Good	Medium	Accuracy-critical production
embed-multilingual-v3.0	Cohere	1024	Excellent	Excellent	Medium	Multilingual, Japanese-heavy use cases
multilingual-e5-large	Microsoft/OSS	1024	Excellent	Excellent	Free (self-host)	Cost reduction, private environments
embed-multilingual-light-v3.0	Cohere	384	Yes	Good	Low	Low-latency requirements

How to Choose a Model

By Language and Domain

Japanese-heavy documents: embed-multilingual-v3.0 or multilingual-e5-large are strong candidates
English-first or global: text-embedding-3-small is sufficient in many cases
Domain-specific terminology: If a general-purpose model doesn’t produce adequate accuracy, consider a domain-specific model or fine-tuning

By Cost and Operations

Minimize API costs: text-embedding-3-small (cheapest OpenAI option) or self-hosted multilingual-e5
No external API dependency (on-premise / private cloud): Open-source models like multilingual-e5
Start quickly: Begin with text-embedding-3-small and switch if accuracy or cost requirements demand it

By Accuracy

MTEB (Massive Text Embedding Benchmark) is a publicly available benchmark that evaluates embedding model accuracy across languages and tasks. When selecting a model, referencing the Japanese task rankings provides an objective basis for comparison.[3]

Common Mistakes

Different Models for Indexing and Querying

Using text-embedding-3-small when indexing documents and text-embedding-3-large when querying produces vectors in different spaces, making accurate similarity computation impossible. Always use the same model throughout.

If a model is changed, all documents must be re-vectorized (re-indexed).

Forgetting to Normalize Vectors

When using cosine similarity, vectors must be normalized (set to unit length). Most vector DBs handle this automatically, but manual computation requires attention.

import numpy as np

def normalize(vector):
    """Normalize a vector to unit length"""
    norm = np.linalg.norm(vector)
    if norm == 0:
        return vector
    return vector / norm

normalized_vector = normalize(np.array(vector))

Missing Prefix for E5-Style Models

E5-family models (like multilingual-e5) require the prefix "passage: " for documents and "query: " for questions. The model card states that omitting these prefixes causes performance degradation.[4]

Summary

Embeddings convert text into vectors — “coordinates in meaning-space”
In RAG, both documents and questions are converted with the same model; cosine similarity finds semantically close documents
For Japanese-heavy use cases, embed-multilingual-v3.0 and multilingual-e5-large have accuracy advantages
Starting with text-embedding-3-small and switching as needed is a practical approach
Using the same model for indexing and querying is the single most important rule

FAQ

Q: Do I need to understand the underlying math to use embeddings?

A: No. In practice, there is no need to understand the formulas. Understanding the concept — “text is converted to a list of numbers, and the distance between numbers measures semantic closeness” — is sufficient to use embeddings effectively through libraries and APIs. At the tuning stage, understanding vector dimensions and normalization is helpful, but deriving the math is not necessary.

Q: Why does embedding quality matter so much for RAG?

A: Embedding quality directly determines retrieval accuracy. If the embedding model cannot correctly judge that “this question and this document are semantically close,” no amount of reranker tuning or prompt engineering will help — because the relevant document was never retrieved in the first place. For Japanese documents with specialized terminology and proper nouns, selecting a multilingual model is particularly important.

Q: Can embedding models be updated?

A: Yes. OpenAI provides text-embedding-3-small and text-embedding-3-large.[1] When a model is changed, previous vectors are incompatible with new ones, requiring a full re-index of all documents. In production, pinning the model version and planning for migration impact is important.

Q: What is the relationship between chunk size and embedding models?

A: Embedding models have a maximum input length. text-embedding-3-small accepts up to 8192 tokens as input.[1] Text that exceeds this limit is truncated at the end, so chunk sizes must stay within the model’s limit.

References

Retrieval Strategies

The Future of RAG