Embeddings & Vector Representations
About 10 minutes
An embedding is a technique for representing the meaning of text or words as a list of numbers (a vector). In RAG, both documents and questions are converted into the same vector space, making it possible to find “semantically similar documents” through fast numerical computation.
What Are Embeddings?
Section titled “What Are Embeddings?”A useful analogy is to think of embeddings as “coordinates in meaning-space.”
On a map, I can look up the coordinates of two cities and calculate the distance between them numerically. Embeddings do the same thing for text, but in a high-dimensional space. For example, “cat” and “dog” are semantically close, so their vectors in embedding space will be close together. “Cat” and “spaceship” are semantically distant, so their vectors will be far apart.
Real embedding vectors are made up of hundreds to thousands of numbers. For example, OpenAI’s text-embedding-3-small represents one piece of text using 1536 numbers — a 1536-dimensional vector.[1]
# Python 3.11+
# pip install openai
from openai import OpenAI
client = OpenAI() # Uses the OPENAI_API_KEY environment variable
response = client.embeddings.create(
model="text-embedding-3-small",
input="RAG stands for Retrieval-Augmented Generation"
)
vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}") # Output: Dimensions: 1536
print(f"First 5 elements: {vector[:5]}") # Output: [-0.021, 0.034, ...]Text-to-Vector Conversion Flow
Section titled “Text-to-Vector Conversion Flow”In RAG, embeddings are used both when documents are indexed and when a query is made.
graph TD
A["Text (document or question)"] --> B["Embedding model"]
B --> C["Vector (list of numbers)"]
C --> D{"Purpose"}
D -->|"Indexing documents"| E["Store in vector DB"]
D -->|"Searching with a query"| F["Compute similarity"]
F --> G["Retrieve nearest vectors"]
G --> H["Pass relevant documents to LLM"]When indexing documents, each chunk (a piece of a split document) is converted to a vector and stored in the vector DB. When a user asks a question, that question is also converted to a vector, and the system computes the distance between that vector and all stored vectors to retrieve the most similar (semantically closest) chunks.
Important: The same embedding model must be used for both indexing and querying. Vectors produced by different models cannot be compared.
How Cosine Similarity Works
Section titled “How Cosine Similarity Works”The most commonly used method for measuring “closeness” between vectors is cosine similarity.
Cosine similarity measures how similar the “direction” of two vectors is, on a scale from -1 to 1.
- Close to 1.0: Very similar (e.g., “how to raise a dog” vs. “methods for keeping a dog”)
- Close to 0: Little relationship (e.g., “how to raise a dog” vs. “history of space exploration”)
- Close to -1: Opposite meaning (almost never occurs in practice with text)
import numpy as np
def cosine_similarity(vec_a, vec_b):
"""Compute cosine similarity between two vectors"""
dot_product = np.dot(vec_a, vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
return dot_product / (norm_a * norm_b)
similarity = cosine_similarity(vector_doc, vector_query)
print(f"Similarity: {similarity:.4f}") # Output: 0.8732Most vector DBs compute similarity internally and return results sorted by score. Memorizing the cosine similarity formula is not necessary — the key intuition is “higher score = semantically closer.”
Popular Embedding Models
Section titled “Popular Embedding Models”OpenAI Embeddings
Section titled “OpenAI Embeddings”OpenAI’s text-embedding-3-small and text-embedding-3-large are embedding models that support multiple languages including Japanese.[1]
text-embedding-3-small: 1536 dimensions, low cost, multilingual, ideal for prototypestext-embedding-3-large: 3072 dimensions, higher accuracy for production environments
Cohere Embed
Section titled “Cohere Embed”Cohere’s embed-multilingual-v3.0 is an embedding model for multilingual search, and its input_type parameter distinguishes inputs such as search documents and search queries.[2]
- 1024 dimensions
- Supports 100+ languages
- Distinguishes between
search_documentandsearch_queryinput types, which is a notable feature
# pip install cohere
import cohere
co = cohere.Client() # Uses the COHERE_API_KEY environment variable
# For indexing: input_type="search_document"
doc_response = co.embed(
texts=["RAG stands for Retrieval-Augmented Generation"],
model="embed-multilingual-v3.0",
input_type="search_document"
)
# For querying: input_type="search_query"
query_response = co.embed(
texts=["How does RAG work?"],
model="embed-multilingual-v3.0",
input_type="search_query"
)multilingual-e5 (Open-Source Multilingual)
Section titled “multilingual-e5 (Open-Source Multilingual)”multilingual-e5 (intfloat/multilingual-e5-large) is an open-source multilingual embedding model initialized from xlm-roberta-large and continually trained on multilingual datasets. It supports 100 languages, making it a self-hosting option for multilingual search that includes Japanese.[4]
# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/multilingual-e5-large")
# Documents require the "passage: " prefix
docs = ["passage: RAG stands for Retrieval-Augmented Generation"]
# Queries require the "query: " prefix
queries = ["query: How does RAG work?"]
doc_vectors = model.encode(docs)
query_vectors = model.encode(queries)Embedding Model Comparison
Section titled “Embedding Model Comparison”| Model | Provider | Dimensions | Multilingual | Japanese Quality | Cost | Best For |
|---|---|---|---|---|---|---|
| text-embedding-3-small | OpenAI | 1536 | Yes | Good | Low | Prototypes, small-scale production |
| text-embedding-3-large | OpenAI | 3072 | Yes | Good | Medium | Accuracy-critical production |
| embed-multilingual-v3.0 | Cohere | 1024 | Excellent | Excellent | Medium | Multilingual, Japanese-heavy use cases |
| multilingual-e5-large | Microsoft/OSS | 1024 | Excellent | Excellent | Free (self-host) | Cost reduction, private environments |
| embed-multilingual-light-v3.0 | Cohere | 384 | Yes | Good | Low | Low-latency requirements |
How to Choose a Model
Section titled “How to Choose a Model”By Language and Domain
Section titled “By Language and Domain”- Japanese-heavy documents:
embed-multilingual-v3.0ormultilingual-e5-largeare strong candidates - English-first or global:
text-embedding-3-smallis sufficient in many cases - Domain-specific terminology: If a general-purpose model doesn’t produce adequate accuracy, consider a domain-specific model or fine-tuning
By Cost and Operations
Section titled “By Cost and Operations”- Minimize API costs:
text-embedding-3-small(cheapest OpenAI option) or self-hostedmultilingual-e5 - No external API dependency (on-premise / private cloud): Open-source models like
multilingual-e5 - Start quickly: Begin with
text-embedding-3-smalland switch if accuracy or cost requirements demand it
By Accuracy
Section titled “By Accuracy”MTEB (Massive Text Embedding Benchmark) is a publicly available benchmark that evaluates embedding model accuracy across languages and tasks. When selecting a model, referencing the Japanese task rankings provides an objective basis for comparison.[3]
Common Mistakes
Section titled “Common Mistakes”Different Models for Indexing and Querying
Section titled “Different Models for Indexing and Querying”Using text-embedding-3-small when indexing documents and text-embedding-3-large when querying produces vectors in different spaces, making accurate similarity computation impossible. Always use the same model throughout.
If a model is changed, all documents must be re-vectorized (re-indexed).
Forgetting to Normalize Vectors
Section titled “Forgetting to Normalize Vectors”When using cosine similarity, vectors must be normalized (set to unit length). Most vector DBs handle this automatically, but manual computation requires attention.
import numpy as np
def normalize(vector):
"""Normalize a vector to unit length"""
norm = np.linalg.norm(vector)
if norm == 0:
return vector
return vector / norm
normalized_vector = normalize(np.array(vector))Missing Prefix for E5-Style Models
Section titled “Missing Prefix for E5-Style Models”E5-family models (like multilingual-e5) require the prefix "passage: " for documents and "query: " for questions. The model card states that omitting these prefixes causes performance degradation.[4]
Summary
Section titled “Summary”- Embeddings convert text into vectors — “coordinates in meaning-space”
- In RAG, both documents and questions are converted with the same model; cosine similarity finds semantically close documents
- For Japanese-heavy use cases,
embed-multilingual-v3.0andmultilingual-e5-largehave accuracy advantages - Starting with
text-embedding-3-smalland switching as needed is a practical approach - Using the same model for indexing and querying is the single most important rule
Q: Do I need to understand the underlying math to use embeddings?
A: No. In practice, there is no need to understand the formulas. Understanding the concept — “text is converted to a list of numbers, and the distance between numbers measures semantic closeness” — is sufficient to use embeddings effectively through libraries and APIs. At the tuning stage, understanding vector dimensions and normalization is helpful, but deriving the math is not necessary.
Q: Why does embedding quality matter so much for RAG?
A: Embedding quality directly determines retrieval accuracy. If the embedding model cannot correctly judge that “this question and this document are semantically close,” no amount of reranker tuning or prompt engineering will help — because the relevant document was never retrieved in the first place. For Japanese documents with specialized terminology and proper nouns, selecting a multilingual model is particularly important.
Q: Can embedding models be updated?
A: Yes. OpenAI provides text-embedding-3-small and text-embedding-3-large.[1] When a model is changed, previous vectors are incompatible with new ones, requiring a full re-index of all documents. In production, pinning the model version and planning for migration impact is important.
Q: What is the relationship between chunk size and embedding models?
A: Embedding models have a maximum input length. text-embedding-3-small accepts up to 8192 tokens as input.[1] Text that exceeds this limit is truncated at the end, so chunk sizes must stay within the model’s limit.