Skip to content
LinkedInX

Embeddings & Vector Representations

About 10 minutes

Target audience: People who want a deeper understanding of how RAG works, and those who want to know how to choose an embedding model
Prerequisites: Familiarity with the basic flow described in What Is RAG?

An embedding is a technique for representing the meaning of text or words as a list of numbers (a vector). In RAG, both documents and questions are converted into the same vector space, making it possible to find “semantically similar documents” through fast numerical computation.

A useful analogy is to think of embeddings as “coordinates in meaning-space.”

On a map, I can look up the coordinates of two cities and calculate the distance between them numerically. Embeddings do the same thing for text, but in a high-dimensional space. For example, “cat” and “dog” are semantically close, so their vectors in embedding space will be close together. “Cat” and “spaceship” are semantically distant, so their vectors will be far apart.

Real embedding vectors are made up of hundreds to thousands of numbers. For example, OpenAI’s text-embedding-3-small represents one piece of text using 1536 numbers — a 1536-dimensional vector.[1]

# Python 3.11+
# pip install openai
from openai import OpenAI

client = OpenAI()  # Uses the OPENAI_API_KEY environment variable

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="RAG stands for Retrieval-Augmented Generation"
)

vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")          # Output: Dimensions: 1536
print(f"First 5 elements: {vector[:5]}")     # Output: [-0.021, 0.034, ...]

In RAG, embeddings are used both when documents are indexed and when a query is made.

graph TD
    A["Text (document or question)"] --> B["Embedding model"]
    B --> C["Vector (list of numbers)"]
    C --> D{"Purpose"}
    D -->|"Indexing documents"| E["Store in vector DB"]
    D -->|"Searching with a query"| F["Compute similarity"]
    F --> G["Retrieve nearest vectors"]
    G --> H["Pass relevant documents to LLM"]

When indexing documents, each chunk (a piece of a split document) is converted to a vector and stored in the vector DB. When a user asks a question, that question is also converted to a vector, and the system computes the distance between that vector and all stored vectors to retrieve the most similar (semantically closest) chunks.

Important: The same embedding model must be used for both indexing and querying. Vectors produced by different models cannot be compared.

The most commonly used method for measuring “closeness” between vectors is cosine similarity.

Cosine similarity measures how similar the “direction” of two vectors is, on a scale from -1 to 1.

  • Close to 1.0: Very similar (e.g., “how to raise a dog” vs. “methods for keeping a dog”)
  • Close to 0: Little relationship (e.g., “how to raise a dog” vs. “history of space exploration”)
  • Close to -1: Opposite meaning (almost never occurs in practice with text)
import numpy as np

def cosine_similarity(vec_a, vec_b):
    """Compute cosine similarity between two vectors"""
    dot_product = np.dot(vec_a, vec_b)
    norm_a = np.linalg.norm(vec_a)
    norm_b = np.linalg.norm(vec_b)
    return dot_product / (norm_a * norm_b)

similarity = cosine_similarity(vector_doc, vector_query)
print(f"Similarity: {similarity:.4f}")  # Output: 0.8732

Most vector DBs compute similarity internally and return results sorted by score. Memorizing the cosine similarity formula is not necessary — the key intuition is “higher score = semantically closer.”

OpenAI’s text-embedding-3-small and text-embedding-3-large are embedding models that support multiple languages including Japanese.[1]

  • text-embedding-3-small: 1536 dimensions, low cost, multilingual, ideal for prototypes
  • text-embedding-3-large: 3072 dimensions, higher accuracy for production environments

Cohere’s embed-multilingual-v3.0 is an embedding model for multilingual search, and its input_type parameter distinguishes inputs such as search documents and search queries.[2]

  • 1024 dimensions
  • Supports 100+ languages
  • Distinguishes between search_document and search_query input types, which is a notable feature
# pip install cohere
import cohere

co = cohere.Client()  # Uses the COHERE_API_KEY environment variable

# For indexing: input_type="search_document"
doc_response = co.embed(
    texts=["RAG stands for Retrieval-Augmented Generation"],
    model="embed-multilingual-v3.0",
    input_type="search_document"
)

# For querying: input_type="search_query"
query_response = co.embed(
    texts=["How does RAG work?"],
    model="embed-multilingual-v3.0",
    input_type="search_query"
)

multilingual-e5 (Open-Source Multilingual)

Section titled “multilingual-e5 (Open-Source Multilingual)”

multilingual-e5 (intfloat/multilingual-e5-large) is an open-source multilingual embedding model initialized from xlm-roberta-large and continually trained on multilingual datasets. It supports 100 languages, making it a self-hosting option for multilingual search that includes Japanese.[4]

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/multilingual-e5-large")

# Documents require the "passage: " prefix
docs = ["passage: RAG stands for Retrieval-Augmented Generation"]
# Queries require the "query: " prefix
queries = ["query: How does RAG work?"]

doc_vectors = model.encode(docs)
query_vectors = model.encode(queries)
ModelProviderDimensionsMultilingualJapanese QualityCostBest For
text-embedding-3-smallOpenAI1536YesGoodLowPrototypes, small-scale production
text-embedding-3-largeOpenAI3072YesGoodMediumAccuracy-critical production
embed-multilingual-v3.0Cohere1024ExcellentExcellentMediumMultilingual, Japanese-heavy use cases
multilingual-e5-largeMicrosoft/OSS1024ExcellentExcellentFree (self-host)Cost reduction, private environments
embed-multilingual-light-v3.0Cohere384YesGoodLowLow-latency requirements
  • Japanese-heavy documents: embed-multilingual-v3.0 or multilingual-e5-large are strong candidates
  • English-first or global: text-embedding-3-small is sufficient in many cases
  • Domain-specific terminology: If a general-purpose model doesn’t produce adequate accuracy, consider a domain-specific model or fine-tuning
  • Minimize API costs: text-embedding-3-small (cheapest OpenAI option) or self-hosted multilingual-e5
  • No external API dependency (on-premise / private cloud): Open-source models like multilingual-e5
  • Start quickly: Begin with text-embedding-3-small and switch if accuracy or cost requirements demand it

MTEB (Massive Text Embedding Benchmark) is a publicly available benchmark that evaluates embedding model accuracy across languages and tasks. When selecting a model, referencing the Japanese task rankings provides an objective basis for comparison.[3]

Different Models for Indexing and Querying

Section titled “Different Models for Indexing and Querying”

Using text-embedding-3-small when indexing documents and text-embedding-3-large when querying produces vectors in different spaces, making accurate similarity computation impossible. Always use the same model throughout.

If a model is changed, all documents must be re-vectorized (re-indexed).

When using cosine similarity, vectors must be normalized (set to unit length). Most vector DBs handle this automatically, but manual computation requires attention.

import numpy as np

def normalize(vector):
    """Normalize a vector to unit length"""
    norm = np.linalg.norm(vector)
    if norm == 0:
        return vector
    return vector / norm

normalized_vector = normalize(np.array(vector))

E5-family models (like multilingual-e5) require the prefix "passage: " for documents and "query: " for questions. The model card states that omitting these prefixes causes performance degradation.[4]

  • Embeddings convert text into vectors — “coordinates in meaning-space”
  • In RAG, both documents and questions are converted with the same model; cosine similarity finds semantically close documents
  • For Japanese-heavy use cases, embed-multilingual-v3.0 and multilingual-e5-large have accuracy advantages
  • Starting with text-embedding-3-small and switching as needed is a practical approach
  • Using the same model for indexing and querying is the single most important rule

Q: Do I need to understand the underlying math to use embeddings?

A: No. In practice, there is no need to understand the formulas. Understanding the concept — “text is converted to a list of numbers, and the distance between numbers measures semantic closeness” — is sufficient to use embeddings effectively through libraries and APIs. At the tuning stage, understanding vector dimensions and normalization is helpful, but deriving the math is not necessary.

Q: Why does embedding quality matter so much for RAG?

A: Embedding quality directly determines retrieval accuracy. If the embedding model cannot correctly judge that “this question and this document are semantically close,” no amount of reranker tuning or prompt engineering will help — because the relevant document was never retrieved in the first place. For Japanese documents with specialized terminology and proper nouns, selecting a multilingual model is particularly important.

Q: Can embedding models be updated?

A: Yes. OpenAI provides text-embedding-3-small and text-embedding-3-large.[1] When a model is changed, previous vectors are incompatible with new ones, requiring a full re-index of all documents. In production, pinning the model version and planning for migration impact is important.

Q: What is the relationship between chunk size and embedding models?

A: Embedding models have a maximum input length. text-embedding-3-small accepts up to 8192 tokens as input.[1] Text that exceeds this limit is truncated at the end, so chunk sizes must stay within the model’s limit.

  1. OpenAI Embeddings — Official Documentation
  2. Cohere Embed — Official Documentation
  3. MTEB Leaderboard (Embedding Model Benchmark)
  4. multilingual-e5 — Hugging Face