From Words to Vectors โ What Are Embeddings?
Traditional relational and document databases have relied on exact keyword matching for decades. The problem arises when a user phrases a query differently from how the document was written โ the system simply won't find it.
The solution is vector embeddings: dense numerical representations of data (text, images, audio) placed in a high-dimensional vector space. Machine learning models โ such as Sentence Transformers, OpenAI models, or Google BERT โ transform raw data into sequences of floating-point numbers. A typical vector representing a paragraph of text may have 384 to 3,072 dimensions.
The key property: objects with similar meaning land close together in this abstract space. Example from the demo below: "bone" and "kibble" โ two stereotypical dog treats โ sit right next to each other because they combine high dog association with high food association. "Ham" is close but slightly further โ it is food, yet less strongly associated with dogs. "Leash" and "bowl" are strongly dog-related but not food, so they fall in a completely different region of the space. A vector database stores these representations and can instantly find those "closest" to a given query.
Similarity Metrics โ The Mathematical Core of Search
At the heart of a vector database is calculating similarity between a query vector and stored vectors. Several main metrics are used:
- Cosine similarity โ most commonly used in NLP and semantic search. It measures the cosine of the angle between two vectors, ignoring their length. A value of 1 means identical meaning, 0 means no similarity, -1 means opposite meanings. Ideal for RAG, where a short query must find a long document.
- Euclidean distance (L2) โ an intuitive measure of straight-line distance in n-dimensional space. Used where the absolute magnitude of the vector matters, such as in computer vision or anomaly detection.
- Dot product โ combines analysis of both angle and vector length. Popular in recommendation systems where vector length can encode intensity of user preferences. Computationally, it is the fastest metric to evaluate.
| Metric | Uses direction | Uses magnitude | Typical use case |
|---|---|---|---|
| Cosine | โ | โ | NLP, semantic search, RAG |
| Dot product | โ | โ | Recommendations, ranking |
| L2 (Euclidean) | โ | โ | Computer vision, anomaly detection |
Indexing Algorithms โ How a Database Searches Billions of Vectors
Exact nearest neighbor search (k-NN) requires comparing a query against every vector in the database. For datasets on the order of billions of records, this is computationally infeasible โ the curse of dimensionality sets in.
The solution is Approximate Nearest Neighbor (ANN) algorithms, which sacrifice a minimal percentage of precision for logarithmic speed gains.
- HNSW (Hierarchical Navigable Small World) โ the most popular and fastest graph-based algorithm, implemented in Qdrant, Pinecone, and pgvector. It builds a multi-layer hierarchical graph where upper layers contain sparse connections between distant clusters (fast general navigation) and lower layers densify local connections. Search descends through layers to the precise level in milliseconds.
- IVF (Inverted File Index) + PQ (Product Quantization) โ IVF clusters the vector space into Voronoi cells defined by centroids. Search only checks the nearest clusters. Combined with PQ, it compresses high-dimensional vectors to take up far less RAM at the cost of a slight precision drop. Best for large, infrequently updated datasets.
- DiskANN โ an algorithm developed by Microsoft Research (used by Azure Cosmos DB and Milvus) that enables powerful vector operations directly on SSDs, minimizing the costs of in-memory operations typical of classic HNSW.
RAG Architecture โ How Vector Databases Power LLMs
Large language models like GPT-4, LLaMA, and Claude have knowledge frozen at training time. They suffer from three critical limitations: hallucinations (generating false but plausible-sounding information), a knowledge cutoff (no access to current data), and lack of domain-specific knowledge (private corporate documents never formed part of open training sets).
Retrieval-Augmented Generation solves these problems by combining the model's static "parametric memory" with an external "non-parametric memory" in a vector database. The mechanism was proposed by Meta AI researchers in 2020 as an alternative to costly fine-tuning.
The RAG process unfolds in three phases:
- Indexing phase โ preparing knowledge "to be remembered":
- Raw documents (PDFs, wikis, emails, ticket systems) are split into short fragments (chunks), e.g. a few paragraphs each. Adjacent chunks deliberately overlap at the seams (overlap, usually the last 1โ2 sentences), so that a sentence which would fall on the cut boundary is not torn apart.
- Each chunk is then passed through an embedding model โ a specialized neural network (e.g. OpenAI's text-embedding-ada-002) whose sole job is to turn text into a list of numbers (a vector) describing its meaning. Two fragments with similar content receive similar vectors, even if they use completely different words.
- The vector lands in the database together with source metadata (where it comes from, when it was created, who authored it), so that it can later be retrieved and the original cited.
- Retrieval phase โ finding matching fragments in real time:
- When a user asks a question, the system converts it into a vector using the same embedding model that was used to index the documents (the question vector and chunk vectors must "speak the same numerical language").
- The database compares the question vector against all chunk vectors and picks those lying closest in the meaning space. A short list of best matches is returned (typically 3โ10 fragments, jargon: Top-K) โ so the LLM receives a condensed context rather than an entire library.
- Exhaustively scanning millions of vectors (k-NN, k-nearest neighbors) would be far too slow, so vector databases use a clever approximation (ANN, approximate nearest neighbors) โ trading a tiny fraction of accuracy for answers in milliseconds rather than minutes.
- Generation phase โ retrieved chunks construct a prompt passed to the LLM (user question + precise context from the database). The model generates an answer grounded in verifiable facts rather than guessing from network weights.
Hybrid Search โ When Semantic Search Alone Is Not Enough
Pure vector search excels at conceptual matching but fails with specific identifiers: invoice numbers, code names, acronyms. This is where hybrid search comes in.
Modern vector databases combine classic lexical search (BM25 algorithm) with vector search. They also enable metadata filtering โ before computing similarity across millions of vectors, the database can first narrow the search space logically: for example, only documents dated after 2024 or belonging to a specific company department. The hybrid approach dramatically increases the relevance of RAG results.
Overview of Major Vector Database Systems
The vector database market is now highly diversified. Two categories exist: native vector databases (Qdrant, Pinecone, Milvus, Weaviate, Chroma) and systems extended with vector capabilities (pgvector on PostgreSQL, Azure Cosmos DB).
- Pinecone is the leader in commercial SaaS solutions โ a fully managed, serverless vector database. Zero barrier to entry, automatic scaling, SOC2 compliance. Drawback: proprietary model creating vendor lock-in; costs rise steeply for large datasets.
- Qdrant is written in Rust and has gained enormous popularity among startups requiring extremely low latency. Its unique payload filtering mechanism weaves metadata filtering directly into the HNSW graph traversal. P99 latencies around ~12 ms at multi-million-vector scale. Available open-source and in the cloud.
- Weaviate from the Netherlands, written in Go, stands out for built-in on-the-fly vectorization modules and powerful hybrid search with alpha-blending of parameters (BM25 + vector distance). GraphQL interface with a steep learning curve.
- Milvus (backed by Zilliz) โ enterprise-grade open-source system designed for billions and trillions of vectors. Native Kubernetes architecture, wide variety of indexes (FAISS, HNSW, IVF, DiskANN), GPU acceleration support. Requires an advanced DevOps team.
- pgvector โ a PostgreSQL extension enabling vector operations (e.g., ORDER BY embedding <=> query_vector) with full ACID compliance. Eliminates the data synchronization problem between existing infrastructure and a separate vector database. Optimal for companies with existing Postgres infrastructure and datasets up to ~5 million vectors.
- Chroma โ dedicated to rapid local AI prototyping and agentic LLM systems with LangChain. Extremely simple to install, supports Python and JavaScript. Handles large production environments less well.
Practical Applications Beyond RAG
The flexibility of embeddings has found use far beyond corporate chatbots.
- E-commerce recommendation systems โ vector databases map products and customers to dense numerical profiles. An online store understands abstract semantic connections: customers browsing luxury leather jackets may gravitate toward a specific color palette of leather footwear โ despite no overlapping keywords in purchase history.
- Knowledge management in organizations โ RAG built on vector databases unifies hundreds of scattered wikis, Confluence documents, and PDF policies from SharePoint. Instead of searching for a specific phrase, an employee asks naturally: "what should I do if I forgot my password this morning?" โ and receives a precise answer citing the IT security policy.
- Modern SEO โ search engines like Google AI Overviews fully implement semantic vector matching. SEO specialists analyze the cosine similarity of their articles to the "ideal answer vector" for market queries. Keyword stuffing loses relevance โ semantic coherence of content is what matters.
Summary โ Choosing the Right Technology
Vector databases are no longer the domain of global corporations. Democratization of tools means small businesses configure RAG architectures at a fraction of former costs.
The choice comes down to a few key questions: do you need extremely low latency with your own infrastructure (Qdrant)? Do you prefer zero infrastructure management at the cost of vendor lock-in (Pinecone)? Do you have existing Postgres infrastructure and a dataset up to 5 million vectors (pgvector)? Are you building a system on billions of vectors with GPU acceleration (Milvus)?
The future steers systems toward multimodal analysis โ rarer connections between word, image, and sound in real time. Vector databases will remain the critical interface powering the rationality of Generative Intelligence era assistants.
