Vector Databases Explained: Pinecone, Chroma, and Beyond
Every other AI demo I've seen this year follows the same pattern. Take some data, embed it into vectors, store it somewhere, then retrieve the relevant pieces when a user asks a question. This is the core of Retrieval-Augmented Generation, and the "store it somewhere" part is a vector database. It has quietly become one of the most important pieces of infrastructure in the modern AI stack.
I've been building RAG systems for the past few months, both for my own projects and while helping students in CS3000 think about how to structure their final projects. I have opinions now. Let me share them.
What Are Vector Embeddings?
Before we talk about databases, we need to talk about embeddings. An embedding is a way to represent data, usually text, as a dense vector of floating-point numbers. A sentence like "the cat sat on the mat" might become a list of 1536 numbers that encode its semantic meaning.
The key property is that semantically similar inputs produce vectors that are close together in vector space. "The cat sat on the mat" and "A kitten rested on the rug" will have similar embeddings, even though they share almost no words. This is what makes semantic search possible.
You generate embeddings using a model. OpenAI's text-embedding-ada-002 is the most popular right now, but open-source alternatives from HuggingFace (like all-MiniLM-L6-v2 from Sentence Transformers) work well too, especially if you want to avoid API costs.
How Similarity Search Works
Once you have vectors, you need to find the ones most similar to a query vector. The two most common distance metrics are:
Cosine similarity measures the angle between two vectors. It ranges from -1 to 1, where 1 means identical direction. It's the default for most text applications because it handles varying document lengths well.
Dot product is faster to compute and works well when your vectors are normalized. In practice, for normalized vectors, cosine similarity and dot product give the same ranking.
The challenge is doing this search efficiently. With a million documents, you can't compare your query against every single vector. Vector databases use approximate nearest neighbor (ANN) algorithms like HNSW or IVF to make this fast, trading a tiny amount of accuracy for massive speed improvements.
The Contenders
Pinecone
The managed option. You don't run any infrastructure. You send vectors to their API, they handle indexing, scaling, and retrieval. It's fast, reliable, and the developer experience is polished.
Best for: Production applications where you don't want to manage infrastructure. Startups that need to move fast. Teams without dedicated ML ops.
Drawback: It's a paid service, and costs can add up with large datasets. You're also locked into their ecosystem.
Chroma
The open-source, developer-friendly option. It runs locally, stores data in SQLite by default, and has the simplest API of any vector database I've used.
import chromadb
from chromadb.utils import embedding_functions
client = chromadb.Client()
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
collection = client.create_collection("my_docs", embedding_function=ef)
collection.add(
documents=["Vector databases store embeddings", "RAG uses retrieval for generation"],
ids=["doc1", "doc2"]
)
results = collection.query(query_texts=["How do vector databases work?"], n_results=1)
print(results["documents"])Best for: Prototyping, local development, small to medium datasets. Students and researchers who want to understand the concepts without cloud costs.
Drawback: Not designed for large-scale production workloads out of the box.
Weaviate
A more feature-rich open-source option. It supports hybrid search (combining vector and keyword search), built-in vectorization modules, and GraphQL-based querying. It's heavier than Chroma but more capable.
Best for: Applications that need hybrid search or more complex query patterns. Teams that want open-source but need production-grade features.
Qdrant
Written in Rust, focused on performance. It has excellent filtering capabilities, meaning you can combine vector similarity with metadata filters efficiently. The API is clean and well-documented.
Best for: Performance-sensitive applications. Use cases where you need rich filtering alongside similarity search.
Which One Should You Pick?
For learning and prototyping, start with Chroma. It takes five minutes to set up and teaches you the core concepts without any infrastructure overhead. For production, the choice depends on your constraints. If you want managed simplicity, Pinecone. If you want open-source with production features, Weaviate or Qdrant. I go much deeper on the production trade-offs in my Pinecone vs Qdrant vs Weaviate decision framework, and if you go the self-hosted route, my guide to self-hosting Qdrant covers the full path from Docker Compose to production.
The Bigger Picture
Here's what I keep telling my students: vector databases are becoming as fundamental as relational databases. Not a replacement for them, but a necessary complement. Every application that uses AI for search, recommendations, or generation will need some form of vector storage and retrieval.
A few years ago, knowing SQL was non-negotiable for data work. I think understanding embeddings and vector search is heading in the same direction. The specific tools will evolve, Pinecone might dominate or get displaced, Chroma might scale up or stay niche. But the underlying concept of storing and retrieving semantic representations is here to stay. Build your intuition for it now.
Related Posts
A Beginner's Guide to RAG: Making LLMs Actually Useful
LLMs hallucinate because they don't know your data. Retrieval-Augmented Generation fixes that. Here's how it works and how to build one.
Custom Commands and Slash Commands: Building Your Own Claude Code CLI
Slash commands turn Claude Code into a personalized CLI. A markdown file becomes a reusable workflow you invoke with a single slash. Here's how to build them.
Subagents and Parallel Execution: Making Claude Code 5x Faster
Claude Code can spawn autonomous worker agents that run in parallel. Here's how subagents work, when to use them, and why they make complex tasks dramatically faster.