·7 min read

Self-Hosting Qdrant: From Docker Compose to Production

vector-databasesinfrastructure

If you've decided Qdrant is your vector database -- and if you need speed plus self-hosting it probably should be -- here's the practical setup guide I wish existed. Not the "what is a vector database" preamble. Not the managed cloud pitch. Just the steps to go from nothing to a production-ready Qdrant instance.

End-to-end Qdrant deployment: from Docker to production-ready vector search.

Local Development

Everything starts with Docker Compose. One service, two ports, one volume. That's it.

version: "3.8"
 
services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"  # REST API
      - "6334:6334"  # gRPC
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      - QDRANT__SERVICE__GRPC_PORT=6334
 
volumes:
  qdrant_data:

Port 6333 is the REST API -- you'll use this for the dashboard and quick debugging. Port 6334 is gRPC -- use this from your application code because it's significantly faster for batch operations. The volume mount is critical. Without it, you lose all your data every time the container restarts. I've made that mistake exactly once.

Run docker-compose up -d and hit http://localhost:6333/dashboard to verify everything is alive. Qdrant ships with a built-in web UI that lets you browse collections, run test queries, and inspect points. It's surprisingly useful for development.

Creating Collections

A collection in Qdrant is like a table in Postgres -- it's where your vectors live. You define it with a name, vector dimensions, and a distance metric.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
 
client = QdrantClient(host="localhost", port=6333)
 
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=384,           # must match your embedding model's output dimension
        distance=Distance.COSINE,
    ),
)

The size parameter must match the dimensionality of your embedding model. all-MiniLM-L6-v2 outputs 384 dimensions. OpenAI's text-embedding-3-small outputs 1536. Get this wrong and Qdrant will reject your uploads with a cryptic dimension mismatch error.

Distance metric choice matters. Use Cosine for normalized embeddings, which is most sentence-transformer models. Use Dot product when your embeddings are unnormalized and magnitude carries meaning. Euclidean is there if you need it, but I've rarely found a reason to pick it over the other two.

Under the hood, Qdrant builds an HNSW index on your vectors automatically. The defaults are solid for most workloads, but if you're loading millions of vectors, you can tune m (graph connectivity) and ef_construct (index build quality) to trade build time for search accuracy.

Loading Embeddings

Generate your embeddings, pair them with metadata payloads, and batch upsert. Here's the full pattern.

from qdrant_client.models import PointStruct
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("all-MiniLM-L6-v2")
 
documents = [
    {"text": "Qdrant is a vector search engine written in Rust.", "category": "tech", "date": "2026-01-15"},
    {"text": "HNSW provides approximate nearest neighbor search.", "category": "algorithms", "date": "2026-01-20"},
    # ... hundreds more
]
 
# generate embeddings
texts = [doc["text"] for doc in documents]
embeddings = model.encode(texts)
 
# batch upsert
points = [
    PointStruct(
        id=i,
        vector=embedding.tolist(),
        payload=doc,
    )
    for i, (embedding, doc) in enumerate(zip(embeddings, documents))
]
 
# upload in batches of 100-500 for best throughput
BATCH_SIZE = 200
for start in range(0, len(points), BATCH_SIZE):
    batch = points[start : start + BATCH_SIZE]
    client.upsert(collection_name="documents", points=batch)

Batch sizes of 100-500 hit the sweet spot for throughput. Go smaller and you're wasting round trips. Go larger and you risk timeouts or memory pressure on the server. I usually start at 200 and adjust based on payload size.

The payload is any JSON metadata you want to attach to each vector. This is where Qdrant starts to differentiate itself from simpler vector stores -- those payloads become filterable at query time.

This is Qdrant's killer feature. You can combine vector similarity with structured payload filters in a single query. "Find me the 10 most similar documents WHERE category = 'legal' AND date is after 2025-01-01." One API call, sub-millisecond overhead on the filtering.

from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
 
query_embedding = model.encode("How does approximate nearest neighbor work?")
 
results = client.search(
    collection_name="documents",
    query_vector=query_embedding.tolist(),
    query_filter=Filter(
        must=[
            FieldCondition(
                key="category",
                match=MatchValue(value="algorithms"),
            ),
            FieldCondition(
                key="date",
                range=Range(gte="2025-01-01"),
            ),
        ]
    ),
    limit=10,
)
 
for result in results:
    print(f"Score: {result.score:.4f} | {result.payload['text']}")

The filter syntax supports must, should, and must_not clauses -- essentially boolean AND, OR, and NOT. You can nest them, combine range filters with exact matches, and even filter on arrays. It's expressive enough for real applications without being overengineered.

Why this matters for RAG: most retrieval pipelines need more than just "find the 10 nearest vectors." You need to scope results by user, tenant, document type, date range, or permission level. Without payload filtering, you end up doing a broad vector search and then post-filtering in Python, which is both slower and returns fewer relevant results. Qdrant does the filtering at the index level, so you get the right results without the waste. You'll also want to add guardrails on the generation side to ensure the LLM doesn't leak sensitive filtered data back to unauthorized users.

Production Checklist

Going from local development to production? Here's what you need.

Enable authentication. By default, Qdrant accepts unauthenticated requests. In production, set an API key via the QDRANT__SERVICE__API_KEY environment variable. Every request must then include the key in the header. Non-negotiable.

Set up snapshots for backup. Qdrant supports on-demand snapshots of individual collections. Hit the /collections/COLLECTION_NAME/snapshots endpoint to create one, then store it in S3 or wherever your backup pipeline lives. Schedule this. Losing your vector index means re-embedding everything, which is both slow and expensive.

Expose the metrics endpoint. Qdrant serves Prometheus-compatible metrics at /metrics. Scrape this with your existing monitoring stack -- if you're already instrumenting your LLM application with OpenTelemetry, plugging in Qdrant metrics fits naturally into the same pipeline. Watch for search latency, indexing queue depth, and memory usage. If search latency starts creeping up, you're either under-provisioned or your HNSW index needs tuning.

Plan for replication. Single-node Qdrant is fine for moderate workloads, but if uptime matters, run a cluster with replication factor 2 or higher. Qdrant's distributed mode uses Raft consensus and supports automatic failover. The configuration is straightforward -- add QDRANT__CLUSTER__ENABLED=true and point nodes at each other.

Set resource limits. Qdrant is memory-hungry by design -- it memory-maps the HNSW index for speed. Make sure your container or VM has enough RAM for your dataset plus overhead. A rough rule of thumb: plan for about 1 GB of RAM per million 384-dimensional vectors, more for larger dimensions.

Qdrant operational, RAG pipeline connected. That's a production vector search in an afternoon. The Rust-powered engine handles the hard parts -- HNSW indexing, filtered search, snapshotting -- and you get to focus on the retrieval logic that actually matters for your application.