
Scaling Vector Databases Without Burning Cash (and Your Weekend)
Your marketing squad just added five new languages and doubled page‑level embeddings from 10 million to 50 million- all before your first coffee. Latency SLOs? Unchanged. Budget? Of course not. If you size the cluster wrong today, you will be explaining the overage on every ops call for the next 12 months.
“Every vector you store is a tiny monthly subscription you’ve sold back to your cloud provider.”
This long‑form guide distills six months of Slack wars, real invoices, and post‑mortem tears into a practical playbook. We cover:
By the end, you’ll know exactly why your data science team’s “just 384‑dim embeddings” translates to either three CCX53 nodes or two r7g.8xlarge—and how to debate that in the next exec meeting.
Vector database scaling is the art of expanding storage, RAM, and compute so approximate‑nearest‑neighbour (ANN) queries stay within p95 latency targets as vector count, dimension, or QPS climb. It usually means sharding indices, tiering storage (RAM→NVMe→S3), and balancing replicas for HA.
Cloud Modernisation, Not Moon‑Shots
Investors love AI‑driven UX, but hate infra burn.
RAG, personalised search, and semantic analytics balloon vector count faster than user count.
Early cost slipups are now venture‑capital‑visible; billing lines expose overspend down to the hour.
“Vector search is the first infra line‑item the Board reads after ‘GPU spend’.”
Pain | Symptom | Hidden Cost |
|---|---|---|
Feature team adds language or modality | Embedding volume × 3 | 1‑click deploy triggers €4 k/mo RAM growth |
Multi‑tenant SaaS, one marquee customer doubles traffic | Spike in cache misses | Other tenants suffer recall drop ➜ churn risk |
Pivot to serverless demo | Pay‑per‑request RU/WU spikes | CFO asks why POC costs more than prod |
Make no mistake: Vector DBs are moving out of “nice‑to‑have” into the critical path of real‑time UX. You’ll scale them whether prepared or not.
We extend the short formula you saw earlier into an end‑to‑end workflow with code snippets and sanity checkpoints.
Step 1 – Estimate Raw Footprint
Result: 50 M × 768‑d float32 ≈ 143 GiB.
Step 2 – Apply Engine Factor (F)
Each engine keeps ancillary graphs, caches, and metadata. We measured memory at 50 M vectors for default HNSW configs.
Engine | Overhead F | Memory @50 M | Why |
|---|---|---|---|
Milvus (HNSW) | × 7‑8 | 1.0‑1.1 TB | Graph & neighbour lists in RAM |
Weaviate | × 2 | 286 GB | Vector‑cache + inverted index |
Qdrant | × 1.5 | 215 GB | Payload encoded leanly |
Vespa | × 1.2‑3 | 170‑430 GB | Compression selectable (bf16, int8, PQ) |
Rule‑of‑thumb: Don’t trust vendor docs—dump and load 1 million vectors first, then multiply.
Step 3 – Map to Node Shapes
Using Weaviate’s ×2 factor: 286 GB ÷ 128 GB ≈ 2.2 → 3× CCX53 on Hetzner gives 384 GB cluster with 30 % head‑room. Want AWS? Two r7g.8xlarge offer 512 GB, but at 3‑5× cost.
Step 4 – Add Replicas & Tiering
Reads dominate at scale → 2× replicas give HA and double QPS.
Hot shards stay in NVMe, cold shards flush to S3.
Topology Declaration & Queue Types
Choose classic mirrors or x-queue-type: quorum. Quorum queues use Raft, drop priorities, and behave differently with TTL.Reliable Publishing
Enable publisher confirms (channel.confirmSelect()), else a broker fail-over can eat in-flight messages.Connection Resilience
Use clients with automatic connection & channel recovery, then re-declare exchanges/queues after reconnect. Expect at least one reconnect per monthly AWS patch window.Idempotent Consumers
Fail-over may redeliver. Make handlers safe for duplicates.Prefetch & Back-Pressure Tuning
Large backlogs replicate across three AZs, killing latency. Keep queues short, prefetch modest (20-50), and monitor QueueDequeue CloudWatch metric.Sizing & Sharding
Heavy streams? Split by key into multiple queues or brokers. Cluster won’t lift the single-queue ceiling.Alert Hygiene
Three times the nodes means three times the metrics. De-noise your dashboards (e.g., ignore benign raft elections).
Range | Milvus | Weaviate | Qdrant | Vespa |
|---|---|---|---|---|
Dev <5 M | 1 × 8 vCPU / 32 GB | 1 × 8 vCPU / 32 GB | 1 × 4 vCPU / 16 GB | 1 content + 2 API (≈64 GB) |
Small ≈50 M | 3 × 32 vCPU / 128 GB | 3 × 128 GB | 3 × 64 GB | 6 × 64 GB |
Mid ≈0.5 B | 25 × 64 vCPU / 256 GB | 12 × 256 GB | 12 × 128 GB | 24 × 72 vCPU |
Hetzner vs AWS Cost Table
Stay inside those lines and RabbitMQ is cost-effective and developer-friendly.
Provider | Nodes | €/month | €/M vector | Notes |
|---|---|---|---|---|
Hetzner CCX53 | 3 × 128 GB | €675 | €13.5 | Flat‑rate, EU DC |
AWS r7g.8xlarge | 2 × 256 GB | €2 275 | €45 | Spot saves 70 % but risk |
AWS r7a.8xlarge | 2 × 256 GB | €3 900 | €78 | EU Central1 on‑demand |
Compression & Quantisation
Switch to HNSW + PQ:
Memory shrink: 24× (float32→int8 sub‑vectors).
Recall impact: ≤1 % on MS MARCO 50 M.
Cost drop: Weaviate €/M vector ≈ €0.6.
“Quantise cold shards—turn an r7a budget into a t4g bill.”
Engine class | Writes (ingest) | Reads (query) | R:W split |
|---|---|---|---|
Vector DB (ANN) | CPU‑heavy index build, 1‑2× RAM | Graph walks, RAM‑lat, GPU optional | 1 : 3–5 |
SQL / NoSQL | Small random I/O | Short key lookups, cache | 1 : 1 |
Loki / TSDB | Append to object store | Massive decompression | 1 : 8 |
ClickHouse | Chunk merge CPU | Vectorised scans | 1 : 4 |
“We budget Loki for traffic spikes; we budget vector search for new features—very different fiscal rhythms.”
Engine | Isolation Primitive | Strength | Caveat |
|---|---|---|---|
Milvus | Database→Collection | Strong RBAC | 64 DB cap |
Qdrant | is_tenant payload | Lightest RAM | Cluster global limits |
Weaviate | Tenant shards | Data invisible cross‑tenant | Off by default |
Vespa | Tenant→App→Instance | Billing & quota | Pin zones for hard isolation |
Over‑sharding = lost recall. Keep ≤64 shards/query or adopt routing‑aware hashing.
Implicit index rebuilds. Distance metric switch doubles memory until swap completes.
Serverless shock. Pinecone RU/WU great for POCs; sustained 100 QPS can out‑price self‑host in weeks.
Ignoring write spikes. Online fine‑tuning models can add 30 % throughput at night—plan ingest.
Think tiers, not instances. RAM for hot shards, NVMe for warm, S3 for cold.
Quantise early. PQ accuracy loss is negligible at billions scale.
Treat embeddings like logs. Retention policy + auto‑archive to cheap storage.
Automate with IaC. Use Terraform modules for shard counts so data scientists can request capacity without kubectl.
Observe recall, not just latency. Dropping from 99 → 94 % recall can slip past alerts yet ruin conversions.
GPU nodes shine above 50 k QPS/shard; below that, AVX‑512 CPUs cheaper.
SIMD index builds in FAISS 1.8 cut ingest time by 40 %.
Serverless warm pools: Keep 10 % vectors in Pinecone for demos, bulk in Qdrant BYOC.
Regulatory headwinds: EU AI Act will require audit trails → pick engines with WAL + S3 snapshots.
Vector search is moving from prototype to production faster than most infra. Armed with the 4‑step framework and real €/M vector costs, you can defend budgets, architect smart tiers, and sleep the night before launch day.