← All posts
📅

Personal RAG KB: A 4-Version Journey from Cloud ADB to Local Mac M2 Max

Why I moved from Oracle ADB cloud → pgvector replica → local Mac — and what I learned measuring latency, quality, and cost at each version.

TL;DR — I built a personal RAG (~6,000 documents, ~57K chunks) across 4 versions in 9 days. The final version runs locally on a MacBook Pro M2 Max: warm latency p50 ~0.92s, top-1 score 0.95+ on typical queries, $0/month, hybrid retrieval (dense + sparse + ColBERT) + cross-encoder rerank. This post logs each version, what I measured, what I traded off, and what I learned.

Context

I have many scattered knowledge sources: Confluence (~3.6K pages), Jira (~534 tickets), Slack threads, Krisp/Granola meeting transcripts, Gmail, Claude Code conversation logs (~270 sessions/month), markdown notes in ~/Documents/. The problem: 90% of the knowledge I’d absorbed couldn’t be retrieved when needed. Spotlight only matches exact strings. Per-source MCP is too slow. Each Claude conversation burned 40-80K tokens on grep + Read.

The initial goal was simple: ask “who said what about X?” via Claude → get the top 5 relevant snippets + source links in <3 seconds.

I let that goal drive the 4 pivots below.

V1 — Cloud ADB SIN (Day 1-8)

Stack: GCP VM e2-standard-4 SIN + Oracle Autonomous DB 23ai (vector type built-in, Always Free) + FastMCP Streamable HTTP server + Cloudflare named tunnel mcp.vuihoc.ai + OAuth 2.0 (PKCE + DCR + password) + legacy bearer fallback.

Embedder: initially BGE-small-en-v1.5 384d — by Day 3b I had to pivot to multilingual-e5-small because Vietnamese queries only scored 0.69-0.77. After the swap: 0.85+.

M1 measurements:

MetricTargetAchieved
Hit@5≥60%~85% (subjective on 5 real queries)
Latency p95<5s1.16s
Sources1005,329 / 47K chunks
TouchpointTelegramMCP — Claude Desktop + web + iOS

The biggest pivot in V1: drop the Telegram bot, switch to MCP. Reason: Claude clients (Desktop/web/iOS) call tools via MCP themselves, no need to build a custom bot UI. Same UX, less code, free mobile access.

Strengths: shipped fast (~22h), free tier (ADB 20GB + Cloudflare + GCP credit), public access from any Claude client.

Weaknesses: my Mac is in Vietnam, ADB is in Singapore → 200ms cross-region tax per query. BGE-en is weak by default for Vietnamese. Pure dense retrieval, no rerank — semantic was OK but didn’t catch edge cases.

V2 — pgvector replica (Day 10)

V1’s problem: 200ms cross-region was hard to accept. Instead of tuning the embedder, I attacked latency.

Migrated ADB SIN → self-hosted pgvector on GCP e2-micro US (Always Free) → then replicated to asia-southeast1 (paid via credit). Same tunnel-id, no DNS change.

I also swapped the embedder to Voyage 512d API because at the time I thought “API reduces VM RAM, less model maintenance.”

Latency result: p50 700-850ms → 148-229ms warm (~-72%).

Quality problem: Voyage 512d was weaker than e5-small for mixed VN/EN queries. Top-1 score regressed: ~0.85 (e5) → ~0.67 (Voyage). Several VN queries returned irrelevant results.

V2 lesson: optimizing latency by sacrificing quality is foolish. The user (me) accepts 700ms for a 0.85 score, but not 150ms for a 0.67 score. Speed only matters after the answer is right.

V2 worked for a few days, then I decided to make a bigger pivot: drop the cloud entirely.

V3 — Local Mac M2 Max (Day 11-13)

Premise: A personal RAG doesn’t need the cloud. The Mac is always on, the data is personal, why round-trip the Internet?

Hardware in use: MacBook Pro M2 Max, 12 cores (8P+4E), 64GB RAM. Plenty of headroom for a heavy stack.

New stack:

  • Embedder: bge-m3 1024d — special because it produces 3 representations in one forward pass: dense vector, sparse vector (token weights), and a ColBERT-style multi-vector. Top quality on multilingual MIRACL ~76%.
  • Reranker: bge-reranker-v2-m3 cross-encoder — rescore the top-N candidates after retrieval.
  • Fusion: RRF (Reciprocal Rank Fusion) merges dense + sparse rankings.
  • DB: Postgres 16 + pgvector (built from source because the brew formula only ships for v17/v18).
  • Server: HTTP daemon on port 8080 instead of stdio MCP. The key reason: 3 clients (Claude Desktop + KB ingest hook + future iOS) share the same daemon → 1× model RAM instead of 3× (2.5GB vs 7.5GB).
  • Public access: Cloudflare tunnel reusing the same tunnel-id → mcp.vuihoc.ai points to the local Mac. Mobile/web still get in.

Re-ingesting the entire KB took 17h49m → 5,950 sources / 56,474 chunks / 900MB DB.

Memorable V3 bug fixes:

  1. The Starlette wrapper swallowed the MCP lifespan → had to use mcp.streamable_http_app() directly.
  2. transformers 5.7.0 removed XLMRobertaTokenizer.prepare_for_model → downgrade to 4.57.6.
  3. pgvector HNSW + WHERE filter on a JOIN returned 0 rows — fixed with SET hnsw.iterative_scan = relaxed_order + ef_search = 200. Need autocommit=True to SET at session level.
  4. top_k silent cap: RRF_TOP=5 was fixed → top_k=10 still returned 5. Fix: rerank_pool_n = min(25, max(5, top_k * 2)) dynamic.

Measured latency (regression of 10 queries, 2026-05-06):

MetricValue
Cold call1.97s
Warm avg1.06s (10 queries)
Warm p50 (test 2-10)~0.92s
Top-1 score median0.96
Pass rate10/10

The quality jump was visible: the query “Conversation about Hermes self-improve agent” (mixed VN/EN) — V2 Voyage couldn’t return a relevant top hit; V3 bge-m3 + rerank returned top-1 score 0.96.

V4 — Multimodal + structured polish (Day 14-15)

With V3 stable, I started filling coverage gaps:

Phase 2 — Office docs: 4 dispatchers (pypdf, python-docx, pandas for xlsx, python-pptx) in extract_office.py. Several PDFs (UOB GIRO bank specs) ingested → 32 chunks. Top-1 on “UOB Bulk FAST GIRO format” → 0.9958.

Phase 3 — image-only PDF: 2 sample PDFs had empty text (image-only). Pipeline: pdftoppm -r 300 render → macOS Vision OCR (free, on-device) → markdown sidecar → ingest. 1 + 25 chunks added.

Phase 4 — image sidecar 100% coverage: Started at 6,446/6,589 PNGs with sidecars. The remaining 143 PNGs: 41 real PNGs (OCR), 11 misnamed PDFs (render + OCR), 91 misnamed non-image (text/CSV/JSON/zip — fallback handler). VLM cost (Claude Haiku fallback): $0.07 for 37 calls (budget cap $8 — tiny because most files weren’t real images).

Structured retrieval — kb_my_action_items: Regex parser for - [ ] **[CATEGORY]** Name — Task ... 📅 Added: .... Multi-source: Action_Items.md master + Meetings/meetings/*.md. NAME_ALIASES (resolve aliases to the same person). Filters: name, status, since_days, category. Latency <100ms because no embedding/rerank — just regex + filter.

Confluence delta sync: Audit cloud spaces (34) vs locally crawled (33). CQL lastmodified >= last_sync → 83 updated pages. Subtracting self-authored (already ingested via skill) + personal spaces → 55 others-authored pages fetched + ingested through the filesystem-first pipeline. ~30 min end-to-end.

Security — OAuth /login rate limit: Sliding 10-min window, 5 failures/IP → 15-min hard lockout (429 + Retry-After). IP from X-Forwarded-For first hop, fallback request.client.host. In-memory deque + threading lock.

Final state:

MetricV1V4 (current)
Sources5,3296,014+
Chunks47K57K+
Coverage.md only.md + PDF + Office + 100% PNG (OCR + VLM) + structured action items
Embeddere5-small 384dbge-m3 1024d hybrid + reranker v2-m3
Latency p50 warm~700ms (cross-region)~0.92s (local + rerank)
Quality0.850.95+ on relevant queries
Cost/month~$50 GCP credit$0

Lessons learned

1. Cross-region latency is the root of perf issues. V1 → V2 showed: replicate close to the user > tuning the embedder/index. But V2 → V3 showed: local beats cloud for personal data, because you don’t need to share infrastructure.

2. Cloud embedder APIs aren’t a free lunch. Voyage was convenient maintenance-wise but the quality trade-off was heavy for mixed-language. Local bge-m3 (2.5GB RAM) is free and noticeably higher quality.

3. Hybrid retrieval (dense + sparse) + cross-encoder rerank is a step-change. Pure dense retrieval caps around 0.85 score. Hybrid + rerank reaches 0.95+. Cost: an extra ~300-500ms latency, still under 1s warm.

4. Single-daemon design matters when you have multiple clients. If each client loads its own model (Claude Desktop + ingest hook + iOS) → 7.5GB RAM. Shared HTTP daemon → 2.5GB. A Mac M2 Max with 64GB doesn’t sweat, but this is a pattern worth learning for smaller hardware.

5. Filesystem-first canonical = sleep-well rule. The DB is a derived index. Sync 1-way: filesystem → DB. Every time I swapped the embedder/schema (3 times across the 4 versions), I rebuilt the index from filesystem without fearing data loss.

6. Coverage > sophistication in the early phase. V1 indexing only .md solved 80% of queries. Office/OCR/VLM/structured action items came in V4 — after the text baseline was stable. Don’t do phase 4 before phase 1.

7. I’m building this for myself. My “M3 daily-driver criterion” is: use it naturally every day for 2 weeks straight without pivoting the stack. Currently at 1-2 weeks. When I hit that mark, M3 is done.

Stack summary (V4)

MacBook Pro M2 Max (12 cores, 64GB RAM)
├── HTTP daemon :8080  ──┐
│   ├── bge-m3 1024d hybrid (dense + sparse + ColBERT)
│   ├── bge-reranker-v2-m3 cross-encoder
│   ├── RRF fusion + dynamic rerank pool
│   └── Postgres 16 + pgvector (HNSW iterative_scan)

├── Cloudflare tunnel  ──→  mcp.vuihoc.ai
│   └── (mobile/web Claude clients)

└── Filesystem canonical  ~/Documents/KB/
    ├── kb-ingest-file.py hook (auto-tag from folder)
    ├── extract_office.py (PDF/docx/xlsx/pptx)
    ├── extract_image_ocr.py (macOS Vision)
    └── extract_image_vlm.py (Haiku fallback ~$0.0005/image)

Next

  • TOTP 2FA for OAuth (a single password isn’t enough for a serious threat model).
  • Reranker eval: bench bge-reranker-v2-m3 vs Cohere rerank-3 on the real KB.
  • Foundation for 6 other portfolio projects (recipe-extractor, research-agent, support-bot, meeting-ai, finance-advisor, health-coach) — sharing the same RAG daemon, each project just adds domain tools.

If you’re also building a personal RAG, my strong advice: start local. Cloud is only needed when you share data with someone else.