TL;DR — I built a personal RAG (~6,000 documents, ~57K chunks) across 4 versions in 9 days. The final version runs locally on a MacBook Pro M2 Max: warm latency p50 ~0.92s, top-1 score 0.95+ on typical queries, $0/month, hybrid retrieval (dense + sparse + ColBERT) + cross-encoder rerank. This post logs each version, what I measured, what I traded off, and what I learned.
Context
I have many scattered knowledge sources: Confluence (~3.6K pages), Jira (~534 tickets), Slack threads, Krisp/Granola meeting transcripts, Gmail, Claude Code conversation logs (~270 sessions/month), markdown notes in ~/Documents/. The problem: 90% of the knowledge I’d absorbed couldn’t be retrieved when needed. Spotlight only matches exact strings. Per-source MCP is too slow. Each Claude conversation burned 40-80K tokens on grep + Read.
The initial goal was simple: ask “who said what about X?” via Claude → get the top 5 relevant snippets + source links in <3 seconds.
I let that goal drive the 4 pivots below.
V1 — Cloud ADB SIN (Day 1-8)
Stack: GCP VM e2-standard-4 SIN + Oracle Autonomous DB 23ai (vector type built-in, Always Free) + FastMCP Streamable HTTP server + Cloudflare named tunnel mcp.vuihoc.ai + OAuth 2.0 (PKCE + DCR + password) + legacy bearer fallback.
Embedder: initially BGE-small-en-v1.5 384d — by Day 3b I had to pivot to multilingual-e5-small because Vietnamese queries only scored 0.69-0.77. After the swap: 0.85+.
M1 measurements:
| Metric | Target | Achieved |
|---|---|---|
| Hit@5 | ≥60% | ~85% (subjective on 5 real queries) |
| Latency p95 | <5s | 1.16s |
| Sources | 100 | 5,329 / 47K chunks |
| Touchpoint | Telegram | MCP — Claude Desktop + web + iOS |
The biggest pivot in V1: drop the Telegram bot, switch to MCP. Reason: Claude clients (Desktop/web/iOS) call tools via MCP themselves, no need to build a custom bot UI. Same UX, less code, free mobile access.
Strengths: shipped fast (~22h), free tier (ADB 20GB + Cloudflare + GCP credit), public access from any Claude client.
Weaknesses: my Mac is in Vietnam, ADB is in Singapore → 200ms cross-region tax per query. BGE-en is weak by default for Vietnamese. Pure dense retrieval, no rerank — semantic was OK but didn’t catch edge cases.
V2 — pgvector replica (Day 10)
V1’s problem: 200ms cross-region was hard to accept. Instead of tuning the embedder, I attacked latency.
Migrated ADB SIN → self-hosted pgvector on GCP e2-micro US (Always Free) → then replicated to asia-southeast1 (paid via credit). Same tunnel-id, no DNS change.
I also swapped the embedder to Voyage 512d API because at the time I thought “API reduces VM RAM, less model maintenance.”
Latency result: p50 700-850ms → 148-229ms warm (~-72%).
Quality problem: Voyage 512d was weaker than e5-small for mixed VN/EN queries. Top-1 score regressed: ~0.85 (e5) → ~0.67 (Voyage). Several VN queries returned irrelevant results.
V2 lesson: optimizing latency by sacrificing quality is foolish. The user (me) accepts 700ms for a 0.85 score, but not 150ms for a 0.67 score. Speed only matters after the answer is right.
V2 worked for a few days, then I decided to make a bigger pivot: drop the cloud entirely.
V3 — Local Mac M2 Max (Day 11-13)
Premise: A personal RAG doesn’t need the cloud. The Mac is always on, the data is personal, why round-trip the Internet?
Hardware in use: MacBook Pro M2 Max, 12 cores (8P+4E), 64GB RAM. Plenty of headroom for a heavy stack.
New stack:
- Embedder: bge-m3 1024d — special because it produces 3 representations in one forward pass: dense vector, sparse vector (token weights), and a ColBERT-style multi-vector. Top quality on multilingual MIRACL ~76%.
- Reranker: bge-reranker-v2-m3 cross-encoder — rescore the top-N candidates after retrieval.
- Fusion: RRF (Reciprocal Rank Fusion) merges dense + sparse rankings.
- DB: Postgres 16 + pgvector (built from source because the brew formula only ships for v17/v18).
- Server: HTTP daemon on port 8080 instead of stdio MCP. The key reason: 3 clients (Claude Desktop + KB ingest hook + future iOS) share the same daemon → 1× model RAM instead of 3× (2.5GB vs 7.5GB).
- Public access: Cloudflare tunnel reusing the same tunnel-id →
mcp.vuihoc.aipoints to the local Mac. Mobile/web still get in.
Re-ingesting the entire KB took 17h49m → 5,950 sources / 56,474 chunks / 900MB DB.
Memorable V3 bug fixes:
- The
Starlette wrapperswallowed the MCP lifespan → had to usemcp.streamable_http_app()directly. transformers 5.7.0removedXLMRobertaTokenizer.prepare_for_model→ downgrade to 4.57.6.- pgvector HNSW + WHERE filter on a JOIN returned 0 rows — fixed with
SET hnsw.iterative_scan = relaxed_order+ef_search = 200. Needautocommit=Trueto SET at session level. top_k silent cap: RRF_TOP=5 was fixed → top_k=10 still returned 5. Fix:rerank_pool_n = min(25, max(5, top_k * 2))dynamic.
Measured latency (regression of 10 queries, 2026-05-06):
| Metric | Value |
|---|---|
| Cold call | 1.97s |
| Warm avg | 1.06s (10 queries) |
| Warm p50 (test 2-10) | ~0.92s |
| Top-1 score median | 0.96 |
| Pass rate | 10/10 |
The quality jump was visible: the query “Conversation about Hermes self-improve agent” (mixed VN/EN) — V2 Voyage couldn’t return a relevant top hit; V3 bge-m3 + rerank returned top-1 score 0.96.
V4 — Multimodal + structured polish (Day 14-15)
With V3 stable, I started filling coverage gaps:
Phase 2 — Office docs: 4 dispatchers (pypdf, python-docx, pandas for xlsx, python-pptx) in extract_office.py. Several PDFs (UOB GIRO bank specs) ingested → 32 chunks. Top-1 on “UOB Bulk FAST GIRO format” → 0.9958.
Phase 3 — image-only PDF: 2 sample PDFs had empty text (image-only). Pipeline: pdftoppm -r 300 render → macOS Vision OCR (free, on-device) → markdown sidecar → ingest. 1 + 25 chunks added.
Phase 4 — image sidecar 100% coverage: Started at 6,446/6,589 PNGs with sidecars. The remaining 143 PNGs: 41 real PNGs (OCR), 11 misnamed PDFs (render + OCR), 91 misnamed non-image (text/CSV/JSON/zip — fallback handler). VLM cost (Claude Haiku fallback): $0.07 for 37 calls (budget cap $8 — tiny because most files weren’t real images).
Structured retrieval — kb_my_action_items: Regex parser for - [ ] **[CATEGORY]** Name — Task ... 📅 Added: .... Multi-source: Action_Items.md master + Meetings/meetings/*.md. NAME_ALIASES (resolve aliases to the same person). Filters: name, status, since_days, category. Latency <100ms because no embedding/rerank — just regex + filter.
Confluence delta sync: Audit cloud spaces (34) vs locally crawled (33). CQL lastmodified >= last_sync → 83 updated pages. Subtracting self-authored (already ingested via skill) + personal spaces → 55 others-authored pages fetched + ingested through the filesystem-first pipeline. ~30 min end-to-end.
Security — OAuth /login rate limit: Sliding 10-min window, 5 failures/IP → 15-min hard lockout (429 + Retry-After). IP from X-Forwarded-For first hop, fallback request.client.host. In-memory deque + threading lock.
Final state:
| Metric | V1 | V4 (current) |
|---|---|---|
| Sources | 5,329 | 6,014+ |
| Chunks | 47K | 57K+ |
| Coverage | .md only | .md + PDF + Office + 100% PNG (OCR + VLM) + structured action items |
| Embedder | e5-small 384d | bge-m3 1024d hybrid + reranker v2-m3 |
| Latency p50 warm | ~700ms (cross-region) | ~0.92s (local + rerank) |
| Quality | 0.85 | 0.95+ on relevant queries |
| Cost/month | ~$50 GCP credit | $0 |
Lessons learned
1. Cross-region latency is the root of perf issues. V1 → V2 showed: replicate close to the user > tuning the embedder/index. But V2 → V3 showed: local beats cloud for personal data, because you don’t need to share infrastructure.
2. Cloud embedder APIs aren’t a free lunch. Voyage was convenient maintenance-wise but the quality trade-off was heavy for mixed-language. Local bge-m3 (2.5GB RAM) is free and noticeably higher quality.
3. Hybrid retrieval (dense + sparse) + cross-encoder rerank is a step-change. Pure dense retrieval caps around 0.85 score. Hybrid + rerank reaches 0.95+. Cost: an extra ~300-500ms latency, still under 1s warm.
4. Single-daemon design matters when you have multiple clients. If each client loads its own model (Claude Desktop + ingest hook + iOS) → 7.5GB RAM. Shared HTTP daemon → 2.5GB. A Mac M2 Max with 64GB doesn’t sweat, but this is a pattern worth learning for smaller hardware.
5. Filesystem-first canonical = sleep-well rule. The DB is a derived index. Sync 1-way: filesystem → DB. Every time I swapped the embedder/schema (3 times across the 4 versions), I rebuilt the index from filesystem without fearing data loss.
6. Coverage > sophistication in the early phase. V1 indexing only .md solved 80% of queries. Office/OCR/VLM/structured action items came in V4 — after the text baseline was stable. Don’t do phase 4 before phase 1.
7. I’m building this for myself. My “M3 daily-driver criterion” is: use it naturally every day for 2 weeks straight without pivoting the stack. Currently at 1-2 weeks. When I hit that mark, M3 is done.
Stack summary (V4)
MacBook Pro M2 Max (12 cores, 64GB RAM)
├── HTTP daemon :8080 ──┐
│ ├── bge-m3 1024d hybrid (dense + sparse + ColBERT)
│ ├── bge-reranker-v2-m3 cross-encoder
│ ├── RRF fusion + dynamic rerank pool
│ └── Postgres 16 + pgvector (HNSW iterative_scan)
│
├── Cloudflare tunnel ──→ mcp.vuihoc.ai
│ └── (mobile/web Claude clients)
│
└── Filesystem canonical ~/Documents/KB/
├── kb-ingest-file.py hook (auto-tag from folder)
├── extract_office.py (PDF/docx/xlsx/pptx)
├── extract_image_ocr.py (macOS Vision)
└── extract_image_vlm.py (Haiku fallback ~$0.0005/image)
Next
- TOTP 2FA for OAuth (a single password isn’t enough for a serious threat model).
- Reranker eval: bench bge-reranker-v2-m3 vs Cohere rerank-3 on the real KB.
- Foundation for 6 other portfolio projects (recipe-extractor, research-agent, support-bot, meeting-ai, finance-advisor, health-coach) — sharing the same RAG daemon, each project just adds domain tools.
If you’re also building a personal RAG, my strong advice: start local. Cloud is only needed when you share data with someone else.