How I built local-first memory for Claude Code, Cursor, and Codex - 94.5% LoCoMo recall@10, 70ms p50

Wait 5 sec.

TL;DR. PMB (Personal Memory Brain) is an open-source MCP server that gives AI coding agents persistent, long-term memory. Everything runs on your machine - SQLite + LanceDB, zero cloud, zero API keys. On the LoCoMo benchmark it hits 94.5% recall@10 at 70ms p50, matching or beating cloud-based memory services. This post is mostly the techniques that got it there - predicate-aware reranking, multilingual verb expansion, no-LLM atomic fact extraction, RRF query splitting, and a durable async embed queue - with the actual code from the repo. 👉 GitHub: https://github.com/oleksiijko/pmb · pip install pmb-aiThe problemNext morning it knows nothing. The "memory" features in mainstream agents are context-window summarization - fragile, lossy, gone after compaction.The third-party fix is a cloud memory layer: mem0, Letta, Zep. They work, but they require you to mail every personal decision, every chunk of your code, and every "ugh, why did we choose Postgres" question into someone else's data center, behind their API key, on their monthly bill.For private work this is a non-starter. I want my agent to remember my project for months. I don't want my git history training someone else's model.So I built PMB.Numbers firstThree benchmarks. All reproducible from the repo.1. LoCoMo recall@10LoCoMo evaluates long-conversation memory across multi-session dialogues - the canonical benchmark for this kind of system.| System | LoCoMo recall@10 | Local? | API keys? ||----|----|----|----|| PMB (this post) | 94.5% | ✅ | ❌ none || mem0 (published) | ~67-72% | ❌ cloud | ✅ required || Zep (published) | ~85-90% | ❌ cloud | ✅ required || Naive vector search | ~60% | ✅ | ❌ |Reproduce: python scripts/benchmarks/benchmark_locomo.py --n-conversations 102. Latency| Percentile | Cold (first call) | Warm ||----|----|----|| p50 | 1500ms | 70ms || p95 | 2200ms | 240ms || p99 | 3100ms | 470ms |Cold cost is one-time embedding-model load. Everything after is hot.3. Mega stress (900 multilingual queries)LoCoMo wasn't enough. I wrote a 30-base × 30-paraphrase generator covering English, Spanish, German, Russian, Ukrainian, including cross-lingual queries (English question hits Russian-stored fact), compound questions, name disambiguation, self-reference rescue.| Metric | Score ||----|----|| top-1 accuracy | 73.3% || top-3 accuracy | 87.3% || top-10 accuracy | 99.2% || p50 latency | 70ms |99.2% top-10 is what matters in practice: the answer is in the agent's candidate set on essentially every query.How the recall pipeline worksMost "AI memory" systems are a thin wrapper over cosinesimilarity(queryembedding, all_chunks). That ceiling-out at ~60% recall is where they all hit.PMB stacks five layers. Each one buys 3-8 points of recall on adversarial cases.How the recall pipeline worksMost "AI memory" systems are a thin wrapper over cosinesimilarity(queryembedding, all_chunks). That ceiling-out at ~60% recall is where they all hit.PMB stacks five layers. Each one buys 3-8 points of recall on adversarial cases.\Technique 1 - PAMVR: predicate-aware reranking (+5pp recall)The killer insight: embedding similarity doesn't understand verbs, tense, or negation. A query "where do I live now" returns equally high cosine similarity for "I lived in Berlin (2019)" and "I live in Lisbon (currently)" - but only one is right.So PAMVR runs a final pass over the top ~50 candidates and applies 14 hand-tuned rules. Each rule looks at query features (intent, verbs, named entities, time markers) and candidate features (predicates, recency, entity overlap) and multiplies the score up or down.The rule list (from `src/pmb/reasoning/pamvr.py`):\1. Entity strict        - if query names X, content must mention X2. Verb match           - query main verb must appear (or via synonym)3. Verb+topic combo     - both signals agree -> big boost4. Keyword AND          - high token overlap = direct match5. Vocab bridge         - domain synonyms (typing↔mypy, database↔Postgres)6. Prefix kind          - "what's the fix" + content starts with "Fix:"7. Policy intent        - "what's the X policy" + decision-shaped fact8. Topic constraint     - X-policy requires X token in content9. Time duration        - "lifetime/duration" + content has digits+unit10. Now/current         - query "now" + content has temporal qualifier11. Quantitative        - "how many/long" + content has digits12. Entity count        - "who is on the team" + content has N persons13. Use-verb expansion  - "did we use" matches "deploy/host/run"14. Topic intersection - penalty when zero shared tokensThe single most useful structure: vocabulary bridges. These map query terms to content synonyms a vector model would otherwise miss.VOCAB_BRIDGES: dict[str, list[str]] = {    "typing":     ["mypy", "type hints", "types", "static type"],    "database":   ["postgres", "mysql", "mongodb", "cloud sql", "rdbms"],    "policy":     ["enforce", "must have", "going forward", "ratified", "rule", "convention", "guideline"],    "lifetime":   ["valid", "minutes", "hours", "days", "ttl"],    "deploy":     ["host", "hosted", "running", "production", "fargate", "ecs", "cloud run"],    "plan":       ["roadmap", "okr", "will", "going to", "scheduled"],}And the contract: PAMVR is a pure (query, event, score) → float function. Compose it anywhere in the scoring pipeline.from pmb.reasoning.pamvr import apply_pamvrnew_score = apply_pamvr(query, event, current_score)I bench-evaluated PAMVR by ablating rules one at a time on a 30-query qualitative benchmark. Top-1 accuracy without PAMVR: 60%. With all 14 rules: 93.3%. No LoCoMo regression. No LLM. Total cost: ~0.5ms per candidate.Technique 2 - Verb synonym expansion (+8pp recall on intent-mismatched queries)A specific failure mode that kept showing up: query "where do I live" returned zero hits against a stored fact "I'm based in Lisbon" or "I moved to Lisbon last year." The embedder understood the meaning fine. The reranker's verb-match rule was killing the candidate score because the surface verbs didn't match.The fix is one dictionary. Map each canonical verb to its near-synonyms — including paraphrases, past forms, and common substitutes:VERB_SYNS: dict[str, set[str]] = { "live": {"live", "lives", "lived", "reside", "based", "moved", "relocated", "settled"}, "work": {"work", "works", "worked", "working", "job", "employed", "employed at", "role at"}, "own": {"own", "owns", "owned", "have", "control", "manage"}, "lead": {"lead", "leading", "leads", "led", "head", "heads", "manage", "managing"}, "decide": {"decide", "decided", "agreed", "accepted", "concluded", "ratified", "chose", "picked"}, "deploy": {"deploy", "deployed", "host", "hosted", "running", "production"}, "fix": {"fix", "fixed", "patch", "patched", "hotfix", "resolved", "solved"}, "name": {"name", "called", "known as"},}After this single dictionary update, top-1 recall on intent-mismatched English queries jumped from 64% to 72%.The proper-noun extractor handles arbitrary names that aren't in any dictionary — Stripe, Lisbon, Alice, Postgres. No NER library, no model:_PROPER_NOUN_RE = re.compile(r"\b(?P[A-Z][a-z']{2,})\b")# Capitalised words that AREN'T proper nouns — sentence-initial# function words and common modal/wh-words._NOT_PROPER = { "when", "where", "what", "who", "how", "why", "which", "today", "yesterday", "tomorrow", "now", "this", "that", "the", "and", "but", "for", "with", "from", "into",}def _extract_proper_nouns(query: str) -> set[str]: """Pull capitalised, 3+ char tokens that look like proper nouns.""" out = set() for m in _PROPER_NOUN_RE.finditer(query): tok = m.group("n").lower() if tok in _NOT_PROPER or tok in _STOP: continue out.add(tok) return outA query like "what does Alice think about Postgres" extracts {alice, postgres} and uses them as required entities in the reranker. Anything not mentioning both gets a heavy penalty. Cost: ~0.05ms. Bonus - multilingual. The same VERB_SYNS mapping has Russian / Ukrainian / Spanish / German stems in the actual repo, and the regex extends to Cyrillic and Greek characters. Cross-lingual recall (English question → Russian-stored fact) went from 0% to 67% with no per-language code paths in the hot loop. Source: `src/pmb/reasoning/pamvr.py`.Lesson: when embeddings underperform on paraphrased queries, don't blame the embedder. Check whether your downstream reranker is aware of verb paraphrases. Usually it isn't.Technique 3 - Atomic fact extraction without LLM (+15pp recall on multi-fact messages)When a user writes: "Today I met Alice at the coffee shop. She's the new tech lead at Stripe and lives in Berlin. We discussed onboarding for the Q3 hire."…and stores it as one chunk, asking "where does Alice live?" gets a mediocre score. The chunk's vector is the average of all three facts. The embedding for "Alice lives in Berlin" is diluted.mem0's original paper solves this by sending each message to an LLM that returns atomic facts. That's a great idea, but it requires an API key and adds 200-500ms per write.PMB's version is regex-driven. From `src/pmb/reasoning/fact_extract.py`:@dataclass(frozen=True)class AtomicFact: content: str # the atomic fact text kind: str # which pattern fired ('location', 'role', ...) confidence: float # 0.0-1.0# Sentence boundary: split on . ! ? followed by a capital letter._SENT_SPLIT = re.compile(r"(?