LongMemEval Benchmark

92.6% on 500 questions.
Competitive with frontier-LLM systems — using only local models.

Graphonomous v0.3.3 evaluated against LongMemEval (ICLR 2025) — the standard benchmark for long-term conversational memory. 500 questions across 940 sessions, 10,866 turns. No cloud LLM calls: a 500M-param local embedding model + cross-encoder reranker + κ-topology retrieval. We publish the full results, gaps, and roadmap.

v0.3.3 — Oracle Split, 500 Questions

Evaluated 2026-04-05. Embedder: nomic-embed-text-v2-moe (768D, 500M params). Retrieval-only evaluation with QA proxy scoring (session hit + keyword recall + session recall + evidence recall + NDCG).

92.6%
QA Proxy
Overall accuracy
98.1%
Session Hit Rate
Found answer session
1.5s
Mean Latency
Per query
96.7%
Abstention
Learned threshold

Where Graphonomous stands

LongMemEval has become the de facto benchmark for agent memory. Here's how published systems compare. Note that systems use different LLM backends and evaluation methodologies—direct comparison requires care.

agentmemory (Opus 4.6)
96.2%
OMEGA (GPT-4.1)
95.4%
Mastra OM (GPT-5-mini)
94.9%
Hindsight v0.4.19
94.6%
Graphonomous v0.3.3 (local 500M)
92.6%
Emergence AI (RAG)
86.0%
Supermemory (Gemini-3)
85.2%
Mastra OM (GPT-4o)
84.2%
Zep / Graphiti
71.2%
Letta / MemGPT
65.0%
GPT-4 128K (full ctx)
63.5%
Methodology note: Graphonomous is a retrieval-only system — it does not generate answers. Results use a self-contained QA proxy score (session hit + keyword recall + session recall + evidence recall + NDCG). Most competitors use GPT-4o or GPT-5 as a judge on generated answers. Top systems also use frontier LLMs (GPT-5-mini, Gemini-3-Pro) for both memory operations and answer generation, while Graphonomous runs entirely on local models (nomic embeddings + ms-marco cross-encoder). An LLM judge evaluation is planned for a future release.

Five abilities, five different stories

LongMemEval tests five core long-term memory abilities. Graphonomous now delivers strong performance across all five, with knowledge updates, multi-session reasoning, and abstention all above 95%. Temporal reasoning is the remaining gap and the current optimization focus.

Ability Questions QA Proxy Session Hit Session Recall Status
Knowledge Update 72 97.8% 100.0% 98.6% Strong
Abstention 30 96.7% 86.7% 70.3% Strong
Information Extraction 150 95.6% 98.7% 98.7% Strong
Multi-Session Reasoning 121 89.7% 100.0% 95.3% Strong
Temporal Reasoning 127 87.8% 94.5% 89.0% Gap

Seven question types, granular view

Question Type Count QA Proxy Session Hit
Single-Session Assistant 56 99.5% 100.0%
Knowledge Update 78 97.9% 98.7%
Single-Session User 70 97.5% 97.1%
Multi-Session 133 90.6% 100.0%
Temporal Reasoning 133 87.6% 94.0%
Single-Session Preference 30 84.8% 93.3%
Temporal reasoning is now the weakest category at 87.6%. Relative-date parsing (“two weeks ago”, “last Saturday”) landed in v0.3.3 and closed 1.7pp of the gap, but queries asking about session chronology (first, last, before/after) still depend on session_rank heuristics that miss some fine-grained ordering. Preferences jumped from 67.7% → 84.8% across v0.3.2 and v0.3.3 via adaptive candidate expansion; the remaining gap is vocabulary mismatch on narrative answers (e.g., "recommend a show" → session about "stand-up comedy, John Mulaney, Netflix").

Abstention solved: 70.0% → 96.7%

30 questions in LongMemEval are "false premise" — they ask about information that was never mentioned. A correct system should recognize it doesn't know and abstain. In v0.3.2 we missed 9 of 30. v0.3.3 misses just one — via learned ANN score statistics rather than hand-tuned heuristics.

Fix

Learned ANN-statistics threshold

We replaced the gap < 0.05 OR mean < 0.25 heuristic with an abstention signal derived from pre-rerank ANN score distributions: the top-k cosine similarity mean, standard deviation, and decay rate. Sessions with flat, low ANN distributions abstain before the cross-encoder gets a chance to inflate their scores.

Result

29 of 30 correct

v0.3.3 abstains correctly on 29 of 30 false-premise questions — closing most of the gap to agentmemory's 30/30. The remaining miss is a close call where a near-synonym session scored highly on both ANN and cross-encoder.

Why It Worked

Catch noise before reranking

Cross-encoder reranking is excellent at boosting real matches — but it also boosts irrelevant matches when no real match exists. By checking the raw ANN distribution before rerank, we decide "this query has no real hit" from the shape of the retrieval response itself.

Remaining Work

Negative-evidence modeling

The learned threshold is still distribution-based, not semantic. True negative-evidence modeling ("the graph contains no node about X") would require a small classifier trained on topical coverage and could close the last 3.3pp.

What the numbers reveal

Remaining Gap

Temporal: 87.8% (127 questions)

The v0.3.3 relative-date parser (“two weeks ago”, “last Saturday”) anchored to the question date lifted temporal from 85.1% → 87.8%, but this is still the weakest category. Mean first-correct-rank is bimodal: median 1, but missed questions sit deep in the ranking. Needs richer date normalization (month names, explicit ranges) and temporal intent detection (“first”, “latest”, “before X”).

Remaining Gap

Multi-Session: 89.7% (63 questions)

Session-aggregate ranking (v0.3.3) boosted this category, but questions that require stitching 3+ sessions still trail single-session categories. A cross-session expansion pass triggered when the top-k spans multiple sessions is the next lever.

Measurement

No LLM judge (measurement gap)

QA proxy score (session hit + keyword recall + session recall + evidence recall + NDCG) approximates but doesn't equal real QA accuracy. Competitors use GPT-4o judge on generated answers. True accuracy likely falls within a few points of this proxy.

Strength

Abstention: 96.7% (30 questions)

v0.3.3 replaced the heuristic gap check with a learned ANN-statistics threshold that inspects pre-rerank distribution shape. 29/30 false-premise questions now correctly abstain — up from 21/30 in v0.3.2.

Strength

Preferences: 84.8% (30 questions)

Up from 67.7% via fact-augmented indexing and session-aggregate boost. Preference answers no longer depend on keyword overlap with the question.

Strength

Knowledge Update: 97.1%

Belief revision and superseded edges work. When facts change across sessions, Graphonomous correctly retrieves the latest version. 100% session hit rate on update questions.

Strength

Latency: 1.5s mean

From 20s+ down to 1.5s mean across 500 queries. Cross-encoder reranking dominates at ~1.3s; all other stages under 150ms. Competitive with systems that call frontier LLM APIs per query.

Stage-level timing breakdown

Every query passes through 15 pipeline stages. Cross-encoder reranking dominates at 83% of total time. All topology and graph operations complete in under 50ms.

Stage Mean p50 p95 Max
Cross-Encoder Rerank 1,317ms 1,307ms 1,409ms 2,242ms
ANN Retrieve 147ms 145ms 182ms 332ms
Edge Impact 42ms 41ms 61ms 93ms
Topology 12ms 12ms 23ms 34ms
Expand Neighbors 4ms 4ms 7ms 10ms
Chain Retrieval 3ms 0ms 20ms 27ms
BM25 Await 2ms 0ms 8ms 44ms
Diversify + Sort 2ms 1ms 2ms 14ms

Path to 95%

Five phases targeting the gaps above. Estimated impact based on published ablation studies and SOTA system architectures.

Phase 1 — Measurement

Add LLM judge + answer generation to benchmark. Current QA proxy is a useful directional metric but not comparable to competitor evaluations. Feed (question, retrieved_context) to Claude, judge answer quality 0/0.5/1. Establish honest baseline.

Impact: measurement accuracy, no direct score change

Phase 2 — Abstention + Preferences (shipped v0.3.3)

Replaced heuristic abstention with a learned ANN-statistics threshold. Added fact-augmented key expansion for preference indexing ("prefers Thai food" stored as structured facts). Session-aggregate boost rewards sessions with multiple strong hits.

Delivered: abstention 70.0% → 96.7%, preferences 67.7% → 84.8%

Phase 3 — Temporal Intelligence (partial, +1–3pp remaining)

Dual timestamps (documentDate vs eventDate) and the relative-date parser ("two weeks ago", "last Saturday") landed in v0.3.3 and lifted temporal 85.1% → 87.8%. Still pending: explicit date ranges, month-name parsing, and ordinal intent ("first", "latest", "before X").

Delivered: temporal 85.1% → 87.8%; target 92%+

Phase 4 — Multi-Session Enhancement (partial, +2pp remaining)

Session-aggregate ranking (v0.3.3) lifted multi-session to 89.7%. Next up: cross-session expansion when top-k spans multiple sessions, and chain-of-retrieval rounds for 3+ session stitching.

Delivered: multi-session 87% → 89.7%; target 92%+

Phase 5 — Query Intelligence (+2–3pp)

Query expansion and reformulation (standard RAG technique, rephrasing queries 3 ways). Adaptive retrieval depth based on query complexity. These are the techniques that push from good to SOTA.

Expected: overall 92.6% → 95%+

How the benchmark was run

Embedder

nomic-embed-text-v2-moe (768D) via ONNX. Dense-forward export running all experts with router-weighted blending. Task-prefixed queries ("search_query:") and documents ("search_document:").

Retrieval Pipeline

Hybrid ANN (HNSW) + BM25 with reciprocal rank fusion. Cross-encoder reranking (ms-marco-MiniLM-L-6-v2). Graph neighborhood expansion (1–2 hops). Adaptive limits scaling with sqrt(node_count).

Graph Scale

940 sessions ingested (10,866 turns). ~12K knowledge nodes, ~208K edges. Session-level + turn-level + entity linking edges. Full topology analysis per query.

Evaluation

Self-contained QA proxy: 35% session hit + 25% keyword recall + 20% session recall + 10% evidence recall + 10% NDCG. No external LLM judge. Abstention scored binary on confidence gap heuristic.

Hardware

Single-machine Elixir/OTP runtime. EXLA for neural embeddings. SQLite + ETS for storage. No GPU required for inference (CPU-only ONNX). Total evaluation: 13 minutes for 500 queries.

Dataset

LongMemEval oracle split (ICLR 2025). 500 questions across 7 types testing 5 memory abilities. ~115K tokens per question context. Each question has 1–6 evidence snippets with session-level ground truth.

What top systems do differently

Key architectural differences between Graphonomous and systems scoring 91%+. These inform the roadmap above.

System Score Key Technique Graphonomous Gap
agentmemory 96.2% Six-signal retrieval weighting with learned calibration; 100% abstention via confidence scoring Closed in v0.3.3: ANN-statistics threshold — 96.7% abstention
OMEGA 95.4% Local SQLite + ONNX; six-stage pipeline; five forgetting mechanisms; conflict detection; 50ms retrieval No forgetting policy; no conflict detection at write time
Mastra OM 94.9% Observer + Reflector dual agents; 3–6x compression; dated observations; stable context window No reflector/consolidation at query time; no compression
Hindsight v0.4.19 94.6% World/Experience/Mental-Model memory banks; 4-strategy parallel retrieval (semantic, keyword, graph, temporal) + RRF; auto-observations Single unified graph; no opinion/entity separation
Emergence AI 86.0% Session-level RAG with NDCG scoring; cross-encoder reranking Closest architecture; Graphonomous already exceeds on overall score
Open Source

Help us close the gap

Graphonomous is open source. The benchmark harness, all results data, and the roadmap are in the repo. Contributions welcome—especially on abstention detection and temporal reasoning.

View on GitHub LongMemEval Paper Back to Home