Graphonomous v0.3.3 evaluated against LongMemEval (ICLR 2025) — the standard benchmark for long-term conversational memory. 500 questions across 940 sessions, 10,866 turns. No cloud LLM calls: a 500M-param local embedding model + cross-encoder reranker + κ-topology retrieval. We publish the full results, gaps, and roadmap.
Evaluated 2026-04-05. Embedder: nomic-embed-text-v2-moe (768D, 500M params). Retrieval-only evaluation with QA proxy scoring (session hit + keyword recall + session recall + evidence recall + NDCG).
LongMemEval has become the de facto benchmark for agent memory. Here's how published systems compare. Note that systems use different LLM backends and evaluation methodologies—direct comparison requires care.
LongMemEval tests five core long-term memory abilities. Graphonomous now delivers strong performance across all five, with knowledge updates, multi-session reasoning, and abstention all above 95%. Temporal reasoning is the remaining gap and the current optimization focus.
| Ability | Questions | QA Proxy | Session Hit | Session Recall | Status |
|---|---|---|---|---|---|
| Knowledge Update | 72 | 97.8% | 100.0% | 98.6% | Strong |
| Abstention | 30 | 96.7% | 86.7% | 70.3% | Strong |
| Information Extraction | 150 | 95.6% | 98.7% | 98.7% | Strong |
| Multi-Session Reasoning | 121 | 89.7% | 100.0% | 95.3% | Strong |
| Temporal Reasoning | 127 | 87.8% | 94.5% | 89.0% | Gap |
| Question Type | Count | QA Proxy | Session Hit |
|---|---|---|---|
| Single-Session Assistant | 56 | 99.5% | 100.0% |
| Knowledge Update | 78 | 97.9% | 98.7% |
| Single-Session User | 70 | 97.5% | 97.1% |
| Multi-Session | 133 | 90.6% | 100.0% |
| Temporal Reasoning | 133 | 87.6% | 94.0% |
| Single-Session Preference | 30 | 84.8% | 93.3% |
30 questions in LongMemEval are "false premise" — they ask about information that was never mentioned. A correct system should recognize it doesn't know and abstain. In v0.3.2 we missed 9 of 30. v0.3.3 misses just one — via learned ANN score statistics rather than hand-tuned heuristics.
We replaced the gap < 0.05 OR mean < 0.25
heuristic with an abstention signal derived from
pre-rerank ANN score distributions: the top-k cosine
similarity mean, standard deviation, and decay rate.
Sessions with flat, low ANN distributions abstain before
the cross-encoder gets a chance to inflate their scores.
v0.3.3 abstains correctly on 29 of 30 false-premise questions — closing most of the gap to agentmemory's 30/30. The remaining miss is a close call where a near-synonym session scored highly on both ANN and cross-encoder.
Cross-encoder reranking is excellent at boosting real matches — but it also boosts irrelevant matches when no real match exists. By checking the raw ANN distribution before rerank, we decide "this query has no real hit" from the shape of the retrieval response itself.
The learned threshold is still distribution-based, not semantic. True negative-evidence modeling ("the graph contains no node about X") would require a small classifier trained on topical coverage and could close the last 3.3pp.
The v0.3.3 relative-date parser (“two weeks ago”, “last Saturday”) anchored to the question date lifted temporal from 85.1% → 87.8%, but this is still the weakest category. Mean first-correct-rank is bimodal: median 1, but missed questions sit deep in the ranking. Needs richer date normalization (month names, explicit ranges) and temporal intent detection (“first”, “latest”, “before X”).
Session-aggregate ranking (v0.3.3) boosted this category, but questions that require stitching 3+ sessions still trail single-session categories. A cross-session expansion pass triggered when the top-k spans multiple sessions is the next lever.
QA proxy score (session hit + keyword recall + session recall + evidence recall + NDCG) approximates but doesn't equal real QA accuracy. Competitors use GPT-4o judge on generated answers. True accuracy likely falls within a few points of this proxy.
v0.3.3 replaced the heuristic gap check with a learned ANN-statistics threshold that inspects pre-rerank distribution shape. 29/30 false-premise questions now correctly abstain — up from 21/30 in v0.3.2.
Up from 67.7% via fact-augmented indexing and session-aggregate boost. Preference answers no longer depend on keyword overlap with the question.
Belief revision and superseded edges work. When facts change across sessions, Graphonomous correctly retrieves the latest version. 100% session hit rate on update questions.
From 20s+ down to 1.5s mean across 500 queries. Cross-encoder reranking dominates at ~1.3s; all other stages under 150ms. Competitive with systems that call frontier LLM APIs per query.
Every query passes through 15 pipeline stages. Cross-encoder reranking dominates at 83% of total time. All topology and graph operations complete in under 50ms.
| Stage | Mean | p50 | p95 | Max |
|---|---|---|---|---|
| Cross-Encoder Rerank | 1,317ms | 1,307ms | 1,409ms | 2,242ms |
| ANN Retrieve | 147ms | 145ms | 182ms | 332ms |
| Edge Impact | 42ms | 41ms | 61ms | 93ms |
| Topology | 12ms | 12ms | 23ms | 34ms |
| Expand Neighbors | 4ms | 4ms | 7ms | 10ms |
| Chain Retrieval | 3ms | 0ms | 20ms | 27ms |
| BM25 Await | 2ms | 0ms | 8ms | 44ms |
| Diversify + Sort | 2ms | 1ms | 2ms | 14ms |
Five phases targeting the gaps above. Estimated impact based on published ablation studies and SOTA system architectures.
Add LLM judge + answer generation to benchmark. Current QA proxy is a useful directional metric but not comparable to competitor evaluations. Feed (question, retrieved_context) to Claude, judge answer quality 0/0.5/1. Establish honest baseline.
Replaced heuristic abstention with a learned ANN-statistics threshold. Added fact-augmented key expansion for preference indexing ("prefers Thai food" stored as structured facts). Session-aggregate boost rewards sessions with multiple strong hits.
Dual timestamps (documentDate vs eventDate) and the relative-date parser ("two weeks ago", "last Saturday") landed in v0.3.3 and lifted temporal 85.1% → 87.8%. Still pending: explicit date ranges, month-name parsing, and ordinal intent ("first", "latest", "before X").
Session-aggregate ranking (v0.3.3) lifted multi-session to 89.7%. Next up: cross-session expansion when top-k spans multiple sessions, and chain-of-retrieval rounds for 3+ session stitching.
Query expansion and reformulation (standard RAG technique, rephrasing queries 3 ways). Adaptive retrieval depth based on query complexity. These are the techniques that push from good to SOTA.
nomic-embed-text-v2-moe (768D) via ONNX.
Dense-forward export running all experts with
router-weighted blending. Task-prefixed queries
("search_query:") and documents ("search_document:").
Hybrid ANN (HNSW) + BM25 with reciprocal rank fusion. Cross-encoder reranking (ms-marco-MiniLM-L-6-v2). Graph neighborhood expansion (1–2 hops). Adaptive limits scaling with sqrt(node_count).
940 sessions ingested (10,866 turns). ~12K knowledge nodes, ~208K edges. Session-level + turn-level + entity linking edges. Full topology analysis per query.
Self-contained QA proxy: 35% session hit + 25% keyword recall + 20% session recall + 10% evidence recall + 10% NDCG. No external LLM judge. Abstention scored binary on confidence gap heuristic.
Single-machine Elixir/OTP runtime. EXLA for neural embeddings. SQLite + ETS for storage. No GPU required for inference (CPU-only ONNX). Total evaluation: 13 minutes for 500 queries.
LongMemEval oracle split (ICLR 2025). 500 questions across 7 types testing 5 memory abilities. ~115K tokens per question context. Each question has 1–6 evidence snippets with session-level ground truth.
Key architectural differences between Graphonomous and systems scoring 91%+. These inform the roadmap above.
| System | Score | Key Technique | Graphonomous Gap |
|---|---|---|---|
| agentmemory | 96.2% | Six-signal retrieval weighting with learned calibration; 100% abstention via confidence scoring | Closed in v0.3.3: ANN-statistics threshold — 96.7% abstention |
| OMEGA | 95.4% | Local SQLite + ONNX; six-stage pipeline; five forgetting mechanisms; conflict detection; 50ms retrieval | No forgetting policy; no conflict detection at write time |
| Mastra OM | 94.9% | Observer + Reflector dual agents; 3–6x compression; dated observations; stable context window | No reflector/consolidation at query time; no compression |
| Hindsight v0.4.19 | 94.6% | World/Experience/Mental-Model memory banks; 4-strategy parallel retrieval (semantic, keyword, graph, temporal) + RRF; auto-observations | Single unified graph; no opinion/entity separation |
| Emergence AI | 86.0% | Session-level RAG with NDCG scoring; cross-encoder reranking | Closest architecture; Graphonomous already exceeds on overall score |
Graphonomous is open source. The benchmark harness, all results data, and the roadmap are in the repo. Contributions welcome—especially on abstention detection and temporal reasoning.