Graphonomous — LongMemEval Benchmarks

92.6% on 500 questions.
Competitive with frontier-LLM systems — using only local models.

Graphonomous v0.3.3 evaluated against LongMemEval (ICLR 2025) — the standard benchmark for long-term conversational memory. 500 questions across 940 sessions, 10,866 turns. No cloud LLM calls: a 500M-param local embedding model + cross-encoder reranker + κ-topology retrieval. We publish the full results, gaps, and roadmap.

Where Graphonomous stands

LongMemEval has become the de facto benchmark for agent memory. Here's how published systems compare. Note that systems use different LLM backends and evaluation methodologies—direct comparison requires care.

Five abilities, five different stories

LongMemEval tests five core long-term memory abilities. Graphonomous now delivers strong performance across all five, with knowledge updates, multi-session reasoning, and abstention all above 95%. Temporal reasoning is the remaining gap and the current optimization focus.

Ability	Questions	QA Proxy	Session Hit	Session Recall	Status
Knowledge Update	72	97.8%	100.0%	98.6%	Strong
Abstention	30	96.7%	86.7%	70.3%	Strong
Information Extraction	150	95.6%	98.7%	98.7%	Strong
Multi-Session Reasoning	121	89.7%	100.0%	95.3%	Strong
Temporal Reasoning	127	87.8%	94.5%	89.0%	Gap

Ability

Questions

QA Proxy

Session Hit

Session Recall

Status

Knowledge Update

97.8%

100.0%

98.6%

Strong

Abstention

96.7%

86.7%

70.3%

Strong

Information Extraction

150

95.6%

98.7%

Strong

Multi-Session Reasoning

121

89.7%

100.0%

95.3%

Strong

Temporal Reasoning

127

87.8%

94.5%

89.0%

Gap

Question Type	Count	QA Proxy	Session Hit
Single-Session Assistant	56	99.5%	100.0%
Knowledge Update	78	97.9%	98.7%
Single-Session User	70	97.5%	97.1%
Multi-Session	133	90.6%	100.0%
Temporal Reasoning	133	87.6%	94.0%
Single-Session Preference	30	84.8%	93.3%

Question Type

Count

QA Proxy

Session Hit

Single-Session Assistant

99.5%

100.0%

Knowledge Update

97.9%

98.7%

Single-Session User

97.5%

97.1%

Multi-Session

133

90.6%

100.0%

Temporal Reasoning

133

87.6%

94.0%

Single-Session Preference

84.8%

93.3%

Abstention solved: 70.0% → 96.7%

30 questions in LongMemEval are "false premise" — they ask about information that was never mentioned. A correct system should recognize it doesn't know and abstain. In v0.3.2 we missed 9 of 30. v0.3.3 misses just one — via learned ANN score statistics rather than hand-tuned heuristics.

Remaining Gap

Temporal: 87.8% (127 questions)

The v0.3.3 relative-date parser (“two weeks ago”, “last Saturday”) anchored to the question date lifted temporal from 85.1% → 87.8%, but this is still the weakest category. Mean first-correct-rank is bimodal: median 1, but missed questions sit deep in the ranking. Needs richer date normalization (month names, explicit ranges) and temporal intent detection (“first”, “latest”, “before X”).

Remaining Gap

Multi-Session: 89.7% (63 questions)

Session-aggregate ranking (v0.3.3) boosted this category, but questions that require stitching 3+ sessions still trail single-session categories. A cross-session expansion pass triggered when the top-k spans multiple sessions is the next lever.

Measurement

No LLM judge (measurement gap)

QA proxy score (session hit + keyword recall + session recall + evidence recall + NDCG) approximates but doesn't equal real QA accuracy. Competitors use GPT-4o judge on generated answers. True accuracy likely falls within a few points of this proxy.

Strength

Abstention: 96.7% (30 questions)

v0.3.3 replaced the heuristic gap check with a learned ANN-statistics threshold that inspects pre-rerank distribution shape. 29/30 false-premise questions now correctly abstain — up from 21/30 in v0.3.2.

Strength

Preferences: 84.8% (30 questions)

Up from 67.7% via fact-augmented indexing and session-aggregate boost. Preference answers no longer depend on keyword overlap with the question.

Strength

Knowledge Update: 97.1%

Belief revision and superseded edges work. When facts change across sessions, Graphonomous correctly retrieves the latest version. 100% session hit rate on update questions.

Strength

Latency: 1.5s mean

From 20s+ down to 1.5s mean across 500 queries. Cross-encoder reranking dominates at ~1.3s; all other stages under 150ms. Competitive with systems that call frontier LLM APIs per query.

Stage	Mean	p50	p95	Max
Cross-Encoder Rerank	1,317ms	1,307ms	1,409ms	2,242ms
ANN Retrieve	147ms	145ms	182ms	332ms
Edge Impact	42ms	41ms	61ms	93ms
Topology	12ms	12ms	23ms	34ms
Expand Neighbors	4ms	4ms	7ms	10ms
Chain Retrieval	3ms	0ms	20ms	27ms
BM25 Await	2ms	0ms	8ms	44ms
Diversify + Sort	2ms	1ms	2ms	14ms

Stage

Mean

p50

p95

Max

Cross-Encoder Rerank

1,317ms

1,307ms

1,409ms

2,242ms

ANN Retrieve

147ms

145ms

182ms

332ms

Edge Impact

42ms

41ms

61ms

93ms

Topology

12ms

23ms

34ms

Expand Neighbors

4ms

7ms

10ms

Chain Retrieval

3ms

0ms

20ms

27ms

BM25 Await

2ms

0ms

8ms

44ms

Diversify + Sort

2ms

1ms

2ms

14ms

System	Score	Key Technique	Graphonomous Gap
agentmemory	96.2%	Six-signal retrieval weighting with learned calibration; 100% abstention via confidence scoring	Closed in v0.3.3: ANN-statistics threshold — 96.7% abstention
OMEGA	95.4%	Local SQLite + ONNX; six-stage pipeline; five forgetting mechanisms; conflict detection; 50ms retrieval	No forgetting policy; no conflict detection at write time
Mastra OM	94.9%	Observer + Reflector dual agents; 3–6x compression; dated observations; stable context window	No reflector/consolidation at query time; no compression
Hindsight v0.4.19	94.6%	World/Experience/Mental-Model memory banks; 4-strategy parallel retrieval (semantic, keyword, graph, temporal) + RRF; auto-observations	Single unified graph; no opinion/entity separation
Emergence AI	86.0%	Session-level RAG with NDCG scoring; cross-encoder reranking	Closest architecture; Graphonomous already exceeds on overall score

System

Score

Key Technique

Graphonomous Gap

agentmemory

96.2%

Six-signal retrieval weighting with learned calibration; 100% abstention via confidence scoring

Closed in v0.3.3: ANN-statistics threshold — 96.7% abstention

OMEGA

95.4%

Local SQLite + ONNX; six-stage pipeline; five forgetting mechanisms; conflict detection; 50ms retrieval

No forgetting policy; no conflict detection at write time

Mastra OM

94.9%

Observer + Reflector dual agents; 3–6x compression; dated observations; stable context window

No reflector/consolidation at query time; no compression

Hindsight v0.4.19

94.6%

World/Experience/Mental-Model memory banks; 4-strategy parallel retrieval (semantic, keyword, graph, temporal) + RRF; auto-observations

Single unified graph; no opinion/entity separation

Emergence AI

86.0%

Session-level RAG with NDCG scoring; cross-encoder reranking

Closest architecture; Graphonomous already exceeds on overall score

Help us close the gap

Graphonomous is open source. The benchmark harness, all results data, and the roadmap are in the repo. Contributions welcome—especially on abstention detection and temporal reasoning.

92.6% on 500 questions.
Competitive with frontier-LLM systems — using only local models.

v0.3.3 — Oracle Split, 500 Questions

Where Graphonomous stands

Five abilities, five different stories

Seven question types, granular view

Abstention solved: 70.0% → 96.7%

Learned ANN-statistics threshold

29 of 30 correct

Catch noise before reranking

Negative-evidence modeling

What the numbers reveal

Temporal: 87.8% (127 questions)

Multi-Session: 89.7% (63 questions)

No LLM judge (measurement gap)

Abstention: 96.7% (30 questions)

Preferences: 84.8% (30 questions)

Knowledge Update: 97.1%

Latency: 1.5s mean

Stage-level timing breakdown

Path to 95%

Phase 1 — Measurement

Phase 2 — Abstention + Preferences (shipped v0.3.3)

Phase 3 — Temporal Intelligence (partial, +1–3pp remaining)

Phase 4 — Multi-Session Enhancement (partial, +2pp remaining)

Phase 5 — Query Intelligence (+2–3pp)

How the benchmark was run

Embedder

Retrieval Pipeline

Graph Scale

Evaluation

Hardware

Dataset

What top systems do differently

Help us close the gap