Benchmark Suite

Two benchmarks. Two sides of the same question.

LongMemEval measures how Graphonomous performs against a realistic, diverse agent-memory workload. GraphMemBench v2 measures what κ-aware topology actually contributes — under controlled conditions where cycles are the difference between right and wrong.

Public benchmark

LongMemEval

ICLR 2025 long-term memory benchmark: 500 questions across knowledge-update, temporal reasoning, abstention, single-session, and multi-session tracks. Graphonomous runs it fully local with a 500M-param embedder.

92.6%
QA Accuracy
500
Questions
5
Tracks
View results →
Synthetic · κ-sensitive

GraphMemBench v2

Purpose-built synthetic benchmark for measuring κ-topology's contribution to retrieval. Eight tiers span κ=0 controls, simple cycles, dense multi-SCCs, contradiction resolution, mixed-κ discrimination, evidence path tracing, and causal DAG ordering.

8
Tiers
5
Difficulty Knobs
+100pp
κ-delta gate
View tiers →

External realism + internal causality

LongMemEval tells us whether Graphonomous is competitive with GPT-4o-powered memory systems on a task users actually care about. GraphMemBench T1–T6 tell us whether the κ-topology machinery is doing real work. T7–T8 establish baselines for graph algorithm quality (Dijkstra paths, toposort ordering).

Property LongMemEval GraphMemBench v2
Source ICLR 2025 public dataset Deterministic synthetic
Reproducible from seed No (fixed corpus) Yes (--seed N)
Question count 500 fixed 50–500 per tier (configurable)
Measures κ-topology isolation No Yes (topology on/off A/B)
Domain content Realistic dialogue Synthetic cycles + DAGs with distractors
Validation gate 92.6% overall QA T1–T6: ≥3pp κ-delta; T7–T8: algorithm baselines