LongMemEval measures how Graphonomous performs against a realistic, diverse agent-memory workload. GraphMemBench v2 measures what κ-aware topology actually contributes — under controlled conditions where cycles are the difference between right and wrong.
ICLR 2025 long-term memory benchmark: 500 questions across knowledge-update, temporal reasoning, abstention, single-session, and multi-session tracks. Graphonomous runs it fully local with a 500M-param embedder.
Purpose-built synthetic benchmark for measuring κ-topology's contribution to retrieval. Eight tiers span κ=0 controls, simple cycles, dense multi-SCCs, contradiction resolution, mixed-κ discrimination, evidence path tracing, and causal DAG ordering.
LongMemEval tells us whether Graphonomous is competitive with GPT-4o-powered memory systems on a task users actually care about. GraphMemBench T1–T6 tell us whether the κ-topology machinery is doing real work. T7–T8 establish baselines for graph algorithm quality (Dijkstra paths, toposort ordering).
| Property | LongMemEval | GraphMemBench v2 |
|---|---|---|
| Source | ICLR 2025 public dataset | Deterministic synthetic |
| Reproducible from seed | No (fixed corpus) | Yes (--seed N) |
| Question count | 500 fixed | 50–500 per tier (configurable) |
| Measures κ-topology isolation | No | Yes (topology on/off A/B) |
| Domain content | Realistic dialogue | Synthetic cycles + DAGs with distractors |
| Validation gate | 92.6% overall QA | T1–T6: ≥3pp κ-delta; T7–T8: algorithm baselines |