GraphMemBench v2

A κ-sensitive benchmark for cyclic memory

Eight deterministic synthetic tiers designed to isolate what κ-topology actually contributes to retrieval. T1–T6 test κ-sensitive cycle detection; T7–T8 benchmark graph algorithm quality (evidence paths, causal ordering). Each tier runs topology ON vs OFF side by side.

From κ=0 controls to graph algorithm benchmarks

T1–T2 are acyclic controls. T3–T6 increase in κ complexity and retrieval difficulty. T7–T8 benchmark graph algorithms: weighted shortest paths (Dijkstra) and causal DAG ordering (toposort).

T1

κ = 0
Linear chains

Acyclic control. Direct causal chains A→B→C→D with no loop-back. Topology should stay inert — no SCCs, no κ, no deliberate routing.

Δ topology on/off: 0pp
max κ observed: 0 / 0  ✓ correct

T2

κ = 0
Branching DAG

Acyclic control. Tree-shaped causal decomposition: root→{A,B}→{A1,A2,B1}. Tests that fan-out alone doesn't trigger false-positive κ routing.

Δ topology on/off: 0pp
max κ observed: 0 / 0  ✓ correct

T3

κ = 1
Simple-cycle

Disjoint 3–5 node SCCs forming directed cycles A→B→C→A. Primary test for cycle-root paradox and membership-in-cycle retrieval.

Δ kappa_recall: +100pp
Δ routing_precision: +100pp
Δ cycle_root_accuracy: +100pp

T4

κ ≥ 2
Multi-SCC fault-line

Dense bidirectional ring + chord edges per SCC. Tests fault-line detection and high-κ routing. Configurable chord density scales κ upward.

Δ kappa_recall: +100pp
Δ faultline_mrr: +signal
max κ observed: 2 / 0

T5

κ = 0–1
Adversarial contradiction

Paired old/new facts per subject, half joined by bidirectional contradicts (κ=1), half by one-way supersedes (κ=0). Tests belief revision under κ pressure.

Δ kappa_recall (κ=1 subset): +100pp
fresh_belief_top1_rate tracked
belief_revision_rank_mean tracked

T6

mixed κ
κ-discrimination

SCCs planted across a density ladder [1..5] — retriever must distinguish κ values, not just detect that a cycle exists. New metrics: discrim_accuracy and κ-MAE.

Δ kappa_discrim_accuracy: +100pp
Δ kappa_discrim_mae: −2.0
max κ observed: 2 / 0

T7

algorithm
Evidence path tracing

Weighted DAGs with known shortest paths (Dijkstra). Tests whether retrieval preserves evidence chain ordering and hop counts. Alternate paths probe multi-path awareness.

path_node_recall: 1.00
path_order_accuracy: 0.60
hop_count_mae: 4.00 (baseline)

T8

algorithm
Causal DAG ordering

Layered DAGs with known topological order and critical path depth. Tests causal sequencing (toposort), longest-path detection, and source/sink identification.

ordering_accuracy: 0.46 (baseline)
critical_depth_mae: 2.67
source_sink_recall: 1.00

T1–T6: topology on vs off

Every tier passes the ≥3pp κ-metric gate (or — for the κ=0 controls — correctly produces zero delta). Sanity mode, deterministic from --seed 42.

Tier κ class kappa_recall on kappa_recall off Δ max κ on / off p50 on / off Gate
T1 κ=0 linear 0.00 0.00 0 0 / 0 30ms / 32ms ✓ correct
T2 κ=0 branching 0.00 0.00 0 0 / 0 30ms / 29ms ✓ correct
T3 κ=1 cycle 1.00 0.00 +100pp 1 / 0 10.2s / 32ms ✓ pass
T4 κ≥2 multi-SCC 1.00 0.00 +100pp 2 / 0 43ms / 32ms ✓ pass
T5 κ=0–1 contradict 1.00 0.00 +100pp 1 / 0 41ms / 39ms ✓ pass
T6 mixed κ 1.00 0.00 +100pp 2 / 0 46ms / 38ms ✓ pass

T7–T8: graph algorithm quality

These tiers are κ-independent — they benchmark whether the retriever preserves structural properties of the ingested graph. Topology on/off should not affect these metrics (and doesn't).

T7 · Evidence Path Tracing (Dijkstra)

Metric Topology ON Topology OFF Δ
path_node_recall 1.00 1.00 0
path_order_accuracy 0.60 0.60 0
hop_count_mae 4.00 4.00 0
alt_path_detected 1.00 1.00 0
latency p50 / p95 32ms / 200ms 35ms / 275ms ~0
Interpretation. All gold shortest-path nodes are retrieved (recall = 1.0) and alternate paths are detected. Ordering accuracy is 0.60 — the retriever returns the right nodes but doesn't perfectly preserve Dijkstra ordering, as expected (similarity ranking ≠ shortest-path ranking). Hop count MAE of 4.0 reflects the same ordering gap: retrieved node count is correct, but inferred hops differ from the gold chain length.

Baseline established. These numbers set the floor for the trace_evidence_path MCP tool, which uses Dijkstra directly and should achieve MAE ≈ 0.

T8 · Causal DAG Ordering (Toposort)

Metric Topology ON Topology OFF Δ
ordering_accuracy 0.46 0.46 0
critical_depth_mae 2.67 2.67 0
source_sink_recall 1.00 1.00 0
latency p50 / p95 34ms / 205ms 33ms / 201ms ~0
Interpretation. Source and sink nodes are perfectly retrieved (recall = 1.0). Ordering accuracy is 0.46 — roughly coin-flip — because similarity-based retrieval doesn't respect causal layer ordering. Critical depth MAE of 2.67 confirms the retriever sees nodes from some but not all layers.

Baseline established. These results quantify the gap that the DAG algorithms (Graphonomous.Algorithms.DAG) close: Kahn's toposort + longest-path DP should push ordering accuracy toward 1.0 and depth MAE toward 0.

All results: sanity mode, --seed 42, fallback embedder. T7/T8 are κ-independent — topology on/off produces identical algorithm metrics by design.

Scale the benchmark until it breaks

The point of a synthetic benchmark is to control the difficulty axis. These five CLI flags let you push Graphonomous along orthogonal pressure points and watch where performance degrades.

1 · Density

--density N

T4 chord-edge density. 1 = bidirectional ring only, 5 = dense layered chords. Higher density increases κ and forces fault-line detection to work harder.

2 · Distractor flood

--distractors N

Injects N acyclic chains of noise alongside the gold SCCs. Beyond some threshold the similarity stage drops gold clusters from its top-K, collapsing kappa_recall.

3 · Retrieval squeeze

--similarity-limit N
--final-limit N
--expansion-hops N
--neighbors-per-node N

Shrink the retrieval budget that topology gets to work with. At similarity-limit=3, hops=0, T4 kappa_recall drops to 0.90 and faultline_mrr falls to 0.60.

4 · Mixed-κ tier (T6)

--tier 6

Discrimination, not detection. SCCs planted across a density ladder force the retriever to resolve the exact κ value. Tracks kappa_discrim_accuracy and MAE.

5 · Content homogenization

--homogenize true

Strips distinctive domain words and replaces them with system-N tokens. Similarity retrieval can no longer shortcut to the right cluster — topology has to carry the weight.

Run it locally

All tiers are deterministic and reproduce bit-for-bit from the same seed. Fixtures live in priv/graphmembench/fixtures/T{1..8}/.

# Single tier, sanity mode (5 SCCs, 10 questions)
$ mix benchmark.graphmembench --tier 3 --sanity --topology on

# Full-size run, 100 questions per tier
$ mix benchmark.graphmembench --tier 4 --density 3

# Squeeze the retrieval budget until topology breaks
$ mix benchmark.graphmembench --tier 4 --density 5 \
    --similarity-limit 3 --final-limit 5 --expansion-hops 0


# Homogenized content — similarity can't shortcut
$ mix benchmark.graphmembench --tier 6 --homogenize true \
    --distractors 100


# T7: Evidence path tracing (Dijkstra shortest paths)
$ mix benchmark.graphmembench --tier 7 --sanity --topology on

# T8: Causal DAG ordering (toposort + longest path)
$ mix benchmark.graphmembench --tier 8 --sanity --topology on
Additive-only design. GraphMemBench v2 is a harness, not a patch — it never modifies retriever.ex, topology.ex, or the LongMemEval path. Running it cannot affect production behavior. Results land in benchmark_results/graphmembench_T{N}_topology_{on,off}.json for offline comparison.