Agent memory that
improves itself

Most memory systems are key-value stores with embeddings. Graphonomous is a closed loop — outcomes change beliefs, beliefs change retrieval, and the graph gets sharper with every action. Open-source MCP server, works with any model.

v0.4.0 · Elixir/OTP · Apache 2.0

Get Graphonomous running in 60 seconds

Copy this into Claude Code, Codex, Cursor, or any MCP-capable agent. It installs the server, wires it up, and starts your first memory session.

## Install Graphonomous MCP Server

# 1. Install the npm package (includes platform-specific binary)
npm i -g graphonomous

# 2. Add to your MCP config (~/.mcp.json or project .mcp.json)
{
  "mcpServers": {
    "graphonomous": {
      "command": "npx",
      "args": ["-y", "graphonomous", "--db", "~/.graphonomous/knowledge.db"]
    }
  }
}

# 3. Restart your agent. Graphonomous is now your memory layer.
# Every session starts with retrieve → route → act → learn → consolidate.
Requirements:
Node.js ≥ 18 · macOS or Linux · x64 or arm64
Works with:
Claude Code · Codex · Cursor · Zed · any MCP client
Data stays local:
SQLite DB at ~/.graphonomous/knowledge.db

The κ routing + deliberation + attention stack

The system analyzed a 4-node business cycle, routed reasoning depth automatically, and now supports both topology-aware deliberation and proactive attention cycles with model-tier adaptation.

DAG Region (κ = 0)

routing:    fast
max_kappa:  0
action:     Single-pass retrieval.
            No deliberation needed.

SCC Region (κ > 0)

routing:    deliberate
max_kappa:  1
scc_count:  1
fault_line: Product Quality → Market Share
budget:     max_iterations: 2, agents: 1,
            confidence: 0.75
MCP Tool: topology_analyze — Input: 4 business cycle nodes
{
  "routing": "deliberate",
  "max_kappa": 1,
  "scc_count": 1,
  "sccs": [{
    "id": "scc-0",
    "nodes": ["market-share", "revenue", "r-and-d", "product-quality"],
    "kappa": 1,
    "approximate": false,
    "fault_line_edges": [{
      "source": "product-quality",
      "target": "market-share"
    }],
    "routing": "deliberate",
    "deliberation_budget": {
      "max_iterations": 2,
      "agent_count": 1,
      "timeout_multiplier": 1.5,
      "confidence_threshold": 0.75
    }
  }],
  "dag_nodes": []
}

Live result from Graphonomous MCP server. The system detected a circular dependency between market share, revenue, R&D, and product quality — and identified the exact edge (Product Quality → Market Share) where the feedback loop is weakest. No other agent memory system does this.


Capabilities no other agent memory has

Every agent memory system retrieves context. These six capabilities go beyond retrieval — each one is demonstrated with real MCP payloads.

All 18 interactive demos with real MCP payloads →


A closed loop, not a pipeline

Most agent memory systems are pipelines: store → retrieve → done. Graphonomous is a closed loop. Outcomes feed back into beliefs, beliefs change confidence, confidence changes retrieval rankings, and the graph improves with every action — without retraining any model.

1. RETRIEVE hybrid search + κ topology + causal_context IDs 2. ROUTE κ=0 → fast path κ>0 → deliberate 3. ACT action + causal_parent_ids trace every decision 4. LEARN learn_from_outcome confidence ↑ or ↓ 5. CONSOLIDATE 7-stage sleep cycle prune · merge · promote every cycle improves the next

Retrieve & Route

Hybrid search (embeddings + BM25 + cross-encoder reranking) returns ranked results. Every retrieval computes κ topology on the subgraph. κ = 0 takes the fast path; κ > 0 routes to deliberation. The attention engine decides: act now, learn more, or escalate.

Act & Learn

Every decision carries causal_parent_ids — the graph nodes that informed it. learn_from_outcome closes the loop: success boosts confidence on those nodes, failure reduces it. Next retrieval ranks results differently. No gradient descent, no weight updates.

Consolidate

During idle time, a 7-stage sleep cycle runs: decay confidence, prune weak nodes, strengthen co-activated edges, merge near-duplicates, and promote proven knowledge from fast to glacial memory. Then back to Retrieve — with updated confidence, cleaner topology, and re-prioritized goals.


We used Graphonomous to benchmark Graphonomous

PRISM (Protocol for Rating Iterative System Memory) is a self-improving continual learning benchmark. It evaluates Graphonomous across 9 CL dimensions, then feeds the results back into the graph — so the system literally learns from its own evaluation. Here’s what happened over 6 cycles.

Cycle Score Scenarios Dimensions What changed
1 0.10 5 2 / 9 Cold start — graph full of episodic infra nodes, zero procedural knowledge
2 0.76 5 6 / 9 Seeded 9 procedural/semantic nodes (bootstrap, κ, corrections, benchmarks)
3 0.95 8 9 / 9 Full dimension coverage — added consolidation, forgetting, feedback scenarios
4 0.99 8 9 / 9 Proper methodology — independent L2 Sonnet judges, L3 Haiku meta-judge
5 dropped 11 9 / 9 Adversarial + cross-domain — BM25 keyword attack exposed ranking vulnerability
5c partial retests 7 3 code fixes (confidence-weighted BM25, batch normalization) + 4 bridge nodes
6 0.45 15 6 Generalization test — no new nodes, 2 new domains. Vocabulary gap exposed.

The dual loop in action: PRISM composes scenarios, runs them against Graphonomous, judges the results across 9 CL dimensions, then reflects on what to test next. Meanwhile, Graphonomous stores every evaluation result as knowledge nodes — so each cycle starts with richer context than the last.

Cycle 5 was the breakthrough: adversarial nodes with low confidence (0.29) outranked correct nodes (0.80+) because BM25 keyword scoring wasn’t confidence-weighted. PRISM caught it. We fixed it. Cycle 5c confirmed the fix. That’s the loop working.

The 9 Continual Learning Dimensions

Stability

Anti-forgetting — does old knowledge survive new learning? Weight: 0.20

Plasticity

New acquisition — can the system learn novel concepts quickly? Weight: 0.18

Knowledge Update

Contradiction handling — does it revise beliefs when corrected? Weight: 0.15

Temporal Reasoning

Time-aware retrieval — does recency and sequence matter? Weight: 0.12

Consolidation

Abstraction — does it merge, prune, and promote knowledge? Weight: 0.10

Epistemic Awareness

Uncertainty — does it know what it doesn’t know? Weight: 0.08

Cross-Domain Transfer

Generalization — does code knowledge help with business queries? Weight: 0.07

Intentional Forgetting

Controlled removal — soft-hide, hard-delete, GDPR erase. Weight: 0.05

Outcome Feedback

Self-correction — does it learn from action results? Weight: 0.05

Two loops, one inside the other:

PRISM (outer):  compose → interact → observe → reflect → diagnose
                              ↓
            ┌———————————— Graphonomous (inner) ————————————┐
            │ retrieve → route → act → learn → consolidate │
            └——————————————————————————————————————————————┘

You can do this with your own repo. PRISM’s BYOR (Bring Your Own Repo) registers your git history as ground truth, auto-discovers CL events from commits, and generates evaluation scenarios. Run it in Claude Code, Codex, or any MCP-capable agent.


See what Graphonomous actually stores

This is a real knowledge graph — 6 node types, 17 edge types, and the workflows that connect them. Filter by scenario to follow the learning loop, watch κ-routing detect circular reasoning, or see belief revision propagate corrections through the graph. Hover any node for details. Drag to rearrange.

Scenario: Edges:

Node Types (6)

Edge Types (17)


5 machines, 29 actions

Tool selection accuracy degrades past ~30 tools. Instead of 29 individual tools, Graphonomous v0.4 exposes 5 loop-phase machines — one per phase of the closed memory loop. Each machine dispatches via an action parameter.

retrieverouteactlearnconsolidate
“What do I know?” → “What should I do?” → “Do it” → “Did it work?” → “Clean up”
Machine Actions Description
retrieve context, episodic, procedural, coverage, trace_evidence, frontier κ-aware ranked retrieval, time-filtered episodes, procedural search, epistemic coverage, Dijkstra evidence paths, Wilson interval uncertainty
route topology, deliberate, attention_survey, attention_cycle, review_goal SCC/κ analysis, κ-driven deliberation, priority survey, triage → dispatch, coverage-driven gate
act store_node, store_edge, delete_node, manage_edge, manage_goal, belief_revise, forget_node, forget_policy, gdpr_erase All graph mutations: node/edge CRUD, goal lifecycle, AGM belief revision, soft/hard/cascade forgetting, GDPR erasure
learn from_outcome, from_feedback, detect_novelty, from_interaction, contradictions Causal confidence updates, feedback processing, novelty scoring, full ingestion pipeline, contradiction detection
consolidate run, stats, query, traverse 7-stage consolidation, aggregate statistics, operation-based inspection, BFS traversal

Dual-loop interlocking with PRISM: When PRISM (OS-009) benchmarks Graphonomous, both closed loops nest. PRISM’s 6 machines + Graphonomous’s 5 machines = 11 tools in a shared session, down from 76. The outer loop improves the benchmark. The inner loop improves the memory. Each makes the other sharper.


Get started in three prompts

Copy these into Claude Code, Codex, or any MCP-capable coding agent. Graphonomous runs as a local MCP server — your knowledge stays on your machine.

1. Bootstrap a Graphonomous session

Start a Graphonomous memory session for this repo.
1. retrieve(action: "context", query: "session context")
2. Check active goals: act(action: "manage_goal", goal_operation: "list_goals")
3. Survey attention: route(action: "attention_survey")
Then proceed with my task, storing durable knowledge as we go.

2. Run a PRISM benchmark cycle

Run a PRISM evaluation cycle against Graphonomous:
1. config(action: "register_system", name: "graphonomous", transport: "stdio")
2. compose(action: "byor_register", repo_url: ".", commit_range: "HEAD~20..HEAD")
3. compose(action: "byor_discover") to find CL events in commit history
4. compose(action: "scenarios") to generate evaluation scenarios
5. interact(action: "run") for each scenario
6. observe(action: "judge_transcript") with L2 dimension scoring
7. reflect(action: "analyze_gaps") to find weak dimensions
8. Store results: act(action: "store_node") with cycle outcomes

3. Close the dual loop

Use PRISM results to improve Graphonomous, then re-evaluate:
1. diagnose(action: "failure_patterns") to cluster weaknesses
2. diagnose(action: "suggest_fixes") for targeted improvements
3. Fix the code or seed bridge nodes to fill vocabulary gaps
4. interact(action: "run") to retest failing scenarios
5. observe(action: "judge_transcript") to re-score
6. learn(action: "from_outcome", status: "success|failure") to close the loop
7. consolidate(action: "run") to clean up the graph
Repeat. Each cycle makes both the benchmark and the memory sharper.

MCP config: Add this to your .mcp.json or IDE settings to connect both servers:

{
  "mcpServers": {
    "graphonomous": {
      "command": "path/to/graphonomous/scripts/graphonomous_mcp_wrapper.sh",
      "args": ["--db", "~/.graphonomous/knowledge.db"]
    },
    "prism": {
      "command": "path/to/PRISM/scripts/prism_mcp_wrapper.sh",
      "args": []
    }
  }
}

Features are easy. Integration is the moat.

Any system can add confidence scores. Any system can add a consolidation step. The difference is what happens when these features talk to each other.

In Graphonomous, a failed outcome reduces confidence on the causal parent nodes that informed the decision. Lower confidence changes the κ topology of that subgraph. Changed topology changes the routing decision on the next retrieval. The attention engine re-prioritizes goals based on the new coverage landscape. During idle time, consolidation prunes the weakened nodes and strengthens what worked. One outcome ripples through the entire system — retrieval, routing, attention, and memory lifecycle — without any component knowing about the others.

That causal chain breaks if any piece lives in a separate system. Bolted-on confidence scores don’t feed into topology analysis. A standalone consolidator can’t see which nodes were used in failed decisions. Separate goal tracking can’t query the graph’s epistemic coverage. The integrated architecture is the product.

Where others stop

Each of these systems does something well. None of them close the loop.

  • Letta (MemGPT) pioneered “continual learning in token space” with sleep-time consolidation and skill caching. But memory items have no confidence scores, no outcome-based feedback, and no topology awareness. Learning means rewriting text, not updating a causal graph.
  • Zep / Graphiti builds temporal knowledge graphs with sophisticated entity deduplication and a four-timestamp bi-timeline. Strong on “what changed when.” But no confidence tracking, no outcome learning, no topology analysis, and contradictions are resolved by temporal invalidation alone — new overwrites old.
  • Hindsight has an opinion network with updateable confidence — the closest competitor architecturally. But updates are linear alpha adjustments, not formal belief revision. No causal attribution chains, no topology routing, no consolidation cycles, and confidence exists only on opinions, not on facts or experiences.
  • Cognee extracts knowledge graphs with a behavioral critic that scores memory by downstream utility — a genuine innovation. But no per-node confidence, no causal feedback loop, and no topology-aware retrieval.
  • MemPalace achieves 96.6–100% retrieval accuracy with zero API calls — best-in-class for pure recall. But it’s storage and retrieval only. No learning loop, no confidence, no consolidation.

Graphonomous is the only system where confidence tracking, causal attribution, κ-routing, belief revision, sleep-cycle consolidation, multi-timescale memory, and goal-aware attention all operate on the same graph. MCP-native, works with any model, runs on SQLite at the edge.

GDPR-compliant forgetting is built in. Soft forget, cascade delete, policy-based pruning, and permanent audit-logged erasure (Article 17) — all in one tool surface.


Under the hood


Proved theory

The κ invariant is proved on 1,926,351 finite systems with zero counterexamples. The proof is browser-runnable at opensentience.org.

The theoretical foundations, deliberation protocol, attention engine, and governance model are published as open research protocols OS-001 through OS-008.

The first empirical evaluation (OS-E001) benchmarks the full engine on 18,165 files across 14 projects: 12,880 edges, 22 SCCs, κ=27, graph beats flat retrieval (+0.103 recall), 100% test pass rate across all 29 MCP tools. Raw data and reproduction scripts included.

GRAPHMEMBENCH v0.3.3 — 160/160 SCENARIOS

GraphMemBench is a 160-scenario capability validation suite across 20 categories, testing every continual learning capability from κ activation to GDPR forgetting, plus graph algorithm quality (Dijkstra evidence paths, toposort causal ordering). All scenarios pass with 455 tests and 0 failures.

Phase 1 (Foundation): Kappa Activation · Belief Revision · Conflict-Aware Consolidation · Two-Phase Retrieval · Intentional Forgetting
Phase 2 (Advanced): Uncertainty Propagation · Procedural Retrieval · Multi-Agent Prep · Integration Scenarios · Stress
Phase 3 (Causal): Causal Metadata · E2E Workflows · Regression Guards · Competitor Adapters · Reporting

Competitor adapter interface validates against 5 implementations: Graphonomous (live), Baseline, Mem0, Zep, and Hindsight stubs.

v0.3 CAPABILITIES

Belief Revision — AGM-style expand/revise/contract with automatic contradiction detection and confidence propagation through dependency graphs.
Intentional Forgetting — Soft, hard, and cascade modes plus hybrid LRU+priority-decay policy pruning and GDPR Article 17 erasure with audit trails.
Epistemic Frontier — Wilson score intervals identify where one more piece of evidence would most reduce uncertainty. Information gain ranking for research prioritization.
Causal Edge Metadata — causal_strength, confounders, and intervention_history on edges, updated automatically during outcome learning.
Hybrid Retrieval — nomic-embed-text-v1.5 (768d) + BM25 via SQLite FTS5 + cross-encoder reranking. Estimated +6–14pp SHR lift over v0.2.