OS-009 · a diagnostic loop in the [&] three-protocol stack

The benchmark refracts memory into nine colors.

Most memory benchmarks give you one number and a ranking. PRISM decomposes continual learning into nine CL dimensions, scores them against git-grounded truth, audits its own judges, and evolves scenarios as systems improve — so the leaderboard never stops measuring the frontier.

Run a cycle See the loop Open Sentience

§00 · the name

PRISM

Protocol
Rating(for)
Iterative
System
Memory

Protocol for Rating Iterative System Memory. A diagnostic loop that rates how well a memory system iterates — not just what it remembers once, but how it stabilizes, acquires, revises, and consolidates over time.

      §01 · premise
      One number can't see what's actually learning.
    

        Legacy benchmarks
        Synthetic Q&A — authors write questions and expected answers.
Single pass — store once, retrieve once, score once.
One composite — one headline number for everything.
One judge — a single LLM call decides correctness.
Static bank — questions never evolve with the field.

      →

        PRISM
        Observational — judges watch agents interact naturally.
Git-grounded — the code at a commit IS the answer.
Closed-loop — scenarios sequence S1→S2→S3 without reset.
Nine dimensions — stability, plasticity, κ-reasoning, and more.
Three judge layers — transcript, L2 dim judges, L3 meta-judges.
Self-evolving — gap analysis and IRT recalibration each cycle.

      

§02 · the nine colors

CL refracted into measurable dimensions.

A continual learning system is not one thing — it has to remember without forgetting, acquire without thrashing, revise without contradiction, abstract without over-generalizing, and know what it doesn't know. PRISM scores all nine.

01Stabilityw 0.20Anti-forgetting under new input.
02Plasticityw 0.18Acquisition of genuinely new facts.
03Knowledge Updatew 0.15Belief revision on contradiction.
04Temporal Reasoningw 0.12Ordering, decay, recency.
05Consolidationw 0.10Abstraction of episodes into rules.
06Epistemic Awarenessw 0.08Knowing what it doesn't know.
07Cross-Domain Transferw 0.07Reuse across unfamiliar contexts.
08Intentional Forgettingw 0.05GDPR-grade removal that actually sticks.
09Outcome Feedbackw 0.05Confidence updates from real results.

      §03 · four phases, six machines
      The diagnostic loop that evolves with the field.
    

Six loop-phase machines — down from 76 individual tools — interlock with Graphonomous and PULSE to form the [&] triple loop.

Compose
Anchor scenarios on real git commits. Each commit boundary is a probe: "what did the system know before, and what should it infer after?" The code at HEAD is the oracle.
Interact
A user simulator drives the scored system over MCP, following a scripted scenario tree. No leaking of expected answers — only the diffs and probes the simulator would reasonably issue.
Observe
Three layers of judges. L1 is the raw transcript. L2 scores per dimension with structured rubrics. L3 is a meta-judge from a different model family that audits L2 for bias and drift.
Reflect
Gap analysis finds dimensions where judges disagree or where difficulty curves have flattened. IRT re-estimates scenario difficulty. New scenarios are generated to cover the gaps.

      §04 · triangulation
      Three judges. Three layers. One audit trail.
    

L1Transcript

The raw interaction: every MCP call, every returned tool result, every probe. This is the observable evidence anything else has to justify itself against.

L2Dimension judges

Nine structured rubrics, one per CL dimension. Each rubric cites spans of the transcript. No invisible reasoning — every score is traceable.

L3Meta-judges

A second model family audits L2. If L2 and L3 disagree, the judgment is flagged for human review and the leaderboard reflects the uncertainty.

§05 · run it

Benchmark your memory system in one command.

PRISM ships as an Elixir/OTP release and an npm-installed MCP server. Register your system, import or import-from-BEAM a scenario suite, and run a cycle.

Single-binary Fly.io deploy — this website runs on the same release.
Bring-your-own-runner: any MCP-capable agent can be scored.
Nine dimensions × domain tags × loop-closure rate.
Open leaderboard API at /api/leaderboard.

# 1. install as an MCP server
$ npx -y os-prism --db ~/.prism/benchmarks.db

# 2. from a Claude / Codex / Zed session
→ /prism:bootstrap
→ /prism:configure register graphonomous
→ /prism:benchmark run --cycle 0001

# 3. inspect the leaderboard
$ curl https://prism-eval.fly.dev/api/leaderboard

§06 · by domain

One benchmark. Every team that ships memory.

PRISM scores any MCP-capable system the same way — compose a scenario suite, interact over the wire, observe with three-layer judging, reflect and recalibrate. Below, the same loop wears five jobs. Each is a bring-your-own-runner adapter: the runner is yours, the nine CL dimensions and the leaderboard are shared.

Eval & QA teams. Catch the forgetting regression before your users do. A nightly cycle re-runs retention and update-consistency scenarios, and any score drop pages the channel — grounded in the actual git diff, not vibes.

regression_runner.ex Elixir / PRISM.Runner

cycle nightly judges L1·L2·L3 grounded git diff

Agent marketplaces. Rank every build on a level playing field. Import a public suite, run each agent under a fixed budget, publish to an open leaderboard — and the suite evolves harder as the field improves.

agent_runner.ex Elixir / PRISM.Runner

cycle per-release dimensions all nine leaderboard public

RAG & search vendors. Prove your retrieval layer recalls the right span — and abstains instead of hallucinating when it shouldn't guess. Adversarial distractors keep precision honest.

recall_runner.ex Elixir / PRISM.Runner

cycle per-index-build adversarial distractors output fix suggestions

Robotics & embodied AI. Does the robot's memory survive a re-map? Score spatial recall across episodes, backed by SCOPE spatial claims persisted in Graphonomous, and alert on catastrophic forgetting before the next deploy.

spatial_runner.ex Elixir / PRISM.Runner

cycle per-deploy substrate Graphonomous region graph_subgraph

Research labs. Reproducible ablations: freeze the suite, pin the seed, vary the architecture, compare by dimension. Meta-judges audit the judges, and tables export straight to your paper.

ablation_runner.ex Elixir / PRISM.Runner

cycle paper-2026 scenarios frozen + seeded export leaderboard CSV

Illustrative reference hosts. The PRISM.Runner DSL above is a teaching surface, not a published library — the real conformance contract is BYOR (bring-your-own-runner) over MCP, so a Python, TypeScript, or Rust system is scored identically. The nine CL dimensions, three-layer judging, git-grounded ground truth, and IRT calibration are the protocol.

§07 · open research

PRISM is one instrument in an open research program on machine cognition.

OpenSentience publishes the specifications — OS-001 through OS-010 — that turn agent cognition into something you can read, measure, and falsify. Structured memory. Deliberation topology. Continual learning. Temporal algebra. Every protocol ships as a public document, a reference implementation, and a test.

OS-006 ampersand protocol — structural composition
OS-008 agent harness — pipeline enforcement
OS-009 PRISM — diagnostic loop (you are here)
OS-010 PULSE — temporal loop algebra

Explore OpenSentience

opensentience.org — open research into machine cognition, published by Ampersand Box Design.

              
            OS-001
OS-002
OS-003
OS-004
OS-005
OS-006
OS-007
OS-008
OS-009

ten protocols · one open sentience