PRISM
OS-009 GitHub →
OS-009 · a diagnostic loop in the [&] three-protocol stack

The benchmark refracts memory into nine colors.

Most memory benchmarks give you one number and a ranking. PRISM decomposes continual learning into nine CL dimensions, scores them against git-grounded truth, audits its own judges, and evolves scenarios as systems improve — so the leaderboard never stops measuring the frontier.

§00 · the name

PRISM

  1. Protocol
  2. Rating(for)
  3. Iterative
  4. System
  5. Memory

Protocol for Rating Iterative System Memory. A diagnostic loop that rates how well a memory system iterates — not just what it remembers once, but how it stabilizes, acquires, revises, and consolidates over time.

§01 · premise

One number can't see what's actually learning.

Legacy benchmarks

  • Synthetic Q&A — authors write questions and expected answers.
  • Single pass — store once, retrieve once, score once.
  • One composite — one headline number for everything.
  • One judge — a single LLM call decides correctness.
  • Static bank — questions never evolve with the field.

PRISM

  • Observational — judges watch agents interact naturally.
  • Git-grounded — the code at a commit IS the answer.
  • Closed-loop — scenarios sequence S1→S2→S3 without reset.
  • Nine dimensions — stability, plasticity, κ-reasoning, and more.
  • Three judge layers — transcript, L2 dim judges, L3 meta-judges.
  • Self-evolving — gap analysis and IRT recalibration each cycle.
§02 · the nine colors

CL refracted into measurable dimensions.

A continual learning system is not one thing — it has to remember without forgetting, acquire without thrashing, revise without contradiction, abstract without over-generalizing, and know what it doesn't know. PRISM scores all nine.

  1. 01Stabilityw 0.20Anti-forgetting under new input.
  2. 02Plasticityw 0.18Acquisition of genuinely new facts.
  3. 03Knowledge Updatew 0.15Belief revision on contradiction.
  4. 04Temporal Reasoningw 0.12Ordering, decay, recency.
  5. 05Consolidationw 0.10Abstraction of episodes into rules.
  6. 06Epistemic Awarenessw 0.08Knowing what it doesn't know.
  7. 07Cross-Domain Transferw 0.07Reuse across unfamiliar contexts.
  8. 08Intentional Forgettingw 0.05GDPR-grade removal that actually sticks.
  9. 09Outcome Feedbackw 0.05Confidence updates from real results.
§03 · four phases, six machines

The diagnostic loop that evolves with the field.

Six loop-phase machines — down from 76 individual tools — interlock with Graphonomous and PULSE to form the [&] triple loop.
  1. Compose

    Anchor scenarios on real git commits. Each commit boundary is a probe: "what did the system know before, and what should it infer after?" The code at HEAD is the oracle.

  2. Interact

    A user simulator drives the scored system over MCP, following a scripted scenario tree. No leaking of expected answers — only the diffs and probes the simulator would reasonably issue.

  3. Observe

    Three layers of judges. L1 is the raw transcript. L2 scores per dimension with structured rubrics. L3 is a meta-judge from a different model family that audits L2 for bias and drift.

  4. Reflect

    Gap analysis finds dimensions where judges disagree or where difficulty curves have flattened. IRT re-estimates scenario difficulty. New scenarios are generated to cover the gaps.

§04 · triangulation

Three judges. Three layers. One audit trail.

L1Transcript

The raw interaction: every MCP call, every returned tool result, every probe. This is the observable evidence anything else has to justify itself against.

L2Dimension judges

Nine structured rubrics, one per CL dimension. Each rubric cites spans of the transcript. No invisible reasoning — every score is traceable.

L3Meta-judges

A second model family audits L2. If L2 and L3 disagree, the judgment is flagged for human review and the leaderboard reflects the uncertainty.

§05 · run it

Benchmark your memory system in one command.

PRISM ships as an Elixir/OTP release and an npm-installed MCP server. Register your system, import or import-from-BEAM a scenario suite, and run a cycle.

  • Single-binary Fly.io deploy — this website runs on the same release.
  • Bring-your-own-runner: any MCP-capable agent can be scored.
  • Nine dimensions × domain tags × loop-closure rate.
  • Open leaderboard API at /api/leaderboard.
# 1. install as an MCP server
$ npx -y os-prism --db ~/.prism/benchmarks.db

# 2. from a Claude / Codex / Zed session
 /prism:bootstrap
 /prism:configure register graphonomous
 /prism:benchmark run --cycle 0001

# 3. inspect the leaderboard
$ curl https://prism-eval.fly.dev/api/leaderboard
      
§06 · open research

PRISM is one instrument in an open research program on machine cognition.

OpenSentience publishes the specifications — OS-001 through OS-010 — that turn agent cognition into something you can read, measure, and falsify. Structured memory. Deliberation topology. Continual learning. Temporal algebra. Every protocol ships as a public document, a reference implementation, and a test.

  • OS-006 ampersand protocol — structural composition
  • OS-008 agent harness — pipeline enforcement
  • OS-009 PRISM — diagnostic loop (you are here)
  • OS-010 PULSE — temporal loop algebra
Explore OpenSentience

opensentience.org — open research into machine cognition, published by Ampersand Box Design.