OS-009 · a diagnostic loop in the [&] three-protocol stack
The benchmarkrefractsmemory into nine colors.
Most memory benchmarks give you one number and a ranking.
PRISM decomposes continual learning into nine CL dimensions,
scores them against git-grounded truth, audits its own judges,
and evolves scenarios as systems improve — so the leaderboard
never stops measuring the frontier.
Protocol for Rating Iterative System Memory.
A diagnostic loop that rates how well a memory system
iterates — not just what it remembers once, but how it
stabilizes, acquires, revises, and consolidates over time.
§01 · premise
One number can't see what's actually learning.
Legacy benchmarks
Synthetic Q&A — authors write questions and expected answers.
Single pass — store once, retrieve once, score once.
One composite — one headline number for everything.
One judge — a single LLM call decides correctness.
Static bank — questions never evolve with the field.
Git-grounded — the code at a commit IS the answer.
Closed-loop — scenarios sequence S1→S2→S3 without reset.
Nine dimensions — stability, plasticity, κ-reasoning, and more.
Three judge layers — transcript, L2 dim judges, L3 meta-judges.
Self-evolving — gap analysis and IRT recalibration each cycle.
§02 · the nine colors
CL refracted into measurable dimensions.
A continual learning system is not one thing — it has to remember without forgetting,
acquire without thrashing, revise without contradiction, abstract without over-generalizing,
and know what it doesn't know. PRISM scores all nine.
01Stabilityw 0.20Anti-forgetting under new input.
02Plasticityw 0.18Acquisition of genuinely new facts.
03Knowledge Updatew 0.15Belief revision on contradiction.
05Consolidationw 0.10Abstraction of episodes into rules.
06Epistemic Awarenessw 0.08Knowing what it doesn't know.
07Cross-Domain Transferw 0.07Reuse across unfamiliar contexts.
08Intentional Forgettingw 0.05GDPR-grade removal that actually sticks.
09Outcome Feedbackw 0.05Confidence updates from real results.
§03 · four phases, six machines
The diagnostic loop that evolves with the field.
Six loop-phase machines — down from 76 individual tools — interlock with Graphonomous and PULSE to form the [&] triple loop.
Compose
Anchor scenarios on real git commits. Each commit boundary is a
probe: "what did the system know before, and what should it
infer after?" The code at HEAD is the oracle.
Interact
A user simulator drives the scored system over MCP, following a
scripted scenario tree. No leaking of expected answers — only
the diffs and probes the simulator would reasonably issue.
Observe
Three layers of judges. L1 is the raw transcript. L2 scores per
dimension with structured rubrics. L3 is a meta-judge from a
different model family that audits L2 for bias and drift.
Reflect
Gap analysis finds dimensions where judges disagree or where
difficulty curves have flattened. IRT re-estimates scenario
difficulty. New scenarios are generated to cover the gaps.
§04 · triangulation
Three judges. Three layers. One audit trail.
L1Transcript
The raw interaction: every MCP call, every returned tool result, every probe. This is the observable evidence anything else has to justify itself against.
L2Dimension judges
Nine structured rubrics, one per CL dimension. Each rubric cites spans of the transcript. No invisible reasoning — every score is traceable.
L3Meta-judges
A second model family audits L2. If L2 and L3 disagree, the judgment is flagged for human review and the leaderboard reflects the uncertainty.
§05 · run it
Benchmark your memory system in one command.
PRISM ships as an Elixir/OTP release and an npm-installed MCP
server. Register your system, import or import-from-BEAM a
scenario suite, and run a cycle.
Single-binary Fly.io deploy — this website runs on the same release.
Bring-your-own-runner: any MCP-capable agent can be scored.
Nine dimensions × domain tags × loop-closure rate.
Open leaderboard API at /api/leaderboard.
# 1. install as an MCP server$ npx -y os-prism --db ~/.prism/benchmarks.db
# 2. from a Claude / Codex / Zed session→ /prism:bootstrap
→ /prism:configure register graphonomous
→ /prism:benchmark run --cycle 0001
# 3. inspect the leaderboard$ curl https://prism-eval.fly.dev/api/leaderboard
§06 · by domain
One benchmark. Every team that ships memory.
PRISM scores any MCP-capable system the same way — compose a
scenario suite, interact over the wire, observe with three-layer
judging, reflect and recalibrate. Below, the same loop wears five
jobs. Each is a bring-your-own-runner adapter: the runner is
yours, the nine CL dimensions and the leaderboard are shared.
Eval & QA teams. Catch the forgetting regression before
your users do. A nightly cycle re-runs retention and
update-consistency scenarios, and any score drop pages the
channel — grounded in the actual git diff, not vibes.
regression_runner.exElixir / PRISM.Runner
cycle nightlyjudges L1·L2·L3grounded git diff
Agent marketplaces. Rank every build on a level playing
field. Import a public suite, run each agent under a fixed
budget, publish to an open leaderboard — and the suite
evolves harder as the field improves.
agent_runner.exElixir / PRISM.Runner
cycle per-releasedimensions all nineleaderboard public
RAG & search vendors. Prove your retrieval layer
recalls the right span — and abstains instead of
hallucinating when it shouldn't guess. Adversarial distractors
keep precision honest.
Robotics & embodied AI. Does the robot's memory
survive a re-map? Score spatial recall across episodes, backed
by SCOPE spatial claims persisted in Graphonomous, and alert on
catastrophic forgetting before the next deploy.
Research labs. Reproducible ablations: freeze the suite,
pin the seed, vary the architecture, compare by dimension.
Meta-judges audit the judges, and tables export straight to
your paper.
Illustrative reference hosts. The PRISM.Runner
DSL above is a teaching surface, not a published library — the real
conformance contract is BYOR (bring-your-own-runner) over MCP, so a
Python, TypeScript, or Rust system is scored identically. The nine
CL dimensions, three-layer judging, git-grounded ground truth, and
IRT calibration are the protocol.
§07 · open research
PRISM is one instrument in an
open research program on machine cognition.
OpenSentience publishes the specifications — OS-001 through OS-010 —
that turn agent cognition into something you can read,
measure, and falsify. Structured memory.
Deliberation topology. Continual learning. Temporal algebra. Every
protocol ships as a public document, a reference implementation, and
a test.