OS-009 · a diagnostic loop in the [&] three-protocol stack
The benchmarkrefractsmemory into nine colors.
Most memory benchmarks give you one number and a ranking.
PRISM decomposes continual learning into nine CL dimensions,
scores them against git-grounded truth, audits its own judges,
and evolves scenarios as systems improve — so the leaderboard
never stops measuring the frontier.
Protocol for Rating Iterative System Memory.
A diagnostic loop that rates how well a memory system
iterates — not just what it remembers once, but how it
stabilizes, acquires, revises, and consolidates over time.
§01 · premise
One number can't see what's actually learning.
Legacy benchmarks
Synthetic Q&A — authors write questions and expected answers.
Single pass — store once, retrieve once, score once.
One composite — one headline number for everything.
One judge — a single LLM call decides correctness.
Static bank — questions never evolve with the field.
Git-grounded — the code at a commit IS the answer.
Closed-loop — scenarios sequence S1→S2→S3 without reset.
Nine dimensions — stability, plasticity, κ-reasoning, and more.
Three judge layers — transcript, L2 dim judges, L3 meta-judges.
Self-evolving — gap analysis and IRT recalibration each cycle.
§02 · the nine colors
CL refracted into measurable dimensions.
A continual learning system is not one thing — it has to remember without forgetting,
acquire without thrashing, revise without contradiction, abstract without over-generalizing,
and know what it doesn't know. PRISM scores all nine.
01Stabilityw 0.20Anti-forgetting under new input.
02Plasticityw 0.18Acquisition of genuinely new facts.
03Knowledge Updatew 0.15Belief revision on contradiction.
05Consolidationw 0.10Abstraction of episodes into rules.
06Epistemic Awarenessw 0.08Knowing what it doesn't know.
07Cross-Domain Transferw 0.07Reuse across unfamiliar contexts.
08Intentional Forgettingw 0.05GDPR-grade removal that actually sticks.
09Outcome Feedbackw 0.05Confidence updates from real results.
§03 · four phases, six machines
The diagnostic loop that evolves with the field.
Six loop-phase machines — down from 76 individual tools — interlock with Graphonomous and PULSE to form the [&] triple loop.
Compose
Anchor scenarios on real git commits. Each commit boundary is a
probe: "what did the system know before, and what should it
infer after?" The code at HEAD is the oracle.
Interact
A user simulator drives the scored system over MCP, following a
scripted scenario tree. No leaking of expected answers — only
the diffs and probes the simulator would reasonably issue.
Observe
Three layers of judges. L1 is the raw transcript. L2 scores per
dimension with structured rubrics. L3 is a meta-judge from a
different model family that audits L2 for bias and drift.
Reflect
Gap analysis finds dimensions where judges disagree or where
difficulty curves have flattened. IRT re-estimates scenario
difficulty. New scenarios are generated to cover the gaps.
§04 · triangulation
Three judges. Three layers. One audit trail.
L1Transcript
The raw interaction: every MCP call, every returned tool result, every probe. This is the observable evidence anything else has to justify itself against.
L2Dimension judges
Nine structured rubrics, one per CL dimension. Each rubric cites spans of the transcript. No invisible reasoning — every score is traceable.
L3Meta-judges
A second model family audits L2. If L2 and L3 disagree, the judgment is flagged for human review and the leaderboard reflects the uncertainty.
§05 · run it
Benchmark your memory system in one command.
PRISM ships as an Elixir/OTP release and an npm-installed MCP
server. Register your system, import or import-from-BEAM a
scenario suite, and run a cycle.
Single-binary Fly.io deploy — this website runs on the same release.
Bring-your-own-runner: any MCP-capable agent can be scored.
Nine dimensions × domain tags × loop-closure rate.
Open leaderboard API at /api/leaderboard.
# 1. install as an MCP server$ npx -y os-prism --db ~/.prism/benchmarks.db
# 2. from a Claude / Codex / Zed session→ /prism:bootstrap
→ /prism:configure register graphonomous
→ /prism:benchmark run --cycle 0001
# 3. inspect the leaderboard$ curl https://prism-eval.fly.dev/api/leaderboard
§06 · open research
PRISM is one instrument in an
open research program on machine cognition.
OpenSentience publishes the specifications — OS-001 through OS-010 —
that turn agent cognition into something you can read,
measure, and falsify. Structured memory.
Deliberation topology. Continual learning. Temporal algebra. Every
protocol ships as a public document, a reference implementation, and
a test.