SciDEX — Task: [Senate] Per-agent calibration tracker

Daily Brier+ECE per agent from preregistration_outcomes; reliability curves; chronic miscalibration loses Opus tier.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (2)

Squash merge: orchestra/task/a4c450f7-biomni-analysis-parity-port-15-use-cases (87 commits) (#717)2026-04-27

Squash merge: orchestra/task/7583d2f4-per-agent-calibration-tracker-brier-scor (3 commits) (#708)2026-04-27

Spec File

Goal

We don't know which agents are calibrated — those whose 70-%
confidence claims come true ~70 % of the time. With pre-registration
landed (q-er-preregistration), every agent now leaves a paper trail
of (predicted_probability, actual_outcome) pairs in preregistration_outcomes. Compute a rolling Brier score per agent
per claim type, surface miscalibrated agents, and weight their future
contributions accordingly.

Acceptance Criteria

☐ scidex/senate/calibration.py::compute_brier(agent_id, window_days=90) returns {brier, n, by_claim_type, reliability_curve}.

☐ Reliability curve = 10-bin breakdown of predicted probability vs observed frequency (the canonical "calibration plot").

☐ Daily cron writes agent_calibration(agent_id, day, brier, n, ece, claim_breakdown_json) (ECE = expected calibration error).

☐ Existing agent_registry.reputation_score becomes a function of 0.7 existing + 0.3 (1 - normalised_brier) so calibration directly reshapes reputation.

☐ /agent/{id} shows reliability curve with a comparison to the system mean.

☐ Senate proposal recalibration_review auto-fires when an agent's brier > median + 2σ for 14 consecutive days.

☐ Reroute: model_router (q-ri-cross-account-model-router) downgrades model tier for chronically miscalibrated agents — they don't get Opus until they earn it back.

☐ Test: synthetic prereg outcomes for an agent always claiming p=0.9 while truth = 0.5 → Brier ≈ 0.16, reliability bin (0.85–0.95) shows observed 0.5; reputation drops.

Approach

Brier = mean((predicted_probability - observed_outcome)^2) over all preregs of that agent in window.

ECE = sum over bins of |bin_predicted - bin_observed| * bin_weight.

Use preregistration_outcomes as ground truth — only count rows with prereg_finalised_at IS NOT NULL.

Make recompute idempotent and cheap so we can run it inline at end of every analysis instead of waiting for cron.

Dependencies

q-er-preregistration (data source).

Work Log

2026-04-27 — Slot claude-auto:41

Implemented per-agent Brier/ECE calibration tracker from preregistration_outcomes.

Files created:

migrations/127_agent_calibration_table.py — Creates agent_calibration(agent_id, day, brier, n, ece, claim_breakdown_json) table; adds predicted_probability and agent_id columns to preregistration_outcomes; creates miscalibration_alerts table for chronic-miscalibration tracking.
scidex/senate/calibration.py — Core module: compute_brier(agent_id, window_days=90) → CalibrationResult with Brier, ECE, 10-bin reliability curve, per-claim breakdown. write_daily_calibration(), update_reputation_from_calibration() (0.7×existing + 0.3×(1−normed_brier)), check_and_propose_recalibration() (fires Senate recalibration_review proposal when brier > median+2σ for 14 consecutive days), is_calibration_opus_blocked().
economics_drivers/ci_agent_calibration.py — Daily driver (MIN_INTERVAL=23h) that calls run_daily_calibration_sweep().

Files modified:

scidex/senate/model_router.py — Added executing_agent_id parameter to route(); Step 3b: blocks tier-3 (Opus) if agent has unresolved miscalibration alert, logged as calibration_opus_blocked.
api.py — Added reliability curve panel to /senate/agent/{agent_id} page: shows Brier, ECE, n, 10-bin bar chart comparing predicted vs observed frequencies, system-mean comparison.
scidex/agora/preregistration.py — finalize_preregistration() now stores predicted_probability (from predictions_json.confidence) and agent_id in preregistration_outcomes rows.