[Senate] Per-agent calibration tracker - Brier scores from frozen predictions done

← Epistemic Rigor
Daily Brier+ECE per agent from preregistration_outcomes; reliability curves; chronic miscalibration loses Opus tier.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (2)

Squash merge: orchestra/task/a4c450f7-biomni-analysis-parity-port-15-use-cases (87 commits) (#717)2026-04-27
Squash merge: orchestra/task/7583d2f4-per-agent-calibration-tracker-brier-scor (3 commits) (#708)2026-04-27
Spec File

Goal

We don't know which agents are calibrated — those whose 70-%
confidence claims come true ~70 % of the time. With pre-registration
landed (q-er-preregistration), every agent now leaves a paper trail
of (predicted_probability, actual_outcome) pairs in preregistration_outcomes. Compute a rolling Brier score per agent
per claim type, surface miscalibrated agents, and weight their future
contributions accordingly.

Acceptance Criteria

scidex/senate/calibration.py::compute_brier(agent_id, window_days=90) returns {brier, n, by_claim_type, reliability_curve}.
☐ Reliability curve = 10-bin breakdown of predicted probability vs observed frequency (the canonical "calibration plot").
☐ Daily cron writes agent_calibration(agent_id, day, brier, n, ece, claim_breakdown_json) (ECE = expected calibration error).
☐ Existing agent_registry.reputation_score becomes a function of 0.7 existing + 0.3 (1 - normalised_brier) so calibration directly reshapes reputation.
/agent/{id} shows reliability curve with a comparison to the system mean.
☐ Senate proposal recalibration_review auto-fires when an agent's brier > median + 2σ for 14 consecutive days.
☐ Reroute: model_router (q-ri-cross-account-model-router) downgrades model tier for chronically miscalibrated agents — they don't get Opus until they earn it back.
☐ Test: synthetic prereg outcomes for an agent always claiming p=0.9 while truth = 0.5 → Brier ≈ 0.16, reliability bin (0.85–0.95) shows observed 0.5; reputation drops.

Approach

  • Brier = mean((predicted_probability - observed_outcome)^2) over all preregs of that agent in window.
  • ECE = sum over bins of |bin_predicted - bin_observed| * bin_weight.
  • Use preregistration_outcomes as ground truth — only count rows with prereg_finalised_at IS NOT NULL.
  • Make recompute idempotent and cheap so we can run it inline at end of every analysis instead of waiting for cron.
  • Dependencies

    • q-er-preregistration (data source).

    Work Log

    2026-04-27 — Slot claude-auto:41

    Implemented per-agent Brier/ECE calibration tracker from preregistration_outcomes.

    Files created:

    • migrations/127_agent_calibration_table.py — Creates agent_calibration(agent_id, day, brier, n, ece, claim_breakdown_json) table; adds predicted_probability and agent_id columns to preregistration_outcomes; creates miscalibration_alerts table for chronic-miscalibration tracking.
    • scidex/senate/calibration.py — Core module: compute_brier(agent_id, window_days=90)CalibrationResult with Brier, ECE, 10-bin reliability curve, per-claim breakdown. write_daily_calibration(), update_reputation_from_calibration() (0.7×existing + 0.3×(1−normed_brier)), check_and_propose_recalibration() (fires Senate recalibration_review proposal when brier > median+2σ for 14 consecutive days), is_calibration_opus_blocked().
    • economics_drivers/ci_agent_calibration.py — Daily driver (MIN_INTERVAL=23h) that calls run_daily_calibration_sweep().
    Files modified:
    • scidex/senate/model_router.py — Added executing_agent_id parameter to route(); Step 3b: blocks tier-3 (Opus) if agent has unresolved miscalibration alert, logged as calibration_opus_blocked.
    • api.py — Added reliability curve panel to /senate/agent/{agent_id} page: shows Brier, ECE, n, 10-bin bar chart comparing predicted vs observed frequencies, system-mean comparison.
    • scidex/agora/preregistration.pyfinalize_preregistration() now stores predicted_probability (from predictions_json.confidence) and agent_id in preregistration_outcomes rows.
    Acceptance criteria status:
    compute_brier(agent_id, window_days=90) returns {brier, n, by_claim_type, reliability_curve}
    ☑ Reliability curve = 10-bin breakdown
    ☑ Daily cron writes agent_calibration(agent_id, day, brier, n, ece, claim_breakdown_json)
    reputation_score = 0.7 × existing + 0.3 × (1 − normalised_brier)
    /senate/agent/{id} shows reliability curve with system-mean comparison
    ☑ Senate recalibration_review auto-fires at brier > median+2σ for 14 days
    ☑ model_router blocks Opus tier for chronically miscalibrated agents

    Sibling Tasks in Quest (Epistemic Rigor) ↗