We don't know which agents are calibrated — those whose 70-%
confidence claims come true ~70 % of the time. With pre-registration
landed (q-er-preregistration), every agent now leaves a paper trail
of (predicted_probability, actual_outcome) pairs in
preregistration_outcomes. Compute a rolling Brier score per agent
per claim type, surface miscalibrated agents, and weight their future
contributions accordingly.
scidex/senate/calibration.py::compute_brier(agent_id, window_days=90) returns {brier, n, by_claim_type, reliability_curve}.agent_calibration(agent_id, day, brier, n, ece, claim_breakdown_json) (ECE = expected calibration error).agent_registry.reputation_score becomes a function of 0.7 existing + 0.3 (1 - normalised_brier) so calibration directly reshapes reputation./agent/{id} shows reliability curve with a comparison to the system mean.recalibration_review auto-fires when an agent's brier > median + 2σ for 14 consecutive days.q-ri-cross-account-model-router) downgrades model tier for chronically miscalibrated agents — they don't get Opus until they earn it back.p=0.9 while truth = 0.5 → Brier ≈ 0.16, reliability bin (0.85–0.95) shows observed 0.5; reputation drops.(predicted_probability - observed_outcome)^2) over all preregs of that agent in window.|bin_predicted - bin_observed| * bin_weight.preregistration_outcomes as ground truth — only count rows with prereg_finalised_at IS NOT NULL.q-er-preregistration (data source).Implemented per-agent Brier/ECE calibration tracker from preregistration_outcomes.
Files created:
migrations/127_agent_calibration_table.py — Creates agent_calibration(agent_id, day, brier, n, ece, claim_breakdown_json) table; adds predicted_probability and agent_id columns to preregistration_outcomes; creates miscalibration_alerts table for chronic-miscalibration tracking.scidex/senate/calibration.py — Core module: compute_brier(agent_id, window_days=90) → CalibrationResult with Brier, ECE, 10-bin reliability curve, per-claim breakdown. write_daily_calibration(), update_reputation_from_calibration() (0.7×existing + 0.3×(1−normed_brier)), check_and_propose_recalibration() (fires Senate recalibration_review proposal when brier > median+2σ for 14 consecutive days), is_calibration_opus_blocked().economics_drivers/ci_agent_calibration.py — Daily driver (MIN_INTERVAL=23h) that calls run_daily_calibration_sweep().scidex/senate/model_router.py — Added executing_agent_id parameter to route(); Step 3b: blocks tier-3 (Opus) if agent has unresolved miscalibration alert, logged as calibration_opus_blocked.api.py — Added reliability curve panel to /senate/agent/{agent_id} page: shows Brier, ECE, n, 10-bin bar chart comparing predicted vs observed frequencies, system-mean comparison.scidex/agora/preregistration.py — finalize_preregistration() now stores predicted_probability (from predictions_json.confidence) and agent_id in preregistration_outcomes rows.compute_brier(agent_id, window_days=90) returns {brier, n, by_claim_type, reliability_curve}agent_calibration(agent_id, day, brier, n, ece, claim_breakdown_json)reputation_score = 0.7 × existing + 0.3 × (1 − normalised_brier)/senate/agent/{id} shows reliability curve with system-mean comparisonrecalibration_review auto-fires at brier > median+2σ for 14 days