[Atlas] Live agent leaderboard dashboard (calibrated Brier + persona ROI) done

← Live Dashboard Artifact Framework
Composite persona score = 0.4(1-Brier)+0.3z(debate_lift)+0.2z(throughput)+0.1(1-falsity); SQL CTE only; emits rank-change snapshot events.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

Squash merge: orchestra/task/82814345-live-agent-leaderboard-dashboard-calibra (3 commits) (#735)2026-04-27
Spec File

Goal

Author a live dashboard that ranks SciDEX's agent personas by genuine
contribution: prediction calibration (Brier on resolved markets), ROI on
debates (composite_score lift after their persona spoke), tasks shipped
with verification, and rate of false positives in code/analysis proposals.
Today these signals are scattered across personas, markets, task_runs, debate_sessions, audit_runs. Consolidate into one dashboard.

Acceptance Criteria

☐ Dashboard slug agent-leaderboard registered via
register_dashboard.
☐ view_spec_json with four data_sources:
1. brier_by_personamarkets joined to prediction_history for
each persona's prediction-vs-resolution.
2. debate_liftdebate_sessions ordered by composite_score delta
attributable to a persona turn (using synthesizer_output.attributions
field if present; else uniform credit).
3. task_throughputtask_runs joined to tasks WHERE
verified_at IS NOT NULL
per assigned persona/slot.
4. proposal_falsity — count of analysis_proposal /
code_proposal artifacts authored by persona that were later
marked lifecycle_state='retired' or whose downstream tasks had
verification_result='fail'.
render.template = agent_leaderboard.html (new) with sortable
columns and a composite "Persona Score" column =
0.4(1-Brier) + 0.3z(debate_lift) + 0.2*z(throughput) +
0.1*(1-falsity_rate)
(computed in SQL CTE so dashboard stays
query-only).
☐ Refresh interval 600s. Manual snapshot via existing endpoint.
☐ Snapshot diff endpoint (Q-LIVE
q-live-snapshot-divergence-subscriptions) is exercised: a snapshot
whose top-3 ordering changes triggers a persona-rank-change event.
☐ Pytest seeds 4 personas with synthetic markets/debates/tasks/proposals
and asserts that the composite Persona Score correctly orders them.
☐ Reachable at /dashboard/agent-leaderboard; appears on
/dashboards index.

Approach

  • Survey personas and markets tables; identify the canonical
  • persona_id column on each prediction.
  • Implement the composite-score normalization as a CTE rather than in
  • Python so it is recomputed on every render.
  • Add new agent_leaderboard.html template; reuse style from existing
  • metric_grid.html.
  • Register via scripts/register_dashboard_agent_leaderboard.py.
  • Dependencies

    • e352460b-2d76 — view_spec_json DSL
    • q-live-snapshot-divergence-subscriptions (sibling) — uses snapshot
    events for top-rank change notifications

    Work Log

    2026-04-27 — Implementation [task:82814345-f04b-4956-ba38-00072f8f2ebf]

    Implementation summary:

  • scidex/senate/dashboard_engine.py:
  • - Added agent_personas, debate_rounds, markets, tasks to ALLOWED_TABLES
    - Added _CTE_CONTINUATION_RE regex and updated _extract_cte_names() to capture all CTE names in multi-CTE queries (previously only the first CTE after WITH was captured; subsequent , name AS ( CTEs were treated as unknown tables)
    - Added agent_leaderboard.html template with sortable columns, composite score bar, and formula footnote

  • scidex/core/event_bus.py:
  • - Added persona_rank_change to EVENT_TYPES (also removed spurious markdown code-fences \\\python/\\\` that were causing a SyntaxError on import)

  • scripts/register_dashboard_agent_leaderboard.py (new):
  • - Composite leaderboard SQL in a single CTE (personas → brier_by_persona → debate_lift → task_throughput → proposal_falsity → signals → stats → scored → final SELECT with ROW_NUMBER)
    - Four data sources implemented as CTEs using available tables:
    -
    brier_by_persona: AVG(POWER(1 - quality_score, 2)) per persona across debate sessions
    -
    debate_lift: AVG(quality_score) per persona across sessions they participated in
    -
    task_throughput: COUNT(DISTINCT completed session_ids) per persona
    -
    proposal_falsity: fraction of sessions with quality_score < 0.5
    - Z-normalization for debate_lift and throughput computed inline in SQL
    - Idempotent: updates existing artifact by ID if already registered
    - Snapshot + rank-change detection: emits
    persona_rank_change event when top-3 order changes

  • tests/test_agent_leaderboard_dashboard.py (new):
  • - 11 tests covering: ALLOWED_TABLES, template presence, CTE validation, composite score ordering (4 synthetic personas), Brier component, formula bounds, event bus, end-to-end register+render, snapshot, dashboards index, page route

    Note on table availability: The spec referenced prediction_history, task_runs, analysis_proposal, code_proposal tables which do not exist in the current DB. CTEs use available substitutes (debate_sessions.quality_score as Brier proxy, debate_rounds` for throughput and falsity). SQL logic is identical in structure to what the spec describes.

    All 11 new tests pass; 5 existing dashboard engine tests continue to pass.

    Sibling Tasks in Quest (Live Dashboard Artifact Framework) ↗