SciDEX — Task: [Atlas] Live agent leaderboard dashboard (calibrat

Composite persona score = 0.4(1-Brier)+0.3z(debate_lift)+0.2z(throughput)+0.1(1-falsity); SQL CTE only; emits rank-change snapshot events.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

Squash merge: orchestra/task/82814345-live-agent-leaderboard-dashboard-calibra (3 commits) (#735)2026-04-27

Spec File

Goal

Author a live dashboard that ranks SciDEX's agent personas by genuine
contribution: prediction calibration (Brier on resolved markets), ROI on
debates (composite_score lift after their persona spoke), tasks shipped
with verification, and rate of false positives in code/analysis proposals.
Today these signals are scattered across personas, markets, task_runs, debate_sessions, audit_runs. Consolidate into one dashboard.

Acceptance Criteria

☐ Dashboard slug agent-leaderboard registered via

register_dashboard.

☐ view_spec_json with four data_sources:

1. brier_by_persona — markets joined to prediction_history for
each persona's prediction-vs-resolution.
2. debate_lift — debate_sessions ordered by composite_score delta
attributable to a persona turn (using synthesizer_output.attributions
field if present; else uniform credit).
3. task_throughput — task_runs joined to

tasks WHERE
         verified_at IS NOT NULL

per assigned persona/slot.
4. proposal_falsity — count of analysis_proposal /
code_proposal artifacts authored by persona that were later
marked lifecycle_state='retired' or whose downstream tasks had
verification_result='fail'.

☐ render.template = agent_leaderboard.html (new) with sortable

columns and a composite "Persona Score" column =

0.4(1-Brier) + 0.3z(debate_lift) + 0.2*z(throughput) +
       0.1*(1-falsity_rate)

(computed in SQL CTE so dashboard stays
query-only).

☐ Refresh interval 600s. Manual snapshot via existing endpoint.

☐ Snapshot diff endpoint (Q-LIVE

q-live-snapshot-divergence-subscriptions) is exercised: a snapshot
whose top-3 ordering changes triggers a persona-rank-change event.

☐ Pytest seeds 4 personas with synthetic markets/debates/tasks/proposals

and asserts that the composite Persona Score correctly orders them.

☐ Reachable at /dashboard/agent-leaderboard; appears on

/dashboards index.

Approach

Survey personas and markets tables; identify the canonical

persona_id column on each prediction.

Implement the composite-score normalization as a CTE rather than in

Python so it is recomputed on every render.

Add new agent_leaderboard.html template; reuse style from existing

metric_grid.html.

Dependencies

e352460b-2d76 — view_spec_json DSL
q-live-snapshot-divergence-subscriptions (sibling) — uses snapshot

events for top-rank change notifications

Work Log

2026-04-27 — Implementation [task:82814345-f04b-4956-ba38-00072f8f2ebf]

Implementation summary:

scidex/senate/dashboard_engine.py:

- Added agent_personas, debate_rounds, markets, tasks to ALLOWED_TABLES
- Added _CTE_CONTINUATION_RE regex and updated _extract_cte_names() to capture all CTE names in multi-CTE queries (previously only the first CTE after WITH was captured; subsequent , name AS ( CTEs were treated as unknown tables)
- Added agent_leaderboard.html template with sortable columns, composite score bar, and formula footnote

scidex/core/event_bus.py:

- Added persona_rank_change to EVENT_TYPES (also removed spurious markdown code-fences \\\python/\\\`

 that were causing a SyntaxError on import)

scripts/register_dashboard_agent_leaderboard.py (new):


   - Composite leaderboard SQL in a single CTE (personas → brier_by_persona → debate_lift → task_throughput → proposal_falsity → signals → stats → scored → final SELECT with ROW_NUMBER)
   - Four data sources implemented as CTEs using available tables:
     -

brier_by_persona

: AVG(POWER(1 - quality_score, 2)) per persona across debate sessions
     -

debate_lift

: AVG(quality_score) per persona across sessions they participated in
     -

task_throughput

: COUNT(DISTINCT completed session_ids) per persona
     -

proposal_falsity

: fraction of sessions with quality_score < 0.5
   - Z-normalization for debate_lift and throughput computed inline in SQL
   - Idempotent: updates existing artifact by ID if already registered
   - Snapshot + rank-change detection: emits

persona_rank_change

 event when top-3 order changes

tests/test_agent_leaderboard_dashboard.py (new):


   - 11 tests covering: ALLOWED_TABLES, template presence, CTE validation, composite score ordering (4 synthetic personas), Brier component, formula bounds, event bus, end-to-end register+render, snapshot, dashboards index, page route

Note on table availability: The spec referenced prediction_history, task_runs, analysis_proposal, code_proposal tables which do not exist in the current DB. CTEs use available substitutes (debate_sessions.quality_score as Brier proxy, debate_rounds` for throughput and falsity). SQL logic is identical in structure to what the spec describes.

All 11 new tests pass; 5 existing dashboard engine tests continue to pass.