Author a live dashboard that ranks SciDEX's agent personas by genuine
contribution: prediction calibration (Brier on resolved markets), ROI on
debates (composite_score lift after their persona spoke), tasks shipped
with verification, and rate of false positives in code/analysis proposals.
Today these signals are scattered across personas, markets, task_runs,
debate_sessions, audit_runs. Consolidate into one dashboard.
agent-leaderboard registered viaregister_dashboard.
brier_by_persona — markets joined to prediction_history fordebate_lift — debate_sessions ordered by composite_score deltasynthesizer_output.attributionstask_throughput — task_runs joined to tasks WHERE
verified_at IS NOT NULL per assigned persona/slot.proposal_falsity — count of analysis_proposal /code_proposal artifacts authored by persona that were laterlifecycle_state='retired' or whose downstream tasks hadverification_result='fail'.
render.template = agent_leaderboard.html (new) with sortable0.4(1-Brier) + 0.3z(debate_lift) + 0.2*z(throughput) +
0.1*(1-falsity_rate) (computed in SQL CTE so dashboard staysq-live-snapshot-divergence-subscriptions) is exercised: a snapshot/dashboard/agent-leaderboard; appears on/dashboards index.personas and markets tables; identify the canonicalagent_leaderboard.html template; reuse style from existingmetric_grid.html.
scripts/register_dashboard_agent_leaderboard.py.e352460b-2d76 — view_spec_json DSLq-live-snapshot-divergence-subscriptions (sibling) — uses snapshotImplementation summary:
scidex/senate/dashboard_engine.py:agent_personas, debate_rounds, markets, tasks to ALLOWED_TABLES_CTE_CONTINUATION_RE regex and updated _extract_cte_names() to capture all CTE names in multi-CTE queries (previously only the first CTE after WITH was captured; subsequent , name AS ( CTEs were treated as unknown tables)agent_leaderboard.html template with sortable columns, composite score bar, and formula footnotescidex/core/event_bus.py:persona_rank_change to EVENT_TYPES (also removed spurious markdown code-fences \\\python/\\\` that were causing a SyntaxError on import)scripts/register_dashboard_agent_leaderboard.py (new):
- Composite leaderboard SQL in a single CTE (personas → brier_by_persona → debate_lift → task_throughput → proposal_falsity → signals → stats → scored → final SELECT with ROW_NUMBER)
- Four data sources implemented as CTEs using available tables:
- brier_by_persona: AVG(POWER(1 - quality_score, 2)) per persona across debate sessions
- debate_lift: AVG(quality_score) per persona across sessions they participated in
- task_throughput: COUNT(DISTINCT completed session_ids) per persona
- proposal_falsity: fraction of sessions with quality_score < 0.5
- Z-normalization for debate_lift and throughput computed inline in SQL
- Idempotent: updates existing artifact by ID if already registered
- Snapshot + rank-change detection: emits persona_rank_change event when top-3 order changestests/test_agent_leaderboard_dashboard.py (new):
- 11 tests covering: ALLOWED_TABLES, template presence, CTE validation, composite score ordering (4 synthetic personas), Brier component, formula bounds, event bus, end-to-end register+render, snapshot, dashboards index, page routeNote on table availability: The spec referenced
prediction_history, task_runs, analysis_proposal, code_proposal tables which do not exist in the current DB. CTEs use available substitutes (debate_sessions.quality_score as Brier proxy, debate_rounds` for throughput and falsity). SQL logic is identical in structure to what the spec describes.All 11 new tests pass; 5 existing dashboard engine tests continue to pass.