[Forge] Skill quality leaderboard — verified-correct outputs as the ranking signal

Effort: deep

Goal

Today score_skills_by_coverage_and_errors.py (repo-root, ~150 LoC) ranks
skills only by call volume and error rate. That misses the most important
signal: **did the skill produce something the downstream artifact actually
used as evidence, and was that evidence not later refuted**? Build a skill_quality_score per skill that combines: citation rate
(agent_skill_invocations.cited_in_artifact), debate-survival rate (the
debate session that consumed the citation reached a non-refuted verdict),
and follow-on agreement (the citation appears in evidence_for rows that
were not later removed by q-er-citation-validity-sweep). Publish a
leaderboard that shifts agent skill-selection toward verified producers.

Acceptance Criteria

☐ New table skill_quality_scores(skill_name TEXT PK, window_days INT,


      total_calls INT, citation_rate REAL, debate_survival_rate REAL,
      retraction_rate REAL, composite REAL, ranked_at TIMESTAMPTZ)

with
migration migrations/20260428_skill_quality_scores.sql.

☐ Module scidex/forge/skill_quality.py exposing

compute(window_days: int = 30) -> int (rows written) and
leaderboard(limit: int = 50) -> list[dict].

☐ Composite formula (document inline + in module docstring):

composite = 0.4citation_rate + 0.4debate_survival_rate +
       0.2*(1 - retraction_rate)

, all components in [0, 1].

☐ citation_rate = cited / success from agent_skill_invocations

(success-only denominator).

☐ debate_survival_rate: join agent_skill_invocations on

artifact_class IN ('debate_round','pre_fetch') to
debate_sessions via artifact_id = analysis_id; survival = 1 when
the session's quality_score >= 0.5 AND the session does NOT carry
a downstream falsified flag in hypothesis_falsified /
analysis_falsified (whichever exists at write time — defensive
IF EXISTS).

☐ retraction_rate: of all rows where cited_in_artifact = TRUE and

citation_ref is a PMID, the share whose PMID appears in the
paper_retractions table (consumed via existing
scidex.atlas.retraction_check).

☐ API: GET /api/forge/skills/leaderboard?window=30 returns the

ranked rows; GET /api/forge/skills/{skill_name}/quality returns
one skill's full breakdown + 30-day sparkline.

☐ Daily timer scidex-skill-quality-recompute.timer invokes

python -m scidex.forge.skill_quality compute --window 30.

☐ Tests tests/test_skill_quality.py: synthetic invocations covering

(a) all-cited / no-debate, (b) debate-falsified, (c) retracted-PMID,
(d) clean composite of 1.0; assert composite math matches by hand.

☐ HTML page /forge/skills/leaderboard renders the top 50 with

composite score color-banded (green ≥0.7, amber 0.4–0.7, red <0.4)
and each skill name links into the existing
/forge/skills/{slug} detail view.

Approach

Walk the existing repo-root score_skills_by_coverage_and_errors.py

as a starting query; rewrite into a streamlined PG implementation in
scidex/forge/skill_quality.py.

The most subtle piece is debate_survival_rate: build it as a single

CTE that resolves artifact_id → debate_sessions.id via the
analysis_id foreign key already used in scidex/agora/synthesis_engine.py:184.

Composite formula gets unit-tested at the value level (3 hand-built

fixtures) so future re-tunings are explicit.

Wire the API + HTML page; mirror q-skills-usage-telemetry template

for consistency.

Dependencies

q-skills-usage-telemetry — provides the read patterns the

leaderboard query inherits.

q-er-citation-validity-sweep — supplies the retraction signal

(defensive degrade if the table is missing).

Dependents

q-skills-cost-rationality — feeds the leaderboard composite into the

cost model so the optimizer knows quality, not just price.

Work Log

2026-04-27 — Implementation

Staleness review: confirmed task is still necessary — no prior implementation found; skill_quality_scores table did not exist, scidex/forge/skill_quality.py absent.
DB staleness check: paper_retractions table absent (expected, per spec: degrades to 0), hypothesis_falsified absent, analysis_falsified absent, debate_sessions has quality_score column.
Migration: migrations/20260428_skill_quality_scores.sql — creates skill_quality_scores table with PK on (skill_name, window_days), indexes on (window_days, ranked_at DESC) and (window_days, composite DESC).
Module: scidex/forge/skill_quality.py — compute(window_days=30) → int, leaderboard(limit=50, window_days=30) → list[dict], skill_breakdown(skill_name, window_days=30) → dict. All components use defensive COALESCE(..., 0.0) for missing optional tables.
API: api_routes/forge.py — three new endpoints: GET /api/forge/skills/leaderboard, GET /api/forge/skills/{skill_name}/quality, GET /forge/skills/leaderboard HTML page with color-banded composite scores.
Tests: tests/test_skill_quality.py — 9 tests covering all four spec-required cases (a/b/c/d), component range assertions, formula weight sum.
Timer: deploy/bootstrap/systemd/scidex-skill-quality-recompute.{service,timer} — daily at 03:00.

2026-04-27 — Verification

Tests: 9/9 pass (pytest, all skill_quality + skill_telemetry tests)
Import smoke: skill_quality module imports without error
Rebase: rebased on origin/main (commit 62e760cd8), resolved conflict in api_routes/forge.py by keeping skill quality leaderboard section (upstream added drift detector which is unrelated)
No regressions: existing test_skill_telemetry.py (11 tests) still passes

Verification — 2026-04-27T20:45:00Z

Result: PASS Verified by: minimax:74 via task 3e33a36d-5bf0-44c4-99b0-2ffe6985cde3

Tests run

Target	Command	Expected	Actual	Pass?
`test_skill_quality.py`	`pytest tests/test_skill_quality.py -v`	9 passed	9 passed	✓
`test_skill_telemetry.py` (regression)	`pytest tests/test_skill_telemetry.py -v`	11 passed	11 passed	✓
Module import	`python3 -c "from scidex.forge.skill_quality import..."`	no error	no error	✓
Rebase against main	`git rebase origin/main`	clean	conflict resolved	✓

Attribution

The current state is produced by:

62e760cd8 — main HEAD (Squash merge: orchestra/task/402dd97b...)

Notes

Optional tables (paper_retractions, hypothesis_falsified, analysis_falsified) are absent in the current DB — the defensive COALESCE(..., 0.0) design ensures retraction_rate and debate_survival_rate degrade to 0 rather than raising errors when these downstream tables are populated.
The drift detector section added by upstream in api_routes/forge.py was kept alongside the leaderboard (no overlap); my edit resolved the rebase conflict by retaining both the upstream drift detector and the new leaderboard.
HTML page links each skill name to /forge/skills/{skill_name} (existing detail view pattern). |

Tasks using this spec (1)

[Forge] Skill quality leaderboard - verified-correct outputs

Forge done P89

File: q-skills-quality-leaderboard_spec.md

Modified: 2026-05-01 20:13

Size: 7.2 KB