[Forge] Skill quality leaderboard — verified-correct outputs as the ranking signal

← All Specs

Effort: deep

Goal

Today score_skills_by_coverage_and_errors.py (repo-root, ~150 LoC) ranks
skills only by call volume and error rate. That misses the most important
signal: **did the skill produce something the downstream artifact actually
used as evidence, and was that evidence not later refuted**? Build a skill_quality_score per skill that combines: citation rate
(agent_skill_invocations.cited_in_artifact), debate-survival rate (the
debate session that consumed the citation reached a non-refuted verdict),
and follow-on agreement (the citation appears in evidence_for rows that
were not later removed by q-er-citation-validity-sweep). Publish a
leaderboard that shifts agent skill-selection toward verified producers.

Acceptance Criteria

☐ New table skill_quality_scores(skill_name TEXT PK, window_days INT,
total_calls INT, citation_rate REAL, debate_survival_rate REAL,
retraction_rate REAL, composite REAL, ranked_at TIMESTAMPTZ)
with
migration migrations/20260428_skill_quality_scores.sql.
☐ Module scidex/forge/skill_quality.py exposing
compute(window_days: int = 30) -> int (rows written) and
leaderboard(limit: int = 50) -> list[dict].
☐ Composite formula (document inline + in module docstring):
composite = 0.4citation_rate + 0.4debate_survival_rate +
0.2*(1 - retraction_rate)
, all components in [0, 1].
citation_rate = cited / success from agent_skill_invocations
(success-only denominator).
debate_survival_rate: join agent_skill_invocations on
artifact_class IN ('debate_round','pre_fetch') to
debate_sessions via artifact_id = analysis_id; survival = 1 when
the session's quality_score >= 0.5 AND the session does NOT carry
a downstream falsified flag in hypothesis_falsified /
analysis_falsified (whichever exists at write time — defensive
IF EXISTS).
retraction_rate: of all rows where cited_in_artifact = TRUE and
citation_ref is a PMID, the share whose PMID appears in the
paper_retractions table (consumed via existing
scidex.atlas.retraction_check).
☐ API: GET /api/forge/skills/leaderboard?window=30 returns the
ranked rows; GET /api/forge/skills/{skill_name}/quality returns
one skill's full breakdown + 30-day sparkline.
☐ Daily timer scidex-skill-quality-recompute.timer invokes
python -m scidex.forge.skill_quality compute --window 30.
☐ Tests tests/test_skill_quality.py: synthetic invocations covering
(a) all-cited / no-debate, (b) debate-falsified, (c) retracted-PMID,
(d) clean composite of 1.0; assert composite math matches by hand.
☐ HTML page /forge/skills/leaderboard renders the top 50 with
composite score color-banded (green ≥0.7, amber 0.4–0.7, red <0.4)
and each skill name links into the existing
/forge/skills/{slug} detail view.

Approach

  • Walk the existing repo-root score_skills_by_coverage_and_errors.py
  • as a starting query; rewrite into a streamlined PG implementation in
    scidex/forge/skill_quality.py.
  • The most subtle piece is debate_survival_rate: build it as a single
  • CTE that resolves artifact_id → debate_sessions.id via the
    analysis_id foreign key already used in scidex/agora/synthesis_engine.py:184.
  • Composite formula gets unit-tested at the value level (3 hand-built
  • fixtures) so future re-tunings are explicit.
  • Wire the API + HTML page; mirror q-skills-usage-telemetry template
  • for consistency.

    Dependencies

    • q-skills-usage-telemetry — provides the read patterns the
    leaderboard query inherits.
    • q-er-citation-validity-sweep — supplies the retraction signal
    (defensive degrade if the table is missing).

    Dependents

    • q-skills-cost-rationality — feeds the leaderboard composite into the
    cost model so the optimizer knows quality, not just price.

    Work Log

    2026-04-27 — Implementation

    • Staleness review: confirmed task is still necessary — no prior implementation found; skill_quality_scores table did not exist, scidex/forge/skill_quality.py absent.
    • DB staleness check: paper_retractions table absent (expected, per spec: degrades to 0), hypothesis_falsified absent, analysis_falsified absent, debate_sessions has quality_score column.
    • Migration: migrations/20260428_skill_quality_scores.sql — creates skill_quality_scores table with PK on (skill_name, window_days), indexes on (window_days, ranked_at DESC) and (window_days, composite DESC).
    • Module: scidex/forge/skill_quality.pycompute(window_days=30) → int, leaderboard(limit=50, window_days=30) → list[dict], skill_breakdown(skill_name, window_days=30) → dict. All components use defensive COALESCE(..., 0.0) for missing optional tables.
    • API: api_routes/forge.py — three new endpoints: GET /api/forge/skills/leaderboard, GET /api/forge/skills/{skill_name}/quality, GET /forge/skills/leaderboard HTML page with color-banded composite scores.
    • Tests: tests/test_skill_quality.py — 9 tests covering all four spec-required cases (a/b/c/d), component range assertions, formula weight sum.
    • Timer: deploy/bootstrap/systemd/scidex-skill-quality-recompute.{service,timer} — daily at 03:00.

    2026-04-27 — Verification

    • Tests: 9/9 pass (pytest, all skill_quality + skill_telemetry tests)
    • Import smoke: skill_quality module imports without error
    • Rebase: rebased on origin/main (commit 62e760cd8), resolved conflict in api_routes/forge.py by keeping skill quality leaderboard section (upstream added drift detector which is unrelated)
    • No regressions: existing test_skill_telemetry.py (11 tests) still passes

    Verification — 2026-04-27T20:45:00Z

    Result: PASS Verified by: minimax:74 via task 3e33a36d-5bf0-44c4-99b0-2ffe6985cde3

    Tests run

    TargetCommandExpectedActualPass?
    test_skill_quality.pypytest tests/test_skill_quality.py -v9 passed9 passed
    test_skill_telemetry.py (regression)pytest tests/test_skill_telemetry.py -v11 passed11 passed
    Module importpython3 -c "from scidex.forge.skill_quality import..."no errorno error
    Rebase against maingit rebase origin/maincleanconflict resolved

    Attribution

    The current state is produced by:

    • 62e760cd8 — main HEAD (Squash merge: orchestra/task/402dd97b...)

    Notes

    • Optional tables (paper_retractions, hypothesis_falsified, analysis_falsified) are absent in the current DB — the defensive COALESCE(..., 0.0) design ensures retraction_rate and debate_survival_rate degrade to 0 rather than raising errors when these downstream tables are populated.
    • The drift detector section added by upstream in api_routes/forge.py was kept alongside the leaderboard (no overlap); my edit resolved the rebase conflict by retaining both the upstream drift detector and the new leaderboard.
    • HTML page links each skill name to /forge/skills/{skill_name} (existing detail view pattern). |

    Tasks using this spec (1)
    [Forge] Skill quality leaderboard - verified-correct outputs
    Forge done P89
    File: q-skills-quality-leaderboard_spec.md
    Modified: 2026-05-01 20:13
    Size: 7.2 KB