Effort: deep
Today score_skills_by_coverage_and_errors.py (repo-root, ~150 LoC) ranks
skills only by call volume and error rate. That misses the most important
signal: **did the skill produce something the downstream artifact actually
used as evidence, and was that evidence not later refuted**? Build a
skill_quality_score per skill that combines: citation rate
(agent_skill_invocations.cited_in_artifact), debate-survival rate (the
debate session that consumed the citation reached a non-refuted verdict),
and follow-on agreement (the citation appears in evidence_for rows that
were not later removed by q-er-citation-validity-sweep). Publish a
leaderboard that shifts agent skill-selection toward verified producers.
skill_quality_scores(skill_name TEXT PK, window_days INT,migrations/20260428_skill_quality_scores.sql.
scidex/forge/skill_quality.py exposingcompute(window_days: int = 30) -> int (rows written) andleaderboard(limit: int = 50) -> list[dict].
composite = 0.4citation_rate + 0.4debate_survival_rate +
0.2*(1 - retraction_rate), all components in [0, 1].
citation_rate = cited / success from agent_skill_invocationsdebate_survival_rate: join agent_skill_invocations onartifact_class IN ('debate_round','pre_fetch') todebate_sessions via artifact_id = analysis_id; survival = 1 whenquality_score >= 0.5 AND the session does NOT carryfalsified flag in hypothesis_falsified /analysis_falsified (whichever exists at write time — defensiveIF EXISTS).
retraction_rate: of all rows where cited_in_artifact = TRUE andcitation_ref is a PMID, the share whose PMID appears in thepaper_retractions table (consumed via existingscidex.atlas.retraction_check).
GET /api/forge/skills/leaderboard?window=30 returns theGET /api/forge/skills/{skill_name}/quality returnsscidex-skill-quality-recompute.timer invokespython -m scidex.forge.skill_quality compute --window 30.
tests/test_skill_quality.py: synthetic invocations covering/forge/skills/leaderboard renders the top 50 with/forge/skills/{slug} detail view.score_skills_by_coverage_and_errors.pyscidex/forge/skill_quality.py.
debate_survival_rate: build it as a singleartifact_id → debate_sessions.id via theanalysis_id foreign key already used in scidex/agora/synthesis_engine.py:184.
q-skills-usage-telemetry templateq-skills-usage-telemetry — provides the read patterns theq-er-citation-validity-sweep — supplies the retraction signalq-skills-cost-rationality — feeds the leaderboard composite into theskill_quality_scores table did not exist, scidex/forge/skill_quality.py absent.paper_retractions table absent (expected, per spec: degrades to 0), hypothesis_falsified absent, analysis_falsified absent, debate_sessions has quality_score column.migrations/20260428_skill_quality_scores.sql — creates skill_quality_scores table with PK on (skill_name, window_days), indexes on (window_days, ranked_at DESC) and (window_days, composite DESC).scidex/forge/skill_quality.py — compute(window_days=30) → int, leaderboard(limit=50, window_days=30) → list[dict], skill_breakdown(skill_name, window_days=30) → dict. All components use defensive COALESCE(..., 0.0) for missing optional tables.api_routes/forge.py — three new endpoints: GET /api/forge/skills/leaderboard, GET /api/forge/skills/{skill_name}/quality, GET /forge/skills/leaderboard HTML page with color-banded composite scores.tests/test_skill_quality.py — 9 tests covering all four spec-required cases (a/b/c/d), component range assertions, formula weight sum.deploy/bootstrap/systemd/scidex-skill-quality-recompute.{service,timer} — daily at 03:00.skill_quality module imports without errororigin/main (commit 62e760cd8), resolved conflict in api_routes/forge.py by keeping skill quality leaderboard section (upstream added drift detector which is unrelated)test_skill_telemetry.py (11 tests) still passesResult: PASS
Verified by: minimax:74 via task 3e33a36d-5bf0-44c4-99b0-2ffe6985cde3
The current state is produced by:
62e760cd8 — main HEAD (Squash merge: orchestra/task/402dd97b...)paper_retractions, hypothesis_falsified, analysis_falsified) are absent in the current DB — the defensive COALESCE(..., 0.0) design ensures retraction_rate and debate_survival_rate degrade to 0 rather than raising errors when these downstream tables are populated.api_routes/forge.py was kept alongside the leaderboard (no overlap); my edit resolved the rebase conflict by retaining both the upstream drift detector and the new leaderboard./forge/skills/{skill_name} (existing detail view pattern). |