SciDEX — Task: [Forge] Score performance for 25 unscored register

Registered skills have missing or zero performance_score. Skill scores guide routing, playground exposure, and tool-library maintenance. Verification: - 25 skills have calibrated performance_score values or documented insufficient-data status - Scores are based on tool call success, latency, usage, and code-path health - Remaining unscored skill count is reduced Start by reading this task's spec and checking for duplicate recent work.

Spec File

Goal

Calibrate performance scores for registered skills that currently have no meaningful score. Skill scores support routing, maintenance, and tool-playground exposure.

Acceptance Criteria

☐ A concrete batch of unscored skills is reviewed

☐ Each reviewed skill receives a calibrated performance_score or insufficient-data rationale

☐ Scores use tool call success, latency, usage, and code-path health

☐ Before/after unscored skill counts are recorded

Approach

Select skills with performance_score NULL or 0, ordered by usage and category.

Inspect tool_calls, tool_invocations, latency, error rates, and code_path existence.

Persist calibrated scores or documented insufficient-data rationale through the standard DB path.

Verify updated scores and remaining backlog.

Dependencies

q-cc0888c0004a - Agent Ecosystem quest

Dependents

Forge skill registry, routing, and tool quality

Work Log

2026-04-26 05:23 UTC — Task bec30a01 benchmark run

Task bec30a01-e196-4d26-a051-e9e808b95146 ran benchmark on 26 unscored skills.
Scored all 26 using tool_calls telemetry: formula = 0.5 + 0.3success_rate + 0.2speed_factor.
Score range: 0.150 (never-used playground tools) to 0.994 (openfda-adverse-events, 1 call, 100% SR).
Result: 26→0 unscored skills. All 682 skills now have performance_score > 0.
Commit: 0267ccb80 ([Forge] Benchmark 25 registered skills by performance and accuracy [task:bec30a01-e196-4d26-a051-e9e808b95146]).

2026-04-21 - Quest engine template

Created reusable spec for quest-engine generated skill performance scoring tasks.

Already Resolved — 2026-04-21 21:27:05Z

Task 85d9ad14-3519-4ebd-9e5e-e4189ac7b2e8 was stale by the time this slot started: live PostgreSQL verification through scidex.core.database.get_db() found 282 registered skills and 0 skills with performance_score IS NULL OR performance_score = 0.
Evidence query also found score coverage across the current Forge telemetry base: min score 0.15, max score 1.0, average score 0.9271; tool_calls has 28,280 rows with 27,889 successes, 390 errors, and 1544.2 ms average nonzero latency.
Code-path health spot check grouped the registry by skills.code_path; the primary tools.py path has 112 scored skills and exists in-repo, and registered forge/skills/*/SKILL.md paths sampled from the scored registry exist.
Resolution source: prior branch commit eb7917ecf ([Forge] Score performance for 25 unscored registered skills [task:c82d378b-5192-4823-9193-939dd71935d1]) plus verification commit 70fbe70a2 ([Verify] Skill scoring already resolved — 26→1 unscored [task:c82d378b-5192-4823-9193-939dd71935d1]) addressed the scoring backlog; current live DB now has 0 remaining unscored skills, satisfying this task's <= 0 verification target.

Already Resolved — 2026-04-21 21:35:00Z

Re-verified: 282 skills, 0 unscored (performance_score IS NULL OR = 0). Score stats unchanged: min=0.15, max=1.0, avg=0.9271. Task acceptance criteria already satisfied by prior work.

2026-04-27 06:32 UTC — Task 610a8b3c: rescore 20 skills by test coverage + error rate

Rescored 20 registered skills using two-dimensional formula:

score = 0.40 success_rate + 0.35 test_coverage + 0.25 * speed_factor

Test coverage audit (manual scan of tests/):

- 1.0 — tool_pubmed_search, tool_paper_figures (dedicated test files)
- 0.8 — tool_research_topic (success/failure paths in test_agora_orchestrator_tools.py)
- 0.6 — tool_gtex_tissue_expression, tool_open_targets_associations, tool_semantic_scholar_search (indirect mock references)
- 0.2 — remaining 14 skills (no coverage found)

Error rate from tool_calls telemetry (per-skill errors / total calls)
Speed factor penalises average latency above 500 ms (drops to 0 at 10.5 s)
Notable corrections from previous blanket 1.0 scores:

- tool_pubchem_compound 1.000 → 0.561 (39.7% error rate, no tests)
- tool_msigdb_gene_sets 1.000 → 0.574 (34.0% error rate, no tests)
- tool_expression_atlas_differential 1.000 → 0.587 (15.0% err, 3.4 s avg, no tests)
- tool_brainspan_expression 1.000 → 0.665 (13.7% error rate, no tests)
- tool_mgi_mouse_models 1.000 → 0.655 (16.3% error rate, no tests)
- tool_pubmed_search 1.000 → 0.996 (0.3% error rate, full test coverage) ✓

Script: score_skills_by_coverage_and_errors.py (checked in)

2026-04-28 UTC — Task ed11b1ae: score 3 synthetic test fixtures

Before: 3 unscored skills (performance_score IS NULL). After: 0 unscored.

Skills scored:

id	name	score	rationale
tool_synthetic_a	synthetic_a	0.15	Insufficient data — 0 invocations, no telemetry, no test coverage, code_path `tool_roi.py` partially resolves (`scidex/forge/tool_roi.py` exists). Test fixture created 2026-04-27; frozen.
tool_synthetic_b	synthetic_b	0.15	Same as above.
tool_synthetic_c	synthetic_c	0.15	Same as above.

Formula applied: 0.40 × success_rate + 0.35 × test_coverage + 0.25 × speed_factor = 0.40×0 + 0.35×0 + 0.25×0 = 0.00 → floored to 0.15 (established insufficient-data baseline for never-used skills).

Structural note: These 3 entries appear to be test DB fixtures (description: "Synthetic test tool", frozen=True, names synthetic_a/b/c). They have no real telemetry and no associated skill bundle. If they are only testing infrastructure they should be cleaned up or properly retired; if they represent planned tools, they need input schemas and real code paths.

Payload JSON

{
  "requirements": {
    "analysis": 7,
    "coding": 8
  }
}

Sibling Tasks in Quest (Forge) ↗

○[Forge] Integrate tools with debate engineP95

○[Forge] Reproducible analysis capsules and artifact supply chainP93

○[Forge] Benchmark answer-key migration to dataset registry (driver #31)P93

○[Forge] CI: Experiment claim driver — pick high-IIG experiments for executionP93

○[Forge] Benchmark evaluation harness — run top 50 hypotheses through 6 registered benchmarks, store predictive scoresP92

○[Forge] CI: Paper replication target selectorP91

○[Forge] Artifact enrichment quest — evaluation context, cross-links, provenanceP82

○[Forge] Reduce PubMed metadata backlog for papers missing abstractsP82

○[Forge] CI: Test all scientific tools for availabilityP78

○[Forge] Execute: testes-gonadal RNA-seq experiment 5b0bb7afP70

[Forge] Score performance for 25 unscored registered skills done analysis:7 coding:8