[Forge] Score performance for 25 unscored registered skills done analysis:7 coding:8

← Forge
Registered skills have missing or zero performance_score. Skill scores guide routing, playground exposure, and tool-library maintenance. Verification: - 25 skills have calibrated performance_score values or documented insufficient-data status - Scores are based on tool call success, latency, usage, and code-path health - Remaining unscored skill count is reduced Start by reading this task's spec and checking for duplicate recent work.
Spec File

Goal

Calibrate performance scores for registered skills that currently have no meaningful score. Skill scores support routing, maintenance, and tool-playground exposure.

Acceptance Criteria

☐ A concrete batch of unscored skills is reviewed
☐ Each reviewed skill receives a calibrated performance_score or insufficient-data rationale
☐ Scores use tool call success, latency, usage, and code-path health
☐ Before/after unscored skill counts are recorded

Approach

  • Select skills with performance_score NULL or 0, ordered by usage and category.
  • Inspect tool_calls, tool_invocations, latency, error rates, and code_path existence.
  • Persist calibrated scores or documented insufficient-data rationale through the standard DB path.
  • Verify updated scores and remaining backlog.
  • Dependencies

    • q-cc0888c0004a - Agent Ecosystem quest

    Dependents

    • Forge skill registry, routing, and tool quality

    Work Log

    2026-04-26 05:23 UTC — Task bec30a01 benchmark run

    • Task bec30a01-e196-4d26-a051-e9e808b95146 ran benchmark on 26 unscored skills.
    • Scored all 26 using tool_calls telemetry: formula = 0.5 + 0.3success_rate + 0.2speed_factor.
    • Score range: 0.150 (never-used playground tools) to 0.994 (openfda-adverse-events, 1 call, 100% SR).
    • Result: 26→0 unscored skills. All 682 skills now have performance_score > 0.
    • Commit: 0267ccb80 ([Forge] Benchmark 25 registered skills by performance and accuracy [task:bec30a01-e196-4d26-a051-e9e808b95146]).

    2026-04-21 - Quest engine template

    • Created reusable spec for quest-engine generated skill performance scoring tasks.

    Already Resolved — 2026-04-21 21:27:05Z

    • Task 85d9ad14-3519-4ebd-9e5e-e4189ac7b2e8 was stale by the time this slot started: live PostgreSQL verification through scidex.core.database.get_db() found 282 registered skills and 0 skills with performance_score IS NULL OR performance_score = 0.
    • Evidence query also found score coverage across the current Forge telemetry base: min score 0.15, max score 1.0, average score 0.9271; tool_calls has 28,280 rows with 27,889 successes, 390 errors, and 1544.2 ms average nonzero latency.
    • Code-path health spot check grouped the registry by skills.code_path; the primary tools.py path has 112 scored skills and exists in-repo, and registered forge/skills/*/SKILL.md paths sampled from the scored registry exist.
    • Resolution source: prior branch commit eb7917ecf ([Forge] Score performance for 25 unscored registered skills [task:c82d378b-5192-4823-9193-939dd71935d1]) plus verification commit 70fbe70a2 ([Verify] Skill scoring already resolved — 26→1 unscored [task:c82d378b-5192-4823-9193-939dd71935d1]) addressed the scoring backlog; current live DB now has 0 remaining unscored skills, satisfying this task's <= 0 verification target.

    Already Resolved — 2026-04-21 21:35:00Z

    • Re-verified: 282 skills, 0 unscored (performance_score IS NULL OR = 0). Score stats unchanged: min=0.15, max=1.0, avg=0.9271. Task acceptance criteria already satisfied by prior work.

    2026-04-27 06:32 UTC — Task 610a8b3c: rescore 20 skills by test coverage + error rate

    • Rescored 20 registered skills using two-dimensional formula:
    score = 0.40 success_rate + 0.35 test_coverage + 0.25 * speed_factor
    • Test coverage audit (manual scan of tests/):
    - 1.0 — tool_pubmed_search, tool_paper_figures (dedicated test files)
    - 0.8 — tool_research_topic (success/failure paths in test_agora_orchestrator_tools.py)
    - 0.6 — tool_gtex_tissue_expression, tool_open_targets_associations, tool_semantic_scholar_search (indirect mock references)
    - 0.2 — remaining 14 skills (no coverage found)
    • Error rate from tool_calls telemetry (per-skill errors / total calls)
    • Speed factor penalises average latency above 500 ms (drops to 0 at 10.5 s)
    • Notable corrections from previous blanket 1.0 scores:
    - tool_pubchem_compound 1.000 → 0.561 (39.7% error rate, no tests)
    - tool_msigdb_gene_sets 1.000 → 0.574 (34.0% error rate, no tests)
    - tool_expression_atlas_differential 1.000 → 0.587 (15.0% err, 3.4 s avg, no tests)
    - tool_brainspan_expression 1.000 → 0.665 (13.7% error rate, no tests)
    - tool_mgi_mouse_models 1.000 → 0.655 (16.3% error rate, no tests)
    - tool_pubmed_search 1.000 → 0.996 (0.3% error rate, full test coverage) ✓
    • Script: score_skills_by_coverage_and_errors.py (checked in)

    2026-04-28 UTC — Task ed11b1ae: score 3 synthetic test fixtures

    Before: 3 unscored skills (performance_score IS NULL). After: 0 unscored.

    Skills scored:

    idnamescorerationale
    tool_synthetic_asynthetic_a0.15Insufficient data — 0 invocations, no telemetry, no test coverage, code_path tool_roi.py partially resolves (scidex/forge/tool_roi.py exists). Test fixture created 2026-04-27; frozen.
    tool_synthetic_bsynthetic_b0.15Same as above.
    tool_synthetic_csynthetic_c0.15Same as above.
    Formula applied: 0.40 × success_rate + 0.35 × test_coverage + 0.25 × speed_factor = 0.40×0 + 0.35×0 + 0.25×0 = 0.00 → floored to 0.15 (established insufficient-data baseline for never-used skills).

    Structural note: These 3 entries appear to be test DB fixtures (description: "Synthetic test tool", frozen=True, names synthetic_a/b/c). They have no real telemetry and no associated skill bundle. If they are only testing infrastructure they should be cleaned up or properly retired; if they represent planned tools, they need input schemas and real code paths.

    Payload JSON
    {
      "requirements": {
        "analysis": 7,
        "coding": 8
      }
    }

    Sibling Tasks in Quest (Forge) ↗