SciDEX — Task: [Forge] Score performance for 3 unscored registere

3 registered skills have missing or zero performance_score. Skill scores guide routing, playground exposure, and tool-library maintenance. ## Acceptance criteria (recommended — see 'Broader latitude' below) - 3 skills have calibrated performance_score values or documented insufficient-data status - Scores are based on tool call success, latency, usage, and code-path health - Remaining unscored skill count is <= 0 ## Before starting 1. Read this task's spec file and check for duplicate recent work. 2. Evaluate whether the gap and acceptance criteria target the right problem. If you see a better framing, propose it in your work log and — if appropriate — reframe before executing. 3. Check adjacent SciDEX layers (Agora, Atlas, Forge, Exchange, Senate): does your work need cross-linking? Do you see a pattern spanning multiple gaps that could become a platform improvement? ## Broader latitude (explicitly welcome) You are a scientific discoverer, not just a task executor. Beyond the acceptance criteria above, you're invited to: - **Question the framing.** If the gap's premise is weak, the acceptance criteria miss the point, or the methodology is the wrong frame entirely — say so. Propose a reframe with justification. - **Propose structural improvements.** If you notice a recurring pattern across tasks that would benefit from a new tool, scoring dimension, debate mode, or governance rule — flag it in your work log with a concrete proposal (file a Senate task or add to the Forge tool backlog as appropriate). - **Propose algorithmic improvements.** If the scoring algorithm, ranking method, matching heuristic, or quality rubric seems misaligned with the data you're seeing — document a specific improvement with before/after examples. - **Strengthen artifacts beyond the minimum.** Iterate toward a SOTA-quality notebook/analysis/benchmark rather than the lowest bar that passes the checks. Fewer high-quality artifacts beat many shallow ones. Document each such contribution in your commit messages (``[Senate] proposal:`` / ``[Forge] tool-sketch:`` / ``[Meta] algorithm-critique:``) so operators can triage.

Git Commits (1)

[Forge] Score 3 unscored synthetic skills at 0.15 insufficient-data baseline [task:ed11b1ae-04c1-4815-a9a7-14c5e71ea91f] (#1126)2026-04-28

Spec File

Goal

Calibrate performance scores for registered skills that currently have no meaningful score. Skill scores support routing, maintenance, and tool-playground exposure.

Acceptance Criteria

☐ A concrete batch of unscored skills is reviewed

☐ Each reviewed skill receives a calibrated performance_score or insufficient-data rationale

☐ Scores use tool call success, latency, usage, and code-path health

☐ Before/after unscored skill counts are recorded

Approach

Select skills with performance_score NULL or 0, ordered by usage and category.

Inspect tool_calls, tool_invocations, latency, error rates, and code_path existence.

Persist calibrated scores or documented insufficient-data rationale through the standard DB path.

Verify updated scores and remaining backlog.

Dependencies

q-cc0888c0004a - Agent Ecosystem quest

Dependents

Forge skill registry, routing, and tool quality

Work Log

2026-04-26 05:23 UTC — Task bec30a01 benchmark run

Task bec30a01-e196-4d26-a051-e9e808b95146 ran benchmark on 26 unscored skills.
Scored all 26 using tool_calls telemetry: formula = 0.5 + 0.3success_rate + 0.2speed_factor.
Score range: 0.150 (never-used playground tools) to 0.994 (openfda-adverse-events, 1 call, 100% SR).
Result: 26→0 unscored skills. All 682 skills now have performance_score > 0.
Commit: 0267ccb80 ([Forge] Benchmark 25 registered skills by performance and accuracy [task:bec30a01-e196-4d26-a051-e9e808b95146]).

2026-04-21 - Quest engine template

Created reusable spec for quest-engine generated skill performance scoring tasks.

Already Resolved — 2026-04-21 21:27:05Z

Task 85d9ad14-3519-4ebd-9e5e-e4189ac7b2e8 was stale by the time this slot started: live PostgreSQL verification through scidex.core.database.get_db() found 282 registered skills and 0 skills with performance_score IS NULL OR performance_score = 0.
Evidence query also found score coverage across the current Forge telemetry base: min score 0.15, max score 1.0, average score 0.9271; tool_calls has 28,280 rows with 27,889 successes, 390 errors, and 1544.2 ms average nonzero latency.
Code-path health spot check grouped the registry by skills.code_path; the primary tools.py path has 112 scored skills and exists in-repo, and registered forge/skills/*/SKILL.md paths sampled from the scored registry exist.
Resolution source: prior branch commit eb7917ecf ([Forge] Score performance for 25 unscored registered skills [task:c82d378b-5192-4823-9193-939dd71935d1]) plus verification commit 70fbe70a2 ([Verify] Skill scoring already resolved — 26→1 unscored [task:c82d378b-5192-4823-9193-939dd71935d1]) addressed the scoring backlog; current live DB now has 0 remaining unscored skills, satisfying this task's <= 0 verification target.

Already Resolved — 2026-04-21 21:35:00Z

Re-verified: 282 skills, 0 unscored (performance_score IS NULL OR = 0). Score stats unchanged: min=0.15, max=1.0, avg=0.9271. Task acceptance criteria already satisfied by prior work.

2026-04-27 06:32 UTC — Task 610a8b3c: rescore 20 skills by test coverage + error rate

Rescored 20 registered skills using two-dimensional formula:

score = 0.40 success_rate + 0.35 test_coverage + 0.25 * speed_factor

Test coverage audit (manual scan of tests/):

- 1.0 — tool_pubmed_search, tool_paper_figures (dedicated test files)
- 0.8 — tool_research_topic (success/failure paths in test_agora_orchestrator_tools.py)
- 0.6 — tool_gtex_tissue_expression, tool_open_targets_associations, tool_semantic_scholar_search (indirect mock references)
- 0.2 — remaining 14 skills (no coverage found)

Error rate from tool_calls telemetry (per-skill errors / total calls)
Speed factor penalises average latency above 500 ms (drops to 0 at 10.5 s)
Notable corrections from previous blanket 1.0 scores:

- tool_pubchem_compound 1.000 → 0.561 (39.7% error rate, no tests)
- tool_msigdb_gene_sets 1.000 → 0.574 (34.0% error rate, no tests)
- tool_expression_atlas_differential 1.000 → 0.587 (15.0% err, 3.4 s avg, no tests)
- tool_brainspan_expression 1.000 → 0.665 (13.7% error rate, no tests)
- tool_mgi_mouse_models 1.000 → 0.655 (16.3% error rate, no tests)
- tool_pubmed_search 1.000 → 0.996 (0.3% error rate, full test coverage) ✓

Script: score_skills_by_coverage_and_errors.py (checked in)

2026-04-28 UTC — Task ed11b1ae: score 3 synthetic test fixtures

Before: 3 unscored skills (performance_score IS NULL). After: 0 unscored.

Skills scored:

id	name	score	rationale
tool_synthetic_a	synthetic_a	0.15	Insufficient data — 0 invocations, no telemetry, no test coverage, code_path `tool_roi.py` partially resolves (`scidex/forge/tool_roi.py` exists). Test fixture created 2026-04-27; frozen.
tool_synthetic_b	synthetic_b	0.15	Same as above.
tool_synthetic_c	synthetic_c	0.15	Same as above.

Formula applied: 0.40 × success_rate + 0.35 × test_coverage + 0.25 × speed_factor = 0.40×0 + 0.35×0 + 0.25×0 = 0.00 → floored to 0.15 (established insufficient-data baseline for never-used skills).

Structural note: These 3 entries appear to be test DB fixtures (description: "Synthetic test tool", frozen=True, names synthetic_a/b/c). They have no real telemetry and no associated skill bundle. If they are only testing infrastructure they should be cleaned up or properly retired; if they represent planned tools, they need input schemas and real code paths.

Payload JSON

{
  "requirements": {
    "analysis": 6,
    "reasoning": 6
  },
  "max_iterations": 15
}

Sibling Tasks in Quest (Forge) ↗

○[Forge] Integrate tools with debate engineP95

○[Forge] Reproducible analysis capsules and artifact supply chainP93

○[Forge] Benchmark answer-key migration to dataset registry (driver #31)P93

○[Forge] CI: Experiment claim driver — pick high-IIG experiments for executionP93

○[Forge] Benchmark evaluation harness — run top 50 hypotheses through 6 registered benchmarks, store predictive scoresP92

○[Forge] CI: Paper replication target selectorP91

○[Forge] Artifact enrichment quest — evaluation context, cross-links, provenanceP82

○[Forge] Reduce PubMed metadata backlog for papers missing abstractsP82

○[Forge] CI: Test all scientific tools for availabilityP78

○[Forge] Execute: testes-gonadal RNA-seq experiment 5b0bb7afP70

[Forge] Score performance for 3 unscored registered skills done analysis:6 reasoning:6