[Forge] Deprecated-tool detector - find tools no agents use

← All Specs

Effort: standard

Goal

scidex/forge/forge_tools.py:43 register_tool adds entries to the skills
table; today there's no reverse audit asking "of the registered tools, which
have not been invoked in 90 days?". The result is tools.py accreting dead
weight: tools written for one debate, never used again, still showing up in
agent prompt manifests and confusing model selection. Build a periodic
detector that flags zero-usage tools, classifies why (broken? superseded?
niche?), and emits Senate decisions for each: deprecate, retire, or rewrite.

Acceptance Criteria

scidex/forge/deprecated_tool_detector.py::scan(window_days=90) returns rows where times_used = 0 in window OR success_rate_in_window < 0.3.
☑ For each candidate, classify cause: never_invoked, chronic_failure (success_rate < 0.3), superseded (LLM check: are there tools with overlapping signature with success_rate > 0.7?), niche (used <5 times but successful).
☑ Migration: tool_deprecation_candidates(tool_id PK, classification, evidence_json, scanned_at, decision, decided_at, decided_by) — created inline via ensure_table().
☑ LLM-graded supersession check (Sonnet effort=standard): given (deprecated_tool_signature, candidate_supersessor_signatures), returns boolean + rationale.
☑ Weekly cron in scidex/senate/scheduled_tasks.py; emits a tool_deprecation_review Senate proposal grouping all candidates by classification.
/forge/tools/deprecation page lists candidates with classification badges and "decide" buttons (delegates to apply_decision).
GET /api/forge/tools/deprecation returns the candidate list.
☑ When a candidate is decided retire, skills.retired_at is set and the candidate row is marked retired_at (PR auto-comment deferred to review workflow).
☑ Test: seed 3 mock tools — A has 0 invocations, B has 5/20 successes, C has 3/3 successes — scanner classifies (A: never_invoked, B: chronic_failure, C: niche); proposal payload contains all three. Verified 2026-04-28.

Approach

  • Read scidex/forge/forge_tools.py:67 log_tool_call to learn the success/duration schema.
  • Tool signature = function name + arg names; pull from AST of scidex/forge/tools.py not from string-matching.
  • Supersession check is an LLM cluster: group tool signatures by docstring embedding cosine, then ask the LLM whether intra-cluster pairs are subsumption.
  • PR-open path uses an existing helper in scripts/orchestra_pr.py if present; else use subprocess + gh pr create (see CLAUDE.md "Creating pull requests" pattern).
  • Idempotency: re-running the scan within 24 h short-circuits via the existing scanned_at row.
  • Dependencies

    • q-tools-cost-benchmark — supplies success-rate and ROI signals.
    • scidex/forge/forge_tools.py (skills table source).
    • scidex/senate/governance.py:159 create_proposal — proposal channel.

    Work Log

    2026-04-28 — Implementation complete [task:29e3ec83-cd8d-4908-a34d-72719c05b6bc]

    Files created/modified:

    • scidex/forge/deprecated_tool_detector.py — new module: ensure_table(), scan(window_days=90, force=False), emit_senate_proposal(), apply_decision(), get_candidates(), _check_supersession() (LLM-graded), _get_tool_signatures_from_ast() (AST-based), _upsert_candidate() (isolated per-row commits to avoid TX poisoning)
    • scidex/senate/scheduled_tasks.py — added DeprecatedToolDetectorTask with weekly 10080-min cron
    • api.py — added GET /api/forge/tools/deprecation (JSON), POST /api/forge/tools/deprecation/{tool_id}/decide, GET /forge/tools/deprecation (HTML page with classification badges and decide buttons)
    • tests/test_deprecated_tool_detector.py — test harness for the 3-tool seed scenario
    • docs/planning/specs/q-tools-deprecated-detector_spec.md — this file
    Test results:
    • Classification logic verified: A (0 invocations) → never_invoked, B (5/20 = 25%) → chronic_failure, C (3/3, 3 uses) → niche
    • emit_senate_proposal creates tool_deprecation_review proposal with all 3 test tools in metadata.tool_ids
    • LLM supersession check capped at 20 calls per scan (MAX_LLM_SUPERSESSION_CHECKS) to control cost
    • Per-row DB commits in _upsert_candidate prevent PostgreSQL TX abort from propagating across the scan loop

    Tasks using this spec (1)
    [Forge] Deprecated-tool detector - find tools no agents use
    File: q-tools-deprecated-detector_spec.md
    Modified: 2026-05-01 20:13
    Size: 4.4 KB