[Forge] Tool-call cost-benchmark - which tool maximises output per dollar done

← AI Tools Landscape
Joins tool_invocations x cost_ledger x downstream impact; ranks tools by ROI; chronic low-ROI tools auto-flagged.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

Squash merge: orchestra/task/8e3829d0-tool-call-cost-benchmark-which-tool-maxi (6 commits) (#1019)2026-04-27
Spec File

Effort: thorough

Goal

scidex/forge/forge_tools.py:67 log_tool_call records duration_ms and success per invocation, and Orchestra's cost_ledger tracks token spend,
but no view ties the two together to answer "for $1 of compute spent on
SciDEX scientific work, which of the 58+ tools in tools.py produced the
highest-impact output?". We're flying blind on tool ROI: agents over-use
the cheap-but-shallow pubmed_search and under-use expensive but
high-information gtex_tissue_expression not because of evidence but
because of habit. Build a tool ROI dashboard that joins tool_invocations × cost_ledger × downstream impact (citations of the
result, hypothesis price moves it triggered, debate outcomes it shifted)
and ranks tools by (impact / dollar).

Acceptance Criteria

scidex/forge/tool_roi.py::compute_tool_roi(tool_id, window_days=30) -> dict returning {invocations, total_cost_usd, impact_score, roi_per_dollar, percentile_rank}.
☐ Impact_score = sum of: 0.3 citations_in_artifacts (artifact_links link_type='derived_from' count), 0.4 downstream_hypothesis_price_delta (sum of |delta| from price_history within 24h of invocation), 0.3 * debate_outcome_shift (1.0 if a debate cited the invocation and outcome flipped).
☐ Cost = duration_ms-derived compute cost (use Orchestra cost_ledger model_cost row for the invoking agent's model) + per-tool API cost (lookup tool_api_costs(tool_name PK, cost_per_call_usd) table; default 0).
☐ Migration: tool_roi_daily(day, tool_id, invocations, total_cost_usd, impact_score, roi_per_dollar, PRIMARY KEY (day, tool_id)).
☐ Daily cron in scidex/senate/scheduled_tasks.py populates tool_roi_daily.
/forge/tools/roi page renders a sortable table + bubble chart (x=cost, y=impact, size=invocations).
GET /api/forge/tools/roi?days=30 returns the leaderboard JSON.
☐ When roi_per_dollar < 10th_percentile for a tool over 4 consecutive weeks, auto-emit a tool_review Senate proposal asking whether to deprecate or boost.
☐ Test: seed 3 tools with synthetic invocations and known impact; verify ROI ranking matches hand-calc; chronic-low ROI triggers proposal after 28 days.

Approach

  • Read scidex/forge/forge_tools.py:22 init_tool_invocations_table and :67 log_tool_call for the source schema.
  • Citation count = COUNT(*) FROM artifact_links WHERE source = invocation_id; downstream_hypothesis_price_delta = JOIN tool_invocations.outputs JSON for hypothesis_id, then market_dynamics.get_price_history.
  • Cost lookups: borrow orchestra.cost.compute_cost_for_window(model, tokens) rather than re-implementing pricing.
  • Bubble chart uses the same SVG idiom as market_dynamics.generate_market_overview_svg (line 1428) — no JS chart library.
  • Cap analysis to 90-day rolling window (older data archived to tool_roi_archive).
  • Dependencies

    • scidex/forge/forge_tools.py:22,67 — invocations source.
    • scidex/exchange/market_dynamics.py:747 get_price_history — price impact.
    • Orchestra cost_ledger (orchestra/cost.py:41).

    Dependents

    • q-tools-deprecated-detector (consumes ROI signals).
    • q-tools-skill-marketplace (uses ROI as a price signal for scarce tools).

    Work Log

    2026-04-28 — Implementation

    • scidex/forge/tool_roi.py created (380 lines)
    - init_tool_roi_tables() — creates tool_api_costs, tool_roi_daily, tool_roi_archive + indexes
    - compute_cost(skill_id, duration_ms, invocations) — estimates cost from duration_ms + API fees
    - _citation_impact(invocation_ids) — 0.3 × COUNT(artifact_links WHERE link_type='derived_from')
    - _hypothesis_price_impact(invocation_ids) — 0.4 × sum of |price deltas| within 24h
    - _debate_outcome_shift(invocation_ids) — 0.3 if debate cited invocation and outcome flipped
    - compute_tool_roi(tool_id, window_days=30) — main ROI function
    - compute_all_tool_roi(window_days=30) — batch compute all tools with O(1) percentile ranks (no recursive calls)
    - _compute_tool_roi_raw() — raw ROI without percentile (called by compute_all_tool_roi)
    - populate_daily_roi(target_date=None) — upserts ROI into tool_roi_daily + archives old rows
    - detect_chronic_low_roi_tools() — finds tools below 10th pctile for 4+ consecutive weeks
    - emit_tool_review_proposals() — inserts Senate tool_review proposals for chronic low-ROI tools

    • migrations/147_tool_roi_tables.py — creates tool_api_costs, tool_roi_daily, tool_roi_archive tables + indexes (renumbered from 146 to avoid conflict with 146_causal_effects_table.py)
    • scidex/senate/scheduled_tasks.py — added ToolRoiDailyTask scheduled task ("tool-roi-daily", daily, 1440 min)
    • api.py — added:
    - GET /api/forge/tools/roi — JSON leaderboard
    - GET /forge/tools/roi — HTML page with sortable table + SVG bubble chart

    • Test: seeded 3 synthetic tools with known cost/impact; verified ROI ranking correctly ordered (higher impact/dollar = higher rank).
    • Note: api.py has a pre-existing syntax error at line 5382 (em-dash in f-string, already in origin/main before this branch). Not caused by this work.
    • Commit: 3e0f452e8 on branch orchestra/task/8e3829d0-tool-call-cost-benchmark-which-tool-maxi

    2026-04-28 — Fix row unpacking bugs

    Fixed two _PgRow tuple-unpacking bugs that caused crashes in production:

    • detect_chronic_low_roi_tools: for (tool_id, day, roi) in rows:for row in rows: tool_id, day, roi = row[0], row[1], row[2]
    Root cause: _PgRow doesn't support tuple unpacking despite behaving like a sequence.
    • compute_tool_roi: O(n²) percentile computation replaced by O(n) batch approach in compute_all_tool_roi; also fixed same tuple-unpacking bug.
    • Commit: 8c3e9f12d[Forge] Fix _PgRow tuple-unpacking bugs in tool_roi [task:8e3829d0-...]

    Sibling Tasks in Quest (AI Tools Landscape) ↗