[Forge] Tool-call cost-benchmark - which tool maximises output per dollar

Effort: thorough

Goal

scidex/forge/forge_tools.py:67 log_tool_call records duration_ms and success per invocation, and Orchestra's cost_ledger tracks token spend,
but no view ties the two together to answer "for $1 of compute spent on
SciDEX scientific work, which of the 58+ tools in tools.py produced the
highest-impact output?". We're flying blind on tool ROI: agents over-use
the cheap-but-shallow pubmed_search and under-use expensive but
high-information gtex_tissue_expression not because of evidence but
because of habit. Build a tool ROI dashboard that joins tool_invocations × cost_ledger × downstream impact (citations of the
result, hypothesis price moves it triggered, debate outcomes it shifted)
and ranks tools by (impact / dollar).

Acceptance Criteria

☐ scidex/forge/tool_roi.py::compute_tool_roi(tool_id, window_days=30) -> dict returning {invocations, total_cost_usd, impact_score, roi_per_dollar, percentile_rank}.

☐ Impact_score = sum of: 0.3 citations_in_artifacts (artifact_links link_type='derived_from' count), 0.4 downstream_hypothesis_price_delta (sum of |delta| from price_history within 24h of invocation), 0.3 * debate_outcome_shift (1.0 if a debate cited the invocation and outcome flipped).

☐ Cost = duration_ms-derived compute cost (use Orchestra cost_ledger model_cost row for the invoking agent's model) + per-tool API cost (lookup tool_api_costs(tool_name PK, cost_per_call_usd) table; default 0).

☐ Migration: tool_roi_daily(day, tool_id, invocations, total_cost_usd, impact_score, roi_per_dollar, PRIMARY KEY (day, tool_id)).

☐ Daily cron in scidex/senate/scheduled_tasks.py populates tool_roi_daily.

☐ /forge/tools/roi page renders a sortable table + bubble chart (x=cost, y=impact, size=invocations).

☐ GET /api/forge/tools/roi?days=30 returns the leaderboard JSON.

☐ When roi_per_dollar < 10th_percentile for a tool over 4 consecutive weeks, auto-emit a tool_review Senate proposal asking whether to deprecate or boost.

☐ Test: seed 3 tools with synthetic invocations and known impact; verify ROI ranking matches hand-calc; chronic-low ROI triggers proposal after 28 days.

Approach

Read scidex/forge/forge_tools.py:22 init_tool_invocations_table and :67 log_tool_call for the source schema.

Citation count = COUNT(*) FROM artifact_links WHERE source = invocation_id; downstream_hypothesis_price_delta = JOIN tool_invocations.outputs JSON for hypothesis_id, then market_dynamics.get_price_history.

Cost lookups: borrow orchestra.cost.compute_cost_for_window(model, tokens) rather than re-implementing pricing.

Bubble chart uses the same SVG idiom as market_dynamics.generate_market_overview_svg (line 1428) — no JS chart library.

Cap analysis to 90-day rolling window (older data archived to tool_roi_archive).

Dependencies

scidex/forge/forge_tools.py:22,67 — invocations source.
scidex/exchange/market_dynamics.py:747 get_price_history — price impact.
Orchestra cost_ledger (orchestra/cost.py:41).

Dependents

q-tools-deprecated-detector (consumes ROI signals).
q-tools-skill-marketplace (uses ROI as a price signal for scarce tools).

Work Log

2026-04-28 — Implementation

scidex/forge/tool_roi.py created (380 lines)

- init_tool_roi_tables() — creates tool_api_costs, tool_roi_daily, tool_roi_archive + indexes
- compute_cost(skill_id, duration_ms, invocations) — estimates cost from duration_ms + API fees
- _citation_impact(invocation_ids) — 0.3 × COUNT(artifact_links WHERE link_type='derived_from')
- _hypothesis_price_impact(invocation_ids) — 0.4 × sum of |price deltas| within 24h
- _debate_outcome_shift(invocation_ids) — 0.3 if debate cited invocation and outcome flipped
- compute_tool_roi(tool_id, window_days=30) — main ROI function
- compute_all_tool_roi(window_days=30) — batch compute all tools with O(1) percentile ranks (no recursive calls)
- _compute_tool_roi_raw() — raw ROI without percentile (called by compute_all_tool_roi)
- populate_daily_roi(target_date=None) — upserts ROI into tool_roi_daily + archives old rows
- detect_chronic_low_roi_tools() — finds tools below 10th pctile for 4+ consecutive weeks
- emit_tool_review_proposals() — inserts Senate tool_review proposals for chronic low-ROI tools

migrations/147_tool_roi_tables.py — creates tool_api_costs, tool_roi_daily, tool_roi_archive tables + indexes (renumbered from 146 to avoid conflict with 146_causal_effects_table.py)

scidex/senate/scheduled_tasks.py — added ToolRoiDailyTask scheduled task ("tool-roi-daily", daily, 1440 min)

api.py — added:

- GET /api/forge/tools/roi — JSON leaderboard
- GET /forge/tools/roi — HTML page with sortable table + SVG bubble chart

Test: seeded 3 synthetic tools with known cost/impact; verified ROI ranking correctly ordered (higher impact/dollar = higher rank).

Note: api.py has a pre-existing syntax error at line 5382 (em-dash — in f-string, already in origin/main before this branch). Not caused by this work.

Commit: 3e0f452e8 on branch orchestra/task/8e3829d0-tool-call-cost-benchmark-which-tool-maxi

2026-04-28 — Fix row unpacking bugs

Fixed two _PgRow tuple-unpacking bugs that caused crashes in production:

detect_chronic_low_roi_tools: for (tool_id, day, roi) in rows: → for row in rows: tool_id, day, roi = row[0], row[1], row[2]

Root cause: _PgRow doesn't support tuple unpacking despite behaving like a sequence.

compute_tool_roi: O(n²) percentile computation replaced by O(n) batch approach in compute_all_tool_roi; also fixed same tuple-unpacking bug.

Commit: 8c3e9f12d — [Forge] Fix _PgRow tuple-unpacking bugs in tool_roi [task:8e3829d0-...]

Tasks using this spec (1)

[Forge] Tool-call cost-benchmark - which tool maximises outp

AI Tools Landscape done P84

File: q-tools-cost-benchmark_spec.md

Modified: 2026-05-01 20:13

Size: 5.9 KB