[Senate] Throttle gap factory: 2472 open gaps, 0.77% conversion to analyses

← All Specs

Goal

The knowledge_gaps table grew from 48 → 2472 in one prioritization-quest cycle (50x). Only 19 of those 2472 gaps have ever produced an analysis (0.77%). The gap factory is creating gaps faster than the system can process them, and most of the new gaps appear to be low-quality (high-volume PubMed extraction without filtering).

This task implements four fixes:

  • Add a quality filter so only gaps with importance_score >= 0.5 are inserted
  • Add a per-cycle cap on new gap inserts (max 50 per cycle)
  • Close gaps older than 30 days with status='open' and zero downstream activity
  • Reduce the gap factory frequency from daily to weekly
  • Acceptance Criteria

    ☐ gap_scanner.py only inserts gaps when importance_score >= 0.5 (using gap_scoring.py)
    ☐ gap_scanner.py caps new gap inserts at 50 per cycle
    ☐ debate_gap_extractor.py only inserts gaps when importance_score >= 0.5
    ☐ gap_pipeline.py closes stale gaps (30+ days, open status, no downstream activity)
    ☐ systemd timer changed from daily to weekly
    ☐ All changes tested and committed

    Approach

  • gap_scanner.py modifications:
  • - Import and call gap_scoring.score_knowledge_gap() before creating each gap
    - Use the returned importance_score to filter (>= 0.5 threshold)
    - Track gaps_created count and stop when reaching cap of 50
    - Change --days default from 7 to 1 to reduce paper scan window

  • debate_gap_extractor.py modifications:
  • - After extracting question, call gap_scoring.score_knowledge_gap()
    - Only create gap if importance_score >= 0.5

  • gap_pipeline.py additions:
  • - Add close_stale_gaps() function to find and close gaps with:
    - status = 'open'
    - created_at < 30 days ago
    - No entries in analyses or hypotheses tables
    - Run this as part of the regular pipeline

  • scidex-gap-scanner.timer modification:
  • - Change OnCalendar=daily to OnCalendar=weekly
    - Alternatively: OnCalendar=*-0/7 03:00:00 for weekly

    Dependencies

    • gap_scoring.py must remain functional (already exists)
    • db_writes.create_knowledge_gap must remain functional (already exists)

    Work Log

    2026-04-11 — Slot 0

    • Investigated gap factory issue: gap_scanner.py and debate_gap_extractor.py create gaps without setting importance_score
    • Found gap_scoring.py exists but is not called during gap creation
    • Analyzed database: 2470 open gaps, only ~2023 have importance_score >= 0.5
    • Created spec file

    2026-04-11 — Implementation

    • gap_scanner.py: Added import for gap_scoring; modified create_gap() to call gap_scoring.score_knowledge_gap() and filter on importance_score >= 0.5; added MAX_GAPS_PER_CYCLE=50 constant; added per-cycle cap logic in scan_papers(); changed default --days from 7 to 1
    • debate_gap_extractor.py: Added import for gap_scoring; modified create_gap_from_question() to call gap_scoring.score_knowledge_gap() and filter on importance_score >= 0.5; updated process_debate_session() to handle filtered (None) gap_ids
    • gap_pipeline.py: Added close_stale_gaps() function with --close-stale and --close-stale-days CLI args; closes gaps with status='open', >30 days old, no downstream analyses/hypotheses
    • scidex-gap-scanner.service: Changed --days from 7 to 1
    • scidex-gap-scanner.timer: Changed OnCalendar from daily to weekly
    • All Python files passed syntax check and import tests
    • Committed as 37c7ec19 and pushed to origin

    2026-04-11 — Merge Gate Fix (attempt 3)

    • Merge gate again blocked: still had infra/scripts deletions, and close_stale_gaps not wired into pipeline
    • Restored infra/scripts/{README.md,backup-all.sh,snapshot-home-hourly.sh,sync-full-s3.sh} (were deleted in prior commits but are unrelated to gap throttle)
    • Wired close_stale_gaps() into systemd service: scidex-gap-scanner.service now runs both gap_scanner.py AND gap_pipeline.py --close-stale in a single ExecStart bash -c chain
    • Committed as 8f136147, pushed to origin
    • Final diff: 6 files, +225/-10 lines (gap_scanner.py, debate_gap_extractor.py, gap_pipeline.py, scidex-gap-scanner.{service,timer}, spec)

    2026-04-14 — Slot 0 (Verification)

    • Verified all 5 acceptance criteria on branch (already merged to main via 37c7ec199, 8f136147):
    - [x] gap_scanner.py: importance_score >= 0.5 filter present (line 408-413, uses gap_scoring.score_knowledge_gap)
    - [x] gap_scanner.py: MAX_GAPS_PER_CYCLE = 50 cap (line 462, enforced at lines 514-515, 530-531)
    - [x] debate_gap_extractor.py: importance_score >= 0.5 filter (lines 286-291, 588-595)
    - [x] gap_pipeline.py: close_stale_gaps() with --close-stale / --close-stale-days args (lines 400-480)
    - [x] scidex-gap-scanner.timer: OnCalendar=weekly (line 6)
    - [x] scidex-gap-scanner.service: wires both gap_scanner.py AND gap_pipeline.py --close-stale (line 9)
    • Updated spec work log
    • Result: Done — throttle gap factory task complete, all acceptance criteria verified

    Tasks using this spec (1)
    [Senate] Throttle gap factory: 2472 open gaps, 0.77% convers
    Senate done P96
    File: ee42ebc1_7e2_spec.md
    Modified: 2026-05-01 20:13
    Size: 5.0 KB