[UI] Harden link checker against transient API restart cascades done coding:5

← UI
Reduce false broken links and runtime blowups when scidex-api restarts mid-crawl by adding outage-aware retry behavior and clear reporting. ## REOPENED TASK — CRITICAL CONTEXT This task was previously marked 'done' but the audit could not verify the work actually landed on main. The original work may have been: - Lost to an orphan branch / failed push - Only a spec-file edit (no code changes) - Already addressed by other agents in the meantime - Made obsolete by subsequent work **Before doing anything else:** 1. **Re-evaluate the task in light of CURRENT main state.** Read the spec and the relevant files on origin/main NOW. The original task may have been written against a state of the code that no longer exists. 2. **Verify the task still advances SciDEX's aims.** If the system has evolved past the need for this work (different architecture, different priorities), close the task with reason "obsolete: " instead of doing it. 3. **Check if it's already done.** Run `git log --grep=''` and read the related commits. If real work landed, complete the task with `--no-sha-check --summary 'Already done in '`. 4. **Make sure your changes don't regress recent functionality.** Many agents have been working on this codebase. Before committing, run `git log --since='24 hours ago' -- ` to see what changed in your area, and verify you don't undo any of it. 5. **Stay scoped.** Only do what this specific task asks for. Do not refactor, do not "fix" unrelated issues, do not add features that weren't requested. Scope creep at this point is regression risk. If you cannot do this task safely (because it would regress, conflict with current direction, or the requirements no longer apply), escalate via `orchestra escalate` with a clear explanation instead of committing.

Completion Notes

Hardening already present on main. Evidence: is_transient_outage() at line 528, reconcile_transient_failures() at line 573, suppress_transient_http0_noise() at line 766, generate_report() includes transient_outage_count at line 1218, and main loop wires all three at lines 1293/1325. Code inspection and syntax check confirms all acceptance criteria met. Commit bc490b64d landed the fix.

Git Commits (3)

Merge remote-tracking branch 'origin/main' into orchestra/integration/task338-slot2-202604042026-04-04
[UI] Harden link checker transient outage handling [task:f38b1b34-9df1-4ae1-814d-fdeecdc1aec6]2026-04-04
[UI] Harden link checker transient outage handling [task:f38b1b34-9df1-4ae1-814d-fdeecdc1aec6]2026-04-04
Spec File

Goal

Improve link-checking resilience so temporary local API outages during restarts do not produce large volumes of false broken-link reports. The checker should retry outage-like failures, classify transient infrastructure failures separately from real broken links, and exit with clear summary reporting. This reduces noisy failures in self-improvement validation runs.

Acceptance Criteria

☑ Link checker retries transient outage/network failures with bounded backoff before classifying a URL as failed.
☑ Reports distinguish true broken links from transient outage failures in both per-link logs and final summary.
☑ Existing success/failure behavior for genuine HTTP 4xx/5xx broken links remains intact.
☑ Validation command(s) complete and demonstrate expected behavior.

Approach

  • Inspect link_checker.py flow and current failure classification.
  • Implement outage-aware retry logic and structured result categories.
  • Update summary output and process exit behavior to reflect categories.
  • Run targeted checks with timeout to verify behavior and no regressions.
  • Dependencies

    • None.

    Dependents

    • Any CI or agent flows that run timeout 300 python3 link_checker.py as a health gate.

    Work Log

    2026-04-04 05:48 PDT — Slot 2

    • Started task: [UI] Harden link checker against transient API restart cascades.
    • Read required standards: /home/ubuntu/Orchestra/AGENTS.md, local AGENTS.md, and QUESTS.md.
    • Pulled latest worktree branch state (git pull --rebase).
    • Retrieved task via Orchestra and confirmed spec_path was missing on disk.
    • Created this spec file before implementation, per policy.

    2026-04-04 05:56 PDT — Slot 2

    • Read link_checker.py end-to-end and mapped current failure flow (seed crawl failures, per-link checks, retry logic, report/task paths).
    • Implemented outage-aware hardening in link_checker.py:
    - Added retryable outage status classification (0, 502, 503, 504) and failure typing from check_link().
    - Added unpack_failure() for backwards-compatible failure tuple handling.
    - Replaced HTTP-0-only suppression with reconcile_transient_failures():
    - Detects restart cascades across many retryable failures.
    - Rechecks retryable URLs with bounded budget (LINKCHECK_TRANSIENT_RECHECK_MAX_URLS, default 60).
    - Separates transient outage failures from concrete broken links.
    - Updated generate_report() to include:
    - transient_outage_count
    - transient_outage_links (with failure types)
    - enriched broken_links entries including failure_type.
    - Updated main flow to skip false-positive task creation when only transient outages remain.
    • Validation executed:
    - python3 -c "import py_compile; py_compile.compile('link_checker.py', doraise=True)"
    - timeout 300 python3 link_checker.py
    - Observed startup API connection instability and a large retryable burst.
    - Checker classified 558 entries as transient outage failures and reported 0 concrete broken links.
    - timeout 60 curl -s http://localhost:8000/api/status | python3 -m json.tool
    - Core page probes via http://localhost:8000 (/, /exchange, /gaps, /graph, /analyses/, /atlas.html, /how.html) returned 200/301/302
    - scidex services list verified API and nginx active ✅
    • Result: Done — link checker now distinguishes transient API restart cascades from real broken links and reports both categories clearly.

    2026-04-04 06:00 PDT — Slot 2

    • Reconciled merge conflict against latest origin/main and preserved upstream portability updates:
    - report/DB paths now derive from REPO_ROOT with *_PATH environment override support.
    - open linkcheck task detection includes running status.
    - retained this task's transient-outage reconciliation flow and added bounded post-filter revalidation pass.
    • Re-verified after merge resolution:
    - python3 -c "import py_compile; py_compile.compile('link_checker.py', doraise=True)"
    - timeout 300 env LINKCHECK_MAX_RUNTIME_SECONDS=60 python3 link_checker.py ✅ (partial crawl by bounded deadline; 0 concrete broken links).

    Payload JSON
    {
      "requirements": {
        "coding": 5
      },
      "_reset_note": "This task was reset after a database incident on 2026-04-17.\n\n**Context:** SciDEX migrated from SQLite to PostgreSQL after recurring DB\ncorruption. Some work done during Apr 16-17 may have been lost.\n\n**Before starting work:**\n1. Check if the task's goal is ALREADY satisfied (run the relevant checks)\n2. Check `git log --all --grep=task:YOUR_TASK_ID` for prior commits\n3. If complete, verify and mark done. If partial, continue. If not done, proceed.\n\n**DB change:** SciDEX now uses PostgreSQL. `get_db()` auto-detects via\nSCIDEX_DB_BACKEND=postgres env var.",
      "_reset_at": "2026-04-18T06:29:22.046013+00:00",
      "_reset_from_status": "done"
    }

    Sibling Tasks in Quest (UI) ↗