[Operations] Complete service outage - all core pages returning status 0

← All Specs

Goal

Investigate and resolve a complete service outage where all 8 core SciDEX pages return HTTP status 0 (connection failure), indicating the web server is unreachable.

Acceptance Criteria

☑ Root cause identified and documented
☑ All 8 core pages return 200 or acceptable redirect (301/302)
☑ API /api/status responds with valid JSON
☑ Virtual environment restored for future systemd restarts
☑ Health check script added to detect/prevent future outages

Root Cause Analysis

The main checkout at /home/ubuntu/scidex/ suffered catastrophic working tree deletion:

  • 14,513 files deleted from the working tree (including api.py, tools.py, database.py)
  • The virtual environment (venv/) was completely destroyed
  • The running uvicorn process had modules in memory but became unresponsive
  • pull_main.sh was not running to restore files

The uvicorn process loaded modules at startup time, but once the working tree was wiped,
the process became unable to serve requests properly. When it was killed, systemd tried
to restart it but the venv binary (venv/bin/python3.12) no longer existed on disk,
causing restart failures.

Approach

  • Diagnosed: checked process state, port binding, file existence
  • Restored: git checkout HEAD -- . to restore all working tree files
  • Recreated: /usr/bin/python3.12 -m venv + pip install from requirements.txt
  • Verified: all 8 core pages return correct HTTP status
  • Prevention: added scripts/health_check_api.sh for automated recovery
  • Work Log

    2026-04-17 06:20 PT — Slot 53 (glm-5)

    • INVESTIGATION: Found uvicorn process running (PID 915175) but not responding to HTTP requests
    • DIAGNOSIS: Main checkout working tree had 14,513 files deleted; api.py, tools.py, database.py all missing
    • DIAGNOSIS: Virtual environment at venv/ completely destroyed
    • DIAGNOSIS: pull_main.sh not running to restore files
    • FIX: Restored all files via git checkout HEAD -- .
    • FIX: Recreated venv with /usr/bin/python3.12 -m venv and pip install requirements
    • FIX: Killed stuck process; systemd restarted with healthy venv
    • VERIFICATION: All 8 core pages return correct status codes:
    - / → 302, /exchange → 200, /gaps → 200, /graph → 200
    - /analyses/ → 200, /atlas.html → 200, /how.html → 301, /pitch.html → 200
    • VERIFICATION: /api/status returns: 390 analyses, 685 hypotheses, 707K edges
    • PREVENTION: Added scripts/health_check_api.sh for automated detection and recovery
    • RESULT: Service fully restored and hardened against future working tree deletion

    File: e9756cb9_service_outage_spec.md
    Modified: 2026-05-01 20:13
    Size: 2.8 KB