Fix: PG pool exhaustion via dead-socket validator hang + autoscaler conn leak + schema drift + pgbouncer mode mismatch

← All Specs

Fix: PG pool exhaustion via dead-socket validator hang + autoscaler conn leak + schema drift + pgbouncer mode mismatch

Layer: Senate Type: one_shot incident hotfix Date: 2026-05-06

Background

scidex.ai began returning 503 to most page loads at 04:49 PDT today.
Symptoms persisted for ~13h before user noticed. Service was up
(scidex-api.service active); failures came from the db_error_middleware at api.py:1698-1723 returning couldn't get a connection after 10.00 sec.

Root cause chain

  • Trigger (04:41): check_and_award_milestones SQL at
  • api.py:26917-26923 references column alias paper_count in a
    HAVING clause — invalid in PostgreSQL (allowed by SQLite). Each
    call aborts a transaction. This is the original SQLite→PG
    migration drift the codebase has been chasing.
  • Pool slot loss (ongoing): PgPoolAutoscaler._load_frozen at
  • scidex/senate/pg_pool_autoscaler.py:185-199 runs SELECT … LIMIT 1
    via the thread-local get_db() and never commits/rolls back.
    autocommit=False means the SELECT opens an implicit transaction
    that holds the conn in idle in transaction for the full 30s tick
    interval. PG idle_in_transaction_session_timeout=1min saves us
    from hard-leaks but the slot is unavailable to other callers
    ~50% of the time.
  • Validator hang (the actual pool-killer):
  • _txn_free_check at api_shared/db.py:284-313 runs SELECT 1 on
    every checkout. When pgbouncer's server_idle_timeout=120s closes
    a server-side socket, the kernel-level FIN doesn't propagate
    instantly to psycopg's view of the socket. The validator's
    SELECT 1 then sits waiting for a response that never comes,
    blocking until the pool-wide 10s _POOL_TIMEOUT fires. This is
    what makes every 503 take exactly 10.00s.
  • pgbouncer pool size mismatch: app SCIDEX_PG_POOL_MAX=120
  • but pgbouncer default_pool_size=40 per database in
    pool_mode=session. App can never get more than 40 PG-side
    conns through the bouncer regardless of psycopg's max.
  • Idle-timeout coupling: psycopg max_idle=120s ==
  • pgbouncer server_idle_timeout=120s. Race where pgbouncer
    reaps the server side at the exact moment psycopg hands the
    client conn out → "the connection is closed" cascade.
  • CLOSE_WAIT amplification: outbound HTTPS sockets to LLM
  • endpoints (47.89.128.168, 143.204.1.3) sit in CLOSE_WAIT,
    meaning request handlers are blocked in network reads while
    holding DB conns. Each blocked handler removes one slot from
    the already-tiny pool.

    Fixes (this PR)

    #FileFix
    1scidex/senate/pg_pool_autoscaler.py_load_frozen/_set_frozen use pool.connection() ctx mgr — no more thread-local INTRANS slot held by the autoscaler.
    2api.py (~26919)Replace HAVING paper_count >= %s with HAVING COUNT(*) >= %s. PG-compliant.
    3api_shared/db.py _txn_free_checkBound the validator with a statement_timeout=1500ms SET on the conn before SELECT 1, then RESET. Dead sockets fail fast.
    4api_shared/db.py pool defaults_POOL_MAX_IDLE default 120 → 60 so psycopg evicts before pgbouncer's 120s server_idle_timeout. Env override still wins.
    5/etc/scidex-pg.env(Operator step, called out in this spec — not in repo.) Lower SCIDEX_PG_POOL_MAX=12035. Below pgbouncer's 40 with margin.
    6api_shared/db.pyNew pool_dead_socket_audit() helper + Prometheus gauge scidex_pg_pool_phantom_count so this is visible in /metrics next time.
    7api.py db_error_middlewareWhen DB-error 503 fires, also kick a one-shot async task that drains stale wrappers from the per-request holder (defence in depth).
    8tests/test_pool_robustness.pyNew tests: _load_frozen doesn't leave a thread-local conn; validator times out in <2s on a dead socket; paper_count query parses.

    Out of scope (follow-ups)

    • Switch pgbouncer to pool_mode=transaction (needs audit of LISTEN/
    NOTIFY, advisory locks, prepared-statement usage; bigger change).
    • Refactor handlers that issue outbound HTTPS calls to release DB conn
    before the round-trip (the CLOSE_WAIT amplifier).
    • Audit other get_db() callers in long-lived background threads
    (analysis_watchdog, resource_governance, sandbox_audit, market_consumer)
    for similar SELECT-without-commit patterns.

    Operator post-merge checklist

  • git pull on production checkout.
  • Edit /etc/scidex-pg.env: set SCIDEX_PG_POOL_MAX=35.
  • sudo systemctl restart scidex-api — clears current dead sockets,
  • loads new defaults.
  • Verify: curl -s http://127.0.0.1:8000/metrics | grep scidex_pg_pool;
  • then curl -s http://127.0.0.1:8000/health returns 200.
  • Re-run ss -tnp | grep $(pgrep -f 'uvicorn.*api:app') 2 minutes after restart;
  • number of ESTAB sockets to :6432 should equal scidex_pg_pool_size
    (no phantoms).

    Validation

    • pytest tests/test_pool_robustness.py -x
    • python -c "from api_shared.db import _txn_free_check" (import smoke)
    • After deploy: journalctl -u scidex-api -f | grep "couldn't get" should
    go silent within 60s.

    File: fix-pg-pool-exhaustion-2026-05-06_spec.md
    Modified: 2026-05-18 04:17
    Size: 5.6 KB