Fix: PG pool exhaustion via dead-socket validator hang + autoscaler conn leak + schema drift + pgbouncer mode mismatch
Layer: Senate
Type: one_shot incident hotfix
Date: 2026-05-06
Background
scidex.ai began returning 503 to most page loads at 04:49 PDT today.
Symptoms persisted for ~13h before user noticed. Service was up
(scidex-api.service active); failures came from the
db_error_middleware at api.py:1698-1723 returning
couldn't get a connection after 10.00 sec.
Root cause chain
Trigger (04:41): check_and_award_milestones SQL at
api.py:26917-26923 references column alias
paper_count in a
HAVING clause — invalid in PostgreSQL (allowed by SQLite). Each
call aborts a transaction. This is the original SQLite→PG
migration drift the codebase has been chasing.
Pool slot loss (ongoing): PgPoolAutoscaler._load_frozen at
scidex/senate/pg_pool_autoscaler.py:185-199 runs
SELECT … LIMIT 1 via the thread-local
get_db() and never commits/rolls back.
autocommit=False means the SELECT opens an implicit transaction
that holds the conn in
idle in transaction for the full 30s tick
interval. PG
idle_in_transaction_session_timeout=1min saves us
from hard-leaks but the slot is unavailable to other callers
~50% of the time.
Validator hang (the actual pool-killer):
_txn_free_check at
api_shared/db.py:284-313 runs
SELECT 1 on
every checkout. When pgbouncer's
server_idle_timeout=120s closes
a server-side socket, the kernel-level FIN doesn't propagate
instantly to psycopg's view of the socket. The validator's
SELECT 1 then sits waiting for a response that never comes,
blocking until the pool-wide 10s
_POOL_TIMEOUT fires. This is
what makes every 503 take exactly 10.00s.
pgbouncer pool size mismatch: app SCIDEX_PG_POOL_MAX=120
but pgbouncer
default_pool_size=40 per database in
pool_mode=session. App can never get more than 40 PG-side
conns through the bouncer regardless of psycopg's max.
Idle-timeout coupling: psycopg max_idle=120s ==
pgbouncer
server_idle_timeout=120s. Race where pgbouncer
reaps the server side at the exact moment psycopg hands the
client conn out → "the connection is closed" cascade.
CLOSE_WAIT amplification: outbound HTTPS sockets to LLM
endpoints (47.89.128.168, 143.204.1.3) sit in CLOSE_WAIT,
meaning request handlers are blocked in network reads while
holding DB conns. Each blocked handler removes one slot from
the already-tiny pool.
Fixes (this PR)
| # | File | Fix |
|---|
| 1 | scidex/senate/pg_pool_autoscaler.py | _load_frozen/_set_frozen use pool.connection() ctx mgr — no more thread-local INTRANS slot held by the autoscaler. |
| 2 | api.py (~26919) | Replace HAVING paper_count >= %s with HAVING COUNT(*) >= %s. PG-compliant. |
| 3 | api_shared/db.py _txn_free_check | Bound the validator with a statement_timeout=1500ms SET on the conn before SELECT 1, then RESET. Dead sockets fail fast. |
| 4 | api_shared/db.py pool defaults | _POOL_MAX_IDLE default 120 → 60 so psycopg evicts before pgbouncer's 120s server_idle_timeout. Env override still wins. |
| 5 | /etc/scidex-pg.env | (Operator step, called out in this spec — not in repo.) Lower SCIDEX_PG_POOL_MAX=120 → 35. Below pgbouncer's 40 with margin. |
| 6 | api_shared/db.py | New pool_dead_socket_audit() helper + Prometheus gauge scidex_pg_pool_phantom_count so this is visible in /metrics next time. |
| 7 | api.py db_error_middleware | When DB-error 503 fires, also kick a one-shot async task that drains stale wrappers from the per-request holder (defence in depth). |
| 8 | tests/test_pool_robustness.py | New tests: _load_frozen doesn't leave a thread-local conn; validator times out in <2s on a dead socket; paper_count query parses. |
Out of scope (follow-ups)
- Switch pgbouncer to
pool_mode=transaction (needs audit of LISTEN/
NOTIFY, advisory locks, prepared-statement usage; bigger change).
- Refactor handlers that issue outbound HTTPS calls to release DB conn
before the round-trip (the CLOSE_WAIT amplifier).
- Audit other
get_db() callers in long-lived background threads
(analysis_watchdog, resource_governance, sandbox_audit, market_consumer)
for similar SELECT-without-commit patterns.
Operator post-merge checklist
git pull on production checkout.
Edit /etc/scidex-pg.env: set SCIDEX_PG_POOL_MAX=35.
sudo systemctl restart scidex-api — clears current dead sockets,
loads new defaults.
Verify: curl -s http://127.0.0.1:8000/metrics | grep scidex_pg_pool;
then
curl -s http://127.0.0.1:8000/health returns 200.
Re-run ss -tnp | grep $(pgrep -f 'uvicorn.*api:app') 2 minutes after restart;
number of ESTAB sockets to :6432 should equal
scidex_pg_pool_size (no phantoms).
Validation
pytest tests/test_pool_robustness.py -x
python -c "from api_shared.db import _txn_free_check" (import smoke)
- After deploy:
journalctl -u scidex-api -f | grep "couldn't get" should
go silent within 60s.