Fix: PG pool exhaustion via dead-socket validator hang + autoscaler conn leak + schema drift + pgbouncer mode mismatch

Layer: Senate Type: one_shot incident hotfix Date: 2026-05-06

Background

scidex.ai began returning 503 to most page loads at 04:49 PDT today.
Symptoms persisted for ~13h before user noticed. Service was up
(scidex-api.service active); failures came from the db_error_middleware at api.py:1698-1723 returning couldn't get a connection after 10.00 sec.

Root cause chain

Trigger (04:41): check_and_award_milestones SQL at

api.py:26917-26923 references column alias paper_count in a
HAVING clause — invalid in PostgreSQL (allowed by SQLite). Each
call aborts a transaction. This is the original SQLite→PG
migration drift the codebase has been chasing.

Pool slot loss (ongoing): PgPoolAutoscaler._load_frozen at

scidex/senate/pg_pool_autoscaler.py:185-199 runs SELECT … LIMIT 1
via the thread-local get_db() and never commits/rolls back.
autocommit=False means the SELECT opens an implicit transaction
that holds the conn in idle in transaction for the full 30s tick
interval. PG idle_in_transaction_session_timeout=1min saves us
from hard-leaks but the slot is unavailable to other callers
~50% of the time.

Validator hang (the actual pool-killer):

_txn_free_check at api_shared/db.py:284-313 runs SELECT 1 on
every checkout. When pgbouncer's server_idle_timeout=120s closes
a server-side socket, the kernel-level FIN doesn't propagate
instantly to psycopg's view of the socket. The validator's
SELECT 1 then sits waiting for a response that never comes,
blocking until the pool-wide 10s _POOL_TIMEOUT fires. This is
what makes every 503 take exactly 10.00s.

pgbouncer pool size mismatch: app SCIDEX_PG_POOL_MAX=120

but pgbouncer default_pool_size=40 per database in
pool_mode=session. App can never get more than 40 PG-side
conns through the bouncer regardless of psycopg's max.

Idle-timeout coupling: psycopg max_idle=120s ==

pgbouncer server_idle_timeout=120s. Race where pgbouncer
reaps the server side at the exact moment psycopg hands the
client conn out → "the connection is closed" cascade.

CLOSE_WAIT amplification: outbound HTTPS sockets to LLM

endpoints (47.89.128.168, 143.204.1.3) sit in CLOSE_WAIT,
meaning request handlers are blocked in network reads while
holding DB conns. Each blocked handler removes one slot from
the already-tiny pool.

Fixes (this PR)

#	File	Fix
1	`scidex/senate/pg_pool_autoscaler.py`	`_load_frozen`/`_set_frozen` use `pool.connection()` ctx mgr — no more thread-local INTRANS slot held by the autoscaler.
2	`api.py` (~26919)	Replace `HAVING paper_count >= %s` with `HAVING COUNT(*) >= %s`. PG-compliant.
3	`api_shared/db.py` `_txn_free_check`	Bound the validator with a `statement_timeout=1500ms` SET on the conn before `SELECT 1`, then RESET. Dead sockets fail fast.
4	`api_shared/db.py` pool defaults	`_POOL_MAX_IDLE` default 120 → 60 so psycopg evicts before pgbouncer's 120s `server_idle_timeout`. Env override still wins.
5	`/etc/scidex-pg.env`	(Operator step, called out in this spec — not in repo.) Lower `SCIDEX_PG_POOL_MAX=120` → `35`. Below pgbouncer's 40 with margin.
6	`api_shared/db.py`	New `pool_dead_socket_audit()` helper + Prometheus gauge `scidex_pg_pool_phantom_count` so this is visible in /metrics next time.
7	`api.py` `db_error_middleware`	When DB-error 503 fires, also kick a one-shot async task that drains stale wrappers from the per-request holder (defence in depth).
8	`tests/test_pool_robustness.py`	New tests: `_load_frozen` doesn't leave a thread-local conn; validator times out in <2s on a dead socket; `paper_count` query parses.

Out of scope (follow-ups)

Switch pgbouncer to pool_mode=transaction (needs audit of LISTEN/

NOTIFY, advisory locks, prepared-statement usage; bigger change).

Refactor handlers that issue outbound HTTPS calls to release DB conn

before the round-trip (the CLOSE_WAIT amplifier).

Audit other get_db() callers in long-lived background threads

(analysis_watchdog, resource_governance, sandbox_audit, market_consumer)
for similar SELECT-without-commit patterns.

Operator post-merge checklist

git pull on production checkout.

Edit /etc/scidex-pg.env: set SCIDEX_PG_POOL_MAX=35.

sudo systemctl restart scidex-api — clears current dead sockets,

loads new defaults.

Verify: curl -s http://127.0.0.1:8000/metrics | grep scidex_pg_pool;

then curl -s http://127.0.0.1:8000/health returns 200.

Re-run ss -tnp | grep $(pgrep -f 'uvicorn.*api:app') 2 minutes after restart;

number of ESTAB sockets to :6432 should equal scidex_pg_pool_size
(no phantoms).

Validation

pytest tests/test_pool_robustness.py -x
python -c "from api_shared.db import _txn_free_check" (import smoke)
After deploy: journalctl -u scidex-api -f | grep "couldn't get" should

go silent within 60s.

File: fix-pg-pool-exhaustion-2026-05-06_spec.md

Modified: 2026-05-18 04:17

Size: 5.6 KB