Fix: PG pool observability + RO enforcement + 2 silent RO-write bugs

Layer: Senate / Atlas Type: one_shot follow-up to PR #1377 Date: 2026-05-06

Background

Follow-up to today's outage (see fix-pg-pool-exhaustion-2026-05-06_spec.md
and PR #1377). The user asked four questions; this PR addresses all
four directly:

Do we have enough pool stats logging? — no.

Timeline of degradation? — built; included below.

Debug the RO replica. — done; replica is gone, not coming back.

Does get_db_ro() actually return a RO-enforced conn? — **no, it

silently allowed writes through the primary fall-through pool.**

Timeline of degradation (from journalctl, full retention)

Day	pool timeouts	503 responses	Notes
Apr 19	0	0	pre-replica baseline
Apr 20	2459	565	replica systemd unit added 19:19 PDT (commit 03f91a907); broke immediately
Apr 21	1482	94	winding down; replica started succeeding
Apr 22–24	1–5	1	quiet, replica online
Apr 25–May 4	0	0–7	stable, no incidents
May 5	11	9	replica started failing
May 6	1120	1119	full outage (~13h, 04:49–17:14 PDT)

So the 2026-05-06 user-visible degradation was ~13h, not the 17 days
implied by the journal's earliest "couldn't get a connection" entry.

RO replica forensics

Commit 03f91a907 (2026-04-20 19:19 PDT, "[Atlas] api_shared/db.py:
add get_db_ro() for streaming replica pool") added the plumbing and
claimed an out-of-repo systemd unit scidex-pg-replica.service with
data dir /data/postgres-replica. Today on the box:

$ systemctl list-units 'postgres*' 'pg*' --all
postgresql@16-main.service  loaded active running
pgbouncer.service           loaded active running
postgres_exporter.service   loaded active running
postgresql.service          loaded active exited

$ ss -tlnp | grep ':5433'
(nothing)

$ ls /data/postgres-replica/
(no such directory)

$ journalctl -u scidex-pg-replica
(no entries — unit never existed in this journal's retention window)

Net: there is no replica. There is no record of a replica. The
systemd unit and data dir referenced in the original commit either
never existed on this host or were deleted before the journal's
retention window began. We are not bringing it back; the cost (extra
PG instance, replication lag, ops surface) is not worth it for a
single-host deploy at our current QPS.

Read-only enforcement audit

get_db_ro() does NOT enforce read-only. It only routes to a
different pool. Before today, the route went to a dead replica so
writes failed at TCP level (no harm, but masked the bug). After my
PR #1377 fix, _get_ro_pool() falls through to the primary when SCIDEX_PG_RO_DSN is unset — so writes now succeed silently.

AST audit (scope-aware: tracks var = get_db_ro() per function,
flags var.execute("INSERT|UPDATE|DELETE …") only when var was not
reassigned to get_db() in between):

api.py:34670  fn=artifact_detail()   var=db  verb=UPDATE
scidex/atlas/federated_search/engine.py:59  fn=_cache_get()  var=db  verb=UPDATE

Two real violations across the entire codebase. Both were
permanently dead under the broken replica (they were silently failing
all along — artifacts.intrinsic_priority is NULL on all 103 142
rows, federated_search_cache is empty). My fall-through fix would
have reanimated them as silent primary writes; this PR routes them
through get_db() explicitly and adds session-level RO enforcement
so any new violations fail loudly.

Changes

Read-only enforcement (the load-bearing fix)

api_shared/db.py: new _ro_pool_configure(conn) runs

SET SESSION default_transaction_read_only = on on every checkout
from both the RO replica pool AND a new RO fall-through pool. PG
raises cannot execute … in a read-only transaction on any write
through get_db_ro() regardless of where the pool actually points.

api_shared/db.py: new _get_ro_fallthrough_pool() — a separate

ConnectionPool against the primary DSN with the RO configure hook,
used when SCIDEX_PG_RO_DSN is unset. Old code returned the
primary pool directly, which is why writes silently succeeded.

Pool observability

api_shared/db.py pool_stats(): now reports primary, RO replica

(when configured), AND RO fall-through pool, plus
ro_dsn_configured and ro_routes_to
(replica | primary_fallthrough | primary_disabled).

api_shared/db.py: new start_pool_logger() — daemon thread that

emits a one-line INFO snapshot every 60s (SCIDEX_PG_POOL_LOG_INTERVAL).
Without this the journal had no historical pool state — exactly
what we needed to trace the May 5 → May 6 degradation slope.

api.py: 6 new Prometheus gauges:

* scidex_pg_pool_ro_size, _ro_available, _ro_requests_waiting
* scidex_pg_pool_ro_fallthrough_size, _ro_fallthrough_available
* scidex_pg_pool_ro_routes_to (1=replica, 2=primary_fallthrough, 3=disabled)

api.py: fallback /metrics endpoint also surfaces these so the

metrics pathway works even when prometheus_fastapi_instrumentator
isn't installed.

api.py startup: wires start_pool_logger() next to the

autoscaler boot.

RO-write violation fixes

api.py artifact_detail lazy-priority block (~line 34670):

switched the UPDATE to a fresh get_db() checkout (_rw_db).

scidex/atlas/federated_search/engine.py _cache_get: switched

to get_db() since the function does an UPDATE.

Tests

tests/test_pool_observability.py — 5 tests:

* pool_stats() reports ro_routes_to correctly in both replica
and fall-through configurations.
* _ro_pool_configure calls SET SESSION default_transaction_read_only = on.
* The two flagged RO-write violations are gone (textual check).
* start_pool_logger() is idempotent.

Out of scope

Bringing back a real RO replica (decision: not now — single-host

deploy doesn't need it).

Mass-converting other RO callers to get_db() (no other writes

through RO conns according to the AST audit).

Pgbouncer mode change (still session pooling, unchanged).

Validation

pytest tests/test_pool_observability.py -x — 5 tests pass.
pytest tests/test_pool_robustness.py -x — pre-existing 5 still pass.
pytest tests/test_pg_pool_autoscaler.py -x — pre-existing 10 still pass.
python -c "import api_shared.db, api" — clean import.
Live verification post-deploy: curl -s /metrics | grep ro_routes_to

should show 2.0 (primary_fallthrough). Hourly INFO snapshot should
appear in journalctl -u scidex-api.

File: fix-pool-observability-ro-enforcement-2026-05-06_spec.md

Modified: 2026-05-18 04:17

Size: 6.7 KB