Layer: Senate / Atlas Type: one_shot follow-up to PR #1377 Date: 2026-05-06
Follow-up to today's outage (see fix-pg-pool-exhaustion-2026-05-06_spec.md
and PR #1377). The user asked four questions; this PR addresses all
four directly:
get_db_ro() actually return a RO-enforced conn? — **no, itCommit 03f91a907 (2026-04-20 19:19 PDT, "[Atlas] api_shared/db.py:
add get_db_ro() for streaming replica pool") added the plumbing and
claimed an out-of-repo systemd unit scidex-pg-replica.service with
data dir /data/postgres-replica. Today on the box:
$ systemctl list-units 'postgres*' 'pg*' --all
postgresql@16-main.service loaded active running
pgbouncer.service loaded active running
postgres_exporter.service loaded active running
postgresql.service loaded active exited
$ ss -tlnp | grep ':5433'
(nothing)
$ ls /data/postgres-replica/
(no such directory)
$ journalctl -u scidex-pg-replica
(no entries — unit never existed in this journal's retention window)Net: there is no replica. There is no record of a replica. The
systemd unit and data dir referenced in the original commit either
never existed on this host or were deleted before the journal's
retention window began. We are not bringing it back; the cost (extra
PG instance, replication lag, ops surface) is not worth it for a
single-host deploy at our current QPS.
get_db_ro() does NOT enforce read-only. It only routes to a
different pool. Before today, the route went to a dead replica so
writes failed at TCP level (no harm, but masked the bug). After my
PR #1377 fix, _get_ro_pool() falls through to the primary when
SCIDEX_PG_RO_DSN is unset — so writes now succeed silently.
AST audit (scope-aware: tracks var = get_db_ro() per function,
flags var.execute("INSERT|UPDATE|DELETE …") only when var was not
reassigned to get_db() in between):
api.py:34670 fn=artifact_detail() var=db verb=UPDATE
scidex/atlas/federated_search/engine.py:59 fn=_cache_get() var=db verb=UPDATETwo real violations across the entire codebase. Both were
permanently dead under the broken replica (they were silently failing
all along — artifacts.intrinsic_priority is NULL on all 103 142
rows, federated_search_cache is empty). My fall-through fix would
have reanimated them as silent primary writes; this PR routes them
through get_db() explicitly and adds session-level RO enforcement
so any new violations fail loudly.
api_shared/db.py: new _ro_pool_configure(conn) runsSET SESSION default_transaction_read_only = on on every checkoutcannot execute … in a read-only transaction on any writeget_db_ro() regardless of where the pool actually points.
api_shared/db.py: new _get_ro_fallthrough_pool() — a separateSCIDEX_PG_RO_DSN is unset. Old code returned theapi_shared/db.py pool_stats(): now reports primary, RO replicaro_dsn_configured and ro_routes_toreplica | primary_fallthrough | primary_disabled).
api_shared/db.py: new start_pool_logger() — daemon thread thatSCIDEX_PG_POOL_LOG_INTERVAL).api.py: 6 new Prometheus gauges:scidex_pg_pool_ro_size, _ro_available, _ro_requests_waitingscidex_pg_pool_ro_fallthrough_size, _ro_fallthrough_availablescidex_pg_pool_ro_routes_to (1=replica, 2=primary_fallthrough, 3=disabled)
api.py: fallback /metrics endpoint also surfaces these so theprometheus_fastapi_instrumentatorapi.py startup: wires start_pool_logger() next to theapi.py artifact_detail lazy-priority block (~line 34670):get_db() checkout (_rw_db).
scidex/atlas/federated_search/engine.py _cache_get: switchedget_db() since the function does an UPDATE.tests/test_pool_observability.py — 5 tests:pool_stats() reports ro_routes_to correctly in both replica_ro_pool_configure calls SET SESSION default_transaction_read_only = on.start_pool_logger() is idempotent.get_db() (no other writespytest tests/test_pool_observability.py -x — 5 tests pass.pytest tests/test_pool_robustness.py -x — pre-existing 5 still pass.pytest tests/test_pg_pool_autoscaler.py -x — pre-existing 10 still pass.python -c "import api_shared.db, api" — clean import.curl -s /metrics | grep ro_routes_to2.0 (primary_fallthrough). Hourly INFO snapshot shouldjournalctl -u scidex-api.