Fix: PG pool observability + RO enforcement + 2 silent RO-write bugs

← All Specs

Fix: PG pool observability + RO enforcement + 2 silent RO-write bugs

Layer: Senate / Atlas Type: one_shot follow-up to PR #1377 Date: 2026-05-06

Background

Follow-up to today's outage (see fix-pg-pool-exhaustion-2026-05-06_spec.md
and PR #1377). The user asked four questions; this PR addresses all
four directly:

  • Do we have enough pool stats logging? — no.
  • Timeline of degradation? — built; included below.
  • Debug the RO replica. — done; replica is gone, not coming back.
  • Does get_db_ro() actually return a RO-enforced conn? — **no, it
  • silently allowed writes through the primary fall-through pool.**

    Timeline of degradation (from journalctl, full retention)

    Daypool timeouts503 responsesNotes
    Apr 1900pre-replica baseline
    Apr 202459565replica systemd unit added 19:19 PDT (commit 03f91a907); broke immediately
    Apr 21148294winding down; replica started succeeding
    Apr 22–241–51quiet, replica online
    Apr 25–May 400–7stable, no incidents
    May 5119replica started failing
    May 611201119full outage (~13h, 04:49–17:14 PDT)
    So the 2026-05-06 user-visible degradation was ~13h, not the 17 days
    implied by the journal's earliest "couldn't get a connection" entry.

    RO replica forensics

    Commit 03f91a907 (2026-04-20 19:19 PDT, "[Atlas] api_shared/db.py:
    add get_db_ro() for streaming replica pool") added the plumbing and
    claimed an out-of-repo systemd unit scidex-pg-replica.service with
    data dir /data/postgres-replica. Today on the box:

    $ systemctl list-units 'postgres*' 'pg*' --all
    postgresql@16-main.service  loaded active running
    pgbouncer.service           loaded active running
    postgres_exporter.service   loaded active running
    postgresql.service          loaded active exited
    
    $ ss -tlnp | grep ':5433'
    (nothing)
    
    $ ls /data/postgres-replica/
    (no such directory)
    
    $ journalctl -u scidex-pg-replica
    (no entries — unit never existed in this journal's retention window)

    Net: there is no replica. There is no record of a replica. The
    systemd unit and data dir referenced in the original commit either
    never existed on this host or were deleted before the journal's
    retention window began. We are not bringing it back; the cost (extra
    PG instance, replication lag, ops surface) is not worth it for a
    single-host deploy at our current QPS.

    Read-only enforcement audit

    get_db_ro() does NOT enforce read-only. It only routes to a
    different pool. Before today, the route went to a dead replica so
    writes failed at TCP level (no harm, but masked the bug). After my
    PR #1377 fix, _get_ro_pool() falls through to the primary when SCIDEX_PG_RO_DSN is unset — so writes now succeed silently.

    AST audit (scope-aware: tracks var = get_db_ro() per function,
    flags var.execute("INSERT|UPDATE|DELETE …") only when var was not
    reassigned to get_db() in between):

    api.py:34670  fn=artifact_detail()   var=db  verb=UPDATE
    scidex/atlas/federated_search/engine.py:59  fn=_cache_get()  var=db  verb=UPDATE

    Two real violations across the entire codebase. Both were
    permanently dead under the broken replica (they were silently failing
    all along — artifacts.intrinsic_priority is NULL on all 103 142
    rows, federated_search_cache is empty). My fall-through fix would
    have reanimated them as silent primary writes; this PR routes them
    through get_db() explicitly and adds session-level RO enforcement
    so any new violations fail loudly.

    Changes

    Read-only enforcement (the load-bearing fix)

    • api_shared/db.py: new _ro_pool_configure(conn) runs
    SET SESSION default_transaction_read_only = on on every checkout
    from both the RO replica pool AND a new RO fall-through pool. PG
    raises cannot execute … in a read-only transaction on any write
    through get_db_ro() regardless of where the pool actually points.
    • api_shared/db.py: new _get_ro_fallthrough_pool() — a separate
    ConnectionPool against the primary DSN with the RO configure hook,
    used when SCIDEX_PG_RO_DSN is unset. Old code returned the
    primary pool directly, which is why writes silently succeeded.

    Pool observability

    • api_shared/db.py pool_stats(): now reports primary, RO replica
    (when configured), AND RO fall-through pool, plus
    ro_dsn_configured and ro_routes_to
    (replica | primary_fallthrough | primary_disabled).
    • api_shared/db.py: new start_pool_logger() — daemon thread that
    emits a one-line INFO snapshot every 60s (SCIDEX_PG_POOL_LOG_INTERVAL).
    Without this the journal had no historical pool state — exactly
    what we needed to trace the May 5 → May 6 degradation slope.
    • api.py: 6 new Prometheus gauges:
    * scidex_pg_pool_ro_size, _ro_available, _ro_requests_waiting
    * scidex_pg_pool_ro_fallthrough_size, _ro_fallthrough_available
    * scidex_pg_pool_ro_routes_to (1=replica, 2=primary_fallthrough, 3=disabled)
    • api.py: fallback /metrics endpoint also surfaces these so the
    metrics pathway works even when prometheus_fastapi_instrumentator
    isn't installed.
    • api.py startup: wires start_pool_logger() next to the
    autoscaler boot.

    RO-write violation fixes

    • api.py artifact_detail lazy-priority block (~line 34670):
    switched the UPDATE to a fresh get_db() checkout (_rw_db).
    • scidex/atlas/federated_search/engine.py _cache_get: switched
    to get_db() since the function does an UPDATE.

    Tests

    • tests/test_pool_observability.py — 5 tests:
    * pool_stats() reports ro_routes_to correctly in both replica
    and fall-through configurations.
    * _ro_pool_configure calls SET SESSION default_transaction_read_only = on.
    * The two flagged RO-write violations are gone (textual check).
    * start_pool_logger() is idempotent.

    Out of scope

    • Bringing back a real RO replica (decision: not now — single-host
    deploy doesn't need it).
    • Mass-converting other RO callers to get_db() (no other writes
    through RO conns according to the AST audit).
    • Pgbouncer mode change (still session pooling, unchanged).

    Validation

    • pytest tests/test_pool_observability.py -x — 5 tests pass.
    • pytest tests/test_pool_robustness.py -x — pre-existing 5 still pass.
    • pytest tests/test_pg_pool_autoscaler.py -x — pre-existing 10 still pass.
    • python -c "import api_shared.db, api" — clean import.
    • Live verification post-deploy: curl -s /metrics | grep ro_routes_to
    should show 2.0 (primary_fallthrough). Hourly INFO snapshot should
    appear in journalctl -u scidex-api.

    File: fix-pool-observability-ro-enforcement-2026-05-06_spec.md
    Modified: 2026-05-18 04:17
    Size: 6.7 KB