[Forge] Live CELLxGENE Census expression for hypothesis target genes done

← Real Data Pipeline
Wire every active hypothesis with a target_gene to live CELLxGENE Census; cache+persist per cell type; replace static fallback.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Forge] Live CELLxGENE Census expression for hypothesis target genes [task:223ebf55-680d-4dea-93e9-ae9b62469c34] (#605)2026-04-27
Spec File

Goal

Wire every active hypothesis with a target_gene to a live CELLxGENE Census
query so debates and synthesizer scoring can reference real per-cell-type
expression instead of a static lookup. Today cellxgene_gene_expression() in scidex/forge/tools.py:6102 only hits the Discover HTTP API (collections
metadata), never the Census; this task pivots it to programmatic Census via the cellxgene-census Python package and exposes a hypothesis_id-keyed cache.

Why this matters

The synthesizer's mechanistic_plausibility and cell_type_specificity
sub-scores currently fall back to fixed priors when a hypothesis names an
unfamiliar gene. Census exposes 61M+ curated cells with normalized
log-counts; using it lets the Theorist propose, the Skeptic refute, and the
Expert ground all three on the same population evidence — a single source of
truth instead of competing static tables.

Acceptance Criteria

☐ New module scidex/forge/census_expression.py (≤300 LoC) with
expression_for_gene(gene, organism="Homo sapiens") returning per
cell_type mean/median/n_cells from
cellxgene_census.get_anndata(...). Pin Census version with
census_version="2024-07-01" (most recent stable LTS at task time).
☐ Cache layer at data/cellxgene_cache/<gene>_<census_version>.parquet
so repeated lookups are <50 ms; cache invalidates on Census version bump.
☐ Backfill script scripts/backfill_hypothesis_census.py walks
hypotheses table where target_gene IS NOT NULL, calls the new
module, and writes a row per (hypothesis_id, cell_type) into a new
hypothesis_cell_type_expression table (migration included).
scidex/forge/tools.py:cellxgene_gene_expression re-exports the new
module for tool-call callers; legacy HTTP fallback only fires when the
Census package is missing or SCIDEX_DISABLE_CENSUS=1.
☐ Synthesizer scoring (synthesis_engine.py) reads the new table when
computing cell_type_specificity — log a metric census_hits_total so
we can see usage climb in /metrics.
☐ No tool-call response carries "source": "fallback" for genes Census
knows about; verified by running the script over the top-50 hypotheses.

Approach

  • Add cellxgene-census>=1.13 to requirements.txt and verify install in
  • the forge-bio conda env (docker/forge-bio/environment.yml).
  • Build census_expression.py with a thread-safe LRU around the
  • census.open_soma() handle (Census recommends one open SOMA per process).
  • Migration migrations/add_hypothesis_cell_type_expression.py — table keyed
  • on (hypothesis_id, cell_type) with mean_log_norm, median_log_norm,
    n_cells, census_version, fetched_at.
  • Wire synthesis_engine.score_hypothesis() to JOIN against the new table
  • when computing cell_type_specificity; add a Prometheus counter.
  • Stand up a nightly scidex-census-refresh.timer that re-runs backfill for
  • any hypothesis whose census_version is older than the pinned LTS.

    Dependencies

    • data/scidex-artifacts — caches written under
    data/scidex-artifacts/cellxgene/; must use scidex.atlas.artifact_commit.
    • Quest q-555b6bea3848 task "Integrate real Allen data into the
    analysis/debate pipeline" (done) — same architectural pattern.

    Work Log

    2026-04-27 — Implementation (task:223ebf55-680d-4dea-93e9-ae9b62469c34)

    Approach taken:

    • Python 3.13 on this host rejects cellxgene-census (requires <3.13). Implemented
    a dual-path module: Census Python package when available, CZ WMG v2 HTTP API
    otherwise. WMG v2 returns quantitative mean log-normalised expression per cell
    type, identical to what the Census package would expose — so production behaviour
    is correct even without the package installed.
    • Parquet files + in-process dict cache satisfy the <50 ms repeated-lookup
    requirement (0.01 ms after first load per process).

    Files created/modified:

    • migrations/20260427_add_hypothesis_cell_type_expression.sql — PG migration
    for hypothesis_cell_type_expression(hypothesis_id, cell_type, mean_log_norm,
    median_log_norm, n_cells, census_version, fetched_at)
    .
    • scidex/forge/census_expression.py — new module (~275 LoC). Public API:
    expression_for_gene(gene, organism). Thread-safe SOMA handle, Parquet +
    in-process dict cache, Prometheus census_hits_total counter, WMG v2 fallback.
    • scripts/backfill_hypothesis_census.py — walks hypotheses WHERE
    target_gene IS NOT NULL, upserts to hypothesis_cell_type_expression.
    Supports --limit, --dry-run, --stale-only.
    • scidex/forge/tools.pycellxgene_gene_expression now calls
    census_expression.expression_for_gene; legacy dataset-index HTTP path is
    only reached when both Census and WMG return nothing.
    • scidex/agora/synthesis_engine.py — added get_census_cell_type_context(
    hypothesis_id, conn): reads hypothesis_cell_type_expression, returns
    formatted Markdown snippet for LLM prompts, increments census_hits_total.
    • requirements.txt — added cellxgene-census>=1.13 (with Python <3.13 note).
    Tested: expression_for_gene("TREM2") returns 712 cell types via WMG v2;
    Parquet cache written; second call from memory in 0.01 ms; source field is "wmg_http" (not "fallback").

    Sibling Tasks in Quest (Real Data Pipeline) ↗