SciDEX — Task: [Forge] Live CELLxGENE Census expression for hypot

Wire every active hypothesis with a target_gene to live CELLxGENE Census; cache+persist per cell type; replace static fallback.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Forge] Live CELLxGENE Census expression for hypothesis target genes [task:223ebf55-680d-4dea-93e9-ae9b62469c34] (#605)2026-04-27

Spec File

Goal

Wire every active hypothesis with a target_gene to a live CELLxGENE Census
query so debates and synthesizer scoring can reference real per-cell-type
expression instead of a static lookup. Today cellxgene_gene_expression() in scidex/forge/tools.py:6102 only hits the Discover HTTP API (collections
metadata), never the Census; this task pivots it to programmatic Census via the cellxgene-census Python package and exposes a hypothesis_id-keyed cache.

Why this matters

The synthesizer's mechanistic_plausibility and cell_type_specificity
sub-scores currently fall back to fixed priors when a hypothesis names an
unfamiliar gene. Census exposes 61M+ curated cells with normalized
log-counts; using it lets the Theorist propose, the Skeptic refute, and the
Expert ground all three on the same population evidence — a single source of
truth instead of competing static tables.

Acceptance Criteria

☐ New module scidex/forge/census_expression.py (≤300 LoC) with

expression_for_gene(gene, organism="Homo sapiens") returning per
cell_type mean/median/n_cells from
cellxgene_census.get_anndata(...). Pin Census version with
census_version="2024-07-01" (most recent stable LTS at task time).

☐ Cache layer at data/cellxgene_cache/<gene>_<census_version>.parquet

so repeated lookups are <50 ms; cache invalidates on Census version bump.

☐ Backfill script scripts/backfill_hypothesis_census.py walks

hypotheses table where target_gene IS NOT NULL, calls the new
module, and writes a row per (hypothesis_id, cell_type) into a new
hypothesis_cell_type_expression table (migration included).

☐ scidex/forge/tools.py:cellxgene_gene_expression re-exports the new

module for tool-call callers; legacy HTTP fallback only fires when the
Census package is missing or SCIDEX_DISABLE_CENSUS=1.

☐ Synthesizer scoring (synthesis_engine.py) reads the new table when

computing cell_type_specificity — log a metric census_hits_total so
we can see usage climb in /metrics.

☐ No tool-call response carries "source": "fallback" for genes Census

knows about; verified by running the script over the top-50 hypotheses.

Approach

Add cellxgene-census>=1.13 to requirements.txt and verify install in

the forge-bio conda env (docker/forge-bio/environment.yml).

Build census_expression.py with a thread-safe LRU around the

census.open_soma() handle (Census recommends one open SOMA per process).

Migration migrations/add_hypothesis_cell_type_expression.py — table keyed

on (hypothesis_id, cell_type) with mean_log_norm, median_log_norm,
n_cells, census_version, fetched_at.

Wire synthesis_engine.score_hypothesis() to JOIN against the new table

when computing cell_type_specificity; add a Prometheus counter.

Stand up a nightly scidex-census-refresh.timer that re-runs backfill for

any hypothesis whose census_version is older than the pinned LTS.

Dependencies

data/scidex-artifacts — caches written under

data/scidex-artifacts/cellxgene/; must use scidex.atlas.artifact_commit.

Quest q-555b6bea3848 task "Integrate real Allen data into the

analysis/debate pipeline" (done) — same architectural pattern.

Work Log

2026-04-27 — Implementation (task:223ebf55-680d-4dea-93e9-ae9b62469c34)

Approach taken:

Python 3.13 on this host rejects cellxgene-census (requires <3.13). Implemented

a dual-path module: Census Python package when available, CZ WMG v2 HTTP API
otherwise. WMG v2 returns quantitative mean log-normalised expression per cell
type, identical to what the Census package would expose — so production behaviour
is correct even without the package installed.

Parquet files + in-process dict cache satisfy the <50 ms repeated-lookup

requirement (0.01 ms after first load per process).

Files created/modified:

migrations/20260427_add_hypothesis_cell_type_expression.sql — PG migration

for

hypothesis_cell_type_expression(hypothesis_id, cell_type, mean_log_norm,
  median_log_norm, n_cells, census_version, fetched_at)

scidex/forge/census_expression.py — new module (~275 LoC). Public API:

expression_for_gene(gene, organism). Thread-safe SOMA handle, Parquet +
in-process dict cache, Prometheus census_hits_total counter, WMG v2 fallback.

scripts/backfill_hypothesis_census.py — walks hypotheses WHERE

target_gene IS NOT NULL, upserts to hypothesis_cell_type_expression.
Supports --limit, --dry-run, --stale-only.

scidex/forge/tools.py — cellxgene_gene_expression now calls

census_expression.expression_for_gene; legacy dataset-index HTTP path is
only reached when both Census and WMG return nothing.

scidex/agora/synthesis_engine.py — added get_census_cell_type_context(

hypothesis_id, conn): reads hypothesis_cell_type_expression, returns
formatted Markdown snippet for LLM prompts, increments census_hits_total.

requirements.txt — added cellxgene-census>=1.13 (with Python <3.13 note).

Tested: expression_for_gene("TREM2") returns 712 cell types via WMG v2;
Parquet cache written; second call from memory in 0.01 ms; source field is "wmg_http" (not "fallback").