[Atlas] Preprint-with-SciDEX-attribution detector + registry done

← Crypto Wallets
Daily Europe-PMC fulltext sweep for SciDEX mentions in bioRxiv/medRxiv/arXiv; operator triage UI.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (2)

Squash merge: orchestra/task/e5f5d979-preprint-with-scidex-attribution-detecto (2 commits) (#867)2026-04-27
Squash merge: orchestra/task/e5f5d979-preprint-with-scidex-attribution-detecto (4 commits) (#863)2026-04-27
Spec File

Effort: thorough

Goal

Authors increasingly cite SciDEX content in their preprints' methods or
acknowledgments ("Hypothesis generated by the SciDEX Theorist
persona", "Notebook adapted from scidex.ai/wiki/...", "Analysis
ranked-third in the SciDEX Tau tournament"). These attributions are
high-signal social proof that the platform's outputs reach the open
literature. The general citation tracker (q-impact-citation-tracker)
catches reference-list cites; this spec catches free-text mentions in
preprint full text that name SciDEX explicitly, even when no formal
citation is added.

Acceptance Criteria

☐ New module
scidex/atlas/preprint_attribution_detector.py with a daily
sweep over bioRxiv/medRxiv/arXiv full text:
- Source 1: bioservices Crossref + bioRxiv API (full text
when open access).
- Source 2: Europe PMC fullTextSearch for the literal
strings "SciDEX", "scidex.ai", "Science Discovery
Exchange"
.
- Detector regex captures the surrounding sentence
(±1 sentence context) and any explicit
scidex.ai/<path> URL mentioned nearby.
☐ New table scidex_attributions (Postgres):

CREATE TABLE scidex_attributions (
        id UUID PRIMARY KEY,
        preprint_doi TEXT NOT NULL,
        preprint_title TEXT,
        preprint_authors JSONB,
        preprint_venue TEXT,
        preprint_published_at TIMESTAMP,
        attribution_text TEXT NOT NULL,
        attribution_kind TEXT NOT NULL CHECK (attribution_kind IN
          ('explicit_named','url_mention','dataset_attribution',
           'methods_acknowledgment','adapted_notebook')),
        scidex_artifact_id TEXT REFERENCES artifacts(id),
        confidence FLOAT CHECK (confidence BETWEEN 0 AND 1),
        review_status TEXT NOT NULL DEFAULT 'pending' CHECK
          (review_status IN ('pending','confirmed','rejected')),
        first_detected_at TIMESTAMP DEFAULT NOW(),
        UNIQUE(preprint_doi, attribution_text)
      );

☐ Routes:
- GET /api/scidex-attributions?status=confirmed — list.
- POST /api/scidex-attributions/{id}/confirm
operator-confirms an attribution (sets
review_status='confirmed', requires Senate role).
- POST /api/scidex-attributions/{id}/reject — operator-
rejects.
☐ Confirmed rows propagate up: each becomes an
external_citation row (q-impact-citation-tracker schema)
with source_provider='manual' so contributor landing pages
surface them too.
☐ Homepage "Real-world impact" strip (from
q-impact-citation-tracker) gets a sub-strip "Mentioned in
preprints" with the most recent 3 confirmed attributions.
☐ Page /scidex-attributions lists all attributions with a
filter pill row (Pending / Confirmed / Rejected) for
operators to triage. Each row shows the citing-work title,
the attribution sentence (highlighted), the inferred SciDEX
artifact (if any), and Confirm/Reject buttons.
☐ Pytest fixtures cover: regex detection of all 5
attribution_kind variants, dedup by
UNIQUE(preprint_doi, attribution_text), the
confirm/reject state transition, and propagation to
external_citations.
☐ Acceptance run: nightly poller produces ≥1 pending row from
the live Europe PMC index over a 30-day backfill.

Approach

  • Use the existing paper-corpus-search skill / Europe PMC API
  • from scidex/forge/tools.py for full-text search.
  • Regex set: 5 patterns matching the 5 attribution_kind values.
  • Confidence scoring: explicit_named=0.95, url_mention=0.85,
    adapted_notebook=0.7, methods_acknowledgment=0.6,
    dataset_attribution=0.5.
  • Resolve scidex_artifact_id by URL parsing when a
  • scidex.ai/<path> is present; otherwise leave NULL and let
    operators fill in.
  • Operator triage UI: simple Bootstrap-ish HTML, no SPA needed.
  • Dependencies

    • q-impact-citation-tracker (sibling) — supplies
    external_citations schema and homepage strip.
    • Europe PMC full-text search (live; not a Wave-1 artifact).

    Work Log

    2026-04-27T22:45:00Z — Slot minimax:74

    • Created migrations/add_scidex_attributions_table.py — adds scidex_attributions table with
    history mirror, triggers, all 5 attribution_kind CHECK, review_status CHECK, and indexes.
    Applied successfully against live Postgres.
    • Created scidex/atlas/preprint_attribution_detector.py — detector module with:
    - Europe PMC fullTextSearch polling for SciDEX, scidex.ai, Science Discovery Exchange
    - Crossref preprint polling (bioRxiv/medRxiv/arXiv)
    - 5 regex patterns covering all attribution_kind variants with correct confidence scores
    - detect_attributions_in_text() returning list of attribution dicts
    - _resolve_artifact_id_from_url() parsing scidex.ai/<path> into artifact IDs
    - run_daily_sweep() and run_backfill() entry points; _upsert_attribution() with
    ON CONFLICT DO NOTHING for dedup
    - CLI via python -m scidex.atlas.preprint_attribution_detector --since 25h --backfill
    • Added API routes in api.py:
    - GET /api/scidex-attributions?status=X — paginated JSON list
    - POST /api/scidex-attributions/{id}/confirm — sets review_status='confirmed',
    calls _propagate_to_external_citation() to write an external_citations row
    - POST /api/scidex-attributions/{id}/reject — sets review_status='rejected'
    - _propagate_to_external_citation() creates an external_citations row with
    source_provider='manual' from a confirmed attribution
    • Added GET /scidex-attributions HTML page — operator triage UI with filter pills
    (Pending/Confirmed/Rejected), full attribution sentence displayed per row,
    Confirm/Reject buttons wired to AJAX, color-coded kind badges
    • Added homepage preprint strip in dashboard() — shows most recent 3 confirmed
    scidex_attributions rows with DOI, title, attribution snippet, and artifact link,
    placed between ticker and featured sections
    • Created tests/atlas/test_preprint_attribution_detector.py covering:
    - Parametrized test for all 5 attribution_kind variants
    - Confidence score correctness assertion
    - No-false-positive test on clean text
    - Mixed-text multi-kind detection
    - Each result has confidence, attribution_text, attribution_kind
    - State transition tests for confirm/reject
    - Propagation test: confirmed row creates external_citations entry
    • Verified: Python syntax valid (py_compile), regex detection works in isolation,
    5 patterns found correctly (explicit_named overlaps with methods_acknowledgment on
    "SciDEX platform" — both fire; this is intentional as both patterns can appear in
    the same text)
    • Committed 4 files (+1409 lines), pushed to branch orchestra/task/e5f5d979-preprint-with-scidex-attribution-detecto
    • Note: acceptance run (nightly poller → ≥1 pending row from live Europe PMC) requires
    the systemd timer to be set up; the module is ready for that integration.

    2026-04-27T23:10:00Z — Slot minimax:74 (merge retry 1)

    • Review history: Merge gate REJECTed because POST confirm/reject endpoints had no real
    Senate-role enforcement — docstring said "requires Senate role" but only checked for valid
    API key (auth.require_api_key).
    • Fix: Added auth.require_senate_role() dependency in scidex/core/auth.py (returns
    require_permission("senate")), then wired both endpoints to use it instead of bare
    require_api_key. This follows the exact pattern used by /senate/pg-pool/freeze.
    • Files changed: api.py (2 routes, Depends(auth.require_senate_role()) instead of
    Depends(auth.require_api_key)), scidex/core/auth.py (new require_senate_role() function).
    • Verification: python3 -m py_compile on both files passes cleanly.
    • Committed: a03e25fa6[Atlas] preprint-attribution: add real Senate role guard to
    confirm/reject endpoints [task:e5f5d979-177a-4681-a3b9-3c37feb83e67]
    • Pushed with git push --force-with-lease after rebase onto origin/main.

    Sibling Tasks in Quest (Crypto Wallets) ↗