SciDEX — Task: [Forge] Unified literature_search tool

Create a new unified literature_search tool for CLAUDE_TOOLS that replaces the current pubmed_search as the primary paper discovery interface. Design: (1) Accepts query + optional filters (date range, journal, author, provider preference). (2) Searches across ALL configured providers in parallel (PubMed, Semantic Scholar, Paperclip, OpenAlex, CrossRef). (3) Deduplicates results by DOI/PMID/title fuzzy match. (4) Returns normalized results with: title, authors, year, journal, abstract/TLDR, citation_count, DOI, PMID, all external IDs, relevance score per provider. (5) Automatically cites/links papers in the SciDEX knowledge graph when referenced in a debate. (6) Strong citation format: each paper reference includes a stable URI (doi.org link preferred, fallback to PubMed URL, then Semantic Scholar URL). Keep pubmed_search as a legacy alias that routes through the new tool with provider=['pubmed']. ## REOPENED TASK — CRITICAL CONTEXT This task was previously marked 'done' but the audit could not verify the work actually landed on main. The original work may have been: - Lost to an orphan branch / failed push - Only a spec-file edit (no code changes) - Already addressed by other agents in the meantime - Made obsolete by subsequent work **Before doing anything else:** 1. **Re-evaluate the task in light of CURRENT main state.** Read the spec and the relevant files on origin/main NOW. The original task may have been written against a state of the code that no longer exists. 2. **Verify the task still advances SciDEX's aims.** If the system has evolved past the need for this work (different architecture, different priorities), close the task with reason "obsolete: " instead of doing it. 3. **Check if it's already done.** Run `git log --grep=''` and read the related commits. If real work landed, complete the task with `--no-sha-check --summary 'Already done in '`. 4. **Make sure your changes don't regress recent functionality.** Many agents have been working on this codebase. Before committing, run `git log --since='24 hours ago' -- ` to see what changed in your area, and verify you don't undo any of it. 5. **Stay scoped.** Only do what this specific task asks for. Do not refactor, do not "fix" unrelated issues, do not add features that weren't requested. Scope creep at this point is regression risk. If you cannot do this task safely (because it would regress, conflict with current direction, or the requirements no longer apply), escalate via `orchestra escalate` with a clear explanation instead of committing.

Last Error

Audit reopened: ORPHAN_BRANCH — 1 commit(s) found but none on main; branch=paperclip-mcp-adapter-v2

Git Commits (1)

[Forge] Unified literature_search tool — multi-provider with citation handling [task:ed0d0fe6-423b-45c3-8a5e-5415267fb5bb]2026-04-11

Spec File

Spec: Unified literature_search Tool

Task ID

ed0d0fe6-423b-45c3-8a5e-5415267fb5bb

Overview

Create a new unified literature_search tool for CLAUDE_TOOLS that replaces pubmed_search as the primary paper discovery interface. The tool searches multiple providers in parallel, deduplicates results, and returns normalized results with rich metadata and stable citation URIs.

Key Background

PaperCorpus class already exists in tools.py (line 898) with:

Multi-provider adapters: pubmed, semantic_scholar, openalex, crossref, paperclip
Basic deduplication by external_ids
Local SQLite caching via _upsert_cached_paper

The task is to build a HIGHER-LEVEL unified tool on top of this infrastructure that adds:

Parallel multi-provider search with relevance scoring

Strong citation format with stable URIs

Knowledge graph auto-citation when papers are referenced in debates

Legacy pubmed_search alias

Design

Signature

@log_tool_call
def literature_search(
    query: str,
    max_results: int = 10,
    providers: str = "",  # comma-separated, empty = all
    date_from: str = "",  # YYYY-MM-DD
    date_to: str = "",    # YYYY-MM-DD
    journal: str = "",
    author: str = "",
    include_abstract: bool = True,
    cite_in_kg: bool = True,  # auto-cite in KG when referenced
) -> dict:

Returns

{
    "query": str,
    "total_count": int,
    "results": [
        {
            # Core IDs
            "doi": str,
            "pmid": str,
            "paper_id": str,  # Semantic Scholar ID

            # Metadata
            "title": str,
            "authors": [str],  # top 5
            "year": int,
            "journal": str,
            "abstract": str,
            "tldr": str,

            # Metrics
            "citation_count": int,
            "influential_citation_count": int,

            # Stable citation URI (strong citation format)
            "citation_uri": "https://doi.org/10.xxxx/xxxxx",  # doi.org preferred

            # Per-provider relevance scores
            "provider_scores": {
                "pubmed": 0.95,
                "semantic_scholar": 0.88,
                ...
            },

            # External IDs from all providers
            "external_ids": {
                "doi": str,
                "pmid": str,
                "paper_id": str,
                "openalex": str,
                "crossref": str,
            },

            # Source provider that matched best
            "best_provider": str,
            "best_score": float,
        }
    ],
    "providers_searched": [str],
    "search_time_ms": int,
}

Strong Citation Format

Each paper reference includes a stable URI in this priority order:

https://doi.org/{doi} — preferred, persistent, resolver-based

https://pubmed.ncbi.nlm.nih.gov/{pmid} — fallback if no DOI

https://www.semanticscholar.org/paper/{paper_id} — final fallback

Provider Parallel Search

All configured providers run in parallel threads (ThreadPoolExecutor)
Each returns relevance score based on provider-specific ranking
Results merged, deduplicated, re-ranked by aggregate score
Timeout per provider: 15 seconds

Deduplication Strategy

Primary: exact DOI match

Secondary: exact PMID match

Tertiary: title fuzzy match (normalize, then Jaccard similarity > 0.85)

Knowledge Graph Auto-Citation

When cite_in_kg=True and results are returned, automatically:

For each paper with a DOI/PMID, upsert into papers table

Create citation edges in KG for any papers already in the KG

This happens via kg_add_paper_citations() or equivalent

Legacy pubmed_search Alias

Keep pubmed_search as a thin alias:

def pubmed_search(query, max_results=10):
    """Legacy alias — searches PubMed only."""
    return literature_search(query, max_results=max_results, providers="pubmed")

Implementation Plan

Step 1: Create spec (this file)

Step 2: Build `_normalize_paper()` helper

Normalize paper dicts from all providers to unified schema.

Step 3: Build `_compute_relevance_score()` helper

Compute relevance score per provider based on title match quality and ranking.

Step 4: Implement `literature_search()`

Main function using PaperCorpus + ThreadPoolExecutor for parallel search.

Step 5: Build `_upsert_paper_to_kg()` helper

Auto-cite papers in KG when cite_in_kg=True.

Step 6: Add `pubmed_search` legacy alias

Thin wrapper around literature_search(providers="pubmed").

Step 7: Register tool in forge_tools.py tool list

Add literature_search to the API tool registry (following existing pattern).

Acceptance Criteria

literature_search("TREM2 microglia", max_results=10) returns ≥1 result with all fields populated

Results include citation_uri with doi.org link (or PubMed/Semantic Scholar fallback)

Multiple providers searched in parallel (verify via logs)

pubmed_search still works and returns same format as before (backwards compatible)

Deduplication removes duplicate DOIs/PMIDs across providers

provider_scores shows per-provider relevance

KG auto-citation fires when cite_in_kg=True

Tool registered in forge_tools.py with category literature_search

Files to Modify

tools.py — add literature_search(), _normalize_paper(), _compute_relevance_score(), _upsert_paper_to_kg(), update pubmed_search alias
forge_tools.py — add literature_search to tool registry (if not already present as paper_corpus_search)

DO NOT Modify

api.py (critical file per task instructions)
migrations/, .sql
PostgreSQL

[Forge] Unified literature_search tool — multi-provider with citation handling closed