[Forge] Unified literature_search tool — multi-provider with citation handling closed

← Mission Control
Create a new unified literature_search tool for CLAUDE_TOOLS that replaces the current pubmed_search as the primary paper discovery interface. Design: (1) Accepts query + optional filters (date range, journal, author, provider preference). (2) Searches across ALL configured providers in parallel (PubMed, Semantic Scholar, Paperclip, OpenAlex, CrossRef). (3) Deduplicates results by DOI/PMID/title fuzzy match. (4) Returns normalized results with: title, authors, year, journal, abstract/TLDR, citation_count, DOI, PMID, all external IDs, relevance score per provider. (5) Automatically cites/links papers in the SciDEX knowledge graph when referenced in a debate. (6) Strong citation format: each paper reference includes a stable URI (doi.org link preferred, fallback to PubMed URL, then Semantic Scholar URL). Keep pubmed_search as a legacy alias that routes through the new tool with provider=['pubmed']. ## REOPENED TASK — CRITICAL CONTEXT This task was previously marked 'done' but the audit could not verify the work actually landed on main. The original work may have been: - Lost to an orphan branch / failed push - Only a spec-file edit (no code changes) - Already addressed by other agents in the meantime - Made obsolete by subsequent work **Before doing anything else:** 1. **Re-evaluate the task in light of CURRENT main state.** Read the spec and the relevant files on origin/main NOW. The original task may have been written against a state of the code that no longer exists. 2. **Verify the task still advances SciDEX's aims.** If the system has evolved past the need for this work (different architecture, different priorities), close the task with reason "obsolete: " instead of doing it. 3. **Check if it's already done.** Run `git log --grep=''` and read the related commits. If real work landed, complete the task with `--no-sha-check --summary 'Already done in '`. 4. **Make sure your changes don't regress recent functionality.** Many agents have been working on this codebase. Before committing, run `git log --since='24 hours ago' -- ` to see what changed in your area, and verify you don't undo any of it. 5. **Stay scoped.** Only do what this specific task asks for. Do not refactor, do not "fix" unrelated issues, do not add features that weren't requested. Scope creep at this point is regression risk. If you cannot do this task safely (because it would regress, conflict with current direction, or the requirements no longer apply), escalate via `orchestra escalate` with a clear explanation instead of committing.

Last Error

Audit reopened: ORPHAN_BRANCH — 1 commit(s) found but none on main; branch=paperclip-mcp-adapter-v2

Git Commits (1)

[Forge] Unified literature_search tool — multi-provider with citation handling [task:ed0d0fe6-423b-45c3-8a5e-5415267fb5bb]2026-04-11
Spec File

Spec: Unified literature_search Tool

Task ID

ed0d0fe6-423b-45c3-8a5e-5415267fb5bb

Overview

Create a new unified literature_search tool for CLAUDE_TOOLS that replaces pubmed_search as the primary paper discovery interface. The tool searches multiple providers in parallel, deduplicates results, and returns normalized results with rich metadata and stable citation URIs.

Key Background

PaperCorpus class already exists in tools.py (line 898) with:
  • Multi-provider adapters: pubmed, semantic_scholar, openalex, crossref, paperclip
  • Basic deduplication by external_ids
  • Local SQLite caching via _upsert_cached_paper

The task is to build a HIGHER-LEVEL unified tool on top of this infrastructure that adds:
  • Parallel multi-provider search with relevance scoring
  • Strong citation format with stable URIs
  • Knowledge graph auto-citation when papers are referenced in debates
  • Legacy pubmed_search alias
  • Design

    Signature

    @log_tool_call
    def literature_search(
        query: str,
        max_results: int = 10,
        providers: str = "",  # comma-separated, empty = all
        date_from: str = "",  # YYYY-MM-DD
        date_to: str = "",    # YYYY-MM-DD
        journal: str = "",
        author: str = "",
        include_abstract: bool = True,
        cite_in_kg: bool = True,  # auto-cite in KG when referenced
    ) -> dict:

    Returns

    {
        "query": str,
        "total_count": int,
        "results": [
            {
                # Core IDs
                "doi": str,
                "pmid": str,
                "paper_id": str,  # Semantic Scholar ID
    
                # Metadata
                "title": str,
                "authors": [str],  # top 5
                "year": int,
                "journal": str,
                "abstract": str,
                "tldr": str,
    
                # Metrics
                "citation_count": int,
                "influential_citation_count": int,
    
                # Stable citation URI (strong citation format)
                "citation_uri": "https://doi.org/10.xxxx/xxxxx",  # doi.org preferred
    
                # Per-provider relevance scores
                "provider_scores": {
                    "pubmed": 0.95,
                    "semantic_scholar": 0.88,
                    ...
                },
    
                # External IDs from all providers
                "external_ids": {
                    "doi": str,
                    "pmid": str,
                    "paper_id": str,
                    "openalex": str,
                    "crossref": str,
                },
    
                # Source provider that matched best
                "best_provider": str,
                "best_score": float,
            }
        ],
        "providers_searched": [str],
        "search_time_ms": int,
    }

    Strong Citation Format

    Each paper reference includes a stable URI in this priority order:
  • https://doi.org/{doi} — preferred, persistent, resolver-based
  • https://pubmed.ncbi.nlm.nih.gov/{pmid} — fallback if no DOI
  • https://www.semanticscholar.org/paper/{paper_id} — final fallback
  • Provider Parallel Search

    • All configured providers run in parallel threads (ThreadPoolExecutor)
    • Each returns relevance score based on provider-specific ranking
    • Results merged, deduplicated, re-ranked by aggregate score
    • Timeout per provider: 15 seconds

    Deduplication Strategy

  • Primary: exact DOI match
  • Secondary: exact PMID match
  • Tertiary: title fuzzy match (normalize, then Jaccard similarity > 0.85)
  • Knowledge Graph Auto-Citation

    When cite_in_kg=True and results are returned, automatically:
  • For each paper with a DOI/PMID, upsert into papers table
  • Create citation edges in KG for any papers already in the KG
  • This happens via kg_add_paper_citations() or equivalent
  • Legacy pubmed_search Alias

    Keep pubmed_search as a thin alias:

    def pubmed_search(query, max_results=10):
        """Legacy alias — searches PubMed only."""
        return literature_search(query, max_results=max_results, providers="pubmed")

    Implementation Plan

    Step 1: Create spec (this file)

    Step 2: Build _normalize_paper() helper

    Normalize paper dicts from all providers to unified schema.

    Step 3: Build _compute_relevance_score() helper

    Compute relevance score per provider based on title match quality and ranking.

    Step 4: Implement literature_search()

    Main function using PaperCorpus + ThreadPoolExecutor for parallel search.

    Step 5: Build _upsert_paper_to_kg() helper

    Auto-cite papers in KG when cite_in_kg=True.

    Step 6: Add pubmed_search legacy alias

    Thin wrapper around literature_search(providers="pubmed").

    Step 7: Register tool in forge_tools.py tool list

    Add literature_search to the API tool registry (following existing pattern).

    Acceptance Criteria

  • literature_search("TREM2 microglia", max_results=10) returns ≥1 result with all fields populated
  • Results include citation_uri with doi.org link (or PubMed/Semantic Scholar fallback)
  • Multiple providers searched in parallel (verify via logs)
  • pubmed_search still works and returns same format as before (backwards compatible)
  • Deduplication removes duplicate DOIs/PMIDs across providers
  • provider_scores shows per-provider relevance
  • KG auto-citation fires when cite_in_kg=True
  • Tool registered in forge_tools.py with category literature_search
  • Files to Modify

    • tools.py — add literature_search(), _normalize_paper(), _compute_relevance_score(), _upsert_paper_to_kg(), update pubmed_search alias
    • forge_tools.py — add literature_search to tool registry (if not already present as paper_corpus_search)

    DO NOT Modify

    • api.py (critical file per task instructions)
    • migrations/, .sql
    • PostgreSQL