Spec: Unified literature_search Tool

← All Specs

Spec: Unified literature_search Tool

Task ID

ed0d0fe6-423b-45c3-8a5e-5415267fb5bb

Overview

Create a new unified literature_search tool for CLAUDE_TOOLS that replaces pubmed_search as the primary paper discovery interface. The tool searches multiple providers in parallel, deduplicates results, and returns normalized results with rich metadata and stable citation URIs.

Key Background

PaperCorpus class already exists in tools.py (line 898) with:
  • Multi-provider adapters: pubmed, semantic_scholar, openalex, crossref, paperclip
  • Basic deduplication by external_ids
  • Local SQLite caching via _upsert_cached_paper

The task is to build a HIGHER-LEVEL unified tool on top of this infrastructure that adds:
  • Parallel multi-provider search with relevance scoring
  • Strong citation format with stable URIs
  • Knowledge graph auto-citation when papers are referenced in debates
  • Legacy pubmed_search alias
  • Design

    Signature

    @log_tool_call
    def literature_search(
        query: str,
        max_results: int = 10,
        providers: str = "",  # comma-separated, empty = all
        date_from: str = "",  # YYYY-MM-DD
        date_to: str = "",    # YYYY-MM-DD
        journal: str = "",
        author: str = "",
        include_abstract: bool = True,
        cite_in_kg: bool = True,  # auto-cite in KG when referenced
    ) -> dict:

    Returns

    {
        "query": str,
        "total_count": int,
        "results": [
            {
                # Core IDs
                "doi": str,
                "pmid": str,
                "paper_id": str,  # Semantic Scholar ID
    
                # Metadata
                "title": str,
                "authors": [str],  # top 5
                "year": int,
                "journal": str,
                "abstract": str,
                "tldr": str,
    
                # Metrics
                "citation_count": int,
                "influential_citation_count": int,
    
                # Stable citation URI (strong citation format)
                "citation_uri": "https://doi.org/10.xxxx/xxxxx",  # doi.org preferred
    
                # Per-provider relevance scores
                "provider_scores": {
                    "pubmed": 0.95,
                    "semantic_scholar": 0.88,
                    ...
                },
    
                # External IDs from all providers
                "external_ids": {
                    "doi": str,
                    "pmid": str,
                    "paper_id": str,
                    "openalex": str,
                    "crossref": str,
                },
    
                # Source provider that matched best
                "best_provider": str,
                "best_score": float,
            }
        ],
        "providers_searched": [str],
        "search_time_ms": int,
    }

    Strong Citation Format

    Each paper reference includes a stable URI in this priority order:
  • https://doi.org/{doi} — preferred, persistent, resolver-based
  • https://pubmed.ncbi.nlm.nih.gov/{pmid} — fallback if no DOI
  • https://www.semanticscholar.org/paper/{paper_id} — final fallback
  • Provider Parallel Search

    • All configured providers run in parallel threads (ThreadPoolExecutor)
    • Each returns relevance score based on provider-specific ranking
    • Results merged, deduplicated, re-ranked by aggregate score
    • Timeout per provider: 15 seconds

    Deduplication Strategy

  • Primary: exact DOI match
  • Secondary: exact PMID match
  • Tertiary: title fuzzy match (normalize, then Jaccard similarity > 0.85)
  • Knowledge Graph Auto-Citation

    When cite_in_kg=True and results are returned, automatically:
  • For each paper with a DOI/PMID, upsert into papers table
  • Create citation edges in KG for any papers already in the KG
  • This happens via kg_add_paper_citations() or equivalent
  • Legacy pubmed_search Alias

    Keep pubmed_search as a thin alias:

    def pubmed_search(query, max_results=10):
        """Legacy alias — searches PubMed only."""
        return literature_search(query, max_results=max_results, providers="pubmed")

    Implementation Plan

    Step 1: Create spec (this file)

    Step 2: Build _normalize_paper() helper

    Normalize paper dicts from all providers to unified schema.

    Step 3: Build _compute_relevance_score() helper

    Compute relevance score per provider based on title match quality and ranking.

    Step 4: Implement literature_search()

    Main function using PaperCorpus + ThreadPoolExecutor for parallel search.

    Step 5: Build _upsert_paper_to_kg() helper

    Auto-cite papers in KG when cite_in_kg=True.

    Step 6: Add pubmed_search legacy alias

    Thin wrapper around literature_search(providers="pubmed").

    Step 7: Register tool in forge_tools.py tool list

    Add literature_search to the API tool registry (following existing pattern).

    Acceptance Criteria

  • literature_search("TREM2 microglia", max_results=10) returns ≥1 result with all fields populated
  • Results include citation_uri with doi.org link (or PubMed/Semantic Scholar fallback)
  • Multiple providers searched in parallel (verify via logs)
  • pubmed_search still works and returns same format as before (backwards compatible)
  • Deduplication removes duplicate DOIs/PMIDs across providers
  • provider_scores shows per-provider relevance
  • KG auto-citation fires when cite_in_kg=True
  • Tool registered in forge_tools.py with category literature_search
  • Files to Modify

    • tools.py — add literature_search(), _normalize_paper(), _compute_relevance_score(), _upsert_paper_to_kg(), update pubmed_search alias
    • forge_tools.py — add literature_search to tool registry (if not already present as paper_corpus_search)

    DO NOT Modify

    • api.py (critical file per task instructions)
    • migrations/, .sql
    • PostgreSQL

    Tasks using this spec (1)
    [Forge] Unified literature_search tool — multi-provider with
    closed P88
    File: ed0d0fe6_423_spec.md
    Modified: 2026-05-01 20:13
    Size: 5.5 KB