Spec: Unified literature_search Tool

Task ID

ed0d0fe6-423b-45c3-8a5e-5415267fb5bb

Overview

Create a new unified literature_search tool for CLAUDE_TOOLS that replaces pubmed_search as the primary paper discovery interface. The tool searches multiple providers in parallel, deduplicates results, and returns normalized results with rich metadata and stable citation URIs.

Key Background

PaperCorpus class already exists in tools.py (line 898) with:

Multi-provider adapters: pubmed, semantic_scholar, openalex, crossref, paperclip
Basic deduplication by external_ids
Local SQLite caching via _upsert_cached_paper

The task is to build a HIGHER-LEVEL unified tool on top of this infrastructure that adds:

Parallel multi-provider search with relevance scoring

Strong citation format with stable URIs

Knowledge graph auto-citation when papers are referenced in debates

Legacy pubmed_search alias

Design

Signature

@log_tool_call
def literature_search(
    query: str,
    max_results: int = 10,
    providers: str = "",  # comma-separated, empty = all
    date_from: str = "",  # YYYY-MM-DD
    date_to: str = "",    # YYYY-MM-DD
    journal: str = "",
    author: str = "",
    include_abstract: bool = True,
    cite_in_kg: bool = True,  # auto-cite in KG when referenced
) -> dict:

Returns

{
    "query": str,
    "total_count": int,
    "results": [
        {
            # Core IDs
            "doi": str,
            "pmid": str,
            "paper_id": str,  # Semantic Scholar ID

            # Metadata
            "title": str,
            "authors": [str],  # top 5
            "year": int,
            "journal": str,
            "abstract": str,
            "tldr": str,

            # Metrics
            "citation_count": int,
            "influential_citation_count": int,

            # Stable citation URI (strong citation format)
            "citation_uri": "https://doi.org/10.xxxx/xxxxx",  # doi.org preferred

            # Per-provider relevance scores
            "provider_scores": {
                "pubmed": 0.95,
                "semantic_scholar": 0.88,
                ...
            },

            # External IDs from all providers
            "external_ids": {
                "doi": str,
                "pmid": str,
                "paper_id": str,
                "openalex": str,
                "crossref": str,
            },

            # Source provider that matched best
            "best_provider": str,
            "best_score": float,
        }
    ],
    "providers_searched": [str],
    "search_time_ms": int,
}

Strong Citation Format

Each paper reference includes a stable URI in this priority order:

https://doi.org/{doi} — preferred, persistent, resolver-based

https://pubmed.ncbi.nlm.nih.gov/{pmid} — fallback if no DOI

https://www.semanticscholar.org/paper/{paper_id} — final fallback

Provider Parallel Search

All configured providers run in parallel threads (ThreadPoolExecutor)
Each returns relevance score based on provider-specific ranking
Results merged, deduplicated, re-ranked by aggregate score
Timeout per provider: 15 seconds

Deduplication Strategy

Primary: exact DOI match

Secondary: exact PMID match

Tertiary: title fuzzy match (normalize, then Jaccard similarity > 0.85)

Knowledge Graph Auto-Citation

When cite_in_kg=True and results are returned, automatically:

For each paper with a DOI/PMID, upsert into papers table

Create citation edges in KG for any papers already in the KG

This happens via kg_add_paper_citations() or equivalent

Legacy pubmed_search Alias

Keep pubmed_search as a thin alias:

def pubmed_search(query, max_results=10):
    """Legacy alias — searches PubMed only."""
    return literature_search(query, max_results=max_results, providers="pubmed")

Implementation Plan

Step 1: Create spec (this file)

Step 2: Build `_normalize_paper()` helper

Normalize paper dicts from all providers to unified schema.

Step 3: Build `_compute_relevance_score()` helper

Compute relevance score per provider based on title match quality and ranking.

Step 4: Implement `literature_search()`

Main function using PaperCorpus + ThreadPoolExecutor for parallel search.

Step 5: Build `_upsert_paper_to_kg()` helper

Auto-cite papers in KG when cite_in_kg=True.

Step 6: Add `pubmed_search` legacy alias

Thin wrapper around literature_search(providers="pubmed").

Step 7: Register tool in forge_tools.py tool list

Add literature_search to the API tool registry (following existing pattern).

Acceptance Criteria

literature_search("TREM2 microglia", max_results=10) returns ≥1 result with all fields populated

Results include citation_uri with doi.org link (or PubMed/Semantic Scholar fallback)

Multiple providers searched in parallel (verify via logs)

pubmed_search still works and returns same format as before (backwards compatible)

Deduplication removes duplicate DOIs/PMIDs across providers

provider_scores shows per-provider relevance

KG auto-citation fires when cite_in_kg=True

Tool registered in forge_tools.py with category literature_search

Files to Modify

tools.py — add literature_search(), _normalize_paper(), _compute_relevance_score(), _upsert_paper_to_kg(), update pubmed_search alias
forge_tools.py — add literature_search to tool registry (if not already present as paper_corpus_search)

DO NOT Modify

api.py (critical file per task instructions)
migrations/, .sql
PostgreSQL

Tasks using this spec (1)

[Forge] Unified literature_search tool — multi-provider with

closed P88

File: ed0d0fe6_423_spec.md

Modified: 2026-05-01 20:13

Size: 5.5 KB

Spec: Unified literature_search Tool