Spec: Unified literature_search Tool
Task ID
ed0d0fe6-423b-45c3-8a5e-5415267fb5bbOverview
Create a new unified
literature_search tool for CLAUDE_TOOLS that replaces
pubmed_search as the primary paper discovery interface. The tool searches multiple providers in parallel, deduplicates results, and returns normalized results with rich metadata and stable citation URIs.
Key Background
PaperCorpus class already exists in
tools.py (line 898) with:
- Multi-provider adapters: pubmed, semantic_scholar, openalex, crossref, paperclip
- Basic deduplication by external_ids
- Local SQLite caching via
_upsert_cached_paper
The task is to build a HIGHER-LEVEL unified tool on top of this infrastructure that adds:
Parallel multi-provider search with relevance scoring
Strong citation format with stable URIs
Knowledge graph auto-citation when papers are referenced in debates
Legacy pubmed_search aliasDesign
Signature
@log_tool_call
def literature_search(
query: str,
max_results: int = 10,
providers: str = "", # comma-separated, empty = all
date_from: str = "", # YYYY-MM-DD
date_to: str = "", # YYYY-MM-DD
journal: str = "",
author: str = "",
include_abstract: bool = True,
cite_in_kg: bool = True, # auto-cite in KG when referenced
) -> dict:
Returns
{
"query": str,
"total_count": int,
"results": [
{
# Core IDs
"doi": str,
"pmid": str,
"paper_id": str, # Semantic Scholar ID
# Metadata
"title": str,
"authors": [str], # top 5
"year": int,
"journal": str,
"abstract": str,
"tldr": str,
# Metrics
"citation_count": int,
"influential_citation_count": int,
# Stable citation URI (strong citation format)
"citation_uri": "https://doi.org/10.xxxx/xxxxx", # doi.org preferred
# Per-provider relevance scores
"provider_scores": {
"pubmed": 0.95,
"semantic_scholar": 0.88,
...
},
# External IDs from all providers
"external_ids": {
"doi": str,
"pmid": str,
"paper_id": str,
"openalex": str,
"crossref": str,
},
# Source provider that matched best
"best_provider": str,
"best_score": float,
}
],
"providers_searched": [str],
"search_time_ms": int,
}
Strong Citation Format
Each paper reference includes a stable URI in this priority order:
https://doi.org/{doi} — preferred, persistent, resolver-based
https://pubmed.ncbi.nlm.nih.gov/{pmid} — fallback if no DOI
https://www.semanticscholar.org/paper/{paper_id} — final fallbackProvider Parallel Search
- All configured providers run in parallel threads (ThreadPoolExecutor)
- Each returns relevance score based on provider-specific ranking
- Results merged, deduplicated, re-ranked by aggregate score
- Timeout per provider: 15 seconds
Deduplication Strategy
Primary: exact DOI match
Secondary: exact PMID match
Tertiary: title fuzzy match (normalize, then Jaccard similarity > 0.85)Knowledge Graph Auto-Citation
When
cite_in_kg=True and results are returned, automatically:
For each paper with a DOI/PMID, upsert into papers table
Create citation edges in KG for any papers already in the KG
This happens via kg_add_paper_citations() or equivalentLegacy pubmed_search Alias
Keep
pubmed_search as a thin alias:
def pubmed_search(query, max_results=10):
"""Legacy alias — searches PubMed only."""
return literature_search(query, max_results=max_results, providers="pubmed")
Implementation Plan
Step 1: Create spec (this file)
Step 2: Build _normalize_paper() helper
Normalize paper dicts from all providers to unified schema.
Step 3: Build _compute_relevance_score() helper
Compute relevance score per provider based on title match quality and ranking.
Step 4: Implement literature_search()
Main function using PaperCorpus + ThreadPoolExecutor for parallel search.
Step 5: Build _upsert_paper_to_kg() helper
Auto-cite papers in KG when
cite_in_kg=True.
Step 6: Add pubmed_search legacy alias
Thin wrapper around
literature_search(providers="pubmed").
Step 7: Register tool in forge_tools.py tool list
Add
literature_search to the API tool registry (following existing pattern).
Acceptance Criteria
literature_search("TREM2 microglia", max_results=10) returns ≥1 result with all fields populated
Results include citation_uri with doi.org link (or PubMed/Semantic Scholar fallback)
Multiple providers searched in parallel (verify via logs)
pubmed_search still works and returns same format as before (backwards compatible)
Deduplication removes duplicate DOIs/PMIDs across providers
provider_scores shows per-provider relevance
KG auto-citation fires when cite_in_kg=True
Tool registered in forge_tools.py with category literature_searchFiles to Modify
tools.py — add literature_search(), _normalize_paper(), _compute_relevance_score(), _upsert_paper_to_kg(), update pubmed_search alias
forge_tools.py — add literature_search to tool registry (if not already present as paper_corpus_search)
DO NOT Modify
api.py (critical file per task instructions)
migrations/, .sql
PostgreSQL