Even hypotheses that survived an adversarial debate
(q-rt-adversarial-debate-runner) at time T may have been falsified
by post-T literature. PubMed grows by ~3500 articles/day; a hypothesis
proposed in March can have its mechanism contradicted in April without
SciDEX noticing. This task ships a recurring Falsifier-of-Truth
runner that re-probes the top-Elo hypotheses against new literature
since their last falsification check, and downgrades any whose
support has decayed. It is the temporal complement of the adversarial
debate runner: that one tests reasoning quality, this one tests
literature freshness.
Effort: deep
scidex/agora/falsifier_of_truth.py:select_targets(top_n=50, min_elo=1600,
min_days_since_last_check=14) -> list[hypothesis_id].run(hypothesis_id) -> FalsifierReport orchestrates:last_falsifier_check_at (default: hypothesiscreated_at) using the existing paper-corpus-searchscidex/agora/pubmed_utils.py.scidex/senate/falsifier-related code; grepFalsifier in scidex/agents/).contradicts, weakens,supports, unrelated from the trailing window.falsification_score = (contradicts + 0.5*weakens) /
total_relevant.falsification_score > 0.30 AND ≥3 contradicts: flaglifecycle='under_review', dockmigrations/20260428_falsifier_of_truth.sql:ALTER TABLE hypotheses ADD COLUMN IF NOT EXISTS
last_falsifier_check_at TIMESTAMPTZ;
ALTER TABLE hypotheses ADD COLUMN IF NOT EXISTS
falsification_score DOUBLE PRECISION;
CREATE TABLE falsifier_of_truth_run (
id BIGSERIAL PRIMARY KEY,
hypothesis_id TEXT NOT NULL,
ran_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
new_pmids TEXT[] NOT NULL,
contradicts INT NOT NULL DEFAULT 0,
weakens INT NOT NULL DEFAULT 0,
supports INT NOT NULL DEFAULT 0,
unrelated INT NOT NULL DEFAULT 0,
falsification_score DOUBLE PRECISION,
verdict TEXT NOT NULL CHECK (verdict IN
('survives','weakened','falsified')),
comment_id TEXT
);
CREATE INDEX idx_fot_h ON falsifier_of_truth_run(hypothesis_id);Elo desc, last_falsifier_check_at asc nulls first. Lands asscidex/agora/crosslink_emitter.py style: a structuredcomment_type_labels=['refutation'] with the newq-perc-refutation-debate-emitter (closing the loop).
(hypothesis_id, pmid) is never reportedfalsifier_seen_pmid(hypothesis_id, pmid) row written eachq-rt-citation-honeypotsurvives/weakened/falsified and atests/test_falsifier_of_truth.py: target selection,comment_type_labels=['refutation'] path which already wiresspawned_debate_id).
scidex/agora/pubmed_utils.py — PubMed query helper.q-rt-citation-honeypot — honeypot quarantine respected.q-perc-refutation-debate-emitter — receives the falsifierq-trust-provenance-integrity-scanner — uses falsification deltasDelivered:
migrations/20260428_falsifier_of_truth.sql — Adds lifecycle, last_falsifier_check_at, falsification_score to hypotheses; creates falsifier_of_truth_run and falsifier_seen_pmid tables with indexes. Migration applied to the live DB.scidex/agora/falsifier_of_truth.py — Full implementation:select_targets(top_n=50, min_elo=1600, min_days_since_last_check=14) — queries elo_ratings.leaderboard, filters deprecated/no-score/recently-checked hypotheses.run(hypothesis_id) -> FalsifierReport — end-to-end pipeline: PubMed search with date window (search_pubmed with mindate), honeypot guard, seen-PMID dedup, LLM scoring per paper (contradicts/weakens/supports/unrelated), falsification_score = (contradicts + 0.5*weakens) / total_relevant, verdict decision (survives/weakened/falsified). Falsified → lifecycle='under_review', composite_score docked 15%, Elo penalty via elo_ratings.record_match, refutation comment posted to artifact_comments with comment_type_labels=['refutation'].run_nightly(max_hypotheses=10) — nightly batch driver with error isolation per hypothesis.get_dashboard_stats(days=30) — Senate tile aggregation (survives/weakened/falsified counts + most-falsified list)._is_honeypot_pmid queries citation_honeypot table (no-op until that migration lands).tests/agora/test_falsifier_of_truth.py — 28 passing tests covering: score arithmetic edge cases, verdict threshold logic, search query building, select_targets filtering, run() happy/falsified/weakened/error paths, nightly batch cap, dedup exclusion, honeypot exclusion.lifecycle column added to hypotheses (also needed by adversarial_debate.py which already references it).hypothesis-{hypothesis_id} (matches existing comments in artifact_comments).run_nightly() is callable directly; Orchestra cron setup is a separate operational step.