[Atlas] Consolidate duplicate wiki_page artifact rows

← All Specs

[Atlas] Consolidate duplicate wiki_page artifact rows

Problem

artifacts has two rows for nearly every wiki_page slug:

  • wiki-<slug> (canonical — receives 2.65M+ artifact_links references)
  • <slug> (orphan — created by a v1_polymorphic_backfill script)

State:
  • 17,470 slugs with BOTH rows (paired duplicates)
  • 1,179 slugs with only wiki-<slug> (some are dangling — no wiki_pages row)
  • 322 slugs with only <slug> (needs canonicalisation)
  • 18,971 distinct slugs total

The duplication causes the artifacts gallery to render mixed-form links, splits comment threads, and inflates artifact counts. Phase 1 (PR #1405) routed URLs correctly; this phase removes the orphan rows.

Audit-backed loss surface

Across 23 metadata columns on 17,470 paired rows, only title differs (466 pairs). All other fields on the unprefixed side are default (lifecycle='active', quality_score=0.5, all counters=0, all FKs NULL).

466 title conflicts decompose into 5 deterministic buckets + 1 special case:

  • 322 prefix-extends: take longer
  • 119 pref-is-slug-take-unp: take unprefixed
  • 19 unp-much-longer: take unprefixed
  • 5 case-only-diff: take uppercase
  • 1 semantic-disagree (wiki-proteins-apoe: ApoE (Apolipoprotein E) vs APOE - Apolipoprotein E): hardcoded — take APOE form

FK reference fan-out (rows pointing to an unprefixed wiki_page id):
  • artifact_links: 4 src + 46 tgt
  • chamber_artifacts: 35
  • artifact_comments: 2
  • artifact_syntheses: 1
  • ap_publishes: 1
  • usage_cache: 1
  • All other 18 FK columns: 0
  • Total: ~90 rows to retarget
wiki_pages.content_md is the source of truth for wiki content and is untouched.

Execution

scripts/oneoff/2026-05-13-consolidate-wiki-page-artifacts.py runs in phases:

  • audit — emit JSONL classification of every pair and every orphan
  • snapshot — create _archive_*_2026_05_13 tables for full rollback
  • merge — 17,470 paired rows: merge title + metadata, retarget FKs, delete unprefixed
  • rename-singletons — 322 unprefixed-only rows: UPDATE id='wiki-'||id
  • deprecate-dangling — 1,179 prefixed-only rows with no wiki_pages.slug: set lifecycle_state='deprecated', deprecated_reason='wiki_page_missing'
  • constraint — add unique index preventing future duplication
  • Idempotent. Each phase can be run independently. --dry-run prints SQL without executing.

    Acceptance

    • Zero rows with artifact_type='wiki_page' and id NOT LIKE 'wiki-%' (after phases 3-4)
    • All FK references retargeted (re-running FK audit finds zero unprefixed refs)
    • Unique index artifacts_wiki_page_one_per_slug exists and rejects future duplicates
    • /artifacts?artifact_type=wiki_page count drops by ~17,792
    • Every previously paired /wiki/<slug> and /artifact/wiki-<slug> still HTTP 200

    Rollback

    INSERT INTO artifacts SELECT * FROM _archive_artifacts_unprefixed_wiki_2026_05_13;
    -- repeat for 6 archived FK tables
    DROP INDEX artifacts_wiki_page_one_per_slug;

    File: consolidate-wiki-page-duplicates-2026-05-13_spec.md
    Modified: 2026-05-18 04:17
    Size: 3.0 KB