Skills + artifacts reorganization

← All Specs

Skills + artifacts reorganization

Why

Two pieces of external content in SciDEX are currently managed by ad-hoc copies instead of tracked git submodules. This spec cleans both up and lays groundwork for session-scoped workspaces.

1. forge/skills/ is a manually-copied mirror

  • 134 K-Dense SKILL.md bundles were copied on 2026-04-16 from K-Dense-AI/scientific-agent-skills (commit 304dc005cf on the import branch)
  • Content is byte-for-byte identical to upstream — zero local drift, zero SciDEX-unique skills, zero upstream-only skills we're missing
  • The registration script that ingested them (register_kdense_skills.py) lives in an abandoned worktree — never merged to main
  • forge/skills/ is not referenced by any live code path. The canonical skill loader (scidex/forge/skills_canonical.py) reads from skills/ (23 SciDEX-native API wrappers), not forge/skills/
  • .claude/skills/ already symlinks the 23 native skills + 9 founding personas for Claude Code discovery — same symlink pattern should extend to K-Dense

2. /data/scidex-artifacts is already a separate repo

Despite ADR-001 (2026-04-12) deciding to keep artifacts in the main repo, SciDEX-AI/SciDEX-Artifacts was created on 2026-04-18 and is now 1.2 GB / 9,995 files:

  • 8,579 figures (Git LFS)
  • 934 notebooks
  • 457 analyses
  • 23 datasets

But the main repo also still has duplicates at site/figures/ (8,569), site/notebooks/, datasets/. The API at api.py serves from the main-repo path; dual-path fallback code exists but is not hit in prod.

ADR-001's "revisit when volume exceeds 1 GB" trigger has fired. Reality has already forked; docs haven't caught up.

3. Session workspaces as the forward direction

K-Dense's share model (e.g. session_20260403_101458_7f548077d599) bundles an analysis's inputs, outputs, and code into a single session directory with a manifest. SciDEX has no equivalent; artifacts float loose. Adding a tiny session layer now keeps future work cheap.

What we do

Phase A — Skills submodule (low risk)

  • git submodule add https://github.com/K-Dense-AI/scientific-agent-skills.git vendor/kdense-skills (pinned to current head)
  • Delete forge/skills/* from main (content is now in submodule)
  • Replace with symlinks: forge/skills/<slug>../../vendor/kdense-skills/scientific-skills/<slug> (keeps the forge/skills/ namespace that existing specs reference)
  • Add .claude/skills/<slug>../../vendor/kdense-skills/scientific-skills/<slug> (one symlink per skill) so Claude Code inside the SciDEX checkout auto-discovers all 134 K-Dense skills in addition to the existing 23 + 9
  • Port the orphan register_kdense_skills.py into scripts/register_kdense_skills.py on main; point it at vendor/kdense-skills/scientific-skills/
  • Phase B — Artifacts submodule

  • git submodule add https://github.com/SciDEX-AI/SciDEX-Artifacts.git data/scidex-artifacts (not vendor/ since it's first-class content, not vendored code; path matches the upstream repo name)
  • ln -s data/scidex-artifacts /home/ubuntu/scidex/.../mount (skipped — host /data/scidex-artifacts already exists; document that it's the same checkout once submodule is initialized)
  • Add new ADR 002-artifacts-separate-repo.md that supersedes ADR-001 with the factual history (why it split despite the prior decision)
  • Do NOT delete site/figures/, site/notebooks/, datasets/ from main yet. Surface the duplication for human review in a follow-up PR — touching ~10K LFS+tracked files is out of scope for this spec
  • Phase C — Install script

    scripts/install_skills.sh:

    • git submodule update --init --recursive vendor/kdense-skills vendor/mimeo vendor/mimeographs
    • Create .claude/skills/ symlinks for all 3 sources (personas, native skills, K-Dense skills) if missing
    • Optional --register-db flag: runs python -m scidex.forge.skills_canonical (native) and python scripts/register_kdense_skills.py (K-Dense) to populate the DB
    • Idempotent

    Phase D — Session skeleton

    Add to scidex-artifacts repo:

    • sessions/README.md — describes the session directory convention
    • sessions/.gitkeep
    • sessions/_schema/session.schema.json — JSON Schema for the session manifest

    In main SciDEX repo:

    • scidex/atlas/sessions.pycreate_session(title, owner) -> session_id, load_session(id) -> dict, list_sessions(limit), link_artifact(session_id, artifact_path, role). Thin wrapper over the filesystem under data/scidex-artifacts/sessions/<id>/

    Manifest shape:

    {
      "id": "session_<utc-iso-compact>_<8-hex>",
      "title": "...",
      "owner": "...",
      "created_at": "...",
      "artifacts": [
        {"path": "figures/foo.png", "role": "output", "sha": "..."},
        {"path": "notebooks/bar.html", "role": "code"}
      ],
      "parent_session": null,
      "notes_md": "..."
    }

    Phase D is intentionally minimal — filesystem-backed, no DB, no UI. Subsequent work can wire it into analyses, wiki pages, and the share link surface.

    What we do NOT do here

    • Delete main-repo duplicates of artifacts (needs its own PR with LFS-aware history rewrite review)
    • Wire K-Dense skills into the skills DB table (they use a different metadata schema; keeping them FS-only for Claude Code discovery is enough for v1)
    • Build a session UI or share-link endpoint
    • Lift the tool-growth freeze

    Rollback

    Each phase rolls back via git submodule deinit <path> && rm -rf <path> && git checkout -- .gitmodules followed by restoring the deleted directories from git show HEAD~N:<path>.

    Verification

    • git submodule status shows both new submodules pinned
    • ls .claude/skills/ | wc -l returns 166 (9 personas + 23 native + 134 K-Dense)
    • python -m scidex.forge.skills_canonical --dry-run still reports 23 native skills discovered (unchanged)
    • python scripts/register_kdense_skills.py --dry-run reports 134 K-Dense skills
    • ls -la data/scidex-artifacts/figures/ returns ~8.5K files (via submodule)
    • CI / pre-push hook passes

    Follow-ups (not in this PR)

  • Plan + execute main-repo artifact duplicate removal (site/figures/, site/notebooks/, datasets/ → either rm or git-filter-repo)
  • Decide whether K-Dense skills should earn DB rows (currently only filesystem-discoverable)
  • Session manifests → analyses link, wiki crossref, /api/sessions/{id} route
  • Monthly submodule refresh cron: git submodule update --remote vendor/kdense-skills + diff review
  • File: skills_artifacts_reorg_spec.md
    Modified: 2026-05-01 20:13
    Size: 6.5 KB