Root cause of #13 (PUT 500s on first ship, retries return already-present):
my earlier prune-tool session ran as root and rewrote the live index via
os.replace(), which drops the original ownership/mode. The new file was
root:root and the cis490 service user couldn't append to it. Every fresh
PUT 500'd on _append_index after the tarball had already landed via
os.replace, so retries always saw "already-present" and never recovered
the missing index row.
Two fixes:
- tools/prune_episodes.py: snapshot the index's stat before the rename
and restore uid/gid/mode after. Best-effort chown so non-root prune
runs (where chown would EPERM) still succeed; non-root callers
matched the original owner anyway.
- tools/index_backfill.py: new tool. Walks episodes/<host>/*.tar.zst,
computes sha256+size, and appends rows for episodes missing from
the index. Preserves "backfilled: true" so trainers can distinguish
reconstructed rows. Always opens the index in append mode (never
replaces), so it cannot reproduce the ownership bug it's recovering
from.
Regression test: tests/test_prune.py::test_archive_preserves_index_mode.
Operator note for the live receiver: ran the chown fix manually
(chown cis490:cis490 /var/lib/cis490/index.jsonl) and ran the
backfill once to recover 140 elliott-thinkpad rows that 500'd before
the chown landed.