receiver: returns 500 on first PUT but episode lands correctly — index.jsonl write likely failing after atomic rename #13

Closed
opened 2026-04-30 16:26:42 -05:00 by elliott · 1 comment
Owner

Context

First live ship from elliott-thinkpad lab host (Dev_REL1_043026 bring-up). All new episodes receive HTTP 500 on first PUT attempt. On the shipper's next scan the same episodes return 200 already-present, confirming the tarball was stored correctly.

What happened

INFO httpx HTTP Request: PUT https://collector.wg/v1/episodes/elliott-thinkpad/01KQG4D75K.tar.zst "HTTP/1.1 500 Internal Server Error"
INFO cis490.shipper.queue ship 01KQG4D75K -> transient (500) server error 500
...(next scan)...
INFO cis490.shipper.queue ship 01KQG4D75K -> already-present (200)

This pattern repeated for every fresh episode from the first wave.

Root cause hypothesis

receiver/store.py EpisodeStore.ingest_stream does:

  1. Stream body → partial file ✓
  2. Verify sha256 ✓
  3. os.replace(partial, final) ✓ (tarball is there on retry)
  4. self._append_index({...}) ← likely exception here
  5. return StoreResult(status="stored", ...) ← never reached

If _append_index raises (permissions on index.jsonl, disk issue, or concurrent write race), the app propagates the exception and Starlette returns 500. The tarball is already saved at step 3 so subsequent requests correctly return already-present.

Impact

Data is NOT lost — episodes land on the Pi. But index.jsonl may be missing entries for episodes that hit this path, which breaks index_reader and downstream trainers that rely on the index.

What was tried

Cannot inspect Pi receiver logs (SSH refused on port 22). shipper retries recover the 200, so no data loss on the shipper side.

Suggested next step

  1. Check /var/lib/cis490/index.jsonl line count vs. episode count under /var/lib/cis490/episodes/elliott-thinkpad/ — gap confirms the hypothesis.
  2. Check receiver systemd logs: journalctl -u cis490-receiver -n 50
  3. Likely fix: wrap _append_index in a try/except that logs the error but returns StoreResult(status="stored") so the client gets 201 and the index can be rebuilt separately.

Refs spectral/CIS490 Dev_REL1_043026 bring-up.

## Context First live ship from elliott-thinkpad lab host (Dev_REL1_043026 bring-up). All new episodes receive HTTP 500 on first PUT attempt. On the shipper's next scan the same episodes return 200 already-present, confirming the tarball was stored correctly. ## What happened ``` INFO httpx HTTP Request: PUT https://collector.wg/v1/episodes/elliott-thinkpad/01KQG4D75K.tar.zst "HTTP/1.1 500 Internal Server Error" INFO cis490.shipper.queue ship 01KQG4D75K -> transient (500) server error 500 ...(next scan)... INFO cis490.shipper.queue ship 01KQG4D75K -> already-present (200) ``` This pattern repeated for every fresh episode from the first wave. ## Root cause hypothesis `receiver/store.py EpisodeStore.ingest_stream` does: 1. Stream body → partial file ✓ 2. Verify sha256 ✓ 3. `os.replace(partial, final)` ✓ (tarball is there on retry) 4. `self._append_index({...})` ← likely exception here 5. `return StoreResult(status="stored", ...)` ← never reached If `_append_index` raises (permissions on index.jsonl, disk issue, or concurrent write race), the app propagates the exception and Starlette returns 500. The tarball is already saved at step 3 so subsequent requests correctly return already-present. ## Impact Data is NOT lost — episodes land on the Pi. But index.jsonl may be missing entries for episodes that hit this path, which breaks `index_reader` and downstream trainers that rely on the index. ## What was tried Cannot inspect Pi receiver logs (SSH refused on port 22). shipper retries recover the 200, so no data loss on the shipper side. ## Suggested next step 1. Check `/var/lib/cis490/index.jsonl` line count vs. episode count under `/var/lib/cis490/episodes/elliott-thinkpad/` — gap confirms the hypothesis. 2. Check receiver systemd logs: `journalctl -u cis490-receiver -n 50` 3. Likely fix: wrap `_append_index` in a try/except that logs the error but returns StoreResult(status="stored") so the client gets 201 and the index can be rebuilt separately. Refs spectral/CIS490 Dev_REL1_043026 bring-up.
max closed this issue 2026-04-30 16:36:07 -05:00
Owner

Root cause confirmed and fixed in 8d2d0d2.

The diagnosis in this issue was correct in shape but missed the origin: it was not a logic bug in receiver/store.py. The live receiver could not append to /var/lib/cis490/index.jsonl because an earlier sudo'd prune run (during the data-quality cleanup session) rewrote the file via os.replace(), leaving it root:root mode 0644. The receiver runs as the cis490 user and got EPERM on every _append_index. The os.replace path in store.py never executes — the tarball was already on disk before _append_index ran, which is why retries saw already-present.

Resolution:

  • Operator: sudo chown cis490:cis490 /var/lib/cis490/index.jsonl (immediate)
  • Operator: sudo -u cis490 /opt/cis490/.venv/bin/python /opt/cis490/tools/index_backfill.py to recover the 140 elliott-thinkpad rows that 500'd before the chown
  • Code: tools/prune_episodes.py now snapshots the index's stat before rewriting and restores ownership+mode after os.replace, so this cannot recur from a sudo'd prune
  • Code: new tools/index_backfill.py for any future drift between episodes/ and index.jsonl
  • Test: tests/test_prune.py::test_archive_preserves_index_mode

Verified live: index now at 207 rows, growing in step with new PUTs from elliott-thinkpad and k-gamingcom. No 500s in the receiver journal since the chown.

Root cause confirmed and fixed in 8d2d0d2. The diagnosis in this issue was correct in shape but missed the *origin*: it was not a logic bug in receiver/store.py. The live receiver could not append to /var/lib/cis490/index.jsonl because an earlier sudo'd prune run (during the data-quality cleanup session) rewrote the file via os.replace(), leaving it root:root mode 0644. The receiver runs as the cis490 user and got EPERM on every _append_index. The os.replace path in store.py never executes — the tarball was already on disk before _append_index ran, which is why retries saw already-present. Resolution: - Operator: sudo chown cis490:cis490 /var/lib/cis490/index.jsonl (immediate) - Operator: sudo -u cis490 /opt/cis490/.venv/bin/python /opt/cis490/tools/index_backfill.py to recover the 140 elliott-thinkpad rows that 500'd before the chown - Code: tools/prune_episodes.py now snapshots the index's stat before rewriting and restores ownership+mode after os.replace, so this cannot recur from a sudo'd prune - Code: new tools/index_backfill.py for any future drift between episodes/ and index.jsonl - Test: tests/test_prune.py::test_archive_preserves_index_mode Verified live: index now at 207 rows, growing in step with new PUTs from elliott-thinkpad and k-gamingcom. No 500s in the receiver journal since the chown.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: bolyai/CIS490#13
No description provided.