receiver: returns 500 on first PUT but episode lands correctly — index.jsonl write likely failing after atomic rename #13
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
First live ship from elliott-thinkpad lab host (Dev_REL1_043026 bring-up). All new episodes receive HTTP 500 on first PUT attempt. On the shipper's next scan the same episodes return 200 already-present, confirming the tarball was stored correctly.
What happened
This pattern repeated for every fresh episode from the first wave.
Root cause hypothesis
receiver/store.py EpisodeStore.ingest_streamdoes:os.replace(partial, final)✓ (tarball is there on retry)self._append_index({...})← likely exception herereturn StoreResult(status="stored", ...)← never reachedIf
_append_indexraises (permissions on index.jsonl, disk issue, or concurrent write race), the app propagates the exception and Starlette returns 500. The tarball is already saved at step 3 so subsequent requests correctly return already-present.Impact
Data is NOT lost — episodes land on the Pi. But index.jsonl may be missing entries for episodes that hit this path, which breaks
index_readerand downstream trainers that rely on the index.What was tried
Cannot inspect Pi receiver logs (SSH refused on port 22). shipper retries recover the 200, so no data loss on the shipper side.
Suggested next step
/var/lib/cis490/index.jsonlline count vs. episode count under/var/lib/cis490/episodes/elliott-thinkpad/— gap confirms the hypothesis.journalctl -u cis490-receiver -n 50_append_indexin a try/except that logs the error but returns StoreResult(status="stored") so the client gets 201 and the index can be rebuilt separately.Refs spectral/CIS490 Dev_REL1_043026 bring-up.
Root cause confirmed and fixed in
8d2d0d2.The diagnosis in this issue was correct in shape but missed the origin: it was not a logic bug in receiver/store.py. The live receiver could not append to /var/lib/cis490/index.jsonl because an earlier sudo'd prune run (during the data-quality cleanup session) rewrote the file via os.replace(), leaving it root:root mode 0644. The receiver runs as the cis490 user and got EPERM on every _append_index. The os.replace path in store.py never executes — the tarball was already on disk before _append_index ran, which is why retries saw already-present.
Resolution:
Verified live: index now at 207 rows, growing in step with new PUTs from elliott-thinkpad and k-gamingcom. No 500s in the receiver journal since the chown.