AGENTS.md: how to recover from 400/412 commit-rejected loops

Smaller models running as on-device agents need a direct, prescriptive
remediation block for the gate-failure modes — the receiver's response
body is good but only visible if the agent reads journalctl carefully.
Document the exact sequence (git pull → install-lab-host.sh) and what
the install script now does on its own (drain pre-stamp queue, restart
services). Also calls out the two anti-patterns we don't want agents
trying: silencing the shipper to stop log noise, or fabricating a
code_version field to bypass the gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
max 2026-05-01 11:46:04 -05:00
parent eda6164897
commit f294e97875

View file

@ -78,6 +78,44 @@ common silent failures it catches:
`cis490-doctor --json` is machine-readable for use by other agents.
## Shipper says "400 missing" or "412 commit-rejected": pull and reinstall
If `journalctl -u cis490-shipper` shows a steady stream of
`-> fatal (400)` or `-> 412 commit-rejected` lines, the receiver is
rejecting episodes because their `meta.json::code_version.commit`
isn't in the receiver's allow-list (or isn't being sent at all). This
happens when this lab host is running code older than the receiver
will accept.
The fix is always the same — pull main and re-run the installer:
```sh
cd /opt/cis490
sudo -u cis490 git pull origin main
sudo /opt/cis490/scripts/install-lab-host.sh
```
`install-lab-host.sh` does the rest:
1. Re-stamps `/opt/cis490/VERSION` to the new HEAD.
2. Drains pre-stamp episodes via
`tools/quarantine_unstamped.py` so the queue stops looping on
them. Drained episodes go to `/var/lib/cis490/data/quarantine/`
with a `quarantine_reason.json` per-episode for triage.
3. Restarts `cis490-shipper` and `cis490-orchestrator` so the new code
takes effect.
Do **not** disable the shipper to silence the log noise — once a host
has the new code, traffic resumes immediately. Do **not** mint a fake
`code_version` field in old episodes to bypass the gate; that field
exists specifically to keep buggy pre-fix data out of the training
index.
If the receiver is rejecting *new* episodes too (you've pulled and
restarted, but still see 412), the receiver's allow-list window may
not yet include your commit — wait 5s for its Forgejo refresh, or
push your commit to `origin/main` first if you're testing
unmerged work.
## Tier 3 + Tier 4 deploy (zero-touch via install-lab-host.sh)
`install-lab-host.sh` runs Tier-3 deploy automatically on its second