URGENT: elliott-thinkpad + k-gamingcom — pull main and reinstall to resume data flow #18

Open
opened 2026-05-01 16:56:33 -05:00 by max · 1 comment
Owner

Status

  • elliott-thinkpad is shipping ~1 PUT/sec to the receiver, every PUT 400'd (~2300 in the last 30 min). Episodes were generated against pre-stamping code (no code_version in meta.json), so the receiver gate rejects them. The shipper running on this host is the old binary that doesn't quarantine fatal episodes — it's looping forever on the same backlog.
  • k-gamingcom has been silent since 06:37 CDT this morning.
  • The receiver index is frozen at 4,465 rows. Zero new data has landed since the gate cutover.

What needs to run, on each lab host

Single command, takes ~2 min, idempotent:

cd /opt/cis490 && sudo -u cis490 git pull origin main && \
    sudo /opt/cis490/scripts/install-lab-host.sh

Per AGENTS.md (top-of-file decision-tree table, row 1). That command:

  1. Pulls origin/main (gets the queue-quarantine, watchdog, install-loop fixes).
  2. Re-stamps /opt/cis490/VERSION so newly-generated episodes carry a valid code_version.commit.
  3. Runs tools/quarantine_unstamped.py to drain the pre-stamp backlog from data/episodes/ to data/quarantine/ (~2,300 entries on elliott-thinkpad right now). The shipper stops looping on rejected episodes immediately.
  4. daemon-reload + systemctl restart cis490-shipper cis490-orchestrator so the new code takes effect.

After that, fresh episodes generated by the new orchestrator carry the commit hash, the receiver gate accepts them, and the index resumes growing.

What NOT to do

From AGENTS.md "HARD RULES":

  • Do not disable / mask / systemctl stop cis490-shipper to silence the log noise.
  • Do not invent a code_version field in old meta.json files to bypass the gate.
  • Do not git pull and skip running install-lab-host.sh (the bare pull updates files but doesn't restart the daemons; running daemons keep executing the old code from frozen module objects).

Receiver-side context (for the on-device agent)

  • Receiver allow-list HEAD: 20ff76c (auto-refreshes every 5 s from Forgejo, so the new commit is already valid the moment a host pulls).
  • Receiver gate logs are visible at the maintainer's journalctl -u cis490-receiver; current symptom is a steady stream of 400 from 10.100.0.12 (elliott-thinkpad) with missing X-Cis490-Code-Commit.
  • Once the new shipper is running, the existing rejected episodes auto-quarantine on first scan and disappear from the live queue.

Verification

After the fix, the maintainer-side check is:

ssh <pi> 'sudo tail -3 /var/lib/cis490/index.jsonl'
# Should show recent timestamps with code_commit=20ff76c... (or newer)
## Status - **elliott-thinkpad** is shipping ~1 PUT/sec to the receiver, **every PUT 400'd** (~2300 in the last 30 min). Episodes were generated against pre-stamping code (no `code_version` in `meta.json`), so the receiver gate rejects them. The shipper running on this host is the *old* binary that doesn't quarantine fatal episodes — it's looping forever on the same backlog. - **k-gamingcom** has been silent since 06:37 CDT this morning. - The receiver index is frozen at 4,465 rows. **Zero new data has landed since the gate cutover.** ## What needs to run, on each lab host Single command, takes ~2 min, idempotent: ```sh cd /opt/cis490 && sudo -u cis490 git pull origin main && \ sudo /opt/cis490/scripts/install-lab-host.sh ``` Per AGENTS.md (top-of-file decision-tree table, row 1). That command: 1. Pulls origin/main (gets the queue-quarantine, watchdog, install-loop fixes). 2. Re-stamps `/opt/cis490/VERSION` so newly-generated episodes carry a valid `code_version.commit`. 3. Runs `tools/quarantine_unstamped.py` to drain the pre-stamp backlog from `data/episodes/` to `data/quarantine/` (~2,300 entries on elliott-thinkpad right now). The shipper stops looping on rejected episodes immediately. 4. `daemon-reload` + `systemctl restart cis490-shipper cis490-orchestrator` so the new code takes effect. After that, fresh episodes generated by the new orchestrator carry the commit hash, the receiver gate accepts them, and the index resumes growing. ## What NOT to do From AGENTS.md "HARD RULES": - Do not disable / mask / `systemctl stop` cis490-shipper to silence the log noise. - Do not invent a `code_version` field in old `meta.json` files to bypass the gate. - Do not `git pull` and skip running `install-lab-host.sh` (the bare pull updates files but doesn't restart the daemons; running daemons keep executing the old code from frozen module objects). ## Receiver-side context (for the on-device agent) - Receiver allow-list HEAD: `20ff76c` (auto-refreshes every 5 s from Forgejo, so the new commit is already valid the moment a host pulls). - Receiver gate logs are visible at the maintainer's `journalctl -u cis490-receiver`; current symptom is a steady stream of `400` from `10.100.0.12` (elliott-thinkpad) with `missing X-Cis490-Code-Commit`. - Once the new shipper is running, the existing rejected episodes auto-quarantine on first scan and disappear from the live queue. ## Verification After the fix, the maintainer-side check is: ```sh ssh <pi> 'sudo tail -3 /var/lib/cis490/index.jsonl' # Should show recent timestamps with code_commit=20ff76c... (or newer) ```
Author
Owner

Update — 2026-05-02 11:48 CDT

elliott-thinkpad pulled enough to start stamping but the commit it's stamping with does NOT exist on origin/main:

  • Last 19 hours: ~31,738 episodes rejected as 412 not-in-window with commit=5568d77df801
  • Forgejo API for that sha: HTTP 404 — not on main, not on any other branch
  • Receiver head: 98dcd4f (current)
  • elliott-thinkpad is generating + stamping + shipping ~1 ep/sec, throwing every one away

The local commit was likely made by the on-device agent or an operator hand-edit, then never pushed. Auto-update would refuse to fast-forward over it (correctly — non-ancestor HEAD).

Three ways to unstick

# OPTION A — push the local commit (preserves the work, makes it visible)
cd /opt/cis490
sudo -u cis490 git push origin HEAD:main
# Receiver picks up the new sha in ~5 s; new episodes start landing.

# OPTION B — reset to origin (loses the local change)
cd /opt/cis490
sudo -u cis490 git fetch origin main
sudo -u cis490 git reset --hard origin/main
sudo /opt/cis490/scripts/install-lab-host.sh

# OPTION C — show the maintainer what's in 5568d77 first (recommended before either above)
cd /opt/cis490
sudo -u cis490 git log -1 --stat 5568d77df801
sudo -u cis490 git show 5568d77df801 | head -200
# Paste the output as a comment here. Then choose A or B.

k-gamingcom remains silent — separate issue (services not running, host offline, or never bootstrapped). Same canonical command from row 1 of AGENTS.md still applies once a human is at that keyboard.

## Update — 2026-05-02 11:48 CDT elliott-thinkpad pulled enough to start stamping but the commit it's stamping with does NOT exist on origin/main: - Last 19 hours: ~31,738 episodes rejected as `412 not-in-window` with `commit=5568d77df801` - Forgejo API for that sha: HTTP 404 — not on main, not on any other branch - Receiver head: `98dcd4f` (current) - elliott-thinkpad is generating + stamping + shipping ~1 ep/sec, throwing every one away The local commit was likely made by the on-device agent or an operator hand-edit, then never pushed. Auto-update would refuse to fast-forward over it (correctly — non-ancestor HEAD). ### Three ways to unstick ```sh # OPTION A — push the local commit (preserves the work, makes it visible) cd /opt/cis490 sudo -u cis490 git push origin HEAD:main # Receiver picks up the new sha in ~5 s; new episodes start landing. # OPTION B — reset to origin (loses the local change) cd /opt/cis490 sudo -u cis490 git fetch origin main sudo -u cis490 git reset --hard origin/main sudo /opt/cis490/scripts/install-lab-host.sh # OPTION C — show the maintainer what's in 5568d77 first (recommended before either above) cd /opt/cis490 sudo -u cis490 git log -1 --stat 5568d77df801 sudo -u cis490 git show 5568d77df801 | head -200 # Paste the output as a comment here. Then choose A or B. ``` k-gamingcom remains silent — separate issue (services not running, host offline, or never bootstrapped). Same canonical command from row 1 of AGENTS.md still applies once a human is at that keyboard.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: bolyai/CIS490#18
No description provided.