CIS490/FIXYOURSELF.md
max d1e1b1132d FIXYOURSELF.md: explicit decision tree for stuck lab hosts
The auto-update timer (98dcd4f) covers the routine case of a host
falling behind origin/main. It deliberately refuses to fast-forward
when local HEAD isn't an ancestor of origin/main — the right call
for safety, but it leaves on-device agents with no automatic path
out when they (or an operator) made a local commit.

That's exactly the elliott-thinkpad incident: ~31,738 episodes
shipped over 19 hours, all stamped with local commit 5568d77 that
isn't on origin/main, all 412'd. Auto-update can't fix it; the
on-device agent had no doc telling it what to do.

FIXYOURSELF.md is that doc. Pure decision tree, six branches
(behind / diverged / no-network / no-git / dirty-tree / clean) each
with verbatim commands and the order to try them. The diverged-HEAD
branch (§B) is the elliott-thinkpad case and offers three resolutions
(push, reset, file-issue-and-wait) so an agent that doesn't have
push permission isn't backed into discarding work.

Linked from the AGENTS.md top-of-file symptom table so a smaller
model finds it without having to know the filename.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:53:16 -05:00

8.3 KiB

FIXYOURSELF.md — when the lab host can't unstick itself automatically

You are reading this because you (an on-device AI agent) are running on a CIS490 lab host and one of these is true:

  • cis490-autoupdate.timer keeps failing.
  • The shipper logs show steady 412 commit-rejected or 400 missing.
  • cis490-doctor --role lab-host flagged you red.

The normal "pull and reinstall" path (AGENTS.md row 1) sometimes can't recover on its own — most often because someone made a local commit on this host, or /opt/cis490 got into a state git pull --ff-only refuses to fix. This doc is the explicit decision tree for those cases. Run the commands literally. Do not paraphrase.


Step 1 — Identify your state

Run all of these. Save the output. Match the result against the table.

cd /opt/cis490
sudo -u cis490 git rev-parse HEAD            # → LOCAL
sudo -u cis490 git rev-parse origin/main 2>&1 # → REMOTE (or error)
sudo -u cis490 git fetch origin main 2>&1    # → fresh fetch result
sudo -u cis490 git log -1 --format='%H %ci %s'  # → what HEAD is
sudo -u cis490 git status --porcelain        # → uncommitted changes?

Then:

If you see… Your state Go to
LOCAL == REMOTE and git status empty Not stuck — run §Z to be safe §Z
LOCAL != REMOTE and git merge-base --is-ancestor HEAD origin/main returns 0 Behind main, no local commits §A
LOCAL != REMOTE and the merge-base check returns 1 (NON-zero) You have a local commit not on origin/main §B
git fetch prints a network error Connectivity broken §C
/opt/cis490/.git is missing No git checkout — populated via cp -aT originally §D
git status shows tracked files modified Uncommitted edits on this host §E

If multiple match: §C blocks everything else (fix network first), then §D, then §E, then §B, then §A.


§A — Behind main, clean tree

cd /opt/cis490
sudo -u cis490 git pull --ff-only origin main
sudo /opt/cis490/scripts/install-lab-host.sh

install-lab-host.sh re-stamps VERSION, drains the pre-stamp queue, and restarts the daemons. Verify with §Z.


§B — You have a local commit not on origin/main

This is the elliott-thinkpad case (2026-05-01..02). You committed something locally, the maintainer's origin/main doesn't have it, and the receiver's allow-list rejects every episode you ship. Pick ONE of B.1, B.2, B.3 — read all three first.

cd /opt/cis490
sudo -u cis490 git log -3 --stat HEAD      # what you're about to push — read it
sudo -u cis490 git push origin HEAD:main

If git push succeeds: the receiver allow-list picks it up within 5 seconds, new episodes start landing immediately. Verify with §Z.

If git push fails with "permission denied" or "auth required": you don't have push credentials. Skip to B.2 or B.3.

If git push fails with "non-fast-forward": your local commit diverged AROUND origin/main (you committed against an older base). Try:

sudo -u cis490 git pull --rebase origin main
# If the rebase has conflicts you can't resolve in 60 seconds, abort:
#   sudo -u cis490 git rebase --abort
# Then go to B.2 or B.3.
sudo -u cis490 git push origin HEAD:main

B.2 — Discard your local commit, reset to origin

This deletes your local commit. Only do this if you know it wasn't a real fix or the maintainer told you to discard it.

cd /opt/cis490
sudo -u cis490 git fetch origin main
sudo -u cis490 git log --oneline HEAD ^origin/main  # what you're about to lose
sudo -u cis490 git reset --hard origin/main
sudo /opt/cis490/scripts/install-lab-host.sh

B.3 — Stop, file an issue, wait

If you can't decide between B.1 and B.2 (e.g. you don't know if your commit is a real fix), do this:

cd /opt/cis490
HOST_ID=$(grep '^host_id' /etc/cis490/lab-host.toml | cut -d'"' -f2)
LOCAL_SHA=$(sudo -u cis490 git rev-parse HEAD)
DETAIL=$(sudo -u cis490 git log -3 --stat HEAD | head -100)

# File the issue (replace <TOKEN> with the operator's Forgejo token —
# do NOT embed yours in commits)
curl -sS -X POST \
  -H "Authorization: token <TOKEN>" \
  -H "Content-Type: application/json" \
  http://10.100.0.1:3000/api/v1/repos/spectral/CIS490/issues \
  -d "$(python3 -c "import json,os; print(json.dumps({
    'title': f\"$HOST_ID: stuck on local commit ${LOCAL_SHA:0:12}\",
    'body':  f\"### What's at HEAD\n\n\`\`\`\n${DETAIL}\n\`\`\`\n\nNeed maintainer to choose: push HEAD to main, or reset --hard origin/main here?\"
  }))")"

Then leave the daemons running. The shipper will keep auto-quarantining the 412s — backlog grows but doesn't crash anything. Wait for a maintainer comment.


§C — Network broken

ping -c 1 10.100.0.1                # the Pi
sudo wg show                         # is wg0 up?
sudo systemctl restart wg-quick@wg0  # bring it back up
sudo systemctl restart cis490-shipper cis490-orchestrator

If ping 10.100.0.1 still fails after a wg-quick restart, this is a WireGuard / wg-enroll / iptmonads problem outside this repo. File an issue at spectral/wg-enroll or spectral/iptmonads and stop.


§D — /opt/cis490/.git missing

The host was originally set up with cp -aT (no .git/). That makes auto-update impossible. Re-clone:

# Stop services so we don't race with the orchestrator mid-episode
sudo systemctl stop cis490-shipper cis490-orchestrator

# Preserve config/data — only /opt/cis490 (the code) gets replaced.
# /etc/cis490/ and /var/lib/cis490/ are NOT touched.
sudo mv /opt/cis490 /opt/cis490.pre-fix
sudo git clone http://maxgit.wg:3000/spectral/CIS490.git /opt/cis490
sudo chown -R cis490:cis490 /opt/cis490

sudo /opt/cis490/scripts/install-lab-host.sh
# Once verified, you can drop the backup:
#   sudo rm -rf /opt/cis490.pre-fix

§E — Uncommitted edits on tracked files

cd /opt/cis490
sudo -u cis490 git status --short    # see what's modified
sudo -u cis490 git diff              # see exactly what changed

If the changes are intentional (e.g. you fixed a bug), commit them first and then go to §B:

sudo -u cis490 git add <files>
sudo -u cis490 git commit -m "<short description>"
# Now go to §B.

If the changes are accidental / left over from debugging, discard them:

sudo -u cis490 git checkout -- .
# Now go to §A.

§Z — Verify you're unstuck

# 1. Daemons up?
systemctl is-active cis490-shipper cis490-orchestrator
# Both should say "active".

# 2. VERSION present and matches HEAD?
cat /opt/cis490/VERSION
sudo -u cis490 git -C /opt/cis490 rev-parse HEAD
# The "commit" in VERSION should equal the rev-parse output.

# 3. Recent ship results — should NOT be 412/400-fatal-only:
sudo journalctl -u cis490-shipper --since '2 minutes ago' | tail -20

# 4. Maintainer-side confirmation (if you can reach the Pi):
#   sudo tail -5 /var/lib/cis490/index.jsonl
# Should show recent received_at_wall timestamps with your host_id.

# 5. Run the doctor — every red row prints its own fix:
/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py --role lab-host

Hard rules (don't break these)

These are repeated from AGENTS.md because the situations covered in this doc are exactly when an agent is most tempted to break them.

  • Do NOT edit meta.json files to add a fake code_version. The gate exists to keep buggy data out of the index.
  • Do NOT disable / mask / systemctl stop cis490-shipper to silence log noise. Fix the cause.
  • Do NOT set verify_tls = false in lab-host.toml. That's a bypass, not a fix.
  • Do NOT git commit on /opt/cis490 to "make the gate happy" — committing without pushing is exactly how the elliott-thinkpad incident started. If you commit, push (§B.1) or don't commit at all.
  • Do NOT delete /var/lib/cis490/data/quarantine/. Those are forensic records of what the gate rejected; they're capped at 30 days by the cleanup pass.
  • Do NOT clobber /etc/cis490/certs/. Cert delivery is automated; rerun install-lab-host.sh if certs are missing.

If you find yourself wanting to do any of the above, stop and file an issue (§B.3 has the curl command). The maintainer would much rather resolve a stale lab host by reading an issue than by reverse-engineering what an agent did to escape a stuck state.