Smaller (non-4.7) Claude models act as on-device agents on CIS490 lab hosts and have hit the install gotchas that became issues #10–#12. Their reports describe symptoms well but miss inferred context — so this expands the runbook with explicit "do this, not that" notes: - run tools from /opt/cis490 not a clone (CWD-on-sys.path trap) - shipper "waiting on mTLS material" is expected and self-heals; do not try to fix it manually - table of the three install bugs already closed in main, so a fresh agent can recognize the symptom and pull instead of re-filing - "fix one red row at a time" rather than batching attempts Closes nothing new; this is the followup to #10/#11/#12 promised during their resolution.
11 KiB
AGENTS.md — guidance for AI agents working on this repo
This project is part of the spectral lab (http://maxgit.wg/spectral/).
The conventions below also apply to sibling repos (wg-enroll,
wg-pki, caddy, iptmonads, matrix, forgejo, vault,
openclaw-deploy).
How a lab host gets to "shipping data" — the canonical bring-up
If you (an AI agent OR a human) are looking at a fresh lab host and asking "is this thing actually generating data for the central collector?", run this in order. Cloning the repo by itself does nothing. Running launchers from a manual clone bypasses the systemd services that do the actual work.
# 0. (One-time, on the Pi only.) Initialize the CIS490 client CA + a
# leaf cert for THIS lab host. Get its WG IP from `wg-enroll-admin
# show <usb>` first.
sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh # idempotent
sudo /home/max/.env/wg-pki/scripts/deploy-cis490-cert.sh \
<host_id> <wg_ip> # mints + scp's + extracts + chmods
# 1. (On the lab host.) Install the lab-host role. This copies the
# repo into /opt/cis490, builds the venv, drops systemd units,
# fetches the Alpine baseline qcow2, and builds the cidata ISO
# with the in-guest agent embedded.
sudo /opt/cis490/scripts/install-lab-host.sh
# (or, if running from the manual clone:)
# sudo ./scripts/install-lab-host.sh
# 2. Edit /etc/cis490/lab-host.toml — set host_id and any overrides.
# 3. Verify everything before enabling the timer-driven services:
/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py \
--role lab-host
# → green/yellow rows means READY; red rows print the exact fix
# command. Re-run until clean.
# 4. Turn on the services. From this moment on, the orchestrator runs
# one fleet wave on each Restart= cycle, and the shipper picks up
# completed episodes and PUTs them to https://collector.wg over mTLS.
sudo systemctl enable --now cis490-shipper cis490-orchestrator
# 5. (On the Pi.) Watch the index grow:
sudo tail -f /var/lib/cis490/index.jsonl
# 6. (Optional, Tier 3.) Enable real exploit fire — needs metasploit.
sudo /opt/cis490/scripts/install-msfrpcd.sh
# Operator-supplied URL + sha256 (Rapid7 download is registration-walled):
IMAGE_URL='…' IMAGE_SHA256='…' sudo OUT_DIR=/var/lib/cis490/vm/images \
/opt/cis490/scripts/fetch-metasploitable2.sh
If index.jsonl doesn't grow within a wave-interval (~60 s after
systemctl enable --now), run cis490-doctor again. The most
common silent failures it catches:
*.wgDNS missing (wg-enroll provisions it; manual workaround is one line in/etc/hosts)- mTLS cert chain not installed under
/etc/cis490/certs/ cis490-shipperservice inactive (forgot step 4)qemu-system-x86_64not on PATH
cis490-doctor --json is machine-readable for use by other agents.
Common bring-up gotchas (read this before debugging an install)
Smaller models acting as on-device agents have hit these traps. Each
one is now fixed in main, but if you're on an older clone you may
still see the symptom — pull origin/main first, then re-read.
Run tools from /opt/cis490, not from a manual clone
When you run cis490-doctor from a clone like ~/.env/CIS490/,
Python prepends the clone path to sys.path. Subprocesses spawned
by the doctor (e.g., python -m shipper --ping) inherit the calling
CWD and pick up the clone's shipper/ package instead of the
service venv at /opt/cis490/. Symptom: tracebacks reference the
clone path, or No module named exploits despite package = false.
Fix already in main: the doctor passes cwd=/opt/cis490 to the
shipper subprocess and inserts repo_root into sys.path itself.
Operator action: always invoke either as
/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py
or via cd /opt/cis490 && ./tools/cis490_doctor.py. Don't run from a
clone unless you know what you're doing.
Shipper logs "waiting on mTLS material" — this is expected, not a bug
The cis490-shipper unit is enabled by install-lab-host.sh before
the Pi has issued the host's mTLS leaf. The transport pre-flights the
configured ca_bundle / client_cert / client_key paths and, if
any are missing, defers building the SSL context. You'll see one
warning per process lifetime:
shipper waiting on mTLS material (client_cert path missing: …); will retry each request
The unit stays up. Each ping/ship attempt re-tries the build. Once
the Pi runs deploy-cis490-cert.sh <host_id> <wg_ip> and the leaf
lands at /etc/cis490/certs/, the next request succeeds and the
transport logs mTLS material now on disk; shipper transport ready.
Do not try to "fix" the warning by restarting the unit, deleting the config, or hand-rolling certs — just confirm the Pi-side step ran and wait one scan interval.
install-lab-host.sh failures
Three install bugs were fixed in commit 95ac56a. If you're on an
older clone:
| Symptom | Cause | Fix |
|---|---|---|
ModuleNotFoundError: pycdlib during cidata build |
pycdlib was in dev deps, service venv only installs main deps |
Pull main; pycdlib is in dependencies now |
Episodes exit rc=1 in 15 s; launch_demo.sh can't find image |
vm/images/ dir wasn't created before symlinking |
Pull main; install script now install -d's the directory |
cis490-doctor reports "tier3: No module named exploits" |
sys.path didn't include repo root |
Pull main; doctor inserts repo_root into sys.path |
If you hit any of these on a fresh install, pull main first before filing an issue — the issue is probably already closed.
One traceback at a time
When the doctor lights up multiple red rows, fix the topmost one and re-run rather than batching attempts. Each red row prints the exact operator command it expects you to run. Don't paraphrase or invent adjacent commands; the doctor is the source of truth for what's missing.
How an agent generates data on demand (without waiting for the timer)
# One labeled episode (90 s) with a chosen sample profile:
sudo -u cis490 /opt/cis490/.venv/bin/python \
/opt/cis490/tools/run_real_vm_demo.py \
--data-root /var/lib/cis490/data \
--sample mirai-class-bot
# Force the shipper to run one pass:
sudo systemctl start cis490-shipper.service # (if disabled)
# or:
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
--config /etc/cis490/lab-host.toml --once
# Confirm on the Pi:
ssh <pi> 'sudo tail -3 /var/lib/cis490/index.jsonl'
Any tool that wants to count what's been shipped:
ssh <pi> 'sudo /opt/cis490/.venv/bin/python /opt/cis490/tools/index_reader.py \
--index /var/lib/cis490/index.jsonl --count-by host_id'
Quick test for "is the pipe alive at all"
The --ping mode of the shipper does an authenticated POST to
/v1/ping and writes nothing to the database. Use it as the
smallest E2E test:
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
--config /etc/cis490/lab-host.toml --ping
# {"ok": true, "status_code": 200, ...} on success.
# Any other status is the operator's signal to read the error string.
File an issue when you hit a problem
When you run into an issue you cannot fully resolve in the current turn, file it as a Forgejo issue on the relevant repo before moving on. Do not silently log a TODO comment, leave a partial workaround, or assume someone else will remember. The issue tracker is the durable record.
This applies to:
- a build / test / typecheck failure you can't fix in scope
- a bug you discover but aren't tasked with fixing
- a missing dep, missing config, or env-only failure that blocks E2E
- a design gap you've worked around but want a follow-up to fix properly
- a scope-out you made (e.g. "deferred Tier 4 sample fetch") that needs an owner so it doesn't get lost
Don't file an issue when:
- the user is in the conversation and you can just tell them
- it's already filed (search first:
GET /api/v1/repos/<owner>/<repo>/issues?state=open&q=<keyword>) - it's truly a non-issue (a one-line edit you're about to make this same turn)
How to file (Forgejo API)
The local Forgejo at http://10.100.0.1:3000 accepts API calls with a
token-bearer header:
curl -s -X POST \
-H "Authorization: token <TOKEN>" \
-H "Content-Type: application/json" \
http://10.100.0.1:3000/api/v1/repos/spectral/<repo>/issues \
-d '{
"title": "<short, action-oriented title>",
"body": "<context, repro, attempted fixes, suggested next step>"
}'
The token comes from the user's session — never embed one in code or commits.
What a good issue body contains
- Context — one sentence on what was being attempted.
- What happened — the actual error, log line, or unexpected behavior. Paste exact output.
- What was tried — every workaround you attempted and why it didn't stick.
- Suggested next step — the smallest change that would resolve it, if you have a guess. "Unknown" is a fine answer.
- Related — link the commit / PR / file:line where the issue surfaced.
What a good title looks like
| Bad | Good |
|---|---|
tests broken |
tests/test_episode.py: race when t_mono_origin_ns is set in run() not __init__ |
caddy thing |
Caddy: client_auth requires absolute path; relative trusted_ca_cert_file silently fails |
fix later |
shipper: 5xx backoff cap is 5min, doc says 1min — pick one |
After filing
- Reference the issue number in the next commit message:
Refs spectral/<repo>#<n>orCloses spectral/<repo>#<n>if your current change actually fixes it. - If the issue is on a different repo than the one you're committing
to, fully qualify:
spectral/wg-pki#3.
Other conventions
- Don't put off the hard parts. Frame "deferred-with-reason" only for genuine blockers (binary not present on this machine, external service unreachable). For anything you could do but find awkward — bridge setup, cross-arch quirks, fleet concurrency — do it. The user has flagged this twice when work was scoped down prematurely. When something genuinely is blocked by an operator artifact, file the Forgejo issue and automate the bring-up (e.g., installer script + sha256-verifying fetcher) so the moment the artifact lands it Just Works.
- Naming: never coin USB / device / service names on the user's behalf. Ask first. Reusing an old name is especially bad.
/etcconfigs:Readfirst, copy second. Never overwrite a/etc/...file from a template without checking what's actually there.- wg-enroll scope: creation-only. Don't add admin / service-activation features to it.
- Don't expand a project's binary name beyond its own boundary:
openclawis the queue/permissions binary inopenclaw-deploy. This repo iswg-enroll(or its caller). Don't conflate.