CIS490/AGENTS.md
max c80a36d3ae AGENTS.md: prescriptive guidance for smaller models on lab hosts
Smaller (non-4.7) Claude models act as on-device agents on CIS490 lab
hosts and have hit the install gotchas that became issues #10–#12.
Their reports describe symptoms well but miss inferred context — so
this expands the runbook with explicit "do this, not that" notes:

- run tools from /opt/cis490 not a clone (CWD-on-sys.path trap)
- shipper "waiting on mTLS material" is expected and self-heals; do
  not try to fix it manually
- table of the three install bugs already closed in main, so a fresh
  agent can recognize the symptom and pull instead of re-filing
- "fix one red row at a time" rather than batching attempts

Closes nothing new; this is the followup to #10/#11/#12 promised
during their resolution.
2026-04-30 16:19:09 -05:00

11 KiB

AGENTS.md — guidance for AI agents working on this repo

This project is part of the spectral lab (http://maxgit.wg/spectral/). The conventions below also apply to sibling repos (wg-enroll, wg-pki, caddy, iptmonads, matrix, forgejo, vault, openclaw-deploy).


How a lab host gets to "shipping data" — the canonical bring-up

If you (an AI agent OR a human) are looking at a fresh lab host and asking "is this thing actually generating data for the central collector?", run this in order. Cloning the repo by itself does nothing. Running launchers from a manual clone bypasses the systemd services that do the actual work.

# 0. (One-time, on the Pi only.) Initialize the CIS490 client CA + a
#    leaf cert for THIS lab host. Get its WG IP from `wg-enroll-admin
#    show <usb>` first.
sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh   # idempotent
sudo /home/max/.env/wg-pki/scripts/deploy-cis490-cert.sh \
     <host_id> <wg_ip>           # mints + scp's + extracts + chmods

# 1. (On the lab host.) Install the lab-host role. This copies the
#    repo into /opt/cis490, builds the venv, drops systemd units,
#    fetches the Alpine baseline qcow2, and builds the cidata ISO
#    with the in-guest agent embedded.
sudo /opt/cis490/scripts/install-lab-host.sh
# (or, if running from the manual clone:)
#   sudo ./scripts/install-lab-host.sh

# 2. Edit /etc/cis490/lab-host.toml — set host_id and any overrides.

# 3. Verify everything before enabling the timer-driven services:
/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py \
    --role lab-host
# → green/yellow rows means READY; red rows print the exact fix
#   command. Re-run until clean.

# 4. Turn on the services. From this moment on, the orchestrator runs
#    one fleet wave on each Restart= cycle, and the shipper picks up
#    completed episodes and PUTs them to https://collector.wg over mTLS.
sudo systemctl enable --now cis490-shipper cis490-orchestrator

# 5. (On the Pi.) Watch the index grow:
sudo tail -f /var/lib/cis490/index.jsonl

# 6. (Optional, Tier 3.) Enable real exploit fire — needs metasploit.
sudo /opt/cis490/scripts/install-msfrpcd.sh
# Operator-supplied URL + sha256 (Rapid7 download is registration-walled):
IMAGE_URL='…' IMAGE_SHA256='…' sudo OUT_DIR=/var/lib/cis490/vm/images \
    /opt/cis490/scripts/fetch-metasploitable2.sh

If index.jsonl doesn't grow within a wave-interval (~60 s after systemctl enable --now), run cis490-doctor again. The most common silent failures it catches:

  • *.wg DNS missing (wg-enroll provisions it; manual workaround is one line in /etc/hosts)
  • mTLS cert chain not installed under /etc/cis490/certs/
  • cis490-shipper service inactive (forgot step 4)
  • qemu-system-x86_64 not on PATH

cis490-doctor --json is machine-readable for use by other agents.

Common bring-up gotchas (read this before debugging an install)

Smaller models acting as on-device agents have hit these traps. Each one is now fixed in main, but if you're on an older clone you may still see the symptom — pull origin/main first, then re-read.

Run tools from /opt/cis490, not from a manual clone

When you run cis490-doctor from a clone like ~/.env/CIS490/, Python prepends the clone path to sys.path. Subprocesses spawned by the doctor (e.g., python -m shipper --ping) inherit the calling CWD and pick up the clone's shipper/ package instead of the service venv at /opt/cis490/. Symptom: tracebacks reference the clone path, or No module named exploits despite package = false.

Fix already in main: the doctor passes cwd=/opt/cis490 to the shipper subprocess and inserts repo_root into sys.path itself. Operator action: always invoke either as /opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py or via cd /opt/cis490 && ./tools/cis490_doctor.py. Don't run from a clone unless you know what you're doing.

Shipper logs "waiting on mTLS material" — this is expected, not a bug

The cis490-shipper unit is enabled by install-lab-host.sh before the Pi has issued the host's mTLS leaf. The transport pre-flights the configured ca_bundle / client_cert / client_key paths and, if any are missing, defers building the SSL context. You'll see one warning per process lifetime:

shipper waiting on mTLS material (client_cert path missing: …); will retry each request

The unit stays up. Each ping/ship attempt re-tries the build. Once the Pi runs deploy-cis490-cert.sh <host_id> <wg_ip> and the leaf lands at /etc/cis490/certs/, the next request succeeds and the transport logs mTLS material now on disk; shipper transport ready.

Do not try to "fix" the warning by restarting the unit, deleting the config, or hand-rolling certs — just confirm the Pi-side step ran and wait one scan interval.

install-lab-host.sh failures

Three install bugs were fixed in commit 95ac56a. If you're on an older clone:

Symptom Cause Fix
ModuleNotFoundError: pycdlib during cidata build pycdlib was in dev deps, service venv only installs main deps Pull main; pycdlib is in dependencies now
Episodes exit rc=1 in 15 s; launch_demo.sh can't find image vm/images/ dir wasn't created before symlinking Pull main; install script now install -d's the directory
cis490-doctor reports "tier3: No module named exploits" sys.path didn't include repo root Pull main; doctor inserts repo_root into sys.path

If you hit any of these on a fresh install, pull main first before filing an issue — the issue is probably already closed.

One traceback at a time

When the doctor lights up multiple red rows, fix the topmost one and re-run rather than batching attempts. Each red row prints the exact operator command it expects you to run. Don't paraphrase or invent adjacent commands; the doctor is the source of truth for what's missing.

How an agent generates data on demand (without waiting for the timer)

# One labeled episode (90 s) with a chosen sample profile:
sudo -u cis490 /opt/cis490/.venv/bin/python \
    /opt/cis490/tools/run_real_vm_demo.py \
    --data-root /var/lib/cis490/data \
    --sample mirai-class-bot

# Force the shipper to run one pass:
sudo systemctl start cis490-shipper.service   # (if disabled)
# or:
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
    --config /etc/cis490/lab-host.toml --once

# Confirm on the Pi:
ssh <pi> 'sudo tail -3 /var/lib/cis490/index.jsonl'

Any tool that wants to count what's been shipped:

ssh <pi> 'sudo /opt/cis490/.venv/bin/python /opt/cis490/tools/index_reader.py \
    --index /var/lib/cis490/index.jsonl --count-by host_id'

Quick test for "is the pipe alive at all"

The --ping mode of the shipper does an authenticated POST to /v1/ping and writes nothing to the database. Use it as the smallest E2E test:

sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
    --config /etc/cis490/lab-host.toml --ping
# {"ok": true, "status_code": 200, ...} on success.
# Any other status is the operator's signal to read the error string.

File an issue when you hit a problem

When you run into an issue you cannot fully resolve in the current turn, file it as a Forgejo issue on the relevant repo before moving on. Do not silently log a TODO comment, leave a partial workaround, or assume someone else will remember. The issue tracker is the durable record.

This applies to:

  • a build / test / typecheck failure you can't fix in scope
  • a bug you discover but aren't tasked with fixing
  • a missing dep, missing config, or env-only failure that blocks E2E
  • a design gap you've worked around but want a follow-up to fix properly
  • a scope-out you made (e.g. "deferred Tier 4 sample fetch") that needs an owner so it doesn't get lost

Don't file an issue when:

  • the user is in the conversation and you can just tell them
  • it's already filed (search first: GET /api/v1/repos/<owner>/<repo>/issues?state=open&q=<keyword>)
  • it's truly a non-issue (a one-line edit you're about to make this same turn)

How to file (Forgejo API)

The local Forgejo at http://10.100.0.1:3000 accepts API calls with a token-bearer header:

curl -s -X POST \
  -H "Authorization: token <TOKEN>" \
  -H "Content-Type: application/json" \
  http://10.100.0.1:3000/api/v1/repos/spectral/<repo>/issues \
  -d '{
    "title": "<short, action-oriented title>",
    "body":  "<context, repro, attempted fixes, suggested next step>"
  }'

The token comes from the user's session — never embed one in code or commits.

What a good issue body contains

  1. Context — one sentence on what was being attempted.
  2. What happened — the actual error, log line, or unexpected behavior. Paste exact output.
  3. What was tried — every workaround you attempted and why it didn't stick.
  4. Suggested next step — the smallest change that would resolve it, if you have a guess. "Unknown" is a fine answer.
  5. Related — link the commit / PR / file:line where the issue surfaced.

What a good title looks like

Bad Good
tests broken tests/test_episode.py: race when t_mono_origin_ns is set in run() not __init__
caddy thing Caddy: client_auth requires absolute path; relative trusted_ca_cert_file silently fails
fix later shipper: 5xx backoff cap is 5min, doc says 1min — pick one

After filing

  • Reference the issue number in the next commit message: Refs spectral/<repo>#<n> or Closes spectral/<repo>#<n> if your current change actually fixes it.
  • If the issue is on a different repo than the one you're committing to, fully qualify: spectral/wg-pki#3.

Other conventions

  • Don't put off the hard parts. Frame "deferred-with-reason" only for genuine blockers (binary not present on this machine, external service unreachable). For anything you could do but find awkward — bridge setup, cross-arch quirks, fleet concurrency — do it. The user has flagged this twice when work was scoped down prematurely. When something genuinely is blocked by an operator artifact, file the Forgejo issue and automate the bring-up (e.g., installer script + sha256-verifying fetcher) so the moment the artifact lands it Just Works.
  • Naming: never coin USB / device / service names on the user's behalf. Ask first. Reusing an old name is especially bad.
  • /etc configs: Read first, copy second. Never overwrite a /etc/... file from a template without checking what's actually there.
  • wg-enroll scope: creation-only. Don't add admin / service-activation features to it.
  • Don't expand a project's binary name beyond its own boundary: openclaw is the queue/permissions binary in openclaw-deploy. This repo is wg-enroll (or its caller). Don't conflate.