max 6f8b744c33 cis490-doctor + AGENTS.md operator runbook + louder install script

Adds the missing diagnostic + onboarding tools so an agent (AI or
human) handed a fresh lab host can get to "shipping data" without
re-deriving every step from logs.

tools/cis490_doctor.py — one-shot health check that walks the full
stack from the bottom up. Each row is green/yellow/red with an
exact fix command for the red rows. Checks:
  - repo: branch, tree-clean, distance from origin/main
  - install: /opt/cis490, .venv python, /etc/cis490/{lab-host,receiver}.toml,
    /etc/cis490/lab-host.env
  - mTLS: /etc/cis490/certs/{wg-ca,lab-host}.{pem,key}, openssl chain verify
  - systemd: cis490-{shipper,orchestrator,receiver} active state
  - net: receiver.url DNS, TCP reach, mTLS handshake to collector.wg
  - vm prereqs: /dev/kvm, qemu-system-x86_64, zstd, alpine-baseline.qcow2,
    cidata.iso
  - tier3 prereqs: msfrpcd, metasploitable2.qcow2 (warn-level)
  - end-to-end: cis490-shipper --ping
Modes: --role {lab-host,receiver}, --json (machine-readable),
--no-tier3 (skip optional checks). Exits non-zero on any red row.
ANSI color (auto-disabled on non-tty / NO_COLOR).

AGENTS.md gains a "How a lab host gets to shipping data" canonical
flow at the top: cert delivery via wg-pki/deploy-cis490-cert.sh →
install-lab-host.sh → cis490-doctor → systemctl enable. Plus an
"on-demand episode" recipe + a "smallest E2E test" snippet for
agents that need to verify the pipe without waiting on the timer.
The strict "cloning the repo by itself does nothing" callout makes
the failure mode mu and elliott-lab hit explicit.

scripts/install-lab-host.sh prints a 5-step banner on first install
that points at cis490_doctor.py + the deploy-cis490-cert.sh flow,
plus an always-printed footer warning that "cloning + running
launchers manually is NOT enough." Same message the AGENTS.md
section reinforces.

Refs spectral/CIS490#8 (the "Tier-2 is shipping in the meantime"
claim that turned out to be untrue because no cis490-shipper
service was running on elliott-lab — exactly the case this
diagnostic tool targets).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-30 01:11:57 -05:00

7.7 KiB

Raw Permalink Blame History

AGENTS.md — guidance for AI agents working on this repo

This project is part of the spectral lab (http://maxgit.wg/spectral/). The conventions below also apply to sibling repos (wg-enroll, wg-pki, caddy, iptmonads, matrix, forgejo, vault, openclaw-deploy).

How a lab host gets to "shipping data" — the canonical bring-up

If you (an AI agent OR a human) are looking at a fresh lab host and asking "is this thing actually generating data for the central collector?", run this in order. Cloning the repo by itself does nothing. Running launchers from a manual clone bypasses the systemd services that do the actual work.

# 0. (One-time, on the Pi only.) Initialize the CIS490 client CA + a
#    leaf cert for THIS lab host. Get its WG IP from `wg-enroll-admin
#    show <usb>` first.
sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh   # idempotent
sudo /home/max/.env/wg-pki/scripts/deploy-cis490-cert.sh \
     <host_id> <wg_ip>           # mints + scp's + extracts + chmods

# 1. (On the lab host.) Install the lab-host role. This copies the
#    repo into /opt/cis490, builds the venv, drops systemd units,
#    fetches the Alpine baseline qcow2, and builds the cidata ISO
#    with the in-guest agent embedded.
sudo /opt/cis490/scripts/install-lab-host.sh
# (or, if running from the manual clone:)
#   sudo ./scripts/install-lab-host.sh

# 2. Edit /etc/cis490/lab-host.toml — set host_id and any overrides.

# 3. Verify everything before enabling the timer-driven services:
/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py \
    --role lab-host
# → green/yellow rows means READY; red rows print the exact fix
#   command. Re-run until clean.

# 4. Turn on the services. From this moment on, the orchestrator runs
#    one fleet wave on each Restart= cycle, and the shipper picks up
#    completed episodes and PUTs them to https://collector.wg over mTLS.
sudo systemctl enable --now cis490-shipper cis490-orchestrator

# 5. (On the Pi.) Watch the index grow:
sudo tail -f /var/lib/cis490/index.jsonl

# 6. (Optional, Tier 3.) Enable real exploit fire — needs metasploit.
sudo /opt/cis490/scripts/install-msfrpcd.sh
# Operator-supplied URL + sha256 (Rapid7 download is registration-walled):
IMAGE_URL='…' IMAGE_SHA256='…' sudo OUT_DIR=/var/lib/cis490/vm/images \
    /opt/cis490/scripts/fetch-metasploitable2.sh

If index.jsonl doesn't grow within a wave-interval (~60 s after systemctl enable --now), run cis490-doctor again. The most common silent failures it catches:

*.wg DNS missing (wg-enroll provisions it; manual workaround is one line in /etc/hosts)
mTLS cert chain not installed under /etc/cis490/certs/
cis490-shipper service inactive (forgot step 4)
qemu-system-x86_64 not on PATH

cis490-doctor --json is machine-readable for use by other agents.

How an agent generates data on demand (without waiting for the timer)

# One labeled episode (90 s) with a chosen sample profile:
sudo -u cis490 /opt/cis490/.venv/bin/python \
    /opt/cis490/tools/run_real_vm_demo.py \
    --data-root /var/lib/cis490/data \
    --sample mirai-class-bot

# Force the shipper to run one pass:
sudo systemctl start cis490-shipper.service   # (if disabled)
# or:
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
    --config /etc/cis490/lab-host.toml --once

# Confirm on the Pi:
ssh <pi> 'sudo tail -3 /var/lib/cis490/index.jsonl'

Any tool that wants to count what's been shipped:

ssh <pi> 'sudo /opt/cis490/.venv/bin/python /opt/cis490/tools/index_reader.py \
    --index /var/lib/cis490/index.jsonl --count-by host_id'

Quick test for "is the pipe alive at all"

The --ping mode of the shipper does an authenticated POST to /v1/ping and writes nothing to the database. Use it as the smallest E2E test:

sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
    --config /etc/cis490/lab-host.toml --ping
# {"ok": true, "status_code": 200, ...} on success.
# Any other status is the operator's signal to read the error string.

File an issue when you hit a problem

When you run into an issue you cannot fully resolve in the current turn, file it as a Forgejo issue on the relevant repo before moving on. Do not silently log a TODO comment, leave a partial workaround, or assume someone else will remember. The issue tracker is the durable record.

This applies to:

a build / test / typecheck failure you can't fix in scope
a bug you discover but aren't tasked with fixing
a missing dep, missing config, or env-only failure that blocks E2E
a design gap you've worked around but want a follow-up to fix properly
a scope-out you made (e.g. "deferred Tier 4 sample fetch") that needs an owner so it doesn't get lost

Don't file an issue when:

the user is in the conversation and you can just tell them
it's already filed (search first: GET /api/v1/repos/<owner>/<repo>/issues?state=open&q=<keyword>)
it's truly a non-issue (a one-line edit you're about to make this same turn)

How to file (Forgejo API)

The local Forgejo at http://10.100.0.1:3000 accepts API calls with a token-bearer header:

curl -s -X POST \
  -H "Authorization: token <TOKEN>" \
  -H "Content-Type: application/json" \
  http://10.100.0.1:3000/api/v1/repos/spectral/<repo>/issues \
  -d '{
    "title": "<short, action-oriented title>",
    "body":  "<context, repro, attempted fixes, suggested next step>"
  }'

The token comes from the user's session — never embed one in code or commits.

What a good issue body contains

Context — one sentence on what was being attempted.
What happened — the actual error, log line, or unexpected behavior. Paste exact output.
What was tried — every workaround you attempted and why it didn't stick.
Suggested next step — the smallest change that would resolve it, if you have a guess. "Unknown" is a fine answer.
Related — link the commit / PR / file:line where the issue surfaced.

What a good title looks like

Bad	Good
`tests broken`	`tests/test_episode.py: race when t_mono_origin_ns is set in run() not __init__`
`caddy thing`	`Caddy: client_auth requires absolute path; relative trusted_ca_cert_file silently fails`
`fix later`	`shipper: 5xx backoff cap is 5min, doc says 1min — pick one`

After filing

Reference the issue number in the next commit message: Refs spectral/<repo>#<n> or Closes spectral/<repo>#<n> if your current change actually fixes it.
If the issue is on a different repo than the one you're committing to, fully qualify: spectral/wg-pki#3.

Other conventions

Don't put off the hard parts. Frame "deferred-with-reason" only for genuine blockers (binary not present on this machine, external service unreachable). For anything you could do but find awkward — bridge setup, cross-arch quirks, fleet concurrency — do it. The user has flagged this twice when work was scoped down prematurely. When something genuinely is blocked by an operator artifact, file the Forgejo issue and automate the bring-up (e.g., installer script + sha256-verifying fetcher) so the moment the artifact lands it Just Works.
Naming: never coin USB / device / service names on the user's behalf. Ask first. Reusing an old name is especially bad.
/etc configs: Read first, copy second. Never overwrite a /etc/... file from a template without checking what's actually there.
wg-enroll scope: creation-only. Don't add admin / service-activation features to it.
Don't expand a project's binary name beyond its own boundary: openclaw is the queue/permissions binary in openclaw-deploy. This repo is wg-enroll (or its caller). Don't conflate.

7.7 KiB Raw Permalink Blame History