CIS490/AGENTS.md
max 5d0e8e33a9 Tier 4 is mandatory: hard-fail on no real samples; auto-distribute MB key
User: 'we don't want it to be optional, this real malware IS the data
we want.' Acknowledged. Three changes make Tier 4 actually mandatory
without forcing per-host operator action:

1. bootstrap.wg /v1/secret/<name> endpoint
   - Pi serves /etc/cis490/secrets/malwarebazaar.token to lab hosts
     over the same trust boundary as the cert endpoint (WG mesh,
     iptmonads-gated). Strict allow-list — only `malwarebazaar`
     resolves; everything else 404s. Secret returned as bare text
     with Cache-Control: no-store. Live-verified on the Pi.
   - tests/test_bootstrap_secrets.py covers four cases: 404 unprovisioned,
     200 with token, 404 unknown name, 500 on empty file.

2. install-tier-3-4.sh: Tier 4 is no longer optional
   - Resolves MB key in priority: env var → /opt/cis490/samples/.bazaar.token
     → https://bootstrap.wg/v1/secret/malwarebazaar.
   - Caches the bootstrap-fetched key locally so re-runs are offline.
   - If all three resolution paths fail, dies with the exact
     remediation command for the operator (one-time set-malwarebazaar-key.sh
     on the Pi).
   - auto_fetch_samples.py is run unconditionally (SKIP_TIER4 still
     works for emergency overrides but logs a warning that the host
     will produce only mimics). Deploy fails if zero binaries land
     in samples/store/ — no silent mimic-only fallback.
   - SKIP_TIER4 documentation now says 'DEPRECATED; defeats the project'.

3. scripts/set-malwarebazaar-key.sh
   - Pi-side helper: one operator command per fleet, ever. Accepts
     key via env or stdin, validates length, drops at the right
     path with the right perms. Lab hosts pull the rest automatically.

AGENTS.md: rewrote the Tier-4 section to reflect mandatory status +
the one-time-on-Pi distribution model.

152/152 tests pass. Bootstrap service updated live on the Pi.
2026-05-01 00:44:41 -05:00

22 KiB

AGENTS.md — guidance for AI agents working on this repo

This project is part of the spectral lab (http://maxgit.wg/spectral/). The conventions below also apply to sibling repos (wg-enroll, wg-pki, caddy, iptmonads, matrix, forgejo, vault, openclaw-deploy).

READ THIS FIRST IF YOU'RE TROUBLESHOOTING A LAB-HOST BRING-UP: If the shipper or doctor mentions mTLS, certs, CA bundles, or "secure the connection," do not write any code, mint any keys, generate any CSRs, or call openssl. Cert delivery is automated — jump to "Securing the connection (mTLS) — DO NOT mint your own certs" below for the one command you should run.

PULL origin/main BEFORE DEBUGGING. Three install bugs and the shipper crash-loop on missing CA were closed in main as of 2026-04-30. If your clone is older than that, the symptoms below may already be fixed.


How a lab host gets to "shipping data" — the canonical bring-up

If you (an AI agent OR a human) are looking at a fresh lab host and asking "is this thing actually generating data for the central collector?", run this in order. Cloning the repo by itself does nothing. Running launchers from a manual clone bypasses the systemd services that do the actual work.

# 0. (One-time, on the Pi only.) Initialize the CIS490 client CA + a
#    leaf cert for THIS lab host. Get its WG IP from `wg-enroll-admin
#    show <usb>` first.
sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh   # idempotent
sudo /home/max/.env/wg-pki/scripts/deploy-cis490-cert.sh \
     <host_id> <wg_ip>           # mints + scp's + extracts + chmods

# 1. (On the lab host.) Install the lab-host role. This copies the
#    repo into /opt/cis490, builds the venv, drops systemd units,
#    fetches the Alpine baseline qcow2, and builds the cidata ISO
#    with the in-guest agent embedded.
sudo /opt/cis490/scripts/install-lab-host.sh
# (or, if running from the manual clone:)
#   sudo ./scripts/install-lab-host.sh

# 2. Edit /etc/cis490/lab-host.toml — set host_id and any overrides.

# 3. Verify everything before enabling the timer-driven services:
/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py \
    --role lab-host
# → green/yellow rows means READY; red rows print the exact fix
#   command. Re-run until clean.

# 4. Turn on the services. From this moment on, the orchestrator runs
#    one fleet wave on each Restart= cycle, and the shipper picks up
#    completed episodes and PUTs them to https://collector.wg over mTLS.
sudo systemctl enable --now cis490-shipper cis490-orchestrator

# 5. (On the Pi.) Watch the index grow:
sudo tail -f /var/lib/cis490/index.jsonl

# 6. (Optional, Tier 3.) Enable real exploit fire — needs metasploit.
sudo /opt/cis490/scripts/install-msfrpcd.sh
# Operator-supplied URL + sha256 (Rapid7 download is registration-walled):
IMAGE_URL='…' IMAGE_SHA256='…' sudo OUT_DIR=/var/lib/cis490/vm/images \
    /opt/cis490/scripts/fetch-metasploitable2.sh

If index.jsonl doesn't grow within a wave-interval (~60 s after systemctl enable --now), run cis490-doctor again. The most common silent failures it catches:

  • *.wg DNS missing (wg-enroll provisions it; manual workaround is one line in /etc/hosts)
  • mTLS cert chain not installed under /etc/cis490/certs/
  • cis490-shipper service inactive (forgot step 4)
  • qemu-system-x86_64 not on PATH

cis490-doctor --json is machine-readable for use by other agents.

Tier 3 + Tier 4 deploy (zero-touch via install-lab-host.sh)

install-lab-host.sh runs Tier-3 deploy automatically on its second pass (after the mTLS cert lands). No operator interaction is needed: metasploit-framework auto-installs via the Rapid7 omnibus, the Metasploitable2 image auto-fetches from a public mirror with TOFU sha256 pinning, the host-only bridge auto-comes-up, and a live exploit fire is verified before the script returns.

To re-run the deploy by hand or on a host where Tier 3 was skipped:

sudo /opt/cis490/scripts/install-tier-3-4.sh

It's idempotent — re-running on an already-deployed host is a no-op except for the verify step. Inputs are all optional env vars:

var effect
SKIP_VERIFY skip the live vsftpd_234_backdoor smoke run
SKIP_BRIDGE skip br-malware setup (limits to 2 of 5 modules)
SKIP_TIER4 skip the Tier-4 auto-fetch even if API key present
MALWAREBAZAAR_API_KEY opt-in: present means Tier-4 auto-fetch runs

The fleet runner auto-detects Tier-3 readiness via orchestrator/fleet.py::_msfrpcd_available(). Once cis490-msfrpcd.service is up and metasploitable2.qcow2 is on disk, the next wave produces Tier-3 episodes (meta.exploit.module_name populated). No orchestrator restart is required, but a restart speeds up the switch.

Tier-4 (real malware execution) is mandatory, push-button after one-time Pi setup

Real-binary episodes are the project's training target — Tier-4 is NOT optional. A lab-host deploy that lands without real samples fails loudly; mimic-only data does not answer the research question.

One-time, on the Pi (operator runs once, ever):

sudo MALWAREBAZAAR_API_KEY=<key> /opt/cis490/scripts/set-malwarebazaar-key.sh

Free signup at https://bazaar.abuse.ch/. The key lands at /etc/cis490/secrets/malwarebazaar.token (mode 0640, root:cis490). The bootstrap service's /v1/secret/malwarebazaar endpoint then serves it to every lab host — same trust boundary as the cert endpoint (WG mesh, iptmonads-gated).

Per lab host (auto): install-tier-3-4.sh resolves the MB key in priority order:

  1. MALWAREBAZAAR_API_KEY env var
  2. /opt/cis490/samples/.bazaar.token (cached from a previous run)
  3. https://bootstrap.wg/v1/secret/malwarebazaar (auto-distributed from the Pi)

If all three fail, the deploy aborts with the exact remediation command. Once the key resolves, tools/auto_fetch_samples.py walks each manifest family, queries MB by signature, fetches the first match, sha256-verifies on the way in, lands the binary at /opt/cis490/samples/store/<sha256>, and rewrites manifest.toml in place. The orchestrator's next selection that picks a sample with kind == "real" runs the real binary via the chunked-upload path.

If auto_fetch_samples.py lands zero binaries (zero successful MB queries), install-tier-3-4.sh exits non-zero. No silent mimic-only fallback — the project's data depends on real samples.

Set MALWAREBAZAAR_API_KEY (free signup at https://bazaar.abuse.ch/) before running install-tier-3-4.sh and step 5 runs tools/auto_fetch_samples.py automatically:

  1. For each [[sample]] in samples/manifest.toml without a sha256, query MalwareBazaar by family (signature match)
  2. Download the first matching binary (sha256-verified on the way in)
  3. Edit the manifest in place — add source, sha256, url
  4. Episodes that select that sample now run the real binary via the chunked-upload path (exploits.driver._resolve_workload)

The mimic profile remains the fallback for episodes that select a sample whose binary isn't on disk. Trainers filter on meta.sample.kind ∈ {"real", "mimic"}.

Confirm Tier 3+4 are flowing

# On the Pi maintainer side:
sudo python3 -c "
import json, glob, subprocess, tarfile, io
from collections import Counter
mods = Counter(); kinds = Counter()
for tar in glob.glob('/var/lib/cis490/episodes/*/*.tar.zst'):
    z = subprocess.check_output(['zstd','-q','-d','--stdout',tar],stderr=subprocess.DEVNULL)
    with tarfile.open(fileobj=io.BytesIO(z)) as t:
        for m in t.getmembers():
            if m.name.endswith('meta.json') and m.isfile():
                meta = json.load(t.extractfile(m))
                mods[(meta.get('exploit') or {}).get('module_name','<none>')] += 1
                kinds[(meta.get('sample') or {}).get('kind','<none>')] += 1
                break
print('exploit modules used:', dict(mods))
print('sample kinds:', dict(kinds))
"

If mods is {'<none>': N} and kinds is {'mimic': N}, Tier 3 hasn't kicked in yet on any lab host — re-run install-tier-3-4.sh there. If mods shows {'vsftpd_234_backdoor': N, ...} and kinds shows a non-zero 'real' count, both tiers are live.

Don't shortcut

  • DO NOT install metasploit-framework system-wide outside install-msfrpcd.sh. The script wires the systemd unit + creds; a manual install bypasses the orchestrator's _msfrpcd_available() probe.
  • DO NOT push your MalwareBazaar API key. samples/.bazaar.token is gitignored.
  • DO NOT add bogus sha256 entries to manifest.tomlfetch_sample.py hash-verifies every download.
  • DO NOT add real-binary entries by hand when auto_fetch_samples.py exists. Hand-edits are racy with the auto-fetcher's tempfile rewrite.

Securing the connection (mTLS) — DO NOT mint your own certs

The lab-host ↔ Pi connection is mTLS over WireGuard. The cert delivery is fully automated. You should never run openssl, write a CSR, edit a Caddyfile, or generate a private key on the lab host. If you find yourself doing any of that, you're off the runbook.

The actual cert flow:

  1. The lab host comes up on WireGuard via wg-enroll (USB-driven, one-time, separate project). After this, the lab host can reach bootstrap.wg and collector.wg on the 10.100.0.0/24 overlay.
  2. scripts/install-lab-host.sh, on its way through, pulls the leaf cert + CA bundle from https://bootstrap.wg/v1/cert/<host_id> over plain TLS (CA bundled in etc/caddy-root.crt). Trust boundary is "this peer is on the WG mesh" — iptmonads already gates the bootstrap port to enrolled peers.
  3. The fetch step is a no-op if host_id is still the default REPLACE_ME in /etc/cis490/lab-host.toml. This is the most common reason agents think cert delivery is broken.

The one fix that resolves 95 % of "cert/TLS/connection" reports:

# 1. Make sure host_id is set:
sudo grep '^host_id' /etc/cis490/lab-host.toml
# If it says "REPLACE_ME", edit it to the real host_id you registered.

# 2. Re-run the installer. It will fetch the cert from bootstrap.wg.
sudo /opt/cis490/scripts/install-lab-host.sh

# 3. Confirm certs landed:
ls -l /etc/cis490/certs/   # expect lab-host.pem, lab-host.key, wg-ca.pem

# 4. Smoke-test the pipe:
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
    --config /etc/cis490/lab-host.toml --ping
# {"ok": true, ...}  → done.

If step 2 prints WARN: bootstrap.wg fetch failed, the cause is almost always one of:

  • bootstrap.wg DNS not resolving → add to /etc/hosts: echo '10.100.0.1 bootstrap.wg collector.wg' | sudo tee -a /etc/hosts
  • wg0 interface not up → sudo wg show should list a peer; if not, re-run wg-enroll.
  • The Pi's cis490-bootstrap.service is down → file an issue against the receiver-side host, not against this repo.

What you should NOT do, even if it feels like it would help:

  • Generate certs with openssl or step-cli on the lab host.
  • Copy certs from another lab host.
  • Set verify_tls = false in lab-host.toml to "skip TLS for now."
  • Restart the shipper repeatedly hoping it self-heals — it already retries on every request without restart.
  • File a Forgejo issue titled "shipper can't connect" without first running the four-line block above and pasting its output.

The shipper's waiting on mTLS material log line is expected during first-boot until the cert lands. It is not an error to fix. The transport rebuilds the SSL context on each request, so the moment certs land in /etc/cis490/certs/, the next ping/ship attempt succeeds — no restart needed.

Common bring-up gotchas (read this before debugging an install)

Smaller models acting as on-device agents have hit these traps. Each one is now fixed in main, but if you're on an older clone you may still see the symptom — pull origin/main first, then re-read.

Run tools from /opt/cis490, not from a manual clone

When you run cis490-doctor from a clone like ~/.env/CIS490/, Python prepends the clone path to sys.path. Subprocesses spawned by the doctor (e.g., python -m shipper --ping) inherit the calling CWD and pick up the clone's shipper/ package instead of the service venv at /opt/cis490/. Symptom: tracebacks reference the clone path, or No module named exploits despite package = false.

Fix already in main: the doctor passes cwd=/opt/cis490 to the shipper subprocess and inserts repo_root into sys.path itself. Operator action: always invoke either as /opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py or via cd /opt/cis490 && ./tools/cis490_doctor.py. Don't run from a clone unless you know what you're doing.

Shipper logs "waiting on mTLS material" — this is expected, not a bug

The cis490-shipper unit is enabled by install-lab-host.sh before the Pi has issued the host's mTLS leaf. The transport pre-flights the configured ca_bundle / client_cert / client_key paths and, if any are missing, defers building the SSL context. You'll see one warning per process lifetime:

shipper waiting on mTLS material (client_cert path missing: …); will retry each request

The unit stays up. Each ping/ship attempt re-tries the build. Once the Pi runs deploy-cis490-cert.sh <host_id> <wg_ip> and the leaf lands at /etc/cis490/certs/, the next request succeeds and the transport logs mTLS material now on disk; shipper transport ready.

Do not try to "fix" the warning by restarting the unit, deleting the config, or hand-rolling certs — just confirm the Pi-side step ran and wait one scan interval.

install-lab-host.sh failures

Three install bugs were fixed in commit 95ac56a. If you're on an older clone:

Symptom Cause Fix
ModuleNotFoundError: pycdlib during cidata build pycdlib was in dev deps, service venv only installs main deps Pull main; pycdlib is in dependencies now
Episodes exit rc=1 in 15 s; launch_demo.sh can't find image vm/images/ dir wasn't created before symlinking Pull main; install script now install -d's the directory
cis490-doctor reports "tier3: No module named exploits" sys.path didn't include repo root Pull main; doctor inserts repo_root into sys.path

If you hit any of these on a fresh install, pull main first before filing an issue — the issue is probably already closed.

The classifier is multi-source — don't gut episodes on /proc alone

tools/prune_episodes.py cross-checks four telemetry sources before flagging an episode as flat:

  • telemetry-proc.jsonl — host qemu-system /proc CPU%
  • netflow.jsonl — bridge_pcap byte counters (network profiles)
  • telemetry-qmp.jsonl — virtio blockstats per-phase delta (io-walk, ransomware-shape)
  • telemetry-guest.jsonl — in-guest agent load_1m (low-and-slow, any host with a working agent)

An episode flags as flat-cpu only when EVERY available source shows no inter-phase variation. If /proc is flat but qmp blockstats show 90 MB written during infected_running, the episode is kept — the host /proc collector loses signal under contention but qmp sees through. This is essential on laptop-class lab hosts (e.g. elliott-thinkpad) where the guest is co-scheduled with 13 other VMs and the per-VM /proc CPU% gets buried.

All four sources stamp t_wall_ns; phase mapping uses that, not t_mono_ns, because /proc and labels are orchestrator-relative while netflow/guest are wall-clock-anchored. If you add a new collector, emit t_wall_ns from CLOCK_REALTIME on every row or your data will silently bucket into "(pre)".

Don't trust the in-guest probe alone — cross-check host CPU

The pre_kill_probe.yes / pre_kill_probe.sh fields in workload_killed events are produced by pgrep running inside an Alpine guest. busybox's pgrep does NOT support the -c flag. Older versions of VMLoadController._probe() used pgrep -c yes, which exits 1 with a usage banner on busybox; the || echo 0 fallback then always reported yes=0 regardless of whether the workload was running. This caused 244 episodes from elliott-thinkpad and k-gamingcom to be incorrectly labelled workload-silent.

The fix landed in main (probe now uses pgrep yes | wc -l); episodes shipped after that commit have correct probe values. For older episodes still on disk, the prune classifier now requires flat-cpu (host-side CPU envelope confirms no signal) AND the probe to flag workload-silent — a probe-only zero is no longer trusted. So you can safely run cis490-prune --archive against the existing data without losing valid episodes.

If you write any new in-guest diagnostic that runs commands via SerialClient, assume busybox/ash semantics: no disown builtin, no GNU pgrep -c, no bash /dev/tcp, no [[ ]]. Always pair an in-guest signal with the host-side /proc measurement before you declare an episode bad.

One traceback at a time

When the doctor lights up multiple red rows, fix the topmost one and re-run rather than batching attempts. Each red row prints the exact operator command it expects you to run. Don't paraphrase or invent adjacent commands; the doctor is the source of truth for what's missing.

How an agent generates data on demand (without waiting for the timer)

# One labeled episode (90 s) with a chosen sample profile:
sudo -u cis490 /opt/cis490/.venv/bin/python \
    /opt/cis490/tools/run_real_vm_demo.py \
    --data-root /var/lib/cis490/data \
    --sample mirai-class-bot

# Force the shipper to run one pass:
sudo systemctl start cis490-shipper.service   # (if disabled)
# or:
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
    --config /etc/cis490/lab-host.toml --once

# Confirm on the Pi:
ssh <pi> 'sudo tail -3 /var/lib/cis490/index.jsonl'

Any tool that wants to count what's been shipped:

ssh <pi> 'sudo /opt/cis490/.venv/bin/python /opt/cis490/tools/index_reader.py \
    --index /var/lib/cis490/index.jsonl --count-by host_id'

Quick test for "is the pipe alive at all"

The --ping mode of the shipper does an authenticated POST to /v1/ping and writes nothing to the database. Use it as the smallest E2E test:

sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
    --config /etc/cis490/lab-host.toml --ping
# {"ok": true, "status_code": 200, ...} on success.
# Any other status is the operator's signal to read the error string.

File an issue when you hit a problem

When you run into an issue you cannot fully resolve in the current turn, file it as a Forgejo issue on the relevant repo before moving on. Do not silently log a TODO comment, leave a partial workaround, or assume someone else will remember. The issue tracker is the durable record.

This applies to:

  • a build / test / typecheck failure you can't fix in scope
  • a bug you discover but aren't tasked with fixing
  • a missing dep, missing config, or env-only failure that blocks E2E
  • a design gap you've worked around but want a follow-up to fix properly
  • a scope-out you made (e.g. "deferred Tier 4 sample fetch") that needs an owner so it doesn't get lost

Don't file an issue when:

  • the user is in the conversation and you can just tell them
  • it's already filed (search first: GET /api/v1/repos/<owner>/<repo>/issues?state=open&q=<keyword>)
  • it's truly a non-issue (a one-line edit you're about to make this same turn)

How to file (Forgejo API)

The local Forgejo at http://10.100.0.1:3000 accepts API calls with a token-bearer header:

curl -s -X POST \
  -H "Authorization: token <TOKEN>" \
  -H "Content-Type: application/json" \
  http://10.100.0.1:3000/api/v1/repos/spectral/<repo>/issues \
  -d '{
    "title": "<short, action-oriented title>",
    "body":  "<context, repro, attempted fixes, suggested next step>"
  }'

The token comes from the user's session — never embed one in code or commits.

What a good issue body contains

  1. Context — one sentence on what was being attempted.
  2. What happened — the actual error, log line, or unexpected behavior. Paste exact output.
  3. What was tried — every workaround you attempted and why it didn't stick.
  4. Suggested next step — the smallest change that would resolve it, if you have a guess. "Unknown" is a fine answer.
  5. Related — link the commit / PR / file:line where the issue surfaced.

What a good title looks like

Bad Good
tests broken tests/test_episode.py: race when t_mono_origin_ns is set in run() not __init__
caddy thing Caddy: client_auth requires absolute path; relative trusted_ca_cert_file silently fails
fix later shipper: 5xx backoff cap is 5min, doc says 1min — pick one

After filing

  • Reference the issue number in the next commit message: Refs spectral/<repo>#<n> or Closes spectral/<repo>#<n> if your current change actually fixes it.
  • If the issue is on a different repo than the one you're committing to, fully qualify: spectral/wg-pki#3.

Other conventions

  • Don't put off the hard parts. Frame "deferred-with-reason" only for genuine blockers (binary not present on this machine, external service unreachable). For anything you could do but find awkward — bridge setup, cross-arch quirks, fleet concurrency — do it. The user has flagged this twice when work was scoped down prematurely. When something genuinely is blocked by an operator artifact, file the Forgejo issue and automate the bring-up (e.g., installer script + sha256-verifying fetcher) so the moment the artifact lands it Just Works.
  • Naming: never coin USB / device / service names on the user's behalf. Ask first. Reusing an old name is especially bad.
  • /etc configs: Read first, copy second. Never overwrite a /etc/... file from a template without checking what's actually there.
  • wg-enroll scope: creation-only. Don't add admin / service-activation features to it.
  • Don't expand a project's binary name beyond its own boundary: openclaw is the queue/permissions binary in openclaw-deploy. This repo is wg-enroll (or its caller). Don't conflate.