diff --git a/AGENTS.md b/AGENTS.md index 36eff3e..c4dc238 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -7,536 +7,118 @@ The conventions below also apply to sibling repos (`wg-enroll`, --- -## ⚡ FIRST: are you an on-device agent on a lab host? +## ⚡ FIRST: read PIPELINE.md -**Read this table before doing anything else.** Find your symptom in -the left column and run the EXACT command in the right column. -Do not invent variations. Do not paraphrase. +**[PIPELINE.md](./PIPELINE.md) is the canonical plan for this repo.** +Read it before changing anything in the data-collection / emulation / +labelling pipeline. If anything in this file or any other doc +contradicts PIPELINE.md, PIPELINE.md wins and the other doc is wrong. -| Symptom (what you see) | What to run (verbatim) | -|---|---| -| `journalctl -u cis490-shipper` shows `412 commit-rejected` or `code commit rejected` | `cd /opt/cis490 && sudo -u cis490 git pull origin main && sudo /opt/cis490/scripts/install-lab-host.sh` | -| `journalctl -u cis490-shipper` shows `400` and `missing X-Cis490-Code-Commit` | Same command as above. | -| `systemctl status cis490-shipper` shows `inactive (dead)` after install | `sudo systemctl enable --now cis490-shipper cis490-orchestrator` | -| Shipper logs `waiting on mTLS material` | **Do nothing** — `cis490-cert-fetch.timer` polls bootstrap.wg every 5 min; once the cert lands, the script restarts the shipper automatically. To force a fetch right now: `sudo systemctl start cis490-cert-fetch.service` then `sudo journalctl -u cis490-cert-fetch -n 20`. | -| Receiver index isn't growing (Pi side: `tail -f /var/lib/cis490/index.jsonl`) | `/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py --role lab-host` — read the **red** rows; each one prints the exact fix command. | -| `bootstrap.wg fetch failed` after install | `echo '10.100.0.1 bootstrap.wg collector.wg' \| sudo tee -a /etc/hosts && sudo /opt/cis490/scripts/install-lab-host.sh` | -| `cis490-doctor` says VERSION stamp missing | `sudo /opt/cis490/scripts/install-lab-host.sh` | -| You're not sure what's wrong | `/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py --role lab-host` | -| The canonical pull-and-reinstall command above doesn't fix it (auto-update timer keeps failing, `git pull` errors, you have a local commit, etc.) | **Read [FIXYOURSELF.md](./FIXYOURSELF.md).** It's the explicit decision tree for stuck states the auto-update timer can't recover from. | - -**HARD RULES — do not break these even if they "feel like they'd help":** - -- **Do not run `openssl`, `step-cli`, mint keys, or write CSRs.** Cert - delivery is fully automated by `install-lab-host.sh`. If you're - about to touch a private key on a lab host, stop. -- **Do not set `verify_tls = false`.** Skipping TLS verification is not - "skipping it for now"; it's leaving a hole. -- **Do not invent a `code_version` field in old episode `meta.json` - files.** That field exists to keep buggy data out of the index. - The drain helper handles them properly. -- **Do not disable, mask, or `systemctl stop` `cis490-shipper` to silence - log noise.** Fix the cause via the table above; the noise stops - on its own once the host has the new code. -- **Do not `git pull` and skip running `install-lab-host.sh`.** A bare - `git pull` updates files on disk but does NOT restart the systemd - units, so the running daemons keep executing the old code. -- **Do not file a Forgejo issue without first running `cis490-doctor` - and pasting its output.** - -If your symptom isn't in the table or the canonical command above -doesn't fix it, file a Forgejo issue (see the "File an issue" section -near the bottom). +This file is for general engineering conventions. The pipeline +correctness story lives in PIPELINE.md. --- -## How a lab host gets to "shipping data" — the canonical bring-up +## What this project is -If you (an AI agent OR a human) are looking at a fresh lab host and -asking "is this thing actually generating data for the central -collector?", run this in order. **Cloning the repo by itself does -nothing. Running launchers from a manual clone bypasses the -systemd services that do the actual work.** +CIS490 trains a behavioral malware-detection model from labelled +episodes captured on lab-host VMs running real or mimic workloads, +optionally driven into infected states by Metasploit modules. The +producer is the orchestrator on each lab host; the consumer is the +receiver on the Pi (`office-print`, `10.100.0.1`). -```sh -# 0. (One-time, on the Pi only.) Initialize the CIS490 client CA + a -# leaf cert for THIS lab host. Get its WG IP from `wg-enroll-admin -# show ` first. -sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh # idempotent -sudo /home/max/.env/wg-pki/scripts/deploy-cis490-cert.sh \ - # mints + scp's + extracts + chmods +The producer must ship only ground-truth episodes. The receiver must +reject anything that doesn't meet the bar. See PIPELINE.md. -# 1. (On the lab host.) Install the lab-host role. ONE COMMAND DOES -# EVERYTHING — repo to /opt/cis490, venv build, systemd units, -# Alpine baseline qcow2, cidata ISO, VERSION stamp, mTLS cert -# auto-fetch from bootstrap.wg, Tier-3+4 deploy (msfrpcd + -# Metasploitable2 + theZoo malware samples + bridge), pre-stamp -# queue drain, and a `daemon-reload + systemctl restart` of the -# shipper + orchestrator on re-runs. Idempotent — safe to re-run. -sudo /opt/cis490/scripts/install-lab-host.sh -# (or, if running from a clone elsewhere:) -# sudo ./scripts/install-lab-host.sh +## Hard rules — do not break these -# 2. Edit /etc/cis490/lab-host.toml — set host_id (the only required -# edit). Then re-run step 1 so the cert auto-fetch can resolve -# bootstrap.wg/v1/cert/. +- **Do not silently downgrade a host.** If a collector is silent, an + exploit doesn't land, or a dependency is missing, the host produces + zero episodes and says so loudly. There is no "ship what we can" + fallback. +- **Do not write a label that an event didn't justify.** Phase + labels come from observed events, not from the schedule clock. See + PIPELINE.md §4.5. +- **Do not add a module to the catalog without verifying it lands a + session against its declared target.** See PIPELINE.md §4.3. +- **Do not add per-host config overrides.** One canonical manifest; + hosts that can't run it produce nothing. See PIPELINE.md §4.1. +- **Do not bypass the dirty-tree gate** except via the + `CIS490_ALLOW_DIRTY=1` env var (logged, stamped, audited). No + "skip preflight," no `verify_tls=false`, no other override knobs. +- **Do not run `openssl`, `step-cli`, mint keys, or write CSRs.** + Cert delivery is automated. If you find yourself touching a + private key on a lab host, stop. +- **Do not file a Forgejo issue without first running + `cis490-doctor` and pasting its output.** -# 3. Verify everything before enabling the timer-driven services: -/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py \ - --role lab-host -# → green/yellow rows means READY; red rows print the exact fix -# command. Re-run until clean. +## How a lab host gets to "shipping data" -# 4. Turn on the services. From this moment on, the orchestrator runs -# one fleet wave on each Restart= cycle, and the shipper picks up -# completed episodes and PUTs them to https://collector.wg over mTLS. -sudo systemctl enable --now cis490-shipper cis490-orchestrator +This will be rewritten as PIPELINE.md §4 lands. The current +`scripts/install-lab-host.sh` does most of the right things but does +not yet enforce the canonical manifest, target-VM build, catalog +verification, or preflight. Until those land, treat the install +script as in-flight and assume a fresh lab host will produce nothing +until the bar is met. -# 5. (On the Pi.) Watch the index grow: -sudo tail -f /var/lib/cis490/index.jsonl -``` +The bar (when in place) will be: -**There is no manual Tier-3 step.** Steps 1 + 2 deploy msfrpcd, -Metasploitable2 (auto-fetched from a public mirror with TOFU sha256 -pinning — no Rapid7 registration), and Tier-4 real-malware samples -from theZoo (no API key, no signup). The orchestrator switches to -Tier-3 episodes automatically once the prereqs are on disk. +1. Repo cloned to `/opt/cis490`, working tree clean, HEAD on + `origin/main`. +2. Every binary in the active collector + module catalog set on + `PATH`. +3. Every target VM image built from the in-repo spec, sha256-pinned. +4. Every module in the catalog passes `scripts/verify-catalog.sh` + against its target. +5. Every collector in the active set passes its emit-test. +6. `orchestrator/preflight.py` exits 0. -**Hosts self-update.** `install-lab-host.sh` enables -`cis490-autoupdate.timer`, which runs every 30 min (with up to 10 min -of randomized delay) and does `git fetch + git pull --ff-only + -install-lab-host.sh` whenever origin/main has moved. So once a host -has done the canonical bring-up ONCE, it self-heals on every -subsequent maintainer push — you don't need to remember to pull. The -timer logs to `journalctl -u cis490-autoupdate.service`. If the -host's checkout has diverged from origin (operator hand-edits, -half-applied changes), auto-update bails rather than guessing — that -shows up as a unit failure with a clear log message. - -If `index.jsonl` doesn't grow within a wave-interval (~60 s after -`systemctl enable --now`), run `cis490-doctor` again. The most -common silent failures it catches: - -- `*.wg` DNS missing (wg-enroll provisions it; manual workaround is - one line in `/etc/hosts`) -- mTLS cert chain not installed under `/etc/cis490/certs/` -- `cis490-shipper` service inactive (forgot step 4) -- `qemu-system-x86_64` not on PATH - -`cis490-doctor --json` is machine-readable for use by other agents. - -## Shipper says "400 missing" or "412 commit-rejected": pull and reinstall - -If `journalctl -u cis490-shipper` shows a steady stream of -`-> fatal (400)` or `-> 412 commit-rejected` lines, the receiver is -rejecting episodes because their `meta.json::code_version.commit` -isn't in the receiver's allow-list (or isn't being sent at all). This -happens when this lab host is running code older than the receiver -will accept. - -The fix is always the same — pull main and re-run the installer: - -```sh -cd /opt/cis490 -sudo -u cis490 git pull origin main -sudo /opt/cis490/scripts/install-lab-host.sh -``` - -`install-lab-host.sh` does the rest: -1. Re-stamps `/opt/cis490/VERSION` to the new HEAD. -2. Drains pre-stamp episodes via - `tools/quarantine_unstamped.py` so the queue stops looping on - them. Drained episodes go to `/var/lib/cis490/data/quarantine/` - with a `quarantine_reason.json` per-episode for triage. -3. Restarts `cis490-shipper` and `cis490-orchestrator` so the new code - takes effect. - -Do **not** disable the shipper to silence the log noise — once a host -has the new code, traffic resumes immediately. Do **not** mint a fake -`code_version` field in old episodes to bypass the gate; that field -exists specifically to keep buggy pre-fix data out of the training -index. - -If the receiver is rejecting *new* episodes too (you've pulled and -restarted, but still see 412), the receiver's allow-list window may -not yet include your commit — wait 5s for its Forgejo refresh, or -push your commit to `origin/main` first if you're testing -unmerged work. - -## Tier 3 + Tier 4 deploy (zero-touch via install-lab-host.sh) - -`install-lab-host.sh` runs Tier-3 deploy automatically on its second -pass (after the mTLS cert lands). No operator interaction is needed: -metasploit-framework auto-installs via the Rapid7 omnibus, the -Metasploitable2 image auto-fetches from a public mirror with TOFU -sha256 pinning, the host-only bridge auto-comes-up, and a live -exploit fire is verified before the script returns. - -To re-run the deploy by hand or on a host where Tier 3 was skipped: - -```sh -sudo /opt/cis490/scripts/install-tier-3-4.sh -``` - -It's idempotent — re-running on an already-deployed host is a no-op -except for the verify step. Inputs are all optional env vars: - -| var | effect | -|---|---| -| `SKIP_VERIFY` | skip the live `vsftpd_234_backdoor` smoke run | -| `SKIP_BRIDGE` | skip `br-malware` setup (limits to 2 of 5 modules) | -| `SKIP_TIER4` | skip the Tier-4 auto-fetch (DEPRECATED — leaves you with mimic-only data, defeats the project) | - -The fleet runner auto-detects Tier-3 readiness via -`orchestrator/fleet.py::_msfrpcd_available()`. Once -`cis490-msfrpcd.service` is up and `metasploitable2.qcow2` is on -disk, the next wave produces Tier-3 episodes (`meta.exploit.module_name` -populated). No orchestrator restart is required, but a restart speeds -up the switch. - -### Tier-4 (real malware execution) is mandatory, fully automated - -**Real-binary episodes are the project's training target — Tier-4 is -NOT optional.** A lab-host deploy that lands without real samples -fails loudly; mimic-only data does not answer the research question. - -There is **no operator step**. No API key, no signup, no manual -provisioning. `install-tier-3-4.sh` runs `tools/auto_fetch_samples.py` -which: - -1. Clones (or pulls) `theZoo` from - `https://github.com/ytisf/theZoo` to `/var/lib/cis490/theZoo` - (~500 MB shallow clone, public, GPL-3.0, security-research repo) -2. For each `[[sample]]` in `manifest.toml` without a sha256, locates - a directory in `theZoo/malware/Binaries/` whose name matches - the entry's `family` (case-insensitive substring + prefix priority) -3. Extracts the password-protected `.zip` (well-known password - `infected`) -4. Picks the largest non-text payload as the binary, computes its - sha256, copies to `/opt/cis490/samples/store/` -5. Rewrites `manifest.toml` in place, atomically (tempfile + - `os.replace` preserving stat), adding `source = "theZoo"`, - `sha256 = ""`, and the upstream URL - -If `auto_fetch_samples.py` lands zero binaries (theZoo layout drift, -git clone failure, or a family has no matching directory), -`install-tier-3-4.sh` exits non-zero. **No silent mimic-only fallback.** - -The orchestrator's next selection that picks a sample with -`kind == "real"` runs the real binary via the chunked-upload path -(`exploits.driver._resolve_workload`). The mimic profile remains the -fallback for episodes that select a sample whose binary isn't on -disk. Trainers filter on `meta.sample.kind ∈ {"real", "mimic"}`. - -### Confirm Tier 3+4 are flowing - -```sh -# On the Pi maintainer side: -sudo python3 -c " -import json, glob, subprocess, tarfile, io -from collections import Counter -mods = Counter(); kinds = Counter() -for tar in glob.glob('/var/lib/cis490/episodes/*/*.tar.zst'): - z = subprocess.check_output(['zstd','-q','-d','--stdout',tar],stderr=subprocess.DEVNULL) - with tarfile.open(fileobj=io.BytesIO(z)) as t: - for m in t.getmembers(): - if m.name.endswith('meta.json') and m.isfile(): - meta = json.load(t.extractfile(m)) - mods[(meta.get('exploit') or {}).get('module_name','')] += 1 - kinds[(meta.get('sample') or {}).get('kind','')] += 1 - break -print('exploit modules used:', dict(mods)) -print('sample kinds:', dict(kinds)) -" -``` - -If `mods` is `{'': N}` and `kinds` is `{'mimic': N}`, Tier 3 -hasn't kicked in yet on any lab host — re-run -`install-tier-3-4.sh` there. If `mods` shows -`{'vsftpd_234_backdoor': N, ...}` and `kinds` shows a non-zero -`'real'` count, both tiers are live. - -### Don't shortcut - -- DO NOT install `metasploit-framework` system-wide outside - `install-msfrpcd.sh`. The script wires the systemd unit + creds; - a manual install bypasses the orchestrator's - `_msfrpcd_available()` probe. -- DO NOT add bogus sha256 entries to `manifest.toml` — - `auto_fetch_samples.py` hash-verifies every binary it stages. -- DO NOT add real-binary entries by hand when `auto_fetch_samples.py` - exists. Hand-edits are racy with the auto-fetcher's tempfile - rewrite. +Once that's true, `systemctl enable --now cis490-shipper +cis490-orchestrator` brings the host online. The orchestrator runs +the canonical experiment; the shipper PUTs sealed episodes to the +receiver. Episodes that don't pass the acceptance gate go to +`data/rejected//` locally and are never shipped. ## Securing the connection (mTLS) — DO NOT mint your own certs -The lab-host ↔ Pi connection is mTLS over WireGuard. **The cert -delivery is fully automated.** You should never run `openssl`, write -a CSR, edit a Caddyfile, or generate a private key on the lab host. -If you find yourself doing any of that, you're off the runbook. +The lab-host ↔ Pi connection is mTLS over WireGuard. Cert delivery +is automated via `bootstrap.wg/v1/cert/`. You should never +run `openssl`, write a CSR, edit a Caddyfile, or generate a private +key on the lab host. If you find yourself doing any of that, you're +off the runbook. -**The actual cert flow:** - -1. The lab host comes up on WireGuard via `wg-enroll` (USB-driven, - one-time, separate project). After this, the lab host can reach - `bootstrap.wg` and `collector.wg` on the `10.100.0.0/24` overlay. -2. `scripts/install-lab-host.sh`, on its way through, pulls the leaf - cert + CA bundle from `https://bootstrap.wg/v1/cert/` - over plain TLS (CA bundled in `etc/caddy-root.crt`). Trust - boundary is "this peer is on the WG mesh" — `iptmonads` already - gates the bootstrap port to enrolled peers. -3. The fetch step is a no-op if `host_id` is still the default - `REPLACE_ME` in `/etc/cis490/lab-host.toml`. **This is the most - common reason agents think cert delivery is broken.** - -**The one fix that resolves 95 % of "cert/TLS/connection" reports:** - -```sh -# 1. Make sure host_id is set: -sudo grep '^host_id' /etc/cis490/lab-host.toml -# If it says "REPLACE_ME", edit it to the real host_id you registered. - -# 2. Re-run the installer. It will fetch the cert from bootstrap.wg. -sudo /opt/cis490/scripts/install-lab-host.sh - -# 3. Confirm certs landed: -ls -l /etc/cis490/certs/ # expect lab-host.pem, lab-host.key, wg-ca.pem - -# 4. Smoke-test the pipe: -sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \ - --config /etc/cis490/lab-host.toml --ping -# {"ok": true, ...} → done. -``` - -If step 2 prints `WARN: bootstrap.wg fetch failed`, the cause is -almost always one of: - -- `bootstrap.wg` DNS not resolving → add to `/etc/hosts`: - `echo '10.100.0.1 bootstrap.wg collector.wg' | sudo tee -a /etc/hosts` -- `wg0` interface not up → `sudo wg show` should list a peer; if not, - re-run wg-enroll. -- The Pi's `cis490-bootstrap.service` is down → file an issue against - the receiver-side host, not against this repo. - -**What you should NOT do, even if it feels like it would help:** - -- Generate certs with `openssl` or `step-cli` on the lab host. -- Copy certs from another lab host. -- Set `verify_tls = false` in `lab-host.toml` to "skip TLS for now." -- Restart the shipper repeatedly hoping it self-heals — it already - retries on every request without restart. -- File a Forgejo issue titled "shipper can't connect" without first - running the four-line block above and pasting its output. +The most common reason cert fetch appears to fail is `host_id` still +being `REPLACE_ME` in `/etc/cis490/lab-host.toml`. Check that first. The shipper's `waiting on mTLS material` log line is **expected** -during first-boot until the cert lands. It is not an error to fix. -The transport rebuilds the SSL context on each request, so the -moment certs land in `/etc/cis490/certs/`, the next ping/ship -attempt succeeds — no restart needed. +during first-boot until the cert lands. It is not an error. The +transport rebuilds the SSL context on each request, so the moment +certs land in `/etc/cis490/certs/`, the next attempt succeeds — no +restart needed. -## Common bring-up gotchas (read this before debugging an install) +## Filing issues -Smaller models acting as on-device agents have hit these traps. Each -one is now fixed in main, but if you're on an older clone you may -still see the symptom — pull `origin/main` first, then re-read. +When you run into an issue you cannot fully resolve in the current +turn, file it as a Forgejo issue on the relevant repo. Do not +silently log a TODO comment, leave a partial workaround, or assume +someone else will remember. -### Run tools from `/opt/cis490`, not from a manual clone +File issues for: +- A build / test / typecheck failure you can't fix in scope. +- A bug you discover but aren't tasked with fixing. +- A missing dep, missing config, or env-only failure that blocks + E2E. +- A design gap you've worked around but want a follow-up to fix + properly. -When you run `cis490-doctor` from a clone like `~/.env/CIS490/`, -Python prepends the clone path to `sys.path`. Subprocesses spawned -by the doctor (e.g., `python -m shipper --ping`) inherit the calling -CWD and pick up the clone's `shipper/` package instead of the -service venv at `/opt/cis490/`. Symptom: tracebacks reference the -clone path, or `No module named exploits` despite `package = false`. +Don't file when: +- The user is in the conversation and you can just tell them. +- It's already filed (search first: + `GET /api/v1/repos///issues?state=open&q=`). +- It's truly a non-issue (a one-line edit you're about to make this + same turn). -**Fix already in main:** the doctor passes `cwd=/opt/cis490` to the -shipper subprocess and inserts `repo_root` into `sys.path` itself. -**Operator action:** always invoke either as -`/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py` -or via `cd /opt/cis490 && ./tools/cis490_doctor.py`. Don't run from a -clone unless you know what you're doing. - -### Shipper logs "waiting on mTLS material" — this is expected, not a bug - -The `cis490-shipper` unit is enabled by `install-lab-host.sh` *before* -the Pi has issued the host's mTLS leaf. The transport pre-flights the -configured `ca_bundle` / `client_cert` / `client_key` paths and, if -any are missing, defers building the SSL context. You'll see one -warning per process lifetime: - -``` -shipper waiting on mTLS material (client_cert path missing: …); will retry each request -``` - -The unit stays up. Each ping/ship attempt re-tries the build. Once -the Pi runs `deploy-cis490-cert.sh ` and the leaf -lands at `/etc/cis490/certs/`, the next request succeeds and the -transport logs `mTLS material now on disk; shipper transport ready`. - -**Do not** try to "fix" the warning by restarting the unit, deleting -the config, or hand-rolling certs — just confirm the Pi-side step -ran and wait one scan interval. - -### Outdated clone? Pull main first. - -A long list of install-time bugs (cp self-copy, missing service -restart, fatal-loop quarantine, ca_bundle pointing at the wrong -chain, busybox pgrep flags, pycdlib in the wrong dep group, missing -vm/images/ symlink target, doctor sys.path) have been fixed and are -all resolved in main. **If you hit any "this used to work" symptom -on a host that hasn't pulled in a while, the canonical command is -always the same:** - -```sh -cd /opt/cis490 && sudo -u cis490 git pull origin main && \ - sudo /opt/cis490/scripts/install-lab-host.sh -``` - -That one command: - -- Re-stamps `/opt/cis490/VERSION` so episodes get a valid - `code_version.commit` — required by the receiver's gate. -- Drains pre-stamp episodes from `data/episodes/` to - `data/quarantine/` via `tools/quarantine_unstamped.py` so the queue - stops looping on them. -- Runs `daemon-reload` and `systemctl restart cis490-shipper - cis490-orchestrator` so the live daemons pick up the new code - (a bare `git pull` does NOT do this — Python module objects in the - running process are frozen at last service start). -- Re-runs the Tier-3+4 deploy idempotently if the cert is on disk. - -After it returns, the shipper will be running as `Type=notify` with -`WatchdogSec=180` — systemd kills + restarts it if a scan pass hangs. - -### The classifier is multi-source — don't gut episodes on /proc alone - -`tools/prune_episodes.py` cross-checks four telemetry sources before -flagging an episode as flat: - -- `telemetry-proc.jsonl` — host qemu-system /proc CPU% -- `netflow.jsonl` — bridge_pcap byte counters (network profiles) -- `telemetry-qmp.jsonl` — virtio blockstats per-phase delta (io-walk, - ransomware-shape) -- `telemetry-guest.jsonl` — in-guest agent load_1m (low-and-slow, - any host with a working agent) - -An episode flags as `flat-cpu` only when EVERY available source -shows no inter-phase variation. If `/proc` is flat but qmp blockstats -show 90 MB written during `infected_running`, the episode is kept — -the host /proc collector loses signal under contention but qmp sees -through. This is essential on laptop-class lab hosts (e.g. -elliott-thinkpad) where the guest is co-scheduled with 13 other VMs -and the per-VM /proc CPU% gets buried. - -All four sources stamp `t_wall_ns`; phase mapping uses that, not -`t_mono_ns`, because /proc and labels are orchestrator-relative -while netflow/guest are wall-clock-anchored. If you add a new -collector, emit `t_wall_ns` from CLOCK_REALTIME on every row or your -data will silently bucket into "(pre)". - -### Don't trust the in-guest probe alone — cross-check host CPU - -The `pre_kill_probe.yes` / `pre_kill_probe.sh` fields in -`workload_killed` events are produced by `pgrep` running inside an -Alpine guest. busybox's pgrep does NOT support the `-c` flag. Older -versions of `VMLoadController._probe()` used `pgrep -c yes`, which -exits 1 with a usage banner on busybox; the `|| echo 0` fallback then -always reported `yes=0` regardless of whether the workload was -running. This caused 244 episodes from `elliott-thinkpad` and -`k-gamingcom` to be incorrectly labelled `workload-silent`. - -The fix landed in main (probe now uses `pgrep yes | wc -l`); episodes -shipped after that commit have correct probe values. For older -episodes still on disk, the prune classifier now requires `flat-cpu` -(host-side CPU envelope confirms no signal) AND the probe to flag -workload-silent — a probe-only zero is no longer trusted. So you can -safely run `cis490-prune --archive` against the existing data without -losing valid episodes. - -If you write any new in-guest diagnostic that runs commands via -SerialClient, assume busybox/ash semantics: no `disown` builtin, no -GNU `pgrep -c`, no bash `/dev/tcp`, no `[[ ]]`. Always pair an -in-guest signal with the host-side `/proc` measurement before you -declare an episode bad. - -### One traceback at a time - -When the doctor lights up multiple red rows, fix the topmost one and -re-run rather than batching attempts. Each red row prints the exact -operator command it expects you to run. Don't paraphrase or invent -adjacent commands; the doctor is the source of truth for what's -missing. - -## How an agent generates data on demand (without waiting for the timer) - -```sh -# One labeled episode (90 s) with a chosen sample profile: -sudo -u cis490 /opt/cis490/.venv/bin/python \ - /opt/cis490/tools/run_real_vm_demo.py \ - --data-root /var/lib/cis490/data \ - --sample mirai-class-bot - -# Force the shipper to run one pass: -sudo systemctl start cis490-shipper.service # (if disabled) -# or: -sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \ - --config /etc/cis490/lab-host.toml --once - -# Confirm on the Pi: -ssh 'sudo tail -3 /var/lib/cis490/index.jsonl' -``` - -Any tool that wants to *count* what's been shipped: - -```sh -ssh 'sudo /opt/cis490/.venv/bin/python /opt/cis490/tools/index_reader.py \ - --index /var/lib/cis490/index.jsonl --count-by host_id' -``` - -## Quick test for "is the pipe alive at all" - -The `--ping` mode of the shipper does an authenticated POST to -`/v1/ping` and writes nothing to the database. Use it as the -smallest E2E test: - -```sh -sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \ - --config /etc/cis490/lab-host.toml --ping -# {"ok": true, "status_code": 200, ...} on success. -# Any other status is the operator's signal to read the error string. -``` - ---- - -## File an issue when you hit a problem - -**When you run into an issue you cannot fully resolve in the current -turn, file it as a Forgejo issue on the relevant repo before moving -on.** Do not silently log a TODO comment, leave a partial workaround, -or assume someone else will remember. The issue tracker is the -durable record. - -This applies to: - -- a build / test / typecheck failure you can't fix in scope -- a bug you discover but aren't tasked with fixing -- a missing dep, missing config, or env-only failure that blocks E2E -- a design gap you've worked around but want a follow-up to fix - properly -- a scope-out you made (e.g. "deferred Tier 4 sample fetch") that - needs an owner so it doesn't get lost - -Don't file an issue when: - -- the user is in the conversation and you can just *tell* them -- it's already filed (search first: `GET /api/v1/repos///issues?state=open&q=`) -- it's truly a non-issue (a one-line edit you're about to make this - same turn) - -## How to file (Forgejo API) - -The local Forgejo at `http://10.100.0.1:3000` accepts API calls with a -token-bearer header: +### How to file (Forgejo API) ```sh curl -s -X POST \ @@ -552,19 +134,19 @@ curl -s -X POST \ The token comes from the user's session — never embed one in code or commits. -### What a good issue body contains +### Good issue body 1. **Context** — one sentence on what was being attempted. -2. **What happened** — the actual error, log line, or unexpected - behavior. Paste exact output. +2. **What happened** — the actual error or unexpected behavior. Paste + exact output. 3. **What was tried** — every workaround you attempted and why it didn't stick. 4. **Suggested next step** — the smallest change that would resolve - it, if you have a guess. "Unknown" is a fine answer. + it, if you have a guess. "Unknown" is fine. 5. **Related** — link the commit / PR / file:line where the issue surfaced. -### What a good title looks like +### Good titles | Bad | Good | |---|---| @@ -572,25 +154,22 @@ commits. | `caddy thing` | `Caddy: client_auth requires absolute path; relative trusted_ca_cert_file silently fails` | | `fix later` | `shipper: 5xx backoff cap is 5min, doc says 1min — pick one` | -## After filing - -- Reference the issue number in the next commit message: - `Refs spectral/#` or `Closes spectral/#` if your - current change actually fixes it. -- If the issue is on a different repo than the one you're committing - to, fully qualify: `spectral/wg-pki#3`. +After filing, reference the issue in the next commit message: +`Refs spectral/#` or `Closes spectral/#`. Fully +qualify cross-repo: `spectral/wg-pki#3`. ## Other conventions -- **Don't put off the hard parts.** Frame "deferred-with-reason" only +- **Don't put off the hard parts.** "Deferred-with-reason" is only for genuine blockers (binary not present on this machine, external service unreachable). For anything you *could* do but find awkward - — bridge setup, cross-arch quirks, fleet concurrency — do it. The - user has flagged this twice when work was scoped down prematurely. - When something genuinely is blocked by an operator artifact, file - the Forgejo issue and *automate the bring-up* (e.g., installer - script + sha256-verifying fetcher) so the moment the artifact lands - it Just Works. + — bridge setup, cross-arch quirks, fleet concurrency — do it. +- **No architectural bandaids in the pipeline.** Compensating layers + (auto-update timers, fix-yourself decision trees, prescriptive + symptom→command tables, trainer-side prune scripts that paper over + silent collectors) are not allowed in the data-collection / + emulation / labelling path. Fix the producer instead. See + PIPELINE.md. - **Naming:** never coin USB / device / service names on the user's behalf. Ask first. Reusing an old name is especially bad. - **`/etc` configs:** `Read` first, copy second. Never overwrite a diff --git a/FIXYOURSELF.md b/FIXYOURSELF.md deleted file mode 100644 index ae41593..0000000 --- a/FIXYOURSELF.md +++ /dev/null @@ -1,249 +0,0 @@ -# FIXYOURSELF.md — when the lab host can't unstick itself automatically - -**You are reading this because you (an on-device AI agent) are running -on a CIS490 lab host and one of these is true:** - -- `cis490-autoupdate.timer` keeps failing. -- The shipper logs show steady `412 commit-rejected` or `400 missing`. -- `cis490-doctor --role lab-host` flagged you red. - -The normal "pull and reinstall" path (AGENTS.md row 1) sometimes can't -recover on its own — most often because someone made a local commit -on this host, or `/opt/cis490` got into a state `git pull --ff-only` -refuses to fix. This doc is the explicit decision tree for those -cases. Run the commands literally. Do not paraphrase. - ---- - -## Step 1 — Identify your state - -Run all of these. Save the output. Match the result against the -table. - -```sh -cd /opt/cis490 -sudo -u cis490 git rev-parse HEAD # → LOCAL -sudo -u cis490 git rev-parse origin/main 2>&1 # → REMOTE (or error) -sudo -u cis490 git fetch origin main 2>&1 # → fresh fetch result -sudo -u cis490 git log -1 --format='%H %ci %s' # → what HEAD is -sudo -u cis490 git status --porcelain # → uncommitted changes? -``` - -Then: - -| If you see… | Your state | Go to | -|---|---|---| -| LOCAL == REMOTE and `git status` empty | Not stuck — run §Z to be safe | §Z | -| LOCAL != REMOTE and `git merge-base --is-ancestor HEAD origin/main` returns 0 | Behind main, no local commits | §A | -| LOCAL != REMOTE and the merge-base check returns 1 (NON-zero) | **You have a local commit not on origin/main** | §B | -| `git fetch` prints a network error | Connectivity broken | §C | -| `/opt/cis490/.git` is missing | No git checkout — populated via `cp -aT` originally | §D | -| `git status` shows tracked files modified | Uncommitted edits on this host | §E | - -If multiple match: §C blocks everything else (fix network first), then -§D, then §E, then §B, then §A. - ---- - -## §A — Behind main, clean tree - -```sh -cd /opt/cis490 -sudo -u cis490 git pull --ff-only origin main -sudo /opt/cis490/scripts/install-lab-host.sh -``` - -`install-lab-host.sh` re-stamps VERSION, drains the pre-stamp queue, -and restarts the daemons. Verify with §Z. - ---- - -## §B — You have a local commit not on origin/main - -This is the elliott-thinkpad case (2026-05-01..02). You committed -something locally, the maintainer's `origin/main` doesn't have it, -and the receiver's allow-list rejects every episode you ship. Pick -ONE of B.1, B.2, B.3 — read all three first. - -### B.1 — Push your commit to origin/main (RECOMMENDED if your change is real) - -```sh -cd /opt/cis490 -sudo -u cis490 git log -3 --stat HEAD # what you're about to push — read it -sudo -u cis490 git push origin HEAD:main -``` - -If `git push` succeeds: the receiver allow-list picks it up within 5 -seconds, new episodes start landing immediately. Verify with §Z. - -If `git push` fails with "permission denied" or "auth required": you -don't have push credentials. Skip to B.2 or B.3. - -If `git push` fails with "non-fast-forward": your local commit -diverged AROUND origin/main (you committed against an older base). -Try: - -```sh -sudo -u cis490 git pull --rebase origin main -# If the rebase has conflicts you can't resolve in 60 seconds, abort: -# sudo -u cis490 git rebase --abort -# Then go to B.2 or B.3. -sudo -u cis490 git push origin HEAD:main -``` - -### B.2 — Discard your local commit, reset to origin - -⚠ **This deletes your local commit.** Only do this if you know it -wasn't a real fix or the maintainer told you to discard it. - -```sh -cd /opt/cis490 -sudo -u cis490 git fetch origin main -sudo -u cis490 git log --oneline HEAD ^origin/main # what you're about to lose -sudo -u cis490 git reset --hard origin/main -sudo /opt/cis490/scripts/install-lab-host.sh -``` - -### B.3 — Stop, file an issue, wait - -If you can't decide between B.1 and B.2 (e.g. you don't know if your -commit is a real fix), do this: - -```sh -cd /opt/cis490 -HOST_ID=$(grep '^host_id' /etc/cis490/lab-host.toml | cut -d'"' -f2) -LOCAL_SHA=$(sudo -u cis490 git rev-parse HEAD) -DETAIL=$(sudo -u cis490 git log -3 --stat HEAD | head -100) - -# File the issue (replace with the operator's Forgejo token — -# do NOT embed yours in commits) -curl -sS -X POST \ - -H "Authorization: token " \ - -H "Content-Type: application/json" \ - http://10.100.0.1:3000/api/v1/repos/spectral/CIS490/issues \ - -d "$(python3 -c "import json,os; print(json.dumps({ - 'title': f\"$HOST_ID: stuck on local commit ${LOCAL_SHA:0:12}\", - 'body': f\"### What's at HEAD\n\n\`\`\`\n${DETAIL}\n\`\`\`\n\nNeed maintainer to choose: push HEAD to main, or reset --hard origin/main here?\" - }))")" -``` - -Then leave the daemons running. The shipper will keep auto-quarantining -the 412s — backlog grows but doesn't crash anything. Wait for a -maintainer comment. - ---- - -## §C — Network broken - -```sh -ping -c 1 10.100.0.1 # the Pi -sudo wg show # is wg0 up? -sudo systemctl restart wg-quick@wg0 # bring it back up -sudo systemctl restart cis490-shipper cis490-orchestrator -``` - -If `ping 10.100.0.1` still fails after a `wg-quick` restart, this is -a WireGuard / wg-enroll / iptmonads problem outside this repo. File -an issue at `spectral/wg-enroll` or `spectral/iptmonads` and stop. - ---- - -## §D — `/opt/cis490/.git` missing - -The host was originally set up with `cp -aT` (no `.git/`). That makes -auto-update impossible. Re-clone: - -```sh -# Stop services so we don't race with the orchestrator mid-episode -sudo systemctl stop cis490-shipper cis490-orchestrator - -# Preserve config/data — only /opt/cis490 (the code) gets replaced. -# /etc/cis490/ and /var/lib/cis490/ are NOT touched. -sudo mv /opt/cis490 /opt/cis490.pre-fix -sudo git clone http://maxgit.wg:3000/spectral/CIS490.git /opt/cis490 -sudo chown -R cis490:cis490 /opt/cis490 - -sudo /opt/cis490/scripts/install-lab-host.sh -# Once verified, you can drop the backup: -# sudo rm -rf /opt/cis490.pre-fix -``` - ---- - -## §E — Uncommitted edits on tracked files - -```sh -cd /opt/cis490 -sudo -u cis490 git status --short # see what's modified -sudo -u cis490 git diff # see exactly what changed -``` - -If the changes are intentional (e.g. you fixed a bug), commit them -first and then go to §B: - -```sh -sudo -u cis490 git add -sudo -u cis490 git commit -m "" -# Now go to §B. -``` - -If the changes are accidental / left over from debugging, discard -them: - -```sh -sudo -u cis490 git checkout -- . -# Now go to §A. -``` - ---- - -## §Z — Verify you're unstuck - -```sh -# 1. Daemons up? -systemctl is-active cis490-shipper cis490-orchestrator -# Both should say "active". - -# 2. VERSION present and matches HEAD? -cat /opt/cis490/VERSION -sudo -u cis490 git -C /opt/cis490 rev-parse HEAD -# The "commit" in VERSION should equal the rev-parse output. - -# 3. Recent ship results — should NOT be 412/400-fatal-only: -sudo journalctl -u cis490-shipper --since '2 minutes ago' | tail -20 - -# 4. Maintainer-side confirmation (if you can reach the Pi): -# sudo tail -5 /var/lib/cis490/index.jsonl -# Should show recent received_at_wall timestamps with your host_id. - -# 5. Run the doctor — every red row prints its own fix: -/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py --role lab-host -``` - ---- - -## Hard rules (don't break these) - -These are repeated from AGENTS.md because the situations covered in -this doc are exactly when an agent is most tempted to break them. - -- **Do NOT** edit `meta.json` files to add a fake `code_version`. The - gate exists to keep buggy data out of the index. -- **Do NOT** disable / mask / `systemctl stop` `cis490-shipper` to - silence log noise. Fix the cause. -- **Do NOT** set `verify_tls = false` in `lab-host.toml`. That's a - bypass, not a fix. -- **Do NOT** `git commit` on `/opt/cis490` to "make the gate - happy" — committing without pushing is exactly how the - elliott-thinkpad incident started. If you commit, push (§B.1) or - don't commit at all. -- **Do NOT** delete `/var/lib/cis490/data/quarantine/`. Those are - forensic records of what the gate rejected; they're capped at 30 - days by the cleanup pass. -- **Do NOT** clobber `/etc/cis490/certs/`. Cert delivery is - automated; rerun `install-lab-host.sh` if certs are missing. - -If you find yourself wanting to do any of the above, stop and file -an issue (§B.3 has the curl command). The maintainer would much -rather resolve a stale lab host by reading an issue than by -reverse-engineering what an agent did to escape a stuck state. diff --git a/PIPELINE.md b/PIPELINE.md new file mode 100644 index 0000000..04d7949 --- /dev/null +++ b/PIPELINE.md @@ -0,0 +1,900 @@ +# PIPELINE.md — the CIS490 generative pipeline honesty plan + +**This document is canonical.** It supersedes any guidance in +`AGENTS.md`, `FIXYOURSELF.md`, `README.md`, or other repo docs that +contradicts it. If another doc says something different, this doc wins +and the other doc is wrong (file an issue or fix it). + +This is not an architecture overview. This is a fix list. Read it, +implement it, do not split it into phases. + +**Before proposing any change to the pipeline, re-read §1, §7, and §8 +and run your proposal against §8's checklist.** Then proceed. + +--- + +## 1. Principle + +Every episode that reaches the dataset must be ground-truth. Every +host runs the same experiment with the same configured catalog. Every +exploit module and every collector in the catalog has been proven to +work end-to-end before it is eligible to run. There are no +compensating layers — no auto-update timers that drag stale peers +forward, no "fix-yourself" decision trees, no per-host divergence +absorbed by trainer-side filters, no labels written by clock when the +event they describe didn't happen. + +If a host can't meet the bar, it produces zero episodes and says so +loudly. A small honest dataset beats a large dishonest one. + +**Default to removal, not addition.** If a problem can be fixed by +deleting code or removing a layer, prefer that. Adding a layer is +the suspect default and should be justified against §7 and §8 before +proceeding. + +--- + +## 2. What the experiments are for + +CIS490 trains a behavioral malware-detection model. The dataset is +the ground-truth labelled record of what the host looked like during +known-clean, known-armed, known-infecting, and known-infected phases +of a real exploit chain against a real target service. The model +learns to distinguish those phases from in-deployment +behavior. **Every dishonest label is a poisoned training example.** + +This is why the producer's job is not "ship lots of episodes." It is +"ship episodes whose labels are true." + +--- + +## 3. What is currently broken (evidence) + +Numbers from the 200-episode quality probe on 2026-05-03: + +1. **Labels lie.** 0 of 67 Tier-3 exploit fires resulted in a + `session_open` event. All 67 logged `session_open_timeout`. Yet + every one of those 67 episodes is labelled + `phase=infected_running` because the schedule-driven labeller + transitions on a clock, not on observed events. The + `infected_running` label in the dataset means "the schedule said + so," not "an attacker session was actually open on this host." +2. **Collectors are silent.** + - `perf` produces 0 rows on 100% of episodes on both hosts. + - `guest-agent` produces 0 rows on 100% of episodes on both hosts. + - `qmp`, `netflow`, and `pcap` produce 0 rows on 100% of + k-gamingcom episodes (different config from elliott). + - The host `tcpdump` is missing on k-gamingcom; `pcap_unavailable` + is logged then ignored. +3. **The catalog is unverified.** Modules are added to the rotation + without a per-module verification that the module actually lands a + session against its declared target. `samba_usermap_script` has a + 100% failure rate against the configured Metasploitable2 target + and was still in the rotation. +4. **Hosts run divergent experiments.** elliott and k-gamingcom have + different per-host manifests, different collector coverage, + different qemu invocations. The dataset is a union of two + different experiments, not 200 samples from one. +5. **Working trees are dirty.** 200/200 episodes report `dirty=true`, + so `code_version.commit` is unverifiable provenance. + +Each of these is a failure of the producer. Receiver-side filtering +and trainer-side prune scripts are bandaids that hide them. + +--- + +## 4. The fix — line items + +Every item below must land. They are not phases. They are parts of +one cohesive correctness story; any of them missing leaves the +pipeline half-honest. Each item names its acceptance test. + +### 4.1 Canonical manifest + +There is exactly one manifest, version-pinned in the repo at +`manifest.toml`. Every lab host loads the same manifest. There is no +per-host manifest override, no per-host collector enable/disable +flag, no per-host qemu argument list. Hosts that cannot run the +canonical manifest exit 78 at orchestrator startup. + +**Acceptance:** `find . -name manifest.toml -not -path './.git/*'` +returns exactly one path. There is no `--manifest` CLI flag on the +orchestrator that takes a different path; the path is hard-coded. +Removing this line item would re-create the host divergence we just +exited. + +### 4.2 Target VMs we build, not VMs we fetch + +Every target VM image is built from a declarative spec checked into +the repo (Packer, mkosi, debootstrap, whatever — declarative). The +image build produces a sha256-pinned artifact. The build script +verifies, before producing the artifact, that: + +- The vulnerable service is up after first boot. +- The service is on the port the module catalog declares. +- The service version matches the version the module catalog + declares. + +`Metasploitable2` from a SourceForge mirror is removed. We don't +ship episodes targeting black-box images. + +**Acceptance:** `scripts/build-target-.sh` exists for every +target referenced by an exploit module. Running it produces an image +whose post-boot state passes the spec's verification step. The +verification step's exit code gates the build's exit code. + +### 4.3 Module catalog admission criteria + +A module is in the catalog *only if* it passes a recorded end-to-end +verification run against its declared target. The verification is: + +1. Boot the target snapshot. +2. Fire the module via msfrpcd. +3. Observe a `session_open` event (not `session_open_timeout`). +4. Observe at least one shell command round-trip on the session. +5. Confirm guest-side artifact (file written, process spawned — + per-module). + +If any step fails, the module does not enter the catalog. There is +no "tentatively included" tier. Modules already in the catalog are +re-verified by `scripts/verify-catalog.sh` (new) on every release; +failures remove the module from the catalog. + +**Acceptance:** every entry in `exploits/modules/*.toml` has a +companion `verified_against = ""` and +`last_verified = ""` field. `scripts/verify-catalog.sh` +re-runs every entry and exits 0 only if every one passes. + +### 4.4 Collector admission criteria + +A collector is in the active set *only if* it passes a recorded +end-to-end verification run that confirms it emits non-zero rows +against a known-busy probe workload. + +For each of the six collectors (`proc`, `qmp`, `netflow`, `perf`, +`guest`, `pcap`): + +1. Diagnose the current zero-row failure (read the code, run + standalone, find the actual cause). Fix the cause. +2. Add a unit-or-integration test that runs the collector for N + seconds against a synthesized workload (a busy-loop process for + `proc`/`perf`, a packet generator for `netflow`/`pcap`, a QMP + blockstats query for `qmp`, a guest heartbeat for `guest`) and + asserts ≥1 row. +3. The test must run in CI and on every install via the install + script. + +A collector that cannot pass admission is removed from the active +set with a recorded reason — not silently included with zero rows. + +**Acceptance:** `pytest tests/test_collectors_emit.py -k ` +passes for each name. The CI run gates merges. + +### 4.5 Event-driven labelling + +Phase labels are written from observed events, never from the +schedule clock. The schedule becomes a *time budget* — maximum time +the orchestrator will wait in each phase — not a label source. + +Specifically: + +- `clean` is written at episode start. +- `armed` is written when the orchestrator instructs the driver to + fire (this is observable in code). +- `infecting` is written when the `exploit_fire` event is observed. +- `infected_running` is written **only** when the `session_open` + event is observed. +- If `session_open_timeout` is observed instead, the episode + terminates with a `failed` label and is rejected (see §4.6). +- `dormant` and subsequent `infected_running` transitions are + written from observed in-session idle / activity, not from clock. + +Per-module timeouts replace the global 30s timeout. Default 120s, +configurable per module in `exploits/modules/*.toml`. + +**Acceptance:** for every shipped episode, every entry in +`labels.jsonl` has a corresponding event in `events.jsonl` with a +matching `t_mono_ns` within ±100ms. An invariant test asserts this. + +### 4.6 Episode acceptance gate at finalization + +Before sealing meta and writing `done.marker`, the orchestrator +verifies: + +- Every collector in the active set produced ≥1 row. +- Every label has a matching event (§4.5 invariant). +- For Tier-3 episodes: a `session_open` event exists. +- `dirty=true` is absent OR `dirty_override=true` is present (see + §4.9). + +If any check fails, the episode goes to `data/rejected//` with a +`rejected_reason.json` describing which check failed. `done.marker` +is not written. The shipper never sees it. + +**Acceptance:** `tests/test_acceptance_gate.py` covers each rejection +condition. A passing test asserts a clean episode is accepted; for +each failure mode, the test asserts the episode is moved to +`rejected/` with the expected reason. + +### 4.7 Producer preflight + +`orchestrator/preflight.py` runs at orchestrator startup. One bar +(no light/deep split). Checks: + +- Every binary required by the active collector set + active module + catalog is on `PATH`. +- `/dev/kvm` accessible by the service user. +- `kernel.perf_event_paranoid <= 2`. +- `cfg.bridge_iface` exists; `tcpdump` can capture on it. +- `msfrpcd` reachable; `auth.login` returns a token. +- For every module in catalog: `module.info` is fetchable. +- For every sample in catalog: file present on disk; sha256 matches. +- Probe-boot baseline-v1 snapshot; observe guest-agent heartbeat + within N seconds. +- `git status --porcelain` empty (or `CIS490_ALLOW_DIRTY=1`). +- HEAD is on a commit currently in `origin/main`. + +Failures are collected (every failed check logged with diagnosis + +remediation), then `sys.exit(78)`. + +**Acceptance:** `tests/test_preflight.py` covers each check +individually with mocked subprocess/filesystem. `python -m +orchestrator.preflight` runs the checks and prints a structured +report. Exit codes: 0 ok, 78 sysadmin error. + +### 4.8 Receiver-side rejection (defense in depth) + +**The receiver is defense-in-depth, NOT the primary correctness +mechanism.** The producer is. Receiver rejection exists to catch +peers running stale or broken code; it is never a substitute for +fixing the producer. A change that strengthens receiver rejection +without strengthening the producer is the defensive-instead-of- +corrective pattern (§7.9). + +The receiver enforces the same correctness invariants the +orchestrator does. A peer running stale code that produces dishonest +episodes still gets rejected at ingest: + +- Reject any meta with `dirty=true` and no `dirty_override=true`. +- Reject any meta where `phases_observed` contains `infected_running` + but `events.jsonl` (extracted from the tarball) lacks + `session_open`. +- Reject any meta where any configured-collector row count is zero. +- Existing commit-allow-list gate continues. + +Rejections return 422 with a JSON body naming the failed check. +Rejected tarballs are not written to the index. + +**Acceptance:** `tests/test_receiver_rejects.py` covers each new +rejection condition. + +### 4.9 Override discipline + +The only escape hatch from the dirty-tree gate is the +`CIS490_ALLOW_DIRTY=1` environment variable. When set: + +- Orchestrator logs `WARN: dirty tree override active`. +- meta.json gains `dirty_override: true`. +- Receiver accepts the episode only if `dirty_override` is also + `true`. +- Every override use is auditable from the dataset. + +There are no other override knobs. No `verify_tls=false`, no "skip +preflight," no "include this collector even if it emits zero rows." + +### 4.10 Regression-test discipline + +Every fix in this plan lands with a test that would have caught the +regression at PR time. Tests are not a follow-up. A PR that fixes +the perf collector without a perf-emit test is incomplete and gets +sent back. + +CI runs: +- All unit tests. +- `scripts/verify-catalog.sh` against a smoke target subset (catalog + verification full run is gated to release commits — too expensive + for every PR). +- The collector-emit integration tests (§4.4) on real binaries. + +### 4.11 systemd integration + +- `cis490-orchestrator.service` adds + `RestartPreventExitStatus=78`. A preflight failure stays loud and + stuck instead of cycling restarts. +- On preflight failure, orchestrator writes + `/var/lib/cis490/preflight.failed.json` with the failed checks + + timestamps. Doctor surfaces this in its next report. The + fleet-health alert distinguishes "preflight failed" from "host + silent." + +### 4.12 Cleanup of compensating layers + +The following are deleted as part of this change. Their existence +was load-bearing for the dishonest pipeline; the honest one doesn't +need them. + +- `FIXYOURSELF.md` — entire file deleted. Stuck states no longer + exist as a class because the gates make them impossible. +- `cis490-autoupdate.timer` + `scripts/auto-update.sh` — deleted. + Hosts run pinned commits. New code is rolled out by the operator, + not auto-pulled. +- `cis490-cert-fetch.timer` — replaced by a one-shot first-boot + fetch in `install-lab-host.sh`. No periodic re-fetch. +- `tools/quarantine_unstamped.py` — deleted. Pre-stamp episodes + cannot exist because no episode is written without a valid stamp. +- `tools/check_fleet_health.py` — keep, but delete the "fatal-only" + alert branch (that branch existed because we were shipping fatals; + with the gate, we don't). +- `tools/prune_episodes.py`'s "kept episode despite flat /proc + because qmp showed write" cross-check logic — deleted. Episodes + that don't pass the producer-side gate don't reach the trainer. +- AGENTS.md "symptom→fix table" — deleted (the + symptoms it covers are now impossible). +- AGENTS.md "Hosts self-update" section — deleted. + +### 4.13 Containment bar + +Real malware execution requires explicit containment. Target VMs +exist in an isolation context that is part of the canonical +experiment, not a deployment detail. A future change that weakens +any of the items below is a containment regression and is rejected +regardless of what experimental realism it claims to add. + +For every target VM in the catalog (§4.2): + +- **Network:** target attaches to a bridge with NO upstream egress. + No NAT to the host network, no internet route, no DNS resolution + beyond what the experiment provides. Outbound C2 callbacks + resolve to a sinkhole inside the experiment, never to the + internet. +- **Filesystem:** no shared mount with the host. No 9p, no + virtio-fs with host paths. The target's disk is the snapshot it + was booted from, period. +- **Privilege:** QEMU runs as the unprivileged service user. KVM + access is via group membership only; no setuid wrappers, no + privileged TUN ownership transfer, no passthrough of host + devices not explicitly required by the catalog. +- **Lifetime:** every target boots from a fresh snapshot. State + from one episode never crosses into the next. The snapshot is + reverted at episode end, not "cleaned." +- **Escape monitoring:** any QEMU exit that is not a clean shutdown + is logged with full QMP state and the episode is marked `failed`. + Two unclean exits on the same target image within a release + window trigger admission-criteria re-verification (§4.3) for + every module targeting that image. + +**Acceptance:** `tests/test_containment.py` asserts each target +build (a) has no upstream egress route from inside the guest, +(b) has no host-shared filesystem mount, (c) runs QEMU as the +unprivileged service user, (d) reverts to snapshot at episode end. +The test runs in CI and on every install. + +--- + +## 5. Build order + +There is no half-honest intermediate state. The order below +sequences the work; it does not phase the deployment. Everything +lands to `main` in one merge. + +1. Fix the four root-cause defects: + - Diagnose + fix the perf collector (read code, run standalone, + find why it's silent, fix). + - Diagnose + fix the guest-agent collector (mount baseline image, + verify agent installed, fix build). + - Diagnose + fix k-gamingcom's missing qmp/netflow/pcap (compare + configs, eliminate divergence — §4.1). + - Diagnose + fix `samba_usermap_script` against its target + (manual msfconsole drive, find why the bind shell never + connects, fix or remove from catalog — §4.3). +2. Land the canonical manifest (§4.1). +3. Land the target-VM build pipeline (§4.2) and containment + tests (§4.13) together — target VMs are not in the catalog + without containment. +4. Land the catalog admission criteria + verifier (§4.3). +5. Land the collector admission criteria + tests (§4.4). +6. Land the event-driven labeller (§4.5). +7. Land the acceptance gate (§4.6). +8. Land the preflight (§4.7). +9. Land the receiver-side rejection (§4.8). +10. Land the override discipline + cleanup (§4.9, §4.12). +11. Land systemd integration + alert distinguishing (§4.11). + +After merge: lab hosts pull the new manifest, run preflight, fail +loudly if they don't meet the bar, produce zero episodes until they +do. The operator brings each host to bar — fixing one root cause at +a time, loudly. The dataset goes quiet, then comes back honest. + +--- + +## 6. Out of scope (and why) + +- **Schedule jitter for label-leakage resistance.** Real concern, + but doesn't affect honesty — only generalization. Address after + honest data is flowing. +- **New collectors (audit, ftrace, etc.).** Adding collectors before + the existing six are honest is putting more weight on a broken + floor. +- **Trainer changes.** This plan stops at the dataset boundary. The + trainer no longer needs to filter dishonest episodes because they + don't exist. +- **Multi-architecture targets.** All target VMs are x86_64 for now. + +Each of these is fine to defer because they don't paper over a +correctness defect. They add value on top of an honest pipeline; the +pipeline isn't honest yet. + +--- + +## 7. Anti-patterns (named — match every proposal against this list) + +Each pattern below is a shape a proposal can take that has been +rejected as architectural sleight-of-hand. **Match every proposal +against this list before submitting it.** A proposal that matches +a named pattern is rejected; abandon it and propose a corrective +fix instead. + +The patterns are named so future sessions can recognize them in +their own work. A bandaid with a nice name (preflight, acceptance +gate, retry layer, fleet-health) is still a bandaid. + +**§7 is non-exhaustive.** New sleight-of-hand patterns will exist +that aren't named here. The §8 decision tests are the actual +filter; a proposal that fails §8 is rejected even if it matches +no named pattern. Do not read §7 as a closed taxonomy and conclude +"my proposal isn't on the list, so it's fine." If §8 says no, the +answer is no, regardless of whether a named match exists. + +### 7.1 Compensating-layer pattern + +**Definition.** Adding a layer (timer, watcher, retry, alert, +recovery doc) that absorbs a failure mode upstream of itself +instead of fixing the upstream cause. + +**Example from session 2026-05-02..03.** `cis490-autoupdate.timer` +to drag stale peers forward. The actual fix was the operator's +deploy process; the timer existed because deployment was unreliable +and we patched around the unreliability instead of fixing it. + +**Test.** If I removed this layer right now, would the original +problem reappear immediately? If yes, the layer is a compensating +bandaid for an unfixed root cause. + +**What to do instead.** Fix the upstream cause. If you cannot in +this change, fail loudly (§9) and stop. + +### 7.2 Phasing-as-deferral pattern + +**Definition.** Splitting a correctness fix into "phase 1, phase 2," +"light vs deep," or "land this now, the harder part later." Any +sequencing that ships a half-honest intermediate state. + +**Example from session 2026-05-02..03.** "Land preflight first, +labeller refactor later." The intermediate state ships dishonest +data because the labeller is still clock-driven. + +**Test.** Does each intermediate merge ship dishonest data, or +rely on a layer that won't exist yet? If yes, no phasing. + +**What to do instead.** Reduce scope (drop a feature, narrow the +active set) until the change is small enough to land in one merge. +Do not defer the hard part. + +### 7.3 Single-instance-fix pattern + +**Definition.** Fixing one item from a class while leaving the +other items as future work. + +**Example from session 2026-05-02..03.** "I'll diagnose perf and +samba in parallel" while guest-agent, qmp, netflow, and the rest +of the module catalog stay broken. + +**Test.** Is this a class of N items, of which I'm fixing < N? If +yes, fix all or remove the unfixed from the active set. + +**What to do instead.** Either fix every member of the class, or +shrink the active catalog to just the verified members. Unverified +members do not ship. + +### 7.4 Per-host-divergence pattern + +**Definition.** Accepting that two hosts behave differently as a +working assumption. + +**Example from session 2026-05-02..03.** "Which host should I +investigate samba on, elliott or k-gamingcom?" — implying the +answer matters because hosts are different. + +**Test.** Given identical workloads on identical canonical-manifest +hosts, would the produced episodes be identical? If no, the +divergence is the bug. + +**What to do instead.** Eliminate the divergence (one canonical +manifest, one canonical target VM build, one canonical collector +set — §4.1). If a host can't run the canonical experiment, it +produces zero episodes. + +### 7.5 Black-box-trust pattern + +**Definition.** Treating an externally-built artifact as if it +behaves correctly under our experiments without a verifiable spec +for what it should do. + +**Example from session 2026-05-02..03.** Metasploitable2 from a +SourceForge mirror — we don't know what version of Samba is +running, whether the service is up, or whether the image has been +altered. We were shipping modules targeting it anyway. + +**Test.** Do we have a verifiable spec for this artifact's +behavior? If no, we don't trust it. + +**What to do instead.** Build the artifact from a declarative spec +we control (§4.2). If we can't, remove modules targeting it from +the catalog. + +### 7.6 Investigation-as-deferral pattern + +**Definition.** Proposing investigation when a verifiable gate +would suffice. The investigation itself becomes the deferred work. + +**Example from session 2026-05-02..03.** "I need to diagnose why +perf is silent before I can write the gate." A gate of the form +"perf must produce ≥1 row" works without knowing the cause; it +forces the diagnosis to happen as part of the fix. + +**Test.** Can the gate be expressed as an assertion ("X must +produce > 0 rows" / "X must observe Y event") without knowing the +root cause? If yes, write the gate first. + +**What to do instead.** Write the strictest possible gate first. +The investigation is the work of making the gate pass. + +### 7.7 Speculation-as-evidence pattern + +**Definition.** Asserting a claim as fact without measurement. + +**Example from session 2026-05-02..03.** "30s vs 120s won't change +this — if the exploit were almost working, we'd see occasional +opens." No data was gathered; the claim was projected. + +**Test.** Do I have a measurement that supports this claim? If no, +I am speculating. + +**What to do instead.** Say "I don't know yet." Either gather data +or design the fix to be correct under both possibilities. + +### 7.8 Out-of-scope-for-correctness pattern + +**Definition.** Naming a correctness-affecting item as "out of +scope" to avoid the harder problem. + +**Example from session 2026-05-02..03.** "Manifest canonicalization +is out of scope, flagged as known issue." Per-host config divergence +is the source of half the data quality problems; excluding it from +scope was a deferral. + +**Test.** Does excluding this item leave the system half-honest? +If yes, it is in scope. + +**What to do instead.** Reduce other scope (drop a feature, narrow +the active set) to fit. Correctness items cannot be deferred. + +### 7.9 Defensive-instead-of-corrective pattern + +**Definition.** Building rejection logic at the consumer instead of +fixing the producer that produces the rejected output. + +**Example from session 2026-05-02..03.** Receiver-side rejection of +dishonest episodes without fixing why the producer produces them. +Defense-in-depth (both ends gated) is good; defense-without- +corrective (only consumer gated) is a bandaid. + +**Test.** Does this fix make the dishonest behavior IMPOSSIBLE +upstream, or only unobservable downstream? If only unobservable, +the producer is still broken. + +**What to do instead.** Fix the producer first. The consumer-side +gate is defense-in-depth on top of a corrected producer, never a +substitute. + +### 7.10 Recovery-layer pattern + +**Definition.** Building documentation, scripts, timers, or +runbooks for "what to do when X is stuck." Applies anywhere in +the pipeline — producer, receiver, trainer, dashboard, install +scripts, on-device agents, anywhere a "recovery from a state +that shouldn't exist" layer is contemplated. Producer-side is +just the most common location. + +**Example from session 2026-05-02..03.** `FIXYOURSELF.md` — a +250-line decision tree for recovering hosts whose auto-update +timer couldn't fix them. The states it covered shouldn't have been +possible if the producer were correct. + +**Test.** Can the stuck state happen at all if the relevant +component is correct? If no, delete the recovery layer and fix +the component. + +**What to do instead.** Make the stuck state impossible. If you +can't, fail loudly (§9) and stop. + +--- + +## 8. Decision tests before proposing a change + +Before adding any code, doc, layer, or feature, answer all of the +following. **Any uncomfortable answer means stop and re-evaluate.** + +1. Does this change make the dishonest behavior IMPOSSIBLE, or + only less likely / less observable? +2. Does this change scale to every instance of the problem class, + or only one? +3. If I removed this change, would the underlying problem return + immediately? +4. Am I adding a layer? If yes, can I instead remove the layer + that allowed the failure? +5. Does this proposal match any pattern in §7? If yes, abandon it + and propose a corrective fix. +6. Is the change complete in one merge? If not, why is the + intermediate state honest? +7. Am I doing this because it's correct, or because it's the + easiest thing that looks like progress? + +If you cannot answer all seven cleanly, stop. Ask the operator. +Do not proceed. + +--- + +## 9. What to do when blocked + +When you cannot fix something cleanly in scope: + +- **Fail loudly.** Exit with a distinguishable code (e.g., 78). + Write a structured failure record. Do not retry silently. +- **Stop.** Do not continue producing output as if the failure + didn't happen. +- **Ask the operator.** Tell the user what's blocked, what you + tried, and what you need to proceed. +- **Do not build a recovery layer.** That is the recovery-layer + pattern (§7.10). +- **Do not propose phased fixes.** That is the phasing-as-deferral + pattern (§7.2). +- **Do not narrow scope silently.** If the active set must shrink + to make the change tractable, name it explicitly and get sign-off. + +The operator prefers a small honest system that fails loudly over a +large half-broken one that limps. A loud failure is more useful +than a silent bandaid. + +--- + +## 10. Definitions of ground truth + +For each collector, "real row" means the row was actually emitted +by the underlying mechanism for *this episode*, not synthesized, +defaulted, or carried over from a previous run. + +| Collector | Ground truth means | +|---|---| +| `proc` | Row read from `/proc//{stat,io,status}` for the live qemu PID of this episode's target VM, while that PID is alive. | +| `qmp` | Row obtained from a successful QMP `query-status` / `query-blockstats` round-trip on `cfg.qmp_socket` for this episode's qemu PID. | +| `netflow` | Row computed from packet capture on `cfg.bridge_iface` for traffic involving this episode's target VM during the episode wall-clock window. | +| `perf` | Row produced by `perf` (or equivalent) sampling this episode's qemu PID. Not from a previous run, not from a different PID. | +| `guest` | Row received from the in-guest agent over the virtio-serial channel during the episode wall-clock window. The agent must be running in *this episode's* guest, not a stale one. | +| `pcap` | Bytes captured from `cfg.bridge_iface` during the episode wall-clock window, written to `network.pcap`. | + +For each phase, "label justified" means the corresponding event was +observed: + +| Phase | Justified by | +|---|---| +| `clean` | Episode start (orchestrator-emitted). | +| `armed` | Orchestrator instructs the driver to fire (orchestrator-emitted). | +| `infecting` | `exploit_fire` event observed in `events.jsonl`. | +| `infected_running` | `session_open` event observed in `events.jsonl`. **Not** `session_open_timeout`, **not** schedule-clock. | +| `dormant` | Observed in-session idle (no traffic / no command activity for N seconds). | +| `failed` | `session_open_timeout` or other terminal driver failure. Episode is rejected (§4.6). | + +A row that doesn't meet the ground-truth bar is not a row. A label +that isn't justified is not a label. The acceptance gate (§4.6) +enforces both. + +--- + +## 11. Honest reporting + +When you (a future session) report status to the operator: + +- **Distinguish merged from verified.** "Code merged" is not + "behavior verified in production." A passing test on a CI host + is not the same as a working system on a lab host. +- **Distinguish proposed from implemented.** "I proposed X" is not + "X is in the repo." +- **Audit your cumulative pattern.** At the end of a session, + re-read your own changes against §7. It is possible to add three + reasonable-looking layers in sequence that cumulatively form a + compensating-layer pattern, even if no individual one looks like + a bandaid. +- **Name compensating layers you've built.** If §7 audit finds + matches, name them and propose their removal. +- **Don't summarize cumulative changes as "fixes" without + auditing.** "I shipped 12 commits this session" is not the same + as "the pipeline is honest now." +- **Verify before agreeing or refuting.** When the operator says + something is done that you can verify, verify it before agreeing. + When they say something is broken that you can verify, verify it + before refuting. + +--- + +## 12. Glossary + +Terms used throughout this document, pinned to one definition. + +| Term | Definition | +|---|---| +| **Canonical manifest** | The single, version-pinned `manifest.toml` at the repo root. Every host loads this exact file. There is no per-host override (§4.1). | +| **Active set** | The collectors enabled in the canonical manifest for a given run. A collector is in the active set only if it has passed admission criteria (§4.4). | +| **Catalog** | The set of exploit modules in `exploits/modules/*.toml` that have passed admission (§4.3). Modules not in the catalog do not run. | +| **Ground truth** | A row or label is ground truth when it was emitted by the underlying mechanism for *this* episode, with the justifying event observed. See §10. | +| **Episode boundary** | An episode begins when the orchestrator emits the first `clean` label and ends when `done.marker` is written or the episode is moved to `rejected/`. All collector rows must fall inside this wall-clock window. | +| **Configured collector** | A collector listed as enabled in the canonical manifest. Distinct from "running collector" (the process actually started) and "active set" (the manifest-listed plus admission-passing intersection). For acceptance purposes, only the configured set matters. | +| **Admission criteria** | The bar a module / collector / target / override knob must pass to be in the active pipeline. See §4.3, §4.4, §13. | +| **Honest** | Of an episode: every label justified by an observed event, every configured collector emitted ≥1 ground-truth row, working tree was clean (or override-stamped), HEAD on `origin/main`. Of the pipeline: every accepted episode is honest. | +| **Bandaid / compensating layer** | A layer that absorbs a failure mode upstream of itself instead of fixing the upstream cause. See §7.1. | +| **Override** | A knob that loosens an admission criterion or gate. There is exactly one — `CIS490_ALLOW_DIRTY` (§14). | +| **Operator** | The human maintainer with sign-off authority. Distinct from agents that propose changes. See §15. | +| **Containment regression** | A change that weakens any of the §4.13 isolation requirements. Rejected regardless of claimed experimental value. | + +--- + +## 13. Admission scope (what triggers the bar) + +Any change to the following is in admission scope and must pass §4 +admission criteria + §15 operator sign-off: + +- Any module in `exploits/modules/*.toml`. +- Any collector in the active set. +- Any field of `manifest.toml`. +- Any phase rule or label-emission code in the labeller. +- Any gate in the producer or receiver. +- Any schedule entry (phase budget, per-module timeout). +- Any target VM build spec or its containment posture (§4.13). +- Any override knob (the closed list in §14). + +The following are NOT admission scope and can be changed without +admission ceremony, but must still pass §8 decision tests: + +- Internal refactors that do not change observable behavior of + any of the above. +- Test code, fixtures, CI configuration. +- Documentation that does not contradict §1. +- Build/install scripts, insofar as they don't change what gets + shipped or how it's labelled. + +A future session that argues "this is just infrastructure" or +"this is just tooling" to dodge admission scope: re-read this +section. Anything that touches what gets shipped, how it's +labelled, what runs on the host, the containment posture, or +how the gate decides — is in scope. The "infrastructure / +tooling" framing is a recurring sleight-of-hand vector and +triggers automatic rejection. + +--- + +## 14. Override knobs (closed list) + +The complete list of override knobs in CIS490, version-pinned to +this document: + +| Knob | Effect | Where audited | +|---|---|---| +| `CIS490_ALLOW_DIRTY=1` (env var, orchestrator) | Allows the orchestrator to start with a dirty git tree. Stamps `dirty_override: true` in every `meta.json` produced. Receiver accepts only with matching stamp. | per-episode in `meta.json` | + +That is the entire list. Adding a knob to this list is itself an +admission event (§13) requiring operator sign-off (§15) and an §8 +review. + +**Knobs that have been considered and rejected** (do not propose +again without re-reading the rationale): + +- `verify_tls=false` — TLS verification is a correctness boundary; + bypassing it is the defensive-instead-of-corrective pattern + (§7.9). +- `skip_preflight=1` — preflight is the gate; bypassing it makes + the gate non-functional. +- `experimental_collector=true` — bypassing collector admission + is the single-instance-fix pattern (§7.3) wearing a flag. +- `diagnostic_mode=true` — generic bypass; in practice would be + applied to hide failures, not investigate them. +- `dry_run` for the producer — episodes that aren't shipped go to + `rejected/`; no dry-run flag needed. + +If a future session proposes a new override knob, the burden is on +the proposal: pass §8, get operator sign-off, amend §14 in the +same merge. "Add the knob now and amend §14 later" is the +phasing-as-deferral pattern (§7.2) applied to documentation. + +--- + +## 15. Sign-off discipline + +Admission decisions are made by the operator, not by agents acting +alone. Specifically: + +- **Adding a module to the catalog** requires operator sign-off. + An agent runs `scripts/verify-catalog.sh`, presents the + verification result, and the operator decides whether the module + enters the catalog. +- **Adding a collector to the active set** requires operator + sign-off. Agent runs the emit-test, operator decides. +- **Promoting a target VM build** requires operator sign-off after + §4.2 verification and §4.13 containment tests pass. +- **Adding an override knob** (§14) requires operator sign-off. +- **Amending PIPELINE.md** requires operator sign-off (§16). + +**Removing** anything from the catalog or active set does NOT +require operator sign-off — the bar is asymmetric. Tightening +is always permitted; loosening requires sign-off. + +The operator is the human with maintainer credentials on the +repository. Agents propose, run verification, and present results; +the operator decides admission. + +If an agent is acting in a non-interactive context (CI run, +scheduled job) where no operator is available to sign off, the +agent does not admit anything. It produces verification output +and stops. + +--- + +## 16. Amending PIPELINE.md + +This document is not immutable, but it is the canonical statement +of the bar. Amendments are governed by the same discipline as +admission decisions: + +1. Any change to §1 (principle), §4 (fix items), §7 (anti-patterns), + §8 (decision tests), §10 (ground truth), §13 (admission scope), + §14 (override list), or §15 (sign-off) is a substantive + amendment. +2. Substantive amendments require operator sign-off (§15) and must + pass §8 decision tests applied to the amendment itself. +3. The amendment lands in the same merge as the code change it + justifies. "Amend the doc later" is the phasing pattern (§7.2). +4. Editorial changes (typos, formatting, link fixes, glossary + wording) do not require sign-off but should be flagged in the + commit message. + +A future session that wants to add a feature or layer the document +forbids: the path is to amend the document, not to work around it. +"This isn't covered by PIPELINE.md, so I'll just do it" is the +out-of-scope-for-correctness pattern (§7.8) applied to the +meta-document. Anything that touches admission scope (§13) is +covered even if not named explicitly. + +If you find the document is wrong — internally inconsistent, +contradicts observed reality, prescribes something impossible — +file a Forgejo issue against the repo with the contradiction +documented. Do not silently work around the doc. + +--- + +## 17. What this plan supersedes + +The following docs are deleted or rewritten as part of landing this +plan: + +| Doc | Action | +|---|---| +| `FIXYOURSELF.md` | Deleted. Compensating-layer doc; the states it covers don't exist after §4.6. | +| `AGENTS.md` "symptom→fix table" | Deleted. Bandaid-driven. | +| `AGENTS.md` "Hosts self-update" section | Deleted. Hosts run pinned commits. | +| `AGENTS.md` "Tier 3+4 deploy zero-touch" claim | Rewritten. Targets are built locally now, not auto-fetched. | +| `AGENTS.md` "trust the in-guest probe alone, cross-check host CPU" | Deleted. The producer-side gate makes this fictional cross-check unnecessary. | +| `TIER3-BRINGUP.md` | Kept as historical record — labelled bug report, not current guidance. | +| `README.md` Tier-3+4 narrative | Reviewed and aligned. | + +If you are a future session reading this and find another doc that +contradicts §1–§6 of this file: this file is right and the other +doc is wrong. Fix the other doc.