PIPELINE.md is canonical; rewrite AGENTS.md; delete FIXYOURSELF.md
PIPELINE.md is the canonical plan for the data-collection / emulation / labelling pipeline. It supersedes any guidance in AGENTS.md, README.md, or other repo docs that contradicts it (§17). Future sessions read it before changing anything in the pipeline. AGENTS.md is rewritten to point at PIPELINE.md as canonical and to strip the prescriptive symptom→fix table that absorbed producer-side defects instead of fixing them (§7.1 compensating-layer pattern). FIXYOURSELF.md is deleted (§4.12, §7.10 recovery-layer pattern). The states it covered are made impossible by the §4.6 acceptance gate landing later in §5; recovering from a state that shouldn't exist is itself the bandaid we're removing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
05bf785f0a
commit
bfb1c491f8
3 changed files with 1003 additions and 773 deletions
627
AGENTS.md
627
AGENTS.md
|
|
@ -7,536 +7,118 @@ The conventions below also apply to sibling repos (`wg-enroll`,
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## ⚡ FIRST: are you an on-device agent on a lab host?
|
## ⚡ FIRST: read PIPELINE.md
|
||||||
|
|
||||||
**Read this table before doing anything else.** Find your symptom in
|
**[PIPELINE.md](./PIPELINE.md) is the canonical plan for this repo.**
|
||||||
the left column and run the EXACT command in the right column.
|
Read it before changing anything in the data-collection / emulation /
|
||||||
Do not invent variations. Do not paraphrase.
|
labelling pipeline. If anything in this file or any other doc
|
||||||
|
contradicts PIPELINE.md, PIPELINE.md wins and the other doc is wrong.
|
||||||
|
|
||||||
| Symptom (what you see) | What to run (verbatim) |
|
This file is for general engineering conventions. The pipeline
|
||||||
|---|---|
|
correctness story lives in PIPELINE.md.
|
||||||
| `journalctl -u cis490-shipper` shows `412 commit-rejected` or `code commit rejected` | `cd /opt/cis490 && sudo -u cis490 git pull origin main && sudo /opt/cis490/scripts/install-lab-host.sh` |
|
|
||||||
| `journalctl -u cis490-shipper` shows `400` and `missing X-Cis490-Code-Commit` | Same command as above. |
|
|
||||||
| `systemctl status cis490-shipper` shows `inactive (dead)` after install | `sudo systemctl enable --now cis490-shipper cis490-orchestrator` |
|
|
||||||
| Shipper logs `waiting on mTLS material` | **Do nothing** — `cis490-cert-fetch.timer` polls bootstrap.wg every 5 min; once the cert lands, the script restarts the shipper automatically. To force a fetch right now: `sudo systemctl start cis490-cert-fetch.service` then `sudo journalctl -u cis490-cert-fetch -n 20`. |
|
|
||||||
| Receiver index isn't growing (Pi side: `tail -f /var/lib/cis490/index.jsonl`) | `/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py --role lab-host` — read the **red** rows; each one prints the exact fix command. |
|
|
||||||
| `bootstrap.wg fetch failed` after install | `echo '10.100.0.1 bootstrap.wg collector.wg' \| sudo tee -a /etc/hosts && sudo /opt/cis490/scripts/install-lab-host.sh` |
|
|
||||||
| `cis490-doctor` says VERSION stamp missing | `sudo /opt/cis490/scripts/install-lab-host.sh` |
|
|
||||||
| You're not sure what's wrong | `/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py --role lab-host` |
|
|
||||||
| The canonical pull-and-reinstall command above doesn't fix it (auto-update timer keeps failing, `git pull` errors, you have a local commit, etc.) | **Read [FIXYOURSELF.md](./FIXYOURSELF.md).** It's the explicit decision tree for stuck states the auto-update timer can't recover from. |
|
|
||||||
|
|
||||||
**HARD RULES — do not break these even if they "feel like they'd help":**
|
|
||||||
|
|
||||||
- **Do not run `openssl`, `step-cli`, mint keys, or write CSRs.** Cert
|
|
||||||
delivery is fully automated by `install-lab-host.sh`. If you're
|
|
||||||
about to touch a private key on a lab host, stop.
|
|
||||||
- **Do not set `verify_tls = false`.** Skipping TLS verification is not
|
|
||||||
"skipping it for now"; it's leaving a hole.
|
|
||||||
- **Do not invent a `code_version` field in old episode `meta.json`
|
|
||||||
files.** That field exists to keep buggy data out of the index.
|
|
||||||
The drain helper handles them properly.
|
|
||||||
- **Do not disable, mask, or `systemctl stop` `cis490-shipper` to silence
|
|
||||||
log noise.** Fix the cause via the table above; the noise stops
|
|
||||||
on its own once the host has the new code.
|
|
||||||
- **Do not `git pull` and skip running `install-lab-host.sh`.** A bare
|
|
||||||
`git pull` updates files on disk but does NOT restart the systemd
|
|
||||||
units, so the running daemons keep executing the old code.
|
|
||||||
- **Do not file a Forgejo issue without first running `cis490-doctor`
|
|
||||||
and pasting its output.**
|
|
||||||
|
|
||||||
If your symptom isn't in the table or the canonical command above
|
|
||||||
doesn't fix it, file a Forgejo issue (see the "File an issue" section
|
|
||||||
near the bottom).
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## How a lab host gets to "shipping data" — the canonical bring-up
|
## What this project is
|
||||||
|
|
||||||
If you (an AI agent OR a human) are looking at a fresh lab host and
|
CIS490 trains a behavioral malware-detection model from labelled
|
||||||
asking "is this thing actually generating data for the central
|
episodes captured on lab-host VMs running real or mimic workloads,
|
||||||
collector?", run this in order. **Cloning the repo by itself does
|
optionally driven into infected states by Metasploit modules. The
|
||||||
nothing. Running launchers from a manual clone bypasses the
|
producer is the orchestrator on each lab host; the consumer is the
|
||||||
systemd services that do the actual work.**
|
receiver on the Pi (`office-print`, `10.100.0.1`).
|
||||||
|
|
||||||
```sh
|
The producer must ship only ground-truth episodes. The receiver must
|
||||||
# 0. (One-time, on the Pi only.) Initialize the CIS490 client CA + a
|
reject anything that doesn't meet the bar. See PIPELINE.md.
|
||||||
# leaf cert for THIS lab host. Get its WG IP from `wg-enroll-admin
|
|
||||||
# show <usb>` first.
|
|
||||||
sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh # idempotent
|
|
||||||
sudo /home/max/.env/wg-pki/scripts/deploy-cis490-cert.sh \
|
|
||||||
<host_id> <wg_ip> # mints + scp's + extracts + chmods
|
|
||||||
|
|
||||||
# 1. (On the lab host.) Install the lab-host role. ONE COMMAND DOES
|
## Hard rules — do not break these
|
||||||
# EVERYTHING — repo to /opt/cis490, venv build, systemd units,
|
|
||||||
# Alpine baseline qcow2, cidata ISO, VERSION stamp, mTLS cert
|
|
||||||
# auto-fetch from bootstrap.wg, Tier-3+4 deploy (msfrpcd +
|
|
||||||
# Metasploitable2 + theZoo malware samples + bridge), pre-stamp
|
|
||||||
# queue drain, and a `daemon-reload + systemctl restart` of the
|
|
||||||
# shipper + orchestrator on re-runs. Idempotent — safe to re-run.
|
|
||||||
sudo /opt/cis490/scripts/install-lab-host.sh
|
|
||||||
# (or, if running from a clone elsewhere:)
|
|
||||||
# sudo ./scripts/install-lab-host.sh
|
|
||||||
|
|
||||||
# 2. Edit /etc/cis490/lab-host.toml — set host_id (the only required
|
- **Do not silently downgrade a host.** If a collector is silent, an
|
||||||
# edit). Then re-run step 1 so the cert auto-fetch can resolve
|
exploit doesn't land, or a dependency is missing, the host produces
|
||||||
# bootstrap.wg/v1/cert/<host_id>.
|
zero episodes and says so loudly. There is no "ship what we can"
|
||||||
|
fallback.
|
||||||
|
- **Do not write a label that an event didn't justify.** Phase
|
||||||
|
labels come from observed events, not from the schedule clock. See
|
||||||
|
PIPELINE.md §4.5.
|
||||||
|
- **Do not add a module to the catalog without verifying it lands a
|
||||||
|
session against its declared target.** See PIPELINE.md §4.3.
|
||||||
|
- **Do not add per-host config overrides.** One canonical manifest;
|
||||||
|
hosts that can't run it produce nothing. See PIPELINE.md §4.1.
|
||||||
|
- **Do not bypass the dirty-tree gate** except via the
|
||||||
|
`CIS490_ALLOW_DIRTY=1` env var (logged, stamped, audited). No
|
||||||
|
"skip preflight," no `verify_tls=false`, no other override knobs.
|
||||||
|
- **Do not run `openssl`, `step-cli`, mint keys, or write CSRs.**
|
||||||
|
Cert delivery is automated. If you find yourself touching a
|
||||||
|
private key on a lab host, stop.
|
||||||
|
- **Do not file a Forgejo issue without first running
|
||||||
|
`cis490-doctor` and pasting its output.**
|
||||||
|
|
||||||
# 3. Verify everything before enabling the timer-driven services:
|
## How a lab host gets to "shipping data"
|
||||||
/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py \
|
|
||||||
--role lab-host
|
|
||||||
# → green/yellow rows means READY; red rows print the exact fix
|
|
||||||
# command. Re-run until clean.
|
|
||||||
|
|
||||||
# 4. Turn on the services. From this moment on, the orchestrator runs
|
This will be rewritten as PIPELINE.md §4 lands. The current
|
||||||
# one fleet wave on each Restart= cycle, and the shipper picks up
|
`scripts/install-lab-host.sh` does most of the right things but does
|
||||||
# completed episodes and PUTs them to https://collector.wg over mTLS.
|
not yet enforce the canonical manifest, target-VM build, catalog
|
||||||
sudo systemctl enable --now cis490-shipper cis490-orchestrator
|
verification, or preflight. Until those land, treat the install
|
||||||
|
script as in-flight and assume a fresh lab host will produce nothing
|
||||||
|
until the bar is met.
|
||||||
|
|
||||||
# 5. (On the Pi.) Watch the index grow:
|
The bar (when in place) will be:
|
||||||
sudo tail -f /var/lib/cis490/index.jsonl
|
|
||||||
```
|
|
||||||
|
|
||||||
**There is no manual Tier-3 step.** Steps 1 + 2 deploy msfrpcd,
|
1. Repo cloned to `/opt/cis490`, working tree clean, HEAD on
|
||||||
Metasploitable2 (auto-fetched from a public mirror with TOFU sha256
|
`origin/main`.
|
||||||
pinning — no Rapid7 registration), and Tier-4 real-malware samples
|
2. Every binary in the active collector + module catalog set on
|
||||||
from theZoo (no API key, no signup). The orchestrator switches to
|
`PATH`.
|
||||||
Tier-3 episodes automatically once the prereqs are on disk.
|
3. Every target VM image built from the in-repo spec, sha256-pinned.
|
||||||
|
4. Every module in the catalog passes `scripts/verify-catalog.sh`
|
||||||
|
against its target.
|
||||||
|
5. Every collector in the active set passes its emit-test.
|
||||||
|
6. `orchestrator/preflight.py` exits 0.
|
||||||
|
|
||||||
**Hosts self-update.** `install-lab-host.sh` enables
|
Once that's true, `systemctl enable --now cis490-shipper
|
||||||
`cis490-autoupdate.timer`, which runs every 30 min (with up to 10 min
|
cis490-orchestrator` brings the host online. The orchestrator runs
|
||||||
of randomized delay) and does `git fetch + git pull --ff-only +
|
the canonical experiment; the shipper PUTs sealed episodes to the
|
||||||
install-lab-host.sh` whenever origin/main has moved. So once a host
|
receiver. Episodes that don't pass the acceptance gate go to
|
||||||
has done the canonical bring-up ONCE, it self-heals on every
|
`data/rejected/<id>/` locally and are never shipped.
|
||||||
subsequent maintainer push — you don't need to remember to pull. The
|
|
||||||
timer logs to `journalctl -u cis490-autoupdate.service`. If the
|
|
||||||
host's checkout has diverged from origin (operator hand-edits,
|
|
||||||
half-applied changes), auto-update bails rather than guessing — that
|
|
||||||
shows up as a unit failure with a clear log message.
|
|
||||||
|
|
||||||
If `index.jsonl` doesn't grow within a wave-interval (~60 s after
|
|
||||||
`systemctl enable --now`), run `cis490-doctor` again. The most
|
|
||||||
common silent failures it catches:
|
|
||||||
|
|
||||||
- `*.wg` DNS missing (wg-enroll provisions it; manual workaround is
|
|
||||||
one line in `/etc/hosts`)
|
|
||||||
- mTLS cert chain not installed under `/etc/cis490/certs/`
|
|
||||||
- `cis490-shipper` service inactive (forgot step 4)
|
|
||||||
- `qemu-system-x86_64` not on PATH
|
|
||||||
|
|
||||||
`cis490-doctor --json` is machine-readable for use by other agents.
|
|
||||||
|
|
||||||
## Shipper says "400 missing" or "412 commit-rejected": pull and reinstall
|
|
||||||
|
|
||||||
If `journalctl -u cis490-shipper` shows a steady stream of
|
|
||||||
`-> fatal (400)` or `-> 412 commit-rejected` lines, the receiver is
|
|
||||||
rejecting episodes because their `meta.json::code_version.commit`
|
|
||||||
isn't in the receiver's allow-list (or isn't being sent at all). This
|
|
||||||
happens when this lab host is running code older than the receiver
|
|
||||||
will accept.
|
|
||||||
|
|
||||||
The fix is always the same — pull main and re-run the installer:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
cd /opt/cis490
|
|
||||||
sudo -u cis490 git pull origin main
|
|
||||||
sudo /opt/cis490/scripts/install-lab-host.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
`install-lab-host.sh` does the rest:
|
|
||||||
1. Re-stamps `/opt/cis490/VERSION` to the new HEAD.
|
|
||||||
2. Drains pre-stamp episodes via
|
|
||||||
`tools/quarantine_unstamped.py` so the queue stops looping on
|
|
||||||
them. Drained episodes go to `/var/lib/cis490/data/quarantine/`
|
|
||||||
with a `quarantine_reason.json` per-episode for triage.
|
|
||||||
3. Restarts `cis490-shipper` and `cis490-orchestrator` so the new code
|
|
||||||
takes effect.
|
|
||||||
|
|
||||||
Do **not** disable the shipper to silence the log noise — once a host
|
|
||||||
has the new code, traffic resumes immediately. Do **not** mint a fake
|
|
||||||
`code_version` field in old episodes to bypass the gate; that field
|
|
||||||
exists specifically to keep buggy pre-fix data out of the training
|
|
||||||
index.
|
|
||||||
|
|
||||||
If the receiver is rejecting *new* episodes too (you've pulled and
|
|
||||||
restarted, but still see 412), the receiver's allow-list window may
|
|
||||||
not yet include your commit — wait 5s for its Forgejo refresh, or
|
|
||||||
push your commit to `origin/main` first if you're testing
|
|
||||||
unmerged work.
|
|
||||||
|
|
||||||
## Tier 3 + Tier 4 deploy (zero-touch via install-lab-host.sh)
|
|
||||||
|
|
||||||
`install-lab-host.sh` runs Tier-3 deploy automatically on its second
|
|
||||||
pass (after the mTLS cert lands). No operator interaction is needed:
|
|
||||||
metasploit-framework auto-installs via the Rapid7 omnibus, the
|
|
||||||
Metasploitable2 image auto-fetches from a public mirror with TOFU
|
|
||||||
sha256 pinning, the host-only bridge auto-comes-up, and a live
|
|
||||||
exploit fire is verified before the script returns.
|
|
||||||
|
|
||||||
To re-run the deploy by hand or on a host where Tier 3 was skipped:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
sudo /opt/cis490/scripts/install-tier-3-4.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
It's idempotent — re-running on an already-deployed host is a no-op
|
|
||||||
except for the verify step. Inputs are all optional env vars:
|
|
||||||
|
|
||||||
| var | effect |
|
|
||||||
|---|---|
|
|
||||||
| `SKIP_VERIFY` | skip the live `vsftpd_234_backdoor` smoke run |
|
|
||||||
| `SKIP_BRIDGE` | skip `br-malware` setup (limits to 2 of 5 modules) |
|
|
||||||
| `SKIP_TIER4` | skip the Tier-4 auto-fetch (DEPRECATED — leaves you with mimic-only data, defeats the project) |
|
|
||||||
|
|
||||||
The fleet runner auto-detects Tier-3 readiness via
|
|
||||||
`orchestrator/fleet.py::_msfrpcd_available()`. Once
|
|
||||||
`cis490-msfrpcd.service` is up and `metasploitable2.qcow2` is on
|
|
||||||
disk, the next wave produces Tier-3 episodes (`meta.exploit.module_name`
|
|
||||||
populated). No orchestrator restart is required, but a restart speeds
|
|
||||||
up the switch.
|
|
||||||
|
|
||||||
### Tier-4 (real malware execution) is mandatory, fully automated
|
|
||||||
|
|
||||||
**Real-binary episodes are the project's training target — Tier-4 is
|
|
||||||
NOT optional.** A lab-host deploy that lands without real samples
|
|
||||||
fails loudly; mimic-only data does not answer the research question.
|
|
||||||
|
|
||||||
There is **no operator step**. No API key, no signup, no manual
|
|
||||||
provisioning. `install-tier-3-4.sh` runs `tools/auto_fetch_samples.py`
|
|
||||||
which:
|
|
||||||
|
|
||||||
1. Clones (or pulls) `theZoo` from
|
|
||||||
`https://github.com/ytisf/theZoo` to `/var/lib/cis490/theZoo`
|
|
||||||
(~500 MB shallow clone, public, GPL-3.0, security-research repo)
|
|
||||||
2. For each `[[sample]]` in `manifest.toml` without a sha256, locates
|
|
||||||
a directory in `theZoo/malware/Binaries/` whose name matches
|
|
||||||
the entry's `family` (case-insensitive substring + prefix priority)
|
|
||||||
3. Extracts the password-protected `.zip` (well-known password
|
|
||||||
`infected`)
|
|
||||||
4. Picks the largest non-text payload as the binary, computes its
|
|
||||||
sha256, copies to `/opt/cis490/samples/store/<sha256>`
|
|
||||||
5. Rewrites `manifest.toml` in place, atomically (tempfile +
|
|
||||||
`os.replace` preserving stat), adding `source = "theZoo"`,
|
|
||||||
`sha256 = "<hex>"`, and the upstream URL
|
|
||||||
|
|
||||||
If `auto_fetch_samples.py` lands zero binaries (theZoo layout drift,
|
|
||||||
git clone failure, or a family has no matching directory),
|
|
||||||
`install-tier-3-4.sh` exits non-zero. **No silent mimic-only fallback.**
|
|
||||||
|
|
||||||
The orchestrator's next selection that picks a sample with
|
|
||||||
`kind == "real"` runs the real binary via the chunked-upload path
|
|
||||||
(`exploits.driver._resolve_workload`). The mimic profile remains the
|
|
||||||
fallback for episodes that select a sample whose binary isn't on
|
|
||||||
disk. Trainers filter on `meta.sample.kind ∈ {"real", "mimic"}`.
|
|
||||||
|
|
||||||
### Confirm Tier 3+4 are flowing
|
|
||||||
|
|
||||||
```sh
|
|
||||||
# On the Pi maintainer side:
|
|
||||||
sudo python3 -c "
|
|
||||||
import json, glob, subprocess, tarfile, io
|
|
||||||
from collections import Counter
|
|
||||||
mods = Counter(); kinds = Counter()
|
|
||||||
for tar in glob.glob('/var/lib/cis490/episodes/*/*.tar.zst'):
|
|
||||||
z = subprocess.check_output(['zstd','-q','-d','--stdout',tar],stderr=subprocess.DEVNULL)
|
|
||||||
with tarfile.open(fileobj=io.BytesIO(z)) as t:
|
|
||||||
for m in t.getmembers():
|
|
||||||
if m.name.endswith('meta.json') and m.isfile():
|
|
||||||
meta = json.load(t.extractfile(m))
|
|
||||||
mods[(meta.get('exploit') or {}).get('module_name','<none>')] += 1
|
|
||||||
kinds[(meta.get('sample') or {}).get('kind','<none>')] += 1
|
|
||||||
break
|
|
||||||
print('exploit modules used:', dict(mods))
|
|
||||||
print('sample kinds:', dict(kinds))
|
|
||||||
"
|
|
||||||
```
|
|
||||||
|
|
||||||
If `mods` is `{'<none>': N}` and `kinds` is `{'mimic': N}`, Tier 3
|
|
||||||
hasn't kicked in yet on any lab host — re-run
|
|
||||||
`install-tier-3-4.sh` there. If `mods` shows
|
|
||||||
`{'vsftpd_234_backdoor': N, ...}` and `kinds` shows a non-zero
|
|
||||||
`'real'` count, both tiers are live.
|
|
||||||
|
|
||||||
### Don't shortcut
|
|
||||||
|
|
||||||
- DO NOT install `metasploit-framework` system-wide outside
|
|
||||||
`install-msfrpcd.sh`. The script wires the systemd unit + creds;
|
|
||||||
a manual install bypasses the orchestrator's
|
|
||||||
`_msfrpcd_available()` probe.
|
|
||||||
- DO NOT add bogus sha256 entries to `manifest.toml` —
|
|
||||||
`auto_fetch_samples.py` hash-verifies every binary it stages.
|
|
||||||
- DO NOT add real-binary entries by hand when `auto_fetch_samples.py`
|
|
||||||
exists. Hand-edits are racy with the auto-fetcher's tempfile
|
|
||||||
rewrite.
|
|
||||||
|
|
||||||
## Securing the connection (mTLS) — DO NOT mint your own certs
|
## Securing the connection (mTLS) — DO NOT mint your own certs
|
||||||
|
|
||||||
The lab-host ↔ Pi connection is mTLS over WireGuard. **The cert
|
The lab-host ↔ Pi connection is mTLS over WireGuard. Cert delivery
|
||||||
delivery is fully automated.** You should never run `openssl`, write
|
is automated via `bootstrap.wg/v1/cert/<host_id>`. You should never
|
||||||
a CSR, edit a Caddyfile, or generate a private key on the lab host.
|
run `openssl`, write a CSR, edit a Caddyfile, or generate a private
|
||||||
If you find yourself doing any of that, you're off the runbook.
|
key on the lab host. If you find yourself doing any of that, you're
|
||||||
|
off the runbook.
|
||||||
|
|
||||||
**The actual cert flow:**
|
The most common reason cert fetch appears to fail is `host_id` still
|
||||||
|
being `REPLACE_ME` in `/etc/cis490/lab-host.toml`. Check that first.
|
||||||
1. The lab host comes up on WireGuard via `wg-enroll` (USB-driven,
|
|
||||||
one-time, separate project). After this, the lab host can reach
|
|
||||||
`bootstrap.wg` and `collector.wg` on the `10.100.0.0/24` overlay.
|
|
||||||
2. `scripts/install-lab-host.sh`, on its way through, pulls the leaf
|
|
||||||
cert + CA bundle from `https://bootstrap.wg/v1/cert/<host_id>`
|
|
||||||
over plain TLS (CA bundled in `etc/caddy-root.crt`). Trust
|
|
||||||
boundary is "this peer is on the WG mesh" — `iptmonads` already
|
|
||||||
gates the bootstrap port to enrolled peers.
|
|
||||||
3. The fetch step is a no-op if `host_id` is still the default
|
|
||||||
`REPLACE_ME` in `/etc/cis490/lab-host.toml`. **This is the most
|
|
||||||
common reason agents think cert delivery is broken.**
|
|
||||||
|
|
||||||
**The one fix that resolves 95 % of "cert/TLS/connection" reports:**
|
|
||||||
|
|
||||||
```sh
|
|
||||||
# 1. Make sure host_id is set:
|
|
||||||
sudo grep '^host_id' /etc/cis490/lab-host.toml
|
|
||||||
# If it says "REPLACE_ME", edit it to the real host_id you registered.
|
|
||||||
|
|
||||||
# 2. Re-run the installer. It will fetch the cert from bootstrap.wg.
|
|
||||||
sudo /opt/cis490/scripts/install-lab-host.sh
|
|
||||||
|
|
||||||
# 3. Confirm certs landed:
|
|
||||||
ls -l /etc/cis490/certs/ # expect lab-host.pem, lab-host.key, wg-ca.pem
|
|
||||||
|
|
||||||
# 4. Smoke-test the pipe:
|
|
||||||
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
|
|
||||||
--config /etc/cis490/lab-host.toml --ping
|
|
||||||
# {"ok": true, ...} → done.
|
|
||||||
```
|
|
||||||
|
|
||||||
If step 2 prints `WARN: bootstrap.wg fetch failed`, the cause is
|
|
||||||
almost always one of:
|
|
||||||
|
|
||||||
- `bootstrap.wg` DNS not resolving → add to `/etc/hosts`:
|
|
||||||
`echo '10.100.0.1 bootstrap.wg collector.wg' | sudo tee -a /etc/hosts`
|
|
||||||
- `wg0` interface not up → `sudo wg show` should list a peer; if not,
|
|
||||||
re-run wg-enroll.
|
|
||||||
- The Pi's `cis490-bootstrap.service` is down → file an issue against
|
|
||||||
the receiver-side host, not against this repo.
|
|
||||||
|
|
||||||
**What you should NOT do, even if it feels like it would help:**
|
|
||||||
|
|
||||||
- Generate certs with `openssl` or `step-cli` on the lab host.
|
|
||||||
- Copy certs from another lab host.
|
|
||||||
- Set `verify_tls = false` in `lab-host.toml` to "skip TLS for now."
|
|
||||||
- Restart the shipper repeatedly hoping it self-heals — it already
|
|
||||||
retries on every request without restart.
|
|
||||||
- File a Forgejo issue titled "shipper can't connect" without first
|
|
||||||
running the four-line block above and pasting its output.
|
|
||||||
|
|
||||||
The shipper's `waiting on mTLS material` log line is **expected**
|
The shipper's `waiting on mTLS material` log line is **expected**
|
||||||
during first-boot until the cert lands. It is not an error to fix.
|
during first-boot until the cert lands. It is not an error. The
|
||||||
The transport rebuilds the SSL context on each request, so the
|
transport rebuilds the SSL context on each request, so the moment
|
||||||
moment certs land in `/etc/cis490/certs/`, the next ping/ship
|
certs land in `/etc/cis490/certs/`, the next attempt succeeds — no
|
||||||
attempt succeeds — no restart needed.
|
restart needed.
|
||||||
|
|
||||||
## Common bring-up gotchas (read this before debugging an install)
|
## Filing issues
|
||||||
|
|
||||||
Smaller models acting as on-device agents have hit these traps. Each
|
When you run into an issue you cannot fully resolve in the current
|
||||||
one is now fixed in main, but if you're on an older clone you may
|
turn, file it as a Forgejo issue on the relevant repo. Do not
|
||||||
still see the symptom — pull `origin/main` first, then re-read.
|
silently log a TODO comment, leave a partial workaround, or assume
|
||||||
|
someone else will remember.
|
||||||
|
|
||||||
### Run tools from `/opt/cis490`, not from a manual clone
|
File issues for:
|
||||||
|
- A build / test / typecheck failure you can't fix in scope.
|
||||||
|
- A bug you discover but aren't tasked with fixing.
|
||||||
|
- A missing dep, missing config, or env-only failure that blocks
|
||||||
|
E2E.
|
||||||
|
- A design gap you've worked around but want a follow-up to fix
|
||||||
|
properly.
|
||||||
|
|
||||||
When you run `cis490-doctor` from a clone like `~/.env/CIS490/`,
|
Don't file when:
|
||||||
Python prepends the clone path to `sys.path`. Subprocesses spawned
|
- The user is in the conversation and you can just tell them.
|
||||||
by the doctor (e.g., `python -m shipper --ping`) inherit the calling
|
- It's already filed (search first:
|
||||||
CWD and pick up the clone's `shipper/` package instead of the
|
`GET /api/v1/repos/<owner>/<repo>/issues?state=open&q=<keyword>`).
|
||||||
service venv at `/opt/cis490/`. Symptom: tracebacks reference the
|
- It's truly a non-issue (a one-line edit you're about to make this
|
||||||
clone path, or `No module named exploits` despite `package = false`.
|
same turn).
|
||||||
|
|
||||||
**Fix already in main:** the doctor passes `cwd=/opt/cis490` to the
|
### How to file (Forgejo API)
|
||||||
shipper subprocess and inserts `repo_root` into `sys.path` itself.
|
|
||||||
**Operator action:** always invoke either as
|
|
||||||
`/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py`
|
|
||||||
or via `cd /opt/cis490 && ./tools/cis490_doctor.py`. Don't run from a
|
|
||||||
clone unless you know what you're doing.
|
|
||||||
|
|
||||||
### Shipper logs "waiting on mTLS material" — this is expected, not a bug
|
|
||||||
|
|
||||||
The `cis490-shipper` unit is enabled by `install-lab-host.sh` *before*
|
|
||||||
the Pi has issued the host's mTLS leaf. The transport pre-flights the
|
|
||||||
configured `ca_bundle` / `client_cert` / `client_key` paths and, if
|
|
||||||
any are missing, defers building the SSL context. You'll see one
|
|
||||||
warning per process lifetime:
|
|
||||||
|
|
||||||
```
|
|
||||||
shipper waiting on mTLS material (client_cert path missing: …); will retry each request
|
|
||||||
```
|
|
||||||
|
|
||||||
The unit stays up. Each ping/ship attempt re-tries the build. Once
|
|
||||||
the Pi runs `deploy-cis490-cert.sh <host_id> <wg_ip>` and the leaf
|
|
||||||
lands at `/etc/cis490/certs/`, the next request succeeds and the
|
|
||||||
transport logs `mTLS material now on disk; shipper transport ready`.
|
|
||||||
|
|
||||||
**Do not** try to "fix" the warning by restarting the unit, deleting
|
|
||||||
the config, or hand-rolling certs — just confirm the Pi-side step
|
|
||||||
ran and wait one scan interval.
|
|
||||||
|
|
||||||
### Outdated clone? Pull main first.
|
|
||||||
|
|
||||||
A long list of install-time bugs (cp self-copy, missing service
|
|
||||||
restart, fatal-loop quarantine, ca_bundle pointing at the wrong
|
|
||||||
chain, busybox pgrep flags, pycdlib in the wrong dep group, missing
|
|
||||||
vm/images/ symlink target, doctor sys.path) have been fixed and are
|
|
||||||
all resolved in main. **If you hit any "this used to work" symptom
|
|
||||||
on a host that hasn't pulled in a while, the canonical command is
|
|
||||||
always the same:**
|
|
||||||
|
|
||||||
```sh
|
|
||||||
cd /opt/cis490 && sudo -u cis490 git pull origin main && \
|
|
||||||
sudo /opt/cis490/scripts/install-lab-host.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
That one command:
|
|
||||||
|
|
||||||
- Re-stamps `/opt/cis490/VERSION` so episodes get a valid
|
|
||||||
`code_version.commit` — required by the receiver's gate.
|
|
||||||
- Drains pre-stamp episodes from `data/episodes/` to
|
|
||||||
`data/quarantine/` via `tools/quarantine_unstamped.py` so the queue
|
|
||||||
stops looping on them.
|
|
||||||
- Runs `daemon-reload` and `systemctl restart cis490-shipper
|
|
||||||
cis490-orchestrator` so the live daemons pick up the new code
|
|
||||||
(a bare `git pull` does NOT do this — Python module objects in the
|
|
||||||
running process are frozen at last service start).
|
|
||||||
- Re-runs the Tier-3+4 deploy idempotently if the cert is on disk.
|
|
||||||
|
|
||||||
After it returns, the shipper will be running as `Type=notify` with
|
|
||||||
`WatchdogSec=180` — systemd kills + restarts it if a scan pass hangs.
|
|
||||||
|
|
||||||
### The classifier is multi-source — don't gut episodes on /proc alone
|
|
||||||
|
|
||||||
`tools/prune_episodes.py` cross-checks four telemetry sources before
|
|
||||||
flagging an episode as flat:
|
|
||||||
|
|
||||||
- `telemetry-proc.jsonl` — host qemu-system /proc CPU%
|
|
||||||
- `netflow.jsonl` — bridge_pcap byte counters (network profiles)
|
|
||||||
- `telemetry-qmp.jsonl` — virtio blockstats per-phase delta (io-walk,
|
|
||||||
ransomware-shape)
|
|
||||||
- `telemetry-guest.jsonl` — in-guest agent load_1m (low-and-slow,
|
|
||||||
any host with a working agent)
|
|
||||||
|
|
||||||
An episode flags as `flat-cpu` only when EVERY available source
|
|
||||||
shows no inter-phase variation. If `/proc` is flat but qmp blockstats
|
|
||||||
show 90 MB written during `infected_running`, the episode is kept —
|
|
||||||
the host /proc collector loses signal under contention but qmp sees
|
|
||||||
through. This is essential on laptop-class lab hosts (e.g.
|
|
||||||
elliott-thinkpad) where the guest is co-scheduled with 13 other VMs
|
|
||||||
and the per-VM /proc CPU% gets buried.
|
|
||||||
|
|
||||||
All four sources stamp `t_wall_ns`; phase mapping uses that, not
|
|
||||||
`t_mono_ns`, because /proc and labels are orchestrator-relative
|
|
||||||
while netflow/guest are wall-clock-anchored. If you add a new
|
|
||||||
collector, emit `t_wall_ns` from CLOCK_REALTIME on every row or your
|
|
||||||
data will silently bucket into "(pre)".
|
|
||||||
|
|
||||||
### Don't trust the in-guest probe alone — cross-check host CPU
|
|
||||||
|
|
||||||
The `pre_kill_probe.yes` / `pre_kill_probe.sh` fields in
|
|
||||||
`workload_killed` events are produced by `pgrep` running inside an
|
|
||||||
Alpine guest. busybox's pgrep does NOT support the `-c` flag. Older
|
|
||||||
versions of `VMLoadController._probe()` used `pgrep -c yes`, which
|
|
||||||
exits 1 with a usage banner on busybox; the `|| echo 0` fallback then
|
|
||||||
always reported `yes=0` regardless of whether the workload was
|
|
||||||
running. This caused 244 episodes from `elliott-thinkpad` and
|
|
||||||
`k-gamingcom` to be incorrectly labelled `workload-silent`.
|
|
||||||
|
|
||||||
The fix landed in main (probe now uses `pgrep yes | wc -l`); episodes
|
|
||||||
shipped after that commit have correct probe values. For older
|
|
||||||
episodes still on disk, the prune classifier now requires `flat-cpu`
|
|
||||||
(host-side CPU envelope confirms no signal) AND the probe to flag
|
|
||||||
workload-silent — a probe-only zero is no longer trusted. So you can
|
|
||||||
safely run `cis490-prune --archive` against the existing data without
|
|
||||||
losing valid episodes.
|
|
||||||
|
|
||||||
If you write any new in-guest diagnostic that runs commands via
|
|
||||||
SerialClient, assume busybox/ash semantics: no `disown` builtin, no
|
|
||||||
GNU `pgrep -c`, no bash `/dev/tcp`, no `[[ ]]`. Always pair an
|
|
||||||
in-guest signal with the host-side `/proc` measurement before you
|
|
||||||
declare an episode bad.
|
|
||||||
|
|
||||||
### One traceback at a time
|
|
||||||
|
|
||||||
When the doctor lights up multiple red rows, fix the topmost one and
|
|
||||||
re-run rather than batching attempts. Each red row prints the exact
|
|
||||||
operator command it expects you to run. Don't paraphrase or invent
|
|
||||||
adjacent commands; the doctor is the source of truth for what's
|
|
||||||
missing.
|
|
||||||
|
|
||||||
## How an agent generates data on demand (without waiting for the timer)
|
|
||||||
|
|
||||||
```sh
|
|
||||||
# One labeled episode (90 s) with a chosen sample profile:
|
|
||||||
sudo -u cis490 /opt/cis490/.venv/bin/python \
|
|
||||||
/opt/cis490/tools/run_real_vm_demo.py \
|
|
||||||
--data-root /var/lib/cis490/data \
|
|
||||||
--sample mirai-class-bot
|
|
||||||
|
|
||||||
# Force the shipper to run one pass:
|
|
||||||
sudo systemctl start cis490-shipper.service # (if disabled)
|
|
||||||
# or:
|
|
||||||
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
|
|
||||||
--config /etc/cis490/lab-host.toml --once
|
|
||||||
|
|
||||||
# Confirm on the Pi:
|
|
||||||
ssh <pi> 'sudo tail -3 /var/lib/cis490/index.jsonl'
|
|
||||||
```
|
|
||||||
|
|
||||||
Any tool that wants to *count* what's been shipped:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
ssh <pi> 'sudo /opt/cis490/.venv/bin/python /opt/cis490/tools/index_reader.py \
|
|
||||||
--index /var/lib/cis490/index.jsonl --count-by host_id'
|
|
||||||
```
|
|
||||||
|
|
||||||
## Quick test for "is the pipe alive at all"
|
|
||||||
|
|
||||||
The `--ping` mode of the shipper does an authenticated POST to
|
|
||||||
`/v1/ping` and writes nothing to the database. Use it as the
|
|
||||||
smallest E2E test:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
|
|
||||||
--config /etc/cis490/lab-host.toml --ping
|
|
||||||
# {"ok": true, "status_code": 200, ...} on success.
|
|
||||||
# Any other status is the operator's signal to read the error string.
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## File an issue when you hit a problem
|
|
||||||
|
|
||||||
**When you run into an issue you cannot fully resolve in the current
|
|
||||||
turn, file it as a Forgejo issue on the relevant repo before moving
|
|
||||||
on.** Do not silently log a TODO comment, leave a partial workaround,
|
|
||||||
or assume someone else will remember. The issue tracker is the
|
|
||||||
durable record.
|
|
||||||
|
|
||||||
This applies to:
|
|
||||||
|
|
||||||
- a build / test / typecheck failure you can't fix in scope
|
|
||||||
- a bug you discover but aren't tasked with fixing
|
|
||||||
- a missing dep, missing config, or env-only failure that blocks E2E
|
|
||||||
- a design gap you've worked around but want a follow-up to fix
|
|
||||||
properly
|
|
||||||
- a scope-out you made (e.g. "deferred Tier 4 sample fetch") that
|
|
||||||
needs an owner so it doesn't get lost
|
|
||||||
|
|
||||||
Don't file an issue when:
|
|
||||||
|
|
||||||
- the user is in the conversation and you can just *tell* them
|
|
||||||
- it's already filed (search first: `GET /api/v1/repos/<owner>/<repo>/issues?state=open&q=<keyword>`)
|
|
||||||
- it's truly a non-issue (a one-line edit you're about to make this
|
|
||||||
same turn)
|
|
||||||
|
|
||||||
## How to file (Forgejo API)
|
|
||||||
|
|
||||||
The local Forgejo at `http://10.100.0.1:3000` accepts API calls with a
|
|
||||||
token-bearer header:
|
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
curl -s -X POST \
|
curl -s -X POST \
|
||||||
|
|
@ -552,19 +134,19 @@ curl -s -X POST \
|
||||||
The token comes from the user's session — never embed one in code or
|
The token comes from the user's session — never embed one in code or
|
||||||
commits.
|
commits.
|
||||||
|
|
||||||
### What a good issue body contains
|
### Good issue body
|
||||||
|
|
||||||
1. **Context** — one sentence on what was being attempted.
|
1. **Context** — one sentence on what was being attempted.
|
||||||
2. **What happened** — the actual error, log line, or unexpected
|
2. **What happened** — the actual error or unexpected behavior. Paste
|
||||||
behavior. Paste exact output.
|
exact output.
|
||||||
3. **What was tried** — every workaround you attempted and why it
|
3. **What was tried** — every workaround you attempted and why it
|
||||||
didn't stick.
|
didn't stick.
|
||||||
4. **Suggested next step** — the smallest change that would resolve
|
4. **Suggested next step** — the smallest change that would resolve
|
||||||
it, if you have a guess. "Unknown" is a fine answer.
|
it, if you have a guess. "Unknown" is fine.
|
||||||
5. **Related** — link the commit / PR / file:line where the issue
|
5. **Related** — link the commit / PR / file:line where the issue
|
||||||
surfaced.
|
surfaced.
|
||||||
|
|
||||||
### What a good title looks like
|
### Good titles
|
||||||
|
|
||||||
| Bad | Good |
|
| Bad | Good |
|
||||||
|---|---|
|
|---|---|
|
||||||
|
|
@ -572,25 +154,22 @@ commits.
|
||||||
| `caddy thing` | `Caddy: client_auth requires absolute path; relative trusted_ca_cert_file silently fails` |
|
| `caddy thing` | `Caddy: client_auth requires absolute path; relative trusted_ca_cert_file silently fails` |
|
||||||
| `fix later` | `shipper: 5xx backoff cap is 5min, doc says 1min — pick one` |
|
| `fix later` | `shipper: 5xx backoff cap is 5min, doc says 1min — pick one` |
|
||||||
|
|
||||||
## After filing
|
After filing, reference the issue in the next commit message:
|
||||||
|
`Refs spectral/<repo>#<n>` or `Closes spectral/<repo>#<n>`. Fully
|
||||||
- Reference the issue number in the next commit message:
|
qualify cross-repo: `spectral/wg-pki#3`.
|
||||||
`Refs spectral/<repo>#<n>` or `Closes spectral/<repo>#<n>` if your
|
|
||||||
current change actually fixes it.
|
|
||||||
- If the issue is on a different repo than the one you're committing
|
|
||||||
to, fully qualify: `spectral/wg-pki#3`.
|
|
||||||
|
|
||||||
## Other conventions
|
## Other conventions
|
||||||
|
|
||||||
- **Don't put off the hard parts.** Frame "deferred-with-reason" only
|
- **Don't put off the hard parts.** "Deferred-with-reason" is only
|
||||||
for genuine blockers (binary not present on this machine, external
|
for genuine blockers (binary not present on this machine, external
|
||||||
service unreachable). For anything you *could* do but find awkward
|
service unreachable). For anything you *could* do but find awkward
|
||||||
— bridge setup, cross-arch quirks, fleet concurrency — do it. The
|
— bridge setup, cross-arch quirks, fleet concurrency — do it.
|
||||||
user has flagged this twice when work was scoped down prematurely.
|
- **No architectural bandaids in the pipeline.** Compensating layers
|
||||||
When something genuinely is blocked by an operator artifact, file
|
(auto-update timers, fix-yourself decision trees, prescriptive
|
||||||
the Forgejo issue and *automate the bring-up* (e.g., installer
|
symptom→command tables, trainer-side prune scripts that paper over
|
||||||
script + sha256-verifying fetcher) so the moment the artifact lands
|
silent collectors) are not allowed in the data-collection /
|
||||||
it Just Works.
|
emulation / labelling path. Fix the producer instead. See
|
||||||
|
PIPELINE.md.
|
||||||
- **Naming:** never coin USB / device / service names on the user's
|
- **Naming:** never coin USB / device / service names on the user's
|
||||||
behalf. Ask first. Reusing an old name is especially bad.
|
behalf. Ask first. Reusing an old name is especially bad.
|
||||||
- **`/etc` configs:** `Read` first, copy second. Never overwrite a
|
- **`/etc` configs:** `Read` first, copy second. Never overwrite a
|
||||||
|
|
|
||||||
249
FIXYOURSELF.md
249
FIXYOURSELF.md
|
|
@ -1,249 +0,0 @@
|
||||||
# FIXYOURSELF.md — when the lab host can't unstick itself automatically
|
|
||||||
|
|
||||||
**You are reading this because you (an on-device AI agent) are running
|
|
||||||
on a CIS490 lab host and one of these is true:**
|
|
||||||
|
|
||||||
- `cis490-autoupdate.timer` keeps failing.
|
|
||||||
- The shipper logs show steady `412 commit-rejected` or `400 missing`.
|
|
||||||
- `cis490-doctor --role lab-host` flagged you red.
|
|
||||||
|
|
||||||
The normal "pull and reinstall" path (AGENTS.md row 1) sometimes can't
|
|
||||||
recover on its own — most often because someone made a local commit
|
|
||||||
on this host, or `/opt/cis490` got into a state `git pull --ff-only`
|
|
||||||
refuses to fix. This doc is the explicit decision tree for those
|
|
||||||
cases. Run the commands literally. Do not paraphrase.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Step 1 — Identify your state
|
|
||||||
|
|
||||||
Run all of these. Save the output. Match the result against the
|
|
||||||
table.
|
|
||||||
|
|
||||||
```sh
|
|
||||||
cd /opt/cis490
|
|
||||||
sudo -u cis490 git rev-parse HEAD # → LOCAL
|
|
||||||
sudo -u cis490 git rev-parse origin/main 2>&1 # → REMOTE (or error)
|
|
||||||
sudo -u cis490 git fetch origin main 2>&1 # → fresh fetch result
|
|
||||||
sudo -u cis490 git log -1 --format='%H %ci %s' # → what HEAD is
|
|
||||||
sudo -u cis490 git status --porcelain # → uncommitted changes?
|
|
||||||
```
|
|
||||||
|
|
||||||
Then:
|
|
||||||
|
|
||||||
| If you see… | Your state | Go to |
|
|
||||||
|---|---|---|
|
|
||||||
| LOCAL == REMOTE and `git status` empty | Not stuck — run §Z to be safe | §Z |
|
|
||||||
| LOCAL != REMOTE and `git merge-base --is-ancestor HEAD origin/main` returns 0 | Behind main, no local commits | §A |
|
|
||||||
| LOCAL != REMOTE and the merge-base check returns 1 (NON-zero) | **You have a local commit not on origin/main** | §B |
|
|
||||||
| `git fetch` prints a network error | Connectivity broken | §C |
|
|
||||||
| `/opt/cis490/.git` is missing | No git checkout — populated via `cp -aT` originally | §D |
|
|
||||||
| `git status` shows tracked files modified | Uncommitted edits on this host | §E |
|
|
||||||
|
|
||||||
If multiple match: §C blocks everything else (fix network first), then
|
|
||||||
§D, then §E, then §B, then §A.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## §A — Behind main, clean tree
|
|
||||||
|
|
||||||
```sh
|
|
||||||
cd /opt/cis490
|
|
||||||
sudo -u cis490 git pull --ff-only origin main
|
|
||||||
sudo /opt/cis490/scripts/install-lab-host.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
`install-lab-host.sh` re-stamps VERSION, drains the pre-stamp queue,
|
|
||||||
and restarts the daemons. Verify with §Z.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## §B — You have a local commit not on origin/main
|
|
||||||
|
|
||||||
This is the elliott-thinkpad case (2026-05-01..02). You committed
|
|
||||||
something locally, the maintainer's `origin/main` doesn't have it,
|
|
||||||
and the receiver's allow-list rejects every episode you ship. Pick
|
|
||||||
ONE of B.1, B.2, B.3 — read all three first.
|
|
||||||
|
|
||||||
### B.1 — Push your commit to origin/main (RECOMMENDED if your change is real)
|
|
||||||
|
|
||||||
```sh
|
|
||||||
cd /opt/cis490
|
|
||||||
sudo -u cis490 git log -3 --stat HEAD # what you're about to push — read it
|
|
||||||
sudo -u cis490 git push origin HEAD:main
|
|
||||||
```
|
|
||||||
|
|
||||||
If `git push` succeeds: the receiver allow-list picks it up within 5
|
|
||||||
seconds, new episodes start landing immediately. Verify with §Z.
|
|
||||||
|
|
||||||
If `git push` fails with "permission denied" or "auth required": you
|
|
||||||
don't have push credentials. Skip to B.2 or B.3.
|
|
||||||
|
|
||||||
If `git push` fails with "non-fast-forward": your local commit
|
|
||||||
diverged AROUND origin/main (you committed against an older base).
|
|
||||||
Try:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
sudo -u cis490 git pull --rebase origin main
|
|
||||||
# If the rebase has conflicts you can't resolve in 60 seconds, abort:
|
|
||||||
# sudo -u cis490 git rebase --abort
|
|
||||||
# Then go to B.2 or B.3.
|
|
||||||
sudo -u cis490 git push origin HEAD:main
|
|
||||||
```
|
|
||||||
|
|
||||||
### B.2 — Discard your local commit, reset to origin
|
|
||||||
|
|
||||||
⚠ **This deletes your local commit.** Only do this if you know it
|
|
||||||
wasn't a real fix or the maintainer told you to discard it.
|
|
||||||
|
|
||||||
```sh
|
|
||||||
cd /opt/cis490
|
|
||||||
sudo -u cis490 git fetch origin main
|
|
||||||
sudo -u cis490 git log --oneline HEAD ^origin/main # what you're about to lose
|
|
||||||
sudo -u cis490 git reset --hard origin/main
|
|
||||||
sudo /opt/cis490/scripts/install-lab-host.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
### B.3 — Stop, file an issue, wait
|
|
||||||
|
|
||||||
If you can't decide between B.1 and B.2 (e.g. you don't know if your
|
|
||||||
commit is a real fix), do this:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
cd /opt/cis490
|
|
||||||
HOST_ID=$(grep '^host_id' /etc/cis490/lab-host.toml | cut -d'"' -f2)
|
|
||||||
LOCAL_SHA=$(sudo -u cis490 git rev-parse HEAD)
|
|
||||||
DETAIL=$(sudo -u cis490 git log -3 --stat HEAD | head -100)
|
|
||||||
|
|
||||||
# File the issue (replace <TOKEN> with the operator's Forgejo token —
|
|
||||||
# do NOT embed yours in commits)
|
|
||||||
curl -sS -X POST \
|
|
||||||
-H "Authorization: token <TOKEN>" \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
http://10.100.0.1:3000/api/v1/repos/spectral/CIS490/issues \
|
|
||||||
-d "$(python3 -c "import json,os; print(json.dumps({
|
|
||||||
'title': f\"$HOST_ID: stuck on local commit ${LOCAL_SHA:0:12}\",
|
|
||||||
'body': f\"### What's at HEAD\n\n\`\`\`\n${DETAIL}\n\`\`\`\n\nNeed maintainer to choose: push HEAD to main, or reset --hard origin/main here?\"
|
|
||||||
}))")"
|
|
||||||
```
|
|
||||||
|
|
||||||
Then leave the daemons running. The shipper will keep auto-quarantining
|
|
||||||
the 412s — backlog grows but doesn't crash anything. Wait for a
|
|
||||||
maintainer comment.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## §C — Network broken
|
|
||||||
|
|
||||||
```sh
|
|
||||||
ping -c 1 10.100.0.1 # the Pi
|
|
||||||
sudo wg show # is wg0 up?
|
|
||||||
sudo systemctl restart wg-quick@wg0 # bring it back up
|
|
||||||
sudo systemctl restart cis490-shipper cis490-orchestrator
|
|
||||||
```
|
|
||||||
|
|
||||||
If `ping 10.100.0.1` still fails after a `wg-quick` restart, this is
|
|
||||||
a WireGuard / wg-enroll / iptmonads problem outside this repo. File
|
|
||||||
an issue at `spectral/wg-enroll` or `spectral/iptmonads` and stop.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## §D — `/opt/cis490/.git` missing
|
|
||||||
|
|
||||||
The host was originally set up with `cp -aT` (no `.git/`). That makes
|
|
||||||
auto-update impossible. Re-clone:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
# Stop services so we don't race with the orchestrator mid-episode
|
|
||||||
sudo systemctl stop cis490-shipper cis490-orchestrator
|
|
||||||
|
|
||||||
# Preserve config/data — only /opt/cis490 (the code) gets replaced.
|
|
||||||
# /etc/cis490/ and /var/lib/cis490/ are NOT touched.
|
|
||||||
sudo mv /opt/cis490 /opt/cis490.pre-fix
|
|
||||||
sudo git clone http://maxgit.wg:3000/spectral/CIS490.git /opt/cis490
|
|
||||||
sudo chown -R cis490:cis490 /opt/cis490
|
|
||||||
|
|
||||||
sudo /opt/cis490/scripts/install-lab-host.sh
|
|
||||||
# Once verified, you can drop the backup:
|
|
||||||
# sudo rm -rf /opt/cis490.pre-fix
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## §E — Uncommitted edits on tracked files
|
|
||||||
|
|
||||||
```sh
|
|
||||||
cd /opt/cis490
|
|
||||||
sudo -u cis490 git status --short # see what's modified
|
|
||||||
sudo -u cis490 git diff # see exactly what changed
|
|
||||||
```
|
|
||||||
|
|
||||||
If the changes are intentional (e.g. you fixed a bug), commit them
|
|
||||||
first and then go to §B:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
sudo -u cis490 git add <files>
|
|
||||||
sudo -u cis490 git commit -m "<short description>"
|
|
||||||
# Now go to §B.
|
|
||||||
```
|
|
||||||
|
|
||||||
If the changes are accidental / left over from debugging, discard
|
|
||||||
them:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
sudo -u cis490 git checkout -- .
|
|
||||||
# Now go to §A.
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## §Z — Verify you're unstuck
|
|
||||||
|
|
||||||
```sh
|
|
||||||
# 1. Daemons up?
|
|
||||||
systemctl is-active cis490-shipper cis490-orchestrator
|
|
||||||
# Both should say "active".
|
|
||||||
|
|
||||||
# 2. VERSION present and matches HEAD?
|
|
||||||
cat /opt/cis490/VERSION
|
|
||||||
sudo -u cis490 git -C /opt/cis490 rev-parse HEAD
|
|
||||||
# The "commit" in VERSION should equal the rev-parse output.
|
|
||||||
|
|
||||||
# 3. Recent ship results — should NOT be 412/400-fatal-only:
|
|
||||||
sudo journalctl -u cis490-shipper --since '2 minutes ago' | tail -20
|
|
||||||
|
|
||||||
# 4. Maintainer-side confirmation (if you can reach the Pi):
|
|
||||||
# sudo tail -5 /var/lib/cis490/index.jsonl
|
|
||||||
# Should show recent received_at_wall timestamps with your host_id.
|
|
||||||
|
|
||||||
# 5. Run the doctor — every red row prints its own fix:
|
|
||||||
/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py --role lab-host
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Hard rules (don't break these)
|
|
||||||
|
|
||||||
These are repeated from AGENTS.md because the situations covered in
|
|
||||||
this doc are exactly when an agent is most tempted to break them.
|
|
||||||
|
|
||||||
- **Do NOT** edit `meta.json` files to add a fake `code_version`. The
|
|
||||||
gate exists to keep buggy data out of the index.
|
|
||||||
- **Do NOT** disable / mask / `systemctl stop` `cis490-shipper` to
|
|
||||||
silence log noise. Fix the cause.
|
|
||||||
- **Do NOT** set `verify_tls = false` in `lab-host.toml`. That's a
|
|
||||||
bypass, not a fix.
|
|
||||||
- **Do NOT** `git commit` on `/opt/cis490` to "make the gate
|
|
||||||
happy" — committing without pushing is exactly how the
|
|
||||||
elliott-thinkpad incident started. If you commit, push (§B.1) or
|
|
||||||
don't commit at all.
|
|
||||||
- **Do NOT** delete `/var/lib/cis490/data/quarantine/`. Those are
|
|
||||||
forensic records of what the gate rejected; they're capped at 30
|
|
||||||
days by the cleanup pass.
|
|
||||||
- **Do NOT** clobber `/etc/cis490/certs/`. Cert delivery is
|
|
||||||
automated; rerun `install-lab-host.sh` if certs are missing.
|
|
||||||
|
|
||||||
If you find yourself wanting to do any of the above, stop and file
|
|
||||||
an issue (§B.3 has the curl command). The maintainer would much
|
|
||||||
rather resolve a stale lab host by reading an issue than by
|
|
||||||
reverse-engineering what an agent did to escape a stuck state.
|
|
||||||
900
PIPELINE.md
Normal file
900
PIPELINE.md
Normal file
|
|
@ -0,0 +1,900 @@
|
||||||
|
# PIPELINE.md — the CIS490 generative pipeline honesty plan
|
||||||
|
|
||||||
|
**This document is canonical.** It supersedes any guidance in
|
||||||
|
`AGENTS.md`, `FIXYOURSELF.md`, `README.md`, or other repo docs that
|
||||||
|
contradicts it. If another doc says something different, this doc wins
|
||||||
|
and the other doc is wrong (file an issue or fix it).
|
||||||
|
|
||||||
|
This is not an architecture overview. This is a fix list. Read it,
|
||||||
|
implement it, do not split it into phases.
|
||||||
|
|
||||||
|
**Before proposing any change to the pipeline, re-read §1, §7, and §8
|
||||||
|
and run your proposal against §8's checklist.** Then proceed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Principle
|
||||||
|
|
||||||
|
Every episode that reaches the dataset must be ground-truth. Every
|
||||||
|
host runs the same experiment with the same configured catalog. Every
|
||||||
|
exploit module and every collector in the catalog has been proven to
|
||||||
|
work end-to-end before it is eligible to run. There are no
|
||||||
|
compensating layers — no auto-update timers that drag stale peers
|
||||||
|
forward, no "fix-yourself" decision trees, no per-host divergence
|
||||||
|
absorbed by trainer-side filters, no labels written by clock when the
|
||||||
|
event they describe didn't happen.
|
||||||
|
|
||||||
|
If a host can't meet the bar, it produces zero episodes and says so
|
||||||
|
loudly. A small honest dataset beats a large dishonest one.
|
||||||
|
|
||||||
|
**Default to removal, not addition.** If a problem can be fixed by
|
||||||
|
deleting code or removing a layer, prefer that. Adding a layer is
|
||||||
|
the suspect default and should be justified against §7 and §8 before
|
||||||
|
proceeding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. What the experiments are for
|
||||||
|
|
||||||
|
CIS490 trains a behavioral malware-detection model. The dataset is
|
||||||
|
the ground-truth labelled record of what the host looked like during
|
||||||
|
known-clean, known-armed, known-infecting, and known-infected phases
|
||||||
|
of a real exploit chain against a real target service. The model
|
||||||
|
learns to distinguish those phases from in-deployment
|
||||||
|
behavior. **Every dishonest label is a poisoned training example.**
|
||||||
|
|
||||||
|
This is why the producer's job is not "ship lots of episodes." It is
|
||||||
|
"ship episodes whose labels are true."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. What is currently broken (evidence)
|
||||||
|
|
||||||
|
Numbers from the 200-episode quality probe on 2026-05-03:
|
||||||
|
|
||||||
|
1. **Labels lie.** 0 of 67 Tier-3 exploit fires resulted in a
|
||||||
|
`session_open` event. All 67 logged `session_open_timeout`. Yet
|
||||||
|
every one of those 67 episodes is labelled
|
||||||
|
`phase=infected_running` because the schedule-driven labeller
|
||||||
|
transitions on a clock, not on observed events. The
|
||||||
|
`infected_running` label in the dataset means "the schedule said
|
||||||
|
so," not "an attacker session was actually open on this host."
|
||||||
|
2. **Collectors are silent.**
|
||||||
|
- `perf` produces 0 rows on 100% of episodes on both hosts.
|
||||||
|
- `guest-agent` produces 0 rows on 100% of episodes on both hosts.
|
||||||
|
- `qmp`, `netflow`, and `pcap` produce 0 rows on 100% of
|
||||||
|
k-gamingcom episodes (different config from elliott).
|
||||||
|
- The host `tcpdump` is missing on k-gamingcom; `pcap_unavailable`
|
||||||
|
is logged then ignored.
|
||||||
|
3. **The catalog is unverified.** Modules are added to the rotation
|
||||||
|
without a per-module verification that the module actually lands a
|
||||||
|
session against its declared target. `samba_usermap_script` has a
|
||||||
|
100% failure rate against the configured Metasploitable2 target
|
||||||
|
and was still in the rotation.
|
||||||
|
4. **Hosts run divergent experiments.** elliott and k-gamingcom have
|
||||||
|
different per-host manifests, different collector coverage,
|
||||||
|
different qemu invocations. The dataset is a union of two
|
||||||
|
different experiments, not 200 samples from one.
|
||||||
|
5. **Working trees are dirty.** 200/200 episodes report `dirty=true`,
|
||||||
|
so `code_version.commit` is unverifiable provenance.
|
||||||
|
|
||||||
|
Each of these is a failure of the producer. Receiver-side filtering
|
||||||
|
and trainer-side prune scripts are bandaids that hide them.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. The fix — line items
|
||||||
|
|
||||||
|
Every item below must land. They are not phases. They are parts of
|
||||||
|
one cohesive correctness story; any of them missing leaves the
|
||||||
|
pipeline half-honest. Each item names its acceptance test.
|
||||||
|
|
||||||
|
### 4.1 Canonical manifest
|
||||||
|
|
||||||
|
There is exactly one manifest, version-pinned in the repo at
|
||||||
|
`manifest.toml`. Every lab host loads the same manifest. There is no
|
||||||
|
per-host manifest override, no per-host collector enable/disable
|
||||||
|
flag, no per-host qemu argument list. Hosts that cannot run the
|
||||||
|
canonical manifest exit 78 at orchestrator startup.
|
||||||
|
|
||||||
|
**Acceptance:** `find . -name manifest.toml -not -path './.git/*'`
|
||||||
|
returns exactly one path. There is no `--manifest` CLI flag on the
|
||||||
|
orchestrator that takes a different path; the path is hard-coded.
|
||||||
|
Removing this line item would re-create the host divergence we just
|
||||||
|
exited.
|
||||||
|
|
||||||
|
### 4.2 Target VMs we build, not VMs we fetch
|
||||||
|
|
||||||
|
Every target VM image is built from a declarative spec checked into
|
||||||
|
the repo (Packer, mkosi, debootstrap, whatever — declarative). The
|
||||||
|
image build produces a sha256-pinned artifact. The build script
|
||||||
|
verifies, before producing the artifact, that:
|
||||||
|
|
||||||
|
- The vulnerable service is up after first boot.
|
||||||
|
- The service is on the port the module catalog declares.
|
||||||
|
- The service version matches the version the module catalog
|
||||||
|
declares.
|
||||||
|
|
||||||
|
`Metasploitable2` from a SourceForge mirror is removed. We don't
|
||||||
|
ship episodes targeting black-box images.
|
||||||
|
|
||||||
|
**Acceptance:** `scripts/build-target-<name>.sh` exists for every
|
||||||
|
target referenced by an exploit module. Running it produces an image
|
||||||
|
whose post-boot state passes the spec's verification step. The
|
||||||
|
verification step's exit code gates the build's exit code.
|
||||||
|
|
||||||
|
### 4.3 Module catalog admission criteria
|
||||||
|
|
||||||
|
A module is in the catalog *only if* it passes a recorded end-to-end
|
||||||
|
verification run against its declared target. The verification is:
|
||||||
|
|
||||||
|
1. Boot the target snapshot.
|
||||||
|
2. Fire the module via msfrpcd.
|
||||||
|
3. Observe a `session_open` event (not `session_open_timeout`).
|
||||||
|
4. Observe at least one shell command round-trip on the session.
|
||||||
|
5. Confirm guest-side artifact (file written, process spawned —
|
||||||
|
per-module).
|
||||||
|
|
||||||
|
If any step fails, the module does not enter the catalog. There is
|
||||||
|
no "tentatively included" tier. Modules already in the catalog are
|
||||||
|
re-verified by `scripts/verify-catalog.sh` (new) on every release;
|
||||||
|
failures remove the module from the catalog.
|
||||||
|
|
||||||
|
**Acceptance:** every entry in `exploits/modules/*.toml` has a
|
||||||
|
companion `verified_against = "<target_name>"` and
|
||||||
|
`last_verified = "<commit_sha>"` field. `scripts/verify-catalog.sh`
|
||||||
|
re-runs every entry and exits 0 only if every one passes.
|
||||||
|
|
||||||
|
### 4.4 Collector admission criteria
|
||||||
|
|
||||||
|
A collector is in the active set *only if* it passes a recorded
|
||||||
|
end-to-end verification run that confirms it emits non-zero rows
|
||||||
|
against a known-busy probe workload.
|
||||||
|
|
||||||
|
For each of the six collectors (`proc`, `qmp`, `netflow`, `perf`,
|
||||||
|
`guest`, `pcap`):
|
||||||
|
|
||||||
|
1. Diagnose the current zero-row failure (read the code, run
|
||||||
|
standalone, find the actual cause). Fix the cause.
|
||||||
|
2. Add a unit-or-integration test that runs the collector for N
|
||||||
|
seconds against a synthesized workload (a busy-loop process for
|
||||||
|
`proc`/`perf`, a packet generator for `netflow`/`pcap`, a QMP
|
||||||
|
blockstats query for `qmp`, a guest heartbeat for `guest`) and
|
||||||
|
asserts ≥1 row.
|
||||||
|
3. The test must run in CI and on every install via the install
|
||||||
|
script.
|
||||||
|
|
||||||
|
A collector that cannot pass admission is removed from the active
|
||||||
|
set with a recorded reason — not silently included with zero rows.
|
||||||
|
|
||||||
|
**Acceptance:** `pytest tests/test_collectors_emit.py -k <name>`
|
||||||
|
passes for each name. The CI run gates merges.
|
||||||
|
|
||||||
|
### 4.5 Event-driven labelling
|
||||||
|
|
||||||
|
Phase labels are written from observed events, never from the
|
||||||
|
schedule clock. The schedule becomes a *time budget* — maximum time
|
||||||
|
the orchestrator will wait in each phase — not a label source.
|
||||||
|
|
||||||
|
Specifically:
|
||||||
|
|
||||||
|
- `clean` is written at episode start.
|
||||||
|
- `armed` is written when the orchestrator instructs the driver to
|
||||||
|
fire (this is observable in code).
|
||||||
|
- `infecting` is written when the `exploit_fire` event is observed.
|
||||||
|
- `infected_running` is written **only** when the `session_open`
|
||||||
|
event is observed.
|
||||||
|
- If `session_open_timeout` is observed instead, the episode
|
||||||
|
terminates with a `failed` label and is rejected (see §4.6).
|
||||||
|
- `dormant` and subsequent `infected_running` transitions are
|
||||||
|
written from observed in-session idle / activity, not from clock.
|
||||||
|
|
||||||
|
Per-module timeouts replace the global 30s timeout. Default 120s,
|
||||||
|
configurable per module in `exploits/modules/*.toml`.
|
||||||
|
|
||||||
|
**Acceptance:** for every shipped episode, every entry in
|
||||||
|
`labels.jsonl` has a corresponding event in `events.jsonl` with a
|
||||||
|
matching `t_mono_ns` within ±100ms. An invariant test asserts this.
|
||||||
|
|
||||||
|
### 4.6 Episode acceptance gate at finalization
|
||||||
|
|
||||||
|
Before sealing meta and writing `done.marker`, the orchestrator
|
||||||
|
verifies:
|
||||||
|
|
||||||
|
- Every collector in the active set produced ≥1 row.
|
||||||
|
- Every label has a matching event (§4.5 invariant).
|
||||||
|
- For Tier-3 episodes: a `session_open` event exists.
|
||||||
|
- `dirty=true` is absent OR `dirty_override=true` is present (see
|
||||||
|
§4.9).
|
||||||
|
|
||||||
|
If any check fails, the episode goes to `data/rejected/<id>/` with a
|
||||||
|
`rejected_reason.json` describing which check failed. `done.marker`
|
||||||
|
is not written. The shipper never sees it.
|
||||||
|
|
||||||
|
**Acceptance:** `tests/test_acceptance_gate.py` covers each rejection
|
||||||
|
condition. A passing test asserts a clean episode is accepted; for
|
||||||
|
each failure mode, the test asserts the episode is moved to
|
||||||
|
`rejected/` with the expected reason.
|
||||||
|
|
||||||
|
### 4.7 Producer preflight
|
||||||
|
|
||||||
|
`orchestrator/preflight.py` runs at orchestrator startup. One bar
|
||||||
|
(no light/deep split). Checks:
|
||||||
|
|
||||||
|
- Every binary required by the active collector set + active module
|
||||||
|
catalog is on `PATH`.
|
||||||
|
- `/dev/kvm` accessible by the service user.
|
||||||
|
- `kernel.perf_event_paranoid <= 2`.
|
||||||
|
- `cfg.bridge_iface` exists; `tcpdump` can capture on it.
|
||||||
|
- `msfrpcd` reachable; `auth.login` returns a token.
|
||||||
|
- For every module in catalog: `module.info` is fetchable.
|
||||||
|
- For every sample in catalog: file present on disk; sha256 matches.
|
||||||
|
- Probe-boot baseline-v1 snapshot; observe guest-agent heartbeat
|
||||||
|
within N seconds.
|
||||||
|
- `git status --porcelain` empty (or `CIS490_ALLOW_DIRTY=1`).
|
||||||
|
- HEAD is on a commit currently in `origin/main`.
|
||||||
|
|
||||||
|
Failures are collected (every failed check logged with diagnosis +
|
||||||
|
remediation), then `sys.exit(78)`.
|
||||||
|
|
||||||
|
**Acceptance:** `tests/test_preflight.py` covers each check
|
||||||
|
individually with mocked subprocess/filesystem. `python -m
|
||||||
|
orchestrator.preflight` runs the checks and prints a structured
|
||||||
|
report. Exit codes: 0 ok, 78 sysadmin error.
|
||||||
|
|
||||||
|
### 4.8 Receiver-side rejection (defense in depth)
|
||||||
|
|
||||||
|
**The receiver is defense-in-depth, NOT the primary correctness
|
||||||
|
mechanism.** The producer is. Receiver rejection exists to catch
|
||||||
|
peers running stale or broken code; it is never a substitute for
|
||||||
|
fixing the producer. A change that strengthens receiver rejection
|
||||||
|
without strengthening the producer is the defensive-instead-of-
|
||||||
|
corrective pattern (§7.9).
|
||||||
|
|
||||||
|
The receiver enforces the same correctness invariants the
|
||||||
|
orchestrator does. A peer running stale code that produces dishonest
|
||||||
|
episodes still gets rejected at ingest:
|
||||||
|
|
||||||
|
- Reject any meta with `dirty=true` and no `dirty_override=true`.
|
||||||
|
- Reject any meta where `phases_observed` contains `infected_running`
|
||||||
|
but `events.jsonl` (extracted from the tarball) lacks
|
||||||
|
`session_open`.
|
||||||
|
- Reject any meta where any configured-collector row count is zero.
|
||||||
|
- Existing commit-allow-list gate continues.
|
||||||
|
|
||||||
|
Rejections return 422 with a JSON body naming the failed check.
|
||||||
|
Rejected tarballs are not written to the index.
|
||||||
|
|
||||||
|
**Acceptance:** `tests/test_receiver_rejects.py` covers each new
|
||||||
|
rejection condition.
|
||||||
|
|
||||||
|
### 4.9 Override discipline
|
||||||
|
|
||||||
|
The only escape hatch from the dirty-tree gate is the
|
||||||
|
`CIS490_ALLOW_DIRTY=1` environment variable. When set:
|
||||||
|
|
||||||
|
- Orchestrator logs `WARN: dirty tree override active`.
|
||||||
|
- meta.json gains `dirty_override: true`.
|
||||||
|
- Receiver accepts the episode only if `dirty_override` is also
|
||||||
|
`true`.
|
||||||
|
- Every override use is auditable from the dataset.
|
||||||
|
|
||||||
|
There are no other override knobs. No `verify_tls=false`, no "skip
|
||||||
|
preflight," no "include this collector even if it emits zero rows."
|
||||||
|
|
||||||
|
### 4.10 Regression-test discipline
|
||||||
|
|
||||||
|
Every fix in this plan lands with a test that would have caught the
|
||||||
|
regression at PR time. Tests are not a follow-up. A PR that fixes
|
||||||
|
the perf collector without a perf-emit test is incomplete and gets
|
||||||
|
sent back.
|
||||||
|
|
||||||
|
CI runs:
|
||||||
|
- All unit tests.
|
||||||
|
- `scripts/verify-catalog.sh` against a smoke target subset (catalog
|
||||||
|
verification full run is gated to release commits — too expensive
|
||||||
|
for every PR).
|
||||||
|
- The collector-emit integration tests (§4.4) on real binaries.
|
||||||
|
|
||||||
|
### 4.11 systemd integration
|
||||||
|
|
||||||
|
- `cis490-orchestrator.service` adds
|
||||||
|
`RestartPreventExitStatus=78`. A preflight failure stays loud and
|
||||||
|
stuck instead of cycling restarts.
|
||||||
|
- On preflight failure, orchestrator writes
|
||||||
|
`/var/lib/cis490/preflight.failed.json` with the failed checks +
|
||||||
|
timestamps. Doctor surfaces this in its next report. The
|
||||||
|
fleet-health alert distinguishes "preflight failed" from "host
|
||||||
|
silent."
|
||||||
|
|
||||||
|
### 4.12 Cleanup of compensating layers
|
||||||
|
|
||||||
|
The following are deleted as part of this change. Their existence
|
||||||
|
was load-bearing for the dishonest pipeline; the honest one doesn't
|
||||||
|
need them.
|
||||||
|
|
||||||
|
- `FIXYOURSELF.md` — entire file deleted. Stuck states no longer
|
||||||
|
exist as a class because the gates make them impossible.
|
||||||
|
- `cis490-autoupdate.timer` + `scripts/auto-update.sh` — deleted.
|
||||||
|
Hosts run pinned commits. New code is rolled out by the operator,
|
||||||
|
not auto-pulled.
|
||||||
|
- `cis490-cert-fetch.timer` — replaced by a one-shot first-boot
|
||||||
|
fetch in `install-lab-host.sh`. No periodic re-fetch.
|
||||||
|
- `tools/quarantine_unstamped.py` — deleted. Pre-stamp episodes
|
||||||
|
cannot exist because no episode is written without a valid stamp.
|
||||||
|
- `tools/check_fleet_health.py` — keep, but delete the "fatal-only"
|
||||||
|
alert branch (that branch existed because we were shipping fatals;
|
||||||
|
with the gate, we don't).
|
||||||
|
- `tools/prune_episodes.py`'s "kept episode despite flat /proc
|
||||||
|
because qmp showed write" cross-check logic — deleted. Episodes
|
||||||
|
that don't pass the producer-side gate don't reach the trainer.
|
||||||
|
- AGENTS.md "symptom→fix table" — deleted (the
|
||||||
|
symptoms it covers are now impossible).
|
||||||
|
- AGENTS.md "Hosts self-update" section — deleted.
|
||||||
|
|
||||||
|
### 4.13 Containment bar
|
||||||
|
|
||||||
|
Real malware execution requires explicit containment. Target VMs
|
||||||
|
exist in an isolation context that is part of the canonical
|
||||||
|
experiment, not a deployment detail. A future change that weakens
|
||||||
|
any of the items below is a containment regression and is rejected
|
||||||
|
regardless of what experimental realism it claims to add.
|
||||||
|
|
||||||
|
For every target VM in the catalog (§4.2):
|
||||||
|
|
||||||
|
- **Network:** target attaches to a bridge with NO upstream egress.
|
||||||
|
No NAT to the host network, no internet route, no DNS resolution
|
||||||
|
beyond what the experiment provides. Outbound C2 callbacks
|
||||||
|
resolve to a sinkhole inside the experiment, never to the
|
||||||
|
internet.
|
||||||
|
- **Filesystem:** no shared mount with the host. No 9p, no
|
||||||
|
virtio-fs with host paths. The target's disk is the snapshot it
|
||||||
|
was booted from, period.
|
||||||
|
- **Privilege:** QEMU runs as the unprivileged service user. KVM
|
||||||
|
access is via group membership only; no setuid wrappers, no
|
||||||
|
privileged TUN ownership transfer, no passthrough of host
|
||||||
|
devices not explicitly required by the catalog.
|
||||||
|
- **Lifetime:** every target boots from a fresh snapshot. State
|
||||||
|
from one episode never crosses into the next. The snapshot is
|
||||||
|
reverted at episode end, not "cleaned."
|
||||||
|
- **Escape monitoring:** any QEMU exit that is not a clean shutdown
|
||||||
|
is logged with full QMP state and the episode is marked `failed`.
|
||||||
|
Two unclean exits on the same target image within a release
|
||||||
|
window trigger admission-criteria re-verification (§4.3) for
|
||||||
|
every module targeting that image.
|
||||||
|
|
||||||
|
**Acceptance:** `tests/test_containment.py` asserts each target
|
||||||
|
build (a) has no upstream egress route from inside the guest,
|
||||||
|
(b) has no host-shared filesystem mount, (c) runs QEMU as the
|
||||||
|
unprivileged service user, (d) reverts to snapshot at episode end.
|
||||||
|
The test runs in CI and on every install.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Build order
|
||||||
|
|
||||||
|
There is no half-honest intermediate state. The order below
|
||||||
|
sequences the work; it does not phase the deployment. Everything
|
||||||
|
lands to `main` in one merge.
|
||||||
|
|
||||||
|
1. Fix the four root-cause defects:
|
||||||
|
- Diagnose + fix the perf collector (read code, run standalone,
|
||||||
|
find why it's silent, fix).
|
||||||
|
- Diagnose + fix the guest-agent collector (mount baseline image,
|
||||||
|
verify agent installed, fix build).
|
||||||
|
- Diagnose + fix k-gamingcom's missing qmp/netflow/pcap (compare
|
||||||
|
configs, eliminate divergence — §4.1).
|
||||||
|
- Diagnose + fix `samba_usermap_script` against its target
|
||||||
|
(manual msfconsole drive, find why the bind shell never
|
||||||
|
connects, fix or remove from catalog — §4.3).
|
||||||
|
2. Land the canonical manifest (§4.1).
|
||||||
|
3. Land the target-VM build pipeline (§4.2) and containment
|
||||||
|
tests (§4.13) together — target VMs are not in the catalog
|
||||||
|
without containment.
|
||||||
|
4. Land the catalog admission criteria + verifier (§4.3).
|
||||||
|
5. Land the collector admission criteria + tests (§4.4).
|
||||||
|
6. Land the event-driven labeller (§4.5).
|
||||||
|
7. Land the acceptance gate (§4.6).
|
||||||
|
8. Land the preflight (§4.7).
|
||||||
|
9. Land the receiver-side rejection (§4.8).
|
||||||
|
10. Land the override discipline + cleanup (§4.9, §4.12).
|
||||||
|
11. Land systemd integration + alert distinguishing (§4.11).
|
||||||
|
|
||||||
|
After merge: lab hosts pull the new manifest, run preflight, fail
|
||||||
|
loudly if they don't meet the bar, produce zero episodes until they
|
||||||
|
do. The operator brings each host to bar — fixing one root cause at
|
||||||
|
a time, loudly. The dataset goes quiet, then comes back honest.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Out of scope (and why)
|
||||||
|
|
||||||
|
- **Schedule jitter for label-leakage resistance.** Real concern,
|
||||||
|
but doesn't affect honesty — only generalization. Address after
|
||||||
|
honest data is flowing.
|
||||||
|
- **New collectors (audit, ftrace, etc.).** Adding collectors before
|
||||||
|
the existing six are honest is putting more weight on a broken
|
||||||
|
floor.
|
||||||
|
- **Trainer changes.** This plan stops at the dataset boundary. The
|
||||||
|
trainer no longer needs to filter dishonest episodes because they
|
||||||
|
don't exist.
|
||||||
|
- **Multi-architecture targets.** All target VMs are x86_64 for now.
|
||||||
|
|
||||||
|
Each of these is fine to defer because they don't paper over a
|
||||||
|
correctness defect. They add value on top of an honest pipeline; the
|
||||||
|
pipeline isn't honest yet.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Anti-patterns (named — match every proposal against this list)
|
||||||
|
|
||||||
|
Each pattern below is a shape a proposal can take that has been
|
||||||
|
rejected as architectural sleight-of-hand. **Match every proposal
|
||||||
|
against this list before submitting it.** A proposal that matches
|
||||||
|
a named pattern is rejected; abandon it and propose a corrective
|
||||||
|
fix instead.
|
||||||
|
|
||||||
|
The patterns are named so future sessions can recognize them in
|
||||||
|
their own work. A bandaid with a nice name (preflight, acceptance
|
||||||
|
gate, retry layer, fleet-health) is still a bandaid.
|
||||||
|
|
||||||
|
**§7 is non-exhaustive.** New sleight-of-hand patterns will exist
|
||||||
|
that aren't named here. The §8 decision tests are the actual
|
||||||
|
filter; a proposal that fails §8 is rejected even if it matches
|
||||||
|
no named pattern. Do not read §7 as a closed taxonomy and conclude
|
||||||
|
"my proposal isn't on the list, so it's fine." If §8 says no, the
|
||||||
|
answer is no, regardless of whether a named match exists.
|
||||||
|
|
||||||
|
### 7.1 Compensating-layer pattern
|
||||||
|
|
||||||
|
**Definition.** Adding a layer (timer, watcher, retry, alert,
|
||||||
|
recovery doc) that absorbs a failure mode upstream of itself
|
||||||
|
instead of fixing the upstream cause.
|
||||||
|
|
||||||
|
**Example from session 2026-05-02..03.** `cis490-autoupdate.timer`
|
||||||
|
to drag stale peers forward. The actual fix was the operator's
|
||||||
|
deploy process; the timer existed because deployment was unreliable
|
||||||
|
and we patched around the unreliability instead of fixing it.
|
||||||
|
|
||||||
|
**Test.** If I removed this layer right now, would the original
|
||||||
|
problem reappear immediately? If yes, the layer is a compensating
|
||||||
|
bandaid for an unfixed root cause.
|
||||||
|
|
||||||
|
**What to do instead.** Fix the upstream cause. If you cannot in
|
||||||
|
this change, fail loudly (§9) and stop.
|
||||||
|
|
||||||
|
### 7.2 Phasing-as-deferral pattern
|
||||||
|
|
||||||
|
**Definition.** Splitting a correctness fix into "phase 1, phase 2,"
|
||||||
|
"light vs deep," or "land this now, the harder part later." Any
|
||||||
|
sequencing that ships a half-honest intermediate state.
|
||||||
|
|
||||||
|
**Example from session 2026-05-02..03.** "Land preflight first,
|
||||||
|
labeller refactor later." The intermediate state ships dishonest
|
||||||
|
data because the labeller is still clock-driven.
|
||||||
|
|
||||||
|
**Test.** Does each intermediate merge ship dishonest data, or
|
||||||
|
rely on a layer that won't exist yet? If yes, no phasing.
|
||||||
|
|
||||||
|
**What to do instead.** Reduce scope (drop a feature, narrow the
|
||||||
|
active set) until the change is small enough to land in one merge.
|
||||||
|
Do not defer the hard part.
|
||||||
|
|
||||||
|
### 7.3 Single-instance-fix pattern
|
||||||
|
|
||||||
|
**Definition.** Fixing one item from a class while leaving the
|
||||||
|
other items as future work.
|
||||||
|
|
||||||
|
**Example from session 2026-05-02..03.** "I'll diagnose perf and
|
||||||
|
samba in parallel" while guest-agent, qmp, netflow, and the rest
|
||||||
|
of the module catalog stay broken.
|
||||||
|
|
||||||
|
**Test.** Is this a class of N items, of which I'm fixing < N? If
|
||||||
|
yes, fix all or remove the unfixed from the active set.
|
||||||
|
|
||||||
|
**What to do instead.** Either fix every member of the class, or
|
||||||
|
shrink the active catalog to just the verified members. Unverified
|
||||||
|
members do not ship.
|
||||||
|
|
||||||
|
### 7.4 Per-host-divergence pattern
|
||||||
|
|
||||||
|
**Definition.** Accepting that two hosts behave differently as a
|
||||||
|
working assumption.
|
||||||
|
|
||||||
|
**Example from session 2026-05-02..03.** "Which host should I
|
||||||
|
investigate samba on, elliott or k-gamingcom?" — implying the
|
||||||
|
answer matters because hosts are different.
|
||||||
|
|
||||||
|
**Test.** Given identical workloads on identical canonical-manifest
|
||||||
|
hosts, would the produced episodes be identical? If no, the
|
||||||
|
divergence is the bug.
|
||||||
|
|
||||||
|
**What to do instead.** Eliminate the divergence (one canonical
|
||||||
|
manifest, one canonical target VM build, one canonical collector
|
||||||
|
set — §4.1). If a host can't run the canonical experiment, it
|
||||||
|
produces zero episodes.
|
||||||
|
|
||||||
|
### 7.5 Black-box-trust pattern
|
||||||
|
|
||||||
|
**Definition.** Treating an externally-built artifact as if it
|
||||||
|
behaves correctly under our experiments without a verifiable spec
|
||||||
|
for what it should do.
|
||||||
|
|
||||||
|
**Example from session 2026-05-02..03.** Metasploitable2 from a
|
||||||
|
SourceForge mirror — we don't know what version of Samba is
|
||||||
|
running, whether the service is up, or whether the image has been
|
||||||
|
altered. We were shipping modules targeting it anyway.
|
||||||
|
|
||||||
|
**Test.** Do we have a verifiable spec for this artifact's
|
||||||
|
behavior? If no, we don't trust it.
|
||||||
|
|
||||||
|
**What to do instead.** Build the artifact from a declarative spec
|
||||||
|
we control (§4.2). If we can't, remove modules targeting it from
|
||||||
|
the catalog.
|
||||||
|
|
||||||
|
### 7.6 Investigation-as-deferral pattern
|
||||||
|
|
||||||
|
**Definition.** Proposing investigation when a verifiable gate
|
||||||
|
would suffice. The investigation itself becomes the deferred work.
|
||||||
|
|
||||||
|
**Example from session 2026-05-02..03.** "I need to diagnose why
|
||||||
|
perf is silent before I can write the gate." A gate of the form
|
||||||
|
"perf must produce ≥1 row" works without knowing the cause; it
|
||||||
|
forces the diagnosis to happen as part of the fix.
|
||||||
|
|
||||||
|
**Test.** Can the gate be expressed as an assertion ("X must
|
||||||
|
produce > 0 rows" / "X must observe Y event") without knowing the
|
||||||
|
root cause? If yes, write the gate first.
|
||||||
|
|
||||||
|
**What to do instead.** Write the strictest possible gate first.
|
||||||
|
The investigation is the work of making the gate pass.
|
||||||
|
|
||||||
|
### 7.7 Speculation-as-evidence pattern
|
||||||
|
|
||||||
|
**Definition.** Asserting a claim as fact without measurement.
|
||||||
|
|
||||||
|
**Example from session 2026-05-02..03.** "30s vs 120s won't change
|
||||||
|
this — if the exploit were almost working, we'd see occasional
|
||||||
|
opens." No data was gathered; the claim was projected.
|
||||||
|
|
||||||
|
**Test.** Do I have a measurement that supports this claim? If no,
|
||||||
|
I am speculating.
|
||||||
|
|
||||||
|
**What to do instead.** Say "I don't know yet." Either gather data
|
||||||
|
or design the fix to be correct under both possibilities.
|
||||||
|
|
||||||
|
### 7.8 Out-of-scope-for-correctness pattern
|
||||||
|
|
||||||
|
**Definition.** Naming a correctness-affecting item as "out of
|
||||||
|
scope" to avoid the harder problem.
|
||||||
|
|
||||||
|
**Example from session 2026-05-02..03.** "Manifest canonicalization
|
||||||
|
is out of scope, flagged as known issue." Per-host config divergence
|
||||||
|
is the source of half the data quality problems; excluding it from
|
||||||
|
scope was a deferral.
|
||||||
|
|
||||||
|
**Test.** Does excluding this item leave the system half-honest?
|
||||||
|
If yes, it is in scope.
|
||||||
|
|
||||||
|
**What to do instead.** Reduce other scope (drop a feature, narrow
|
||||||
|
the active set) to fit. Correctness items cannot be deferred.
|
||||||
|
|
||||||
|
### 7.9 Defensive-instead-of-corrective pattern
|
||||||
|
|
||||||
|
**Definition.** Building rejection logic at the consumer instead of
|
||||||
|
fixing the producer that produces the rejected output.
|
||||||
|
|
||||||
|
**Example from session 2026-05-02..03.** Receiver-side rejection of
|
||||||
|
dishonest episodes without fixing why the producer produces them.
|
||||||
|
Defense-in-depth (both ends gated) is good; defense-without-
|
||||||
|
corrective (only consumer gated) is a bandaid.
|
||||||
|
|
||||||
|
**Test.** Does this fix make the dishonest behavior IMPOSSIBLE
|
||||||
|
upstream, or only unobservable downstream? If only unobservable,
|
||||||
|
the producer is still broken.
|
||||||
|
|
||||||
|
**What to do instead.** Fix the producer first. The consumer-side
|
||||||
|
gate is defense-in-depth on top of a corrected producer, never a
|
||||||
|
substitute.
|
||||||
|
|
||||||
|
### 7.10 Recovery-layer pattern
|
||||||
|
|
||||||
|
**Definition.** Building documentation, scripts, timers, or
|
||||||
|
runbooks for "what to do when X is stuck." Applies anywhere in
|
||||||
|
the pipeline — producer, receiver, trainer, dashboard, install
|
||||||
|
scripts, on-device agents, anywhere a "recovery from a state
|
||||||
|
that shouldn't exist" layer is contemplated. Producer-side is
|
||||||
|
just the most common location.
|
||||||
|
|
||||||
|
**Example from session 2026-05-02..03.** `FIXYOURSELF.md` — a
|
||||||
|
250-line decision tree for recovering hosts whose auto-update
|
||||||
|
timer couldn't fix them. The states it covered shouldn't have been
|
||||||
|
possible if the producer were correct.
|
||||||
|
|
||||||
|
**Test.** Can the stuck state happen at all if the relevant
|
||||||
|
component is correct? If no, delete the recovery layer and fix
|
||||||
|
the component.
|
||||||
|
|
||||||
|
**What to do instead.** Make the stuck state impossible. If you
|
||||||
|
can't, fail loudly (§9) and stop.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Decision tests before proposing a change
|
||||||
|
|
||||||
|
Before adding any code, doc, layer, or feature, answer all of the
|
||||||
|
following. **Any uncomfortable answer means stop and re-evaluate.**
|
||||||
|
|
||||||
|
1. Does this change make the dishonest behavior IMPOSSIBLE, or
|
||||||
|
only less likely / less observable?
|
||||||
|
2. Does this change scale to every instance of the problem class,
|
||||||
|
or only one?
|
||||||
|
3. If I removed this change, would the underlying problem return
|
||||||
|
immediately?
|
||||||
|
4. Am I adding a layer? If yes, can I instead remove the layer
|
||||||
|
that allowed the failure?
|
||||||
|
5. Does this proposal match any pattern in §7? If yes, abandon it
|
||||||
|
and propose a corrective fix.
|
||||||
|
6. Is the change complete in one merge? If not, why is the
|
||||||
|
intermediate state honest?
|
||||||
|
7. Am I doing this because it's correct, or because it's the
|
||||||
|
easiest thing that looks like progress?
|
||||||
|
|
||||||
|
If you cannot answer all seven cleanly, stop. Ask the operator.
|
||||||
|
Do not proceed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. What to do when blocked
|
||||||
|
|
||||||
|
When you cannot fix something cleanly in scope:
|
||||||
|
|
||||||
|
- **Fail loudly.** Exit with a distinguishable code (e.g., 78).
|
||||||
|
Write a structured failure record. Do not retry silently.
|
||||||
|
- **Stop.** Do not continue producing output as if the failure
|
||||||
|
didn't happen.
|
||||||
|
- **Ask the operator.** Tell the user what's blocked, what you
|
||||||
|
tried, and what you need to proceed.
|
||||||
|
- **Do not build a recovery layer.** That is the recovery-layer
|
||||||
|
pattern (§7.10).
|
||||||
|
- **Do not propose phased fixes.** That is the phasing-as-deferral
|
||||||
|
pattern (§7.2).
|
||||||
|
- **Do not narrow scope silently.** If the active set must shrink
|
||||||
|
to make the change tractable, name it explicitly and get sign-off.
|
||||||
|
|
||||||
|
The operator prefers a small honest system that fails loudly over a
|
||||||
|
large half-broken one that limps. A loud failure is more useful
|
||||||
|
than a silent bandaid.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Definitions of ground truth
|
||||||
|
|
||||||
|
For each collector, "real row" means the row was actually emitted
|
||||||
|
by the underlying mechanism for *this episode*, not synthesized,
|
||||||
|
defaulted, or carried over from a previous run.
|
||||||
|
|
||||||
|
| Collector | Ground truth means |
|
||||||
|
|---|---|
|
||||||
|
| `proc` | Row read from `/proc/<qemu_pid>/{stat,io,status}` for the live qemu PID of this episode's target VM, while that PID is alive. |
|
||||||
|
| `qmp` | Row obtained from a successful QMP `query-status` / `query-blockstats` round-trip on `cfg.qmp_socket` for this episode's qemu PID. |
|
||||||
|
| `netflow` | Row computed from packet capture on `cfg.bridge_iface` for traffic involving this episode's target VM during the episode wall-clock window. |
|
||||||
|
| `perf` | Row produced by `perf` (or equivalent) sampling this episode's qemu PID. Not from a previous run, not from a different PID. |
|
||||||
|
| `guest` | Row received from the in-guest agent over the virtio-serial channel during the episode wall-clock window. The agent must be running in *this episode's* guest, not a stale one. |
|
||||||
|
| `pcap` | Bytes captured from `cfg.bridge_iface` during the episode wall-clock window, written to `network.pcap`. |
|
||||||
|
|
||||||
|
For each phase, "label justified" means the corresponding event was
|
||||||
|
observed:
|
||||||
|
|
||||||
|
| Phase | Justified by |
|
||||||
|
|---|---|
|
||||||
|
| `clean` | Episode start (orchestrator-emitted). |
|
||||||
|
| `armed` | Orchestrator instructs the driver to fire (orchestrator-emitted). |
|
||||||
|
| `infecting` | `exploit_fire` event observed in `events.jsonl`. |
|
||||||
|
| `infected_running` | `session_open` event observed in `events.jsonl`. **Not** `session_open_timeout`, **not** schedule-clock. |
|
||||||
|
| `dormant` | Observed in-session idle (no traffic / no command activity for N seconds). |
|
||||||
|
| `failed` | `session_open_timeout` or other terminal driver failure. Episode is rejected (§4.6). |
|
||||||
|
|
||||||
|
A row that doesn't meet the ground-truth bar is not a row. A label
|
||||||
|
that isn't justified is not a label. The acceptance gate (§4.6)
|
||||||
|
enforces both.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Honest reporting
|
||||||
|
|
||||||
|
When you (a future session) report status to the operator:
|
||||||
|
|
||||||
|
- **Distinguish merged from verified.** "Code merged" is not
|
||||||
|
"behavior verified in production." A passing test on a CI host
|
||||||
|
is not the same as a working system on a lab host.
|
||||||
|
- **Distinguish proposed from implemented.** "I proposed X" is not
|
||||||
|
"X is in the repo."
|
||||||
|
- **Audit your cumulative pattern.** At the end of a session,
|
||||||
|
re-read your own changes against §7. It is possible to add three
|
||||||
|
reasonable-looking layers in sequence that cumulatively form a
|
||||||
|
compensating-layer pattern, even if no individual one looks like
|
||||||
|
a bandaid.
|
||||||
|
- **Name compensating layers you've built.** If §7 audit finds
|
||||||
|
matches, name them and propose their removal.
|
||||||
|
- **Don't summarize cumulative changes as "fixes" without
|
||||||
|
auditing.** "I shipped 12 commits this session" is not the same
|
||||||
|
as "the pipeline is honest now."
|
||||||
|
- **Verify before agreeing or refuting.** When the operator says
|
||||||
|
something is done that you can verify, verify it before agreeing.
|
||||||
|
When they say something is broken that you can verify, verify it
|
||||||
|
before refuting.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Glossary
|
||||||
|
|
||||||
|
Terms used throughout this document, pinned to one definition.
|
||||||
|
|
||||||
|
| Term | Definition |
|
||||||
|
|---|---|
|
||||||
|
| **Canonical manifest** | The single, version-pinned `manifest.toml` at the repo root. Every host loads this exact file. There is no per-host override (§4.1). |
|
||||||
|
| **Active set** | The collectors enabled in the canonical manifest for a given run. A collector is in the active set only if it has passed admission criteria (§4.4). |
|
||||||
|
| **Catalog** | The set of exploit modules in `exploits/modules/*.toml` that have passed admission (§4.3). Modules not in the catalog do not run. |
|
||||||
|
| **Ground truth** | A row or label is ground truth when it was emitted by the underlying mechanism for *this* episode, with the justifying event observed. See §10. |
|
||||||
|
| **Episode boundary** | An episode begins when the orchestrator emits the first `clean` label and ends when `done.marker` is written or the episode is moved to `rejected/`. All collector rows must fall inside this wall-clock window. |
|
||||||
|
| **Configured collector** | A collector listed as enabled in the canonical manifest. Distinct from "running collector" (the process actually started) and "active set" (the manifest-listed plus admission-passing intersection). For acceptance purposes, only the configured set matters. |
|
||||||
|
| **Admission criteria** | The bar a module / collector / target / override knob must pass to be in the active pipeline. See §4.3, §4.4, §13. |
|
||||||
|
| **Honest** | Of an episode: every label justified by an observed event, every configured collector emitted ≥1 ground-truth row, working tree was clean (or override-stamped), HEAD on `origin/main`. Of the pipeline: every accepted episode is honest. |
|
||||||
|
| **Bandaid / compensating layer** | A layer that absorbs a failure mode upstream of itself instead of fixing the upstream cause. See §7.1. |
|
||||||
|
| **Override** | A knob that loosens an admission criterion or gate. There is exactly one — `CIS490_ALLOW_DIRTY` (§14). |
|
||||||
|
| **Operator** | The human maintainer with sign-off authority. Distinct from agents that propose changes. See §15. |
|
||||||
|
| **Containment regression** | A change that weakens any of the §4.13 isolation requirements. Rejected regardless of claimed experimental value. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. Admission scope (what triggers the bar)
|
||||||
|
|
||||||
|
Any change to the following is in admission scope and must pass §4
|
||||||
|
admission criteria + §15 operator sign-off:
|
||||||
|
|
||||||
|
- Any module in `exploits/modules/*.toml`.
|
||||||
|
- Any collector in the active set.
|
||||||
|
- Any field of `manifest.toml`.
|
||||||
|
- Any phase rule or label-emission code in the labeller.
|
||||||
|
- Any gate in the producer or receiver.
|
||||||
|
- Any schedule entry (phase budget, per-module timeout).
|
||||||
|
- Any target VM build spec or its containment posture (§4.13).
|
||||||
|
- Any override knob (the closed list in §14).
|
||||||
|
|
||||||
|
The following are NOT admission scope and can be changed without
|
||||||
|
admission ceremony, but must still pass §8 decision tests:
|
||||||
|
|
||||||
|
- Internal refactors that do not change observable behavior of
|
||||||
|
any of the above.
|
||||||
|
- Test code, fixtures, CI configuration.
|
||||||
|
- Documentation that does not contradict §1.
|
||||||
|
- Build/install scripts, insofar as they don't change what gets
|
||||||
|
shipped or how it's labelled.
|
||||||
|
|
||||||
|
A future session that argues "this is just infrastructure" or
|
||||||
|
"this is just tooling" to dodge admission scope: re-read this
|
||||||
|
section. Anything that touches what gets shipped, how it's
|
||||||
|
labelled, what runs on the host, the containment posture, or
|
||||||
|
how the gate decides — is in scope. The "infrastructure /
|
||||||
|
tooling" framing is a recurring sleight-of-hand vector and
|
||||||
|
triggers automatic rejection.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 14. Override knobs (closed list)
|
||||||
|
|
||||||
|
The complete list of override knobs in CIS490, version-pinned to
|
||||||
|
this document:
|
||||||
|
|
||||||
|
| Knob | Effect | Where audited |
|
||||||
|
|---|---|---|
|
||||||
|
| `CIS490_ALLOW_DIRTY=1` (env var, orchestrator) | Allows the orchestrator to start with a dirty git tree. Stamps `dirty_override: true` in every `meta.json` produced. Receiver accepts only with matching stamp. | per-episode in `meta.json` |
|
||||||
|
|
||||||
|
That is the entire list. Adding a knob to this list is itself an
|
||||||
|
admission event (§13) requiring operator sign-off (§15) and an §8
|
||||||
|
review.
|
||||||
|
|
||||||
|
**Knobs that have been considered and rejected** (do not propose
|
||||||
|
again without re-reading the rationale):
|
||||||
|
|
||||||
|
- `verify_tls=false` — TLS verification is a correctness boundary;
|
||||||
|
bypassing it is the defensive-instead-of-corrective pattern
|
||||||
|
(§7.9).
|
||||||
|
- `skip_preflight=1` — preflight is the gate; bypassing it makes
|
||||||
|
the gate non-functional.
|
||||||
|
- `experimental_collector=true` — bypassing collector admission
|
||||||
|
is the single-instance-fix pattern (§7.3) wearing a flag.
|
||||||
|
- `diagnostic_mode=true` — generic bypass; in practice would be
|
||||||
|
applied to hide failures, not investigate them.
|
||||||
|
- `dry_run` for the producer — episodes that aren't shipped go to
|
||||||
|
`rejected/`; no dry-run flag needed.
|
||||||
|
|
||||||
|
If a future session proposes a new override knob, the burden is on
|
||||||
|
the proposal: pass §8, get operator sign-off, amend §14 in the
|
||||||
|
same merge. "Add the knob now and amend §14 later" is the
|
||||||
|
phasing-as-deferral pattern (§7.2) applied to documentation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 15. Sign-off discipline
|
||||||
|
|
||||||
|
Admission decisions are made by the operator, not by agents acting
|
||||||
|
alone. Specifically:
|
||||||
|
|
||||||
|
- **Adding a module to the catalog** requires operator sign-off.
|
||||||
|
An agent runs `scripts/verify-catalog.sh`, presents the
|
||||||
|
verification result, and the operator decides whether the module
|
||||||
|
enters the catalog.
|
||||||
|
- **Adding a collector to the active set** requires operator
|
||||||
|
sign-off. Agent runs the emit-test, operator decides.
|
||||||
|
- **Promoting a target VM build** requires operator sign-off after
|
||||||
|
§4.2 verification and §4.13 containment tests pass.
|
||||||
|
- **Adding an override knob** (§14) requires operator sign-off.
|
||||||
|
- **Amending PIPELINE.md** requires operator sign-off (§16).
|
||||||
|
|
||||||
|
**Removing** anything from the catalog or active set does NOT
|
||||||
|
require operator sign-off — the bar is asymmetric. Tightening
|
||||||
|
is always permitted; loosening requires sign-off.
|
||||||
|
|
||||||
|
The operator is the human with maintainer credentials on the
|
||||||
|
repository. Agents propose, run verification, and present results;
|
||||||
|
the operator decides admission.
|
||||||
|
|
||||||
|
If an agent is acting in a non-interactive context (CI run,
|
||||||
|
scheduled job) where no operator is available to sign off, the
|
||||||
|
agent does not admit anything. It produces verification output
|
||||||
|
and stops.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 16. Amending PIPELINE.md
|
||||||
|
|
||||||
|
This document is not immutable, but it is the canonical statement
|
||||||
|
of the bar. Amendments are governed by the same discipline as
|
||||||
|
admission decisions:
|
||||||
|
|
||||||
|
1. Any change to §1 (principle), §4 (fix items), §7 (anti-patterns),
|
||||||
|
§8 (decision tests), §10 (ground truth), §13 (admission scope),
|
||||||
|
§14 (override list), or §15 (sign-off) is a substantive
|
||||||
|
amendment.
|
||||||
|
2. Substantive amendments require operator sign-off (§15) and must
|
||||||
|
pass §8 decision tests applied to the amendment itself.
|
||||||
|
3. The amendment lands in the same merge as the code change it
|
||||||
|
justifies. "Amend the doc later" is the phasing pattern (§7.2).
|
||||||
|
4. Editorial changes (typos, formatting, link fixes, glossary
|
||||||
|
wording) do not require sign-off but should be flagged in the
|
||||||
|
commit message.
|
||||||
|
|
||||||
|
A future session that wants to add a feature or layer the document
|
||||||
|
forbids: the path is to amend the document, not to work around it.
|
||||||
|
"This isn't covered by PIPELINE.md, so I'll just do it" is the
|
||||||
|
out-of-scope-for-correctness pattern (§7.8) applied to the
|
||||||
|
meta-document. Anything that touches admission scope (§13) is
|
||||||
|
covered even if not named explicitly.
|
||||||
|
|
||||||
|
If you find the document is wrong — internally inconsistent,
|
||||||
|
contradicts observed reality, prescribes something impossible —
|
||||||
|
file a Forgejo issue against the repo with the contradiction
|
||||||
|
documented. Do not silently work around the doc.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 17. What this plan supersedes
|
||||||
|
|
||||||
|
The following docs are deleted or rewritten as part of landing this
|
||||||
|
plan:
|
||||||
|
|
||||||
|
| Doc | Action |
|
||||||
|
|---|---|
|
||||||
|
| `FIXYOURSELF.md` | Deleted. Compensating-layer doc; the states it covers don't exist after §4.6. |
|
||||||
|
| `AGENTS.md` "symptom→fix table" | Deleted. Bandaid-driven. |
|
||||||
|
| `AGENTS.md` "Hosts self-update" section | Deleted. Hosts run pinned commits. |
|
||||||
|
| `AGENTS.md` "Tier 3+4 deploy zero-touch" claim | Rewritten. Targets are built locally now, not auto-fetched. |
|
||||||
|
| `AGENTS.md` "trust the in-guest probe alone, cross-check host CPU" | Deleted. The producer-side gate makes this fictional cross-check unnecessary. |
|
||||||
|
| `TIER3-BRINGUP.md` | Kept as historical record — labelled bug report, not current guidance. |
|
||||||
|
| `README.md` Tier-3+4 narrative | Reviewed and aligned. |
|
||||||
|
|
||||||
|
If you are a future session reading this and find another doc that
|
||||||
|
contradicts §1–§6 of this file: this file is right and the other
|
||||||
|
doc is wrong. Fix the other doc.
|
||||||
Loading…
Add table
Reference in a new issue