Bug 14 (vm/launch_target.sh): Metasploitable2 requires -machine pc
(i440fx), -cpu kvm32, -drive if=ide, and -device e1000. The previous
config (-machine q35, -cpu host, -drive if=virtio, virtio-net-pci)
caused a kernel panic at boot because /dev/vda != the grub root=/dev/sda1.
Services never started; the b'' probe fix (Bug 10) then correctly waited
out the full timeout with no result.
Bug 15 (scripts/install-tier-3-4.sh): verify step used vsftpd_234_backdoor
which is requires_bridge=true and has a hardcoded port-6200 backdoor.
Changed to distccd_command_exec with TARGET_PORTS="5632:3632,4444:4444".
manifest.toml: admit distccd_command_exec and unreal_ircd_3281_backdoor
to the module catalog. Both use cmd/unix/bind_perl (bind shell, no guest
egress, SLIRP-safe). distccd returns a valid protocol response so MSF's
handler runs and session_open fires. Verified against Metasploitable2
sourceforge image sha256 a8c019c3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug 10: _wait_for_tcp returned on recv()→b'' (connection closed by peer),
falsely signalling service-ready. Only socket.timeout or non-empty data
are genuine ready signals; b'' now retries.
Bug 11: distccd_command_exec and unreal_ircd_3281_backdoor incorrectly
had requires_bridge=true. bind_perl payloads connect inward (host→guest
via hostfwd), not outward — no bridge egress needed. Both modules now
run on SLIRP-only fleet slots.
Bug 12: msgpack.unpackb crashed on integer session IDs from msfrpcd 6.x
(strict_map_key=True default). Added strict_map_key=False.
Bug 13 (documented): samba_usermap_script removed from catalog (NoReply
on every fire — already handled in dca6144 on origin/main).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two correctness fixes that the §4.5 event-driven labeller surfaced:
1. tools/run_real_vm_demo.py was hardcoding a Tier-3-shaped schedule
(clean → armed → infecting → infected_running → ...) for episodes
with no exploit firing. Pre-§4.5 those episodes wrote dishonest
`infected_running` labels from the schedule clock — exactly the §3
evidence pattern. Post-§4.5 they write `failed` at the infecting
transition (the justifying exploit_fire never arrives), which is
honest about what happened but useless for training.
The honest fix: Tier-2 episodes have a clean-only schedule. All
telemetry tagged `clean` because nothing infected anything. The
total duration matches the canonical Tier-3 schedule so episode
lengths are comparable across tiers — no length-bias in the
dataset (§10).
Helper `tier2_schedule_from(schedule)` in orchestrator/manifest.py
derives `[("clean", total_seconds)]` from the canonical schedule.
`tier3_schedule_from(schedule)` renders the legacy
`[(name, seconds)]` shape EpisodeConfig still expects.
Tier-2 demo (run_real_vm_demo.py) now calls tier2_schedule_from.
Tier-3 demo (run_tier3_demo.py) now calls tier3_schedule_from.
Drops the hardcoded DEFAULT_SCHEDULE constants from both — the
canonical manifest is the single source of truth (§4.1).
2. .gitignore now excludes /VERSION. The install-lab-host.sh stamp
writes /opt/cis490/VERSION so episodes can record code provenance
without /opt/cis490 carrying a .git directory. But /opt/cis490 IS
typically a git checkout on lab hosts (auto-update.sh pulls into
it), so writing VERSION leaves the working tree dirty. Every
episode's meta.code_version.dirty=true. PIPELINE.md §4.6 acceptance
gate's rule 4 would then reject every episode without
CIS490_ALLOW_DIRTY=1 set — which would break the data flow.
Now VERSION is .gitignored: install-lab-host.sh stamps it, git
status doesn't see it, dirty=false, gate rule 4 passes naturally.
These two changes together keep the data flowing AND honest. Tier-2
episodes pass with `phases=[clean]` + every collector emitting real
rows. Tier-3 episodes (none today, empty catalog) walk the full
event-driven schedule when a verified module gets re-admitted.
286 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase labels are written ONLY when justifying events arrive. The
schedule clock is now a budget — an upper bound — never a label
source. This is the core honesty fix the §3 evidence demanded:
Before: every Tier-3 episode wrote `infected_running` from the
schedule clock regardless of whether session_open ever
fired. Per §10 every dishonest label is a poisoned
training example. 67/67 of the §3 probe episodes were
poisoned this way.
After: `infecting` writes ONLY when exploit_fire is observed in
events.jsonl. `infected_running` writes ONLY when
session_open is observed. Either timing out or seeing
session_open_timeout terminates the walker with a `failed`
label that the §4.6 acceptance gate will reject.
PHASE_JUSTIFYING_EVENTS in orchestrator/episode.py declares which
events justify which phases:
"clean": None # orchestrator-emitted
"armed": None # orchestrator-emitted
"infecting": ("exploit_fire",)
"infected_running": ("session_open",)
TERMINAL_FAILURE_EVENTS = {"session_open_timeout"} short-circuit any
event-driven wait into a `failed` label.
`dormant` is intentionally OFF the canonical schedule. §4.5 calls
for dormant to be event-driven (session_idle / session_active) too,
but the driver doesn't emit those yet. Per §1 default-to-removal we
ship without dormant rather than label it from the clock; when the
driver gains those emits, dormant re-enters the schedule with
proper justification.
EpisodeRunner now owns:
* `_event_log` — every emit_event appends here
* `_event_cv` — condition variable for waiters
* `_wait_for_event(names, since_t_mono_ns, timeout_s)` — returns
the first matching event in the log
with t_mono >= threshold; threshold
catches events that fired during
the previous on_phase callback.
When an event-driven phase's justifier already arrived (e.g.
exploit_fire emitted by driver._fire() inside on_phase("armed")),
the walker uses the EVENT's t_mono on the label — not the time the
walker noticed. The label means "this is when this thing actually
happened."
manifest.toml: dropped the dormant cycle from the canonical schedule.
Episode is shorter (~30s) but every label is event-justified.
14 new tests in tests/test_event_driven_labeller.py covering: justifier
mapping invariants, _wait_for_event semantics (already-arrived,
future, timeout, since-threshold, first-of-multiple-names), walker
behavior (orchestrator-emitted phases, event-driven phases, missing
event → failed, terminal-failure-event short-circuit, stop event,
event-t_mono on label, phase_transition events with justified_by).
286 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the missing emit-tests so every collector in KNOWN_COLLECTORS
has end-to-end coverage:
* test_proc_emits_rows_against_self_pid
Samples /proc/<own pid> for ~0.6s. Asserts ≥3 rows + populated
core fields (cpu_user_jiffies, rss_bytes, vsize_bytes). Works
anywhere with /proc.
* test_pcap_bucketize_emits_rows_from_synthetic_capture
Builds a 2-packet Ethernet+IPv4+TCP pcap in-memory, feeds it
to pcap.bucketize, asserts ≥1 row written + total packet count
across buckets matches input. Covers BOTH the pcap and netflow
collectors (netflow IS the bucketized pcap output).
* test_every_known_collector_has_emit_coverage
Cross-cutting tripwire: for every name in KNOWN_COLLECTORS,
either there's a test_collectors_emit.py test or there's an
explicit COLLECTOR_TEST_CARVE_OUTS entry. Adding a collector
to KNOWN_COLLECTORS without an emit test fails this. Carve-outs
today: qmp (covered by tests/test_qmp.py — needs running QEMU
for real-binary emit) and guest_agent (covered by
tests/test_guest_agent.py — needs a real VM with the agent
baked in).
The carve-outs are explicit, not implicit. A drift where someone
adds a new collector without a real-binary emit test fails CI before
the manifest can include it.
272 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tools/verify_catalog.py runs the §4.3 end-to-end verification flow
against every entry in manifest.toml's [catalog].modules (or a single
named module). The flow follows §4.3 exactly:
1. Load the module config + the verified-against target spec.
2. Resolve the published image path; fail loudly if absent.
3. Boot the target VM under §4.13 containment (restrict=on, snapshot=on,
no shared FS, unprivileged QEMU — same posture as verify.sh).
4. Wait for the service on the spec'd port.
5. Login to msfrpcd, snapshot the existing session set, fire the
module against `127.0.0.1:<host_port>` (the SLIRP hostfwd to the
guest's promised service port).
6. Wait for `session_open` — NOT session_open_timeout, which is the
§4.5 failed-label outcome.
7. Round-trip a shell command (`id`); confirm uid= shape.
8. Confirm a guest-side artifact (touch marker; ls + echo VERIFY_OK).
Per-module exit code is 0 only when EVERY step passes. CLI exit is 0
only when EVERY requested module passes — partial credit isn't an
option (§1 default-to-removal: a module that can't pass shouldn't be
in the catalog).
Structured JSON output with per-step timings + detail strings, written
to stdout or --out <path>. Operator pulls this into a successful CI
run + signs off on the manifest.toml [[catalog.modules]] amendment
with a fresh `last_verified = <commit_sha>` per §15.
Tests (tests/test_verify_catalog.py, 8 cases): exercise the flow with
a mocked MSFRpcClient + mocked qemu boot. Cover happy path, every
short-circuit failure mode (image missing, service never up, session
timeout, shell round-trip wrong, guest artifact missing), and
spec-load errors. Real verification needs lab hardware; the mocked
flow proves the orchestration contract.
269 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§4.2 calls for target VMs we BUILD, not VMs we fetch. §4.13 demands
every target ship the same isolation posture (no upstream egress, no
host-shared FS, unprivileged QEMU, fresh snapshot per episode). This
commit lands the infrastructure for both.
New surface:
* orchestrator/target_spec.py
Loads + validates `vm/targets/<name>/spec.toml`. Containment
fields are not knobs — each has exactly ONE safe value, and a
spec asserting the unsafe value is rejected at load time. There's
no `--containment-override`; weakening §4.13 requires amending
PIPELINE.md and operator sign-off.
* tools/build_target.py
Orchestrates build → verify → publish for a single target. Spec
invalid → exit 78 (sysadmin error). build.sh failure → image not
published. verify.sh failure → image discarded; that's the §4.2
acceptance gate. Publishes sha256 + the manifest.toml stanza the
operator copies in to admit the image (§16 substantive amendment
with sign-off per §15).
* vm/targets/<name>/{spec.toml,build.sh,verify.sh}
Template structure. spec.toml is the contract; build.sh produces
$OUT_PATH; verify.sh boots the produced image under the §4.13
containment posture and asserts every promise.
* vm/targets/shellshock/
First real working target. CVE-2014-6271 (Apache mod_cgi + bash
4.2 mis-parsing function-export environment values). Replaces
the SourceForge Metasploitable2 path that §3 evidence proved
unverifiable. Bash 4.2 is built from sha256-pinned GNU source
inside an Alpine 3.21 cloudinit guest; the build script asserts
the produced bash actually triggers shellshock; the verifier
re-asserts it under restrict=on with a real CVE-2014-6271 probe.
* vm/targets/README.md
How operators add a target. Walks the spec → build → verify →
manifest amendment loop.
Containment regression tests (tests/test_containment.py) — 20 new
assertions, parameterized over every target with a build/verify trio:
* verify.sh MUST contain `restrict=on` on its netdev (§4.13)
* verify.sh MUST contain `snapshot=on` on the boot drive (§4.13)
* verify.sh + build.sh MUST NOT contain -virtfs / -fsdev / 9pfs
* verify.sh + build.sh MUST NOT wrap qemu-system in `sudo`
* Every target must ship the complete spec.toml + build.sh + verify.sh
trio — no half-built targets (§1 default-to-removal)
Spec validation tests (tests/test_target_spec.py): 13 new tests over
spec parse, name/dir mismatch, missing fields, out-of-range port, and
the §4.13 containment field validators (each unsafe value rejected
with a clear error).
The shellshock target's image is NOT yet published to manifest.toml's
[[targets.images]] — that's the §15 sign-off amendment that lands
after a successful operator-driven build_target.py run on a lab host
with KVM. Building takes ~10 min on x86_64; cannot run on the Pi
under TCG. Operator drives the first build, verifies the sha256, then
amends manifest.toml in a follow-up commit.
261 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The experiment is now defined by a single version-pinned file —
manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every
lab host loads THIS exact file; per-host overrides of experiment
shape are forbidden.
Drops the following per-host CLI overrides that previously violated
the canonical-manifest principle:
* --manifest, --modules-dir (paths now derived)
* --ram-per-vm-mib (in manifest.experiment)
* --max-concurrent (manifest.experiment.fleet.max_concurrent_ceiling)
* --max-tier3-slots (manifest.experiment.fleet.max_tier3_slots)
* --force-tier2 (not a §14 sanctioned override knob —
ship empty catalog to disable Tier-3)
* --require-real-samples (sample-side concern; out of fleet scope)
* tools/run_*_demo.py --manifest (samples path now from canonical)
New surface:
* manifest.toml — the single source of truth
* orchestrator/manifest.py — load_canonical() + Manifest dataclass
with strict validation, raises
ManifestError on any failure
* EpisodeConfig.experiment_meta — populated by run_*_demo.py from
the canonical manifest; stamped
into every episode's meta.json
under "experiment" key for
provenance
* cis490-orchestrator.service — RestartPreventExitStatus=78 so
manifest-load failures stay
stuck-and-loud (§9, §4.7)
* install-lab-host.sh — validates manifest.toml at
install time; missing or invalid
= die with clear message
Catalog admission semantics: only modules whose name appears in
manifest.catalog get loaded into the runtime catalog (§4.3 in
miniature, will tighten further in step 4 when verified_against /
last_verified actually gate admission). Missing toml for an admitted
name is a sysadmin error → exit 78.
Renames cfg.manifest → cfg.samples + adds cfg.experiment to
disambiguate sample-manifest from experiment-manifest. Rewrites
test_fleet.py fixture to construct synthetic Manifest objects so
test outcomes don't depend on the on-disk manifest.toml content.
12 new tests in tests/test_manifest.py: schema-version mismatch,
unknown collector, duplicate collector, unknown phase, negative
phase seconds, negative ram, missing catalog fields, json round-trip.
Local run: `python tools/run_fleet.py --capacity` correctly logs the
loaded manifest and prints capacity. 241 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PIPELINE.md §1 (default-to-removal), §4.3 (catalog admission), §10
(every dishonest label is a poisoned training example).
Empirical evidence on commits 4ab5477 → c41763b: samba_usermap_script
fired its bind_perl payload but the framework's bind handler never
managed to connect to the guest's listening port within
session_open_timeout_s=30 (or even with WfsDelay=30 bumped on the
framework side). All 67 attempts in the §3 probe ended in
session_open_timeout. Yet the schedule clock was still writing
`infected_running` labels for the failed exploit — exactly the §10
poisoned-example pattern.
Until §5 step 3 builds an in-house target VM and step 4 re-admits
modules with `verified_against` recorded (§4.3), the production
catalog should consist of zero verified Tier-3 modules. That's the
state after this removal: the four remaining modules
(vsftpd_234_backdoor, distccd_command_exec, php_cgi_arg_injection,
unreal_ircd_3281_backdoor) are all `requires_bridge=true`, which the
fleet picker filters out unconditionally (the post-revert behavior
from commit 0390eb2). Net effect: production runs Tier-2 only,
producing honest Tier-2 episodes and zero dishonest Tier-3
infected_running labels.
Test fixture updated to inject synthetic in-memory ModuleConfigs
instead of loading from disk, so Tier-3 dispatch logic stays tested
even though no production module qualifies. test_exploits asserts
the new "every shipped module is requires_bridge until §4.3 admits
something verified" invariant — flips into a tripwire if anyone
reintroduces an unverified non-bridge module.
229 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PIPELINE.md §4.4 requires every collector in the active set to actually
work end-to-end. On k-gamingcom (commit dac03d2 episode at 02:21Z) the
new perf_unavailable lifecycle event surfaced a concrete cause:
`reason: binary_not_on_path` — perf is enabled but the binary isn't
installed. Same story with tcpdump on k-gamingcom (pcap_unavailable
events with `error: tcpdump not found`).
The canonical install script is the right place to ensure the deps
are present. detect_os reads /etc/os-release; ensure_collector_packages
installs `perf` (Arch / RHEL) or `linux-perf` + `linux-tools-generic`
(Debian/Ubuntu) plus `tcpdump`. After the install attempt the script
re-checks `command -v` and dies loudly if either is still missing —
silent silent silent forbidden per §1, so install failure has to be
observable.
Idempotent (`--needed` / equivalent skips already-installed packages).
Operator owns full system upgrades; this only does targeted package
install. On unknown distros logs a warning and dies on the followup
check, with a clear pointer to install perf/tcpdump by hand.
The next autoupdate tick on k-gamingcom should pull this and
self-install perf + tcpdump, after which rows_perf > 0 and pcap should
start producing bytes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validation on k-gamingcom (commit ac7b85f) showed perf enabled in
production but rows_perf=0 on every episode. Without lifecycle events
the failure mode is indistinguishable from "perf wasn't enabled" — §1
silent-downgrade. The events now surface the actual cause:
- perf_unavailable — binary missing OR launch failed (with reason)
- perf_started — perf is running (pid, events, interval)
- perf_first_row — first row written; counters_populated tells
whether any event was actually counted
- perf_finished — final tally (intervals_seen,
intervals_with_values)
- perf_no_counters — perf was alive but every interval came back
<not counted> (likely paranoid > 2 or PID
ownership mismatch)
`_flush()` now writes a row whenever an interval is observed, even
when every event was <not counted>. The all-None row is honest data
("perf observed this interval and counted nothing"), and the rows
become a count of observed intervals rather than a count of
successful measurements — distinct from rows_proc / rows_qmp which
do count successful measurements. Trainers filter on
`cycles is not None` etc. when they need only populated rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical evidence from k-gamingcom (commit 4ab5477, 2026-05-03 22:20Z
vsftpd_234_backdoor episode): the picker selected vsftpd because BRIDGE
was set on that host. The exploit fires against target_ip=127.0.0.1
(SLIRP loopback) but vsftpd's hardcoded port-6200 backdoor is reachable
only at the guest's bridge IP. Result: session_open_timeout, AND a
schedule-clock-driven `infected_running` label was still written for
the failed exploit — exactly the §10 poisoned-training-example pattern.
Until guest-IP discovery for bridge mode is wired (a separate piece of
infrastructure), bridge-only modules can't actually reach their target
even when the operator sets BRIDGE for Tier-2's pcap source. Revert
the picker to its prior conservative form: drop requires_bridge modules
unconditionally regardless of BRIDGE state. Same for the BRIDGE env
strip in the Tier-3 launch path — it was correct as unconditional.
Replaces the two aspirational tests
(test_fleet_uses_all_modules_when_bridge_set,
test_fleet_propagates_bridge_env_to_runner) with their honest negatives
(test_tier3_drops_requires_bridge_modules_unconditionally,
test_tier3_strips_bridge_env_even_when_set). The previous tests asserted
behavior the rest of the pipeline can't deliver; they were false signals.
229 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The §5 step 1 fixes correct the perf collector's stdout/stderr +
event-name parser bugs, but the launchers
(run_real_vm_demo / run_tier3_demo) never set enable_perf=True, so
production episodes still ship with rows_perf=0 — silently disabled
collector, which is exactly the §1 / §4.4 pattern.
Turn it on in both launchers. Failure modes (perf binary missing,
paranoid level too high) are logged as warnings + return 0 rows
visibly, not silently.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Diagnoses + fixes for the silent-collector / never-lands-session
failures that the 200-episode quality probe surfaced (§3 evidence).
All four address the producer; no compensating layers added.
perf collector (rows_perf=0 on 100% of episodes):
- perf stat -j writes to stderr by default with -p; we read stdout.
Add --log-fd 1 so JSON reaches stdout where the parser sees it.
- Event names come back annotated with the privilege scope perf
actually measured ("cycles:u" under perf_event_paranoid=2). Strip
the suffix so _build_row's plain-name lookups hit. Without this
every metric was None even when perf reported real numbers.
- tests/test_collectors_emit.py covers the regression with a real
busy-loop fixture; emit-test discipline per §4.4.
guest-agent collector (rows_guest=0 on 100% of episodes):
- Alpine cloud image doesn't ship python3, so the in-guest agent's
`#!/usr/bin/env python3` shebang silently fails. Add packages:
[python3] to cidata user-data so cloud-init installs it before
the OpenRC service starts.
- Guest agent now exits nonzero (was: silent stdout fallback) when
/dev/virtio-ports/cis490.guest.agent is missing, so OpenRC
reports the failure to /var/log/cis490-agent.log instead of the
bytes vanishing into the void. Refs §1.
- Host-side collector emits guest_agent_connected /
guest_agent_first_byte / guest_agent_silent_window into the
orchestrator's events.jsonl. Future episodes show the in-guest
failure mode per-episode instead of inferring from rows_guest=0.
k-gamingcom missing qmp/netflow/pcap (also affected elliott on
Tier-3 episodes — was misclassified as host divergence):
- tools/run_tier3_demo.py was building EpisodeConfig WITHOUT
qmp_socket / guest_agent_socket / bridge_iface — even though
launch_target.sh creates the underlying chardevs and BRIDGE
supplies the iface. tools/run_real_vm_demo.py wires them
correctly; Tier-3 had a copy-paste gap.
- tests/test_collectors_emit.py adds a source-grep regression so
the wiring stays honest.
samba_usermap_script never lands session (0/67 in §3 probe):
- Bind handler default WfsDelay (~5s) gives up before bind_perl on
Metasploitable2 has finished forking + binding LPORT under
SLIRP+hostfwd. Bump to 30s; matches session_open_timeout_s in
exploits/driver.py so framework + driver agree on the wait
budget. Add ConnectTimeout=15 so the handler's bind connect has
retry budget instead of one-shot.
orchestrator/fleet.py: usable_modules + BRIDGE handling were both
unconditional, so:
- With BRIDGE set, requires_bridge modules were still being
dropped — picker only ever returned samba_usermap_script across
every slot/episode (the test_fleet_uses_all_modules_when_bridge_set
failure on HEAD).
- env.pop("BRIDGE") fired even when BRIDGE was the operator's
explicit setup, breaking modules that need bridge mode (vsftpd
backdoor on hardcoded port 6200, distccd, etc.).
Both made conditional on bridge_set so the picker walks the full
catalog under bridge mode and SLIRP-only modules still get a
clean SLIRP env when BRIDGE is unset.
receiver/app.py: half-pregnant v2 schema state in HEAD — calling
store.ingest_stream(episode_type=..., benign_profile=...) with
kwargs the matching store.py change was in the WIP stash. Removed
v2 awareness from app.py so v1 episodes (what the producer ships
today) get accepted again. SCHEMA_VERSION default reset to 1 to
match.
229 passed, 0 failed. (HEAD had 15 failures, all linked to the
half-pregnant v2 state above.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PIPELINE.md is the canonical plan for the data-collection / emulation
/ labelling pipeline. It supersedes any guidance in AGENTS.md,
README.md, or other repo docs that contradicts it (§17). Future
sessions read it before changing anything in the pipeline.
AGENTS.md is rewritten to point at PIPELINE.md as canonical and to
strip the prescriptive symptom→fix table that absorbed producer-side
defects instead of fixing them (§7.1 compensating-layer pattern).
FIXYOURSELF.md is deleted (§4.12, §7.10 recovery-layer pattern). The
states it covered are made impossible by the §4.6 acceptance gate
landing later in §5; recovering from a state that shouldn't exist is
itself the bandaid we're removing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The detector previously returned 1 on alerts, which made systemd
mark cis490-fleet-health.service as 'failed' every tick that found
a sick host. That's the wrong UX — a detector finding a fault is
working correctly, not crashing. The alert is the signal (via
WARNING log + alerts.jsonl); the unit's success state should mean
"the detector itself ran cleanly." Test added.
Caught while live-deploying on the Pi: the first run found
elliott-thinkpad fatal-only at 943×4xx + 1425×5xx and correctly
emitted the alert — but systemd showed the unit red, which would
have caused operators to chase the wrong tail.
Side note: the same first run also caught a real bug — pycache for
receiver.store on /opt/cis490 was stale after I deployed the new
app.py + store.py from main, causing 1464 × 500 responses. Cleared
the pycache and the index immediately resumed growing (4465 →
4515 in 30 seconds). The detector earned its keep on the very
first cycle.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pieces of self-monitoring so the maintainer isn't the alarm:
(2) Receiver-side fleet health monitor
cis490-fleet-health.timer runs check_fleet_health.py every 5 min.
Detects three symptoms and writes them to
/var/lib/cis490/alerts.jsonl + a syslog WARNING (greppable / easy
to forward to a notifier):
silent — host shipped in last 24h but has been quiet >30 min
fatal-only — actively shipping but every PUT 4xx
unstamped — shipping without X-Cis490-Code-Commit header
Dedup is keyed on (host, symptom, hour-bucket) so a sustained fault
fires once per hour, not every 5 min. 15 unit tests cover the index
parser, three detectors, and dedup.
(3) Per-host doctor snapshots
Lab hosts run cis490-doctor-check.timer once a day (10 min after
boot, then daily with 30-min jitter). The timer runs
cis490_doctor.py --json and PUTs the result to a new endpoint:
PUT /v1/host-health/<host> → /var/lib/cis490/host-health/<host>.json
GET /v1/host-health → aggregate across all hosts
Endpoint is NOT gated by version_gate — sick hosts running stale
code MUST still be able to report sickness. 11 unit tests cover
PUT/GET, atomic-write semantics, bearer auth, and the
not-gated-by-version-gate property.
ship_health_check.py reuses the existing shipper transport (mTLS +
bearer + receiver URL from lab-host.toml) so we don't reimplement
auth.
Both timers wired into install-lab-host.sh — the loop also enables
the previously-added autoupdate + cert-fetch timers, so a single
install run gives a host all four self-healing mechanisms.
Tests: 293 pass (26 new — 15 fleet-health, 11 host-health). 2
pre-existing test_fleet.py failures from the elliott-ThinkPad
merge (667f042) are unrelated to this change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The serial console approach failed: Metasploitable2's kernel is not
configured with console=ttyS0, so only GRUB output reaches the QEMU
serial socket; the OS boot and login prompt never appear there.
New approach:
1. Sleep _METASPLOITABLE2_MIN_BOOT_S (65 s) after QEMU writes its
pidfile. By this point the guest kernel and init are always up.
2. Call _wait_for_tcp with a 3 s recv timeout. Post-floor, SLIRP has
forwarded the connection to the guest TCP stack, so:
- socket.timeout → service listening, waiting for client data ✓
- OSError/RST → port still closed (service not ready); retry ✓
Eliminates the early-boot false-positive that caused exploits to
fire ~60 s before Samba was actually listening.
Also update TIER3-BRINGUP.md bug 6 to reflect the correct final fix.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
k-gamingcom symptom (2026-05-02): the on-device agent successfully
finished Tier-3 bring-up, but the shipper sits in "waiting on mTLS
material" because the cert auto-fetch step in install-lab-host.sh
either ran with host_id still REPLACE_ME, or hit a transient
bootstrap.wg failure, and there's no automatic retry. The Pi-side
cert IS minted and the bootstrap endpoint serves it — the failure
mode is purely "lab-host hasn't pulled it down."
Fix: extract the cert-fetch logic into scripts/fetch-lab-host-cert.sh
(idempotent, no-op when certs are already on disk, no-op when host_id
is unset, exit-0 on transient network failure so the unit doesn't
get pinned as failed), and run it from a 5-minute systemd timer.
The timer handles all three "stuck waiting on mTLS" cases without
operator action:
- operator edited host_id post-install but didn't re-run install
- bootstrap.wg was briefly unreachable during install
- lab host was offline when install ran but came up later
The script `try-restart`s cis490-shipper after a successful fetch
so the daemon picks up the new cert immediately instead of waiting
for its lazy retry. install-lab-host.sh still calls the script
on install for fast first-time bring-up — the timer is the safety
net.
Tarball extract is staged through a temp dir + atomic rename so a
mid-extract crash never leaves us with a mismatched cert/key pair.
AGENTS.md row 4 updated: "waiting on mTLS material" remediation now
points at the timer, with the exact `systemctl start
cis490-cert-fetch.service` command to force an immediate retry.
Tests: 267/267 unchanged. The fetch script is idempotent + has all
its happy/error paths handled inline; a unit test would mostly be
testing systemd's behaviour. The integration test path is the timer
running on a real lab host, which is the actual production case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root causes and fixes documented in TIER3-BRINGUP.md. Summary:
1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap
instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot.
2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring
modules selected on SLIRP runs; fix: always filter requires_bridge.
3. cmd/unix/interact creates no session.list entry → session_open_timeout
every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl.
4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444);
fix: extra_host_port:extra_host_port mapping so guest binds the
per-slot LPORT directly.
5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots;
fix: requires_bridge=true filters it from SLIRP fleet runs.
6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba
boots (~60 s too early); fix: replace TCP probe with serial console
_wait_for_serial_login that waits for actual "login:" prompt.
7. Stale QEMU survives orchestrator restart (start_new_session=True) →
holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from
old pidfile before rmtree.
8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100.
9. msfrpcd 6.x returns bytes for all string values even with raw=False;
fix: MSFRpcClient._str() recursive decoder applied to all responses.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The shipper on a stuck lab host logs the receiver's response body
verbatim as ERROR (queue.py:_log_412). That's the ONLY inbound
channel from this Pi to a lab host without ssh — every PUT the
shipper makes pulls down a fresh remediation message.
Update the 400 (missing-commit) and 412 (not-in-window) bodies to
explicitly call out FIXYOURSELF.md and the diverged-HEAD case (§B),
not just "pull and reinstall" — because if the host is on a local
commit that's not on origin/main, plain `git pull --ff-only` fails
and the agent needs to know about §B's three resolutions.
elliott-thinkpad has been hitting the receiver ~1/sec for 19 hours;
it'll receive this updated body on its very next PUT. The on-device
agent (or whoever is reading the journal) sees the path forward
without the maintainer having to push through any other channel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The auto-update timer (98dcd4f) covers the routine case of a host
falling behind origin/main. It deliberately refuses to fast-forward
when local HEAD isn't an ancestor of origin/main — the right call
for safety, but it leaves on-device agents with no automatic path
out when they (or an operator) made a local commit.
That's exactly the elliott-thinkpad incident: ~31,738 episodes
shipped over 19 hours, all stamped with local commit 5568d77 that
isn't on origin/main, all 412'd. Auto-update can't fix it; the
on-device agent had no doc telling it what to do.
FIXYOURSELF.md is that doc. Pure decision tree, six branches
(behind / diverged / no-network / no-git / dirty-tree / clean) each
with verbatim commands and the order to try them. The diverged-HEAD
branch (§B) is the elliott-thinkpad case and offers three resolutions
(push, reset, file-issue-and-wait) so an agent that doesn't have
push permission isn't backed into discarding work.
Linked from the AGENTS.md top-of-file symptom table so a smaller
model finds it without having to know the filename.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today's incident: post-cutover, k-gamingcom went silent and
elliott-thinkpad kept shipping pre-stamp episodes that the receiver
gate 400'd in a 2300+ PUT loop. Both required `git pull && install-
lab-host.sh` *on the host* — neither the on-device AI agent nor the
operator pulled in time, and from the receiver Pi I cannot reach in
(sshd off on the lab hosts).
Fix the recurrence directly: a 30-min systemd timer that does
git fetch + (if behind) ff-only pull + re-run install-lab-host.sh.
Hosts catch up on the next tick on their own — no human or agent
action required.
Mechanics:
- scripts/auto-update.sh runs as root, drops to cis490 for git ops
to satisfy /opt/cis490 ownership ("dubious ownership" guard).
- Refuses ff if local HEAD isn't an ancestor of origin/main —
protects operator hand-edits from silent overwrite.
- Network failures exit 0 (offline is normal, don't pin a unit
failure); divergence + install failures exit non-zero so the
journal records what broke.
- RandomizedDelaySec=10min on the timer prevents thundering-herd
when several hosts boot together.
- Hands off to install-lab-host.sh via exec — exactly one path
through bring-up; no special "auto" flow.
The version-gate provides the quality boundary, so even if origin/
main moves forward unsafely, the receiver's allow-list still
controls what lands in the index.
install-lab-host.sh enables cis490-autoupdate.timer on every run,
idempotent — existing hosts pick it up the next time they pull
manually.
Filed Forgejo #18 with the canonical command for elliott-thinkpad
+ k-gamingcom to bootstrap themselves out of the current incident
(auto-update doesn't help them retroactively — it has to be running
*before* the cutover to catch the next one).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Smaller models running on lab hosts read AGENTS.md top-to-bottom and
need explicit if-this-then-that. Restructure to put a decision-tree
table at the very top mapping every realistic symptom to the exact
command to run (verbatim — no paraphrasing instruction). Adds an
unambiguous HARD RULES list.
Also fixes accumulated drift:
- Tier-4 section had two contradictory descriptions (theZoo flow +
legacy MalwareBazaar flow). Removed the MalwareBazaar paragraphs;
the table's MALWAREBAZAAR_API_KEY env var is gone (theZoo needs no
auth). The "DO NOT push API key" bullet was about a flow that no
longer exists.
- Canonical bring-up step 6 said the Metasploitable2 download was
"registration-walled" requiring an operator-supplied URL+sha256.
Not true since the SourceForge mirror + TOFU pinning fix —
install-lab-host.sh handles it. Removed the manual step entirely
and noted Tier-3+4 are part of step 1.
- The "Three install bugs in 95ac56a" historical table was churn that
doesn't help current agents. Replaced with a generic
"outdated-clone? pull main and re-run install-lab-host.sh" block
that explicitly enumerates what the install script does (VERSION
stamp, queue drain, daemon-reload+restart, watchdog).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three robustness items off the future-work list:
1. Shipper sd_notify watchdog. Type=notify + WatchdogSec=180. The
daemon sends READY=1 after queue construction and WATCHDOG=1 once
per scan pass via a heartbeat callback wired into run_forever.
Restart=on-failure only catches process death — silent stalls
(deadlock, hung tar subprocess, blocked I/O past timeout) used to
leave a zombie running with the data backlog growing. Now systemd
kills + restarts the daemon if no WATCHDOG=1 arrives within 180s.
Verified end-to-end against systemd via `systemd-run --transient
--property=Type=notify --property=WatchdogSec=10`: unit transitions
to active on READY=1; SIGSTOP'ing the process triggers
`Watchdog timeout (limit 10s)! Killing process N with SIGABRT` at
exactly t+10s, then unit goes failed → restart cycle.
2. Quarantine cleanup. Without an upper bound, data/quarantine/ grew
forever as fatal episodes piled up. New ShipperConfig fields:
quarantine_keep_days = 30 # opt-out: 0 disables
quarantine_cleanup_interval_s = 3600 # gate so 5s tick doesn't
# statx() the whole tree
Cleanup runs at the start of run_once() but is gated to once per
hour. Removed entries logged.
3. Doctor surfaces shipping errors. Tails 10 minutes of cis490-shipper
journal and surfaces 412/400/transient patterns as red/yellow rows
with the canonical fix command. An on-device agent running
cis490_doctor.py now sees one line ("12 ship(s) rejected as
out-of-window") instead of needing to grep the journal.
Tests: 200/200 (was 188). New coverage: heartbeat callback fires +
survives exceptions; quarantine cleanup respects keep_days, gate, and
opt-out; doctor parser correctly classifies 412/400/transient/clean/
empty/journalctl-denied; both error classes prioritise 412 (more
actionable) when present together.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
receiver.toml.example: the local_repo_path comment was wrong about
when it kicks in. With the new fallback path, it's used both when
forgejo_url is unset (sole backend) AND when forgejo is unreachable
(failover). Document that, plus the auto-detect of /opt/cis490/.git.
cis490_doctor: add a VERSION-stamp check for lab-host role. If
/opt/cis490/VERSION is missing or malformed, the orchestrator stamps
"unknown" → receiver gate rejects every PUT → quarantine. Surface
this as a red row with the canonical fix (re-run install-lab-host.sh)
so an on-device agent doesn't have to grep journal logs to figure it
out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups from the post-cutover diagnosis:
1. version_gate: forgejo → local git fallback. If forgejo refresh
returns empty AND a local repo path is configured, retry against
`git log` from the local checkout. The receiver service runs on
the same Pi as forgejo, so a simultaneous restart used to leave
the gate's cache empty and reject every PUT with not-in-window.
Auto-detects /opt/cis490/.git when the operator hasn't set
local_repo_path explicitly — that path is always present on a
production receiver and ProtectSystem=strict still allows reads.
Logs `source=git-fallback` so this isn't silent.
2. shipper/queue: sweep orphaned outbox tarballs. The lifecycle
invariant is `outbox/<id>.tar.zst exists ⇒ episodes/<id>/ exists`
— broken historically by the now-fixed fatal-loop, by operator
`rm` of an episode dir, or by an OS crash between rename(2) and
the post-ship cleanup. Without sweeping, dead bytes pile up
forever. New _sweep_outbox runs at the start of every scan,
bounded by the file count in outbox/.
Tests cover: fallback fires when forgejo unreachable + repo_path set;
no fallback when repo_path None (opt-in); orphan tarball + partial
get swept on the next pass; live tarballs untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Smaller models running as on-device agents need a direct, prescriptive
remediation block for the gate-failure modes — the receiver's response
body is good but only visible if the agent reads journalctl carefully.
Document the exact sequence (git pull → install-lab-host.sh) and what
the install script now does on its own (drain pre-stamp queue, restart
services). Also calls out the two anti-patterns we don't want agents
trying: silencing the shipper to stop log noise, or fabricating a
code_version field to bypass the gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why services weren't starting after the gate went live:
1. install-lab-host.sh self-copy. The receiver's 400 remediation tells
the agent to `cd /opt/cis490 && git pull && sudo
./scripts/install-lab-host.sh`. That makes REPO_ROOT==INSTALL_ROOT
and `cp -aT $REPO_ROOT $INSTALL_ROOT` errors with "are the same
file"; `set -e` aborts before the systemd units install or anything
restarts. Detect the same-dir case and skip the cp; chown still
runs.
2. Services never restart. install-lab-host.sh and install-tier-3-4.sh
both ended by *telling the operator* to restart, then exiting. The
running shipper/orchestrator kept executing pre-gate code from the
old module objects, so new `code_version` stamping never reached an
episode. Both scripts now `systemctl restart` the units they own
when those units are enabled.
3. Shipper queue fatal-loop. queue.py incremented `fatal++` but didn't
move the episode out of `data/episodes/`. Next scan re-tarred and
re-PUT the same dir, getting 400 again. With 4465+ pre-stamp
episodes on k-gamingcom this burned ~1 PUT/sec for 5+ hours of
receiver log. Fatal episodes now move to data/quarantine/<id>/ with
a quarantine_reason.json beside them; the outbox tarball is
deleted.
4. Pre-stamp backlog drain. tools/quarantine_unstamped.py is a
one-shot that scans data/episodes/ and quarantines anything without
a 40-char-hex code_version.commit. Wired into install-lab-host.sh
step 9 so a re-install drains the queue automatically. Idempotent;
safe to run while the shipper is active.
Tests cover the queue's new fatal-quarantine path and every drain
behaviour (kept/quarantined/dry-run/idempotent/missing-meta/collision).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the "have you tested it" gap as much as we can without x86 KVM.
The Pi is ARM64 — can't boot Metasploitable2 or run KVM-accelerated
guests. But most of the Tier-3 chain doesn't need x86:
* chunked_real_binary_upload is just shell commands over a pipe
* exploit module TOMLs and the deterministic selector are pure Python
* manifest loading + sample selection are pure Python
* msfrpcd itself runs on ARM (Ruby + Java)
* the receiver's commit gate is the same on any arch
verify_tier3_local.py exercises each of those end-to-end, in process,
on this Pi:
PASS exploits/modules/*.toml parse + selector deterministic
PASS manifest loads + selector covers every sample
PASS chunked binary upload survives a real /bin/sh round-trip
(150 KB binary, 26 chunks, sha256-verified end to end)
PASS staged samples are Linux i386 ELF (when staged)
PASS msfrpcd round-trips core.version (when listening)
PASS receiver /v1/health + gate enforces commit allow-list
Live result on this Pi today: 5 PASS, 1 SKIP (msfrpcd not installed
on the Pi, which is correct — the Pi is the receiver, not a lab
host). When run on a lab host after install-tier-3-4.sh, all 6
PASS gives full Tier-3 readiness.
What this script does NOT verify (still needs x86 KVM on a lab
host, covered by install-tier-3-4.sh's verify step):
* Metasploitable2 boots under QEMU/KVM
* vsftpd_234_backdoor lands a session against it
* the chunked-upload binary actually executes inside that session
But the chunked-upload step proves every byte of the upload path
(printf '%s', heredoc-free path, base64 decode, sha256 verify,
chmod, exec scaffold) works against a real POSIX shell. An msfrpc
session presents the same shell interface, so a passing local-sh
test is strong evidence the production path will work.
tests/test_tier3_local_verify.py wraps the deterministic steps
(module parse, manifest, chunked upload) so pytest catches
regressions automatically. 174/174 total.
Operator workflow: ssh into Pi (or lab host), run:
/opt/cis490/.venv/bin/python tools/verify_tier3_local.py
Each step prints PASS/FAIL/SKIP with detail. Exit 1 if any FAIL.
User caught it: I shipped the theZoo path without running it
end-to-end. A real fetch on the Pi exposed two bugs:
1. Family-name matcher was substring-strict. "Cryptolocker-class"
wouldn't match the dir "CryptoLocker_22Jan2014" because "-class"
isn't in the dir name. Now expands to a sequence of tokens
(full, head-of-dash, head-of-dot, head-of-underscore) and tries
each. First match wins.
2. Extraction picker was "largest non-text" — a bad heuristic for
theZoo, where each Linux.* zip often contains MULTIPLE binaries
for different platforms (Linux i386, x86-64, ARM, FreeBSD, sometimes
even Windows PE). The largest is rarely the i386 Linux ELF that
would actually run on Metasploitable2. Now sniffs ELF magic bytes
in stdlib and tiers:
1. Linux i386 ELF (largest first)
2. any other ELF (best-effort, may not execute)
3. largest non-text (Wine fallback)
Verified end-to-end on the Pi against a real theZoo clone (~500 MB,
263 family dirs, 2026-05-01 fresh pull):
linux-encoder-ransomware → ELF 32-bit Intel i386 SYSV (278 KB)
linux-wirenet-rat → ELF 32-bit Intel i386 SYSV (64 KB)
linux-rex-ransomware → ELF 32-bit Intel i386 SYSV Go (7.6 MB)
linux-neurevt-bot → ELF 32-bit Intel i386 SYSV (3.0 MB)
linux-earthkrahang-apt → ELF 32-bit Intel i386 GNU/Linux (5.8 MB)
5/5 picks are runnable Linux i386 ELFs. Manifest rewrites in place
add source/sha256/url; meta.sample.kind goes to "real" automatically.
Manifest rewritten:
- Old families (XMRig, Mirai, Cryptolocker-class, Dridex, Kovter,
Reverse-Shell) → mostly absent from theZoo's Linux catalog or
matched the wrong arch.
- New families chosen against a verified theZoo presence list:
Linux.Encoder, Linux.Wirenet, Ransomware.Rex, Neurevt,
EarthKrahang.
- XMRig + Kovter remain as mimic-only fallbacks (theZoo lacks a
runnable Linux i386 binary for these; orchestrator falls back
to the mimic profile).
Tests added (tests/test_auto_fetch_samples.py): 13 cases covering
ELF magic detection (i386 accepted, FreeBSD/x86-64/ARM/PE32/text
all rejected), family-token expansion (the "-class" suffix bug),
extraction picker (prefers Linux i386 over larger non-Linux ELFs),
manifest in-place rewrite preserves mode + skips entries that
already have sha256.
What's still NOT verified end-to-end (requires a lab host with
KVM x86):
- Metasploitable2 boot under QEMU
- vsftpd_234_backdoor exploit fire via msfrpcd
- chunked binary upload through a real shell session
- real binary executing inside a Metasploitable2 guest
The Pi is ARM64 — can't run Metasploitable2. install-tier-3-4.sh's
verify step (run_tier3_demo.py) covers all four on a real lab host;
deploy verifies on first run there.
171/171 tests pass.
Initial git-log-based gate ran into a permission wall: the cis490
service user can't read /home/max/cis490/.git (ProtectHome=true +
home-dir mode). Switching the production source to the local Forgejo
HTTP API (already accessible to all WG peers, single source of truth
both lab hosts and the receiver pull from). When the maintainer
pushes new code to spectral/CIS490, the next 5-second cache refresh
sees the new commit and lab hosts can immediately ship under it.
VersionGate now takes either:
- forgejo_url + repo_owner + repo_name + branch (+ optional
auth_token for private repos): hits
/api/v1/repos/<owner>/<name>/commits?sha=<branch>&limit=<n>
- repo_path: dev-only fallback, runs `git log` locally
Local-git path retained for tests + the dev-only case.
receiver.toml.example gains forgejo_url/repo_owner/repo_name/branch
with auth_token commented; live-deployed receiver.toml on the Pi has
the spectral org + token.
Live state on the Pi: 41 valid hashes loaded, head=f8ad02b. Verified
end-to-end:
bogus commit → 412 + remediation
HEAD commit → clears gate (fails downstream at sha-mismatch as
expected for the empty-body verify probe)
Test added: test_forgejo_backend_accepts_returned_commits stands up
a tiny canned-response HTTPServer in-process, exercises the parser
without depending on a live Forgejo instance. Brings test_version_gate
to 10 cases; total 158/158.
Stops out-of-date lab hosts from polluting the dataset with episodes
generated by buggy code. The valid-commits set mirrors the maintainer's
working clone on the Pi automatically — when the maintainer pulls or
pushes a new commit, the receiver picks it up within the 5-second
cache TTL with no service restart.
Receiver changes:
- receiver/version_gate.py (new): VersionGate(repo_path, window).
Each check() consults a frozenset of the last `window` commit
hashes from `git -C <repo> log --format=%H -n <window>`, refreshed
every 5s under a lock. Resilient to transient git failure (keeps
prior cache so a flaky `git` doesn't lock out every shipper).
- receiver/app.py: PUT extracts X-Cis490-Code-Commit; gate.check()
before ingest. Rejects with:
400 + remediation if header missing or malformed
412 + remediation + your_commit + head_commit if not in window
Remediation block is verbatim copy-pasteable into the lab-host
shell:
cd /opt/cis490 && sudo -u cis490 git pull origin main
sudo /opt/cis490/scripts/install-lab-host.sh
sudo systemctl restart cis490-orchestrator
- receiver/store.py: ingest_stream takes commit kwarg, stamps it on
the index.jsonl row (new optional field). Backfilled rows from
index_backfill.py also pull commit out of meta.json.
- receiver/config.py + etc/receiver.toml.example: new [version_gate]
section. enabled=true, repo_path=/home/max/cis490, window=100 by
default. Enabled toggle exists for emergency disable-and-collect.
Shipper changes:
- shipper/transport.py: ship_tarball() takes commit kwarg, sends
X-Cis490-Code-Commit header. 412 maps to status='fatal' so the
queue doesn't infinite-retry — operator must pull and reinstall
before the next ship will succeed.
- shipper/queue.py: reads meta.json::code_version.commit per
episode, passes through. On 412, logs the receiver's full
remediation block at ERROR level so journalctl on the lab host
shows exactly what to run.
Tests: 9 in test_version_gate (including 2 end-to-end via
starlette.testclient), 2 cover the boundary where new commits land
mid-cache and where missing-repo gracefully keeps prior cache.
157/157 total.
Index schema: existing rows stay valid (commit field is optional
on read). New rows from receiver-direct AND from index_backfill.py
include commit.
Closes a real reproducibility gap. Three weeks of bug fixes have
shipped (probe fix in 2707709, multi-signal classifier in 321ea63,
mandatory tier-4 in 265f3ad, etc.); without a per-episode
code_version, trainers can't tell which episodes came from buggy
pre-fix code and have to scan every tarball to guess.
Resolution priority (cached across episodes):
1. $INSTALL_ROOT/VERSION (production — install-lab-host.sh writes
it at install time since /opt/cis490 is a flat copy with no .git)
2. git rev-parse HEAD from the repo root (dev clones)
3. {"commit": "unknown", source: "unknown"} so the field is always
present (filterable)
Output shape, always present in meta.json:
"code_version": {
"commit": "<40-hex>" | "unknown",
"branch": "<name>" | null,
"dirty": bool | null,
"source": "VERSION-file" | "git" | "unknown"
}
install-lab-host.sh writes VERSION at install time with the source
repo's git rev-parse HEAD + branch + clean-tree flag + install
timestamp. Lab-host agents that pull main + re-run install-lab-host.sh
get a fresh stamp automatically.
148/148 tests pass; test_episode_against_self_pid_produces_full_directory
asserts the field's presence + valid `source` value.
Replaces MalwareBazaar with theZoo (https://github.com/ytisf/theZoo).
theZoo is a public security-research repo with hundreds of malware
samples organized by family, password-protected with the well-known
'infected'. No API key, no signup, nothing for an operator to do —
which is what zero-touch tier-4 actually means.
Changes:
- tools/auto_fetch_samples.py: rewrite. Clones theZoo (shallow, ~500 MB)
to /var/lib/cis490/theZoo on first run, then for each manifest
family without a sha256 it locates a matching Binaries/<Name>
dir, extracts the .zip with password 'infected', picks the largest
non-text payload as the binary, sha256s it, stages at
samples/store/<sha256>, and rewrites manifest.toml in place
(atomic tempfile + os.replace, stat preserved). Mandatory exit
semantic: non-zero if no real samples landed.
- scripts/install-tier-3-4.sh: dropped the MB-key resolution chain
(env var → local file → bootstrap.wg fetch). Now just runs
auto_fetch_samples.py and dies if zero samples land. SKIP_TIER4
remains as the explicit override but is documented as defeating
the project.
- bootstrap/app.py + __main__.py + etc/cis490-bootstrap.service:
removed the /v1/secret/<name> endpoint and the --secrets-root flag.
Dead code now that no API key needs distributing. Live-rolled
back on the Pi (404 verified post-restart, stale /etc/cis490/secrets
dir removed).
- scripts/set-malwarebazaar-key.sh: deleted. No MB key means no
one-time operator step.
- tests/test_bootstrap_secrets.py: deleted (route removed).
- AGENTS.md: rewrote tier-4 section to reflect zero-operator model.
148/148 tests pass. Bootstrap service rolled back live.
User: 'we don't want it to be optional, this real malware IS the data
we want.' Acknowledged. Three changes make Tier 4 actually mandatory
without forcing per-host operator action:
1. bootstrap.wg /v1/secret/<name> endpoint
- Pi serves /etc/cis490/secrets/malwarebazaar.token to lab hosts
over the same trust boundary as the cert endpoint (WG mesh,
iptmonads-gated). Strict allow-list — only `malwarebazaar`
resolves; everything else 404s. Secret returned as bare text
with Cache-Control: no-store. Live-verified on the Pi.
- tests/test_bootstrap_secrets.py covers four cases: 404 unprovisioned,
200 with token, 404 unknown name, 500 on empty file.
2. install-tier-3-4.sh: Tier 4 is no longer optional
- Resolves MB key in priority: env var → /opt/cis490/samples/.bazaar.token
→ https://bootstrap.wg/v1/secret/malwarebazaar.
- Caches the bootstrap-fetched key locally so re-runs are offline.
- If all three resolution paths fail, dies with the exact
remediation command for the operator (one-time set-malwarebazaar-key.sh
on the Pi).
- auto_fetch_samples.py is run unconditionally (SKIP_TIER4 still
works for emergency overrides but logs a warning that the host
will produce only mimics). Deploy fails if zero binaries land
in samples/store/ — no silent mimic-only fallback.
- SKIP_TIER4 documentation now says 'DEPRECATED; defeats the project'.
3. scripts/set-malwarebazaar-key.sh
- Pi-side helper: one operator command per fleet, ever. Accepts
key via env or stdin, validates length, drops at the right
path with the right perms. Lab hosts pull the rest automatically.
AGENTS.md: rewrote the Tier-4 section to reflect mandatory status +
the one-time-on-Pi distribution model.
152/152 tests pass. Bootstrap service updated live on the Pi.
Replaces the manual runbook with scripts that just work. install-lab-host.sh
now runs the full Tier-3 deploy automatically as its 8th step (after the
mTLS cert lands), and Tier-4 auto-fetches when MALWAREBAZAAR_API_KEY is set.
Changes:
- install-msfrpcd.sh: actually runs the Rapid7 omnibus installer when
metasploit-framework isn't present (was: bail with "install manually").
apt-get and dnf paths both go through the same omnibus script with
DEBIAN_FRONTEND=noninteractive. Idempotent.
- fetch-metasploitable2.sh: bakes in the SourceForge public-mirror URL
(https://downloads.sourceforge.net/project/metasploitable/...) so no
operator URL is required. sha256 is now optional and TOFU-pinned —
first run records the hash to OUT_DIR/metasploitable2.qcow2.sha256;
subsequent runs verify against that. Skips if qcow2 already present.
- scripts/install-tier-3-4.sh (new): orchestrates the four steps
(msfrpcd → metasploitable2 → bridge → tier-3 verify) plus optional
Tier-4 auto-fetch. Idempotent. SKIP_VERIFY / SKIP_BRIDGE / SKIP_TIER4
env knobs for partial deploys.
- tools/auto_fetch_samples.py (new): when MALWAREBAZAAR_API_KEY is set,
queries MB by each manifest entry's `family` (signature match), pulls
the first match via fetch_sample.py, and rewrites manifest.toml in
place (atomic tempfile + os.replace, preserving stat). Skips entries
that already have sha256.
- install-lab-host.sh: gains a step 8 that calls install-tier-3-4.sh
automatically when mTLS certs are on disk. --skip-tier3 flag for
operators who want Tier 2 only. Skipped silently before certs land
so first-pass install (host_id=REPLACE_ME) still works.
- AGENTS.md: rewrote the Tier-3 section to point at the one-shot
script. Removed the old multi-command runbook so on-device agents
can't accidentally follow stale steps.
Net effect: a fresh lab host now gets Tier 3 (and Tier 4 if API key
present) from a single sudo invocation. No operator picks for image
URLs, no manual metasploit installs, no manual manifest edits.
Repo has all the code paths for Tier 3 (real exploit fire via msfrpcd)
and Tier 4 (real malware execution via chunked upload), but neither
lab host has run a single Tier-3 episode because msfrpcd and the
Metasploitable2 image aren't deployed there. 3009 episodes in flight
to date are all Tier 2 (mimic workloads in clean Alpine), which is
useful pipeline-validation data but cannot answer the actual research
question.
This commit makes the deploy push-button:
- AGENTS.md: new "Tier 3 + Tier 4 deploy" section listing the three
prereqs (install-msfrpcd.sh, fetch-metasploitable2.sh, setup_bridge.sh),
the foreground verify command (run_tier3_demo.py), and the Tier-4
promotion path (MB API key → fetch_sample.py → manifest edit →
orchestrator restart).
- samples/manifest.toml: clearer per-entry comment showing the
4-step sha256 → real-binary promotion path. Replaces the earlier
"TBD" placeholder which suggested a single edit unlocks Tier 4
when in fact you need to fetch the binary too.
The fleet runner already auto-detects msfrpcd (orchestrator/fleet.py
_msfrpcd_available()); once the lab-host operator-AI lands the
prereqs, episodes flip to Tier 3 with no orchestrator config change.
Tier 4 follows automatically the next time the deterministic
selector picks a sample whose sha256 file exists in samples/store/.
A laptop-class lab host (elliott-thinkpad) running 14 parallel fleet
slots can't deliver host /proc CPU% signal for the bursty profiles —
the per-VM share gets buried under contention. But the workloads ARE
running: qmp blockstats record 90+ MB written during infected_running
for io-walk episodes, netflow shows real packet bursts for
scan-and-dial, and the in-guest agent (when alive) shows load_1m
deltas the host can't see.
The classifier now cross-checks four sources before flagging an
episode:
- /proc CPU% medians (host-side qemu)
- netflow byte totals (bridge_pcap)
- qmp blockstats per-phase DELTA (cumulative counters; deltas
matter, not raw values)
- guest-agent load_1m
An episode flags only if every available source agrees no
inter-phase signal. Missing sources are "unknown", not "flat".
Time-base bug also fixed: phase mapping now uses t_wall_ns (which
all sources stamp from CLOCK_REALTIME) rather than t_mono_ns —
netflow uses qemu boot-monotonic, /proc uses orchestrator-relative,
they don't share a number line.
Result on the live receiver:
- 1067 active episodes, 100% kept under the new logic
- 143 episodes rescued from a previous false-positive archive
- Only the 9 genuinely-broken pre-Sample-propagation elliott-lab
episodes remain archived (no-sample + no-workload-events)
Two new tests (test_flat_proc_rescued_by_netflow,
test_flat_everywhere_still_flags) pin the boundary so a future
regression surfaces immediately.
AGENTS.md gains a "classifier is multi-source" section explaining
the cross-check and the t_wall_ns invariant.
On-device agent (k-gamingcom) ran the diagnostic probe sequence and
proved the workload IS running on Alpine — yes saturating the vCPU,
loadavg=1.05, three yes PIDs visible — but two busybox incompatibilities
made every episode look silent:
1. _probe() used `pgrep -c yes`. The -c flag is procps-ng/util-linux,
not busybox. busybox pgrep exits 1 with a usage banner; the
`|| echo 0` fallback then reported yes=0 every time. Switched to
`pgrep yes | wc -l` which both pgrep variants support.
2. _wrap_loop appended `disown` after the nohup-backgrounded script.
busybox sh / ash have no disown builtin, so each infected_running
phase printed `sh: disown: not found` into run()'s captured output.
The script kept running (nohup gives SIGHUP immunity, which is
what disown was for), but the spurious error is now gone.
Cross-validation in the classifier:
- prune_episodes.py: workload-silent now requires the probe AND
host-side /proc CPU envelope (flat-cpu) to AGREE. A probe-only zero
is treated as the busybox false-positive and dropped. This means
the 244 already-on-disk episodes from elliott-thinkpad and
k-gamingcom are correctly classified without re-collecting.
Test coverage:
- test_workload_silent_flag updated to require both signals
- test_workload_silent_suppressed_when_host_cpu_real new regression
for the busybox false-positive
AGENTS.md gains a "Don't trust the in-guest probe alone" section with
the busybox-vs-procps gotcha + a list of busybox-incompatible patterns
to avoid in any new in-guest diagnostic.
ca_bundle is what the shipper uses to verify collector.wg's TLS cert.
That cert is signed by the Caddy Local Authority, bundled in the repo
as etc/caddy-root.crt. Pointing it at wg-ca.pem (the wg-pki CIS490
Lab-Host Client CA, which is the *receiver's* trust anchor for our
client cert) caused CERTIFICATE_VERIFY_FAILED on every ship.
Original fix authored by the on-device agent on k-gamingcom in
Dev_REL2_043026@786b8da; cherry-picked here onto main.
Co-Authored-By: k-gamingcom on-device agent
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of #13 (PUT 500s on first ship, retries return already-present):
my earlier prune-tool session ran as root and rewrote the live index via
os.replace(), which drops the original ownership/mode. The new file was
root:root and the cis490 service user couldn't append to it. Every fresh
PUT 500'd on _append_index after the tarball had already landed via
os.replace, so retries always saw "already-present" and never recovered
the missing index row.
Two fixes:
- tools/prune_episodes.py: snapshot the index's stat before the rename
and restore uid/gid/mode after. Best-effort chown so non-root prune
runs (where chown would EPERM) still succeed; non-root callers
matched the original owner anyway.
- tools/index_backfill.py: new tool. Walks episodes/<host>/*.tar.zst,
computes sha256+size, and appends rows for episodes missing from
the index. Preserves "backfilled: true" so trainers can distinguish
reconstructed rows. Always opens the index in append mode (never
replaces), so it cannot reproduce the ownership bug it's recovering
from.
Regression test: tests/test_prune.py::test_archive_preserves_index_mode.
Operator note for the live receiver: ran the chown fix manually
(chown cis490:cis490 /var/lib/cis490/index.jsonl) and ran the
backfill once to recover 140 elliott-thinkpad rows that 500'd before
the chown landed.
Sysadmin observed lab-host agents still trying to "secure the
connection" — minting certs, generating CSRs, or otherwise reinventing
a cert-delivery flow that's already automated through bootstrap.wg.
Three reinforcements so an agent reading any of the three surfaces
(AGENTS.md, install script output, journalctl) gets the same message:
- AGENTS.md gains a top-of-file "do not mint your own certs" callout
+ a dedicated "Securing the connection (mTLS)" section with the
one fix (re-run install-lab-host.sh after setting host_id) and an
explicit "what NOT to do" list (no openssl, no copy from another
host, no verify_tls=false).
- install-lab-host.sh's FIRST-INSTALL NEXT STEPS now spells out that
the cert auto-fetch is silently skipped while host_id is REPLACE_ME,
and that the operator MUST re-run the script after editing host_id.
Step 2 is now "RE-RUN THIS SCRIPT" with a DO NOT openssl warning.
- The shipper's "waiting on mTLS material" warning now embeds the
exact remediation command + a pointer to AGENTS.md, so an agent
reading journalctl without ever opening the repo still gets it.
Tests: 12/12 in test_shipper still pass; warning string change is
not asserted on (only the dataclass error field).
Smaller (non-4.7) Claude models act as on-device agents on CIS490 lab
hosts and have hit the install gotchas that became issues #10–#12.
Their reports describe symptoms well but miss inferred context — so
this expands the runbook with explicit "do this, not that" notes:
- run tools from /opt/cis490 not a clone (CWD-on-sys.path trap)
- shipper "waiting on mTLS material" is expected and self-heals; do
not try to fix it manually
- table of the three install bugs already closed in main, so a fresh
agent can recognize the symptom and pull instead of re-filing
- "fix one red row at a time" rather than batching attempts
Closes nothing new; this is the followup to #10/#11/#12 promised
during their resolution.
First-boot bring-up enables cis490-shipper before the Pi has issued the
mTLS leaf, so ssl.create_default_context(cafile=...) raised
FileNotFoundError out of __init__ and systemd crash-looped the unit
every RestartSec=5. Now the transport pre-flights the configured
ca_bundle / client_cert / client_key paths, raises a recoverable
_CertNotReadyError, and ping/ship_tarball retry the build on each
request — daemon self-heals once the cert lands without a restart.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>