Replaces the discrete harmony dropdown with two continuous sliders
that together cover every named scheme:
- count (1..6) — number of palette colors
- spread (0..300°) — total angular range across the palette
Offsets fan symmetrically from the primary, so n=3 spread=60 →
[0, +30, -30] and n=4 spread=270 → [0, +90, -90, +180]. Common
named harmonies fall out as specific (count, spread) values:
mono count=1, any spread
complementary count=2, spread=180
analogous count=3-5, spread<=60
split-complementary count=3, spread~180-210
triadic count=3, spread=240
tetradic / square count=4, spread=270
A small hint line under the sliders shows the matched harmony name
when (count, spread) lands in a recognized neighborhood, or "custom"
otherwise. Per-marker drags continue to work on top of the
continuous setting; touching the count or spread slider regenerates
to the symmetric distribution (i.e. drags are reset on slider use,
which is the natural "clean me up" action).
Migration: any prior localStorage state with the old discrete
`harmony` field or multiplicative `spread` is silently dropped at
boot and replaced with the new defaults (count=3, spread=60).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes
- Wheel markers are now actually draggable. The previous version
rebuilt the marker DOM during drag, which destroyed the captured
element. apply() now only rebuilds when harmony changes; otherwise
it updates marker positions in place.
- Background swap is verified end-to-end. body[data-theme] is set
by JS and the CSS selectors gate per-theme bg layers correctly.
New theme controls (advanced details panes)
- spread (0.2-2x): scales harmony angular offsets uniformly
- L variance (0-40): per-color alternating lightness ladder
- C variance (0-0.15): per-color alternating chroma ladder
- animation speed (0.1-4x): drives every animation-duration via
calc(N / var(--anim-speed))
- background blur (0-40px): filter: blur on the entire bg-canvas
- tint strength (0-0.6): palette-tinted vignette opacity, applied
on EVERY theme so even "black" picks up palette character
Marker drag semantics
- Primary marker drag → rotates the entire palette (changes H)
- Non-primary marker drag → moves just that marker's offset
(allows individual proportion adjustment on top of harmony preset)
Backgrounds
- Old "lava" is renamed "drift" (soft blurred-blob version)
- New "lava" is a proper lava lamp via SVG goo filter
(Gaussian blur + alpha threshold) so bubbles merge as they
approach. Eight palette-colored bubbles rising at staggered
speeds.
State persists in localStorage; all sliders + offsets survive
reload. Panel resets all knobs back to defaults.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Press `t` (or click the topbar button) to open a floating theme
panel. Choices persist in localStorage.
Color machinery
- All palette colors derived from one base OKLCH (L, C, H) plus a
harmony rule that fixes angular offsets between siblings. Six
rules: mono, analogous, complementary, triadic, tetradic,
split-complement. Browsers parse CSS oklch() natively so no JS
conversion math is needed.
- Sliders for L (20-95%), C (0-0.4), H (0-360°), plus an interactive
color wheel where each marker is draggable. Dragging any marker
rotates the entire palette around the wheel — proportions are
locked by the harmony rule, so siblings move together.
Background themes (--c1..c5 from the palette drive every color)
- black: solid bg, no animation. Default.
- lava: 6 blurred radial blobs floating up the screen at different
speeds, screen-blended.
- vaporwave: perspective grid crawling toward the viewer + a soft
radial sun in palette colors.
- laser: 5 long beams sweeping from screen center, palette-colored.
Caveats
- oklch() needs Chrome 111+ / Firefox 113+ / Safari 15.4+. Older
browsers will fall back to CSS init values (which match the
previous palette), so the page still renders.
- Phase colors (clean/armed/etc) stay hardcoded — they're semantic.
- The keydown listener for `t` is registered alongside the other
hotkeys and skips form fields like the rest.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New scene 10 between Sequence models and KNN: shows
training/models/lstm.py shape (PyTorch LSTM trainer, ~25 lines).
Placeholder content; the model session can swap in the real
trainer string in dashboard.js when it lands.
- Sharpens the prose gradient (95% solid until 70% across instead
of 92% solid until 55%) and bumps stage-view right-padding by
2.5em, so the prose column has a clearer edge against the canvas
content instead of fading smoothly into it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audience takeaway is supposed to be that everything they're seeing is
genuine live data from the devices. The library list moves to a
single supporting paragraph instead of dominating the slide.
Slots in between intro and collect: a side-by-side display of
pyproject.toml's annotated dependency list and the import header of
receiver/app.py. The point is to surface the project's
stdlib-first / annotated-deps stance early in the deck without
making the audience open a terminal.
Lightweight syntax highlighting via a small char-by-char Python
tokenizer (no third-party highlight.js). The TOML case uses regex.
GitHub-dark color palette.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the event contract for producers that want to drive
widgets on the dashboard:
- /publish endpoint (loopback only; Caddy 404s externally)
- All six widget-driving event types and their shapes
- The reconnect gotcha (live events not replayed; only `snapshot` is)
- Two integration patterns (separate process vs in-process)
- Three options for browser-triggered demos
- systemd hardening that constrains producers running on the Pi
Adds a stdlib-only Publisher helper so producers don't need to
hand-roll urllib.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Starlette + WebSocket dashboard run on the Pi as cis490-dashboard.service
(127.0.0.1:8447, Caddy-fronted at dashboard.wg). Tails
/var/lib/cis490/index.jsonl for episode events, snapshots host counts
every 30s, broadcasts to every connected browser. New connections get a
warm snapshot (recent_episodes, total_bytes, host_counts) so reloads
don't see a cold dashboard.
Frontend is a 10-scene scrollytelling deck following the project
outline: intro, collect, hosts, db explorer, baseline, attacks,
chunking, models, knn, perf. Sticky full-bleed canvas with a
right-aligned prose column (matrix-explorable layout). Hotkeys (arrows,
space, j/k, c, Home/End), prev/next chevrons, FAB, and an opt-in
click-to-advance toggle. Demo toggle drives synthetic data for the
five scenes that have no real producer yet (attack envelopes,
chunking, model bars, knn scatter, perf scatter); when off, those
scenes show "awaiting <event_type> events" rather than fake data.
Producers wire in by POSTing typed JSON to 127.0.0.1:8447/publish
(loopback only; Caddy 404s it externally). Event types the widgets
subscribe to: model_metric {model, accuracy}, embedding {x, y, phase},
model_perf {model, latency_us, accuracy}, prediction {episode_id,
window_idx, predicted, actual}, attack_profile {name, shape, curve},
phase {phase}.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug 14 (vm/launch_target.sh): Metasploitable2 requires -machine pc
(i440fx), -cpu kvm32, -drive if=ide, and -device e1000. The previous
config (-machine q35, -cpu host, -drive if=virtio, virtio-net-pci)
caused a kernel panic at boot because /dev/vda != the grub root=/dev/sda1.
Services never started; the b'' probe fix (Bug 10) then correctly waited
out the full timeout with no result.
Bug 15 (scripts/install-tier-3-4.sh): verify step used vsftpd_234_backdoor
which is requires_bridge=true and has a hardcoded port-6200 backdoor.
Changed to distccd_command_exec with TARGET_PORTS="5632:3632,4444:4444".
manifest.toml: admit distccd_command_exec and unreal_ircd_3281_backdoor
to the module catalog. Both use cmd/unix/bind_perl (bind shell, no guest
egress, SLIRP-safe). distccd returns a valid protocol response so MSF's
handler runs and session_open fires. Verified against Metasploitable2
sourceforge image sha256 a8c019c3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug 10: _wait_for_tcp returned on recv()→b'' (connection closed by peer),
falsely signalling service-ready. Only socket.timeout or non-empty data
are genuine ready signals; b'' now retries.
Bug 11: distccd_command_exec and unreal_ircd_3281_backdoor incorrectly
had requires_bridge=true. bind_perl payloads connect inward (host→guest
via hostfwd), not outward — no bridge egress needed. Both modules now
run on SLIRP-only fleet slots.
Bug 12: msgpack.unpackb crashed on integer session IDs from msfrpcd 6.x
(strict_map_key=True default). Added strict_map_key=False.
Bug 13 (documented): samba_usermap_script removed from catalog (NoReply
on every fire — already handled in dca6144 on origin/main).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two correctness fixes that the §4.5 event-driven labeller surfaced:
1. tools/run_real_vm_demo.py was hardcoding a Tier-3-shaped schedule
(clean → armed → infecting → infected_running → ...) for episodes
with no exploit firing. Pre-§4.5 those episodes wrote dishonest
`infected_running` labels from the schedule clock — exactly the §3
evidence pattern. Post-§4.5 they write `failed` at the infecting
transition (the justifying exploit_fire never arrives), which is
honest about what happened but useless for training.
The honest fix: Tier-2 episodes have a clean-only schedule. All
telemetry tagged `clean` because nothing infected anything. The
total duration matches the canonical Tier-3 schedule so episode
lengths are comparable across tiers — no length-bias in the
dataset (§10).
Helper `tier2_schedule_from(schedule)` in orchestrator/manifest.py
derives `[("clean", total_seconds)]` from the canonical schedule.
`tier3_schedule_from(schedule)` renders the legacy
`[(name, seconds)]` shape EpisodeConfig still expects.
Tier-2 demo (run_real_vm_demo.py) now calls tier2_schedule_from.
Tier-3 demo (run_tier3_demo.py) now calls tier3_schedule_from.
Drops the hardcoded DEFAULT_SCHEDULE constants from both — the
canonical manifest is the single source of truth (§4.1).
2. .gitignore now excludes /VERSION. The install-lab-host.sh stamp
writes /opt/cis490/VERSION so episodes can record code provenance
without /opt/cis490 carrying a .git directory. But /opt/cis490 IS
typically a git checkout on lab hosts (auto-update.sh pulls into
it), so writing VERSION leaves the working tree dirty. Every
episode's meta.code_version.dirty=true. PIPELINE.md §4.6 acceptance
gate's rule 4 would then reject every episode without
CIS490_ALLOW_DIRTY=1 set — which would break the data flow.
Now VERSION is .gitignored: install-lab-host.sh stamps it, git
status doesn't see it, dirty=false, gate rule 4 passes naturally.
These two changes together keep the data flowing AND honest. Tier-2
episodes pass with `phases=[clean]` + every collector emitting real
rows. Tier-3 episodes (none today, empty catalog) walk the full
event-driven schedule when a verified module gets re-admitted.
286 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase labels are written ONLY when justifying events arrive. The
schedule clock is now a budget — an upper bound — never a label
source. This is the core honesty fix the §3 evidence demanded:
Before: every Tier-3 episode wrote `infected_running` from the
schedule clock regardless of whether session_open ever
fired. Per §10 every dishonest label is a poisoned
training example. 67/67 of the §3 probe episodes were
poisoned this way.
After: `infecting` writes ONLY when exploit_fire is observed in
events.jsonl. `infected_running` writes ONLY when
session_open is observed. Either timing out or seeing
session_open_timeout terminates the walker with a `failed`
label that the §4.6 acceptance gate will reject.
PHASE_JUSTIFYING_EVENTS in orchestrator/episode.py declares which
events justify which phases:
"clean": None # orchestrator-emitted
"armed": None # orchestrator-emitted
"infecting": ("exploit_fire",)
"infected_running": ("session_open",)
TERMINAL_FAILURE_EVENTS = {"session_open_timeout"} short-circuit any
event-driven wait into a `failed` label.
`dormant` is intentionally OFF the canonical schedule. §4.5 calls
for dormant to be event-driven (session_idle / session_active) too,
but the driver doesn't emit those yet. Per §1 default-to-removal we
ship without dormant rather than label it from the clock; when the
driver gains those emits, dormant re-enters the schedule with
proper justification.
EpisodeRunner now owns:
* `_event_log` — every emit_event appends here
* `_event_cv` — condition variable for waiters
* `_wait_for_event(names, since_t_mono_ns, timeout_s)` — returns
the first matching event in the log
with t_mono >= threshold; threshold
catches events that fired during
the previous on_phase callback.
When an event-driven phase's justifier already arrived (e.g.
exploit_fire emitted by driver._fire() inside on_phase("armed")),
the walker uses the EVENT's t_mono on the label — not the time the
walker noticed. The label means "this is when this thing actually
happened."
manifest.toml: dropped the dormant cycle from the canonical schedule.
Episode is shorter (~30s) but every label is event-justified.
14 new tests in tests/test_event_driven_labeller.py covering: justifier
mapping invariants, _wait_for_event semantics (already-arrived,
future, timeout, since-threshold, first-of-multiple-names), walker
behavior (orchestrator-emitted phases, event-driven phases, missing
event → failed, terminal-failure-event short-circuit, stop event,
event-t_mono on label, phase_transition events with justified_by).
286 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the missing emit-tests so every collector in KNOWN_COLLECTORS
has end-to-end coverage:
* test_proc_emits_rows_against_self_pid
Samples /proc/<own pid> for ~0.6s. Asserts ≥3 rows + populated
core fields (cpu_user_jiffies, rss_bytes, vsize_bytes). Works
anywhere with /proc.
* test_pcap_bucketize_emits_rows_from_synthetic_capture
Builds a 2-packet Ethernet+IPv4+TCP pcap in-memory, feeds it
to pcap.bucketize, asserts ≥1 row written + total packet count
across buckets matches input. Covers BOTH the pcap and netflow
collectors (netflow IS the bucketized pcap output).
* test_every_known_collector_has_emit_coverage
Cross-cutting tripwire: for every name in KNOWN_COLLECTORS,
either there's a test_collectors_emit.py test or there's an
explicit COLLECTOR_TEST_CARVE_OUTS entry. Adding a collector
to KNOWN_COLLECTORS without an emit test fails this. Carve-outs
today: qmp (covered by tests/test_qmp.py — needs running QEMU
for real-binary emit) and guest_agent (covered by
tests/test_guest_agent.py — needs a real VM with the agent
baked in).
The carve-outs are explicit, not implicit. A drift where someone
adds a new collector without a real-binary emit test fails CI before
the manifest can include it.
272 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tools/verify_catalog.py runs the §4.3 end-to-end verification flow
against every entry in manifest.toml's [catalog].modules (or a single
named module). The flow follows §4.3 exactly:
1. Load the module config + the verified-against target spec.
2. Resolve the published image path; fail loudly if absent.
3. Boot the target VM under §4.13 containment (restrict=on, snapshot=on,
no shared FS, unprivileged QEMU — same posture as verify.sh).
4. Wait for the service on the spec'd port.
5. Login to msfrpcd, snapshot the existing session set, fire the
module against `127.0.0.1:<host_port>` (the SLIRP hostfwd to the
guest's promised service port).
6. Wait for `session_open` — NOT session_open_timeout, which is the
§4.5 failed-label outcome.
7. Round-trip a shell command (`id`); confirm uid= shape.
8. Confirm a guest-side artifact (touch marker; ls + echo VERIFY_OK).
Per-module exit code is 0 only when EVERY step passes. CLI exit is 0
only when EVERY requested module passes — partial credit isn't an
option (§1 default-to-removal: a module that can't pass shouldn't be
in the catalog).
Structured JSON output with per-step timings + detail strings, written
to stdout or --out <path>. Operator pulls this into a successful CI
run + signs off on the manifest.toml [[catalog.modules]] amendment
with a fresh `last_verified = <commit_sha>` per §15.
Tests (tests/test_verify_catalog.py, 8 cases): exercise the flow with
a mocked MSFRpcClient + mocked qemu boot. Cover happy path, every
short-circuit failure mode (image missing, service never up, session
timeout, shell round-trip wrong, guest artifact missing), and
spec-load errors. Real verification needs lab hardware; the mocked
flow proves the orchestration contract.
269 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§4.2 calls for target VMs we BUILD, not VMs we fetch. §4.13 demands
every target ship the same isolation posture (no upstream egress, no
host-shared FS, unprivileged QEMU, fresh snapshot per episode). This
commit lands the infrastructure for both.
New surface:
* orchestrator/target_spec.py
Loads + validates `vm/targets/<name>/spec.toml`. Containment
fields are not knobs — each has exactly ONE safe value, and a
spec asserting the unsafe value is rejected at load time. There's
no `--containment-override`; weakening §4.13 requires amending
PIPELINE.md and operator sign-off.
* tools/build_target.py
Orchestrates build → verify → publish for a single target. Spec
invalid → exit 78 (sysadmin error). build.sh failure → image not
published. verify.sh failure → image discarded; that's the §4.2
acceptance gate. Publishes sha256 + the manifest.toml stanza the
operator copies in to admit the image (§16 substantive amendment
with sign-off per §15).
* vm/targets/<name>/{spec.toml,build.sh,verify.sh}
Template structure. spec.toml is the contract; build.sh produces
$OUT_PATH; verify.sh boots the produced image under the §4.13
containment posture and asserts every promise.
* vm/targets/shellshock/
First real working target. CVE-2014-6271 (Apache mod_cgi + bash
4.2 mis-parsing function-export environment values). Replaces
the SourceForge Metasploitable2 path that §3 evidence proved
unverifiable. Bash 4.2 is built from sha256-pinned GNU source
inside an Alpine 3.21 cloudinit guest; the build script asserts
the produced bash actually triggers shellshock; the verifier
re-asserts it under restrict=on with a real CVE-2014-6271 probe.
* vm/targets/README.md
How operators add a target. Walks the spec → build → verify →
manifest amendment loop.
Containment regression tests (tests/test_containment.py) — 20 new
assertions, parameterized over every target with a build/verify trio:
* verify.sh MUST contain `restrict=on` on its netdev (§4.13)
* verify.sh MUST contain `snapshot=on` on the boot drive (§4.13)
* verify.sh + build.sh MUST NOT contain -virtfs / -fsdev / 9pfs
* verify.sh + build.sh MUST NOT wrap qemu-system in `sudo`
* Every target must ship the complete spec.toml + build.sh + verify.sh
trio — no half-built targets (§1 default-to-removal)
Spec validation tests (tests/test_target_spec.py): 13 new tests over
spec parse, name/dir mismatch, missing fields, out-of-range port, and
the §4.13 containment field validators (each unsafe value rejected
with a clear error).
The shellshock target's image is NOT yet published to manifest.toml's
[[targets.images]] — that's the §15 sign-off amendment that lands
after a successful operator-driven build_target.py run on a lab host
with KVM. Building takes ~10 min on x86_64; cannot run on the Pi
under TCG. Operator drives the first build, verifies the sha256, then
amends manifest.toml in a follow-up commit.
261 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The experiment is now defined by a single version-pinned file —
manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every
lab host loads THIS exact file; per-host overrides of experiment
shape are forbidden.
Drops the following per-host CLI overrides that previously violated
the canonical-manifest principle:
* --manifest, --modules-dir (paths now derived)
* --ram-per-vm-mib (in manifest.experiment)
* --max-concurrent (manifest.experiment.fleet.max_concurrent_ceiling)
* --max-tier3-slots (manifest.experiment.fleet.max_tier3_slots)
* --force-tier2 (not a §14 sanctioned override knob —
ship empty catalog to disable Tier-3)
* --require-real-samples (sample-side concern; out of fleet scope)
* tools/run_*_demo.py --manifest (samples path now from canonical)
New surface:
* manifest.toml — the single source of truth
* orchestrator/manifest.py — load_canonical() + Manifest dataclass
with strict validation, raises
ManifestError on any failure
* EpisodeConfig.experiment_meta — populated by run_*_demo.py from
the canonical manifest; stamped
into every episode's meta.json
under "experiment" key for
provenance
* cis490-orchestrator.service — RestartPreventExitStatus=78 so
manifest-load failures stay
stuck-and-loud (§9, §4.7)
* install-lab-host.sh — validates manifest.toml at
install time; missing or invalid
= die with clear message
Catalog admission semantics: only modules whose name appears in
manifest.catalog get loaded into the runtime catalog (§4.3 in
miniature, will tighten further in step 4 when verified_against /
last_verified actually gate admission). Missing toml for an admitted
name is a sysadmin error → exit 78.
Renames cfg.manifest → cfg.samples + adds cfg.experiment to
disambiguate sample-manifest from experiment-manifest. Rewrites
test_fleet.py fixture to construct synthetic Manifest objects so
test outcomes don't depend on the on-disk manifest.toml content.
12 new tests in tests/test_manifest.py: schema-version mismatch,
unknown collector, duplicate collector, unknown phase, negative
phase seconds, negative ram, missing catalog fields, json round-trip.
Local run: `python tools/run_fleet.py --capacity` correctly logs the
loaded manifest and prints capacity. 241 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PIPELINE.md §1 (default-to-removal), §4.3 (catalog admission), §10
(every dishonest label is a poisoned training example).
Empirical evidence on commits 4ab5477 → c41763b: samba_usermap_script
fired its bind_perl payload but the framework's bind handler never
managed to connect to the guest's listening port within
session_open_timeout_s=30 (or even with WfsDelay=30 bumped on the
framework side). All 67 attempts in the §3 probe ended in
session_open_timeout. Yet the schedule clock was still writing
`infected_running` labels for the failed exploit — exactly the §10
poisoned-example pattern.
Until §5 step 3 builds an in-house target VM and step 4 re-admits
modules with `verified_against` recorded (§4.3), the production
catalog should consist of zero verified Tier-3 modules. That's the
state after this removal: the four remaining modules
(vsftpd_234_backdoor, distccd_command_exec, php_cgi_arg_injection,
unreal_ircd_3281_backdoor) are all `requires_bridge=true`, which the
fleet picker filters out unconditionally (the post-revert behavior
from commit 0390eb2). Net effect: production runs Tier-2 only,
producing honest Tier-2 episodes and zero dishonest Tier-3
infected_running labels.
Test fixture updated to inject synthetic in-memory ModuleConfigs
instead of loading from disk, so Tier-3 dispatch logic stays tested
even though no production module qualifies. test_exploits asserts
the new "every shipped module is requires_bridge until §4.3 admits
something verified" invariant — flips into a tripwire if anyone
reintroduces an unverified non-bridge module.
229 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PIPELINE.md §4.4 requires every collector in the active set to actually
work end-to-end. On k-gamingcom (commit dac03d2 episode at 02:21Z) the
new perf_unavailable lifecycle event surfaced a concrete cause:
`reason: binary_not_on_path` — perf is enabled but the binary isn't
installed. Same story with tcpdump on k-gamingcom (pcap_unavailable
events with `error: tcpdump not found`).
The canonical install script is the right place to ensure the deps
are present. detect_os reads /etc/os-release; ensure_collector_packages
installs `perf` (Arch / RHEL) or `linux-perf` + `linux-tools-generic`
(Debian/Ubuntu) plus `tcpdump`. After the install attempt the script
re-checks `command -v` and dies loudly if either is still missing —
silent silent silent forbidden per §1, so install failure has to be
observable.
Idempotent (`--needed` / equivalent skips already-installed packages).
Operator owns full system upgrades; this only does targeted package
install. On unknown distros logs a warning and dies on the followup
check, with a clear pointer to install perf/tcpdump by hand.
The next autoupdate tick on k-gamingcom should pull this and
self-install perf + tcpdump, after which rows_perf > 0 and pcap should
start producing bytes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validation on k-gamingcom (commit ac7b85f) showed perf enabled in
production but rows_perf=0 on every episode. Without lifecycle events
the failure mode is indistinguishable from "perf wasn't enabled" — §1
silent-downgrade. The events now surface the actual cause:
- perf_unavailable — binary missing OR launch failed (with reason)
- perf_started — perf is running (pid, events, interval)
- perf_first_row — first row written; counters_populated tells
whether any event was actually counted
- perf_finished — final tally (intervals_seen,
intervals_with_values)
- perf_no_counters — perf was alive but every interval came back
<not counted> (likely paranoid > 2 or PID
ownership mismatch)
`_flush()` now writes a row whenever an interval is observed, even
when every event was <not counted>. The all-None row is honest data
("perf observed this interval and counted nothing"), and the rows
become a count of observed intervals rather than a count of
successful measurements — distinct from rows_proc / rows_qmp which
do count successful measurements. Trainers filter on
`cycles is not None` etc. when they need only populated rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical evidence from k-gamingcom (commit 4ab5477, 2026-05-03 22:20Z
vsftpd_234_backdoor episode): the picker selected vsftpd because BRIDGE
was set on that host. The exploit fires against target_ip=127.0.0.1
(SLIRP loopback) but vsftpd's hardcoded port-6200 backdoor is reachable
only at the guest's bridge IP. Result: session_open_timeout, AND a
schedule-clock-driven `infected_running` label was still written for
the failed exploit — exactly the §10 poisoned-training-example pattern.
Until guest-IP discovery for bridge mode is wired (a separate piece of
infrastructure), bridge-only modules can't actually reach their target
even when the operator sets BRIDGE for Tier-2's pcap source. Revert
the picker to its prior conservative form: drop requires_bridge modules
unconditionally regardless of BRIDGE state. Same for the BRIDGE env
strip in the Tier-3 launch path — it was correct as unconditional.
Replaces the two aspirational tests
(test_fleet_uses_all_modules_when_bridge_set,
test_fleet_propagates_bridge_env_to_runner) with their honest negatives
(test_tier3_drops_requires_bridge_modules_unconditionally,
test_tier3_strips_bridge_env_even_when_set). The previous tests asserted
behavior the rest of the pipeline can't deliver; they were false signals.
229 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The §5 step 1 fixes correct the perf collector's stdout/stderr +
event-name parser bugs, but the launchers
(run_real_vm_demo / run_tier3_demo) never set enable_perf=True, so
production episodes still ship with rows_perf=0 — silently disabled
collector, which is exactly the §1 / §4.4 pattern.
Turn it on in both launchers. Failure modes (perf binary missing,
paranoid level too high) are logged as warnings + return 0 rows
visibly, not silently.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Diagnoses + fixes for the silent-collector / never-lands-session
failures that the 200-episode quality probe surfaced (§3 evidence).
All four address the producer; no compensating layers added.
perf collector (rows_perf=0 on 100% of episodes):
- perf stat -j writes to stderr by default with -p; we read stdout.
Add --log-fd 1 so JSON reaches stdout where the parser sees it.
- Event names come back annotated with the privilege scope perf
actually measured ("cycles:u" under perf_event_paranoid=2). Strip
the suffix so _build_row's plain-name lookups hit. Without this
every metric was None even when perf reported real numbers.
- tests/test_collectors_emit.py covers the regression with a real
busy-loop fixture; emit-test discipline per §4.4.
guest-agent collector (rows_guest=0 on 100% of episodes):
- Alpine cloud image doesn't ship python3, so the in-guest agent's
`#!/usr/bin/env python3` shebang silently fails. Add packages:
[python3] to cidata user-data so cloud-init installs it before
the OpenRC service starts.
- Guest agent now exits nonzero (was: silent stdout fallback) when
/dev/virtio-ports/cis490.guest.agent is missing, so OpenRC
reports the failure to /var/log/cis490-agent.log instead of the
bytes vanishing into the void. Refs §1.
- Host-side collector emits guest_agent_connected /
guest_agent_first_byte / guest_agent_silent_window into the
orchestrator's events.jsonl. Future episodes show the in-guest
failure mode per-episode instead of inferring from rows_guest=0.
k-gamingcom missing qmp/netflow/pcap (also affected elliott on
Tier-3 episodes — was misclassified as host divergence):
- tools/run_tier3_demo.py was building EpisodeConfig WITHOUT
qmp_socket / guest_agent_socket / bridge_iface — even though
launch_target.sh creates the underlying chardevs and BRIDGE
supplies the iface. tools/run_real_vm_demo.py wires them
correctly; Tier-3 had a copy-paste gap.
- tests/test_collectors_emit.py adds a source-grep regression so
the wiring stays honest.
samba_usermap_script never lands session (0/67 in §3 probe):
- Bind handler default WfsDelay (~5s) gives up before bind_perl on
Metasploitable2 has finished forking + binding LPORT under
SLIRP+hostfwd. Bump to 30s; matches session_open_timeout_s in
exploits/driver.py so framework + driver agree on the wait
budget. Add ConnectTimeout=15 so the handler's bind connect has
retry budget instead of one-shot.
orchestrator/fleet.py: usable_modules + BRIDGE handling were both
unconditional, so:
- With BRIDGE set, requires_bridge modules were still being
dropped — picker only ever returned samba_usermap_script across
every slot/episode (the test_fleet_uses_all_modules_when_bridge_set
failure on HEAD).
- env.pop("BRIDGE") fired even when BRIDGE was the operator's
explicit setup, breaking modules that need bridge mode (vsftpd
backdoor on hardcoded port 6200, distccd, etc.).
Both made conditional on bridge_set so the picker walks the full
catalog under bridge mode and SLIRP-only modules still get a
clean SLIRP env when BRIDGE is unset.
receiver/app.py: half-pregnant v2 schema state in HEAD — calling
store.ingest_stream(episode_type=..., benign_profile=...) with
kwargs the matching store.py change was in the WIP stash. Removed
v2 awareness from app.py so v1 episodes (what the producer ships
today) get accepted again. SCHEMA_VERSION default reset to 1 to
match.
229 passed, 0 failed. (HEAD had 15 failures, all linked to the
half-pregnant v2 state above.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PIPELINE.md is the canonical plan for the data-collection / emulation
/ labelling pipeline. It supersedes any guidance in AGENTS.md,
README.md, or other repo docs that contradicts it (§17). Future
sessions read it before changing anything in the pipeline.
AGENTS.md is rewritten to point at PIPELINE.md as canonical and to
strip the prescriptive symptom→fix table that absorbed producer-side
defects instead of fixing them (§7.1 compensating-layer pattern).
FIXYOURSELF.md is deleted (§4.12, §7.10 recovery-layer pattern). The
states it covered are made impossible by the §4.6 acceptance gate
landing later in §5; recovering from a state that shouldn't exist is
itself the bandaid we're removing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The detector previously returned 1 on alerts, which made systemd
mark cis490-fleet-health.service as 'failed' every tick that found
a sick host. That's the wrong UX — a detector finding a fault is
working correctly, not crashing. The alert is the signal (via
WARNING log + alerts.jsonl); the unit's success state should mean
"the detector itself ran cleanly." Test added.
Caught while live-deploying on the Pi: the first run found
elliott-thinkpad fatal-only at 943×4xx + 1425×5xx and correctly
emitted the alert — but systemd showed the unit red, which would
have caused operators to chase the wrong tail.
Side note: the same first run also caught a real bug — pycache for
receiver.store on /opt/cis490 was stale after I deployed the new
app.py + store.py from main, causing 1464 × 500 responses. Cleared
the pycache and the index immediately resumed growing (4465 →
4515 in 30 seconds). The detector earned its keep on the very
first cycle.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pieces of self-monitoring so the maintainer isn't the alarm:
(2) Receiver-side fleet health monitor
cis490-fleet-health.timer runs check_fleet_health.py every 5 min.
Detects three symptoms and writes them to
/var/lib/cis490/alerts.jsonl + a syslog WARNING (greppable / easy
to forward to a notifier):
silent — host shipped in last 24h but has been quiet >30 min
fatal-only — actively shipping but every PUT 4xx
unstamped — shipping without X-Cis490-Code-Commit header
Dedup is keyed on (host, symptom, hour-bucket) so a sustained fault
fires once per hour, not every 5 min. 15 unit tests cover the index
parser, three detectors, and dedup.
(3) Per-host doctor snapshots
Lab hosts run cis490-doctor-check.timer once a day (10 min after
boot, then daily with 30-min jitter). The timer runs
cis490_doctor.py --json and PUTs the result to a new endpoint:
PUT /v1/host-health/<host> → /var/lib/cis490/host-health/<host>.json
GET /v1/host-health → aggregate across all hosts
Endpoint is NOT gated by version_gate — sick hosts running stale
code MUST still be able to report sickness. 11 unit tests cover
PUT/GET, atomic-write semantics, bearer auth, and the
not-gated-by-version-gate property.
ship_health_check.py reuses the existing shipper transport (mTLS +
bearer + receiver URL from lab-host.toml) so we don't reimplement
auth.
Both timers wired into install-lab-host.sh — the loop also enables
the previously-added autoupdate + cert-fetch timers, so a single
install run gives a host all four self-healing mechanisms.
Tests: 293 pass (26 new — 15 fleet-health, 11 host-health). 2
pre-existing test_fleet.py failures from the elliott-ThinkPad
merge (667f042) are unrelated to this change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The serial console approach failed: Metasploitable2's kernel is not
configured with console=ttyS0, so only GRUB output reaches the QEMU
serial socket; the OS boot and login prompt never appear there.
New approach:
1. Sleep _METASPLOITABLE2_MIN_BOOT_S (65 s) after QEMU writes its
pidfile. By this point the guest kernel and init are always up.
2. Call _wait_for_tcp with a 3 s recv timeout. Post-floor, SLIRP has
forwarded the connection to the guest TCP stack, so:
- socket.timeout → service listening, waiting for client data ✓
- OSError/RST → port still closed (service not ready); retry ✓
Eliminates the early-boot false-positive that caused exploits to
fire ~60 s before Samba was actually listening.
Also update TIER3-BRINGUP.md bug 6 to reflect the correct final fix.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
k-gamingcom symptom (2026-05-02): the on-device agent successfully
finished Tier-3 bring-up, but the shipper sits in "waiting on mTLS
material" because the cert auto-fetch step in install-lab-host.sh
either ran with host_id still REPLACE_ME, or hit a transient
bootstrap.wg failure, and there's no automatic retry. The Pi-side
cert IS minted and the bootstrap endpoint serves it — the failure
mode is purely "lab-host hasn't pulled it down."
Fix: extract the cert-fetch logic into scripts/fetch-lab-host-cert.sh
(idempotent, no-op when certs are already on disk, no-op when host_id
is unset, exit-0 on transient network failure so the unit doesn't
get pinned as failed), and run it from a 5-minute systemd timer.
The timer handles all three "stuck waiting on mTLS" cases without
operator action:
- operator edited host_id post-install but didn't re-run install
- bootstrap.wg was briefly unreachable during install
- lab host was offline when install ran but came up later
The script `try-restart`s cis490-shipper after a successful fetch
so the daemon picks up the new cert immediately instead of waiting
for its lazy retry. install-lab-host.sh still calls the script
on install for fast first-time bring-up — the timer is the safety
net.
Tarball extract is staged through a temp dir + atomic rename so a
mid-extract crash never leaves us with a mismatched cert/key pair.
AGENTS.md row 4 updated: "waiting on mTLS material" remediation now
points at the timer, with the exact `systemctl start
cis490-cert-fetch.service` command to force an immediate retry.
Tests: 267/267 unchanged. The fetch script is idempotent + has all
its happy/error paths handled inline; a unit test would mostly be
testing systemd's behaviour. The integration test path is the timer
running on a real lab host, which is the actual production case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root causes and fixes documented in TIER3-BRINGUP.md. Summary:
1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap
instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot.
2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring
modules selected on SLIRP runs; fix: always filter requires_bridge.
3. cmd/unix/interact creates no session.list entry → session_open_timeout
every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl.
4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444);
fix: extra_host_port:extra_host_port mapping so guest binds the
per-slot LPORT directly.
5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots;
fix: requires_bridge=true filters it from SLIRP fleet runs.
6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba
boots (~60 s too early); fix: replace TCP probe with serial console
_wait_for_serial_login that waits for actual "login:" prompt.
7. Stale QEMU survives orchestrator restart (start_new_session=True) →
holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from
old pidfile before rmtree.
8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100.
9. msfrpcd 6.x returns bytes for all string values even with raw=False;
fix: MSFRpcClient._str() recursive decoder applied to all responses.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The shipper on a stuck lab host logs the receiver's response body
verbatim as ERROR (queue.py:_log_412). That's the ONLY inbound
channel from this Pi to a lab host without ssh — every PUT the
shipper makes pulls down a fresh remediation message.
Update the 400 (missing-commit) and 412 (not-in-window) bodies to
explicitly call out FIXYOURSELF.md and the diverged-HEAD case (§B),
not just "pull and reinstall" — because if the host is on a local
commit that's not on origin/main, plain `git pull --ff-only` fails
and the agent needs to know about §B's three resolutions.
elliott-thinkpad has been hitting the receiver ~1/sec for 19 hours;
it'll receive this updated body on its very next PUT. The on-device
agent (or whoever is reading the journal) sees the path forward
without the maintainer having to push through any other channel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The auto-update timer (98dcd4f) covers the routine case of a host
falling behind origin/main. It deliberately refuses to fast-forward
when local HEAD isn't an ancestor of origin/main — the right call
for safety, but it leaves on-device agents with no automatic path
out when they (or an operator) made a local commit.
That's exactly the elliott-thinkpad incident: ~31,738 episodes
shipped over 19 hours, all stamped with local commit 5568d77 that
isn't on origin/main, all 412'd. Auto-update can't fix it; the
on-device agent had no doc telling it what to do.
FIXYOURSELF.md is that doc. Pure decision tree, six branches
(behind / diverged / no-network / no-git / dirty-tree / clean) each
with verbatim commands and the order to try them. The diverged-HEAD
branch (§B) is the elliott-thinkpad case and offers three resolutions
(push, reset, file-issue-and-wait) so an agent that doesn't have
push permission isn't backed into discarding work.
Linked from the AGENTS.md top-of-file symptom table so a smaller
model finds it without having to know the filename.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today's incident: post-cutover, k-gamingcom went silent and
elliott-thinkpad kept shipping pre-stamp episodes that the receiver
gate 400'd in a 2300+ PUT loop. Both required `git pull && install-
lab-host.sh` *on the host* — neither the on-device AI agent nor the
operator pulled in time, and from the receiver Pi I cannot reach in
(sshd off on the lab hosts).
Fix the recurrence directly: a 30-min systemd timer that does
git fetch + (if behind) ff-only pull + re-run install-lab-host.sh.
Hosts catch up on the next tick on their own — no human or agent
action required.
Mechanics:
- scripts/auto-update.sh runs as root, drops to cis490 for git ops
to satisfy /opt/cis490 ownership ("dubious ownership" guard).
- Refuses ff if local HEAD isn't an ancestor of origin/main —
protects operator hand-edits from silent overwrite.
- Network failures exit 0 (offline is normal, don't pin a unit
failure); divergence + install failures exit non-zero so the
journal records what broke.
- RandomizedDelaySec=10min on the timer prevents thundering-herd
when several hosts boot together.
- Hands off to install-lab-host.sh via exec — exactly one path
through bring-up; no special "auto" flow.
The version-gate provides the quality boundary, so even if origin/
main moves forward unsafely, the receiver's allow-list still
controls what lands in the index.
install-lab-host.sh enables cis490-autoupdate.timer on every run,
idempotent — existing hosts pick it up the next time they pull
manually.
Filed Forgejo #18 with the canonical command for elliott-thinkpad
+ k-gamingcom to bootstrap themselves out of the current incident
(auto-update doesn't help them retroactively — it has to be running
*before* the cutover to catch the next one).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Smaller models running on lab hosts read AGENTS.md top-to-bottom and
need explicit if-this-then-that. Restructure to put a decision-tree
table at the very top mapping every realistic symptom to the exact
command to run (verbatim — no paraphrasing instruction). Adds an
unambiguous HARD RULES list.
Also fixes accumulated drift:
- Tier-4 section had two contradictory descriptions (theZoo flow +
legacy MalwareBazaar flow). Removed the MalwareBazaar paragraphs;
the table's MALWAREBAZAAR_API_KEY env var is gone (theZoo needs no
auth). The "DO NOT push API key" bullet was about a flow that no
longer exists.
- Canonical bring-up step 6 said the Metasploitable2 download was
"registration-walled" requiring an operator-supplied URL+sha256.
Not true since the SourceForge mirror + TOFU pinning fix —
install-lab-host.sh handles it. Removed the manual step entirely
and noted Tier-3+4 are part of step 1.
- The "Three install bugs in 95ac56a" historical table was churn that
doesn't help current agents. Replaced with a generic
"outdated-clone? pull main and re-run install-lab-host.sh" block
that explicitly enumerates what the install script does (VERSION
stamp, queue drain, daemon-reload+restart, watchdog).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three robustness items off the future-work list:
1. Shipper sd_notify watchdog. Type=notify + WatchdogSec=180. The
daemon sends READY=1 after queue construction and WATCHDOG=1 once
per scan pass via a heartbeat callback wired into run_forever.
Restart=on-failure only catches process death — silent stalls
(deadlock, hung tar subprocess, blocked I/O past timeout) used to
leave a zombie running with the data backlog growing. Now systemd
kills + restarts the daemon if no WATCHDOG=1 arrives within 180s.
Verified end-to-end against systemd via `systemd-run --transient
--property=Type=notify --property=WatchdogSec=10`: unit transitions
to active on READY=1; SIGSTOP'ing the process triggers
`Watchdog timeout (limit 10s)! Killing process N with SIGABRT` at
exactly t+10s, then unit goes failed → restart cycle.
2. Quarantine cleanup. Without an upper bound, data/quarantine/ grew
forever as fatal episodes piled up. New ShipperConfig fields:
quarantine_keep_days = 30 # opt-out: 0 disables
quarantine_cleanup_interval_s = 3600 # gate so 5s tick doesn't
# statx() the whole tree
Cleanup runs at the start of run_once() but is gated to once per
hour. Removed entries logged.
3. Doctor surfaces shipping errors. Tails 10 minutes of cis490-shipper
journal and surfaces 412/400/transient patterns as red/yellow rows
with the canonical fix command. An on-device agent running
cis490_doctor.py now sees one line ("12 ship(s) rejected as
out-of-window") instead of needing to grep the journal.
Tests: 200/200 (was 188). New coverage: heartbeat callback fires +
survives exceptions; quarantine cleanup respects keep_days, gate, and
opt-out; doctor parser correctly classifies 412/400/transient/clean/
empty/journalctl-denied; both error classes prioritise 412 (more
actionable) when present together.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
receiver.toml.example: the local_repo_path comment was wrong about
when it kicks in. With the new fallback path, it's used both when
forgejo_url is unset (sole backend) AND when forgejo is unreachable
(failover). Document that, plus the auto-detect of /opt/cis490/.git.
cis490_doctor: add a VERSION-stamp check for lab-host role. If
/opt/cis490/VERSION is missing or malformed, the orchestrator stamps
"unknown" → receiver gate rejects every PUT → quarantine. Surface
this as a red row with the canonical fix (re-run install-lab-host.sh)
so an on-device agent doesn't have to grep journal logs to figure it
out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups from the post-cutover diagnosis:
1. version_gate: forgejo → local git fallback. If forgejo refresh
returns empty AND a local repo path is configured, retry against
`git log` from the local checkout. The receiver service runs on
the same Pi as forgejo, so a simultaneous restart used to leave
the gate's cache empty and reject every PUT with not-in-window.
Auto-detects /opt/cis490/.git when the operator hasn't set
local_repo_path explicitly — that path is always present on a
production receiver and ProtectSystem=strict still allows reads.
Logs `source=git-fallback` so this isn't silent.
2. shipper/queue: sweep orphaned outbox tarballs. The lifecycle
invariant is `outbox/<id>.tar.zst exists ⇒ episodes/<id>/ exists`
— broken historically by the now-fixed fatal-loop, by operator
`rm` of an episode dir, or by an OS crash between rename(2) and
the post-ship cleanup. Without sweeping, dead bytes pile up
forever. New _sweep_outbox runs at the start of every scan,
bounded by the file count in outbox/.
Tests cover: fallback fires when forgejo unreachable + repo_path set;
no fallback when repo_path None (opt-in); orphan tarball + partial
get swept on the next pass; live tarballs untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Smaller models running as on-device agents need a direct, prescriptive
remediation block for the gate-failure modes — the receiver's response
body is good but only visible if the agent reads journalctl carefully.
Document the exact sequence (git pull → install-lab-host.sh) and what
the install script now does on its own (drain pre-stamp queue, restart
services). Also calls out the two anti-patterns we don't want agents
trying: silencing the shipper to stop log noise, or fabricating a
code_version field to bypass the gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why services weren't starting after the gate went live:
1. install-lab-host.sh self-copy. The receiver's 400 remediation tells
the agent to `cd /opt/cis490 && git pull && sudo
./scripts/install-lab-host.sh`. That makes REPO_ROOT==INSTALL_ROOT
and `cp -aT $REPO_ROOT $INSTALL_ROOT` errors with "are the same
file"; `set -e` aborts before the systemd units install or anything
restarts. Detect the same-dir case and skip the cp; chown still
runs.
2. Services never restart. install-lab-host.sh and install-tier-3-4.sh
both ended by *telling the operator* to restart, then exiting. The
running shipper/orchestrator kept executing pre-gate code from the
old module objects, so new `code_version` stamping never reached an
episode. Both scripts now `systemctl restart` the units they own
when those units are enabled.
3. Shipper queue fatal-loop. queue.py incremented `fatal++` but didn't
move the episode out of `data/episodes/`. Next scan re-tarred and
re-PUT the same dir, getting 400 again. With 4465+ pre-stamp
episodes on k-gamingcom this burned ~1 PUT/sec for 5+ hours of
receiver log. Fatal episodes now move to data/quarantine/<id>/ with
a quarantine_reason.json beside them; the outbox tarball is
deleted.
4. Pre-stamp backlog drain. tools/quarantine_unstamped.py is a
one-shot that scans data/episodes/ and quarantines anything without
a 40-char-hex code_version.commit. Wired into install-lab-host.sh
step 9 so a re-install drains the queue automatically. Idempotent;
safe to run while the shipper is active.
Tests cover the queue's new fatal-quarantine path and every drain
behaviour (kept/quarantined/dry-run/idempotent/missing-meta/collision).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the "have you tested it" gap as much as we can without x86 KVM.
The Pi is ARM64 — can't boot Metasploitable2 or run KVM-accelerated
guests. But most of the Tier-3 chain doesn't need x86:
* chunked_real_binary_upload is just shell commands over a pipe
* exploit module TOMLs and the deterministic selector are pure Python
* manifest loading + sample selection are pure Python
* msfrpcd itself runs on ARM (Ruby + Java)
* the receiver's commit gate is the same on any arch
verify_tier3_local.py exercises each of those end-to-end, in process,
on this Pi:
PASS exploits/modules/*.toml parse + selector deterministic
PASS manifest loads + selector covers every sample
PASS chunked binary upload survives a real /bin/sh round-trip
(150 KB binary, 26 chunks, sha256-verified end to end)
PASS staged samples are Linux i386 ELF (when staged)
PASS msfrpcd round-trips core.version (when listening)
PASS receiver /v1/health + gate enforces commit allow-list
Live result on this Pi today: 5 PASS, 1 SKIP (msfrpcd not installed
on the Pi, which is correct — the Pi is the receiver, not a lab
host). When run on a lab host after install-tier-3-4.sh, all 6
PASS gives full Tier-3 readiness.
What this script does NOT verify (still needs x86 KVM on a lab
host, covered by install-tier-3-4.sh's verify step):
* Metasploitable2 boots under QEMU/KVM
* vsftpd_234_backdoor lands a session against it
* the chunked-upload binary actually executes inside that session
But the chunked-upload step proves every byte of the upload path
(printf '%s', heredoc-free path, base64 decode, sha256 verify,
chmod, exec scaffold) works against a real POSIX shell. An msfrpc
session presents the same shell interface, so a passing local-sh
test is strong evidence the production path will work.
tests/test_tier3_local_verify.py wraps the deterministic steps
(module parse, manifest, chunked upload) so pytest catches
regressions automatically. 174/174 total.
Operator workflow: ssh into Pi (or lab host), run:
/opt/cis490/.venv/bin/python tools/verify_tier3_local.py
Each step prints PASS/FAIL/SKIP with detail. Exit 1 if any FAIL.
User caught it: I shipped the theZoo path without running it
end-to-end. A real fetch on the Pi exposed two bugs:
1. Family-name matcher was substring-strict. "Cryptolocker-class"
wouldn't match the dir "CryptoLocker_22Jan2014" because "-class"
isn't in the dir name. Now expands to a sequence of tokens
(full, head-of-dash, head-of-dot, head-of-underscore) and tries
each. First match wins.
2. Extraction picker was "largest non-text" — a bad heuristic for
theZoo, where each Linux.* zip often contains MULTIPLE binaries
for different platforms (Linux i386, x86-64, ARM, FreeBSD, sometimes
even Windows PE). The largest is rarely the i386 Linux ELF that
would actually run on Metasploitable2. Now sniffs ELF magic bytes
in stdlib and tiers:
1. Linux i386 ELF (largest first)
2. any other ELF (best-effort, may not execute)
3. largest non-text (Wine fallback)
Verified end-to-end on the Pi against a real theZoo clone (~500 MB,
263 family dirs, 2026-05-01 fresh pull):
linux-encoder-ransomware → ELF 32-bit Intel i386 SYSV (278 KB)
linux-wirenet-rat → ELF 32-bit Intel i386 SYSV (64 KB)
linux-rex-ransomware → ELF 32-bit Intel i386 SYSV Go (7.6 MB)
linux-neurevt-bot → ELF 32-bit Intel i386 SYSV (3.0 MB)
linux-earthkrahang-apt → ELF 32-bit Intel i386 GNU/Linux (5.8 MB)
5/5 picks are runnable Linux i386 ELFs. Manifest rewrites in place
add source/sha256/url; meta.sample.kind goes to "real" automatically.
Manifest rewritten:
- Old families (XMRig, Mirai, Cryptolocker-class, Dridex, Kovter,
Reverse-Shell) → mostly absent from theZoo's Linux catalog or
matched the wrong arch.
- New families chosen against a verified theZoo presence list:
Linux.Encoder, Linux.Wirenet, Ransomware.Rex, Neurevt,
EarthKrahang.
- XMRig + Kovter remain as mimic-only fallbacks (theZoo lacks a
runnable Linux i386 binary for these; orchestrator falls back
to the mimic profile).
Tests added (tests/test_auto_fetch_samples.py): 13 cases covering
ELF magic detection (i386 accepted, FreeBSD/x86-64/ARM/PE32/text
all rejected), family-token expansion (the "-class" suffix bug),
extraction picker (prefers Linux i386 over larger non-Linux ELFs),
manifest in-place rewrite preserves mode + skips entries that
already have sha256.
What's still NOT verified end-to-end (requires a lab host with
KVM x86):
- Metasploitable2 boot under QEMU
- vsftpd_234_backdoor exploit fire via msfrpcd
- chunked binary upload through a real shell session
- real binary executing inside a Metasploitable2 guest
The Pi is ARM64 — can't run Metasploitable2. install-tier-3-4.sh's
verify step (run_tier3_demo.py) covers all four on a real lab host;
deploy verifies on first run there.
171/171 tests pass.
Initial git-log-based gate ran into a permission wall: the cis490
service user can't read /home/max/cis490/.git (ProtectHome=true +
home-dir mode). Switching the production source to the local Forgejo
HTTP API (already accessible to all WG peers, single source of truth
both lab hosts and the receiver pull from). When the maintainer
pushes new code to spectral/CIS490, the next 5-second cache refresh
sees the new commit and lab hosts can immediately ship under it.
VersionGate now takes either:
- forgejo_url + repo_owner + repo_name + branch (+ optional
auth_token for private repos): hits
/api/v1/repos/<owner>/<name>/commits?sha=<branch>&limit=<n>
- repo_path: dev-only fallback, runs `git log` locally
Local-git path retained for tests + the dev-only case.
receiver.toml.example gains forgejo_url/repo_owner/repo_name/branch
with auth_token commented; live-deployed receiver.toml on the Pi has
the spectral org + token.
Live state on the Pi: 41 valid hashes loaded, head=f8ad02b. Verified
end-to-end:
bogus commit → 412 + remediation
HEAD commit → clears gate (fails downstream at sha-mismatch as
expected for the empty-body verify probe)
Test added: test_forgejo_backend_accepts_returned_commits stands up
a tiny canned-response HTTPServer in-process, exercises the parser
without depending on a live Forgejo instance. Brings test_version_gate
to 10 cases; total 158/158.
Stops out-of-date lab hosts from polluting the dataset with episodes
generated by buggy code. The valid-commits set mirrors the maintainer's
working clone on the Pi automatically — when the maintainer pulls or
pushes a new commit, the receiver picks it up within the 5-second
cache TTL with no service restart.
Receiver changes:
- receiver/version_gate.py (new): VersionGate(repo_path, window).
Each check() consults a frozenset of the last `window` commit
hashes from `git -C <repo> log --format=%H -n <window>`, refreshed
every 5s under a lock. Resilient to transient git failure (keeps
prior cache so a flaky `git` doesn't lock out every shipper).
- receiver/app.py: PUT extracts X-Cis490-Code-Commit; gate.check()
before ingest. Rejects with:
400 + remediation if header missing or malformed
412 + remediation + your_commit + head_commit if not in window
Remediation block is verbatim copy-pasteable into the lab-host
shell:
cd /opt/cis490 && sudo -u cis490 git pull origin main
sudo /opt/cis490/scripts/install-lab-host.sh
sudo systemctl restart cis490-orchestrator
- receiver/store.py: ingest_stream takes commit kwarg, stamps it on
the index.jsonl row (new optional field). Backfilled rows from
index_backfill.py also pull commit out of meta.json.
- receiver/config.py + etc/receiver.toml.example: new [version_gate]
section. enabled=true, repo_path=/home/max/cis490, window=100 by
default. Enabled toggle exists for emergency disable-and-collect.
Shipper changes:
- shipper/transport.py: ship_tarball() takes commit kwarg, sends
X-Cis490-Code-Commit header. 412 maps to status='fatal' so the
queue doesn't infinite-retry — operator must pull and reinstall
before the next ship will succeed.
- shipper/queue.py: reads meta.json::code_version.commit per
episode, passes through. On 412, logs the receiver's full
remediation block at ERROR level so journalctl on the lab host
shows exactly what to run.
Tests: 9 in test_version_gate (including 2 end-to-end via
starlette.testclient), 2 cover the boundary where new commits land
mid-cache and where missing-repo gracefully keeps prior cache.
157/157 total.
Index schema: existing rows stay valid (commit field is optional
on read). New rows from receiver-direct AND from index_backfill.py
include commit.
Closes a real reproducibility gap. Three weeks of bug fixes have
shipped (probe fix in 2707709, multi-signal classifier in 321ea63,
mandatory tier-4 in 265f3ad, etc.); without a per-episode
code_version, trainers can't tell which episodes came from buggy
pre-fix code and have to scan every tarball to guess.
Resolution priority (cached across episodes):
1. $INSTALL_ROOT/VERSION (production — install-lab-host.sh writes
it at install time since /opt/cis490 is a flat copy with no .git)
2. git rev-parse HEAD from the repo root (dev clones)
3. {"commit": "unknown", source: "unknown"} so the field is always
present (filterable)
Output shape, always present in meta.json:
"code_version": {
"commit": "<40-hex>" | "unknown",
"branch": "<name>" | null,
"dirty": bool | null,
"source": "VERSION-file" | "git" | "unknown"
}
install-lab-host.sh writes VERSION at install time with the source
repo's git rev-parse HEAD + branch + clean-tree flag + install
timestamp. Lab-host agents that pull main + re-run install-lab-host.sh
get a fresh stamp automatically.
148/148 tests pass; test_episode_against_self_pid_produces_full_directory
asserts the field's presence + valid `source` value.
Replaces MalwareBazaar with theZoo (https://github.com/ytisf/theZoo).
theZoo is a public security-research repo with hundreds of malware
samples organized by family, password-protected with the well-known
'infected'. No API key, no signup, nothing for an operator to do —
which is what zero-touch tier-4 actually means.
Changes:
- tools/auto_fetch_samples.py: rewrite. Clones theZoo (shallow, ~500 MB)
to /var/lib/cis490/theZoo on first run, then for each manifest
family without a sha256 it locates a matching Binaries/<Name>
dir, extracts the .zip with password 'infected', picks the largest
non-text payload as the binary, sha256s it, stages at
samples/store/<sha256>, and rewrites manifest.toml in place
(atomic tempfile + os.replace, stat preserved). Mandatory exit
semantic: non-zero if no real samples landed.
- scripts/install-tier-3-4.sh: dropped the MB-key resolution chain
(env var → local file → bootstrap.wg fetch). Now just runs
auto_fetch_samples.py and dies if zero samples land. SKIP_TIER4
remains as the explicit override but is documented as defeating
the project.
- bootstrap/app.py + __main__.py + etc/cis490-bootstrap.service:
removed the /v1/secret/<name> endpoint and the --secrets-root flag.
Dead code now that no API key needs distributing. Live-rolled
back on the Pi (404 verified post-restart, stale /etc/cis490/secrets
dir removed).
- scripts/set-malwarebazaar-key.sh: deleted. No MB key means no
one-time operator step.
- tests/test_bootstrap_secrets.py: deleted (route removed).
- AGENTS.md: rewrote tier-4 section to reflect zero-operator model.
148/148 tests pass. Bootstrap service rolled back live.
User: 'we don't want it to be optional, this real malware IS the data
we want.' Acknowledged. Three changes make Tier 4 actually mandatory
without forcing per-host operator action:
1. bootstrap.wg /v1/secret/<name> endpoint
- Pi serves /etc/cis490/secrets/malwarebazaar.token to lab hosts
over the same trust boundary as the cert endpoint (WG mesh,
iptmonads-gated). Strict allow-list — only `malwarebazaar`
resolves; everything else 404s. Secret returned as bare text
with Cache-Control: no-store. Live-verified on the Pi.
- tests/test_bootstrap_secrets.py covers four cases: 404 unprovisioned,
200 with token, 404 unknown name, 500 on empty file.
2. install-tier-3-4.sh: Tier 4 is no longer optional
- Resolves MB key in priority: env var → /opt/cis490/samples/.bazaar.token
→ https://bootstrap.wg/v1/secret/malwarebazaar.
- Caches the bootstrap-fetched key locally so re-runs are offline.
- If all three resolution paths fail, dies with the exact
remediation command for the operator (one-time set-malwarebazaar-key.sh
on the Pi).
- auto_fetch_samples.py is run unconditionally (SKIP_TIER4 still
works for emergency overrides but logs a warning that the host
will produce only mimics). Deploy fails if zero binaries land
in samples/store/ — no silent mimic-only fallback.
- SKIP_TIER4 documentation now says 'DEPRECATED; defeats the project'.
3. scripts/set-malwarebazaar-key.sh
- Pi-side helper: one operator command per fleet, ever. Accepts
key via env or stdin, validates length, drops at the right
path with the right perms. Lab hosts pull the rest automatically.
AGENTS.md: rewrote the Tier-4 section to reflect mandatory status +
the one-time-on-Pi distribution model.
152/152 tests pass. Bootstrap service updated live on the Pi.
Replaces the manual runbook with scripts that just work. install-lab-host.sh
now runs the full Tier-3 deploy automatically as its 8th step (after the
mTLS cert lands), and Tier-4 auto-fetches when MALWAREBAZAAR_API_KEY is set.
Changes:
- install-msfrpcd.sh: actually runs the Rapid7 omnibus installer when
metasploit-framework isn't present (was: bail with "install manually").
apt-get and dnf paths both go through the same omnibus script with
DEBIAN_FRONTEND=noninteractive. Idempotent.
- fetch-metasploitable2.sh: bakes in the SourceForge public-mirror URL
(https://downloads.sourceforge.net/project/metasploitable/...) so no
operator URL is required. sha256 is now optional and TOFU-pinned —
first run records the hash to OUT_DIR/metasploitable2.qcow2.sha256;
subsequent runs verify against that. Skips if qcow2 already present.
- scripts/install-tier-3-4.sh (new): orchestrates the four steps
(msfrpcd → metasploitable2 → bridge → tier-3 verify) plus optional
Tier-4 auto-fetch. Idempotent. SKIP_VERIFY / SKIP_BRIDGE / SKIP_TIER4
env knobs for partial deploys.
- tools/auto_fetch_samples.py (new): when MALWAREBAZAAR_API_KEY is set,
queries MB by each manifest entry's `family` (signature match), pulls
the first match via fetch_sample.py, and rewrites manifest.toml in
place (atomic tempfile + os.replace, preserving stat). Skips entries
that already have sha256.
- install-lab-host.sh: gains a step 8 that calls install-tier-3-4.sh
automatically when mTLS certs are on disk. --skip-tier3 flag for
operators who want Tier 2 only. Skipped silently before certs land
so first-pass install (host_id=REPLACE_ME) still works.
- AGENTS.md: rewrote the Tier-3 section to point at the one-shot
script. Removed the old multi-command runbook so on-device agents
can't accidentally follow stale steps.
Net effect: a fresh lab host now gets Tier 3 (and Tier 4 if API key
present) from a single sudo invocation. No operator picks for image
URLs, no manual metasploit installs, no manual manifest edits.