On-device agent (k-gamingcom) ran the diagnostic probe sequence and
proved the workload IS running on Alpine — yes saturating the vCPU,
loadavg=1.05, three yes PIDs visible — but two busybox incompatibilities
made every episode look silent:
1. _probe() used `pgrep -c yes`. The -c flag is procps-ng/util-linux,
not busybox. busybox pgrep exits 1 with a usage banner; the
`|| echo 0` fallback then reported yes=0 every time. Switched to
`pgrep yes | wc -l` which both pgrep variants support.
2. _wrap_loop appended `disown` after the nohup-backgrounded script.
busybox sh / ash have no disown builtin, so each infected_running
phase printed `sh: disown: not found` into run()'s captured output.
The script kept running (nohup gives SIGHUP immunity, which is
what disown was for), but the spurious error is now gone.
Cross-validation in the classifier:
- prune_episodes.py: workload-silent now requires the probe AND
host-side /proc CPU envelope (flat-cpu) to AGREE. A probe-only zero
is treated as the busybox false-positive and dropped. This means
the 244 already-on-disk episodes from elliott-thinkpad and
k-gamingcom are correctly classified without re-collecting.
Test coverage:
- test_workload_silent_flag updated to require both signals
- test_workload_silent_suppressed_when_host_cpu_real new regression
for the busybox false-positive
AGENTS.md gains a "Don't trust the in-guest probe alone" section with
the busybox-vs-procps gotcha + a list of busybox-incompatible patterns
to avoid in any new in-guest diagnostic.
The elliott-lab episode showed every phase median'd 20% CPU because
the in-guest workload silently never fired — and there was no signal
in events.jsonl to detect that from outside, so a trainer would
treat the labels as ground truth and learn "all phases look identical".
This commit closes the audit gap so the failure is visible in meta:
orchestrator/episode.py
EpisodeConfig.sample: Sample | None — the manifest entry that
drove this episode's workload selection. Stamped into meta.sample
as {name, family, category, profile, kind, sha256} so trainers
can join cleanly without re-deriving from events. None means the
v1 yes-loop fallback path ran (and the trainer should treat the
episode with appropriate skepticism).
tools/vm_load_controller.py
VMLoadController gains an emit_event callable. Every phase now
emits a workload_* event into the runner's events.jsonl:
workload_setup login + initial cleanup OK
workload_killed clean / dormant. Dormant carries a
`pre_kill_probe` dict from inside the
guest (`pgrep -c yes`, `pgrep -c sh`,
/proc/loadavg) so the trainer can detect
the elliott-lab failure mode where the
workload never actually ran.
workload_armed armed handshake fired
workload_infecting dd urandom / payload write fired
workload_started infected_running command sent
workload_failed any of the above raised inside SerialClient
(timeout, EOF, partial login). The runner
would have silently swallowed the
exception via its on_phase try/except;
the audit row makes the failure detectable.
Exceptions in shell calls surface as workload_failed events but
do NOT propagate, matching the runner's existing on_phase
contract.
tools/run_real_vm_demo.py
Wires the controller's emit_event to the runner's emit_event via
a small forward-reference closure (controller is built before
runner; runner.emit_event needs to be the sink). Sample also
flows into EpisodeConfig.sample so meta.sample matches what the
controller actually ran.
Tests: 119 (was 106). New cases:
tests/test_vm_load_controller.py (11 tests against a FakeSerial)
- setup emits workload_setup
- infected_running runs the v1 yes-loop AND emits workload_started
- dormant probes BEFORE killing and stamps pre_kill_probe
- dormant probe records "yes=0" (the elliott-lab fingerprint)
- clean / armed / infecting all emit their respective events
- serial.run() exception → workload_failed event, no propagation
- sample-with-profile dispatches to exploits.workloads command
(NOT the v1 yes-loop)
- missing emit_event callback is a no-op (back-compat)
tests/test_episode.py (2 new)
- meta.sample carries name/family/category/profile/kind/sha256
when EpisodeConfig.sample is set
- meta.sample stays null in the v1 fallback path
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The v1 driver ran ``yes > /dev/null`` for every sample, which
produced the same envelope shape regardless of which malware family
the orchestrator claimed to be running. That's a poor training
signal: the model sees identical /proc + QMP traces tagged
"cryptominer" / "ransomware" / "RAT" with no distinguishing
features. v2 fixes this.
What landed:
exploits/workloads.py — six ``Workload`` profiles, each producing
a distinct in-session shell command pair (start_cmd / stop_cmd)
that backgrounds a profile-shaped loop:
cpu-saturate — sustained 1-vCPU saturation (XMRig shape)
scan-and-dial — periodic SYN-style probes across 10.200.0.0/24
+ dial-home to gateway (Mirai shape)
io-walk — fs traversal + 4 KiB urandom writes, periodic
re-read (ransomware shape)
bursty-c2 — long idle, periodic 3-packet TCP egress burst
(Dridex C2 beacon shape)
low-and-slow — minimal CPU + periodic awk-driven memory churn
(Kovter / fileless shape)
shell-resident — single long-lived TCP socket pinned to gateway
with periodic 6-byte command ticks (RAT shape)
Each profile uses a /tmp/.cis490-workload-<profile>.{pid,sh} pair so
the stop_cmd can cleanly kill the loop and its descendants.
exploits/driver.py — MSFExploitDriver now accepts an optional
``Sample``. With one supplied, ``infected_running`` dispatches to
the matching workload via exploits.workloads.workload_for(); the
``sample_executed`` event records profile + sample name + sample
kind so the trainer can join cleanly. Without a sample, the v1
yes-loop path remains unchanged (backwards compat).
tools/vm_load_controller.py — the same dispatch on the Tier-2 path
(no exploit, real Alpine guest driven over the serial console).
A fleet wave now produces six visually distinct envelopes per
wave whether the underlying mode is Tier 2 or Tier 3.
tools/run_real_vm_demo.py — accepts ``--sample <name>`` (or
SAMPLE_NAME env from the fleet runner) + auto-wires QMP + agent
sockets into the EpisodeConfig so all three new collectors
(sources 2, 4, 5) run alongside source 1 by default.
tools/run_tier3_demo.py — same ``--sample`` plumbing for the
exploit-driven path.
Tests: 86 pass (was 82). New v2 cases:
- profile dispatch routes infected_running to the workload's
start_cmd (NOT the v1 yes-loop) when a Sample is set
- all six profiles produce distinct start_cmds (the property the
ML model needs)
- unknown profile string falls back to cpu-saturate with a warning
- v1 path (no Sample) still uses yes-loop (backwards compat)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end now drives a real KVM guest through the full XMRig-shaped
phase schedule with the workload running INSIDE the guest. Telemetry is
host-side /proc/<qemu_pid>; the load is busybox `yes` (sustained CPU
saturation) and `dd if=/dev/urandom` (disk burst on infecting), driven
over the serial console at every phase transition. The plotted envelope
shows clean idle → armed → infecting (disk spike) → infected_running
(100% CPU plateau) → dormant → re-entry → final clean.
Components:
vm/launch_demo.sh now boots Alpine 3.21 nocloud-cloudinit
(Cirros 0.6.x's cirros-init blocks on the
EC2 metadata service for ~17 min before
falling through to NoCloud — abandoned).
Mounts a cidata ISO as a second drive.
tools/build_cidata.py pure-Python NoCloud ISO builder (pycdlib).
Sets root password and ssh_pwauth via
runcmd so we don't depend on a specific
cloud-init version's plain_text_passwd
handling.
tools/vm_serial.py serial-console client (stdlib socket).
Idempotent login (detects already-in-shell
state), sentinel-bracketed run() that
distinguishes shell output from the TTY
echo of input by requiring a leading
\r\n boundary on the marker.
tools/vm_load_controller.py in-guest load controller. set_phase()
dispatches the per-phase shell command
over the serial connection.
tools/run_real_vm_demo.py ties it all together: boot VM, wait for
cloud-init runcmd, log in, run the
EpisodeRunner with on_phase=controller,
shut down VM.
Deps: paramiko, pycdlib added.
docs/sources.md updated with Alpine cloud image (sha512 pinned), and
the new Python deps.
README leads with the tier-2 plot now (real VM, real workload). The
previous synthetic plot is moved below with explicit "host-side mimic,
not a VM" labelling. Tier-2 status flipped to ✅ in the tier table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>