- multi_model_metrics: publish gbt / mlp / cnn / knn_semi /
gru / lstm / bert (knn handled by knn streamer); read both
*_train.json and *_eval.json with macro_f1.point fallback
- dashboard.css: add palette gradients for the four
non-canonical names so the bars render with a fill colour
- dashboard.js: open the bar's visible scale to the full 0–1
range so honest-low cross-host F1s show as a bar instead of
clamping to 0%
- ship lambda-live-detection-loop.py + dashboard request docs
(scenes 7/8/12, sticky cache, lambda-inference-demo)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The scene-9 embedding handler appends to a `points` array without
ever capping. The producer republishes its (stable, deterministic)
point set on a cycle so reconnecting browsers eventually see the
scatter; each cycle pushes the same N points again and the in-memory
count grows without bound. Browser slows after ~10 min.
Two complementary fixes proposed:
A. FIFO cap (1-line change in the handler — fixes the leak today)
B. embedding_batch event with replace=true (cleaner, pairs with
the snapshot/sticky-cache request for refresh-time hydration)
Producer side has already reduced cadence as a band-aid (200 pts
every 30 s, was 600 every 5 s) — 18x slower accumulation but still
unbounded.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Producer-side knn fit is saved at data/processed/knn_v1.parquet
(150k rows, 3.4 MB). Live streamer publishes 2000-point cycles every
~2 s, but per PRODUCERS.md §reconnect-gotcha live events aren't
replayed; refresh-to-data is currently bounded by cycle time.
Three options laid out for the dashboard chat to pick:
A. Sticky cache (per-event-type ring buffer in the broadcaster)
B. Feeder reading the parquet → broadcaster.state["embedding_cache"]
C. Caddy fileserver + JS fetch on load
Whichever option lands, the producer side will adapt (e.g., dump a
JSON sidecar if Option C is picked). Path ownership preserved —
dashboard owns dashboard/, producer owns producers/.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LogBERT-style self-supervised Transformer pretrain on `clean`-only
windows, plus Integrated Gradients attribution for any tensor model.
Both directly answer the assignment's §8 'next steps in unsupervised
learning' requirement and Natsos & Symeonidis 2025's RQ3 on
explainability.
Pretrain (training/models/transformer_ssl.py +
trainer/run_ssl.py):
- Masked Timestep Reconstruction (MTR) — random 15% of timesteps
zeroed, encoder + per-channel head reconstructs from the rest.
Loss: MSE over masked positions.
- Volume of Hypersphere Minimization (VHM, Deep SVDD-style) — pull
learnable [DIST] token embedding toward a frozen center vector
initialized as the mean over clean train. Loss: ||h_dist - c||^2.
- Calibrated anomaly threshold at user-configurable target FPR
(default 5%) on clean-val distance distribution.
- Trained ONLY on `clean`-phase windows; the model never sees a
labeled malware sample yet flags any window that doesn't look
clean — including novel malware the supervised classifier never
saw. Uses the same schema-hashed checkpoint format as the
supervised models so loaders refuse mismatched feature schemas.
XAI (training/xai/integrated_gradients.py):
- Per-(channel, timestep) attribution via path-integrated gradients
over Riemann-mid-point steps. Works for cnn/gru/lstm/transformer/
transformer_ssl.
- Per-phase mean |IG| heatmaps under reports/xai/<model>/<phase>.png,
top-k channel importance per phase as JSON. Smoke-verified on the
trained CNN: top channel for `clean` is guest.cpu_iowait (sensible
— clean = idle = high iowait).
Project brief and slide planner:
- docs/project_brief.md — full draft of the assignment's required
sections 1–9 (problem, research question, ML task type with
justification, six supervised algorithms with assumptions, dataset
description with full validation breakdown, evaluation metrics with
rationale, current progress, lit review with 11 APA citations,
next steps for unsupervised, references).
- docs/slide_planner.md — all 16 slides filled with content tied to
specific files and metrics from this codebase, not generic
placeholders.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the Tier-3 exploit driver — an MSFExploitDriver that plugs into
EpisodeRunner.on_phase, fires a Metasploit module against a target VM
via msfrpcd, watches for the resulting session, and stamps each
transition (exploit_fire, session_open, session_landing_probe,
sample_executed, session_dormant, session_killed) into the episode's
events.jsonl on the orchestrator's monotonic clock.
What landed:
- exploits/msfrpc.py — minimal msgpack-over-HTTPS client (auth,
module.execute, job/session lifecycle) so we don't depend on a
third-party MSF wrapper.
- exploits/driver.py — phase-to-msfrpc adapter; idempotent fire,
session-open polling with timeout, workload start/stop, teardown.
- exploits/modules.py + exploits/modules/vsftpd_234_backdoor.toml —
TOML module configs with {{ target_ip }} placeholders, replacing the
imperative .rc-script approach the README previously hinted at.
- vm/launch_target.sh — SLIRP+restrict=on launcher for the
intentionally-vulnerable target VM (host can reach guest via
hostfwd, guest cannot reach host or internet).
- tools/run_tier3_demo.py — end-to-end runner mirroring run_real_vm_demo.
- tests/test_exploits.py — 12 new tests against a fake MSFRpcClient,
including an integration test that drives a real EpisodeRunner.
Plumbing changes:
- EpisodeRunner._emit_event → public emit_event, so external drivers
share the runner's monotonic clock and events.jsonl.
- mkdir for episode_dir moved to __init__ so emit_event is callable
before run() (driver_setup fires pre-schedule).
Status: driver + tests pass (40/40); end-to-end against a live msfrpcd
+ Metasploitable2 image is the next bring-up step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end now drives a real KVM guest through the full XMRig-shaped
phase schedule with the workload running INSIDE the guest. Telemetry is
host-side /proc/<qemu_pid>; the load is busybox `yes` (sustained CPU
saturation) and `dd if=/dev/urandom` (disk burst on infecting), driven
over the serial console at every phase transition. The plotted envelope
shows clean idle → armed → infecting (disk spike) → infected_running
(100% CPU plateau) → dormant → re-entry → final clean.
Components:
vm/launch_demo.sh now boots Alpine 3.21 nocloud-cloudinit
(Cirros 0.6.x's cirros-init blocks on the
EC2 metadata service for ~17 min before
falling through to NoCloud — abandoned).
Mounts a cidata ISO as a second drive.
tools/build_cidata.py pure-Python NoCloud ISO builder (pycdlib).
Sets root password and ssh_pwauth via
runcmd so we don't depend on a specific
cloud-init version's plain_text_passwd
handling.
tools/vm_serial.py serial-console client (stdlib socket).
Idempotent login (detects already-in-shell
state), sentinel-bracketed run() that
distinguishes shell output from the TTY
echo of input by requiring a leading
\r\n boundary on the marker.
tools/vm_load_controller.py in-guest load controller. set_phase()
dispatches the per-phase shell command
over the serial connection.
tools/run_real_vm_demo.py ties it all together: boot VM, wait for
cloud-init runcmd, log in, run the
EpisodeRunner with on_phase=controller,
shut down VM.
Deps: paramiko, pycdlib added.
docs/sources.md updated with Alpine cloud image (sha512 pinned), and
the new Python deps.
README leads with the tier-2 plot now (real VM, real workload). The
previous synthetic plot is moved below with explicit "host-side mimic,
not a VM" labelling. Tier-2 status flipped to ✅ in the tier table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README now leads with a 'What an episode looks like' section that
shows both:
* docs/images/synthetic-envelope.png — pipeline-validation plot. Real
telemetry of a real process whose load is shaped by tools/load_mimic.py
(Python). Explicitly labelled NOT REAL MALWARE in the caption — the
earlier wording was unclear.
* docs/images/real-vm-idle.png — real Cirros 0.6.3 booted under KVM,
same orchestrator + /proc collector pointed at the qemu-system pid.
Idle baseline; no exploit, no payload yet.
A 'What's still missing for the real-malware envelope' table makes the
tier path explicit (real VM idle → real workload in-guest → real exploit
fire → real sample).
Repository nav, deploy steps, design rationale, and threat model are
moved into <details>...</details> blocks so first-time visitors see the
demo plots and the status list without scrolling past wall-of-text.
Stale Pi-as-deployment-target wording in the design-rationale section
is fixed alongside.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vm/launch_demo.sh boots a Cirros qcow2 under KVM with QMP and a monitor
socket exposed; snapshot=on routes guest writes to a temporary overlay
so the on-disk image is never mutated (clean factory reset every boot).
End-to-end verified: vm/launch_demo.sh → orchestrator with --target-pid
<qemu pid> → 201 telemetry rows over 20s against the real qemu-system
process. The plotted envelope shows the expected idle-VM shape:
periodic ~10% CPU spikes from KVM/timer interrupts, flat 230 MiB RSS,
and a single late-boot disk write. Distinct from the synthetic
load_mimic envelope, confirming the collector reads real KVM behavior.
docs/sources.md is the works-cited doc — every tool, library, sample
source, paper, and standard the project leans on, grouped by category.
README's nav table now points at it. README's status section also lists
what's done vs. in progress so reviewers can see scope at a glance.
Note: vm/images/ stays gitignored. The Cirros 0.6.3 image is documented
with its sha256 (7d6355852aeb...) in docs/sources.md so any team member
can reproduce the bytes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end pipeline now produces a labeled envelope from a single command.
Drives the orchestrator through an 8-phase XMRig-shaped schedule and
renders a 3-panel envelope (CPU%, RSS, IO write rate) with phase bands
sourced from labels.jsonl. Real telemetry, simulated load — validates the
collection + labeling shape before a real VM is involved.
Components:
- tools/load_mimic.py phase-driven load generator. Reads phase
commands on stdin; CPU/IO behavior matches
the named phase (clean=idle, armed=light burst,
infecting=disk burst+CPU, infected_running=
CPU saturation+stratum-shaped writes,
dormant=quieter than clean).
- tools/run_envelope_demo.py spawns load_mimic, drives EpisodeRunner with
a default 85s schedule that includes the
classic infected_running → dormant → re-entry
pattern.
- tools/plot_envelope.py reads telemetry + labels from an episode dir,
writes envelope.png with colored phase bands.
orchestrator: EpisodeRunner now takes an optional phase_schedule and an
on_phase callback. Walks the schedule emitting one label per transition.
Backwards-compatible — existing single-phase tests still green.
Doc fix (user pushback): README + architecture + threat-model no longer
imply the Pi5 is the deployment target. Pi5's actual role here is the
WireGuard-side collector for episode tarballs. Deployment target is
generic ("constrained Linux device"). The "gateway observer" concept
remains a deployment pattern, decoupled from the Pi5's collector role.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>