- multi_model_metrics: publish gbt / mlp / cnn / knn_semi /
gru / lstm / bert (knn handled by knn streamer); read both
*_train.json and *_eval.json with macro_f1.point fallback
- dashboard.css: add palette gradients for the four
non-canonical names so the bars render with a fill colour
- dashboard.js: open the bar's visible scale to the full 0–1
range so honest-low cross-host F1s show as a bar instead of
clamping to 0%
- ship lambda-live-detection-loop.py + dashboard request docs
(scenes 7/8/12, sticky cache, lambda-inference-demo)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pi-safe replacement for the original metrics.py + perf.py producers
which load every checkpoint into memory and score the test set on each
cycle. That pattern crashed the Pi during this project (300 MB knn
pickles × 6 variants + 226 MB test set in memory at peak ≈ OOM).
The new producer:
- reads reports/eval/<model>_<mode>_train.json files (already
contain the test_macro_f1 each trainer wrote)
- publishes one model_metric event per file
- publishes one model_perf event per file with a hardcoded
per-architecture latency estimate (gbt 250 µs, knn 3500, mlp 50,
cnn 500, gru 1500, lstm 2000, transformer 800, transformer_ssl
1000). These are family-level order-of-magnitude figures; proper
benchmarks need to run on the deployment hardware (which is the
A100, not the Pi).
- re-publishes on a tick (default 30 s) for refresh-resilience.
- NO model loading. Pi-safe.
scripts/rsync-from-lambda.sh — pulls Lambda's artifacts/ + reports/eval/
to the Pi every 30 s. As Lambda finishes each model and writes its
train.json, the Pi sees the new file within a cycle and the publisher
broadcasts the metric on its next tick. Live multi-model dashboard
during training, with no Pi-side inference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eval suite at the end of the bootstrap was using ../artifacts and
../data/* paths because they were originally invoked from inside repo/.
Now that we no longer cd into repo, drop the ../ prefix. Same root
cause as the previous commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version did `(cd repo && "${cmd[@]}")` to "cd into repo
for module imports." But PYTHONPATH was already set to $PWD/repo at
the top of the script — so the cd was redundant for imports AND
broke relative paths: the trainer expects to find
data/processed/validation_v1.parquet from $HOME/cis490, not from
$HOME/cis490/repo/.
Symptom: every training job failed immediately with
FileNotFoundError: data/processed/validation_v1.parquet
Drop the cd; PYTHONPATH already does the import work.
Found while running on the A100 today; trainer relaunched manually
in-place via a stand-in bootstrap2.sh; this commit makes the next
bundle clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pi has 4 cores; only KNN and tree-based models are realistic to train
here without GPU. While Lambda runs the full 16-job manifest in
parallel (~1.7h), this chain trains the CPU-friendly subset on the
Pi (~30 min) so scenes 8 & 12 populate with multi-model numbers
within minutes instead of waiting on Lambda's full cycle.
Order: gbt-realistic, knn-realistic, knn-oracle, knn_semi-realistic,
knn_semi-oracle. Skips models whose .ckpt.json already exists
(idempotent restart). Each is a subprocess of training/trainer/run.py
so XGBoost/numpy/sklearn don't fight each other for cores.
Caller is expected to start gbt-oracle separately (it's the longest
single training and we kicked it off before invoking this script).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
At our model sizes (max ~250 K params, max batch 512), each training
process uses ~1 GiB VRAM. A 40 GiB A100 is far from contention with
two concurrent jobs. Bounded-concurrency rolling launcher cuts
sequential ~3.5 h → parallel ~1.7 h for the full 14-job manifest.
PARALLEL=2 (default) — override via env var if running on a smaller GPU
or testing the queue logic.
Per-job logs still land at logs/<model>_<mode>.log; failure reporting
is the same. Idempotent: skipping already-present checkpoints unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
External-GPU path for the time-pressured first round, before the
Windows desktop joins the WG fleet. Lambda is treated as an "external
worker" whose output lands in the same /var/lib/cis490/models/ tree
the receiver-coordinated fleet uses, so cis490-jobs status reflects
Lambda runs identically to fleet runs.
Three scripts + one ingest tool:
scripts/build-lambda-bundle.sh
Tarball at /tmp/cis490-lambda/lambda-bundle-<short>.tar.zst with:
- the repo (sans .git, sans data/, sans artifacts*)
- data/processed/{validation_v1,features_window_v1}.parquet
- data/processed/feature_schema_v1.json
- data/processed/tensor_window_v1/ (npz shards)
- bootstrap.sh (entrypoint)
- training_manifest.toml (the canonical job list)
- BUNDLE_MANIFEST.json (commit hash + counts + build stamp)
Verifies all four data inputs exist BEFORE compressing 5+ GB.
scripts/run-on-lambda.sh ubuntu@<ip>
rsync bundle up → ssh + run bootstrap → rsync artifacts +
reports/eval back to artifacts-lambda/ + reports/lambda/.
Resumable rsync; sha256-verified.
scripts/lambda-bootstrap.sh (runs ON the Lambda instance)
Creates .venv with cu121 torch + xgboost + the [training] deps,
iterates the manifest's job list in priority order (highest first),
runs trainer/run.py (or run_ssl.py for transformer_ssl) per job,
skips jobs whose .ckpt.json already exists (idempotent on re-run),
writes per-job logs/<model>_<mode>.log, runs eval suite at the end,
stamps artifacts/RUN_SUMMARY.json with counts + failed-job list.
tools/ingest_lambda_artifacts.py
Bundles each (ckpt.json + sidecar + train.json) trio into a
.tar.zst, sha256, PUTs to the local trainer-receiver's
/v1/model/{job_id}, marks the job complete. Maps (model, mode) →
job_id by re-reading the canonical manifest. Handles the queue
state churn (requeue if completed, claim if pending, fail-back
on race losses).
End-to-end smoke verified on the A100 instance just provisioned:
- SSH from Pi via ed25519 keypair (cis490-trainer-pi)
- GPU: A100-SXM4-40GB, driver 580.105.08
- venv warmed: torch 2.5.1+cu121, xgboost 3.2.0
- 464 GB ephemeral disk available
Pi-side feature build (build_features.py + build_tensors.py against
all 72,952 accepted+degraded episodes) is in progress; bundle build
gates on its completion. Estimated wall-clock for the full Lambda
training run on A100: ~2.5 hours for 12 supervised + 2 SSL models +
eval suite.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symmetric companion to the collection fleet (orchestrator/fleet.py)
but for *training*. Collection is embarrassingly parallel; training
is not (a model is trained at most once across the fleet), so the
receiver coordinates which worker gets which job.
Operator-control surface is etc/training_manifest.toml.example —
single canonical file declaring (a) per-host capability + per-model
allow/deny policy, (b) one [[jobs]] entry per (model, mode, hyper)
with capability constraints (require_cuda, prefer_cuda, min_vram_gib,
min_ram_gib, allowed_hosts).
Components:
capability.py — self-detection: hostname, cores, RAM, CUDA presence,
VRAM, torch version, git commit. Used by workers to filter
eligible jobs before claiming.
manifest.py — TOML loader + JobSpec/HostSpec. Job IDs are stable
sha256 of (model, mode, hyper, split_recipe, train_hosts, seed)
so manifest reload is idempotent: existing rows keep their status,
new jobs become claimable, removed jobs stay until cancelled.
queue.py — SQLite job queue (training_jobs.db) with statuses
pending|claimed|running|completed|failed|cancelled. Atomic
claim_next via single UPDATE WHERE status='pending'. Heartbeat,
complete, fail. Stale-claim sweep (stale_after_s=600s) with
max_attempts cutoff to failed.
store.py — model artifact store mirroring receiver/store.py.
Artifact ID is the sha256 of the uploaded tarball; bit-identical
re-runs deduplicate.
receiver.py — Starlette app exposing 11 endpoints:
POST /v1/job/claim (worker)
POST /v1/job/{id}/heartbeat (worker)
POST /v1/job/{id}/complete (worker)
POST /v1/job/{id}/fail (worker)
PUT /v1/model/{id} (worker — uploads tarball)
GET /v1/jobs (anyone)
GET /v1/workers (anyone)
POST /v1/job/{id}/cancel (operator: X-Operator-Token)
POST /v1/job/{id}/requeue (operator)
POST /v1/manifest/reload (operator)
GET /v1/health (anyone)
Runs as cis490-trainer-receiver.service on the Pi alongside the
existing receiver, on a separate port.
client.py — stdlib HTTP client (urllib only, no new deps).
worker.py — long-running daemon. Loop: detect capability → claim →
spawn training/trainer/run.py subprocess → heartbeat every 30s →
tar artifact, sha256, PUT /v1/model → complete. SIGTERM-safe.
Operator CLI (tools/cis490_jobs.py): status / list / show / cancel /
requeue / reload / workers. Cancel and requeue require
$CIS490_OPERATOR_TOKEN matching the receiver's configured value.
Bootstrap: scripts/install-training-worker.sh (Linux systemd) and
scripts/install-training-worker-windows.ps1 (Windows Scheduled Task)
let the operator enroll a new host with one command after cloning
the repo and setting up the venv. Worker self-tests capability
before registering.
End-to-end smoke verified on the Pi: receiver up, manifest synced,
14 jobs queued, worker registered, claimed 4 CPU-eligible jobs
(allow_jobs=["gbt","mlp"]), completed 3 (gbt-realistic, gbt-oracle,
mlp-oracle), 1 failed with the actual error visible via
cis490-jobs status, 3 artifacts uploaded to
/var/lib/cis490/models/<model>_<mode>/<sha256>/bundle.tar.zst with
proper index.jsonl row.
21 unit tests (manifest validation: 8; queue lifecycle + eligibility:
13). All pass alongside the prior 17 training tests = 38 green.
Open limitations surfaced inline:
- Hyper-key drift between manifest and run.py fails at training
time, not at manifest reload (worth tightening to argparse
introspection later).
- mTLS not yet wired through Caddy for the trainer-receiver port —
listens loopback-only until that lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The model layer of the project, built honestly:
- tools/dataset_validate.py — full-sweep validator over the receiver
store (sha256, schema, monotonic labels, telemetry-row gate). On the
current corpus: 64,798 accepted + 8,154 degraded + 3,701 rejected +
7 errored across 76,660 shipped episodes. data/processed/validation_v1.parquet
is committed as the per-episode acceptance index.
- training/_features.py — channel registry (46 channels across
proc/guest/qmp/netflow), summary-stat windowing AND channel×time
tensor extraction at 10s/5s windowing. Time alignment uses t_wall_ns
(Unix ns) — tested fix for a real netflow-vs-host clock-base
inconsistency that was silently dropping every netflow channel.
- training/_split.py — three held-out recipes (host / sample / time)
with profile-stratification assertions. held_out_host carries
untested_profiles for cases like scan-and-dial absent from the test
host (5 of 6 profiles tested cross-device, never silently averaged).
- training/models/ — 6 architectures behind a common BaseModel
interface: gbt (XGBoost), mlp, cnn, gru, lstm, transformer. Each
trained twice (realistic / oracle) per the deployment threat model.
Schema-hashed checkpoints refuse to load if _features.py changed
since training (silent-input-drift protection, tested).
- training/trainer/ — unified training loop: class-weighted CE, LR
warmup + cosine, gradient clipping, mixed precision when CUDA,
early stopping on val macro F1, best-on-val checkpoint. Same loop
runs MLP/CNN/GRU/LSTM/Transformer; GBT uses XGBoost
early_stopping_rounds on val mlogloss.
- training/eval_/ — bootstrap 95% CIs on macro F1, per-class F1,
per-profile and per-host breakdown, paired-bootstrap significance
for model-vs-model gap. Confusion matrix uses union of seen labels.
- training/dashboard/producers/ — replay/metrics/perf/profiles
emitting the six event types the dashboard's awaiting scenes
consume; on-demand tensor extraction so the Pi can run live
inference without 65 GB of shards.
- 17 unit tests (split coverage, features round-trip, schema mismatch,
determinism, time-base alignment regression).
End-to-end smoke-trained all six on a 567-episode subset; held-out
test macro F1 reported with paired-bootstrap significance. The
methodology now reports honest cross-device generalization, not
in-distribution validation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug 14 (vm/launch_target.sh): Metasploitable2 requires -machine pc
(i440fx), -cpu kvm32, -drive if=ide, and -device e1000. The previous
config (-machine q35, -cpu host, -drive if=virtio, virtio-net-pci)
caused a kernel panic at boot because /dev/vda != the grub root=/dev/sda1.
Services never started; the b'' probe fix (Bug 10) then correctly waited
out the full timeout with no result.
Bug 15 (scripts/install-tier-3-4.sh): verify step used vsftpd_234_backdoor
which is requires_bridge=true and has a hardcoded port-6200 backdoor.
Changed to distccd_command_exec with TARGET_PORTS="5632:3632,4444:4444".
manifest.toml: admit distccd_command_exec and unreal_ircd_3281_backdoor
to the module catalog. Both use cmd/unix/bind_perl (bind shell, no guest
egress, SLIRP-safe). distccd returns a valid protocol response so MSF's
handler runs and session_open fires. Verified against Metasploitable2
sourceforge image sha256 a8c019c3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The experiment is now defined by a single version-pinned file —
manifest.toml at the repo root. PIPELINE.md §4.1 / §13 / §16. Every
lab host loads THIS exact file; per-host overrides of experiment
shape are forbidden.
Drops the following per-host CLI overrides that previously violated
the canonical-manifest principle:
* --manifest, --modules-dir (paths now derived)
* --ram-per-vm-mib (in manifest.experiment)
* --max-concurrent (manifest.experiment.fleet.max_concurrent_ceiling)
* --max-tier3-slots (manifest.experiment.fleet.max_tier3_slots)
* --force-tier2 (not a §14 sanctioned override knob —
ship empty catalog to disable Tier-3)
* --require-real-samples (sample-side concern; out of fleet scope)
* tools/run_*_demo.py --manifest (samples path now from canonical)
New surface:
* manifest.toml — the single source of truth
* orchestrator/manifest.py — load_canonical() + Manifest dataclass
with strict validation, raises
ManifestError on any failure
* EpisodeConfig.experiment_meta — populated by run_*_demo.py from
the canonical manifest; stamped
into every episode's meta.json
under "experiment" key for
provenance
* cis490-orchestrator.service — RestartPreventExitStatus=78 so
manifest-load failures stay
stuck-and-loud (§9, §4.7)
* install-lab-host.sh — validates manifest.toml at
install time; missing or invalid
= die with clear message
Catalog admission semantics: only modules whose name appears in
manifest.catalog get loaded into the runtime catalog (§4.3 in
miniature, will tighten further in step 4 when verified_against /
last_verified actually gate admission). Missing toml for an admitted
name is a sysadmin error → exit 78.
Renames cfg.manifest → cfg.samples + adds cfg.experiment to
disambiguate sample-manifest from experiment-manifest. Rewrites
test_fleet.py fixture to construct synthetic Manifest objects so
test outcomes don't depend on the on-disk manifest.toml content.
12 new tests in tests/test_manifest.py: schema-version mismatch,
unknown collector, duplicate collector, unknown phase, negative
phase seconds, negative ram, missing catalog fields, json round-trip.
Local run: `python tools/run_fleet.py --capacity` correctly logs the
loaded manifest and prints capacity. 241 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PIPELINE.md §4.4 requires every collector in the active set to actually
work end-to-end. On k-gamingcom (commit dac03d2 episode at 02:21Z) the
new perf_unavailable lifecycle event surfaced a concrete cause:
`reason: binary_not_on_path` — perf is enabled but the binary isn't
installed. Same story with tcpdump on k-gamingcom (pcap_unavailable
events with `error: tcpdump not found`).
The canonical install script is the right place to ensure the deps
are present. detect_os reads /etc/os-release; ensure_collector_packages
installs `perf` (Arch / RHEL) or `linux-perf` + `linux-tools-generic`
(Debian/Ubuntu) plus `tcpdump`. After the install attempt the script
re-checks `command -v` and dies loudly if either is still missing —
silent silent silent forbidden per §1, so install failure has to be
observable.
Idempotent (`--needed` / equivalent skips already-installed packages).
Operator owns full system upgrades; this only does targeted package
install. On unknown distros logs a warning and dies on the followup
check, with a clear pointer to install perf/tcpdump by hand.
The next autoupdate tick on k-gamingcom should pull this and
self-install perf + tcpdump, after which rows_perf > 0 and pcap should
start producing bytes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pieces of self-monitoring so the maintainer isn't the alarm:
(2) Receiver-side fleet health monitor
cis490-fleet-health.timer runs check_fleet_health.py every 5 min.
Detects three symptoms and writes them to
/var/lib/cis490/alerts.jsonl + a syslog WARNING (greppable / easy
to forward to a notifier):
silent — host shipped in last 24h but has been quiet >30 min
fatal-only — actively shipping but every PUT 4xx
unstamped — shipping without X-Cis490-Code-Commit header
Dedup is keyed on (host, symptom, hour-bucket) so a sustained fault
fires once per hour, not every 5 min. 15 unit tests cover the index
parser, three detectors, and dedup.
(3) Per-host doctor snapshots
Lab hosts run cis490-doctor-check.timer once a day (10 min after
boot, then daily with 30-min jitter). The timer runs
cis490_doctor.py --json and PUTs the result to a new endpoint:
PUT /v1/host-health/<host> → /var/lib/cis490/host-health/<host>.json
GET /v1/host-health → aggregate across all hosts
Endpoint is NOT gated by version_gate — sick hosts running stale
code MUST still be able to report sickness. 11 unit tests cover
PUT/GET, atomic-write semantics, bearer auth, and the
not-gated-by-version-gate property.
ship_health_check.py reuses the existing shipper transport (mTLS +
bearer + receiver URL from lab-host.toml) so we don't reimplement
auth.
Both timers wired into install-lab-host.sh — the loop also enables
the previously-added autoupdate + cert-fetch timers, so a single
install run gives a host all four self-healing mechanisms.
Tests: 293 pass (26 new — 15 fleet-health, 11 host-health). 2
pre-existing test_fleet.py failures from the elliott-ThinkPad
merge (667f042) are unrelated to this change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
k-gamingcom symptom (2026-05-02): the on-device agent successfully
finished Tier-3 bring-up, but the shipper sits in "waiting on mTLS
material" because the cert auto-fetch step in install-lab-host.sh
either ran with host_id still REPLACE_ME, or hit a transient
bootstrap.wg failure, and there's no automatic retry. The Pi-side
cert IS minted and the bootstrap endpoint serves it — the failure
mode is purely "lab-host hasn't pulled it down."
Fix: extract the cert-fetch logic into scripts/fetch-lab-host-cert.sh
(idempotent, no-op when certs are already on disk, no-op when host_id
is unset, exit-0 on transient network failure so the unit doesn't
get pinned as failed), and run it from a 5-minute systemd timer.
The timer handles all three "stuck waiting on mTLS" cases without
operator action:
- operator edited host_id post-install but didn't re-run install
- bootstrap.wg was briefly unreachable during install
- lab host was offline when install ran but came up later
The script `try-restart`s cis490-shipper after a successful fetch
so the daemon picks up the new cert immediately instead of waiting
for its lazy retry. install-lab-host.sh still calls the script
on install for fast first-time bring-up — the timer is the safety
net.
Tarball extract is staged through a temp dir + atomic rename so a
mid-extract crash never leaves us with a mismatched cert/key pair.
AGENTS.md row 4 updated: "waiting on mTLS material" remediation now
points at the timer, with the exact `systemctl start
cis490-cert-fetch.service` command to force an immediate retry.
Tests: 267/267 unchanged. The fetch script is idempotent + has all
its happy/error paths handled inline; a unit test would mostly be
testing systemd's behaviour. The integration test path is the timer
running on a real lab host, which is the actual production case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root causes and fixes documented in TIER3-BRINGUP.md. Summary:
1. BRIDGE env var leaked into Tier-3 subprocess → target VM used tap
instead of SLIRP; fix: env.pop("BRIDGE") in fleet _run_slot.
2. usable_modules filter conditioned on BRIDGE presence → bridge-requiring
modules selected on SLIRP runs; fix: always filter requires_bridge.
3. cmd/unix/interact creates no session.list entry → session_open_timeout
every episode; fix: switch samba_usermap_script to cmd/unix/bind_perl.
4. Per-slot LPORT hostfwd used wrong guest port (host:5444→guest:4444);
fix: extra_host_port:extra_host_port mapping so guest binds the
per-slot LPORT directly.
5. vsftpd backdoor port 6200 hardcoded → collision across concurrent slots;
fix: requires_bridge=true filters it from SLIRP fleet runs.
6. SLIRP false-positive in _wait_for_tcp → exploit fires before Samba
boots (~60 s too early); fix: replace TCP probe with serial console
_wait_for_serial_login that waits for actual "login:" prompt.
7. Stale QEMU survives orchestrator restart (start_new_session=True) →
holds hostfwd ports, new QEMU silently fails; fix: kill by pgid from
old pidfile before rmtree.
8. PORT_BASE default used privileged port 21; fix: default to 2021+slot*100.
9. msfrpcd 6.x returns bytes for all string values even with raw=False;
fix: MSFRpcClient._str() recursive decoder applied to all responses.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Today's incident: post-cutover, k-gamingcom went silent and
elliott-thinkpad kept shipping pre-stamp episodes that the receiver
gate 400'd in a 2300+ PUT loop. Both required `git pull && install-
lab-host.sh` *on the host* — neither the on-device AI agent nor the
operator pulled in time, and from the receiver Pi I cannot reach in
(sshd off on the lab hosts).
Fix the recurrence directly: a 30-min systemd timer that does
git fetch + (if behind) ff-only pull + re-run install-lab-host.sh.
Hosts catch up on the next tick on their own — no human or agent
action required.
Mechanics:
- scripts/auto-update.sh runs as root, drops to cis490 for git ops
to satisfy /opt/cis490 ownership ("dubious ownership" guard).
- Refuses ff if local HEAD isn't an ancestor of origin/main —
protects operator hand-edits from silent overwrite.
- Network failures exit 0 (offline is normal, don't pin a unit
failure); divergence + install failures exit non-zero so the
journal records what broke.
- RandomizedDelaySec=10min on the timer prevents thundering-herd
when several hosts boot together.
- Hands off to install-lab-host.sh via exec — exactly one path
through bring-up; no special "auto" flow.
The version-gate provides the quality boundary, so even if origin/
main moves forward unsafely, the receiver's allow-list still
controls what lands in the index.
install-lab-host.sh enables cis490-autoupdate.timer on every run,
idempotent — existing hosts pick it up the next time they pull
manually.
Filed Forgejo #18 with the canonical command for elliott-thinkpad
+ k-gamingcom to bootstrap themselves out of the current incident
(auto-update doesn't help them retroactively — it has to be running
*before* the cutover to catch the next one).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why services weren't starting after the gate went live:
1. install-lab-host.sh self-copy. The receiver's 400 remediation tells
the agent to `cd /opt/cis490 && git pull && sudo
./scripts/install-lab-host.sh`. That makes REPO_ROOT==INSTALL_ROOT
and `cp -aT $REPO_ROOT $INSTALL_ROOT` errors with "are the same
file"; `set -e` aborts before the systemd units install or anything
restarts. Detect the same-dir case and skip the cp; chown still
runs.
2. Services never restart. install-lab-host.sh and install-tier-3-4.sh
both ended by *telling the operator* to restart, then exiting. The
running shipper/orchestrator kept executing pre-gate code from the
old module objects, so new `code_version` stamping never reached an
episode. Both scripts now `systemctl restart` the units they own
when those units are enabled.
3. Shipper queue fatal-loop. queue.py incremented `fatal++` but didn't
move the episode out of `data/episodes/`. Next scan re-tarred and
re-PUT the same dir, getting 400 again. With 4465+ pre-stamp
episodes on k-gamingcom this burned ~1 PUT/sec for 5+ hours of
receiver log. Fatal episodes now move to data/quarantine/<id>/ with
a quarantine_reason.json beside them; the outbox tarball is
deleted.
4. Pre-stamp backlog drain. tools/quarantine_unstamped.py is a
one-shot that scans data/episodes/ and quarantines anything without
a 40-char-hex code_version.commit. Wired into install-lab-host.sh
step 9 so a re-install drains the queue automatically. Idempotent;
safe to run while the shipper is active.
Tests cover the queue's new fatal-quarantine path and every drain
behaviour (kept/quarantined/dry-run/idempotent/missing-meta/collision).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes a real reproducibility gap. Three weeks of bug fixes have
shipped (probe fix in 2707709, multi-signal classifier in 321ea63,
mandatory tier-4 in 265f3ad, etc.); without a per-episode
code_version, trainers can't tell which episodes came from buggy
pre-fix code and have to scan every tarball to guess.
Resolution priority (cached across episodes):
1. $INSTALL_ROOT/VERSION (production — install-lab-host.sh writes
it at install time since /opt/cis490 is a flat copy with no .git)
2. git rev-parse HEAD from the repo root (dev clones)
3. {"commit": "unknown", source: "unknown"} so the field is always
present (filterable)
Output shape, always present in meta.json:
"code_version": {
"commit": "<40-hex>" | "unknown",
"branch": "<name>" | null,
"dirty": bool | null,
"source": "VERSION-file" | "git" | "unknown"
}
install-lab-host.sh writes VERSION at install time with the source
repo's git rev-parse HEAD + branch + clean-tree flag + install
timestamp. Lab-host agents that pull main + re-run install-lab-host.sh
get a fresh stamp automatically.
148/148 tests pass; test_episode_against_self_pid_produces_full_directory
asserts the field's presence + valid `source` value.
Replaces MalwareBazaar with theZoo (https://github.com/ytisf/theZoo).
theZoo is a public security-research repo with hundreds of malware
samples organized by family, password-protected with the well-known
'infected'. No API key, no signup, nothing for an operator to do —
which is what zero-touch tier-4 actually means.
Changes:
- tools/auto_fetch_samples.py: rewrite. Clones theZoo (shallow, ~500 MB)
to /var/lib/cis490/theZoo on first run, then for each manifest
family without a sha256 it locates a matching Binaries/<Name>
dir, extracts the .zip with password 'infected', picks the largest
non-text payload as the binary, sha256s it, stages at
samples/store/<sha256>, and rewrites manifest.toml in place
(atomic tempfile + os.replace, stat preserved). Mandatory exit
semantic: non-zero if no real samples landed.
- scripts/install-tier-3-4.sh: dropped the MB-key resolution chain
(env var → local file → bootstrap.wg fetch). Now just runs
auto_fetch_samples.py and dies if zero samples land. SKIP_TIER4
remains as the explicit override but is documented as defeating
the project.
- bootstrap/app.py + __main__.py + etc/cis490-bootstrap.service:
removed the /v1/secret/<name> endpoint and the --secrets-root flag.
Dead code now that no API key needs distributing. Live-rolled
back on the Pi (404 verified post-restart, stale /etc/cis490/secrets
dir removed).
- scripts/set-malwarebazaar-key.sh: deleted. No MB key means no
one-time operator step.
- tests/test_bootstrap_secrets.py: deleted (route removed).
- AGENTS.md: rewrote tier-4 section to reflect zero-operator model.
148/148 tests pass. Bootstrap service rolled back live.
User: 'we don't want it to be optional, this real malware IS the data
we want.' Acknowledged. Three changes make Tier 4 actually mandatory
without forcing per-host operator action:
1. bootstrap.wg /v1/secret/<name> endpoint
- Pi serves /etc/cis490/secrets/malwarebazaar.token to lab hosts
over the same trust boundary as the cert endpoint (WG mesh,
iptmonads-gated). Strict allow-list — only `malwarebazaar`
resolves; everything else 404s. Secret returned as bare text
with Cache-Control: no-store. Live-verified on the Pi.
- tests/test_bootstrap_secrets.py covers four cases: 404 unprovisioned,
200 with token, 404 unknown name, 500 on empty file.
2. install-tier-3-4.sh: Tier 4 is no longer optional
- Resolves MB key in priority: env var → /opt/cis490/samples/.bazaar.token
→ https://bootstrap.wg/v1/secret/malwarebazaar.
- Caches the bootstrap-fetched key locally so re-runs are offline.
- If all three resolution paths fail, dies with the exact
remediation command for the operator (one-time set-malwarebazaar-key.sh
on the Pi).
- auto_fetch_samples.py is run unconditionally (SKIP_TIER4 still
works for emergency overrides but logs a warning that the host
will produce only mimics). Deploy fails if zero binaries land
in samples/store/ — no silent mimic-only fallback.
- SKIP_TIER4 documentation now says 'DEPRECATED; defeats the project'.
3. scripts/set-malwarebazaar-key.sh
- Pi-side helper: one operator command per fleet, ever. Accepts
key via env or stdin, validates length, drops at the right
path with the right perms. Lab hosts pull the rest automatically.
AGENTS.md: rewrote the Tier-4 section to reflect mandatory status +
the one-time-on-Pi distribution model.
152/152 tests pass. Bootstrap service updated live on the Pi.
Replaces the manual runbook with scripts that just work. install-lab-host.sh
now runs the full Tier-3 deploy automatically as its 8th step (after the
mTLS cert lands), and Tier-4 auto-fetches when MALWAREBAZAAR_API_KEY is set.
Changes:
- install-msfrpcd.sh: actually runs the Rapid7 omnibus installer when
metasploit-framework isn't present (was: bail with "install manually").
apt-get and dnf paths both go through the same omnibus script with
DEBIAN_FRONTEND=noninteractive. Idempotent.
- fetch-metasploitable2.sh: bakes in the SourceForge public-mirror URL
(https://downloads.sourceforge.net/project/metasploitable/...) so no
operator URL is required. sha256 is now optional and TOFU-pinned —
first run records the hash to OUT_DIR/metasploitable2.qcow2.sha256;
subsequent runs verify against that. Skips if qcow2 already present.
- scripts/install-tier-3-4.sh (new): orchestrates the four steps
(msfrpcd → metasploitable2 → bridge → tier-3 verify) plus optional
Tier-4 auto-fetch. Idempotent. SKIP_VERIFY / SKIP_BRIDGE / SKIP_TIER4
env knobs for partial deploys.
- tools/auto_fetch_samples.py (new): when MALWAREBAZAAR_API_KEY is set,
queries MB by each manifest entry's `family` (signature match), pulls
the first match via fetch_sample.py, and rewrites manifest.toml in
place (atomic tempfile + os.replace, preserving stat). Skips entries
that already have sha256.
- install-lab-host.sh: gains a step 8 that calls install-tier-3-4.sh
automatically when mTLS certs are on disk. --skip-tier3 flag for
operators who want Tier 2 only. Skipped silently before certs land
so first-pass install (host_id=REPLACE_ME) still works.
- AGENTS.md: rewrote the Tier-3 section to point at the one-shot
script. Removed the old multi-command runbook so on-device agents
can't accidentally follow stale steps.
Net effect: a fresh lab host now gets Tier 3 (and Tier 4 if API key
present) from a single sudo invocation. No operator picks for image
URLs, no manual metasploit installs, no manual manifest edits.
Sysadmin observed lab-host agents still trying to "secure the
connection" — minting certs, generating CSRs, or otherwise reinventing
a cert-delivery flow that's already automated through bootstrap.wg.
Three reinforcements so an agent reading any of the three surfaces
(AGENTS.md, install script output, journalctl) gets the same message:
- AGENTS.md gains a top-of-file "do not mint your own certs" callout
+ a dedicated "Securing the connection (mTLS)" section with the
one fix (re-run install-lab-host.sh after setting host_id) and an
explicit "what NOT to do" list (no openssl, no copy from another
host, no verify_tls=false).
- install-lab-host.sh's FIRST-INSTALL NEXT STEPS now spells out that
the cert auto-fetch is silently skipped while host_id is REPLACE_ME,
and that the operator MUST re-run the script after editing host_id.
Step 2 is now "RE-RUN THIS SCRIPT" with a DO NOT openssl warning.
- The shipper's "waiting on mTLS material" warning now embeds the
exact remediation command + a pointer to AGENTS.md, so an agent
reading journalctl without ever opening the repo still gets it.
Tests: 12/12 in test_shipper still pass; warning string change is
not asserted on (only the dataclass error field).
1. pyproject.toml — move pycdlib to main deps (was dev-only; cidata build
fails on first install because the venv doesn't include dev extras).
2. scripts/install-lab-host.sh — create vm/images/ dir before symlinking
alpine-baseline.qcow2 and cidata.iso into INSTALL_ROOT. Without the
mkdir the ln -sf silently fails (|| true), leaving the launchers unable
to find the images and causing every episode to fail within 15 s.
3. tools/cis490_doctor.py — two fixes:
a. Insert repo_root into sys.path at doctor startup so the inline
`from exploits.modules import ...` succeeds when running from /opt/cis490
(package = false means nothing is installed into site-packages).
b. Pass cwd=/opt/cis490 to the shipper --ping subprocess so python -m
shipper resolves the module correctly regardless of the caller's CWD.
Tested on k-gamingcom: install script now builds cidata.iso on first run,
7-slot fleet wave completes with rc=0, doctor shows 13 ok / 4 warn / 2 fail
(remaining failures are mTLS certs + collector.wg DNS — both need Pi-side
action, not code changes).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes the next batch of issues from the post-mortem. The previous
"each run uses a different vulnerability" commit shipped 5 modules
but 3 of them couldn't actually fire under SLIRP+restrict=on:
their reverse-shell payloads needed a callback channel the launcher
didn't provide, AND their LHOST options were set to {{ target_ip }}
(the target's IP, not the attacker's — copy-paste from RHOSTS).
Same time, the workloads.py shell commands used bash-only /dev/tcp
redirects that silently no-op'd in the busybox shell sessions
Metasploitable2 returns. Net effect: episodes that selected those
modules would have produced session_open_timeout + dead workloads.
Module configs (the three callback ones):
exploits/modules/distccd_command_exec.toml
exploits/modules/php_cgi_arg_injection.toml
exploits/modules/unreal_ircd_3281_backdoor.toml
- Switch payload from cmd/unix/reverse* to cmd/unix/bind_perl
so the target listens on a known port; msfrpcd connects to it
via the host's hostfwd (no callback path required).
- Drop the bogus LHOST = "{{ target_ip }}" — bind shells don't
use LHOST.
- Add [runtime] table:
requires_bridge = true
extra_target_ports = [<bind_lport>]
Both fields are honored by the loader (ModuleConfig.requires_bridge)
and the launcher (TARGET_PORTS gets the extra port hostfwd'd
when BRIDGE mode is active).
orchestrator/fleet.py
When BRIDGE is unset in env, _run_slot filters the module catalog
down to modules where requires_bridge=False before calling
select_module. Two same-socket-shell modules (vsftpd_234_backdoor +
samba_usermap_script) survive — fleet still has variety; just
doesn't pick modules whose payloads can't land. With BRIDGE set,
the full catalog rotates as before, AND BRIDGE is propagated to
the per-slot subprocess env so launch_target.sh enters tap+bridge
mode.
exploits/workloads.py
Replaced bash-only constructs in three profiles:
scan-and-dial /dev/tcp/HOST/PORT redirects → nc -z -w 1
bursty-c2 same fix
shell-resident exec 3<>/dev/tcp/... → piping into nc -w
All three now run cleanly in busybox / dash / Metasploitable2's
default shell. The remaining three profiles (cpu-saturate, io-walk,
low-and-slow) were already busybox-portable.
scripts/install-lab-host.sh
- lab-host.env now defaults BRIDGE=br-malware (was commented out).
Operator opt-out is to comment the line back in.
- New step 6b: provisions br-malware via vm/setup_bridge.sh AND
pre-creates a per-slot tap pool (cis490tap0..7 for Tier-2 demo,
cis490target0..7 for Tier-3 target) all attached to br-malware
and brought up. Launchers reference these by SLOT — no sudo
needed at episode time.
- On bridge-setup failure, the script auto-comments BRIDGE in the
env file with a "auto-disabled: bridge setup failed" note so
the fleet falls back to same-socket modules + Tier-2 cleanly.
tools/cis490_doctor.py
Two new checks for the lab-host role:
bridge: br-malware exists / up
tier3: msfrpcd listening on 127.0.0.1:55553
tier3: module catalog parses (counts same-socket vs requires_bridge)
All three are warn-level — they don't fail an otherwise-healthy
Tier-2-only setup; they tell the operator what's missing for full
Tier-3 + source 4 coverage.
Tests: 132 (was 129). New cases:
test_fleet.py +3
- fleet skips requires_bridge modules when BRIDGE unset (asserted
across 20 episodes; never picks a callback module)
- fleet uses the full catalog when BRIDGE is set
- BRIDGE env propagates to per-slot subprocess
What's still untested live: the bind_perl payloads against a real
Metasploitable2 in the bridge-enabled launcher path. That's a
deployment validation, not a code change. The unit tests confirm
the dispatch / filter logic; the live test is the next operator
action.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a pull-based cert distribution path so install-lab-host.sh can
fetch its own leaf cert without operator intervention. Removes the
ssh-from-Pi requirement that blocked elliott-lab.
How the chicken-and-egg gets solved: a freshly wg-enrolled lab host
already has WG access (gate kept by iptmonads at L4) and trusts the
Caddy local CA (bundled in this repo at etc/caddy-root.crt). It
makes a single TLS call to https://bootstrap.wg/v1/cert/<host_id>
— no mTLS — gets back a tar of {ca.crt, leaf.pem, leaf.key},
extracts to /etc/cis490/certs/, and the shipper unblocks. Trust
boundary is "reached :443 over WG"; no operator action needed.
bootstrap/
app.py Starlette: GET /v1/cert/{host_id}, GET /v1/health.
Validates host_id charset, rate-limits per source IP,
logs every mint with the X-Real-IP Caddy injects.
__main__.py uvicorn launcher; runs as root because the wg-pki CA
private key is root-only.
etc/cis490-bootstrap.service
systemd unit on 127.0.0.1:8446 with ProtectSystem=strict +
narrow ReadWritePaths=/var/lib/wg-pki. ProtectHome=no because
systemd's read-only mode hides /home contents (the issuer script
the wrapper exec's lives there).
scripts/issue-cis490-client-cert-wrapper.sh
Adapter the bootstrap service shells out to. Resolves the actual
wg-pki issuer script across the three plausible install layouts
(/opt/wg-pki, /home/max/wg-pki, /home/max/.env/wg-pki) so a single
copy of the unit file works on any operator's box. Forces
--out-dir to /var/lib/wg-pki/issued so writes stay inside the
service's narrow ReadWritePaths.
scripts/install-lab-host.sh
After scaffolding lab-host.toml, if /etc/cis490/certs/lab-host.pem
is absent, curls bootstrap.wg with --cacert etc/caddy-root.crt
(no chicken-and-egg), extracts, chowns/chmods. Skips silently if
bootstrap.wg is unreachable so manual hand-carry remains possible.
scripts/install-receiver.sh
Drops cis490-bootstrap.service alongside cis490-receiver and
prints both as "enable --now" candidates. cis490-bootstrap is the
thing that makes lab hosts self-provisioning.
etc/caddy-root.crt
Bundled copy of wg-pki's published Caddy local CA root, so the
bootstrap fetch can verify TLS without depending on a wg-pki
clone that may or may not be on the lab host yet.
Verified live on the Pi:
$ curl --cacert etc/caddy-root.crt https://bootstrap.wg/v1/cert/elliott-lab -o /tmp/x.tar
HTTP 200 size=10240
$ tar tf /tmp/x.tar
ca.crt
elliott-lab.key
elliott-lab.pem
$ openssl verify -CAfile … elliott-lab.pem
/tmp/.../elliott-lab.pem: OK
$ openssl x509 -subject … -noout
subject=CN=elliott-lab
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the missing diagnostic + onboarding tools so an agent (AI or
human) handed a fresh lab host can get to "shipping data" without
re-deriving every step from logs.
tools/cis490_doctor.py — one-shot health check that walks the full
stack from the bottom up. Each row is green/yellow/red with an
exact fix command for the red rows. Checks:
- repo: branch, tree-clean, distance from origin/main
- install: /opt/cis490, .venv python, /etc/cis490/{lab-host,receiver}.toml,
/etc/cis490/lab-host.env
- mTLS: /etc/cis490/certs/{wg-ca,lab-host}.{pem,key}, openssl chain verify
- systemd: cis490-{shipper,orchestrator,receiver} active state
- net: receiver.url DNS, TCP reach, mTLS handshake to collector.wg
- vm prereqs: /dev/kvm, qemu-system-x86_64, zstd, alpine-baseline.qcow2,
cidata.iso
- tier3 prereqs: msfrpcd, metasploitable2.qcow2 (warn-level)
- end-to-end: cis490-shipper --ping
Modes: --role {lab-host,receiver}, --json (machine-readable),
--no-tier3 (skip optional checks). Exits non-zero on any red row.
ANSI color (auto-disabled on non-tty / NO_COLOR).
AGENTS.md gains a "How a lab host gets to shipping data" canonical
flow at the top: cert delivery via wg-pki/deploy-cis490-cert.sh →
install-lab-host.sh → cis490-doctor → systemctl enable. Plus an
"on-demand episode" recipe + a "smallest E2E test" snippet for
agents that need to verify the pipe without waiting on the timer.
The strict "cloning the repo by itself does nothing" callout makes
the failure mode mu and elliott-lab hit explicit.
scripts/install-lab-host.sh prints a 5-step banner on first install
that points at cis490_doctor.py + the deploy-cis490-cert.sh flow,
plus an always-printed footer warning that "cloning + running
launchers manually is NOT enough." Same message the AGENTS.md
section reinforces.
Refs spectral/CIS490#8 (the "Tier-2 is shipping in the meantime"
claim that turned out to be untrue because no cis490-shipper
service was running on elliott-lab — exactly the case this
diagnostic tool targets).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the gaps surfaced in the "what is not implemented" audit so the
fleet really is shippable end-to-end. Verified live on the Pi:
- cis490-shipper --ping → HTTP 200 through Caddy + mTLS via the
new wg-pki client CA leaf
- real episode dir → tar+zstd → PUT → HTTP 201 stored
- re-ship same bytes → 200 (idempotent)
- re-ship different bytes under same id → 409 (conflict)
Changes:
orchestrator/episode.py
- EpisodeConfig.revert_at_start / revert_at_end (Tier 0+ snapshot/
revert per docs/architecture.md). When set + qmp_socket present,
EpisodeRunner issues loadvm <snapshot_name> and emits
snapshot_revert / snapshot_revert_failed events on the same
monotonic clock as everything else.
collectors/qmp.py
- savevm() / loadvm() helpers using human-monitor-command, plus a
test against the fake QMP server.
exploits/workloads.py
- chunked_real_binary_upload() returns a ChunkedUpload plan: 8 KiB
base64 chunks (~6 KiB binary each) so msfrpc never sees a buffer-
busting payload. Includes a finalize step that sha256-verifies on
the guest before exec.
- real_binary_workload() now wraps the chunked plan for backwards
compat with single-shot callers.
exploits/driver.py
- Tier-4 dispatch walks the chunked plan in MSFExploitDriver:
each chunk is a separate session_shell_write; finalize verifies;
exec only runs on sha-ok. New events: real_binary_upload_begin,
real_binary_verify, real_binary_aborted.
etc/cis490-orchestrator.service
- Reads /etc/cis490/lab-host.env (FLEET_HOST_ID + optional BRIDGE).
- Grants AmbientCapabilities CAP_NET_RAW (tcpdump for source 4) +
CAP_SYS_ADMIN + CAP_PERFMON (perf for source 3) so collectors
work under hardening.
scripts/install-lab-host.sh
- Writes /etc/cis490/lab-host.env on first install with FLEET_HOST_ID
defaulting to `hostname -s`.
- Best-effort: fetches the Alpine baseline qcow2 (sha512-pinned) and
builds cidata.iso with the in-guest agent embedded; symlinks both
into /opt/cis490/vm/images/ so launchers find them.
scripts/fetch-alpine-baseline.sh
- Idempotent fetcher for the Alpine 3.21 cloud-init nocloud qcow2
matching the sha512 in docs/sources.md.
tools/plot_envelope.py
- Rebuilt to render whatever telemetry the episode dir contains:
proc → QMP block ops → perf IPC/miss-rate → bridge pkts/SYNs →
guest agent load/mem. Missing sources are silently skipped.
tools/index_reader.py
- cis490-index CLI: filter receiver's index.jsonl by host / sample
/ time range, sort, count-by group. Closest thing to a query
interface until we stand up Postgres/Timescale.
samples/README.md
- Rewritten to match the new manifest schema, the kind=real vs mimic
split, the per-(host, slot, ep) selection mechanic, and the
chunked-upload safety story.
Tests: 106 pass (was 102). New cases:
- test_qmp.py — savevm + loadvm (HMP wrapper + error path)
- test_tier4.py — chunked plan splitting, sha-pinned finalize,
end-to-end driver walks all chunks + verify + exec via the fake
msfrpc client
Closes the "what is not implemented" punch list.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This is the chunk that makes "real data" actually flow on multiple
hosts in parallel. End-to-end pipe was up at 613c6fa / 2579683; now
the lab-host side has the diversity + concurrency it needs.
Collectors landed:
collectors/qmp.py — source 2 (oracle). Tiny synchronous QMP
client + row builder + run loop. Tolerates
older qemu without query-stats.
collectors/guest_agent.py — source 5 (deployable). Reads the
virtio-serial host-side socket, parses
agent JSON-lines, re-stamps to the host
monotonic clock, persists.
collectors/pcap.py — source 4 (deployable). tcpdump capture
+ pure-Python pcap reader + 100 ms
netflow.jsonl bucketizer. Decodes
Ethernet/IPv4/TCP/UDP enough for the
schema in docs/data-model.md.
In-guest agent:
vm/guest-agent/cis490_agent.py — stdlib-only Python agent. Reads
/proc/{stat,meminfo,loadavg,net/dev,net/tcp*}, top-N RSS procs,
thermal. Writes JSON-lines to /dev/virtio-ports/cis490.guest.agent.
tools/build_cidata.py — embeds the agent + an OpenRC service into
user-data so first boot of the Alpine cidata image auto-starts it.
Launchers:
vm/launch_demo.sh / launch_target.sh — second virtio-serial port for
the agent socket; SLOT env support so multiple VMs run without
socket / port collisions; PORT_BASE on launch_target so multiple
target VMs hostfwd different host ports.
vm/setup_bridge.sh — creates host-only br-malware (10.200.0.1/24,
no NAT). Idempotent.
Fleet:
orchestrator/fleet.py — capacity detector (cores / RAM / load
headroom) + concurrent-slot runner. Per-slot ENV selects the
sample. FleetCapacity dataclass round-trips into meta.json so
"this episode ran with 6 concurrent VMs" is auditable post-hoc.
tools/run_fleet.py — CLI: --capacity report; --waves N runs N
waves of (max_concurrent) episodes each, every slot with a
different sample.
etc/cis490-orchestrator.service — now drives the fleet runner with
Restart=always so each invocation runs one wave and respawns,
giving a continuous stream.
Samples:
samples/manifest.toml — six profiles spanning the five major
behaviour shapes. Each entry is real OR mimic (sha256 distinguishes).
samples/manifest.py — strict TOML loader (rejects dups, unknown
categories) + deterministic select(host_id, slot, episode_index)
so different hosts on the network walk the catalog in different
orders without any coordinator.
EpisodeRunner:
orchestrator/episode.py — optional qmp_socket + guest_agent_socket
fields on EpisodeConfig; when set, additional collector threads
run alongside proc_qemu. EpisodeResult now carries rows_qmp +
rows_guest counters.
Tier-3 setup automation:
scripts/install-msfrpcd.sh — installs metasploit-framework where
the package manager has it, generates a strong password into
/etc/cis490/msfrpc.env, drops a hardened systemd unit bound to
127.0.0.1:55553. After this, run_tier3_demo.py works zero-touch
once MSFRPC_PASSWORD is sourced.
scripts/fetch-metasploitable2.sh — accepts IMAGE_URL + IMAGE_SHA256
from the operator (Rapid7 download is registration-walled), pulls,
verifies, converts vmdk → qcow2, lands at vm/images/.
Tests: 82 pass (was 51). New suites:
tests/test_qmp.py — fake QMP server, capability handshake,
blockstats, async-event interleaving,
5-failure backoff
tests/test_guest_agent.py — fake virtio socket, JSON-lines read +
re-stamp, malformed-line tolerance
tests/test_pcap.py — synthetic pcap with TCP/UDP/ARP frames,
bucketize correctness across windows
tests/test_fleet.py — capacity math (8-core idle / low-RAM /
high-load / Pi5 / 1-core box), manifest
selection determinism + diversity
What's queued for the next commit (already discussed in convo):
- MSFExploitDriver v2: map sample.profile → distinct in-session
workload so Tier-3 episodes don't all produce the same yes-loop
envelope. Critical for ML to learn varied malware shapes.
- Real-sample fetch from MalwareBazaar by sha256.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the deployment loop end-to-end on the CIS490 side:
shipper/
config.py ShipperConfig (host_id, paths, receiver endpoint, mTLS)
transport.py httpx-based PUT + ping with mTLS + bearer support
queue.py scan data/episodes/, tar+zstd via system zstd, ship,
retire to data/shipped/. Idempotent across crashes per
the state machine in docs/transport.md.
__main__.py CLI: --ping (smoke test), --once (one pass), or daemon
receiver/app.py: new POST /v1/ping that requires the same auth as PUT
/v1/episodes but writes nothing. Used by `cis490-shipper --ping`
during lab-host bring-up to verify the WG/Caddy/mTLS path before
shipping any real bytes.
etc/
cis490-shipper.service systemd unit for the lab-host shipper
cis490-orchestrator.service systemd unit for the lab-host queue
(kept disabled by default until queue
mode lands)
lab-host.toml.example config template
scripts/
install-lab-host.sh idempotent installer; verifies prereqs,
creates cis490 service user, syncs repo to
/opt/cis490, builds venv, drops systemd units
and config template
install-receiver.sh same, for the receiver role on the central WG
node (Pi5 in our setup)
tests/test_shipper.py 11 end-to-end tests against a real Uvicorn
server hosting the receiver app. Exercises
ping, tar+ship, idempotent re-ship, 409
conflict, transient (receiver down), tarball
round-trip via system zstd.
AGENTS.md guidance for AI agents working on this and sibling repos.
Headline: when you hit an issue you can't fully fix in
scope, file a Forgejo issue rather than leaving a TODO.
51/51 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>