Lays down the design surface for the CIS490 behavioral-malware-detection dataset and model. No code yet — schema and topology are decided first so collection can start without rework. Docs: - README: project goal, navigation - architecture: lab topology, KVM choice, episode state machine, deployment-mirror reasoning - threat-model: train/serve parity rule, oracle-vs-deployable feature split, two-model evaluation strategy - data-model: per-episode JSONL layout, row schemas, phase enum - transport: WG-native shipper/receiver design, idempotent uploads - deploy: one-command install for lab-host and receiver roles - lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/, training/ (each with a short README explaining purpose). Extended .gitignore to exclude qcow2 images, pcaps, sample binaries, secrets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1.1 KiB
1.1 KiB
training/
Deferred until the dataset has substance. The plan, recorded so we don't lose it:
- Two models will be trained from the same episodes:
- Realistic — features only (
available_in_deployment: true). - Oracle — all rows, regardless of the deployment flag.
- Realistic — features only (
- Baseline architecture: a rolling-window feature builder + a gradient-boosted trees classifier (XGBoost or LightGBM). Cheap, strong, interpretable.
- Window: 1–5 second sliding windows with per-channel summary stats (mean, std, p95, slope, count of zero buckets).
- Target: the phase enum from
labels.jsonl, projected onto each window's center timestamp. - Evaluation:
- Held-out samples (not just held-out time slices) — generalization to unseen malware matters more than within-sample accuracy.
- Confusion matrix + per-phase precision/recall.
- Realistic vs. oracle gap, reported.
- Stretch: trust-over-time scoring per the IEEE 9881803 paper, with a reset threshold tuned for low false-positive cost.
See docs/threat-model.md for why this split exists.