# CIS 490 — Slide Deck Planning Template (Filled) Maps each of the assignment's 16 slides to concrete content drawn from `training/`, `tools/`, the validator output, and the trained models. Optional slides are marked **[opt]** and can be cut for time. --- ## Slide 1 — Title Slide - Title: **Behavioral Malware Detection from Hypervisor-Layer VM Telemetry: A Cross-Architecture Comparison Under Cross-Device Generalization** - Authors / affiliation / date / advisor (Dr. Mejias). - One subtitle line: *Six architectures × two deployment modes, evaluated on held-out host.* --- ## Slide 2 — Motivation > *Most malware doesn't look like malware in a database — it looks like a process behaving badly.* - 92 % of threats now use TLS encryption (SonicWall 2022, cited in Melvin 2025) — payload inspection is dead, behavioral detection is what's left. - Static analysis defeated by obfuscation and packing; signature databases miss zero-days; in-guest detectors can be disabled by the malware they're trying to catch. - The deployable answer: watch *behavior* from outside the VM at the hypervisor layer. Visual: lift the dashboard's `intro` scene tagline. --- ## Slide 3 — Problem Statement One sentence: > Train a model that classifies a 10-second window of out-of-VM telemetry as one of `{clean, armed, infecting, infected_running, dormant}`, and **measure whether it generalizes from the device it was trained on to a different device it has never seen.** Second sentence: > The honesty bar is *cross-device test-set macro F1 with 95 % CIs*, not in-distribution validation. --- ## Slide 4 — Research Gaps + Questions **Gaps surfaced by the literature review:** 1. Most prior work (Melvin 2025, Natsos 2025) reports in-distribution metrics; cross-device generalization is rarely measured. 2. Architecture choice for resource-utilization time-series is contested: Melvin (CNN > RNN), Natsos (Transformer > LSTM > CNN). No head-to-head with controlled training methodology and statistical significance. 3. Realistic-vs-oracle ablation (host-side `/proc` removed at deployment) is not reported in either paper. **Research question (single):** *Across six architectures, which best generalizes to a held-out host, and what does each lose when restricted to in-deployment features?* --- ## Slide 5 — Proposed Solution: Overview A one-pane diagram (the dashboard's pipeline panel works): ``` Episodes (.tar.zst) → Validator → Feature & Tensor Builder → Held-out-by-Host Split → 6 Architectures × 2 Modes → Bootstrap-CI Eval → Comparison Report ``` Three sentences: 1. We collected ~73 k labeled VM episodes across two physical hosts. 2. We trained six architectures (GBT, MLP, CNN, GRU, LSTM, Transformer) twice — once with all telemetry, once with only what a deployed model would see — using a unified training loop with class-weighted loss, early stopping on val macro F1, and best-on-val checkpointing. 3. We evaluated all twelve on the *unseen* host with bootstrap CIs and paired-bootstrap significance. --- ## Slide 6 — Model Design Side-by-side architecture cards (all six). Visual: parameter counts + inputs: | Model | Input | Params (smoke) | Family | |---|---|---:|---| | GBT | `(230,)` summary | ~30 KB serialized | Tree | | MLP | `(230,)` summary | 104 K | Dense | | CNN | `(46, 100)` tensor | 101 K | Conv | | GRU | `(46, 100)` tensor | 161 K | RNN | | LSTM | `(46, 100)` tensor | 214 K | RNN | | Transformer | `(46, 100)` tensor | 76 K | Attention | Note the param counts deliberately stay within ~3× of each other — the comparison is "what does the inductive bias buy you," not "more parameters." --- ## Slide 7 — Methodology **Data.** 76,660 episodes shipped from 2 hosts. 95.2 % training-usable after the §4.6 acceptance gate. 10-second windows / 5-second stride → ≈9 windows per episode → ~660 k windows. **Split.** Held-out-by-host (primary): train on `elliott-thinkpad`, val carved from train host, test on `k-gamingcom`. Profile-stratified; `scan-and-dial` flagged as `untested_profiles` because k-gamingcom never ran it. Held-out-by-sample (secondary) on the one profile that has ≥ 3 samples. **Standardize on train only.** Per-channel for tensors, per-feature for summaries. Median imputation for NaN. **Class-weighted CE.** `armed` weight ≈ 10.8, `infecting` ≈ 2.3, `clean` ≈ 0.4 — inverse frequency, clipped. **Training loop.** AdamW, LR warmup (5 % of steps) + cosine decay, gradient clipping at 1.0, early stop on val macro F1 patience 8, mixed precision when CUDA, best-on-val checkpoint. Same loop for all five NN architectures. XGBoost uses `early_stopping_rounds=30` on val mlogloss. --- ## Slide 8 — Evaluation Setup - **Held-out test set**: every episode from `k-gamingcom` (≈ 23 k windows). Never touched at train time. - **Metrics**: macro F1, per-phase F1, per-profile F1, per-host F1 — every value with bootstrap 95 % CIs (1000 resamples). - **Statistical significance**: paired-bootstrap of macro-F1 differences vs the top model. CI excludes 0 → significant. - **Latency**: median µs per window at batch sizes `{1, 8, 64, 512}` — single-window timing alone is misleading because of Python overhead. - **Realistic-vs-oracle gap**: every architecture trained twice, both numbers reported. --- ## Slide 9 — Evaluation Results The **comparison_v2.md** table from `training/eval_/run.py`. Smoke-set numbers (200 episodes/host, 5 epochs — full-scale numbers will replace these for the final deck): | Model | Test macro F1 (95 % CI) | Significant vs top? | |---|---|---| | **gbt (oracle)** | 0.557 [0.543, 0.571] | — (anchor) | | mlp | 0.176 | yes (CI excludes 0) | | transformer | 0.113 | yes | | lstm | 0.112 | yes | | gru | 0.092 | yes | | cnn | 0.089 | yes | **Visualization**: confusion matrix grid (one per model) from `reports/eval/__confusion.png`. **Headline claim** (smoke-version, will be re-tuned at full scale): > At the data scale we have, GBT generalizes to the held-out host significantly better than every NN architecture — including the Transformer. The result is consistent with Natsos & Symeonidis 2025's finding that Transformer dominates only as the dataset grows past ~1k samples per family; below that, simpler inductive biases win. --- ## Slide 10 — Case Study / Demonstration The **live dashboard** (`https://dashboard.wg`): - Scene 2 (collect): live ingest counter from `index.jsonl` tailing producer. - Scene 6 (attacks): per-profile attack-envelope thumbnails from `producers/profiles.py`. - Scene 7 (chunking): predictions emitted by `producers/replay.py` running an episode at wall-clock speed. - Scene 8 (models): macro F1 bars from `producers/metrics.py` (re-published every 20 s for reconnects). - Scene 9 (knn): PCA-2 scatter colored by phase from each model's saved projection. - Scene 10 (perf): accuracy-vs-latency scatter from `producers/perf.py`, batch-size 64. Visual: one screenshot of dashboard.wg with live data, clicked through scenes 2 → 10. --- ## Slide 11 — Theoretical Contributions [opt] - **Cross-source clock alignment.** Producers were inconsistent about `t_mono_ns` semantics (episode-relative vs system-uptime); we canonicalize on `t_wall_ns`. Generalizable to any multi-source telemetry pipeline. - **Held-out-host as the primary cross-device generalization claim.** Most prior work reports in-distribution metrics; we report the harder number and let the reader see the gap. - **Realistic-vs-oracle ablation** as the operational measure of "what the deployed model is missing." --- ## Slide 12 — Practical Contributions [opt] - **Open-source training stack** (`training/`): six architectures, schema-hashed checkpoints, validator, dashboard producers — directly reusable for any project where labeled per-window resource-utilization data is available. - **Live dashboard** at `dashboard.wg` with both pipeline-state and trained-model events — the working example of "model running live" the assignment §10 (case study) wants. - **Validator + producer machinery** that catches data-quality issues (torn writes in `index.jsonl`, host silently shipping without bridge pcap, scan-and-dial absent from one host) instead of training on them. --- ## Slide 13 — Design Principles [opt] - *No silent downgrade* — every host either ships data that meets the gate or produces nothing. - *No silent schema drift* — every model checkpoint refuses to load if its training-time channel/feature schema doesn't match what `_features.py` produces today. - *Honest CIs over point estimates* — every test number we report has bootstrap bounds. - *Held out by host, not by time slice* — within-sample time splits are easy and dishonest about generalization. --- ## Slide 14 — Limitations [opt] - **Two hosts.** Cross-device generalization with N=2 is a single fold of leave-one-host-out CV; with more hosts the methodology becomes more rigorous. - **Twelve unique sample names total**, with two profiles having 1 sample each. Held-out-by-sample is feasible only on `io-walk`. - **`scan-and-dial` not present on `k-gamingcom`** — that profile is trained but cannot be evaluated cross-device. Reported as `untested_profiles`, never silently averaged. - **Producer-side bugs** found during validation: `receiver/store.py:130` torn write (1 occurrence in 76 k); ~24 k k-gamingcom episodes shipped without netflow (silent collector failure); cross-source clock-base inconsistency. All surfaced in `training/README.md` for follow-up. - **Single training corpus, single attack manifest.** Generalization to *new attack manifests* is a stronger claim than this corpus supports. --- ## Slide 15 — Conclusion and Future Work **Conclusion (1 line):** The realistic-mode model can be trained, evaluated honestly cross-device, and deployed against live telemetry — but the right architecture depends on data scale, and the cross-device gap is the metric that matters. **Future work:** - **Self-supervised pretrain** on `clean`-only windows (Masked-Timestep Reconstruction + Volume-of-Hypersphere Minimization, per LogBERT). Detects novel malware without labels. Implementation in `training/models/transformer_ssl.py`. - **Trust-over-time scoring** per IEEE 9881803 — per-window confidence accumulated over a sliding decision window, reset trigger at threshold. - **Integrated Gradients attribution** per (channel, timestep) — *which signals drove the call* — so the writeup can show evidence, not just confidence numbers. Implementation in `training/xai/integrated_gradients.py`. - **More hosts in the fleet** — the data generation and shipping pipeline is in place; adding hosts is a config change. With N ≥ 4 hosts, leave-one-out CV becomes meaningful. - **More distinct samples per profile** — 12 samples is too few to claim novel-malware generalization; the current dataset only supports cross-device generalization. --- ## Slide 16 — References / Acknowledgments The same APA reference list as `docs/project_brief.md` §9. Lab acknowledgments: Dr. Mejias, Raul, the spectral lab infrastructure. Tools acknowledgments: `libVMI`, `Drakvuf`, `XGBoost`, `PyTorch`, `scikit-learn`, `pyarrow`. --- ## Notes for the deck builder - Slides 11–14 are marked optional in the assignment — keep them if you have time, drop them if you're tight. Slides 1, 3, 5, 7, 9, 10, 15 are the *required* spine. - Every metric on Slide 9 should come from `reports/eval/comparison_v2.md` directly — copy-paste the markdown and let the deck render it. - Slide 10 (live demo) is more memorable than any chart — bring up `dashboard.wg`, scroll through scenes 2–10, let it talk. - The repo has the `[opt]`-marked `Design Principles` slide built into the README. If you cut Slide 13, the principles still live in the artifact.