# CIS 490 — Slide Deck Planning Template (Filled)

Maps each of the assignment's 16 slides to concrete content drawn from
`training/`, `tools/`, the validator output, and the trained models.
Optional slides are marked **[opt]** and can be cut for time.

---

## Slide 1 — Title Slide

- Title: **Behavioral Malware Detection from Hypervisor-Layer VM Telemetry: A Cross-Architecture Comparison Under Cross-Device Generalization**
- Authors / affiliation / date / advisor (Dr. Mejias).
- One subtitle line: *Six architectures × two deployment modes, evaluated on held-out host.*

---

## Slide 2 — Motivation

> *Most malware doesn't look like malware in a database — it looks like a process behaving badly.*

- 92 % of threats now use TLS encryption (SonicWall 2022, cited in Melvin 2025) — payload inspection is dead, behavioral detection is what's left.
- Static analysis defeated by obfuscation and packing; signature databases miss zero-days; in-guest detectors can be disabled by the malware they're trying to catch.
- The deployable answer: watch *behavior* from outside the VM at the hypervisor layer.

Visual: lift the dashboard's `intro` scene tagline.

---

## Slide 3 — Problem Statement

One sentence:

> Train a model that classifies a 10-second window of out-of-VM telemetry as one of `{clean, armed, infecting, infected_running, dormant}`, and **measure whether it generalizes from the device it was trained on to a different device it has never seen.**

Second sentence:

> The honesty bar is *cross-device test-set macro F1 with 95 % CIs*, not in-distribution validation.

---

## Slide 4 — Research Gaps + Questions

**Gaps surfaced by the literature review:**
1. Most prior work (Melvin 2025, Natsos 2025) reports in-distribution metrics; cross-device generalization is rarely measured.
2. Architecture choice for resource-utilization time-series is contested: Melvin (CNN > RNN), Natsos (Transformer > LSTM > CNN). No head-to-head with controlled training methodology and statistical significance.
3. Realistic-vs-oracle ablation (host-side `/proc` removed at deployment) is not reported in either paper.

**Research question (single):** *Across six architectures, which best generalizes to a held-out host, and what does each lose when restricted to in-deployment features?*

---

## Slide 5 — Proposed Solution: Overview

A one-pane diagram (the dashboard's pipeline panel works):

```
Episodes (.tar.zst) → Validator → Feature & Tensor Builder → Held-out-by-Host Split → 6 Architectures × 2 Modes → Bootstrap-CI Eval → Comparison Report
```

Three sentences:
1. We collected ~73 k labeled VM episodes across two physical hosts.
2. We trained six architectures (GBT, MLP, CNN, GRU, LSTM, Transformer) twice — once with all telemetry, once with only what a deployed model would see — using a unified training loop with class-weighted loss, early stopping on val macro F1, and best-on-val checkpointing.
3. We evaluated all twelve on the *unseen* host with bootstrap CIs and paired-bootstrap significance.

---

## Slide 6 — Model Design

Side-by-side architecture cards (all six). Visual: parameter counts + inputs:

| Model | Input | Params (smoke) | Family |
|---|---|---:|---|
| GBT | `(230,)` summary | ~30 KB serialized | Tree |
| MLP | `(230,)` summary | 104 K | Dense |
| CNN | `(46, 100)` tensor | 101 K | Conv |
| GRU | `(46, 100)` tensor | 161 K | RNN |
| LSTM | `(46, 100)` tensor | 214 K | RNN |
| Transformer | `(46, 100)` tensor | 76 K | Attention |

Note the param counts deliberately stay within ~3× of each other — the comparison is "what does the inductive bias buy you," not "more parameters."

---

## Slide 7 — Methodology

**Data.** 76,660 episodes shipped from 2 hosts. 95.2 % training-usable after the §4.6 acceptance gate. 10-second windows / 5-second stride → ≈9 windows per episode → ~660 k windows.

**Split.** Held-out-by-host (primary): train on `elliott-thinkpad`, val carved from train host, test on `k-gamingcom`. Profile-stratified; `scan-and-dial` flagged as `untested_profiles` because k-gamingcom never ran it. Held-out-by-sample (secondary) on the one profile that has ≥ 3 samples.

**Standardize on train only.** Per-channel for tensors, per-feature for summaries. Median imputation for NaN.

**Class-weighted CE.** `armed` weight ≈ 10.8, `infecting` ≈ 2.3, `clean` ≈ 0.4 — inverse frequency, clipped.

**Training loop.** AdamW, LR warmup (5 % of steps) + cosine decay, gradient clipping at 1.0, early stop on val macro F1 patience 8, mixed precision when CUDA, best-on-val checkpoint. Same loop for all five NN architectures. XGBoost uses `early_stopping_rounds=30` on val mlogloss.

---

## Slide 8 — Evaluation Setup

- **Held-out test set**: every episode from `k-gamingcom` (≈ 23 k windows). Never touched at train time.
- **Metrics**: macro F1, per-phase F1, per-profile F1, per-host F1 — every value with bootstrap 95 % CIs (1000 resamples).
- **Statistical significance**: paired-bootstrap of macro-F1 differences vs the top model. CI excludes 0 → significant.
- **Latency**: median µs per window at batch sizes `{1, 8, 64, 512}` — single-window timing alone is misleading because of Python overhead.
- **Realistic-vs-oracle gap**: every architecture trained twice, both numbers reported.

---

## Slide 9 — Evaluation Results

The **comparison_v2.md** table from `training/eval_/run.py`. Smoke-set numbers (200 episodes/host, 5 epochs — full-scale numbers will replace these for the final deck):

| Model | Test macro F1 (95 % CI) | Significant vs top? |
|---|---|---|
| **gbt (oracle)** | 0.557 [0.543, 0.571] | — (anchor) |
| mlp | 0.176 | yes (CI excludes 0) |
| transformer | 0.113 | yes |
| lstm | 0.112 | yes |
| gru | 0.092 | yes |
| cnn | 0.089 | yes |

**Visualization**: confusion matrix grid (one per model) from `reports/eval/<model>_<mode>_confusion.png`.

**Headline claim** (smoke-version, will be re-tuned at full scale):

> At the data scale we have, GBT generalizes to the held-out host significantly better than every NN architecture — including the Transformer. The result is consistent with Natsos & Symeonidis 2025's finding that Transformer dominates only as the dataset grows past ~1k samples per family; below that, simpler inductive biases win.

---

## Slide 10 — Case Study / Demonstration

The **live dashboard** (`https://dashboard.wg`):

- Scene 2 (collect): live ingest counter from `index.jsonl` tailing producer.
- Scene 6 (attacks): per-profile attack-envelope thumbnails from `producers/profiles.py`.
- Scene 7 (chunking): predictions emitted by `producers/replay.py` running an episode at wall-clock speed.
- Scene 8 (models): macro F1 bars from `producers/metrics.py` (re-published every 20 s for reconnects).
- Scene 9 (knn): PCA-2 scatter colored by phase from each model's saved projection.
- Scene 10 (perf): accuracy-vs-latency scatter from `producers/perf.py`, batch-size 64.

Visual: one screenshot of dashboard.wg with live data, clicked through scenes 2 → 10.

---

## Slide 11 — Theoretical Contributions [opt]

- **Cross-source clock alignment.** Producers were inconsistent about `t_mono_ns` semantics (episode-relative vs system-uptime); we canonicalize on `t_wall_ns`. Generalizable to any multi-source telemetry pipeline.
- **Held-out-host as the primary cross-device generalization claim.** Most prior work reports in-distribution metrics; we report the harder number and let the reader see the gap.
- **Realistic-vs-oracle ablation** as the operational measure of "what the deployed model is missing."

---

## Slide 12 — Practical Contributions [opt]

- **Open-source training stack** (`training/`): six architectures, schema-hashed checkpoints, validator, dashboard producers — directly reusable for any project where labeled per-window resource-utilization data is available.
- **Live dashboard** at `dashboard.wg` with both pipeline-state and trained-model events — the working example of "model running live" the assignment §10 (case study) wants.
- **Validator + producer machinery** that catches data-quality issues (torn writes in `index.jsonl`, host silently shipping without bridge pcap, scan-and-dial absent from one host) instead of training on them.

---

## Slide 13 — Design Principles [opt]

- *No silent downgrade* — every host either ships data that meets the gate or produces nothing.
- *No silent schema drift* — every model checkpoint refuses to load if its training-time channel/feature schema doesn't match what `_features.py` produces today.
- *Honest CIs over point estimates* — every test number we report has bootstrap bounds.
- *Held out by host, not by time slice* — within-sample time splits are easy and dishonest about generalization.

---

## Slide 14 — Limitations [opt]

- **Two hosts.** Cross-device generalization with N=2 is a single fold of leave-one-host-out CV; with more hosts the methodology becomes more rigorous.
- **Twelve unique sample names total**, with two profiles having 1 sample each. Held-out-by-sample is feasible only on `io-walk`.
- **`scan-and-dial` not present on `k-gamingcom`** — that profile is trained but cannot be evaluated cross-device. Reported as `untested_profiles`, never silently averaged.
- **Producer-side bugs** found during validation: `receiver/store.py:130` torn write (1 occurrence in 76 k); ~24 k k-gamingcom episodes shipped without netflow (silent collector failure); cross-source clock-base inconsistency. All surfaced in `training/README.md` for follow-up.
- **Single training corpus, single attack manifest.** Generalization to *new attack manifests* is a stronger claim than this corpus supports.

---

## Slide 15 — Conclusion and Future Work

**Conclusion (1 line):** The realistic-mode model can be trained, evaluated honestly cross-device, and deployed against live telemetry — but the right architecture depends on data scale, and the cross-device gap is the metric that matters.

**Future work:**
- **Self-supervised pretrain** on `clean`-only windows (Masked-Timestep Reconstruction + Volume-of-Hypersphere Minimization, per LogBERT). Detects novel malware without labels. Implementation in `training/models/transformer_ssl.py`.
- **Trust-over-time scoring** per IEEE 9881803 — per-window confidence accumulated over a sliding decision window, reset trigger at threshold.
- **Integrated Gradients attribution** per (channel, timestep) — *which signals drove the call* — so the writeup can show evidence, not just confidence numbers. Implementation in `training/xai/integrated_gradients.py`.
- **More hosts in the fleet** — the data generation and shipping pipeline is in place; adding hosts is a config change. With N ≥ 4 hosts, leave-one-out CV becomes meaningful.
- **More distinct samples per profile** — 12 samples is too few to claim novel-malware generalization; the current dataset only supports cross-device generalization.

---

## Slide 16 — References / Acknowledgments

The same APA reference list as `docs/project_brief.md` §9. Lab acknowledgments: Dr. Mejias, Raul, the spectral lab infrastructure. Tools acknowledgments: `libVMI`, `Drakvuf`, `XGBoost`, `PyTorch`, `scikit-learn`, `pyarrow`.

---

## Notes for the deck builder

- Slides 11–14 are marked optional in the assignment — keep them if you have time, drop them if you're tight. Slides 1, 3, 5, 7, 9, 10, 15 are the *required* spine.
- Every metric on Slide 9 should come from `reports/eval/comparison_v2.md` directly — copy-paste the markdown and let the deck render it.
- Slide 10 (live demo) is more memorable than any chart — bring up `dashboard.wg`, scroll through scenes 2–10, let it talk.
- The repo has the `[opt]`-marked `Design Principles` slide built into the README. If you cut Slide 13, the principles still live in the artifact.