LogBERT-style self-supervised Transformer pretrain on `clean`-only
windows, plus Integrated Gradients attribution for any tensor model.
Both directly answer the assignment's §8 'next steps in unsupervised
learning' requirement and Natsos & Symeonidis 2025's RQ3 on
explainability.
Pretrain (training/models/transformer_ssl.py +
trainer/run_ssl.py):
- Masked Timestep Reconstruction (MTR) — random 15% of timesteps
zeroed, encoder + per-channel head reconstructs from the rest.
Loss: MSE over masked positions.
- Volume of Hypersphere Minimization (VHM, Deep SVDD-style) — pull
learnable [DIST] token embedding toward a frozen center vector
initialized as the mean over clean train. Loss: ||h_dist - c||^2.
- Calibrated anomaly threshold at user-configurable target FPR
(default 5%) on clean-val distance distribution.
- Trained ONLY on `clean`-phase windows; the model never sees a
labeled malware sample yet flags any window that doesn't look
clean — including novel malware the supervised classifier never
saw. Uses the same schema-hashed checkpoint format as the
supervised models so loaders refuse mismatched feature schemas.
XAI (training/xai/integrated_gradients.py):
- Per-(channel, timestep) attribution via path-integrated gradients
over Riemann-mid-point steps. Works for cnn/gru/lstm/transformer/
transformer_ssl.
- Per-phase mean |IG| heatmaps under reports/xai/<model>/<phase>.png,
top-k channel importance per phase as JSON. Smoke-verified on the
trained CNN: top channel for `clean` is guest.cpu_iowait (sensible
— clean = idle = high iowait).
Project brief and slide planner:
- docs/project_brief.md — full draft of the assignment's required
sections 1–9 (problem, research question, ML task type with
justification, six supervised algorithms with assumptions, dataset
description with full validation breakdown, evaluation metrics with
rationale, current progress, lit review with 11 APA citations,
next steps for unsupervised, references).
- docs/slide_planner.md — all 16 slides filled with content tied to
specific files and metrics from this codebase, not generic
placeholders.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
CIS 490 — Slide Deck Planning Template (Filled)
Maps each of the assignment's 16 slides to concrete content drawn from
training/, tools/, the validator output, and the trained models.
Optional slides are marked [opt] and can be cut for time.
Slide 1 — Title Slide
- Title: Behavioral Malware Detection from Hypervisor-Layer VM Telemetry: A Cross-Architecture Comparison Under Cross-Device Generalization
- Authors / affiliation / date / advisor (Dr. Mejias).
- One subtitle line: Six architectures × two deployment modes, evaluated on held-out host.
Slide 2 — Motivation
Most malware doesn't look like malware in a database — it looks like a process behaving badly.
- 92 % of threats now use TLS encryption (SonicWall 2022, cited in Melvin 2025) — payload inspection is dead, behavioral detection is what's left.
- Static analysis defeated by obfuscation and packing; signature databases miss zero-days; in-guest detectors can be disabled by the malware they're trying to catch.
- The deployable answer: watch behavior from outside the VM at the hypervisor layer.
Visual: lift the dashboard's intro scene tagline.
Slide 3 — Problem Statement
One sentence:
Train a model that classifies a 10-second window of out-of-VM telemetry as one of
{clean, armed, infecting, infected_running, dormant}, and measure whether it generalizes from the device it was trained on to a different device it has never seen.
Second sentence:
The honesty bar is cross-device test-set macro F1 with 95 % CIs, not in-distribution validation.
Slide 4 — Research Gaps + Questions
Gaps surfaced by the literature review:
- Most prior work (Melvin 2025, Natsos 2025) reports in-distribution metrics; cross-device generalization is rarely measured.
- Architecture choice for resource-utilization time-series is contested: Melvin (CNN > RNN), Natsos (Transformer > LSTM > CNN). No head-to-head with controlled training methodology and statistical significance.
- Realistic-vs-oracle ablation (host-side
/procremoved at deployment) is not reported in either paper.
Research question (single): Across six architectures, which best generalizes to a held-out host, and what does each lose when restricted to in-deployment features?
Slide 5 — Proposed Solution: Overview
A one-pane diagram (the dashboard's pipeline panel works):
Episodes (.tar.zst) → Validator → Feature & Tensor Builder → Held-out-by-Host Split → 6 Architectures × 2 Modes → Bootstrap-CI Eval → Comparison Report
Three sentences:
- We collected ~73 k labeled VM episodes across two physical hosts.
- We trained six architectures (GBT, MLP, CNN, GRU, LSTM, Transformer) twice — once with all telemetry, once with only what a deployed model would see — using a unified training loop with class-weighted loss, early stopping on val macro F1, and best-on-val checkpointing.
- We evaluated all twelve on the unseen host with bootstrap CIs and paired-bootstrap significance.
Slide 6 — Model Design
Side-by-side architecture cards (all six). Visual: parameter counts + inputs:
| Model | Input | Params (smoke) | Family |
|---|---|---|---|
| GBT | (230,) summary |
~30 KB serialized | Tree |
| MLP | (230,) summary |
104 K | Dense |
| CNN | (46, 100) tensor |
101 K | Conv |
| GRU | (46, 100) tensor |
161 K | RNN |
| LSTM | (46, 100) tensor |
214 K | RNN |
| Transformer | (46, 100) tensor |
76 K | Attention |
Note the param counts deliberately stay within ~3× of each other — the comparison is "what does the inductive bias buy you," not "more parameters."
Slide 7 — Methodology
Data. 76,660 episodes shipped from 2 hosts. 95.2 % training-usable after the §4.6 acceptance gate. 10-second windows / 5-second stride → ≈9 windows per episode → ~660 k windows.
Split. Held-out-by-host (primary): train on elliott-thinkpad, val carved from train host, test on k-gamingcom. Profile-stratified; scan-and-dial flagged as untested_profiles because k-gamingcom never ran it. Held-out-by-sample (secondary) on the one profile that has ≥ 3 samples.
Standardize on train only. Per-channel for tensors, per-feature for summaries. Median imputation for NaN.
Class-weighted CE. armed weight ≈ 10.8, infecting ≈ 2.3, clean ≈ 0.4 — inverse frequency, clipped.
Training loop. AdamW, LR warmup (5 % of steps) + cosine decay, gradient clipping at 1.0, early stop on val macro F1 patience 8, mixed precision when CUDA, best-on-val checkpoint. Same loop for all five NN architectures. XGBoost uses early_stopping_rounds=30 on val mlogloss.
Slide 8 — Evaluation Setup
- Held-out test set: every episode from
k-gamingcom(≈ 23 k windows). Never touched at train time. - Metrics: macro F1, per-phase F1, per-profile F1, per-host F1 — every value with bootstrap 95 % CIs (1000 resamples).
- Statistical significance: paired-bootstrap of macro-F1 differences vs the top model. CI excludes 0 → significant.
- Latency: median µs per window at batch sizes
{1, 8, 64, 512}— single-window timing alone is misleading because of Python overhead. - Realistic-vs-oracle gap: every architecture trained twice, both numbers reported.
Slide 9 — Evaluation Results
The comparison_v2.md table from training/eval_/run.py. Smoke-set numbers (200 episodes/host, 5 epochs — full-scale numbers will replace these for the final deck):
| Model | Test macro F1 (95 % CI) | Significant vs top? |
|---|---|---|
| gbt (oracle) | 0.557 [0.543, 0.571] | — (anchor) |
| mlp | 0.176 | yes (CI excludes 0) |
| transformer | 0.113 | yes |
| lstm | 0.112 | yes |
| gru | 0.092 | yes |
| cnn | 0.089 | yes |
Visualization: confusion matrix grid (one per model) from reports/eval/<model>_<mode>_confusion.png.
Headline claim (smoke-version, will be re-tuned at full scale):
At the data scale we have, GBT generalizes to the held-out host significantly better than every NN architecture — including the Transformer. The result is consistent with Natsos & Symeonidis 2025's finding that Transformer dominates only as the dataset grows past ~1k samples per family; below that, simpler inductive biases win.
Slide 10 — Case Study / Demonstration
The live dashboard (https://dashboard.wg):
- Scene 2 (collect): live ingest counter from
index.jsonltailing producer. - Scene 6 (attacks): per-profile attack-envelope thumbnails from
producers/profiles.py. - Scene 7 (chunking): predictions emitted by
producers/replay.pyrunning an episode at wall-clock speed. - Scene 8 (models): macro F1 bars from
producers/metrics.py(re-published every 20 s for reconnects). - Scene 9 (knn): PCA-2 scatter colored by phase from each model's saved projection.
- Scene 10 (perf): accuracy-vs-latency scatter from
producers/perf.py, batch-size 64.
Visual: one screenshot of dashboard.wg with live data, clicked through scenes 2 → 10.
Slide 11 — Theoretical Contributions [opt]
- Cross-source clock alignment. Producers were inconsistent about
t_mono_nssemantics (episode-relative vs system-uptime); we canonicalize ont_wall_ns. Generalizable to any multi-source telemetry pipeline. - Held-out-host as the primary cross-device generalization claim. Most prior work reports in-distribution metrics; we report the harder number and let the reader see the gap.
- Realistic-vs-oracle ablation as the operational measure of "what the deployed model is missing."
Slide 12 — Practical Contributions [opt]
- Open-source training stack (
training/): six architectures, schema-hashed checkpoints, validator, dashboard producers — directly reusable for any project where labeled per-window resource-utilization data is available. - Live dashboard at
dashboard.wgwith both pipeline-state and trained-model events — the working example of "model running live" the assignment §10 (case study) wants. - Validator + producer machinery that catches data-quality issues (torn writes in
index.jsonl, host silently shipping without bridge pcap, scan-and-dial absent from one host) instead of training on them.
Slide 13 — Design Principles [opt]
- No silent downgrade — every host either ships data that meets the gate or produces nothing.
- No silent schema drift — every model checkpoint refuses to load if its training-time channel/feature schema doesn't match what
_features.pyproduces today. - Honest CIs over point estimates — every test number we report has bootstrap bounds.
- Held out by host, not by time slice — within-sample time splits are easy and dishonest about generalization.
Slide 14 — Limitations [opt]
- Two hosts. Cross-device generalization with N=2 is a single fold of leave-one-host-out CV; with more hosts the methodology becomes more rigorous.
- Twelve unique sample names total, with two profiles having 1 sample each. Held-out-by-sample is feasible only on
io-walk. scan-and-dialnot present onk-gamingcom— that profile is trained but cannot be evaluated cross-device. Reported asuntested_profiles, never silently averaged.- Producer-side bugs found during validation:
receiver/store.py:130torn write (1 occurrence in 76 k); ~24 k k-gamingcom episodes shipped without netflow (silent collector failure); cross-source clock-base inconsistency. All surfaced intraining/README.mdfor follow-up. - Single training corpus, single attack manifest. Generalization to new attack manifests is a stronger claim than this corpus supports.
Slide 15 — Conclusion and Future Work
Conclusion (1 line): The realistic-mode model can be trained, evaluated honestly cross-device, and deployed against live telemetry — but the right architecture depends on data scale, and the cross-device gap is the metric that matters.
Future work:
- Self-supervised pretrain on
clean-only windows (Masked-Timestep Reconstruction + Volume-of-Hypersphere Minimization, per LogBERT). Detects novel malware without labels. Implementation intraining/models/transformer_ssl.py. - Trust-over-time scoring per IEEE 9881803 — per-window confidence accumulated over a sliding decision window, reset trigger at threshold.
- Integrated Gradients attribution per (channel, timestep) — which signals drove the call — so the writeup can show evidence, not just confidence numbers. Implementation in
training/xai/integrated_gradients.py. - More hosts in the fleet — the data generation and shipping pipeline is in place; adding hosts is a config change. With N ≥ 4 hosts, leave-one-out CV becomes meaningful.
- More distinct samples per profile — 12 samples is too few to claim novel-malware generalization; the current dataset only supports cross-device generalization.
Slide 16 — References / Acknowledgments
The same APA reference list as docs/project_brief.md §9. Lab acknowledgments: Dr. Mejias, Raul, the spectral lab infrastructure. Tools acknowledgments: libVMI, Drakvuf, XGBoost, PyTorch, scikit-learn, pyarrow.
Notes for the deck builder
- Slides 11–14 are marked optional in the assignment — keep them if you have time, drop them if you're tight. Slides 1, 3, 5, 7, 9, 10, 15 are the required spine.
- Every metric on Slide 9 should come from
reports/eval/comparison_v2.mddirectly — copy-paste the markdown and let the deck render it. - Slide 10 (live demo) is more memorable than any chart — bring up
dashboard.wg, scroll through scenes 2–10, let it talk. - The repo has the
[opt]-markedDesign Principlesslide built into the README. If you cut Slide 13, the principles still live in the artifact.