Max 3ea6bca6f0 training: self-supervised pretrain + IG XAI + project brief / slide planner

LogBERT-style self-supervised Transformer pretrain on `clean`-only
windows, plus Integrated Gradients attribution for any tensor model.
Both directly answer the assignment's §8 'next steps in unsupervised
learning' requirement and Natsos & Symeonidis 2025's RQ3 on
explainability.

Pretrain (training/models/transformer_ssl.py +
trainer/run_ssl.py):
  - Masked Timestep Reconstruction (MTR) — random 15% of timesteps
    zeroed, encoder + per-channel head reconstructs from the rest.
    Loss: MSE over masked positions.
  - Volume of Hypersphere Minimization (VHM, Deep SVDD-style) — pull
    learnable [DIST] token embedding toward a frozen center vector
    initialized as the mean over clean train. Loss: ||h_dist - c||^2.
  - Calibrated anomaly threshold at user-configurable target FPR
    (default 5%) on clean-val distance distribution.
  - Trained ONLY on `clean`-phase windows; the model never sees a
    labeled malware sample yet flags any window that doesn't look
    clean — including novel malware the supervised classifier never
    saw. Uses the same schema-hashed checkpoint format as the
    supervised models so loaders refuse mismatched feature schemas.

XAI (training/xai/integrated_gradients.py):
  - Per-(channel, timestep) attribution via path-integrated gradients
    over Riemann-mid-point steps. Works for cnn/gru/lstm/transformer/
    transformer_ssl.
  - Per-phase mean |IG| heatmaps under reports/xai/<model>/<phase>.png,
    top-k channel importance per phase as JSON. Smoke-verified on the
    trained CNN: top channel for `clean` is guest.cpu_iowait (sensible
    — clean = idle = high iowait).

Project brief and slide planner:
  - docs/project_brief.md — full draft of the assignment's required
    sections 1–9 (problem, research question, ML task type with
    justification, six supervised algorithms with assumptions, dataset
    description with full validation breakdown, evaluation metrics with
    rationale, current progress, lit review with 11 APA citations,
    next steps for unsupervised, references).
  - docs/slide_planner.md — all 16 slides filled with content tied to
    specific files and metrics from this codebase, not generic
    placeholders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 01:19:41 -05:00

12 KiB

Raw Permalink Blame History

CIS 490 — Slide Deck Planning Template (Filled)

Maps each of the assignment's 16 slides to concrete content drawn from training/, tools/, the validator output, and the trained models. Optional slides are marked [opt] and can be cut for time.

Slide 1 — Title Slide

Title: Behavioral Malware Detection from Hypervisor-Layer VM Telemetry: A Cross-Architecture Comparison Under Cross-Device Generalization
Authors / affiliation / date / advisor (Dr. Mejias).
One subtitle line: Six architectures × two deployment modes, evaluated on held-out host.

Slide 2 — Motivation

Most malware doesn't look like malware in a database — it looks like a process behaving badly.

92 % of threats now use TLS encryption (SonicWall 2022, cited in Melvin 2025) — payload inspection is dead, behavioral detection is what's left.
Static analysis defeated by obfuscation and packing; signature databases miss zero-days; in-guest detectors can be disabled by the malware they're trying to catch.
The deployable answer: watch behavior from outside the VM at the hypervisor layer.

Visual: lift the dashboard's intro scene tagline.

Slide 3 — Problem Statement

One sentence:

Train a model that classifies a 10-second window of out-of-VM telemetry as one of {clean, armed, infecting, infected_running, dormant}, and measure whether it generalizes from the device it was trained on to a different device it has never seen.

Second sentence:

The honesty bar is cross-device test-set macro F1 with 95 % CIs, not in-distribution validation.

Slide 4 — Research Gaps + Questions

Gaps surfaced by the literature review:

Most prior work (Melvin 2025, Natsos 2025) reports in-distribution metrics; cross-device generalization is rarely measured.
Architecture choice for resource-utilization time-series is contested: Melvin (CNN > RNN), Natsos (Transformer > LSTM > CNN). No head-to-head with controlled training methodology and statistical significance.
Realistic-vs-oracle ablation (host-side /proc removed at deployment) is not reported in either paper.

Research question (single): Across six architectures, which best generalizes to a held-out host, and what does each lose when restricted to in-deployment features?

Slide 5 — Proposed Solution: Overview

A one-pane diagram (the dashboard's pipeline panel works):

Episodes (.tar.zst) → Validator → Feature & Tensor Builder → Held-out-by-Host Split → 6 Architectures × 2 Modes → Bootstrap-CI Eval → Comparison Report

Three sentences:

We collected ~73 k labeled VM episodes across two physical hosts.
We trained six architectures (GBT, MLP, CNN, GRU, LSTM, Transformer) twice — once with all telemetry, once with only what a deployed model would see — using a unified training loop with class-weighted loss, early stopping on val macro F1, and best-on-val checkpointing.
We evaluated all twelve on the unseen host with bootstrap CIs and paired-bootstrap significance.

Slide 6 — Model Design

Side-by-side architecture cards (all six). Visual: parameter counts + inputs:

Model	Input	Params (smoke)	Family
GBT	`(230,)` summary	~30 KB serialized	Tree
MLP	`(230,)` summary	104 K	Dense
CNN	`(46, 100)` tensor	101 K	Conv
GRU	`(46, 100)` tensor	161 K	RNN
LSTM	`(46, 100)` tensor	214 K	RNN
Transformer	`(46, 100)` tensor	76 K	Attention

Note the param counts deliberately stay within ~3× of each other — the comparison is "what does the inductive bias buy you," not "more parameters."

Slide 7 — Methodology

Data. 76,660 episodes shipped from 2 hosts. 95.2 % training-usable after the §4.6 acceptance gate. 10-second windows / 5-second stride → ≈9 windows per episode → ~660 k windows.

Split. Held-out-by-host (primary): train on elliott-thinkpad, val carved from train host, test on k-gamingcom. Profile-stratified; scan-and-dial flagged as untested_profiles because k-gamingcom never ran it. Held-out-by-sample (secondary) on the one profile that has ≥ 3 samples.

Standardize on train only. Per-channel for tensors, per-feature for summaries. Median imputation for NaN.

Class-weighted CE. armed weight ≈ 10.8, infecting ≈ 2.3, clean ≈ 0.4 — inverse frequency, clipped.

Training loop. AdamW, LR warmup (5 % of steps) + cosine decay, gradient clipping at 1.0, early stop on val macro F1 patience 8, mixed precision when CUDA, best-on-val checkpoint. Same loop for all five NN architectures. XGBoost uses early_stopping_rounds=30 on val mlogloss.

Slide 8 — Evaluation Setup

Held-out test set: every episode from k-gamingcom (≈ 23 k windows). Never touched at train time.
Metrics: macro F1, per-phase F1, per-profile F1, per-host F1 — every value with bootstrap 95 % CIs (1000 resamples).
Statistical significance: paired-bootstrap of macro-F1 differences vs the top model. CI excludes 0 → significant.
Latency: median µs per window at batch sizes {1, 8, 64, 512} — single-window timing alone is misleading because of Python overhead.
Realistic-vs-oracle gap: every architecture trained twice, both numbers reported.

Slide 9 — Evaluation Results

The comparison_v2.md table from training/eval_/run.py. Smoke-set numbers (200 episodes/host, 5 epochs — full-scale numbers will replace these for the final deck):

Model	Test macro F1 (95 % CI)	Significant vs top?
gbt (oracle)	0.557 [0.543, 0.571]	— (anchor)
mlp	0.176	yes (CI excludes 0)
transformer	0.113	yes
lstm	0.112	yes
gru	0.092	yes
cnn	0.089	yes

Visualization: confusion matrix grid (one per model) from reports/eval/<model>_<mode>_confusion.png.

Headline claim (smoke-version, will be re-tuned at full scale):

At the data scale we have, GBT generalizes to the held-out host significantly better than every NN architecture — including the Transformer. The result is consistent with Natsos & Symeonidis 2025's finding that Transformer dominates only as the dataset grows past ~1k samples per family; below that, simpler inductive biases win.

Slide 10 — Case Study / Demonstration

The live dashboard (https://dashboard.wg):

Scene 2 (collect): live ingest counter from index.jsonl tailing producer.
Scene 6 (attacks): per-profile attack-envelope thumbnails from producers/profiles.py.
Scene 7 (chunking): predictions emitted by producers/replay.py running an episode at wall-clock speed.
Scene 8 (models): macro F1 bars from producers/metrics.py (re-published every 20 s for reconnects).
Scene 9 (knn): PCA-2 scatter colored by phase from each model's saved projection.
Scene 10 (perf): accuracy-vs-latency scatter from producers/perf.py, batch-size 64.

Visual: one screenshot of dashboard.wg with live data, clicked through scenes 2 → 10.

Slide 11 — Theoretical Contributions [opt]

Cross-source clock alignment. Producers were inconsistent about t_mono_ns semantics (episode-relative vs system-uptime); we canonicalize on t_wall_ns. Generalizable to any multi-source telemetry pipeline.
Held-out-host as the primary cross-device generalization claim. Most prior work reports in-distribution metrics; we report the harder number and let the reader see the gap.
Realistic-vs-oracle ablation as the operational measure of "what the deployed model is missing."

Slide 12 — Practical Contributions [opt]

Open-source training stack (training/): six architectures, schema-hashed checkpoints, validator, dashboard producers — directly reusable for any project where labeled per-window resource-utilization data is available.
Live dashboard at dashboard.wg with both pipeline-state and trained-model events — the working example of "model running live" the assignment §10 (case study) wants.
Validator + producer machinery that catches data-quality issues (torn writes in index.jsonl, host silently shipping without bridge pcap, scan-and-dial absent from one host) instead of training on them.

Slide 13 — Design Principles [opt]

No silent downgrade — every host either ships data that meets the gate or produces nothing.
No silent schema drift — every model checkpoint refuses to load if its training-time channel/feature schema doesn't match what _features.py produces today.
Honest CIs over point estimates — every test number we report has bootstrap bounds.
Held out by host, not by time slice — within-sample time splits are easy and dishonest about generalization.

Slide 14 — Limitations [opt]

Two hosts. Cross-device generalization with N=2 is a single fold of leave-one-host-out CV; with more hosts the methodology becomes more rigorous.
Twelve unique sample names total, with two profiles having 1 sample each. Held-out-by-sample is feasible only on io-walk.
scan-and-dial not present on k-gamingcom — that profile is trained but cannot be evaluated cross-device. Reported as untested_profiles, never silently averaged.
Producer-side bugs found during validation: receiver/store.py:130 torn write (1 occurrence in 76 k); ~24 k k-gamingcom episodes shipped without netflow (silent collector failure); cross-source clock-base inconsistency. All surfaced in training/README.md for follow-up.
Single training corpus, single attack manifest. Generalization to new attack manifests is a stronger claim than this corpus supports.

Slide 15 — Conclusion and Future Work

Conclusion (1 line): The realistic-mode model can be trained, evaluated honestly cross-device, and deployed against live telemetry — but the right architecture depends on data scale, and the cross-device gap is the metric that matters.

Future work:

Self-supervised pretrain on clean-only windows (Masked-Timestep Reconstruction + Volume-of-Hypersphere Minimization, per LogBERT). Detects novel malware without labels. Implementation in training/models/transformer_ssl.py.
Trust-over-time scoring per IEEE 9881803 — per-window confidence accumulated over a sliding decision window, reset trigger at threshold.
Integrated Gradients attribution per (channel, timestep) — which signals drove the call — so the writeup can show evidence, not just confidence numbers. Implementation in training/xai/integrated_gradients.py.
More hosts in the fleet — the data generation and shipping pipeline is in place; adding hosts is a config change. With N ≥ 4 hosts, leave-one-out CV becomes meaningful.
More distinct samples per profile — 12 samples is too few to claim novel-malware generalization; the current dataset only supports cross-device generalization.

Slide 16 — References / Acknowledgments

The same APA reference list as docs/project_brief.md §9. Lab acknowledgments: Dr. Mejias, Raul, the spectral lab infrastructure. Tools acknowledgments: libVMI, Drakvuf, XGBoost, PyTorch, scikit-learn, pyarrow.

Notes for the deck builder

Slides 11–14 are marked optional in the assignment — keep them if you have time, drop them if you're tight. Slides 1, 3, 5, 7, 9, 10, 15 are the required spine.
Every metric on Slide 9 should come from reports/eval/comparison_v2.md directly — copy-paste the markdown and let the deck render it.
Slide 10 (live demo) is more memorable than any chart — bring up dashboard.wg, scroll through scenes 2–10, let it talk.
The repo has the [opt]-marked Design Principles slide built into the README. If you cut Slide 13, the principles still live in the artifact.

12 KiB Raw Permalink Blame History Unescape Escape