Max 3ea6bca6f0 training: self-supervised pretrain + IG XAI + project brief / slide planner

LogBERT-style self-supervised Transformer pretrain on `clean`-only
windows, plus Integrated Gradients attribution for any tensor model.
Both directly answer the assignment's §8 'next steps in unsupervised
learning' requirement and Natsos & Symeonidis 2025's RQ3 on
explainability.

Pretrain (training/models/transformer_ssl.py +
trainer/run_ssl.py):
  - Masked Timestep Reconstruction (MTR) — random 15% of timesteps
    zeroed, encoder + per-channel head reconstructs from the rest.
    Loss: MSE over masked positions.
  - Volume of Hypersphere Minimization (VHM, Deep SVDD-style) — pull
    learnable [DIST] token embedding toward a frozen center vector
    initialized as the mean over clean train. Loss: ||h_dist - c||^2.
  - Calibrated anomaly threshold at user-configurable target FPR
    (default 5%) on clean-val distance distribution.
  - Trained ONLY on `clean`-phase windows; the model never sees a
    labeled malware sample yet flags any window that doesn't look
    clean — including novel malware the supervised classifier never
    saw. Uses the same schema-hashed checkpoint format as the
    supervised models so loaders refuse mismatched feature schemas.

XAI (training/xai/integrated_gradients.py):
  - Per-(channel, timestep) attribution via path-integrated gradients
    over Riemann-mid-point steps. Works for cnn/gru/lstm/transformer/
    transformer_ssl.
  - Per-phase mean |IG| heatmaps under reports/xai/<model>/<phase>.png,
    top-k channel importance per phase as JSON. Smoke-verified on the
    trained CNN: top channel for `clean` is guest.cpu_iowait (sensible
    — clean = idle = high iowait).

Project brief and slide planner:
  - docs/project_brief.md — full draft of the assignment's required
    sections 1–9 (problem, research question, ML task type with
    justification, six supervised algorithms with assumptions, dataset
    description with full validation breakdown, evaluation metrics with
    rationale, current progress, lit review with 11 APA citations,
    next steps for unsupervised, references).
  - docs/slide_planner.md — all 16 slides filled with content tied to
    specific files and metrics from this codebase, not generic
    placeholders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 01:19:41 -05:00

19 KiB

Raw Permalink Blame History

CIS 490 — Project Brief

1. Project Title and Author(s)

Behavioral Malware Detection from Hypervisor-Layer VM Telemetry: A Cross-Architecture Comparison Under Cross-Device Generalization

Authors: . Course: CIS 490 — AI / Machine Learning / Cybersecurity. Advisor: Dr. Mejias / Raul.

2. Problem Statement and Research Question

Problem. A deployed malware detector running on a host's hypervisor or out-of-VM monitor sees per-process resource-utilization telemetry (CPU, memory, I/O, network) at sub-second resolution, and must decide whether the workload inside the VM is benign or compromised, without trusting any in-guest agent (which malware can disable). Static analysis is defeated by obfuscation and packing; signature-based detectors fail on zero-day samples. Behavioral detectors that classify resource-utilization time-series are the alternative, but their cross-device generalization — a model trained on dev hosts and deployed on production hosts with different hardware envelopes — is rarely measured honestly.

Research question.

Across six neural and tree-based architectures trained on labeled per-window resource-utilization tensors from real Alpine-VM episodes, which architecture best generalizes to a held-out host the model never saw at train time, and what is the gap between in-distribution validation performance and cross-device test performance?

The question is concrete (six named architectures, one held-out host), measurable (macro F1 with bootstrap 95 % CIs), and narrow enough to test on the corpus we have (≈73 k accepted-or-degraded episodes, two active hosts).

3. Machine Learning Task Type

Multi-class classification. Each window is one of five phases (clean, armed, infecting, infected_running, dormant). Phase labels come from the orchestrator's labels.jsonl aligned to the window center (PIPELINE.md §4.5).

Justification:

The label space is small (5), discrete, and mutually exclusive at any one timestamp.
The operationally interesting questions ("is this window malicious?", "what's the kind of malicious?") map cleanly to a closed multi-class label set without forcing artificial regression to a continuous "maliciousness score."
Class imbalance is real but learnable (armed ≈ 4 %, infecting ≈ 7 %, clean ≈ 33 %, infected_running ≈ 56 %), and class-weighted cross-entropy plus macro-F1 selection handles it directly.

Ranking would force a forced ordering of phases. Regression would need a continuous severity target we don't have. Classification with an honest multi-class metric is the right framing.

4. Supervised Algorithms Used

We compare six architectures × two threat-model modes = twelve trained models:

Family	Model	Input	Inductive bias	Why include
Trees	XGBoost (`gbt`)	Per-window summary stats `(46 channels × {mean, std, p50, p95, slope})` = 230 features	Greedy axis-aligned splits over hand-crafted features	Strong tabular baseline; cheap; interpretable via feature importance
Dense NN	MLP (`mlp`)	Same summary features	Universal-approximator over fixed-size feature vector	Apples-to-apples NN parity check against GBT
Convolutional	1D-CNN (`cnn`)	Channel × time tensor `(46, 100)`	Local-receptive-field translation invariance over time	Cheap-edge candidate; captures local envelope shape
Recurrent	GRU (`gru`)	Same tensor	Sequential state accumulation	Standard RNN baseline
Recurrent	LSTM (`lstm`)	Same tensor	Sequential state with explicit cell memory	Cell-choice ablation against GRU
Attention	Transformer encoder (`transformer`)	Same tensor + sinusoidal positional embeddings	Global all-pairs attention	Reviewer-standard modern baseline; per Natsos & Symeonidis 2025, can outperform LSTM at all data scales

Threat-model modes:

Realistic — features whose available_in_deployment=True: guest_agent channels (in-guest /proc surrogate) and bridge_pcap channels (network monitor). 29 of 46 channels.
Oracle — all channels, including host-side /proc/<qemu_pid> and QEMU QMP introspection. 46 of 46. Upper bound for what the architecture can learn given full visibility.

The realistic-vs-oracle gap is the project's headline metric for what the deployed model is missing.

Hyperparameters and assumptions. All NN models share the trainer in training/trainer/_loop.py: AdamW, weight decay 1e-4, LR warmup over the first 5 % of steps + cosine decay to 0, gradient clipping at norm 1, mixed precision when CUDA is present, early stopping on val macro F1 with patience 8, best-on-val checkpoint. NN-specific hyperparameters (hidden size, layer count, dropout, head count) are listed in each model file under training/models/ and tuned on the held-out-host val slice — never on test. GBT uses XGBoost with tree_method=hist, max_depth=6, eta=0.1, early stopping at 30 rounds on val mlogloss. Class weights are computed from the train set as N / (n_classes × count_k) clipped to [0.1, 20] and passed to both the cross-entropy loss (NN) and as sample weights (GBT). Schema-hashed checkpoints (training/models/_checkpoint.py) refuse to load if the feature/channel registry has changed since training — silent input-slot drift is rejected.

5. Dataset Description

Source. Lab-generated. Each lab host on the WireGuard mesh boots an Alpine 3.21 cloud-init VM, runs a profile-driven workload from the manifest, samples telemetry from four sources at 1–10 Hz, ships the labeled tarball to the receiver Pi over mTLS. See manifest.toml (canonical experiment) and PIPELINE.md (correctness story).

Approximate size. As of 2026-05-07: 76,660 shipped episodes indexed in /var/lib/cis490/index.jsonl, totaling ~2.7 GB compressed (one .tar.zst per episode, ~36 KB median). The full validator sweep (tools/dataset_validate.py) classifies every episode against the §4.6 acceptance gate:

status	count	%
accepted	64,798	84.5 %
degraded (no `netflow.jsonl`)	8,154	10.6 %
rejected (missing telemetry-guest or telemetry-qmp)	3,701	4.8 %
error (sha or size mismatch — corrupt)	7	0.01 %

72,952 episodes (95.2 %) are training-usable.

Key features. Per accepted episode, four telemetry sources at the cadences below, plus labels.jsonl (phase transitions), events.jsonl, meta.json (sample, profile, schedule, host fingerprint), done.marker. Windowed at 10-second windows / 5-second stride into ≈9 windows per ~50-second episode, summarized either as a 230-dim summary-stat vector (per-channel mean, std, p50, p95, slope) for tree/MLP models or a (46 channels, 100 timesteps) tensor for sequence models.

source	role	available in deployment	Hz
host_proc (`/proc/<qemu_pid>`)	host-side per-process metrics — oracle only	no	~10
guest_agent (in-VM /proc surrogate)	what the deployed model would see	yes	~10
host_qmp (QEMU introspection)	block I/O, KVM stats — oracle only	no	~1
bridge_pcap (network monitor)	per-100ms packet/flow counts	yes	~10

Label availability. Every window has a phase label projected from labels.jsonl onto the window center. Five classes: clean, armed, infecting, infected_running, dormant. The phase enum is closed; we do not predict failed (only emitted when no transition fires within the schedule's per-phase budget — episodes that hit failed are filtered upstream by the acceptance gate).

Preprocessing pipeline (training/):

Validation (tools/dataset_validate.py) — full-sweep validator over the receiver store. SHA256, schema, monotonic labels, row-count gate.
Feature extraction (training/build_features.py, training/build_tensors.py) — counter channels differenced to per-second rates; resample to a uniform 10 Hz grid via linear interpolation; emit summary-stat parquet AND channel × time tensor shards.
Time-base alignment fix. Producers were inconsistent: labels/proc/guest/qmp use episode-relative t_mono_ns, netflow uses system-uptime t_mono_ns. We canonicalize on t_wall_ns (Unix nanoseconds) which is consistent across all sources. Caught and fixed by tests/test_training_features.py::test_t_wall_ns_alignment_not_t_mono_ns.
Held-out split (training/_split.py) — primary: held-out-by-host (train on elliott-thinkpad, val carved from train host, test on k-gamingcom). Secondary: held-out-by-sample where ≥ 3 unique sample_names per profile. Profile-stratification assertions; untested_profiles (e.g., scan-and-dial not present on k-gamingcom) and excluded_profiles are reported, never silently averaged into test metrics.
Standardization (training/models/_base.py::StandardizeStats) — fit on the train slice only; per-feature for summary models, per-channel for tensor models. Median imputation for NaN, then z-score.

Sample diversity caveat. The corpus has only 12 unique malware/mimic sample_name values across 6 profiles. Two profiles have a single sample each, so held-out-by-sample is mathematically infeasible for them. Held-out-by-host is the right primary split given this constraint.

6. Evaluation Metrics

Metric	Why this is the right measure
Macro F1	Class-balanced multi-class metric. Plain accuracy is biased toward `infected_running` (~56 % of windows); macro F1 weights each phase equally and is the right early-stopping criterion. Selecting `best-on-val` by macro F1 (not accuracy) is the difference between training a detector and training a class-prior estimator.
Per-phase precision, recall, F1	The five phases are not equally interesting operationally. `armed`/`infecting` are rare but indicate the transition into compromise — high recall there matters more than on `clean`. We report precision and recall separately so a writeup can talk about false positives versus missed detections.
Bootstrap 95 % CIs on every metric	A single point estimate from a finite test set is dishonest. We resample test rows with replacement (1000 bootstraps) and report `macro F1 = 0.557 [0.543, 0.571]`. CIs are produced by `training/eval_/_metrics.py::bootstrap_macro_f1`.
Paired-bootstrap significance	Model-vs-model gap. Same row indices applied to both models' predictions on each resample, so "which test windows happened to be hard" cancels. CI excludes 0 → significant.
Per-profile and per-host breakdown	A model with macro F1 = 0.55 might be 0.85 on five profiles and 0.10 on the sixth. The single number hides exactly the failure modes this project cares about. `training/eval_/breakdown.py` produces both tables.
Realistic-vs-oracle gap	The honest measure of what the deployed model is missing. Oracle is the architectural ceiling; realistic is what would actually run. Their gap is the cost of restricting to in-deployment features.
Latency (µs) at production batch sizes	Single-window timing is misleading because Python overhead dominates. We report median µs at batch sizes `{1, 8, 64, 512}` so the dashboard scatter and the writeup can talk about deployment cost, not Python overhead.

We do not use plain accuracy as the headline metric; it appears in tables only for completeness. AUC and Precision@k are not computed because the task is multi-class with a small phase set, not binary or ranked retrieval.

7. Current Progress and Literature Review

Code progress. Validator, feature extractor (summary + tensor), held-out-by-host / -by-sample / -by-time recipes with profile-stratification assertions, six model architectures behind a common BaseModel interface, schema-hashed checkpoint format, unified trainer with class-weighted CE + LR warmup/cosine + early stopping on val macro F1, eval suite with bootstrap CIs and paired-bootstrap significance, dashboard producers (live metric + replay + perf), 17/17 unit tests passing. End-to-end smoke-trained all six architectures on a 567-episode subset; full-scale training pending the 2070 Super box.

Literature review. Continued in references/CIS490_Project_Workbook.xlsx (Literature Matrix tab). Key sources and how each informs the project:

Source	Informs
Natsos & Symeonidis 2025, Transformer-based malware detection using process resource utilization metrics (Results in Engineering)	Closest prior work — same input modality (resource-utilization metrics), same VM context. Confirms Transformer ≥ LSTM at all data sizes; validates the "other tenant processes carry indirect malware signal" finding that supports our oracle ablation. Statistical-test methodology (paired T-test + Wilcoxon).
Melvin et al. 2025, A Deep Learning Model Leveraging Time-Series System Call Data to Detect Malware Attacks in Virtual Machines (Int J Comput Intell Syst)	Hypervisor-layer IDS via VMI/Drakvuf, time-series CNN. Direct support for the threat-model assumption (don't trust in-guest agents). Counterpoint to Natsos & Symeonidis on architecture choice — they argue CNN > RNN/LSTM on system-call traces; the literature disagreement is itself a research finding.
Guo, Yuan, Wu 2021, LogBERT: Log Anomaly Detection via BERT (arXiv 2103.04475)	Self-supervised pretrain on normal sequences (Masked Log Key Prediction + Volume-of-Hypersphere Minimization) for novel-anomaly detection without labeled attacks. Methodological template for our §8 next-step (one-class anomaly detector trained on `clean` windows only).
Ma & Rastogi 2021, DANTE: Predicting Insider Threat using LSTM on system logs (arXiv 2102.05600)	Supporting evidence for LSTM-on-time-sequence-of-discrete-events in cybersecurity. Honest acknowledgment of limitations (high false-positives, unknown-threat blind spots) — useful as a cited limitation in our writeup.
Forrest et al. 1996, A Sense of Self for Unix Processes	Seminal anomaly-IDS-from-system-calls paper; the historical anchor for §2 Existing Work.
Du, Li, Zheng, Srikumar 2017, DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning (ACM CCS)	LSTM-on-log-keys baseline that DANTE and LogBERT both cite. Anchor for the unsupervised next-step.
Hochreiter & Schmidhuber 1997, Long Short-Term Memory (Neural Computation 9(8))	Foundational LSTM architecture reference.
Vaswani et al. 2017, Attention Is All You Need (NeurIPS)	Foundational Transformer architecture reference.
Chen & Guestrin 2016, XGBoost: A Scalable Tree Boosting System (KDD)	Foundational reference for the GBT baseline.
MITRE Caldera (https://github.com/mitre/caldera)	Adversary emulation platform. Cited under Dataset Description as related tooling for reproducible attack-trace generation.
(Future) IEEE 9881803 trust-over-time scoring (cited in repo `README.md`)	Will inform §8 unsupervised next-step (sliding-window confidence accumulation + reset trigger).

The Literature Matrix in the workbook will be filled with all 22 columns per source (Relevant?, Priority, Authors, Year, Paper Type, …, How this informs my project, APA citation) for at least these 11 entries.

8. Next Steps for Unsupervised Learning

The supervised classifier above tells us "which of the five phases is this window?" — but the deployed model has to handle novel malware that wasn't in any training set. The unsupervised next step:

Self-supervised pretraining on clean-only windows. Following LogBERT and DeepLog: train the Transformer encoder on clean windows with two objectives: (a) Masked Timestep Reconstruction — randomly mask 15 % of timesteps in the (channel × time) tensor, predict the masked values from the rest; (b) Volume-of-Hypersphere Minimization — pull the [DIST] CLS-style embedding of each clean window toward a single center vector. At inference time, anomaly score = reconstruction MSE on masked positions, OR distance from center. The model never sees a labeled malware sample yet flags any window that doesn't look clean. This is the right unsupervised complement to the supervised classifier and directly addresses novel-malware generalization. Implementation lives in training/models/transformer_ssl.py and training/trainer/run_ssl.py.

Trust-over-time scoring (per IEEE 9881803, the original project framing). Per-window confidence accumulated across a sliding decision window with exponential decay; reset trigger when the running confidence crosses a tuned threshold. Different from per-window classification — it's a behavioral commitment that the model is willing to act on, not just a momentary opinion.

PCA / t-SNE / UMAP on the standardized window features, colored by phase and by host, for the dashboard's KNN-scatter widget and the writeup's "do the phases separate at all in low-dim space?" sanity check. PCA-2 projection is already saved with each trained model checkpoint.

Clustering by host-profile fingerprint (k-means in PC space, per profile). Already implemented in the validator's outlier-flagging path. Useful for catching host-drift contamination before training.

Feature attribution via Integrated Gradients, Gradient×Input, SmoothGrad (per Natsos & Symeonidis 2025 RQ3). Per-(channel, timestep) attribution averaged per phase tells us which signals at which phases drove the model's decision. Feeds the writeup's interpretability section. Implementation in training/xai/integrated_gradients.py.

9. References

(APA 7th edition; final reference list will live in references/links.md mirror plus the workbook's Literature Matrix.)

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785

Du, M., Li, F., Zheng, G., & Srikumar, V. (2017). DeepLog: Anomaly detection and diagnosis from system logs through deep learning. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 1285–1298. https://doi.org/10.1145/3133956.3134015

Forrest, S., Hofmeyr, S. A., Somayaji, A., & Longstaff, T. A. (1996). A sense of self for Unix processes. Proceedings of the 1996 IEEE Symposium on Security and Privacy, 120–128. https://doi.org/10.1109/SECPRI.1996.502675

Guo, H., Yuan, S., & Wu, X. (2021). LogBERT: Log anomaly detection via BERT (arXiv:2103.04475). arXiv. https://arxiv.org/abs/2103.04475

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

Ma, Q., & Rastogi, N. (2021). DANTE: Predicting insider threat using LSTM on system logs (arXiv:2102.05600). arXiv. https://arxiv.org/abs/2102.05600

Melvin, A. A. R., Kathrine, J. W., Jeyabose, A., & Cenitta, D. (2025). A deep learning model leveraging time-series system call data to detect malware attacks in virtual machines. International Journal of Computational Intelligence Systems, 18(58). https://doi.org/10.1007/s44196-025-00781-z

MITRE Corporation. (n.d.). Caldera: A scalable, automated adversary emulation platform [Computer software]. GitHub. https://github.com/mitre/caldera

Natsos, D., & Symeonidis, A. L. (2025). Transformer-based malware detection using process resource utilization metrics. Results in Engineering, 25, 104250. https://doi.org/10.1016/j.rineng.2025.104250

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://papers.nips.cc/paper/7181-attention-is-all-you-need

19 KiB Raw Permalink Blame History Unescape Escape