LogBERT-style self-supervised Transformer pretrain on `clean`-only
windows, plus Integrated Gradients attribution for any tensor model.
Both directly answer the assignment's §8 'next steps in unsupervised
learning' requirement and Natsos & Symeonidis 2025's RQ3 on
explainability.
Pretrain (training/models/transformer_ssl.py +
trainer/run_ssl.py):
- Masked Timestep Reconstruction (MTR) — random 15% of timesteps
zeroed, encoder + per-channel head reconstructs from the rest.
Loss: MSE over masked positions.
- Volume of Hypersphere Minimization (VHM, Deep SVDD-style) — pull
learnable [DIST] token embedding toward a frozen center vector
initialized as the mean over clean train. Loss: ||h_dist - c||^2.
- Calibrated anomaly threshold at user-configurable target FPR
(default 5%) on clean-val distance distribution.
- Trained ONLY on `clean`-phase windows; the model never sees a
labeled malware sample yet flags any window that doesn't look
clean — including novel malware the supervised classifier never
saw. Uses the same schema-hashed checkpoint format as the
supervised models so loaders refuse mismatched feature schemas.
XAI (training/xai/integrated_gradients.py):
- Per-(channel, timestep) attribution via path-integrated gradients
over Riemann-mid-point steps. Works for cnn/gru/lstm/transformer/
transformer_ssl.
- Per-phase mean |IG| heatmaps under reports/xai/<model>/<phase>.png,
top-k channel importance per phase as JSON. Smoke-verified on the
trained CNN: top channel for `clean` is guest.cpu_iowait (sensible
— clean = idle = high iowait).
Project brief and slide planner:
- docs/project_brief.md — full draft of the assignment's required
sections 1–9 (problem, research question, ML task type with
justification, six supervised algorithms with assumptions, dataset
description with full validation breakdown, evaluation metrics with
rationale, current progress, lit review with 11 APA citations,
next steps for unsupervised, references).
- docs/slide_planner.md — all 16 slides filled with content tied to
specific files and metrics from this codebase, not generic
placeholders.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
78 lines
1.3 KiB
Text
78 lines
1.3 KiB
Text
# Disk images and snapshots
|
|
*.iso
|
|
*.img
|
|
*.qcow2
|
|
*.qcow2.*
|
|
*.vmdk
|
|
*.vdi
|
|
*.raw
|
|
vm/images/
|
|
vm/snapshots/
|
|
|
|
# VERSION file is install-script-stamped (provenance for episodes
|
|
# generated from /opt/cis490 install copies). Tracking it would
|
|
# trigger spurious dirty-tree state on lab hosts and reject every
|
|
# episode at the §4.6 acceptance gate.
|
|
/VERSION
|
|
|
|
# Telemetry output
|
|
data/episodes/
|
|
data/campaign.json
|
|
data/campaign_done.marker
|
|
data/outbox/
|
|
data/shipped/
|
|
*.pcap
|
|
*.pcapng
|
|
|
|
# Training artifacts that are regenerated from raw episodes:
|
|
# features are large and deterministic from code+episodes, so we don't
|
|
# track them. validation_v1.parquet IS tracked — it's small and pins
|
|
# the accepted/degraded set.
|
|
data/processed/features_*.parquet
|
|
data/processed/feature_schema_*.json
|
|
data/processed/.validation_checkpoint.parquet
|
|
data/processed/validation_smoke.parquet
|
|
data/processed/tensor_window_*/
|
|
data/logs/
|
|
artifacts/
|
|
artifacts-*/
|
|
reports/eval/
|
|
reports/pca/
|
|
reports/xai/
|
|
reports/fleet-*/
|
|
|
|
# Per-developer training venv
|
|
.venv-training/
|
|
|
|
# Malware samples — NEVER commit binaries
|
|
samples/store/
|
|
*.bin
|
|
*.elf
|
|
*.exe
|
|
*.dll
|
|
*.so.malware
|
|
|
|
# Python
|
|
__pycache__/
|
|
*.py[cod]
|
|
.venv/
|
|
venv/
|
|
.pytest_cache/
|
|
.mypy_cache/
|
|
.ruff_cache/
|
|
*.egg-info/
|
|
dist/
|
|
build/
|
|
|
|
# Editor
|
|
.vscode/
|
|
.idea/
|
|
*.swp
|
|
.DS_Store
|
|
|
|
# Local secrets (never commit)
|
|
.env
|
|
.env.local
|
|
secrets.toml
|
|
*.pat
|
|
*.token
|