CIS490/.gitignore
Max 3ea6bca6f0 training: self-supervised pretrain + IG XAI + project brief / slide planner
LogBERT-style self-supervised Transformer pretrain on `clean`-only
windows, plus Integrated Gradients attribution for any tensor model.
Both directly answer the assignment's §8 'next steps in unsupervised
learning' requirement and Natsos & Symeonidis 2025's RQ3 on
explainability.

Pretrain (training/models/transformer_ssl.py +
trainer/run_ssl.py):
  - Masked Timestep Reconstruction (MTR) — random 15% of timesteps
    zeroed, encoder + per-channel head reconstructs from the rest.
    Loss: MSE over masked positions.
  - Volume of Hypersphere Minimization (VHM, Deep SVDD-style) — pull
    learnable [DIST] token embedding toward a frozen center vector
    initialized as the mean over clean train. Loss: ||h_dist - c||^2.
  - Calibrated anomaly threshold at user-configurable target FPR
    (default 5%) on clean-val distance distribution.
  - Trained ONLY on `clean`-phase windows; the model never sees a
    labeled malware sample yet flags any window that doesn't look
    clean — including novel malware the supervised classifier never
    saw. Uses the same schema-hashed checkpoint format as the
    supervised models so loaders refuse mismatched feature schemas.

XAI (training/xai/integrated_gradients.py):
  - Per-(channel, timestep) attribution via path-integrated gradients
    over Riemann-mid-point steps. Works for cnn/gru/lstm/transformer/
    transformer_ssl.
  - Per-phase mean |IG| heatmaps under reports/xai/<model>/<phase>.png,
    top-k channel importance per phase as JSON. Smoke-verified on the
    trained CNN: top channel for `clean` is guest.cpu_iowait (sensible
    — clean = idle = high iowait).

Project brief and slide planner:
  - docs/project_brief.md — full draft of the assignment's required
    sections 1–9 (problem, research question, ML task type with
    justification, six supervised algorithms with assumptions, dataset
    description with full validation breakdown, evaluation metrics with
    rationale, current progress, lit review with 11 APA citations,
    next steps for unsupervised, references).
  - docs/slide_planner.md — all 16 slides filled with content tied to
    specific files and metrics from this codebase, not generic
    placeholders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 01:19:41 -05:00

78 lines
1.3 KiB
Text

# Disk images and snapshots
*.iso
*.img
*.qcow2
*.qcow2.*
*.vmdk
*.vdi
*.raw
vm/images/
vm/snapshots/
# VERSION file is install-script-stamped (provenance for episodes
# generated from /opt/cis490 install copies). Tracking it would
# trigger spurious dirty-tree state on lab hosts and reject every
# episode at the §4.6 acceptance gate.
/VERSION
# Telemetry output
data/episodes/
data/campaign.json
data/campaign_done.marker
data/outbox/
data/shipped/
*.pcap
*.pcapng
# Training artifacts that are regenerated from raw episodes:
# features are large and deterministic from code+episodes, so we don't
# track them. validation_v1.parquet IS tracked — it's small and pins
# the accepted/degraded set.
data/processed/features_*.parquet
data/processed/feature_schema_*.json
data/processed/.validation_checkpoint.parquet
data/processed/validation_smoke.parquet
data/processed/tensor_window_*/
data/logs/
artifacts/
artifacts-*/
reports/eval/
reports/pca/
reports/xai/
reports/fleet-*/
# Per-developer training venv
.venv-training/
# Malware samples — NEVER commit binaries
samples/store/
*.bin
*.elf
*.exe
*.dll
*.so.malware
# Python
__pycache__/
*.py[cod]
.venv/
venv/
.pytest_cache/
.mypy_cache/
.ruff_cache/
*.egg-info/
dist/
build/
# Editor
.vscode/
.idea/
*.swp
.DS_Store
# Local secrets (never commit)
.env
.env.local
secrets.toml
*.pat
*.token