CIS490

Author	SHA1	Message	Date
Max	aa6187042b	.gitignore: exclude data/processed/knn_*.parquet KNN fit output (PCA-3 + KMeans + KNN-classifier predictions per window) is a derived artifact regenerable from features_window_v1. Like features_window itself it stays out of git; the streamer reads it from disk on the producing host.	2026-05-08 13:20:17 -05:00
Max	3ea6bca6f0	training: self-supervised pretrain + IG XAI + project brief / slide planner LogBERT-style self-supervised Transformer pretrain on `clean`-only windows, plus Integrated Gradients attribution for any tensor model. Both directly answer the assignment's §8 'next steps in unsupervised learning' requirement and Natsos & Symeonidis 2025's RQ3 on explainability. Pretrain (training/models/transformer_ssl.py + trainer/run_ssl.py): - Masked Timestep Reconstruction (MTR) — random 15% of timesteps zeroed, encoder + per-channel head reconstructs from the rest. Loss: MSE over masked positions. - Volume of Hypersphere Minimization (VHM, Deep SVDD-style) — pull learnable [DIST] token embedding toward a frozen center vector initialized as the mean over clean train. Loss: \|\|h_dist - c\|\|^2. - Calibrated anomaly threshold at user-configurable target FPR (default 5%) on clean-val distance distribution. - Trained ONLY on `clean`-phase windows; the model never sees a labeled malware sample yet flags any window that doesn't look clean — including novel malware the supervised classifier never saw. Uses the same schema-hashed checkpoint format as the supervised models so loaders refuse mismatched feature schemas. XAI (training/xai/integrated_gradients.py): - Per-(channel, timestep) attribution via path-integrated gradients over Riemann-mid-point steps. Works for cnn/gru/lstm/transformer/ transformer_ssl. - Per-phase mean \|IG\| heatmaps under reports/xai/<model>/<phase>.png, top-k channel importance per phase as JSON. Smoke-verified on the trained CNN: top channel for `clean` is guest.cpu_iowait (sensible — clean = idle = high iowait). Project brief and slide planner: - docs/project_brief.md — full draft of the assignment's required sections 1–9 (problem, research question, ML task type with justification, six supervised algorithms with assumptions, dataset description with full validation breakdown, evaluation metrics with rationale, current progress, lit review with 11 APA citations, next steps for unsupervised, references). - docs/slide_planner.md — all 16 slides filled with content tied to specific files and metrics from this codebase, not generic placeholders. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 01:19:41 -05:00
Max	1fabd4a246	training: validator, feature/tensor extractors, 6 supervised models, schema-hashed checkpoints, eval suite, dashboard producers The model layer of the project, built honestly: - tools/dataset_validate.py — full-sweep validator over the receiver store (sha256, schema, monotonic labels, telemetry-row gate). On the current corpus: 64,798 accepted + 8,154 degraded + 3,701 rejected + 7 errored across 76,660 shipped episodes. data/processed/validation_v1.parquet is committed as the per-episode acceptance index. - training/_features.py — channel registry (46 channels across proc/guest/qmp/netflow), summary-stat windowing AND channel×time tensor extraction at 10s/5s windowing. Time alignment uses t_wall_ns (Unix ns) — tested fix for a real netflow-vs-host clock-base inconsistency that was silently dropping every netflow channel. - training/_split.py — three held-out recipes (host / sample / time) with profile-stratification assertions. held_out_host carries untested_profiles for cases like scan-and-dial absent from the test host (5 of 6 profiles tested cross-device, never silently averaged). - training/models/ — 6 architectures behind a common BaseModel interface: gbt (XGBoost), mlp, cnn, gru, lstm, transformer. Each trained twice (realistic / oracle) per the deployment threat model. Schema-hashed checkpoints refuse to load if _features.py changed since training (silent-input-drift protection, tested). - training/trainer/ — unified training loop: class-weighted CE, LR warmup + cosine, gradient clipping, mixed precision when CUDA, early stopping on val macro F1, best-on-val checkpoint. Same loop runs MLP/CNN/GRU/LSTM/Transformer; GBT uses XGBoost early_stopping_rounds on val mlogloss. - training/eval_/ — bootstrap 95% CIs on macro F1, per-class F1, per-profile and per-host breakdown, paired-bootstrap significance for model-vs-model gap. Confusion matrix uses union of seen labels. - training/dashboard/producers/ — replay/metrics/perf/profiles emitting the six event types the dashboard's awaiting scenes consume; on-demand tensor extraction so the Pi can run live inference without 65 GB of shards. - 17 unit tests (split coverage, features round-trip, schema mismatch, determinism, time-base alignment regression). End-to-end smoke-trained all six on a 567-episode subset; held-out test macro F1 reported with paired-bootstrap significance. The methodology now reports honest cross-device generalization, not in-distribution validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 01:19:00 -05:00
Max Gorog	3d4f282e9c	Tier-2 episodes use clean-only schedule; .gitignore VERSION Two correctness fixes that the §4.5 event-driven labeller surfaced: 1. tools/run_real_vm_demo.py was hardcoding a Tier-3-shaped schedule (clean → armed → infecting → infected_running → ...) for episodes with no exploit firing. Pre-§4.5 those episodes wrote dishonest `infected_running` labels from the schedule clock — exactly the §3 evidence pattern. Post-§4.5 they write `failed` at the infecting transition (the justifying exploit_fire never arrives), which is honest about what happened but useless for training. The honest fix: Tier-2 episodes have a clean-only schedule. All telemetry tagged `clean` because nothing infected anything. The total duration matches the canonical Tier-3 schedule so episode lengths are comparable across tiers — no length-bias in the dataset (§10). Helper `tier2_schedule_from(schedule)` in orchestrator/manifest.py derives `[("clean", total_seconds)]` from the canonical schedule. `tier3_schedule_from(schedule)` renders the legacy `[(name, seconds)]` shape EpisodeConfig still expects. Tier-2 demo (run_real_vm_demo.py) now calls tier2_schedule_from. Tier-3 demo (run_tier3_demo.py) now calls tier3_schedule_from. Drops the hardcoded DEFAULT_SCHEDULE constants from both — the canonical manifest is the single source of truth (§4.1). 2. .gitignore now excludes /VERSION. The install-lab-host.sh stamp writes /opt/cis490/VERSION so episodes can record code provenance without /opt/cis490 carrying a .git directory. But /opt/cis490 IS typically a git checkout on lab hosts (auto-update.sh pulls into it), so writing VERSION leaves the working tree dirty. Every episode's meta.code_version.dirty=true. PIPELINE.md §4.6 acceptance gate's rule 4 would then reject every episode without CIS490_ALLOW_DIRTY=1 set — which would break the data flow. Now VERSION is .gitignored: install-lab-host.sh stamps it, git status doesn't see it, dirty=false, gate rule 4 passes naturally. These two changes together keep the data flowing AND honest. Tier-2 episodes pass with `phases=[clean]` + every collector emitting real rows. Tier-3 episodes (none today, empty catalog) walk the full event-driven schedule when a verified module gets re-admitted. 286 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 01:55:37 -05:00
Elliott Kolden	86fdd03de4	Dev_REL1_043026: lab-host bring-up, fixes, and issue report Full bring-up of this host from a clean clone: installed uv/perf/tcpdump, downloaded Alpine 3.21 cloud image, built cidata ISO, took baseline-v1 snapshot. Validated single-episode demo (853 rows, 8 phases) and 2-episode campaign loop (campaign_done.marker written). Cherry-picked campaign runner from Dev_REL1_042926. Fixed .gitignore to cover campaign output files. Issue report at reports/Dev_REL1_043026.md covers ISS-001 through ISS-007, with ISS-005 (missing install-lab-host.sh) remaining open. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 14:59:47 -06:00
Maximus Gorog	fa1574a0a6	Scaffold project: docs, repo skeleton, transport + deploy design Lays down the design surface for the CIS490 behavioral-malware-detection dataset and model. No code yet — schema and topology are decided first so collection can start without rework. Docs: - README: project goal, navigation - architecture: lab topology, KVM choice, episode state machine, deployment-mirror reasoning - threat-model: train/serve parity rule, oracle-vs-deployable feature split, two-model evaluation strategy - data-model: per-episode JSONL layout, row schemas, phase enum - transport: WG-native shipper/receiver design, idempotent uploads - deploy: one-command install for lab-host and receiver roles - lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/, training/ (each with a short README explaining purpose). Extended .gitignore to exclude qcow2 images, pcaps, sample binaries, secrets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:21:00 -06:00
elliott	7a0fefc02e	Initial commit	2026-04-27 17:28:48 -06:00

7 commits