Scaffold project: docs, repo skeleton, transport + deploy design
Lays down the design surface for the CIS490 behavioral-malware-detection dataset and model. No code yet — schema and topology are decided first so collection can start without rework. Docs: - README: project goal, navigation - architecture: lab topology, KVM choice, episode state machine, deployment-mirror reasoning - threat-model: train/serve parity rule, oracle-vs-deployable feature split, two-model evaluation strategy - data-model: per-episode JSONL layout, row schemas, phase enum - transport: WG-native shipper/receiver design, idempotent uploads - deploy: one-command install for lab-host and receiver roles - lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/, training/ (each with a short README explaining purpose). Extended .gitignore to exclude qcow2 images, pcaps, sample binaries, secrets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
7a0fefc02e
commit
fa1574a0a6
14 changed files with 1080 additions and 0 deletions
47
.gitignore
vendored
47
.gitignore
vendored
|
|
@ -1 +1,48 @@
|
|||
# Disk images and snapshots
|
||||
*.iso
|
||||
*.img
|
||||
*.qcow2
|
||||
*.qcow2.*
|
||||
*.vmdk
|
||||
*.vdi
|
||||
*.raw
|
||||
vm/images/
|
||||
vm/snapshots/
|
||||
|
||||
# Telemetry output
|
||||
data/episodes/
|
||||
*.pcap
|
||||
*.pcapng
|
||||
|
||||
# Malware samples — NEVER commit binaries
|
||||
samples/store/
|
||||
*.bin
|
||||
*.elf
|
||||
*.exe
|
||||
*.dll
|
||||
*.so.malware
|
||||
|
||||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
.venv/
|
||||
venv/
|
||||
.pytest_cache/
|
||||
.mypy_cache/
|
||||
.ruff_cache/
|
||||
*.egg-info/
|
||||
dist/
|
||||
build/
|
||||
|
||||
# Editor
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
.DS_Store
|
||||
|
||||
# Local secrets (never commit)
|
||||
.env
|
||||
.env.local
|
||||
secrets.toml
|
||||
*.pat
|
||||
*.token
|
||||
|
|
|
|||
51
README.md
Normal file
51
README.md
Normal file
|
|
@ -0,0 +1,51 @@
|
|||
# CIS490 — Behavioral Malware Detection Dataset & Model
|
||||
|
||||
Course project for CIS490 (Cybersecurity). The end-goal is an ML model that watches
|
||||
performance metrics on a real device, decides whether the device has been breached,
|
||||
and triggers a hardware-level reset when confidence is high enough.
|
||||
|
||||
This repository covers the **dataset side** of that pipeline: we run real, public
|
||||
malware samples against intentionally vulnerable Linux VMs and capture labeled
|
||||
time-series telemetry that mirrors what the same model would see in deployment on
|
||||
a Raspberry Pi or similarly-constrained target.
|
||||
|
||||
The work is grounded in the trust-over-time scoring model from
|
||||
[IEEE 9881803](https://ieeexplore.ieee.org/document/9881803) and a related
|
||||
proprietary follow-on that pairs detection with blockchain-anchored hardware reset.
|
||||
|
||||
## What lives where
|
||||
|
||||
| Path | What it holds |
|
||||
|---|---|
|
||||
| [`docs/architecture.md`](docs/architecture.md) | Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning |
|
||||
| [`docs/threat-model.md`](docs/threat-model.md) | Train/serve parity rule and the oracle-vs-deployable feature split |
|
||||
| [`docs/data-model.md`](docs/data-model.md) | On-disk JSONL schema, per-episode layout, phase enum |
|
||||
| [`docs/transport.md`](docs/transport.md) | Sender/receiver design — how episodes get to the central collector over WG |
|
||||
| [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles |
|
||||
| [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring |
|
||||
| `orchestrator/` | State machine that drives the boot → arm → detonate → observe → revert loop |
|
||||
| `collectors/` | One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) |
|
||||
| `vm/` | qcow2 images and snapshot scripts (binaries gitignored) |
|
||||
| `exploits/` | Metasploit resource scripts for repeatable exploitation |
|
||||
| `samples/` | Sample manifest (sha256-pinned). **Binaries never committed.** |
|
||||
| `training/` | Model training code (deferred — schema first) |
|
||||
|
||||
## Quick orientation
|
||||
|
||||
1. **Why VMs?** We need a clean snapshot/revert loop and we need to run real malware
|
||||
without burning hardware. KVM gives us both at near-native speed.
|
||||
2. **Why is the network isolated?** A host-only bridge keeps malware off the
|
||||
internet and off the WG overlay. The Pi5 gateway is the **lab-side observer**,
|
||||
playing the same role it would play in a deployed setting.
|
||||
3. **Why JSONL and not a database (yet)?** Schema-last: collect first, decide
|
||||
storage shape after we see what's actually useful. JSONL is crash-safe,
|
||||
append-only, and reshapes trivially into Postgres/Timescale/Parquet later.
|
||||
4. **Why two models?** One trained on features that exist on a real Pi
|
||||
(*deployable*), one trained on host-side QEMU-only features (*oracle*). The
|
||||
accuracy gap measures how much detection power a privileged rootkit can take
|
||||
from the deployed model. See [docs/threat-model.md](docs/threat-model.md).
|
||||
|
||||
## Status
|
||||
|
||||
Project bootstrap. Skeleton, documentation, and design decisions in place;
|
||||
collection and orchestration code in progress.
|
||||
23
collectors/README.md
Normal file
23
collectors/README.md
Normal file
|
|
@ -0,0 +1,23 @@
|
|||
# collectors/
|
||||
|
||||
One module per telemetry source. All collectors:
|
||||
|
||||
- Receive an `episode_id`, an output directory, and a shared `t_mono_origin_ns`.
|
||||
- Write JSONL into `data/episodes/<episode_id>/telemetry-<name>.jsonl`.
|
||||
- Stamp every row with the same `t_mono_ns` / `t_wall_ns` clock pair.
|
||||
- Stamp every row with `source` and `available_in_deployment` (true/false).
|
||||
- Exit cleanly on `SIGTERM` from the orchestrator.
|
||||
|
||||
| Module | Source | Vantage | Role |
|
||||
|---|---|---|---|
|
||||
| `proc_qemu.py` | host `/proc/<qemu_pid>/{stat,io,status,schedstat}` | outside guest | oracle |
|
||||
| `qmp.py` | QEMU QMP `query-stats`, `query-blockstats`, netdev | outside guest | oracle |
|
||||
| `perf_qemu.py` | `perf stat -p <qemu_pid>` | outside guest | oracle |
|
||||
| `pcap.py` | `tcpdump -i br-malware`, bucketed | gateway-side | feature |
|
||||
| `guest_agent.py` | virtio-serial reader, parses agent JSONL | inside guest | feature |
|
||||
|
||||
The in-guest agent itself (a small Python+psutil program that runs on the
|
||||
guest and writes to `/dev/virtio-ports/cis490.guest.agent`) lives under
|
||||
`vm/guest-agent/` because it is shipped *into* the guest at image-build time.
|
||||
|
||||
See [`docs/data-model.md`](../docs/data-model.md) for row schemas.
|
||||
107
docs/architecture.md
Normal file
107
docs/architecture.md
Normal file
|
|
@ -0,0 +1,107 @@
|
|||
# Architecture
|
||||
|
||||
## One-paragraph summary
|
||||
|
||||
A QEMU/KVM host runs short, repeatable **episodes** against a vulnerable Linux
|
||||
guest. Each episode boots from a clean snapshot, captures a baseline, fires a
|
||||
known exploit, drops a public malware sample, observes the infection envelope,
|
||||
and reverts the snapshot. Telemetry is captured from five vantage points
|
||||
simultaneously, all stamped with the host monotonic clock so rows align. The
|
||||
output of an episode is a self-contained directory of JSONL files plus a pcap.
|
||||
|
||||
## Lab topology
|
||||
|
||||
```
|
||||
+---------------------------------------------------------------+
|
||||
| VM HOST (this machine, /home/maximus/.env/qemu) |
|
||||
| |
|
||||
| +-----------------------+ +------------------------+ |
|
||||
| | KVM guest | | orchestrator (host) | |
|
||||
| | (Metasploitable2, | | - snapshot loop | |
|
||||
| | 1 vCPU, capped) | | - exploit driver | |
|
||||
| | |<====>| - phase labeler | |
|
||||
| | in-guest agent -----|virtio| | |
|
||||
| | |serial| collectors: | |
|
||||
| | vNIC ----------------| | * host /proc/qemu_pid| |
|
||||
| | | | | * QMP query-stats | |
|
||||
| +--------|--------------+ | * perf -p qemu_pid | |
|
||||
| | | * tcpdump on br | |
|
||||
| v | * guest agent rx | |
|
||||
| br-malware (host-only, NO NAT) | | |
|
||||
| | +-----------|------------+ |
|
||||
| +--- isolated, no internet | |
|
||||
| v |
|
||||
| data/episodes/
|
||||
+----------------------------------------------------------|----+
|
||||
| (later)
|
||||
v
|
||||
WG overlay -> Pi5 (DB + ingest)
|
||||
```
|
||||
|
||||
The malware bridge `br-malware` is **host-only** — no NAT, no route to the WG
|
||||
overlay, no DNS. The orchestrator also blocks egress with nftables on the host
|
||||
as a belt-and-suspenders measure.
|
||||
|
||||
## Why KVM, not TCG and not Docker
|
||||
|
||||
| Option | Speed | Determinism | Real OS isolation | Verdict |
|
||||
|---|---|---|---|---|
|
||||
| TCG `-icount` | slow | bit-exact replay | yes | overkill — ML wants noise |
|
||||
| **KVM** | near-native | host-scheduler noise (good) | yes | **chosen** |
|
||||
| Docker | fastest | low | shares host kernel — unsafe for malware | ruled out |
|
||||
|
||||
KVM is roughly 15× faster than TCG for boot/snapshot-revert cycles, which directly
|
||||
multiplies dataset size for a fixed wall-clock budget. The "constrained
|
||||
single-threaded device" framing from the project goal is preserved by pinning to
|
||||
1 vCPU and applying a host cgroup CPU cap.
|
||||
|
||||
## The episode state machine
|
||||
|
||||
```
|
||||
snapshot_load(baseline)
|
||||
|
|
||||
v
|
||||
[clean] ---- record T_baseline seconds of idle telemetry ----+
|
||||
| |
|
||||
v |
|
||||
[armed] ---- exploit module fires; session opens ------------+
|
||||
| |
|
||||
v |
|
||||
[infecting] ---- sample uploaded + executed -----------------+
|
||||
| |
|
||||
v |
|
||||
[infected_running] ---- observe T_active seconds ------------+
|
||||
| |
|
||||
v |
|
||||
[dormant] ---- (optional) wait for sample's idle window ----+
|
||||
| |
|
||||
v |
|
||||
[reverting] ---- snapshot_load(baseline); episode ends -----+
|
||||
|
|
||||
v
|
||||
write meta.json + close jsonl
|
||||
```
|
||||
|
||||
Phase transitions are emitted by the orchestrator into `labels.jsonl` *at the
|
||||
moment the orchestrator takes the action*, not inferred from metrics afterward.
|
||||
This is what makes the dataset honestly labeled.
|
||||
|
||||
## Why the lab topology mirrors deployment
|
||||
|
||||
In the field, the ML model runs on a real Pi or constrained device. Whatever
|
||||
sees the device's network traffic from outside (router, gateway, hypervisor) is
|
||||
the **gateway observer**. In our lab, the host-only bridge plays exactly that
|
||||
role — bridge-side pcap features at training time map 1:1 to gateway-side
|
||||
NetFlow/pcap features at inference time. This is what makes
|
||||
*train/serve parity* possible for the network channel even though we'll later
|
||||
run on bare metal.
|
||||
|
||||
See [`threat-model.md`](threat-model.md) for the rest of the parity story
|
||||
(host-side QEMU features must NOT be used as model inputs — they are labeling
|
||||
oracles only).
|
||||
|
||||
## Out of scope for this repo
|
||||
|
||||
- Authoring novel malware or zero-day exploits.
|
||||
- Detection-evasion research targeting other vendors' AV.
|
||||
- Production deployment of the trained model — that lives in a separate repo.
|
||||
205
docs/data-model.md
Normal file
205
docs/data-model.md
Normal file
|
|
@ -0,0 +1,205 @@
|
|||
# Data Model
|
||||
|
||||
JSONL only, no database, schema-last. Each episode is a self-contained directory.
|
||||
|
||||
## Per-episode layout
|
||||
|
||||
```
|
||||
data/episodes/<episode_id>/
|
||||
meta.json # one-time, written at start; updated at end with summary
|
||||
events.jsonl # orchestrator actions, one row per event
|
||||
labels.jsonl # phase transitions, one row per transition
|
||||
telemetry-proc.jsonl # source 1 (oracle) host /proc/<qemu_pid>
|
||||
telemetry-qmp.jsonl # source 2 (oracle) QEMU QMP queries
|
||||
telemetry-perf.jsonl # source 3 (oracle) perf stat -p <qemu_pid>
|
||||
telemetry-guest.jsonl # source 5 (feature) in-guest agent over virtio-serial
|
||||
network.pcap # source 4 raw tcpdump -i br-malware
|
||||
netflow.jsonl # source 4 bucketed 100ms aggregations of pcap
|
||||
stderr.log # raw qemu + agent logs
|
||||
```
|
||||
|
||||
`<episode_id>` is a [ULID](https://github.com/ulid/spec) — sortable by time,
|
||||
unique without coordination, URL-safe.
|
||||
|
||||
## Common fields on every telemetry row
|
||||
|
||||
| Field | Type | Notes |
|
||||
|---|---|---|
|
||||
| `t_mono_ns` | int | host `CLOCK_MONOTONIC` at sample time, episode-relative origin |
|
||||
| `t_wall_ns` | int | host wall clock, ns since epoch |
|
||||
| `source` | string | one of `host_proc`, `host_qmp`, `host_perf`, `bridge_pcap`, `guest_agent` |
|
||||
| `available_in_deployment` | bool | **true = feature, false = oracle** |
|
||||
|
||||
The `available_in_deployment` flag is denormalized onto every row so downstream
|
||||
loaders don't have to look up a separate manifest to filter for the realistic
|
||||
model.
|
||||
|
||||
## meta.json schema
|
||||
|
||||
```json
|
||||
{
|
||||
"episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0",
|
||||
"schema_version": 1,
|
||||
"started_at_wall": "2026-04-28T22:30:00Z",
|
||||
"ended_at_wall": "2026-04-28T22:31:42Z",
|
||||
"git_commit": "<sha>",
|
||||
"host_fingerprint": {
|
||||
"kernel": "6.18.8",
|
||||
"qemu_version": "9.0.0",
|
||||
"cpu_model": "...",
|
||||
"smt_off": true
|
||||
},
|
||||
"vm": {
|
||||
"image_name": "metasploitable2",
|
||||
"image_sha256": "...",
|
||||
"vcpus": 1,
|
||||
"ram_mib": 512,
|
||||
"cgroup_cpu_cap": "800ms/1s",
|
||||
"snapshot_name": "baseline-v1"
|
||||
},
|
||||
"exploit": {
|
||||
"framework": "metasploit",
|
||||
"module": "exploit/multi/samba/usermap_script",
|
||||
"rport": 445,
|
||||
"rhost": "10.200.0.10"
|
||||
},
|
||||
"sample": {
|
||||
"name": "linux.miner.xmrig.elf",
|
||||
"sha256": "...",
|
||||
"source": "MalwareBazaar",
|
||||
"first_seen": "2024-...",
|
||||
"category": "miner"
|
||||
},
|
||||
"schedule": {
|
||||
"baseline_seconds": 30,
|
||||
"infected_seconds": 90,
|
||||
"dormant_seconds": 60
|
||||
},
|
||||
"result": {
|
||||
"phases_observed": ["clean","armed","infecting","infected_running","dormant"],
|
||||
"exploit_succeeded": true,
|
||||
"sample_executed": true,
|
||||
"snapshot_revert_ok": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## events.jsonl
|
||||
|
||||
One row per orchestrator action. Tells you exactly what happened and when.
|
||||
|
||||
```json
|
||||
{"t_mono_ns": 0, "t_wall_ns": 1745875200000000000, "event": "snapshot_load", "snapshot": "baseline-v1"}
|
||||
{"t_mono_ns": 30100000000,"t_wall_ns": 1745875230100000000,"event": "exploit_fire", "module": "exploit/multi/samba/usermap_script"}
|
||||
{"t_mono_ns": 31250000000,"t_wall_ns": 1745875231250000000,"event": "session_open", "session_id": 1}
|
||||
{"t_mono_ns": 31300000000,"t_wall_ns": 1745875231300000000,"event": "sample_uploaded", "path": "/tmp/.x", "sha256": "..."}
|
||||
{"t_mono_ns": 31400000000,"t_wall_ns": 1745875231400000000,"event": "sample_executed", "pid_reported_by_guest": 1042}
|
||||
{"t_mono_ns": 121400000000,"t_wall_ns": 1745875321400000000,"event": "snapshot_revert", "snapshot": "baseline-v1"}
|
||||
```
|
||||
|
||||
## labels.jsonl
|
||||
|
||||
```json
|
||||
{"t_mono_ns": 0, "phase": "clean", "prev": null, "reason": "snapshot_loaded"}
|
||||
{"t_mono_ns": 30100000000, "phase": "armed", "prev": "clean", "reason": "exploit_module_running"}
|
||||
{"t_mono_ns": 31250000000, "phase": "infecting", "prev": "armed", "reason": "session_open"}
|
||||
{"t_mono_ns": 31400000000, "phase": "infected_running","prev": "infecting", "reason": "sample_executed"}
|
||||
{"t_mono_ns": 91400000000, "phase": "dormant", "prev": "infected_running", "reason": "scheduler_transition"}
|
||||
```
|
||||
|
||||
### Phase enum (closed)
|
||||
|
||||
```
|
||||
clean — known-good, post-snapshot-load, pre-exploit
|
||||
armed — exploit module is running but no session yet
|
||||
infecting — session opened, sample landing/starting
|
||||
infected_running — sample is actively producing observable behavior
|
||||
dormant — sample is present but idle (sleep timer, beacon interval)
|
||||
reverting — snapshot_load triggered, episode ending
|
||||
```
|
||||
|
||||
## telemetry-proc.jsonl (source 1, oracle)
|
||||
|
||||
```json
|
||||
{
|
||||
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
|
||||
"source": "host_proc", "available_in_deployment": false,
|
||||
"cpu_user_jiffies": 142, "cpu_sys_jiffies": 38,
|
||||
"rss_bytes": 542113792, "vsize_bytes": 1842933760,
|
||||
"io_read_bytes": 0, "io_write_bytes": 4096,
|
||||
"voluntary_ctxsw": 12, "involuntary_ctxsw": 3,
|
||||
"minor_faults": 412, "major_faults": 0
|
||||
}
|
||||
```
|
||||
|
||||
## telemetry-qmp.jsonl (source 2, oracle)
|
||||
|
||||
```json
|
||||
{
|
||||
"t_mono_ns": 1000000000, "t_wall_ns": 1745875201000000000,
|
||||
"source": "host_qmp", "available_in_deployment": false,
|
||||
"blockstats": {"vda": {"rd_ops": 12, "wr_ops": 4, "rd_bytes": 49152, "wr_bytes": 16384}},
|
||||
"kvm_exits": {"total": 18342, "io": 942, "mmio": 12, "halt": 17000, "irq_window": 110},
|
||||
"netdev": {"net0": {"rx_packets": 0, "tx_packets": 4, "rx_bytes": 0, "tx_bytes": 256}}
|
||||
}
|
||||
```
|
||||
|
||||
## telemetry-perf.jsonl (source 3, oracle)
|
||||
|
||||
```json
|
||||
{
|
||||
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
|
||||
"source": "host_perf", "available_in_deployment": false,
|
||||
"cycles": 184_213_104, "instructions": 121_987_001,
|
||||
"cache_references": 1_041_213, "cache_misses": 38_104,
|
||||
"branches": 24_198_421, "branch_misses": 412_004,
|
||||
"page_faults": 12, "context_switches": 18,
|
||||
"ipc": 0.66, "cache_miss_rate": 0.0366
|
||||
}
|
||||
```
|
||||
|
||||
## netflow.jsonl (source 4, feature)
|
||||
|
||||
Bucketed from the pcap. The pcap stays raw on disk for re-derivation later.
|
||||
|
||||
```json
|
||||
{
|
||||
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
|
||||
"source": "bridge_pcap", "available_in_deployment": true,
|
||||
"bucket_ms": 100,
|
||||
"pkts_in": 0, "pkts_out": 0, "bytes_in": 0, "bytes_out": 0,
|
||||
"unique_dst_ips": 0, "unique_dst_ports": 0,
|
||||
"syn_count": 0, "fin_count": 0, "rst_count": 0,
|
||||
"dns_query_count": 0, "tcp_new_flows": 0
|
||||
}
|
||||
```
|
||||
|
||||
## telemetry-guest.jsonl (source 5, feature)
|
||||
|
||||
```json
|
||||
{
|
||||
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
|
||||
"source": "guest_agent", "available_in_deployment": true,
|
||||
"cpu_pct_total": 12.4, "load_1m": 0.41,
|
||||
"mem_used_bytes": 184_213_504, "mem_available_bytes": 354_127_872,
|
||||
"thermal_milli_c": 47200,
|
||||
"net": {"eth0": {"rx_bytes": 0, "tx_bytes": 256, "rx_pkts": 0, "tx_pkts": 4}},
|
||||
"top_procs": [
|
||||
{"pid": 1042, "comm": "kworker/0:1", "cpu_pct": 0.4, "rss_bytes": 1_048_576},
|
||||
{"pid": 1, "comm": "systemd", "cpu_pct": 0.1, "rss_bytes": 4_194_304}
|
||||
],
|
||||
"listen_ports": [22, 80, 445]
|
||||
}
|
||||
```
|
||||
|
||||
## Versioning
|
||||
|
||||
`schema_version` lives in `meta.json`. Bump when any row schema changes. Keep
|
||||
old episodes untouched; loaders dispatch on version.
|
||||
|
||||
## Ingest later
|
||||
|
||||
When we move to a database (Timescale most likely), each `telemetry-*.jsonl`
|
||||
becomes one hypertable, partitioned by `t_wall_ns`, indexed on
|
||||
`(episode_id, source)`. The deployment-tag flag becomes a column we filter on
|
||||
when materializing the realistic-model training view.
|
||||
138
docs/deploy.md
Normal file
138
docs/deploy.md
Normal file
|
|
@ -0,0 +1,138 @@
|
|||
# Deploy
|
||||
|
||||
Two roles. One install command each.
|
||||
|
||||
## Roles
|
||||
|
||||
| Role | Where it runs | What it does |
|
||||
|---|---|---|
|
||||
| `lab-host` | any KVM-capable Linux box on WG | runs episodes, ships completed episodes to the receiver |
|
||||
| `receiver` | Pi5 (or any always-on WG node) | accepts ship uploads, stores tarballs + `index.jsonl` |
|
||||
|
||||
## Lab host install
|
||||
|
||||
```sh
|
||||
git clone https://maxgit.wg/spectral/CIS490.git
|
||||
cd CIS490
|
||||
./scripts/install-lab-host.sh
|
||||
```
|
||||
|
||||
The installer:
|
||||
|
||||
1. Verifies KVM (`/dev/kvm` exists, user in `kvm` group).
|
||||
2. Installs system deps via the host package manager (qemu, tcpdump,
|
||||
linux-tools/perf, zstd, python ≥ 3.11).
|
||||
3. Bootstraps a [`uv`](https://github.com/astral-sh/uv)-managed venv at
|
||||
`.venv/` and installs the pinned Python deps from `uv.lock`.
|
||||
4. Drops two systemd units into `/etc/systemd/system/`:
|
||||
- `cis490-orchestrator.service` — runs the episode loop on a queue
|
||||
- `cis490-shipper.service` — watches `data/episodes/` and ships completed
|
||||
episodes
|
||||
5. Writes a config template to `/etc/cis490/lab-host.toml` (idempotent — only
|
||||
on first install).
|
||||
|
||||
You finish by editing `/etc/cis490/lab-host.toml` to point at your receiver
|
||||
and to enroll your lab host's WG-issued client cert, then:
|
||||
|
||||
```sh
|
||||
sudo systemctl enable --now cis490-orchestrator cis490-shipper
|
||||
```
|
||||
|
||||
### `lab-host.toml`
|
||||
|
||||
```toml
|
||||
host_id = "lab-host-1"
|
||||
|
||||
[paths]
|
||||
data_root = "/var/lib/cis490/data"
|
||||
samples_store = "/var/lib/cis490/samples/store"
|
||||
qcow_image = "/var/lib/cis490/vm/images/metasploitable2.qcow2"
|
||||
|
||||
[receiver]
|
||||
url = "https://collector.wg"
|
||||
client_cert = "/etc/cis490/certs/lab-host-1.pem"
|
||||
client_key = "/etc/cis490/certs/lab-host-1.key"
|
||||
ca_bundle = "/etc/cis490/certs/wg-ca.pem"
|
||||
|
||||
[episode]
|
||||
baseline_seconds = 30
|
||||
infected_seconds = 90
|
||||
dormant_seconds = 60
|
||||
|
||||
[retention]
|
||||
keep_local_for_days = 7
|
||||
prune_at_disk_pct = 80
|
||||
```
|
||||
|
||||
## Receiver install
|
||||
|
||||
On the Pi5 (or designated central node):
|
||||
|
||||
```sh
|
||||
git clone https://maxgit.wg/spectral/CIS490.git
|
||||
cd CIS490
|
||||
./scripts/install-receiver.sh
|
||||
```
|
||||
|
||||
The installer:
|
||||
|
||||
1. Installs Python ≥ 3.11 + zstd + a tiny WSGI runner (uvicorn).
|
||||
2. Bootstraps the same `uv`-managed venv.
|
||||
3. Drops `cis490-receiver.service` listening on `127.0.0.1:8443` (TLS
|
||||
terminated by the existing Caddy in `spectral/caddy`, which already binds
|
||||
`*.wg`).
|
||||
4. Writes a config template to `/etc/cis490/receiver.toml`.
|
||||
|
||||
Caddy block (added to your `spectral/caddy` config) for the receiver:
|
||||
|
||||
```caddy
|
||||
collector.wg {
|
||||
tls internal
|
||||
reverse_proxy 127.0.0.1:8443 {
|
||||
transport http {
|
||||
tls
|
||||
tls_client_auth /etc/cis490/certs/wg-ca.pem
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
(mTLS terminates at the receiver, not Caddy — so the receiver sees the
|
||||
client cert and can enforce per-host policies later.)
|
||||
|
||||
### `receiver.toml`
|
||||
|
||||
```toml
|
||||
listen_addr = "127.0.0.1:8443"
|
||||
store_root = "/var/lib/cis490/episodes"
|
||||
incoming_root = "/var/lib/cis490/incoming"
|
||||
index_path = "/var/lib/cis490/index.jsonl"
|
||||
ca_bundle = "/etc/cis490/certs/wg-ca.pem"
|
||||
|
||||
[limits]
|
||||
max_episode_bytes = 268_435_456 # 256 MiB
|
||||
```
|
||||
|
||||
## Day-2 operations
|
||||
|
||||
```sh
|
||||
# How many episodes have been shipped?
|
||||
ssh collector.wg 'wc -l /var/lib/cis490/index.jsonl'
|
||||
|
||||
# What's in the outbox on a lab host? (failed/pending shipments)
|
||||
ls /var/lib/cis490/data/outbox/
|
||||
|
||||
# Tail the orchestrator log
|
||||
journalctl -u cis490-orchestrator -f
|
||||
|
||||
# Tail the shipper log
|
||||
journalctl -u cis490-shipper -f
|
||||
```
|
||||
|
||||
## Updating
|
||||
|
||||
```sh
|
||||
git pull
|
||||
./scripts/install-lab-host.sh # idempotent; re-syncs deps and units
|
||||
sudo systemctl restart cis490-orchestrator cis490-shipper
|
||||
```
|
||||
145
docs/lab-setup.md
Normal file
145
docs/lab-setup.md
Normal file
|
|
@ -0,0 +1,145 @@
|
|||
# Lab Setup
|
||||
|
||||
How to bring up the host, build the guest, and verify the snapshot loop.
|
||||
|
||||
## Host prerequisites
|
||||
|
||||
```
|
||||
qemu-system-x86_64 >= 8.0
|
||||
qemu-img >= 8.0
|
||||
bridge-utils
|
||||
tcpdump / tshark
|
||||
linux-tools-common (for `perf`)
|
||||
zstd
|
||||
python >= 3.11
|
||||
uv (https://github.com/astral-sh/uv)
|
||||
```
|
||||
|
||||
`scripts/install-lab-host.sh` installs all of these and wires up systemd —
|
||||
see [`deploy.md`](deploy.md).
|
||||
|
||||
KVM must be enabled in the kernel and the user must be in the `kvm` group:
|
||||
|
||||
```
|
||||
ls /dev/kvm # must exist
|
||||
groups # must include kvm
|
||||
```
|
||||
|
||||
## Network: host-only malware bridge
|
||||
|
||||
`br-malware` (10.200.0.1/24) is the only network the guest sees, and it is
|
||||
host-only — no NAT, no upstream route. The host's WG interface is on a
|
||||
*separate* link (`wg0`) used only for shipping completed episodes to the
|
||||
collector; the bridge and WG never touch.
|
||||
|
||||
| Interface | Purpose |
|
||||
|---|---|
|
||||
| `br-malware` (10.200.0.1/24) | host-only bridge, only NIC attached to the guest |
|
||||
| guest `eth0` | DHCP from a dnsmasq bound only to `br-malware` |
|
||||
| host WG (`wg0`) | shipping channel to the collector — not connected to the bridge |
|
||||
|
||||
> Detailed firewall rules and the egress-drop safety net are out of scope for
|
||||
> this document and live in the deploy script. The relevant invariant for
|
||||
> readers is: **the guest cannot route off `br-malware`, period.**
|
||||
|
||||
## Guest: Metasploitable 2
|
||||
|
||||
1. Download from the [Rapid7 mirror](https://information.rapid7.com/download-metasploitable-2017.html)
|
||||
(verify sha256 against the published value before use).
|
||||
2. Convert VMware → qcow2:
|
||||
|
||||
```
|
||||
qemu-img convert -O qcow2 -p Metasploitable.vmdk metasploitable2.qcow2
|
||||
```
|
||||
|
||||
3. First boot (no snapshot yet) — let it come up, log in (msfadmin/msfadmin),
|
||||
confirm services are listening on the expected ports, shut down cleanly.
|
||||
4. Take the baseline snapshot:
|
||||
|
||||
```
|
||||
qemu-img snapshot -c baseline-v1 metasploitable2.qcow2
|
||||
```
|
||||
|
||||
Internal qcow2 snapshots load in well under a second — this is the
|
||||
"factory reset" mechanism for every episode.
|
||||
|
||||
## Single-vCPU constrained-device emulation
|
||||
|
||||
```
|
||||
-cpu host -smp 1,sockets=1,cores=1,threads=1
|
||||
-m 512
|
||||
-machine type=q35,accel=kvm
|
||||
```
|
||||
|
||||
Plus a host-side cgroup CPU cap on the QEMU process (e.g. 80% of one core) so
|
||||
the guest behaves like a small, constrained device under load.
|
||||
|
||||
## Telemetry channels
|
||||
|
||||
### virtio-serial for the in-guest agent
|
||||
|
||||
```
|
||||
-device virtio-serial-pci
|
||||
-chardev socket,path=/run/qemu/guest-agent.sock,server=on,wait=off,id=ga
|
||||
-device virtserialport,chardev=ga,name=cis490.guest.agent
|
||||
```
|
||||
|
||||
The in-guest agent opens `/dev/virtio-ports/cis490.guest.agent` and writes
|
||||
JSONL to it. Host side, the orchestrator reads from the unix socket. No network
|
||||
involvement = the malware cannot interfere with this channel.
|
||||
|
||||
### QMP for live oracle queries
|
||||
|
||||
```
|
||||
-qmp unix:/run/qemu/qmp.sock,server=on,wait=off
|
||||
```
|
||||
|
||||
The orchestrator polls `query-stats`, `query-blockstats`, and netdev stats over
|
||||
this socket.
|
||||
|
||||
### perf stat on the QEMU process
|
||||
|
||||
```
|
||||
perf stat -p <qemu_pid> -I 100 \
|
||||
-e cycles,instructions,cache-references,cache-misses,branches,branch-misses,page-faults,context-switches \
|
||||
-x , -o telemetry-perf.csv
|
||||
```
|
||||
|
||||
The collector tails the CSV, parses, and emits JSONL.
|
||||
|
||||
### tcpdump on `br-malware`
|
||||
|
||||
```
|
||||
tcpdump -i br-malware -w network.pcap -B 4096 -s 200
|
||||
```
|
||||
|
||||
Post-process to `netflow.jsonl` with 100ms buckets.
|
||||
|
||||
## Snapshot loop sanity check
|
||||
|
||||
A green light before any data collection:
|
||||
|
||||
1. `qemu-img snapshot -l metasploitable2.qcow2` shows `baseline-v1`.
|
||||
2. Boot the VM with the qcow2.
|
||||
3. Touch a file in the guest. Shut down.
|
||||
4. `qemu-img snapshot -a baseline-v1 metasploitable2.qcow2`.
|
||||
5. Boot again. The file is gone. ✅
|
||||
|
||||
## Safety checks before running real samples
|
||||
|
||||
- `ip route show table all | grep br-malware` shows no route off the bridge.
|
||||
- `dig @host example.com` from a guest fails (no DNS for malware).
|
||||
- The host's WG interface is **not** bridged to `br-malware`.
|
||||
|
||||
(See `scripts/install-lab-host.sh` for the firewall plumbing — it isn't the
|
||||
focus of this project.)
|
||||
|
||||
## Where to put VMs and snapshots
|
||||
|
||||
```
|
||||
vm/images/ # qcow2 disk images (gitignored)
|
||||
vm/snapshots/ # named snapshot exports if we ever externalize them
|
||||
```
|
||||
|
||||
Both directories are gitignored. The repo only carries the *recipes* for
|
||||
reproducing them.
|
||||
94
docs/threat-model.md
Normal file
94
docs/threat-model.md
Normal file
|
|
@ -0,0 +1,94 @@
|
|||
# Threat Model & Train/Serve Parity
|
||||
|
||||
The single most important design rule in this project:
|
||||
|
||||
> **A feature used by the deployed model must exist on the deployed device.**
|
||||
|
||||
Violating this rule produces a model that scores 99% in the lab and is useless in
|
||||
the field. This document spells out which features fall on which side of that
|
||||
line, and why we still bother capturing both.
|
||||
|
||||
## The setting
|
||||
|
||||
The deployed model runs on a real, non-virtualized device — a Raspberry Pi, an
|
||||
IoT endpoint, or similar. It tries to detect the moment that device gets
|
||||
breached. Two adversarial facts shape the design:
|
||||
|
||||
1. **Malware can lie to in-device tools.** A sufficiently-privileged rootkit can
|
||||
hook `/proc`, intercept `perf_event_open`, and hide its own processes.
|
||||
2. **There is no host-side QEMU view.** The deployed device is the actual
|
||||
machine. Nothing is watching it from outside *the OS itself*.
|
||||
|
||||
So the model has two trustworthy floors:
|
||||
|
||||
- **In-device features that survive most malware** (perf counters via the syscall
|
||||
interface, thermals, gross resource counters) — fast to lie to in principle,
|
||||
but in practice most commodity malware doesn't bother.
|
||||
- **Off-device features at the gateway** (network telemetry observed by an
|
||||
upstream router/gateway) — physics-bound, the malware cannot prevent bytes
|
||||
from leaving the NIC.
|
||||
|
||||
## Two roles: features vs. oracles
|
||||
|
||||
Every measurement we capture in the lab gets one of two roles:
|
||||
|
||||
| Role | What it's used for | Available in deployment? |
|
||||
|---|---|---|
|
||||
| **Feature** | Input to the trained model | **Must be yes** |
|
||||
| **Oracle** | Ground-truth labeling during training only | No — but we have it in the lab |
|
||||
|
||||
The oracle channels (host `/proc/<qemu_pid>`, QMP `query-stats`,
|
||||
`perf -p qemu_pid`) are how we know with certainty what the guest is *actually*
|
||||
doing — not what it claims to be doing. We use that certainty to assign correct
|
||||
labels in `labels.jsonl`. The model never sees them at training time.
|
||||
|
||||
## Channel taxonomy
|
||||
|
||||
| # | Channel | Vantage | Role | Why |
|
||||
|---|---|---|---|---|
|
||||
| 1 | Host `/proc/<qemu_pid>` | outside guest | oracle | doesn't exist on real device |
|
||||
| 2 | QEMU QMP `query-stats`, `query-blockstats` | outside guest | oracle | same |
|
||||
| 3 | `perf stat -p <qemu_pid>` | outside guest | oracle | same |
|
||||
| 4 | Bridge-side pcap (`tcpdump -i br-malware`) | gateway | **feature** | matches Pi5 gateway in field |
|
||||
| 5 | In-guest `/proc/*`, `perf_event_open`, `/sys/class/thermal/*` | inside guest | **feature** | same exact source on real device |
|
||||
|
||||
Note: in-guest features (5) are the same syscall surfaces we'd read on a real
|
||||
Pi. The data we capture from them in the lab and the data we capture from them
|
||||
on a Pi at deployment are pulled from identical kernel APIs — that is what makes
|
||||
parity hold.
|
||||
|
||||
## The two-model evaluation strategy
|
||||
|
||||
We will train two classifiers from the same dataset:
|
||||
|
||||
1. **Realistic model** — trained only on features (channels 4 + 5).
|
||||
*This is what would be deployed.*
|
||||
2. **Oracle model** — trained on everything (channels 1–5).
|
||||
*This is the upper bound on what was learnable from this dataset.*
|
||||
|
||||
The interesting numbers are:
|
||||
|
||||
- The realistic model's accuracy → **how well the deployable system actually works**.
|
||||
- The gap between realistic and oracle → **how much detection power a privileged
|
||||
rootkit can take away from us by lying** (the cost of in-device measurement
|
||||
being adversarial).
|
||||
- Realistic-without-pcap vs. realistic-with-pcap → **how much the gateway
|
||||
observer is worth** (the value of off-device telemetry).
|
||||
|
||||
Reporting all three is the paper-worthy result — not just one accuracy number.
|
||||
|
||||
## Using oracle disagreement as a label
|
||||
|
||||
When the in-guest agent reports CPU at 8% and the host says the QEMU process is
|
||||
burning 70%, that *is* the rootkit-active signal — and it's more reliable than
|
||||
any single in-guest measurement. Episodes where this divergence appears get a
|
||||
high-confidence `infected_running` label even if the in-guest data alone looks
|
||||
ambiguous. This is the practical payoff of capturing both sides.
|
||||
|
||||
## What we are not claiming
|
||||
|
||||
- We are not claiming to detect kernel rootkits robustly from in-guest data alone.
|
||||
The oracle/feature gap will quantify the limit.
|
||||
- We are not claiming the trained model is safe to deploy without the gateway
|
||||
observer in production — for the strongest threat model, gateway-side fusion
|
||||
is required.
|
||||
164
docs/transport.md
Normal file
164
docs/transport.md
Normal file
|
|
@ -0,0 +1,164 @@
|
|||
# Transport — Centralized Episode Collection over WG
|
||||
|
||||
The dataset lives wherever it is convenient to train from. In our setup that is
|
||||
the Pi5 (or whatever the team designates as the central collector), reachable
|
||||
over the WG overlay at `<receiver-host>.wg`. This document describes how
|
||||
episodes get from a lab host to the central collector.
|
||||
|
||||
## Design goals
|
||||
|
||||
1. **Easy to deploy.** One config file, one systemd unit per side. No DB
|
||||
required to start collecting.
|
||||
2. **WG-native.** Sender and receiver both live on the WG overlay; transport is
|
||||
just HTTPS over WG. We use the existing wg-pki CA for mTLS.
|
||||
3. **Idempotent.** Re-shipping the same episode is safe and cheap; the
|
||||
receiver responds 200 if the bytes already match.
|
||||
4. **Crash-safe.** Lab host crash mid-episode does not corrupt the central
|
||||
store. Receiver crash mid-upload leaves no partial visible.
|
||||
5. **Schema-free.** The receiver does not parse JSONL; it stores tarballs and
|
||||
an append-only index. The schema lives only at training time.
|
||||
|
||||
## What gets shipped
|
||||
|
||||
A complete episode directory is tarred and zstd-compressed:
|
||||
|
||||
```
|
||||
data/episodes/<episode_id>/ → <episode_id>.tar.zst
|
||||
```
|
||||
|
||||
The orchestrator marks an episode complete by writing a `done.marker` file at
|
||||
the *end* of the directory after `meta.json` is finalized. The shipper only
|
||||
considers directories that contain `done.marker` — partially-written episodes
|
||||
are invisible to it.
|
||||
|
||||
## Wire protocol
|
||||
|
||||
```
|
||||
PUT https://<receiver-host>.wg/v1/episodes/<host_id>/<episode_id>.tar.zst
|
||||
Content-Type: application/zstd
|
||||
Content-Length: <bytes>
|
||||
X-Content-SHA256: <sha256-of-body>
|
||||
X-Schema-Version: 1
|
||||
X-Lab-Host: <host_id>
|
||||
X-Episode-Id: <episode_id>
|
||||
body: <the tar.zst bytes>
|
||||
```
|
||||
|
||||
Auth: mTLS using a leaf certificate issued by the wg-pki CA. The receiver
|
||||
trusts only certs issued by that CA.
|
||||
|
||||
Responses:
|
||||
|
||||
| Status | Meaning |
|
||||
|---|---|
|
||||
| 201 | Stored; new |
|
||||
| 200 | Already present with matching sha256; nothing to do |
|
||||
| 409 | Already present with **different** sha256; receiver refuses to overwrite |
|
||||
| 4xx | Bad request (missing header, malformed id, etc.) |
|
||||
| 5xx | Server error; sender retries with backoff |
|
||||
|
||||
There is no DELETE. Episodes are immutable once shipped.
|
||||
|
||||
## Sender (`shipper`) state machine
|
||||
|
||||
```
|
||||
scan data/episodes/
|
||||
|
|
||||
v
|
||||
for each <id>/done.marker:
|
||||
|
|
||||
v
|
||||
tar+zstd → data/outbox/<id>.tar.zst.partial
|
||||
|
|
||||
v
|
||||
rename → data/outbox/<id>.tar.zst (atomic; visible to retry loop)
|
||||
|
|
||||
v
|
||||
PUT to receiver
|
||||
|
|
||||
+-- 200/201 → mv data/episodes/<id> data/shipped/<id>;
|
||||
| rm data/outbox/<id>.tar.zst
|
||||
|
|
||||
+-- 409 → log mismatch, leave files in place, alert (manual triage)
|
||||
|
|
||||
+-- 5xx/network → backoff (1s, 2s, 4s, 8s, ... cap 5min); retry
|
||||
```
|
||||
|
||||
The shipper does the same scan on every wake-up, so a crash mid-tar or
|
||||
mid-PUT is harmless — the next pass picks up wherever it left off.
|
||||
|
||||
## Receiver state machine
|
||||
|
||||
```
|
||||
PUT body received
|
||||
|
|
||||
v
|
||||
stream into /var/lib/cis490/incoming/<host>/<id>.tar.zst.partial
|
||||
|
|
||||
v
|
||||
compute sha256 while streaming
|
||||
|
|
||||
+-- mismatch with header → 400, delete partial
|
||||
|
|
||||
+-- match:
|
||||
|
|
||||
v
|
||||
if final path exists:
|
||||
|
|
||||
+-- existing sha256 == new sha256 → 200, delete partial
|
||||
|
|
||||
+-- existing sha256 != new sha256 → 409, delete partial
|
||||
else:
|
||||
|
|
||||
v
|
||||
atomic rename → /var/lib/cis490/episodes/<host>/<id>.tar.zst
|
||||
|
|
||||
v
|
||||
append index.jsonl row
|
||||
|
|
||||
v
|
||||
201
|
||||
```
|
||||
|
||||
`index.jsonl` row:
|
||||
|
||||
```json
|
||||
{
|
||||
"received_at_wall": "2026-04-28T22:31:43Z",
|
||||
"host_id": "lab-host-1",
|
||||
"episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0",
|
||||
"sha256": "...",
|
||||
"size_bytes": 8412331,
|
||||
"schema_version": 1
|
||||
}
|
||||
```
|
||||
|
||||
That index is the closest thing to a database we have until we decide on one.
|
||||
A trainer can stream it to know what episodes exist, then untar on demand.
|
||||
|
||||
## Why not just rsync?
|
||||
|
||||
`rsync` works, but:
|
||||
|
||||
- No schema-version tagging at the protocol layer.
|
||||
- No clean way to enforce "immutable once written".
|
||||
- mTLS via WG-issued certs is more uniform with the rest of the overlay than
|
||||
ssh-key juggling.
|
||||
- A tiny FastAPI receiver is also a natural place to add ingest-time hooks
|
||||
later (e.g. emit a Matrix notification on successful receipt, kick off a
|
||||
training run when N new episodes arrive).
|
||||
|
||||
We may switch to rsync if the FastAPI receiver becomes a bottleneck. For a
|
||||
class project that is unlikely.
|
||||
|
||||
## Operational notes
|
||||
|
||||
- **Disk on lab host.** The shipper keeps episodes locally in
|
||||
`data/shipped/<id>/` until a retention pass prunes them. Default retention:
|
||||
7 days *or* 80% disk usage, whichever comes first.
|
||||
- **Disk on receiver.** No retention enforced by default — the central store
|
||||
is the dataset.
|
||||
- **Backpressure.** If the receiver is unreachable (WG down, Pi rebooting),
|
||||
the shipper accumulates tarballs in `data/outbox/`. No data is lost.
|
||||
- **Multiple lab hosts.** Each writes under its own `<host_id>/` prefix. No
|
||||
coordination needed; episode ids are globally unique (ULID).
|
||||
12
exploits/README.md
Normal file
12
exploits/README.md
Normal file
|
|
@ -0,0 +1,12 @@
|
|||
# exploits/
|
||||
|
||||
Metasploit resource scripts (`*.rc`) that drive specific exploit modules
|
||||
deterministically — same inputs, same module options, every time.
|
||||
|
||||
Each script:
|
||||
- Sets `RHOSTS` to the guest's bridge IP.
|
||||
- Sets a payload that opens a session usable for sample upload + execute.
|
||||
- Avoids any options that introduce randomness in the exploit fire timing
|
||||
(so that the `armed → infecting` transition lands at a predictable offset).
|
||||
|
||||
These scripts pair with public Metasploit modules. We do not author exploits.
|
||||
21
orchestrator/README.md
Normal file
21
orchestrator/README.md
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
# orchestrator/
|
||||
|
||||
The state machine that drives a single **episode**:
|
||||
|
||||
```
|
||||
snapshot_load → clean → armed → infecting → infected_running → dormant → reverting
|
||||
```
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- Bring up the host-only bridge and verify isolation before the guest starts.
|
||||
- Boot the guest from a named snapshot.
|
||||
- Spawn the five telemetry collectors (`collectors/`) with a shared episode id
|
||||
and shared monotonic clock origin.
|
||||
- Drive the Metasploit Framework over RPC to fire the configured exploit module.
|
||||
- Upload + execute the configured malware sample once a session is open.
|
||||
- Emit phase transitions to `labels.jsonl` *at the moment the action is taken*.
|
||||
- Revert the snapshot at episode end.
|
||||
- Write `meta.json` with the result summary.
|
||||
|
||||
Implementation lives in this directory and is imported as `orchestrator.*`.
|
||||
33
samples/README.md
Normal file
33
samples/README.md
Normal file
|
|
@ -0,0 +1,33 @@
|
|||
# samples/
|
||||
|
||||
**Sample binaries are NEVER committed to this repo.** This directory holds:
|
||||
|
||||
- `manifest.yaml` — sha256-pinned list of samples to fetch, with metadata
|
||||
(source, category, expected behavior, target CVE).
|
||||
- `fetch.py` — script that pulls samples from configured sources
|
||||
(MalwareBazaar, theZoo, vx-underground), verifies sha256, and stores them
|
||||
under `samples/store/` (gitignored).
|
||||
- Per-sample notes in markdown describing observed behavior in our lab.
|
||||
|
||||
`samples/store/` lives only on the lab host. It is gitignored *and* should
|
||||
sit on a disk that is not auto-mounted on developer workstations.
|
||||
|
||||
## Manifest entry shape (placeholder)
|
||||
|
||||
```yaml
|
||||
samples:
|
||||
- name: linux.miner.xmrig.elf
|
||||
sha256: "..." # pinned
|
||||
source: MalwareBazaar
|
||||
category: miner
|
||||
target_cve: null # cryptominers are usually post-exploit payloads
|
||||
behavior: "high CPU, periodic stratum protocol traffic"
|
||||
pairs_with_exploit: exploit/multi/samba/usermap_script
|
||||
```
|
||||
|
||||
## Safety rules
|
||||
|
||||
- Only download to the lab host, never to a developer workstation.
|
||||
- Verify sha256 immediately, before any other read.
|
||||
- Keep the directory on a path that is *not* on the WG overlay.
|
||||
- Re-verify sha256 before each detonation; refuse to run on mismatch.
|
||||
23
training/README.md
Normal file
23
training/README.md
Normal file
|
|
@ -0,0 +1,23 @@
|
|||
# training/
|
||||
|
||||
Deferred until the dataset has substance. The plan, recorded so we don't lose
|
||||
it:
|
||||
|
||||
1. Two models will be trained from the same episodes:
|
||||
- **Realistic** — features only (`available_in_deployment: true`).
|
||||
- **Oracle** — all rows, regardless of the deployment flag.
|
||||
2. Baseline architecture: a rolling-window feature builder + a gradient-boosted
|
||||
trees classifier (XGBoost or LightGBM). Cheap, strong, interpretable.
|
||||
3. Window: 1–5 second sliding windows with per-channel summary stats
|
||||
(mean, std, p95, slope, count of zero buckets).
|
||||
4. Target: the phase enum from `labels.jsonl`, projected onto each window's
|
||||
center timestamp.
|
||||
5. Evaluation:
|
||||
- Held-out *samples* (not just held-out time slices) — generalization to
|
||||
unseen malware matters more than within-sample accuracy.
|
||||
- Confusion matrix + per-phase precision/recall.
|
||||
- Realistic vs. oracle gap, reported.
|
||||
6. Stretch: trust-over-time scoring per the IEEE 9881803 paper, with a reset
|
||||
threshold tuned for low false-positive cost.
|
||||
|
||||
See [`docs/threat-model.md`](../docs/threat-model.md) for why this split exists.
|
||||
17
vm/README.md
Normal file
17
vm/README.md
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
# vm/
|
||||
|
||||
Recipes and helpers for building and snapshotting guest VMs. Disk images and
|
||||
snapshots themselves are gitignored — this directory carries the *how*, not
|
||||
the bytes.
|
||||
|
||||
```
|
||||
vm/
|
||||
images/ # qcow2 staging (gitignored)
|
||||
snapshots/ # exported snapshots if needed (gitignored)
|
||||
guest-agent/ # in-guest telemetry agent (shipped into the guest)
|
||||
metasploitable2.md # download/convert/snapshot procedure (TODO)
|
||||
custom-debian/ # cloud-init for our own vulnerable Debian (TODO)
|
||||
```
|
||||
|
||||
See [`docs/lab-setup.md`](../docs/lab-setup.md) for the full host + guest
|
||||
bring-up procedure.
|
||||
Loading…
Add table
Reference in a new issue