Closing the loop on the previous wave's commits. Tier 4 (real-malware fetch + chunked upload + guest-side sha-verify + exec) and source 3 (perf stat collector) are both implemented and tested as of a88ac83; the README still tagged them as TBD / planned. Fix. - Tier 4 status: 🚧 → ✅ code; ⏳ awaiting operator's MalwareBazaar API key + at least one sha256 entry in manifest.toml. Same shape as the Tier-3 line. - New "Tier 4 — real malware sample" section walks through the fetch → chunked upload → guest-side sha-verify → exec flow with links to the relevant code. - Source 3 (perf stat): "🚧 planned" → "✅ opt-in via enable_perf". - Snapshot/revert (revert_at_start / revert_at_end via QMP loadvm) added to the Orchestrator + drivers list. - Test-count header updated 86 → 106. - Stale issue links to closed #4 / #5 / #6 dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
428 lines
20 KiB
Markdown
428 lines
20 KiB
Markdown
# CIS490 — Behavioral Malware Detection Dataset & Model
|
||
|
||
Course project for CIS490 (Cybersecurity). The end-goal is an ML model that
|
||
watches performance metrics on a real device, decides whether the device has
|
||
been breached, and triggers a hardware-level reset when confidence is high
|
||
enough. This repository covers the **dataset side** — we run public malware
|
||
samples (and behavior-matched mimics) against intentionally vulnerable Linux
|
||
VMs and capture labeled time-series telemetry that mirrors what the deployed
|
||
model would see in the field.
|
||
|
||
Concretely, every lab host on the WireGuard mesh detects how much capacity
|
||
it has, spins up that many concurrent VMs, gives each VM a *different*
|
||
malware profile from the manifest, and ships the resulting labeled episode
|
||
tarballs to the central receiver on the Pi over mTLS. Running the same
|
||
fleet on multiple hosts gives novel, non-overlapping data per host with no
|
||
coordinator — see [Multi-host fleet](#multi-host-fleet) below.
|
||
|
||
The work is grounded in the trust-over-time scoring model from
|
||
[IEEE 9881803](https://ieeexplore.ieee.org/document/9881803).
|
||
|
||
---
|
||
|
||
## What an episode looks like
|
||
|
||
Each episode runs a target through a labeled phase schedule
|
||
(`clean → armed → infecting → infected_running → dormant → ...`) while
|
||
sampling host-side `/proc` telemetry at 10 Hz. The dataset's "envelope" is
|
||
the set of timestamped phase transitions written to `labels.jsonl` —
|
||
sharing a monotonic clock with the metric rows so anything aligned in
|
||
time can be aligned in code.
|
||
|
||
### Tier 2 — *real Alpine VM, profile-driven workload inside the guest*
|
||
|
||
This is the closest we get to real-malware behaviour without yet running
|
||
real malware. Telemetry is real `/proc/<qemu_pid>` from outside the
|
||
guest plus three more sources running concurrently (QMP, bridge pcap,
|
||
in-guest agent — see *Telemetry sources* below). The *load* itself is
|
||
generated inside the guest by a profile-matched shell command from
|
||
[`exploits/workloads.py`](exploits/workloads.py), driven over the
|
||
serial console by [`tools/vm_load_controller.py`](tools/vm_load_controller.py).
|
||
|
||
Each sample's `profile` (from [`samples/manifest.toml`](samples/manifest.toml))
|
||
dispatches to a different in-session workload, so the envelope each
|
||
VM produces is observably different per family — exactly the variance
|
||
the ML model needs to learn:
|
||
|
||
| profile | shape |
|
||
|------------------|--------------------------------------------------------|
|
||
| `cpu-saturate` | sustained 1-vCPU saturation (XMRig) |
|
||
| `scan-and-dial` | SYN-style probes across the bridge subnet + dial-home |
|
||
| `io-walk` | fs traversal + 4 KiB urandom writes (ransomware) |
|
||
| `bursty-c2` | long idle + periodic 3-packet egress burst (Dridex) |
|
||
| `low-and-slow` | minimal CPU + periodic memory churn (Kovter / fileless)|
|
||
| `shell-resident` | one long-lived TCP socket + periodic command ticks (RAT)|
|
||
|
||
Every phase transition in `labels.jsonl` corresponds to an actual
|
||
command issued inside the real VM, and `meta.json` records which
|
||
sample / profile / kind drove it.
|
||
|
||

|
||
|
||
The 100% CPU plateaux are `yes > /dev/null` running on the guest's
|
||
single vCPU; the IO spikes during *infecting* are `dd if=/dev/urandom`
|
||
producing the sample-drop shape; the *dormant* drops are the
|
||
controller killing the load process inside the VM. The
|
||
infected_running → dormant → infected_running re-entry is the textbook
|
||
envelope that justifies the whole project framing.
|
||
|
||
Reproduce one episode (profile-driven via `--sample` or `SAMPLE_NAME`
|
||
env, defaults to the v1 yes-loop without one):
|
||
|
||
```sh
|
||
uv run python tools/run_real_vm_demo.py --data-root data \
|
||
--sample xmrig-cryptominer
|
||
```
|
||
|
||
Or run the **fleet** — one wave of `max_concurrent` parallel episodes,
|
||
each slot pulling a different sample from the manifest:
|
||
|
||
```sh
|
||
uv run python tools/run_fleet.py --capacity # see what the host can do
|
||
uv run python tools/run_fleet.py --waves 1 --data-root data
|
||
```
|
||
|
||
### Tier 1 — *real Alpine VM, idle baseline*
|
||
|
||
Same pipeline, pointed at the real `qemu-system` process while the
|
||
guest is doing nothing. Periodic ~10% CPU spikes are KVM/timer
|
||
interrupts; the single disk-write spike near t=3 s is the guest
|
||
finishing late-boot activity.
|
||
|
||

|
||
|
||
### Pipeline-validation plot — *synthetic load, real telemetry*
|
||
|
||
This is **not real malware** and the load is **not** even running
|
||
inside a VM — it's a Python program on the host (`tools/load_mimic.py`)
|
||
that mimics an XMRig-style envelope. We used it to validate the
|
||
orchestrator + collector + labeling pipeline before plugging in a real
|
||
guest. Kept here because it shows the same shape the tier-2 plot
|
||
above produces from real KVM behaviour.
|
||
|
||

|
||
|
||
### Tier 3 — *real exploit fire, profile-matched workload (Driver v2)*
|
||
|
||
The Tier-3 driver lives in [`exploits/`](exploits/README.md) — a tiny
|
||
msgpack-over-HTTPS msfrpc client + `MSFExploitDriver`. With a
|
||
[`Sample`](samples/manifest.py) supplied, the driver dispatches the
|
||
post-exploit `infected_running` workload through
|
||
[`exploits/workloads.py`](exploits/workloads.py) — same six profiles
|
||
as Tier 2, so a fleet wave produces matched envelopes whether or not
|
||
an exploit fires. Without a sample, the v1 yes-loop path is preserved
|
||
for smoke runs.
|
||
|
||
First canned module: `exploits/modules/vsftpd_234_backdoor.toml`
|
||
(Metasploitable2's CVE-2011-2523). [`scripts/install-msfrpcd.sh`](scripts/install-msfrpcd.sh)
|
||
sets up `msfrpcd` (loopback only) as a hardened systemd unit;
|
||
[`scripts/fetch-metasploitable2.sh`](scripts/fetch-metasploitable2.sh)
|
||
pulls + sha256-verifies a target image from operator-supplied URL.
|
||
|
||
### Tier 4 — *real malware sample, fetched + uploaded + executed*
|
||
|
||
A manifest entry with a `sha256` flips its `Sample.kind` to `"real"`.
|
||
The driver then bypasses the mimic profile and runs the real-binary
|
||
path:
|
||
|
||
1. [`tools/fetch_sample.py <sha256>`](tools/fetch_sample.py) pulls the
|
||
binary from MalwareBazaar (Auth-Key from
|
||
`samples/.bazaar.token` or `MALWAREBAZAAR_API_KEY`), unzips with the
|
||
standard `infected` password, sha-verifies, and lands at
|
||
`samples/store/<sha256>` (gitignored).
|
||
2. At `infected_running`, the driver chunked-uploads the binary into
|
||
the shell session as 8 KiB base64 segments
|
||
(`exploits.workloads.chunked_real_binary_upload`). 256 KiB binaries
|
||
work without buffer-busting msfrpc.
|
||
3. The session decodes, sha-verifies *again on the guest side*, chmods,
|
||
and execs only if the hash matches. Mismatch fail-stops the run.
|
||
4. `meta.sample.sha256` + per-step events
|
||
(`real_binary_upload_begin`, `real_binary_verify`,
|
||
`sample_executed{kind=real}`) record exactly which binary was run
|
||
and when, so trainers can join cleanly.
|
||
|
||
### Tier maturity
|
||
|
||
| Tier | What it gives | Status |
|
||
|---|---|---|
|
||
| 1 — real VM, idle | confidence the collectors read real KVM behaviour | ✅ done |
|
||
| 2 — real VM, profile-driven workload | distinguishable in-guest envelopes per malware family | ✅ done |
|
||
| 3 — real VM, real exploit fire + profile workload | honest `armed → infecting` transitions, driver v2 dispatch | ✅ code; ⏳ awaiting Metasploitable2 image + msfrpcd on a lab host |
|
||
| 4 — real VM, real malware sample (MalwareBazaar fetch) | the full envelope we ultimately train on | ✅ code; ⏳ awaiting MalwareBazaar API key + sha256s in manifest |
|
||
|
||
### Telemetry sources (all five wire into one episode dir)
|
||
|
||
| # | Source | Vantage | Role |
|
||
|---|--------------------------------|---------------|---------------------|
|
||
| 1 | host `/proc/<qemu_pid>` | outside | oracle (label only) |
|
||
| 2 | QEMU QMP queries | outside | oracle (label only) |
|
||
| 3 | `perf stat -p <qemu_pid>` | outside | oracle (label only) |
|
||
| 4 | Bridge pcap → 100 ms netflow | gateway-side | feature (deployable)|
|
||
| 5 | In-guest agent (virtio-serial) | inside | feature (deployable)|
|
||
|
||
All five are live. The deploy/oracle split follows
|
||
[`docs/threat-model.md`](docs/threat-model.md): only sources 4 + 5
|
||
are usable as model *features* in the field — sources 1, 2, 3 exist
|
||
as labeling oracles only.
|
||
|
||
For an interactive view of any episode (zoom/pan/hover), run:
|
||
|
||
```sh
|
||
tools/show_envelope.sh data/episodes/<episode_id>
|
||
# then open http://127.0.0.1:8988/
|
||
```
|
||
|
||
---
|
||
|
||
## Status (106/106 tests passing as of `a88ac83`)
|
||
|
||
**Pipeline (lab-host → Pi → tarball stored)**
|
||
- ✅ Receiver app (HTTPS PUT, sha256-verified, idempotent) — running on the Pi behind Caddy with mTLS via the wg-pki client CA
|
||
- ✅ `POST /v1/ping` smoke endpoint (writes nothing, exercises the full auth path)
|
||
- ✅ Shipper (`shipper/`) — tar+zstd, retry/backoff, `--ping` mode
|
||
- ✅ Caddy `collector.wg` block (in `spectral/caddy`)
|
||
- ✅ Lab-host install script + systemd units (`scripts/install-lab-host.sh`, `etc/cis490-{shipper,orchestrator}.service`)
|
||
- ✅ Receiver install script (`scripts/install-receiver.sh`)
|
||
- ✅ wg-pki client-CA bootstrap + per-host leaf issuance (in `spectral/wg-pki`)
|
||
|
||
**Telemetry**
|
||
- ✅ Source 1 — host `/proc/<qemu_pid>` @ 10 Hz
|
||
- ✅ Source 2 — QEMU QMP @ 1 Hz
|
||
- ✅ Source 3 — `perf stat -p <qemu_pid>` (opt-in via `enable_perf`; needs `CAP_SYS_ADMIN` / `CAP_PERFMON`)
|
||
- ✅ Source 4 — bridge pcap + 100 ms netflow bucketizer (pure-Python parser, no scapy/dpkt dep), wired into `EpisodeRunner` via `bridge_iface`
|
||
- ✅ Source 5 — in-guest agent over virtio-serial; cidata-embedded for first-boot install on Alpine
|
||
|
||
**Orchestrator + drivers**
|
||
- ✅ Orchestrator v0 — phase-scheduled episode runner, ULID episode ids
|
||
- ✅ Snapshot/revert via QMP `loadvm` (`revert_at_start` / `revert_at_end`) for clean baselines between episodes
|
||
- ✅ Tier 2 driver — real Alpine VM, profile-driven in-guest workload over serial console
|
||
- ✅ Tier 3 driver v2 — `MSFExploitDriver` + msfrpc client + per-sample workload dispatch; first canned module `vsftpd_234_backdoor.toml`
|
||
- ✅ Tier 4 — `tools/fetch_sample.py` (MalwareBazaar by sha256) + chunked real-binary upload (`exploits.workloads.chunked_real_binary_upload`) + guest-side sha-verify-then-exec dispatch in `MSFExploitDriver`
|
||
- ⏳ Tier 3 integration — needs operator to drop a Metasploitable2 image + run `scripts/install-msfrpcd.sh` on a lab host
|
||
- ⏳ Tier 4 integration — needs operator's MalwareBazaar API key + at least one `sha256` entry in `samples/manifest.toml`
|
||
|
||
**Fleet (multi-VM, multi-host data generation)**
|
||
- ✅ Resource-aware capacity detector (cores / RAM / load) — `orchestrator/fleet.py`
|
||
- ✅ Concurrent slot runner — `tools/run_fleet.py`
|
||
- ✅ Sample manifest with six behavioural profiles + deterministic per-(host_id, slot, episode) selection so every host walks the catalog in a different order
|
||
|
||
> **Topology note:** the **Pi5 is the WireGuard-side *collector*** that
|
||
> receives episode tarballs from one or more lab hosts. It is *not* the
|
||
> deployment target for the model. The deployment target is generic
|
||
> ("any constrained Linux device"). See
|
||
> [`docs/architecture.md`](docs/architecture.md).
|
||
|
||
---
|
||
|
||
<details>
|
||
<summary><b>Quick start — fleet mode (the primary workflow)</b></summary>
|
||
|
||
```sh
|
||
git clone https://maxgit.wg/spectral/CIS490.git
|
||
cd CIS490
|
||
uv sync
|
||
|
||
# 1. Build the cidata ISO with the in-guest agent baked in.
|
||
uv run python tools/build_cidata.py vm/images/cidata.iso
|
||
|
||
# 2. See what this host is sized for.
|
||
uv run python tools/run_fleet.py --capacity
|
||
# cores: 4 (reserve 1)
|
||
# ram: 7951 MiB total, 5223 MiB available (headroom 1024 MiB, per-vm 320 MiB)
|
||
# load: 1m=0.51
|
||
# caps: by_cores=3, by_ram=13, by_load=3
|
||
# --> max_concurrent VMs: 3
|
||
|
||
# 3. Run one wave (= max_concurrent parallel episodes, each with a
|
||
# different sample profile).
|
||
uv run python tools/run_fleet.py --waves 1 --data-root data
|
||
|
||
# 4. Plot any episode (matplotlib WebAgg).
|
||
tools/show_envelope.sh data/episodes/<episode_id>
|
||
```
|
||
|
||
Each episode dir contains:
|
||
|
||
```
|
||
meta.json episode metadata (image, sample, profile, fleet capacity)
|
||
events.jsonl orchestrator + driver events (exploit_fire, session_open, sample_executed, ...)
|
||
labels.jsonl one row per phase transition — THIS is the envelope
|
||
telemetry-proc.jsonl source 1: host /proc sampler @ 10 Hz
|
||
telemetry-qmp.jsonl source 2: QMP query-status / blockstats / kvm stats @ 1 Hz
|
||
telemetry-guest.jsonl source 5: in-guest agent (CPU jiffies, mem, listen ports, top procs)
|
||
network.pcap source 4: tcpdump on br-malware
|
||
netflow.jsonl source 4: 100 ms-bucketed pcap aggregation
|
||
done.marker written last; the shipper only sees finished episodes
|
||
```
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary><b>Quick start — single episode, no fleet</b></summary>
|
||
|
||
```sh
|
||
# Tier 2 (no exploit, profile-driven workload):
|
||
uv run python tools/run_real_vm_demo.py --data-root data \
|
||
--sample mirai-class-bot
|
||
|
||
# Tier 3 (real exploit fire via msfrpcd):
|
||
MSFRPC_PASSWORD=$(. /etc/cis490/msfrpc.env; echo $MSFRPC_PASSWORD) \
|
||
uv run python tools/run_tier3_demo.py \
|
||
--module vsftpd_234_backdoor \
|
||
--sample ransomware-mimic \
|
||
--data-root data
|
||
```
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary><b>Multi-host fleet — how cross-host diversity works</b></summary>
|
||
|
||
Each lab host's `host_id` (set in `/etc/cis490/lab-host.toml`) seeds a
|
||
deterministic walk through the sample catalog:
|
||
|
||
```python
|
||
# samples/manifest.py
|
||
def select(self, *, host_id, slot, episode_index):
|
||
seed = f"{host_id}|{slot}|{episode_index}"
|
||
idx = sha256(seed)[:8] % len(self.samples)
|
||
return self.samples[idx]
|
||
```
|
||
|
||
So:
|
||
- `host=alice slot=0 ep=0` and `host=bob slot=0 ep=0` almost certainly
|
||
pick *different* samples (test asserts < 25% collision over 20 trials).
|
||
- A single host walks the entire catalog within ~`len(manifest)` waves
|
||
(test confirms full coverage in 200 episodes).
|
||
- No coordinator needed — every host independently produces non-overlapping
|
||
data, and `meta.fleet.host_id` + `meta.sample.name` make the join trivial
|
||
at training time.
|
||
|
||
The fleet runner shells out to the same `tools/run_real_vm_demo.py` per
|
||
slot, with `SLOT` / `RUN_DIR` / `SAMPLE_NAME` env passed through to the
|
||
launcher. Each VM gets its own QMP socket, agent socket, hostfwd port
|
||
range, and episode dir, so concurrency is collision-free up to the
|
||
capacity ceiling.
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary><b>Repository layout</b></summary>
|
||
|
||
| Path | What it holds |
|
||
|---|---|
|
||
| [`docs/architecture.md`](docs/architecture.md) | Lab topology, KVM choice, snapshot loop, deployment-mirror reasoning |
|
||
| [`docs/threat-model.md`](docs/threat-model.md) | Train/serve parity rule and the oracle-vs-deployable feature split |
|
||
| [`docs/data-model.md`](docs/data-model.md) | On-disk JSONL schema, per-episode layout, phase enum |
|
||
| [`docs/transport.md`](docs/transport.md) | Sender/receiver design — how episodes get to the central collector over WG |
|
||
| [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles |
|
||
| [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring |
|
||
| [`docs/sources.md`](docs/sources.md) | Works cited — every tool, dep, sample source, paper, and standard |
|
||
| `orchestrator/` | Episode runner + `fleet.py` (capacity detection, concurrent slot driver) |
|
||
| `collectors/` | One module per telemetry source: `proc_qemu`, `qmp`, `pcap`, `guest_agent` |
|
||
| `receiver/` | Starlette app: PUT `/v1/episodes` + POST `/v1/ping`, sha256-verified, idempotent |
|
||
| `shipper/` | Lab-host-side: scan `data/episodes/`, tar+zstd, PUT over mTLS, retry/backoff |
|
||
| `vm/` | Launch scripts (`launch_demo.sh`, `launch_target.sh`), `setup_bridge.sh`, in-guest agent at `vm/guest-agent/cis490_agent.py`. qcow2 images and pcap captures gitignored. |
|
||
| `tools/` | `run_fleet.py`, `run_real_vm_demo.py`, `run_tier3_demo.py`, `build_cidata.py`, `plot_envelope.py`, `show_envelope.sh` |
|
||
| [`exploits/`](exploits/README.md) | MSF RPC client (`msfrpc.py`), `driver.py` (v2 with sample dispatch), `workloads.py` (six profile-matched in-session loops), per-module TOML configs |
|
||
| [`samples/`](samples/manifest.toml) | Sample manifest + loader. Binaries land at `samples/store/<sha256>` (gitignored). |
|
||
| `scripts/` | `install-{lab-host,receiver,msfrpcd}.sh`, `fetch-metasploitable2.sh` |
|
||
| `training/` | Model training code (deferred — schema first) |
|
||
| `etc/` | systemd units and config templates (`cis490-{receiver,shipper,orchestrator}.service`, `lab-host.toml.example`, `receiver.toml.example`) |
|
||
| [`AGENTS.md`](AGENTS.md) | Conventions for AI agents working on this and sibling spectral repos |
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary><b>Design decisions — why these choices</b></summary>
|
||
|
||
- **Why VMs (not Docker)?** We need a clean snapshot/revert loop and we need
|
||
to run real malware without compromising the host. KVM gives both at
|
||
near-native speed; containers share the host kernel and many samples detect
|
||
containerization and refuse to detonate. See
|
||
[`docs/architecture.md`](docs/architecture.md).
|
||
- **Why KVM (not TCG/-icount)?** ML training data wants noise to generalize
|
||
to. KVM is ~15× faster than TCG, which directly multiplies dataset size
|
||
per wall-clock hour. We pin 1 vCPU + cap CPU% via cgroup to preserve the
|
||
"constrained device" framing.
|
||
- **Why JSONL (not a DB yet)?** Schema-last. Collect first, decide storage
|
||
shape after we see what's useful. JSONL is crash-safe, append-only,
|
||
reshapes trivially into Postgres/Timescale/Parquet.
|
||
- **Why two models — realistic vs. oracle?** Features that exist on a
|
||
deployed device train the *realistic* model. Host-side QEMU telemetry
|
||
(which doesn't exist in deployment) is *oracle*-only — used to assign
|
||
honest labels at training time, never as a model input. The accuracy gap
|
||
between the two measures how much detection power a privileged rootkit
|
||
can take from us by lying to in-device tools. See
|
||
[`docs/threat-model.md`](docs/threat-model.md).
|
||
- **Why ULIDs for episode ids?** Time-sortable, no coordinator, URL-safe.
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary><b>Deploying the receiver and lab-host roles</b></summary>
|
||
|
||
Two roles, one bootstrap command each. Detailed in
|
||
[`docs/deploy.md`](docs/deploy.md):
|
||
|
||
- `lab-host` — runs episodes, ships completed episodes to the receiver.
|
||
- `receiver` — accepts ship uploads, stores tarballs + appends to
|
||
`index.jsonl`. Runs on the Pi5 in our setup.
|
||
|
||
```sh
|
||
# On the Pi5 (or any always-on WG node):
|
||
sudo ./scripts/install-receiver.sh
|
||
# Add the collector.wg block to spectral/caddy (already merged), then:
|
||
sudo systemctl enable --now cis490-receiver
|
||
|
||
# One-time, on the Pi: bootstrap the CIS490 client CA.
|
||
sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh
|
||
|
||
# On each lab host: enroll via wg-enroll first, then:
|
||
sudo ./scripts/install-lab-host.sh
|
||
# Drop a TLS leaf from wg-pki at /etc/cis490/certs/, edit /etc/cis490/lab-host.toml.
|
||
sudo systemctl enable --now cis490-shipper cis490-orchestrator
|
||
```
|
||
|
||
The orchestrator service runs `tools/run_fleet.py --waves 1` per
|
||
invocation with `Restart=always`, giving a continuous stream of
|
||
fresh-sample episodes per host. The shipper picks them up as
|
||
`done.marker` files appear and PUTs them to `https://collector.wg`.
|
||
|
||
For mTLS leaf-cert minting: `spectral/wg-pki/scripts/issue-cis490-client-cert.sh <host_id>`.
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary><b>Threat model and feature-availability split</b></summary>
|
||
|
||
See [`docs/threat-model.md`](docs/threat-model.md) for the full argument.
|
||
The short version:
|
||
|
||
| Channel | Vantage | Role |
|
||
|---|---|---|
|
||
| Host `/proc/<qemu_pid>` | outside guest | oracle (label only) |
|
||
| QEMU QMP `query-stats` etc. | outside guest | oracle (label only) |
|
||
| `perf stat -p <qemu_pid>` | outside guest | oracle (label only) |
|
||
| Bridge-side pcap | gateway-style | feature (deployable) |
|
||
| In-guest `/proc`, perf, thermal | inside guest | feature (deployable) |
|
||
|
||
We collect everything in the lab. Only the *features* go into the deployed
|
||
model; the oracles are used to label episodes with high confidence
|
||
(disagreement between in-guest and host-side data is itself a rootkit
|
||
signal).
|
||
|
||
</details>
|
||
|
||
---
|
||
|
||
## Citing this work
|
||
|
||
A short course-project citation, until the dataset reaches a publishable
|
||
form:
|
||
|
||
> Gorog, M. *CIS490 Behavioral Malware Detection Dataset (in progress).*
|
||
> Spectral lab, 2026.
|
||
|
||
See [`docs/sources.md`](docs/sources.md) for everything else this project
|
||
leans on.
|