Lays down the design surface for the CIS490 behavioral-malware-detection dataset and model. No code yet — schema and topology are decided first so collection can start without rework. Docs: - README: project goal, navigation - architecture: lab topology, KVM choice, episode state machine, deployment-mirror reasoning - threat-model: train/serve parity rule, oracle-vs-deployable feature split, two-model evaluation strategy - data-model: per-episode JSONL layout, row schemas, phase enum - transport: WG-native shipper/receiver design, idempotent uploads - deploy: one-command install for lab-host and receiver roles - lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/, training/ (each with a short README explaining purpose). Extended .gitignore to exclude qcow2 images, pcaps, sample binaries, secrets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
205 lines
7.2 KiB
Markdown
205 lines
7.2 KiB
Markdown
# Data Model
|
|
|
|
JSONL only, no database, schema-last. Each episode is a self-contained directory.
|
|
|
|
## Per-episode layout
|
|
|
|
```
|
|
data/episodes/<episode_id>/
|
|
meta.json # one-time, written at start; updated at end with summary
|
|
events.jsonl # orchestrator actions, one row per event
|
|
labels.jsonl # phase transitions, one row per transition
|
|
telemetry-proc.jsonl # source 1 (oracle) host /proc/<qemu_pid>
|
|
telemetry-qmp.jsonl # source 2 (oracle) QEMU QMP queries
|
|
telemetry-perf.jsonl # source 3 (oracle) perf stat -p <qemu_pid>
|
|
telemetry-guest.jsonl # source 5 (feature) in-guest agent over virtio-serial
|
|
network.pcap # source 4 raw tcpdump -i br-malware
|
|
netflow.jsonl # source 4 bucketed 100ms aggregations of pcap
|
|
stderr.log # raw qemu + agent logs
|
|
```
|
|
|
|
`<episode_id>` is a [ULID](https://github.com/ulid/spec) — sortable by time,
|
|
unique without coordination, URL-safe.
|
|
|
|
## Common fields on every telemetry row
|
|
|
|
| Field | Type | Notes |
|
|
|---|---|---|
|
|
| `t_mono_ns` | int | host `CLOCK_MONOTONIC` at sample time, episode-relative origin |
|
|
| `t_wall_ns` | int | host wall clock, ns since epoch |
|
|
| `source` | string | one of `host_proc`, `host_qmp`, `host_perf`, `bridge_pcap`, `guest_agent` |
|
|
| `available_in_deployment` | bool | **true = feature, false = oracle** |
|
|
|
|
The `available_in_deployment` flag is denormalized onto every row so downstream
|
|
loaders don't have to look up a separate manifest to filter for the realistic
|
|
model.
|
|
|
|
## meta.json schema
|
|
|
|
```json
|
|
{
|
|
"episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0",
|
|
"schema_version": 1,
|
|
"started_at_wall": "2026-04-28T22:30:00Z",
|
|
"ended_at_wall": "2026-04-28T22:31:42Z",
|
|
"git_commit": "<sha>",
|
|
"host_fingerprint": {
|
|
"kernel": "6.18.8",
|
|
"qemu_version": "9.0.0",
|
|
"cpu_model": "...",
|
|
"smt_off": true
|
|
},
|
|
"vm": {
|
|
"image_name": "metasploitable2",
|
|
"image_sha256": "...",
|
|
"vcpus": 1,
|
|
"ram_mib": 512,
|
|
"cgroup_cpu_cap": "800ms/1s",
|
|
"snapshot_name": "baseline-v1"
|
|
},
|
|
"exploit": {
|
|
"framework": "metasploit",
|
|
"module": "exploit/multi/samba/usermap_script",
|
|
"rport": 445,
|
|
"rhost": "10.200.0.10"
|
|
},
|
|
"sample": {
|
|
"name": "linux.miner.xmrig.elf",
|
|
"sha256": "...",
|
|
"source": "MalwareBazaar",
|
|
"first_seen": "2024-...",
|
|
"category": "miner"
|
|
},
|
|
"schedule": {
|
|
"baseline_seconds": 30,
|
|
"infected_seconds": 90,
|
|
"dormant_seconds": 60
|
|
},
|
|
"result": {
|
|
"phases_observed": ["clean","armed","infecting","infected_running","dormant"],
|
|
"exploit_succeeded": true,
|
|
"sample_executed": true,
|
|
"snapshot_revert_ok": true
|
|
}
|
|
}
|
|
```
|
|
|
|
## events.jsonl
|
|
|
|
One row per orchestrator action. Tells you exactly what happened and when.
|
|
|
|
```json
|
|
{"t_mono_ns": 0, "t_wall_ns": 1745875200000000000, "event": "snapshot_load", "snapshot": "baseline-v1"}
|
|
{"t_mono_ns": 30100000000,"t_wall_ns": 1745875230100000000,"event": "exploit_fire", "module": "exploit/multi/samba/usermap_script"}
|
|
{"t_mono_ns": 31250000000,"t_wall_ns": 1745875231250000000,"event": "session_open", "session_id": 1}
|
|
{"t_mono_ns": 31300000000,"t_wall_ns": 1745875231300000000,"event": "sample_uploaded", "path": "/tmp/.x", "sha256": "..."}
|
|
{"t_mono_ns": 31400000000,"t_wall_ns": 1745875231400000000,"event": "sample_executed", "pid_reported_by_guest": 1042}
|
|
{"t_mono_ns": 121400000000,"t_wall_ns": 1745875321400000000,"event": "snapshot_revert", "snapshot": "baseline-v1"}
|
|
```
|
|
|
|
## labels.jsonl
|
|
|
|
```json
|
|
{"t_mono_ns": 0, "phase": "clean", "prev": null, "reason": "snapshot_loaded"}
|
|
{"t_mono_ns": 30100000000, "phase": "armed", "prev": "clean", "reason": "exploit_module_running"}
|
|
{"t_mono_ns": 31250000000, "phase": "infecting", "prev": "armed", "reason": "session_open"}
|
|
{"t_mono_ns": 31400000000, "phase": "infected_running","prev": "infecting", "reason": "sample_executed"}
|
|
{"t_mono_ns": 91400000000, "phase": "dormant", "prev": "infected_running", "reason": "scheduler_transition"}
|
|
```
|
|
|
|
### Phase enum (closed)
|
|
|
|
```
|
|
clean — known-good, post-snapshot-load, pre-exploit
|
|
armed — exploit module is running but no session yet
|
|
infecting — session opened, sample landing/starting
|
|
infected_running — sample is actively producing observable behavior
|
|
dormant — sample is present but idle (sleep timer, beacon interval)
|
|
reverting — snapshot_load triggered, episode ending
|
|
```
|
|
|
|
## telemetry-proc.jsonl (source 1, oracle)
|
|
|
|
```json
|
|
{
|
|
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
|
|
"source": "host_proc", "available_in_deployment": false,
|
|
"cpu_user_jiffies": 142, "cpu_sys_jiffies": 38,
|
|
"rss_bytes": 542113792, "vsize_bytes": 1842933760,
|
|
"io_read_bytes": 0, "io_write_bytes": 4096,
|
|
"voluntary_ctxsw": 12, "involuntary_ctxsw": 3,
|
|
"minor_faults": 412, "major_faults": 0
|
|
}
|
|
```
|
|
|
|
## telemetry-qmp.jsonl (source 2, oracle)
|
|
|
|
```json
|
|
{
|
|
"t_mono_ns": 1000000000, "t_wall_ns": 1745875201000000000,
|
|
"source": "host_qmp", "available_in_deployment": false,
|
|
"blockstats": {"vda": {"rd_ops": 12, "wr_ops": 4, "rd_bytes": 49152, "wr_bytes": 16384}},
|
|
"kvm_exits": {"total": 18342, "io": 942, "mmio": 12, "halt": 17000, "irq_window": 110},
|
|
"netdev": {"net0": {"rx_packets": 0, "tx_packets": 4, "rx_bytes": 0, "tx_bytes": 256}}
|
|
}
|
|
```
|
|
|
|
## telemetry-perf.jsonl (source 3, oracle)
|
|
|
|
```json
|
|
{
|
|
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
|
|
"source": "host_perf", "available_in_deployment": false,
|
|
"cycles": 184_213_104, "instructions": 121_987_001,
|
|
"cache_references": 1_041_213, "cache_misses": 38_104,
|
|
"branches": 24_198_421, "branch_misses": 412_004,
|
|
"page_faults": 12, "context_switches": 18,
|
|
"ipc": 0.66, "cache_miss_rate": 0.0366
|
|
}
|
|
```
|
|
|
|
## netflow.jsonl (source 4, feature)
|
|
|
|
Bucketed from the pcap. The pcap stays raw on disk for re-derivation later.
|
|
|
|
```json
|
|
{
|
|
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
|
|
"source": "bridge_pcap", "available_in_deployment": true,
|
|
"bucket_ms": 100,
|
|
"pkts_in": 0, "pkts_out": 0, "bytes_in": 0, "bytes_out": 0,
|
|
"unique_dst_ips": 0, "unique_dst_ports": 0,
|
|
"syn_count": 0, "fin_count": 0, "rst_count": 0,
|
|
"dns_query_count": 0, "tcp_new_flows": 0
|
|
}
|
|
```
|
|
|
|
## telemetry-guest.jsonl (source 5, feature)
|
|
|
|
```json
|
|
{
|
|
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
|
|
"source": "guest_agent", "available_in_deployment": true,
|
|
"cpu_pct_total": 12.4, "load_1m": 0.41,
|
|
"mem_used_bytes": 184_213_504, "mem_available_bytes": 354_127_872,
|
|
"thermal_milli_c": 47200,
|
|
"net": {"eth0": {"rx_bytes": 0, "tx_bytes": 256, "rx_pkts": 0, "tx_pkts": 4}},
|
|
"top_procs": [
|
|
{"pid": 1042, "comm": "kworker/0:1", "cpu_pct": 0.4, "rss_bytes": 1_048_576},
|
|
{"pid": 1, "comm": "systemd", "cpu_pct": 0.1, "rss_bytes": 4_194_304}
|
|
],
|
|
"listen_ports": [22, 80, 445]
|
|
}
|
|
```
|
|
|
|
## Versioning
|
|
|
|
`schema_version` lives in `meta.json`. Bump when any row schema changes. Keep
|
|
old episodes untouched; loaders dispatch on version.
|
|
|
|
## Ingest later
|
|
|
|
When we move to a database (Timescale most likely), each `telemetry-*.jsonl`
|
|
becomes one hypertable, partitioned by `t_wall_ns`, indexed on
|
|
`(episode_id, source)`. The deployment-tag flag becomes a column we filter on
|
|
when materializing the realistic-model training view.
|