Lays down the design surface for the CIS490 behavioral-malware-detection dataset and model. No code yet — schema and topology are decided first so collection can start without rework. Docs: - README: project goal, navigation - architecture: lab topology, KVM choice, episode state machine, deployment-mirror reasoning - threat-model: train/serve parity rule, oracle-vs-deployable feature split, two-model evaluation strategy - data-model: per-episode JSONL layout, row schemas, phase enum - transport: WG-native shipper/receiver design, idempotent uploads - deploy: one-command install for lab-host and receiver roles - lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/, training/ (each with a short README explaining purpose). Extended .gitignore to exclude qcow2 images, pcaps, sample binaries, secrets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.2 KiB
Data Model
JSONL only, no database, schema-last. Each episode is a self-contained directory.
Per-episode layout
data/episodes/<episode_id>/
meta.json # one-time, written at start; updated at end with summary
events.jsonl # orchestrator actions, one row per event
labels.jsonl # phase transitions, one row per transition
telemetry-proc.jsonl # source 1 (oracle) host /proc/<qemu_pid>
telemetry-qmp.jsonl # source 2 (oracle) QEMU QMP queries
telemetry-perf.jsonl # source 3 (oracle) perf stat -p <qemu_pid>
telemetry-guest.jsonl # source 5 (feature) in-guest agent over virtio-serial
network.pcap # source 4 raw tcpdump -i br-malware
netflow.jsonl # source 4 bucketed 100ms aggregations of pcap
stderr.log # raw qemu + agent logs
<episode_id> is a ULID — sortable by time,
unique without coordination, URL-safe.
Common fields on every telemetry row
| Field | Type | Notes |
|---|---|---|
t_mono_ns |
int | host CLOCK_MONOTONIC at sample time, episode-relative origin |
t_wall_ns |
int | host wall clock, ns since epoch |
source |
string | one of host_proc, host_qmp, host_perf, bridge_pcap, guest_agent |
available_in_deployment |
bool | true = feature, false = oracle |
The available_in_deployment flag is denormalized onto every row so downstream
loaders don't have to look up a separate manifest to filter for the realistic
model.
meta.json schema
{
"episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0",
"schema_version": 1,
"started_at_wall": "2026-04-28T22:30:00Z",
"ended_at_wall": "2026-04-28T22:31:42Z",
"git_commit": "<sha>",
"host_fingerprint": {
"kernel": "6.18.8",
"qemu_version": "9.0.0",
"cpu_model": "...",
"smt_off": true
},
"vm": {
"image_name": "metasploitable2",
"image_sha256": "...",
"vcpus": 1,
"ram_mib": 512,
"cgroup_cpu_cap": "800ms/1s",
"snapshot_name": "baseline-v1"
},
"exploit": {
"framework": "metasploit",
"module": "exploit/multi/samba/usermap_script",
"rport": 445,
"rhost": "10.200.0.10"
},
"sample": {
"name": "linux.miner.xmrig.elf",
"sha256": "...",
"source": "MalwareBazaar",
"first_seen": "2024-...",
"category": "miner"
},
"schedule": {
"baseline_seconds": 30,
"infected_seconds": 90,
"dormant_seconds": 60
},
"result": {
"phases_observed": ["clean","armed","infecting","infected_running","dormant"],
"exploit_succeeded": true,
"sample_executed": true,
"snapshot_revert_ok": true
}
}
events.jsonl
One row per orchestrator action. Tells you exactly what happened and when.
{"t_mono_ns": 0, "t_wall_ns": 1745875200000000000, "event": "snapshot_load", "snapshot": "baseline-v1"}
{"t_mono_ns": 30100000000,"t_wall_ns": 1745875230100000000,"event": "exploit_fire", "module": "exploit/multi/samba/usermap_script"}
{"t_mono_ns": 31250000000,"t_wall_ns": 1745875231250000000,"event": "session_open", "session_id": 1}
{"t_mono_ns": 31300000000,"t_wall_ns": 1745875231300000000,"event": "sample_uploaded", "path": "/tmp/.x", "sha256": "..."}
{"t_mono_ns": 31400000000,"t_wall_ns": 1745875231400000000,"event": "sample_executed", "pid_reported_by_guest": 1042}
{"t_mono_ns": 121400000000,"t_wall_ns": 1745875321400000000,"event": "snapshot_revert", "snapshot": "baseline-v1"}
labels.jsonl
{"t_mono_ns": 0, "phase": "clean", "prev": null, "reason": "snapshot_loaded"}
{"t_mono_ns": 30100000000, "phase": "armed", "prev": "clean", "reason": "exploit_module_running"}
{"t_mono_ns": 31250000000, "phase": "infecting", "prev": "armed", "reason": "session_open"}
{"t_mono_ns": 31400000000, "phase": "infected_running","prev": "infecting", "reason": "sample_executed"}
{"t_mono_ns": 91400000000, "phase": "dormant", "prev": "infected_running", "reason": "scheduler_transition"}
Phase enum (closed)
clean — known-good, post-snapshot-load, pre-exploit
armed — exploit module is running but no session yet
infecting — session opened, sample landing/starting
infected_running — sample is actively producing observable behavior
dormant — sample is present but idle (sleep timer, beacon interval)
reverting — snapshot_load triggered, episode ending
telemetry-proc.jsonl (source 1, oracle)
{
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
"source": "host_proc", "available_in_deployment": false,
"cpu_user_jiffies": 142, "cpu_sys_jiffies": 38,
"rss_bytes": 542113792, "vsize_bytes": 1842933760,
"io_read_bytes": 0, "io_write_bytes": 4096,
"voluntary_ctxsw": 12, "involuntary_ctxsw": 3,
"minor_faults": 412, "major_faults": 0
}
telemetry-qmp.jsonl (source 2, oracle)
{
"t_mono_ns": 1000000000, "t_wall_ns": 1745875201000000000,
"source": "host_qmp", "available_in_deployment": false,
"blockstats": {"vda": {"rd_ops": 12, "wr_ops": 4, "rd_bytes": 49152, "wr_bytes": 16384}},
"kvm_exits": {"total": 18342, "io": 942, "mmio": 12, "halt": 17000, "irq_window": 110},
"netdev": {"net0": {"rx_packets": 0, "tx_packets": 4, "rx_bytes": 0, "tx_bytes": 256}}
}
telemetry-perf.jsonl (source 3, oracle)
{
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
"source": "host_perf", "available_in_deployment": false,
"cycles": 184_213_104, "instructions": 121_987_001,
"cache_references": 1_041_213, "cache_misses": 38_104,
"branches": 24_198_421, "branch_misses": 412_004,
"page_faults": 12, "context_switches": 18,
"ipc": 0.66, "cache_miss_rate": 0.0366
}
netflow.jsonl (source 4, feature)
Bucketed from the pcap. The pcap stays raw on disk for re-derivation later.
{
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
"source": "bridge_pcap", "available_in_deployment": true,
"bucket_ms": 100,
"pkts_in": 0, "pkts_out": 0, "bytes_in": 0, "bytes_out": 0,
"unique_dst_ips": 0, "unique_dst_ports": 0,
"syn_count": 0, "fin_count": 0, "rst_count": 0,
"dns_query_count": 0, "tcp_new_flows": 0
}
telemetry-guest.jsonl (source 5, feature)
{
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
"source": "guest_agent", "available_in_deployment": true,
"cpu_pct_total": 12.4, "load_1m": 0.41,
"mem_used_bytes": 184_213_504, "mem_available_bytes": 354_127_872,
"thermal_milli_c": 47200,
"net": {"eth0": {"rx_bytes": 0, "tx_bytes": 256, "rx_pkts": 0, "tx_pkts": 4}},
"top_procs": [
{"pid": 1042, "comm": "kworker/0:1", "cpu_pct": 0.4, "rss_bytes": 1_048_576},
{"pid": 1, "comm": "systemd", "cpu_pct": 0.1, "rss_bytes": 4_194_304}
],
"listen_ports": [22, 80, 445]
}
Versioning
schema_version lives in meta.json. Bump when any row schema changes. Keep
old episodes untouched; loaders dispatch on version.
Ingest later
When we move to a database (Timescale most likely), each telemetry-*.jsonl
becomes one hypertable, partitioned by t_wall_ns, indexed on
(episode_id, source). The deployment-tag flag becomes a column we filter on
when materializing the realistic-model training view.