CIS490/docs/data-model.md
Maximus Gorog fa1574a0a6 Scaffold project: docs, repo skeleton, transport + deploy design
Lays down the design surface for the CIS490 behavioral-malware-detection
dataset and model. No code yet — schema and topology are decided first so
collection can start without rework.

Docs:
- README: project goal, navigation
- architecture: lab topology, KVM choice, episode state machine,
  deployment-mirror reasoning
- threat-model: train/serve parity rule, oracle-vs-deployable feature
  split, two-model evaluation strategy
- data-model: per-episode JSONL layout, row schemas, phase enum
- transport: WG-native shipper/receiver design, idempotent uploads
- deploy: one-command install for lab-host and receiver roles
- lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring

Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/,
training/ (each with a short README explaining purpose).
Extended .gitignore to exclude qcow2 images, pcaps, sample binaries,
secrets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:21:00 -06:00

205 lines
7.2 KiB
Markdown

# Data Model
JSONL only, no database, schema-last. Each episode is a self-contained directory.
## Per-episode layout
```
data/episodes/<episode_id>/
meta.json # one-time, written at start; updated at end with summary
events.jsonl # orchestrator actions, one row per event
labels.jsonl # phase transitions, one row per transition
telemetry-proc.jsonl # source 1 (oracle) host /proc/<qemu_pid>
telemetry-qmp.jsonl # source 2 (oracle) QEMU QMP queries
telemetry-perf.jsonl # source 3 (oracle) perf stat -p <qemu_pid>
telemetry-guest.jsonl # source 5 (feature) in-guest agent over virtio-serial
network.pcap # source 4 raw tcpdump -i br-malware
netflow.jsonl # source 4 bucketed 100ms aggregations of pcap
stderr.log # raw qemu + agent logs
```
`<episode_id>` is a [ULID](https://github.com/ulid/spec) — sortable by time,
unique without coordination, URL-safe.
## Common fields on every telemetry row
| Field | Type | Notes |
|---|---|---|
| `t_mono_ns` | int | host `CLOCK_MONOTONIC` at sample time, episode-relative origin |
| `t_wall_ns` | int | host wall clock, ns since epoch |
| `source` | string | one of `host_proc`, `host_qmp`, `host_perf`, `bridge_pcap`, `guest_agent` |
| `available_in_deployment` | bool | **true = feature, false = oracle** |
The `available_in_deployment` flag is denormalized onto every row so downstream
loaders don't have to look up a separate manifest to filter for the realistic
model.
## meta.json schema
```json
{
"episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0",
"schema_version": 1,
"started_at_wall": "2026-04-28T22:30:00Z",
"ended_at_wall": "2026-04-28T22:31:42Z",
"git_commit": "<sha>",
"host_fingerprint": {
"kernel": "6.18.8",
"qemu_version": "9.0.0",
"cpu_model": "...",
"smt_off": true
},
"vm": {
"image_name": "metasploitable2",
"image_sha256": "...",
"vcpus": 1,
"ram_mib": 512,
"cgroup_cpu_cap": "800ms/1s",
"snapshot_name": "baseline-v1"
},
"exploit": {
"framework": "metasploit",
"module": "exploit/multi/samba/usermap_script",
"rport": 445,
"rhost": "10.200.0.10"
},
"sample": {
"name": "linux.miner.xmrig.elf",
"sha256": "...",
"source": "MalwareBazaar",
"first_seen": "2024-...",
"category": "miner"
},
"schedule": {
"baseline_seconds": 30,
"infected_seconds": 90,
"dormant_seconds": 60
},
"result": {
"phases_observed": ["clean","armed","infecting","infected_running","dormant"],
"exploit_succeeded": true,
"sample_executed": true,
"snapshot_revert_ok": true
}
}
```
## events.jsonl
One row per orchestrator action. Tells you exactly what happened and when.
```json
{"t_mono_ns": 0, "t_wall_ns": 1745875200000000000, "event": "snapshot_load", "snapshot": "baseline-v1"}
{"t_mono_ns": 30100000000,"t_wall_ns": 1745875230100000000,"event": "exploit_fire", "module": "exploit/multi/samba/usermap_script"}
{"t_mono_ns": 31250000000,"t_wall_ns": 1745875231250000000,"event": "session_open", "session_id": 1}
{"t_mono_ns": 31300000000,"t_wall_ns": 1745875231300000000,"event": "sample_uploaded", "path": "/tmp/.x", "sha256": "..."}
{"t_mono_ns": 31400000000,"t_wall_ns": 1745875231400000000,"event": "sample_executed", "pid_reported_by_guest": 1042}
{"t_mono_ns": 121400000000,"t_wall_ns": 1745875321400000000,"event": "snapshot_revert", "snapshot": "baseline-v1"}
```
## labels.jsonl
```json
{"t_mono_ns": 0, "phase": "clean", "prev": null, "reason": "snapshot_loaded"}
{"t_mono_ns": 30100000000, "phase": "armed", "prev": "clean", "reason": "exploit_module_running"}
{"t_mono_ns": 31250000000, "phase": "infecting", "prev": "armed", "reason": "session_open"}
{"t_mono_ns": 31400000000, "phase": "infected_running","prev": "infecting", "reason": "sample_executed"}
{"t_mono_ns": 91400000000, "phase": "dormant", "prev": "infected_running", "reason": "scheduler_transition"}
```
### Phase enum (closed)
```
clean — known-good, post-snapshot-load, pre-exploit
armed — exploit module is running but no session yet
infecting — session opened, sample landing/starting
infected_running — sample is actively producing observable behavior
dormant — sample is present but idle (sleep timer, beacon interval)
reverting — snapshot_load triggered, episode ending
```
## telemetry-proc.jsonl (source 1, oracle)
```json
{
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
"source": "host_proc", "available_in_deployment": false,
"cpu_user_jiffies": 142, "cpu_sys_jiffies": 38,
"rss_bytes": 542113792, "vsize_bytes": 1842933760,
"io_read_bytes": 0, "io_write_bytes": 4096,
"voluntary_ctxsw": 12, "involuntary_ctxsw": 3,
"minor_faults": 412, "major_faults": 0
}
```
## telemetry-qmp.jsonl (source 2, oracle)
```json
{
"t_mono_ns": 1000000000, "t_wall_ns": 1745875201000000000,
"source": "host_qmp", "available_in_deployment": false,
"blockstats": {"vda": {"rd_ops": 12, "wr_ops": 4, "rd_bytes": 49152, "wr_bytes": 16384}},
"kvm_exits": {"total": 18342, "io": 942, "mmio": 12, "halt": 17000, "irq_window": 110},
"netdev": {"net0": {"rx_packets": 0, "tx_packets": 4, "rx_bytes": 0, "tx_bytes": 256}}
}
```
## telemetry-perf.jsonl (source 3, oracle)
```json
{
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
"source": "host_perf", "available_in_deployment": false,
"cycles": 184_213_104, "instructions": 121_987_001,
"cache_references": 1_041_213, "cache_misses": 38_104,
"branches": 24_198_421, "branch_misses": 412_004,
"page_faults": 12, "context_switches": 18,
"ipc": 0.66, "cache_miss_rate": 0.0366
}
```
## netflow.jsonl (source 4, feature)
Bucketed from the pcap. The pcap stays raw on disk for re-derivation later.
```json
{
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
"source": "bridge_pcap", "available_in_deployment": true,
"bucket_ms": 100,
"pkts_in": 0, "pkts_out": 0, "bytes_in": 0, "bytes_out": 0,
"unique_dst_ips": 0, "unique_dst_ports": 0,
"syn_count": 0, "fin_count": 0, "rst_count": 0,
"dns_query_count": 0, "tcp_new_flows": 0
}
```
## telemetry-guest.jsonl (source 5, feature)
```json
{
"t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
"source": "guest_agent", "available_in_deployment": true,
"cpu_pct_total": 12.4, "load_1m": 0.41,
"mem_used_bytes": 184_213_504, "mem_available_bytes": 354_127_872,
"thermal_milli_c": 47200,
"net": {"eth0": {"rx_bytes": 0, "tx_bytes": 256, "rx_pkts": 0, "tx_pkts": 4}},
"top_procs": [
{"pid": 1042, "comm": "kworker/0:1", "cpu_pct": 0.4, "rss_bytes": 1_048_576},
{"pid": 1, "comm": "systemd", "cpu_pct": 0.1, "rss_bytes": 4_194_304}
],
"listen_ports": [22, 80, 445]
}
```
## Versioning
`schema_version` lives in `meta.json`. Bump when any row schema changes. Keep
old episodes untouched; loaders dispatch on version.
## Ingest later
When we move to a database (Timescale most likely), each `telemetry-*.jsonl`
becomes one hypertable, partitioned by `t_wall_ns`, indexed on
`(episode_id, source)`. The deployment-tag flag becomes a column we filter on
when materializing the realistic-model training view.