# Data Model JSONL only, no database, schema-last. Each episode is a self-contained directory. ## Per-episode layout ``` data/episodes// meta.json # one-time, written at start; updated at end with summary events.jsonl # orchestrator actions, one row per event labels.jsonl # phase transitions, one row per transition telemetry-proc.jsonl # source 1 (oracle) host /proc/ telemetry-qmp.jsonl # source 2 (oracle) QEMU QMP queries telemetry-perf.jsonl # source 3 (oracle) perf stat -p telemetry-guest.jsonl # source 5 (feature) in-guest agent over virtio-serial network.pcap # source 4 raw tcpdump -i br-malware netflow.jsonl # source 4 bucketed 100ms aggregations of pcap stderr.log # raw qemu + agent logs ``` `` is a [ULID](https://github.com/ulid/spec) — sortable by time, unique without coordination, URL-safe. ## Common fields on every telemetry row | Field | Type | Notes | |---|---|---| | `t_mono_ns` | int | host `CLOCK_MONOTONIC` at sample time, episode-relative origin | | `t_wall_ns` | int | host wall clock, ns since epoch | | `source` | string | one of `host_proc`, `host_qmp`, `host_perf`, `bridge_pcap`, `guest_agent` | | `available_in_deployment` | bool | **true = feature, false = oracle** | The `available_in_deployment` flag is denormalized onto every row so downstream loaders don't have to look up a separate manifest to filter for the realistic model. ## meta.json schema ```json { "episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0", "schema_version": 1, "started_at_wall": "2026-04-28T22:30:00Z", "ended_at_wall": "2026-04-28T22:31:42Z", "git_commit": "", "host_fingerprint": { "kernel": "6.18.8", "qemu_version": "9.0.0", "cpu_model": "...", "smt_off": true }, "vm": { "image_name": "metasploitable2", "image_sha256": "...", "vcpus": 1, "ram_mib": 512, "cgroup_cpu_cap": "800ms/1s", "snapshot_name": "baseline-v1" }, "exploit": { "framework": "metasploit", "module": "exploit/multi/samba/usermap_script", "rport": 445, "rhost": "10.200.0.10" }, "sample": { "name": "linux.miner.xmrig.elf", "sha256": "...", "source": "MalwareBazaar", "first_seen": "2024-...", "category": "miner" }, "schedule": { "baseline_seconds": 30, "infected_seconds": 90, "dormant_seconds": 60 }, "result": { "phases_observed": ["clean","armed","infecting","infected_running","dormant"], "exploit_succeeded": true, "sample_executed": true, "snapshot_revert_ok": true } } ``` ## events.jsonl One row per orchestrator action. Tells you exactly what happened and when. ```json {"t_mono_ns": 0, "t_wall_ns": 1745875200000000000, "event": "snapshot_load", "snapshot": "baseline-v1"} {"t_mono_ns": 30100000000,"t_wall_ns": 1745875230100000000,"event": "exploit_fire", "module": "exploit/multi/samba/usermap_script"} {"t_mono_ns": 31250000000,"t_wall_ns": 1745875231250000000,"event": "session_open", "session_id": 1} {"t_mono_ns": 31300000000,"t_wall_ns": 1745875231300000000,"event": "sample_uploaded", "path": "/tmp/.x", "sha256": "..."} {"t_mono_ns": 31400000000,"t_wall_ns": 1745875231400000000,"event": "sample_executed", "pid_reported_by_guest": 1042} {"t_mono_ns": 121400000000,"t_wall_ns": 1745875321400000000,"event": "snapshot_revert", "snapshot": "baseline-v1"} ``` ## labels.jsonl ```json {"t_mono_ns": 0, "phase": "clean", "prev": null, "reason": "snapshot_loaded"} {"t_mono_ns": 30100000000, "phase": "armed", "prev": "clean", "reason": "exploit_module_running"} {"t_mono_ns": 31250000000, "phase": "infecting", "prev": "armed", "reason": "session_open"} {"t_mono_ns": 31400000000, "phase": "infected_running","prev": "infecting", "reason": "sample_executed"} {"t_mono_ns": 91400000000, "phase": "dormant", "prev": "infected_running", "reason": "scheduler_transition"} ``` ### Phase enum (closed) ``` clean — known-good, post-snapshot-load, pre-exploit armed — exploit module is running but no session yet infecting — session opened, sample landing/starting infected_running — sample is actively producing observable behavior dormant — sample is present but idle (sleep timer, beacon interval) reverting — snapshot_load triggered, episode ending ``` ## telemetry-proc.jsonl (source 1, oracle) ```json { "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000, "source": "host_proc", "available_in_deployment": false, "cpu_user_jiffies": 142, "cpu_sys_jiffies": 38, "rss_bytes": 542113792, "vsize_bytes": 1842933760, "io_read_bytes": 0, "io_write_bytes": 4096, "voluntary_ctxsw": 12, "involuntary_ctxsw": 3, "minor_faults": 412, "major_faults": 0 } ``` ## telemetry-qmp.jsonl (source 2, oracle) ```json { "t_mono_ns": 1000000000, "t_wall_ns": 1745875201000000000, "source": "host_qmp", "available_in_deployment": false, "blockstats": {"vda": {"rd_ops": 12, "wr_ops": 4, "rd_bytes": 49152, "wr_bytes": 16384}}, "kvm_exits": {"total": 18342, "io": 942, "mmio": 12, "halt": 17000, "irq_window": 110}, "netdev": {"net0": {"rx_packets": 0, "tx_packets": 4, "rx_bytes": 0, "tx_bytes": 256}} } ``` ## telemetry-perf.jsonl (source 3, oracle) ```json { "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000, "source": "host_perf", "available_in_deployment": false, "cycles": 184_213_104, "instructions": 121_987_001, "cache_references": 1_041_213, "cache_misses": 38_104, "branches": 24_198_421, "branch_misses": 412_004, "page_faults": 12, "context_switches": 18, "ipc": 0.66, "cache_miss_rate": 0.0366 } ``` ## netflow.jsonl (source 4, feature) Bucketed from the pcap. The pcap stays raw on disk for re-derivation later. ```json { "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000, "source": "bridge_pcap", "available_in_deployment": true, "bucket_ms": 100, "pkts_in": 0, "pkts_out": 0, "bytes_in": 0, "bytes_out": 0, "unique_dst_ips": 0, "unique_dst_ports": 0, "syn_count": 0, "fin_count": 0, "rst_count": 0, "dns_query_count": 0, "tcp_new_flows": 0 } ``` ## telemetry-guest.jsonl (source 5, feature) ```json { "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000, "source": "guest_agent", "available_in_deployment": true, "cpu_pct_total": 12.4, "load_1m": 0.41, "mem_used_bytes": 184_213_504, "mem_available_bytes": 354_127_872, "thermal_milli_c": 47200, "net": {"eth0": {"rx_bytes": 0, "tx_bytes": 256, "rx_pkts": 0, "tx_pkts": 4}}, "top_procs": [ {"pid": 1042, "comm": "kworker/0:1", "cpu_pct": 0.4, "rss_bytes": 1_048_576}, {"pid": 1, "comm": "systemd", "cpu_pct": 0.1, "rss_bytes": 4_194_304} ], "listen_ports": [22, 80, 445] } ``` ## Versioning `schema_version` lives in `meta.json`. Bump when any row schema changes. Keep old episodes untouched; loaders dispatch on version. ## Ingest later When we move to a database (Timescale most likely), each `telemetry-*.jsonl` becomes one hypertable, partitioned by `t_wall_ns`, indexed on `(episode_id, source)`. The deployment-tag flag becomes a column we filter on when materializing the realistic-model training view.