CIS490/docs/data-model.md
Maximus Gorog fa1574a0a6 Scaffold project: docs, repo skeleton, transport + deploy design
Lays down the design surface for the CIS490 behavioral-malware-detection
dataset and model. No code yet — schema and topology are decided first so
collection can start without rework.

Docs:
- README: project goal, navigation
- architecture: lab topology, KVM choice, episode state machine,
  deployment-mirror reasoning
- threat-model: train/serve parity rule, oracle-vs-deployable feature
  split, two-model evaluation strategy
- data-model: per-episode JSONL layout, row schemas, phase enum
- transport: WG-native shipper/receiver design, idempotent uploads
- deploy: one-command install for lab-host and receiver roles
- lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring

Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/,
training/ (each with a short README explaining purpose).
Extended .gitignore to exclude qcow2 images, pcaps, sample binaries,
secrets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:21:00 -06:00

7.2 KiB

Data Model

JSONL only, no database, schema-last. Each episode is a self-contained directory.

Per-episode layout

data/episodes/<episode_id>/
  meta.json               # one-time, written at start; updated at end with summary
  events.jsonl            # orchestrator actions, one row per event
  labels.jsonl            # phase transitions, one row per transition
  telemetry-proc.jsonl    # source 1 (oracle)        host /proc/<qemu_pid>
  telemetry-qmp.jsonl     # source 2 (oracle)        QEMU QMP queries
  telemetry-perf.jsonl    # source 3 (oracle)        perf stat -p <qemu_pid>
  telemetry-guest.jsonl   # source 5 (feature)       in-guest agent over virtio-serial
  network.pcap            # source 4 raw             tcpdump -i br-malware
  netflow.jsonl           # source 4 bucketed        100ms aggregations of pcap
  stderr.log              # raw qemu + agent logs

<episode_id> is a ULID — sortable by time, unique without coordination, URL-safe.

Common fields on every telemetry row

Field Type Notes
t_mono_ns int host CLOCK_MONOTONIC at sample time, episode-relative origin
t_wall_ns int host wall clock, ns since epoch
source string one of host_proc, host_qmp, host_perf, bridge_pcap, guest_agent
available_in_deployment bool true = feature, false = oracle

The available_in_deployment flag is denormalized onto every row so downstream loaders don't have to look up a separate manifest to filter for the realistic model.

meta.json schema

{
  "episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0",
  "schema_version": 1,
  "started_at_wall": "2026-04-28T22:30:00Z",
  "ended_at_wall": "2026-04-28T22:31:42Z",
  "git_commit": "<sha>",
  "host_fingerprint": {
    "kernel": "6.18.8",
    "qemu_version": "9.0.0",
    "cpu_model": "...",
    "smt_off": true
  },
  "vm": {
    "image_name": "metasploitable2",
    "image_sha256": "...",
    "vcpus": 1,
    "ram_mib": 512,
    "cgroup_cpu_cap": "800ms/1s",
    "snapshot_name": "baseline-v1"
  },
  "exploit": {
    "framework": "metasploit",
    "module": "exploit/multi/samba/usermap_script",
    "rport": 445,
    "rhost": "10.200.0.10"
  },
  "sample": {
    "name": "linux.miner.xmrig.elf",
    "sha256": "...",
    "source": "MalwareBazaar",
    "first_seen": "2024-...",
    "category": "miner"
  },
  "schedule": {
    "baseline_seconds": 30,
    "infected_seconds": 90,
    "dormant_seconds": 60
  },
  "result": {
    "phases_observed": ["clean","armed","infecting","infected_running","dormant"],
    "exploit_succeeded": true,
    "sample_executed": true,
    "snapshot_revert_ok": true
  }
}

events.jsonl

One row per orchestrator action. Tells you exactly what happened and when.

{"t_mono_ns": 0,         "t_wall_ns": 1745875200000000000, "event": "snapshot_load", "snapshot": "baseline-v1"}
{"t_mono_ns": 30100000000,"t_wall_ns": 1745875230100000000,"event": "exploit_fire", "module": "exploit/multi/samba/usermap_script"}
{"t_mono_ns": 31250000000,"t_wall_ns": 1745875231250000000,"event": "session_open", "session_id": 1}
{"t_mono_ns": 31300000000,"t_wall_ns": 1745875231300000000,"event": "sample_uploaded", "path": "/tmp/.x", "sha256": "..."}
{"t_mono_ns": 31400000000,"t_wall_ns": 1745875231400000000,"event": "sample_executed", "pid_reported_by_guest": 1042}
{"t_mono_ns": 121400000000,"t_wall_ns": 1745875321400000000,"event": "snapshot_revert", "snapshot": "baseline-v1"}

labels.jsonl

{"t_mono_ns": 0,            "phase": "clean",           "prev": null,                "reason": "snapshot_loaded"}
{"t_mono_ns": 30100000000,  "phase": "armed",           "prev": "clean",             "reason": "exploit_module_running"}
{"t_mono_ns": 31250000000,  "phase": "infecting",       "prev": "armed",             "reason": "session_open"}
{"t_mono_ns": 31400000000,  "phase": "infected_running","prev": "infecting",         "reason": "sample_executed"}
{"t_mono_ns": 91400000000,  "phase": "dormant",         "prev": "infected_running",  "reason": "scheduler_transition"}

Phase enum (closed)

clean              — known-good, post-snapshot-load, pre-exploit
armed              — exploit module is running but no session yet
infecting          — session opened, sample landing/starting
infected_running   — sample is actively producing observable behavior
dormant            — sample is present but idle (sleep timer, beacon interval)
reverting          — snapshot_load triggered, episode ending

telemetry-proc.jsonl (source 1, oracle)

{
  "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
  "source": "host_proc", "available_in_deployment": false,
  "cpu_user_jiffies": 142, "cpu_sys_jiffies": 38,
  "rss_bytes": 542113792, "vsize_bytes": 1842933760,
  "io_read_bytes": 0, "io_write_bytes": 4096,
  "voluntary_ctxsw": 12, "involuntary_ctxsw": 3,
  "minor_faults": 412, "major_faults": 0
}

telemetry-qmp.jsonl (source 2, oracle)

{
  "t_mono_ns": 1000000000, "t_wall_ns": 1745875201000000000,
  "source": "host_qmp", "available_in_deployment": false,
  "blockstats": {"vda": {"rd_ops": 12, "wr_ops": 4, "rd_bytes": 49152, "wr_bytes": 16384}},
  "kvm_exits": {"total": 18342, "io": 942, "mmio": 12, "halt": 17000, "irq_window": 110},
  "netdev": {"net0": {"rx_packets": 0, "tx_packets": 4, "rx_bytes": 0, "tx_bytes": 256}}
}

telemetry-perf.jsonl (source 3, oracle)

{
  "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
  "source": "host_perf", "available_in_deployment": false,
  "cycles": 184_213_104, "instructions": 121_987_001,
  "cache_references": 1_041_213, "cache_misses": 38_104,
  "branches": 24_198_421, "branch_misses": 412_004,
  "page_faults": 12, "context_switches": 18,
  "ipc": 0.66, "cache_miss_rate": 0.0366
}

netflow.jsonl (source 4, feature)

Bucketed from the pcap. The pcap stays raw on disk for re-derivation later.

{
  "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
  "source": "bridge_pcap", "available_in_deployment": true,
  "bucket_ms": 100,
  "pkts_in": 0, "pkts_out": 0, "bytes_in": 0, "bytes_out": 0,
  "unique_dst_ips": 0, "unique_dst_ports": 0,
  "syn_count": 0, "fin_count": 0, "rst_count": 0,
  "dns_query_count": 0, "tcp_new_flows": 0
}

telemetry-guest.jsonl (source 5, feature)

{
  "t_mono_ns": 100000000, "t_wall_ns": 1745875200100000000,
  "source": "guest_agent", "available_in_deployment": true,
  "cpu_pct_total": 12.4, "load_1m": 0.41,
  "mem_used_bytes": 184_213_504, "mem_available_bytes": 354_127_872,
  "thermal_milli_c": 47200,
  "net": {"eth0": {"rx_bytes": 0, "tx_bytes": 256, "rx_pkts": 0, "tx_pkts": 4}},
  "top_procs": [
    {"pid": 1042, "comm": "kworker/0:1", "cpu_pct": 0.4, "rss_bytes": 1_048_576},
    {"pid": 1, "comm": "systemd", "cpu_pct": 0.1, "rss_bytes": 4_194_304}
  ],
  "listen_ports": [22, 80, 445]
}

Versioning

schema_version lives in meta.json. Bump when any row schema changes. Keep old episodes untouched; loaders dispatch on version.

Ingest later

When we move to a database (Timescale most likely), each telemetry-*.jsonl becomes one hypertable, partitioned by t_wall_ns, indexed on (episode_id, source). The deployment-tag flag becomes a column we filter on when materializing the realistic-model training view.