CIS490/collectors
Maximus Gorog fa1574a0a6 Scaffold project: docs, repo skeleton, transport + deploy design
Lays down the design surface for the CIS490 behavioral-malware-detection
dataset and model. No code yet — schema and topology are decided first so
collection can start without rework.

Docs:
- README: project goal, navigation
- architecture: lab topology, KVM choice, episode state machine,
  deployment-mirror reasoning
- threat-model: train/serve parity rule, oracle-vs-deployable feature
  split, two-model evaluation strategy
- data-model: per-episode JSONL layout, row schemas, phase enum
- transport: WG-native shipper/receiver design, idempotent uploads
- deploy: one-command install for lab-host and receiver roles
- lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring

Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/,
training/ (each with a short README explaining purpose).
Extended .gitignore to exclude qcow2 images, pcaps, sample binaries,
secrets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:21:00 -06:00
..
README.md Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00

collectors/

One module per telemetry source. All collectors:

  • Receive an episode_id, an output directory, and a shared t_mono_origin_ns.
  • Write JSONL into data/episodes/<episode_id>/telemetry-<name>.jsonl.
  • Stamp every row with the same t_mono_ns / t_wall_ns clock pair.
  • Stamp every row with source and available_in_deployment (true/false).
  • Exit cleanly on SIGTERM from the orchestrator.
Module Source Vantage Role
proc_qemu.py host /proc/<qemu_pid>/{stat,io,status,schedstat} outside guest oracle
qmp.py QEMU QMP query-stats, query-blockstats, netdev outside guest oracle
perf_qemu.py perf stat -p <qemu_pid> outside guest oracle
pcap.py tcpdump -i br-malware, bucketed gateway-side feature
guest_agent.py virtio-serial reader, parses agent JSONL inside guest feature

The in-guest agent itself (a small Python+psutil program that runs on the guest and writes to /dev/virtio-ports/cis490.guest.agent) lives under vm/guest-agent/ because it is shipped into the guest at image-build time.

See docs/data-model.md for row schemas.