CIS490/collectors
Maximus Gorog 064387b7a0 Add v0 orchestrator + first oracle collector (host /proc)
End-to-end: ``python -m orchestrator --target-pid <pid> --duration N`` now
writes a complete episode directory matching docs/data-model.md, with phase
labels, events, and a 10 Hz host /proc telemetry stream. No VM yet — pid is
arbitrary so we can validate the loop against e.g. ``sleep 5`` while the lab
side comes up.

collectors/proc_qemu.py — parses /proc/<pid>/{stat,io,status} (handles parens
in comm), single-shot collect_once(), and a stop-event-driven run_loop()
that ticks at a fixed cadence and exits when the pid disappears. Tagged
``available_in_deployment: false`` per the threat-model doc.

orchestrator/episode.py — EpisodeRunner: creates data/episodes/<ulid>/,
atomic meta.json, events.jsonl + labels.jsonl writers, drives the collector
in a thread for duration_s, writes done.marker last so the shipper never
sees a half-finished episode.

orchestrator/ulid.py — tiny 26-char Crockford-base32 ULID generator.
Time-sortable, no third-party dep.

orchestrator/__main__.py — CLI entry point.

Tests (15 new, 28 total green):
- proc_qemu: real-ish stat with parens-in-comm, missing /proc/<pid>/io,
  missing pid, run_loop cadence, run_loop terminates when pid disappears.
- episode: full directory shape against os.getpid(), id override,
  done.marker written after meta.json finalize.
- ulid: length+alphabet, 2000-burst uniqueness, time-sortability.

Smoke-tested against ``sleep 10``: 16 rows over 1.5s at 100ms cadence,
monotonic clock, RSS stable at ~3.5 MiB as expected for an idle sleep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:40:25 -06:00
..
__init__.py Add v0 orchestrator + first oracle collector (host /proc) 2026-04-28 23:40:25 -06:00
proc_qemu.py Add v0 orchestrator + first oracle collector (host /proc) 2026-04-28 23:40:25 -06:00
README.md Scaffold project: docs, repo skeleton, transport + deploy design 2026-04-28 23:21:00 -06:00

collectors/

One module per telemetry source. All collectors:

  • Receive an episode_id, an output directory, and a shared t_mono_origin_ns.
  • Write JSONL into data/episodes/<episode_id>/telemetry-<name>.jsonl.
  • Stamp every row with the same t_mono_ns / t_wall_ns clock pair.
  • Stamp every row with source and available_in_deployment (true/false).
  • Exit cleanly on SIGTERM from the orchestrator.
Module Source Vantage Role
proc_qemu.py host /proc/<qemu_pid>/{stat,io,status,schedstat} outside guest oracle
qmp.py QEMU QMP query-stats, query-blockstats, netdev outside guest oracle
perf_qemu.py perf stat -p <qemu_pid> outside guest oracle
pcap.py tcpdump -i br-malware, bucketed gateway-side feature
guest_agent.py virtio-serial reader, parses agent JSONL inside guest feature

The in-guest agent itself (a small Python+psutil program that runs on the guest and writes to /dev/virtio-ports/cis490.guest.agent) lives under vm/guest-agent/ because it is shipped into the guest at image-build time.

See docs/data-model.md for row schemas.