Lays down the design surface for the CIS490 behavioral-malware-detection dataset and model. No code yet — schema and topology are decided first so collection can start without rework. Docs: - README: project goal, navigation - architecture: lab topology, KVM choice, episode state machine, deployment-mirror reasoning - threat-model: train/serve parity rule, oracle-vs-deployable feature split, two-model evaluation strategy - data-model: per-episode JSONL layout, row schemas, phase enum - transport: WG-native shipper/receiver design, idempotent uploads - deploy: one-command install for lab-host and receiver roles - lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/, training/ (each with a short README explaining purpose). Extended .gitignore to exclude qcow2 images, pcaps, sample binaries, secrets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.1 KiB
Transport — Centralized Episode Collection over WG
The dataset lives wherever it is convenient to train from. In our setup that is
the Pi5 (or whatever the team designates as the central collector), reachable
over the WG overlay at <receiver-host>.wg. This document describes how
episodes get from a lab host to the central collector.
Design goals
- Easy to deploy. One config file, one systemd unit per side. No DB required to start collecting.
- WG-native. Sender and receiver both live on the WG overlay; transport is just HTTPS over WG. We use the existing wg-pki CA for mTLS.
- Idempotent. Re-shipping the same episode is safe and cheap; the receiver responds 200 if the bytes already match.
- Crash-safe. Lab host crash mid-episode does not corrupt the central store. Receiver crash mid-upload leaves no partial visible.
- Schema-free. The receiver does not parse JSONL; it stores tarballs and an append-only index. The schema lives only at training time.
What gets shipped
A complete episode directory is tarred and zstd-compressed:
data/episodes/<episode_id>/ → <episode_id>.tar.zst
The orchestrator marks an episode complete by writing a done.marker file at
the end of the directory after meta.json is finalized. The shipper only
considers directories that contain done.marker — partially-written episodes
are invisible to it.
Wire protocol
PUT https://<receiver-host>.wg/v1/episodes/<host_id>/<episode_id>.tar.zst
Content-Type: application/zstd
Content-Length: <bytes>
X-Content-SHA256: <sha256-of-body>
X-Schema-Version: 1
X-Lab-Host: <host_id>
X-Episode-Id: <episode_id>
body: <the tar.zst bytes>
Auth: mTLS using a leaf certificate issued by the wg-pki CA. The receiver trusts only certs issued by that CA.
Responses:
| Status | Meaning |
|---|---|
| 201 | Stored; new |
| 200 | Already present with matching sha256; nothing to do |
| 409 | Already present with different sha256; receiver refuses to overwrite |
| 4xx | Bad request (missing header, malformed id, etc.) |
| 5xx | Server error; sender retries with backoff |
There is no DELETE. Episodes are immutable once shipped.
Sender (shipper) state machine
scan data/episodes/
|
v
for each <id>/done.marker:
|
v
tar+zstd → data/outbox/<id>.tar.zst.partial
|
v
rename → data/outbox/<id>.tar.zst (atomic; visible to retry loop)
|
v
PUT to receiver
|
+-- 200/201 → mv data/episodes/<id> data/shipped/<id>;
| rm data/outbox/<id>.tar.zst
|
+-- 409 → log mismatch, leave files in place, alert (manual triage)
|
+-- 5xx/network → backoff (1s, 2s, 4s, 8s, ... cap 5min); retry
The shipper does the same scan on every wake-up, so a crash mid-tar or mid-PUT is harmless — the next pass picks up wherever it left off.
Receiver state machine
PUT body received
|
v
stream into /var/lib/cis490/incoming/<host>/<id>.tar.zst.partial
|
v
compute sha256 while streaming
|
+-- mismatch with header → 400, delete partial
|
+-- match:
|
v
if final path exists:
|
+-- existing sha256 == new sha256 → 200, delete partial
|
+-- existing sha256 != new sha256 → 409, delete partial
else:
|
v
atomic rename → /var/lib/cis490/episodes/<host>/<id>.tar.zst
|
v
append index.jsonl row
|
v
201
index.jsonl row:
{
"received_at_wall": "2026-04-28T22:31:43Z",
"host_id": "lab-host-1",
"episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0",
"sha256": "...",
"size_bytes": 8412331,
"schema_version": 1
}
That index is the closest thing to a database we have until we decide on one. A trainer can stream it to know what episodes exist, then untar on demand.
Why not just rsync?
rsync works, but:
- No schema-version tagging at the protocol layer.
- No clean way to enforce "immutable once written".
- mTLS via WG-issued certs is more uniform with the rest of the overlay than ssh-key juggling.
- A tiny FastAPI receiver is also a natural place to add ingest-time hooks later (e.g. emit a Matrix notification on successful receipt, kick off a training run when N new episodes arrive).
We may switch to rsync if the FastAPI receiver becomes a bottleneck. For a class project that is unlikely.
Operational notes
- Disk on lab host. The shipper keeps episodes locally in
data/shipped/<id>/until a retention pass prunes them. Default retention: 7 days or 80% disk usage, whichever comes first. - Disk on receiver. No retention enforced by default — the central store is the dataset.
- Backpressure. If the receiver is unreachable (WG down, Pi rebooting),
the shipper accumulates tarballs in
data/outbox/. No data is lost. - Multiple lab hosts. Each writes under its own
<host_id>/prefix. No coordination needed; episode ids are globally unique (ULID).