CIS490/docs/transport.md
Maximus Gorog fa1574a0a6 Scaffold project: docs, repo skeleton, transport + deploy design
Lays down the design surface for the CIS490 behavioral-malware-detection
dataset and model. No code yet — schema and topology are decided first so
collection can start without rework.

Docs:
- README: project goal, navigation
- architecture: lab topology, KVM choice, episode state machine,
  deployment-mirror reasoning
- threat-model: train/serve parity rule, oracle-vs-deployable feature
  split, two-model evaluation strategy
- data-model: per-episode JSONL layout, row schemas, phase enum
- transport: WG-native shipper/receiver design, idempotent uploads
- deploy: one-command install for lab-host and receiver roles
- lab-setup: KVM prereqs, VM build, snapshot, virtio-serial wiring

Skeleton: orchestrator/, collectors/, vm/, exploits/, samples/,
training/ (each with a short README explaining purpose).
Extended .gitignore to exclude qcow2 images, pcaps, sample binaries,
secrets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:21:00 -06:00

164 lines
5.1 KiB
Markdown

# Transport — Centralized Episode Collection over WG
The dataset lives wherever it is convenient to train from. In our setup that is
the Pi5 (or whatever the team designates as the central collector), reachable
over the WG overlay at `<receiver-host>.wg`. This document describes how
episodes get from a lab host to the central collector.
## Design goals
1. **Easy to deploy.** One config file, one systemd unit per side. No DB
required to start collecting.
2. **WG-native.** Sender and receiver both live on the WG overlay; transport is
just HTTPS over WG. We use the existing wg-pki CA for mTLS.
3. **Idempotent.** Re-shipping the same episode is safe and cheap; the
receiver responds 200 if the bytes already match.
4. **Crash-safe.** Lab host crash mid-episode does not corrupt the central
store. Receiver crash mid-upload leaves no partial visible.
5. **Schema-free.** The receiver does not parse JSONL; it stores tarballs and
an append-only index. The schema lives only at training time.
## What gets shipped
A complete episode directory is tarred and zstd-compressed:
```
data/episodes/<episode_id>/ → <episode_id>.tar.zst
```
The orchestrator marks an episode complete by writing a `done.marker` file at
the *end* of the directory after `meta.json` is finalized. The shipper only
considers directories that contain `done.marker` — partially-written episodes
are invisible to it.
## Wire protocol
```
PUT https://<receiver-host>.wg/v1/episodes/<host_id>/<episode_id>.tar.zst
Content-Type: application/zstd
Content-Length: <bytes>
X-Content-SHA256: <sha256-of-body>
X-Schema-Version: 1
X-Lab-Host: <host_id>
X-Episode-Id: <episode_id>
body: <the tar.zst bytes>
```
Auth: mTLS using a leaf certificate issued by the wg-pki CA. The receiver
trusts only certs issued by that CA.
Responses:
| Status | Meaning |
|---|---|
| 201 | Stored; new |
| 200 | Already present with matching sha256; nothing to do |
| 409 | Already present with **different** sha256; receiver refuses to overwrite |
| 4xx | Bad request (missing header, malformed id, etc.) |
| 5xx | Server error; sender retries with backoff |
There is no DELETE. Episodes are immutable once shipped.
## Sender (`shipper`) state machine
```
scan data/episodes/
|
v
for each <id>/done.marker:
|
v
tar+zstd → data/outbox/<id>.tar.zst.partial
|
v
rename → data/outbox/<id>.tar.zst (atomic; visible to retry loop)
|
v
PUT to receiver
|
+-- 200/201 → mv data/episodes/<id> data/shipped/<id>;
| rm data/outbox/<id>.tar.zst
|
+-- 409 → log mismatch, leave files in place, alert (manual triage)
|
+-- 5xx/network → backoff (1s, 2s, 4s, 8s, ... cap 5min); retry
```
The shipper does the same scan on every wake-up, so a crash mid-tar or
mid-PUT is harmless — the next pass picks up wherever it left off.
## Receiver state machine
```
PUT body received
|
v
stream into /var/lib/cis490/incoming/<host>/<id>.tar.zst.partial
|
v
compute sha256 while streaming
|
+-- mismatch with header → 400, delete partial
|
+-- match:
|
v
if final path exists:
|
+-- existing sha256 == new sha256 → 200, delete partial
|
+-- existing sha256 != new sha256 → 409, delete partial
else:
|
v
atomic rename → /var/lib/cis490/episodes/<host>/<id>.tar.zst
|
v
append index.jsonl row
|
v
201
```
`index.jsonl` row:
```json
{
"received_at_wall": "2026-04-28T22:31:43Z",
"host_id": "lab-host-1",
"episode_id": "01HW9GZJ7K8QF5W3X2Y6N1A4B0",
"sha256": "...",
"size_bytes": 8412331,
"schema_version": 1
}
```
That index is the closest thing to a database we have until we decide on one.
A trainer can stream it to know what episodes exist, then untar on demand.
## Why not just rsync?
`rsync` works, but:
- No schema-version tagging at the protocol layer.
- No clean way to enforce "immutable once written".
- mTLS via WG-issued certs is more uniform with the rest of the overlay than
ssh-key juggling.
- A tiny FastAPI receiver is also a natural place to add ingest-time hooks
later (e.g. emit a Matrix notification on successful receipt, kick off a
training run when N new episodes arrive).
We may switch to rsync if the FastAPI receiver becomes a bottleneck. For a
class project that is unlikely.
## Operational notes
- **Disk on lab host.** The shipper keeps episodes locally in
`data/shipped/<id>/` until a retention pass prunes them. Default retention:
7 days *or* 80% disk usage, whichever comes first.
- **Disk on receiver.** No retention enforced by default — the central store
is the dataset.
- **Backpressure.** If the receiver is unreachable (WG down, Pi rebooting),
the shipper accumulates tarballs in `data/outbox/`. No data is lost.
- **Multiple lab hosts.** Each writes under its own `<host_id>/` prefix. No
coordination needed; episode ids are globally unique (ULID).