Lab-host shipper + receiver /v1/ping + install scripts
Implements the deployment loop end-to-end on the CIS490 side:
shipper/
config.py ShipperConfig (host_id, paths, receiver endpoint, mTLS)
transport.py httpx-based PUT + ping with mTLS + bearer support
queue.py scan data/episodes/, tar+zstd via system zstd, ship,
retire to data/shipped/. Idempotent across crashes per
the state machine in docs/transport.md.
__main__.py CLI: --ping (smoke test), --once (one pass), or daemon
receiver/app.py: new POST /v1/ping that requires the same auth as PUT
/v1/episodes but writes nothing. Used by `cis490-shipper --ping`
during lab-host bring-up to verify the WG/Caddy/mTLS path before
shipping any real bytes.
etc/
cis490-shipper.service systemd unit for the lab-host shipper
cis490-orchestrator.service systemd unit for the lab-host queue
(kept disabled by default until queue
mode lands)
lab-host.toml.example config template
scripts/
install-lab-host.sh idempotent installer; verifies prereqs,
creates cis490 service user, syncs repo to
/opt/cis490, builds venv, drops systemd units
and config template
install-receiver.sh same, for the receiver role on the central WG
node (Pi5 in our setup)
tests/test_shipper.py 11 end-to-end tests against a real Uvicorn
server hosting the receiver app. Exercises
ping, tar+ship, idempotent re-ship, 409
conflict, transient (receiver down), tarball
round-trip via system zstd.
AGENTS.md guidance for AI agents working on this and sibling repos.
Headline: when you hit an issue you can't fully fix in
scope, file a Forgejo issue rather than leaving a TODO.
51/51 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
613c6fa223
commit
7c9f9582ca
13 changed files with 1365 additions and 0 deletions
91
AGENTS.md
Normal file
91
AGENTS.md
Normal file
|
|
@ -0,0 +1,91 @@
|
||||||
|
# AGENTS.md — guidance for AI agents working on this repo
|
||||||
|
|
||||||
|
This project is part of the spectral lab (`http://maxgit.wg/spectral/`).
|
||||||
|
The conventions below also apply to sibling repos (`wg-enroll`,
|
||||||
|
`wg-pki`, `caddy`, `iptmonads`, `matrix`, `forgejo`, `vault`,
|
||||||
|
`openclaw-deploy`).
|
||||||
|
|
||||||
|
## File an issue when you hit a problem
|
||||||
|
|
||||||
|
**When you run into an issue you cannot fully resolve in the current
|
||||||
|
turn, file it as a Forgejo issue on the relevant repo before moving
|
||||||
|
on.** Do not silently log a TODO comment, leave a partial workaround,
|
||||||
|
or assume someone else will remember. The issue tracker is the
|
||||||
|
durable record.
|
||||||
|
|
||||||
|
This applies to:
|
||||||
|
|
||||||
|
- a build / test / typecheck failure you can't fix in scope
|
||||||
|
- a bug you discover but aren't tasked with fixing
|
||||||
|
- a missing dep, missing config, or env-only failure that blocks E2E
|
||||||
|
- a design gap you've worked around but want a follow-up to fix
|
||||||
|
properly
|
||||||
|
- a scope-out you made (e.g. "deferred Tier 4 sample fetch") that
|
||||||
|
needs an owner so it doesn't get lost
|
||||||
|
|
||||||
|
Don't file an issue when:
|
||||||
|
|
||||||
|
- the user is in the conversation and you can just *tell* them
|
||||||
|
- it's already filed (search first: `GET /api/v1/repos/<owner>/<repo>/issues?state=open&q=<keyword>`)
|
||||||
|
- it's truly a non-issue (a one-line edit you're about to make this
|
||||||
|
same turn)
|
||||||
|
|
||||||
|
## How to file (Forgejo API)
|
||||||
|
|
||||||
|
The local Forgejo at `http://10.100.0.1:3000` accepts API calls with a
|
||||||
|
token-bearer header:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -s -X POST \
|
||||||
|
-H "Authorization: token <TOKEN>" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
http://10.100.0.1:3000/api/v1/repos/spectral/<repo>/issues \
|
||||||
|
-d '{
|
||||||
|
"title": "<short, action-oriented title>",
|
||||||
|
"body": "<context, repro, attempted fixes, suggested next step>"
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
The token comes from the user's session — never embed one in code or
|
||||||
|
commits.
|
||||||
|
|
||||||
|
### What a good issue body contains
|
||||||
|
|
||||||
|
1. **Context** — one sentence on what was being attempted.
|
||||||
|
2. **What happened** — the actual error, log line, or unexpected
|
||||||
|
behavior. Paste exact output.
|
||||||
|
3. **What was tried** — every workaround you attempted and why it
|
||||||
|
didn't stick.
|
||||||
|
4. **Suggested next step** — the smallest change that would resolve
|
||||||
|
it, if you have a guess. "Unknown" is a fine answer.
|
||||||
|
5. **Related** — link the commit / PR / file:line where the issue
|
||||||
|
surfaced.
|
||||||
|
|
||||||
|
### What a good title looks like
|
||||||
|
|
||||||
|
| Bad | Good |
|
||||||
|
|---|---|
|
||||||
|
| `tests broken` | `tests/test_episode.py: race when t_mono_origin_ns is set in run() not __init__` |
|
||||||
|
| `caddy thing` | `Caddy: client_auth requires absolute path; relative trusted_ca_cert_file silently fails` |
|
||||||
|
| `fix later` | `shipper: 5xx backoff cap is 5min, doc says 1min — pick one` |
|
||||||
|
|
||||||
|
## After filing
|
||||||
|
|
||||||
|
- Reference the issue number in the next commit message:
|
||||||
|
`Refs spectral/<repo>#<n>` or `Closes spectral/<repo>#<n>` if your
|
||||||
|
current change actually fixes it.
|
||||||
|
- If the issue is on a different repo than the one you're committing
|
||||||
|
to, fully qualify: `spectral/wg-pki#3`.
|
||||||
|
|
||||||
|
## Other conventions
|
||||||
|
|
||||||
|
- **Naming:** never coin USB / device / service names on the user's
|
||||||
|
behalf. Ask first. Reusing an old name is especially bad.
|
||||||
|
- **`/etc` configs:** `Read` first, copy second. Never overwrite a
|
||||||
|
`/etc/...` file from a template without checking what's actually
|
||||||
|
there.
|
||||||
|
- **wg-enroll scope:** creation-only. Don't add admin /
|
||||||
|
service-activation features to it.
|
||||||
|
- **Don't expand a project's binary name beyond its own boundary:**
|
||||||
|
`openclaw` is the queue/permissions binary in `openclaw-deploy`.
|
||||||
|
This repo is `wg-enroll` (or its caller). Don't conflate.
|
||||||
31
etc/cis490-orchestrator.service
Normal file
31
etc/cis490-orchestrator.service
Normal file
|
|
@ -0,0 +1,31 @@
|
||||||
|
[Unit]
|
||||||
|
Description=CIS490 lab-host episode orchestrator (queue mode)
|
||||||
|
Documentation=https://maxgit.wg/spectral/CIS490
|
||||||
|
# Episodes need KVM and (for Tier 3+) msfrpcd up. msfrpcd is brought
|
||||||
|
# up out-of-band; this unit only requires the kernel + WG.
|
||||||
|
After=network-online.target wg-quick@wg0.service
|
||||||
|
Wants=network-online.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
User=cis490
|
||||||
|
Group=cis490
|
||||||
|
WorkingDirectory=/opt/cis490
|
||||||
|
# Queue mode is currently a TODO — the binary takes a job-spec file
|
||||||
|
# and runs episodes in a loop. Until that lands, this unit stays
|
||||||
|
# disabled by default; lab-host operators kick off episodes by hand
|
||||||
|
# via tools/run_*.py and let the shipper pick them up.
|
||||||
|
ExecStart=/opt/cis490/.venv/bin/python -m orchestrator --queue /var/lib/cis490/data/queue
|
||||||
|
Restart=on-failure
|
||||||
|
RestartSec=10
|
||||||
|
|
||||||
|
# Hardening
|
||||||
|
NoNewPrivileges=true
|
||||||
|
PrivateTmp=true
|
||||||
|
ProtectSystem=strict
|
||||||
|
ProtectHome=true
|
||||||
|
ReadWritePaths=/var/lib/cis490
|
||||||
|
SupplementaryGroups=kvm
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
33
etc/cis490-shipper.service
Normal file
33
etc/cis490-shipper.service
Normal file
|
|
@ -0,0 +1,33 @@
|
||||||
|
[Unit]
|
||||||
|
Description=CIS490 lab-host episode shipper
|
||||||
|
Documentation=https://maxgit.wg/spectral/CIS490
|
||||||
|
# WG must be up before the shipper can reach the receiver.
|
||||||
|
After=network-online.target wg-quick@wg0.service
|
||||||
|
Wants=network-online.target
|
||||||
|
Requires=wg-quick@wg0.service
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
User=cis490
|
||||||
|
Group=cis490
|
||||||
|
WorkingDirectory=/opt/cis490
|
||||||
|
ExecStart=/opt/cis490/.venv/bin/python -m shipper --config /etc/cis490/lab-host.toml
|
||||||
|
Restart=on-failure
|
||||||
|
RestartSec=5
|
||||||
|
|
||||||
|
# Hardening
|
||||||
|
NoNewPrivileges=true
|
||||||
|
PrivateTmp=true
|
||||||
|
ProtectSystem=strict
|
||||||
|
ProtectHome=true
|
||||||
|
ReadWritePaths=/var/lib/cis490
|
||||||
|
ProtectKernelTunables=true
|
||||||
|
ProtectKernelModules=true
|
||||||
|
ProtectControlGroups=true
|
||||||
|
LockPersonality=true
|
||||||
|
RestrictNamespaces=true
|
||||||
|
RestrictRealtime=true
|
||||||
|
SystemCallArchitectures=native
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
50
etc/lab-host.toml.example
Normal file
50
etc/lab-host.toml.example
Normal file
|
|
@ -0,0 +1,50 @@
|
||||||
|
# CIS490 lab-host — copy to /etc/cis490/lab-host.toml and edit.
|
||||||
|
#
|
||||||
|
# This config drives BOTH the orchestrator (which runs episodes) and
|
||||||
|
# the shipper (which uploads completed episodes to the central
|
||||||
|
# receiver over WG).
|
||||||
|
|
||||||
|
# Stable identity for this lab host. Used in the receiver path
|
||||||
|
# (/v1/episodes/<host_id>/...) and in the X-Lab-Host header. Pick
|
||||||
|
# something short, stable, and DNS-safe — letters, digits, _.- only.
|
||||||
|
host_id = "REPLACE_ME"
|
||||||
|
|
||||||
|
[paths]
|
||||||
|
data_root = "/var/lib/cis490/data"
|
||||||
|
samples_store = "/var/lib/cis490/samples/store"
|
||||||
|
qcow_image = "/var/lib/cis490/vm/images/metasploitable2.qcow2"
|
||||||
|
|
||||||
|
[receiver]
|
||||||
|
# The receiver lives behind Caddy on the WG-side collector host. The
|
||||||
|
# hostname must resolve over WG (collector.wg in the canonical
|
||||||
|
# spectral lab). The wg-pki CA must be on every lab-host so the
|
||||||
|
# Caddy-issued internal cert validates.
|
||||||
|
url = "https://collector.wg"
|
||||||
|
ca_bundle = "/etc/cis490/certs/wg-ca.pem"
|
||||||
|
|
||||||
|
# mTLS: leaf cert + private key issued by wg-pki for THIS host_id.
|
||||||
|
# Comment these out to fall back to bearer-token auth during early
|
||||||
|
# bring-up.
|
||||||
|
client_cert = "/etc/cis490/certs/lab-host.pem"
|
||||||
|
client_key = "/etc/cis490/certs/lab-host.key"
|
||||||
|
|
||||||
|
# Bearer is optional and only used if mTLS isn't yet configured. When
|
||||||
|
# both are set, mTLS does the actual authn and the bearer is a
|
||||||
|
# belt-and-suspenders check.
|
||||||
|
# bearer_token = "REPLACE_ME_WITH_SECRET"
|
||||||
|
|
||||||
|
# Set to false ONLY for local-loopback dev against an unsigned cert.
|
||||||
|
# verify_tls = true
|
||||||
|
|
||||||
|
[shipper]
|
||||||
|
scan_interval_s = 5.0
|
||||||
|
request_timeout_s = 60.0
|
||||||
|
|
||||||
|
[episode]
|
||||||
|
baseline_seconds = 30
|
||||||
|
infected_seconds = 90
|
||||||
|
dormant_seconds = 60
|
||||||
|
|
||||||
|
[retention]
|
||||||
|
keep_local_for_days = 7
|
||||||
|
prune_at_disk_pct = 80
|
||||||
|
|
@ -2,6 +2,7 @@ from __future__ import annotations
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
import secrets
|
import secrets
|
||||||
|
import time
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Awaitable, Callable
|
from typing import Awaitable, Callable
|
||||||
|
|
||||||
|
|
@ -17,6 +18,7 @@ log = logging.getLogger("cis490.receiver")
|
||||||
|
|
||||||
|
|
||||||
SUFFIX = ".tar.zst"
|
SUFFIX = ".tar.zst"
|
||||||
|
SCHEMA_VERSION = 1
|
||||||
|
|
||||||
|
|
||||||
def _bearer_check(request: Request, expected: str | None) -> Response | None:
|
def _bearer_check(request: Request, expected: str | None) -> Response | None:
|
||||||
|
|
@ -40,6 +42,23 @@ def make_app(
|
||||||
async def health(request: Request) -> JSONResponse:
|
async def health(request: Request) -> JSONResponse:
|
||||||
return JSONResponse({"status": "ok"})
|
return JSONResponse({"status": "ok"})
|
||||||
|
|
||||||
|
async def ping(request: Request) -> JSONResponse:
|
||||||
|
"""Smoke-test endpoint. Verifies that the auth layer and the
|
||||||
|
WG/Caddy/receiver pipe are alive end-to-end without persisting
|
||||||
|
anything — index.jsonl is untouched. Used by ``cis490-shipper
|
||||||
|
--ping`` during initial bring-up of a new lab host."""
|
||||||
|
guard = _bearer_check(request, bearer_token)
|
||||||
|
if guard is not None:
|
||||||
|
return guard
|
||||||
|
return JSONResponse(
|
||||||
|
{
|
||||||
|
"ok": True,
|
||||||
|
"host_id": request.headers.get("x-lab-host"),
|
||||||
|
"t_wall_ns": time.time_ns(),
|
||||||
|
"schema_version": SCHEMA_VERSION,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
async def put_episode(request: Request) -> JSONResponse:
|
async def put_episode(request: Request) -> JSONResponse:
|
||||||
guard = _bearer_check(request, bearer_token)
|
guard = _bearer_check(request, bearer_token)
|
||||||
if guard is not None:
|
if guard is not None:
|
||||||
|
|
@ -124,6 +143,7 @@ def make_app(
|
||||||
|
|
||||||
routes = [
|
routes = [
|
||||||
Route("/v1/health", health, methods=["GET"]),
|
Route("/v1/health", health, methods=["GET"]),
|
||||||
|
Route("/v1/ping", ping, methods=["POST"]),
|
||||||
Route(
|
Route(
|
||||||
"/v1/episodes/{host_id}/{filename}",
|
"/v1/episodes/{host_id}/{filename}",
|
||||||
put_episode,
|
put_episode,
|
||||||
|
|
|
||||||
112
scripts/install-lab-host.sh
Executable file
112
scripts/install-lab-host.sh
Executable file
|
|
@ -0,0 +1,112 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Install / refresh the CIS490 lab-host role.
|
||||||
|
#
|
||||||
|
# Idempotent — safe to re-run after `git pull`. Does NOT enroll the
|
||||||
|
# host into WireGuard (that's wg-enroll's job, run separately and
|
||||||
|
# *first*) and does NOT mint TLS certs (that's wg-pki's job).
|
||||||
|
#
|
||||||
|
# Steps:
|
||||||
|
# 1. Verify prereqs (KVM, zstd, qemu, python3.11+, systemd).
|
||||||
|
# 2. Create the cis490 service user + /var/lib/cis490 layout.
|
||||||
|
# 3. Sync the repo into /opt/cis490 and build a uv-managed venv.
|
||||||
|
# 4. Install systemd units from etc/.
|
||||||
|
# 5. Drop /etc/cis490/lab-host.toml (only on first install).
|
||||||
|
#
|
||||||
|
# Operator finishes by:
|
||||||
|
# - editing /etc/cis490/lab-host.toml (host_id, receiver URL, certs)
|
||||||
|
# - placing leaf certs at /etc/cis490/certs/{lab-host.pem,key,wg-ca.pem}
|
||||||
|
# - `systemctl enable --now cis490-shipper`
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||||
|
INSTALL_ROOT="${INSTALL_ROOT:-/opt/cis490}"
|
||||||
|
DATA_ROOT="${DATA_ROOT:-/var/lib/cis490}"
|
||||||
|
ETC_ROOT="${ETC_ROOT:-/etc/cis490}"
|
||||||
|
SERVICE_USER="${SERVICE_USER:-cis490}"
|
||||||
|
|
||||||
|
log() { printf '[install-lab-host] %s\n' "$*" >&2; }
|
||||||
|
die() { log "FATAL: $*"; exit 1; }
|
||||||
|
|
||||||
|
# --- 1. prereqs --------------------------------------------------------
|
||||||
|
log "checking prereqs"
|
||||||
|
|
||||||
|
if [[ $EUID -ne 0 ]]; then
|
||||||
|
die "must run as root (writes to /opt, /etc, /var/lib, and systemd)"
|
||||||
|
fi
|
||||||
|
command -v systemctl >/dev/null || die "systemd not found"
|
||||||
|
command -v qemu-system-x86_64 >/dev/null || die "qemu-system-x86_64 not on PATH"
|
||||||
|
command -v zstd >/dev/null || die "zstd not on PATH (apt install zstd)"
|
||||||
|
[[ -e /dev/kvm ]] || die "/dev/kvm missing — KVM not available"
|
||||||
|
|
||||||
|
# uv is preferred (lockfile-driven). Fall back to system pip if absent.
|
||||||
|
USE_UV=0
|
||||||
|
if command -v uv >/dev/null; then USE_UV=1; fi
|
||||||
|
|
||||||
|
# --- 2. user + layout --------------------------------------------------
|
||||||
|
log "ensuring service user $SERVICE_USER"
|
||||||
|
if ! id -u "$SERVICE_USER" >/dev/null 2>&1; then
|
||||||
|
useradd --system --no-create-home --shell /usr/sbin/nologin \
|
||||||
|
--home-dir "$INSTALL_ROOT" "$SERVICE_USER"
|
||||||
|
fi
|
||||||
|
# kvm group lets the service spawn VMs.
|
||||||
|
if getent group kvm >/dev/null 2>&1; then
|
||||||
|
usermod -a -G kvm "$SERVICE_USER" || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
install -d -o root -g root -m 0755 "$ETC_ROOT" "$ETC_ROOT/certs"
|
||||||
|
install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 \
|
||||||
|
"$DATA_ROOT" "$DATA_ROOT/data" \
|
||||||
|
"$DATA_ROOT/data/episodes" "$DATA_ROOT/data/outbox" \
|
||||||
|
"$DATA_ROOT/data/shipped" "$DATA_ROOT/data/queue" \
|
||||||
|
"$DATA_ROOT/samples" "$DATA_ROOT/samples/store" \
|
||||||
|
"$DATA_ROOT/vm" "$DATA_ROOT/vm/images"
|
||||||
|
|
||||||
|
# --- 3. repo + venv ----------------------------------------------------
|
||||||
|
log "syncing repo into $INSTALL_ROOT"
|
||||||
|
install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 "$INSTALL_ROOT"
|
||||||
|
# We use a clean cp -aT rather than rsync to avoid an extra dep.
|
||||||
|
cp -aT "$REPO_ROOT" "$INSTALL_ROOT"
|
||||||
|
chown -R "$SERVICE_USER":"$SERVICE_USER" "$INSTALL_ROOT"
|
||||||
|
|
||||||
|
log "building venv"
|
||||||
|
if [[ "$USE_UV" -eq 1 ]]; then
|
||||||
|
sudo -u "$SERVICE_USER" -- env HOME="$INSTALL_ROOT" \
|
||||||
|
uv sync --project "$INSTALL_ROOT"
|
||||||
|
else
|
||||||
|
sudo -u "$SERVICE_USER" -- python3 -m venv "$INSTALL_ROOT/.venv"
|
||||||
|
sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/pip" install \
|
||||||
|
--quiet --upgrade pip
|
||||||
|
sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/pip" install \
|
||||||
|
--quiet starlette 'uvicorn[standard]' httpx msgpack
|
||||||
|
fi
|
||||||
|
|
||||||
|
# --- 4. systemd --------------------------------------------------------
|
||||||
|
log "installing systemd units"
|
||||||
|
install -m 0644 "$REPO_ROOT/etc/cis490-shipper.service" \
|
||||||
|
/etc/systemd/system/cis490-shipper.service
|
||||||
|
install -m 0644 "$REPO_ROOT/etc/cis490-orchestrator.service" \
|
||||||
|
/etc/systemd/system/cis490-orchestrator.service
|
||||||
|
systemctl daemon-reload
|
||||||
|
|
||||||
|
# --- 5. config template (only on first install) -----------------------
|
||||||
|
if [[ ! -f "$ETC_ROOT/lab-host.toml" ]]; then
|
||||||
|
log "writing $ETC_ROOT/lab-host.toml (template)"
|
||||||
|
install -m 0640 -o root -g "$SERVICE_USER" \
|
||||||
|
"$REPO_ROOT/etc/lab-host.toml.example" "$ETC_ROOT/lab-host.toml"
|
||||||
|
log ""
|
||||||
|
log "FIRST-INSTALL NEXT STEPS:"
|
||||||
|
log " 1. Edit $ETC_ROOT/lab-host.toml — set host_id and receiver URL."
|
||||||
|
log " 2. Place TLS material at:"
|
||||||
|
log " $ETC_ROOT/certs/wg-ca.pem"
|
||||||
|
log " $ETC_ROOT/certs/lab-host.pem"
|
||||||
|
log " $ETC_ROOT/certs/lab-host.key (mode 0600, owner $SERVICE_USER)"
|
||||||
|
log " 3. Smoke-test the receiver pipe:"
|
||||||
|
log " sudo -u $SERVICE_USER $INSTALL_ROOT/.venv/bin/python -m shipper \\"
|
||||||
|
log " --config $ETC_ROOT/lab-host.toml --ping"
|
||||||
|
log " 4. systemctl enable --now cis490-shipper"
|
||||||
|
else
|
||||||
|
log "$ETC_ROOT/lab-host.toml exists; leaving in place"
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "lab-host install complete."
|
||||||
106
scripts/install-receiver.sh
Executable file
106
scripts/install-receiver.sh
Executable file
|
|
@ -0,0 +1,106 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Install / refresh the CIS490 receiver role on the central WG node
|
||||||
|
# (the Pi5 in our setup). Idempotent — safe to re-run.
|
||||||
|
#
|
||||||
|
# Steps:
|
||||||
|
# 1. Verify prereqs (python3.11+, systemd).
|
||||||
|
# 2. Create the cis490 service user + /var/lib/cis490 layout.
|
||||||
|
# 3. Sync the repo into /opt/cis490 and build a venv.
|
||||||
|
# 4. Install cis490-receiver.service.
|
||||||
|
# 5. Drop /etc/cis490/receiver.toml on first install.
|
||||||
|
#
|
||||||
|
# This script does NOT:
|
||||||
|
# - configure Caddy. Add a `collector.wg` block to your spectral/caddy
|
||||||
|
# config to terminate TLS and reverse-proxy to 127.0.0.1:8443.
|
||||||
|
# - issue server / client certs. wg-pki owns CA + leaf issuance.
|
||||||
|
# - open firewall ports. iptmonads owns the WG-side ruleset.
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||||
|
INSTALL_ROOT="${INSTALL_ROOT:-/opt/cis490}"
|
||||||
|
DATA_ROOT="${DATA_ROOT:-/var/lib/cis490}"
|
||||||
|
ETC_ROOT="${ETC_ROOT:-/etc/cis490}"
|
||||||
|
SERVICE_USER="${SERVICE_USER:-cis490}"
|
||||||
|
|
||||||
|
log() { printf '[install-receiver] %s\n' "$*" >&2; }
|
||||||
|
die() { log "FATAL: $*"; exit 1; }
|
||||||
|
|
||||||
|
# --- 1. prereqs --------------------------------------------------------
|
||||||
|
log "checking prereqs"
|
||||||
|
if [[ $EUID -ne 0 ]]; then
|
||||||
|
die "must run as root"
|
||||||
|
fi
|
||||||
|
command -v systemctl >/dev/null || die "systemd not found"
|
||||||
|
command -v python3 >/dev/null || die "python3 not on PATH"
|
||||||
|
|
||||||
|
PY_VER="$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
|
||||||
|
if ! python3 -c 'import sys; sys.exit(0 if sys.version_info >= (3,11) else 1)'; then
|
||||||
|
die "python >=3.11 required, found $PY_VER"
|
||||||
|
fi
|
||||||
|
|
||||||
|
USE_UV=0
|
||||||
|
if command -v uv >/dev/null; then USE_UV=1; fi
|
||||||
|
|
||||||
|
# --- 2. user + layout --------------------------------------------------
|
||||||
|
log "ensuring service user $SERVICE_USER"
|
||||||
|
if ! id -u "$SERVICE_USER" >/dev/null 2>&1; then
|
||||||
|
useradd --system --no-create-home --shell /usr/sbin/nologin \
|
||||||
|
--home-dir "$INSTALL_ROOT" "$SERVICE_USER"
|
||||||
|
fi
|
||||||
|
|
||||||
|
install -d -o root -g root -m 0755 "$ETC_ROOT" "$ETC_ROOT/certs"
|
||||||
|
install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 \
|
||||||
|
"$DATA_ROOT" "$DATA_ROOT/episodes" "$DATA_ROOT/incoming"
|
||||||
|
install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0750 "$DATA_ROOT"
|
||||||
|
# Pre-create the index file so the first PUT doesn't race on creation.
|
||||||
|
sudo -u "$SERVICE_USER" -- touch "$DATA_ROOT/index.jsonl"
|
||||||
|
|
||||||
|
# --- 3. repo + venv ----------------------------------------------------
|
||||||
|
log "syncing repo into $INSTALL_ROOT"
|
||||||
|
install -d -o "$SERVICE_USER" -g "$SERVICE_USER" -m 0755 "$INSTALL_ROOT"
|
||||||
|
cp -aT "$REPO_ROOT" "$INSTALL_ROOT"
|
||||||
|
chown -R "$SERVICE_USER":"$SERVICE_USER" "$INSTALL_ROOT"
|
||||||
|
|
||||||
|
log "building venv"
|
||||||
|
if [[ "$USE_UV" -eq 1 ]]; then
|
||||||
|
sudo -u "$SERVICE_USER" -- env HOME="$INSTALL_ROOT" \
|
||||||
|
uv sync --project "$INSTALL_ROOT"
|
||||||
|
else
|
||||||
|
sudo -u "$SERVICE_USER" -- python3 -m venv "$INSTALL_ROOT/.venv"
|
||||||
|
sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/pip" install \
|
||||||
|
--quiet --upgrade pip
|
||||||
|
sudo -u "$SERVICE_USER" -- "$INSTALL_ROOT/.venv/bin/pip" install \
|
||||||
|
--quiet starlette 'uvicorn[standard]'
|
||||||
|
fi
|
||||||
|
|
||||||
|
# --- 4. systemd --------------------------------------------------------
|
||||||
|
log "installing systemd unit"
|
||||||
|
install -m 0644 "$REPO_ROOT/etc/cis490-receiver.service" \
|
||||||
|
/etc/systemd/system/cis490-receiver.service
|
||||||
|
systemctl daemon-reload
|
||||||
|
|
||||||
|
# --- 5. config template (only on first install) -----------------------
|
||||||
|
if [[ ! -f "$ETC_ROOT/receiver.toml" ]]; then
|
||||||
|
log "writing $ETC_ROOT/receiver.toml (template)"
|
||||||
|
install -m 0640 -o root -g "$SERVICE_USER" \
|
||||||
|
"$REPO_ROOT/etc/receiver.toml.example" "$ETC_ROOT/receiver.toml"
|
||||||
|
log ""
|
||||||
|
log "FIRST-INSTALL NEXT STEPS:"
|
||||||
|
log " 1. Verify $ETC_ROOT/receiver.toml paths."
|
||||||
|
log " 2. Add a collector.wg block to your spectral/caddy config."
|
||||||
|
log " Example:"
|
||||||
|
log " collector.wg {"
|
||||||
|
log " tls internal"
|
||||||
|
log " reverse_proxy 127.0.0.1:8443"
|
||||||
|
log " }"
|
||||||
|
log " (mTLS to clients is enforced by the wg-pki CA bundle on"
|
||||||
|
log " the receiver side once leaf certs are issued.)"
|
||||||
|
log " 3. Open the WG-side port via iptmonads."
|
||||||
|
log " 4. systemctl enable --now cis490-receiver"
|
||||||
|
log " 5. From a lab host: cis490-shipper --ping"
|
||||||
|
else
|
||||||
|
log "$ETC_ROOT/receiver.toml exists; leaving in place"
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "receiver install complete."
|
||||||
0
shipper/__init__.py
Normal file
0
shipper/__init__.py
Normal file
106
shipper/__main__.py
Normal file
106
shipper/__main__.py
Normal file
|
|
@ -0,0 +1,106 @@
|
||||||
|
"""``cis490-shipper`` CLI entrypoint.
|
||||||
|
|
||||||
|
Modes:
|
||||||
|
|
||||||
|
--ping hit /v1/ping; exit 0 if 200/ok, non-zero otherwise.
|
||||||
|
No tarball flow; index.jsonl on the receiver is untouched.
|
||||||
|
--once one scan pass over data/episodes/, ship anything done, exit.
|
||||||
|
(default) long-running daemon; rescans every scan_interval_s.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import signal
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from .config import ShipperConfig
|
||||||
|
from .queue import ShipperQueue
|
||||||
|
from .transport import ShipperTransport
|
||||||
|
|
||||||
|
|
||||||
|
def _setup_logging(level: str) -> None:
|
||||||
|
logging.basicConfig(
|
||||||
|
level=getattr(logging, level.upper(), logging.INFO),
|
||||||
|
format="%(asctime)s %(levelname)s %(name)s %(message)s",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
parser = argparse.ArgumentParser(prog="cis490-shipper")
|
||||||
|
parser.add_argument(
|
||||||
|
"--config",
|
||||||
|
default="/etc/cis490/lab-host.toml",
|
||||||
|
help="Path to lab-host config (TOML)",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--ping",
|
||||||
|
action="store_true",
|
||||||
|
help="Hit /v1/ping on the receiver and exit",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--once",
|
||||||
|
action="store_true",
|
||||||
|
help="One scan pass, then exit (default is long-running daemon)",
|
||||||
|
)
|
||||||
|
parser.add_argument("--log-level", default="INFO")
|
||||||
|
args = parser.parse_args(argv)
|
||||||
|
|
||||||
|
_setup_logging(args.log_level)
|
||||||
|
log = logging.getLogger("cis490.shipper")
|
||||||
|
|
||||||
|
try:
|
||||||
|
cfg = ShipperConfig.load(args.config)
|
||||||
|
except (FileNotFoundError, ValueError) as e:
|
||||||
|
log.error("config error: %s", e)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
transport = ShipperTransport(cfg)
|
||||||
|
|
||||||
|
if args.ping:
|
||||||
|
result = transport.ping()
|
||||||
|
# Print structured one-liner for CI / test pipelines.
|
||||||
|
print(json.dumps({
|
||||||
|
"ok": result.ok,
|
||||||
|
"status_code": result.status_code,
|
||||||
|
"host_id": cfg.host_id,
|
||||||
|
"receiver": cfg.receiver.url,
|
||||||
|
"body": result.body,
|
||||||
|
"error": result.error,
|
||||||
|
}))
|
||||||
|
return 0 if result.ok else 1
|
||||||
|
|
||||||
|
queue = ShipperQueue(cfg, transport)
|
||||||
|
if args.once:
|
||||||
|
result = queue.run_once()
|
||||||
|
log.info(
|
||||||
|
"scan complete: scanned=%d shipped=%d transient=%d conflicts=%d fatal=%d",
|
||||||
|
result.scanned, result.shipped, result.transient_failures,
|
||||||
|
result.conflicts, result.fatal,
|
||||||
|
)
|
||||||
|
# Exit code reflects fatal-only; transient failures aren't an error
|
||||||
|
# because the next pass / pod restart will retry.
|
||||||
|
return 1 if result.fatal else 0
|
||||||
|
|
||||||
|
# Daemon mode
|
||||||
|
stopping = False
|
||||||
|
def _stop(signum, frame): # noqa: ARG001
|
||||||
|
nonlocal stopping
|
||||||
|
log.info("received signal %s; finishing pass and exiting", signum)
|
||||||
|
stopping = True
|
||||||
|
signal.signal(signal.SIGTERM, _stop)
|
||||||
|
signal.signal(signal.SIGINT, _stop)
|
||||||
|
|
||||||
|
log.info(
|
||||||
|
"shipper starting: host_id=%s data_root=%s receiver=%s",
|
||||||
|
cfg.host_id, cfg.data_root, cfg.receiver.url,
|
||||||
|
)
|
||||||
|
queue.run_forever(stop_check=lambda: stopping)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
91
shipper/config.py
Normal file
91
shipper/config.py
Normal file
|
|
@ -0,0 +1,91 @@
|
||||||
|
"""Lab-host shipper config — loaded from /etc/cis490/lab-host.toml."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import tomllib
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ReceiverEndpoint:
|
||||||
|
url: str # e.g. "https://collector.wg"
|
||||||
|
ca_bundle: Path | None = None
|
||||||
|
client_cert: Path | None = None
|
||||||
|
client_key: Path | None = None
|
||||||
|
bearer_token: str | None = None
|
||||||
|
verify_tls: bool = True
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ShipperConfig:
|
||||||
|
host_id: str
|
||||||
|
data_root: Path # Lab-host data root; episodes/, outbox/, shipped/ live here.
|
||||||
|
receiver: ReceiverEndpoint
|
||||||
|
# Daemon mode: how often to scan for new done.marker files.
|
||||||
|
scan_interval_s: float = 5.0
|
||||||
|
# PUT timeout per episode. Tarballs are bounded by max_episode_bytes;
|
||||||
|
# at WG speeds this is well under 60s for a typical episode.
|
||||||
|
request_timeout_s: float = 60.0
|
||||||
|
# Backoff schedule on transient (5xx / network) failures, in seconds,
|
||||||
|
# capped at the last entry. The shipper's scan loop will pick the
|
||||||
|
# episode up again on the next pass regardless.
|
||||||
|
backoff_seconds: tuple[float, ...] = (1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 60.0, 120.0, 300.0)
|
||||||
|
# Local retention before pruning data/shipped/.
|
||||||
|
keep_local_for_days: int = 7
|
||||||
|
|
||||||
|
@property
|
||||||
|
def episodes_dir(self) -> Path:
|
||||||
|
return self.data_root / "episodes"
|
||||||
|
|
||||||
|
@property
|
||||||
|
def outbox_dir(self) -> Path:
|
||||||
|
return self.data_root / "outbox"
|
||||||
|
|
||||||
|
@property
|
||||||
|
def shipped_dir(self) -> Path:
|
||||||
|
return self.data_root / "shipped"
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def load(cls, path: str | Path) -> "ShipperConfig":
|
||||||
|
with open(path, "rb") as f:
|
||||||
|
data = tomllib.load(f)
|
||||||
|
|
||||||
|
host_id = data.get("host_id")
|
||||||
|
if not isinstance(host_id, str) or not host_id:
|
||||||
|
raise ValueError("lab-host config: host_id (string) required at top level")
|
||||||
|
|
||||||
|
paths = data.get("paths", {})
|
||||||
|
data_root = Path(paths.get("data_root", "/var/lib/cis490/data")).resolve()
|
||||||
|
|
||||||
|
rcv = data.get("receiver", {})
|
||||||
|
url = rcv.get("url")
|
||||||
|
if not isinstance(url, str) or not url:
|
||||||
|
raise ValueError("lab-host config: receiver.url required")
|
||||||
|
|
||||||
|
receiver = ReceiverEndpoint(
|
||||||
|
url=url.rstrip("/"),
|
||||||
|
ca_bundle=_optional_path(rcv.get("ca_bundle")),
|
||||||
|
client_cert=_optional_path(rcv.get("client_cert")),
|
||||||
|
client_key=_optional_path(rcv.get("client_key")),
|
||||||
|
bearer_token=rcv.get("bearer_token"),
|
||||||
|
verify_tls=bool(rcv.get("verify_tls", True)),
|
||||||
|
)
|
||||||
|
|
||||||
|
retention = data.get("retention", {})
|
||||||
|
return cls(
|
||||||
|
host_id=host_id,
|
||||||
|
data_root=data_root,
|
||||||
|
receiver=receiver,
|
||||||
|
scan_interval_s=float(data.get("shipper", {}).get("scan_interval_s", 5.0)),
|
||||||
|
request_timeout_s=float(data.get("shipper", {}).get("request_timeout_s", 60.0)),
|
||||||
|
keep_local_for_days=int(retention.get("keep_local_for_days", 7)),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _optional_path(v: object) -> Path | None:
|
||||||
|
if v in (None, ""):
|
||||||
|
return None
|
||||||
|
if isinstance(v, str):
|
||||||
|
return Path(v).expanduser()
|
||||||
|
raise TypeError(f"expected path string, got {type(v).__name__}")
|
||||||
195
shipper/queue.py
Normal file
195
shipper/queue.py
Normal file
|
|
@ -0,0 +1,195 @@
|
||||||
|
"""Shipper episode queue — scan, compress, ship, retire.
|
||||||
|
|
||||||
|
State machine, mirroring docs/transport.md:
|
||||||
|
|
||||||
|
data/episodes/<id>/done.marker
|
||||||
|
|
|
||||||
|
v
|
||||||
|
tar+zstd → data/outbox/<id>.tar.zst.partial
|
||||||
|
|
|
||||||
|
v
|
||||||
|
rename → data/outbox/<id>.tar.zst
|
||||||
|
|
|
||||||
|
v
|
||||||
|
PUT to receiver
|
||||||
|
|
|
||||||
|
+-- 200/201 → mv data/episodes/<id> → data/shipped/<id>
|
||||||
|
| rm data/outbox/<id>.tar.zst
|
||||||
|
|
|
||||||
|
+-- 409 → leave files in place (the local + remote tarball
|
||||||
|
| differ; manual triage)
|
||||||
|
|
|
||||||
|
+-- 5xx/net → leave outbox tarball; retry on next pass
|
||||||
|
|
|
||||||
|
+-- 4xx → log + skip (caller-side bug, doesn't self-heal)
|
||||||
|
|
||||||
|
Idempotent on every pass. A crash mid-tar leaves only a ``.partial``
|
||||||
|
which the next pass overwrites. A crash mid-PUT leaves the tarball in
|
||||||
|
``outbox/`` and the next pass re-ships it; the receiver responds 200
|
||||||
|
on a matching sha256, 409 on a divergent one.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import shutil
|
||||||
|
import subprocess
|
||||||
|
import tarfile
|
||||||
|
import tempfile
|
||||||
|
import time
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from .config import ShipperConfig
|
||||||
|
from .transport import ShipperTransport, ShipResult, hash_file
|
||||||
|
|
||||||
|
|
||||||
|
log = logging.getLogger("cis490.shipper.queue")
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class PassResult:
|
||||||
|
scanned: int
|
||||||
|
shipped: int
|
||||||
|
transient_failures: int
|
||||||
|
conflicts: int
|
||||||
|
fatal: int
|
||||||
|
|
||||||
|
|
||||||
|
class ShipperQueue:
|
||||||
|
def __init__(self, cfg: ShipperConfig, transport: ShipperTransport) -> None:
|
||||||
|
self.cfg = cfg
|
||||||
|
self.transport = transport
|
||||||
|
cfg.episodes_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
cfg.outbox_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
cfg.shipped_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# ---- main entry point ---------------------------------------------
|
||||||
|
|
||||||
|
def run_once(self) -> PassResult:
|
||||||
|
"""One scan pass. Returns counts for logging / tests."""
|
||||||
|
ready = self._ready_episodes()
|
||||||
|
scanned = len(ready)
|
||||||
|
shipped = 0
|
||||||
|
transient = 0
|
||||||
|
conflicts = 0
|
||||||
|
fatal = 0
|
||||||
|
|
||||||
|
for ep_dir in ready:
|
||||||
|
episode_id = ep_dir.name
|
||||||
|
try:
|
||||||
|
tarball, sha = self._tar_episode(ep_dir)
|
||||||
|
except Exception:
|
||||||
|
log.exception("tar failed for %s", episode_id)
|
||||||
|
transient += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
res = self.transport.ship_tarball(episode_id, tarball, sha)
|
||||||
|
log.info(
|
||||||
|
"ship %s -> %s (%d) %s",
|
||||||
|
episode_id, res.status, res.status_code, res.error or "",
|
||||||
|
)
|
||||||
|
|
||||||
|
if res.status in ("stored", "already-present"):
|
||||||
|
self._retire(ep_dir, tarball)
|
||||||
|
shipped += 1
|
||||||
|
elif res.status == "conflict":
|
||||||
|
conflicts += 1
|
||||||
|
# Keep the tarball + episode dir in place. Operator must
|
||||||
|
# decide whether to drop our copy or fix the remote one.
|
||||||
|
elif res.status == "transient":
|
||||||
|
transient += 1
|
||||||
|
else: # fatal
|
||||||
|
fatal += 1
|
||||||
|
|
||||||
|
return PassResult(
|
||||||
|
scanned=scanned,
|
||||||
|
shipped=shipped,
|
||||||
|
transient_failures=transient,
|
||||||
|
conflicts=conflicts,
|
||||||
|
fatal=fatal,
|
||||||
|
)
|
||||||
|
|
||||||
|
def run_forever(self, *, stop_check=lambda: False) -> None:
|
||||||
|
while not stop_check():
|
||||||
|
try:
|
||||||
|
self.run_once()
|
||||||
|
except Exception:
|
||||||
|
log.exception("scan pass crashed; sleeping anyway")
|
||||||
|
# Coarse sleep: we don't need precise scheduling and we
|
||||||
|
# don't want a tight loop on errors.
|
||||||
|
t0 = time.monotonic()
|
||||||
|
while time.monotonic() - t0 < self.cfg.scan_interval_s:
|
||||||
|
if stop_check():
|
||||||
|
return
|
||||||
|
time.sleep(0.5)
|
||||||
|
|
||||||
|
# ---- internals -----------------------------------------------------
|
||||||
|
|
||||||
|
def _ready_episodes(self) -> list[Path]:
|
||||||
|
out: list[Path] = []
|
||||||
|
if not self.cfg.episodes_dir.exists():
|
||||||
|
return out
|
||||||
|
for ep in sorted(self.cfg.episodes_dir.iterdir()):
|
||||||
|
if ep.is_dir() and (ep / "done.marker").exists():
|
||||||
|
out.append(ep)
|
||||||
|
return out
|
||||||
|
|
||||||
|
def _tar_episode(self, ep_dir: Path) -> tuple[Path, str]:
|
||||||
|
"""Tar+zstd the episode dir into outbox. Idempotent — overwrites
|
||||||
|
any prior partial. Returns ``(tarball_path, sha256_hex)``."""
|
||||||
|
episode_id = ep_dir.name
|
||||||
|
outbox = self.cfg.outbox_dir
|
||||||
|
partial = outbox / f"{episode_id}.tar.zst.partial"
|
||||||
|
final = outbox / f"{episode_id}.tar.zst"
|
||||||
|
|
||||||
|
partial.unlink(missing_ok=True)
|
||||||
|
|
||||||
|
# We use the system `zstd` for streaming compression: pipe a
|
||||||
|
# tar stream into `zstd -T0 -19` to get a deterministic tarball
|
||||||
|
# without buffering the whole tar in memory or pulling in the
|
||||||
|
# python-zstandard dependency. Falls back to in-process `zstd`
|
||||||
|
# via the python wheel if the binary isn't on PATH.
|
||||||
|
if _which_zstd():
|
||||||
|
with partial.open("wb") as zout:
|
||||||
|
proc = subprocess.Popen(
|
||||||
|
["zstd", "-q", "-T0", "-19", "--stdout"],
|
||||||
|
stdin=subprocess.PIPE, stdout=zout,
|
||||||
|
)
|
||||||
|
assert proc.stdin is not None
|
||||||
|
with tarfile.open(fileobj=proc.stdin, mode="w|") as tf:
|
||||||
|
tf.add(ep_dir, arcname=episode_id, recursive=True)
|
||||||
|
proc.stdin.close()
|
||||||
|
rc = proc.wait()
|
||||||
|
if rc != 0:
|
||||||
|
partial.unlink(missing_ok=True)
|
||||||
|
raise RuntimeError(f"zstd exited {rc}")
|
||||||
|
else:
|
||||||
|
# Fallback: pipe through python's built-in zlib via gzip is
|
||||||
|
# NOT compatible (we want zstd). Surface the missing binary
|
||||||
|
# rather than silently producing a non-zstd tarball.
|
||||||
|
partial.unlink(missing_ok=True)
|
||||||
|
raise RuntimeError(
|
||||||
|
"the `zstd` binary is required on the lab host. "
|
||||||
|
"Install it via your package manager."
|
||||||
|
)
|
||||||
|
|
||||||
|
sha = hash_file(partial)
|
||||||
|
partial.replace(final)
|
||||||
|
return final, sha
|
||||||
|
|
||||||
|
def _retire(self, ep_dir: Path, tarball: Path) -> None:
|
||||||
|
"""Move episode dir → shipped/, drop the tarball."""
|
||||||
|
target = self.cfg.shipped_dir / ep_dir.name
|
||||||
|
if target.exists():
|
||||||
|
# Belt-and-suspenders: re-shipping an already-retired
|
||||||
|
# episode shouldn't happen (the dir was moved), but if it
|
||||||
|
# does, prefer the existing copy and just clean up.
|
||||||
|
shutil.rmtree(ep_dir, ignore_errors=True)
|
||||||
|
else:
|
||||||
|
ep_dir.replace(target)
|
||||||
|
tarball.unlink(missing_ok=True)
|
||||||
|
|
||||||
|
|
||||||
|
def _which_zstd() -> bool:
|
||||||
|
return shutil.which("zstd") is not None
|
||||||
203
shipper/transport.py
Normal file
203
shipper/transport.py
Normal file
|
|
@ -0,0 +1,203 @@
|
||||||
|
"""HTTP transport for the lab-host shipper.
|
||||||
|
|
||||||
|
Two operations against the receiver:
|
||||||
|
POST /v1/ping — smoke test
|
||||||
|
PUT /v1/episodes/<host>/<episode>.tar.zst — episode upload
|
||||||
|
|
||||||
|
Auth is mTLS (client cert from wg-pki) when configured. A bearer token
|
||||||
|
is supported as a stand-in during early bring-up before the cert is
|
||||||
|
issued; production runs should set both.
|
||||||
|
|
||||||
|
The transport returns small dataclasses rather than throwing — the
|
||||||
|
caller (shipper queue) decides whether to retry, move to shipped/, or
|
||||||
|
alert. This keeps the retry policy in one place.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import hashlib
|
||||||
|
import logging
|
||||||
|
import ssl
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
from .config import ReceiverEndpoint, ShipperConfig
|
||||||
|
|
||||||
|
|
||||||
|
log = logging.getLogger("cis490.shipper.transport")
|
||||||
|
|
||||||
|
|
||||||
|
SCHEMA_VERSION = 1
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class PingResult:
|
||||||
|
ok: bool
|
||||||
|
status_code: int
|
||||||
|
body: dict[str, Any] | None
|
||||||
|
error: str | None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ShipResult:
|
||||||
|
status: str # "stored" | "already-present" | "conflict" | "transient" | "fatal"
|
||||||
|
status_code: int
|
||||||
|
sha256: str | None
|
||||||
|
body: dict[str, Any] | None
|
||||||
|
error: str | None
|
||||||
|
|
||||||
|
|
||||||
|
def _build_ssl_context(rcv: ReceiverEndpoint) -> ssl.SSLContext | bool:
|
||||||
|
"""Build an SSL context honoring the wg-pki CA bundle + client cert.
|
||||||
|
|
||||||
|
Returns True / a bundle path / a context. httpx accepts all three;
|
||||||
|
we use a context so we can attach the client cert for mTLS."""
|
||||||
|
if not rcv.url.lower().startswith("https://"):
|
||||||
|
return False
|
||||||
|
ctx = ssl.create_default_context(
|
||||||
|
cafile=str(rcv.ca_bundle) if rcv.ca_bundle else None,
|
||||||
|
)
|
||||||
|
if not rcv.verify_tls:
|
||||||
|
# Dev-only path; production lab-hosts should always pin the
|
||||||
|
# wg-pki CA. Logged loudly so it doesn't slip through.
|
||||||
|
log.warning("TLS verification disabled — dev-only configuration")
|
||||||
|
ctx.check_hostname = False
|
||||||
|
ctx.verify_mode = ssl.CERT_NONE
|
||||||
|
if rcv.client_cert and rcv.client_key:
|
||||||
|
ctx.load_cert_chain(str(rcv.client_cert), str(rcv.client_key))
|
||||||
|
return ctx
|
||||||
|
|
||||||
|
|
||||||
|
class ShipperTransport:
|
||||||
|
def __init__(self, cfg: ShipperConfig) -> None:
|
||||||
|
self.cfg = cfg
|
||||||
|
self._verify = _build_ssl_context(cfg.receiver)
|
||||||
|
|
||||||
|
# ---- ping ----------------------------------------------------------
|
||||||
|
|
||||||
|
def ping(self) -> PingResult:
|
||||||
|
url = f"{self.cfg.receiver.url}/v1/ping"
|
||||||
|
headers = self._common_headers()
|
||||||
|
try:
|
||||||
|
with httpx.Client(verify=self._verify, timeout=self.cfg.request_timeout_s) as c:
|
||||||
|
r = c.post(url, headers=headers, content=b"")
|
||||||
|
except httpx.HTTPError as e:
|
||||||
|
return PingResult(ok=False, status_code=0, body=None, error=str(e))
|
||||||
|
|
||||||
|
body: dict[str, Any] | None = None
|
||||||
|
try:
|
||||||
|
body = r.json()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
if r.status_code == 200 and isinstance(body, dict) and body.get("ok"):
|
||||||
|
return PingResult(ok=True, status_code=200, body=body, error=None)
|
||||||
|
return PingResult(
|
||||||
|
ok=False,
|
||||||
|
status_code=r.status_code,
|
||||||
|
body=body,
|
||||||
|
error=f"unexpected status {r.status_code}",
|
||||||
|
)
|
||||||
|
|
||||||
|
# ---- ship ----------------------------------------------------------
|
||||||
|
|
||||||
|
def ship_tarball(
|
||||||
|
self,
|
||||||
|
episode_id: str,
|
||||||
|
tarball_path: Path,
|
||||||
|
sha256_hex: str,
|
||||||
|
) -> ShipResult:
|
||||||
|
url = (
|
||||||
|
f"{self.cfg.receiver.url}/v1/episodes/"
|
||||||
|
f"{self.cfg.host_id}/{episode_id}.tar.zst"
|
||||||
|
)
|
||||||
|
size = tarball_path.stat().st_size
|
||||||
|
headers = self._common_headers() | {
|
||||||
|
"Content-Type": "application/zstd",
|
||||||
|
"Content-Length": str(size),
|
||||||
|
"X-Content-SHA256": sha256_hex,
|
||||||
|
"X-Episode-Id": episode_id,
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
with httpx.Client(verify=self._verify, timeout=self.cfg.request_timeout_s) as c, \
|
||||||
|
tarball_path.open("rb") as body:
|
||||||
|
# httpx streams from a file-like object via the `content=` kwarg.
|
||||||
|
r = c.put(url, headers=headers, content=body)
|
||||||
|
except httpx.HTTPError as e:
|
||||||
|
return ShipResult(
|
||||||
|
status="transient",
|
||||||
|
status_code=0,
|
||||||
|
sha256=None,
|
||||||
|
body=None,
|
||||||
|
error=str(e),
|
||||||
|
)
|
||||||
|
|
||||||
|
body_json: dict[str, Any] | None = None
|
||||||
|
try:
|
||||||
|
body_json = r.json()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
if r.status_code == 201:
|
||||||
|
return ShipResult(
|
||||||
|
status="stored",
|
||||||
|
status_code=201,
|
||||||
|
sha256=sha256_hex,
|
||||||
|
body=body_json,
|
||||||
|
error=None,
|
||||||
|
)
|
||||||
|
if r.status_code == 200:
|
||||||
|
return ShipResult(
|
||||||
|
status="already-present",
|
||||||
|
status_code=200,
|
||||||
|
sha256=sha256_hex,
|
||||||
|
body=body_json,
|
||||||
|
error=None,
|
||||||
|
)
|
||||||
|
if r.status_code == 409:
|
||||||
|
return ShipResult(
|
||||||
|
status="conflict",
|
||||||
|
status_code=409,
|
||||||
|
sha256=sha256_hex,
|
||||||
|
body=body_json,
|
||||||
|
error="receiver already has a different sha256 for this id",
|
||||||
|
)
|
||||||
|
if 500 <= r.status_code < 600:
|
||||||
|
return ShipResult(
|
||||||
|
status="transient",
|
||||||
|
status_code=r.status_code,
|
||||||
|
sha256=None,
|
||||||
|
body=body_json,
|
||||||
|
error=f"server error {r.status_code}",
|
||||||
|
)
|
||||||
|
# 4xx other than 409: caller-side bug — don't retry.
|
||||||
|
return ShipResult(
|
||||||
|
status="fatal",
|
||||||
|
status_code=r.status_code,
|
||||||
|
sha256=None,
|
||||||
|
body=body_json,
|
||||||
|
error=f"client error {r.status_code}",
|
||||||
|
)
|
||||||
|
|
||||||
|
# ---- helpers -------------------------------------------------------
|
||||||
|
|
||||||
|
def _common_headers(self) -> dict[str, str]:
|
||||||
|
h: dict[str, str] = {
|
||||||
|
"X-Lab-Host": self.cfg.host_id,
|
||||||
|
"X-Schema-Version": str(SCHEMA_VERSION),
|
||||||
|
}
|
||||||
|
if self.cfg.receiver.bearer_token:
|
||||||
|
h["Authorization"] = f"Bearer {self.cfg.receiver.bearer_token}"
|
||||||
|
return h
|
||||||
|
|
||||||
|
|
||||||
|
def hash_file(path: Path) -> str:
|
||||||
|
h = hashlib.sha256()
|
||||||
|
with path.open("rb") as f:
|
||||||
|
for chunk in iter(lambda: f.read(1024 * 1024), b""):
|
||||||
|
h.update(chunk)
|
||||||
|
return h.hexdigest()
|
||||||
327
tests/test_shipper.py
Normal file
327
tests/test_shipper.py
Normal file
|
|
@ -0,0 +1,327 @@
|
||||||
|
"""End-to-end shipper tests.
|
||||||
|
|
||||||
|
These run a real Uvicorn server bound to 127.0.0.1 on a free port,
|
||||||
|
hosting the actual receiver Starlette app over an EpisodeStore on a
|
||||||
|
temp dir. The shipper then talks to that server with its real
|
||||||
|
`httpx.Client` — same code path as production. This catches things
|
||||||
|
the receiver-side ASGI tests can't (HTTP framing, header handling,
|
||||||
|
sync httpx behaviour, content-length quirks).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import socket
|
||||||
|
import threading
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
import pytest
|
||||||
|
import uvicorn
|
||||||
|
|
||||||
|
from receiver.app import make_app
|
||||||
|
from receiver.store import EpisodeStore
|
||||||
|
from shipper.config import ReceiverEndpoint, ShipperConfig
|
||||||
|
from shipper.queue import ShipperQueue
|
||||||
|
from shipper.transport import ShipperTransport
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Live-receiver fixture
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def _free_port() -> int:
|
||||||
|
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||||||
|
s.bind(("127.0.0.1", 0))
|
||||||
|
return s.getsockname()[1]
|
||||||
|
|
||||||
|
|
||||||
|
class _ServerThread(threading.Thread):
|
||||||
|
def __init__(self, app, port: int) -> None:
|
||||||
|
super().__init__(daemon=True)
|
||||||
|
cfg = uvicorn.Config(
|
||||||
|
app,
|
||||||
|
host="127.0.0.1",
|
||||||
|
port=port,
|
||||||
|
log_level="error",
|
||||||
|
lifespan="off",
|
||||||
|
access_log=False,
|
||||||
|
)
|
||||||
|
self.server = uvicorn.Server(cfg)
|
||||||
|
|
||||||
|
def run(self) -> None:
|
||||||
|
self.server.run()
|
||||||
|
|
||||||
|
def stop(self) -> None:
|
||||||
|
self.server.should_exit = True
|
||||||
|
|
||||||
|
|
||||||
|
def _wait_for_port(port: int, timeout_s: float = 5.0) -> None:
|
||||||
|
deadline = time.monotonic() + timeout_s
|
||||||
|
while time.monotonic() < deadline:
|
||||||
|
try:
|
||||||
|
with httpx.Client(timeout=0.5) as c:
|
||||||
|
r = c.get(f"http://127.0.0.1:{port}/v1/health")
|
||||||
|
if r.status_code == 200:
|
||||||
|
return
|
||||||
|
except httpx.HTTPError:
|
||||||
|
pass
|
||||||
|
time.sleep(0.05)
|
||||||
|
raise TimeoutError(f"receiver on 127.0.0.1:{port} did not come up")
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def store(tmp_path: Path) -> EpisodeStore:
|
||||||
|
return EpisodeStore(
|
||||||
|
store_root=tmp_path / "rcv-episodes",
|
||||||
|
incoming_root=tmp_path / "rcv-incoming",
|
||||||
|
index_path=tmp_path / "rcv-index.jsonl",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def receiver(store: EpisodeStore):
|
||||||
|
app = make_app(store=store, max_episode_bytes=10_000_000, bearer_token=None)
|
||||||
|
port = _free_port()
|
||||||
|
server = _ServerThread(app, port)
|
||||||
|
server.start()
|
||||||
|
try:
|
||||||
|
_wait_for_port(port)
|
||||||
|
yield f"http://127.0.0.1:{port}", store
|
||||||
|
finally:
|
||||||
|
server.stop()
|
||||||
|
server.join(timeout=2)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def receiver_with_bearer(store: EpisodeStore):
|
||||||
|
app = make_app(store=store, max_episode_bytes=10_000_000, bearer_token="s3cret")
|
||||||
|
port = _free_port()
|
||||||
|
server = _ServerThread(app, port)
|
||||||
|
server.start()
|
||||||
|
try:
|
||||||
|
_wait_for_port(port)
|
||||||
|
yield f"http://127.0.0.1:{port}", store
|
||||||
|
finally:
|
||||||
|
server.stop()
|
||||||
|
server.join(timeout=2)
|
||||||
|
|
||||||
|
|
||||||
|
def _make_shipper(
|
||||||
|
tmp_path: Path,
|
||||||
|
receiver_url: str,
|
||||||
|
*,
|
||||||
|
host_id: str = "lab1",
|
||||||
|
bearer: str | None = None,
|
||||||
|
) -> tuple[ShipperConfig, ShipperTransport, ShipperQueue]:
|
||||||
|
data_root = tmp_path / "lab-data"
|
||||||
|
cfg = ShipperConfig(
|
||||||
|
host_id=host_id,
|
||||||
|
data_root=data_root,
|
||||||
|
receiver=ReceiverEndpoint(url=receiver_url, bearer_token=bearer),
|
||||||
|
scan_interval_s=0.05,
|
||||||
|
)
|
||||||
|
transport = ShipperTransport(cfg)
|
||||||
|
queue = ShipperQueue(cfg, transport)
|
||||||
|
return cfg, transport, queue
|
||||||
|
|
||||||
|
|
||||||
|
def _make_episode(cfg: ShipperConfig, episode_id: str, *, content: bytes = b"data") -> Path:
|
||||||
|
ep = cfg.episodes_dir / episode_id
|
||||||
|
ep.mkdir(parents=True, exist_ok=True)
|
||||||
|
(ep / "meta.json").write_bytes(content)
|
||||||
|
(ep / "events.jsonl").write_text("{}\n")
|
||||||
|
(ep / "labels.jsonl").write_text("{}\n")
|
||||||
|
(ep / "telemetry-proc.jsonl").write_text("{}\n")
|
||||||
|
(ep / "done.marker").touch()
|
||||||
|
return ep
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Ping
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def test_ping_returns_ok_against_running_receiver(tmp_path: Path, receiver) -> None:
|
||||||
|
url, _ = receiver
|
||||||
|
_, transport, _ = _make_shipper(tmp_path, url)
|
||||||
|
res = transport.ping()
|
||||||
|
assert res.ok is True
|
||||||
|
assert res.status_code == 200
|
||||||
|
assert res.body is not None
|
||||||
|
assert res.body["ok"] is True
|
||||||
|
assert res.body["host_id"] == "lab1"
|
||||||
|
assert res.body["schema_version"] == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_ping_writes_nothing_to_index(tmp_path: Path, receiver) -> None:
|
||||||
|
url, store = receiver
|
||||||
|
_, transport, _ = _make_shipper(tmp_path, url)
|
||||||
|
transport.ping()
|
||||||
|
transport.ping()
|
||||||
|
transport.ping()
|
||||||
|
assert store.index_path.read_text() == ""
|
||||||
|
|
||||||
|
|
||||||
|
def test_ping_fails_with_wrong_bearer(tmp_path: Path, receiver_with_bearer) -> None:
|
||||||
|
url, _ = receiver_with_bearer
|
||||||
|
_, transport, _ = _make_shipper(tmp_path, url, bearer="WRONG")
|
||||||
|
res = transport.ping()
|
||||||
|
assert res.ok is False
|
||||||
|
assert res.status_code == 401
|
||||||
|
|
||||||
|
|
||||||
|
def test_ping_succeeds_with_right_bearer(tmp_path: Path, receiver_with_bearer) -> None:
|
||||||
|
url, _ = receiver_with_bearer
|
||||||
|
_, transport, _ = _make_shipper(tmp_path, url, bearer="s3cret")
|
||||||
|
res = transport.ping()
|
||||||
|
assert res.ok is True
|
||||||
|
assert res.status_code == 200
|
||||||
|
|
||||||
|
|
||||||
|
def test_ping_fails_when_receiver_unreachable(tmp_path: Path) -> None:
|
||||||
|
# Pick a free port and don't bind it — connect must fail.
|
||||||
|
port = _free_port()
|
||||||
|
_, transport, _ = _make_shipper(tmp_path, f"http://127.0.0.1:{port}")
|
||||||
|
res = transport.ping()
|
||||||
|
assert res.ok is False
|
||||||
|
assert res.status_code == 0
|
||||||
|
assert res.error is not None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Tar + ship
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def test_run_once_ships_one_done_episode(tmp_path: Path, receiver) -> None:
|
||||||
|
url, store = receiver
|
||||||
|
cfg, _, queue = _make_shipper(tmp_path, url)
|
||||||
|
_make_episode(cfg, "01EPISODE")
|
||||||
|
|
||||||
|
result = queue.run_once()
|
||||||
|
assert result.scanned == 1
|
||||||
|
assert result.shipped == 1
|
||||||
|
assert result.transient_failures == 0
|
||||||
|
|
||||||
|
# Episode dir moved to shipped/.
|
||||||
|
assert not (cfg.episodes_dir / "01EPISODE").exists()
|
||||||
|
assert (cfg.shipped_dir / "01EPISODE").exists()
|
||||||
|
|
||||||
|
# Outbox tarball cleaned up.
|
||||||
|
assert list(cfg.outbox_dir.iterdir()) == []
|
||||||
|
|
||||||
|
# Receiver stored it and indexed it.
|
||||||
|
assert store.final_path("lab1", "01EPISODE").exists()
|
||||||
|
rows = [json.loads(l) for l in store.index_path.read_text().splitlines()]
|
||||||
|
assert len(rows) == 1
|
||||||
|
assert rows[0]["host_id"] == "lab1"
|
||||||
|
assert rows[0]["episode_id"] == "01EPISODE"
|
||||||
|
|
||||||
|
|
||||||
|
def test_run_once_skips_episodes_without_done_marker(tmp_path: Path, receiver) -> None:
|
||||||
|
url, store = receiver
|
||||||
|
cfg, _, queue = _make_shipper(tmp_path, url)
|
||||||
|
ep = cfg.episodes_dir / "01PARTIAL"
|
||||||
|
ep.mkdir(parents=True)
|
||||||
|
(ep / "meta.json").write_text("{}")
|
||||||
|
# Note: NO done.marker.
|
||||||
|
|
||||||
|
result = queue.run_once()
|
||||||
|
assert result.scanned == 0
|
||||||
|
assert result.shipped == 0
|
||||||
|
assert ep.exists() # untouched
|
||||||
|
assert store.index_path.read_text() == ""
|
||||||
|
|
||||||
|
|
||||||
|
def test_run_once_idempotent_re_ship_returns_already_present(tmp_path: Path, receiver) -> None:
|
||||||
|
"""If a prior run shipped an episode but crashed before retiring it,
|
||||||
|
the next run must re-ship the same bytes successfully (200) and
|
||||||
|
retire the dir, not flag it as a conflict."""
|
||||||
|
url, store = receiver
|
||||||
|
cfg, _, queue = _make_shipper(tmp_path, url)
|
||||||
|
_make_episode(cfg, "01REPLAY", content=b"same-bytes")
|
||||||
|
|
||||||
|
queue.run_once()
|
||||||
|
assert (cfg.shipped_dir / "01REPLAY").exists()
|
||||||
|
|
||||||
|
# Simulate a crash: move it back as if retire never happened.
|
||||||
|
(cfg.shipped_dir / "01REPLAY").rename(cfg.episodes_dir / "01REPLAY")
|
||||||
|
|
||||||
|
result = queue.run_once()
|
||||||
|
assert result.scanned == 1
|
||||||
|
assert result.shipped == 1
|
||||||
|
assert (cfg.shipped_dir / "01REPLAY").exists()
|
||||||
|
|
||||||
|
# Index didn't double up.
|
||||||
|
rows = store.index_path.read_text().splitlines()
|
||||||
|
assert len(rows) == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_run_once_handles_409_conflict(tmp_path: Path, receiver) -> None:
|
||||||
|
"""If the same episode_id was previously shipped with *different*
|
||||||
|
bytes, the receiver returns 409 and the shipper must NOT retire
|
||||||
|
the local dir — operator triage required."""
|
||||||
|
url, _ = receiver
|
||||||
|
cfg, _, queue = _make_shipper(tmp_path, url)
|
||||||
|
_make_episode(cfg, "01CONFLICT", content=b"first")
|
||||||
|
|
||||||
|
result = queue.run_once()
|
||||||
|
assert result.shipped == 1
|
||||||
|
|
||||||
|
# Simulate a re-do with different content but the same id (e.g., a
|
||||||
|
# botched re-run on the lab host).
|
||||||
|
(cfg.shipped_dir / "01CONFLICT").rename(cfg.episodes_dir / "01CONFLICT")
|
||||||
|
(cfg.episodes_dir / "01CONFLICT" / "meta.json").write_bytes(b"tampered")
|
||||||
|
|
||||||
|
result = queue.run_once()
|
||||||
|
assert result.scanned == 1
|
||||||
|
assert result.shipped == 0
|
||||||
|
assert result.conflicts == 1
|
||||||
|
# Local dir survives — operator can decide what to do.
|
||||||
|
assert (cfg.episodes_dir / "01CONFLICT").exists()
|
||||||
|
|
||||||
|
|
||||||
|
def test_run_once_handles_transient_when_receiver_is_down(tmp_path: Path) -> None:
|
||||||
|
port = _free_port()
|
||||||
|
cfg, _, queue = _make_shipper(tmp_path, f"http://127.0.0.1:{port}")
|
||||||
|
_make_episode(cfg, "01DOWN")
|
||||||
|
|
||||||
|
result = queue.run_once()
|
||||||
|
assert result.scanned == 1
|
||||||
|
assert result.shipped == 0
|
||||||
|
assert result.transient_failures == 1
|
||||||
|
# Episode dir + tarball both stay in place for the next pass.
|
||||||
|
assert (cfg.episodes_dir / "01DOWN").exists()
|
||||||
|
assert (cfg.outbox_dir / "01DOWN.tar.zst").exists()
|
||||||
|
|
||||||
|
|
||||||
|
def test_tarball_round_trips_episode_dir(tmp_path: Path, receiver) -> None:
|
||||||
|
"""The receiver-side tarball must extract back to the original
|
||||||
|
episode dir layout (modulo file order). Verifies the tar+zstd
|
||||||
|
pipe is intact."""
|
||||||
|
import subprocess
|
||||||
|
import tarfile
|
||||||
|
|
||||||
|
url, _ = receiver
|
||||||
|
cfg, _, queue = _make_shipper(tmp_path, url)
|
||||||
|
ep = _make_episode(cfg, "01ROUND", content=b"meta-bytes")
|
||||||
|
expected_files = sorted(p.name for p in ep.iterdir())
|
||||||
|
|
||||||
|
queue.run_once()
|
||||||
|
|
||||||
|
# The receiver stored it; pull the bytes back, decompress + untar.
|
||||||
|
rcv_path = next((tmp_path / "rcv-episodes" / "lab1").glob("01ROUND.tar.zst"))
|
||||||
|
decompressed = tmp_path / "01ROUND.tar"
|
||||||
|
subprocess.check_call(
|
||||||
|
["zstd", "-q", "-d", "-o", str(decompressed), str(rcv_path)],
|
||||||
|
)
|
||||||
|
extract_dir = tmp_path / "extracted"
|
||||||
|
extract_dir.mkdir()
|
||||||
|
with tarfile.open(decompressed) as tf:
|
||||||
|
tf.extractall(extract_dir)
|
||||||
|
|
||||||
|
got_files = sorted(p.name for p in (extract_dir / "01ROUND").iterdir())
|
||||||
|
assert got_files == expected_files
|
||||||
Loading…
Add table
Reference in a new issue