Merge remote-tracking branch 'origin/main' into Dev_REL1_043026

This commit is contained in:
Elliott Kolden 2026-04-30 15:48:23 -06:00
commit b42d073669
6 changed files with 328 additions and 8 deletions

143
AGENTS.md
View file

@ -5,6 +5,18 @@ The conventions below also apply to sibling repos (`wg-enroll`,
`wg-pki`, `caddy`, `iptmonads`, `matrix`, `forgejo`, `vault`,
`openclaw-deploy`).
> **READ THIS FIRST IF YOU'RE TROUBLESHOOTING A LAB-HOST BRING-UP:**
> If the shipper or doctor mentions mTLS, certs, CA bundles, or
> "secure the connection," **do not write any code, mint any keys,
> generate any CSRs, or call openssl.** Cert delivery is
> automated — jump to **"Securing the connection (mTLS) — DO NOT
> mint your own certs"** below for the one command you should run.
> **PULL `origin/main` BEFORE DEBUGGING.** Three install bugs and the
> shipper crash-loop on missing CA were closed in main as of
> 2026-04-30. If your clone is older than that, the symptoms below
> may already be fixed.
---
## How a lab host gets to "shipping data" — the canonical bring-up
@ -66,6 +78,137 @@ common silent failures it catches:
`cis490-doctor --json` is machine-readable for use by other agents.
## Securing the connection (mTLS) — DO NOT mint your own certs
The lab-host ↔ Pi connection is mTLS over WireGuard. **The cert
delivery is fully automated.** You should never run `openssl`, write
a CSR, edit a Caddyfile, or generate a private key on the lab host.
If you find yourself doing any of that, you're off the runbook.
**The actual cert flow:**
1. The lab host comes up on WireGuard via `wg-enroll` (USB-driven,
one-time, separate project). After this, the lab host can reach
`bootstrap.wg` and `collector.wg` on the `10.100.0.0/24` overlay.
2. `scripts/install-lab-host.sh`, on its way through, pulls the leaf
cert + CA bundle from `https://bootstrap.wg/v1/cert/<host_id>`
over plain TLS (CA bundled in `etc/caddy-root.crt`). Trust
boundary is "this peer is on the WG mesh" — `iptmonads` already
gates the bootstrap port to enrolled peers.
3. The fetch step is a no-op if `host_id` is still the default
`REPLACE_ME` in `/etc/cis490/lab-host.toml`. **This is the most
common reason agents think cert delivery is broken.**
**The one fix that resolves 95 % of "cert/TLS/connection" reports:**
```sh
# 1. Make sure host_id is set:
sudo grep '^host_id' /etc/cis490/lab-host.toml
# If it says "REPLACE_ME", edit it to the real host_id you registered.
# 2. Re-run the installer. It will fetch the cert from bootstrap.wg.
sudo /opt/cis490/scripts/install-lab-host.sh
# 3. Confirm certs landed:
ls -l /etc/cis490/certs/ # expect lab-host.pem, lab-host.key, wg-ca.pem
# 4. Smoke-test the pipe:
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
--config /etc/cis490/lab-host.toml --ping
# {"ok": true, ...} → done.
```
If step 2 prints `WARN: bootstrap.wg fetch failed`, the cause is
almost always one of:
- `bootstrap.wg` DNS not resolving → add to `/etc/hosts`:
`echo '10.100.0.1 bootstrap.wg collector.wg' | sudo tee -a /etc/hosts`
- `wg0` interface not up → `sudo wg show` should list a peer; if not,
re-run wg-enroll.
- The Pi's `cis490-bootstrap.service` is down → file an issue against
the receiver-side host, not against this repo.
**What you should NOT do, even if it feels like it would help:**
- Generate certs with `openssl` or `step-cli` on the lab host.
- Copy certs from another lab host.
- Set `verify_tls = false` in `lab-host.toml` to "skip TLS for now."
- Restart the shipper repeatedly hoping it self-heals — it already
retries on every request without restart.
- File a Forgejo issue titled "shipper can't connect" without first
running the four-line block above and pasting its output.
The shipper's `waiting on mTLS material` log line is **expected**
during first-boot until the cert lands. It is not an error to fix.
The transport rebuilds the SSL context on each request, so the
moment certs land in `/etc/cis490/certs/`, the next ping/ship
attempt succeeds — no restart needed.
## Common bring-up gotchas (read this before debugging an install)
Smaller models acting as on-device agents have hit these traps. Each
one is now fixed in main, but if you're on an older clone you may
still see the symptom — pull `origin/main` first, then re-read.
### Run tools from `/opt/cis490`, not from a manual clone
When you run `cis490-doctor` from a clone like `~/.env/CIS490/`,
Python prepends the clone path to `sys.path`. Subprocesses spawned
by the doctor (e.g., `python -m shipper --ping`) inherit the calling
CWD and pick up the clone's `shipper/` package instead of the
service venv at `/opt/cis490/`. Symptom: tracebacks reference the
clone path, or `No module named exploits` despite `package = false`.
**Fix already in main:** the doctor passes `cwd=/opt/cis490` to the
shipper subprocess and inserts `repo_root` into `sys.path` itself.
**Operator action:** always invoke either as
`/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py`
or via `cd /opt/cis490 && ./tools/cis490_doctor.py`. Don't run from a
clone unless you know what you're doing.
### Shipper logs "waiting on mTLS material" — this is expected, not a bug
The `cis490-shipper` unit is enabled by `install-lab-host.sh` *before*
the Pi has issued the host's mTLS leaf. The transport pre-flights the
configured `ca_bundle` / `client_cert` / `client_key` paths and, if
any are missing, defers building the SSL context. You'll see one
warning per process lifetime:
```
shipper waiting on mTLS material (client_cert path missing: …); will retry each request
```
The unit stays up. Each ping/ship attempt re-tries the build. Once
the Pi runs `deploy-cis490-cert.sh <host_id> <wg_ip>` and the leaf
lands at `/etc/cis490/certs/`, the next request succeeds and the
transport logs `mTLS material now on disk; shipper transport ready`.
**Do not** try to "fix" the warning by restarting the unit, deleting
the config, or hand-rolling certs — just confirm the Pi-side step
ran and wait one scan interval.
### `install-lab-host.sh` failures
Three install bugs were fixed in commit `95ac56a`. If you're on an
older clone:
| Symptom | Cause | Fix |
|---|---|---|
| `ModuleNotFoundError: pycdlib` during cidata build | `pycdlib` was in `dev` deps, service venv only installs main deps | Pull main; `pycdlib` is in `dependencies` now |
| Episodes exit `rc=1` in 15 s; `launch_demo.sh` can't find image | `vm/images/` dir wasn't created before symlinking | Pull main; install script now `install -d`'s the directory |
| `cis490-doctor` reports "tier3: No module named exploits" | `sys.path` didn't include repo root | Pull main; doctor inserts `repo_root` into `sys.path` |
**If you hit any of these on a fresh install, pull main first** before
filing an issue — the issue is probably already closed.
### One traceback at a time
When the doctor lights up multiple red rows, fix the topmost one and
re-run rather than batching attempts. Each red row prints the exact
operator command it expects you to run. Don't paraphrase or invent
adjacent commands; the doctor is the source of truth for what's
missing.
## How an agent generates data on demand (without waiting for the timer)
```sh

View file

@ -209,20 +209,33 @@ if [[ "$NEW_INSTALL" == "1" ]]; then
log " FIRST-INSTALL NEXT STEPS "
log "================================================================="
log " 1. Edit $ETC_ROOT/lab-host.toml — set host_id and receiver URL."
log " (host_id starts as 'REPLACE_ME'; the cert auto-fetch in"
log " step 7 of this script SKIPS while that's still the value.)"
log ""
log " 2. (On the Pi.) Mint + ship a leaf cert for this host:"
log " sudo wg-pki/scripts/deploy-cis490-cert.sh <host_id> <wg_ip>"
log " 2. RE-RUN THIS SCRIPT — sudo $0"
log " The second pass sees the real host_id and pulls the leaf"
log " cert from https://bootstrap.wg/v1/cert/<host_id>. There is"
log " no manual cert-minting step on this host. DO NOT openssl."
log ""
log " 3. Run the diagnostic — every red row prints the exact fix:"
log " 3. Confirm certs landed:"
log " ls -l $ETC_ROOT/certs/"
log " Expected: lab-host.pem, lab-host.key, wg-ca.pem"
log ""
log " 4. Run the diagnostic — every red row prints the exact fix:"
log " $INSTALL_ROOT/.venv/bin/python \\"
log " $INSTALL_ROOT/tools/cis490_doctor.py --role lab-host"
log ""
log " 4. Smoke-test the pipe (returns ok=true on success):"
log " 5. Smoke-test the pipe (returns ok=true on success):"
log " sudo -u $SERVICE_USER $INSTALL_ROOT/.venv/bin/python -m shipper \\"
log " --config $ETC_ROOT/lab-host.toml --ping"
log ""
log " 5. Turn on the services — episodes start flowing immediately:"
log " 6. Turn on the services — episodes start flowing immediately:"
log " sudo systemctl enable --now cis490-shipper cis490-orchestrator"
log ""
log " If bootstrap.wg fetch fails (step 2 prints WARN), see AGENTS.md"
log " ('Securing the connection (mTLS)') — almost always a missing"
log " /etc/hosts entry or wg0 not up. Do not work around this by"
log " minting certs locally; fix the WG path instead."
log "================================================================="
fi

View file

@ -110,8 +110,12 @@ class ShipperTransport:
except _CertNotReadyError as e:
if not self._cert_warned:
log.warning(
"shipper waiting on mTLS material (%s); will retry each request",
e,
"shipper waiting on mTLS material (%s); will retry each "
"request — this is expected during first-boot. To unblock: "
"set host_id in /etc/cis490/lab-host.toml then run "
"`sudo /opt/cis490/scripts/install-lab-host.sh` (do NOT "
"mint certs by hand). See AGENTS.md → Securing the "
"connection (mTLS).", e,
)
self._cert_warned = True
return False

View file

@ -296,6 +296,29 @@ def test_host_filter_scopes_to_one_lab_host(tmp_path: Path) -> None:
assert (episodes / "lab1" / "01FAKE.tar.zst").exists()
def test_archive_preserves_index_mode(tmp_path: Path) -> None:
"""Regression: the prune tool's index rewrite must not change the
file's mode bits. Real-world failure: a sudo'd prune run replaced
the receiver's index with a root-owned file the service couldn't
append to, every PUT 500'd on _append_index."""
import stat as _stat
episodes, index = _stage_receiver_tree(tmp_path)
# Set a non-default mode so we can detect drift.
index.chmod(0o664)
before_mode = _stat.S_IMODE(index.stat().st_mode)
pe.main([
"--episodes-root", str(episodes),
"--index", str(index),
"--archive-root", str(tmp_path / "archive"),
"--reason", "no-sample",
"--archive",
])
after_mode = _stat.S_IMODE(index.stat().st_mode)
assert after_mode == before_mode, (
f"prune mutated index mode: {oct(before_mode)} -> {oct(after_mode)}"
)
def test_multiple_reasons_combine(tmp_path: Path) -> None:
"""An episode failing >1 signal is flagged once, all reasons listed."""
tar = _make_episode(

125
tools/index_backfill.py Normal file
View file

@ -0,0 +1,125 @@
"""``cis490-index-backfill`` — rebuild missing index.jsonl rows from
tarballs already on disk.
Use case: a stretch where the receiver wrote tarballs to
``episodes/<host>/`` but failed to append the matching index row
(permissions, disk full, crash mid-write). The shipper retries see
``already-present`` and never re-PUT, so the gap is permanent until
something on the receiver-side fills it.
This tool walks ``episodes/<host>/<id>.tar.zst`` and, for any episode
not already in the index, computes sha256 + size and appends a row
matching the receiver's schema. Existing rows are left alone.
Run on the receiver host as the same user the receiver runs under
(``cis490``) so the appended rows match in ownership and the live
receiver can keep writing afterward:
sudo -u cis490 /opt/cis490/.venv/bin/python \\
/opt/cis490/tools/index_backfill.py \\
--episodes-root /var/lib/cis490/episodes \\
--index /var/lib/cis490/index.jsonl
"""
from __future__ import annotations
import argparse
import hashlib
import json
import sys
from datetime import datetime, timezone
from pathlib import Path
SCHEMA_VERSION = 1
def _sha256_of(path: Path) -> str:
h = hashlib.sha256()
with path.open("rb") as f:
for chunk in iter(lambda: f.read(1024 * 1024), b""):
h.update(chunk)
return h.hexdigest()
def _existing_episode_ids(index_path: Path) -> set[str]:
if not index_path.exists():
return set()
seen: set[str] = set()
for line in index_path.read_text().splitlines():
if not line.strip():
continue
try:
row = json.loads(line)
except json.JSONDecodeError:
continue
ep = row.get("episode_id")
if isinstance(ep, str):
seen.add(ep)
return seen
def main(argv: list[str] | None = None) -> int:
p = argparse.ArgumentParser(prog="cis490-index-backfill")
p.add_argument("--episodes-root", type=Path,
default=Path("/var/lib/cis490/episodes"))
p.add_argument("--index", type=Path,
default=Path("/var/lib/cis490/index.jsonl"))
p.add_argument("--host", help="Only backfill this host_id")
p.add_argument("--dry-run", action="store_true",
help="Print what would be appended; don't write")
args = p.parse_args(argv)
if not args.episodes_root.exists():
print(f"no episodes dir at {args.episodes_root}", file=sys.stderr)
return 2
seen = _existing_episode_ids(args.index)
rows_to_write: list[dict] = []
scanned = 0
for host_dir in sorted(args.episodes_root.iterdir()):
if not host_dir.is_dir():
continue
if args.host and host_dir.name != args.host:
continue
for tar in sorted(host_dir.glob("*.tar.zst")):
scanned += 1
episode_id = tar.stem.removesuffix(".tar")
if episode_id in seen:
continue
sha = _sha256_of(tar)
size = tar.stat().st_size
rows_to_write.append({
"received_at_wall": datetime.now(timezone.utc).isoformat(),
"host_id": host_dir.name,
"episode_id": episode_id,
"sha256": sha,
"size_bytes": size,
"schema_version": SCHEMA_VERSION,
"backfilled": True,
})
print(f"scanned: {scanned} already-indexed: {scanned - len(rows_to_write)} "
f"to-backfill: {len(rows_to_write)}")
if not rows_to_write:
return 0
if args.dry_run:
for r in rows_to_write[:5]:
print(json.dumps(r, sort_keys=True))
if len(rows_to_write) > 5:
print(f"... ({len(rows_to_write) - 5} more)")
return 0
# Append each row exactly the way the receiver does (single-line
# write, append-mode open). This keeps writes atomic on POSIX for
# rows < PIPE_BUF and never replaces the file, so ownership is
# preserved automatically.
with args.index.open("a") as f:
for row in rows_to_write:
f.write(json.dumps(row, sort_keys=True) + "\n")
print(f"backfilled {len(rows_to_write)} rows into {args.index}")
return 0
if __name__ == "__main__":
sys.exit(main())

View file

@ -266,10 +266,22 @@ def apply_action(
continue
kept.append(line)
# Rewrite via tempfile + replace so a crash mid-write doesn't
# corrupt the live index.
# corrupt the live index. os.replace drops ownership/mode from
# the original — when prune runs as root that leaves the new
# file root:root and locks out the cis490 receiver service
# (every PUT then 500s on _append_index). Snapshot stat before
# the rename, restore after.
st = index_path.stat()
tmp = index_path.with_suffix(".jsonl.partial")
tmp.write_text("\n".join(kept) + ("\n" if kept else ""))
os.replace(tmp, index_path)
try:
os.chown(index_path, st.st_uid, st.st_gid)
except (PermissionError, OSError):
# Best-effort: chown requires root, but if we got here as a
# non-root user the original ownership matched ours anyway.
pass
os.chmod(index_path, st.st_mode & 0o7777)
# ---------------------------------------------------------------------------