Merge remote-tracking branch 'origin/main' into Dev_REL1_043026
This commit is contained in:
commit
b42d073669
6 changed files with 328 additions and 8 deletions
143
AGENTS.md
143
AGENTS.md
|
|
@ -5,6 +5,18 @@ The conventions below also apply to sibling repos (`wg-enroll`,
|
|||
`wg-pki`, `caddy`, `iptmonads`, `matrix`, `forgejo`, `vault`,
|
||||
`openclaw-deploy`).
|
||||
|
||||
> **READ THIS FIRST IF YOU'RE TROUBLESHOOTING A LAB-HOST BRING-UP:**
|
||||
> If the shipper or doctor mentions mTLS, certs, CA bundles, or
|
||||
> "secure the connection," **do not write any code, mint any keys,
|
||||
> generate any CSRs, or call openssl.** Cert delivery is
|
||||
> automated — jump to **"Securing the connection (mTLS) — DO NOT
|
||||
> mint your own certs"** below for the one command you should run.
|
||||
|
||||
> **PULL `origin/main` BEFORE DEBUGGING.** Three install bugs and the
|
||||
> shipper crash-loop on missing CA were closed in main as of
|
||||
> 2026-04-30. If your clone is older than that, the symptoms below
|
||||
> may already be fixed.
|
||||
|
||||
---
|
||||
|
||||
## How a lab host gets to "shipping data" — the canonical bring-up
|
||||
|
|
@ -66,6 +78,137 @@ common silent failures it catches:
|
|||
|
||||
`cis490-doctor --json` is machine-readable for use by other agents.
|
||||
|
||||
## Securing the connection (mTLS) — DO NOT mint your own certs
|
||||
|
||||
The lab-host ↔ Pi connection is mTLS over WireGuard. **The cert
|
||||
delivery is fully automated.** You should never run `openssl`, write
|
||||
a CSR, edit a Caddyfile, or generate a private key on the lab host.
|
||||
If you find yourself doing any of that, you're off the runbook.
|
||||
|
||||
**The actual cert flow:**
|
||||
|
||||
1. The lab host comes up on WireGuard via `wg-enroll` (USB-driven,
|
||||
one-time, separate project). After this, the lab host can reach
|
||||
`bootstrap.wg` and `collector.wg` on the `10.100.0.0/24` overlay.
|
||||
2. `scripts/install-lab-host.sh`, on its way through, pulls the leaf
|
||||
cert + CA bundle from `https://bootstrap.wg/v1/cert/<host_id>`
|
||||
over plain TLS (CA bundled in `etc/caddy-root.crt`). Trust
|
||||
boundary is "this peer is on the WG mesh" — `iptmonads` already
|
||||
gates the bootstrap port to enrolled peers.
|
||||
3. The fetch step is a no-op if `host_id` is still the default
|
||||
`REPLACE_ME` in `/etc/cis490/lab-host.toml`. **This is the most
|
||||
common reason agents think cert delivery is broken.**
|
||||
|
||||
**The one fix that resolves 95 % of "cert/TLS/connection" reports:**
|
||||
|
||||
```sh
|
||||
# 1. Make sure host_id is set:
|
||||
sudo grep '^host_id' /etc/cis490/lab-host.toml
|
||||
# If it says "REPLACE_ME", edit it to the real host_id you registered.
|
||||
|
||||
# 2. Re-run the installer. It will fetch the cert from bootstrap.wg.
|
||||
sudo /opt/cis490/scripts/install-lab-host.sh
|
||||
|
||||
# 3. Confirm certs landed:
|
||||
ls -l /etc/cis490/certs/ # expect lab-host.pem, lab-host.key, wg-ca.pem
|
||||
|
||||
# 4. Smoke-test the pipe:
|
||||
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
|
||||
--config /etc/cis490/lab-host.toml --ping
|
||||
# {"ok": true, ...} → done.
|
||||
```
|
||||
|
||||
If step 2 prints `WARN: bootstrap.wg fetch failed`, the cause is
|
||||
almost always one of:
|
||||
|
||||
- `bootstrap.wg` DNS not resolving → add to `/etc/hosts`:
|
||||
`echo '10.100.0.1 bootstrap.wg collector.wg' | sudo tee -a /etc/hosts`
|
||||
- `wg0` interface not up → `sudo wg show` should list a peer; if not,
|
||||
re-run wg-enroll.
|
||||
- The Pi's `cis490-bootstrap.service` is down → file an issue against
|
||||
the receiver-side host, not against this repo.
|
||||
|
||||
**What you should NOT do, even if it feels like it would help:**
|
||||
|
||||
- Generate certs with `openssl` or `step-cli` on the lab host.
|
||||
- Copy certs from another lab host.
|
||||
- Set `verify_tls = false` in `lab-host.toml` to "skip TLS for now."
|
||||
- Restart the shipper repeatedly hoping it self-heals — it already
|
||||
retries on every request without restart.
|
||||
- File a Forgejo issue titled "shipper can't connect" without first
|
||||
running the four-line block above and pasting its output.
|
||||
|
||||
The shipper's `waiting on mTLS material` log line is **expected**
|
||||
during first-boot until the cert lands. It is not an error to fix.
|
||||
The transport rebuilds the SSL context on each request, so the
|
||||
moment certs land in `/etc/cis490/certs/`, the next ping/ship
|
||||
attempt succeeds — no restart needed.
|
||||
|
||||
## Common bring-up gotchas (read this before debugging an install)
|
||||
|
||||
Smaller models acting as on-device agents have hit these traps. Each
|
||||
one is now fixed in main, but if you're on an older clone you may
|
||||
still see the symptom — pull `origin/main` first, then re-read.
|
||||
|
||||
### Run tools from `/opt/cis490`, not from a manual clone
|
||||
|
||||
When you run `cis490-doctor` from a clone like `~/.env/CIS490/`,
|
||||
Python prepends the clone path to `sys.path`. Subprocesses spawned
|
||||
by the doctor (e.g., `python -m shipper --ping`) inherit the calling
|
||||
CWD and pick up the clone's `shipper/` package instead of the
|
||||
service venv at `/opt/cis490/`. Symptom: tracebacks reference the
|
||||
clone path, or `No module named exploits` despite `package = false`.
|
||||
|
||||
**Fix already in main:** the doctor passes `cwd=/opt/cis490` to the
|
||||
shipper subprocess and inserts `repo_root` into `sys.path` itself.
|
||||
**Operator action:** always invoke either as
|
||||
`/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py`
|
||||
or via `cd /opt/cis490 && ./tools/cis490_doctor.py`. Don't run from a
|
||||
clone unless you know what you're doing.
|
||||
|
||||
### Shipper logs "waiting on mTLS material" — this is expected, not a bug
|
||||
|
||||
The `cis490-shipper` unit is enabled by `install-lab-host.sh` *before*
|
||||
the Pi has issued the host's mTLS leaf. The transport pre-flights the
|
||||
configured `ca_bundle` / `client_cert` / `client_key` paths and, if
|
||||
any are missing, defers building the SSL context. You'll see one
|
||||
warning per process lifetime:
|
||||
|
||||
```
|
||||
shipper waiting on mTLS material (client_cert path missing: …); will retry each request
|
||||
```
|
||||
|
||||
The unit stays up. Each ping/ship attempt re-tries the build. Once
|
||||
the Pi runs `deploy-cis490-cert.sh <host_id> <wg_ip>` and the leaf
|
||||
lands at `/etc/cis490/certs/`, the next request succeeds and the
|
||||
transport logs `mTLS material now on disk; shipper transport ready`.
|
||||
|
||||
**Do not** try to "fix" the warning by restarting the unit, deleting
|
||||
the config, or hand-rolling certs — just confirm the Pi-side step
|
||||
ran and wait one scan interval.
|
||||
|
||||
### `install-lab-host.sh` failures
|
||||
|
||||
Three install bugs were fixed in commit `95ac56a`. If you're on an
|
||||
older clone:
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---|---|---|
|
||||
| `ModuleNotFoundError: pycdlib` during cidata build | `pycdlib` was in `dev` deps, service venv only installs main deps | Pull main; `pycdlib` is in `dependencies` now |
|
||||
| Episodes exit `rc=1` in 15 s; `launch_demo.sh` can't find image | `vm/images/` dir wasn't created before symlinking | Pull main; install script now `install -d`'s the directory |
|
||||
| `cis490-doctor` reports "tier3: No module named exploits" | `sys.path` didn't include repo root | Pull main; doctor inserts `repo_root` into `sys.path` |
|
||||
|
||||
**If you hit any of these on a fresh install, pull main first** before
|
||||
filing an issue — the issue is probably already closed.
|
||||
|
||||
### One traceback at a time
|
||||
|
||||
When the doctor lights up multiple red rows, fix the topmost one and
|
||||
re-run rather than batching attempts. Each red row prints the exact
|
||||
operator command it expects you to run. Don't paraphrase or invent
|
||||
adjacent commands; the doctor is the source of truth for what's
|
||||
missing.
|
||||
|
||||
## How an agent generates data on demand (without waiting for the timer)
|
||||
|
||||
```sh
|
||||
|
|
|
|||
|
|
@ -209,20 +209,33 @@ if [[ "$NEW_INSTALL" == "1" ]]; then
|
|||
log " FIRST-INSTALL NEXT STEPS "
|
||||
log "================================================================="
|
||||
log " 1. Edit $ETC_ROOT/lab-host.toml — set host_id and receiver URL."
|
||||
log " (host_id starts as 'REPLACE_ME'; the cert auto-fetch in"
|
||||
log " step 7 of this script SKIPS while that's still the value.)"
|
||||
log ""
|
||||
log " 2. (On the Pi.) Mint + ship a leaf cert for this host:"
|
||||
log " sudo wg-pki/scripts/deploy-cis490-cert.sh <host_id> <wg_ip>"
|
||||
log " 2. RE-RUN THIS SCRIPT — sudo $0"
|
||||
log " The second pass sees the real host_id and pulls the leaf"
|
||||
log " cert from https://bootstrap.wg/v1/cert/<host_id>. There is"
|
||||
log " no manual cert-minting step on this host. DO NOT openssl."
|
||||
log ""
|
||||
log " 3. Run the diagnostic — every red row prints the exact fix:"
|
||||
log " 3. Confirm certs landed:"
|
||||
log " ls -l $ETC_ROOT/certs/"
|
||||
log " Expected: lab-host.pem, lab-host.key, wg-ca.pem"
|
||||
log ""
|
||||
log " 4. Run the diagnostic — every red row prints the exact fix:"
|
||||
log " $INSTALL_ROOT/.venv/bin/python \\"
|
||||
log " $INSTALL_ROOT/tools/cis490_doctor.py --role lab-host"
|
||||
log ""
|
||||
log " 4. Smoke-test the pipe (returns ok=true on success):"
|
||||
log " 5. Smoke-test the pipe (returns ok=true on success):"
|
||||
log " sudo -u $SERVICE_USER $INSTALL_ROOT/.venv/bin/python -m shipper \\"
|
||||
log " --config $ETC_ROOT/lab-host.toml --ping"
|
||||
log ""
|
||||
log " 5. Turn on the services — episodes start flowing immediately:"
|
||||
log " 6. Turn on the services — episodes start flowing immediately:"
|
||||
log " sudo systemctl enable --now cis490-shipper cis490-orchestrator"
|
||||
log ""
|
||||
log " If bootstrap.wg fetch fails (step 2 prints WARN), see AGENTS.md"
|
||||
log " ('Securing the connection (mTLS)') — almost always a missing"
|
||||
log " /etc/hosts entry or wg0 not up. Do not work around this by"
|
||||
log " minting certs locally; fix the WG path instead."
|
||||
log "================================================================="
|
||||
fi
|
||||
|
||||
|
|
|
|||
|
|
@ -110,8 +110,12 @@ class ShipperTransport:
|
|||
except _CertNotReadyError as e:
|
||||
if not self._cert_warned:
|
||||
log.warning(
|
||||
"shipper waiting on mTLS material (%s); will retry each request",
|
||||
e,
|
||||
"shipper waiting on mTLS material (%s); will retry each "
|
||||
"request — this is expected during first-boot. To unblock: "
|
||||
"set host_id in /etc/cis490/lab-host.toml then run "
|
||||
"`sudo /opt/cis490/scripts/install-lab-host.sh` (do NOT "
|
||||
"mint certs by hand). See AGENTS.md → Securing the "
|
||||
"connection (mTLS).", e,
|
||||
)
|
||||
self._cert_warned = True
|
||||
return False
|
||||
|
|
|
|||
|
|
@ -296,6 +296,29 @@ def test_host_filter_scopes_to_one_lab_host(tmp_path: Path) -> None:
|
|||
assert (episodes / "lab1" / "01FAKE.tar.zst").exists()
|
||||
|
||||
|
||||
def test_archive_preserves_index_mode(tmp_path: Path) -> None:
|
||||
"""Regression: the prune tool's index rewrite must not change the
|
||||
file's mode bits. Real-world failure: a sudo'd prune run replaced
|
||||
the receiver's index with a root-owned file the service couldn't
|
||||
append to, every PUT 500'd on _append_index."""
|
||||
import stat as _stat
|
||||
episodes, index = _stage_receiver_tree(tmp_path)
|
||||
# Set a non-default mode so we can detect drift.
|
||||
index.chmod(0o664)
|
||||
before_mode = _stat.S_IMODE(index.stat().st_mode)
|
||||
pe.main([
|
||||
"--episodes-root", str(episodes),
|
||||
"--index", str(index),
|
||||
"--archive-root", str(tmp_path / "archive"),
|
||||
"--reason", "no-sample",
|
||||
"--archive",
|
||||
])
|
||||
after_mode = _stat.S_IMODE(index.stat().st_mode)
|
||||
assert after_mode == before_mode, (
|
||||
f"prune mutated index mode: {oct(before_mode)} -> {oct(after_mode)}"
|
||||
)
|
||||
|
||||
|
||||
def test_multiple_reasons_combine(tmp_path: Path) -> None:
|
||||
"""An episode failing >1 signal is flagged once, all reasons listed."""
|
||||
tar = _make_episode(
|
||||
|
|
|
|||
125
tools/index_backfill.py
Normal file
125
tools/index_backfill.py
Normal file
|
|
@ -0,0 +1,125 @@
|
|||
"""``cis490-index-backfill`` — rebuild missing index.jsonl rows from
|
||||
tarballs already on disk.
|
||||
|
||||
Use case: a stretch where the receiver wrote tarballs to
|
||||
``episodes/<host>/`` but failed to append the matching index row
|
||||
(permissions, disk full, crash mid-write). The shipper retries see
|
||||
``already-present`` and never re-PUT, so the gap is permanent until
|
||||
something on the receiver-side fills it.
|
||||
|
||||
This tool walks ``episodes/<host>/<id>.tar.zst`` and, for any episode
|
||||
not already in the index, computes sha256 + size and appends a row
|
||||
matching the receiver's schema. Existing rows are left alone.
|
||||
|
||||
Run on the receiver host as the same user the receiver runs under
|
||||
(``cis490``) so the appended rows match in ownership and the live
|
||||
receiver can keep writing afterward:
|
||||
|
||||
sudo -u cis490 /opt/cis490/.venv/bin/python \\
|
||||
/opt/cis490/tools/index_backfill.py \\
|
||||
--episodes-root /var/lib/cis490/episodes \\
|
||||
--index /var/lib/cis490/index.jsonl
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import sys
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
SCHEMA_VERSION = 1
|
||||
|
||||
|
||||
def _sha256_of(path: Path) -> str:
|
||||
h = hashlib.sha256()
|
||||
with path.open("rb") as f:
|
||||
for chunk in iter(lambda: f.read(1024 * 1024), b""):
|
||||
h.update(chunk)
|
||||
return h.hexdigest()
|
||||
|
||||
|
||||
def _existing_episode_ids(index_path: Path) -> set[str]:
|
||||
if not index_path.exists():
|
||||
return set()
|
||||
seen: set[str] = set()
|
||||
for line in index_path.read_text().splitlines():
|
||||
if not line.strip():
|
||||
continue
|
||||
try:
|
||||
row = json.loads(line)
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
ep = row.get("episode_id")
|
||||
if isinstance(ep, str):
|
||||
seen.add(ep)
|
||||
return seen
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
p = argparse.ArgumentParser(prog="cis490-index-backfill")
|
||||
p.add_argument("--episodes-root", type=Path,
|
||||
default=Path("/var/lib/cis490/episodes"))
|
||||
p.add_argument("--index", type=Path,
|
||||
default=Path("/var/lib/cis490/index.jsonl"))
|
||||
p.add_argument("--host", help="Only backfill this host_id")
|
||||
p.add_argument("--dry-run", action="store_true",
|
||||
help="Print what would be appended; don't write")
|
||||
args = p.parse_args(argv)
|
||||
|
||||
if not args.episodes_root.exists():
|
||||
print(f"no episodes dir at {args.episodes_root}", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
seen = _existing_episode_ids(args.index)
|
||||
rows_to_write: list[dict] = []
|
||||
scanned = 0
|
||||
for host_dir in sorted(args.episodes_root.iterdir()):
|
||||
if not host_dir.is_dir():
|
||||
continue
|
||||
if args.host and host_dir.name != args.host:
|
||||
continue
|
||||
for tar in sorted(host_dir.glob("*.tar.zst")):
|
||||
scanned += 1
|
||||
episode_id = tar.stem.removesuffix(".tar")
|
||||
if episode_id in seen:
|
||||
continue
|
||||
sha = _sha256_of(tar)
|
||||
size = tar.stat().st_size
|
||||
rows_to_write.append({
|
||||
"received_at_wall": datetime.now(timezone.utc).isoformat(),
|
||||
"host_id": host_dir.name,
|
||||
"episode_id": episode_id,
|
||||
"sha256": sha,
|
||||
"size_bytes": size,
|
||||
"schema_version": SCHEMA_VERSION,
|
||||
"backfilled": True,
|
||||
})
|
||||
|
||||
print(f"scanned: {scanned} already-indexed: {scanned - len(rows_to_write)} "
|
||||
f"to-backfill: {len(rows_to_write)}")
|
||||
if not rows_to_write:
|
||||
return 0
|
||||
if args.dry_run:
|
||||
for r in rows_to_write[:5]:
|
||||
print(json.dumps(r, sort_keys=True))
|
||||
if len(rows_to_write) > 5:
|
||||
print(f"... ({len(rows_to_write) - 5} more)")
|
||||
return 0
|
||||
|
||||
# Append each row exactly the way the receiver does (single-line
|
||||
# write, append-mode open). This keeps writes atomic on POSIX for
|
||||
# rows < PIPE_BUF and never replaces the file, so ownership is
|
||||
# preserved automatically.
|
||||
with args.index.open("a") as f:
|
||||
for row in rows_to_write:
|
||||
f.write(json.dumps(row, sort_keys=True) + "\n")
|
||||
print(f"backfilled {len(rows_to_write)} rows into {args.index}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
|
|
@ -266,10 +266,22 @@ def apply_action(
|
|||
continue
|
||||
kept.append(line)
|
||||
# Rewrite via tempfile + replace so a crash mid-write doesn't
|
||||
# corrupt the live index.
|
||||
# corrupt the live index. os.replace drops ownership/mode from
|
||||
# the original — when prune runs as root that leaves the new
|
||||
# file root:root and locks out the cis490 receiver service
|
||||
# (every PUT then 500s on _append_index). Snapshot stat before
|
||||
# the rename, restore after.
|
||||
st = index_path.stat()
|
||||
tmp = index_path.with_suffix(".jsonl.partial")
|
||||
tmp.write_text("\n".join(kept) + ("\n" if kept else ""))
|
||||
os.replace(tmp, index_path)
|
||||
try:
|
||||
os.chown(index_path, st.st_uid, st.st_gid)
|
||||
except (PermissionError, OSError):
|
||||
# Best-effort: chown requires root, but if we got here as a
|
||||
# non-root user the original ownership matched ours anyway.
|
||||
pass
|
||||
os.chmod(index_path, st.st_mode & 0o7777)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue