Today's incident: post-cutover, k-gamingcom went silent and
elliott-thinkpad kept shipping pre-stamp episodes that the receiver
gate 400'd in a 2300+ PUT loop. Both required `git pull && install-
lab-host.sh` *on the host* — neither the on-device AI agent nor the
operator pulled in time, and from the receiver Pi I cannot reach in
(sshd off on the lab hosts).
Fix the recurrence directly: a 30-min systemd timer that does
git fetch + (if behind) ff-only pull + re-run install-lab-host.sh.
Hosts catch up on the next tick on their own — no human or agent
action required.
Mechanics:
- scripts/auto-update.sh runs as root, drops to cis490 for git ops
to satisfy /opt/cis490 ownership ("dubious ownership" guard).
- Refuses ff if local HEAD isn't an ancestor of origin/main —
protects operator hand-edits from silent overwrite.
- Network failures exit 0 (offline is normal, don't pin a unit
failure); divergence + install failures exit non-zero so the
journal records what broke.
- RandomizedDelaySec=10min on the timer prevents thundering-herd
when several hosts boot together.
- Hands off to install-lab-host.sh via exec — exactly one path
through bring-up; no special "auto" flow.
The version-gate provides the quality boundary, so even if origin/
main moves forward unsafely, the receiver's allow-list still
controls what lands in the index.
install-lab-host.sh enables cis490-autoupdate.timer on every run,
idempotent — existing hosts pick it up the next time they pull
manually.
Filed Forgejo #18 with the canonical command for elliott-thinkpad
+ k-gamingcom to bootstrap themselves out of the current incident
(auto-update doesn't help them retroactively — it has to be running
*before* the cutover to catch the next one).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
26 KiB
AGENTS.md — guidance for AI agents working on this repo
This project is part of the spectral lab (http://maxgit.wg/spectral/).
The conventions below also apply to sibling repos (wg-enroll,
wg-pki, caddy, iptmonads, matrix, forgejo, vault,
openclaw-deploy).
⚡ FIRST: are you an on-device agent on a lab host?
Read this table before doing anything else. Find your symptom in the left column and run the EXACT command in the right column. Do not invent variations. Do not paraphrase.
| Symptom (what you see) | What to run (verbatim) |
|---|---|
journalctl -u cis490-shipper shows 412 commit-rejected or code commit rejected |
cd /opt/cis490 && sudo -u cis490 git pull origin main && sudo /opt/cis490/scripts/install-lab-host.sh |
journalctl -u cis490-shipper shows 400 and missing X-Cis490-Code-Commit |
Same command as above. |
systemctl status cis490-shipper shows inactive (dead) after install |
sudo systemctl enable --now cis490-shipper cis490-orchestrator |
Shipper logs waiting on mTLS material |
Do nothing — this is expected. The cert auto-arrives; the daemon retries each request. |
Receiver index isn't growing (Pi side: tail -f /var/lib/cis490/index.jsonl) |
/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py --role lab-host — read the red rows; each one prints the exact fix command. |
bootstrap.wg fetch failed after install |
echo '10.100.0.1 bootstrap.wg collector.wg' | sudo tee -a /etc/hosts && sudo /opt/cis490/scripts/install-lab-host.sh |
cis490-doctor says VERSION stamp missing |
sudo /opt/cis490/scripts/install-lab-host.sh |
| You're not sure what's wrong | /opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py --role lab-host |
HARD RULES — do not break these even if they "feel like they'd help":
- Do not run
openssl,step-cli, mint keys, or write CSRs. Cert delivery is fully automated byinstall-lab-host.sh. If you're about to touch a private key on a lab host, stop. - Do not set
verify_tls = false. Skipping TLS verification is not "skipping it for now"; it's leaving a hole. - Do not invent a
code_versionfield in old episodemeta.jsonfiles. That field exists to keep buggy data out of the index. The drain helper handles them properly. - Do not disable, mask, or
systemctl stopcis490-shipperto silence log noise. Fix the cause via the table above; the noise stops on its own once the host has the new code. - Do not
git pulland skip runninginstall-lab-host.sh. A baregit pullupdates files on disk but does NOT restart the systemd units, so the running daemons keep executing the old code. - Do not file a Forgejo issue without first running
cis490-doctorand pasting its output.
If your symptom isn't in the table or the canonical command above doesn't fix it, file a Forgejo issue (see the "File an issue" section near the bottom).
How a lab host gets to "shipping data" — the canonical bring-up
If you (an AI agent OR a human) are looking at a fresh lab host and asking "is this thing actually generating data for the central collector?", run this in order. Cloning the repo by itself does nothing. Running launchers from a manual clone bypasses the systemd services that do the actual work.
# 0. (One-time, on the Pi only.) Initialize the CIS490 client CA + a
# leaf cert for THIS lab host. Get its WG IP from `wg-enroll-admin
# show <usb>` first.
sudo /home/max/.env/wg-pki/scripts/init-cis490-client-ca.sh # idempotent
sudo /home/max/.env/wg-pki/scripts/deploy-cis490-cert.sh \
<host_id> <wg_ip> # mints + scp's + extracts + chmods
# 1. (On the lab host.) Install the lab-host role. ONE COMMAND DOES
# EVERYTHING — repo to /opt/cis490, venv build, systemd units,
# Alpine baseline qcow2, cidata ISO, VERSION stamp, mTLS cert
# auto-fetch from bootstrap.wg, Tier-3+4 deploy (msfrpcd +
# Metasploitable2 + theZoo malware samples + bridge), pre-stamp
# queue drain, and a `daemon-reload + systemctl restart` of the
# shipper + orchestrator on re-runs. Idempotent — safe to re-run.
sudo /opt/cis490/scripts/install-lab-host.sh
# (or, if running from a clone elsewhere:)
# sudo ./scripts/install-lab-host.sh
# 2. Edit /etc/cis490/lab-host.toml — set host_id (the only required
# edit). Then re-run step 1 so the cert auto-fetch can resolve
# bootstrap.wg/v1/cert/<host_id>.
# 3. Verify everything before enabling the timer-driven services:
/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py \
--role lab-host
# → green/yellow rows means READY; red rows print the exact fix
# command. Re-run until clean.
# 4. Turn on the services. From this moment on, the orchestrator runs
# one fleet wave on each Restart= cycle, and the shipper picks up
# completed episodes and PUTs them to https://collector.wg over mTLS.
sudo systemctl enable --now cis490-shipper cis490-orchestrator
# 5. (On the Pi.) Watch the index grow:
sudo tail -f /var/lib/cis490/index.jsonl
There is no manual Tier-3 step. Steps 1 + 2 deploy msfrpcd, Metasploitable2 (auto-fetched from a public mirror with TOFU sha256 pinning — no Rapid7 registration), and Tier-4 real-malware samples from theZoo (no API key, no signup). The orchestrator switches to Tier-3 episodes automatically once the prereqs are on disk.
Hosts self-update. install-lab-host.sh enables
cis490-autoupdate.timer, which runs every 30 min (with up to 10 min
of randomized delay) and does git fetch + git pull --ff-only + install-lab-host.sh whenever origin/main has moved. So once a host
has done the canonical bring-up ONCE, it self-heals on every
subsequent maintainer push — you don't need to remember to pull. The
timer logs to journalctl -u cis490-autoupdate.service. If the
host's checkout has diverged from origin (operator hand-edits,
half-applied changes), auto-update bails rather than guessing — that
shows up as a unit failure with a clear log message.
If index.jsonl doesn't grow within a wave-interval (~60 s after
systemctl enable --now), run cis490-doctor again. The most
common silent failures it catches:
*.wgDNS missing (wg-enroll provisions it; manual workaround is one line in/etc/hosts)- mTLS cert chain not installed under
/etc/cis490/certs/ cis490-shipperservice inactive (forgot step 4)qemu-system-x86_64not on PATH
cis490-doctor --json is machine-readable for use by other agents.
Shipper says "400 missing" or "412 commit-rejected": pull and reinstall
If journalctl -u cis490-shipper shows a steady stream of
-> fatal (400) or -> 412 commit-rejected lines, the receiver is
rejecting episodes because their meta.json::code_version.commit
isn't in the receiver's allow-list (or isn't being sent at all). This
happens when this lab host is running code older than the receiver
will accept.
The fix is always the same — pull main and re-run the installer:
cd /opt/cis490
sudo -u cis490 git pull origin main
sudo /opt/cis490/scripts/install-lab-host.sh
install-lab-host.sh does the rest:
- Re-stamps
/opt/cis490/VERSIONto the new HEAD. - Drains pre-stamp episodes via
tools/quarantine_unstamped.pyso the queue stops looping on them. Drained episodes go to/var/lib/cis490/data/quarantine/with aquarantine_reason.jsonper-episode for triage. - Restarts
cis490-shipperandcis490-orchestratorso the new code takes effect.
Do not disable the shipper to silence the log noise — once a host
has the new code, traffic resumes immediately. Do not mint a fake
code_version field in old episodes to bypass the gate; that field
exists specifically to keep buggy pre-fix data out of the training
index.
If the receiver is rejecting new episodes too (you've pulled and
restarted, but still see 412), the receiver's allow-list window may
not yet include your commit — wait 5s for its Forgejo refresh, or
push your commit to origin/main first if you're testing
unmerged work.
Tier 3 + Tier 4 deploy (zero-touch via install-lab-host.sh)
install-lab-host.sh runs Tier-3 deploy automatically on its second
pass (after the mTLS cert lands). No operator interaction is needed:
metasploit-framework auto-installs via the Rapid7 omnibus, the
Metasploitable2 image auto-fetches from a public mirror with TOFU
sha256 pinning, the host-only bridge auto-comes-up, and a live
exploit fire is verified before the script returns.
To re-run the deploy by hand or on a host where Tier 3 was skipped:
sudo /opt/cis490/scripts/install-tier-3-4.sh
It's idempotent — re-running on an already-deployed host is a no-op except for the verify step. Inputs are all optional env vars:
| var | effect |
|---|---|
SKIP_VERIFY |
skip the live vsftpd_234_backdoor smoke run |
SKIP_BRIDGE |
skip br-malware setup (limits to 2 of 5 modules) |
SKIP_TIER4 |
skip the Tier-4 auto-fetch (DEPRECATED — leaves you with mimic-only data, defeats the project) |
The fleet runner auto-detects Tier-3 readiness via
orchestrator/fleet.py::_msfrpcd_available(). Once
cis490-msfrpcd.service is up and metasploitable2.qcow2 is on
disk, the next wave produces Tier-3 episodes (meta.exploit.module_name
populated). No orchestrator restart is required, but a restart speeds
up the switch.
Tier-4 (real malware execution) is mandatory, fully automated
Real-binary episodes are the project's training target — Tier-4 is NOT optional. A lab-host deploy that lands without real samples fails loudly; mimic-only data does not answer the research question.
There is no operator step. No API key, no signup, no manual
provisioning. install-tier-3-4.sh runs tools/auto_fetch_samples.py
which:
- Clones (or pulls)
theZoofromhttps://github.com/ytisf/theZooto/var/lib/cis490/theZoo(~500 MB shallow clone, public, GPL-3.0, security-research repo) - For each
[[sample]]inmanifest.tomlwithout a sha256, locates a directory intheZoo/malware/Binaries/whose name matches the entry'sfamily(case-insensitive substring + prefix priority) - Extracts the password-protected
.zip(well-known passwordinfected) - Picks the largest non-text payload as the binary, computes its
sha256, copies to
/opt/cis490/samples/store/<sha256> - Rewrites
manifest.tomlin place, atomically (tempfile +os.replacepreserving stat), addingsource = "theZoo",sha256 = "<hex>", and the upstream URL
If auto_fetch_samples.py lands zero binaries (theZoo layout drift,
git clone failure, or a family has no matching directory),
install-tier-3-4.sh exits non-zero. No silent mimic-only fallback.
The orchestrator's next selection that picks a sample with
kind == "real" runs the real binary via the chunked-upload path
(exploits.driver._resolve_workload). The mimic profile remains the
fallback for episodes that select a sample whose binary isn't on
disk. Trainers filter on meta.sample.kind ∈ {"real", "mimic"}.
Confirm Tier 3+4 are flowing
# On the Pi maintainer side:
sudo python3 -c "
import json, glob, subprocess, tarfile, io
from collections import Counter
mods = Counter(); kinds = Counter()
for tar in glob.glob('/var/lib/cis490/episodes/*/*.tar.zst'):
z = subprocess.check_output(['zstd','-q','-d','--stdout',tar],stderr=subprocess.DEVNULL)
with tarfile.open(fileobj=io.BytesIO(z)) as t:
for m in t.getmembers():
if m.name.endswith('meta.json') and m.isfile():
meta = json.load(t.extractfile(m))
mods[(meta.get('exploit') or {}).get('module_name','<none>')] += 1
kinds[(meta.get('sample') or {}).get('kind','<none>')] += 1
break
print('exploit modules used:', dict(mods))
print('sample kinds:', dict(kinds))
"
If mods is {'<none>': N} and kinds is {'mimic': N}, Tier 3
hasn't kicked in yet on any lab host — re-run
install-tier-3-4.sh there. If mods shows
{'vsftpd_234_backdoor': N, ...} and kinds shows a non-zero
'real' count, both tiers are live.
Don't shortcut
- DO NOT install
metasploit-frameworksystem-wide outsideinstall-msfrpcd.sh. The script wires the systemd unit + creds; a manual install bypasses the orchestrator's_msfrpcd_available()probe. - DO NOT add bogus sha256 entries to
manifest.toml—auto_fetch_samples.pyhash-verifies every binary it stages. - DO NOT add real-binary entries by hand when
auto_fetch_samples.pyexists. Hand-edits are racy with the auto-fetcher's tempfile rewrite.
Securing the connection (mTLS) — DO NOT mint your own certs
The lab-host ↔ Pi connection is mTLS over WireGuard. The cert
delivery is fully automated. You should never run openssl, write
a CSR, edit a Caddyfile, or generate a private key on the lab host.
If you find yourself doing any of that, you're off the runbook.
The actual cert flow:
- The lab host comes up on WireGuard via
wg-enroll(USB-driven, one-time, separate project). After this, the lab host can reachbootstrap.wgandcollector.wgon the10.100.0.0/24overlay. scripts/install-lab-host.sh, on its way through, pulls the leaf cert + CA bundle fromhttps://bootstrap.wg/v1/cert/<host_id>over plain TLS (CA bundled inetc/caddy-root.crt). Trust boundary is "this peer is on the WG mesh" —iptmonadsalready gates the bootstrap port to enrolled peers.- The fetch step is a no-op if
host_idis still the defaultREPLACE_MEin/etc/cis490/lab-host.toml. This is the most common reason agents think cert delivery is broken.
The one fix that resolves 95 % of "cert/TLS/connection" reports:
# 1. Make sure host_id is set:
sudo grep '^host_id' /etc/cis490/lab-host.toml
# If it says "REPLACE_ME", edit it to the real host_id you registered.
# 2. Re-run the installer. It will fetch the cert from bootstrap.wg.
sudo /opt/cis490/scripts/install-lab-host.sh
# 3. Confirm certs landed:
ls -l /etc/cis490/certs/ # expect lab-host.pem, lab-host.key, wg-ca.pem
# 4. Smoke-test the pipe:
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
--config /etc/cis490/lab-host.toml --ping
# {"ok": true, ...} → done.
If step 2 prints WARN: bootstrap.wg fetch failed, the cause is
almost always one of:
bootstrap.wgDNS not resolving → add to/etc/hosts:echo '10.100.0.1 bootstrap.wg collector.wg' | sudo tee -a /etc/hostswg0interface not up →sudo wg showshould list a peer; if not, re-run wg-enroll.- The Pi's
cis490-bootstrap.serviceis down → file an issue against the receiver-side host, not against this repo.
What you should NOT do, even if it feels like it would help:
- Generate certs with
opensslorstep-clion the lab host. - Copy certs from another lab host.
- Set
verify_tls = falseinlab-host.tomlto "skip TLS for now." - Restart the shipper repeatedly hoping it self-heals — it already retries on every request without restart.
- File a Forgejo issue titled "shipper can't connect" without first running the four-line block above and pasting its output.
The shipper's waiting on mTLS material log line is expected
during first-boot until the cert lands. It is not an error to fix.
The transport rebuilds the SSL context on each request, so the
moment certs land in /etc/cis490/certs/, the next ping/ship
attempt succeeds — no restart needed.
Common bring-up gotchas (read this before debugging an install)
Smaller models acting as on-device agents have hit these traps. Each
one is now fixed in main, but if you're on an older clone you may
still see the symptom — pull origin/main first, then re-read.
Run tools from /opt/cis490, not from a manual clone
When you run cis490-doctor from a clone like ~/.env/CIS490/,
Python prepends the clone path to sys.path. Subprocesses spawned
by the doctor (e.g., python -m shipper --ping) inherit the calling
CWD and pick up the clone's shipper/ package instead of the
service venv at /opt/cis490/. Symptom: tracebacks reference the
clone path, or No module named exploits despite package = false.
Fix already in main: the doctor passes cwd=/opt/cis490 to the
shipper subprocess and inserts repo_root into sys.path itself.
Operator action: always invoke either as
/opt/cis490/.venv/bin/python /opt/cis490/tools/cis490_doctor.py
or via cd /opt/cis490 && ./tools/cis490_doctor.py. Don't run from a
clone unless you know what you're doing.
Shipper logs "waiting on mTLS material" — this is expected, not a bug
The cis490-shipper unit is enabled by install-lab-host.sh before
the Pi has issued the host's mTLS leaf. The transport pre-flights the
configured ca_bundle / client_cert / client_key paths and, if
any are missing, defers building the SSL context. You'll see one
warning per process lifetime:
shipper waiting on mTLS material (client_cert path missing: …); will retry each request
The unit stays up. Each ping/ship attempt re-tries the build. Once
the Pi runs deploy-cis490-cert.sh <host_id> <wg_ip> and the leaf
lands at /etc/cis490/certs/, the next request succeeds and the
transport logs mTLS material now on disk; shipper transport ready.
Do not try to "fix" the warning by restarting the unit, deleting the config, or hand-rolling certs — just confirm the Pi-side step ran and wait one scan interval.
Outdated clone? Pull main first.
A long list of install-time bugs (cp self-copy, missing service restart, fatal-loop quarantine, ca_bundle pointing at the wrong chain, busybox pgrep flags, pycdlib in the wrong dep group, missing vm/images/ symlink target, doctor sys.path) have been fixed and are all resolved in main. If you hit any "this used to work" symptom on a host that hasn't pulled in a while, the canonical command is always the same:
cd /opt/cis490 && sudo -u cis490 git pull origin main && \
sudo /opt/cis490/scripts/install-lab-host.sh
That one command:
- Re-stamps
/opt/cis490/VERSIONso episodes get a validcode_version.commit— required by the receiver's gate. - Drains pre-stamp episodes from
data/episodes/todata/quarantine/viatools/quarantine_unstamped.pyso the queue stops looping on them. - Runs
daemon-reloadandsystemctl restart cis490-shipper cis490-orchestratorso the live daemons pick up the new code (a baregit pulldoes NOT do this — Python module objects in the running process are frozen at last service start). - Re-runs the Tier-3+4 deploy idempotently if the cert is on disk.
After it returns, the shipper will be running as Type=notify with
WatchdogSec=180 — systemd kills + restarts it if a scan pass hangs.
The classifier is multi-source — don't gut episodes on /proc alone
tools/prune_episodes.py cross-checks four telemetry sources before
flagging an episode as flat:
telemetry-proc.jsonl— host qemu-system /proc CPU%netflow.jsonl— bridge_pcap byte counters (network profiles)telemetry-qmp.jsonl— virtio blockstats per-phase delta (io-walk, ransomware-shape)telemetry-guest.jsonl— in-guest agent load_1m (low-and-slow, any host with a working agent)
An episode flags as flat-cpu only when EVERY available source
shows no inter-phase variation. If /proc is flat but qmp blockstats
show 90 MB written during infected_running, the episode is kept —
the host /proc collector loses signal under contention but qmp sees
through. This is essential on laptop-class lab hosts (e.g.
elliott-thinkpad) where the guest is co-scheduled with 13 other VMs
and the per-VM /proc CPU% gets buried.
All four sources stamp t_wall_ns; phase mapping uses that, not
t_mono_ns, because /proc and labels are orchestrator-relative
while netflow/guest are wall-clock-anchored. If you add a new
collector, emit t_wall_ns from CLOCK_REALTIME on every row or your
data will silently bucket into "(pre)".
Don't trust the in-guest probe alone — cross-check host CPU
The pre_kill_probe.yes / pre_kill_probe.sh fields in
workload_killed events are produced by pgrep running inside an
Alpine guest. busybox's pgrep does NOT support the -c flag. Older
versions of VMLoadController._probe() used pgrep -c yes, which
exits 1 with a usage banner on busybox; the || echo 0 fallback then
always reported yes=0 regardless of whether the workload was
running. This caused 244 episodes from elliott-thinkpad and
k-gamingcom to be incorrectly labelled workload-silent.
The fix landed in main (probe now uses pgrep yes | wc -l); episodes
shipped after that commit have correct probe values. For older
episodes still on disk, the prune classifier now requires flat-cpu
(host-side CPU envelope confirms no signal) AND the probe to flag
workload-silent — a probe-only zero is no longer trusted. So you can
safely run cis490-prune --archive against the existing data without
losing valid episodes.
If you write any new in-guest diagnostic that runs commands via
SerialClient, assume busybox/ash semantics: no disown builtin, no
GNU pgrep -c, no bash /dev/tcp, no [[ ]]. Always pair an
in-guest signal with the host-side /proc measurement before you
declare an episode bad.
One traceback at a time
When the doctor lights up multiple red rows, fix the topmost one and re-run rather than batching attempts. Each red row prints the exact operator command it expects you to run. Don't paraphrase or invent adjacent commands; the doctor is the source of truth for what's missing.
How an agent generates data on demand (without waiting for the timer)
# One labeled episode (90 s) with a chosen sample profile:
sudo -u cis490 /opt/cis490/.venv/bin/python \
/opt/cis490/tools/run_real_vm_demo.py \
--data-root /var/lib/cis490/data \
--sample mirai-class-bot
# Force the shipper to run one pass:
sudo systemctl start cis490-shipper.service # (if disabled)
# or:
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
--config /etc/cis490/lab-host.toml --once
# Confirm on the Pi:
ssh <pi> 'sudo tail -3 /var/lib/cis490/index.jsonl'
Any tool that wants to count what's been shipped:
ssh <pi> 'sudo /opt/cis490/.venv/bin/python /opt/cis490/tools/index_reader.py \
--index /var/lib/cis490/index.jsonl --count-by host_id'
Quick test for "is the pipe alive at all"
The --ping mode of the shipper does an authenticated POST to
/v1/ping and writes nothing to the database. Use it as the
smallest E2E test:
sudo -u cis490 /opt/cis490/.venv/bin/python -m shipper \
--config /etc/cis490/lab-host.toml --ping
# {"ok": true, "status_code": 200, ...} on success.
# Any other status is the operator's signal to read the error string.
File an issue when you hit a problem
When you run into an issue you cannot fully resolve in the current turn, file it as a Forgejo issue on the relevant repo before moving on. Do not silently log a TODO comment, leave a partial workaround, or assume someone else will remember. The issue tracker is the durable record.
This applies to:
- a build / test / typecheck failure you can't fix in scope
- a bug you discover but aren't tasked with fixing
- a missing dep, missing config, or env-only failure that blocks E2E
- a design gap you've worked around but want a follow-up to fix properly
- a scope-out you made (e.g. "deferred Tier 4 sample fetch") that needs an owner so it doesn't get lost
Don't file an issue when:
- the user is in the conversation and you can just tell them
- it's already filed (search first:
GET /api/v1/repos/<owner>/<repo>/issues?state=open&q=<keyword>) - it's truly a non-issue (a one-line edit you're about to make this same turn)
How to file (Forgejo API)
The local Forgejo at http://10.100.0.1:3000 accepts API calls with a
token-bearer header:
curl -s -X POST \
-H "Authorization: token <TOKEN>" \
-H "Content-Type: application/json" \
http://10.100.0.1:3000/api/v1/repos/spectral/<repo>/issues \
-d '{
"title": "<short, action-oriented title>",
"body": "<context, repro, attempted fixes, suggested next step>"
}'
The token comes from the user's session — never embed one in code or commits.
What a good issue body contains
- Context — one sentence on what was being attempted.
- What happened — the actual error, log line, or unexpected behavior. Paste exact output.
- What was tried — every workaround you attempted and why it didn't stick.
- Suggested next step — the smallest change that would resolve it, if you have a guess. "Unknown" is a fine answer.
- Related — link the commit / PR / file:line where the issue surfaced.
What a good title looks like
| Bad | Good |
|---|---|
tests broken |
tests/test_episode.py: race when t_mono_origin_ns is set in run() not __init__ |
caddy thing |
Caddy: client_auth requires absolute path; relative trusted_ca_cert_file silently fails |
fix later |
shipper: 5xx backoff cap is 5min, doc says 1min — pick one |
After filing
- Reference the issue number in the next commit message:
Refs spectral/<repo>#<n>orCloses spectral/<repo>#<n>if your current change actually fixes it. - If the issue is on a different repo than the one you're committing
to, fully qualify:
spectral/wg-pki#3.
Other conventions
- Don't put off the hard parts. Frame "deferred-with-reason" only for genuine blockers (binary not present on this machine, external service unreachable). For anything you could do but find awkward — bridge setup, cross-arch quirks, fleet concurrency — do it. The user has flagged this twice when work was scoped down prematurely. When something genuinely is blocked by an operator artifact, file the Forgejo issue and automate the bring-up (e.g., installer script + sha256-verifying fetcher) so the moment the artifact lands it Just Works.
- Naming: never coin USB / device / service names on the user's behalf. Ask first. Reusing an old name is especially bad.
/etcconfigs:Readfirst, copy second. Never overwrite a/etc/...file from a template without checking what's actually there.- wg-enroll scope: creation-only. Don't add admin / service-activation features to it.
- Don't expand a project's binary name beyond its own boundary:
openclawis the queue/permissions binary inopenclaw-deploy. This repo iswg-enroll(or its caller). Don't conflate.