Max Gorog bfb1c491f8 PIPELINE.md is canonical; rewrite AGENTS.md; delete FIXYOURSELF.md

PIPELINE.md is the canonical plan for the data-collection / emulation
/ labelling pipeline. It supersedes any guidance in AGENTS.md,
README.md, or other repo docs that contradicts it (§17). Future
sessions read it before changing anything in the pipeline.

AGENTS.md is rewritten to point at PIPELINE.md as canonical and to
strip the prescriptive symptom→fix table that absorbed producer-side
defects instead of fixing them (§7.1 compensating-layer pattern).

FIXYOURSELF.md is deleted (§4.12, §7.10 recovery-layer pattern). The
states it covered are made impossible by the §4.6 acceptance gate
landing later in §5; recovering from a state that shouldn't exist is
itself the bandaid we're removing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-03 17:04:43 -05:00

7.4 KiB

Raw Blame History

AGENTS.md — guidance for AI agents working on this repo

This project is part of the spectral lab (http://maxgit.wg/spectral/). The conventions below also apply to sibling repos (wg-enroll, wg-pki, caddy, iptmonads, matrix, forgejo, vault, openclaw-deploy).

⚡ FIRST: read PIPELINE.md

PIPELINE.md is the canonical plan for this repo. Read it before changing anything in the data-collection / emulation / labelling pipeline. If anything in this file or any other doc contradicts PIPELINE.md, PIPELINE.md wins and the other doc is wrong.

This file is for general engineering conventions. The pipeline correctness story lives in PIPELINE.md.

What this project is

CIS490 trains a behavioral malware-detection model from labelled episodes captured on lab-host VMs running real or mimic workloads, optionally driven into infected states by Metasploit modules. The producer is the orchestrator on each lab host; the consumer is the receiver on the Pi (office-print, 10.100.0.1).

The producer must ship only ground-truth episodes. The receiver must reject anything that doesn't meet the bar. See PIPELINE.md.

Hard rules — do not break these

Do not silently downgrade a host. If a collector is silent, an exploit doesn't land, or a dependency is missing, the host produces zero episodes and says so loudly. There is no "ship what we can" fallback.
Do not write a label that an event didn't justify. Phase labels come from observed events, not from the schedule clock. See PIPELINE.md §4.5.
Do not add a module to the catalog without verifying it lands a session against its declared target. See PIPELINE.md §4.3.
Do not add per-host config overrides. One canonical manifest; hosts that can't run it produce nothing. See PIPELINE.md §4.1.
Do not bypass the dirty-tree gate except via the CIS490_ALLOW_DIRTY=1 env var (logged, stamped, audited). No "skip preflight," no verify_tls=false, no other override knobs.
Do not run openssl, step-cli, mint keys, or write CSRs. Cert delivery is automated. If you find yourself touching a private key on a lab host, stop.
Do not file a Forgejo issue without first running cis490-doctor and pasting its output.

How a lab host gets to "shipping data"

This will be rewritten as PIPELINE.md §4 lands. The current scripts/install-lab-host.sh does most of the right things but does not yet enforce the canonical manifest, target-VM build, catalog verification, or preflight. Until those land, treat the install script as in-flight and assume a fresh lab host will produce nothing until the bar is met.

The bar (when in place) will be:

Repo cloned to /opt/cis490, working tree clean, HEAD on origin/main.
Every binary in the active collector + module catalog set on PATH.
Every target VM image built from the in-repo spec, sha256-pinned.
Every module in the catalog passes scripts/verify-catalog.sh against its target.
Every collector in the active set passes its emit-test.
orchestrator/preflight.py exits 0.

Once that's true, systemctl enable --now cis490-shipper cis490-orchestrator brings the host online. The orchestrator runs the canonical experiment; the shipper PUTs sealed episodes to the receiver. Episodes that don't pass the acceptance gate go to data/rejected/<id>/ locally and are never shipped.

Securing the connection (mTLS) — DO NOT mint your own certs

The lab-host ↔ Pi connection is mTLS over WireGuard. Cert delivery is automated via bootstrap.wg/v1/cert/<host_id>. You should never run openssl, write a CSR, edit a Caddyfile, or generate a private key on the lab host. If you find yourself doing any of that, you're off the runbook.

The most common reason cert fetch appears to fail is host_id still being REPLACE_ME in /etc/cis490/lab-host.toml. Check that first.

The shipper's waiting on mTLS material log line is expected during first-boot until the cert lands. It is not an error. The transport rebuilds the SSL context on each request, so the moment certs land in /etc/cis490/certs/, the next attempt succeeds — no restart needed.

Filing issues

When you run into an issue you cannot fully resolve in the current turn, file it as a Forgejo issue on the relevant repo. Do not silently log a TODO comment, leave a partial workaround, or assume someone else will remember.

File issues for:

A build / test / typecheck failure you can't fix in scope.
A bug you discover but aren't tasked with fixing.
A missing dep, missing config, or env-only failure that blocks E2E.
A design gap you've worked around but want a follow-up to fix properly.

Don't file when:

The user is in the conversation and you can just tell them.
It's already filed (search first: GET /api/v1/repos/<owner>/<repo>/issues?state=open&q=<keyword>).
It's truly a non-issue (a one-line edit you're about to make this same turn).

How to file (Forgejo API)

curl -s -X POST \
  -H "Authorization: token <TOKEN>" \
  -H "Content-Type: application/json" \
  http://10.100.0.1:3000/api/v1/repos/spectral/<repo>/issues \
  -d '{
    "title": "<short, action-oriented title>",
    "body":  "<context, repro, attempted fixes, suggested next step>"
  }'

The token comes from the user's session — never embed one in code or commits.

Good issue body

Context — one sentence on what was being attempted.
What happened — the actual error or unexpected behavior. Paste exact output.
What was tried — every workaround you attempted and why it didn't stick.
Suggested next step — the smallest change that would resolve it, if you have a guess. "Unknown" is fine.
Related — link the commit / PR / file:line where the issue surfaced.

Good titles

Bad	Good
`tests broken`	`tests/test_episode.py: race when t_mono_origin_ns is set in run() not __init__`
`caddy thing`	`Caddy: client_auth requires absolute path; relative trusted_ca_cert_file silently fails`
`fix later`	`shipper: 5xx backoff cap is 5min, doc says 1min — pick one`

After filing, reference the issue in the next commit message: Refs spectral/<repo>#<n> or Closes spectral/<repo>#<n>. Fully qualify cross-repo: spectral/wg-pki#3.

Other conventions

Don't put off the hard parts. "Deferred-with-reason" is only for genuine blockers (binary not present on this machine, external service unreachable). For anything you could do but find awkward — bridge setup, cross-arch quirks, fleet concurrency — do it.
No architectural bandaids in the pipeline. Compensating layers (auto-update timers, fix-yourself decision trees, prescriptive symptom→command tables, trainer-side prune scripts that paper over silent collectors) are not allowed in the data-collection / emulation / labelling path. Fix the producer instead. See PIPELINE.md.
Naming: never coin USB / device / service names on the user's behalf. Ask first. Reusing an old name is especially bad.
/etc configs: Read first, copy second. Never overwrite a /etc/... file from a template without checking what's actually there.
wg-enroll scope: creation-only. Don't add admin / service-activation features to it.
Don't expand a project's binary name beyond its own boundary: openclaw is the queue/permissions binary in openclaw-deploy. This repo is wg-enroll (or its caller). Don't conflate.

7.4 KiB Raw Blame History