CIS490/PIPELINE.md
Max Gorog bfb1c491f8 PIPELINE.md is canonical; rewrite AGENTS.md; delete FIXYOURSELF.md
PIPELINE.md is the canonical plan for the data-collection / emulation
/ labelling pipeline. It supersedes any guidance in AGENTS.md,
README.md, or other repo docs that contradicts it (§17). Future
sessions read it before changing anything in the pipeline.

AGENTS.md is rewritten to point at PIPELINE.md as canonical and to
strip the prescriptive symptom→fix table that absorbed producer-side
defects instead of fixing them (§7.1 compensating-layer pattern).

FIXYOURSELF.md is deleted (§4.12, §7.10 recovery-layer pattern). The
states it covered are made impossible by the §4.6 acceptance gate
landing later in §5; recovering from a state that shouldn't exist is
itself the bandaid we're removing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:04:43 -05:00

38 KiB
Raw Permalink Blame History

PIPELINE.md — the CIS490 generative pipeline honesty plan

This document is canonical. It supersedes any guidance in AGENTS.md, FIXYOURSELF.md, README.md, or other repo docs that contradicts it. If another doc says something different, this doc wins and the other doc is wrong (file an issue or fix it).

This is not an architecture overview. This is a fix list. Read it, implement it, do not split it into phases.

Before proposing any change to the pipeline, re-read §1, §7, and §8 and run your proposal against §8's checklist. Then proceed.


1. Principle

Every episode that reaches the dataset must be ground-truth. Every host runs the same experiment with the same configured catalog. Every exploit module and every collector in the catalog has been proven to work end-to-end before it is eligible to run. There are no compensating layers — no auto-update timers that drag stale peers forward, no "fix-yourself" decision trees, no per-host divergence absorbed by trainer-side filters, no labels written by clock when the event they describe didn't happen.

If a host can't meet the bar, it produces zero episodes and says so loudly. A small honest dataset beats a large dishonest one.

Default to removal, not addition. If a problem can be fixed by deleting code or removing a layer, prefer that. Adding a layer is the suspect default and should be justified against §7 and §8 before proceeding.


2. What the experiments are for

CIS490 trains a behavioral malware-detection model. The dataset is the ground-truth labelled record of what the host looked like during known-clean, known-armed, known-infecting, and known-infected phases of a real exploit chain against a real target service. The model learns to distinguish those phases from in-deployment behavior. Every dishonest label is a poisoned training example.

This is why the producer's job is not "ship lots of episodes." It is "ship episodes whose labels are true."


3. What is currently broken (evidence)

Numbers from the 200-episode quality probe on 2026-05-03:

  1. Labels lie. 0 of 67 Tier-3 exploit fires resulted in a session_open event. All 67 logged session_open_timeout. Yet every one of those 67 episodes is labelled phase=infected_running because the schedule-driven labeller transitions on a clock, not on observed events. The infected_running label in the dataset means "the schedule said so," not "an attacker session was actually open on this host."
  2. Collectors are silent.
    • perf produces 0 rows on 100% of episodes on both hosts.
    • guest-agent produces 0 rows on 100% of episodes on both hosts.
    • qmp, netflow, and pcap produce 0 rows on 100% of k-gamingcom episodes (different config from elliott).
    • The host tcpdump is missing on k-gamingcom; pcap_unavailable is logged then ignored.
  3. The catalog is unverified. Modules are added to the rotation without a per-module verification that the module actually lands a session against its declared target. samba_usermap_script has a 100% failure rate against the configured Metasploitable2 target and was still in the rotation.
  4. Hosts run divergent experiments. elliott and k-gamingcom have different per-host manifests, different collector coverage, different qemu invocations. The dataset is a union of two different experiments, not 200 samples from one.
  5. Working trees are dirty. 200/200 episodes report dirty=true, so code_version.commit is unverifiable provenance.

Each of these is a failure of the producer. Receiver-side filtering and trainer-side prune scripts are bandaids that hide them.


4. The fix — line items

Every item below must land. They are not phases. They are parts of one cohesive correctness story; any of them missing leaves the pipeline half-honest. Each item names its acceptance test.

4.1 Canonical manifest

There is exactly one manifest, version-pinned in the repo at manifest.toml. Every lab host loads the same manifest. There is no per-host manifest override, no per-host collector enable/disable flag, no per-host qemu argument list. Hosts that cannot run the canonical manifest exit 78 at orchestrator startup.

Acceptance: find . -name manifest.toml -not -path './.git/*' returns exactly one path. There is no --manifest CLI flag on the orchestrator that takes a different path; the path is hard-coded. Removing this line item would re-create the host divergence we just exited.

4.2 Target VMs we build, not VMs we fetch

Every target VM image is built from a declarative spec checked into the repo (Packer, mkosi, debootstrap, whatever — declarative). The image build produces a sha256-pinned artifact. The build script verifies, before producing the artifact, that:

  • The vulnerable service is up after first boot.
  • The service is on the port the module catalog declares.
  • The service version matches the version the module catalog declares.

Metasploitable2 from a SourceForge mirror is removed. We don't ship episodes targeting black-box images.

Acceptance: scripts/build-target-<name>.sh exists for every target referenced by an exploit module. Running it produces an image whose post-boot state passes the spec's verification step. The verification step's exit code gates the build's exit code.

4.3 Module catalog admission criteria

A module is in the catalog only if it passes a recorded end-to-end verification run against its declared target. The verification is:

  1. Boot the target snapshot.
  2. Fire the module via msfrpcd.
  3. Observe a session_open event (not session_open_timeout).
  4. Observe at least one shell command round-trip on the session.
  5. Confirm guest-side artifact (file written, process spawned — per-module).

If any step fails, the module does not enter the catalog. There is no "tentatively included" tier. Modules already in the catalog are re-verified by scripts/verify-catalog.sh (new) on every release; failures remove the module from the catalog.

Acceptance: every entry in exploits/modules/*.toml has a companion verified_against = "<target_name>" and last_verified = "<commit_sha>" field. scripts/verify-catalog.sh re-runs every entry and exits 0 only if every one passes.

4.4 Collector admission criteria

A collector is in the active set only if it passes a recorded end-to-end verification run that confirms it emits non-zero rows against a known-busy probe workload.

For each of the six collectors (proc, qmp, netflow, perf, guest, pcap):

  1. Diagnose the current zero-row failure (read the code, run standalone, find the actual cause). Fix the cause.
  2. Add a unit-or-integration test that runs the collector for N seconds against a synthesized workload (a busy-loop process for proc/perf, a packet generator for netflow/pcap, a QMP blockstats query for qmp, a guest heartbeat for guest) and asserts ≥1 row.
  3. The test must run in CI and on every install via the install script.

A collector that cannot pass admission is removed from the active set with a recorded reason — not silently included with zero rows.

Acceptance: pytest tests/test_collectors_emit.py -k <name> passes for each name. The CI run gates merges.

4.5 Event-driven labelling

Phase labels are written from observed events, never from the schedule clock. The schedule becomes a time budget — maximum time the orchestrator will wait in each phase — not a label source.

Specifically:

  • clean is written at episode start.
  • armed is written when the orchestrator instructs the driver to fire (this is observable in code).
  • infecting is written when the exploit_fire event is observed.
  • infected_running is written only when the session_open event is observed.
  • If session_open_timeout is observed instead, the episode terminates with a failed label and is rejected (see §4.6).
  • dormant and subsequent infected_running transitions are written from observed in-session idle / activity, not from clock.

Per-module timeouts replace the global 30s timeout. Default 120s, configurable per module in exploits/modules/*.toml.

Acceptance: for every shipped episode, every entry in labels.jsonl has a corresponding event in events.jsonl with a matching t_mono_ns within ±100ms. An invariant test asserts this.

4.6 Episode acceptance gate at finalization

Before sealing meta and writing done.marker, the orchestrator verifies:

  • Every collector in the active set produced ≥1 row.
  • Every label has a matching event (§4.5 invariant).
  • For Tier-3 episodes: a session_open event exists.
  • dirty=true is absent OR dirty_override=true is present (see §4.9).

If any check fails, the episode goes to data/rejected/<id>/ with a rejected_reason.json describing which check failed. done.marker is not written. The shipper never sees it.

Acceptance: tests/test_acceptance_gate.py covers each rejection condition. A passing test asserts a clean episode is accepted; for each failure mode, the test asserts the episode is moved to rejected/ with the expected reason.

4.7 Producer preflight

orchestrator/preflight.py runs at orchestrator startup. One bar (no light/deep split). Checks:

  • Every binary required by the active collector set + active module catalog is on PATH.
  • /dev/kvm accessible by the service user.
  • kernel.perf_event_paranoid <= 2.
  • cfg.bridge_iface exists; tcpdump can capture on it.
  • msfrpcd reachable; auth.login returns a token.
  • For every module in catalog: module.info is fetchable.
  • For every sample in catalog: file present on disk; sha256 matches.
  • Probe-boot baseline-v1 snapshot; observe guest-agent heartbeat within N seconds.
  • git status --porcelain empty (or CIS490_ALLOW_DIRTY=1).
  • HEAD is on a commit currently in origin/main.

Failures are collected (every failed check logged with diagnosis + remediation), then sys.exit(78).

Acceptance: tests/test_preflight.py covers each check individually with mocked subprocess/filesystem. python -m orchestrator.preflight runs the checks and prints a structured report. Exit codes: 0 ok, 78 sysadmin error.

4.8 Receiver-side rejection (defense in depth)

The receiver is defense-in-depth, NOT the primary correctness mechanism. The producer is. Receiver rejection exists to catch peers running stale or broken code; it is never a substitute for fixing the producer. A change that strengthens receiver rejection without strengthening the producer is the defensive-instead-of- corrective pattern (§7.9).

The receiver enforces the same correctness invariants the orchestrator does. A peer running stale code that produces dishonest episodes still gets rejected at ingest:

  • Reject any meta with dirty=true and no dirty_override=true.
  • Reject any meta where phases_observed contains infected_running but events.jsonl (extracted from the tarball) lacks session_open.
  • Reject any meta where any configured-collector row count is zero.
  • Existing commit-allow-list gate continues.

Rejections return 422 with a JSON body naming the failed check. Rejected tarballs are not written to the index.

Acceptance: tests/test_receiver_rejects.py covers each new rejection condition.

4.9 Override discipline

The only escape hatch from the dirty-tree gate is the CIS490_ALLOW_DIRTY=1 environment variable. When set:

  • Orchestrator logs WARN: dirty tree override active.
  • meta.json gains dirty_override: true.
  • Receiver accepts the episode only if dirty_override is also true.
  • Every override use is auditable from the dataset.

There are no other override knobs. No verify_tls=false, no "skip preflight," no "include this collector even if it emits zero rows."

4.10 Regression-test discipline

Every fix in this plan lands with a test that would have caught the regression at PR time. Tests are not a follow-up. A PR that fixes the perf collector without a perf-emit test is incomplete and gets sent back.

CI runs:

  • All unit tests.
  • scripts/verify-catalog.sh against a smoke target subset (catalog verification full run is gated to release commits — too expensive for every PR).
  • The collector-emit integration tests (§4.4) on real binaries.

4.11 systemd integration

  • cis490-orchestrator.service adds RestartPreventExitStatus=78. A preflight failure stays loud and stuck instead of cycling restarts.
  • On preflight failure, orchestrator writes /var/lib/cis490/preflight.failed.json with the failed checks + timestamps. Doctor surfaces this in its next report. The fleet-health alert distinguishes "preflight failed" from "host silent."

4.12 Cleanup of compensating layers

The following are deleted as part of this change. Their existence was load-bearing for the dishonest pipeline; the honest one doesn't need them.

  • FIXYOURSELF.md — entire file deleted. Stuck states no longer exist as a class because the gates make them impossible.
  • cis490-autoupdate.timer + scripts/auto-update.sh — deleted. Hosts run pinned commits. New code is rolled out by the operator, not auto-pulled.
  • cis490-cert-fetch.timer — replaced by a one-shot first-boot fetch in install-lab-host.sh. No periodic re-fetch.
  • tools/quarantine_unstamped.py — deleted. Pre-stamp episodes cannot exist because no episode is written without a valid stamp.
  • tools/check_fleet_health.py — keep, but delete the "fatal-only" alert branch (that branch existed because we were shipping fatals; with the gate, we don't).
  • tools/prune_episodes.py's "kept episode despite flat /proc because qmp showed write" cross-check logic — deleted. Episodes that don't pass the producer-side gate don't reach the trainer.
  • AGENTS.md "symptom→fix table" — deleted (the symptoms it covers are now impossible).
  • AGENTS.md "Hosts self-update" section — deleted.

4.13 Containment bar

Real malware execution requires explicit containment. Target VMs exist in an isolation context that is part of the canonical experiment, not a deployment detail. A future change that weakens any of the items below is a containment regression and is rejected regardless of what experimental realism it claims to add.

For every target VM in the catalog (§4.2):

  • Network: target attaches to a bridge with NO upstream egress. No NAT to the host network, no internet route, no DNS resolution beyond what the experiment provides. Outbound C2 callbacks resolve to a sinkhole inside the experiment, never to the internet.
  • Filesystem: no shared mount with the host. No 9p, no virtio-fs with host paths. The target's disk is the snapshot it was booted from, period.
  • Privilege: QEMU runs as the unprivileged service user. KVM access is via group membership only; no setuid wrappers, no privileged TUN ownership transfer, no passthrough of host devices not explicitly required by the catalog.
  • Lifetime: every target boots from a fresh snapshot. State from one episode never crosses into the next. The snapshot is reverted at episode end, not "cleaned."
  • Escape monitoring: any QEMU exit that is not a clean shutdown is logged with full QMP state and the episode is marked failed. Two unclean exits on the same target image within a release window trigger admission-criteria re-verification (§4.3) for every module targeting that image.

Acceptance: tests/test_containment.py asserts each target build (a) has no upstream egress route from inside the guest, (b) has no host-shared filesystem mount, (c) runs QEMU as the unprivileged service user, (d) reverts to snapshot at episode end. The test runs in CI and on every install.


5. Build order

There is no half-honest intermediate state. The order below sequences the work; it does not phase the deployment. Everything lands to main in one merge.

  1. Fix the four root-cause defects:
    • Diagnose + fix the perf collector (read code, run standalone, find why it's silent, fix).
    • Diagnose + fix the guest-agent collector (mount baseline image, verify agent installed, fix build).
    • Diagnose + fix k-gamingcom's missing qmp/netflow/pcap (compare configs, eliminate divergence — §4.1).
    • Diagnose + fix samba_usermap_script against its target (manual msfconsole drive, find why the bind shell never connects, fix or remove from catalog — §4.3).
  2. Land the canonical manifest (§4.1).
  3. Land the target-VM build pipeline (§4.2) and containment tests (§4.13) together — target VMs are not in the catalog without containment.
  4. Land the catalog admission criteria + verifier (§4.3).
  5. Land the collector admission criteria + tests (§4.4).
  6. Land the event-driven labeller (§4.5).
  7. Land the acceptance gate (§4.6).
  8. Land the preflight (§4.7).
  9. Land the receiver-side rejection (§4.8).
  10. Land the override discipline + cleanup (§4.9, §4.12).
  11. Land systemd integration + alert distinguishing (§4.11).

After merge: lab hosts pull the new manifest, run preflight, fail loudly if they don't meet the bar, produce zero episodes until they do. The operator brings each host to bar — fixing one root cause at a time, loudly. The dataset goes quiet, then comes back honest.


6. Out of scope (and why)

  • Schedule jitter for label-leakage resistance. Real concern, but doesn't affect honesty — only generalization. Address after honest data is flowing.
  • New collectors (audit, ftrace, etc.). Adding collectors before the existing six are honest is putting more weight on a broken floor.
  • Trainer changes. This plan stops at the dataset boundary. The trainer no longer needs to filter dishonest episodes because they don't exist.
  • Multi-architecture targets. All target VMs are x86_64 for now.

Each of these is fine to defer because they don't paper over a correctness defect. They add value on top of an honest pipeline; the pipeline isn't honest yet.


7. Anti-patterns (named — match every proposal against this list)

Each pattern below is a shape a proposal can take that has been rejected as architectural sleight-of-hand. Match every proposal against this list before submitting it. A proposal that matches a named pattern is rejected; abandon it and propose a corrective fix instead.

The patterns are named so future sessions can recognize them in their own work. A bandaid with a nice name (preflight, acceptance gate, retry layer, fleet-health) is still a bandaid.

§7 is non-exhaustive. New sleight-of-hand patterns will exist that aren't named here. The §8 decision tests are the actual filter; a proposal that fails §8 is rejected even if it matches no named pattern. Do not read §7 as a closed taxonomy and conclude "my proposal isn't on the list, so it's fine." If §8 says no, the answer is no, regardless of whether a named match exists.

7.1 Compensating-layer pattern

Definition. Adding a layer (timer, watcher, retry, alert, recovery doc) that absorbs a failure mode upstream of itself instead of fixing the upstream cause.

Example from session 2026-05-02..03. cis490-autoupdate.timer to drag stale peers forward. The actual fix was the operator's deploy process; the timer existed because deployment was unreliable and we patched around the unreliability instead of fixing it.

Test. If I removed this layer right now, would the original problem reappear immediately? If yes, the layer is a compensating bandaid for an unfixed root cause.

What to do instead. Fix the upstream cause. If you cannot in this change, fail loudly (§9) and stop.

7.2 Phasing-as-deferral pattern

Definition. Splitting a correctness fix into "phase 1, phase 2," "light vs deep," or "land this now, the harder part later." Any sequencing that ships a half-honest intermediate state.

Example from session 2026-05-02..03. "Land preflight first, labeller refactor later." The intermediate state ships dishonest data because the labeller is still clock-driven.

Test. Does each intermediate merge ship dishonest data, or rely on a layer that won't exist yet? If yes, no phasing.

What to do instead. Reduce scope (drop a feature, narrow the active set) until the change is small enough to land in one merge. Do not defer the hard part.

7.3 Single-instance-fix pattern

Definition. Fixing one item from a class while leaving the other items as future work.

Example from session 2026-05-02..03. "I'll diagnose perf and samba in parallel" while guest-agent, qmp, netflow, and the rest of the module catalog stay broken.

Test. Is this a class of N items, of which I'm fixing < N? If yes, fix all or remove the unfixed from the active set.

What to do instead. Either fix every member of the class, or shrink the active catalog to just the verified members. Unverified members do not ship.

7.4 Per-host-divergence pattern

Definition. Accepting that two hosts behave differently as a working assumption.

Example from session 2026-05-02..03. "Which host should I investigate samba on, elliott or k-gamingcom?" — implying the answer matters because hosts are different.

Test. Given identical workloads on identical canonical-manifest hosts, would the produced episodes be identical? If no, the divergence is the bug.

What to do instead. Eliminate the divergence (one canonical manifest, one canonical target VM build, one canonical collector set — §4.1). If a host can't run the canonical experiment, it produces zero episodes.

7.5 Black-box-trust pattern

Definition. Treating an externally-built artifact as if it behaves correctly under our experiments without a verifiable spec for what it should do.

Example from session 2026-05-02..03. Metasploitable2 from a SourceForge mirror — we don't know what version of Samba is running, whether the service is up, or whether the image has been altered. We were shipping modules targeting it anyway.

Test. Do we have a verifiable spec for this artifact's behavior? If no, we don't trust it.

What to do instead. Build the artifact from a declarative spec we control (§4.2). If we can't, remove modules targeting it from the catalog.

7.6 Investigation-as-deferral pattern

Definition. Proposing investigation when a verifiable gate would suffice. The investigation itself becomes the deferred work.

Example from session 2026-05-02..03. "I need to diagnose why perf is silent before I can write the gate." A gate of the form "perf must produce ≥1 row" works without knowing the cause; it forces the diagnosis to happen as part of the fix.

Test. Can the gate be expressed as an assertion ("X must produce > 0 rows" / "X must observe Y event") without knowing the root cause? If yes, write the gate first.

What to do instead. Write the strictest possible gate first. The investigation is the work of making the gate pass.

7.7 Speculation-as-evidence pattern

Definition. Asserting a claim as fact without measurement.

Example from session 2026-05-02..03. "30s vs 120s won't change this — if the exploit were almost working, we'd see occasional opens." No data was gathered; the claim was projected.

Test. Do I have a measurement that supports this claim? If no, I am speculating.

What to do instead. Say "I don't know yet." Either gather data or design the fix to be correct under both possibilities.

7.8 Out-of-scope-for-correctness pattern

Definition. Naming a correctness-affecting item as "out of scope" to avoid the harder problem.

Example from session 2026-05-02..03. "Manifest canonicalization is out of scope, flagged as known issue." Per-host config divergence is the source of half the data quality problems; excluding it from scope was a deferral.

Test. Does excluding this item leave the system half-honest? If yes, it is in scope.

What to do instead. Reduce other scope (drop a feature, narrow the active set) to fit. Correctness items cannot be deferred.

7.9 Defensive-instead-of-corrective pattern

Definition. Building rejection logic at the consumer instead of fixing the producer that produces the rejected output.

Example from session 2026-05-02..03. Receiver-side rejection of dishonest episodes without fixing why the producer produces them. Defense-in-depth (both ends gated) is good; defense-without- corrective (only consumer gated) is a bandaid.

Test. Does this fix make the dishonest behavior IMPOSSIBLE upstream, or only unobservable downstream? If only unobservable, the producer is still broken.

What to do instead. Fix the producer first. The consumer-side gate is defense-in-depth on top of a corrected producer, never a substitute.

7.10 Recovery-layer pattern

Definition. Building documentation, scripts, timers, or runbooks for "what to do when X is stuck." Applies anywhere in the pipeline — producer, receiver, trainer, dashboard, install scripts, on-device agents, anywhere a "recovery from a state that shouldn't exist" layer is contemplated. Producer-side is just the most common location.

Example from session 2026-05-02..03. FIXYOURSELF.md — a 250-line decision tree for recovering hosts whose auto-update timer couldn't fix them. The states it covered shouldn't have been possible if the producer were correct.

Test. Can the stuck state happen at all if the relevant component is correct? If no, delete the recovery layer and fix the component.

What to do instead. Make the stuck state impossible. If you can't, fail loudly (§9) and stop.


8. Decision tests before proposing a change

Before adding any code, doc, layer, or feature, answer all of the following. Any uncomfortable answer means stop and re-evaluate.

  1. Does this change make the dishonest behavior IMPOSSIBLE, or only less likely / less observable?
  2. Does this change scale to every instance of the problem class, or only one?
  3. If I removed this change, would the underlying problem return immediately?
  4. Am I adding a layer? If yes, can I instead remove the layer that allowed the failure?
  5. Does this proposal match any pattern in §7? If yes, abandon it and propose a corrective fix.
  6. Is the change complete in one merge? If not, why is the intermediate state honest?
  7. Am I doing this because it's correct, or because it's the easiest thing that looks like progress?

If you cannot answer all seven cleanly, stop. Ask the operator. Do not proceed.


9. What to do when blocked

When you cannot fix something cleanly in scope:

  • Fail loudly. Exit with a distinguishable code (e.g., 78). Write a structured failure record. Do not retry silently.
  • Stop. Do not continue producing output as if the failure didn't happen.
  • Ask the operator. Tell the user what's blocked, what you tried, and what you need to proceed.
  • Do not build a recovery layer. That is the recovery-layer pattern (§7.10).
  • Do not propose phased fixes. That is the phasing-as-deferral pattern (§7.2).
  • Do not narrow scope silently. If the active set must shrink to make the change tractable, name it explicitly and get sign-off.

The operator prefers a small honest system that fails loudly over a large half-broken one that limps. A loud failure is more useful than a silent bandaid.


10. Definitions of ground truth

For each collector, "real row" means the row was actually emitted by the underlying mechanism for this episode, not synthesized, defaulted, or carried over from a previous run.

Collector Ground truth means
proc Row read from /proc/<qemu_pid>/{stat,io,status} for the live qemu PID of this episode's target VM, while that PID is alive.
qmp Row obtained from a successful QMP query-status / query-blockstats round-trip on cfg.qmp_socket for this episode's qemu PID.
netflow Row computed from packet capture on cfg.bridge_iface for traffic involving this episode's target VM during the episode wall-clock window.
perf Row produced by perf (or equivalent) sampling this episode's qemu PID. Not from a previous run, not from a different PID.
guest Row received from the in-guest agent over the virtio-serial channel during the episode wall-clock window. The agent must be running in this episode's guest, not a stale one.
pcap Bytes captured from cfg.bridge_iface during the episode wall-clock window, written to network.pcap.

For each phase, "label justified" means the corresponding event was observed:

Phase Justified by
clean Episode start (orchestrator-emitted).
armed Orchestrator instructs the driver to fire (orchestrator-emitted).
infecting exploit_fire event observed in events.jsonl.
infected_running session_open event observed in events.jsonl. Not session_open_timeout, not schedule-clock.
dormant Observed in-session idle (no traffic / no command activity for N seconds).
failed session_open_timeout or other terminal driver failure. Episode is rejected (§4.6).

A row that doesn't meet the ground-truth bar is not a row. A label that isn't justified is not a label. The acceptance gate (§4.6) enforces both.


11. Honest reporting

When you (a future session) report status to the operator:

  • Distinguish merged from verified. "Code merged" is not "behavior verified in production." A passing test on a CI host is not the same as a working system on a lab host.
  • Distinguish proposed from implemented. "I proposed X" is not "X is in the repo."
  • Audit your cumulative pattern. At the end of a session, re-read your own changes against §7. It is possible to add three reasonable-looking layers in sequence that cumulatively form a compensating-layer pattern, even if no individual one looks like a bandaid.
  • Name compensating layers you've built. If §7 audit finds matches, name them and propose their removal.
  • Don't summarize cumulative changes as "fixes" without auditing. "I shipped 12 commits this session" is not the same as "the pipeline is honest now."
  • Verify before agreeing or refuting. When the operator says something is done that you can verify, verify it before agreeing. When they say something is broken that you can verify, verify it before refuting.

12. Glossary

Terms used throughout this document, pinned to one definition.

Term Definition
Canonical manifest The single, version-pinned manifest.toml at the repo root. Every host loads this exact file. There is no per-host override (§4.1).
Active set The collectors enabled in the canonical manifest for a given run. A collector is in the active set only if it has passed admission criteria (§4.4).
Catalog The set of exploit modules in exploits/modules/*.toml that have passed admission (§4.3). Modules not in the catalog do not run.
Ground truth A row or label is ground truth when it was emitted by the underlying mechanism for this episode, with the justifying event observed. See §10.
Episode boundary An episode begins when the orchestrator emits the first clean label and ends when done.marker is written or the episode is moved to rejected/. All collector rows must fall inside this wall-clock window.
Configured collector A collector listed as enabled in the canonical manifest. Distinct from "running collector" (the process actually started) and "active set" (the manifest-listed plus admission-passing intersection). For acceptance purposes, only the configured set matters.
Admission criteria The bar a module / collector / target / override knob must pass to be in the active pipeline. See §4.3, §4.4, §13.
Honest Of an episode: every label justified by an observed event, every configured collector emitted ≥1 ground-truth row, working tree was clean (or override-stamped), HEAD on origin/main. Of the pipeline: every accepted episode is honest.
Bandaid / compensating layer A layer that absorbs a failure mode upstream of itself instead of fixing the upstream cause. See §7.1.
Override A knob that loosens an admission criterion or gate. There is exactly one — CIS490_ALLOW_DIRTY (§14).
Operator The human maintainer with sign-off authority. Distinct from agents that propose changes. See §15.
Containment regression A change that weakens any of the §4.13 isolation requirements. Rejected regardless of claimed experimental value.

13. Admission scope (what triggers the bar)

Any change to the following is in admission scope and must pass §4 admission criteria + §15 operator sign-off:

  • Any module in exploits/modules/*.toml.
  • Any collector in the active set.
  • Any field of manifest.toml.
  • Any phase rule or label-emission code in the labeller.
  • Any gate in the producer or receiver.
  • Any schedule entry (phase budget, per-module timeout).
  • Any target VM build spec or its containment posture (§4.13).
  • Any override knob (the closed list in §14).

The following are NOT admission scope and can be changed without admission ceremony, but must still pass §8 decision tests:

  • Internal refactors that do not change observable behavior of any of the above.
  • Test code, fixtures, CI configuration.
  • Documentation that does not contradict §1.
  • Build/install scripts, insofar as they don't change what gets shipped or how it's labelled.

A future session that argues "this is just infrastructure" or "this is just tooling" to dodge admission scope: re-read this section. Anything that touches what gets shipped, how it's labelled, what runs on the host, the containment posture, or how the gate decides — is in scope. The "infrastructure / tooling" framing is a recurring sleight-of-hand vector and triggers automatic rejection.


14. Override knobs (closed list)

The complete list of override knobs in CIS490, version-pinned to this document:

Knob Effect Where audited
CIS490_ALLOW_DIRTY=1 (env var, orchestrator) Allows the orchestrator to start with a dirty git tree. Stamps dirty_override: true in every meta.json produced. Receiver accepts only with matching stamp. per-episode in meta.json

That is the entire list. Adding a knob to this list is itself an admission event (§13) requiring operator sign-off (§15) and an §8 review.

Knobs that have been considered and rejected (do not propose again without re-reading the rationale):

  • verify_tls=false — TLS verification is a correctness boundary; bypassing it is the defensive-instead-of-corrective pattern (§7.9).
  • skip_preflight=1 — preflight is the gate; bypassing it makes the gate non-functional.
  • experimental_collector=true — bypassing collector admission is the single-instance-fix pattern (§7.3) wearing a flag.
  • diagnostic_mode=true — generic bypass; in practice would be applied to hide failures, not investigate them.
  • dry_run for the producer — episodes that aren't shipped go to rejected/; no dry-run flag needed.

If a future session proposes a new override knob, the burden is on the proposal: pass §8, get operator sign-off, amend §14 in the same merge. "Add the knob now and amend §14 later" is the phasing-as-deferral pattern (§7.2) applied to documentation.


15. Sign-off discipline

Admission decisions are made by the operator, not by agents acting alone. Specifically:

  • Adding a module to the catalog requires operator sign-off. An agent runs scripts/verify-catalog.sh, presents the verification result, and the operator decides whether the module enters the catalog.
  • Adding a collector to the active set requires operator sign-off. Agent runs the emit-test, operator decides.
  • Promoting a target VM build requires operator sign-off after §4.2 verification and §4.13 containment tests pass.
  • Adding an override knob (§14) requires operator sign-off.
  • Amending PIPELINE.md requires operator sign-off (§16).

Removing anything from the catalog or active set does NOT require operator sign-off — the bar is asymmetric. Tightening is always permitted; loosening requires sign-off.

The operator is the human with maintainer credentials on the repository. Agents propose, run verification, and present results; the operator decides admission.

If an agent is acting in a non-interactive context (CI run, scheduled job) where no operator is available to sign off, the agent does not admit anything. It produces verification output and stops.


16. Amending PIPELINE.md

This document is not immutable, but it is the canonical statement of the bar. Amendments are governed by the same discipline as admission decisions:

  1. Any change to §1 (principle), §4 (fix items), §7 (anti-patterns), §8 (decision tests), §10 (ground truth), §13 (admission scope), §14 (override list), or §15 (sign-off) is a substantive amendment.
  2. Substantive amendments require operator sign-off (§15) and must pass §8 decision tests applied to the amendment itself.
  3. The amendment lands in the same merge as the code change it justifies. "Amend the doc later" is the phasing pattern (§7.2).
  4. Editorial changes (typos, formatting, link fixes, glossary wording) do not require sign-off but should be flagged in the commit message.

A future session that wants to add a feature or layer the document forbids: the path is to amend the document, not to work around it. "This isn't covered by PIPELINE.md, so I'll just do it" is the out-of-scope-for-correctness pattern (§7.8) applied to the meta-document. Anything that touches admission scope (§13) is covered even if not named explicitly.

If you find the document is wrong — internally inconsistent, contradicts observed reality, prescribes something impossible — file a Forgejo issue against the repo with the contradiction documented. Do not silently work around the doc.


17. What this plan supersedes

The following docs are deleted or rewritten as part of landing this plan:

Doc Action
FIXYOURSELF.md Deleted. Compensating-layer doc; the states it covers don't exist after §4.6.
AGENTS.md "symptom→fix table" Deleted. Bandaid-driven.
AGENTS.md "Hosts self-update" section Deleted. Hosts run pinned commits.
AGENTS.md "Tier 3+4 deploy zero-touch" claim Rewritten. Targets are built locally now, not auto-fetched.
AGENTS.md "trust the in-guest probe alone, cross-check host CPU" Deleted. The producer-side gate makes this fictional cross-check unnecessary.
TIER3-BRINGUP.md Kept as historical record — labelled bug report, not current guidance.
README.md Tier-3+4 narrative Reviewed and aligned.

If you are a future session reading this and find another doc that contradicts §1§6 of this file: this file is right and the other doc is wrong. Fix the other doc.