Tier-2 workload-silent false positive: pgrep -c unsupported on BusyBox, disown missing — workload IS running #15

Closed
opened 2026-04-30 16:57:31 -05:00 by elliott · 1 comment
Owner

Symptom

Pi-side classifier labels 244 episodes from elliott-thinkpad and k-gamingcom as workload-silent. In-guest loadavg peaks at ~0.77 for cpu-saturate, and top_procs in telemetry-guest.jsonl never shows a yes process.

Diagnostic environment

  • Alpine 3.21, BusyBox v1.37.0 (2024-11-19)
  • Tested via single VM booted from launch_demo.sh, probed with SerialClient from dev clone

Step 2 — baseline guest state (clean boot, no workload)

=== /tmp listing ===
total 5
drwxrwxrwt    4 root  root  1024 Apr 30 21:50 .
drwxr-xr-x   21 root  root  1024 Apr 30 21:50 ..
drwxrwxrwt    2 root  root  1024 Apr 30 21:50 .ICE-unix
drwxrwxrwt    2 root  root  1024 Apr 30 21:50 .X11-unix
-rw-r--r--    1 root  root    15 Apr 30 21:50 .cis490-boot

=== workload script files ===
done

=== pgrep yes/sh ===
2096 /bin/sh /usr/sbin/cloud-init-hotplugd
2179 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
2265 -sh
2270 -sh
done

=== loadavg ===
0.09 0.03 0.01 1/85 2274

=== busybox applets (yes/nohup/sh/disown) ===
/usr/bin/yes
/usr/bin/nohup
/bin/sh
nohup
sh
yes

Note: disown does NOT appear — it is not a BusyBox applet or shell builtin on this Alpine guest.


Step 3 — exact start_cmd and pgrep chain test

TEST_BEGIN

start_cmd repr (exact output of _cpu_saturate().start_cmd):

"cat > /tmp/.cis490-workload-cpu-saturate.sh <<'CIS490_EOF'\n#!/bin/sh\ntrap 'exit 0' TERM INT\nwhile :; do\n  yes > /dev/null 2>&1 &\n  wait $!\n\ndone\nCIS490_EOF\nchmod +x /tmp/.cis490-workload-cpu-saturate.sh; nohup sh /tmp/.cis490-workload-cpu-saturate.sh </dev/null >/dev/null 2>&1 &\necho $! > /tmp/.cis490-workload-cpu-saturate.pid\ndisown\n"

run(start_cmd) output:

'\r\n-sh: disown: not found'

Probe 3 seconds after start_cmd:

yes=0
loadavg=1.05
-rw-r--r--  1 root root  5 Apr 30 21:54 /tmp/.cis490-workload-cpu-saturate.pid
-rwxr-xr-x  1 root root 86 Apr 30 21:54 /tmp/.cis490-workload-cpu-saturate.sh
probe-done

ps at same moment:

 2300 root  0:00 sh /tmp/.cis490-workload-cpu-saturate.sh
 2301 root  0:01 yes
 2302 root  0:01 yes
ps-done

pid file contents: 2300

script file contents:

#!/bin/sh
trap 'exit 0' TERM INT
while :; do
  yes > /dev/null 2>&1 &
  wait $!

done

pgrep -c yes (as used in _probe()):

pgrep: unrecognized option: c
BusyBox v1.37.0 (2024-11-19 21:09:16 UTC) multi-call binary.

Usage: pgrep [-flanovx] [-s SID|-P PPID|PATTERN]

Display process(es) selected by regex PATTERN

        -l      Show command name too
        -a      Show command line too
        -f      Match against entire command line
        -n      Show the newest process only
        -o      Show the oldest process only
        -v      Negate the match
        -x      Match whole name (not substring)
        -s      Match session ID (0 for current)
        -P      Match parent process ID
        -u EUID Match against effective UID
        -U UID  Match against UID
exit=1

pgrep -l yes (working alternative): 2301 yes 2302 yes 2339 yes exit=0

pgrep yes (no flags): 2301 2302 2339 exit=0

Manual /proc count: /proc/2301/status /proc/2302/status /proc/2339/status

TEST_END


Observation

(d) Something else — the heredoc creates the file correctly (86 bytes, correct content); nohup sh keeps the script alive (confirmed by ps); yes IS saturating the vCPU (loadavg=1.05, three yes PIDs visible). The workload is NOT silent.

The false-positive workload-silent label is caused by two bugs in VMLoadController._probe():

Bug 1 (root cause): pgrep -c is not a valid BusyBox flag

BusyBox v1.37.0 pgrep supports [-flanovx] only. The -c (count) flag is a procps/util-linux extension. When pgrep -c yes is called:

  1. BusyBox exits with code 1 and prints usage to stderr
  2. 2>/dev/null suppresses the usage output
  3. || echo 0 fires → always produces yes=0

So echo yes=$(pgrep -c yes 2>/dev/null || echo 0) always prints yes=0, even when yes is saturating the vCPU. This false zero is what the Pi classifier reads from workload_killed events.

Fix: replace pgrep -c yes with pgrep yes | wc -l (both supported by BusyBox).

Bug 2 (secondary): disown is not a BusyBox builtin

-sh: disown: not found is printed on every infected_running entry. The background process survives (protected by nohup), so this is currently harmless, but the error leaks into run()'s captured output and could confuse future callers.

Fix: remove disown from _wrap_loop; nohup already provides SIGHUP immunity.


Impact

All 244 episodes from elliott-thinkpad and k-gamingcom are correctly labeled by phase but incorrectly tagged workload-silent. The CPU envelope IS present in the host-side /proc telemetry (qemu-system CPU%). The in-guest top_procs gap may be a separate agent-side pgrep issue using the same -c flag. The episodes are not wasted — host-side telemetry is valid — but the workload_silent filter would incorrectly exclude them from the ML pipeline.

## Symptom Pi-side classifier labels 244 episodes from `elliott-thinkpad` and `k-gamingcom` as `workload-silent`. In-guest loadavg peaks at ~0.77 for cpu-saturate, and `top_procs` in `telemetry-guest.jsonl` never shows a `yes` process. ## Diagnostic environment - Alpine 3.21, BusyBox v1.37.0 (2024-11-19) - Tested via single VM booted from `launch_demo.sh`, probed with `SerialClient` from dev clone --- ## Step 2 — baseline guest state (clean boot, no workload) ``` === /tmp listing === total 5 drwxrwxrwt 4 root root 1024 Apr 30 21:50 . drwxr-xr-x 21 root root 1024 Apr 30 21:50 .. drwxrwxrwt 2 root root 1024 Apr 30 21:50 .ICE-unix drwxrwxrwt 2 root root 1024 Apr 30 21:50 .X11-unix -rw-r--r-- 1 root root 15 Apr 30 21:50 .cis490-boot === workload script files === done === pgrep yes/sh === 2096 /bin/sh /usr/sbin/cloud-init-hotplugd 2179 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups 2265 -sh 2270 -sh done === loadavg === 0.09 0.03 0.01 1/85 2274 === busybox applets (yes/nohup/sh/disown) === /usr/bin/yes /usr/bin/nohup /bin/sh nohup sh yes ``` **Note**: `disown` does NOT appear — it is not a BusyBox applet or shell builtin on this Alpine guest. --- ## Step 3 — exact start_cmd and pgrep chain test ### TEST_BEGIN **start_cmd repr** (exact output of `_cpu_saturate().start_cmd`): ``` "cat > /tmp/.cis490-workload-cpu-saturate.sh <<'CIS490_EOF'\n#!/bin/sh\ntrap 'exit 0' TERM INT\nwhile :; do\n yes > /dev/null 2>&1 &\n wait $!\n\ndone\nCIS490_EOF\nchmod +x /tmp/.cis490-workload-cpu-saturate.sh; nohup sh /tmp/.cis490-workload-cpu-saturate.sh </dev/null >/dev/null 2>&1 &\necho $! > /tmp/.cis490-workload-cpu-saturate.pid\ndisown\n" ``` **run(start_cmd) output:** ``` '\r\n-sh: disown: not found' ``` **Probe 3 seconds after start_cmd:** ``` yes=0 loadavg=1.05 -rw-r--r-- 1 root root 5 Apr 30 21:54 /tmp/.cis490-workload-cpu-saturate.pid -rwxr-xr-x 1 root root 86 Apr 30 21:54 /tmp/.cis490-workload-cpu-saturate.sh probe-done ``` **ps at same moment:** ``` 2300 root 0:00 sh /tmp/.cis490-workload-cpu-saturate.sh 2301 root 0:01 yes 2302 root 0:01 yes ps-done ``` **pid file contents:** `2300` **script file contents:** ```sh #!/bin/sh trap 'exit 0' TERM INT while :; do yes > /dev/null 2>&1 & wait $! done ``` **pgrep -c yes (as used in _probe()):** ``` pgrep: unrecognized option: c BusyBox v1.37.0 (2024-11-19 21:09:16 UTC) multi-call binary. Usage: pgrep [-flanovx] [-s SID|-P PPID|PATTERN] Display process(es) selected by regex PATTERN -l Show command name too -a Show command line too -f Match against entire command line -n Show the newest process only -o Show the oldest process only -v Negate the match -x Match whole name (not substring) -s Match session ID (0 for current) -P Match parent process ID -u EUID Match against effective UID -U UID Match against UID exit=1 ``` **pgrep -l yes (working alternative):** `2301 yes 2302 yes 2339 yes exit=0` **pgrep yes (no flags):** `2301 2302 2339 exit=0` **Manual /proc count:** `/proc/2301/status /proc/2302/status /proc/2339/status` ### TEST_END --- ## Observation **(d) Something else** — the heredoc creates the file correctly (86 bytes, correct content); `nohup sh` keeps the script alive (confirmed by ps); `yes` IS saturating the vCPU (loadavg=1.05, three yes PIDs visible). The workload is NOT silent. The false-positive `workload-silent` label is caused by **two bugs in `VMLoadController._probe()`**: ### Bug 1 (root cause): `pgrep -c` is not a valid BusyBox flag BusyBox v1.37.0 `pgrep` supports `[-flanovx]` only. The `-c` (count) flag is a procps/util-linux extension. When `pgrep -c yes` is called: 1. BusyBox exits with code 1 and prints usage to stderr 2. `2>/dev/null` suppresses the usage output 3. `|| echo 0` fires → always produces `yes=0` So `echo yes=$(pgrep -c yes 2>/dev/null || echo 0)` **always prints `yes=0`**, even when `yes` is saturating the vCPU. This false zero is what the Pi classifier reads from `workload_killed` events. **Fix**: replace `pgrep -c yes` with `pgrep yes | wc -l` (both supported by BusyBox). ### Bug 2 (secondary): `disown` is not a BusyBox builtin `-sh: disown: not found` is printed on every `infected_running` entry. The background process survives (protected by `nohup`), so this is currently harmless, but the error leaks into `run()`'s captured output and could confuse future callers. **Fix**: remove `disown` from `_wrap_loop`; `nohup` already provides SIGHUP immunity. --- ## Impact All 244 episodes from `elliott-thinkpad` and `k-gamingcom` are correctly labeled by phase but incorrectly tagged `workload-silent`. The CPU envelope IS present in the host-side `/proc` telemetry (qemu-system CPU%). The in-guest `top_procs` gap may be a separate agent-side pgrep issue using the same `-c` flag. The episodes are not wasted — host-side telemetry is valid — but the `workload_silent` filter would incorrectly exclude them from the ML pipeline.
max closed this issue 2026-04-30 17:28:49 -05:00
Owner

Excellent diagnostic. Both bugs fixed in 2707709 on main:

  • _probe() now uses 'pgrep yes | wc -l' (busybox- and procps-compatible)
  • _wrap_loop drops disown; nohup is enough for SIGHUP immunity

For the 244 episodes already on disk: rather than re-collect, the prune classifier now cross-checks the in-guest probe against host-side /proc CPU envelope. workload-silent flags only when both agree. Added test_workload_silent_suppressed_when_host_cpu_real as a regression. AGENTS.md grew a 'Don't trust the in-guest probe alone' section with the busybox-vs-procps gotcha and a list of patterns to avoid.

Net result: existing data is rescued, the probe is correct going forward.

Excellent diagnostic. Both bugs fixed in 2707709 on main: - _probe() now uses 'pgrep yes | wc -l' (busybox- and procps-compatible) - _wrap_loop drops disown; nohup is enough for SIGHUP immunity For the 244 episodes already on disk: rather than re-collect, the prune classifier now cross-checks the in-guest probe against host-side /proc CPU envelope. workload-silent flags only when both agree. Added test_workload_silent_suppressed_when_host_cpu_real as a regression. AGENTS.md grew a 'Don't trust the in-guest probe alone' section with the busybox-vs-procps gotcha and a list of patterns to avoid. Net result: existing data is rescued, the probe is correct going forward.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: bolyai/CIS490#15
No description provided.