Phase 3 — measurement protocol

Status: LOCKED 2026-05-02. Pre-registered hypothesis and decision rules locked here before data collection. Conditions, falsification criteria, and the “what would change my mind” set are defined upfront so the analysis cannot quietly drift toward confirming the prior. Per architect's framing: this is “the single highest-value remaining measurement, and it's cheap” (phase0_findings).

Decisions in from user 2026-05-02: in-scope client is chromium-fourier (Step 1 + Step 2 patched Chromium 149 at /tmp/chromium-ohm-gl-fix-step2/chrome), not stock Brave. Replicate strategy: 1 smoke replicate per condition first, then 2 follow-up replicates per condition for averaging (total N = 3 / condition in the limit). Cage A/B: nested (no bare-DRM cage). Operator role: open chromium-fourier, navigate to test page, press play; everything else runs over SSH from boltzmann.

1. Pre-registered hypothesis under test

H1 (primary): KWin v6.6.4's per-frame compositor cost on the parent + wp_subsurface NV12 path is dominated by glEGLImageTargetTexture2DOES (src/scene/surfaceitem.cpp:490 for the YUV multi-plane branch, :496 for the single-plane branch), called from OpenGLSurfaceTexture::updateDmabufTexture (:472-501).

Operational definition of “dominated by”: the symbol — by name glEGLImageTargetTexture2DOES, OR by its panfrost-side implementation in libEGL / libpanfrost / Mesa-DRI — accounts for ≥ 20 % of kwin_wayland's self-time CPU during the steady-state window of the locked Phase 1 measurement, and ≥ 10 percentage points more of self-time than the same symbol shows in the cage A/B's kwin_wayland self-time. The 20 % / 10 pp thresholds are deliberate: small enough to be reachable if the hypothesis is correct, large enough to rule out incidental noise.

H0 (null): the symbol's contribution is not distinguishable from background (< 5 %) OR is matched within ±5 pp by cage's kwin_wayland — in which case the cost is elsewhere and Phase 2 must re-open.

Auxiliary hypothesis H2: the warmup-burst drops at t ≈ 0–5, 10–12, 20–30 s correlate with cold-path EglDisplay::importDmaBufAsImage events. Validated separately by counting fresh imports during warmup vs steady state (instrumentation hook below).

2. What would change my mind — falsification table

Decision rule applied to the direct-KWin run's kwin_wayland self-time. Cage A/B comparator applied second.

Observed direct-KWin self-time profile	Verdict	Phase 4 implication
`glEGLImageTargetTexture2DOES` (or panfrost-side equivalent) ≥ 20 % AND ≥ 10 pp above cage	H1 confirmed	Proceed to Phase 4 patch design as drafted in phase2_source_findings.
symbol 5–20 %, but ≥ 10 pp above cage	H1 partially confirmed	Patch is worth landing but not the only cost; identify the second-hottest symbol and run a follow-up to size it before committing to a single-patch shape.
symbol < 5 % OR cage shows the same heat	H1 rejected	Re-open Phase 2. The hottest symbol in the direct-KWin run becomes the new candidate; specifically check the alternative-hot list below.
Profile unstable across replicates (IQR > 5 pp on the dominant symbol)	inconclusive	Increase replicates from 3 to 6, fix the CPU governor at `performance`, re-test. Likely thermal throttling on the PineTab2's passive-cooled SoC.

Alternative-hot candidates (named so we don't only look for what we expect):

dma_buf_* / sync_file_* kernel-side symbols → producer-side fence handling cost; would steer toward a kernel-V4L2 fix rather than a KWin patch.
pan_*, panvk_*, panfrost_resource_* Mesa internals → driver-side cost; KWin patch may not help, would steer toward a Mesa bug report.
Item::scheduleRepaint, RenderLoopPrivate::* → KWin scheduling/damage cost; different patch shape.
GLShaderManager::* / shader-conversion fragment binding → shader cost; would steer toward a YUV-shader optimization.
memcpy or __memcpy_* heavy → software fallback path engaged unexpectedly; would invalidate the “we're on the dmabuf zero-copy path” assumption.
Top symbol is [kernel.kallsyms] _raw_spin_* → contention, not compositor work; means a different bottleneck.

3. Conditions

Three conditions, each run as 1 smoke replicate first, then 2 follow-up replicates for averaging (N = 3 in the limit). The smoke pass gates the follow-up pass: if the smoke pass already produces a clearly resolved verdict (H1 confirmed or rejected at high confidence), we still run the follow-up to get the IQR; we do not short-circuit “yes the symbol is hot, we're done.” The replicate count is load-bearing for the IQR, not the point estimate.

Condition	Code	Compositor	Workload
C0	`noise_floor`	KWin (Plasma session)	Idle desktop, no playback, chromium-fourier not running. 70 s.
C1	`kwin_direct`	KWin (Plasma session)	chromium-fourier (Step 1 + Step 2 patched Chromium 149, binary at `/tmp/chromium-ohm-gl-fix-step2/chrome`), autoplay-loop of `bbb_1080p30_h264.mp4` on `brave_drops_test.html`. Windowed (NOT fullscreen — windowed is the in-scope case per phase1_lock). 70 s.
C2	`cage_nested`	cage nested inside KWin	Same chromium-fourier binary + same clip, launched as the only client of cage (`cage – /tmp/chromium-ohm-gl-fix-step2/chrome <url>`). 70 s.

C0 is the sanity check: if glEGLImageTargetTexture2DOES shows up at non-zero in idle, the symbol is being invoked by something other than the test workload (e.g. plasmashell wallpaper or panel updates) and our deltas have a confounder.

Run schedule:

Pass 1 (smoke): C0 rep1, C1 rep1, C2 rep1. Three runs in sequence, ~20 min wall-clock with cooldowns.
Gate: I summarise pass-1 numbers and explicitly flag the rough verdict before pass 2. Per user's “2 in background” intent, the gate is informational; pass 2 proceeds unless pass-1 reveals a setup problem (no symbols, thermal throttling, route_engaged failure, etc.) that needs fixing first.
Pass 2 (averaging): C0 rep2 + rep3, C1 rep2 + rep3, C2 rep2 + rep3. Six runs in sequence, ~40 min wall-clock with cooldowns.

Total 9 runs, ~60 min wall-clock plus thermal cooldowns. Inter-run gap: minimum 60 s desktop idle, longer if the SoC hasn't cooled to within 3 °C of the pre-session temp.

4. Pre-conditions on ohm

Capture the system state once at the start of the session as a control snapshot. All commands run as mfritsche unless noted.

# Hardware / kernel snapshot
uname -a > phase3_evidence/ohm_uname.txt
cat /proc/cpuinfo > phase3_evidence/ohm_cpuinfo.txt
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor > phase3_evidence/ohm_governors.txt
cat /sys/class/thermal/thermal_zone*/temp > phase3_evidence/ohm_temps_pre.txt
 
# Software versions of the things on the hot path
pacman -Q kwin-fourier qt6-base-fourier mesa cage 2>/dev/null > phase3_evidence/ohm_pkg_versions.txt
 
# What KWin is actually running (PID + ELF)
pgrep -a kwin_wayland > phase3_evidence/ohm_kwin_pids.txt
readlink "/proc/$(pgrep kwin_wayland | head -1)/exe" >> phase3_evidence/ohm_kwin_pids.txt
 
# Confirm the chromium-fourier Step 1 + Step 2 binary is in place
ls -la /tmp/chromium-ohm-gl-fix-step2/chrome > phase3_evidence/ohm_chromium_fourier_path.txt 2>&1
/tmp/chromium-ohm-gl-fix-step2/chrome --version >> phase3_evidence/ohm_chromium_fourier_path.txt 2>&1 || echo "(version check failed — verify binary integrity)" >> phase3_evidence/ohm_chromium_fourier_path.txt
 
# Confirm the test page and clip are still on disk (location from ohm_gl_fix Phase 0 setup)
find / -name 'brave_drops_test.html' 2>/dev/null > phase3_evidence/ohm_test_assets.txt
find / -name 'bbb_1080p30_h264.mp4' 2>/dev/null >> phase3_evidence/ohm_test_assets.txt

CPU governor: lock to performance for the duration of the session. A frequency-scaling event mid-run is the single most likely confounder.

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Revert at end of session — also tracked in ohm_tooling_revert_log.md.

Background noise minimisation: stop the most likely sources of compositor traffic during playback. Document each thing stopped in the revert log.

# Notifications, file indexer, baloo, clipboard, kded modules — all common Plasma chatter sources
systemctl --user stop plasma-baloorunner.service kded6.service 2>&1 | tee -a phase3_evidence/ohm_services_stopped.txt || true
# Network: keep up (chromium-fourier needs it for video render path bringup), but quiet it
nmcli device status > phase3_evidence/ohm_nm_pre.txt

Thermal floor: between runs, allow the SoC to cool until cat /sys/class/thermal/thermal_zone0/temp is within 3 °C of the pre-session reading. If a run starts hot, throttling can knock 10–20 % off CPU throughput and the perf profile shifts. Fail-safe: if temp doesn't drop within 90 s of a finished run, skip the next run and try again later. Don't power-fight the chip.

5. Tools to install on ohm (track in revert log)

# perf for kernel 6.19 — Arch ships in linux-tools-meta or as a per-kernel package
sudo pacman -S --noconfirm perf
 
# Debug symbols for the symbols we actually need to see
# kwin: kwin-fourier ships stripped; need the matching -debug if available, else accept symbol-via-build-id resolution
sudo pacman -S --noconfirm gdb
# Mesa debug symbols — required for panfrost-side resolution if H1 cost is in libpanfrost
sudo pacman -S --noconfirm mesa-debug || echo "mesa-debug not available, will rely on /usr/lib/debug/.build-id/ if dbg-symbols ship"

Add each install to ohm_tooling_revert_log with the date, command, and rationale.

Symbol resolution sanity check (run once, before the real measurement):

# Quick smoke test: 5s perf record on kwin_wayland during normal use, confirm we get named symbols not [unknown]
sudo perf record -F 99 -g --call-graph dwarf -p $(pgrep kwin_wayland | head -1) -- sleep 5
sudo perf report --stdio --no-children | head -30

If > 50 % of the top output is [unknown] we don't have enough debug-info coverage yet. Decide: rebuild kwin-fourier with -DCMAKE_BUILD_TYPE=RelWithDebInfo on boltzmann's chromium-builder LXD, redeploy to ohm, retry. Do not proceed without legible symbols.

6. Per-run measurement — operator vs Claude split

The measurement is human-in-the-loop. The split is deliberate: the operator (mfritsche) handles only the parts that need a human at the keyboard — opening chromium-fourier, navigating to the test page, pressing play, sitting still. Claude orchestrates everything else over SSH from boltzmann: pre-state capture, perf record, top sampling, post-state capture, perf report processing, evidence-directory creation, metrics.csv updates.

Operator's complete role per run (≤ 90 s of attention per run):

C0 (noise_floor): when Claude says “go”, do nothing for 70 s. Don't touch the mouse, don't switch windows, don't type. Wait for “done.”
C1 (kwin_direct): when Claude says “go”, open chromium-fourier (Claude provides the exact command) and navigate to the test page (Claude provides URL). Verify the video starts auto-playing within ~5 s (visible motion, audio if speakers/headphones on). Then DON'T touch anything for 70 s. Wait for “done.” Close the browser when Claude says “you can close it.”
C2 (cage_nested): when Claude says “go”, run the cage launcher (Claude provides one-line command). Cage opens with chromium-fourier as its single client, video auto-plays. DON'T touch anything for 70 s. Wait for “done.” Close cage with the keystroke Claude provides.

Operator does NOT run perf, top, perf report, or any analysis script. Operator does NOT manage thermal cooldowns, evidence directories, or metric calculations. Those are all Claude's work over SSH.

Claude's per-run script (run via SSH on ohm; pseudo-shell — actual implementation is Claude tooling not a single shell file):

RUN_ID="kwin_direct_rep1"   # or noise_floor_rep1, cage_nested_rep1, etc.
EVD="$HOME/src/kwin_overlay_subsurface/phase3_evidence/$RUN_ID"
mkdir -p "$EVD"
 
# Pre-state
date -u +%FT%TZ > "$EVD/start.txt"
cat /sys/class/thermal/thermal_zone0/temp > "$EVD/temp_pre.txt"
 
# Wait for operator's "go". Once received:
 
# top runs for 70 s exactly. -p PID restricts it to kwin_wayland; we want full-system in addition.
top -b -n 70 -d 1 -p "$(pgrep kwin_wayland | head -1)" > "$EVD/top_kwin.txt" 2>&1 &
top -b -n 70 -d 1                                       > "$EVD/top_full.txt" 2>&1 &
 
# perf record runs for the same 70 s. -F 99 sampling, --call-graph dwarf for accurate aarch64 unwinds.
# Need sudo to attach to a process not owned by the invoker (kwin_wayland is owned by the Plasma user).
sudo perf record -F 99 -g --call-graph dwarf \
    -p "$(pgrep kwin_wayland | head -1)" \
    -o "$EVD/perf.data" -- sleep 70 &
 
wait
 
cat /sys/class/thermal/thermal_zone0/temp > "$EVD/temp_post.txt"
date -u +%FT%TZ > "$EVD/end.txt"
 
# Post-process the perf data to text artefacts (durable; goes into git):
sudo perf report -i "$EVD/perf.data" --stdio --no-children --sort=overhead,symbol,dso > "$EVD/perf_report_self.txt"
sudo perf report -i "$EVD/perf.data" --stdio --children    --sort=overhead,symbol,dso > "$EVD/perf_report_children.txt"
sudo perf report -i "$EVD/perf.data" --stdio --call-graph                              > "$EVD/perf_report_callgraph.txt"
 
# perf.data itself is binary, big, and goes into .gitignore — kept on ohm under
# /home/mfritsche/perf_archive/<run_id>/ for re-analysis if needed.

The drop-rate trajectory (Phase 1 cells drops_post_warmup, drops_total) comes from the test page's instrumentation (per Phase 0 setup, getVideoPlaybackQuality() polled per-second; output dump at /tmp/playback_quality_<rundate>.json). Claude pulls that file post-run.

route_engaged check: only on C1 rep1. Claude separately captures a chrome trace (DevTools Performance recording, ~5 s during steady state) and parses for WaylandBufferManagerHost::CommitOverlays / SkiaRenderer::SwapBuffers ratio per phase1_lock. Operator's role for this: when Claude says “open DevTools, hit record, wait 5s, hit stop, save the trace,” do those four button presses.

7. Cage A/B note

The kwin_wayland perf is captured for all three conditions, including C2. In C2, KWin is composing one fullscreen surface (cage's framebuffer) and we expect its self-time to be much lower; the texture-rebind cost moves into cage's process. Optionally also perf record on the cage process for a side-by-side, but that's a Phase 3-extension, not a gate for the H1/H0 verdict. The H1/H0 verdict is on kwin_wayland's self-time only — that's what we're patching.

8. Phase 1 binding cells captured per run

Recorded into metrics.csv alongside the perf evidence. One row per replicate:

phase	path_label	clip	compositor	drops	drops_post_warmup	kwin_wayland_cpu	route_engaged
`phase3_perf_kwin`	C1 rep N	bbb_1080p30_h264.mp4	kwin	from getVideoPlaybackQuality	from same, t > 10 s	from `top.txt` median t > 30 s	Y per chrome trace check
`phase3_perf_cage`	C2 rep N	bbb_1080p30_h264.mp4	cage_nested	…	…	…	Y
`phase3_perf_idle`	C0 rep N	(none)	kwin	0	0	from `top.txt` median	N/A

The phase1_reference_cage and phase1_baseline_kwin rows in metrics.csv (currently TBD per phase1_lock) get populated from C2 / C1 medians here. Do not import the ohm_gl_fix log directly — re-record under this protocol per the phase1_lock discipline note.

9. Decision rule application

After all 9 runs:

Compute per-condition median + IQR for kwin_wayland_cpu and for the top-5 perf symbols' self-time.
Apply the falsification table from §2 to the medians of C1 vs C2.
Cross-check: C0's glEGLImageTargetTexture2DOES self-time should be ~0. If it isn't, identify the confounder (likely plasmashell or kded6 not properly stopped) and re-run C1 / C2.
Document the verdict and the supporting numbers in phase3_findings.md (new file). Format: short answer + table + per-symbol commentary.
Update metrics.csv with the captured rows.
Commit. Then update Dokuwiki via the his agent.

Do not edit phase2_source_findings's hypothesis section. If H1 is rejected, the rejection lives in phase3_findings.md and phase2_source_findings gets a new “Phase 2 revision” section appended below the original — the discipline rule from phase1_lock (“characterise honestly, do not retroactively move the prior to match the result”) applies to Phase 2's hypothesis as well.

10. Artefact organisation

phase3_evidence/
├── ohm_uname.txt               (one-time pre-session)
├── ohm_cpuinfo.txt
├── ohm_governors.txt           (also captured post-session for revert audit)
├── ohm_temps_pre.txt
├── ohm_pkg_versions.txt
├── ohm_brave_version.txt
├── ohm_brave_path.txt
├── ohm_kwin_pids.txt
├── ohm_services_stopped.txt
├── noise_floor_rep1/
│   ├── perf.data
│   ├── perf_report_self.txt
│   ├── perf_report_children.txt
│   ├── perf_report_callgraph.txt
│   ├── top.txt
│   ├── start.txt
│   ├── end.txt
│   ├── temp_pre.txt
│   └── temp_post.txt
├── noise_floor_rep2/
├── noise_floor_rep3/
├── kwin_direct_rep1/  …  rep3/  (each with the same layout + playback_quality.json)
├── cage_nested_rep1/  …  rep3/
└── perf_diff/
    ├── kwin_vs_cage_median.txt   (perf diff -i kwin_direct_repN vs cage_nested_repN, median pair)
    └── kwin_vs_idle_median.txt

perf.data files are large (50–200 MB each at -F 99 with dwarf unwind for 70 s) and binary. They go into phase3_evidence/<run>/ but do NOT commit them to git — add phase3_evidence/*/perf.data to .gitignore. The perf_report_*.txt text outputs are the durable artefact and they are committed. If the original perf.data files are needed later for reanalysis, they live on ohm's /home/mfritsche/perf_archive/ or wherever we park them.

11. Caveats and known unknowns

Sampling rate: 99 Hz is a deliberate compromise. Lower (49 Hz) reduces the tool's own perturbation; higher (999 Hz) gives better resolution on short-running symbols. If glEGLImageTargetTexture2DOES is genuinely fast (low-microsecond) and we miss it at 99 Hz, the symbol will look colder than reality. Mitigation: if H0 verdict but we're suspicious, re-run at -F 999 for one C1 replicate as a tie-breaker.
Stack unwind on aarch64: –call-graph dwarf is the most accurate option but most expensive. Frame-pointer unwind (–call-graph fp) requires kwin-fourier to be built with -fno-omit-frame-pointer; if it isn't, dwarf is the only option. Confirm at the symbol-resolution sanity check.
Mesa-debug availability: if the panfrost-side implementation of glEGLImageTargetTexture2DOES resolves only to libEGL_mesa.so.0+0xNNNN, name-resolution to a specific Mesa function may need build-id matching against the upstream Mesa 26.0.5 source. If this becomes a blocker, file it as a secondary task — the existence of the heat is detectable without resolving the inner symbol.
Thermal throttling: PineTab2 passive cooling is a real risk over 9 runs back-to-back. The 60 s inter-run gap and the temp-monitoring step are the mitigation; replicate IQR is the canary.
Phase 1 cell route_engaged: requires inspecting a chrome trace JSON for WaylandBufferManagerHost::CommitOverlays / SkiaRenderer::SwapBuffers ratio per phase1_lock. This is a separate measurement done on ONE C1 run (not all three), captured as phase3_evidence/kwin_direct_rep1/chrome_trace.json and the ratio reported alongside.
chromium-fourier Step 2 patch state: confirmed running per phase0_findings (binary at /tmp/chromium-ohm-gl-fix-step2/chrome). Re-confirm at the start of the session — if a system update has clobbered the patched binary, the measurement runs against unpatched chrome and the route_engaged cell could fail for unrelated reasons.

12. Orchestration handshake

Per-run sequence between Claude (over SSH from boltzmann) and operator (at ohm). Each “→” is a state transition; “[op]” is operator action, “[claude]” is Claude action. Total operator attention per run: ≤ 90 s.

Setup (once per session, before any runs):

[claude] Pre-state capture per §4. Verify chromium-fourier binary in place, test asset paths resolved, perf installed, debug symbols resolve in the smoke test (§5).
[claude] Pin CPU governor to performance; stop background services per §4. Add to revert log.
[claude] If smoke test (§5) shows [unknown] in > 50 % of top symbols: HALT. Rebuild kwin-fourier with RelWithDebInfo on chromium-builder LXD, redeploy, retry. Don't proceed.
[op] Standby. Claude says when to act.

Per-run sequence (repeated 9 times — pass-1 then pass-2 per §3):

Run-id and condition selected by Claude per the schedule.

[claude] Print run-id + condition + cooldown status. Wait for thermal floor if needed.
[claude] Create evidence dir, capture pre-state (timestamp, temp).
[claude] Tell operator: the exact instruction for the condition. Examples:
- C0: “Run C0 noise_floor rep N. Don't touch anything for 70 s. Reply 'go' when ready and I'll start.”
- C1: “Run C1 kwin_direct rep N. (1) Open a terminal and run: /tmp/chromium-ohm-gl-fix-step2/chrome <test-page-url> (2) Wait until the video is autoplaying (visible motion, ~5 s). (3) Reply 'go'. (4) Don't touch anything for 70 s.”
- C2: “Run C2 cage_nested rep N. (1) Open a terminal and run: cage – /tmp/chromium-ohm-gl-fix-step2/chrome <test-page-url>. (2) Wait until cage's chromium-fourier autoplays. (3) Reply 'go'. (4) Don't touch anything for 70 s. (5) When I say 'done', close cage with Super+Q (or whatever the cage exit shortcut is on this build).”
[op] Follow instructions. Reply 'go' once playback is rolling and steady (or for C0, immediately on prompt).
[claude] On 'go': start perf record, top ×2, all timed for 70 s. Watch for premature exit (would indicate an error).
[op] Sit still 70 s. Don't touch the keyboard, don't switch windows.
[claude] At t = 70 s: stop. Capture post-state (timestamp, temp). Run perf report post-processing. Pull playback_quality JSON.
[claude] Tell operator: “Done with rep N. You can close the browser/cage.” Print one-line summary: drops_total / drops_post_warmup / kwin_wayland_cpu (median t > 30 s) / top-3 perf symbols.
[op] Close browser/cage. Wait for next run prompt.
[claude] Cooldown timer (60 s minimum, longer if temp post > pre + 3 °C). Then queue next run.

After pass 1 (3 runs done — C0, C1, C2 rep 1):

[claude] Pulls together C0/C1/C2 rep-1 numbers. Reports: rough verdict on H1 (confirmed / partially / rejected / inconclusive based on §2 thresholds applied to N=1, with caveat that thresholds are robust only at N=3). Reports any setup problems found (no symbols, thermal throttling, route_engaged failure, weird outliers). Asks operator to confirm proceeding to pass 2 IF a setup problem was found; otherwise auto-proceeds.

After pass 2 (6 runs done — reps 2-3 of each condition):

[claude] Computes median + IQR per cell, applies §2 falsification table to the medians of C1 vs C2 with N=3. Writes phase3_findings.md. Updates metrics.csv. Commits. Surfaces the verdict + the file. Offers Dokuwiki upload.

Operator's “panic word”: if at any point the test environment is wrong (wrong binary version, browser extension popup interrupted, browser focus lost mid-run, anything visibly off), reply “abort” instead of waiting 70 s. Claude discards the in-flight run, re-queues it, and we don't lose the larger session.

13. Summary in one paragraph

Nine runs (three conditions × three replicates, executed as 1-smoke-then-2-followup), perf record on kwin_wayland in each, captured per the locked Phase 1 protocol, with pre-registered decision rules from §2. Operator's role is ≤ 90 s of button-pressing per run (open chromium-fourier, hit play, sit still); Claude orchestrates the rest over SSH. If glEGLImageTargetTexture2DOES (or the panfrost-side equivalent) crosses 20 % self-time in C1 and clears C2 by ≥ 10 pp at N=3, hypothesis H1 is confirmed and we proceed to Phase 4 patch design. If not, we name the new dominant symbol and re-open Phase 2. The thermal floor and replicate IQR are the canaries that tell us when to discard a run rather than analyse it.

Zettelkasten

Table of Contents