====== Phase 3 — measurement findings ======

**Status: LOCKED 2026-05-02.** Single-pass measurement (1 replicate per condition; pass-2 averaging deferred per §11). Verdict applies the §2 falsification table from [[kwin_overlay_subsurface:phase3_protocol|phase3_protocol]] to N=1 with explicit IQR-deferred caveat.

===== TL;DR — H1 rejected, H1' (working) emerges, drops are not CPU-caused =====

Hypothesis H1 from [[kwin_overlay_subsurface:phase2_source_findings|phase2_source_findings]] (''glEGLImageTargetTexture2DOES'' ≥ 20 % of ''kwin_wayland'' self-time during chromium-fourier playback) is **decisively rejected**. The symbol is below 0.2 % of self-time in every condition, including the cage A/B and the exploratory stock-Brave run. The Phase 4 patch site at ''src/scene/surfaceitem.cpp:472-501'' is **not** load-bearing on this hardware.

What dominates ''kwin_wayland''’s self-time is per-frame Wayland-protocol dispatch — Qt event-filter chain, kernel ''_raw_spin_unlock_irqrestore'' on unix-stream sendmsg, kernel ''fdget'', kernel maple-tree mm operations (''mas_walk'', ''mas_empty_area_rev''), ''wl_display_flush_clients''. Same hot-symbol shape across both clients (chromium-fourier and stock Brave), reproducible across the cage A/B.

A separate, equally important finding: **stock Brave drops fewer frames than Step-2-patched chromium-fourier on the same hardware, while using ~75 % more ''kwin_wayland'' CPU**. Drops do not scale with compositor CPU load. The drop-causation mechanism is therefore not the compositor's per-frame work itself — it is something timing-related (commit pacing, frame-callback delivery latency, FIFO scheduling) that the Step 2 overlay-route engagement may make worse, not better.

===== Methodology executed vs phase3_protocol planned =====

^ What was planned (locked) ^ What ran ^ Delta ^
| 3 replicates × 3 conditions = 9 runs | 1 replicate × 3 conditions = 3 runs (C0, C1, C2) | Pass-2 averaging deferred. Single-rep verdict applies. |
| C0 noise_floor | C0 ran (after one re-attempt — first run had a stale chromium-fourier process; baseline contaminated and repeated) | OK |
| C1 chromium-fourier on KWin | C1 ran | OK on perf; **drops not captured** (stderr-redirect missing in launch) |
| C2 chromium-fourier in cage (nested) | C2 ran. Bonus: also ''perf record'' on cage process for caller-side comparison. | OK on perf; **drops not captured** |
| //(not in original plan)// | **C3 stock Brave on KWin** added on user request after the C1 perf result inverted the prior. Treated as an exploratory addendum, not a control. | New row in [[kwin_overlay_subsurface:metrics|metrics.csv]] (''phase3_perf_brave_kwin''). |
| Pre-conditions per §4 | Pre-conditions executed; CPU governor pinned to ''performance'', baloo permanently disabled (was at 65.8 % CPU during initial C0 — major contaminator), kded6 not loaded on this Plasma build, plasmashell + konsole left running as session-natural baseline. | All deviations recorded in ''phase1_evidence/ohm_tooling_revert_log.md''. |
| Symbol-resolution smoke test (§5) | Passed: ''kwin-fourier'' is built stripped, but DWARF unwind recovers KWin C++ symbols cleanly (''Compositor::composite'', ''OpenGLSurfaceTexture::updateDmabufTexture'', ''WorkspaceScene::overlayCandidates'' all resolve). | OK |

===== §2 falsification table — verdict at N=1 =====

Decision rule ([[kwin_overlay_subsurface:phase3_protocol|phase3_protocol]]:§2) applied to the H1 target symbol — ''glEGLImageTargetTexture2DOES'' (or its panfrost-side implementation) — in ''kwin_wayland''’s self-time. Cage A/B comparator C2 is the differential.

^ Condition ^ H1-target symbol max self-time ^ Combined libEGL+panfrost self-time ^
| C0 noise_floor | absent | < 1 % (panfrost driver bookkeeping at idle) |
| C1 chromium-fourier on KWin | absent (named) — closest is ''eglMakeCurrent'' 0.15 % | ~1.5 % across all hex-offset entries |
| C2 chromium-fourier in cage (kwin perf) | ''OpenGLSurfaceTexture::updateDmabufTexture'' 0.15 % (named — on cage's framebuffer dmabuf import path) | ~2 % |
| C3 stock Brave on KWin | absent — closest is ''eglMakeCurrent'' 0.13 % | < 1 % |

§2 falsification rules require **≥ 20 % AND ≥ 10 pp above cage** for confirmation; **< 5 % OR cage same heat** for rejection. Maximum reading anywhere is 0.15 %. **H1 rejected** with margin two orders of magnitude above the rejection threshold. IQR is irrelevant when the point estimates are an order of magnitude below the floor of the threshold band.

That ''OpenGLSurfaceTexture::updateDmabufTexture'' resolves and is named in C2 (0.15 %, on cage's framebuffer) is itself evidence: the path is reachable, the symbol cache works, the function simply isn't expensive on this hardware. The Phase 2 hypothesis was wrong about **which** code path dominates, not about whether perf can see it.

===== Data — kwin_wayland self-time across all four conditions =====

Conditions defined in [[kwin_overlay_subsurface:phase3_protocol|phase3_protocol]]:§3. Replicate count N=1.

^ Cell ^ C0 idle ^ C1 chromium-fourier on KWin ^ C2 chromium-fourier in cage (kwin perf) ^ C3 stock Brave on KWin ^
| Window (UTC) | 19:20:24 → 19:21:36 | 19:26:10 → 19:27:21 | 19:38:00 → 19:39:11 (approx) | 19:51:10 → 19:52:22 |
| SoC temp pre / post (°C) | 48.3 / 48.9 | 51.9 / 59.4 | (recorded) | 60.6 / 60.0 |
| ''kwin_wayland'' %CPU median | 30.9 | 17.0 | 36.9 | 35.9 |
| ''kwin_wayland'' %CPU mean | 30.7 | 17.7 | 37.2 | 35.7 |
| Sample count (70 s @ 99 Hz) | 2127 | 1310 | (recorded) | 2487 |
| Top-1 self-time symbol | ''_raw_spin_unlock_irqrestore'' 2.91 % | ''QCoreApplicationPrivate::sendThroughApplicationEventFilters'' 3.31 % | ''QCoreApplicationPrivate::sendThroughApplicationEventFilters'' 2.03 % | ''_raw_spin_unlock_irqrestore'' 3.27 % |
| Top-2 | ''QCoreApplicationPrivate::sendThroughApplicationEventFilters'' 2.80 % | ''_raw_spin_unlock_irqrestore'' 1.62 % | ''libgallium-26.0.5'' 0x144e8dc 1.21 % | ''QCoreApplicationPrivate::sendThroughApplicationEventFilters'' 2.92 % |
| Top-3 | ''vma_interval_tree_insert'' 0.89 % | ''fdget'' 1.09 % | ''libgallium-26.0.5'' 0x144e988 1.14 % | ''fdget'' 1.51 % |
| ''wl_display_flush_clients'' | not in top | not in top (~ tail) | not in top | 0.73 % |
| ''mas_walk'' + ''mas_empty_area_rev'' (kernel mm) | tail | 1.4 % combined | (kwin) tail; **cage process: 1.9 %** | 0.82 % combined |
| H1-target combined | < 1 % | ~1.5 % | ~2 % | < 1 % |
| ''OpenGLSurfaceTexture::updateDmabufTexture'' (named) | absent | absent | **0.15 %** | absent |

==== Reading these numbers ====

The C0 → C1 transition is counter-prior: chromium-fourier playback **drops** ''kwin_wayland''’s CPU median from 30.9 % to 17.0 %, a 13.9 pp decrease. Likely cause: chromium-fourier's window occluded plasmashell + konsole (baseline-load sources) more than it added its own composite cost. Plasmashell alone was responsible for ~22 % ''kwin_wayland'' load at idle.

C2 (cage nested) brought ''kwin_wayland''’s CPU to 36.9 % — **higher** than C1 by 19.9 pp. cage as a wrapper does not reduce KWin's compositor cost; it relocates work between processes, and re-exposes plasmashell to kwin's composite path (cage's window does not occlude plasmashell the way chromium-fourier's window did). Cage's own process consumed ~7 % CPU in the same 70 s — a small fraction of where the work actually goes.

C3 (stock Brave) closely matches C2 in ''kwin_wayland'' CPU (35.9 % vs 36.9 %) — about double C1. The hot-symbol shape in C3 matches C1's exactly (Qt event filter, kernel locking, fdget, mas_walk, wl_display_flush_clients), confirming the cost shape is **generic to chromium-class clients on KWin**, not specific to chromium-fourier's Step 2 overlay route. The doubling is consistent with stock Brave engaging more frequent or more-protocol-heavy commit patterns than Step-2-patched chromium-fourier — a quantitative observation about what Step 2 reduced, not a Phase 2 hypothesis about WHERE it landed.

===== Stock Brave addition (C3) — the inversion finding =====

C3 was added on operator request after C1's perf data inverted the working prior. Stock Brave (''brave-bin 1:1.89.145-1'', Brave Browser 147.1.89.145) has none of the chromium-fourier patches: no Step 1 (libva-v4l2-request port), no Step 2 (''WaylandBufferManagerHost'' overlay-commit route). Its video pipeline is a different beast — it does **not** use libva, decodes through Chromium's ''V4L2VideoDecoder'' directly, and engages the Wayland presentation path with whatever default settings Brave 147 ships.

The drop trajectory salvaged from C3 (typed by operator from the on-screen ''<pre>'' log; recorded in ''phase3_evidence/brave_stock_kwin_rep1/drops_trajectory_typed.txt''):

^ Cell ^ Value ^
| ''drops_total'' over 70.27 s | **18** |
| ''drops_post_warmup'' (visible window t=51.21 → 70.27 s) | **0** new drops in 19.06 s |
| Upper bound on ''drops_post_warmup'' (0 < t < warmup_s = 10) | ≤ 18 (likely much less; all drops settled to 18 by t=51 s and held) |
| Frames delivered | 1624 over 70.27 s = 23.1 fps effective |
| Source clip | bbb_1080p30_h264.mp4 (30 fps source) |

Compared with the Phase 0 handover values for chromium-fourier on KWin (58 drops, 29 post-warmup) on the same hardware:

^ Metric ^ Phase 0 chromium-fourier on KWin ^ C3 stock Brave on KWin (this session) ^
| ''drops_total'' | 58 | 18 |
| ''drops_post_warmup'' | 29 | ≤ 18 (tight upper bound; visible window shows 0) |
| Effective fps | 21.4 | 23.1 |
| ''kwin_wayland_cpu'' | 20.5 % | 35.9 % |

Stock Brave **drops about 70 % fewer frames** than chromium-fourier on the same compositor + same hardware, while consuming **75 % more ''kwin_wayland'' CPU**. The "less optimisation" client is winning on the user-visible metric.

Three plausible interpretations, mutually compatible:

  - **Step 2 (chromium-fourier) engages the overlay route, which has lower per-frame Wayland-protocol overhead but worse frame-pacing latency on this hardware.** The route gives up CPU efficiency in exchange for some other property (presentation timing? buffer release ordering?) that shows up as drops on this stack. Stock Brave's non-overlay route is more chatty (higher CPU) but presents frames on a smoother schedule.
  - **Brave 147 vs Chromium 149 differ in browser-side frame-pacing logic** (e.g. RAF cadence, video element scheduling, present queue depth) in ways unrelated to Wayland protocol shape. Some Chromium-side fix between 147 and 149 made drop-pacing //worse// on this stack.
  - **chromium-fourier's Step 1 (libva-v4l2-request) isn't actually engaged** on this clip — phase0_findings:67 says Brave doesn't use libva and chromium-fourier may not either, and Step 2 is the only patch on the hot path. If Step 2 alone is responsible for the drop increase, the conclusion narrows to "the overlay-commit route hurts on this hardware".

These are hypotheses, not findings. None of them are testable from the data we have in this session.

===== H1' — working hypothesis after the data =====

Two parts:

==== H1'-a: per-frame Wayland-protocol dispatch dominates kwin_wayland's CPU during chromium-class video playback on this hardware ====

Evidence: C1 + C3 both show the same hot-symbol set:

  * ''QCoreApplicationPrivate::sendThroughApplicationEventFilters'' (Qt event-filter chain — every Wayland message dispatched into a queued event)
  * ''_raw_spin_unlock_irqrestore'' on ''unix_stream_sendmsg'' (kernel socket lock release per send)
  * ''fdget'' (kernel fd-table lookup — every Wayland message touches fds)
  * ''mas_walk'' / ''mas_empty_area_rev'' (kernel maple-tree mm — VMA insertion/lookup, called when dmabuf imports map memory)
  * ''wl_display_flush_clients'' (libwayland-server flush)
  * ''unix_poll'' (kernel poll on the wayland socket)

These are NOT GL/EGL/panfrost symbols. They are the message-loop. The cost shape is "many small messages per second × per-message kernel-and-Qt fixed overhead" rather than "few expensive ops".

This is not a fully formed hypothesis at file:line precision yet. It needs a Phase 3.5 / Phase 2 revision step to identify:

  * The KWin entry into the Qt-event-filter chain (probably ''KWin::Display::dispatchEvents'').
  * The libwayland-server message-arrival path that wraps each socket message into a Qt event.
  * Whether the per-frame message volume from a chromium client is actually 30 messages/sec or much higher (commit + attach + damage + viewport + frame-callback + sync-obj acquire/release per frame ≈ 7 × 30 = 210 messages/sec/surface, × N surfaces).

==== H1'-b: drops are NOT compositor-CPU-caused on this hardware ====

Evidence: the Brave-vs-chromium-fourier inversion. Lower compositor CPU (chromium-fourier) correlates with MORE drops (58 vs 18). If drops were caused by compositor CPU saturation, the relationship would be the other direction.

Drops therefore have a different root cause — most likely scheduling / pacing / latency in the path from V4L2 capture buffer ready → wp_subsurface commit → KWin transaction apply → atomic-page-flip. Phase 4's patch shape will not be "make compositor faster" — it will be something like "fix the wp_subsurface commit-pacing" or "address the syncobj acquire-point scheduling jitter".

===== Phase 4 design-space implications =====

The design space narrows again, this time orthogonally to Phase 1's narrowing:

  * **Architect's hypothesis (a) — cache the dmabuf-to-GL-texture import / eliminate per-frame ''glEGLImageTargetTexture2DOES''**: rejected as the cost site. The function is cold. Patching it would not move the drop metric.
  * **Architect's hypothesis (b) — promote subsurface to direct scanout via ''wp_drm_lease_v1''**: already rejected in Phase 1 (structurally unreachable on rockchip-drm because the only NV12-LINEAR plane on CRTC 52 is the primary plane, in use for the GL framebuffer).
  * **New candidate H1' (Wayland protocol-dispatch overhead)**: addresses CPU but **not** drops, per the Brave-vs-chromium-fourier inversion. Possibly worth a CPU-reduction patch on its own merits (Mali-class hardware), but not the campaign's locked goal (''drops_post_warmup == 0'').
  * **New candidate H1''-drops (scheduling/pacing root cause)**: requires a different measurement instrument — frame-timing trace, syncobj acquire-point-vs-frame-clock alignment, wp_surface commit timestamping. **The campaign's goal moves from "find and patch the per-frame CPU bottleneck" to "find and patch the drop-causation mechanism".** These are not the same problem.

===== What was NOT captured / deferred =====

Per the discipline rule (''feedback_phase_discipline.md''): name the work, name where it goes, don't quietly drop a question.

  * **''drops_total'' and ''drops_post_warmup'' for C0, C1, C2**: lost. C0 has zero drops by construction (no playback). C1 and C2 drops are deferred to a follow-up measurement session, captured with single-line launch commands (no shell line-continuation issues) and ''--enable-logging=stderr 2>file'' properly captured. Acceptable interim: use Phase 0 handover values (58/29 for KWin, 7/0 for cage) as historical reference rather than fresh same-session numbers.
  * **Pass-2 replicates (C0/C1/C2/C3 reps 2 + 3) for IQR**: deferred. Justification: H1 is rejected at N=1 with two orders of magnitude of margin; replication would not change the verdict. IQR is meaningful for borderline-confirmed hypotheses; this isn't one.
  * **''route_engaged'' cell for C3 (stock Brave)**: not captured. Would need a chrome trace of ''WaylandBufferManagerHost::CommitOverlays'' / ''SkiaRenderer::SwapBuffers'' ratio. Stock Brave probably does NOT engage the overlay route (no Step 2 patch). Deferred to the same follow-up session.
  * **chrome trace for C1 to confirm Step 2 IS engaged in this session**: deferred. Phase 0's ''phase3_remeasure_2026-05-02/task23_per_frame_route.md'' confirmed it on a different day; assuming continuity unless proven otherwise.
  * **Bare-DRM cage (cage_drm)**: not in this session's plan. Optional per ''worklist.md:124-126'', deferred unless a future ambiguity demands it.
  * **Phase 2 source-read of ''KWin::Display::dispatchEvents'' and the libwayland-server arrival path**: this is the H1' pivot's required predecessor. Belongs in a "Phase 2-prime" section appended to [[kwin_overlay_subsurface:phase2_source_findings|phase2_source_findings]] rather than rewriting the existing H1 section (per ''phase1_lock.md'' discipline — don't quietly move the prior to match the result).

===== Caveats and open questions =====

  * **Single replicate per condition**. IQR uncaptured. Verdict robust because point estimates are far from the threshold band, but session-to-session variability isn't quantified.
  * **Browser version skew between C1 and C3**: chromium-fourier is Chromium 149.0.7812.0; stock Brave is 147.1.89.145. Browser-side frame-pacing or compositor-protocol-traffic differences between 147 and 149 are NOT controlled for. The C1-vs-C3 inversion could be partly explained by upstream Chromium changes between those versions, not solely by the fourier patches.
  * **C2's ''kwin_wayland'' measurement is dominated by plasmashell + cage's framebuffer composite, not by chromium-fourier's video subsurface composite** (cage absorbs that). C2 perf is therefore not a clean "chromium-without-kwin" measurement; it's a "cage-as-a-wrapper-on-kwin" measurement. The cage-process perf (also captured) is closer to "chromium's internal compositor cost" but at low sample count (516 samples in 70 s → ~7 % CPU on cage process).
  * **Plasmashell's ~22 % idle compositor traffic** is part of the natural Plasma session baseline. We did not investigate what plasmashell is animating. The differential measurements (C1 vs C2, C1 vs C3) are valid because plasmashell is the same load in each condition.
  * **Browser profile state**: chromium-fourier and stock Brave used their respective default profile dirs. Profile contents (extensions, settings) are not controlled for.
  * **The drop-causation reframing is not verified**, only suggested by the Brave-vs-chromium-fourier inversion. A clean test would require running both clients with proper drops capture in the same session, which we did not do.

===== Decision and pivot =====

The campaign's working H1 hypothesis is closed. Phase 2's [[kwin_overlay_subsurface:phase2_source_findings|phase2_source_findings]] H1 section remains as the "what we believed before measurement" record, with a "**Phase 2 revision — superseded by Phase 3 measurement on 2026-05-02**" note pointing here. Discipline rule honoured: prior is not silently rewritten.

Next campaign step is a **Phase 2-prime / Phase 3-prime measurement design** focused on:

  - The Wayland-protocol-message volume from chromium-class clients during 30 fps NV12 playback (per-second message count, broken down by message type).
  - The KWin entry to the Qt-event-filter chain — ''KWin::Display::dispatchEvents'' and the libwayland-server arrival path.
  - The drop-causation timing — wp_surface commit timestamp vs frame-callback delivery vs atomic-page-flip vsync alignment, captured with frame-level tracing rather than 99 Hz perf sampling.

Phase 4 patch design **defers** until Phase 2-prime is documented. No patch precedes a documented source-read.