Differences

This shows you the differences between two versions of the page.

--- fourier:attribution_2026-05-03 [2026/05/04 03:58] – Phase 5 review artifacts — fourier_attribution campaign 2026-05-03 (raw uncurated) markus_fritsche
+++ fourier:attribution_2026-05-03 [2026/05/04 07:24] (current) – Append Sonnet architect Phase 5 review response markus_fritsche
@@ Line 977: / Line 977: @@
 Per [[feedback:dev_process|Phase 5]] of the dev process: Claude is forbidden from curating this artifact. The reviewer is asked to read the raw documents above and surface anything the active session may have rationalised away — methodology gaps, threshold-setting bias, confounds the campaign acknowledged but didn't quantify (e.g. Brave-vs-Chromium-149 version delta in cell B), and any path from this matrix to a useful next campaign.
+====== Phase 5 reviewer response (Sonnet architect, 2026-05-04) ======
+Engaged via [[https://dokuwiki.reauktion.de/doku.php?id=feedback:dev_process|Plan]] subagent, ''model: sonnet'' override, open-consultation mode. Active-session prompt forbade curating the artifact; reviewer was explicitly asked to argue with the verdicts.
+====== Phase 5 Review — fourier_attribution 2026-05-03 ======
+Reviewer: Sonnet architect (subagent, fresh-model invocation 2026-05-04, open-consultation mode).
+Engaged via ''%%Plan%%'' subagent with ''%%model: sonnet%%'' override per the ''%%kwin_overlay_subsurface%%'' precedent. Review prompt explicitly forbade the active session from curating the artifact going to the reviewer; reviewer was given paths to local repo + DokuWiki page + asked to argue with the verdicts.
+===== §1 Methodology concerns =====
+**The cell ordering and baseline recycling are the biggest structural problem.**
+Cell A (all-fourier-on baseline) was run in the phase0 evidence set (reps starting ~20:57Z), then those reps were discarded in favour of a *second* N=3 cell-A run in the phase4 evidence set (reps starting ~21:09Z, i.e. ~12 minutes later, same kwin PID 53655). The phase4 cell A has measurably different numbers from the phase0 cell A: drops_60s fell from {20,15,15} to {17,12,8}, and browser_cpu_median fell from {56.6,56.25,56.0} to {54.25,54.4,54.9}. Both moves are in the "looks better" direction. Thresholds were locked against the *phase0* cell A medians and ranges (drops_60s median=15, browser_cpu_median=56.0), but the phase4 cell A values used as the comparison baseline in ''%%phase4_findings.md%%'' are the *lower* set (median drops_60s=12, browser_cpu_median=54.4). The analysis then computes deltas against "A median = 12 drops / 54.4 browser CPU" while the pass/fail thresholds were set against a baseline of "15 drops / 56.0 browser CPU." This is not called out anywhere in phase4_findings.md, and the arithmetic is self-inconsistent as a result.
+This matters concretely for cell B: the "drops_60s = +4 vs A" is computed as B_median(16) − A_phase4_median(12) = +4. But the threshold was set as "phase0 median + 5 = 20." If the threshold anchor had been re-locked at the phase4 cell A, B's drops_60s delta would still be +4 (which is fine, still below 5). For the fps verdict the delta -0.83 is large either way. So this inconsistency doesn't change the cell-B verdict in this case — but it should have been called out, and it opens the door to the question of which cell A is "the" baseline.
+**The metric set has a blind spot: no per-frame GPU submit latency or fence wait time.** The active session noted that ''%%wp_presentation_feedback%%'' would yield Δ_present; more directly useful would be a simple count of how many top -d 1 kwin samples are above, say, 20% CPU (i.e. a "high-load fraction" rather than just median). The median of 32.9% in cell D is so high it doesn't need better instrumentation — but for borderline cases like cell B's kwin_cpu_median of 13.0% vs threshold of 12.5%, per-frame feedback would distinguish "systematically higher" from "a few high spikes pulling up the median." The campaign correctly excludes ''%%wp_presentation_feedback%%'' as out-of-scope, but it should have flagged that the kwin_cpu_median in cell B is at the threshold boundary and would benefit from more resolution.
+**No cell E (all-fourier-off).** The campaign acknowledges this. The combination matters because if kwin-fourier and chromium-fourier both being off produces a qualitatively different failure mode than kwin-fourier alone being off, the single-toggle matrix won't surface it. For practical maintenance decisions (can we drop all three packages?) this is the directly relevant cell and it's missing.
+**Execution order: A→B→D→C.** The full sequence was phase0-A (20:57Z), phase4-A (21:09Z), B (21:14Z), D (21:50Z), C (22:01Z). Cell D ran ~37 minutes after cell B in the same continuous uptime session (same kwin PID 53655 was still alive through D). Cell C ran on a new session (kwin PID 82420, different from 53655 and 77581). The campaign notes this in phase4_findings "open questions" as potential ambient drift in cell C, but it doesn't ask the symmetric question about whether the A/B/D sequence on a continuous session introduced a drift pattern into those three cells that wouldn't apply to cell C.
+===== §2 Threshold concerns =====
+**The kwin_cpu_median threshold bump from 0 to 0.5 is defensible as a decision but not as stated.** The phase0 cell A had kwin_cpu_median = 12.0 in all three reps (range 0.0). The justification for inflating the threshold to ±0.5 is that ''%%top -d 1%%'' rounds to integer percent. That is correct as far as it goes, but the implication is wrong: if top rounds to integers, then any non-zero difference between cells will be at least 1.0, not 0.5. Setting the threshold at 12.5 is operationally equivalent to setting it at 12.99 — the only values top will return are 12.0 or 13.0, never 12.5. So the 0.5 threshold is conservative in the right direction (avoids false positives) but its stated rationale ("floor because top rounds") is imprecise. It effectively means "any cell with median ≥ 13.0 triggers the threshold." Cell B's median is 13.0, which is exactly at the trigger. If the threshold had been stated correctly as "rounds-to-1%" it would have been ±1.0, and cell B's kwin_cpu delta of +2.0 would still trigger but you'd be clearer about what you're measuring.
+**The cell-A median used in phase4_findings.md is from the phase4 A reps, but the IQR thresholds were locked from the phase0 A reps.** This means the anchor point (what "A median" means) is different in the threshold table vs the delta computation table. Specifically: drops_60s threshold was "≥21 to trigger" (phase0 median 15 + delta 5 = 20; but the wording says ≥21). Cell D's drops_60s median is 24, which clears it either way. Cell B's drops_60s delta of +4 is computed as B(16) − A_phase4(12), giving +4, which is below the ±5 threshold. But B(16) − A_phase0(15) = +1, even more clearly below threshold. So this inconsistency doesn't change any verdict, but the pass/fail table in phase4_findings.md cites "+4 = within ±5" while the drops floor it's compared against is the phase4 median (12), not the threshold anchor (15). The documentation is subtly self-contradictory and would mislead anyone trying to verify the math.
+**panfrost_mean_freq threshold of ±10 MHz is derived from the phase0 range of 8.9 MHz, rounded up to 10.** Reasonable. The phase4 cell A range is 607.2 − 596.4 = 10.8 MHz, slightly wider than the phase0 range. If the phase4 range had been used as the threshold, it would have rounded to ±11 MHz, and cell C's −11.2 delta would still be exactly at or below threshold (rather than "marginal wrong direction"). This has no verdict impact since cell C is chaff by all other metrics, but it illustrates the phase0-vs-phase4 baseline confusion again.
+===== §3 Confounds the campaign missed =====
+**The "daily Brave ambient" was stable across A and B but vanished for D and C.** The phase0 state snapshot notes "operator's daily Brave (PID 58105 etc.) running in background, ~13% CPU idle. Documented as stable ambient confound across all three [phase0 A] reps." In phase4_findings.md, the README says "daily Brave killed for clean ambient" — so by the time the phase4 matrix was run, the ambient Brave was gone. This means the phase4 A reps are running in a cleaner ambient than the phase0 A reps, which is probably why browser_cpu_median dropped from ~56 to ~54 (the background Brave is no longer consuming CPU that could have been attributed to the workload browser's process group, depending on how aggregation works). This is explicitly tracked in phase0 as a "stable confound" and it was apparently resolved before phase4 — but the two A-rep runs are never reconciled or explained. Why was cell A run twice?
+**Session age asymmetry within cell D.** This is the one the active session noticed. But it didn't note the directional implication: cell D reps started at 21:50Z — roughly 47 minutes after the first phase4 cell A rep (21:09Z). The kwin_wayland PID (77581) for cell D is a fresh post-revert session. If a fresh KWin session has a warm-up cost (dbus listeners registering, initial frame pipeline setup, shader compilation) that settles within a few minutes, and all three D reps showed near-identical kwin CPU of ~32.9%, then this is a steady state, not a fresh-session artifact. The signal is too large and too stable across reps to be a warm-up artifact. But the campaign missed the inverse question: are cells A and B *benefiting from an older, warmed-up compositor state* (kwin PID 53655 had been running since before the phase0 runs, which started at ~20:57Z, i.e. kwin was at least ~35 minutes old before any data was collected)? An older KWin session will have shader caches warm, GL state cached, etc. Cell D's fresh KWin might genuinely be doing more work because it hasn't compiled/cached what kwin-fourier-patched KWin had already compiled by the time cell A ran. This is a concrete alternative explanation for part of the cell D vs A delta that is not the kwin-fourier patch itself.
+However, the shader-cache explanation for cell D's 32.9% vs 11% kwin CPU is implausible: shader compilation is a one-time cost per boot/profile, not a 60-second continuous burn. The sustained 95% GPU peak-freq residency in cell D is not consistent with shader compilation overhead; it's consistent with the watchDmaBuf spin diagnosis. So this confound is unlikely to explain cell D's result, but it should have been named and dismissed explicitly.
+**cell D ran before cell C**, meaning when the kwin revert happened, qt6-base was still the fourier version. Cell C ran with stock qt6-base but fourier kwin. The sequence was: all-on → chromium-off (B) → kwin-off (D) → qt6-off (C). After cell C finished, both kwin and qt6 were presumably reverted (or cell C reverted qt6, and... what happened to kwin for cell C?). The c_rep1_start.txt shows kwin-fourier 1:6.6.4-3 is present, which means between cell D (kwin-off) and cell C (qt6-off), kwin-fourier was reinstalled. The revert log only shows kwin revert, no kwin reinstall log. This means there was a reinstall step between D and C that isn't in evidence. That's not a confound per se — the start.txt confirms packages were verified per-rep — but the session order (D before C) means cell C ran on a KWin session that was created *after* the kwin-fourier reinstall + logout+login. The phase0 notes describe autologin via ''%%99-autologin-fourier-attribution.conf%%'' — and the kwin PID changes confirm each logout+login created a fresh session. So cell C is the *second* fresh session (after D), and cell A/B are on the *original long-uptime session*. This makes A/B vs C/D a confounded comparison on session age independent of the package toggle.
+===== §4 Verdict robustness per package =====
+**kwin-fourier: WHEAT, robust.**
+The signal is 3× kwin CPU (11→33%), 170 MHz mean GPU freq jump (~28% of scale), 95% peak-freq residency (vs 35%), and 2× drops. All three cell D reps are near-identical — kwin_cpu_median is exactly 32.9% in all three, panfrost_mean_freq is 783/777/775. This is the tightest cell in the whole matrix and the largest delta. No plausible confound explains this away: the shader-cache alternative is dismissed above (it doesn't produce a 60-second continuous GPU burn at 95% peak residency). The session-age asymmetry (fresh D vs long-uptime A) would, if anything, help cell D by having a cleaner cache state, but it doesn't explain the sustained GPU saturation.
+Confidence: very high. If the verdict here were wrong, you'd need to argue that kwin PID 77581's fresh session caused the GPU to continuously max-freq for the entire 60-second window for a reason unrelated to the kwin packages, which is not credible.
+**chromium-fourier: WHEAT-but-fragile verdict.**
+The claimed evidence is: fps −0.83 (Brave 22.91–23.43 vs chromium-fourier 24.0–24.01), browser_cpu_median +82.75pp (137 vs 54.4), kwin_cpu_median +2.0 (13.0 vs 11.0). All three are confounded by the Brave-vs-Chromium-149 version gap.
+The fps delta is the clearest problem. Cell B fps values are {23.43, 23.18, 22.91} — a declining trend across the three reps (B1→B2→B3, each about 0.25 fps lower). Cell A's fps is locked at 24.00–24.01. This declining trend across Brave reps suggests something is drifting within the Brave session (progressive video decoder stall? buffer pressure?), not a stable property of the browser. The effective_fps metric in the extractor is computed as ''%%(frames_60s - frames_5s) / (s60[0] - s5[0])%%'', and cell B's frames_60s are {1414, 1392, 1367} — genuinely delivering fewer frames, not just a measurement artifact. Whether this is a Chromium-147 decoder limitation, a Brave-specific regression vs Chromium-149, or a chromium-fourier patch effect cannot be separated.
+The browser_cpu_median of 137pp vs 54.4pp is a 2.5× gap. This is real. Brave with Chromium-147 base is consuming 2.5× more CPU for the same workload. But Chromium-147 vs 149 is a two-major-version delta, which can easily explain multi-×100pp CPU differences in a decoder-heavy workload (codec path changes, VA-API usage patterns, zero-copy buffer handling). The chromium-fourier patches (Step 1 = libva-v4l2-request port, Step 2 = WaylandConnection overlay-route) are precisely the kind of changes that would reduce browser CPU by enabling hardware decode paths — but you cannot tell from this matrix whether those paths are also present in Brave-147, absent in Brave-147, or partially present with different efficiency.
+The kwin_cpu_median of +2.0 (11→13%) is, as noted above, right at the rounding threshold. It's suggestive that Brave presents frames less efficiently to KWin, but at N=3 with integer-rounded values, it's barely more than a 1-sample wide signal.
+My independent verdict: call it WHEAT-suspected-but-unconfirmed. The direction is clear, the magnitude is large, but the control comparison is the wrong browser at the wrong version. The campaign's own caveat is correct — you cannot call this "chromium-fourier delivers benefit" cleanly; you can only call it "chromium-fourier + chromium-149 base is substantially better than Brave-1.89/Chromium-147 on this workload." The confound is load-bearing. I would not ship the chromium-fourier conclusion to anyone making a package maintenance decision without the Chromium-149 vanilla control.
+**qt6-fourier: CHAFF on this workload, verdict sound.**
+Zero of five metrics moved beyond threshold when qt6-base was reverted to stock. The panfrost mean freq delta is −11.2 MHz (slightly *lower* GPU usage without qt6-fourier), which is the wrong direction for "the patch helps." Cell C reps have wider variance than cell A (drops_60s 5/10/14, browser_cpu 52.2/54.15/60.2), which the campaign correctly flags. However, c_rep1's browser_cpu_median is 60.2 — which is 5.8 above the cell A baseline of 54.4 — and the threshold is "+1." If all three C reps had been like c_rep1, cell C would have been a false-positive wheat verdict on browser_cpu. The fact that c_rep2 (54.15) and c_rep3 (52.2) are both at or below baseline suppresses this. The variance in cell C is real and should be noted as a reliability concern for the qt6-fourier verdict, not just flagged as "wider than expected." Had the campaign run N=5 for cell C, or had c_rep1's values been closer to the mean, the verdict might have been uncertain rather than confidently chaff.
+The workload-specificity caveat is well-stated. "CHAFF on bbb 1080p H.264 Chromium-149" is correct. "CHAFF generally" is not supported.
+===== §5 Cheapest next campaign =====
+**Run a Chromium-149 vanilla control cell (cell E) to de-confound the chromium-fourier verdict.**
+This is the single highest-value next step. The action is: obtain or build a stock Chromium-149 binary (without the Step 1 libva-v4l2-request port and without the Step 2 WaylandConnection overlay-route patches) and run it as cell E with all three fourier packages on. Compare against cell A (chromium-fourier on) and cell B (Brave-147).
+Cost: the main effort is building or obtaining Chromium-149 vanilla for aarch64. The predecessor ohm_gl_fix campaign built chromium-fourier from source, so a same-version vanilla build is feasible — it's the same build without applying the fourier patches. If the build artifacts from the ohm_gl_fix campaign are still around, a stripped binary might be constructible faster. Alternatively, checking whether the Arch aarch64 Chromium package (not fourier-patched) is at version 149 would give a zero-effort control — if it's already 149 in the repo, ''%%pacman -S chromium%%'' may be sufficient.
+If cell E shows vanilla-149 performs close to cell A (chromium-fourier), the verdict becomes "the benefit was version-level, not patch-level." If cell E is close to cell B (Brave-147), the verdict strengthens to "patches matter, not version." If cell E is somewhere in between, you have a partial attribution.
+This needs only one additional cell at N=3, targeting exclusively the browser_cpu_median and fps metrics (kwin_cpu and GPU freq were not the primary indicators for cell B). The campaign infrastructure (orchestrator script, test rig) is already in place; the only new work is producing the binary.
+Nothing else in the open-question list is cheaper or higher-signal for the stated campaign question.