====== fourier — Phase 5 review handover, 2026-04-30 ====== Second-model review artifact. Goal/situation/measurements/plan; uncurated. Reviewer reads top-to-bottom, redacts as needed, then hands to a fresh model instance. Index: [[fourier:start]]. Origin: prior Phase 1 metric ("max FPS / decode rate") was invalidated by ad-hoc Phase 3 measurements collected 2026-04-24 on ohm. Loop edge 3→1 fired retroactively. This page is the re-entry at Phase 2 with the existing data treated as Phase 3 output, plus a fresh reproduction on the danctnix/archarm baseline. ===== Phase 1 — Goal (revised) ===== Move ohm/RK3566 to "1080p30 YouTube playback at <30% total CPU, zero dropped frames over a 60 s window" by way of the chromium-fourier package. Same gate Markus already wrote into the README; this page is the metric-definition for the next iteration of the loop. ==== Why the metric changed ==== Original: "max H.264 decode FPS." Phase 3 showed that decode-FPS is not the binding constraint. ohm's hantro VPU caps at ~95 fps for 1080p H.264. Source rate (24 fps) is reached at 14 % CPU when the decoder output is discarded (drm_prime + null muxer). The cost shows up downstream of the decoder, in whatever consumer takes the frame. ==== Replacement — variant (C) ==== Two bands measured on every run: * **Coarse band** (ship-gate). Per-clip realtime run, fixed length, fixed clip, fixed framerate target. Capture: time-averaged CPU% over the run, decoder + sink drops out of total presented frames, and a path-discipline note (which surface negotiated; dmabuf-direct or GL-imported; what was actually established at both ends). "Hard facts > somewhat hard facts" — average CPU% is the hard fact, not a peak-watching estimate. * **Latency band** (diagnosis). Per-frame distribution over the chain compressed-bytes-arrival → presented-pixel-on-display, with sub-bands queue-fill / decode / fence-publish / dmabuf-export / GL-import (when applicable) / composite / scanout. Used when (B) fails and we don't know which sub-band ate the latency. ==== Surface-variant clarification ==== The metric does **not** require a GLES-textured terminus. The winning ohm path (gst → waylandsink) terminates at a VOP2 scanout plane without ever touching a GL texture. Phase 1 target = "presented pixel on display." Each measurement records its surface variant (''dmabuf-scanout'', ''GLES-imported'') so latency/CPU comparisons across paths stay honest. ===== Phase 2 — Situation ===== ==== Test target ==== ohm. PineTab2, RK3566, hantro-vpu on ''/dev/video1''. Mali-G52 panfrost. Kernel ''6.19.10-danctnix1-1-pinetab2''. Plasma Wayland session running on DSI-1, KWin compositor. Wayland socket ''/run/user/1001/wayland-0''. ==== Stack (after revert; this is the danctnix/archarm baseline) ==== * ''ffmpeg-v4l2-request 2:8.1-3'' from danctnix. Lists ''v4l2request'' in ''-hwaccels''. Built from same Kwiboo branch the marfrit stripped variant tracks. * ''mpv 0.41.0''. ''gst-launch-1.0 1.28.2''. * ''libva 2.23.0'', ''libva-utils 2.22.0'', no v4l2_request VA driver installed (the bootlin meson-stranded ''.so'' was removed). * ''[marfrit]'' commented out in ''/etc/pacman.conf'' for the duration of this iteration. Pre-revert installed from [marfrit]: * ''ffmpeg-v4l2-request-git 2:8.1.r123329.b57fbbe-1'' — replaced by danctnix's ''ffmpeg-v4l2-request 2:8.1-3''. * ''brave-bin 1:1.89.143-1'' (AUR, installed 2026-04-24 for the Brave-HW-decode probe) — removed. * Stranded ''/usr/lib/dri/v4l2_request_drv_video.so'' (266 KB, no package owner) from a meson-direct install of bootlin ''libva-v4l2-request-git'' — removed. Snapshots before/after at ''/tmp/ohm-pre-revert-pacman-Q.txt'', ''/tmp/ohm-post-revert-pacman-Q.txt'', ''/tmp/ohm-pacman.conf.pre-revert''. ==== Constraints already known ==== * Producer-side ''dma_resv'' is empty on V4L2 stateless decoders. KWin's ''watchDmaBuf'' fence wait would either spin or skip. Mitigated downstream by the kwin-fourier 0001-skip-wait patch (shipped); kernel-side fix in flight as the vb2-dma-resv RFC (v1 sent, two reviews received, v2 sketch backlogged — see [[fourier:vb2_dma_resv_rfc]]). * GL/EGL upload from a V4L2 dmabuf is the bottleneck on Mali-class silicon. Quantified twice on ohm prior to this iteration (mpv SW gpu-next: 127 % CPU, 973 drops; mpv HW v4l2request gpu-next: 122 % CPU, 930 drops). Same shape under stock danctnix today (R6 below). * Browsers cannot use ''waylandsink''. Firefox renders into its own GLES viewport; Chromium uses Skia/SkiaGanesh on GLES. The browser path is structurally GL-bound; the dmabuf-direct win does not transfer. * ''mpv --vo=dmabuf-wayland'' fails format negotiation on ''yuv420p → drm_prime''. Hwupload filter rejects the output format ("hardware format not supported"). Re-confirmed in R7. ==== What will not be touched in this iteration ==== * fresnel (RK3399), ampere/boltzmann (RK3588): out of scope. * Kernel side. The vb2-dma-resv v2 sketch is backlogged separately. * Firefox. The campaign-shipped firefox-fourier 150.0.1-16 result on fresnel (RDD 78 % → 5 %) stands; reproducing it on ohm is interesting but is path (3) in the Phase 4 ranking, not (1). ===== Phase 3 — Baseline measurements ===== Corpus clip: ''~/fourier-test/bbb_1080p30_h264.mp4'', SHA-16 ''dcf8a7170fbd49bb''. Full Big Buck Bunny (725 MB, 9:56, 24 fps source rate). All paced runs cap at 60 s of source via ''-t 60'' / ''--end=60'' / ''timeout 62 -INT''. CPU% column is GNU ''/usr/bin/time -v'' "Percent of CPU this job got" — a time-average over the whole run, not a peak. ==== 2026-04-30 reproduction (stock danctnix/archarm) ==== ^ Run ^ Path ^ CPU% ^ Wall ^ Drops/Total ^ Path discipline ^ | R1 | ''ffmpeg -re -hwaccel none -f null -'' (SW) | 89 % | 1:00.21 | n/a | no decode hwaccel | | R2 | ''ffmpeg -re -hwaccel v4l2request -f null -'' (no ''drm_prime'') | 64 % | 1:00.25 | n/a | hwaccel engaged + CPU readback | | R3 | ''ffmpeg -re -hwaccel v4l2request -hwaccel_output_format drm_prime -f null -'' | **15 %** | 1:00.23 | n/a | hantro-vpu engaged on S264; drm_prime export only; no consumer | | R4 | ''gst v4l2slh264dec ! fakesink sync=false'' (unpaced, 1800 frames) | 89 % | 0:45.11 | n/a | decoder emits ''video/x-raw(memory:DMABuf), format=DMA_DRM, drm-format=NV12''; fakesink discards | | R5 | ''gst v4l2slh264dec ! waylandsink sync=true'' (paced, 60 s wall) | **8 %** | 1:02.27 | 0 | waylandsink advertises ''DMA_DRM'' caps; linux-dmabuf-v1 negotiated; KWin → VOP2 plane; no GL upload | | R6 | ''mpv --hwdec=v4l2request --vo=gpu-next'' | **138 %** | 1:02.63 | **1039 / 1440 (72 %)** | mpv selects ''h264-v4l2request''; gpu-next loads ''drmprime'' hwdec driver; ''drm_prime → GLES import → libplacebo composite''; bottleneck | | R7 | ''mpv --hwdec=v4l2request --vo=dmabuf-wayland'' | 17 % | 1:01.35 | broken | hwupload filter: ''hardware format not supported''; audio plays, no video | Logs: ''~/fourier-test/baseline-runs/20260430-1145/'' on ohm (stdouts, stderrs, mpv terminal captures, GST_DEBUG traces). ==== Prior Phase 3 dataset (README dossier, marfrit-modified ohm) ==== For comparison; ad-hoc, kernel same, ffmpeg-v4l2-request-git (marfrit Kwiboo build) instead of danctnix ffmpeg-v4l2-request: ^ Run ^ Path ^ CPU% (prior) ^ Drops (prior) ^ | R1 | SW realtime | 90 % | n/a | | R2 | HW no drm_prime | 67 % | n/a | | R3 | HW + drm_prime | 14 % | n/a | | R4 | gst → fakesink | 89 % | n/a (1800 / 48.9 s = 36.8 fps unpaced) | | R5 | gst → waylandsink | 6–7 % | 0 (1488 / 62 s, ''progressreport'' 1:1) | | R6 | mpv gpu-next | 122 % | 930 / 1440 (65 %) | | R7 | mpv dmabuf-wayland | n/a | format-negotiation failure | ==== Reproduction commentary ==== * R1–R5 reproduce within 1–3 percentage points / single-frame counts. The danctnix → marfrit swap on ffmpeg is not visible in the data — same Kwiboo branch, different build flags. * R6 worsened from 122 % / 930 to 138 % / 1039. Same shape; delta plausibly compositor / clip-content variance. The qualitative finding holds: GL composition is the bottleneck, decode is essentially free, ~70 % of frames don't reach the panel. * R7's failure mode is identical. ==== Path-optimization notes (per Markus, 2026-04-30) ==== Definition: "zero copy if possible, no double verification of buffers if already established at both ends." Captured per run: * R3: drm_prime export-only, no consumer. Zero-copy on the producer side; no second-end exists. * R5: caps negotiation runs once at preroll (''video/x-raw(memory:DMABuf), format=DMA_DRM, drm-format=NV12'' on both ends); per-frame the dmabuf fd flows decoder → KWin → VOP2 with no caps re-validation. * R6: ''drm_prime → EGLImage → GL texture'' import per frame. Each frame allocates a fresh EGLImage. This is the per-frame cost the metric needed; it's not a buffer-double-verification problem, it's a fresh-import-per-frame problem. * R7: format-negotiation fails, never reaches per-frame phase. ===== Phase 4 — Plan ===== ==== Sub-target ==== ohm + chromium-fourier + H.264 + GL surface (the browser-side analogue of R6, but inside Chromium's V4L2VideoDecoder + Skia/SkiaGanesh GL composition). Resumes task #46. **No fleet broadening this round.** ==== What will be touched ==== * data:CT220 chromium-builder. Unblock the ''gn'' configure check that rejects empty ''clang_revision''. Standard Arch chromium PKGBUILD patches solve this; CachyOS chromium-cachy fork has the equivalent. Result: a built ''chromium-fourier'' aarch64 package landing on packages.reauktion.de. * ohm: install ''chromium-fourier'' from ''[marfrit]'' (re-enabled only after the Phase 3 baseline has been frozen on this page). Run a single 60 s YouTube-equivalent clip-playback measurement in the same shape as R5/R6. * No kernel changes. No userspace patches beyond the build unblock. ==== What will not be touched ==== * fresnel, ampere, boltzmann. * Firefox (firefox-fourier already shipped, RDD result valid). * Kernel (vb2-dma-resv v2 backlogged independently). * mpv ''dmabuf-wayland'' diagnosis (path 2 in the earlier ranking; deferred). * libva-v4l2-request multiplanar port. Browser path will use Chromium's V4L2VideoDecoder, not VA-API. ==== Expected outcome (Phase 7 will compare against this) ==== Two predictions, in the same units as Phase 3 R5 + R6: - ''chromium-fourier'' on ohm, 1080p30 H.264 corpus clip, 60 s wall: avg CPU% in the renderer + GPU process combined, 80–150 %. Drops 200–800 of 1440 (14–56 %). Path: V4L2VideoDecoder → drm_prime → Chromium ''GLImage'' GLES texture → Skia composite. Bottleneck location: the GL import, same shape as R6. - Stretch: with ''--enable-features=VaapiVideoDecodeLinuxGL'' or equivalent dmabuf-direct compositing flag (the LinuxOzone overlay path), CPU% drops to 20–40 % and drop-count to single digits. This is the chromium-fourier-specific carrier patch bet. If it works, ohm clears the ship-gate. If both numbers come back inside their ranges, Phase 7 closes the loop. If they're outside (especially if (1) is much worse), the plan re-enters Phase 4 with the diagnosis from the latency band. ==== Open Phase 4 questions surfaced for review ==== - **Build unblock approach.** Patch ''gn'' to skip ''clang_revision'' check (lighter), or do a full toolchain swap to ''clang_base_path="/usr"'' + ''compiler-rt-adjust-paths'' (heavier, what Arch chromium PKGBUILD does). Either is a one-line PKGBUILD change for the gn-side; the second has more long-term upgrade implications. - **Latency-band instrumentation depth.** ''perf trace -e syscalls'' + KMS vblank timestamps (''/sys/kernel/debug/dri/0/state'') + Chromium's UMA media-pipeline traces (if exposed in this build) is the proposal. Do we add ''chrome://tracing'' captures on top, or is that overkill for one sub-target? - **Test corpus extension.** Stay on the bbb 1080p H.264 clip, or add a real YouTube DASH H.264 segment (different keyframe cadence, different bitrate envelope) for the browser-relevant case? Real YouTube would test the fragmented-MP4 path; bbb tests the static-MP4 path. - **Acceptance for "stretch" outcome.** Is a Chromium build flag sufficient as a campaign artifact, or do we want a carrier patch in chromium-fourier that flips the default? ==== Phase 5 routing ==== This page is the second-model review handover. Reviewer assesses: * Whether the (C) metric is right, or whether the latency band needs a different definition. * Whether path (1) is the right narrowing or whether path (2) or (3) buys more information first. * Whether the Phase 4 prediction range is too wide or too narrow. * What's missing. After redaction, Markus hands to a fresh model instance. ===== Source artefacts ===== * Run logs (ohm): ''~/fourier-test/baseline-runs/20260430-1145/'' * Pre/post-revert snapshots (ohm): ''/tmp/ohm-pre-revert-pacman-Q.txt'', ''/tmp/ohm-post-revert-pacman-Q.txt'', ''/tmp/ohm-pacman.conf.pre-revert'' * Project-state memory: ''project_fourier_campaign_state.md'' * Process spec: ''feedback_dev_process.md'' (the 8-phase loop) * Existing campaign README: ''~/src/fourier/README.md'' * Chromium-fourier workspace: ''marfrit-packages/arch/chromium-fourier/''