====== fourier — Phase 5 review handover, 2026-04-30 ======

Second-model review artifact. Goal/situation/measurements/plan; uncurated.
Reviewer reads top-to-bottom, redacts as needed, then hands to a fresh
model instance. Index: [[fourier:start]].

Origin: prior Phase 1 metric ("max FPS / decode rate") was invalidated by
ad-hoc Phase 3 measurements collected 2026-04-24 on ohm. Loop edge 3→1
fired retroactively. This page is the re-entry at Phase 2 with the
existing data treated as Phase 3 output, plus a fresh reproduction on
the danctnix/archarm baseline.

===== Phase 1 — Goal (revised) =====

Move ohm/RK3566 to "1080p30 YouTube playback at <30% total CPU, zero
dropped frames over a 60 s window" by way of the chromium-fourier
package. Same gate Markus already wrote into the README; this page
is the metric-definition for the next iteration of the loop.

==== Why the metric changed ====

Original: "max H.264 decode FPS." Phase 3 showed that decode-FPS is
not the binding constraint. ohm's hantro VPU caps at ~95 fps for
1080p H.264. Source rate (24 fps) is reached at 14 % CPU when the
decoder output is discarded (drm_prime + null muxer). The cost shows
up downstream of the decoder, in whatever consumer takes the frame.

==== Replacement — variant (C) ====

Two bands measured on every run:

  * **Coarse band** (ship-gate). Per-clip realtime run, fixed length, fixed clip, fixed framerate target. Capture: time-averaged CPU% over the run, decoder + sink drops out of total presented frames, and a path-discipline note (which surface negotiated; dmabuf-direct or GL-imported; what was actually established at both ends). "Hard facts > somewhat hard facts" — average CPU% is the hard fact, not a peak-watching estimate.
  * **Latency band** (diagnosis). Per-frame distribution over the chain compressed-bytes-arrival → presented-pixel-on-display, with sub-bands queue-fill / decode / fence-publish / dmabuf-export / GL-import (when applicable) / composite / scanout. Used when (B) fails and we don't know which sub-band ate the latency.

==== Surface-variant clarification ====

The metric does **not** require a GLES-textured terminus. The winning
ohm path (gst → waylandsink) terminates at a VOP2 scanout plane
without ever touching a GL texture. Phase 1 target = "presented pixel
on display." Each measurement records its surface variant
(''dmabuf-scanout'', ''GLES-imported'') so latency/CPU comparisons
across paths stay honest.

===== Phase 2 — Situation =====

==== Test target ====

ohm. PineTab2, RK3566, hantro-vpu on ''/dev/video1''. Mali-G52
panfrost. Kernel ''6.19.10-danctnix1-1-pinetab2''. Plasma Wayland
session running on DSI-1, KWin compositor. Wayland socket
''/run/user/1001/wayland-0''.

==== Stack (after revert; this is the danctnix/archarm baseline) ====

  * ''ffmpeg-v4l2-request 2:8.1-3'' from danctnix. Lists ''v4l2request''
    in ''-hwaccels''. Built from same Kwiboo branch the marfrit
    stripped variant tracks.
  * ''mpv 0.41.0''. ''gst-launch-1.0 1.28.2''.
  * ''libva 2.23.0'', ''libva-utils 2.22.0'', no v4l2_request VA
    driver installed (the bootlin meson-stranded ''.so'' was removed).
  * ''[marfrit]'' commented out in ''/etc/pacman.conf'' for the duration
    of this iteration.

Pre-revert installed from [marfrit]:
  * ''ffmpeg-v4l2-request-git 2:8.1.r123329.b57fbbe-1'' — replaced by
    danctnix's ''ffmpeg-v4l2-request 2:8.1-3''.
  * ''brave-bin 1:1.89.143-1'' (AUR, installed 2026-04-24 for the
    Brave-HW-decode probe) — removed.
  * Stranded ''/usr/lib/dri/v4l2_request_drv_video.so'' (266 KB,
    no package owner) from a meson-direct install of bootlin
    ''libva-v4l2-request-git'' — removed.

Snapshots before/after at ''/tmp/ohm-pre-revert-pacman-Q.txt'',
''/tmp/ohm-post-revert-pacman-Q.txt'', ''/tmp/ohm-pacman.conf.pre-revert''.

==== Constraints already known ====

  * Producer-side ''dma_resv'' is empty on V4L2 stateless decoders.
    KWin's ''watchDmaBuf'' fence wait would either spin or skip.
    Mitigated downstream by the kwin-fourier 0001-skip-wait patch
    (shipped); kernel-side fix in flight as the vb2-dma-resv RFC
    (v1 sent, two reviews received, v2 sketch backlogged — see
    [[fourier:vb2_dma_resv_rfc]]).
  * GL/EGL upload from a V4L2 dmabuf is the bottleneck on Mali-class
    silicon. Quantified twice on ohm prior to this iteration (mpv
    SW gpu-next: 127 % CPU, 973 drops; mpv HW v4l2request gpu-next:
    122 % CPU, 930 drops). Same shape under stock danctnix today
    (R6 below).
  * Browsers cannot use ''waylandsink''. Firefox renders into its
    own GLES viewport; Chromium uses Skia/SkiaGanesh on GLES. The
    browser path is structurally GL-bound; the dmabuf-direct
    win does not transfer.
  * ''mpv --vo=dmabuf-wayland'' fails format negotiation on
    ''yuv420p → drm_prime''. Hwupload filter rejects the output
    format ("hardware format not supported"). Re-confirmed in R7.

==== What will not be touched in this iteration ====

  * fresnel (RK3399), ampere/boltzmann (RK3588): out of scope.
  * Kernel side. The vb2-dma-resv v2 sketch is backlogged separately.
  * Firefox. The campaign-shipped firefox-fourier 150.0.1-16 result
    on fresnel (RDD 78 % → 5 %) stands; reproducing it on ohm is
    interesting but is path (3) in the Phase 4 ranking, not (1).

===== Phase 3 — Baseline measurements =====

Corpus clip: ''~/fourier-test/bbb_1080p30_h264.mp4'', SHA-16
''dcf8a7170fbd49bb''. Full Big Buck Bunny (725 MB, 9:56, 24 fps source
rate). All paced runs cap at 60 s of source via ''-t 60'' /
''--end=60'' / ''timeout 62 -INT''.

CPU% column is GNU ''/usr/bin/time -v'' "Percent of CPU this job
got" — a time-average over the whole run, not a peak.

==== 2026-04-30 reproduction (stock danctnix/archarm) ====

^ Run ^ Path ^ CPU% ^ Wall ^ Drops/Total ^ Path discipline ^
| R1 | ''ffmpeg -re -hwaccel none -f null -'' (SW) | 89 % | 1:00.21 | n/a | no decode hwaccel |
| R2 | ''ffmpeg -re -hwaccel v4l2request -f null -'' (no ''drm_prime'') | 64 % | 1:00.25 | n/a | hwaccel engaged + CPU readback |
| R3 | ''ffmpeg -re -hwaccel v4l2request -hwaccel_output_format drm_prime -f null -'' | **15 %** | 1:00.23 | n/a | hantro-vpu engaged on S264; drm_prime export only; no consumer |
| R4 | ''gst v4l2slh264dec ! fakesink sync=false'' (unpaced, 1800 frames) | 89 % | 0:45.11 | n/a | decoder emits ''video/x-raw(memory:DMABuf), format=DMA_DRM, drm-format=NV12''; fakesink discards |
| R5 | ''gst v4l2slh264dec ! waylandsink sync=true'' (paced, 60 s wall) | **8 %** | 1:02.27 | 0 | waylandsink advertises ''DMA_DRM'' caps; linux-dmabuf-v1 negotiated; KWin → VOP2 plane; no GL upload |
| R6 | ''mpv --hwdec=v4l2request --vo=gpu-next'' | **138 %** | 1:02.63 | **1039 / 1440 (72 %)** | mpv selects ''h264-v4l2request''; gpu-next loads ''drmprime'' hwdec driver; ''drm_prime → GLES import → libplacebo composite''; bottleneck |
| R7 | ''mpv --hwdec=v4l2request --vo=dmabuf-wayland'' | 17 % | 1:01.35 | broken | hwupload filter: ''hardware format not supported''; audio plays, no video |

Logs: ''~/fourier-test/baseline-runs/20260430-1145/'' on ohm
(stdouts, stderrs, mpv terminal captures, GST_DEBUG traces).

==== Prior Phase 3 dataset (README dossier, marfrit-modified ohm) ====

For comparison; ad-hoc, kernel same, ffmpeg-v4l2-request-git
(marfrit Kwiboo build) instead of danctnix ffmpeg-v4l2-request:

^ Run ^ Path ^ CPU% (prior) ^ Drops (prior) ^
| R1 | SW realtime | 90 % | n/a |
| R2 | HW no drm_prime | 67 % | n/a |
| R3 | HW + drm_prime | 14 % | n/a |
| R4 | gst → fakesink | 89 % | n/a (1800 / 48.9 s = 36.8 fps unpaced) |
| R5 | gst → waylandsink | 6–7 % | 0 (1488 / 62 s, ''progressreport'' 1:1) |
| R6 | mpv gpu-next | 122 % | 930 / 1440 (65 %) |
| R7 | mpv dmabuf-wayland | n/a | format-negotiation failure |

==== Reproduction commentary ====

  * R1–R5 reproduce within 1–3 percentage points / single-frame
    counts. The danctnix → marfrit swap on ffmpeg is not visible
    in the data — same Kwiboo branch, different build flags.
  * R6 worsened from 122 % / 930 to 138 % / 1039. Same shape;
    delta plausibly compositor / clip-content variance. The
    qualitative finding holds: GL composition is the bottleneck,
    decode is essentially free, ~70 % of frames don't reach the
    panel.
  * R7's failure mode is identical.

==== Path-optimization notes (per Markus, 2026-04-30) ====

Definition: "zero copy if possible, no double verification of
buffers if already established at both ends." Captured per run:

  * R3: drm_prime export-only, no consumer. Zero-copy on the
    producer side; no second-end exists.
  * R5: caps negotiation runs once at preroll
    (''video/x-raw(memory:DMABuf), format=DMA_DRM, drm-format=NV12''
    on both ends); per-frame the dmabuf fd flows decoder → KWin
    → VOP2 with no caps re-validation.
  * R6: ''drm_prime → EGLImage → GL texture'' import per frame.
    Each frame allocates a fresh EGLImage. This is the per-frame
    cost the metric needed; it's not a buffer-double-verification
    problem, it's a fresh-import-per-frame problem.
  * R7: format-negotiation fails, never reaches per-frame phase.

===== Phase 4 — Plan =====

==== Sub-target ====

ohm + chromium-fourier + H.264 + GL surface (the browser-side analogue
of R6, but inside Chromium's V4L2VideoDecoder + Skia/SkiaGanesh GL
composition). Resumes task #46. **No fleet broadening this round.**

==== What will be touched ====

  * data:CT220 chromium-builder. Unblock the ''gn'' configure check
    that rejects empty ''clang_revision''. Standard Arch chromium
    PKGBUILD patches solve this; CachyOS chromium-cachy fork has
    the equivalent. Result: a built ''chromium-fourier'' aarch64
    package landing on packages.reauktion.de.
  * ohm: install ''chromium-fourier'' from ''[marfrit]'' (re-enabled
    only after the Phase 3 baseline has been frozen on this page).
    Run a single 60 s YouTube-equivalent clip-playback measurement
    in the same shape as R5/R6.
  * No kernel changes. No userspace patches beyond the build
    unblock.

==== What will not be touched ====

  * fresnel, ampere, boltzmann.
  * Firefox (firefox-fourier already shipped, RDD result valid).
  * Kernel (vb2-dma-resv v2 backlogged independently).
  * mpv ''dmabuf-wayland'' diagnosis (path 2 in the earlier ranking;
    deferred).
  * libva-v4l2-request multiplanar port. Browser path will use
    Chromium's V4L2VideoDecoder, not VA-API.

==== Expected outcome (Phase 7 will compare against this) ====

Two predictions, in the same units as Phase 3 R5 + R6:

  - ''chromium-fourier'' on ohm, 1080p30 H.264 corpus clip, 60 s wall:
    avg CPU% in the renderer + GPU process combined, 80–150 %.
    Drops 200–800 of 1440 (14–56 %). Path: V4L2VideoDecoder →
    drm_prime → Chromium ''GLImage'' GLES texture → Skia composite.
    Bottleneck location: the GL import, same shape as R6.
  - Stretch: with ''--enable-features=VaapiVideoDecodeLinuxGL'' or
    equivalent dmabuf-direct compositing flag (the LinuxOzone
    overlay path), CPU% drops to 20–40 % and drop-count to single
    digits. This is the chromium-fourier-specific carrier patch
    bet. If it works, ohm clears the ship-gate.

If both numbers come back inside their ranges, Phase 7 closes the
loop. If they're outside (especially if (1) is much worse), the
plan re-enters Phase 4 with the diagnosis from the latency band.

==== Open Phase 4 questions surfaced for review ====

  - **Build unblock approach.** Patch ''gn'' to skip
    ''clang_revision'' check (lighter), or do a full toolchain
    swap to ''clang_base_path="/usr"'' + ''compiler-rt-adjust-paths''
    (heavier, what Arch chromium PKGBUILD does). Either is a one-line
    PKGBUILD change for the gn-side; the second has more long-term
    upgrade implications.
  - **Latency-band instrumentation depth.** ''perf trace -e syscalls''
    + KMS vblank timestamps (''/sys/kernel/debug/dri/0/state'')
    + Chromium's UMA media-pipeline traces (if exposed in this
    build) is the proposal. Do we add ''chrome://tracing'' captures
    on top, or is that overkill for one sub-target?
  - **Test corpus extension.** Stay on the bbb 1080p H.264 clip,
    or add a real YouTube DASH H.264 segment (different keyframe
    cadence, different bitrate envelope) for the browser-relevant
    case? Real YouTube would test the fragmented-MP4 path; bbb
    tests the static-MP4 path.
  - **Acceptance for "stretch" outcome.** Is a Chromium build flag
    sufficient as a campaign artifact, or do we want a carrier
    patch in chromium-fourier that flips the default?

==== Phase 5 routing ====

This page is the second-model review handover. Reviewer assesses:
  * Whether the (C) metric is right, or whether the latency band
    needs a different definition.
  * Whether path (1) is the right narrowing or whether path (2)
    or (3) buys more information first.
  * Whether the Phase 4 prediction range is too wide or too narrow.
  * What's missing.

After redaction, Markus hands to a fresh model instance.

===== Source artefacts =====

  * Run logs (ohm): ''~/fourier-test/baseline-runs/20260430-1145/''
  * Pre/post-revert snapshots (ohm): ''/tmp/ohm-pre-revert-pacman-Q.txt'',
    ''/tmp/ohm-post-revert-pacman-Q.txt'',
    ''/tmp/ohm-pacman.conf.pre-revert''
  * Project-state memory: ''project_fourier_campaign_state.md''
  * Process spec: ''feedback_dev_process.md'' (the 8-phase loop)
  * Existing campaign README: ''~/src/fourier/README.md''
  * Chromium-fourier workspace: ''marfrit-packages/arch/chromium-fourier/''