User Tools

Site Tools


fourier:phase5_2026-04-30

fourier — Phase 5 review handover, 2026-04-30

Second-model review artifact. Goal/situation/measurements/plan; uncurated. Reviewer reads top-to-bottom, redacts as needed, then hands to a fresh model instance. Index: fourier.

Origin: prior Phase 1 metric (“max FPS / decode rate”) was invalidated by ad-hoc Phase 3 measurements collected 2026-04-24 on ohm. Loop edge 3→1 fired retroactively. This page is the re-entry at Phase 2 with the existing data treated as Phase 3 output, plus a fresh reproduction on the danctnix/archarm baseline.

Phase 1 — Goal (revised)

Move ohm/RK3566 to “1080p30 YouTube playback at <30% total CPU, zero dropped frames over a 60 s window” by way of the chromium-fourier package. Same gate Markus already wrote into the README; this page is the metric-definition for the next iteration of the loop.

Why the metric changed

Original: “max H.264 decode FPS.” Phase 3 showed that decode-FPS is not the binding constraint. ohm's hantro VPU caps at ~95 fps for 1080p H.264. Source rate (24 fps) is reached at 14 % CPU when the decoder output is discarded (drm_prime + null muxer). The cost shows up downstream of the decoder, in whatever consumer takes the frame.

Replacement — variant (C)

Two bands measured on every run:

  • Coarse band (ship-gate). Per-clip realtime run, fixed length, fixed clip, fixed framerate target. Capture: time-averaged CPU% over the run, decoder + sink drops out of total presented frames, and a path-discipline note (which surface negotiated; dmabuf-direct or GL-imported; what was actually established at both ends). “Hard facts > somewhat hard facts” — average CPU% is the hard fact, not a peak-watching estimate.
  • Latency band (diagnosis). Per-frame distribution over the chain compressed-bytes-arrival → presented-pixel-on-display, with sub-bands queue-fill / decode / fence-publish / dmabuf-export / GL-import (when applicable) / composite / scanout. Used when (B) fails and we don't know which sub-band ate the latency.

Surface-variant clarification

The metric does not require a GLES-textured terminus. The winning ohm path (gst → waylandsink) terminates at a VOP2 scanout plane without ever touching a GL texture. Phase 1 target = “presented pixel on display.” Each measurement records its surface variant (dmabuf-scanout, GLES-imported) so latency/CPU comparisons across paths stay honest.

Phase 2 — Situation

Test target

ohm. PineTab2, RK3566, hantro-vpu on /dev/video1. Mali-G52 panfrost. Kernel 6.19.10-danctnix1-1-pinetab2. Plasma Wayland session running on DSI-1, KWin compositor. Wayland socket /run/user/1001/wayland-0.

Stack (after revert; this is the danctnix/archarm baseline)

  • ffmpeg-v4l2-request 2:8.1-3 from danctnix. Lists v4l2request

in -hwaccels. Built from same Kwiboo branch the marfrit

  stripped variant tracks.
* ''mpv 0.41.0''. ''gst-launch-1.0 1.28.2''.
* ''libva 2.23.0'', ''libva-utils 2.22.0'', no v4l2_request VA
  driver installed (the bootlin meson-stranded ''.so'' was removed).
* ''[marfrit]'' commented out in ''/etc/pacman.conf'' for the duration
  of this iteration.

Pre-revert installed from [marfrit]:

  • ffmpeg-v4l2-request-git 2:8.1.r123329.b57fbbe-1 — replaced by

danctnix's ffmpeg-v4l2-request 2:8.1-3.

  • brave-bin 1:1.89.143-1 (AUR, installed 2026-04-24 for the

Brave-HW-decode probe) — removed.

  • Stranded /usr/lib/dri/v4l2_request_drv_video.so (266 KB,

no package owner) from a meson-direct install of bootlin

  ''libva-v4l2-request-git'' — removed.

Snapshots before/after at /tmp/ohm-pre-revert-pacman-Q.txt, /tmp/ohm-post-revert-pacman-Q.txt, /tmp/ohm-pacman.conf.pre-revert.

Constraints already known

  • Producer-side dma_resv is empty on V4L2 stateless decoders.

KWin's watchDmaBuf fence wait would either spin or skip.

  Mitigated downstream by the kwin-fourier 0001-skip-wait patch
  (shipped); kernel-side fix in flight as the vb2-dma-resv RFC
  (v1 sent, two reviews received, v2 sketch backlogged — see
  [[fourier:vb2_dma_resv_rfc]]).
* GL/EGL upload from a V4L2 dmabuf is the bottleneck on Mali-class
  silicon. Quantified twice on ohm prior to this iteration (mpv
  SW gpu-next: 127 % CPU, 973 drops; mpv HW v4l2request gpu-next:
  122 % CPU, 930 drops). Same shape under stock danctnix today
  (R6 below).
* Browsers cannot use ''waylandsink''. Firefox renders into its
  own GLES viewport; Chromium uses Skia/SkiaGanesh on GLES. The
  browser path is structurally GL-bound; the dmabuf-direct
  win does not transfer.
* ''mpv --vo=dmabuf-wayland'' fails format negotiation on
  ''yuv420p → drm_prime''. Hwupload filter rejects the output
  format ("hardware format not supported"). Re-confirmed in R7.

What will not be touched in this iteration

  • fresnel (RK3399), ampere/boltzmann (RK3588): out of scope.
  • Kernel side. The vb2-dma-resv v2 sketch is backlogged separately.
  • Firefox. The campaign-shipped firefox-fourier 150.0.1-16 result

on fresnel (RDD 78 % → 5 %) stands; reproducing it on ohm is

  interesting but is path (3) in the Phase 4 ranking, not (1).

Phase 3 — Baseline measurements

Corpus clip: ~/fourier-test/bbb_1080p30_h264.mp4, SHA-16 dcf8a7170fbd49bb. Full Big Buck Bunny (725 MB, 9:56, 24 fps source rate). All paced runs cap at 60 s of source via -t 60 / –end=60 / timeout 62 -INT.

CPU% column is GNU /usr/bin/time -v “Percent of CPU this job got” — a time-average over the whole run, not a peak.

2026-04-30 reproduction (stock danctnix/archarm)

Run Path CPU% Wall Drops/Total Path discipline
R1 ffmpeg -re -hwaccel none -f null - (SW) 89 % 1:00.21 n/a no decode hwaccel
R2 ffmpeg -re -hwaccel v4l2request -f null - (no drm_prime) 64 % 1:00.25 n/a hwaccel engaged + CPU readback
R3 ffmpeg -re -hwaccel v4l2request -hwaccel_output_format drm_prime -f null - 15 % 1:00.23 n/a hantro-vpu engaged on S264; drm_prime export only; no consumer
R4 gst v4l2slh264dec ! fakesink sync=false (unpaced, 1800 frames) 89 % 0:45.11 n/a decoder emits video/x-raw(memory:DMABuf), format=DMA_DRM, drm-format=NV12; fakesink discards
R5 gst v4l2slh264dec ! waylandsink sync=true (paced, 60 s wall) 8 % 1:02.27 0 waylandsink advertises DMA_DRM caps; linux-dmabuf-v1 negotiated; KWin → VOP2 plane; no GL upload
R6 mpv –hwdec=v4l2request –vo=gpu-next 138 % 1:02.63 1039 / 1440 (72 %) mpv selects h264-v4l2request; gpu-next loads drmprime hwdec driver; drm_prime → GLES import → libplacebo composite; bottleneck
R7 mpv –hwdec=v4l2request –vo=dmabuf-wayland 17 % 1:01.35 broken hwupload filter: hardware format not supported; audio plays, no video

Logs: ~/fourier-test/baseline-runs/20260430-1145/ on ohm (stdouts, stderrs, mpv terminal captures, GST_DEBUG traces).

Prior Phase 3 dataset (README dossier, marfrit-modified ohm)

For comparison; ad-hoc, kernel same, ffmpeg-v4l2-request-git (marfrit Kwiboo build) instead of danctnix ffmpeg-v4l2-request:

Run Path CPU% (prior) Drops (prior)
R1 SW realtime 90 % n/a
R2 HW no drm_prime 67 % n/a
R3 HW + drm_prime 14 % n/a
R4 gst → fakesink 89 % n/a (1800 / 48.9 s = 36.8 fps unpaced)
R5 gst → waylandsink 6–7 % 0 (1488 / 62 s, progressreport 1:1)
R6 mpv gpu-next 122 % 930 / 1440 (65 %)
R7 mpv dmabuf-wayland n/a format-negotiation failure

Reproduction commentary

  • R1–R5 reproduce within 1–3 percentage points / single-frame

counts. The danctnix → marfrit swap on ffmpeg is not visible

  in the data — same Kwiboo branch, different build flags.
* R6 worsened from 122 % / 930 to 138 % / 1039. Same shape;
  delta plausibly compositor / clip-content variance. The
  qualitative finding holds: GL composition is the bottleneck,
  decode is essentially free, ~70 % of frames don't reach the
  panel.
* R7's failure mode is identical.

Path-optimization notes (per Markus, 2026-04-30)

Definition: “zero copy if possible, no double verification of buffers if already established at both ends.” Captured per run:

  • R3: drm_prime export-only, no consumer. Zero-copy on the

producer side; no second-end exists.

  • R5: caps negotiation runs once at preroll

(video/x-raw(memory:DMABuf), format=DMA_DRM, drm-format=NV12

  on both ends); per-frame the dmabuf fd flows decoder → KWin
  → VOP2 with no caps re-validation.
* R6: ''drm_prime → EGLImage → GL texture'' import per frame.
  Each frame allocates a fresh EGLImage. This is the per-frame
  cost the metric needed; it's not a buffer-double-verification
  problem, it's a fresh-import-per-frame problem.
* R7: format-negotiation fails, never reaches per-frame phase.

Phase 4 — Plan

Sub-target

ohm + chromium-fourier + H.264 + GL surface (the browser-side analogue of R6, but inside Chromium's V4L2VideoDecoder + Skia/SkiaGanesh GL composition). Resumes task #46. No fleet broadening this round.

What will be touched

  • data:CT220 chromium-builder. Unblock the gn configure check

that rejects empty clang_revision. Standard Arch chromium

  PKGBUILD patches solve this; CachyOS chromium-cachy fork has
  the equivalent. Result: a built ''chromium-fourier'' aarch64
  package landing on packages.reauktion.de.
* ohm: install ''chromium-fourier'' from ''[marfrit]'' (re-enabled
  only after the Phase 3 baseline has been frozen on this page).
  Run a single 60 s YouTube-equivalent clip-playback measurement
  in the same shape as R5/R6.
* No kernel changes. No userspace patches beyond the build
  unblock.

What will not be touched

  • fresnel, ampere, boltzmann.
  • Firefox (firefox-fourier already shipped, RDD result valid).
  • Kernel (vb2-dma-resv v2 backlogged independently).
  • mpv dmabuf-wayland diagnosis (path 2 in the earlier ranking;

deferred).

  • libva-v4l2-request multiplanar port. Browser path will use

Chromium's V4L2VideoDecoder, not VA-API.

Expected outcome (Phase 7 will compare against this)

Two predictions, in the same units as Phase 3 R5 + R6:

  1. chromium-fourier on ohm, 1080p30 H.264 corpus clip, 60 s wall:

avg CPU% in the renderer + GPU process combined, 80–150 %.

  Drops 200–800 of 1440 (14–56 %). Path: V4L2VideoDecoder →
  drm_prime → Chromium ''GLImage'' GLES texture → Skia composite.
  Bottleneck location: the GL import, same shape as R6.
- Stretch: with ''--enable-features=VaapiVideoDecodeLinuxGL'' or
  equivalent dmabuf-direct compositing flag (the LinuxOzone
  overlay path), CPU% drops to 20–40 % and drop-count to single
  digits. This is the chromium-fourier-specific carrier patch
  bet. If it works, ohm clears the ship-gate.

If both numbers come back inside their ranges, Phase 7 closes the loop. If they're outside (especially if (1) is much worse), the plan re-enters Phase 4 with the diagnosis from the latency band.

Open Phase 4 questions surfaced for review

  1. Build unblock approach. Patch gn to skip

clang_revision check (lighter), or do a full toolchain

  swap to ''clang_base_path="/usr"'' + ''compiler-rt-adjust-paths''
  (heavier, what Arch chromium PKGBUILD does). Either is a one-line
  PKGBUILD change for the gn-side; the second has more long-term
  upgrade implications.
- **Latency-band instrumentation depth.** ''perf trace -e syscalls''
  + KMS vblank timestamps (''/sys/kernel/debug/dri/0/state'')
  + Chromium's UMA media-pipeline traces (if exposed in this
  build) is the proposal. Do we add ''chrome://tracing'' captures
  on top, or is that overkill for one sub-target?
- **Test corpus extension.** Stay on the bbb 1080p H.264 clip,
  or add a real YouTube DASH H.264 segment (different keyframe
  cadence, different bitrate envelope) for the browser-relevant
  case? Real YouTube would test the fragmented-MP4 path; bbb
  tests the static-MP4 path.
- **Acceptance for "stretch" outcome.** Is a Chromium build flag
  sufficient as a campaign artifact, or do we want a carrier
  patch in chromium-fourier that flips the default?

Phase 5 routing

This page is the second-model review handover. Reviewer assesses:

  • Whether the (C) metric is right, or whether the latency band

needs a different definition.

  • Whether path (1) is the right narrowing or whether path (2)

or (3) buys more information first.

  • Whether the Phase 4 prediction range is too wide or too narrow.
  • What's missing.

After redaction, Markus hands to a fresh model instance.

Source artefacts

  • Run logs (ohm): ~/fourier-test/baseline-runs/20260430-1145/
  • Pre/post-revert snapshots (ohm): /tmp/ohm-pre-revert-pacman-Q.txt,

/tmp/ohm-post-revert-pacman-Q.txt,

  ''/tmp/ohm-pacman.conf.pre-revert''
* Project-state memory: ''project_fourier_campaign_state.md''
* Process spec: ''feedback_dev_process.md'' (the 8-phase loop)
* Existing campaign README: ''~/src/fourier/README.md''
* Chromium-fourier workspace: ''marfrit-packages/arch/chromium-fourier/''
fourier/phase5_2026-04-30.txt · Last modified: by markus_fritsche