User Tools

Site Tools


ohm_gl_fix:phase4_2026-05-01

This is an old revision of the document!


ohm_gl_fix — Phase 4, 2026-05-01

This page replaces both prior Phase 4 drafts: the original libplacebo fd-cache plan (retracted after perf record showed libplacebo at 0.41 % of CPU and the patched code path not on the hot path) and its in-place revision into a “documentation of the gap” page. Phase 4 is now a plan, not an enumeration. It picks one fix surface, names the implementation, states what gets measured at Phase 7, and identifies the loopback edges.

The driver of this rewrite: Phase 1 was refined on 2026-05-01 with machine-readable criteria (Phase 1 revised §4 — C1 drops, C2 LLC-load-misses, C3 DRM_IOCTL/sec, C4 boundary fd-passing) and Phase 3 was rebuilt on the same day with empirically-grounded boundary characterisation (Phase 3 revised §3, §4). With both anchors in place, Phase 4 can commit.

1. What this Phase 4 is targeting

Phase 1 revised §2 named the in-scope workloads:

  • YouTube / HTML5 <video> in Brave
  • Web browsing in Brave (compositor-side video + animation)
  • VS Code (Electron + Chromium under the hood)

All three traverse the Chromium video pipeline:

VaapiVideoDecoder → libva → libva-v4l2-request → V4L2 stateless

This is not the libavcodec hwaccel chain that mpv, ffplay, and VLC use. Browsers vendor their own ffmpeg fork and gate hardware video decode through libva. Therefore: the fix surfaces from the prior Phase 4 enumeration that touch libavcodec (B “libavcodec drm_prime → linux-dmabuf-v1”) or libplacebo (C2 “panvk-1.2-fakeshim”) do not lift the in-scope use cases, however structurally clean they look in isolation. The empirical entrypoint for Brave is libva, and libva on this hardware fails at vaInitialize (Phase 3 revised §1, §8; also fourier README L236-281).

Phase 4 commits to fix surface A: libva-v4l2-request multiplanar port as the primary direction, with an explicit pre-implementation research step (Step 0) that may discover the campaign needs a follow-up Chromium-side patch.

2. Decision rationale

Three reasons to commit to A specifically:

  1. It is the only fix surface that touches Brave's actual chain.

B (libavcodec) and C2 (libplacebo Vulkan layer) target consumers

  Markus does not use. D (compositor DRM-shim) is a Wayland-protocol
  proposal that does not exist upstream and would not survive a
  Phase 5 review.
- **Substantial groundwork exists.** fourier's local
  [[https://github.com/bootlin/libva-v4l2-request|libva-v4l2-request]]
  patches (on ohm at ''~/fourier-test/libva-patches/fourier-local.patch'')
  already get the bootlin source past format enumeration on the
  multiplanar hantro device (fourier ''README'' L240-256). The
  starting point is not "from zero" — it is "from probe-passing,
  multiplanar buffer setup still single-plane".
- **It addresses the structural gap, not the symptom.** Phase 1
  revised's criteria all hold globally for libva consumers once A
  is delivered, not just for one application. fourier already
  flagged this as the right axis ("//browser HW video decode on
  ohm is parked until a multiplanar libva-v4l2-request rework
  exists, either ours or someone else's//", fourier ''README''
  L276-281).

Note explicitly: A alone may not suffice. Once the libva chain produces a NV12 dmabuf for Brave's VaapiVideoDecoder, the display side — Chromium's GPU-process compositor — still has to present that dmabuf without per-frame Mesa GL+DRM round-trips (Phase 1 revised's C3, ≤100 DRM_IOCTL/sec). Whether Chromium does this on Wayland today, or needs an additional patch, is the open question Step 0 below answers before code is written.

3. Implementation plan

Step 0 — Research: characterise Chromium's Wayland video presentation path

Duration: 3–7 days. Output: decision document attached to this Phase 4 plan, naming whether Step 2 is required.

Question to answer: when VaapiVideoDecoder produces a NativePixmap (= dmabuf-backed VA-API surface) on chrome –ozone-platform=wayland, does Chromium's GPU process present it via zwp_linux_dmabuf_v1 subsurface (Wayland direct overlay) or via Skia GL composite onto the page's main surface?

Concrete sub-tasks:

  1. Source archaeology in Chromium (current Brave-bin's underlying

Chromium version, likely M138-class):

  • ui/ozone/platform/wayland/host/wayland_buffer_manager_host.cc

and surrounding files — Wayland buffer attachment.

  • components/viz/service/display_embedder/ — overlay candidate

surface processing.

  • media/gpu/vaapi/ — VA-API surface to native-pixmap conversion.
  • gpu/ipc/service/gpu_video_decode_accelerator_helpers.cc

dmabuf flow from decoder to compositor.

  1. Empirical synthesis test: with current Brave (libva broken),

can we coax Chromium into the dmabuf-overlay path using a

  different content source — e.g. WebGL canvas, or a video element
  with software decode where the decoded YUV is uploaded once to
  a GL texture and we observe whether composite uses the texture
  via Wayland subsurface or via Skia main-surface compositing?
  Look at ''DRM_IOCTL_*'' rate and ''SCM_RIGHTS'' fd-passing on the
  GPU process (already instrumented in
  [[ohm_gl_fix:phase3_revised_2026-05-01|Phase 3 revised]] §3).
- **Feature flag inventory:** check ''chrome://flags'' and
  ''--enable-features='' for relevant entries:
  ''VaapiVideoDecoder'', ''VaapiVideoDecodeLinuxGL'',
  ''UseChromeOSDirectVideoDecoder'', ''UseDelegatedCompositing'',
  ''DelegatedCompositingLimitToUi'', ''AcceleratedVideoDecodeLinuxGL'',
  ''wayland-screen-coordinates'', ''ozone-overlay-priority-hint''.

Output gate: decision document records whether Chromium's GPU process under default flags will route a working VA-API dmabuf to zwp_linux_dmabuf_v1 (Step 2 not needed) or composite via Skia GL (Step 2 needed). The decision document attaches to this Phase 4 page after Step 0 completes.

Step 1 — libva-v4l2-request multiplanar port

Duration: 4–8 weeks of focused work; the lower end if fourier's local patches and Phase 2 §3 substrate (9-fd capture pool, NV12 single-plane 1920×1088 sizeimage = 3 655 712) generalise. The upper end if hantro's request-API control set turns out to need additional reverse-engineering against the kernel driver (drivers/staging/media/rkvdec/ / drivers/staging/media/hantro/).

Source basis:

(last meaningful commit ~years ago per fourier; confirm at

  Step 1 start).
* fourier local patches:
  ''~/fourier-test/libva-patches/fourier-local.patch'' — HEVC
  stripped (RK3566 has no HEVC HW), missing ''#include "utils.h"''
  in ''src/h264.c'' restored, ''src/config.c'' format-enumeration
  extended to try both ''V4L2_BUF_TYPE_VIDEO_OUTPUT'' and
  ''V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE'' (fourier ''README''
  L240-256).

Concrete work surface, in order:

  1. Fork + import groundwork. Set up

marfrit-packages/libva-v4l2-request-ohm-gl-fix/. Apply

  fourier's patches as the patch-zero baseline. ''pkgname=
  libva-v4l2-request-ohm-gl-fix'', ''provides+conflicts+replaces=
  libva-v4l2-request''. Build via fermi (Gitea Actions runner
  archlinuxarm aarch64).
- **Multiplanar buffer setup in ''src/v4l2.c''.** Replace
  single-plane ''v4l2_buffer'' / ''v4l2_format'' usage with
  MPLANE variants (''VIDIOC_S_FMT'' on
  ''V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE'' for bitstream input,
  ''V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE'' for NV12 output;
  ''VIDIOC_QBUF'' / ''VIDIOC_DQBUF'' with ''planes[]'' arrays).
  The Phase 2 §3 strace evidence (''ffmpeg -hwaccel
  v4l2request -hwaccel_output_format drm_prime'' producing
  9 ''VIDIOC_EXPBUF''s with NV12 single-plane ''sizeimage =
  3 655 712'') is the per-buffer template.
- **Multiplanar context lifecycle in ''src/context.c''.**
  Replace ''vaCreateContext'' single-plane buffer-pool setup
  with multiplanar pool that mirrors the
  ''VIDIOC_REQBUFS+CREATE_BUFS, count=1''-loop pattern Phase 2
  captured. Capture ring depth = 9 (per Phase 2 §3). Output
  ring (bitstream input) depth = 4.
- **Multiplanar slice submission in ''src/picture.c'' and
  ''src/h264.c''.** Adapt request-API frame submission: build
  ''V4L2_CTRL_*_HEADER'' control payloads (SPS, PPS, decode
  params, slice params, scaling matrix) attached to the request
  fd, ''VIDIOC_QBUF'' the bitstream input MPLANE buffer with the
  request fd, ''VIDIOC_DQBUF'' the capture MPLANE NV12 buffer
  after decode. The kernel UAPI is in
  ''include/uapi/linux/v4l2-controls.h''
  ''V4L2_CID_STATELESS_H264_*'' (note: the older
  ''V4L2_CID_MPEG_VIDEO_HEVC_*'' was renamed; H264 was renamed
  to ''V4L2_CID_STATELESS_H264_*'' on the same wave).
- **NativePixmap export.** Ensure each capture-side dmabuf fd
  flows out of libva to the caller (Chromium's
  ''VaapiPicture'') as a NativePixmap with the right DRM format
  (''DRM_FORMAT_NV12'') and modifier (''DRM_FORMAT_MOD_LINEAR''
  per Phase 3 Finding 1). Verify the modifier matches what
  Chromium will accept.
- **Test corpus.** Run against:
  * ''bbb_1080p30_h264.mp4'' (the campaign's reference clip).
  * ''vainfo'' (libva self-test) on
    ''/dev/dri/renderD128'' equivalent.
  * Any failure cases noted by fourier (''README'' L319-340,
    "test corpus" — pull list at Step 1 start).
- **Package + publish.** PKGBUILD finalised, builds on fermi,
  pushes to marfrit-packages pacman repo.

Step 2 (conditional) — Chromium display-side patch

Trigger: Step 0 finds Chromium does not auto-route VA-API NativePixmaps through zwp_linux_dmabuf_v1 on Wayland under the default feature flags — i.e. it composites via Skia GL and Phase 1 revised's C3 (≤ 100 DRM_IOCTL/sec) cannot be reached from Step 1 alone.

Shape (deferred — exact scope set by Step 0): patch Chromium to route VAAPI NativePixmaps as Wayland subsurfaces for video elements; or enable a feature flag set that does this. Build as chromium-ohm-gl-fix (or brave-ohm-gl-fix) on marfrit-packages.

If Step 0 finds Step 2 is not needed, Phase 4 implementation ends at Step 1 + Step 3.

Step 3 — Verification (Phase 7 prep)

After Step 1 (and conditionally Step 2) lands on ohm:

  1. Reinstall: sudo pacman -U libva-v4l2-request-ohm-gl-fix-*.pkg.tar.zst

(and conditionally chromium-ohm-gl-fix-*).

§3 v2 strace (ioctl,mmap,munmap,sendmsg,recvmsg) and §4

  perf-stat (''cache-misses,LLC-load-misses,cycles,instructions'')
  on Brave + ''bbb_1080p30_h264.mp4'' over a 60 s steady-state
  window. Capture renderer + GPU-process targets.
- Check Phase 1 revised C1-C4:
  * **C1** drops ≤ 10 over 60 s, drops_post_warmup = 0
  * **C2** LLC-load-misses ≤ 9 M / 10 s
  * **C3** DRM_IOCTL/sec ≤ 100
  * **C4** at least one of (a) ''VIDIOC_EXPBUF'' + ''SCM_RIGHTS''
    OR (b) ''PRIME_FD_TO_HANDLE'' from V4L2 dmabuf observed
- Append result row(s) to ''metrics.csv'' as ''phase7_verify_*''.

4. What's touched, what's not

Touched:

  • libva-v4l2-request — substantial multiplanar rewrite of

src/v4l2.c, src/context.c, src/picture.c,

  ''src/h264.c''. Public ABI preserved (libva-driver entrypoints
  unchanged); internal restructuring only.
* marfrit-packages — new ''libva-v4l2-request-ohm-gl-fix/'' tree.
  Conditionally: ''chromium-ohm-gl-fix/'' (Step 2 only).
* ohm system — ''pacman -U'' replaces stock libva-v4l2-request
  (and conditionally Chromium/Brave) with the campaign packages.

Not touched:

  • mpv, ffplay, VLC, gst-* — these remain on their current paths.

Their users will not benefit from Phase 4. Out of campaign scope.

  • Mesa / panfrost / panvk / libplacebo — their state is unchanged.

The panvk-1.2-fakeshim option from prior Phase 4 drafts is

  not pursued in this iteration.
* libavcodec / ffmpeg — Chromium statically vendors its own; the
  system ''ffmpeg-v4l2-request-git'' package is unchanged.
* Kernel drivers (hantro-vpu, panfrost). Step 1 builds against the
  existing UAPI surface; no kernel work.
* KWin / Wayland protocol. Step 1 produces dmabuf fds; existing
  KWin ''zwp_linux_dmabuf_v1'' implementation consumes them.
  No compositor work.
* The S5 regression
  ([[ohm_gl_fix:phase3_revised_2026-05-01|Phase 3 revised]]
  §6 / §8 — gst-launch waylandsink ~0.3 drops/sec on today's
  stack vs. fourier 2026-04-24's 0/62). Separate iteration if
  pursued.

5. Predicted outcome (against Phase 1 revised C1-C4)

If Step 0 + Step 1 deliver and Step 2 turns out unnecessary (optimistic case):

Criterion Current (Brave SW path) Predicted (Phase 4 delivered) How verified
C1 drops post-warmup ≤ 10 / 60 s not measured (estimated 100s+ based on Brave's CPU footprint) 0 drops post-warmup; total drops ≤ 5 (Vulkan-init blip equivalent) Phase 7 strace/perf-stat re-run
C2 LLC-load-misses ≤ 9 M / 10 s Brave GPU process has heavy memcpy traffic (Phase 3 revised §2 — 12.92 % memcpy on GPU process) ≤ 9 M / 10 s for GPU process (no per-frame dmabuf-to-shmem CPU copy) perf-stat re-run
C3 DRM_IOCTL/sec ≤ 100 not measured for Brave (S2/S3/S4 sit at 800–1 050; S5 at 1 046) ≤ 100 if Chromium routes the dmabuf via zwp_linux_dmabuf_v1 overlay; otherwise Step 2 needed strace v2 + boundary_counts.csv extension
C4 boundary fd-passing NO (libva fails, no V4L2 path engaged) YESVIDIOC_EXPBUF from libva, then either SCM_RIGHTS to KWin or PRIME_FD_TO_HANDLE to GL (depending on Step 2 outcome) strace v2 boundary inspection

If Step 2 is required, the same outcome but reached via Step 1 + Step 2 in sequence, with Step 1's standalone result being C1+C2 met and C3+C4 partially met (Level 1 zero-copy at the decode boundary; Level 2 still not at the compositor boundary).

6. Risks and mitigations

  1. R1 — Multiplanar port takes longer than 8 weeks. V4L2

stateless API + request-API + hantro-specific control set is

  intricate. //Mitigation:// scope to H.264 only initially. HEVC
  is moot (RK3566 hantro has no HEVC HW). VP8 / VP9 / AV1 follow
  only if H.264 lands cleanly. If a single sub-task slips by >3
  weeks, surface to Markus for re-scoping.
- **R2 — Chromium routes VA-API NativePixmap through Skia GL on
  Wayland by default** (Step 0 negative finding). //Mitigation://
  Step 2 patches Chromium. Engineering cost goes up materially
  but campaign scope still tractable. If Step 2 itself looks
  >2 months, reconsider whether to ship Step 1 alone with C1+C2
  met and document C3 as still missing.
- **R3 — hantro's H.264 conformance is incomplete.** Some streams
  (interlaced, certain profile/level combinations,
  Hi10P) may fail. //Mitigation://
  cross-check against fourier's ''gst v4l2slh264dec'' working
  output on the same clip — that path uses the same kernel
  driver and is a known-good reference. Use the test corpus from
  fourier ''README'' L319-340 once
  enumerated.
- **R4 — KWin's ''zwp_linux_dmabuf_v1'' modifier handling on the
  NV12 ''DRM_FORMAT_MOD_LINEAR'' that hantro produces.** Phase 3
  Finding 1 already showed all panvk modifiers carry
  ''external_only=1''; that's a panvk-side property, but KWin's
  own modifier acceptance for NV12 is independent. //Mitigation://
  cross-check by running ''gst-launch v4l2slh264dec → waylandsink''
  on today's stack — that path produces the same modifier and is
  accepted by KWin (the S1 zero-copy reference). If S1 still
  works, KWin's acceptance is fine for the Step 1 output.
- **R5 — fourier's libva-v4l2-request local patches were against
  an older bootlin tree.** May not apply cleanly to current
  upstream. //Mitigation://  start by rebasing fourier's patches
  on current upstream as the first sub-task of Step 1. If
  upstream has moved more than expected, fall back to
  fourier's snapshot.
- **R6 — Chromium's VAAPI gating** (''VaapiVideoDecoder'',
  ''VaapiIgnoreDriverChecks''). The driver-check path inspects
  the libva driver's reported profile set. fourier already saw
  ''vainfo'' enumerate H.264 profiles successfully with the
  probe patch; the multiplanar Step 1 should preserve that.
  //Mitigation:// after Step 1, re-run ''vainfo
  LIBVA_DRIVER_NAME=v4l2_request
  LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video1'' to confirm profile
  enumeration still passes. Then Brave's
  ''--enable-features=VaapiVideoDecoder,VaapiIgnoreDriverChecks''
  invocation should engage.

7. Phase 5 hand-over

Per ~/.claude/projects/-home-mfritsche-src/memory/feedback_dev_process.md, Phase 5 is second-model review of all Phase 1-4 artefacts. Markus pastes the materials uncurated:

phase3/io_cache_2026-05-01/boundary_counts.csv,

  ''phase3/io_cache_2026-05-01/perfstat.csv''

Specific questions for the second-model reviewer to challenge:

  1. Is fix surface A actually the right pick given Phase 1

revised's use-case priority? In particular: does the reviewer

  see a path Phase 4 missed where Brave's chain could be lifted
  without rewriting libva-v4l2-request multiplanar?
- **Is Step 0's research scope sufficient** to commit to or rule
  out Step 2 with confidence, or does Step 0 itself need a
  Phase 4-internal sub-plan?
- **Risk R1 (slip) and R2 (Step 2 needed) — is the mitigation
  realistic** given a single-engineer-with-Claude-assistance
  capacity?
- **Test corpus from fourier README L319-340 — is it adequate**
  for declaring Step 1 complete, or should we extend it?

8. Phase 6 (implementation) and Phase 7 (verification) order

Phase 6 = “execute Step 0 → Step 1 → conditionally Step 2”. Phase 7 = “Step 3” above. metrics.csv rows phase7_verify_brave_* will hold the binding numbers.

Phase 6 is long (weeks-to-months in elapsed wall time, not full-time). Sub-step boundaries inside Phase 6 are Phase-4-internal; no need to re-enter Phase 4 unless a step-level surprise demands re-planning (e.g. Step 0 turns up something that invalidates Step 1's direction).

The three loopback edges (Phase 1 revised §5):

  • C1 ✓ + C2 ✗ + C3 ✓ → flag, investigate. Surfaces a measurement

classification issue.

  • C1 ✓ + C2 ✓ + C3 ✗ → Level-1 fixed, Level-2 missing. This is the expected post-Step-1 state if Step 0 said Step 2 is needed. Re-enter Phase 4 with Step 2 spec'd.
  • C1 ✗ at Phase 7 → drops still happen. Re-enter Phase 4 with

new perf evidence.

9. Deferred / out of scope

  • Other libva consumers (mpv-via-vaapi, VLC-via-vaapi) — same

Step 1 lifts them indirectly. Verification is Brave-only; gains

  on other libva consumers are documented at Phase 7 but not
  required for closure.
* **libavcodec hwaccel consumers** (mpv ''gpu-next'', ffplay,
  VLC ''qt'') — fix surface B from prior Phase 4 enumeration.
  Separate campaign.
* **Vulkan-anchored consumers** (libplacebo Vulkan backend on
  Mali-G52). Fix surface C2 (''panvk-1.2-fakeshim''). Separate
  campaign.
* **HEVC, VP8, VP9, AV1.** RK3566 hantro has H.264 + MPEG2 + VP8
  HW only. AV1 / VP9 / HEVC are SW even after Step 1. Out of scope
  for this campaign's verification.
* **The S5 zero-drop regression**
  ([[ohm_gl_fix:phase3_revised_2026-05-01|Phase 3 revised]] §6 +
  §8). Side investigation if pursued.
* **Other Mali-Bifrost-v7 hardware** (G31 / G51 / G76 — same
  panvk arch, different SBC stacks). Out of scope; Phase 1's
  "Mali-G52" framing is hardware-specific.
* **General-purpose Vulkan workloads.** Phase 1 revised §6
  explicit out-of-scope. SW-emulated mandatory-1.2 entry points
  in any future panvk-fakeshim are tolerated.

10. References

measurable success criteria.

versions, V4L2 9-fd buffer pool, panvk gates, panfrost modifier

  surface.
* [[ohm_gl_fix:phase3_revised_2026-05-01|Phase 3 revised]] —
  six-contender empirical bucket-attribution + boundary
  characterisation; the basis for §1's "Brave is libva, not
  libavcodec" pivot.
* [[ohm_gl_fix:phase4_2026-04-30|Original Phase 4]] — superseded
  by this page; preserved for audit trail.
* fourier ''README'' L236-281 — prior libva-v4l2-request
  investigation and partial multiplanar probe patches that form
  Step 1's starting point.
* Bootlin libva-v4l2-request:
  [[https://github.com/bootlin/libva-v4l2-request]]
* Local artefact: ''~/fourier-test/libva-patches/fourier-local.patch''
  (HEVC-stripped, missing-include fixed, format-enumeration
  extended for MPLANE).
* marfrit-packages parallel: ''ffmpeg-v4l2-request-git/'' is the
  template for the new ''libva-v4l2-request-ohm-gl-fix/'' package
  layout.

Phase 4 ends here. Phase 6 (implementation) begins with Step 0, which produces a small attached decision document on this page. The first pacman -U on ohm marks Phase 6's first deliverable. Phase 7 is the metrics.csv phase7_verify_* row(s).

ohm_gl_fix/phase4_2026-05-01.1777636328.txt.gz · Last modified: by markus_fritsche