This is an old revision of the document!
Table of Contents
ohm_gl_fix — Phase 4, 2026-05-01
This page replaces both prior Phase 4 drafts: the original libplacebo fd-cache plan (retracted after perf record showed libplacebo at 0.41 % of CPU and the patched code path not on the hot path) and its in-place revision into a “documentation of the gap” page. Phase 4 is now a plan, not an enumeration. It picks one fix surface, names the implementation, states what gets measured at Phase 7, and identifies the loopback edges.
The driver of this rewrite: Phase 1 was refined on 2026-05-01 with machine-readable criteria (Phase 1 revised §4 — C1 drops, C2 LLC-load-misses, C3 DRM_IOCTL/sec, C4 boundary fd-passing) and Phase 3 was rebuilt on the same day with empirically-grounded boundary characterisation (Phase 3 revised §3, §4). With both anchors in place, Phase 4 can commit.
1. What this Phase 4 is targeting
Phase 1 revised §2 named the in-scope workloads:
- YouTube / HTML5
<video>in Brave - Web browsing in Brave (compositor-side video + animation)
- VS Code (Electron + Chromium under the hood)
All three traverse the Chromium video pipeline:
VaapiVideoDecoder → libva → libva-v4l2-request → V4L2 stateless
This is not the libavcodec hwaccel chain that mpv, ffplay, and VLC use. Browsers vendor their own ffmpeg fork and gate hardware video decode through libva. Therefore: the fix surfaces from the prior Phase 4 enumeration that touch libavcodec (B “libavcodec drm_prime → linux-dmabuf-v1”) or libplacebo (C2 “panvk-1.2-fakeshim”) do not lift the in-scope use cases, however structurally clean they look in isolation. The empirical entrypoint for Brave is libva, and libva on this hardware fails at vaInitialize (Phase 3 revised §1, §8; also fourier README L236-281).
Phase 4 commits to fix surface A: libva-v4l2-request multiplanar port as the primary direction, with an explicit pre-implementation research step (Step 0) that may discover the campaign needs a follow-up Chromium-side patch.
2. Decision rationale
Three reasons to commit to A specifically:
- It is the only fix surface that touches Brave's actual chain. B (libavcodec) and C2 (libplacebo Vulkan layer) target consumers Markus does not use. D (compositor DRM-shim) is a Wayland-protocol proposal that does not exist upstream and would not survive a Phase 5 review.
- Substantial groundwork exists. fourier's local libva-v4l2-request patches (on ohm at
~/fourier-test/libva-patches/fourier-local.patch) already get the bootlin source past format enumeration on the multiplanar hantro device (fourierREADMEL240-256). The starting point is not “from zero” — it is “from probe-passing, multiplanar buffer setup still single-plane”. - It addresses the structural gap, not the symptom. Phase 1 revised's criteria all hold globally for libva consumers once A is delivered, not just for one application. fourier already flagged this as the right axis (“browser HW video decode on ohm is parked until a multiplanar libva-v4l2-request rework exists, either ours or someone else's”, fourier
READMEL276-281).
Note explicitly: A alone may not suffice. Once the libva chain produces a NV12 dmabuf for Brave's VaapiVideoDecoder, the display side — Chromium's GPU-process compositor — still has to present that dmabuf without per-frame Mesa GL+DRM round-trips (Phase 1 revised's C3, ≤100 DRM_IOCTL/sec). Whether Chromium does this on Wayland today, or needs an additional patch, is the open question Step 0 below answers before code is written.
3. Implementation plan
Step 0 — Research: characterise Chromium's Wayland video presentation path
Duration: 3–7 days. Output: decision document attached to this Phase 4 plan, naming whether Step 2 is required.
Question to answer: when VaapiVideoDecoder produces a NativePixmap (= dmabuf-backed VA-API surface) on chrome –ozone-platform=wayland, does Chromium's GPU process present it via zwp_linux_dmabuf_v1 subsurface (Wayland direct overlay) or via Skia GL composite onto the page's main surface?
Concrete sub-tasks:
- Source archaeology in Chromium (current Brave-bin's underlying Chromium version, likely M138-class):
ui/ozone/platform/wayland/host/wayland_buffer_manager_host.ccand surrounding files — Wayland buffer attachment.components/viz/service/display_embedder/— overlay candidate surface processing.media/gpu/vaapi/— VA-API surface to native-pixmap conversion.gpu/ipc/service/gpu_video_decode_accelerator_helpers.cc— dmabuf flow from decoder to compositor.
- Empirical synthesis test: with current Brave (libva broken), can we coax Chromium into the dmabuf-overlay path using a different content source — e.g. WebGL canvas, or a video element with software decode where the decoded YUV is uploaded once to a GL texture and we observe whether composite uses the texture via Wayland subsurface or via Skia main-surface compositing? Look at
DRM_IOCTL_*rate andSCM_RIGHTSfd-passing on the GPU process (already instrumented in Phase 3 revised §3). - Feature flag inventory: check
chrome:flagsand–enable-features=for relevant entries:VaapiVideoDecoder,VaapiVideoDecodeLinuxGL,UseChromeOSDirectVideoDecoder,UseDelegatedCompositing,DelegatedCompositingLimitToUi,AcceleratedVideoDecodeLinuxGL,wayland-screen-coordinates,ozone-overlay-priority-hint. Output gate: decision document records whether Chromium's GPU process under default flags will route a working VA-API dmabuf tozwp_linux_dmabuf_v1(Step 2 not needed) or composite via Skia GL (Step 2 needed). The decision document attaches to this Phase 4 page after Step 0 completes. ==== Step 1 — libva-v4l2-request multiplanar port ==== Duration: 4–8 weeks of focused work; the lower end if fourier's local patches and Phase 2 §3 substrate (9-fd capture pool, NV12 single-plane 1920×1088sizeimage = 3 655 712) generalise. The upper end if hantro's request-API control set turns out to need additional reverse-engineering against the kernel driver (drivers/staging/media/rkvdec//drivers/staging/media/hantro/). Source basis: * Upstream fork: https://github.com/bootlin/libva-v4l2-request (last meaningful commit ~years ago per fourier; confirm at Step 1 start). * fourier local patches:~/fourier-test/libva-patches/fourier-local.patch— HEVC stripped (RK3566 has no HEVC HW), missing#include “utils.h”insrc/h264.crestored,src/config.cformat-enumeration extended to try bothV4L2_BUF_TYPE_VIDEO_OUTPUTandV4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE(fourierREADMEL240-256). Concrete work surface, in order: - Fork + import groundwork. Set upmarfrit-packages/libva-v4l2-request-ohm-gl-fix/. Apply fourier's patches as the patch-zero baseline.pkgname= libva-v4l2-request-ohm-gl-fix,provides+conflicts+replaces= libva-v4l2-request. Build via fermi (Gitea Actions runner archlinuxarm aarch64). - Multiplanar buffer setup inv4l2_buffersrc/v4l2.c. Replace single-plane/v4l2_formatusage with MPLANE variants (VIDIOC_S_FMTonV4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANEfor bitstream input,V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANEfor NV12 output;VIDIOC_QBUF/VIDIOC_DQBUFwithplanes[]arrays). The Phase 2 §3 strace evidence (ffmpeg -hwaccel v4l2request -hwaccel_output_format drm_primeproducing 9VIDIOC_EXPBUFs with NV12 single-planesizeimage = 3 655 712) is the per-buffer template. - Multiplanar context lifecycle invaCreateContextsrc/context.c. Replacesingle-plane buffer-pool setup with multiplanar pool that mirrors theVIDIOC_REQBUFS+CREATE_BUFS, count=1-loop pattern Phase 2 captured. Capture ring depth = 9 (per Phase 2 §3). Output ring (bitstream input) depth = 4. - Multiplanar slice submission inV4L2_CTRL_*_HEADERsrc/picture.candsrc/h264.c. Adapt request-API frame submission: buildcontrol payloads (SPS, PPS, decode params, slice params, scaling matrix) attached to the request fd,VIDIOC_QBUFthe bitstream input MPLANE buffer with the request fd,VIDIOC_DQBUFthe capture MPLANE NV12 buffer after decode. The kernel UAPI is ininclude/uapi/linux/v4l2-controls.hV4L2_CID_STATELESS_H264_*(note: the olderV4L2_CID_MPEG_VIDEO_HEVC_*was renamed; H264 was renamed toV4L2_CID_STATELESS_H264_*on the same wave). - NativePixmap export. Ensure each capture-side dmabuf fd flows out of libva to the caller (Chromium'sVaapiPicture) as a NativePixmap with the right DRM format (DRM_FORMAT_NV12) and modifier (DRM_FORMAT_MOD_LINEARper Phase 3 Finding 1). Verify the modifier matches what Chromium will accept. - Test corpus. Run against: *bbb_1080p30_h264.mp4(the campaign's reference clip). *vainfo(libva self-test) on/dev/dri/renderD128equivalent. * Any failure cases noted by fourier (READMEL319-340, “test corpus” — pull list at Step 1 start). - Package + publish. PKGBUILD finalised, builds on fermi, pushes to marfrit-packages pacman repo. ==== Step 2 (conditional) — Chromium display-side patch ==== Trigger: Step 0 finds Chromium does not auto-route VA-API NativePixmaps throughzwp_linux_dmabuf_v1on Wayland under the default feature flags — i.e. it composites via Skia GL and Phase 1 revised's C3 (≤ 100 DRM_IOCTL/sec) cannot be reached from Step 1 alone. Shape (deferred — exact scope set by Step 0): patch Chromium to route VAAPI NativePixmaps as Wayland subsurfaces for video elements; or enable a feature flag set that does this. Build aschromium-ohm-gl-fix(orbrave-ohm-gl-fix) on marfrit-packages. If Step 0 finds Step 2 is not needed, Phase 4 implementation ends at Step 1 + Step 3. ==== Step 3 — Verification (Phase 7 prep) ==== After Step 1 (and conditionally Step 2) lands on ohm: - Reinstall:sudo pacman -U libva-v4l2-request-ohm-gl-fix-*.pkg.tar.zst(and conditionallychromium-ohm-gl-fix-*). - Re-run Phase 3 revised §3 v2 strace (ioctl,mmap,munmap,sendmsg,recvmsg) and §4 perf-stat (cache-misses,LLC-load-misses,cycles,instructions) on Brave +bbb_1080p30_h264.mp4over a 60 s steady-state window. Capture renderer + GPU-process targets. - Check Phase 1 revised C1-C4: * C1 drops ≤ 10 over 60 s, drops_post_warmup = 0 * C2 LLC-load-misses ≤ 9 M / 10 s * C3 DRM_IOCTL/sec ≤ 100 * C4 at least one of (a)VIDIOC_EXPBUF+SCM_RIGHTSOR (b)PRIME_FD_TO_HANDLEfrom V4L2 dmabuf observed - Append result row(s) tometrics.csvasphase7_verify_*. ===== 4. What's touched, what's not ===== Touched: * libva-v4l2-request — substantial multiplanar rewrite ofsrc/v4l2.c,src/context.c,src/picture.c,src/h264.c. Public ABI preserved (libva-driver entrypoints unchanged); internal restructuring only. * marfrit-packages — newlibva-v4l2-request-ohm-gl-fix/tree. Conditionally:chromium-ohm-gl-fix/(Step 2 only). * ohm system —pacman -Ureplaces stock libva-v4l2-request (and conditionally Chromium/Brave) with the campaign packages. Not touched: * mpv, ffplay, VLC, gst-* — these remain on their current paths. Their users will not benefit from Phase 4. Out of campaign scope. * Mesa / panfrost / panvk / libplacebo — their state is unchanged. Thepanvk-1.2-fakeshimoption from prior Phase 4 drafts is not pursued in this iteration. * libavcodec / ffmpeg — Chromium statically vendors its own; the systemffmpeg-v4l2-request-gitpackage is unchanged. * Kernel drivers (hantro-vpu, panfrost). Step 1 builds against the existing UAPI surface; no kernel work. * KWin / Wayland protocol. Step 1 produces dmabuf fds; existing KWinzwp_linux_dmabuf_v1implementation consumes them. No compositor work. * The S5 regression (Phase 3 revised §6 / §8 — gst-launch waylandsink ~0.3 drops/sec on today's stack vs. fourier 2026-04-24's 0/62). Separate iteration if pursued. ===== 5. Predicted outcome (against Phase 1 revised C1-C4) ===== If Step 0 + Step 1 deliver and Step 2 turns out unnecessary (optimistic case): ^ Criterion ^ Current (Brave SW path) ^ Predicted (Phase 4 delivered) ^ How verified ^ | C1 drops post-warmup ≤ 10 / 60 s | not measured (estimated 100s+ based on Brave's CPU footprint) | 0 drops post-warmup; total drops ≤ 5 (Vulkan-init blip equivalent) | Phase 7 strace/perf-stat re-run | | C2 LLC-load-misses ≤ 9 M / 10 s | Brave GPU process has heavy memcpy traffic (Phase 3 revised §2 — 12.92 % memcpy on GPU process) | ≤ 9 M / 10 s for GPU process (no per-frame dmabuf-to-shmem CPU copy) | perf-stat re-run | | C3 DRM_IOCTL/sec ≤ 100 | not measured for Brave (S2/S3/S4 sit at 800–1 050; S5 at 1 046) | ≤ 100 if Chromium routes the dmabuf viazwp_linux_dmabuf_v1overlay; otherwise Step 2 needed | strace v2 + boundary_counts.csv extension | | C4 boundary fd-passing | NO (libva fails, no V4L2 path engaged) | YES —VIDIOC_EXPBUFfrom libva, then eitherSCM_RIGHTSto KWin orPRIME_FD_TO_HANDLEto GL (depending on Step 2 outcome) | strace v2 boundary inspection | If Step 2 is required, the same outcome but reached via Step 1 + Step 2 in sequence, with Step 1's standalone result being C1+C2 met and C3+C4 partially met (Level 1 zero-copy at the decode boundary; Level 2 still not at the compositor boundary). ===== 6. Risks and mitigations ===== - R1 — Multiplanar port takes longer than 8 weeks. V4L2 stateless API + request-API + hantro-specific control set is intricate. Mitigation: scope to H.264 only initially. HEVC is moot (RK3566 hantro has no HEVC HW). VP8 / VP9 / AV1 follow only if H.264 lands cleanly. If a single sub-task slips by >3 weeks, surface to Markus for re-scoping. - R2 — Chromium routes VA-API NativePixmap through Skia GL on Wayland by default (Step 0 negative finding). Mitigation: Step 2 patches Chromium. Engineering cost goes up materially but campaign scope still tractable. If Step 2 itself looks >2 months, reconsider whether to ship Step 1 alone with C1+C2 met and document C3 as still missing. - R3 — hantro's H.264 conformance is incomplete. Some streams (interlaced, certain profile/level combinations, Hi10P) may fail. Mitigation: cross-check against fourier'sgst v4l2slh264decworking output on the same clip — that path uses the same kernel driver and is a known-good reference. Use the test corpus from fourierREADMEL319-340 once enumerated. - R4 — KWin'sexternal_only=1zwp_linux_dmabuf_v1modifier handling on the NV12DRM_FORMAT_MOD_LINEARthat hantro produces. Phase 3 Finding 1 already showed all panvk modifiers carry; that's a panvk-side property, but KWin's own modifier acceptance for NV12 is independent. Mitigation: cross-check by runninggst-launch v4l2slh264dec → waylandsinkon today's stack — that path produces the same modifier and is accepted by KWin (the S1 zero-copy reference). If S1 still works, KWin's acceptance is fine for the Step 1 output. - R5 — fourier's libva-v4l2-request local patches were against an older bootlin tree. May not apply cleanly to current upstream. Mitigation: start by rebasing fourier's patches on current upstream as the first sub-task of Step 1. If upstream has moved more than expected, fall back to fourier's snapshot. - R6 — Chromium's VAAPI gating (VaapiVideoDecoder,VaapiIgnoreDriverChecks). The driver-check path inspects the libva driver's reported profile set. fourier already sawvainfoenumerate H.264 profiles successfully with the probe patch; the multiplanar Step 1 should preserve that. Mitigation: after Step 1, re-runvainfo LIBVA_DRIVER_NAME=v4l2_request LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video1to confirm profile enumeration still passes. Then Brave's–enable-features=VaapiVideoDecoder,VaapiIgnoreDriverChecksinvocation should engage. ===== 7. Phase 5 hand-over ===== Per~/.claude/projects/-home-mfritsche-src/memory/feedback_dev_process.md, Phase 5 is second-model review of all Phase 1-4 artefacts. Markus pastes the materials uncurated: * Phase 1 revised * Phase 2 (substrate) * Phase 3 revised * This Phase 4 page * Companion CSVs:metrics.csv,phase3/io_cache_2026-05-01/boundary_counts.csv,phase3/io_cache_2026-05-01/perfstat.csvSpecific questions for the second-model reviewer to challenge: - Is fix surface A actually the right pick given Phase 1 revised's use-case priority? In particular: does the reviewer see a path Phase 4 missed where Brave's chain could be lifted without rewriting libva-v4l2-request multiplanar? - Is Step 0's research scope sufficient to commit to or rule out Step 2 with confidence, or does Step 0 itself need a Phase 4-internal sub-plan? - Risk R1 (slip) and R2 (Step 2 needed) — is the mitigation realistic given a single-engineer-with-Claude-assistance capacity? - Test corpus from fourier README L319-340 — is it adequate for declaring Step 1 complete, or should we extend it? ===== 8. Phase 6 (implementation) and Phase 7 (verification) order ===== Phase 6 = “execute Step 0 → Step 1 → conditionally Step 2”. Phase 7 = “Step 3” above.metrics.csvrowsphase7_verify_brave_*will hold the binding numbers. Phase 6 is long (weeks-to-months in elapsed wall time, not full-time). Sub-step boundaries inside Phase 6 are Phase-4-internal; no need to re-enter Phase 4 unless a step-level surprise demands re-planning (e.g. Step 0 turns up something that invalidates Step 1's direction). The three loopback edges (Phase 1 revised §5): * C1 ✓ + C2 ✗ + C3 ✓ → flag, investigate. Surfaces a measurement classification issue. * C1 ✓ + C2 ✓ + C3 ✗ → Level-1 fixed, Level-2 missing. This is the expected post-Step-1 state if Step 0 said Step 2 is needed. Re-enter Phase 4 with Step 2 spec'd. * C1 ✗ at Phase 7 → drops still happen. Re-enter Phase 4 with new perf evidence. ===== 9. Deferred / out of scope ===== * Other libva consumers (mpv-via-vaapi, VLC-via-vaapi) — same Step 1 lifts them indirectly. Verification is Brave-only; gains on other libva consumers are documented at Phase 7 but not required for closure. * libavcodec hwaccel consumers (mpvgpu-next, ffplay, VLCqt) — fix surface B from prior Phase 4 enumeration. Separate campaign. * Vulkan-anchored consumers (libplacebo Vulkan backend on Mali-G52). Fix surface C2 (panvk-1.2-fakeshim). Separate campaign. * HEVC, VP8, VP9, AV1. RK3566 hantro has H.264 + MPEG2 + VP8 HW only. AV1 / VP9 / HEVC are SW even after Step 1. Out of scope for this campaign's verification. * The S5 zero-drop regression (Phase 3 revised §6 + §8). Side investigation if pursued. * Other Mali-Bifrost-v7 hardware (G31 / G51 / G76 — same panvk arch, different SBC stacks). Out of scope; Phase 1's “Mali-G52” framing is hardware-specific. * General-purpose Vulkan workloads. Phase 1 revised §6 explicit out-of-scope. SW-emulated mandatory-1.2 entry points in any future panvk-fakeshim are tolerated. ===== 10. References ===== * Phase 1 revised — measurable success criteria. * Phase 2 (substrate) — versions, V4L2 9-fd buffer pool, panvk gates, panfrost modifier surface. * Phase 3 revised — six-contender empirical bucket-attribution + boundary characterisation; the basis for §1's “Brave is libva, not libavcodec” pivot. * Original Phase 4 — superseded by this page; preserved for audit trail. * fourierREADMEL236-281 — prior libva-v4l2-request investigation and partial multiplanar probe patches that form Step 1's starting point. * Bootlin libva-v4l2-request: https://github.com/bootlin/libva-v4l2-request * Local artefact:~/fourier-test/libva-patches/fourier-local.patch(HEVC-stripped, missing-include fixed, format-enumeration extended for MPLANE). * marfrit-packages parallel:ffmpeg-v4l2-request-git/is the template for the newlibva-v4l2-request-ohm-gl-fix/package layout. —- Phase 4 ends here. Phase 6 (implementation) begins with Step 0, which produces a small attached decision document on this page. The firstpacman -Uon ohm marks Phase 6's first deliverable. Phase 7 is the metrics.csvphase7_verify_*row(s).
