User Tools

Site Tools


ohm_gl_fix:phase4_2026-05-01

This is an old revision of the document!


ohm_gl_fix — Phase 4, 2026-05-01

This page replaces both prior Phase 4 drafts: the original libplacebo fd-cache plan (retracted after perf record showed libplacebo at 0.41 % of CPU and the patched code path not on the hot path) and its in-place revision into a “documentation of the gap” page. Phase 4 is now a plan, not an enumeration. It picks one fix surface, names the implementation, states what gets measured at Phase 7, and identifies the loopback edges.

The driver of this rewrite: Phase 1 was refined on 2026-05-01 with machine-readable criteria (Phase 1 revised §4 — C1 drops, C2 LLC-load-misses, C3 DRM_IOCTL/sec, C4 boundary fd-passing) and Phase 3 was rebuilt on the same day with empirically-grounded boundary characterisation (Phase 3 revised §3, §4). With both anchors in place, Phase 4 can commit.

1. What this Phase 4 is targeting

Phase 1 revised §2 named the in-scope workloads:

  • YouTube / HTML5 <video> in Brave
  • Web browsing in Brave (compositor-side video + animation)
  • VS Code (Electron + Chromium under the hood)

All three traverse the Chromium video pipeline:

VaapiVideoDecoder → libva → libva-v4l2-request → V4L2 stateless

This is not the libavcodec hwaccel chain that mpv, ffplay, and VLC use. Browsers vendor their own ffmpeg fork and gate hardware video decode through libva. Therefore: the fix surfaces from the prior Phase 4 enumeration that touch libavcodec (B “libavcodec drm_prime → linux-dmabuf-v1”) or libplacebo (C2 “panvk-1.2-fakeshim”) do not lift the in-scope use cases, however structurally clean they look in isolation. The empirical entrypoint for Brave is libva, and libva on this hardware fails at vaInitialize (Phase 3 revised §1, §8; also fourier README L236-281).

Phase 4 commits to fix surface A: libva-v4l2-request multiplanar port as the primary direction, with an explicit pre-implementation research step (Step 0) that may discover the campaign needs a follow-up Chromium-side patch.

2. Decision rationale

Three reasons to commit to A specifically:

  1. It is the only fix surface that touches Brave's actual chain. B (libavcodec) and C2 (libplacebo Vulkan layer) target consumers Markus does not use. D (compositor DRM-shim) is a Wayland-protocol proposal that does not exist upstream and would not survive a Phase 5 review.
  2. Substantial groundwork exists. fourier's local libva-v4l2-request patches (on ohm at ~/fourier-test/libva-patches/fourier-local.patch) already get the bootlin source past format enumeration on the multiplanar hantro device (fourier README L240-256). The starting point is not “from zero” — it is “from probe-passing, multiplanar buffer setup still single-plane”.
  3. It addresses the structural gap, not the symptom. Phase 1 revised's criteria all hold globally for libva consumers once A is delivered, not just for one application. fourier already flagged this as the right axis (“browser HW video decode on ohm is parked until a multiplanar libva-v4l2-request rework exists, either ours or someone else's”, fourier README L276-281).

Note explicitly: A alone may not suffice. Once the libva chain produces a NV12 dmabuf for Brave's VaapiVideoDecoder, the display side — Chromium's GPU-process compositor — still has to present that dmabuf without per-frame Mesa GL+DRM round-trips (Phase 1 revised's C3, ≤100 DRM_IOCTL/sec). Whether Chromium does this on Wayland today, or needs an additional patch, is the open question Step 0 below answers before code is written.

3. Implementation plan

Step 0 — Research: characterise Chromium's Wayland video presentation path

Duration: 3–7 days. Output: decision document attached to this Phase 4 plan, naming whether Step 2 is required.

Question to answer: when VaapiVideoDecoder produces a NativePixmap (= dmabuf-backed VA-API surface) on chrome –ozone-platform=wayland, does Chromium's GPU process present it via zwp_linux_dmabuf_v1 subsurface (Wayland direct overlay) or via Skia GL composite onto the page's main surface?

Concrete sub-tasks:

  1. Source archaeology in Chromium (current Brave-bin's underlying Chromium version, likely M138-class):
    • ui/ozone/platform/wayland/host/wayland_buffer_manager_host.cc and surrounding files — Wayland buffer attachment.
    • components/viz/service/display_embedder/ — overlay candidate surface processing.
    • media/gpu/vaapi/ — VA-API surface to native-pixmap conversion.
    • gpu/ipc/service/gpu_video_decode_accelerator_helpers.cc — dmabuf flow from decoder to compositor.
  2. Empirical synthesis test: with current Brave (libva broken), can we coax Chromium into the dmabuf-overlay path using a different content source — e.g. WebGL canvas, or a video element with software decode where the decoded YUV is uploaded once to a GL texture and we observe whether composite uses the texture via Wayland subsurface or via Skia main-surface compositing? Look at DRM_IOCTL_* rate and SCM_RIGHTS fd-passing on the GPU process (already instrumented in Phase 3 revised §3).
  3. Feature flag inventory: check chrome:flags and –enable-features= for relevant entries: VaapiVideoDecoder, VaapiVideoDecodeLinuxGL, UseChromeOSDirectVideoDecoder, UseDelegatedCompositing, DelegatedCompositingLimitToUi, AcceleratedVideoDecodeLinuxGL, wayland-screen-coordinates, ozone-overlay-priority-hint. Output gate: decision document records whether Chromium's GPU process under default flags will route a working VA-API dmabuf to zwp_linux_dmabuf_v1 (Step 2 not needed) or composite via Skia GL (Step 2 needed). The decision document attaches to this Phase 4 page after Step 0 completes. ==== Step 1 — libva-v4l2-request multiplanar port ==== Duration: 4–8 weeks of focused work; the lower end if fourier's local patches and Phase 2 §3 substrate (9-fd capture pool, NV12 single-plane 1920×1088 sizeimage = 3 655 712) generalise. The upper end if hantro's request-API control set turns out to need additional reverse-engineering against the kernel driver (drivers/staging/media/rkvdec/ / drivers/staging/media/hantro/). Source basis: * Upstream fork: https://github.com/bootlin/libva-v4l2-request (last meaningful commit ~years ago per fourier; confirm at Step 1 start). * fourier local patches: ~/fourier-test/libva-patches/fourier-local.patch — HEVC stripped (RK3566 has no HEVC HW), missing #include “utils.h” in src/h264.c restored, src/config.c format-enumeration extended to try both V4L2_BUF_TYPE_VIDEO_OUTPUT and V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE (fourier README L240-256). Concrete work surface, in order: - Fork + import groundwork. Set up marfrit-packages/libva-v4l2-request-ohm-gl-fix/. Apply fourier's patches as the patch-zero baseline. pkgname= libva-v4l2-request-ohm-gl-fix, provides+conflicts+replaces= libva-v4l2-request. Build via fermi (Gitea Actions runner archlinuxarm aarch64). - Multiplanar buffer setup in src/v4l2.c. Replace single-plane v4l2_buffer / v4l2_format usage with MPLANE variants (VIDIOC_S_FMT on V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE for bitstream input, V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE for NV12 output; VIDIOC_QBUF / VIDIOC_DQBUF with planes[] arrays). The Phase 2 §3 strace evidence (ffmpeg -hwaccel v4l2request -hwaccel_output_format drm_prime producing 9 VIDIOC_EXPBUFs with NV12 single-plane sizeimage = 3 655 712) is the per-buffer template. - Multiplanar context lifecycle in src/context.c. Replace vaCreateContext single-plane buffer-pool setup with multiplanar pool that mirrors the VIDIOC_REQBUFS+CREATE_BUFS, count=1-loop pattern Phase 2 captured. Capture ring depth = 9 (per Phase 2 §3). Output ring (bitstream input) depth = 4. - Multiplanar slice submission in src/picture.c and src/h264.c. Adapt request-API frame submission: build V4L2_CTRL_*_HEADER control payloads (SPS, PPS, decode params, slice params, scaling matrix) attached to the request fd, VIDIOC_QBUF the bitstream input MPLANE buffer with the request fd, VIDIOC_DQBUF the capture MPLANE NV12 buffer after decode. The kernel UAPI is in include/uapi/linux/v4l2-controls.h V4L2_CID_STATELESS_H264_* (note: the older V4L2_CID_MPEG_VIDEO_HEVC_* was renamed; H264 was renamed to V4L2_CID_STATELESS_H264_* on the same wave). - NativePixmap export. Ensure each capture-side dmabuf fd flows out of libva to the caller (Chromium's VaapiPicture) as a NativePixmap with the right DRM format (DRM_FORMAT_NV12) and modifier (DRM_FORMAT_MOD_LINEAR per Phase 3 Finding 1). Verify the modifier matches what Chromium will accept. - Test corpus. Run against: * bbb_1080p30_h264.mp4 (the campaign's reference clip). * vainfo (libva self-test) on /dev/dri/renderD128 equivalent. * Any failure cases noted by fourier (README L319-340, “test corpus” — pull list at Step 1 start). - Package + publish. PKGBUILD finalised, builds on fermi, pushes to marfrit-packages pacman repo. ==== Step 2 (conditional) — Chromium display-side patch ==== Trigger: Step 0 finds Chromium does not auto-route VA-API NativePixmaps through zwp_linux_dmabuf_v1 on Wayland under the default feature flags — i.e. it composites via Skia GL and Phase 1 revised's C3 (≤ 100 DRM_IOCTL/sec) cannot be reached from Step 1 alone. Shape (deferred — exact scope set by Step 0): patch Chromium to route VAAPI NativePixmaps as Wayland subsurfaces for video elements; or enable a feature flag set that does this. Build as chromium-ohm-gl-fix (or brave-ohm-gl-fix) on marfrit-packages. If Step 0 finds Step 2 is not needed, Phase 4 implementation ends at Step 1 + Step 3. ==== Step 3 — Verification (Phase 7 prep) ==== After Step 1 (and conditionally Step 2) lands on ohm: - Reinstall: sudo pacman -U libva-v4l2-request-ohm-gl-fix-*.pkg.tar.zst (and conditionally chromium-ohm-gl-fix-*). - Re-run Phase 3 revised §3 v2 strace (ioctl,mmap,munmap,sendmsg,recvmsg) and §4 perf-stat (cache-misses,LLC-load-misses,cycles,instructions) on Brave + bbb_1080p30_h264.mp4 over a 60 s steady-state window. Capture renderer + GPU-process targets. - Check Phase 1 revised C1-C4: * C1 drops ≤ 10 over 60 s, drops_post_warmup = 0 * C2 LLC-load-misses ≤ 9 M / 10 s * C3 DRM_IOCTL/sec ≤ 100 * C4 at least one of (a) VIDIOC_EXPBUF + SCM_RIGHTS OR (b) PRIME_FD_TO_HANDLE from V4L2 dmabuf observed - Append result row(s) to metrics.csv as phase7_verify_*. ===== 4. What's touched, what's not ===== Touched: * libva-v4l2-request — substantial multiplanar rewrite of src/v4l2.c, src/context.c, src/picture.c, src/h264.c. Public ABI preserved (libva-driver entrypoints unchanged); internal restructuring only. * marfrit-packages — new libva-v4l2-request-ohm-gl-fix/ tree. Conditionally: chromium-ohm-gl-fix/ (Step 2 only). * ohm system — pacman -U replaces stock libva-v4l2-request (and conditionally Chromium/Brave) with the campaign packages. Not touched: * mpv, ffplay, VLC, gst-* — these remain on their current paths. Their users will not benefit from Phase 4. Out of campaign scope. * Mesa / panfrost / panvk / libplacebo — their state is unchanged. The panvk-1.2-fakeshim option from prior Phase 4 drafts is not pursued in this iteration. * libavcodec / ffmpeg — Chromium statically vendors its own; the system ffmpeg-v4l2-request-git package is unchanged. * Kernel drivers (hantro-vpu, panfrost). Step 1 builds against the existing UAPI surface; no kernel work. * KWin / Wayland protocol. Step 1 produces dmabuf fds; existing KWin zwp_linux_dmabuf_v1 implementation consumes them. No compositor work. * The S5 regression (Phase 3 revised §6 / §8 — gst-launch waylandsink ~0.3 drops/sec on today's stack vs. fourier 2026-04-24's 0/62). Separate iteration if pursued. ===== 5. Predicted outcome (against Phase 1 revised C1-C4) ===== If Step 0 + Step 1 deliver and Step 2 turns out unnecessary (optimistic case): ^ Criterion ^ Current (Brave SW path) ^ Predicted (Phase 4 delivered) ^ How verified ^ | C1 drops post-warmup ≤ 10 / 60 s | not measured (estimated 100s+ based on Brave's CPU footprint) | 0 drops post-warmup; total drops ≤ 5 (Vulkan-init blip equivalent) | Phase 7 strace/perf-stat re-run | | C2 LLC-load-misses ≤ 9 M / 10 s | Brave GPU process has heavy memcpy traffic (Phase 3 revised §2 — 12.92 % memcpy on GPU process) | ≤ 9 M / 10 s for GPU process (no per-frame dmabuf-to-shmem CPU copy) | perf-stat re-run | | C3 DRM_IOCTL/sec ≤ 100 | not measured for Brave (S2/S3/S4 sit at 800–1 050; S5 at 1 046) | ≤ 100 if Chromium routes the dmabuf via zwp_linux_dmabuf_v1 overlay; otherwise Step 2 needed | strace v2 + boundary_counts.csv extension | | C4 boundary fd-passing | NO (libva fails, no V4L2 path engaged) | YESVIDIOC_EXPBUF from libva, then either SCM_RIGHTS to KWin or PRIME_FD_TO_HANDLE to GL (depending on Step 2 outcome) | strace v2 boundary inspection | If Step 2 is required, the same outcome but reached via Step 1 + Step 2 in sequence, with Step 1's standalone result being C1+C2 met and C3+C4 partially met (Level 1 zero-copy at the decode boundary; Level 2 still not at the compositor boundary). ===== 6. Risks and mitigations ===== - R1 — Multiplanar port takes longer than 8 weeks. V4L2 stateless API + request-API + hantro-specific control set is intricate. Mitigation: scope to H.264 only initially. HEVC is moot (RK3566 hantro has no HEVC HW). VP8 / VP9 / AV1 follow only if H.264 lands cleanly. If a single sub-task slips by >3 weeks, surface to Markus for re-scoping. - R2 — Chromium routes VA-API NativePixmap through Skia GL on Wayland by default (Step 0 negative finding). Mitigation: Step 2 patches Chromium. Engineering cost goes up materially but campaign scope still tractable. If Step 2 itself looks >2 months, reconsider whether to ship Step 1 alone with C1+C2 met and document C3 as still missing. - R3 — hantro's H.264 conformance is incomplete. Some streams (interlaced, certain profile/level combinations, Hi10P) may fail. Mitigation: cross-check against fourier's gst v4l2slh264dec working output on the same clip — that path uses the same kernel driver and is a known-good reference. Use the test corpus from fourier README L319-340 once enumerated. - R4 — KWin's zwp_linux_dmabuf_v1 modifier handling on the NV12 DRM_FORMAT_MOD_LINEAR that hantro produces. Phase 3 Finding 1 already showed all panvk modifiers carry external_only=1; that's a panvk-side property, but KWin's own modifier acceptance for NV12 is independent. Mitigation: cross-check by running gst-launch v4l2slh264dec → waylandsink on today's stack — that path produces the same modifier and is accepted by KWin (the S1 zero-copy reference). If S1 still works, KWin's acceptance is fine for the Step 1 output. - R5 — fourier's libva-v4l2-request local patches were against an older bootlin tree. May not apply cleanly to current upstream. Mitigation: start by rebasing fourier's patches on current upstream as the first sub-task of Step 1. If upstream has moved more than expected, fall back to fourier's snapshot. - R6 — Chromium's VAAPI gating (VaapiVideoDecoder, VaapiIgnoreDriverChecks). The driver-check path inspects the libva driver's reported profile set. fourier already saw vainfo enumerate H.264 profiles successfully with the probe patch; the multiplanar Step 1 should preserve that. Mitigation: after Step 1, re-run vainfo LIBVA_DRIVER_NAME=v4l2_request LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video1 to confirm profile enumeration still passes. Then Brave's –enable-features=VaapiVideoDecoder,VaapiIgnoreDriverChecks invocation should engage. ===== 7. Phase 5 hand-over ===== Per ~/.claude/projects/-home-mfritsche-src/memory/feedback_dev_process.md, Phase 5 is second-model review of all Phase 1-4 artefacts. Markus pastes the materials uncurated: * Phase 1 revised * Phase 2 (substrate) * Phase 3 revised * This Phase 4 page * Companion CSVs: metrics.csv, phase3/io_cache_2026-05-01/boundary_counts.csv, phase3/io_cache_2026-05-01/perfstat.csv Specific questions for the second-model reviewer to challenge: - Is fix surface A actually the right pick given Phase 1 revised's use-case priority? In particular: does the reviewer see a path Phase 4 missed where Brave's chain could be lifted without rewriting libva-v4l2-request multiplanar? - Is Step 0's research scope sufficient to commit to or rule out Step 2 with confidence, or does Step 0 itself need a Phase 4-internal sub-plan? - Risk R1 (slip) and R2 (Step 2 needed) — is the mitigation realistic given a single-engineer-with-Claude-assistance capacity? - Test corpus from fourier README L319-340 — is it adequate for declaring Step 1 complete, or should we extend it? ===== 8. Phase 6 (implementation) and Phase 7 (verification) order ===== Phase 6 = “execute Step 0 → Step 1 → conditionally Step 2”. Phase 7 = “Step 3” above. metrics.csv rows phase7_verify_brave_* will hold the binding numbers. Phase 6 is long (weeks-to-months in elapsed wall time, not full-time). Sub-step boundaries inside Phase 6 are Phase-4-internal; no need to re-enter Phase 4 unless a step-level surprise demands re-planning (e.g. Step 0 turns up something that invalidates Step 1's direction). The three loopback edges (Phase 1 revised §5): * C1 ✓ + C2 ✗ + C3 ✓ → flag, investigate. Surfaces a measurement classification issue. * C1 ✓ + C2 ✓ + C3 ✗ → Level-1 fixed, Level-2 missing. This is the expected post-Step-1 state if Step 0 said Step 2 is needed. Re-enter Phase 4 with Step 2 spec'd. * C1 ✗ at Phase 7 → drops still happen. Re-enter Phase 4 with new perf evidence. ===== 9. Deferred / out of scope ===== * Other libva consumers (mpv-via-vaapi, VLC-via-vaapi) — same Step 1 lifts them indirectly. Verification is Brave-only; gains on other libva consumers are documented at Phase 7 but not required for closure. * libavcodec hwaccel consumers (mpv gpu-next, ffplay, VLC qt) — fix surface B from prior Phase 4 enumeration. Separate campaign. * Vulkan-anchored consumers (libplacebo Vulkan backend on Mali-G52). Fix surface C2 (panvk-1.2-fakeshim). Separate campaign. * HEVC, VP8, VP9, AV1. RK3566 hantro has H.264 + MPEG2 + VP8 HW only. AV1 / VP9 / HEVC are SW even after Step 1. Out of scope for this campaign's verification. * The S5 zero-drop regression (Phase 3 revised §6 + §8). Side investigation if pursued. * Other Mali-Bifrost-v7 hardware (G31 / G51 / G76 — same panvk arch, different SBC stacks). Out of scope; Phase 1's “Mali-G52” framing is hardware-specific. * General-purpose Vulkan workloads. Phase 1 revised §6 explicit out-of-scope. SW-emulated mandatory-1.2 entry points in any future panvk-fakeshim are tolerated. ===== 10. References ===== * Phase 1 revised — measurable success criteria. * Phase 2 (substrate) — versions, V4L2 9-fd buffer pool, panvk gates, panfrost modifier surface. * Phase 3 revised — six-contender empirical bucket-attribution + boundary characterisation; the basis for §1's “Brave is libva, not libavcodec” pivot. * Original Phase 4 — superseded by this page; preserved for audit trail. * fourier README L236-281 — prior libva-v4l2-request investigation and partial multiplanar probe patches that form Step 1's starting point. * Bootlin libva-v4l2-request: https://github.com/bootlin/libva-v4l2-request * Local artefact: ~/fourier-test/libva-patches/fourier-local.patch (HEVC-stripped, missing-include fixed, format-enumeration extended for MPLANE). * marfrit-packages parallel: ffmpeg-v4l2-request-git/ is the template for the new libva-v4l2-request-ohm-gl-fix/ package layout. —- Phase 4 ends here. Phase 6 (implementation) begins with Step 0, which produces a small attached decision document on this page. The first pacman -U on ohm marks Phase 6's first deliverable. Phase 7 is the metrics.csv phase7_verify_* row(s).
ohm_gl_fix/phase4_2026-05-01.1777640911.txt.gz · Last modified: by markus_fritsche