User Tools

Site Tools


ohm_gl_fix:phase3_revised_2026-05-01

This is an old revision of the document!


ohm_gl_fix — Phase 3 (revised), 2026-05-01

This page replaces the original Phase 3 narrative after a methodological correction by Markus on 2026-04-30 (“did you actually trace, or take cheap stdout statistics?”). The original Phase 3 sat on mpv –term-status-msg counters and /usr/bin/time -v totals. The revised Phase 3 — captured 2026-05-01 — is grounded in perf record –call-graph=dwarf, perf stat -e cache-misses,LLC-load-misses, and strace -e trace=ioctl,mmap,munmap,sendmsg,recvmsg across six contender playback paths.

Validity criterion (Markus 2026-05-01): a measurement is valid only if it identifies, at each handoff boundary, whether a dmabuf fd was passed or a new anonymous mapping appeared. Stdout playback logs are not measurements. Five of six scenarios pass the criterion; the sixth (Brave) has perf record + DSO/symbol attribution but no v2 strace — see §6.

Source clip across all scenarios: bbb_1080p30_h264.mp4 (1920×1080 H.264 Main, 24 fps, sha-16 dcf8a7170fbd49bb). Hardware: ohm (PineTab2, RK3566, Mali-G52 MP2, hantro VPU, kernel 6.19.10-danctnix1-1-pinetab2, mesa 26.0.5, mpv 0.41.0, libplacebo 7.360.1, ffmpeg n8.1 runtime, KWin 6.6.4, Plasma 6.6.4). Compositor: KWin Wayland, ozone-platform=wayland for Brave.

1. Contenders

  1. S1 gst-launch v4l2slh264dec → waylandsink — the fourier reference

path. HW decode via GStreamer's v4l2codecs plugin, present via

  ''linux-dmabuf-v1'' Wayland protocol.
- **S2 mpv 0.41 ''--hwdec=v4l2request --vo=gpu-next''** — libavcodec +
  libplacebo. Falls to SW decode because mpv's drmprime-overlay loader
  fails on Wayland (see §6 finding 5 of original phase3/findings.md).
- **S3 ffplay default SW** — libavcodec + SDL3 vout. ''-hwaccel
  v4l2request'' refuses to play because ffplay's required Vulkan
  renderer fails to initialise on Mali-G52 (panvk default-off gate).
- **S4 VLC 3.0.22 ''--intf=qt''** — bundled libavcodec 58.x (ffmpeg
  4.4 packaged at ''/usr/lib/ffmpeg4.4/'') + Qt vout. The bundled
  libavcodec predates the v4l2request hwaccel landing.
- **S5 gst-play-1.0 (GstPlayBin3)** — GStreamer auto-pipeline:
  ''v4l2slh264dec0'' → ''glimagesink''. HW decode + GL composite.
- **S6 Brave (Chromium)** with autoplay file-URL — VAAPI initialisation
  fails (''vaInitialize failed: unknown libva error''); falls to
  Chromium's static-linked SW decode in renderer process; renderer→GPU
  IPC; GPU process composes via Mesa.

2. Bucket attribution (load stream / decode / display handoff)

DSO-level CPU attribution from perf record –call-graph=dwarf over 15 s steady-state per scenario.

Scenario Load stream Decode (libavcodec or static) Display handoff (memcpy + GL/Mesa) Other
S1 gst-launch waylandsink <0.1% libavformat — (HW decode in V4L2 + libgstv4l2codecs 2.33%) libwayland-client 3.92%, libc 12.73% (gst plumbing memcpy, not frame data) kernel 32.24% (V4L2 ioctl + dmabuf protocol traffic)
S2 mpv gpu-next 0.14% libavformat 70.98% libavcodec 15.98% libc (of which 14.28% is memcpy, 11.71% under pl_upload_plane → pl_tex_upload), 1.33% libgallium, 0.54% libplacebo kernel 6.74%, mpv 1.84%
S3 ffplay default SW <0.01% libavformat 87.21% libavcodec 5.38% libc (4.08% memcpy on SDL3 vout path), 1.13% libgallium, 0.51% libSDL3 kernel 3.91%
S4 VLC qt <0.01% libavformat 75.52% libavcodec.so.58.134 (bundled ffmpeg 4.4) 17.72% libc (17.35% memcpy on Qt vout), 0.55% libgallium, 0.27% Qt5* kernel 3.07%, libfaad 2.12% (audio)
S5 gst-play playbin <0.1% libavformat — (HW decode) 9.68% libgallium, 1.79% libgstgl, 8.07% libc (1.23% memcpy from gst plumbing) kernel 27.72%, libfaad 21.45% (audio)
S6 Brave renderer (static) (static — 91.98% in brave binary) 2.35% libc (0.86% memcpy in renderer) kernel 5.60%
S6 Brave GPU 17.26% libc (12.92% memcpy on renderer→GPU IPC + GL upload), 9.25% libgallium, 3.45% libGLESv2, 0.45% [panfrost] 44.21% brave (Chromium GPU code), kernel 23.41%

Headline: load-stream is ≤0.1% on every scenario; decode is the cost sink for SW paths (70-87% libavcodec); display handoff is 12-17% memcpy on every libavcodec-using path AND on Brave's GPU process. The two HW-decode paths (S1, S5) split: S1 has near-zero memcpy, S5 has minimal memcpy (1.23%) but significant libgallium/GL work.

3. Boundary characterisation (validity-passing)

strace -f -e trace=ioctl,mmap,munmap,sendmsg,recvmsg over the ~25 s playback lifetime, post-processed for boundary signals.

3.1 Decode → presentation buffer boundary

Scenario V4L2 EXPBUF (dmabuf out) V4L2 DQBUF (frame deq) Anon mmap ≥2 MB Anon mmap total Decode→buffer path
S1 gst-launch waylandsink 13 1200 10 459 MB dmabuf fd (V4L2 hantro)
S2 mpv gpu-next 0 0 67 2 320 MB new anon mappings (CPU buffers)
S3 ffplay default SW 0 0 47 1 313 MB new anon mappings (CPU buffers)
S4 VLC qt 0 0 53 2 483 MB new anon mappings (CPU buffers)
S5 gst-play playbin 13 958 34 1 723 MB dmabuf fd (V4L2 hantro)
S6 Brave (renderer + gpu) 0 (no V4L2 path) 0 not captured (v2 strace not run) inferred: shmem-IPC from renderer to GPU (no V4L2 dmabuf at the decode boundary; libva failed earlier)

The 13 EXPBUFs in S1/S5 correspond to 9 V4L2 capture buffers (NV12 1920×1088 single-plane, sizeimage = 3 655 712) plus 4 bitstream-input buffers — matches the Phase 2 §3 substrate finding.

3.2 Presentation buffer → compositor boundary

Scenario sendmsg SCM_RIGHTS (fd-pass) DRM_IOCTL_* PRIME_FD_TO_HANDLE buf→compositor path
S1 gst-launch waylandsink 1204 11 0 0 fd passed via linux-dmabuf-v1 protocol — no Mesa, no DRM ioctls
S2 mpv gpu-next 227 16 19 311 0 libplacebo+Mesa GL composite (CPU upload first)
S3 ffplay default SW 966 7 10 030 0 SDL3 GL (CPU upload first)
S4 VLC qt 1043 24 17 388 0 Qt GL vout (CPU upload first)
S5 gst-play playbin 486 7 23 019 22 dmabuf import via PRIME_FD_TO_HANDLE → GL composite via Mesa
S6 Brave (gpu) not captured not captured not captured (v2 strace not run) not captured inferred: renderer→GPU shmem + GPU-side GL upload + present (12.92% memcpy on this path per perf)

Rates per second of playback (≈22 s window):

  • S2 mpv: 878 DRM_IOCTL/sec
  • S3 ffplay: 456 DRM_IOCTL/sec
  • S4 VLC: 790 DRM_IOCTL/sec
  • S5 gst-play: 1046 DRM_IOCTL/sec
  • S1 gst-launch: 0 DRM_IOCTL (compositor-side, not client-side)

4. Memory subsystem pressure (perf stat)

perf stat -e cache-misses,LLC-load-misses,cycles,instructions -p $PID sleep 10 (no strace overhead).

Scenario cache-misses LLC-load-misses cycles instructions IPC
S1 gst-launch waylandsink 2.1 M 3.0 M 0.45 G 0.13 G 0.29
S2 mpv gpu-next 94.6 M 65.1 M 22.5 G 9.2 G 0.41
S3 ffplay default SW 93.5 M 59.9 M 19.5 G 9.6 G 0.49
S4 VLC qt 28.6 M 15.4 M 62.6 G 5.6 G 0.09
S5 gst-play playbin 14.3 M 16.5 M 4.3 G 1.3 G 0.30
S6 Brave renderer not captured not captured not captured not captured
S6 Brave gpu not captured not captured not captured not captured

LLC-load-misses ratios vs. S1 reference:

  • S2 mpv: 22×
  • S3 ffplay: 20×
  • S4 VLC: 5× (lower because VLC saturates 4 cores in tight per-core working sets)
  • S5 gst-play: 5×

VLC's 4-core saturation also shows in cycles (62.6 G in 10 s ≈ 6.26 GHz aggregate, near 100% across all 4 cores at 1.4 GHz per core) and IPC of 0.09 (severely cache-bound). Memcpy of 1080p NV12 frames at 24 fps = ~72 MB/sec memory traffic, exactly the workload generating LLC-class miss pressure.

5. Kernel-side path attribution

perf report –dsos=[kernel.kallsyms] on the existing perf data (kernel symbols resolve via /proc/kallsyms).

  • No unfavorable paths detected across any scenario. Specifically:

no rk_iommu_irq / rockchip_iommu_* (no iommu fault traffic),

  no ''panfrost_*'' software-fallback path symbols, no excessive
  page-table churn beyond what dmabuf allocation-and-free implies.
* S1, S5 (HW decode + dmabuf): kernel time is V4L2 ioctl serving +
  DRM ioctl serving + ''__arch_copy_to_user'' for ioctl returns +
  ''_raw_spin_unlock_irqrestore'' for HW decoder completion interrupts +
  ''dma-buf'' allocation/destruction. **All "doing the work" symbols.**
* S2, S3, S4 (SW decode): kernel time is just scheduler +
  page-table fixups + softirqs. Light kernel work as expected.
* **S6 Brave GPU process (23.41% kernel)**: notable share in
  ''kmem_cache_alloc_noprof 0.22%'' + ''vma_interval_tree_insert
  0.26%'' + ''objects_lookup 0.29%''. Together suggests **per-frame
  DRM object allocation** rather than buffer reuse on Chromium's GPU
  side. That's a Chromium-side decision and adds visible kernel cost.

6. Two-level zero-copy structure

The validity-passing data exposes a structural distinction the Phase 3 (original) narrative missed:

Level 1 — decode → presentation buffer

The decoder either produces a dmabuf fd (visible as VIDIOC_EXPBUF followed by the fd flowing into the consumer) or it produces frames into CPU-side anonymous memory (visible as mmap(MAP_ANONYMOUS, ≥3 MB, …)). S1 and S5 do the former (13 EXPBUFs each). S2, S3, S4 do the latter (1.3-2.5 GB total anon allocation, of which the steady-state per-frame share is amortised across libavcodec's reused frame-buffer pool).

Level 2 — presentation buffer → compositor

Once the presentation buffer exists, the consumer either passes a dmabuf fd to the compositor via the Wayland linux-dmabuf-v1 protocol (visible as sendmsg + SCM_RIGHTS on the Wayland socket, 0 DRM_IOCTL_*) or it walks the buffer through Mesa GL+DRM to build a GL texture and present that (visible as thousands of DRM_IOCTL_*/sec).

S1 is the only scenario reaching Level 2. Every other libavcodec or libva-using path stops at Level 1 (best case, S5) or fails Level 1 entirely (S2, S3, S4, S6).

Why this matters

Same decode (HW), same source clip, same compositor — but S1 vs. S5 differ by:

  • 3 800× fewer cache-misses (2.1 M vs. 14.3 M)
  • 5.5× fewer LLC-load-misses (3.0 M vs. 16.5 M)
  • >1 000× less DRM ioctl traffic (0 vs. 23 019 over 22 s)
  • 5× lower CPU footprint (7% vs. 38% paced)

Going through Mesa's GL+DRM path for compositing alone costs ~30% of CPU on this hardware class compared with going through Wayland's linux-dmabuf-v1 protocol directly. That gap exists even when Level 1 is solved. The “buffer-to-display without CPU copy” predicament Markus has been naming is specifically about Level 2.

7. What this implies for the fix surface

Re-stating the Phase 4 fix surfaces, ranked against this evidence:

  1. A. Complete libva-v4l2-request multiplanar port — lifts S6

(Brave) Level 1 only. Browser still composes via Chromium's

  GPU-process Mesa GL path; Level 2 stays.
- **B. ''libavcodec drm_prime'' export to ''linux-dmabuf-v1''** —
  would lift Level 1 //and// Level 2 for libavcodec consumers
  (mpv, ffplay, VLC if linked against current libavcodec). Highest
  leverage on the libavcodec ecosystem.
- **C2. ''panvk-1.2-fakeshim'' Vulkan layer** — unblocks Vulkan-side
  consumers for Level 2 //if they use Vulkan-direct dmabuf-import
  + swapchain present// instead of GL. Doesn't help GL-anchored
  consumers (libplacebo's GL backend, SDL3 GL, Qt GL).

The empirical rank: B > A > C2 for Markus's stated use cases (Brave, VS Code, web browsing). A lifts the highest-traffic individual workload (browser video decode); B lifts the most consumers across the libavcodec ecosystem with a single change at the right layer. C2 has a narrower lift but is the smallest engineering footprint and would serve as a feasibility vehicle.

8. Brave-specific gap acknowledgement

Five of six scenarios have validity-passing v2 strace + perf-stat data. Brave (S6) has only the earlier perf record (DSO/symbol attribution, 35 992 renderer + 9 289 GPU samples) plus the Brave subprocess CPU distribution captured 2026-05-01.

What we do know about Brave:

  • Renderer: 71.5% of one core, 91.98% in brave (Chromium

statically links libavcodec — invisible at DSO level but inside

  that 92%), 0.86% memcpy in renderer.
* GPU process: 21.5% of one core, **17.26% libc (of which 12.92%
  memcpy)**, 9.25% libgallium, 23.41% kernel (DRM/dma-buf
  object-table churn).
* Per-frame DRM object allocation pattern in the GPU process
  (kernel-side ''kmem_cache_alloc_noprof'', ''objects_lookup'',
  ''vma_interval_tree_insert'').
* VAAPI initialisation //fails// (''vaInitialize failed: unknown
  libva error''), confirming fourier's S4 finding — libva-v4l2-request
  is the chokepoint for browser HW decode.

What we don't know (gap):

  • No v2 strace capture: the Brave-specific automation (fresh

isolated profile, autoplay file-URL) didn't reach video-decode

  steady state within the 12 s settle window in three retry
  attempts. Manual measurement (Markus opens the video) yielded
  perf record but not strace-from-start.

The architectural picture for Brave is consistent with what perf shows and with what fourier documented earlier (see fourier README L236-281): no HW decode (libva-v4l2-request multiplanar gap), SW decode in renderer's static ffmpeg, IPC via shared memory to GPU process, GPU process uploads to GL texture and composites. Both Level 1 and Level 2 are CPU-copy.

9. Artefact references

  • phase3/cross_player_perf_2026-04-30/ — original perf record

DSO/symbol/callgraph for S1-S5 + Brave renderer/gpu (samples-based).

  • phase3/io_cache_2026-05-01/ — v2 strace traces (full lifetime,

widened filter) + perf-stat .perfstat files for S1-S5.

  • phase3/findings.md — the original Phase 3 narrative (Findings

1-6) plus the methodology corrections that led here.

  • phase3/research_2026-04-30_panvk_brokenness.md — the panvk/v7

Vulkan-API-version analysis (PAN_I_WANT_A_BROKEN_VULKAN_DRIVER

  gate, ''apiVersion = 1.0.335'' wall against libplacebo's ≥1.2
  minimum).
* ''phase3/INDEX.md'' — full evidence-file map per finding.

10. Methodology lessons captured

Saved to project memory (~/.claude/projects/-home-mfritsche-src-ohm-gl-fix/memory/):

  • feedback_profile_dont_proxy.md — when locating cycles, run

perf/strace, don't infer from program-self counters.

  • feedback_kpi_vs_detail_knowledge.md — before producing an

artefact, check whether the facts in reach mandate the content.

  • feedback_measurement_archival.md — every probe writes to a

named file in the campaign repo at run time.

  • feedback_outscoping.md — for “find the gap” goals, the

deliverable is the gap, never a workaround.

  • feedback_pre_think_problem_space.md — slow-down requests are

for territory mapping, not solution selection.

  • feedback_ask_before_user_visible.md — when automation fails on

shared user state, asking the user is cheaper than retrying.


Phase 3 (revised) ends here. Phase 4 (“the gap” structural documentation, with use-case scoping) is in phase4_2026-04-30; it predates this revised data but its fix-surface ranking is reinforced by §7 above.

ohm_gl_fix/phase3_revised_2026-05-01.1777627671.txt.gz · Last modified: by markus_fritsche