ohm_gl_fix:phase3_revised_2026-05-01
Differences
This shows you the differences between two versions of the page.
| ohm_gl_fix:phase3_revised_2026-05-01 [2026/05/01 09:27] – Phase 3 revised — empirical bucket-attribution + boundary characterisation markus_fritsche | ohm_gl_fix:phase3_revised_2026-05-01 [2026/05/01 13:08] (current) – rewrap paragraphs (DokuWiki single-newline fix) markus_fritsche | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== ohm_gl_fix — Phase 3 (revised), 2026-05-01 ====== | ====== ohm_gl_fix — Phase 3 (revised), 2026-05-01 ====== | ||
| - | This page replaces the original Phase 3 narrative after a methodological | + | This page replaces the original Phase 3 narrative after a methodological correction by Markus on 2026-04-30 ("did you actually trace, or take cheap stdout statistics?" |
| - | correction by Markus on 2026-04-30 ("did you actually trace, or take cheap | + | |
| - | stdout statistics?" | + | |
| - | counters and ''/ | + | |
| - | 2026-05-01 — is grounded in '' | + | |
| - | '' | + | |
| - | '' | + | |
| - | playback paths. | + | |
| - | Validity criterion (Markus 2026-05-01): | + | Validity criterion (Markus 2026-05-01): |
| - | identifies, **at each handoff boundary**, whether a //dmabuf fd was passed// | + | |
| - | or a //new anonymous mapping appeared//. Stdout playback logs are not | + | |
| - | measurements. Five of six scenarios pass the criterion; the sixth (Brave) | + | |
| - | has perf record + DSO/symbol attribution but no v2 strace — see §6. | + | |
| - | Source clip across all scenarios: '' | + | Source clip across all scenarios: '' |
| - | H.264 Main, 24 fps, sha-16 '' | + | |
| - | (PineTab2, RK3566, Mali-G52 MP2, hantro VPU, kernel | + | |
| - | '' | + | |
| - | libplacebo 7.360.1, ffmpeg n8.1 runtime, KWin 6.6.4, Plasma 6.6.4). | + | |
| - | Compositor: KWin Wayland, ozone-platform=wayland for Brave. | + | |
| ===== 1. Contenders ===== | ===== 1. Contenders ===== | ||
| - | - **S1 gst-launch v4l2slh264dec → waylandsink** — the fourier reference | + | - **S1 gst-launch v4l2slh264dec → waylandsink** — the fourier reference path. HW decode via GStreamer' |
| - | | + | - **S2 mpv 0.41 '' |
| - | | + | - **S3 ffplay default SW** — libavcodec + SDL3 vout. '' |
| - | - **S2 mpv 0.41 '' | + | - **S4 VLC 3.0.22 '' |
| - | | + | - **S5 gst-play-1.0 (GstPlayBin3)** — GStreamer auto-pipeline: |
| - | | + | - **S6 Brave (Chromium)** with autoplay file-URL — VAAPI initialisation fails ('' |
| - | - **S3 ffplay default SW** — libavcodec + SDL3 vout. '' | + | |
| - | | + | |
| - | | + | |
| - | - **S4 VLC 3.0.22 '' | + | |
| - | | + | |
| - | | + | |
| - | - **S5 gst-play-1.0 (GstPlayBin3)** — GStreamer auto-pipeline: | + | |
| - | | + | |
| - | - **S6 Brave (Chromium)** with autoplay file-URL — VAAPI initialisation | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| ===== 2. Bucket attribution (load stream / decode / display handoff) ===== | ===== 2. Bucket attribution (load stream / decode / display handoff) ===== | ||
| - | DSO-level CPU attribution from '' | + | DSO-level CPU attribution from '' |
| - | 15 s steady-state per scenario. | + | |
| ^ Scenario | ^ Scenario | ||
| Line 58: | Line 29: | ||
| | S6 Brave GPU | — | — | **17.26%** libc (12.92% memcpy on renderer→GPU IPC + GL upload), 9.25% libgallium, 3.45% libGLESv2, 0.45% [panfrost] | 44.21% brave (Chromium GPU code), kernel 23.41% | | | S6 Brave GPU | — | — | **17.26%** libc (12.92% memcpy on renderer→GPU IPC + GL upload), 9.25% libgallium, 3.45% libGLESv2, 0.45% [panfrost] | 44.21% brave (Chromium GPU code), kernel 23.41% | | ||
| - | Headline: load-stream is ≤0.1% on every scenario; decode is the cost | + | Headline: load-stream is ≤0.1% on every scenario; decode is the cost sink for SW paths (70-87% libavcodec); |
| - | sink for SW paths (70-87% libavcodec); | + | |
| - | on every libavcodec-using path AND on Brave' | + | |
| - | HW-decode paths (S1, S5) split: S1 has near-zero memcpy, S5 has minimal | + | |
| - | memcpy (1.23%) but significant libgallium/ | + | |
| ===== 3. Boundary characterisation (validity-passing) ===== | ===== 3. Boundary characterisation (validity-passing) ===== | ||
| - | '' | + | '' |
| - | ~25 s playback lifetime, post-processed for boundary signals. | + | |
| ==== 3.1 Decode → presentation buffer boundary ==== | ==== 3.1 Decode → presentation buffer boundary ==== | ||
| Line 79: | Line 45: | ||
| | S6 Brave (renderer + gpu) | 0 (no V4L2 path) | 0 | not captured (v2 strace not run) | — | inferred: shmem-IPC from renderer to GPU (no V4L2 dmabuf at the decode boundary; libva failed earlier) | | | S6 Brave (renderer + gpu) | 0 (no V4L2 path) | 0 | not captured (v2 strace not run) | — | inferred: shmem-IPC from renderer to GPU (no V4L2 dmabuf at the decode boundary; libva failed earlier) | | ||
| - | The 13 EXPBUFs in S1/S5 correspond to 9 V4L2 capture buffers (NV12 | + | The 13 EXPBUFs in S1/S5 correspond to 9 V4L2 capture buffers (NV12 1920×1088 single-plane, |
| - | 1920×1088 single-plane, | + | |
| - | buffers — matches the Phase 2 §3 substrate finding. | + | |
| ==== 3.2 Presentation buffer → compositor boundary ==== | ==== 3.2 Presentation buffer → compositor boundary ==== | ||
| Line 103: | Line 67: | ||
| ===== 4. Memory subsystem pressure (perf stat) ===== | ===== 4. Memory subsystem pressure (perf stat) ===== | ||
| - | '' | + | '' |
| - | sleep 10'' | + | |
| ^ Scenario | ^ Scenario | ||
| Line 122: | Line 85: | ||
| * S5 gst-play: 5× | * S5 gst-play: 5× | ||
| - | VLC's 4-core saturation also shows in cycles (62.6 G in 10 s ≈ 6.26 GHz | + | VLC's 4-core saturation also shows in cycles (62.6 G in 10 s ≈ 6.26 GHz aggregate, near 100% across all 4 cores at 1.4 GHz per core) and IPC of 0.09 (severely cache-bound). Memcpy of 1080p NV12 frames at 24 fps = ~72 MB/sec memory traffic, exactly the workload generating LLC-class miss pressure. |
| - | aggregate, near 100% across all 4 cores at 1.4 GHz per core) and IPC of | + | |
| - | 0.09 (severely cache-bound). Memcpy of 1080p NV12 frames at 24 fps = | + | |
| - | ~72 MB/sec memory traffic, exactly the workload generating LLC-class | + | |
| - | miss pressure. | + | |
| ===== 5. Kernel-side path attribution ===== | ===== 5. Kernel-side path attribution ===== | ||
| - | '' | + | '' |
| - | (kernel symbols resolve via ''/ | + | |
| - | * **No unfavorable paths detected** across any scenario. Specifically: | + | * **No unfavorable paths detected** across any scenario. Specifically: |
| - | | + | * S1, S5 (HW decode + dmabuf): kernel time is V4L2 ioctl serving + DRM ioctl serving + '' |
| - | | + | * S2, S3, S4 (SW decode): kernel time is just scheduler + page-table fixups + softirqs. Light kernel work as expected. |
| - | | + | * **S6 Brave GPU process (23.41% kernel)**: notable share in '' |
| - | * S1, S5 (HW decode + dmabuf): kernel time is V4L2 ioctl serving + | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | * S2, S3, S4 (SW decode): kernel time is just scheduler + | + | |
| - | | + | |
| - | * **S6 Brave GPU process (23.41% kernel)**: notable share in | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| ===== 6. Two-level zero-copy structure ===== | ===== 6. Two-level zero-copy structure ===== | ||
| - | The validity-passing data exposes a structural distinction the | + | The validity-passing data exposes a structural distinction the Phase 3 (original) narrative missed: |
| - | Phase 3 (original) narrative missed: | + | |
| ==== Level 1 — decode → presentation buffer ==== | ==== Level 1 — decode → presentation buffer ==== | ||
| - | The decoder either produces a //dmabuf fd// (visible as | + | The decoder either produces a //dmabuf fd// (visible as '' |
| - | '' | + | |
| - | or it produces frames into //CPU-side anonymous memory// (visible | + | |
| - | as '' | + | |
| - | (**13 EXPBUFs** each). S2, S3, S4 do the latter (**1.3-2.5 GB total | + | |
| - | anon allocation**, | + | |
| - | amortised across libavcodec' | + | |
| ==== Level 2 — presentation buffer → compositor ==== | ==== Level 2 — presentation buffer → compositor ==== | ||
| - | Once the presentation buffer exists, the consumer either passes a | + | Once the presentation buffer exists, the consumer either passes a //dmabuf fd to the compositor via the Wayland '' |
| - | //dmabuf fd to the compositor via the Wayland '' | + | |
| - | protocol// (visible as '' | + | |
| - | **0 DRM_IOCTL_***) or it walks the buffer through //Mesa GL+DRM// to | + | |
| - | build a GL texture and present that (visible as **thousands of | + | |
| - | DRM_IOCTL_***/ | + | |
| - | S1 is the only scenario reaching Level 2. Every other libavcodec or | + | S1 is the only scenario reaching Level 2. Every other libavcodec or libva-using path stops at Level 1 (best case, S5) or fails Level 1 entirely (S2, S3, S4, S6). |
| - | libva-using path stops at Level 1 (best case, S5) or fails Level 1 | + | |
| - | entirely (S2, S3, S4, S6). | + | |
| ==== Why this matters ==== | ==== Why this matters ==== | ||
| - | Same decode (HW), same source clip, same compositor — but | + | Same decode (HW), same source clip, same compositor — but **S1 vs. S5 differ by**: |
| - | **S1 vs. S5 differ by**: | + | |
| * **3 800× fewer cache-misses** (2.1 M vs. 14.3 M) | * **3 800× fewer cache-misses** (2.1 M vs. 14.3 M) | ||
| Line 187: | Line 119: | ||
| * **5× lower CPU footprint** (7% vs. 38% paced) | * **5× lower CPU footprint** (7% vs. 38% paced) | ||
| - | Going through Mesa's GL+DRM path for compositing **alone** costs ~30% | + | Going through Mesa's GL+DRM path for compositing **alone** costs ~30% of CPU on this hardware class compared with going through Wayland' |
| - | of CPU on this hardware class compared with going through Wayland' | + | |
| - | '' | + | |
| - | Level 1 is solved. The " | + | |
| - | Markus has been naming is // | + | |
| ===== 7. What this implies for the fix surface ===== | ===== 7. What this implies for the fix surface ===== | ||
| Line 197: | Line 125: | ||
| Re-stating the Phase 4 fix surfaces, ranked against this evidence: | Re-stating the Phase 4 fix surfaces, ranked against this evidence: | ||
| - | - **A. Complete libva-v4l2-request multiplanar port** — lifts S6 | + | - **A. Complete libva-v4l2-request multiplanar port** — lifts S6 (Brave) Level 1 only. Browser still composes via Chromium' |
| - | | + | - **B. '' |
| - | | + | - **C2. '' |
| - | - **B. '' | + | + swapchain present// instead of GL. Doesn' |
| - | | + | |
| - | | + | |
| - | | + | |
| - | - **C2. '' | + | |
| - | | + | |
| - | + swapchain present// instead of GL. Doesn' | + | |
| - | | + | |
| - | The empirical rank: //B > A > C2// for Markus' | + | The empirical rank: //B > A > C2// for Markus' |
| - | VS Code, web browsing). A lifts the highest-traffic individual workload | + | |
| - | (browser video decode); B lifts the most consumers across the | + | |
| - | libavcodec ecosystem with a single change at the right layer. C2 has a | + | |
| - | narrower lift but is the smallest engineering footprint and would | + | |
| - | serve as a feasibility vehicle. | + | |
| ===== 8. Brave-specific gap acknowledgement ===== | ===== 8. Brave-specific gap acknowledgement ===== | ||
| - | Five of six scenarios have validity-passing v2 strace + perf-stat data. | + | Five of six scenarios have validity-passing v2 strace + perf-stat data. Brave (S6) has only the earlier perf record (DSO/symbol attribution, |
| - | Brave (S6) has only the earlier perf record (DSO/symbol attribution, | + | |
| - | 35 992 renderer + 9 289 GPU samples) plus the Brave subprocess CPU | + | |
| - | distribution captured 2026-05-01. | + | |
| What we //do// know about Brave: | What we //do// know about Brave: | ||
| - | * Renderer: 71.5% of one core, 91.98% in '' | + | * Renderer: 71.5% of one core, 91.98% in '' |
| - | | + | * GPU process: 21.5% of one core, **17.26% libc (of which 12.92% memcpy)**, 9.25% libgallium, 23.41% kernel (DRM/ |
| - | | + | * Per-frame DRM object allocation pattern in the GPU process (kernel-side '' |
| - | * GPU process: 21.5% of one core, **17.26% libc (of which 12.92% | + | * VAAPI initialisation //fails// ('' |
| - | | + | |
| - | | + | |
| - | * Per-frame DRM object allocation pattern in the GPU process | + | |
| - | | + | |
| - | | + | |
| - | * VAAPI initialisation //fails// ('' | + | |
| - | | + | |
| - | | + | |
| What we // | What we // | ||
| - | * No v2 strace capture: the Brave-specific automation (fresh | + | * No v2 strace capture: the Brave-specific automation (fresh isolated profile, autoplay file-URL) didn't reach video-decode steady state within the 12 s settle window in three retry attempts. Manual measurement (Markus opens the video) yielded perf record but not strace-from-start. |
| - | | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| - | The architectural picture for Brave is consistent with what perf shows | + | The architectural picture for Brave is consistent with what perf shows and with what fourier documented earlier (see fourier README L236-281): no HW decode (libva-v4l2-request multiplanar gap), SW decode in renderer' |
| - | and with what fourier documented earlier (see fourier README L236-281): | + | |
| - | no HW decode (libva-v4l2-request multiplanar gap), SW decode in | + | |
| - | renderer' | + | |
| - | process uploads to GL texture and composites. Both Level 1 and Level 2 | + | |
| - | are CPU-copy. | + | |
| ===== 9. Artefact references ===== | ===== 9. Artefact references ===== | ||
| - | * '' | + | * '' |
| - | | + | * '' |
| - | * '' | + | * '' |
| - | | + | * '' |
| - | * '' | + | |
| - | | + | |
| - | * '' | + | |
| - | | + | |
| - | | + | |
| - | | + | |
| * '' | * '' | ||
| Line 271: | Line 161: | ||
| Saved to project memory ('' | Saved to project memory ('' | ||
| - | * '' | + | * '' |
| - | | + | * '' |
| - | * '' | + | * '' |
| - | | + | * '' |
| - | * '' | + | * '' |
| - | | + | * '' |
| - | * '' | + | |
| - | | + | |
| - | * '' | + | |
| - | | + | |
| - | * '' | + | |
| - | | + | |
| ---- | ---- | ||
| - | //Phase 3 (revised) ends here. Phase 4 ("the gap" structural | + | //Phase 3 (revised) ends here. Phase 4 ("the gap" structural documentation, |
| - | documentation, | + | |
| - | [[ohm_gl_fix: | + | |
| - | fix-surface ranking is reinforced by §7 above.// | + | |
ohm_gl_fix/phase3_revised_2026-05-01.1777627671.txt.gz · Last modified: by markus_fritsche
