Table of Contents
Phase 2 — KWin source archaeology
This file is the synthesised source-read for the kwin_overlay_subsurface campaign. It opens with the Phase 1 leading question answer (per worklist), follows with the architectural diagram of the per-frame path, and ends with file-level findings ordered by the priority list in worklist.
Discipline guard: no patches are written before this file is committed. Re-scoping is documented honestly with the deferral target named.
Phase 1 — leading question answer
Status: LOCKED 2026-05-02. This section is the Phase 1 deliverable. Future Phase 2 / Phase 3 measurements may modify the file-level findings below this section but must not silently move the Phase 1 answer; if reality after measurement contradicts this, the contradiction is documented as a new section, not by editing this one.
Question (from worklist):
On what condition does KWin promote awp_linux_dmabuf_v1surface to direct scanout versus falling back to GPU composite, and does the hantro NV12DRM_FORMAT_MOD_LINEARoutput satisfy those conditions on this DRM driver (rockchip-drm on RK3568, Mesa 26.0.5)?
Short answer — NO
Neither of KWin v6.6.4's two scanout-promotion paths can place the hantro NV12 LINEAR buffer on a DRM plane on this hardware in the windowed Brave case, for two distinct structural reasons. The Phase 4 design space narrows to the import-caching hypothesis only. This aligns with — does not contradict — the architect's prior from the A2 trajectory hint.
KWin's two scanout-promotion paths
KWin v6.6.4 has two distinct paths that could in principle promote a wp_linux_dmabuf_v1 surface to scanout. Both pass through the same per-layer feasibility check (OutputLayer::importScanoutBuffer) but differ in how the candidate is chosen.
Path A — single-plane direct scanout
Entry: Compositor::prepareDirectScanout (src/compositor.cpp:379, prepareDirectScanout(view, logicalOutput, backendOutput, frame)).
view→scanoutCandidates(1)(compositor.cpp:385). OnWorkspaceScenethis callsWorkspaceScene::scanoutCandidates(src/scene/workspacescene.cpp:281), which walksm_containerItem→sortedChildItems()top-to-bottom via the recursive helperaddCandidates(workspacescene.cpp:197).addCandidatesproduces up tomaxCount + 1 = 2candidateSurfaceItems. The walk requires every traversed item to haveopacity == 1.0and no effects (workspacescene.cpp:202-203).- After the walk, the back element of the candidate list must be either absent OR a 1×1 single-pixel black-background buffer per
checkForBlackBackground(workspacescene.cpp:263-279). OtherwisescanoutCandidatesreturns{}(workspacescene.cpp:306-308). - If a candidate is returned,
prepareDirectScanoutrequires it to be aSurfaceItemWaylandwith a valid surface, valid buffer, and dmabuf attributes (compositor.cpp:391-402). It then takes the format/modifier intersection:layer→supportedDrmFormats()for non-tearing orsupportedAsyncDrmFormats()for tearing must containattrs→formatANDattrs→modifier(compositor.cpp:404-409). - If the intersection passes,
layer→importScanoutBuffer(buffer, frame)is invoked (compositor.cpp:416). For the DRM backend that resolves toEglGbmLayer::importScanoutBuffer(src/backends/drm/drm_egl_layer.cpp:81).
Path B — multi-overlay plane scanout
Entry: Compositor::repaint per-layer loop (compositor.cpp:680+), through Scene::overlayCandidates and assignOverlays. Candidate filtering for an overlay is findOverlayCandidates (workspacescene.cpp:335), which accepts an item iff:
- it is a
SurfaceItem, - non-empty rect,
frameTimeEstimation < 50 ms(≥ 20 fps source frame cadence),surfaceItem→buffer()→dmabufAttributes()is non-null,opacity == 1.0(TODO comment on line 381 says item-opacity is not yet handled),- not entirely covered by other opaque windows,
- if the region is occupied or rounded-corner-clipped, the item must be fully opaque to qualify as an underlay (workspacescene.cpp:386-389).
Per-layer feasibility is then the same EglGbmLayer::importScanoutBuffer gate, with the layer being a non-Primary OutputLayerType.
EglGbmLayer::importScanoutBuffer — the conjunct list
(src/backends/drm/drm_egl_layer.cpp:81-127, top-to-bottom)
- Env var
KWIN_DRM_NO_DIRECT_SCANOUTunset. - Layer is Primary OR
drmOutput()→shouldDisableNonPrimaryPlanes()is false. The latter is only true inPresentationMode::AsyncorAdaptiveAsync(drm_output.cpp:112-117) — i.e. tearing modes — so this conjunct is inactive for Brave's default 30 fps playback. gpu()→needsModeset()is false (no pending modeset).drmOutput()→needsShadowBuffer()is false (no display-side shadow buffer required, e.g. for HDR/colour conversion).gpu() == gpu()→platform()→primaryGpu()(no cross-GPU scanout).- Color pipeline is identity OR
colorPowerTradeoff != PreferAccuracy. sourceRect() == sourceRect().toRect()— the source rect must be integer-aligned. Sub-pixel cropping → reject. Comment cites the kernel doc note that “devices that don't support subpixel plane coordinates can ignore the fractional part.”- If
offloadTransform()is non-identity, the plane must support that transform viam_plane→supportsTransformation. gpu()→importBuffer(buffer, FileDescriptor{})returns non-null (gbm import succeeds for this dmabuf format/modifier/stride).
The doc comment on OutputLayer::importScanoutBuffer (src/core/outputlayer.h:101-106) notes that even when this returns true, “a presentation request on the output must however be used afterwards to find out if it's actually successful” — i.e. the final filter is the kernel's DRM atomic-test.
Where supportedDrmFormats() comes from
(src/backends/drm/drm_plane.cpp:84-142)
DrmPlane::updateProperties() reads the kernel's IN_FORMATS blob via drmModeFormatModifierBlobIterNext. Each (fmt, mod) pair the kernel advertises is added to m_supportedFormats. EglGbmLayer returns this dictionary verbatim from supportedDrmFormats().
So whatever the kernel's rockchip-drm driver advertises in IN_FORMATS for a given DRM plane is what KWin treats as scanout-eligible for that layer. There is no further KWin-side filter on top.
Hardware: rockchip-drm plane format/modifier table on ohm
Raw evidence: ohm_drm_info_2026-05-02.json (inlined), ohm_modetest_planes_2026-05-02.txt (inlined).
DRM driver: rockchip-drm (RockChip Soc DRM, 1.0.0). Active connector: DSI-1 (the PineTab2's internal panel), 800×1280 mode preferred. Two CRTCs visible (51 inactive, 52 active, fb=60).
Three planes (full set on the SoC):
| Plane ID | DRM type | possible_crtcs | KWin OutputLayerType | NV12 LINEAR? | Notes |
|---|---|---|---|---|---|
| 33 | Primary | 0x01 (CRTC 51 only — inactive) | Primary | No (RGB-only LINEAR) | This CRTC has no display attached |
| 39 | Primary | 0x02 (CRTC 52 only — active) | Primary | YES (LINEAR(0x0)) | Currently driving fb=60 (the GL framebuffer) |
| 45 | Overlay | 0x03 (either CRTC) | GenericLayer | No | XR30/XB30/XR/XB/AR/AB 24/RG/BG 24/16, YU08/YU10/YUYV/Y210, all in AFBC modifiers (ARM_BLOCK_SIZE=16×16 family). No NV12 in any modifier. |
CRTC index mapping is positional: CRTC ID 51 = index 0 (bit 0), CRTC ID 52 = index 1 (bit 1). Plane 39 is restricted to CRTC 52; Plane 45 can drive either CRTC. KWin's planeToLayerType (drm_layer.cpp:34-49) maps DRM Primary→OutputLayerType::Primary and DRM Overlay→OutputLayerType::GenericLayer directly.
So on the active CRTC 52, the OutputLayer set KWin sees is:
- 1 ×
OutputLayerType::Primaryfrom Plane 39 — supports NV12 LINEAR. - 1 ×
OutputLayerType::GenericLayerfrom Plane 45 — does not support NV12 in any modifier.
Why the answer is NO — the failing conjunct, named
For Brave's windowed parent + wp_subsurface case:
Path A is rejected at the scene-walk stage
addCandidates (workspacescene.cpp:197-261) walks the Brave window top-to-bottom. The walk would produce two candidates: the wp_subsurface (video) — added first because it has higher z than its parent — and the parent surface (chrome UI). With maxCount=1, WorkspaceScene::scanoutCandidates calls addCandidates with maxCount + 1 = 2, so two candidates are gathered before the inner size check rejects.
After the walk, workspacescene.cpp:306-308 checks ret.size() == maxCount + 1 && !checkForBlackBackground(ret.back()). The back of the list is the parent surface (Brave UI). It is not a 1×1 single-pixel SHM/single-pixel buffer. Therefore checkForBlackBackground returns false, and the function returns {}. Path A returns empty for windowed Brave by construction.
The “black background” idiom is from 8473b90a20 (Xaver Hugl, 2025-09-03, “compositor: move the 'black background' check to workspacescene”) which moved the check from compositor.cpp into the scene. The check exists for fullscreen-on-black-window patterns (some games / video players render a 1×1 black parent window with their actual content as a child surface, to bypass compositor work) — Brave does not use that pattern.
Path B is rejected at the format/modifier intersection
The wp_subsurface (video) clears every findOverlayCandidates filter at 30 fps with NV12 LINEAR dmabufs. The candidate makes it to prepareDirectScanout for a non-Primary OutputLayer. On CRTC 52, the only non-Primary OutputLayer is Plane 45 (GenericLayer). Plane 45 advertises no NV12 modifier in its IN_FORMATS blob.
Therefore compositor.cpp:404-409:
const auto formats = ... layer->supportedDrmFormats(); if (auto it = formats.find(attrs->format); it == formats.end() || !it->contains(attrs->modifier)) { layer->setScanoutCandidate(candidate); candidate->setScanoutHint(layer->scanoutDevice(), formats); return false; }
returns false: formats.find(DRM_FORMAT_NV12) == formats.end() for Plane 45 → reject. Path B is rejected at the format/modifier intersection. No further conjunct in EglGbmLayer::importScanoutBuffer is even evaluated.
The Primary plane (39) does support NV12 LINEAR, but it is in use as the GL framebuffer surface (OutputLayerType::Primary is the single-framebuffer canonical role in KWin). KWin v6.6.4 does not have logic to swap plane roles dynamically (move the GL framebuffer to Plane 45 in AFBC, free Plane 39 for video). That would be a substantial KWin design change.
Implications for Phase 4 design space
Per worklist Phase 1 contract — “yes/no plus a paragraph naming the specific conjunct(s) that pass or fail”:
- Architect's hypothesis (a) — cache the dmabuf-to-GL-texture import. Remains the primary candidate. Aligns with the A2 trajectory data (drops in three bursts during ~30 s warmup, then steady 0/sec). Phase 2 source-read prioritises:
src/wayland/linuxdmabufv1clientbuffer.cpp,src/scene/surfaceitem_wayland.cpp,src/scene/itemrenderer_opengl.cpp. - Architect's hypothesis (b) — promote single-color-plane subsurface video to direct scanout via
wp_drm_lease_v1. STRUCTURALLY UNREACHABLE on this hardware/driver combo. Two reasons:wp_drm_lease_v1is the wrong protocol for this case — it leases an entire connector/output to a client (typical consumer: VR HMDs). It is not the mechanism for putting a subsurface on its own DRM plane within a managed Plasma session. The protocol-correct mechanism would be KWin's existing multi-overlay path (Path B above), which fails at the format/modifier intersection on rockchip-drm.- Even if KWin gained dynamic plane-role swapping, the rockchip-drm overlay plane (Plane 45) does not advertise NV12 in any modifier — that is a kernel-side gap, out of this campaign's scope per README.
Bug-report shape — narrowed
Per worklist: “Either answer also informs the bug-report shape … Different messages, different audiences.”
- The “missed scanout-promotion” framing has two possible audiences, neither well-suited to this campaign:
- KWin maintainers: would require a design-discussion patch (dynamic plane-role swap). Out of scope.
- rockchip-drm maintainers: kernel patch to expose NV12 on the overlay plane (if the VOP2 hardware actually supports it on the overlay window — needs separate VOP2 archaeology). Out of scope per README.
- The “your subsurface composite is slow” framing (Phase 4 hypothesis a — import-caching) has one audience: KWin maintainers, with a measurement-grounded patch description. This is the Phase 5 bug-report shape this campaign should pursue.
Caveats and Phase 1-step-3 deferral
- This answer rests on the assumption that “windowed Brave with chrome UI visible” is the in-scope case (per Phase 1 lock). Fullscreen Brave (F11) would change Path A's outcome — the parent surface might fill the viewport with no second candidate, in which case Path A could potentially succeed if Plane 39 is available. Not measured in Phase 0 / not in scope.
- Phase 1 step 3 from worklist (“does KWin require the subsurface to be the only damageable region of a given plane”) is partially answered by the conjunct list above (rounded-corner clipping is in
findOverlayCandidates, opacity == 1.0 is required, not entirely-covered is required). The deeper question — whether Brave's parent renders content behind the video subsurface region — is deferred to Phase 2 source-read per the proposal accepted on 2026-05-02. It does not change Phase 1's answer because Path B is already disqualified at the format intersection upstream of any geometric considerations. - The integer-source-rect requirement (
drm_egl_layer.cpp:117) is noted but not load-bearing for Phase 1's answer. It would be a load-bearing conjunct if Path B reachedimportScanoutBuffer, which it does not on this hardware. Banked for Phase 2.
Architectural map
End-to-end per-frame path for a Brave wp_subsurface presenting an NV12 LINEAR dmabuf at 1080p30 on the windowed parent (Brave UI), Plasma 6.6.4 + EglBackend + panfrost.
Buffer ingress (one-time per wl_buffer)
- Brave commits
wp_linux_buffer_params.create_immedper V4L2 capture buffer slot. KWin instantiates aLinuxDmaBufV1ClientBuffer(src/wayland/linuxdmabufv1clientbuffer.cpp:216), which IS-AGraphicsBufferstoringDmaBufAttributes(fd, offset, pitch per plane, modifier, format, width, height) —:354-358. RenderBackend::testImportBuffer(clientBuffer)(:217) validates that the rendering backend can import this dmabuf at all. Successful →wl_buffer.createdsent, the wl_buffer enters Brave's reuse pool. No GL texture exists yet.- Lifetime: until Brave destroys the wl_buffer (
:339). For Chromium with a V4L2 capture pool of N buffers, this means N stableLinuxDmaBufV1ClientBuffer/GraphicsBuffer*identities, reused round-robin across frames.
Per-attach (every wl_surface.commit with a buffer attached)
SurfaceInterface::bufferChangedfires →SurfaceItemWayland::handleBufferChanged(src/scene/surfaceitem_wayland.cpp:103-106) →setBuffer(…). KWin 6.6.4 negotiateswp_linux_drm_syncobj_v1explicit sync with Chromium-class clients, so the buffer commit goes throughTransaction::watchSyncObj(src/wayland/transaction.cpp:244-249), NOTwatchDmaBuf. (Source: kwin-fourier MR body, zeroDMA_BUF_IOCTL_EXPORT_SYNC_FILEover 60 s playback.) Fourier patches only touchwatchDmaBuf— confirmed irrelevant.- Damage region updated; per-frame compositor wakes.
Per-frame texture-update (the hot path)
SurfaceItem::preprocess() (src/scene/surfaceitem.cpp:187-208): if m_texture exists and size matches, call m_texture→update(damageRegion); else call m_texture→create(). Brave's video buffers are all the same size (1920×1080), so after the first frame the same OpenGLSurfaceTexture is re-used — update() is the steady-state path, not create().
OpenGLSurfaceTexture::updateDmabufTexture(buffer) (src/scene/surfaceitem.cpp:472-501):
// for NV12 (s_drmConversions match in src/utils/drm_format_helper.h:35-44): // plane 0 = R8 full-size (Y), plane 1 = GR88 half-size (CbCr) for (uint plane = 0; plane < itConv->plane.count(); ++plane) { ... m_texture.planes[plane]->bind(); // glBindTexture, cheap glEGLImageTargetTexture2DOES(GL_TEXTURE_2D, m_backend->importBufferAsImage(buffer, plane, currentPlane.format, size)); // *** suspect *** m_texture.planes[plane]->unbind(); }
EglBackend::importBufferAsImage(buffer, plane, format, size) (src/opengl/eglbackend.cpp:279-299):
std::pair key(buffer, plane); auto it = m_importedBuffers.constFind(key); if (Q_LIKELY(it != m_importedBuffers.constEnd())) { return *it; // CACHE HIT after warmup } // MISS: create fresh EGLImage from DmaBufAttributes via EglDisplay::importDmaBufAsImage // — the genuinely expensive cold-path EGL_LINUX_DMA_BUF_EXT import
The cache key is (GraphicsBuffer *, plane_index). After warmup, every frame for every plane is a cache hit on EGLImage but glEGLImageTargetTexture2DOES is still called every frame to re-target the persistent GLTexture to whatever EGLImage corresponds to the current frame's buffer.
For NV12 video, this re-target happens 2× per frame (Y plane R8 + CbCr plane GR88). For RGBA single-surface (e.g. cage's fullscreen output, or any non-YUV client), it happens 1×.
Per-frame rendering (also hot but well-understood)
ItemRendererOpenGL::renderItem (src/scene/itemrenderer_opengl.cpp:334): standard quad render. vbo→bindArrays, glActiveTexture(GL_TEXTURE0+i) + texture[i]→bind() per plane (:473-474), draw, unbind. No suspicious work; the texture binds here are plain glBindTexture, not glEGLImageTargetTexture2DOES.
File-level findings — Phase 2 reading list
- [x]
src/wayland/linuxdmabufv1clientbuffer.cpp— protocol-only. CreatesLinuxDmaBufV1ClientBufferonce per wl_buffer (:165, :216).renderBackend→testImportBuffervalidates at creation time (:166, :217). NO GL texture import here; that lives in the EglBackend / surface-texture code. Lifetime tied to wl_buffer; for Chromium's V4L2 capture pool this is N stable buffers reused round-robin. - [x]
src/scene/surfaceitem_wayland.cpp— slot-driven:handleBufferChanged(:103) just stores the new buffer pointer and emits damage. Subsurface tree built inhandleChildSubSurfacesChanged(:142) once per surface tree change — not per frame. No per-frame slow path here. The actual texture work is insurfaceitem.cpp::OpenGLSurfaceTexture::updateDmabufTexture. - [x]
src/scene/itemrenderer_opengl.cpp—renderItem(:334-499) does standard quad rendering withglBindTextureon the already-imported GLTextures. Not the cost site. Per-plane texture binds at:473-474are plainglBindTexture. No special-case for parent+subsurface vs single-surface. - [x]
src/scene/composite.cpp+ scene scheduling — promotion predicate. Done in Phase 1. - [x]
src/backends/drm/— DRM atomic plane-probe, format/modifier acceptance per output. Done in Phase 1 to the depth needed for the leading question.
Phase 2 hypothesis — concrete, file:line
SUPERSEDED 2026-05-02 by Phase 3 measurement. Verdict: H1 rejected at N=1 across C0/C1/C2 + exploratory C3 stock-Brave. The symbol's self-time peaks at 0.15 % vs the 20 % threshold. See phase3_findings. The hypothesis text below is preserved as the “what we believed before measurement” record per the discipline rule (feedback_phase_discipline.md); do not edit it. New working hypothesis H1' (per-frame Wayland-protocol dispatch dominates) emerges from Phase 3 and gets its own Phase 2-prime source-read.
Per-frame cost in KWin's parent + wp_subsurface composite path on Mali-G52 panfrost lives in OpenGLSurfaceTexture::updateDmabufTexture (src/scene/surfaceitem.cpp:472-501), specifically the glEGLImageTargetTexture2DOES call at line 490 (multi-plane YUV) / line 496 (single-plane).
Mechanism
The EglBackend::m_importedBuffers cache (src/opengl/eglbackend.h:116, src/opengl/eglbackend.cpp:279-321) does cache the EGLImage per (GraphicsBuffer *, plane), so after warmup the EGLImage lookup is a hash hit. But the EGLImage and GLTexture are decoupled: a single per-surface m_texture.planes[plane] GLTexture is re-targeted to a different EGLImage every frame via glEGLImageTargetTexture2DOES, because OpenGLSurfaceTexture::updateDmabufTexture is unconditional — it calls the function on every update(), regardless of whether the underlying EGLImage actually changed.
For Brave's V4L2 capture pool of N buffers cycling round-robin:
- Warmup (≤ ~30 s on ohm): each new GraphicsBuffer\* miss in the cache → fresh
EglDisplay::importDmaBufAsImage(kernel-side dmabuf-to-EGLImage import). 6-9 expensive first-imports correlate with the three drop-bursts in the A2 trajectory (ohm_gl_fix/phase3_remeasure_2026-05-02/A2_brave_drops_findings.md) at t ≈ 0–5 / 10–12 / 20–30 s. Pool grows in response to scene complexity (B-frame depth, motion-vector load), explaining the discrete bursts. - Steady state (post-warmup): every frame pays 2×
glEGLImageTargetTexture2DOESfor NV12 (Y plane + CbCr plane). On panfrost, this rebind has non-trivial cost even when the (texture, image) pair is unchanged or when the new image was previously bound to the same texture in a recent frame.
cage's parity here is informative: cage composites a single fullscreen RGBA surface, so its OpenGLSurfaceTexture::updateDmabufTexture runs the single-plane branch (:493-498) — 1× rebind per frame. KWin direct on the same workload runs the multi-plane branch — 2× rebind per frame, plus the warmup re-import bursts that cage does not exhibit (cage's surfaces are GL-rendered framebuffers KWin imports once, not a V4L2-cycled video pool).
Predicted Phase 4 patch shape
Cache the GLTexture alongside the EGLImage in EglBackend::m_importedBuffers, keyed by (GraphicsBuffer *, plane). On updateDmabufTexture, look up the per-(buffer, plane) GLTexture and re-target the per-surface m_texture.planes[plane] to that GLTexture's name (or, more invasively, swap the GLTexture pointer entirely). Eliminates per-frame glEGLImageTargetTexture2DOES after warmup. Concrete edit site: src/scene/surfaceitem.cpp:472-501 (updateDmabufTexture) plus the cache extension in src/opengl/eglbackend.cpp:279-321 and its header.
The rebind pattern was introduced in the original NV12 Wayland dmabuf support (commit 3568829216 opengl: Add support for NV12 on Wayland dmabufs, pre-2024); no commit message documents a defensive rationale. The merge commit 8c37d1926a (BasicEGLSurfaceTextureWayland → OpenGLSurfaceTexture) and refactor cf8ee656a9 (move surface-texture business to scene/) preserved the pattern unchanged. Phase 5 patch description must explain the mechanism (glEGLImageTargetTexture2DOES is idempotent for an unchanged image binding, and the buffer's contents change doesn't require a re-bind because the texture is already backed by the dmabuf via the EGLImage) — not just cite the symptom.
Phase 3 measurement that validates this hypothesis
perf record -p $(pgrep kwin_wayland) during 70 s playback under the locked phase1_lock protocol. Expectation: hot symbols include glEGLImageTargetTexture2DOES (or its panfrost-side implementation, e.g. panvk_* / panfrost_resource_setup) at a non-trivial fraction of kwin_wayland self-time during steady-state. If hot, hypothesis confirmed at the file:line. If cold (i.e. glEGLImageTargetTexture2DOES doesn't show up), the cost is elsewhere and Phase 2 must re-open. Cage perf record under the same workload provides the differential — cage should NOT show the same symbol at the same heat.
Caveats
- The hypothesis is consistent with A2 trajectory (warmup bursts + steady-state CPU) but is not yet validated by hot-path data. Phase 3 perf record is the highest-value remaining measurement (per architect, see phase0_findings).
- The rebind cost on panfrost specifically is asserted from first principles. If the panfrost implementation of
glEGLImageTargetTexture2DOEShappens to short-circuit identical re-binds, the steady-state cost is elsewhere (possibly inm_texture.planes[plane]→bind()GL state churn or further upstream in damage tracking). OpenGLSurfaceContents(them_texturefield, surfaceitem.h:174) is per-OpenGLSurfaceTexture, not per-buffer. Caching the GLTexture per-buffer requires a different ownership model. This is a Phase 4 design decision, not a Phase 2 fact.
