User Tools

Site Tools


kwin_overlay_subsurface:phase2_source_findings

Phase 2 — KWin source archaeology

This file is the synthesised source-read for the kwin_overlay_subsurface campaign. It opens with the Phase 1 leading question answer (per worklist), follows with the architectural diagram of the per-frame path, and ends with file-level findings ordered by the priority list in worklist.

Discipline guard: no patches are written before this file is committed. Re-scoping is documented honestly with the deferral target named.

Phase 1 — leading question answer

Status: LOCKED 2026-05-02. This section is the Phase 1 deliverable. Future Phase 2 / Phase 3 measurements may modify the file-level findings below this section but must not silently move the Phase 1 answer; if reality after measurement contradicts this, the contradiction is documented as a new section, not by editing this one.

Question (from worklist):

On what condition does KWin promote a wp_linux_dmabuf_v1 surface to direct scanout versus falling back to GPU composite, and does the hantro NV12 DRM_FORMAT_MOD_LINEAR output satisfy those conditions on this DRM driver (rockchip-drm on RK3568, Mesa 26.0.5)?

Short answer — NO

Neither of KWin v6.6.4's two scanout-promotion paths can place the hantro NV12 LINEAR buffer on a DRM plane on this hardware in the windowed Brave case, for two distinct structural reasons. The Phase 4 design space narrows to the import-caching hypothesis only. This aligns with — does not contradict — the architect's prior from the A2 trajectory hint.

KWin's two scanout-promotion paths

KWin v6.6.4 has two distinct paths that could in principle promote a wp_linux_dmabuf_v1 surface to scanout. Both pass through the same per-layer feasibility check (OutputLayer::importScanoutBuffer) but differ in how the candidate is chosen.

Path A — single-plane direct scanout

Entry: Compositor::prepareDirectScanout (src/compositor.cpp:379, prepareDirectScanout(view, logicalOutput, backendOutput, frame)).

  1. view→scanoutCandidates(1) (compositor.cpp:385). On WorkspaceScene this calls WorkspaceScene::scanoutCandidates (src/scene/workspacescene.cpp:281), which walks m_containerItem→sortedChildItems() top-to-bottom via the recursive helper addCandidates (workspacescene.cpp:197).
  2. addCandidates produces up to maxCount + 1 = 2 candidate SurfaceItems. The walk requires every traversed item to have opacity == 1.0 and no effects (workspacescene.cpp:202-203).
  3. After the walk, the back element of the candidate list must be either absent OR a 1×1 single-pixel black-background buffer per checkForBlackBackground (workspacescene.cpp:263-279). Otherwise scanoutCandidates returns {} (workspacescene.cpp:306-308).
  4. If a candidate is returned, prepareDirectScanout requires it to be a SurfaceItemWayland with a valid surface, valid buffer, and dmabuf attributes (compositor.cpp:391-402). It then takes the format/modifier intersection: layer→supportedDrmFormats() for non-tearing or supportedAsyncDrmFormats() for tearing must contain attrs→format AND attrs→modifier (compositor.cpp:404-409).
  5. If the intersection passes, layer→importScanoutBuffer(buffer, frame) is invoked (compositor.cpp:416). For the DRM backend that resolves to EglGbmLayer::importScanoutBuffer (src/backends/drm/drm_egl_layer.cpp:81).

Path B — multi-overlay plane scanout

Entry: Compositor::repaint per-layer loop (compositor.cpp:680+), through Scene::overlayCandidates and assignOverlays. Candidate filtering for an overlay is findOverlayCandidates (workspacescene.cpp:335), which accepts an item iff:

  • it is a SurfaceItem,
  • non-empty rect,
  • frameTimeEstimation < 50 ms (≥ 20 fps source frame cadence),
  • surfaceItem→buffer()→dmabufAttributes() is non-null,
  • opacity == 1.0 (TODO comment on line 381 says item-opacity is not yet handled),
  • not entirely covered by other opaque windows,
  • if the region is occupied or rounded-corner-clipped, the item must be fully opaque to qualify as an underlay (workspacescene.cpp:386-389).

Per-layer feasibility is then the same EglGbmLayer::importScanoutBuffer gate, with the layer being a non-Primary OutputLayerType.

EglGbmLayer::importScanoutBuffer — the conjunct list

(src/backends/drm/drm_egl_layer.cpp:81-127, top-to-bottom)

  1. Env var KWIN_DRM_NO_DIRECT_SCANOUT unset.
  2. Layer is Primary OR drmOutput()→shouldDisableNonPrimaryPlanes() is false. The latter is only true in PresentationMode::Async or AdaptiveAsync (drm_output.cpp:112-117) — i.e. tearing modes — so this conjunct is inactive for Brave's default 30 fps playback.
  3. gpu()→needsModeset() is false (no pending modeset).
  4. drmOutput()→needsShadowBuffer() is false (no display-side shadow buffer required, e.g. for HDR/colour conversion).
  5. gpu() == gpu()→platform()→primaryGpu() (no cross-GPU scanout).
  6. Color pipeline is identity OR colorPowerTradeoff != PreferAccuracy.
  7. sourceRect() == sourceRect().toRect() — the source rect must be integer-aligned. Sub-pixel cropping → reject. Comment cites the kernel doc note that “devices that don't support subpixel plane coordinates can ignore the fractional part.”
  8. If offloadTransform() is non-identity, the plane must support that transform via m_plane→supportsTransformation.
  9. gpu()→importBuffer(buffer, FileDescriptor{}) returns non-null (gbm import succeeds for this dmabuf format/modifier/stride).

The doc comment on OutputLayer::importScanoutBuffer (src/core/outputlayer.h:101-106) notes that even when this returns true, “a presentation request on the output must however be used afterwards to find out if it's actually successful” — i.e. the final filter is the kernel's DRM atomic-test.

Where supportedDrmFormats() comes from

(src/backends/drm/drm_plane.cpp:84-142)

DrmPlane::updateProperties() reads the kernel's IN_FORMATS blob via drmModeFormatModifierBlobIterNext. Each (fmt, mod) pair the kernel advertises is added to m_supportedFormats. EglGbmLayer returns this dictionary verbatim from supportedDrmFormats().

So whatever the kernel's rockchip-drm driver advertises in IN_FORMATS for a given DRM plane is what KWin treats as scanout-eligible for that layer. There is no further KWin-side filter on top.

Hardware: rockchip-drm plane format/modifier table on ohm

Raw evidence: ohm_drm_info_2026-05-02.json (inlined), ohm_modetest_planes_2026-05-02.txt (inlined).

DRM driver: rockchip-drm (RockChip Soc DRM, 1.0.0). Active connector: DSI-1 (the PineTab2's internal panel), 800×1280 mode preferred. Two CRTCs visible (51 inactive, 52 active, fb=60).

Three planes (full set on the SoC):

Plane ID DRM type possible_crtcs KWin OutputLayerType NV12 LINEAR? Notes
33 Primary 0x01 (CRTC 51 only — inactive) Primary No (RGB-only LINEAR) This CRTC has no display attached
39 Primary 0x02 (CRTC 52 only — active) Primary YES (LINEAR(0x0)) Currently driving fb=60 (the GL framebuffer)
45 Overlay 0x03 (either CRTC) GenericLayer No XR30/XB30/XR/XB/AR/AB 24/RG/BG 24/16, YU08/YU10/YUYV/Y210, all in AFBC modifiers (ARM_BLOCK_SIZE=16×16 family). No NV12 in any modifier.

CRTC index mapping is positional: CRTC ID 51 = index 0 (bit 0), CRTC ID 52 = index 1 (bit 1). Plane 39 is restricted to CRTC 52; Plane 45 can drive either CRTC. KWin's planeToLayerType (drm_layer.cpp:34-49) maps DRM Primary→OutputLayerType::Primary and DRM Overlay→OutputLayerType::GenericLayer directly.

So on the active CRTC 52, the OutputLayer set KWin sees is:

  • 1 × OutputLayerType::Primary from Plane 39 — supports NV12 LINEAR.
  • 1 × OutputLayerType::GenericLayer from Plane 45 — does not support NV12 in any modifier.

Why the answer is NO — the failing conjunct, named

For Brave's windowed parent + wp_subsurface case:

Path A is rejected at the scene-walk stage

addCandidates (workspacescene.cpp:197-261) walks the Brave window top-to-bottom. The walk would produce two candidates: the wp_subsurface (video) — added first because it has higher z than its parent — and the parent surface (chrome UI). With maxCount=1, WorkspaceScene::scanoutCandidates calls addCandidates with maxCount + 1 = 2, so two candidates are gathered before the inner size check rejects.

After the walk, workspacescene.cpp:306-308 checks ret.size() == maxCount + 1 && !checkForBlackBackground(ret.back()). The back of the list is the parent surface (Brave UI). It is not a 1×1 single-pixel SHM/single-pixel buffer. Therefore checkForBlackBackground returns false, and the function returns {}. Path A returns empty for windowed Brave by construction.

The “black background” idiom is from 8473b90a20 (Xaver Hugl, 2025-09-03, “compositor: move the 'black background' check to workspacescene”) which moved the check from compositor.cpp into the scene. The check exists for fullscreen-on-black-window patterns (some games / video players render a 1×1 black parent window with their actual content as a child surface, to bypass compositor work) — Brave does not use that pattern.

Path B is rejected at the format/modifier intersection

The wp_subsurface (video) clears every findOverlayCandidates filter at 30 fps with NV12 LINEAR dmabufs. The candidate makes it to prepareDirectScanout for a non-Primary OutputLayer. On CRTC 52, the only non-Primary OutputLayer is Plane 45 (GenericLayer). Plane 45 advertises no NV12 modifier in its IN_FORMATS blob.

Therefore compositor.cpp:404-409:

const auto formats = ... layer->supportedDrmFormats();
if (auto it = formats.find(attrs->format); it == formats.end() || !it->contains(attrs->modifier)) {
    layer->setScanoutCandidate(candidate);
    candidate->setScanoutHint(layer->scanoutDevice(), formats);
    return false;
}

returns false: formats.find(DRM_FORMAT_NV12) == formats.end() for Plane 45 → reject. Path B is rejected at the format/modifier intersection. No further conjunct in EglGbmLayer::importScanoutBuffer is even evaluated.

The Primary plane (39) does support NV12 LINEAR, but it is in use as the GL framebuffer surface (OutputLayerType::Primary is the single-framebuffer canonical role in KWin). KWin v6.6.4 does not have logic to swap plane roles dynamically (move the GL framebuffer to Plane 45 in AFBC, free Plane 39 for video). That would be a substantial KWin design change.

Implications for Phase 4 design space

Per worklist Phase 1 contract — “yes/no plus a paragraph naming the specific conjunct(s) that pass or fail”:

  • Architect's hypothesis (a) — cache the dmabuf-to-GL-texture import. Remains the primary candidate. Aligns with the A2 trajectory data (drops in three bursts during ~30 s warmup, then steady 0/sec). Phase 2 source-read prioritises: src/wayland/linuxdmabufv1clientbuffer.cpp, src/scene/surfaceitem_wayland.cpp, src/scene/itemrenderer_opengl.cpp.
  • Architect's hypothesis (b) — promote single-color-plane subsurface video to direct scanout via wp_drm_lease_v1. STRUCTURALLY UNREACHABLE on this hardware/driver combo. Two reasons:
    1. wp_drm_lease_v1 is the wrong protocol for this case — it leases an entire connector/output to a client (typical consumer: VR HMDs). It is not the mechanism for putting a subsurface on its own DRM plane within a managed Plasma session. The protocol-correct mechanism would be KWin's existing multi-overlay path (Path B above), which fails at the format/modifier intersection on rockchip-drm.
    2. Even if KWin gained dynamic plane-role swapping, the rockchip-drm overlay plane (Plane 45) does not advertise NV12 in any modifier — that is a kernel-side gap, out of this campaign's scope per README.

Bug-report shape — narrowed

Per worklist: “Either answer also informs the bug-report shape … Different messages, different audiences.”

  • The “missed scanout-promotion” framing has two possible audiences, neither well-suited to this campaign:
    • KWin maintainers: would require a design-discussion patch (dynamic plane-role swap). Out of scope.
    • rockchip-drm maintainers: kernel patch to expose NV12 on the overlay plane (if the VOP2 hardware actually supports it on the overlay window — needs separate VOP2 archaeology). Out of scope per README.
  • The “your subsurface composite is slow” framing (Phase 4 hypothesis a — import-caching) has one audience: KWin maintainers, with a measurement-grounded patch description. This is the Phase 5 bug-report shape this campaign should pursue.

Caveats and Phase 1-step-3 deferral

  • This answer rests on the assumption that “windowed Brave with chrome UI visible” is the in-scope case (per Phase 1 lock). Fullscreen Brave (F11) would change Path A's outcome — the parent surface might fill the viewport with no second candidate, in which case Path A could potentially succeed if Plane 39 is available. Not measured in Phase 0 / not in scope.
  • Phase 1 step 3 from worklist (“does KWin require the subsurface to be the only damageable region of a given plane”) is partially answered by the conjunct list above (rounded-corner clipping is in findOverlayCandidates, opacity == 1.0 is required, not entirely-covered is required). The deeper question — whether Brave's parent renders content behind the video subsurface region — is deferred to Phase 2 source-read per the proposal accepted on 2026-05-02. It does not change Phase 1's answer because Path B is already disqualified at the format intersection upstream of any geometric considerations.
  • The integer-source-rect requirement (drm_egl_layer.cpp:117) is noted but not load-bearing for Phase 1's answer. It would be a load-bearing conjunct if Path B reached importScanoutBuffer, which it does not on this hardware. Banked for Phase 2.

Architectural map

End-to-end per-frame path for a Brave wp_subsurface presenting an NV12 LINEAR dmabuf at 1080p30 on the windowed parent (Brave UI), Plasma 6.6.4 + EglBackend + panfrost.

Buffer ingress (one-time per wl_buffer)

  • Brave commits wp_linux_buffer_params.create_immed per V4L2 capture buffer slot. KWin instantiates a LinuxDmaBufV1ClientBuffer (src/wayland/linuxdmabufv1clientbuffer.cpp:216), which IS-A GraphicsBuffer storing DmaBufAttributes (fd, offset, pitch per plane, modifier, format, width, height) — :354-358.
  • RenderBackend::testImportBuffer(clientBuffer) (:217) validates that the rendering backend can import this dmabuf at all. Successful → wl_buffer.created sent, the wl_buffer enters Brave's reuse pool. No GL texture exists yet.
  • Lifetime: until Brave destroys the wl_buffer (:339). For Chromium with a V4L2 capture pool of N buffers, this means N stable LinuxDmaBufV1ClientBuffer / GraphicsBuffer* identities, reused round-robin across frames.

Per-attach (every wl_surface.commit with a buffer attached)

  • SurfaceInterface::bufferChanged fires → SurfaceItemWayland::handleBufferChanged (src/scene/surfaceitem_wayland.cpp:103-106) → setBuffer(…). KWin 6.6.4 negotiates wp_linux_drm_syncobj_v1 explicit sync with Chromium-class clients, so the buffer commit goes through Transaction::watchSyncObj (src/wayland/transaction.cpp:244-249), NOT watchDmaBuf. (Source: kwin-fourier MR body, zero DMA_BUF_IOCTL_EXPORT_SYNC_FILE over 60 s playback.) Fourier patches only touch watchDmaBuf — confirmed irrelevant.
  • Damage region updated; per-frame compositor wakes.

Per-frame texture-update (the hot path)

SurfaceItem::preprocess() (src/scene/surfaceitem.cpp:187-208): if m_texture exists and size matches, call m_texture→update(damageRegion); else call m_texture→create(). Brave's video buffers are all the same size (1920×1080), so after the first frame the same OpenGLSurfaceTexture is re-used — update() is the steady-state path, not create().

OpenGLSurfaceTexture::updateDmabufTexture(buffer) (src/scene/surfaceitem.cpp:472-501):

// for NV12 (s_drmConversions match in src/utils/drm_format_helper.h:35-44):
//   plane 0 = R8 full-size (Y), plane 1 = GR88 half-size (CbCr)
for (uint plane = 0; plane < itConv->plane.count(); ++plane) {
    ...
    m_texture.planes[plane]->bind();                                           // glBindTexture, cheap
    glEGLImageTargetTexture2DOES(GL_TEXTURE_2D,
        m_backend->importBufferAsImage(buffer, plane, currentPlane.format, size));  // *** suspect ***
    m_texture.planes[plane]->unbind();
}

EglBackend::importBufferAsImage(buffer, plane, format, size) (src/opengl/eglbackend.cpp:279-299):

std::pair key(buffer, plane);
auto it = m_importedBuffers.constFind(key);
if (Q_LIKELY(it != m_importedBuffers.constEnd())) {
    return *it;                                                                // CACHE HIT after warmup
}
// MISS: create fresh EGLImage from DmaBufAttributes via EglDisplay::importDmaBufAsImage
// — the genuinely expensive cold-path EGL_LINUX_DMA_BUF_EXT import

The cache key is (GraphicsBuffer *, plane_index). After warmup, every frame for every plane is a cache hit on EGLImage but glEGLImageTargetTexture2DOES is still called every frame to re-target the persistent GLTexture to whatever EGLImage corresponds to the current frame's buffer.

For NV12 video, this re-target happens 2× per frame (Y plane R8 + CbCr plane GR88). For RGBA single-surface (e.g. cage's fullscreen output, or any non-YUV client), it happens 1×.

Per-frame rendering (also hot but well-understood)

ItemRendererOpenGL::renderItem (src/scene/itemrenderer_opengl.cpp:334): standard quad render. vbo→bindArrays, glActiveTexture(GL_TEXTURE0+i) + texture[i]→bind() per plane (:473-474), draw, unbind. No suspicious work; the texture binds here are plain glBindTexture, not glEGLImageTargetTexture2DOES.


File-level findings — Phase 2 reading list

  • [x] src/wayland/linuxdmabufv1clientbuffer.cpp — protocol-only. Creates LinuxDmaBufV1ClientBuffer once per wl_buffer (:165, :216). renderBackend→testImportBuffer validates at creation time (:166, :217). NO GL texture import here; that lives in the EglBackend / surface-texture code. Lifetime tied to wl_buffer; for Chromium's V4L2 capture pool this is N stable buffers reused round-robin.
  • [x] src/scene/surfaceitem_wayland.cpp — slot-driven: handleBufferChanged (:103) just stores the new buffer pointer and emits damage. Subsurface tree built in handleChildSubSurfacesChanged (:142) once per surface tree change — not per frame. No per-frame slow path here. The actual texture work is in surfaceitem.cpp::OpenGLSurfaceTexture::updateDmabufTexture.
  • [x] src/scene/itemrenderer_opengl.cpprenderItem (:334-499) does standard quad rendering with glBindTexture on the already-imported GLTextures. Not the cost site. Per-plane texture binds at :473-474 are plain glBindTexture. No special-case for parent+subsurface vs single-surface.
  • [x] src/scene/composite.cpp + scene scheduling — promotion predicate. Done in Phase 1.
  • [x] src/backends/drm/ — DRM atomic plane-probe, format/modifier acceptance per output. Done in Phase 1 to the depth needed for the leading question.

Phase 2 hypothesis — concrete, file:line

SUPERSEDED 2026-05-02 by Phase 3 measurement. Verdict: H1 rejected at N=1 across C0/C1/C2 + exploratory C3 stock-Brave. The symbol's self-time peaks at 0.15 % vs the 20 % threshold. See phase3_findings. The hypothesis text below is preserved as the “what we believed before measurement” record per the discipline rule (feedback_phase_discipline.md); do not edit it. New working hypothesis H1' (per-frame Wayland-protocol dispatch dominates) emerges from Phase 3 and gets its own Phase 2-prime source-read.

Per-frame cost in KWin's parent + wp_subsurface composite path on Mali-G52 panfrost lives in OpenGLSurfaceTexture::updateDmabufTexture (src/scene/surfaceitem.cpp:472-501), specifically the glEGLImageTargetTexture2DOES call at line 490 (multi-plane YUV) / line 496 (single-plane).

Mechanism

The EglBackend::m_importedBuffers cache (src/opengl/eglbackend.h:116, src/opengl/eglbackend.cpp:279-321) does cache the EGLImage per (GraphicsBuffer *, plane), so after warmup the EGLImage lookup is a hash hit. But the EGLImage and GLTexture are decoupled: a single per-surface m_texture.planes[plane] GLTexture is re-targeted to a different EGLImage every frame via glEGLImageTargetTexture2DOES, because OpenGLSurfaceTexture::updateDmabufTexture is unconditional — it calls the function on every update(), regardless of whether the underlying EGLImage actually changed.

For Brave's V4L2 capture pool of N buffers cycling round-robin:

  • Warmup (≤ ~30 s on ohm): each new GraphicsBuffer\* miss in the cache → fresh EglDisplay::importDmaBufAsImage (kernel-side dmabuf-to-EGLImage import). 6-9 expensive first-imports correlate with the three drop-bursts in the A2 trajectory (ohm_gl_fix/phase3_remeasure_2026-05-02/A2_brave_drops_findings.md) at t ≈ 0–5 / 10–12 / 20–30 s. Pool grows in response to scene complexity (B-frame depth, motion-vector load), explaining the discrete bursts.
  • Steady state (post-warmup): every frame pays 2× glEGLImageTargetTexture2DOES for NV12 (Y plane + CbCr plane). On panfrost, this rebind has non-trivial cost even when the (texture, image) pair is unchanged or when the new image was previously bound to the same texture in a recent frame.

cage's parity here is informative: cage composites a single fullscreen RGBA surface, so its OpenGLSurfaceTexture::updateDmabufTexture runs the single-plane branch (:493-498) — 1× rebind per frame. KWin direct on the same workload runs the multi-plane branch — 2× rebind per frame, plus the warmup re-import bursts that cage does not exhibit (cage's surfaces are GL-rendered framebuffers KWin imports once, not a V4L2-cycled video pool).

Predicted Phase 4 patch shape

Cache the GLTexture alongside the EGLImage in EglBackend::m_importedBuffers, keyed by (GraphicsBuffer *, plane). On updateDmabufTexture, look up the per-(buffer, plane) GLTexture and re-target the per-surface m_texture.planes[plane] to that GLTexture's name (or, more invasively, swap the GLTexture pointer entirely). Eliminates per-frame glEGLImageTargetTexture2DOES after warmup. Concrete edit site: src/scene/surfaceitem.cpp:472-501 (updateDmabufTexture) plus the cache extension in src/opengl/eglbackend.cpp:279-321 and its header.

The rebind pattern was introduced in the original NV12 Wayland dmabuf support (commit 3568829216 opengl: Add support for NV12 on Wayland dmabufs, pre-2024); no commit message documents a defensive rationale. The merge commit 8c37d1926a (BasicEGLSurfaceTextureWayland → OpenGLSurfaceTexture) and refactor cf8ee656a9 (move surface-texture business to scene/) preserved the pattern unchanged. Phase 5 patch description must explain the mechanism (glEGLImageTargetTexture2DOES is idempotent for an unchanged image binding, and the buffer's contents change doesn't require a re-bind because the texture is already backed by the dmabuf via the EGLImage) — not just cite the symptom.

Phase 3 measurement that validates this hypothesis

perf record -p $(pgrep kwin_wayland) during 70 s playback under the locked phase1_lock protocol. Expectation: hot symbols include glEGLImageTargetTexture2DOES (or its panfrost-side implementation, e.g. panvk_* / panfrost_resource_setup) at a non-trivial fraction of kwin_wayland self-time during steady-state. If hot, hypothesis confirmed at the file:line. If cold (i.e. glEGLImageTargetTexture2DOES doesn't show up), the cost is elsewhere and Phase 2 must re-open. Cage perf record under the same workload provides the differential — cage should NOT show the same symbol at the same heat.

Caveats

  • The hypothesis is consistent with A2 trajectory (warmup bursts + steady-state CPU) but is not yet validated by hot-path data. Phase 3 perf record is the highest-value remaining measurement (per architect, see phase0_findings).
  • The rebind cost on panfrost specifically is asserted from first principles. If the panfrost implementation of glEGLImageTargetTexture2DOES happens to short-circuit identical re-binds, the steady-state cost is elsewhere (possibly in m_texture.planes[plane]→bind() GL state churn or further upstream in damage tracking).
  • OpenGLSurfaceContents (the m_texture field, surfaceitem.h:174) is per-OpenGLSurfaceTexture, not per-buffer. Caching the GLTexture per-buffer requires a different ownership model. This is a Phase 4 design decision, not a Phase 2 fact.
kwin_overlay_subsurface/phase2_source_findings.txt · Last modified: by markus_fritsche