User Tools

Site Tools


rk3588_ddr

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rk3588_ddr [2026/04/15 15:32] – Append 2026-04-15 late evening: bootrom emu + gitea SSH + PineBuds PR #122 markus_fritscherk3588_ddr [2026/04/20 21:58] (current) – MVP2 session 2026-04-20 recap (matching-decomp blitz, 33/118, 15/16 poll sites) markus_fritsche
Line 438: Line 438:
  
 //Last updated: 2026-04-15 late evening// //Last updated: 2026-04-15 late evening//
 +
 +
 +===== 2026-04-15 late night: counted-loop v3 is cold-boot-broken =====
 +
 +**Project-defining finding.** The counted-loop trampoline approach (any counter
 +value we tested — 16 Ki, 1 Mi, 16 Mi iterations) **cannot** replace the stock
 +blob's infinite polls for the PHY firmware handshake that fires during F1
 +frequency retrain on the GenBook RK3588. All-evening bisection turned out to be
 +warm-PHY illusion; cold-boot control experiments at the end revealed that only
 +stock cold-boots reliably.
 +
 +==== The warm-PHY trap ====
 +
 +Every "known-good" baseline earlier in the evening (''stock'', ''early'',
 +''midlate'', ''0-8'' through ''0-11'') was tested via ''rkdeveloptool rd''
 +which only fires after ''rkdeveloptool db <spl-loader>'' has pushed its own
 +SPL into SRAM and **run a full DDR init at 2400 MHz** (visible in UART captures
 +as the ''DDR ff1a08bde6 typ 25/03/13-15:39:39'' banner preceding our patched
 +blob's ''typ 25/04/21'' banner). PHY comes up warm and our patched TPL inherits
 +a trained PHY state where the F1-retrain code path that kills cold boots either
 +never fires or side-steps site 1.
 +
 +Cold-tested ''early'' at end of night via RK806 power-off + physical power-on:
 +**same** ''0:1!2:3:4:'' marker chain as the full-patched variant. Stock cold-tested:
 +full boot. Bisection was theatre.
 +
 +==== Diagnostic chain ====
 +
 +The UART trace rewriter ended up being the tool that cracked it. Each trampoline
 +emits a unique byte to UART2 (''0xFEB50000'') on entry (''0''–''9'', ''A''–''F''),
 +a colon on success exit, an exclamation on timeout exit. Typical cold-boot hang tail:
 +
 +  change to F1: 534MHz
 +  0:1!2:3:4:
 +  (hang)
 +
 +Reads: site 0 succeeded, site 1 **timed out**, sites 2-4 succeeded, then hang
 +somewhere after site 4 (no trampoline → no marker).
 +
 +**Site 1 context** (blob offset ''0x7b9c''):
 +
 +  7b90: orr  w0, w0, #0x2
 +  7b94: str  w0, [x26, #2948]    ; trigger write (+0xB84)
 +  7b98: mov  w0, #0x36000000     ; mask = bits 25,26,28,29
 +  7b9c: ldr  w1, [x26, #2952]    ; body[0]: poll (+0xB88)
 +  7ba0: bics wzr, w0, w1          ; body[1]: flags
 +  7ba4: B.NE 0x7b9c              ; (stock: retry forever)
 +
 +Register ''+0xB88'' is TRM-undocumented — Synopsys DWC PHY PUB space, not
 +uMCTL2 territory. Stock infinite-poll always succeeds cold; our 1 Mi and 16 Mi
 +counted loops both time out every time.
 +
 +==== Likely root cause ====
 +
 +The PHY firmware state machine is sensitive to either the polling cadence or
 +the CPU-cycle count before the first LDR. Our trampoline adds a 3-instruction
 +UART-marker prolog + 1-instruction counter init ≈ 10 cycles of extra latency
 +before the first read. Stock has zero extra cycles between the ''b'' from the
 +caller and the ''ldr'' at ''0x7b9c''. If PHY firmware advances state only when
 +reads arrive inside a specific window, our prolog pushes the first read outside
 +that window and the handshake silently aborts — no subsequent polling recovers.
 +
 +Not proven (tonight didn't have time to build a non-trace counter-bump variant
 +and cold-test it to isolate UART-marker latency from counter-logic latency),
 +but the evidence pattern fits: stock works, trace-enabled variants fail, counter
 +size doesn't matter past ~5 ms. Time isn't the independent variable — cycle
 +count before first read is.
 +
 +==== Shipping deliverables ====
 +
 +Tonight we built working tooling. A working **fix** is future work.
 +
 +  * ''spi_check.py'' — RKNS wrapper + TPL entry-signature gate, run before every flash.
 +  * ''blob_emu.py'' — position-correct Unicorn emulator at ''0xFF001000'' with MSR/MRS
 +    skip and DW_apb_uart shim; prints byte-identical DDR banner to real hardware.
 +  * ''patch_timeouts_v3.py'' — now has ''--counter'' (any MOVZ-encodable imm32) and
 +    ''--uart-trace'' (per-site entry + success/timeout exit markers).
 +  * ''build_genbook_sites.sh'' — wrapper for arbitrary site-list subsets.
 +  * Meitner ''~/ampere/captures/'' — full UART archive of tonight's 11+ variants.
 +
 +==== Methodology lessons (captured in memory) ====
 +
 +  * **Warm-PHY illusion** — ''feedback_warm_phy_illusion.md''. Always cold-test the
 +    baseline BEFORE bisecting any hardware init bug. ''rkdeveloptool rd'' is a
 +    warm boot, not a cold boot — results are not portable to cold deployment.
 +  * Linear bisection that looks "too clean for a hard problem" is signal of a
 +    methodology leak. Tonight's neat ''0-8 boots, 0-9 boots, 0-10 boots, 0-11
 +    boots, 0-12 hangs'' progression was entirely warm-PHY artifact.
 +
 +==== Next session direction ====
 +
 +Re-scope from "patch all 16 timeout-less polls" to "patch only the safe subset":
 +
 +  - Read each site's body + base register, cross-reference with TRM §2.4 +
 +    Synopsys DWC uMCTL2 docs.
 +  - Classify: PHY-firmware handshake polls (DO NOT patch) vs SGRF/firewall/PLL/
 +    BUS_GRF polls (safe to patch).
 +  - Rebuild subset patcher, cold-test. If a non-empty safe subset exists, ship that.
 +
 +Stock stays on the GenBook SPI as the reliable cold-boot variant. Board is
 +currently running Arch from stock.
 +
 +//Last updated: 2026-04-15 23:51//
 +
 +
 +===== 2026-04-16: MVP1 delivered — root cause was reseating =====
 +
 +The original "board craps out at 2400 MHz" problem that started the entire
 +MegabitChip project was **hardware, not firmware**. Two physical interventions
 +resolved it:
 +
 +  - **Reseating the CM5 module** in its PCIe-style socket → restored LPDDR5
 +    signal integrity at 2400 MT/s. User confirmed: "Definitely reseating."
 +  - **Copperfield copper-shim cooling mod** → improved thermal margin at
 +    elevated temps.
 +
 +After reseating + swapping to the stock 2400 MHz DDR blob
 +(''rk3588_ddr_lp4_2112MHz_lp5_2400MHz_v1.19.bin''), the GenBook cold-boots
 +reliably at 2400 MHz, survives full kernel compiles at 84 °C avg core temp,
 +and passes ''memtester'' on 16 GB (previously failed).
 +
 +==== MVP1 shipped deliverables ====
 +
 +^ Deliverable ^ Location ^ Status ^
 +| Unicorn blob emulator | ''boltzmann:blob_emu.py'' | Byte-identical DDR banner |
 +| SPI pre-flash validator | ''boltzmann:spi_check.py'' | Wired into build scripts |
 +| UART trace rewriter | ''patch_timeouts_v3.py --uart-trace'' | Entry + exit markers |
 +| Configurable counted-loop patcher | ''patch_timeouts_v3.py --counter --sites'' | Cold-boot-broken for PHY polls |
 +| GenBook flash pipeline | ''meitner:~/ampere/'' | 90 s iteration |
 +| Ghidra LLM auto-renamer | ''oppenheimer:LLMRename.java'' | ~25% yield on fresh projects |
 +| Cold-boot methodology | ''feedback_warm_phy_illusion.md'' | Lesson captured |
 +| UART capture archive | ''meitner:~/ampere/captures/'' | 11+ variants |
 +| 2400 MHz stock GenBook SPI | ''meitner:~/ampere/u-boot-rockchip-spi-2400MHz-stock-genbook-8mb.bin'' | Cold-boot-proven |
 +
 +==== MVP2 goal ====
 +
 +Boot from **source-regenerated blob**: matching-decomp all 118 functions →
 +clang recompile → byte-identical binary → then **modify**. Currently at 1/118
 +functions matched (''train_phy_block'' at 96%+). Once source exists, the
 +community can rewrite training algorithms, expose OC knobs, and do things
 +Rockchip never intended. Question of principle.
 +
 +//Last updated: 2026-04-16 00:xx//
 +
 +====== MVP2 session 2026-04-20 — matching-decomp blitz ======
 +
 +Single session, **1/118 → 33/118 functions matching-decomped**.
 +Canonical compile line settled + poll-site coverage jumped to 15/16.
 +
 +===== Canonical compile line =====
 +
 +<code bash>
 +clang -O2 -ffreestanding -mgeneral-regs-only \
 +      [-fno-pic]          # when referencing extern data symbols
 +      [-fno-builtin]      # when lifting memcpy/memset
 +      [-fno-unroll-loops] # for small fixed-count loops
 +</code>
 +
 +  * **Hard required:** ''-mgeneral-regs-only''. EL3 TPL has no
 +    FPU/NEON enabled; any ''q0/q1'' vector insn would fault.
 +    Without the flag, clang's vectorizer replaces byte/word loops
 +    with 128-bit NEON ldp/stp (observed on FUN_00000ac8: 428 B of
 +    Neon vs 112 B scalar vendor).
 +  * ''gcc -O2 -ffreestanding'' stays acceptable; on some small
 +    helpers (FUN_000027e0) gcc byte-matches vendor where clang
 +    picks different register allocation.
 +
 +===== Workspace =====
 +
 +All lifts live in ''boltzmann:~/projects/AMPere/benchmark/NN_<name>/''
 +with 5 files each:
 +
 +  * ''func.bin''  — raw slice from
 +    ''rkbin/bin/rk35/rk3588_ddr_lp4_1848MHz_lp5_2112MHz_v1.19.bin''
 +  * ''func.s''    — objdump -D
 +  * ''reference.c'' — annotated ground truth
 +  * ''candidate.c'' — clang-friendly source
 +  * ''GRIND_LOG.md'' — per-function summary + vendor-vs-clang deltas
 +
 +===== Poll-site coverage: 4/16 → 15/16 =====
 +
 +^ site ^ containing fn ^ benchmark dir ^ semantic role ^
 +| 0 | FUN_00007730 | 15_site0_block | PHY train interlock disable |
 +| 1 | FUN_00007730 | 14_site1_block | DFI shadow handshake (bit 1 / 4-lane ack) |
 +| 2 | FUN_00007730 | 07_site2_block | Enter Normal operating-mode |
 +| 3 | FUN_00007730 | 11_site3_block | DDRCTL_DFISTAT bits[2:1] clear |
 +| 4 | FUN_00007730 | 18_site4_block | Enter Self-refresh |
 +| 5 | FUN_00007730 | 19_site5_block | Wait selfref_type == auto |
 +| 6 | FUN_00007730 | 20_site6_block | DFI shadow handshake (bit 0 / 2-lane ack) |
 +| 7 | FUN_00007730 | 21_site7_block | Exit Self-refresh |
 +| 8 | FUN_00008b40 | 35_site8_block | Enable auto-ctrlupd + wait Normal |
 +| 9 | FUN_00009a90 | 40_site9_block | Exit SREF, 2-bit variant |
 +| 10 | FUN_00009a90 | **pending** | absolute 0xff000024 access — SRAM mirror? |
 +| 11 | FUN_0000d10c | 05_prep_freq_change | wait PHY state 1 |
 +| 12-15 | FUN_0000d328 | 04_train_phy_block | PHY training step |
 +
 +Only **site 10** remains — sits in the 9044-byte FUN_00009a90 monster,
 +uses an absolute address (not a ch_base + offset) so needs wider
 +context before extraction.
 +
 +===== Highlights — what landed this session =====
 +
 +  * **FUN_00002340** — MR-submit (TRM-verified DDRCTL_MRCTRL0/1/STAT
 +    registers). Highest-leverage dispatcher callee; every MR write
 +    in FUN_6c8c (LP4/x) and FUN_6d90 (LP5) goes through this.
 +  * **FUN_0000337c** — freq→timing LUT. LP5 thresholds 533/800/1600/
 +    2133 MHz, LP4 thresholds 400/613/1066 MHz. Returns a pointer
 +    into the blob's 0x11C78/0x11CE0 data-region timing tables.
 +  * **FUN_00006c8c** (LP4/x) + **FUN_00006d90** (LP5) — MR dispatch.
 +    6d90 compiled to **exactly 364 B** matching vendor (size-exact).
 +    Together: 16 MR writes per per-channel-per-rank iteration.
 +  * **FUN_00000ac8** — memcpy_aligned with same-ptr shortcut and
 +    8-byte fast path.
 +  * **FUN_00000b38** — xorshift-seeded buffer hash, seed 0x47C6A7E6
 +    (DJB-variant with XOR fold).
 +  * **FUN_00000b88** — ATAGS magic validator, accepts {0,
 +    0x54410001} ∪ [0x54410050, 0x544100FF].
 +  * **FUN_00000bd8** — SRAM_BOOT range + overflow validator for
 +    ATAGS reads (SRAM window 0x1FE000..0x200000, 8 KB).
 +  * **Print chain closed:**
 +    - ''FUN_000104b8'' puts (CRLF-expanding)
 +    - ''FUN_000104f8'' recursive decimal print
 +    - ''FUN_00001194'' "channel[N] " dispatcher (tail-calls FUN_f60)
 +  * **Timer chain closed:**
 +    - ''FUN_00010a38'' udelay via CNTPCT_EL0 + CNTFRQ_EL0
 +    - ''FUN_00010a70'' system_timer_init (STIMER @ 0xFD8C8000)
 +  * **Prep/restore freq-change pair** — FUN_d10c save + FUN_d1d0
 +    restore, with matching save-area offsets 0x238/0x240/0x244/
 +    0x248/0x24C.
 +  * **FUN_0000cb44** (1088 B training-timing pack) — **full port**
 +    from Ghidra decompile. Compiles clean with -Wall -Wextra at
 +    944 B. The −13 memory-op delta vs vendor is clang's legitimate
 +    RAM-access coalescing. **Cross-validation under blob_emu.py
 +    still pending — backlog item #36.**
 +
 +===== Context-map decoded =====
 +
 +''FUN_0000d390'' (init_ctx_pointers) writes 25 constants to the
 +208-byte ctx struct — decoded as the blob's RK3588 physical-address
 +dictionary:
 +
 +^ ctx offset ^ value ^ role ^
 +| 0x00..0x60 (stride 0x20) | 0xF7..0xFA000000 | 4-ch DDR channel bases |
 +| 0x08..0x68 | 0xFE0C..0x0F0000 | 4-ch CRU-DDR |
 +| 0x10..0x70 | 0xFD80..0x0C000 | 4-ch DDRPHY (16K stride) |
 +| 0x18..0x78 | 0xFE00..0x06000 | 4-ch DDRCTL (8K stride) |
 +| 0x80 | 0xFD58A000 | GRF sideband |
 +| 0x88 | 0xFD7C0000 | CRU |
 +| 0x90 | 0xFD59E000 | GRF alt |
 +| 0x98 | 0xFD586000 | GRF (3rd) |
 +| 0xA0 | 0xFD587000 | GRF (4th) |
 +| 0xB8 | 0xFD8D0000 | GRF DDR |
 +| 0xC0 | 0xFD588000 | GRF (5th) |
 +| **0xC8** | **0xFD59C000** | **DMC sec_a** (prep/restore + setup sec_table) |
 +| **0xD0** | **0xFD59D000** | **DMC sec_b** |
 +
 +Confirms: the secondary-table pointers used in prep_freq_change,
 +restore_freq_change, and setup_channels point into DMC (Dynamic
 +Memory Controller) timing-register regions at 0xFD59C000/0xFD59D000
 +— Rockchip-vendor register islands separate from the uMCTL2 DDRCTL
 +block.
 +
 +===== Strings decoded =====
 +
 +| offset | content |
 +| 0x10C36 | ''"Magic is not support\n"'' |
 +| 0x10C4C | ''"Tag is overflow\n"'' |
 +| 0x10DA4 | ''"unsupported dram type\n"'' |
 +| 0x113D1 | ''", "'' |
 +| 0x11491 | ''"MHz\n"'' |
 +| 0x114E9 | ''"channel["'' |
 +| 0x114F2 | ''"] "'' |
 +
 +===== Caveat — to validate before relying on =====
 +
 +''FUN_0000cb44'' (1088 B, per-channel training-timing pack) is a
 +full port of the Ghidra decompile. Compiles clean at 944 B. The
 +−13 memory-op delta vs vendor is clang's legitimate RAM-access
 +coalescing for a non-volatile struct — post-function RAM state
 +should match, but **hasn't been cross-validated under blob_emu.py**.
 +
 +**Backlog item #36** = "Run both vendor and candidate under
 +blob_emu.py with identical input state (ctx, ch_idx, ch_array_base)
 +and compare post-function RAM state at ctx+ch_idx*0x6C and
 +target+0x10..0x24."
 +
 +===== Backlog staged =====
 +
 +Next 10 units (tasks #37–46 in session state, of which tasks 37–43
 +are **complete as of EOD 2026-04-20**):
 +
 +  * 37 FUN_000104b8 puts ✔
 +  * 38 FUN_000104f8 print_decimal ✔
 +  * 39 FUN_00010a38 udelay ✔
 +  * 40 site-9 poll block ✔
 +  * 41 FUN_00000e5c freq_log ✔
 +  * 42 FUN_00010a70 system_timer_init ✔
 +  * 43 FUN_00002110 dram_type → timing base ✔
 +  * 44 FUN_0000bf7c (tiny thunk)
 +  * 45 FUN_000016bc
 +  * 46 FUN_00002e88
 +
 +After those, the larger targets still on the shelf:
 +
 +  * site 10 extraction (FUN_00009a90 body)
 +  * FUN_000027f8 (508 B, 7730-callee)
 +  * FUN_00005540 (2636 B monster)
 +  * FUN_00009a90 non-site-9/10 body (~6500 B remaining)
 +  * FUN_00008b40 non-site-8 body (~2100 B)
 +
 +===== Numbers =====
 +
 +| metric | start of session | end |
 +| matching-decomp units | 1 | 33 (7 more in-flight tonight) |
 +| poll-sites covered | 4/16 | 15/16 |
 +| benchmark directories | 5 | 36+ |
 +| cumulative bytes of vendor asm lifted | ~104 B | ~6.0 KB |
  
rk3588_ddr.1776267176.txt.gz · Last modified: by markus_fritsche