====== MVP2 session 2026-04-20 — matching-decomp blitz ======
Single session, **1/118 → 33/118 functions matching-decomped**.
Canonical compile line settled + poll-site coverage jumped to 15/16.
===== Canonical compile line =====
clang -O2 -ffreestanding -mgeneral-regs-only \
[-fno-pic] # when referencing extern data symbols
[-fno-builtin] # when lifting memcpy/memset
[-fno-unroll-loops] # for small fixed-count loops
* **Hard required:** ''-mgeneral-regs-only''. EL3 TPL has no
FPU/NEON enabled; any ''q0/q1'' vector insn would fault.
Without the flag, clang's vectorizer replaces byte/word loops
with 128-bit NEON ldp/stp (observed on FUN_00000ac8: 428 B of
Neon vs 112 B scalar vendor).
* ''gcc -O2 -ffreestanding'' stays acceptable; on some small
helpers (FUN_000027e0) gcc byte-matches vendor where clang
picks different register allocation.
===== Workspace =====
All lifts live in ''boltzmann:~/projects/AMPere/benchmark/NN_/''
with 5 files each:
* ''func.bin'' — raw slice from
''rkbin/bin/rk35/rk3588_ddr_lp4_1848MHz_lp5_2112MHz_v1.19.bin''
* ''func.s'' — objdump -D
* ''reference.c'' — annotated ground truth
* ''candidate.c'' — clang-friendly source
* ''GRIND_LOG.md'' — per-function summary + vendor-vs-clang deltas
===== Poll-site coverage: 4/16 → 15/16 =====
^ site ^ containing fn ^ benchmark dir ^ semantic role ^
| 0 | FUN_00007730 | 15_site0_block | PHY train interlock disable |
| 1 | FUN_00007730 | 14_site1_block | DFI shadow handshake (bit 1 / 4-lane ack) |
| 2 | FUN_00007730 | 07_site2_block | Enter Normal operating-mode |
| 3 | FUN_00007730 | 11_site3_block | DDRCTL_DFISTAT bits[2:1] clear |
| 4 | FUN_00007730 | 18_site4_block | Enter Self-refresh |
| 5 | FUN_00007730 | 19_site5_block | Wait selfref_type == auto |
| 6 | FUN_00007730 | 20_site6_block | DFI shadow handshake (bit 0 / 2-lane ack) |
| 7 | FUN_00007730 | 21_site7_block | Exit Self-refresh |
| 8 | FUN_00008b40 | 35_site8_block | Enable auto-ctrlupd + wait Normal |
| 9 | FUN_00009a90 | 40_site9_block | Exit SREF, 2-bit variant |
| 10 | FUN_00009a90 | **pending** | absolute 0xff000024 access — SRAM mirror? |
| 11 | FUN_0000d10c | 05_prep_freq_change | wait PHY state 1 |
| 12-15 | FUN_0000d328 | 04_train_phy_block | PHY training step |
Only **site 10** remains — sits in the 9044-byte FUN_00009a90 monster,
uses an absolute address (not a ch_base + offset) so needs wider
context before extraction.
===== Highlights — what landed this session =====
* **FUN_00002340** — MR-submit (TRM-verified DDRCTL_MRCTRL0/1/STAT
registers). Highest-leverage dispatcher callee; every MR write
in FUN_6c8c (LP4/x) and FUN_6d90 (LP5) goes through this.
* **FUN_0000337c** — freq→timing LUT. LP5 thresholds 533/800/1600/
2133 MHz, LP4 thresholds 400/613/1066 MHz. Returns a pointer
into the blob's 0x11C78/0x11CE0 data-region timing tables.
* **FUN_00006c8c** (LP4/x) + **FUN_00006d90** (LP5) — MR dispatch.
6d90 compiled to **exactly 364 B** matching vendor (size-exact).
Together: 16 MR writes per per-channel-per-rank iteration.
* **FUN_00000ac8** — memcpy_aligned with same-ptr shortcut and
8-byte fast path.
* **FUN_00000b38** — xorshift-seeded buffer hash, seed 0x47C6A7E6
(DJB-variant with XOR fold).
* **FUN_00000b88** — ATAGS magic validator, accepts {0,
0x54410001} ∪ [0x54410050, 0x544100FF].
* **FUN_00000bd8** — SRAM_BOOT range + overflow validator for
ATAGS reads (SRAM window 0x1FE000..0x200000, 8 KB).
* **Print chain closed:**
- ''FUN_000104b8'' puts (CRLF-expanding)
- ''FUN_000104f8'' recursive decimal print
- ''FUN_00001194'' "channel[N] " dispatcher (tail-calls FUN_f60)
* **Timer chain closed:**
- ''FUN_00010a38'' udelay via CNTPCT_EL0 + CNTFRQ_EL0
- ''FUN_00010a70'' system_timer_init (STIMER @ 0xFD8C8000)
* **Prep/restore freq-change pair** — FUN_d10c save + FUN_d1d0
restore, with matching save-area offsets 0x238/0x240/0x244/
0x248/0x24C.
* **FUN_0000cb44** (1088 B training-timing pack) — **full port**
from Ghidra decompile. Compiles clean with -Wall -Wextra at
944 B. The −13 memory-op delta vs vendor is clang's legitimate
RAM-access coalescing. **Cross-validation under blob_emu.py
still pending — backlog item #36.**
===== Context-map decoded =====
''FUN_0000d390'' (init_ctx_pointers) writes 25 constants to the
208-byte ctx struct — decoded as the blob's RK3588 physical-address
dictionary:
^ ctx offset ^ value ^ role ^
| 0x00..0x60 (stride 0x20) | 0xF7..0xFA000000 | 4-ch DDR channel bases |
| 0x08..0x68 | 0xFE0C..0x0F0000 | 4-ch CRU-DDR |
| 0x10..0x70 | 0xFD80..0x0C000 | 4-ch DDRPHY (16K stride) |
| 0x18..0x78 | 0xFE00..0x06000 | 4-ch DDRCTL (8K stride) |
| 0x80 | 0xFD58A000 | GRF sideband |
| 0x88 | 0xFD7C0000 | CRU |
| 0x90 | 0xFD59E000 | GRF alt |
| 0x98 | 0xFD586000 | GRF (3rd) |
| 0xA0 | 0xFD587000 | GRF (4th) |
| 0xB8 | 0xFD8D0000 | GRF DDR |
| 0xC0 | 0xFD588000 | GRF (5th) |
| **0xC8** | **0xFD59C000** | **DMC sec_a** (prep/restore + setup sec_table) |
| **0xD0** | **0xFD59D000** | **DMC sec_b** |
Confirms: the secondary-table pointers used in prep_freq_change,
restore_freq_change, and setup_channels point into DMC (Dynamic
Memory Controller) timing-register regions at 0xFD59C000/0xFD59D000
— Rockchip-vendor register islands separate from the uMCTL2 DDRCTL
block.
===== Strings decoded =====
| offset | content |
| 0x10C36 | ''"Magic is not support\n"'' |
| 0x10C4C | ''"Tag is overflow\n"'' |
| 0x10DA4 | ''"unsupported dram type\n"'' |
| 0x113D1 | ''", "'' |
| 0x11491 | ''"MHz\n"'' |
| 0x114E9 | ''"channel["'' |
| 0x114F2 | ''"] "'' |
===== Caveat — to validate before relying on =====
''FUN_0000cb44'' (1088 B, per-channel training-timing pack) is a
full port of the Ghidra decompile. Compiles clean at 944 B. The
−13 memory-op delta vs vendor is clang's legitimate RAM-access
coalescing for a non-volatile struct — post-function RAM state
should match, but **hasn't been cross-validated under blob_emu.py**.
**Backlog item #36** = "Run both vendor and candidate under
blob_emu.py with identical input state (ctx, ch_idx, ch_array_base)
and compare post-function RAM state at ctx+ch_idx*0x6C and
target+0x10..0x24."
===== Backlog staged =====
Next 10 units (tasks #37–46 in session state, of which tasks 37–43
are **complete as of EOD 2026-04-20**):
* 37 FUN_000104b8 puts ✔
* 38 FUN_000104f8 print_decimal ✔
* 39 FUN_00010a38 udelay ✔
* 40 site-9 poll block ✔
* 41 FUN_00000e5c freq_log ✔
* 42 FUN_00010a70 system_timer_init ✔
* 43 FUN_00002110 dram_type → timing base ✔
* 44 FUN_0000bf7c (tiny thunk)
* 45 FUN_000016bc
* 46 FUN_00002e88
After those, the larger targets still on the shelf:
* site 10 extraction (FUN_00009a90 body)
* FUN_000027f8 (508 B, 7730-callee)
* FUN_00005540 (2636 B monster)
* FUN_00009a90 non-site-9/10 body (~6500 B remaining)
* FUN_00008b40 non-site-8 body (~2100 B)
===== Numbers =====
| metric | start of session | end |
| matching-decomp units | 1 | 33 (7 more in-flight tonight) |
| poll-sites covered | 4/16 | 15/16 |
| benchmark directories | 5 | 36+ |
| cumulative bytes of vendor asm lifted | ~104 B | ~6.0 KB |