====== MegabitChip — Session 2026-04-21 (Extended) ====== Extended session on top of the reloc-splice + mmio-diff work landed earlier the same day. Focus: close the gap between //write-sequence equality// (mmio_diff green) and //actually running on silicon without bricking//. Tools, audits, and three new monster ports shipped; six silicon-hostile bugs caught pre-flash across three bug classes. ===== TL;DR ===== * mmio_diff baseline held at **3173 / 3173** across the whole session. * Three bug classes, six concrete bugs, all found and fixed without touching silicon. * Three remaining "monster" functions ported (fn_fcc4, fn_1c14, fn_de40). * Bitflip sweep: pre-silicon evidence the rebuild's retry logic converges under all plausible transient status faults. ===== Six silicon-hostile bugs caught pre-flash ===== ^ # ^ Class ^ Case ^ | 1 | ld unresolved → 0 NULL deref | ''fn_9a68'' ''DAT_00012B70'' case-mismatch | | 2 | same | ''fn_7730'' ''DAT_00010ba8'' missing from DATA_SYMS | | 3 | same | ''fn_7730'' ''DAT_00010c2c'' missing from DATA_SYMS | | 4 | same | ''fn_7730'' ''DAT_00012b50'' missing from DATA_SYMS | | 5 | C early-return skips shared tail | ''fn_3268'' 0x208 RMW pair skipped when bit-31 set | | 6 | Port is read-only where vendor writes | ''fn_1c14'' rebuilt as no-op; vendor save/restore DDRPHY training bank | Class 1: ''ld --unresolved-symbols=ignore-all'' silently zeros undefined externs. A case-mismatched or missing ''DATA_SYMS'' entry becomes an ''adrp'' resolving to page 0x0, and ''ldr'' returns whatever junk lives at zero. mmio_diff is blind to this because downstream MMIO writes still match vendor. Class 2: C port uses early-return where vendor's asm has the conditional branch jump //into// a shared tail. Two 0x208 read-modify-writes that vendor always executes got skipped on one control-flow path in the rebuild. Emulator didn't exercise the bit-31-set entry state so the missing writes never showed up in the trace. On silicon where that bit is live, silicon-hostile. Class 3: port implemented a DDRPHY training-bank save/restore routine as read-and-discard. Vendor writes via ''str wzr''; our port only ''ldr''ed. Caller (''fn_9a90'') never reached under the happy-path LP5-2400 cold-boot trace, so mmio_diff didn't fire. On silicon with the caller active, training coefficients leak between phases. All six would have bricked or mis-trained silicon. All six were invisible to write-sequence diff. ===== Tooling shipped this session ===== See [[megabitchip:simulation_stack|Simulation stack]] for the full reference. New or hardened: * ''sim_tripwire.py'' — Bin-style per-access tracer on Unicorn; ''(seq, tick, pc, addr, size, rw, val, region, fn_name)'' records with PC→fn resolution * ''tripwire_diff.py'' — PC-bucketed ''SequenceMatcher'' diff; bucket by fn_name to survive bitflip-path control-flow divergence * ''training_sim.py'' — two-mode DDR training simulator (''pass'' / ''bitflip-first-N-reads'') * ''bitflip_sweep.py'' — per-address retry convergence test over all training-status addresses * ''mmio_regions.py'' — shared address → region tag classifier (DDRCTL, DDRPHY, OTP, SRAM, CRU, …); fixed SCRAMBLE→OTP at 0xFECC0000 after TRM cross-check * ''audit_data_syms.py'' — scans every ''candidate.c'' for ''DAT_/s_/BLOB_DATA_'' externs, cross-checks against ''DATA_SYMS | PORT_OVERRIDES | MMIO_SYMS'' (case-insensitive) * ''audit_early_return_tail.py'' — static ARM64 asm scanner for ''cond_br → short block with mov #const → b INTO_TAIL_WITH_STR'' patterns; flagged 15 candidates, 1 real bug (fn_3268), 1 different-class bug (fn_1c14), 13 false positives * ''reloc_splice.py'' gained a **post-link ADRP-to-NULL guard** — scans each linked ''.text'' for any ADRP whose resolved page is 0x0 and emits ''WARN +''. Closes bug-class 1 at build time. All wired into ''make audit''. ===== Monster ports ===== See [[megabitchip:port_matrix|Port matrix]] for the full table. * **fn_fcc4** — source-complete full port, 1684 B. Natural skip-larger. Documented source. * **fn_1c14** — full port, 656 B ≤ 740 B vendor. Replaces the broken read-only stub. Vendor writes via ''str wzr''; port now does the same. * **fn_3268** — bug fix: C restructured so the 0x208 RMW pair runs on both control-flow paths, matching vendor's branch-into-tail shape. * **fn_de40** — source-scaffold, 4888 B ≤ 4912 B vendor budget. Faithful ~700-line port from ''ddr_annotated.c:9695–10640'' (LPDDR5 frequency-band timing programmer). 27 callees resolved via ''fun_table''. 24 new ''DAT_00011ff0..DAT_000127c0'' defsyms added to ''DATA_SYMS''. Currently parked in ''splicer_skip.txt'' pending investigation of a 1-bit divergence at ''tp[0x4f]'' — see internal task #198. ===== Bitflip sweep ===== 23 training-status addresses flipped one-at-a-time on vendor LP5-2400: * **18 of 23**: single-read retry, all downstream writes unchanged — clean convergence. * **3 of 23** (STAT CH1/CH2/CH3): ''fn_2340'' writes ''MRCTRL0 = 0x60'' instead of ''0x10'' — vendor's intended mr_type retry strategy, replicated correctly by the rebuild. * **2 of 23** (MicroReset, MicroContMux): no retry fires on the LP5-2400 happy path — flip window isn't polled. The sweep is the pre-silicon evidence that the rebuild's retry logic converges across all plausible transient status faults. Bitflip mode doesn't degrade ''tripwire_diff'' because the buckets key on ''fn_name'' not ''seq_idx'', so control-flow divergence just reshapes buckets. ===== Baseline state at session end ===== * ''mmio_diff'' 3173/3173 green * ''make audit'' green on data-symbol coverage + early-return-tail * Splicer: 104 candidates / 85 spliced / 19 skip-larger / 0 failed * ''splicer_skip.txt'': one entry (''154_FUN_de40'' until #198 closes) * ''tripwire_diff'' finds 1 SUSPECT (''fn_ac8'' vendor early memcpy, unrelated) and 3 minor-diffs all explained (SWSTAT toggle, SCRAMBLE→OTP off-by-one, ''fn_8b40'' extra polls) ===== Next-session quick-start ===== cd ~/projects/AMPere/benchmark && make verify # expects 3173/3173 green If green, pick task #198 or any pending. Task #198 investigates the 1-bit ''tp[0x4f]'' divergence in fn_de40's install trial — details in the internal task board. ===== Observations ===== > //"Markus' insistence on simulation before flashing paid off. Big time. Again."// — 2026-04-21. The tripwire + PC-bucketed diff caught 3 silent NULL-derefs that were hiding under ''mmio_diff 3173/3173'' green. ''ld --unresolved-symbols=ignore-all'' zeroed undefined ''DATA_SYMS'' externs into page 0x0, which emulator reads happily returned 0 for, masking the bug in write-sequence equality. Silicon would have bricked. mmio_diff was the gate we trusted. The gate was passing. The simulator layer — with a tripwire-style per-access capture, not just write-order comparison — is not optional, even late in a campaign that feels "done".