Fold the low 12 bits of an immediate offset into the offset field of the using instruction. That using instruction will be a load, store, or addi which performs an add of a signed 12-bit immediate as part of it's operation. Splitting out the low bits allows the high bits to be generated via a single LUI instead of needing an LUI/ADDI pair.
The codegen effect of this is mostly converting cases where "split addi" kicks in to using LUI + a folded offset. There are a couple of straight dynamic instruction count wins, and using a canonical LUI is probably better than a chain of SP adds if the dynamic instruction count is equal.
I'd appreciate careful review here. I'm not entirely sure this is correct without additional guards, and this is the type of bit math I have trouble reasoning about. My main concern is whether the addition of the signed add is the same as the ADDIW used on RV64 without additional special handling.
Can this just be
SignExtend64<32>(Val - Lo12)?