Find consecutive two 32-bit loads and consecutive two 32-bit stores that write the values of the consecutive 32-bit loads. Transform the loads/stores to 64-bit load/store.
When the wide load/store is unscaled(ldur/stur), offset needs not to be changed. e.g., %vreg2 = LDURWi %vreg0, -76; %vreg3 = LDURWi %vreg0, -72; STURWi %vreg2, %vreg1, -44; STURWi %vreg3, %vreg1, -40; ; becomes %vreg2 = LDURXi %vreg0, -76; STURXi %vreg2, %vreg1, -44; When the wide load/store is scaled(ldr/str), offset should be a half of the original value. e.g., %vreg2 = LDRWui %vreg0, 4; %vreg3 = LDRWui %vreg0, 5; STRWui %vreg2, %vreg1, 2; STRWui %vreg3, %vreg1, 3; ; becomes %vreg2 = LDRXui %vreg0, 2; STRXui %vreg2, %vreg1, 1; When the original load/store is scaled(ldr/str) and it has an odd offset value, it can be widened if (-256 <= unscaled offset value < 256) is satisfied. cf.) unscaled offset value = scaled offset value * memory scale size e.g., %vreg2 = LDRWui %vreg0, 13; %vreg3 = LDRWui %vreg0, 14; STRWui %vreg2, %vreg1, 37; STRWui %vreg3, %vreg1, 38; ; becomes %vreg2 = LDURXi %vreg0, 52; 52 = 13 * 4 STURXi %vreg2, %vreg1, 148; 148 = 37 * 4
ExynosOpt? :)