Consider sample code which copies 4x4 matrix row by row (see cascade-vld-vst.ll). Current revision generation following code (AArch32):
mov r2, #48
mov r3, r0
vld1.32 {d16, d17}, [r3], r2
vld1.64 {d18, d19}, [r3]
add r3, r0, #32
add r0, r0, #16
vld1.64 {d22, d23}, [r0]
add r0, r1, #16
vld1.64 {d20, d21}, [r3]
vst1.64 {d22, d23}, [r0]
add r0, r1, #32
vst1.64 {d20, d21}, [r0]
vst1.32 {d16, d17}, [r1], r2
vst1.64 {d18, d19}, [r1]
mov pc, lrAfter this patch is applied:
vld1.32 {d16, d17}, [r0]!
vld1.32 {d18, d19}, [r0]!
vld1.32 {d20, d21}, [r0]!
vld1.64 {d22, d23}, [r0]
vst1.32 {d16, d17}, [r1]!
vst1.32 {d18, d19}, [r1]!
vst1.32 {d20, d21}, [r1]!
vst1.64 {d22, d23}, [r1]
mov pc, lrIt also speeds up our matrix multiplication function by 15%. Some of the existing LLVM test cases now have approx. 25% less instructions than before.
The improvement is based on two major changes to CombineBaseUpdate
- When we select address increment instruction to fold we prefer one which is equal to access size of load/store
- If we can't find such address increment bound to current load/store instruction address operand we navigate up the SelectionDAG chain and try to borrow address increment instruction bound to address operand of parent VST{X}_UPD and VLD{X}_UPD which we processed earlier.






But you are checking only VLD1/VST1, so you may want to change function name appropriately. VLD1OrVST1
...................................^^ ^