Consider sample code which copies 4x4 matrix row by row (see cascade-vld-vst.ll). Current revision generation following code (AArch32):
mov r2, #48 mov r3, r0 vld1.32 {d16, d17}, [r3], r2 vld1.64 {d18, d19}, [r3] add r3, r0, #32 add r0, r0, #16 vld1.64 {d22, d23}, [r0] add r0, r1, #16 vld1.64 {d20, d21}, [r3] vst1.64 {d22, d23}, [r0] add r0, r1, #32 vst1.64 {d20, d21}, [r0] vst1.32 {d16, d17}, [r1], r2 vst1.64 {d18, d19}, [r1] mov pc, lr
After this patch is applied:
vld1.32 {d16, d17}, [r0]! vld1.32 {d18, d19}, [r0]! vld1.32 {d20, d21}, [r0]! vld1.64 {d22, d23}, [r0] vst1.32 {d16, d17}, [r1]! vst1.32 {d18, d19}, [r1]! vst1.32 {d20, d21}, [r1]! vst1.64 {d22, d23}, [r1] mov pc, lr
It also speeds up our matrix multiplication function by 15%. Some of the existing LLVM test cases now have approx. 25% less instructions than before.
The improvement is based on two major changes to CombineBaseUpdate
- When we select address increment instruction to fold we prefer one which is equal to access size of load/store
- If we can't find such address increment bound to current load/store instruction address operand we navigate up the SelectionDAG chain and try to borrow address increment instruction bound to address operand of parent VST{X}_UPD and VLD{X}_UPD which we processed earlier.
But you are checking only VLD1/VST1, so you may want to change function name appropriately. VLD1OrVST1
...................................^^ ^