Page MenuHomePhabricator

[ARM] Simplify address calculation for NEON load/store
ClosedPublic

Authored by asavonic on Aug 31 2021, 5:28 AM.

Details

Summary

The patch attempts to optimize a sequence of SIMD loads from the same
base pointer:

%0 = gep float*, float* base, i32 4
%1 = bitcast float* %0 to <4 x float>*
%2 = load <4 x float>, <4 x float>* %1
...
%n1 = gep float*, float* base, i32 N
%n2 = bitcast float* %n1 to <4 x float>*
%n3 = load <4 x float>, <4 x float>* %n2

For AArch64 the compiler generates a sequence of LDR Qt, [Xn, #16].
However, 32-bit NEON VLD1/VST1 lack the [Wn, #imm] addressing mode, so
the address is computed before every ld/st instruction:

add r2, r0, #32
add r0, r0, #16
vld1.32 {d18, d19}, [r2]
vld1.32 {d22, d23}, [r0]

This can be improved by computing address for the first load, and then
using a post-indexed form of VLD1/VST1 to load the rest:

add r0, r0, #16
vld1.32 {d18, d19}, [r0]!
vld1.32 {d22, d23}, [r0]

In order to do that, the patch adds more patterns to DAGCombine:

  • (load (add ptr inc1)) and (add ptr inc2) are now folded if inc1 and inc2 are constants.
  • (or ptr inc) is now recognized as a pointer increment if ptr is sufficiently aligned.

In addition to that, we now search for all possible base updates and
then pick the best one.

Diff Detail

Event Timeline

asavonic created this revision.Aug 31 2021, 5:28 AM
asavonic requested review of this revision.Aug 31 2021, 5:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2021, 5:28 AM

My first thought was why can't this be handled by LSR, but I can see how that might not work very well trying to precisely match VLD offsets. And the tests you added have no loops :)

(I also had some thoughts about whether this was useful in general, or if a sufficiently powerful cpu would break these into microops in either case, leading to the same performance in the end. But the code does look cleaner now, I can see how it would improve things)

The way we handled this in MVE was to "distribute" the increments in the ARMLoadStoreOptimizer pass. The instructions in MVE are different, and that does involve checking through Machine Instructions for Adds that can be better distributed into postinc instructions. LSR got it mostly right, DAG Combine did an OKish job most of the time, and we fixed up what went wrong later in the pipeline.

It seems to have worked out OK as far as I can tell, is there a reason we can't do the same thing here? Adding the new pass seems fine if we need it, but I'm less sanguine about having to disable a lot of Add folds in DAGCombiner.

llvm/test/CodeGen/ARM/arm-post-indexing-opt.ll
2–3

It is best not to mix llc and opt test files. They are easier kept as separate tests, with autogenerated check lines in each.
It's also good to show before and after in the review by pre-committing the tests, which makes the review easier by making it obvious what has changed.

My first thought was why can't this be handled by LSR, but I can see how that might not work very well trying to precisely match VLD offsets. And the tests you added have no loops :)

The patch is focused on sequential loads that have the same base
pointer and constant offsets, but it can also work if such sequence is
in a loop body:

void test(float *a, float *b, int n) {
  for (int i = 0; i < n; ++i) {
    v4f32 A1 = vld1q_f32(a + 16 * i);
    v4f32 A2 = vld1q_f32(a + 16 * i + 4);
    v4f32 A3 = vld1q_f32(a + 16 * i + 8);
    v4f32 A4 = vld1q_f32(a + 16 * i + 12);
    vst1q_f32(b + 4 * i, A1);
    vst1q_f32(b + 4 * i, A2);
    vst1q_f32(b + 4 * i, A3);
    vst1q_f32(b + 4 * i, A4);
  }

LSR seems to only handle values that are loop IV, so these constant
offsets are not optimized. The loop body is compiled to:

add     lr, r0, r3
subs    r2, r2, #1
mov     r4, lr
vld1.32 {d16, d17}, [r4], r12
vld1.32 {d18, d19}, [r4]
add     r4, lr, #32              ; <-- extra address computation
vld1.32 {d20, d21}, [r4]
add     r4, lr, #16              ; <--
vld1.32 {d22, d23}, [r4]
add     r4, r1, r3
add     r5, r4, #16              ; <--
add     r3, r3, #64
mov     lr, r4
add     r4, r4, #32              ; <--
vst1.32 {d16, d17}, [lr], r12
vst1.32 {d22, d23}, [r5]
vst1.32 {d20, d21}, [r4]
vst1.32 {d18, d19}, [lr]
bne     .LBB0_2

In the first revision of this patch ARMPostIndexingOpt was confused by
GEP patterns produced by LSR. This is now fixed and the sequence is
optimized to:

add     r3, r0, r12
subs    r2, r2, #1
vld1.32 {d16, d17}, [r3]!
vld1.32 {d18, d19}, [r3]!
vld1.32 {d20, d21}, [r3]!
vld1.32 {d22, d23}, [r3]
add     r3, r1, r12
add     r12, r12, #64
vst1.32 {d16, d17}, [r3]!
vst1.32 {d18, d19}, [r3]!
vst1.32 {d20, d21}, [r3]!
vst1.32 {d22, d23}, [r3]
bne     .LBB0_2

(I also had some thoughts about whether this was useful in general, or if a sufficiently powerful cpu would break these into microops in either case, leading to the same performance in the end. But the code does look cleaner now, I can see how it would improve things)

I've measured execution time of the loop above, and it is ~7% faster
on Cortex-A72. It may be different on other hardware though.

The way we handled this in MVE was to "distribute" the increments in the ARMLoadStoreOptimizer pass. The instructions in MVE are different, and that does involve checking through Machine Instructions for Adds that can be better distributed into postinc instructions. LSR got it mostly right, DAG Combine did an OKish job most of the time, and we fixed up what went wrong later in the pipeline.

It seems to have worked out OK as far as I can tell, is there a reason we can't do the same thing here?

I think the approach is still the same, the new pass just works for
cases that LSR does not handle.

Adding the new pass seems fine if we need it, but I'm less sanguine about having to disable a lot of Add folds in DAGCombiner.

Agree, this is potentially the most problematic change. It is limited
to just (load/store (add)) and works only before legalization, so this
/hopefully/ reduces its impact to just the patterns we need.

asavonic updated this revision to Diff 371409.Sep 8 2021, 11:55 AM
  • Added handling for GEP patterns generated by LSR.
  • Split llc and opt LIT tests.
  • Pre-committed llc LIT test.

The way we handled this in MVE was to "distribute" the increments in the ARMLoadStoreOptimizer pass. The instructions in MVE are different, and that does involve checking through Machine Instructions for Adds that can be better distributed into postinc instructions. LSR got it mostly right, DAG Combine did an OKish job most of the time, and we fixed up what went wrong later in the pipeline.

It seems to have worked out OK as far as I can tell, is there a reason we can't do the same thing here?

I think the approach is still the same, the new pass just works for
cases that LSR does not handle.

The MVE method works quite differently I feel. It fixes up problems later in the pipeline at the mir level, not trying to get them perfectly correct before ISel. It can be awkward dealing with mir though, and difficult to look through all the instructions that can be generated looking at ways to distribute postincs more evenly if the results are not already close enough. I don't think I would recommend it for this problem, it would probably be simpler to fix it up in the DAG than trying to do it later in MIR.

Adding the new pass seems fine if we need it, but I'm less sanguine about having to disable a lot of Add folds in DAGCombiner.

Agree, this is potentially the most problematic change. It is limited
to just (load/store (add)) and works only before legalization, so this
/hopefully/ reduces its impact to just the patterns we need.

Unfortunately I'm not sure we can say this in general, even if it is quite rare. I feel like having this reassociationCanBreakPostIndexingPattern method in the generic DAG combine code disabling so many of the ADD folds is not a good sign. It feels like it's working around the fact that this is a quite fragile way of trying to get postinc working. The main issues I found I think could be fixed with one use checks in reassociationCanBreakPostIndexingPattern, but is there a way to make this work without disabling the generic folds? Perhaps by adding new folds for fixing postinc patterns, and getting "add-like-ors" to behave like adds in more places?

asavonic updated this revision to Diff 378619.Mon, Oct 11, 4:12 AM
asavonic edited the summary of this revision. (Show Details)

is there a way to make this work without disabling the generic folds? Perhaps by adding new folds for fixing postinc patterns, and getting "add-like-ors" to behave like adds in more places?

Done. Everything is moved to DAGCombine now, so the new pass is not required.
This allows to catch more cases, and not require any changes around
ADD-to-OR tranformation.

Code from CombineBaseUpdate is moved to TryCombineBaseUpdate without
major changes, but now we look for more candidates (pointer updates)
and try to pick the best one.

Nice work. I'm glad this works this way.

llvm/lib/Target/ARM/ARMISelLowering.cpp
15581

Can this use haveNoCommonBitsSet, similar to ARMDAGToDAGISel::SelectAddLikeOr?

15599

Do we need to check both operands? The Add/Or should be canonicalized to have the constant on the RHS.

15650

How useful are these error messages do you think, in the long run?

The number of combines tried can be quite high, and these may just end up adding noise for people not looking at CombineBaseUpdate combines. It seems to be fairly uncommon to add debug messages for DAG combines.

But if you think they are useful, feel free to keep them around.

llvm/test/CodeGen/ARM/alloc-no-stack-realign.ll
19–30

How come this test is changing?

asavonic updated this revision to Diff 379057.Tue, Oct 12, 8:52 AM
  • Used DAG.haveNoCommonBitsSet to check for ADD-like OR.
  • Removed handling for non-canonical DAG.
  • Removed extra debug messages.
asavonic added inline comments.Tue, Oct 12, 8:54 AM
llvm/lib/Target/ARM/ARMISelLowering.cpp
15581

Done.

15599

Thanks. I was not sure that we can expect that.
Removed the extra check.

15650

Let's remove them.
They were useful for debugging, but standard messages from DAG combiner are good enough.

llvm/test/CodeGen/ARM/alloc-no-stack-realign.ll
19–30

Oh, this one is tricky.
The original test expected to see an OR instruction when the stack is aligned, and an ADD instruction when it is not. After the patch both ADD and OR are folded with loads/stores, so the two sequences (test1 and test2) are completely identical.

I changed the IR slightly, so that OR is not folded.

dmgreen accepted this revision.Thu, Oct 14, 12:23 AM

Thanks. LGTM

llvm/lib/Target/ARM/ARMISelLowering.cpp
15463

LLVM tends to leave the brackets off single statement if blocks.

This revision is now accepted and ready to land.Thu, Oct 14, 12:23 AM
This revision was landed with ongoing or failed builds.Thu, Oct 14, 5:26 AM
This revision was automatically updated to reflect the committed changes.
asavonic added inline comments.Thu, Oct 14, 5:27 AM
llvm/lib/Target/ARM/ARMISelLowering.cpp
15463

Thank you. Fixed that before landing.