Page MenuHomePhabricator

[SVE][LSR] Teach LSR to enable simple scaled-index addressing mode generation for SVE.
ClosedPublic

Authored by huihuiz on Jun 8 2021, 6:08 PM.

Details

Summary

Currently, Loop strengh reduce is not handling loops with scalable stride very well.

Take loop vectorized with scalable vector type <vscale x 8 x i16> for instance,
(refer to test/CodeGen/AArch64/sve-lsr-scaled-index-addressing-mode.ll added).

Memory accesses are incremented by "16*vscale", while induction variable is incremented
by "8*vscale". The scaling factor "2" needs to be extracted to build candidate formula
i.e., "reg(%in) + 2*reg({0,+,(8 * %vscale)}". So that addrec register reg({0,+,(8*vscale)})
can be reused among Address and ICmpZero LSRUses to enable optimal solution selection.

This patch allow LSR getExactSDiv to recognize special cases like "C1*X*Y /s C2*X*Y",
and pull out "C1 /s C2" as scaling factor whenever possible. Without this change, LSR
is missing candidate formula with proper scaled factor to leverage target scaled-index
addressing mode.

Note: This patch doesn't fully fix AArch64 isLegalAddressingMode for scalable
vector. But allow simple valid scale to pass through.

Diff Detail

Event Timeline

huihuiz created this revision.Jun 8 2021, 6:08 PM
huihuiz requested review of this revision.Jun 8 2021, 6:08 PM
efriedma added inline comments.Jun 8 2021, 9:24 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11811

"AM.Scale > 0" check is redundant.

llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
743

LOps == ROps?

dmgreen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11810

What does a AM.Scale == 1 relate to? Using a ld1b?

llvm/test/CodeGen/AArch64/sve-lsr-scaled-index-addressing-mode.ll
42

Why is this load volatile?

58

This test isn't using masked loads/stores, but it would be good to make sure they work sensibly.

huihuiz updated this revision to Diff 350996.Jun 9 2021, 2:36 PM
huihuiz marked 2 inline comments as done.
huihuiz added a reviewer: dmgreen.

Thanks @efriedma @dmgreen for the feedbacks!
Addressed review comments, please let me know if there are anything I missed ?

huihuiz added inline comments.Jun 9 2021, 2:38 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11810

For ld1b cases, AM.Scale could be 1 when trying to match [r+r] with a base register.
But such case is covered by "(uint64_t)AM.Scale == VecElemNumBytes", since VecElemNumBytes is also 1.

I simplify this checking into "(AM.Scale == 0 || (uint64_t)AM.Scale == VecElemNumBytes)".
Let me know if there is any case I missed ?

llvm/test/CodeGen/AArch64/sve-lsr-scaled-index-addressing-mode.ll
42

Removed, not needed.

58

Thanks for catching it! masked load/store case added, work as expected.

Matt added a subscriber: Matt.Jun 9 2021, 5:20 PM
sdesmalen added inline comments.Jun 10 2021, 2:04 PM
llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
737

Does the MulRHS also need checks for IgnoreSignificantBits || isMulSExtable(MulRHS, SE) ?

huihuiz updated this revision to Diff 351295.Jun 10 2021, 4:36 PM

To be more conservative, added checking IgnoreSignificantBits || isMulSExtable(MulRHS, SE) to MulRHS as well.

huihuiz added inline comments.Jun 10 2021, 4:38 PM
llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
737

I didn't include this because I was thinking, even with "IgnoreSignificantBits" equal to false and MulRHS a negative value, we can still extract the factor.
For instance, (16*X) /s (-8*X), -2 can be extracted.

But this seems to violate the intention of setting IgnoreSignificnatBits to false.

Therefore, I am restricting the checking to MulRHS as well.

The change looks good to me, just have an open question about the test.

@huihuiz Thanks for working on this by the way, this has been on our wish-list for quite a while! :)

llvm/test/CodeGen/AArch64/sve-lsr-scaled-index-addressing-mode.ll
11

nit: I'm not really sure what the convention is here, but I wonder if it's better to just have two RUN lines, where the former checks the full output of the IR itself (more than just the IR-NOT, and instead of checking the debug output of the pass), and one that checks the full output of the asm (or at least more than just the load/store). The reason I'm suggesting checking more of the IR/asm is to check that there are no redundant instructions for other purposes. I'm also not sure if testing debug-output is normally desirable, but if we just check the output, we can remove the REQUIRES: asserts line.

huihuiz updated this revision to Diff 351564.Jun 11 2021, 2:22 PM
huihuiz marked an inline comment as done.
huihuiz edited the summary of this revision. (Show Details)

Thanks @sdesmalen for the review! Happy to help out! ;)

Update test check lines as suggested.

This revision is now accepted and ready to land.Jun 14 2021, 1:09 PM