This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Improve load/store optimizer to handle LDUR + LDR.
ClosedPublic

Authored by ab on Aug 18 2015, 2:12 PM.

Details

Summary

This patch allows the mixing of scaled and unscaled load/stores to form
load/store pairs.

PR24465

Diff Detail

Repository
rL LLVM

Event Timeline

mcrosier updated this revision to Diff 32449.Aug 18 2015, 2:12 PM
mcrosier retitled this revision from to [AArch64] Improve load/store optimizer to handle LDUR + LDR..
mcrosier updated this object.
mcrosier added reviewers: ab, gberry, mssimpso.
mcrosier set the repository for this revision to rL LLVM.

I haven't done any testing beyond basic lit tests. Hopefully, I'll have correctness/perf testing in flight by tomorrow. Regardless, I think this is in good enough shape to review.

kristof.beyls added inline comments.Aug 19 2015, 3:26 AM
test/CodeGen/AArch64/ldp-stp-scaled-unscaled-pairs.ll
1 ↗(On Diff #32449)

My preference is to use -march=aarch64 instead of -march=arm64 as much as possible, as aarch64 is the official name for the architecture.
Is there a specific reason -mcpu=cortex-a57 needs to be added to the test line? I don't have a strong opinion on whether it should be there or not, just wondering.

81 ↗(On Diff #32449)

Looking at this test case, I see that before this patch, the following code is produced:

ldur    x8, [x0, #-8]
ldr      x9, [x0]

. If I'm not mistaken, ldur x8, [x0, #-8] has the same functionality as ldr x8, [x0, #-1]? If so, wouldn't it be better to make sure
we produce ldr instead of ldur in the first place?
If we would do that, and there still is a good reason to have special code to convert LDR + LDUR into LDP, I guess none of the above test cases really show that (although I haven't investigated every single test case in detail)?

kristof.beyls added inline comments.Aug 19 2015, 4:08 AM
test/CodeGen/AArch64/ldp-stp-scaled-unscaled-pairs.ll
81 ↗(On Diff #32449)

D'oh! The ARMARM clearly states that the scaled immediate offsets in ldr x8, [x0, #imm] can only be positive/unsigned. Please ignore my comment above!

mcrosier updated this revision to Diff 32819.Aug 21 2015, 6:07 AM
mcrosier added a reviewer: mzolotukhin.

Addressed Kristof's comments. Included full diff.

mcrosier updated this revision to Diff 32820.Aug 21 2015, 6:17 AM

Fix test7 which I modified/broke, while doing additional testing.

mcrosier updated this revision to Diff 32821.Aug 21 2015, 6:19 AM
mzolotukhin edited edge metadata.Aug 22 2015, 12:04 PM

Hi Chad,

I checked on a small testcase, and with this patch we do merge STUR and STR. There is one potential issue though: in some cases we intentionally split STP into two STUR/STR, as there is a big performance penalty if STP crosses a cache line (see e.g. performSTORECombine in AArch64ISelLowering.cpp and getMemoryOpCost in AArch64TargetTransformInfo.cpp). So, I think we might want to perform this combining only for non-temporal or known-to-be-well-aligned memory accesses. What do you think, does it make sense?

Thanks,
Michael

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
455

This looks strange. Is it expected that scaled and unscaled offset differ by MemSize^2? (in one case you multiply by MemSize, in the other - divide)

708

Same here.

Hi Michael,
So, you're saying that performSTORECombine splits 16B stores that cross line boundaries for performance reasons. AFAICT, this combine is only enabled with Cyclone. Later the AArch64LoadStore optimizer runs and combines these split stores undoing the optimization performed during ISelLowering.

If I understand things correctly, I think this makes sense. However, I'd probably implement this in a separate patch and test it on other subtargets (e.g., A57). Sound reasonable?

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
455

My assumption is that a scaled offset is always the unscaled offset divide by the MemSize and conversely the unscaled offset is always the scaled offset divided by the MemSize; I believe this is valid.

I.e.,
Unscaled Offset = Scaled Offset * MemSize;
Scaled Offset = Unscaled Offset / MemSize;

What did I miss? I'm trying to simply the logic so that both offsets are either scaled or unscaled, but not a mix of the two. I'm guessing you're concerned my conversion is incorrect.

Hi Chad,

As far as I understand, currently ISelLowering splits the unaligned stores, from which we happen to get STUR and STR, which we can't combine to STP without this patch. With this patch, we'll be able to merge them back, so we'll undo that optimization.

I think that being able to combine STUR/STR to STP is good by itself, but from the patch alone we'll probably get regressions (AFAIR, the biggest one was matmul from LNT testsuite). That's my only concern, but I don't mind landing the patch and addressing this issue later if we actually observe it.

Also, I can't provide a high-quality review of this code, as I'm not very familiar with it, so I'd appreciate other people's feedback on the patch.

Thanks!

Michael

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
455

Yep, I was concerned that it was incorrect, but I just read the code incorrectly. Sorry for that!
(I thought that we do PairedOffset = UnscaledOffset*MemSize in one case and PairedOffset = UnscaleOffset/MemSize in the other).

ab edited edge metadata.Aug 31 2015, 6:00 PM

This LGTM (modulo nits), but let's see what the testsuite says first.

As far as I understand, currently ISelLowering splits the unaligned stores, from which we happen to get STUR and STR, which we can't combine to STP without this patch. With this patch, we'll be able to merge them back, so we'll undo that optimization.

But this issue isn't specific to this patch, right? If the unaligned store was split to STR+STR we would have generated an STP even before this change. I agree we'll need to do something about this though, but separately, and for both mixed and non-mixed STR/STUR pairs.

lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
450

MI -> Paired, FirstMI -> I ?

607–610

What about removing CanMergeOpc? It's assigned to several times with different values, and all could be folded into ifs/returns.

619–620

Same: fold the condition into the if, remove CanMergeOpc?

627–629

I see this is the only use of the get*From* helpers. What about:

getMatchingPairOpcode(Opc) == getMatchingPairOpcode(PairOpc)

That'll let us get rid of the added functions.

630–633

Same: fold the condition into the return, remove CanMergeOpc?

721

To make sure I understand: the previous code was "incorrect" in checking MI rather than FirstMI, but it didn't make a difference because we didn't allow mismatched opcodes, right?

test/CodeGen/AArch64/ldp-stp-scaled-unscaled-pairs.ll
8 ↗(On Diff #32821)

Kill the "entry:"s ?

Thanks for the feedback, Ahmed.

In D12116#236954, @ab wrote:

This LGTM (modulo nits), but let's see what the testsuite says first.

Unfortunately, I don't have a setup that can test with the testsuite at the moment. That is in the works...

This transformation is very narrow and did not hit anything in Spec2000. Therefore, I'm going to move onto other work with a higher ROI. However, feel free to push this one along.

As far as I understand, currently ISelLowering splits the unaligned stores, from which we happen to get STUR and STR, which we can't combine to STP without this patch. With this patch, we'll be able to merge them back, so we'll undo that optimization.

But this issue isn't specific to this patch, right? If the unaligned store was split to STR+STR we would have generated an STP even before this change. I agree we'll need to do something about this though, but separately, and for both mixed and non-mixed STR/STUR pairs.

That is absolutely correct! This patch only applies to the very narrow case of STR/STUR. The common case is when we're pairing STUR/STUR, which has already been committed.

mcrosier added inline comments.Sep 1 2015, 8:36 AM
lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
721

I'm not sure the checking was "incorrect", but your latter comment is correct; because we didn't allow mismatches opcodes it didn't make a difference. However, because we're converting MI to be either Scaled or Unscaled to match FirstMI, we need to use IsUscaled in the context of this patch.

Hi Chad, Ahmed,

I ran some testing for this patch and found a bug there. The issue is that when we scale offset, we need to make sure that the original value is divisible by the scale factor. Here is a test to illustrate the issue:

define <2 x i64> @test4(i64* %p) nounwind {
  %a1 = bitcast i64* %p to <2 x i64>*
  %tmp1 = load <2 x i64>, < 2 x i64>* %a1, align 8           # Load <p[0], p[1]>

  %add.ptr2 = getelementptr inbounds i64, i64* %p, i64 3
  %a2 = bitcast i64* %add.ptr2 to <2 x i64>*
  %tmp2 = load <2 x i64>, <2 x i64>* %a2, align 8            # Load <p[3], p[4]>
  %add = add nsw <2 x i64> %tmp1, %tmp2
  ret <2 x i64> %add
}

The current patch will combine these two loads, which is incorrect.

Michael

ab commandeered this revision.Sep 2 2015, 8:12 AM
ab edited reviewers, added: mcrosier; removed: ab.

Thanks Michael. Chad, I'll fix that and resubmit, thanks for the patch!

ab updated this revision to Diff 33810.Sep 2 2015, 8:13 AM
ab edited edge metadata.
ab removed rL LLVM as the repository for this revision.

And this should catch the unscaled offset not being a multiple of the scale (with abundant asserts).

mcrosier edited edge metadata.Sep 2 2015, 8:14 AM

Thanks, Ahmed.

ab marked 14 inline comments as done.Sep 2 2015, 8:15 AM

Hi,

I just finished LNT+externals run, and saw no failures. Thanks for the fix!

Michael

mcrosier accepted this revision.Sep 3 2015, 5:30 AM
mcrosier edited edge metadata.

Thanks for seeing this one through, Ahmed. LGTM.

This revision is now accepted and ready to land.Sep 3 2015, 5:30 AM
mcrosier closed this revision.Sep 3 2015, 7:44 AM

Committed r246769.

test/CodeGen/AArch64/ldp-stp-scaled-unscaled-pairs.ll
1 ↗(On Diff #33810)

BTW, I added -aarch64-neon-syntax=apple to the final commit, so that this will pass on all platforms.

I speculatively reverted r246769 in r246782 as the compiler looks to be ICEing on Multisource/Benchmarks/tramp3d-v4.