Page MenuHomePhabricator

[AArch64] Adds a pre-indexed paired Load/Store optimization for LDR-STR.
ClosedPublic

Authored by stelios-arm on Mar 24 2021, 8:50 AM.

Details

Summary

This patch merges STR<S,D,Q,W,X>pre-STR<S,D,Q,W,X>ui and LDR<S,D,Q,W,X>pre-LDR<S,D,Q,W,X>ui instruction pairs into a single STP<S,D,Q,W,X>pre and LDP<S,D,Q,W,X>pre instruction, respectively.

For each pair, there is a MIR test that verifies this optimization.


This was a missed opportunity in the AArch64 load/store optimiser for cases like this:

#define float32_t float
#define uint32_t unsigned

void test(float32_t * S, float32_t * D, uint32_t N) {
  for (uint32_t i = 0; i < N; i++) {
    D[i] = D[i] + S[i];
  }
}

When compiled with:

-Ofast -target aarch64-arm-none-eabi -mcpu=cortex-a55 -mllvm -lsr-preferred-addressing-mode=preindexed

It results in:

.LBB0_9:                                // =>This Inner Loop Header: Depth=1
        ldr     q0, [x11, #32]!
        ldr     q1, [x11, #16]
        subs    x12, x12, #8                    // =8
        ldr     q2, [x10, #32]!
        ldr     q3, [x10, #16]
        fadd    v0.4s, v2.4s, v0.4s
        fadd    v1.4s, v3.4s, v1.4s
        stp     q0, q1, [x11]
        b.ne    .LBB0_9

where:

ldr     q0, [x11, #32]!
ldr     q1, [x11, #16]

should be:

ldp	q0, q1, [x11, #32]!

Additionally for cases like:

define <4 x i32>* @strqpre-strqui-merge(<4 x i32>* %p, <4 x i32> %a, <4 x i32> %b) {
entry:
  %p0 = getelementptr <4 x i32>, <4 x i32>* %p, i32 2
  store <4 x i32> %a, <4 x i32>* %p0
  %p1 = getelementptr <4 x i32>, <4 x i32>* %p, i32 3
  store <4 x i32> %b, <4 x i32>* %p1
  ret <4 x i32>* %p0
}

It results in:

"strqpre-strqui-merge":                 // @strqpre-strqui-merge
        str     q0, [x0, #32]!
        str     q1, [x0, #16]
        ret

where the store instruction should be merged with:

stp	q0, q1, [x0, #32]!

This patch covers such cases including the various forms of STR<>pre/LDR<>pre.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
stelios-arm requested review of this revision.Mar 24 2021, 8:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2021, 8:50 AM
stelios-arm added inline comments.Mar 24 2021, 9:07 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2386

For LDRQpre , MI.hasOrderedMemoryRef()results to true because the instruction has no memory reference information, and conservatively assumes it wasn't preserved. Therefore, I added:

&& MI.getOpcode() != AArch64::LDRQpre

to ignore it for this instruction. I suppose there is a better way of doing it, but I am not yet sure how.

2403–2409

Note for LDRQpre instructions it should be MI.getOperand(2).getReg(), and also BaseReg should be Register BaseReg = MI.getOperand(2).getReg(). This can be easily fixed, however it won’t do much because MI.modifiesRegister(BaseReg, TRI) will again result to true. Any suggestions?

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1021

Mis-indentation. I am going to fix this is in an updated revision.

1306

Ditto.

Hi Stelios, many thanks for putting this together, good stuff.
I will do a code-review a bit later, but as there's potential for some corner cases here, first a testing question. Did you do a bootstrap build and e.g. ran the llvm test suite?

dmgreen added inline comments.Mar 25 2021, 5:23 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2255

We should try and add all the various forms of STR?pre/LDR?pre. Hopefully they all work the same way, with the same operands.

2386

Why does the LDRQpre have no memory operand?

2394

-> operand
Formatting looks a bit off, which might or might not be fixed by just running clang-format on the patch.

llvm/test/CodeGen/AArch64/strqpre-strqui-merge.mir
64 ↗(On Diff #333004)

This should presumably be the before-STPQpre code, that is then converted to a STPQpre by the pass.

You can often remove a lot of the stuff above, like the frameInfo and all the regBankSelected stuff.

And there is an update_mir_test_checks for generating check lines.

stelios-arm marked an inline comment as done.
stelios-arm retitled this revision from [AArch64] Adds a pre-indexed Load/Store optimization for LDRQ-STRQ. to [AArch64] Adds a pre-indexed paired Load/Store optimization for LDR-STR..
stelios-arm edited the summary of this revision. (Show Details)
  1. Added all the various forms of STR<>pre/LDR<>pre.
  2. Added additional test cases for the MIR tests to cover the various forms of STR<>pre/LDR<>pre.
  3. Added constraints so that it optimizes cases where the offset of the second LDR/STR<>ui is equal to the size of the destination register. Additionally, it only optimizes cases where the base register of the pre-index LDR/STRpre<> is not used or modified.
  4. Did a bootstrap build and ran the llvm test-suite on an AArch64 machine. Both the test-suite and regression tests results in no errors.
  5. Currently there is a hack to avoid the memoperands_empty() check for LDR<>pre instructions. This is because they are missing the load memory operand. See below:
early-clobber renamable $x1, renamable $w0 = LDRWpre killed renamable $x1, 20

instead it should look something similar to:

early-clobber renamable $x1, renamable $w0 = LDRWpre killed renamable $x1, 20 :: (load 4)

This is going to be addressed in another patch, and then this patch will be updated to remove the hack that is in place.

stelios-arm marked 2 inline comments as done.Apr 9 2021, 4:44 AM

Hi Stelios, many thanks for putting this together, good stuff.
I will do a code-review a bit later, but as there's potential for some corner cases here, first a testing question. Did you do a bootstrap build and e.g. ran the llvm test suite?

This was done for the second revision.

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2386

I added an explanation in the new revision (Point 5). This is going to be addressed in another patch and this patch will be updated accordingly.

stelios-arm added inline comments.Apr 9 2021, 4:50 AM
llvm/test/CodeGen/AArch64/arm64-memset-inline.ll
67

In case you are wondering why with the new patch this is changed to and STP here is the full check commands pre-patch:

; CHECK-LABEL: bzero_8_stack:
; CHECK:       // %bb.0:
; CHECK-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
; CHECK-NEXT:    .cfi_def_cfa_offset 16
; CHECK-NEXT:    .cfi_offset w30, -16
; CHECK-NEXT:    add x0, sp, #8 // =8
; CHECK-NEXT:    str xzr, [sp, #8]
; CHECK-NEXT:    bl something
; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
; CHECK-NEXT:    ret
233

Similarly:

; CHECK-LABEL: memset_8_stack:
; CHECK:       // %bb.0:
; CHECK-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
; CHECK-NEXT:    .cfi_def_cfa_offset 16
; CHECK-NEXT:    .cfi_offset w30, -16
; CHECK-NEXT:    mov x8, #-6148914691236517206
; CHECK-NEXT:    add x0, sp, #8 // =8
; CHECK-NEXT:    str x8, [sp, #8]
; CHECK-NEXT:    bl something
; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
; CHECK-NEXT:    ret
SjoerdMeijer added inline comments.Apr 9 2021, 11:12 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2406

Just make this dependent on D100215 and modify the code here accordingly. I guess that simply means removing the FIXME and the IsPreLD check.

Matt added a subscriber: Matt.Mon, Apr 12, 5:25 AM

Removed the hack that was used to avoid the memoperands_empty() check for LDR<>pre instructions.

It would be good to see some extra tests for various edge cases, like offsets near to the boundaries and different pairs of instructions being combined/not.

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2121

I feel like "Unscaled" instructions are a set of instructions on their own right. Can we rename the function to something like hasUnscaledLdStOffset to make it clear what it means now?

2403–2409

The IsPreLd can move to the outer if, and can be IsPreSt too? The register it is using isn't correct, but it's unpredictable for a writeback load/store to load/store the same register as the operand. So the check is OK to skip for preinc loads/stores.

2983

Should this get some updates to make it more precise for pre-inc pairs?

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
572

This is only used in one place, where it could be checking isPreLdSt on the original opcode?

627

+= 1?

1599–1600

Why does this not already handle the combining of PreLdStPair?

The existing code can combine in both directions. Presumably it's only valid for forward now?

1602

-> Additionally
-> operations

stelios-arm marked 5 inline comments as done.Fri, Apr 16, 8:54 AM
stelios-arm added inline comments.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2983

Could be. Currently, this method is called for load/stores when getMemOperandWithOffset returns true. However, getMemOperandWithOffset has a call to getMemOpInfo:

// If this returns false, then it's an instruction we don't want to handle.
if (!getMemOpInfo(LdSt.getOpcode(), Scale, Width, Dummy1, Dummy2))
  return false;

Currently, the getMemOpInfo does not include the pre-inc instructions. Additionally, for the post-inc instructions, there is only one variant available for STRWpost/LDRWpost.

I am not sure why the other variants are not included, so I am bit skeptical If the pre-inc instructions should be added.

What do you think?

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1599–1600

For example, the following instructions:

str w1 [x0, #20]!
str w2 [x0, #4]

Can be paired to:

Stp w1, w2, [x0, #20]!

The offset of the first and second instruction is 20 and 4, respectively. The offset stride is 4. Therefore, the check (Offset == MIOffset + OffsetStride) and (Offset + OffsetStride == MIOffset) will return false.

That’s is why it’s needed. And yes, for such cases it’s only valid for forward now, since the order of the instructions matters for this optimization.

  1. Added more test cases.
  2. Addressed the remarks.

I presume that everything that uses hasUnscaledLdStOffset is still OK? It would either handle pre loads/stores or not reach them as pre loads/stores?

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2384

Can these use isPreLd?

2388–2390

Is this assert valid/sensible for PreInc?

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1015

Does this need the second condition in the if? Should that always be true if the first part is a pre?

1599–1600

OK, but it looks like the existing (Offset == MIOffset + OffsetStride) conditions could be true for preinc where we don't want them to be. Can we switch it around to be something like:

bools IsPreLdSt = isPreLdStPairCandidate(..)
if (!IsPreLdSt) {
  check conditions else continue
} else {
  check pre conditions else continue
}

That way we don't need the extra indenting, and the conditions don't get muddled together.

llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir
401

Comments can be added here with ;

llvm/test/CodeGen/AArch64/strpre-str-merge.mir
268

-257?

If the strp has to be aligned, is it worth adding a test for +/-260 too?

329

store 4

358

What happens if this is 16? (Or 4)

390

I tried to come up with a list of tests. You have most of the covered, I also came up with these, some of which might be good to make sure are covered:
Given ldrqpre a, [b, c] and ldrqui d, [e, f]

  • q with d
  • load with store
  • a == b? No sure what happens then
  • b != e. That's probably tested naturally anyway.
  • second instruction is ldruqui.

There were some others but they sound less useful.

stelios-arm marked 8 inline comments as done.

Addressed the comments made by @dmgreen (Thanks for the comments!)

  1. Added more test cases
  2. Refactoring
  3. Added support for LDUR<>i<>Ri/STUR<>i
  4. Changed the code so that the same Pre Ld/St Opcodes are not candidates to merge/pair.
stelios-arm added inline comments.Thu, Apr 22, 7:15 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2388–2390

The assert is valid for pre-inc ld/st because MI.getOperand(1) is the destination register.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1015

If it reaches at this point, the second condition is redundant. Therefore, I will remove it.

llvm/test/CodeGen/AArch64/strpre-str-merge.mir
358

It exhibits the same behaviour.

358

They will not be merged.

dmgreen added inline comments.Mon, Apr 26, 4:27 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2388–2390

Valid - sure. But sensible? It's trying to check that the address operand is a Reg/FrameIndex. With Preinc, that will be shifted one operand. Can we make sure it checks the correct operand for those too?

2983

We may want to make it more precise, but for the moment it's probably fine so long as it's not going to get anything drastically wrong. It clusters memory during scheduling to potentially increase the number of combined loads later on.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1311

Maybe phrase this as "Opcodes match: If the opcodes are pre ld/st there is nothing more to check."

llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir
447

store 8

616

According to this the limit of a LDP is 1008: https://godbolt.org/z/613xsozqP.
So 1024 is the first multiple of 16 that should be invalid.

dmgreen added inline comments.Mon, Apr 26, 5:22 AM
llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir
18

Although I don't think it's an issue with this patch exactly, should this not have two MMO's? Either be :: (load 8) or :: (load 4) (load 4) ?

stelios-arm marked 4 inline comments as done.Fri, Apr 30, 4:11 AM
stelios-arm added inline comments.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2388–2390

Oh yes, sure!

2983

I agree, maybe that's the next step.

llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir
18

Yes, it should have "(load 4), (load 4)" instead.

616

According to this, the offset of LDR<>pre/STR<>pre is in range [-256,255]. Assuming that x∈[-256,255] and c = size of <S,D,Q,W,X> , for this type of optimization the resulted LDP<>pre/STP<>pre will be in range of [min(x mod c == 0), max(x mod c == 0)].

stelios-arm marked 3 inline comments as done.

Addressed the comments made by @dmgreen.

dmgreen accepted this revision.Fri, Apr 30, 6:07 AM

Thanks for adding all these tests. LGTM.

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2395

I think it's probably OK to check

assert((MI.getOperand(IsPreLdSt ? 2 : 1).isReg() ||
        MI.getOperand(IsPreLdSt ? 2 : 1).isFI())...
llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir
18

OK. It's because they are identical, so are elided during the merging of memory operands. It's a little strange to not see both, but fine. You don't get any extra information out of having both in this case.

This revision is now accepted and ready to land.Fri, Apr 30, 6:07 AM
This revision was automatically updated to reflect the committed changes.
stelios-arm marked an inline comment as done.
nickdesaulniers added a comment.EditedTue, May 4, 4:10 PM

As a heads up, this commit seems to be causing some pretty spooky boot failures for aarch64 Linux kernels built with LTO (non-thin). We're trying to isolate the bug in https://github.com/ClangBuiltLinux/linux/issues/1368, but so far all we know is that modifications to AArch64::STRXpre seem to be solely responsible for the boot failures.

Hello. This is for the 64bit store? Hmm. That might mean that the address and the operands are being re-used. Maybe?

Are the linux builds easy to reproduce? Do you have details anywhere?

Do we have a test case for this example?

early-clobber renamable $x0 = STRXpre killed renamable $x1, killed renamable $x0, 24 :: (store 8)
STRXui killed renamable $x0, renamable $x0, 1 :: (store 8)

@nickdesaulniers This patch is a possible fix for the issue, do you mind testing if it's now OK?
@dmgreen I added one in the patch.

@nickdesaulniers This patch is a possible fix for the issue, do you mind testing if it's now OK?
@dmgreen I added one in the patch.

I tested that patch and it resolves the issue for me, thanks!

@nathanchance Great! Thanks for the update.

hans added a subscriber: hans.Wed, May 5, 9:22 AM

We tracked some test failures in Chromium down to this patch (https://crbug.com/1205459) and it appears the fix (D101888) fixed it for us too.