This is an archive of the discontinued LLVM Phabricator instance.

[AArch64LoadStoreOptimizer] Generate more STPs by renaming registers earlier
ClosedPublic

Authored by MeeraN on Jun 3 2021, 1:48 AM.

Details

Summary

Our initial motivating case was memcpy's with alignments > 16. The loads/stores, to which small memcpy's expand, are kept together in several places so that we get a sequence like this for a 64 bit copy:
LD w0
LD w1
ST w0
ST w1
The load/store optimiser can generate a LDP/STP w0, w1 from this because the registers read/written are consecutive. In our case however, the sequence is optimised during ISel, resulting in:
LD w0
ST w0
LD w0
ST w0
This instruction reordering allows reuse of registers. Since the registers are no longer consecutive (i.e. they are the same), it inhibits LDP/STP creation. The approach here is to perform renaming:
LD w0
ST w0
LD w1
ST w1
to enable the folding of the stores into a STP. We do not yet generate the LDP due to a limitation in the renaming implementation, but plan to look at that in a follow-up so that we fully support this case. While this was initially motivated by certain memcpy's, this is a general approach and thus is beneficial for other cases too, as can be seen in some test changes.

Diff Detail

Event Timeline

MeeraN created this revision.Jun 3 2021, 1:48 AM
MeeraN requested review of this revision.Jun 3 2021, 1:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 3 2021, 1:48 AM

Hi Meera,

Thanks for working on this.

Correct me if I am wrong, but I think the problem here is that the LDP/STP creation is very fragile. The load/store optimiser relies on the pairs to be in order, otherwise it won't recognise them. For loads/stores with a bigger alignment (than usual), there is an optimisation in ISel that optimises the chain, and removes the TokenFactor that normally glues these things together, which is sensible optimisation. There have been some attempts to fix to keep the loads/stores for memcpy together in different places, but that is part of the fragile story. Your approach here is to perform renaming earlier for a sequence like this:

LDR w1
STR w1
LDR w1
STR w1

So that we get:

LDR w1
STR w1
LDR w2
STR w2

and can create a LDP w1, w2 and STP w1, w2.

I think it is worth clarifying some of these things, both in the description of the patch and a code comment.

Ideally we want to generate a LDP and STP for the test cases that you added. Why do we not yet get the LDP?

Speaking about the tests, I think they should be MIR tests. Then we see better what's going on, like the order of instructions etc.

MeeraN added a comment.Jun 3 2021, 2:35 AM

Hi Sjoerd,

I can definitely add a comment in and some more information to the description as well in order to explain in more detail what is going on. I will also add a MIR test as well.

Ideally we want to generate a LDP and STP for the test cases that you added. Why do we not yet get the LDP?

We don't get the LDP since it fails at the very start of this function (called on line 1529) since FirstMI is a load instruction:

 static bool
 canRenameUpToDef(MachineInstr &FirstMI, LiveRegUnits &UsedInBetween,
                  SmallPtrSetImpl<const TargetRegisterClass *> &RequiredClasses,
                  const TargetRegisterInfo *TRI) {
   if (!FirstMI.mayStore())
     return false;

   ...
}

Because this returns false, the renaming of registers isn't attempted for the loads and so they don't get recognised as a pair. It is something I still need to look into in more detail to see if there is something else we can try.

Matt added a subscriber: Matt.Jun 3 2021, 8:26 AM

Ideally we want to generate a LDP and STP for the test cases that you added. Why do we not yet get the LDP?

We don't get the LDP since it fails at the very start of this function (called on line 1529) since FirstMI is a load instruction:

 static bool
 canRenameUpToDef(MachineInstr &FirstMI, LiveRegUnits &UsedInBetween,
                  SmallPtrSetImpl<const TargetRegisterClass *> &RequiredClasses,
                  const TargetRegisterInfo *TRI) {
   if (!FirstMI.mayStore())
     return false;

   ...
}

Because this returns false, the renaming of registers isn't attempted for the loads and so they don't get recognised as a pair. It is something I still need to look into in more detail to see if there is something else we can try.

Okay, I see. I am not that familiar with that helper function, but this is a minor code reshuffle and a codegen improvement, so I would be okay to address the LDPs in a follow up if that is more convenient.

MeeraN updated this revision to Diff 349590.Jun 3 2021, 9:46 AM
MeeraN edited the summary of this revision. (Show Details)

Added the inline comment to explain the changes, also added an MIR test as well as changing the existing test to show the difference when aarch64-load-store-renaming is enabled.

SjoerdMeijer added inline comments.Jun 4 2021, 12:27 AM
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1712

Hmmmm, this hard coded 16 here....

1722

... and here, makes this look like a not very general approach.

I was hoping that we wouldn't need these alignment checks. We would possibly optimise other cases too, but that would be good. Or is there something else going on?

llvm/test/CodeGen/AArch64/memcpy.ll
1 ↗(On Diff #349590)

Now that we have the MIR tests, I don't think we need these llc tests, it doesn't add anything, so you can just remove this file.

MeeraN updated this revision to Diff 350334.Jun 7 2021, 9:44 AM

Removed llc test and also removed alignment check for values greater than 16. Updated the tests that failed because of this change with the new assembly now that the STP instruction is being generated.

fhahn added a comment.Jun 7 2021, 9:56 AM

This does not seem memcpy specific. If that's the case, could you please update the title/description of the patch? Also the test cases in llvm/test/CodeGen/AArch64/memcpy.mir seem to specifically test additional renaming cases, so it might be good to put them into the existing llvm/test/CodeGen/AArch64/stp-opt-with-renaming.mir or into a file named along those lines. it would also be good to include a brief comment with the tests what is different to the existing re-naming tests.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1521

The comment and the name seem to indicate that this function checks whether a register has been renamed or not, but I don't think that's the case.

The function checks if ice can rename from FirstMI to the beginning of the block and if it can, then it updates Flags with a register that can be used for renaming, if such a register has been found.

MeeraN updated this revision to Diff 350581.Jun 8 2021, 5:43 AM
MeeraN retitled this revision from [AArch64] Modified AArch64LoadStoreOptimizer to generate STP instructions for memcpys to [AArch64LoadStoreOptimizer] Generate more STPs by renaming registers earlier.
MeeraN edited the summary of this revision. (Show Details)

Changed patch title, moved tests to stp-opt-with-renaming.mir and corrected the function name and comment for the helper function in AArch64LoadStoreOptimizer.cpp

SjoerdMeijer accepted this revision.Jun 9 2021, 12:46 AM

LGTM

Perhaps you can elaborate a little bit in the commit message:

When the order of loads and stores has been optimized during ISel, it prevents LDPs/STPs from being generated, so we attempt to rename registers at an earlier stage so that the instructions can be recognised as a pair in this order.

It's not clear which loads/stores. Perhaps you can expand a little bit on this, and say something along the lines of:

Our motivating case initially were memcpy's with an alignment > 16. The loads/stores to which small memcpy's expand to, are tried to be kept together in several places so that we get a sequence like this for a 64 bits copy:

LD w0
LD w1
ST w0
ST w1

The load/store optimiser can fold this in a LDP/STP w0, w1 because the registers read/written are consecutive. In our case however, the loads and stores chains has been optimised during ISel so that we end up with a sequence like this:

LD w0
ST w0
LD w0
ST w0

This instruction reordering/scheduling allows reuse of registers, and since the registers of the loads/stores are no longer consecutive (i.e. they are the same), it inhibits LDP/STP creation. The approach here is to perform renaming:

LD w0
ST w0
LD w1
ST w1

enabling the folding of the stores into a STP. We do not yet generate the LDP due to a limitation in the renaming implementation, but plan to look at that in a follow up so that we fully support this case. And while this was initially motivated by certain memcpy's, this is a general approach and thus beneficial for other cases too as can be seen in some test changes.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1706

Nit: yet -> yet.

This revision is now accepted and ready to land.Jun 9 2021, 12:46 AM
This revision was landed with ongoing or failed builds.Jun 9 2021, 4:33 AM
This revision was automatically updated to reflect the committed changes.
fhahn added inline comments.Jun 9 2021, 5:24 AM
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
49

Is there a reason to do the whole check for rename registers twice, or could we determine whether re-naming is needed once for all cases? In particular, by doing it at this state, we may miss out on some legality checks, e.g. the one at line 1734 to check if the base register has been modified?

Can you add a test case where the base is modified?

50

nit: I don't think this matches LLVM's coding style (should be camel case with upper case start, https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly)

llvm/test/CodeGen/AArch64/stp-opt-with-renaming.mir
548 ↗(On Diff #350850)

IIUC the *real* difference between the existing tests in the file and the new ones is that in the new ones, the merge-able STR instructions have the same register as stored operand? If that's the case, it would be good to clarify the comment, as in the other cases things could also originate from re-ordering.

554 ↗(On Diff #350850)

can you also add a negative test, like one where there is no register available for renaming? I think you can do that by adding a couple of live-ins.

564 ↗(On Diff #350850)

please reduce the values here, most are not needed (see the tests above).

MeeraN added a comment.Jun 9 2021, 7:30 AM

@fhahn in response to your inline comment about performing the renaming registers check twice, if we don't have it we won't try to rename the registers at all, since the check on line 1764 is not reached in our case as we never enter the if statement that contains it. So we brought it earlier to make sure that it tries to rename the registers, if it can. I'll follow-up the rest of the comments in a new diff.

FYI I've bisected down a misoptimization to this particular commit

haowei added a subscriber: haowei.Jun 10 2021, 12:09 AM

We are seeing stage 2 clang build crashes on arm64 platform and confirmed it is caused by this change after bisecting. Could you revert the change please?

The problem I'm seeing can be reproduced with this source file: https://martin.st/temp/cdf-preproc.c

To reproduce:

$ clang -target aarch64-w64-mingw32 -S -O3 cdf-preproc.c

One difference in the generated asm looks like this, where I think the error is visible:

        ldrh    w10, [x2, #6032]
        strh    wzr, [x1, #6034]
        strh    w10, [x1, #6032]
-       ldr     q0, [x2, #16]
-       str     q0, [x1, #16]
-       ldr     q0, [x2]
+       ldp     q0, q1, [x2]
        strh    wzr, [x1, #24]
-       str     q0, [x1]
-       ldr     q0, [x2, #48]
-       str     q0, [x1, #48]
-       ldr     q0, [x2, #32]
+       stp     q0, q1, [x1]
+       ldp     q0, q1, [x2, #32]

Previously the strh wzr, [x1, #24] was done after the str q0, [x1, #16] (so it would overwrite one part of the written qword with zeros), but now the strh remains intact and is done before the new stp q0, q1, [x1].

Many thanks for the reproducer!
We are going to have a look at that.

@fhahn in response to your inline comment about performing the renaming registers check twice, if we don't have it we won't try to rename the registers at all, since the check on line 1764 is not reached in our case as we never enter the if statement that contains it. So we brought it earlier to make sure that it tries to rename the registers, if it can. I'll follow-up the rest of the comments in a new diff.

As I mentioned inline, this seems to be skipping a legality check (around line 1734), which is probably the source of the mis-compiles (please add a test for that).

I understand the motivation about trying to find a rename register earlier, but I was wondering if we can detect the need for renaming at a single (possible new) place. It may not be possible, because the register to check for renaming depends on whether we merge forward or backwards.

thakis added a subscriber: thakis.Jun 10 2021, 3:31 AM

We were also seeing several downstream test failures on arm64 after this. Thanks for the speedy revert!

MeeraN reopened this revision.Jun 15 2021, 3:18 AM
This revision is now accepted and ready to land.Jun 15 2021, 3:18 AM
MeeraN updated this revision to Diff 352084.Jun 15 2021, 3:24 AM
MeeraN edited the summary of this revision. (Show Details)

Added checks to ensure that STPs are only generated when the base register is not in use or has not been modified. Also added test cases when the base register has been modified and when there are no registers available for renaming.

Besides some nits/observations inline, one question about testing. The reproducer that was attached does not get miscompiled anymore? And it is exactly one of these cases that was added, i.e. the modified base register? Just asking to see if there's not another case or test that we are missing.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1526

Style nit: can we return early here to reduce indentation:

if (!DebugCounter::shouldExecute(RegRenamingCounter))
  return false;
1697

Diffing with previous revision does not seem to work, so just double checking this.
You have added here the check if the base reg has been modified, and is available here, similarly like that is done from line 1726, right?

That seems okay to be me now.

llvm/test/CodeGen/AArch64/consthoist-gep.ll
3

Unrelated nit: perhaps align the assembly of this block for readability, indentation is all over the place

llvm/test/CodeGen/AArch64/stp-opt-with-renaming.mir
666 ↗(On Diff #352084)

And here you have added a lot of registers as live-in, which results in no registers being available for renaming. Nice one, that looks good.

701 ↗(On Diff #352084)

Yep, and here the base register is modified. Looks good.

MeeraN updated this revision to Diff 352677.Jun 17 2021, 5:07 AM

Fixed the style points raised in the helper function updateFlagsWithRenameReg and in the test consthoist-gep.ll

@SjoerdMeijer, in response to your questions, yes I added the check to see if a base register is used/modified on line 1698/1699 so that we still perform that legality check as we won't reach the one on 1726 if the renaming of registers is successful. Also, the modified base register test is taken from the reproducer and so it should no longer generate extra, unwanted STPs.

fhahn added inline comments.Jun 17 2021, 5:18 AM
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1746

Would it just be possible to check for renaming here, similar to like it is done below?

@fhahn I don't think that would be possible, at least not in our case, since it won't actually reach that line.

SjoerdMeijer accepted this revision.Jun 21 2021, 1:16 AM

LGTM, but please wait a day with committing this in case there are more comments.

This revision was landed with ongoing or failed builds.Jun 22 2021, 8:33 AM
This revision was automatically updated to reflect the committed changes.

The problem I'm seeing can be reproduced with this source file: https://martin.st/temp/cdf-preproc.c

To reproduce:

$ clang -target aarch64-w64-mingw32 -S -O3 cdf-preproc.c

There still are issues that are reproducible with this particular sample. (I'm also seeing miscompilations still in some other samples that I haven't narrowed down yet, hoping that they just are more cases of this same issue.) The diff of the assembly output of that particular sample looks like this:

--- good.s      2021-06-23 09:42:24.381499762 +0300
+++ bad.s       2021-06-23 09:42:40.521154128 +0300
@@ -1955,17 +1955,14 @@
        ldrh    w10, [x2, #5756]
        strh    wzr, [x1, #5758]
        strh    w10, [x1, #5756]
-       ldr     q0, [x2, #976]
+       ldr     q1, [x2, #976]
        add     x10, x1, #960                   // =960
-       str     q0, [x1, #976]
        ldr     q0, [x2, #960]
        strh    wzr, [x10, #30]
-       str     q0, [x1, #960]
-       ldr     q0, [x2, #1008]
-       str     q0, [x1, #1008]
-       ldr     q0, [x2, #992]
+       stp     q0, q1, [x1, #960]
+       ldp     q0, q2, [x2, #992]
        strh    wzr, [x10, #62]
-       str     q0, [x1, #992]
+       stp     q0, q2, [x1, #992]
        ldr     q0, [x2, #1040]
        str     q0, [x1, #1040]
        ldr     q0, [x2, #1024]

Note that x10 = x1 + 960. The strh wzr, [x10, #30] clears a halfword at [x1, #990] (i.e. bytes x1[990-991]); this used to be done after str q0, [x1, #976] (which writes x1[976-991]), but now this write is done before the stp q0, q1, [x1, #960] (which writes x1[960-991]).

fhahn added a comment.Jun 23 2021, 2:25 AM

@fhahn I don't think that would be possible, at least not in our case, since it won't actually reach that line.

The question is *why* does it not reach that code? I think we need to understand this first and provide reasons why skipping the other legality checks is fine. As the mis-compiles indicate, skipping the legality checks does not work well. Would it be possible to adjust the places that need & can legally do renaming in a more targeted approach?

The question is *why* does it not reach that code? I think we need to understand this first and provide reasons why skipping the other legality checks is fine. As the mis-compiles indicate, skipping the legality checks does not work well. Would it be possible to adjust the places that need & can legally do renaming in a more targeted approach?

After taking a look into it, it doesn't reach that line because !mayAlias(MI, MemInsns, AA) will return false. I think that is the only check that is missing. But I'm not sure how we could adjust the place where renaming originally takes place as !mayAlias will return false for the store instructions we are trying to pair.

dewen added a subscriber: dewen.Thu, Nov 23, 7:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptThu, Nov 23, 7:10 PM
Herald added a subscriber: StephenFan. · View Herald Transcript