Our initial motivating case was memcpy's with alignments > 16. The loads/stores, to which small memcpy's expand, are kept together in several places so that we get a sequence like this for a 64 bit copy:
LD w0
LD w1
ST w0
ST w1
The load/store optimiser can generate a LDP/STP w0, w1 from this because the registers read/written are consecutive. In our case however, the sequence is optimised during ISel, resulting in:
LD w0
ST w0
LD w0
ST w0
This instruction reordering allows reuse of registers. Since the registers are no longer consecutive (i.e. they are the same), it inhibits LDP/STP creation. The approach here is to perform renaming:
LD w0
ST w0
LD w1
ST w1
to enable the folding of the stores into a STP. We do not yet generate the LDP due to a limitation in the renaming implementation, but plan to look at that in a follow-up so that we fully support this case. While this was initially motivated by certain memcpy's, this is a general approach and thus is beneficial for other cases too, as can be seen in some test changes.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Hi Meera,
Thanks for working on this.
Correct me if I am wrong, but I think the problem here is that the LDP/STP creation is very fragile. The load/store optimiser relies on the pairs to be in order, otherwise it won't recognise them. For loads/stores with a bigger alignment (than usual), there is an optimisation in ISel that optimises the chain, and removes the TokenFactor that normally glues these things together, which is sensible optimisation. There have been some attempts to fix to keep the loads/stores for memcpy together in different places, but that is part of the fragile story. Your approach here is to perform renaming earlier for a sequence like this:
LDR w1 STR w1 LDR w1 STR w1
So that we get:
LDR w1 STR w1 LDR w2 STR w2
and can create a LDP w1, w2 and STP w1, w2.
I think it is worth clarifying some of these things, both in the description of the patch and a code comment.
Ideally we want to generate a LDP and STP for the test cases that you added. Why do we not yet get the LDP?
Speaking about the tests, I think they should be MIR tests. Then we see better what's going on, like the order of instructions etc.
Hi Sjoerd,
I can definitely add a comment in and some more information to the description as well in order to explain in more detail what is going on. I will also add a MIR test as well.
We don't get the LDP since it fails at the very start of this function (called on line 1529) since FirstMI is a load instruction:
static bool canRenameUpToDef(MachineInstr &FirstMI, LiveRegUnits &UsedInBetween, SmallPtrSetImpl<const TargetRegisterClass *> &RequiredClasses, const TargetRegisterInfo *TRI) { if (!FirstMI.mayStore()) return false; ... }
Because this returns false, the renaming of registers isn't attempted for the loads and so they don't get recognised as a pair. It is something I still need to look into in more detail to see if there is something else we can try.
Okay, I see. I am not that familiar with that helper function, but this is a minor code reshuffle and a codegen improvement, so I would be okay to address the LDPs in a follow up if that is more convenient.
Added the inline comment to explain the changes, also added an MIR test as well as changing the existing test to show the difference when aarch64-load-store-renaming is enabled.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
37 | ... and here, makes this look like a not very general approach. I was hoping that we wouldn't need these alignment checks. We would possibly optimise other cases too, but that would be good. Or is there something else going on? | |
54 | Hmmmm, this hard coded 16 here.... | |
llvm/test/CodeGen/AArch64/memcpy.ll | ||
1 ↗ | (On Diff #349590) | Now that we have the MIR tests, I don't think we need these llc tests, it doesn't add anything, so you can just remove this file. |
Removed llc test and also removed alignment check for values greater than 16. Updated the tests that failed because of this change with the new assembly now that the STP instruction is being generated.
This does not seem memcpy specific. If that's the case, could you please update the title/description of the patch? Also the test cases in llvm/test/CodeGen/AArch64/memcpy.mir seem to specifically test additional renaming cases, so it might be good to put them into the existing llvm/test/CodeGen/AArch64/stp-opt-with-renaming.mir or into a file named along those lines. it would also be good to include a brief comment with the tests what is different to the existing re-naming tests.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
40 | The comment and the name seem to indicate that this function checks whether a register has been renamed or not, but I don't think that's the case. The function checks if ice can rename from FirstMI to the beginning of the block and if it can, then it updates Flags with a register that can be used for renaming, if such a register has been found. |
Changed patch title, moved tests to stp-opt-with-renaming.mir and corrected the function name and comment for the helper function in AArch64LoadStoreOptimizer.cpp
LGTM
Perhaps you can elaborate a little bit in the commit message:
When the order of loads and stores has been optimized during ISel, it prevents LDPs/STPs from being generated, so we attempt to rename registers at an earlier stage so that the instructions can be recognised as a pair in this order.
It's not clear which loads/stores. Perhaps you can expand a little bit on this, and say something along the lines of:
Our motivating case initially were memcpy's with an alignment > 16. The loads/stores to which small memcpy's expand to, are tried to be kept together in several places so that we get a sequence like this for a 64 bits copy:
LD w0 LD w1 ST w0 ST w1
The load/store optimiser can fold this in a LDP/STP w0, w1 because the registers read/written are consecutive. In our case however, the loads and stores chains has been optimised during ISel so that we end up with a sequence like this:
LD w0 ST w0 LD w0 ST w0
This instruction reordering/scheduling allows reuse of registers, and since the registers of the loads/stores are no longer consecutive (i.e. they are the same), it inhibits LDP/STP creation. The approach here is to perform renaming:
LD w0 ST w0 LD w1 ST w1
enabling the folding of the stores into a STP. We do not yet generate the LDP due to a limitation in the renaming implementation, but plan to look at that in a follow up so that we fully support this case. And while this was initially motivated by certain memcpy's, this is a general approach and thus beneficial for other cases too as can be seen in some test changes.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
48 | Nit: yet -> yet. |
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
1707 | Is there a reason to do the whole check for rename registers twice, or could we determine whether re-naming is needed once for all cases? In particular, by doing it at this state, we may miss out on some legality checks, e.g. the one at line 1734 to check if the base register has been modified? Can you add a test case where the base is modified? | |
1708 | nit: I don't think this matches LLVM's coding style (should be camel case with upper case start, https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly) | |
llvm/test/CodeGen/AArch64/stp-opt-with-renaming.mir | ||
548 | IIUC the *real* difference between the existing tests in the file and the new ones is that in the new ones, the merge-able STR instructions have the same register as stored operand? If that's the case, it would be good to clarify the comment, as in the other cases things could also originate from re-ordering. | |
554 | can you also add a negative test, like one where there is no register available for renaming? I think you can do that by adding a couple of live-ins. | |
564 | please reduce the values here, most are not needed (see the tests above). |
@fhahn in response to your inline comment about performing the renaming registers check twice, if we don't have it we won't try to rename the registers at all, since the check on line 1764 is not reached in our case as we never enter the if statement that contains it. So we brought it earlier to make sure that it tries to rename the registers, if it can. I'll follow-up the rest of the comments in a new diff.
We are seeing stage 2 clang build crashes on arm64 platform and confirmed it is caused by this change after bisecting. Could you revert the change please?
The problem I'm seeing can be reproduced with this source file: https://martin.st/temp/cdf-preproc.c
To reproduce:
$ clang -target aarch64-w64-mingw32 -S -O3 cdf-preproc.c
One difference in the generated asm looks like this, where I think the error is visible:
ldrh w10, [x2, #6032] strh wzr, [x1, #6034] strh w10, [x1, #6032] - ldr q0, [x2, #16] - str q0, [x1, #16] - ldr q0, [x2] + ldp q0, q1, [x2] strh wzr, [x1, #24] - str q0, [x1] - ldr q0, [x2, #48] - str q0, [x1, #48] - ldr q0, [x2, #32] + stp q0, q1, [x1] + ldp q0, q1, [x2, #32]
Previously the strh wzr, [x1, #24] was done after the str q0, [x1, #16] (so it would overwrite one part of the written qword with zeros), but now the strh remains intact and is done before the new stp q0, q1, [x1].
As I mentioned inline, this seems to be skipping a legality check (around line 1734), which is probably the source of the mis-compiles (please add a test for that).
I understand the motivation about trying to find a rename register earlier, but I was wondering if we can detect the need for renaming at a single (possible new) place. It may not be possible, because the register to check for renaming depends on whether we merge forward or backwards.
We were also seeing several downstream test failures on arm64 after this. Thanks for the speedy revert!
Added checks to ensure that STPs are only generated when the base register is not in use or has not been modified. Also added test cases when the base register has been modified and when there are no registers available for renaming.
Besides some nits/observations inline, one question about testing. The reproducer that was attached does not get miscompiled anymore? And it is exactly one of these cases that was added, i.e. the modified base register? Just asking to see if there's not another case or test that we are missing.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
39 | Diffing with previous revision does not seem to work, so just double checking this. That seems okay to be me now. | |
45 | Style nit: can we return early here to reduce indentation: if (!DebugCounter::shouldExecute(RegRenamingCounter)) return false; | |
llvm/test/CodeGen/AArch64/consthoist-gep.ll | ||
2–1 | Unrelated nit: perhaps align the assembly of this block for readability, indentation is all over the place | |
llvm/test/CodeGen/AArch64/stp-opt-with-renaming.mir | ||
121 | And here you have added a lot of registers as live-in, which results in no registers being available for renaming. Nice one, that looks good. | |
156 | Yep, and here the base register is modified. Looks good. |
Fixed the style points raised in the helper function updateFlagsWithRenameReg and in the test consthoist-gep.ll
@SjoerdMeijer, in response to your questions, yes I added the check to see if a base register is used/modified on line 1698/1699 so that we still perform that legality check as we won't reach the one on 1726 if the renaming of registers is successful. Also, the modified base register test is taken from the reproducer and so it should no longer generate extra, unwanted STPs.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
36–1 | Would it just be possible to check for renaming here, similar to like it is done below? |
@fhahn I don't think that would be possible, at least not in our case, since it won't actually reach that line.
There still are issues that are reproducible with this particular sample. (I'm also seeing miscompilations still in some other samples that I haven't narrowed down yet, hoping that they just are more cases of this same issue.) The diff of the assembly output of that particular sample looks like this:
--- good.s 2021-06-23 09:42:24.381499762 +0300 +++ bad.s 2021-06-23 09:42:40.521154128 +0300 @@ -1955,17 +1955,14 @@ ldrh w10, [x2, #5756] strh wzr, [x1, #5758] strh w10, [x1, #5756] - ldr q0, [x2, #976] + ldr q1, [x2, #976] add x10, x1, #960 // =960 - str q0, [x1, #976] ldr q0, [x2, #960] strh wzr, [x10, #30] - str q0, [x1, #960] - ldr q0, [x2, #1008] - str q0, [x1, #1008] - ldr q0, [x2, #992] + stp q0, q1, [x1, #960] + ldp q0, q2, [x2, #992] strh wzr, [x10, #62] - str q0, [x1, #992] + stp q0, q2, [x1, #992] ldr q0, [x2, #1040] str q0, [x1, #1040] ldr q0, [x2, #1024]
Note that x10 = x1 + 960. The strh wzr, [x10, #30] clears a halfword at [x1, #990] (i.e. bytes x1[990-991]); this used to be done after str q0, [x1, #976] (which writes x1[976-991]), but now this write is done before the stp q0, q1, [x1, #960] (which writes x1[960-991]).
The question is *why* does it not reach that code? I think we need to understand this first and provide reasons why skipping the other legality checks is fine. As the mis-compiles indicate, skipping the legality checks does not work well. Would it be possible to adjust the places that need & can legally do renaming in a more targeted approach?
After taking a look into it, it doesn't reach that line because !mayAlias(MI, MemInsns, AA) will return false. I think that is the only check that is missing. But I'm not sure how we could adjust the place where renaming originally takes place as !mayAlias will return false for the store instructions we are trying to pair.
... and here, makes this look like a not very general approach.
I was hoping that we wouldn't need these alignment checks. We would possibly optimise other cases too, but that would be good. Or is there something else going on?