Page MenuHomePhabricator

[llvm][sve] Lowering for VLS truncating stores

Authored by DavidTruby on Jun 17 2021, 9:58 AM.



This adds custom lowering for truncating stores when operating on
fixed length vectors in SVE.

Currently truncating stores are not used in certain cases where
the size of the vector is larger than the target vector width.

It also includes an AArch64-specific DAG combine to prevent a
regression where truncates would be merged in to stores before
being cancelled out with a prior extend; this DAG combine
merges the extends and truncating stores to obtain a normal store.

Diff Detail

Event Timeline

DavidTruby created this revision.Jun 17 2021, 9:58 AM
DavidTruby requested review of this revision.Jun 17 2021, 9:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2021, 9:58 AM

This patch is making way too many seemingly unrelated code changes without much explanation. Is all of this really necessary for the initial patch?


Can we change this to make sure we only mark stores that we can actually lower "custom"? (i.e. the value type is a legal integer vector, and the memory type has a legal element type.) That should make the rest of the patch simpler, I think.


Why don't we want to optimize store_trunc_v2i64i8 when SVE registers are 128 bits wide?


Can we "uzp1 z0, z0, z1" or something like that in store_trunc_v16i32i16?

Matt added a subscriber: Matt.Jun 18 2021, 11:37 AM

This patch is making way too many seemingly unrelated code changes without much explanation. Is all of this really necessary for the initial patch?

Hi, thanks for the review! I think maybe I didn't explain that well in the summary why the extra changes are needed. I've hopefully added better explanations below.


I'm not sure what doing this would allow us to remove for the rest of the patch, could you point me at it?


This unrelated-appearing addition is necessary to prevent a regression in some code generation we already had.
In essence, sometimes we were generating an instruction, extending the result, truncating that and storing; we were relying on a later fold of an extend and truncate but since the fold into the truncating store happens first we were left with an extra extend then a truncating store.
This function removes that extra extend that we were generating in those cases.

I realise it's somewhat unrelated appearing to the rest of the patch but I didn't want to introduce a regression and change all the tests and then fix the regression in a separate patch and change the tests back.


The code generation of these legalized types is poor and needs fixing in a separate patch, as these should really be using truncating stores as well. These tests just check that legalised code is correct rather than good at the moment.

Looking again with your explanation, I understand a bit more how the pieces fit together; it just wasn't obvious at first glance.


Well, the change to guard LowerTruncateVectorStore is only necessary because of this, I think. And probably it messes with DAGCombine heuristics a little. But neither of those is a big deal, I guess.

I think it's still worth doing to illustrate the intent here, though; it's hard to someone reading this to understand why you're custom-lowering every possible truncstore.


Not sure what this change is doing here; I don't see any test coverage for masked stores?

DavidTruby added inline comments.Jun 28 2021, 5:51 PM

This is a mistake, I was working on masked truncating stores alongside this before deciding they should be two separate patches due to the size. Thanks for spotting it!

paulwalker-arm added inline comments.Jun 29 2021, 3:58 AM

Perhaps this should be moved into addTypeForFixedLengthSVE or at least part of it, then here we just need to worry about the 128 and 64bit vectors data vectors, which is consistent with how we handle the other nodes when it comes to SVE's fixed-length lowering support.

When doing this we can also speed up the InnerVT loop since we only want to care about a small subset of types (namely starting at i8 element types and doubling until we reach VT's element type. For example:

// if VT.isInteger
InnerVT = VT.changeVectorElementType(i8);
while (InnerVT != VT)
  setTruncStoreAction(VT, InnerVT, Custom);
  InnerVT = InnerVT.widenIntegerVectorElementType();

When is comes to the 128 and 64bit vectors data vectors support we may not even need a loop given we know the exact MVTs we want to custom lower.

DavidTruby edited the summary of this revision. (Show Details)Jun 29 2021, 5:27 AM

Fixes based on review comments

DavidTruby marked an inline comment as done.Jun 29 2021, 7:36 AM

I seem to be causing some failures on MIPS and AMDGPU with this patch and I can't really work out why. Don't suppose anyone has any pointers?


I couldn't use widenIntegerVectorElementType here because for some reason it requires a Context. I think what I've done should be equivalent though.

I seem to be causing some failures on MIPS and AMDGPU with this patch and I can't really work out why. Don't suppose anyone has any pointers?

This is obviously the DAGCombiner changes. Not sure what sort of pointers you're looking for; either the changes are desirable for mips/amdgpu, or they're not and the combine needs to be restricted somehow.

DavidTruby updated this revision to Diff 355926.Jul 1 2021, 9:55 AM

Here I've fixed the test breakages on MIPS and AMDGPU

I believe the new MIPS code generated is both correct and better
than the existing code generation but I'd like someone more
familiar with the architecture than me to verify that!

The AMDGPU change is more involved, as it revolves around the fact
that truncating stores of vectors are not supported on R600 from
what I understand. As such we don't want to tell the DAG combiner
that we want more truncating stores. I'm not really sure what
I've done here is the "correct" way to handle this, it feels a bit
hacky to me to be honest, so if anyone has any better suggestions
I would like to hear them

For MIPS, it looks like we're somehow eliminating zero-extension operations which aren't getting eliminated otherwise. Not sure why we don't manage to perform the transform otherwise, but seems fine. Actually, maybe this is an argument for generating truncstores even more aggressively than your patch; in theory, before operation legalization, we could do the transform even if it's "expand".

For R600, it looks like combining to a truncstore blocks an important target-specific DAGCombine; the combines don't support loadext/truncstore. See AMDGPUTargetLowering::shouldCombineMemoryType. You could try to fix the target-specific combines, or just add a target hook, I guess.


If you're going to add a target hook for this, please make it a separate function from isTruncStoreLegalOrCustom, so it's clear what it's doing.

efriedma added inline comments.Jul 1 2021, 12:56 PM

There's a potential infinite loop here: if a truncstore is "custom", and expands to trunc+store, combining never ends. You can avoid this by checking for LegalOperations.

DavidTruby updated this revision to Diff 356747.Tue, Jul 6, 8:59 AM

Renamed R600 target hook and added check for LegalOperations.

DavidTruby marked 2 inline comments as done.Tue, Jul 6, 8:59 AM
bsmith added inline comments.Tue, Jul 6, 9:03 AM

This test change looks incorrect

efriedma added inline comments.Tue, Jul 6, 9:51 AM
1012 ↗(On Diff #356747)

Not sure what this change is about?

DavidTruby updated this revision to Diff 357192.Thu, Jul 8, 4:33 AM

Fixed changes in tests caused by rebase

DavidTruby marked 2 inline comments as done.Thu, Jul 8, 4:34 AM
DavidTruby added inline comments.
1012 ↗(On Diff #356747)

Looks like this snuck in when I rebased, I've removed it as a change.

DavidTruby marked an inline comment as done.Thu, Jul 8, 4:36 AM
This revision is now accepted and ready to land.Thu, Jul 8, 11:02 AM
This revision was landed with ongoing or failed builds.Mon, Jul 12, 3:14 AM
This revision was automatically updated to reflect the committed changes.

Seems like this change caused a seg fault in chrome android PGO builds.

reproduces with: clang -cc1 -triple aarch64-unknown-linux-android21 -emit-obj -Oz t.cpp
and cpp file is attached

will revert this for now

@akhuang thanks for the reproducer, I've committed this again with a small fix that prevents that issue from occuring. In essence, I was not checking in the dag combine here if the store was an indexed store (at which point we can't perform the optimisation), I've added a check for that now and your example passes. If you see any more issues please revert again and let me know. Thanks!