This is an archive of the discontinued LLVM Phabricator instance.

[SLP][AArch64] Implement lookahead operand reordering score of splat loads for AArch64
ClosedPublic

Authored by vporpo on Apr 12 2022, 3:24 PM.

Details

Summary

The original patch (https://reviews.llvm.org/D121354) targets x86 and adjusts
the lookahead score of splat loads ad they can be done by the movddup
instruction that combines the load and the broadcast and is cheap to execute.

A similar issue shows up on AArch64. The ld1r instruction performs a broadcast
load and is cheap to execute.

This patch implements the TargetTransformInfo hooks for AArch64.

Diff Detail

Event Timeline

vporpo created this revision.Apr 12 2022, 3:24 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2022, 3:24 PM
vporpo requested review of this revision.Apr 12 2022, 3:24 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2022, 3:24 PM

Thanks for looking into this! I was hoping to take a look at some point too, this has saved me a job :)

Do you have any performance results to suggest it is a good idea, past the obvious that it sounds like it should be better? Some very quick runs didn't look amazing, but perhaps they are in some ways unlucky.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2715

This code can be before the table. The table and the CostTableLookup should be together.

2717

I would expect it to have one Arg for a splat.

I guess this would not match the canonical representation for a splat load too, except from where it is being created from scalar code. Not sure what to do about that, but it seems unfortunate that the cost will go up after it has been vectorized.

%l = load i8, i8 *%p
%i = insertelement <16 x i8> poison, i8 %l, i32 0
%s = shufflevector <16 x i8> %i, <16 x i8> poison, <16 x i32> zeroinitializer
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
283

I'm not a fan of these "isLegal" methods for things that should just have a cost.
It should either be passed a VectorType or the NumElements should be an ElementCount, to allow for scalable vector types.

285

The "legal" types for NEON would be something where the legalized type is one of
{v8i8, v16i8, v4i16, v8i16, v2i32, v4i32, v1i64, v2i64}.
But also the floating point equivalents of them. I guess I don't really know what "legal" means here. The X86 version of this method seems to account for NumElements, but if a v4i64 can just be legalized to two v2i64's, it can still generate a splatload efficiently and copy that to another value.

Hmm. For now lets say that the ElementCount size needs to be one of {8, 16, 32, 64}, and the size of the vector would be at least 64bits. That should rule out types we don't have at least, and larger vectors can still be treated as cheap.

vporpo marked an inline comment as done.Apr 20 2022, 8:28 AM

Thank you Dave for the review.

Do you have any performance results to suggest it is a good idea, past the obvious that it sounds like it should be better? Some very quick runs didn't look amazing, but perhaps they are in some ways unlucky.

Yes, I tried this on an AArch64 machine. It improves the BM_Dot_RealComplex_EigenDotFixed<double_16> test from the Eigen benchmark by more than 10%. But this only tests for a 2 x double type. I am not sure how it performs for other types.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2717

Yes, I think this won't be matched. Args is meant to be used for scalar code, but that's a great point, we could improve this in a future patch.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
283

I agree, ideally this should be handled transparently by cost functions. The reason why we need this is that we are not currently using the TTI cost model functions for the operand reordering scores in getShallowScore(). So we use this function as a way to check whether the target supports this type of combined load + broadcast instructions.

Regarding using ElementCount this needs some refactoring on the X86 side too, so I will upload one refactoring patch before this, with some of these changes.

285

"Legal" means that we can efficiently generate an instruction that handles a load + broadcast. In x86 this is done with the movddup instruction which seems to do this efficiently for two 64-bit elements, which is what gets accepted by isLegalBroadcast().

The ld1r instruction in AArch64 seems to support a wide range of type, like the ones you listed, though I am not sure how efficient all of these are. I guess we can accept the element count sizes you propose {8, 16, 32, 64}, which sounds right, or we can stick to a 2 x double that we know for sure that it works well.

vporpo updated this revision to Diff 423930.Apr 20 2022, 9:14 AM

Addressed comments.

fhahn added inline comments.Apr 20 2022, 10:39 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
285

It looks like there is at least some test coverage missing, e.g. I think we need a test with scalable vectors and with different element types other than double.

dmgreen added inline comments.Apr 20 2022, 11:26 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2616

This assert looks like it will fire a lot of the time. Should we use IsLoad && isLegalBroadcastLoad(...)?

That might make this testable from the cost model too, Even if it's slightly unorthodox to use a vector load for such cases.

2717

Three instructions patterns are tough to cost-model nicely is llvm at the moment. It needs too much code to check down through uses and up through operands.

The first part of my comment was meaning that Args should only have 1 element for a splat.

IsLoad = Args.size() == 1 && isa<LoadInstr>(Args[0]);

Do we expect it get called when there are multiple loads too?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
285

It is the "efficiently generate" that gets me, it being a pretty imprecise term. Would an architecture that uses two operations count? Many architectures will split the operation into two micro-ops in any case. Why not just give it a cost like other functions. The reciprocal throughput is 1. It's more clear what that means compared to other instructions. The other isLegalMaskedLoad/isLegalMaskedGather functions I can see a place for - they effectively tell you what the canonical representation for a gather should be - should it be treated as a vector operation that has vector optimizations on it or as a group of scalar operations that go through scalar optimizations.

Ignore me though. It's a fairly minor complaint, and it working is much better than it not working :)

285

I don't think we _can_ test scalable vectors, this is only callable from the slp vectorizer, unfortunately.

vporpo added inline comments.Apr 20 2022, 2:23 PM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2616

It is already guarded by if (IsLoad) so I guess IsLoad && is not needed.

That might make this testable from the cost model too, Even if it's slightly unorthodox to use a vector load for such cases.

I am not sure I understand. Is this about the assertion?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
285

Yeah, I am not sure how I would test scalable vectors, given that the test will run through the SLP vectorizer.

I added a couple more tests for float, i32 and i64. I tried writing a test for i16 and i8 but I think it is not possible to expose the issue with the current operand reordering. These require 4x or 8x vectors respectively and a deep reduction tree. SLP would need to perform deep operand reordering across the leaf nodes of the deep reduction tree, which I think we don't support currently.

vporpo updated this revision to Diff 424031.Apr 20 2022, 2:27 PM

Updated test.

dmgreen added inline comments.Apr 21 2022, 4:09 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2616

I meant remove the assert and turn it into a condition. This function might be checking Types which we do not consider to be legal. https://godbolt.org/z/dv56WMaP7 has an example (apparently :) ).

We could presumably have a test in llvm/test/Analysis/CostModel/AArch64 that tests load <vector>, splat-shuffle, and it will trigger this and be treated as free?

2717

Do we need to check more than 1 arg?

vporpo added inline comments.Apr 21 2022, 6:25 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2616

Oops, yes this is totally broken, sorry about that. I had just copied this part of the code from X86 TTI where we have a cost table that checks the types but I skipped the table. This assertion is meant to check that the table and isLegalBroadcastLoad are in sync.
To fix this I can either add a cost table like this:

static const CostTblEntry NeonBroadcastLoadTbl[] = { 
   {TTI::SK_Broadcast, MVT::v8i8, 0 },
   {TTI::SK_Broadcast, MVT::v16i8, 0 },
   {TTI::SK_Broadcast, MVT::v4i16, 0 },
   {TTI::SK_Broadcast, MVT::v8i16, 0 },
   {TTI::SK_Broadcast, MVT::v2i32, 0 },
   {TTI::SK_Broadcast, MVT::v4i32, 0 },
   {TTI::SK_Broadcast, MVT::v2f64, 0},
   {TTI::SK_Broadcast, MVT::v4f32, 0 },
};

using a similar logic as in X86TargetTransformInfo.cpp:1558, or rely on an if (isLegalBroadcastLoad()).
I think adding the table makes it a bit more explicit, and I would prefer it. What do you think?

Btw would you happen to know if a v2f32 broadcast is handled by ld1r in neon? I think this is the only 64-bit entry missing from the table.

2717

Passing the whole vector seems a bit more natural from the SLP point of view. But yes, I agree, a single argument when Kind == TTI::SK_Broadcast would probably make more sense from the TTI point of view. This needs changes in the x86 side too, so I will update it in a separate patch.

dmgreen added inline comments.Apr 21 2022, 6:47 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2616

v2f32 should work OK - there is nothing very different between a v2f32 and a v2i32 load. There are also fp16 types and possibly bf16 types (but they might not work at the moment).

My slight preference would probably be towards re-using isLegalBroadcastLoad because they are then insync by design and we don't need to repeat the logic. Up to you though.

2717

I was expecting the operands to be the two inputs of the shuffle (or the scalar equivalents that would become those operands). But if it doesn't work that way, then it sounds OK to check them all.

vporpo added inline comments.Apr 21 2022, 8:24 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2717

Let me rephrase what I mean because I think I misunderstood your comments earlier. I think you raised several issues (please correct me if I am wrong):
(i) The number of Args passed to getShuffleCost() and whether we need a second one. Having one Args is enough for a splat, but yes I agree, in the general case getShuffleCost() should model any type of shuffle, which would require us to have two Args like getShuffleCost(..., Args1, Args2), one for each operand. For the splat case we could leave one of them empty.
(ii) Populating Args with only one element in case of a splat. I think this makes sense too.
(iii) Whether Args should be a vector of scalar operands or a single vector operand. Given that we are currently using this only from within SLP, I think we can stick to a vector of scalar operands for now. But I guess it could be useful to support both.

Since these are TTI design issues I think they should be part of a separate patch. I will work on a patch for (i) and (ii).

dmgreen added inline comments.Apr 21 2022, 8:35 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2717

Yeah it can be separately to this patch. I was expecting it to work like getArithmeticInstrCost which takes an array of the Args representing the operands for the instruction (which from the loop vectorizer will be the scalar instructions that will be converted to vector operands, for example). The getIntrinsicCost does the same with the Args array.

vporpo added inline comments.Apr 21 2022, 9:25 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2616

We could presumably have a test in llvm/test/Analysis/CostModel/AArch64 that tests load <vector>, splat-shuffle, and it will trigger this and be treated as free?

This won't work with the current code. TTI's getUserCost() won't pass shuffle's operands to TargetTTI->getShuffleCost(). This would have to be part of the refactoring patches.

vporpo updated this revision to Diff 424232.Apr 21 2022, 9:28 AM

Fixed getShuffleCost logic related to isLegalBroadcastLoad().

dmgreen accepted this revision.Apr 22 2022, 2:38 AM

Thanks. This sounds like a sensible idea, even if some of the benchmarks I have don't show it, it still LGTM.

This revision is now accepted and ready to land.Apr 22 2022, 2:38 AM
This revision was landed with ongoing or failed builds.Apr 22 2022, 7:48 AM
This revision was automatically updated to reflect the committed changes.