This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Fix lookahead operand reordering for splat loads.
ClosedPublic

Authored by vporpo on Mar 9 2022, 9:45 PM.

Details

Summary

Splat loads are inexpensive in X86. For a 2-lane vector we need just one
instruction: movddup (%reg), xmm0. Using the standard Splat score leads
to worse code. This patch adds a new score dedicated for splat loads.

Please note that a splat is usually three IR instructions:

  • It is usually a load and 2 inserts:
%ld = load double, double* %gep
%ins1 = insertelement <2 x double> poison, double %ld, i32 0
%ins2 = insertelement <2 x double> %ins1, double %ld, i32 1
  • But it can also be a load, an insert and a shuffle:
%ld = load double, double* %gep
%ins = insertelement <2 x double> poison, double %ld, i32 0
%shf = shufflevector <2 x double> %ins, <2 x double> poison, <2 x i32> zeroinitializer

Because of this some of the lit tests contain more IR instructions.

Diff Detail

Event Timeline

vporpo created this revision.Mar 9 2022, 9:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 9:45 PM
vporpo requested review of this revision.Mar 9 2022, 9:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 9:45 PM
RKSimon added inline comments.Mar 9 2022, 11:37 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1130

why test both V1 and V2 if we know they are the same value?

vporpo updated this revision to Diff 414286.Mar 10 2022, 12:21 AM
vporpo marked an inline comment as done.

Removed isa<LoadInst>(V2).

ABataev added inline comments.Mar 10 2022, 2:58 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1091

Thos is target specific for x86. What about other targets?

lebedev.ri added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1130

This should be a lot more principled than this (i.e, query the actual costmodel).
E.g., movddup is only for double, but i think this will also trigger for float?
Also, what about broadcasting loads that are available in AVX/AVX2/AVX512?

vporpo added inline comments.Mar 10 2022, 10:52 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1091

Please see the discussion below.

1130

I agree, this needs to be target specific. I could add a new ShuffleKind, something like SK_BroadcastLoad. But this is not strictly a shuffle pattern, it is a combined Load + Shuffle. So this might need its own function, like TargetTransformInfo::getBroadcastLoadCost(VectorType *VecTy). What do you think?

RKSimon added inline comments.Mar 10 2022, 11:01 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1130

What about adding a IsLoad bool flag to the TTI::getShuffleCost? Or a TTI::IsLegalBroadcastLoad (similar to the existing IsLegal*Load ops)?

vporpo added inline comments.Mar 10 2022, 11:09 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1130

Yeah, something like TTI::IsLegalBroadcastLoad may be more suitable for this since the score that we are using in the reordering is not an actual TTI cost.

vporpo updated this revision to Diff 414478.Mar 10 2022, 1:14 PM

Changed the logic to use TTI::isLegalBroadcastLoad().

This is blocking our internal process, so any help in getting this reviewed would be greatly appreciated.

Ping (sorry for the repeated pings but this is quite urgent as it is blocking our process).

ABataev added inline comments.Mar 14 2022, 12:39 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
5117

I assume, you need to tweak the cost model for broadcast with loads.

5117–5121
return ST->hasSSSE3() && VecTy && VecTy->getElementCount().getKnownMinValue() == 2 && 
       Ty->getElementType() == Type::getDoubleTy(Ty->getContext());
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1090–1091

We don't care about the instruction count for SLP, but the throughput.

1124–1125

Maybe pass a scalar type and number of elements to avoid constructing vector type?

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191

This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.

vporpo updated this revision to Diff 415261.Mar 14 2022, 4:13 PM
vporpo marked 2 inline comments as done.

Addressed comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1090–1091

Agreed, but usually code with fewer instructions is better for various reasons. How would you want me to rephrase this ?

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191

How did you come up with these throughput values ?

The assembly code that comes out of llc for the original code is:

movl  8(%esp), %eax
movl  4(%esp), %ecx
vmovupd (%ecx), %xmm0
vpermilpd $1, %xmm0, %xmm1        ## xmm1 = xmm0[1,0]                                                                                                                          
vmovq %xmm0, %xmm0                    ## xmm0 = xmm0[0],zero                                                                                                                   
vaddpd  %xmm1, %xmm0, %xmm0
vmovupd %xmm0, (%eax)

The new code is:

movl  8(%esp), %eax
movl  4(%esp), %ecx
vmovsd  8(%ecx), %xmm0                  ## xmm0 = mem[0],zero                                                                                                                  
vmovddup  (%ecx), %xmm1                   ## xmm1 = mem[0,0]                                                                                                                   
vaddpd  %xmm1, %xmm0, %xmm0
vmovupd %xmm0, (%eax)

I ran the function in a loop on a skylake and the new code is 25% faster.

ABataev added inline comments.Mar 14 2022, 4:29 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1090–1091

Use the throughput, not number of instructions.

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191

https://godbolt.org/z/3rhGajsaT

The first page is the result without patch, the second - with patch.

vporpo updated this revision to Diff 415297.Mar 14 2022, 7:10 PM

Updated cost model to handle broadcast loads.

RKSimon added inline comments.Mar 15 2022, 6:28 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1049

Why do we need this? Why not just use getShuffleCost?

vporpo added inline comments.Mar 15 2022, 9:54 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1049

We could use getShuffleCost() instead and introduce a new enumerator like ShuffleKind::SK_BroadcastLoad.
But I feel it is better to use a separate function because getShuffleCost() is all about the cost of data shuffling (broadcast, reverse, select, transpose etc.), while a BroadcastLoad includes both a load from memory and a data shuffle. I don't have a strong preference though, what do you think?

My concern is that we've never worked out how we should account for folded load costs and this is setting a precedent that might back fire.

Wdyt about something like ShuffleKind::SK_BroadcastForBroadcastLoad that only takes into consideration the shuffle part of a broadcast load? A normal broadcast of a double to a 2-wide vector costs 1, so this could cost 0.

@ABataev wdyt about these options, any preference?

@ABataev wdyt about these options, any preference?

I agree with @RKSimon here. Better to teach existing functions about possibly foldable loads somehow. There might be some other cases where we'll need similar info, e.g. some AVX512 targets has slow gathers for registers but not for loads, need to pass this info somehow too.

vporpo updated this revision to Diff 415690.Mar 15 2022, 9:07 PM

Removed getBroadcastLoadCost() and replaced it with getShuffleCost() which now has a new IsLoad argument.

ABataev added inline comments.Mar 16 2022, 6:56 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1056

I belive better to pass the instruction here, just like in other functions. And then do the analysis of this instruction, if it was passed.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1124

Probably, need to check also for number of uses + external uses.

RKSimon added inline comments.Mar 16 2022, 8:39 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1056

Passing ArrayRef<const Value *> Args like we do for arith op makes sense here

RKSimon added inline comments.Mar 16 2022, 8:41 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1558

Won't this fail on pre-SSSEe targets?

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
2

; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S -mtriple=i386-apple-macosx10.8.0 -mattr=+sse2 | FileCheck %s

Add a pre-SSSE3 run?

vporpo updated this revision to Diff 415944.Mar 16 2022, 12:53 PM
vporpo marked 2 inline comments as done.

Replaced IsLoad argument with Args in getShuffleCost(), and addressed comments.

vporpo added inline comments.Mar 16 2022, 12:55 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1558

Good catch, I added a check for SSE3.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1124

Good point, I added a check which should only allow this if we don't have any internal/external uses. I will add checks for internal uses in a follow up patch, because it requires more changes + tests.

ABataev added inline comments.Mar 16 2022, 1:36 PM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1056

You can drop const in const Value *

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1556–1558

Maybe rename it to SSE3BroadcastLoadTbl?

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191

What about this?

vporpo updated this revision to Diff 415963.Mar 16 2022, 1:58 PM
vporpo marked 2 inline comments as done.

Changed ArrayRef<const Value*> to ArrayRef<Value *> and renamed table to SSE3BroadcastLoadTbl.

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191

I am not sure what to do about this, it may have lower throughput but it has lower latency so it runs faster. Are we always considering throughput? It looks like in TTI we are mostly counting instructions at least from what I can see in getShuffleCost():

  {TTI::SK_PermuteSingleSrc, MVT::v2i64, 1}, // pshufd                                                                                                                     
  {TTI::SK_PermuteSingleSrc, MVT::v4i32, 1}, // pshufd                                                                                                                     
  {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2*pshuflw + 2*pshufhw                                                                                                      
                                              // + pshufd/unpck                                                                                                            
{ TTI::SK_PermuteSingleSrc, MVT::v16i8, 10 }, // 2*pshuflw + 2*pshufhw                                                                                                     
                                              // + 2*pshufd + 2*unpck + 2*packus
ABataev added inline comments.Mar 16 2022, 2:03 PM
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191

It is still in terms of throughput.
Yeah, it maybe faster on skylake but not on corei7-avx. And there might be similar cases for other cpus. Need to tweak the estimation criteria

vporpo added inline comments.Mar 17 2022, 2:24 PM
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191

What kind of tweaking are you proposing?

vporpo added inline comments.Mar 17 2022, 3:43 PM
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
181–191

As far as I understand the reason for the lower throughput is that we are adding pressure to the memory units ([4] and [5]). This is happening because we are loading twice, once for the scalar load, and once with vmovddup. This means that if we try to pack many instances of this code in parallel, then the pipeline would stall more than in the original code. But in any other case, like if this is part of some other code which is not heavy on the load units, it is the latency that matters, and this code is better with respect to latency.

Modeling throughput would require an accurate pipeline model, which we are not using in SLP. The cost modeling that we are doing looks more like a latency model than a throughput one since it has no concept of pipeline resources. If we were actually modelling the pipeline, then we could check both the code latency and the code throughput and decide which one to choose in each case. Given that we are not actually doing this I would argue against trying to fix a potential pipeline stall in an ad hoc way.

This revision is now accepted and ready to land.Mar 17 2022, 4:16 PM
This revision was landed with ongoing or failed builds.Mar 17 2022, 6:06 PM
This revision was automatically updated to reflect the committed changes.