This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Fix lookahead operand reordering for splat loads.
ClosedPublic

Authored by vporpo on Mar 9 2022, 9:45 PM.

Details

Summary

Splat loads are inexpensive in X86. For a 2-lane vector we need just one
instruction: movddup (%reg), xmm0. Using the standard Splat score leads
to worse code. This patch adds a new score dedicated for splat loads.

Please note that a splat is usually three IR instructions:

  • It is usually a load and 2 inserts:
%ld = load double, double* %gep
%ins1 = insertelement <2 x double> poison, double %ld, i32 0
%ins2 = insertelement <2 x double> %ins1, double %ld, i32 1
  • But it can also be a load, an insert and a shuffle:
%ld = load double, double* %gep
%ins = insertelement <2 x double> poison, double %ld, i32 0
%shf = shufflevector <2 x double> %ins, <2 x double> poison, <2 x i32> zeroinitializer

Because of this some of the lit tests contain more IR instructions.

Diff Detail

Event Timeline

vporpo created this revision.Mar 9 2022, 9:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 9:45 PM
vporpo requested review of this revision.Mar 9 2022, 9:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 9:45 PM
RKSimon added inline comments.Mar 9 2022, 11:37 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1189

why test both V1 and V2 if we know they are the same value?

vporpo updated this revision to Diff 414286.Mar 10 2022, 12:21 AM
vporpo marked an inline comment as done.

Removed isa<LoadInst>(V2).

ABataev added inline comments.Mar 10 2022, 2:58 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1147

Thos is target specific for x86. What about other targets?

lebedev.ri added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1189

This should be a lot more principled than this (i.e, query the actual costmodel).
E.g., movddup is only for double, but i think this will also trigger for float?
Also, what about broadcasting loads that are available in AVX/AVX2/AVX512?

vporpo added inline comments.Mar 10 2022, 10:52 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1147

Please see the discussion below.

1189

I agree, this needs to be target specific. I could add a new ShuffleKind, something like SK_BroadcastLoad. But this is not strictly a shuffle pattern, it is a combined Load + Shuffle. So this might need its own function, like TargetTransformInfo::getBroadcastLoadCost(VectorType *VecTy). What do you think?

RKSimon added inline comments.Mar 10 2022, 11:01 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1189

What about adding a IsLoad bool flag to the TTI::getShuffleCost? Or a TTI::IsLegalBroadcastLoad (similar to the existing IsLegal*Load ops)?

vporpo added inline comments.Mar 10 2022, 11:09 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1189

Yeah, something like TTI::IsLegalBroadcastLoad may be more suitable for this since the score that we are using in the reordering is not an actual TTI cost.

vporpo updated this revision to Diff 414478.Mar 10 2022, 1:14 PM

Changed the logic to use TTI::isLegalBroadcastLoad().

This is blocking our internal process, so any help in getting this reviewed would be greatly appreciated.

Ping (sorry for the repeated pings but this is quite urgent as it is blocking our process).

ABataev added inline comments.Mar 14 2022, 12:39 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
5136

I assume, you need to tweak the cost model for broadcast with loads.

5136–5140
return ST->hasSSSE3() && VecTy && VecTy->getElementCount().getKnownMinValue() == 2 && 
       Ty->getElementType() == Type::getDoubleTy(Ty->getContext());
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1146–1147

We don't care about the instruction count for SLP, but the throughput.

1181–1182

Maybe pass a scalar type and number of elements to avoid constructing vector type?

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266

This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.

vporpo updated this revision to Diff 415261.Mar 14 2022, 4:13 PM
vporpo marked 2 inline comments as done.

Addressed comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1146–1147

Agreed, but usually code with fewer instructions is better for various reasons. How would you want me to rephrase this ?

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266

How did you come up with these throughput values ?

The assembly code that comes out of llc for the original code is:

movl  8(%esp), %eax
movl  4(%esp), %ecx
vmovupd (%ecx), %xmm0
vpermilpd $1, %xmm0, %xmm1        ## xmm1 = xmm0[1,0]                                                                                                                          
vmovq %xmm0, %xmm0                    ## xmm0 = xmm0[0],zero                                                                                                                   
vaddpd  %xmm1, %xmm0, %xmm0
vmovupd %xmm0, (%eax)

The new code is:

movl  8(%esp), %eax
movl  4(%esp), %ecx
vmovsd  8(%ecx), %xmm0                  ## xmm0 = mem[0],zero                                                                                                                  
vmovddup  (%ecx), %xmm1                   ## xmm1 = mem[0,0]                                                                                                                   
vaddpd  %xmm1, %xmm0, %xmm0
vmovupd %xmm0, (%eax)

I ran the function in a loop on a skylake and the new code is 25% faster.

ABataev added inline comments.Mar 14 2022, 4:29 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1146–1147

Use the throughput, not number of instructions.

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266

https://godbolt.org/z/3rhGajsaT

The first page is the result without patch, the second - with patch.

vporpo updated this revision to Diff 415297.Mar 14 2022, 7:10 PM

Updated cost model to handle broadcast loads.

RKSimon added inline comments.Mar 15 2022, 6:28 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1049

Why do we need this? Why not just use getShuffleCost?

vporpo added inline comments.Mar 15 2022, 9:54 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1049

We could use getShuffleCost() instead and introduce a new enumerator like ShuffleKind::SK_BroadcastLoad.
But I feel it is better to use a separate function because getShuffleCost() is all about the cost of data shuffling (broadcast, reverse, select, transpose etc.), while a BroadcastLoad includes both a load from memory and a data shuffle. I don't have a strong preference though, what do you think?

My concern is that we've never worked out how we should account for folded load costs and this is setting a precedent that might back fire.

Wdyt about something like ShuffleKind::SK_BroadcastForBroadcastLoad that only takes into consideration the shuffle part of a broadcast load? A normal broadcast of a double to a 2-wide vector costs 1, so this could cost 0.

@ABataev wdyt about these options, any preference?

@ABataev wdyt about these options, any preference?

I agree with @RKSimon here. Better to teach existing functions about possibly foldable loads somehow. There might be some other cases where we'll need similar info, e.g. some AVX512 targets has slow gathers for registers but not for loads, need to pass this info somehow too.

vporpo updated this revision to Diff 415690.Mar 15 2022, 9:07 PM

Removed getBroadcastLoadCost() and replaced it with getShuffleCost() which now has a new IsLoad argument.

ABataev added inline comments.Mar 16 2022, 6:56 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1058

I belive better to pass the instruction here, just like in other functions. And then do the analysis of this instruction, if it was passed.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1181

Probably, need to check also for number of uses + external uses.

RKSimon added inline comments.Mar 16 2022, 8:39 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1058

Passing ArrayRef<const Value *> Args like we do for arith op makes sense here

RKSimon added inline comments.Mar 16 2022, 8:41 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1559

Won't this fail on pre-SSSEe targets?

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
2–3

; RUN: opt < %s -basic-aa -slp-vectorizer -slp-threshold=-100 -instcombine -dce -S -mtriple=i386-apple-macosx10.8.0 -mattr=+sse2 | FileCheck %s

Add a pre-SSSE3 run?

vporpo updated this revision to Diff 415944.Mar 16 2022, 12:53 PM
vporpo marked 2 inline comments as done.

Replaced IsLoad argument with Args in getShuffleCost(), and addressed comments.

vporpo added inline comments.Mar 16 2022, 12:55 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1559

Good catch, I added a check for SSE3.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1181

Good point, I added a check which should only allow this if we don't have any internal/external uses. I will add checks for internal uses in a follow up patch, because it requires more changes + tests.

ABataev added inline comments.Mar 16 2022, 1:36 PM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1058

You can drop const in const Value *

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
1557–1559

Maybe rename it to SSE3BroadcastLoadTbl?

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266

What about this?

vporpo updated this revision to Diff 415963.Mar 16 2022, 1:58 PM
vporpo marked 2 inline comments as done.

Changed ArrayRef<const Value*> to ArrayRef<Value *> and renamed table to SSE3BroadcastLoadTbl.

llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266

I am not sure what to do about this, it may have lower throughput but it has lower latency so it runs faster. Are we always considering throughput? It looks like in TTI we are mostly counting instructions at least from what I can see in getShuffleCost():

  {TTI::SK_PermuteSingleSrc, MVT::v2i64, 1}, // pshufd                                                                                                                     
  {TTI::SK_PermuteSingleSrc, MVT::v4i32, 1}, // pshufd                                                                                                                     
  {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2*pshuflw + 2*pshufhw                                                                                                      
                                              // + pshufd/unpck                                                                                                            
{ TTI::SK_PermuteSingleSrc, MVT::v16i8, 10 }, // 2*pshuflw + 2*pshufhw                                                                                                     
                                              // + 2*pshufd + 2*unpck + 2*packus
ABataev added inline comments.Mar 16 2022, 2:03 PM
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266

It is still in terms of throughput.
Yeah, it maybe faster on skylake but not on corei7-avx. And there might be similar cases for other cpus. Need to tweak the estimation criteria

vporpo added inline comments.Mar 17 2022, 2:24 PM
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266

What kind of tweaking are you proposing?

vporpo added inline comments.Mar 17 2022, 3:43 PM
llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll
256–266

As far as I understand the reason for the lower throughput is that we are adding pressure to the memory units ([4] and [5]). This is happening because we are loading twice, once for the scalar load, and once with vmovddup. This means that if we try to pack many instances of this code in parallel, then the pipeline would stall more than in the original code. But in any other case, like if this is part of some other code which is not heavy on the load units, it is the latency that matters, and this code is better with respect to latency.

Modeling throughput would require an accurate pipeline model, which we are not using in SLP. The cost modeling that we are doing looks more like a latency model than a throughput one since it has no concept of pipeline resources. If we were actually modelling the pipeline, then we could check both the code latency and the code throughput and decide which one to choose in each case. Given that we are not actually doing this I would argue against trying to fix a potential pipeline stall in an ad hoc way.

This revision is now accepted and ready to land.Mar 17 2022, 4:16 PM
This revision was landed with ongoing or failed builds.Mar 17 2022, 6:06 PM
This revision was automatically updated to reflect the committed changes.