This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Better estimate cost of no-op extracts on target vectors.
ClosedPublic

Authored by fhahn on Apr 1 2021, 5:17 AM.

Details

Summary

The motivation for this patch is to better estimate the cost of
extracelement instructions in cases were they are going to be free,
because the source vector can be used directly.

A simple example is

%v1.lane.0 = extractelement <2 x double> %v.1, i32 0
%v1.lane.1 = extractelement <2 x double> %v.1, i32 1

%a.lane.0 = fmul double %v1.lane.0, %x
%a.lane.1 = fmul double %v1.lane.1, %y

Currently we only consider the extracts free, if there are no other
users.

In this particular case, on AArch64 which can fit <2 x double> in a
vector register, the extracts should be free, independently of other
users, because the source vector of the extracts will be in a vector
register directly, so it should be free to use the vector directly.

The SLP vectorized version of noop_extracts_9_lanes is 30%-50% faster on
certain AArch64 CPUs.

It looks like this does not impact any code in
SPEC2000/SPEC2006/MultiSource both on X86 and AArch64 with -O3 -flto.

This originally regressed after D80773, so if there's a better
alternative to explore, I'd be more than happy to do that.

Diff Detail

Event Timeline

fhahn created this revision.Apr 1 2021, 5:17 AM
fhahn requested review of this revision.Apr 1 2021, 5:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2021, 5:17 AM
ABataev added inline comments.Apr 1 2021, 6:48 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551

I think it is better to use TLI->getTypeLegalizationCost(DL, cast<ExtractElementInst>(V)->getVectorOperandType()); to get the real machine vector type and the number of splits.

3562

->getVectorOperand() instead of getOperand(0)

3566

I think you need to use the real extract indices here to be more correct, i.e.

for (unsigned I = Idx - EltsPerVector; I <= Idx; ++I)
Cost -= TTI->getVectorInstrCost(Instruction::ExtractElement,
                                                cast<ExtractElementInst>(VL[I])->getVectorOperandType(), *getExtractIndex(cast<Instruction>(VL[I])));
fhahn updated this revision to Diff 334678.Apr 1 2021, 7:13 AM

Address comments, thanks!

fhahn marked an inline comment as done.Apr 1 2021, 7:16 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551

I think using TargetLoweringInfo would indeed be better, but unfortunately I don't think we can access it here, as it is defined in CodeGen? I tried to see if there are any other such uses in llvm/lib/Transforms but couldn't. Perhaps there's a way to use it I am missing?

3566

Thanks, I think that's much better, updated.

ABataev added inline comments.Apr 1 2021, 7:29 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551

You can use TTI->getNumberOfParts() to get the number of registers and then calculate EltsPerVector.
Also, what if there are extracts from 2 different vectors with the different numbers of elements?

fhahn added inline comments.Apr 1 2021, 8:01 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551

That's convenient, thanks! I just gave it a try, but I stumbled over a problem. For example, on AArch64, <2 x i32> fits and can be used as the lower half of a vector register, so EltsPerVector would be 2 (and rightly so). But this has the unfortunate effect that in some cases we would vectorize some operations earlier with <2 x i32>, rather than vectorizing a larger expression with <4 x i32>. By using the larger vector register, we make sure to only do so to use the largest VF.

Arguably using getNumberOfParts is the right thing to use here, but I really want to avoid introducing any regressions and I don't think there's a way at the moment to skip vectorizing eagerly if it would prevent optimizing with a wider VF later on. WDYT?

Also, what if there are extracts from 2 different vectors with the different numbers of elements?

At the moment all extracts in a block need to have the same vector register, so the types should also be the same. The extracts_first_2_lanes_different_vectors test should check for that case.

ABataev added inline comments.Apr 1 2021, 8:13 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551
  1. Could you give an example, please?
  2. Then maybe guard these extra checks with something like:
if (*ShuffleKind == TargetTransformInfo::SK_PermuteSingleSrc) {
 ...
}

?

fhahn updated this revision to Diff 334712.Apr 1 2021, 8:42 AM

Subtract shufflecost for vector part, rather than multiple extractelement costs, to be symmetric to the cost computed earlier.

fhahn added inline comments.Apr 1 2021, 8:49 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551

Just had another look at the failure and it was caused by computing Cost = TTI->getShuffleCost(ShuffleKind.getValue(), VecTy, Mask); up-front, but then subtracting the cost of individual extracts.

I think this may reduce the cost too aggressively.

For example in the code below, the shuffle-cost on AArch64 is 1, but the cost to extract from lane 1 is 3. The new code should just cancel out the cost of the shuffle, but in this case it made the cost more profitable than it should be! I think it should be more correct to subtract the cost of a single shuffle for a vector with EltsPerVector elements. That should be more symmetric to the computed Cost.

%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1

I think there's a similar problem in the AllUsersVectorized && !ScalarToTreeEntry.count() case, but it is not obvious to me how to improve that, given that we potentially need to subtract the cost for only a subset of extracts. Perhaps we should ensure Cost is at least 0, to avoid the problem?

ABataev added inline comments.Apr 1 2021, 9:06 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3548–3551

I think AllUsersVectorized && !ScalarToTreeEntry.count() case is correct since we removing the ExtractElement instruction from the code completely.

3566

Hmm, I think more correct would be to do something like this:

Cost += TTI->getShuffleCost(
                  ShuffleKind.getValue(),
                  FixedVectorType::get(VecTy->getElementType(), EltsPerVector), Mask);

in case if AllConsecutive is false and ignore the initial shuffle cost completely. Also, you need to calculate the correct Mask here or pass llvm::None

So, I think you need to split it into separate parts.
The first one for *ShuffleKind == TargetTransformInfo::SK_PermuteSingleSrc with improved shuffle cost for non-consecutive extracts and the generic one with the old functionality

fhahn added inline comments.Apr 1 2021, 10:01 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3566

That sounds good! I put up D99745. Was that what you had in mind for preparation? (Still need to add some comments, but I wanted to make sure that's what you actually had in mind)

ABataev added inline comments.Apr 1 2021, 10:03 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3566

Sorry, I did not mean to split the patch into 2 parts :) I meant to split processing into 2 parts. Plus, the patch cannot be NFC since it changes the functionality.

fhahn updated this revision to Diff 334797.Apr 1 2021, 12:40 PM

Move new code to separate function.

fhahn added inline comments.Apr 1 2021, 12:41 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3566

Oh right! I think I see what you mean now. I move the new logic to a separate function and changed it to only add the shuffle costs for blocks that are not consecutive. I think the code should be much clearer now, thanks!

ABataev added inline comments.Apr 1 2021, 12:54 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3453–3455

Can you make it static local?

3458–3460

Is this possible? Or better to make it an assert?

3477–3481

Use *getExtractIndex(cast<Instruction>(V)); to get the indeces

3504–3505
Optional<TargetTransformInfo::ShuffleKind> ShuffleKind = isShuffle(VL.slice(StartIdx, std::min(EltsPerVector, VL.size() - StartIdx)), Mask);
  1. Can we skip this call? Just TargetTransformInfo::SK_PermuteSingleSrc and None as a Mask?
fhahn updated this revision to Diff 334807.Apr 1 2021, 1:12 PM

address comments: removed mask computation, made static.

fhahn marked an inline comment as done.Apr 1 2021, 1:14 PM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3453–3455

Yes, originally I was still using areAllUsesVectorized, but I removed that now, as it does not really fit logically any more. This impacted 2 X86 tests, but I think it is the patch working as expected.

3458–3460

Unfortunately that can happen when compiling without a specific target and there's a test that tiggers an assert otherwise.

3504–3505

Done, much simple now!

ABataev added inline comments.Apr 1 2021, 1:17 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3458–3460

Maybe just exit in this case? And rely on the conservative cost model?

3487

No need for this check anymore

fhahn updated this revision to Diff 334810.Apr 1 2021, 1:23 PM

Address comments: Add early exit if getNumberOfParts returns 0, remove vector operand checks

fhahn marked an inline comment as done.Apr 1 2021, 1:24 PM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3458–3460

yes, I updated it to just return the shuffle cost for the whole VecTy.

ABataev added inline comments.Apr 1 2021, 1:28 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3457

Move this after if

3546–3549

Since you have a call for TTI->getShuffleCost(ShuffleKind.getValue(), VecTy, Mask) in computeExtractCostForPermuteSingleSrc maybe just check for shuffle kind in this function rathr than here? And go the conservatiave way if *ShuffleKind != TargetTransformInfo::SK_PermuteSingleSrc? In this case you would need to rename computeExtractCostForPermuteSingleSrc to something like computeExtractCost

fhahn updated this revision to Diff 334818.Apr 1 2021, 1:38 PM

Addressed comments: renamed to computeExtractCost and moved shuffle cost computation to function. Thanks again!

This revision is now accepted and ready to land.Apr 1 2021, 1:39 PM
fhahn marked 2 inline comments as done.Apr 1 2021, 1:40 PM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3546–3549

Done, everything is now done in the function. Overall that should make the code simpler again, thanks.

Harbormaster completed remote builds in B96796: Diff 334807.
This revision was automatically updated to reflect the committed changes.
fhahn marked an inline comment as done.
MaskRay added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3476

I have seen a case where

VecTy->getNumElements() == 2
NumOfParts == 4
EltsPerVector == 0
Idx == 0

So divide-by-zero SIGFPE. Trying to reduce to a test case, but how should I paper over the problem quickly?

fhahn added inline comments.Apr 2 2021, 10:52 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3476

Oh right, that's interesting! Perhaps a vector with an element type that does not fit into a single register?

For those cases, the logic below should not apply I think, so perhaps adding a build out on ExltsPerVector == 0 after computing it?

MaskRay added inline comments.Apr 2 2021, 10:53 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3476

if (EltsPerVector == 0) return TTI.getShuffleCost(ShuffleKind, VecTy, Mask); ?

fhahn added inline comments.Apr 2 2021, 10:55 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3476

Yes that was what I was thinking! It would be great to add a test case as well.