This is an archive of the discontinued LLVM Phabricator instance.

AVX-512 cost calculation for interleave load/store patterns
ClosedPublic

Authored by delena on Dec 26 2016, 7:12 AM.

Details

Summary

X86 target does not provide any target specific cost calculation for interleave patterns. It uses the common target-independent calculation, which gives very high numbers.
As a result, the scalar version is chosen in many cases. The situation on AVX-512 is even worse, since we have 3-src shuffles that significantly reduce the cost.

In this patch I calculate the cost on AVX-512. It will allow to compare interleave pattern with gather/scatter and choose a better solution (PR31426).

Diff Detail

Repository
rL LLVM

Event Timeline

delena updated this revision to Diff 82501.Dec 26 2016, 7:12 AM
delena retitled this revision from to AVX-512 cost calculation for interleave load/store patterns.
delena updated this object.
delena added reviewers: mkuper, Farhana.
delena set the repository for this revision to rL LLVM.
delena added subscribers: llvm-commits, Ayal.
mkuper added inline comments.Dec 27 2016, 11:16 AM
../include/llvm/Analysis/TargetTransformInfo.h
467 ↗(On Diff #82501)

Can you make this clearer? It's not obvious what "merge" means. Does the order matter?

469 ↗(On Diff #82501)

Is the extra space intentional?
Also, maybe "one" -> " a single"?

../include/llvm/CodeGen/BasicTTIImpl.h
65 ↗(On Diff #82501)

Why "All Permutations"?
getPermuteShuffleOverhead(), maybe? Also, I'm not sure what this has to do with permutations, especially given the example below. (TBH, It didn't have anything to do with SK_Alternate, it was just used that way. That wasn't really good either).

359 ↗(On Diff #82501)

This code makes very little sense to me. Not your change, but the original code.
Why does this special-case "alt shuffle", of all things? And why should this special-case only these specific shuffles after the change?

I think this is backwards - there may be specific shuffle types that are cheap by default - e.g. broadcast makes sense. Reverse? Not so much.

../lib/Target/X86/X86TargetTransformInfo.cpp
804 ↗(On Diff #82501)

Are you sure this is correct? I mean, this is fine for SK_Reverse, but I don't think it works for general shuffles.

I mean, let's say you have a shuffle of two v256i8.
Legalization will give you two sets of 4 * v64i8, but you don't end up with 4 two-input shuffles, since each of the 4 output vectors may depend on any subset of the 8 input vectors, so you may need a lot more shuffles.

Am I missing something?

851 ↗(On Diff #82501)

Same as above, I don't think this works for PermuteOneSrc either.

886 ↗(On Diff #82501)

Are you planning on adding SSE4, AVX and AVX2 costs as well?
This isn't a blocker for this patch, and should be a separate patch in any case, I'm just curious.

2131 ↗(On Diff #82501)

This line is > 80 chars.

zvi added a subscriber: zvi.Dec 28 2016, 7:34 AM
delena marked 2 inline comments as done.Dec 29 2016, 1:16 AM
delena added inline comments.
../include/llvm/Analysis/TargetTransformInfo.h
467 ↗(On Diff #82501)

The order does not matter. I meant any permutation of elements from 2 source vectors. I've changed the comment.

../include/llvm/CodeGen/BasicTTIImpl.h
65 ↗(On Diff #82501)

In the worst case, the shuffle is being replaced with "extracts" and "inserts". In my opinion, SK_Reverse should also have this overhead. And the SK_Broadcast is not always 1 instruction. But I don't want to fix everything in this patch.

359 ↗(On Diff #82501)

Reverse is not cheap for all types, VPERMW, for example, appears on AVX-512-BW. And broadcast for i16/i8 was added to AVX2 ISA.

../lib/Target/X86/X86TargetTransformInfo.cpp
804 ↗(On Diff #82501)

You are not missing anything. The cost, given here is the right cost for the legal types.
After split the cost should be (NumOfSrc*2 -1)*Entry->Cost. I took this into account in AVX-512 calculations.
I'll fix.

851 ↗(On Diff #82501)

I fixed and added a test. thanks.

886 ↗(On Diff #82501)

It should be added, I'm not sure that I'll be able to take it immediately after this patch. (and I have one more patch that should compare "interleave" with gather/scatter).
I'll try to find an example, where high "interleaving" cost on AVX2 prevents vectorization and fill PR.
But on AVX2, where we do not have any gather (or at least do not consider it as an option), we should compare "interleave" with strided-scalar. Mohamed is working on reducing scalar cost for strided access. We'll see what happens on AVX2 and earlier ISAs after his patch.

delena updated this revision to Diff 82656.Dec 29 2016, 1:20 AM

Some fixes in shuffle cost calculation after Michael comments. Added 2 more tests for shuffle cost model.

D27811 is working on much the same code, but I'm fine with you getting these changes in first and I'll carry on the refactor afterwards.

../lib/Analysis/CostModel.cpp
115 ↗(On Diff #82656)

I've added a ShuffleVectorInst::isSplat helper in D27811 for the same purpose. It's probably worth you merging that part (inc the CodeGenPrepare.cpp change as well) here.

delena added inline comments.Dec 29 2016, 3:31 AM
../lib/Analysis/CostModel.cpp
115 ↗(On Diff #82656)

The "broadcast" was not the matter of my patch and I do not really investigated in it, just added for completeness. And the broadcast-test is not full. I'd rather remove these changes at all and let you to proceed.

mkuper added inline comments.Dec 29 2016, 9:57 AM
../include/llvm/CodeGen/BasicTTIImpl.h
65 ↗(On Diff #82501)

I didn't mean to imply you should, just trying to figure out how we can at least move this in the right direction - even if just in terms of naming/documentation. :-)

359 ↗(On Diff #82501)

Sure.
And this isn't even X86-specific, this is just trying to give sane defaults for all targets, and I don't think the current code does that.
Could you please add a FIXME here?

../lib/Analysis/CostModel.cpp
112 ↗(On Diff #82656)

Not 100% related, but I would expect us to canonicalize shuffles to avoid the "all elements come from the second input" case.
(I'm not saying you should remove the check, since the cost is fairly minor... but I'm curious whether this is the case)

../lib/Target/X86/X86TargetTransformInfo.cpp
886 ↗(On Diff #82501)

SGTM.

2207 ↗(On Diff #82656)

Could you add an assert here?

../test/Analysis/CostModel/X86/shuffle-reverse.ll
126 ↗(On Diff #82656)

I thought this patch was not supposed to touch the costs of SK_Reverse shuffles.
Did we end up considering this shuffle SK_PermuteSingleSrc?

../test/Analysis/CostModel/X86/shuffle-two-src.ll
4 ↗(On Diff #82656)

1 -> 2

delena marked 2 inline comments as done.Jan 1 2017, 4:36 AM
delena added inline comments.
../test/Analysis/CostModel/X86/shuffle-reverse.ll
126 ↗(On Diff #82656)

In some cases "reverse" is cheaper in some not, but definitely not more expensive than SK_PermuteSingleSrc. I fixed one line in the "reverse" cost to make it consistent. But in order to redirect SK_Reverse to SK_PermuteSingleSrc I need to refactor the whole function. I don't want to do this in this patch. I'm also taking in account one more pending patch to the same place: https://reviews.llvm.org/D27811

delena updated this revision to Diff 82783.Jan 1 2017, 4:47 AM

Some minor changes after Michael's comments.

zvi added a subscriber: magabari.Jan 1 2017, 11:49 PM
mkuper edited edge metadata.Jan 2 2017, 12:27 AM

This LGTM, but please also wait for Simon's approval (or disapproval :-) ), just so your patches stay in sync.

../lib/Analysis/CostModel.cpp
115 ↗(On Diff #82656)

So, if this goes in first, it'll get removed by Simon's clean-up, right?

RKSimon accepted this revision.Jan 2 2017, 1:35 AM
RKSimon edited edge metadata.

LGTM, with a few minors. I can do the NFC refactor on the shuffle kinds after that and the finish off the broadcast patch.

I wonder whether we should consider adding a version of getShuffleCost that takes a raw shuffle mask instead of a shuffle kind enum - similar to getIntrinsicInstrCost. There is always going to be cases where the target code can identify more cheap shuffle cases. But that is a problem for another day - we would still need the SK_PermuteTwoSrc/SK_PermuteSingleSrc enums.

../lib/Analysis/CostModel.cpp
115 ↗(On Diff #82656)

Yes, but that's not a problem.

../lib/Target/X86/X86TargetTransformInfo.cpp
617 ↗(On Diff #82783)

I missed this when I did the earlier work on Reverse - better just to commit this (and the fixed cost test) separately.

This revision is now accepted and ready to land.Jan 2 2017, 1:35 AM
This revision was automatically updated to reflect the committed changes.