Page MenuHomePhabricator

[SLP] Initial support for the vectorization of the non-power-of-2 vectors.
Needs ReviewPublic

Authored by ABataev on Jan 22 2019, 8:30 AM.

Details

Summary

Possibly vectorized operations are extended to the power-of-2 number with UndefValues to allow to use regular vector operations.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ABataev marked an inline comment as done.Sep 22 2020, 8:57 AM

Rebase

some very minor style comments - a general comment would be to try and pre-commit the style/NFC refactor/cleanup changes so the size of this patch is smaller

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2781

Do these trivial style refactors separately now to reduce the size of the patch?

2797

Do these trivial style refactors separately now to reduce the size of the patch?

3598

duplicate cast

3629

duplicate cast

3643

duplicate cast

3697

duplicate cast

4345–4350

trivial style refactor - pull out of patch?

RKSimon added inline comments.Sep 23 2020, 4:24 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
199–200

Are we going to have a problem if VL[0] is UndefValue?

ABataev added inline comments.Sep 23 2020, 4:30 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
199–200

Yeah, will fix it.

ABataev updated this revision to Diff 293701.Sep 23 2020, 5:30 AM

Rebase + fix

RKSimon added inline comments.Sep 23 2020, 5:52 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

These "feel" like regressions to me - any idea whats going on?

ABataev added inline comments.Sep 23 2020, 5:55 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

The cost model problem, if I recall it correctly. I investigated it before and found out that the cost model for AArch64 is not defined for long vectors in some cases and we fall back to the generic cost model evaluation which is not quite correct in many cases. Need to tweak the cost model for AArch64.

RKSimon added inline comments.Sep 23 2020, 6:15 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

Any instruction cost type (extract/shuffle/store?) in particular that needs better costs? It'd be good to at least raise a specific bug report to the aarch64 team

ABataev added inline comments.Sep 23 2020, 6:31 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

Do not remember already, need some time to investigate it again. Hope to do it by the end of this week.
PS. There was a question about this test already.

spatel added inline comments.Sep 23 2020, 7:18 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7540

Is it necessary to copy these? If so, it would be better to name this function something like "getCopyOfExtraArgValues" to make that explicit.

If not, we can just make this a standard 'get' method:

const MapVector<Instruction *, Value *> &getExtraArgs() const {
  return ExtraArgs;
}

And then access the 'second' data in the user code?

ABataev added inline comments.Sep 24 2020, 6:26 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7540

We don't need to expose the first element of the MapVector here, it is not good from the general design point of view. I'll rename the member function.

ABataev updated this revision to Diff 294040.Sep 24 2020, 6:33 AM

Rebase + rename

ABataev added inline comments.Sep 24 2020, 7:19 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
16

Found the reason. It is the cost of shuffle of TTI::SK_PermuteSingleSrc kind. Before this patch, the test operated with the vector <2 x i16>, which is transformed to llvm::MVT::v2i32 by type legalization function and the cost of this shuffle is tweaked to be 1 (see llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp, AArch64TTIImpl::getShuffleCost). The cost of this operation is 1, per table.

With this patch, the original vector type is <4 x i16> which is transformed to llvm::MVT::v4i16 and there is no optimized value for TTI::SK_PermuteSingleSrc in the table for this type and the function falls back to the pessimistic cost model and returns 18.

There are several TODOs int the file already about fixing the cost model for different shuffle operations.

Does anyone have any more comments?

spatel added inline comments.Sep 30 2020, 6:23 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3207–3209

Use isValidElementType() or check for undef directly?
I still can't tell from the debug statement exactly what we are guarding against. Should the type check already be here even without this patch?

3903

Are we always creating a masked load for a vector with 2 elements? This logic needs a code comment to explain the cases.

4828

Please add code comment/example to explain what the difference is between these 2 clauses.

4902

Is Passthrough a full vector of undef elements? If so, it should be created/named that way (or directly in the call to CreateMaskedLoad()) rather than in the loop.

4943–4944

Similar to above (so can we add a helper function to avoid duplicating the code?):
Please add code comment/example to explain what the difference is between these 2 clauses.

reverse ping

reverse ping

Will update the patch as soon as I'm back to work, in 2-3 weeks.

reverse ping?

reverse ping?

Need some time to setup my dev environment, will update ASAP

ABataev added inline comments.Mon, Nov 9, 10:37 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3207–3209

I was just trying to protect the code and try to support it only for simple types at first. There are some doubts that the cost for masked loads/stores is completed and I protected it to make it work only for simple types. I can remove this check if the cost model for masked ops is good enough.

3903

No, no need to do it for 2 elements, removed it.

4828

Fixed it, thanks.

4902

Fixed

4943–4944

Fixed, thanks!

RKSimon added inline comments.Mon, Nov 9, 11:14 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3207–3209

masked load/store costs for constant masks should be good enough now (getScalarizationOverhead should now provide us with a reasonable fallback)

ABataev updated this revision to Diff 304196.Tue, Nov 10, 8:11 AM

Rebase, updates and fixes

All of my comments were addressed, so LGTM. But please wait for an official 'accept' from at least 1 other reviewer.

RKSimon added inline comments.Thu, Nov 12, 9:35 AM
llvm/test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll
278

There isn't a ANY check-prefix atm (it was cleaned out in rG119e4550ddedc75e4 as part of the unused prefix cleanup) - please can you review?

ABataev added inline comments.Thu, Nov 12, 9:39 AM
llvm/test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll
278

Yes, need to remove it, I think. Most probably, caused but not quite clean merge.

ABataev updated this revision to Diff 304968.Thu, Nov 12, 1:49 PM

Rebase, test cleanup + small code improvements.

xbolva00 added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
26 ↗(On Diff #306735)

Regression on avx?

ABataev added inline comments.Fri, Nov 20, 11:01 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
26 ↗(On Diff #306735)

Yes, looks like the issue with the cost of @llvm.masked.gather for masked gather with some undefs in the mask

craig.topper added inline comments.Fri, Nov 20, 11:07 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
26 ↗(On Diff #306735)

Gather is slow on CPUs prior to AVX512. And its cost is proportional to the number of elements. I don't think the value of the mask should be a factor.

ABataev added inline comments.Fri, Nov 20, 11:18 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
26 ↗(On Diff #306735)

True, but in some cases it can be optimized into gather + shuffle instead of wide gather, if there are undefs in mask.

ABataev updated this revision to Diff 307098.Mon, Nov 23, 9:09 AM

Rebase + improve handling of masked gathers.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2601

Could it be actually no one Instruction in UseEntry or should it be assert?

2605

"Lane 0" seems outdated here, but not sure about better description.

2739

assert(NumberOfInstructions != 0 && "...") and if (NumberOfInstructions == 1)?

3484–3487

Comment typo: aggrgate.

4861–4862

emplace_back()

6285

typo: "indeces"

ABataev updated this revision to Diff 307204.Mon, Nov 23, 2:45 PM

Fixed according to comments

ABataev updated this revision to Diff 307205.Mon, Nov 23, 3:02 PM

Fixed function name.

Btw, I've observed significant compile-time regression with this patch: http://llvm-compile-time-tracker.com/compare.php?from=99d82412f822190a6caa3e3a5b9f87b71f56de47&to=81b636bae72c967f526bcd18de45a6f4a76daa41&stat=instructions (thanks to @nikic for awesome service). This could be justified in case of comparable performance improvements but have you done any benchmarking?

dtemirbulatov added a comment.EditedTue, Dec 1, 3:37 PM

Btw, I've observed significant compile-time regression with this patch: http://llvm-compile-time-tracker.com/compare.php?from=99d82412f822190a6caa3e3a5b9f87b71f56de47&to=81b636bae72c967f526bcd18de45a6f4a76daa41&stat=instructions (thanks to @nikic for awesome service). This could be justified in case of comparable performance improvements but have you done any benchmarking?

I have done a while back with SPECINT 2006 and as I remember results were good, but I am not sure that I could find those now. Yes, for me, having this new functionality with presented compile-time regression looks ok.

I dont think (geomean) 0.20% is significant compile time problem. TBH, I expected bigger CT regressions - up to 0.5% is fine IMHO.