Page MenuHomePhabricator

[SLP]Better detection of perfect/shuffles matches for gather nodes.
ClosedPublic

Authored by ABataev on May 21 2021, 7:23 AM.

Details

Summary

Implemented better scheme for perfect/shuffled matches of the gather
nodes which allows to fix the performance regressions introduced by
earlier patches. Starting detecting matches for broadcast nodes and
extractelement gathering.

Diff Detail

Event Timeline

ABataev created this revision.May 21 2021, 7:23 AM
ABataev requested review of this revision.May 21 2021, 7:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 21 2021, 7:23 AM
ABataev added inline comments.May 21 2021, 7:27 AM
llvm/test/Transforms/SLPVectorizer/X86/blending-shuffle-inseltpoison.ll
23–45

Regression is caused by the incorrect cost model. It returns cost 12 for mul <4 x i8> and it is compensated by the fact that we calculate the cost of gather of extractelement instructions twice (its cost is -3). After this patch we correctly calculate the cost for the gather node only once (-3 for the first gather and 0 for the second one, perfect diamond match). llvm-mca returns the cost of code is 2 or 4 (for normalized mul <16 x i8> it is 4, for the original code it is 2). Need to tweak the cost model.

Given the iterations here I think that it needs a lot of inline documentation for cost model of what you're doing and what it's expecting here.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3719

This needs to be documented as to what you're doing and what it means in the code.

3743

Same :)

4552

Please document this :)

ABataev updated this revision to Diff 347078.May 21 2021, 10:50 AM

Address comments

RKSimon added inline comments.May 21 2021, 2:24 PM
llvm/test/Transforms/SLPVectorizer/X86/blending-shuffle-inseltpoison.ll
23–45

I'll handle this regression - I'm reviewing the cost tables against llvm-mca reports at the moment.

Please can you rebase? The vXi8 MUL costs should be better now

ABataev added inline comments.May 24 2021, 5:55 AM
llvm/test/Transforms/SLPVectorizer/X86/blending-shuffle.ll
84–125

Regressions caused by the incorrect cost of mul <2 x i8>. Per mca tool the cost is 2, cost model reports 3.

RKSimon added inline comments.May 24 2021, 6:45 AM
llvm/test/Transforms/SLPVectorizer/X86/blending-shuffle.ll
84–125

Are we counting the costs of the v4i8 mul twice here?

ABataev added inline comments.May 24 2021, 6:47 AM
llvm/test/Transforms/SLPVectorizer/X86/blending-shuffle.ll
84–125

Yes, but the cost of mul of v2i8. It is extended to mul v4i8 by the instcombine.

RKSimon added inline comments.May 26 2021, 7:57 AM
llvm/test/Transforms/SLPVectorizer/X86/blending-shuffle.ll
84–125

The cost is trickier than that as the costs tables aren't usually cpu specific- the worst case for v2i8 multiply is at least 4, so that's what the cost table reports - incidently, this test is for bdver2 (cost = 3.5),

But tbh, I wouldn't worry too much about this scalarization.

RKSimon accepted this revision.May 28 2021, 3:21 AM

LGTM

This revision is now accepted and ready to land.May 28 2021, 3:21 AM
This revision was landed with ongoing or failed builds.Jun 1 2021, 7:09 AM
This revision was automatically updated to reflect the committed changes.
nikic added a subscriber: nikic.Jun 1 2021, 8:47 AM

FYI there seems to be some (but not very large) compile-time impact on ClamAV: https://llvm-compile-time-tracker.com/compare.php?from=e60f147324b64f7740de58e6b936cdc0e26daadd&to=36911971a58d1ba8b15e97790ac816eaadb0603e&stat=instructions

The file with the largest impact is libclamav_htmlnorm.c with 3.4% regression in NewPM-O3 (and 5.6% in NewPM-ReleaseLTO-g). Might be worth taking a look.

FYI there seems to be some (but not very large) compile-time impact on ClamAV: https://llvm-compile-time-tracker.com/compare.php?from=e60f147324b64f7740de58e6b936cdc0e26daadd&to=36911971a58d1ba8b15e97790ac816eaadb0603e&stat=instructions

The file with the largest impact is libclamav_htmlnorm.c with 3.4% regression in NewPM-O3 (and 5.6% in NewPM-ReleaseLTO-g). Might be worth taking a look.

I'll check if it can be improved, thanks!

FYI there seems to be some (but not very large) compile-time impact on ClamAV: https://llvm-compile-time-tracker.com/compare.php?from=e60f147324b64f7740de58e6b936cdc0e26daadd&to=36911971a58d1ba8b15e97790ac816eaadb0603e&stat=instructions

The file with the largest impact is libclamav_htmlnorm.c with 3.4% regression in NewPM-O3 (and 5.6% in NewPM-ReleaseLTO-g). Might be worth taking a look.

Checked it, caused by the different order of the checks for gathered scalars introduced in this patch. Unfortunately, we cannot change the order back, it was incorrect and produced incorrect cost in some cases. Had to implement a special check to fix this bug, it affects compile time in some cases.