Implemented better scheme for perfect/shuffled matches of the gather
nodes which allows to fix the performance regressions introduced by
earlier patches. Starting detecting matches for broadcast nodes and
extractelement gathering.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/Transforms/SLPVectorizer/X86/blending-shuffle-inseltpoison.ll | ||
---|---|---|
25–66 | Regression is caused by the incorrect cost model. It returns cost 12 for mul <4 x i8> and it is compensated by the fact that we calculate the cost of gather of extractelement instructions twice (its cost is -3). After this patch we correctly calculate the cost for the gather node only once (-3 for the first gather and 0 for the second one, perfect diamond match). llvm-mca returns the cost of code is 2 or 4 (for normalized mul <16 x i8> it is 4, for the original code it is 2). Need to tweak the cost model. |
Given the iterations here I think that it needs a lot of inline documentation for cost model of what you're doing and what it's expecting here.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
3717 | This needs to be documented as to what you're doing and what it means in the code. | |
3741 | Same :) | |
4550 | Please document this :) |
llvm/test/Transforms/SLPVectorizer/X86/blending-shuffle-inseltpoison.ll | ||
---|---|---|
25–66 | I'll handle this regression - I'm reviewing the cost tables against llvm-mca reports at the moment. |
llvm/test/Transforms/SLPVectorizer/X86/blending-shuffle.ll | ||
---|---|---|
113–154 | Regressions caused by the incorrect cost of mul <2 x i8>. Per mca tool the cost is 2, cost model reports 3. |
llvm/test/Transforms/SLPVectorizer/X86/blending-shuffle.ll | ||
---|---|---|
113–154 | Are we counting the costs of the v4i8 mul twice here? |
llvm/test/Transforms/SLPVectorizer/X86/blending-shuffle.ll | ||
---|---|---|
113–154 | Yes, but the cost of mul of v2i8. It is extended to mul v4i8 by the instcombine. |
llvm/test/Transforms/SLPVectorizer/X86/blending-shuffle.ll | ||
---|---|---|
113–154 | The cost is trickier than that as the costs tables aren't usually cpu specific- the worst case for v2i8 multiply is at least 4, so that's what the cost table reports - incidently, this test is for bdver2 (cost = 3.5), But tbh, I wouldn't worry too much about this scalarization. |
FYI there seems to be some (but not very large) compile-time impact on ClamAV: https://llvm-compile-time-tracker.com/compare.php?from=e60f147324b64f7740de58e6b936cdc0e26daadd&to=36911971a58d1ba8b15e97790ac816eaadb0603e&stat=instructions
The file with the largest impact is libclamav_htmlnorm.c with 3.4% regression in NewPM-O3 (and 5.6% in NewPM-ReleaseLTO-g). Might be worth taking a look.
Checked it, caused by the different order of the checks for gathered scalars introduced in this patch. Unfortunately, we cannot change the order back, it was incorrect and produced incorrect cost in some cases. Had to implement a special check to fix this bug, it affects compile time in some cases.
This needs to be documented as to what you're doing and what it means in the code.