When generating masked gathers nodes, SLP vectorizer accounts the cost
of the GEPs for loads as part of the scalar-vector transformation cost
estimation. But it does not do it for vectorized loads/stores, while it
may completely remove some of the GEPs completely. Because of this in
some cases masked gather operation can be much more profitable rather
than regular vectorization (masked-gather cost + vector GEP - scalar
loads + GEPs comparing to vectorized loads - scalar loads).
Added the analysis of the removed scalarGEPs for vectorized load/store nodes for better cost estimation.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6654 | Ptr->hasAllConstantIndices() ? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6646 | Better to use OperandValueInfo() instead of the initialization? |
LGTM - in the medium term I think we should be trying to move more of this into getGEPCost - but that callback needs improvement first as its barely used.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 | We see quite a significant performance regression related to this patch. Vecorized code: [1] [2] [3] [4] [5] [6] Instructions: Original: [1] [2] [3] [4] [5] [6] Instructions: |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 | I can say, it was expected. That's why there was a discussion about using getGEPCost instead of this. This changes just syncs cost estimation for masked gathers and vector loads. As you noted, we already had the issue with the geps costs. We need to fix this. It would help if Intel will try to implement their part of getGEPCost and we can start using it here for better cost estimation. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 | You got me wrong. I did not say we already had issues with geps. What I did I did say is: we already had issues with vectorizing sequences when we should not. Most issues with these wrongful vectorizations come from shuffles and permutations generated. And the example, which is the test case in this patch (remark_not_all_parts.ll) merely does show that vectorized code is worse that the original.
Aligning gather loads does not justify enough IMO. May be gathers have this same issue? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 | It is not to improve the vectorization but to fix the cost difference between vector loads and masked gather. If we're going to revert it, we need to remove the geps cost estimation for masked gathers. Otherwise there are cases, where consecutive loads are less profitable than the masked gather, and it leads to the perf regressions. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 | Yes, we better to focus on these cases and find out why gather loads look more profitable (instead of making unit stride load look less profitable). It can be because of the same issue with geps or it can be that gather load cost itself is too optimistic. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 | The reason is know - need to fix the cost estimation for GEPs. And we need to fix getGEPCost function. Without this we are overoptimistic about GEPs. And we need to fix the cost in getGEPCost function and use it in the cost estimation. This will fix the regression introduced in this patch and fix general problem. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 | Could you point out the test case (where GEPS are incorrectly estimated) please? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 | I don't remember exactly currently, found out when worked on strided loads vectorization. We ignored the cost of the GEPs for vector loads and because of that vector loads became less profitable than the masked gather with vectorized GEPs (the strided load cost is higher than the vector load, but vectorized GEPs currently are more profitable than the scalar ones, since the cost of each GEP is currently calculated as the cost of ADD).
Yes, right. And we need to fix couple things about it - the cost for GEPs (which are free for many cases on X86) and the cost of scalar/vector loads (which also are free in many cases for X86). But I just don't have enough time to do it. I would appreciate it if you could try implement the GEP cost for X86 and we could switch to getGEPCost instead of using the cost of simple ADD for GEPs. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 | To be honest, I still do not understand what is the problem with GEPs. GEPs only have cost when stride is unknown. But if we end up with "vectorize" state node here we are already ensured that stride is known and and it is unit stride(i.e. we load or store adjacent elements in memory). That equally applies to loads and stores but the thing is here in SLP we don't yet issue scatter stores yet (if I've not missed something). So what problem did you suppose to solve when applied this GEP adjustment to stores is still unclear. It's unfortunate that you lost the test case that reasoned you for this patch. Any chance to recover it somehow? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 |
Different cost model for GEPS for masked loads and vector loads.
For X86, but there other archs.
I did not lost it, just need some time to find it. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 |
Different does not mean incorrect.
Thanks for admitting this was inappropriate place to fix problem. It should be a part of target dependent TTI implementation for the target that you meant to fix. Now it equally applies to all targets. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 |
In this case it is incorrect.
That's why I asked you to help with the implementation of getGEPCost in TTI for X86 so we could use it here instead of add cost. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 | I'm willing to work on it. But I cannot begin earlier than after couple of weeks (before Dec 6th to be precise). And we need to meet somewhere to discuss, share ideas and sync on this issue (phab is not the right place for that). It would be nice if you could find test cases that can be reduced (if not already) to show case the issue to address. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
6656 | Sounds good. We can discuss it via e-mail or set up a meeting. |
Better to use OperandValueInfo() instead of the initialization?