During the reordering transformation we should try to avoid reordering bundles
like fadd,fsub because this may block them being matched into a single vector
instruction in x86.
We do this by checking if a TreeEntry is such a pattern and adding it to the
list of TreeEntries with orders that need to be considered.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
60,420 ms | x64 debian > Clang.Driver::fsanitize.c |
Event Timeline
Do you have many more examples of these patterns? If the addsub<->subadd pattern is the main problem it shouldn't take much to fix it later on in the backend
I am not sure if there is any other pattern, but this one showed up in a regression in the eigen benchmark.
Hmm yeah fixing it in the back-end may be an option but what if we end up not vectorizing the code because of this? The alternative is a blend + add + sub which should increase the cost quite a bit. So it is probably best to teach reordering not to mess up this pattern in the first place.
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll | ||
---|---|---|
121–127 | I don't quite understand what's the difference here. Could you explain, please? |
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll | ||
---|---|---|
121–127 | Before this patch the pattern shuffle + fadd + fsub lowers to 3 instructions: blend + vector add + vector sub (the shuffle selects TMP3[0],TMP2[1], which is fadd[0],fsub[1] , the inverse of the addsub pattern). With this patch shuffle + fadd +fsub lowers to a single addsub instruction (the shuffle selects TMP2[0], TMP3[2] which is fsub[0],fadd[1]). |
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll | ||
---|---|---|
121–127 | Ok, why not fixing it in the backend? This new function you added does not affect the cost, but ignoring shuffle actually increases the cost of the tree. |
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll | ||
---|---|---|
121–127 | I think it is better to do it here because we may end up not vectorizing some code because of the additional cost of the blend + add + sub pattern. I think I can fix the tree cost with a follow-up patch that fixes the cost of the altshuffle pattern when it corresponds to the addsub instruction. |
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll | ||
---|---|---|
121–127 | Ah, I see, thanks. What about trying to do both - lowering in the backend and the cost adjustment? Is it possible? |
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll | ||
---|---|---|
121–127 | Hmm I guess in the back-end we won't have access to cost benefit analysis like the one we are doing during the reordering step (i.e., finding the most popular order). So we would have to do a simple conversion of any sub-add pattern to an add-sub + shuffles, but I am not sure that this would always be profitable. I think that the addsub pattern should be taken into consideration when looking for the most popular ordering so that it can influence the decision, but I am not so sure we can always justify the cost of the extra shuffles. |
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll | ||
---|---|---|
121–127 | The backend does not need the cost, it sjut checks the pattern and lowers the sequence to the instructions. Why does it need the cost? |
llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll | ||
---|---|---|
121–127 | I think there is a difference between doing the transformation here vs in the back-end. In the back-end we can't easily check if reordering a sub-add to an add-sub can also remove some of the shuffles that are already in the code. For example if we have code like this: %vsub = fsub <2 x double> %subop, ... %vadd = fadd <2 x double> %addop, ... %shuffle = shufflevector <2 x double> %vsub, %vadd, 0, 3 # vsub[0], vadd[1] store <2 x double> %shuffle, ... This will be lowered to: add + sub + blend + store. But if we convert the sub-add pattern (i.e., add + sub + blend) to an add-sub + shuffles in the back-end, then this will introduce 3 shuffles: 2 for the operands and 1 for the user, resulting in a pattern: pshuf + pshuf + addsub + pshuf + store. This is clearly worse than the original code. But the transformation could be profitable if we could remove some of the shuffles, for example if the code looked like this: %vsub = fsub <2 x double> %subop, ... %vadd = fadd <2 x double> %addop, ... %shuffle = shufflevector <2 x double> %vsub, %vadd, 0, 3 # vsub[0], vadd[1] %shuffle2 = shufflevector <2 x double> %shuffle, zero, 1, 0 # shuffle[1], shuffle[0] store <2 x double> %shuffle, ... Converting this to sub-add in the back-end would result in pshuf + pshuf + addsub + pshuf + pshuf + store, but the latter two pshuf instructions negate each other so the could be optimized away. A similar optimization could happen if the inputs were already reordered with a shuffle earlier. What I am trying to say is that simply converting a sub-add pattern to an add-sub pattern does not look profitable unless we can also get rid of some of the shuffles that cancel each other out. I can't think of how we could do this in the back-end more effectively than we could do it here, but perhaps I am missing something. |
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
685 | If we're going to do this (and I'm still not convinced we should) we're going to need a shuffle mask - otherwise addsub matching is not going to be accurate. |
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
685 | I agree, the shuffle mask will make matching a lot more accurate, I will update the patches. Why do you think we should not be going ahead with this? To me this is very similar to the shuffle-reducing transformations that we already have, so it looks like it fits pretty well the scope of the SLP pass. |
I'm concerned its yet something else that is being added to SLP to make up for a weakness some place else (usually the cost model mechanism) and may take years to unravel again.
In this case we don't have a way to determine realistic costs for instruction sequences - at best we can optionally provide operands when getting the cost of a single instruction - we already have this problem with folded loads, this could end up being a lot more complex :-(
I understand your concerns, but I feel that this pattern is not as tricky as folded loads. We can accurately detect the add-sub pattern within the SLP pass itself, since it matches the AltShuffle TreeEntry. So we can even make the tree cost more accurate (this is in a follow up patch). The only issue that I see is perhaps with cost modeling of this pattern from within a different pass, other than SLP, since that would require special logic to skip the cost of the vector add and vector sub. But given the current infrastructure I don't think we can do anything about this, but that shouldn't stop us from making getShuffleCost() more accurate, even if it will only benefit SLP.
Lets get the shuffle mask arg added - we need that to correctly cost alt patterns. We might be able to reduce the SLP impact to a getAltCost() wrapper of some kind.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2381 | I'd say we don't really need this interface. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2381 | Well, the issue is that the TTI function can't check whether this is an actual add-sub pattern, because it doesn't have all the information needed. For this to work, it would have to know how we are planning to combine the scalar instructions into vectors, which is an SLP-specific thing. So it feels that this logic should not belong to TTI and instead it should all be in the SLP pass. Regarding the mask, I don't really like the way this is done now. Actually I am not even sure passing the shuffle mask is a good idea because it doesn't actually help with checking the pattern. It is probably best to just pass the even/odd opcodes and the vector type and ask for TTI to very whether these the compatible ones with the addsub instructions. Wdyt? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2381 | my 2 cents wrt the TTI interface |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2381 | Yeah, this makes sense. I will update the patch. |
Replaced shuffle mask argument with a bitvector and moved all the opcode checking code to TTI->isLegalAltInstr().
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2381 | I updated the isLegalAltInstr() TTI interface function. I think it looks more like what you described in your previous comment. |
Don't we need to account for cost somehow?
The cost calculation changes are in a followup patch: https://reviews.llvm.org/D126432
If we're going to do this (and I'm still not convinced we should) we're going to need a shuffle mask - otherwise addsub matching is not going to be accurate.