It really requires some additional improvvements, because I see a lot of regressions for the cmp instructions. I think, at first you should try to vectorize cmp instructions using the horizontal reductions anf only if it was unsuccessful, you need to try to vectorize the operands of the instruction itself.
- Don't match undef lanes unless they exist on both halves of the mask.
- New test to show that difference.
Sun, Mar 24
The x86 regression test changes to preserve behavior look good. Pre-commit those, so we're just left actual diffs here?
Sat, Mar 23
The more I look at the motivating patterns, the less hopeful I am that we can get optimal code by delaying the transforms until SDAG.
Fri, Mar 22
Thu, Mar 21
This LG, but i'm not sure i understand how this is related to D59066?
Here, we are clearly end up with no select in ASM.
But in D59066 we expand to this pattern.
So there is something else that is able to do the transform that we do manually in D59066?
Should D59066 be doing something else to simply trigger the existing transform?
LGTM - could you add a TODO comment here about using m_APInt() instead of m_ConstantInt()...there's no reason to limit this to scalars AFAICT.
Wed, Mar 20
We improved the generic expansion slightly with D59066. That leaves customization for x86 which is required because umin/umax are custom lowered even if we don't actually have the instructions pmaxud/pmaxuq. That's not a generic lowering problem; that's an x86 problem.
Tue, Mar 19
Logic looks fine, so I won't hold it up, but seems better to not duplicate code for sibling transforms?
Mon, Mar 18
Seems good, but let Simon have a look too in case there's some uarch concern that I'm not aware of.
If we want to be conservative for compile-time, we could use the simple pattern matchers (m_SMax...) rather than the heavier ValueTracking call. But we don't have that option currently for abs/nabs.
It seems like we've accomplished the improvement for almost no extra cost though, so that's probably a moot point now.
I'd prefer that we IR simplify our way out of this infinite loop instead of looking the other way though. Ie, can we get this in instsimplify using a ConstantRange?
That should be relatively simple to do, we just need to support constant range calculation for min/max flavor selects in computeConstantRange().
One thing I didn't mention is that this opportunity was exposed only by InstCombine's instruction sinking. I captured the code after sinking to create the test case.
Also, is it reasonable to assume prior passes will transform the code to avoid triggering possible infinite loops in later passes like this? Unless you mean move part of this transformation into InstSimplify and out of InstCombine?
Sun, Mar 17
Any particular reason to limit this to rotates only? This should be valid for funnel shifts in general.
Please check in the test that provides the missing coverage as an NFC preliminary patch.
Hmm, this looks reasonable to me, @spatel ?
Are there any concerns of performance impact of using ConstantRange() here ?
Though i do think it does pull it's weight.
Fri, Mar 15
This patch causes 5% regression of one of our eigen benchmarks on Haswell.
The problem is when it combines the CMP in a hot block with SUB in a cold block into a single SUB in hot block, on a two address architecture like x86, if the operand of CMP has other uses, it needs to make an extra COPY before the original CMP, so there is one more instruction in hot block.
Another patch r355823 papered over the problem in our code, but it didn't fix the root cause.
The regression is only observed on Haswell, it doesn't impact Skylake.
Thu, Mar 14