Looks good to me.
Sep 2 2020
Sep 1 2020
Actually it doesn't seem to work, I get undefined references to LLVMAddInstructionSimplifyPass at link time in mesa.
Not entirely sure why (would InstSimplifyPass.cpp have to include Scalar.h?), but moving the LLVMAddInstructionSimplifyPass() function to Scalar.cpp (all the other scalar passes have the wrapper function there already) fixes it.
Aug 28 2020
Looks great to me, that'll certainly fit the bill for mesa (not sure though on the name should it have the Legacy in the name too for consistency - either way looks fine to me however).
Aug 27 2020
mesa/llvmpipe was using this pass, so it doesn't build anymore. If the proposed replacement is the InstSimplifyLegacyPass that's fine, however the old one was nicely available via the c interface (LLVMAddConstantPropagationPass) whereas the InstSimplifyLegacyPass is not.
(I'd note that we only use very few passes, and they are perhaps not optimal, but the goal is to avoid any expensive passes while still getting some optimization. So, not really insisting on InstSimplifyLegacyPass neither if there are other suitable replacements doing roughly the same thing - it would be nice though if we'd not have to do our own c++ wrapper as we try to avoid that when possible.)
Sep 13 2018
Looks alright to me, and cleaner than what I suggested in bug 31151. And the test results look great too.
Jul 20 2018
May 17 2018
I don't really feel qualified to review this. But fwiw in mesa we never used rotate instructions, and for that matter we were not relying on shift intrinsics neither (we've got some cases where we actually need modulo width behavior so we're masking off the bits).
But omg do the larger-or-equal to bit width cases make things more complex... Shift is such a nice instruction - if just everybody could agree what the behavior should be when the shift count is larger than the bit width...
May 16 2018
The addus looks right now, thanks.
May 15 2018
FWIW as said I think could live with autoupgrade of at least the unsigned sat ones. These have similar complexity to pabs (pabs is cmp/sel/neg whereas these are add(or sub)/cmp/sel, although the add one also requires a constant), so if there were no concerns about the possibility of optimizations breaking pabs recognition, there might not be any here neither, if it actually works (which doesn't always seem to be the case according to the test cases?). (Although personally I'd have liked to keep pabs intrinsics too...)
Though disappearing intrinsics causing a null call in the generated code rather than some error when compiling remains a concern - but certainly I know about this one now so easily debugged :-).
Other than that, there's a bug in the logic of the addus, otherwise looks reasonable to me, though I'm not an expert on llvm internals.
May 10 2018
I'm also concerned that the more complex patterns are easier for other optimizations to simplify a little and break. The simpler things like pmin/pmax or pabs aren't so bad to not match when they get optimized a little.
May 9 2018
We're currently discussing the possible solutions for the JIT pipeline issue with @DavidKreitzer.
May 7 2018
Hmm I suppose that's "more stuff which breaks our JIT compilation".
Removing these intrinsics actually is looking a bit ugly from my point of view (mesa llvmpipe developer).
One issue is that we take care to only emit instructions as necessary - that is if the packxxx instructions aren't available, we only do whatever is necessary to accomplish the pack - so if we already know we cannot have values outside the target range (or maybe we can have but we don't care for some reason) we make sure to only emit shuffle. But if packxxx is available we still prefer that of course.
And if we can't use intrinsics, we have to emit loads of other IR instructions (and they need to match what the autoupgrade would do), but still need a separate path for when we know the backend won't be able to use packxxx intrinsic, which is getting real ugly.
Apr 25 2018
Is the IR being generated on the fly and then fed to the JIT? Which means
it doesn't go through the autoupgrade code since that is only done by the
bitcode reader and the ll parser? If that's the case, you'll need to
generate the replacement sequence directly instead of using the intrinsic.
Dec 20 2016
This looks correct to me, always nice to shave off some instructions...
Dec 15 2016
Looks great to me.
Dec 13 2016
I'd like to propose the following:
1 - we get this patch and D27684 approved and committed, providing v4i32 lowering to shufps and avoiding some of the more unnecessary domain switches.
2 - get shufps lowering added to target shuffle combining, I added shufpd recently and it's just been the domain issues that I wanted to tidyup up before adding shufps as well
3 - add support for v8i32 (and v16i32?) lowering to shufps
4 - other missing domain switch patterns (scalar stores and vpermilps/vpshufd come to mind)
5 - add support for domain switching to target shuffle combine when the shuffle depth is 3 or more - this will allow pshufd use on pre-AVX targets and seems to introduce some good uses of insertps as well.
That seems within scope for 4.0 and doesn't involve anything too exotic. After 4.0 we should be in a better position to begin work on moving some of this work to MC combines to better make use of specific scheduler models
I'm not sure I completely agree on the details (In particular, I'm not certain the older non-AVX arches where the mov is needed have zero-latency movs), but I mostly agree. Especially since it looks like the same platforms that need the mov are the ones that *do* have transition penalty. So using a shufps may be worth it only with something like -march=nehalem -mtune=ivb, where we generate SSE code, but expect to run it on a newer platform.
I think this should go in, just forget the domain penalties, as it shouldn't be an issue with most cpus. When I looked at this in Agner Fog's guides, my conclusion was that it is probably only really an issue with Nehalem and Via Nanos. If a cpu has just one clock additional latency back and forth it's still worth it replacing 3 shuffle instructions with one from the wrong domain (albeit the latency chain will be the same then) - and if it manages to only replace 2 shuffle instructions from the right domain it might be worse or better in such a case. (If it is actually worse with Nehalem with its 2 clock penalty back and forth would of course depend if some instruction mix is latency bound or throughput bound.)
Also, plenty of the more odd cpus either place all shuffles in int domain anyway or even do something more odd (like the original core2 merom). I suppose ideally the shuffle lowering code would take into account such hw cost differences, but the truth is right now it doesn't really model any of this (e.g. on some cpus unpacks might not have the same cost as pshufd neither and so on).
So, I'm all for it (and suggested fixing it the same in https://llvm.org/bugs/show_bug.cgi?id=27885).