4i32 shuffles for single insertions into zero vectors lowers to X86vzmovl was using movss - causing domain switch stalls.
This patch fixes this by using movq( punpckldq $SRC, ZERO) ) instead.
Follow up to D6458 which dealt with SSE4.1 targets.
Differential D6526
[X86][SSE] Keep 4i32 vector insertions in integer domain on pre-SSE4.1 targets RKSimon on Dec 4 2014, 6:42 AM. Authored by
Details
Diff Detail Event Timeline
Comment Actions Added comments - I'll add a new patch using that movq/pshufd pattern shortly.
Comment Actions I'm rather dubious of this. Do you have timing info to support this? Even in the best-case scenario, you're trading a domain-crossing penalty (which is 1 or 2 cycles of latency with no additional µops) for an additional real instruction (shuffle), increasing execution pressure. e.g.: xorps %xmm1, %xmm1 [handled in rename, does not consume execute resources on recent HW] vs. movq %xmm0, %xmm0 [1 µop, 1 cycle latency] You have *maybe* shaved a latency cycle (but probably not) at the cost of an extra µop. Domain-crossing penalties are worth avoiding, but usually not at the cost of additional µops. They aren't *that* painful. In the worst cases (uint to fp), this creates a domain-crossing penalty where there wasn't one before, and also adds execute pressure. For the z4zz, zz4z, zzz4 cases, I expect we should usually be generating XORPS + SHUFPS. 1 µop executed (possibly with a domain-crossing penalty) vs. 3 µops. I'm not saying that this can't possibly be a win; I'm saying that I'd like to see numbers to back it up before we do this. Comment Actions Fixed typo in shuffle immediate. Regarding the cost/benefit of domain switches/data bypass delays - Chandler has covered it quite thoroughly. I'd like to add that this is for pre-SSE4.1 targets, older hardware that tend to see greater losses in these situations; that these shuffles are often performed between operations that must be done in the integer domain (i.e. domain switches before/after the float shuffle); and this particular lowering can now be done with a single register (my personal favourite as I'm on a never-ending crusade to get rid of spills!). On top of all that, staying within a domain is what is recommended in both Intel & AMD optimization guides (old and new) and Agner goes on about it to some length in his docs too. Comment Actions So, I've done some simple loop timing tests (doing paddd's before + after the shuffle code to ensure we're using the integer domain) on the following older cpus: Intel Core 2 Duo 1.83 GHz (T5600) Merom And after all that I'm seeing no discernable difference in performance between the 2 implementations - the movss version can be made faster if we don't have to generate the zero (e.g. if its already generated and this is the last use of the register) but that is it. With that in mind I'm recommending that we do go ahead with this patch, primarily for the lower use of registers and that it matches the general rule of avoiding domain swaps - but don't expect any big improvement on old hardware! Comment Actions Added X86vzmovl folded loads tests. I looked at using a pand with a constant mask as an alternative and saw a minimal regression (nearly in the noise) compared to the movq/movss versions I was already testing against. I'm worried about pursuing that route though - it adds addiitonal memory access and the mask approach might make it more difficult for future optimizations of the multiple pshufd ops that are still in the vector-shuffle-128-v4.ll tests. Comment Actions Right, I’m going to try and get this done. Chandler - if you have no complaints I’d like to commit the additional X86vzmovl folded tests right away as it requires no alterations to codegen - its just adding missing tests. For replacing the MOVQ+PSHUFD with an AND with constant mask - is there a decent way of doing this in the X86InstrSSE.td patterns? I can’t fathom how I’d declare the constant vector, nor how I’d make it fold or pre-load the constant. I can add it to lowerVectorShuffleAsElementInsertion easily enough but its kind of missing the point. Comment Actions Stale review, feel free to close if its obsolete. That said, feel free to submit the missing tests separately. I'm happy to take a look if you get the patch into a good place. I don't know enough about the pattern stuff to be a lot of help I'm afraid. Comment Actions The vector-zmov.ll test were added in rL224284 Abandoning the rest of the patch - I can possibly revisit this if I ever find a decent way to create vector constants in tablegen. |
I think an even better pattern is: movq, pshufd 0,2,2,2?
Also, do we correctly match to movd when the source is a foldable load? I can't remember if there is a test case for that, but its really important to not do a shuffle when just loading a single i32 from memory into an xmm register.