This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Keep 4i32 vector insertions in integer domain on pre-SSE4.1 targets
AbandonedPublic

Authored by RKSimon on Dec 4 2014, 6:42 AM.

Details

Summary

4i32 shuffles for single insertions into zero vectors lowers to X86vzmovl was using movss - causing domain switch stalls.

This patch fixes this by using movq( punpckldq $SRC, ZERO) ) instead.

Follow up to D6458 which dealt with SSE4.1 targets.

Diff Detail

Event Timeline

RKSimon updated this revision to Diff 16925.Dec 4 2014, 6:42 AM
RKSimon retitled this revision from to [X86][SSE] Keep 4i32 vector insertions in integer domain on pre-SSE4.1 targets.
RKSimon updated this object.
RKSimon edited the test plan for this revision. (Show Details)
RKSimon added reviewers: chandlerc, andreadb.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: Unknown Object (MLST).
chandlerc added inline comments.Dec 4 2014, 7:44 AM
test/CodeGen/X86/vector-shuffle-128-v4.ll
663–665

I think an even better pattern is: movq, pshufd 0,2,2,2?

Also, do we correctly match to movd when the source is a foldable load? I can't remember if there is a test case for that, but its really important to not do a shuffle when just loading a single i32 from memory into an xmm register.

700–703

This highlights that our lowering for this is completely wrong. movq + pshufd is better even with SEE4.1, and movd + pshufd is better when we can fold the load....

Added comments - I'll add a new patch using that movq/pshufd pattern shortly.

test/CodeGen/X86/vector-shuffle-128-v4.ll
663–665

Yup - that'd be a nicer pattern (single register!) - easy enough to change.

There is an existing movd folded load pattern using VMOVDI2PDIrm - I haven't seen any tests for it but it does seem to work alright.

700–703

I am seeing lowerVectorShuffleAsElementInsertion interfere with a number of better shuffles candidates for these kind of patterns.

I'm also finding that we don't do a good job of tracking elements that are known to be zero - X86ISelLowering has computeZeroableShuffleElements but I'm starting to think about providing a better implementation inside the DAGCombiner instead. It'd need to know the difference between known zeros and zeroable, peek inside more ops etc. - but it could help a lot and there is no reason for this to be target specific.

RKSimon updated this revision to Diff 16940.Dec 4 2014, 11:33 AM
RKSimon set the repository for this revision to rL LLVM.

Converted pattern to movq, pshufd 0,2,2,2

scanon added a subscriber: scanon.EditedDec 4 2014, 12:24 PM

I'm rather dubious of this. Do you have timing info to support this?

Even in the best-case scenario, you're trading a domain-crossing penalty (which is 1 or 2 cycles of latency with no additional µops) for an additional real instruction (shuffle), increasing execution pressure. e.g.:

xorps %xmm1, %xmm1 [handled in rename, does not consume execute resources on recent HW]
movss %xmm0, %xmm1 [single µop, 1 cycle latency]
[1 (2 on Nehalem, IIRC) cycle latency hit if next instruction is not in FP domain]

vs.

movq %xmm0, %xmm0 [1 µop, 1 cycle latency]
pshufd %xmm0, %xmm0, blah [1 µop, 1 cycle latency]

You have *maybe* shaved a latency cycle (but probably not) at the cost of an extra µop. Domain-crossing penalties are worth avoiding, but usually not at the cost of additional µops. They aren't *that* painful.

In the worst cases (uint to fp), this creates a domain-crossing penalty where there wasn't one before, and also adds execute pressure.

For the z4zz, zz4z, zzz4 cases, I expect we should usually be generating XORPS + SHUFPS. 1 µop executed (possibly with a domain-crossing penalty) vs. 3 µops.

I'm not saying that this can't possibly be a win; I'm saying that I'd like to see numbers to back it up before we do this.

RKSimon updated this revision to Diff 16959.Dec 4 2014, 3:53 PM
RKSimon set the repository for this revision to rL LLVM.

Fixed typo in shuffle immediate.

Regarding the cost/benefit of domain switches/data bypass delays - Chandler has covered it quite thoroughly. I'd like to add that this is for pre-SSE4.1 targets, older hardware that tend to see greater losses in these situations; that these shuffles are often performed between operations that must be done in the integer domain (i.e. domain switches before/after the float shuffle); and this particular lowering can now be done with a single register (my personal favourite as I'm on a never-ending crusade to get rid of spills!). On top of all that, staying within a domain is what is recommended in both Intel & AMD optimization guides (old and new) and Agner goes on about it to some length in his docs too.

RKSimon updated this revision to Diff 17025.Dec 7 2014, 5:48 AM

Updated patch to work with recently commited tests.

So, I've done some simple loop timing tests (doing paddd's before + after the shuffle code to ensure we're using the integer domain) on the following older cpus:

Intel Core 2 Duo 1.83 GHz (T5600) Merom
Intel Core 2 Duo 3.06 GHz (E7600) Wolfdale
Pentium M 1.60 GHz Deron

And after all that I'm seeing no discernable difference in performance between the 2 implementations - the movss version can be made faster if we don't have to generate the zero (e.g. if its already generated and this is the last use of the register) but that is it.

With that in mind I'm recommending that we do go ahead with this patch, primarily for the lower use of registers and that it matches the general rule of avoiding domain swaps - but don't expect any big improvement on old hardware!

RKSimon updated this revision to Diff 17026.Dec 7 2014, 8:35 AM

Added X86vzmovl folded loads tests.

I looked at using a pand with a constant mask as an alternative and saw a minimal regression (nearly in the noise) compared to the movq/movss versions I was already testing against. I'm worried about pursuing that route though - it adds addiitonal memory access and the mask approach might make it more difficult for future optimizations of the multiple pshufd ops that are still in the vector-shuffle-128-v4.ll tests.

Right, I’m going to try and get this done.

Chandler - if you have no complaints I’d like to commit the additional X86vzmovl folded tests right away as it requires no alterations to codegen - its just adding missing tests.

For replacing the MOVQ+PSHUFD with an AND with constant mask - is there a decent way of doing this in the X86InstrSSE.td patterns? I can’t fathom how I’d declare the constant vector, nor how I’d make it fold or pre-load the constant. I can add it to lowerVectorShuffleAsElementInsertion easily enough but its kind of missing the point.

chandlerc requested changes to this revision.Mar 29 2015, 2:13 PM
chandlerc edited edge metadata.

Stale review, feel free to close if its obsolete.

That said, feel free to submit the missing tests separately.

I'm happy to take a look if you get the patch into a good place. I don't know enough about the pattern stuff to be a lot of help I'm afraid.

This revision now requires changes to proceed.Mar 29 2015, 2:13 PM
RKSimon abandoned this revision.Mar 30 2015, 3:30 AM

The vector-zmov.ll test were added in rL224284

Abandoning the rest of the patch - I can possibly revisit this if I ever find a decent way to create vector constants in tablegen.