This is an archive of the discontinued LLVM Phabricator instance.

[X86, AVX] use blends instead of insert128 with index 0
ClosedPublic

Authored by spatel on Mar 16 2015, 3:23 PM.

Details

Summary

Another case of x86-specific shuffle strength reduction: avoid generating insert*128 instructions with index 0 because they are slower than their non-lane-changing blend equivalents.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 22057.Mar 16 2015, 3:23 PM
spatel retitled this revision from to [X86, AVX] use blends instead of insert128 with index 0 .
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: andreadb, chandlerc, bruno.
spatel added a subscriber: Unknown Object (MLST).
spatel updated this revision to Diff 22104.Mar 17 2015, 10:54 AM

Updated patch: the previous version of the patch was adding logic to PerformShuffleCombine256(), but that won't catch every case where we create an INSERT_SUBVECTOR. And yes, there was one more shuffle regression test where we expected an inserti128 $0

In this version, I've moved the code into Insert128BitVector(). No other functional changes.

This should catch every creation of an INSERT_SUBVECTOR that can be optimized with a BLENDI.

andreadb edited edge metadata.Mar 18 2015, 10:24 AM

Hi Sanjay,

test/CodeGen/X86/avx-cast.ll
43–46 ↗(On Diff #22104)

So, the reason why your code doesn't optimize this case for AVX1 is because AVX1 doesn't support integer blend on YMM registers.

However, wouldn't the following code be faster in this case (at least on Intel cpus)?

vxorps %ymm1, %ymm1, %ymm1
vblendps $0, %ymm0, %ymm1, %ymm0

I understand that we want to avoid domain-crossing as much as possible.
However, in this particular case I don't think it is possible (please correct me if I am wrong).
Your code would fall back to selecting a 'vinsertf128'. However, as far as I know 'vinsertf128' is floating point cluster anyway. So, I expect (I haven't tested it though) that using 'vblendps/d' would probably give us the same (or better on Haswell?) throughput. What do you think?

spatel added inline comments.Mar 18 2015, 2:17 PM
test/CodeGen/X86/avx-cast.ll
43–46 ↗(On Diff #22104)

Hi Andrea -

Thanks for the close reading!

Yes - if you only have AVX and you get to this point, then there's no avoiding the domain-crossing because you won't have vinserti128 either.

I'll redo the check to account for this case.

spatel updated this revision to Diff 22282.Mar 19 2015, 11:06 AM
spatel edited edge metadata.

Patch updated based on feedback from Andrea: for integer ops on AVX1, it's better to generate a wrong domain vblend instead of a wrong domain vinsertf128.

Thanks Sanjay.
I made a couple of comments (see below). Otherwise the patch looks good to me.

lib/Target/X86/X86ISelLowering.cpp
185–186 ↗(On Diff #22282)

So, the INSERT_SUBVECTOR node is always generated regardless of whether 'ScalarType' is floating point or not. I think it makes sense to factor out the common logic between the floating point and the integer case. For example, you can create the INSERT_SUBVECTOR node immedately after line 175. This will allow you to get rid of the code at around line 203.

test/CodeGen/X86/avx-cast.ll
9–12 ↗(On Diff #22282)

Would it be possible to also have a test where the vector insertion is not performed on a zero vector? Apparently all the test cases you modified only seem to test the case case where a vector is inserted in the low 128-bit lane of a zero vector.

spatel added inline comments.Mar 19 2015, 2:13 PM
test/CodeGen/X86/avx-cast.ll
9–12 ↗(On Diff #22282)

Let me make sure I understand the scenario. Is it different that this:

define <4 x double> @shuffle_v4f64_0167(<4 x double> %a, <4 x double> %b) {
; ALL-LABEL: shuffle_v4f64_0167:
; ALL:       # BB#0:
; ALL-NEXT:    vblendpd {{.*#+}} ymm0 = ymm0[0,1],ymm1[2,3]
; ALL-NEXT:    retq
  %shuffle = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 1, i32 6, i32 7>
  ret <4 x double> %shuffle
}

I think the existing shuffle lowering was already detecting the non-zero version, so there are existing test cases that cover it (the above is in vector-shuffle-256-v4.ll). Please let me know if I've misunderstood.

andreadb accepted this revision.Mar 19 2015, 2:52 PM
andreadb edited edge metadata.
andreadb added inline comments.
test/CodeGen/X86/avx-cast.ll
9–12 ↗(On Diff #22282)

That matches what I originally thought: the non-zero cases are handled by other parts of the shuffle lowering logic.
I just wanted to make sure that we had a good test coverage :-).
I think your patch is OK. Thanks!

This revision is now accepted and ready to land.Mar 19 2015, 2:52 PM
This revision was automatically updated to reflect the committed changes.

Thanks, Andrea. I hoisted the common INSERT_SUBVECTOR and checked in at r232773.