Another case of x86-specific shuffle strength reduction: avoid generating insert*128 instructions with index 0 because they are slower than their non-lane-changing blend equivalents.
Details
Diff Detail
Event Timeline
Updated patch: the previous version of the patch was adding logic to PerformShuffleCombine256(), but that won't catch every case where we create an INSERT_SUBVECTOR. And yes, there was one more shuffle regression test where we expected an inserti128 $0
In this version, I've moved the code into Insert128BitVector(). No other functional changes.
This should catch every creation of an INSERT_SUBVECTOR that can be optimized with a BLENDI.
Hi Sanjay,
test/CodeGen/X86/avx-cast.ll | ||
---|---|---|
43–46 | So, the reason why your code doesn't optimize this case for AVX1 is because AVX1 doesn't support integer blend on YMM registers. However, wouldn't the following code be faster in this case (at least on Intel cpus)? vxorps %ymm1, %ymm1, %ymm1 vblendps $0, %ymm0, %ymm1, %ymm0 I understand that we want to avoid domain-crossing as much as possible. |
test/CodeGen/X86/avx-cast.ll | ||
---|---|---|
43–46 | Hi Andrea - Thanks for the close reading! Yes - if you only have AVX and you get to this point, then there's no avoiding the domain-crossing because you won't have vinserti128 either. I'll redo the check to account for this case. |
Patch updated based on feedback from Andrea: for integer ops on AVX1, it's better to generate a wrong domain vblend instead of a wrong domain vinsertf128.
Thanks Sanjay.
I made a couple of comments (see below). Otherwise the patch looks good to me.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
185–186 | So, the INSERT_SUBVECTOR node is always generated regardless of whether 'ScalarType' is floating point or not. I think it makes sense to factor out the common logic between the floating point and the integer case. For example, you can create the INSERT_SUBVECTOR node immedately after line 175. This will allow you to get rid of the code at around line 203. | |
test/CodeGen/X86/avx-cast.ll | ||
9–12 | Would it be possible to also have a test where the vector insertion is not performed on a zero vector? Apparently all the test cases you modified only seem to test the case case where a vector is inserted in the low 128-bit lane of a zero vector. |
test/CodeGen/X86/avx-cast.ll | ||
---|---|---|
9–12 | Let me make sure I understand the scenario. Is it different that this: define <4 x double> @shuffle_v4f64_0167(<4 x double> %a, <4 x double> %b) { ; ALL-LABEL: shuffle_v4f64_0167: ; ALL: # BB#0: ; ALL-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0,1],ymm1[2,3] ; ALL-NEXT: retq %shuffle = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 1, i32 6, i32 7> ret <4 x double> %shuffle } I think the existing shuffle lowering was already detecting the non-zero version, so there are existing test cases that cover it (the above is in vector-shuffle-256-v4.ll). Please let me know if I've misunderstood. |
test/CodeGen/X86/avx-cast.ll | ||
---|---|---|
9–12 | That matches what I originally thought: the non-zero cases are handled by other parts of the shuffle lowering logic. |
So, the INSERT_SUBVECTOR node is always generated regardless of whether 'ScalarType' is floating point or not. I think it makes sense to factor out the common logic between the floating point and the integer case. For example, you can create the INSERT_SUBVECTOR node immedately after line 175. This will allow you to get rid of the code at around line 203.