This is an archive of the discontinued LLVM Phabricator instance.

Generate better code for shuffles
ClosedPublic

Authored by mkuper on Dec 16 2014, 2:57 AM.

Details

Summary

This fixes PR15872, the code improves from:

vpextrd $1, %xmm0, %eax
vmovd   %xmm0, %ecx
vmovd   %ecx, %xmm1
vpinsrd $1, %eax, %xmm1, %xmm1
vextractf128    $1, %ymm0, %xmm2
vmovd   %xmm2, %eax
vpinsrd $2, %eax, %xmm1, %xmm1
vpextrd $1, %xmm2, %eax
vpinsrd $3, %eax, %xmm1, %xmm1
vpextrd $3, %xmm0, %eax
vpextrd $2, %xmm0, %ecx
vmovd   %ecx, %xmm0
vpinsrd $1, %eax, %xmm0, %xmm0
vpextrd $2, %xmm2, %eax
vpinsrd $2, %eax, %xmm0, %xmm0
vpextrd $3, %xmm2, %eax
vpinsrd $3, %eax, %xmm0, %xmm0
vmovdqa %xmm1, (%rdi)
vzeroupper
retq

to

vextractf128    $1, %ymm0, %xmm1
vpunpcklqdq     %xmm1, %xmm0, %xmm2 # xmm2 = xmm0[0],xmm1[0]
vpunpckhqdq     %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[1],xmm1[1]
vmovdqa %xmm2, (%rdi)
vzeroupper
retq

Sanjay has a fix for PR21711 which apparently has the same underlying issue here: http://reviews.llvm.org/D6622

This version is more general, but it may be too general, I'm fine with anything in this vein that fixes both PRs.

Diff Detail

Repository
rL LLVM

Event Timeline

mkuper updated this revision to Diff 17327.Dec 16 2014, 2:57 AM
mkuper retitled this revision from to Generate better code for shuffles.
mkuper updated this object.
mkuper edited the test plan for this revision. (Show Details)
mkuper added reviewers: nadav, andreadb, spatel, chandlerc, delena.
mkuper added a subscriber: Unknown Object (MLST).
spatel edited edge metadata.Dec 16 2014, 8:52 AM

Hi Michael -

Thanks for notifying about this patch. Can you please rebase the patch against a newer trunk rev? Each hunk failed to patch for me on r224339.

mkuper updated this revision to Diff 17337.Dec 16 2014, 9:13 AM
mkuper edited edge metadata.

Rebased patch.

By the way, Sanjay, I've verified that the test-cases you have in vec_extract-avx.ll pass with this.

Rebased patch.

This turned out to just be a problem with Windows line-endings in your patch. It didn't patch on a Mac because of that. Not sure if that gets cleaned up on submission or if you need to change that before committing.

By the way, Sanjay, I've verified that the test-cases you have in vec_extract-avx.ll pass with this.

Excellent! I'll leave it to the other reviewers to decide if we need a TLI hook for this; otherwise, LGTM.

This patch is better than D6622 because it handles cases that that one bails out on. Eg, when we're extracting across the 128-bit boundary as in your test case or:

define void @low_v8f32_to_v4f32(<8 x float> %v, <4 x float>* %ptr) {
 %ext0 = extractelement <8 x float> %v, i32 1   
 %ext1 = extractelement <8 x float> %v, i32 2
 %ext2 = extractelement <8 x float> %v, i32 3
 %ext3 = extractelement <8 x float> %v, i32 4   ; crossing the 128-bit lane
 %ins0 = insertelement <4 x float> undef, float %ext0, i32 0
 %ins1 = insertelement <4 x float> %ins0, float %ext1, i32 1
 %ins2 = insertelement <4 x float> %ins1, float %ext2, i32 2
 %ins3 = insertelement <4 x float> %ins2, float %ext3, i32 3
 store <4 x float> %ins3, <4 x float>* %ptr, align 16
 ret void
}

You're avoiding the x86 miscompile that I saw by only generating extract_subvectors along half-vector-length boundaries. The case that I saw failing would be caused by a node like this:

DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, VT, VecIn1, DAG.getIntPtrConstant(1)); // not a half-vector extract

When that ends up in X86ISelLowering's Extract128BitVector(), we quietly round the index down with this integer math:

unsigned NormalizedIdxVal = (((IdxVal * ElVT.getSizeInBits()) / vectorWidth) * ElemsPerChunk);

The comment around there says:

// Idx is an index in the 128 bits we want. It need not be aligned to a 128-bit boundary.

While that matches what the integer math does, I think this code should be asserting that the index is a correct multiple of the number of elements. Otherwise, we're generating an extract that doesn't execute what the input node is asking for.

The line-endings get cleaned up on submission, thankfully.

Regarding the miscompile - yes, that's what I thought. Wasn't sure it was a miscompile, or some magic is going on behind the scenes, but avoided it anyway since generating "unaligned" extracts didn't make sense to me from a performance stand-point.

Regarding D6622 - it's good to hear I didn't miss anything. I'll add the test-case from D6622 to my commit.

spatel added inline comments.Dec 16 2014, 11:23 AM
test/CodeGen/X86/vector-shuffle-combining.ll
1566 ↗(On Diff #17337)

This isn't passing 'make check' for me. On an AVX2 machine, we generate 'vextracti128' (the integer flavor of the extract instruction).

mkuper updated this revision to Diff 17349.Dec 16 2014, 11:55 AM

Fixed my own test (thanks Sanjay) and added the test case from D6622

andreadb accepted this revision.Dec 17 2014, 3:23 AM
andreadb edited edge metadata.

Hi Michael,

thanks for the patch. It looks good to me!

Cheers,
Andrea

This revision is now accepted and ready to land.Dec 17 2014, 3:23 AM
This revision was automatically updated to reflect the committed changes.