This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] add insertelement + shuffle demanded element fold
ClosedPublic

Authored by spatel on Aug 28 2017, 3:15 PM.

Details

Summary

I started drafting a patch for this as an add-on for simplify demanded elements that would have been similar to SimplifyMultipleUseDemandedBits(), but since I don't know if any more folds like this exist, it seems unnecessary to make a larger patch for this special case.

The problem arises because we (correctly, I think) bail out of SimplifyDemandedVectorElts() if a value has multiple uses. But then we call recognizeIdentityMask() which ignores undef shuffle mask elements. That means we might replace a shuffle with an insertelement that doesn't need to exist. At that point, there's no going back - we've lost the information (via the shuffle mask) that would have let any other transforms know that the insertelement was not actually needed. If we don't ignore undefs in recognizeIdentityMask(), we'll currently fail to simplify several other patterns, so chasing down all of those didn't seem like a better option.

This can manifest as bad codegen for things like horizontal ops (PR34111) because the bogus inserts in IR can't be ignored by the SLP vectorizer, and then the backend doesn't have enough pattern matching to see the redundancies:

define <4 x float> @add_ps_002(<4 x float> %a) {
  %a0 = extractelement <4 x float> %a, i32 0
  %a1 = extractelement <4 x float> %a, i32 1
  %a2 = extractelement <4 x float> %a, i32 2
  %a3 = extractelement <4 x float> %a, i32 3
  %a0_again = extractelement <4 x float> %a, i32 0
  %a1_again = extractelement <4 x float> %a, i32 1
  %a2_again = extractelement <4 x float> %a, i32 2
  %a3_again = extractelement <4 x float> %a, i32 3
  %add01 = fadd float %a0, %a1
  %add23 = fadd float %a2, %a3
  %add01_again = fadd float %a0_again, %a1_again
  %add23_again = fadd float %a2_again, %a3_again
  %out0 = insertelement <4 x float> undef, float %add01, i32 0
  %out01 = insertelement <4 x float> %out0, float %add23, i32 1
  %out012 = insertelement <4 x float> %out01, float %add01_again, i32 2
  %out0123 = insertelement <4 x float> %out012, float %add23_again, i32 3
  %shuffle = shufflevector <4 x float> %out0123, <4 x float> %a, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
  ret <4 x float> %shuffle
}

$ ./opt -instcombine -slp-vectorizer -instcombine -S hadd_bogus_undef.ll | ./llc -o - -mattr=avx

vmovshdup	%xmm0, %xmm1    ## xmm1 = xmm0[1,1,3,3] <-- didn't need that...
vaddss	%xmm1, %xmm0, %xmm1   <--- or that...
vhaddps	%xmm0, %xmm0, %xmm0
vinsertps	$32, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm0[0,1],xmm1[0],xmm0[3]  <-- or that

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Aug 28 2017, 3:15 PM
craig.topper edited edge metadata.Aug 28 2017, 4:35 PM

Why doesn't SimplifyDemandedBits drop %out012? It doesn't have multiple users.

It looks like once we evaluate %out0123 in SimplifyDemandedVectorElts, we stop the recursion right there and don't go back any further. This might have been ok if visitShuffleVectorInst had immediately returned based on SimplifyDemandedVectorElts returning non-null. If it had we would have re-added the shuffle to the worklist and reevaluated the SimplifyDemandedVectorElts again. But instead it just sets a MadeChange flag and goes on.

Oh the use count is high because we haven't DCEd the insert_element we bypassed yet.

Oh the use count is high because we haven't DCEd the insert_element we bypassed yet.

Right - I played around with returning early instead of setting MadeChange, but it didn't change anything in my experiments.

This comment was removed by craig.topper.
craig.topper added a comment.EditedAug 28 2017, 4:58 PM

What about

--- a/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
+++ b/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
@@ -996,19 +996,21 @@ Value *InstCombiner::SimplifyDemandedVectorElts(Value *V, APInt DemandedElts,
     // If this is inserting an element that isn't demanded, remove this
     // insertelement.
     unsigned IdxNo = Idx->getZExtValue();
-    if (IdxNo >= VWidth || !DemandedElts[IdxNo]) {
-      Worklist.Add(I);
-      return I->getOperand(0);
-    }

     // Otherwise, the element inserted overwrites whatever was there, so the
     // input demanded set is simpler than the output set.
     APInt DemandedElts2 = DemandedElts;
-    DemandedElts2.clearBit(IdxNo);
+    if (IdxNo < VWidth)
+      DemandedElts2.clearBit(IdxNo);
     TmpV = SimplifyDemandedVectorElts(I->getOperand(0), DemandedElts2,
                                       UndefElts, Depth + 1);
     if (TmpV) { I->setOperand(0, TmpV); MadeChange = true; }

+    if (IdxNo >= VWidth || !DemandedElts[IdxNo]) {
+      Worklist.Add(I);
+      return I->getOperand(0);
+    }
+
     // The inserted element is defined.
     UndefElts.clearBit(IdxNo);
     break;

What about

--- a/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
+++ b/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
@@ -996,19 +996,21 @@ Value *InstCombiner::SimplifyDemandedVectorElts(Value *V, APInt DemandedElts,
     // If this is inserting an element that isn't demanded, remove this
     // insertelement.
     unsigned IdxNo = Idx->getZExtValue();
-    if (IdxNo >= VWidth || !DemandedElts[IdxNo]) {
-      Worklist.Add(I);
-      return I->getOperand(0);
-    }

     // Otherwise, the element inserted overwrites whatever was there, so the
     // input demanded set is simpler than the output set.
     APInt DemandedElts2 = DemandedElts;
-    DemandedElts2.clearBit(IdxNo);
+    if (IdxNo < VWidth)
+      DemandedElts2.clearBit(IdxNo);
     TmpV = SimplifyDemandedVectorElts(I->getOperand(0), DemandedElts2,
                                       UndefElts, Depth + 1);
     if (TmpV) { I->setOperand(0, TmpV); MadeChange = true; }

+    if (IdxNo >= VWidth || !DemandedElts[IdxNo]) {
+      Worklist.Add(I);
+      return I->getOperand(0);
+    }
+
     // The inserted element is defined.
     UndefElts.clearBit(IdxNo);
     break;

This patch will work on the examples given here, so maybe we want to do it anyway, but I think it's more fragile.

Ie, it works for a chain of ops that we know how to handle here in SimplifyDemandedVectorElts(), but as soon as something unexpected appears in the pattern, we might fail again. Here's the smallest variant that I could come up with to show the problem:

define <4 x float> @inselt_shuf_no_demand_bogus_insert_index_in_chain(float %a1, float %a2, float %a3, i32 %variable_index) {
  %out1 = insertelement <4 x float> undef, float %a1, i32 1
  %out12 = insertelement <4 x float> %out1, float %a2, i32 undef ; something unexpected
  %out123 = insertelement <4 x float> %out12, float %a3, i32 3
  %shuffle = shufflevector <4 x float> %out123, <4 x float> undef, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
  ret <4 x float> %shuffle
}

This will be left as:

%out1 = insertelement <4 x float> undef, float %a1, i32 1
ret <4 x float> %out1

Even if we make the other fix in visitShuffleVectorInst() to return instead of set MadeChange, this still won't get folded because %out1 has multiple uses.

I agree there are definitely cases where your change is required. I was just confused how we got a multiple use that didn't exist in the original IR.

My fix seems to be needed to get rid of %out012 in this test case

define <4 x i32> @add_ps_002(<4 x i32> %a, <4 x i32> %b) {
  %a0 = extractelement <4 x i32> %a, i32 0
  %a1 = extractelement <4 x i32> %a, i32 1
  %a2 = extractelement <4 x i32> %a, i32 2
  %a3 = extractelement <4 x i32> %a, i32 3
  %a0_again = extractelement <4 x i32> %a, i32 0
  %a1_again = extractelement <4 x i32> %a, i32 1
  %a2_again = extractelement <4 x i32> %a, i32 2
  %a3_again = extractelement <4 x i32> %a, i32 3
  %add01 = add i32 %a0, %a1
  %add23 = add i32 %a2, %a3
  %add01_again = add i32 %a0_again, %a1_again
  %add23_again = add i32 %a2_again, %a3_again
  %out0 = insertelement <4 x i32> undef, i32 %add01, i32 0
  %out01 = insertelement <4 x i32> %out0, i32 %add23, i32 1
  %out012 = insertelement <4 x i32> %out01, i32 %add01_again, i32 2
  %foo = add <4 x i32> %out012, %b
  %out0123 = insertelement <4 x i32> %foo, i32 %add23_again, i32 3
  %shuffle = shufflevector <4 x i32> %out0123, <4 x i32> %a, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
  ret <4 x i32> %shuffle
}

I agree there are definitely cases where your change is required. I was just confused how we got a multiple use that didn't exist in the original IR.

My fix seems to be needed to get rid of %out012 in this test case

define <4 x i32> @add_ps_002(<4 x i32> %a, <4 x i32> %b) {
  %a0 = extractelement <4 x i32> %a, i32 0
  %a1 = extractelement <4 x i32> %a, i32 1
  %a2 = extractelement <4 x i32> %a, i32 2
  %a3 = extractelement <4 x i32> %a, i32 3
  %a0_again = extractelement <4 x i32> %a, i32 0
  %a1_again = extractelement <4 x i32> %a, i32 1
  %a2_again = extractelement <4 x i32> %a, i32 2
  %a3_again = extractelement <4 x i32> %a, i32 3
  %add01 = add i32 %a0, %a1
  %add23 = add i32 %a2, %a3
  %add01_again = add i32 %a0_again, %a1_again
  %add23_again = add i32 %a2_again, %a3_again
  %out0 = insertelement <4 x i32> undef, i32 %add01, i32 0
  %out01 = insertelement <4 x i32> %out0, i32 %add23, i32 1
  %out012 = insertelement <4 x i32> %out01, i32 %add01_again, i32 2
  %foo = add <4 x i32> %out012, %b
  %out0123 = insertelement <4 x i32> %foo, i32 %add23_again, i32 3
  %shuffle = shufflevector <4 x i32> %out0123, <4 x i32> %a, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
  ret <4 x i32> %shuffle
}

Yep - nice example. So we need both of these changes and might as well do the early return too.

Do you think that covers all cases?

Since I've demystified for myself how to write an opt pass, I'm wondering if we should just move SimplifyDemandedVectorElts() into its own pass similar to BDCE. I think it would run twice (once near the start and then again after loopvectorizer/SLP). And then instcombine would have one less task in its job description.

spatel updated this revision to Diff 113319.Aug 30 2017, 2:53 PM

Patch updated:
Use @craig.topper 's suggested change to keep recursing in SimplifyDemandedVectorElts before replacing an operand.

This is a small change that fixes the case that motivated me to look here in the first place. I added a test case that would have been caught by my initial patch in rL312172, so we don't lose it. Eventually, I think we should make a separate pass to better simplify based on vector demanded elements without further burdening instcombine.

This revision is now accepted and ready to land.Aug 30 2017, 4:02 PM
This revision was automatically updated to reflect the committed changes.