This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] allow more than one use for vector cast folding with selects
ClosedPublic

Authored by spatel on Jun 9 2016, 10:50 AM.

Details

Summary

The motivating example for this transform is similar to D20774 where bitcasts interfere with a single cmp/select sequence, but in this case we have 2 uses of each bitcast to produce min and max ops:

define void @minmax_bc_store(<4 x float> %a, <4 x float> %b, <4 x float>* %ptr1, <4 x float>* %ptr2) {
  %cmp = fcmp olt <4 x float> %a, %b
  %bc1 = bitcast <4 x float> %a to <4 x i32>
  %bc2 = bitcast <4 x float> %b to <4 x i32>
  %sel1 = select <4 x i1> %cmp, <4 x i32> %bc1, <4 x i32> %bc2
  %sel2 = select <4 x i1> %cmp, <4 x i32> %bc2, <4 x i32> %bc1
  %bc3 = bitcast <4 x float>* %ptr1 to <4 x i32>*
  store <4 x i32> %sel1, <4 x i32>* %bc3
  %bc4 = bitcast <4 x float>* %ptr2 to <4 x i32>*
  store <4 x i32> %sel2, <4 x i32>* %bc4
  ret void
}

With this patch, we move the selects up to use the input args which allows getting rid of all of the bitcasts:

define void @minmax_bc_store(<4 x float> %a, <4 x float> %b, <4 x float>* %ptr1, <4 x float>* %ptr2) {
  %cmp = fcmp olt <4 x float> %a, %b
  %sel1.v = select <4 x i1> %cmp, <4 x float> %a, <4 x float> %b
  %sel2.v = select <4 x i1> %cmp, <4 x float> %b, <4 x float> %a
  store <4 x float> %sel1.v, <4 x float>* %ptr1, align 16
  store <4 x float> %sel2.v, <4 x float>* %ptr2, align 16
  ret void
}

The asm for x86 SSE then improves from:

movaps	%xmm0, %xmm2
cmpltps	%xmm1, %xmm2
movaps	%xmm2, %xmm3
andnps	%xmm1, %xmm3
movaps	%xmm2, %xmm4
andnps	%xmm0, %xmm4
andps	%xmm2, %xmm0
orps	%xmm3, %xmm0
andps	%xmm1, %xmm2
orps	%xmm4, %xmm2
movaps	%xmm0, (%rdi)
movaps	%xmm2, (%rsi)

To:

movaps	%xmm0, %xmm2
minps	%xmm1, %xmm2
maxps	%xmm0, %xmm1
movaps	%xmm2, (%rdi)
movaps	%xmm1, (%rsi)

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 60197.Jun 9 2016, 10:50 AM
spatel retitled this revision from to [InstCombine] allow more than one use for vector cast folding with selects.
spatel updated this object.
spatel added reviewers: majnemer, eli.friedman, RKSimon.
spatel added a subscriber: llvm-commits.
eli.friedman edited edge metadata.Jun 9 2016, 11:52 AM

Is it possible that this could cause an infinite loop in instcombine? Previously, this transform always reduced the total number of bitcasts in the function, but this doesn't. (Consider the case where the select uses its own result.)

lib/Transforms/InstCombine/InstCombineSelect.cpp
151 ↗(On Diff #60197)

Need to fix comment here?

Is it possible that this could cause an infinite loop in instcombine? Previously, this transform always reduced the total number of bitcasts in the function, but this doesn't. (Consider the case where the select uses its own result.)

Sorry for not seeing it - the select uses its own result via a phi? Can you show the construct that you have in mind?

Something like this?

loop:
  %sel = select <4 x i1> %cmp, <4 x i32> %selx, <4 x i32> %selx
  %selx = bitcast <4 x i32> %sel to <4 x f32>
  br label %loop

Your optimization flips the types of the select and the bitcast, I think. Obviously not an actual testcase because it's missing PHI nodes, but that's the basic idea.

spatel added a comment.Jun 9 2016, 1:45 PM

Something like this?

loop:
  %sel = select <4 x i1> %cmp, <4 x i32> %selx, <4 x i32> %selx
  %selx = bitcast <4 x i32> %sel to <4 x f32>
  br label %loop

Your optimization flips the types of the select and the bitcast, I think. Obviously not an actual testcase because it's missing PHI nodes, but that's the basic idea.

Interesting...I'm trying to be more diabolical, but I can't get it yet. :)
In order for the bitcasted output to feed back into the original intstruction, it would have to be bitcasted back to the original type somewhere along the way? In that case, I think we'd eliminate the bitcasts as they get paired up:

define void @infloop(<4 x i1> %cmp, <4 x i32> %a, <4 x i32> %b) {
entry:
  br label %loop

loop:
  %phi1 = phi <4 x i32> [ %a, %entry ], [ %self1, %loop ]
  %phi2 = phi <4 x i32> [ %b, %entry ], [ %self2, %loop ]
  %selx1 = bitcast <4 x i32> %phi1 to <4 x float>
  %selx2 = bitcast <4 x i32> %phi2 to <4 x float>
  %sel1 = select <4 x i1> %cmp, <4 x float> %selx1, <4 x float> %selx2
  %sel2 = select <4 x i1> %cmp, <4 x float> %selx2, <4 x float> %selx1
  %self1 = bitcast <4 x float> %sel1 to <4 x i32>
  %self2 = bitcast <4 x float> %sel2 to <4 x i32>
  br label %loop

  ret void
}

$ ./opt -instcombine infloop.ll -S

define void @infloop(<4 x i1> %cmp, <4 x i32> %a, <4 x i32> %b) {
entry:
  br label %loop

loop:                                             ; preds = %loop, %entry
  %phi1 = phi <4 x i32> [ %a, %entry ], [ %sel1.v, %loop ]
  %phi2 = phi <4 x i32> [ %b, %entry ], [ %sel2.v, %loop ]
  %sel1.v = select <4 x i1> %cmp, <4 x i32> %phi1, <4 x i32> %phi2
  %sel2.v = select <4 x i1> %cmp, <4 x i32> %phi2, <4 x i32> %phi1
  br label %loop
                                                  ; No predecessors!
  ret void
}

Looking a bit more, I think the infinite loop isn't possible because of the way instcombine works with PHI nodes and the reachable code restriction. Sorry about the false alarm.

spatel added a comment.Jun 9 2016, 3:01 PM

Looking a bit more, I think the infinite loop isn't possible because of the way instcombine works with PHI nodes and the reachable code restriction. Sorry about the false alarm.

No problem - thanks for making me look harder at the possibilities. I'll upload a new draft with a code comment change.

spatel updated this revision to Diff 60246.Jun 9 2016, 3:04 PM
spatel edited edge metadata.

Patch updated:
Add comment to better explain the one-use restriction.
Also, add a TODO comment for cleanup because there's a strange combo of isa/dyn_cast/llvm_unreachable below here.

eli.friedman accepted this revision.Jun 16 2016, 10:48 AM
eli.friedman edited edge metadata.

This patch doesn't only apply to bitcasts, which can lead to gigantic codegen changes in some cases. That isn't a new problem, though; consider:

define void @min_max_trunc(<4 x float> %a, <4 x float> %b, <4 x i64> %c, <4 x i64> %d, <4 x i32>* %ptr1, <4 x i32>* %ptr2) {
  %cmp = fcmp olt <4 x float> %a, %b
  %bc1 = trunc <4 x i64> %c to <4 x i32>
  %bc2 = trunc <4 x i64> %d to <4 x i32>
  %sel1 = select <4 x i1> %cmp, <4 x i32> %bc1, <4 x i32> %bc2
  store <4 x i32> %sel1, <4 x i32>* %ptr1
  ret void
}

instcombine makes this generate much worse code for SSE2. Although, I'm pretty sure we can blame SelectionDAG for some part of that because it's generating absolutely terrible code. Feel free to just file a bug for this, but I'm pretty sure we need some sort of target-hook for this.

It would be nice to throw together a few testcases for zext/sext/trunc/sitofp/fptoui just to make sure we have coverage.

Otherwise LGTM.

This revision is now accepted and ready to land.Jun 16 2016, 10:48 AM

This patch doesn't only apply to bitcasts, which can lead to gigantic codegen changes in some cases.

...

Feel free to just file a bug for this, but I'm pretty sure we need some sort of target-hook for this.

Thanks:
https://llvm.org/bugs/show_bug.cgi?id=28160

I think I should limit this patch to only bitcasts to make it safer?

I think I should limit this patch to only bitcasts to make it safer?

If you like. Probably not a big deal either way, given the transforms we already perform.

This revision was automatically updated to reflect the committed changes.