This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Remove check for sext of vector icmp from shouldOptimizeCast
ClosedPublic

Authored by craig.topper on Aug 1 2017, 11:53 PM.

Details

Summary

Looks like for 'and' and 'or' we end up performing at least some of the transformations this is bocking in a round about way anyway.

For 'and sext(cmp1), sext(cmp2) we end up later turning it into 'select cmp1, sext(cmp2), 0'. Then we optimize that back to sext (and cmp1, cmp2). This is the same result we would have gotten if shouldOptimizeCast hadn't blocked it. We do something analogous for 'or'.

With this patch we allow that transformation to happen directly in foldCastedBitwiseLogic. And we now support the same thing for 'xor'. This is definitely opening up many other cases, but since we already went around it for some cases hopefully it's ok.

Diff Detail

Repository
rL LLVM

Event Timeline

craig.topper created this revision.Aug 1 2017, 11:53 PM
spatel edited edge metadata.Aug 2 2017, 9:19 AM
spatel added subscribers: efriedma, mcrosier, t.p.northover.

I pushed 'test7' through llc for x86 and PPC64LE, and no problems. But then I tried AArch64 and ARM, and they went nuts whether it was an 'xor' or an 'and':

define <2 x i64> @test7(<4 x float> %a, <4 x float> %b) {
  %cmp = fcmp ult <4 x float> %a, zeroinitializer
  %cmp4 = fcmp ult <4 x float> %b, zeroinitializer
  %sext = sext <4 x i1> %cmp to <4 x i32>
  %sext5 = sext <4 x i1> %cmp4 to <4 x i32>
  %and = and <4 x i32> %sext, %sext5
  %conv = bitcast <4 x i32> %and to <2 x i64>
  ret <2 x i64> %conv
}

define <2 x i64> @test7_better(<4 x float> %a, <4 x float> %b) {
  %cmp = fcmp ult <4 x float> %a, zeroinitializer
  %cmp4 = fcmp ult <4 x float> %b, zeroinitializer
  %and1 = and <4 x i1> %cmp, %cmp4
  %and = sext <4 x i1> %and1 to <4 x i32>
  %conv = bitcast <4 x i32> %and to <2 x i64>
  ret <2 x i64> %conv
}

$ ./llc -o - vcmp.ll -mtriple=aarch64

test7:                           // @test7
	fcmge	v0.4s, v0.4s, #0.0
	mvn	 v0.16b, v0.16b
	fcmge	v1.4s, v1.4s, #0.0
	bic	v0.16b, v0.16b, v1.16b
	ret
test7_better:                           // @test7_better
// BB#0:
	fcmge	v0.4s, v0.4s, #0.0
	fcmge	v1.4s, v1.4s, #0.0
	mvn	 v0.16b, v0.16b
	mvn	 v1.16b, v1.16b
	xtn	v0.4h, v0.4s
	xtn	v1.4h, v1.4s
	and	v0.8b, v0.8b, v1.8b
	ushll	v0.4s, v0.4h, #0
	shl	v0.4s, v0.4s, #31
	sshr	v0.4s, v0.4s, #31
	ret

Given that the more common problem patterns already exist independent of this patch, I would agree to proceed. But let's ping people with an ARM stake for their opinions - @t.p.northover @efriedma @mcrosier ?

Our handling of i1 masks in SelectionDAG is generally terrible.

x86 doesn't have this particular problem for <4 x float>, because it doesn't have 64-bit vectors, and it doesn't have this problem for <8 x float> if AVX is turned on because there's a target-specific DAGCombine to work around the issue (WidenMaskArithmetic), but it does show up in other cases. For example, try the following on x86 without AVX:

define <8 x i32> @testa(<8 x float> %a, <8 x float> %b) {
  %cmp = fcmp ult <8 x float> %a, zeroinitializer
  %cmp4 = fcmp ult <8 x float> %b, zeroinitializer
  %and1 = and <8 x i1> %cmp, %cmp4
  %and = sext <8 x i1> %and1 to <8 x i32>
  ret <8 x i32> %and
}

Or the following with AVX (but not AVX512):

define <16 x i32> @testb(<16 x float> %a, <16 x float> %b) {
  %cmp = fcmp ult <16 x float> %a, zeroinitializer
  %cmp4 = fcmp ult <16 x float> %b, zeroinitializer
  %and1 = and <16 x i1> %cmp, %cmp4
  %and = sext <16 x i1> %and1 to <16 x i32>
  ret <16 x i32> %and
}

That said, I don't think it makes sense to block this patch on that issue; it's an existing problem.

spatel accepted this revision.Aug 22 2017, 4:33 PM

LGTM.

I filed an AArch64 bug to track the backend problem:
https://bugs.llvm.org/show_bug.cgi?id=34290

This revision is now accepted and ready to land.Aug 22 2017, 4:33 PM
This revision was automatically updated to reflect the committed changes.