This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] replace bitcast to scalar + insertelement with widening shuffle + vector bitcast
AbandonedPublic

Authored by spatel on Sep 27 2017, 7:04 AM.

Details

Reviewers
efriedma
igorb
zvi
Summary

insert undef, (bitcast vType X to scalar), C --> bitcast (shuffle X, undef, Mask)

I think this is a universal improvement for vector IR code because it removes a vector-to-scalar-to-vector transition, but I'm not sure if the pattern is relevant to anything besides x86 AVX. In the motivating example from PR34716 ( https://bugs.llvm.org/show_bug.cgi?id=34716 ), we have:

define <8 x i64> @test(i32 %x0, i32 %x1) {
  %1 = insertelement <2 x i32> undef, i32 %x0, i32 0
  %2 = insertelement <2 x i32> %1, i32 %x1, i32 1
  %3 = bitcast <2 x i32> %2 to i64
  %4 = insertelement <8 x i64> undef, i64 %3, i32 0
  %5 = shufflevector <8 x i64> %4, <8 x i64> undef, <8 x i32> zeroinitializer
  ret <8 x i64> %5
}

This leads to inefficient movement between scalar GPRs and vector registers. With this patch, other vector instcombines will fire reducing the IR to:

define <8 x i64> @test(i32 %x0, i32 %x1) {
  %1 = insertelement <16 x i32> undef, i32 %x0, i32 0   ; wide vec insert
  %2 = insertelement <16 x i32> %1, i32 %x1, i32 1       ; wide vec insert
  %3 = bitcast <16 x i32> %2 to <8 x i64>                       ; free bitcast
  %4 = shufflevector <8 x i64> %3, <8 x i64> undef, <8 x i32> zeroinitializer   ; splat
  ret <8 x i64> %4
}

And through backend folds, a 32-bit AVX512 target could manage to load the two 32-bit scalars and splat in one instruction (although this doesn't quite happen yet):

vmovsd	4(%esp), %xmm0          # xmm0 = mem[0],zero
vbroadcastsd	%xmm0, %zmm0

Diff Detail

Event Timeline

spatel created this revision.Sep 27 2017, 7:04 AM
efriedma edited edge metadata.Sep 27 2017, 10:58 AM
efriedma added subscribers: arsenm, tra.

ARM/AArch64 are very similar in this respect, since there are multiple vector register sizes. You'll see a similar result for your examples on aarch64. (On 32-bit ARM, we manage to optimize away the extra copy after isel.) I'm not quite sure how much of this logic it makes sense to put into instcombine, given most of the benefit here has to do with the way CPUs split integer and vector registers, but this is probably okay for other targets.

Does it make sense to do this transform even if the operand of the insertelement isn't undef? I guess you'd need a second shuffle in that case?

Took me a minute to reason it out, but I think this is right on big-endian targets: semantically, the bitcast in both cases is essentially equivalent.

ARM/AArch64 are very similar in this respect, since there are multiple vector register sizes. You'll see a similar result for your examples on aarch64. (On 32-bit ARM, we manage to optimize away the extra copy after isel.) I'm not quite sure how much of this logic it makes sense to put into instcombine, given most of the benefit here has to do with the way CPUs split integer and vector registers, but this is probably okay for other targets.

Yes, I'm biased in my split register files view. If we disregard that, then we could just look at this as a canonicalization rule for equivalent IR constructs. But we need to be sure that targets can accept the relatively simple shuffle mask either way.

Does it make sense to do this transform even if the operand of the insertelement isn't undef? I guess you'd need a second shuffle in that case?

Right - I thought about that case, but I couldn't justify it if we need another IR instruction.

Took me a minute to reason it out, but I think this is right on big-endian targets: semantically, the bitcast in both cases is essentially equivalent.

Took me more than a minute, and I still wasn't entirely sure. :)
Thanks for thinking about that - I should've mentioned it in the description.

Right - I thought about that case, but I couldn't justify it if we need another IR instruction.

So how do you plan to fix x86 lowering for this case? If we need a DAGCombine for this anyway, not sure there's much point to handling it in instcombine.

Right - I thought about that case, but I couldn't justify it if we need another IR instruction.

So how do you plan to fix x86 lowering for this case? If we need a DAGCombine for this anyway, not sure there's much point to handling it in instcombine.

The only thing missing is a load fold. Ie, we'll produce this asm for 32-bit with AVX512f (the motivating target in the bug report) with this patch:

vmovsd	4(%esp), %xmm0          # xmm0 = mem[0],zero
vbroadcastsd	%xmm0, %zmm0

I think the ideal code would fuse those into splat of memop:

vbroadcastsd 4(%esp), %zmm0

I meant, "how do we fix x86 in the general case"? Consider the following (with -mtriple=x86_64 -mattr=+xop):

define <8 x i64> @test(i32 %x0, i32 %x1, <8 x i64> %v) {
  %1 = insertelement <2 x i32> undef, i32 %x0, i32 0
  %2 = insertelement <2 x i32> %1, i32 %x1, i32 1
  %3 = bitcast <2 x i32> %2 to i64
  %4 = insertelement <8 x i64> %v, i64 %3, i32 0
  ret <8 x i64> %4
}

We currently generate a five-instruction sequence for something which can be done in two instructions. And the instcombine here won't trigger.

I meant, "how do we fix x86 in the general case"? Consider the following (with -mtriple=x86_64 -mattr=+xop):

Ah! Sorry, I misread the question.

define <8 x i64> @test(i32 %x0, i32 %x1, <8 x i64> %v) {
  %1 = insertelement <2 x i32> undef, i32 %x0, i32 0
  %2 = insertelement <2 x i32> %1, i32 %x1, i32 1
  %3 = bitcast <2 x i32> %2 to i64
  %4 = insertelement <8 x i64> %v, i64 %3, i32 0
  ret <8 x i64> %4
}

We currently generate a five-instruction sequence for something which can be done in two instructions. And the instcombine here won't trigger.

Ok - so this example is an extension of a different instcombine that I was originally thinking of to solve this case (https://bugs.llvm.org/show_bug.cgi?id=34716#c1). We could trade an insert for a bitcast:

define <8 x i64> @test_not_undef_bc(i32 %x0, i32 %x1, <8 x i64> %v) {

  %bc = bitcast <8 x i64> %v to <16 x i32>
  %i1 = insertelement <16 x i32> %bc, i32 %x0, i32 0
  %i2 = insertelement <16 x i32> %i1, i32 %x1, i32 1
  %bc2 = bitcast <16 x i32> %i2 to <8 x i64>
  ret <8 x i64> %bc2
}

I think we'd get that by applying a fold like:
(ins (bitcast (ins ))) --> (bitcast (ins (bitcast )))
...twice. That's still 3 instructions though?

vpinsrd	$0, %edi, %xmm0, %xmm2
vpinsrd	$1, %esi, %xmm2, %xmm2
vblendps	$15, %ymm2, %ymm0, %ymm0 ## ymm0 = ymm2[0,1,2,3],ymm0[4,5,6,7]

No, you're right, that's still three instructions; I didn't realize there wasn't a ymm version of vpinsrd.

No, you're right, that's still three instructions; I didn't realize there wasn't a ymm version of vpinsrd.

But looking at that IR again, it's not a simple match applied twice. Let me see what this would look like in the DAG with insert_subvector...

I guess the question is does this IR patch have enough value to stand by itself? It does let us kill one IR instruction for the case in the bug report. But I agree we need something else in the DAG.

spatel abandoned this revision.Oct 13 2017, 8:59 AM

Abandoning - we solved this more generally in one way, but more restricted in another in the DAG with D38388. If we find other gaps in this kind of code, we could resurrect this.