This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] replace bitcast to scalar + insertelement with widening shuffle + vector bitcast
AbandonedPublic

Authored by spatel on Sep 27 2017, 7:04 AM.

Download Raw Diff

Details

Reviewers

efriedma
igorb
zvi

Summary

insert undef, (bitcast vType X to scalar), C --> bitcast (shuffle X, undef, Mask)

I think this is a universal improvement for vector IR code because it removes a vector-to-scalar-to-vector transition, but I'm not sure if the pattern is relevant to anything besides x86 AVX. In the motivating example from PR34716 ( https://bugs.llvm.org/show_bug.cgi?id=34716 ), we have:

define <8 x i64> @test(i32 %x0, i32 %x1) {
  %1 = insertelement <2 x i32> undef, i32 %x0, i32 0
  %2 = insertelement <2 x i32> %1, i32 %x1, i32 1
  %3 = bitcast <2 x i32> %2 to i64
  %4 = insertelement <8 x i64> undef, i64 %3, i32 0
  %5 = shufflevector <8 x i64> %4, <8 x i64> undef, <8 x i32> zeroinitializer
  ret <8 x i64> %5
}

This leads to inefficient movement between scalar GPRs and vector registers. With this patch, other vector instcombines will fire reducing the IR to:

define <8 x i64> @test(i32 %x0, i32 %x1) {
  %1 = insertelement <16 x i32> undef, i32 %x0, i32 0   ; wide vec insert
  %2 = insertelement <16 x i32> %1, i32 %x1, i32 1       ; wide vec insert
  %3 = bitcast <16 x i32> %2 to <8 x i64>                       ; free bitcast
  %4 = shufflevector <8 x i64> %3, <8 x i64> undef, <8 x i32> zeroinitializer   ; splat
  ret <8 x i64> %4
}

And through backend folds, a 32-bit AVX512 target could manage to load the two 32-bit scalars and splat in one instruction (although this doesn't quite happen yet):

vmovsd	4(%esp), %xmm0          # xmm0 = mem[0],zero
vbroadcastsd	%xmm0, %zmm0

Diff Detail

Event Timeline

spatel created this revision.Sep 27 2017, 7:04 AM

Herald added a subscriber: mcrosier. · View Herald TranscriptSep 27 2017, 7:04 AM

ARM/AArch64 are very similar in this respect, since there are multiple vector register sizes. You'll see a similar result for your examples on aarch64. (On 32-bit ARM, we manage to optimize away the extra copy after isel.) I'm not quite sure how much of this logic it makes sense to put into instcombine, given most of the benefit here has to do with the way CPUs split integer and vector registers, but this is probably okay for other targets.

Does it make sense to do this transform even if the operand of the insertelement isn't undef? I guess you'd need a second shuffle in that case?

Took me a minute to reason it out, but I think this is right on big-endian targets: semantically, the bitcast in both cases is essentially equivalent.

In D38316#882376, @efriedma wrote:

ARM/AArch64 are very similar in this respect, since there are multiple vector register sizes. You'll see a similar result for your examples on aarch64. (On 32-bit ARM, we manage to optimize away the extra copy after isel.) I'm not quite sure how much of this logic it makes sense to put into instcombine, given most of the benefit here has to do with the way CPUs split integer and vector registers, but this is probably okay for other targets.

Yes, I'm biased in my split register files view. If we disregard that, then we could just look at this as a canonicalization rule for equivalent IR constructs. But we need to be sure that targets can accept the relatively simple shuffle mask either way.

Does it make sense to do this transform even if the operand of the insertelement isn't undef? I guess you'd need a second shuffle in that case?

Right - I thought about that case, but I couldn't justify it if we need another IR instruction.

Took me a minute to reason it out, but I think this is right on big-endian targets: semantically, the bitcast in both cases is essentially equivalent.

Took me more than a minute, and I still wasn't entirely sure. :)
Thanks for thinking about that - I should've mentioned it in the description.

Right - I thought about that case, but I couldn't justify it if we need another IR instruction.

So how do you plan to fix x86 lowering for this case? If we need a DAGCombine for this anyway, not sure there's much point to handling it in instcombine.

In D38316#882443, @efriedma wrote:

Right - I thought about that case, but I couldn't justify it if we need another IR instruction.

So how do you plan to fix x86 lowering for this case? If we need a DAGCombine for this anyway, not sure there's much point to handling it in instcombine.

The only thing missing is a load fold. Ie, we'll produce this asm for 32-bit with AVX512f (the motivating target in the bug report) with this patch:

vmovsd	4(%esp), %xmm0          # xmm0 = mem[0],zero
vbroadcastsd	%xmm0, %zmm0

I think the ideal code would fuse those into splat of memop:

vbroadcastsd 4(%esp), %zmm0

I meant, "how do we fix x86 in the general case"? Consider the following (with -mtriple=x86_64 -mattr=+xop):

define <8 x i64> @test(i32 %x0, i32 %x1, <8 x i64> %v) {
  %1 = insertelement <2 x i32> undef, i32 %x0, i32 0
  %2 = insertelement <2 x i32> %1, i32 %x1, i32 1
  %3 = bitcast <2 x i32> %2 to i64
  %4 = insertelement <8 x i64> %v, i64 %3, i32 0
  ret <8 x i64> %4
}

We currently generate a five-instruction sequence for something which can be done in two instructions. And the instcombine here won't trigger.

In D38316#882545, @efriedma wrote:

I meant, "how do we fix x86 in the general case"? Consider the following (with -mtriple=x86_64 -mattr=+xop):

Ah! Sorry, I misread the question.

define <8 x i64> @test(i32 %x0, i32 %x1, <8 x i64> %v) {
  %1 = insertelement <2 x i32> undef, i32 %x0, i32 0
  %2 = insertelement <2 x i32> %1, i32 %x1, i32 1
  %3 = bitcast <2 x i32> %2 to i64
  %4 = insertelement <8 x i64> %v, i64 %3, i32 0
  ret <8 x i64> %4
}
We currently generate a five-instruction sequence for something which can be done in two instructions. And the instcombine here won't trigger.

Ok - so this example is an extension of a different instcombine that I was originally thinking of to solve this case (https://bugs.llvm.org/show_bug.cgi?id=34716#c1). We could trade an insert for a bitcast:

define <8 x i64> @test_not_undef_bc(i32 %x0, i32 %x1, <8 x i64> %v) {

  %bc = bitcast <8 x i64> %v to <16 x i32>
  %i1 = insertelement <16 x i32> %bc, i32 %x0, i32 0
  %i2 = insertelement <16 x i32> %i1, i32 %x1, i32 1
  %bc2 = bitcast <16 x i32> %i2 to <8 x i64>
  ret <8 x i64> %bc2
}

I think we'd get that by applying a fold like:
(ins (bitcast (ins ))) --> (bitcast (ins (bitcast )))
...twice. That's still 3 instructions though?

vpinsrd	$0, %edi, %xmm0, %xmm2
vpinsrd	$1, %esi, %xmm2, %xmm2
vblendps	$15, %ymm2, %ymm0, %ymm0 ## ymm0 = ymm2[0,1,2,3],ymm0[4,5,6,7]

No, you're right, that's still three instructions; I didn't realize there wasn't a ymm version of vpinsrd.

In D38316#882635, @efriedma wrote:

No, you're right, that's still three instructions; I didn't realize there wasn't a ymm version of vpinsrd.

But looking at that IR again, it's not a simple match applied twice. Let me see what this would look like in the DAG with insert_subvector...

I guess the question is does this IR patch have enough value to stand by itself? It does let us kill one IR instruction for the case in the bug report. But I agree we need something else in the DAG.

RKSimon added a subscriber: RKSimon.Sep 28 2017, 2:35 AM

spatel mentioned this in rL314466: [x86] add tests for possible insertelement to shuffle transform; NFC.Sep 28 2017, 3:29 PM

spatel mentioned this in D38388: [DAGCombiner, x86] convert insertelement of bitcasted vector into shuffle.Sep 28 2017, 3:54 PM

spatel mentioned this in rL315460: [DAGCombiner] convert insertelement of bitcasted vector into shuffle.Oct 11 2017, 7:12 AM

Abandoning - we solved this more generally in one way, but more restricted in another in the DAG with D38388. If we find other gaps in this kind of code, we could resurrect this.

spatel mentioned this in D40209: [DAGCombiner] eliminate shuffle of insert element.Nov 18 2017, 9:13 AM

spatel mentioned this in rL320050: [DAGCombiner] eliminate shuffle of insert element.Dec 7 2017, 7:18 AM

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstCombineVectorOps.cpp

32 lines

test/

Transforms/

InstCombine/

insert-extract-shuffle.ll

28 lines

Diff 116804

lib/Transforms/InstCombine/InstCombineVectorOps.cpp

Show First 20 Lines • Show All 818 Lines • ▼ Show 20 Lines	Instruction *InstCombiner::visitInsertElementInst(InsertElementInst &IE) {
if (Instruction *NewInsElt = hoistInsEltConst(IE, Builder))		if (Instruction *NewInsElt = hoistInsEltConst(IE, Builder))
return NewInsElt;		return NewInsElt;

// Turn a sequence of inserts that broadcasts a scalar into a single		// Turn a sequence of inserts that broadcasts a scalar into a single
// insert + shufflevector.		// insert + shufflevector.
if (Instruction *Broadcast = foldInsSequenceIntoBroadcast(IE))		if (Instruction *Broadcast = foldInsSequenceIntoBroadcast(IE))
return Broadcast;		return Broadcast;

		Value *X;
		if (match(ScalarOp, m_OneUse(m_BitCast(m_Value(X)))) &&
		isa<UndefValue>(VecOp)) {
		auto *InsertIndex = dyn_cast<ConstantInt>(IdxOp);
		auto *SrcVecTy = dyn_cast<VectorType>(X->getType());
		if (InsertIndex && SrcVecTy) {
		// We're bitcasting from vector to scalar and then inserting into a larger
		// undef vector. This is really a subvector extension with undef elements,
		// so use a size-extending shuffle to avoid the scalar conversion:
		// insert undef, (bitcast vType X to scalar), C -->
		// bitcast (shuffle X, undef, Mask)
		unsigned NumSrcElts = SrcVecTy->getNumElements();
		unsigned ExtRatio = IE.getType()->getBitWidth() / SrcVecTy->getBitWidth();
		unsigned NumMaskVals = NumSrcElts * ExtRatio;
		SmallVector<Constant *, 16> MaskValues(NumMaskVals);
		for (unsigned i = 0; i != NumMaskVals; ++i) {
		// The bitcast maps all of the elements of the source vector into one
		// larger element of the resulting vector. All other elements are
		// undefined. Example:
		// insert <2 x i16> undef, (bitcast <2 x i8> X to i16), i32 0 -->
		// bitcast (shuffle <2 x i8> X, undef, <0,1,-1,-1>) to <2 x i16>
		if (i / NumSrcElts == InsertIndex->getZExtValue())
		MaskValues[i] = Builder.getInt32(i % NumSrcElts);
		else
		MaskValues[i] = UndefValue::get(Builder.getInt32Ty());
		}
		Value *SV = Builder.CreateShuffleVector(X, UndefValue::get(SrcVecTy),
		ConstantVector::get(MaskValues));
		return new BitCastInst(SV, IE.getType());
		}
		}

return nullptr;		return nullptr;
}		}

/// Return true if we can evaluate the specified expression tree if the vector		/// Return true if we can evaluate the specified expression tree if the vector
/// elements were shuffled in a different order.		/// elements were shuffled in a different order.
static bool CanEvaluateShuffled(Value *V, ArrayRef<int> Mask,		static bool CanEvaluateShuffled(Value *V, ArrayRef<int> Mask,
unsigned Depth = 5) {		unsigned Depth = 5) {
// We can always reorder the elements of a constant.		// We can always reorder the elements of a constant.
▲ Show 20 Lines • Show All 635 Lines • Show Last 20 Lines

test/Transforms/InstCombine/insert-extract-shuffle.ll

Show First 20 Lines • Show All 277 Lines • ▼ Show 20 Lines	entry:
%a = extractelement <2 x i32> %x, i32 1		%a = extractelement <2 x i32> %x, i32 1
%b = insertelement <4 x i32> zeroinitializer, i32 %a, i64 3		%b = insertelement <4 x i32> zeroinitializer, i32 %a, i64 3
%c = add i32 %y, 3		%c = add i32 %y, 3
%d = extractelement <2 x i32> %x, i32 %c		%d = extractelement <2 x i32> %x, i32 %c
%e = icmp eq i32 %d, 0		%e = icmp eq i32 %d, 0
%ret = select i1 %e, <4 x i32> %b, <4 x i32> zeroinitializer		%ret = select i1 %e, <4 x i32> %b, <4 x i32> zeroinitializer
ret <4 x i32> %ret		ret <4 x i32> %ret
}		}

		; insert undef, (bitcast vType X to scalar), C --> bitcast (shuffle X, undef, Mask)
		; PR34716 - https://bugs.llvm.org/show_bug.cgi?id=34716

		define <2 x i64> @bitcast_vector_and_insert(<2 x float> %x){
		; CHECK-LABEL: @bitcast_vector_and_insert(
		; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <2 x float> %x, <2 x float> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
		; CHECK-NEXT: [[VEC:%.*]] = bitcast <4 x float> [[TMP1]] to <2 x i64>
		; CHECK-NEXT: ret <2 x i64> [[VEC]]
		;
		%bc = bitcast <2 x float> %x to i64
		%vec = insertelement <2 x i64> undef, i64 %bc, i32 0
		ret <2 x i64> %vec
		}

		; Use weird types to show the shuffle mask is still correct for unusual cases.

		define <4 x i72> @bitcast_vector_and_insert_odd(<3 x i24> %x){
		; CHECK-LABEL: @bitcast_vector_and_insert_odd(
		; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <3 x i24> %x, <3 x i24> undef, <12 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 0, i32 1, i32 2, i32 undef, i32 undef, i32 undef>
		; CHECK-NEXT: [[VEC:%.*]] = bitcast <12 x i24> [[TMP1]] to <4 x i72>
		; CHECK-NEXT: ret <4 x i72> [[VEC]]
		;
		%bc = bitcast <3 x i24> %x to i72
		%vec = insertelement <4 x i72> undef, i72 %bc, i32 2
		ret <4 x i72> %vec
		}