This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] shrink truncated insertelement with constant operand
ClosedPublic

Authored by spatel on Feb 18 2017, 10:53 AM.

Download Raw Diff

Details

Reviewers

majnemer
hfinkel
efriedma

Commits

rGfe9705149bfd: [InstCombine] shrink truncated insertelement into undef vector
rL297242: [InstCombine] shrink truncated insertelement into undef vector

Summary

This is the 2nd part of solving:
http://lists.llvm.org/pipermail/llvm-dev/2017-February/110293.html

D30123 would move the trunc ahead of the shuffle, and this will move the trunc ahead of the insertelement. I'm assuming that we're ok transforming insertelement insts in general (unlike shufflevector) since those are simple ops that any target should handle.

I neglected to include FP truncate in D30123, so if this looks correct, I can update that patch to do a similar conversion or make a follow-up patch for that.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Feb 18 2017, 10:53 AM

Herald added a subscriber: mcrosier. · View Herald TranscriptFeb 18 2017, 10:53 AM

zvi added a subscriber: zvi.Feb 21 2017, 6:45 AM

Ping.

On the integer side, I'm sort of worried this could make code generation worse on targets which don't have insert instructions for all relevant widths (for example, SSE2 has i16 insertelement, but not i8 insertelement).

For fptrunc, you could in theory hurt performance if the insert is overwriting a denormal float, but I guess that's unlikely to happen in practice.

In D30137#694601, @efriedma wrote:

On the integer side, I'm sort of worried this could make code generation worse on targets which don't have insert instructions for all relevant widths (for example, SSE2 has i16 insertelement, but not i8 insertelement).

If it's any consolation, the current x86 codegen for tests similar to the ones in this patch -- even with newer features like AVX2 -- looks awful. I can file a bug with this and others:

define <8 x i16> @wide_insert_into_constant_vector(i32 %x) {
  %vec = insertelement <8 x i32> <i32 1, i32 2, i32 3, i32 4, i32 5, i32 3, i32 -2, i32 65536>, i32 %x, i32 1
  %trunc = trunc <8 x i32> %vec to <8 x i16>
  ret <8 x i16> %trunc
}

$ ./llc -o - shrinkinselt.ll -mattr=avx2
...
	movl	$1, %eax
	vmovd	%eax, %xmm0
	vpinsrd	$1, %edi, %xmm0, %xmm0
	movl	$3, %eax
	vpinsrd	$2, %eax, %xmm0, %xmm0
	movl	$4, %eax
	vpinsrd	$3, %eax, %xmm0, %xmm0
	vinserti128	$1, LCPI0_0(%rip), %ymm0, %ymm0
	vpshufb	LCPI0_1(%rip), %ymm0, %ymm0 ## ymm0 = ymm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
	vpermq	$232, %ymm0, %ymm0      ## ymm0 = ymm0[0,2,2,3]
                                        ## kill: %XMM0<def> %XMM0<kill> %YMM0<kill>
	vzeroupper
	retq

Not sure how to explain this...

We can ignore deficiencies which are obviously bugs; I was thinking of something more like this:

define <16 x i8> @trunc_inselt1(<16 x i16> %a, i16 %x) {
  %vec = insertelement <16 x i16> %a, i16 %x, i32 1
  %trunc = trunc <16 x i16> %vec to <16 x i8>
  ret <16 x i8> %trunc
}

define <16 x i8> @trunc_inselt2(<16 x i16> %a, i8 %x) {
  %trunc = trunc <16 x i16> %a to <16 x i8>
  %vec = insertelement <16 x i8> %trunc, i8 %x, i32 1
  ret <16 x i8> %vec
}

For trunc_inselt2, the x86 backend produces:

trunc_inselt2:                          # @trunc_inselt2
        .cfi_startproc
# BB#0:
        movdqa  .LCPI1_0(%rip), %xmm2   # xmm2 = [255,255,255,255,255,255,255,255]
        pand    %xmm2, %xmm1
        pand    %xmm2, %xmm0
        packuswb        %xmm1, %xmm0
        movdqa  .LCPI1_1(%rip), %xmm1   # xmm1 = [255,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
        pand    %xmm1, %xmm0
        movd    %edi, %xmm2
        psllw   $8, %xmm2
        pandn   %xmm2, %xmm1
        por     %xmm1, %xmm0
        retq

The pand+movd+psllw+pandn+por sequence is clearly a lot worse than a pinsrw.

In D30137#694791, @efriedma wrote:

We can ignore deficiencies which are obviously bugs; I was thinking of something more like this:

define <16 x i8> @trunc_inselt1(<16 x i16> %a, i16 %x) {
  %vec = insertelement <16 x i16> %a, i16 %x, i32 1
  %trunc = trunc <16 x i16> %vec to <16 x i8>
  ret <16 x i8> %trunc
}

define <16 x i8> @trunc_inselt2(<16 x i16> %a, i8 %x) {
  %trunc = trunc <16 x i16> %a to <16 x i8>
  %vec = insertelement <16 x i8> %trunc, i8 %x, i32 1
  ret <16 x i8> %vec
}

For trunc_inselt2, the x86 backend produces:

trunc_inselt2:                          # @trunc_inselt2
        .cfi_startproc
# BB#0:
        movdqa  .LCPI1_0(%rip), %xmm2   # xmm2 = [255,255,255,255,255,255,255,255]
        pand    %xmm2, %xmm1
        pand    %xmm2, %xmm0
        packuswb        %xmm1, %xmm0
        movdqa  .LCPI1_1(%rip), %xmm1   # xmm1 = [255,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
        pand    %xmm1, %xmm0
        movd    %edi, %xmm2
        psllw   $8, %xmm2
        pandn   %xmm2, %xmm1
        por     %xmm1, %xmm0
        retq

This patch won't fire on your example because the scalar or the vector must be a constant. I'm trying to find an affected case that isn't blindingly bad already, but I haven't seen it yet. :)

Replace %x in my testcase with 3, and you get the same result. Granted, there are better ways to generate code than what the x86 backend currently does, but the end result would still be worse than the version with pinsrw.

I'd be more comfortable special-casing insertelement into a undef vector (so there's an obvious efficient lowering on any architecture).

In D30137#694830, @efriedma wrote:

I'd be more comfortable special-casing insertelement into a undef vector (so there's an obvious efficient lowering on any architecture).

Yes, I think that's a safe intermediate step that still handles the motivating case. I'll update the patch.

Patch updated:

Limit the transform to insertion into an undef vector to avoid backend problems.
Add tests for undef.
Add TODO comments for original tests.

LGTM

This revision is now accepted and ready to land.Mar 7 2017, 3:18 PM

Closed by commit rL297242: [InstCombine] shrink truncated insertelement into undef vector (authored by spatel). · Explain WhyMar 7 2017, 3:39 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

InstCombine/

InstCombineCasts.cpp

38 lines

test/

Transforms/

InstCombine/

vector-casts.ll

32 lines

Diff 90953

llvm/trunk/lib/Transforms/InstCombine/InstCombineCasts.cpp

Show First 20 Lines • Show All 474 Lines • ▼ Show 20 Lines	if (Shuf && Shuf->hasOneUse() && isa<UndefValue>(Shuf->getOperand(1)) &&
Constant *NarrowUndef = UndefValue::get(Trunc.getType());		Constant *NarrowUndef = UndefValue::get(Trunc.getType());
Value *NarrowOp = Builder.CreateTrunc(Shuf->getOperand(0), Trunc.getType());		Value *NarrowOp = Builder.CreateTrunc(Shuf->getOperand(0), Trunc.getType());
return new ShuffleVectorInst(NarrowOp, NarrowUndef, Shuf->getMask());		return new ShuffleVectorInst(NarrowOp, NarrowUndef, Shuf->getMask());
}		}

return nullptr;		return nullptr;
}		}

		/// Try to narrow the width of an insert element. This could be generalized for
		/// any vector constant, but we limit the transform to insertion into undef to
		/// avoid potential backend problems from unsupported insertion widths. This
		/// could also be extended to handle the case of inserting a scalar constant
		/// into a vector variable.
		static Instruction *shrinkInsertElt(CastInst &Trunc,
		InstCombiner::BuilderTy &Builder) {
		Instruction::CastOps Opcode = Trunc.getOpcode();
		assert((Opcode == Instruction::Trunc \|\| Opcode == Instruction::FPTrunc) &&
		"Unexpected instruction for shrinking");

		auto *InsElt = dyn_cast<InsertElementInst>(Trunc.getOperand(0));
		if (!InsElt \|\| !InsElt->hasOneUse())
		return nullptr;

		Type *DestTy = Trunc.getType();
		Type *DestScalarTy = DestTy->getScalarType();
		Value *VecOp = InsElt->getOperand(0);
		Value *ScalarOp = InsElt->getOperand(1);
		Value *Index = InsElt->getOperand(2);

		if (isa<UndefValue>(VecOp)) {
		// trunc (inselt undef, X, Index) --> inselt undef, (trunc X), Index
		// fptrunc (inselt undef, X, Index) --> inselt undef, (fptrunc X), Index
		UndefValue *NarrowUndef = UndefValue::get(DestTy);
		Value *NarrowOp = Builder.CreateCast(Opcode, ScalarOp, DestScalarTy);
		return InsertElementInst::Create(NarrowUndef, NarrowOp, Index);
		}

		return nullptr;
		}

Instruction *InstCombiner::visitTrunc(TruncInst &CI) {		Instruction *InstCombiner::visitTrunc(TruncInst &CI) {
if (Instruction *Result = commonCastTransforms(CI))		if (Instruction *Result = commonCastTransforms(CI))
return Result;		return Result;

// Test if the trunc is the user of a select which is part of a		// Test if the trunc is the user of a select which is part of a
// minimum or maximum operation. If so, don't do any more simplification.		// minimum or maximum operation. If so, don't do any more simplification.
// Even simplifying demanded bits can break the canonical form of a		// Even simplifying demanded bits can break the canonical form of a
// min/max.		// min/max.
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	Instruction *InstCombiner::visitTrunc(TruncInst &CI) {
}		}

if (Instruction *I = shrinkBitwiseLogic(CI))		if (Instruction *I = shrinkBitwiseLogic(CI))
return I;		return I;

if (Instruction I = shrinkSplatShuffle(CI, Builder))		if (Instruction I = shrinkSplatShuffle(CI, Builder))
return I;		return I;

		if (Instruction I = shrinkInsertElt(CI, Builder))
		return I;

if (Src->hasOneUse() && isa<IntegerType>(SrcTy) &&		if (Src->hasOneUse() && isa<IntegerType>(SrcTy) &&
shouldChangeType(SrcTy, DestTy)) {		shouldChangeType(SrcTy, DestTy)) {
// Transform "trunc (shl X, cst)" -> "shl (trunc X), cst" so long as the		// Transform "trunc (shl X, cst)" -> "shl (trunc X), cst" so long as the
// dest type is native and cst < dest size.		// dest type is native and cst < dest size.
if (match(Src, m_Shl(m_Value(A), m_ConstantInt(Cst))) &&		if (match(Src, m_Shl(m_Value(A), m_ConstantInt(Cst))) &&
!match(A, m_Shr(m_Value(), m_Constant()))) {		!match(A, m_Shr(m_Value(), m_Constant()))) {
// Skip shifts of shift by constants. It undoes a combine in		// Skip shifts of shift by constants. It undoes a combine in
// FoldShiftByConstant and is the extend in reg pattern.		// FoldShiftByConstant and is the extend in reg pattern.
▲ Show 20 Lines • Show All 836 Lines • ▼ Show 20 Lines	case Intrinsic::trunc: {
CallInst *NewCI = CallInst::Create(Overload, Args,		CallInst *NewCI = CallInst::Create(Overload, Args,
OpBundles, II->getName());		OpBundles, II->getName());
NewCI->copyFastMathFlags(II);		NewCI->copyFastMathFlags(II);
return NewCI;		return NewCI;
}		}
}		}
}		}

		if (Instruction I = shrinkInsertElt(CI, Builder))
		return I;

return nullptr;		return nullptr;
}		}

Instruction *InstCombiner::visitFPExt(CastInst &CI) {		Instruction *InstCombiner::visitFPExt(CastInst &CI) {
return commonCastTransforms(CI);		return commonCastTransforms(CI);
}		}

// fpto{s/u}i({u/s}itofp(X)) --> X or zext(X) or sext(X) or trunc(X)		// fpto{s/u}i({u/s}itofp(X)) --> X or zext(X) or sext(X) or trunc(X)
▲ Show 20 Lines • Show All 735 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/vector-casts.ll

Show First 20 Lines • Show All 210 Lines • ▼ Show 20 Lines	;
%notequal_b_load_.i = fcmp une <8 x float> %n, zeroinitializer		%notequal_b_load_.i = fcmp une <8 x float> %n, zeroinitializer
%equal_a_load72_.i = fcmp ueq <8 x float> %n, zeroinitializer		%equal_a_load72_.i = fcmp ueq <8 x float> %n, zeroinitializer
%notequal_b_load__to_boolvec.i = sext <8 x i1> %notequal_b_load_.i to <8 x i32>		%notequal_b_load__to_boolvec.i = sext <8 x i1> %notequal_b_load_.i to <8 x i32>
%equal_a_load72__to_boolvec.i = sext <8 x i1> %equal_a_load72_.i to <8 x i32>		%equal_a_load72__to_boolvec.i = sext <8 x i1> %equal_a_load72_.i to <8 x i32>
%wrong = or <8 x i32> %notequal_b_load__to_boolvec.i, %equal_a_load72__to_boolvec.i		%wrong = or <8 x i32> %notequal_b_load__to_boolvec.i, %equal_a_load72__to_boolvec.i
ret <8 x i32> %wrong		ret <8 x i32> %wrong
}		}

		; Hoist a trunc to a scalar if we're inserting into an undef vector.
		; trunc (inselt undef, X, Index) --> inselt undef, (trunc X), Index

		define <3 x i16> @trunc_inselt_undef(i32 %x) {
		; CHECK-LABEL: @trunc_inselt_undef(
		; CHECK-NEXT: [[TMP1:%.*]] = trunc i32 %x to i16
		; CHECK-NEXT: [[TRUNC:%.*]] = insertelement <3 x i16> undef, i16 [[TMP1]], i32 1
		; CHECK-NEXT: ret <3 x i16> [[TRUNC]]
		;
		%vec = insertelement <3 x i32> undef, i32 %x, i32 1
		%trunc = trunc <3 x i32> %vec to <3 x i16>
		ret <3 x i16> %trunc
		}

		; Hoist a trunc to a scalar if we're inserting into an undef vector.
		; trunc (inselt undef, X, Index) --> inselt undef, (trunc X), Index

		define <2 x float> @fptrunc_inselt_undef(double %x, i32 %index) {
		; CHECK-LABEL: @fptrunc_inselt_undef(
		; CHECK-NEXT: [[TMP1:%.*]] = fptrunc double %x to float
		; CHECK-NEXT: [[TRUNC:%.*]] = insertelement <2 x float> undef, float [[TMP1]], i32 %index
		; CHECK-NEXT: ret <2 x float> [[TRUNC]]
		;
		%vec = insertelement <2 x double> <double undef, double undef>, double %x, i32 %index
		%trunc = fptrunc <2 x double> %vec to <2 x float>
		ret <2 x float> %trunc
		}

		; TODO: Strengthen the backend, so we can have this canonicalization.
; Insert a scalar int into a constant vector and truncate:		; Insert a scalar int into a constant vector and truncate:
; trunc (inselt C, X, Index) --> inselt C, (trunc X), Index		; trunc (inselt C, X, Index) --> inselt C, (trunc X), Index

define <3 x i16> @trunc_inselt1(i32 %x) {		define <3 x i16> @trunc_inselt1(i32 %x) {
; CHECK-LABEL: @trunc_inselt1(		; CHECK-LABEL: @trunc_inselt1(
; CHECK-NEXT: [[VEC:%.*]] = insertelement <3 x i32> <i32 3, i32 undef, i32 65536>, i32 %x, i32 1		; CHECK-NEXT: [[VEC:%.*]] = insertelement <3 x i32> <i32 3, i32 undef, i32 65536>, i32 %x, i32 1
; CHECK-NEXT: [[TRUNC:%.*]] = trunc <3 x i32> [[VEC]] to <3 x i16>		; CHECK-NEXT: [[TRUNC:%.*]] = trunc <3 x i32> [[VEC]] to <3 x i16>
; CHECK-NEXT: ret <3 x i16> [[TRUNC]]		; CHECK-NEXT: ret <3 x i16> [[TRUNC]]
;		;
%vec = insertelement <3 x i32> <i32 3, i32 -2, i32 65536>, i32 %x, i32 1		%vec = insertelement <3 x i32> <i32 3, i32 -2, i32 65536>, i32 %x, i32 1
%trunc = trunc <3 x i32> %vec to <3 x i16>		%trunc = trunc <3 x i32> %vec to <3 x i16>
ret <3 x i16> %trunc		ret <3 x i16> %trunc
}		}

		; TODO: Strengthen the backend, so we can have this canonicalization.
; Insert a scalar FP into a constant vector and FP truncate:		; Insert a scalar FP into a constant vector and FP truncate:
; fptrunc (inselt C, X, Index) --> inselt C, (fptrunc X), Index		; fptrunc (inselt C, X, Index) --> inselt C, (fptrunc X), Index

define <2 x float> @fptrunc_inselt1(double %x, i32 %index) {		define <2 x float> @fptrunc_inselt1(double %x, i32 %index) {
; CHECK-LABEL: @fptrunc_inselt1(		; CHECK-LABEL: @fptrunc_inselt1(
; CHECK-NEXT: [[VEC:%.*]] = insertelement <2 x double> <double undef, double 3.000000e+00>, double %x, i32 %index		; CHECK-NEXT: [[VEC:%.*]] = insertelement <2 x double> <double undef, double 3.000000e+00>, double %x, i32 %index
; CHECK-NEXT: [[TRUNC:%.*]] = fptrunc <2 x double> [[VEC]] to <2 x float>		; CHECK-NEXT: [[TRUNC:%.*]] = fptrunc <2 x double> [[VEC]] to <2 x float>
; CHECK-NEXT: ret <2 x float> [[TRUNC]]		; CHECK-NEXT: ret <2 x float> [[TRUNC]]
;		;
%vec = insertelement <2 x double> <double undef, double 3.0>, double %x, i32 %index		%vec = insertelement <2 x double> <double undef, double 3.0>, double %x, i32 %index
%trunc = fptrunc <2 x double> %vec to <2 x float>		%trunc = fptrunc <2 x double> %vec to <2 x float>
ret <2 x float> %trunc		ret <2 x float> %trunc
}		}

		; TODO: Strengthen the backend, so we can have this canonicalization.
; Insert a scalar int constant into a vector and truncate:		; Insert a scalar int constant into a vector and truncate:
; trunc (inselt X, C, Index) --> inselt (trunc X), C', Index		; trunc (inselt X, C, Index) --> inselt (trunc X), C', Index

define <8 x i16> @trunc_inselt2(<8 x i32> %x, i32 %index) {		define <8 x i16> @trunc_inselt2(<8 x i32> %x, i32 %index) {
; CHECK-LABEL: @trunc_inselt2(		; CHECK-LABEL: @trunc_inselt2(
; CHECK-NEXT: [[VEC:%.*]] = insertelement <8 x i32> %x, i32 1048576, i32 %index		; CHECK-NEXT: [[VEC:%.*]] = insertelement <8 x i32> %x, i32 1048576, i32 %index
; CHECK-NEXT: [[TRUNC:%.*]] = trunc <8 x i32> [[VEC]] to <8 x i16>		; CHECK-NEXT: [[TRUNC:%.*]] = trunc <8 x i32> [[VEC]] to <8 x i16>
; CHECK-NEXT: ret <8 x i16> [[TRUNC]]		; CHECK-NEXT: ret <8 x i16> [[TRUNC]]
;		;
%vec = insertelement <8 x i32> %x, i32 1048576, i32 %index		%vec = insertelement <8 x i32> %x, i32 1048576, i32 %index
%trunc = trunc <8 x i32> %vec to <8 x i16>		%trunc = trunc <8 x i32> %vec to <8 x i16>
ret <8 x i16> %trunc		ret <8 x i16> %trunc
}		}

		; TODO: Strengthen the backend, so we can have this canonicalization.
; Insert a scalar FP constant into a vector and FP truncate:		; Insert a scalar FP constant into a vector and FP truncate:
; fptrunc (inselt X, C, Index) --> inselt (fptrunc X), C', Index		; fptrunc (inselt X, C, Index) --> inselt (fptrunc X), C', Index

define <3 x float> @fptrunc_inselt2(<3 x double> %x) {		define <3 x float> @fptrunc_inselt2(<3 x double> %x) {
; CHECK-LABEL: @fptrunc_inselt2(		; CHECK-LABEL: @fptrunc_inselt2(
; CHECK-NEXT: [[VEC:%.*]] = insertelement <3 x double> %x, double 4.000000e+00, i32 2		; CHECK-NEXT: [[VEC:%.*]] = insertelement <3 x double> %x, double 4.000000e+00, i32 2
; CHECK-NEXT: [[TRUNC:%.*]] = fptrunc <3 x double> [[VEC]] to <3 x float>		; CHECK-NEXT: [[TRUNC:%.*]] = fptrunc <3 x double> [[VEC]] to <3 x float>
; CHECK-NEXT: ret <3 x float> [[TRUNC]]		; CHECK-NEXT: ret <3 x float> [[TRUNC]]
;		;
%vec = insertelement <3 x double> %x, double 4.0, i32 2		%vec = insertelement <3 x double> %x, double 4.0, i32 2
%trunc = fptrunc <3 x double> %vec to <3 x float>		%trunc = fptrunc <3 x double> %vec to <3 x float>
ret <3 x float> %trunc		ret <3 x float> %trunc
}		}