This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] fold trunc ([lshr] (bitcast vector) ) --> extractelement (PR25543)
ClosedPublic

Authored by spatel on Dec 9 2015, 1:12 PM.

Download Raw Diff

Details

Reviewers

kuhar
majnemer
hfinkel

Commits

rGf727e387be5f: [InstCombine] fold trunc ([lshr] (bitcast vector) ) --> extractelement (PR25543)
rL255504: [InstCombine] fold trunc ([lshr] (bitcast vector) ) --> extractelement (PR25543)

Summary

This is a fix for PR25543:
https://llvm.org/bugs/show_bug.cgi?id=25543

The idea is to take the existing fold of:
bitcast ( trunc ( lshr ( bitcast X))) --> extractelement (bitcast X)
( http://reviews.llvm.org/rL112232 )

And break it into 2 less specific transforms so we'll catch more cases such as the example in the bug report:
bitcast ( trunc ( lshr ( bitcast X))) --> bitcast ( extractelement (bitcast X)) --> extractelement (bitcast X)

D14879 handles the 2nd transform: folding of bitcasts around the extractelement.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 42329.Dec 9 2015, 1:12 PM

spatel retitled this revision from to [InstCombine] fold trunc ([lshr] (bitcast vector) ) --> extractelement (PR25543).

spatel updated this object.

spatel added reviewers: hfinkel, kuhar, majnemer.

spatel added a subscriber: llvm-commits.

hfinkel added inline comments.Dec 10 2015, 1:13 AM

test/Transforms/InstCombine/bitcast-bigendian.ll
24 ↗	(On Diff #42329)	I'm a bit surprised by this change. We used to prefer the vector bitcast over the scalar one, and with this change, we prefer the scalar bitcast after the abstract. I can see benefits to this as a canonical form (even through some backends will need to work somewhat harder to retail good code quality: vector bitcasts are often cheaper than scalar ones). However, what happens when both elements are extracted? Do we end up with multiple scalar bitcasts?

spatel added inline comments.Dec 10 2015, 10:12 AM

test/Transforms/InstCombine/bitcast-bigendian.ll

24 ↗

(On Diff #42329)

I didn't think much of that difference for the existing test case, but for the case that I think you're asking about:

define float @test2(<2 x i32> %A) {
  %tmp28 = bitcast <2 x i32> %A to i64
  %tmp23 = trunc i64 %tmp28 to i32
  %tmp24 = bitcast i32 %tmp23 to float

  %tmp = bitcast <2 x i32> %A to i64
  %lshr = lshr i64 %tmp, 32
  %tmp2 = trunc i64 %lshr to i32
  %tmp4 = bitcast i32 %tmp2 to float 

  %add = fadd float %tmp24, %tmp4
  ret float %add
}

Yes, we'll get multiple scalar bitcasts:

define float @test2(<2 x i32> %A) {
  %tmp23 = extractelement <2 x i32> %A, i32 0
  %tmp24 = bitcast i32 %tmp23 to float
  %tmp2 = extractelement <2 x i32> %A, i32 1
  %tmp4 = bitcast i32 %tmp2 to float
  %add = fadd float %tmp24, %tmp4
  ret float %add
}

And the codegen for that is decidedly worse on all of PPC64+Altivec, AArch64, and x86-64 than what we used to get:

define float @test2(<2 x i32> %A) {
  %1 = bitcast <2 x i32> %A to <2 x float>
  %tmp24 = extractelement <2 x float> %1, i32 0
  %2 = bitcast <2 x i32> %A to <2 x float>
  %tmp4 = extractelement <2 x float> %2, i32 1
  %add = fadd float %tmp24, %tmp4
  ret float %add
}

Should I create a bitcast canonicalization instcombine to hoist bitcasts ahead of extracts?

hfinkel added inline comments.Dec 10 2015, 10:25 AM

test/Transforms/InstCombine/bitcast-bigendian.ll
24 ↗	(On Diff #42329)	Should I create a bitcast canonicalization instcombine to hoist bitcasts ahead of extracts? I think it is likely better to emulate the current apparent preference for the vector bitcast over the scalar one(s). However, given that your code creates a vector bit cast, what code is undoing that to prefer the scalar bitcasts?

spatel added inline comments.Dec 10 2015, 10:51 AM

test/Transforms/InstCombine/bitcast-bigendian.ll
24 ↗	(On Diff #42329)	I was digging around for that, but it's an illusion. :) If we have this sequence: %bc1 = bitcast <2 x i32> %A to i64 %trunc = trunc i64 %bc1 to i32 %bc2 = bitcast i32 %trunc to float This patch fires at the trunc without ever seeing the 2nd bitcast, and says, "That's an extract!" and it eliminates the need for the first bitcast: %ext = extractelement <2 x i32> %A, i32 0 %bc2 = bitcast i32 %ext to float So the remaining bitcasts that we're seeing in these cases are just the original instructions. Without this patch, there was no direct transform of the trunc; we only matched a (bc(trunc(bc x))) pattern, so we could hoist the 2nd bitcast.

To take this to its logical (I hope) conclusion...
We shouldn't need the transform in D14879 at all if we:

Canonicalize bitcasts before extractelements
Transform (bitcast (bitcast X)) --> bitcast X

I'm not sure why we wouldn't allow the 2nd for any types, but CastInst::isEliminableCastPair() says we're not allowed to simplify those if vector and scalar types are intermingled and it's not a roundtrip to the original type:

// If either of the casts are a bitcast from scalar to vector, disallow the
// merging. However, bitcast of A->B->A are allowed.

So for the 8 vector/scalar type combinations for a pair of bitcasts, we don't optimize the middle 6 in this list:

define ppc_fp128 @bitcast_bitcast_scalar_scalar_scalar(i128 %A) {
  %bc1 = bitcast i128 %A to fp128
  %bc2 = bitcast fp128 %bc1 to ppc_fp128
  ret ppc_fp128 %bc2
}

define <2 x i32> @bitcast_bitcast_scalar_scalar_vector(i64 %A) {
  %bc1 = bitcast i64 %A to double
  %bc2 = bitcast double %bc1 to <2 x i32>
  ret <2 x i32> %bc2
}

define double @bitcast_bitcast_scalar_vector_scalar(i64 %A) {
  %bc1 = bitcast i64 %A to <2 x i32>
  %bc2 = bitcast <2 x i32> %bc1 to double
  ret double %bc2
}

define <2 x i32> @bitcast_bitcast_scalar_vector_vector(i64 %A) {
  %bc1 = bitcast i64 %A to <4 x i16>
  %bc2 = bitcast <4 x i16> %bc1 to <2 x i32>
  ret <2 x i32> %bc2
}

define i64 @bitcast_bitcast_vector_scalar_scalar(<2 x i32> %A) {
  %bc1 = bitcast <2 x i32> %A to double
  %bc2 = bitcast double %bc1 to i64
  ret i64 %bc2
}

define <4 x i16> @bitcast_bitcast_vector_scalar_vector(<2 x i32> %A) {
  %bc1 = bitcast <2 x i32> %A to double
  %bc2 = bitcast double %bc1 to <4 x i16>
  ret <4 x i16> %bc2
}

define double @bitcast_bitcast_vector_vector_scalar(<2 x float> %A) {
  %bc1 = bitcast <2 x float> %A to <4 x i16>
  %bc2 = bitcast <4 x i16> %bc1 to double
  ret double %bc2
}

define <2 x i32> @bitcast_bitcast_vector_vector_vector(<2 x float> %A) {
  %bc1 = bitcast <2 x float> %A to <4 x i16>
  %bc2 = bitcast <4 x i16> %bc1 to <2 x i32>
  ret <2 x i32> %bc2
}

hfinkel added a subscriber: nadav.Dec 10 2015, 3:47 PM

In D15392#307685, @spatel wrote:
To take this to its logical (I hope) conclusion...
We shouldn't need the transform in D14879 at all if we:

Canonicalize bitcasts before extractelements

Transform (bitcast (bitcast X)) --> bitcast X

I'm not sure why we wouldn't allow the 2nd for any types, but CastInst::isEliminableCastPair() says we're not allowed to simplify those if vector and scalar types are intermingled and it's not a roundtrip to the original type:
// If either of the casts are a bitcast from scalar to vector, disallow the
// merging. However, bitcast of A->B->A are allowed.

I've added Nadav, in case he remembers what this is all about. It seems to have been related to a fix to:

https://llvm.org/bugs/show_bug.cgi?id=7311

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

InstCombine/

InstCombineCasts.cpp

102 lines

test/

Transforms/

InstCombine/

trunc.ll

18 lines

Diff 42724

llvm/trunk/lib/Transforms/InstCombine/InstCombineCasts.cpp

Show First 20 Lines • Show All 424 Lines • ▼ Show 20 Lines	static bool canEvaluateTruncated(Value V, Type Ty, InstCombiner &IC,
default:		default:
// TODO: Can handle more cases here.		// TODO: Can handle more cases here.
break;		break;
}		}

return false;		return false;
}		}

		/// Given a vector that is bitcast to an integer, optionally logically
		/// right-shifted, and truncated, convert it to an extractelement.
		/// Example (big endian):
		/// trunc (lshr (bitcast <4 x i32> %X to i128), 32) to i32
		/// --->
		/// extractelement <4 x i32> %X, 1
		static Instruction *foldVecTruncToExtElt(TruncInst &Trunc, InstCombiner &IC,
		const DataLayout &DL) {
		Value *TruncOp = Trunc.getOperand(0);
		Type *DestType = Trunc.getType();
		if (!TruncOp->hasOneUse() \|\| !isa<IntegerType>(DestType))
		return nullptr;

		Value *VecInput = nullptr;
		ConstantInt *ShiftVal = nullptr;
		if (!match(TruncOp, m_CombineOr(m_BitCast(m_Value(VecInput)),
		m_LShr(m_BitCast(m_Value(VecInput)),
		m_ConstantInt(ShiftVal)))) \|\|
		!isa<VectorType>(VecInput->getType()))
		return nullptr;

		VectorType *VecType = cast<VectorType>(VecInput->getType());
		unsigned VecWidth = VecType->getPrimitiveSizeInBits();
		unsigned DestWidth = DestType->getPrimitiveSizeInBits();
		unsigned ShiftAmount = ShiftVal ? ShiftVal->getZExtValue() : 0;

		if ((VecWidth % DestWidth != 0) \|\| (ShiftAmount % DestWidth != 0))
		return nullptr;

		// If the element type of the vector doesn't match the result type,
		// bitcast it to a vector type that we can extract from.
		unsigned NumVecElts = VecWidth / DestWidth;
		if (VecType->getElementType() != DestType) {
		VecType = VectorType::get(DestType, NumVecElts);
		VecInput = IC.Builder->CreateBitCast(VecInput, VecType, "bc");
		}

		unsigned Elt = ShiftAmount / DestWidth;
		if (DL.isBigEndian())
		Elt = NumVecElts - 1 - Elt;

		return ExtractElementInst::Create(VecInput, IC.Builder->getInt32(Elt));
		}

Instruction *InstCombiner::visitTrunc(TruncInst &CI) {		Instruction *InstCombiner::visitTrunc(TruncInst &CI) {
if (Instruction *Result = commonCastTransforms(CI))		if (Instruction *Result = commonCastTransforms(CI))
return Result;		return Result;

// Test if the trunc is the user of a select which is part of a		// Test if the trunc is the user of a select which is part of a
// minimum or maximum operation. If so, don't do any more simplification.		// minimum or maximum operation. If so, don't do any more simplification.
// Even simplifying demanded bits can break the canonical form of a		// Even simplifying demanded bits can break the canonical form of a
// min/max.		// min/max.
▲ Show 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	Instruction *InstCombiner::visitTrunc(TruncInst &CI) {
if (Src->hasOneUse() && isa<IntegerType>(SrcTy) &&		if (Src->hasOneUse() && isa<IntegerType>(SrcTy) &&
ShouldChangeType(SrcTy, DestTy) &&		ShouldChangeType(SrcTy, DestTy) &&
match(Src, m_And(m_Value(A), m_ConstantInt(Cst)))) {		match(Src, m_And(m_Value(A), m_ConstantInt(Cst)))) {
Value *NewTrunc = Builder->CreateTrunc(A, DestTy, A->getName() + ".tr");		Value *NewTrunc = Builder->CreateTrunc(A, DestTy, A->getName() + ".tr");
return BinaryOperator::CreateAnd(NewTrunc,		return BinaryOperator::CreateAnd(NewTrunc,
ConstantExpr::getTrunc(Cst, DestTy));		ConstantExpr::getTrunc(Cst, DestTy));
}		}

		if (Instruction I = foldVecTruncToExtElt(CI, this, DL))
		return I;

return nullptr;		return nullptr;
}		}

/// Transform (zext icmp) to bitwise / integer operations in order to eliminate		/// Transform (zext icmp) to bitwise / integer operations in order to eliminate
/// the icmp.		/// the icmp.
Instruction InstCombiner::transformZExtICmp(ICmpInst ICI, Instruction &CI,		Instruction InstCombiner::transformZExtICmp(ICmpInst ICI, Instruction &CI,
bool DoXform) {		bool DoXform) {
// If we are just checking for a icmp eq of a single bit and zext'ing it		// If we are just checking for a icmp eq of a single bit and zext'ing it
▲ Show 20 Lines • Show All 1,196 Lines • ▼ Show 20 Lines	static Instruction *canonicalizeBitCastExtElt(BitCastInst &BitCast,

unsigned NumElts = ExtElt->getVectorOperandType()->getNumElements();		unsigned NumElts = ExtElt->getVectorOperandType()->getNumElements();
auto *NewVecType = VectorType::get(DestType, NumElts);		auto *NewVecType = VectorType::get(DestType, NumElts);
auto *NewBC = IC.Builder->CreateBitCast(ExtElt->getVectorOperand(),		auto *NewBC = IC.Builder->CreateBitCast(ExtElt->getVectorOperand(),
NewVecType, "bc");		NewVecType, "bc");
return ExtractElementInst::Create(NewBC, ExtElt->getIndexOperand());		return ExtractElementInst::Create(NewBC, ExtElt->getIndexOperand());
}		}

static Instruction foldVecTruncToExtElt(Value VecInput, Type *DestTy,
unsigned ShiftAmt, InstCombiner &IC,
const DataLayout &DL) {
VectorType *VecTy = cast<VectorType>(VecInput->getType());
unsigned DestWidth = DestTy->getPrimitiveSizeInBits();
unsigned VecWidth = VecTy->getPrimitiveSizeInBits();

if ((VecWidth % DestWidth != 0) \|\| (ShiftAmt % DestWidth != 0))
return nullptr;

// If the element type of the vector doesn't match the result type,
// bitcast it to be a vector type we can extract from.
unsigned NumVecElts = VecWidth / DestWidth;
if (VecTy->getElementType() != DestTy) {
VecTy = VectorType::get(DestTy, NumVecElts);
VecInput = IC.Builder->CreateBitCast(VecInput, VecTy);
}

unsigned Elt = ShiftAmt / DestWidth;
if (DL.isBigEndian())
Elt = NumVecElts - 1 - Elt;

return ExtractElementInst::Create(VecInput, IC.Builder->getInt32(Elt));
}

/// See if we can optimize an integer->float/double bitcast.
/// The various long double bitcasts can't get in here.
static Instruction *optimizeIntToFloatBitCast(BitCastInst &CI, InstCombiner &IC,
const DataLayout &DL) {
Value *Src = CI.getOperand(0);
Type *DstTy = CI.getType();

// If this is a bitcast from int to float, check to see if the int is an
// extraction from a vector.
Value *VecInput = nullptr;
// bitcast(trunc(bitcast(somevector)))
if (match(Src, m_Trunc(m_BitCast(m_Value(VecInput)))) &&
isa<VectorType>(VecInput->getType()))
return foldVecTruncToExtElt(VecInput, DstTy, 0, IC, DL);

// bitcast(trunc(lshr(bitcast(somevector), cst))
ConstantInt *ShAmt = nullptr;
if (match(Src, m_Trunc(m_LShr(m_BitCast(m_Value(VecInput)),
m_ConstantInt(ShAmt)))) &&
isa<VectorType>(VecInput->getType()))
return foldVecTruncToExtElt(VecInput, DstTy, ShAmt->getZExtValue(), IC, DL);

return nullptr;
}

Instruction *InstCombiner::visitBitCast(BitCastInst &CI) {		Instruction *InstCombiner::visitBitCast(BitCastInst &CI) {
// If the operands are integer typed then apply the integer transforms,		// If the operands are integer typed then apply the integer transforms,
// otherwise just apply the common ones.		// otherwise just apply the common ones.
Value *Src = CI.getOperand(0);		Value *Src = CI.getOperand(0);
Type *SrcTy = Src->getType();		Type *SrcTy = Src->getType();
Type *DestTy = CI.getType();		Type *DestTy = CI.getType();

// Get rid of casts from one type to the same type. These are useless and can		// Get rid of casts from one type to the same type. These are useless and can
Show All 27 Lines	if (PointerType *DstPTy = dyn_cast<PointerType>(DestTy)) {

// If we found a path from the src to dest, create the getelementptr now.		// If we found a path from the src to dest, create the getelementptr now.
if (SrcElTy == DstElTy) {		if (SrcElTy == DstElTy) {
SmallVector<Value *, 8> Idxs(NumZeros + 1, Builder->getInt32(0));		SmallVector<Value *, 8> Idxs(NumZeros + 1, Builder->getInt32(0));
return GetElementPtrInst::CreateInBounds(Src, Idxs);		return GetElementPtrInst::CreateInBounds(Src, Idxs);
}		}
}		}

// Try to optimize int -> float bitcasts.
if ((DestTy->isFloatTy() \|\| DestTy->isDoubleTy()) && isa<IntegerType>(SrcTy))
if (Instruction I = optimizeIntToFloatBitCast(CI, this, DL))
return I;

if (VectorType *DestVTy = dyn_cast<VectorType>(DestTy)) {		if (VectorType *DestVTy = dyn_cast<VectorType>(DestTy)) {
if (DestVTy->getNumElements() == 1 && !SrcTy->isVectorTy()) {		if (DestVTy->getNumElements() == 1 && !SrcTy->isVectorTy()) {
Value *Elem = Builder->CreateBitCast(Src, DestVTy->getElementType());		Value *Elem = Builder->CreateBitCast(Src, DestVTy->getElementType());
return InsertElementInst::Create(UndefValue::get(DestTy), Elem,		return InsertElementInst::Create(UndefValue::get(DestTy), Elem,
Constant::getNullValue(Type::getInt32Ty(CI.getContext())));		Constant::getNullValue(Type::getInt32Ty(CI.getContext())));
// FIXME: Canonicalize bitcast(insertelement) -> insertelement(bitcast)		// FIXME: Canonicalize bitcast(insertelement) -> insertelement(bitcast)
}		}

▲ Show 20 Lines • Show All 95 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/trunc.ll

	Show First 20 Lines • Show All 115 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: @test10(			; CHECK-LABEL: @test10(
	; CHECK: trunc			; CHECK: trunc
	; CHECK: and			; CHECK: and
	; CHECK: ret			; CHECK: ret
	}			}

	; PR25543			; PR25543
	; https://llvm.org/bugs/show_bug.cgi?id=25543			; https://llvm.org/bugs/show_bug.cgi?id=25543
	; TODO: This could be extractelement.			; This is an extractelement.

	define i32 @trunc_bitcast1(<4 x i32> %v) {			define i32 @trunc_bitcast1(<4 x i32> %v) {
	%bc = bitcast <4 x i32> %v to i128			%bc = bitcast <4 x i32> %v to i128
	%shr = lshr i128 %bc, 32			%shr = lshr i128 %bc, 32
	%ext = trunc i128 %shr to i32			%ext = trunc i128 %shr to i32
	ret i32 %ext			ret i32 %ext

	; CHECK-LABEL: @trunc_bitcast1(			; CHECK-LABEL: @trunc_bitcast1(
	; CHECK-NEXT: %bc = bitcast <4 x i32> %v to i128			; CHECK-NEXT: %ext = extractelement <4 x i32> %v, i32 1
	; CHECK-NEXT: %shr = lshr i128 %bc, 32
	; CHECK-NEXT: %ext = trunc i128 %shr to i32
	; CHECK-NEXT: ret i32 %ext			; CHECK-NEXT: ret i32 %ext
	}			}

	; TODO: This could be bitcast + extractelement.			; A bitcast may still be required.

	define i32 @trunc_bitcast2(<2 x i64> %v) {			define i32 @trunc_bitcast2(<2 x i64> %v) {
	%bc = bitcast <2 x i64> %v to i128			%bc = bitcast <2 x i64> %v to i128
	%shr = lshr i128 %bc, 64			%shr = lshr i128 %bc, 64
	%ext = trunc i128 %shr to i32			%ext = trunc i128 %shr to i32
	ret i32 %ext			ret i32 %ext

	; CHECK-LABEL: @trunc_bitcast2(			; CHECK-LABEL: @trunc_bitcast2(
	; CHECK-NEXT: %bc = bitcast <2 x i64> %v to i128			; CHECK-NEXT: %bc1 = bitcast <2 x i64> %v to <4 x i32>
	; CHECK-NEXT: %shr = lshr i128 %bc, 64			; CHECK-NEXT: %ext = extractelement <4 x i32> %bc1, i32 2
	; CHECK-NEXT: %ext = trunc i128 %shr to i32
	; CHECK-NEXT: ret i32 %ext			; CHECK-NEXT: ret i32 %ext
	}			}

	; TODO: The shift is optional. This could be extractelement.			; The right shift is optional.

	define i32 @trunc_bitcast3(<4 x i32> %v) {			define i32 @trunc_bitcast3(<4 x i32> %v) {
	%bc = bitcast <4 x i32> %v to i128			%bc = bitcast <4 x i32> %v to i128
	%ext = trunc i128 %bc to i32			%ext = trunc i128 %bc to i32
	ret i32 %ext			ret i32 %ext

	; CHECK-LABEL: @trunc_bitcast3(			; CHECK-LABEL: @trunc_bitcast3(
	; CHECK-NEXT: %bc = bitcast <4 x i32> %v to i128			; CHECK-NEXT: %ext = extractelement <4 x i32> %v, i32 0
	; CHECK-NEXT: %ext = trunc i128 %bc to i32
	; CHECK-NEXT: ret i32 %ext			; CHECK-NEXT: ret i32 %ext
	}			}