This is an archive of the discontinued LLVM Phabricator instance.

I was trying out some examples, to see how this performs on ARM architectures. This pattern was relatively rare and didn't come up a lot in the benchmarks I tried (or optimized to be the same performance/codesize).

Testing the specific pattern, the first one I tried was i8's, a type not legal on those archs. It doesn't do great with this (although aarch64 does do OK given the instructions it can use):
https://godbolt.org/z/hbP445xsT
i32 looks OK for the same thing. AArch64 even looks better, I think because ARM already uses addcarry and splits things in a way that turns out to be equivalent.
https://godbolt.org/z/3zbhKPs1e

I was trying to look at across basic block too, but didn't do very well at getting something that wasn't just optimized away:
i8: https://godbolt.org/z/n8sxv3rqE
i32: https://godbolt.org/z/1sEa7Wd68
And then there was also this, but it crashed putting instructions in the wrong places:
https://godbolt.org/z/48abYPq8M

I have no strong objections to creating uadd_with_overflow in instcombine, but we have not done that in the past and I don't know if I see a huge reason to start now. It appears that we should at least be limiting it to legal types.

Rebased; Addressed review comments.

In D107552#2930663, @dmgreen wrote:

It appears that we should at least be limiting it to legal types.

I agree. The i8 cases doesn't look good on thumbv6m and thumbv7m. I've added DL.isLegalInteger() checks now.

Most of non-correctness bailouts here don't really make sense to me.
Presumably, if it is profitable for a particular backend, said backend should be expanding whatever "bad" intrinsics into it's original form.

Harbormaster completed remote builds in B118628: Diff 365119.Aug 9 2021, 1:56 AM

In D107552#2933906, @lebedev.ri wrote:

Most of non-correctness bailouts here don't really make sense to me.
Presumably, if it is profitable for a particular backend, said backend should be expanding whatever "bad" intrinsics into it's original form.

I agree that we shouldn't ideally add non-correctness bailouts, but I would like to do whatever I can in this patch to avoid regressing other backends.

Fixed the case in which the add of the uaddo won't dominate the original trunc use. For e.g.:

target datalayout = "n32"

declare void @true()
declare void @false()

define i32 @lshr_32_add_zext_trunc(i32 %a, i32 %b) {

%zext.a = zext i32 %a to i64
%zext.b = zext i32 %b to i64
%add = add i64 %zext.a, %zext.b
%trunc.add = trunc i64 %add to i32
%cmp = icmp ult i32 %a, %b
br i1 %cmp, label %brTrue, label %brFalse

brTrue:

%shr = lshr i64 %add, 32
%trunc.shr = trunc i64 %shr to i32
%retTrue = add i32 %trunc.add, %trunc.shr
call void @true()
br label %brMerge

brFalse:

%retFalse = add i32 %trunc.add, 12
call void @false()
br label %brMerge

brMerge:

%ret = phi i32 [ %retTrue, %brTrue ], [ %retFalse, %brFalse ]
ret i32 %ret

}

Harbormaster completed remote builds in B118636: Diff 365130.Aug 9 2021, 3:15 AM

lebedev.ri added inline comments.Aug 9 2021, 4:05 AM

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1060	I really don't think we should be doing this..
1069	Early return
1083–1084	Presumably you should just set the expansion point to be right after the old `add`?

craig.topper added inline comments.Aug 9 2021, 9:50 AM

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1105	We probably need to do something here to schedule the dead truncates to revisited to remove them. But I'm not sure.

craig.topper added inline comments.Aug 9 2021, 9:54 AM

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1105	I think this should use InstCombinerImpl::replaceInstUsesWith rather than Instruction::replaceAllUsesWith. Then I think it should call InstCombinerImpl::eraseInstFromFunction. @spatel or @lebedev.ri does that sound right?

lebedev.ri added inline comments.Aug 9 2021, 10:23 AM

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1105	Right, the pass-specific call should be used. I don't think we need to call `eraseInstFromFunction()` explcitly, that should be handled automatically once you return non-null from this function. Generally, i would think not having an intermediate vector would be better, but not sure.

craig.topper added inline comments.Aug 9 2021, 10:33 AM

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1105	The truncates aren’t directly part of the chain of instructions we’re changing so I’m not sure if the get queued properly for DCE once we return.

lebedev.ri added inline comments.Aug 9 2021, 10:37 AM

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1105	`InstCombinerImpl::replaceInstUsesWith()` adds the instruction uses of which we replaced into worklist, so the next time it is revisited, it will be dropped.

Rebased; Addressed review comments.

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1060	I think I'm being too cautious here. I just realized that InstCombinerImpl::foldICmpWithConstant() can produce sadd_with_overflow, if I read correctly it doesn't seem to worry about cross basic-block use. I have removed this check now.

Harbormaster completed remote builds in B119006: Diff 365652.Aug 10 2021, 9:10 PM

Matt added a subscriber: Matt.Aug 11 2021, 9:05 AM

Ping.

In D107552#2930663, @dmgreen wrote:

I was trying out some examples, to see how this performs on ARM architectures. This pattern was relatively rare and didn't come up a lot in the benchmarks I tried (or optimized to be the same performance/codesize).

Testing the specific pattern, the first one I tried was i8's, a type not legal on those archs. It doesn't do great with this (although aarch64 does do OK given the instructions it can use):
https://godbolt.org/z/hbP445xsT
i32 looks OK for the same thing. AArch64 even looks better, I think because ARM already uses addcarry and splits things in a way that turns out to be equivalent.
https://godbolt.org/z/3zbhKPs1e

I was trying to look at across basic block too, but didn't do very well at getting something that wasn't just optimized away:
i8: https://godbolt.org/z/n8sxv3rqE
i32: https://godbolt.org/z/1sEa7Wd68
And then there was also this, but it crashed putting instructions in the wrong places:
https://godbolt.org/z/48abYPq8M

I have no strong objections to creating uadd_with_overflow in instcombine, but we have not done that in the past and I don't know if I see a huge reason to start now. It appears that we should at least be limiting it to legal types.

It's arguably a worse canonical form since fewer optimizations will understand this (e.g. ScalarEvolution, though not sure it really will matter here).

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1056	Demorgan this
1059–1060	I also think a legality check doesn't belong here. I also don't think the legal types in the datalayout is particularly well defined
1063	demorgan

Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2022, 5:34 AM

Herald added a subscriber: StephenFan. · View Herald Transcript

Pierre-vh mentioned this in D137705: [AMDGPU] Add DAG Combine for right-shift carry add to uaddo.Nov 22 2022, 12:15 AM

Pierre-vh mentioned this in D138814: [InstCombine] Combine lshr of add -> (a + b < a).Nov 28 2022, 6:57 AM

Pierre-vh mentioned this in rGb3fdb7b0cba4: [InstCombine] Combine lshr of add -> (a + b < a).Jan 10 2023, 12:37 AM

D138814

Revision Contents

Path

Size

llvm/

lib/

Transforms/

InstCombine/

InstCombineInternal.h

1 line

InstCombineShifts.cpp

76 lines

test/

Transforms/

InstCombine/

shift-add.ll

49 lines

Diff 365652

llvm/lib/Transforms/InstCombine/InstCombineInternal.h

Show First 20 Lines • Show All 353 Lines • ▼ Show 20 Lines	private:
Value foldLogicOfFCmps(FCmpInst LHS, FCmpInst *RHS, bool IsAnd);		Value foldLogicOfFCmps(FCmpInst LHS, FCmpInst *RHS, bool IsAnd);

Value foldAndOrOfICmpsOfAndWithPow2(ICmpInst LHS, ICmpInst *RHS,		Value foldAndOrOfICmpsOfAndWithPow2(ICmpInst LHS, ICmpInst *RHS,
Instruction *CxtI, bool IsAnd,		Instruction *CxtI, bool IsAnd,
bool IsLogical = false);		bool IsLogical = false);
Value matchSelectFromAndOr(Value A, Value B, Value C, Value *D);		Value matchSelectFromAndOr(Value A, Value B, Value C, Value *D);
Value getSelectCondition(Value A, Value *B);		Value getSelectCondition(Value A, Value *B);

		Instruction *foldXShrToOverflow(BinaryOperator &I);
Instruction foldIntrinsicWithOverflowCommon(IntrinsicInst II);		Instruction foldIntrinsicWithOverflowCommon(IntrinsicInst II);
Instruction *foldFPSignBitOps(BinaryOperator &I);		Instruction *foldFPSignBitOps(BinaryOperator &I);

// Optimize one of these forms:		// Optimize one of these forms:
// and i1 Op, SI / select i1 Op, i1 SI, i1 false (if IsAnd = true)		// and i1 Op, SI / select i1 Op, i1 SI, i1 false (if IsAnd = true)
// or i1 Op, SI / select i1 Op, i1 true, i1 SI (if IsAnd = false)		// or i1 Op, SI / select i1 Op, i1 true, i1 SI (if IsAnd = false)
// into simplier select instruction using isImpliedCondition.		// into simplier select instruction using isImpliedCondition.
Instruction foldAndOrOfSelectUsingImpliedCond(Value Op, SelectInst &SI,		Instruction foldAndOrOfSelectUsingImpliedCond(Value Op, SelectInst &SI,
▲ Show 20 Lines • Show All 423 Lines • Show Last 20 Lines

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp

Show First 20 Lines • Show All 1,029 Lines • ▼ Show 20 Lines	Instruction *InstCombinerImpl::visitShl(BinaryOperator &I) {
if (match(Op0, m_One()) &&		if (match(Op0, m_One()) &&
match(Op1, m_Sub(m_SpecificInt(BitWidth - 1), m_Value(X))))		match(Op1, m_Sub(m_SpecificInt(BitWidth - 1), m_Value(X))))
return BinaryOperator::CreateLShr(		return BinaryOperator::CreateLShr(
ConstantInt::get(Ty, APInt::getSignMask(BitWidth)), X);		ConstantInt::get(Ty, APInt::getSignMask(BitWidth)), X);

return nullptr;		return nullptr;
}		}

		// Tries to perform (Xshr (add (zext a, i2^n), (zext b, i2^n)), 2^(n-1)) ->
		// (llvm.uadd.with.overflow a, b).overflow where Xshr can be ashr or lshr; a and
		// b has type i2^(n-1).
		Instruction *InstCombinerImpl::foldXShrToOverflow(BinaryOperator &I) {
		assert(I.getOpcode() == Instruction::AShr \|\|
		I.getOpcode() == Instruction::LShr);

		Instruction *Op0 = dyn_cast<Instruction>(I.getOperand(0));
		Value *Op1 = I.getOperand(1);
		Type *Ty = I.getType();
		const APInt *ShAmtAPInt;

		if (!Op0 \|\| !match(Op1, m_APInt(ShAmtAPInt)))
		return nullptr;

		unsigned ShAmt = ShAmtAPInt->getZExtValue();
		unsigned BitWidth = Ty->getScalarSizeInBits();

		if (!(isPowerOf2_32(BitWidth) && Ty->isIntegerTy(ShAmt * 2)))
		arsenmUnsubmitted Not Done Reply Inline Actions Demorgan this arsenm: Demorgan this
		return nullptr;

		if (!DL.isLegalInteger(ShAmt))
		return nullptr;
		lebedev.riUnsubmitted Done Reply Inline Actions I really don't think we should be doing this.. lebedev.ri: I really don't think we should be doing this..
		abinavppAuthorUnsubmitted Done Reply Inline Actions I think I'm being too cautious here. I just realized that InstCombinerImpl::foldICmpWithConstant() can produce sadd_with_overflow, if I read correctly it doesn't seem to worry about cross basic-block use. I have removed this check now. abinavpp: I think I'm being too cautious here. I just realized that InstCombinerImpl…
		arsenmUnsubmitted Not Done Reply Inline Actions I also think a legality check doesn't belong here. I also don't think the legal types in the datalayout is particularly well defined arsenm: I also think a legality check doesn't belong here. I also don't think the legal types in the…

		Value X = nullptr, Y = nullptr;
		if (!(match(Op0, m_Add(m_ZExt(m_Value(X)), m_ZExt(m_Value(Y)))) &&
		arsenmUnsubmitted Not Done Reply Inline Actions demorgan arsenm: demorgan
		X->getType()->getScalarSizeInBits() == ShAmt &&
		Y->getType()->getScalarSizeInBits() == ShAmt))
		return nullptr;

		// Make sure that `Op0` is only used by `I` and `ShAmt`-truncates.
		SmallVector<TruncInst *, 2> Truncs;
		lebedev.riUnsubmitted Done Reply Inline Actions Early return lebedev.ri: Early return
		for (User *Usr : Op0->users()) {
		if (Usr == &I)
		continue;

		TruncInst *Trunc = dyn_cast<TruncInst>(Usr);
		if (!Trunc)
		return nullptr;

		if (Trunc->getType()->getScalarSizeInBits() != ShAmt)
		return nullptr;

		Truncs.push_back(Trunc);
		}

		// If we get here, we can be sure that `Op0` is only used by `Truncs` and
		lebedev.riUnsubmitted Done Reply Inline Actions Presumably you should just set the expansion point to be right after the old `add`? lebedev.ri: Presumably you should just set the expansion point to be right after the old `add`?
		// `I`.

		BasicBlock::iterator RestoreInsPt = Builder.GetInsertPoint();
		// Insert at Op0 so that the newly created `UAdd` will dominate it's users
		// (i.e. `Op0`'s users).
		Builder.SetInsertPoint(Op0);

		Value *UAddOverflow = Builder.CreateBinaryIntrinsic(
		Intrinsic::uadd_with_overflow, X, Y, /* FMFSource */ nullptr, "uaddo");
		Value *UAdd = Builder.CreateExtractValue(UAddOverflow, 0,
		UAddOverflow->getName() + ".add");
		Value *Overflow = Builder.CreateExtractValue(
		UAddOverflow, 1, UAddOverflow->getName() + ".overflow");

		// Replace the uses of truncated `Op0` with `UAdd` since `UAddOverflow`
		// performs the truncated version of the addition performed by `Op0`.
		for (TruncInst *Trunc : Truncs) {
		replaceInstUsesWith(*Trunc, UAdd);
		}

		Builder.SetInsertPoint(&(*RestoreInsPt));
		craig.topperUnsubmitted Not Done Reply Inline Actions We probably need to do something here to schedule the dead truncates to revisited to remove them. But I'm not sure. craig.topper: We probably need to do something here to schedule the dead truncates to revisited to remove…
		craig.topperUnsubmitted Not Done Reply Inline Actions I think this should use InstCombinerImpl::replaceInstUsesWith rather than Instruction::replaceAllUsesWith. Then I think it should call InstCombinerImpl::eraseInstFromFunction. @spatel or @lebedev.ri does that sound right? craig.topper: I think this should use InstCombinerImpl::replaceInstUsesWith rather than Instruction…
		lebedev.riUnsubmitted Done Reply Inline Actions Right, the pass-specific call should be used. I don't think we need to call `eraseInstFromFunction()` explcitly, that should be handled automatically once you return non-null from this function. Generally, i would think not having an intermediate vector would be better, but not sure. lebedev.ri: Right, the pass-specific call should be used. I don't think we need to call…
		craig.topperUnsubmitted Not Done Reply Inline Actions The truncates aren’t directly part of the chain of instructions we’re changing so I’m not sure if the get queued properly for DCE once we return. craig.topper: The truncates aren’t directly part of the chain of instructions we’re changing so I’m not sure…
		lebedev.riUnsubmitted Not Done Reply Inline Actions `InstCombinerImpl::replaceInstUsesWith()` adds the instruction uses of which we replaced into worklist, so the next time it is revisited, it will be dropped. lebedev.ri: `InstCombinerImpl::replaceInstUsesWith()` adds the instruction uses of which we replaced into…

		// Replace the use of `Op0` by `I` with `Overflow`.
		return new ZExtInst(Overflow, Ty);
		}

Instruction *InstCombinerImpl::visitLShr(BinaryOperator &I) {		Instruction *InstCombinerImpl::visitLShr(BinaryOperator &I) {
if (Value *V = SimplifyLShrInst(I.getOperand(0), I.getOperand(1), I.isExact(),		if (Value *V = SimplifyLShrInst(I.getOperand(0), I.getOperand(1), I.isExact(),
SQ.getWithInstruction(&I)))		SQ.getWithInstruction(&I)))
return replaceInstUsesWith(I, V);		return replaceInstUsesWith(I, V);

if (Instruction *X = foldVectorBinop(I))		if (Instruction *X = foldVectorBinop(I))
return X;		return X;

▲ Show 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	if (match(Op1, m_APInt(ShAmtAPInt))) {
// If the shifted-out value is known-zero, then this is an exact shift.		// If the shifted-out value is known-zero, then this is an exact shift.
if (!I.isExact() &&		if (!I.isExact() &&
MaskedValueIsZero(Op0, APInt::getLowBitsSet(BitWidth, ShAmt), 0, &I)) {		MaskedValueIsZero(Op0, APInt::getLowBitsSet(BitWidth, ShAmt), 0, &I)) {
I.setIsExact();		I.setIsExact();
return &I;		return &I;
}		}
}		}

		if (Instruction *Overflow = foldXShrToOverflow(I))
		return Overflow;

// Transform (x << y) >> y to x & (-1 >> y)		// Transform (x << y) >> y to x & (-1 >> y)
Value *X;		Value *X;
if (match(Op0, m_OneUse(m_Shl(m_Value(X), m_Specific(Op1))))) {		if (match(Op0, m_OneUse(m_Shl(m_Value(X), m_Specific(Op1))))) {
Constant *AllOnes = ConstantInt::getAllOnesValue(Ty);		Constant *AllOnes = ConstantInt::getAllOnesValue(Ty);
Value *Mask = Builder.CreateLShr(AllOnes, Op1);		Value *Mask = Builder.CreateLShr(AllOnes, Op1);
return BinaryOperator::CreateAnd(Mask, X);		return BinaryOperator::CreateAnd(Mask, X);
}		}

▲ Show 20 Lines • Show All 190 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/shift-add.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; This test makes sure that these instructions are properly eliminated.			; This test makes sure that these instructions are properly eliminated.
	;			;
	; RUN: opt < %s -instcombine -S \| FileCheck %s			; RUN: opt < %s -instcombine -S \| FileCheck %s

				target datalayout = "n32"

	define i32 @shl_C1_add_A_C2_i32(i16 %A) {			define i32 @shl_C1_add_A_C2_i32(i16 %A) {
	; CHECK-LABEL: @shl_C1_add_A_C2_i32(			; CHECK-LABEL: @shl_C1_add_A_C2_i32(
	; CHECK-NEXT: [[B:%.]] = zext i16 [[A:%.]] to i32			; CHECK-NEXT: [[B:%.]] = zext i16 [[A:%.]] to i32
	; CHECK-NEXT: [[D:%.*]] = shl i32 192, [[B]]			; CHECK-NEXT: [[D:%.*]] = shl i32 192, [[B]]
	; CHECK-NEXT: ret i32 [[D]]			; CHECK-NEXT: ret i32 [[D]]
	;			;
	%B = zext i16 %A to i32			%B = zext i16 %A to i32
	%C = add i32 %B, 5			%C = add i32 %B, 5
	▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines
	;			;
	%A = zext i16 %I to i32			%A = zext i16 %I to i32
	%B = insertelement <4 x i32> undef, i32 %A, i32 0			%B = insertelement <4 x i32> undef, i32 %A, i32 0
	%C = shufflevector <4 x i32> %B, <4 x i32> undef, <4 x i32> zeroinitializer			%C = shufflevector <4 x i32> %B, <4 x i32> undef, <4 x i32> zeroinitializer
	%D = add <4 x i32> %C, <i32 0, i32 1, i32 50, i32 16>			%D = add <4 x i32> %C, <i32 0, i32 1, i32 50, i32 16>
	%E = lshr <4 x i32> <i32 6, i32 2, i32 1, i32 -7>, %D			%E = lshr <4 x i32> <i32 6, i32 2, i32 1, i32 -7>, %D
	ret <4 x i32> %E			ret <4 x i32> %E
	}			}

				define i64 @lshr_32_add_zext_basic(i32 %a, i32 %b) {
				; CHECK-LABEL: define i64 @lshr_32_add_zext_basic(
				; CHECK-NEXT: [[uaddo:%.*]] = call { i32, i1 } @llvm.uadd.with.overflow.i32(i32 %a, i32 %b)
				; CHECK-NEXT: [[overflow:%.*]] = extractvalue { i32, i1 } [[uaddo]], 1
				; CHECK-NEXT: [[zextOverflow:%.*]] = zext i1 %uaddo.overflow to i64
				; CHECK-NEXT: ret i64 [[zextOverflow]]

				%zext.a = zext i32 %a to i64
				%zext.b = zext i32 %b to i64
				%add = add i64 %zext.a, %zext.b
				%lshr = lshr i64 %add, 32
				ret i64 %lshr
				}

				define i64 @ashr_32_add_zext_basic(i32 %a, i32 %b) {
				; CHECK-LABEL: define i64 @ashr_32_add_zext_basic(
				; CHECK-NEXT: [[uaddo:%.*]] = call { i32, i1 } @llvm.uadd.with.overflow.i32(i32 %a, i32 %b)
				; CHECK-NEXT: [[overflow:%.*]] = extractvalue { i32, i1 } [[uaddo]], 1
				; CHECK-NEXT: [[zextOverflow:%.*]] = zext i1 %uaddo.overflow to i64
				; CHECK-NEXT: ret i64 [[zextOverflow]]

				%zext.a = zext i32 %a to i64
				%zext.b = zext i32 %b to i64
				%add = add i64 %zext.a, %zext.b
				%lshr = ashr i64 %add, 32
				ret i64 %lshr
				}

				define i32 @lshr_32_add_zext_trunc(i32 %a, i32 %b) {
				; CHECK-LABEL: define i32 @lshr_32_add_zext_trunc(
				; CHECK-NEXT: [[uaddo:%.*]] = call { i32, i1 } @llvm.uadd.with.overflow.i32(i32 %a, i32 %b)
				; CHECK-NEXT: [[add:%.*]] = extractvalue { i32, i1 } [[uaddo]], 0
				; CHECK-NEXT: [[overflow:%.*]] = extractvalue { i32, i1 } [[uaddo]], 1
				; CHECK-NEXT: [[zextOverflow:%.*]] = zext i1 [[overflow]] to i32
				; CHECK-NEXT: [[ret:%.*]] = add i32 [[add]], [[zextOverflow]]
				; CHECK-NEXT: ret i32 [[ret]]

				%zext.a = zext i32 %a to i64
				%zext.b = zext i32 %b to i64
				%add = add i64 %zext.a, %zext.b
				%trunc.add = trunc i64 %add to i32
				%shr = lshr i64 %add, 32
				%trunc.shr = trunc i64 %shr to i32
				%ret = add i32 %trunc.add, %trunc.shr
				ret i32 %ret
				}

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Combine lshr of add that intends to get the carry as llvm.uadd.with.overflowAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 365652

llvm/lib/Transforms/InstCombine/InstCombineInternal.h

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp

llvm/test/Transforms/InstCombine/shift-add.ll

[InstCombine] Combine lshr of add that intends to get the carry as llvm.uadd.with.overflow
AbandonedPublic