This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] try multi-use demanded bits folds for 'add'
ClosedPublic

Authored by spatel on Sep 13 2022, 10:09 AM.

Download Raw Diff

Details

Reviewers

craig.topper
RKSimon
nikic

Commits

rG73919a87e9a6: [InstCombine] try multi-use demanded bits folds for 'add'

Summary

This patch enables a multi-use demanded bits fold (motivated by issue #57576):
https://alive2.llvm.org/ce/z/DsZakh

This mimics transforms that we already do on the single-use path.

Originally, this patch did not include the last part to form a constant, but that can be removed independently to reduce risk. It's not clear what the effect of either change will be when viewed end-to-end.

See the "add-demand2" series for experimental timing results:
https://llvm-compile-time-tracker.com/?config=NewPM-O3&stat=instructions&remote=rotateright

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Sep 13 2022, 10:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 13 2022, 10:09 AM

Herald added subscribers: StephenFan, zzheng, hiraditya, mcrosier. · View Herald Transcript

spatel requested review of this revision.Sep 13 2022, 10:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 13 2022, 10:09 AM

Herald added subscribers: llvm-commits, • pcwang-thead. · View Herald Transcript

Nice - have you looked if DAG SimplifyMultipleUseDemandedBits would benefit?

In D133788#3786981, @RKSimon wrote:

Nice - have you looked if DAG SimplifyMultipleUseDemandedBits would benefit?

I saw that we don't have this in codegen either, but I didn't try to adapt it yet. We can do the same thing for Op1 (RHS) of a sub, so that's another potential follow-up.

Harbormaster completed remote builds in B186401: Diff 459787.Sep 13 2022, 11:11 AM

spatel mentioned this in D133799: [SDAG] try multi-use demanded bits folds for 'add'.Sep 13 2022, 12:18 PM

In D133788#3787026, @spatel wrote:

In D133788#3786981, @RKSimon wrote:

Nice - have you looked if DAG SimplifyMultipleUseDemandedBits would benefit?

I saw that we don't have this in codegen either, but I didn't try to adapt it yet. We can do the same thing for Op1 (RHS) of a sub, so that's another potential follow-up.

D133799 shows the DAG version...looks like a slog to make that worthwhile.

The 2nd part avoids computing known bits for every other multi-use add (giving the small improvement in compile-time). That also results in the test diffs that show large unsigned constants becoming small negative numbers. That should make analysis easier and codegen better in most cases.

I think it might make sense to separate these two changes if we can do so reasonably cheaply. Rather then calling computeKnownBits on the add, which will unnecessarily repeat the two recursive calls, can we call KnownBits::computeForAddSub() on the computed results? I expect that would be about compile-time neutral, while retaining existing behavior.

In D133788#3787430, @nikic wrote:

The 2nd part avoids computing known bits for every other multi-use add (giving the small improvement in compile-time). That also results in the test diffs that show large unsigned constants becoming small negative numbers. That should make analysis easier and codegen better in most cases.

I think it might make sense to separate these two changes if we can do so reasonably cheaply. Rather then calling computeKnownBits on the add, which will unnecessarily repeat the two recursive calls, can we call KnownBits::computeForAddSub() on the computed results? I expect that would be about compile-time neutral, while retaining existing behavior.

Yes - good suggestion and your guess about timing looks correct. I drafted the changes as independent patches and pushed them up to llvm-compile-time-tracker as perf/add-demand2:
https://llvm-compile-time-tracker.com/?config=NewPM-O3&stat=instructions&remote=rotateright

The timing diffs are nearly invisible in this set, so we can mostly judge the patches on the test diffs, and if it saves a bit of time too, that's a nice bonus. The sibling change in the codegen patch shows that the overall results may be unpredictable, but we can probably deal with any fallout with backend fixes.

spatel updated this revision to Diff 460039.Sep 14 2022, 5:07 AM

spatel edited the summary of this revision. (Show Details)

This revision is now accepted and ready to land.Sep 14 2022, 5:30 AM

Harbormaster completed remote builds in B186582: Diff 460039.Sep 14 2022, 5:53 AM

This revision was landed with ongoing or failed builds.Sep 14 2022, 6:39 AM

Closed by commit rG73919a87e9a6: [InstCombine] try multi-use demanded bits folds for 'add' (authored by spatel). · Explain Why

This revision was automatically updated to reflect the committed changes.

spatel added a commit: rG73919a87e9a6: [InstCombine] try multi-use demanded bits folds for 'add'.

spatel mentioned this in rGd6498abc241b: [InstCombine] remove multi-use add demanded constant fold.Sep 18 2022, 11:24 AM

spatel mentioned this in rG64d309131a87: [InstCombine] try multi-use demanded bits fold for 'sub'.Sep 21 2022, 11:15 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

InstCombine/

InstCombineSimplifyDemanded.cpp

21 lines

test/

Transforms/

InstCombine/

add.ll

14 lines

shift.ll

7 lines

Diff 460066

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp

Show First 20 Lines • Show All 1,058 Lines • ▼ Show 20 Lines	case Instruction::Xor: {
// If all of the demanded bits are known zero on one side, return the other.		// If all of the demanded bits are known zero on one side, return the other.
if (DemandedMask.isSubsetOf(RHSKnown.Zero))		if (DemandedMask.isSubsetOf(RHSKnown.Zero))
return I->getOperand(0);		return I->getOperand(0);
if (DemandedMask.isSubsetOf(LHSKnown.Zero))		if (DemandedMask.isSubsetOf(LHSKnown.Zero))
return I->getOperand(1);		return I->getOperand(1);

break;		break;
}		}
		case Instruction::Add: {
		unsigned NLZ = DemandedMask.countLeadingZeros();
		APInt DemandedFromOps = APInt::getLowBitsSet(BitWidth, BitWidth - NLZ);

		// If an operand adds zeros to every bit below the highest demanded bit,
		// that operand doesn't change the result. Return the other side.
		computeKnownBits(I->getOperand(1), RHSKnown, Depth + 1, CxtI);
		if (DemandedFromOps.isSubsetOf(RHSKnown.Zero))
		return I->getOperand(0);

		computeKnownBits(I->getOperand(0), LHSKnown, Depth + 1, CxtI);
		if (DemandedFromOps.isSubsetOf(LHSKnown.Zero))
		return I->getOperand(1);

		bool NSW = cast<OverflowingBinaryOperator>(I)->hasNoSignedWrap();
		Known = KnownBits::computeForAddSub(true, NSW, LHSKnown, RHSKnown);
		if (DemandedMask.isSubsetOf(Known.Zero\|Known.One))
		return Constant::getIntegerValue(ITy, Known.One);

		break;
		}
case Instruction::AShr: {		case Instruction::AShr: {
// Compute the Known bits to simplify things downstream.		// Compute the Known bits to simplify things downstream.
computeKnownBits(I, Known, Depth, CxtI);		computeKnownBits(I, Known, Depth, CxtI);

// If this user is only demanding bits that we know, return the known		// If this user is only demanding bits that we know, return the known
// constant.		// constant.
if (DemandedMask.isSubsetOf(Known.Zero \| Known.One))		if (DemandedMask.isSubsetOf(Known.Zero \| Known.One))
return Constant::getIntegerValue(ITy, Known.One);		return Constant::getIntegerValue(ITy, Known.One);
▲ Show 20 Lines • Show All 621 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/add.ll

Show First 20 Lines • Show All 2,177 Lines • ▼ Show 20 Lines	;
ret i8 %add		ret i8 %add
}		}

define i5 @demand_low_bits_uses(i8 %x, i8 %y) {		define i5 @demand_low_bits_uses(i8 %x, i8 %y) {
; CHECK-LABEL: @demand_low_bits_uses(		; CHECK-LABEL: @demand_low_bits_uses(
; CHECK-NEXT: [[M:%.]] = mul i8 [[X:%.]], -32		; CHECK-NEXT: [[M:%.]] = mul i8 [[X:%.]], -32
; CHECK-NEXT: [[A:%.]] = add i8 [[M]], [[Y:%.]]		; CHECK-NEXT: [[A:%.]] = add i8 [[M]], [[Y:%.]]
; CHECK-NEXT: call void @use(i8 [[A]])		; CHECK-NEXT: call void @use(i8 [[A]])
; CHECK-NEXT: [[R:%.*]] = trunc i8 [[A]] to i5		; CHECK-NEXT: [[R:%.*]] = trunc i8 [[Y]] to i5
; CHECK-NEXT: ret i5 [[R]]		; CHECK-NEXT: ret i5 [[R]]
;		;
%m = mul i8 %x, -32 ; 0xE0		%m = mul i8 %x, -32 ; 0xE0
%a = add i8 %m, %y		%a = add i8 %m, %y
call void @use(i8 %a)		call void @use(i8 %a)
%r = trunc i8 %a to i5		%r = trunc i8 %a to i5
ret i5 %r		ret i5 %r
}		}

		; negative test - demands one more bit

define i6 @demand_low_bits_uses_extra_bit(i8 %x, i8 %y) {		define i6 @demand_low_bits_uses_extra_bit(i8 %x, i8 %y) {
; CHECK-LABEL: @demand_low_bits_uses_extra_bit(		; CHECK-LABEL: @demand_low_bits_uses_extra_bit(
; CHECK-NEXT: [[M:%.]] = mul i8 [[X:%.]], -32		; CHECK-NEXT: [[M:%.]] = mul i8 [[X:%.]], -32
; CHECK-NEXT: [[A:%.]] = add i8 [[M]], [[Y:%.]]		; CHECK-NEXT: [[A:%.]] = add i8 [[M]], [[Y:%.]]
; CHECK-NEXT: call void @use(i8 [[A]])		; CHECK-NEXT: call void @use(i8 [[A]])
; CHECK-NEXT: [[R:%.*]] = trunc i8 [[A]] to i6		; CHECK-NEXT: [[R:%.*]] = trunc i8 [[A]] to i6
; CHECK-NEXT: ret i6 [[R]]		; CHECK-NEXT: ret i6 [[R]]
;		;
%m = mul i8 %x, -32 ; 0xE0		%m = mul i8 %x, -32 ; 0xE0
%a = add i8 %m, %y		%a = add i8 %m, %y
call void @use(i8 %a)		call void @use(i8 %a)
%r = trunc i8 %a to i6		%r = trunc i8 %a to i6
ret i6 %r		ret i6 %r
}		}

define i8 @demand_low_bits_uses_commute(i8 %x, i8 %p, i8 %z) {		define i8 @demand_low_bits_uses_commute(i8 %x, i8 %p, i8 %z) {
; CHECK-LABEL: @demand_low_bits_uses_commute(		; CHECK-LABEL: @demand_low_bits_uses_commute(
; CHECK-NEXT: [[Y:%.]] = mul i8 [[P:%.]], [[P]]		; CHECK-NEXT: [[Y:%.]] = mul i8 [[P:%.]], [[P]]
; CHECK-NEXT: [[M:%.]] = and i8 [[X:%.]], -64		; CHECK-NEXT: [[M:%.]] = and i8 [[X:%.]], -64
; CHECK-NEXT: [[A:%.*]] = add i8 [[Y]], [[M]]		; CHECK-NEXT: [[A:%.*]] = add i8 [[Y]], [[M]]
; CHECK-NEXT: call void @use(i8 [[A]])		; CHECK-NEXT: call void @use(i8 [[A]])
; CHECK-NEXT: [[S:%.]] = sub i8 [[A]], [[Z:%.]]		; CHECK-NEXT: [[S:%.]] = sub i8 [[Y]], [[Z:%.]]
; CHECK-NEXT: [[R:%.*]] = shl i8 [[S]], 2		; CHECK-NEXT: [[R:%.*]] = shl i8 [[S]], 2
; CHECK-NEXT: ret i8 [[R]]		; CHECK-NEXT: ret i8 [[R]]
;		;
%y = mul i8 %p, %p ; thwart complexity-based canonicalization		%y = mul i8 %p, %p ; thwart complexity-based canonicalization
%m = and i8 %x, -64 ; 0xC0		%m = and i8 %x, -64 ; 0xC0
%a = add i8 %y, %m		%a = add i8 %y, %m
call void @use(i8 %a)		call void @use(i8 %a)
%s = sub i8 %a, %z		%s = sub i8 %a, %z
%r = shl i8 %s, 2		%r = shl i8 %s, 2
ret i8 %r		ret i8 %r
}		}

define i8 @demand_low_bits_uses_commutei_extra_bit(i8 %x, i8 %p, i8 %z) {		; negative test - demands one more bit
; CHECK-LABEL: @demand_low_bits_uses_commutei_extra_bit(
		define i8 @demand_low_bits_uses_commute_extra_bit(i8 %x, i8 %p, i8 %z) {
		; CHECK-LABEL: @demand_low_bits_uses_commute_extra_bit(
; CHECK-NEXT: [[Y:%.]] = mul i8 [[P:%.]], [[P]]		; CHECK-NEXT: [[Y:%.]] = mul i8 [[P:%.]], [[P]]
; CHECK-NEXT: [[M:%.]] = and i8 [[X:%.]], -64		; CHECK-NEXT: [[M:%.]] = and i8 [[X:%.]], -64
; CHECK-NEXT: [[A:%.*]] = add i8 [[Y]], [[M]]		; CHECK-NEXT: [[A:%.*]] = add i8 [[Y]], [[M]]
; CHECK-NEXT: call void @use(i8 [[A]])		; CHECK-NEXT: call void @use(i8 [[A]])
; CHECK-NEXT: [[S:%.]] = sub i8 [[A]], [[Z:%.]]		; CHECK-NEXT: [[S:%.]] = sub i8 [[A]], [[Z:%.]]
; CHECK-NEXT: [[R:%.*]] = shl i8 [[S]], 1		; CHECK-NEXT: [[R:%.*]] = shl i8 [[S]], 1
; CHECK-NEXT: ret i8 [[R]]		; CHECK-NEXT: ret i8 [[R]]
;		;
Show All 12 Lines
; CHECK-NEXT: [[ZY:%.]] = zext i64 [[Y:%.]] to i128		; CHECK-NEXT: [[ZY:%.]] = zext i64 [[Y:%.]] to i128
; CHECK-NEXT: [[ZW:%.]] = zext i64 [[W:%.]] to i128		; CHECK-NEXT: [[ZW:%.]] = zext i64 [[W:%.]] to i128
; CHECK-NEXT: [[ZZ:%.]] = zext i64 [[Z:%.]] to i128		; CHECK-NEXT: [[ZZ:%.]] = zext i64 [[Z:%.]] to i128
; CHECK-NEXT: [[SHY:%.*]] = shl nuw i128 [[ZY]], 64		; CHECK-NEXT: [[SHY:%.*]] = shl nuw i128 [[ZY]], 64
; CHECK-NEXT: [[MW:%.*]] = mul i128 [[ZW]], -18446744073709551616		; CHECK-NEXT: [[MW:%.*]] = mul i128 [[ZW]], -18446744073709551616
; CHECK-NEXT: [[XY:%.*]] = or i128 [[SHY]], [[ZX]]		; CHECK-NEXT: [[XY:%.*]] = or i128 [[SHY]], [[ZX]]
; CHECK-NEXT: [[SUB:%.*]] = sub i128 [[XY]], [[ZZ]]		; CHECK-NEXT: [[SUB:%.*]] = sub i128 [[XY]], [[ZZ]]
; CHECK-NEXT: [[ADD:%.*]] = add i128 [[SUB]], [[MW]]		; CHECK-NEXT: [[ADD:%.*]] = add i128 [[SUB]], [[MW]]
; CHECK-NEXT: [[T:%.*]] = trunc i128 [[ADD]] to i64		; CHECK-NEXT: [[T:%.*]] = trunc i128 [[SUB]] to i64
; CHECK-NEXT: [[H:%.*]] = lshr i128 [[ADD]], 64		; CHECK-NEXT: [[H:%.*]] = lshr i128 [[ADD]], 64
; CHECK-NEXT: [[T2:%.*]] = trunc i128 [[H]] to i64		; CHECK-NEXT: [[T2:%.*]] = trunc i128 [[H]] to i64
; CHECK-NEXT: [[R1:%.*]] = insertvalue { i64, i64 } poison, i64 [[T]], 0		; CHECK-NEXT: [[R1:%.*]] = insertvalue { i64, i64 } poison, i64 [[T]], 0
; CHECK-NEXT: [[R2:%.*]] = insertvalue { i64, i64 } [[R1]], i64 [[T2]], 1		; CHECK-NEXT: [[R2:%.*]] = insertvalue { i64, i64 } [[R1]], i64 [[T2]], 1
; CHECK-NEXT: ret { i64, i64 } [[R2]]		; CHECK-NEXT: ret { i64, i64 } [[R2]]
;		;
%zx = zext i64 %x to i128		%zx = zext i64 %x to i128
%zy = zext i64 %y to i128		%zy = zext i64 %y to i128
Show All 14 Lines

llvm/test/Transforms/InstCombine/shift.ll

Show First 20 Lines • Show All 1,744 Lines • ▼ Show 20 Lines	;
ret void		ret void
}		}

; OSS Fuzz #38078		; OSS Fuzz #38078
; https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=38078		; https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=38078
define void @ossfuzz_38078(i32 %arg, i32 %arg1, i32* %ptr, i1* %ptr2, i32* %ptr3, i1* %ptr4, i32* %ptr5, i32* %ptr6, i1* %ptr7) {		define void @ossfuzz_38078(i32 %arg, i32 %arg1, i32* %ptr, i1* %ptr2, i32* %ptr3, i1* %ptr4, i32* %ptr5, i32* %ptr6, i1* %ptr7) {
; CHECK-LABEL: @ossfuzz_38078(		; CHECK-LABEL: @ossfuzz_38078(
; CHECK-NEXT: bb:		; CHECK-NEXT: bb:
		; CHECK-NEXT: [[I2:%.]] = add nsw i32 [[ARG:%.]], [[ARG1:%.*]]
		; CHECK-NEXT: [[B3:%.*]] = or i32 [[I2]], 2147483647
; CHECK-NEXT: [[G1:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 -1		; CHECK-NEXT: [[G1:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 -1
; CHECK-NEXT: [[I2:%.]] = sub i32 0, [[ARG1:%.]]		; CHECK-NEXT: [[I5:%.*]] = icmp eq i32 [[I2]], 0
; CHECK-NEXT: [[I5:%.]] = icmp eq i32 [[I2]], [[ARG:%.]]
; CHECK-NEXT: call void @llvm.assume(i1 [[I5]])		; CHECK-NEXT: call void @llvm.assume(i1 [[I5]])
; CHECK-NEXT: store volatile i32 2147483647, i32* [[G1]], align 4		; CHECK-NEXT: store volatile i32 [[B3]], i32* [[G1]], align 4
; CHECK-NEXT: br label [[BB:%.*]]		; CHECK-NEXT: br label [[BB:%.*]]
; CHECK: BB:		; CHECK: BB:
; CHECK-NEXT: unreachable		; CHECK-NEXT: unreachable
;		;
bb:		bb:
%i = or i32 0, -1		%i = or i32 0, -1
%B24 = urem i32 %i, -2147483648		%B24 = urem i32 %i, -2147483648
%B21 = or i32 %i, %i		%B21 = or i32 %i, %i
▲ Show 20 Lines • Show All 138 Lines • Show Last 20 Lines