This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
-
InstCombineAndOrXor.cpp
-
test/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
-
bitreverse.ll

Differential D90170

[InstCombine] InstCombinerImpl::visitOr - enable bitreverse matching
ClosedPublic

Authored by RKSimon on Oct 26 2020, 10:10 AM.

Download Raw Diff

Details

Reviewers

spatel
lebedev.ri
nikic

Commits

rG401d6685c0aa: [InstCombine] InstCombinerImpl::visitOr - enable bitreverse matching

Summary

Currently we only match bswap intrinsics from or(shl(),lshr()) style patterns when we could often match bitreverse intrinsics almost as cheaply.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	370 ms	linux > HWAddressSanitizer-x86_64.TestCases::sizes.cpp

Event Timeline

RKSimon created this revision.Oct 26 2020, 10:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 26 2020, 10:10 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

RKSimon requested review of this revision.Oct 26 2020, 10:10 AM

Harbormaster completed remote builds in B76433: Diff 300709.Oct 26 2020, 10:46 AM

Do you have any background on where the current design comes from? If I understand right, we're currently matching bswap in InstCombine, and matching bitreverse in CGP. Why is that? Is that a compile-time consideration, or a code quality consideration (e.g., it is only profitable to materialize a bitreverse intrinsic if it is actually legal, which we do guard on in CGP)?

I think its mainly a compile-time concern - bswap is a common pattern (and we bail out in cases where we perform sub-byte operations), bitreverse is not.

If compile-time is the only concern, then these are the numbers I get for enabling the bitreverse matching without any type size limit on CTMark: http://llvm-compile-time-tracker.com/compare.php?from=27f647d117087ca11959e232e6443f4aee31e966&to=b9042c581410835940a253af86646edeabfc858e&stat=instructions So this doesn't seem too problematic. Of course, there's the usual "death by a thousand cuts" to consider with InstCombine.

If we want to make things better rather than just not make things worse, we should run an experiment: move the whole bswap / bitreverse matching to aggressive-instcombine without the bitwidth limit.
That may require a pass manager adjustment to run aggressive-instcombine at -O2 to avoid perf regressions, but that could pay for itself by not running these costly matchers so many times when there's really no hope of finding a match.

RKSimon planned changes to this revision.Nov 12 2020, 4:29 AM

laytonio added a subscriber: laytonio.Nov 20 2020, 6:58 AM

Based on feedback (@nikic Are those timings still representable?) and my own local testing, I'm not seeing any really bad slow downs due to matching for bitreverse patterns as well as as byteswaps, so I've updated the patch without any i16 limitation.

If we're happy with this I think we can then remove the bitreverse matching entirely from codegenprepare.

Harbormaster completed remote builds in B102735: Diff 343032.May 5 2021, 7:35 AM

Added a minor compile time saving - reducing BitPartRecursionMaxDepth from 64 to 48, which was pretty high for a maximum bitwidth of 128.

Harbormaster completed remote builds in B103320: Diff 343841.May 8 2021, 8:44 AM

ping?

LGTM, but from the llvm-xray traces of LLVM i've seen, IIRC matchBSwapOrBitReverse/collectbitparts is pretty high up in the list of cycle eaters.

This revision is now accepted and ready to land.May 14 2021, 1:59 AM

In D90170#2759050, @lebedev.ri wrote:

LGTM, but from the llvm-xray traces of LLVM i've seen, IIRC matchBSwapOrBitReverse/collectbitparts is pretty high up in the list of cycle eaters.

OK, I'm going to reduce the traversal tree depth in a separate patch first to make any nasty build time regressions due to the bitreverse matching more specific.

In D90170#2759102, @RKSimon wrote:

In D90170#2759050, @lebedev.ri wrote:

LGTM, but from the llvm-xray traces of LLVM i've seen, IIRC matchBSwapOrBitReverse/collectbitparts is pretty high up in the list of cycle eaters.

OK, I'm going to reduce the traversal tree depth in a separate patch first to make any nasty build time regressions due to the bitreverse matching more specific.

To be noted, i don't have any specific issues with *this* patch, just that whole transform as a whole.
I would maybe suggest to dyamically compute the BitPartRecursionMaxDepth based on the bitwidth,
and maybe change it to be a not depth-based cut-off, but a number of visited Values.

RKSimon mentioned this in rG78c8451cd7b1: [Local] collectBitParts - reduce maximum recursion depth..May 14 2021, 3:43 AM

Closed by commit rG401d6685c0aa: [InstCombine] InstCombinerImpl::visitOr - enable bitreverse matching (authored by RKSimon). · Explain WhyMay 15 2021, 5:40 AM

This revision was automatically updated to reflect the committed changes.

RKSimon added a commit: rG401d6685c0aa: [InstCombine] InstCombinerImpl::visitOr - enable bitreverse matching.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

InstCombine/

InstCombineAndOrXor.cpp

6 lines

test/

Transforms/

InstCombine/

bitreverse.ll

65 lines

Diff 300709

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp

Show First 20 Lines • Show All 2,573 Lines • ▼ Show 20 Lines	if (Value *V = SimplifyUsingDistributiveLaws(I))
return replaceInstUsesWith(I, V);		return replaceInstUsesWith(I, V);

if (Value *V = SimplifyBSwap(I, Builder))		if (Value *V = SimplifyBSwap(I, Builder))
return replaceInstUsesWith(I, V);		return replaceInstUsesWith(I, V);

if (Instruction *FoldedLogic = foldBinOpIntoSelectOrPhi(I))		if (Instruction *FoldedLogic = foldBinOpIntoSelectOrPhi(I))
return FoldedLogic;		return FoldedLogic;

if (Instruction BSwap = matchBSwapOrBitReverse(I, /MatchBSwaps*/ true,		// FIXME: Limit bitreverse matching to i16 or less to control recursion costs.
/MatchBitReversals/ false))		bool MatchBitReversals = I.getType()->getScalarSizeInBits() <= 16;
		if (Instruction *BSwap =
		matchBSwapOrBitReverse(I, /MatchBSwaps/ true, MatchBitReversals))
return BSwap;		return BSwap;

if (Instruction Funnel = matchFunnelShift(I, this))		if (Instruction Funnel = matchFunnelShift(I, this))
return Funnel;		return Funnel;

if (Instruction *Concat = matchOrConcat(I, Builder))		if (Instruction *Concat = matchOrConcat(I, Builder))
return replaceInstUsesWith(I, Concat);		return replaceInstUsesWith(I, Concat);

▲ Show 20 Lines • Show All 877 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/bitreverse.ll

	Show All 11 Lines
	; mask ^= (mask << s);			; mask ^= (mask << s);
	; v = ((v >> s) & mask) \| ((v << s) & ~mask);			; v = ((v >> s) & mask) \| ((v << s) & ~mask);
	; }			; }
	; return v;			; return v;
	;}			;}
	define i8 @rev8(i8 %v) {			define i8 @rev8(i8 %v) {
	; CHECK-LABEL: @rev8(			; CHECK-LABEL: @rev8(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[OR:%.]] = call i8 @llvm.fshl.i8(i8 [[V:%.]], i8 [[V]], i8 4)			; CHECK-NEXT: [[OR_2:%.]] = call i8 @llvm.bitreverse.i8(i8 [[V:%.]])
	; CHECK-NEXT: [[SHR4_1:%.*]] = lshr i8 [[OR]], 2
	; CHECK-NEXT: [[AND_1:%.*]] = and i8 [[SHR4_1]], 51
	; CHECK-NEXT: [[SHL7_1:%.*]] = shl i8 [[OR]], 2
	; CHECK-NEXT: [[AND9_1:%.*]] = and i8 [[SHL7_1]], -52
	; CHECK-NEXT: [[OR_1:%.*]] = or i8 [[AND_1]], [[AND9_1]]
	; CHECK-NEXT: [[SHR4_2:%.*]] = lshr i8 [[OR_1]], 1
	; CHECK-NEXT: [[AND_2:%.*]] = and i8 [[SHR4_2]], 85
	; CHECK-NEXT: [[SHL7_2:%.*]] = shl i8 [[OR_1]], 1
	; CHECK-NEXT: [[AND9_2:%.*]] = and i8 [[SHL7_2]], -86
	; CHECK-NEXT: [[OR_2:%.*]] = or i8 [[AND_2]], [[AND9_2]]
	; CHECK-NEXT: ret i8 [[OR_2]]			; CHECK-NEXT: ret i8 [[OR_2]]
	;			;
	entry:			entry:
	%shr4 = lshr i8 %v, 4			%shr4 = lshr i8 %v, 4
	%shl7 = shl i8 %v, 4			%shl7 = shl i8 %v, 4
	%or = or i8 %shr4, %shl7			%or = or i8 %shr4, %shl7
	%shr4.1 = lshr i8 %or, 2			%shr4.1 = lshr i8 %or, 2
	%and.1 = and i8 %shr4.1, 51			%and.1 = and i8 %shr4.1, 51
	%shl7.1 = shl i8 %or, 2			%shl7.1 = shl i8 %or, 2
	%and9.1 = and i8 %shl7.1, -52			%and9.1 = and i8 %shl7.1, -52
	%or.1 = or i8 %and.1, %and9.1			%or.1 = or i8 %and.1, %and9.1
	%shr4.2 = lshr i8 %or.1, 1			%shr4.2 = lshr i8 %or.1, 1
	%and.2 = and i8 %shr4.2, 85			%and.2 = and i8 %shr4.2, 85
	%shl7.2 = shl i8 %or.1, 1			%shl7.2 = shl i8 %or.1, 1
	%and9.2 = and i8 %shl7.2, -86			%and9.2 = and i8 %shl7.2, -86
	%or.2 = or i8 %and.2, %and9.2			%or.2 = or i8 %and.2, %and9.2
	ret i8 %or.2			ret i8 %or.2
	}			}

	define i16 @rev16(i16 %v) {			define i16 @rev16(i16 %v) {
	; CHECK-LABEL: @rev16(			; CHECK-LABEL: @rev16(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[OR:%.]] = call i16 @llvm.bswap.i16(i16 [[V:%.]])			; CHECK-NEXT: [[OR_3:%.]] = call i16 @llvm.bitreverse.i16(i16 [[V:%.]])
	; CHECK-NEXT: [[SHR4_1:%.*]] = lshr i16 [[OR]], 4
	; CHECK-NEXT: [[AND_1:%.*]] = and i16 [[SHR4_1]], 3855
	; CHECK-NEXT: [[SHL7_1:%.*]] = shl i16 [[OR]], 4
	; CHECK-NEXT: [[AND9_1:%.*]] = and i16 [[SHL7_1]], -3856
	; CHECK-NEXT: [[OR_1:%.*]] = or i16 [[AND_1]], [[AND9_1]]
	; CHECK-NEXT: [[SHR4_2:%.*]] = lshr i16 [[OR_1]], 2
	; CHECK-NEXT: [[AND_2:%.*]] = and i16 [[SHR4_2]], 13107
	; CHECK-NEXT: [[SHL7_2:%.*]] = shl i16 [[OR_1]], 2
	; CHECK-NEXT: [[AND9_2:%.*]] = and i16 [[SHL7_2]], -13108
	; CHECK-NEXT: [[OR_2:%.*]] = or i16 [[AND_2]], [[AND9_2]]
	; CHECK-NEXT: [[SHR4_3:%.*]] = lshr i16 [[OR_2]], 1
	; CHECK-NEXT: [[AND_3:%.*]] = and i16 [[SHR4_3]], 21845
	; CHECK-NEXT: [[SHL7_3:%.*]] = shl i16 [[OR_2]], 1
	; CHECK-NEXT: [[AND9_3:%.*]] = and i16 [[SHL7_3]], -21846
	; CHECK-NEXT: [[OR_3:%.*]] = or i16 [[AND_3]], [[AND9_3]]
	; CHECK-NEXT: ret i16 [[OR_3]]			; CHECK-NEXT: ret i16 [[OR_3]]
	;			;
	entry:			entry:
	%shr4 = lshr i16 %v, 8			%shr4 = lshr i16 %v, 8
	%shl7 = shl i16 %v, 8			%shl7 = shl i16 %v, 8
	%or = or i16 %shr4, %shl7			%or = or i16 %shr4, %shl7
	%shr4.1 = lshr i16 %or, 4			%shr4.1 = lshr i16 %or, 4
	%and.1 = and i16 %shr4.1, 3855			%and.1 = and i16 %shr4.1, 3855
	▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines
	; unsigned char y;			; unsigned char y;
	; y = x&0x55; x ^= y; x \|= (y<<2)\|(y>>6);			; y = x&0x55; x ^= y; x \|= (y<<2)\|(y>>6);
	; y = x&0x66; x ^= y; x \|= (y<<4)\|(y>>4);			; y = x&0x66; x ^= y; x \|= (y<<4)\|(y>>4);
	; return (x<<1)\|(x>>7);			; return (x<<1)\|(x>>7);
	;}			;}

	define i8 @rev8_xor(i8 %0) {			define i8 @rev8_xor(i8 %0) {
	; CHECK-LABEL: @rev8_xor(			; CHECK-LABEL: @rev8_xor(
	; CHECK-NEXT: [[TMP2:%.]] = and i8 [[TMP0:%.]], 85			; CHECK-NEXT: [[TMP2:%.]] = call i8 @llvm.bitreverse.i8(i8 [[TMP0:%.]])
	; CHECK-NEXT: [[TMP3:%.*]] = and i8 [[TMP0]], -86			; CHECK-NEXT: ret i8 [[TMP2]]
	; CHECK-NEXT: [[TMP4:%.*]] = shl i8 [[TMP2]], 2
	; CHECK-NEXT: [[TMP5:%.*]] = lshr i8 [[TMP2]], 6
	; CHECK-NEXT: [[TMP6:%.*]] = or i8 [[TMP5]], [[TMP3]]
	; CHECK-NEXT: [[TMP7:%.*]] = or i8 [[TMP6]], [[TMP4]]
	; CHECK-NEXT: [[TMP8:%.*]] = and i8 [[TMP7]], 102
	; CHECK-NEXT: [[TMP9:%.*]] = and i8 [[TMP7]], 25
	; CHECK-NEXT: [[TMP10:%.*]] = lshr i8 [[TMP8]], 4
	; CHECK-NEXT: [[TMP11:%.*]] = or i8 [[TMP10]], [[TMP9]]
	; CHECK-NEXT: [[TMP12:%.*]] = shl i8 [[TMP8]], 5
	; CHECK-NEXT: [[TMP13:%.*]] = shl nuw nsw i8 [[TMP11]], 1
	; CHECK-NEXT: [[TMP14:%.*]] = or i8 [[TMP12]], [[TMP13]]
	; CHECK-NEXT: [[TMP15:%.*]] = lshr i8 [[TMP0]], 7
	; CHECK-NEXT: [[TMP16:%.*]] = or i8 [[TMP14]], [[TMP15]]
	; CHECK-NEXT: ret i8 [[TMP16]]
	;			;
	%2 = and i8 %0, 85			%2 = and i8 %0, 85
	%3 = xor i8 %0, %2			%3 = xor i8 %0, %2
	%4 = shl i8 %2, 2			%4 = shl i8 %2, 2
	%5 = lshr i8 %2, 6			%5 = lshr i8 %2, 6
	%6 = or i8 %5, %3			%6 = or i8 %5, %3
	%7 = or i8 %6, %4			%7 = or i8 %6, %4
	%8 = and i8 %7, 102			%8 = and i8 %7, 102
	%9 = xor i8 %7, %8			%9 = xor i8 %7, %8
	%10 = lshr i8 %8, 4			%10 = lshr i8 %8, 4
	%11 = or i8 %10, %9			%11 = or i8 %10, %9
	%12 = shl i8 %8, 5			%12 = shl i8 %8, 5
	%13 = shl i8 %11, 1			%13 = shl i8 %11, 1
	%14 = or i8 %12, %13			%14 = or i8 %12, %13
	%15 = lshr i8 %0, 7			%15 = lshr i8 %0, 7
	%16 = or i8 %14, %15			%16 = or i8 %14, %15
	ret i8 %16			ret i8 %16
	}			}

	define <2 x i8> @rev8_xor_vector(<2 x i8> %0) {			define <2 x i8> @rev8_xor_vector(<2 x i8> %0) {
	; CHECK-LABEL: @rev8_xor_vector(			; CHECK-LABEL: @rev8_xor_vector(
	; CHECK-NEXT: [[TMP2:%.]] = and <2 x i8> [[TMP0:%.]], <i8 85, i8 85>			; CHECK-NEXT: [[TMP2:%.]] = call <2 x i8> @llvm.bitreverse.v2i8(<2 x i8> [[TMP0:%.]])
	; CHECK-NEXT: [[TMP3:%.*]] = and <2 x i8> [[TMP0]], <i8 -86, i8 -86>			; CHECK-NEXT: ret <2 x i8> [[TMP2]]
	; CHECK-NEXT: [[TMP4:%.*]] = shl <2 x i8> [[TMP2]], <i8 2, i8 2>
	; CHECK-NEXT: [[TMP5:%.*]] = lshr <2 x i8> [[TMP2]], <i8 6, i8 6>
	; CHECK-NEXT: [[TMP6:%.*]] = or <2 x i8> [[TMP5]], [[TMP3]]
	; CHECK-NEXT: [[TMP7:%.*]] = or <2 x i8> [[TMP6]], [[TMP4]]
	; CHECK-NEXT: [[TMP8:%.*]] = and <2 x i8> [[TMP7]], <i8 102, i8 102>
	; CHECK-NEXT: [[TMP9:%.*]] = and <2 x i8> [[TMP7]], <i8 25, i8 25>
	; CHECK-NEXT: [[TMP10:%.*]] = lshr <2 x i8> [[TMP8]], <i8 4, i8 4>
	; CHECK-NEXT: [[TMP11:%.*]] = or <2 x i8> [[TMP10]], [[TMP9]]
	; CHECK-NEXT: [[TMP12:%.*]] = shl <2 x i8> [[TMP8]], <i8 5, i8 5>
	; CHECK-NEXT: [[TMP13:%.*]] = shl nuw nsw <2 x i8> [[TMP11]], <i8 1, i8 1>
	; CHECK-NEXT: [[TMP14:%.*]] = or <2 x i8> [[TMP12]], [[TMP13]]
	; CHECK-NEXT: [[TMP15:%.*]] = lshr <2 x i8> [[TMP0]], <i8 7, i8 7>
	; CHECK-NEXT: [[TMP16:%.*]] = or <2 x i8> [[TMP14]], [[TMP15]]
	; CHECK-NEXT: ret <2 x i8> [[TMP16]]
	;			;
	%2 = and <2 x i8> %0, <i8 85, i8 85>			%2 = and <2 x i8> %0, <i8 85, i8 85>
	%3 = xor <2 x i8> %0, %2			%3 = xor <2 x i8> %0, %2
	%4 = shl <2 x i8> %2, <i8 2, i8 2>			%4 = shl <2 x i8> %2, <i8 2, i8 2>
	%5 = lshr <2 x i8> %2, <i8 6, i8 6>			%5 = lshr <2 x i8> %2, <i8 6, i8 6>
	%6 = or <2 x i8> %5, %3			%6 = or <2 x i8> %5, %3
	%7 = or <2 x i8> %6, %4			%7 = or <2 x i8> %6, %4
	%8 = and <2 x i8> %7, <i8 102, i8 102>			%8 = and <2 x i8> %7, <i8 102, i8 102>
	▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines