This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/PowerPC/
-
Target/
-
PowerPC/
-
PPCISelLowering.cpp
-
PPCInstrInfo.td
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
bitreverse.ll
-
pr33093.ll
-
testBitReverse.ll

Differential D33572

[PPC CodeGen] Expand the bitreverse.i32 intrinsic.
ClosedPublic

Authored by jtony on May 25 2017, 3:36 PM.

Download Raw Diff

Details

Reviewers

hfinkel
kbarton
echristo
Carrot
nemanjai
sfertile
syzaara
lei
stefanp
gyiu
inouehrs

Commits

rGc260e0eb565f: [PPC CodeGen] Expand the bitreverse.i32 intrinsic.
rL307413: [PPC CodeGen] Expand the bitreverse.i32 intrinsic.

Summary

This patch fixes pr33093.

Current PPCDAGToDAGISel doesn't handle bit reverse efficiently, it generates bit by bit moves for it, even if we give it the following fast algorithm.

unsigned int ReverseBits(unsigned int n) {

n = ((n >> 1) & 0x55555555) | ((n & 0x55555555) << 1);
n = ((n >> 2) & 0x33333333) | ((n & 0x33333333) << 2);
n = ((n >> 4) & 0x0F0F0F0F) | ((n & 0x0F0F0F0F) << 4);
return ((n & 0xff000000u) >> 24) | ((n & 0x00ff0000u) >> 8) | ((n & 0x0000ff00u) << 8) | ((n & 0x000000ffu) << 24);

}

This patch recognizes bit reverse in BitPermutationSelector, and generates the fast code shown above.

Diff Detail

Repository: rL LLVM

Event Timeline

Carrot created this revision.May 25 2017, 3:36 PM

Herald added a subscriber: nemanjai. · View Herald TranscriptMay 25 2017, 3:36 PM

I'd rather not address the problem this way. Can we canonicalize the code sequence at the IR level into the @llvm.bitreverse intrinsic, and then match the intrinsic efficiently in the backend (preferably in TableGen, where complicated patterns are more succinct to write)? This will give us the additional advantage of good lowering for @llvm.bitreverse and the opportunity for IR-level optimizations to deal with the canonical representation. What the bit permutation selector is doing should be relatively easy to replicate at the IR level.

The code sequence is pretty standard, and I'd love to generalize it, but it is not entirely clear how to usefully suggest doing so in this framework. It is doing the bit reversal by first interchanging adjacent bits, then adjacent bit pairs, etc. Straight out of Hacker's Delight :-) We can do this without a completely-serial dependency chain only because of the complete symmetry of the reversal operation. The bit-permutation selector could certainly recognize "partial reversals", and integrate using a sequence like this for those parts of the overall permutation. I'm not sure how worthwhile this would be.

There is also a larger issue potentially worth discussing. The code we currently produce looks like this:

	rlwinm 4, 3, 1, 0, 31 
	rlwimi 4, 3, 3, 30, 30
	rlwimi 4, 3, 5, 29, 29
	rlwimi 4, 3, 7, 28, 28
	rlwimi 4, 3, 9, 27, 27
	rlwimi 4, 3, 11, 26, 26
	rlwimi 4, 3, 13, 25, 25
	rlwimi 4, 3, 15, 24, 24
	rlwimi 4, 3, 17, 23, 23
	rlwimi 4, 3, 19, 22, 22
	rlwimi 4, 3, 21, 21, 21
	rlwimi 4, 3, 23, 20, 20
	rlwimi 4, 3, 25, 19, 19
	rlwimi 4, 3, 27, 18, 18
	rlwimi 4, 3, 29, 17, 17
	rlwimi 4, 3, 31, 16, 16
	rlwimi 4, 3, 3, 14, 14
	rlwimi 4, 3, 5, 13, 13
	rlwimi 4, 3, 7, 12, 12
	rlwimi 4, 3, 9, 11, 11
	rlwimi 4, 3, 11, 10, 10
	rlwimi 4, 3, 13, 9, 9
	rlwimi 4, 3, 15, 8, 8
	rlwimi 4, 3, 17, 7, 7
	rlwimi 4, 3, 19, 6, 6
	rlwimi 4, 3, 21, 5, 5
	rlwimi 4, 3, 23, 4, 4
	rlwimi 4, 3, 25, 3, 3
	rlwimi 4, 3, 27, 2, 2
	rlwimi 4, 3, 29, 1, 1
	rlwimi 4, 3, 31, 0, 0
	mr 3, 4
	blr

and that's one large dependency chain (each instruction updating r4). It also clearly does not need to be that way. We could insert the reversed bits into n registers, as they're all independent, and then 'and' the results together at the end. In this way, we could create lots of independent streams of computation. Have you experimented with whether this is faster than the original sequence on the P8? If it is, then I'll partially take back what I said about putting a pattern in TableGen, and recommend that we implement dependency-chain splitting in the bit-permutation selector (and rank options by taking throughput into account instead of just counting instructions), or alternatively, implement dependency-chain splitting in the MachineCombiner.

test/CodeGen/PowerPC/pr33093.ll
37 ↗	(On Diff #100314)	Please check for the desired sequence here, including regex-recognized operands. Same below.

This revision now requires changes to proceed.May 30 2017, 7:28 PM

jtony added a subscriber: jtony.Jun 22 2017, 8:02 AM

jtony commandeered this revision.Jun 26 2017, 6:23 PM

jtony added a reviewer: Carrot.

Re-implement this patch according to Hal's comments.
Note this is the first patch of the CodeGen part for intrinsic llvm.bitreverse.i32
There will be a follow-up patch to implement intrinsic llvm.bitreverse.i64
and another patch to do idiom recognition in llvm opt to generate llvm.bitreverse

hfinkel added inline comments.Jun 26 2017, 8:01 PM

lib/Target/PowerPC/PPCInstrInfo.td
4454 ↗	(On Diff #104066)	I realize that there are plenty of places online that explain the algorithm, but please add an explanation here as well (i.e. that we're exchanging pairs of bits, and then exchanging groups of two bits, etc.).
4469 ↗	(On Diff #104066)	Please break this line, and the other swap patterns, so they're not so long (no need to exceed 80 cols here, it's likely more readable breaking this after the first argument to OR).

Maybe, you do not need to implement idiom recognition for the bit reversal operation by your self (see makeBitReverse in CodeGenPrepare.cpp).

jtony retitled this revision from [PPC] Implement fast bit reverse in PPCDAGToDAGISel to [PPC CodeGen] Expand the bitreverse.i32 intrinsic..Jun 28 2017, 8:54 AM

Address comments from Hal Finkel and add one more IR test case to test the original situation in Bugzilla (the IR is equivalent form of fast bit-reverse but NOT the intrinsic).

jtony marked an inline comment as done.Jun 28 2017, 11:26 AM

In D33572#794146, @jtony wrote:

Address comments from Hal Finkel and add one more IR test case to test the original situation in Bugzilla (the IR is equivalent form of fast bit-reverse but NOT the intrinsic).

So, we don't currently recognize the 32-bit version as a bit permutation in 64-bit mode? Otherwise, we'd end up in the same situation as the PR and the test would fail right now, right?

In D33572#794182, @hfinkel wrote:

In D33572#794146, @jtony wrote:

Address comments from Hal Finkel and add one more IR test case to test the original situation in Bugzilla (the IR is equivalent form of fast bit-reverse but NOT the intrinsic).

So, we don't currently recognize the 32-bit version as a bit permutation in 64-bit mode? Otherwise, we'd end up in the same situation as the PR and the test would fail right now, right?

I think that the point that needs to be mentioned in this patch is that idiom recognition will fire now that we've legalized ISD::BITREVERSE and we will not consider this for the bit permutation handling. Namely, what we'll have in the SDAG is just the ISD::BITREVERSE node.

In D33572#794920, @nemanjai wrote:

In D33572#794182, @hfinkel wrote:

In D33572#794146, @jtony wrote:

Address comments from Hal Finkel and add one more IR test case to test the original situation in Bugzilla (the IR is equivalent form of fast bit-reverse but NOT the intrinsic).

So, we don't currently recognize the 32-bit version as a bit permutation in 64-bit mode? Otherwise, we'd end up in the same situation as the PR and the test would fail right now, right?

I think that the point that needs to be mentioned in this patch is that idiom recognition will fire now that we've legalized ISD::BITREVERSE and we will not consider this for the bit permutation handling. Namely, what we'll have in the SDAG is just the ISD::BITREVERSE node.

Ah, okay. We're doing idiom recognition in the backend (in CGP). That might be suboptimal (now or in the future), but that's a separate matter. This LGTM.

This revision is now accepted and ready to land.Jun 29 2017, 7:48 AM

Closed by commit rL307413: [PPC CodeGen] Expand the bitreverse.i32 intrinsic. (authored by jtony). · Explain WhyJul 7 2017, 9:42 AM

This revision was automatically updated to reflect the committed changes.

jtony mentioned this in D35188: Add bitreverse LNT benchmark..Jul 10 2017, 6:48 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

PowerPC/

PPCISelLowering.cpp

3 lines

PPCInstrInfo.td

68 lines

test/

CodeGen/

PowerPC/

bitreverse.ll

23 lines

pr33093.ll

67 lines

testBitReverse.ll

42 lines

Diff 105658

llvm/trunk/lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	PPCTargetLowering::PPCTargetLowering(const PPCTargetMachine &TM,

// Set up the register classes.		// Set up the register classes.
addRegisterClass(MVT::i32, &PPC::GPRCRegClass);		addRegisterClass(MVT::i32, &PPC::GPRCRegClass);
if (!useSoftFloat()) {		if (!useSoftFloat()) {
addRegisterClass(MVT::f32, &PPC::F4RCRegClass);		addRegisterClass(MVT::f32, &PPC::F4RCRegClass);
addRegisterClass(MVT::f64, &PPC::F8RCRegClass);		addRegisterClass(MVT::f64, &PPC::F8RCRegClass);
}		}

		// Match BITREVERSE to customized fast code sequence in the td file.
		setOperationAction(ISD::BITREVERSE, MVT::i32, Legal);

// PowerPC has an i16 but no i8 (or i1) SEXTLOAD.		// PowerPC has an i16 but no i8 (or i1) SEXTLOAD.
for (MVT VT : MVT::integer_valuetypes()) {		for (MVT VT : MVT::integer_valuetypes()) {
setLoadExtAction(ISD::SEXTLOAD, VT, MVT::i1, Promote);		setLoadExtAction(ISD::SEXTLOAD, VT, MVT::i1, Promote);
setLoadExtAction(ISD::SEXTLOAD, VT, MVT::i8, Expand);		setLoadExtAction(ISD::SEXTLOAD, VT, MVT::i8, Expand);
}		}

setTruncStoreAction(MVT::f64, MVT::f32, Expand);		setTruncStoreAction(MVT::f64, MVT::f32, Expand);

▲ Show 20 Lines • Show All 13,314 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/PowerPC/PPCInstrInfo.td

	Show First 20 Lines • Show All 4,448 Lines • ▼ Show 20 Lines

	// Message Synchronize			// Message Synchronize
	def MSGSYNC : XForm_0<31, 886, (outs), (ins), "msgsync", IIC_SprMSGSYNC, []>;			def MSGSYNC : XForm_0<31, 886, (outs), (ins), "msgsync", IIC_SprMSGSYNC, []>;

	// Power-Saving Mode Instruction:			// Power-Saving Mode Instruction:
	def STOP : XForm_0<19, 370, (outs), (ins), "stop", IIC_SprSTOP, []>;			def STOP : XForm_0<19, 370, (outs), (ins), "stop", IIC_SprSTOP, []>;

	} // IsISA3_0			} // IsISA3_0

				// Fast 32-bit reverse bits algorithm:
				// Step 1: 1-bit swap (swap odd 1-bit and even 1-bit):
				// n = ((n >> 1) & 0x55555555) \| ((n << 1) & 0xAAAAAAAA);
				// Step 2: 2-bit swap (swap odd 2-bit and even 2-bit):
				// n = ((n >> 2) & 0x33333333) \| ((n << 2) & 0xCCCCCCCC);
				// Step 3: 4-bit swap (swap odd 4-bit and even 4-bit):
				// n = ((n >> 4) & 0x0F0F0F0F) \| ((n << 4) & 0xF0F0F0F0);
				// Step 4: byte reverse (Suppose n = [B1,B2,B3,B4]):
				// Step 4.1: Put B4,B2 in the right position (rotate left 3 bytes):
				// n' = (n rotl 24); After which n' = [B4, B1, B2, B3]
				// Step 4.2: Insert B3 to the right position:
				// n' = rlwimi n', n, 8, 8, 15; After which n' = [B4, B3, B2, B3]
				// Step 4.3: Insert B1 to the right position:
				// n' = rlwimi n', n, 8, 24, 31; After which n' = [B4, B3, B2, B1]
				def MaskValues {
				dag Lo1 = (ORI (LIS 0x5555), 0x5555);
				dag Hi1 = (ORI (LIS 0xAAAA), 0xAAAA);
				dag Lo2 = (ORI (LIS 0x3333), 0x3333);
				dag Hi2 = (ORI (LIS 0xCCCC), 0xCCCC);
				dag Lo4 = (ORI (LIS 0x0F0F), 0x0F0F);
				dag Hi4 = (ORI (LIS 0xF0F0), 0xF0F0);
				}

				def Shift1 {
				dag Right = (RLWINM $A, 31, 1, 31);
				dag Left = (RLWINM $A, 1, 0, 30);
				}

				def Swap1 {
				dag Bit = (OR (AND Shift1.Right, MaskValues.Lo1),
				(AND Shift1.Left, MaskValues.Hi1));
				}

				def Shift2 {
				dag Right = (RLWINM Swap1.Bit, 30, 2, 31);
				dag Left = (RLWINM Swap1.Bit, 2, 0, 29);
				}

				def Swap2 {
				dag Bits = (OR (AND Shift2.Right, MaskValues.Lo2),
				(AND Shift2.Left, MaskValues.Hi2));
				}

				def Shift4 {
				dag Right = (RLWINM Swap2.Bits, 28, 4, 31);
				dag Left = (RLWINM Swap2.Bits, 4, 0, 27);
				}

				def Swap4 {
				dag Bits = (OR (AND Shift4.Right, MaskValues.Lo4),
				(AND Shift4.Left, MaskValues.Hi4));
				}

				def Rotate {
				dag Left3Bytes = (RLWINM Swap4.Bits, 24, 0, 31);
				}

				def RotateInsertByte3 {
				dag Left = (RLWIMI Rotate.Left3Bytes, Swap4.Bits, 8, 8, 15);
				}

				def RotateInsertByte1 {
				dag Left = (RLWIMI RotateInsertByte3.Left, Swap4.Bits, 8, 24, 31);
				}

				def : Pat<(i32 (bitreverse i32:$A)),
				(RLDICL_32 RotateInsertByte1.Left, 0, 32)>;

llvm/trunk/test/CodeGen/PowerPC/bitreverse.ll

	; RUN: llc -verify-machineinstrs -march=ppc64 %s -o - \| FileCheck %s

	; These tests just check that the plumbing is in place for @llvm.bitreverse. The
	; actual output is massive at the moment as llvm.bitreverse is not yet legal.

	declare <2 x i16> @llvm.bitreverse.v2i16(<2 x i16>) readnone

	define <2 x i16> @f(<2 x i16> %a) {
	; CHECK-LABEL: f:
	; CHECK: rlwinm
	%b = call <2 x i16> @llvm.bitreverse.v2i16(<2 x i16> %a)
	ret <2 x i16> %b
	}

	declare i8 @llvm.bitreverse.i8(i8) readnone

	define i8 @g(i8 %a) {
	; CHECK-LABEL: g:
	; CHECK: rlwinm
	; CHECK: rlwimi
	%b = call i8 @llvm.bitreverse.i8(i8 %a)
	ret i8 %b
	}

llvm/trunk/test/CodeGen/PowerPC/pr33093.ll

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=powerpc64-unknown-linux-gnu -mcpu=pwr8 < %s \| FileCheck %s
				; RUN: llc -mtriple=powerpc64le-unknown-linux-gnu -mcpu=pwr8 < %s \| FileCheck %s

				define zeroext i32 @ReverseBits(i32 zeroext %n) {
				; CHECK-LABEL: ReverseBits:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: lis 4, -21846
				; CHECK-NEXT: lis 5, 21845
				; CHECK-NEXT: slwi 6, 3, 1
				; CHECK-NEXT: srwi 3, 3, 1
				; CHECK-NEXT: lis 7, -13108
				; CHECK-NEXT: lis 8, 13107
				; CHECK-NEXT: ori 4, 4, 43690
				; CHECK-NEXT: ori 5, 5, 21845
				; CHECK-NEXT: lis 10, -3856
				; CHECK-NEXT: lis 11, 3855
				; CHECK-NEXT: and 3, 3, 5
				; CHECK-NEXT: and 4, 6, 4
				; CHECK-NEXT: ori 5, 8, 13107
				; CHECK-NEXT: or 3, 3, 4
				; CHECK-NEXT: ori 4, 7, 52428
				; CHECK-NEXT: slwi 9, 3, 2
				; CHECK-NEXT: srwi 3, 3, 2
				; CHECK-NEXT: and 3, 3, 5
				; CHECK-NEXT: and 4, 9, 4
				; CHECK-NEXT: ori 5, 11, 3855
				; CHECK-NEXT: or 3, 3, 4
				; CHECK-NEXT: ori 4, 10, 61680
				; CHECK-NEXT: slwi 12, 3, 4
				; CHECK-NEXT: srwi 3, 3, 4
				; CHECK-NEXT: and 4, 12, 4
				; CHECK-NEXT: and 3, 3, 5
				; CHECK-NEXT: or 3, 3, 4
				; CHECK-NEXT: rotlwi 4, 3, 24
				; CHECK-NEXT: rlwimi 4, 3, 8, 8, 15
				; CHECK-NEXT: rlwimi 4, 3, 8, 24, 31
				; CHECK-NEXT: rldicl 3, 4, 0, 32
				; CHECK-NEXT: clrldi 3, 3, 32
				; CHECK-NEXT: blr
				entry:
				%shr = lshr i32 %n, 1
				%and = and i32 %shr, 1431655765
				%and1 = shl i32 %n, 1
				%shl = and i32 %and1, -1431655766
				%or = or i32 %and, %shl
				%shr2 = lshr i32 %or, 2
				%and3 = and i32 %shr2, 858993459
				%and4 = shl i32 %or, 2
				%shl5 = and i32 %and4, -858993460
				%or6 = or i32 %and3, %shl5
				%shr7 = lshr i32 %or6, 4
				%and8 = and i32 %shr7, 252645135
				%and9 = shl i32 %or6, 4
				%shl10 = and i32 %and9, -252645136
				%or11 = or i32 %and8, %shl10
				%shr13 = lshr i32 %or11, 24
				%and14 = lshr i32 %or11, 8
				%shr15 = and i32 %and14, 65280
				%and17 = shl i32 %or11, 8
				%shl18 = and i32 %and17, 16711680
				%shl21 = shl i32 %or11, 24
				%or16 = or i32 %shl21, %shr13
				%or19 = or i32 %or16, %shr15
				%or22 = or i32 %or19, %shl18
				ret i32 %or22
				}

llvm/trunk/test/CodeGen/PowerPC/testBitReverse.ll

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -verify-machineinstrs -mtriple=powerpc64-unknown-linux-gnu -mcpu=pwr8 < %s \| FileCheck %s
				; RUN: llc -verify-machineinstrs -mtriple=powerpc64le-unknown-linux-gnu -mcpu=pwr8 < %s \| FileCheck %s
				declare i32 @llvm.bitreverse.i32(i32)
				define i32 @testBitReverseIntrinsicI32(i32 %arg) {
				; CHECK-LABEL: testBitReverseIntrinsicI32:
				; CHECK: # BB#0:
				; CHECK-NEXT: lis 4, -21846
				; CHECK-NEXT: lis 5, 21845
				; CHECK-NEXT: slwi 6, 3, 1
				; CHECK-NEXT: srwi 3, 3, 1
				; CHECK-NEXT: lis 7, -13108
				; CHECK-NEXT: lis 8, 13107
				; CHECK-NEXT: ori 4, 4, 43690
				; CHECK-NEXT: ori 5, 5, 21845
				; CHECK-NEXT: lis 10, -3856
				; CHECK-NEXT: lis 11, 3855
				; CHECK-NEXT: and 3, 3, 5
				; CHECK-NEXT: and 4, 6, 4
				; CHECK-NEXT: ori 5, 8, 13107
				; CHECK-NEXT: or 3, 3, 4
				; CHECK-NEXT: ori 4, 7, 52428
				; CHECK-NEXT: slwi 9, 3, 2
				; CHECK-NEXT: srwi 3, 3, 2
				; CHECK-NEXT: and 3, 3, 5
				; CHECK-NEXT: and 4, 9, 4
				; CHECK-NEXT: ori 5, 11, 3855
				; CHECK-NEXT: or 3, 3, 4
				; CHECK-NEXT: ori 4, 10, 61680
				; CHECK-NEXT: slwi 12, 3, 4
				; CHECK-NEXT: srwi 3, 3, 4
				; CHECK-NEXT: and 4, 12, 4
				; CHECK-NEXT: and 3, 3, 5
				; CHECK-NEXT: or 3, 3, 4
				; CHECK-NEXT: rotlwi 4, 3, 24
				; CHECK-NEXT: rlwimi 4, 3, 8, 8, 15
				; CHECK-NEXT: rlwimi 4, 3, 8, 24, 31
				; CHECK-NEXT: rldicl 3, 4, 0, 32
				; CHECK-NEXT: blr
				%res = call i32 @llvm.bitreverse.i32(i32 %arg)
				ret i32 %res
				}