This is an archive of the discontinued LLVM Phabricator instance.

[PPC CodeGen] Expand the bitreverse.i32 intrinsic.
ClosedPublic

Authored by jtony on May 25 2017, 3:36 PM.

Details

Summary

This patch fixes pr33093.

Current PPCDAGToDAGISel doesn't handle bit reverse efficiently, it generates bit by bit moves for it, even if we give it the following fast algorithm.

unsigned int ReverseBits(unsigned int n) {

n = ((n >> 1) & 0x55555555) | ((n & 0x55555555) << 1);
n = ((n >> 2) & 0x33333333) | ((n & 0x33333333) << 2);
n = ((n >> 4) & 0x0F0F0F0F) | ((n & 0x0F0F0F0F) << 4);
return ((n & 0xff000000u) >> 24) | ((n & 0x00ff0000u) >> 8) | ((n & 0x0000ff00u) << 8) | ((n & 0x000000ffu) << 24);

}

This patch recognizes bit reverse in BitPermutationSelector, and generates the fast code shown above.

Diff Detail

Event Timeline

Carrot created this revision.May 25 2017, 3:36 PM
hfinkel requested changes to this revision.May 30 2017, 7:28 PM

I'd rather not address the problem this way. Can we canonicalize the code sequence at the IR level into the @llvm.bitreverse intrinsic, and then match the intrinsic efficiently in the backend (preferably in TableGen, where complicated patterns are more succinct to write)? This will give us the additional advantage of good lowering for @llvm.bitreverse and the opportunity for IR-level optimizations to deal with the canonical representation. What the bit permutation selector is doing should be relatively easy to replicate at the IR level.

The code sequence is pretty standard, and I'd love to generalize it, but it is not entirely clear how to usefully suggest doing so in this framework. It is doing the bit reversal by first interchanging adjacent bits, then adjacent bit pairs, etc. Straight out of Hacker's Delight :-) We can do this without a completely-serial dependency chain only because of the complete symmetry of the reversal operation. The bit-permutation selector could certainly recognize "partial reversals", and integrate using a sequence like this for those parts of the overall permutation. I'm not sure how worthwhile this would be.

There is also a larger issue potentially worth discussing. The code we currently produce looks like this:

	rlwinm 4, 3, 1, 0, 31 
	rlwimi 4, 3, 3, 30, 30
	rlwimi 4, 3, 5, 29, 29
	rlwimi 4, 3, 7, 28, 28
	rlwimi 4, 3, 9, 27, 27
	rlwimi 4, 3, 11, 26, 26
	rlwimi 4, 3, 13, 25, 25
	rlwimi 4, 3, 15, 24, 24
	rlwimi 4, 3, 17, 23, 23
	rlwimi 4, 3, 19, 22, 22
	rlwimi 4, 3, 21, 21, 21
	rlwimi 4, 3, 23, 20, 20
	rlwimi 4, 3, 25, 19, 19
	rlwimi 4, 3, 27, 18, 18
	rlwimi 4, 3, 29, 17, 17
	rlwimi 4, 3, 31, 16, 16
	rlwimi 4, 3, 3, 14, 14
	rlwimi 4, 3, 5, 13, 13
	rlwimi 4, 3, 7, 12, 12
	rlwimi 4, 3, 9, 11, 11
	rlwimi 4, 3, 11, 10, 10
	rlwimi 4, 3, 13, 9, 9
	rlwimi 4, 3, 15, 8, 8
	rlwimi 4, 3, 17, 7, 7
	rlwimi 4, 3, 19, 6, 6
	rlwimi 4, 3, 21, 5, 5
	rlwimi 4, 3, 23, 4, 4
	rlwimi 4, 3, 25, 3, 3
	rlwimi 4, 3, 27, 2, 2
	rlwimi 4, 3, 29, 1, 1
	rlwimi 4, 3, 31, 0, 0
	mr 3, 4
	blr

and that's one large dependency chain (each instruction updating r4). It also clearly does not need to be that way. We could insert the reversed bits into n registers, as they're all independent, and then 'and' the results together at the end. In this way, we could create lots of independent streams of computation. Have you experimented with whether this is faster than the original sequence on the P8? If it is, then I'll partially take back what I said about putting a pattern in TableGen, and recommend that we implement dependency-chain splitting in the bit-permutation selector (and rank options by taking throughput into account instead of just counting instructions), or alternatively, implement dependency-chain splitting in the MachineCombiner.

test/CodeGen/PowerPC/pr33093.ll
38

Please check for the desired sequence here, including regex-recognized operands. Same below.

This revision now requires changes to proceed.May 30 2017, 7:28 PM
jtony added a subscriber: jtony.Jun 22 2017, 8:02 AM
jtony commandeered this revision.Jun 26 2017, 6:23 PM
jtony added a reviewer: Carrot.
jtony updated this revision to Diff 104066.Jun 26 2017, 6:50 PM
jtony edited edge metadata.
jtony removed subscribers: jtony, nemanjai.

Re-implement this patch according to Hal's comments.
Note this is the first patch of the CodeGen part for intrinsic llvm.bitreverse.i32
There will be a follow-up patch to implement intrinsic llvm.bitreverse.i64
and another patch to do idiom recognition in llvm opt to generate llvm.bitreverse

hfinkel added inline comments.Jun 26 2017, 8:01 PM
lib/Target/PowerPC/PPCInstrInfo.td
4454

I realize that there are plenty of places online that explain the algorithm, but please add an explanation here as well (i.e. that we're exchanging pairs of bits, and then exchanging groups of two bits, etc.).

4469

Please break this line, and the other swap patterns, so they're not so long (no need to exceed 80 cols here, it's likely more readable breaking this after the first argument to OR).

inouehrs edited edge metadata.Jun 27 2017, 12:26 AM

Maybe, you do not need to implement idiom recognition for the bit reversal operation by your self (see makeBitReverse in CodeGenPrepare.cpp).

jtony retitled this revision from [PPC] Implement fast bit reverse in PPCDAGToDAGISel to [PPC CodeGen] Expand the bitreverse.i32 intrinsic..Jun 28 2017, 8:54 AM
jtony updated this revision to Diff 104479.Jun 28 2017, 11:25 AM
jtony marked 2 inline comments as done.

Address comments from Hal Finkel and add one more IR test case to test the original situation in Bugzilla (the IR is equivalent form of fast bit-reverse but NOT the intrinsic).

jtony marked an inline comment as done.Jun 28 2017, 11:26 AM

Address comments from Hal Finkel and add one more IR test case to test the original situation in Bugzilla (the IR is equivalent form of fast bit-reverse but NOT the intrinsic).

So, we don't currently recognize the 32-bit version as a bit permutation in 64-bit mode? Otherwise, we'd end up in the same situation as the PR and the test would fail right now, right?

nemanjai edited edge metadata.Jun 29 2017, 12:30 AM

Address comments from Hal Finkel and add one more IR test case to test the original situation in Bugzilla (the IR is equivalent form of fast bit-reverse but NOT the intrinsic).

So, we don't currently recognize the 32-bit version as a bit permutation in 64-bit mode? Otherwise, we'd end up in the same situation as the PR and the test would fail right now, right?

I think that the point that needs to be mentioned in this patch is that idiom recognition will fire now that we've legalized ISD::BITREVERSE and we will not consider this for the bit permutation handling. Namely, what we'll have in the SDAG is just the ISD::BITREVERSE node.

hfinkel accepted this revision.Jun 29 2017, 7:48 AM

Address comments from Hal Finkel and add one more IR test case to test the original situation in Bugzilla (the IR is equivalent form of fast bit-reverse but NOT the intrinsic).

So, we don't currently recognize the 32-bit version as a bit permutation in 64-bit mode? Otherwise, we'd end up in the same situation as the PR and the test would fail right now, right?

I think that the point that needs to be mentioned in this patch is that idiom recognition will fire now that we've legalized ISD::BITREVERSE and we will not consider this for the bit permutation handling. Namely, what we'll have in the SDAG is just the ISD::BITREVERSE node.

Ah, okay. We're doing idiom recognition in the backend (in CGP). That might be suboptimal (now or in the future), but that's a separate matter. This LGTM.

This revision is now accepted and ready to land.Jun 29 2017, 7:48 AM
This revision was automatically updated to reflect the committed changes.
lib/Target/PowerPC/PPCInstrInfo.td