This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] InstCombinerImpl::visitOr - enable bitreverse matching
ClosedPublic

Authored by RKSimon on Oct 26 2020, 10:10 AM.

Details

Summary

Currently we only match bswap intrinsics from or(shl(),lshr()) style patterns when we could often match bitreverse intrinsics almost as cheaply.

Diff Detail

Event Timeline

RKSimon created this revision.Oct 26 2020, 10:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 26 2020, 10:10 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
RKSimon requested review of this revision.Oct 26 2020, 10:10 AM
nikic added a comment.Oct 31 2020, 5:06 AM

Do you have any background on where the current design comes from? If I understand right, we're currently matching bswap in InstCombine, and matching bitreverse in CGP. Why is that? Is that a compile-time consideration, or a code quality consideration (e.g., it is only profitable to materialize a bitreverse intrinsic if it is actually legal, which we do guard on in CGP)?

I think its mainly a compile-time concern - bswap is a common pattern (and we bail out in cases where we perform sub-byte operations), bitreverse is not.

nikic added a comment.Nov 1 2020, 1:40 AM

If compile-time is the only concern, then these are the numbers I get for enabling the bitreverse matching without any type size limit on CTMark: http://llvm-compile-time-tracker.com/compare.php?from=27f647d117087ca11959e232e6443f4aee31e966&to=b9042c581410835940a253af86646edeabfc858e&stat=instructions So this doesn't seem too problematic. Of course, there's the usual "death by a thousand cuts" to consider with InstCombine.

spatel added a comment.Nov 1 2020, 6:10 AM

If we want to make things better rather than just not make things worse, we should run an experiment: move the whole bswap / bitreverse matching to aggressive-instcombine without the bitwidth limit.
That may require a pass manager adjustment to run aggressive-instcombine at -O2 to avoid perf regressions, but that could pay for itself by not running these costly matchers so many times when there's really no hope of finding a match.

RKSimon planned changes to this revision.Nov 12 2020, 4:29 AM
RKSimon updated this revision to Diff 343032.May 5 2021, 6:59 AM
RKSimon retitled this revision from [InstCombine] InstCombinerImpl::visitOr - enable limited bitreverse matching to [InstCombine] InstCombinerImpl::visitOr - enable bitreverse matching.
RKSimon edited the summary of this revision. (Show Details)

Based on feedback (@nikic Are those timings still representable?) and my own local testing, I'm not seeing any really bad slow downs due to matching for bitreverse patterns as well as as byteswaps, so I've updated the patch without any i16 limitation.

If we're happy with this I think we can then remove the bitreverse matching entirely from codegenprepare.

RKSimon updated this revision to Diff 343841.May 8 2021, 7:55 AM

Added a minor compile time saving - reducing BitPartRecursionMaxDepth from 64 to 48, which was pretty high for a maximum bitwidth of 128.

lebedev.ri accepted this revision.May 14 2021, 1:59 AM

LGTM, but from the llvm-xray traces of LLVM i've seen, IIRC matchBSwapOrBitReverse/collectbitparts is pretty high up in the list of cycle eaters.

This revision is now accepted and ready to land.May 14 2021, 1:59 AM

LGTM, but from the llvm-xray traces of LLVM i've seen, IIRC matchBSwapOrBitReverse/collectbitparts is pretty high up in the list of cycle eaters.

OK, I'm going to reduce the traversal tree depth in a separate patch first to make any nasty build time regressions due to the bitreverse matching more specific.

LGTM, but from the llvm-xray traces of LLVM i've seen, IIRC matchBSwapOrBitReverse/collectbitparts is pretty high up in the list of cycle eaters.

OK, I'm going to reduce the traversal tree depth in a separate patch first to make any nasty build time regressions due to the bitreverse matching more specific.

To be noted, i don't have any specific issues with *this* patch, just that whole transform as a whole.
I would maybe suggest to dyamically compute the BitPartRecursionMaxDepth based on the bitwidth,
and maybe change it to be a not depth-based cut-off, but a number of visited Values.