Currently we only match bswap intrinsics from or(shl(),lshr()) style patterns when we could often match bitreverse intrinsics almost as cheaply.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
370 ms | linux > HWAddressSanitizer-x86_64.TestCases::sizes.cpp |
Event Timeline
Do you have any background on where the current design comes from? If I understand right, we're currently matching bswap in InstCombine, and matching bitreverse in CGP. Why is that? Is that a compile-time consideration, or a code quality consideration (e.g., it is only profitable to materialize a bitreverse intrinsic if it is actually legal, which we do guard on in CGP)?
I think its mainly a compile-time concern - bswap is a common pattern (and we bail out in cases where we perform sub-byte operations), bitreverse is not.
If compile-time is the only concern, then these are the numbers I get for enabling the bitreverse matching without any type size limit on CTMark: http://llvm-compile-time-tracker.com/compare.php?from=27f647d117087ca11959e232e6443f4aee31e966&to=b9042c581410835940a253af86646edeabfc858e&stat=instructions So this doesn't seem too problematic. Of course, there's the usual "death by a thousand cuts" to consider with InstCombine.
If we want to make things better rather than just not make things worse, we should run an experiment: move the whole bswap / bitreverse matching to aggressive-instcombine without the bitwidth limit.
That may require a pass manager adjustment to run aggressive-instcombine at -O2 to avoid perf regressions, but that could pay for itself by not running these costly matchers so many times when there's really no hope of finding a match.
Based on feedback (@nikic Are those timings still representable?) and my own local testing, I'm not seeing any really bad slow downs due to matching for bitreverse patterns as well as as byteswaps, so I've updated the patch without any i16 limitation.
If we're happy with this I think we can then remove the bitreverse matching entirely from codegenprepare.
Added a minor compile time saving - reducing BitPartRecursionMaxDepth from 64 to 48, which was pretty high for a maximum bitwidth of 128.
LGTM, but from the llvm-xray traces of LLVM i've seen, IIRC matchBSwapOrBitReverse/collectbitparts is pretty high up in the list of cycle eaters.
OK, I'm going to reduce the traversal tree depth in a separate patch first to make any nasty build time regressions due to the bitreverse matching more specific.
To be noted, i don't have any specific issues with *this* patch, just that whole transform as a whole.
I would maybe suggest to dyamically compute the BitPartRecursionMaxDepth based on the bitwidth,
and maybe change it to be a not depth-based cut-off, but a number of visited Values.