This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Dropping redundant masking before left-shift [0/5] (PR42563)
ClosedPublic

Authored by lebedev.ri on Jul 10 2019, 10:08 AM.

Details

Summary

If we have some pattern that leaves only some low bits set, and then performs
left-shift of those bits, if none of the bits that are left after the final
shift are modified by the mask, we can omit the mask.

There are many variants to this pattern:
a. (x & ((1 << MaskShAmt) - 1)) << ShiftShAmt
All these patterns can be simplified to just:
x << ShiftShAmt
iff:
a. (MaskShAmt+ShiftShAmt) u>= bitwidth(x)

alive proof:
a: https://rise4fun.com/Alive/wi9

Indeed, not all of these patterns are canonical.
But since this fold will only produce a single instruction
i'm really interested in handling even uncanonical patterns,
since i have this general kind of pattern in hotpaths,
and it is not totally outlandish for bit-twiddling code.

For now let's start with patterns where both shift amounts are variable,
with trivial constant "offset" between them, since i believe this is
both simplest to handle and i think this is most common.
But again, there are likely other variants where we could use
ValueTracking/ConstantRange to handle more cases.

https://bugs.llvm.org/show_bug.cgi?id=42563

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Jul 10 2019, 10:08 AM
lebedev.ri edited the summary of this revision. (Show Details)

Rebased over re-combed tests, NFC.

lebedev.ri marked 2 inline comments as done.Jul 12 2019, 2:28 PM
lebedev.ri added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
73 ↗(On Diff #209047)

Been thinking about this, and while the case where ShiftShAmt is a constant will already work,
the case where mask is constant, is another case here:
https://rise4fun.com/Alive/mc8
Will add that in some next follow-up patch.

103–104 ↗(On Diff #209047)

It may be possible to use SimplifyICmpInst() here, to catch more cases, may look into that after patchset.

Up.
I'm perfectly happy to do trivial follow-up patches in post-commit review mode,
but i feel like the base patch/the approach might be good to be reviewed..

xbolva00 accepted this revision.Jul 16 2019, 5:41 AM

Looks fine

llvm/test/Transforms/InstCombine/redundant-left-shift-input-masking-variant-a.ll
343 ↗(On Diff #209047)

What about ‘exact’ ?

This revision is now accepted and ready to land.Jul 16 2019, 5:41 AM
lebedev.ri marked an inline comment as done.Jul 16 2019, 6:32 AM

@spatel any comments?

llvm/test/Transforms/InstCombine/redundant-left-shift-input-masking-variant-a.ll
343 ↗(On Diff #209047)

In all these tests the last instruction is shl, not lshr; shl can only have nuw/nsw/nuw nsw.
In any case, the new shift must not have any of these flags, because now that we have dropped masking,
we no longer know that we won't shift-out non-0 bits.
alive obviously agrees:
https://rise4fun.com/Alive/NDG
https://rise4fun.com/Alive/sliGT
https://rise4fun.com/Alive/TWI5
https://rise4fun.com/Alive/EiG
https://rise4fun.com/Alive/Fsly

xbolva00 added inline comments.Jul 16 2019, 6:37 AM
llvm/test/Transforms/InstCombine/redundant-left-shift-input-masking-variant-a.ll
343 ↗(On Diff #209047)

Thanks

spatel added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
81–83 ↗(On Diff #209047)

assert on opcode and assign operands rather than using match - the input must be 'shl'?

103–104 ↗(On Diff #209047)

Using instsimplify like that (or as currently in this patch with SimplifyBinOp) is where we go into the InstCombine gray area of, "Is it worth the cost?"

With InstSimplify, we're using a potentially expensive/recursive analysis (similar to ValueTracking), and we know that's where InstCombine spends a lot of time.

cc'ing @efriedma for further thoughts.

108 ↗(On Diff #209047)

Add a code comment similar to what you explained here in the review:
"We no longer know that we won't shift-out non-0 bits."

lebedev.ri added inline comments.Jul 16 2019, 8:40 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
103–104 ↗(On Diff #209047)

I'm only seeing two big alternatives here:
a) not doing these kind of optimizations at all. i sure hope this isn't the answer :)
b) patternmatch proliferation - basically just reimplement every interesting fold that we'd get via SimplifyBinOp in here - code bloat and wasted dev time, i'd rather not.

These patterns can come up within bit stream manipulation codepaths (think like llvm/lib/Bitcode),
and they may end up being the hotpath, so it's good to not have them there.
I'm not sure i have any numbers at hand, but as a ballpark, in the end with all these shift optimizations
(including those that i didn't do yet), i'm hoping to reclaim* -5%..-25% of runtime.

As a middle-ground, can the recursion depth be specified when calling SimplifyBinOp()?
I'm not seeing any cleverer fix here.

*reclaim - when changing this buffer abstraction from

return (cache >> (fillLevel - count)) & ((1U << count) - 1U);

to

return (cache >> (32 - count));

that is the perf regression i'm getting; and looking at the crude IR in regressed benchmarks,
i think some of that loss i will reclaim.

spatel added inline comments.Jul 16 2019, 9:21 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
103–104 ↗(On Diff #209047)

I agree - I'd like to have the optimization. I'm only acknowledging our long-standing problem with instcombine.

There's 1 more possibility: move this to AggressiveInstCombine (which is currently only enabled at -O3...).

lebedev.ri added a comment.EditedJul 16 2019, 10:04 AM

To follow-up on inline comment - right now (llvm master vs llvm master with this patchset; rawspeed develop with no patches ontop) this fold happens once:

$ /repositories/llvm-test-suite/utils/compare.py -m instcombine.MasksDroped /builddirs/llvm-project/build-llvm-test-suite-{old,new}/results.json
/repositories/llvm-test-suite/utils/compare.py:109: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  d = pd.concat(datasets, axis=0, names=['run'], keys=datasetnames)
Tests: 198
Metric: instcombine.MasksDroped

Program                                        results results0 diff 
test-suite :: RawSpeed/RawSpeed.test           NaN      1.00    nan%
test-suite...AllocatorAdaptorBenchmark.test    NaN     NaN      nan%
test-suite...lateDecompressorBenchmark.test    NaN     NaN      nan%
test-suite...sRawInterpolatorBenchmark.test    NaN     NaN      nan%
test-suite...eed/io/BitStreamBenchmark.test    NaN     NaN      nan%
test-suite...a/CameraMetaDataBenchmark.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-ArwDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-Cr2Decoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-DcrDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-DcsDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-DngDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-ErfDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-IiqDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-KdcDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-MefDecoder.test    NaN     NaN      nan%
Geomean difference                                              nan%
       results  results0  diff
count  0.0      1.0       0.0 
mean  NaN       1.0      NaN  
std   NaN      NaN       NaN  
min   NaN       1.0      NaN  
25%   NaN       1.0      NaN  
50%   NaN       1.0      NaN  
75%   NaN       1.0      NaN  
max   NaN       1.0      NaN  
/builddirs/llvm-project/build-llvm-test-suite-new$ grep -r "instcombine.MasksDroped"
RawSpeed/build/CMakeFiles/rawspeed.dir/src/librawspeed/decompressors/SamsungV0Decompressor.stats:       "instcombine.MasksDroped": 1,
results.json:        "instcombine.MasksDroped": 1.0,

I'm expecting that this number will be better in the end, when all the bits are in place.

While there, the previous fold (+reassociateShiftAmtsOfTwoSameDirectionShifts(), D63812) is more frequent:

$ /repositories/llvm-test-suite/utils/compare.py -m instcombine.ShiftsCombined /builddirs/llvm-project/build-llvm-test-suite-{old,new}/results.json
/repositories/llvm-test-suite/utils/compare.py:109: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  d = pd.concat(datasets, axis=0, names=['run'], keys=datasetnames)
Tests: 198
Metric: instcombine.ShiftsCombined

Program                                        results results0 diff 
test-suite :: RawSpeed/RawSpeed.test           26.00   26.00    0.0%
test-suite...AllocatorAdaptorBenchmark.test    NaN     NaN      nan%
test-suite...lateDecompressorBenchmark.test    NaN     NaN      nan%
test-suite...sRawInterpolatorBenchmark.test    NaN     NaN      nan%
test-suite...eed/io/BitStreamBenchmark.test    NaN     NaN      nan%
test-suite...a/CameraMetaDataBenchmark.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-ArwDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-Cr2Decoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-DcrDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-DcsDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-DngDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-ErfDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-IiqDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-KdcDecoder.test    NaN     NaN      nan%
test-suite...fDecoderFuzzer-MefDecoder.test    NaN     NaN      nan%
Geomean difference                                              nan%
       results  results0  diff
count  1.0      1.0       1.0 
mean   26.0     26.0      0.0 
std   NaN      NaN       NaN  
min    26.0     26.0      0.0 
25%    26.0     26.0      0.0 
50%    26.0     26.0      0.0 
75%    26.0     26.0      0.0 
max    26.0     26.0      0.0 

The performance implications of this patchset are as i expected them to be:

raw.pixls.us-unique/Samsung/NX30$ //usr/src/googlebenchmark/tools/compare.py -a benchmarks /builddirs/llvm-project/build-llvm-test-suite-{old,new}/RawSpeed/build/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_repetitions=128 --benchmark_min_time=0.00000001 2015-03-07-163604_sam_7204.srw 
RUNNING: /builddirs/llvm-project/build-llvm-test-suite-old/RawSpeed/build/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_repetitions=128 --benchmark_min_time=0.00000001 2015-03-07-163604_sam_7204.srw --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmp3aOcOJ
2019-07-16 19:55:51
Running /builddirs/llvm-project/build-llvm-test-suite-old/RawSpeed/build/src/utilities/rsbench/rsbench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.88, 0.76, 1.53
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                       Time             CPU   Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2015-03-07-163604_sam_7204.srw/threads:1/process_time/real_time_mean          136 ms          136 ms          128   0.135973          0.99989   20.5978M       151.485M        151.468M      7.35441        7.3536   0.135988
2015-03-07-163604_sam_7204.srw/threads:1/process_time/real_time_median        136 ms          136 ms          128   0.135954                1   20.5978M       151.507M        151.507M      7.35546       7.35547   0.135953
2015-03-07-163604_sam_7204.srw/threads:1/process_time/real_time_stddev      0.212 ms        0.193 ms          128   193.542u         237.857u          0       215.294k        236.262k    0.0104522     0.0114703   212.466u
RUNNING: /builddirs/llvm-project/build-llvm-test-suite-new/RawSpeed/build/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_repetitions=128 --benchmark_min_time=0.00000001 2015-03-07-163604_sam_7204.srw --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpWIQdvn
2019-07-16 19:56:10
Running /builddirs/llvm-project/build-llvm-test-suite-new/RawSpeed/build/src/utilities/rsbench/rsbench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.92, 0.78, 1.52
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                       Time             CPU   Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2015-03-07-163604_sam_7204.srw/threads:1/process_time/real_time_mean          131 ms          131 ms          128   0.131231         0.999992   20.5978M       156.959M        156.958M      7.62015        7.6201   0.131232
2015-03-07-163604_sam_7204.srw/threads:1/process_time/real_time_median        131 ms          131 ms          128   0.131175                1   20.5978M       157.026M        157.024M      7.62343        7.6233   0.131177
2015-03-07-163604_sam_7204.srw/threads:1/process_time/real_time_stddev      0.218 ms        0.218 ms          128   217.942u         33.6202u          0       259.861k        259.966k    0.0126159      0.012621   218.033u
Comparing /builddirs/llvm-project/build-llvm-test-suite-old/RawSpeed/build/src/utilities/rsbench/rsbench to /builddirs/llvm-project/build-llvm-test-suite-new/RawSpeed/build/src/utilities/rsbench/rsbench
Benchmark                                                                                Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
2015-03-07-163604_sam_7204.srw/threads:1/process_time/real_time_pvalue                 0.0000          0.0000      U Test, Repetitions: 128 vs 128
2015-03-07-163604_sam_7204.srw/threads:1/process_time/real_time_mean                  -0.0350         -0.0349           136           131           136           131
2015-03-07-163604_sam_7204.srw/threads:1/process_time/real_time_median                -0.0351         -0.0352           136           131           136           131
2015-03-07-163604_sam_7204.srw/threads:1/process_time/real_time_stddev                +0.0260         +0.1253             0             0             0             0

-4% improvement is notable.

With InstSimplify, we're using a potentially expensive/recursive analysis (similar to ValueTracking), and we know that's where InstCombine spends a lot of time.

cc'ing @efriedma for further thoughts.

There's 1 more possibility: move this to AggressiveInstCombine (which is currently only enabled at -O3...).

Note that before we get to the SimplifyBinOp() we first need to match 3 or 4 instructions,
which is more strict than the previously-added reassociateShiftAmtsOfTwoSameDirectionShifts() fold.
@efriedma If retargeting *this* fold into AggressiveInstCombine is the only acceptable way forward
please let me know to avoid lack of forward progress/stalling..

I As discussed at http://lists.llvm.org/pipermail/llvm-dev/2017-March/111257.html, instcombine is probably doing too much work in general given the number of times it's run. But I don't think this specific use of instsimplify here is likely to be a problem in practice; given the pattern guard, it should trigger very rarely.

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
96 ↗(On Diff #209047)

SimplifyAddInst?

lebedev.ri marked 3 inline comments as done.Jul 18 2019, 3:18 PM

@xbolva00, @spatel, @efriedma thank you for the review!

I As discussed at http://lists.llvm.org/pipermail/llvm-dev/2017-March/111257.html, instcombine is probably doing too much work in general given the number of times it's run.

Yep, i agree. Compile time is important.

But I don't think this specific use of instsimplify here is likely to be a problem in practice; given the pattern guard, it should trigger very rarely.

Okay then, thank you for taking a look!
I will then proceed with this design (keeping it in instcombine).

Will rebase this addressing inline notes and land this set - the followup patches are
trivial extensions that simply teach it about more similar patterns,
if anyone wants to review those - be my guest, but i'm not expecting any surprises there.

lebedev.ri marked 5 inline comments as done.

Rebased, addressed review notes.

This revision was automatically updated to reflect the committed changes.