This is an archive of the discontinued LLVM Phabricator instance.

[LoopIdiom] 'logical right-shift until zero' ('count active bits') "on steroids" idiom recognition.
ClosedPublic

Authored by lebedev.ri on May 8 2021, 3:17 PM.

Details

Summary

I think i've added exhaustive test coverage, and i have verified that alive2 is happy with all the tests,
so in principle i'm fine with landing this without review, but just in case..

This adds support for the "count active bits" pattern, i.e.:

int countActiveBits(unsigned val) {
    int cnt = 0;
    for( ; (val >> cnt) != 0; ++cnt)
        ;
    return cnt;
}

but a somewhat more general one, since that is what i need:

int countActiveBits(unsigned val, int start, int off) {
    int cnt;
    for (cnt = start; val >> (cnt + off); cnt++)
        ;
    return cnt;
}

I've followed in footstep of 'left-shift until bittest' idiom (D91038),
in the sense that iff the ctlz intrinsic is cheap, we'll transform,
regardless of all other factors.

This can have a shocking effect on certain benchmarks:

raw.pixls.us-unique/Olympus/XZ-1$ /repositories/googlebenchmark/tools/compare.py -a benchmarks ~/rawspeed/build-{old,new}/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 p1319978.orf
RUNNING: /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 p1319978.orf --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmp49_28zcm
2021-05-09T01:06:05+03:00
Running /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench
Run on (32 X 3600.24 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 5.26, 6.29, 3.49
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
p1319978.orf/threads:32/process_time/real_time_mean          145 ms          145 ms          128   0.145319         0.999981   10.1568M       69.8949M        69.8936M      6.88159       6.88146   0.145322
p1319978.orf/threads:32/process_time/real_time_median        145 ms          145 ms          128   0.145317         0.999986   10.1568M       69.8941M        69.8931M      6.88151       6.88141   0.145319
p1319978.orf/threads:32/process_time/real_time_stddev      0.766 ms        0.766 ms          128   766.586u         15.1302u          0       354.167k        354.098k    0.0348699     0.0348631   766.469u
RUNNING: /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_min_time=0.00000001 --benchmark_repetitions=128 p1319978.orf --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpwb9sw2x0
2021-05-09T01:06:24+03:00
Running /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Run on (32 X 3599.95 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 4.05, 5.95, 3.43
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
p1319978.orf/threads:32/process_time/real_time_mean         99.8 ms         99.8 ms          128  0.0997758         0.999972   10.1568M       101.797M        101.794M      10.0225       10.0222  0.0997786
p1319978.orf/threads:32/process_time/real_time_median       99.7 ms         99.7 ms          128  0.0997165         0.999985   10.1568M       101.857M        101.854M      10.0284       10.0281  0.0997195
p1319978.orf/threads:32/process_time/real_time_stddev      0.224 ms        0.224 ms          128   224.166u          34.345u          0        226.81k        227.231k    0.0223309     0.0223723   224.586u
Comparing /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Benchmark                                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------
p1319978.orf/threads:32/process_time/real_time_pvalue                 0.0000          0.0000      U Test, Repetitions: 128 vs 128
p1319978.orf/threads:32/process_time/real_time_mean                  -0.3134         -0.3134           145           100           145           100
p1319978.orf/threads:32/process_time/real_time_median                -0.3138         -0.3138           145           100           145           100
p1319978.orf/threads:32/process_time/real_time_stddev                -0.7073         -0.7078             1             0             1             0

Diff Detail

Event Timeline

lebedev.ri created this revision.May 8 2021, 3:17 PM
lebedev.ri requested review of this revision.May 8 2021, 3:17 PM
lebedev.ri updated this revision to Diff 343871.May 8 2021, 3:23 PM

Actually add changed tests.

Hmm, looks like i might also need at least the "[shifty] count leading zeros" variation of this.

craig.topper added inline comments.May 11 2021, 3:48 PM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
2605

This should check SuccessorBB?

lebedev.ri marked an inline comment as done.

@craig.topper thank you for taking a look!
Nit addressed.

zhuhan0 added inline comments.May 12 2021, 11:49 AM
llvm/test/Transforms/LoopIdiom/X86/logical-right-shift-until-zero.ll
9–65

I could be wrong but would this mis-compile if %nbits results in unsigned overflow? For example,

%val = 0x10000000
%start = 1
%extraoffset.= 255

Loop trip count is 8 before transformation but 1 after.

lebedev.ri marked an inline comment as done.May 12 2021, 12:00 PM
lebedev.ri added inline comments.
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
2605

Indeed. Same with the previous transform.

llvm/test/Transforms/LoopIdiom/X86/logical-right-shift-until-zero.ll
9–65

Could you please specify, for which bitwidth your counterexample is?
I'm going to guess i32, so we have

%iv = i32 1
%nbits = i32 256
%val.shifted = lshr i32 %val, 256

We then navigate to https://llvm.org/docs/LangRef.html#lshr-instruction

If op2 is (statically or dynamically) equal to or larger than the number of bits in op1, this instruction returns a poison value.

So i'm not seeing a miscompile.

As i have said, i've verified each of the tests here with alive2, and they are all fine.

zhuhan0 added inline comments.May 12 2021, 12:22 PM
llvm/test/Transforms/LoopIdiom/X86/logical-right-shift-until-zero.ll
9–65

8 bits.

%start = i8 1
%extraoffset = i8 255  ; unsigned

Sorry I'm not familiar with alive2 so forgive me if I'm questioning something that's obviously proven to be correct.

lebedev.ri marked an inline comment as done.May 12 2021, 12:36 PM
lebedev.ri added inline comments.
llvm/test/Transforms/LoopIdiom/X86/logical-right-shift-until-zero.ll
9–65

Then i do not understand what %val = 0x10000000 means.
What's the decimal value of %val?

lebedev.ri marked an inline comment as done.May 12 2021, 2:54 PM
lebedev.ri added inline comments.
llvm/test/Transforms/LoopIdiom/X86/logical-right-shift-until-zero.ll
9–65

Ah, you mean %val = 128.
So we have

$ cat /tmp/test.ll 
declare void @escape_inner(i8, i8, i8, i1, i8)
declare void @escape_outer(i8, i8, i8, i1, i8)

define i8 @p0() {
entry:
  br label %loop

loop:
  %iv = phi i8 [ 1, %entry ], [ %iv.next, %loop ]
  %nbits = add nsw i8 %iv, 255
  %val.shifted = lshr i8 128, %nbits
  %val.shifted.iszero = icmp eq i8 %val.shifted, 0
  %iv.next = add i8 %iv, 1

  call void @escape_inner(i8 %iv, i8 %nbits, i8 %val.shifted, i1 %val.shifted.iszero, i8 %iv.next)

  br i1 %val.shifted.iszero, label %end, label %loop

end:
  %iv.res = phi i8 [ %iv, %loop ]
  %nbits.res = phi i8 [ %nbits, %loop ]
  %val.shifted.res = phi i8 [ %val.shifted, %loop ]
  %val.shifted.iszero.res = phi i1 [ %val.shifted.iszero, %loop ]
  %iv.next.res = phi i8 [ %iv.next, %loop ]

  call void @escape_outer(i8 %iv.res, i8 %nbits.res, i8 %val.shifted.res, i1 %val.shifted.iszero.res, i8 %iv.next.res)

  ret i8 %iv.res
}
$ ./bin/opt -loop-idiom -mtriple=x86_64 -mcpu=core-avx2 -o - -S /tmp/test.ll | ./bin/opt -O3 -S -o - -
; ModuleID = '<stdin>'
source_filename = "/tmp/test.ll"
target triple = "x86_64"

declare void @escape_inner(i8, i8, i8, i1, i8) local_unnamed_addr #0

declare void @escape_outer(i8, i8, i8, i1, i8) local_unnamed_addr #0

define i8 @p0() local_unnamed_addr #0 {
entry:
  tail call void @escape_inner(i8 1, i8 0, i8 -128, i1 false, i8 2)
  tail call void @escape_inner(i8 2, i8 1, i8 64, i1 false, i8 3)
  tail call void @escape_inner(i8 3, i8 2, i8 32, i1 false, i8 4)
  tail call void @escape_inner(i8 4, i8 3, i8 16, i1 false, i8 5)
  tail call void @escape_inner(i8 5, i8 4, i8 8, i1 false, i8 6)
  tail call void @escape_inner(i8 6, i8 5, i8 4, i1 false, i8 7)
  tail call void @escape_inner(i8 7, i8 6, i8 2, i1 false, i8 8)
  tail call void @escape_inner(i8 8, i8 7, i8 1, i1 false, i8 9)
  tail call void @escape_inner(i8 9, i8 8, i8 undef, i1 true, i8 10)
  tail call void @escape_outer(i8 9, i8 8, i8 undef, i1 true, i8 10)
  ret i8 9
}

attributes #0 = { "target-cpu"="core-avx2" }

@escape_inner() is called 9 times, not 1.

zhuhan0 accepted this revision.May 13 2021, 1:02 PM

LGTM.

llvm/test/Transforms/LoopIdiom/X86/logical-right-shift-until-zero.ll
9–65

Thanks for the test.

This revision is now accepted and ready to land.May 13 2021, 1:02 PM

@craig.topper ?

LGTM.

Thanks for taking a look!

Hmm, looks like i might also need at least the "[shifty] count leading zeros" variation of this.

... and i guess it might be an extension of recognizeShiftUntilBitTest().

LGTM

Thank you for the review!

I plan to hopefully look into the next idiom soon after.

This revision was landed with ongoing or failed builds.May 17 2021, 10:34 AM
This revision was automatically updated to reflect the committed changes.

Hi,

THere seems to be something problematic with this patch.
With

opt -enable-new-pm=0 -o /dev/null bbi-56362_2.ll -loop-idiom

I get

Instruction does not dominate all uses!
  %1 = load i16, i16* getelementptr inbounds ({ i16, i16 }, { i16, i16 }* @v_92, i32 0, i32 0), align 1
  %.numleadingzeros = call i16 @llvm.ctlz.i16(i16 %1, i1 false)
in function main

and with

opt -enable-new-pm=1 -o /dev/null bbi-56362_2.ll -loop-idiom

I get

opt: ../lib/Transforms/Scalar/LoopPassManager.cpp:253: auto llvm::FunctionToLoopPassAdaptor::run(llvm::Function &, llvm::FunctionAnalysisManager &)::(anonymous class)::operator()(llvm::StringRef, llvm::Any) const: Assertion `L->isRecursivelyLCSSAForm(LAR.DT, LI) && "Loops must remain in LCSSA form!"' failed.