This is an archive of the discontinued LLVM Phabricator instance.

[LoopIntWrapPredication] Loop Integer Wrapping Predication Pass
Needs ReviewPublic

Authored by kachkov98 on Aug 19 2022, 12:57 AM.

Details

Summary

Motivation for the LoopIntWrapPredication pass
Consider the following example:

for (unsigned i = 0; i < N; ++i)
  for (unsigned j = 0; j < N; ++j)
    C[i*N+j] = foo();

With C pointer size = 64 bit and i, j variables with 32-bit size. According to the C standard, unsigned overflow is defined behavior, so i*N+j calculation will be done with 32-bit types, zero-extented to 64 bit and will be used as offset in GEP instruction. However, if we replace induction variables types to signed integer, this calculation will have nsw flag, and IndVarSimplify pass will be able to promote both induction variables to 64-bit types and get rid of sext inside hot loop. Since we can't do the same thing with unsigned variables, but it's quite common pattern in code, we try to do versioning of this loop: generate some runtime check that ensures that overflow will never occur, and set NUW flags on this chain of address calculation. We use scalar evolution to get possible range of this expression and insert loop-invariant condition that this range is not overflowing. To simplify pass, we don't do loop versioning directly, but inserting branching code for this calculation chain inside the loop, relying that this branch will be unswitched by the subsequent pass (because this branch has loop-invariant condition). This how it will look like in pseudo-code:

for (unsigned i = 0; i < N; ++i)
  for (unsigned j = 0; j < N; ++j)
    if (overflow(N*N))
      *(C + zext(i*N+j)) = foo();
    else
      *(C + zext(i*N+j /*nuw*/)) = foo();

After unswitching:

if (overflow(N*N))
  for (unsigned i = 0; i < N; ++i)
    for (unsigned j = 0; j < N; ++j)
      *(C + zext(i*N+j)) = foo();
else
  for (unsigned i = 0; i < N; ++i)
    for (unsigned j = 0; j < N; ++j)
      *(C + zext(i*N+j /*nuw*/)) = foo();

After IndVarSimplify:

if (overflow(N*N))
  for (unsigned i = 0; i < N; ++i)
    for (unsigned j = 0; j < N; ++j)
      *(C + zext(i*N+j)) = foo();
else
  for (uint64_t i = 0; i < N; ++i)
    for (uint64_t j = 0; j < N; ++j)
      *(C + (i*N+j)) = foo();

Results
This pass shows some good results in Coremark benchmark on our RISC-V hardware, increasing score on 18 %. It has exaclty the same pattern as described before: uses unsigned types for induction variables, that are never overflowed in runtime (this issue was also mentioned here: https://github.com/eembc/coremark/issues/22). Analyzing code of matrix functions (especially matrix_mul_matrix) shows that significant ammount of instructions inside innermost loop are doing this zero extension, but unsigned overflow of array index is never occured.

Diff Detail

Event Timeline

kachkov98 created this revision.Aug 19 2022, 12:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 19 2022, 12:57 AM
kachkov98 requested review of this revision.Aug 19 2022, 12:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 19 2022, 12:57 AM
kachkov98 edited the summary of this revision. (Show Details)Aug 19 2022, 1:20 AM
kachkov98 added a subscriber: anton-afanasyev.
kachkov98 edited the summary of this revision. (Show Details)Aug 19 2022, 2:02 AM

test-suite statistics:

Tests: 2428
Metric: loop-int-wrap-predication.NumPredicatedChains

Program                                       loop-int-wrap-predication.NumPredicatedChains                              
MultiSourc.../Applications/JM/ldecod/ldecod    30.00                                       
MultiSourc...chmarks/Prolangs-C/agrep/agrep    18.00                                       
MultiSourc.../Applications/JM/lencod/lencod    11.00                                       
MultiSourc...ch/consumer-lame/consumer-lame     7.00                                       
MultiSourc...e/Applications/ClamAV/clamscan     6.00                                       
MultiSourc...reeBench/fourinarow/fourinarow     5.00                                       
MultiSourc...Benchmarks/7zip/7zip-benchmark     4.00                                       
MultiSourc.../DOE-ProxyApps-C++/CLAMR/CLAMR     4.00                                       
MultiSourc...nch/mpeg2/mpeg2dec/mpeg2decode     2.00                                       
MultiSourc...arks/Rodinia/backprop/backprop     2.00                                       
MultiSourc...ch/consumer-jpeg/consumer-jpeg     1.00                                       
MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg     1.00                                       
MultiSourc...enchmarks/mafft/pairlocalalign     1.00
fhahn added a subscriber: fhahn.Aug 19 2022, 2:21 AM

test-suite statistics:

Tests: 2428
Metric: loop-int-wrap-predication.NumPredicatedChains

Program                                       loop-int-wrap-predication.NumPredicatedChains                              
MultiSourc.../Applications/JM/ldecod/ldecod    30.00                                       
MultiSourc...chmarks/Prolangs-C/agrep/agrep    18.00                                       
MultiSourc.../Applications/JM/lencod/lencod    11.00                                       
MultiSourc...ch/consumer-lame/consumer-lame     7.00                                       
MultiSourc...e/Applications/ClamAV/clamscan     6.00                                       
MultiSourc...reeBench/fourinarow/fourinarow     5.00                                       
MultiSourc...Benchmarks/7zip/7zip-benchmark     4.00                                       
MultiSourc.../DOE-ProxyApps-C++/CLAMR/CLAMR     4.00                                       
MultiSourc...nch/mpeg2/mpeg2dec/mpeg2decode     2.00                                       
MultiSourc...arks/Rodinia/backprop/backprop     2.00                                       
MultiSourc...ch/consumer-jpeg/consumer-jpeg     1.00                                       
MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg     1.00                                       
MultiSourc...enchmarks/mafft/pairlocalalign     1.00

Do you have also have numbers on impact of runtime performance on a wider range of benchmarks like SPEC & co?

craig.topper added inline comments.Aug 19 2022, 6:57 PM
llvm/lib/Transforms/Scalar/LoopIntWrapPredication.cpp
71

The description on line 32 mentions shl, but Shl isn't here

247

What guarantees the loop has a preheader? Placement in the pipeline would probably guarantee it but I'm not sure anything does when running the pass by itself. Unless I missed something.

249

This assert is probably unnecessary. A terminator is required for the Preheader to even be recognized as a predecessor of the loop.

276

proceed -> process?

kachkov98 edited the summary of this revision. (Show Details)

Review changes

kachkov98 marked 2 inline comments as done.Aug 22 2022, 1:03 AM
kachkov98 added inline comments.
llvm/lib/Transforms/Scalar/LoopIntWrapPredication.cpp
71

It looks like processing of shl is not profitable, since it's not handled by SimplifyIndVar (https://llvm.org/doxygen/SimplifyIndVar_8cpp_source.html#l01551)

247

Loop Pass Manager should run LoopSimplify pass, and this cannonical form ensures that loop has a preheader: https://llvm.org/docs/LoopTerminology.html#loop-simplify-form

neat! it'll be great if for loop bounds that don't overflow and can be inferred at compile time, the first version just get DCE'd.

neat! it'll be great if for loop bounds that don't overflow and can be inferred at compile time, the first version just get DCE'd.

From the experiment, if SCEV can be proven to has NUW, IndVarSimplifyPass already handles this case and successfully widens induction variable, so it makes sence to insert only non-tirivial runtime checks.

hiraditya added inline comments.Aug 23 2022, 7:33 AM
llvm/lib/Transforms/Scalar/LoopIntWrapPredication.cpp
86

What happens when Inst->use_empty() == true?

hiraditya added a comment.EditedAug 23 2022, 7:36 AM

Is there another induction variable optimization pass that this can be merged with?

kachkov98 added inline comments.Aug 23 2022, 7:45 AM
llvm/lib/Transforms/Scalar/LoopIntWrapPredication.cpp
86

We are checking that Inst has exactly one use (line 82)

craig.topper added inline comments.Aug 23 2022, 8:20 AM
llvm/lib/Transforms/Scalar/LoopIntWrapPredication.cpp
247

I agree that it works if run as part of the Loop Pass Manager. But if you run the pass standalone from the opt command line, it may not be in loop simplify form. The pass either needs to protect itself or require the LoopSimplify analysis.

kachkov98 added inline comments.Aug 23 2022, 8:37 AM
llvm/lib/Transforms/Scalar/LoopIntWrapPredication.cpp
247

I've tried to run only this pass from opt with -debug-pass-manager option, and this is the output:

Running pass: VerifierPass on [module]
Running analysis: VerifierAnalysis on [module]
Running analysis: InnerAnalysisManagerProxy<llvm::FunctionAnalysisManager, llvm::Module> on [module]
Running analysis: PreservedCFGCheckerAnalysis on foo
Running pass: LoopSimplifyPass on foo (14 instructions)
Running analysis: LoopAnalysis on foo
Running analysis: DominatorTreeAnalysis on foo
Running analysis: AssumptionAnalysis on foo
Running analysis: TargetIRAnalysis on foo
Running pass: LCSSAPass on foo (14 instructions)
Running analysis: AAManager on foo
Running analysis: TargetLibraryAnalysis on foo
Running analysis: BasicAA on foo
Running analysis: ScopedNoAliasAA on foo
Running analysis: TypeBasedAA on foo
Running analysis: OuterAnalysisManagerProxy<llvm::ModuleAnalysisManager, llvm::Function> on foo
Running analysis: ScalarEvolutionAnalysis on foo
Running analysis: InnerAnalysisManagerProxy<llvm::LoopAnalysisManager, llvm::Function> on foo
Running pass: LoopIntWrapPredicationPass on Loop at depth 1 containing: %for.body<header><latch><exiting>
Invalidating analysis: PreservedCFGCheckerAnalysis on foo
Invalidating analysis: VerifierAnalysis on [module]
Running pass: VerifierPass on [module]
Running analysis: VerifierAnalysis on [module]
Running pass: PrintModulePass on [module]

So LoopSimplify pass is launched even in standalone mode as a part of loop pass manager pipeline.

kachkov98 updated this revision to Diff 454854.Aug 23 2022, 8:38 AM

More fixes

Statistics with SPEC2017:

Metric: loop-int-wrap-predication.NumPredicatedChains,loop-int-wrap-predication.NumProcessedLoops

Program                                                                         loop-int-wrap-predication.NumPredicatedChains loop-int-wrap-predication.NumProcessedLoops
                   test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  30.00                                         13.00                                     
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  26.00                                         20.00                                     
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  26.00                                         20.00                                     
               test-suite :: MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test  18.00                                         10.00                                     
       test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test  18.00                                         16.00                                     
                   test-suite :: MultiSource/Applications/JM/lencod/lencod.test  11.00                                         11.00                                     
  test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   7.00                                          7.00                                     
                    test-suite :: MultiSource/Applications/ClamAV/clamscan.test   6.00                                          6.00                                     
      test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test   5.00                                          5.00                                     
           test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   4.00                                          4.00                                     
            test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   4.00                                          4.00                                     
        test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test   4.00                                          4.00                                     
                  test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test   4.00                                          3.00                                     
  test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test   4.00                                          4.00                                     
 test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test   4.00                                          4.00                                     
test-suite :: MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode.test   2.00                                          2.00                                     
            test-suite :: MultiSource/Benchmarks/Rodinia/backprop/backprop.test   2.00                                          2.00                                     
               test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test   1.00                                          1.00                                     
                test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test   1.00                                          1.00                                     
                 test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test   1.00                                          1.00                                     
        test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test   1.00                                          1.00                                     
  test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   1.00                                          1.00

I've checked that all tests (including SPEC) are now passing on x86 machine. Getting reliable performance results on RISC-V will take some time - will try to update this information soon.

Isn;t this pattern handled by LoopFlatten? cc @SjoerdMeijer

Isn;t this pattern handled by LoopFlatten? cc @SjoerdMeijer

Yes, it's a very similar pattern, but the goal of this pass is different: the insertion of runtime overflow check to simplify code under condition when this doesn't happen. Even after LoopFlatten the resulted SCEV can't be proven to not overflow at compile-time.

Is there another induction variable optimization pass that this can be merged with?

I haven't found any other passes which can benefit from this transformation. In general, the idea looks similar to LICMLoopVersioning, but in that pass runtime checks are inserted to relax pointer aliasing restrictions, and it was hard to reuse some code from it.

yakush added a subscriber: yakush.Sep 23 2022, 11:55 AM
kachkov98 updated this revision to Diff 481188.EditedDec 8 2022, 1:04 AM

Rebased on ToT
Add predication on separate conditions
Add simple cost model (do not predicate big loops)

Compile-time results on llvm-test-suite + SPEC2006 + SPEC2017:

Program                                                                         loop-int-wrap-predication.NumPredicatedChains

         test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2029.00                                      
                 test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test  250.00                                      
              test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   24.00                                      
             test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   24.00                                      
                    test-suite :: SingleSource/UnitTests/matrix-types-spec.test   17.00                                      
                      test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test   12.00                                      
                   test-suite :: MultiSource/Applications/JM/lencod/lencod.test    8.00                                      
  test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test    7.00                                      
 test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test    7.00                                      
       test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test    7.00                                      
          test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test    7.00                                      
                  test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test    7.00                                      
                    test-suite :: MultiSource/Applications/ClamAV/clamscan.test    6.00                                      
            test-suite :: SingleSource/UnitTests/Vectorizer/runtime-checks.test    6.00                                      
               test-suite :: MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test    6.00                                      
      test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test    5.00                                      
  test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test    5.00                                      
              test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test    4.00                                      
        test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test    4.00                                      
                          test-suite :: MultiSource/Benchmarks/PAQ8p/paq8p.test    4.00                                      
            test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test    3.00                                      
           test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test    3.00                                      
            test-suite :: MultiSource/Benchmarks/Rodinia/backprop/backprop.test    2.00                                      
 test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test    2.00                                      
                test-suite :: SingleSource/Benchmarks/Misc-C++/oopack_v1p8.test    2.00                                      
  test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test    2.00                                      
test-suite :: MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode.test    2.00                                      
                 test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test    1.00                                      
               test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test    1.00                                      
                test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test    1.00                                      
                test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test    1.00                                      
  test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test    1.00                                      
        test-suite :: External/SPEC/CINT2006/462.libquantum/462.libquantum.test    1.00                                      
        test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test    1.00                                      
         test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test    1.00

Big number of cases in 510.parest_r is due to template instantiations of almost the same code (sparse_matrix class methods). In this benchmark (on ref data), I've observed 1.7 % improvement on T-head RVB-ICE RISC-V board with XuanTie C910 core (before: 3 090 706 ms, after: 3 038 376 ms). Results on SPEC2006:

Benchmark	baseline, ms	optimized, ms	diff
400.perlbench	1 493 603,64	1 497 198,00	100,24%
401.bzip2	1 905 760,43	1 903 672,13	99,89%
403.gcc	        1 495 701,76	1 492 812,78	99,81%
429.mcf	        2 009 662,24	2 018 930,79	100,46%
445.gobmk	1 502 378,31	1 495 364,06	99,53%
456.hmmer	1 900 147,45	1 860 074,98	97,89%
458.sjeng	1 837 567,39	1 839 056,42	100,08%
462.libquantum	1 171 120,66	1 156 115,71	98,72%
464.h264ref	2 386 369,44	2 359 021,70	98,85%
471.omnetpp	1 688 319,54	1 685 899,41	99,86%
473.astar	1 670 467,01	1 661 233,42	99,45%
483.xalancbmk	1 450 736,57	1 463 321,58	100,87%

Ping (this pass is mostly profitable for RISC-V target, because zext on this target has non-zero cost, and the intention is to report Coremark results for RISC-V with this optimization enabled, like with DFA jump threading)