This is an archive of the discontinued LLVM Phabricator instance.

[LoopSimplify] Convert loop with multiple latches to nested loop using dominator tree
Changes PlannedPublic

Authored by jaykang10 on Jul 9 2021, 7:45 AM.

Details

Summary

With certain condition of phi node, LoopSimplify pass converts loop with multiple latches to a nested loop as below.

// Loop with multiple latches
//
//             |     /----------------\
//    +--------v----v------+          |
//    |      header        <---\      |
//    +--------------------+   |      |
//             |               |      |
//    +--------v-----------+   |      |
//    |      latch1        >---/      |
//    +--------------------+          |
//             |                      |
//    +--------v-----------+          |
//    |      latch2        >----------/
//    +--------------------+
//             |
//==============================>
//
// A nested loop
//
//             |     /----------------\
//    +-------------v------+          |
//    |   outer header     |          |
//    +--------------------+          |
//             |                      |
//    +--------v----v------+          |
//    |   inner header     <---\      |
//    +--------------------+   |      |
//             |               |      |
//    +--------v-----------+   |      |
//    |      latch1        >---/      |
//    +--------------------+          |
//             |                      |
//    +--------v-----------+          |
//    |      latch2        >----------/
//    +--------------------+
//             |

This patch enables above transformation more using dominator tree. If one latch dominates other latches, the latch can be inner loop's one.

Diff Detail

Event Timeline

jaykang10 created this revision.Jul 9 2021, 7:45 AM
jaykang10 requested review of this revision.Jul 9 2021, 7:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2021, 7:45 AM
jaykang10 updated this revision to Diff 357886.Jul 12 2021, 3:48 AM

Fixed regressions more.

If you feel the tests are modified weirdly, please let me know.

The transforms here is clearly correct... the part I'm not sure about is whether it's profitable in general. I'm particularly worried about cases where we make PHI nodes in the outer loop more difficult to analyze. Have you don't any experiments to try to determine the performance impact?

The transforms here is clearly correct... the part I'm not sure about is whether it's profitable in general. I'm particularly worried about cases where we make PHI nodes in the outer loop more difficult to analyze. Have you don't any experiments to try to determine the performance impact?

I have checked the performance number from below benchmarks.

llvm-test-suite for x86

Tests: 2939
Short Running: 2374 (filtered out)
Remaining: 565
Metric: exec_time

Program                                        results.org results.multi.latches diff 
 test-suite...emCmp<8, GreaterThanZero, Mid>   601.59      1056.69               75.7%
 test-suite...emCmp<15, LessThanZero, First>   299.16      517.11                72.9%
 test-suite...Cmp<8, GreaterThanZero, First>   577.34      902.53                56.3%
 test-suite...Cmp<31, GreaterThanZero, None>   410.89      600.97                46.3%
 test-suite...MemCmp<15, LessThanZero, None>   453.93      658.92                45.2%
 test-suite...aw.test:BM_MAT_X_MAT_RAW/44217   391924.61   536335.61             36.8%
 test-suite...sCRaw.test:BM_HYDRO_2D_RAW/171     8.22       10.95                33.2%
 test-suite...Cmp<31, GreaterThanZero, Last>   420.00      553.94                31.9%
 test-suite...t:BM_MemCmp<31, EqZero, First>   155.73      204.14                31.1%
 test-suite...Source/Benchmarks/sim/sim.test     2.12        2.78                30.7%
 test-suite...lications/sqlite3/sqlite3.test     1.70        2.12                25.0%
 test-suite....test:BENCHMARK_HARRIS/512/512   2272.75     2806.47               23.5%
 test-suite...tions/lambda-0.1.3/lambda.test     2.09        2.57                22.6%
 test-suite...s-C/Pathfinder/PathFinder.test     1.66        2.02                21.9%
 test-suite...C/Packing-flt/Packing-flt.test     1.64        2.00                21.7%
 Geomean difference                                                               nan%
         results.org  results.multi.latches        diff
count  564.000000     562.000000             561.000000
mean   2062.547024    2174.176751            0.007214  
std    23151.613672   25917.675478           0.091953  
min    0.614800       0.616300              -0.490838  
25%    2.715425       2.744382              -0.007251  
50%    98.130479      98.001753             -0.000128  
75%    517.301815     525.848058             0.010050  
max    391924.609000  536335.615000          0.756508
SPEC2006 for AArch64

Benchmark		diff(%)
400.perlbench		0.361501846
401.bzip2		0.279018415
403.gcc		-0.565577003
429.mcf		-2.208262958
445.gobmk		-0.31082178
456.hmmer		-0.467206165
458.sjeng		-0.235398032
462.libquantum		-3.935842025
464.h264ref		0.057919804
471.omnetpp		-1.779197968
473.astar		-0.518484615
483.xalancbmk		-0.36306218
SPEC2017 for AArch64

Benchmark		diff(%)
500.perlbench_r		1.547117225
502.gcc_r		-0.76960912
505.mcf_r		0
520.omnetpp_r		-1.730885443
523.xalancbmk_r		-0.697377016
525.x264_r		0.052501261
531.deepsjeng_r		-0.150069504
541.leela_r		-0.042836837
548.exchange2_r		0.011397094
557.xz_r		-0.416533465

As you mentioned, the existing analyses could be failed with the cascaded phi nodes from outer loop. I was able to see a test in which CanProveNotTakenFirstIteration is failed with LICM pass.
In order to canonicalize the loop with multiple latches, normally, the LoopSimplify pass creates a new latch and connects the old latches to the new one. If the pass detects a certain condition of phi node in the loop with multiple latches, it converts the loop into a nested loop. Therefore, it could be a question about which one is better between a loop, which has multiple induction variables and conditional branch, and a nested loop.

As you mentioned, the existing analyses could be failed with the cascaded phi nodes from outer loop. I was able to see a test in which CanProveNotTakenFirstIteration is failed with LICM pass.

Right, that's the sort of thing I'm worried about. Which is why I don't think it makes sense to separate out nested loops unless we have some heuristic that indicates it's actually profitable, as opposed to just legal.

As you mentioned, the existing analyses could be failed with the cascaded phi nodes from outer loop. I was able to see a test in which CanProveNotTakenFirstIteration is failed with LICM pass.

Right, that's the sort of thing I'm worried about. Which is why I don't think it makes sense to separate out nested loops unless we have some heuristic that indicates it's actually profitable, as opposed to just legal.

Thanks for the suggestion. Let me check the tests and try to find some heuristics that indicates it's actually profitable.

Following comments of @eli.friedman, added a heuristic and updated tests.

@eli.friedman I have checked the tests affected by this transform and I have found a case which we need to avoid.

For example, you can see it on test-multiple-latch function in llvm/test/Transforms/LICM/hoist-mustexec.ll. With previous commit, the LICM was failed with the test.

If you feel something wrong with this update, please let me know.

Fixed typo on comments.

jaykang10 updated this revision to Diff 361036.Jul 22 2021, 5:17 PM

Fixed a bug.

jaykang10 updated this revision to Diff 361140.Jul 23 2021, 2:33 AM

Rebased and updated patch to resolve merge conflict

jaykang10 updated this revision to Diff 367431.Aug 19 2021, 2:23 AM

Following comment of @eli.friedman, added profitable check

jaykang10 planned changes to this revision.Sep 10 2021, 11:58 AM