This is an archive of the discontinued LLVM Phabricator instance.

[LoopUtils] Populate sibling loops in reverse program order on new pass manager
AbandonedPublic

Authored by jaykang10 on Apr 1 2021, 3:19 PM.

Details

Summary

New pass manager is populating loops in forward program order rather than legacy pass manager.

Let's see a worse case with the order of new loop pass manager as below.

loop1:
  %6 = phi i8 [ 0, %1 ], [ %9, %5 ]
  %7 = phi i32 [ 1, %ph ], [ %8, %loop1 ]
  %8 = add nuw nsw i8 %7, 1
  %9 = select i1 %4, i8 1, i8 %6
  %10 = and i8 %9, 
  %11 = icmp eq i32 %10, 0
  br i1 %11, label %loop1, label %ph2

ph2:
  %13 = phi i32 [ %8, %loop1 ]
  br label %14 

loop2:
  %15 = phi i32 [ %16, %loop2 ], [ %13, %ph2 ]
  %16 = add nsw i32 %15, -1
  %17 = icmp sgt i32 %15, 0
  br i1 %17, label %14, label %exit
...
}

The loop2 is read-only loop and the IndVarSimplify pass can delete the loop. If we use reverse program order of loops, the loop2 can be deleted and it causes loop1 is also redundant because %8 has no use any more. If we use forward program order of loops, the loop1 can not be removed because the loop2 uses %8 during optimizing loop1.

This change populates the loops in reverse program order on new pass manager like legacy pass manager.

Diff Detail

Event Timeline

jaykang10 created this revision.Apr 1 2021, 3:19 PM
jaykang10 requested review of this revision.Apr 1 2021, 3:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2021, 3:19 PM
jaykang10 edited the summary of this revision. (Show Details)Apr 2 2021, 12:50 AM
jaykang10 edited the summary of this revision. (Show Details)Apr 2 2021, 12:57 AM

It does make sense to me that visiting uses first could be better. I'm not sure if there was actually any specific reason to change the order in the NPM. We'll do some performance testing with this.

Actually, can you give a concrete example? The example in the description isn't clear enough

There will probably be cases where one way is better and cases where the other way is better. Might just be better to try to fix your use case directly in the passes themselves

jaykang10 updated this revision to Diff 335485.Apr 6 2021, 4:54 AM

Fixed failed tests

Actually, can you give a concrete example? The example in the description isn't clear enough

There will probably be cases where one way is better and cases where the other way is better. Might just be better to try to fix your use case directly in the passes themselves

@aeubanks Thanks for comments. Let me try to create a example for it.

jaykang10 updated this revision to Diff 335594.Apr 6 2021, 10:56 AM

Added a test

Actually, can you give a concrete example? The example in the description isn't clear enough

There will probably be cases where one way is better and cases where the other way is better. Might just be better to try to fix your use case directly in the passes themselves

@aeubanks Thanks for comments. Let me try to create a example for it.

@aeubanks I have added a reduced example loop-order-on-new-loop-pass-manager.ll.

As you can see, the second loop is read-only loop and the IndVarSimplify pass deletes the loop. If we use reverse order of loops, the second loop is gone and the %7 and %8 are redundant on first loop. In the end, with -O2, it causes empty function. If we use forward order of loops, the first loop is not removed because the second loop uses %8 during optimizing first loop. I am not sure we can fix this issue directly in the passes because it needs to look at other sibling loops. From my opinion, it would be better to change the order of loops in reverse on new loop pass manager for dead uses.

@aeubanks Can we push this change please?

Traversing forward for the loops can cause a value in an earlier loop to be evaluated, which could help a later loop, right? I'm now not sure that it's such a clear tradeoff.
For example, scev-expander-preserve-lcssa.ll looks worse now.

It could be interesting to revisit a previous loop if a loop was deleted.

I think there are more knowledgeable people about this sort of stuff on llvm-dev, could you make a post there?

Traversing forward for the loops can cause a value in an earlier loop to be evaluated, which could help a later loop, right? I'm now not sure that it's such a clear tradeoff.
For example, scev-expander-preserve-lcssa.ll looks worse now.

It could be interesting to revisit a previous loop if a loop was deleted.

I think there are more knowledgeable people about this sort of stuff on llvm-dev, could you make a post there?

I agree with you. It could be better to discuss this issue with llvm-dev. Let me make a post to llvm-dev.

fhahn added a subscriber: fhahn.Apr 15 2021, 10:02 AM
fhahn added inline comments.
llvm/test/Transforms/IndVarSimplify/scev-expander-preserve-lcssa.ll
122

This seems like a regression?

I am seeing the forward one is worse than reverse one on IRCE pass. The pass checks whether it safely calculates the bounds of a new loop using isLoopEntryGuardedByCond which climbs up the predecessor chain of loop header. If the previous loop has already been transformed with the IRCE pass, the current loop sometimes fails to calculate the bounds of a new loop.
The IRCE pass is function pass and it uses appendLoopsToWorklist which adds loops in forward program order. Can we add appendReverseLoopsToWorklist or something like that please? It is not easy to guarantee which order is always better but each pass could know which one is better. How do you think about it?

llvm/test/Transforms/IndVarSimplify/scev-expander-preserve-lcssa.ll
122

Yep, the reverse one is not always better than forward one...

@aeubanks As I mentioned on llvm-dev, how do you think about new pass manager allows each pass to choose the order of loops on new pass manager?

I think we should stick with one way unless there's specific need otherwise. This patch overall is fine, multiple people have said that we should just stick with the legacy order unless somebody has a better idea.

The description is a bit verbose with unnecessary code snippets, could you clean that up a bit?

llvm/lib/Transforms/Utils/LoopUtils.cpp
1507

this name and various comments need to be updated

jaykang10 edited the summary of this revision. (Show Details)Apr 21 2021, 1:51 AM
jaykang10 added inline comments.Apr 21 2021, 3:21 AM
llvm/lib/Transforms/Utils/LoopUtils.cpp
1507

I am sorry. I missed there is already a templated function for the forward program order. I will update code with it.

jaykang10 updated this revision to Diff 339183.Apr 21 2021, 4:35 AM

Used existing templated function for populating loops in reverse program order on new pass manager.

@aeubanks I have updated description and code. If you feel something wrong from them, please let me know.

fhahn added a comment.Apr 21 2021, 5:46 AM

Could you share any data of the impact this change has on runtime-performance/code-size on benchmarks?

Could you share any data of the impact this change has on runtime-performance/code-size on benchmarks?

I have checked exec_time/size with llvm-test-suite on x86 and the result is as below.

Tests: 2892
Metric: exec_time

/usr/local/lib/python2.7/dist-packages/scipy/stats/stats.py:308: RuntimeWarning: divide by zero encountered in log
  log_a = np.log(np.array(a, dtype=dtype))
Program                                        base   new    diff 
 test-suite...te/GCC-C-execute-980505-2.test     0.00   0.00  inf%
 test-suite...ute/GCC-C-execute-pr49768.test     0.00   0.00  inf%
 test-suite...cute/GCC-C-execute-simd-6.test     0.00   0.00  inf%
 test-suite...ute/GCC-C-execute-pr68328.test     0.00   0.00  inf%
 test-suite...d_ops_test_op_paddusw_152.test     0.00   0.00  inf%
 test-suite...te/GCC-C-execute-stkalign.test     0.00   0.00  inf%
 test-suite.../GCC-C-execute-20071216-1.test     0.00   0.00  inf%
 test-suite...ute/GCC-C-execute-pr47538.test     0.00   0.00  inf%
 test-suite...ecute/GCC-C-execute-mod-1.test     0.00   0.00  inf%
 test-suite...ute/GCC-C-execute-pr46316.test     0.00   0.00  inf%
 test-suite...md_ops_test_op_pmaxsw_177.test     0.00   0.00  inf%
 test-suite...execute-struct-aliasing-1.test     0.00   0.00  inf%
 test-suite...cute/GCC-C-execute-enum-1.test     0.00   0.00  inf%
 test-suite.../GCC-C-execute-20010114-1.test     0.00   0.00  inf%
 test-suite...ute/GCC-C-execute-pr46309.test     0.00   0.00  inf%
 Geomean difference                                           nan%
                base            new         diff
count  2876.000000    2876.000000    2874.000000
mean   485.557850     456.966905     inf        
std    13703.055405   12752.463246  NaN         
min    0.000000       0.000000      -1.000000   
25%    0.000400       0.000400      -0.250000   
50%    0.000700       0.000700       0.000000   
75%    0.010925       0.010925       0.295752   
max    526452.168000  550562.615000  inf
Tests: 2892
Metric: size

Program                                        base    new     diff 
 test-suite...ute/GCC-C-execute-pr41917.test   8016.00 8024.00  0.1%
 test-suite...ute/GCC-C-execute-pr60072.test   8016.00 8024.00  0.1%
 test-suite...ute/GCC-C-execute-pr47155.test   8040.00 8048.00  0.1%
 test-suite...ute/GCC-C-execute-pr82387.test   8048.00 8056.00  0.1%
 test-suite.../GCC-C-execute-20001009-2.test   8048.00 8056.00  0.1%
 test-suite.../GCC-C-execute-20041201-1.test   8056.00 8064.00  0.1%
 test-suite...ute/GCC-C-execute-pr57321.test   8072.00 8080.00  0.1%
 test-suite...ute/GCC-C-execute-pr58385.test   8072.00 8080.00  0.1%
 test-suite...cute/GCC-C-execute-pure-1.test   8080.00 8088.00  0.1%
 test-suite...ute/GCC-C-execute-pr81503.test   8096.00 8104.00  0.1%
 test-suite...ute/GCC-C-execute-pr48717.test   8104.00 8112.00  0.1%
 test-suite...ute/GCC-C-execute-pr65216.test   8120.00 8128.00  0.1%
 test-suite.../GCC-C-execute-20040302-1.test   8136.00 8144.00  0.1%
 test-suite...e/GCC-C-execute-pr52979-1.test   8176.00 8184.00  0.1%
 test-suite...e/GCC-C-execute-pr52979-2.test   8176.00 8184.00  0.1%
 Geomean difference                                             nan%
               base           new         diff
count  2.484000e+03  2.484000e+03  2484.000000
mean   3.340861e+04  3.340918e+04  0.000047   
std    1.803869e+05  1.803872e+05  0.000201   
min    7.992000e+03  7.992000e+03  0.000000   
25%    8.264000e+03  8.264000e+03  0.000000   
50%    8.344000e+03  8.344000e+03  0.000000   
75%    1.742200e+04  1.742200e+04  0.000000   
max    7.648424e+06  7.648432e+06  0.000998

@aeubanks Can we push this change please? If you need something more, please let me know.

fhahn added a comment.Apr 22 2021, 3:42 AM

Could you share any data of the impact this change has on runtime-performance/code-size on benchmarks?

I have checked exec_time/size with llvm-test-suite on x86 and the result is as below.

Thanks for sharing the data, but I am not sure the excerpt you shared motivates a change? The SingleSource/short-running ones probably do not provide much insight. I think the results for MultiSource/SPEC would be more interesting. In how many binaries does the patch cause changes? How does it impact runtime performance for longer benchmarks?

Could you share any data of the impact this change has on runtime-performance/code-size on benchmarks?

I have checked exec_time/size with llvm-test-suite on x86 and the result is as below.

Thanks for sharing the data, but I am not sure the excerpt you shared motivates a change? The SingleSource/short-running ones probably do not provide much insight. I think the results for MultiSource/SPEC would be more interesting. In how many binaries does the patch cause changes? How does it impact runtime performance for longer benchmarks?

On AArch64, the performance difference from spec2006 with this change is as below.

                diff(%)
400.perlbench          0
401.bzip2        1.04712
403.gcc         -0.37736
429.mcf         0.571429
445.gobmk              0
456.hmmer              0
458.sjeng              0
462.libquantum  0.615385
464.h264ref      -0.9009
471.omnetpp     0.684932
473.astar       0.571429
483.xalancbmk   -0.37453

Additionally, with this change, IRCE pass handles more cases and it helps loop vectorizer vectorize more loops. Once this change is merged, I will start discussion to turn on IRCE pass as default on new pass manager.

aeubanks added inline comments.Apr 22 2021, 11:32 AM
llvm/include/llvm/Transforms/Utils/LoopUtils.h
467

this is now wrong?

468–471

This comment is what worries me the most. The forward order was deliberately chosen with motivation.

Sorry for all the scrutiny, but this is an important change. If everybody else is ok with this change, then I'm fine too. But I would like to know what the source of your test case is. Is it from some C code?

I still think it would be nice if we could somehow, after a loop has been deleted, revisit other loops that contained defs for uses in the deleted loop. That way we still get advantages of forward traversal along with some benefit of reverse traversal. But perhaps that's not reasonable for the loop pass manager to do.

jaykang10 added inline comments.Apr 23 2021, 12:57 AM
llvm/include/llvm/Transforms/Utils/LoopUtils.h
467

No, it is not wrong. It calls different function in this change now. The comment was for forward program order.

468–471

This comment is what worries me the most. The forward order was deliberately chosen with motivation.

I agree with the comment but there could be other cases get worse with forward order.

Sorry for all the scrutiny, but this is an important change. If everybody else is ok with this change, then I'm fine too. But I would like to know what the source of your test case is. Is it from some C code?

Yep, the test case comes from spec benchmark and I have reduced it.

I still think it would be nice if we could somehow, after a loop has been deleted, revisit other loops that contained defs for uses in the deleted loop. That way we still get advantages of forward traversal along with some benefit of reverse traversal. But perhaps that's not reasonable for the loop pass manager to do.

I agree with it. Each pass could re-visit the target loops.

I feel it is not easy to get agreement with this change from everyone. Let's just keep open this issue. Later, down stream folks could raise this issue again because they could meet different input on their own passes based on legacy pass manager.

jaykang10 planned changes to this revision.May 27 2021, 10:11 AM
jaykang10 abandoned this revision.May 30 2021, 12:39 PM