This is an archive of the discontinued LLVM Phabricator instance.

[MachineSinking] sink more profitable loads
ClosedPublic

Authored by shchenz on Aug 31 2020, 2:11 AM.

Details

Summary

Instead of setting Store to true conservatively, handle some cases where Store can be determined by some analysis if MI.parent() dom SuccToSinkTo and SuccToSinkTo post dom MI.parent().

Thus we can get more profitable loads to be sunk.

Diff Detail

Event Timeline

shchenz created this revision.Aug 31 2020, 2:11 AM
shchenz requested review of this revision.Aug 31 2020, 2:11 AM
shchenz edited the summary of this revision. (Show Details)Aug 31 2020, 5:06 AM
shchenz updated this revision to Diff 289112.Sep 1 2020, 2:57 AM

handle further more cases to sink profitable loads

Do we need some threshold to limit the analysis? I'm worried this could get very expensive.

llvm/lib/CodeGen/MachineSink.cpp
896

Are the post-dominance checks necessary? I'm not sure what invariants you're trying to enforce.

shchenz updated this revision to Diff 289389.Sep 2 2020, 3:46 AM

1: can not check alias for call instruction by using mayAlias
2: add one more cache to save compiling time.

Do we need some threshold to limit the analysis? I'm worried this could get very expensive.

I add a new cache to save compiling time. Could you be more specific on how to add the threshold? Thanks.

llvm/lib/CodeGen/MachineSink.cpp
896

they are not necessary. I add the post-dominance check for the convenience of finding all blocks which are in the path from block From to To.
With these checks, it is easy to judge if a block BB(reachable from From) is in the path from From to To. Just check PDT->dominates(To, BB))
`

lenary removed a subscriber: lenary.Sep 2 2020, 3:57 AM

gentle ping

Hi @efriedma,

Given you've already taken a look, can you do the review or do you want me to step in?

Cheers,
-Quentin

I'd appreciate it if you'd step in, Quentin. I have a backlog of reviews, and I still haven't quite wrapped my head around the use of post-dominance here.

shchenz added a comment.EditedSep 30 2020, 6:58 PM

@efriedma @qcolombet haha, thanks for your review. Please take your time. ^-^ . I want to make the alias check simple(find all basic blocks in all paths from block From to block To), so I guard the dominance between From and To as From dominates To and To post dominates From. This will make us still miss some cases in which load instruction can be sunk. But it can make the complexity simple.

gentle ping ^^

Hi,

Looks reasonable but I'd like to have an idea of the compile time impact of this patch.

Make sure to fix the Lint comments as well.

Cheers,
-Quentin

llvm/lib/CodeGen/MachineSink.cpp
348

Would it make sense to preserve the cache as long as we don't move stores, calls and loads with ordered memory ref?

I'm worried about the compile time impact of this patch so I am wondering if we want to be smarter about invalidating the cache.

919

While we traverse BB, should we create an entry for the pair (To, BB).

Put differently, how is the compile time impact of this patch and do we have to do more to avoid computations?

shchenz updated this revision to Diff 299890.Oct 22 2020, 1:46 AM

1: clear the cache just before function returns
2: cache handled bb while calling depth_first() for a block

Thanks very much for your review. @qcolombet Update accordingly. I collected some compiling time numbers for llvm test-suite. No obivious degradation found.

llvm/lib/CodeGen/MachineSink.cpp
348

Good idea. But seems we still need to clear the cache before we return from MachineSinking::runOnMachineFunction. (MachineSink instance is shared with different callings to MachineSinking::runOnMachineFunction for different functions), otherwise I meet some weird memory issue.

919

I collected some data from PowerPC, seems the compiling time difference is not obivious in llvm test-suite. the tool hides some small diff tests. Among them, I saw some tests takes 10s and some of them takes more than 30s, but no big reg for them. The biggest diff 9.6% and 7.4%, compile time are around 0.1s. And these two degradations can not reproduciable in other runs.

./utils/compare.py -m compile_time base.json fix.json  
Tests: 309
Metric: compile_time

Program                                        base  fix   diff 
                                                                  
 test-suite...enchmarks/Stanford/Queens.test       9.6%
 test-suite...d-warshall/floyd-warshall.test       7.4%
 test-suite...ImageProcessing/Blur/blur.test      -4.7%
 test-suite...SubsetBRawLoops/lcalsBRaw.test       2.7%
 test-suite...math/automotive-basicmath.test     -2.5%
 test-suite.../Benchmarks/Stanford/Perm.test      -2.4%
 test-suite...rks/tramp3d-v4/tramp3d-v4.test     -2.2%
 test-suite...ow-dbl/GlobalDataFlow-dbl.test      -2.2%
 test-suite...enchmarks/Stanford/RealMM.test      -2.0%
 test-suite...ctions-flt/Reductions-flt.test      -1.8%
 test-suite...t/StatementReordering-flt.test      -1.7%
 test-suite.../Trimaran/enc-rc4/enc-rc4.test      -1.7%
 test-suite...l/StatementReordering-dbl.test      1.6%
 test-suite...pansion-dbl/Expansion-dbl.test      -1.4%
 test-suite...bl/IndirectAddressing-dbl.test      -1.2%
 Geomean difference                                           nan%
            base        fix        diff
count  309.000000  309.000000  277.000000
mean   2.696964    2.685603   -0.003232  
std    6.283106    6.251353    0.009456  
min    0.000000    0.000000   -0.047321  
25%    0.077944    0.077862   -0.005599  
50%    0.291912    0.289780   -0.003404  
75%    2.475318    2.478657   -0.001450  
max    64.257411   64.088470   0.095601
shchenz updated this revision to Diff 299903.Oct 22 2020, 2:43 AM

1: address Lint suggestion

shchenz added a subscriber: nikic.EditedOct 23 2020, 7:07 AM

Ah, I used @nikic perf testing tool for this patch on the X86 target(?). (Thanks for this great tool ^-^) Here is the result of the compiling time: https://llvm-compile-time-tracker.com/compare.php?from=2d25004a137223a02aa06e8bfd512a648f3b3f94&to=abf39366779bed8b9f2d276cd199c3b78471e3b1&stat=instructions

GEO degradates about 0.02% and no obivious outlier. I guess the compiling time increase on these benchmarks should be acceptable?

qcolombet accepted this revision.Oct 27 2020, 9:15 AM

I guess the compiling time increase on these benchmarks should be acceptable?

Yes, that's fine by me.

llvm/lib/CodeGen/MachineSink.cpp
348

Yeah, that's expected.

This revision is now accepted and ready to land.Oct 27 2020, 9:15 AM
This revision was landed with ongoing or failed builds.Nov 1 2020, 6:16 PM
This revision was automatically updated to reflect the committed changes.

Hi, @shchenz. Our several opecncl benchmarks have appeared great compile time regression.
For only one function, the time consume on Machine code sinking pass increased form 6.0711s to 366.5713.
According to your algorithm, this patch will obviously increase the compile time for some cases.

shchenz added a comment.EditedDec 22 2020, 6:56 PM

Hi, @shchenz. Our several opecncl benchmarks have appeared great compile time regression.
For only one function, the time consume on Machine code sinking pass increased form 6.0711s to 366.5713.
According to your algorithm, this patch will obviously increase the compile time for some cases.

Yes, we have been aware that this patch may introduce compiling time degradations. And as you can see in previous comments, I already tested the compiling time on X86 arch. Sadly, the tested benchmarks don't expose any regressions.

Could you please help to send me your regression function/IR? So I can have a look about how to fix it? Thanks.

Yes, we have been aware that this patch may introduce compiling time degradations. And as you can see in previous comments, I already tested the compiling time on X86 arch. Sadly, the tested benchmarks don't expose any regressions.

Could you please help to send me your regression function/IR? So I can have a look about how to fix it? Thanks.

I am sorry that I can't provide the case to you directly. I am trying making a small reproduce but encountered some problems.
However, can you adding an option to control it?
Thanks.

Yes, we have been aware that this patch may introduce compiling time degradations. And as you can see in previous comments, I already tested the compiling time on X86 arch. Sadly, the tested benchmarks don't expose any regressions.

Could you please help to send me your regression function/IR? So I can have a look about how to fix it? Thanks.

I am sorry that I can't provide the case to you directly. I am trying making a small reproduce but encountered some problems.
However, can you adding an option to control it?
Thanks.

I can surely do that. But I think the most reasonable solution would be fix the compiling time issue. Since compiling time tests I did before does not expose any regression, your test case must be a little special. Could you find out the special point, for example the function has too many blocks or some/many blocks in the function has too many instructions? Thanks.

I can surely do that. But I think the most reasonable solution would be fix the compiling time issue. Since compiling time tests I did before does not expose any regression, your test case must be a little special. Could you find out the special point, for example the function has too many blocks or some/many blocks in the function has too many instructions? Thanks.

I think the increase in compile time is because the function has too many instructions and blocks. The function has Thousands of lines of instructions. Can you add limitation for the number of instructions or the number of blocks so the check for 'store' can end early?

I can surely do that. But I think the most reasonable solution would be fix the compiling time issue. Since compiling time tests I did before does not expose any regression, your test case must be a little special. Could you find out the special point, for example the function has too many blocks or some/many blocks in the function has too many instructions? Thanks.

I think the increase in compile time is because the function has too many instructions and blocks. The function has Thousands of lines of instructions. Can you add limitation for the number of instructions or the number of blocks so the check for 'store' can end early?

Could you please help to check if the following change can solve your issue?

C
diff --git a/llvm/lib/CodeGen/MachineSink.cpp b/llvm/lib/CodeGen/MachineSink.cpp
index 0abdf89..8ca3520 100644
--- a/llvm/lib/CodeGen/MachineSink.cpp
+++ b/llvm/lib/CodeGen/MachineSink.cpp
@@ -79,6 +79,12 @@ static cl::opt<unsigned> SplitEdgeProbabilityThreshold(
         "splitted critical edge"),
     cl::init(40), cl::Hidden);
 
+static cl::opt<unsigned> SinkLoadInstsPerBlockThreshold(
+    "machine-sink-load-instrs-threshold",
+    cl::desc("Do not try to find alias store for a load if there is a in-path "
+             "block whose instruction number is higher than this threshold."),
+    cl::init(2000), cl::Hidden);
+
 STATISTIC(NumSunk,      "Number of machine instructions sunk");
 STATISTIC(NumSplit,     "Number of critical edges split");
 STATISTIC(NumCoalesces, "Number of copies coalesced");
@@ -1036,6 +1042,12 @@ bool MachineSinking::hasStoreBetween(MachineBasicBlock *From,
     HandledBlocks.insert(BB);
     // To post dominates BB, it must be a path from block From.
     if (PDT->dominates(To, BB)) {
+      // If this BB is too big, stop searching to save compiling time.
+      if (BB->size() > SinkLoadInstsPerBlockThreshold) {
+        HasStoreCache[BlockPair] = true;
+        return true;
+      }
+
       for (MachineInstr &I : *BB) {
         // Treat as alias conservatively for a call or an ordered memory
         // operation.

I can surely do that. But I think the most reasonable solution would be fix the compiling time issue. Since compiling time tests I did before does not expose any regression, your test case must be a little special. Could you find out the special point, for example the function has too many blocks or some/many blocks in the function has too many instructions? Thanks.

I think the increase in compile time is because the function has too many instructions and blocks. The function has Thousands of lines of instructions. Can you add limitation for the number of instructions or the number of blocks so the check for 'store' can end early?

Could you please help to check if the following change can solve your issue?

C
diff --git a/llvm/lib/CodeGen/MachineSink.cpp b/llvm/lib/CodeGen/MachineSink.cpp
index 0abdf89..8ca3520 100644
--- a/llvm/lib/CodeGen/MachineSink.cpp
+++ b/llvm/lib/CodeGen/MachineSink.cpp
@@ -79,6 +79,12 @@ static cl::opt<unsigned> SplitEdgeProbabilityThreshold(
         "splitted critical edge"),
     cl::init(40), cl::Hidden);
 
+static cl::opt<unsigned> SinkLoadInstsPerBlockThreshold(
+    "machine-sink-load-instrs-threshold",
+    cl::desc("Do not try to find alias store for a load if there is a in-path "
+             "block whose instruction number is higher than this threshold."),
+    cl::init(2000), cl::Hidden);
+
 STATISTIC(NumSunk,      "Number of machine instructions sunk");
 STATISTIC(NumSplit,     "Number of critical edges split");
 STATISTIC(NumCoalesces, "Number of copies coalesced");
@@ -1036,6 +1042,12 @@ bool MachineSinking::hasStoreBetween(MachineBasicBlock *From,
     HandledBlocks.insert(BB);
     // To post dominates BB, it must be a path from block From.
     if (PDT->dominates(To, BB)) {
+      // If this BB is too big, stop searching to save compiling time.
+      if (BB->size() > SinkLoadInstsPerBlockThreshold) {
+        HasStoreCache[BlockPair] = true;
+        return true;
+      }
+
       for (MachineInstr &I : *BB) {
         // Treat as alias conservatively for a call or an ordered memory
         // operation.

Unfortunately, this patch doesn't work. I don't think the number of blocks is necessarily related to the number of instructions.
For one function, with your default threshold, The time spent on Machine code sinking is 11.2464s, no obvious difference from before. When set the threshold to 1, the time spent on Machine code sinking reduced 0.6937.
I think it would be better if you can limit the number of MIs.
Anyway, this patch provides us with an option. Thanks for your help.

I can surely do that. But I think the most reasonable solution would be fix the compiling time issue. Since compiling time tests I did before does not expose any regression, your test case must be a little special. Could you find out the special point, for example the function has too many blocks or some/many blocks in the function has too many instructions? Thanks.

I think the increase in compile time is because the function has too many instructions and blocks. The function has Thousands of lines of instructions. Can you add limitation for the number of instructions or the number of blocks so the check for 'store' can end early?

Could you please help to check if the following change can solve your issue?

C
diff --git a/llvm/lib/CodeGen/MachineSink.cpp b/llvm/lib/CodeGen/MachineSink.cpp
index 0abdf89..8ca3520 100644
--- a/llvm/lib/CodeGen/MachineSink.cpp
+++ b/llvm/lib/CodeGen/MachineSink.cpp
@@ -79,6 +79,12 @@ static cl::opt<unsigned> SplitEdgeProbabilityThreshold(
         "splitted critical edge"),
     cl::init(40), cl::Hidden);
 
+static cl::opt<unsigned> SinkLoadInstsPerBlockThreshold(
+    "machine-sink-load-instrs-threshold",
+    cl::desc("Do not try to find alias store for a load if there is a in-path "
+             "block whose instruction number is higher than this threshold."),
+    cl::init(2000), cl::Hidden);
+
 STATISTIC(NumSunk,      "Number of machine instructions sunk");
 STATISTIC(NumSplit,     "Number of critical edges split");
 STATISTIC(NumCoalesces, "Number of copies coalesced");
@@ -1036,6 +1042,12 @@ bool MachineSinking::hasStoreBetween(MachineBasicBlock *From,
     HandledBlocks.insert(BB);
     // To post dominates BB, it must be a path from block From.
     if (PDT->dominates(To, BB)) {
+      // If this BB is too big, stop searching to save compiling time.
+      if (BB->size() > SinkLoadInstsPerBlockThreshold) {
+        HasStoreCache[BlockPair] = true;
+        return true;
+      }
+
       for (MachineInstr &I : *BB) {
         // Treat as alias conservatively for a call or an ordered memory
         // operation.

Unfortunately, this patch doesn't work. I don't think the number of blocks is necessarily related to the number of instructions.
For one function, with your default threshold, The time spent on Machine code sinking is 11.2464s, no obvious difference from before. When set the threshold to 1, the time spent on Machine code sinking reduced 0.6937.
I think it would be better if you can limit the number of MIs.
Anyway, this patch provides us with an option. Thanks for your help.

The above threshold is for number of MIs. BB->size() is to get instruction number of BB. I committed 31c2b93d83f63ce7f9bb4977f58de2e00bf18e0f to further reduce compiling time. You can have a try

Is it better to have the check before 'if (PDT->dominates(To, BB)) {' ?

Is it better to have the check before 'if (PDT->dominates(To, BB)) {' ?

I don't think so. If there is a huge block that is reachable from block From but is not dominated by From and can not flow to To, we will mark {From, To} as hasStore. This is not true, we should only care about the blocks which are dominated by From and can flow to To.

The above threshold is for number of MIs. BB->size() is to get instruction number of BB. I committed 31c2b93d83f63ce7f9bb4977f58de2e00bf18e0f to further reduce compiling time. You can have a try

Thanks. It works for our tests.