This is an archive of the discontinued LLVM Phabricator instance.

Teach the DominatorTree fallback to recalculation when applying updates to speedup JT (PR37929)
ClosedPublic

Authored by NutshellySima on Oct 13 2018, 7:04 AM.

Details

Summary

This patch makes the dominatortree recalculate when applying updates with the size of the update vector larger than a threshold. Directly applying updates is usually slower than recalculating the whole domtree in this case. This patch fixes an issue which causes JT running slowly on some inputs.

In bug 37929, the dominator tree is trying to apply 19,000+ updates several times, which takes several minutes.

After this patch, the time used by DT.applyUpdates:

InputBefore (s)After (s)Speedup
the 2nd Reproducer in 379292970.151980x
clang-5.0.0.0.bc9.74.32.26x
clang-5.0.0.4.bc11.62.64.46x

Diff Detail

Event Timeline

NutshellySima created this revision.Oct 13 2018, 7:04 AM
kuhar added a comment.Oct 13 2018, 8:34 AM

Thanks for looking into this. A couple of high-level questions:

  • how was the constant "650" picked? Is it just playing with different values and choosing something that seemed reasonable?
  • why is the threshold independent of the tree size? Intuitively the policy could be to recalculate if the number of updates is greater than some multiple of size. Maybe we can refer to the average-case complexity from Lucuas' paper?
  • why do we want to put it in the incremental updater instead of the DomTreeUpdater class?
NutshellySima added a comment.EditedOct 13 2018, 9:24 AM

Thanks for looking into this. A couple of high-level questions:

  • how was the constant "650" picked? Is it just playing with different values and choosing something that seemed reasonable?
  • why is the threshold independent of the tree size? Intuitively the policy could be to recalculate if the number of updates is greater than some multiple of size. Maybe we can refer to the average-case complexity from Lucuas' paper?
  • why do we want to put it in the incremental updater instead of the DomTreeUpdater class?
  • For the constant "650", I used the 2nd reproducer in bug PR37929 and collected some statistics on the size of the updates after legalization and the time used by applying updates. The statistics shows that the time complexity is about Θ(k^2.4) (let k be the size of the update vector after legalization) [I have just checked the original paper, it is O(m min{n, k} + kn), so about Ω(k*(k+n)) in the case of the reproducer] and it seems that at least for a CFG size no larger than the one in PR37929, "650" is a reasonable value that the time used by recalculating is sure to be less than applying updates directly. (I think there sure should be some better heuristics deciding this.)
  • It is definitely better to get it related to the tree size but I'm lack of inputs with different CFG sizes to experiment on.
  • I acknowledged this as an issue of the incremental updater not handling a huge update vector correctly; It definitely should fallback to recalculating if it is much faster. I believe this constant is related to the size of updates after legalization so I put it in the incremental updater. [Currently, there is a similar functionality in DTU which is Ω(n^2) and I believe it is slow and (redundant if JT submits updates normally) so I am considering removing it from DTU, then it is impossible to know the size of the legalized update vector]. Besides, it can also benefit other places not using the DomTreeUpdater.
kuhar added a comment.Oct 13 2018, 2:37 PM
  • For the constant "650", I used the 2nd reproducer in bug PR37929 and collected some statistics on the size of the updates after legalization and the time used by applying updates. The statistics shows that the time complexity is about Θ(k^2.4) (let k be the size of the update vector after legalization) [I have just checked the original paper, it is O(m min{n, k} + kn), so about Ω(k*(k+n)) in the case of the reproducer] and it seems that at least for a CFG size no larger than the one in PR37929, "650" is a reasonable value that the time used by recalculating is sure to be less than applying updates directly. (I think there sure should be some better heuristics deciding this.)

Thanks for the detailed explanation. This is what I expected. I don't it's enough to use 650 in the code without any explanation or justification -- even if it works in practise, I think it deserves a comment.

  • It is definitely better to get it related to the tree size but I'm lack of inputs with different CFG sizes to experiment on.

You can take intermediate bitcode files when doing full-LTO built using lld. I think the command that does the trick is --save-temps=DIR or similar.

  • I acknowledged this as an issue of the incremental updater not handling a huge update vector correctly; It definitely should fallback to recalculating if it is much faster. I believe this constant is related to the size of updates after legalization so I put it in the incremental updater. [Currently, there is a similar functionality in DTU which is Ω(n^2) and I believe it is slow and (redundant if JT submits updates normally) so I am considering removing it from DTU, then it is impossible to know the size of the legalized update vector]. Besides, it can also benefit other places not using the DomTreeUpdater.

I see. If we expect to see many redundant updates, then it makes sense to check that after legalization.

How about this heuristic instead: since we expect the complexity of a batch update to be O(k^2 + nk) and full recomputation is O(n^2) worst case, we can fallback to recomputation when k > n. Size of the tree should be accessible with just DT.size(), and it seems like a conservative choice we could improve later. What do you think?

NutshellySima edited the summary of this revision. (Show Details)

Make the threshold selected proportional to the domtree size.

  • For the constant "650", I used the 2nd reproducer in bug PR37929 and collected some statistics on the size of the updates after legalization and the time used by applying updates. The statistics shows that the time complexity is about Θ(k^2.4) (let k be the size of the update vector after legalization) [I have just checked the original paper, it is O(m min{n, k} + kn), so about Ω(k*(k+n)) in the case of the reproducer] and it seems that at least for a CFG size no larger than the one in PR37929, "650" is a reasonable value that the time used by recalculating is sure to be less than applying updates directly. (I think there sure should be some better heuristics deciding this.)

Thanks for the detailed explanation. This is what I expected. I don't it's enough to use 650 in the code without any explanation or justification -- even if it works in practise, I think it deserves a comment.

  • It is definitely better to get it related to the tree size but I'm lack of inputs with different CFG sizes to experiment on.

You can take intermediate bitcode files when doing full-LTO built using lld. I think the command that does the trick is --save-temps=DIR or similar.

  • I acknowledged this as an issue of the incremental updater not handling a huge update vector correctly; It definitely should fallback to recalculating if it is much faster. I believe this constant is related to the size of updates after legalization so I put it in the incremental updater. [Currently, there is a similar functionality in DTU which is Ω(n^2) and I believe it is slow and (redundant if JT submits updates normally) so I am considering removing it from DTU, then it is impossible to know the size of the legalized update vector]. Besides, it can also benefit other places not using the DomTreeUpdater.

I see. If we expect to see many redundant updates, then it makes sense to check that after legalization.

How about this heuristic instead: since we expect the complexity of a batch update to be O(k^2 + nk) and full recomputation is O(n^2) worst case, we can fallback to recomputation when k > n. Size of the tree should be accessible with just DT.size(), and it seems like a conservative choice we could improve later. What do you think?

I agree with you. Though the paper says that the complexity is related to the number of edges in the graph after modification, I think it's sufficient in practise to make the threshold k proportional to n. And it seems that it works well and makes other inputs (besides the reproducer) much faster than the previous version of this patch too.

NutshellySima edited the summary of this revision. (Show Details)Oct 14 2018, 7:50 AM

Interesting. Nice improvements. What about small trees? It would seem that any tree less that 75 nodes would always be recalculated. Do the timings you ran include things to show that this is better? Or was that just looking at larger trees at the time?

Interesting. Nice improvements. What about small trees? It would seem that any tree less that 75 nodes would always be recalculated. Do the timings you ran include things to show that this is better? Or was that just looking at larger trees at the time?

I haven't conducted dedicated tests on graphs with the number of vertices less than 75, but among all the domtrees recalculated in the two clang-*.bc I mentioned above, there are approximately 63% of them has an n smaller than 75.

In fact, the time used by directly applying updates to the domtree also depends on whether the graph is sparse or dense. It is hard to generate accurate heuristic to estimate when to recalculate the domtree or not. Because edge insertions have a complexity of O(m*min{n,k}+nk), edge deletions have a complexity of O(m*k) and recalculation has a complexity of O(n^2). (Let n denote number of vertices of the graph/m be the number of edges after updates/k be the number of updates). I think if we want to fine-tune this heuristic, we need to collect a lot of statistics and mathematically fit the function (n,m,k)->bool (whether to recalculate or not).

@kuhar , do you think we need to use another heuristic involving m? Though I think maybe using only n is sufficient? :)

kuhar added a comment.Oct 15 2018, 7:07 AM

Interesting. Nice improvements. What about small trees? It would seem that any tree less that 75 nodes would always be recalculated. Do the timings you ran include things to show that this is better? Or was that just looking at larger trees at the time?

I haven't conducted dedicated tests on graphs with the number of vertices less than 75, but among all the domtrees recalculated in the two clang-*.bc I mentioned above, there are approximately 63% of them has an n smaller than 75.

I'd be much more comfortable if this change didn't always recalculate for graphs smaller than 75. I think there is a lot of value in making sure the incremental updater is behaving correctly on small examples (e.g., existing unit tests, most regression tests, etc.). If for some reason it stops working, it will be much easier to detect and fix it on these small examples.

In fact, the time used by directly applying updates to the domtree also depends on whether the graph is sparse or dense. It is hard to generate accurate heuristic to estimate when to recalculate the domtree or not. Because edge insertions have a complexity of O(m*min{n,k}+nk), edge deletions have a complexity of O(m*k) and recalculation has a complexity of O(n^2). (Let n denote number of vertices of the graph/m be the number of edges after updates/k be the number of updates). I think if we want to fine-tune this heuristic, we need to collect a lot of statistics and mathematically fit the function (n,m,k)->bool (whether to recalculate or not).

Since we cannot measure running time directly in any realistic scenario, I think that the best way to approach this would be to count the number of graph nodes visited during DFS walks and see how many of them we get with different heuristics.
I think that the bugs already established that we need to have some fallback heuristic. I'd suggest being careful here and trying at least a couple of different policies on different projects (maybe clang, sqlite, the lnt test suite). There are a couple of properties I propose for the heuristic:

  1. Don't always recalculate small trees (say, n < 100)
  2. Recalculate when k > n
  3. Recalculate in the case of the performance bugs we saw
  4. Visit fewer nodes than the current updater for Clang + SQLite + Oggenc + ...

@kuhar , do you think we need to use another heuristic involving m? Though I think maybe using only n is sufficient? :)

Intuitively, I think that the heuristic can be some linear function: a * n + b * k > threshold, where a and b are coefficients and the threshold is a constant. I don't think it's possible to calculate m efficiently, is it?

NutshellySima added a comment.EditedOct 15 2018, 7:34 AM

Interesting. Nice improvements. What about small trees? It would seem that any tree less that 75 nodes would always be recalculated. Do the timings you ran include things to show that this is better? Or was that just looking at larger trees at the time?

I haven't conducted dedicated tests on graphs with the number of vertices less than 75, but among all the domtrees recalculated in the two clang-*.bc I mentioned above, there are approximately 63% of them has an n smaller than 75.

I'd be much more comfortable if this change didn't always recalculate for graphs smaller than 75. I think there is a lot of value in making sure the incremental updater is behaving correctly on small examples (e.g., existing unit tests, most regression tests, etc.). If for some reason it stops working, it will be much easier to detect and fix it on these small examples.

Thanks for the feedback. It makes sense to stop recalculating on small graphs but I need to have a check on whether it performs well.

In fact, the time used by directly applying updates to the domtree also depends on whether the graph is sparse or dense. It is hard to generate accurate heuristic to estimate when to recalculate the domtree or not. Because edge insertions have a complexity of O(m*min{n,k}+nk), edge deletions have a complexity of O(m*k) and recalculation has a complexity of O(n^2). (Let n denote number of vertices of the graph/m be the number of edges after updates/k be the number of updates). I think if we want to fine-tune this heuristic, we need to collect a lot of statistics and mathematically fit the function (n,m,k)->bool (whether to recalculate or not).

Since we cannot measure running time directly in any realistic scenario, I think that the best way to approach this would be to count the number of graph nodes visited during DFS walks and see how many of them we get with different heuristics.
I think that the bugs already established that we need to have some fallback heuristic. I'd suggest being careful here and trying at least a couple of different policies on different projects (maybe clang, sqlite, the lnt test suite). There are a couple of properties I propose for the heuristic:

  1. Don't always recalculate small trees (say, n < 100)
  2. Recalculate when k > n
  3. Recalculate in the case of the performance bugs we saw
  4. Visit fewer nodes than the current updater for Clang + SQLite + Oggenc + ...

For k>n you mentioned, I think you mean k>n/α (α is a constant) because for the reproducer in 37929, the minimum value of α to make the time consumed by clang acceptable is about 3 (α should definitely be larger than 3). But selecting a higher value like "75" can also boost other inputs more (about 2x~4.5x on clang*.bcs).

@kuhar , do you think we need to use another heuristic involving m? Though I think maybe using only n is sufficient? :)

Intuitively, I think that the heuristic can be some linear function: a * n + b * k > threshold, where a and b are coefficients and the threshold is a constant. I don't think it's possible to calculate m efficiently, is it?

m is the number of edges after updates. I think m can only be calculated by traversing two levels of linked lists to count the number of edges. It seems that it is O(m). I just think whether it is worthy to find some heuristic like n^2/m (I'm not sure whether it works better than the current one) and can avoid some cases when the graph is really sparse but also triggers the recalculation.

kuhar added a comment.Oct 15 2018, 7:57 AM

For k>n you mentioned, I think you mean k>n/α (α is a constant) because for the reproducer in 37929, the minimum value of α to make the time consumed by clang acceptable is about 3 (α should definitely be larger than 3). But selecting a higher value like "75" can also boost other inputs more (about 2x~4.5x on clang*.bcs).

OK, makes sense if that the case in the bug report.

Would you be able to produce some numbers for clang for a heuristic like this:
if (n > 100) k * a >n; else if (k > n)
with different as?

NutshellySima added a comment.EditedOct 15 2018, 8:06 AM

For k>n you mentioned, I think you mean k>n/α (α is a constant) because for the reproducer in 37929, the minimum value of α to make the time consumed by clang acceptable is about 3 (α should definitely be larger than 3). But selecting a higher value like "75" can also boost other inputs more (about 2x~4.5x on clang*.bcs).

OK, makes sense if that the case in the bug report.

Would you be able to produce some numbers for clang for a heuristic like this:
if (n > 100) k * a >n; else if (k > n)
with different as?

I have tested heuristic k>n/α with α=10/40/65/70/75/80/100/300/1000 on clang.bcs. (But I didn't save them to a file :\) 10/1000 produce very bad result. ("10" is better than the current implementation). 300 produces better result than 1000. 40/100 produce better ones than 10/300/1000. And 65-80 don't make much difference and produce the best results among all αs I selected.

kuhar added a comment.Oct 15 2018, 8:13 AM

I have tested heuristic k>n/α with α=10/40/65/70/75/80/100/300/1000 on clang.bcs. (But I didn't save them to a file :\) 10/1000 produce very bad result. ("10" is better than the current implementation). 300 produces better result than 1000. 40/100 produce better ones than 10/300/1000. And 65-80 don't make much difference and produce the best results among all αs I selected.

I'd be very interested to see the numbers for these constants.

I have tested heuristic k>n/α with α=10/40/65/70/75/80/100/300/1000 on clang.bcs. (But I didn't save them to a file :\) 10/1000 produce very bad result. ("10" is better than the current implementation). 300 produces better result than 1000. 40/100 produce better ones than 10/300/1000. And 65-80 don't make much difference and produce the best results among all αs I selected.

I'd be very interested to see the numbers for these constants.

OK, I'll try to get those numbers again.
Do you think for the unittests of the incremental updater, the updater should directly apply updates on a small k (don't do recalculation with few updates) or n (don't do recalculation on a small graph)?

kuhar added a comment.Oct 15 2018, 8:25 AM

Do you think for the unittests of the incremental updater, the updater should directly apply updates on a small k (don't do recalculation with few updates) or n (don't do recalculation on a small graph)?

I think that for small n it shouldn't recalculate when k < n.

Ii might also be worth investigating to make these constant hidden command line parameters. For one it makes testing faster and it also gives users an emergency knob (the cl::opt() mechanism).

include/llvm/Support/GenericDomTreeConstruction.h
1198

I think this comment is still too vague. It doesn't need to be altered until the final algorithm is selected but I'd like to see more details here. Seemingly arbitrary constants needs support information more than "trust me, I ran some tests".

NutshellySima added a comment.EditedOct 16 2018, 6:54 AM

@kuhar , here is the statistics for k*α>n:
Let y=time used by DT.applyUpdates/Times used by all DomTree Updating operations.
raw statistics:
clang: https://paste.ubuntu.com/p/yC8sSQqbsx/
SQLite: https://paste.ubuntu.com/p/cTpqDthsNC/
oggenc: https://paste.ubuntu.com/p/4RJkCxjhkY/

αy/clangy/sqlitey/oggenc
0/always update directly33.20%24.90%28.40%
522.10%19.60%20.80%
1020.90%20.00%20.30%
2020.70%21.70%22.50%
3013.10%23.10%23.70%
4013.60%24.10%24.90%
5514.20%26.00%28.00%
6513.80%27%28.70%
7014.10%27.30%29.30%
7514.00%28.80%30.10%
8014.20%28.10%30.20%
8514.20%29.50%30.80%
10014.70%30.60%31.00%
30017.60%42.90%34.60%
100056.40%43.00%34.50%

I think maybe "40" is a better choice?

kuhar added a comment.Oct 16 2018, 8:34 AM

@kuhar , here is the statistics for k*α>n:

Awesome, thanks for the numbers. It's much easier to reason about it this way. But I'm a bit confused by why you measured it like this. Can we have statistics with respect to nodes visited during DFS? That should be highly correlated and give us more confidence in the measured times.

I think maybe "40" is a better choice?

30 seems best to me :)

I'd still like to see the constant as a hidden command line argument cl::opt(). Otherwise I think this is an excellent patch with considerable compile-time speedups. Nice work @NutshellySima .

kuhar added a comment.Oct 24 2018, 8:06 AM

I'd still like to see the constant as a hidden command line argument cl::opt(). Otherwise I think this is an excellent patch with considerable compile-time speedups. Nice work @NutshellySima .

The issue I find with cl::opt here is that this is a generic class that may not necessarily have a corresponding cpp file -- there can be multiple clients to GenericDomTreeConstruction, each with different requirements. Because of that, it's not clear to me in which translation unit one would put cl::opt for the constant in.
Is that a solved problem somewhere else in LLVM? I'm not familiar with using cl::opt in header files.

kuhar added a comment.Oct 24 2018, 8:17 AM

I'd still like to see the constant as a hidden command line argument cl::opt(). Otherwise I think this is an excellent patch with considerable compile-time speedups. Nice work @NutshellySima .

The issue I find with cl::opt here is that this is a generic class that may not necessarily have a corresponding cpp file -- there can be multiple clients to GenericDomTreeConstruction, each with different requirements. Because of that, it's not clear to me in which translation unit one would put cl::opt for the constant in.
Is that a solved problem somewhere else in LLVM? I'm not familiar with using cl::opt in header files.

One reasonable alternative I can think of would be adding an extra template or function parameter to applyUpdated, that can be defaulted to some reasonable constant. This would also allow us to adjust the constant in optimizations that are known to be sensitive to the length of the final legalized batch.

NutshellySima updated this revision to Diff 171112.EditedOct 25 2018, 8:53 AM
NutshellySima set the repository for this revision to rL LLVM.
  1. Adjust the constant to be 40.
  2. Disable recalculation when n<=100 && k<=n making the unittest of the incremental algorithm work as before.

The issue I find with cl::opt here is that this is a generic class that may not necessarily have a corresponding cpp file -- there can be multiple clients to GenericDomTreeConstruction, each with different requirements. Because of that, it's not clear to me in which translation unit one would put cl::opt for the constant in.
Is that a solved problem somewhere else in LLVM? I'm not familiar with using cl::opt in header files.

Fair enough. I'm more used to working on function passes that have several cl::opt() lines at the top of the .cpp file. If we don't have a corresponding cpp it's non-trivial to add it as a command-line argument. I too have never tried to use cl::opt() in a .h.

The issue I find with cl::opt here is that this is a generic class that may not necessarily have a corresponding cpp file -- there can be multiple clients to GenericDomTreeConstruction, each with different requirements. Because of that, it's not clear to me in which translation unit one would put cl::opt for the constant in.
Is that a solved problem somewhere else in LLVM? I'm not familiar with using cl::opt in header files.

Fair enough. I'm more used to working on function passes that have several cl::opt() lines at the top of the .cpp file. If we don't have a corresponding cpp it's non-trivial to add it as a command-line argument. I too have never tried to use cl::opt() in a .h.

I'm also not familiar with cl::opt() in headers and I think adding a function parameter is an option.

The constant "40" selected is based on the previous table and additional tests on "25" and "50" (see below),

αy/clangy/sqlitey/oggenc
0/baseline33.20%25.60%28.40%
2020.70%21.70%22.50%
2520.40%------
3013.10%23.10%23.70%
4013.60%24.10%24.90%
50---25.60%---
5514.20%26.00%28.00%
40 (n>=100)13.30%21.20%24.50%

I'll test n>=200/300/400 to see whether it is better.

The constant "40" selected is based on the previous table and additional tests on "25" and "50" (see below),

αy/clangy/sqlitey/oggenc
0/baseline33.20%25.60%28.40%
2020.70%21.70%22.50%
2520.40%------
3013.10%23.10%23.70%
4013.60%24.10%24.90%
50---25.60%---
5514.20%26.00%28.00%
40 (n>=100)13.30%21.20%24.50%

I'll test n>=200/300/400 to see whether it is better.

For the oggenc case, n>=200/300 doesn't seem to be better.

αy/oggenc
40 (n>=100)24.5%
40 (n>=200)25.8%
40 (n>=300)27.2%

Based on the fact that the current heuristic 40 (n>=100) outperforms the one updating directly without recalculation 0/baseline on clang/sqlite and oggenc, I think the current heuristic is conservative enough and expected not to harm performance of any input significantly. :)

NutshellySima marked an inline comment as done.Oct 25 2018, 10:08 AM
brzycki accepted this revision.Oct 25 2018, 10:19 AM

LGTM. I have no doubt someone will speak up if they detect a regression. :)

This revision is now accepted and ready to land.Oct 25 2018, 10:19 AM
kuhar accepted this revision.Oct 25 2018, 10:23 AM

LGTM. Thanks for looking into this!

This revision was automatically updated to reflect the committed changes.

Since bug 37929 is a regression in version 7.0, should we backport this patch into branch 7.0.1 later (maybe November) if there isn't anyone reporting issues on this patch?

Since bug 37929 is a regression in version 7.0, should we backport this patch into branch 7.0.1 later (maybe November) if there isn't anyone reporting issues on this patch?

Sounds like a good idea

Since bug 37929 is a regression in version 7.0, should we backport this patch into branch 7.0.1 later (maybe November) if there isn't anyone reporting issues on this patch?

Sounds like a good idea

I recommend waiting about 2 weeks for the patch to get through all the various automated regression test frameworks out there. If there's a performance regression you'll likely hear about it within that time frame.

Since bug 37929 is a regression in version 7.0, should we backport this patch into branch 7.0.1 later (maybe November) if there isn't anyone reporting issues on this patch?

Sounds like a good idea

I recommend waiting about 2 weeks for the patch to get through all the various automated regression test frameworks out there. If there's a performance regression you'll likely hear about it within that time frame.

Thanks for your advice. :)