This is an archive of the discontinued LLVM Phabricator instance.

[DomTree] Fix order of domtree updates in MergeBlockIntoPredecessor.
AbandonedPublic

Authored by efriedma on Nov 19 2018, 3:00 PM.

Details

Summary

If the "delete" update is ordered before the "insert" update, DeleteUnreachable will delete all the descendants of the basic block, so the update isn't really incremental. Changing the order makes the update incremental and fast. This makes an extreme C++ testcase compile three times faster.

I don't currently have a testcase I can upload, but I can try to construct one if it would be helpful.

I'm not sure this is the right fix; I'm not deeply familiar with how domtree updating works. Should the batch updating code handle this itself, somehow?

Diff Detail

Repository
rL LLVM

Event Timeline

efriedma created this revision.Nov 19 2018, 3:00 PM
kuhar added a comment.Nov 19 2018, 3:12 PM

We experimented with different ordering of updates before, including scheduling deletions before insertions, but didn't discover any promising strategies. Because of that, updates are performed in the exact order they were scheduled (except optimizing out trivially redundant ones).

The preferred API to use is currently DomTreeUpeater, but I don't think it would affect performance here.

Any thoughts @NutshellySima?

NutshellySima accepted this revision.Nov 20 2018, 10:38 AM

We experimented with different ordering of updates before, including scheduling deletions before insertions, but didn't discover any promising strategies. Because of that, updates are performed in the exact order they were scheduled (except optimizing out trivially redundant ones).

The preferred API to use is currently DomTreeUpeater, but I don't think it would affect performance here.

Any thoughts @NutshellySima?

I have seen similar results that scheduling edge insertions before deletions can sometimes be faster when I try to make UpdateAnalysisInformation in BasicBlockUtils to use the incremental updater recently. In the case of UpdateAnalysisInformation, the number of edge insertions and deletions is nearly the same.

If I schedule edge deletions before insertions, I get

===-------------------------------------------------------------------------===
                              DomTree Calculation
===-------------------------------------------------------------------------===
  Total Execution Time: 144.1258 seconds (144.0620 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  90.6385 ( 63.9%)   1.4306 ( 63.0%)  92.0691 ( 63.9%)  92.0375 ( 63.9%)  delete-reachable -- DomTree
  45.4411 ( 32.0%)   0.6180 ( 27.2%)  46.0591 ( 32.0%)  46.0432 ( 32.0%)  delete-unreachable -- DomTree
   3.8668 (  2.7%)   0.1652 (  7.3%)   4.0320 (  2.8%)   4.0172 (  2.8%)  insert-unreachable -- DomTree
   1.9094 (  1.3%)   0.0562 (  2.5%)   1.9657 (  1.4%)   1.9640 (  1.4%)  insert-reachable -- DomTree
  141.8558 (100.0%)   2.2700 (100.0%)  144.1258 (100.0%)  144.0620 (100.0%)  Total

But if I schedule edge insertions before deletions, I get

===-------------------------------------------------------------------------===
                              DomTree Calculation
===-------------------------------------------------------------------------===
  Total Execution Time: 95.0652 seconds (95.0225 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  91.7329 ( 98.4%)   1.7553 ( 95.1%)  93.4882 ( 98.3%)  93.4364 ( 98.3%)  delete-reachable -- DomTree
   1.2455 (  1.3%)   0.0733 (  4.0%)   1.3188 (  1.4%)   1.3275 (  1.4%)  insert-unreachable -- DomTree
   0.1300 (  0.1%)   0.0074 (  0.4%)   0.1374 (  0.1%)   0.1376 (  0.1%)  delete-unreachable -- DomTree
   0.1107 (  0.1%)   0.0102 (  0.6%)   0.1209 (  0.1%)   0.1210 (  0.1%)  insert-reachable -- DomTree
  93.2191 (100.0%)   1.8462 (100.0%)  95.0652 (100.0%)  95.0225 (100.0%)  Total

Though I don't have bitcode samples to reproduce your findings, I guess three times faster is caused by the DomTree is consuming a lot of time in DeleteUnreachable().

Maybe we can put some high level APIs that have these best practices inside DTU such as something like updateAfterSpliceBlocks and updateAfterChangePredecessorTo or just add a sort after running updates legalization. :)

This revision is now accepted and ready to land.Nov 20 2018, 10:38 AM
kuhar added inline comments.Nov 20 2018, 11:34 AM
lib/Transforms/Utils/BasicBlockUtils.cpp
163

I would appreciate a comment explaining that the order of updates matter in this place, in case we want to revisit it some time later.

167

nit: We could use a range-based for loop here instead (and also in the loop below)

Though I don't have bitcode samples to reproduce your findings, I guess three times faster is caused by the DomTree is consuming a lot of time in DeleteUnreachable().

Maybe we can put some high level APIs that have these best practices inside DTU such as something like updateAfterSpliceBlocks and updateAfterChangePredecessorTo or just add a sort after running updates legalization. :)

Intuitively, it would make sense to schedule insertions when we anticipate many DeleteUnreachable deletions -- insertions cause the CFG to be more dense, which turns unreachable deletions into reachable ones. I'm don't know of any way to guess whether a batch of deletions will cause more unreachable than reachable deletions, and I don't believe scheduling insertions before deletions is always going to be profitable.

Has anyone run CTMark test-suite numbers on this patch before and after? @efriedma is it possible to post the extreme C++ case to analyze exactly what's happening in the Dominator update routines and why?

lib/Transforms/Utils/BasicBlockUtils.cpp
163

+1

Has anyone run CTMark test-suite numbers on this patch before and after?

I would expect this has very little impact if you don't hit the quadratic case.

is it possible to post the extreme C++ case to analyze exactly what's happening in the Dominator update routines and why?

The original code is proprietary, so I can't post it... I'll try to come up with a similar artificial testcase.

My first attempt at producing an artificial testcase is the following... but patch doesn't fix this one. I'll look into why my testcase is different. (Produce an IR file with Python, then run opt -gvn on the result.)

count=10000
print "define void @f(i1 %x) {entry:br i1 0, label %bb0, label %bb" + str(count)
for i in xrange(count):
  print "bb{}: br label %bb{}".format(i,i+1)
print "bb{}: ret void}}".format(count)

Another testcase variation (again, generate IR with Python, run opt -gvn):

count=2000
print "define void @f(i1 *%p) {entry:%x = load volatile i1, i1* %p br i1 %x, label %bb0, label %bb" + str(count)
for i in xrange(count):
  print "bb{}: %x{} = load volatile i1, i1* %p br i1 %x{}, label %bbt{}, label %bbf{}".format(i,i,i,i,i)
  print "bbt{}: br label %bb{}".format(i,i+1)
  print "bbf{}: br label %bbfx{}".format(i,i)
  print "bbfx{}: br label %bb{}".format(i,i+1)
print "bb{}: ret void}}".format(count)

This structure is helped by my patch, although not as much as my original testcase. I'll try to dig a little more to see if I can come up with a structure to reproduce the "three times"; from the debug output, I think insert-first also limits renumbering for certain domtree structures.

Fundamentally, does this have to be an expensive operation? It seems like there should be some way to keep a "local" operation like this from propagating across the entire tree.

kuhar added a comment.Nov 20 2018, 2:06 PM

Fundamentally, does this have to be an expensive operation? It seems like there should be some way to keep a "local" operation like this from propagating across the entire tree.

Historically, this and other cases were handled by manual tree modifications based on local tree properties. The problem was that there were dozens small and very difficult to detect bugs in that manual update code that I replaced with automatic incremental updates over the summer 2017. Even though most of them seemed simple and reasonably documented, there are non-trivial edge cases (e.g., inside irreducible control flow) that people missed in virtually all the code that performed manual updates. The consensus was that we'd rather take small performance hit and use a single correct updating algorithm only falling back to manual updates when running time degraded significantly.

So far we haven't had many people complaining about performance of the incremental updater; it was actually quite surprising that the first optimized version performed so well in practice. There were a handful performance problems in various pieces of the optimizer that we either fixed or mitigated, but we were only able to do so with good repros. It's easy to create big and weird graphs (like ladder graphs), but we hardly ever have to deal with IR like this coming from a real frontend. I'd be great if you could produce some real repro out of your code. It should be enough to take a snapshot of the IR just before you start scheduling updates in this pass, and recreate it with branching only -- we don't need any non-branching instructions or real value names.

@efriedma I ran CTMark on the patch with the following patch as tip (both compilers are Release):

commit 9b6b09070e777106e6e91c15406e197c7d18e1fb (origin/master, origin/HEAD)
Author: Ilya Biryukov <ibiryukov@google.com>
Date:   Mon Nov 26 15:38:01 2018 +0000

    [clangd] Add type boosting in code completion

    Reviewers: sammccall, ioeric

    Reviewed By: sammccall

    Subscribers: MaskRay, jkorous, arphaman, kadircet, cfe-commits

    Differential Revision: https://reviews.llvm.org/D52276

The results show it's mostly good, except for tramp3d-v4 which regresses about 9%:

#  left column: D54730/results.json
# right column: tip/results.json
#  metric name: compile_time
     41.3040 -> 37.9320      [  8.89%]  CTMark/tramp3d-v4/tramp3d-v4.test
    125.0240 <- 132.6200     [  6.08%]  CTMark/7zip/7zip-benchmark.test
     23.3760 -> 22.8840      [  2.15%]  CTMark/mafft/pairlocalalign.test
     43.6280 <- 44.3680      [  1.70%]  CTMark/SPASS/SPASS.test
     52.6560 <- 53.0440      [  0.74%]  CTMark/lencod/lencod.test
     31.9480 <- 32.1480      [  0.63%]  CTMark/consumer-typeset/consumer-typeset.test
     35.1400 <- 35.3160      [  0.50%]  CTMark/kimwitu++/kc.test
     51.0240 <- 51.2560      [  0.45%]  CTMark/ClamAV/clamscan.test
     26.1120 <- 26.2080      [  0.37%]  CTMark/sqlite3/sqlite3.test
     86.8120 <- 86.9920      [  0.21%]  CTMark/Bullet/bullet.test

I'm slightly concerned this patch may regress compile time for others whose code shape resembles tramp3d.

I just re-ran both to make sure it wasn't caused by other factors on the host. There's some movement in the <= 2% range but tramp3d-v4 is the same regression.

#  left column: D54730/results.json
# right column: tip/results.json
#  metric name: compile_time
     43.7640 -> 39.6280      [ 10.44%]  CTMark/tramp3d-v4/tramp3d-v4.test
     26.4680 <- 27.0880      [  2.34%]  CTMark/sqlite3/sqlite3.test
    130.5600 -> 128.1000     [  1.92%]  CTMark/7zip/7zip-benchmark.test
     35.4600 <- 36.1000      [  1.80%]  CTMark/kimwitu++/kc.test
     32.2400 -> 31.7520      [  1.54%]  CTMark/consumer-typeset/consumer-typeset.test
     23.3600 <- 23.6480      [  1.23%]  CTMark/mafft/pairlocalalign.test
     52.7280 -> 52.5520      [  0.33%]  CTMark/lencod/lencod.test
     50.8120 <- 50.9560      [  0.28%]  CTMark/ClamAV/clamscan.test
     44.1040 <- 44.2160      [  0.25%]  CTMark/SPASS/SPASS.test
     86.8800 -> 86.8200      [  0.07%]  CTMark/Bullet/bullet.test

@efriedma have you had any luck generating an artificial testcase?

Since I sent out D62981 with identical changes and the same motivation (thank you for pointing this out @NutshellySima), please let me add a small comment here.

Intuitively, it would make sense to schedule insertions when we anticipate many DeleteUnreachable deletions -- insertions cause the CFG to be more dense, which turns unreachable deletions into reachable ones. I'm don't know of any way to guess whether a batch of deletions will cause more unreachable than reachable deletions, and I don't believe scheduling insertions before deletions is always going to be profitable.

I think this is entirely correct. We have no way of knowing *in general* that insertions are more profitable to do before deletions. So I take back the comment of "let's sort these in the DTU".
In this *particular case* however (inside MergeBlockIntoPredecessor), I think it *is* always profitable. This method only merges 2 blocks PredBB-BB, where PredBB has no other successors and BB has no other predecessors. Deleting edges is very likely to make some blocks unreachable, then again reachable after an insert.
So my suggestion would be to make the change only inside MergeBlockIntoPredecessor with a detailed comment explaining why the order matters here.

Thoughts?

@efriedma: Do you have the time to continue this, or should I take over in D62981 instead?

Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2019, 9:33 AM

Feel free to take it over, particularly since it looks like you have a testcase you can easily share.

kuhar resigned from this revision.Sep 30 2019, 9:48 AM

This already landed in D62981 .

efriedma abandoned this revision.Sep 30 2019, 12:02 PM