Page MenuHomePhabricator

[Dominators] Simplify and optimize path compression used in link-eval forest.
ClosedPublic

Authored by MaskRay on Feb 17 2019, 5:49 AM.

Details

Summary
  • NodeToInfo[*] have been allocated so the addresses are stable. We can store them instead of NodePtr to save NumToNode lookups.
  • Nodes are traversed twice. Using Visited to check the traversal number is expensive and obscure. Just split the two traversals into two loops explicitly.
  • The check VInInfo.DFSNum < LastLinked is redundant as it is implied by VInInfo->Parent < LastLinked
  • VLabelInfo PLabelInfo are used to save a NodeToInfo lookup in the second traversal.

Also add some comments explaining eval().

This shows a ~4.5% improvement (9.8444s -> 9.3996s) on

perf stat -r 10 taskset -c 0 ~/llvm/Release/bin/opt -passes=$(printf '%.0srequire<domtree>,invalidate<domtree>,' {1..1000})'require<domtree>' -disable-output sqlite-autoconf-3270100/sqlite3.bc

Diff Detail

Event Timeline

MaskRay created this revision.Feb 17 2019, 5:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 17 2019, 5:49 AM
kuhar requested changes to this revision.Feb 17 2019, 10:58 AM

Hi @MaskRay,

This looks like a fantastic improvement! The only issue is that the code, both original and after your changes, lacks comments.

For example, I can see you put a comment that identified that the second part of the function does path compression, but it's not clear how exactly, and what is the first part of the function doing.

Next, what is the motivation behind this change? Have you identified some bottlenecks in the path compression part, have you suffered form the pathological quadratic complexity of SemiNCA? Perhaps it would make more sense to implement a hybrid algorithm that uses Lengauer-Tarjan for the initial full construction and SemiNCA for incremental updates? I have been thinking about it for quite some time now, but there weren't any complaints about the quadratic complexity yet, thus I figured the additional maintenance cost is not justified.

What was your evaluation for this patch? What benchmarks did you run? How do the numbers look like?

This revision now requires changes to proceed.Feb 17 2019, 10:58 AM
kuhar added a subscriber: brzycki.
MaskRay updated this revision to Diff 187187.Feb 17 2019, 7:29 PM
MaskRay retitled this revision from [Dominators] Simplify and optimize path compression used in eval() to [Dominators] Simplify and optimize path compression used in link-eval forest..

Add comments

Thanks for the quick feedback!

This looks like a fantastic improvement! The only issue is that the code, both original and after your changes, lacks comments.

For example, I can see you put a comment that identified that the second part of the function does path compression, but it's not clear how exactly, and what is the first part of the function doing.

I added some comments explaining the eval() function as used in the link-eval forest. The file header didn't say Semi-NCA is O(N^2) so I have changed it to mention that, as the original one discussing eval() time complexity may give people false impression that this is almost linear or O(N log N) (at least me when I was reading the implementation).

Next, what is the motivation behind this change? Have you identified some bottlenecks in the path compression part, have you suffered form the pathological quadratic complexity of SemiNCA? Perhaps it would make more sense to implement a hybrid algorithm that uses Lengauer-Tarjan for the initial full construction and SemiNCA for incremental updates? I have been thinking about it for quite some time now, but there weren't any complaints about the quadratic complexity yet, thus I figured the additional maintenance cost is not justified.

No particular motivation when I started doing this :) I just came across this file and learned some dominators in the weekend. I haven't suffered from the worst-case O(N^2) complexity and I doubt it may happen in practice. sncaworst in Linear-Time Algorithms for Dominators and Related Problems requires an O(N) dominator tree path with O(N) children reachable from the top of the tree path. Anyway, I've changed the file header to mention this may require attention.

What was your evaluation for this patch? What benchmarks did you run? How do the numbers look like?

This patch should be obvious improvement but it may be small. I think there are several other aspects that can be improved and may have a larger impact. The biggest problem from my view is the representation. The NumToNode hash table lookup immediately followed by NodeToInfo lookup is a common pattern in the implementation and this may be expensive. Label can be an unsigned instead of a NodePtr to save some indirect lookups (this is the optimization opportunity we can leverage after the transition from SLT to Semi-NCA). IDom can probably do the same but that seems to require more efforts. DFSNum may be a candidate to be removed, etc. I don't understand why numbering starts from 1 but it is only a small aesthetic issue.

I'll have to read An Experimental Study of Dynamic Dominators and the other parts of the dominator implementation in llvm first to get a better idea where can be improved :)

Seeking for suggestions of tests. What workflow do you use when changing this file? I'm currently thinking of IRTests and check-llvm-analysis when experimenting and check-llvm or check-all for a comprehensive testing.

Could you please share yours? Especially the statistics in the talk Dominator Trees and incremental updates that transcend time...

kuhar added a comment.Feb 17 2019, 9:46 PM

Thanks for the comments, it looks amazing now.

No particular motivation when I started doing this :) I just came across this file and learned some dominators in the weekend.

Cool.

I haven't suffered from the worst-case O(N^2) complexity and I doubt it may happen in practice. sncaworst in Linear-Time Algorithms for Dominators and Related Problems requires an O(N) dominator tree path with O(N) children reachable from the top of the tree path. Anyway, I've changed the file header to mention this may require attention.

Great. It was indirectly mentioned in the part that described various checks, but it seems much more explicit now.

This patch should be obvious improvement but it may be small. I think there are several other aspects that can be improved and may have a larger impact. The biggest problem from my view is the representation. The NumToNode hash table lookup immediately followed by NodeToInfo lookup is a common pattern in the implementation and this may be expensive. Label can be an unsigned instead of a NodePtr to save some indirect lookups (this is the optimization opportunity we can leverage after the transition from SLT to Semi-NCA).

The patch changes the stack size for eval, which may have unexpected impact on caching and code layout. This is often very surprising and difficult to reason about.
As expected, the current version of SNCA spends most of time on DenseMap lookups, and profiles are not very informative (at least to me). Reducing the use of hashtables can be quite some improvement.

> IDom can probably do the same but that seems to require more efforts. DFSNum may be a candidate to be removed, etc. I don't understand why numbering starts from 1 but it is only a small aesthetic issue.
Keep in mind that IDom is a part of the public interface of DomTreeNode.

Seeking for suggestions of tests. What workflow do you use when changing this file? I'm currently thinking of IRTests and check-llvm-analysis when experimenting and check-llvm or check-all for a comprehensive testing.
Could you please share yours? Especially the statistics in the talk Dominator Trees and incremental updates that transcend time...

For the incremental updater, I'd usually manually instrument some of the GenericDomTree functions and count the number of nodes visited, times spent doing insertions, deletions, SNCA, etc. You can find some code for it in D50303 and D36884. For incremental updates, I just run -O3 on some big bitcode files, like clang, opt, sqlite.

For the construction algorithm itself, I had an llvm tool that loaded an LLVM module and randomly run construction on each functions a couple of times. See D36897.

I usually run all tests in llvm (ninja check-all), on the llvm-test-suite with LNT, run it on the it on bitcode linked into a single file (on clang, opt, sqlite, and more recently webassembly and rippled) -- I use gllvm for it. In order to make it work with gllvm, I usually compile projects with something like this:

export LLVM_COMPILER_PATH=my_path/llvm-9.0/bin
cmake .. -GNinja -DCMAKE_CXX_COMPILER=/home/kuba/go/bin/gclang++ DCMAKE_C_COMPILER=/home/kuba/go/bin/gclang -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS='-Xclang -disable-llvm-optzns' -DCMAKE_C_FLAGS='-Xclang -disable-llvm-optzns'

And then extract bitcode with get-bc and feed it into opt with -O3 -- aggressive inlining helps with exercising dominator construction due to larger CFGs.

kuhar added a comment.Feb 17 2019, 9:48 PM

Another idea for benchmarking is to count the number of iterations of loops that you expect to be affected by your patches and making sure it decreases, as an indirect measure of running time improvement.

The improvement is quite fantastic!

For the incremental updater, I'd usually manually instrument some of the GenericDomTree functions and count the number of nodes visited, times spent doing insertions, deletions, SNCA, etc. You can find some code for it in D50303 and D36884. For incremental updates, I just run -O3 on some big bitcode files, like clang, opt, sqlite.
...
And then extract bitcode with get-bc and feed it into opt with -O3 -- aggressive inlining helps with exercising dominator construction due to larger CFGs.

And I'd like to mention that it's needed to remove this piece of code which makes the incremental updater fallback to DominatorTree reconstruction when the number of updates is quite large relative to the number of DomTree nodes from GenericDomTreeConstruction.h before testing the performance of the incremental updater. I believe it can make improvement of the incremental updater more observable.

And I'd like to mention that it's needed to remove this piece of code which makes the incremental updater fallback to DominatorTree reconstruction when the number of updates is quite large relative to the number of DomTree nodes from GenericDomTreeConstruction.h before testing the performance of the incremental updater. I believe it can make improvement of the incremental updater more observable.

Thanks for the note. I have a better understanding now. See my other 2 patches D58369 D58373 in the area :)

I am now testing with IRTests check-llvm-analysis during development and check-llvm check-all when about to send out a patch.

Comparing the results of opt -passes='default<O3>,function(print<domtree>)' -disable-output (llvm-link.bc,llvm-as.bc,sqlite3.bc,...) give some more confidence.

I don't expect this patch to have perceivable performance boost.. Maybe there is a bit. Ran time opt -domtree -disable-output < ~/llvm/GLLVM/bin/clang-9.bc several times with and without the patch:

Before: 1:05.27 1:05.45 1:05.46 1:06.96 1:05.25 1:04.58
After: 1:05.71 1:04.49 1:03.50 1:05.14 1:05.04 1:03.77

kuhar added a comment.Feb 19 2019, 6:28 AM

Comparing the results of opt -passes='default<O3>,function(print<domtree>)' -disable-output (llvm-link.bc,llvm-as.bc,sqlite3.bc,...) give some more confidence.

I don't expect this patch to have perceivable performance boost.. Maybe there is a bit. Ran time opt -domtree -disable-output < ~/llvm/GLLVM/bin/clang-9.bc several times with and without the patch:

Before: 1:05.27 1:05.45 1:05.46 1:06.96 1:05.25 1:04.58
After: 1:05.71 1:04.49 1:03.50 1:05.14 1:05.04 1:03.77

If you are building DT/PDT once per function, then your times is going to be dominated by reading bitcode, parsing it, building a module, verifying it, etc. -- I suggest to visit each function a number of times (say 1000), in randomized order, and measure time of a single pass / tool that does it.
In addition, make sure that your performance governor is set to performance and that you pin the process to a single cpu core with its sibling disabled. (like in https://llvm.org/docs/Benchmarking.html#linux). Benchmarking on x86 is hard.

In addition, make sure that your performance governor is set to performance and that you pin the process to a single cpu core with its sibling disabled. (like in https://llvm.org/docs/Benchmarking.html#linux). Benchmarking on x86 is hard.

+1

One test I like to use is the LLVM test-suite subdirectory CTMark with running the top-level lit process through taskset, as in:

cmake -G Ninja -D TEST_SUITE_SUBDIRS=CTMark -D TEST_SUITE_RUN_BENCHMARKS=0 ...

ninja

taskset -c $core lit -v -j 1 . -o results.json

In addition, make sure that your performance governor is set to performance and that you pin the process to a single cpu core with its sibling disabled. (like in https://llvm.org/docs/Benchmarking.html#linux). Benchmarking on x86 is hard.

Thanks for the pointer! Here is the performance number of perf stat -r 10 taskset -c 0 ~/llvm/Release/bin/opt -passes=$(printf '%.0srequire<domtree>,invalidate<domtree>,' {1..1000})'require<domtree>' -disable-output /tmp/p/sqlite-autoconf-3270100/sqlite3.bc
I'm very new to the optimization part of llvm.. Let me know if there is better way to measure domtree construction.

Before: 9.8444 +- 0.0279 seconds time elapsed ( +- 0.28% )
After: 9.3996 +- 0.0170 seconds time elapsed ( +- 0.18% )

I've tried this thrice and can reproduce similar comparison results.

kuhar accepted this revision.Feb 19 2019, 7:35 PM

Awesome, so it seems like we have ~4.5% improvement. Thanks for running the experiments.

This revision is now accepted and ready to land.Feb 19 2019, 7:35 PM
MaskRay updated this revision to Diff 187509.Feb 19 2019, 8:35 PM

Add performance number to the description

MaskRay updated this revision to Diff 187510.Feb 19 2019, 8:36 PM
MaskRay edited the summary of this revision. (Show Details)

.

This revision was automatically updated to reflect the committed changes.

One test I like to use is the LLVM test-suite subdirectory CTMark with running the top-level lit process through taskset, as in:

@brzycki Thanks for the tip! Do you know how to create testing .bc from LLVM test-suite for benchmarks?

And how can I benchmark dynamic dominator tree? CalculateFromScratch is easy to benchmark as I can just repeatedly run the function analysis domtree and discard the result but I don't know how for the dynamic case...

@brzycki Thanks for the tip! Do you know how to create testing .bc from LLVM test-suite for benchmarks?

Hello @MaskRay, test-suite has a subdirectory named LLVMSource that contains IR variants of tests. However, I see no reference to that sub-directory in the CMake build files and the last update to that directory was in 2012. I have never tried it and have no idea if lit can run tests in there. The LLVMSource/Makefile does pull in all files in that directory so if it worked you should be able to just add a *.ll file there for it to be picked up.

And how can I benchmark dynamic dominator tree? CalculateFromScratch is easy to benchmark as I can just repeatedly run the function analysis domtree and discard the result but I don't know how for the dynamic case...

The testing I've done on Dominators was as customer to the external API and all the benchmarking I performed was in conjunction with the pass. I don't know of a way to profile just the DTU.

There is a directory under test-suite's MicroBenchmarks that use the Google benchmark framework to time small sections of code. I don't know if that could be easily harnessed to test LLVM internals though. I suspect given test-suite's standalone nature it won't be easy to create a benchmark that relies on LLVM as a library. But I don't know for sure.

And how can I benchmark dynamic dominator tree? CalculateFromScratch is easy to benchmark as I can just repeatedly run the function analysis domtree and discard the result but I don't know how for the dynamic case...

The best method I came up with is to instrument the updater primitives (applyUpdates, insertEdge, deleteEdge, recalculate, etc.), and measure time within the running optimizer. It should be best to get some big bitcode file and run opt with O3 on it. clang is reasonably sized in this configuration, you should also have luck with webassembly and rippled. If you are with google, you can try some bigger internal targets (there is a script to compile a given target to bitcode).

...
And how can I benchmark dynamic dominator tree? CalculateFromScratch is easy to benchmark as I can just repeatedly run the function analysis domtree and discard the result but I don't know how for the dynamic case...

You can reference the code in D50300 to time incremental updater primitives. I recommend you to find some projects which have large functions, probably machine generated code, for example, the one mentioned in Bug 37929. Bitcodes which have large functions can make improvement more observable. I remember incremental DT updating happens intensively in some passes like JumpThreading, so you can run O3 and save a .bc before jumpthreading. It can save you some time when benchmarking.