Change the original algorithm so that it scales better when meeting very large bitcode where every instruction does not implies a global.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Can you add a description to the revision summarizing what you're doing before I try to infer it from the diff?
Thanks!
Do you have numbers showing up the improvement?
What's the effect of memoization on memory usage?
include/llvm/Transforms/IPO/GlobalDCE.h | ||
---|---|---|
35 | Why did you rename this? Some part of this diff could disappear without this change. | |
36 | DenseMap is intended to contains small value, SmallPtrSet seems a bit fat here? Also, after reading the algorithm, this could be a vector, you should not need dedup by construction. | |
38 | We don't suffix member names. | |
lib/Transforms/IPO/GlobalDCE.cpp | ||
101 | The recursion is bound by the fact that we only recurse on constant and stop at the first GV reached. I'd rather have this as a comment in the code. | |
103 | So the memorization is only used for constants (GV initializers, GEP, what else?), I'd rename it to make it explicit that this cache only applies to constants. Also can such constant have multiple uses? I'm failing to remember in which case right now. If they only have a single use we could just traverse and remove the memorization entirely. | |
113 | Why not passing GV to ComputeDependencies instead of Deps and populate directly GVDependencies_ ? | |
114 | No braces | |
118 | GlobalValue & GV does not seems formatted, I'd suggest running git clang-format on this patch. | |
120 | Early return. | |
121 | What does this do? Isn't it redundant with the one two lines above? | |
126 | What's the bound on the recursion? How do we guarantee that we won't run out of stack? Worth a comment. | |
187 | pop_back_val() | |
190 | No brace |
The complexity of the original algorithm is roughly in O(|live function instruction|), and the complexity of this algorithm is roughly in O(|globals users|).
So if there's a big global function, previous algorithm is more costly (and that's the original motivating example).
If there's a big dead function, this algorithm is more costly.
On average test cases, it does not make much difference, It's only noticeable on very large functions (> 1000000 instructions) which happens to be my use case.
What's the effect of memoization on memory usage?
memoization of constants was already implemented in the previous algorithm, so not much difference here.
This algorithm contains an additional map that holds the global dependencies, so it should have a bigger memory consumption.
include/llvm/Transforms/IPO/GlobalDCE.h | ||
---|---|---|
36 | According to my tests, DenseMap performs better on large test case than std::unorderd_map (by a factor of two) Correct for Set->Vector | |
lib/Transforms/IPO/GlobalDCE.cpp | ||
103 | See r179458 | |
113 | I think it's clearer like this, as it separates the dependencies build from the GVDependencies update. | |
126 | There's a finite number of globals and they can only be visited once because of the set check. |
include/llvm/Transforms/IPO/GlobalDCE.h | ||
---|---|---|
36 | I'm not sure what you mean by "large test case"? Do you mean lots of entries? My point was about sizeof(Map[key]) and memory consumption. | |
lib/Transforms/IPO/GlobalDCE.cpp | ||
103 | Thanks for the pointer! | |
126 | I'm not worried about *infinite* recursion, but *too large* recursion for the limit stack size in some environment. If the recursion is in O(global) that's a potential stack overflow. |
Can you please add a comment in the code explaining how the new algorithm works? I'd like to see figures on a case where this actually matters (e.g. take the bitcode, run through opt -globaldce -time-passes with the old/new algorithm and see what's the performance difference)
lib/Transforms/IPO/GlobalDCE.cpp | ||
---|---|---|
173 | Capital letter, ending with a dot (here and everywhere else). | |
185 | This whole patch needs to be clang-formatted, this should look like | |
241 | stray newline. |
include/llvm/Transforms/IPO/GlobalDCE.h | ||
---|---|---|
36 | Yeah I meant lots of entries. I've done the tests with an std::vector instead and it does not impact the execution time, and it's indeeded better on the memory usage, switching to this. | |
lib/Transforms/IPO/GlobalDCE.cpp | ||
126 | Previous algorithm worst case recursion depth was in `O(|global|)` and that happened, for instance, when a global function references a static function that references a static function that ... etc. I'm not 100% sure but this function has a worst case recursion depth of two, has it only propagates the liveness to the whole comdat section. If you agree with my analyse, I'll update the comment. |
I've updated the diff, formated with git clang-format as suggested. However there's no -U option to that command so the diff does not appear with the whole contxt, and sipmly running clang-format on the files produces too much noise.
FYI, My workflow is:
- git clang-format HEAD~ (will format only the last commit).
- git diff (just to check the formatting changes)
- git commit -a --amend
- generate the patch for phab, I'm using arc diff HEAD~ which handles everything, otherwise manually with git diff -U9999 HEAD~ you get the full context patched to be updated.
I wonder if your diff is correct, it seems that you updated the diff compare to the previous diff state (like if you didn't squash commits).
include/llvm/Transforms/IPO/GlobalDCE.h | ||
---|---|---|
42 | Put the comments on the line before the variables, and use /// (doxygen-like prefix). | |
lib/Transforms/IPO/GlobalDCE.cpp | ||
126 | Yes I guess a GV can belong only to a single comdat so it should be fine! |
Patch correctly formatted, thanks @mehdi_amini for sharing your workflow.
Also updated the comments.
include/llvm/Transforms/IPO/GlobalDCE.h | ||
---|---|---|
22 | Are you still using SmallSet or is this a dead include? | |
lib/Transforms/IPO/GlobalDCE.cpp | ||
132 | stray newline? | |
188 | Please use an explicit type here | |
189 | Why did you remove this comment? | |
189 | and here, and I guess everywhere else the type is not really obvious (for consistency with what's already in the pass). |
include/llvm/Transforms/IPO/GlobalDCE.h | ||
---|---|---|
22 | It's still used for `ConstantDependeciesCache`. |
This looks much better now. Can you please add a comment in the code explaining the algorithm (if I missed it, please point out) while I take a last look? Thanks.
include/llvm/Transforms/IPO/GlobalDCE.h | ||
---|---|---|
39 | DenseMap? |
To be fair: the algorithm isn't changed, it is just the "uses" graph edges that are reversed as a pre-step.
I.e. the question is "how to you get all the globals referenced by another global"?
Before this patch, it was doing this by walking the body (or the initializer) and collecting the references. What this patch is doing, it precomputing the answer to this query for the whole module by walking the use-list of every global instead.
I believe it is the only change.
include/llvm/Transforms/IPO/GlobalDCE.h | ||
---|---|---|
39 | I've kept the original Moving to a std::unordered_multimap for GVDEpendencies, not sure about what you expect for the ConstantDependenciesCache. What's the pro/con of using a DenseMap? |
include/llvm/Transforms/IPO/GlobalDCE.h | ||
---|---|---|
39 | DenseMap is an open-addressing hashmap where the bucket are allocated with the key, hence "dense". This is very performance but for the memory consumption: empty buckets are taking the same amount of space as occupied ones. |
include/llvm/Transforms/IPO/GlobalDCE.h | ||
---|---|---|
39 | ok, thanks for the extra info. Everything looks much nicer now. |
Hi all,
the review seems ok now, do you have any other suggestion/advices/flames?
As a side note : I don't have the commit right in llvm trunk, so if one of you could take care of that step?
LGTM. Serge, one of us can commit that for you once llvm.org decides to go back online.
Are you still using SmallSet or is this a dead include?