LICM may hoist instructions to preheader speculatively. Before code generation, we need to sink down the hoisted instructions inside to loop if it's beneficial. This pass is a reverse of LICM: looking at instructions in preheader and sinks the instruction to basic blocks inside the loop body if basic block frequency is smaller than the preheader frequency.
Details
Diff Detail
Event Timeline
lib/Transforms/Scalar/LoopSink.cpp | ||
---|---|---|
38 | Add statistics for number of instructions sinked etc. | |
79 | LoopBase class has a member method 'contains' which can be used. | |
99 | Remove dead code | |
134 | add documentation for the method. | |
140 | To speed up the pass, perhaps it is better to do a quick scan of the BB of the loop to see if there are any blocks that are really cold (colder than preheader). If there are not, early return. | |
144 | add a comment for this variable. | |
160 | if -> is | |
179 | Perhaps compare cdt's frequency with the sum of bb and sink bb? If it is greater than the sum, do not use cdt. | |
200 | if T >= ... early return | |
205 | Add debug trace here |
Have you talked to anyone about the design for this?
I know Daniel Jasper, Quentin, and several others have looked at similar things before. Previous attempts have focused on using MachineLICM to do sinking as well as hoisting. While I don't have a strong opinion about one design over the other, we should be consistent about the plan here, and possibly consolidate some of the logic.
lib/Transforms/Scalar/LoopSink.cpp | ||
---|---|---|
123 | This is not correct -- it should merge in SubLoop's AST too. See LICM's CollectAliasInfoForLoop -- that should be extracted as a common utility. | |
132 | Is the first check needed? | |
148 | Contnue if BBs is empty | |
156 | This logic seems to be inverted -- Using CDT should be encouraged if its frequency equals the sum of SinkBB and N. In other words, division should be used, not multiplication | |
184 | Is the formatting correct here? |
update
lib/Transforms/Scalar/LoopSink.cpp | ||
---|---|---|
123 | Yes, we can refactor the code to reuse the CollectAliasInfoForLoop from LICM for legacy passmanager:
This is definitely doable, but seems an over kill to me because:
Comments? | |
132 | I think yes because if there is loop variant in its operand, sinking it into the loop may change the value every iteration. | |
156 | It's actually expected: if CDT's frequency is equal or only a little larger than SinkBB+N's frequency, then the check will fail and goes to "else" branch: picking the CDT instead SinkBB. | |
184 | I used clang-format --style=llvm for the formatting. |
test/Transforms/LICM/loopsink.ll | ||
---|---|---|
18 | b6 --> b7 | |
24 | add check-not of @g after preheader. | |
27 | add check-not @g after b3 and b4 | |
66 | Add a branch profile data here. | |
75 | B6 --> b6 | |
77 | B3 --> b3 | |
81 | This should be b7 | |
132 | annotate with branch profile data | |
143 | B3 -> b3 | |
147 | b6 -> b7 | |
211 | but this loop will be executed at least once per call of t4, so the loop body frequency should not be lower than entry frequency | |
272 | This test can be simplified a little by just making an external call here. |
test/Transforms/LICM/loopsink.ll | ||
---|---|---|
211 | So the current algorithm is that even if the frequency is equal (as in this case), we still tend to sink because it will reduce live range. |
Are you going to have a separate patch to hook this in the pass manager ?
lib/Transforms/Scalar/LoopSink.cpp | ||
---|---|---|
142 | Isn't it still okay to try to sink in outside the loop if the user block is cold enough? | |
162 | Why not early return if frequency of SinkBB is greater than PreheaderFreq. | |
179 | I guess you intend L->contains(LI->getLoopFor(N)) ? |
include/llvm/Transforms/Utils/LoopUtils.h | ||
---|---|---|
474 |
First, this is used for both hoisting and sinking. In fact, this function is really checking "is aliased with loop", you probably should call it something like that and use that. | |
477 | Ditto above | |
480 | This should be "Return true if a non-memory instruction can be handled by the hoister/sinker" Please don't call it isHoistableInst and then use it for sinking :) | |
lib/Transforms/Scalar/LICM.cpp | ||
511 | This is not really right, it returns whether a non-memory instruction is hoistable. :) | |
lib/Transforms/Scalar/LoopSink.cpp | ||
42 | "Don't sink instructions that require cloning unless they execute less than this percent of the time" (or whatever) | |
87 | "sunk into loop body" | |
89 | CurLoop is unused? | |
90 | s/hoist/sink/ | |
110 | bool HasColdBB =llvm::any_of(L->blocks(), [&](const BasicBlock *BB) { return BFI->getBlockFreq(BB) <= PreHeaderFreq}); | |
133 | I'm confused. | |
170 |
I am also confused by this placement strategy. You are not ordering the blocks in any particular processing order, so you may not actually choose the best sink points, as once you NCA something high in the domtree and something low, NCA will always be something high in the domtree. If you ordered it so it was the lowest things first (using the DFS numbers or whatever), you may decide multiple intermediate placements are cheaper than what you are doing here. | |
191 | Needs a comment | |
199 | This looks ... interesting. Instead, why not add SinkBBS.size() == 0 check above (so it skips if it's empty). Then you should simply move the i == 0 case outside of the loop, and the loop is just doing the insertions. | |
205 | This is Local.cpp's replaceDominatedUsesWith (if you need a Instruction, Use version, make one :P) |
refactor
include/llvm/Transforms/Utils/LoopUtils.h | ||
---|---|---|
480 | Refactored code to remove these functions. | |
lib/Transforms/Scalar/LICM.cpp | ||
511–512 | refactored code to remove this. | |
lib/Transforms/Scalar/LoopSink.cpp | ||
134 | We need reverse iterator because instructions in the back of the BB may depend on the instructions in the front, thus it needs to be sunk first before other instructions can be sunk. | |
171 | Good point. Refactored the code and updated the algorithm to iterate from cold blocks top ensure optimal. | |
200 | SinkBBs.size() == 0 check is already moved above the total frequency check, so it will not reach here. The i==0 check is to distinguish between the first SinkBB (that we use move instead of insert) and the later SinkBB (that we make a copy for each insert). |
minor drive by comments
include/llvm/Transforms/Scalar.h | ||
---|---|---|
141 | Missing comments | |
include/llvm/Transforms/Utils/Local.h | ||
325 | Clarification needed. This doesn't really tell me what the parameter does. In fact, there's nothing that says what the BB param is for either. Can you fix that? | |
lib/Transforms/Scalar/LoopSink.cpp | ||
10 | How does this related to the existing Sink.cpp pass? (Not asking for an answer in review, but updating the comment with context might be helpful.) |
So, I don't think the code is quite ready to go into the tree yet. There are a bunch of minor issues that should be cleaned up. None of these are really big (it seems like you all have sorted out the algorithmic and high level design stuff), but I think they shuold be addressed before the code goes in. Especially the refactoring bit I suggest below.
lib/Transforms/Scalar/LoopSink.cpp | ||
---|---|---|
128–138 | I think this code needs significantly better variable names. You have BB, B, and B again all for basic blocks in *different* containers. And AddedBBs doesn't really tell the reader much about what the container is doing. Compare that to SortedLoopBBs which says exactly what it is. SinkBBs might also benefit from a slightly better name (and the function name might similarly benefit). | |
129 | Usually it is better to keep a set outside the loop and clear it on each iteration... | |
159–161 | Why the variable rather than inlining this? Also, can you just call any_of directly since we have a using declaration and this is a range variant that doesn't exist in the standard? | |
172–175 | This comment again doesn't parse for me, but isn't this dead code now that you're just directly using the reverse iterators? |
lib/Transforms/Scalar/LICM.cpp | ||
---|---|---|
438–449 | Can you split these refactorings into a separate patch please? They seem strictly beneficial and unrelated, and will make for example reverts messier if left in as part of this patch. I have several minor and boring comments on the refactorings, but it seems better to make them on a dedicated patch than to clutter this thread with them. (Just to be clear, I'd would leave it a static function, and just get the API you want and the implementation improvements. Then in this patch you can just make it an external function.) | |
lib/Transforms/Scalar/LoopSink.cpp | ||
89–92 | I'm having a hard time understanding what this comment is trying to say. Can you try to re-word this to be more clear (and more clearly worded)? | |
102 | Since this is new code, please use the more modern form of doxygen throughout: http://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments (I've also updated those to more accurately reflect that we use auto-brief new rather than explicit '\brief ...' sections.) | |
102–106 | I feel like this comment too could be much more clear. First, it isn't clear without a lot of context what the purpose of this would be. I'm guessing you mean something like find a candidate set of basic blocks to sink *into*? "Dominate BBs" - this is ambiguous. Do *all* returned basic blocks need to dominate the set of blocks in BBs? Or is it more that for each block in BBs, at least one block in the returned set must dominate that block? A better name for the parameter than "BBs" would probably help here. The frequency constraint isn't really explained. Why is that important? Give the reader more help to understand what the code will end up doing. | |
107–108 | Do you expect these sets to be large? Naively, I wouldn't... If small is likely to be common, it would be better to use SmallPtrSetImpl as an input and SmallPtrSet as a result with a reasonable small size optimization. | |
126–127 | This comment doesn't parse: "that are dominated it" seems to have a grammar error. |
(Trying to first clarify the split-off of the patch I'm suggesting...)
lib/Transforms/Scalar/LICM.cpp | ||
---|---|---|
438–449 | I see that we got confused here and in the other review. The part of this refactoring I *do* think makes sense to split out and send for review independently is changing the signature (for example, removing TargetLibraryInfo) and re-organizing the implementation. The only part I think needs to happen with this patch is making this routine be a public routine in the 'llvm' namespace. Does that make more sense? |
lib/Transforms/Scalar/LICM.cpp | ||
---|---|---|
438–449 | Sure. The changes in LICM are reduced to minimal in https://reviews.llvm.org/D24168, PTAL. Please do not look at LICM changes in this patch for now because it also includes the refactoring bit. I'll rebase once D24168 is closed. |
First batch of comments. While I'll probably have a some more minor comments later, there are a couple of particularly interesting ones that I wanted to go ahead and send out.
include/llvm/Transforms/Utils/LoopUtils.h | ||
---|---|---|
474–476 | Here and elsewhere in comments, I would just say "is null" rather than "is nullptr". | |
lib/Transforms/Scalar/LoopSink.cpp | ||
11–15 | Some grammar issues here:
I also think the wording could be improved to be more clear when reading it. For example "This pass does the inverse transformation of what LICM does" reads more clearly to me. Lastly, I would lead with that high-level description, then go into specifics. I would separate the comparison with the other Sink pass into a second paragraph. So something along the lines of: This pass does the inverse transformation of what LICM does. It traverses all of the instructions ... It differs from the Sink pass ... Does that make sense? | |
42 | Do you want to count separately the number of instruction clones created as part of this? Not sure if that has been an interesting metric while working on this patch or not. | |
49 | I would suggest sinking the LegacyLoopSinkPass to the bottom of the file to avoid needing this forward declaration. | |
102 | Naming convention: adjustedSumFreq | |
130 | naming: findBBsToSinkInto | |
136–144 | This shouldn't be done each time we try to sink an instruction. This should be pre-computed once for the loop and re-used for each instruction we try to sink. | |
149–171 | This seems really expensive. By my reading this is O(N * L * M) where L is the number of basic blocks in a loop, N is the number of instructions we try to sink into that loop, and M is the number of basic blocks within the loop that use the instructions. If there is for example one hot basic block in the loop and a large number of cold basic blocks and all of the uses are in those cold basic blocks, it seems like this could become quite large. Have you looked at other algorithms? Is there a particular reason to go with this one? (I've not thought about the problem very closely yet...) | |
181–182 | This doxygen is still in the old form. Also, this should be 'sinkLoop'. Use AARseluts here rather than the old name? Can any of these parameters be null? If not, pass references? I would generally partition the arguments into those that are required and pass references for them and then pass the optional ones as pointers. Then you can document that they are optional and the types will reinforce that fact. | |
190–192 | I think a comment along the lines of "If there are no basic blocks with lower frequency than the preheader then we can avoid the detailed analysis as we will never find profitable sinking opportunities." I would also find this easier to read without the negation as: if (all_of(... return BFI->getBlockFreq(BB) > PreheaderFreq; | |
211 | This comment is a little confusing. It seems to be describing a think (like a variable, for example BBs) but is also right above a loop that populates that variable. Generally, once comments can be read as implementation comments about the code, I try to make them describe behavior of the code as that reads a bit better IMO. So "Compute the set of blocks which contain a use of I and ..." would read a bit better for me. Also "are in the sub loop of L" doesn't parse very well although I understand what you mean. I think it would be more clear to say "... blocks in the loop L which ..." rather than going into the issue of subloops. | |
215 | If this is the case we can't sink I at all though, right? I think that is what the code already does, maybe just update the comment? | |
216–217 | I would use two ifs here since one needs its own comment (and it is a nice comment!) | |
217 | Why not use L->contains(UI)? | |
223 | Check for an empty BBs here to handle the case of a use that can't be handled? This combined with the below BBsToSinkInto makes me think this should be extracted to a helper that tries to sink one instruction so that we can use early exit from that function. | |
235 | So, this has an important problem: it introduces a non-determinism into the compiler. The initial problem is that SmallPtrSet does not provide stable iteration order, and so there is no predicting which basic block gets the original instruction and which one gets the clone. However, merely using something like SetVector helps but isn't fully satisfying here because the insertion order is *also* something we would very much like to not depend on: the use list order. I would suggest essentially numbering the basic blocks in the loop and use a vector of the BBs sorted by their number here. You can just create a map out of the blocks range with something like: int i = 0; for (auto *BB : L->blocks()) LoopBlockNumber[BB] = ++i; (Just pseudo code, but you get the idea.) That will punt the ordering requirement to LoopInfo which is I think the right place for it. | |
251 | I'd indicate the difference between this and the previous debug message. Maybe "Sinking a clone of " instead of just "Sinking". |
Thanks for the reviews!
lib/Transforms/Scalar/LoopSink.cpp | ||
---|---|---|
149–171 | I initially started with an adhoc algorithm which is O(L * M), but Danny pointed out it is not optimal, so I changed to this optimal algorithm. The lower bound for any sinking algorithm is O(L*M), but if optimal solution is desired, O(N*L*M) is the best I can get. Yes, this could be expensive when N is large. I practice, I did not see noticeable compile time increase in speccpu2006 benchmarks after applying this patch (and enable the pass in frontend). How about we limit the N to be no more than a certain number to avoid expensive computation in extreme cases? | |
215 | Not sure if I get this right, do you mean update the comment (as I just did) to make it less redundant? | |
217 | Because it does not check for sub loops. e.g. Loop1 { I1 Loop2 { I2 } } Loop1->contains(I1) --> true For this check we want to make sure I1 and I2 both return true. |
M is the number of use BBs.
The pass already filters out loops which do not have any cold blocks -- this effectively filters out most of the loops in reality so the compile time impact will be minimal. Further more, the following can be done:
- only collect cold bbs in the loop body that is colder than header and sort them instead
- skip any instructions with use BBs that are not member of the cold BBs collected in 1).
Will try to make a full pass through, thanks for the extensive updates Dehao! One specific point of discussion below:
lib/Transforms/Scalar/LoopSink.cpp | ||
---|---|---|
149–171 | I'm not worried about N being large. I'm worried about the fact that L is >= to M and so it is quadratic in the number of basic blocks that use each instruction. The other thing is that if this scales in compile time by N then it scales in compile time *by how much effect it is having*. If it scales in compile time by M^2, then we pay more and more compile time as loops get larger even if we only sink very few instructions. I would either bound M to a small number, and/or look for some way to not have this be quadratic. It seems like a bunch of this should be pre-computable for the loop? |
This is really close. Some minor nit picks and a few more interesting comments below.
include/llvm/Transforms/Utils/LoopUtils.h | ||
---|---|---|
474–476 | You changed one 'nullptr' to 'null' but missed the other. | |
lib/Transforms/Scalar/LoopSink.cpp | ||
12–13 | "and sink them/ to" -> "and sinks them to" | |
102 | "/p" -> "\p" | |
160 | The number of user *instructions* isn't really the right thing to apply the threshold to as that doesn't directly change the cost. The idea is that we need the size of BBsToSinkInto to be a small constant in order for the search for the coldest dominating set to be "just" linear in the number of blocks in the loop. So while a threshold of "40" may make sense for number of user instructions, I suspect the threshold should be much smaller when applied to the size of BBsToSinkInto. I also think you should add two comments about this. One, you should comment to the findBBsToSinkInto function clarifying the algorithmic complexity (That it O(N * M) or O(M^2) where N is SortedLoopBBs.size() and M is BBsToSinkInto.size()), and you should mention where you check this threshold that the reason is because we're going to call findBBsToSinkInto which would go quadratic if we didn't set a cap. The reason for all of this is that I'm worried some future maintainer will come along and not really understand how risk it is to adjust these thresholds so I think it is important to document the implications. I still think we will long-term need a better algorithmic approach here as I suspect we'll find annoying cases where the threshold blocks an important optimization (such as when there are way too many initial BBsToInsertInto but there are a small number of common dominating blocks). But I understand this is one of the really hard problems (its the same as optimal PHI placement and a bunch of other ones), and I don't want to hold up your patch on a comprehensive approach here. On an unrelated note, you should also document that this threshold has a secondary function: it places an upper bound on how much code growth we may trigger here. I'd document this in particular as that seems somewhat accidental and I suspect we may long-term want a better threshold for that. I would in fact encourage you to leave a FIXME to adjust this for min_size and opt_size. | |
179–180 | We generally prefer calling .empty() to testing .size() against zero. | |
185 | Here you don't need stable sort since this is a total ordering. You should just use std::sort and mention that this is a known total ordering in a comment. You could do that in an overall comment htat explains what you're doing here: // Copy the final BBs into a vector and sort them using the total ordering // of the loop block numbers as iterating the set doesn't give a useful // order. No need to stable sort as the block numbers are a total ordering. | |
195 | You didn't actually switch to the sorted list here. Also, you can just use a range based for loop here. | |
219 | This routine doesn't sink the loop, so a name sinkLoop seems confusing. Maybe sinkLoopInvariantInstructions? Also, I think this should be a static function. | |
242–247 | It wasn't obvious to me reading this that SortedLoopBBs only contained the oop BBs that are less hot than the preheader. I think it might be nice to clue the reader in that this isn't *all* the loop BBs. Maybe SortedColdLoopBBs? Or just ColdLoopBBs? If you make this change, I'd keep the name consistent throughout of course. Also, you use <= here, but < everywhere else I see, any particular reason to include BBs in this list with the same frequency? |
Thanks for the reviews.
I also added an overall algorithm description at file level per Madhur's suggest.
lib/Transforms/Scalar/LoopSink.cpp | ||
---|---|---|
160 | You actually mean the size of BBs to be small constant? Because computing BBsToSinkInto is the most computation intensive part of the algorithm. | |
195 | The reason I used iterator here is because we need to handle the first entry in a different way. |
Very cool. I think this patch LGTM with the comments below addressed (documentation fixes, simple changes, a fixme, a bunch of minor test cleanup). Feel free to submit. I've asked for one somewhat immediate follow-up patch, and it'd also be good to get a patch to put this into the pipeline behind a flag so folks can test the size impact.
lib/Transforms/Scalar/LoopSink.cpp | ||
---|---|---|
14–23 | The differences seem to be a bit duplicated at this point. Sorry if this is the result of my suggestions. I think that you only need one of the prose "It Differs ..." and the bulleted list. If you want the detail in the list, I would just clean up the wording of that section so the English reads cleanly: "in a way that" -> "in the following ways" | |
85 | I would use SmallPtrSetImpl<BasicBlock *> here and elsewhere on API boundaries where you can below. | |
213 | So, this technically will break the verifier if you ever look at the IR at this point. While that is allowed, it seems fairly easy to avoid this by first creating all the clones and rewriting uses to the clones before moving the instruction. By the time you move it, the only uses remaining should be the ones dominated by the destination insertion point. | |
221–229 | This causes the cloning to be quadratic in the number of uses as we may have a single clone for each use. I know we have an upper bound, but this is still pretty slow (the use list is slow to walk). I'd suggest you fix this in a follow-up patch though (I think it'll be a lot of code and easier to review as a follow-up), just leave a short FIXME saying that this is slow and may be quadratic in the number of uses. (For the follow-up patch, the approach I'd suggest is that as you build up BBsToInsertInto, you also build up a map from UseBBs to the dominating BB that will be inserted into. Then you can insert into each BBsToInsertInto here without rewriting any uses but building up a map from those BBs to the inserted clones. Finally, you can do a single walk over the uses and for each one look up the inserted BB in the first map and then the inserted clone in the second map and rewrite the use. Or maybe you see a simpler way? This was just the first that came to mind.) | |
265 | Use a SmallDenseMap? Good to dodge allocations for small loops. | |
test/Transforms/LICM/loopsink.ll | ||
9 | You can prune out these "Function Attrs" comments... See below. | |
293 | Please try to minimize function attributes you have in your test cases. You may not need any. If you do need them, you can attach the textual form directly to the functions which is much more friendly for test cases (and makes the comments explaining what the '#0' attribute set contains unnecessary). | |
295–304 | Please prune out all the metadata your test isn't directly using (TBAA stuff, Clang stuff). | |
test/Transforms/LICM/sink.ll | ||
5–6 | Unless your tests depend on specific datalayout or triple, please avoid including them in the IR test cases so that things are more generic and less tied to platforms. | |
23 | Same comments as above about function attributes. Also, please don't use C++ mangled names, but instead provide clean and easy to read names directly. |
Missing comments