Page MenuHomePhabricator

Add Loop Sink pass to reverse the LICM based of basic block frequency.
ClosedPublic

Authored by danielcdh on Jul 25 2016, 2:05 PM.

Details

Summary

LICM may hoist instructions to preheader speculatively. Before code generation, we need to sink down the hoisted instructions inside to loop if it's beneficial. This pass is a reverse of LICM: looking at instructions in preheader and sinks the instruction to basic blocks inside the loop body if basic block frequency is smaller than the preheader frequency.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
chandlerc added inline comments.Sep 1 2016, 3:47 PM
lib/Transforms/Scalar/LICM.cpp
434–438

Can you split these refactorings into a separate patch please? They seem strictly beneficial and unrelated, and will make for example reverts messier if left in as part of this patch.

I have several minor and boring comments on the refactorings, but it seems better to make them on a dedicated patch than to clutter this thread with them.

(Just to be clear, I'd would leave it a static function, and just get the API you want and the implementation improvements. Then in this patch you can just make it an external function.)

lib/Transforms/Scalar/LoopSink.cpp
89–92

I'm having a hard time understanding what this comment is trying to say. Can you try to re-word this to be more clear (and more clearly worded)?

102

Since this is new code, please use the more modern form of doxygen throughout: http://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments

(I've also updated those to more accurately reflect that we use auto-brief new rather than explicit '\brief ...' sections.)

102–106

I feel like this comment too could be much more clear.

First, it isn't clear without a lot of context what the purpose of this would be. I'm guessing you mean something like find a candidate set of basic blocks to sink *into*?

"Dominate BBs" - this is ambiguous. Do *all* returned basic blocks need to dominate the set of blocks in BBs? Or is it more that for each block in BBs, at least one block in the returned set must dominate that block? A better name for the parameter than "BBs" would probably help here.

The frequency constraint isn't really explained. Why is that important? Give the reader more help to understand what the code will end up doing.

107–108

Do you expect these sets to be large? Naively, I wouldn't...

If small is likely to be common, it would be better to use SmallPtrSetImpl as an input and SmallPtrSet as a result with a reasonable small size optimization.

126–127

This comment doesn't parse: "that are dominated it" seems to have a grammar error.

128–138

I think this code needs significantly better variable names.

You have BB, B, and B again all for basic blocks in *different* containers.

And AddedBBs doesn't really tell the reader much about what the container is doing. Compare that to SortedLoopBBs which says exactly what it is. SinkBBs might also benefit from a slightly better name (and the function name might similarly benefit).

danielcdh updated this revision to Diff 70123.Sep 1 2016, 6:11 PM
danielcdh marked 13 inline comments as done.
danielcdh edited edge metadata.

Integrate Chandler's comment.

danielcdh updated this revision to Diff 70124.Sep 1 2016, 6:17 PM

clang-format

(Trying to first clarify the split-off of the patch I'm suggesting...)

lib/Transforms/Scalar/LICM.cpp
434–438

I see that we got confused here and in the other review.

The part of this refactoring I *do* think makes sense to split out and send for review independently is changing the signature (for example, removing TargetLibraryInfo) and re-organizing the implementation.

The only part I think needs to happen with this patch is making this routine be a public routine in the 'llvm' namespace.

Does that make more sense?

danielcdh updated this revision to Diff 70721.Sep 8 2016, 10:29 AM

rebase and update

danielcdh added inline comments.Sep 8 2016, 10:37 AM
lib/Transforms/Scalar/LICM.cpp
434–438

Sure.

The changes in LICM are reduced to minimal in https://reviews.llvm.org/D24168, PTAL.

Please do not look at LICM changes in this patch for now because it also includes the refactoring bit. I'll rebase once D24168 is closed.

danielcdh updated this revision to Diff 71441.Sep 14 2016, 2:47 PM

update the logic to replace dominated uses after sinking.

First batch of comments. While I'll probably have a some more minor comments later, there are a couple of particularly interesting ones that I wanted to go ahead and send out.

include/llvm/Transforms/Utils/LoopUtils.h
472–474

Here and elsewhere in comments, I would just say "is null" rather than "is nullptr".

lib/Transforms/Scalar/LoopSink.cpp
11–15

Some grammar issues here:

  • "all instructions" -> "all of the instructions"
  • "in loop preheader" -> "in the loop preheader"
  • "sink it" -> "sink them"
  • "the Sink pass that it only" -> "the Sink pass in that it only"
  • "in loop's preheader" -> "in the loop's preheader".

I also think the wording could be improved to be more clear when reading it. For example "This pass does the inverse transformation of what LICM does" reads more clearly to me. Lastly, I would lead with that high-level description, then go into specifics. I would separate the comparison with the other Sink pass into a second paragraph. So something along the lines of:

This pass does the inverse transformation of what LICM does. It
traverses all of the instructions ...

It differs from the Sink pass ...

Does that make sense?

42

Do you want to count separately the number of instruction clones created as part of this? Not sure if that has been an interesting metric while working on this patch or not.

49

I would suggest sinking the LegacyLoopSinkPass to the bottom of the file to avoid needing this forward declaration.

102

Naming convention: adjustedSumFreq

130

naming: findBBsToSinkInto

136–144

This shouldn't be done each time we try to sink an instruction. This should be pre-computed once for the loop and re-used for each instruction we try to sink.

149–171

This seems really expensive. By my reading this is O(N * L * M) where L is the number of basic blocks in a loop, N is the number of instructions we try to sink into that loop, and M is the number of basic blocks within the loop that use the instructions. If there is for example one hot basic block in the loop and a large number of cold basic blocks and all of the uses are in those cold basic blocks, it seems like this could become quite large.

Have you looked at other algorithms? Is there a particular reason to go with this one? (I've not thought about the problem very closely yet...)

181–182

This doxygen is still in the old form. Also, this should be 'sinkLoop'.

Use AARseluts here rather than the old name?

Can any of these parameters be null? If not, pass references? I would generally partition the arguments into those that are required and pass references for them and then pass the optional ones as pointers. Then you can document that they are optional and the types will reinforce that fact.

190–192

I think a comment along the lines of "If there are no basic blocks with lower frequency than the preheader then we can avoid the detailed analysis as we will never find profitable sinking opportunities."

I would also find this easier to read without the negation as:

if (all_of(...
      return BFI->getBlockFreq(BB) > PreheaderFreq;
211

This comment is a little confusing. It seems to be describing a think (like a variable, for example BBs) but is also right above a loop that populates that variable. Generally, once comments can be read as implementation comments about the code, I try to make them describe behavior of the code as that reads a bit better IMO. So "Compute the set of blocks which contain a use of I and ..." would read a bit better for me.

Also "are in the sub loop of L" doesn't parse very well although I understand what you mean. I think it would be more clear to say "... blocks in the loop L which ..." rather than going into the issue of subloops.

215

If this is the case we can't sink I at all though, right? I think that is what the code already does, maybe just update the comment?

216–217

I would use two ifs here since one needs its own comment (and it is a nice comment!)

217

Why not use L->contains(UI)?

223

Check for an empty BBs here to handle the case of a use that can't be handled?

This combined with the below BBsToSinkInto makes me think this should be extracted to a helper that tries to sink one instruction so that we can use early exit from that function.

235

So, this has an important problem: it introduces a non-determinism into the compiler.

The initial problem is that SmallPtrSet does not provide stable iteration order, and so there is no predicting which basic block gets the original instruction and which one gets the clone.

However, merely using something like SetVector helps but isn't fully satisfying here because the insertion order is *also* something we would very much like to not depend on: the use list order.

I would suggest essentially numbering the basic blocks in the loop and use a vector of the BBs sorted by their number here. You can just create a map out of the blocks range with something like:

int i = 0;
for (auto *BB : L->blocks())
  LoopBlockNumber[BB] = ++i;

(Just pseudo code, but you get the idea.)

That will punt the ordering requirement to LoopInfo which is I think the right place for it.

251

I'd indicate the difference between this and the previous debug message. Maybe "Sinking a clone of " instead of just "Sinking".

danielcdh marked 13 inline comments as done.Oct 4 2016, 11:26 AM

Thanks for the reviews!

lib/Transforms/Scalar/LoopSink.cpp
149–171

I initially started with an adhoc algorithm which is O(L * M), but Danny pointed out it is not optimal, so I changed to this optimal algorithm. The lower bound for any sinking algorithm is O(L*M), but if optimal solution is desired, O(N*L*M) is the best I can get.

Yes, this could be expensive when N is large. I practice, I did not see noticeable compile time increase in speccpu2006 benchmarks after applying this patch (and enable the pass in frontend).

How about we limit the N to be no more than a certain number to avoid expensive computation in extreme cases?

215

Not sure if I get this right, do you mean update the comment (as I just did) to make it less redundant?

217

Because it does not check for sub loops.

e.g.

Loop1 {

I1
Loop2 {
  I2
}

}

Loop1->contains(I1) --> true
Loop2->contains(I2) --> true
Loop1->contains(I2) --> false

For this check we want to make sure I1 and I2 both return true.

danielcdh updated this revision to Diff 73512.Oct 4 2016, 11:27 AM

integrate Chandler's reviews

danielcdh updated this revision to Diff 73513.Oct 4 2016, 11:31 AM
danielcdh marked an inline comment as done.

missed one comment

What is the status of this patch? Any more comments need to be addressed?

Chandler said he has more comment.

M is the number of use BBs.

The pass already filters out loops which do not have any cold blocks -- this effectively filters out most of the loops in reality so the compile time impact will be minimal. Further more, the following can be done:

  1. only collect cold bbs in the loop body that is colder than header and sort them instead
  2. skip any instructions with use BBs that are not member of the cold BBs collected in 1).

Example of parent BB being colder than the use BB?

Will try to make a full pass through, thanks for the extensive updates Dehao! One specific point of discussion below:

lib/Transforms/Scalar/LoopSink.cpp
149–171

I'm not worried about N being large. I'm worried about the fact that L is >= to M and so it is quadratic in the number of basic blocks that use each instruction.

The other thing is that if this scales in compile time by N then it scales in compile time *by how much effect it is having*. If it scales in compile time by M^2, then we pay more and more compile time as loops get larger even if we only sink very few instructions.

I would either bound M to a small number, and/or look for some way to not have this be quadratic. It seems like a bunch of this should be pre-computable for the loop?

danielcdh updated this revision to Diff 74907.Oct 17 2016, 2:38 PM

add max use threshold

This is really close. Some minor nit picks and a few more interesting comments below.

include/llvm/Transforms/Utils/LoopUtils.h
472–474

You changed one 'nullptr' to 'null' but missed the other.

lib/Transforms/Scalar/LoopSink.cpp
12–13

"and sink them/ to" -> "and sinks them to"

102

"/p" -> "\p"

160

The number of user *instructions* isn't really the right thing to apply the threshold to as that doesn't directly change the cost.

The idea is that we need the size of BBsToSinkInto to be a small constant in order for the search for the coldest dominating set to be "just" linear in the number of blocks in the loop.

So while a threshold of "40" may make sense for number of user instructions, I suspect the threshold should be much smaller when applied to the size of BBsToSinkInto.

I also think you should add two comments about this. One, you should comment to the findBBsToSinkInto function clarifying the algorithmic complexity (That it O(N * M) or O(M^2) where N is SortedLoopBBs.size() and M is BBsToSinkInto.size()), and you should mention where you check this threshold that the reason is because we're going to call findBBsToSinkInto which would go quadratic if we didn't set a cap.

The reason for all of this is that I'm worried some future maintainer will come along and not really understand how risk it is to adjust these thresholds so I think it is important to document the implications.

I still think we will long-term need a better algorithmic approach here as I suspect we'll find annoying cases where the threshold blocks an important optimization (such as when there are way too many initial BBsToInsertInto but there are a small number of common dominating blocks). But I understand this is one of the really hard problems (its the same as optimal PHI placement and a bunch of other ones), and I don't want to hold up your patch on a comprehensive approach here.

On an unrelated note, you should also document that this threshold has a secondary function: it places an upper bound on how much code growth we may trigger here. I'd document this in particular as that seems somewhat accidental and I suspect we may long-term want a better threshold for that. I would in fact encourage you to leave a FIXME to adjust this for min_size and opt_size.

179–180

We generally prefer calling .empty() to testing .size() against zero.

185

Here you don't need stable sort since this is a total ordering. You should just use std::sort and mention that this is a known total ordering in a comment. You could do that in an overall comment htat explains what you're doing here:

// Copy the final BBs into a vector and sort them using the total ordering
// of the loop block numbers as iterating the set doesn't give a useful
// order. No need to stable sort as the block numbers are a total ordering.
195

You didn't actually switch to the sorted list here.

Also, you can just use a range based for loop here.

219

This routine doesn't sink the loop, so a name sinkLoop seems confusing. Maybe sinkLoopInvariantInstructions?

Also, I think this should be a static function.

242–247

It wasn't obvious to me reading this that SortedLoopBBs only contained the oop BBs that are less hot than the preheader. I think it might be nice to clue the reader in that this isn't *all* the loop BBs. Maybe SortedColdLoopBBs? Or just ColdLoopBBs? If you make this change, I'd keep the name consistent throughout of course.

Also, you use <= here, but < everywhere else I see, any particular reason to include BBs in this list with the same frequency?

danielcdh updated this revision to Diff 75159.Oct 19 2016, 8:38 AM
danielcdh marked 7 inline comments as done.

update

Thanks for the reviews.

I also added an overall algorithm description at file level per Madhur's suggest.

lib/Transforms/Scalar/LoopSink.cpp
160

You actually mean the size of BBs to be small constant? Because computing BBsToSinkInto is the most computation intensive part of the algorithm.

195

The reason I used iterator here is because we need to handle the first entry in a different way.

chandlerc accepted this revision.Oct 27 2016, 12:58 AM
chandlerc added a reviewer: chandlerc.

Very cool. I think this patch LGTM with the comments below addressed (documentation fixes, simple changes, a fixme, a bunch of minor test cleanup). Feel free to submit. I've asked for one somewhat immediate follow-up patch, and it'd also be good to get a patch to put this into the pipeline behind a flag so folks can test the size impact.

lib/Transforms/Scalar/LoopSink.cpp
14–23

The differences seem to be a bit duplicated at this point. Sorry if this is the result of my suggestions.

I think that you only need one of the prose "It Differs ..." and the bulleted list. If you want the detail in the list, I would just clean up the wording of that section so the English reads cleanly:

"in a way that" -> "in the following ways"
"prehead" -> "preahader"
"find optimal" -> "find the optimal"

85

I would use SmallPtrSetImpl<BasicBlock *> here and elsewhere on API boundaries where you can below.

213

So, this technically will break the verifier if you ever look at the IR at this point. While that is allowed, it seems fairly easy to avoid this by first creating all the clones and rewriting uses to the clones before moving the instruction. By the time you move it, the only uses remaining should be the ones dominated by the destination insertion point.

221–229

This causes the cloning to be quadratic in the number of uses as we may have a single clone for each use.

I know we have an upper bound, but this is still pretty slow (the use list is slow to walk).

I'd suggest you fix this in a follow-up patch though (I think it'll be a lot of code and easier to review as a follow-up), just leave a short FIXME saying that this is slow and may be quadratic in the number of uses.

(For the follow-up patch, the approach I'd suggest is that as you build up BBsToInsertInto, you also build up a map from UseBBs to the dominating BB that will be inserted into. Then you can insert into each BBsToInsertInto here without rewriting any uses but building up a map from those BBs to the inserted clones. Finally, you can do a single walk over the uses and for each one look up the inserted BB in the first map and then the inserted clone in the second map and rewrite the use. Or maybe you see a simpler way? This was just the first that came to mind.)

265

Use a SmallDenseMap? Good to dodge allocations for small loops.

test/Transforms/LICM/loopsink.ll
9

You can prune out these "Function Attrs" comments... See below.

293

Please try to minimize function attributes you have in your test cases. You may not need any.

If you do need them, you can attach the textual form directly to the functions which is much more friendly for test cases (and makes the comments explaining what the '#0' attribute set contains unnecessary).

295–304

Please prune out all the metadata your test isn't directly using (TBAA stuff, Clang stuff).

test/Transforms/LICM/sink.ll
5–6

Unless your tests depend on specific datalayout or triple, please avoid including them in the IR test cases so that things are more generic and less tied to platforms.

23

Same comments as above about function attributes. Also, please don't use C++ mangled names, but instead provide clean and easy to read names directly.

danielcdh updated this revision to Diff 76055.Oct 27 2016, 9:24 AM
danielcdh marked 21 inline comments as done.
danielcdh edited edge metadata.

Thanks for the review!

danielcdh closed this revision.Oct 27 2016, 9:39 AM