This is an archive of the discontinued LLVM Phabricator instance.

[MachineLICM] Sink instructions only if they are unlikely to be executed
Needs ReviewPublic

Authored by djasper on Mar 19 2015, 10:37 AM.

Details

Reviewers
qcolombet
Summary

This is a more limited change which we can iterate on later.

Also remove the isCopy test, which probably isn't useful anymore.

Diff Detail

Event Timeline

djasper updated this revision to Diff 22264.Mar 19 2015, 10:37 AM
djasper retitled this revision from to [MachineLICM] Sink instructions only if they are unlikely to be executed.
djasper updated this object.
djasper edited the test plan for this revision. (Show Details)
djasper added a reviewer: qcolombet.
djasper added subscribers: Unknown Object (MLST), chandlerc.
qcolombet added inline comments.Mar 19 2015, 10:50 AM
lib/CodeGen/MachineLICM.cpp
805

Could you add a comment on the choice of the magic constant?
It’s the same used in the loop vectorizer IIRC, but having some hints why we choose that is a good thing.

Would it may send to make the denominator parametrizable so that it is easy to check different threshold?

849

Shouldn’t we have a threshold in the other direction as well, i.e., Freq(B) < Freq(Preheader) * <some threshold>?

test/CodeGen/X86/sink-cheap-instructions.ll
25

What is the purpose of this change?

djasper updated this revision to Diff 22305.Mar 19 2015, 2:43 PM
djasper added inline comments.
lib/CodeGen/MachineLICM.cpp
805

Done.

849

Could you elaborate what that might accomplish?

Also the frequency of the *pre*header doesn't really relate strongly to the frequencies within the loop AFAICT.

test/CodeGen/X86/sink-cheap-instructions.ll
25

The purpose is to ensure that the getelementptr for %6 is not pulled into the loop as it is used in every iteration. I realize that I have forgotten to actually test that. Fixed.

qcolombet added inline comments.Mar 19 2015, 3:30 PM
lib/CodeGen/MachineLICM.cpp
849

Sure, but the instructions we are moving come from the pre header and we do not want them to explode the number of time they are executed.
Let say the pre-header is executed once and the loop 1,000,000 times. Then, even in the cold path of the loop, this is still rather expensive to sink the instructions.

djasper added inline comments.Mar 20 2015, 2:15 AM
lib/CodeGen/MachineLICM.cpp
849

I think we will almost never have accurate information about that. And the cost model is also not easy. We are generally weighing up the cost of the additional computation in each loop vs. the cost of the additional live ranges of these registers and the register spills that might be a result. The latter might actually make the computation inside the loop *more* expensive as demonstrated in the changed test code. Here, we are very frequently accessing one of the struct's fields. If we hoist all of the GEPs out of the loop, we end up spilling them all onto the stack and each load becomes a load from memory. With this change, we keep the GEPs of the less frequently accessed fields inside the loops (where they are folded into the LEAs and potentially don't cause significant overhead over loading them from the stack onto which they would be spilled). We still pull out the GEP for the frequently accessed field and can actually keep that in a register using only a cheap MOV inside the loop.

I think the long-term solution might be to order all instructions we could sink by frequency and then sink them into the loop starting from the least frequently executed until there is no more register pressure. Not entirely sure how to implement that correctly yet. My hope is that this patch is an incremental step towards that.

djasper updated this revision to Diff 22347.Mar 20 2015, 8:00 AM

Turns out the IsCopy check was actually important and papered over the fact that we cannot sink something that is used by a PHI (because we sink to before the first non-PHI instruction later). Sinking something that is used by a PHI probably also doesn't make much sense (or at least needs a separate investigation) as e.g. we might sink along the loop-entry edge.

Also simplified the implementation.

qcolombet added inline comments.Mar 20 2015, 10:05 AM
lib/CodeGen/MachineLICM.cpp
835

Aborting on PHI does not make much sense.
Instead, you should look for the common dominator for the use of the PHI, e.g., by looking at the dominator of the terminator of the related block. In that case, you wouldn’t insert at after the first non-PHI instruction.

849

We are generally weighing up the cost of the additional computation in each loop vs. the cost of the additional live ranges of these registers and the register spills that might be a result.

I agree, but checking for frequencies to derive a heuristic for that sounds wrong to me.
To simplify, let us assume that the spiller would insert the spill instructions at the exact same locations as the sinking algorithm.
The first question we should ask ourselves is: do the reload instructions are more expensive than the things we sunk?
If the answer is no, then there is no point in sinking. Frequency does not help for that.

Second, when do we consider the live-range we shorten/extend?
E.g., what if you sink a = add b<kill> + c<kill>?
You end up increasing the register pressure by one in the whole loop body and pushed the instruction in a more expensive location. Frequency does not help for that too.

The bottom line is, I believe that frequency has nothing to do, with the heuristic we want, to achieve better spill placement.
Therefore, I do not think this is a step toward fixing the problem.

I am fine if you want to do experimentation to gather ideas, but seeing gains out of that study seems more luck based than actual improvements.

djasper added inline comments.Mar 20 2015, 3:25 PM
lib/CodeGen/MachineLICM.cpp
835

I don't think I understand this.

We are talking about an instruction that is currently in the loop preheader and its use is in a PHI. Doesn't that (at least in the vast majority of cases) mean that we are needing the value only when branching from the loop preheader into the loop? Thus, the instruction is currently at the best possible place.

849

The first question we should ask ourselves is: do the reload instructions are more expensive than the things we sunk?
If the answer is no, then there is no point in sinking. Frequency does not help for that.

I am not sure it is as easy. Register spills have other costs as well. At the very least, they are increasing the stack frame size and incur the cost of writing to the stack. If the instructions are only executed a handful of times, as in the case when they are unlikely to be executed, this additional cost is significant.

The model I have in mind in the long run is the following. Lets call the difference between instruction cost and cost of spill-reload C. It is the additional cost that a sunken instruction will incur each time it is executed in the loop. You are saying that we should sink all the instructions where C is negative (spill reload is cheaper). This is correct to a certain extent although we have to factor in register pressure because we might not need to spill at all. But that a bit of a separate consideration. Lets further assume that the likelihood of execution is L. Now, for positive C, we should sink instructions if C*L is smaller than some threshold that factors in the cost of the spill itself which I mentioned above. It is the latter part that I am getting at with this patch and I do think that looking at probabilities is a step into the right direction (although maybe not the most important first step).

Now, to explain where I am coming from. I have code with a loop, which basically does deserialization and inside the loop there is a large switch statement. In practice, each of the spilled registers is usually accessed either 0 or 1 times. Thus, sinking is very important.

Also, there is already code in MachineLICM, which kind of does the same thing in line 743. It basically checks whether the loop header has more than 25 successors in which case it cops out of hoisting instructions. Now, of course 25 is significantly greater than the 20% probability I am using. But also, just looking at the successor count is not very helpful as I think this condition would not fire if the loop contained a switch statement inside a surrounding branch in the loop body. We should probably also make this decision based on execution probabilities.

qcolombet edited edge metadata.Mar 20 2015, 3:58 PM

Hi Daniel,

Register spills have other costs as well. At the very least, they are increasing the stack frame size and incur the cost of writing to the stack. If the instructions are only executed a handful of times, as in the case when they are unlikely to be executed, this additional cost is significant.

Ok I see where you are going now. That being said, I think that the overhead of stores + stack update (i.e., what we are talking about) is not that relevant. Anyway, I agree that it should be taken into account at some point.

You are saying that we should sink all the instructions where C is negative (spill reload is cheaper).

The opposite :). We should *not* sink those instructions. Of course, we agree that this is useless to sink if register pressure is not a problem, which we do not check at all here.

Now, for positive C, we should sink instructions if C*L is smaller than some threshold that factors in the cost of the spill itself which I mentioned above. It is the latter part that I am getting at with this patch and I do think that looking at probabilities is a step into the right direction (although maybe not the most important first step).

Aside from register pressure, which now I understand you want to consider later, I am still not convinced that the suggested check represent a meaningful cost model.
Anyhow, we can rework that when we will look at the register pressure thing.
The bottom line is I am fine with whatever path you think is worth pursuing.

Cheers,
-Quentin

lib/CodeGen/MachineLICM.cpp
835

Shouldn’t have we filter out those cases with this check:
!HasLoopPHIUse(I)

I thought we were speaking of a PHI within a diamond in the loop, not the PHI of the header.

Hi Daniel,

Register spills have other costs as well. At the very least, they are increasing the stack frame size and incur the cost of writing to the stack. If the instructions are only executed a handful of times, as in the case when they are unlikely to be executed, this additional cost is significant.

Ok I see where you are going now. That being said, I think that the overhead of stores + stack update (i.e., what we are talking about) is not that relevant. Anyway, I agree that it should be taken into account at some point.

I think that largely depends on whether we think the instruction inside the loop is normally executed (more than once).

You are saying that we should sink all the instructions where C is negative (spill reload is cheaper).

The opposite :). We should *not* sink those instructions. Of course, we agree that this is useless to sink if register pressure is not a problem, which we do not check at all here.

Yeah, sorry, I meant we should sink the instructions where spill reload is more expensive than the actual instruction (as long as there is register pressure). I think I should look first look at register pressure first. Will do.

Now, for positive C, we should sink instructions if C*L is smaller than some threshold that factors in the cost of the spill itself which I mentioned above. It is the latter part that I am getting at with this patch and I do think that looking at probabilities is a step into the right direction (although maybe not the most important first step).

Aside from register pressure, which now I understand you want to consider later, I am still not convinced that the suggested check represent a meaningful cost model.
Anyhow, we can rework that when we will look at the register pressure thing.
The bottom line is I am fine with whatever path you think is worth pursuing.

Cheers,
-Quentin

lib/CodeGen/MachineLICM.cpp
835

Ah, right. Thanks for pointing that out. So, the trouble I was actually running into was that there is a PHI use outside of the loop.

I'll need to look closer at the cases where that can actually happen. Will update as soon as I have more information.