This is an archive of the discontinued LLVM Phabricator instance.

Add a new pass to speculate around PHI nodes with constant (integer) operands when profitable.
ClosedPublic

Authored by chandlerc on Sep 5 2017, 4:34 AM.

Details

Summary

The core idea is to (re-)introduce some redundancies where their cost is
hidden by the cost of materializing immediates for constant operands of
PHI nodes. When the cost of the redundancies is covered by this,
avoiding materializing the immediate has numerous benefits:

  1. Less register pressure
  2. Potential for further folding / combining
  3. Potential for more efficient instructions due to immediate operand

As a motivating example, consider the remarkably different cost on x86
of a SHL instruction with an immediate operand versus a register
operand.

This pattern turns up surprisingly frequently, but is somewhat rarely
obvious as a significant performance problem.

The pass is entirely target independent, but it does rely on the target
cost model in TTI to decide when to speculate things around the PHI
node. I've included x86-focused tests, but any target that sets up its
immediate cost model should benefit from this pass.

There is probably more that can be done in this space, but the pass
as-is is enough to get some important performance on our internal
benchmarks, and should be generally performance neutral, but help with
more extensive benchmarking is always welcome.

One awkward part is that this pass has to be scheduled after
*everything* that can eliminate these kinds of redundancies. This
includes SimplifyCFG, GVN, etc. I'm open to suggestions about better
places to put this. We could in theory make it part of the codegen pass
pipeline, but there doesn't really seem to be a good reason for that --
it isn't "lowering" in any sense and only relies on pretty standard cost
model based TTI queries, so it seems to fit well with the "optimization"
pipeline model. Still, further thoughts on the pipeline position are
welcome.

I've also only implemented this in the new pass manager. If folks are
very interested, I can try to add it to the old PM as well, but I didn't
really see much point (my use case is already switched over to the new
PM).

I have some testing in place, but can probably add some more. However,
I've built a reasonable amount of code with this pass enabled (test
suite, SPEC, and a decent pile of internal code).

Diff Detail

Repository
rL LLVM

Event Timeline

chandlerc created this revision.Sep 5 2017, 4:34 AM
davidxl edited edge metadata.Sep 5 2017, 9:51 AM

Just some initial comments around the test cases.

It seems that all test cases are testing non speculative code hoisting -- there are no speculative hoisting tests.

test/Transforms/SpeculateAroundPHIs/X86/basic.ll
141 ↗(On Diff #113835)

This is simple code hoisting without speculation.

It is unclear whether this is a win for this case -- it increases code size (and icache pressure). It may or may not reduce register pressure either.

176 ↗(On Diff #113835)

What is the cost model here? If there are two uses, it will still be a net win to hoist -- the code size does not change, but the runtime cost is strictly reduced. If there are more than 2 uses, then there is tradeoff to make between size increase vs runtime cost reduction.

davide added a subscriber: davide.Sep 5 2017, 10:09 AM
hfinkel added inline comments.Sep 5 2017, 2:15 PM
include/llvm/Transforms/Scalar/SpeculateAroundPHIs.h
28 ↗(On Diff #113835)

parte -> part

30 ↗(On Diff #113835)

I don't think that you need the word totaly (which is also spelled incorrectly).

32 ↗(On Diff #113835)

I think that it would be helpful to show a little pseudo-code example of what this means here.

lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
55 ↗(On Diff #113835)

scane -> scan

200 ↗(On Diff #113835)

An outdated comment? The "function" operand is last. Bundle operands are before that. Regular function arguments are first.

298 ↗(On Diff #113835)

linear to compute -> linear-to-compute (dashes for a compound adjective)

306 ↗(On Diff #113835)

computattion -> computation

568 ↗(On Diff #113835)

relpcae -> replace

Just some initial comments around the test cases.

It seems that all test cases are testing non speculative code hoisting -- there are no speculative hoisting tests.

Sorry, I can add some non-hoisted code as well, for example a call that may not return. That's the way in which this is technically speculation.

But I agree "speculation" is perhaps a frustrating term here. I'm using it only because a very similar transform in SROA where loads are lifted around PHI operands is called "speculating" and so it seemed the least bad term I knew of... That said, if there is a better term, I'm happy to switch to it.

chandlerc updated this revision to Diff 114301.Sep 7 2017, 6:45 PM
chandlerc marked 8 inline comments as done.

Update based on code review from David and Hal.

Still need to add some more interesting test cases.

chandlerc added inline comments.Sep 7 2017, 6:45 PM
include/llvm/Transforms/Scalar/SpeculateAroundPHIs.h
30 ↗(On Diff #113835)

I totaly don't. And I can totaly spell things just fine. ;]

(Thanks for catching.)

32 ↗(On Diff #113835)

I have have gone overboard. Lemme know if you have something else in mind.

test/Transforms/SpeculateAroundPHIs/X86/basic.ll
141 ↗(On Diff #113835)

Can you explain how it would increase code size? I'm not really seeing it, but maybe I'm just missing something. All of these trunc instructions won't require any materialized code no x86...

As for hoisting vs. speculation -- I can add a call that might never return after the PHI and before the other instructions, and we will still hoist. That becomes, technically, speculative. But see my other email -- happy to use different/better terminology, but so far all of the ideas I've had are less clear.

176 ↗(On Diff #113835)

See the code? We actually look at the uses...

dberlin edited edge metadata.EditedSep 7 2017, 8:57 PM

I feel unclear on what you see the safety conditions of this transform are. I believe yours are off a little :)

The actual "can you perform the transform" safety conditions are as far as i know as follows:

for each pred of a phi:
   for each operand, translate it through a phi block to the pred:
      if it was translated, it's safe
      if it was not translated:
        depth_first_search of def chain of operand
           if you hit a constant - safe
           if you hit something that properly dominates the phi block - safe
           if you hit a phi node in the same block - unsafe
          // If it's something in the phi block but has no phis, and is safe, you could duplicate it if is safe to speculate. Otherwise it is unsafe

The reason is there are only three cases:

  1. Either the operand chain is somehow defined by the phi in the same block - this is unsafe. By definition, if you had speculated this already, it would have been replaced by a phi that could be translated through, and you would have speculated the rest of the operand chain into phis as well. IE the fact that your current expression only indirectly depends on a phi means you already made a decision not to speculate the operand chain, either for safety or cost reasons.

(Unless your cost model is very strange, i don't see why you are re-evaluating this on each go-around)

  1. Or it is defined but not phi-dependent in the block - these can be copied as a chain if you wanted.
  1. or it must dominate the block, by definition

If it didn't dominate the block, and didn't flow through the phi, this would be a path around the dominator to the block, which would mean it wasn't really a dominator.

Example of 1:

for.body.i:
  %tmp1 = phi i1 [ true, %entry ], [ false, %cond.end.i ]
  %f.08.i = phi i32 [ 0, %entry ], [ %inc.i, %cond.end.i ]
  %mul.i = select i1 %cmp1.i, i32 %f.08.i, i32 0
  br i1 %tmp1, label %cond.end.i, label %cond.true.i

cond.true.i:
  ;; Ensure we don't replace this divide with a phi of ops that merges the wrong loop iteration value
  %div.i = udiv i32 %mul.i, %f.08.i
  br label %cond.end.i
cond.end.i:
     ....
  %inc.i = add nuw nsw i32 %f.08.i, 1

f.08.i is a safe operand. it translates directly through the phi
mul.i is not safe[1]. It depends on a phi recursively. As mentioned, if you had chosen to speculate the operands, it would now be a direct use of a phi (not an indirect one), because you would have converted the op of phis for the operand into a phi of ops. Thus, had it been converted, you would have translated it above, and it would have been safe.

IE

a = phi(0, b)
c = add a, 2
d = mul c, 30

the only way d is safe is if you had chosen to speculate c as well, because then you would have

pred2:
cop = add a, b

a = phi(0, b)
cphi = phi(2, cop)
d = mul cphi, 30

and now it's a direct use :)

But you didn't choose to speculate c, so either c is itself unsafe for some reason, or it wasn't cost effective. I can't see how you could sanely decide it is safe or cost effective to speculate d , which also requires speculating c.

Note:
This is essentially a duplicate of OpIsSafeForPhiOfOps in NewGVN.cpp. I cache both safety and non-safety (but didn't bother to make it non-recursive). You only cache safety, which means you are N^2 if everything is safe except the last operand in the depth first chain. You will re-explore that subgraph again and again, AFAICT. The subgraphs don't get safer over time, so you can cache non-safety as well.

[1]
An interesting problem here:
If you translate udiv.i through entry predecessor into udiv i32 %mul.i, 0, it will simplify to 0, which is even right.
if you then translate udiv i32 %mul.i, %f.08.i into udiv i32 %mul.i, %inc.i, you may even think this is safe. The simplifier will simplify it to zero as well (inc.i == f.08.i + 1, mul = select f.08.i, 0. Because it can prove the second op to udiv is greater than the first, the answer must be zero). It's constant, awesome!

You will get phi(0, 0), which is not right.

So you have to be careful. You can't call the simplifier on the intermediate expressions.
The real translation would be
foo = select %cmp.i %inc.i, i32 0
udiv i32 %foo, %inc.i

Which is 0 | 1, not 0

include/llvm/Transforms/Scalar/SpeculateAroundPHIs.h
91 ↗(On Diff #114301)

All of these are not speculation, they are transformations from op of phis, to phi of ops :)
(NewGVN performs this transform when it requires no code insertion)

In that respect, it's a code hoisting transform.

dberlin added a comment.EditedSep 7 2017, 9:24 PM

Other random note:
NewGVN can do this value-wise, not just lexically.
So in some future when NewGVN is the default, you may just want to make this part of one of the things it does after the analysis (probably after elimination too).
The eliminator will already have eliminated all cases where doing the above eliminates redundancies without costing anything (IE the part where you see how it simplifies)
We could also, for zero cost, track the things where it it would have just been code motion (IE one operand constant, one operand not), and where we didn't do it because it required speculation of multiple operands.

Then your cost model does not need to take into how much something simplifies, look for constants, etc. We could just tell you what the case is already, and you could make a decision whether to do the duplication or not.

(Side note: I remember that the ARM folks found something very similar to be profitable on aarch64 when I interned there ~2 years ago. See D11648.)
@jmolloy @aadg @silviu.baranga

So, I'm still parsing your first comment Danny, but seeing the second, I think there may be some confusion here...

I'm not actually modeling any simplifications at all, and I'm not really trying to. While that is interesting, I completely agree that this is the wrong place to do it and doesn't have the machinery you would need to do a good job. I think something based around a GVN-like analysis makes way more sense.

The "cost model" is trying to handle one fairly narrow and specific case: when the target's cost for materializing N constants along N incoming edges is equal to the cost of duplicating the user instructions of the incoming constant (and those users' dependencies) along the N incoming edges and letting the constant get materialized as part of the instruction.

I don't actually expect this to ever be profitable for large numbers of duplicated instructions... That typically doesn't make sense. The example I added to the top-level comment on the pass is really the case I care about. The theory is that all simplification-based, or non-duplicating variants of this transform, if advisable, would have *already happend*, likely by NewGVN or similar. This pass is just trying to handle the case where there are target-specific costs associated with materializing constants that don't occur when the constant is an operand of an instruction.

Anyways, hopefully this clarified some of the intent behind the cost modeling. Like I said, I'm still going throuh the larger comment around safety checks and complexity. I actually had another data structure there and I suspect that Danny is correct that removing it does make this quadratic. And quite possibly there is a better way to avoid that, but yeah, need to go through that more to be confident of anything.

I'll admit to not staring at the cost model too hard (I rarely have useful input on that kind of target specific thing), but it looked at a glance like you trying to calculate which might constant fold as well.
If not, or it's not part of the goal, awesome.

i'm happy to whiteboard the safety part with you.
What you are doing would be safe, but maybe not cost effective, if you check and move the entire operand chain until you hit a phi in the same block. It looks like you might be, but it's not entirely clear you are doing that.
The logic in speculatephis and visiting deps/users is hard to follow for me to ensure it comes up with the right list.

The core point, however, to take the very simple example above:

a = phi(0, b)
c = add a, 2
d = mul c, a

It would not be safe to speculate mul c,a without copying + translating add a, 2 as well.

A valid translation to each predecessor for mul is not:
mul c, 0
and
mul c, b

it's
cop1 = add 0, 2
mul cop1, 0

and

cop2 = add a, b
mul cop2, b

So you must copy and translate c to translate d.
Also, if you do the speculation in postorder, you are going to presumably reevaluate the same ops again and again in smaller chains (first you do d above, then you do c).

It's unclear to me that you are really doing it in postorder, and not preorder :)
If you are, I imagine it would be better to go the other way around, in preorder, because you could save redundant checks and cut off earlier.
This assumes, however, there is no reason to speculate d above if you choose not to speculate c.

Rebase (no substantive changes yet...)

I'll admit to not staring at the cost model too hard (I rarely have useful input on that kind of target specific thing), but it looked at a glance like you trying to calculate which might constant fold as well.
If not, or it's not part of the goal, awesome.

Correct, I'm not trying to calculate that, and it isn't part of the goal. The idea being, by the point this pass runs any profitable folding-through-PHI-speculation should be handled by something much more like GVN. The intent is for this to catch cases where the minimal, canonical form has edge-dependent constant inputs and without any *folding* it is profitable on the target to speculate users along the edge. The things I imagine firing here are things like x86's immediate operands and other architecture's barrel shifter operands.

Anyways, hopefully that clarifies. Also, the examples in the code now probably help. I've added a comment to specifically talk about the fact that we *don't* try to handle this case.

i'm happy to whiteboard the safety part with you.
What you are doing would be safe, but maybe not cost effective, if you check and move the entire operand chain until you hit a phi in the same block. It looks like you might be, but it's not entirely clear you are doing that.

I may have bugs, but that is exactly the intent of the code.

The logic in speculatephis and visiting deps/users is hard to follow for me to ensure it comes up with the right list.

Yeah, I tried to find a cleaner way to write this to make this clear, but failed to come up with one. It's in an awkward part of LLVM's IR for walking in the precise way we want. That is compounded by the fact that we expect for this whole thing to happen relatively rarely, and be profitable even more rarely. So the entire thing tries to avoid walk large amounts of the IR until there is at least some evidence that this is useful.

The core point, however, to take the very simple example above:

a = phi(0, b)
c = add a, 2
d = mul c, a

It would not be safe to speculate mul c,a without copying + translating add a, 2 as well.

A valid translation to each predecessor for mul is not:
mul c, 0
and
mul c, b

it's
cop1 = add 0, 2
mul cop1, 0

and

cop2 = add a, b
mul cop2, b

So you must copy and translate c to translate d.

Correct. And I think I have test cases covering this, for example @test_speculate_free_insts where we are required to copy and speculate trunc instructions even though they don't use any PHI nodes just as we would have to do that for the add in your example.

Also, if you do the speculation in postorder, you are going to presumably reevaluate the same ops again and again in smaller chains (first you do d above, then you do c).

It's unclear to me that you are really doing it in postorder, and not preorder :)
If you are, I imagine it would be better to go the other way around, in preorder, because you could save redundant checks and cut off earlier.
This assumes, however, there is no reason to speculate d above if you choose not to speculate c.

I think the profitability walk is OK -- it is in postorder, but of the operand graph rather than the use graph -- essentially, the inverse graph. It is specifically designed to check the shorter chains first and memoize that result when checking the longer chains.

It turns out that we have to do this to get good profitability measurements anyways, even setting aside the complexity. If we don't find small chains that are themselves profitable and mark them as profitable on their own, we would incorrectly count their speculation cost against the benefit of speculating the large chain. We still can't get this to be perfect because it is a greedy algorithm and there are cases that doesn't handle, but by being greedy in the right order we handle important, obvious cases such as exactly the one you describe where we want to first consider the chain rooted at c, then at d, etc etc., where each is a superset of the one before and it is important to consider the cost incrementally. I have some specific test cases for that which motivate this.

However, the *safety* check is I think still a problem (as you pointed out in your first comment). Specifically, it caches safe subgraphs but not unsafe subgraphs and so it can walk the unsafe subgraph over and over again and go superlinear. I'm working on a fix to that, but wanted to write up this much and see if we're on the same page.

chandlerc updated this revision to Diff 117643.Oct 4 2017, 2:36 AM

Rebase (and move to mono-repo layout, sorry for any disruption).

Address the feedback from Danny by rewriting the safety check to work
depth-first (so that we can effectively cache *unsafe* operations rather than
re-exploring them) and preorder (so that we prune early rather than late). Last
but not least, this also allows us to cache safe to speculate sub-regions even
if the whole is not safe, avoidnig reexploring those subregions repeatedly.

I've also added various comments and cleaned up the code a bit to reflect the
more robust design.

I actually took a stab at sharing code between preorder and postorder walks
here, but it made the code substantially harder to read. This is in large part
because the preorder checks are distinctly for *operands* only and thus the
structure ends up surprisingly different.

FWIW, I think with the latest patch update this has now addressed Danny's primary concerns, and should be much more effective at avoiding re-visiting regions of IR.

The biggest remaining issue of revisiting regions of IR is the fact that we don't cache things between basic blocks. We could move to do that, but I'm worried about how much more complex the logic would become. Would appreciate thoughts on that from others.

I've taken some time to go through the code in NewGVN that computes similar things based on the suggestion from Danny. These now do *very* similar things in terms of walk. The differences I've seen are:

  1. While the safety checks are both preorder, my code uses DFS instead of BFS. The reason is that we walk up the stack to mark more nodes as unsafe when we discover an unsafe node.
  2. The safety checks in this pass are stricter than in NewGVN because other parts of NewGVN handle various safety concerns of code motion. We essentially merge them here and use stricter checks
  3. The cost modeling does DFS postorder. I don't see a way around this given what the cost model is trying to do. It lets us use dynamic programming to avoid recomputing the same cost multiple times. Fortunately, this walk never fails and we use the above preorder-checked safety walk to establish the total set of nodes traversed.

I suspect this will make it hard and unlikely to be useful to share a lot of code between the two. =/ They are solving similar but ultimately different problems and have different tradeoffs as a consequence.

Anyways, I think the patch as is is probably good for review now.

davidxl added inline comments.Oct 5 2017, 2:14 PM
llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
36 ↗(On Diff #117643)

wether --> whether

67 ↗(On Diff #117643)

This is too limiting. The scenario is commonly created by the store sinking optimization -- which involves memory operation.

107 ↗(On Diff #117643)

why not checking this first?

112 ↗(On Diff #117643)

What is the second check about? OpI is not a code motion candidate.

The first check seems wrong to me too. At this point, we determined that OpI is not in a block that dominates the the target blocks of the code motion (hoisting), so it should return false immediately.

I don't see the need to do a DFS walk to check the chains of operands either.

The caching here seems also wrong and confusing.

185 ↗(On Diff #117643)

This cost should be weighted by the block frequency of the constant materializing block.

186 ↗(On Diff #117643)

This should remember the total frequency for incomingC

256 ↗(On Diff #117643)

The cost analysis here assumes non-speculative case where the uses of phi are fully anticipated. Consider the real speculative case:

/// entry:
///     br i1 %flag, label %a, label %b
///
///   a:
///     br label %merge
///
///   b:
///     br label %merge
///
///   merge:
///     %p = phi i32 [ 7, %a ], [ 11, %b ]
///     br i1 %flag2, label %use, label %exit
///
///   use:
///     %sum = add i32 %arg, %p
///     br label %exit
///
///   exit:
///    %s = phi ([0, ...], [%sum, ..])
///     ret i32 %s

Hoisting the add into %a and %b is speculative. Additional speculation cost is involved depending on the cost of add and relative frequency of %a+%b vs %use.

consider another case with triagular cfg:

/// entry:
///     br i1 %flag, label %a, label %merge
///   a:
///     br label %merge
///
///   merge:
///     %p = phi i32 [ 7, %a ], [ 11, %entry ]
///     %sum = add i32 %arg, %p
///    ret i32 %sum

hoisting 'add' into entry and '%a' introduces runtime overhead.

684 ↗(On Diff #117643)

Looks like the PotentialSpecSet, UnsafeSet is shared across different PN candidates -- this does not look correct.

(quick responses, still working on code updates from the review)

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
67 ↗(On Diff #117643)

There is a FIXME discussing this already?

However, this pass is already able to handle the motivating case, so it seems reasonable to extend this cautiously in subsequent commits?

107 ↗(On Diff #117643)

Sorry, this order came from when we didn't update the set as aggressively. Yeah, we should check this first.

112 ↗(On Diff #117643)

If OpI isn't in a block that dominates the target blocks, we will need to hoist OpI as well as the original UI. That's the point of this walk and the discussion w/ Danny: we may need to hoist more than one instruction in order to hoist the user of the PHI.

This actually was necessary to get basic tests to work at all because in many cases the operand is a trivial/free instruction such as a trunc or zext between i64 and i32. We want to hoist those in addition to the PHI use. There is a test case that demonstrates this as well I think.

Because we're doing that, we need to actually recurse in some way. DFS vs. BFS is just a tradeoff in terms of how you visit things and my previous emails discussed why I think DFS is probably right.

Hopefully this also makes the caching more clear, but please let me know what is unclear/confusing if note.

185 ↗(On Diff #117643)

I'm not sure.

Currently, the heuristic is to only do this when it is a strict win in terms of cost (including size). Given that, I don't see much to do with BFI.

We could potentially go with something more aggressive and that would definitely need BFI to do accurately. But I'm not sure what the right heuristics would be there. The only places I've seen this really matter were trivially profitable. So I'm inclined to start with the simple model. We can always add more smarts to the cost model when test cases motivating it arrive.

186 ↗(On Diff #117643)

(I assume this is w.r.t. using BFI, so will defer unless we add that logic...)

256 ↗(On Diff #117643)

Ah, this explains perhaps the comment above regarding BFI.

The second case was definitely unintended. I can add something to prevent this. I'm not sure we *ever* want to do that case.

For the first case, I'm not so sure. If the 'add' is the same cost as materializing the constant, and we'll have to materialize the constant anyways, is this really a problem? We can suppress this too if so.

I'm somewhat interested in simpler cost modeling and just eliminating the hard cases here if possible (IE, until we see places where the fancier approach is needed).

684 ↗(On Diff #117643)

Only for PNs in the same basic block, which should make all of the things the same and correct to cache?

I can add an assert w.r.t. the PNs being in the same block though.

I've taken some time to go through the code in NewGVN that computes similar things based on the suggestion from Danny. These now do *very* similar things in terms of walk. The differences I've seen are:

  1. While the safety checks are both preorder, my code uses DFS instead of BFS. The reason is that we walk up the stack to mark more nodes as unsafe when we discover an unsafe node.

I originally wrote mine as a DFS, i may go back to it.
I actually originally had graphtraits set up so it could be used with a depth_first_iterator on operands and walk the def-use chain. I wonder if i should do that and we could share *that*, or whether it's not worth it (obviously, not something we have to do this second, and i could look at it as a followup).

  1. The safety checks in this pass are stricter than in NewGVN because other parts of NewGVN handle various safety concerns of code motion. We essentially merge them here and use stricter checks
  2. The cost modeling does DFS postorder. I don't see a way around this given what the cost model is trying to do. It lets us use dynamic programming to avoid recomputing the same cost multiple times. Fortunately, this walk never fails and we use the above preorder-checked safety walk to establish the total set of nodes traversed.

I suspect this will make it hard and unlikely to be useful to share a lot of code between the two. =/

:(
They are solving similar but ultimately different problems and have different tradeoffs as a consequence.

Anyways, I think the patch as is is probably good for review now.

I'm on vacation, but i'll try to look at it late next week if nobody beats me to it.

davidxl added inline comments.Oct 6 2017, 10:38 AM
llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
67 ↗(On Diff #117643)

This might affect the design decisions: use MemSSA or AS ?

It is probably good to handle the easiest cases (which covers 90% of the cases) where the memory instruction is just hoisted to its direct predecessor blocks.

112 ↗(On Diff #117643)

So dependent OpI are also hoisting candidates. However I don't see the cost model actually consider them.

Besides, speculative and non-speculative hoisting candidates are all handled here. For real speculative ones, there is not enough check for safe speculations -- e.g. instructions that may trap or result in exceptions.

117 ↗(On Diff #117643)

What is the point of tracking UnsafeSet?

The code will be much more readable if the following is done:

  1. collect the set of all operands (transitively) UI depends on, which are not defined in a dominating block of the hoisting target blocks;
  2. check if it safe to hoist those dependent operands. If any one of them is not safe, bail.
119 ↗(On Diff #117643)

Why not pop the DFSstack?

256 ↗(On Diff #117643)

As an example for the first case. Suppose there are two incoming constants C1, and C2. M(C1) and M(C2) are the cost of materializing the constants, and F(C1) and F(C2) are the costs of folding them into UI. Assuming the frequency of two incoming blocks are P1 and P2.

The total cost before will be: P1 * M(C1) + P2 * M (C2), and the total cost after is P1 * F(C1) + P2 * F(C2).

The comparison can be reduced to comparing M(C1) + M(C2) with F(C1) + F(C2) only when

  1. P1 == P2

or

  1. M(C1) == M(C2) and F(C1) == F(C2)

Depending on the value of C1 and C2, 2) may not be true.

Hopefully responding to all of the comments. Note the last one which is probably the most interesting point of discussion around cost modeling.

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
67 ↗(On Diff #117643)

FWIW, I think the current design handles all of the cases I've seen in the wild. Anyways, I don't disagree about handling this, it just seems easy to do in a follow-up patch.

107 ↗(On Diff #117643)

Coming back to this -- the checks above this are actually probably cheaper. Neither require walking the operands at all and so they should allow us to completely avoid the potentially speculated set queries entirely. They're also both cases where we won't actually speculate the instruction at all. So I think the current order makes sense. I've added some comments to help clarify this.

112 ↗(On Diff #117643)

The cost model does consider the transitive operands that have to be speculated? There are tests around this.

Also, mayBeMemoryDependent handles all of the cases you mention. It requires the instruction to both be safe to speculatively execute (and therefore not control or data dependent in any way) and not read or write memory. It is essentially the strictest predicate we could use (hence the FIXMEs around using more nuanced checks in the future).

117 ↗(On Diff #117643)

See my conversation with Danny about why this is needed?

We are doing exactly what you suggest, but memoizing the result to reduce the total complexity of this algorithm.

119 ↗(On Diff #117643)

This is less code and avoids mutating the stack on each step of the loop. Return will free the memory anyways.

256 ↗(On Diff #117643)

I should clarify why the second case you mentioned originally cannot in fact happen:

The code splits critical edges so that it should not be changing the total number of instructions executed along any given path. The speculation goes into a non-critical edge, and so it shouldn't go onto a path that wouldn't have executed that exact instruction anyways.

And again, there are already tests of this.

For the first case, I see what you're getting at with this example. However, I'm worried about scaling by probability. My concern is that I'm a little worried about doing way too much speculation just to remove a constant materialization in the case of fairly extremely hot paths. A very minor improvement along a very hot path could have nearly unbounded code size regressions along the cold path.

An alternative, simpler approach would be to check that, in addition to F(C1) + F(C2) < M(C1) + M(C2) being true, F(C1) <= M(C1) && F(C2) <= M(C2). That should ensure that no path regresses its dynamic cost, and the total isn't a size regression overall. I haven't looked at it in detail, but something like this might make for an easier model than profile and still avoid regressions.

Naturally, if you're aware of cases where we *need* to specifically bias towards a hot path with this kind of optimization, that's a different story and we should absolutely use the profile there. I'm just not currently aware of any such cases.

chandlerc updated this revision to Diff 119279.Oct 17 2017, 3:24 AM

Add a stricter form of checking for profitability to avoid one potential issue
with this patch. Also clean up comments in various places to try and help
address code review feedback.

chandlerc added inline comments.Oct 17 2017, 3:26 AM
llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
256 ↗(On Diff #117643)

I've implemented this and it isn't bad at all.

When initially establishing if there is any savings to be had from speculating, we check that no incoming constant becomes worse with folding, and then compute the total savings.

I found a way to test for it, but it is somewhat lame because the *speculation* cost (which is separate from the cost mentioned here) ends up too high anyways. However, I've added the test case and verified that in fact we catch the lack of profitability earlier with the new code. Hopefully this addresses the last concern here. We can of course revisit this and add profile-based heuristics if there is a need for them, but all the test cases I have are happy with the strict "no regression" approach.

davidxl added inline comments.Oct 17 2017, 10:47 AM
llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
256 ↗(On Diff #117643)

Can you clarify 'the second case you mentioned originally cannot in fact happen"?

I don't see a test case with it. The similar one is @test_no_spec_dominating_phi, but that the phi use of that case still post dominates the phi so it is not really speculative hoisting. If you move the use into '%a' or '%b' block, the hoisting will actually change the dynamic instruction counts.

Add a test case covering critical edge splitting.

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
256 ↗(On Diff #117643)

Wow, sorry, I missed adding this test case, sorry about that. Just updated patch to include it. You can find the code that handles this by looking at critical edge splitting.

davidxl added inline comments.Oct 17 2017, 11:57 AM
llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
256 ↗(On Diff #117643)

thanks.

I am actually expecting test case like the following:

entry:
  br i1 %flag1, label %x.block, label %y.block

x.block:
  br label %merge

y.block:
  br label %merge

merge:
  %p = phi i32 [ 7, %x.block ], [ 11, %y.block ]
  br i1 %flag2, label %a, label %b

a:
 %sum = add i32 %5, %p
  br label %exit

b:
  br label %exit

exit:
 %s = phi i32 [%sum, %a], [0, %b]
  ret i32 %s

Perhaps also need a test to cover the case that the other dependent operands of the use instruction is another PHI. This case can be handled for free.

entry:
  br i1 %flag1, label %x.block, label %y.block

x.block:
  br label %merge

y.block:
  br label %merge

merge:
  %p = phi i32 [ 7, %x.block ], [ 11, %y.block ]
  %xy.phi = phi i32 [ %x, %x.block ], [ %y, %y.block ]

  %sum = add i32 %xy.phi, %p
  ret i32 %sum
MatzeB added a subscriber: MatzeB.Oct 17 2017, 2:22 PM

Without having read the code yet:

We could in theory make it part of the codegen pass
pipeline, but there doesn't really seem to be a good reason for that --
it isn't "lowering" in any sense and only relies on pretty standard cost
model based TTI queries, so it seems to fit well with the "optimization"
pipeline model.

This does sound like a CodeGen pass to me! Not all of CodeGen is lowering, we do have optimization passes there too. And the fact that TTI is used is another indicator that this is a machine specific pass. This being part of CodeGen would also allow targets to not add the pass to the pipeline if it doesn't help them.

Without having read the code yet:

We could in theory make it part of the codegen pass
pipeline, but there doesn't really seem to be a good reason for that --
it isn't "lowering" in any sense and only relies on pretty standard cost
model based TTI queries, so it seems to fit well with the "optimization"
pipeline model.

This does sound like a CodeGen pass to me! Not all of CodeGen is lowering, we do have optimization passes there too. And the fact that TTI is used is another indicator that this is a machine specific pass. This being part of CodeGen would also allow targets to not add the pass to the pipeline if it doesn't help them.

I responded to Mattias in person about this, but wanted to add it here.

Essentially, we have a nice "optimization" phase of the main pass pipeline now where there is no ambiguity about this being a non-canonicalization transform.

It would be pretty hard to implement this today as an MI pass because of how folding of addressing modes and such happens currently. I don't really know how to do this at all.

We could make this a CodeGen IR pass, but there doesn't seem to be any reason to do so. The API needed for this is already in TTI and seems to fit the TTI model very well: a completely generic cost model.

So it seems fine to add it here, and if we ever want to sink it to a CodeGen IR pass, we can, but I don't see any reason to speculatively do that. (Queue the puns...)

chandlerc updated this revision to Diff 123657.Nov 20 2017, 2:30 PM

Update to address the issue highlighted by David and add more test cases.

chandlerc added inline comments.Nov 20 2017, 2:31 PM
llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
256 ↗(On Diff #117643)

Ok, both of these are handled.

I used a very blunt tool for this: insisting the uses are in the same BB as the PHI node. I left a FIXME about potentially relaxing this to just be a post-dominance check. Since this is about cost modeling, we don't even need a full reachability test.

Anyways, hopefully this looks better now. I also added test cases covering the PHI translation (which was already implemented).

This revision is now accepted and ready to land.Nov 20 2017, 10:45 PM
MatzeB accepted this revision.Nov 27 2017, 2:50 PM

LGTM

I've also only implemented this in the new pass manager. If folks are

very interested, I can try to add it to the old PM as well, but I didn't
really see much point (my use case is already switched over to the new
PM).

Isn't the legacy PM still the default we use for most frontends including clang and the new PM is only used for (Thin)LTO so far?

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
53–58 ↗(On Diff #123657)

Too much auto for my taste. I think writing the types is friendlier to code readers (unless you have something like a cast<XXX> on the RHS which makes the type obvious.

llvm/test/Transforms/SpeculateAroundPHIs/X86/lit.local.cfg
1–2 ↗(On Diff #123657)

As this is only a single test, you could use REQUIRES: x86-registered-target in the test itself instead of creating an x86 specific subdirectory.

LGTM

I've also only implemented this in the new pass manager. If folks are

very interested, I can try to add it to the old PM as well, but I didn't
really see much point (my use case is already switched over to the new
PM).

Isn't the legacy PM still the default we use for most frontends including clang and the new PM is only used for (Thin)LTO so far?

Clang has a flag to use the new PM everywhere. Google is already using it by default, and I have an email thread on llvm-dev about the (very few) things left before we can make it the default for Clang.

chandlerc marked 2 inline comments as done.Nov 28 2017, 3:16 AM

Thanks for all the feedback, landing with suggested fixes.

I've also run SPEC with this change and saw no significant changes so it seems safe performance wise. It also didn't regress a wide range of benchmarks internally.

This change alone is however worth between 2% and 4% latency on one very unusually important benchmark: what is for us the hottest path through tcmalloc code. It helps optimize the size class computation that has long been part of tcmalloc:
https://github.com/gperftools/gperftools/blob/master/src/common.h#L204-L210

I also have at least one improvement on it that I'm also working on but it should be a good follow-up patch. Nothing really wrong as-is, and good to start breaking things into better increments.

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
53–58 ↗(On Diff #123657)

The RHS says uses() and the type is Use. That seemed sufficiently redundant to me? The tree has a bit of a mixture, combined with places that are just amazingly confusing. I even have Use elsewhere in this file, so happily changed.

llvm/test/Transforms/SpeculateAroundPHIs/X86/lit.local.cfg
1–2 ↗(On Diff #123657)

Ah, yeah, for something this tiny that's likely to be way cleaner long term.

This revision was automatically updated to reflect the committed changes.
chandlerc marked 2 inline comments as done.
reames added a subscriber: reames.Jan 2 2018, 2:12 PM

Catching up on old review traffic. Nothing critical.

llvm/trunk/include/llvm/Transforms/Scalar/SpeculateAroundPHIs.h
84

Isn't this snippet wrong? It looks like we're adding 7 or 18 here, not 7 or 11. I think you're missing a conditional branch. Alternatively, you could make the second add be an "addli $4, edi" (since 4+7 == 11)

If we don't do that later transformation, maybe we should be. Extracting out the common component of the two constants adding that unconditionally and then conditionally adding the difference seems like a near ideal lowering here. We'd end up with something like:
/ testq %eax, %eax
/ addl $7, %edi
/ jne .L
/ addl $4, %edi <-- different constant
/ .L:
/ movl %edi, %eax
/// retq

thopre added a subscriber: thopre.May 28 2021, 8:45 AM

@chandlerc

This pass does not seem to account for the extra branch instruction (1 for exiting the loop, 1 for looping back Vs 1 single branch when there's no critical edge) nor does it account for critical edge preventing the use of hardware loop instruction. The example that causes us trouble is:

unsigned KnownDec(unsigned *arr) {
  unsigned x = 0x2000;
  unsigned z = 0;
  while(x) {
    z += arr[x-1];
    x--;
  }
  return z;
}

When compiling for our (non-upstream) target, when get the following IR after this pass:

entry:
  %sub.0 = add nsw i32 2000, -1
  br label %while.body

while.body:                                       ; preds = %while.body.while.body_crit_edge, %entry
  %z.07 = phi i32 [ 0, %entry ], [ %add, %while.body.while.body_crit_edge ]
  %sub.phi = phi i32 [ %sub.0, %entry ], [ %sub.1, %while.body.while.body_crit_edge ]
  %arrayidx = getelementptr inbounds i32, i32* %arr, i32 %sub.phi
  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
  %add = add i32 %0, %z.07
  %tobool.not = icmp eq i32 %sub.phi, 0
  br i1 %tobool.not, label %while.end, label %while.body.while.body_crit_edge, !llvm.loop !6

while.body.while.body_crit_edge:                  ; preds = %while.body
  %sub.1 = add nsw i32 %sub.phi, -1
  br label %while.body

while.end:                                        ; preds = %while.body
  ret i32 %add

The fact that the condition check is separate from the loop latch means we cannot use a hardware loop instruction. Similar code gets generated for PowerPC if using 0x10000 instead of 2000 but they run EarlyCSE after which sink the value down again in the PHI and by chance they have a pass to canonicalize the loop form for their addressing mode which remove the getelementptr in the critical edge (inserted there by the loop strength reduction pass). This feels kinda lucky (the EarlyCSE and ppc-loop-instr-form-prep passes were added to the PPC pipeline long before the PHI speculation one. Is the expectation that targets with hardware loop deal with the result of PHI speculation? If that's the case, could we have a hook for those target to disable the pass when they suspect a hardware loop instruction might be used?

Best regards.

Herald added a project: Restricted Project. · View Herald TranscriptMay 28 2021, 8:45 AM

I guess i still didn't post a reproducer, but i've seen that the IR after this transform
undergoes LSR, and ends up having pretty negative performance effect in the end.
I'm rather unconvinced that this optimization holds in the IR.
We just can't be sure that in the end we will actually have the instructions we costmodel here.
I would much rather see this being a (late) codegen pass.