This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/Transforms/Scalar/
-
llvm/
-
Transforms/
-
Scalar/
1
SpeculateAroundPHIs.h
-
lib/
-
Passes/
-
PassBuilder.cpp
-
PassRegistry.def
-
Transforms/Scalar/
-
Scalar/
-
CMakeLists.txt
-
SpeculateAroundPHIs.cpp
-
test/
-
Other/
-
new-pm-defaults.ll
-
new-pm-thinlto-defaults.ll
-
Transforms/SpeculateAroundPHIs/
-
SpeculateAroundPHIs/
-
basic-x86.ll

Differential D37467

Add a new pass to speculate around PHI nodes with constant (integer) operands when profitable.
ClosedPublic

Authored by chandlerc on Sep 5 2017, 4:34 AM.

Download Raw Diff

Details

Reviewers

craig.topper
hfinkel
davidxl
echristo
sanjoy
iteratee
• dberlin
MatzeB

Commits

rGc34f789e38bd: Add a new pass to speculate around PHI nodes with constant (integer) operands…
rL319164: Add a new pass to speculate around PHI nodes with constant (integer) operands…

Summary

The core idea is to (re-)introduce some redundancies where their cost is
hidden by the cost of materializing immediates for constant operands of
PHI nodes. When the cost of the redundancies is covered by this,
avoiding materializing the immediate has numerous benefits:

Less register pressure
Potential for further folding / combining
Potential for more efficient instructions due to immediate operand

As a motivating example, consider the remarkably different cost on x86
of a SHL instruction with an immediate operand versus a register
operand.

This pattern turns up surprisingly frequently, but is somewhat rarely
obvious as a significant performance problem.

The pass is entirely target independent, but it does rely on the target
cost model in TTI to decide when to speculate things around the PHI
node. I've included x86-focused tests, but any target that sets up its
immediate cost model should benefit from this pass.

There is probably more that can be done in this space, but the pass
as-is is enough to get some important performance on our internal
benchmarks, and should be generally performance neutral, but help with
more extensive benchmarking is always welcome.

One awkward part is that this pass has to be scheduled after
*everything* that can eliminate these kinds of redundancies. This
includes SimplifyCFG, GVN, etc. I'm open to suggestions about better
places to put this. We could in theory make it part of the codegen pass
pipeline, but there doesn't really seem to be a good reason for that --
it isn't "lowering" in any sense and only relies on pretty standard cost
model based TTI queries, so it seems to fit well with the "optimization"
pipeline model. Still, further thoughts on the pipeline position are
welcome.

I've also only implemented this in the new pass manager. If folks are
very interested, I can try to add it to the old PM as well, but I didn't
really see much point (my use case is already switched over to the new
PM).

I have some testing in place, but can probably add some more. However,
I've built a reasonable amount of code with this pass enabled (test
suite, SPEC, and a decent pile of internal code).

Diff Detail

Repository: rL LLVM

Event Timeline

chandlerc created this revision.Sep 5 2017, 4:34 AM

Herald added a reviewer: • dberlin. · View Herald TranscriptSep 5 2017, 4:34 AM

Herald added subscribers: eraman, fhahn, mgorny and 2 others. · View Herald Transcript

Harbormaster completed remote builds in B9894: Diff 113835.Sep 5 2017, 4:34 AM

Just some initial comments around the test cases.

It seems that all test cases are testing non speculative code hoisting -- there are no speculative hoisting tests.

test/Transforms/SpeculateAroundPHIs/X86/basic.ll
141 ↗	(On Diff #113835)	This is simple code hoisting without speculation. It is unclear whether this is a win for this case -- it increases code size (and icache pressure). It may or may not reduce register pressure either.
176 ↗	(On Diff #113835)	What is the cost model here? If there are two uses, it will still be a net win to hoist -- the code size does not change, but the runtime cost is strictly reduced. If there are more than 2 uses, then there is tradeoff to make between size increase vs runtime cost reduction.

davide added a subscriber: davide.Sep 5 2017, 10:09 AM

hfinkel added inline comments.Sep 5 2017, 2:15 PM

include/llvm/Transforms/Scalar/SpeculateAroundPHIs.h
28 ↗	(On Diff #113835)	parte -> part
30 ↗	(On Diff #113835)	I don't think that you need the word totaly (which is also spelled incorrectly).
32 ↗	(On Diff #113835)	I think that it would be helpful to show a little pseudo-code example of what this means here.
lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
55 ↗	(On Diff #113835)	scane -> scan
200 ↗	(On Diff #113835)	An outdated comment? The "function" operand is last. Bundle operands are before that. Regular function arguments are first.
298 ↗	(On Diff #113835)	linear to compute -> linear-to-compute (dashes for a compound adjective)
306 ↗	(On Diff #113835)	computattion -> computation
568 ↗	(On Diff #113835)	relpcae -> replace

In D37467#861195, @davidxl wrote:

Just some initial comments around the test cases.

It seems that all test cases are testing non speculative code hoisting -- there are no speculative hoisting tests.

Sorry, I can add some non-hoisted code as well, for example a call that may not return. That's the way in which this is technically speculation.

But I agree "speculation" is perhaps a frustrating term here. I'm using it only because a very similar transform in SROA where loads are lifted around PHI operands is called "speculating" and so it seemed the least bad term I knew of... That said, if there is a better term, I'm happy to switch to it.

Update based on code review from David and Hal.

Still need to add some more interesting test cases.

chandlerc added inline comments.Sep 7 2017, 6:45 PM

include/llvm/Transforms/Scalar/SpeculateAroundPHIs.h
30 ↗	(On Diff #113835)	I totaly don't. And I can totaly spell things just fine. ;] (Thanks for catching.)
32 ↗	(On Diff #113835)	I have have gone overboard. Lemme know if you have something else in mind.
test/Transforms/SpeculateAroundPHIs/X86/basic.ll
141 ↗	(On Diff #113835)	Can you explain how it would increase code size? I'm not really seeing it, but maybe I'm just missing something. All of these `trunc` instructions won't require any materialized code no x86... As for hoisting vs. speculation -- I can add a call that might never return after the PHI and before the other instructions, and we will still hoist. That becomes, technically, speculative. But see my other email -- happy to use different/better terminology, but so far all of the ideas I've had are less clear.
176 ↗	(On Diff #113835)	See the code? We actually look at the uses...

I feel unclear on what you see the safety conditions of this transform are. I believe yours are off a little :)

The actual "can you perform the transform" safety conditions are as far as i know as follows:

for each pred of a phi:
   for each operand, translate it through a phi block to the pred:
      if it was translated, it's safe
      if it was not translated:
        depth_first_search of def chain of operand
           if you hit a constant - safe
           if you hit something that properly dominates the phi block - safe
           if you hit a phi node in the same block - unsafe
          // If it's something in the phi block but has no phis, and is safe, you could duplicate it if is safe to speculate. Otherwise it is unsafe

The reason is there are only three cases:

Either the operand chain is somehow defined by the phi in the same block - this is unsafe. By definition, if you had speculated this already, it would have been replaced by a phi that could be translated through, and you would have speculated the rest of the operand chain into phis as well. IE the fact that your current expression only indirectly depends on a phi means you already made a decision not to speculate the operand chain, either for safety or cost reasons.

(Unless your cost model is very strange, i don't see why you are re-evaluating this on each go-around)

Or it is defined but not phi-dependent in the block - these can be copied as a chain if you wanted.

or it must dominate the block, by definition

If it didn't dominate the block, and didn't flow through the phi, this would be a path around the dominator to the block, which would mean it wasn't really a dominator.

Example of 1:

for.body.i:
  %tmp1 = phi i1 [ true, %entry ], [ false, %cond.end.i ]
  %f.08.i = phi i32 [ 0, %entry ], [ %inc.i, %cond.end.i ]
  %mul.i = select i1 %cmp1.i, i32 %f.08.i, i32 0
  br i1 %tmp1, label %cond.end.i, label %cond.true.i

cond.true.i:
  ;; Ensure we don't replace this divide with a phi of ops that merges the wrong loop iteration value
  %div.i = udiv i32 %mul.i, %f.08.i
  br label %cond.end.i
cond.end.i:
     ....
  %inc.i = add nuw nsw i32 %f.08.i, 1

f.08.i is a safe operand. it translates directly through the phi
mul.i is not safe[1]. It depends on a phi recursively. As mentioned, if you had chosen to speculate the operands, it would now be a direct use of a phi (not an indirect one), because you would have converted the op of phis for the operand into a phi of ops. Thus, had it been converted, you would have translated it above, and it would have been safe.

a = phi(0, b)
c = add a, 2
d = mul c, 30

the only way d is safe is if you had chosen to speculate c as well, because then you would have

pred2:
cop = add a, b

a = phi(0, b)
cphi = phi(2, cop)
d = mul cphi, 30

and now it's a direct use :)

But you didn't choose to speculate c, so either c is itself unsafe for some reason, or it wasn't cost effective. I can't see how you could sanely decide it is safe or cost effective to speculate d , which also requires speculating c.

Note:
This is essentially a duplicate of OpIsSafeForPhiOfOps in NewGVN.cpp. I cache both safety and non-safety (but didn't bother to make it non-recursive). You only cache safety, which means you are N^2 if everything is safe except the last operand in the depth first chain. You will re-explore that subgraph again and again, AFAICT. The subgraphs don't get safer over time, so you can cache non-safety as well.

[1]
An interesting problem here:
If you translate udiv.i through entry predecessor into udiv i32 %mul.i, 0, it will simplify to 0, which is even right.
if you then translate udiv i32 %mul.i, %f.08.i into udiv i32 %mul.i, %inc.i, you may even think this is safe. The simplifier will simplify it to zero as well (inc.i == f.08.i + 1, mul = select f.08.i, 0. Because it can prove the second op to udiv is greater than the first, the answer must be zero). It's constant, awesome!

You will get phi(0, 0), which is not right.

So you have to be careful. You can't call the simplifier on the intermediate expressions.
The real translation would be
foo = select %cmp.i %inc.i, i32 0
udiv i32 %foo, %inc.i

Which is 0 | 1, not 0

include/llvm/Transforms/Scalar/SpeculateAroundPHIs.h
91 ↗	(On Diff #114301)	All of these are not speculation, they are transformations from op of phis, to phi of ops :) (NewGVN performs this transform when it requires no code insertion) In that respect, it's a code hoisting transform.

Other random note:
NewGVN can do this value-wise, not just lexically.
So in some future when NewGVN is the default, you may just want to make this part of one of the things it does after the analysis (probably after elimination too).
The eliminator will already have eliminated all cases where doing the above eliminates redundancies without costing anything (IE the part where you see how it simplifies)
We could also, for zero cost, track the things where it it would have just been code motion (IE one operand constant, one operand not), and where we didn't do it because it required speculation of multiple operands.

Then your cost model does not need to take into how much something simplifies, look for constants, etc. We could just tell you what the case is already, and you could make a decision whether to do the duplication or not.

(Side note: I remember that the ARM folks found something very similar to be profitable on aarch64 when I interned there ~2 years ago. See D11648.)
@jmolloy @aadg @silviu.baranga

So, I'm still parsing your first comment Danny, but seeing the second, I think there may be some confusion here...

I'm not actually modeling any simplifications at all, and I'm not really trying to. While that is interesting, I completely agree that this is the wrong place to do it and doesn't have the machinery you would need to do a good job. I think something based around a GVN-like analysis makes way more sense.

The "cost model" is trying to handle one fairly narrow and specific case: when the target's cost for materializing N constants along N incoming edges is equal to the cost of duplicating the user instructions of the incoming constant (and those users' dependencies) along the N incoming edges and letting the constant get materialized as part of the instruction.

I don't actually expect this to ever be profitable for large numbers of duplicated instructions... That typically doesn't make sense. The example I added to the top-level comment on the pass is really the case I care about. The theory is that all simplification-based, or non-duplicating variants of this transform, if advisable, would have *already happend*, likely by NewGVN or similar. This pass is just trying to handle the case where there are target-specific costs associated with materializing constants that don't occur when the constant is an operand of an instruction.

Anyways, hopefully this clarified some of the intent behind the cost modeling. Like I said, I'm still going throuh the larger comment around safety checks and complexity. I actually had another data structure there and I suspect that Danny is correct that removing it does make this quadratic. And quite possibly there is a better way to avoid that, but yeah, need to go through that more to be confident of anything.

I'll admit to not staring at the cost model too hard (I rarely have useful input on that kind of target specific thing), but it looked at a glance like you trying to calculate which might constant fold as well.
If not, or it's not part of the goal, awesome.

i'm happy to whiteboard the safety part with you.
What you are doing would be safe, but maybe not cost effective, if you check and move the entire operand chain until you hit a phi in the same block. It looks like you might be, but it's not entirely clear you are doing that.
The logic in speculatephis and visiting deps/users is hard to follow for me to ensure it comes up with the right list.

The core point, however, to take the very simple example above:

a = phi(0, b)
c = add a, 2
d = mul c, a

It would not be safe to speculate mul c,a without copying + translating add a, 2 as well.

A valid translation to each predecessor for mul is not:
mul c, 0
and
mul c, b

it's
cop1 = add 0, 2
mul cop1, 0

and

cop2 = add a, b
mul cop2, b

So you must copy and translate c to translate d.
Also, if you do the speculation in postorder, you are going to presumably reevaluate the same ops again and again in smaller chains (first you do d above, then you do c).

It's unclear to me that you are really doing it in postorder, and not preorder :)
If you are, I imagine it would be better to go the other way around, in preorder, because you could save redundant checks and cut off earlier.
This assumes, however, there is no reason to speculate d above if you choose not to speculate c.

Rebase (no substantive changes yet...)

Herald added a subscriber: hiraditya. · View Herald TranscriptSep 20 2017, 12:31 AM

Harbormaster completed remote builds in B10439: Diff 115970.Sep 20 2017, 12:31 AM

In D37467#864858, @dberlin wrote:

I'll admit to not staring at the cost model too hard (I rarely have useful input on that kind of target specific thing), but it looked at a glance like you trying to calculate which might constant fold as well.
If not, or it's not part of the goal, awesome.

Correct, I'm not trying to calculate that, and it isn't part of the goal. The idea being, by the point this pass runs any profitable folding-through-PHI-speculation should be handled by something much more like GVN. The intent is for this to catch cases where the minimal, canonical form has edge-dependent constant inputs and without any *folding* it is profitable on the target to speculate users along the edge. The things I imagine firing here are things like x86's immediate operands and other architecture's barrel shifter operands.

Anyways, hopefully that clarifies. Also, the examples in the code now probably help. I've added a comment to specifically talk about the fact that we *don't* try to handle this case.

i'm happy to whiteboard the safety part with you.
What you are doing would be safe, but maybe not cost effective, if you check and move the entire operand chain until you hit a phi in the same block. It looks like you might be, but it's not entirely clear you are doing that.

I may have bugs, but that is exactly the intent of the code.

The logic in speculatephis and visiting deps/users is hard to follow for me to ensure it comes up with the right list.

Yeah, I tried to find a cleaner way to write this to make this clear, but failed to come up with one. It's in an awkward part of LLVM's IR for walking in the precise way we want. That is compounded by the fact that we expect for this whole thing to happen relatively rarely, and be profitable even more rarely. So the entire thing tries to avoid walk large amounts of the IR until there is at least some evidence that this is useful.

The core point, however, to take the very simple example above:
a = phi(0, b)
c = add a, 2
d = mul c, a
It would not be safe to speculate mul c,a without copying + translating add a, 2 as well.

A valid translation to each predecessor for mul is not:
mul c, 0
and
mul c, b

it's
cop1 = add 0, 2
mul cop1, 0

and

cop2 = add a, b
mul cop2, b

So you must copy and translate c to translate d.

Correct. And I think I have test cases covering this, for example @test_speculate_free_insts where we are required to copy and speculate trunc instructions even though they don't use any PHI nodes just as we would have to do that for the add in your example.

Also, if you do the speculation in postorder, you are going to presumably reevaluate the same ops again and again in smaller chains (first you do d above, then you do c).

It's unclear to me that you are really doing it in postorder, and not preorder :)
If you are, I imagine it would be better to go the other way around, in preorder, because you could save redundant checks and cut off earlier.
This assumes, however, there is no reason to speculate d above if you choose not to speculate c.

I think the profitability walk is OK -- it is in postorder, but of the operand graph rather than the use graph -- essentially, the inverse graph. It is specifically designed to check the shorter chains first and memoize that result when checking the longer chains.

It turns out that we have to do this to get good profitability measurements anyways, even setting aside the complexity. If we don't find small chains that are themselves profitable and mark them as profitable on their own, we would incorrectly count their speculation cost against the benefit of speculating the large chain. We still can't get this to be perfect because it is a greedy algorithm and there are cases that doesn't handle, but by being greedy in the right order we handle important, obvious cases such as exactly the one you describe where we want to first consider the chain rooted at c, then at d, etc etc., where each is a superset of the one before and it is important to consider the cost incrementally. I have some specific test cases for that which motivate this.

However, the *safety* check is I think still a problem (as you pointed out in your first comment). Specifically, it caches safe subgraphs but not unsafe subgraphs and so it can walk the unsafe subgraph over and over again and go superlinear. I'm working on a fix to that, but wanted to write up this much and see if we're on the same page.

Rebase (and move to mono-repo layout, sorry for any disruption).

Address the feedback from Danny by rewriting the safety check to work
depth-first (so that we can effectively cache *unsafe* operations rather than
re-exploring them) and preorder (so that we prune early rather than late). Last
but not least, this also allows us to cache safe to speculate sub-regions even
if the whole is not safe, avoidnig reexploring those subregions repeatedly.

I've also added various comments and cleaned up the code a bit to reflect the
more robust design.

I actually took a stab at sharing code between preorder and postorder walks
here, but it made the code substantially harder to read. This is in large part
because the preorder checks are distinctly for *operands* only and thus the
structure ends up surprisingly different.

Harbormaster completed remote builds in B10814: Diff 117643.Oct 4 2017, 2:36 AM

FWIW, I think with the latest patch update this has now addressed Danny's primary concerns, and should be much more effective at avoiding re-visiting regions of IR.

The biggest remaining issue of revisiting regions of IR is the fact that we don't cache things between basic blocks. We could move to do that, but I'm worried about how much more complex the logic would become. Would appreciate thoughts on that from others.

I've taken some time to go through the code in NewGVN that computes similar things based on the suggestion from Danny. These now do *very* similar things in terms of walk. The differences I've seen are:

While the safety checks are both preorder, my code uses DFS instead of BFS. The reason is that we walk up the stack to mark more nodes as unsafe when we discover an unsafe node.
The safety checks in this pass are stricter than in NewGVN because other parts of NewGVN handle various safety concerns of code motion. We essentially merge them here and use stricter checks
The cost modeling does DFS postorder. I don't see a way around this given what the cost model is trying to do. It lets us use dynamic programming to avoid recomputing the same cost multiple times. Fortunately, this walk never fails and we use the above preorder-checked safety walk to establish the total set of nodes traversed.

I suspect this will make it hard and unlikely to be useful to share a lot of code between the two. =/ They are solving similar but ultimately different problems and have different tradeoffs as a consequence.

Anyways, I think the patch as is is probably good for review now.

davidxl added inline comments.Oct 5 2017, 2:14 PM

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
36 ↗	(On Diff #117643)	wether --> whether
67 ↗	(On Diff #117643)	This is too limiting. The scenario is commonly created by the store sinking optimization -- which involves memory operation.
107 ↗	(On Diff #117643)	why not checking this first?
112 ↗	(On Diff #117643)	What is the second check about? OpI is not a code motion candidate. The first check seems wrong to me too. At this point, we determined that OpI is not in a block that dominates the the target blocks of the code motion (hoisting), so it should return false immediately. I don't see the need to do a DFS walk to check the chains of operands either. The caching here seems also wrong and confusing.
185 ↗	(On Diff #117643)	This cost should be weighted by the block frequency of the constant materializing block.
186 ↗	(On Diff #117643)	This should remember the total frequency for incomingC
256 ↗	(On Diff #117643)	The cost analysis here assumes non-speculative case where the uses of phi are fully anticipated. Consider the real speculative case: /// entry: /// br i1 %flag, label %a, label %b /// /// a: /// br label %merge /// /// b: /// br label %merge /// /// merge: /// %p = phi i32 [ 7, %a ], [ 11, %b ] /// br i1 %flag2, label %use, label %exit /// /// use: /// %sum = add i32 %arg, %p /// br label %exit /// /// exit: /// %s = phi ([0, ...], [%sum, ..]) /// ret i32 %s Hoisting the add into %a and %b is speculative. Additional speculation cost is involved depending on the cost of add and relative frequency of %a+%b vs %use. consider another case with triagular cfg: /// entry: /// br i1 %flag, label %a, label %merge /// a: /// br label %merge /// /// merge: /// %p = phi i32 [ 7, %a ], [ 11, %entry ] /// %sum = add i32 %arg, %p /// ret i32 %sum hoisting 'add' into entry and '%a' introduces runtime overhead.
684 ↗	(On Diff #117643)	Looks like the PotentialSpecSet, UnsafeSet is shared across different PN candidates -- this does not look correct.

(quick responses, still working on code updates from the review)

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
67 ↗	(On Diff #117643)	There is a FIXME discussing this already? However, this pass is already able to handle the motivating case, so it seems reasonable to extend this cautiously in subsequent commits?
107 ↗	(On Diff #117643)	Sorry, this order came from when we didn't update the set as aggressively. Yeah, we should check this first.
112 ↗	(On Diff #117643)	If `OpI` isn't in a block that dominates the target blocks, we will need to hoist `OpI` as well as the original `UI`. That's the point of this walk and the discussion w/ Danny: we may need to hoist more than one instruction in order to hoist the user of the PHI. This actually was necessary to get basic tests to work at all because in many cases the operand is a trivial/free instruction such as a trunc or zext between i64 and i32. We want to hoist those in addition to the PHI use. There is a test case that demonstrates this as well I think. Because we're doing that, we need to actually recurse in some way. DFS vs. BFS is just a tradeoff in terms of how you visit things and my previous emails discussed why I think DFS is probably right. Hopefully this also makes the caching more clear, but please let me know what is unclear/confusing if note.
185 ↗	(On Diff #117643)	I'm not sure. Currently, the heuristic is to only do this when it is a strict win in terms of cost (including size). Given that, I don't see much to do with BFI. We could potentially go with something more aggressive and that would definitely need BFI to do accurately. But I'm not sure what the right heuristics would be there. The only places I've seen this really matter were trivially profitable. So I'm inclined to start with the simple model. We can always add more smarts to the cost model when test cases motivating it arrive.
186 ↗	(On Diff #117643)	(I assume this is w.r.t. using BFI, so will defer unless we add that logic...)
256 ↗	(On Diff #117643)	Ah, this explains perhaps the comment above regarding BFI. The second case was definitely unintended. I can add something to prevent this. I'm not sure we ever want to do that case. For the first case, I'm not so sure. If the 'add' is the same cost as materializing the constant, and we'll have to materialize the constant anyways, is this really a problem? We can suppress this too if so. I'm somewhat interested in simpler cost modeling and just eliminating the hard cases here if possible (IE, until we see places where the fancier approach is needed).
684 ↗	(On Diff #117643)	Only for PNs in the same basic block, which should make all of the things the same and correct to cache? I can add an assert w.r.t. the PNs being in the same block though.

In D37467#889452, @chandlerc wrote:

I've taken some time to go through the code in NewGVN that computes similar things based on the suggestion from Danny. These now do *very* similar things in terms of walk. The differences I've seen are:

While the safety checks are both preorder, my code uses DFS instead of BFS. The reason is that we walk up the stack to mark more nodes as unsafe when we discover an unsafe node.

I originally wrote mine as a DFS, i may go back to it.
I actually originally had graphtraits set up so it could be used with a depth_first_iterator on operands and walk the def-use chain. I wonder if i should do that and we could share *that*, or whether it's not worth it (obviously, not something we have to do this second, and i could look at it as a followup).

The safety checks in this pass are stricter than in NewGVN because other parts of NewGVN handle various safety concerns of code motion. We essentially merge them here and use stricter checks

The cost modeling does DFS postorder. I don't see a way around this given what the cost model is trying to do. It lets us use dynamic programming to avoid recomputing the same cost multiple times. Fortunately, this walk never fails and we use the above preorder-checked safety walk to establish the total set of nodes traversed.

I suspect this will make it hard and unlikely to be useful to share a lot of code between the two. =/

:(
They are solving similar but ultimately different problems and have different tradeoffs as a consequence.

Anyways, I think the patch as is is probably good for review now.

I'm on vacation, but i'll try to look at it late next week if nobody beats me to it.

davidxl added inline comments.Oct 6 2017, 10:38 AM

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
67 ↗	(On Diff #117643)	This might affect the design decisions: use MemSSA or AS ? It is probably good to handle the easiest cases (which covers 90% of the cases) where the memory instruction is just hoisted to its direct predecessor blocks.
112 ↗	(On Diff #117643)	So dependent OpI are also hoisting candidates. However I don't see the cost model actually consider them. Besides, speculative and non-speculative hoisting candidates are all handled here. For real speculative ones, there is not enough check for safe speculations -- e.g. instructions that may trap or result in exceptions.
117 ↗	(On Diff #117643)	What is the point of tracking UnsafeSet? The code will be much more readable if the following is done: collect the set of all operands (transitively) UI depends on, which are not defined in a dominating block of the hoisting target blocks; check if it safe to hoist those dependent operands. If any one of them is not safe, bail.
119 ↗	(On Diff #117643)	Why not pop the DFSstack?
256 ↗	(On Diff #117643)	As an example for the first case. Suppose there are two incoming constants C1, and C2. M(C1) and M(C2) are the cost of materializing the constants, and F(C1) and F(C2) are the costs of folding them into UI. Assuming the frequency of two incoming blocks are P1 and P2. The total cost before will be: P1 * M(C1) + P2 * M (C2), and the total cost after is P1 * F(C1) + P2 * F(C2). The comparison can be reduced to comparing M(C1) + M(C2) with F(C1) + F(C2) only when P1 == P2 or M(C1) == M(C2) and F(C1) == F(C2) Depending on the value of C1 and C2, 2) may not be true.

Hopefully responding to all of the comments. Note the last one which is probably the most interesting point of discussion around cost modeling.

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
67 ↗	(On Diff #117643)	FWIW, I think the current design handles all of the cases I've seen in the wild. Anyways, I don't disagree about handling this, it just seems easy to do in a follow-up patch.
107 ↗	(On Diff #117643)	Coming back to this -- the checks above this are actually probably cheaper. Neither require walking the operands at all and so they should allow us to completely avoid the potentially speculated set queries entirely. They're also both cases where we won't actually speculate the instruction at all. So I think the current order makes sense. I've added some comments to help clarify this.
112 ↗	(On Diff #117643)	The cost model does consider the transitive operands that have to be speculated? There are tests around this. Also, `mayBeMemoryDependent` handles all of the cases you mention. It requires the instruction to both be safe to speculatively execute (and therefore not control or data dependent in any way) and not read or write memory. It is essentially the strictest predicate we could use (hence the FIXMEs around using more nuanced checks in the future).
117 ↗	(On Diff #117643)	See my conversation with Danny about why this is needed? We are doing exactly what you suggest, but memoizing the result to reduce the total complexity of this algorithm.
119 ↗	(On Diff #117643)	This is less code and avoids mutating the stack on each step of the loop. Return will free the memory anyways.
256 ↗	(On Diff #117643)	I should clarify why the second case you mentioned originally cannot in fact happen: The code splits critical edges so that it should not be changing the total number of instructions executed along any given path. The speculation goes into a non-critical edge, and so it shouldn't go onto a path that wouldn't have executed that exact instruction anyways. And again, there are already tests of this. For the first case, I see what you're getting at with this example. However, I'm worried about scaling by probability. My concern is that I'm a little worried about doing way too much speculation just to remove a constant materialization in the case of fairly extremely hot paths. A very minor improvement along a very hot path could have nearly unbounded code size regressions along the cold path. An alternative, simpler approach would be to check that, in addition to `F(C1) + F(C2) < M(C1) + M(C2)` being true, `F(C1) <= M(C1) && F(C2) <= M(C2)`. That should ensure that no path regresses its dynamic cost, and the total isn't a size regression overall. I haven't looked at it in detail, but something like this might make for an easier model than profile and still avoid regressions. Naturally, if you're aware of cases where we need to specifically bias towards a hot path with this kind of optimization, that's a different story and we should absolutely use the profile there. I'm just not currently aware of any such cases.

Add a stricter form of checking for profitability to avoid one potential issue
with this patch. Also clean up comments in various places to try and help
address code review feedback.

Harbormaster completed remote builds in B11238: Diff 119279.Oct 17 2017, 3:24 AM

chandlerc added inline comments.Oct 17 2017, 3:26 AM

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
256 ↗	(On Diff #117643)	I've implemented this and it isn't bad at all. When initially establishing if there is any savings to be had from speculating, we check that no incoming constant becomes worse with folding, and then compute the total savings. I found a way to test for it, but it is somewhat lame because the speculation cost (which is separate from the cost mentioned here) ends up too high anyways. However, I've added the test case and verified that in fact we catch the lack of profitability earlier with the new code. Hopefully this addresses the last concern here. We can of course revisit this and add profile-based heuristics if there is a need for them, but all the test cases I have are happy with the strict "no regression" approach.

davidxl added inline comments.Oct 17 2017, 10:47 AM

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
256 ↗	(On Diff #117643)	Can you clarify 'the second case you mentioned originally cannot in fact happen"? I don't see a test case with it. The similar one is @test_no_spec_dominating_phi, but that the phi use of that case still post dominates the phi so it is not really speculative hoisting. If you move the use into '%a' or '%b' block, the hoisting will actually change the dynamic instruction counts.

Add a test case covering critical edge splitting.

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
256 ↗	(On Diff #117643)	Wow, sorry, I missed adding this test case, sorry about that. Just updated patch to include it. You can find the code that handles this by looking at critical edge splitting.

davidxl added inline comments.Oct 17 2017, 11:57 AM

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp

256 ↗

(On Diff #117643)

thanks.

I am actually expecting test case like the following:

entry:
  br i1 %flag1, label %x.block, label %y.block

x.block:
  br label %merge

y.block:
  br label %merge

merge:
  %p = phi i32 [ 7, %x.block ], [ 11, %y.block ]
  br i1 %flag2, label %a, label %b

a:
 %sum = add i32 %5, %p
  br label %exit

b:
  br label %exit

exit:
 %s = phi i32 [%sum, %a], [0, %b]
  ret i32 %s

Perhaps also need a test to cover the case that the other dependent operands of the use instruction is another PHI. This case can be handled for free.

entry:
  br i1 %flag1, label %x.block, label %y.block

x.block:
  br label %merge

y.block:
  br label %merge

merge:
  %p = phi i32 [ 7, %x.block ], [ 11, %y.block ]
  %xy.phi = phi i32 [ %x, %x.block ], [ %y, %y.block ]

  %sum = add i32 %xy.phi, %p
  ret i32 %sum

Without having read the code yet:

We could in theory make it part of the codegen pass
pipeline, but there doesn't really seem to be a good reason for that --
it isn't "lowering" in any sense and only relies on pretty standard cost
model based TTI queries, so it seems to fit well with the "optimization"
pipeline model.

This does sound like a CodeGen pass to me! Not all of CodeGen is lowering, we do have optimization passes there too. And the fact that TTI is used is another indicator that this is a machine specific pass. This being part of CodeGen would also allow targets to not add the pass to the pipeline if it doesn't help them.

In D37467#900122, @MatzeB wrote:

Without having read the code yet:

We could in theory make it part of the codegen pass
pipeline, but there doesn't really seem to be a good reason for that --
it isn't "lowering" in any sense and only relies on pretty standard cost
model based TTI queries, so it seems to fit well with the "optimization"
pipeline model.

This does sound like a CodeGen pass to me! Not all of CodeGen is lowering, we do have optimization passes there too. And the fact that TTI is used is another indicator that this is a machine specific pass. This being part of CodeGen would also allow targets to not add the pass to the pipeline if it doesn't help them.

I responded to Mattias in person about this, but wanted to add it here.

Essentially, we have a nice "optimization" phase of the main pass pipeline now where there is no ambiguity about this being a non-canonicalization transform.

It would be pretty hard to implement this today as an MI pass because of how folding of addressing modes and such happens currently. I don't really know how to do this at all.

We could make this a CodeGen IR pass, but there doesn't seem to be any reason to do so. The API needed for this is already in TTI and seems to fit the TTI model very well: a completely generic cost model.

So it seems fine to add it here, and if we ever want to sink it to a CodeGen IR pass, we can, but I don't see any reason to speculatively do that. (Queue the puns...)

Update to address the issue highlighted by David and add more test cases.

Harbormaster completed remote builds in B12340: Diff 123657.Nov 20 2017, 2:30 PM

chandlerc added inline comments.Nov 20 2017, 2:31 PM

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
256 ↗	(On Diff #117643)	Ok, both of these are handled. I used a very blunt tool for this: insisting the uses are in the same BB as the PHI node. I left a FIXME about potentially relaxing this to just be a post-dominance check. Since this is about cost modeling, we don't even need a full reachability test. Anyways, hopefully this looks better now. I also added test cases covering the PHI translation (which was already implemented).

lgtm

This revision is now accepted and ready to land.Nov 20 2017, 10:45 PM

LGTM

I've also only implemented this in the new pass manager. If folks are

very interested, I can try to add it to the old PM as well, but I didn't
really see much point (my use case is already switched over to the new
PM).

Isn't the legacy PM still the default we use for most frontends including clang and the new PM is only used for (Thin)LTO so far?

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
53–58 ↗	(On Diff #123657)	Too much `auto` for my taste. I think writing the types is friendlier to code readers (unless you have something like a `cast<XXX>` on the RHS which makes the type obvious.
llvm/test/Transforms/SpeculateAroundPHIs/X86/lit.local.cfg
1–2 ↗	(On Diff #123657)	As this is only a single test, you could use `REQUIRES: x86-registered-target` in the test itself instead of creating an x86 specific subdirectory.

In D37467#936896, @MatzeB wrote:

LGTM

I've also only implemented this in the new pass manager. If folks are

very interested, I can try to add it to the old PM as well, but I didn't
really see much point (my use case is already switched over to the new
PM).

Isn't the legacy PM still the default we use for most frontends including clang and the new PM is only used for (Thin)LTO so far?

Clang has a flag to use the new PM everywhere. Google is already using it by default, and I have an email thread on llvm-dev about the (very few) things left before we can make it the default for Clang.

Thanks for all the feedback, landing with suggested fixes.

I've also run SPEC with this change and saw no significant changes so it seems safe performance wise. It also didn't regress a wide range of benchmarks internally.

This change alone is however worth between 2% and 4% latency on one very unusually important benchmark: what is for us the hottest path through tcmalloc code. It helps optimize the size class computation that has long been part of tcmalloc:
https://github.com/gperftools/gperftools/blob/master/src/common.h#L204-L210

I also have at least one improvement on it that I'm also working on but it should be a good follow-up patch. Nothing really wrong as-is, and good to start breaking things into better increments.

llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp
53–58 ↗	(On Diff #123657)	The RHS says `uses()` and the type is `Use`. That seemed sufficiently redundant to me? The tree has a bit of a mixture, combined with places that are just amazingly confusing. I even have `Use` elsewhere in this file, so happily changed.
llvm/test/Transforms/SpeculateAroundPHIs/X86/lit.local.cfg
1–2 ↗	(On Diff #123657)	Ah, yeah, for something this tiny that's likely to be way cleaner long term.

Closed by commit rL319164: Add a new pass to speculate around PHI nodes with constant (integer) operands… (authored by chandlerc). · Explain WhyNov 28 2017, 3:32 AM

This revision was automatically updated to reflect the committed changes.

chandlerc marked 2 inline comments as done.

Catching up on old review traffic. Nothing critical.

llvm/trunk/include/llvm/Transforms/Scalar/SpeculateAroundPHIs.h
84	Isn't this snippet wrong? It looks like we're adding 7 or 18 here, not 7 or 11. I think you're missing a conditional branch. Alternatively, you could make the second add be an "addli $4, edi" (since 4+7 == 11) If we don't do that later transformation, maybe we should be. Extracting out the common component of the two constants adding that unconditionally and then conditionally adding the difference seems like a near ideal lowering here. We'd end up with something like: / testq %eax, %eax / addl $7, %edi / jne .L / addl $4, %edi <-- different constant / .L: / movl %edi, %eax /// retq

@chandlerc

This pass does not seem to account for the extra branch instruction (1 for exiting the loop, 1 for looping back Vs 1 single branch when there's no critical edge) nor does it account for critical edge preventing the use of hardware loop instruction. The example that causes us trouble is:

unsigned KnownDec(unsigned *arr) {
  unsigned x = 0x2000;
  unsigned z = 0;
  while(x) {
    z += arr[x-1];
    x--;
  }
  return z;
}

When compiling for our (non-upstream) target, when get the following IR after this pass:

entry:
  %sub.0 = add nsw i32 2000, -1
  br label %while.body

while.body:                                       ; preds = %while.body.while.body_crit_edge, %entry
  %z.07 = phi i32 [ 0, %entry ], [ %add, %while.body.while.body_crit_edge ]
  %sub.phi = phi i32 [ %sub.0, %entry ], [ %sub.1, %while.body.while.body_crit_edge ]
  %arrayidx = getelementptr inbounds i32, i32* %arr, i32 %sub.phi
  %0 = load i32, i32* %arrayidx, align 4, !tbaa !2
  %add = add i32 %0, %z.07
  %tobool.not = icmp eq i32 %sub.phi, 0
  br i1 %tobool.not, label %while.end, label %while.body.while.body_crit_edge, !llvm.loop !6

while.body.while.body_crit_edge:                  ; preds = %while.body
  %sub.1 = add nsw i32 %sub.phi, -1
  br label %while.body

while.end:                                        ; preds = %while.body
  ret i32 %add

The fact that the condition check is separate from the loop latch means we cannot use a hardware loop instruction. Similar code gets generated for PowerPC if using 0x10000 instead of 2000 but they run EarlyCSE after which sink the value down again in the PHI and by chance they have a pass to canonicalize the loop form for their addressing mode which remove the getelementptr in the critical edge (inserted there by the loop strength reduction pass). This feels kinda lucky (the EarlyCSE and ppc-loop-instr-form-prep passes were added to the PPC pipeline long before the PHI speculation one. Is the expectation that targets with hardware loop deal with the result of PHI speculation? If that's the case, could we have a hook for those target to disable the pass when they suspect a hardware loop instruction might be used?

Best regards.

Herald added a project: Restricted Project. · View Herald TranscriptMay 28 2021, 8:45 AM

Herald added subscribers: pengfei, steven_wu. · View Herald Transcript

I guess i still didn't post a reproducer, but i've seen that the IR after this transform
undergoes LSR, and ends up having pretty negative performance effect in the end.
I'm rather unconvinced that this optimization holds in the IR.
We just can't be sure that in the end we will actually have the instructions we costmodel here.
I would much rather see this being a (late) codegen pass.

lebedev.ri mentioned this in D104099: [NewPM] Remove SpeculateAroundPHIs pass.Jun 11 2021, 2:07 AM

lebedev.ri mentioned this in rGe52364532afb: [NewPM] Remove SpeculateAroundPHIs pass.Jun 15 2021, 10:36 AM

FYI this has now been reverted by rGe52364532afb2748c324f360bc1cc12605d314f3.

Herald added a subscriber: ormris. · View Herald TranscriptJun 15 2021, 1:08 PM

nikic mentioned this in D116058: [InstCombine] Convert binop(phi, v) to phi(binop) for constant phi operands.Dec 21 2021, 12:37 PM

Carrot mentioned this in D119916: Add a machine function pass to convert binop(phi(constants), v) to phi(binop) .Feb 15 2022, 10:55 PM

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Transforms/

Scalar/

SpeculateAroundPHIs.h

111 lines

lib/

Passes/

PassBuilder.cpp

6 lines

PassRegistry.def

1 line

Transforms/

Scalar/

CMakeLists.txt

1 line

SpeculateAroundPHIs.cpp

811 lines

test/

Other/

new-pm-defaults.ll

1 line

new-pm-thinlto-defaults.ll

1 line

Transforms/

SpeculateAroundPHIs/

basic-x86.ll

595 lines

Diff 124537

llvm/trunk/include/llvm/Transforms/Scalar/SpeculateAroundPHIs.h

				//===- SpeculateAroundPHIs.h - Speculate around PHIs ------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_TRANSFORMS_SCALAR_SPECULATEAROUNDPHIS_H
				#define LLVM_TRANSFORMS_SCALAR_SPECULATEAROUNDPHIS_H

				#include "llvm/ADT/SetVector.h"
				#include "llvm/Analysis/AssumptionCache.h"
				#include "llvm/IR/Dominators.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/PassManager.h"
				#include "llvm/Support/Compiler.h"
				#include <vector>

				namespace llvm {

				/// This pass handles simple speculating of instructions around PHIs when
				/// doing so is profitable for a particular target despite duplicated
				/// instructions.
				///
				/// The motivating example are PHIs of constants which will require
				/// materializing the constants along each edge. If the PHI is used by an
				/// instruction where the target can materialize the constant as part of the
				/// instruction, it is profitable to speculate those instructions around the
				/// PHI node. This can reduce dynamic instruction count as well as decrease
				/// register pressure.
				///
				/// Consider this IR for example:
				/// ```
				/// entry:
				/// br i1 %flag, label %a, label %b
				///
				/// a:
				/// br label %exit
				///
				/// b:
				/// br label %exit
				///
				/// exit:
				/// %p = phi i32 [ 7, %a ], [ 11, %b ]
				/// %sum = add i32 %arg, %p
				/// ret i32 %sum
				/// ```
				/// To materialize the inputs to this PHI node may require an explicit
				/// instruction. For example, on x86 this would turn into something like
				/// ```
				/// testq %eax, %eax
				/// movl $7, %rNN
				/// jne .L
				/// movl $11, %rNN
				/// .L:
				/// addl %edi, %rNN
				/// movl %rNN, %eax
				/// retq
				/// ```
				/// When these constants can be folded directly into another instruction, it
				/// would be preferable to avoid the potential for register pressure (above we
				/// can easily avoid it, but that isn't always true) and simply duplicate the
				/// instruction using the PHI:
				/// ```
				/// entry:
				/// br i1 %flag, label %a, label %b
				///
				/// a:
				/// %sum.1 = add i32 %arg, 7
				/// br label %exit
				///
				/// b:
				/// %sum.2 = add i32 %arg, 11
				/// br label %exit
				///
				/// exit:
				/// %p = phi i32 [ %sum.1, %a ], [ %sum.2, %b ]
				/// ret i32 %p
				/// ```
				/// Which will generate something like the following on x86:
				/// ```
				/// testq %eax, %eax
				reamesUnsubmitted Not Done Reply Inline Actions Isn't this snippet wrong? It looks like we're adding 7 or 18 here, not 7 or 11. I think you're missing a conditional branch. Alternatively, you could make the second add be an "addli $4, edi" (since 4+7 == 11) If we don't do that later transformation, maybe we should be. Extracting out the common component of the two constants adding that unconditionally and then conditionally adding the difference seems like a near ideal lowering here. We'd end up with something like: / testq %eax, %eax / addl $7, %edi / jne .L / addl $4, %edi <-- different constant / .L: / movl %edi, %eax /// retq reames: Isn't this snippet wrong? It looks like we're adding 7 or 18 here, not 7 or 11. I think…
				/// addl $7, %edi
				/// jne .L
				/// addl $11, %edi
				/// .L:
				/// movl %edi, %eax
				/// retq
				/// ```
				///
				/// It is important to note that this pass is never intended to handle more
				/// complex cases where speculating around PHIs allows simplifications of the
				/// IR itself or other subsequent optimizations. Those can and should already
				/// be handled before this pass is ever run by a more powerful analysis that
				/// can reason about equivalences and common subexpressions. Classically, those
				/// cases would be handled by a GVN-powered PRE or similar transform. This
				/// pass, in contrast, is only interested in cases where despite no
				/// simplifications to the IR itself, speculation is faster to execute. The
				/// result of this is that the cost models which are appropriate to consider
				/// here are relatively simple ones around execution and codesize cost, without
				/// any need to consider simplifications or other transformations.
				struct SpeculateAroundPHIsPass : PassInfoMixin<SpeculateAroundPHIsPass> {
				/// \brief Run the pass over the function.
				PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
				};

				} // end namespace llvm

				#endif // LLVM_TRANSFORMS_SCALAR_SPECULATEAROUNDPHIS_H

llvm/trunk/lib/Passes/PassBuilder.cpp

Show First 20 Lines • Show All 126 Lines • ▼ Show 20 Lines
#include "llvm/Transforms/Scalar/NewGVN.h"		#include "llvm/Transforms/Scalar/NewGVN.h"
#include "llvm/Transforms/Scalar/PartiallyInlineLibCalls.h"		#include "llvm/Transforms/Scalar/PartiallyInlineLibCalls.h"
#include "llvm/Transforms/Scalar/Reassociate.h"		#include "llvm/Transforms/Scalar/Reassociate.h"
#include "llvm/Transforms/Scalar/SCCP.h"		#include "llvm/Transforms/Scalar/SCCP.h"
#include "llvm/Transforms/Scalar/SROA.h"		#include "llvm/Transforms/Scalar/SROA.h"
#include "llvm/Transforms/Scalar/SimpleLoopUnswitch.h"		#include "llvm/Transforms/Scalar/SimpleLoopUnswitch.h"
#include "llvm/Transforms/Scalar/SimplifyCFG.h"		#include "llvm/Transforms/Scalar/SimplifyCFG.h"
#include "llvm/Transforms/Scalar/Sink.h"		#include "llvm/Transforms/Scalar/Sink.h"
		#include "llvm/Transforms/Scalar/SpeculateAroundPHIs.h"
#include "llvm/Transforms/Scalar/SpeculativeExecution.h"		#include "llvm/Transforms/Scalar/SpeculativeExecution.h"
#include "llvm/Transforms/Scalar/TailRecursionElimination.h"		#include "llvm/Transforms/Scalar/TailRecursionElimination.h"
#include "llvm/Transforms/Utils/AddDiscriminators.h"		#include "llvm/Transforms/Utils/AddDiscriminators.h"
#include "llvm/Transforms/Utils/BreakCriticalEdges.h"		#include "llvm/Transforms/Utils/BreakCriticalEdges.h"
#include "llvm/Transforms/Utils/EntryExitInstrumenter.h"		#include "llvm/Transforms/Utils/EntryExitInstrumenter.h"
#include "llvm/Transforms/Utils/LCSSA.h"		#include "llvm/Transforms/Utils/LCSSA.h"
#include "llvm/Transforms/Utils/LibCallsShrinkWrap.h"		#include "llvm/Transforms/Utils/LibCallsShrinkWrap.h"
#include "llvm/Transforms/Utils/LoopSimplify.h"		#include "llvm/Transforms/Utils/LoopSimplify.h"
▲ Show 20 Lines • Show All 651 Lines • ▼ Show 20 Lines	PassBuilder::buildModuleOptimizationPipeline(OptimizationLevel Level,
// passes to avoid re-sinking, but before SimplifyCFG because it can allow		// passes to avoid re-sinking, but before SimplifyCFG because it can allow
// flattening of blocks.		// flattening of blocks.
OptimizePM.addPass(DivRemPairsPass());		OptimizePM.addPass(DivRemPairsPass());

// LoopSink (and other loop passes since the last simplifyCFG) might have		// LoopSink (and other loop passes since the last simplifyCFG) might have
// resulted in single-entry-single-exit or empty blocks. Clean up the CFG.		// resulted in single-entry-single-exit or empty blocks. Clean up the CFG.
OptimizePM.addPass(SimplifyCFGPass());		OptimizePM.addPass(SimplifyCFGPass());

		// Optimize PHIs by speculating around them when profitable. Note that this
		// pass needs to be run after any PRE or similar pass as it is essentially
		// inserting redudnancies into the progrem. This even includes SimplifyCFG.
		OptimizePM.addPass(SpeculateAroundPHIsPass());

// Add the core optimizing pipeline.		// Add the core optimizing pipeline.
MPM.addPass(createModuleToFunctionPassAdaptor(std::move(OptimizePM)));		MPM.addPass(createModuleToFunctionPassAdaptor(std::move(OptimizePM)));

// Now we need to do some global optimization transforms.		// Now we need to do some global optimization transforms.
// FIXME: It would seem like these should come first in the optimization		// FIXME: It would seem like these should come first in the optimization
// pipeline and maybe be the bottom of the canonicalization pipeline? Weird		// pipeline and maybe be the bottom of the canonicalization pipeline? Weird
// ordering here.		// ordering here.
MPM.addPass(GlobalDCEPass());		MPM.addPass(GlobalDCEPass());
▲ Show 20 Lines • Show All 1,004 Lines • Show Last 20 Lines

llvm/trunk/lib/Passes/PassRegistry.def

	Show First 20 Lines • Show All 193 Lines • ▼ Show 20 Lines
	FUNCTION_PASS("print<regions>", RegionInfoPrinterPass(dbgs()))			FUNCTION_PASS("print<regions>", RegionInfoPrinterPass(dbgs()))
	FUNCTION_PASS("print<scalar-evolution>", ScalarEvolutionPrinterPass(dbgs()))			FUNCTION_PASS("print<scalar-evolution>", ScalarEvolutionPrinterPass(dbgs()))
	FUNCTION_PASS("reassociate", ReassociatePass())			FUNCTION_PASS("reassociate", ReassociatePass())
	FUNCTION_PASS("sccp", SCCPPass())			FUNCTION_PASS("sccp", SCCPPass())
	FUNCTION_PASS("simplify-cfg", SimplifyCFGPass())			FUNCTION_PASS("simplify-cfg", SimplifyCFGPass())
	FUNCTION_PASS("sink", SinkingPass())			FUNCTION_PASS("sink", SinkingPass())
	FUNCTION_PASS("slp-vectorizer", SLPVectorizerPass())			FUNCTION_PASS("slp-vectorizer", SLPVectorizerPass())
	FUNCTION_PASS("speculative-execution", SpeculativeExecutionPass())			FUNCTION_PASS("speculative-execution", SpeculativeExecutionPass())
				FUNCTION_PASS("spec-phis", SpeculateAroundPHIsPass())
	FUNCTION_PASS("sroa", SROA())			FUNCTION_PASS("sroa", SROA())
	FUNCTION_PASS("tailcallelim", TailCallElimPass())			FUNCTION_PASS("tailcallelim", TailCallElimPass())
	FUNCTION_PASS("unreachableblockelim", UnreachableBlockElimPass())			FUNCTION_PASS("unreachableblockelim", UnreachableBlockElimPass())
	FUNCTION_PASS("unroll", LoopUnrollPass())			FUNCTION_PASS("unroll", LoopUnrollPass())
	FUNCTION_PASS("verify", VerifierPass())			FUNCTION_PASS("verify", VerifierPass())
	FUNCTION_PASS("verify<domtree>", DominatorTreeVerifierPass())			FUNCTION_PASS("verify<domtree>", DominatorTreeVerifierPass())
	FUNCTION_PASS("verify<loops>", LoopVerifierPass())			FUNCTION_PASS("verify<loops>", LoopVerifierPass())
	FUNCTION_PASS("verify<memoryssa>", MemorySSAVerifierPass())			FUNCTION_PASS("verify<memoryssa>", MemorySSAVerifierPass())
	Show All 33 Lines

llvm/trunk/lib/Transforms/Scalar/CMakeLists.txt

Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	add_llvm_library(LLVMScalarOpts
SROA.cpp		SROA.cpp
Scalar.cpp		Scalar.cpp
Scalarizer.cpp		Scalarizer.cpp
SeparateConstOffsetFromGEP.cpp		SeparateConstOffsetFromGEP.cpp
SimpleLoopUnswitch.cpp		SimpleLoopUnswitch.cpp
SimplifyCFGPass.cpp		SimplifyCFGPass.cpp
Sink.cpp		Sink.cpp
SpeculativeExecution.cpp		SpeculativeExecution.cpp
		SpeculateAroundPHIs.cpp
StraightLineStrengthReduce.cpp		StraightLineStrengthReduce.cpp
StructurizeCFG.cpp		StructurizeCFG.cpp
TailRecursionElimination.cpp		TailRecursionElimination.cpp

ADDITIONAL_HEADER_DIRS		ADDITIONAL_HEADER_DIRS
${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms		${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms
${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms/Scalar		${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms/Scalar

DEPENDS		DEPENDS
intrinsics_gen		intrinsics_gen
)		)

llvm/trunk/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp

				//===- SpeculateAroundPHIs.cpp --------------------------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/Transforms/Scalar/SpeculateAroundPHIs.h"
				#include "llvm/ADT/PostOrderIterator.h"
				#include "llvm/ADT/Sequence.h"
				#include "llvm/ADT/SetVector.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/Analysis/ValueTracking.h"
				#include "llvm/IR/BasicBlock.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IntrinsicInst.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"

				using namespace llvm;

				#define DEBUG_TYPE "spec-phis"

				STATISTIC(NumPHIsSpeculated, "Number of PHI nodes we speculated around");
				STATISTIC(NumEdgesSplit,
				"Number of critical edges which were split for speculation");
				STATISTIC(NumSpeculatedInstructions,
				"Number of instructions we speculated around the PHI nodes");
				STATISTIC(NumNewRedundantInstructions,
				"Number of new, redundant instructions inserted");

				/// Check wether speculating the users of a PHI node around the PHI
				/// will be safe.
				///
				/// This checks both that all of the users are safe and also that all of their
				/// operands are either recursively safe or already available along an incoming
				/// edge to the PHI.
				///
				/// This routine caches both all the safe nodes explored in `PotentialSpecSet`
				/// and the chain of nodes that definitively reach any unsafe node in
				/// `UnsafeSet`. By preserving these between repeated calls to this routine for
				/// PHIs in the same basic block, the exploration here can be reused. However,
				/// these caches must no be reused for PHIs in a different basic block as they
				/// reflect what is available along incoming edges.
				static bool
				isSafeToSpeculatePHIUsers(PHINode &PN, DominatorTree &DT,
				SmallPtrSetImpl<Instruction *> &PotentialSpecSet,
				SmallPtrSetImpl<Instruction *> &UnsafeSet) {
				auto *PhiBB = PN.getParent();
				SmallPtrSet<Instruction *, 4> Visited;
				SmallVector<std::pair<Instruction *, User::value_op_iterator>, 16> DFSStack;

				// Walk each user of the PHI node.
				for (Use &U : PN.uses()) {
				auto *UI = cast<Instruction>(U.getUser());

				// Ensure the use post-dominates the PHI node. This ensures that, in the
				// absence of unwinding, the use will actually be reached.
				// FIXME: We use a blunt hammer of requiring them to be in the same basic
				// block. We should consider using actual post-dominance here in the
				// future.
				if (UI->getParent() != PhiBB) {
				DEBUG(dbgs() << " Unsafe: use in a different BB: " << *UI << "\n");
				return false;
				}

				// FIXME: This check is much too conservative. We're not going to move these
				// instructions onto new dynamic paths through the program unless there is
				// a call instruction between the use and the PHI node. And memory isn't
				// changing unless there is a store in that same sequence. We should
				// probably change this to do at least a limited scan of the intervening
				// instructions and allow handling stores in easily proven safe cases.
				if (mayBeMemoryDependent(*UI)) {
				DEBUG(dbgs() << " Unsafe: can't speculate use: " << *UI << "\n");
				return false;
				}

				// Now do a depth-first search of everything these users depend on to make
				// sure they are transitively safe. This is a depth-first search, but we
				// check nodes in preorder to minimize the amount of checking.
				Visited.insert(UI);
				DFSStack.push_back({UI, UI->value_op_begin()});
				do {
				User::value_op_iterator OpIt;
				std::tie(UI, OpIt) = DFSStack.pop_back_val();

				while (OpIt != UI->value_op_end()) {
				auto OpI = dyn_cast<Instruction>(OpIt);
				// Increment to the next operand for whenever we continue.
				++OpIt;
				// No need to visit non-instructions, which can't form dependencies.
				if (!OpI)
				continue;

				// Now do the main pre-order checks that this operand is a viable
				// dependency of something we want to speculate.

				// First do a few checks for instructions that won't require
				// speculation at all because they are trivially available on the
				// incoming edge (either through dominance or through an incoming value
				// to a PHI).
				//
				// The cases in the current block will be trivially dominated by the
				// edge.
				auto *ParentBB = OpI->getParent();
				if (ParentBB == PhiBB) {
				if (isa<PHINode>(OpI)) {
				// We can trivially map through phi nodes in the same block.
				continue;
				}
				} else if (DT.dominates(ParentBB, PhiBB)) {
				// Instructions from dominating blocks are already available.
				continue;
				}

				// Once we know that we're considering speculating the operand, check
				// if we've already explored this subgraph and found it to be safe.
				if (PotentialSpecSet.count(OpI))
				continue;

				// If we've already explored this subgraph and found it unsafe, bail.
				// If when we directly test whether this is safe it fails, bail.
				if (UnsafeSet.count(OpI) \|\| ParentBB != PhiBB \|\|
				mayBeMemoryDependent(*OpI)) {
				DEBUG(dbgs() << " Unsafe: can't speculate transitive use: " << *OpI
				<< "\n");
				// Record the stack of instructions which reach this node as unsafe
				// so we prune subsequent searches.
				UnsafeSet.insert(OpI);
				for (auto &StackPair : DFSStack) {
				Instruction *I = StackPair.first;
				UnsafeSet.insert(I);
				}
				return false;
				}

				// Skip any operands we're already recursively checking.
				if (!Visited.insert(OpI).second)
				continue;

				// Push onto the stack and descend. We can directly continue this
				// loop when ascending.
				DFSStack.push_back({UI, OpIt});
				UI = OpI;
				OpIt = OpI->value_op_begin();
				}

				// This node and all its operands are safe. Go ahead and cache that for
				// reuse later.
				PotentialSpecSet.insert(UI);

				// Continue with the next node on the stack.
				} while (!DFSStack.empty());
				}

				#ifndef NDEBUG
				// Every visited operand should have been marked as safe for speculation at
				// this point. Verify this and return success.
				for (auto *I : Visited)
				assert(PotentialSpecSet.count(I) &&
				"Failed to mark a visited instruction as safe!");
				#endif
				return true;
				}

				/// Check whether, in isolation, a given PHI node is both safe and profitable
				/// to speculate users around.
				///
				/// This handles checking whether there are any constant operands to a PHI
				/// which could represent a useful speculation candidate, whether the users of
				/// the PHI are safe to speculate including all their transitive dependencies,
				/// and whether after speculation there will be some cost savings (profit) to
				/// folding the operands into the users of the PHI node. Returns true if both
				/// safe and profitable with relevant cost savings updated in the map and with
				/// an update to the `PotentialSpecSet`. Returns false if either safety or
				/// profitability are absent. Some new entries may be made to the
				/// `PotentialSpecSet` even when this routine returns false, but they remain
				/// conservatively correct.
				///
				/// The profitability check here is a local one, but it checks this in an
				/// interesting way. Beyond checking that the total cost of materializing the
				/// constants will be less than the cost of folding them into their users, it
				/// also checks that no one incoming constant will have a higher cost when
				/// folded into its users rather than materialized. This higher cost could
				/// result in a dynamic path that is more expensive even when the total cost
				/// is lower. Currently, all of the interesting cases where this optimization
				/// should fire are ones where it is a no-loss operation in this sense. If we
				/// ever want to be more aggressive here, we would need to balance the
				/// different incoming edges' cost by looking at their respective
				/// probabilities.
				static bool isSafeAndProfitableToSpeculateAroundPHI(
				PHINode &PN, SmallDenseMap<PHINode *, int, 16> &CostSavingsMap,
				SmallPtrSetImpl<Instruction *> &PotentialSpecSet,
				SmallPtrSetImpl<Instruction *> &UnsafeSet, DominatorTree &DT,
				TargetTransformInfo &TTI) {
				// First see whether there is any cost savings to speculating around this
				// PHI, and build up a map of the constant inputs to how many times they
				// occur.
				bool NonFreeMat = false;
				struct CostsAndCount {
				int MatCost = TargetTransformInfo::TCC_Free;
				int FoldedCost = TargetTransformInfo::TCC_Free;
				int Count = 0;
				};
				SmallDenseMap<ConstantInt *, CostsAndCount, 16> CostsAndCounts;
				SmallPtrSet<BasicBlock *, 16> IncomingConstantBlocks;
				for (int i : llvm::seq<int>(0, PN.getNumIncomingValues())) {
				auto *IncomingC = dyn_cast<ConstantInt>(PN.getIncomingValue(i));
				if (!IncomingC)
				continue;

				// Only visit each incoming edge with a constant input once.
				if (!IncomingConstantBlocks.insert(PN.getIncomingBlock(i)).second)
				continue;

				auto InsertResult = CostsAndCounts.insert({IncomingC, {}});
				// Count how many edges share a given incoming costant.
				++InsertResult.first->second.Count;
				// Only compute the cost the first time we see a particular constant.
				if (!InsertResult.second)
				continue;

				int &MatCost = InsertResult.first->second.MatCost;
				MatCost = TTI.getIntImmCost(IncomingC->getValue(), IncomingC->getType());
				NonFreeMat \|= MatCost != TTI.TCC_Free;
				}
				if (!NonFreeMat) {
				DEBUG(dbgs() << " Free: " << PN << "\n");
				// No profit in free materialization.
				return false;
				}

				// Now check that the uses of this PHI can actually be speculated,
				// otherwise we'll still have to materialize the PHI value.
				if (!isSafeToSpeculatePHIUsers(PN, DT, PotentialSpecSet, UnsafeSet)) {
				DEBUG(dbgs() << " Unsafe PHI: " << PN << "\n");
				return false;
				}

				// Compute how much (if any) savings are available by speculating around this
				// PHI.
				for (Use &U : PN.uses()) {
				auto *UserI = cast<Instruction>(U.getUser());
				// Now check whether there is any savings to folding the incoming constants
				// into this use.
				unsigned Idx = U.getOperandNo();

				// If we have a binary operator that is commutative, an actual constant
				// operand would end up on the RHS, so pretend the use of the PHI is on the
				// RHS.
				//
				// Technically, this is a bit weird if both operands are PHIs we're
				// speculating. But if that is the case, giving an "optimistic" cost isn't
				// a bad thing because after speculation it will constant fold. And
				// moreover, such cases should likely have been constant folded already by
				// some other pass, so we shouldn't worry about "modeling" them terribly
				// accurately here. Similarly, if the other operand is a constant, it still
				// seems fine to be "optimistic" in our cost modeling, because when the
				// incoming operand from the PHI node is also a constant, we will end up
				// constant folding.
				if (UserI->isBinaryOp() && UserI->isCommutative() && Idx != 1)
				// Assume we will commute the constant to the RHS to be canonical.
				Idx = 1;

				// Get the intrinsic ID if this user is an instrinsic.
				Intrinsic::ID IID = Intrinsic::not_intrinsic;
				if (auto *UserII = dyn_cast<IntrinsicInst>(UserI))
				IID = UserII->getIntrinsicID();

				for (auto &IncomingConstantAndCostsAndCount : CostsAndCounts) {
				ConstantInt *IncomingC = IncomingConstantAndCostsAndCount.first;
				int MatCost = IncomingConstantAndCostsAndCount.second.MatCost;
				int &FoldedCost = IncomingConstantAndCostsAndCount.second.FoldedCost;
				if (IID)
				FoldedCost += TTI.getIntImmCost(IID, Idx, IncomingC->getValue(),
				IncomingC->getType());
				else
				FoldedCost +=
				TTI.getIntImmCost(UserI->getOpcode(), Idx, IncomingC->getValue(),
				IncomingC->getType());

				// If we accumulate more folded cost for this incoming constant than
				// materialized cost, then we'll regress any edge with this constant so
				// just bail. We're only interested in cases where folding the incoming
				// constants is at least break-even on all paths.
				if (FoldedCost > MatCost) {
				DEBUG(dbgs() << " Not profitable to fold imm: " << *IncomingC << "\n"
				" Materializing cost: " << MatCost << "\n"
				" Accumulated folded cost: " << FoldedCost << "\n");
				return false;
				}
				}
				}

				// Compute the total cost savings afforded by this PHI node.
				int TotalMatCost = TTI.TCC_Free, TotalFoldedCost = TTI.TCC_Free;
				for (auto IncomingConstantAndCostsAndCount : CostsAndCounts) {
				int MatCost = IncomingConstantAndCostsAndCount.second.MatCost;
				int FoldedCost = IncomingConstantAndCostsAndCount.second.FoldedCost;
				int Count = IncomingConstantAndCostsAndCount.second.Count;

				TotalMatCost += MatCost * Count;
				TotalFoldedCost += FoldedCost * Count;
				}
				assert(TotalFoldedCost <= TotalMatCost && "If each constant's folded cost is "
				"less that its materialized cost, "
				"the sum must be as well.");

				DEBUG(dbgs() << " Cost savings " << (TotalMatCost - TotalFoldedCost)
				<< ": " << PN << "\n");
				CostSavingsMap[&PN] = TotalMatCost - TotalFoldedCost;
				return true;
				}

				/// Simple helper to walk all the users of a list of phis depth first, and call
				/// a visit function on each one in post-order.
				///
				/// All of the PHIs should be in the same basic block, and this is primarily
				/// used to make a single depth-first walk across their collective users
				/// without revisiting any subgraphs. Callers should provide a fast, idempotent
				/// callable to test whether a node has been visited and the more important
				/// callable to actually visit a particular node.
				///
				/// Depth-first and postorder here refer to the operand graph -- we start
				/// from a collection of users of PHI nodes and walk "up" the operands
				/// depth-first.
				template <typename IsVisitedT, typename VisitT>
				static void visitPHIUsersAndDepsInPostOrder(ArrayRef<PHINode *> PNs,
				IsVisitedT IsVisited,
				VisitT Visit) {
				SmallVector<std::pair<Instruction *, User::value_op_iterator>, 16> DFSStack;
				for (auto *PN : PNs)
				for (Use &U : PN->uses()) {
				auto *UI = cast<Instruction>(U.getUser());
				if (IsVisited(UI))
				// Already visited this user, continue across the roots.
				continue;

				// Otherwise, walk the operand graph depth-first and visit each
				// dependency in postorder.
				DFSStack.push_back({UI, UI->value_op_begin()});
				do {
				User::value_op_iterator OpIt;
				std::tie(UI, OpIt) = DFSStack.pop_back_val();
				while (OpIt != UI->value_op_end()) {
				auto OpI = dyn_cast<Instruction>(OpIt);
				// Increment to the next operand for whenever we continue.
				++OpIt;
				// No need to visit non-instructions, which can't form dependencies,
				// or instructions outside of our potential dependency set that we
				// were given. Finally, if we've already visited the node, continue
				// to the next.
				if (!OpI \|\| IsVisited(OpI))
				continue;

				// Push onto the stack and descend. We can directly continue this
				// loop when ascending.
				DFSStack.push_back({UI, OpIt});
				UI = OpI;
				OpIt = OpI->value_op_begin();
				}

				// Finished visiting children, visit this node.
				assert(!IsVisited(UI) && "Should not have already visited a node!");
				Visit(UI);
				} while (!DFSStack.empty());
				}
				}

				/// Find profitable PHIs to speculate.
				///
				/// For a PHI node to be profitable, we need the cost of speculating its users
				/// (and their dependencies) to not exceed the savings of folding the PHI's
				/// constant operands into the speculated users.
				///
				/// Computing this is surprisingly challenging. Because users of two different
				/// PHI nodes can depend on each other or on common other instructions, it may
				/// be profitable to speculate two PHI nodes together even though neither one
				/// in isolation is profitable. The straightforward way to find all the
				/// profitable PHIs would be to check each combination of PHIs' cost, but this
				/// is exponential in complexity.
				///
				/// Even if we assume that we only care about cases where we can consider each
				/// PHI node in isolation (rather than considering cases where none are
				/// profitable in isolation but some subset are profitable as a set), we still
				/// have a challenge. The obvious way to find all individually profitable PHIs
				/// is to iterate until reaching a fixed point, but this will be quadratic in
				/// complexity. =/
				///
				/// This code currently uses a linear-to-compute order for a greedy approach.
				/// It won't find cases where a set of PHIs must be considered together, but it
				/// handles most cases of order dependence without quadratic iteration. The
				/// specific order used is the post-order across the operand DAG. When the last
				/// user of a PHI is visited in this postorder walk, we check it for
				/// profitability.
				///
				/// There is an orthogonal extra complexity to all of this: computing the cost
				/// itself can easily become a linear computation making everything again (at
				/// best) quadratic. Using a postorder over the operand graph makes it
				/// particularly easy to avoid this through dynamic programming. As we do the
				/// postorder walk, we build the transitive cost of that subgraph. It is also
				/// straightforward to then update these costs when we mark a PHI for
				/// speculation so that subsequent PHIs don't re-pay the cost of already
				/// speculated instructions.
				static SmallVector<PHINode *, 16>
				findProfitablePHIs(ArrayRef<PHINode *> PNs,
				const SmallDenseMap<PHINode *, int, 16> &CostSavingsMap,
				const SmallPtrSetImpl<Instruction *> &PotentialSpecSet,
				int NumPreds, DominatorTree &DT, TargetTransformInfo &TTI) {
				SmallVector<PHINode *, 16> SpecPNs;

				// First, establish a reverse mapping from immediate users of the PHI nodes
				// to the nodes themselves, and count how many users each PHI node has in
				// a way we can update while processing them.
				SmallDenseMap<Instruction , TinyPtrVector<PHINode >, 16> UserToPNMap;
				SmallDenseMap<PHINode *, int, 16> PNUserCountMap;
				SmallPtrSet<Instruction *, 16> UserSet;
				for (auto *PN : PNs) {
				assert(UserSet.empty() && "Must start with an empty user set!");
				for (Use &U : PN->uses())
				UserSet.insert(cast<Instruction>(U.getUser()));
				PNUserCountMap[PN] = UserSet.size();
				for (auto *UI : UserSet)
				UserToPNMap.insert({UI, {}}).first->second.push_back(PN);
				UserSet.clear();
				}

				// Now do a DFS across the operand graph of the users, computing cost as we
				// go and when all costs for a given PHI are known, checking that PHI for
				// profitability.
				SmallDenseMap<Instruction *, int, 16> SpecCostMap;
				visitPHIUsersAndDepsInPostOrder(
				PNs,
				/IsVisited/
				[&](Instruction *I) {
				// We consider anything that isn't potentially speculated to be
				// "visited" as it is already handled. Similarly, anything that is
				// potentially speculated but for which we have an entry in our cost
				// map, we're done.
				return !PotentialSpecSet.count(I) \|\| SpecCostMap.count(I);
				},
				/Visit/
				[&](Instruction *I) {
				// We've fully visited the operands, so sum their cost with this node
				// and update the cost map.
				int Cost = TTI.TCC_Free;
				for (Value *OpV : I->operand_values())
				if (auto *OpI = dyn_cast<Instruction>(OpV)) {
				auto CostMapIt = SpecCostMap.find(OpI);
				if (CostMapIt != SpecCostMap.end())
				Cost += CostMapIt->second;
				}
				Cost += TTI.getUserCost(I);
				bool Inserted = SpecCostMap.insert({I, Cost}).second;
				(void)Inserted;
				assert(Inserted && "Must not re-insert a cost during the DFS!");

				// Now check if this node had a corresponding PHI node using it. If so,
				// we need to decrement the outstanding user count for it.
				auto UserPNsIt = UserToPNMap.find(I);
				if (UserPNsIt == UserToPNMap.end())
				return;
				auto &UserPNs = UserPNsIt->second;
				auto UserPNsSplitIt = std::stable_partition(
				UserPNs.begin(), UserPNs.end(), [&](PHINode *UserPN) {
				int &PNUserCount = PNUserCountMap.find(UserPN)->second;
				assert(
				PNUserCount > 0 &&
				"Should never re-visit a PN after its user count hits zero!");
				--PNUserCount;
				return PNUserCount != 0;
				});

				// FIXME: Rather than one at a time, we should sum the savings as the
				// cost will be completely shared.
				SmallVector<Instruction *, 16> SpecWorklist;
				for (auto *PN : llvm::make_range(UserPNsSplitIt, UserPNs.end())) {
				int SpecCost = TTI.TCC_Free;
				for (Use &U : PN->uses())
				SpecCost +=
				SpecCostMap.find(cast<Instruction>(U.getUser()))->second;
				SpecCost *= (NumPreds - 1);
				// When the user count of a PHI node hits zero, we should check its
				// profitability. If profitable, we should mark it for speculation
				// and zero out the cost of everything it depends on.
				int CostSavings = CostSavingsMap.find(PN)->second;
				if (SpecCost > CostSavings) {
				DEBUG(dbgs() << " Not profitable, speculation cost: " << *PN << "\n"
				" Cost savings: " << CostSavings << "\n"
				" Speculation cost: " << SpecCost << "\n");
				continue;
				}

				// We're going to speculate this user-associated PHI. Copy it out and
				// add its users to the worklist to update their cost.
				SpecPNs.push_back(PN);
				for (Use &U : PN->uses()) {
				auto *UI = cast<Instruction>(U.getUser());
				auto CostMapIt = SpecCostMap.find(UI);
				if (CostMapIt->second == 0)
				continue;
				// Zero out this cost entry to avoid duplicates.
				CostMapIt->second = 0;
				SpecWorklist.push_back(UI);
				}
				}

				// Now walk all the operands of the users in the worklist transitively
				// to zero out all the memoized costs.
				while (!SpecWorklist.empty()) {
				Instruction *SpecI = SpecWorklist.pop_back_val();
				assert(SpecCostMap.find(SpecI)->second == 0 &&
				"Didn't zero out a cost!");

				// Walk the operands recursively to zero out their cost as well.
				for (auto *OpV : SpecI->operand_values()) {
				auto *OpI = dyn_cast<Instruction>(OpV);
				if (!OpI)
				continue;
				auto CostMapIt = SpecCostMap.find(OpI);
				if (CostMapIt == SpecCostMap.end() \|\| CostMapIt->second == 0)
				continue;
				CostMapIt->second = 0;
				SpecWorklist.push_back(OpI);
				}
				}
				});

				return SpecPNs;
				}

				/// Speculate users around a set of PHI nodes.
				///
				/// This routine does the actual speculation around a set of PHI nodes where we
				/// have determined this to be both safe and profitable.
				///
				/// This routine handles any spliting of critical edges necessary to create
				/// a safe block to speculate into as well as cloning the instructions and
				/// rewriting all uses.
				static void speculatePHIs(ArrayRef<PHINode *> SpecPNs,
				SmallPtrSetImpl<Instruction *> &PotentialSpecSet,
				SmallSetVector<BasicBlock *, 16> &PredSet,
				DominatorTree &DT) {
				DEBUG(dbgs() << " Speculating around " << SpecPNs.size() << " PHIs!\n");
				NumPHIsSpeculated += SpecPNs.size();

				// Split any critical edges so that we have a block to hoist into.
				auto *ParentBB = SpecPNs[0]->getParent();
				SmallVector<BasicBlock *, 16> SpecPreds;
				SpecPreds.reserve(PredSet.size());
				for (auto *PredBB : PredSet) {
				auto *NewPredBB = SplitCriticalEdge(
				PredBB, ParentBB,
				CriticalEdgeSplittingOptions(&DT).setMergeIdenticalEdges());
				if (NewPredBB) {
				++NumEdgesSplit;
				DEBUG(dbgs() << " Split critical edge from: " << PredBB->getName()
				<< "\n");
				SpecPreds.push_back(NewPredBB);
				} else {
				assert(PredBB->getSingleSuccessor() == ParentBB &&
				"We need a non-critical predecessor to speculate into.");
				assert(!isa<InvokeInst>(PredBB->getTerminator()) &&
				"Cannot have a non-critical invoke!");

				// Already non-critical, use existing pred.
				SpecPreds.push_back(PredBB);
				}
				}

				SmallPtrSet<Instruction *, 16> SpecSet;
				SmallVector<Instruction *, 16> SpecList;
				visitPHIUsersAndDepsInPostOrder(SpecPNs,
				/IsVisited/
				[&](Instruction *I) {
				// This is visited if we don't need to
				// speculate it or we already have
				// speculated it.
				return !PotentialSpecSet.count(I) \|\|
				SpecSet.count(I);
				},
				/Visit/
				[&](Instruction *I) {
				// All operands scheduled, schedule this
				// node.
				SpecSet.insert(I);
				SpecList.push_back(I);
				});

				int NumSpecInsts = SpecList.size() * SpecPreds.size();
				int NumRedundantInsts = NumSpecInsts - SpecList.size();
				DEBUG(dbgs() << " Inserting " << NumSpecInsts << " speculated instructions, "
				<< NumRedundantInsts << " redundancies\n");
				NumSpeculatedInstructions += NumSpecInsts;
				NumNewRedundantInstructions += NumRedundantInsts;

				// Each predecessor is numbered by its index in `SpecPreds`, so for each
				// instruction we speculate, the speculated instruction is stored in that
				// index of the vector asosciated with the original instruction. We also
				// store the incoming values for each predecessor from any PHIs used.
				SmallDenseMap<Instruction , SmallVector<Value , 2>, 16> SpeculatedValueMap;

				// Inject the synthetic mappings to rewrite PHIs to the appropriate incoming
				// value. This handles both the PHIs we are speculating around and any other
				// PHIs that happen to be used.
				for (auto *OrigI : SpecList)
				for (auto *OpV : OrigI->operand_values()) {
				auto *OpPN = dyn_cast<PHINode>(OpV);
				if (!OpPN \|\| OpPN->getParent() != ParentBB)
				continue;

				auto InsertResult = SpeculatedValueMap.insert({OpPN, {}});
				if (!InsertResult.second)
				continue;

				auto &SpeculatedVals = InsertResult.first->second;

				// Populating our structure for mapping is particularly annoying because
				// finding an incoming value for a particular predecessor block in a PHI
				// node is a linear time operation! To avoid quadratic behavior, we build
				// a map for this PHI node's incoming values and then translate it into
				// the more compact representation used below.
				SmallDenseMap<BasicBlock , Value , 16> IncomingValueMap;
				for (int i : llvm::seq<int>(0, OpPN->getNumIncomingValues()))
				IncomingValueMap[OpPN->getIncomingBlock(i)] = OpPN->getIncomingValue(i);

				for (auto *PredBB : SpecPreds)
				SpeculatedVals.push_back(IncomingValueMap.find(PredBB)->second);
				}

				// Speculate into each predecessor.
				for (int PredIdx : llvm::seq<int>(0, SpecPreds.size())) {
				auto *PredBB = SpecPreds[PredIdx];
				assert(PredBB->getSingleSuccessor() == ParentBB &&
				"We need a non-critical predecessor to speculate into.");

				for (auto *OrigI : SpecList) {
				auto *NewI = OrigI->clone();
				NewI->setName(Twine(OrigI->getName()) + "." + Twine(PredIdx));
				NewI->insertBefore(PredBB->getTerminator());

				// Rewrite all the operands to the previously speculated instructions.
				// Because we're walking in-order, the defs must precede the uses and we
				// should already have these mappings.
				for (Use &U : NewI->operands()) {
				auto *OpI = dyn_cast<Instruction>(U.get());
				if (!OpI)
				continue;
				auto MapIt = SpeculatedValueMap.find(OpI);
				if (MapIt == SpeculatedValueMap.end())
				continue;
				const auto &SpeculatedVals = MapIt->second;
				assert(SpeculatedVals[PredIdx] &&
				"Must have a speculated value for this predecessor!");
				assert(SpeculatedVals[PredIdx]->getType() == OpI->getType() &&
				"Speculated value has the wrong type!");

				// Rewrite the use to this predecessor's speculated instruction.
				U.set(SpeculatedVals[PredIdx]);
				}

				// Commute instructions which now have a constant in the LHS but not the
				// RHS.
				if (NewI->isBinaryOp() && NewI->isCommutative() &&
				isa<Constant>(NewI->getOperand(0)) &&
				!isa<Constant>(NewI->getOperand(1)))
				NewI->getOperandUse(0).swap(NewI->getOperandUse(1));

				SpeculatedValueMap[OrigI].push_back(NewI);
				assert(SpeculatedValueMap[OrigI][PredIdx] == NewI &&
				"Mismatched speculated instruction index!");
				}
				}

				// Walk the speculated instruction list and if they have uses, insert a PHI
				// for them from the speculated versions, and replace the uses with the PHI.
				// Then erase the instructions as they have been fully speculated. The walk
				// needs to be in reverse so that we don't think there are users when we'll
				// actually eventually remove them later.
				IRBuilder<> IRB(SpecPNs[0]);
				for (auto *OrigI : llvm::reverse(SpecList)) {
				// Check if we need a PHI for any remaining users and if so, insert it.
				if (!OrigI->use_empty()) {
				auto *SpecIPN = IRB.CreatePHI(OrigI->getType(), SpecPreds.size(),
				Twine(OrigI->getName()) + ".phi");
				// Add the incoming values we speculated.
				auto &SpeculatedVals = SpeculatedValueMap.find(OrigI)->second;
				for (int PredIdx : llvm::seq<int>(0, SpecPreds.size()))
				SpecIPN->addIncoming(SpeculatedVals[PredIdx], SpecPreds[PredIdx]);

				// And replace the uses with the PHI node.
				OrigI->replaceAllUsesWith(SpecIPN);
				}

				// It is important to immediately erase this so that it stops using other
				// instructions. This avoids inserting needless PHIs of them.
				OrigI->eraseFromParent();
				}

				// All of the uses of the speculated phi nodes should be removed at this
				// point, so erase them.
				for (auto *SpecPN : SpecPNs) {
				assert(SpecPN->use_empty() && "All users should have been speculated!");
				SpecPN->eraseFromParent();
				}
				}

				/// Try to speculate around a series of PHIs from a single basic block.
				///
				/// This routine checks whether any of these PHIs are profitable to speculate
				/// users around. If safe and profitable, it does the speculation. It returns
				/// true when at least some speculation occurs.
				static bool tryToSpeculatePHIs(SmallVectorImpl<PHINode *> &PNs,
				DominatorTree &DT, TargetTransformInfo &TTI) {
				DEBUG(dbgs() << "Evaluating phi nodes for speculation:\n");

				// Savings in cost from speculating around a PHI node.
				SmallDenseMap<PHINode *, int, 16> CostSavingsMap;

				// Remember the set of instructions that are candidates for speculation so
				// that we can quickly walk things within that space. This prunes out
				// instructions already available along edges, etc.
				SmallPtrSet<Instruction *, 16> PotentialSpecSet;

				// Remember the set of instructions that are (transitively) unsafe to
				// speculate into the incoming edges of this basic block. This avoids
				// recomputing them for each PHI node we check. This set is specific to this
				// block though as things are pruned out of it based on what is available
				// along incoming edges.
				SmallPtrSet<Instruction *, 16> UnsafeSet;

				// For each PHI node in this block, check whether there are immediate folding
				// opportunities from speculation, and whether that speculation will be
				// valid. This determise the set of safe PHIs to speculate.
				PNs.erase(llvm::remove_if(PNs,
				[&](PHINode *PN) {
				return !isSafeAndProfitableToSpeculateAroundPHI(
				*PN, CostSavingsMap, PotentialSpecSet,
				UnsafeSet, DT, TTI);
				}),
				PNs.end());
				// If no PHIs were profitable, skip.
				if (PNs.empty()) {
				DEBUG(dbgs() << " No safe and profitable PHIs found!\n");
				return false;
				}

				// We need to know how much speculation will cost which is determined by how
				// many incoming edges will need a copy of each speculated instruction.
				SmallSetVector<BasicBlock *, 16> PredSet;
				for (auto *PredBB : PNs[0]->blocks()) {
				if (!PredSet.insert(PredBB))
				continue;

				// We cannot speculate when a predecessor is an indirect branch.
				// FIXME: We also can't reliably create a non-critical edge block for
				// speculation if the predecessor is an invoke. This doesn't seem
				// fundamental and we should probably be splitting critical edges
				// differently.
				if (isa<IndirectBrInst>(PredBB->getTerminator()) \|\|
				isa<InvokeInst>(PredBB->getTerminator())) {
				DEBUG(dbgs() << " Invalid: predecessor terminator: " << PredBB->getName()
				<< "\n");
				return false;
				}
				}
				if (PredSet.size() < 2) {
				DEBUG(dbgs() << " Unimportant: phi with only one predecessor\n");
				return false;
				}

				SmallVector<PHINode *, 16> SpecPNs = findProfitablePHIs(
				PNs, CostSavingsMap, PotentialSpecSet, PredSet.size(), DT, TTI);
				if (SpecPNs.empty())
				// Nothing to do.
				return false;

				speculatePHIs(SpecPNs, PotentialSpecSet, PredSet, DT);
				return true;
				}

				PreservedAnalyses SpeculateAroundPHIsPass::run(Function &F,
				FunctionAnalysisManager &AM) {
				auto &DT = AM.getResult<DominatorTreeAnalysis>(F);
				auto &TTI = AM.getResult<TargetIRAnalysis>(F);

				bool Changed = false;
				for (auto BB : ReversePostOrderTraversal<Function >(&F)) {
				SmallVector<PHINode *, 16> PNs;
				auto BBI = BB->begin();
				while (auto PN = dyn_cast<PHINode>(&BBI)) {
				PNs.push_back(PN);
				++BBI;
				}

				if (PNs.empty())
				continue;

				Changed \|= tryToSpeculatePHIs(PNs, DT, TTI);
				}

				if (!Changed)
				return PreservedAnalyses::all();

				PreservedAnalyses PA;
				return PA;
				}

llvm/trunk/test/Other/new-pm-defaults.ll

	Show First 20 Lines • Show All 204 Lines • ▼ Show 20 Lines
	; CHECK-O-NEXT: Running pass: InstCombinePass			; CHECK-O-NEXT: Running pass: InstCombinePass
	; CHECK-O-NEXT: Running pass: RequireAnalysisPass<{{.*}}OptimizationRemarkEmitterAnalysis			; CHECK-O-NEXT: Running pass: RequireAnalysisPass<{{.*}}OptimizationRemarkEmitterAnalysis
	; CHECK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LICMPass			; CHECK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LICMPass
	; CHECK-O-NEXT: Running pass: AlignmentFromAssumptionsPass			; CHECK-O-NEXT: Running pass: AlignmentFromAssumptionsPass
	; CHECK-O-NEXT: Running pass: LoopSinkPass			; CHECK-O-NEXT: Running pass: LoopSinkPass
	; CHECK-O-NEXT: Running pass: InstSimplifierPass			; CHECK-O-NEXT: Running pass: InstSimplifierPass
	; CHECK-O-NEXT: Running pass: DivRemPairsPass			; CHECK-O-NEXT: Running pass: DivRemPairsPass
	; CHECK-O-NEXT: Running pass: SimplifyCFGPass			; CHECK-O-NEXT: Running pass: SimplifyCFGPass
				; CHECK-O-NEXT: Running pass: SpeculateAroundPHIsPass
	; CHECK-O-NEXT: Finished llvm::Function pass manager run.			; CHECK-O-NEXT: Finished llvm::Function pass manager run.
	; CHECK-O-NEXT: Running pass: GlobalDCEPass			; CHECK-O-NEXT: Running pass: GlobalDCEPass
	; CHECK-O-NEXT: Running pass: ConstantMergePass			; CHECK-O-NEXT: Running pass: ConstantMergePass
	; CHECK-O-NEXT: Finished llvm::Module pass manager run.			; CHECK-O-NEXT: Finished llvm::Module pass manager run.
	; CHECK-O-NEXT: Finished llvm::Module pass manager run.			; CHECK-O-NEXT: Finished llvm::Module pass manager run.
	; CHECK-O-NEXT: Running pass: PrintModulePass			; CHECK-O-NEXT: Running pass: PrintModulePass
	;			;
	; Make sure we get the IR back out without changes when we print the module.			; Make sure we get the IR back out without changes when we print the module.
	Show All 29 Lines

llvm/trunk/test/Other/new-pm-thinlto-defaults.ll

	Show First 20 Lines • Show All 192 Lines • ▼ Show 20 Lines
	; CHECK-POSTLINK-O-NEXT: Running pass: InstCombinePass			; CHECK-POSTLINK-O-NEXT: Running pass: InstCombinePass
	; CHECK-POSTLINK-O-NEXT: Running pass: RequireAnalysisPass<{{.*}}OptimizationRemarkEmitterAnalysis			; CHECK-POSTLINK-O-NEXT: Running pass: RequireAnalysisPass<{{.*}}OptimizationRemarkEmitterAnalysis
	; CHECK-POSTLINK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LICMPass			; CHECK-POSTLINK-O-NEXT: Running pass: FunctionToLoopPassAdaptor<{{.*}}LICMPass
	; CHECK-POSTLINK-O-NEXT: Running pass: AlignmentFromAssumptionsPass			; CHECK-POSTLINK-O-NEXT: Running pass: AlignmentFromAssumptionsPass
	; CHECK-POSTLINK-O-NEXT: Running pass: LoopSinkPass			; CHECK-POSTLINK-O-NEXT: Running pass: LoopSinkPass
	; CHECK-POSTLINK-O-NEXT: Running pass: InstSimplifierPass			; CHECK-POSTLINK-O-NEXT: Running pass: InstSimplifierPass
	; CHECK-POSTLINK-O-NEXT: Running pass: DivRemPairsPass			; CHECK-POSTLINK-O-NEXT: Running pass: DivRemPairsPass
	; CHECK-POSTLINK-O-NEXT: Running pass: SimplifyCFGPass			; CHECK-POSTLINK-O-NEXT: Running pass: SimplifyCFGPass
				; CHECK-POSTLINK-O-NEXT: Running pass: SpeculateAroundPHIsPass
	; CHECK-POSTLINK-O-NEXT: Finished llvm::Function pass manager run.			; CHECK-POSTLINK-O-NEXT: Finished llvm::Function pass manager run.
	; CHECK-POSTLINK-O-NEXT: Running pass: GlobalDCEPass			; CHECK-POSTLINK-O-NEXT: Running pass: GlobalDCEPass
	; CHECK-POSTLINK-O-NEXT: Running pass: ConstantMergePass			; CHECK-POSTLINK-O-NEXT: Running pass: ConstantMergePass
	; CHECK-POSTLINK-O-NEXT: Finished llvm::Module pass manager run.			; CHECK-POSTLINK-O-NEXT: Finished llvm::Module pass manager run.
	; CHECK-O-NEXT: Finished llvm::Module pass manager run.			; CHECK-O-NEXT: Finished llvm::Module pass manager run.
	; CHECK-PRELINK-O-NEXT: Running pass: NameAnonGlobalPass			; CHECK-PRELINK-O-NEXT: Running pass: NameAnonGlobalPass
	; CHECK-O-NEXT: Running pass: PrintModulePass			; CHECK-O-NEXT: Running pass: PrintModulePass

	Show All 30 Lines

llvm/trunk/test/Transforms/SpeculateAroundPHIs/basic-x86.ll

				; Test the basic functionality of speculating around PHI nodes based on reduced
				; cost of the constant operands to the PHI nodes using the x86 cost model.
				;
				; REQUIRES: x86-registered-target
				; RUN: opt -S -passes=spec-phis < %s \| FileCheck %s

				target triple = "x86_64-unknown-unknown"

				define i32 @test_basic(i1 %flag, i32 %arg) {
				; CHECK-LABEL: define i32 @test_basic(
				entry:
				br i1 %flag, label %a, label %b
				; CHECK: br i1 %flag, label %a, label %b

				a:
				br label %exit
				; CHECK: a:
				; CHECK-NEXT: %[[SUM_A:.*]] = add i32 %arg, 7
				; CHECK-NEXT: br label %exit

				b:
				br label %exit
				; CHECK: b:
				; CHECK-NEXT: %[[SUM_B:.*]] = add i32 %arg, 11
				; CHECK-NEXT: br label %exit

				exit:
				%p = phi i32 [ 7, %a ], [ 11, %b ]
				%sum = add i32 %arg, %p
				ret i32 %sum
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ %[[SUM_A]], %a ], [ %[[SUM_B]], %b ]
				; CHECK-NEXT: ret i32 %[[PHI]]
				}

				; Check that we handle commuted operands and get the constant onto the RHS.
				define i32 @test_commuted(i1 %flag, i32 %arg) {
				; CHECK-LABEL: define i32 @test_commuted(
				entry:
				br i1 %flag, label %a, label %b
				; CHECK: br i1 %flag, label %a, label %b

				a:
				br label %exit
				; CHECK: a:
				; CHECK-NEXT: %[[SUM_A:.*]] = add i32 %arg, 7
				; CHECK-NEXT: br label %exit

				b:
				br label %exit
				; CHECK: b:
				; CHECK-NEXT: %[[SUM_B:.*]] = add i32 %arg, 11
				; CHECK-NEXT: br label %exit

				exit:
				%p = phi i32 [ 7, %a ], [ 11, %b ]
				%sum = add i32 %p, %arg
				ret i32 %sum
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ %[[SUM_A]], %a ], [ %[[SUM_B]], %b ]
				; CHECK-NEXT: ret i32 %[[PHI]]
				}

				define i32 @test_split_crit_edge(i1 %flag, i32 %arg) {
				; CHECK-LABEL: define i32 @test_split_crit_edge(
				entry:
				br i1 %flag, label %exit, label %a
				; CHECK: entry:
				; CHECK-NEXT: br i1 %flag, label %[[ENTRY_SPLIT:.*]], label %a
				;
				; CHECK: [[ENTRY_SPLIT]]:
				; CHECK-NEXT: %[[SUM_ENTRY_SPLIT:.*]] = add i32 %arg, 7
				; CHECK-NEXT: br label %exit

				a:
				br label %exit
				; CHECK: a:
				; CHECK-NEXT: %[[SUM_A:.*]] = add i32 %arg, 11
				; CHECK-NEXT: br label %exit

				exit:
				%p = phi i32 [ 7, %entry ], [ 11, %a ]
				%sum = add i32 %arg, %p
				ret i32 %sum
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ %[[SUM_ENTRY_SPLIT]], %[[ENTRY_SPLIT]] ], [ %[[SUM_A]], %a ]
				; CHECK-NEXT: ret i32 %[[PHI]]
				}

				define i32 @test_no_spec_dominating_inst(i1 %flag, i32* %ptr) {
				; CHECK-LABEL: define i32 @test_no_spec_dominating_inst(
				entry:
				%load = load i32, i32* %ptr
				br i1 %flag, label %a, label %b
				; CHECK: %[[LOAD:.]] = load i32, i32 %ptr
				; CHECK-NEXT: br i1 %flag, label %a, label %b

				a:
				br label %exit
				; CHECK: a:
				; CHECK-NEXT: %[[SUM_A:.*]] = add i32 %[[LOAD]], 7
				; CHECK-NEXT: br label %exit

				b:
				br label %exit
				; CHECK: b:
				; CHECK-NEXT: %[[SUM_B:.*]] = add i32 %[[LOAD]], 11
				; CHECK-NEXT: br label %exit

				exit:
				%p = phi i32 [ 7, %a ], [ 11, %b ]
				%sum = add i32 %load, %p
				ret i32 %sum
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ %[[SUM_A]], %a ], [ %[[SUM_B]], %b ]
				; CHECK-NEXT: ret i32 %[[PHI]]
				}

				; We have special logic handling PHI nodes, make sure it doesn't get confused
				; by a dominating PHI.
				define i32 @test_no_spec_dominating_phi(i1 %flag1, i1 %flag2, i32 %x, i32 %y) {
				; CHECK-LABEL: define i32 @test_no_spec_dominating_phi(
				entry:
				br i1 %flag1, label %x.block, label %y.block
				; CHECK: entry:
				; CHECK-NEXT: br i1 %flag1, label %x.block, label %y.block

				x.block:
				br label %merge
				; CHECK: x.block:
				; CHECK-NEXT: br label %merge

				y.block:
				br label %merge
				; CHECK: y.block:
				; CHECK-NEXT: br label %merge

				merge:
				%xy.phi = phi i32 [ %x, %x.block ], [ %y, %y.block ]
				br i1 %flag2, label %a, label %b
				; CHECK: merge:
				; CHECK-NEXT: %[[XY_PHI:.*]] = phi i32 [ %x, %x.block ], [ %y, %y.block ]
				; CHECK-NEXT: br i1 %flag2, label %a, label %b

				a:
				br label %exit
				; CHECK: a:
				; CHECK-NEXT: %[[SUM_A:.*]] = add i32 %[[XY_PHI]], 7
				; CHECK-NEXT: br label %exit

				b:
				br label %exit
				; CHECK: b:
				; CHECK-NEXT: %[[SUM_B:.*]] = add i32 %[[XY_PHI]], 11
				; CHECK-NEXT: br label %exit

				exit:
				%p = phi i32 [ 7, %a ], [ 11, %b ]
				%sum = add i32 %xy.phi, %p
				ret i32 %sum
				; CHECK: exit:
				; CHECK-NEXT: %[[SUM_PHI:.*]] = phi i32 [ %[[SUM_A]], %a ], [ %[[SUM_B]], %b ]
				; CHECK-NEXT: ret i32 %[[SUM_PHI]]
				}

				; Ensure that we will speculate some number of "free" instructions on the given
				; architecture even though they are unrelated to the PHI itself.
				define i32 @test_speculate_free_insts(i1 %flag, i64 %arg) {
				; CHECK-LABEL: define i32 @test_speculate_free_insts(
				entry:
				br i1 %flag, label %a, label %b
				; CHECK: br i1 %flag, label %a, label %b

				a:
				br label %exit
				; CHECK: a:
				; CHECK-NEXT: %[[T1_A:.*]] = trunc i64 %arg to i48
				; CHECK-NEXT: %[[T2_A:.*]] = trunc i48 %[[T1_A]] to i32
				; CHECK-NEXT: %[[SUM_A:.*]] = add i32 %[[T2_A]], 7
				; CHECK-NEXT: br label %exit

				b:
				br label %exit
				; CHECK: b:
				; CHECK-NEXT: %[[T1_B:.*]] = trunc i64 %arg to i48
				; CHECK-NEXT: %[[T2_B:.*]] = trunc i48 %[[T1_B]] to i32
				; CHECK-NEXT: %[[SUM_B:.*]] = add i32 %[[T2_B]], 11
				; CHECK-NEXT: br label %exit

				exit:
				%p = phi i32 [ 7, %a ], [ 11, %b ]
				%t1 = trunc i64 %arg to i48
				%t2 = trunc i48 %t1 to i32
				%sum = add i32 %t2, %p
				ret i32 %sum
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ %[[SUM_A]], %a ], [ %[[SUM_B]], %b ]
				; CHECK-NEXT: ret i32 %[[PHI]]
				}

				define i32 @test_speculate_free_phis(i1 %flag, i32 %arg1, i32 %arg2) {
				; CHECK-LABEL: define i32 @test_speculate_free_phis(
				entry:
				br i1 %flag, label %a, label %b
				; CHECK: br i1 %flag, label %a, label %b

				a:
				br label %exit
				; CHECK: a:
				; CHECK-NEXT: %[[SUM_A:.*]] = add i32 %arg1, 7
				; CHECK-NEXT: br label %exit

				b:
				br label %exit
				; CHECK: b:
				; CHECK-NEXT: %[[SUM_B:.*]] = add i32 %arg2, 11
				; CHECK-NEXT: br label %exit

				exit:
				%p1 = phi i32 [ 7, %a ], [ 11, %b ]
				%p2 = phi i32 [ %arg1, %a ], [ %arg2, %b ]
				%sum = add i32 %p2, %p1
				ret i32 %sum
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ %[[SUM_A]], %a ], [ %[[SUM_B]], %b ]
				; We don't DCE the now unused PHI node...
				; CHECK-NEXT: %{{.*}} = phi i32 [ %arg1, %a ], [ %arg2, %b ]
				; CHECK-NEXT: ret i32 %[[PHI]]
				}

				; We shouldn't speculate multiple uses even if each individually looks
				; profitable because of the total cost.
				define i32 @test_no_spec_multi_uses(i1 %flag, i32 %arg1, i32 %arg2, i32 %arg3) {
				; CHECK-LABEL: define i32 @test_no_spec_multi_uses(
				entry:
				br i1 %flag, label %a, label %b
				; CHECK: br i1 %flag, label %a, label %b

				a:
				br label %exit
				; CHECK: a:
				; CHECK-NEXT: br label %exit

				b:
				br label %exit
				; CHECK: b:
				; CHECK-NEXT: br label %exit

				exit:
				%p = phi i32 [ 7, %a ], [ 11, %b ]
				%add1 = add i32 %arg1, %p
				%add2 = add i32 %arg2, %p
				%add3 = add i32 %arg3, %p
				%sum1 = add i32 %add1, %add2
				%sum2 = add i32 %sum1, %add3
				ret i32 %sum2
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ 7, %a ], [ 11, %b ]
				; CHECK-NEXT: %[[ADD1:.*]] = add i32 %arg1, %[[PHI]]
				; CHECK-NEXT: %[[ADD2:.*]] = add i32 %arg2, %[[PHI]]
				; CHECK-NEXT: %[[ADD3:.*]] = add i32 %arg3, %[[PHI]]
				; CHECK-NEXT: %[[SUM1:.*]] = add i32 %[[ADD1]], %[[ADD2]]
				; CHECK-NEXT: %[[SUM2:.*]] = add i32 %[[SUM1]], %[[ADD3]]
				; CHECK-NEXT: ret i32 %[[SUM2]]
				}

				define i32 @test_multi_phis1(i1 %flag, i32 %arg) {
				; CHECK-LABEL: define i32 @test_multi_phis1(
				entry:
				br i1 %flag, label %a, label %b
				; CHECK: br i1 %flag, label %a, label %b

				a:
				br label %exit
				; CHECK: a:
				; CHECK-NEXT: %[[SUM_A1:.*]] = add i32 %arg, 1
				; CHECK-NEXT: %[[SUM_A2:.*]] = add i32 %[[SUM_A1]], 3
				; CHECK-NEXT: %[[SUM_A3:.*]] = add i32 %[[SUM_A2]], 5
				; CHECK-NEXT: br label %exit

				b:
				br label %exit
				; CHECK: b:
				; CHECK-NEXT: %[[SUM_B1:.*]] = add i32 %arg, 2
				; CHECK-NEXT: %[[SUM_B2:.*]] = add i32 %[[SUM_B1]], 4
				; CHECK-NEXT: %[[SUM_B3:.*]] = add i32 %[[SUM_B2]], 6
				; CHECK-NEXT: br label %exit

				exit:
				%p1 = phi i32 [ 1, %a ], [ 2, %b ]
				%p2 = phi i32 [ 3, %a ], [ 4, %b ]
				%p3 = phi i32 [ 5, %a ], [ 6, %b ]
				%sum1 = add i32 %arg, %p1
				%sum2 = add i32 %sum1, %p2
				%sum3 = add i32 %sum2, %p3
				ret i32 %sum3
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ %[[SUM_A3]], %a ], [ %[[SUM_B3]], %b ]
				; CHECK-NEXT: ret i32 %[[PHI]]
				}

				; Check that the order of the PHIs doesn't impact the behavior.
				define i32 @test_multi_phis2(i1 %flag, i32 %arg) {
				; CHECK-LABEL: define i32 @test_multi_phis2(
				entry:
				br i1 %flag, label %a, label %b
				; CHECK: br i1 %flag, label %a, label %b

				a:
				br label %exit
				; CHECK: a:
				; CHECK-NEXT: %[[SUM_A1:.*]] = add i32 %arg, 1
				; CHECK-NEXT: %[[SUM_A2:.*]] = add i32 %[[SUM_A1]], 3
				; CHECK-NEXT: %[[SUM_A3:.*]] = add i32 %[[SUM_A2]], 5
				; CHECK-NEXT: br label %exit

				b:
				br label %exit
				; CHECK: b:
				; CHECK-NEXT: %[[SUM_B1:.*]] = add i32 %arg, 2
				; CHECK-NEXT: %[[SUM_B2:.*]] = add i32 %[[SUM_B1]], 4
				; CHECK-NEXT: %[[SUM_B3:.*]] = add i32 %[[SUM_B2]], 6
				; CHECK-NEXT: br label %exit

				exit:
				%p3 = phi i32 [ 5, %a ], [ 6, %b ]
				%p2 = phi i32 [ 3, %a ], [ 4, %b ]
				%p1 = phi i32 [ 1, %a ], [ 2, %b ]
				%sum1 = add i32 %arg, %p1
				%sum2 = add i32 %sum1, %p2
				%sum3 = add i32 %sum2, %p3
				ret i32 %sum3
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ %[[SUM_A3]], %a ], [ %[[SUM_B3]], %b ]
				; CHECK-NEXT: ret i32 %[[PHI]]
				}

				define i32 @test_no_spec_indirectbr(i1 %flag, i32 %arg) {
				; CHECK-LABEL: define i32 @test_no_spec_indirectbr(
				entry:
				br i1 %flag, label %a, label %b
				; CHECK: entry:
				; CHECK-NEXT: br i1 %flag, label %a, label %b

				a:
				indirectbr i8* undef, [label %exit]
				; CHECK: a:
				; CHECK-NEXT: indirectbr i8* undef, [label %exit]

				b:
				indirectbr i8* undef, [label %exit]
				; CHECK: b:
				; CHECK-NEXT: indirectbr i8* undef, [label %exit]

				exit:
				%p = phi i32 [ 7, %a ], [ 11, %b ]
				%sum = add i32 %arg, %p
				ret i32 %sum
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ 7, %a ], [ 11, %b ]
				; CHECK-NEXT: %[[SUM:.*]] = add i32 %arg, %[[PHI]]
				; CHECK-NEXT: ret i32 %[[SUM]]
				}

				declare void @g()

				declare i32 @__gxx_personality_v0(...)

				; FIXME: We should be able to handle this case -- only the exceptional edge is
				; impossible to split.
				define i32 @test_no_spec_invoke_continue(i1 %flag, i32 %arg) personality i8* bitcast (i32 (...)* @__gxx_personality_v0 to i8*) {
				; CHECK-LABEL: define i32 @test_no_spec_invoke_continue(
				entry:
				br i1 %flag, label %a, label %b
				; CHECK: entry:
				; CHECK-NEXT: br i1 %flag, label %a, label %b

				a:
				invoke void @g()
				to label %exit unwind label %lpad
				; CHECK: a:
				; CHECK-NEXT: invoke void @g()
				; CHECK-NEXT: to label %exit unwind label %lpad

				b:
				invoke void @g()
				to label %exit unwind label %lpad
				; CHECK: b:
				; CHECK-NEXT: invoke void @g()
				; CHECK-NEXT: to label %exit unwind label %lpad

				exit:
				%p = phi i32 [ 7, %a ], [ 11, %b ]
				%sum = add i32 %arg, %p
				ret i32 %sum
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ 7, %a ], [ 11, %b ]
				; CHECK-NEXT: %[[SUM:.*]] = add i32 %arg, %[[PHI]]
				; CHECK-NEXT: ret i32 %[[SUM]]

				lpad:
				%lp = landingpad { i8*, i32 }
				cleanup
				resume { i8*, i32 } undef
				}

				define i32 @test_no_spec_landingpad(i32 %arg, i32* %ptr) personality i8* bitcast (i32 (...)* @__gxx_personality_v0 to i8*) {
				; CHECK-LABEL: define i32 @test_no_spec_landingpad(
				entry:
				invoke void @g()
				to label %invoke.cont unwind label %lpad
				; CHECK: entry:
				; CHECK-NEXT: invoke void @g()
				; CHECK-NEXT: to label %invoke.cont unwind label %lpad

				invoke.cont:
				invoke void @g()
				to label %exit unwind label %lpad
				; CHECK: invoke.cont:
				; CHECK-NEXT: invoke void @g()
				; CHECK-NEXT: to label %exit unwind label %lpad

				lpad:
				%p = phi i32 [ 7, %entry ], [ 11, %invoke.cont ]
				%lp = landingpad { i8*, i32 }
				cleanup
				%sum = add i32 %arg, %p
				store i32 %sum, i32* %ptr
				resume { i8*, i32 } undef
				; CHECK: lpad:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ 7, %entry ], [ 11, %invoke.cont ]

				exit:
				ret i32 0
				}

				declare i32 @__CxxFrameHandler3(...)

				define i32 @test_no_spec_cleanuppad(i32 %arg, i32* %ptr) personality i32 (...)* @__CxxFrameHandler3 {
				; CHECK-LABEL: define i32 @test_no_spec_cleanuppad(
				entry:
				invoke void @g()
				to label %invoke.cont unwind label %lpad
				; CHECK: entry:
				; CHECK-NEXT: invoke void @g()
				; CHECK-NEXT: to label %invoke.cont unwind label %lpad

				invoke.cont:
				invoke void @g()
				to label %exit unwind label %lpad
				; CHECK: invoke.cont:
				; CHECK-NEXT: invoke void @g()
				; CHECK-NEXT: to label %exit unwind label %lpad

				lpad:
				%p = phi i32 [ 7, %entry ], [ 11, %invoke.cont ]
				%cp = cleanuppad within none []
				%sum = add i32 %arg, %p
				store i32 %sum, i32* %ptr
				cleanupret from %cp unwind to caller
				; CHECK: lpad:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ 7, %entry ], [ 11, %invoke.cont ]

				exit:
				ret i32 0
				}

				; Check that we don't fall over when confronted with seemingly reasonable code
				; for us to handle but in an unreachable region and with non-PHI use-def
				; cycles.
				define i32 @test_unreachable_non_phi_cycles(i1 %flag, i32 %arg) {
				; CHECK-LABEL: define i32 @test_unreachable_non_phi_cycles(
				entry:
				ret i32 42
				; CHECK: entry:
				; CHECK-NEXT: ret i32 42

				a:
				br label %exit
				; CHECK: a:
				; CHECK-NEXT: br label %exit

				b:
				br label %exit
				; CHECK: b:
				; CHECK-NEXT: br label %exit

				exit:
				%p = phi i32 [ 7, %a ], [ 11, %b ]
				%zext = zext i32 %sum to i64
				%trunc = trunc i64 %zext to i32
				%sum = add i32 %trunc, %p
				br i1 %flag, label %a, label %b
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i32 [ 7, %a ], [ 11, %b ]
				; CHECK-NEXT: %[[ZEXT:.]] = zext i32 %[[SUM:.]] to i64
				; CHECK-NEXT: %[[TRUNC:.*]] = trunc i64 %[[ZEXT]] to i32
				; CHECK-NEXT: %[[SUM]] = add i32 %[[TRUNC]], %[[PHI]]
				; CHECK-NEXT: br i1 %flag, label %a, label %b
				}

				; Check that we don't speculate in the face of an expensive immediate. There
				; are two reasons this should never speculate. First, even a local analysis
				; should fail because it makes some paths (%a) potentially more expensive due
				; to multiple uses of the immediate. Additionally, when we go to speculate the
				; instructions, their cost will also be too high.
				; FIXME: The goal is really to test the first property, but there doesn't
				; happen to be any way to use free-to-speculate instructions here so that it
				; would be the only interesting property.
				define i64 @test_expensive_imm(i32 %flag, i64 %arg) {
				; CHECK-LABEL: define i64 @test_expensive_imm(
				entry:
				switch i32 %flag, label %a [
				i32 1, label %b
				i32 2, label %c
				i32 3, label %d
				]
				; CHECK: switch i32 %flag, label %a [
				; CHECK-NEXT: i32 1, label %b
				; CHECK-NEXT: i32 2, label %c
				; CHECK-NEXT: i32 3, label %d
				; CHECK-NEXT: ]

				a:
				br label %exit
				; CHECK: a:
				; CHECK-NEXT: br label %exit

				b:
				br label %exit
				; CHECK: b:
				; CHECK-NEXT: br label %exit

				c:
				br label %exit
				; CHECK: c:
				; CHECK-NEXT: br label %exit

				d:
				br label %exit
				; CHECK: d:
				; CHECK-NEXT: br label %exit

				exit:
				%p = phi i64 [ 4294967296, %a ], [ 1, %b ], [ 1, %c ], [ 1, %d ]
				%sum1 = add i64 %arg, %p
				%sum2 = add i64 %sum1, %p
				ret i64 %sum2
				; CHECK: exit:
				; CHECK-NEXT: %[[PHI:.*]] = phi i64 [ {{[0-9]+}}, %a ], [ 1, %b ], [ 1, %c ], [ 1, %d ]
				; CHECK-NEXT: %[[SUM1:.*]] = add i64 %arg, %[[PHI]]
				; CHECK-NEXT: %[[SUM2:.*]] = add i64 %[[SUM1]], %[[PHI]]
				; CHECK-NEXT: ret i64 %[[SUM2]]
				}

				define i32 @test_no_spec_non_postdominating_uses(i1 %flag1, i1 %flag2, i32 %arg) {
				; CHECK-LABEL: define i32 @test_no_spec_non_postdominating_uses(
				entry:
				br i1 %flag1, label %a, label %b
				; CHECK: br i1 %flag1, label %a, label %b

				a:
				br label %merge
				; CHECK: a:
				; CHECK-NEXT: %[[SUM_A:.*]] = add i32 %arg, 7
				; CHECK-NEXT: br label %merge

				b:
				br label %merge
				; CHECK: b:
				; CHECK-NEXT: %[[SUM_B:.*]] = add i32 %arg, 11
				; CHECK-NEXT: br label %merge

				merge:
				%p1 = phi i32 [ 7, %a ], [ 11, %b ]
				%p2 = phi i32 [ 13, %a ], [ 42, %b ]
				%sum1 = add i32 %arg, %p1
				br i1 %flag2, label %exit1, label %exit2
				; CHECK: merge:
				; CHECK-NEXT: %[[PHI1:.*]] = phi i32 [ %[[SUM_A]], %a ], [ %[[SUM_B]], %b ]
				; CHECK-NEXT: %[[PHI2:.*]] = phi i32 [ 13, %a ], [ 42, %b ]
				; CHECK-NEXT: br i1 %flag2, label %exit1, label %exit2

				exit1:
				ret i32 %sum1
				; CHECK: exit1:
				; CHECK-NEXT: ret i32 %[[PHI1]]

				exit2:
				%sum2 = add i32 %arg, %p2
				ret i32 %sum2
				; CHECK: exit2:
				; CHECK-NEXT: %[[SUM2:.*]] = add i32 %arg, %[[PHI2]]
				; CHECK-NEXT: ret i32 %[[SUM2]]
				}

This is an archive of the discontinued LLVM Phabricator instance.

Add a new pass to speculate around PHI nodes with constant (integer) operands when profitable.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 124537

llvm/trunk/include/llvm/Transforms/Scalar/SpeculateAroundPHIs.h

llvm/trunk/lib/Passes/PassBuilder.cpp

llvm/trunk/lib/Passes/PassRegistry.def

llvm/trunk/lib/Transforms/Scalar/CMakeLists.txt

llvm/trunk/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp

llvm/trunk/test/Other/new-pm-defaults.ll

llvm/trunk/test/Other/new-pm-thinlto-defaults.ll

llvm/trunk/test/Transforms/SpeculateAroundPHIs/basic-x86.ll

Add a new pass to speculate around PHI nodes with constant (integer) operands when profitable.
ClosedPublic