This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Remove single use restriction from InstCombine's explicit sinking code.
Needs ReviewPublic

Authored by craig.topper on Sep 12 2017, 1:25 PM.

Details

Summary

The explicit sinking code in InstCombine checks to make sure the candidate for sinking only has a single use. But that doesn't really make sense. All that should really matter is that all uses are in the same basic block.

The test case is derived from a real example I saw in a benchmark we where ended up with a chain of conditional ORs that we were unable to sink completely because %select1 has two uses in the successor block.

I fear based on other discussions on the mailing list that this patch may be controversial because InstCombine shouldn't really be doing this sinking at all. But I'm just trying to make it logical until such time that a new sinking pass is implemented.

Diff Detail

Event Timeline

craig.topper created this revision.

Remove a stray change that snuck in

davide edited edge metadata.Sep 12 2017, 3:17 PM

I think what matters in this case is the {post}dominance relation between the block of the DEF and the block(s) [potentially > 1] of the USEs.
Doing this xform when all the uses are in the same block, is, correct, but restrictive. So, I think your logic is fine, but this makes me still less convinced that we shouldn't use the dom to drive this analysis (and therefore should be a separate pass :)

hfinkel edited edge metadata.Sep 13 2017, 2:14 AM

I think what matters in this case is the {post}dominance relation between the block of the DEF and the block(s) [potentially > 1] of the USEs.
Doing this xform when all the uses are in the same block, is, correct, but restrictive. So, I think your logic is fine, but this makes me still less convinced that we shouldn't use the dom to drive this analysis (and therefore should be a separate pass :)

Do we see any performance effects from removing this entirely? It's not immediately obvious to me what this enables. Maybe SimplifyCFG later removes some blocks as empty? I don't see why anything in InstCombine would care.

Here's a longer version of what I saw in the benchmark i was looking at

y1 = 0;
y2 = (x & 2) ? (y1 | C1) : y1
y3 = (x & 4) ? (y2 | C2) : y2
y4 = (x & 8) ? (y3 | C3) : y3
y5 = (x & 16) ? (y4 | C4) : y4
y6 = (x & 32) ? (y5 | C5) : y5
y7 = (x & 64) ? (y6 | C6) : y6
if (x & 254) {

y8 = (x & 128) ? (y7 | C7) : y7
store y8 to memory

}

As you can see y7 is only used inside the if body, but we didn’t sink it. The select for y8, the (y7 | C7) for its left hand size and the (x & 128) all started output above the if (x & 254) and were pushed down because they were only needed by the store.

Is there some other pass that I should expect to sink this code so that we don't waste time decoding/executing the resulting ors and cmovs when (x & 254) is false?

Here's a longer version of what I saw in the benchmark i was looking at

y1 = 0;
y2 = (x & 2) ? (y1 | C1) : y1
y3 = (x & 4) ? (y2 | C2) : y2
y4 = (x & 8) ? (y3 | C3) : y3
y5 = (x & 16) ? (y4 | C4) : y4
y6 = (x & 32) ? (y5 | C5) : y5
y7 = (x & 64) ? (y6 | C6) : y6
if (x & 254) {

y8 = (x & 128) ? (y7 | C7) : y7
store y8 to memory

}

As you can see y7 is only used inside the if body, but we didn’t sink it. The select for y8, the (y7 | C7) for its left hand size and the (x & 128) all started output above the if (x & 254) and were pushed down because they were only needed by the store.

Is there some other pass that I should expect to sink this code so that we don't waste time decoding/executing the resulting ors and cmovs when (x & 254) is false?

Interesting. Nothing occurs to me (at the IR level - we do have MachineSink in CodeGen). Should we do this in CGP, or is there a reason to do it earlier?

I may have said this before, but we should figure out what we want our canonical form to be. Execute things as early as possible (i.e., aggressive hoisting), but as few times as possible (i.e., still sink out of loops), is one possibility. Lacking other motivations, that's what I'd recommend.

Has there been any sort of discussion on expanding/using the existing IR level code sinking pass? I am referring to the SinkingPass in scalar/sink.cpp. AFAIK it's only used in the AMDGPU preisel pipeline. I don't know it's current state/usability but the description of the pass is:

This pass moves instructions into successor blocks, when possible, so that
they aren't executed on paths where their results aren't needed.

Ok so maybe I'm chasing a deficiency in machine sinking? Here's a related example https://godbolt.org/g/YnEvgg Most of the code should have been able to sink below the 'je' at line 11.

Has there been any sort of discussion on expanding/using the existing IR level code sinking pass? I am referring to the SinkingPass in scalar/sink.cpp. AFAIK it's only used in the AMDGPU preisel pipeline. I don't know it's current state/usability but the description of the pass is:

This pass moves instructions into successor blocks, when possible, so that
they aren't executed on paths where their results aren't needed.

Doing this as a late IR pass makes sense. Taking a quick look at the implementation, there may be some improvements that would help (e.g., make it use MemorySSA, better handling of debug info, use a postdom tree). Considering adding this to the default codegen IR pipeline could be a good idea (and maybe better than trying to do a better job at the MI level?).

davide added a subscriber: kuhar.Sep 13 2017, 3:16 PM

Has there been any sort of discussion on expanding/using the existing IR level code sinking pass? I am referring to the SinkingPass in scalar/sink.cpp. AFAIK it's only used in the AMDGPU preisel pipeline. I don't know it's current state/usability but the description of the pass is:

This pass moves instructions into successor blocks, when possible, so that
they aren't executed on paths where their results aren't needed.

Doing this as a late IR pass makes sense. Taking a quick look at the implementation, there may be some improvements that would help (e.g., make it use MemorySSA, better handling of debug info, use a postdom tree). Considering adding this to the default codegen IR pipeline could be a good idea (and maybe better than trying to do a better job at the MI level?).

One of the reason why this was blocked for improvement was that until recently, we didn't have a functional post dominator tree construction (wrt unreachable blocks et similia).
@kuhar and Danny fixed it recently so that shouldn't be a problem anymore. We might really considering removing this code from InstCombine.

dberlin added a subscriber: dberlin.EditedSep 13 2017, 3:16 PM

A couple things:

  1. A proper and complete scalar PRE implementation would eliminate the redundant computation parts in your godbolt example (but not the redundant stores).

GCC's PRE at -O3 (which is mostly complete), will transform this into a selection of constants and 3 stores, all using those constants.
A proper NewGVN PRE would do the same.

NewGVN right now will catch the full redundancy:
You will get:

  %phiofops = phi i32 [ 131074, %2 ], [ 196611, %6 ]
  %.0 = phi i32 [ 65537, %6 ], [ 0, %2 ]
  ...
%.1 = phi i32 [ %phiofops, %10 ], [ %.0, %7 ]

So it eliminates one of the or's already, even without PRE, because it is fully redundant.
The others would require PRE.

The "proper" theoretical way to eliminate the stores in both (and do the sinking above) is Partial Dead Store/Partial Dead Code elimination.

It should put it in the optimal place as well
There is a PDSE implementation under review (i have someone taking it over).

I would expect PDSE to take care of the first case completely, but as said, it will not sink the computations in your other example, only sink the store.

  1. GCC has no PDSE, it has a simple sinking pass that I implemented to catch the common case (on by default)

It's an IR level pass.

What it does is, for an instruction I in block BB, take all the uses of I (with a phi node use occurring the appropriate predecessor), and finds the nearest common dominator of all the uses.
This place, NCD, is a guaranteed safe location as long as BB dominates it (NCD it may actually be above BB in some loop cases).

then, in the dominator tree between between BB and NCD, we want the block that is the most control dependent and shallowest loop nest.
So shallower loop nests are always considered better.
Same loop nest level is considered better if execution frequency is significantly lower than NCD execution frequency.
Otherwise, use NCD.

The resulting block is always a safe place to sink (because BB is an ancestor of NCD in the dominator tree, and we are only walking the dominator tree till we hit BB).

You could also use real control dependences to find the thing inside the most non-loop branches :)

In the above case, this sinks it into the pred of the phi, as you want.
In the godbolt example, as mentioned, this is a combination of PRE and PDSE.
The simple sinking pass i wrote is not smart enough to handle the PDSE case you've presented.

I believe LLVM has a similar simple sinking pass.

Looks like machine sinking fails because it can't handle the cmovs reading eflags. And it only considers one instruction at a time, but to sink the cmov you have to sink the eflags producer and the cmov together.

Here's something even closer to the real benchmark https://godbolt.org/g/ePkjTu

This also hitting an issue with simplifyCFG's merge conditional stores that resulted in two stores instead of one.

So GCC-7 definitely eliminates all computation in your benchmark at -O3 except for one and, and we should be able to do the same with a newgvn PRE.
It won't do it at O2 because it requires partial antic, which gcc only does at O3.

I've attached the dump of internediate code.

My conclusion: it's not sane to get it to eliminate computation with the current GVN's scalar PRE.

The store sinking would require PDSE to do in one pass optimally.

Is the code gcc generates at O3 really "better"? According to godbolt the resulting assembly looks quite large with lots of spills and reloads.

dberlin added a comment.EditedSep 13 2017, 6:00 PM

The assembly is clearly worse, the IR is clearly better.
The IR has removed all computation except a mask of the input.
(Note, in the dump i gave, it has not run DCE, so there's a bunch of dead phis).
In essence, it has turned the entire function into a nested if statement based on the bar & x values, where all branches are storing a constant value.
(or selects, or however you want to canonicalize this)

GCC then does not do a good job of doing something with that in the backend.

This could be codegened precisely the way a switch could be (IE lookup table and perfect hash, multiple lookup tables,, binary tree, etc).
It can use as many or as few registers as you like.
In fact, in this form, it's also ripe for vectorization.
You could load the constants into AVX512 registers (it should take 2), the bar & values into another, and use the & to reduce and select the right constant from the vector registers. I think you can do it in AVX2 as well, but haven't double checked.

So, i'm going to say "yes, the IR is better, it could probably use canonicalization or something to take it out of the phi form and put it into a canonical form".

So I think we all agree that doing sinking in InstCombine isn't a good idea in the long term. But I suspect doing something to fix that is going to take some more work.

Do we think there any issues with this patch in the short term? Either significant performance regressions or compile time issues?

I think what matters in this case is the {post}dominance relation between the block of the DEF and the block(s) [potentially > 1] of the USEs.

Agreed with this.

Doing this xform when all the uses are in the same block, is, correct, but restrictive. So, I think your logic is fine, but this makes me still less convinced that we shouldn't use the dom to drive this analysis (and therefore should be a separate pass :)

I think GVNSink could be improved to do this. Also the current implementation in InstCombiner looks like a hack. We are sinking an instruction without appropriate cost model. Delaying execution of an instruction may not always be a good idea.

Looks like machine sinking fails because it can't handle the cmovs reading eflags. And it only considers one instruction at a time, but to sink the cmov you have to sink the eflags producer and the cmov together.

I have been working on a global scheduling pass (https://reviews.llvm.org/D32140) which can help achieve this. What we need is a good cost model (reducing live range can be one) to drive the transformation.