This is an archive of the discontinued LLVM Phabricator instance.

[CGP] transform select instructions into branches and sink expensive operands
ClosedPublic

Authored by spatel on Sep 30 2015, 9:35 AM.

Details

Summary

This is a follow-up to the discussion in D12882.

Ideally, we would like SimplifyCFG to be able to form select instructions even when the operands are expensive (as defined by the TTI cost model) because that may expose further optimizations. However, we would then like a later pass like CodeGenPrepare to undo that transformation if the target would likely benefit from not speculatively executing an expensive op (this patch).

Once we have this safety mechanism in place, we can adjust SimplifyCFG to restore its select-formation behavior that changed with r248439.

Diff Detail

Event Timeline

spatel updated this revision to Diff 36118.Sep 30 2015, 9:35 AM
spatel retitled this revision from to [CGP] transform select instructions into branches and sink expensive operands.
spatel updated this object.
spatel added reviewers: reames, hfinkel, mehdi_amini, bkramer.
spatel added a subscriber: llvm-commits.
mehdi_amini edited edge metadata.Sep 30 2015, 3:42 PM

knowing that SelectionDAG operates per basic-block, this makes me worried about implications on the ability of SelectionDAG to combine. I feel this is something that would better be done at the MI level.

Did you check what is the performance impact on the test suite? (if any)

In D13297#256946, @joker.eph wrote:

Did you check what is the performance impact on the test suite? (if any)

Not yet - I'm collecting data now.

Note: my test-suite system has an AMD Jaguar CPU; it would be great if someone else could try this patch on a recent Intel-based system.

In D13297#256946, @joker.eph wrote:

Did you check what is the performance impact on the test suite? (if any)

Not yet - I'm collecting data now.

Note: my test-suite system has an AMD Jaguar CPU; it would be great if someone else could try this patch on a recent Intel-based system.

Or better still for the sake of variety: a non-x86 system.

In D13297#256940, @joker.eph wrote:

knowing that SelectionDAG operates per basic-block, this makes me worried about implications on the ability of SelectionDAG to combine. I feel this is something that would better be done at the MI level.

I've learned that the longstanding existence of something in LLVM does not make it right. :)
However, I want to note for the record that we have similar 'undo' transforms already in IR and/or DAG. These are the cases that I'm aware of:

  1. SimplifyCFG creates selects from branches using TTI (this behavior was toggled for some cases at r228826 and r248439).
  2. SimplifyCFG combines branches with logic ops in FoldBranchToCommonDest() using a target-independent heuristic.
  3. SelectionDAGBuilder may undo #2 in visitBr() using TLI.
  4. CGP may undo #2 in splitBranchCondition() using TLI (and has a FIXME stating #3 should be removed).
  5. CGP may undo #1 in optimizeSelectInst() using TLI (that's the code this patch is building on).

I wonder if any of these raise the same concern? Again, I don't know what the right answer is...just asking a question. Also, note that this patch was proposed in D12882, and I thought there was some consensus for it. I may have misunderstood.

In D13297#256940, @joker.eph wrote:

knowing that SelectionDAG operates per basic-block, this makes me worried about implications on the ability of SelectionDAG to combine. I feel this is something that would better be done at the MI level.

I've learned that the longstanding existence of something in LLVM does not make it right. :)

Definitively. This is not something "right" this is a limitation of SelectionDAG and I know people are actively working on getting rid of it.

However, I want to note for the record that we have similar 'undo' transforms already in IR and/or DAG. These are the cases that I'm aware of:

  1. SimplifyCFG creates selects from branches using TTI (this behavior was toggled for some cases at r228826 and r248439).
  2. SimplifyCFG combines branches with logic ops in FoldBranchToCommonDest() using a target-independent heuristic.
  3. SelectionDAGBuilder may undo #2 in visitBr() using TLI.
  4. CGP may undo #2 in splitBranchCondition() using TLI (and has a FIXME stating #3 should be removed).
  5. CGP may undo #1 in optimizeSelectInst() using TLI (that's the code this patch is building on).

I wonder if any of these raise the same concern? Again, I don't know what the right answer is...just asking a question. Also, note that this patch was proposed in D12882, and I thought there was some consensus for it. I may have misunderstood.

Indeed the "correct" approach may be to not "optimize" for the way SelectionDAG work and its limitation, and instead relying on MI-level combine pass to figure out later combine across basic blocks.

reames edited edge metadata.Oct 1 2015, 9:17 AM

Overall, I think this is the right direction. I might consider enhancing it slightly by sinking operands of the select's operands as long as there's only a single use, but that's gravy. It would be entirely reasonable to do that in a separate change.

Trying to do this at the MI layer would be more complicated, less in alignment with how similar issues are already handled, and doesn't seem to give any real advantage.

I would like to see some performance numbers before submission.

lib/CodeGen/CodeGenPrepare.cpp
3845

This would seem better structured as a boolean predicate on the Value.

3881

clarify comment: expensive and only needed on one side of the select

3929

I'd suggest revising this example. In principal, we might create if/else/end blocks and sink into both if and else blocks. I'd suggest formulating it that way and then mentioning that we optimize away one of the blocks if we can.

3960

auto would be fine here.

3971

!landing pad. That has a very specific meaning in LLVM which this is not. "input value", or "incoming edge" would be fine.

3972

I'd add an assert inside this that TrueBlock == EndBlock just to make it obvious.

3986

I think that if you start the true and false blocks as being StartBlock you can simplify away this code entirely?

While I agree that it is something that seems to be done at the IR level "conceptually", I think you are underestimating the importance of the MI level: there are many things the DAG can fold if it has a view on it, and something that seems costly at the IR level can become "free" in some cases with target knowledge.
As mentioned previously, doing this at the IR level stresses a bit more a limitation of SelectionDAG, which does not mean it is not the right thing to do, but on the other side we should just make sure the MI level combine is able to cope with it (it becomes a target problem to deal with it). A global instruction selection instead of SelectionDAG wouldn't have such issue.
Now maybe x86 and/or ARM either does not have much DAG combine that would be lost in this process, or have a strong enough MI layer to recover, so my concern may not have any practical implication. In the end, let's just run a few benchmarks and see :)
I'll try to run the test-suite on Haswell and ARM64, hopefully this week.

These are test-suite benchmarking subset exec_time results for 3 baseline runs (r248957) followed by 3 runs with this patch applied running on an otherwise quiet AMD Jaguar 1.5GHz Ubuntu 14.04 system.

I don't see anything outside of the noise.

This is what I expected given that SimplifyCFG is not currently forming the TCC_Expensive selects that would be undone by this patch. I'll try again with r248439 reverted.

spatel added a comment.Oct 1 2015, 2:37 PM

This data comes from the same system and has the same baseline as before (r248957) but reverts r248439, so if there are expensive ops (div/rem) that can be speculatively executed and folded into selects, this data should show the perf effect.

But again, I don't see any differences.

spatel marked 6 inline comments as done.Oct 8 2015, 9:37 AM
spatel added inline comments.
lib/CodeGen/CodeGenPrepare.cpp
3986

We can't do that because we want to use these as the labels for the branch, so they have to start off pointing to the EndBlock. This is more confusing that it needs to be. I'll just init these to nullptr instead, and then we'll change them as needed and create the branch at the same time. It's a couple of extra lines of code, but easier to follow IMO.

spatel updated this revision to Diff 36864.Oct 8 2015, 9:40 AM
spatel edited edge metadata.

Patch updated based on Philip's feedback (thanks!).

spatel updated this revision to Diff 37497.Oct 15 2015, 10:18 AM

[Patch rebased; also, the new test case file was missing in the last revision.]

Ping.

Mehdi, did you run any perf tests?

Sorry I haven't had time to setup an ARM device yet, too many other things came up :(
That said, don't consider it a blocker for committing, this LGTM (if Philip is OK with the patch).

reames accepted this revision.Oct 15 2015, 7:14 PM
reames edited edge metadata.

LGTM

This revision is now accepted and ready to land.Oct 15 2015, 7:14 PM
This revision was automatically updated to reflect the committed changes.