This is an archive of the discontinued LLVM Phabricator instance.

Instcombine: try to avoid wasted work in ConstantFold
Needs ReviewPublic

Authored by escha on Mar 14 2016, 11:58 AM.

Details

Summary

This is a bit tricky, and if someone has a better idea, please feel free to suggest it!

InstCombine currently does the following:

  1. Call ConstantFoldInstruction on the instruction, and replace it with a constant if appropriate. This involves (internally) iterating over its operands and optimizing each ConstantExpr operand with ConstantFoldConstantExpression, then calling ConstantFold on the instruction with operands.
  2. If that didn't work, optimize each ConstantExpr operand with ConstantFoldConstantExpression.

However, if you have lots of instructions with ConstantExpr operands but which cannot be folded (ex: loads from constant locations in memory -- in our case, uniform buffers are a common case), this means you call ConstantFoldConstantExpression twice on every operand, every time, because the constant folding fails.

One possible solution (as written here) is to provide an option to ConstantFoldInstruction to tell it not to try to optimize the ConstantExpr arguments, and then to reorder 1) and 2) so that we optimize the ConstantExprs first. I don't know if this is the best option, but it saves ~3-4% of time in instcombine for us.

Diff Detail

Repository
rL LLVM

Event Timeline

escha updated this revision to Diff 50624.Mar 14 2016, 11:58 AM
escha retitled this revision from to Instcombine: try to avoid wasted work in ConstantFold.
escha updated this object.
escha added reviewers: resistor, bogner.
escha set the repository for this revision to rL LLVM.
escha added a subscriber: llvm-commits.
reames requested changes to this revision.Mar 15 2016, 5:23 PM
reames added a reviewer: reames.
reames added a subscriber: reames.

Please upload with full context; I can't review this in the current form.

In general, I'm bothered by this change. Let me back up and ask a question, why do we need to fold the operand of the constant explicitly? Isn't the canonical form fully folded? (i.e. why isn't the argument already folded?) Is there some subtlety specific to constantexprs here?

This revision now requires changes to proceed.Mar 15 2016, 5:23 PM
escha added a comment.Mar 15 2016, 5:30 PM

What git option do I use to get a diff with full context?

If it's already folded, why does InstCombine try to optimize it? I did this under the assumption that InstCombine was doing something reasonable (and just being redundant about it); are you saying InstCombine shouldn't be doing this at all?

escha updated this revision to Diff 50787.Mar 15 2016, 5:48 PM
escha edited edge metadata.

Updated diff with some context.

If it's already folded, why does InstCombine try to optimize it? I did this under the assumption that InstCombine was doing something reasonable (and just being redundant about it); are you saying InstCombine shouldn't be doing this at all?

This is exactly what I'm asking. In general, our canonical form is to have constants fully folded. Given InstCombine is iterative, we only need to do a single folding step at a time to reach an optimal fixed point. Why does this code need to fold constantexprs specially? Is there a reason we can't have a loop per instruction which tries to fold constexpr operands and updates the instruction operands?

escha added a comment.Mar 15 2016, 6:14 PM

I believe this code is in the initial portion of instcombine that creates the initial worklist; it isn't run iteratively at all, at least if I remember right.

escha added a comment.Mar 15 2016, 6:18 PM

To try to be more clear, this code appears to:

  1. Drop instructions that are dead.
  2. If constant, fold the constant. [this step folds ConstExprs as part of its internals, because that's how ConstantFolding works]
  3. If not constant-foldable, check to see if any ConstExpr operands are foldable [this is where we duplicate work]
  4. Add to the initial instcombine worklist.

To try to be more clear, this code appears to:

  1. Drop instructions that are dead.
  2. If constant, fold the constant. [this step folds ConstExprs as part of its internals, because that's how ConstantFolding works]
  3. If not constant-foldable, check to see if any ConstExpr operands are foldable [this is where we duplicate work]
  4. Add to the initial instcombine worklist.

Why not do the following:

  1. Drop instructions that are dead.
  2. Check to see if any ConstExpr operands are foldable, and update the instruction
  3. If constant, fold the constant. [no longer needs to consider constexprs]
  4. Add to the initial instcombine worklist.

Also, steps 1-3 are repeated inside the InstCombine run loop. See InstCombiner::run. I believe the code AddReachableCodeToWorklist is merely intended to be an optimization to avoid adding trivially constant instructions to the worklist (i.e. reduce the memory consumed by the worklist).

escha added a comment.Mar 15 2016, 6:32 PM

Why not do the following:

  1. Drop instructions that are dead.
  2. Check to see if any ConstExpr operands are foldable, and update the instruction
  3. If constant, fold the constant. [no longer needs to consider constexprs]
  4. Add to the initial instcombine worklist.

that's literally what this patch does

You're correct, I got confused by the changes to the ConstantFolding interface.

What happens if you *only* reorder the two steps? (i.e. do we see cases where we have non-foldable constant exprs?) What performance impact do you see there?

What's really confusing me about this code is that my expectation would be that constant expressions were fully folded *at construction* and that there would be very few places (linking two modules?) which would introduce new foldable constructs in a previously fully folded constant expr. If that's correct, why aren't we conicalizing in those presumable very rare places? (i.e. why is this code in ConstantFoldingInstruction at *all*?)

escha added a comment.Mar 15 2016, 9:54 PM

Yeah, I don't understand that part either!

Reordering them alone isn't enough because the slow part isn't the creation of a new simplified ConstExpr; it's running the simplify function, which takes time even if it ends up doing nothing, AFAIK.

Ah okay, I think now (looking at this a bit more) I understand why this cost keeps showing up in the profile. Like you said, it also happens during the main InstCombine run as well. Specifically, we have lots of loads and intrinsics with all constant operands that can't be folded (since they're runtime, not compile-time constants). So when running ConstantFoldInstruction, it's forced to call SimplifyConstantExpr on every operand, even though that ends up doing absolutely nothing. The results then get thrown away because you can't constant-fold a load to something whose value isn't known at compile time.

This cost shows up about 3 times in InstCombine:

  1. ~4% of time: constantfoldinstruction on init (this patch tries to eliminate that cost)
  2. ~4% of time: constantfoldconstantexpr on init (in this patch, but not saved)
  3. ~4% of time: constantfoldinstruction during the main loop

My guess is most of these are doing absolutely nothing useful, but I don't know enough about constexprs or the design of instcombine to understand why it's done this way.

My guess is most of these are doing absolutely nothing useful, but I don't know enough about constexprs or the design of instcombine to understand why it's done this way.

My suspicion is that you've stumbled onto a deeper issue with the design of constexpr canonicalization. I think it would be good to raise this on llvm-dev for further discussion. I suspect there's a better design here which could address InstCombine and all other users of ConstantFolding in a single go. In particular, it *really* bothers me that ConstantFolding is doing a recursive search in this case.

p.s. If you want to split out the reordering change, I'd be happy to sign off on that part without waiting for the broader design discussion. Just make sure you make the change in the main loop as well. That's obviously a good idea.

reames resigned from this revision.Feb 25 2020, 9:08 AM

Resigning from a stale review. Feel free to re-add if thread ever revived.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 25 2020, 9:08 AM
nikic added a subscriber: nikic.Mar 5 2020, 12:37 PM

I know I'm a bit late here, but wondering if you have any IR / test case which can be used for perf testing? I've created D75634 as a different approach to this issue, but don't have a test case for this myself.