Add a new pass to speculate around PHI nodes with constant (integer) operands…


Add a new pass to speculate around PHI nodes with constant (integer) operands when profitable.

The core idea is to (re-)introduce some redundancies where their cost is
hidden by the cost of materializing immediates for constant operands of
PHI nodes. When the cost of the redundancies is covered by this,
avoiding materializing the immediate has numerous benefits:

  1. Less register pressure
  2. Potential for further folding / combining
  3. Potential for more efficient instructions due to immediate operand

As a motivating example, consider the remarkably different cost on x86
of a SHL instruction with an immediate operand versus a register

This pattern turns up surprisingly frequently, but is somewhat rarely
obvious as a significant performance problem.

The pass is entirely target independent, but it does rely on the target
cost model in TTI to decide when to speculate things around the PHI
node. I've included x86-focused tests, but any target that sets up its
immediate cost model should benefit from this pass.

There is probably more that can be done in this space, but the pass
as-is is enough to get some important performance on our internal
benchmarks, and should be generally performance neutral, but help with
more extensive benchmarking is always welcome.

One awkward part is that this pass has to be scheduled after
*everything* that can eliminate these kinds of redundancies. This
includes SimplifyCFG, GVN, etc. I'm open to suggestions about better
places to put this. We could in theory make it part of the codegen pass
pipeline, but there doesn't really seem to be a good reason for that --
it isn't "lowering" in any sense and only relies on pretty standard cost
model based TTI queries, so it seems to fit well with the "optimization"
pipeline model. Still, further thoughts on the pipeline position are

I've also only implemented this in the new pass manager. If folks are
very interested, I can try to add it to the old PM as well, but I didn't
really see much point (my use case is already switched over to the new

I've tested this pretty heavily without issue. A wide range of
benchmarks internally show no change outside the noise, and I don't see
any significant changes in SPEC either. However, the size class
computation in tcmalloc is substantially improved by this, which turns
into a 2% to 4% win on the hottest path through tcmalloc for us, so
there are definitely important cases where this is going to make
a substantial difference.

Differential revision: https://reviews.llvm.org/D37467

Event Timeline

Hahnfeld added inline comments.

Hi, just read about this pass in LLVM Weekly and wonder if this Assembly snippet is actually correct: This looks like 18 is added to %edi in case the branch is not taken. So I think the code should say something like

  testq %eax, %eax
  je .B
  addl $7, %edi
  jmp .L
  addl $11, %edi
  movl %edi, %eax

Please disagree if I'm missing something...