This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] do not speculate fdiv by default; it's expensive (PR24818)
ClosedPublic

Authored by spatel on Sep 15 2015, 10:49 AM.

Details

Summary

I'm not sure if this is the right fix, but PR24818 and PR24343 show that we're now speculating costly ops on x86 when we shouldn't. I've confirmed that we're hoisting fdiv on all in-tree targets without this patch, and that seems wrong, so I think we should fix this in the base class rather than everyone's overrides.

The test just below the one I'm proposing assumes that fsqrt is cheap enough to speculate, but that's not generally true either. Do we want SimplifyCFG to make this speculative transform and then have backends undo it for expensive ops?

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 34809.Sep 15 2015, 10:49 AM
spatel updated this revision to Diff 34810.
spatel retitled this revision from to [SimplifyCFG] do not speculate fdiv by default; it's expensive (PR24818).
spatel updated this object.
spatel added reviewers: hfinkel, arsenm, bkramer.
spatel added a subscriber: llvm-commits.

Updated patch: full-context diff.

dim added a subscriber: dim.EditedSep 16 2015, 2:46 AM

FWIW, this works for my test case in bug 24343. Before D12882:

double foo(int n, double d)
{
  if (n > 0)
    d = 1.0 / d;
  return d;
}

compiles to:

foo:                                    # @foo
# BB#0:                                 # %entry
        pushl       %ebp
        movl        %esp, %ebp
        fldl        12(%ebp)
        cmpl        $0, 8(%ebp)
        fld1
        fdiv        %st(1)
        jg  .LBB0_2
# BB#1:                                 # %entry
        fstp        %st(0)
        fldz
        fxch        %st(1)
.LBB0_2:                                # %entry
        fstp        %st(1)
        popl        %ebp
        retl

After D12882, it compiles to:

foo:                                    # @foo
# BB#0:                                 # %entry
        pushl       %ebp
        movl        %esp, %ebp
        fldl        12(%ebp)
        cmpl        $0, 8(%ebp)
        jle .LBB0_2
# BB#1:                                 # %if.then
        fld1
        fdivp       %st(1)
.LBB0_2:                                # %if.end
        popl        %ebp
        retl

So the fdiv is no longer executed before the comparison.

Thanks for verifying that, Dimitry.

Adding some more reviewers who may be able to provide guidance.

Is it only an "optimization fix"? The "correctness" fix for PR24343 seems to requires that an fdiv can't be speculated with FPE enabled and if the denominator is not zero, right?
It seems that LLVM Optimizations are still not very friendly with respect to that, there were some patches at some point last year but it seems they have not been integrated: http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20141110/243778.html

In D12882#247388, @joker.eph wrote:

Is it only an "optimization fix"? The "correctness" fix for PR24343 seems to requires that an fdiv can't be speculated with FPE enabled and if the denominator is not zero, right?
It seems that LLVM Optimizations are still not very friendly with respect to that, there were some patches at some point last year but it seems they have not been integrated: http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20141110/243778.html

Hi Mehdi -

Yes, my patch is only intended to solve the perf problem (PR24818)...and raise the question about the existing sqrt hoisting test in test/Transforms/SimplifyCFG/speculate-math.ll.

This patch just happens to prevent the FPE in the case of PR24343, but that's a happy accident. Clang/LLVM still need major work to support FPE, and that's tracked (and unsolved for years) by PR8100 and related bugs.

Can you give some context on specifically which speculative
optimizations you're hoping to avoid here? Are you only worried about
if-then-else-to-select from SimplifyCFG? What about cases like LICM?
Do you think those should hoist or not?

My general feeling is that we *should* speculate these instructions if
is legal and we have reason to predict it as being profitable (e.g. in
LICM). We do not currently have a global scheduler (or even a global
schedule fixer-uper beyond some hacks in CodeGenPrep), but I've been
thinking about this for other reasons. In cases where the only use is
down a particular path, adding a "fixup" which undoes hoisting seems
plausible and reasonable.

For this particular case, it really seems like this should be handled
via a select-to-control flow conversion for expensive operations done
late in the pipeline. IMO, selects is a better intermediate form for
optimization, and we should undo the transformation late when heading
into the backend. In fact, we already have the framework for this in
CodeGenPrep::OptimizeSelectInst. Have you considered tweaking that instead?

Philip

Can you give some context on specifically which speculative
optimizations you're hoping to avoid here? Are you only worried about
if-then-else-to-select from SimplifyCFG? What about cases like LICM?
Do you think those should hoist or not?

I was only considering the if-then case in SimplifyCFG. I'm not familiar with LICM, but I don't see it using TTI. Maybe that's just an oversight though?

My general feeling is that we *should* speculate these instructions if
is legal and we have reason to predict it as being profitable (e.g. in
LICM). We do not currently have a global scheduler (or even a global
schedule fixer-uper beyond some hacks in CodeGenPrep), but I've been
thinking about this for other reasons. In cases where the only use is
down a particular path, adding a "fixup" which undoes hoisting seems
plausible and reasonable.

For this particular case, it really seems like this should be handled
via a select-to-control flow conversion for expensive operations done
late in the pipeline. IMO, selects is a better intermediate form for
optimization, and we should undo the transformation late when heading
into the backend. In fact, we already have the framework for this in
CodeGenPrep::OptimizeSelectInst. Have you considered tweaking that instead?

I was hoping there was some later hook in place where we could do this, but I hadn't looked further yet, so I'm glad to see that CGP has it.

But if that is the right spot to do (undo) this transform, that implies that using TTI here in SimplifyCFG was the wrong choice, doesn't it? Ie, the TTI cost model cites this exact case as a reason for its existence:

TCC_Expensive = 4 ///< The cost of a 'div' instruction on x86.

If we're not going to honor the TTI cost model here in SimplifyCFG, we should go back to whatever logic was used before r228826. (Note: I'm not advocating one way or the other...yet; I'm just looking for the right answer.)

This is another way of asking the general question I've run into several times lately ( PR24818, PR24766, PR22723, PR24743 ):
What defines the "canonical" or "optimal" IR? Less instructions? Less basic blocks? Less register pressure? Is it art or science?

hfinkel accepted this revision.Sep 22 2015, 4:13 PM
hfinkel edited edge metadata.

LGTM. Please go ahead and fix the FIXME and make UDiv and SDiv (and URem/SRem?) expensive by default. The targets can always override these if appropriate.

What defines the "canonical" or "optimal" IR?

These are different things. Canonical form is designed to be simple, yes, but to expose further optimizations and limit the implementation complexity of different optimizations and analysis. Optimal IR need to be defined with respect to a particular backend and architecture, and might not be in canonical form.

Less instructions? Less basic blocks?

Generally, yes.

Less register pressure?

During canonicalization, we ignore this by design. During lowering (including late IR passes such as vectorization), it can be considered explicitly.

Is it art or science?

Both.

This revision is now accepted and ready to land.Sep 22 2015, 4:13 PM

Is it art or science?

Both.

Aha, now it's all starting to make more sense. That's much more fun. :)
Many thanks to you and Philip for explaining this.

My plan to push this forward is:

  1. Fix up the TTI cost model (this patch and possibly others).
  2. Add code to CGP that uses the cost model to undo this sort of speculation for targets that don't want it.
  3. Change this part of SimplifyCFG to not use the TTI cost model for purposes of speculative transforms.

I've just started looking at CGP for the first time...it has this ominous comment:

// This pass munges the code in the input function to better prepare it for
// SelectionDAG-based code generation. This works around limitations in it's
// basic-block-at-a-time approach. It should eventually be removed.

Is there a proposed replacement for CGP?

This revision was automatically updated to reflect the committed changes.