Page MenuHomePhabricator

[BreakFalseDeps] ignore function with minsize attribute
Needs ReviewPublic

Authored by spatel on Mon, Sep 9, 12:02 PM.

Details

Summary

This came up in the x86-specific:
https://bugs.llvm.org/show_bug.cgi?id=43239
...but it seems like a general problem for the BreakFalseDeps pass. Dependencies are always broken by adding some other instruction, so that should be avoided if the overall goal is to minimize size.

Diff Detail

Event Timeline

spatel created this revision.Mon, Sep 9, 12:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptMon, Sep 9, 12:02 PM

This would make it policy for -Oz builds to not bother to break dependencies but -Os/-O0+ builds would still do.

Does anything else use BreakFalseDeps?

llvm/lib/CodeGen/BreakFalseDeps.cpp
15–16

NFC

This would make it policy for -Oz builds to not bother to break dependencies but -Os/-O0+ builds would still do.

Correct - that matches my interpretation of the vague clang docs:

-Os Like -O2 with extra optimizations to reduce code size.
-Oz Like -Os (and thus -O2), but reduces code size further.

In my experience, -Oz ("minsize") is used to ensure that code fits in some constrained space, so any extras that can be eliminated should be eliminated. Perf is a secondary concern. This is different than -Os ("optsize") where we are trying to balance speed and size.

Does anything else use BreakFalseDeps?

x86 and ARM are the only targets in trunk that use it from what I can see. Both use it the same way: insert dummy (not required for correctness) instructions to avoid known uarch hazards.

spatel updated this revision to Diff 219531.Tue, Sep 10, 6:03 AM

Patch updated:
Pre-committed NFC code comment diffs in rL371516.

spatel marked an inline comment as done.Tue, Sep 10, 6:03 AM

Adding some ARM folks since ARM is (the only) other in-tree target that would be affected by this change.
It's not strictly stated that dependency-breaking always means adding an instruction, but that’s the practical definition based on x86 and ARM's implementation of this pass's hooks.

For VEX instructions don’t we just use the other input register to break the dependency without adding an instruction?

For VEX instructions don’t we just use the other input register to break the dependency without adding an instruction?

That sounds right, but it's not controlled by this pass. That's part of memory op folding?
If I'm seeing it correctly, that's already more aggressively optimizing for size than what we're doing here:

// Avoid partial and undef register update stalls unless optimizing for size.
if (!MF.getFunction().hasOptSize() &&
    (hasPartialRegUpdate(MI.getOpcode(), Subtarget, /*ForLoadFold*/true) ||
     shouldPreventUndefRegUpdateMemFold(MF, MI)))
  return nullptr;
spatel updated this revision to Diff 219546.Tue, Sep 10, 8:07 AM

Patch updated:
Rebased with improved test coverage for ARM and x86:
rL371526
rL371525

spatel marked an inline comment as done.Tue, Sep 10, 8:12 AM
spatel added inline comments.
llvm/test/CodeGen/X86/sqrt-partial.ll
78

The AVX form of sqrtsd doesn't have a partial reg update, so no difference here.

For VEX instructions don’t we just use the other input register to break the dependency without adding an instruction?

That sounds right, but it's not controlled by this pass. That's part of memory op folding?
If I'm seeing it correctly, that's already more aggressively optimizing for size than what we're doing here:

// Avoid partial and undef register update stalls unless optimizing for size.
if (!MF.getFunction().hasOptSize() &&
    (hasPartialRegUpdate(MI.getOpcode(), Subtarget, /*ForLoadFold*/true) ||
     shouldPreventUndefRegUpdateMemFold(MF, MI)))
  return nullptr;

We disable folding if it takes away our opportunity to use the other input register. But something still needs to pick a register for the implicit_def operand. I though that was this pass? I think in absense of this pass it may just be set to the xmm0 by the register allocator. Do we have tests where xmm0 isn't the right choice?

I believe this test case compiled with avx needs this pass.

define double @minsize(double %x, double %y) minsize {

%t6 = tail call fast double @llvm.sqrt.f64(double %y)
%t= fadd fast double %t6, %y
ret double %t6

}
declare double @llvm.sqrt.f64(double)

I believe this test case compiled with avx needs this pass.

define double @minsize(double %x, double %y) minsize {

%t6 = tail call fast double @llvm.sqrt.f64(double %y)
%t= fadd fast double %t6, %y
ret double %t6

}
declare double @llvm.sqrt.f64(double)

Thanks! I didn't understand where that happened - it's in here (and no existing regression tests appear to capture that).
So yes, to limit this pass, we need to use a slightly finer check to distinguish between transforms that are only changing registers vs. adding instructions.
Added more tests with rL371528 and rL371551.

spatel updated this revision to Diff 219595.Tue, Sep 10, 1:29 PM

Patch updated:
Refine the places where minsize comes into play (where the target is called to add an instruction via TII->breakPartialRegDependency()).

spatel marked an inline comment as done.Tue, Sep 10, 1:31 PM
spatel added inline comments.
llvm/test/CodeGen/X86/sqrt-partial.ll
99

This is the test showing that the pass is still capable of changing a reg.

X86 looks good to me.

Ping - ARM ok?