This came up in the x86-specific:
https://bugs.llvm.org/show_bug.cgi?id=43239
...but it seems like a general problem for the BreakFalseDeps pass. Dependencies are always broken by adding some other instruction, so that should be avoided if the overall goal is to minimize size.
Details
Diff Detail
Event Timeline
This would make it policy for -Oz builds to not bother to break dependencies but -Os/-O0+ builds would still do.
Does anything else use BreakFalseDeps?
llvm/lib/CodeGen/BreakFalseDeps.cpp | ||
---|---|---|
15–16 | NFC |
Correct - that matches my interpretation of the vague clang docs:
-Os Like -O2 with extra optimizations to reduce code size. -Oz Like -Os (and thus -O2), but reduces code size further.
In my experience, -Oz ("minsize") is used to ensure that code fits in some constrained space, so any extras that can be eliminated should be eliminated. Perf is a secondary concern. This is different than -Os ("optsize") where we are trying to balance speed and size.
Does anything else use BreakFalseDeps?
x86 and ARM are the only targets in trunk that use it from what I can see. Both use it the same way: insert dummy (not required for correctness) instructions to avoid known uarch hazards.
Adding some ARM folks since ARM is (the only) other in-tree target that would be affected by this change.
It's not strictly stated that dependency-breaking always means adding an instruction, but that’s the practical definition based on x86 and ARM's implementation of this pass's hooks.
For VEX instructions don’t we just use the other input register to break the dependency without adding an instruction?
That sounds right, but it's not controlled by this pass. That's part of memory op folding?
If I'm seeing it correctly, that's already more aggressively optimizing for size than what we're doing here:
// Avoid partial and undef register update stalls unless optimizing for size. if (!MF.getFunction().hasOptSize() && (hasPartialRegUpdate(MI.getOpcode(), Subtarget, /*ForLoadFold*/true) || shouldPreventUndefRegUpdateMemFold(MF, MI))) return nullptr;
llvm/test/CodeGen/X86/sqrt-partial.ll | ||
---|---|---|
41–42 | The AVX form of sqrtsd doesn't have a partial reg update, so no difference here. |
We disable folding if it takes away our opportunity to use the other input register. But something still needs to pick a register for the implicit_def operand. I though that was this pass? I think in absense of this pass it may just be set to the xmm0 by the register allocator. Do we have tests where xmm0 isn't the right choice?
I believe this test case compiled with avx needs this pass.
define double @minsize(double %x, double %y) minsize {
%t6 = tail call fast double @llvm.sqrt.f64(double %y) %t= fadd fast double %t6, %y ret double %t6
}
declare double @llvm.sqrt.f64(double)
Thanks! I didn't understand where that happened - it's in here (and no existing regression tests appear to capture that).
So yes, to limit this pass, we need to use a slightly finer check to distinguish between transforms that are only changing registers vs. adding instructions.
Added more tests with rL371528 and rL371551.
Patch updated:
Refine the places where minsize comes into play (where the target is called to add an instruction via TII->breakPartialRegDependency()).
llvm/test/CodeGen/X86/sqrt-partial.ll | ||
---|---|---|
55–56 | This is the test showing that the pass is still capable of changing a reg. |
llvm/test/CodeGen/ARM/a15-partial-update.ll | ||
---|---|---|
59 ↗ | (On Diff #219595) | There's a missed optimization here: we could avoid both the false dependence and the extra instruction by modifying the vld1 to a splat load. But generally looks fine. |
Marking patch as accepted based on "looks good" and "looks fine". I'll hold off on commit for a ~day in case there are any more comments.
NFC