This should do the right thing on X86 and resolve PR31455.
I'm not sure it makes sense for MIPS and SystemZ though, which were the original targets of the pass.
Details
Diff Detail
Event Timeline
As mentioned on PR31455, on X86 btver2, this changes goes from being slower to gcc (88cy vs 84cy) (which hoists the sqrtsd) to actually being slightly faster (82cy). This is for a tight loop of ::sqrt() calls across an array of 65535 pre-randomized doubles (~10% of which use the sqrt call and the rest use sqrtsd). This will be mainly due to reduced speculative usage of the FSQRT unit.
This change is increasing the branch density for MIPS in the supplied test case and register pressure, as LLVM now has to synthesise 0.0 into a floating point register. This in turn also decreases code density for MIPS as we can't load 0.0 in a single instruction like x86 in all cases.
lib/Transforms/Scalar/PartiallyInlineLibCalls.cpp | ||
---|---|---|
42–45 | Shouldn't this be: // if (src > 0) // v0 = sqrt_noreadmem(src) # native sqrt instruction // else // v1 = sqrt(src) # library call // dst = phi(v0, v1) | |
test/CodeGen/Mips/optimize-fp-math.ll | ||
7 ↗ | (On Diff #83120) | This should be: ; 32-LABEL: test_sqrtf_float_: Similarly for the 64 case. |
21 ↗ | (On Diff #83120) | Similar to my comment above, except only the first mtc1 has to be matched. |
I suspect that we need to use TTI here to pick the behavior here based on target preferences. What we want to know, I suspect, is: is a floating-point comparison against zero as cheap as, or cheaper than, a floating-point NaN test? -- We don't have an interface yet to make that query, but we could add one.
Would a TTI isFastMaterializeConstant be enough (similar to what we have in FastISel)?
I think this depends on how heuristic we want to make this decision process. There are tradeoffs here with OOO processing, register pressure, etc. We might just want a dedicated TTI interface. This issue with the FP materialization seems just one of many factors. As I understand it, the issue here is giving the x86 processor more time to compute the branch condition (which is difficult to do through the sqrt instruction).
Yes avoiding unnecessary sqrtsd is my main interest, but similar issues have been found on other tickets: PR31510 (constant folding complex pow) also attempts the fast path and only then calls the lib func if any of the results are not finite - again testing the inputs may be more sensible.
Can you clarify? The fundamental assumption here is that the non-finite-output case is rare. As such, we're not "avoiding" the sqrt instruction - we'll almost always need to execute it. Is the problem you're seeing is that the non-finite-output case is not rare, that the rare non-finite-output case is nevertheless expensive enough to worry about, or that testing early is better because of branch-handling effects?
Simon, would you like to commandeer this patch?
I don't really have a strong opinion on the trade-offs here, was just curious about how we ended up with PR31455.
lib/Transforms/Scalar/PartiallyInlineLibCalls.cpp | ||
---|---|---|
42–45 | The IR-level transformation is actually as described, see the good-prototype.ll test. The sqrt_noreadmem() call gets sunk into the branch later. | |
test/CodeGen/Mips/optimize-fp-math.ll | ||
7 ↗ | (On Diff #83120) | Sure. |
Taking over this patch from @mkuper - I'm going to get some tests done (well for x86 at least) and come with a proposal for TTI/TLI to control whether we should test the inputs or fast-path outputs of LibCalls.
Patch updated:
Add a specialized TTI hook as suggested. So now this becomes a functional change only for x86 which overrides the default hook.
Looking at this again, I'm seeing some possible cases where MIPS can do better but that's no reason to hold this patch back.
LGTM.
Shouldn't this be: