This is an archive of the discontinued LLVM Phabricator instance.

Change sqrt partial inlining to depend on sqrt argument rather than result.
ClosedPublic

Authored by spatel on Jan 4 2017, 1:44 PM.

Details

Summary

This should do the right thing on X86 and resolve PR31455.
I'm not sure it makes sense for MIPS and SystemZ though, which were the original targets of the pass.

Diff Detail

Repository
rL LLVM

Event Timeline

mkuper updated this revision to Diff 83120.Jan 4 2017, 1:44 PM
mkuper retitled this revision from to Change sqrt partial inlining to depend on sqrt argument rather than result..
mkuper updated this object.
mkuper added reviewers: RKSimon, uweigand, sdardis.
mkuper added a subscriber: llvm-commits.
RKSimon edited edge metadata.Jan 5 2017, 3:46 AM
RKSimon added a subscriber: spatel.

As mentioned on PR31455, on X86 btver2, this changes goes from being slower to gcc (88cy vs 84cy) (which hoists the sqrtsd) to actually being slightly faster (82cy). This is for a tight loop of ::sqrt() calls across an array of 65535 pre-randomized doubles (~10% of which use the sqrt call and the rest use sqrtsd). This will be mainly due to reduced speculative usage of the FSQRT unit.

sdardis edited edge metadata.Jan 5 2017, 4:04 AM

This change is increasing the branch density for MIPS in the supplied test case and register pressure, as LLVM now has to synthesise 0.0 into a floating point register. This in turn also decreases code density for MIPS as we can't load 0.0 in a single instruction like x86 in all cases.

lib/Transforms/Scalar/PartiallyInlineLibCalls.cpp
45–48 ↗(On Diff #83120)

Shouldn't this be:

// if (src > 0)
//   v0 = sqrt_noreadmem(src) # native sqrt instruction
// else
//   v1 = sqrt(src) # library call
// dst = phi(v0, v1)
test/CodeGen/Mips/optimize-fp-math.ll
7 ↗(On Diff #83120)

This should be:

; 32-LABEL: test_sqrtf_float_:
; 32: mtc1 $zero, $f[[R0:[0-9]+]]
; 32: c.ult.s $f12, $f[[R0]]
; 32: bc1t $BB0_[[BB0:[0-9]+]]
; 32: sqrt.s $f0, $f12
; 32: $BB0_[[BB0]]:
; 32: jal sqrtf

Similarly for the 64 case.

21 ↗(On Diff #83120)

Similar to my comment above, except only the first mtc1 has to be matched.

hfinkel edited edge metadata.Jan 5 2017, 6:40 AM

This change is increasing the branch density for MIPS in the supplied test case and register pressure, as LLVM now has to synthesise 0.0 into a floating point register. This in turn also decreases code density for MIPS as we can't load 0.0 in a single instruction like x86 in all cases.

I suspect that we need to use TTI here to pick the behavior here based on target preferences. What we want to know, I suspect, is: is a floating-point comparison against zero as cheap as, or cheaper than, a floating-point NaN test? -- We don't have an interface yet to make that query, but we could add one.

This change is increasing the branch density for MIPS in the supplied test case and register pressure, as LLVM now has to synthesise 0.0 into a floating point register. This in turn also decreases code density for MIPS as we can't load 0.0 in a single instruction like x86 in all cases.

I suspect that we need to use TTI here to pick the behavior here based on target preferences. What we want to know, I suspect, is: is a floating-point comparison against zero as cheap as, or cheaper than, a floating-point NaN test? -- We don't have an interface yet to make that query, but we could add one.

Would a TTI isFastMaterializeConstant be enough (similar to what we have in FastISel)?

This change is increasing the branch density for MIPS in the supplied test case and register pressure, as LLVM now has to synthesise 0.0 into a floating point register. This in turn also decreases code density for MIPS as we can't load 0.0 in a single instruction like x86 in all cases.

I suspect that we need to use TTI here to pick the behavior here based on target preferences. What we want to know, I suspect, is: is a floating-point comparison against zero as cheap as, or cheaper than, a floating-point NaN test? -- We don't have an interface yet to make that query, but we could add one.

Would a TTI isFastMaterializeConstant be enough (similar to what we have in FastISel)?

I think this depends on how heuristic we want to make this decision process. There are tradeoffs here with OOO processing, register pressure, etc. We might just want a dedicated TTI interface. This issue with the FP materialization seems just one of many factors. As I understand it, the issue here is giving the x86 processor more time to compute the branch condition (which is difficult to do through the sqrt instruction).

Would a TTI isFastMaterializeConstant be enough (similar to what we have in FastISel)?

I think this depends on how heuristic we want to make this decision process. There are tradeoffs here with OOO processing, register pressure, etc. We might just want a dedicated TTI interface. This issue with the FP materialization seems just one of many factors. As I understand it, the issue here is giving the x86 processor more time to compute the branch condition (which is difficult to do through the sqrt instruction).

Yes avoiding unnecessary sqrtsd is my main interest, but similar issues have been found on other tickets: PR31510 (constant folding complex pow) also attempts the fast path and only then calls the lib func if any of the results are not finite - again testing the inputs may be more sensible.

Would a TTI isFastMaterializeConstant be enough (similar to what we have in FastISel)?

I think this depends on how heuristic we want to make this decision process. There are tradeoffs here with OOO processing, register pressure, etc. We might just want a dedicated TTI interface. This issue with the FP materialization seems just one of many factors. As I understand it, the issue here is giving the x86 processor more time to compute the branch condition (which is difficult to do through the sqrt instruction).

Yes avoiding unnecessary sqrtsd is my main interest, but similar issues have been found on other tickets: PR31510 (constant folding complex pow) also attempts the fast path and only then calls the lib func if any of the results are not finite - again testing the inputs may be more sensible.

Can you clarify? The fundamental assumption here is that the non-finite-output case is rare. As such, we're not "avoiding" the sqrt instruction - we'll almost always need to execute it. Is the problem you're seeing is that the non-finite-output case is not rare, that the rare non-finite-output case is nevertheless expensive enough to worry about, or that testing early is better because of branch-handling effects?

mkuper added a comment.Jan 5 2017, 3:29 PM

Simon, would you like to commandeer this patch?

I don't really have a strong opinion on the trade-offs here, was just curious about how we ended up with PR31455.

lib/Transforms/Scalar/PartiallyInlineLibCalls.cpp
45–48 ↗(On Diff #83120)

The IR-level transformation is actually as described, see the good-prototype.ll test. The sqrt_noreadmem() call gets sunk into the branch later.
Whether this *should* be the case or not is a different issue. :-)

test/CodeGen/Mips/optimize-fp-math.ll
7 ↗(On Diff #83120)

Sure.
The update script doesn't work for MIPS yet, right?

mkuper added a comment.Jan 5 2017, 5:23 PM

(Just to be clear, I meant @RKSimon, not @sdardis)

RKSimon commandeered this revision.Jan 6 2017, 5:31 AM
RKSimon edited reviewers, added: mkuper; removed: RKSimon.

Taking over this patch from @mkuper - I'm going to get some tests done (well for x86 at least) and come with a proposal for TTI/TLI to control whether we should test the inputs or fast-path outputs of LibCalls.

spatel commandeered this revision.Nov 19 2017, 10:28 AM
spatel added a reviewer: RKSimon.

Commandeering from the commandeer...er.

spatel updated this revision to Diff 123504.Nov 19 2017, 10:34 AM
spatel added a reviewer: craig.topper.

Patch updated:
Add a specialized TTI hook as suggested. So now this becomes a functional change only for x86 which overrides the default hook.

sdardis accepted this revision.Nov 27 2017, 11:57 AM

Looking at this again, I'm seeing some possible cases where MIPS can do better but that's no reason to hold this patch back.

LGTM.

This revision is now accepted and ready to land.Nov 27 2017, 11:57 AM
This revision was automatically updated to reflect the committed changes.