This is an archive of the discontinued LLVM Phabricator instance.

[SLC] Simplify pow(x, 0.25) to sqrt(sqrt(x))
AbandonedPublic

Authored by evandro on Jul 13 2018, 10:08 AM.

Details

Reviewers
spatel
efriedma
Summary

This transformation helps some benchmarks in SPEC CPU2006, such as 447.dealII, as well as some proprietary benchmarks. Otherwise, no significant regressions on x86-64 or A64.

Diff Detail

Event Timeline

evandro created this revision.Jul 13 2018, 10:08 AM
evandro set the repository for this revision to rL LLVM.

¡Ping! 🔔🔔

evandro updated this revision to Diff 158162.Jul 30 2018, 7:07 PM
evandro edited reviewers, added: efriedma; removed: eli.friedman.

Why is this the right IR canonicalization?

We have a bug report requesting the opposite transform here:
https://bugs.llvm.org/show_bug.cgi?id=35600
...citing the case of a chain of many nested sqrt. If we're going to canonicalize the sequence with many sqrt() to pow() in IR, then we might as well do the same for sqrt(sqrt()) to be consistent. Then, we convert back to sqrt(sqrt()) as a special-case for pow(x, 0.25) in the backend because that's a perf win.

Why is this the right IR canonicalization?

It seems to be interesting on targets that have a good performing instruction to calculate sqrt(). I expanded testing on targets with a slow performing instruction for sqrt() and the results were not so promising though.

Why is this the right IR canonicalization?

It seems to be interesting on targets that have a good performing instruction to calculate sqrt(). I expanded testing on targets with a slow performing instruction for sqrt() and the results were not so promising though.

We are in the middle end here though. Targets are back end.
We do not try to produce some-target-specific-optimal IR.
We want to produce middle-level optimal IR.
The back-end (DAGCombine, ...) can and should further refine the IR into whatever is optimal for the actual target.

So +1 to the question of whether we should be doing the opposite here, and this in the backend.

evandro abandoned this revision.Jul 31 2018, 2:30 PM

Agreed.

And the same remarks are probably applicable to D49040 and D50036 too, even though they only
extend the existing code. Canonicalizing all the sqrt/cbrt/... to just one pow will
make it simpler for everything else since it's just one function to be aware of, instead of three.

spatel added a comment.Aug 1 2018, 5:19 AM

And the same remarks are probably applicable to D49040 and D50036 too, even though they only
extend the existing code. Canonicalizing all the sqrt/cbrt/... to just one pow will
make it simpler for everything else since it's just one function to be aware of, instead of three.

It's fuzzy. If we had decided on pow() as the canonical form for sqrt() from the start, that would be true. But now we have an intrinsic, analysis, transforms, cost modeling, lowering, etc. for sqrt() directly. So if we wanted to reverse that, we'd have to modify all of that to recognize pow(x, 0.5) as the equivalent. I think it's better as we have it now because sqrt() is the more recognized form both in source code and hardware.

IMO, what made this patch different than the others is that we're replacing 1 call with 2 calls (and potentially other ops as shown in the tests here). Again though, it's fuzzy. We could argue that because sqrt() has existing analysis and pow() doesn't, the longer sequence in IR is better. That isn't the usual way to canonicalize, but it's not unprecedented. For example, there are existing transforms where we turn 1 sdiv/udiv into logic+cmp+ select.

I would lean towards converting cbrt() to pow() here in IR. AFAIK, there's no existing IR benefit to the cbrt() form. Plus, there is an intrinsic for pow(), so that makes vectorization easier. If we canonicalize in the other direction (pow(x, 0.33) --> cbrt(x)), then we have a canonicalization difference based on scalar/vector...or we have to add an intrinsic for cbrt?

evandro added a comment.EditedAug 1 2018, 8:08 AM

A specialized transcendental function is inherently faster than a generic one. In the case of cbrt() this is measurable in popular benchmarks, like CPU2000 and CPU2006. I can see the appeal of converging particular cases back to one generic case in the middle end, but IMO it's the opposite way in the run time environment. From this perspective, methinks that the case for an intrinsic of cbrt() is stronger than folding it into pow().

spatel added a comment.Aug 1 2018, 8:59 AM

A specialized transcendental function is inherently faster than a generic one. In the case of cbrt() this is measurable in popular benchmarks, like CPU2000 and CPU2006.

No disagreement there. I understand the motivation and would like to see this fixed too.

I can see the appeal of converging particular cases back to one generic case in the middle end, but IMO it's the opposite way in the run time environment. From this perspective, methinks that the case for an intrinsic of cbrt() is stronger than folding it into pow().

I think we need a stronger argument to add another LLVM math intrinsic for cbrt(). Eg, what IR transforms/analysis would that intrinsic enable vs. something we could already do for llvm.pow(x, 0.33)? If there's no added value, then we should convert this to cbrt() in the backend.

evandro added a subscriber: fhahn.Aug 17 2018, 8:02 AM