This transformation helps some benchmarks in SPEC CPU2006, such as 447.dealII, as well as some proprietary benchmarks. Otherwise, no significant regressions on x86-64 or A64.
Diff Detail
Event Timeline
Why is this the right IR canonicalization?
We have a bug report requesting the opposite transform here:
https://bugs.llvm.org/show_bug.cgi?id=35600
...citing the case of a chain of many nested sqrt. If we're going to canonicalize the sequence with many sqrt() to pow() in IR, then we might as well do the same for sqrt(sqrt()) to be consistent. Then, we convert back to sqrt(sqrt()) as a special-case for pow(x, 0.25) in the backend because that's a perf win.
It seems to be interesting on targets that have a good performing instruction to calculate sqrt(). I expanded testing on targets with a slow performing instruction for sqrt() and the results were not so promising though.
We are in the middle end here though. Targets are back end.
We do not try to produce some-target-specific-optimal IR.
We want to produce middle-level optimal IR.
The back-end (DAGCombine, ...) can and should further refine the IR into whatever is optimal for the actual target.
So +1 to the question of whether we should be doing the opposite here, and this in the backend.
It's fuzzy. If we had decided on pow() as the canonical form for sqrt() from the start, that would be true. But now we have an intrinsic, analysis, transforms, cost modeling, lowering, etc. for sqrt() directly. So if we wanted to reverse that, we'd have to modify all of that to recognize pow(x, 0.5) as the equivalent. I think it's better as we have it now because sqrt() is the more recognized form both in source code and hardware.
IMO, what made this patch different than the others is that we're replacing 1 call with 2 calls (and potentially other ops as shown in the tests here). Again though, it's fuzzy. We could argue that because sqrt() has existing analysis and pow() doesn't, the longer sequence in IR is better. That isn't the usual way to canonicalize, but it's not unprecedented. For example, there are existing transforms where we turn 1 sdiv/udiv into logic+cmp+ select.
I would lean towards converting cbrt() to pow() here in IR. AFAIK, there's no existing IR benefit to the cbrt() form. Plus, there is an intrinsic for pow(), so that makes vectorization easier. If we canonicalize in the other direction (pow(x, 0.33) --> cbrt(x)), then we have a canonicalization difference based on scalar/vector...or we have to add an intrinsic for cbrt?
A specialized transcendental function is inherently faster than a generic one. In the case of cbrt() this is measurable in popular benchmarks, like CPU2000 and CPU2006. I can see the appeal of converging particular cases back to one generic case in the middle end, but IMO it's the opposite way in the run time environment. From this perspective, methinks that the case for an intrinsic of cbrt() is stronger than folding it into pow().
No disagreement there. I understand the motivation and would like to see this fixed too.
I can see the appeal of converging particular cases back to one generic case in the middle end, but IMO it's the opposite way in the run time environment. From this perspective, methinks that the case for an intrinsic of cbrt() is stronger than folding it into pow().
I think we need a stronger argument to add another LLVM math intrinsic for cbrt(). Eg, what IR transforms/analysis would that intrinsic enable vs. something we could already do for llvm.pow(x, 0.33)? If there's no added value, then we should convert this to cbrt() in the backend.