The generic infrastructure to compute the Newton series for reciprocal and reciprocal square root was conceived to allow a target to compute the series itself. However, the original code did not properly consider this condition if returned by a target. This patch addresses the issues to allow a target to compute the series on its own.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
llvm/include/llvm/CodeGen/ISDOpcodes.h | ||
---|---|---|
540–542 ↗ | (On Diff #66169) | You should probably mention the argument order and precision requirements here (specifically that multiple rounding steps may occur). |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
14522 | Isn't this more a question of legality? | |
llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp | ||
203–204 ↗ | (On Diff #66169) | Probably should be all lower-case looking at the surrounding entries. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
7832–7833 | Isn't this really just a pattern? It looks like we could mark FRECPI as legal (which is more what the DAGCombiner should be asking anyway) and write: def : Pat<(f32 (frecpi f32:$arg, f32:$est), (FMULSrr (FRECPS32 $arg, $est) $est>; Similarly for the sqrt case. | |
llvm/test/CodeGen/AArch64/recp-fastmath.ll | ||
16 | We should be checking data-flow too in these tests. It's not enough to know that LLVM managed to cobble together an frecps instruction. |
llvm/include/llvm/CodeGen/ISDOpcodes.h | ||
---|---|---|
546 ↗ | (On Diff #66169) | I don't quite understand why do you need to introduce new ARM specific nodes into the ISD namespace. As an alternative, you could ask the target to directly build ARMISD::FRSQRTI instructions from buildSqrtNRNative function. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
14695 | A more consistent way to choose a refinement method would be to replace the boolean UseOneConstNR value with some enum, let the target set this enum and choose the refinement method based on the enum value. Yet another option would be to return an already refined value from ARMTargetLowering::getRsqrtEstimate and set Iterations to 0. In this case we wouldn't need the buildSqrtNRNative function. |
Thank you.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14695 | I like this suggestion. I'll explore it and get back to y'all. |
Refactored the previous patch extensively to implement the series in the AArch64 backend.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14500–14506 | All of the changes in this file appear to be refactoring, is that right? Mostly they look OK (though should be committed separately). But I'm not convinced moving the sqrt(0) handling into BuildReciprocalEstimate is an improvement. | |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
4598 | A for loop would probably be better. People expect while loops to do crazy things. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14500–14506 | It's actually necessary to inform getRsqrtEstimate() if the estimate is for sqrt() so that, if the target prefers to generate the whole sequence, it has to know if it's for sqrt() or its reciprocal, since the final product, in case of sqrt(), is currently done inside buildSqrtNR{One,Two}Const(). |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14500 | OK, I see that change now. But what about the block of 0-handling code moved from buildSqrtEstimate to buildSqrtEstimateImpl? I still don't see how that's an improvement. |
llvm/test/CodeGen/X86/sqrt-fastmath.ll | ||
---|---|---|
42–45 ↗ | (On Diff #66719) | vblendvps is a regression vs. the vandnps that we had. Do you know what caused that? Also, why did the "-NEXT" part of the assertions disappear? For any x86 codegen test changes, please use the script that is noted on the first line of the test file to avoid that problem. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14500 | For now it does improve grouping, IMO. But I'm mulling whether the handling code could be profitably moved inside getSqrtEstimate. | |
14669 | @spatel I changed Zero for Op here, since Op is then known to be 0. The motivation was that since AArch64 has an immediate version that implements SETEQ and materializing a 0 can be expensive. This probably caused the side effect of changing andnps for vblendvps in X86. | |
llvm/test/CodeGen/X86/sqrt-fastmath.ll | ||
42–45 ↗ | (On Diff #66719) | Sorry about that. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14669 | Ah, I see the diff now. But this is a target-independent transform, so isn't using 'Zero' in the select the more specific, and therefore the better, construct? This suggests that AArch64 is missing a fold that checks if an operand of a select is a zero; x86 must have this somewhere to allow the transform from blendv to andn? | |
llvm/test/CodeGen/X86/sqrt-fastmath.ll | ||
42–45 ↗ | (On Diff #66840) | No worries. Note that I've used a modified version of that script to generate checks for targets besides x86 - in case anyone would like to enhance the script and make test generation easier for AArch64. :) |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14669 | Using Op also gets the correct value for -0. I'm not entirely sure how much that matters though given that we're in fast-math anyway. |
Please can you recreate the diff with context? svn diff --diff-cmd=diff -x -U999999
llvm/test/CodeGen/X86/sqrt-fastmath.ll | ||
---|---|---|
42–45 ↗ | (On Diff #66840) | As Sanjay said, the use of vblendvps over vandnps is a regression that could affect throughput quite badly. |
llvm/test/CodeGen/X86/sqrt-fastmath.ll | ||
---|---|---|
42–45 ↗ | (On Diff #66840) | @t.p.northover, is Sanjay onto something that AArch64 could use a folding instead? Otherwise, I could move the check for 0.0 inside getSqrtEstimate(). |
llvm/test/CodeGen/X86/sqrt-fastmath.ll | ||
---|---|---|
42–45 ↗ | (On Diff #66840) | We seem to catch the simple cases, based on: define <4 x float> @foo(<4 x float> %lhs, <4 x float> %rhs, <4 x float> %val) { %tst = fcmp oeq <4 x float> %lhs, %rhs %res = select <4 x i1> %tst, <4 x float> %val, <4 x float> zeroinitializer ret <4 x float> %res } (I haven't checked but suspect it's actually the generic DAG combiner that's doing it). I'm not sure why we don't get this one, but fixing it would improve performance, probably beyond using Op. |
Side note regarding select folding with -1/0: inspired by this patch, I filed PR28895 ( https://llvm.org/bugs/show_bug.cgi?id=28895 ).
There are a few different paths and optimizations for this in x86. Some of it (eg, D23337 ) could be lifted to generic DAG combiner I think.
When I looked at how AArch64 handled the case in PR28895, I noticed that the select always get cracked into and/andn/or and then re-matched into a vbsl. That seems like better general policy than what x86 is doing (matching to ISD::VSELECT early).
Regardless of all that, we really do want to avoid vblendv on x86 in this patch. As Simon hinted, some cores suffer greatly because vblendv is cracked into the base logic ops (and/andn/or) by the HW, and so that instruction has 3 times worse latency/throughput than a simple op.
I folded the checking when the argument is zero into getSqrtEstimate(). I think that it makes sense when it returns no iterations, allowing the target to handle everything as is more optimal for it.
It's better to use VSELECT also for scalar types. How do I go about to stuff scalars into vectors and then out around VSELECT?
Thank you.
The generic combiner parts look alright to me, and since there are no test changes for PPC or x86, I assume that those changes are only enabling the AArch64 diffs. Someone with AArch experience should confirm that that part of the patch looks good.
It makes sense to split the fix of this issue from the native implementation using it in AArch64.
This part retains the target-independent changes as before. The part with the AArch64 specific changes will be followed up in another patch.
OK, I see that change now. But what about the block of 0-handling code moved from buildSqrtEstimate to buildSqrtEstimateImpl? I still don't see how that's an improvement.