Diff Detail
- Build Status
Buildable 10076 Build 10076: arc lint + arc unit
Event Timeline
There is a functional difference from this improvement (several FP intrinsics should now correctly propagate flags), but I'm not sure if it's visible without fixing something else in DAGCombiner to recognize the flags.
How does this work if an instruction maps to multiple nodes? For example, the FMA intrinsic can map to 2 nodes?
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp | ||
---|---|---|
95 | Formatting/spacing is non-standard here and below. Run clang-format? | |
99–100 | shorten: if (SDNode *Node = SelDB->getDAGNode(Instr)) { | |
7935–7939 | Don't need this anymore? | |
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h | ||
666 | Formatting/spacing is non-standard here and below. Run clang-format? |
Adding Adam and Akira as reviewers since they helped review earlier changes to SDNodeFlags.
Here's a potential test case that would show a difference from having FMF on a sqrt intrinsic:
define float @fast_recip_sqrt(float %x) { %y = call fast float @llvm.sqrt.f32(float %x) %z = fdiv fast float 1.0, %y ret float %z } declare float @llvm.sqrt.f32(float) nounwind readonly
...but as I said earlier, we need to fix the DAGCombiner code where this fold is implemented to recognize the flags on the individual nodes. Currently, it just checks the global state:
if (Options.UnsafeFPMath) {
On x86 currently, this will use the full-precision sqrtss+divss, but it should be using rsqrtss followed by mulss/addss to refine the estimate.
This propagation happens during SelectionDAGBuilder::visit, I scanned through various instructions and there is 1:1 mapping b/w instructions and initial SDNode created for it.
Like
for llvm.fma (Intrinsic Instruction ) -> SDNode(ISD::FMA )
for llvm.minnum (Insturction) -> SDNode(ISD::FMINNUM/FMINNAN)
etc. These initial SDNode are later lowered/expanded during following DAG phases.
Ok, we also have another usage of Fast Maths flage in reviev D37616. Can you please file a bugzilla to track suggested potential improvment.
case Intrinsic::fmuladd: { EVT VT = TLI.getValueType(DAG.getDataLayout(), I.getType()); if (TM.Options.AllowFPOpFusion != FPOpFusion::Strict && TLI.isFMAFasterThanFMulAndFAdd(VT)) { setValue(&I, DAG.getNode(ISD::FMA, sdl, getValue(I.getArgOperand(0)).getValueType(), getValue(I.getArgOperand(0)), getValue(I.getArgOperand(1)), getValue(I.getArgOperand(2)))); } else { // TODO: Intrinsic calls should have fast-math-flags. SDValue Mul = DAG.getNode(ISD::FMUL, sdl, getValue(I.getArgOperand(0)).getValueType(), getValue(I.getArgOperand(0)), getValue(I.getArgOperand(1))); SDValue Add = DAG.getNode(ISD::FADD, sdl, getValue(I.getArgOperand(0)).getValueType(), Mul, getValue(I.getArgOperand(2))); setValue(&I, Add); }
Ok, we also have another usage of Fast Maths flage in reviev D37616. Can you please file a bugzilla to track suggested potential improvment.
@reviewers, are there any more comments apart from last comments, this is just to save iteration, thanks for your time in reviews.
I pointed out 2 more places where I think we can eliminate the existing transfer of flags. I think you should do a complete audit for those.
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp | ||
---|---|---|
2732–2750 | Delete? | |
2760–2762 | Delete? | |
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h | ||
666 | Same question as earlier (I don't think it was answered). Can we use the existing SelectionDAGBuilder::getValue() to get to the node's flags? |
include/llvm/CodeGen/SelectionDAGNodes.h | ||
---|---|---|
364 ↗ | (On Diff #115056) | Okay. Somewhere we need a description of the algorithm. Something like: When processing an instruction with X kind of flags, we set flag Y on something in order to ensure something. This maintains the invariant that whatever. Then, after doing whatever, we set /unset the flag Z. |
- Rebase from trunk.
- More changes to cover review comments.
- Test usage of fast-math flags over nodes at some places, it fixes PR34558.
- More places where flags over node needs to be checked, to be done incrementally.
Please see the section under "Sometimes code reviews will take longer than you would hope for" regarding ping time:
https://llvm.org/docs/DeveloperPolicy.html#code-reviews
Also, this patch has grown in functionality since the last rev, but there are still no tests. If you want to demonstrate the effect of propagating the flags, pick just one DAG combine where that happens (ideally, the simplest case) and add tests to show the functional difference.
include/llvm/CodeGen/SelectionDAGNodes.h | ||
---|---|---|
933 ↗ | (On Diff #115536) | I don't think this is going to work as you're hoping for. If possible, please split this and any related changes into a separate follow-up patch. |
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h | ||
666 | Asking for the 3rd time: is this necessary? |
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h | ||
---|---|---|
666 | SelectionDAGBuilder::getValue() creates a new Values and puts it into a NodeMap if it does not exist and SelectionDAGBuilder::getDAGNode() check NodeMap and returns a DAG node only if it exists. |
I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.
As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.
Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?
My understanding and code changes are based LLVM Ref Manual 's section about Fast-Math flags" (http://llvm.org/docs/LangRef.html#fast-math-flags)
Which say for FMF flag NaN "Allow optimizations to assume the arguments and result are not NaN".
Now in following case which has been added by you
%y = call float @llvm.sqrt.f32(float %x) %z = fdiv fast float 1.0, %y ret float %z
We dont have fast flag over intrinsic but DAGCombining for fdiv sees a fast flag and assume result (%z) and arguments (constant , %y) as not a Nan and goes ahead and generates a reciprocal sqrt. If you remove fast from fdiv and add it to intrinsic then FMF opt at fdiv will not kick in.
Can you please let me know what you expected here.
I expect that the sqrt result is strict. Ie, it should use sqrtss if this is x86-64. We're not allowed to use rsqrtss and lose precision on that op.
That said, my memory of exactly how op-level FMF should work is fuzzy. If anyone else remembers or can link to threads where we've discussed this, please feel free to jump in. :)
Exactly, that is why I added a routine to get Unified flags which intersects flags of a node with flags of its operands in the earlier version of this patch , i think it will be right to inject that code with this patch [it was removed from current version of patch as per your comments]
I think there's a good argument either way here...
While fast does imply nnan, and nnan does propagate backward, and it also implies arcp, and arcp does not propagate backward. arcp applied only to the instruction to which it's attached. In this case, we're allowed to use a reciprocal approximation to the division, and not the sqrt. However, we could argue that using the rsqrt is like doing the sqrt exactly and then just approximating the division. If there are no other users of the sqrt itself, there seems to be little semantic difference.
That said, my memory of exactly how op-level FMF should work is fuzzy. If anyone else remembers or can link to threads where we've discussed this, please feel free to jump in. :)
I think you're mixing up flag propagation as it applies to creation of nodes with flag combining when folding operations. These are 2 different things. This patch is about propagating from IR to nodes at creation time (at least that's what I think it should be limited to based on the title).
I thought we leaned to the more conservative (less propagating) model in IR. Not sure if that would be different here in the DAG or if rsqrt is a special case. Either way, it doesn't affect the logic at node creation time?
Here's a sqrt multi-use case to think about:
define float @multiuse_strict_sqrt(float %x, i1 %cmp) { %y = call float @llvm.sqrt.f32(float %x) %z = fdiv fast float 1.0, %y br i1 %cmp, label %t, label %f t: ret float %z f: %add = fadd float %y, 1.0 ret float %add }
Should this be:
## BB#0: sqrtss %xmm0, %xmm0 <--- shared sqrt op because sqrt is treated as strict testb $1, %dil je LBB0_2 ## BB#1: ## %t movss LCPI0_0(%rip), %xmm1 divss %xmm0, %xmm1 movaps %xmm1, %xmm0 retq LBB0_2: ## %f addss LCPI0_0(%rip), %xmm0 retq
Or:
## BB#0: testb $1, %dil je LBB0_2 ## BB#1: ## %t vrsqrtss %xmm0, %xmm0, %xmm1 <--- fast sqrt is assumed part of fast div vmulss %xmm1, %xmm0, %xmm0 vmulss %xmm1, %xmm0, %xmm0 vaddss LCPI0_0(%rip), %xmm0, %xmm0 vmulss LCPI0_1(%rip), %xmm1, %xmm1 vmulss %xmm0, %xmm1, %xmm0 retq LBB0_2: ## %f vsqrtss %xmm0, %xmm0, %xmm0 <--- strict sqrt only applies on this path vaddss LCPI0_2(%rip), %xmm0, %xmm0 retq
It's a special case in that it's our only combined reciprocal operation.
Either way, it doesn't affect the logic at node creation time?
I don't think that it needs to do so.
Here's a sqrt multi-use case to think about:
define float @multiuse_strict_sqrt(float %x, i1 %cmp) { %y = call float @llvm.sqrt.f32(float %x) %z = fdiv fast float 1.0, %y br i1 %cmp, label %t, label %f t: ret float %z f: %add = fadd float %y, 1.0 ret float %add }Should this be:
## BB#0: sqrtss %xmm0, %xmm0 <--- shared sqrt op because sqrt is treated as strict testb $1, %dil je LBB0_2 ## BB#1: ## %t movss LCPI0_0(%rip), %xmm1 divss %xmm0, %xmm1 movaps %xmm1, %xmm0 retq LBB0_2: ## %f addss LCPI0_0(%rip), %xmm0 retq
This is one valid option. The divss could also be a rcpss (+mul) if we'd like.
Or:
## BB#0: testb $1, %dil je LBB0_2 ## BB#1: ## %t vrsqrtss %xmm0, %xmm0, %xmm1 <--- fast sqrt is assumed part of fast div vmulss %xmm1, %xmm0, %xmm0 vmulss %xmm1, %xmm0, %xmm0 vaddss LCPI0_0(%rip), %xmm0, %xmm0 vmulss LCPI0_1(%rip), %xmm1, %xmm1 vmulss %xmm0, %xmm1, %xmm0 retq LBB0_2: ## %f vsqrtss %xmm0, %xmm0, %xmm0 <--- strict sqrt only applies on this path vaddss LCPI0_2(%rip), %xmm0, %xmm0 retq
This is another valid option. Either of these seem allowable.
I haven't actually looked at the builder changes since you revised them, so I defer to @hfinkel is he's already approved that part.
I still don't understand why we should put the combiner changes into this patch. I'd like to see progress on using the flags too, but I think those should be separate patches with tests that cover all of the potential ambiguity that we've raised here.
I don't think that I approved anything yet. I can take a holistic look at the patch later today.
I still don't understand why we should put the combiner changes into this patch. I'd like to see progress on using the flags too, but I think those should be separate patches with tests that cover all of the potential ambiguity that we've raised here.
@hfinkel , Just a reminder for review clearence.
I still don't understand why we should put the combiner changes into this patch. I'd like to see progress on using the flags too, but I think those should be separate patches with tests that cover all of the potential ambiguity that we've raised here.
include/llvm/CodeGen/SelectionDAGNodes.h | ||
---|---|---|
396 ↗ | (On Diff #116305) | We need a comment here explaining what the Commit parameter does (and how/when it is used). |
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp | ||
95 | shared b/w -> shared by | |
123 | I'm a bit worried about propagating the integer flags automatically this way. Maybe this is fine in practice, but if we're adding some kind of implicit contract here, we should clearly document it. An operation that is exact, or does not overflow in some way, could be implemented in terms of operations that do (and, then, having the flags on those intermediate nodes wouldn't be correct). |
returns -> Returns