[DAG] Consolidating Instruction->SDNode Flags propagation in one class for better code management.
Needs ReviewPublic

Authored by jbhateja on Mon, Sep 11, 6:45 AM.
jbhateja created this revision.Mon, Sep 11, 6:45 AM
jbhateja added a subscriber: llvm-commits.
RKSimon added inline comments.Mon, Sep 11, 7:29 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
112

auto *OFBinOp

118

auto *ExactOp

jbhateja added a comment.EditedMon, Sep 11, 7:53 AM

@RKSimon Anything else or should I check this in as NFC.

@RKSimon Anything else or should I check this in as NFC.

@spatel needs to review

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6622

remove newline diff

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
666

is this needed in this patch?

There is a functional difference from this improvement (several FP intrinsics should now correctly propagate flags), but I'm not sure if it's visible without fixing something else in DAGCombiner to recognize the flags.

How does this work if an instruction maps to multiple nodes? For example, the FMA intrinsic can map to 2 nodes?

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
95

Formatting/spacing is non-standard here and below. Run clang-format?

99–100

shorten: if (SDNode *Node = SelDB->getDAGNode(Instr)) {

7956–7959

Don't need this anymore?

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
666

Formatting/spacing is non-standard here and below. Run clang-format?

Adding potential reviewers from D37616.

Adding Adam and Akira as reviewers since they helped review earlier changes to SDNodeFlags.

Here's a potential test case that would show a difference from having FMF on a sqrt intrinsic:

define float @fast_recip_sqrt(float %x) {
  %y = call fast float @llvm.sqrt.f32(float %x)
  %z = fdiv fast float 1.0,  %y
  ret float %z
}
declare float @llvm.sqrt.f32(float) nounwind readonly

...but as I said earlier, we need to fix the DAGCombiner code where this fold is implemented to recognize the flags on the individual nodes. Currently, it just checks the global state:

if (Options.UnsafeFPMath) {

On x86 currently, this will use the full-precision sqrtss+divss, but it should be using rsqrtss followed by mulss/addss to refine the estimate.

jbhateja added a comment.EditedMon, Sep 11, 9:15 AM

There is a functional difference from this improvement (several FP intrinsics should now correctly propagate flags), but I'm not sure if it's visible without fixing something else in DAGCombiner to recognize the flags.

How does this work if an instruction maps to multiple nodes? For example, the FMA intrinsic can map to 2 nodes?

This propagation happens during SelectionDAGBuilder::visit, I scanned through various instructions and there is 1:1 mapping b/w instructions and initial SDNode created for it.
Like
for llvm.fma (Intrinsic Instruction ) -> SDNode(ISD::FMA )
for llvm.minnum (Insturction) -> SDNode(ISD::FMINNUM/FMINNAN)

etc. These initial SDNode are later lowered/expanded during following DAG phases.

Here's a potential test case that would show a difference from having FMF on a sqrt intrinsic:

define float @fast_recip_sqrt(float %x) {
  %y = call fast float @llvm.sqrt.f32(float %x)
  %z = fdiv fast float 1.0,  %y
  ret float %z
}
declare float @llvm.sqrt.f32(float) nounwind readonly

...but as I said earlier, we need to fix the DAGCombiner code where this fold is implemented to recognize the flags on the individual nodes. Currently, it just checks the global state:

if (Options.UnsafeFPMath) {

On x86 currently, this will use the full-precision sqrtss+divss, but it should be using rsqrtss followed by mulss/addss to refine the estimate.

Ok, we also have another usage of Fast Maths flage in reviev D37616. Can you please file a bugzilla to track suggested potential improvment.

How does this work if an instruction maps to multiple nodes? For example, the FMA intrinsic can map to 2 nodes?

This propagation happens during SelectionDAGBuilder::visit, I scanned through various instructions and there is 1:1 mapping b/w instructions and initial SDNode created for it.

case Intrinsic::fmuladd: {
  EVT VT = TLI.getValueType(DAG.getDataLayout(), I.getType());
  if (TM.Options.AllowFPOpFusion != FPOpFusion::Strict &&
      TLI.isFMAFasterThanFMulAndFAdd(VT)) {
    setValue(&I, DAG.getNode(ISD::FMA, sdl,
                             getValue(I.getArgOperand(0)).getValueType(),
                             getValue(I.getArgOperand(0)),
                             getValue(I.getArgOperand(1)),
                             getValue(I.getArgOperand(2))));
  } else {
    // TODO: Intrinsic calls should have fast-math-flags.
    SDValue Mul = DAG.getNode(ISD::FMUL, sdl,
                              getValue(I.getArgOperand(0)).getValueType(),
                              getValue(I.getArgOperand(0)),
                              getValue(I.getArgOperand(1)));
    SDValue Add = DAG.getNode(ISD::FADD, sdl,
                              getValue(I.getArgOperand(0)).getValueType(),
                              Mul,
                              getValue(I.getArgOperand(2)));
    setValue(&I, Add);
  }

Ok, we also have another usage of Fast Maths flage in reviev D37616. Can you please file a bugzilla to track suggested potential improvment.

https://bugs.llvm.org/show_bug.cgi?id=34558

jbhateja updated this revision to Diff 114847.Tue, Sep 12, 8:16 AM
  • Review comments resolution + flags propagation over operands.

Ping @reviewers

hfinkel added inline comments.
include/llvm/CodeGen/SelectionDAGNodes.h
360

These flags need comments explaining what they are and how/when they're used.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
664

returns -> Returns

@reviewers, are there any more comments apart from last comments, this is just to save iteration, thanks for your time in reviews.

jbhateja updated this revision to Diff 115022.Wed, Sep 13, 6:18 AM
  • Review comments handling.
hfinkel added inline comments.Wed, Sep 13, 8:12 AM
include/llvm/CodeGen/SelectionDAGNodes.h
360

Add a comma after true.

364

Please explain here when these flags are set/unset and why.

jbhateja updated this revision to Diff 115056.Wed, Sep 13, 9:16 AM
  • Review comments resolution.

I pointed out 2 more places where I think we can eliminate the existing transfer of flags. I think you should do a complete audit for those.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2765–2766

Delete?

2773–2774

Delete?

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
666

Same question as earlier (I don't think it was answered).

Can we use the existing SelectionDAGBuilder::getValue() to get to the node's flags?

hfinkel added inline comments.Wed, Sep 13, 3:57 PM
include/llvm/CodeGen/SelectionDAGNodes.h
367

Okay. Somewhere we need a description of the algorithm. Something like:

When processing an instruction with X kind of flags, we set flag Y on something in order to ensure something. This maintains the invariant that whatever. Then, after doing whatever, we set /unset the flag Z.

jbhateja updated this revision to Diff 115536.EditedSat, Sep 16, 6:24 AM
  • Rebase from trunk.
  • More changes to cover review comments.
  • Test usage of fast-math flags over nodes at some places, it fixes PR34558.
  • More places where flags over node needs to be checked, to be done incrementally.

ping @ reviewers.

Ping @reviewers.

Please see the section under "Sometimes code reviews will take longer than you would hope for" regarding ping time:
https://llvm.org/docs/DeveloperPolicy.html#code-reviews

Also, this patch has grown in functionality since the last rev, but there are still no tests. If you want to demonstrate the effect of propagating the flags, pick just one DAG combine where that happens (ideally, the simplest case) and add tests to show the functional difference.

include/llvm/CodeGen/SelectionDAGNodes.h
933

I don't think this is going to work as you're hoping for. If possible, please split this and any related changes into a separate follow-up patch.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
666

Asking for the 3rd time: is this necessary?

jbhateja updated this revision to Diff 116159.Thu, Sep 21, 3:07 AM
  • Review comments resolutions.
jbhateja added inline comments.Thu, Sep 21, 3:10 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
666

SelectionDAGBuilder::getValue() creates a new Values and puts it into a NodeMap if it does not exist and SelectionDAGBuilder::getDAGNode() check NodeMap and returns a DAG node only if it exists.

I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.

As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.

Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?

I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.

As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.

Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?

My understanding and code changes are based LLVM Ref Manual 's section about Fast-Math flags" (http://llvm.org/docs/LangRef.html#fast-math-flags)

Which say for FMF flag NaN "Allow optimizations to assume the arguments and result are not NaN".

Now in following case which has been added by you

%y = call float @llvm.sqrt.f32(float %x)
%z = fdiv fast float 1.0, %y
ret float %z

We dont have fast flag over intrinsic but DAGCombining for fdiv sees a fast flag and assume result (%z) and arguments (constant , %y) as not a Nan and goes ahead and generates a reciprocal sqrt. If you remove fast from fdiv and add it to intrinsic then FMF opt at fdiv will not kick in.

Can you please let me know what you expected here.

I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.

As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.

Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?

My understanding and code changes are based LLVM Ref Manual 's section about Fast-Math flags" (http://llvm.org/docs/LangRef.html#fast-math-flags)

Which say for FMF flag NaN "Allow optimizations to assume the arguments and result are not NaN".

Now in following case which has been added by you

%y = call float @llvm.sqrt.f32(float %x)
%z = fdiv fast float 1.0, %y
ret float %z

We dont have fast flag over intrinsic but DAGCombining for fdiv sees a fast flag and assume result (%z) and arguments (constant , %y) as not a Nan and goes ahead and generates a reciprocal sqrt. If you remove fast from fdiv and add it to intrinsic then FMF opt at fdiv will not kick in.

Can you please let me know what you expected here.

I expect that the sqrt result is strict. Ie, it should use sqrtss if this is x86-64. We're not allowed to use rsqrtss and lose precision on that op.

That said, my memory of exactly how op-level FMF should work is fuzzy. If anyone else remembers or can link to threads where we've discussed this, please feel free to jump in. :)

I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.

As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.

Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?

My understanding and code changes are based LLVM Ref Manual 's section about Fast-Math flags" (http://llvm.org/docs/LangRef.html#fast-math-flags)

Which say for FMF flag NaN "Allow optimizations to assume the arguments and result are not NaN".

Now in following case which has been added by you

%y = call float @llvm.sqrt.f32(float %x)
%z = fdiv fast float 1.0, %y
ret float %z

We dont have fast flag over intrinsic but DAGCombining for fdiv sees a fast flag and assume result (%z) and arguments (constant , %y) as not a Nan and goes ahead and generates a reciprocal sqrt. If you remove fast from fdiv and add it to intrinsic then FMF opt at fdiv will not kick in.

Can you please let me know what you expected here.

I expect that the sqrt result is strict. Ie, it should use sqrtss if this is x86-64. We're not allowed to use rsqrtss and lose precision on that op.

That said, my memory of exactly how op-level FMF should work is fuzzy. If anyone else remembers or can link to threads where we've discussed this, please feel free to jump in. :)

Exactly, that is why I added a routine to get Unified flags which intersects flags of a node with flags of its operands in the earlier version of this patch , i think it will be right to inject that code with this patch [it was removed from current version of patch as per your comments]

I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.

As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.

Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?

My understanding and code changes are based LLVM Ref Manual 's section about Fast-Math flags" (http://llvm.org/docs/LangRef.html#fast-math-flags)

Which say for FMF flag NaN "Allow optimizations to assume the arguments and result are not NaN".

Now in following case which has been added by you

%y = call float @llvm.sqrt.f32(float %x)
%z = fdiv fast float 1.0, %y
ret float %z

We dont have fast flag over intrinsic but DAGCombining for fdiv sees a fast flag and assume result (%z) and arguments (constant , %y) as not a Nan and goes ahead and generates a reciprocal sqrt. If you remove fast from fdiv and add it to intrinsic then FMF opt at fdiv will not kick in.

Can you please let me know what you expected here.

I expect that the sqrt result is strict. Ie, it should use sqrtss if this is x86-64. We're not allowed to use rsqrtss and lose precision on that op.

I think there's a good argument either way here...

While fast does imply nnan, and nnan does propagate backward, and it also implies arcp, and arcp does not propagate backward. arcp applied only to the instruction to which it's attached. In this case, we're allowed to use a reciprocal approximation to the division, and not the sqrt. However, we could argue that using the rsqrt is like doing the sqrt exactly and then just approximating the division. If there are no other users of the sqrt itself, there seems to be little semantic difference.

That said, my memory of exactly how op-level FMF should work is fuzzy. If anyone else remembers or can link to threads where we've discussed this, please feel free to jump in. :)

I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.

As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.

Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?

My understanding and code changes are based LLVM Ref Manual 's section about Fast-Math flags" (http://llvm.org/docs/LangRef.html#fast-math-flags)

Which say for FMF flag NaN "Allow optimizations to assume the arguments and result are not NaN".

Now in following case which has been added by you

%y = call float @llvm.sqrt.f32(float %x)
%z = fdiv fast float 1.0, %y
ret float %z

We dont have fast flag over intrinsic but DAGCombining for fdiv sees a fast flag and assume result (%z) and arguments (constant , %y) as not a Nan and goes ahead and generates a reciprocal sqrt. If you remove fast from fdiv and add it to intrinsic then FMF opt at fdiv will not kick in.

Can you please let me know what you expected here.

I expect that the sqrt result is strict. Ie, it should use sqrtss if this is x86-64. We're not allowed to use rsqrtss and lose precision on that op.

That said, my memory of exactly how op-level FMF should work is fuzzy. If anyone else remembers or can link to threads where we've discussed this, please feel free to jump in. :)

Exactly, that is why I added a routine to get Unified flags which intersects flags of a node with flags of its operands in the earlier version of this patch , i think it will be right to inject that code with this patch [it was removed from current version of patch as per your comments]

I think you're mixing up flag propagation as it applies to creation of nodes with flag combining when folding operations. These are 2 different things. This patch is about propagating from IR to nodes at creation time (at least that's what I think it should be limited to based on the title).

RKSimon resigned from this revision.Thu, Sep 21, 12:53 PM

While fast does imply nnan, and nnan does propagate backward, and it also implies arcp, and arcp does not propagate backward. arcp applied only to the instruction to which it's attached. In this case, we're allowed to use a reciprocal approximation to the division, and not the sqrt. However, we could argue that using the rsqrt is like doing the sqrt exactly and then just approximating the division. If there are no other users of the sqrt itself, there seems to be little semantic difference.

I thought we leaned to the more conservative (less propagating) model in IR. Not sure if that would be different here in the DAG or if rsqrt is a special case. Either way, it doesn't affect the logic at node creation time?

Here's a sqrt multi-use case to think about:

define float @multiuse_strict_sqrt(float %x, i1 %cmp) {
  %y = call float @llvm.sqrt.f32(float %x)
  %z = fdiv fast float 1.0,  %y
  br i1 %cmp, label %t, label %f

t:
  ret float %z
f:
  %add = fadd float %y, 1.0
  ret float %add
}

Should this be:

## BB#0:
	sqrtss	%xmm0, %xmm0 <--- shared sqrt op because sqrt is treated as strict
	testb	$1, %dil
	je	LBB0_2
## BB#1:                                ## %t
	movss	LCPI0_0(%rip), %xmm1  
	divss	%xmm0, %xmm1
	movaps	%xmm1, %xmm0
	retq
LBB0_2:                                 ## %f
	addss	LCPI0_0(%rip), %xmm0
	retq

Or:

## BB#0:
	testb	$1, %dil
	je	LBB0_2
## BB#1:                                ## %t
	vrsqrtss	%xmm0, %xmm0, %xmm1   <--- fast sqrt is assumed part of fast div
	vmulss	%xmm1, %xmm0, %xmm0
	vmulss	%xmm1, %xmm0, %xmm0
	vaddss	LCPI0_0(%rip), %xmm0, %xmm0
	vmulss	LCPI0_1(%rip), %xmm1, %xmm1
	vmulss	%xmm0, %xmm1, %xmm0
	retq
LBB0_2:                                 ## %f
	vsqrtss	%xmm0, %xmm0, %xmm0  <--- strict sqrt only applies on this path
	vaddss	LCPI0_2(%rip), %xmm0, %xmm0
	retq

While fast does imply nnan, and nnan does propagate backward, and it also implies arcp, and arcp does not propagate backward. arcp applied only to the instruction to which it's attached. In this case, we're allowed to use a reciprocal approximation to the division, and not the sqrt. However, we could argue that using the rsqrt is like doing the sqrt exactly and then just approximating the division. If there are no other users of the sqrt itself, there seems to be little semantic difference.

I thought we leaned to the more conservative (less propagating) model in IR. Not sure if that would be different here in the DAG or if rsqrt is a special case.

It's a special case in that it's our only combined reciprocal operation.

Either way, it doesn't affect the logic at node creation time?

I don't think that it needs to do so.

Here's a sqrt multi-use case to think about:

define float @multiuse_strict_sqrt(float %x, i1 %cmp) {
  %y = call float @llvm.sqrt.f32(float %x)
  %z = fdiv fast float 1.0,  %y
  br i1 %cmp, label %t, label %f

t:
  ret float %z
f:
  %add = fadd float %y, 1.0
  ret float %add
}

Should this be:

## BB#0:
	sqrtss	%xmm0, %xmm0 <--- shared sqrt op because sqrt is treated as strict
	testb	$1, %dil
	je	LBB0_2
## BB#1:                                ## %t
	movss	LCPI0_0(%rip), %xmm1  
	divss	%xmm0, %xmm1
	movaps	%xmm1, %xmm0
	retq
LBB0_2:                                 ## %f
	addss	LCPI0_0(%rip), %xmm0
	retq

This is one valid option. The divss could also be a rcpss (+mul) if we'd like.

Or:

## BB#0:
	testb	$1, %dil
	je	LBB0_2
## BB#1:                                ## %t
	vrsqrtss	%xmm0, %xmm0, %xmm1   <--- fast sqrt is assumed part of fast div
	vmulss	%xmm1, %xmm0, %xmm0
	vmulss	%xmm1, %xmm0, %xmm0
	vaddss	LCPI0_0(%rip), %xmm0, %xmm0
	vmulss	LCPI0_1(%rip), %xmm1, %xmm1
	vmulss	%xmm0, %xmm1, %xmm0
	retq
LBB0_2:                                 ## %f
	vsqrtss	%xmm0, %xmm0, %xmm0  <--- strict sqrt only applies on this path
	vaddss	LCPI0_2(%rip), %xmm0, %xmm0
	retq

This is another valid option. Either of these seem allowable.

jbhateja updated this revision to Diff 116305.Thu, Sep 21, 10:58 PM
  • Updating test case with more than one uses of sqrt / mul.
jbhateja added a comment.EditedSat, Sep 23, 6:21 PM

@spatel , @reviewiews , can this land now into trunk ?