This is an archive of the discontinued LLVM Phabricator instance.

[DAG] Consolidating Instruction->SDNode Flags propagation in one class for better code management.
Needs ReviewPublic

Authored by jbhateja on Sep 11 2017, 6:45 AM.
Tokens
"Y So Serious" token, awarded by post.kadirselcuk.

Event Timeline

jbhateja created this revision.Sep 11 2017, 6:45 AM
jbhateja added a subscriber: llvm-commits.
RKSimon added inline comments.Sep 11 2017, 7:29 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
112

auto *OFBinOp

118

auto *ExactOp

jbhateja added a comment.EditedSep 11 2017, 7:53 AM

@RKSimon Anything else or should I check this in as NFC.

RKSimon edited edge metadata.Sep 11 2017, 8:00 AM

@RKSimon Anything else or should I check this in as NFC.

@spatel needs to review

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6632

remove newline diff

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
666

is this needed in this patch?

spatel edited edge metadata.Sep 11 2017, 8:04 AM

There is a functional difference from this improvement (several FP intrinsics should now correctly propagate flags), but I'm not sure if it's visible without fixing something else in DAGCombiner to recognize the flags.

How does this work if an instruction maps to multiple nodes? For example, the FMA intrinsic can map to 2 nodes?

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
95

Formatting/spacing is non-standard here and below. Run clang-format?

99–100

shorten: if (SDNode *Node = SelDB->getDAGNode(Instr)) {

7966–7969

Don't need this anymore?

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
666

Formatting/spacing is non-standard here and below. Run clang-format?

Adding potential reviewers from D37616.

Adding Adam and Akira as reviewers since they helped review earlier changes to SDNodeFlags.

Here's a potential test case that would show a difference from having FMF on a sqrt intrinsic:

define float @fast_recip_sqrt(float %x) {
  %y = call fast float @llvm.sqrt.f32(float %x)
  %z = fdiv fast float 1.0,  %y
  ret float %z
}
declare float @llvm.sqrt.f32(float) nounwind readonly

...but as I said earlier, we need to fix the DAGCombiner code where this fold is implemented to recognize the flags on the individual nodes. Currently, it just checks the global state:

if (Options.UnsafeFPMath) {

On x86 currently, this will use the full-precision sqrtss+divss, but it should be using rsqrtss followed by mulss/addss to refine the estimate.

jbhateja added a comment.EditedSep 11 2017, 9:15 AM

There is a functional difference from this improvement (several FP intrinsics should now correctly propagate flags), but I'm not sure if it's visible without fixing something else in DAGCombiner to recognize the flags.

How does this work if an instruction maps to multiple nodes? For example, the FMA intrinsic can map to 2 nodes?

This propagation happens during SelectionDAGBuilder::visit, I scanned through various instructions and there is 1:1 mapping b/w instructions and initial SDNode created for it.
Like
for llvm.fma (Intrinsic Instruction ) -> SDNode(ISD::FMA )
for llvm.minnum (Insturction) -> SDNode(ISD::FMINNUM/FMINNAN)

etc. These initial SDNode are later lowered/expanded during following DAG phases.

Here's a potential test case that would show a difference from having FMF on a sqrt intrinsic:

define float @fast_recip_sqrt(float %x) {
  %y = call fast float @llvm.sqrt.f32(float %x)
  %z = fdiv fast float 1.0,  %y
  ret float %z
}
declare float @llvm.sqrt.f32(float) nounwind readonly

...but as I said earlier, we need to fix the DAGCombiner code where this fold is implemented to recognize the flags on the individual nodes. Currently, it just checks the global state:

if (Options.UnsafeFPMath) {

On x86 currently, this will use the full-precision sqrtss+divss, but it should be using rsqrtss followed by mulss/addss to refine the estimate.

Ok, we also have another usage of Fast Maths flage in reviev D37616. Can you please file a bugzilla to track suggested potential improvment.

How does this work if an instruction maps to multiple nodes? For example, the FMA intrinsic can map to 2 nodes?

This propagation happens during SelectionDAGBuilder::visit, I scanned through various instructions and there is 1:1 mapping b/w instructions and initial SDNode created for it.

case Intrinsic::fmuladd: {
  EVT VT = TLI.getValueType(DAG.getDataLayout(), I.getType());
  if (TM.Options.AllowFPOpFusion != FPOpFusion::Strict &&
      TLI.isFMAFasterThanFMulAndFAdd(VT)) {
    setValue(&I, DAG.getNode(ISD::FMA, sdl,
                             getValue(I.getArgOperand(0)).getValueType(),
                             getValue(I.getArgOperand(0)),
                             getValue(I.getArgOperand(1)),
                             getValue(I.getArgOperand(2))));
  } else {
    // TODO: Intrinsic calls should have fast-math-flags.
    SDValue Mul = DAG.getNode(ISD::FMUL, sdl,
                              getValue(I.getArgOperand(0)).getValueType(),
                              getValue(I.getArgOperand(0)),
                              getValue(I.getArgOperand(1)));
    SDValue Add = DAG.getNode(ISD::FADD, sdl,
                              getValue(I.getArgOperand(0)).getValueType(),
                              Mul,
                              getValue(I.getArgOperand(2)));
    setValue(&I, Add);
  }

Ok, we also have another usage of Fast Maths flage in reviev D37616. Can you please file a bugzilla to track suggested potential improvment.

https://bugs.llvm.org/show_bug.cgi?id=34558

jbhateja updated this revision to Diff 114847.Sep 12 2017, 8:16 AM
  • Review comments resolution + flags propagation over operands.

Ping @reviewers

hfinkel added inline comments.
include/llvm/CodeGen/SelectionDAGNodes.h
357

These flags need comments explaining what they are and how/when they're used.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
664

returns -> Returns

@reviewers, are there any more comments apart from last comments, this is just to save iteration, thanks for your time in reviews.

jbhateja updated this revision to Diff 115022.Sep 13 2017, 6:18 AM
  • Review comments handling.
hfinkel added inline comments.Sep 13 2017, 8:12 AM
include/llvm/CodeGen/SelectionDAGNodes.h
357

Add a comma after true.

361

Please explain here when these flags are set/unset and why.

jbhateja updated this revision to Diff 115056.Sep 13 2017, 9:16 AM
  • Review comments resolution.

I pointed out 2 more places where I think we can eliminate the existing transfer of flags. I think you should do a complete audit for those.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2762–2780

Delete?

2790–2792

Delete?

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
666

Same question as earlier (I don't think it was answered).

Can we use the existing SelectionDAGBuilder::getValue() to get to the node's flags?

hfinkel added inline comments.Sep 13 2017, 3:57 PM
include/llvm/CodeGen/SelectionDAGNodes.h
364

Okay. Somewhere we need a description of the algorithm. Something like:

When processing an instruction with X kind of flags, we set flag Y on something in order to ensure something. This maintains the invariant that whatever. Then, after doing whatever, we set /unset the flag Z.

jbhateja updated this revision to Diff 115536.EditedSep 16 2017, 6:24 AM
  • Rebase from trunk.
  • More changes to cover review comments.
  • Test usage of fast-math flags over nodes at some places, it fixes PR34558.
  • More places where flags over node needs to be checked, to be done incrementally.

ping @ reviewers.

Ping @reviewers.

Please see the section under "Sometimes code reviews will take longer than you would hope for" regarding ping time:
https://llvm.org/docs/DeveloperPolicy.html#code-reviews

Also, this patch has grown in functionality since the last rev, but there are still no tests. If you want to demonstrate the effect of propagating the flags, pick just one DAG combine where that happens (ideally, the simplest case) and add tests to show the functional difference.

include/llvm/CodeGen/SelectionDAGNodes.h
912

I don't think this is going to work as you're hoping for. If possible, please split this and any related changes into a separate follow-up patch.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
666

Asking for the 3rd time: is this necessary?

jbhateja updated this revision to Diff 116159.Sep 21 2017, 3:07 AM
  • Review comments resolutions.
jbhateja added inline comments.Sep 21 2017, 3:10 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
666

SelectionDAGBuilder::getValue() creates a new Values and puts it into a NodeMap if it does not exist and SelectionDAGBuilder::getDAGNode() check NodeMap and returns a DAG node only if it exists.

I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.

As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.

Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?

I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.

As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.

Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?

My understanding and code changes are based LLVM Ref Manual 's section about Fast-Math flags" (http://llvm.org/docs/LangRef.html#fast-math-flags)

Which say for FMF flag NaN "Allow optimizations to assume the arguments and result are not NaN".

Now in following case which has been added by you

%y = call float @llvm.sqrt.f32(float %x)
%z = fdiv fast float 1.0, %y
ret float %z

We dont have fast flag over intrinsic but DAGCombining for fdiv sees a fast flag and assume result (%z) and arguments (constant , %y) as not a Nan and goes ahead and generates a reciprocal sqrt. If you remove fast from fdiv and add it to intrinsic then FMF opt at fdiv will not kick in.

Can you please let me know what you expected here.

I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.

As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.

Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?

My understanding and code changes are based LLVM Ref Manual 's section about Fast-Math flags" (http://llvm.org/docs/LangRef.html#fast-math-flags)

Which say for FMF flag NaN "Allow optimizations to assume the arguments and result are not NaN".

Now in following case which has been added by you

%y = call float @llvm.sqrt.f32(float %x)
%z = fdiv fast float 1.0, %y
ret float %z

We dont have fast flag over intrinsic but DAGCombining for fdiv sees a fast flag and assume result (%z) and arguments (constant , %y) as not a Nan and goes ahead and generates a reciprocal sqrt. If you remove fast from fdiv and add it to intrinsic then FMF opt at fdiv will not kick in.

Can you please let me know what you expected here.

I expect that the sqrt result is strict. Ie, it should use sqrtss if this is x86-64. We're not allowed to use rsqrtss and lose precision on that op.

That said, my memory of exactly how op-level FMF should work is fuzzy. If anyone else remembers or can link to threads where we've discussed this, please feel free to jump in. :)

I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.

As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.

Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?

My understanding and code changes are based LLVM Ref Manual 's section about Fast-Math flags" (http://llvm.org/docs/LangRef.html#fast-math-flags)

Which say for FMF flag NaN "Allow optimizations to assume the arguments and result are not NaN".

Now in following case which has been added by you

%y = call float @llvm.sqrt.f32(float %x)
%z = fdiv fast float 1.0, %y
ret float %z

We dont have fast flag over intrinsic but DAGCombining for fdiv sees a fast flag and assume result (%z) and arguments (constant , %y) as not a Nan and goes ahead and generates a reciprocal sqrt. If you remove fast from fdiv and add it to intrinsic then FMF opt at fdiv will not kick in.

Can you please let me know what you expected here.

I expect that the sqrt result is strict. Ie, it should use sqrtss if this is x86-64. We're not allowed to use rsqrtss and lose precision on that op.

That said, my memory of exactly how op-level FMF should work is fuzzy. If anyone else remembers or can link to threads where we've discussed this, please feel free to jump in. :)

Exactly, that is why I added a routine to get Unified flags which intersects flags of a node with flags of its operands in the earlier version of this patch , i think it will be right to inject that code with this patch [it was removed from current version of patch as per your comments]

I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.

As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.

Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?

My understanding and code changes are based LLVM Ref Manual 's section about Fast-Math flags" (http://llvm.org/docs/LangRef.html#fast-math-flags)

Which say for FMF flag NaN "Allow optimizations to assume the arguments and result are not NaN".

Now in following case which has been added by you

%y = call float @llvm.sqrt.f32(float %x)
%z = fdiv fast float 1.0, %y
ret float %z

We dont have fast flag over intrinsic but DAGCombining for fdiv sees a fast flag and assume result (%z) and arguments (constant , %y) as not a Nan and goes ahead and generates a reciprocal sqrt. If you remove fast from fdiv and add it to intrinsic then FMF opt at fdiv will not kick in.

Can you please let me know what you expected here.

I expect that the sqrt result is strict. Ie, it should use sqrtss if this is x86-64. We're not allowed to use rsqrtss and lose precision on that op.

I think there's a good argument either way here...

While fast does imply nnan, and nnan does propagate backward, and it also implies arcp, and arcp does not propagate backward. arcp applied only to the instruction to which it's attached. In this case, we're allowed to use a reciprocal approximation to the division, and not the sqrt. However, we could argue that using the rsqrt is like doing the sqrt exactly and then just approximating the division. If there are no other users of the sqrt itself, there seems to be little semantic difference.

That said, my memory of exactly how op-level FMF should work is fuzzy. If anyone else remembers or can link to threads where we've discussed this, please feel free to jump in. :)

I've added some more FMF tests at rL313893 which I think this patch will miscompile. Please rebase/update.

As I suggested before, this patch shouldn't try to enable multiple DAG combines with node-level FMF. It's not as straightforward as you might think.

Pick exactly one combine if you want to show that this patch is working as intended. The llvm.muladd intrinsic test that you have here with a target that supports 'fma' (not plain x86) seems like a good choice to me. If we have a strict op in IR, it should produce an fma instruction. If we have a fast op in IR, it should produce the simpler fmul instruction?

My understanding and code changes are based LLVM Ref Manual 's section about Fast-Math flags" (http://llvm.org/docs/LangRef.html#fast-math-flags)

Which say for FMF flag NaN "Allow optimizations to assume the arguments and result are not NaN".

Now in following case which has been added by you

%y = call float @llvm.sqrt.f32(float %x)
%z = fdiv fast float 1.0, %y
ret float %z

We dont have fast flag over intrinsic but DAGCombining for fdiv sees a fast flag and assume result (%z) and arguments (constant , %y) as not a Nan and goes ahead and generates a reciprocal sqrt. If you remove fast from fdiv and add it to intrinsic then FMF opt at fdiv will not kick in.

Can you please let me know what you expected here.

I expect that the sqrt result is strict. Ie, it should use sqrtss if this is x86-64. We're not allowed to use rsqrtss and lose precision on that op.

That said, my memory of exactly how op-level FMF should work is fuzzy. If anyone else remembers or can link to threads where we've discussed this, please feel free to jump in. :)

Exactly, that is why I added a routine to get Unified flags which intersects flags of a node with flags of its operands in the earlier version of this patch , i think it will be right to inject that code with this patch [it was removed from current version of patch as per your comments]

I think you're mixing up flag propagation as it applies to creation of nodes with flag combining when folding operations. These are 2 different things. This patch is about propagating from IR to nodes at creation time (at least that's what I think it should be limited to based on the title).

RKSimon resigned from this revision.Sep 21 2017, 12:53 PM

While fast does imply nnan, and nnan does propagate backward, and it also implies arcp, and arcp does not propagate backward. arcp applied only to the instruction to which it's attached. In this case, we're allowed to use a reciprocal approximation to the division, and not the sqrt. However, we could argue that using the rsqrt is like doing the sqrt exactly and then just approximating the division. If there are no other users of the sqrt itself, there seems to be little semantic difference.

I thought we leaned to the more conservative (less propagating) model in IR. Not sure if that would be different here in the DAG or if rsqrt is a special case. Either way, it doesn't affect the logic at node creation time?

Here's a sqrt multi-use case to think about:

define float @multiuse_strict_sqrt(float %x, i1 %cmp) {
  %y = call float @llvm.sqrt.f32(float %x)
  %z = fdiv fast float 1.0,  %y
  br i1 %cmp, label %t, label %f

t:
  ret float %z
f:
  %add = fadd float %y, 1.0
  ret float %add
}

Should this be:

## BB#0:
	sqrtss	%xmm0, %xmm0 <--- shared sqrt op because sqrt is treated as strict
	testb	$1, %dil
	je	LBB0_2
## BB#1:                                ## %t
	movss	LCPI0_0(%rip), %xmm1  
	divss	%xmm0, %xmm1
	movaps	%xmm1, %xmm0
	retq
LBB0_2:                                 ## %f
	addss	LCPI0_0(%rip), %xmm0
	retq

Or:

## BB#0:
	testb	$1, %dil
	je	LBB0_2
## BB#1:                                ## %t
	vrsqrtss	%xmm0, %xmm0, %xmm1   <--- fast sqrt is assumed part of fast div
	vmulss	%xmm1, %xmm0, %xmm0
	vmulss	%xmm1, %xmm0, %xmm0
	vaddss	LCPI0_0(%rip), %xmm0, %xmm0
	vmulss	LCPI0_1(%rip), %xmm1, %xmm1
	vmulss	%xmm0, %xmm1, %xmm0
	retq
LBB0_2:                                 ## %f
	vsqrtss	%xmm0, %xmm0, %xmm0  <--- strict sqrt only applies on this path
	vaddss	LCPI0_2(%rip), %xmm0, %xmm0
	retq

While fast does imply nnan, and nnan does propagate backward, and it also implies arcp, and arcp does not propagate backward. arcp applied only to the instruction to which it's attached. In this case, we're allowed to use a reciprocal approximation to the division, and not the sqrt. However, we could argue that using the rsqrt is like doing the sqrt exactly and then just approximating the division. If there are no other users of the sqrt itself, there seems to be little semantic difference.

I thought we leaned to the more conservative (less propagating) model in IR. Not sure if that would be different here in the DAG or if rsqrt is a special case.

It's a special case in that it's our only combined reciprocal operation.

Either way, it doesn't affect the logic at node creation time?

I don't think that it needs to do so.

Here's a sqrt multi-use case to think about:

define float @multiuse_strict_sqrt(float %x, i1 %cmp) {
  %y = call float @llvm.sqrt.f32(float %x)
  %z = fdiv fast float 1.0,  %y
  br i1 %cmp, label %t, label %f

t:
  ret float %z
f:
  %add = fadd float %y, 1.0
  ret float %add
}

Should this be:

## BB#0:
	sqrtss	%xmm0, %xmm0 <--- shared sqrt op because sqrt is treated as strict
	testb	$1, %dil
	je	LBB0_2
## BB#1:                                ## %t
	movss	LCPI0_0(%rip), %xmm1  
	divss	%xmm0, %xmm1
	movaps	%xmm1, %xmm0
	retq
LBB0_2:                                 ## %f
	addss	LCPI0_0(%rip), %xmm0
	retq

This is one valid option. The divss could also be a rcpss (+mul) if we'd like.

Or:

## BB#0:
	testb	$1, %dil
	je	LBB0_2
## BB#1:                                ## %t
	vrsqrtss	%xmm0, %xmm0, %xmm1   <--- fast sqrt is assumed part of fast div
	vmulss	%xmm1, %xmm0, %xmm0
	vmulss	%xmm1, %xmm0, %xmm0
	vaddss	LCPI0_0(%rip), %xmm0, %xmm0
	vmulss	LCPI0_1(%rip), %xmm1, %xmm1
	vmulss	%xmm0, %xmm1, %xmm0
	retq
LBB0_2:                                 ## %f
	vsqrtss	%xmm0, %xmm0, %xmm0  <--- strict sqrt only applies on this path
	vaddss	LCPI0_2(%rip), %xmm0, %xmm0
	retq

This is another valid option. Either of these seem allowable.

jbhateja updated this revision to Diff 116305.Sep 21 2017, 10:58 PM
  • Updating test case with more than one uses of sqrt / mul.
jbhateja added a comment.EditedSep 23 2017, 6:21 PM

@spatel , @reviewiews , can this land now into trunk ?

@spatel , @reviewiews , can this land now into trunk ?

I haven't actually looked at the builder changes since you revised them, so I defer to @hfinkel is he's already approved that part.

I still don't understand why we should put the combiner changes into this patch. I'd like to see progress on using the flags too, but I think those should be separate patches with tests that cover all of the potential ambiguity that we've raised here.

@spatel , @reviewiews , can this land now into trunk ?

I haven't actually looked at the builder changes since you revised them, so I defer to @hfinkel is he's already approved that part.

I don't think that I approved anything yet. I can take a holistic look at the patch later today.

I still don't understand why we should put the combiner changes into this patch. I'd like to see progress on using the flags too, but I think those should be separate patches with tests that cover all of the potential ambiguity that we've raised here.

@spatel , @reviewiews , can this land now into trunk ?

I haven't actually looked at the builder changes since you revised them, so I defer to @hfinkel is he's already approved that part.

I don't think that I approved anything yet. I can take a holistic look at the patch later today.

@hfinkel , Just a reminder for review clearence.

I still don't understand why we should put the combiner changes into this patch. I'd like to see progress on using the flags too, but I think those should be separate patches with tests that cover all of the potential ambiguity that we've raised here.

@reviewers, please let me know if there are any more comments on this patch.

hfinkel added inline comments.Dec 16 2017, 10:23 AM
include/llvm/CodeGen/SelectionDAGNodes.h
393–394

We need a comment here explaining what the Commit parameter does (and how/when it is used).

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
95

shared b/w -> shared by

123

I'm a bit worried about propagating the integer flags automatically this way. Maybe this is fine in practice, but if we're adding some kind of implicit contract here, we should clearly document it. An operation that is exact, or does not overflow in some way, could be implemented in terms of operations that do (and, then, having the flags on those intermediate nodes wouldn't be correct).

post.kadirselcuk added a subscriber: Restricted Project.Jul 10 2021, 8:43 PM
sanjoy resigned from this revision.Jan 29 2022, 5:33 PM