Page MenuHomePhabricator

[X86] PR34149 Suboptimal codegen for fast minnum and maxnum.
Needs ReviewPublic

Authored by jbhateja on Sep 8 2017, 2:37 AM.

Details

Summary

Perform DAG combining of FMIN/MAXNUM based on Fast-Math flags which are propagated
from Instruction to SDNode.

Event Timeline

jbhateja created this revision.Sep 8 2017, 2:37 AM
jbhateja added a subscriber: llvm-commits.
spatel edited edge metadata.Sep 8 2017, 3:14 PM

The way you're transferring the flags from the IR to the node works for this particular case, but I'd prefer that we clean that up as a preliminary step for this patch if possible. The existing flags transfer code is limited because we started with SDNodeFlags only on binary operators. So there's a different blob of code to handle flags in SelectionDAGBuilder::visitBinary(). Can we unify that? Is it possible to handle this for all opcodes/nodes in SelectionDAGBuilder::visit()?

spatel added a comment.Sep 8 2017, 3:40 PM

I might be missing some context here. If we have fast/nnan on these calls, then can't we simplify this in IR to fmp+select and not have to deal with this in the backend? The intrinsics only exist to make sure that NaN behavior in IR meets the higher level standards, so if we have nnan, then we don't need the intrinsic?

jbhateja updated this revision to Diff 114491.Sep 9 2017, 9:54 AM
  • Consolidating Instruction->SDNode Flags propagation in one class.

I might be missing some context here. If we have fast/nnan on these calls, then can't we simplify this in IR to fmp+select and not have to deal with this in the backend? The intrinsics only exist to make sure that NaN behavior in IR meets the higher level standards, so if we have nnan, then we don't need the intrinsic?

Intrinsic function defer code geneation/expansion to backend this give backend control over geneating efficient code as per specific target. As of now SelectionDAGBuilder boils down intrinsic call to fminnum (or fminnan if target supports it and nnan is true) which is then handled by different targets differently.

I might be missing some context here. If we have fast/nnan on these calls, then can't we simplify this in IR to fmp+select and not have to deal with this in the backend? The intrinsics only exist to make sure that NaN behavior in IR meets the higher level standards, so if we have nnan, then we don't need the intrinsic?

Intrinsic function defer code geneation/expansion to backend this give backend control over geneating efficient code as per specific target.

It's incorrect that intrinsics are passed unaltered to the backend for expansion/optimization. See the optimizations for both generic and target-specific intrinsics in InstCombiner::visitCallInst().

Again, I may be missing some context - who created this IR? Creating a 'call fast llvm.maxnum()' just doesn't make sense to me, so if we can fix that in IR, we should do that. The intrinsic inhibits the large number of potential optimizations for fcmp+select that we have in IR. No target should benefit from having extra NaN semantics requirements provided by the intrinsic that are then overridden by FMF.

Please split the FlagsAcquirer diff into a separate patch.

I might be missing some context here. If we have fast/nnan on these calls, then can't we simplify this in IR to fmp+select and not have to deal with this in the backend? The intrinsics only exist to make sure that NaN behavior in IR meets the higher level standards, so if we have nnan, then we don't need the intrinsic?

Intrinsic function defer code geneation/expansion to backend this give backend control over geneating efficient code as per specific target.

It's incorrect that intrinsics are passed unaltered to the backend for expansion/optimization. See the optimizations for both generic and target-specific intrinsics in InstCombiner::visitCallInst().

Again, I may be missing some context - who created this IR? Creating a 'call fast llvm.maxnum()' just doesn't make sense to me, so if we can fix that in IR, we should do that. The intrinsic inhibits the large number of potential optimizations for fcmp+select that we have in IR. No target should benefit from having extra NaN semantics requirements provided by the intrinsic that are then overridden by FMF.

Please split the FlagsAcquirer diff into a separate patch.

Your point of not genrating call fast @llvm.minnum in the first place and instead fcmp+setcc should have been generated is valid. But, its still a valid IR syntactically and semantically which could be thrown at backend.

Please consider following case which is being compiled for arm( -mcpu=cortex-r52 -march=arm test.ll -mattr=fp-armv8)

define <4 x double> @CASEA(<4 x double> %x, <4 x double> %y) {

%z = call fast <4 x double> @llvm.minnum.v4f64(<4 x double> %x, <4 x double> %y) readnone
ret <4 x double> %z

}

define <4 x double> @CASEB(<4 x double> %x, <4 x double> %y) {

%c = fcmp ule <4 x double> %x, %y
%z = select <4 x i1> %c, <4 x double> %x, <4 x double> %y
ret <4 x double> %z

}

Instruction selector does not generates vminnm for CASEB where as same is generated for CASEA.
As of now SelectionDAGBuilder generates fminnum (or fminnnan) SDNode for llvm.minnum intrinsic.
Thus different targets lowers fminnum (SDNode) differently.

So handling for fast math with intrinsic here should be fine ?

Otherwise, I can put Acquirer in a seperate patch.

I might be missing some context here. If we have fast/nnan on these calls, then can't we simplify this in IR to fmp+select and not have to deal with this in the backend? The intrinsics only exist to make sure that NaN behavior in IR meets the higher level standards, so if we have nnan, then we don't need the intrinsic?

Intrinsic function defer code geneation/expansion to backend this give backend control over geneating efficient code as per specific target.

It's incorrect that intrinsics are passed unaltered to the backend for expansion/optimization. See the optimizations for both generic and target-specific intrinsics in InstCombiner::visitCallInst().

Again, I may be missing some context - who created this IR? Creating a 'call fast llvm.maxnum()' just doesn't make sense to me, so if we can fix that in IR, we should do that. The intrinsic inhibits the large number of potential optimizations for fcmp+select that we have in IR. No target should benefit from having extra NaN semantics requirements provided by the intrinsic that are then overridden by FMF.

Please split the FlagsAcquirer diff into a separate patch.

Your point of not genrating call fast @llvm.minnum in the first place and instead fcmp+setcc should have been generated is valid. But, its still a valid IR syntactically and semantically which could be thrown at backend.

I don't see the point of this argument. There are an infinite number of valid IR patterns that we can send to the backend, but the main point of the IR optimizer is to limit that set, so we don't have to increase the complexity of the backend. That's what I'm requesting here: solve this in IR, so the backend never has to worry about it. I'll ask again: who is producing this IR or these nodes? If I'm missing some scenario in which this pattern can be created in the backend, then I agree that we will have to handle it there (but still not the target-specific way you've proposed). If not, let's not add unnecessary code.

Please consider following case which is being compiled for arm( -mcpu=cortex-r52 -march=arm test.ll -mattr=fp-armv8)

define <4 x double> @CASEA(<4 x double> %x, <4 x double> %y) {

%z = call fast <4 x double> @llvm.minnum.v4f64(<4 x double> %x, <4 x double> %y) readnone
ret <4 x double> %z

}

define <4 x double> @CASEB(<4 x double> %x, <4 x double> %y) {

%c = fcmp ule <4 x double> %x, %y
%z = select <4 x i1> %c, <4 x double> %x, <4 x double> %y
ret <4 x double> %z

}

Instruction selector does not generates vminnm for CASEB where as same is generated for CASEA.
As of now SelectionDAGBuilder generates fminnum (or fminnnan) SDNode for llvm.minnum intrinsic.
Thus different targets lowers fminnum (SDNode) differently.

I acknowledge there may be other bugs here, but I think the ARM backend is behaving correctly for the example as shown (cc'ing @efriedma for expertise).

If we want to create equivalent patterns for these 2 examples, then we must add 'fast' to the fcmp in the 2nd case:

%c = fcmp fast ule <4 x double> %x, %y

If we do that, we see that the ARM backend is still behaving correctly and optimally to produce vminnm:

$ ./llc -o - minnum.ll -mcpu=cortex-r52 -march=arm -mattr=fp-armv8 | egrep 'CASE|minnm'
_CASEA:
vminnm.f64 d25, d22, d17
vminnm.f64 d24, d23, d16
vminnm.f64 d17, d21, d19
vminnm.f64 d16, d20, d18
_CASEB:
vminnm.f64 d25, d22, d17
vminnm.f64 d24, d23, d16
vminnm.f64 d17, d21, d19
vminnm.f64 d16, d20, d18

asbirlea edited edge metadata.Sep 12 2017, 2:39 PM

FWIW, we've seen a similar case of min/maxnum being generated in the IR and leading to suboptimal codegen.
The solution was replacing the IR generated with fcmp+select (coupled with D27846). Seems like the right approach to avoid adding complexity to the backend.

RKSimon resigned from this revision.Sep 29 2017, 3:18 AM