Perform DAG combining of FMIN/MAXNUM based on Fast-Math flags which are propagated
from Instruction to SDNode.
Details
- Reviewers
- spatel - craig.topper - sanjoy - RKSimon - asbirlea 
Diff Detail
- Build Status
- Buildable 10060 - Build 10060: arc lint + arc unit 
Event Timeline
The way you're transferring the flags from the IR to the node works for this particular case, but I'd prefer that we clean that up as a preliminary step for this patch if possible. The existing flags transfer code is limited because we started with SDNodeFlags only on binary operators. So there's a different blob of code to handle flags in SelectionDAGBuilder::visitBinary(). Can we unify that? Is it possible to handle this for all opcodes/nodes in SelectionDAGBuilder::visit()?
I might be missing some context here. If we have fast/nnan on these calls, then can't we simplify this in IR to fmp+select and not have to deal with this in the backend? The intrinsics only exist to make sure that NaN behavior in IR meets the higher level standards, so if we have nnan, then we don't need the intrinsic?
Intrinsic function defer code geneation/expansion to backend this give backend control over geneating efficient code as per specific target. As of now SelectionDAGBuilder boils down intrinsic call to fminnum (or fminnan if target supports it and nnan is true) which is then handled by different targets differently.
It's incorrect that intrinsics are passed unaltered to the backend for expansion/optimization. See the optimizations for both generic and target-specific intrinsics in InstCombiner::visitCallInst().
Again, I may be missing some context - who created this IR? Creating a 'call fast llvm.maxnum()' just doesn't make sense to me, so if we can fix that in IR, we should do that. The intrinsic inhibits the large number of potential optimizations for fcmp+select that we have in IR. No target should benefit from having extra NaN semantics requirements provided by the intrinsic that are then overridden by FMF.
Please split the FlagsAcquirer diff into a separate patch.
Your point of not genrating call fast @llvm.minnum in the first place and instead fcmp+setcc should have been generated is valid. But, its still a valid IR syntactically and semantically which could be thrown at backend.
Please consider following case which is being compiled for arm( -mcpu=cortex-r52 -march=arm test.ll -mattr=fp-armv8)
define <4 x double> @CASEA(<4 x double> %x, <4 x double> %y) {
%z = call fast <4 x double> @llvm.minnum.v4f64(<4 x double> %x, <4 x double> %y) readnone ret <4 x double> %z
}
define <4 x double> @CASEB(<4 x double> %x, <4 x double> %y) {
%c = fcmp ule <4 x double> %x, %y %z = select <4 x i1> %c, <4 x double> %x, <4 x double> %y ret <4 x double> %z
}
Instruction selector does not generates vminnm for CASEB where as same is generated for CASEA. 
As of now SelectionDAGBuilder generates fminnum (or fminnnan) SDNode for llvm.minnum intrinsic. 
Thus different targets lowers fminnum (SDNode) differently.
So handling for fast math with intrinsic here should be fine ?
Otherwise, I can put Acquirer in a seperate patch.
I don't see the point of this argument. There are an infinite number of valid IR patterns that we can send to the backend, but the main point of the IR optimizer is to limit that set, so we don't have to increase the complexity of the backend. That's what I'm requesting here: solve this in IR, so the backend never has to worry about it. I'll ask again: who is producing this IR or these nodes? If I'm missing some scenario in which this pattern can be created in the backend, then I agree that we will have to handle it there (but still not the target-specific way you've proposed). If not, let's not add unnecessary code.
Please consider following case which is being compiled for arm( -mcpu=cortex-r52 -march=arm test.ll -mattr=fp-armv8)
define <4 x double> @CASEA(<4 x double> %x, <4 x double> %y) {
%z = call fast <4 x double> @llvm.minnum.v4f64(<4 x double> %x, <4 x double> %y) readnone ret <4 x double> %z}
define <4 x double> @CASEB(<4 x double> %x, <4 x double> %y) {
%c = fcmp ule <4 x double> %x, %y %z = select <4 x i1> %c, <4 x double> %x, <4 x double> %y ret <4 x double> %z}
Instruction selector does not generates vminnm for CASEB where as same is generated for CASEA.
As of now SelectionDAGBuilder generates fminnum (or fminnnan) SDNode for llvm.minnum intrinsic.
Thus different targets lowers fminnum (SDNode) differently.
I acknowledge there may be other bugs here, but I think the ARM backend is behaving correctly for the example as shown (cc'ing @efriedma for expertise).
If we want to create equivalent patterns for these 2 examples, then we must add 'fast' to the fcmp in the 2nd case:
%c = fcmp fast ule <4 x double> %x, %y
If we do that, we see that the ARM backend is still behaving correctly and optimally to produce vminnm:
$ ./llc -o - minnum.ll -mcpu=cortex-r52 -march=arm  -mattr=fp-armv8 | egrep 'CASE|minnm'
_CASEA:
	vminnm.f64	d25, d22, d17
	vminnm.f64	d24, d23, d16
	vminnm.f64	d17, d21, d19
	vminnm.f64	d16, d20, d18
_CASEB:
	vminnm.f64	d25, d22, d17
	vminnm.f64	d24, d23, d16
	vminnm.f64	d17, d21, d19
	vminnm.f64	d16, d20, d18
FWIW, we've seen a similar case of min/maxnum being generated in the IR and leading to suboptimal codegen. 
The solution was replacing the IR generated with fcmp+select (coupled with D27846). Seems like the right approach to avoid adding complexity to the backend.