Download Raw Diff

Details

Reviewers

spatel
hfinkel
arsenm

Summary

The patch addresses a single transform of selection DAG optimization, namely a few special cases of constant folding, extending the guard for isFast on the SDNodeFlags passed to getNode for fadd, fsub, fmul, fdiv and frem.

Diff Detail

Event Timeline

mcberg2017 created this revision.May 16 2018, 11:39 AM

Herald added subscribers: nhaehnle, wdng. · View Herald TranscriptMay 16 2018, 11:39 AM

This patch is split off of D46562.

What is the test actually testing? I don't think there should be any difference from this in the floor lowering, and I don't really want a separate test file duplicating all of the lowering testing

I can simplify this somewhat, I will make a new test that hits the patterns for folding only for all ops of concern, then we will have a way to test with unsafe and without, using the new guard to ensure that the CHECK result is the same for both.

In the way of explanation of the floor test, I simply commented out unsafe and used only Flags.isFast() go guard the optimization, the original floor test broke. Matt, would you like me to include the overload for the floor test with the isFast path in the original file or is a specific functional lit test sufficient?

Added fast math flag capture from visitBinary so that getNode can make decisions in cases where SelectionDAGBuilder::visit has not yet been called. Also updated the test for just these cases and dropped the floor test.

missed a file...

mcberg2017 mentioned this in D46973: Extending undef support for float arithmetic to isFast IR flags.May 16 2018, 3:05 PM

mcberg2017 mentioned this in D46562: Utilize new SDNode flag functionality to expand current support model.May 16 2018, 3:27 PM

Any issues remaining here?

spatel added a subscriber: andreadb.May 17 2018, 6:23 AM

spatel added inline comments.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2772–2775 ↗	(On Diff #147183)	I don't think we want repeat this here (otherwise, we need to set the flags like this for all FPMO). This reminds me of a comment that @andreadb made (probably >3 years ago; I'm not finding it now) regarding DAGBuilder/DAGCombiner. IIRC, because of the way the DAG operates, we may need to repeat folds in the builder to ensure consistency. So I think the right fix is to duplicate the simplifications, so the unnecessary node is not created in the 1st place.
test/CodeGen/X86/fp-fold.ll
5–6 ↗	(On Diff #147183)	This isn't quite true. In all cases, we do not need full 'fast' to enable these transforms. All of the transforms should show the minimal FMF requirement in instsimplify at this point, so that's what we should emulate here. Eg, x * 1.0 = x is true without any FMF.

spatel added inline comments.May 17 2018, 6:27 AM

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
4440–4462	This code structure really doesn't make sense as-is: we putting independent cases together even though they share no common code.

Cleaned up constant folding to follow flag rules used in other locations to guard optimization and reordered code for context. The tests also reflect the min flag usage and duality with unsafe.

spatel added inline comments.May 17 2018, 2:58 PM

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
4442–4444	This doesn't match instsimplify: // fadd X, 0 ==> X, when we know X is not -0 // fsub X, +0 ==> X (no flags needed) But make sure (add tests) for -0.0 too, so we get that right: // fadd X, -0 ==> X (no flags needed) // fsub X, -0 ==> X, when we know X is not -0
4443–4445	Again, these are not the correct predicates: // fmul nnan nsz X, 0 ==> 0
test/CodeGen/X86/fp-fold.ll
26–35 ↗	(On Diff #147380)	Make sure I understand this - we need to change the code in DAGCombiner to fold this? Why is this case different than the others? It's probably better if you check in the baseline tests first, so we're only getting the functional diffs from this patch.

These changes reflect what is in Transforms/InstSimplify/fast-math.ll
I have a baseline version of this test ready to roll, will check that in first.

As for the fmul_zero case, the call producing the getNode is from visitBinary, which cannot provide flags as they are created and passed in without any representation from the current Instruction (which does have the flags). I believe we need to add some new coverage in the DAGCombiner so that the case above is covered so we do not get to this scenario as the STRICT case is unoptimized. For now the test marks the short fall.

spatel added inline comments.May 18 2018, 10:25 AM

test/CodeGen/X86/fp-fold.ll
32–36 ↗	(On Diff #147434)	This is wrong when x is -0.0: (-0.0) - (-0.0) --> (-0.0) + 0.0 --> 0.0

So something to keep in mind, because D46562 was split, part of its code landing here, we are missing flag based visit functionality, where unsafe behavior differs, so for this patch there will be some holes. Its the reason fmul_zero diverges and when you see the next upload, why fsub_zero_2 will also diverge in the tests. However, D46562 will attempt to deal with most of the issues that remain via sub flag usage in the visit functions.

Special cases for fsub, we will need mirrored functionality in visit for fsub via flags to collect on matching optimizations in unsafe. The same is true for other operations.

mcberg2017 added inline comments.May 18 2018, 12:25 PM

test/CodeGen/X86/fp-fold.ll
32–36 ↗	(On Diff #147434)	A related question is that unsafe does do that optimization, is it incorrectly doing so?

spatel added inline comments.May 18 2018, 1:58 PM

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
4441–4445	Still not quite right. Add and sub should have similarly structured code: for 1 version of 0.0, we don't need 'nsz' and for the other version of 0.0, we do need 'nsz'.
test/CodeGen/X86/fp-fold.ll
32–36 ↗	(On Diff #147434)	That goes back to how we interpret "-enable-unsafe-fp-math". My reading of the text: cl::desc("Enable optimizations that may decrease FP precision") says that yes, we have a bug (ignoring signed-zero is not covered by that parameter). But as we noted in one of the earlier reviews, clang only sets this when you pile together a bunch of independent FP loosening: if (!MathErrno && AssociativeMath && ReciprocalMath && !SignedZeros && !TrappingMath) CmdArgs.push_back("-menable-unsafe-fp-math"); ...so the bug is probably not easily visible for anyone besides compiler hackers?

Updated after rL332756 for just the test changes. Are we ready to roll then?

In D46968#1105074, @mcberg2017 wrote:

Updated after rL332756 for just the test changes. Are we ready to roll then?

There are 4 cases each for fadd/fsub:

0.0
-0.0
nsz 0.0
nsz -0.0

I added those tests in rL332780. Can you rebase with that?
But now I'm not sure why we're bothering with these folds in getNode(). We have to repeat them in DAGCombiner anyway, right? Isn't this the equivalent of doing simplifications in IRBuilder? Are we doing this because it's a big perf win (seems unlikely)?

Rebased a test in fp_fold.ll

I am hoping to remove part of these changes once the DAGCombiner work is in place.

I know we'll probably just kill this whole chunk of code in a subsequent patch, but the logic should be complete/correct. I think we want this for fadd/fsub:

case ISD::FADD:
  // fadd X, -0.0 ==> X (no flags needed)
  // fadd X, 0.0 ==> X, when we know X is not -0
  // FIXME: Unsafe math doesn't imply no-signed-zeros.
  if (N2CFP && N2CFP->isZero())
    if (N2CFP->isNegative() || getTarget().Options.UnsafeFPMath ||
        Flags.hasNoSignedZeros())
      return N1;
  break;
case ISD::FSUB:
  // fsub X, 0.0 ==> X (no flags needed)
  // fsub X, -0.0 ==> X, when we know X is not -0
  // FIXME: Unsafe math doesn't imply no-signed-zeros.
  if (N2CFP && N2CFP->isZero())
    if (!N2CFP->isNegative() || getTarget().Options.UnsafeFPMath ||
        Flags.hasNoSignedZeros())
      return N1;
  break;

arsenm added inline comments.May 23 2018, 4:37 AM

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
4440	Not particularly a comment on this patch, but I was wondering what's the reasoning for doing this in getNode? I've never liked how getNode attempts to optimize nodes on creation, especially since a lot of node visitors in DAGCombiner attempt the same folds (and kind of have to since a ReplaceNodeResults may trigger the same situation). Have you found an explanation?

spatel added inline comments.May 23 2018, 6:13 AM

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
4440	I hinted at the only explanation I could think of: it's being done as a compile-time optimization. I doubt that's a meaningful win, so I wouldn't mind skipping this step entirely - just fix the folds in DAGCombiner and ignore this.

Given the last round of comments, I am going to spend some time on the DAG combiner work in D46562 to see if I can remove the need for these changes. More later...

mcberg2017 abandoned this revision.Jun 14 2018, 12:33 PM

Diff 147145

lib/CodeGen/SelectionDAG/SelectionDAG.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,431 Lines • ▼ Show 20 Lines	SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,
case ISD::SMIN:		case ISD::SMIN:
case ISD::SMAX:		case ISD::SMAX:
case ISD::UMIN:		case ISD::UMIN:
case ISD::UMAX:		case ISD::UMAX:
assert(VT.isInteger() && "This operator does not apply to FP types!");		assert(VT.isInteger() && "This operator does not apply to FP types!");
assert(N1.getValueType() == N2.getValueType() &&		assert(N1.getValueType() == N2.getValueType() &&
N1.getValueType() == VT && "Binary operator types must match!");		N1.getValueType() == VT && "Binary operator types must match!");
break;		break;
case ISD::FADD:		case ISD::FADD:
		arsenmUnsubmitted Not Done Reply Inline Actions Not particularly a comment on this patch, but I was wondering what's the reasoning for doing this in getNode? I've never liked how getNode attempts to optimize nodes on creation, especially since a lot of node visitors in DAGCombiner attempt the same folds (and kind of have to since a ReplaceNodeResults may trigger the same situation). Have you found an explanation? arsenm: Not particularly a comment on this patch, but I was wondering what's the reasoning for doing…
		spatelUnsubmitted Not Done Reply Inline Actions I hinted at the only explanation I could think of: it's being done as a compile-time optimization. I doubt that's a meaningful win, so I wouldn't mind skipping this step entirely - just fix the folds in DAGCombiner and ignore this. spatel: I hinted at the only explanation I could think of: it's being done as a compile-time…
case ISD::FSUB:		case ISD::FSUB:
case ISD::FMUL:		case ISD::FMUL:
case ISD::FDIV:		case ISD::FDIV:
case ISD::FREM:		case ISD::FREM:
		spatelUnsubmitted Not Done Reply Inline Actions This doesn't match instsimplify: // fadd X, 0 ==> X, when we know X is not -0 // fsub X, +0 ==> X (no flags needed) But make sure (add tests) for -0.0 too, so we get that right: // fadd X, -0 ==> X (no flags needed) // fsub X, -0 ==> X, when we know X is not -0 spatel: This doesn't match instsimplify: // fadd X, 0 ==> X, when we know X is not -0 // fsub X, +0…
if (getTarget().Options.UnsafeFPMath) {		if (getTarget().Options.UnsafeFPMath \|\| Flags.isFast()) {
		spatelUnsubmitted Not Done Reply Inline Actions Again, these are not the correct predicates: // fmul nnan nsz X, 0 ==> 0 spatel: Again, these are not the correct predicates: // fmul nnan nsz X, 0 ==> 0
		spatelUnsubmitted Not Done Reply Inline Actions Still not quite right. Add and sub should have similarly structured code: for 1 version of 0.0, we don't need 'nsz' and for the other version of 0.0, we do need 'nsz'. spatel: Still not quite right. Add and sub should have similarly structured code: for 1 version of 0.0…
if (Opcode == ISD::FADD) {		if (Opcode == ISD::FADD) {
// x+0 --> x		// x+0 --> x
if (N2CFP && N2CFP->getValueAPF().isZero())		if (N2CFP && N2CFP->getValueAPF().isZero())
return N1;		return N1;
} else if (Opcode == ISD::FSUB) {		} else if (Opcode == ISD::FSUB) {
// x-0 --> x		// x-0 --> x
if (N2CFP && N2CFP->getValueAPF().isZero())		if (N2CFP && N2CFP->getValueAPF().isZero())
return N1;		return N1;
} else if (Opcode == ISD::FMUL) {		} else if (Opcode == ISD::FMUL) {
// x*0 --> 0		// x*0 --> 0
if (N2CFP && N2CFP->isZero())		if (N2CFP && N2CFP->isZero())
return N2;		return N2;
// x*1 --> x		// x*1 --> x
if (N2CFP && N2CFP->isExactlyValue(1.0))		if (N2CFP && N2CFP->isExactlyValue(1.0))
return N1;		return N1;
}		}
}		}
		spatelUnsubmitted Not Done Reply Inline Actions This code structure really doesn't make sense as-is: we putting independent cases together even though they share no common code. spatel: This code structure really doesn't make sense as-is: we putting independent cases together even…
assert(VT.isFloatingPoint() && "This operator only applies to FP types!");		assert(VT.isFloatingPoint() && "This operator only applies to FP types!");
assert(N1.getValueType() == N2.getValueType() &&		assert(N1.getValueType() == N2.getValueType() &&
N1.getValueType() == VT && "Binary operator types must match!");		N1.getValueType() == VT && "Binary operator types must match!");
break;		break;
case ISD::FCOPYSIGN: // N1 and result must match. N1/N2 need not match.		case ISD::FCOPYSIGN: // N1 and result must match. N1/N2 need not match.
assert(N1.getValueType() == VT &&		assert(N1.getValueType() == VT &&
N1.getValueType().isFloatingPoint() &&		N1.getValueType().isFloatingPoint() &&
N2.getValueType().isFloatingPoint() &&		N2.getValueType().isFloatingPoint() &&
▲ Show 20 Lines • Show All 4,239 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/ffloor.f64_fmf.ll

				; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=FUNC %s
				; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -check-prefix=CI -check-prefix=FUNC %s
				; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=CI -check-prefix=FUNC %s

				declare double @llvm.fabs.f64(double %Val)
				declare double @llvm.floor.f64(double) nounwind readnone
				declare <2 x double> @llvm.floor.v2f64(<2 x double>) nounwind readnone
				declare <3 x double> @llvm.floor.v3f64(<3 x double>) nounwind readnone
				declare <4 x double> @llvm.floor.v4f64(<4 x double>) nounwind readnone
				declare <8 x double> @llvm.floor.v8f64(<8 x double>) nounwind readnone
				declare <16 x double> @llvm.floor.v16f64(<16 x double>) nounwind readnone

				; FUNC-LABEL: {{^}}ffloor_f64:
				; CI: v_floor_f64_e32
				; SI: v_fract_f64_e32
				; SI-DAG: v_min_f64
				; SI-DAG: v_cmp_class_f64_e64 vcc
				; SI: v_cndmask_b32_e32
				; SI: v_cndmask_b32_e32
				; SI: v_add_f64
				; SI: s_endpgm
				define amdgpu_kernel void @ffloor_f64(double addrspace(1)* %out, double %x) {
				%y = call fast double @llvm.floor.f64(double %x) nounwind readnone
				store double %y, double addrspace(1)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}ffloor_f64_neg:
				; CI: v_floor_f64_e64
				; SI: v_fract_f64_e64 {{v[[0-9]+:[0-9]+]}}, -[[INPUT:s[[0-9]+:[0-9]+]]]
				; SI-DAG: v_min_f64
				; SI-DAG: v_cmp_class_f64_e64 vcc
				; SI: v_cndmask_b32_e32
				; SI: v_cndmask_b32_e32
				; SI: v_add_f64 {{v[[0-9]+:[0-9]+]}}, -[[INPUT]]
				; SI: s_endpgm
				define amdgpu_kernel void @ffloor_f64_neg(double addrspace(1)* %out, double %x) {
				%neg = fsub fast double 0.0, %x
				%y = call fast double @llvm.floor.f64(double %neg) nounwind readnone
				store double %y, double addrspace(1)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}ffloor_f64_neg_abs:
				; CI: v_floor_f64_e64
				; SI: v_fract_f64_e64 {{v[[0-9]+:[0-9]+]}}, -\|[[INPUT:s[[0-9]+:[0-9]+]]]\|
				; SI-DAG: v_min_f64
				; SI-DAG: v_cmp_class_f64_e64 vcc
				; SI: v_cndmask_b32_e32
				; SI: v_cndmask_b32_e32
				; SI: v_add_f64 {{v[[0-9]+:[0-9]+]}}, -\|[[INPUT]]\|
				; SI: s_endpgm
				define amdgpu_kernel void @ffloor_f64_neg_abs(double addrspace(1)* %out, double %x) {
				%abs = call fast double @llvm.fabs.f64(double %x)
				%neg = fsub fast double 0.0, %abs
				%y = call double @llvm.floor.f64(double %neg) nounwind readnone
				store double %y, double addrspace(1)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}ffloor_v2f64:
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				define amdgpu_kernel void @ffloor_v2f64(<2 x double> addrspace(1)* %out, <2 x double> %x) {
				%y = call fast <2 x double> @llvm.floor.v2f64(<2 x double> %x) nounwind readnone
				store <2 x double> %y, <2 x double> addrspace(1)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}ffloor_v3f64:
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI-NOT: v_floor_f64_e32
				define amdgpu_kernel void @ffloor_v3f64(<3 x double> addrspace(1)* %out, <3 x double> %x) {
				%y = call fast <3 x double> @llvm.floor.v3f64(<3 x double> %x) nounwind readnone
				store <3 x double> %y, <3 x double> addrspace(1)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}ffloor_v4f64:
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				define amdgpu_kernel void @ffloor_v4f64(<4 x double> addrspace(1)* %out, <4 x double> %x) {
				%y = call fast <4 x double> @llvm.floor.v4f64(<4 x double> %x) nounwind readnone
				store <4 x double> %y, <4 x double> addrspace(1)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}ffloor_v8f64:
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				define amdgpu_kernel void @ffloor_v8f64(<8 x double> addrspace(1)* %out, <8 x double> %x) {
				%y = call fast <8 x double> @llvm.floor.v8f64(<8 x double> %x) nounwind readnone
				store <8 x double> %y, <8 x double> addrspace(1)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}ffloor_v16f64:
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				; CI: v_floor_f64_e32
				define amdgpu_kernel void @ffloor_v16f64(<16 x double> addrspace(1)* %out, <16 x double> %x) {
				%y = call fast <16 x double> @llvm.floor.v16f64(<16 x double> %x) nounwind readnone
				store <16 x double> %y, <16 x double> addrspace(1)* %out
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

Extending constant folding for float arithmetic to isFast IR flags
AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 147145

lib/CodeGen/SelectionDAG/SelectionDAG.cpp

test/CodeGen/AMDGPU/ffloor.f64_fmf.ll

This is an archive of the discontinued LLVM Phabricator instance.

Extending constant folding for float arithmetic to isFast IR flagsAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 147145

lib/CodeGen/SelectionDAG/SelectionDAG.cpp

test/CodeGen/AMDGPU/ffloor.f64_fmf.ll

Extending constant folding for float arithmetic to isFast IR flags
AbandonedPublic