Page MenuHomePhabricator

[WIP][FPEnv] Don't transform FSUB(-0.0,X)->FNEG(X) when flushing denormals
AbandonedPublic

Authored by cameron.mcinally on Feb 4 2020, 10:07 AM.

Details

Summary

When in a mode that flushes denormals, we don't want to transform FSUB(-0.0,X) -> FNEG(X). The former is an arith operation that will flush a denormal input to 0. The latter is a bitwise operation that will only flip the sign bit.

Marked as [WIP] since the logic is a little weird. Hoping @arsenm and others can offer some guidance...

  1. Notice that we still perform the transformation when in DenormalMode::IEEE. This is counter-intuitive. IEEE-754 is what specifies that these operations are distinct, but only in regards to side-effects, not denormal flushing. LLVM optimizations do not preserve side-effects, and both operation results will be bitwise identical when we're not flushing denormals, so I think this is the correct thing to do.

Although, there's also the problem of this transform changing the sign of a NaN in DenormalMode::IEEE. Do we want to take that into consideration? E.g. an FSUB(-0.0, NaN) should produce a canonical NaN with the same payload, while FNEG(NaN) produces -NaN. If I'm not mistaken, IEEE-754 doesn't specify the sign of a NaN result, besides being a canonical NaN.

  1. Also notice that we still perform the transformation when in DenormalMode::Invalid. I believe that Invalid is actually a flush to zero mode. However, I think it makes sense to leave the default mode unchanged wrt disabling this transform. There could be a very small (and hard to measure) performance penalty for using a proper FSUB on some targets.

Thoughts about any of this?

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2020, 10:07 AM
arsenm added inline comments.Feb 4 2020, 10:17 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12358–12360

This will need updating for the splitting the input and output patch I just committed

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2997–3015

Why does SelectionDAGBuilder bother doing this fold at all? It should just directly translate the fsub?

arsenm added inline comments.Feb 4 2020, 10:19 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2997–3015

I think this should be just ripped out in a separate patch. The same problem seems to have been copied to IRTranslator, which should also be removed

The DenormalMode::Invalid is a temporary state and should not really be a concern. It should be invalid and never seen after D69989

cameron.mcinally marked 2 inline comments as done.Feb 4 2020, 11:14 AM
cameron.mcinally added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12358–12360

Ok, thanks. Will wait for the builds to go green and then update.

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2997–3015

@sanjay, what do you think? Seems reasonable to me.

I think it made sense to do this when FNEG(X) was the canonical form of FSUB(-0.0, X). Wouldn't want two forms floating around for even a small amount of time. But now that there are cases where the operations are distinct through llc, it seems ok to wait until DAGCombine.

spatel added inline comments.Feb 4 2020, 12:31 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2997–3015

Yes, now that we can use DenormMode to distinguish target behavior, it seems better to do it later in DAGCombiner if that would make sense for the target.

cameron.mcinally planned changes to this revision.Feb 5 2020, 7:21 AM
cameron.mcinally marked an inline comment as done.
cameron.mcinally added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2997–3015

I'll have to put a pin in this for now. Removing this block is causing regressions in about 15 tests. The regressions appear to be subtle lowering differences, so I suspect it will take some time to straighten them out.

cameron.mcinally marked an inline comment as not done.Feb 12 2020, 7:32 AM
cameron.mcinally added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2997–3015

I looked into removing this and there are warts underneath. Some are surmountable (different lowerings), but one is worrisome. I.e. the case where the FNeg operand is undef:

FNEG(undef) -> undef
FSUB(-0.0, undef) -> NaN

That is, removing this transform propagates NaNs where we previously had undef values.

Any thoughts on how to proceed? Do we want to minimize code differences by keeping this transform in place? Or are we okay moving forward with the undef->NaN change?

spatel added inline comments.Feb 12 2020, 11:42 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2997–3015

Did that difference show up as real regressions or something benign?
We could add a special-case fold for this here or getNode() if it helps:
fsub C, undef --> undef (as long as C is not NaN or Inf?)

arsenm added inline comments.Feb 12 2020, 12:58 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2997–3015

This sounds more correct to me. I don't see why this would be special cased

spatel added inline comments.Feb 12 2020, 1:56 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2997–3015

It's a special-case in the sense that folding to NaN is correct in general. Just dealing with this particular pattern is also a special-case because we could do something similar for all FP ops, not just fsub with constant operand 0. But we'll need to work out if/how the corner cases differ per opcode.

cameron.mcinally marked an inline comment as not done.Feb 13 2020, 7:18 AM
cameron.mcinally added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2997–3015

Benign llc regression tests. The NaN and undef propagate differently, so the asm differences appear worse than they are.

We could add a special-case fold for this here or getNode() if it helps:
fsub C, undef --> undef (as long as C is not NaN or Inf?)

That's a good idea. I don't feel strongly about it, but the current transform might be more obvious than adding a special case fold though.

spatel added inline comments.Feb 20 2020, 6:43 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2997–3015

I tried to generalize this over in D74713, but there doesn't appear to be support for bending the theoretical definition of undef to the practical constraints of real-life floating-point.

So our options are:

  1. Add a constant fold for this exact case: fsub -0.0, undef --> undef
  2. Ignore the diffs caused by removing this transform.

I'd lean toward #1 (I can limit D74713 to that case as the start of that effort).

Thanks, Sanjay. I'm okay with either approach.

I'll pick this up again in the near future. I've been distracted with another project...

Thanks, Sanjay. I'm okay with either approach.

rGa253a2a793cd: [SDAG] fold fsub -0.0, undef to undef rather than NaN.

Thanks for that patch, Sanjay.

I have another issue which I hope you can help me sort out. There's a transform in narrowExtractedVectorBinOp(...) in DAGCombiner.cpp:

// extract (binop B0, B1), N --> binop (extract B0, N), (extract B1, N)

This transform only happens for binops, so we don't see it when SelectionDAGBuilder converts the FSUB->FNEG.

The IR is...

%rhs_neg = fsub <4 x float> <float -0.0, float -0.0, float -0.0, float -0.0>, %rhs
%splat = shufflevector <4 x float> %rhs_neg, <4 x float> undef, <2 x i32> <i32 3, i32 3>

and after DAGCombine we end up with DAGs like this...

FNEG:
<               t9: v4f32 = bitcast t8
<             t24: v4f32 = fneg t9
<           t15: v2f32 = extract_subvector t24, Constant:i64<2>
<         t17: v2f32 = vector_shuffle<1,1> t15, undef:v2f32

FSUB:
>               t29: v1i64 = extract_subvector t8, Constant:i64<1>
>             t30: v2f32 = bitcast t29
>           t32: v2f32 = fneg t30
>         t17: v2f32 = vector_shuffle<1,1> t32, undef:v2f32

Moving the extract to the operands (FSUB) is a problem on AArch64 since the extract could be rolled into the shuffle (FNEG). E.g.:

FNEG:
<             t9: v4f32 = bitcast t8
<           t24: v4f32 = fneg t9
<         t26: v2f32 = AArch64ISD::DUPLANE32 t24, Constant:i64<3>

FSUB:
>                 t29: v1i64 = extract_subvector t8, Constant:i64<1>
>               t30: v2f32 = bitcast t29
>             t32: v2f32 = fneg t30
>           t36: v4f32 = insert_subvector undef:v4f32, t32, Constant:i32<0>
>         t37: v2f32 = AArch64ISD::DUPLANE32 t36, Constant:i64<1>

Any insight on the best way to correct this difference? I suppose I could fix up the extract+insert at the MachineInstruction level, but that doesn't seem like the correct fix since other targets could have the same problem.

I'm also a little skeptical about moving the extracts to the operands, and if it's a win in the general case. Seems like it would be stronger after any extract+insert peeps have occurred, but I suppose that's why it's done in DAGCombine. :/

Thanks for that patch, Sanjay.

I have another issue which I hope you can help me sort out. There's a transform in narrowExtractedVectorBinOp(...) in DAGCombiner.cpp:

// extract (binop B0, B1), N --> binop (extract B0, N), (extract B1, N)

This transform only happens for binops, so we don't see it when SelectionDAGBuilder converts the FSUB->FNEG.

The IR is...

%rhs_neg = fsub <4 x float> <float -0.0, float -0.0, float -0.0, float -0.0>, %rhs
%splat = shufflevector <4 x float> %rhs_neg, <4 x float> undef, <2 x i32> <i32 3, i32 3>

and after DAGCombine we end up with DAGs like this...

FNEG:
<               t9: v4f32 = bitcast t8
<             t24: v4f32 = fneg t9
<           t15: v2f32 = extract_subvector t24, Constant:i64<2>
<         t17: v2f32 = vector_shuffle<1,1> t15, undef:v2f32

FSUB:
>               t29: v1i64 = extract_subvector t8, Constant:i64<1>
>             t30: v2f32 = bitcast t29
>           t32: v2f32 = fneg t30
>         t17: v2f32 = vector_shuffle<1,1> t32, undef:v2f32

Moving the extract to the operands (FSUB) is a problem on AArch64 since the extract could be rolled into the shuffle (FNEG). E.g.:

FNEG:
<             t9: v4f32 = bitcast t8
<           t24: v4f32 = fneg t9
<         t26: v2f32 = AArch64ISD::DUPLANE32 t24, Constant:i64<3>

FSUB:
>                 t29: v1i64 = extract_subvector t8, Constant:i64<1>
>               t30: v2f32 = bitcast t29
>             t32: v2f32 = fneg t30
>           t36: v4f32 = insert_subvector undef:v4f32, t32, Constant:i32<0>
>         t37: v2f32 = AArch64ISD::DUPLANE32 t36, Constant:i64<1>

Any insight on the best way to correct this difference? I suppose I could fix up the extract+insert at the MachineInstruction level, but that doesn't seem like the correct fix since other targets could have the same problem.

I'm also a little skeptical about moving the extracts to the operands, and if it's a win in the general case. Seems like it would be stronger after any extract+insert peeps have occurred, but I suppose that's why it's done in DAGCombine. :/

The motivation for narrowExtractedVectorBinOp() was to shrink unnecessarily wide vector ops on x86 (256/512-bit vector code can run much slower than 128-bit vector code).
But we want to avoid moving fneg around too much because it can be folded into some other op for free in many cases. We can show there's an inconsistency in the handling in an independent example, so:
rGb3d0c798367d

Let me know if that works to remove the problem here.

Thanks again, Sanjay. That did help. I have other issues to work through on AMDGPU, but it's getting closer...

Thanks again, Sanjay. That did help. I have other issues to work through on AMDGPU, but it's getting closer...

As a heads up, AMDGPU doesn’t respect the denormal attribute yet and still uses the custom subtarget features. The patch to switch is posted but held up by its dependencies

Rebase and AMDGPU test changes to elucidate a problem with this Diff.

@arsenm, The problem in the AMDGPU tests is that FSUB(-0.0, X) is not folding into the following instruction, as it would if it was transformed into an FNEG(X).

It's probably okay to fold some of these. E.g.

-  %fneg.a = fsub float -0.000000e+00, %a
+  %fneg.a = fneg float %a
   %add = fadd float %fneg.a, %b

If we're flushing input to zero, it's probably okay to fold a FSUB(-0,X) into the FADD, since the FADD will flush denorms. Although, if we're flushing output to zero, that probably is NOT ok, since something like FADD(largest_denorm, largest_denorm) would return a normal.

I guess what I'm really asking is how important is this to AMDGPU? It seems to be the only target that is upset about the changes in this Diff.

Would it be enough to update the CHECK lines to not expect a FSUB(-0,X) to fold? Or does this need more peeps to fold the cases where it's safe? And if the latter, should we move ahead with this Diff and optimize later?

Rebase and AMDGPU test changes to elucidate a problem with this Diff.

@arsenm, The problem in the AMDGPU tests is that FSUB(-0.0, X) is not folding into the following instruction, as it would if it was transformed into an FNEG(X).

It's probably okay to fold some of these. E.g.

-  %fneg.a = fsub float -0.000000e+00, %a
+  %fneg.a = fneg float %a
   %add = fadd float %fneg.a, %b

If we're flushing input to zero, it's probably okay to fold a FSUB(-0,X) into the FADD, since the FADD will flush denorms. Although, if we're flushing output to zero, that probably is NOT ok, since something like FADD(largest_denorm, largest_denorm) would return a normal.

I guess what I'm really asking is how important is this to AMDGPU? It seems to be the only target that is upset about the changes in this Diff.

AMDGPU isn't respecting the new attributes yet. My patches to switch to it are still working their way through the review/commit process

arsenm added inline comments.Mar 31 2020, 2:45 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12590

Shouldn't be considering invalid anymore

Remove DenormalMode::Invalid check as suggested by @arsenm.

cameron.mcinally marked 2 inline comments as done.Apr 1 2020, 7:43 AM
arsenm added a comment.Apr 2 2020, 2:20 PM

AMDGPU should now be properly respecting the new attributes

Thanks, Matt. It looks like preventing the FSUB->FNEG transform is still causing trouble with folding the negate into instructions. E.g.

<scrubbed>/clang/llvm-project/llvm/test/CodeGen/AMDGPU/v_mac_f16.ll:125:7: error: SI: expected string not found in input
; SI: v_mad_f32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, -v{{[0-9]+}}

It seems that there are a handful of new test failures too.

Any suggestions on how to proceed?

Should we not expect the explicit FSUB(-0,X) to fold under denormal flushing modes? (Too big a hammer, but correct)

Or maybe fold the FSUB(-0,X) into the instruction in the backend where possible? (Might cause some slightly wrong answers, unless we're careful)

Thanks, Matt. It looks like preventing the FSUB->FNEG transform is still causing trouble with folding the negate into instructions. E.g.

<scrubbed>/clang/llvm-project/llvm/test/CodeGen/AMDGPU/v_mac_f16.ll:125:7: error: SI: expected string not found in input
; SI: v_mad_f32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, -v{{[0-9]+}}

It seems that there are a handful of new test failures too.

Any suggestions on how to proceed?

Should we not expect the explicit FSUB(-0,X) to fold under denormal flushing modes? (Too big a hammer, but correct)

That's what I would expect. Additional context is needed to know the flush will be performed elsewhere

Or maybe fold the FSUB(-0,X) into the instruction in the backend where possible? (Might cause some slightly wrong answers, unless we're careful)

I don't think we need to fold this in the target, we should be able to fold based on another instruction we know will flush. In the sample you gave there, the f16 operation was promoted to f32 and the conversion should also flush

Should we not expect the explicit FSUB(-0,X) to fold under denormal flushing modes? (Too big a hammer, but correct)

That's what I would expect. Additional context is needed to know the flush will be performed elsewhere

Or maybe fold the FSUB(-0,X) into the instruction in the backend where possible? (Might cause some slightly wrong answers, unless we're careful)

I don't think we need to fold this in the target, we should be able to fold based on another instruction we know will flush. In the sample you gave there, the f16 operation was promoted to f32 and the conversion should also flush

Good point. I suppose we'd need a switch to check if the user's opcode is a flushing operation.

That's kind of ugly though. Anyone know of a better way to do it?

arsenm added a comment.Apr 3 2020, 5:21 PM

Should we not expect the explicit FSUB(-0,X) to fold under denormal flushing modes? (Too big a hammer, but correct)

That's what I would expect. Additional context is needed to know the flush will be performed elsewhere

Or maybe fold the FSUB(-0,X) into the instruction in the backend where possible? (Might cause some slightly wrong answers, unless we're careful)

I don't think we need to fold this in the target, we should be able to fold based on another instruction we know will flush. In the sample you gave there, the f16 operation was promoted to f32 and the conversion should also flush

Good point. I suppose we'd need a switch to check if the user's opcode is a flushing operation.

That's kind of ugly though. Anyone know of a better way to do it?

Also the if the input is flushing

ychen added a subscriber: ychen.Apr 3 2020, 6:52 PM
cameron.mcinally planned changes to this revision.May 14 2020, 8:54 AM

Sorry for the long wait time. I'm still working on this. The AMDGPU tests are proving hard to clean up. Update hopefully coming soon...

Made some more progress on sorting out the AMDGPU backend, but I'm running up against walls: some optimization opportunities that will need further work; some newly exposed bugs in existing code; and some are my lack of experience with the AMDGPU instruction set. I added FIXME comments with some details about the cases I'm not familiar with. @arsenm Any comments on these changes?

The general intent of this patch is to check if the FSUB(+-0, X)->FNEG(X) transform is safe while in a DAZ/FTZ mode. This is done by checking if all uses of a FSUB(+-0, X) will flush denormals. If so, the transform is safe to do. Most cases are caught okay, but some are trickier. I couldn't solve them all.

(Digressing: I think we need a TableGen flag for instructions that could flush denormals.)

arsenm added inline comments.Jun 16 2020, 10:15 AM
llvm/include/llvm/CodeGen/TargetLowering.h
464

AMDGPU basically already has this, but it requires a depth argument similar to computeKnownBits.

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
834

We're already doing this, but I'm made somewhat uncomfortable by how constant folding is done. We don't insert a canonicalize when constant folding, so if you check isCanonicalized(x), but x is constant folded away into something that should have flushed, this won't be quite right. I guess the way it's defined, this only matters when folding canonicalize inputs?

846

Weird to have FMAXNUM but not FMINNUM. I also think we have a defective implementation for subtargets where the instructions don't read the FP mode. We inspect the inputs of the generic node rather than introducing a target specific wrapper with the broken behavior

llvm/test/CodeGen/AMDGPU/fneg-combines.ll
11

Correct, this most of these are for source modifier folding purposes only

cameron.mcinally marked an inline comment as done.Jun 16 2020, 12:23 PM
cameron.mcinally added inline comments.
llvm/include/llvm/CodeGen/TargetLowering.h
464

I did see isCanonicalized(...), but it looks like it goes the other direction. I.e. isCanonicalized(...) checks to see if the predecessor is already canonicalized. willCanonicalize(...) checks to see if the successors will canonicalize the result of the operation. There are a number of existing tests that begin with an FSUB(-0, X), so that's why I choose this solution.

I think we'd eventually want both directions, for completeness. But I noticed that isCanonicalized(...) only exists in SIISelLowering, and I didn't want to mess around with something I didn't fully understand.

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
846

Agreed. This switch is only opcodes that existed in current testing, so there are some gaps. I should probably add FMINNUM under a separate patch.

That said, I could introduce a test case pre-commit and then fix it in this patch. That's probably the right way to go forward.

This switch is also likely incorrect at the edges (e.g. FMED3, FMA). I don't fully understand all the intricacies of AMDGPU flushing [as seen in isCanonicalized(...)]. There's more work needed here.

Remove FMAXNUM and a couple other opcodes from the willCanonicalize(...) switch. They are not currently tested, but rather leftover junk from building out this code. I was mistaken.

cameron.mcinally marked an inline comment as not done.Jun 25 2020, 8:13 AM

Ping. @arsenm

I know there are some problems with the current implementation, but I think it's a good first step. Landing the DAGCombiner changes is probably worth the edge-case precision bugs, so that other backends don't regress. In particular, the current DAGCombiner::visitFSub(...) code is vulnerable now. Thoughts on any of this?

The thing I'm somewhat worried about is a subtlety with constant folding. Constant folding will blindly fold unaware of whatever canonicalization needed to happen. willCanonicalize may have lied if something happened later that caused the canonicalizing operation to constant fold away

The thing I'm somewhat worried about is a subtlety with constant folding. Constant folding will blindly fold unaware of whatever canonicalization needed to happen. willCanonicalize may have lied if something happened later that caused the canonicalizing operation to constant fold away

Ah, good point. I remember you saying that before, but I didn't absorb it at the time.

That's a sticky problem. We could wait until the MachineInstr level to do the FSUB->FNEG transform, to ensure that constant folding completed. But I suspect (pretty certain) that we'll have missed other FNEG peeps we'd want by then. So that won't work.

In general, it would be good to go for functional correctness first, and then try to optimize. That's kind of a problem for this specific project though, since so many existing tests would need to be updated. I'm not sure what to do. Will need to think about it...

The thing I'm somewhat worried about is a subtlety with constant folding. Constant folding will blindly fold unaware of whatever canonicalization needed to happen. willCanonicalize may have lied if something happened later that caused the canonicalizing operation to constant fold away

Ah, good point. I remember you saying that before, but I didn't absorb it at the time.

That's a sticky problem. We could wait until the MachineInstr level to do the FSUB->FNEG transform, to ensure that constant folding completed. But I suspect (pretty certain) that we'll have missed other FNEG peeps we'd want by then. So that won't work.

In general, it would be good to go for functional correctness first, and then try to optimize. That's kind of a problem for this specific project though, since so many existing tests would need to be updated. I'm not sure what to do. Will need to think about it...

If we modeled everything correctly, the correct thing to do would be to insert canonicalizes whenever occurs (but that's a massive change). I'm not a huge fan of getNode doing constant folding, so maybe eliminating that at least would help?

cameron.mcinally abandoned this revision.Aug 20 2020, 2:21 PM

Abandoning this Diff since most of it was covered in D84056. Will prepare a new patch to remove the problematic FSUB DAGCombine soon.