Page MenuHomePhabricator

[WIP][FPEnv] Don't transform FSUB(-0.0,X)->FNEG(X) when flushing denormals
Needs ReviewPublic

Authored by cameron.mcinally on Feb 4 2020, 10:07 AM.

Details

Summary

When in a mode that flushes denormals, we don't want to transform FSUB(-0.0,X) -> FNEG(X). The former is an arith operation that will flush a denormal input to 0. The latter is a bitwise operation that will only flip the sign bit.

Marked as [WIP] since the logic is a little weird. Hoping @arsenm and others can offer some guidance...

  1. Notice that we still perform the transformation when in DenormalMode::IEEE. This is counter-intuitive. IEEE-754 is what specifies that these operations are distinct, but only in regards to side-effects, not denormal flushing. LLVM optimizations do not preserve side-effects, and both operation results will be bitwise identical when we're not flushing denormals, so I think this is the correct thing to do.

Although, there's also the problem of this transform changing the sign of a NaN in DenormalMode::IEEE. Do we want to take that into consideration? E.g. an FSUB(-0.0, NaN) should produce a canonical NaN with the same payload, while FNEG(NaN) produces -NaN. If I'm not mistaken, IEEE-754 doesn't specify the sign of a NaN result, besides being a canonical NaN.

  1. Also notice that we still perform the transformation when in DenormalMode::Invalid. I believe that Invalid is actually a flush to zero mode. However, I think it makes sense to leave the default mode unchanged wrt disabling this transform. There could be a very small (and hard to measure) performance penalty for using a proper FSUB on some targets.

Thoughts about any of this?

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2020, 10:07 AM
arsenm added inline comments.Feb 4 2020, 10:17 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12589–12591

This will need updating for the splitting the input and output patch I just committed

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2986–2987

Why does SelectionDAGBuilder bother doing this fold at all? It should just directly translate the fsub?

arsenm added inline comments.Feb 4 2020, 10:19 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2986–2987

I think this should be just ripped out in a separate patch. The same problem seems to have been copied to IRTranslator, which should also be removed

The DenormalMode::Invalid is a temporary state and should not really be a concern. It should be invalid and never seen after D69989

cameron.mcinally marked 2 inline comments as done.Feb 4 2020, 11:14 AM
cameron.mcinally added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12589–12591

Ok, thanks. Will wait for the builds to go green and then update.

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2986–2987

@sanjay, what do you think? Seems reasonable to me.

I think it made sense to do this when FNEG(X) was the canonical form of FSUB(-0.0, X). Wouldn't want two forms floating around for even a small amount of time. But now that there are cases where the operations are distinct through llc, it seems ok to wait until DAGCombine.

spatel added inline comments.Feb 4 2020, 12:31 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2986–2987

Yes, now that we can use DenormMode to distinguish target behavior, it seems better to do it later in DAGCombiner if that would make sense for the target.

cameron.mcinally planned changes to this revision.Feb 5 2020, 7:21 AM
cameron.mcinally marked an inline comment as done.
cameron.mcinally added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2986–2987

I'll have to put a pin in this for now. Removing this block is causing regressions in about 15 tests. The regressions appear to be subtle lowering differences, so I suspect it will take some time to straighten them out.

cameron.mcinally marked an inline comment as not done.Feb 12 2020, 7:32 AM
cameron.mcinally added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2986–2987

I looked into removing this and there are warts underneath. Some are surmountable (different lowerings), but one is worrisome. I.e. the case where the FNeg operand is undef:

FNEG(undef) -> undef
FSUB(-0.0, undef) -> NaN

That is, removing this transform propagates NaNs where we previously had undef values.

Any thoughts on how to proceed? Do we want to minimize code differences by keeping this transform in place? Or are we okay moving forward with the undef->NaN change?

spatel added inline comments.Feb 12 2020, 11:42 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2986–2987

Did that difference show up as real regressions or something benign?
We could add a special-case fold for this here or getNode() if it helps:
fsub C, undef --> undef (as long as C is not NaN or Inf?)

arsenm added inline comments.Feb 12 2020, 12:58 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2986–2987

This sounds more correct to me. I don't see why this would be special cased

spatel added inline comments.Feb 12 2020, 1:56 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2986–2987

It's a special-case in the sense that folding to NaN is correct in general. Just dealing with this particular pattern is also a special-case because we could do something similar for all FP ops, not just fsub with constant operand 0. But we'll need to work out if/how the corner cases differ per opcode.

cameron.mcinally marked an inline comment as not done.Feb 13 2020, 7:18 AM
cameron.mcinally added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2986–2987

Benign llc regression tests. The NaN and undef propagate differently, so the asm differences appear worse than they are.

We could add a special-case fold for this here or getNode() if it helps:
fsub C, undef --> undef (as long as C is not NaN or Inf?)

That's a good idea. I don't feel strongly about it, but the current transform might be more obvious than adding a special case fold though.

spatel added inline comments.Feb 20 2020, 6:43 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2986–2987

I tried to generalize this over in D74713, but there doesn't appear to be support for bending the theoretical definition of undef to the practical constraints of real-life floating-point.

So our options are:

  1. Add a constant fold for this exact case: fsub -0.0, undef --> undef
  2. Ignore the diffs caused by removing this transform.

I'd lean toward #1 (I can limit D74713 to that case as the start of that effort).

Thanks, Sanjay. I'm okay with either approach.

I'll pick this up again in the near future. I've been distracted with another project...

Thanks, Sanjay. I'm okay with either approach.

rGa253a2a793cd: [SDAG] fold fsub -0.0, undef to undef rather than NaN.

Thanks for that patch, Sanjay.

I have another issue which I hope you can help me sort out. There's a transform in narrowExtractedVectorBinOp(...) in DAGCombiner.cpp:

// extract (binop B0, B1), N --> binop (extract B0, N), (extract B1, N)

This transform only happens for binops, so we don't see it when SelectionDAGBuilder converts the FSUB->FNEG.

The IR is...

%rhs_neg = fsub <4 x float> <float -0.0, float -0.0, float -0.0, float -0.0>, %rhs
%splat = shufflevector <4 x float> %rhs_neg, <4 x float> undef, <2 x i32> <i32 3, i32 3>

and after DAGCombine we end up with DAGs like this...

FNEG:
<               t9: v4f32 = bitcast t8
<             t24: v4f32 = fneg t9
<           t15: v2f32 = extract_subvector t24, Constant:i64<2>
<         t17: v2f32 = vector_shuffle<1,1> t15, undef:v2f32

FSUB:
>               t29: v1i64 = extract_subvector t8, Constant:i64<1>
>             t30: v2f32 = bitcast t29
>           t32: v2f32 = fneg t30
>         t17: v2f32 = vector_shuffle<1,1> t32, undef:v2f32

Moving the extract to the operands (FSUB) is a problem on AArch64 since the extract could be rolled into the shuffle (FNEG). E.g.:

FNEG:
<             t9: v4f32 = bitcast t8
<           t24: v4f32 = fneg t9
<         t26: v2f32 = AArch64ISD::DUPLANE32 t24, Constant:i64<3>

FSUB:
>                 t29: v1i64 = extract_subvector t8, Constant:i64<1>
>               t30: v2f32 = bitcast t29
>             t32: v2f32 = fneg t30
>           t36: v4f32 = insert_subvector undef:v4f32, t32, Constant:i32<0>
>         t37: v2f32 = AArch64ISD::DUPLANE32 t36, Constant:i64<1>

Any insight on the best way to correct this difference? I suppose I could fix up the extract+insert at the MachineInstruction level, but that doesn't seem like the correct fix since other targets could have the same problem.

I'm also a little skeptical about moving the extracts to the operands, and if it's a win in the general case. Seems like it would be stronger after any extract+insert peeps have occurred, but I suppose that's why it's done in DAGCombine. :/

Thanks for that patch, Sanjay.

I have another issue which I hope you can help me sort out. There's a transform in narrowExtractedVectorBinOp(...) in DAGCombiner.cpp:

// extract (binop B0, B1), N --> binop (extract B0, N), (extract B1, N)

This transform only happens for binops, so we don't see it when SelectionDAGBuilder converts the FSUB->FNEG.

The IR is...

%rhs_neg = fsub <4 x float> <float -0.0, float -0.0, float -0.0, float -0.0>, %rhs
%splat = shufflevector <4 x float> %rhs_neg, <4 x float> undef, <2 x i32> <i32 3, i32 3>

and after DAGCombine we end up with DAGs like this...

 FNEG:
 <               t9: v4f32 = bitcast t8
 <             t24: v4f32 = fneg t9
 <           t15: v2f32 = extract_subvector t24, Constant:i64<2>
 <         t17: v2f32 = vector_shuffle<1,1> t15, undef:v2f32
 
 FSUB:
>               t29: v1i64 = extract_subvector t8, Constant:i64<1>
>             t30: v2f32 = bitcast t29
>           t32: v2f32 = fneg t30
>         t17: v2f32 = vector_shuffle<1,1> t32, undef:v2f32

Moving the extract to the operands (FSUB) is a problem on AArch64 since the extract could be rolled into the shuffle (FNEG). E.g.:

 FNEG:
 <             t9: v4f32 = bitcast t8
 <           t24: v4f32 = fneg t9
 <         t26: v2f32 = AArch64ISD::DUPLANE32 t24, Constant:i64<3>
 
 FSUB:
>                 t29: v1i64 = extract_subvector t8, Constant:i64<1>
>               t30: v2f32 = bitcast t29
>             t32: v2f32 = fneg t30
>           t36: v4f32 = insert_subvector undef:v4f32, t32, Constant:i32<0>
>         t37: v2f32 = AArch64ISD::DUPLANE32 t36, Constant:i64<1>

Any insight on the best way to correct this difference? I suppose I could fix up the extract+insert at the MachineInstruction level, but that doesn't seem like the correct fix since other targets could have the same problem.

I'm also a little skeptical about moving the extracts to the operands, and if it's a win in the general case. Seems like it would be stronger after any extract+insert peeps have occurred, but I suppose that's why it's done in DAGCombine. :/

The motivation for narrowExtractedVectorBinOp() was to shrink unnecessarily wide vector ops on x86 (256/512-bit vector code can run much slower than 128-bit vector code).
But we want to avoid moving fneg around too much because it can be folded into some other op for free in many cases. We can show there's an inconsistency in the handling in an independent example, so:
rGb3d0c798367d

Let me know if that works to remove the problem here.

Thanks again, Sanjay. That did help. I have other issues to work through on AMDGPU, but it's getting closer...

Thanks again, Sanjay. That did help. I have other issues to work through on AMDGPU, but it's getting closer...

As a heads up, AMDGPU doesn’t respect the denormal attribute yet and still uses the custom subtarget features. The patch to switch is posted but held up by its dependencies

Rebase and AMDGPU test changes to elucidate a problem with this Diff.

@arsenm, The problem in the AMDGPU tests is that FSUB(-0.0, X) is not folding into the following instruction, as it would if it was transformed into an FNEG(X).

It's probably okay to fold some of these. E.g.

-  %fneg.a = fsub float -0.000000e+00, %a
+  %fneg.a = fneg float %a
   %add = fadd float %fneg.a, %b

If we're flushing input to zero, it's probably okay to fold a FSUB(-0,X) into the FADD, since the FADD will flush denorms. Although, if we're flushing output to zero, that probably is NOT ok, since something like FADD(largest_denorm, largest_denorm) would return a normal.

I guess what I'm really asking is how important is this to AMDGPU? It seems to be the only target that is upset about the changes in this Diff.

Would it be enough to update the CHECK lines to not expect a FSUB(-0,X) to fold? Or does this need more peeps to fold the cases where it's safe? And if the latter, should we move ahead with this Diff and optimize later?

Rebase and AMDGPU test changes to elucidate a problem with this Diff.

@arsenm, The problem in the AMDGPU tests is that FSUB(-0.0, X) is not folding into the following instruction, as it would if it was transformed into an FNEG(X).

It's probably okay to fold some of these. E.g.

-  %fneg.a = fsub float -0.000000e+00, %a
+  %fneg.a = fneg float %a
   %add = fadd float %fneg.a, %b

If we're flushing input to zero, it's probably okay to fold a FSUB(-0,X) into the FADD, since the FADD will flush denorms. Although, if we're flushing output to zero, that probably is NOT ok, since something like FADD(largest_denorm, largest_denorm) would return a normal.

I guess what I'm really asking is how important is this to AMDGPU? It seems to be the only target that is upset about the changes in this Diff.

AMDGPU isn't respecting the new attributes yet. My patches to switch to it are still working their way through the review/commit process

arsenm added inline comments.Tue, Mar 31, 2:45 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12590

Shouldn't be considering invalid anymore

Remove DenormalMode::Invalid check as suggested by @arsenm.

cameron.mcinally marked 2 inline comments as done.Wed, Apr 1, 7:43 AM
arsenm added a comment.Thu, Apr 2, 2:20 PM

AMDGPU should now be properly respecting the new attributes

Thanks, Matt. It looks like preventing the FSUB->FNEG transform is still causing trouble with folding the negate into instructions. E.g.

<scrubbed>/clang/llvm-project/llvm/test/CodeGen/AMDGPU/v_mac_f16.ll:125:7: error: SI: expected string not found in input
; SI: v_mad_f32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, -v{{[0-9]+}}

It seems that there are a handful of new test failures too.

Any suggestions on how to proceed?

Should we not expect the explicit FSUB(-0,X) to fold under denormal flushing modes? (Too big a hammer, but correct)

Or maybe fold the FSUB(-0,X) into the instruction in the backend where possible? (Might cause some slightly wrong answers, unless we're careful)

Thanks, Matt. It looks like preventing the FSUB->FNEG transform is still causing trouble with folding the negate into instructions. E.g.

<scrubbed>/clang/llvm-project/llvm/test/CodeGen/AMDGPU/v_mac_f16.ll:125:7: error: SI: expected string not found in input
; SI: v_mad_f32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, -v{{[0-9]+}}

It seems that there are a handful of new test failures too.

Any suggestions on how to proceed?

Should we not expect the explicit FSUB(-0,X) to fold under denormal flushing modes? (Too big a hammer, but correct)

That's what I would expect. Additional context is needed to know the flush will be performed elsewhere

Or maybe fold the FSUB(-0,X) into the instruction in the backend where possible? (Might cause some slightly wrong answers, unless we're careful)

I don't think we need to fold this in the target, we should be able to fold based on another instruction we know will flush. In the sample you gave there, the f16 operation was promoted to f32 and the conversion should also flush

Should we not expect the explicit FSUB(-0,X) to fold under denormal flushing modes? (Too big a hammer, but correct)

That's what I would expect. Additional context is needed to know the flush will be performed elsewhere

Or maybe fold the FSUB(-0,X) into the instruction in the backend where possible? (Might cause some slightly wrong answers, unless we're careful)

I don't think we need to fold this in the target, we should be able to fold based on another instruction we know will flush. In the sample you gave there, the f16 operation was promoted to f32 and the conversion should also flush

Good point. I suppose we'd need a switch to check if the user's opcode is a flushing operation.

That's kind of ugly though. Anyone know of a better way to do it?

arsenm added a comment.Fri, Apr 3, 5:21 PM

Should we not expect the explicit FSUB(-0,X) to fold under denormal flushing modes? (Too big a hammer, but correct)

That's what I would expect. Additional context is needed to know the flush will be performed elsewhere

Or maybe fold the FSUB(-0,X) into the instruction in the backend where possible? (Might cause some slightly wrong answers, unless we're careful)

I don't think we need to fold this in the target, we should be able to fold based on another instruction we know will flush. In the sample you gave there, the f16 operation was promoted to f32 and the conversion should also flush

Good point. I suppose we'd need a switch to check if the user's opcode is a flushing operation.

That's kind of ugly though. Anyone know of a better way to do it?

Also the if the input is flushing

ychen added a subscriber: ychen.Fri, Apr 3, 6:52 PM