Page MenuHomePhabricator

[AMDGPU] Support a fdot2 pattern.
ClosedPublic

Authored by FarhanaAleen on Jul 10 2018, 9:36 AM.

Details

Summary

Optimize fma((float)S0.x, (float)S1.x fma((float)S0.y, (float)S1.y, z)) -> fdot2((v2f16)S0, (v2f16)S1, (float)z)

Diff Detail

Repository
rL LLVM

Event Timeline

FarhanaAleen created this revision.Jul 10 2018, 9:36 AM

Does fdot2 perform rounding of intermediates?
Basically you start with two FMAs: FMA - Perform a * b + c with no intermediate rounding step. So the expression you are converting is quite fancy in terms of rounding:

fma(S0.x, S1.x fma(S0.y, S1.y, z)) => round((S0.x * S1.x) + round((S0.y * S1.y) + z))

I.e. you have two operations with rounding and two without. I am really unsure that is what the v_dot2_f2_f16 instruction does.
Then you probably need to have some conditions to check if denorms are supported or not.

This operation only rounds a single time, and unfortunately always flushes f32 denorms. Thus this transformation should only be done when unsafe math is requested.

This operation only rounds a single time, and unfortunately always flushes f32 denorms. Thus this transformation should only be done when unsafe math is requested.

As far as I understand it should be also legal with -mattr=-fp32-denormals,-fp64-fp16-denormals. I.e. when both 32 and 16 denorms are not supported. Right? Not that is really helps in the real world.
Otherwise it shall be legal if either UnsafeAlgebra or AllowContract flag is set on both FMA nodes.

Agreed.

By the way, since types are being mixed, shouldn't the summary say something like optimize fma((float)S0.x, (float)S1.x, fma((float)S0.y, (float)S1.y, S2)) --> fdot2(S0, S1, S2)? We only want this transformation if S0 and S1 are <2 x f16>.

As far as I understand it should be also legal with -mattr=-fp32-denormals,-fp64-fp16-denormals. I.e. when both 32 and 16 denorms are not supported. Right? Not that is really helps in the real world.
Otherwise it shall be legal if either UnsafeAlgebra or AllowContract flag is set on both FMA nodes.

Having the FMA node already grantees that either UnsafeAlgebra is set or AllowContract flag set is on the FAdd/FMUL nodes. We don't need to check them again during the FMA combine, right?

By the way, since types are being mixed, shouldn't the summary say something like optimize fma((float)S0.x, (float)S1.x, fma((float)S0.y, (float)S1.y, S2)) --> fdot2(S0, S1, S2)? We only want this transformation if S0 and S1 are <2 x f16>.

Current pattern matching does not support float element type yet, it will be supported next.

You are right, there is a typo in the summary. It should be:
fma((f16)S0.x, (f16)S1.x fma((f16)S0.y, (f16)S1.y, (f16)z)) -> ftrunc(fdot2(S0, S1, (f32)z))

By the way, since types are being mixed, shouldn't the summary say something like optimize fma((float)S0.x, (float)S1.x, fma((float)S0.y, (float)S1.y, S2)) --> fdot2(S0, S1, S2)? We only want this transformation if S0 and S1 are <2 x f16>.

Current pattern matching does not support float element type yet, it will be supported next.

You are right, there is a typo in the summary. It should be:
fma((f16)S0.x, (f16)S1.x fma((f16)S0.y, (f16)S1.y, (f16)z)) -> ftrunc(fdot2(S0, S1, (f32)z))

I think I had it right. FMA requires all 3 arguments to be the same type. And we don't want to do this transformation if the first two arguments to each FMA weren't cast from f16 to f32.

As far as I understand it should be also legal with -mattr=-fp32-denormals,-fp64-fp16-denormals. I.e. when both 32 and 16 denorms are not supported. Right? Not that is really helps in the real world.
Otherwise it shall be legal if either UnsafeAlgebra or AllowContract flag is set on both FMA nodes.

Having the FMA node already grantees that either UnsafeAlgebra is set or AllowContract flag set is on the FAdd/FMUL nodes. We don't need to check them again during the FMA combine, right?

FMA really restricts only the intermediate result rounding, but result of fma itself is rounded according to -mattr and other options. So I think you would need to check it anyway.

As far as I understand it should be also legal with -mattr=-fp32-denormals,-fp64-fp16-denormals. I.e. when both 32 and 16 denorms are not supported. Right? Not that is really helps in the real world.
Otherwise it shall be legal if either UnsafeAlgebra or AllowContract flag is set on both FMA nodes.

Having the FMA node already grantees that either UnsafeAlgebra is set or AllowContract flag set is on the FAdd/FMUL nodes. We don't need to check them again during the FMA combine, right?

The fma absolutely does not guarantee this

test/CodeGen/AMDGPU/dotproduct.ll
1 ↗(On Diff #154822)

Test should probable be named fdot2.ll

2 ↗(On Diff #154822)

The check prefixes are broken. They both need to include gcn for the check labels to work

23 ↗(On Diff #154822)

Remove fast and only use the required algebra flag (whatever it’s called now), plus some tests with the different permutations with a missing flag

30 ↗(On Diff #154822)

Needs tests with source and output modifiers, also with wider vector types

arsenm added inline comments.Jul 10 2018, 2:33 PM
lib/Target/AMDGPU/SIISelLowering.cpp
7474 ↗(On Diff #154822)

Probably should only do after legalization

FarhanaAleen added inline comments.Jul 12 2018, 11:20 AM
lib/Target/AMDGPU/SIISelLowering.cpp
7474 ↗(On Diff #154822)

Doing this after legalization makes the analysis complex. Due to type(v2f16)/node legalization we get a long chain of bitcasts in the IR after legalization. Also, the identification of the vector element is not straightforward after the legalization.

Would it be illegal to do this before legalization?

FarhanaAleen edited the summary of this revision. (Show Details)

Added fast-math flag+allow-contract flag and more test-cases.

rampitec added inline comments.Jul 13 2018, 10:13 AM
lib/Target/AMDGPU/SIISelLowering.cpp
7488 ↗(On Diff #155416)

That is not exactly true. If denorms are not supported on both f32 and f16 folding is also legal.

arsenm added inline comments.Jul 13 2018, 10:25 AM
lib/Target/AMDGPU/SIISelLowering.cpp
7474 ↗(On Diff #154822)

No, I just expect introducing custom nodes earlier to break generic optimizations

Reworded the comment about the flag requirements.

rampitec added inline comments.Jul 13 2018, 10:36 AM
lib/Target/AMDGPU/SIISelLowering.cpp
7488 ↗(On Diff #155416)

The comment is still wrong. It is in fact illegal to do it without respect of the denorm mode. It is only legal under some conditions, like unsafe math or allowed contraction for that reason.

FarhanaAleen added inline comments.
lib/Target/AMDGPU/SIISelLowering.cpp
7488 ↗(On Diff #155416)

I thought there are two separate things.

  1. fdot2 handles denormals and requires denorm mode to be set for that. With this configuration under the fast-math/fp-contract, we can still reject to generate fdot2 since the denorm support is not there but we want to support denormals.
  2. fdot2 does not handle denormals ever regardless of the denorm mode setting. This configuration does not leave much choices. So, we don't need to worry about the denorm support, fast-math/fp-contract is sufficient to decide whether fdot2 should be generated or not.

I thought it is already implied that we can only do this transformation under unsafe-fp-math/fp-contract, so are any other aggressive floating point optimizations. So, we don't need to mention about the unsafe-math/allowed contraction explicitly in the comment. There are cases even when the unsafe-fp-math/fp-contract is there, we don't do the transformation because the hardware does not support subnormals and fdot2 is an exception to those cases. That's why my comment was trying to emphasize on the denorm part with an implication of unsafe-math is there. But I see your point.

This revision is now accepted and ready to land.Jul 13 2018, 3:05 PM
Closed by commit rL337198: [AMDGPU] [AMDGPU] Support a fdot2 pattern. (authored by faaleen, committed by ). · Explain WhyJul 16 2018, 11:25 AM
This revision was automatically updated to reflect the committed changes.
arsenm added inline comments.Aug 13 2018, 8:30 AM
llvm/trunk/test/CodeGen/AMDGPU/fdot2.ll
232

Probably should try matching the negated form too