Optimize fma((float)S0.x, (float)S1.x fma((float)S0.y, (float)S1.y, z)) -> fdot2((v2f16)S0, (v2f16)S1, (float)z)
Details
Diff Detail
Event Timeline
Does fdot2 perform rounding of intermediates?
Basically you start with two FMAs: FMA - Perform a * b + c with no intermediate rounding step. So the expression you are converting is quite fancy in terms of rounding:
fma(S0.x, S1.x fma(S0.y, S1.y, z)) => round((S0.x * S1.x) + round((S0.y * S1.y) + z))
I.e. you have two operations with rounding and two without. I am really unsure that is what the v_dot2_f2_f16 instruction does.
Then you probably need to have some conditions to check if denorms are supported or not.
This operation only rounds a single time, and unfortunately always flushes f32 denorms. Thus this transformation should only be done when unsafe math is requested.
As far as I understand it should be also legal with -mattr=-fp32-denormals,-fp64-fp16-denormals. I.e. when both 32 and 16 denorms are not supported. Right? Not that is really helps in the real world.
Otherwise it shall be legal if either UnsafeAlgebra or AllowContract flag is set on both FMA nodes.
Agreed.
By the way, since types are being mixed, shouldn't the summary say something like optimize fma((float)S0.x, (float)S1.x, fma((float)S0.y, (float)S1.y, S2)) --> fdot2(S0, S1, S2)? We only want this transformation if S0 and S1 are <2 x f16>.
As far as I understand it should be also legal with -mattr=-fp32-denormals,-fp64-fp16-denormals. I.e. when both 32 and 16 denorms are not supported. Right? Not that is really helps in the real world.
Otherwise it shall be legal if either UnsafeAlgebra or AllowContract flag is set on both FMA nodes.
Having the FMA node already grantees that either UnsafeAlgebra is set or AllowContract flag set is on the FAdd/FMUL nodes. We don't need to check them again during the FMA combine, right?
By the way, since types are being mixed, shouldn't the summary say something like optimize fma((float)S0.x, (float)S1.x, fma((float)S0.y, (float)S1.y, S2)) --> fdot2(S0, S1, S2)? We only want this transformation if S0 and S1 are <2 x f16>.
Current pattern matching does not support float element type yet, it will be supported next.
You are right, there is a typo in the summary. It should be:
fma((f16)S0.x, (f16)S1.x fma((f16)S0.y, (f16)S1.y, (f16)z)) -> ftrunc(fdot2(S0, S1, (f32)z))
I think I had it right. FMA requires all 3 arguments to be the same type. And we don't want to do this transformation if the first two arguments to each FMA weren't cast from f16 to f32.
FMA really restricts only the intermediate result rounding, but result of fma itself is rounded according to -mattr and other options. So I think you would need to check it anyway.
The fma absolutely does not guarantee this
test/CodeGen/AMDGPU/dotproduct.ll | ||
---|---|---|
1 ↗ | (On Diff #154822) | Test should probable be named fdot2.ll |
2 ↗ | (On Diff #154822) | The check prefixes are broken. They both need to include gcn for the check labels to work |
23 ↗ | (On Diff #154822) | Remove fast and only use the required algebra flag (whatever it’s called now), plus some tests with the different permutations with a missing flag |
30 ↗ | (On Diff #154822) | Needs tests with source and output modifiers, also with wider vector types |
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7474 | Probably should only do after legalization |
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7474 | Doing this after legalization makes the analysis complex. Due to type(v2f16)/node legalization we get a long chain of bitcasts in the IR after legalization. Also, the identification of the vector element is not straightforward after the legalization. Would it be illegal to do this before legalization? |
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7488 | That is not exactly true. If denorms are not supported on both f32 and f16 folding is also legal. |
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7474 | No, I just expect introducing custom nodes earlier to break generic optimizations |
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7488 | The comment is still wrong. It is in fact illegal to do it without respect of the denorm mode. It is only legal under some conditions, like unsafe math or allowed contraction for that reason. |
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7488 | I thought there are two separate things.
I thought it is already implied that we can only do this transformation under unsafe-fp-math/fp-contract, so are any other aggressive floating point optimizations. So, we don't need to mention about the unsafe-math/allowed contraction explicitly in the comment. There are cases even when the unsafe-fp-math/fp-contract is there, we don't do the transformation because the hardware does not support subnormals and fdot2 is an exception to those cases. That's why my comment was trying to emphasize on the denorm part with an implication of unsafe-math is there. But I see your point. |
llvm/trunk/test/CodeGen/AMDGPU/fdot2.ll | ||
---|---|---|
232 ↗ | (On Diff #155722) | Probably should try matching the negated form too |
Probably should only do after legalization