As the name suggests, correctly-rounded-divide-sqrt specifies the result of divion/sqrt to be rounded, and
thus we need to generate the correct sequence of code even when we flush the denormals.
Details
Diff Detail
Event Timeline
This looks OK to me, although tuning on correctly rounded division any time denorms are enabled is not actually required by OpenCL.
The attribute should not de directly checked (we probably shouldn’t even be putting it on the function). The proper thing to check is the fpmath metadata on the individual instruction. This isn’t propagated into the DAG, so AMDGPUCodeGenPrepare inserts intrinsic calls which isn’t ideal
:
So what's your suggestion here? The current logic in AMDGPUCodeGenPrepare is to find cases that we can insert the intrinsic to generate "Faster 2.5 ULP division that does not support denormals."
Otherwise SIISelLowering will lower FDIV32 UnsafeMath and Demorm support.
Do you want to change this logic to insert new intrinsics to generate the expected sequence of code for fdiv32?
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
568–569 | The attribute should not be considered at all. Only the fpmath metadata matters. If -cl-fp32-correctly-rounded-divide-sqrt is specified, the regular fdiv instruction should behave correctly. | |
571–575 | An intrinsic should only be introduced when the fdiv differs from the default FP environment. Here you are doing the opposite, and not even considering the denormal mode. You should be inhibiting the insertion of the fdiv.fast if denormals are enabled, not introducing a new intrinsic. You can also consider the afn fast flag and use that to ignore the denormal mode | |
628–631 | There's no need to check the attribute | |
llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll | ||
245 | The attribute should be removed |
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
568–569 | Do you mean that in AMDGPUCodeGenPrepare, we should check the fpmath metadata to keep regular fdiv (instead of an intrinsic) when -cl-fp32-correctly-rounded-divide-sqrt is specified? The issue is, when -cl-fp32-correctly-rounded-divide-sqrt is specified, a simple v_rcp is generated for a fdiv. Apparently the codegen produces the wrong sequence of code for a "regular" fdiv. |
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
568–569 | Yes. We shouldn't even have an IR attribute for this flag. The interpretation of the flag is entirely represented in the use of the !fpmath metadata. |
Implement rcp optimization for fdiv in AMGGPUCodegenPrepare to insert amdgcn_rcp intrinsic. For f32 type fdiv,
if fpmath metadata is unavailable, we could not do rcp optimization unless fast unsafe math is specified.
The GlobalISel path should also be fixed, but that can be a follow up patch
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
595 | Needs comment explaining the interaction between !fpmath requirements and denormals. This could use a chart of different fast math options, FP math and denormal handling and the expected lowering | |
636–641 | I don't think just allow reciprocal is sufficient without either checking FPMath or afn. I think this needs to be something more like | |
647 | Typo metadat | |
648 | It would be clearer to invert this, instead of the logic below relying on the double negative | |
648–650 | It would be clearer to do something like | |
649 | FPMath should be checked once, and in relation to it's value only. Checking for the lack of metadata here is imprecise | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
7540 | Based on the original problem, Flags.hasAllowReciprocal() isn't sufficient here. Without knowledge of !fpmath, this also needs approximate function | |
7542 | Needs comment explaining why | |
8726 | Braces |
For GlobalISel, I'm not sure this should reproduce the same fix. We can more plausibly preserve the !fpmath in the gMIR and handle it the right way, instead of hacking around it in AMDGPUCodeGenPrepare. I think a few asserts and the verifier would need to be updated, but it should be possible to allow arbitrary MDNode operands on an instruction, similar to how implicit registers can be added. I think we should disallow implicit register operands on G_* instructions, and instead only allow implicit metadata arguments. The fdiv lowering can then do the right thing with the original !fpmath information
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
636–641 | Can you explain what is exactly "denormal hasLowAccuracy" here? |
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
600 | UnsafeDiv is too imprecise here. This should explain in concrete terms why we need to insert the intrinsics and not just refer to the variable names. We need fdiv.fast when we only need 2.5 ULP and denormals are flushed | |
624 | I think this should maybe be rephrased into RcpLegal and UseFDivFast |
Update based on the comments.
- Rewrite the comments of the function visitFDiv;
- Rename a few variables.
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
616 | You can just initialize this below with the logical value instead of setting the value conditionally | |
640 | I think this still isn't quite right. I think this should be (FMF.allowReciprocal() && ((!HasFP32Denormals && !NeedHighAccuracy) || FMF.approxFunc())). As is, this will allow reciprocal when denormals are flushed, but the higher fdiv precision is required, which was the case you were trying to fix in the first place | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
7542 | This still needs the denormal and type checks | |
llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll | ||
88 | This should not have produced rcp since denormals are enabled and it doesn't have afn. | |
91–92 | The name says high accuracy, but 5 ulp is lower accuracy. This didn't form rcp, but I think for the wrong reason | |
94–98 | These two I think are OK because of afn |
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
640 | Yes, this also needs to account for FP32denormals. RCP for f16 doesn't' care about the fp16 denormal mode |
update based on feedback.
- Using arcp && (( no denormals && fpmath>=2.5) || afn)
- update arcp related LIT tests.
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
122 | Set UseFDivFast once based on the logical expression below and never mutate it. UseFDivFast should be const | |
llvm/test/CodeGen/AMDGPU/fdiv.f16.ll | ||
253 ↗ | (On Diff #239724) | I don't know what ulp the f16 rcp instruction provides. This test change looks incomplete if there isn't already a case without !fpmath |
llvm/test/CodeGen/AMDGPU/fdiv.f16.ll | ||
---|---|---|
253 ↗ | (On Diff #239724) | I found a document stating this provides "~0.5ulp", so I guess check that value for f16? |
llvm/test/CodeGen/AMDGPU/fdiv.f16.ll | ||
---|---|---|
253 ↗ | (On Diff #239724) | Currently the logic in DAG lowering does "1/x -> rcp(x)" for fp16 without checking fpmath accuracy. We need to revisit that logic in DAG lowering. But I would rather to do that in a follow-up patch. |
Update based on feedback:
- const for UseFDivFast variable
- Remove the added "!fpmath !0" for an arcp f16 test, because the current logic in DAG loweing generates the same code with/without !fpmath.
TODO (in an follow up patch maybe): Change the accuracy threshold and apply the threshold to all types. Also need to re-visit
the rcp logic in DAG Lowering as long as the work in AMDGPUCodegenPrepare is done.
commit 2531535984ad989ce88aeee23cb92a827da6686e
Author: Changpeng Fang <changpeng.fang@gmail.com>
Date: Thu Jan 23 16:57:43 2020 -0800
The attribute should not be considered at all. Only the fpmath metadata matters. If -cl-fp32-correctly-rounded-divide-sqrt is specified, the regular fdiv instruction should behave correctly.