This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Generate the correct sequence of code for FDIV32 when correctly-rounded-divide-sqrt is set
ClosedPublic

Authored by cfang on Dec 10 2019, 11:21 AM.

Details

Summary

As the name suggests, correctly-rounded-divide-sqrt specifies the result of divion/sqrt to be rounded, and
thus we need to generate the correct sequence of code even when we flush the denormals.

Diff Detail

Event Timeline

cfang created this revision.Dec 10 2019, 11:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 10 2019, 11:21 AM

This looks OK to me, although tuning on correctly rounded division any time denorms are enabled is not actually required by OpenCL.

arsenm requested changes to this revision.Dec 10 2019, 8:51 PM

The attribute should not de directly checked (we probably shouldn’t even be putting it on the function). The proper thing to check is the fpmath metadata on the individual instruction. This isn’t propagated into the DAG, so AMDGPUCodeGenPrepare inserts intrinsic calls which isn’t ideal

This revision now requires changes to proceed.Dec 10 2019, 8:51 PM
cfang added a comment.Dec 12 2019, 1:28 PM

The attribute should not de directly checked (we probably shouldn’t even be putting it on the function). The proper thing to check is the fpmath metadata on the individual instruction. This isn’t propagated into the DAG, so AMDGPUCodeGenPrepare inserts intrinsic calls which isn’t ideal

:
So what's your suggestion here? The current logic in AMDGPUCodeGenPrepare is to find cases that we can insert the intrinsic to generate "Faster 2.5 ULP division that does not support denormals."
Otherwise SIISelLowering will lower FDIV32 UnsafeMath and Demorm support.

Do you want to change this logic to insert new intrinsics to generate the expected sequence of code for fdiv32?

cfang updated this revision to Diff 236698.Jan 7 2020, 3:09 PM

Introduce an intrinsic in AMDGPUCodeGenPrepare to generate correctly rounded fdiv32.

arsenm added inline comments.Jan 9 2020, 2:03 PM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
568–569

The attribute should not be considered at all. Only the fpmath metadata matters. If -cl-fp32-correctly-rounded-divide-sqrt is specified, the regular fdiv instruction should behave correctly.

571–575

An intrinsic should only be introduced when the fdiv differs from the default FP environment. Here you are doing the opposite, and not even considering the denormal mode. You should be inhibiting the insertion of the fdiv.fast if denormals are enabled, not introducing a new intrinsic. You can also consider the afn fast flag and use that to ignore the denormal mode

628–631

There's no need to check the attribute

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll
245

The attribute should be removed

cfang marked 2 inline comments as done.Jan 10 2020, 9:52 AM
cfang added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
568–569

Do you mean that in AMDGPUCodeGenPrepare, we should check the fpmath metadata to keep regular fdiv (instead of an intrinsic) when -cl-fp32-correctly-rounded-divide-sqrt is specified?

The issue is, when -cl-fp32-correctly-rounded-divide-sqrt is specified, a simple v_rcp is generated for a fdiv. Apparently the codegen produces the wrong sequence of code for a "regular" fdiv.

arsenm added inline comments.Jan 10 2020, 10:56 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
568–569

Yes. We shouldn't even have an IR attribute for this flag. The interpretation of the flag is entirely represented in the use of the !fpmath metadata.

cfang updated this revision to Diff 239207.Jan 20 2020, 4:21 PM
cfang marked an inline comment as done.

Implement rcp optimization for fdiv in AMGGPUCodegenPrepare to insert amdgcn_rcp intrinsic. For f32 type fdiv,
if fpmath metadata is unavailable, we could not do rcp optimization unless fast unsafe math is specified.

The GlobalISel path should also be fixed, but that can be a follow up patch

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
595

Needs comment explaining the interaction between !fpmath requirements and denormals. This could use a chart of different fast math options, FP math and denormal handling and the expected lowering

636–641

I don't think just allow reciprocal is sufficient without either checking FPMath or afn. I think this needs to be something more like
UnsafeFP || isFast || (allowReciprocal && (denormal hasLowAccuracy || approximateFunction))

647

Typo metadat

648

It would be clearer to invert this, instead of the logic below relying on the double negative

648–650

It would be clearer to do something like
bool NeedHighAccuracy = !FPMath || FPMath->getFPAccuracy() < 2.5

649

FPMath should be checked once, and in relation to it's value only. Checking for the lack of metadata here is imprecise

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7540

Based on the original problem, Flags.hasAllowReciprocal() isn't sufficient here. Without knowledge of !fpmath, this also needs approximate function

7542

Needs comment explaining why

8726

Braces

For GlobalISel, I'm not sure this should reproduce the same fix. We can more plausibly preserve the !fpmath in the gMIR and handle it the right way, instead of hacking around it in AMDGPUCodeGenPrepare. I think a few asserts and the verifier would need to be updated, but it should be possible to allow arbitrary MDNode operands on an instruction, similar to how implicit registers can be added. I think we should disallow implicit register operands on G_* instructions, and instead only allow implicit metadata arguments. The fdiv lowering can then do the right thing with the original !fpmath information

cfang marked 5 inline comments as done.Jan 21 2020, 8:33 AM
cfang added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
648–650

Is < 2.5 ulp the limiting factor that we can not do 1/x -> rcp(x) ?

649

Do you mean here we should check like this:

(Ty->isFloatTy() && (HasFP32Denormals || NeedHighAccuracy));

where NeedHighAccuracy is checked like a previous comment?

cfang marked 3 inline comments as done.Jan 21 2020, 9:33 AM
cfang added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
636–641

Can you explain what is exactly "denormal hasLowAccuracy" here?

cfang updated this revision to Diff 239462.Jan 21 2020, 4:59 PM

Update based on feedback from the reviewer.

arsenm added inline comments.Jan 22 2020, 7:08 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
87–89

This should be redundant with the below logic

96–98

This should be fully captured by the logic above

arsenm added inline comments.Jan 22 2020, 7:13 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
600

UnsafeDiv is too imprecise here. This should explain in concrete terms why we need to insert the intrinsics and not just refer to the variable names. We need fdiv.fast when we only need 2.5 ULP and denormals are flushed

624

I think this should maybe be rephrased into RcpLegal and UseFDivFast

cfang updated this revision to Diff 239666.Jan 22 2020, 11:59 AM

Update based on the comments.

  1. Rewrite the comments of the function visitFDiv;
  2. Rename a few variables.
arsenm added inline comments.Jan 22 2020, 12:58 PM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
616

You can just initialize this below with the logical value instead of setting the value conditionally

640

I think this still isn't quite right.

I think this should be (FMF.allowReciprocal() && ((!HasFP32Denormals && !NeedHighAccuracy) || FMF.approxFunc())).

As is, this will allow reciprocal when denormals are flushed, but the higher fdiv precision is required, which was the case you were trying to fix in the first place

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7542

This still needs the denormal and type checks

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll
88

This should not have produced rcp since denormals are enabled and it doesn't have afn.

91–92

The name says high accuracy, but 5 ulp is lower accuracy. This didn't form rcp, but I think for the wrong reason

94–98

These two I think are OK because of afn

cfang marked 3 inline comments as done.Jan 22 2020, 2:29 PM
cfang added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
616

Thanks, Will do like that.

640

How could we handle fp16 and fp64? I think HasFP32Denormals only matter for fp32.

Also, the issue I am working on seems not related to FMF.allowReciprocal() at all unless arcp is default.

arsenm added inline comments.Jan 22 2020, 2:36 PM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
640

Yes, this also needs to account for FP32denormals. RCP for f16 doesn't' care about the fp16 denormal mode

cfang updated this revision to Diff 239724.Jan 22 2020, 3:56 PM
cfang marked an inline comment as done.

update based on feedback.

  1. Using arcp && (( no denormals && fpmath>=2.5) || afn)
  2. update arcp related LIT tests.
arsenm added inline comments.Jan 23 2020, 7:30 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
122

Set UseFDivFast once based on the logical expression below and never mutate it. UseFDivFast should be const

llvm/test/CodeGen/AMDGPU/fdiv.f16.ll
253 ↗(On Diff #239724)

I don't know what ulp the f16 rcp instruction provides. This test change looks incomplete if there isn't already a case without !fpmath

arsenm added inline comments.Jan 23 2020, 7:42 AM
llvm/test/CodeGen/AMDGPU/fdiv.f16.ll
253 ↗(On Diff #239724)

I found a document stating this provides "~0.5ulp", so I guess check that value for f16?

cfang marked an inline comment as done.Jan 23 2020, 11:47 AM
cfang added inline comments.
llvm/test/CodeGen/AMDGPU/fdiv.f16.ll
253 ↗(On Diff #239724)

Currently the logic in DAG lowering does "1/x -> rcp(x)" for fp16 without checking fpmath accuracy.
Actually it always does "1/x -> rcp(x)" for fp16 because v_rcp_f16 supports denormals.

We need to revisit that logic in DAG lowering. But I would rather to do that in a follow-up patch.

cfang updated this revision to Diff 239963.Jan 23 2020, 11:56 AM

Update based on feedback:

  1. const for UseFDivFast variable
  2. Remove the added "!fpmath !0" for an arcp f16 test, because the current logic in DAG loweing generates the same code with/without !fpmath.

TODO (in an follow up patch maybe): Change the accuracy threshold and apply the threshold to all types. Also need to re-visit
the rcp logic in DAG Lowering as long as the work in AMDGPUCodegenPrepare is done.

arsenm accepted this revision.Jan 23 2020, 1:11 PM

LGTM

This revision is now accepted and ready to land.Jan 23 2020, 1:11 PM
cfang closed this revision.Feb 7 2020, 1:01 PM

commit 2531535984ad989ce88aeee23cb92a827da6686e
Author: Changpeng Fang <changpeng.fang@gmail.com>
Date: Thu Jan 23 16:57:43 2020 -0800