This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Correctly expand f64 sqrt intrinsic
AcceptedPublic

Authored by arsenm on Jun 21 2023, 2:40 PM.

Details

Reviewers
jhuber6
rampitec
Pierre-vh
foad
Group Reviewers
Restricted Project
Summary

rocm-device-libs and llpc were avoiding using f64 sqrt
intrinsics in favor of their own expansions. Port the
expansion into the backend. Both of these users should be
updated to call the intrinsic instead.

The library and llpc expansions are slightly different.
llpc uses an ldexp to do the scale; the library uses a multiply.

Use ldexp to do the scale instead of the multiply.
I believe v_ldexp_f64 and v_mul_f64 are always the same number of
cycles, but it's cheaper to materialize the 32-bit integer constant
than the 64-bit double constant.

The libraries have another fast version of sqrt which will
be handled separately.

I am tempted to do this in an IR expansion instead. In the IR
we could take advantage of computeKnownFPClass to avoid
the 0-or-inf argument check.

Diff Detail

Event Timeline

arsenm created this revision.Jun 21 2023, 2:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2023, 2:40 PM
arsenm requested review of this revision.Jun 21 2023, 2:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2023, 2:40 PM
Herald added a subscriber: wdng. · View Herald Transcript

Needs a few more test updates

arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4871

todo: propagate flags without nnan/ninf

I am tempted to do this in an IR expansion instead. In the IR
we could take advantage of computeKnownFPClass to avoid
the 0-or-inf argument check.

Wouldn't this be a good fit for CGP, and avoid repeating logic for GISel+DAGISel?
Is there a drawback to doing it in IR?

llvm/test/CodeGen/AMDGPU/fsqrt.f64.ll
44

I think I need some context, why is v_sqrt_f64 so bad that this expansion is preferred? Accuracy/semantics?

I am tempted to do this in an IR expansion instead. In the IR
we could take advantage of computeKnownFPClass to avoid
the 0-or-inf argument check.

Wouldn't this be a good fit for CGP, and avoid repeating logic for GISel+DAGISel?
Is there a drawback to doing it in IR?

CGP is an optimization pass only. If we did IR expansion, we would break if anything later tried to introduce new sqrts for some reason. I don't like legality holes where we're hoping certain things don't happen after the expansion, so we'd need both

arsenm updated this revision to Diff 542865.Jul 21 2023, 4:54 AM

Rebase

llvm/test/CodeGen/AMDGPU/fsqrt.f64.ll
44

It's nowhere near accurate enough, I don't really know what the point of the instruction is. The raw instruction definitely just fails OpenCL conformance

Pierre-vh accepted this revision as: Pierre-vh.Jul 25 2023, 4:37 AM

LGTM with some nits

llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h
1183–1184
1186

Nit: can't mask just be int64_t directly?

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4844

Add that they fail OCL conformance + many users avoid using it because of that, just for some context.

llvm/test/Analysis/CostModel/AMDGPU/arith-fp.ll
55

Why did this change?

This revision is now accepted and ready to land.Jul 25 2023, 4:37 AM
arsenm added inline comments.Jul 25 2023, 4:55 AM
llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h
1186

I think the real question is whether it should be FPClassTest

llvm/test/Analysis/CostModel/AMDGPU/arith-fp.ll
55

legal->custom. Really it should go up more

arsenm added inline comments.Jul 25 2023, 4:58 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4844

Seems like historical noise that doesn't explain what's going on

foad added a comment.Jul 27 2023, 2:54 AM

I am tempted to do this in an IR expansion instead. In the IR
we could take advantage of computeKnownFPClass to avoid
the 0-or-inf argument check.

Could you use computeKnownFPClass to set ninf and nsz if the argument is known not to be inf/0, and then check those flags in your existing legalization code?

llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h
1184
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4861

It's nice that you have a comment showing the algorithm, but it'd be way more useful if the names in the comment matched the names in the code.

Also what do the => mean? It looks like they show alternative expansions, and I guess we implement the versions on the RHS?? I can see that the second one avoids the need to calculate h2, but what is the point of the first one?

4873

What is this magic number and why is scaling required? If it is to avoid denormal inputs to amdgcn_rsq then please say that in a comment.

llvm/test/CodeGen/AMDGPU/fsqrt.f64.ll
44

v_sqrt_f64 is documented as having "(2**29)ULP accuracy"(!). My guess is that it just does the f64/f32 format conversions for you, and uses the same underlying sqrt expansion as v_sqrt_f32.

I am tempted to do this in an IR expansion instead. In the IR
we could take advantage of computeKnownFPClass to avoid
the 0-or-inf argument check.

Could you use computeKnownFPClass to set ninf and nsz if the argument is known not to be inf/0, and then check those flags in your existing legalization code?

I am tempted to do this in an IR expansion instead. In the IR
we could take advantage of computeKnownFPClass to avoid
the 0-or-inf argument check.

Could you use computeKnownFPClass to set ninf and nsz if the argument is known not to be inf/0, and then check those flags in your existing legalization code?

Yes, in this case it seems feasible. We don't in general infer fast math flags anywhere. It's usually difficult because they constrain both inputs and outputs

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4861

Copy paste, this is ported code

4873

Your guess is as good as mine, this is all ported and none of this stuff was ever well documented

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.rcp.ll