This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Overhaul and improve rcp and rsq f32 formation
ClosedPublic

Authored by arsenm on Jul 18 2023, 6:09 AM.

Details

Reviewers
foad
b-sumner
rampitec
Pierre-vh
jhuber6
yaxunl
jdoerfert
Group Reviewers
Restricted Project
Summary

The highlight change is a new denormal safe 1ulp lowering which uses
rcp after using frexp to perform input scaling. This saves 2
instructions compared to other implementations which performed an
explicit denormal range change. This improves the OpenCL default, and
requires a flag for HIP. I don't believe there's any flag wired up for
OpenMP to emit the necessary fpmath metadata.

This provides several improvements and changes that were hard to
separate without regressing one case or another. Disturbingly the
OpenCL conformance test seems to have the reciprocal test commented
out. I locally hacked it back in to test this.

Starts introducing f32 rsq intrinsics in AMDGPUCodeGenPrepare. Like
the rcp case, we could do this in codegen if !fpmath were preserved
(although we would lose some computeKnownFPClass tricks). Start
requiring contract flags to form rsq. The rsq fusion actually improves
the result from ~2ulp to ~1ulp. We have some older fusion in codegen
which only keys off unsafe math which should be refined.

Expand rsq patterns by checking for denormal inputs and pre/post
multiplying like the current library code does. We also take advantage
of computeKnownFPClass to avoid the scaling when we can statically
prove the input cannot be a denormal. We could do the same for the rcp
case, but unlike rsq a large input can underflow to denormal. We need
additional upper bound exponent checks on the input in order to do the
same for rcp.

This rsq handling also now starts handling the negated case. We
introduce rsq with an fneg. In the case the fneg doesn't fold into its
user, it's a neutral change but provides improvement if it is foldable
as a source modifier.

Also starts respecting the arcp attribute properly, and more strictly
interprets afn. We were previously interpreting afn as implying you
could do the reciprocal expansion of an fdiv. The codegen handling of
these also needs to be revisited.

This also effectively introduces the optimization
combineRepeatedFPDivisors enables, just done in the IR instead (and
only for f32).

This is almost across the board better. The one minor regression is
for gfx6/buggy frexp case where for multiple reciprocals, we could
previously reuse rematerialized constants per instance (it's neutral
for a single rcp).

The fdiv.fast and sqrt handling need to be revisited next.

Diff Detail

Event Timeline

arsenm created this revision.Jul 18 2023, 6:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2023, 6:09 AM
arsenm requested review of this revision.Jul 18 2023, 6:09 AM
Herald added a project: Restricted Project. · View Herald Transcript
arsenm updated this revision to Diff 541500.Jul 18 2023, 6:27 AM

Fix some test rebase diff errors

arsenm added inline comments.Jul 18 2023, 12:47 PM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
893

I'm not really sure how to interpret arcp and afn. The backend interpretation continues to be aggressive with afn, so I don't know which is correct.

arsenm updated this revision to Diff 541988.Jul 19 2023, 6:24 AM

Defer the afn/unsafe-fp-math case to codegen for now, although it's really aggressive

ping, I want to get this in before the branch

rampitec accepted this revision.Jul 20 2023, 3:10 PM
This revision is now accepted and ready to land.Jul 20 2023, 3:10 PM
chapuni added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
822

Used only in +Asserts

llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f32.ll