This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Always use v_rcp_f16 and v_rsq_f16
ClosedPublic

Authored by arsenm on Jul 5 2023, 9:02 AM.

Details

Reviewers
foad
rampitec
b-sumner
Pierre-vh
jhuber6
Group Reviewers
Restricted Project
Summary

These inherited the fast math checks from f32, but the manual suggests
these should be accurate enough for unconditional use. The definition
of correctly rounded is 0.5ulp, but the manual says "0.51ulp". I've
been a bit nervous about changing this as the OpenCL conformance test
does not cover half. Brute force produces identical values compared to
a reference host implementation for all values.

Diff Detail

Event Timeline

arsenm created this revision.Jul 5 2023, 9:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2023, 9:02 AM
arsenm requested review of this revision.Jul 5 2023, 9:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2023, 9:02 AM
Herald added subscribers: wangpc, wdng. · View Herald Transcript
rampitec accepted this revision.Jul 5 2023, 12:32 PM

There is 'half' subtest of the OCL conformance. Not sure it covers it, but I would run it. Otherwise LGTM.

This revision is now accepted and ready to land.Jul 5 2023, 12:32 PM

There is 'half' subtest of the OCL conformance. Not sure it covers it, but I would run it. Otherwise LGTM.

It doesn't. Even more disturbing is the regular reciprocal test for f32/f64 is commented out

foad added a comment.Jul 6 2023, 3:36 AM

Brute force produces identical values compared to a reference host implementation for all values.

Have you tested v_sqrt_f16 or any other f16 trans instructions?

arsenm added a comment.Jul 6 2023, 2:40 PM

Brute force produces identical values compared to a reference host implementation for all values.

Have you tested v_sqrt_f16 or any other f16 trans instructions?

Haven't gotten there yet

Brute force produces identical values compared to a reference host implementation for all values.

Have you tested v_sqrt_f16 or any other f16 trans instructions?

Haven't gotten there yet

v_sqrt_f16 is identical.
v_log_f16 is is identical.
v_exp_f16 has a single value differ: ref=0x1p+0 inst=0x1.004p+0

I'm also comparing these by cast to float host implementations, maybe a proper f16 implementation would have rounded these differences differently?

I think I'm doing something wrong with the pre-scaling for sin/cos, those results just seem totally wrong

foad added a comment.Jul 14 2023, 4:34 AM

v_exp_f16 has a single value differ: ref=0x1p+0 inst=0x1.004p+0

For what input value?

foad added inline comments.Oct 13 2023, 5:17 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4152–4155

Comments don't match the code. https://github.com/llvm/llvm-project/pull/68982 fixes the comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9172–9175

Same.