This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add support for FMA in the GPU utilities
ClosedPublic

Authored by jhuber6 on Jun 14 2023, 7:42 AM.

Details

Summary

This adds the generic FMA utilities for the GPU. We implement these
through the builtins which map to the FMA instructions in the ISA. These
may not have strict compliance with other assumptions in the the libc
such as rounding modes. I've included the relevant information on how
the GPU vendors map the behaviour. This should help make it easier to
implement some future generic versions.

Depends on D152486

Diff Detail

Event Timeline

jhuber6 created this revision.Jun 14 2023, 7:42 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 14 2023, 7:42 AM
jhuber6 requested review of this revision.Jun 14 2023, 7:42 AM

These may not have strict compliance

I expect every FMA instruction in existence to be strictly compliant. There's not even errno to worry about. For AMDGPU FP exceptions should even work with strictfp

libc/src/__support/FPUtil/gpu/FMA.h
14–16

The NVPTX description sounds like you're just describing what FMA is

These may not have strict compliance

I expect every FMA instruction in existence to be strictly compliant. There's not even errno to worry about. For AMDGPU FP exceptions should even work with strictfp

May be misusing terminology here, they are definitely compliant with IEEE, but maybe not with libc's desire for all math to be correct under every rounding mode. I don't have a complete understanding of the requirements or desires here from the libc team.

Could you explain strictfp here? I've never encountered it in AMDGPU.

libc/src/__support/FPUtil/gpu/FMA.h
14–16

NVPTX has different versions for all the rounding modes, AFAICT __builtin_fma maps to the round to nearest version while AMDGPU has no such facilities. So I'm assuming that this doesn't work "correctly" if the user changes the rounding mode, but it's unlikely we'd want to support that on the GPU much.

May be misusing terminology here, they are definitely compliant with IEEE, but maybe not with libc's desire for all math to be correct under every rounding mode. I don't have a complete understanding of the requirements or desires here from the libc team.

There's a parallel set of FP intrinsics for strictfp functions if you're using fenv access. If you're not using strictfp/experimental.constrained intrinsics, you don't get control over the rounding mode and everything assumes round even. Really we'd want to have a separate regular and fenv access build.

arsenm added inline comments.Jun 14 2023, 9:31 AM
libc/src/__support/FPUtil/gpu/FMA.h
13

I don't really see the point of documenting this here. It's a weird place to give platform specifics and FMA is about as well defined as you can get

lntue added a comment.Jun 14 2023, 9:32 AM

I would assume that the fma instructions on GPU will be more performant than normal multiply + add. Do you want to let generic math functions use fma's for GPUs?

https://github.com/llvm/llvm-project/blob/main/libc/src/__support/macros/properties/cpu_features.h#L39
https://github.com/llvm/llvm-project/blob/main/libc/src/__support/FPUtil/multiply_add.h#L30

jhuber6 updated this revision to Diff 531386.Jun 14 2023, 9:34 AM

Addressing comments

lntue accepted this revision.Jun 14 2023, 9:38 AM
This revision is now accepted and ready to land.Jun 14 2023, 9:38 AM

I would assume that the fma instructions on GPU will be more performant than normal multiply + add. Do you want to let generic math functions use fma's for GPUs?

https://github.com/llvm/llvm-project/blob/main/libc/src/__support/macros/properties/cpu_features.h#L39
https://github.com/llvm/llvm-project/blob/main/libc/src/__support/FPUtil/multiply_add.h#L30

This is trying to resolve the problem the fmuladd intrinsic solves. The target macros should be dropped and you should simply implement multiply_add with FP_CONTRACT on and let the backend decide

I would assume that the fma instructions on GPU will be more performant than normal multiply + add. Do you want to let generic math functions use fma's for GPUs?

https://github.com/llvm/llvm-project/blob/main/libc/src/__support/macros/properties/cpu_features.h#L39
https://github.com/llvm/llvm-project/blob/main/libc/src/__support/FPUtil/multiply_add.h#L30

This is trying to resolve the problem the fmuladd intrinsic solves. The target macros should be dropped and you should simply implement multiply_add with FP_CONTRACT on and let the backend decide

Currently these macros are used for few things:

  • Resolve when the FMA instructions are available, which is straightforward for other architectures, not so much for early x86-64 AVX, AVX2 cpus.
  • We cannot rely on __builtin_fma since it causes back-ref to libc's fma functions.
  • We use this to control and generate both with and without fma instructions for testing math functions in the current settings, somewhat similar to memory functions.
  • Also when fma instructions are not available, we need a precise control of falling back to either emulated fma functions or just multiply + add. With the current set up, we can simply control it with calling either fputil::fma or fputil::multiply_add.
This revision was automatically updated to reflect the committed changes.
  • Resolve when the FMA instructions are available, which is straightforward for other architectures, not so much for early x86-64 AVX, AVX2 cpus.

This is the backend's job. At best you are papering over gaps / bugs in legalization.

  • We cannot rely on __builtin_fma since it causes back-ref to libc's fma functions.

This is just a bug. The backend should always be able to handle llvm.fma. Whether or not x86 respects nobuiltin when lowering it is another question, but it should always be able to inline expand it or call into compiler-rt.

  • Also when fma instructions are not available, we need a precise control of falling back to either emulated fma functions or just multiply + add. With the current set up, we can simply control it with calling either fputil::fma or fputil::multiply_add.

If you do not care about the precision semantics of FMA, you really don't need to know anything about the target. You should just emit fmul contract or the fmuladd intrinsic (which you get using FP_CONTRACT on the basic expression). The backend trivially then only introduces fma if profitable

If you do not care about the precision semantics of FMA, you really don't need to know anything about the target. You should just emit fmul contract or the fmuladd intrinsic (which you get using FP_CONTRACT on the basic expression). The backend trivially then only introduces fma if profitable

Unfortunately, completely relying on backend is not enough for us. There are cases (at least for math functions) knowing exactly when fma instructions are available / unavailable is critical for performance and accuracy, such as using different efficient algorithms:

https://github.com/llvm/llvm-project/blob/main/libc/src/math/generic/log1p.cpp#L997 ,
https://github.com/llvm/llvm-project/blob/main/libc/src/math/generic/tanhf.cpp#L64 ,

or different exceptional values:

https://github.com/llvm/llvm-project/blob/main/libc/src/math/generic/expm1f.cpp#L42

This is just a bug. The backend should always be able to handle llvm.fma. Whether or not x86 respects nobuiltin when lowering it is another question, but it should always be able to inline > expand it or call into compiler-rt.

I totally agree with you on this.