This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/
-
config/gpu/
-
gpu/
-
entrypoints.txt
-
src/__support/
-
__support/
-
FPUtil/
-
FMA.h
-
gpu/
1/3
FMA.h
-
macros/properties/
-
properties/
-
cpu_features.h

Differential D152923

[libc] Add support for FMA in the GPU utilities
ClosedPublic

Authored by jhuber6 on Jun 14 2023, 7:42 AM.

Download Raw Diff

Details

Reviewers

lntue
michaelrj
sivachandra
jdoerfert
JonChesterfield
tra
arsenm

Commits

rGf205fbbb011e: [libc] Add support for FMA in the GPU utilities

Summary

This adds the generic FMA utilities for the GPU. We implement these
through the builtins which map to the FMA instructions in the ISA. These
may not have strict compliance with other assumptions in the the libc
such as rounding modes. I've included the relevant information on how
the GPU vendors map the behaviour. This should help make it easier to
implement some future generic versions.

Depends on D152486

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Jun 14 2023, 7:42 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 14 2023, 7:42 AM

Herald added a subscriber: libc-commits. · View Herald Transcript

jhuber6 requested review of this revision.Jun 14 2023, 7:42 AM

Herald added a subscriber: wdng. · View Herald TranscriptJun 14 2023, 7:42 AM

Harbormaster completed remote builds in B238806: Diff 531329.Jun 14 2023, 7:49 AM

These may not have strict compliance

I expect every FMA instruction in existence to be strictly compliant. There's not even errno to worry about. For AMDGPU FP exceptions should even work with strictfp

libc/src/__support/FPUtil/gpu/FMA.h
14–16	The NVPTX description sounds like you're just describing what FMA is

In D152923#4421453, @arsenm wrote:

These may not have strict compliance

I expect every FMA instruction in existence to be strictly compliant. There's not even errno to worry about. For AMDGPU FP exceptions should even work with strictfp

May be misusing terminology here, they are definitely compliant with IEEE, but maybe not with libc's desire for all math to be correct under every rounding mode. I don't have a complete understanding of the requirements or desires here from the libc team.

Could you explain strictfp here? I've never encountered it in AMDGPU.

libc/src/__support/FPUtil/gpu/FMA.h
14–16	NVPTX has different versions for all the rounding modes, AFAICT `__builtin_fma` maps to the round to nearest version while AMDGPU has no such facilities. So I'm assuming that this doesn't work "correctly" if the user changes the rounding mode, but it's unlikely we'd want to support that on the GPU much.

In D152923#4421576, @jhuber6 wrote:

May be misusing terminology here, they are definitely compliant with IEEE, but maybe not with libc's desire for all math to be correct under every rounding mode. I don't have a complete understanding of the requirements or desires here from the libc team.

There's a parallel set of FP intrinsics for strictfp functions if you're using fenv access. If you're not using strictfp/experimental.constrained intrinsics, you don't get control over the rounding mode and everything assumes round even. Really we'd want to have a separate regular and fenv access build.

arsenm added inline comments.Jun 14 2023, 9:31 AM

libc/src/__support/FPUtil/gpu/FMA.h
13	I don't really see the point of documenting this here. It's a weird place to give platform specifics and FMA is about as well defined as you can get

I would assume that the fma instructions on GPU will be more performant than normal multiply + add. Do you want to let generic math functions use fma's for GPUs?

https://github.com/llvm/llvm-project/blob/main/libc/src/__support/macros/properties/cpu_features.h#L39
https://github.com/llvm/llvm-project/blob/main/libc/src/__support/FPUtil/multiply_add.h#L30

Addressing comments

Herald added subscribers: • pcwang-thead, s.egerton, simoncook, asb. · View Herald TranscriptJun 14 2023, 9:34 AM

lntue accepted this revision.Jun 14 2023, 9:38 AM

This revision is now accepted and ready to land.Jun 14 2023, 9:38 AM

In D152923#4421674, @lntue wrote:

I would assume that the fma instructions on GPU will be more performant than normal multiply + add. Do you want to let generic math functions use fma's for GPUs?

https://github.com/llvm/llvm-project/blob/main/libc/src/__support/macros/properties/cpu_features.h#L39
https://github.com/llvm/llvm-project/blob/main/libc/src/__support/FPUtil/multiply_add.h#L30

This is trying to resolve the problem the fmuladd intrinsic solves. The target macros should be dropped and you should simply implement multiply_add with FP_CONTRACT on and let the backend decide

In D152923#4421746, @arsenm wrote:

In D152923#4421674, @lntue wrote:

I would assume that the fma instructions on GPU will be more performant than normal multiply + add. Do you want to let generic math functions use fma's for GPUs?

https://github.com/llvm/llvm-project/blob/main/libc/src/__support/macros/properties/cpu_features.h#L39
https://github.com/llvm/llvm-project/blob/main/libc/src/__support/FPUtil/multiply_add.h#L30

This is trying to resolve the problem the fmuladd intrinsic solves. The target macros should be dropped and you should simply implement multiply_add with FP_CONTRACT on and let the backend decide

Currently these macros are used for few things:

Resolve when the FMA instructions are available, which is straightforward for other architectures, not so much for early x86-64 AVX, AVX2 cpus.
We cannot rely on __builtin_fma since it causes back-ref to libc's fma functions.
We use this to control and generate both with and without fma instructions for testing math functions in the current settings, somewhat similar to memory functions.
Also when fma instructions are not available, we need a precise control of falling back to either emulated fma functions or just multiply + add. With the current set up, we can simply control it with calling either fputil::fma or fputil::multiply_add.

Harbormaster completed remote builds in B238849: Diff 531386.Jun 14 2023, 10:01 AM

Closed by commit rGf205fbbb011e: [libc] Add support for FMA in the GPU utilities (authored by jhuber6). · Explain WhyJun 14 2023, 10:59 AM

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rGf205fbbb011e: [libc] Add support for FMA in the GPU utilities.

In D152923#4421830, @lntue wrote:

Resolve when the FMA instructions are available, which is straightforward for other architectures, not so much for early x86-64 AVX, AVX2 cpus.

This is the backend's job. At best you are papering over gaps / bugs in legalization.

We cannot rely on __builtin_fma since it causes back-ref to libc's fma functions.

This is just a bug. The backend should always be able to handle llvm.fma. Whether or not x86 respects nobuiltin when lowering it is another question, but it should always be able to inline expand it or call into compiler-rt.

Also when fma instructions are not available, we need a precise control of falling back to either emulated fma functions or just multiply + add. With the current set up, we can simply control it with calling either fputil::fma or fputil::multiply_add.

If you do not care about the precision semantics of FMA, you really don't need to know anything about the target. You should just emit fmul contract or the fmuladd intrinsic (which you get using FP_CONTRACT on the basic expression). The backend trivially then only introduces fma if profitable

In D152923#4422026, @arsenm wrote:

If you do not care about the precision semantics of FMA, you really don't need to know anything about the target. You should just emit fmul contract or the fmuladd intrinsic (which you get using FP_CONTRACT on the basic expression). The backend trivially then only introduces fma if profitable

Unfortunately, completely relying on backend is not enough for us. There are cases (at least for math functions) knowing exactly when fma instructions are available / unavailable is critical for performance and accuracy, such as using different efficient algorithms:

https://github.com/llvm/llvm-project/blob/main/libc/src/math/generic/log1p.cpp#L997 ,
https://github.com/llvm/llvm-project/blob/main/libc/src/math/generic/tanhf.cpp#L64 ,

or different exceptional values:

https://github.com/llvm/llvm-project/blob/main/libc/src/math/generic/expm1f.cpp#L42

This is just a bug. The backend should always be able to handle llvm.fma. Whether or not x86 respects nobuiltin when lowering it is another question, but it should always be able to inline > expand it or call into compiler-rt.

I totally agree with you on this.

Revision Contents

Path

Size

libc/

config/

gpu/

entrypoints.txt

2 lines

src/

__support/

FPUtil/

FMA.h

2 lines

gpu/

FMA.h

33 lines

macros/

properties/

cpu_features.h

2 lines

Diff 531430

libc/config/gpu/entrypoints.txt

Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	set(TARGET_LIBC_ENTRYPOINTS
libc.src.stdio.fputs		libc.src.stdio.fputs
libc.src.stdio.stdin		libc.src.stdio.stdin
libc.src.stdio.stdout		libc.src.stdio.stdout
libc.src.stdio.stderr		libc.src.stdio.stderr
)		)

set(TARGET_LIBM_ENTRYPOINTS		set(TARGET_LIBM_ENTRYPOINTS
# math.h entrypoints		# math.h entrypoints
		libc.src.math.fma
		libc.src.math.fmaf
libc.src.math.sin		libc.src.math.sin
libc.src.math.round		libc.src.math.round
libc.src.math.roundf		libc.src.math.roundf
libc.src.math.roundl		libc.src.math.roundl
)		)

set(TARGET_LLVMLIBC_ENTRYPOINTS		set(TARGET_LLVMLIBC_ENTRYPOINTS
${TARGET_LIBC_ENTRYPOINTS}		${TARGET_LIBC_ENTRYPOINTS}
${TARGET_LIBM_ENTRYPOINTS}		${TARGET_LIBM_ENTRYPOINTS}
)		)

libc/src/__support/FPUtil/FMA.h

	Show All 14 Lines
	#if defined(LIBC_TARGET_CPU_HAS_FMA)			#if defined(LIBC_TARGET_CPU_HAS_FMA)

	#if defined(LIBC_TARGET_ARCH_IS_X86_64)			#if defined(LIBC_TARGET_ARCH_IS_X86_64)
	#include "x86_64/FMA.h"			#include "x86_64/FMA.h"
	#elif defined(LIBC_TARGET_ARCH_IS_AARCH64)			#elif defined(LIBC_TARGET_ARCH_IS_AARCH64)
	#include "aarch64/FMA.h"			#include "aarch64/FMA.h"
	#elif defined(LIBC_TARGET_ARCH_IS_RISCV64)			#elif defined(LIBC_TARGET_ARCH_IS_RISCV64)
	#include "riscv64/FMA.h"			#include "riscv64/FMA.h"
				#elif defined(LIBC_TARGET_ARCH_IS_GPU)
				#include "gpu/FMA.h"
	#endif			#endif

	#else			#else
	// FMA instructions are not available			// FMA instructions are not available
	#include "generic/FMA.h"			#include "generic/FMA.h"
	#include "src/__support/CPP/type_traits.h"			#include "src/__support/CPP/type_traits.h"

	namespace __llvm_libc {			namespace __llvm_libc {
	Show All 12 Lines

libc/src/__support/FPUtil/gpu/FMA.h

This file was added.

				//===-- GPU implementations of the fma function ------------------ C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_SUPPORT_FPUTIL_GPU_FMA_H
				#define LLVM_LIBC_SRC_SUPPORT_FPUTIL_GPU_FMA_H

				// These intrinsics map to the FMA instrunctions in the target ISA for the GPU.
				// The default rounding mode generated from these will be to the nearest even.
				arsenmUnsubmitted Not Done Reply Inline Actions I don't really see the point of documenting this here. It's a weird place to give platform specifics and FMA is about as well defined as you can get arsenm: I don't really see the point of documenting this here. It's a weird place to give platform…
				static_assert(__has_builtin(__builtin_fma), "FMA builtins must be defined");
				static_assert(__has_builtin(__builtin_fmaf), "FMA builtins must be defined");

				arsenmUnsubmitted Not Done Reply Inline Actions The NVPTX description sounds like you're just describing what FMA is arsenm: The NVPTX description sounds like you're just describing what FMA is
				jhuber6AuthorUnsubmitted Done Reply Inline Actions NVPTX has different versions for all the rounding modes, AFAICT `__builtin_fma` maps to the round to nearest version while AMDGPU has no such facilities. So I'm assuming that this doesn't work "correctly" if the user changes the rounding mode, but it's unlikely we'd want to support that on the GPU much. jhuber6: NVPTX has different versions for all the rounding modes, AFAICT `__builtin_fma` maps to the…
				namespace __llvm_libc {
				namespace fputil {

				template <typename T>
				LIBC_INLINE cpp::enable_if_t<cpp::is_same_v<T, float>, T> fma(T x, T y, T z) {
				__builtin_fmaf(x, y, z);
				}

				template <typename T>
				LIBC_INLINE cpp::enable_if_t<cpp::is_same_v<T, double>, T> fma(T x, T y, T z) {
				__builtin_fma(x, y, z);
				}

				} // namespace fputil
				} // namespace __llvm_libc

				#endif // LLVM_LIBC_SRC_SUPPORT_FPUTIL_GPU_FMA_H

libc/src/__support/macros/properties/cpu_features.h

	Show All 31 Lines
	#define LIBC_TARGET_CPU_HAS_AVX512F			#define LIBC_TARGET_CPU_HAS_AVX512F
	#endif			#endif

	#if defined(__AVX512BW__)			#if defined(__AVX512BW__)
	#define LIBC_TARGET_CPU_HAS_AVX512BW			#define LIBC_TARGET_CPU_HAS_AVX512BW
	#endif			#endif

	#if defined(__ARM_FEATURE_FMA) \|\| (defined(__AVX2__) && defined(__FMA__)) \|\| \			#if defined(__ARM_FEATURE_FMA) \|\| (defined(__AVX2__) && defined(__FMA__)) \|\| \
	defined(__LIBC_RISCV_USE_FMA)			defined(__NVPTX__) \|\| defined(__AMDGPU__) \|\| defined(__LIBC_RISCV_USE_FMA)
	#define LIBC_TARGET_CPU_HAS_FMA			#define LIBC_TARGET_CPU_HAS_FMA
	#endif			#endif

	#endif // LLVM_LIBC_SRC_SUPPORT_MACROS_PROPERTIES_CPU_FEATURES_H			#endif // LLVM_LIBC_SRC_SUPPORT_MACROS_PROPERTIES_CPU_FEATURES_H

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add support for FMA in the GPU utilitiesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 531430

libc/config/gpu/entrypoints.txt

libc/src/__support/FPUtil/FMA.h

libc/src/__support/FPUtil/gpu/FMA.h

libc/src/__support/macros/properties/cpu_features.h

[libc] Add support for FMA in the GPU utilities
ClosedPublic