Page MenuHomePhabricator

[CUDA] Enable fusing FP ops for CUDA by default.
ClosedPublic

Authored by tra on May 17 2016, 3:44 PM.

Details

Summary

This matches default nvcc behavior and gives substantial performance boost on GPU where fmad is much cheaper compared to add+mul.

Diff Detail

Repository
rL LLVM

Event Timeline

tra updated this revision to Diff 57534.May 17 2016, 3:44 PM
tra retitled this revision from to [CUDA] Enable fusing FP ops for CUDA by default..
tra updated this object.
tra added reviewers: jlebar, jingyue.
tra added a subscriber: cfe-commits.
jlebar edited edge metadata.May 17 2016, 3:54 PM

I am not sure we want this? Although it matches nvcc, it does not match our floating-point behavior for C++ in general -- it makes us non-IEEE-whatever compliant by default.

Although I agree that if we don't do this, lots of people are not going to pass -fp-contract=fast and resultantly will think that we're slower than nvcc. There's no way to win. :(

I am not sure we want this? Although it matches nvcc, it does not match our floating-point behavior for C++ in general -- it makes us non-IEEE-whatever compliant by default.

Although I agree that if we don't do this, lots of people are not going to pass -fp-contract=fast and resultantly will think that we're slower than nvcc. There's no way to win. :(

But people also don't expect IEEE compliance on GPUs, and also, the system default for forming FMAs has long been system specific. The default on IBM systems, for example, is generally the equivalent of -ffp-contract=fast (in both XLC and GCC).

That having been said, is this change the equivalent of -ffp-contract=fast or -ffp-contract=on? I think it is the latter and we want the former (i.e. where we let the backend be as aggressive as possible *after* inlining).

tra added a subscriber: scanon.May 17 2016, 4:17 PM

Things are even more interesting. -ffp-contract=fast is *not* what this change does. :-)

We have two places where we can fuse FP instructions -- in clang and in LLVM back-end.
Clang fuses add+mul into llvm.fmuladd intrinsic if -ffp-contract=on (default) and DefaultFPContract=1 (which is only set for OpenCL for some reason) and back-end then decides whether it's profitable to emit fused operation or not. NVPTX does emit fmad.

Compare this to -ffp-contract=fast which actually *disables* fusing in clang and instead allows LLVM backend to do fusing wherever it sees fit (as opposed to 'fuse intrinsics only'. It may potentially fuse any suitable multiply/add pair, not only those vetted by front-end.

Currently there's no way to enable front-end fusing via command line, unless you compile OpenCL source. With this patch in place for CUDA compilation we can pick either no fusing, controlled fusing by front-end or more aggressive fusing by back-end.

Setting DefaultFPContract=1 for CUDA seems to be the least evil -- it's somewhat controlled in scope and gives us a way to disable fusing completely or make it more aggressive if it's needed.

Perhaps @scanon and @hfinkel can weigh in.

tra added a comment.May 17 2016, 4:21 PM

That having been said, is this change the equivalent of -ffp-contract=fast or -ffp-contract=on? I think it is the latter and we want the former (i.e. where we let the backend be as aggressive as possible *after* inlining).

It is -ffp-contract=on. As it happens, it appears to produce better code compared to -ffp-contract=fast at least on some benchmarks. Apparently smaller IR (smaller number of intrinsic call instructions vs multiple separate mul+add) makes job easier for straight line strength reduction pass and it's able to remove more redundant calculations in unrolled loops.

-ffp-contract=on obeys the semantics of C's FP_CONTRACT pragma. In particular, it will not fuse:

float m = x*y;
float a = m + z;

Whereas you probably want that to fuse for your purposes. -ffp-contract=fast seems more in line with your needs.

In D20341#432525, @tra wrote:

That having been said, is this change the equivalent of -ffp-contract=fast or -ffp-contract=on? I think it is the latter and we want the former (i.e. where we let the backend be as aggressive as possible *after* inlining).

It is -ffp-contract=on. As it happens, it appears to produce better code compared to -ffp-contract=fast at least on some benchmarks. Apparently smaller IR (smaller number of intrinsic call instructions vs multiple separate mul+add) makes job easier for straight line strength reduction pass and it's able to remove more redundant calculations in unrolled loops.

That's certainly interesting, and frankly, something I don't immediately understand. Given that, at that level, the IR for -ffo-contract=fast is the same as -ffp-contract=off, this seems to point to some more-general problem that we should likely fix anyway.

I will say that, once templated C++ libraries become involved, the per-statement C rules for fusion often don't apply in enough places to be useful. You really need to perform the fusion after inlining. Obviously, however, for more-directly-programmed expressions, this concern does not apply.

tra added a comment.May 17 2016, 4:35 PM

OK. Consensus seems to be that -ffp-contract=fast is the way to go. I'll update the patch.
I've just checked Steve's example with nvcc and indeed it fused mul+add.

tra updated this revision to Diff 57540.May 17 2016, 4:56 PM
tra edited edge metadata.

Changed default to -ffp-contract=fast.

tra updated this revision to Diff 57541.May 17 2016, 5:07 PM

Added test case.

Is there a better way to test that correct options are passed to back-end?
This test resorts to checking assembly generated by back-end which is way too far away from what actually needs testing.

But people also don't expect IEEE compliance on GPUs

Is that true? You have a lot more experience with this than I do, but my observation of nvidia's hardware is that it's moved to add *more* IEEE compliance as it's matured. For example, older hardware didn't support denormals, but newer chips do. Surely that's in response to some users.

One of our goals with CUDA in clang is to make device code as similar as possible to host code. Throwing out IEEE compliance seems counter to that goal.

I also don't see the bright line here. Like, if we can FMA to our heart's content, where do we draw the line wrt IEEE compliance? Do we turn on flush-denormals-to-zero by default? Do we use approximate transcendental functions instead of the more accurate ones? Do we assume floating point arithmetic is associative? What is the principle that leads us to do FMAs but not these other optimizations?

In addition, CUDA != GPUs. Maybe this is something to turn on by default for NVPTX, although I'm still pretty uncomfortable with that. Prior art in other compilers is interesting, but I think it's notable that clang doesn't do this for any other targets (afaict?) despite the fact that gcc does.

The main argument I see for this is "nvcc does it, and people will think clang is slow if we don't". That's maybe not a bad argument, but it makes me sad. :(

tra added a comment.May 18 2016, 10:49 AM

I don't think using FMA throws away IEEE compliance.

IEEE 784-2008 says:

A language standard should also define, and require implementations to provide, attributes that allow and
disallow value-changing optimizations, separately or collectively, for a block. These optimizations might
include, but are not limited to:
...
― Synthesis of a fusedMultiplyAdd operation from a multiplication and an addition

It sounds like FMA use is up to user/language and IEEE standard is fine with it either way.

We need to establish what is the language standard that we need to adhere to. C++ standard itself does not seem to say much about FP precision or particular FP format.

C11 standard (ISO/IEC 9899:201x draft, 7.12.2) says:

The default state (‘‘on’’ or ‘‘off’’) for the [FP_CONTRACT] pragma is implementation-defined.

Nvidia has fairly detailed description of their FP.
http://docs.nvidia.com/cuda/floating-point/index.html#fused-multiply-add-fma

The fused multiply-add operator on the GPU has high performance and increases the accuracy of computations. No special flags or function calls are needed to gain this benefit in CUDA programs. Understand that a hardware fused multiply-add operation is not yet available on the CPU, which can cause differences in numerical results.

At the moment it's the most specific guideline I managed to find regarding expected FP behavior applicable to CUDA.

But people also don't expect IEEE compliance on GPUs

Is that true?

Yes.

You have a lot more experience with this than I do, but my observation of nvidia's hardware is that it's moved to add *more* IEEE compliance as it's matured. For example, older hardware didn't support denormals, but newer chips do. Surely that's in response to some users.

This is also true, but user expectations change slowly.

One of our goals with CUDA in clang is to make device code as similar as possible to host code. Throwing out IEEE compliance seems counter to that goal.

I also don't see the bright line here. Like, if we can FMA to our heart's content, where do we draw the line wrt IEEE compliance? Do we turn on flush-denormals-to-zero by default? Do we use approximate transcendental functions instead of the more accurate ones? Do we assume floating point arithmetic is associative? What is the principle that leads us to do FMAs but not these other optimizations?

In addition, CUDA != GPUs. Maybe this is something to turn on by default for NVPTX, although I'm still pretty uncomfortable with that. Prior art in other compilers is interesting, but I think it's notable that clang doesn't do this for any other targets (afaict?) despite the fact that gcc does.

The main argument I see for this is "nvcc does it, and people will think clang is slow if we don't". That's maybe not a bad argument, but it makes me sad. :(

In D20341#433344, @tra wrote:

I don't think using FMA throws away IEEE compliance.

IEEE 784-2008 says:

A language standard should also define, and require implementations to provide, attributes that allow and
disallow value-changing optimizations, separately or collectively, for a block. These optimizations might
include, but are not limited to:
...
― Synthesis of a fusedMultiplyAdd operation from a multiplication and an addition

It sounds like FMA use is up to user/language and IEEE standard is fine with it either way.

That's correct. FMA formation is allowed, although the default for this, and how it's done is unfortunately a function of many aspects of the programming environment (language, target platform, etc.).

We need to establish what is the language standard that we need to adhere to. C++ standard itself does not seem to say much about FP precision or particular FP format.

C11 standard (ISO/IEC 9899:201x draft, 7.12.2) says:

The default state (‘‘on’’ or ‘‘off’’) for the [FP_CONTRACT] pragma is implementation-defined.

Nvidia has fairly detailed description of their FP.
http://docs.nvidia.com/cuda/floating-point/index.html#fused-multiply-add-fma

The fused multiply-add operator on the GPU has high performance and increases the accuracy of computations. No special flags or function calls are needed to gain this benefit in CUDA programs. Understand that a hardware fused multiply-add operation is not yet available on the CPU, which can cause differences in numerical results.

At the moment it's the most specific guideline I managed to find regarding expected FP behavior applicable to CUDA.

I think this is the most important point. IEEE allows an implementation choice here, and users who already have working CUDA code have tested that code within that context. This is different from the host's choice (at least on x86), but users already expect this. There is a performance impact, but there's also a numerical impact, and I don't think we do our users any favors by differing from NVIDIA here.

jlebar accepted this revision.May 19 2016, 11:41 AM
jlebar edited edge metadata.

Well, if the CUDA documentation says so...let's do it. :) Thanks for your patience, everyone.

This revision is now accepted and ready to land.May 19 2016, 11:41 AM
tra added a subscriber: chandlerc.May 19 2016, 11:43 AM

Short version of offline discussion with @chandlerc : Default of -ffp-contract=fast for CUDA is fine.

This revision was automatically updated to reflect the committed changes.