Page MenuHomePhabricator

Require 'contract' fast-math flag for FMA generation
Needs ReviewPublic

Authored by andrew.w.kaylor on Oct 28 2021, 2:35 PM.

Details

Summary

Currently, the target-independent backend codegen will allow the generation of FMA instructions if *either* the 'contract' fast-math flag is set OR TargetOption::AllowFPOpFusion is set to FPOpFusion::Fast OR the TargetOption::UnsafeFPMath flag is set. This allows fp contraction to be controlled by a means other than the IR and prevents the generation of IR (by a front end) that would enable fusion in some functions and disable it in others.

Note: This change would render the clang -ffp-contract=fast-honor-pragma option obsolete. It also makes the llc -fp-contract option non-functional. These options will be removed in a later patch.

Also note: There are 17 additional lit tests that fail with this change. I updated tests for the X86 and AMDGPU backends (to have one I was familiar with and another I wasn't). It's tedious work, so I didn't want to update all the tests without getting feedback on this direction. Obviously, I'd fix all the tests before committing this patch. There may be a change needed to the front end for CUDA and HIP support before this patch is committed, but I'd like to keep that separate.

I'll send an RFC to llvm-dev to draw more attention to this proposed change.

Diff Detail

Event Timeline

andrew.w.kaylor requested review of this revision.Oct 28 2021, 2:35 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 28 2021, 2:35 PM

I would rephrase the description as removing the global flag for contraction

llvm/test/CodeGen/AMDGPU/fdot2.ll
37

This is somewhat surprising but I guess it works

qiucf added a subscriber: qiucf.Oct 31 2021, 11:17 PM
tra added a subscriber: tra.Nov 1 2021, 11:54 AM

There may be a change needed to the front end for CUDA and HIP support before this patch is committed, but I'd like to keep that separate.

Do I understand it correctly is that the proposal is to explicitly annotate each instruction, instead of relying on LLVM to infer the FP contraction from the global flag. We should still be able to use -ffp-contract=fast in clang, only now it would affect IR generation directly.

I would rephrase the description as removing the global flag for contraction

Maybe, we should be even more specific -- replace global flag for contractions with explicit instruction attributes.

clang/test/CodeGenCUDA/fp-contract.cu
3–7

We do need to have a way to preserve current behavior for CUDA compilation. There are many existing users that implicitly assume it.

I would rephrase the description as removing the global flag for contraction

This change also removes the behavior of the function attribute "unsafe=fp-math" enabling fp-contraction. When I started, I didn't realize that the handling of the global and the handling of the function attributes were so closely related. I guess this needs some more discussion. I'd like to see the function attributes phased out where the same semantics can be expressed by fast-math flags, but as discussed on the llvm-dev email thread this may require updates to keep from breaking out-of-tree front ends.

clang/test/CodeGenCUDA/fp-contract.cu
3–7

When you say you want to preserve "the current behavior" do you mean using "fast" as the default "fp-contract" setting, or also ignoring fp-contract-related pragmas when fp-contract=fast is used?

I certainly understand wanting fp-contract=fast to be the default behavior, but I'm puzzled by the difference in behavior that was introduced between HIP and CUDA, wherein HIP respects the pragmas but CUDA doesn't.

arsenm added a comment.Nov 1 2021, 3:05 PM

I would rephrase the description as removing the global flag for contraction

This change also removes the behavior of the function attribute "unsafe=fp-math" enabling fp-contraction.

This attribute has never been handled consistently in the backend. We have attributes approximately corresponding to each of the individual fast math flags, so the mirror of the IR controls would be to just set all of those attributes. I've never been sure how unsafe-fp-math fits in here, other than just a convenience to avoid setting all the others. I would also like to get rid of it, since you should be checking the individual properties anyway.

tra added inline comments.Nov 1 2021, 3:28 PM
clang/test/CodeGenCUDA/fp-contract.cu
3–7

The idea is not to disturb the status quo. Major CUDA users are sort of used to clang being reasonably close to what NVCC does by default. What that is, exactly is not always clear. The current state of affairs has been working well enough. Changing how FP gets compiled will likely trigger a noticeable number of test failures due to both numerical differences and performance regressions. Former we have somewhat decent coverage for in tensorflow. Performance regressions would be harder to spot.

I can test the patch on our tensorflow tests and see how it fares.

If there are nontrivial failures, we will need to consider how to phase in the changes w/o causing unnecessary trouble for the users and/or give then an escape hatch option to keep things working until they can fix their code or tests.

puzzled by the difference in behavior that was introduced between HIP and CUDA

The details on HIP's need for fp-honor-pragma is in https://github.com/llvm/llvm-project/commit/cb08558caa3bad69213b08e6361586491232c745

For CUDA things were still working well enough with -ffp-contract=fast, so there was no need to change things.

clang/test/CodeGenCUDA/fp-contract.cu
3–7

What I'd like to understand is whether CUDA requires ignoring the pragma when fp-contract=fast is set or if it just needs to use fp-contract=fast by default and doesn't mind that the pragma is ignored. I understand why HIP would want to honor the pragma, and I'd like that to be the normal behavior of fp-contract=fast for all targets.

I see that CUDA does respect "#pragma clang fp contract(off)" as a way to disable contraction if the global setting is "fp-contract=on" (https://godbolt.org/z/4d7En36En), so I don't understand why we wouldn't want the pragma to also work with "fp-contract=fast".

Also, Zahira Ammarguellat is working on a patch to align the clang behavior and documentation (https://reviews.llvm.org/D107994). We're trying not to break the CUDA behavior in the process. Could you take a look at that patch and provide feedback? Thanks!

tra added inline comments.Nov 2 2021, 12:38 PM
clang/test/CodeGenCUDA/fp-contract.cu
3–7

What I'd like to understand is whether CUDA requires ignoring the pragma when fp-contract=fast is set

I don't think so, but I'm not sure. I think ignoring the pragmas was the unfortunate side-effect of backend with 'Fast' fp fuse option does not respect contract flag, mentioned in D90174. Whether someone happens to rely on it is hard to tell.

or if it just needs to use fp-contract=fast by default and doesn't mind that the pragma is ignored.

I think this is roughly the case. We used fp-contract=fast because it matched what we get from nvcc, which does ignore clang pragmas: https://godbolt.org/z/fGW33fo4v

If that behavior changes, it *may* be visible to CUDA users. On the other hand, it should not be widespread and, arguably, clang pragmas that may come from clang headers or the standard library should be respected.

I see that CUDA does respect "#pragma clang fp contract(off)" as a way to disable contraction if the global setting is "fp-contract=on" (https://godbolt.org/z/4d7En36En), so I don't understand why we wouldn't want the pragma to also work with "fp-contract=fast".

We want GPU-side code to behave the same as the host-side code. For GPUs we rely on some math functions to be provided by the standard library and others are provided by CUDA headers and libdevice bitcode. I suspect that may be one of the reasons for CUDA compilation to ignore clang pragmas, because they would not be present in the GPU-side variants.

AFAICT, it's not that we want fp-contract=fast to ignore pragmas, but that it just conveniently happened to enable fma everywhere.
We could use a separate option to disable fp contract pragmas for that and let fp-contract=fast honor the pragmas.

andrew.w.kaylor added inline comments.Nov 2 2021, 3:02 PM
clang/test/CodeGenCUDA/fp-contract.cu
3–7

It's not surprising that nvcc would ignore the clang-specific pragma. Also not surprising, if I use "#pragma STDC FP_CONTRACT OFF" instead it doesn't respect that either (though in that case I get a warning saying, "unrecognized STDC pragma").

It seems very likely that CUDA developers just won't use these pragmas. If for whatever reason they do, I think it makes sense to have the pragma behave the same as it would for other targets. I'd also be OK with having clang issue a diagnostic and always disregard the pragma when compiling CUDA code (though I don't think that's the best choice). The current situation where the pragma works with fp-contract=on but not with fp-contract=fast just seems like a bug to me. Basically what I'm saying is that when compiling CUDA code, we should either not allow the pragma or it should work.

One of the things I'm trying to accomplish with this patch is to fix the target-independent backend to have it generate instructions that implement the semantics of the IR produced by the front end. So if the IR says no contract, there would be no global option that would cause the backend to do otherwise. The upshot of this would be that the pragma could be used to disable contraction, even if fp-contract=fast were used as the base setting for the compilation unit.

I feel very strongly that this is the way the backend should work relative to the IR. It sounds like we can have the front end continue working as it does for all targets, including HIP and CUDA (which would be my preference). This would effectively eliminate the need for the fp-contract=fast-honor-pragmas setting, since all targets would always honor the pragmas for fp-contract=fast.

Are there other CUDA stakeholders that I should reach out to for feedback on this?

This attribute has never been handled consistently in the backend. We have attributes approximately corresponding to each of the individual fast math flags, so the mirror of the IR controls would be to just set all of those attributes. I've never been sure how unsafe-fp-math fits in here, other than just a convenience to avoid setting all the others. I would also like to get rid of it, since you should be checking the individual properties anyway.

I'm glad to hear that. It looked like you (or someone else) had put a good bit of work into making various combinations of these options behave consistently with the AMDGPU backend, and I wasn't sure if you needed the global settings for some use case.

Can you add your feedback on my RFC thread (https://lists.llvm.org/pipermail/llvm-dev/2021-October/153460.html)? I think we're approaching consensus there, but your input would be very valuable.

tra added a subscriber: jdoerfert.Nov 2 2021, 4:20 PM
tra added inline comments.
clang/test/CodeGenCUDA/fp-contract.cu
3–7

I agree with your argument in principle.

I also think that we do need a special case to preserve this CUDA quirk. I'm fine with making the standard options behave the way you describe.
But we do need to give CUDA users an escape hatch option to ignore pragmas if they run into issues.

Clang is in an unfortunate position that we want host and GPU code to behave identically, but parts of the GPU-side implementation is provided by NVIDIA and we can't change it. The only practical option I see is to have a way to preserve the current behavior when we have to.

Without the escape hatch option, the only recourse we'll have is to unroll the patch.
FP compilation changes are known to bring surprises, so the question is "what we're going to do about the issues", not whether we'll see such issues. IMO an escape hatch combined with incremental follow-up improvements is a better strategy compared to multiple patch/revert cycles.

Are there other CUDA stakeholders that I should reach out to for feedback on this?

For CUDA front-end, it's probably myself and @yaxunl as HIP shares most of the front-end functionality. OpenMP folks' (@jdoerfert ?) work overlaps some of the areas (they use NVPTX back-end). They probably would have more input on numerics, but they likely have different constraints as OpenMP doesn't need to match NVCC.