This is an archive of the discontinued LLVM Phabricator instance.

LangRef: Add "dynamic" option to "denormal-fp-math"
ClosedPublic

Authored by arsenm on Jan 30 2023, 9:16 AM.

Details

Summary

This is stricter than the default "ieee", and should probably be the
default. This patch leaves the default alone. I can change this in a
future patch.

There are non-reversible transforms I would like to perform which are
legal under IEEE denormal handling, but illegal with flushing zero
behavior. Namely, conversions between llvm.is.fpclass and fcmp with
zeroes.

Under "ieee" handling, it is legal to translate between
llvm.is.fpclass(x, fcZero) and fcmp x, 0.

Under "preserve-sign" handling, it is legal to translate between
llvm.is.fpclass(x, fcSubnormal|fcZero) and fcmp x, 0.

I would like to compile and distribute some math library functions in
a mode where it's callable from code with and without denormals
enabled, which requires not changing the compares with denormals or
zeroes.

If an IEEE function transforms an llvm.is.fpclass call into an fcmp 0,
it is no longer possible to call the function from code with denormals
enabled, or write an optimization to move the function into a denormal
flushing mode. For the original function, if x was a denormal, the
class would evaluate to false. If the function compiled with denormal
handling was converted to or called from a preserve-sign function, the
fcmp now evaluates to true.

This could also be of use for strictfp handling, where code may be
changing the denormal mode.

Alternative name could be "unknown".

Replaces the old AMDGPU custom inlining logic with more conservative
logic which tries to permit inlining for callees with dynamic handling
and avoids inlining other mismatched modes.

Diff Detail

Event Timeline

arsenm created this revision.Jan 30 2023, 9:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 30 2023, 9:16 AM
arsenm requested review of this revision.Jan 30 2023, 9:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 30 2023, 9:16 AM
Herald added a subscriber: wdng. · View Herald Transcript
tra added a comment.Jan 30 2023, 11:20 AM

My $.02, mostly on the style and the mechanics of applying the new attribute. FP semantics aspects are above my pay grade.

clang/lib/CodeGen/CGCall.cpp
2059–2061

IIUIC, this changes denorm mode attributes on the functions with dynamic denorm mode that we link in.
Will that be a problem if the same function, when linked into different modules, would end up with different attributes? E.g. if a function is externally visible and is intended to be common'ed across multiple modules. Should dynamic denorm mode be restricted to the functions private to the module only? We do typically internalize linked bitcode for CUDA, but I don't think it's something we can always implicitly assume.

clang/test/CodeGenCUDA/Inputs/ocml-sample.cl
12

Cosmetic nit: order functions as f16/f32/f64?

clang/test/CodeGenCUDA/link-builtin-bitcode-denormal-fp-mode.cu
16

Do we want to verify that the compiled samples have the correct function attributes?

21

Would it be useful to check the attributes the functions have w/o linking the sample bitcode? In these tests one can infer expected attributes from the flags and comments, so I'm fine not having explicit checks for that.

78

Nit: CHECK-LABEL ?

91–92

I assume these refer to linked in functions, not their calls. It may be useful to include match define/call to make it obvious.

95

I'm not sure whether it does what it's intended to.
AFAICT, at this point we will be past the call sites, so if it's intended to check the call sites in kernel_*, it will likely always succeed, even if we do litter call sites with unwanted attributes.

It's also possible that I have a wrong idea about what the expected IR looks like. If you could post it for reference, that would be helpful.

llvm/test/CodeGen/Generic/denormal-fp-math-cl-opt.ll
4

Edit: Check that the command line flag annotates the IR with the appropriate attributes.

arsenm marked 2 inline comments as done.Jan 31 2023, 7:01 AM
arsenm added inline comments.
clang/lib/CodeGen/CGCall.cpp
2059–2061

The whole point of -mlink-builtin-bitcode is to apply the attributes for the current compilation to what's linked in. The linked functions are always internalized. The only case where we might not want to internalize is for weak symbols (but it looks like we do internalize those today, but this is something I've thought about changing). I'll add a test with a weak library function

In the weak case the right thing to do is probably to not change from dynamic, simply because this linking process is outside of the user's control.

clang/test/CodeGenCUDA/link-builtin-bitcode-denormal-fp-mode.cu
16

Maybe, but that's already tested separately. This test is a bit complex as it is (and could maybe use a few more combinations)

78

error: found 'CHECK-LABEL:' with variable definition or use

95

I can drop this, I later added the -implicit-check-not=denormal-fp-math to all the FileChecks

arsenm marked an inline comment as done.Jan 31 2023, 7:21 AM
arsenm added inline comments.
clang/lib/CodeGen/CGCall.cpp
2059–2061

It turns out we apply attributes prior to internalization. As a separate patch, we can either:

  1. Skip functions that start as interposable, which has an observable change in the IR as it is
  2. Move the link and internalize before setting attributes. This would be unobservable but would catch it if the internalization behavior ever changed
arsenm updated this revision to Diff 493639.Jan 31 2023, 8:49 AM

Fix losing target-cpu

arsenm updated this revision to Diff 493653.Jan 31 2023, 9:49 AM

Fix dropping target-cpu. Also skip interposable functions if we aren't internalizing (this seems to be a theoretical concern, since PropagateAttrs and Internalize are set as a pair)

tra added a comment.Jan 31 2023, 10:36 AM

LGTM for the parts I've commented on.

clang/test/CodeGenCUDA/link-builtin-bitcode-denormal-fp-mode.cu
78

Interesting.
In that case label and attribute checks could be separated into something like this:

CHECK-LABEL: name
CHECK-SAME: [[attribute]]

Up to you.

llvm/test/CodeGen/Generic/denormal-fp-math-cl-opt.ll
4

^^ The comment still needs to be edited.

kpn added a comment.Feb 1 2023, 11:26 AM

We use "dynamic" for the constrained intrinsics. I'd stay consistent with our terminology and stick with "dynamic" here.

I like the amount of testing. You may have gotten every single combination of cases, but I didn't go far enough to check.

llvm/test/Transforms/Inline/AMDGPU/inline-denormal-fp-math.ll
78

Are we changing the behavior in a way that may cause regressions? It looks like we've changed behavior in the absence of "dynamic".

arsenm marked an inline comment as done.Feb 1 2023, 11:29 AM
arsenm added inline comments.
llvm/test/Transforms/Inline/AMDGPU/inline-denormal-fp-math.ll
78

This case is broken to begin with, calling ieee from daz code. This makes the inlining more conservative / noticeable for debugging

kpn added inline comments.Feb 1 2023, 11:36 AM
llvm/test/Transforms/Inline/AMDGPU/inline-denormal-fp-math.ll
78

Can I talk you into mentioning this in your commit message?

arsenm updated this revision to Diff 494417.Feb 2 2023, 12:52 PM
arsenm edited the summary of this revision. (Show Details)

Address comments

Looking at the attribute logic here, there is conceptual room for both a dynamic and an unknown mode (i.e., you get a top and a bottom value), but I don't think there is value in distinguishing between them, so I'm fine with keeping just a dynamic.

I didn't bother to look at the clang changes, and my only comments are some minor ones around documentation:

llvm/docs/LangRef.rst
2241–2258

This isn't your fault, but I noticed when reading the LangRef online that this paragraph has a slightly-different indentation that causes most of this attribute's documentation to gain an extra level of indentation.

2243–2244

I feel like the description of this mode should mention that whether or not denormals are flushed is derived from the dynamic state of the FP environment.

arsenm updated this revision to Diff 494465.Feb 2 2023, 4:18 PM
arsenm marked 2 inline comments as done.

Documentation fixes

In general, it seems like the denormal mode should be considered part of the floating point environment (though as far as I know the C standard, at least, doesn't document it as such). If it were considered part of the floating point environment, the LLVM rules would tell us we could assume the default setting, which I'd assume to be IEEE, and it would only be legal to change this mode in strict mode. However, for your use case preserving the behavior of fpclass seems like what users would want, even in fast-math modes. In this sense, this is a lot like the problem we have with preserving isnan() behavior when fast-math is enabled. Our rules allow it, but it's not what most people would want.

I think the new denormal mode is a good addition.

Do you need to do something with the inliner to handle the case where functions with different denormal modes are inlined into one another? We don't seem to handle that case correctly now (https://godbolt.org/z/PEsWaMEq6), but with the dynamic mode we could handle it without blocking inlining completely.

In general, it seems like the denormal mode should be considered part of the floating point environment (though as far as I know the C standard, at least, doesn't document it as such).

There’s no standardization of denormal flushing. OpenCL defines a flag for it but doesn’t really specify what it really means.

If it were considered part of the floating point environment, the LLVM rules would tell us we could assume the default setting, which I'd assume to be IEEE, and it would only be legal to change this mode in strict mode.

It is. The attribute is informative of the default mode. If we really wanted, a similar attribute could declare the assumed rounding mode if we really wanted for the function. It doesn’t imply you can change it dynamically, just that it should be in that mode before the function executes.

Do you need to do something with the inliner to handle the case where functions with different denormal modes are inlined into one another? We don't seem to handle that case correctly now (https://godbolt.org/z/PEsWaMEq6), but with the dynamic mode we could handle it without blocking inlining completely.

This patch fixed some bugs in the inlining to make it more conservative in undefined looking cases. The inlining of dynamic functions fully works

Not entirely sure where the best place to effect this (I think somewhere in the clang driver code?), but on further reflection, it feels like strict fp-model in clang should set the denormal mode to dynamic.

Not entirely sure where the best place to effect this (I think somewhere in the clang driver code?), but on further reflection, it feels like strict fp-model in clang should set the denormal mode to dynamic.

I was thinking of changing the default in general to dynamic. I was going to at least change the strictfp default in a follow up

kpn added a comment.Feb 16 2023, 10:19 AM

What's the plan for tying this to strictfp? Because I don't it should be tied to cases where we use the constrained intrinsics but the exceptions are ignored and the default rounding is in stated. Those instructions are supposed to behave the same as the non-constrained instructions. So keying off the presence of the strictfp attribute on the function definition, or the (equivalent) presence of constrained intrinsics, would be too simple.

I don't see an obvious connection between denormals and exception behavior, and the rounding mode has the same problem. It would be surprising if changing the rounding mode changed denormal handling or optimization even when the new rounding mode would have identical results. It would also be surprising if changing the rounding mode back to the default round-to-nearest changed how denormals are handled.

Would we get different denormal behavior with a clang flag vs using a #pragma at the top of a source file? That seems surprising as well.

In the abstract I can see lumping in denormal handing with the rest of the FP environment handling. But in the LLVM context I don't see how we can tie use of the constrained intrinsics to denormals.

What's the plan for tying this to strictfp? Because I don't it should be tied to cases where we use the constrained intrinsics but the exceptions are ignored and the default rounding is in stated. Those instructions are supposed to behave the same as the non-constrained instructions. So keying off the presence of the strictfp attribute on the function definition, or the (equivalent) presence of constrained intrinsics, would be too simple.

The denormal mode is exactly parallel to the rounding mode, we just don't have a mirrored field in the constrained intrinsic metadata operands. If we defaulted to using the dynamic mode if you were to use strictfp, everything would be OK. You just couldn't optimize based on knowledge of the denormal mode. I don't really think it's worth putting in the same optimization effort as the rounding mode.

I was thinking of changing the default in general to dynamic. I was going to at least change the strictfp default in a follow up

I had the same thought too, but I reflected a little further that the default fp model implying that the environment being in the default state means we can assume the FTZ/DAZ are also in a default (IEEE) state.

What's the plan for tying this to strictfp? Because I don't it should be tied to cases where we use the constrained intrinsics but the exceptions are ignored and the default rounding is in stated. Those instructions are supposed to behave the same as the non-constrained instructions. So keying off the presence of the strictfp attribute on the function definition, or the (equivalent) presence of constrained intrinsics, would be too simple.

The way I see it, strictfp is an assertion that every FP instruction has a dependency on the FP environment, which is largely orthogonal to the denormal-mode attribute asserting that the FTZ/DAZ bits in the FP environment have a particular value. The constrained intrinsics also have the ability to assert some properties of the FP environment (specifically, rounding mode and exception behavior) on individual instructions. By not adding any metadata to constrained intrinsics at the same time, we don't get the ability to set the denormal-mode on a per-instruction basis-but I don't think there's much value to be gained by doing so (giving that we already have it at a per-function level).

Would we get different denormal behavior with a clang flag vs using a #pragma at the top of a source file? That seems surprising as well.

One of the consequences of having so many different ways of controlling compiler FP environment assumptions is that there's a crazy amount of interactions to consider. But I think there is ultimately a workable solution for the clang frontend to generate interactions that make sense.

pengfei added inline comments.Feb 16 2023, 10:31 PM
llvm/docs/LangRef.rst
2244
  1. Does it mean users must specify dynamic when they change FTZ/DAZ in a function?
  2. If 1) is true, is there a way to partially set on functions in its call stack, e.g.,
main
  f0
  f1
    f10
      setFtzDaz(true);
  f2
  f3

Ideally, users may want to tell compiler main, f1, f10 is dynamic, while f0 is ieee and f2, f3 is positive-zero, rather than dynamic for all.

  1. If 2) is true, it looks silly to do it one by one manually. Should compiler help to deduce this information itself?
arsenm added inline comments.Feb 17 2023, 2:05 PM
llvm/docs/LangRef.rst
2244

This is really a question of how strictfp should interact with the default mode and shouldn't be a different policy from how strictfp functions treat the rounding mode. Arbitrary strictfp functions don't make assumptions based on the default rounding mode, assuming it's not the default. In that sense, denormal-fp-mode doesn't really matter for strictfp functions. They just can't make use of it to optimize. If we had a denormal annotation like the rounding mode, we could make use of it in the same way

This isn't an area for the backend to deduce, semantic meaning needs to be specific and explicit. I have no interest in making changing the denormal mode simple or easy. Turning on flushing isn't really semantically desirable and is basically obsolete on modern hardware.

I was thinking of changing the default in general to dynamic. I was going to at least change the strictfp default in a follow up

I had the same thought too, but I reflected a little further that the default fp model implying that the environment being in the default state means we can assume the FTZ/DAZ are also in a default (IEEE) state.

What's the plan for tying this to strictfp? Because I don't it should be tied to cases where we use the constrained intrinsics but the exceptions are ignored and the default rounding is in stated. Those instructions are supposed to behave the same as the non-constrained instructions. So keying off the presence of the strictfp attribute on the function definition, or the (equivalent) presence of constrained intrinsics, would be too simple.

The way I see it, strictfp is an assertion that every FP instruction has a dependency on the FP environment, which is largely orthogonal to the denormal-mode attribute asserting that the FTZ/DAZ bits in the FP environment have a particular value. The constrained intrinsics also have the ability to assert some properties of the FP environment (specifically, rounding mode and exception behavior) on individual instructions. By not adding any metadata to constrained intrinsics at the same time, we don't get the ability to set the denormal-mode on a per-instruction basis-but I don't think there's much value to be gained by doing so (giving that we already have it at a per-function level)

I think this is not really useful for strictfp. We could make use of this if we were to inline non-strictfp functions into strictfp functions if we had a denormal mode annotation on the constrained intrinsic, which we don't. Without that, strictfp functions can't make assumptions based on the default rounding mode so this doesn't inform anything useful in the presence of a changeable mode

I'm studiously ignoring the Clang and LLVM codegen changes here, but otherwise, I think the direction of this change is generally good.

llvm/lib/Analysis/ConstantFolding.cpp
1380–1382

You should change the doxygen documentation to indicate that this method returns nullptr if the denormal mode is dynamic.

Ditto for ConstantFoldFPInstOperands.

arsenm updated this revision to Diff 505274.Mar 14 2023, 2:52 PM
arsenm marked an inline comment as done.

Update doxygen comment

arsenm added reviewers: Restricted Project, rampitec, foad, Pierre-vh.Apr 7 2023, 3:53 PM

I'm having trouble understanding the changes on the clang side.

If I'm following correctly; the "denormal-fp-math" setting is a promise from the user to the compiler: if the setting is not "dynamic", the user promises that the definition will only execute in the specified denormal mode. This is similar to, for example, the rounding mode pragmas: the user promises a specific rounding mode unless they explicitly request dynamic rounding.

Given that, I don't follow the whole "merging" thing... we should just be setting whatever mode is active. The attribute setting should not depend on whether the function is interposable. If you have a ODR function, all definitions must have a mode compatible with whatever mode will be used at runtime. If you have a non-ODR weak function, optimizations shouldn't propagate that mode from the callee to the caller.

Given that, I don't follow the whole "merging" thing... we should just be setting whatever mode is active. The attribute setting should not depend on whether the function is interposable. If you have a ODR function, all definitions must have a mode compatible with whatever mode will be used at runtime. If you have a non-ODR weak function, optimizations shouldn't propagate that mode from the callee to the caller.

The point is to have code that works with either mode. The intent is to avoid duplicating code for an extremely marginal difference. Having to duplicate every library function for denormal flushing and ieee handling defeats the purpose. The specialization will be realized after linking / inlining

If you have a library function that's built with "denormal-fp-math"="dynamic,dynamic", you can link it into code built in any mode, and LTO should be able to propagate that mode from the caller to the callee. That doesn't require clang to do anything special; you can just specify -fdenormal-fp-math=dynamic while building the library, and the user specifies -fdenormal-fp-math=ieee while building their code.

I guess you're worried specifically about ODR inline functions, defined in headers? The user specifies a specific mode because they know their code honors it... but the user might not be aware of the effect on functions defined in library headers. Other libraries in the same binary might use the same header, but specify a different mode. So if the user specifies a denormal mode, we should ignore it for ODR inline functions, because they didn't actually mean to apply the denormal mode to those definitions?

I'm not sure about applying those semantics automatically; I don't think there's any precedent in clang for anything like this. The closest thing I can think of is -fvisibility-inlines-hidden. I'd prefer to RFC it separately from the rest of the patch, and loop in clang frontend owners, since the precedent we set here will apply to other sorts of attributes.

If you have a library function that's built with "denormal-fp-math"="dynamic,dynamic", you can link it into code built in any mode, and LTO should be able to propagate that mode from the caller to the callee. That doesn't require clang to do anything special; you can just specify -fdenormal-fp-math=dynamic while building the library, and the user specifies -fdenormal-fp-math=ieee while building their code.

That's essentially what this does. I think the part you are missing is the existing special treatment of the builtin device library functions. The default set of attributes for the current translation unit is forcibly set on functions in bitcode libraries linked in and internalized with -mlink-builtin-bitcode. We need to logically merge with the current translation unit's mode, or else we're potentially breaking the linked in function. The main reason I'm doing this in the first place is to move towards a model with less special treatment of these libraries.

I guess you're worried specifically about ODR inline functions, defined in headers? The user specifies a specific mode because they know their code honors it... but the user might not be aware of the effect on functions defined in library headers. Other libraries in the same binary might use the same header, but specify a different mode. So if the user specifies a denormal mode, we should ignore it for ODR inline functions, because they didn't actually mean to apply the denormal mode to those definitions?

No, the clang changes are to handle the headerless bitcode-only device libraries only. The user is supposed to be unaware the builtin libraries exist. They're an implementation detail managed by the clang driver (-mlink-builtin-bitcode is a cc1 only flag) and have a special contract with the compiler.

I'm not sure about applying those semantics automatically; I don't think there's any precedent in clang for anything like this. The closest thing I can think of is -fvisibility-inlines-hidden. I'd prefer to RFC it separately from the rest of the patch, and loop in clang frontend owners, since the precedent we set here will apply to other sorts of attributes.

This isn't new, isn't end user facing, and isn't general purpose. This is the minimum required update to the existing -mlink-builtin-bitcode handling.

efriedma accepted this revision.Apr 25 2023, 10:10 AM

Oh, sorry, I missed that the new code specifically runs on functions imported using -mlink-builtin-bitcode. I somehow thought it was running on all functions.

LGTM

This revision is now accepted and ready to land.Apr 25 2023, 10:10 AM