Page MenuHomePhabricator

Consoldiate internal denormal flushing controls
ClosedPublic

Authored by arsenm on Nov 5 2019, 9:06 PM.

Details

Summary

Currently there are 4 different mechanisms for controlling denormal
flushing behavior, and about as many equivalent frontend controls.

- AMDGPU uses the fp32-denormals and fp64-f16-denormals subtarget features
- NVPTX uses the nvptx-f32ftz attribute
- ARM directly uses the denormal-fp-math attribute
- Other targets indirectly use denormal-fp-math in one DAGCombine
- cl-denorms-are-zero has a corresponding denorms-are-zero attribute

AMDGPU wants a distinct control for f32 flushing from f16/f64, and as
far as I can tell the same is true for NVPTX (based on the attribute
name).

Work on consolidating these into the denormal-fp-math attribute, and a
new type specific denormal-fp-math-f32 variant. Only ARM seems to
support the two different flush modes, so this is overkill for the
other use cases. Ideally we would error on the unsupported
positive-zero mode on other targets from somewhere.

Move the logic for selecting the flush mode into the compiler driver,
instead of handling it in cc1. denormal-fp-math/denormal-fp-math-f32
are now both cc1 flags, but denormal-fp-math-f32 is not yet exposed as
a user flag.

-cl-denorms-are-zero, -fcuda-flush-denormals-to-zero and
-fno-cuda-flush-denormals-to-zero will be mapped to
-fp-denormal-math-f32=ieee or preserve-sign rather than the old
-attributes.

Stop emitting the denorms-are-zero attribute for the OpenCL flag. It
has no in-tree users. The meaning would also be target dependent, such
as the AMDGPU choice to treat this as only meaning allow flushing of
f32 and not f16 or f64. The naming is also potentially confusing,
since DAZ in other contexts refers to instructions implicitly treating
input denormals as zero, not necessarily flushing output denormals to
zero.

This also does not attempt to change the behavior for the current
attribute. The LangRef now states that the default is ieee behavior,
but this is inaccurate for the current implementation. The clang
handling is slightly hacky to avoid touching the existing
denormal-fp-math uses. Fixing this will be left for a future patch.

AMDGPU is still using the subtarget feature to control the denormal
mode, but the new attribute are now emitted. A future change will
switch this and remove the subtarget features.

Diff Detail

Event Timeline

arsenm created this revision.Nov 5 2019, 9:06 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 5 2019, 9:06 PM
arsenm marked an inline comment as done.Nov 6 2019, 10:28 AM
arsenm added inline comments.
llvm/docs/LangRef.rst
1828–1831

On second thought I think this may be too permissive. I think based on the use in DAGCombiner, that flushing of outputs is compulsory.

arsenm marked an inline comment as done.Nov 6 2019, 3:00 PM
arsenm added inline comments.
llvm/docs/LangRef.rst
1828–1831

It turns out the fast sqrt usage really cares about input denormals being implicitly treated as 0, not the output flushing (i.e. this only needs DAZ, not FTZ). I think being permissive on the output is OK, but if implicit input flushing is required then it's compulsory and a target is responsible for inserting a flush of some kind if the use instruction isn't known to follow this mode.

Because of this, I do think it's necessary to treat this as two separate modes. I'm thinking to comma separate output-mode,input-mode, and assume input-mode=output-mode if the second half isn't specified for compatibility with the existing attribute.

arsenm updated this revision to Diff 228167.Nov 6 2019, 4:59 PM

Rename subnormal to denormal. Will defer splitting input and output setting into a future patch before switching default behavior

Stop emitting the denorms-are-zero attribute for the OpenCL flag. It
has no in-tree users. The meaning would also be target dependent, such
as the AMDGPU choice to treat this as only meaning allow flushing of
f32 and not f16 or f64. The naming is also potentially confusing,
since DAZ in other contexts refers to instructions implicitly treating
input denormals as zero, not necessarily flushing output denormals to
zero.

Would the targets supporting OpenCL need to define their own behavior in getDefaultDenormalModeForType?

clang/include/clang/Driver/ToolChain.h
619

Can you elaborate what has to be done in order to fix this?

arsenm marked an inline comment as done.Nov 7 2019, 1:03 PM

Stop emitting the denorms-are-zero attribute for the OpenCL flag. It
has no in-tree users. The meaning would also be target dependent, such
as the AMDGPU choice to treat this as only meaning allow flushing of
f32 and not f16 or f64. The naming is also potentially confusing,
since DAZ in other contexts refers to instructions implicitly treating
input denormals as zero, not necessarily flushing output denormals to
zero.

Would the targets supporting OpenCL need to define their own behavior in getDefaultDenormalModeForType?

Yes. The future ieee default should be conservatively correct though

clang/include/clang/Driver/ToolChain.h
619

The main problem is the current user assumes non-ieee by default. The main blocker is knowing what platforms should default to something different to avoid performance regressions. I have the patch almost ready to switch the default, it’s just missing toolchain overrides

arsenm updated this revision to Diff 228348.Nov 7 2019, 6:05 PM

Fix name in documentation

AMDGPU wants a distinct control for f32 flushing from f16/f64, and as far as I can tell the same is true for NVPTX (based on the attribute name).

I may be corrected, but I believe nvptx only supports ftz for f32.

Double-precision instructions support subnormal inputs and results. Single-precision instructions support subnormal inputs and results by default for sm_20 and subsequent targets, and flush subnormal inputs and results to sign-preserving zero for sm_1x targets. The optional .ftz modifier on single-precision instructions provides backward compatibility with sm_1x targets by flushing subnormal inputs and results to sign-preserving zero regardless of the target architecture.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#floating-point-instructions

Anastasia added inline comments.Nov 19 2019, 8:23 AM
clang/lib/CodeGen/CGCall.cpp
1781

so where would denorms-are-zero be emitted now (in case some out of tree implementations rely on this)?

arsenm marked an inline comment as done.Nov 19 2019, 8:42 AM
arsenm added inline comments.
clang/lib/CodeGen/CGCall.cpp
1781

Rely on in what sense? Do you have a concrete use of this?

Anastasia added inline comments.Nov 26 2019, 9:14 AM
clang/lib/CodeGen/CGCall.cpp
1781

Since it has been emitted before in the module potentially some LLVM implementations could be using that attribute?

arsenm marked an inline comment as done.Dec 1 2019, 10:47 PM
arsenm added inline comments.
clang/lib/CodeGen/CGCall.cpp
1781

I'm disinclined to leave things around just in case some unknown user might have been using them. We've dropped attributes like this before (I think the less-precise-fp-mad one for disuse). This also isn't needed for correctness, so it should be pretty safe to drop

pengfei added a subscriber: pengfei.Dec 2 2019, 4:43 PM
scanon added inline comments.Dec 4 2019, 5:35 AM
llvm/docs/LangRef.rst
1834

Can you clarify this a little bit? I'd prefer something like "Same as `"denorm-fp-math"`, but only controls the behavior of the 32-bit float type.".

arsenm updated this revision to Diff 233059.Dec 10 2019, 4:53 AM

Reword langref, fix name in langref

Anastasia added inline comments.Dec 10 2019, 8:58 AM
clang/lib/CodeGen/CGCall.cpp
1781

Ok, fair enough!

This is looking pretty good to me, but I'm ignoring some of the target specific code that I'm not familiar with.

Is denormal-fp-math influenced by -Ofast? Or are there plans for that? Seems like -Ofast should imply DAZ and FTZ (if supported by target).

I think we discussed this before, but it's worth repeating. If denormal-fp-math isn't specified, we default to IEEE behavior, right? When this lands in master, there could be an unexpected performance hit for targets that aren't paying attention. E.g. I want to use denormal-fp-math to toggle whether a FSUB(-0.0,X) is converted to a FNEG(X) in SelectionDAGBuilder.

Apologies in advance if this has been discussed recently. I've been distracted with another project for the passed few months...

clang/lib/Driver/ToolChains/Clang.cpp
2499

Last line of comment was not removed.

Also, is it safe to remove TrappingMathPresent? Is that part of the work-in-progress to support ffp-exception-behavior?

arsenm marked an inline comment as done.Jan 6 2020, 6:58 AM

This is looking pretty good to me, but I'm ignoring some of the target specific code that I'm not familiar with.

Is denormal-fp-math influenced by -Ofast? Or are there plans for that? Seems like -Ofast should imply DAZ and FTZ (if supported by target).

Yes, through the toolchain handling. I copied the logic for when crtfastmath is linked for the default mode for x86.

I think we discussed this before, but it's worth repeating. If denormal-fp-math isn't specified, we default to IEEE behavior, right? When this lands in master, there could be an unexpected performance hit for targets that aren't paying attention. E.g. I want to use denormal-fp-math to toggle whether a FSUB(-0.0,X) is converted to a FNEG(X) in SelectionDAGBuilder.

Apologies in advance if this has been discussed recently. I've been distracted with another project for the passed few months...

Yes, ieee should be the default. The dependent patches start adding the attribute by default for platforms with flushing enabled with fast math

clang/lib/Driver/ToolChains/Clang.cpp
2499

I think this is a rebase gone bad. The patch changing the strict math was revered and recommitted and I probably broke this

arsenm added a comment.Jan 6 2020, 7:40 AM

This is looking pretty good to me, but I'm ignoring some of the target specific code that I'm not familiar with.

Is denormal-fp-math influenced by -Ofast? Or are there plans for that? Seems like -Ofast should imply DAZ and FTZ (if supported by target).

Yes, through the toolchain handling. I copied the logic for when crtfastmath is linked for the default mode for x86.

I think we discussed this before, but it's worth repeating. If denormal-fp-math isn't specified, we default to IEEE behavior, right? When this lands in master, there could be an unexpected performance hit for targets that aren't paying attention. E.g. I want to use denormal-fp-math to toggle whether a FSUB(-0.0,X) is converted to a FNEG(X) in SelectionDAGBuilder.

Apologies in advance if this has been discussed recently. I've been distracted with another project for the passed few months...

Yes, ieee should be the default. The dependent patches start adding the attribute by default for platforms with flushing enabled with fast math

To clarify this patch leaves the default and defers changing that to a later patch

Ok, thanks for the clarifications. Looks good to me, but it would be good to have experts in OpenCL/Cuda/AMDGPU review the target specific changes.

llvm/docs/LangRef.rst
1837

Can you document which targets do support the option? What happens if I try to use the option on a target where it is not supported?

arsenm marked an inline comment as done.Jan 6 2020, 10:52 AM
arsenm added inline comments.
llvm/docs/LangRef.rst
1837

I'm not sure where to document this, or if/how/where to diagnose it. I don't think the high level LangRef description is the right place to discuss specific target handling.

Currently it won't error or anything. Code checking the denorm mode will see the f32 specific mode, even if the target in the end isn't really going to respect this.

One problem is this potentially does require coordination with other toolchain components. For AMDGPU, the compiler can directly tell the driver what FP mode to set on each entry point, but for x86 it requires linking in crtfastmath to set the default mode bits. If another target had a similar runtime environment requirement, I don't think we can be sure the attribute is correct or not.

llvm/docs/LangRef.rst
1837

There is precedent for describing target-specific behavior in LangRef. It just doesn't seem useful to say that not all targets support the attribute without saying which ones do. We should also say what is expected if a target doesn't support the attribute. It seems reasonable for the function attribute to be silently ignored.

One problem is this potentially does require coordination with other toolchain components. For AMDGPU, the compiler can directly tell the driver what FP mode to set on each entry point, but for x86 it requires linking in crtfastmath to set the default mode bits.

This is a point I'm interested in. I don't like the current crtfastmath.o handling. It feels almost accidental when FTZ works as expected. My understanding is we link crtfastmath.o if we find it but if not everything just goes about its business. The Intel compiler injects code into main() to explicitly set the FTZ/DAZ control modes. That obviously has problems too, but it's at least consistent and predictable. As I understand it, crtfastmath.o sets these modes from a static initializer, but I'm not sure anything is done to determine the order of that initializer relative to others.

How does the compiler identify entry points for AMDGPU? And does it emit code to set FTZ based on the function attribute here?

arsenm marked an inline comment as done.Jan 9 2020, 1:46 PM
arsenm added inline comments.
llvm/docs/LangRef.rst
1837

The entry points are a specific calling convention. There's no real concept of main. Each kernel has an associated blob of metadata the driver uses to set up various config registers on dispatch.

I don't think specially recognizing main in the compiler is fundamentally different than having it done in a static constructor. It's still a construct not associated with any particular function or anything.

andrew.w.kaylor added inline comments.Jan 9 2020, 6:43 PM
llvm/docs/LangRef.rst
1837

The problem with having it done in a static constructor is that you have no certainty of when it will be done relative to other static constructors. If it's in main you can at least say that it's after all the static constructors (assuming main is your entry point).

llvm/docs/LangRef.rst
1837

Yes and no. The linker should honor static constructor priorities. But, yeah, there's no guarantee that this constructor will run before other priority 101 constructors.

The performance penalty for setting denormal flushing in main could be significant (think C++). Also, there's precedent for using static constructors, like GCC's crtfastmath.o.

llvm/docs/LangRef.rst
1837

Fair enough. I don't necessarily like how icc handles this. I don't have a problem with how gcc handles it. I just really don't like how LLVM does it. If we want to take the static constructor approach we should define our own, not depend on whether or not the GNU object file happens to be around.

Static initialization doesn't help for AMDGPU, and I suppose that's likely to be the case for any offload execution model. Since this patch is moving us toward a more consistent implementation I'm wondering if we can define some general rules for how this is supposed to work. Like when the function attribute will result in injected instructions setting the control flags and when it won't.

arsenm marked an inline comment as done.Jan 10 2020, 11:38 AM
arsenm added inline comments.
llvm/docs/LangRef.rst
1837

I think the most we can expect of this attribute as informing codegen of the expected FP denormal handling mode, and not something responsible for ensuring the mode will really be set. AMDGPU conceptually could have a separate set of attributes for setting the denormal FP mode, but since it would look identical, this gets a bonus usage for setting it for kernels. This doesn't protect you from calling functions in modules compiled with different attributes, so similar problems outside the view of the compiler still exist

llvm/docs/LangRef.rst
1837

If we want to take the static constructor approach we should define our own, not depend on whether or not the GNU object file happens to be around.

That's a good idea. There's subtle differences between targets in the GNU implementation. It would be good to standardize them.

arsenm updated this revision to Diff 237665.Jan 13 2020, 6:48 AM

Mention support in langref

clang/lib/Driver/ToolChains/Clang.cpp
2499

Looks like this is still wrong. You didn't intend to change either TrappingMath flag, did you?

2715–2716

Shouldn't this also restore DenormalFP32Math to its default value?

arsenm updated this revision to Diff 238670.Jan 16 2020, 5:30 PM

Forgot clang parts

arsenm marked an inline comment as done.Jan 16 2020, 5:40 PM
arsenm added inline comments.
clang/lib/Driver/ToolChains/Clang.cpp
2757

I think this should just follow along with DenormalFPMath, but I'll put this off to the later patch since this one still is slightly awkward trying to avoid changing the meaning of the absence of the option

andrew.w.kaylor accepted this revision.Jan 17 2020, 11:26 AM

I don't know if there were other reviewers who haven't commented on how you addressed their concerns, but this looks good to me.

Thanks for taking the time to improve our handling of this!

This revision is now accepted and ready to land.Jan 17 2020, 11:26 AM
cameron.mcinally accepted this revision.Jan 17 2020, 2:39 PM

LGTM too. Would be good if an expert reviewed the target-specific changes, but they seem safe enough either way.