Page MenuHomePhabricator

Let clang atomic builtins fetch add/sub support floating point types
Needs ReviewPublic

Authored by yaxunl on Dec 19 2019, 1:14 PM.

Details

Summary

Recently atomicrmw started to support fadd/fsub:

https://reviews.llvm.org/D53965

However clang atomic builtins fetch add/sub still does not support emitting atomicrmw fadd/fsub.

This patch adds that.

This patch also adds target hook for atomic operation support and diagnostics for unsupported atomic
operations.

Diff Detail

Event Timeline

yaxunl created this revision.Dec 19 2019, 1:14 PM

This generally seems fine. Does it work on most backends? I want to make sure it doesn't fail in backends :)

Also, @ldionne / @EricWF / @mclow.lists do you need this in libc++ for floating-point atomic support?

jfb added a subscriber: __simt__.Dec 19 2019, 2:53 PM
yaxunl added a comment.EditedDec 20 2019, 7:35 AM
In D71726#1791904, @jfb wrote:

This generally seems fine. Does it work on most backends? I want to make sure it doesn't fail in backends :)

For x86_64, amdgcn, aarch64, armv7, mips64, it is translated to cmpxchg by AtomicExpandPass and backends did codegen successfully.

For hexagon, riscv32, it is translated to call of __atomic_fetch_add_4 for fadd float. This is concerning. Probably we need to add __atomic_fetch_{add|sub}_{f16|f32|f64} ?

arsenm added a subscriber: arsenm.Jan 2 2020, 7:53 AM
arsenm added inline comments.
clang/lib/CodeGen/CGAtomic.cpp
612–614

Should this really be based on the type, or should the builtin name be different for FP?

In D71726#1791904, @jfb wrote:

This generally seems fine. Does it work on most backends? I want to make sure it doesn't fail in backends :)

For x86_64, amdgcn, aarch64, armv7, mips64, it is translated to cmpxchg by AtomicExpandPass and backends did codegen successfully.

For hexagon, riscv32, it is translated to call of __atomic_fetch_add_4 for fadd float. This is concerning. Probably we need to add __atomic_fetch_{add|sub}_{f16|f32|f64} ?

For systems that have load-link/store-conditional architectures, like ARM / PPC / base RISC-V without extension, I would imagine that using a cmpxchg loop is much worse than simply doing the floating-point add/sub in the middle of the atomic mini-transaction. I'm sure that we want back-ends to be capable of implementing this better than what this pass is doing, even when they don't have "native" fp atomics.

You listed amdgcn... what does this do on nvptx?

In D71726#1791904, @jfb wrote:

This generally seems fine. Does it work on most backends? I want to make sure it doesn't fail in backends :)

For x86_64, amdgcn, aarch64, armv7, mips64, it is translated to cmpxchg by AtomicExpandPass and backends did codegen successfully.

For hexagon, riscv32, it is translated to call of __atomic_fetch_add_4 for fadd float. This is concerning. Probably we need to add __atomic_fetch_{add|sub}_{f16|f32|f64} ?

For systems that have load-link/store-conditional architectures, like ARM / PPC / base RISC-V without extension, I would imagine that using a cmpxchg loop is much worse than simply doing the floating-point add/sub in the middle of the atomic mini-transaction. I'm sure that we want back-ends to be capable of implementing this better than what this pass is doing, even when they don't have "native" fp atomics.

You listed amdgcn... what does this do on nvptx?

Targets can implement shouldExpandAtomicRMWInIR for the desired behavior, which NVPTX currently does not implement. Looking at AtomicExpandPass, it looks like either cmpxchg or LLSC expansions should work for the FP atomics already

yaxunl updated this revision to Diff 265325.May 20 2020, 12:38 PM
yaxunl added a reviewer: arsenm.

rebase

yaxunl marked 2 inline comments as done.May 20 2020, 1:28 PM
In D71726#1791904, @jfb wrote:

This generally seems fine. Does it work on most backends? I want to make sure it doesn't fail in backends :)

For x86_64, amdgcn, aarch64, armv7, mips64, it is translated to cmpxchg by AtomicExpandPass and backends did codegen successfully.

For hexagon, riscv32, it is translated to call of __atomic_fetch_add_4 for fadd float. This is concerning. Probably we need to add __atomic_fetch_{add|sub}_{f16|f32|f64} ?

For systems that have load-link/store-conditional architectures, like ARM / PPC / base RISC-V without extension, I would imagine that using a cmpxchg loop is much worse than simply doing the floating-point add/sub in the middle of the atomic mini-transaction. I'm sure that we want back-ends to be capable of implementing this better than what this pass is doing, even when they don't have "native" fp atomics.

You listed amdgcn... what does this do on nvptx?

Targets can implement shouldExpandAtomicRMWInIR for the desired behavior, which NVPTX currently does not implement. Looking at AtomicExpandPass, it looks like either cmpxchg or LLSC expansions should work for the FP atomics already

nvptx is similar to hexagon and riscv32, where fp atomics is translated to call of __atomic_fetch_add_4.

Since currently only amdgcn supports fp atomics, I am going to add a TargetInfo hook about whether fp atomics is supported and only emit fp atomics for targets supporting it.

clang/lib/CodeGen/CGAtomic.cpp
612–614

I think the original name is better. They are exactly what they are intended to be. They were not able to handle fp types therefore they used to emit diagnostics when fp types were passed to them. However now they are able to handle fp types.

In D71726#1791904, @jfb wrote:

This generally seems fine. Does it work on most backends? I want to make sure it doesn't fail in backends :)

Also, @ldionne / @EricWF / @mclow.lists do you need this in libc++ for floating-point atomic support?

Yes, I guess we do in order to implement fetch_add & friends on floating point types (https://wg21.link/P0020R6).

The builtins would need to work on float, double and long double. The code seems to suggest it does, however the tests only check for float. Does this support __atomic_fetch_{add,sub} on double and long double?

ldionne added inline comments.May 20 2020, 2:08 PM
clang/test/CodeGen/atomic-ops.c
296 ↗(On Diff #265325)

Sorry if that's a dumb question, but I'm a bit confused: p is a float*, but then we add a double 1.0 to it. Is that intended, or should that be double *p instead (or 1.0f)?

yaxunl marked 3 inline comments as done.May 20 2020, 4:14 PM
In D71726#1791904, @jfb wrote:

This generally seems fine. Does it work on most backends? I want to make sure it doesn't fail in backends :)

Also, @ldionne / @EricWF / @mclow.lists do you need this in libc++ for floating-point atomic support?

Yes, I guess we do in order to implement fetch_add & friends on floating point types (https://wg21.link/P0020R6).

The builtins would need to work on float, double and long double. The code seems to suggest it does, however the tests only check for float. Does this support __atomic_fetch_{add,sub} on double and long double?

It depends on target. For x86_64, __atomic_fetch_{add,sub} on double and long double are translated to __atomic_fetch_sub_8 and __atomic_fetch_sub_16.
For amdgcn, __atomic_fetch_{add,sub} on double is translated to fp atomic insts. long double is the same as double on amdgcn.

clang/test/CodeGen/atomic-ops.c
296 ↗(On Diff #265325)

In this case, the value type is converted to the pointee type of the pointer operand.

jfb added a comment.May 20 2020, 6:20 PM
In D71726#1791904, @jfb wrote:

This generally seems fine. Does it work on most backends? I want to make sure it doesn't fail in backends :)

Also, @ldionne / @EricWF / @mclow.lists do you need this in libc++ for floating-point atomic support?

Yes, I guess we do in order to implement fetch_add & friends on floating point types (https://wg21.link/P0020R6).

The builtins would need to work on float, double and long double. The code seems to suggest it does, however the tests only check for float. Does this support __atomic_fetch_{add,sub} on double and long double?

libc++ could implement atomic<float> using a cmpxchg loop with bit_cast and the FP instruction in most cases, and only use these builtins if available.

yaxunl updated this revision to Diff 265479.May 21 2020, 5:24 AM
yaxunl marked an inline comment as done.
yaxunl edited the summary of this revision. (Show Details)

Added TargetInfo::isFPAtomicFetchAddSubSupported to guard fp atomic.

tra added a subscriber: tra.May 21 2020, 10:49 AM
tra added inline comments.
clang/include/clang/Basic/TargetInfo.h
1477

I think it should be predicated on specific type.
E.g. NVPTX supports atomic ops on fp32 ~everywhere, but fp64 atomic add/sub is only supported on newer GPUs.
And then there's fp16...

ldionne added inline comments.May 21 2020, 11:45 AM
clang/test/CodeGen/atomic-ops.c
296 ↗(On Diff #265325)

Ok, thanks for the clarification. Yeah, it was a dumb question after all. I still think it should be made clearer by using 1.0f.

yaxunl marked 3 inline comments as done.May 21 2020, 2:32 PM
yaxunl added inline comments.
clang/include/clang/Basic/TargetInfo.h
1477

will do and add tests for fp16

clang/test/CodeGen/atomic-ops.c
296 ↗(On Diff #265325)

this test has been removed. the new tests do not have this issue.

yaxunl updated this revision to Diff 265606.May 21 2020, 2:39 PM
yaxunl marked 2 inline comments as done.
yaxunl edited the summary of this revision. (Show Details)
yaxunl added a reviewer: tra.

check supported fp atomics by bits.

ldionne added inline comments.May 25 2020, 11:10 AM
clang/test/CodeGenCUDA/amdgpu-atomic-ops.cu
27

Nitpick, but this should be 1.0L to be consistent.

tra added inline comments.May 26 2020, 9:58 AM
clang/include/clang/Basic/TargetInfo.h
1477

The number of bits alone may not be sufficient to differentiate the FP variants.
E.g. 16-bit floats currently have 2 variants: IEEE FP16 and BFloat16 (supported by intel and newer NVIDIA GPUs).
CUDA-11 has introduced TF32 FP format, so we're likely to have more than one 32-bit FP type, too.
I think PPC has an odd long double variant represented as pair of 64-bit doubles.

yaxunl marked 4 inline comments as done.Jul 18 2020, 3:44 PM
yaxunl added inline comments.
clang/include/clang/Basic/TargetInfo.h
1477

will use llvm::fltSemantics for checking, which should cover different fp types.

clang/test/CodeGenCUDA/amdgpu-atomic-ops.cu
27

done

yaxunl updated this revision to Diff 279032.Jul 18 2020, 3:46 PM
yaxunl marked 2 inline comments as done.

use llvm::fltSemantics for checking

Why not have clang always emit atomicrmw for floats, and let AtomicExpandPass handle legalizing that into integer atomics if necessary, rather than adding a target hook in clang?

Why not have clang always emit atomicrmw for floats, and let AtomicExpandPass handle legalizing that into integer atomics if necessary, rather than adding a target hook in clang?

Not all targets can legalize fp atomics by AtomicExpandPass. Some targets need library support.

jfb added a comment.Jul 21 2020, 3:22 PM

Why not have clang always emit atomicrmw for floats, and let AtomicExpandPass handle legalizing that into integer atomics if necessary, rather than adding a target hook in clang?

Not all targets can legalize fp atomics by AtomicExpandPass. Some targets need library support.

What are they missing? It can be expanded to a cmpxchg loop with bitcast to an integer type of the same size.

Why not have clang always emit atomicrmw for floats, and let AtomicExpandPass handle legalizing that into integer atomics if necessary, rather than adding a target hook in clang?

Not all targets can legalize fp atomics by AtomicExpandPass. Some targets need library support.

That isn't true, because you can do so generically with a cmpxchg loop, assuming that size of atomic is supported by the target. This might not be the most efficient lowering choice, but it's always possible as a fallback. (And if the size is too large, then AtomicExpandPass will lower the cmpxchg to the libatomic call.)

If a target wants to tell AtomicExpandPass that fp add/sub are supported, and then lower the resulting ATOMIC_LOAD_FSUB sdag node into a libcall of its choice, that's also ok (as long as the libcall is lock-free).

Why not have clang always emit atomicrmw for floats, and let AtomicExpandPass handle legalizing that into integer atomics if necessary, rather than adding a target hook in clang?

Not all targets can legalize fp atomics by AtomicExpandPass. Some targets need library support.

That isn't true, because you can do so generically with a cmpxchg loop, assuming that size of atomic is supported by the target. This might not be the most efficient lowering choice, but it's always possible as a fallback. (And if the size is too large, then AtomicExpandPass will lower the cmpxchg to the libatomic call.)

If a target wants to tell AtomicExpandPass that fp add/sub are supported, and then lower the resulting ATOMIC_LOAD_FSUB sdag node into a libcall of its choice, that's also ok (as long as the libcall is lock-free).

how about other fp types e.g. bf16, half, long double? Do we need to diagnose them or not?

yaxunl updated this revision to Diff 281296.Jul 28 2020, 11:07 AM

Make IEEE single and double type as supported for fp atomics in all targets by default. This is based on the assumption that AtomicExpandPass or its ongoing work is sufficient to support fp atomics for all targets. This is to facilitate middle end and backend end development to support fp atomics.

If a target would like to treat single and double fp atomics as unsupported, it can override the default behavior in its own TargetInfo.

ping. I think I have addressed all the issues in FE. I think issues in AtomicExpandPass should be addressed by separate patches. Can we land this? Thanks.

tra added a comment.Jul 29 2020, 10:45 AM

LGTM, modulo couple of nits.

@jyknight are you OK with this?

Make IEEE single and double type as supported for fp atomics in all targets by default. This is based on the assumption that AtomicExpandPass or its ongoing work is sufficient to support fp atomics for all targets. This is to facilitate middle end and backend end development to support fp atomics.

If a target would like to treat single and double fp atomics as unsupported, it can override the default behavior in its own TargetInfo.

Do we have sufficient test coverage on all platforms to make sure we're not generating something that LLVM can't handle everywhere?
If not, perhaps we should default to unsupported and only enable it for known working targets.

clang/lib/CodeGen/CGAtomic.cpp
913–915

ShouldCastToIntPtrTy = !MemTy->isFloatingType();

clang/test/Sema/atomic-ops.c
102–104

Rename arguments? d -> f, d2 -> d, d3 -> ld ?

yaxunl updated this revision to Diff 283028.Aug 4 2020, 2:11 PM
yaxunl marked 2 inline comments as done.

revised by Artem's comments.

yaxunl updated this revision to Diff 283067.Aug 4 2020, 5:09 PM

added tests for targets supporting fp atomics.

yaxunl added a comment.Aug 4 2020, 5:17 PM
In D71726#2182667, @tra wrote:

LGTM, modulo couple of nits.

@jyknight are you OK with this?

Make IEEE single and double type as supported for fp atomics in all targets by default. This is based on the assumption that AtomicExpandPass or its ongoing work is sufficient to support fp atomics for all targets. This is to facilitate middle end and backend end development to support fp atomics.

If a target would like to treat single and double fp atomics as unsupported, it can override the default behavior in its own TargetInfo.

Do we have sufficient test coverage on all platforms to make sure we're not generating something that LLVM can't handle everywhere?
If not, perhaps we should default to unsupported and only enable it for known working targets.

I updated TargetInfo for fp atomic support for common targets. Basically by default fp atomic support is now off. It is enabled only for targets which do not generate lib calls for fp atomics. This is because the availability of lib call depends on platform, so it is up to the Target owners to determine whether the support is available if lib call is needed. For those targets which are able to generate llvm fp atomic instructions, fp atomic support is enabled in clang, and tests are added to cover them.

clang/lib/CodeGen/CGAtomic.cpp
913–915

done

clang/test/Sema/atomic-ops.c
102–104

done

yaxunl edited the summary of this revision. (Show Details)Aug 4 2020, 5:20 PM
ldionne removed a subscriber: ldionne.Aug 5 2020, 9:05 AM
In D71726#2182667, @tra wrote:

If a target would like to treat single and double fp atomics as unsupported, it can override the default behavior in its own TargetInfo.

I really don't think this should be a target option at all. Every target can support the atomic fadd/fsub IR instruction (via lowering to a cmpxchg loop if nothing else). If it doesn't work, that's a bug in LLVM. We shouldn't be adding target hooks in Clang to workaround LLVM bugs, rather, we should fix them.

There is one nit -- atomicrmw doesn't (yet) support specifying alignment. There's work now to fix that, but until that's submitted, only naturally-aligned atomicrmw instructions can be created. So, for now, supporting only a naturally-aligned floating-point add would be a reasonable temporary measure.

Do we have sufficient test coverage on all platforms to make sure we're not generating something that LLVM can't handle everywhere?

Probably not.

If not, perhaps we should default to unsupported and only enable it for known working targets.

No, I don't think that's a good way to go. We should fix LLVM if it's broken.

clang/lib/CodeGen/CGAtomic.cpp
959

convertToAtomicIntPointer does more than just cast to an int pointer, are you sure the rest is not necessary for fp types?

clang/lib/Sema/SemaChecking.cpp
5034–5036

This is confusing, and took me a bit to understand what you're doing. I'd suggest reordering the clauses, putting the pointer case first, e.g.:

if (Form == Arithmetic && ValType->isPointerType())
  Ty = Context.getPointerDiffType();
else if (Form == Init || Form == Arithmetic)
  Ty = ValType;
else if (Form == Copy || Form == Xchg) .....
else ......
...

Oh, one more note, C11 has -- and clang already supports -- _Atomic long double x; x += 4; via lowering to a cmpxchg loop. Now that we have an LLVM IR representation for atomicrmw fadd/fsub, clang should be lowering the _Atomic += to that, too. (Doesn't need to be in this patch, but it should be done.)

yaxunl marked 2 inline comments as done.Aug 10 2020, 11:46 AM
In D71726#2182667, @tra wrote:

If a target would like to treat single and double fp atomics as unsupported, it can override the default behavior in its own TargetInfo.

I really don't think this should be a target option at all. Every target can support the atomic fadd/fsub IR instruction (via lowering to a cmpxchg loop if nothing else). If it doesn't work, that's a bug in LLVM. We shouldn't be adding target hooks in Clang to workaround LLVM bugs, rather, we should fix them.

There is one nit -- atomicrmw doesn't (yet) support specifying alignment. There's work now to fix that, but until that's submitted, only naturally-aligned atomicrmw instructions can be created. So, for now, supporting only a naturally-aligned floating-point add would be a reasonable temporary measure.

clang does not always emit atomic instructions for atomic builtins. Clang may emit lib calls for atomic builtins. Basically clang checks target info about max atomic inline width and if the desired atomic operation exceeds the supported atomic inline width, clang will emit lib calls for atomic builtins. The rationale is that the lib calls may be faster than the IR generated by the LLVM pass. This behavior has long existed and it also applies to fp atomics. I don't think emitting lib calls for atomic builtins is a bug. However, this does introduce the issue about whether the library functions for atomics are available for a specific target. As I said, only the target owners have the answer and therefore I introduced the target hook.

Do we have sufficient test coverage on all platforms to make sure we're not generating something that LLVM can't handle everywhere?

Probably not.

In clang, we only test IR generation, as is done for other atomic builtins. fp atomics do not have less coverage compared with other atomic builtins. Actually for other atomic builtins we do not even test them on different targets. The ISA generation of fp atomics should be done in llvm tests and should not be blocking clang change.

clang/lib/CodeGen/CGAtomic.cpp
959

it is not needed for fp types. If the value type does not match the pointer type, clang automatically inserts proper llvm instructions to convert the value type to a value type that matches the pointer type. Two codegen tests are added (atomic_fetch_add(double*, float) and atomic_fetch_add(double*, int)) to test such situations.

clang/lib/Sema/SemaChecking.cpp
5034–5036

done

yaxunl updated this revision to Diff 284463.Aug 10 2020, 11:49 AM
yaxunl marked 2 inline comments as done.

Revised by James' comments.

clang does not always emit atomic instructions for atomic builtins. Clang may emit lib calls for atomic builtins. Basically clang checks target info about max atomic inline width and if the desired atomic operation exceeds the supported atomic inline width, clang will emit lib calls for atomic builtins. The rationale is that the lib calls may be faster than the IR generated by the LLVM pass. This behavior has long existed and it also applies to fp atomics. I don't think emitting lib calls for atomic builtins is a bug. However, this does introduce the issue about whether the library functions for atomics are available for a specific target. As I said, only the target owners have the answer and therefore I introduced the target hook.

If we want the frontend to emit an error when the target doesn't support library-based atomics, that seems fine, but there's no reason to only do so for floating-point types. That is, we should have a TargetInfo method that asks whether atomics at a given size and alignment are supported at all, similar to what we have for "builtin" (lock-free) atomics, and we should check it for all the atomic types and operations.

Actually, maybe we should take the existing hook and have it return one of { LockFree, Library, Unsupported }.

clang does not always emit atomic instructions for atomic builtins. Clang may emit lib calls for atomic builtins. Basically clang checks target info about max atomic inline width and if the desired atomic operation exceeds the supported atomic inline width, clang will emit lib calls for atomic builtins. The rationale is that the lib calls may be faster than the IR generated by the LLVM pass. This behavior has long existed and it also applies to fp atomics. I don't think emitting lib calls for atomic builtins is a bug. However, this does introduce the issue about whether the library functions for atomics are available for a specific target. As I said, only the target owners have the answer and therefore I introduced the target hook.

The LLVM AtomicExpandPass is _also_ introducing libcalls (or cmpxchg loops), as is appropriate for a given target. We currently, redundantly, support the same thing in two places. It's a long-term goal of mine to simplify the atomics code in clang, by deferring more of it to LLVM, but some prerequisites (e.g. supporting misaligned atomicrmw) are not yet in place. The intent is that it is always valid to emit the LLVM atomic IR, and it will be transformed into whatever is best on a given target. As such, there's no reason to restrict these clang intrinsics.

Yes, there are no generically available libcalls for atomic float math -- but that's okay -- let LLVM handle transform into a cmpxchg loop when required.

Yes, there are no generically available libcalls for atomic float math -- but that's okay -- let LLVM handle transform into a cmpxchg loop when required.

I suspect Yaxun's target cannot provide libcalls at all, which is why he wants to diagnose up-front. But I agree that we should be thinking about this uniformly, and that his target should be diagnosing *all* unsupported atomics.

yaxunl updated this revision to Diff 307206.Mon, Nov 23, 3:12 PM
yaxunl edited the summary of this revision. (Show Details)

revised by John's comments. Added target hook and diagnostics for generic atomic operations.

Yes, there are no generically available libcalls for atomic float math -- but that's okay -- let LLVM handle transform into a cmpxchg loop when required.

I suspect Yaxun's target cannot provide libcalls at all, which is why he wants to diagnose up-front. But I agree that we should be thinking about this uniformly, and that his target should be diagnosing *all* unsupported atomics.

amdgpu target currently does not support atomic libcalls. I added a target hook for atomic operation support and diagnostics for generic atomic operations by John's comments.

Clang has existing diagnostics for unsupported atomic load/store for some platforms, and functions about atomic support scattered in target info, AST context, and codegen. This change refactors these codes and unify them as a target hook.