This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget] Implement most hip atomic functions in terms of intrinsics
ClosedPublic

Authored by JonChesterfield on Jan 20 2020, 6:05 PM.

Details

Summary

[libomptarget] Implement hip atomic functions in terms of intrinsics

All but atomicInc can be implemented using type generic clang intrinsics.
There is not yet a corresponding intrinsic for atomicInc in clang, only one in
LLVM. This patch leaves atomicInc as an unresolved symbol.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2020, 6:05 PM

Is is necessary, or useful, to have 3 different kinds of intrinsics here? We have __atomic_fetch_max and __atomic_max_fetch in clang, unsure if they could replace the OpenCL intrinsic. Similarly, there is __atomic_compare_exchange and __atomic_exchange. I would also be fine with only amdgcn intrinsics or only opencl ones but I am a little worried about mixing these 3 kinds.

openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt
73 ↗(On Diff #239220)

Add a TODO here and in the ll file to state this is a *temporary workaround*. Also state the proper solution please.

Is is necessary, or useful, to have 3 different kinds of intrinsics here? We have __atomic_fetch_max and __atomic_max_fetch in clang, unsure if they could replace the OpenCL intrinsic. Similarly, there is __atomic_compare_exchange and __atomic_exchange. I would also be fine with only amdgcn intrinsics or only opencl ones but I am a little worried about mixing these 3 kinds.

I share this concern. I especially dislike the forced cast to _Atomic T* for the opencl intrinsics as that's skirting a bit close to an aliasing violation.

atomic_fetch_max I missed because gcc doesn't document it. In clang, it only compiles for 32 bit integer types (and we have a use of 64 bit in loop.cu). Some context in D46386.

I remember seeing different codegen for the opencl compare_exchange and the builtin but am not able to reproduce that now. There was some trial and error choosing intrinsics while comparing the codegen to hand written IR.

How about generic intrinsics where available:

  • fetch_add
  • exchange
  • compare_exchange

Amdgcn prefixed IR functions for:

  • atomic_inc
  • atomic_fetch_max

with the intent to provide amdgcn intrinsics for inc and fetch_max?

  • Prefer non-opencl atomics where available, comment the IR function

I like this better. Interestingly, we could even think about sharing the clang builtin versions across targets. I'm fine with us keeping it this way for now. What do u think?

openmp/libomptarget/deviceRTLs/amdgcn/src/hip_atomics.h
53

Add a TODO where we want different builtins later on.

arsenm added inline comments.Jan 21 2020, 5:49 PM
openmp/libomptarget/deviceRTLs/amdgcn/src/atomic_inc.ll
12 ↗(On Diff #239446)

We should also have a builtin for this. It would not be difficult to add

15 ↗(On Diff #239446)

This should be false. Where are you seeing this imply volatile? I think we started hacking around OpenCL volatile for atomics in device-libs

JonChesterfield marked 2 inline comments as done.
  • review comments
openmp/libomptarget/deviceRTLs/amdgcn/src/atomic_inc.ll
12 ↗(On Diff #239446)

Agreed and hopefully. It takes a lot of parameters so I was unsure about testing the codegen. Are you willing to review such a patch?

15 ↗(On Diff #239446)

Fun story. We were using functions from hc_atomic.ll though a wrapper and those functions specified volatile. This does the same to preserve that behaviour.

However, dropping volatile from all of these functions doesn't actually break the build. Compiles about as cleanly as before, same set of tests pass. Therefore dropped the qualifier from all of these functions.

arsenm added inline comments.Jan 21 2020, 6:39 PM
openmp/libomptarget/deviceRTLs/amdgcn/src/amdgcn_atomic.ll
20 ↗(On Diff #239476)

There must be an existing builtin for this?

JonChesterfield marked 2 inline comments as done.Jan 21 2020, 6:42 PM
JonChesterfield added inline comments.
openmp/libomptarget/deviceRTLs/common/src/loop.cu
796 ↗(On Diff #239476)

Buffer and elem are already uint64_t, and unsigned long long == 8 bytes on nvptx and amdgcn, so I'm not sure why there was a cast here.

JonChesterfield marked an inline comment as done.Jan 21 2020, 6:43 PM
JonChesterfield added inline comments.
openmp/libomptarget/deviceRTLs/amdgcn/src/amdgcn_atomic.ll
20 ↗(On Diff #239476)

There is an opencl builtin, but it wants the pointer type cast to an atomic. I'm reluctant to use that when the underlying type is not atomic qualified. Used in the previous version of this diff.

There is also a clang builtin, but only for 32 bits.

Discussed in the biweekly openmp call. Agreed that we should write the clang intrinsics first as deviceRTL isn't on the critical path for offloading to amdgcn anyway. I'll update this patch once they're available.

  • Use atomic max, reformat
JonChesterfield added a comment.EditedFeb 12 2020, 2:13 PM

Atomic max is implemented on trunk, as of D55562. Thanks Tim! I didn't notice it land. And failed to spot the difference between fetch_max and max_fetch.

Generates the same IR as the previous revision, updated diff accordingly.

  • Checked cuda documentation, replace max_fetch with fetch_max to match their semantics
sri added a subscriber: sri.Feb 12 2020, 8:21 PM

I have an amdgcn intrinsic for atomic_inc working without any semantic checks. SemaChecking is rather complicated though so it's going to take a little while to work out what changes need to go there.

  • Simplify, drop atomicInc impl
JonChesterfield retitled this revision from [libomptarget] Implement hip atomic functions in terms of intrinsics to [libomptarget] Implement most hip atomic functions in terms of intrinsics.Mar 3 2020, 10:32 AM
JonChesterfield edited the summary of this revision. (Show Details)

At some point in the future we may want to change over to using atomic functions that closer match the C++ ones, and let nvptx implement those in terms of cuda.

This patch is now very minimal - map the few cuda style atomic functions used onto clang's intrinsics, where available.

clang-tidy failed because it tried to compile this file for x86, which doesn't work, and because it doesn't like the naming conventions. I can do T r; => T R; and capitalise various names if we like, but the unknown attributes and error on looking for -DAMDGCN will need an infrastructure change to resolve.

I'm fine with this, @arsenm ?

arsenm accepted this revision.Mar 4 2020, 9:29 AM
This revision is now accepted and ready to land.Mar 4 2020, 9:29 AM
This revision was automatically updated to reflect the committed changes.

Only one of the atomicInc functions is used and clang (trunk) now has an intrinsic that can be used to implement it.