This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget][nfc] Introduce atomic wrapper function
ClosedPublic

Authored by JonChesterfield on Dec 12 2019, 2:46 AM.

Details

Summary

[libomptarget][nfc] Introduce atomic wrapper function

Wraps atomic functions in a template prefixed __kmpc_atomic that
dispatches to cuda or hip atomic functions. Intended to be easily extended
to dispatch to OpenCL or C++ atomics for a third target.

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptDec 12 2019, 2:46 AM
  • consistent underbar

Actually - turns out I don't need this for amdgcn. It appears HIP explicitly supports the same functions as cuda with the same names. https://github.com/ROCm-Developer-Tools/HIP/blob/master/docs/markdown/hip_kernel_language.md

This abstraction is probably still a good idea for opencl based builds, but could perhaps be postponed until then

JonChesterfield added a comment.EditedDec 12 2019, 4:55 AM

Related: D71412. Inclined to postpone this renaming / wrapping exercise until a non cuda, non hip target is proposed. Thoughts?

I think it is worth having this layer of abstraction. Aren't there other atmomics except add? I remember writing a patch like this for more than atomic add (which never got in obviously, one should be able to find it though).

I think it is worth having this layer of abstraction. Aren't there other atmomics except add?

Yep. Complete list of functions used by the current deviceRTL is add, inc, max, exch, cas in a few integer variants. The others were left out for the (admittedly minor) reduction in work available from choosing a naming convention before the find&replace.

I think I'd also prefer to have the abstraction layer. It's part of viewing common/ as C++ plus well defined extensions.

I think it is worth having this layer of abstraction. Aren't there other atmomics except add?

Yep. Complete list of functions used by the current deviceRTL is add, inc, max, exch, cas in a few integer variants. The others were left out for the (admittedly minor) reduction in work available from choosing a naming convention before the find&replace.

I think I'd also prefer to have the abstraction layer. It's part of viewing common/ as C++ plus well defined extensions.

I would propose:

Name them __kmpc_aotmic_XXX as that matches the runtime naming scheme. Changing it locally seems not helpful.
Do the search and replace for this commit already, let's get all atomics moved in a single swoop.
Consider a template solution for both the declaration and implementation. If we get more repetition and types it might be worth it. (At least hat is what I thought in D64217)

__kmpc_atomic_foo works for me, as does running sed once.

Templates I can't see in this instance. Could you sketch the implementation for one of the functions, e.g. the add example from here?

__kmpc_atomic_foo works for me, as does running sed once.

Templates I can't see in this instance. Could you sketch the implementation for one of the functions, e.g. the add example from here?

See D64217

/// Atomically exchange the pointee of \p Ptr with \p Val and return the
/// original value of the pointee.
template <typename T> T __kmpc_impl_atomic_exchange(T *Ptr, T Val) {
  return atomicExch(Ptr, Val);
}

also no need for stdint.h.

JonChesterfield added a comment.EditedDec 12 2019, 5:31 PM
template <typename T> T __kmpc_impl_atomic_exchange(T *Ptr, T Val) {
  return atomicExch(Ptr, Val);
}

OK, cool. So we still need the list of underlying functions from cuda.h or equivalent, and if the instantiation type isn't (implicitly convertible to one) on the list we get a slightly less readable compile time error. That seems like a good tradeoff. Agreed

Edit: a drawback is the underlying symbols must be exposed in the header with the template wrapper, meaning no compile time error for calling them directly. With extra
code instead, the declaration can be separate and we get a degree of checking that the wrappers are used consistently. I'm still cautiously in favour of the templates.

JonChesterfield added a comment.EditedDec 12 2019, 5:52 PM

There's an outstanding design point here.

Logically, the implementation is per target so should be in arch/src/target_atomic.h, with call sites including target_atomic.

However, the nvptx and amdgcn implementations will be (somewhat spuriously) the same. So either copy & paste time, or each needs to include a header from common containing the implementation which is otherwise not included anywhere.

This awkward redirection is necessary under the current scheme to allow a new arch to provide a header that is picked up by common/src/foo.cpp.

I wonder if that means a different redirection scheme is better, e.g. where headers are used from common unless one with the same name is provided under the arch. That seems error prone however.

There's an outstanding design point here.

Logically, the implementation is per target so should be in arch/src/target_atomic.h, with call sites including target_atomic.

However, the nvptx and amdgcn implementations will be (somewhat spuriously) the same. So either copy & paste time, or each needs to include a header from common containing the implementation which is otherwise not included anywhere.

This awkward redirection is necessary under the current scheme to allow a new arch to provide a header that is picked up by common/src/foo.cpp.

I wonder if that means a different redirection scheme is better, e.g. where headers are used from common unless one with the same name is provided under the arch. That seems error prone however.

To be honest, I did not follow this so if my response doesn't make sense let me know.

Could we have a generic atomic header with the template definition that is something like this:

template <typename T> T __kmpc_impl_atomic_exchange(T *Ptr, T Val) {
  T Tmp;
  #pragma omp atomic
  { Tmp = *Ptr; *Ptr = Val; }
  return Tmp;
}

In the target_impl.h you can provide the specialized implementation.

template <typename T> T __kmpc_impl_atomic_exchange(T *Ptr, T Val) {
  return atomicExch(Ptr, Val);
}

Only one template version will be visible at any time. Would that solve the problem?

JonChesterfield added a comment.EditedDec 13 2019, 3:25 AM

To be honest, I did not follow this so if my response doesn't make sense let me know.

Apologies. I'll rephrase the problem now that it's a more reasonable time of day.

Could we have a generic atomic header with the template definition that is something like this:

...

In the target_impl.h you can provide the specialized implementation.

Only one template version will be visible at any time. Would that solve the problem?

I don't believe so.

Consider common/src/loop.cu, which calls __kmpc_atomic_add. Today, that atomic add could be implemented under common/atomic.h and included as common/atomic.h or similar and all would work well.

In the future, a third target wants to provide implementations for the atomics using C++ or OpenCL, so writes their own arch/src/atomic.h. This file would then be ignored by the source under common, because it's in the unexpected place.

We can guard against this by writing the target specific implementation under nvptx/src/atomic.h, in which case common/src will include atomic.h and everything will work out as intended for the third target. This is why we have #include "target_impl.h" and #include "common/debug.h", with careful include paths - to disambiguate.

However, the amdgcn implementation would be very similar to the nvptx one. So this is a new use case for the shared source model:

  • Let nvptx use generic foo.h
  • Let amdgcn use generic foo.h
  • Let third party use specialised foo.h

such that common code picks up the generic case or the specialised one if available.

This could be implemented by:

  • code duplication
  • making fooi.h and putting it under common, with an amdgcn stub which declares some functions then includes fooi.h
  • adding complexity to cmake
  • possibly by playing games with include resolution order
  • a defaults plus override model, e.g. D68310, where we hope the defaults are broadly applicable. Can fail on a forth target
  • always call unqualified headers (no common prefix) from everywhere, and include stubs that include shared text from the targets. So we pay with files like debug.h: #include "common/debug.h", but everything composes exactly correctly.

The general problem is how to statically compose source code such that the end result combines common and target specific code without collapsing under the complexity of the scheme. While using somewhat crude tools.

The last bullet point is the totally explicit representation of what one wants to happen, various other schemes move noise from the source into the build sytem.

jdoerfert added a comment.EditedDec 13 2019, 9:30 AM

To be honest, I did not follow this so if my response doesn't make sense let me know.

Apologies. I'll rephrase the problem now that it's a more reasonable time of day.

Could we have a generic atomic header with the template definition that is something like this:

...

In the target_impl.h you can provide the specialized implementation.

Only one template version will be visible at any time. Would that solve the problem?

I don't believe so.

Consider common/src/loop.cu, which calls __kmpc_atomic_add. Today, that atomic add could be implemented under common/atomic.h and included as common/atomic.h or similar and all would work well.

In the future, a third target wants to provide implementations for the atomics using C++ or OpenCL, so writes their own arch/src/atomic.h. This file would then be ignored by the source under common, because it's in the unexpected place.

We can guard against this by writing the target specific implementation under nvptx/src/atomic.h, in which case common/src will include atomic.h and everything will work out as intended for the third target. This is why we have #include "target_impl.h" and #include "common/debug.h", with careful include paths - to disambiguate.

However, the amdgcn implementation would be very similar to the nvptx one. So this is a new use case for the shared source model:

  • Let nvptx use generic foo.h
  • Let amdgcn use generic foo.h
  • Let third party use specialised foo.h

such that common code picks up the generic case or the specialised one if available.

This could be implemented by:

  • code duplication
  • making fooi.h and putting it under common, with an amdgcn stub which declares some functions then includes fooi.h
  • adding complexity to cmake
  • possibly by playing games with include resolution order
  • a defaults plus override model, e.g. D68310, where we hope the defaults are broadly applicable. Can fail on a forth target
  • always call unqualified headers (no common prefix) from everywhere, and include stubs that include shared text from the targets. So we pay with files like debug.h: #include "common/debug.h", but everything composes exactly correctly.

The general problem is how to statically compose source code such that the end result combines common and target specific code without collapsing under the complexity of the scheme. While using somewhat crude tools.

The last bullet point is the totally explicit representation of what one wants to happen, various other schemes move noise from the source into the build sytem.

We have two targets only right now, let's not complicate things too much. I'm fine with any solution that is reasonable and working now. I have a proposal below but other ways are fine as well.

If both targets agree on the atomics, put the code in common/atomics.h. Once we have third target that doesn't, we put the target code in {nvptx,amdgcn,third}/atomics.h and provide a pseudo/cpu target cpu/atomics.h with the generic openmp based template. In cmake we choose which include path you get, as we do now. (= I would not force no code duplication)

That sounds pragmatic. A single header located under common/ will work for nvptx and amdgcn without #ifdef, so lets go with that. We can rearrange the code if necessary when a third target arises or when amdgcn/nvptx wants to implement the atomics differently.

(amdgcn presently has a choice between clang builtins or one of the hip, opencl, hc libraries for the implementation)

That sounds pragmatic. A single header located under common/ will work for nvptx and amdgcn without #ifdef, so lets go with that. We can rearrange the code if necessary when a third target arises or when amdgcn/nvptx wants to implement the atomics differently.

(amdgcn presently has a choice between clang builtins or one of the hip, opencl, hc libraries for the implementation)

Let's do that then.

Let's do that then.

Yep. Just a note to say this is not forgotten, merely hasn't hit the top of the priority queue just yet

  • Change to template implementation
  • Change to template implementation
jdoerfert accepted this revision.Dec 18 2019, 9:16 AM

LGTM.

Thanks for changing all atomics now.

This revision is now accepted and ready to land.Dec 18 2019, 9:16 AM
JonChesterfield edited the summary of this revision. (Show Details)Dec 18 2019, 12:05 PM
This revision was automatically updated to reflect the committed changes.