This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget] Build a minimal deviceRTL for amdgcn
ClosedPublic

Authored by JonChesterfield on Dec 3 2019, 10:28 AM.

Details

Summary

[libomptarget] Build a minimal deviceRTL for amdgcn

Repeat of D70414, with an include path fixed. Diff for sanity checking.

The CMakeLists.txt file is functionally identical to the one used in the aomp fork.
Whitespace changes were made based on nvptx/CMakeLists.txt, plus the
copyright notice updated to match (Greg was the original author so would
like his sign off on that here).

This change will build a small subset of the deviceRTL if an appropriate toolchain is
available, e.g. a local install of rocm. Support.h is moved from nvptx as a dependency
of debug.h.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald Transcript

It builds, but I have questions about this patch

openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt
17–19

We can't build the runtime with regular clang?

openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

Agan this thing. Can we make it common if it is really required? Again, why do we need these new fields? Could you remove it from the current patch?

JonChesterfield marked 2 inline comments as done.Dec 3 2019, 11:05 AM

Glad it builds. Replies to comments inline

openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt
17–19

Sure, it builds fine with trunk. I'm using a local build of clang 9.0 on one machine and the openmp/fortran fork on a second.

The find package seems to be a list of anywhere that clang might plausibly be found. Copied straight out of our fork, happy to remove any entries you don't like the look of.

The NVPTX_CUDA_COMPILER_DIR doesn't seem especially appropriately named. I use the same compiler for nvptx and for amdgcn.

openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

It's different on nvptx and amdgcn, so not a great candidate for common. The extra fields are used to implement a couple of functions that are stubs in trunk which you weren't sure are necessary for nvptx.

ABataev added inline comments.Dec 3 2019, 11:15 AM
openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt
17–19

Can we then just use the clang (I think, you proposed this for nvcc) to build the runtime? Why shall we lookup for rocm/aomp?

openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

Maybe, it is better to exclude this from this patch? Maybe it would be good to have these fields for nvptx too or we don't need them to for amdgcn. Could you remind where we discussed this structure?

JonChesterfield marked 2 inline comments as done.Dec 3 2019, 11:26 AM
JonChesterfield added inline comments.
openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt
17–19

Which clang? CMAKE_CXX_COMPILER_DIR may resolve to gcc. Or do you mean introduce LIBOMPTARGET_AMDGCN_CUDA_COMPILER_DIR and expect the user to match that up to a clang build?

openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

Discussion was in D69424. I could drop the file from this patch, but it's included by debug.h so that'll break the amdgcn build

JonChesterfield marked an inline comment as done.Dec 3 2019, 11:34 AM
JonChesterfield added inline comments.
openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt
17–19

rocm/aomp is a likely location to find clang if using the rocm developer tools stack, and I think it gets installed into different places depending on whether it's from apt-get or github.

I'm a little unclear on how in tree compiler plus out of tree library (e.g. libc, opencl runtime, or the cuda toolchain) is supposed to work logistically.

ABataev added inline comments.Dec 3 2019, 11:36 AM
openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt
17–19

Freshly built clang which is used to build .bc library for nvptx. Plus, if need it in some cases, I think it would be good to have a cmake configuration option for the user so he could point to his own installation of AOMP. Thoughts?

openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

Ah, found it. Again, the same question. These fields are required to add 2 new omp_... functions. The question is the same. Are they required by the standard? If yes, we must have the same fields for nvptx. Otherwise, I don't think we need these fields at all.

jdoerfert added inline comments.Dec 3 2019, 11:40 AM
openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt
17–19

Take the cmake flag that is used to build the .bc file of the library.

JonChesterfield marked 2 inline comments as done.Dec 3 2019, 12:08 PM
JonChesterfield added inline comments.
openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt
17–19

There's some discussion on another diff about building libomp independently of the compiler so there may not be a freshly built clang available.

I've just had a go at tracing the logic for deciding what directory will be used for clang and it appears to be LLVM_INSTALL_PREFIX if set, otherwise LLVM_BUILD_BINARY_DIR, where find_package looks in:

  • the AOMP environment variable (I think our tests use this)
  • various install places
  • wherever the nvptx compiler is (I'm OK with dropping this)
  • the CXX compiler dir (one of my local builds uses this)

It's a bit of a mess but looks like canonical cmake to me.

openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

"When called from a target region the effect of this routing is unspecified"

So nvptx returning zero is fine, and andgcn returning a meaningful value is also fine, as far as the standard is concerned.

ABataev added inline comments.Dec 3 2019, 12:20 PM
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

Why do we need to store these values? And if we want these functions to return some meaningful values, better to implement the same behavior for both nvptx and amdgcn.

JonChesterfield marked an inline comment as done.Dec 3 2019, 12:33 PM
JonChesterfield added inline comments.
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

The plugin knows how many devices are available, so it writes the values to memory on the device, such that the device can answer the query accurately. I want the function to return a meaningful value on amdgcn as a quality of implementation point - it doesn't directly affect me whether nvptx supports the same. It'll cost a few bytes of memory, in difficult to optimise out fashion.

ABataev added inline comments.Dec 3 2019, 12:36 PM
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

WE need to support the same behavior on all platforms.

JonChesterfield marked an inline comment as done.Dec 3 2019, 12:47 PM
JonChesterfield added inline comments.
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

Who is we in this context, and do you consider unconditionally returning zero everywhere to be better than returning meaningful values on some targets and zero on others?

ABataev added inline comments.Dec 3 2019, 12:53 PM
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

The behavior of the runtime library must be the same on all platforms.

JonChesterfield marked an inline comment as done.Dec 3 2019, 1:06 PM
JonChesterfield added inline comments.
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

I don't think that's the right requirement. The runtime library is correct on a given target if it meets the openmp standard. If one target fails to meet a part of the standard, or chooses to exceed it, that should neither hobble nor demand a comparable extension from all other targets.

ABataev added inline comments.Dec 3 2019, 1:14 PM
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

No, sorry, I can't agree with you here. The same library must behave the same way everywhere, on each target, on each platform. Supporting different behavior of the same library on different platforms is the bad design and leads to a lot of pain with support.

JonChesterfield marked an inline comment as done.Dec 3 2019, 2:09 PM
JonChesterfield added inline comments.
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

It's easier for the user if changing platform causes no change in behaviour, sure. But that also means the user can only use a feature that is supported on all systems.

This is the advantage of implementation defined behaviour. People can write portable code, winning easy porting at the cost of performance. Or they can depend on features that are only available on some systems, knowing roughly the price paid for that.

Both are reasonable positions for developers to take. I don't think we should mandate portable application code, especially given the popularity of openmp on HPC where code may only ever run on one machine.

ABataev requested changes to this revision.Dec 3 2019, 2:20 PM
ABataev added inline comments.
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

No, not reasonable at all. I insist that the target dependent code must contain only really target dependent code, the functiinality must be the same on all platforms.

This revision now requires changes to proceed.Dec 3 2019, 2:20 PM
JonChesterfield marked an inline comment as done.Dec 3 2019, 3:09 PM
JonChesterfield added inline comments.
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

In which direction would you like this one resolved? Force all targets to return zero or enhance nvptx and future targets beyond the requirements of the standard?

On the other hand, take the openmp locking code. On trunk, as far as I can tell, this is functionally correct on volta and deadlocks on earlier nvidia hardware. Imagine for a moment that we are unable to make it work on non-volta cards, would the fix be to deadlock everywhere or to drop support for non-volta cards? Or to make the locks a nop, a serious QOI degradation on volta.

That one is of particular interest as it is unclear whether amdgcn is able to support unstructured locks.

ABataev added inline comments.Dec 3 2019, 3:46 PM
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

Does not matter. I would impelement it to have better performance but you can reuse the original amdgcn solution.

As to locks, they works on all gpus, like pascal, volta, etc. And this code is platform-dependent. If the target does not support it, it is hardware limitation. In your case the code does not have any platform specific dependencies.

JonChesterfield marked an inline comment as done.
  • drop a couple of fields
JonChesterfield added inline comments.Dec 3 2019, 4:29 PM
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18–26 ↗(On Diff #231946)

I've deleted the extra fields from this patch. Given amdgcn doesn't have a libcall implementation yet the discussion about implementation divergence is arguably premature.

I don't understand your requirements here. Are hardware limitations acceptable as a reason for different behaviour between systems? E.g. floating point arithmetic will inevitably return different values on different architectures when compiled with fast-math.

I don't understand your requirements here. Are hardware limitations acceptable as a reason for different behaviour between systems? E.g. floating point arithmetic will inevitably return different values on different architectures when compiled with fast-math.

Each particular case must be investigated individually. Precision is platform dependent here and thus it is ok if the results may vary.

openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18 ↗(On Diff #232010)

I think this can be moved to the common part of the code since the structure is absolutely the same.

JonChesterfield marked an inline comment as done.Dec 3 2019, 5:31 PM

I don't understand your requirements here. Are hardware limitations acceptable as a reason for different behaviour between systems? E.g. floating point arithmetic will inevitably return different values on different architectures when compiled with fast-math.

Each particular case must be investigated individually. Precision is platform dependent here and thus it is ok if the results may vary.

Case by case I can deal with. Thanks for clarifying.

openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18 ↗(On Diff #232010)

It's only the same because I've deleted the fields you disliked. This interface between plugin and deviceRTL is unlikely to be uniform across architectures and is known to be different for nvptx and amdgcn.

ABataev added inline comments.Dec 3 2019, 6:11 PM
openmp/libomptarget/deviceRTLs/amdgcn/src/device_environment.h
18 ↗(On Diff #232010)

Currently they are the same. So, we don't need to copy entities and can use the common one.

  • merge temporarily equivalent headers as requested
JonChesterfield marked an inline comment as done.Dec 3 2019, 6:45 PM
  • use device macro

@ABataev Accept when you are ready, don't wait for me.

This revision is now accepted and ready to land.Dec 4 2019, 7:04 AM
This revision was automatically updated to reflect the committed changes.