Download Raw Diff

Details

Reviewers

jdoerfert
ABataev
bollu
jfb
tra
grokos
Hahnfeld
guansong
xtian
gregrodgers
ronlieb
hfinkel
gtbercea
guraypp
arpith-jacob

Commits

rGed3324f6b6e3: Factor architecture dependent code out of loop.cu
rOMP368751: Factor architecture dependent code out of loop.cu
rL368751: Factor architecture dependent code out of loop.cu

Summary

[libomptarget] Factor architecture dependent code out of loop.cu

Related to the patch series starting D64217. Added subscribers to said series as reviewers. This effort is smaller in scope.

This patch factors out just enough architecture dependent code from loop.cu to allow the same source to be used with amdgcn, given a different target_impl.h. Testing is that the same bitcode (modulo variable names) is generated for libomptarget before and after the refactor, for nvptx and the out of tree amdgcn.

Diff Detail

Repository: rL LLVM

Event Timeline

JonChesterfield created this revision.Aug 6 2019, 5:14 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 6 2019, 5:14 PM

Herald added subscribers: openmp-commits, dexonsmith. · View Herald Transcript

JonChesterfield edited the summary of this revision. (Show Details)Aug 6 2019, 5:15 PM

Harbormaster completed remote builds in B36282: Diff 213755.Aug 6 2019, 5:15 PM

Couple of comments from me inline.

This is working from the branch at https://github.com/ROCm-Developer-Tools/llvm-project. I'm hoping to move the openmp repo incrementally towards a point where it makes few enough nvptx-specific assumptions that adding the amdgcn target only involves a different version of target_impl.h and a few lines of CMake. Currently our repo has six identical files, fourteen different between the src directories. I've tried to pick a representative starting point with loop.cu.

Feedback very welcome.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
19 ↗	(On Diff #213755)	I would prefer to have declarations in this file and implementations in target_impl.cu. That works for amdgcn but the CMake for nvptx doesn't allow these to be inlined across translation units. This has the advantage that the bitcode library is unchanged. `__inline__` was not sufficient for that with nvptx.
31 ↗	(On Diff #213755)	Several differences between nvptx and amdgcn follow from warp size. The wrapper around `__ffs` allows the source to call an overloaded function instead of `#ifdef` between `__ffs` and `__ffsll`. `__SHFL_SYNC` is currently defined in `omptarget-nvptx.h`and similarly needs different implementations, deferred for a future diff.

drop omptarget-nvptx include

Harbormaster completed remote builds in B36287: Diff 213766.Aug 6 2019, 5:51 PM

I would suggest at first to come to an agreement on the design of this reworked library at first.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
17–18 ↗	(On Diff #213766)	Better to use original `INLINE` macro defined in the project rather than to define the new one.
19 ↗	(On Diff #213766)	Why pointers? Use references.
32 ↗	(On Diff #213766)	`uitn32_t`->`__kmpc_impl_lanemask_t`?

address review comments

Harbormaster completed remote builds in B36327: Diff 213873.Aug 7 2019, 7:31 AM

In D65836#1618858, @ABataev wrote:

I would suggest at first to come to an agreement on the design of this reworked library at first.

Part of the motivation behind this change is that smaller diffs are easier to discuss. Hopefully this contributes to reaching said agreement. I think moving inline nvptx behind an interface is prerequisite for any movement towards sharing code between architectures.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
17–18 ↗	(On Diff #213766)	I'd prefer that too, but INLINE maps to `__inline__`, rather than `__forceinline__`, and that leaves calls to these functions in the bitcode library for nvptx.
19 ↗	(On Diff #213766)	Habit. References look like pass by value at the call site so I tend to write out parameters as pointers. Changed.

ABataev added inline comments.Aug 7 2019, 7:42 AM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
17–18 ↗	(On Diff #213766)	Then better to fix original `INLINE` macro and replace `__inline__` with `__forceinline__`. I assume we'd like to inline all the functions.

JonChesterfield mentioned this in D65876: Use forceinline. Necessary for nvcc to inline small functions within the bitcode library.Aug 7 2019, 7:53 AM

JonChesterfield marked 2 inline comments as done.Aug 7 2019, 7:55 AM

JonChesterfield added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
17–18 ↗	(On Diff #213766)	That works for me. I suspect everything marked INLINE was intended to be inlined. Diff at D65876.

JonChesterfield mentioned this in rL368177: Use forceinline. Necessary for nvcc to inline small functions within the….Aug 7 2019, 8:25 AM

JonChesterfield mentioned this in rGae0178bee72c: Use forceinline. Necessary for nvcc to inline small functions within the….

drop omptarget-nvptx include
address review comments
rebase, use INLINE from D65876

Harbormaster completed remote builds in B36345: Diff 213916.Aug 7 2019, 8:45 AM

I'm fine with this, @ABataev ?

openmp/libomptarget/deviceRTLs/nvptx/src/loop.cu
392 ↗	(On Diff #213916)	I would prefer `__SHFL_SYNC` `__ACTIVEMASK` etc. also to be function calls to `__kmpc_XXXX` functions but I won't require it.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
39 ↗	(On Diff #213916)	`int` -> `(u)int32_t` ?
17–18 ↗	(On Diff #213766)	We can even have different definitions of `INLINE` if that becomes necessary.

JonChesterfield marked 2 inline comments as done.Aug 7 2019, 1:00 PM

JonChesterfield added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/loop.cu
392 ↗	(On Diff #213916)	Likewise. They're currently defined in `omptarget-nvptx.h`. I'd like to move them into target_impl and replace the macros with inline functions. That'll raise the question of how to handle implementations for different cuda versions, which I'd like to avoid for this first patch.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
39 ↗	(On Diff #213916)	The cuda functions (plus `__ffsll` etc) return `int`. I'd slightly prefer `uint32_t` (on the basis that these can't return a negative integer). I'm happy either way.

jdoerfert added inline comments.Aug 7 2019, 1:40 PM

openmp/libomptarget/deviceRTLs/nvptx/src/loop.cu
392 ↗	(On Diff #213916)	Agreed, let's do it later but eventually. We can have the #ifdef cascade for cuda versions but that should be in the cuda subfolder (target_impl.h for example) and the "general logic" contains proper calls.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
39 ↗	(On Diff #213916)	I think we should make it clear what we expect wrt. bit-width if we have an expectation. Given the `(u)int32` floating around I'd say we do have some expectations/restrictions.

JonChesterfield marked an inline comment as done.Aug 8 2019, 3:15 AM

JonChesterfield added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
39 ↗	(On Diff #213916)	I think unsigned is clearer for bit manipulation functions. Width is mostly determined by warp/wavefront width, at least for these functions. Return width, inclined to follow IR intrinsics and match the argument. There's a mix of signed and unsigned in the existing code, with implicit conversions between them. Amdgcn happens to use a slightly different mix. Considering that NextIter returns a uint64_t, what would you think of using uint32/64 for the functions in loop as well as in this header?

@ABataev, others, any concerns? If not, let's go ahead with this.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
39 ↗	(On Diff #213916)	what would you think of using uint32/64 for the functions in loop as well as in this header? I would prefer that. We might need a device specific type but (unsigned) int does make me worry every time. Let's table this discussion for now to make progress on this.

I like the general approach,

In D65836#1621104, @jdoerfert wrote:

@ABataev, others, any concerns? If not, let's go ahead with this.

Did you come to an agreement about the design of the new universal library? I would suggest starting with this.
We need to find a better way to files layout, design of the target-specific functions (template class with the specialization implementation for each particular target or just good old plain set of target-specific functions, controlled by the condition compilation), etc.
Then, I would suggest committing this new structure at first.

In D65836#1621112, @ABataev wrote:

In D65836#1621104, @jdoerfert wrote:

@ABataev, others, any concerns? If not, let's go ahead with this.

Did you come to an agreement about the design of the new universal library? I would suggest starting with this.

We have a proposal and no major complaints, I count that as agreement.

We need to find a better way to files layout,

See above.

design of the target-specific functions (template class with the specialization implementation for each particular target or just good old plain set of target-specific functions, controlled by the condition compilation), etc.

The design chosen here seems fine to me. Others didn't disagree. It is for sure a step in the right direction.

Then, I would suggest committing this new structure at first.

I think extracting the code makes more sense first. Also mentioned in the proposal and not objected.
The problem is we cannot really restructure as it is still interleaved with target dependent code.

In D65836#1621254, @jdoerfert wrote:

In D65836#1621112, @ABataev wrote:

In D65836#1621104, @jdoerfert wrote:

@ABataev, others, any concerns? If not, let's go ahead with this.

Did you come to an agreement about the design of the new universal library? I would suggest starting with this.

We have a proposal and no major complaints, I count that as agreement.

I would suggest to ping one more time, maybe someone just missed it.

We need to find a better way to files layout,

See above.

design of the target-specific functions (template class with the specialization implementation for each particular target or just good old plain set of target-specific functions, controlled by the condition compilation), etc.

The design chosen here seems fine to me. Others didn't disagree. It is for sure a step in the right direction.

Then, I would suggest committing this new structure at first.

I think extracting the code makes more sense first. Also mentioned in the proposal and not objected.
The problem is we cannot really restructure as it is still interleaved with target dependent code.

I'm wary of refactoring to support multiple architectures while there is only one in tree. It's too difficult to see where the abstractions should be, and it's usually easier to introduce said abstractions than to move them later.

^ obviously I'm proposing a refactor despite that - hoping to minimise risk by keeping the change small and the abstraction very thin.

Ping. I'd like to land this - Alexey?

No complaints on the list, none here. LGTM.

This revision is now accepted and ready to land.Aug 13 2019, 10:10 AM

What about namong convention here? Shall we use capital letters for the var names or it is fine as is? Also, did you come to an agreement about design, directory layout etc.?

Naming convention

What about namong convention here? Shall we use capital letters for the var names or it is fine as is?

Long term it's probably worth moving to the LLVM conventions throughout. Short term, I'd rather keep roughly the same style as the surrounding code. I believe that's what this patch does.

Also, did you come to an agreement about design, directory layout etc.?

Overall agreement on the final design is a work in progress. I believe there is agreement that we'll need at least one target specific header, so this patch follows the suggestion in D64217 and calls it target_impl.

At the moment, nvptx/src is the most sensible directory for it (as that's the only directory!). I think this is an uncontentious step in the right direction.

In D65836#1627635, @JonChesterfield wrote:

What about namong convention here? Shall we use capital letters for the var names or it is fine as is?

Long term it's probably worth moving to the LLVM conventions throughout. Short term, I'd rather keep roughly the same style as the surrounding code. I believe that's what this patch does.

Also, did you come to an agreement about design, directory layout etc.?

Overall agreement on the final design is a work in progress. I believe there is agreement that we'll need at least one target specific header, so this patch follows the suggestion in D64217 and calls it target_impl.

At the moment, nvptx/src is the most sensible directory for it (as that's the only directory!). I think this is an uncontentious step in the right direction.