Download Raw Diff

Details

Reviewers

jdoerfert
ABataev
bollu
jfb
tra
grokos
Hahnfeld
guansong
xtian
gregrodgers
ronlieb
hfinkel
gtbercea
guraypp
arpith-jacob

Commits

rGed3324f6b6e3: Factor architecture dependent code out of loop.cu
rOMP368751: Factor architecture dependent code out of loop.cu
rL368751: Factor architecture dependent code out of loop.cu

Summary

[libomptarget] Factor architecture dependent code out of loop.cu

Related to the patch series starting D64217. Added subscribers to said series as reviewers. This effort is smaller in scope.

This patch factors out just enough architecture dependent code from loop.cu to allow the same source to be used with amdgcn, given a different target_impl.h. Testing is that the same bitcode (modulo variable names) is generated for libomptarget before and after the refactor, for nvptx and the out of tree amdgcn.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 36287
Build 36286: arc lint + arc unit

Event Timeline

JonChesterfield created this revision.Aug 6 2019, 5:14 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 6 2019, 5:14 PM

Herald added subscribers: openmp-commits, dexonsmith. · View Herald Transcript

JonChesterfield edited the summary of this revision. (Show Details)Aug 6 2019, 5:15 PM

Harbormaster completed remote builds in B36282: Diff 213755.Aug 6 2019, 5:15 PM

Couple of comments from me inline.

This is working from the branch at https://github.com/ROCm-Developer-Tools/llvm-project. I'm hoping to move the openmp repo incrementally towards a point where it makes few enough nvptx-specific assumptions that adding the amdgcn target only involves a different version of target_impl.h and a few lines of CMake. Currently our repo has six identical files, fourteen different between the src directories. I've tried to pick a representative starting point with loop.cu.

Feedback very welcome.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
20	I would prefer to have declarations in this file and implementations in target_impl.cu. That works for amdgcn but the CMake for nvptx doesn't allow these to be inlined across translation units. This has the advantage that the bitcode library is unchanged. `__inline__` was not sufficient for that with nvptx.
32	Several differences between nvptx and amdgcn follow from warp size. The wrapper around `__ffs` allows the source to call an overloaded function instead of `#ifdef` between `__ffs` and `__ffsll`. `__SHFL_SYNC` is currently defined in `omptarget-nvptx.h`and similarly needs different implementations, deferred for a future diff.

drop omptarget-nvptx include

Harbormaster completed remote builds in B36287: Diff 213766.Aug 6 2019, 5:51 PM

I would suggest at first to come to an agreement on the design of this reworked library at first.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
17–18	Better to use original `INLINE` macro defined in the project rather than to define the new one.
19	Why pointers? Use references.
32	`uitn32_t`->`__kmpc_impl_lanemask_t`?

address review comments

Harbormaster completed remote builds in B36327: Diff 213873.Aug 7 2019, 7:31 AM

In D65836#1618858, @ABataev wrote:

I would suggest at first to come to an agreement on the design of this reworked library at first.

Part of the motivation behind this change is that smaller diffs are easier to discuss. Hopefully this contributes to reaching said agreement. I think moving inline nvptx behind an interface is prerequisite for any movement towards sharing code between architectures.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
17–18	I'd prefer that too, but INLINE maps to `__inline__`, rather than `__forceinline__`, and that leaves calls to these functions in the bitcode library for nvptx.
19	Habit. References look like pass by value at the call site so I tend to write out parameters as pointers. Changed.

ABataev added inline comments.Aug 7 2019, 7:42 AM

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
17–18	Then better to fix original `INLINE` macro and replace `__inline__` with `__forceinline__`. I assume we'd like to inline all the functions.

JonChesterfield mentioned this in D65876: Use forceinline. Necessary for nvcc to inline small functions within the bitcode library.Aug 7 2019, 7:53 AM

JonChesterfield marked 2 inline comments as done.Aug 7 2019, 7:55 AM

JonChesterfield added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
17–18	That works for me. I suspect everything marked INLINE was intended to be inlined. Diff at D65876.

JonChesterfield mentioned this in rL368177: Use forceinline. Necessary for nvcc to inline small functions within the….Aug 7 2019, 8:25 AM

JonChesterfield mentioned this in rGae0178bee72c: Use forceinline. Necessary for nvcc to inline small functions within the….

drop omptarget-nvptx include
address review comments
rebase, use INLINE from D65876

Harbormaster completed remote builds in B36345: Diff 213916.Aug 7 2019, 8:45 AM

I'm fine with this, @ABataev ?

openmp/libomptarget/deviceRTLs/nvptx/src/loop.cu
392	I would prefer `__SHFL_SYNC` `__ACTIVEMASK` etc. also to be function calls to `__kmpc_XXXX` functions but I won't require it.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
17–18	We can even have different definitions of `INLINE` if that becomes necessary.
40	`int` -> `(u)int32_t` ?

JonChesterfield marked 2 inline comments as done.Aug 7 2019, 1:00 PM

JonChesterfield added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/loop.cu
392	Likewise. They're currently defined in `omptarget-nvptx.h`. I'd like to move them into target_impl and replace the macros with inline functions. That'll raise the question of how to handle implementations for different cuda versions, which I'd like to avoid for this first patch.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
40	The cuda functions (plus `__ffsll` etc) return `int`. I'd slightly prefer `uint32_t` (on the basis that these can't return a negative integer). I'm happy either way.

jdoerfert added inline comments.Aug 7 2019, 1:40 PM

openmp/libomptarget/deviceRTLs/nvptx/src/loop.cu
392	Agreed, let's do it later but eventually. We can have the #ifdef cascade for cuda versions but that should be in the cuda subfolder (target_impl.h for example) and the "general logic" contains proper calls.
openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
40	I think we should make it clear what we expect wrt. bit-width if we have an expectation. Given the `(u)int32` floating around I'd say we do have some expectations/restrictions.

JonChesterfield marked an inline comment as done.Aug 8 2019, 3:15 AM

JonChesterfield added inline comments.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
40	I think unsigned is clearer for bit manipulation functions. Width is mostly determined by warp/wavefront width, at least for these functions. Return width, inclined to follow IR intrinsics and match the argument. There's a mix of signed and unsigned in the existing code, with implicit conversions between them. Amdgcn happens to use a slightly different mix. Considering that NextIter returns a uint64_t, what would you think of using uint32/64 for the functions in loop as well as in this header?

@ABataev, others, any concerns? If not, let's go ahead with this.

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h
40	what would you think of using uint32/64 for the functions in loop as well as in this header? I would prefer that. We might need a device specific type but (unsigned) int does make me worry every time. Let's table this discussion for now to make progress on this.

I like the general approach,

In D65836#1621104, @jdoerfert wrote:

@ABataev, others, any concerns? If not, let's go ahead with this.

Did you come to an agreement about the design of the new universal library? I would suggest starting with this.
We need to find a better way to files layout, design of the target-specific functions (template class with the specialization implementation for each particular target or just good old plain set of target-specific functions, controlled by the condition compilation), etc.
Then, I would suggest committing this new structure at first.

In D65836#1621112, @ABataev wrote:

In D65836#1621104, @jdoerfert wrote:

@ABataev, others, any concerns? If not, let's go ahead with this.

Did you come to an agreement about the design of the new universal library? I would suggest starting with this.

We have a proposal and no major complaints, I count that as agreement.

We need to find a better way to files layout,

See above.

design of the target-specific functions (template class with the specialization implementation for each particular target or just good old plain set of target-specific functions, controlled by the condition compilation), etc.

The design chosen here seems fine to me. Others didn't disagree. It is for sure a step in the right direction.

Then, I would suggest committing this new structure at first.

I think extracting the code makes more sense first. Also mentioned in the proposal and not objected.
The problem is we cannot really restructure as it is still interleaved with target dependent code.

In D65836#1621254, @jdoerfert wrote:

In D65836#1621112, @ABataev wrote:

In D65836#1621104, @jdoerfert wrote:

@ABataev, others, any concerns? If not, let's go ahead with this.

Did you come to an agreement about the design of the new universal library? I would suggest starting with this.

We have a proposal and no major complaints, I count that as agreement.

I would suggest to ping one more time, maybe someone just missed it.

We need to find a better way to files layout,

See above.

design of the target-specific functions (template class with the specialization implementation for each particular target or just good old plain set of target-specific functions, controlled by the condition compilation), etc.

The design chosen here seems fine to me. Others didn't disagree. It is for sure a step in the right direction.

Then, I would suggest committing this new structure at first.

I think extracting the code makes more sense first. Also mentioned in the proposal and not objected.
The problem is we cannot really restructure as it is still interleaved with target dependent code.

I'm wary of refactoring to support multiple architectures while there is only one in tree. It's too difficult to see where the abstractions should be, and it's usually easier to introduce said abstractions than to move them later.

^ obviously I'm proposing a refactor despite that - hoping to minimise risk by keeping the change small and the abstraction very thin.

Ping. I'd like to land this - Alexey?

No complaints on the list, none here. LGTM.

This revision is now accepted and ready to land.Aug 13 2019, 10:10 AM

What about namong convention here? Shall we use capital letters for the var names or it is fine as is? Also, did you come to an agreement about design, directory layout etc.?

Naming convention

What about namong convention here? Shall we use capital letters for the var names or it is fine as is?

Long term it's probably worth moving to the LLVM conventions throughout. Short term, I'd rather keep roughly the same style as the surrounding code. I believe that's what this patch does.

Also, did you come to an agreement about design, directory layout etc.?

Overall agreement on the final design is a work in progress. I believe there is agreement that we'll need at least one target specific header, so this patch follows the suggestion in D64217 and calls it target_impl.

At the moment, nvptx/src is the most sensible directory for it (as that's the only directory!). I think this is an uncontentious step in the right direction.

In D65836#1627635, @JonChesterfield wrote:

What about namong convention here? Shall we use capital letters for the var names or it is fine as is?

Long term it's probably worth moving to the LLVM conventions throughout. Short term, I'd rather keep roughly the same style as the surrounding code. I believe that's what this patch does.

Also, did you come to an agreement about design, directory layout etc.?

Overall agreement on the final design is a work in progress. I believe there is agreement that we'll need at least one target specific header, so this patch follows the suggestion in D64217 and calls it target_impl.

At the moment, nvptx/src is the most sensible directory for it (as that's the only directory!). I think this is an uncontentious step in the right direction.

Ok, LG

Closed by commit rL368751: Factor architecture dependent code out of loop.cu (authored by JonChesterfield). · Explain WhyAug 13 2019, 2:41 PM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptAug 13 2019, 2:41 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Jon, please subscribe to openmp-commits so that commit emails get through immediately. Thanks!

In D65836#1628764, @Hahnfeld wrote:

Jon, please subscribe to openmp-commits so that commit emails get through immediately. Thanks!

Ah, so that's why I'm getting the rejection emails. Thanks. Subscribed.

The bounce message recommends emailing openmp-commits-owner@lists.llvm.org. It should probably mention subscribing to commits as well.

JonChesterfield mentioned this in D66809: Use target_impl functions to replace more inline asm.Aug 27 2019, 9:54 AM

JonChesterfield mentioned this in rL370216: Use target_impl functions to replace more inline asm.Aug 28 2019, 8:03 AM

JonChesterfield mentioned this in rG329442192625: Use target_impl functions to replace more inline asm.

Diff 213766

openmp/libomptarget/deviceRTLs/nvptx/src/loop.cu

//===------------ loop.cu - NVPTX OpenMP loop constructs --------- CUDA -*-===//		//===------------ loop.cu - NVPTX OpenMP loop constructs --------- CUDA -*-===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file contains the implementation of the KMPC interface		// This file contains the implementation of the KMPC interface
// for the loop construct plus other worksharing constructs that use the same		// for the loop construct plus other worksharing constructs that use the same
// interface as loops.		// interface as loops.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "omptarget-nvptx.h"		#include "omptarget-nvptx.h"
		#include "target_impl.h"

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// template class that encapsulate all the helper functions		// template class that encapsulate all the helper functions
//		//
// T is loop iteration type (32 \| 64) (unsigned \| signed)		// T is loop iteration type (32 \| 64) (unsigned \| signed)
// ST is the signed version of T		// ST is the signed version of T
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
▲ Show 20 Lines • Show All 352 Lines • ▼ Show 20 Lines	INLINE static void dispatch_init(kmp_Ident *loc, int32_t threadId,
}		}
}		}

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// Support for dispatch next		// Support for dispatch next

INLINE static int64_t Shuffle(unsigned active, int64_t val, int leader) {		INLINE static int64_t Shuffle(unsigned active, int64_t val, int leader) {
int lo, hi;		int lo, hi;
asm volatile("mov.b64 {%0,%1}, %2;" : "=r"(lo), "=r"(hi) : "l"(val));		__kmpc_impl_unpack(val, &lo, &hi);
hi = __SHFL_SYNC(active, hi, leader);		hi = __SHFL_SYNC(active, hi, leader);
lo = __SHFL_SYNC(active, lo, leader);		lo = __SHFL_SYNC(active, lo, leader);
asm volatile("mov.b64 %0, {%1,%2};" : "=l"(val) : "r"(lo), "r"(hi));		return __kmpc_impl_pack(lo, hi);
return val;
}		}

INLINE static uint64_t NextIter() {		INLINE static uint64_t NextIter() {
unsigned int active = __ACTIVEMASK();		__kmpc_impl_lanemask_t active = __ACTIVEMASK();
jdoerfertUnsubmitted Not Done Reply Inline Actions I would prefer `__SHFL_SYNC` `__ACTIVEMASK` etc. also to be function calls to `__kmpc_XXXX` functions but I won't require it. jdoerfert: I would prefer `__SHFL_SYNC` `__ACTIVEMASK` etc. also to be function calls to `__kmpc_XXXX`…
JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Likewise. They're currently defined in `omptarget-nvptx.h`. I'd like to move them into target_impl and replace the macros with inline functions. That'll raise the question of how to handle implementations for different cuda versions, which I'd like to avoid for this first patch. JonChesterfield: Likewise. They're currently defined in `omptarget-nvptx.h`. I'd like to move them into…
jdoerfertUnsubmitted Not Done Reply Inline Actions Agreed, let's do it later but eventually. We can have the #ifdef cascade for cuda versions but that should be in the cuda subfolder (target_impl.h for example) and the "general logic" contains proper calls. jdoerfert: Agreed, let's do it later but eventually. We can have the #ifdef cascade for cuda versions but…
int leader = __ffs(active) - 1;		int leader = __kmpc_impl_ffs(active) - 1;
int change = __popc(active);		int change = __kmpc_impl_popc(active);
unsigned lane_mask_lt;		__kmpc_impl_lanemask_t lane_mask_lt = __kmpc_impl_lanemask_lt();
asm("mov.u32 %0, %%lanemask_lt;" : "=r"(lane_mask_lt));		unsigned int rank = __kmpc_impl_popc(active & lane_mask_lt);
unsigned int rank = __popc(active & lane_mask_lt);
uint64_t warp_res;		uint64_t warp_res;
if (rank == 0) {		if (rank == 0) {
warp_res = atomicAdd(		warp_res = atomicAdd(
(unsigned long long *)&omptarget_nvptx_threadPrivateContext->Cnt(),		(unsigned long long *)&omptarget_nvptx_threadPrivateContext->Cnt(),
change);		change);
}		}
warp_res = Shuffle(active, warp_res, leader);		warp_res = Shuffle(active, warp_res, leader);
return warp_res + rank;		return warp_res + rank;
▲ Show 20 Lines • Show All 402 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h

This file was added.

				//===------------ target_impl.h - NVPTX OpenMP GPU options ------- CUDA -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Definitions of target specific functions
				//
				//===----------------------------------------------------------------------===//
				#ifndef _TARGET_IMPL_H_
				#define _TARGET_IMPL_H_

				#include <stdint.h>

				#define FORCEINLINE __forceinline__ __device__

				ABataevUnsubmitted Not Done Reply Inline Actions Better to use original `INLINE` macro defined in the project rather than to define the new one. ABataev: Better to use original `INLINE` macro defined in the project rather than to define the new one.
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions I'd prefer that too, but INLINE maps to `__inline__`, rather than `__forceinline__`, and that leaves calls to these functions in the bitcode library for nvptx. JonChesterfield: I'd prefer that too, but INLINE maps to `__inline__`, rather than `__forceinline__`, and that…
				ABataevUnsubmitted Not Done Reply Inline Actions Then better to fix original `INLINE` macro and replace `__inline__` with `__forceinline__`. I assume we'd like to inline all the functions. ABataev: Then better to fix original `INLINE` macro and replace `__inline__` with `__forceinline__`. I…
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions That works for me. I suspect everything marked INLINE was intended to be inlined. Diff at D65876. JonChesterfield: That works for me. I suspect everything marked INLINE was intended to be inlined. Diff at…
				jdoerfertUnsubmitted Not Done Reply Inline Actions We can even have different definitions of `INLINE` if that becomes necessary. jdoerfert: We can even have different definitions of `INLINE` if that becomes necessary.
				FORCEINLINE void __kmpc_impl_unpack(int64_t val, int32_t lo, int32_t hi) {
				ABataevUnsubmitted Done Reply Inline Actions Why pointers? Use references. ABataev: Why pointers? Use references.
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Habit. References look like pass by value at the call site so I tend to write out parameters as pointers. Changed. JonChesterfield: Habit. References look like pass by value at the call site so I tend to write out parameters as…
				asm volatile("mov.b64 {%0,%1}, %2;" : "=r"(lo), "=r"(hi) : "l"(val));
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions I would prefer to have declarations in this file and implementations in target_impl.cu. That works for amdgcn but the CMake for nvptx doesn't allow these to be inlined across translation units. This has the advantage that the bitcode library is unchanged. `__inline__` was not sufficient for that with nvptx. JonChesterfield: I would prefer to have declarations in this file and implementations in target_impl.cu. That…
				}

				FORCEINLINE int64_t __kmpc_impl_pack(int32_t lo, int32_t hi) {
				int64_t val;
				asm volatile("mov.b64 %0, {%1,%2};" : "=l"(val) : "r"(lo), "r"(hi));
				return val;
				}

				typedef uint32_t __kmpc_impl_lanemask_t;

				FORCEINLINE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_lt() {
				uint32_t res;
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Several differences between nvptx and amdgcn follow from warp size. The wrapper around `__ffs` allows the source to call an overloaded function instead of `#ifdef` between `__ffs` and `__ffsll`. `__SHFL_SYNC` is currently defined in `omptarget-nvptx.h`and similarly needs different implementations, deferred for a future diff. JonChesterfield: Several differences between nvptx and amdgcn follow from warp size. The wrapper around `__ffs`…
				ABataevUnsubmitted Done Reply Inline Actions `uitn32_t`->`__kmpc_impl_lanemask_t`? ABataev: `uitn32_t`->`__kmpc_impl_lanemask_t`?
				asm("mov.u32 %0, %%lanemask_lt;" : "=r"(res));
				return res;
				}

				FORCEINLINE int __kmpc_impl_ffs(uint32_t x) { return __ffs(x); }

				FORCEINLINE int __kmpc_impl_popc(uint32_t x) { return __popc(x); }

				jdoerfertUnsubmitted Not Done Reply Inline Actions `int` -> `(u)int32_t` ? jdoerfert: `int` -> `(u)int32_t` ?
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions The cuda functions (plus `__ffsll` etc) return `int`. I'd slightly prefer `uint32_t` (on the basis that these can't return a negative integer). I'm happy either way. JonChesterfield: The cuda functions (plus `__ffsll` etc) return `int`. I'd slightly prefer `uint32_t` (on the…
				jdoerfertUnsubmitted Not Done Reply Inline Actions I think we should make it clear what we expect wrt. bit-width if we have an expectation. Given the `(u)int32` floating around I'd say we do have some expectations/restrictions. jdoerfert: I think we should make it clear what we expect wrt. bit-width if we have an expectation. Given…
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions I think unsigned is clearer for bit manipulation functions. Width is mostly determined by warp/wavefront width, at least for these functions. Return width, inclined to follow IR intrinsics and match the argument. There's a mix of signed and unsigned in the existing code, with implicit conversions between them. Amdgcn happens to use a slightly different mix. Considering that NextIter returns a uint64_t, what would you think of using uint32/64 for the functions in loop as well as in this header? JonChesterfield: I think unsigned is clearer for bit manipulation functions. Width is mostly determined by…
				jdoerfertUnsubmitted Not Done Reply Inline Actions what would you think of using uint32/64 for the functions in loop as well as in this header? I would prefer that. We might need a device specific type but (unsigned) int does make me worry every time. Let's table this discussion for now to make progress on this. jdoerfert: > what would you think of using uint32/64 for the functions in loop as well as in this header?
				#endif

This is an archive of the discontinued LLVM Phabricator instance.

Factor architecture dependent code out of loop.cu
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 213766

openmp/libomptarget/deviceRTLs/nvptx/src/loop.cu

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h

This is an archive of the discontinued LLVM Phabricator instance.

Factor architecture dependent code out of loop.cuClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 213766

openmp/libomptarget/deviceRTLs/nvptx/src/loop.cu

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h

Factor architecture dependent code out of loop.cu
ClosedPublic