This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/deviceRTLs/nvptx/src/
-
libomptarget/
-
deviceRTLs/
-
nvptx/
-
src/
-
loop.cu
-
reduction.cu
1/2
target_api.h
-
target_impl.h

Differential D68310

Introduce an interface for target_impl that supports default implementations
AbandonedPublic

Authored by JonChesterfield on Oct 1 2019, 5:37 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
ABataev
grokos
ronlieb
gregrodgers
RaviNarayanaswamy

Summary

Introduce an interface for target_impl that supports default implementations

Design goals:

No runtime indirection
Permit header-only implementations for inlining under nvcc
Permit default implementations
Catch various errors at compile time
Syntactically reasonable
Familiar to C++ developers

The API is an adaption of the curiously recuring template pattern, modified to
use static calls throughout. This gives the impl::Bits::pack syntax that matches
free functions in a namespace, which is essentially what the static functions in
a non-template class are.

Marking methods as = delete provides better diagnostics than missing symbols
at link time. The friend/private annotations are analogous to the non-virtual
interface of runtime dispatch.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 38860
Build 38859: arc lint + arc unit

Event Timeline

JonChesterfield created this revision.Oct 1 2019, 5:37 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 1 2019, 5:37 PM

Herald added a subscriber: openmp-commits. · View Herald Transcript

Harbormaster completed remote builds in B38860: Diff 222735.Oct 1 2019, 5:37 PM

nvptx/src/target_impl.h currently defines an implicit interface comprised of various free functions that abstracts over low level aspects of nvptx. I'd prefer a more explicit interface before introducing another target.

This is a big design space. I think this diff is a reasonable point, but failing that it gives us a starting point for discussion.

JonChesterfield added a reviewer: RaviNarayanaswamy.Oct 1 2019, 5:46 PM

Could you motivate this is more? The template solution is for sure more complex than the old proposal/solution and still requires the same cmake tricks so I'd like to hear why it is worth it.

I'll also go through the bullet points you have (in order) and leave some feedback:

I don't see why there should be "runtime indirection" if you don't use the template stuff.
I also do not see why the "old design" would not allow header only (it was actually proposed that way after all).
I get the default implementations but (1) I doubt there are many (where do you expect default impls that are in the target_impl part?) and (2) we could deal with them with different headers, the same way as I proposed to work with different targets.
Unclear why declarations would not do the same, actually declarations would probably catch more than templates do.
Unsure what "syntactically reasonable" means.
Familiar to C++ developers, is not a good argument when the C++ stuff actually doesn't solve the problem (still cmake magic involved)

openmp/libomptarget/deviceRTLs/nvptx/src/target_api.h
34	It is odd to provide a default for pack but not for unpack, when would this ever be useful? It also seems overly complicated to redirect here in the first place. Subclasses could as well provide pack/unpack directly, couldn't thy?

In D68310#1690758, @jdoerfert wrote:

Could you motivate this is more? The template solution is for sure more complex than the old proposal/solution and still requires the same cmake tricks so I'd like to hear why it is worth it.

Thanks! If the end result of the review is we stick with free functions that's fine with me. I can imagine using the types to move some complexity out of cmake but haven't thought that through. I think the primary drawbacks to the current system are relying on cmake to work with multiple architectures and the requirement to implement every function for every architecture.

I'd like the target_impl layer to have default implementations where they're meaningful. It would mean a target can introduce a new function, along with a default, and every other target would continue to work as before, without changing any of the other target code.

popc can be implemented as a call to __builtin_popcount if one doesn't want to call the cuda function
Lanemask, smid probably don't have good defaults
Fences might do, in that localised ones could call through to more global ones
The dozen or so atomic operations used in deviceRTL could all have default implementations that call the standard C++ functions, while using the cuda calls when preferred

It's worth noting that the current list of free functions hits most of the bullet points in the diff, except that I can't see a way to provide compile time default implementations without some extra plumbing. Above template is a suggestion for said plumbing.

I'll also go through the bullet points you have (in order) and leave some feedback:

Thanks! I'll try the same format.

I don't see why there should be "runtime indirection" if you don't use the template stuff.

The most popular interface scheme in C++ involves virtual functions (usually with heap allocation). I don't trust devirtualisation, and eliding the heap allocation is messy.

I also do not see why the "old design" would not allow header only (it was actually proposed that way after all).

- It does. My preferred option is declarations in a header and implementations at link time, but that degraded codegen under nvcc.

I get the default implementations but (1) I doubt there are many (where do you expect default impls that are in the target_impl part?) and (2) we could deal with them with different headers, the same way as I proposed to work with different targets.

(1) All the bit level functions, all atomics. Partly papering over cuda as opposed to nvptx, but they're closely related. (2) is interesting - deferred to below

Unclear why declarations would not do the same, actually declarations would probably catch more than templates do.

I think the crtp approach moves some errors from link time to compile time. Mostly though I was referring to the private constructor / friend stuff - trying to make the interface harder to implement wrong. The = delete syntax was nice in that regard.

Unsure what "syntactically reasonable" means.

Mostly that the interface shouldn't be woven out of macros and code generators

Familiar to C++ developers, is not a good argument when the C++ stuff actually doesn't solve the problem (still cmake magic involved)

Not so much that the code should look like C++, more that the code shouldn't look totally alien to C++. E.g. there could be variable fields that map to arbitrary function calls, theadIdx.x style, but that would be a bad thing.

Above you said

we could deal with them with different headers, the same way as I proposed to work with different targets.

Please could you expand on that? I think the current multitarget plan is a common folder containing as much code as we can manage, with a target_impl.h in each target, where #include paths are set by cmake to look in the target subdir when compiling things in the common subdir. I can see a path to default functions that involves a separate header per function, where the existence of files on disk and some cmake determines which set are pulled into an aggregate header. That's not necessarily what you have in mind though.

openmp/libomptarget/deviceRTLs/nvptx/src/target_api.h
34	Sure. We'd either want defaults for both, or defaults for neither until a target is introduced that would use the defaults. Providing one of each is good for discussion but probably not how it should be committed. The redirect costs some code in the base class but none in the subclass (well, other than the Impl suffix). I like the separation between the interface to the client and the interface to the target, but sure - there's lots of ways to wire things up.

In D68310#1691089, @JonChesterfield wrote:

Above you said

we could deal with them with different headers, the same way as I proposed to work with different targets.

Please could you expand on that? I think the current multitarget plan is a common folder containing as much code as we can manage, with a target_impl.h in each target, where #include paths are set by cmake to look in the target subdir when compiling things in the common subdir. I can see a path to default functions that involves a separate header per function, where the existence of files on disk and some cmake determines which set are pulled into an aggregate header. That's not necessarily what you have in mind though.

A header file per function, or set of function that belong together, was what I meant. I'm still unsure how much "target specific code" we want to provide as a default without it becoming completely target independent. If there is code to be shared now, I mean with hopefully soon two targets, we could just call it common code. Once we get to the situation with >2 targets and not all but some share some code, we can reevaluate and determine the best solution for the actual use case at hand. I'm not strictly against templates or overloading but these do not solve all problems but basically only the ones we do not face yet. Designing something for a future use case is generally to be avoided (IMO).

A header file per function, or set of function that belong together, was what I meant.

Cool. So something like kmpc_atomics.h in common, that can be #included or can be ignored based on the target's desires.

I'm still unsure how much "target specific code" we want to provide as a default without it becoming completely target independent.

The "target specific" idea can be refined a little. I think we have the following categories:

Operations that have to be done in asm or with target builtins (e.g. lanemask)
Operations that can be done in C, but the target wants to use target builtins or asm anyway (e.g. pack)
Operations that are done in C, but different architectures want different C (don't have any yet)

If there is code to be shared now, I mean with hopefully soon two targets, we could just call it common code. Once we get to the situation with >2 targets and not all but some share some code, we can reevaluate and determine the best solution for the actual use case at hand. I'm not strictly against templates or overloading but these do not solve all problems but basically only the ones we do not face yet. Designing something for a future use case is generally to be avoided (IMO).

That's fair. The 'default' I have in mind is essentially what amdgcn uses, as it could be used by other architectures without changes. However until such point as a third architecture is imminent (and indeed we don't have two yet), it's difficult to reliably distinguish common from target specific.

I'll leave this diff open for a while to see if it attracts more comments.

JonChesterfield abandoned this revision.Oct 24 2019, 11:33 PM

JonChesterfield mentioned this in D71404: [libomptarget][nfc] Introduce atomic wrapper function.Dec 13 2019, 3:25 AM

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

nvptx/

src/

4 lines

4 lines

49 lines

30 lines

Diff 222735

openmp/libomptarget/deviceRTLs/nvptx/src/loop.cu

Show First 20 Lines • Show All 376 Lines • ▼ Show 20 Lines	INLINE static void dispatch_init(kmp_Ident *loc, int32_t threadId,
}		}
}		}

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// Support for dispatch next		// Support for dispatch next

INLINE static int64_t Shuffle(unsigned active, int64_t val, int leader) {		INLINE static int64_t Shuffle(unsigned active, int64_t val, int leader) {
uint32_t lo, hi;		uint32_t lo, hi;
__kmpc_impl_unpack(val, lo, hi);		__kmpc_impl::Bits::unpack(val, lo, hi);
hi = __kmpc_impl_shfl_sync(active, hi, leader);		hi = __kmpc_impl_shfl_sync(active, hi, leader);
lo = __kmpc_impl_shfl_sync(active, lo, leader);		lo = __kmpc_impl_shfl_sync(active, lo, leader);
return __kmpc_impl_pack(lo, hi);		return __kmpc_impl::Bits::pack(lo, hi);
}		}

INLINE static uint64_t NextIter() {		INLINE static uint64_t NextIter() {
__kmpc_impl_lanemask_t active = __kmpc_impl_activemask();		__kmpc_impl_lanemask_t active = __kmpc_impl_activemask();
uint32_t leader = __kmpc_impl_ffs(active) - 1;		uint32_t leader = __kmpc_impl_ffs(active) - 1;
uint32_t change = __kmpc_impl_popc(active);		uint32_t change = __kmpc_impl_popc(active);
__kmpc_impl_lanemask_t lane_mask_lt = __kmpc_impl_lanemask_lt();		__kmpc_impl_lanemask_t lane_mask_lt = __kmpc_impl_lanemask_lt();
unsigned int rank = __kmpc_impl_popc(active & lane_mask_lt);		unsigned int rank = __kmpc_impl_popc(active & lane_mask_lt);
▲ Show 20 Lines • Show All 410 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/reduction.cu

	Show All 23 Lines
	void __kmpc_nvptx_end_reduce_nowait(int32_t global_tid) {}			void __kmpc_nvptx_end_reduce_nowait(int32_t global_tid) {}

	EXTERN int32_t __kmpc_shuffle_int32(int32_t val, int16_t delta, int16_t size) {			EXTERN int32_t __kmpc_shuffle_int32(int32_t val, int16_t delta, int16_t size) {
	return __kmpc_impl_shfl_down_sync(0xFFFFFFFF, val, delta, size);			return __kmpc_impl_shfl_down_sync(0xFFFFFFFF, val, delta, size);
	}			}

	EXTERN int64_t __kmpc_shuffle_int64(int64_t val, int16_t delta, int16_t size) {			EXTERN int64_t __kmpc_shuffle_int64(int64_t val, int16_t delta, int16_t size) {
	uint32_t lo, hi;			uint32_t lo, hi;
	__kmpc_impl_unpack(val, lo, hi);			__kmpc_impl::Bits::unpack(val, lo, hi);
	hi = __kmpc_impl_shfl_down_sync(0xFFFFFFFF, hi, delta, size);			hi = __kmpc_impl_shfl_down_sync(0xFFFFFFFF, hi, delta, size);
	lo = __kmpc_impl_shfl_down_sync(0xFFFFFFFF, lo, delta, size);			lo = __kmpc_impl_shfl_down_sync(0xFFFFFFFF, lo, delta, size);
	return __kmpc_impl_pack(lo, hi);			return __kmpc_impl::Bits::pack(lo, hi);
	}			}

	INLINE static void gpu_regular_warp_reduce(void *reduce_data,			INLINE static void gpu_regular_warp_reduce(void *reduce_data,
	kmp_ShuffleReductFctPtr shflFct) {			kmp_ShuffleReductFctPtr shflFct) {
	for (uint32_t mask = WARPSIZE / 2; mask > 0; mask /= 2) {			for (uint32_t mask = WARPSIZE / 2; mask > 0; mask /= 2) {
	shflFct(reduce_data, /LaneId - not used= / 0,			shflFct(reduce_data, /LaneId - not used= / 0,
	/Offset = / mask, /AlgoVersion=/0);			/Offset = / mask, /AlgoVersion=/0);
	}			}
	▲ Show 20 Lines • Show All 491 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/nvptx/src/target_api.h

This file was added.

				//===--- target_api.h - OpenMP GPU target abstraction interface --- c++ -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// CRTP style static interface for target specific functions
				//
				//===----------------------------------------------------------------------===//

				#ifndef _TARGET_API_H_
				#define _TARGET_API_H_

				#include <stdint.h>

				#include "option.h"

				namespace __kmpc_impl {

				// nvcc requires this to be <typename T>. Fails to compile <typename I>.
				template <typename T> class Api {
				public:
				INLINE static uint64_t pack(uint32_t lo, uint32_t hi) {
				return T::packImpl(lo, hi);
				}
				INLINE static void unpack(uint64_t val, uint32_t &lo, uint32_t &hi) {
				T::unpackImpl(val, lo, hi);
				}

				private:
				INLINE static uint64_t packImpl(uint32_t lo, uint32_t hi);
				INLINE static void unpackImpl(uint64_t, uint32_t &, uint32_t &) = delete;
				jdoerfertUnsubmitted Not Done Reply Inline Actions It is odd to provide a default for pack but not for unpack, when would this ever be useful? It also seems overly complicated to redirect here in the first place. Subclasses could as well provide pack/unpack directly, couldn't thy? jdoerfert: It is odd to provide a default for pack but not for unpack, when would this ever be useful? It…
				JonChesterfieldAuthorUnsubmitted Done Reply Inline Actions Sure. We'd either want defaults for both, or defaults for neither until a target is introduced that would use the defaults. Providing one of each is good for discussion but probably not how it should be committed. The redirect costs some code in the base class but none in the subclass (well, other than the Impl suffix). I like the separation between the interface to the client and the interface to the target, but sure - there's lots of ways to wire things up. JonChesterfield: Sure. We'd either want defaults for both, or defaults for neither until a target is introduced…

				private:
				friend T;
				Api() = delete;
				};

				// Default implementations
				template <typename T>
				INLINE uint64_t Api<T>::packImpl(uint32_t lo, uint32_t hi) {
				return (((uint64_t)hi) << 32u) \| (uint64_t)lo;
				}

				} // namespace __kmpc_impl

				#endif

openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h

	//===------------ target_impl.h - NVPTX OpenMP GPU options ------- CUDA -*-===//			//===-------- target_api.h - OpenMP GPU target abstraction ------- CUDA -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// Definitions of target specific functions			// Definitions of target specific functions
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#ifndef _TARGET_IMPL_H_			#ifndef _TARGET_IMPL_H_
	#define _TARGET_IMPL_H_			#define _TARGET_IMPL_H_

	#include <stdint.h>			#include <stdint.h>

				#include "target_api.h"
	#include "option.h"			#include "option.h"

	INLINE void __kmpc_impl_unpack(uint64_t val, uint32_t &lo, uint32_t &hi) {			typedef uint32_t __kmpc_impl_lanemask_t;
	asm volatile("mov.b64 {%0,%1}, %2;" : "=r"(lo), "=r"(hi) : "l"(val));
	}			namespace __kmpc_impl {
				class Bits : public Api<Bits> {
				friend class Api<Bits>;

	INLINE uint64_t __kmpc_impl_pack(uint32_t lo, uint32_t hi) {			private:
				INLINE static uint64_t packImpl(uint32_t lo, uint32_t hi) {
	uint64_t val;			uint64_t val;
	asm volatile("mov.b64 %0, {%1,%2};" : "=l"(val) : "r"(lo), "r"(hi));			asm volatile("mov.b64 %0, {%1,%2};" : "=l"(val) : "r"(lo), "r"(hi));
	return val;			return val;
	}			}

	typedef uint32_t __kmpc_impl_lanemask_t;			INLINE static void unpackImpl(uint64_t val, uint32_t &lo, uint32_t &hi) {
				asm volatile("mov.b64 {%0,%1}, %2;" : "=r"(lo), "=r"(hi) : "l"(val));
				}
				};
				} // namespace __kmpc_impl

	INLINE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_lt() {			INLINE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_lt() {
	__kmpc_impl_lanemask_t res;			__kmpc_impl_lanemask_t res;
	asm("mov.u32 %0, %%lanemask_lt;" : "=r"(res));			asm("mov.u32 %0, %%lanemask_lt;" : "=r"(res));
	return res;			return res;
	}			}

	INLINE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_gt() {			INLINE __kmpc_impl_lanemask_t __kmpc_impl_lanemask_gt() {
	▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines