This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/deviceRTLs/amdgcn/
-
libomptarget/
-
deviceRTLs/
-
amdgcn/
-
CMakeLists.txt
-
src/
1
amdgcn_locks.hip

Differential D75546

[libomptarget] Implement locks for amdgcn
ClosedPublic

Authored by JonChesterfield on Mar 3 2020, 10:52 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
ABataev
grokos

Commits

rG221ada654b28: [libomptarget] Implement locks for amdgcn

Summary

[libomptarget] Implement locks for amdgcn

The nvptx implementation deadlocks on amdgcn. atomic_cas with multiple
active lanes can deadlock - if one lane succeeds, all the others are locked
out. The set_lock implementation therefore runs on a single lane.

Also uses a sleep intrinsic instead of the system clock for a probably
minor performance improvement. The unset/test implementations may be revised
later, based on code size / performance or similar concerns.

This implements the lock at a per-wavefront scope. That's not strictly as
specified, since openmp describes locks in terms of threads. I think the
nvptx implementation provides true per-thread locking on volta and the same
per-warp locking on other architectures.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

JonChesterfield created this revision.Mar 3 2020, 10:52 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2020, 10:52 AM

Herald added subscribers: openmp-commits, jfb, mgorny, jvesely. · View Herald Transcript

s/warp/wavefront

Harbormaster completed remote builds in B47944: Diff 247964.Mar 3 2020, 11:17 AM

Harbormaster completed remote builds in B47942: Diff 247961.

jdoerfert added inline comments.Mar 4 2020, 1:47 PM

openmp/libomptarget/deviceRTLs/amdgcn/src/amdgcn_locks.hip
15	Doesn't this mean we cannot implement locks properly at all?

If we say thread == lane of simd, and every lane executes the same instruction with some masked off, then we can 'lock' a thread with respect to the rest of the device if and only if it is the only active one in the warp.

It's a symptom of defining thread to have finer granularity than instruction pointer, which I believe is a serious design mistake passed down from cuda. I'd like openmp to map SIMD onto the warp instead, at which point we can lock the newly defined 'thread' using an implementation like this.

If we continue doing thread==lane, and want to support this API, I think it can be done by rewriting the cfg. The end result will perform horrendously, but that's fairly likely of any code using spin locks.

The above is also a spin lock, which is not traditionally a great idea, so I'd like to add a futex syscall equivalent to the kernel driver. That's a longer term goal.

In the meantime, this is the most useful functionality I have been able to work out under the openmp lock API.

Can we have a print("Locks are not supported in this thread mapping model") instead?

Warn instead of providing a partial implementation

In D75546#1907737, @jdoerfert wrote:

Can we have a print("Locks are not supported in this thread mapping model") instead?

Absolutely. Updated the diff to do so.

The CAS lock implementation will work correctly for a simd=>warp model, so can resurrect this once that's online.

Harbormaster completed remote builds in B48218: Diff 248515.Mar 5 2020, 10:24 AM

LGTM.

This revision is now accepted and ready to land.Mar 5 2020, 11:59 AM

Closed by commit rG221ada654b28: [libomptarget] Implement locks for amdgcn (authored by JonChesterfield). · Explain WhyMar 5 2020, 12:40 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

amdgcn/

CMakeLists.txt

1 line

src/

amdgcn_locks.hip

51 lines

Diff 247964

openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt

	Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines
	endif()			endif()

	get_filename_component(devicertl_base_directory			get_filename_component(devicertl_base_directory
	${CMAKE_CURRENT_SOURCE_DIR}			${CMAKE_CURRENT_SOURCE_DIR}
	DIRECTORY)			DIRECTORY)

	set(cuda_sources			set(cuda_sources
	${CMAKE_CURRENT_SOURCE_DIR}/src/amdgcn_smid.hip			${CMAKE_CURRENT_SOURCE_DIR}/src/amdgcn_smid.hip
				${CMAKE_CURRENT_SOURCE_DIR}/src/amdgcn_locks.hip
	${CMAKE_CURRENT_SOURCE_DIR}/src/target_impl.hip			${CMAKE_CURRENT_SOURCE_DIR}/src/target_impl.hip
	${devicertl_base_directory}/common/src/cancel.cu			${devicertl_base_directory}/common/src/cancel.cu
	${devicertl_base_directory}/common/src/critical.cu			${devicertl_base_directory}/common/src/critical.cu
	${devicertl_base_directory}/common/src/data_sharing.cu			${devicertl_base_directory}/common/src/data_sharing.cu
	${devicertl_base_directory}/common/src/libcall.cu			${devicertl_base_directory}/common/src/libcall.cu
	${devicertl_base_directory}/common/src/loop.cu			${devicertl_base_directory}/common/src/loop.cu
	${devicertl_base_directory}/common/src/omp_data.cu			${devicertl_base_directory}/common/src/omp_data.cu
	${devicertl_base_directory}/common/src/omptarget.cu			${devicertl_base_directory}/common/src/omptarget.cu
	▲ Show 20 Lines • Show All 86 Lines • Show Last 20 Lines

openmp/libomptarget/deviceRTLs/amdgcn/src/amdgcn_locks.hip

This file was added.

				//===-- amdgcn_locks.hip - AMDGCN OpenMP GPU lock implementation -- HIP -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Definitions of openmp lock functions
				// A 'thread' maps onto a lane of the wavefront. This means a per-thread lock
				// cannot be implemented - if one thread gets the lock, it can't continue on to
				// the next instruction in order to do anything as the other threads are waiting
				// to take the lock
				// The closest approximatation we can implement is to lock per-wavefront.
				//
				jdoerfertUnsubmitted Not Done Reply Inline Actions Doesn't this mean we cannot implement locks properly at all? jdoerfert: Doesn't this mean we cannot implement locks properly at all?
				//===----------------------------------------------------------------------===//

				#include "common/support.h"
				#include "common/target_atomic.h"
				#include "target_impl.h"

				#define UNSET 0u
				#define SET 1u

				DEVICE void __kmpc_impl_init_lock(omp_lock_t *lock) {
				__kmpc_impl_unset_lock(lock);
				}

				DEVICE void __kmpc_impl_destroy_lock(omp_lock_t *lock) {
				__kmpc_impl_unset_lock(lock);
				}

				DEVICE void __kmpc_impl_set_lock(omp_lock_t *lock) {
				uint64_t lowestActiveThread = __kmpc_impl_ffs(__kmpc_impl_activemask()) - 1;
				if (GetLaneId() == lowestActiveThread) {
				while (__kmpc_atomic_cas(lock, UNSET, SET) != UNSET) {
				__builtin_amdgcn_s_sleep(0);
				}
				}
				// test_lock will now return true for any thread in the wavefront
				}

				DEVICE void __kmpc_impl_unset_lock(omp_lock_t *lock) {
				// Could be an atomic store of UNSET
				(void)__kmpc_atomic_exchange(lock, UNSET);
				}

				DEVICE int __kmpc_impl_test_lock(omp_lock_t *lock) {
				// Could be an atomic load
				return __kmpc_atomic_add(lock, 0u);
				}