This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/
-
docs/design/
-
design/
2/3
Runtimes.rst
-
libomptarget/
-
include/
-
Utilities.h
-
plugins-nextgen/
-
amdgpu/src/
-
src/
1
rtl.cpp
-
common/PluginInterface/
-
PluginInterface/
-
PluginInterface.cpp

Differential D152035

[OpenMP] Only initialize a single queue/stream/event eagerly
AbandonedPublic

Authored by jdoerfert on Jun 2 2023, 2:43 PM.

Download Raw Diff

Details

Reviewers

jplehr
mhalk
jhuber6
tianshilei1992
ye-luo

Summary

Initialization for many queues/streams/events might come at a cost even
if we do not use them. This patch lazily initializes them, otherwise
nothing (major) is supposed to change. A minor difference is the
handling of an error in the initialization of an AMD queue (other than
the first). We now report an error but continue with the first queue;
unsure if this will ever come up.

Diff Detail

Event Timeline

jdoerfert created this revision.Jun 2 2023, 2:43 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2023, 2:43 PM

Herald added subscribers: sunshaoce, kerbowa, guansong and 3 others. · View Herald Transcript

jdoerfert requested review of this revision.Jun 2 2023, 2:43 PM

Herald added a subscriber: sstefan1. · View Herald TranscriptJun 2 2023, 2:43 PM

jdoerfert added a reviewer: ye-luo.Jun 2 2023, 2:43 PM

Harbormaster completed remote builds in B236282: Diff 527993.Jun 2 2023, 2:47 PM

tianshilei1992 added inline comments.Jun 2 2023, 3:25 PM

openmp/docs/design/Runtimes.rst
1200	FWIW, events are used more frequently than streams even for single thread offloading.

FWIW, currently talking to @jplehr and @mhalk about the queue <-> stream mapping. Might be a follow up.

openmp/docs/design/Runtimes.rst
1200	True, not sure what the right value is. I can keep them at 32 if ppl prefer.

mhalk added inline comments.Jun 6 2023, 4:20 AM

openmp/docs/design/Runtimes.rst
1200	Just wanted to point out (since it was not directly obvious to me): for value `LIBOMPTARGET_NUM_INITIAL_STREAMS=n`, `getNextQueue` will be called `n` times. So, the patch would have to be adapted if `LIBOMPTARGET_NUM_INITIAL_STREAMS` will be increased. Otherwise, one would have again `n <= OMPX_NumQueues` HSA queues from the get-go, even if they're not used.

mhalk added a child revision: D154523: [OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking.Jul 5 2023, 9:32 AM

kevinsala added a subscriber: kevinsala.Jul 11 2023, 7:20 AM

kevinsala added inline comments.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
585	Is there any reason to maintain both `init` and `initLazy` instead of making `init` lazy by default?

I am unsure we need this after all. @mhalk, can u check, if we don't need this we can scrap it.

Just looked through this patch and D154523.

Don't have a clear "yes"/"no".
IMO merging the two makes sense as they achieve the same: "single HSA queue eager init".

The busy tracking will revert like half of this patch's changes in amdgpu/src/rtl.cpp.
Should you tend to scrap it, I will have to incorporate ~30ish LoC from this one.
That is, the docs & setting of default/initial values (esp. outside of amdgpu/src/rtl.cpp).

mhalk removed a child revision: D154523: [OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking.Aug 1 2023, 11:03 AM

Subsumed by D154523.

Revision Contents

Path

Size

openmp/

docs/

design/

Runtimes.rst

5 lines

libomptarget/

include/

Utilities.h

6 lines

plugins-nextgen/

amdgpu/

src/

rtl.cpp

47 lines

common/

PluginInterface/

PluginInterface.cpp

6 lines

Diff 527993

openmp/docs/design/Runtimes.rst

	Show First 20 Lines • Show All 1,182 Lines • ▼ Show 20 Lines
	""""""""""""""""""""""""""""""""			""""""""""""""""""""""""""""""""

	This environment variable sets the number of pre-created streams in the plugin			This environment variable sets the number of pre-created streams in the plugin
	(if supported) at initialization. More streams will be created dynamically			(if supported) at initialization. More streams will be created dynamically
	throughout the execution if needed. A stream is a queue of asynchronous			throughout the execution if needed. A stream is a queue of asynchronous
	operations (e.g., kernel launches and memory copies) that are executed			operations (e.g., kernel launches and memory copies) that are executed
	sequentially. Parallelism is achieved by featuring multiple streams. The			sequentially. Parallelism is achieved by featuring multiple streams. The
	``libomptarget`` leverages streams to exploit parallelism between plugin			``libomptarget`` leverages streams to exploit parallelism between plugin
	operations. The default value is ``32``.			operations. The default value is ``1``, more streams are created as needed.

	LIBOMPTARGET_NUM_INITIAL_EVENTS			LIBOMPTARGET_NUM_INITIAL_EVENTS
	"""""""""""""""""""""""""""""""			"""""""""""""""""""""""""""""""

	This environment variable sets the number of pre-created events in the			This environment variable sets the number of pre-created events in the
	plugin (if supported) at initialization. More events will be created			plugin (if supported) at initialization. More events will be created
	dynamically throughout the execution if needed. An event is used to synchronize			dynamically throughout the execution if needed. An event is used to synchronize
	a stream with another efficiently. The default value is ``32``.			a stream with another efficiently. The default value is ``1``, more events are
				created as needed.
				tianshilei1992Unsubmitted Not Done Reply Inline Actions FWIW, events are used more frequently than streams even for single thread offloading. tianshilei1992: FWIW, events are used more frequently than streams even for single thread offloading.
				jdoerfertAuthorUnsubmitted Done Reply Inline Actions True, not sure what the right value is. I can keep them at 32 if ppl prefer. jdoerfert: True, not sure what the right value is. I can keep them at 32 if ppl prefer.
				mhalkUnsubmitted Done Reply Inline Actions Just wanted to point out (since it was not directly obvious to me): for value `LIBOMPTARGET_NUM_INITIAL_STREAMS=n`, `getNextQueue` will be called `n` times. So, the patch would have to be adapted if `LIBOMPTARGET_NUM_INITIAL_STREAMS` will be increased. Otherwise, one would have again `n <= OMPX_NumQueues` HSA queues from the get-go, even if they're not used. mhalk: Just wanted to point out (since it was not directly obvious to me): for value…

	LIBOMPTARGET_LOCK_MAPPED_HOST_BUFFERS			LIBOMPTARGET_LOCK_MAPPED_HOST_BUFFERS
	"""""""""""""""""""""""""""""""""""""			"""""""""""""""""""""""""""""""""""""

	This environment variable indicates whether the host buffers mapped by the user			This environment variable indicates whether the host buffers mapped by the user
	should be automatically locked/pinned by the plugin. Pinned host buffers allow			should be automatically locked/pinned by the plugin. Pinned host buffers allow
	true asynchronous copies between the host and devices. Enabling this feature can			true asynchronous copies between the host and devices. Enabling this feature can
	increase the performance of applications that are intensive in host-device			increase the performance of applications that are intensive in host-device
	▲ Show 20 Lines • Show All 222 Lines • Show Last 20 Lines

openmp/libomptarget/include/Utilities.h

Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	if (const char *EnvStr = getenv(Name.data())) {

if (!IsPresent) {		if (!IsPresent) {
DP("Ignoring invalid value %s for envar %s\n", EnvStr, Name.data());		DP("Ignoring invalid value %s for envar %s\n", EnvStr, Name.data());
Data = Default;		Data = Default;
}		}
}		}
}		}

		Envar<Ty> &operator=(const Ty &V) {
		Data = V;
		Initialized = true;
		return *this;
		}

/// Get the definitive value.		/// Get the definitive value.
const Ty &get() const {		const Ty &get() const {
// Throw a runtime error in case this envar is not initialized.		// Throw a runtime error in case this envar is not initialized.
if (!Initialized)		if (!Initialized)
FATAL_MESSAGE0(1, "Consulting envar before initialization");		FATAL_MESSAGE0(1, "Consulting envar before initialization");

return Data;		return Data;
}		}
▲ Show 20 Lines • Show All 162 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

Show First 20 Lines • Show All 575 Lines • ▼ Show 20 Lines	struct AMDGPUQueueTy {
/// Initialize a new queue belonging to a specific agent.		/// Initialize a new queue belonging to a specific agent.
Error init(hsa_agent_t Agent, int32_t QueueSize) {		Error init(hsa_agent_t Agent, int32_t QueueSize) {
hsa_status_t Status =		hsa_status_t Status =
hsa_queue_create(Agent, QueueSize, HSA_QUEUE_TYPE_MULTI, callbackError,		hsa_queue_create(Agent, QueueSize, HSA_QUEUE_TYPE_MULTI, callbackError,
nullptr, UINT32_MAX, UINT32_MAX, &Queue);		nullptr, UINT32_MAX, UINT32_MAX, &Queue);
return Plugin::check(Status, "Error in hsa_queue_create: %s");		return Plugin::check(Status, "Error in hsa_queue_create: %s");
}		}

		/// If the queue is not initialized, do it now.
		Error initLazy(hsa_agent_t Agent, int32_t QueueSize) {
		kevinsalaUnsubmitted Not Done Reply Inline Actions Is there any reason to maintain both `init` and `initLazy` instead of making `init` lazy by default? kevinsala: Is there any reason to maintain both `init` and `initLazy` instead of making `init` lazy by…
		// Lock the queue during the lazy init
		std::lock_guard<std::mutex> Lock(Mutex);
		if (Queue)
		return Plugin::success();
		return init(Agent, QueueSize);
		}

/// Deinitialize the queue and destroy its resources.		/// Deinitialize the queue and destroy its resources.
Error deinit() {		Error deinit() {
		std::lock_guard<std::mutex> Lock(Mutex);
		if (!Queue)
		return Plugin::success();
hsa_status_t Status = hsa_queue_destroy(Queue);		hsa_status_t Status = hsa_queue_destroy(Queue);
return Plugin::check(Status, "Error in hsa_queue_destroy: %s");		return Plugin::check(Status, "Error in hsa_queue_destroy: %s");
}		}

/// Push a kernel launch to the queue. The kernel launch requires an output		/// Push a kernel launch to the queue. The kernel launch requires an output
/// signal and can define an optional input signal (nullptr if none).		/// signal and can define an optional input signal (nullptr if none).
Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,		Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,
uint32_t NumThreads, uint64_t NumBlocks,		uint32_t NumThreads, uint64_t NumBlocks,
uint32_t GroupSize, AMDGPUSignalTy *OutputSignal,		uint32_t GroupSize, AMDGPUSignalTy *OutputSignal,
AMDGPUSignalTy *InputSignal) {		AMDGPUSignalTy *InputSignal) {
assert(OutputSignal && "Invalid kernel output signal");		assert(OutputSignal && "Invalid kernel output signal");

// Lock the queue during the packet publishing process. Notice this blocks		// Lock the queue during the packet publishing process. Notice this blocks
// the addition of other packets to the queue. The following piece of code		// the addition of other packets to the queue. The following piece of code
// should be lightweight; do not block the thread, allocate memory, etc.		// should be lightweight; do not block the thread, allocate memory, etc.
std::lock_guard<std::mutex> Lock(Mutex);		std::lock_guard<std::mutex> Lock(Mutex);
		assert(Queue && "Interacted with a non-initialized queue!");

// Avoid defining the input dependency if already satisfied.		// Avoid defining the input dependency if already satisfied.
if (InputSignal && !InputSignal->load())		if (InputSignal && !InputSignal->load())
InputSignal = nullptr;		InputSignal = nullptr;

// Add a barrier packet before the kernel packet in case there is a pending		// Add a barrier packet before the kernel packet in case there is a pending
// preceding operation. The barrier packet will delay the processing of		// preceding operation. The barrier packet will delay the processing of
// subsequent queue's packets until the barrier input signal are satisfied.		// subsequent queue's packets until the barrier input signal are satisfied.
Show All 32 Lines	struct AMDGPUQueueTy {

/// Push a barrier packet that will wait up to two input signals. All signals		/// Push a barrier packet that will wait up to two input signals. All signals
/// are optional (nullptr if none).		/// are optional (nullptr if none).
Error pushBarrier(AMDGPUSignalTy *OutputSignal,		Error pushBarrier(AMDGPUSignalTy *OutputSignal,
const AMDGPUSignalTy *InputSignal1,		const AMDGPUSignalTy *InputSignal1,
const AMDGPUSignalTy *InputSignal2) {		const AMDGPUSignalTy *InputSignal2) {
// Lock the queue during the packet publishing process.		// Lock the queue during the packet publishing process.
std::lock_guard<std::mutex> Lock(Mutex);		std::lock_guard<std::mutex> Lock(Mutex);
		assert(Queue && "Interacted with a non-initialized queue!");

// Push the barrier with the lock acquired.		// Push the barrier with the lock acquired.
return pushBarrierImpl(OutputSignal, InputSignal1, InputSignal2);		return pushBarrierImpl(OutputSignal, InputSignal1, InputSignal2);
}		}

private:		private:
/// Push a barrier packet that will wait up to two input signals. Assumes the		/// Push a barrier packet that will wait up to two input signals. Assumes the
/// the queue lock is acquired.		/// the queue lock is acquired.
▲ Show 20 Lines • Show All 974 Lines • ▼ Show 20 Lines	Error initImpl(GenericPluginTy &Plugin) override {
if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUE_MAX_SIZE, MaxQueueSize))		if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUE_MAX_SIZE, MaxQueueSize))
return Err;		return Err;

uint32_t MaxQueues;		uint32_t MaxQueues;
if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUES_MAX, MaxQueues))		if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUES_MAX, MaxQueues))
return Err;		return Err;

// Compute the number of queues and their size.		// Compute the number of queues and their size.
const uint32_t NumQueues = std::min(OMPX_NumQueues.get(), MaxQueues);		OMPX_NumQueues = std::max(1U, std::min(OMPX_NumQueues.get(), MaxQueues));
const uint32_t QueueSize = std::min(OMPX_QueueSize.get(), MaxQueueSize);		OMPX_QueueSize = std::min(OMPX_QueueSize.get(), MaxQueueSize);

// Construct and initialize each device queue.		// Construct and initialize each device queue.
Queues = std::vector<AMDGPUQueueTy>(NumQueues);		Queues = std::vector<AMDGPUQueueTy>(OMPX_NumQueues);
for (AMDGPUQueueTy &Queue : Queues)		// Initialize one queue eagerly.
if (auto Err = Queue.init(Agent, QueueSize))		if (auto Err = Queues.front().init(Agent, OMPX_QueueSize))
return Err;		return Err;

// Initialize stream pool.		// Initialize stream pool.
if (auto Err = AMDGPUStreamManager.init(OMPX_InitialNumStreams))		if (auto Err = AMDGPUStreamManager.init(OMPX_InitialNumStreams))
return Err;		return Err;

// Initialize event pool.		// Initialize event pool.
if (auto Err = AMDGPUEventManager.init(OMPX_InitialNumEvents))		if (auto Err = AMDGPUEventManager.init(OMPX_InitialNumEvents))
return Err;		return Err;
▲ Show 20 Lines • Show All 693 Lines • ▼ Show 20 Lines	return utils::iterateAgentMemoryPools(
AMDGPUMemoryPoolTy *MemoryPool =		AMDGPUMemoryPoolTy *MemoryPool =
Plugin::get().allocate<AMDGPUMemoryPoolTy>();		Plugin::get().allocate<AMDGPUMemoryPoolTy>();
new (MemoryPool) AMDGPUMemoryPoolTy(HSAMemoryPool);		new (MemoryPool) AMDGPUMemoryPoolTy(HSAMemoryPool);
AllMemoryPools.push_back(MemoryPool);		AllMemoryPools.push_back(MemoryPool);
return HSA_STATUS_SUCCESS;		return HSA_STATUS_SUCCESS;
});		});
}		}

/// Get the next queue in a round-robin fashion.		/// Get the next queue in a round-robin fashion, includes lazy initialization.
AMDGPUQueueTy &getNextQueue() {		AMDGPUQueueTy &getNextQueue() {
static std::atomic<uint32_t> NextQueue(0);

uint32_t Current = NextQueue.fetch_add(1, std::memory_order_relaxed);		uint32_t Current = NextQueue.fetch_add(1, std::memory_order_relaxed);
return Queues[Current % Queues.size()];		uint32_t Idx = Current % Queues.size();
		auto &Queue = Queues[Idx];
		// Only queue 0 has been initialized eagerly. Others might need lazy/late
		// initialization.
		if (Idx == 0)
		return Queue;

		if (auto Err = Queue.initLazy(Agent, OMPX_QueueSize)) {
		// Gracefully handle late initialization errors, but report them anyway.
		REPORT("%s\n", toString(std::move(Err)).data());
		return Queues[0];
		}
		return Queue;
}		}

private:		private:
using AMDGPUStreamRef = AMDGPUResourceRef<AMDGPUStreamTy>;		using AMDGPUStreamRef = AMDGPUResourceRef<AMDGPUStreamTy>;
using AMDGPUEventRef = AMDGPUResourceRef<AMDGPUEventTy>;		using AMDGPUEventRef = AMDGPUResourceRef<AMDGPUEventTy>;

using AMDGPUStreamManagerTy = GenericDeviceResourceManagerTy<AMDGPUStreamRef>;		using AMDGPUStreamManagerTy = GenericDeviceResourceManagerTy<AMDGPUStreamRef>;
using AMDGPUEventManagerTy = GenericDeviceResourceManagerTy<AMDGPUEventRef>;		using AMDGPUEventManagerTy = GenericDeviceResourceManagerTy<AMDGPUEventRef>;
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	private:
/// The GPU architecture.		/// The GPU architecture.
std::string ComputeUnitKind;		std::string ComputeUnitKind;

/// Reference to the host device.		/// Reference to the host device.
AMDHostDeviceTy &HostDevice;		AMDHostDeviceTy &HostDevice;

/// List of device packet queues.		/// List of device packet queues.
std::vector<AMDGPUQueueTy> Queues;		std::vector<AMDGPUQueueTy> Queues;

		// The next queue to be used for a new stream.
		std::atomic<uint32_t> NextQueue = {0};
};		};

Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {		Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {
hsa_status_t Status;		hsa_status_t Status;
Status = hsa_code_object_deserialize(getStart(), getSize(), "", &CodeObject);		Status = hsa_code_object_deserialize(getStart(), getSize(), "", &CodeObject);
if (auto Err =		if (auto Err =
Plugin::check(Status, "Error in hsa_code_object_deserialize: %s"))		Plugin::check(Status, "Error in hsa_code_object_deserialize: %s"))
return Err;		return Err;
▲ Show 20 Lines • Show All 546 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp

Show First 20 Lines • Show All 368 Lines • ▼ Show 20 Lines	: MemoryManager(nullptr), OMP_TeamLimit("OMP_TEAM_LIMIT"),
OMP_NumTeams("OMP_NUM_TEAMS"),		OMP_NumTeams("OMP_NUM_TEAMS"),
OMP_TeamsThreadLimit("OMP_TEAMS_THREAD_LIMIT"),		OMP_TeamsThreadLimit("OMP_TEAMS_THREAD_LIMIT"),
OMPX_DebugKind("LIBOMPTARGET_DEVICE_RTL_DEBUG"),		OMPX_DebugKind("LIBOMPTARGET_DEVICE_RTL_DEBUG"),
OMPX_SharedMemorySize("LIBOMPTARGET_SHARED_MEMORY_SIZE"),		OMPX_SharedMemorySize("LIBOMPTARGET_SHARED_MEMORY_SIZE"),
// Do not initialize the following two envars since they depend on the		// Do not initialize the following two envars since they depend on the
// device initialization. These cannot be consulted until the device is		// device initialization. These cannot be consulted until the device is
// initialized correctly. We intialize them in GenericDeviceTy::init().		// initialized correctly. We intialize them in GenericDeviceTy::init().
OMPX_TargetStackSize(), OMPX_TargetHeapSize(),		OMPX_TargetStackSize(), OMPX_TargetHeapSize(),
// By default, the initial number of streams and events are 32.		// By default, the initial number of streams and events is 1.
OMPX_InitialNumStreams("LIBOMPTARGET_NUM_INITIAL_STREAMS", 32),		OMPX_InitialNumStreams("LIBOMPTARGET_NUM_INITIAL_STREAMS", 1),
OMPX_InitialNumEvents("LIBOMPTARGET_NUM_INITIAL_EVENTS", 32),		OMPX_InitialNumEvents("LIBOMPTARGET_NUM_INITIAL_EVENTS", 1),
DeviceId(DeviceId), GridValues(OMPGridValues),		DeviceId(DeviceId), GridValues(OMPGridValues),
PeerAccesses(NumDevices, PeerAccessState::PENDING), PeerAccessesLock(),		PeerAccesses(NumDevices, PeerAccessState::PENDING), PeerAccessesLock(),
PinnedAllocs(*this) {}		PinnedAllocs(*this) {}

Error GenericDeviceTy::init(GenericPluginTy &Plugin) {		Error GenericDeviceTy::init(GenericPluginTy &Plugin) {
if (auto Err = initImpl(Plugin))		if (auto Err = initImpl(Plugin))
return Err;		return Err;

▲ Show 20 Lines • Show All 1,152 Lines • Show Last 20 Lines