This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/
-
docs/design/
-
design/
-
Runtimes.rst
-
libomptarget/
-
include/
-
Utilities.h
-
plugins-nextgen/
-
amdgpu/src/
-
src/
7/36
rtl.cpp
-
common/PluginInterface/
-
PluginInterface/
-
PluginInterface.h
-
PluginInterface.cpp

Differential D154523

[OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking
ClosedPublic

Authored by mhalk on Jul 5 2023, 9:32 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
kevinsala

Commits

rG5b19f42b631d: [OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking

Summary

This patch lazily initializes queues/streams/events since their initialization
might come at a cost even if we do not use them.

To further benefit from this, AMDGPU/HSA queue management is moved into the
AMDGPUStreamManager of an AMDGPUDevice. Streams may now use different HSA queues
during their lifetime and identify busy queues.

When a Stream is requested from the resource manager, it will search for and
try to assign an idle queue. During the search for an idle queue the manager
may initialize more queues, up to the set maximum (default: 4).
When no idle queue could be found: resort to round robin selection.

With contributions from Johannes Doerfert <johannes@jdoerfert.de>

Depends on D156245

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mhalk created this revision.Jul 5 2023, 9:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2023, 9:32 AM

Herald added subscribers: sunshaoce, kerbowa, guansong and 5 others. · View Herald Transcript

mhalk requested review of this revision.Jul 5 2023, 9:32 AM

Herald added a reviewer: jdoerfert. · View Herald TranscriptJul 5 2023, 9:32 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: openmp-commits, jplehr, sstefan1, wdng. · View Herald Transcript

Harbormaster completed remote builds in B243246: Diff 537396.Jul 5 2023, 9:35 AM

jdoerfert added inline comments.Jul 5 2023, 9:57 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
798	rename Busy into sth more meaningful, e.g., `NumUsers`. Spell out the inc/dec, maybe "addUser", "removeUser"
1459	Make the class final.
1467	Just resize.
1503	This ignores the queue size, doesn't it? We should not grow larger than that value.
1507	Start with `Current = NextQueue.fetch_add(1, std::memory_order_relaxed) % Size` when you iterate, that way you won't check the first queue all the time.

jplehr added inline comments.Jul 6 2023, 2:51 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1470	In case of error, I think we should report and return the error instead of reporting and keep going. This is the initial queue, so if this fails, there is no queue available.

mhalk added inline comments.Jul 6 2023, 4:25 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1467	Initially, I wanted to do that, but `AMDGPUQueueTy` has a member (Mutex) with deleted copy ctor.
1470	Agreed; as discussed, we'll return `Err` and not report it at this level. It will be reported by the higher layer, for example like this: "PluginInterface" error: Failure to initialize device 0: Error in hsa_queue_create: HSA_STATUS_ERROR_INVALID_ARGUMENT: One of the actual arguments does not meet a precondition stated in the documentation of the corresponding formal argument.
1503	To the best of my knowledge we should not exceed the set (max) number of queues. This will only resize the streams, as the size of `Queues` is only set once, during `init(...)`.
1507	Agreed; as discussed, we're going with two loops to search for an idle queue.

Thank you very much for the comments.
This should implement the current feedback.

Harbormaster completed remote builds in B243435: Diff 537666.Jul 6 2023, 4:56 AM

kevinsala added a subscriber: kevinsala.Jul 6 2023, 7:14 AM

kevinsala added inline comments.Jul 11 2023, 6:29 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
608	Do we need these three atomic operations with `memory_order_seq_cst`? Or a more relaxed memory order could be enough?
1498	I'm afraid most of the code in `getResource` and `returnResource` is duplicated from the generic resource manager. Would it be possible to use the original `GenericDeviceResourceManagerTy::getResource()` instead? For instance: ResourceRef getResource() override { ResourceRef resource = GenericDeviceResourceManagerTy::getResource(); // Perform any change on resource ... return resource; } (Note that with this modification, the changes on the resource are no longer protected by the stream manager mutex)
1531	I don't think `NextQueue` needs atomic operations if this function is called while holding the stream manager mutex.
1568	I feel this patch implements a Queue manager inside a Stream manager. Wouldn't it be better to define this logic inside a new `AMDGPUQueueManagerTy` and just have a reference of it in the `AMDGPUStreamManagerTy`?

kevinsala added inline comments.Jul 11 2023, 7:12 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1535	Also, `NextQueue` shouldn't be incremented while we advance in the search of non busy queues?
1548	I believe we can simplify and merge these two loops into a single one

ye-luo added a subscriber: ye-luo.Jul 11 2023, 7:42 AM

ye-luo added inline comments.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
608	Is this function already under the stream manager mutex when being called? I feel NumUsers doesn't need to be atomic.

kevinsala added inline comments.Jul 11 2023, 7:46 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
608	Yes, as it is now (everything protected by the stream manager mutex), it seems it shouldn't be atomic.

jdoerfert added inline comments.Jul 11 2023, 10:25 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1471–1473
1505	As @kevinsala said, the above, and the `auto &Resource =` part below, can be replaced with `ResourceRef Resource = GenericDeviceResourceManagerTy::getResource();`
1523	The above, except removeUser, should just be `GenericDeviceResourceManagerTy::returnResource(Resource);`, no?
1537	early exit if (busy) continue
1548	Yeah, my bad, I suggested this, we can do sth like for (I = 0; < MaxNumQueues; ++I) { Idx = StartIndex++; if (StartIndex == MaxNumQueues) StartIndex = 0; // use Idx not I ... }
1568	We do not really manage the queues the same way. we can do more reuse, see above.

Thanks for all the valuable feedback -- this should implement it.

Still trying to figure out which parts really need to be protected by Locks.
So, please take a close look at these parts of the code.

Since they are currently guarded, I removed the:
atomics as suggested.
Lock from initLazy

Harbormaster completed remote builds in B245130: Diff 540045.Jul 13 2023, 8:00 AM

mhalk added inline comments.Jul 13 2023, 8:27 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1499–1502	Rather unhappy with this particular piece of code. Now there's a chance for interleaving. I don't know if that's good, "ok" or bad, but if I had to guess: the latter. But without unlocking, this will cause a deadlock. (Could also revert to a `lock_guard` and call it's dtor, if preferred.) Thoughts and comments on this would be highly appreciated.

mhalk added inline comments.Jul 13 2023, 9:31 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1027–1033	Oversight: Will be removed.

Looks pretty good to me, others?

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1519–1529

kevinsala added inline comments.Jul 17 2023, 3:57 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1499–1502	Yes, this seems correct, but I agree that it may produce unnecessary overhead. To cover this use case, we could extend the generic `getResource` and `returnResource` to accept an (optional) template functor that processes the element before returning (i.e., with the lock acquired). Something like: class GenericDeviceResourceManagerTy { protected: template <typename Func> ResourceRef getResourceAndProcess(Func processor) { ResourceRef ref = ...; processor(ref); return ref; } public: virtual ResourceRef getResource() { return getResource([](ResourceRef &) { /* do nothing */ }); } } And your `getResource` function would look like something like: ResourceRef getResource() override { return GenericDeviceResourceManagerTy::getResource( [this](ResourceRef &ref) { assignNextQueue(ref); }); } I will implement this change in another patch. For the moment, this look fine to me. Thanks

kevinsala added inline comments.Jul 17 2023, 4:02 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1499–1502	Fix to my previous comment: template <typename Func> ResourceRef getResource(Func processor) { std::lock_guard<std::mutex> Lock(Mutex); ResourceRef ref = ...; processor(ref); return ref; }

Thanks for checking and letting me know!

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1499–1502	Ok, great -- will leave this as it is, for now.

kevinsala mentioned this in D156245: [OpenMP][libomptarget] Process resources when getting/returning from managers.Jul 25 2023, 8:27 AM

Adapted to D156245 @kevinsala
Thanks for the heads-up!

Please take a look at the signature of the two lambdas, I decided to got with
AMDGPUStreamTy *, instead of duplicating the private ResourceHandleTy.

I'll try to do some further testing in the next days.

Harbormaster completed remote builds in B248652: Diff 544877.Jul 27 2023, 11:55 AM

mhalk edited the summary of this revision. (Show Details)Jul 27 2023, 12:00 PM

mhalk added a parent revision: D156245: [OpenMP][libomptarget] Process resources when getting/returning from managers.

mhalk mentioned this in D152035: [OpenMP] Only initialize a single queue/stream/event eagerly.Jul 28 2023, 5:20 AM

kevinsala added inline comments.Jul 31 2023, 3:57 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1490	I would rename the parameter to `Resource`, `Handle`, or similar. The parameter is no longer a `ResourceReference`. The same for the lambda's parmeter and the `returnResource` function.
1492	You can make this function to return the error and propagate it outside. Then, the `assignNextQueue` function can return the error when failing to initialize a queue, instead of handling it.
1500
1508	You can work with a specific parameter `AMDGPUStreamTy *Stream` directly.
1522	if (auto Err = Q->initLazy(Agent, QueueSize)) return Err;

Merged this diff with D152035 as discussed with @jdoerfert.
Updated commit message accordingly.

Implemented feedback from @kevinsala + rebased.
Thank you both!

Harbormaster completed remote builds in B249538: Diff 546128.Aug 1 2023, 10:38 AM

mhalk retitled this revision from [OpenMP][AMDGPU] Tracking of busy HSA queues to [OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking.Aug 1 2023, 10:39 AM

mhalk edited the summary of this revision. (Show Details)

Looks good to me now. Thanks!

Fixed oversight and made AMDGPUQueueTy::init lazy by default.

Harbormaster completed remote builds in B249546: Diff 546138.Aug 1 2023, 10:52 AM

mhalk removed a parent revision: D152035: [OpenMP] Only initialize a single queue/stream/event eagerly.Aug 1 2023, 11:03 AM

kevinsala accepted this revision.Aug 1 2023, 12:45 PM

This revision is now accepted and ready to land.Aug 1 2023, 12:45 PM

jdoerfert added inline comments.Aug 1 2023, 12:56 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1466	Style: I would suggest HSA rather than Hsa.

Sure!
Changed Hsa to HSA within names.

Thanks for all the feedback and patience :)

Harbormaster completed remote builds in B249592: Diff 546211.Aug 1 2023, 1:53 PM

Closed by commit rG5b19f42b631d: [OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking (authored by mhalk). · Explain WhyAug 2 2023, 5:24 AM

This revision was automatically updated to reflect the committed changes.

mhalk added a commit: rG5b19f42b631d: [OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking.

mhalk mentioned this in D156996: [OpenMP][AMDGPU] Add Envar for controlling HSA busy queue tracking.Aug 3 2023, 6:01 AM

Revision Contents

Path

Size

openmp/

docs/

design/

Runtimes.rst

5 lines

libomptarget/

include/

Utilities.h

6 lines

plugins-nextgen/

amdgpu/

src/

rtl.cpp

171 lines

common/

PluginInterface/

PluginInterface.h

4 lines

PluginInterface.cpp

6 lines

Diff 546422

openmp/docs/design/Runtimes.rst

	Show First 20 Lines • Show All 1,187 Lines • ▼ Show 20 Lines
	""""""""""""""""""""""""""""""""			""""""""""""""""""""""""""""""""

	This environment variable sets the number of pre-created streams in the plugin			This environment variable sets the number of pre-created streams in the plugin
	(if supported) at initialization. More streams will be created dynamically			(if supported) at initialization. More streams will be created dynamically
	throughout the execution if needed. A stream is a queue of asynchronous			throughout the execution if needed. A stream is a queue of asynchronous
	operations (e.g., kernel launches and memory copies) that are executed			operations (e.g., kernel launches and memory copies) that are executed
	sequentially. Parallelism is achieved by featuring multiple streams. The			sequentially. Parallelism is achieved by featuring multiple streams. The
	``libomptarget`` leverages streams to exploit parallelism between plugin			``libomptarget`` leverages streams to exploit parallelism between plugin
	operations. The default value is ``32``.			operations. The default value is ``1``, more streams are created as needed.

	LIBOMPTARGET_NUM_INITIAL_EVENTS			LIBOMPTARGET_NUM_INITIAL_EVENTS
	"""""""""""""""""""""""""""""""			"""""""""""""""""""""""""""""""

	This environment variable sets the number of pre-created events in the			This environment variable sets the number of pre-created events in the
	plugin (if supported) at initialization. More events will be created			plugin (if supported) at initialization. More events will be created
	dynamically throughout the execution if needed. An event is used to synchronize			dynamically throughout the execution if needed. An event is used to synchronize
	a stream with another efficiently. The default value is ``32``.			a stream with another efficiently. The default value is ``1``, more events are
				created as needed.

	LIBOMPTARGET_LOCK_MAPPED_HOST_BUFFERS			LIBOMPTARGET_LOCK_MAPPED_HOST_BUFFERS
	"""""""""""""""""""""""""""""""""""""			"""""""""""""""""""""""""""""""""""""

	This environment variable indicates whether the host buffers mapped by the user			This environment variable indicates whether the host buffers mapped by the user
	should be automatically locked/pinned by the plugin. Pinned host buffers allow			should be automatically locked/pinned by the plugin. Pinned host buffers allow
	true asynchronous copies between the host and devices. Enabling this feature can			true asynchronous copies between the host and devices. Enabling this feature can
	increase the performance of applications that are intensive in host-device			increase the performance of applications that are intensive in host-device
	▲ Show 20 Lines • Show All 272 Lines • Show Last 20 Lines

openmp/libomptarget/include/Utilities.h

Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	if (const char *EnvStr = getenv(Name.data())) {

if (!IsPresent) {		if (!IsPresent) {
DP("Ignoring invalid value %s for envar %s\n", EnvStr, Name.data());		DP("Ignoring invalid value %s for envar %s\n", EnvStr, Name.data());
Data = Default;		Data = Default;
}		}
}		}
}		}

		Envar<Ty> &operator=(const Ty &V) {
		Data = V;
		Initialized = true;
		return *this;
		}

/// Get the definitive value.		/// Get the definitive value.
const Ty &get() const {		const Ty &get() const {
// Throw a runtime error in case this envar is not initialized.		// Throw a runtime error in case this envar is not initialized.
if (!Initialized)		if (!Initialized)
FATAL_MESSAGE0(1, "Consulting envar before initialization");		FATAL_MESSAGE0(1, "Consulting envar before initialization");

return Data;		return Data;
}		}
▲ Show 20 Lines • Show All 162 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

Show First 20 Lines • Show All 577 Lines • ▼ Show 20 Lines

/// Classes for holding AMDGPU signals and managing signals. /// Classes for holding AMDGPU signals and managing signals.

using AMDGPUSignalRef = AMDGPUResourceRef<AMDGPUSignalTy>; using AMDGPUSignalRef = AMDGPUResourceRef<AMDGPUSignalTy>;

using AMDGPUSignalManagerTy = GenericDeviceResourceManagerTy<AMDGPUSignalRef>; using AMDGPUSignalManagerTy = GenericDeviceResourceManagerTy<AMDGPUSignalRef>;

/// Class holding an HSA queue to submit kernel and barrier packets. /// Class holding an HSA queue to submit kernel and barrier packets.

struct AMDGPUQueueTy { struct AMDGPUQueueTy {

/// Create an empty queue. /// Create an empty queue.

AMDGPUQueueTy() : Queue(nullptr), Mutex() {} AMDGPUQueueTy() : Queue(nullptr), Mutex(), NumUsers(0) {}

/// Initialize a new queue belonging to a specific agent. /// Lazily initialize a new queue belonging to a specific agent.

Error init(hsa_agent_t Agent, int32_t QueueSize) { Error init(hsa_agent_t Agent, int32_t QueueSize) {

if (Queue)

return Plugin::success();

hsa_status_t Status = hsa_status_t Status =

hsa_queue_create(Agent, QueueSize, HSA_QUEUE_TYPE_MULTI, callbackError, hsa_queue_create(Agent, QueueSize, HSA_QUEUE_TYPE_MULTI, callbackError,

nullptr, UINT32_MAX, UINT32_MAX, &Queue); nullptr, UINT32_MAX, UINT32_MAX, &Queue);

return Plugin::check(Status, "Error in hsa_queue_create: %s"); return Plugin::check(Status, "Error in hsa_queue_create: %s");

} }

/// Deinitialize the queue and destroy its resources. /// Deinitialize the queue and destroy its resources.

Error deinit() { Error deinit() {

std::lock_guard<std::mutex> Lock(Mutex);

if (!Queue)

return Plugin::success();

hsa_status_t Status = hsa_queue_destroy(Queue); hsa_status_t Status = hsa_queue_destroy(Queue);

return Plugin::check(Status, "Error in hsa_queue_destroy: %s"); return Plugin::check(Status, "Error in hsa_queue_destroy: %s");

} }

/// Returns if this queue is considered busy

bool isBusy() const { return NumUsers > 0; }

kevinsalaUnsubmitted

Not Done

Do we need these three atomic operations with memory_order_seq_cst? Or a more relaxed memory order could be enough?

kevinsala: Do we need these three atomic operations with `memory_order_seq_cst`? Or a more relaxed memory…

ye-luoUnsubmitted

Not Done

Is this function already under the stream manager mutex when being called? I feel NumUsers doesn't need to be atomic.

ye-luo: Is this function already under the stream manager mutex when being called? I feel NumUsers…

kevinsalaUnsubmitted

Not Done

Yes, as it is now (everything protected by the stream manager mutex), it seems it shouldn't be atomic.

kevinsala: Yes, as it is now (everything protected by the stream manager mutex), it seems it shouldn't be…

/// Decrement user count of the queue object

void removeUser() { --NumUsers; }

/// Increase user count of the queue object

void addUser() { ++NumUsers; }

/// Push a kernel launch to the queue. The kernel launch requires an output /// Push a kernel launch to the queue. The kernel launch requires an output

/// signal and can define an optional input signal (nullptr if none). /// signal and can define an optional input signal (nullptr if none).

Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs, Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,

uint32_t NumThreads, uint64_t NumBlocks, uint32_t NumThreads, uint64_t NumBlocks,

uint32_t GroupSize, AMDGPUSignalTy *OutputSignal, uint32_t GroupSize, AMDGPUSignalTy *OutputSignal,

AMDGPUSignalTy *InputSignal) { AMDGPUSignalTy *InputSignal) {

assert(OutputSignal && "Invalid kernel output signal"); assert(OutputSignal && "Invalid kernel output signal");

// Lock the queue during the packet publishing process. Notice this blocks // Lock the queue during the packet publishing process. Notice this blocks

// the addition of other packets to the queue. The following piece of code // the addition of other packets to the queue. The following piece of code

// should be lightweight; do not block the thread, allocate memory, etc. // should be lightweight; do not block the thread, allocate memory, etc.

std::lock_guard<std::mutex> Lock(Mutex); std::lock_guard<std::mutex> Lock(Mutex);

assert(Queue && "Interacted with a non-initialized queue!");

// Avoid defining the input dependency if already satisfied. // Avoid defining the input dependency if already satisfied.

if (InputSignal && !InputSignal->load()) if (InputSignal && !InputSignal->load())

InputSignal = nullptr; InputSignal = nullptr;

// Add a barrier packet before the kernel packet in case there is a pending // Add a barrier packet before the kernel packet in case there is a pending

// preceding operation. The barrier packet will delay the processing of // preceding operation. The barrier packet will delay the processing of

// subsequent queue's packets until the barrier input signal are satisfied. // subsequent queue's packets until the barrier input signal are satisfied.

Show All 32 Lines struct AMDGPUQueueTy {

/// Push a barrier packet that will wait up to two input signals. All signals /// Push a barrier packet that will wait up to two input signals. All signals

/// are optional (nullptr if none). /// are optional (nullptr if none).

Error pushBarrier(AMDGPUSignalTy *OutputSignal, Error pushBarrier(AMDGPUSignalTy *OutputSignal,

const AMDGPUSignalTy *InputSignal1, const AMDGPUSignalTy *InputSignal1,

const AMDGPUSignalTy *InputSignal2) { const AMDGPUSignalTy *InputSignal2) {

// Lock the queue during the packet publishing process. // Lock the queue during the packet publishing process.

std::lock_guard<std::mutex> Lock(Mutex); std::lock_guard<std::mutex> Lock(Mutex);

assert(Queue && "Interacted with a non-initialized queue!");

// Push the barrier with the lock acquired. // Push the barrier with the lock acquired.

return pushBarrierImpl(OutputSignal, InputSignal1, InputSignal2); return pushBarrierImpl(OutputSignal, InputSignal1, InputSignal2);

} }

private: private:

/// Push a barrier packet that will wait up to two input signals. Assumes the /// Push a barrier packet that will wait up to two input signals. Assumes the

/// the queue lock is acquired. /// the queue lock is acquired.

▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines private:

/// published in a multi-thread scenario. Without a queue lock, a thread T1 /// published in a multi-thread scenario. Without a queue lock, a thread T1

/// could acquire packet P and thread T2 acquire packet P+1. Thread T2 could /// could acquire packet P and thread T2 acquire packet P+1. Thread T2 could

/// publish its packet P+1 (signaling the queue's doorbell) before packet P /// publish its packet P+1 (signaling the queue's doorbell) before packet P

/// from T1 is ready to be processed. That scenario should be invalid. Thus, /// from T1 is ready to be processed. That scenario should be invalid. Thus,

/// we use the following mutex to make packet acquiring and publishing atomic. /// we use the following mutex to make packet acquiring and publishing atomic.

/// TODO: There are other more advanced approaches to avoid this mutex using /// TODO: There are other more advanced approaches to avoid this mutex using

/// atomic operations. We can further investigate it if this is a bottleneck. /// atomic operations. We can further investigate it if this is a bottleneck.

std::mutex Mutex; std::mutex Mutex;

/// Indicates that the queue is busy when > 0

int NumUsers;

jdoerfertUnsubmitted

Not Done

rename Busy into sth more meaningful, e.g., NumUsers.
Spell out the inc/dec, maybe "addUser", "removeUser"

jdoerfert: rename Busy into sth more meaningful, e.g., `NumUsers`. Spell out the inc/dec, maybe "addUser"…

}; };

/// Struct that implements a stream of asynchronous operations for AMDGPU /// Struct that implements a stream of asynchronous operations for AMDGPU

/// devices. This class relies on signals to implement streams and define the /// devices. This class relies on signals to implement streams and define the

/// dependencies between asynchronous operations. /// dependencies between asynchronous operations.

struct AMDGPUStreamTy { struct AMDGPUStreamTy {

private: private:

/// Utility struct holding arguments for async H2H memory copies. /// Utility struct holding arguments for async H2H memory copies.

▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines Error performAction() {

return Plugin::success(); return Plugin::success();

} }

}; };

/// The device agent where the stream was created. /// The device agent where the stream was created.

hsa_agent_t Agent; hsa_agent_t Agent;

/// The queue that the stream uses to launch kernels. /// The queue that the stream uses to launch kernels.

AMDGPUQueueTy &Queue; AMDGPUQueueTy *Queue;

/// The manager of signals to reuse signals. /// The manager of signals to reuse signals.

AMDGPUSignalManagerTy &SignalManager; AMDGPUSignalManagerTy &SignalManager;

/// A reference to the associated device. /// A reference to the associated device.

GenericDeviceTy &Device; GenericDeviceTy &Device;

/// Array of stream slots. Use std::deque because it can dynamically grow /// Array of stream slots. Use std::deque because it can dynamically grow

▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines Error complete() {

return Plugin::success(); return Plugin::success();

} }

/// Make the current stream wait on a specific operation of another stream. /// Make the current stream wait on a specific operation of another stream.

/// The idea is to make the current stream waiting on two signals: 1) the last /// The idea is to make the current stream waiting on two signals: 1) the last

/// signal of the current stream, and 2) the last signal of the other stream. /// signal of the current stream, and 2) the last signal of the other stream.

/// Use a barrier packet with two input signals. /// Use a barrier packet with two input signals.

Error waitOnStreamOperation(AMDGPUStreamTy &OtherStream, uint32_t Slot) { Error waitOnStreamOperation(AMDGPUStreamTy &OtherStream, uint32_t Slot) {

if (Queue == nullptr)

return Plugin::error("Target queue was nullptr");

/// The signal that we must wait from the other stream. /// The signal that we must wait from the other stream.

AMDGPUSignalTy *OtherSignal = OtherStream.Slots[Slot].Signal; AMDGPUSignalTy *OtherSignal = OtherStream.Slots[Slot].Signal;

// Prevent the release of the other stream's signal. // Prevent the release of the other stream's signal.

OtherSignal->increaseUseCount(); OtherSignal->increaseUseCount();

// Retrieve an available signal for the operation's output. // Retrieve an available signal for the operation's output.

AMDGPUSignalTy *OutputSignal = nullptr; AMDGPUSignalTy *OutputSignal = nullptr;

if (auto Err = SignalManager.getResource(OutputSignal)) if (auto Err = SignalManager.getResource(OutputSignal))

return Err; return Err;

OutputSignal->reset(); OutputSignal->reset();

OutputSignal->increaseUseCount(); OutputSignal->increaseUseCount();

// Consume stream slot and compute dependencies. // Consume stream slot and compute dependencies.

auto [Curr, InputSignal] = consume(OutputSignal); auto [Curr, InputSignal] = consume(OutputSignal);

// Setup the post action to release the signal. // Setup the post action to release the signal.

if (auto Err = Slots[Curr].schedReleaseSignal(OtherSignal, &SignalManager)) if (auto Err = Slots[Curr].schedReleaseSignal(OtherSignal, &SignalManager))

return Err; return Err;

// Push a barrier into the queue with both input signals. // Push a barrier into the queue with both input signals.

return Queue.pushBarrier(OutputSignal, InputSignal, OtherSignal); return Queue->pushBarrier(OutputSignal, InputSignal, OtherSignal);

} }

/// Callback for running a specific asynchronous operation. This callback is /// Callback for running a specific asynchronous operation. This callback is

/// used for hsa_amd_signal_async_handler. The argument is the operation that /// used for hsa_amd_signal_async_handler. The argument is the operation that

/// should be executed. Notice we use the post action mechanism to codify the /// should be executed. Notice we use the post action mechanism to codify the

/// asynchronous operation. /// asynchronous operation.

static bool asyncActionCallback(hsa_signal_value_t Value, void *Args) { static bool asyncActionCallback(hsa_signal_value_t Value, void *Args) {

StreamSlotTy *Slot = reinterpret_cast<StreamSlotTy *>(Args); StreamSlotTy *Slot = reinterpret_cast<StreamSlotTy *>(Args);

assert(Slot && "Invalid slot"); assert(Slot && "Invalid slot");

mhalkAuthorUnsubmitted

Done

Oversight: Will be removed.

mhalk: Oversight: Will be removed.

assert(Slot->Signal && "Invalid signal"); assert(Slot->Signal && "Invalid signal");

// This thread is outside the stream mutex. Make sure the thread sees the // This thread is outside the stream mutex. Make sure the thread sees the

// changes on the slot. // changes on the slot.

std::atomic_thread_fence(std::memory_order_acquire); std::atomic_thread_fence(std::memory_order_acquire);

// Peform the operation. // Peform the operation.

if (auto Err = Slot->performAction()) if (auto Err = Slot->performAction())

▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines public:

/// Push a asynchronous kernel to the stream. The kernel arguments must be /// Push a asynchronous kernel to the stream. The kernel arguments must be

/// placed in a special allocation for kernel args and must keep alive until /// placed in a special allocation for kernel args and must keep alive until

/// the kernel finalizes. Once the kernel is finished, the stream will release /// the kernel finalizes. Once the kernel is finished, the stream will release

/// the kernel args buffer to the specified memory manager. /// the kernel args buffer to the specified memory manager.

Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs, Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,

uint32_t NumThreads, uint64_t NumBlocks, uint32_t NumThreads, uint64_t NumBlocks,

uint32_t GroupSize, uint32_t GroupSize,

AMDGPUMemoryManagerTy &MemoryManager) { AMDGPUMemoryManagerTy &MemoryManager) {

if (Queue == nullptr)

return Plugin::error("Target queue was nullptr");

// Retrieve an available signal for the operation's output. // Retrieve an available signal for the operation's output.

AMDGPUSignalTy *OutputSignal = nullptr; AMDGPUSignalTy *OutputSignal = nullptr;

if (auto Err = SignalManager.getResource(OutputSignal)) if (auto Err = SignalManager.getResource(OutputSignal))

return Err; return Err;

OutputSignal->reset(); OutputSignal->reset();

OutputSignal->increaseUseCount(); OutputSignal->increaseUseCount();

std::lock_guard<std::mutex> StreamLock(Mutex); std::lock_guard<std::mutex> StreamLock(Mutex);

// Consume stream slot and compute dependencies. // Consume stream slot and compute dependencies.

auto [Curr, InputSignal] = consume(OutputSignal); auto [Curr, InputSignal] = consume(OutputSignal);

// Setup the post action to release the kernel args buffer. // Setup the post action to release the kernel args buffer.

if (auto Err = Slots[Curr].schedReleaseBuffer(KernelArgs, MemoryManager)) if (auto Err = Slots[Curr].schedReleaseBuffer(KernelArgs, MemoryManager))

return Err; return Err;

// Push the kernel with the output signal and an input signal (optional) // Push the kernel with the output signal and an input signal (optional)

return Queue.pushKernelLaunch(Kernel, KernelArgs, NumThreads, NumBlocks, return Queue->pushKernelLaunch(Kernel, KernelArgs, NumThreads, NumBlocks,

GroupSize, OutputSignal, InputSignal); GroupSize, OutputSignal, InputSignal);

} }

/// Push an asynchronous memory copy between pinned memory buffers. /// Push an asynchronous memory copy between pinned memory buffers.

Error pushPinnedMemoryCopyAsync(void *Dst, const void *Src, Error pushPinnedMemoryCopyAsync(void *Dst, const void *Src,

uint64_t CopySize) { uint64_t CopySize) {

// Retrieve an available signal for the operation's output. // Retrieve an available signal for the operation's output.

AMDGPUSignalTy *OutputSignal = nullptr; AMDGPUSignalTy *OutputSignal = nullptr;

if (auto Err = SignalManager.getResource(OutputSignal)) if (auto Err = SignalManager.getResource(OutputSignal))

▲ Show 20 Lines • Show All 211 Lines • ▼ Show 20 Lines Expected<bool> query() {

return true; return true;

} }

/// Record the state of the stream on an event. /// Record the state of the stream on an event.

Error recordEvent(AMDGPUEventTy &Event) const; Error recordEvent(AMDGPUEventTy &Event) const;

/// Make the stream wait on an event. /// Make the stream wait on an event.

Error waitEvent(const AMDGPUEventTy &Event); Error waitEvent(const AMDGPUEventTy &Event);

friend struct AMDGPUStreamManagerTy;

}; };

/// Class representing an event on AMDGPU. The event basically stores some /// Class representing an event on AMDGPU. The event basically stores some

/// information regarding the state of the recorded stream. /// information regarding the state of the recorded stream.

struct AMDGPUEventTy { struct AMDGPUEventTy {

/// Create an empty event. /// Create an empty event.

AMDGPUEventTy(AMDGPUDeviceTy &Device) AMDGPUEventTy(AMDGPUDeviceTy &Device)

: RecordedStream(nullptr), RecordedSlot(-1), RecordedSyncCycle(-1) {} : RecordedStream(nullptr), RecordedSlot(-1), RecordedSyncCycle(-1) {}

▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines Error AMDGPUStreamTy::waitEvent(const AMDGPUEventTy &Event) {

// operation's output signal is satisfied. // operation's output signal is satisfied.

if (!RecordedStream.Slots[Event.RecordedSlot].Signal->load()) if (!RecordedStream.Slots[Event.RecordedSlot].Signal->load())

return Plugin::success(); return Plugin::success();

// Otherwise, make the current stream wait on the other stream's operation. // Otherwise, make the current stream wait on the other stream's operation.

return waitOnStreamOperation(RecordedStream, Event.RecordedSlot); return waitOnStreamOperation(RecordedStream, Event.RecordedSlot);

} }

struct AMDGPUStreamManagerTy final

: GenericDeviceResourceManagerTy<AMDGPUResourceRef<AMDGPUStreamTy>> {

jdoerfertUnsubmitted

Not Done

Make the class final.

jdoerfert: Make the class final.

using ResourceRef = AMDGPUResourceRef<AMDGPUStreamTy>;

using ResourcePoolTy = GenericDeviceResourceManagerTy<ResourceRef>;

AMDGPUStreamManagerTy(GenericDeviceTy &Device, hsa_agent_t HSAAgent)

: GenericDeviceResourceManagerTy(Device), NextQueue(0), Agent(HSAAgent) {}

Error init(uint32_t InitialSize, int NumHSAQueues, int HSAQueueSize) {

jdoerfertUnsubmitted

Not Done

Style: I would suggest HSA rather than Hsa.

jdoerfert: Style: I would suggest HSA rather than Hsa.

Queues = std::vector<AMDGPUQueueTy>(NumHSAQueues);

jdoerfertUnsubmitted

Not Done

Just resize.

jdoerfert: Just resize.

mhalkAuthorUnsubmitted

Done

Initially, I wanted to do that, but AMDGPUQueueTy has a member (Mutex) with deleted copy ctor.

mhalk: Initially, I wanted to do that, but `AMDGPUQueueTy` has a member (Mutex) with deleted copy ctor.

QueueSize = HSAQueueSize;

MaxNumQueues = NumHSAQueues;

// Initialize one queue eagerly

jplehrUnsubmitted

Not Done

In case of error, I think we should report and return the error instead of reporting and keep going.

This is the initial queue, so if this fails, there is no queue available.

jplehr: In case of error, I think we should report and return the error instead of reporting and keep…

mhalkAuthorUnsubmitted

Done

Agreed; as discussed, we'll return Err and not report it at this level.

It will be reported by the higher layer, for example like this:
"PluginInterface" error: Failure to initialize device 0: Error in hsa_queue_create: HSA_STATUS_ERROR_INVALID_ARGUMENT: One of the actual arguments does not meet a precondition stated in the documentation of the corresponding formal argument.

mhalk: Agreed; as discussed, we'll return `Err` and not report it at this level. It will be reported…

if (auto Err = Queues.front().init(Agent, QueueSize))

return Err;

jdoerfertUnsubmitted

Not Done

// Initialize one queue eagerly

- if (auto Err = Queues.front().init(Agent, QueueSize)) {

+ if (auto Err = Queues.front().init(Agent, QueueSize))

return Err;

- }

return GenericDeviceResourceManagerTy::init(InitialSize);

jdoerfert:

return GenericDeviceResourceManagerTy::init(InitialSize);

}

/// Deinitialize the resource pool and delete all resources. This function

/// must be called before the destructor.

Error deinit() override {

// De-init all queues

for (AMDGPUQueueTy &Queue : Queues) {

if (auto Err = Queue.deinit())

return Err;

}

return GenericDeviceResourceManagerTy::deinit();

}

/// Get a single stream from the pool or create new resources.

virtual Error getResource(AMDGPUStreamTy *&StreamHandle) override {

kevinsalaUnsubmitted

Not Done

I would rename the parameter to Resource, Handle, or similar. The parameter is no longer a ResourceReference. The same for the lambda's parmeter and the returnResource function.

kevinsala: I would rename the parameter to `Resource`, `Handle`, or similar. The parameter is no longer a…

return getResourcesImpl(1, &StreamHandle, [this](AMDGPUStreamTy *&Handle) {

return assignNextQueue(Handle);

kevinsalaUnsubmitted

Not Done

You can make this function to return the error and propagate it outside. Then, the assignNextQueue function can return the error when failing to initialize a queue, instead of handling it.

kevinsala: You can make this function to return the error and propagate it outside. Then, the…

});

}

/// Return stream to the pool.

virtual Error returnResource(AMDGPUStreamTy *StreamHandle) override {

return returnResourceImpl(StreamHandle, [](AMDGPUStreamTy *Handle) {

kevinsalaUnsubmitted

Not Done

I'm afraid most of the code in getResource and returnResource is duplicated from the generic resource manager. Would it be possible to use the original GenericDeviceResourceManagerTy::getResource() instead? For instance:

ResourceRef getResource() override {
  ResourceRef resource = GenericDeviceResourceManagerTy::getResource();
  // Perform any change on resource
  ...

  return resource;
}

(Note that with this modification, the changes on the resource are no longer protected by the stream manager mutex)

kevinsala: I'm afraid most of the code in `getResource` and `returnResource` is duplicated from the…

Handle->Queue->removeUser();

return Plugin::success();

kevinsalaUnsubmitted

Not Done

return returnResourceImpl(Reference, [](AMDGPUStreamTy *Ref) {

- (*Ref).Queue->removeUser();

+ Ref->Queue->removeUser();

return Plugin::success();

kevinsala:

});

}

mhalkAuthorUnsubmitted

Done

Rather unhappy with this particular piece of code.
Now there's a chance for interleaving.
I don't know if that's good, "ok" or bad, but if I had to guess: the latter.

But without unlocking, this will cause a deadlock.
(Could also revert to a lock_guard and call it's dtor, if preferred.)

Thoughts and comments on this would be highly appreciated.

mhalk: Rather unhappy with this particular piece of code. Now there's a chance for interleaving. I…

kevinsalaUnsubmitted

Not Done

Yes, this seems correct, but I agree that it may produce unnecessary overhead.

To cover this use case, we could extend the generic getResource and returnResource to accept an (optional) template functor that processes the element before returning (i.e., with the lock acquired). Something like:

class GenericDeviceResourceManagerTy {
protected:
    template <typename Func>
    ResourceRef getResourceAndProcess(Func processor)
    {
        ResourceRef ref = ...;
        processor(ref);
        return ref;
    }

public:
    virtual ResourceRef getResource()
    {
        return getResource([](ResourceRef &) { /* do nothing */ });
    }
}

And your getResource function would look like something like:

ResourceRef getResource() override {
    return GenericDeviceResourceManagerTy::getResource(
        [this](ResourceRef &ref) {
            assignNextQueue(ref);
        });
}

I will implement this change in another patch. For the moment, this look fine to me. Thanks

kevinsala: Yes, this seems correct, but I agree that it may produce unnecessary overhead. To cover this…

kevinsalaUnsubmitted

Not Done

Fix to my previous comment:

template <typename Func>
ResourceRef getResource(Func processor)
{
    std::lock_guard<std::mutex> Lock(Mutex);
    ResourceRef ref = ...;
    processor(ref);
    return ref;
}

kevinsala: Fix to my previous comment: ``` template <typename Func> ResourceRef getResource(Func…

mhalkAuthorUnsubmitted

Done

Ok, great -- will leave this as it is, for now.

mhalk: Ok, great -- will leave this as it is, for now.

jdoerfertUnsubmitted

Not Done

This ignores the queue size, doesn't it? We should not grow larger than that value.

jdoerfert: This ignores the queue size, doesn't it? We should not grow larger than that value.

mhalkAuthorUnsubmitted

Done

To the best of my knowledge we should not exceed the set (max) number of queues.
This will only resize the streams, as the size of Queues is only set once, during init(...).

mhalk: To the best of my knowledge we should not exceed the set (max) number of queues. This will only…

private:

/// Search for and assign an prefereably idle queue to the given Stream. If

jdoerfertUnsubmitted

Not Done

As @kevinsala said, the above, and the auto &Resource = part below, can be replaced with ResourceRef Resource = GenericDeviceResourceManagerTy::getResource();

jdoerfert: As @kevinsala said, the above, and the `auto &Resource =` part below, can be replaced with…

/// there is no queue without current users, resort to round robin selection.

inline Error assignNextQueue(AMDGPUStreamTy *Stream) {

jdoerfertUnsubmitted

Not Done

Start with Current = NextQueue.fetch_add(1, std::memory_order_relaxed) % Size when you iterate, that way you won't check the first queue all the time.

jdoerfert: Start with `Current = NextQueue.fetch_add(1, std::memory_order_relaxed) % Size` when you…

mhalkAuthorUnsubmitted

Done

Agreed; as discussed, we're going with two loops to search for an idle queue.

mhalk: Agreed; as discussed, we're going with two loops to search for an idle queue.

uint32_t StartIndex = NextQueue % MaxNumQueues;

kevinsalaUnsubmitted

Not Done

You can work with a specific parameter AMDGPUStreamTy *Stream directly.

kevinsala: You can work with a specific parameter `AMDGPUStreamTy *Stream` directly.

AMDGPUQueueTy *Q = nullptr;

for (int i = 0; i < MaxNumQueues; ++i) {

Q = &Queues[StartIndex++];

if (StartIndex == MaxNumQueues)

StartIndex = 0;

if (Q->isBusy())

continue;

else {

if (auto Err = Q->init(Agent, QueueSize))

return Err;

Q->addUser();

kevinsalaUnsubmitted

Not Done

if (auto Err = Q->initLazy(Agent, QueueSize))
  return Err;

kevinsala: ``` if (auto Err = Q->initLazy(Agent, QueueSize)) return Err; ```

Stream->Queue = Q;

jdoerfertUnsubmitted

Not Done

The above, except removeUser, should just be GenericDeviceResourceManagerTy::returnResource(Resource);, no?

jdoerfert: The above, except removeUser, should just be ` GenericDeviceResourceManagerTy::returnResource…

return Plugin::success();

}

// All queues busy: Round robin (StartIndex has the initial value again)

Queues[StartIndex].addUser();

jdoerfertUnsubmitted

Not Done

continue;

- else {

if (auto Err = Q->initLazy(Agent, QueueSize)) {

REPORT("Failure during queue init: %s\n",

toString(std::move(Err)).data());

Q = &Queues[0];

}

Q->addUser();

(*Resource).Queue = Q;

return;

- }

}

// All queues busy: Round robin (StartIndex has the original value again)

jdoerfert:

Stream->Queue = &Queues[StartIndex];

++NextQueue;

kevinsalaUnsubmitted

Not Done

I don't think NextQueue needs atomic operations if this function is called while holding the stream manager mutex.

kevinsala: I don't think `NextQueue` needs atomic operations if this function is called while holding the…

return Plugin::success();

}

/// The next queue index to use for round robin selection.

kevinsalaUnsubmitted

Not Done

Also, NextQueue shouldn't be incremented while we advance in the search of non busy queues?

kevinsala: Also, `NextQueue` shouldn't be incremented while we advance in the search of non busy queues?

uint32_t NextQueue;

jdoerfertUnsubmitted

Not Done

early exit if (busy) continue

jdoerfert: early exit if (busy) continue

/// The queues which are assigned to requested streams.

std::vector<AMDGPUQueueTy> Queues;

/// The corresponding device as HSA agent.

hsa_agent_t Agent;

/// The maximum number of queues.

int MaxNumQueues;

/// The size of created queues.

int QueueSize;

kevinsalaUnsubmitted

Not Done

I believe we can simplify and merge these two loops into a single one

kevinsala: I believe we can simplify and merge these two loops into a single one

jdoerfertUnsubmitted

Not Done

Yeah, my bad, I suggested this, we can do sth like

for (I = 0; < MaxNumQueues; ++I) {
  Idx = StartIndex++;
  if (StartIndex == MaxNumQueues) StartIndex = 0;
  // use Idx not I
...
}

jdoerfert: Yeah, my bad, I suggested this, we can do sth like ``` for (I = 0; < MaxNumQueues; ++I) { Idx…

};

/// Abstract class that holds the common members of the actual kernel devices /// Abstract class that holds the common members of the actual kernel devices

/// and the host device. Both types should inherit from this class. /// and the host device. Both types should inherit from this class.

struct AMDGenericDeviceTy { struct AMDGenericDeviceTy {

AMDGenericDeviceTy() {} AMDGenericDeviceTy() {}

virtual ~AMDGenericDeviceTy() {} virtual ~AMDGenericDeviceTy() {}

/// Create all memory pools which the device has access to and classify them. /// Create all memory pools which the device has access to and classify them.

Error initMemoryPools() { Error initMemoryPools() {

// Retrieve all memory pools from the device agent(s). // Retrieve all memory pools from the device agent(s).

Error Err = retrieveAllMemoryPools(); Error Err = retrieveAllMemoryPools();

if (Err) if (Err)

return Err; return Err;

for (AMDGPUMemoryPoolTy *MemoryPool : AllMemoryPools) { for (AMDGPUMemoryPoolTy *MemoryPool : AllMemoryPools) {

// Initialize the memory pool and retrieve some basic info. // Initialize the memory pool and retrieve some basic info.

Error Err = MemoryPool->init(); Error Err = MemoryPool->init();

if (Err) if (Err)

kevinsalaUnsubmitted

Not Done

I feel this patch implements a Queue manager inside a Stream manager. Wouldn't it be better to define this logic inside a new AMDGPUQueueManagerTy and just have a reference of it in the AMDGPUStreamManagerTy?

kevinsala: I feel this patch implements a Queue manager inside a Stream manager. Wouldn't it be better to…

jdoerfertUnsubmitted

Not Done

We do not really manage the queues the same way. we can do more reuse, see above.

jdoerfert: We do not really manage the queues the same way. we can do more reuse, see above.

return Err; return Err;

if (!MemoryPool->isGlobal()) if (!MemoryPool->isGlobal())

continue; continue;

// Classify the memory pools depending on their properties. // Classify the memory pools depending on their properties.

if (MemoryPool->isFineGrained()) { if (MemoryPool->isFineGrained()) {

FineGrainedMemoryPools.push_back(MemoryPool); FineGrainedMemoryPools.push_back(MemoryPool);

▲ Show 20 Lines • Show All 145 Lines • ▼ Show 20 Lines AMDGPUDeviceTy(int32_t DeviceId, int32_t NumDevices,

OMPX_NumQueues("LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES", 4), OMPX_NumQueues("LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES", 4),

OMPX_QueueSize("LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE", 512), OMPX_QueueSize("LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE", 512),

OMPX_DefaultTeamsPerCU("LIBOMPTARGET_AMDGPU_TEAMS_PER_CU", 4), OMPX_DefaultTeamsPerCU("LIBOMPTARGET_AMDGPU_TEAMS_PER_CU", 4),

OMPX_MaxAsyncCopyBytes("LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES", OMPX_MaxAsyncCopyBytes("LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES",

1 * 1024 * 1024), // 1MB 1 * 1024 * 1024), // 1MB

OMPX_InitialNumSignals("LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS", OMPX_InitialNumSignals("LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS",

64), 64),

OMPX_StreamBusyWait("LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT", 2000000), OMPX_StreamBusyWait("LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT", 2000000),

AMDGPUStreamManager(*this), AMDGPUEventManager(*this), AMDGPUStreamManager(*this, Agent), AMDGPUEventManager(*this),

AMDGPUSignalManager(*this), Agent(Agent), HostDevice(HostDevice), AMDGPUSignalManager(*this), Agent(Agent), HostDevice(HostDevice) {}

Queues() {}

~AMDGPUDeviceTy() {} ~AMDGPUDeviceTy() {}

/// Initialize the device, its resources and get its properties. /// Initialize the device, its resources and get its properties.

Error initImpl(GenericPluginTy &Plugin) override { Error initImpl(GenericPluginTy &Plugin) override {

// First setup all the memory pools. // First setup all the memory pools.

if (auto Err = initMemoryPools()) if (auto Err = initMemoryPools())

return Err; return Err;

▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines Error initImpl(GenericPluginTy &Plugin) override {

if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUE_MAX_SIZE, MaxQueueSize)) if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUE_MAX_SIZE, MaxQueueSize))

return Err; return Err;

uint32_t MaxQueues; uint32_t MaxQueues;

if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUES_MAX, MaxQueues)) if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUES_MAX, MaxQueues))

return Err; return Err;

// Compute the number of queues and their size. // Compute the number of queues and their size.

const uint32_t NumQueues = std::min(OMPX_NumQueues.get(), MaxQueues); OMPX_NumQueues = std::max(1U, std::min(OMPX_NumQueues.get(), MaxQueues));

const uint32_t QueueSize = std::min(OMPX_QueueSize.get(), MaxQueueSize); OMPX_QueueSize = std::min(OMPX_QueueSize.get(), MaxQueueSize);

// Construct and initialize each device queue.

Queues = std::vector<AMDGPUQueueTy>(NumQueues);

for (AMDGPUQueueTy &Queue : Queues)

if (auto Err = Queue.init(Agent, QueueSize))

return Err;

// Initialize stream pool. // Initialize stream pool.

if (auto Err = AMDGPUStreamManager.init(OMPX_InitialNumStreams)) if (auto Err = AMDGPUStreamManager.init(OMPX_InitialNumStreams,

OMPX_NumQueues, OMPX_QueueSize))

return Err; return Err;

// Initialize event pool. // Initialize event pool.

if (auto Err = AMDGPUEventManager.init(OMPX_InitialNumEvents)) if (auto Err = AMDGPUEventManager.init(OMPX_InitialNumEvents))

return Err; return Err;

// Initialize signal pool. // Initialize signal pool.

if (auto Err = AMDGPUSignalManager.init(OMPX_InitialNumSignals)) if (auto Err = AMDGPUSignalManager.init(OMPX_InitialNumSignals))

Show All 22 Lines if (!LoadedImages.empty()) {

static_cast<AMDGPUDeviceImageTy &>(*Image); static_cast<AMDGPUDeviceImageTy &>(*Image);

// Unload the executable of the image. // Unload the executable of the image.

if (auto Err = AMDImage.unloadExecutable()) if (auto Err = AMDImage.unloadExecutable())

return Err; return Err;

} }

for (AMDGPUQueueTy &Queue : Queues) {

if (auto Err = Queue.deinit())

return Err;

}

// Invalidate agent reference. // Invalidate agent reference.

Agent = {0}; Agent = {0};

return Plugin::success(); return Plugin::success();

} }

const uint64_t getStreamBusyWaitMicroseconds() const { const uint64_t getStreamBusyWaitMicroseconds() const {

return OMPX_StreamBusyWait; return OMPX_StreamBusyWait;

▲ Show 20 Lines • Show All 670 Lines • ▼ Show 20 Lines return utils::iterateAgentMemoryPools(

AMDGPUMemoryPoolTy *MemoryPool = AMDGPUMemoryPoolTy *MemoryPool =

Plugin::get().allocate<AMDGPUMemoryPoolTy>(); Plugin::get().allocate<AMDGPUMemoryPoolTy>();

new (MemoryPool) AMDGPUMemoryPoolTy(HSAMemoryPool); new (MemoryPool) AMDGPUMemoryPoolTy(HSAMemoryPool);

AllMemoryPools.push_back(MemoryPool); AllMemoryPools.push_back(MemoryPool);

return HSA_STATUS_SUCCESS; return HSA_STATUS_SUCCESS;

}); });

} }

/// Get the next queue in a round-robin fashion.

AMDGPUQueueTy &getNextQueue() {

static std::atomic<uint32_t> NextQueue(0);

uint32_t Current = NextQueue.fetch_add(1, std::memory_order_relaxed);

return Queues[Current % Queues.size()];

}

private: private:

using AMDGPUStreamRef = AMDGPUResourceRef<AMDGPUStreamTy>;

using AMDGPUEventRef = AMDGPUResourceRef<AMDGPUEventTy>; using AMDGPUEventRef = AMDGPUResourceRef<AMDGPUEventTy>;

using AMDGPUStreamManagerTy = GenericDeviceResourceManagerTy<AMDGPUStreamRef>;

using AMDGPUEventManagerTy = GenericDeviceResourceManagerTy<AMDGPUEventRef>; using AMDGPUEventManagerTy = GenericDeviceResourceManagerTy<AMDGPUEventRef>;

/// Envar for controlling the number of HSA queues per device. High number of /// Envar for controlling the number of HSA queues per device. High number of

/// queues may degrade performance. /// queues may degrade performance.

UInt32Envar OMPX_NumQueues; UInt32Envar OMPX_NumQueues;

/// Envar for controlling the size of each HSA queue. The size is the number /// Envar for controlling the size of each HSA queue. The size is the number

/// of HSA packets a queue is expected to hold. It is also the number of HSA /// of HSA packets a queue is expected to hold. It is also the number of HSA

Show All 39 Lines private:

/// The GPU architecture. /// The GPU architecture.

std::string ComputeUnitKind; std::string ComputeUnitKind;

/// The frequency of the steady clock inside the device. /// The frequency of the steady clock inside the device.

uint64_t ClockFrequency; uint64_t ClockFrequency;

/// Reference to the host device. /// Reference to the host device.

AMDHostDeviceTy &HostDevice; AMDHostDeviceTy &HostDevice;

/// List of device packet queues.

std::vector<AMDGPUQueueTy> Queues;

}; };

Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) { Error AMDGPUDeviceImageTy::loadExecutable(const AMDGPUDeviceTy &Device) {

hsa_status_t Status; hsa_status_t Status;

Status = hsa_code_object_deserialize(getStart(), getSize(), "", &CodeObject); Status = hsa_code_object_deserialize(getStart(), getSize(), "", &CodeObject);

if (auto Err = if (auto Err =

Plugin::check(Status, "Error in hsa_code_object_deserialize: %s")) Plugin::check(Status, "Error in hsa_code_object_deserialize: %s"))

return Err; return Err;

▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines Error AMDGPUResourceRef<ResourceTy>::create(GenericDeviceTy &Device) {

AMDGPUDeviceTy &AMDGPUDevice = static_cast<AMDGPUDeviceTy &>(Device); AMDGPUDeviceTy &AMDGPUDevice = static_cast<AMDGPUDeviceTy &>(Device);

Resource = new ResourceTy(AMDGPUDevice); Resource = new ResourceTy(AMDGPUDevice);

return Resource->init(); return Resource->init();

} }

AMDGPUStreamTy::AMDGPUStreamTy(AMDGPUDeviceTy &Device) AMDGPUStreamTy::AMDGPUStreamTy(AMDGPUDeviceTy &Device)

: Agent(Device.getAgent()), Queue(Device.getNextQueue()), : Agent(Device.getAgent()), Queue(nullptr),

SignalManager(Device.getSignalManager()), Device(Device), SignalManager(Device.getSignalManager()), Device(Device),

// Initialize the std::deque with some empty positions. // Initialize the std::deque with some empty positions.

Slots(32), NextSlot(0), SyncCycle(0), RPCServer(nullptr), Slots(32), NextSlot(0), SyncCycle(0), RPCServer(nullptr),

StreamBusyWaitMicroseconds(Device.getStreamBusyWaitMicroseconds()) {} StreamBusyWaitMicroseconds(Device.getStreamBusyWaitMicroseconds()) {}

/// Class implementing the AMDGPU-specific functionalities of the global /// Class implementing the AMDGPU-specific functionalities of the global

/// handler. /// handler.

struct AMDGPUGlobalHandlerTy final : public GenericGlobalHandlerTy { struct AMDGPUGlobalHandlerTy final : public GenericGlobalHandlerTy {

▲ Show 20 Lines • Show All 469 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h

Show First 20 Lines • Show All 1,162 Lines • ▼ Show 20 Lines	public:
/// Initialize the resource pool.		/// Initialize the resource pool.
Error init(uint32_t InitialSize) {		Error init(uint32_t InitialSize) {
assert(ResourcePool.empty() && "Resource pool already initialized");		assert(ResourcePool.empty() && "Resource pool already initialized");
return ResourcePoolTy::resizeResourcePool(InitialSize);		return ResourcePoolTy::resizeResourcePool(InitialSize);
}		}

/// Deinitialize the resource pool and delete all resources. This function		/// Deinitialize the resource pool and delete all resources. This function
/// must be called before the destructor.		/// must be called before the destructor.
Error deinit() {		virtual Error deinit() {
if (NextAvailable)		if (NextAvailable)
DP("Missing %d resources to be returned\n", NextAvailable);		DP("Missing %d resources to be returned\n", NextAvailable);

// TODO: This prevents a bug on libomptarget to make the plugins fail. There		// TODO: This prevents a bug on libomptarget to make the plugins fail. There
// may be some resources not returned. Do not destroy these ones.		// may be some resources not returned. Do not destroy these ones.
if (auto Err = ResourcePoolTy::resizeResourcePool(NextAvailable))		if (auto Err = ResourcePoolTy::resizeResourcePool(NextAvailable))
return Err;		return Err;

▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	if (auto Err = Processor(Handle))
return Err;		return Err;

assert(NextAvailable > 0 && "Resource pool is corrupted");		assert(NextAvailable > 0 && "Resource pool is corrupted");
ResourcePool[--NextAvailable] = Handle;		ResourcePool[--NextAvailable] = Handle;

return Plugin::success();		return Plugin::success();
}		}

private:		protected:
/// The resources between \p OldSize and \p NewSize need to be created or		/// The resources between \p OldSize and \p NewSize need to be created or
/// destroyed. The mutex is locked when this function is called.		/// destroyed. The mutex is locked when this function is called.
Error resizeResourcePoolImpl(uint32_t OldSize, uint32_t NewSize) {		Error resizeResourcePoolImpl(uint32_t OldSize, uint32_t NewSize) {
assert(OldSize != NewSize && "Resizing to the same size");		assert(OldSize != NewSize && "Resizing to the same size");

if (auto Err = Device.setContext())		if (auto Err = Device.setContext())
return Err;		return Err;

▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp

Show First 20 Lines • Show All 390 Lines • ▼ Show 20 Lines	: MemoryManager(nullptr), OMP_TeamLimit("OMP_TEAM_LIMIT"),
OMP_NumTeams("OMP_NUM_TEAMS"),		OMP_NumTeams("OMP_NUM_TEAMS"),
OMP_TeamsThreadLimit("OMP_TEAMS_THREAD_LIMIT"),		OMP_TeamsThreadLimit("OMP_TEAMS_THREAD_LIMIT"),
OMPX_DebugKind("LIBOMPTARGET_DEVICE_RTL_DEBUG"),		OMPX_DebugKind("LIBOMPTARGET_DEVICE_RTL_DEBUG"),
OMPX_SharedMemorySize("LIBOMPTARGET_SHARED_MEMORY_SIZE"),		OMPX_SharedMemorySize("LIBOMPTARGET_SHARED_MEMORY_SIZE"),
// Do not initialize the following two envars since they depend on the		// Do not initialize the following two envars since they depend on the
// device initialization. These cannot be consulted until the device is		// device initialization. These cannot be consulted until the device is
// initialized correctly. We intialize them in GenericDeviceTy::init().		// initialized correctly. We intialize them in GenericDeviceTy::init().
OMPX_TargetStackSize(), OMPX_TargetHeapSize(),		OMPX_TargetStackSize(), OMPX_TargetHeapSize(),
// By default, the initial number of streams and events are 32.		// By default, the initial number of streams and events is 1.
OMPX_InitialNumStreams("LIBOMPTARGET_NUM_INITIAL_STREAMS", 32),		OMPX_InitialNumStreams("LIBOMPTARGET_NUM_INITIAL_STREAMS", 1),
OMPX_InitialNumEvents("LIBOMPTARGET_NUM_INITIAL_EVENTS", 32),		OMPX_InitialNumEvents("LIBOMPTARGET_NUM_INITIAL_EVENTS", 1),
DeviceId(DeviceId), GridValues(OMPGridValues),		DeviceId(DeviceId), GridValues(OMPGridValues),
PeerAccesses(NumDevices, PeerAccessState::PENDING), PeerAccessesLock(),		PeerAccesses(NumDevices, PeerAccessState::PENDING), PeerAccessesLock(),
PinnedAllocs(*this), RPCServer(nullptr) {		PinnedAllocs(*this), RPCServer(nullptr) {
#ifdef OMPT_SUPPORT		#ifdef OMPT_SUPPORT
OmptInitialized.store(false);		OmptInitialized.store(false);
// Bind the callbacks to this device's member functions		// Bind the callbacks to this device's member functions
#define bindOmptCallback(Name, Type, Code) \		#define bindOmptCallback(Name, Type, Code) \
if (ompt::Initialized && ompt::lookupCallbackByCode) { \		if (ompt::Initialized && ompt::lookupCallbackByCode) { \
▲ Show 20 Lines • Show All 1,260 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][AMDGPU] Single eager resource init + HSA queue utilization trackingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 546422

openmp/docs/design/Runtimes.rst

openmp/libomptarget/include/Utilities.h

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp

[OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking
ClosedPublic