Download Raw Diff

Details

Reviewers

jdoerfert
jhuber6
JonChesterfield
tianshilei1992
ye-luo

Commits

rGf238a98e8447: [OpenMP][libomptarget][AMDGPU] Enable active HSA wait state

Summary

[OpenMP][libomptarget][AMDGPU] Enable active HSA wait state

Adds HSA timeout hint of 2 seconds to the AMDGPU nextgen-plugin to improve
performance of small kernels.
The HSA runtime may stay in HSA_WAIT_STATE_ACTIVE for up to the timeout
value before switching to HSA_WAIT_STATE_BLOCKED. This can improve
latency from which small kernels can benefit.
The value was determined via experimentation w/ different benchmarks.

The timeout value can be overriden using the environment variable
LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT with a value in microseconds.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jplehr created this revision.Apr 20 2023, 8:17 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2023, 8:17 AM

Herald added subscribers: sunshaoce, kosarev, kerbowa and 6 others. · View Herald Transcript

jplehr requested review of this revision.Apr 20 2023, 8:17 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2023, 8:17 AM

Herald added subscribers: openmp-commits, sstefan1, wdng. · View Herald Transcript

Harbormaster completed remote builds in B226884: Diff 515343.Apr 20 2023, 8:20 AM

Why separate KERNEL_BUSYWAIT and DATA_BUSYWAIT? I don't feel the need of two separate controls.
What are the units? Documentation?

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
521	When reading the implementation of wait(). I'm still wondering how error got handled when kernels finished (signal equals 0) with an error.
1248	Should this be private?
1575	Should they be private?

kevinsala added a subscriber: kevinsala.Apr 20 2023, 9:02 AM

kevinsala added inline comments.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1248	Can we place this class data member just after the existing ones?
1575	Are these fields really needed? We can access the envars directly (which have the values cached in a variable), right?

In D148808#4283842, @ye-luo wrote:

Why separate KERNEL_BUSYWAIT and DATA_BUSYWAIT? I don't feel the need of two separate controls.
What are the units? Documentation?

It is a matter of priorities. If you are doing async copies, you may not want them to every busy-wait because you have more important things to do on the CPU. However, in that same situation you may have high priority short kernels that you prefer to get the performance of busywait.

For long-term tuning, we may actually want more control over busy-wait timeouts. For now, we just separate the timeouts on data transfers and kernel completions.

In D148808#4286873, @gregrodgers wrote:

In D148808#4283842, @ye-luo wrote:

Why separate KERNEL_BUSYWAIT and DATA_BUSYWAIT? I don't feel the need of two separate controls.
What are the units? Documentation?

It is a matter of priorities. If you are doing async copies, you may not want them to every busy-wait because you have more important things to do on the CPU.

In the OpenMP context, when an OpenMP target without nowait reaches a signal wait. The user intention is busy/active waiting. There are no more important things to do on the core that OpenMP thread resides. If you were thinking of CPU cores being oversubscribed by threads, I would say one may gain or lose all depending on the application and there is no need to put such cases in high priority of discussion.
If nowait is added, events are used instead of signal wait. I think active waiting is always preferred.

Thanks for all the comments, I think there is a bit of confusion due to my lack of documentation of the code.

Units of the variables: microseconds.
Wait state: This indicates whether the HSA runtime should actively wait. As far as I understand it, this means that it will likely not perform a context switch while waiting. This can improve the responsiveness to the signal value change. I did some basic profiles with babelstream using standard waiting and the timeout values we used in the old plugin. I can see improvements in the duration that the system stays within the hsa_signal_wait_scaqcuire. This suggests that this is indeed helping with latency, and therefore with the runtime "realizing" that a short running kernel has finished. Babelstream results do reproducibly improve.
As a result of 2., I think, we used different values for kernels and data movements, as the idea is that the data transfers may just be a little slower.

I will go ahead and update the patch to address all code-related comments.

rebase the patch
add documentation
address reviewer comments

Harbormaster completed remote builds in B228257: Diff 517116.Apr 26 2023, 3:42 AM

From the meeting discussion.

default to active waiting rather than 0.
Consider a single LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT.

One environment variable, defaulting to 2 seconds timeout for active waiting, rename, const

Harbormaster completed remote builds in B229672: Diff 519049.May 3 2023, 6:09 AM

ye-luo accepted this revision.May 3 2023, 7:11 AM

ye-luo added inline comments.

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
903	Could you move this right below line 891 mutex? Prefer not to have variables stuck between functions.

This revision is now accepted and ready to land.May 3 2023, 7:11 AM

Updating the patch description is also needed.

jplehr mentioned this in D146849: [OpenMP][libomptarget] Active and blocking HSA wait states.May 4 2023, 2:36 AM

Address inline comment

Harbormaster completed remote builds in B229927: Diff 519402.May 4 2023, 2:47 AM

jplehr retitled this revision from [OpenMP][libomptarget][AMDGPU] Enable optional active HSA wait state to [OpenMP][libomptarget][AMDGPU] Enable active HSA wait state.May 4 2023, 2:51 AM

jplehr edited the summary of this revision. (Show Details)

Closed by commit rGf238a98e8447: [OpenMP][libomptarget][AMDGPU] Enable active HSA wait state (authored by gregrodgers, committed by jplehr). · Explain WhyMay 4 2023, 3:02 AM

This revision was automatically updated to reflect the committed changes.

jplehr added a commit: rGf238a98e8447: [OpenMP][libomptarget][AMDGPU] Enable active HSA wait state.

Diff 519413

openmp/docs/design/Runtimes.rst

	Show First 20 Lines • Show All 1,154 Lines • ▼ Show 20 Lines
	* ``LIBOMPTARGET_NUM_INITIAL_STREAMS``			* ``LIBOMPTARGET_NUM_INITIAL_STREAMS``
	* ``LIBOMPTARGET_NUM_INITIAL_EVENTS``			* ``LIBOMPTARGET_NUM_INITIAL_EVENTS``
	* ``LIBOMPTARGET_LOCK_MAPPED_HOST_BUFFERS``			* ``LIBOMPTARGET_LOCK_MAPPED_HOST_BUFFERS``
	* ``LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES``			* ``LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES``
	* ``LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE``			* ``LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE``
	* ``LIBOMPTARGET_AMDGPU_TEAMS_PER_CU``			* ``LIBOMPTARGET_AMDGPU_TEAMS_PER_CU``
	* ``LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES``			* ``LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES``
	* ``LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS``			* ``LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS``
				* ``LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT``

	The environment variables ``LIBOMPTARGET_SHARED_MEMORY_SIZE``,			The environment variables ``LIBOMPTARGET_SHARED_MEMORY_SIZE``,
	``LIBOMPTARGET_STACK_SIZE`` and ``LIBOMPTARGET_HEAP_SIZE`` are described in			``LIBOMPTARGET_STACK_SIZE`` and ``LIBOMPTARGET_HEAP_SIZE`` are described in
	:ref:`libopenmptarget_environment_vars`.			:ref:`libopenmptarget_environment_vars`.

	LIBOMPTARGET_NUM_INITIAL_STREAMS			LIBOMPTARGET_NUM_INITIAL_STREAMS
	""""""""""""""""""""""""""""""""			""""""""""""""""""""""""""""""""

	▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
	"""""""""""""""""""""""""""""""""""""""""""			"""""""""""""""""""""""""""""""""""""""""""

	This environment variable controls the initial number of HSA signals per device			This environment variable controls the initial number of HSA signals per device
	in the AMDGPU plugin. There is one resource manager of signals per device			in the AMDGPU plugin. There is one resource manager of signals per device
	managing several pre-created signals. These signals are mainly used by AMDGPU			managing several pre-created signals. These signals are mainly used by AMDGPU
	streams. More HSA signals will be created dynamically throughout the execution			streams. More HSA signals will be created dynamically throughout the execution
	if needed. The default value is ``64``.			if needed. The default value is ``64``.

				LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT
				"""""""""""""""""""""""""""""""""""

				This environment variable controls the timeout hint in microseconds for the
				HSA wait state within the AMDGPU plugin. For the duration of this value
				the HSA runtime may busy wait. This can reduce overall latency.
				The default value is ``2000000``.

	.. _remote_offloading_plugin:			.. _remote_offloading_plugin:

	Remote Offloading Plugin:			Remote Offloading Plugin:
	^^^^^^^^^^^^^^^^^^^^^^^^^			^^^^^^^^^^^^^^^^^^^^^^^^^

	The remote offloading plugin permits the execution of OpenMP target regions			The remote offloading plugin permits the execution of OpenMP target regions
	on devices in remote hosts in addition to the devices connected to the local			on devices in remote hosts in addition to the devices connected to the local
	host. All target devices on the remote host will be exposed to the			host. All target devices on the remote host will be exposed to the
	▲ Show 20 Lines • Show All 159 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

Show First 20 Lines • Show All 505 Lines • ▼ Show 20 Lines	struct AMDGPUSignalTy {

/// Deinitialize the signal.		/// Deinitialize the signal.
Error deinit() {		Error deinit() {
hsa_status_t Status = hsa_signal_destroy(Signal);		hsa_status_t Status = hsa_signal_destroy(Signal);
return Plugin::check(Status, "Error in hsa_signal_destroy: %s");		return Plugin::check(Status, "Error in hsa_signal_destroy: %s");
}		}

/// Wait until the signal gets a zero value.		/// Wait until the signal gets a zero value.
Error wait() const {		Error wait(const uint64_t ActiveTimeout = 0) const {
// TODO: Is it better to use busy waiting or blocking the thread?		if (ActiveTimeout) {
		hsa_signal_value_t Got = 1;
		Got = hsa_signal_wait_scacquire(Signal, HSA_SIGNAL_CONDITION_EQ, 0,
		ActiveTimeout, HSA_WAIT_STATE_ACTIVE);
		if (Got == 0)
		return Plugin::success();
		}
		ye-luoUnsubmitted Not Done Reply Inline Actions When reading the implementation of wait(). I'm still wondering how error got handled when kernels finished (signal equals 0) with an error. ye-luo: When reading the implementation of wait(). I'm still wondering how error got handled when…
while (hsa_signal_wait_scacquire(Signal, HSA_SIGNAL_CONDITION_EQ, 0,		while (hsa_signal_wait_scacquire(Signal, HSA_SIGNAL_CONDITION_EQ, 0,
UINT64_MAX, HSA_WAIT_STATE_BLOCKED) != 0)		UINT64_MAX, HSA_WAIT_STATE_BLOCKED) != 0)
;		;
return Plugin::success();		return Plugin::success();
}		}

/// Load the value on the signal.		/// Load the value on the signal.
hsa_signal_value_t load() const { return hsa_signal_load_scacquire(Signal); }		hsa_signal_value_t load() const { return hsa_signal_load_scacquire(Signal); }
▲ Show 20 Lines • Show All 355 Lines • ▼ Show 20 Lines	private:
/// The synchronization id. This number is increased each time the stream is		/// The synchronization id. This number is increased each time the stream is
/// synchronized. It is useful to detect if an AMDGPUEventTy points to an		/// synchronized. It is useful to detect if an AMDGPUEventTy points to an
/// operation that was already finalized in a previous stream sycnhronize.		/// operation that was already finalized in a previous stream sycnhronize.
uint32_t SyncCycle;		uint32_t SyncCycle;

/// Mutex to protect stream's management.		/// Mutex to protect stream's management.
mutable std::mutex Mutex;		mutable std::mutex Mutex;

		/// Timeout hint for HSA actively waiting for signal value to change
		const uint64_t StreamBusyWaitMicroseconds;

/// Return the current number of asychronous operations on the stream.		/// Return the current number of asychronous operations on the stream.
uint32_t size() const { return NextSlot; }		uint32_t size() const { return NextSlot; }

/// Return the last valid slot on the stream.		/// Return the last valid slot on the stream.
uint32_t last() const { return size() - 1; }		uint32_t last() const { return size() - 1; }

/// Consume one slot from the stream. Since the stream uses signals on demand		/// Consume one slot from the stream. Since the stream uses signals on demand
/// and releases them once the slot is no longer used, the function requires		/// and releases them once the slot is no longer used, the function requires
		ye-luoUnsubmitted Not Done Reply Inline Actions Could you move this right below line 891 mutex? Prefer not to have variables stuck between functions. ye-luo: Could you move this right below line 891 mutex? Prefer not to have variables stuck between…
/// an idle signal for the new consumed slot.		/// an idle signal for the new consumed slot.
std::pair<uint32_t, AMDGPUSignalTy > consume(AMDGPUSignalTy OutputSignal) {		std::pair<uint32_t, AMDGPUSignalTy > consume(AMDGPUSignalTy OutputSignal) {
// Double the stream size if needed. Since we use std::deque, this operation		// Double the stream size if needed. Since we use std::deque, this operation
// does not invalidate the already added slots.		// does not invalidate the already added slots.
if (Slots.size() == NextSlot)		if (Slots.size() == NextSlot)
Slots.resize(Slots.size() * 2);		Slots.resize(Slots.size() * 2);

// Update the next available slot and the stream size.		// Update the next available slot and the stream size.
▲ Show 20 Lines • Show All 328 Lines • ▼ Show 20 Lines	if (InputSignal && InputSignal->load()) {
&InputSignalRaw, OutputSignal->get());		&InputSignalRaw, OutputSignal->get());
} else		} else
Status = hsa_amd_memory_async_copy(Dst, Agent, Inter, Agent, CopySize, 0,		Status = hsa_amd_memory_async_copy(Dst, Agent, Inter, Agent, CopySize, 0,
nullptr, OutputSignal->get());		nullptr, OutputSignal->get());

return Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s");		return Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s");
}		}

/// Synchronize with the stream. The current thread waits until all operations		/// Synchronize with the stream. The current thread waits until all operations
		ye-luoUnsubmitted Not Done Reply Inline Actions Should this be private? ye-luo: Should this be private?
		kevinsalaUnsubmitted Not Done Reply Inline Actions Can we place this class data member just after the existing ones? kevinsala: Can we place this class data member just after the existing ones?
/// are finalized and it performs the pending post actions (i.e., releasing		/// are finalized and it performs the pending post actions (i.e., releasing
/// intermediate buffers).		/// intermediate buffers).
Error synchronize() {		Error synchronize() {
std::lock_guard<std::mutex> Lock(Mutex);		std::lock_guard<std::mutex> Lock(Mutex);

// No need to synchronize anything.		// No need to synchronize anything.
if (size() == 0)		if (size() == 0)
return Plugin::success();		return Plugin::success();

// Wait until all previous operations on the stream have completed.		// Wait until all previous operations on the stream have completed.
if (auto Err = Slots[last()].Signal->wait())		if (auto Err = Slots[last()].Signal->wait(StreamBusyWaitMicroseconds))
return Err;		return Err;

// Reset the stream and perform all pending post actions.		// Reset the stream and perform all pending post actions.
return complete();		return complete();
}		}

/// Query the stream and complete pending post actions if operations finished.		/// Query the stream and complete pending post actions if operations finished.
/// Return whether all the operations completed. This operation does not block		/// Return whether all the operations completed. This operation does not block
▲ Show 20 Lines • Show All 291 Lines • ▼ Show 20 Lines	AMDGPUDeviceTy(int32_t DeviceId, int32_t NumDevices,
: GenericDeviceTy(DeviceId, NumDevices, {0}), AMDGenericDeviceTy(),		: GenericDeviceTy(DeviceId, NumDevices, {0}), AMDGenericDeviceTy(),
OMPX_NumQueues("LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES", 4),		OMPX_NumQueues("LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES", 4),
OMPX_QueueSize("LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE", 512),		OMPX_QueueSize("LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE", 512),
OMPX_DefaultTeamsPerCU("LIBOMPTARGET_AMDGPU_TEAMS_PER_CU", 4),		OMPX_DefaultTeamsPerCU("LIBOMPTARGET_AMDGPU_TEAMS_PER_CU", 4),
OMPX_MaxAsyncCopyBytes("LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES",		OMPX_MaxAsyncCopyBytes("LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES",
1 * 1024 * 1024), // 1MB		1 * 1024 * 1024), // 1MB
OMPX_InitialNumSignals("LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS",		OMPX_InitialNumSignals("LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS",
64),		64),
		OMPX_StreamBusyWait("LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT", 2000000),
AMDGPUStreamManager(this), AMDGPUEventManager(this),		AMDGPUStreamManager(this), AMDGPUEventManager(this),
AMDGPUSignalManager(*this), Agent(Agent), HostDevice(HostDevice),		AMDGPUSignalManager(*this), Agent(Agent), HostDevice(HostDevice),
Queues() {}		Queues() {}

~AMDGPUDeviceTy() {}		~AMDGPUDeviceTy() {}

/// Initialize the device, its resources and get its properties.		/// Initialize the device, its resources and get its properties.
Error initImpl(GenericPluginTy &Plugin) override {		Error initImpl(GenericPluginTy &Plugin) override {
		ye-luoUnsubmitted Not Done Reply Inline Actions Should they be private? ye-luo: Should they be private?
		kevinsalaUnsubmitted Not Done Reply Inline Actions Are these fields really needed? We can access the envars directly (which have the values cached in a variable), right? kevinsala: Are these fields really needed? We can access the envars directly (which have the values cached…
// First setup all the memory pools.		// First setup all the memory pools.
if (auto Err = initMemoryPools())		if (auto Err = initMemoryPools())
return Err;		return Err;

char GPUName[64];		char GPUName[64];
if (auto Err = getDeviceAttr(HSA_AGENT_INFO_NAME, GPUName))		if (auto Err = getDeviceAttr(HSA_AGENT_INFO_NAME, GPUName))
return Err;		return Err;
ComputeUnitKind = GPUName;		ComputeUnitKind = GPUName;
▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	Error deinitImpl() override {
}		}

// Invalidate agent reference.		// Invalidate agent reference.
Agent = {0};		Agent = {0};

return Plugin::success();		return Plugin::success();
}		}

		const uint64_t getStreamBusyWaitMicroseconds() const {
		return OMPX_StreamBusyWait;
		}

Expected<std::unique_ptr<MemoryBuffer>>		Expected<std::unique_ptr<MemoryBuffer>>
doJITPostProcessing(std::unique_ptr<MemoryBuffer> MB) const override {		doJITPostProcessing(std::unique_ptr<MemoryBuffer> MB) const override {

// TODO: We should try to avoid materialization but there seems to be no		// TODO: We should try to avoid materialization but there seems to be no
// good linker interface w/o file i/o.		// good linker interface w/o file i/o.
SmallString<128> LinkerOutputFilePath;		SmallString<128> LinkerOutputFilePath;
std::error_code EC = sys::fs::createTemporaryFile(		std::error_code EC = sys::fs::createTemporaryFile(
"amdgpu-pre-link-jit", ".out", LinkerOutputFilePath);		"amdgpu-pre-link-jit", ".out", LinkerOutputFilePath);
▲ Show 20 Lines • Show All 246 Lines • ▼ Show 20 Lines	if (Size >= OMPX_MaxAsyncCopyBytes) {
return Err;		return Err;

Status = hsa_amd_memory_async_copy(TgtPtr, Agent, PinnedHstPtr, Agent,		Status = hsa_amd_memory_async_copy(TgtPtr, Agent, PinnedHstPtr, Agent,
Size, 0, nullptr, Signal.get());		Size, 0, nullptr, Signal.get());
if (auto Err =		if (auto Err =
Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))		Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))
return Err;		return Err;

if (auto Err = Signal.wait())		if (auto Err = Signal.wait(getStreamBusyWaitMicroseconds()))
return Err;		return Err;

if (auto Err = Signal.deinit())		if (auto Err = Signal.deinit())
return Err;		return Err;

Status = hsa_amd_memory_unlock(const_cast<void *>(HstPtr));		Status = hsa_amd_memory_unlock(const_cast<void *>(HstPtr));
return Plugin::check(Status, "Error in hsa_amd_memory_unlock: %s\n");		return Plugin::check(Status, "Error in hsa_amd_memory_unlock: %s\n");
}		}
Show All 40 Lines	if (Size >= OMPX_MaxAsyncCopyBytes) {
return Err;		return Err;

Status = hsa_amd_memory_async_copy(PinnedHstPtr, Agent, TgtPtr, Agent,		Status = hsa_amd_memory_async_copy(PinnedHstPtr, Agent, TgtPtr, Agent,
Size, 0, nullptr, Signal.get());		Size, 0, nullptr, Signal.get());
if (auto Err =		if (auto Err =
Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))		Plugin::check(Status, "Error in hsa_amd_memory_async_copy: %s"))
return Err;		return Err;

if (auto Err = Signal.wait())		if (auto Err = Signal.wait(getStreamBusyWaitMicroseconds()))
return Err;		return Err;

if (auto Err = Signal.deinit())		if (auto Err = Signal.deinit())
return Err;		return Err;

Status = hsa_amd_memory_unlock(const_cast<void *>(HstPtr));		Status = hsa_amd_memory_unlock(const_cast<void *>(HstPtr));
return Plugin::check(Status, "Error in hsa_amd_memory_unlock: %s\n");		return Plugin::check(Status, "Error in hsa_amd_memory_unlock: %s\n");
}		}
▲ Show 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	private:
UInt32Envar OMPX_MaxAsyncCopyBytes;		UInt32Envar OMPX_MaxAsyncCopyBytes;

/// Envar controlling the initial number of HSA signals per device. There is		/// Envar controlling the initial number of HSA signals per device. There is
/// one manager of signals per device managing several pre-allocated signals.		/// one manager of signals per device managing several pre-allocated signals.
/// These signals are mainly used by AMDGPU streams. If needed, more signals		/// These signals are mainly used by AMDGPU streams. If needed, more signals
/// will be created.		/// will be created.
UInt32Envar OMPX_InitialNumSignals;		UInt32Envar OMPX_InitialNumSignals;

		/// Environment variables to set the time to wait in active state before
		/// switching to blocked state. The default 2000000 busywaits for 2 seconds
		/// before going into a blocking HSA wait state. The unit for these variables
		/// are microseconds.
		UInt32Envar OMPX_StreamBusyWait;

/// Stream manager for AMDGPU streams.		/// Stream manager for AMDGPU streams.
AMDGPUStreamManagerTy AMDGPUStreamManager;		AMDGPUStreamManagerTy AMDGPUStreamManager;

/// Event manager for AMDGPU events.		/// Event manager for AMDGPU events.
AMDGPUEventManagerTy AMDGPUEventManager;		AMDGPUEventManagerTy AMDGPUEventManager;

/// Signal manager for AMDGPU signals.		/// Signal manager for AMDGPU signals.
AMDGPUSignalManagerTy AMDGPUSignalManager;		AMDGPUSignalManagerTy AMDGPUSignalManager;
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	Error AMDGPUResourceRef<ResourceTy>::create(GenericDeviceTy &Device) {

return Resource->init();		return Resource->init();
}		}

AMDGPUStreamTy::AMDGPUStreamTy(AMDGPUDeviceTy &Device)		AMDGPUStreamTy::AMDGPUStreamTy(AMDGPUDeviceTy &Device)
: Agent(Device.getAgent()), Queue(Device.getNextQueue()),		: Agent(Device.getAgent()), Queue(Device.getNextQueue()),
SignalManager(Device.getSignalManager()),		SignalManager(Device.getSignalManager()),
// Initialize the std::deque with some empty positions.		// Initialize the std::deque with some empty positions.
Slots(32), NextSlot(0), SyncCycle(0) {}		Slots(32), NextSlot(0), SyncCycle(0),
		StreamBusyWaitMicroseconds(Device.getStreamBusyWaitMicroseconds()) {}

/// Class implementing the AMDGPU-specific functionalities of the global		/// Class implementing the AMDGPU-specific functionalities of the global
/// handler.		/// handler.
struct AMDGPUGlobalHandlerTy final : public GenericGlobalHandlerTy {		struct AMDGPUGlobalHandlerTy final : public GenericGlobalHandlerTy {
/// Get the metadata of a global from the device. The name and size of the		/// Get the metadata of a global from the device. The name and size of the
/// global is read from DeviceGlobal and the address of the global is written		/// global is read from DeviceGlobal and the address of the global is written
/// to DeviceGlobal.		/// to DeviceGlobal.
Error getGlobalMetadataFromDevice(GenericDeviceTy &Device,		Error getGlobalMetadataFromDevice(GenericDeviceTy &Device,
▲ Show 20 Lines • Show All 470 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][libomptarget][AMDGPU] Enable active HSA wait state
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 519413

openmp/docs/design/Runtimes.rst

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][libomptarget][AMDGPU] Enable active HSA wait stateClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 519413

openmp/docs/design/Runtimes.rst

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

[OpenMP][libomptarget][AMDGPU] Enable active HSA wait state
ClosedPublic