This is an archive of the discontinued LLVM Phabricator instance.

I rewrote the function trying to be a bit simplier. Conceptually, I think the only difference is that NextQueue is always increased. Would it make sense?

inline Error assignNextQueue(AMDGPUStreamTy *Stream) {
  uint32_t StartIndex = NextQueue % MaxNumQueues;

  if (OMPX_QueueTracking) {
    // Find the first idle queue.
    for (uint32_t I = 0; I < MaxNumQueues; ++I) {
      if (!Queues[StartIndex].isBusy())
        break;

      StartIndex = (StartIndex + 1) % MaxNumQueues;
    }
  }

  // Try to initialize queue and increase its users.
  if (Queues[StartIndex].init(Agent, QueueSize))
    return Err;
  Queues[StartIndex].addUser();

  // Assign the queue.
  Stream->Queue = &Queues[StartIndex];

  // Always move to the next queue.
  ++NextQueue;

  return Plugin::success();
}

Also, in the case OMPX_QueueTracking is enabled, when no queues are idle, we could fallback to the queue with less users. The minimum can be computed while iterating the queues in that loop.

Added documentation.

Thank you -- much appreciated!

Harbormaster completed remote builds in B250102: Diff 546912.Aug 3 2023, 9:59 AM

mhalk added inline comments.Aug 3 2023, 10:30 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1466	Good catch, thanks! I wasn't aware of that.
1530	Really like the reduced complexity. Now that `NextQueue` is always increased wouldn't that forfeit the purpose of handling busy queues? -- at least for the first `MaxNumQueues` Stream-requests. Because I think we will now always look at a Queue that has not been initialized and is therefore considered "idle"/not busy. I guess keeping `NextQueue` at zero until the maximum number of queues is actually initialized is important, unless we find sth. equivalent. I'll think about it a bit more.

kevinsala added inline comments.Aug 3 2023, 11:28 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1530	I don't think updating `NextQueue` will have an impact on the tracking mechanism. For the tracking, `NextQueue` just indicates which queue to start looking at. But you will traverse the whole array of queues if we are not finding idle queues. So, in the first `MaxNumQueues` Stream-request, the mechanism will do the following: 1st request: NextQueue is 0 -> check queue 0 (idle) -> break 2nd request: NextQueue is 1 -> check queue 1 (idle) -> break 3rd request: NextQueue is 2 -> check queue 2 (idle) -> break ... In these first `MaxNumQueues` requests, we will always break at the first loop iteration because no one will be using the queue at`NextQueue` positions yet. Then, after those first requests, we will probably start having to iterate over the queues to find if any is idle.

mhalk added inline comments.Aug 3 2023, 11:43 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1530	Yes, but what if upon the 2nd request queue_0 is idle. Won't we initialize queue_1 albeit there is an already initialized one ready? Or do we not care about this? If so & should you have time -- please share some insight in this.

kevinsala added inline comments.Aug 3 2023, 11:51 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1530	True, it would make sense to re-use the already initialized one. That can be changed on the simplified function easily. I've some other doubts regarding falling back to round robin when there is no idle queue. Since the number of queues is very low, I think that improving the fallback (i.e., when all queues are busy) is important. The default number of queues is four, so we will already move to the fallback mechanism (simple round-robin) when having 4 streams working concurrently. For the same cost as now, we can chose the queue with the minimum number of users instead. I believe this could keep the users (more or less) well-balanced among queue. Does it make sense?

mhalk added inline comments.Aug 3 2023, 12:11 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
1530	Yes, that does make sense to me. When you say "for the same cost", what do you have in mind? I guess, I'll check if there are measurable drawbacks to round robin vs. "least contention", performance-wise.

Just as a heads-up, what I'll be looking at.
Maybe this far away from what you had in mind.

One thing I noticed when refactoring this to the selection dependent on "user count" is that:
In conjunction with an early exit, which I definitely wanted, I think we can eliminate one iteration on the for loop?!
(Alleviates the re-introduced complexity a bit, I guess.)
I'm positive we find a good solution to this.

Changed isBusy to getUserCount and added isInitialized.

@kevinsala Let me know what you think.
Also, tell me if I should just update the diff rather than posting code snippets.

inline Error assignNextQueue(AMDGPUStreamTy *Stream) {
    uint32_t SelectedIndex = NextQueue % MaxNumQueues;

    if (OMPX_QueueTracking) {
      // Take utilization into account, begin at SelectedIndex
      uint32_t Index = SelectedIndex;

      for (uint32_t I = 1; I < MaxNumQueues; ++I) {
        // Early exit when an initialized queue is idle
        if (Queues[SelectedIndex].isInitialized() &&
            Queues[SelectedIndex].getUserCount() == 0)
          break;

        // Increment Index & potential wrap around
        if (++Index >= MaxNumQueues)
          Index = 0;

        // Update the least contested queue
        if (Queues[SelectedIndex].getUserCount() > Queues[Index].getUserCount())
          SelectedIndex = Index;
      }
    }

    // Make sure we assign an initialized queue, then add user & assign
    if (auto Err = Queues[SelectedIndex].init(Agent, QueueSize))
      return Err;
    Queues[SelectedIndex].addUser();
    Stream->Queue = &Queues[SelectedIndex];

    // Move cursor to the next queue
    ++NextQueue;
    return Plugin::success();
  }

I would keep it simple:

inline Error assignNextQueue(AMDGPUStreamTy *Stream) {
  uint32_t SelectedIndex = 0;

  if (OMPX_QueueTracking) {
    // Find the least used queue.
    for (uint32_t I = 0; I < MaxNumQueues; ++I) {
      // Early exit when an initialized queue is idle
      if (Queues[I].isInitialized() && Queues[I].getUserCount() == 0) {
        SelectedIndex = I;
        break;
      }
      
      // Update the least contested queue
      if (Queues[SelectedIndex].getUserCount() > Queues[I].getUserCount())
        SelectedIndex = I;
    }
  } else {
    // Round-robin policy.
    SelectedIndex = NextQueue++ % MaxNumQueues;
  }
  
  // Make sure we assign an initialized queue, then add user & assign
  if (auto Err = Queues[SelectedIndex].init(Agent, QueueSize))
    return Err;
  Queues[SelectedIndex].addUser();
  Stream->Queue = &Queues[SelectedIndex];
  
  return Plugin::success();
}

This variant is based on your last snippet. NextQueue is not used at all when queue tracking is enabled; it's only used when working in round-robin policy. In queue tracking mode, we start from the first queue. It features the early exit you wrote. Although it doesn't skip any iteration, I don't think it'll impact the performance.

Implemented feedback and adapted docs.

Thanks @kevinsala for the helpful feedback!
IMHO think this is a very readable solution.

Changes to the snippet:
Decided to rename the Index variable and change its assignment.

Harbormaster completed remote builds in B250377: Diff 547277.Aug 4 2023, 11:05 AM

LGTM. Thanks!

Closed by commit rG7eba3e58d5a3: [OpenMP][AMDGPU] Add Envar for controlling HSA busy queue tracking (authored by mhalk). · Explain WhyAug 7 2023, 7:49 AM

This revision was automatically updated to reflect the committed changes.

mhalk added a commit: rG7eba3e58d5a3: [OpenMP][AMDGPU] Add Envar for controlling HSA busy queue tracking.

Revision Contents

Path

Size

openmp/

docs/

design/

Runtimes.rst

12 lines

libomptarget/

plugins-nextgen/

amdgpu/

src/

rtl.cpp

68 lines

Diff 547794

openmp/docs/design/Runtimes.rst

	Show First 20 Lines • Show All 1,169 Lines • ▼ Show 20 Lines
	* ``LIBOMPTARGET_SHARED_MEMORY_SIZE``			* ``LIBOMPTARGET_SHARED_MEMORY_SIZE``
	* ``LIBOMPTARGET_STACK_SIZE``			* ``LIBOMPTARGET_STACK_SIZE``
	* ``LIBOMPTARGET_HEAP_SIZE``			* ``LIBOMPTARGET_HEAP_SIZE``
	* ``LIBOMPTARGET_NUM_INITIAL_STREAMS``			* ``LIBOMPTARGET_NUM_INITIAL_STREAMS``
	* ``LIBOMPTARGET_NUM_INITIAL_EVENTS``			* ``LIBOMPTARGET_NUM_INITIAL_EVENTS``
	* ``LIBOMPTARGET_LOCK_MAPPED_HOST_BUFFERS``			* ``LIBOMPTARGET_LOCK_MAPPED_HOST_BUFFERS``
	* ``LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES``			* ``LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES``
	* ``LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE``			* ``LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE``
				* ``LIBOMPTARGET_AMDGPU_HSA_QUEUE_BUSY_TRACKING``
	* ``LIBOMPTARGET_AMDGPU_TEAMS_PER_CU``			* ``LIBOMPTARGET_AMDGPU_TEAMS_PER_CU``
	* ``LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES``			* ``LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES``
	* ``LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS``			* ``LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS``
	* ``LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT``			* ``LIBOMPTARGET_AMDGPU_STREAM_BUSYWAIT``

	The environment variables ``LIBOMPTARGET_SHARED_MEMORY_SIZE``,			The environment variables ``LIBOMPTARGET_SHARED_MEMORY_SIZE``,
	``LIBOMPTARGET_STACK_SIZE`` and ``LIBOMPTARGET_HEAP_SIZE`` are described in			``LIBOMPTARGET_STACK_SIZE`` and ``LIBOMPTARGET_HEAP_SIZE`` are described in
	:ref:`libopenmptarget_environment_vars`.			:ref:`libopenmptarget_environment_vars`.
	Show All 40 Lines
	LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE			LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE
	""""""""""""""""""""""""""""""""""			""""""""""""""""""""""""""""""""""

	This environment variable controls the size of each HSA queue in the AMDGPU			This environment variable controls the size of each HSA queue in the AMDGPU
	plugin. The size is the number of AQL packets an HSA queue is expected to hold.			plugin. The size is the number of AQL packets an HSA queue is expected to hold.
	It is also the number of AQL packets that can be pushed into each queue without			It is also the number of AQL packets that can be pushed into each queue without
	waiting the driver to process them. The default value is ``512``.			waiting the driver to process them. The default value is ``512``.

				LIBOMPTARGET_AMDGPU_HSA_QUEUE_BUSY_TRACKING
				"""""""""""""""""""""""""""""""""""""""""""

				This environment variable controls if idle HSA queues will be preferentially
				assigned to streams, for example when they are requested for a kernel launch.
				Should all queues be considered busy, a new queue is initialized and returned,
				until we reach the set maximum. Otherwise, we will select the least utilized
				queue. If this is disabled, each time a stream is requested a new HSA queue
				will be initialized, regardless of their utilization. Additionally, queues will
				be selected using round robin selection. The default value is ``true``.

	.. _libomptarget_amdgpu_teams_per_cu:			.. _libomptarget_amdgpu_teams_per_cu:

	LIBOMPTARGET_AMDGPU_TEAMS_PER_CU			LIBOMPTARGET_AMDGPU_TEAMS_PER_CU
	""""""""""""""""""""""""""""""""			""""""""""""""""""""""""""""""""

	This environment variable controls the default number of teams relative to the			This environment variable controls the default number of teams relative to the
	number of compute units (CUs) of the AMDGPU device. The default number of teams			number of compute units (CUs) of the AMDGPU device. The default number of teams
	is ``#default_teams = #teams_per_CU * #CUs``. The default value of teams per CU			is ``#default_teams = #teams_per_CU * #CUs``. The default value of teams per CU
	▲ Show 20 Lines • Show All 214 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

Show First 20 Lines • Show All 588 Lines • ▼ Show 20 Lines	struct AMDGPUQueueTy {
Error deinit() {		Error deinit() {
std::lock_guard<std::mutex> Lock(Mutex);		std::lock_guard<std::mutex> Lock(Mutex);
if (!Queue)		if (!Queue)
return Plugin::success();		return Plugin::success();
hsa_status_t Status = hsa_queue_destroy(Queue);		hsa_status_t Status = hsa_queue_destroy(Queue);
return Plugin::check(Status, "Error in hsa_queue_destroy: %s");		return Plugin::check(Status, "Error in hsa_queue_destroy: %s");
}		}

/// Returns if this queue is considered busy		/// Returns the number of streams, this queue is currently assigned to.
bool isBusy() const { return NumUsers > 0; }		bool getUserCount() const { return NumUsers; }

/// Decrement user count of the queue object		/// Returns if the underlying HSA queue is initialized.
		bool isInitialized() { return Queue != nullptr; }

		/// Decrement user count of the queue object.
void removeUser() { --NumUsers; }		void removeUser() { --NumUsers; }

/// Increase user count of the queue object		/// Increase user count of the queue object.
void addUser() { ++NumUsers; }		void addUser() { ++NumUsers; }

/// Push a kernel launch to the queue. The kernel launch requires an output		/// Push a kernel launch to the queue. The kernel launch requires an output
/// signal and can define an optional input signal (nullptr if none).		/// signal and can define an optional input signal (nullptr if none).
Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,		Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,
uint32_t NumThreads, uint64_t NumBlocks,		uint32_t NumThreads, uint64_t NumBlocks,
uint32_t GroupSize, AMDGPUSignalTy *OutputSignal,		uint32_t GroupSize, AMDGPUSignalTy *OutputSignal,
AMDGPUSignalTy *InputSignal) {		AMDGPUSignalTy *InputSignal) {
▲ Show 20 Lines • Show All 167 Lines • ▼ Show 20 Lines	private:
/// could acquire packet P and thread T2 acquire packet P+1. Thread T2 could		/// could acquire packet P and thread T2 acquire packet P+1. Thread T2 could
/// publish its packet P+1 (signaling the queue's doorbell) before packet P		/// publish its packet P+1 (signaling the queue's doorbell) before packet P
/// from T1 is ready to be processed. That scenario should be invalid. Thus,		/// from T1 is ready to be processed. That scenario should be invalid. Thus,
/// we use the following mutex to make packet acquiring and publishing atomic.		/// we use the following mutex to make packet acquiring and publishing atomic.
/// TODO: There are other more advanced approaches to avoid this mutex using		/// TODO: There are other more advanced approaches to avoid this mutex using
/// atomic operations. We can further investigate it if this is a bottleneck.		/// atomic operations. We can further investigate it if this is a bottleneck.
std::mutex Mutex;		std::mutex Mutex;

/// Indicates that the queue is busy when > 0		/// The number of streams, this queue is currently assigned to. A queue is
int NumUsers;		/// considered idle when this is zero, otherwise: busy.
		uint32_t NumUsers;
};		};

/// Struct that implements a stream of asynchronous operations for AMDGPU		/// Struct that implements a stream of asynchronous operations for AMDGPU
/// devices. This class relies on signals to implement streams and define the		/// devices. This class relies on signals to implement streams and define the
/// dependencies between asynchronous operations.		/// dependencies between asynchronous operations.
struct AMDGPUStreamTy {		struct AMDGPUStreamTy {
private:		private:
/// Utility struct holding arguments for async H2H memory copies.		/// Utility struct holding arguments for async H2H memory copies.
▲ Show 20 Lines • Show All 649 Lines • ▼ Show 20 Lines
}		}

struct AMDGPUStreamManagerTy final		struct AMDGPUStreamManagerTy final
: GenericDeviceResourceManagerTy<AMDGPUResourceRef<AMDGPUStreamTy>> {		: GenericDeviceResourceManagerTy<AMDGPUResourceRef<AMDGPUStreamTy>> {
using ResourceRef = AMDGPUResourceRef<AMDGPUStreamTy>;		using ResourceRef = AMDGPUResourceRef<AMDGPUStreamTy>;
using ResourcePoolTy = GenericDeviceResourceManagerTy<ResourceRef>;		using ResourcePoolTy = GenericDeviceResourceManagerTy<ResourceRef>;

AMDGPUStreamManagerTy(GenericDeviceTy &Device, hsa_agent_t HSAAgent)		AMDGPUStreamManagerTy(GenericDeviceTy &Device, hsa_agent_t HSAAgent)
: GenericDeviceResourceManagerTy(Device), NextQueue(0), Agent(HSAAgent) {}		: GenericDeviceResourceManagerTy(Device),
		OMPX_QueueTracking("LIBOMPTARGET_AMDGPU_HSA_QUEUE_BUSY_TRACKING", true),
		NextQueue(0), Agent(HSAAgent) {}

Error init(uint32_t InitialSize, int NumHSAQueues, int HSAQueueSize) {		Error init(uint32_t InitialSize, int NumHSAQueues, int HSAQueueSize) {
Queues = std::vector<AMDGPUQueueTy>(NumHSAQueues);		Queues = std::vector<AMDGPUQueueTy>(NumHSAQueues);
QueueSize = HSAQueueSize;		QueueSize = HSAQueueSize;
MaxNumQueues = NumHSAQueues;		MaxNumQueues = NumHSAQueues;
// Initialize one queue eagerly		// Initialize one queue eagerly
		kevinsalaUnsubmitted Not Done Reply Inline Actions This shouldn't be needed. Envar objects load the value from the environment in their constructor. kevinsala: This shouldn't be needed. Envar objects load the value from the environment in their…
		mhalkAuthorUnsubmitted Done Reply Inline Actions Good catch, thanks! I wasn't aware of that. mhalk: Good catch, thanks! I wasn't aware of that.
if (auto Err = Queues.front().init(Agent, QueueSize))		if (auto Err = Queues.front().init(Agent, QueueSize))
return Err;		return Err;

return GenericDeviceResourceManagerTy::init(InitialSize);		return GenericDeviceResourceManagerTy::init(InitialSize);
}		}

/// Deinitialize the resource pool and delete all resources. This function		/// Deinitialize the resource pool and delete all resources. This function
/// must be called before the destructor.		/// must be called before the destructor.
Show All 19 Lines	virtual Error returnResource(AMDGPUStreamTy *StreamHandle) override {
return returnResourceImpl(StreamHandle, [](AMDGPUStreamTy *Handle) {		return returnResourceImpl(StreamHandle, [](AMDGPUStreamTy *Handle) {
Handle->Queue->removeUser();		Handle->Queue->removeUser();
return Plugin::success();		return Plugin::success();
});		});
}		}

private:		private:
/// Search for and assign an prefereably idle queue to the given Stream. If		/// Search for and assign an prefereably idle queue to the given Stream. If
/// there is no queue without current users, resort to round robin selection.		/// there is no queue without current users, choose the queue with the lowest
		/// user count. If utilization is ignored: use round robin selection.
inline Error assignNextQueue(AMDGPUStreamTy *Stream) {		inline Error assignNextQueue(AMDGPUStreamTy *Stream) {
uint32_t StartIndex = NextQueue % MaxNumQueues;		// Start from zero when tracking utilization, otherwise: round robin policy.
AMDGPUQueueTy *Q = nullptr;		uint32_t Index = OMPX_QueueTracking ? 0 : NextQueue++ % MaxNumQueues;

for (int i = 0; i < MaxNumQueues; ++i) {
Q = &Queues[StartIndex++];
if (StartIndex == MaxNumQueues)
StartIndex = 0;

if (Q->isBusy())		if (OMPX_QueueTracking) {
continue;		// Find the least used queue.
else {		for (uint32_t I = 0; I < MaxNumQueues; ++I) {
if (auto Err = Q->init(Agent, QueueSize))		// Early exit when an initialized queue is idle.
return Err;		if (Queues[I].isInitialized() && Queues[I].getUserCount() == 0) {
		Index = I;
		break;
		}

Q->addUser();		// Update the least used queue.
Stream->Queue = Q;		if (Queues[Index].getUserCount() > Queues[I].getUserCount())
return Plugin::success();		Index = I;
}		}
}		}

// All queues busy: Round robin (StartIndex has the initial value again)		// Make sure the queue is initialized, then add user & assign.
Queues[StartIndex].addUser();		if (auto Err = Queues[Index].init(Agent, QueueSize))
Stream->Queue = &Queues[StartIndex];		return Err;
++NextQueue;		Queues[Index].addUser();
		Stream->Queue = &Queues[Index];

return Plugin::success();		return Plugin::success();
}		}
		kevinsalaUnsubmitted Not Done Reply Inline Actions I rewrote the function trying to be a bit simplier. Conceptually, I think the only difference is that `NextQueue` is always increased. Would it make sense? inline Error assignNextQueue(AMDGPUStreamTy Stream) { uint32_t StartIndex = NextQueue % MaxNumQueues; if (OMPX_QueueTracking) { // Find the first idle queue. for (uint32_t I = 0; I < MaxNumQueues; ++I) { if (!Queues[StartIndex].isBusy()) break; StartIndex = (StartIndex + 1) % MaxNumQueues; } } // Try to initialize queue and increase its users. if (Queues[StartIndex].init(Agent, QueueSize)) return Err; Queues[StartIndex].addUser(); // Assign the queue. Stream->Queue = &Queues[StartIndex]; // Always move to the next queue. ++NextQueue; return Plugin::success(); } Also, in the case `OMPX_QueueTracking` is enabled, when no queues are idle, we could fallback to the queue with less users. The minimum can be computed while iterating the queues in that loop. kevinsala:* I rewrote the function trying to be a bit simplier. Conceptually, I think the only difference…
		mhalkAuthorUnsubmitted Done Reply Inline Actions Really like the reduced complexity. Now that `NextQueue` is always increased wouldn't that forfeit the purpose of handling busy queues? -- at least for the first `MaxNumQueues` Stream-requests. Because I think we will now always look at a Queue that has not been initialized and is therefore considered "idle"/not busy. I guess keeping `NextQueue` at zero until the maximum number of queues is actually initialized is important, unless we find sth. equivalent. I'll think about it a bit more. mhalk: Really like the reduced complexity. Now that `NextQueue` is always increased wouldn't that…
		kevinsalaUnsubmitted Not Done Reply Inline Actions I don't think updating `NextQueue` will have an impact on the tracking mechanism. For the tracking, `NextQueue` just indicates which queue to start looking at. But you will traverse the whole array of queues if we are not finding idle queues. So, in the first `MaxNumQueues` Stream-request, the mechanism will do the following: 1st request: NextQueue is 0 -> check queue 0 (idle) -> break 2nd request: NextQueue is 1 -> check queue 1 (idle) -> break 3rd request: NextQueue is 2 -> check queue 2 (idle) -> break ... In these first `MaxNumQueues` requests, we will always break at the first loop iteration because no one will be using the queue at`NextQueue` positions yet. Then, after those first requests, we will probably start having to iterate over the queues to find if any is idle. kevinsala: I don't think updating `NextQueue` will have an impact on the tracking mechanism. For the…
		mhalkAuthorUnsubmitted Done Reply Inline Actions Yes, but what if upon the 2nd request queue_0 is idle. Won't we initialize queue_1 albeit there is an already initialized one ready? Or do we not care about this? If so & should you have time -- please share some insight in this. mhalk: Yes, but what if upon the 2nd request queue_0 is idle. Won't we initialize queue_1 albeit there…
		kevinsalaUnsubmitted Not Done Reply Inline Actions True, it would make sense to re-use the already initialized one. That can be changed on the simplified function easily. I've some other doubts regarding falling back to round robin when there is no idle queue. Since the number of queues is very low, I think that improving the fallback (i.e., when all queues are busy) is important. The default number of queues is four, so we will already move to the fallback mechanism (simple round-robin) when having 4 streams working concurrently. For the same cost as now, we can chose the queue with the minimum number of users instead. I believe this could keep the users (more or less) well-balanced among queue. Does it make sense? kevinsala: True, it would make sense to re-use the already initialized one. That can be changed on the…
		mhalkAuthorUnsubmitted Done Reply Inline Actions Yes, that does make sense to me. When you say "for the same cost", what do you have in mind? I guess, I'll check if there are measurable drawbacks to round robin vs. "least contention", performance-wise. mhalk: Yes, that does make sense to me. When you say "for the same cost", what do you have in mind?

		/// Envar for controlling the tracking of busy HSA queues.
		BoolEnvar OMPX_QueueTracking;

/// The next queue index to use for round robin selection.		/// The next queue index to use for round robin selection.
uint32_t NextQueue;		uint32_t NextQueue;

/// The queues which are assigned to requested streams.		/// The queues which are assigned to requested streams.
std::vector<AMDGPUQueueTy> Queues;		std::vector<AMDGPUQueueTy> Queues;

/// The corresponding device as HSA agent.		/// The corresponding device as HSA agent.
hsa_agent_t Agent;		hsa_agent_t Agent;
▲ Show 20 Lines • Show All 1,588 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][AMDGPU] Add Envar for controlling HSA busy queue trackingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 547794

openmp/docs/design/Runtimes.rst

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

[OpenMP][AMDGPU] Add Envar for controlling HSA busy queue tracking
ClosedPublic