This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/include/llvm/Frontend/OpenMP/
-
include/
-
llvm/
-
Frontend/
-
OpenMP/
1/3
OMPGridValues.h
-
openmp/libomptarget/plugins-nextgen/
-
libomptarget/
-
plugins-nextgen/
-
amdgpu/src/
-
src/
3/9
rtl.cpp
-
common/PluginInterface/
-
PluginInterface/
-
PluginInterface.h
-
PluginInterface.cpp

Differential D140264

[OpenMP] Improve AMDGPU Plugin
ClosedPublic

Authored by jdoerfert on Dec 17 2022, 12:49 PM.

Download Raw Diff

Details

Reviewers

ye-luo
kevinsala
jhuber6
tianshilei1992
JonChesterfield
carlo.bertolli

Summary

With this patch we:

pick more sensible defaults for the number of teams, inspired by the old plugin, and configured via LIBOMPTARGET_AMDGPU_TEAMS_PER_CU.
check the input signal of a kernel launch late, after the queue lock was taken, to avoid a barrier packet more often.
copy the kernel arguments in one swoop into the appropriate memory.
manually specialize the callbacks to avoid potential indirect calls.

Diff Detail

Event Timeline

jdoerfert created this revision.Dec 17 2022, 12:49 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 17 2022, 12:49 PM

Herald added subscribers: kosarev, kerbowa, guansong and 6 others. · View Herald Transcript

jdoerfert requested review of this revision.Dec 17 2022, 12:49 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 17 2022, 12:49 PM

Herald added subscribers: llvm-commits, sstefan1, wdng. · View Herald Transcript

Harbormaster completed remote builds in B203776: Diff 483774.Dec 17 2022, 12:50 PM

tianshilei1992 added inline comments.Dec 17 2022, 4:08 PM

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
117	How is the default value calculated here?

jdoerfert added inline comments.Dec 17 2022, 5:22 PM

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
117	IIRC, it's what CUB uses as max block count for big reductions. So it's probably a good number. I needed something.

kevinsala added inline comments.Dec 18 2022, 11:59 AM

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
69	nit
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
802	We could pass a specialized pointer to the arguments too: if (auto Err = releaseSignalAction((ReleaseSignalArgsTy *) &ActionArgs)) return Err; This would remove the cast on all action functions.
805	I like this approach. But since we don't allow yet other actions than the pre-defined ones, calling unknown actions may hide future errors. I would return error if the action is not recognized.
1546	I think here we're still overwriting the value of max teams determined by `OMP_NUM_TEAMS` in the `GenericDeviceTy` constructor. The same for `GV_Max_WG_Size` and `OMP_TEAMS_THREAD_LIMIT`. To fix it, we could move the code below from the `GenericDeviceTy` constructor to the `GenericDeviceTy::init`: Error GenericDeviceTy::init(GenericPluginTy &Plugin) { // NOTE: This call will initialize GV_Max_Teams and GV_Max_WG_Size with device's maximum values. if (auto Err = initImpl(Plugin)) return Err; // Moved code. Originally in GenericDeviceTy constructor. if (OMP_NumTeams > 0) GridValues.GV_Max_Teams = std::min(GridValues.GV_Max_Teams, uint32_t(OMP_NumTeams)); if (OMP_TeamsThreadLimit > 0) GridValues.GV_Max_WG_Size = std::min(GridValues.GV_Max_WG_Size, uint32_t(OMP_TeamsThreadLimit));

jdoerfert added inline comments.Dec 19 2022, 11:54 AM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
802	The cast is a no-op, is it not? And the explicit one in your example is just implicitly done by the compiler in my code. `void* -> arg_ty *` is convertible.
805	That works too. We can require to list them, won't be too many.
1546	I don't follow completely. Can you make it a follow up?

Looks fine if you address the nits. I don't like the number of TODOs increasing but we can address them later.

This revision is now accepted and ready to land.Dec 19 2022, 11:59 AM

kevinsala added inline comments.Dec 19 2022, 2:38 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

802

Yes, no performance difference. It was to pass directly the corresponding pointer type and improve readability on action functions. It's a detail with no importance

1546

The version on main branch has a problem because we are ignoring the envars OMP_NUM_TEAMS and OMP_TEAMS_THREAD_LIMIT. It's not related to this patch, but we could fix it here, or in a separate patch. Basically, the current code is:

GenericDeviceTy::GenericDeviceTy(int32_t DeviceId, int32_t NumDevices,
                                 const llvm::omp::GV &OMPGridValues)
    : MemoryManager(nullptr), OMP_TeamLimit("OMP_TEAM_LIMIT"),
      OMP_NumTeams("OMP_NUM_TEAMS"),
      OMP_TeamsThreadLimit("OMP_TEAMS_THREAD_LIMIT"),
      /*...  */ {
  // Initialize GV_Max_Teams and GV_Max_WG_Size
  if (OMP_NumTeams > 0)
    GridValues.GV_Max_Teams =
        std::min(GridValues.GV_Max_Teams, uint32_t(OMP_NumTeams));

  if (OMP_TeamsThreadLimit > 0)
    GridValues.GV_Max_WG_Size =
        std::min(GridValues.GV_Max_WG_Size, uint32_t(OMP_TeamsThreadLimit));
}

Error GenericDeviceTy::init(GenericPluginTy &Plugin) {
  // Both CUDA and AMDGPU overwrite the previous values of GV_Max_Teams and
  // GV_Max_WG_Size, ignoring the values of OMP_NUM_TEAMS and OMP_TEAMS_THREAD_LIMIT
  if (auto Err = initImpl(Plugin))
    return Err;

  // ...
}

We initialize GV_Max_Teams and GV_Max_WG_Size considering the envars, but then, the plugin-specific code (in initImpl) overwrites them with the device maximums.

kevinsala added inline comments.Dec 19 2022, 2:59 PM

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
2437	We should avoid copying the arguments if `NumKernelArgs` is zero.

LGTM, if we fix the copy of args when there aren't.

Committed in https://reviews.llvm.org/rGfb2c42df41cb01e1122fd4e9c81e1f4bc5592b12.

Revision Contents

Path

Size

llvm/

include/

llvm/

Frontend/

OpenMP/

OMPGridValues.h

6 lines

openmp/

libomptarget/

plugins-nextgen/

amdgpu/

src/

rtl.cpp

58 lines

common/

PluginInterface/

PluginInterface.h

3 lines

PluginInterface.cpp

18 lines

Diff 483774

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h

Show First 20 Lines • Show All 60 Lines • ▼ Show 20 Lines

struct GV {

unsigned GV_Warp_Size;

constexpr unsigned warpSlotSize() const {

return GV_Warp_Size * GV_Slot_Size;

}

/// the maximum number of teams.

unsigned GV_Max_Teams;

// The default number teams

kevinsalaUnsubmitted

Not Done

unsigned GV_Max_Teams;

- // The default number teams

+ /// The default number of teams.

unsigned GV_Default_Num_Teams;

nit

kevinsala: nit

unsigned GV_Default_Num_Teams;

// An alternative to the heavy data sharing infrastructure that uses global

// memory is one that uses device __shared__ memory. The amount of such space

// (in bytes) reserved by the OpenMP runtime is noted here.

unsigned GV_SimpleBufferSize;

// The absolute maximum team size for a working group

unsigned GV_Max_WG_Size;

// The default maximum team size for a working group

unsigned GV_Default_WG_Size;

constexpr unsigned maxWarpNumber() const {

return GV_Max_WG_Size / GV_Warp_Size;

}

};

/// For AMDGPU GPUs

static constexpr GV AMDGPUGridValues64 = {

256, // GV_Slot_Size

64, // GV_Warp_Size

128, // GV_Max_Teams

440, // GV_Default_Num_Teams

896, // GV_SimpleBufferSize

1024, // GV_Max_WG_Size,

256, // GV_Default_WG_Size

};

static constexpr GV AMDGPUGridValues32 = {

256, // GV_Slot_Size

32, // GV_Warp_Size

128, // GV_Max_Teams

440, // GV_Default_Num_Teams

896, // GV_SimpleBufferSize

1024, // GV_Max_WG_Size,

256, // GV_Default_WG_Size

};

template <unsigned wavesize> constexpr const GV &getAMDGPUGridValues() {

static_assert(wavesize == 32 || wavesize == 64, "Unexpected wavesize");

return wavesize == 32 ? AMDGPUGridValues32 : AMDGPUGridValues64;

}

/// For Nvidia GPUs

static constexpr GV NVPTXGridValues = {

256, // GV_Slot_Size

32, // GV_Warp_Size

1024, // GV_Max_Teams

3200, // GV_Default_Num_Teams

tianshilei1992Unsubmitted

Not Done

How is the default value calculated here?

tianshilei1992: How is the default value calculated here?

jdoerfertAuthorUnsubmitted

Done

IIRC, it's what CUB uses as max block count for big reductions. So it's probably a good number. I needed something.

jdoerfert: IIRC, it's what CUB uses as max block count for big reductions. So it's probably a good number.

896, // GV_SimpleBufferSize

1024, // GV_Max_WG_Size

128, // GV_Default_WG_Size

};

} // namespace omp

} // namespace llvm

#endif // LLVM_FRONTEND_OPENMP_OMPGRIDVALUES_H

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

Show First 20 Lines • Show All 115 Lines • ▼ Show 20 Lines
template <typename ResourceTy>		template <typename ResourceTy>
struct AMDGPUResourceRef : public GenericDeviceResourceRef {		struct AMDGPUResourceRef : public GenericDeviceResourceRef {
/// Create an empty reference to an invalid resource.		/// Create an empty reference to an invalid resource.
AMDGPUResourceRef() : Resource(nullptr) {}		AMDGPUResourceRef() : Resource(nullptr) {}

/// Create a reference to an existing resource.		/// Create a reference to an existing resource.
AMDGPUResourceRef(ResourceTy *Resource) : Resource(Resource) {}		AMDGPUResourceRef(ResourceTy *Resource) : Resource(Resource) {}

		virtual ~AMDGPUResourceRef() {}

/// Create a new resource and save the reference. The reference must be empty		/// Create a new resource and save the reference. The reference must be empty
/// before calling to this function.		/// before calling to this function.
Error create(GenericDeviceTy &Device) override;		Error create(GenericDeviceTy &Device) override;

/// Destroy the referenced resource and invalidate the reference. The		/// Destroy the referenced resource and invalidate the reference. The
/// reference must be to a valid event before calling to this function.		/// reference must be to a valid event before calling to this function.
Error destroy(GenericDeviceTy &Device) override {		Error destroy(GenericDeviceTy &Device) override {
if (!Resource)		if (!Resource)
▲ Show 20 Lines • Show All 403 Lines • ▼ Show 20 Lines	Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,
AMDGPUSignalTy *InputSignal) {		AMDGPUSignalTy *InputSignal) {
assert(OutputSignal && "Invalid kernel output signal");		assert(OutputSignal && "Invalid kernel output signal");

// Lock the queue during the packet publishing process. Notice this blocks		// Lock the queue during the packet publishing process. Notice this blocks
// the addition of other packets to the queue. The following piece of code		// the addition of other packets to the queue. The following piece of code
// should be lightweight; do not block the thread, allocate memory, etc.		// should be lightweight; do not block the thread, allocate memory, etc.
std::lock_guard<std::mutex> Lock(Mutex);		std::lock_guard<std::mutex> Lock(Mutex);

		// Avoid defining the input dependency if already satisfied.
		if (InputSignal && !InputSignal->load())
		InputSignal = nullptr;

// Add a barrier packet before the kernel packet in case there is a pending		// Add a barrier packet before the kernel packet in case there is a pending
// preceding operation. The barrier packet will delay the processing of		// preceding operation. The barrier packet will delay the processing of
// subsequent queue's packets until the barrier input signal are satisfied.		// subsequent queue's packets until the barrier input signal are satisfied.
// No need output signal needed because the dependency is already guaranteed		// No need output signal needed because the dependency is already guaranteed
// by the queue barrier itself.		// by the queue barrier itself.
if (InputSignal)		if (InputSignal)
if (auto Err = pushBarrierImpl(nullptr, InputSignal))		if (auto Err = pushBarrierImpl(nullptr, InputSignal))
return Err;		return Err;
▲ Show 20 Lines • Show All 230 Lines • ▼ Show 20 Lines	struct StreamSlotTy {
}		}

// Perform the action if needed.		// Perform the action if needed.
Error performAction() {		Error performAction() {
if (!ActionFunction)		if (!ActionFunction)
return Plugin::success();		return Plugin::success();

// Perform the action.		// Perform the action.
		if (ActionFunction == memcpyAction) {
		if (auto Err = memcpyAction(&ActionArgs))
		return Err;
		} else if (ActionFunction == releaseBufferAction) {
		if (auto Err = releaseBufferAction(&ActionArgs))
		return Err;
		} else if (ActionFunction == releaseSignalAction) {
		if (auto Err = releaseSignalAction(&ActionArgs))
		kevinsalaUnsubmitted Not Done Reply Inline Actions We could pass a specialized pointer to the arguments too: if (auto Err = releaseSignalAction((ReleaseSignalArgsTy ) &ActionArgs)) return Err; This would remove the cast on all action functions. kevinsala:* We could pass a specialized pointer to the arguments too: ``` if (auto Err =…
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions The cast is a no-op, is it not? And the explicit one in your example is just implicitly done by the compiler in my code. `void* -> arg_ty ` is convertible. jdoerfert:* The cast is a no-op, is it not? And the explicit one in your example is just implicitly done by…
		kevinsalaUnsubmitted Not Done Reply Inline Actions Yes, no performance difference. It was to pass directly the corresponding pointer type and improve readability on action functions. It's a detail with no importance kevinsala: Yes, no performance difference. It was to pass directly the corresponding pointer type and…
		return Err;
		} else {
if (auto Err = (*ActionFunction)(&ActionArgs))		if (auto Err = (*ActionFunction)(&ActionArgs))
		kevinsalaUnsubmitted Not Done Reply Inline Actions I like this approach. But since we don't allow yet other actions than the pre-defined ones, calling unknown actions may hide future errors. I would return error if the action is not recognized. kevinsala: I like this approach. But since we don't allow yet other actions than the pre-defined ones…
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions That works too. We can require to list them, won't be too many. jdoerfert: That works too. We can require to list them, won't be too many.
return Err;		return Err;
		}

// Invalidate the action.		// Invalidate the action.
ActionFunction = nullptr;		ActionFunction = nullptr;

return Plugin::success();		return Plugin::success();
}		}
};		};

▲ Show 20 Lines • Show All 186 Lines • ▼ Show 20 Lines	Error pushKernelLaunch(const AMDGPUKernelTy &Kernel, void *KernelArgs,
OutputSignal->reset();		OutputSignal->reset();
OutputSignal->increaseUseCount();		OutputSignal->increaseUseCount();

std::lock_guard<std::mutex> StreamLock(Mutex);		std::lock_guard<std::mutex> StreamLock(Mutex);

// Consume stream slot and compute dependencies.		// Consume stream slot and compute dependencies.
auto [Curr, InputSignal] = consume(OutputSignal);		auto [Curr, InputSignal] = consume(OutputSignal);

// Avoid defining the input dependency if already satisfied.
if (InputSignal && !InputSignal->load())
InputSignal = nullptr;

// Setup the post action to release the kernel args buffer.		// Setup the post action to release the kernel args buffer.
if (auto Err = Slots[Curr].schedReleaseBuffer(KernelArgs, MemoryManager))		if (auto Err = Slots[Curr].schedReleaseBuffer(KernelArgs, MemoryManager))
return Err;		return Err;

// Push the kernel with the output signal and an input signal (optional)		// Push the kernel with the output signal and an input signal (optional)
return Queue.pushKernelLaunch(Kernel, KernelArgs, NumThreads, NumBlocks,		return Queue.pushKernelLaunch(Kernel, KernelArgs, NumThreads, NumBlocks,
OutputSignal, InputSignal);		OutputSignal, InputSignal);
}		}
▲ Show 20 Lines • Show All 475 Lines • ▼ Show 20 Lines

/// Class implementing the AMDGPU device functionalities which derives from the		/// Class implementing the AMDGPU device functionalities which derives from the
/// generic device class.		/// generic device class.
struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {		struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
// Create an AMDGPU device with a device id and default AMDGPU grid values.		// Create an AMDGPU device with a device id and default AMDGPU grid values.
AMDGPUDeviceTy(int32_t DeviceId, int32_t NumDevices,		AMDGPUDeviceTy(int32_t DeviceId, int32_t NumDevices,
AMDHostDeviceTy &HostDevice, hsa_agent_t Agent)		AMDHostDeviceTy &HostDevice, hsa_agent_t Agent)
: GenericDeviceTy(DeviceId, NumDevices, {0}), AMDGenericDeviceTy(),		: GenericDeviceTy(DeviceId, NumDevices, {0}), AMDGenericDeviceTy(),
OMPX_NumQueues("LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES", 8),		OMPX_NumQueues("LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES", 4),
OMPX_QueueSize("LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE", 1024),		OMPX_QueueSize("LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE", 512),
		OMPX_DefaultTeamsPerCU("LIBOMPTARGET_AMDGPU_TEAMS_PER_CU", 4),
OMPX_MaxAsyncCopyBytes("LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES",		OMPX_MaxAsyncCopyBytes("LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES",
1 * 1024 * 1024), // 1MB		1 * 1024 * 1024), // 1MB
OMPX_InitialNumSignals("LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS",		OMPX_InitialNumSignals("LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS",
64),		64),
AMDGPUStreamManager(this), AMDGPUEventManager(this),		AMDGPUStreamManager(this), AMDGPUEventManager(this),
AMDGPUSignalManager(*this), Agent(Agent), HostDevice(HostDevice),		AMDGPUSignalManager(*this), Agent(Agent), HostDevice(HostDevice),
Queues() {}		Queues() {}

Show All 25 Lines	if (auto Err =
getDeviceAttr(HSA_AGENT_INFO_WORKGROUP_MAX_DIM, WorkgroupMaxDim))		getDeviceAttr(HSA_AGENT_INFO_WORKGROUP_MAX_DIM, WorkgroupMaxDim))
return Err;		return Err;
GridValues.GV_Max_WG_Size = WorkgroupMaxDim[0];		GridValues.GV_Max_WG_Size = WorkgroupMaxDim[0];

// Get maximum number of workgroups.		// Get maximum number of workgroups.
hsa_dim3_t GridMaxDim;		hsa_dim3_t GridMaxDim;
if (auto Err = getDeviceAttr(HSA_AGENT_INFO_GRID_MAX_DIM, GridMaxDim))		if (auto Err = getDeviceAttr(HSA_AGENT_INFO_GRID_MAX_DIM, GridMaxDim))
return Err;		return Err;

GridValues.GV_Max_Teams = GridMaxDim.x / GridValues.GV_Max_WG_Size;		GridValues.GV_Max_Teams = GridMaxDim.x / GridValues.GV_Max_WG_Size;
		kevinsalaUnsubmitted Not Done Reply Inline Actions I think here we're still overwriting the value of max teams determined by `OMP_NUM_TEAMS` in the `GenericDeviceTy` constructor. The same for `GV_Max_WG_Size` and `OMP_TEAMS_THREAD_LIMIT`. To fix it, we could move the code below from the `GenericDeviceTy` constructor to the `GenericDeviceTy::init`: Error GenericDeviceTy::init(GenericPluginTy &Plugin) { // NOTE: This call will initialize GV_Max_Teams and GV_Max_WG_Size with device's maximum values. if (auto Err = initImpl(Plugin)) return Err; // Moved code. Originally in GenericDeviceTy constructor. if (OMP_NumTeams > 0) GridValues.GV_Max_Teams = std::min(GridValues.GV_Max_Teams, uint32_t(OMP_NumTeams)); if (OMP_TeamsThreadLimit > 0) GridValues.GV_Max_WG_Size = std::min(GridValues.GV_Max_WG_Size, uint32_t(OMP_TeamsThreadLimit)); kevinsala: I think here we're still overwriting the value of max teams determined by `OMP_NUM_TEAMS` in…
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions I don't follow completely. Can you make it a follow up? jdoerfert: I don't follow completely. Can you make it a follow up?
		kevinsalaUnsubmitted Not Done Reply Inline Actions The version on main branch has a problem because we are ignoring the envars `OMP_NUM_TEAMS` and `OMP_TEAMS_THREAD_LIMIT`. It's not related to this patch, but we could fix it here, or in a separate patch. Basically, the current code is: GenericDeviceTy::GenericDeviceTy(int32_t DeviceId, int32_t NumDevices, const llvm::omp::GV &OMPGridValues) : MemoryManager(nullptr), OMP_TeamLimit("OMP_TEAM_LIMIT"), OMP_NumTeams("OMP_NUM_TEAMS"), OMP_TeamsThreadLimit("OMP_TEAMS_THREAD_LIMIT"), /... / { // Initialize GV_Max_Teams and GV_Max_WG_Size if (OMP_NumTeams > 0) GridValues.GV_Max_Teams = std::min(GridValues.GV_Max_Teams, uint32_t(OMP_NumTeams)); if (OMP_TeamsThreadLimit > 0) GridValues.GV_Max_WG_Size = std::min(GridValues.GV_Max_WG_Size, uint32_t(OMP_TeamsThreadLimit)); } Error GenericDeviceTy::init(GenericPluginTy &Plugin) { // Both CUDA and AMDGPU overwrite the previous values of GV_Max_Teams and // GV_Max_WG_Size, ignoring the values of OMP_NUM_TEAMS and OMP_TEAMS_THREAD_LIMIT if (auto Err = initImpl(Plugin)) return Err; // ... } We initialize `GV_Max_Teams` and `GV_Max_WG_Size` considering the envars, but then, the plugin-specific code (in `initImpl`) overwrites them with the device maximums. kevinsala: The version on main branch has a problem because we are ignoring the envars `OMP_NUM_TEAMS` and…
if (GridValues.GV_Max_Teams == 0)		if (GridValues.GV_Max_Teams == 0)
return Plugin::error("Maximum number of teams cannot be zero");		return Plugin::error("Maximum number of teams cannot be zero");

		// Compute the default number of teams.
		uint32_t ComputeUnits = 0;
		if (auto Err =
		getDeviceAttr(HSA_AMD_AGENT_INFO_COMPUTE_UNIT_COUNT, ComputeUnits))
		return Err;
		GridValues.GV_Default_Num_Teams = ComputeUnits * OMPX_DefaultTeamsPerCU;

// Get maximum size of any device queues and maximum number of queues.		// Get maximum size of any device queues and maximum number of queues.
uint32_t MaxQueueSize;		uint32_t MaxQueueSize;
if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUE_MAX_SIZE, MaxQueueSize))		if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUE_MAX_SIZE, MaxQueueSize))
return Err;		return Err;

uint32_t MaxQueues;		uint32_t MaxQueues;
if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUES_MAX, MaxQueues))		if (auto Err = getDeviceAttr(HSA_AGENT_INFO_QUEUES_MAX, MaxQueues))
return Err;		return Err;
▲ Show 20 Lines • Show All 466 Lines • ▼ Show 20 Lines	private:
UInt32Envar OMPX_NumQueues;		UInt32Envar OMPX_NumQueues;

/// Envar for controlling the size of each HSA queue. The size is the number		/// Envar for controlling the size of each HSA queue. The size is the number
/// of HSA packets a queue is expected to hold. It is also the number of HSA		/// of HSA packets a queue is expected to hold. It is also the number of HSA
/// packets that can be pushed into each queue without waiting the driver to		/// packets that can be pushed into each queue without waiting the driver to
/// process them.		/// process them.
UInt32Envar OMPX_QueueSize;		UInt32Envar OMPX_QueueSize;

		/// Envar for controlling the default number of teams relative to the number
		/// of compute units (CUs) the device has:
		/// #default_teams = OMPX_DefaultTeamsPerCU * #CUs.
		UInt32Envar OMPX_DefaultTeamsPerCU;

/// Envar specifying the maximum size in bytes where the memory copies are		/// Envar specifying the maximum size in bytes where the memory copies are
/// asynchronous operations. Up to this transfer size, the memory copies are		/// asynchronous operations. Up to this transfer size, the memory copies are
/// asychronous operations pushed to the corresponding stream. For larger		/// asychronous operations pushed to the corresponding stream. For larger
/// transfers, they are synchronous transfers.		/// transfers, they are synchronous transfers.
UInt32Envar OMPX_MaxAsyncCopyBytes;		UInt32Envar OMPX_MaxAsyncCopyBytes;

/// Envar controlling the initial number of HSA signals per device. There is		/// Envar controlling the initial number of HSA signals per device. There is
/// one manager of signals per device managing several pre-allocated signals.		/// one manager of signals per device managing several pre-allocated signals.
▲ Show 20 Lines • Show All 196 Lines • ▼ Show 20 Lines	auto Err = utils::iterateAgents([&](hsa_agent_t Agent) {
hsa_status_t Status =		hsa_status_t Status =
hsa_agent_get_info(Agent, HSA_AGENT_INFO_DEVICE, &DeviceType);		hsa_agent_get_info(Agent, HSA_AGENT_INFO_DEVICE, &DeviceType);
if (Status != HSA_STATUS_SUCCESS)		if (Status != HSA_STATUS_SUCCESS)
return Status;		return Status;

// Classify the agents into kernel (GPU) and host (CPU) kernels.		// Classify the agents into kernel (GPU) and host (CPU) kernels.
if (DeviceType == HSA_DEVICE_TYPE_GPU) {		if (DeviceType == HSA_DEVICE_TYPE_GPU) {
// Ensure that the GPU agent supports kernel dispatch packets.		// Ensure that the GPU agent supports kernel dispatch packets.
hsa_agent_feature_t features;		hsa_agent_feature_t Features;
Status = hsa_agent_get_info(Agent, HSA_AGENT_INFO_FEATURE, &features);		Status = hsa_agent_get_info(Agent, HSA_AGENT_INFO_FEATURE, &Features);
if (features & HSA_AGENT_FEATURE_KERNEL_DISPATCH)		if (Features & HSA_AGENT_FEATURE_KERNEL_DISPATCH)
KernelAgents.push_back(Agent);		KernelAgents.push_back(Agent);
} else if (DeviceType == HSA_DEVICE_TYPE_CPU) {		} else if (DeviceType == HSA_DEVICE_TYPE_CPU) {
HostAgents.push_back(Agent);		HostAgents.push_back(Agent);
}		}
return HSA_STATUS_SUCCESS;		return HSA_STATUS_SUCCESS;
});		});

if (Err)		if (Err)
▲ Show 20 Lines • Show All 160 Lines • ▼ Show 20 Lines	Error AMDGPUKernelTy::launchImpl(GenericDeviceTy &GenericDevice,
utils::AMDGPUImplicitArgsTy *ImplArgs =		utils::AMDGPUImplicitArgsTy *ImplArgs =
reinterpret_cast<utils::AMDGPUImplicitArgsTy *>(		reinterpret_cast<utils::AMDGPUImplicitArgsTy *>(
static_cast<char *>(AllArgs) + KernelArgsSize);		static_cast<char *>(AllArgs) + KernelArgsSize);

// Initialize the implicit arguments to zero.		// Initialize the implicit arguments to zero.
std::memset(ImplArgs, 0, ImplicitArgsSize);		std::memset(ImplArgs, 0, ImplicitArgsSize);

// Copy the explicit arguments.		// Copy the explicit arguments.
for (int32_t ArgId = 0; ArgId < NumKernelArgs; ++ArgId) {		// TODO: We should expose the args memory manager alloc to the common part as
void Dst = (char )AllArgs + sizeof(void ) ArgId;		// alternative to copying them twice.
void Src = ((void **)KernelArgs + ArgId);		std::memcpy(AllArgs, static_cast<void *>(KernelArgs),
		kevinsalaUnsubmitted Not Done Reply Inline Actions We should avoid copying the arguments if `NumKernelArgs` is zero. kevinsala: We should avoid copying the arguments if `NumKernelArgs` is zero.
std::memcpy(Dst, Src, sizeof(void *));		sizeof(void ) NumKernelArgs);
}

AMDGPUDeviceTy &AMDGPUDevice = static_cast<AMDGPUDeviceTy &>(GenericDevice);		AMDGPUDeviceTy &AMDGPUDevice = static_cast<AMDGPUDeviceTy &>(GenericDevice);
AMDGPUStreamTy &Stream = AMDGPUDevice.getStream(AsyncInfoWrapper);		AMDGPUStreamTy &Stream = AMDGPUDevice.getStream(AsyncInfoWrapper);

// Push the kernel launch into the stream.		// Push the kernel launch into the stream.
return Stream.pushKernelLaunch(*this, AllArgs, NumThreads, NumBlocks,		return Stream.pushKernelLaunch(*this, AllArgs, NumThreads, NumBlocks,
ArgsMemoryManager);		ArgsMemoryManager);
}		}
▲ Show 20 Lines • Show All 101 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h

Show First 20 Lines • Show All 366 Lines • ▼ Show 20 Lines	struct GenericDeviceTy : public DeviceAllocatorTy {
/// Getters of the grid values.		/// Getters of the grid values.
uint32_t getWarpSize() const { return GridValues.GV_Warp_Size; }		uint32_t getWarpSize() const { return GridValues.GV_Warp_Size; }
uint32_t getThreadLimit() const { return GridValues.GV_Max_WG_Size; }		uint32_t getThreadLimit() const { return GridValues.GV_Max_WG_Size; }
uint64_t getBlockLimit() const { return GridValues.GV_Max_Teams; }		uint64_t getBlockLimit() const { return GridValues.GV_Max_Teams; }
uint32_t getDefaultNumThreads() const {		uint32_t getDefaultNumThreads() const {
return GridValues.GV_Default_WG_Size;		return GridValues.GV_Default_WG_Size;
}		}
uint64_t getDefaultNumBlocks() const {		uint64_t getDefaultNumBlocks() const {
// TODO: Introduce a default num blocks value.		return GridValues.GV_Default_Num_Teams;
return GridValues.GV_Default_WG_Size;
}		}
uint32_t getDynamicMemorySize() const { return OMPX_SharedMemorySize; }		uint32_t getDynamicMemorySize() const { return OMPX_SharedMemorySize; }

private:		private:
/// Register offload entry for global variable.		/// Register offload entry for global variable.
Error registerGlobalOffloadEntry(DeviceImageTy &DeviceImage,		Error registerGlobalOffloadEntry(DeviceImageTy &DeviceImage,
const __tgt_offload_entry &GlobalEntry,		const __tgt_offload_entry &GlobalEntry,
__tgt_offload_entry &DeviceEntry);		__tgt_offload_entry &DeviceEntry);
▲ Show 20 Lines • Show All 436 Lines • Show Last 20 Lines

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp

Show First 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	uint32_t GenericKernelTy::getNumThreads(GenericDeviceTy &GenericDevice,
return std::min(MaxNumThreads, (ThreadLimitClause > 0) ? ThreadLimitClause		return std::min(MaxNumThreads, (ThreadLimitClause > 0) ? ThreadLimitClause
: PreferredNumThreads);		: PreferredNumThreads);
}		}

uint64_t GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,		uint64_t GenericKernelTy::getNumBlocks(GenericDeviceTy &GenericDevice,
uint64_t NumTeamsClause,		uint64_t NumTeamsClause,
uint64_t LoopTripCount,		uint64_t LoopTripCount,
uint32_t NumThreads) const {		uint32_t NumThreads) const {
uint64_t PreferredNumBlocks = getDefaultNumBlocks(GenericDevice);
if (NumTeamsClause > 0) {		if (NumTeamsClause > 0) {
PreferredNumBlocks = NumTeamsClause;		// TODO: We need to honor any value and consequently allow more than the
} else if (LoopTripCount > 0) {		// block limit. For this we might need to start multiple kernels or let the
		// blocks start again until the requested number has been started.
		return std::min(NumTeamsClause, GenericDevice.getBlockLimit());
		}

		uint64_t TripCountNumBlocks = std::numeric_limits<uint64_t>::max();
		if (LoopTripCount > 0) {
if (isSPMDMode()) {		if (isSPMDMode()) {
// We have a combined construct, i.e. `target teams distribute		// We have a combined construct, i.e. `target teams distribute
// parallel for [simd]`. We launch so many teams so that each thread		// parallel for [simd]`. We launch so many teams so that each thread
// will execute one iteration of the loop. round up to the nearest		// will execute one iteration of the loop. round up to the nearest
// integer		// integer
PreferredNumBlocks = ((LoopTripCount - 1) / NumThreads) + 1;		TripCountNumBlocks = ((LoopTripCount - 1) / NumThreads) + 1;
} else {		} else {
assert((isGenericMode() \|\| isGenericSPMDMode()) &&		assert((isGenericMode() \|\| isGenericSPMDMode()) &&
"Unexpected execution mode!");		"Unexpected execution mode!");
// If we reach this point, then we have a non-combined construct, i.e.		// If we reach this point, then we have a non-combined construct, i.e.
// `teams distribute` with a nested `parallel for` and each team is		// `teams distribute` with a nested `parallel for` and each team is
// assigned one iteration of the `distribute` loop. E.g.:		// assigned one iteration of the `distribute` loop. E.g.:
//		//
// #pragma omp target teams distribute		// #pragma omp target teams distribute
// for(...loop_tripcount...) {		// for(...loop_tripcount...) {
// #pragma omp parallel for		// #pragma omp parallel for
// for(...) {}		// for(...) {}
// }		// }
//		//
// Threads within a team will execute the iterations of the `parallel`		// Threads within a team will execute the iterations of the `parallel`
// loop.		// loop.
PreferredNumBlocks = LoopTripCount;		TripCountNumBlocks = LoopTripCount;
}		}
}		}
		// If the loops are long running we rather reuse blocks than spawn too many.
		uint64_t PreferredNumBlocks =
		std::min(TripCountNumBlocks, getDefaultNumBlocks(GenericDevice));
return std::min(PreferredNumBlocks, GenericDevice.getBlockLimit());		return std::min(PreferredNumBlocks, GenericDevice.getBlockLimit());
}		}

GenericDeviceTy::GenericDeviceTy(int32_t DeviceId, int32_t NumDevices,		GenericDeviceTy::GenericDeviceTy(int32_t DeviceId, int32_t NumDevices,
const llvm::omp::GV &OMPGridValues)		const llvm::omp::GV &OMPGridValues)
: MemoryManager(nullptr), OMP_TeamLimit("OMP_TEAM_LIMIT"),		: MemoryManager(nullptr), OMP_TeamLimit("OMP_TEAM_LIMIT"),
OMP_NumTeams("OMP_NUM_TEAMS"),		OMP_NumTeams("OMP_NUM_TEAMS"),
OMP_TeamsThreadLimit("OMP_TEAMS_THREAD_LIMIT"),		OMP_TeamsThreadLimit("OMP_TEAMS_THREAD_LIMIT"),
▲ Show 20 Lines • Show All 777 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Improve AMDGPU PluginClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 483774

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.h

openmp/libomptarget/plugins-nextgen/common/PluginInterface/PluginInterface.cpp

[OpenMP] Improve AMDGPU Plugin
ClosedPublic