This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget] Tune the number of teams and threads for kernel launch.
Needs ReviewPublic

Authored by dhruvachak on Mar 17 2021, 5:52 PM.

Download Raw Diff

Details

Reviewers

JonChesterfield
jdoerfert
ronlieb
carlo.bertolli

Summary

Change the default number of teams.

Based on kernel register usage, adjust the number of threads in a team.

Includes a corner case fix.

This change is dependent on https://reviews.llvm.org/D98829

Diff Detail

Event Timeline

dhruvachak created this revision.Mar 17 2021, 5:52 PM

Herald added subscribers: kerbowa, nhaehnle, jvesely. · View Herald TranscriptMar 17 2021, 5:52 PM

dhruvachak requested review of this revision.Mar 17 2021, 5:52 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 17 2021, 5:52 PM

Herald added a subscriber: openmp-commits. · View Herald Transcript

Harbormaster completed remote builds in B94364: Diff 331423.Mar 17 2021, 7:00 PM

Herald added a subscriber: sstefan1. · View Herald TranscriptMar 17 2021, 7:00 PM

This is really interesting. The idea seems to be to choose the dispatch parameters based on the kernel metadata and the limits of the machine.

What's the underlying heuristic? Break across N CU's in chunks that match the occupancy limits of each CU?

If so we probably want to compare LDS usage as well to avoid partitioning poorly for that.

Maybe others - there might be a performance cliff on amount of private memory too.

JonChesterfield added inline comments.Mar 18 2021, 4:06 AM

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
102	Side point, there is too much redundancy in this table of numbers (e.g. the log2 fields) and warp_size_32 = 32 looks suspect

In D98832#2634168, @JonChesterfield wrote:

This is really interesting. The idea seems to be to choose the dispatch parameters based on the kernel metadata and the limits of the machine.

What's the underlying heuristic? Break across N CU's in chunks that match the occupancy limits of each CU?

Yes, that's the idea.

If so we probably want to compare LDS usage as well to avoid partitioning poorly for that.

Maybe others - there might be a performance cliff on amount of private memory too.

Agreed. However, I don't see LDS usage in the metadata table in the image. Is it present there?

In theory, a very high sgpr count can limit the number of available workgroups if that's not factored in for determining the number of threads. But in practice, VGPRs tend to be the primary limiting factor. So perhaps we can start with using VGPRs for this purpose and have experience guide us in the future.

Could you upload patches with full context please

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
91	Vector registers? Like xmm? or registers?

Added full context to the updated patch.

In D98832#2635396, @jdoerfert wrote:

Could you upload patches with full context please

Updated with the full context.

Like xmm. Here in particular, I am referring to the vector register file of a GPU.

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
91	Like xmm. Here in particular, I am referring to the vector register file of a GPU.

Harbormaster completed remote builds in B94575: Diff 331712.Mar 18 2021, 5:14 PM

In D98832#2635305, @dhruvachak wrote:

...
Agreed. However, I don't see LDS usage in the metadata table in the image. Is it present there?

Yes, see https://llvm.org/docs/AMDGPUUsage.html for the list of what we can expect. What may not be obvious is that the metadata calls it ".group_segment_fixed_size". I don't know the origin of the terminology, maybe opencl?

In theory, a very high sgpr count can limit the number of available workgroups if that's not factored in for determining the number of threads. But in practice, VGPRs tend to be the primary limiting factor. So perhaps we can start with using VGPRs for this purpose and have experience guide us in the future.

If I understand correctly, occupancy rules all look something like (resource used / resource available) == number simultaneous, where one of the resources tends to be limiting. Offhand, I think that's VGPR, SGPR, LDS (group segment). I think there's also an architecture dependent upper bound on how many things can run at once even if they use very little of those, maybe 8 for gfx9 and 16 for gfx10.

If that's right, perhaps the calculation should look something like:

uint vgpr_occupancy = vgpr_used / vgpr_available;
uint sgpr_occupancy = sgpr_used / sgpr_available;
uint lds_occupancy = lds_used / lds_available;
uint limiting_occupancy = min(vgpr_occupancy, sgpr_occupacny, lds_occupancy);

and then we derive threadsPerGroup from that occupancy and the various other considerations.

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
823	This looks like a drive by copy/paste error fix, maybe post that separately? If you're currently uploading diffs through the gui (based on the missing context comment) that's quite labour intensive. If you change to arcanist, the flow becomes git checkout main git checkout -b some_feature ...edit git add -u && git commit -m "message" arc diff main # opens an editor

In D98832#2637285, @JonChesterfield wrote:
In D98832#2635305, @dhruvachak wrote:

...
Agreed. However, I don't see LDS usage in the metadata table in the image. Is it present there?

Yes, see https://llvm.org/docs/AMDGPUUsage.html for the list of what we can expect. What may not be obvious is that the metadata calls it ".group_segment_fixed_size". I don't know the origin of the terminology, maybe opencl?

In theory, a very high sgpr count can limit the number of available workgroups if that's not factored in for determining the number of threads. But in practice, VGPRs tend to be the primary limiting factor. So perhaps we can start with using VGPRs for this purpose and have experience guide us in the future.

If I understand correctly, occupancy rules all look something like (resource used / resource available) == number simultaneous, where one of the resources tends to be limiting. Offhand, I think that's VGPR, SGPR, LDS (group segment). I think there's also an architecture dependent upper bound on how many things can run at once even if they use very little of those, maybe 8 for gfx9 and 16 for gfx10.

If that's right, perhaps the calculation should look something like:
uint vgpr_occupancy = vgpr_used / vgpr_available;
uint sgpr_occupancy = sgpr_used / sgpr_available;
uint lds_occupancy = lds_used / lds_available;
uint limiting_occupancy = min(vgpr_occupancy, sgpr_occupacny, lds_occupancy);
and then we derive threadsPerGroup from that occupancy and the various other considerations.

Thanks for the pointer to the group segment. Yes, in general, my idea is similar to what you outlined above. However, note that SGPRs and LDS are at different granularities compared to VGPRs. VGPRs are per-thread, SGPRs are shared within a wavefront, and LDS is shared within a workgroup. So while VGPRs can be used to limit the number of threads, perhaps SGPRs and LDS can be used to limit the number of teams.

Let me split up this patch further. I would like to land the default num_teams change sooner rather than later since that's a simple change and has shown improved performance. So let me separate that out. Incorporating SGPRs/LDS to constrain teams/threads will need more experimentation.

dhruvachak mentioned this in D99003: [libomptarget] [amdgpu] Change default number of teams per computation unit.Mar 19 2021, 7:00 PM

[libomptarget] [amdgpu] Set number of teams and threads based on GPU occupancy.

Determine total number of teams in a kernel and the number of threads in each
team in order to maximize occupancy. This change considers register and LDS
usage of the kernel during occupancy computation.

dhruvachak added a reviewer: carlo.bertolli.Jun 16 2021, 10:12 AM

I haven't tried to understand the control flow yet. Is the idea to map a target region to as large a fraction of a CU as we can, scaling it back when occupancy constraints would force some of it to be idle anyway?

In D98832#2822837, @JonChesterfield wrote:

I haven't tried to understand the control flow yet. Is the idea to map a target region to as large a fraction of a CU as we can, scaling it back when occupancy constraints would force some of it to be idle anyway?

Yes, we start with the goal of filling up a CU with a pre-defined number of wavefronts. Given that goal, we try to choose team-count and team-size in a way so that their product approaches the pre-defined number of wavefronts. And the choices of team-count/team-size are constrained by register/LDS usage.

Harbormaster completed remote builds in B109541: Diff 352470.Jun 16 2021, 8:30 PM

[libomptarget] [amdgpu] Set number of teams and threads based on GPU occupancy.

Perform teams/threads tuning in non-generic execution modes.
Do not tune if OMP_TEAMS_THREAD_LIMIT is set.

Harbormaster completed remote builds in B111339: Diff 354975.Jun 28 2021, 1:09 PM

[libomptarget] [amdgpu] Set number of teams and threads based on GPU occupancy.

Ensure that thread count is within the limit.
Perform teams/threads tuning in non-generic execution modes.
Do not tune if OMP_TEAMS_THREAD_LIMIT is set.

Harbormaster completed remote builds in B111419: Diff 355083.Jun 28 2021, 7:17 PM

This stuff definitely needs to be tested.

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
95	Also, this should be a struct, not an array of unsigned with enums for looking up fields
109–110	I don't think these should be added to grid values. They're not used by clang or LLVM, so don't need to be shared. I'm not convinced they're architecture independent constants (I think something has 32k LDS), and it looks like the plugin could use values discovered at runtime. I think we're better off minimising the quantity of shared magic numbers so suggest we write the 64k / 3200 / 64 k as constants in the plugin instead.
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
352	this is (usually) 32 for gfx10, and the plugin is architecture-agnostic, so this probably can't be a compile time enum should all be unsigned too, negative numbers don't make sense for any of these

dhruvachak mentioned this in rGe0b713a0357a: [libomptarget] [amdgpu] Change default number of teams per computation unit.Jun 29 2021, 3:35 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Frontend/

OpenMP/

OMPGridValues.h

10 lines

openmp/

libomptarget/

plugins/

amdgpu/

src/

rtl.cpp

29 lines

Diff 331712

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h

Show First 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	enum GVIDX {
// The absolute maximum team size for a working group		// The absolute maximum team size for a working group
GV_Max_WG_Size,		GV_Max_WG_Size,
// The default maximum team size for a working group		// The default maximum team size for a working group
GV_Default_WG_Size,		GV_Default_WG_Size,
// This is GV_Max_WG_Size / GV_WarpSize. 32 for NVPTX and 16 for AMDGCN.		// This is GV_Max_WG_Size / GV_WarpSize. 32 for NVPTX and 16 for AMDGCN.
GV_Max_Warp_Number,		GV_Max_Warp_Number,
/// The slot size that should be reserved for a working warp.		/// The slot size that should be reserved for a working warp.
/// (~0u >> (GV_Warp_Size - GV_Warp_Size_Log2))		/// (~0u >> (GV_Warp_Size - GV_Warp_Size_Log2))
GV_Warp_Size_Log2_MaskL		GV_Warp_Size_Log2_MaskL,
		/// Total number of vector registers per CU or SM
		GV_Total_Vector_Registers
		jdoerfertUnsubmitted Not Done Reply Inline Actions Vector registers? Like xmm? or registers? jdoerfert: Vector registers? Like xmm? or registers?
		dhruvachakAuthorUnsubmitted Not Done Reply Inline Actions Like xmm. Here in particular, I am referring to the vector register file of a GPU. dhruvachak: Like xmm. Here in particular, I am referring to the vector register file of a GPU.
};		};

/// For AMDGPU GPUs		/// For AMDGPU GPUs
static constexpr unsigned AMDGPUGpuGridValues[] = {		static constexpr unsigned AMDGPUGpuGridValues[] = {
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Also, this should be a struct, not an array of unsigned with enums for looking up fields JonChesterfield: Also, this should be a struct, not an array of unsigned with enums for looking up fields
448, // GV_Threads		448, // GV_Threads
256, // GV_Slot_Size		256, // GV_Slot_Size
64, // GV_Warp_Size		64, // GV_Warp_Size
32, // GV_Warp_Size_32		32, // GV_Warp_Size_32
6, // GV_Warp_Size_Log2		6, // GV_Warp_Size_Log2
64 * 256, // GV_Warp_Slot_Size		64 * 256, // GV_Warp_Slot_Size
128, // GV_Max_Teams		128, // GV_Max_Teams
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Side point, there is too much redundancy in this table of numbers (e.g. the log2 fields) and warp_size_32 = 32 looks suspect JonChesterfield: Side point, there is too much redundancy in this table of numbers (e.g. the log2 fields) and…
256, // GV_Mem_Align		256, // GV_Mem_Align
63, // GV_Warp_Size_Log2_Mask		63, // GV_Warp_Size_Log2_Mask
896, // GV_SimpleBufferSize		896, // GV_SimpleBufferSize
1024, // GV_Max_WG_Size,		1024, // GV_Max_WG_Size,
256, // GV_Defaut_WG_Size		256, // GV_Defaut_WG_Size
1024 / 64, // GV_Max_WG_Size / GV_WarpSize		1024 / 64, // GV_Max_WG_Size / GV_WarpSize
63 // GV_Warp_Size_Log2_MaskL		63, // GV_Warp_Size_Log2_MaskL
		64 * 1024 // GV_Total_Vector_Registers
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions I don't think these should be added to grid values. They're not used by clang or LLVM, so don't need to be shared. I'm not convinced they're architecture independent constants (I think something has 32k LDS), and it looks like the plugin could use values discovered at runtime. I think we're better off minimising the quantity of shared magic numbers so suggest we write the 64k / 3200 / 64 k as constants in the plugin instead. JonChesterfield: I don't think these should be added to grid values. They're not used by clang or LLVM, so…
};		};

/// For Nvidia GPUs		/// For Nvidia GPUs
static constexpr unsigned NVPTXGpuGridValues[] = {		static constexpr unsigned NVPTXGpuGridValues[] = {
992, // GV_Threads		992, // GV_Threads
256, // GV_Slot_Size		256, // GV_Slot_Size
32, // GV_Warp_Size		32, // GV_Warp_Size
32, // GV_Warp_Size_32		32, // GV_Warp_Size_32
5, // GV_Warp_Size_Log2		5, // GV_Warp_Size_Log2
32 * 256, // GV_Warp_Slot_Size		32 * 256, // GV_Warp_Slot_Size
1024, // GV_Max_Teams		1024, // GV_Max_Teams
256, // GV_Mem_Align		256, // GV_Mem_Align
(~0u >> (32 - 5)), // GV_Warp_Size_Log2_Mask		(~0u >> (32 - 5)), // GV_Warp_Size_Log2_Mask
896, // GV_SimpleBufferSize		896, // GV_SimpleBufferSize
1024, // GV_Max_WG_Size		1024, // GV_Max_WG_Size
128, // GV_Defaut_WG_Size		128, // GV_Defaut_WG_Size
1024 / 32, // GV_Max_WG_Size / GV_WarpSize		1024 / 32, // GV_Max_WG_Size / GV_WarpSize
31 // GV_Warp_Size_Log2_MaskL		31, // GV_Warp_Size_Log2_MaskL
		32 * 1024 // GV_Total_Vector_Registers
};		};

} // namespace omp		} // namespace omp
} // namespace llvm		} // namespace llvm

#endif // LLVM_FRONTEND_OPENMP_OMPGRIDVALUES_H		#endif // LLVM_FRONTEND_OPENMP_OMPGRIDVALUES_H

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp

Show All 14 Lines
#include <cstdio>		#include <cstdio>
#include <cstdlib>		#include <cstdlib>
#include <cstring>		#include <cstring>
#include <dlfcn.h>		#include <dlfcn.h>
#include <elf.h>		#include <elf.h>
#include <ffi.h>		#include <ffi.h>
#include <fstream>		#include <fstream>
#include <iostream>		#include <iostream>
#include <libelf.h>		#include <libelf.h>
		Lint: Pre-merge checks Inline Actions clang-tidy: error: 'libelf.h' file not found [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: 'libelf.h' file not found [clang-diagnostic-error] [[https://github.
#include <list>		#include <list>
#include <memory>		#include <memory>
#include <mutex>		#include <mutex>
#include <shared_mutex>		#include <shared_mutex>
#include <thread>		#include <thread>
#include <unordered_map>		#include <unordered_map>
#include <vector>		#include <vector>

Show All 33 Lines
hostrpc_assign_buffer(hsa_agent_t, hsa_queue_t *, uint32_t device_id) {		hostrpc_assign_buffer(hsa_agent_t, hsa_queue_t *, uint32_t device_id) {
DP("Warning: Attempting to assign hostrpc to device %u, but hostrpc library "		DP("Warning: Attempting to assign hostrpc to device %u, but hostrpc library "
"missing\n",		"missing\n",
device_id);		device_id);
return 0;		return 0;
}		}
}		}

		// Heuristic parameters used for kernel launch
		// Number of teams per CU to allow scheduling flexibility
		static const unsigned DefaultTeamsPerCU = 4;
		static const unsigned MinTeamsPerCU = 2;

int print_kernel_trace;		int print_kernel_trace;

// Size of the target call stack struture		// Size of the target call stack struture
uint32_t TgtStackItemSize = 0;		uint32_t TgtStackItemSize = 0;

#undef check // Drop definition from internal.h		#undef check // Drop definition from internal.h
#ifdef OMPTARGET_DEBUG		#ifdef OMPTARGET_DEBUG
#define check(msg, status) \		#define check(msg, status) \
▲ Show 20 Lines • Show All 258 Lines • ▼ Show 20 Lines	public:
std::vector<std::pair<std::unique_ptr<void, atmiFreePtrDeletor>, uint64_t>>		std::vector<std::pair<std::unique_ptr<void, atmiFreePtrDeletor>, uint64_t>>
deviceStateStore;		deviceStateStore;

static const unsigned HardTeamLimit =		static const unsigned HardTeamLimit =
(1 << 16) - 1; // 64K needed to fit in uint16		(1 << 16) - 1; // 64K needed to fit in uint16
static const int DefaultNumTeams = 128;		static const int DefaultNumTeams = 128;
static const int Max_Teams =		static const int Max_Teams =
llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Max_Teams];		llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Max_Teams];
static const int Warp_Size =		static const int Warp_Size =
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions this is (usually) 32 for gfx10, and the plugin is architecture-agnostic, so this probably can't be a compile time enum should all be unsigned too, negative numbers don't make sense for any of these JonChesterfield: this is (usually) 32 for gfx10, and the plugin is architecture-agnostic, so this probably can't…
llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Warp_Size];		llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Warp_Size];
static const int Max_WG_Size =		static const int Max_WG_Size =
llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Max_WG_Size];		llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Max_WG_Size];
static const int Default_WG_Size =		static const int Default_WG_Size =
llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Default_WG_Size];		llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Default_WG_Size];
		static const int Total_VGPR_Count = llvm::omp::AMDGPUGpuGridValues
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'Total_VGPR_Count' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'Total_VGPR_Count' [readability-identifier…
		[llvm::omp::GVIDX::GV_Total_Vector_Registers];

using MemcpyFunc = atmi_status_t ()(hsa_signal_t, void , const void *,		using MemcpyFunc = atmi_status_t ()(hsa_signal_t, void , const void *,
size_t size, hsa_agent_t);		size_t size, hsa_agent_t);
atmi_status_t freesignalpool_memcpy(void dest, const void src, size_t size,		atmi_status_t freesignalpool_memcpy(void dest, const void src, size_t size,
MemcpyFunc Func, int32_t deviceId) {		MemcpyFunc Func, int32_t deviceId) {
hsa_agent_t agent = HSAAgents[deviceId];		hsa_agent_t agent = HSAAgents[deviceId];
hsa_signal_t s = FreeSignalPool.pop();		hsa_signal_t s = FreeSignalPool.pop();
if (s.handle == 0) {		if (s.handle == 0) {
▲ Show 20 Lines • Show All 424 Lines • ▼ Show 20 Lines	int32_t __tgt_rtl_init_device(int device_id) {

// Set default number of teams		// Set default number of teams
if (DeviceInfo.EnvNumTeams > 0) {		if (DeviceInfo.EnvNumTeams > 0) {
DeviceInfo.NumTeams[device_id] = DeviceInfo.EnvNumTeams;		DeviceInfo.NumTeams[device_id] = DeviceInfo.EnvNumTeams;
DP("Default number of teams set according to environment %d\n",		DP("Default number of teams set according to environment %d\n",
DeviceInfo.EnvNumTeams);		DeviceInfo.EnvNumTeams);
} else {		} else {
char *TeamsPerCUEnvStr = getenv("OMP_TARGET_TEAMS_PER_PROC");		char *TeamsPerCUEnvStr = getenv("OMP_TARGET_TEAMS_PER_PROC");
int TeamsPerCU = 1; // default number of teams per CU is 1		int TeamsPerCU = DefaultTeamsPerCU;
if (TeamsPerCUEnvStr) {		if (TeamsPerCUEnvStr) {
TeamsPerCU = std::stoi(TeamsPerCUEnvStr);		TeamsPerCU = std::stoi(TeamsPerCUEnvStr);
}		}

DeviceInfo.NumTeams[device_id] =		DeviceInfo.NumTeams[device_id] =
TeamsPerCU * DeviceInfo.ComputeUnits[device_id];		TeamsPerCU * DeviceInfo.ComputeUnits[device_id];
DP("Default number of teams = %d * number of compute units %d\n",		DP("Default number of teams = %d * number of compute units %d\n",
TeamsPerCU, DeviceInfo.ComputeUnits[device_id]);		TeamsPerCU, DeviceInfo.ComputeUnits[device_id]);
}		}

if (DeviceInfo.NumTeams[device_id] > DeviceInfo.GroupsPerDevice[device_id]) {		if (DeviceInfo.NumTeams[device_id] > DeviceInfo.GroupsPerDevice[device_id]) {
DeviceInfo.NumTeams[device_id] = DeviceInfo.GroupsPerDevice[device_id];		DeviceInfo.NumTeams[device_id] = DeviceInfo.GroupsPerDevice[device_id];
DP("Default number of teams exceeds device limit, capping at %d\n",		DP("Default number of teams exceeds device limit, capping at %d\n",
DeviceInfo.GroupsPerDevice[device_id]);		DeviceInfo.GroupsPerDevice[device_id]);
}		}

// Set default number of threads		// Set default number of threads
DeviceInfo.NumThreads[device_id] = RTLDeviceInfoTy::Default_WG_Size;		DeviceInfo.NumThreads[device_id] = RTLDeviceInfoTy::Default_WG_Size;
DP("Default number of threads set according to library's default %d\n",		DP("Default number of threads set according to library's default %d\n",
RTLDeviceInfoTy::Default_WG_Size);		RTLDeviceInfoTy::Default_WG_Size);
if (DeviceInfo.NumThreads[device_id] >		if (DeviceInfo.NumThreads[device_id] >
DeviceInfo.ThreadsPerGroup[device_id]) {		DeviceInfo.ThreadsPerGroup[device_id]) {
DeviceInfo.NumTeams[device_id] = DeviceInfo.ThreadsPerGroup[device_id];		DeviceInfo.NumThreads[device_id] = DeviceInfo.ThreadsPerGroup[device_id];
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions This looks like a drive by copy/paste error fix, maybe post that separately? If you're currently uploading diffs through the gui (based on the missing context comment) that's quite labour intensive. If you change to arcanist, the flow becomes git checkout main git checkout -b some_feature ...edit git add -u && git commit -m "message" arc diff main # opens an editor JonChesterfield: This looks like a drive by copy/paste error fix, maybe post that separately? If you're…
DP("Default number of threads exceeds device limit, capping at %d\n",		DP("Default number of threads exceeds device limit, capping at %d\n",
DeviceInfo.ThreadsPerGroup[device_id]);		DeviceInfo.ThreadsPerGroup[device_id]);
}		}

DP("Device %d: default limit for groupsPerDevice %d & threadsPerGroup %d\n",		DP("Device %d: default limit for groupsPerDevice %d & threadsPerGroup %d\n",
device_id, DeviceInfo.GroupsPerDevice[device_id],		device_id, DeviceInfo.GroupsPerDevice[device_id],
DeviceInfo.ThreadsPerGroup[device_id]);		DeviceInfo.ThreadsPerGroup[device_id]);

▲ Show 20 Lines • Show All 947 Lines • ▼ Show 20 Lines	uint32_t sgpr_count, vgpr_count, sgpr_spill_count, vgpr_spill_count;
vgpr_spill_count = it.vgpr_spill_count;		vgpr_spill_count = it.vgpr_spill_count;
}		}

/*		/*
* Set limit based on ThreadsPerGroup and GroupsPerDevice		* Set limit based on ThreadsPerGroup and GroupsPerDevice
*/		*/
int num_groups = 0;		int num_groups = 0;

int threadsPerGroup = RTLDeviceInfoTy::Default_WG_Size;		// Compute the maximum number of VGPRs allowed for a workgroup
		int max_vgprs_per_group = RTLDeviceInfoTy::Total_VGPR_Count / MinTeamsPerCU;
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'max_vgprs_per_group' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'max_vgprs_per_group' [readability…

		// Compute the max number of threads per group based on the kernel VGPR usage
		int threadsPerGroup = max_vgprs_per_group / vgpr_count;
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'threadsPerGroup' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'threadsPerGroup' [readability-identifier…

		if (threadsPerGroup > RTLDeviceInfoTy::Default_WG_Size) {
		// Cap it beyond the default
		threadsPerGroup = RTLDeviceInfoTy::Default_WG_Size;
		} else if (threadsPerGroup < RTLDeviceInfoTy::Warp_Size) {
		// Lower bound is a wavefront size
		threadsPerGroup = RTLDeviceInfoTy::Warp_Size;
		} else {
		// Round it down to a multiple of wavefront size
		threadsPerGroup = (threadsPerGroup / RTLDeviceInfoTy::Warp_Size) *
		RTLDeviceInfoTy::Warp_Size;
		}

getLaunchVals(threadsPerGroup, num_groups, KernelInfo->ConstWGSize,		getLaunchVals(threadsPerGroup, num_groups, KernelInfo->ConstWGSize,
KernelInfo->ExecutionMode, DeviceInfo.EnvTeamLimit,		KernelInfo->ExecutionMode, DeviceInfo.EnvTeamLimit,
DeviceInfo.EnvNumTeams,		DeviceInfo.EnvNumTeams,
num_teams, // From run_region arg		num_teams, // From run_region arg
thread_limit, // From run_region arg		thread_limit, // From run_region arg
loop_tripcount, // From run_region arg		loop_tripcount, // From run_region arg
KernelInfo->device_id);		KernelInfo->device_id);
▲ Show 20 Lines • Show All 180 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget] Tune the number of teams and threads for kernel launch.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 331712

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp

[libomptarget] Tune the number of teams and threads for kernel launch.
Needs ReviewPublic