This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget] Tune the number of teams and threads for kernel launch.
Needs ReviewPublic

Authored by dhruvachak on Mar 17 2021, 5:52 PM.

Download Raw Diff

Details

Reviewers

JonChesterfield
jdoerfert
ronlieb
carlo.bertolli

Summary

Change the default number of teams.

Based on kernel register usage, adjust the number of threads in a team.

Includes a corner case fix.

This change is dependent on https://reviews.llvm.org/D98829

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dhruvachak created this revision.Mar 17 2021, 5:52 PM

Herald added subscribers: kerbowa, nhaehnle, jvesely. · View Herald TranscriptMar 17 2021, 5:52 PM

dhruvachak requested review of this revision.Mar 17 2021, 5:52 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 17 2021, 5:52 PM

Herald added a subscriber: openmp-commits. · View Herald Transcript

Harbormaster completed remote builds in B94364: Diff 331423.Mar 17 2021, 7:00 PM

Herald added a subscriber: sstefan1. · View Herald TranscriptMar 17 2021, 7:00 PM

This is really interesting. The idea seems to be to choose the dispatch parameters based on the kernel metadata and the limits of the machine.

What's the underlying heuristic? Break across N CU's in chunks that match the occupancy limits of each CU?

If so we probably want to compare LDS usage as well to avoid partitioning poorly for that.

Maybe others - there might be a performance cliff on amount of private memory too.

JonChesterfield added inline comments.Mar 18 2021, 4:06 AM

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
102	Side point, there is too much redundancy in this table of numbers (e.g. the log2 fields) and warp_size_32 = 32 looks suspect

In D98832#2634168, @JonChesterfield wrote:

This is really interesting. The idea seems to be to choose the dispatch parameters based on the kernel metadata and the limits of the machine.

What's the underlying heuristic? Break across N CU's in chunks that match the occupancy limits of each CU?

Yes, that's the idea.

If so we probably want to compare LDS usage as well to avoid partitioning poorly for that.

Maybe others - there might be a performance cliff on amount of private memory too.

Agreed. However, I don't see LDS usage in the metadata table in the image. Is it present there?

In theory, a very high sgpr count can limit the number of available workgroups if that's not factored in for determining the number of threads. But in practice, VGPRs tend to be the primary limiting factor. So perhaps we can start with using VGPRs for this purpose and have experience guide us in the future.

Could you upload patches with full context please

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
91	Vector registers? Like xmm? or registers?

Added full context to the updated patch.

In D98832#2635396, @jdoerfert wrote:

Could you upload patches with full context please

Updated with the full context.

Like xmm. Here in particular, I am referring to the vector register file of a GPU.

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
91	Like xmm. Here in particular, I am referring to the vector register file of a GPU.

Harbormaster completed remote builds in B94575: Diff 331712.Mar 18 2021, 5:14 PM

In D98832#2635305, @dhruvachak wrote:

...
Agreed. However, I don't see LDS usage in the metadata table in the image. Is it present there?

Yes, see https://llvm.org/docs/AMDGPUUsage.html for the list of what we can expect. What may not be obvious is that the metadata calls it ".group_segment_fixed_size". I don't know the origin of the terminology, maybe opencl?

In theory, a very high sgpr count can limit the number of available workgroups if that's not factored in for determining the number of threads. But in practice, VGPRs tend to be the primary limiting factor. So perhaps we can start with using VGPRs for this purpose and have experience guide us in the future.

If I understand correctly, occupancy rules all look something like (resource used / resource available) == number simultaneous, where one of the resources tends to be limiting. Offhand, I think that's VGPR, SGPR, LDS (group segment). I think there's also an architecture dependent upper bound on how many things can run at once even if they use very little of those, maybe 8 for gfx9 and 16 for gfx10.

If that's right, perhaps the calculation should look something like:

uint vgpr_occupancy = vgpr_used / vgpr_available;
uint sgpr_occupancy = sgpr_used / sgpr_available;
uint lds_occupancy = lds_used / lds_available;
uint limiting_occupancy = min(vgpr_occupancy, sgpr_occupacny, lds_occupancy);

and then we derive threadsPerGroup from that occupancy and the various other considerations.

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
1132–1133	This looks like a drive by copy/paste error fix, maybe post that separately? If you're currently uploading diffs through the gui (based on the missing context comment) that's quite labour intensive. If you change to arcanist, the flow becomes git checkout main git checkout -b some_feature ...edit git add -u && git commit -m "message" arc diff main # opens an editor

In D98832#2637285, @JonChesterfield wrote:
In D98832#2635305, @dhruvachak wrote:

...
Agreed. However, I don't see LDS usage in the metadata table in the image. Is it present there?

Yes, see https://llvm.org/docs/AMDGPUUsage.html for the list of what we can expect. What may not be obvious is that the metadata calls it ".group_segment_fixed_size". I don't know the origin of the terminology, maybe opencl?

In theory, a very high sgpr count can limit the number of available workgroups if that's not factored in for determining the number of threads. But in practice, VGPRs tend to be the primary limiting factor. So perhaps we can start with using VGPRs for this purpose and have experience guide us in the future.

If I understand correctly, occupancy rules all look something like (resource used / resource available) == number simultaneous, where one of the resources tends to be limiting. Offhand, I think that's VGPR, SGPR, LDS (group segment). I think there's also an architecture dependent upper bound on how many things can run at once even if they use very little of those, maybe 8 for gfx9 and 16 for gfx10.

If that's right, perhaps the calculation should look something like:
uint vgpr_occupancy = vgpr_used / vgpr_available;
uint sgpr_occupancy = sgpr_used / sgpr_available;
uint lds_occupancy = lds_used / lds_available;
uint limiting_occupancy = min(vgpr_occupancy, sgpr_occupacny, lds_occupancy);
and then we derive threadsPerGroup from that occupancy and the various other considerations.

Thanks for the pointer to the group segment. Yes, in general, my idea is similar to what you outlined above. However, note that SGPRs and LDS are at different granularities compared to VGPRs. VGPRs are per-thread, SGPRs are shared within a wavefront, and LDS is shared within a workgroup. So while VGPRs can be used to limit the number of threads, perhaps SGPRs and LDS can be used to limit the number of teams.

Let me split up this patch further. I would like to land the default num_teams change sooner rather than later since that's a simple change and has shown improved performance. So let me separate that out. Incorporating SGPRs/LDS to constrain teams/threads will need more experimentation.

dhruvachak mentioned this in D99003: [libomptarget] [amdgpu] Change default number of teams per computation unit.Mar 19 2021, 7:00 PM

[libomptarget] [amdgpu] Set number of teams and threads based on GPU occupancy.

Determine total number of teams in a kernel and the number of threads in each
team in order to maximize occupancy. This change considers register and LDS
usage of the kernel during occupancy computation.

dhruvachak added a reviewer: carlo.bertolli.Jun 16 2021, 10:12 AM

I haven't tried to understand the control flow yet. Is the idea to map a target region to as large a fraction of a CU as we can, scaling it back when occupancy constraints would force some of it to be idle anyway?

In D98832#2822837, @JonChesterfield wrote:

I haven't tried to understand the control flow yet. Is the idea to map a target region to as large a fraction of a CU as we can, scaling it back when occupancy constraints would force some of it to be idle anyway?

Yes, we start with the goal of filling up a CU with a pre-defined number of wavefronts. Given that goal, we try to choose team-count and team-size in a way so that their product approaches the pre-defined number of wavefronts. And the choices of team-count/team-size are constrained by register/LDS usage.

Harbormaster completed remote builds in B109541: Diff 352470.Jun 16 2021, 8:30 PM

[libomptarget] [amdgpu] Set number of teams and threads based on GPU occupancy.

Perform teams/threads tuning in non-generic execution modes.
Do not tune if OMP_TEAMS_THREAD_LIMIT is set.

Harbormaster completed remote builds in B111339: Diff 354975.Jun 28 2021, 1:09 PM

[libomptarget] [amdgpu] Set number of teams and threads based on GPU occupancy.

Ensure that thread count is within the limit.
Perform teams/threads tuning in non-generic execution modes.
Do not tune if OMP_TEAMS_THREAD_LIMIT is set.

Harbormaster completed remote builds in B111419: Diff 355083.Jun 28 2021, 7:17 PM

This stuff definitely needs to be tested.

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
99	Also, this should be a struct, not an array of unsigned with enums for looking up fields
113	I don't think these should be added to grid values. They're not used by clang or LLVM, so don't need to be shared. I'm not convinced they're architecture independent constants (I think something has 32k LDS), and it looks like the plugin could use values discovered at runtime. I think we're better off minimising the quantity of shared magic numbers so suggest we write the 64k / 3200 / 64 k as constants in the plugin instead.
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
506	this is (usually) 32 for gfx10, and the plugin is architecture-agnostic, so this probably can't be a compile time enum should all be unsigned too, negative numbers don't make sense for any of these

dhruvachak mentioned this in rGe0b713a0357a: [libomptarget] [amdgpu] Change default number of teams per computation unit.Jun 29 2021, 3:35 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Frontend/

OpenMP/

OMPGridValues.h

18 lines

openmp/

libomptarget/

plugins/

amdgpu/

src/

rtl.cpp

127 lines

Diff 355083

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h

Show First 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	enum GVIDX {
// The absolute maximum team size for a working group		// The absolute maximum team size for a working group
GV_Max_WG_Size,		GV_Max_WG_Size,
// The default maximum team size for a working group		// The default maximum team size for a working group
GV_Default_WG_Size,		GV_Default_WG_Size,
// This is GV_Max_WG_Size / GV_WarpSize. 32 for NVPTX and 16 for AMDGCN.		// This is GV_Max_WG_Size / GV_WarpSize. 32 for NVPTX and 16 for AMDGCN.
GV_Max_Warp_Number,		GV_Max_Warp_Number,
/// The slot size that should be reserved for a working warp.		/// The slot size that should be reserved for a working warp.
/// (~0u >> (GV_Warp_Size - GV_Warp_Size_Log2))		/// (~0u >> (GV_Warp_Size - GV_Warp_Size_Log2))
GV_Warp_Size_Log2_MaskL		GV_Warp_Size_Log2_MaskL,
		/// Total number of vector registers per CU or SM
		GV_Vector_Register_Count,
		jdoerfertUnsubmitted Not Done Reply Inline Actions Vector registers? Like xmm? or registers? jdoerfert: Vector registers? Like xmm? or registers?
		dhruvachakAuthorUnsubmitted Not Done Reply Inline Actions Like xmm. Here in particular, I am referring to the vector register file of a GPU. dhruvachak: Like xmm. Here in particular, I am referring to the vector register file of a GPU.
		/// Total number of scalar registers per CU or SM
		GV_Scalar_Register_Count,
		/// Total shared memory size per CU or SM in bytes
		GV_Shared_Memory_Size
};		};

/// For AMDGPU GPUs		/// For AMDGPU GPUs
static constexpr unsigned AMDGPUGpuGridValues[] = {		static constexpr unsigned AMDGPUGpuGridValues[] = {
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Also, this should be a struct, not an array of unsigned with enums for looking up fields JonChesterfield: Also, this should be a struct, not an array of unsigned with enums for looking up fields
448, // GV_Threads		448, // GV_Threads
256, // GV_Slot_Size		256, // GV_Slot_Size
64, // GV_Warp_Size		64, // GV_Warp_Size
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Side point, there is too much redundancy in this table of numbers (e.g. the log2 fields) and warp_size_32 = 32 looks suspect JonChesterfield: Side point, there is too much redundancy in this table of numbers (e.g. the log2 fields) and…
32, // GV_Warp_Size_32		32, // GV_Warp_Size_32
6, // GV_Warp_Size_Log2		6, // GV_Warp_Size_Log2
64 * 256, // GV_Warp_Slot_Size		64 * 256, // GV_Warp_Slot_Size
128, // GV_Max_Teams		128, // GV_Max_Teams
256, // GV_Mem_Align		256, // GV_Mem_Align
63, // GV_Warp_Size_Log2_Mask		63, // GV_Warp_Size_Log2_Mask
896, // GV_SimpleBufferSize		896, // GV_SimpleBufferSize
1024, // GV_Max_WG_Size,		1024, // GV_Max_WG_Size,
256, // GV_Defaut_WG_Size		256, // GV_Defaut_WG_Size
1024 / 64, // GV_Max_WG_Size / GV_WarpSize		1024 / 64, // GV_Max_WG_Size / GV_WarpSize
63 // GV_Warp_Size_Log2_MaskL		63, // GV_Warp_Size_Log2_MaskL
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions I don't think these should be added to grid values. They're not used by clang or LLVM, so don't need to be shared. I'm not convinced they're architecture independent constants (I think something has 32k LDS), and it looks like the plugin could use values discovered at runtime. I think we're better off minimising the quantity of shared magic numbers so suggest we write the 64k / 3200 / 64 k as constants in the plugin instead. JonChesterfield: I don't think these should be added to grid values. They're not used by clang or LLVM, so…
		64 * 1024, // GV_Vector_Register_Count
		4 * 800, // GV_Scalar_Register_Count
		64 * 1024 // GV_Shared_Memory_Size
};		};

/// For Nvidia GPUs		/// For Nvidia GPUs
static constexpr unsigned NVPTXGpuGridValues[] = {		static constexpr unsigned NVPTXGpuGridValues[] = {
992, // GV_Threads		992, // GV_Threads
256, // GV_Slot_Size		256, // GV_Slot_Size
32, // GV_Warp_Size		32, // GV_Warp_Size
32, // GV_Warp_Size_32		32, // GV_Warp_Size_32
5, // GV_Warp_Size_Log2		5, // GV_Warp_Size_Log2
32 * 256, // GV_Warp_Slot_Size		32 * 256, // GV_Warp_Slot_Size
1024, // GV_Max_Teams		1024, // GV_Max_Teams
256, // GV_Mem_Align		256, // GV_Mem_Align
(~0u >> (32 - 5)), // GV_Warp_Size_Log2_Mask		(~0u >> (32 - 5)), // GV_Warp_Size_Log2_Mask
896, // GV_SimpleBufferSize		896, // GV_SimpleBufferSize
1024, // GV_Max_WG_Size		1024, // GV_Max_WG_Size
128, // GV_Defaut_WG_Size		128, // GV_Defaut_WG_Size
1024 / 32, // GV_Max_WG_Size / GV_WarpSize		1024 / 32, // GV_Max_WG_Size / GV_WarpSize
31 // GV_Warp_Size_Log2_MaskL		31, // GV_Warp_Size_Log2_MaskL
		64 * 1024, // GV_Vector_Register_Count
		0, // GV_Scalar_Register_Count (not applicable)
		32 * 1024 // GV_Shared_Memory_Size (configurable)
};		};

} // namespace omp		} // namespace omp
} // namespace llvm		} // namespace llvm

#endif // LLVM_FRONTEND_OPENMP_OMPGRIDVALUES_H		#endif // LLVM_FRONTEND_OPENMP_OMPGRIDVALUES_H

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp

Show All 13 Lines
#include <assert.h>		#include <assert.h>
#include <cstdio>		#include <cstdio>
#include <cstdlib>		#include <cstdlib>
#include <cstring>		#include <cstring>
#include <elf.h>		#include <elf.h>
#include <fstream>		#include <fstream>
#include <functional>		#include <functional>
#include <iostream>		#include <iostream>
#include <libelf.h>		#include <libelf.h>
		Lint: Pre-merge checks Inline Actions clang-tidy: error: 'libelf.h' file not found [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: 'libelf.h' file not found [clang-diagnostic-error] [[https://github.
#include <list>		#include <list>
#include <memory>		#include <memory>
#include <mutex>		#include <mutex>
#include <shared_mutex>		#include <shared_mutex>
#include <thread>		#include <thread>
#include <unordered_map>		#include <unordered_map>
#include <vector>		#include <vector>

Show All 34 Lines
hostrpc_assign_buffer(hsa_agent_t, hsa_queue_t *, uint32_t device_id) {		hostrpc_assign_buffer(hsa_agent_t, hsa_queue_t *, uint32_t device_id) {
DP("Warning: Attempting to assign hostrpc to device %u, but hostrpc library "		DP("Warning: Attempting to assign hostrpc to device %u, but hostrpc library "
"missing\n",		"missing\n",
device_id);		device_id);
return 0;		return 0;
}		}
}		}

		// Number of SIMDs in a CU
		static const uint32_t NumSIMDsPerCU = 4;
		// Heuristic parameters used for kernel launch parameters
		// Default number of waves per team is chosen equal to the number of SIMDs
		static const uint32_t DefaultNumWavesPerTeam = NumSIMDsPerCU;
		// Default number of teams per CU is chosen for scheduling flexibility within a
		// SIMD
		static const uint32_t DefaultNumTeamsPerCU = 4;

int print_kernel_trace;		int print_kernel_trace;

#ifdef OMPTARGET_DEBUG		#ifdef OMPTARGET_DEBUG
#define check(msg, status) \		#define check(msg, status) \
if (status != HSA_STATUS_SUCCESS) { \		if (status != HSA_STATUS_SUCCESS) { \
DP(#msg " failed\n"); \		DP(#msg " failed\n"); \
} else { \		} else { \
DP(#msg " succeeded\n"); \		DP(#msg " succeeded\n"); \
▲ Show 20 Lines • Show All 408 Lines • ▼ Show 20 Lines	public:

// device_State shared across loaded binaries, error if inconsistent size		// device_State shared across loaded binaries, error if inconsistent size
std::vector<std::pair<std::unique_ptr<void, atmiFreePtrDeletor>, uint64_t>>		std::vector<std::pair<std::unique_ptr<void, atmiFreePtrDeletor>, uint64_t>>
deviceStateStore;		deviceStateStore;

static const unsigned HardTeamLimit =		static const unsigned HardTeamLimit =
(1 << 16) - 1; // 64K needed to fit in uint16		(1 << 16) - 1; // 64K needed to fit in uint16
static const int DefaultNumTeams = 128;		static const int DefaultNumTeams = 128;
static const int Max_Teams =
llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Max_Teams];
static const int Warp_Size =		static const int Warp_Size =
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions this is (usually) 32 for gfx10, and the plugin is architecture-agnostic, so this probably can't be a compile time enum should all be unsigned too, negative numbers don't make sense for any of these JonChesterfield: this is (usually) 32 for gfx10, and the plugin is architecture-agnostic, so this probably can't…
llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Warp_Size];		llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Warp_Size];
		static const int Max_Teams =
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'Max_Teams' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'Max_Teams' [readability-identifier…
		llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Max_Teams];
static const int Max_WG_Size =		static const int Max_WG_Size =
llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Max_WG_Size];		llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Max_WG_Size];
static const int Default_WG_Size =		static const int Default_WG_Size =
llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Default_WG_Size];		llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Default_WG_Size];
		static const int Max_Warp_Number =
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'Max_Warp_Number' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'Max_Warp_Number' [readability-identifier…
		llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Max_Warp_Number];
		static const int Vector_Register_Count = llvm::omp::AMDGPUGpuGridValues
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'Vector_Register_Count' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'Vector_Register_Count' [readability…
		[llvm::omp::GVIDX::GV_Vector_Register_Count];
		static const int Scalar_Register_Count = llvm::omp::AMDGPUGpuGridValues
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'Scalar_Register_Count' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'Scalar_Register_Count' [readability…
		[llvm::omp::GVIDX::GV_Scalar_Register_Count];
		static const int LDS_Size =
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'LDS_Size' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'LDS_Size' [readability-identifier-naming]…
		llvm::omp::AMDGPUGpuGridValues[llvm::omp::GVIDX::GV_Shared_Memory_Size];

using MemcpyFunc = hsa_status_t ()(hsa_signal_t, void , const void *,		using MemcpyFunc = hsa_status_t ()(hsa_signal_t, void , const void *,
size_t size, hsa_agent_t);		size_t size, hsa_agent_t);
hsa_status_t freesignalpool_memcpy(void dest, const void src, size_t size,		hsa_status_t freesignalpool_memcpy(void dest, const void src, size_t size,
MemcpyFunc Func, int32_t deviceId) {		MemcpyFunc Func, int32_t deviceId) {
hsa_agent_t agent = HSAAgents[deviceId];		hsa_agent_t agent = HSAAgents[deviceId];
hsa_signal_t s = FreeSignalPool.pop();		hsa_signal_t s = FreeSignalPool.pop();
if (s.handle == 0) {		if (s.handle == 0) {
▲ Show 20 Lines • Show All 594 Lines • ▼ Show 20 Lines	if (DeviceInfo.EnvTeamThreadLimit > 0 &&
DP("Capping max number of threads to OMP_TEAMS_THREAD_LIMIT=%d\n",		DP("Capping max number of threads to OMP_TEAMS_THREAD_LIMIT=%d\n",
DeviceInfo.EnvTeamThreadLimit);		DeviceInfo.EnvTeamThreadLimit);
}		}

// Set default number of threads		// Set default number of threads
DeviceInfo.NumThreads[device_id] = RTLDeviceInfoTy::Default_WG_Size;		DeviceInfo.NumThreads[device_id] = RTLDeviceInfoTy::Default_WG_Size;
DP("Default number of threads set according to library's default %d\n",		DP("Default number of threads set according to library's default %d\n",
RTLDeviceInfoTy::Default_WG_Size);		RTLDeviceInfoTy::Default_WG_Size);
if (enforce_upper_bound(&DeviceInfo.NumThreads[device_id],		if (enforce_upper_bound(&DeviceInfo.NumThreads[device_id],
DeviceInfo.ThreadsPerGroup[device_id])) {		DeviceInfo.ThreadsPerGroup[device_id])) {
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions This looks like a drive by copy/paste error fix, maybe post that separately? If you're currently uploading diffs through the gui (based on the missing context comment) that's quite labour intensive. If you change to arcanist, the flow becomes git checkout main git checkout -b some_feature ...edit git add -u && git commit -m "message" arc diff main # opens an editor JonChesterfield: This looks like a drive by copy/paste error fix, maybe post that separately? If you're…
DP("Default number of threads exceeds device limit, capping at %d\n",		DP("Default number of threads exceeds device limit, capping at %d\n",
DeviceInfo.ThreadsPerGroup[device_id]);		DeviceInfo.ThreadsPerGroup[device_id]);
}		}

DP("Device %d: default limit for groupsPerDevice %d & threadsPerGroup %d\n",		DP("Device %d: default limit for groupsPerDevice %d & threadsPerGroup %d\n",
device_id, DeviceInfo.GroupsPerDevice[device_id],		device_id, DeviceInfo.GroupsPerDevice[device_id],
DeviceInfo.ThreadsPerGroup[device_id]);		DeviceInfo.ThreadsPerGroup[device_id]);

▲ Show 20 Lines • Show All 746 Lines • ▼ Show 20 Lines	int32_t __tgt_rtl_data_delete(int device_id, void *tgt_ptr) {
err = core::Runtime::Memfree(tgt_ptr);		err = core::Runtime::Memfree(tgt_ptr);
if (err != HSA_STATUS_SUCCESS) {		if (err != HSA_STATUS_SUCCESS) {
DP("Error when freeing CUDA memory\n");		DP("Error when freeing CUDA memory\n");
return OFFLOAD_FAIL;		return OFFLOAD_FAIL;
}		}
return OFFLOAD_SUCCESS;		return OFFLOAD_SUCCESS;
}		}

		/*
		Determine total number of teams in a kernel and the number of
		threads in each team in order to maximize occupancy.

		Here is the overall idea: We aim to schedule a certain number of
		wavefronts, expressed as the product of the number of teams and the
		number of wavefronts per team. Each CU has 4 SIMDs. To account for
		each SIMD, we start with at least 4 wavefronts in a team. If the
		total number of teams per CU is constrained by LDS usage, we try to
		increase the number of wavefronts per team. If the total number of
		wavefronts per SIMD is constrained by VGPR or SGPR usage, we reduce
		the number of teams while keeping the number of wavefronts unchanged.
		*/
		void adjustTeamsAndThreadsBasedOnResources(int NumTeams, int NumThreads,
		uint32_t lds_usage,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'lds_usage' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'lds_usage' [readability-identifier…
		uint32_t sgpr_count,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'sgpr_count' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'sgpr_count' [readability-identifier…
		uint32_t vgpr_count,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'vgpr_count' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'vgpr_count' [readability-identifier…
		int32_t device_id) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'device_id' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'device_id' [readability-identifier…
		// Initialize the number of waves per team to the default
		uint32_t num_waves_per_team = DefaultNumWavesPerTeam;
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'num_waves_per_team' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'num_waves_per_team' [readability…

		// lds_usage is reported per workgroup (i.e. team). So initialize
		// the number of teams per CU based on lds_usage
		uint32_t num_teams_per_cu =
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'num_teams_per_cu' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'num_teams_per_cu' [readability-identifier…
		std::min(RTLDeviceInfoTy::LDS_Size / (lds_usage ? lds_usage : 1),
		DefaultNumTeamsPerCU);

		// Compute the maximum number of waves per SIMD based on VGPR and SGPR usage
		uint32_t vgprs_avail_per_simd =
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'vgprs_avail_per_simd' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'vgprs_avail_per_simd' [readability…
		RTLDeviceInfoTy::Vector_Register_Count / NumSIMDsPerCU;
		// vgpr_count is per workitem (i.e. thread)
		uint32_t vgpr_usage_per_wave = vgpr_count * RTLDeviceInfoTy::Warp_Size;
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'vgpr_usage_per_wave' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'vgpr_usage_per_wave' [readability…
		uint32_t vgpr_constrained_max_waves_per_simd =
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'vgpr_constrained_max_waves_per_simd' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'vgpr_constrained_max_waves_per_simd'…
		vgprs_avail_per_simd / (vgpr_usage_per_wave ? vgpr_usage_per_wave : 1);

		uint32_t sgprs_avail_per_simd =
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'sgprs_avail_per_simd' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'sgprs_avail_per_simd' [readability…
		RTLDeviceInfoTy::Scalar_Register_Count / NumSIMDsPerCU;
		// sgpr_count is per wavefront
		uint32_t sgpr_constrained_max_waves_per_simd =
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'sgpr_constrained_max_waves_per_simd' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'sgpr_constrained_max_waves_per_simd'…
		sgprs_avail_per_simd / (sgpr_count ? sgpr_count : 1);

		uint32_t max_waves_per_simd =
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'max_waves_per_simd' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'max_waves_per_simd' [readability…
		std::min(std::min(vgpr_constrained_max_waves_per_simd,
		sgpr_constrained_max_waves_per_simd),
		DefaultNumTeamsPerCU);

		uint32_t default_occupancy_factor =
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'default_occupancy_factor' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'default_occupancy_factor' [readability…
		DefaultNumTeamsPerCU * DefaultNumWavesPerTeam;
		// Compute occupancy factor based on constraints from LDS usage
		uint32_t lds_constrained_occupancy_factor =
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'lds_constrained_occupancy_factor' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'lds_constrained_occupancy_factor'…
		num_teams_per_cu * DefaultNumWavesPerTeam;
		// Compute occupancy factor based on constraints from VGPR and SGPR usage
		uint32_t gpr_constrained_occupancy_factor =
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'gpr_constrained_occupancy_factor' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'gpr_constrained_occupancy_factor'…
		max_waves_per_simd * DefaultNumWavesPerTeam;

		// First, we examine whether LDS is limiting the number of teams,
		// regardless of any limits imposed by GPR usage
		if (lds_constrained_occupancy_factor < default_occupancy_factor) {
		// No benefit in increasing num_teams_per_cu
		// But try to increase the number of waves per team subject to
		// constraints imposed by GPR usage
		num_waves_per_team = std::min(gpr_constrained_occupancy_factor /
		(num_teams_per_cu ? num_teams_per_cu : 1),
		(uint32_t)RTLDeviceInfoTy::Max_Warp_Number);
		} else if (gpr_constrained_occupancy_factor < default_occupancy_factor) {
		// Not much can be done, this kernel will run with reduced
		// occupancy. Lower the number of teams to reflect that aspect.
		num_teams_per_cu = std::min(num_teams_per_cu, max_waves_per_simd);
		}

		// Total number of teams in the kernel
		NumTeams = num_teams_per_cu DeviceInfo.ComputeUnits[device_id];

		// Total number of threads in each team
		NumThreads = num_waves_per_team RTLDeviceInfoTy::Warp_Size;
		}

// Determine launch values for threadsPerGroup and num_groups.		// Determine launch values for threadsPerGroup and num_groups.
// Outputs: treadsPerGroup, num_groups		// Outputs: treadsPerGroup, num_groups
// Inputs: Max_Teams, Max_WG_Size, Warp_Size, ExecutionMode,		// Inputs: Max_Teams, Max_WG_Size, Warp_Size, ExecutionMode,
// EnvTeamLimit, EnvNumTeams, num_teams, thread_limit,		// EnvTeamLimit, EnvNumTeams, num_teams, thread_limit,
// loop_tripcount.		// loop_tripcount.
void getLaunchVals(int &threadsPerGroup, int &num_groups, int ConstWGSize,		void getLaunchVals(int &threadsPerGroup, int &num_groups, int ConstWGSize,
int ExecutionMode, int EnvTeamLimit, int EnvNumTeams,		int ExecutionMode, int EnvTeamLimit, int EnvNumTeams,
int num_teams, int thread_limit, uint64_t loop_tripcount,		int num_teams, int thread_limit, uint64_t loop_tripcount,
		uint32_t lds_usage, uint32_t sgpr_count, uint32_t vgpr_count,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'lds_usage' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'sgpr_count' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for parameter 'vgpr_count' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'lds_usage' [readability-identifier…
int32_t device_id) {		int32_t device_id) {

		// In non-generic mode, if the user did not specify number of teams
		// or threads, adjust them based on resources
		int NumTeamsBasedOnResources = 0;
		int NumThreadsBasedOnResources = 0;
		if (ExecutionMode != GENERIC && EnvNumTeams <= 0 && EnvTeamLimit <= 0 &&
		DeviceInfo.EnvTeamThreadLimit <= 0 &&
		DeviceInfo.EnvMaxTeamsDefault <= 0 && num_teams <= 0 &&
		thread_limit <= 0) {
		adjustTeamsAndThreadsBasedOnResources(
		&NumTeamsBasedOnResources, &NumThreadsBasedOnResources, lds_usage,
		sgpr_count, vgpr_count, device_id);
		}

int Max_Teams = DeviceInfo.EnvMaxTeamsDefault > 0		int Max_Teams = DeviceInfo.EnvMaxTeamsDefault > 0
? DeviceInfo.EnvMaxTeamsDefault		? DeviceInfo.EnvMaxTeamsDefault
: DeviceInfo.NumTeams[device_id];		: DeviceInfo.NumTeams[device_id];
if (Max_Teams > DeviceInfo.HardTeamLimit)		if (Max_Teams > DeviceInfo.HardTeamLimit)
Max_Teams = DeviceInfo.HardTeamLimit;		Max_Teams = DeviceInfo.HardTeamLimit;

if (print_kernel_trace & STARTUP_DETAILS) {		if (print_kernel_trace & STARTUP_DETAILS) {
fprintf(stderr, "RTLDeviceInfoTy::Max_Teams: %d\n",		fprintf(stderr, "RTLDeviceInfoTy::Max_Teams: %d\n",
RTLDeviceInfoTy::Max_Teams);		RTLDeviceInfoTy::Max_Teams);
fprintf(stderr, "Max_Teams: %d\n", Max_Teams);		fprintf(stderr, "Max_Teams: %d\n", Max_Teams);
		fprintf(stderr, "NumTeamsBasedOnResources: %d\n", NumTeamsBasedOnResources);
fprintf(stderr, "RTLDeviceInfoTy::Warp_Size: %d\n",		fprintf(stderr, "RTLDeviceInfoTy::Warp_Size: %d\n",
RTLDeviceInfoTy::Warp_Size);		RTLDeviceInfoTy::Warp_Size);
fprintf(stderr, "RTLDeviceInfoTy::Max_WG_Size: %d\n",		fprintf(stderr, "RTLDeviceInfoTy::Max_WG_Size: %d\n",
RTLDeviceInfoTy::Max_WG_Size);		RTLDeviceInfoTy::Max_WG_Size);
fprintf(stderr, "RTLDeviceInfoTy::Default_WG_Size: %d\n",		fprintf(stderr, "RTLDeviceInfoTy::Default_WG_Size: %d\n",
RTLDeviceInfoTy::Default_WG_Size);		RTLDeviceInfoTy::Default_WG_Size);
fprintf(stderr, "thread_limit: %d\n", thread_limit);		fprintf(stderr, "thread_limit: %d\n", thread_limit);
fprintf(stderr, "threadsPerGroup: %d\n", threadsPerGroup);		fprintf(stderr, "threadsPerGroup: %d\n", threadsPerGroup);
		fprintf(stderr, "NumThreadsBasedOnResources: %d\n",
		NumThreadsBasedOnResources);
fprintf(stderr, "ConstWGSize: %d\n", ConstWGSize);		fprintf(stderr, "ConstWGSize: %d\n", ConstWGSize);
}		}

		if (NumTeamsBasedOnResources > 0 &&
		NumTeamsBasedOnResources <= DeviceInfo.HardTeamLimit &&
		NumThreadsBasedOnResources > 0 &&
		NumThreadsBasedOnResources <= RTLDeviceInfoTy::Max_WG_Size) {
		Max_Teams = NumTeamsBasedOnResources;
		threadsPerGroup = NumThreadsBasedOnResources;
		DP("Modifying Max_Teams based on resources: %d\n", Max_Teams);
		DP("Modifying threadsPerGroup based on resources: %d\n", threadsPerGroup);
		}

// check for thread_limit() clause		// check for thread_limit() clause
if (thread_limit > 0) {		if (thread_limit > 0) {
threadsPerGroup = thread_limit;		threadsPerGroup = thread_limit;
DP("Setting threads per block to requested %d\n", thread_limit);		DP("Setting threads per block to requested %d\n", thread_limit);
if (ExecutionMode == GENERIC) { // Add master warp for GENERIC		if (ExecutionMode == GENERIC) { // Add master warp for GENERIC
threadsPerGroup += RTLDeviceInfoTy::Warp_Size;		threadsPerGroup += RTLDeviceInfoTy::Warp_Size;
DP("Adding master wavefront: +%d threads\n", RTLDeviceInfoTy::Warp_Size);		DP("Adding master wavefront: +%d threads\n", RTLDeviceInfoTy::Warp_Size);
}		}
▲ Show 20 Lines • Show All 183 Lines • ▼ Show 20 Lines	int32_t __tgt_rtl_run_target_team_region_locked(
int threadsPerGroup = RTLDeviceInfoTy::Default_WG_Size;		int threadsPerGroup = RTLDeviceInfoTy::Default_WG_Size;

getLaunchVals(threadsPerGroup, num_groups, KernelInfo->ConstWGSize,		getLaunchVals(threadsPerGroup, num_groups, KernelInfo->ConstWGSize,
KernelInfo->ExecutionMode, DeviceInfo.EnvTeamLimit,		KernelInfo->ExecutionMode, DeviceInfo.EnvTeamLimit,
DeviceInfo.EnvNumTeams,		DeviceInfo.EnvNumTeams,
num_teams, // From run_region arg		num_teams, // From run_region arg
thread_limit, // From run_region arg		thread_limit, // From run_region arg
loop_tripcount, // From run_region arg		loop_tripcount, // From run_region arg
		group_segment_size, sgpr_count, vgpr_count,
KernelInfo->device_id);		KernelInfo->device_id);

if (print_kernel_trace >= LAUNCH) {		if (print_kernel_trace >= LAUNCH) {
// enum modes are SPMD, GENERIC, NONE 0,1,2		// enum modes are SPMD, GENERIC, NONE 0,1,2
// if doing rtl timing, print to stderr, unless stdout requested.		// if doing rtl timing, print to stderr, unless stdout requested.
bool traceToStdout = print_kernel_trace & (RTL_TO_STDOUT \| RTL_TIMING);		bool traceToStdout = print_kernel_trace & (RTL_TO_STDOUT \| RTL_TIMING);
fprintf(traceToStdout ? stdout : stderr,		fprintf(traceToStdout ? stdout : stderr,
"DEVID:%2d SGN:%1d ConstWGSize:%-4d args:%2d teamsXthrds:(%4dX%4d) "		"DEVID:%2d SGN:%1d ConstWGSize:%-4d args:%2d teamsXthrds:(%4dX%4d) "
▲ Show 20 Lines • Show All 171 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget] Tune the number of teams and threads for kernel launch.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 355083

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp

[libomptarget] Tune the number of teams and threads for kernel launch.
Needs ReviewPublic