This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget] Tune the number of teams and threads for kernel launch.
Needs ReviewPublic

Authored by dhruvachak on Mar 17 2021, 5:52 PM.

Details

Summary

Change the default number of teams.

Based on kernel register usage, adjust the number of threads in a team.

Includes a corner case fix.

This change is dependent on https://reviews.llvm.org/D98829

Diff Detail

Event Timeline

dhruvachak created this revision.Mar 17 2021, 5:52 PM
dhruvachak requested review of this revision.Mar 17 2021, 5:52 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 17 2021, 5:52 PM

This is really interesting. The idea seems to be to choose the dispatch parameters based on the kernel metadata and the limits of the machine.

What's the underlying heuristic? Break across N CU's in chunks that match the occupancy limits of each CU?

If so we probably want to compare LDS usage as well to avoid partitioning poorly for that.

Maybe others - there might be a performance cliff on amount of private memory too.

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
102

Side point, there is too much redundancy in this table of numbers (e.g. the log2 fields) and warp_size_32 = 32 looks suspect

This is really interesting. The idea seems to be to choose the dispatch parameters based on the kernel metadata and the limits of the machine.

What's the underlying heuristic? Break across N CU's in chunks that match the occupancy limits of each CU?

Yes, that's the idea.

If so we probably want to compare LDS usage as well to avoid partitioning poorly for that.

Maybe others - there might be a performance cliff on amount of private memory too.

Agreed. However, I don't see LDS usage in the metadata table in the image. Is it present there?

In theory, a very high sgpr count can limit the number of available workgroups if that's not factored in for determining the number of threads. But in practice, VGPRs tend to be the primary limiting factor. So perhaps we can start with using VGPRs for this purpose and have experience guide us in the future.

Could you upload patches with full context please

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
91

Vector registers? Like xmm? or registers?

Added full context to the updated patch.

Could you upload patches with full context please

Updated with the full context.

Like xmm. Here in particular, I am referring to the vector register file of a GPU.

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
91

Like xmm. Here in particular, I am referring to the vector register file of a GPU.

...
Agreed. However, I don't see LDS usage in the metadata table in the image. Is it present there?

Yes, see https://llvm.org/docs/AMDGPUUsage.html for the list of what we can expect. What may not be obvious is that the metadata calls it ".group_segment_fixed_size". I don't know the origin of the terminology, maybe opencl?

In theory, a very high sgpr count can limit the number of available workgroups if that's not factored in for determining the number of threads. But in practice, VGPRs tend to be the primary limiting factor. So perhaps we can start with using VGPRs for this purpose and have experience guide us in the future.

If I understand correctly, occupancy rules all look something like (resource used / resource available) == number simultaneous, where one of the resources tends to be limiting. Offhand, I think that's VGPR, SGPR, LDS (group segment). I think there's also an architecture dependent upper bound on how many things can run at once even if they use very little of those, maybe 8 for gfx9 and 16 for gfx10.

If that's right, perhaps the calculation should look something like:

uint vgpr_occupancy = vgpr_used / vgpr_available;
uint sgpr_occupancy = sgpr_used / sgpr_available;
uint lds_occupancy = lds_used / lds_available;
uint limiting_occupancy = min(vgpr_occupancy, sgpr_occupacny, lds_occupancy);

and then we derive threadsPerGroup from that occupancy and the various other considerations.

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
823

This looks like a drive by copy/paste error fix, maybe post that separately?

If you're currently uploading diffs through the gui (based on the missing context comment) that's quite labour intensive. If you change to arcanist, the flow becomes

git checkout main
git checkout -b some_feature
...edit
git add -u && git commit -m "message"
arc diff main # opens an editor

...
Agreed. However, I don't see LDS usage in the metadata table in the image. Is it present there?

Yes, see https://llvm.org/docs/AMDGPUUsage.html for the list of what we can expect. What may not be obvious is that the metadata calls it ".group_segment_fixed_size". I don't know the origin of the terminology, maybe opencl?

In theory, a very high sgpr count can limit the number of available workgroups if that's not factored in for determining the number of threads. But in practice, VGPRs tend to be the primary limiting factor. So perhaps we can start with using VGPRs for this purpose and have experience guide us in the future.

If I understand correctly, occupancy rules all look something like (resource used / resource available) == number simultaneous, where one of the resources tends to be limiting. Offhand, I think that's VGPR, SGPR, LDS (group segment). I think there's also an architecture dependent upper bound on how many things can run at once even if they use very little of those, maybe 8 for gfx9 and 16 for gfx10.

If that's right, perhaps the calculation should look something like:

uint vgpr_occupancy = vgpr_used / vgpr_available;
uint sgpr_occupancy = sgpr_used / sgpr_available;
uint lds_occupancy = lds_used / lds_available;
uint limiting_occupancy = min(vgpr_occupancy, sgpr_occupacny, lds_occupancy);

and then we derive threadsPerGroup from that occupancy and the various other considerations.

Thanks for the pointer to the group segment. Yes, in general, my idea is similar to what you outlined above. However, note that SGPRs and LDS are at different granularities compared to VGPRs. VGPRs are per-thread, SGPRs are shared within a wavefront, and LDS is shared within a workgroup. So while VGPRs can be used to limit the number of threads, perhaps SGPRs and LDS can be used to limit the number of teams.

Let me split up this patch further. I would like to land the default num_teams change sooner rather than later since that's a simple change and has shown improved performance. So let me separate that out. Incorporating SGPRs/LDS to constrain teams/threads will need more experimentation.

[libomptarget] [amdgpu] Set number of teams and threads based on GPU occupancy.

Determine total number of teams in a kernel and the number of threads in each
team in order to maximize occupancy. This change considers register and LDS
usage of the kernel during occupancy computation.

I haven't tried to understand the control flow yet. Is the idea to map a target region to as large a fraction of a CU as we can, scaling it back when occupancy constraints would force some of it to be idle anyway?

I haven't tried to understand the control flow yet. Is the idea to map a target region to as large a fraction of a CU as we can, scaling it back when occupancy constraints would force some of it to be idle anyway?

Yes, we start with the goal of filling up a CU with a pre-defined number of wavefronts. Given that goal, we try to choose team-count and team-size in a way so that their product approaches the pre-defined number of wavefronts. And the choices of team-count/team-size are constrained by register/LDS usage.

[libomptarget] [amdgpu] Set number of teams and threads based on GPU occupancy.

Perform teams/threads tuning in non-generic execution modes.
Do not tune if OMP_TEAMS_THREAD_LIMIT is set.

Determine total number of teams in a kernel and the number of threads in each
team in order to maximize occupancy. This change considers register and LDS
usage of the kernel during occupancy computation.

[libomptarget] [amdgpu] Set number of teams and threads based on GPU occupancy.

Ensure that thread count is within the limit.
Perform teams/threads tuning in non-generic execution modes.
Do not tune if OMP_TEAMS_THREAD_LIMIT is set.

Determine total number of teams in a kernel and the number of threads in each
team in order to maximize occupancy. This change considers register and LDS
usage of the kernel during occupancy computation.

This stuff definitely needs to be tested.

llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
95

Also, this should be a struct, not an array of unsigned with enums for looking up fields

109–110

I don't think these should be added to grid values.

They're not used by clang or LLVM, so don't need to be shared. I'm not convinced they're architecture independent constants (I think something has 32k LDS), and it looks like the plugin could use values discovered at runtime.

I think we're better off minimising the quantity of shared magic numbers so suggest we write the 64k / 3200 / 64 k as constants in the plugin instead.

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
352

this is (usually) 32 for gfx10, and the plugin is architecture-agnostic, so this probably can't be a compile time enum

should all be unsigned too, negative numbers don't make sense for any of these