Page MenuHomePhabricator

dhruvachak (Dhruva Chakrabarti)
User

Projects

User does not belong to any projects.

User Details

User Since
Mar 15 2021, 11:51 AM (8 w, 3 h)

Recent Activity

Mar 19 2021

dhruvachak requested review of D99003: [libomptarget] [amdgpu] Change default number of teams per computation unit.
Mar 19 2021, 7:00 PM · Restricted Project
dhruvachak committed rG451e7001a097: Empty test commit, verifying commit access (authored by dhruvachak).
Empty test commit, verifying commit access
Mar 19 2021, 5:43 PM
dhruvachak added a comment to D98832: [libomptarget] Tune the number of teams and threads for kernel launch..

...
Agreed. However, I don't see LDS usage in the metadata table in the image. Is it present there?

Yes, see https://llvm.org/docs/AMDGPUUsage.html for the list of what we can expect. What may not be obvious is that the metadata calls it ".group_segment_fixed_size". I don't know the origin of the terminology, maybe opencl?

In theory, a very high sgpr count can limit the number of available workgroups if that's not factored in for determining the number of threads. But in practice, VGPRs tend to be the primary limiting factor. So perhaps we can start with using VGPRs for this purpose and have experience guide us in the future.

If I understand correctly, occupancy rules all look something like (resource used / resource available) == number simultaneous, where one of the resources tends to be limiting. Offhand, I think that's VGPR, SGPR, LDS (group segment). I think there's also an architecture dependent upper bound on how many things can run at once even if they use very little of those, maybe 8 for gfx9 and 16 for gfx10.

If that's right, perhaps the calculation should look something like:

uint vgpr_occupancy = vgpr_used / vgpr_available;
uint sgpr_occupancy = sgpr_used / sgpr_available;
uint lds_occupancy = lds_used / lds_available;
uint limiting_occupancy = min(vgpr_occupancy, sgpr_occupacny, lds_occupancy);

and then we derive threadsPerGroup from that occupancy and the various other considerations.

Mar 19 2021, 12:37 PM · Restricted Project, Restricted Project

Mar 18 2021

dhruvachak added a comment to D98832: [libomptarget] Tune the number of teams and threads for kernel launch..

Could you upload patches with full context please

Mar 18 2021, 4:03 PM · Restricted Project, Restricted Project
dhruvachak updated the diff for D98832: [libomptarget] Tune the number of teams and threads for kernel launch..

Added full context to the updated patch.

Mar 18 2021, 3:55 PM · Restricted Project, Restricted Project
dhruvachak added a comment to D98832: [libomptarget] Tune the number of teams and threads for kernel launch..

This is really interesting. The idea seems to be to choose the dispatch parameters based on the kernel metadata and the limits of the machine.

What's the underlying heuristic? Break across N CU's in chunks that match the occupancy limits of each CU?

Mar 18 2021, 11:25 AM · Restricted Project, Restricted Project
dhruvachak added a comment to D98829: [libomptarget] Add register usage info to kernel metadata.

Looks good to me, thanks.

We should probably use uint64_t everywhere, instead of sometimes truncating to uint32_t, but that pattern and the 0,0,0,0, one are pre-existing.

Let's go with this and report errors on implausible (e.g. > 4 billion) register counts as a separate patch, along with sanity checking requested LDS etc.

Mar 18 2021, 9:51 AM · Restricted Project

Mar 17 2021

dhruvachak requested review of D98832: [libomptarget] Tune the number of teams and threads for kernel launch..
Mar 17 2021, 5:52 PM · Restricted Project, Restricted Project
dhruvachak requested review of D98829: [libomptarget] Add register usage info to kernel metadata.
Mar 17 2021, 5:32 PM · Restricted Project