This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: allow specifying a workgroup size that needs to fit in a compute unit
ClosedPublic

Authored by bnieuwenhuizen on Mar 21 2016, 3:38 PM.

Details

Summary

For GL_ARB_compute_shader we need to support workgroup sizes of at least 1024. However, if we want to allow large workgroup sizes, we may need to use less registers, as we have to run more waves per SIMD.

This patch adds an attribute to specify the maximum work group size the compiled program needs to support. It defaults, to 256, as that has no wave restrictions.

Reducing the number of registers available is done similarly to how the registers were reserved for chips with the sgpr init bug.

Diff Detail

Event Timeline

bnieuwenhuizen retitled this revision from to AMDGPU: allow specifying a workgroup size that needs to fit in a compute unit.
bnieuwenhuizen updated this object.
arsenm edited edge metadata.Mar 21 2016, 3:50 PM

I have another patch which already fixes a hardcoded maximum of 256: http://reviews.llvm.org/D16834

You can try merging this what that one.

New attributes should be lowercased,use '-' separated words and start with an "amdgpu-" prefix

lib/Target/AMDGPU/AMDGPU.h
131–132 ↗(On Diff #51238)

I think theses should be distinguished by the ABI rather than introducing a new shader type. Also semi-related, I would like to replace shader types with calling conventions

Thanks for the quick feedback.

Which parts of which patch do you want merged? subtargets + adding register reservation and then calling convention in a separate patch?

On second thought I think I would prefer a function attribute, as we get exact size information per program. Similarly we have in OpenCL the reqd_work_group_size to specify that a specific kernel will always be used with that size. As we can have multiple kernels in OpenCL and as far as I understand a subtarget is global, I think a function attribute would be more suited.

nhaehnle edited edge metadata.Mar 21 2016, 9:06 PM

Thanks for looking at compute :)

I also think the function attribute is the better solution for workgroup size. We want to create the TargetMachine object only once per context (or possibly per compile thread in the future) in Mesa.

For your tests, it might be good to check the spilling stride as well, though I'm not sure how robust that is. Spill tests tend to be annoying.

lib/Target/AMDGPU/AMDGPU.h
131–132 ↗(On Diff #51238)

I agree, the GL_COMPUTE thing should be redundant. For the graphics shader types, there might be a very small number of use cases (e.g. the WholeQuadMode pass is completely skipped unless we're compiling a pixel shader).

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
208–211

I think it would be nicer to just set the MaximumWorkGroup variable to the correct value during initialization.

lib/Target/AMDGPU/SIRegisterInfo.cpp
28–31

I find the min/max naming here confusing. Surely this function should be named getMaxWaveCount?

Also, what about getMaxWaveCountPerSIMD?

43

VI+ rounds SGPR allocations to multiples of 16.

45–53

As Marek has pointed out in the past, the calculations surrounding SGPR allocations are a bit questionable.

The 102/104 limit comes from the instruction encoding, which can encode 104 SGPRs in <= CIK and 102 SGPRs in >= VI. The subtraction of 2/6 comes from VCC/XNACK/FLAT_SCR - but we don't have to address those as regular SGPRs!

So on VI+, it's perfectly fine to use 102 SGPRs, you just then have to tell the hardware to reserve >= 108 slots in the SGPR register bank. Due to rounding up to multiples of 16, this ends up allocating 112 SGPRs, which does waste four slots per wave, but whatever.

In practice, what this means for your calculation is that it should be std::min(AllowedSGPRCount - 2/6, 104/102) (except for the InitBug thing, where you should indeed clamp AllowedSGPRCount before taking space for VCC & friends.

Scratch spilling is per wave though and not per workgroup? Therefore the strides should be the same. For LDS spilling it would be different, but even though we have some functions (SIInstrInfo.cpp, calculateLDSSpillAddress) for LDS spilling I don't see them getting used anywhere?

I will import code + tests from D16834 to handle LDS allocation of variables.

bnieuwenhuizen updated this object.
bnieuwenhuizen edited edge metadata.
v2: - changed naming
    - fixed SGPR register limits
    - also adjusted AMDGPUPromoteAlloca pass

Thanks. And yes, I think you're right about the spills, I got confused there.

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
498–506

Fix the comment :)

lib/Target/AMDGPU/SIRegisterInfo.cpp
62

I was confused by that closing brace for a moment. I'm not sure what the LLVM coding style says if anything, but personally I find something like // anonymous namespace helpful.

lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
128

According to Matt, there should be an amdgpu- prefix here.

arsenm added inline comments.Mar 23 2016, 9:19 AM
lib/Target/AMDGPU/SIRegisterInfo.cpp
39

This should not be a magic number. Also MaxWorkgroupSGPRCount would be a better name

40

There is a alignTo function you should use

bnieuwenhuizen added inline comments.Mar 23 2016, 9:31 AM
lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
498–506

We don't technically handle that function attribute yet. I can move it somewhere more appropiate, like getMaxWorkGroupSize or also make getMaxWorkGroupSize detect and use reqd_work_group_size.

I noticed though that we have all the information for reqd_work_group_size in mesa (unless we want to support ARB_compute_variable_group_size). Maybe it is worth it to switch completely to reqd_work_group_size.

lib/Target/AMDGPU/SIRegisterInfo.cpp
62

Seems like The LLVM coding style wants static instead of anonymous namespaces for functions (http://llvm.org/docs/CodingStandards.html#anonymous-namespaces). I will change that.

bnieuwenhuizen added inline comments.Mar 23 2016, 9:57 AM
lib/Target/AMDGPU/SIRegisterInfo.cpp
40

Note that alignTo rounds up, this rounds down.

nhaehnle added inline comments.Mar 24 2016, 10:44 AM
lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
498–506

I see what you mean. The semantics are also slightly different. It does makes sense to keep a comment about this.

lib/Target/AMDGPU/SIRegisterInfo.cpp
62

Thanks. With that changed, and arguably with a small alignDown helper, the patch LGTM.

v3: - adjust reqd_work_group_size FIXME
    - removed anonymous namespace
    - Use one set of calculations for getMaxWorkGroupSGPRCount
    - Add alignDown

I noticed another place in the patch where alignDown would improve the code.

My apologies for not noticing it for v3.

I think this looks OK overall. I'm a little nervous about the changes to scratch handling introducing bugs, since this is a little hard to test. I would recommend doing a shader-db run on this code, because that has more register spills than piglit.

bnieuwenhuizen edited edge metadata.

There are no changes in shader-db. Also rebased the patch (minor changes related to the calling convention change).

nhaehnle accepted this revision.Apr 12 2016, 2:58 PM
nhaehnle edited edge metadata.

LGTM

This revision is now accepted and ready to land.Apr 12 2016, 2:58 PM
arsenm added inline comments.Apr 12 2016, 3:07 PM
lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
484

& should be on RHS and this should be moved down to uses

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
51

Should be set to 0 here

125

Why is this function needed? You can just compare to F->getCallingConv directly

lib/Target/AMDGPU/SIRegisterInfo.cpp
143

Space after if

144–146

Magic numbers

lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
128

I would also probably reduce this to -max- rather than -maximum- since most other maximums do this

test/CodeGen/AMDGPU/large-work-group-promote-alloca.ll
26

Attributes should be moved to the bottom. Also run the test through opt with no arguments to simplify the sets

bnieuwenhuizen added inline comments.Apr 12 2016, 4:28 PM
lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
125

I want to have everything in the compute stage here. Note that we have multiple calling conventions for that: at least one for compute kernels and one for mesa shaders.

This is a helper to check if it is one of the compute stage calling conventions, so we only have one place we need to remember to put all the conventions.

bnieuwenhuizen edited edge metadata.

Should fix all comments from Matt except for the isCompute call.