This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Calculate minimum allowed occupancy based on threads per lane
Needs RevisionPublic

Authored by foad on Jul 13 2020, 6:11 AM.

Details

Summary

Previously the minimum allowed occupancy (for memory-bound functions)
was a constant 4 waves per SIMD. Now it is a constant 16 threads per
SIMD lane, which better accounts for the architectural changes in GFX10.

Diff Detail

Event Timeline

foad created this revision.Jul 13 2020, 6:11 AM
rampitec requested changes to this revision.Jul 13 2020, 11:25 AM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
200

This is not expressable in a number of VGPRs. The limiter is about memory and not about register budget at all. Meanwhile it looks like 8 waves might be more reasonable for wave32, but wave64 shall still use the same value across the targets, because at the end of the day it boils down to a number of concurrent memory operations.

This revision now requires changes to proceed.Jul 13 2020, 11:25 AM
nhaehnle added inline comments.Jul 16 2020, 12:48 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
200

I agree that using number of VGPRs here is an awkward roundabout way of doing it. The point isn't really about VGPRs, it's about how many threads are alive per lane in order to allow a good mix of VALU and VMEM instruction issue in the face of certain (IMHO underspecified) memory pressure situations.

I also disagree that 8 is the right answer for Wave32, for precisely the same reason: it is about the number of concurrent memory operations. 16x Wave32 gives you the same amount of memory operations per CU (2 SIMD * 16 waves * 32 = 1024 workitems) as 4 waves on <=gfx9 (4 SIMD * 4 waves * 64 = 1024 workitems).

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
919

The ST argument requirement seems annoying. Wouldn't it make sense for the MachineFunctionInfo to know the underlying MachineFunction, which would allow access to the GCNSubtarget (i.e. most-derived TargetSubtargetInfo)?

arsenm added inline comments.Jul 16 2020, 12:52 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
919

No, I'm actively trying to remove references to the machine function state from MachineFunctionInfo

rampitec added inline comments.Jul 16 2020, 12:59 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
200

I can understand the logic behind it, but using of TotalNumVGPRs() is not really what we want here. It probably needs to use getMaxWavesPerEU() and wavefront size to estimate a number of threads in flight. Besides the same logic will give answer 16 even on gfx1030 where 16 is the absolute maximum. The latter does not sound right.

nhaehnle added inline comments.Jul 16 2020, 1:01 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
919

Why is that? MachineFunctionInfo is explicitly created with the MachineFunction as a constructor argument?

arsenm added inline comments.Jul 16 2020, 1:02 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
919

That's what I'm trying to fix in D80249

nhaehnle added inline comments.Jul 16 2020, 1:03 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
200

I agree that gfx10 can probably drop the requirement from the 16 that would be equivalent to gfx9. I've heard numbers in the 10-12 range being thrown around as where dropping lower than that hurts performance. On gfx9, the consensus seemed to be 4 waves for the same kind of folklore. I don't think we really have enough experience and thorough performance studies with gfx10 yet to decide this finally. Maybe it should be a cl::opt... (not a function attribute, since this is specifically about the default heuristic).

rampitec added inline comments.Jul 16 2020, 1:05 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
200

Probably, and I do not think we can use simple divide x by y logic here.

nhaehnle added inline comments.Jul 16 2020, 1:11 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
919

I see. I commented on that change because I don't think it's quite right, but regardless of the outcome of that discussion: D80249 still gives the MFI accesses to the TargetSubtargetInfo, which is what we're talking about here. Even with that change, passing the ST yet again into this method should be unnecessary.