Previously the minimum allowed occupancy (for memory-bound functions)
was a constant 4 waves per SIMD. Now it is a constant 16 threads per
SIMD lane, which better accounts for the architectural changes in GFX10.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp | ||
---|---|---|
200 | This is not expressable in a number of VGPRs. The limiter is about memory and not about register budget at all. Meanwhile it looks like 8 waves might be more reasonable for wave32, but wave64 shall still use the same value across the targets, because at the end of the day it boils down to a number of concurrent memory operations. |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp | ||
---|---|---|
200 | I agree that using number of VGPRs here is an awkward roundabout way of doing it. The point isn't really about VGPRs, it's about how many threads are alive per lane in order to allow a good mix of VALU and VMEM instruction issue in the face of certain (IMHO underspecified) memory pressure situations. I also disagree that 8 is the right answer for Wave32, for precisely the same reason: it is about the number of concurrent memory operations. 16x Wave32 gives you the same amount of memory operations per CU (2 SIMD * 16 waves * 32 = 1024 workitems) as 4 waves on <=gfx9 (4 SIMD * 4 waves * 64 = 1024 workitems). | |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h | ||
919 | The ST argument requirement seems annoying. Wouldn't it make sense for the MachineFunctionInfo to know the underlying MachineFunction, which would allow access to the GCNSubtarget (i.e. most-derived TargetSubtargetInfo)? |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h | ||
---|---|---|
919 | No, I'm actively trying to remove references to the machine function state from MachineFunctionInfo |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp | ||
---|---|---|
200 | I can understand the logic behind it, but using of TotalNumVGPRs() is not really what we want here. It probably needs to use getMaxWavesPerEU() and wavefront size to estimate a number of threads in flight. Besides the same logic will give answer 16 even on gfx1030 where 16 is the absolute maximum. The latter does not sound right. |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h | ||
---|---|---|
919 | Why is that? MachineFunctionInfo is explicitly created with the MachineFunction as a constructor argument? |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp | ||
---|---|---|
200 | I agree that gfx10 can probably drop the requirement from the 16 that would be equivalent to gfx9. I've heard numbers in the 10-12 range being thrown around as where dropping lower than that hurts performance. On gfx9, the consensus seemed to be 4 waves for the same kind of folklore. I don't think we really have enough experience and thorough performance studies with gfx10 yet to decide this finally. Maybe it should be a cl::opt... (not a function attribute, since this is specifically about the default heuristic). |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp | ||
---|---|---|
200 | Probably, and I do not think we can use simple divide x by y logic here. |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h | ||
---|---|---|
919 | I see. I commented on that change because I don't think it's quite right, but regardless of the outcome of that discussion: D80249 still gives the MFI accesses to the TargetSubtargetInfo, which is what we're talking about here. Even with that change, passing the ST yet again into this method should be unnecessary. |
The ST argument requirement seems annoying. Wouldn't it make sense for the MachineFunctionInfo to know the underlying MachineFunction, which would allow access to the GCNSubtarget (i.e. most-derived TargetSubtargetInfo)?