"Per CU" is a bit simplistic for gfx10, but I couldn't think of a better
name.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This affects the calculation of how many VGPRs we can use for a gfx10 cu-mode compute shader that specifies amdgpu-flat-work-group-size. How should I test it?
llvm/test/CodeGen/AMDGPU/attr-amdgpu-flat-work-group-size-vgpr-limit.ll | ||
---|---|---|
3 | Specifying -regalloc=fast is not reliable. With fast register allocation, LIS = getAnalysisIfAvailable<LiveIntervals>(); get nullptr in "si-lower-sgpr-spills" pass, so the slot index is not created in the pass for new inserted instructions. When verifying the machine intruction, it fails on checking slot index. It can be reproduced with below test case. Is it possible to use greedy-ra and reduce the compiling time for this test case? define internal void @use256vgprs() { %v0 = call i32 asm sideeffect "; def $0", "=v"() %v1 = call i32 asm sideeffect "; def $0", "=v"() call void asm sideeffect "; use $0", "v"(i32 %v0) call void asm sideeffect "; use $0", "v"(i32 %v1) ret void } define amdgpu_kernel void @f256() #256 { call void @use256vgprs() ret void } attributes #256 = { nounwind "amdgpu-flat-work-group-size"="256,256" } define amdgpu_kernel void @f512() #512 { call void @foo() call void @use256vgprs() ret void } attributes #512 = { nounwind "amdgpu-flat-work-group-size"="512,512" } define amdgpu_kernel void @f1024() #1024 { call void @foo() call void @use256vgprs() ret void } attributes #1024 = { nounwind "amdgpu-flat-work-group-size"="1024,1024" } declare void @foo() |
llvm/test/CodeGen/AMDGPU/attr-amdgpu-flat-work-group-size-vgpr-limit.ll | ||
---|---|---|
3 | That sounds like a bug in SILowerSGPRSpills. It should not claim to preserve SlotIndexes if it does not preserve them. |
llvm/test/CodeGen/AMDGPU/attr-amdgpu-flat-work-group-size-vgpr-limit.ll | ||
---|---|---|
3 | The pass is only updating SlotIndexes through LiveIntervals, assuming they come as a pair. With -verify-machineinstrs, somehow we end up in a situation with SlotIndexes and without LiveIntervals |
llvm/test/CodeGen/AMDGPU/attr-amdgpu-flat-work-group-size-vgpr-limit.ll | ||
---|---|---|
3 | Apparently StackColoring runs at -O0 and only uses SlotIndexes |
Specifying -regalloc=fast is not reliable. With fast register allocation, LIS = getAnalysisIfAvailable<LiveIntervals>(); get nullptr in "si-lower-sgpr-spills" pass, so the slot index is not created in the pass for new inserted instructions. When verifying the machine intruction, it fails on checking slot index. It can be reproduced with below test case. Is it possible to use greedy-ra and reduce the compiling time for this test case?