This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Correct the number of SGPR blocks used for GFX9
AbandonedPublic

Authored by rochauha on Jul 20 2020, 11:58 AM.

Details

Summary

Edit : Updating the summary based on comments

Even though granularity is 8, the roundup must be an even number of 8-granules for GFX9.
Probably this also needs to be mentioned in https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-compute-pgm-rsrc1-gfx6-gfx10-table for GRANULATED_WAVEFRONT_SGPR_COUNT.

The difference is seen when a the rounded value aligns to 8 but not to 16. (for example 40, 56).
This patch corrects the roundup for GFX9, hence correcting the number of SGPRBlocks.

Diff Detail

Event Timeline

rochauha created this revision.Jul 20 2020, 11:58 AM
arsenm requested changes to this revision.Jul 21 2020, 7:13 AM

Needs test

This revision now requires changes to proceed.Jul 21 2020, 7:13 AM
foad added a subscriber: foad.Jul 21 2020, 7:54 AM
foad added inline comments.
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
439–441

Why have you changed this?

rochauha marked an inline comment as done.Jul 21 2020, 8:28 AM
rochauha added inline comments.
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
439–441

To follow the computation of GRANULATED_WAVEFRONT_SGPR_COUNT for GFX9, as mentioned in https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-compute-pgm-rsrc1-gfx6-gfx10-table

rochauha marked an inline comment as not done.Jul 21 2020, 8:29 AM
rochauha added a comment.EditedJul 21 2020, 8:40 AM

Needs test

I think these changes are tested using the test in https://reviews.llvm.org/D80713.
In fact the issue was found when testing round tripping for the above patch. I guess this can only be verified by a round trip test when we assemble->disassemble->re-assemble? Such a test is already present in the patch for D80713.

I am not sure how else can we look at the value of GRANULATED_WAVEFRONT_SGPR_COUNT in a test case.

Needs test

I think these changes are tested using the test in https://reviews.llvm.org/D80713.
In fact the issue was found when testing round tripping for the above patch. I guess this can only be verified by a round trip test when we assemble->disassemble->re-assemble? Such a test is already present in the patch for D80713.

I am not sure how else can we look at the value of GRANULATED_WAVEFRONT_SGPR_COUNT in a test case.

I think you should be able to test this by adding another KD case to llvm/test/MC/AMDGPU/hsa-v3.s, and just checking the hexdump of the KD as for the other cases there. It should be pretty painless, you can just copy-paste the minimal one, set the SGPR count to trigger the bug, and update the GRANULATED_WAVEFRONT_SGPR_COUNT bits in the expected dump.

scott.linder added inline comments.Jul 21 2020, 3:08 PM
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
348

I don't know if this is actually accurate, I think the reason for the "2 *" in the equation for GFX9 is not because the allocation granule is 16. It is still 8 for gfx9, but there is an additional constraint that you must allocate an even number of granules.

It is a bit confusing, and I would like @kzhuravl to weigh in as IIRC he was who originally helped me understand this when we were updating the assembler.

439–440

If the above is true, and the granule for gfx9 is in fact 8, then I would just move all of the handling of the "even" requirement into this function, i.e. change this to:

unsigned NumSGPRBlocks = NumSGPRs / (isGFX9(*STI) ? 2 * getSGPREncodingGranule(STI) : getSGPREncodingGranule(STI)) - 1;
foad added inline comments.Jul 22 2020, 12:32 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
439–440

The current patch does have the advantage that it closely matches the documentation that Ronak pointed to. Though I suppose we could update the documentation too.

foad added inline comments.Jul 22 2020, 12:34 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
434–440

Incidentally the alignTo and the division could be combined into a single call to divideCeil.

t-tye added inline comments.Jul 22 2020, 12:59 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
348

For GFX9 the granularity is as specified in AMDGPUUsage which is 8. As @scott.linder mentions SPI rounds up to an even number of 8-granules. From the hardware spec:

Number of SGPRS, granularity 8. SPI rounds up reg setting and allocs gran16. Range is from 0-13 allocating (SGPRS/2+1)*16: 16,16,32,32 ... 112,112

rochauha edited the summary of this revision. (Show Details)Jul 22 2020, 11:45 AM
rochauha updated this revision to Diff 279902.EditedJul 22 2020, 11:57 AM

Updated patch based on comments.
Updated old tests.
Added new test.

rochauha marked 3 inline comments as done.Jul 22 2020, 11:59 AM
rochauha added inline comments.
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
434–440

Done.

foad added inline comments.Jul 23 2020, 1:20 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
439

Don't you still need a std::max somewhere in here to cope with the NumSGPRs==0 case?

rochauha updated this revision to Diff 280046.Jul 23 2020, 1:54 AM
rochauha marked an inline comment as done.

Added missing std::max.

rochauha marked 2 inline comments as done.Jul 23 2020, 1:55 AM
rochauha added inline comments.
llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
439

Done. Thanks!

I discussed with Tony today, and I was thinking about this the wrong way.

SPI does not require the granule count to be even, it just rounds up the granule count before actually performing the allocation. This means, from the compiler's perspective, when it is calculating things like the AMDGPU::IsaInfo::getMaxNumSGPRs it must consider the "allocation" granule size (IsaInfo::getSGPRAllocGranule). Conversely, from the assembler/diassembler perspective, it must consider the "encoding" granule size (IsaInfo::getSGPREncodingGranule). It is perfectly OK to have a GFX9 code object with a granulated SGPR count of 1, and we should allow emitting that in the assembler so that the disassembler can accurately reproduce those code objects.

I don't think there is any fix needed here, we already separate these two concepts and correctly apply them elsewhere. I think I just led you astray in the disassembly patch; you should only be using the encoding granule size, and shouldn't need any special handling for e.g. GFX9 to handle the fact that the allocation and encoding granule sizes are not equal.

rochauha marked an inline comment as done.Jul 25 2020, 3:13 AM

I discussed with Tony today, and I was thinking about this the wrong way.

SPI does not require the granule count to be even, it just rounds up the granule count before actually performing the allocation. This means, from the compiler's perspective, when it is calculating things like the AMDGPU::IsaInfo::getMaxNumSGPRs it must consider the "allocation" granule size (IsaInfo::getSGPRAllocGranule). Conversely, from the assembler/diassembler perspective, it must consider the "encoding" granule size (IsaInfo::getSGPREncodingGranule). It is perfectly OK to have a GFX9 code object with a granulated SGPR count of 1, and we should allow emitting that in the assembler so that the disassembler can accurately reproduce those code objects.

I don't think there is any fix needed here, we already separate these two concepts and correctly apply them elsewhere. I think I just led you astray in the disassembly patch; you should only be using the encoding granule size, and shouldn't need any special handling for e.g. GFX9 to handle the fact that the allocation and encoding granule sizes are not equal.

Correct me if I'm wrong. So we must not take inverse of the mentioned GFX9 calculation (the one where we divide by 16 before roundup) as it is for allocation granule size? And hence the disassembly computation will be same for GFX6-8 and GFX9 (because the encoding granule size is the same)?

I discussed with Tony today, and I was thinking about this the wrong way.

SPI does not require the granule count to be even, it just rounds up the granule count before actually performing the allocation. This means, from the compiler's perspective, when it is calculating things like the AMDGPU::IsaInfo::getMaxNumSGPRs it must consider the "allocation" granule size (IsaInfo::getSGPRAllocGranule). Conversely, from the assembler/diassembler perspective, it must consider the "encoding" granule size (IsaInfo::getSGPREncodingGranule). It is perfectly OK to have a GFX9 code object with a granulated SGPR count of 1, and we should allow emitting that in the assembler so that the disassembler can accurately reproduce those code objects.

I don't think there is any fix needed here, we already separate these two concepts and correctly apply them elsewhere. I think I just led you astray in the disassembly patch; you should only be using the encoding granule size, and shouldn't need any special handling for e.g. GFX9 to handle the fact that the allocation and encoding granule sizes are not equal.

Correct me if I'm wrong. So we must not take inverse of the mentioned GFX9 calculation (the one where we divide by 16 before roundup) as it is for allocation granule size? And hence the disassembly computation will be same for GFX6-8 and GFX9 (because the encoding granule size is the same)?

Correct, you can treat all hardware the same and calculate:

NumSGPRs = (NumSGPRBlocks + 1) * getSGPREncodingGranule()

I still think it might be good to make this into a function in AMDGPU::IsaInfo to be the inverse of getNumSGPRBlocks

rochauha abandoned this revision.Aug 8 2020, 1:00 AM

Based on comments and discussion, the difference for GFX9 is being handled using allocation granule sizes and no change is required.