This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/Target/AMDGPU/Utils/
-
lib/
-
Target/
-
AMDGPU/
-
Utils/
5/10
AMDGPUBaseInfo.cpp

Differential D84194

[AMDGPU] Correct the number of SGPR blocks used for GFX9
AbandonedPublic

Authored by rochauha on Jul 20 2020, 11:58 AM.

Download Raw Diff

Details

Reviewers

scott.linder
t-tye
arsenm

Summary

Edit : Updating the summary based on comments

Even though granularity is 8, the roundup must be an even number of 8-granules for GFX9.
Probably this also needs to be mentioned in https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-compute-pgm-rsrc1-gfx6-gfx10-table for GRANULATED_WAVEFRONT_SGPR_COUNT.

The difference is seen when a the rounded value aligns to 8 but not to 16. (for example 40, 56).
This patch corrects the roundup for GFX9, hence correcting the number of SGPRBlocks.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	1,050 ms	windows > LLVM.CodeGen/AMDGPU/GlobalISel::Unknown Unit Message ("")
	980 ms	windows > LLVM.MC/AMDGPU::Unknown Unit Message ("")

Event Timeline

rochauha created this revision.Jul 20 2020, 11:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 20 2020, 11:58 AM

Herald added subscribers: llvm-commits, kerbowa, hiraditya and 8 others. · View Herald Transcript

Harbormaster failed remote builds in B64971: Diff 279319!Jul 20 2020, 10:46 PM

Needs test

This revision now requires changes to proceed.Jul 21 2020, 7:13 AM

foad added a subscriber: foad.Jul 21 2020, 7:54 AM

foad added inline comments.

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
437–438	Why have you changed this?

rochauha marked an inline comment as done.Jul 21 2020, 8:28 AM

rochauha added inline comments.

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
437–438	To follow the computation of `GRANULATED_WAVEFRONT_SGPR_COUNT` for GFX9, as mentioned in https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-compute-pgm-rsrc1-gfx6-gfx10-table

rochauha marked an inline comment as not done.Jul 21 2020, 8:29 AM

In D84194#2164227, @arsenm wrote:

Needs test

I think these changes are tested using the test in https://reviews.llvm.org/D80713.
In fact the issue was found when testing round tripping for the above patch. I guess this can only be verified by a round trip test when we assemble->disassemble->re-assemble? Such a test is already present in the patch for D80713.

I am not sure how else can we look at the value of GRANULATED_WAVEFRONT_SGPR_COUNT in a test case.

In D84194#2164406, @rochauha wrote:

In D84194#2164227, @arsenm wrote:

Needs test

I think these changes are tested using the test in https://reviews.llvm.org/D80713.
In fact the issue was found when testing round tripping for the above patch. I guess this can only be verified by a round trip test when we assemble->disassemble->re-assemble? Such a test is already present in the patch for D80713.

I am not sure how else can we look at the value of GRANULATED_WAVEFRONT_SGPR_COUNT in a test case.

I think you should be able to test this by adding another KD case to llvm/test/MC/AMDGPU/hsa-v3.s, and just checking the hexdump of the KD as for the other cases there. It should be pretty painless, you can just copy-paste the minimal one, set the SGPR count to trigger the bug, and update the GRANULATED_WAVEFRONT_SGPR_COUNT bits in the expected dump.

scott.linder added inline comments.Jul 21 2020, 3:08 PM

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
348	I don't know if this is actually accurate, I think the reason for the "2 *" in the equation for GFX9 is not because the allocation granule is 16. It is still 8 for gfx9, but there is an additional constraint that you must allocate an even number of granules. It is a bit confusing, and I would like @kzhuravl to weigh in as IIRC he was who originally helped me understand this when we were updating the assembler.
437	If the above is true, and the granule for gfx9 is in fact 8, then I would just move all of the handling of the "even" requirement into this function, i.e. change this to: unsigned NumSGPRBlocks = NumSGPRs / (isGFX9(STI) ? 2 getSGPREncodingGranule(STI) : getSGPREncodingGranule(STI)) - 1;

foad added inline comments.Jul 22 2020, 12:32 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
437	The current patch does have the advantage that it closely matches the documentation that Ronak pointed to. Though I suppose we could update the documentation too.

foad added inline comments.Jul 22 2020, 12:34 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
435–437	Incidentally the alignTo and the division could be combined into a single call to divideCeil.

t-tye added inline comments.Jul 22 2020, 12:59 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
348	For GFX9 the granularity is as specified in AMDGPUUsage which is 8. As @scott.linder mentions SPI rounds up to an even number of 8-granules. From the hardware spec: Number of SGPRS, granularity 8. SPI rounds up reg setting and allocs gran16. Range is from 0-13 allocating (SGPRS/2+1)*16: 16,16,32,32 ... 112,112

rochauha edited the summary of this revision. (Show Details)Jul 22 2020, 11:45 AM

Updated patch based on comments.
Updated old tests.
Added new test.

rochauha marked 3 inline comments as done.Jul 22 2020, 11:59 AM

rochauha added inline comments.

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
435–437	Done.

Harbormaster failed remote builds in B65271: Diff 279902!Jul 22 2020, 12:28 PM

foad added inline comments.Jul 23 2020, 1:20 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
437–438	Don't you still need a std::max somewhere in here to cope with the NumSGPRs==0 case?

Added missing std::max.

rochauha marked 2 inline comments as done.Jul 23 2020, 1:55 AM

rochauha added inline comments.

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp
437–438	Done. Thanks!

Harbormaster failed remote builds in B65344: Diff 280046!Jul 23 2020, 2:25 AM

I discussed with Tony today, and I was thinking about this the wrong way.

SPI does not require the granule count to be even, it just rounds up the granule count before actually performing the allocation. This means, from the compiler's perspective, when it is calculating things like the AMDGPU::IsaInfo::getMaxNumSGPRs it must consider the "allocation" granule size (IsaInfo::getSGPRAllocGranule). Conversely, from the assembler/diassembler perspective, it must consider the "encoding" granule size (IsaInfo::getSGPREncodingGranule). It is perfectly OK to have a GFX9 code object with a granulated SGPR count of 1, and we should allow emitting that in the assembler so that the disassembler can accurately reproduce those code objects.

I don't think there is any fix needed here, we already separate these two concepts and correctly apply them elsewhere. I think I just led you astray in the disassembly patch; you should only be using the encoding granule size, and shouldn't need any special handling for e.g. GFX9 to handle the fact that the allocation and encoding granule sizes are not equal.

In D84194#2170882, @scott.linder wrote:

I discussed with Tony today, and I was thinking about this the wrong way.

SPI does not require the granule count to be even, it just rounds up the granule count before actually performing the allocation. This means, from the compiler's perspective, when it is calculating things like the AMDGPU::IsaInfo::getMaxNumSGPRs it must consider the "allocation" granule size (IsaInfo::getSGPRAllocGranule). Conversely, from the assembler/diassembler perspective, it must consider the "encoding" granule size (IsaInfo::getSGPREncodingGranule). It is perfectly OK to have a GFX9 code object with a granulated SGPR count of 1, and we should allow emitting that in the assembler so that the disassembler can accurately reproduce those code objects.

I don't think there is any fix needed here, we already separate these two concepts and correctly apply them elsewhere. I think I just led you astray in the disassembly patch; you should only be using the encoding granule size, and shouldn't need any special handling for e.g. GFX9 to handle the fact that the allocation and encoding granule sizes are not equal.

Correct me if I'm wrong. So we must not take inverse of the mentioned GFX9 calculation (the one where we divide by 16 before roundup) as it is for allocation granule size? And hence the disassembly computation will be same for GFX6-8 and GFX9 (because the encoding granule size is the same)?

In D84194#2173959, @rochauha wrote:

In D84194#2170882, @scott.linder wrote:

I discussed with Tony today, and I was thinking about this the wrong way.

SPI does not require the granule count to be even, it just rounds up the granule count before actually performing the allocation. This means, from the compiler's perspective, when it is calculating things like the AMDGPU::IsaInfo::getMaxNumSGPRs it must consider the "allocation" granule size (IsaInfo::getSGPRAllocGranule). Conversely, from the assembler/diassembler perspective, it must consider the "encoding" granule size (IsaInfo::getSGPREncodingGranule). It is perfectly OK to have a GFX9 code object with a granulated SGPR count of 1, and we should allow emitting that in the assembler so that the disassembler can accurately reproduce those code objects.

I don't think there is any fix needed here, we already separate these two concepts and correctly apply them elsewhere. I think I just led you astray in the disassembly patch; you should only be using the encoding granule size, and shouldn't need any special handling for e.g. GFX9 to handle the fact that the allocation and encoding granule sizes are not equal.

Correct me if I'm wrong. So we must not take inverse of the mentioned GFX9 calculation (the one where we divide by 16 before roundup) as it is for allocation granule size? And hence the disassembly computation will be same for GFX6-8 and GFX9 (because the encoding granule size is the same)?

Correct, you can treat all hardware the same and calculate:

NumSGPRs = (NumSGPRBlocks + 1) * getSGPREncodingGranule()

I still think it might be good to make this into a function in AMDGPU::IsaInfo to be the inverse of getNumSGPRBlocks

Based on comments and discussion, the difference for GFX9 is being handled using allocation granule sizes and no change is required.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

Utils/

AMDGPUBaseInfo.cpp

6 lines

Diff 279319

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp

Show First 20 Lines • Show All 338 Lines • ▼ Show 20 Lines	unsigned getSGPRAllocGranule(const MCSubtargetInfo *STI) {
if (Version.Major >= 10)		if (Version.Major >= 10)
return getAddressableNumSGPRs(STI);		return getAddressableNumSGPRs(STI);
if (Version.Major >= 8)		if (Version.Major >= 8)
return 16;		return 16;
return 8;		return 8;
}		}

unsigned getSGPREncodingGranule(const MCSubtargetInfo *STI) {		unsigned getSGPREncodingGranule(const MCSubtargetInfo *STI) {
return 8;		// 16 for GFX9, 8 for GFX6-8
		return isGFX9(*STI) ? 16 : 8;
		scott.linderUnsubmitted Done Reply Inline Actions I don't know if this is actually accurate, I think the reason for the "2 " in the equation for GFX9 is not because the allocation granule is 16. It is still 8 for gfx9, but there is an additional constraint that you must allocate an even number of granules. It is a bit confusing, and I would like @kzhuravl to weigh in as IIRC he was who originally helped me understand this when we were updating the assembler. scott.linder:* I don't know if this is actually accurate, I think the reason for the "2 *" in the equation for…
		t-tyeUnsubmitted Not Done Reply Inline Actions For GFX9 the granularity is as specified in AMDGPUUsage which is 8. As @scott.linder mentions SPI rounds up to an even number of 8-granules. From the hardware spec: Number of SGPRS, granularity 8. SPI rounds up reg setting and allocs gran16. Range is from 0-13 allocating (SGPRS/2+1)16: 16,16,32,32 ... 112,112 t-tye:* For GFX9 the granularity is as specified in AMDGPUUsage which is 8. As @scott.linder mentions…
}		}

unsigned getTotalNumSGPRs(const MCSubtargetInfo *STI) {		unsigned getTotalNumSGPRs(const MCSubtargetInfo *STI) {
IsaVersion Version = getIsaVersion(STI->getCPU());		IsaVersion Version = getIsaVersion(STI->getCPU());
if (Version.Major >= 8)		if (Version.Major >= 8)
return 800;		return 800;
return 512;		return 512;
}		}
▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines

unsigned getNumExtraSGPRs(const MCSubtargetInfo *STI, bool VCCUsed,		unsigned getNumExtraSGPRs(const MCSubtargetInfo *STI, bool VCCUsed,
bool FlatScrUsed) {		bool FlatScrUsed) {
return getNumExtraSGPRs(STI, VCCUsed, FlatScrUsed,		return getNumExtraSGPRs(STI, VCCUsed, FlatScrUsed,
STI->getFeatureBits().test(AMDGPU::FeatureXNACK));		STI->getFeatureBits().test(AMDGPU::FeatureXNACK));
}		}

unsigned getNumSGPRBlocks(const MCSubtargetInfo *STI, unsigned NumSGPRs) {		unsigned getNumSGPRBlocks(const MCSubtargetInfo *STI, unsigned NumSGPRs) {
NumSGPRs = alignTo(std::max(1u, NumSGPRs), getSGPREncodingGranule(STI));		NumSGPRs = alignTo(std::max(1u, NumSGPRs), getSGPREncodingGranule(STI));
// SGPRBlocks is actual number of SGPR blocks minus 1.		// SGPRBlocks is actual number of SGPR blocks minus 1.
return NumSGPRs / getSGPREncodingGranule(STI) - 1;		unsigned NumSGPRBlocks = NumSGPRs / getSGPREncodingGranule(STI) - 1;
		scott.linderUnsubmitted Not Done Reply Inline Actions If the above is true, and the granule for gfx9 is in fact 8, then I would just move all of the handling of the "even" requirement into this function, i.e. change this to: unsigned NumSGPRBlocks = NumSGPRs / (isGFX9(STI) ? 2 getSGPREncodingGranule(STI) : getSGPREncodingGranule(STI)) - 1; scott.linder: If the above is true, and the granule for gfx9 is in fact 8, then I would just move all of the…
		foadUnsubmitted Not Done Reply Inline Actions The current patch does have the advantage that it closely matches the documentation that Ronak pointed to. Though I suppose we could update the documentation too. foad: The current patch does have the advantage that it closely matches the documentation that Ronak…
		foadUnsubmitted Done Reply Inline Actions Incidentally the alignTo and the division could be combined into a single call to divideCeil. foad: Incidentally the alignTo and the division could be combined into a single call to divideCeil.
		rochauhaAuthorUnsubmitted Done Reply Inline Actions Done. rochauha: Done.
		return isGFX9(STI) ? NumSGPRBlocks 2 : NumSGPRBlocks;
		foadUnsubmitted Not Done Reply Inline Actions Why have you changed this? foad: Why have you changed this?
		rochauhaAuthorUnsubmitted Not Done Reply Inline Actions To follow the computation of `GRANULATED_WAVEFRONT_SGPR_COUNT` for GFX9, as mentioned in https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-compute-pgm-rsrc1-gfx6-gfx10-table rochauha: To follow the computation of `GRANULATED_WAVEFRONT_SGPR_COUNT` for GFX9, as mentioned in https…
		foadUnsubmitted Done Reply Inline Actions Don't you still need a std::max somewhere in here to cope with the NumSGPRs==0 case? foad: Don't you still need a std::max somewhere in here to cope with the NumSGPRs==0 case?
		rochauhaAuthorUnsubmitted Done Reply Inline Actions Done. Thanks! rochauha: Done. Thanks!
}		}

unsigned getVGPRAllocGranule(const MCSubtargetInfo *STI,		unsigned getVGPRAllocGranule(const MCSubtargetInfo *STI,
Optional<bool> EnableWavefrontSize32) {		Optional<bool> EnableWavefrontSize32) {
bool IsWave32 = EnableWavefrontSize32 ?		bool IsWave32 = EnableWavefrontSize32 ?
*EnableWavefrontSize32 :		*EnableWavefrontSize32 :
STI->getFeatureBits().test(FeatureWavefrontSize32);		STI->getFeatureBits().test(FeatureWavefrontSize32);

▲ Show 20 Lines • Show All 1,059 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Correct the number of SGPR blocks used for GFX9AbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 279319

llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp

[AMDGPU] Correct the number of SGPR blocks used for GFX9
AbandonedPublic