This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Remove the s_buffer workaround for GFX9 chips
ClosedPublic

Authored by mareko on Jan 31 2018, 11:47 AM.

Download Raw Diff

Details

Reviewers

arsenm
nhaehnle

Commits

rGb2cc77985b1c: AMDGPU: Remove the s_buffer workaround for GFX9 chips
rL324486: AMDGPU: Remove the s_buffer workaround for GFX9 chips

Summary

I checked the AMD closed source compiler and the workaround is only
needed when x3 is emulated as x4, which we don't do in LLVM.

SMEM x3 opcodes don't exist, and instead there is a possibility to use x4
with the last component being unused. If the last component is out of
buffer bounds and falls on the next 4K page, the hw hangs.

Diff Detail

Repository: rL LLVM

Event Timeline

mareko created this revision.Jan 31 2018, 11:47 AM

Herald added subscribers: t-tye, tpr, dstuttard and 3 others. · View Herald TranscriptJan 31 2018, 11:47 AM

Harbormaster completed remote builds in B14485: Diff 132236.Jan 31 2018, 11:48 AM

We probably do want to do that optimization at some point, although in that case I would hope we would avoid producing them in the buggy case. Can you add more details to the comment here, and possibly leave it?

In D42756#995040, @arsenm wrote:

We probably do want to do that optimization at some point, although in that case I would hope we would avoid producing them in the buggy case. Can you add more details to the comment here, and possibly leave it?

What details? Can you be more specific about what you're asking here?

In D42756#995044, @mareko wrote:

In D42756#995040, @arsenm wrote:

We probably do want to do that optimization at some point, although in that case I would hope we would avoid producing them in the buggy case. Can you add more details to the comment here, and possibly leave it?

What details? Can you be more specific about what you're asking here?

Like you mentioned in the commit message that there is a problem with x3 loads only.

In D42756#995101, @arsenm wrote:

In D42756#995044, @mareko wrote:

In D42756#995040, @arsenm wrote:

We probably do want to do that optimization at some point, although in that case I would hope we would avoid producing them in the buggy case. Can you add more details to the comment here, and possibly leave it?

What details? Can you be more specific about what you're asking here?

Like you mentioned in the commit message that there is a problem with x3 loads only.

SMEM x3 opcodes don't exist, and instead there is a possibility to use x4 with the last component being unused. If the last component is out of buffer bounds and falls on the next 4K page, the hw hangs.

I checked the AMD closed source compiler and the workaround is only
needed when x3 is emulated as x4, which we don't do in LLVM.

Harbormaster completed remote builds in B14517: Diff 132453.Feb 1 2018, 12:39 PM

Ping

This revision was not accepted when it landed; it landed in state Needs Review.Feb 7 2018, 8:05 AM

Closed by commit rL324486: AMDGPU: Remove the s_buffer workaround for GFX9 chips (authored by mareko). · Explain Why

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AMDGPUSubtarget.h

8 lines

SILoadStoreOptimizer.cpp

5 lines

test/

CodeGen/

AMDGPU/

smrd.ll

10 lines

Diff 133215

llvm/trunk/lib/Target/AMDGPU/AMDGPUSubtarget.h

Show First 20 Lines • Show All 327 Lines • ▼ Show 20 Lines	public:
bool hasMin3Max3_16() const {		bool hasMin3Max3_16() const {
return getGeneration() >= GFX9;		return getGeneration() >= GFX9;
}		}

bool hasMadMixInsts() const {		bool hasMadMixInsts() const {
return HasMadMixInsts;		return HasMadMixInsts;
}		}

bool hasSBufferLoadStoreAtomicDwordxN() const {
// Only use the "x1" variants on GFX9 or don't use the buffer variants.
// For x2 and higher variants, if the accessed region spans 2 VM pages and
// the second page is unmapped, the hw hangs.
// TODO: There is one future GFX9 chip that doesn't have this bug.
return getGeneration() != GFX9;
}

bool hasCARRY() const {		bool hasCARRY() const {
return (getGeneration() >= EVERGREEN);		return (getGeneration() >= EVERGREEN);
}		}

bool hasBORROW() const {		bool hasBORROW() const {
return (getGeneration() >= EVERGREEN);		return (getGeneration() >= EVERGREEN);
}		}

▲ Show 20 Lines • Show All 606 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 847 Lines • ▼ Show 20 Lines	if (Opc == AMDGPU::DS_READ_B32 \|\| Opc == AMDGPU::DS_READ_B64 \|\|
Modified = true;		Modified = true;
I = mergeWrite2Pair(CI);		I = mergeWrite2Pair(CI);
} else {		} else {
++I;		++I;
}		}

continue;		continue;
}		}
if (STM->hasSBufferLoadStoreAtomicDwordxN() &&		if (Opc == AMDGPU::S_BUFFER_LOAD_DWORD_IMM \|\|
(Opc == AMDGPU::S_BUFFER_LOAD_DWORD_IMM \|\|		Opc == AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM) {
Opc == AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM)) {
// EltSize is in units of the offset encoding.		// EltSize is in units of the offset encoding.
CI.InstClass = S_BUFFER_LOAD_IMM;		CI.InstClass = S_BUFFER_LOAD_IMM;
CI.EltSize = AMDGPU::getSMRDEncodedOffset(*STM, 4);		CI.EltSize = AMDGPU::getSMRDEncodedOffset(*STM, 4);
CI.IsX2 = Opc == AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM;		CI.IsX2 = Opc == AMDGPU::S_BUFFER_LOAD_DWORDX2_IMM;
if (findMatchingInst(CI)) {		if (findMatchingInst(CI)) {
Modified = true;		Modified = true;
I = mergeSBufferLoadImmPair(CI);		I = mergeSBufferLoadImmPair(CI);
if (!CI.IsX2)		if (!CI.IsX2)
▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/smrd.ll

Show First 20 Lines • Show All 211 Lines • ▼ Show 20 Lines	main_body:
%r = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 %off)		%r = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 %off)
ret float %r		ret float %r
}		}

; GCN-LABEL: {{^}}smrd_imm_merged:		; GCN-LABEL: {{^}}smrd_imm_merged:
; GCN-NEXT: %bb.		; GCN-NEXT: %bb.
; SICI-NEXT: s_buffer_load_dwordx4 s[{{[0-9]}}:{{[0-9]}}], s[0:3], 0x1		; SICI-NEXT: s_buffer_load_dwordx4 s[{{[0-9]}}:{{[0-9]}}], s[0:3], 0x1
; SICI-NEXT: s_buffer_load_dwordx2 s[{{[0-9]}}:{{[0-9]}}], s[0:3], 0x7		; SICI-NEXT: s_buffer_load_dwordx2 s[{{[0-9]}}:{{[0-9]}}], s[0:3], 0x7
; VI-NEXT: s_buffer_load_dwordx4 s[{{[0-9]}}:{{[0-9]}}], s[0:3], 0x4		; VIGFX9-NEXT: s_buffer_load_dwordx4 s[{{[0-9]}}:{{[0-9]}}], s[0:3], 0x4
; VI-NEXT: s_buffer_load_dwordx2 s[{{[0-9]}}:{{[0-9]}}], s[0:3], 0x1c		; VIGFX9-NEXT: s_buffer_load_dwordx2 s[{{[0-9]}}:{{[0-9]}}], s[0:3], 0x1c
; GFX9-NEXT: s_buffer_load_dword s{{[0-9]}}
; GFX9-NEXT: s_buffer_load_dword s{{[0-9]}}
; GFX9-NEXT: s_buffer_load_dword s{{[0-9]}}
; GFX9-NEXT: s_buffer_load_dword s{{[0-9]}}
; GFX9-NEXT: s_buffer_load_dword s{{[0-9]}}
; GFX9-NEXT: s_buffer_load_dword s{{[0-9]}}
define amdgpu_ps void @smrd_imm_merged(<4 x i32> inreg %desc) #0 {		define amdgpu_ps void @smrd_imm_merged(<4 x i32> inreg %desc) #0 {
main_body:		main_body:
%r1 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 4)		%r1 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 4)
%r2 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 8)		%r2 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 8)
%r3 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 12)		%r3 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 12)
%r4 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 16)		%r4 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 16)
%r5 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 28)		%r5 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 28)
%r6 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 32)		%r6 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 32)
▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines