Download Raw Diff

Details

Reviewers

t-tye
foad
rampitec

Commits

rG8967d044fc26: [AMDGPU] Add SIMemoryLegalizer comments to clarify bit usage

Summary

Attempt to further document the intended cache policies requested
by different combinations of GLC, SLC and DLC bits.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

critson created this revision.Nov 22 2021, 12:57 AM

Herald added subscribers: kerbowa, hiraditya, tpr and 6 others. · View Herald TranscriptNov 22 2021, 12:57 AM

critson requested review of this revision.Nov 22 2021, 12:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 22 2021, 12:57 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

LGTM.

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
881	Not really related to your patch, but why do we return here? Doesn't that mean that IsNonTemporal is effectively ignored if IsVolatile is true? Wouldn't it be both better and simpler to fall through to the IsNonTemporal handling here?

critson added inline comments.Nov 22 2021, 1:25 AM

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
881	This is a good point. From a bit setting perspective it would be fine. Of course there is the question of what it semantically means to have a volatile nontemporal access when we seem to define volatile as "bypasses all caches".

Harbormaster completed remote builds in B135348: Diff 388810.Nov 22 2021, 1:40 AM

t-tye added inline comments.Nov 22 2021, 9:11 AM

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
881	I suspect this is because at one time relaxed atomics were marked as volatile. This may have been because the C/C++/OpenCL standards defined them that way, or because LLVM back then did not fully support atomics so marking them all as volatile made the existing passes "do the right thing". So this code may have been an attempt not to pessimize normal relaxed atomics. LLVM does not support non-temporal atomics currently. I am not sure if these reasons are still the case so would be good to investigate and potentially fix this code (or at least document why it is the way it is with a FIXME). That can be a separate review I think.

t-tye added inline comments.Nov 22 2021, 10:45 AM

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
844	Please add back: /// There is no bypass control for the L2 cache at the isa level. The modified comment is only explaining the L1 cache and both caches are involved for system scope.
867–868	How about: // Request L1 cache policy to be MISS_EVICT for load instructions and MISS_LRU for store instructions. Note that there is no L2 cache bypass policy at the isa level.
885–887	I had not remembered that the GLC and SLC bits are used together to set the L1 cache policy. So how about: // Setting GLC and SLC both to 1 sets the L1 cache policy to MISS_EVICT for both loads and stores, and the L2 cache policy to STREAM. and delete the comment below.
1108	MISS_EVICT is a policy that only applies when both SLC and GLC are set. Here only GLC is being set which specifies MISS_LRU and L2 LRU. How about: // Set the L1 cache policy to MISS_LRU. Note that there is no L2 cache bypass policy at the isa level.
1219–1220	How about: // Request L1 cache policy to be MISS_EVICT for load instructions and MISS_LRU for store instructions. Note that there is no L2 cache bypass policy at the isa level.
1237–1240	How about: // Setting GLC and SLC both to 1 sets the L1 cache policy to MISS_EVICT for both loads and stores, and the L2 cache policy to STREAM.
1400	How about: // Set the L0 and L1 cache policies to MISS_EVICT. Note that there is no L2 cache bypass policy at the isa level.
1450–1452	How about: // Request L0 and L1 cache policy to be MISS_EVICT for load instructions and MISS_LRU for store instructions. Note that there is no L2 cache coherent bypass policy at the isa level.
1468–1475	It appears that this should be setting GLC=1 for stores so that L0 will be HIT_EVICT instead of MISS_EVICT. This must not be done for loads as that would make Lo MISS_EVICT. How about: // For loads setting GLC to 1 sets the L0 and L1 cache policy to HIT_EVICT and the L2 cache policy to STREAM. For stores setting GLC and SLC both to 1 sets the L0 and L1 cache policy to MISS_EVICT and the L2 cache policy to STREAM.

t-tye requested changes to this revision.Nov 22 2021, 10:46 AM

This revision now requires changes to proceed.Nov 22 2021, 10:46 AM

critson marked 2 inline comments as done.Nov 22 2021, 9:50 PM

critson added inline comments.

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
844	I deleted that text because there is a bypass for L2 stores and atomics on GFX10: SLC=0 DLC=1. I can put it back but only contextualised for targets before GFX10? (And the same for all the similar references in comments below.)
881	Yes, a separate investigation and review.
1468–1475	Do you have MISS_EVICT and HIT_EVICT flipped in your description? Do you mean: // For loads setting SLC to 1 sets the L0 and L1 cache policy to HIT_EVICT and the L2 cache policy to STREAM. For stores setting GLC and SLC both to 1 sets the L0 cache policy to MISS_EVICT and the L2 cache policy to STREAM. L1 is always bypassed for stores. I can add the GLC bit for stores and this ceases to be NFC.

t-tye added inline comments.Nov 23 2021, 12:16 AM

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
844	In GFX10 the bypass is only available for stores and not loads, and is not coherent so cannot be used anyway. That is why I added the word "coherent" in one of the comments below. So probably should do that here too.
1468–1475	I believe I have it right according to the hardware GFX10 memory model spec. // For loads setting SLC to 1 sets the L0 and L1 cache policy to HIT_EVICT and the L2 cache policy to STREAM. For stores setting GLC and SLC both to 1 sets the L0 and L1 cache policy to MISS_EVICT and the L2 cache policy to STREAM. We have to state the policy for L1 too even though the hardware documentation does not state it. The L1 MUST be evict or a subsequent load could see stale data. Yes this changes ceases to be NFC and will need thorough testing.

Address reviewer comments.
Normalise language to use "set" instead of "request".
Add GLC to non-temporal stores on GFX10 meaning this is no longer NFC.

critson retitled this revision from [AMDGPU] Add SIMemoryLegalizer comments to clarify bit usage (NFC) to [AMDGPU] Add SIMemoryLegalizer comments to clarify bit usage.Nov 23 2021, 1:32 AM

critson added inline comments.

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
1108	"The load intentionally misses the GPU L1 and reads from L2. If there was a line in the GPU L1 that matched, it is invalidated; L2 is reread." -- (CDNA1 Shader ISA, p67). This sounds like MISS_EVICT to me. As far as I know MISS_LRU only exists for stores and means write-combine, whereas MISS_EVICT is write-through.
1468–1475	Technically the L1 only has one policy described as "bypassed (but is coherent)" (RDNA1 Shader ISA p69), but on paper the behaviour of this looks the same as MISS_EVICT. So I guess I can accept just calling it that. Do you have any test cases which use non-temporal?

Harbormaster completed remote builds in B135570: Diff 389119.Nov 23 2021, 3:15 AM

t-tye added inline comments.Nov 23 2021, 1:52 PM

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
1108	GFX90A cache policies are different to other GFX9 and GFX10. If GLC=1 then the L1 policy is MISS_LRU for loads: any existing line is invalidated, then the cache line is loaded and remains in the cache with LRU policy.
1468–1475	In the GFX10 memory model spec I have, it does not state the L1 policy for stores, but my understanding is that it behaves as MISS_EVICT. To me "bypass but coherent" is semantically the same thing as MISS_EVICT and it seems odd not to use the terminology that is already well defined.

Update GFX90A comment inline with Tony's comments

t-tye added inline comments.Nov 25 2021, 3:59 AM

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
868	I would not add "{write-combine)" as it really is not an exact match for that term.
1220	Would eliminate "(write-combine)" as mentioned above.
1451	Eliminate "(write-cobine)". Add: // Note that there is no L2 cache coherent bypass policy at the isa level.

Harbormaster completed remote builds in B136019: Diff 389721.Nov 25 2021, 4:09 AM

Address reviewer comments

Harbormaster completed remote builds in B136025: Diff 389732.Nov 25 2021, 4:50 AM

LGTM

This revision is now accepted and ready to land.Nov 25 2021, 5:40 AM

Update tests for non-temporal GLC bit addition in GFX10

This revision was landed with ongoing or failed builds.Nov 26 2021, 4:06 AM

Closed by commit rG8967d044fc26: [AMDGPU] Add SIMemoryLegalizer comments to clarify bit usage (authored by critson). · Explain Why

This revision was automatically updated to reflect the committed changes.

critson added a commit: rG8967d044fc26: [AMDGPU] Add SIMemoryLegalizer comments to clarify bit usage.

Harbormaster completed remote builds in B136188: Diff 389980.Nov 26 2021, 4:07 AM

foad added inline comments.Nov 29 2021, 6:21 AM

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
1473	Does this change the documented code sequences: https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-memory-model-code-sequences-gfx10-table ?

foad mentioned this in D114707: [AMDGPU] Update docs for nontemporal store.Nov 29 2021, 6:46 AM

foad added inline comments.Nov 29 2021, 6:46 AM

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
1473	D114707

foad mentioned this in rG5d602120c3a3: [AMDGPU] Update docs for nontemporal store.Nov 30 2021, 1:44 AM

Diff 389721

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

Show First 20 Lines • Show All 789 Lines • ▼ Show 20 Lines	bool SIGfx6CacheControl::enableLoadCacheBypass(
SIAtomicAddrSpace AddrSpace) const {		SIAtomicAddrSpace AddrSpace) const {
assert(MI->mayLoad() && !MI->mayStore());		assert(MI->mayLoad() && !MI->mayStore());
bool Changed = false;		bool Changed = false;

if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) {		if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) {
switch (Scope) {		switch (Scope) {
case SIAtomicScope::SYSTEM:		case SIAtomicScope::SYSTEM:
case SIAtomicScope::AGENT:		case SIAtomicScope::AGENT:
		// Set L1 cache policy to MISS_EVICT.
		// Note: there is no L2 cache bypass policy at the ISA level.
Changed \|= enableGLCBit(MI);		Changed \|= enableGLCBit(MI);
break;		break;
case SIAtomicScope::WORKGROUP:		case SIAtomicScope::WORKGROUP:
case SIAtomicScope::WAVEFRONT:		case SIAtomicScope::WAVEFRONT:
case SIAtomicScope::SINGLETHREAD:		case SIAtomicScope::SINGLETHREAD:
// No cache to bypass.		// No cache to bypass.
break;		break;
default:		default:
Show All 26 Lines

bool SIGfx6CacheControl::enableRMWCacheBypass(		bool SIGfx6CacheControl::enableRMWCacheBypass(
const MachineBasicBlock::iterator &MI,		const MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,		SIAtomicScope Scope,
SIAtomicAddrSpace AddrSpace) const {		SIAtomicAddrSpace AddrSpace) const {
assert(MI->mayLoad() && MI->mayStore());		assert(MI->mayLoad() && MI->mayStore());
bool Changed = false;		bool Changed = false;

/// The L1 cache is write through so does not need to be bypassed. There is no		/// Do not set GLC for RMW atomic operations as L0/L1 cache is automatically
/// bypass control for the L2 cache at the isa level.		/// bypassed, and the GLC bit is instead used to indicate if they are
		/// return or no-return.
		t-tyeUnsubmitted Done Reply Inline Actions Please add back: /// There is no bypass control for the L2 cache at the isa level. The modified comment is only explaining the L1 cache and both caches are involved for system scope. t-tye: Please add back: /// There is no bypass control for the L2 cache at the isa level. The…
		critsonAuthorUnsubmitted Done Reply Inline Actions I deleted that text because there is a bypass for L2 stores and atomics on GFX10: SLC=0 DLC=1. I can put it back but only contextualised for targets before GFX10? (And the same for all the similar references in comments below.) critson: I deleted that text because there is a bypass for L2 stores and atomics on GFX10: SLC=0 DLC=1.
		t-tyeUnsubmitted Done Reply Inline Actions In GFX10 the bypass is only available for stores and not loads, and is not coherent so cannot be used anyway. That is why I added the word "coherent" in one of the comments below. So probably should do that here too. t-tye: In GFX10 the bypass is only available for stores and not loads, and is not coherent so cannot…
		/// Note: there is no L2 cache coherent bypass control at the ISA level.

return Changed;		return Changed;
}		}

bool SIGfx6CacheControl::enableVolatileAndOrNonTemporal(		bool SIGfx6CacheControl::enableVolatileAndOrNonTemporal(
MachineBasicBlock::iterator &MI, SIAtomicAddrSpace AddrSpace, SIMemOp Op,		MachineBasicBlock::iterator &MI, SIAtomicAddrSpace AddrSpace, SIMemOp Op,
bool IsVolatile, bool IsNonTemporal) const {		bool IsVolatile, bool IsNonTemporal) const {
// Only handle load and store, not atomic read-modify-write insructions. The		// Only handle load and store, not atomic read-modify-write insructions. The
// latter use glc to indicate if the atomic returns a result and so must not		// latter use glc to indicate if the atomic returns a result and so must not
// be used for cache control.		// be used for cache control.
assert(MI->mayLoad() ^ MI->mayStore());		assert(MI->mayLoad() ^ MI->mayStore());

// Only update load and store, not LLVM IR atomic read-modify-write		// Only update load and store, not LLVM IR atomic read-modify-write
// instructions. The latter are always marked as volatile so cannot sensibly		// instructions. The latter are always marked as volatile so cannot sensibly
// handle it as do not want to pessimize all atomics. Also they do not support		// handle it as do not want to pessimize all atomics. Also they do not support
// the nontemporal attribute.		// the nontemporal attribute.
assert(Op == SIMemOp::LOAD \|\| Op == SIMemOp::STORE);		assert(Op == SIMemOp::LOAD \|\| Op == SIMemOp::STORE);

bool Changed = false;		bool Changed = false;

if (IsVolatile) {		if (IsVolatile) {
		// Set L1 cache policy to be MISS_EVICT for load instructions and
		// MISS_LRU (write-combine) for store instructions.
		t-tyeUnsubmitted Done Reply Inline Actions How about: // Request L1 cache policy to be MISS_EVICT for load instructions and MISS_LRU for store instructions. Note that there is no L2 cache bypass policy at the isa level. t-tye: How about: // Request L1 cache policy to be MISS_EVICT for load instructions and MISS_LRU…
		t-tyeUnsubmitted Done Reply Inline Actions I would not add "{write-combine)" as it really is not an exact match for that term. t-tye: I would not add "{write-combine)" as it really is not an exact match for that term.
		// Note: there is no L2 cache bypass policy at the ISA level.
if (Op == SIMemOp::LOAD)		if (Op == SIMemOp::LOAD)
Changed \|= enableGLCBit(MI);		Changed \|= enableGLCBit(MI);

// Ensure operation has completed at system scope to cause all volatile		// Ensure operation has completed at system scope to cause all volatile
// operations to be visible outside the program in a global order. Do not		// operations to be visible outside the program in a global order. Do not
// request cross address space as only the global address space can be		// request cross address space as only the global address space can be
// observable outside the program, so no need to cause a waitcnt for LDS		// observable outside the program, so no need to cause a waitcnt for LDS
// address space operations.		// address space operations.
Changed \|= insertWait(MI, SIAtomicScope::SYSTEM, AddrSpace, Op, false,		Changed \|= insertWait(MI, SIAtomicScope::SYSTEM, AddrSpace, Op, false,
Position::AFTER);		Position::AFTER);

return Changed;		return Changed;
		foadUnsubmitted Done Reply Inline Actions Not really related to your patch, but why do we return here? Doesn't that mean that IsNonTemporal is effectively ignored if IsVolatile is true? Wouldn't it be both better and simpler to fall through to the IsNonTemporal handling here? foad: Not really related to your patch, but why do we return here? Doesn't that mean that…
		critsonAuthorUnsubmitted Done Reply Inline Actions This is a good point. From a bit setting perspective it would be fine. Of course there is the question of what it semantically means to have a volatile nontemporal access when we seem to define volatile as "bypasses all caches". critson: This is a good point. From a bit setting perspective it would be fine. Of course there is the…
		t-tyeUnsubmitted Done Reply Inline Actions I suspect this is because at one time relaxed atomics were marked as volatile. This may have been because the C/C++/OpenCL standards defined them that way, or because LLVM back then did not fully support atomics so marking them all as volatile made the existing passes "do the right thing". So this code may have been an attempt not to pessimize normal relaxed atomics. LLVM does not support non-temporal atomics currently. I am not sure if these reasons are still the case so would be good to investigate and potentially fix this code (or at least document why it is the way it is with a FIXME). That can be a separate review I think. t-tye: I suspect this is because at one time relaxed atomics were marked as volatile. This may have…
		critsonAuthorUnsubmitted Done Reply Inline Actions Yes, a separate investigation and review. critson: Yes, a separate investigation and review.
}		}

if (IsNonTemporal) {		if (IsNonTemporal) {
// Request L1 MISS_EVICT and L2 STREAM for load and store instructions.		// Setting both GLC and SLC configures L1 cache policy to MISS_EVICT
		// for both loads and stores, and the L2 cache policy to STREAM.
Changed \|= enableGLCBit(MI);		Changed \|= enableGLCBit(MI);
		t-tyeUnsubmitted Done Reply Inline Actions I had not remembered that the GLC and SLC bits are used together to set the L1 cache policy. So how about: // Setting GLC and SLC both to 1 sets the L1 cache policy to MISS_EVICT for both loads and stores, and the L2 cache policy to STREAM. and delete the comment below. t-tye: I had not remembered that the GLC and SLC bits are used together to set the L1 cache policy. So…
Changed \|= enableSLCBit(MI);		Changed \|= enableSLCBit(MI);
return Changed;		return Changed;
}		}

return Changed;		return Changed;
}		}

bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,		bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
▲ Show 20 Lines • Show All 204 Lines • ▼ Show 20 Lines	bool SIGfx90ACacheControl::enableLoadCacheBypass(
SIAtomicAddrSpace AddrSpace) const {		SIAtomicAddrSpace AddrSpace) const {
assert(MI->mayLoad() && !MI->mayStore());		assert(MI->mayLoad() && !MI->mayStore());
bool Changed = false;		bool Changed = false;

if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) {		if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) {
switch (Scope) {		switch (Scope) {
case SIAtomicScope::SYSTEM:		case SIAtomicScope::SYSTEM:
case SIAtomicScope::AGENT:		case SIAtomicScope::AGENT:
		// Set the L1 cache policy to MISS_LRU.
		t-tyeUnsubmitted Done Reply Inline Actions MISS_EVICT is a policy that only applies when both SLC and GLC are set. Here only GLC is being set which specifies MISS_LRU and L2 LRU. How about: // Set the L1 cache policy to MISS_LRU. Note that there is no L2 cache bypass policy at the isa level. t-tye: MISS_EVICT is a policy that only applies when both SLC and GLC are set. Here only GLC is being…
		critsonAuthorUnsubmitted Done Reply Inline Actions "The load intentionally misses the GPU L1 and reads from L2. If there was a line in the GPU L1 that matched, it is invalidated; L2 is reread." -- (CDNA1 Shader ISA, p67). This sounds like MISS_EVICT to me. As far as I know MISS_LRU only exists for stores and means write-combine, whereas MISS_EVICT is write-through. critson: > "The load intentionally misses the GPU L1 and reads from L2. If there was a line in the GPU…
		t-tyeUnsubmitted Done Reply Inline Actions GFX90A cache policies are different to other GFX9 and GFX10. If GLC=1 then the L1 policy is MISS_LRU for loads: any existing line is invalidated, then the cache line is loaded and remains in the cache with LRU policy. t-tye: GFX90A cache policies are different to other GFX9 and GFX10. If GLC=1 then the L1 policy is…
		// Note: there is no L2 cache bypass policy at the ISA level.
Changed \|= enableGLCBit(MI);		Changed \|= enableGLCBit(MI);
break;		break;
case SIAtomicScope::WORKGROUP:		case SIAtomicScope::WORKGROUP:
// In threadgroup split mode the waves of a work-group can be executing on		// In threadgroup split mode the waves of a work-group can be executing on
// different CUs. Therefore need to bypass the L1 which is per CU.		// different CUs. Therefore need to bypass the L1 which is per CU.
// Otherwise in non-threadgroup split mode all waves of a work-group are		// Otherwise in non-threadgroup split mode all waves of a work-group are
// on the same CU, and so the L1 does not need to be bypassed.		// on the same CU, and so the L1 does not need to be bypassed.
if (ST.isTgSplitEnabled())		if (ST.isTgSplitEnabled())
▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines	bool SIGfx90ACacheControl::enableVolatileAndOrNonTemporal(
// instructions. The latter are always marked as volatile so cannot sensibly		// instructions. The latter are always marked as volatile so cannot sensibly
// handle it as do not want to pessimize all atomics. Also they do not support		// handle it as do not want to pessimize all atomics. Also they do not support
// the nontemporal attribute.		// the nontemporal attribute.
assert(Op == SIMemOp::LOAD \|\| Op == SIMemOp::STORE);		assert(Op == SIMemOp::LOAD \|\| Op == SIMemOp::STORE);

bool Changed = false;		bool Changed = false;

if (IsVolatile) {		if (IsVolatile) {
		// Set L1 cache policy to be MISS_EVICT for load instructions and
		// MISS_LRU (write-combine) for store instructions.
		t-tyeUnsubmitted Done Reply Inline Actions How about: // Request L1 cache policy to be MISS_EVICT for load instructions and MISS_LRU for store instructions. Note that there is no L2 cache bypass policy at the isa level. t-tye: How about: // Request L1 cache policy to be MISS_EVICT for load instructions and MISS_LRU for…
		t-tyeUnsubmitted Done Reply Inline Actions Would eliminate "(write-combine)" as mentioned above. t-tye: Would eliminate "(write-combine)" as mentioned above.
		// Note: there is no L2 cache bypass policy at the ISA level.
if (Op == SIMemOp::LOAD)		if (Op == SIMemOp::LOAD)
Changed \|= enableGLCBit(MI);		Changed \|= enableGLCBit(MI);

// Ensure operation has completed at system scope to cause all volatile		// Ensure operation has completed at system scope to cause all volatile
// operations to be visible outside the program in a global order. Do not		// operations to be visible outside the program in a global order. Do not
// request cross address space as only the global address space can be		// request cross address space as only the global address space can be
// observable outside the program, so no need to cause a waitcnt for LDS		// observable outside the program, so no need to cause a waitcnt for LDS
// address space operations.		// address space operations.
Changed \|= insertWait(MI, SIAtomicScope::SYSTEM, AddrSpace, Op, false,		Changed \|= insertWait(MI, SIAtomicScope::SYSTEM, AddrSpace, Op, false,
Position::AFTER);		Position::AFTER);

return Changed;		return Changed;
}		}

if (IsNonTemporal) {		if (IsNonTemporal) {
// Request L1 MISS_EVICT and L2 STREAM for load and store instructions.		// Setting both GLC and SLC configures L1 cache policy to MISS_EVICT
		// for both loads and stores, and the L2 cache policy to STREAM.
Changed \|= enableGLCBit(MI);		Changed \|= enableGLCBit(MI);
Changed \|= enableSLCBit(MI);		Changed \|= enableSLCBit(MI);
		t-tyeUnsubmitted Done Reply Inline Actions How about: // Setting GLC and SLC both to 1 sets the L1 cache policy to MISS_EVICT for both loads and stores, and the L2 cache policy to STREAM. t-tye: How about: // Setting GLC and SLC both to 1 sets the L1 cache policy to MISS_EVICT for both…
return Changed;		return Changed;
}		}

return Changed;		return Changed;
}		}

bool SIGfx90ACacheControl::insertWait(MachineBasicBlock::iterator &MI,		bool SIGfx90ACacheControl::insertWait(MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,		SIAtomicScope Scope,
▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Lines
bool SIGfx10CacheControl::enableLoadCacheBypass(		bool SIGfx10CacheControl::enableLoadCacheBypass(
const MachineBasicBlock::iterator &MI,		const MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,		SIAtomicScope Scope,
SIAtomicAddrSpace AddrSpace) const {		SIAtomicAddrSpace AddrSpace) const {
assert(MI->mayLoad() && !MI->mayStore());		assert(MI->mayLoad() && !MI->mayStore());
bool Changed = false;		bool Changed = false;

if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) {		if ((AddrSpace & SIAtomicAddrSpace::GLOBAL) != SIAtomicAddrSpace::NONE) {
/// TODO Do not set glc for rmw atomic operations as they
/// implicitly bypass the L0/L1 caches.

switch (Scope) {		switch (Scope) {
case SIAtomicScope::SYSTEM:		case SIAtomicScope::SYSTEM:
case SIAtomicScope::AGENT:		case SIAtomicScope::AGENT:
		// Set the L0 and L1 cache policies to MISS_EVICT.
		t-tyeUnsubmitted Done Reply Inline Actions How about: // Set the L0 and L1 cache policies to MISS_EVICT. Note that there is no L2 cache bypass policy at the isa level. t-tye: How about: // Set the L0 and L1 cache policies to MISS_EVICT. Note that there is no L2 cache…
		// Note: there is no L2 cache coherent bypass control at the ISA level.
Changed \|= enableGLCBit(MI);		Changed \|= enableGLCBit(MI);
Changed \|= enableDLCBit(MI);		Changed \|= enableDLCBit(MI);
break;		break;
case SIAtomicScope::WORKGROUP:		case SIAtomicScope::WORKGROUP:
// In WGP mode the waves of a work-group can be executing on either CU of		// In WGP mode the waves of a work-group can be executing on either CU of
// the WGP. Therefore need to bypass the L0 which is per CU. Otherwise in		// the WGP. Therefore need to bypass the L0 which is per CU. Otherwise in
// CU mode all waves of a work-group are on the same CU, and so the L0		// CU mode all waves of a work-group are on the same CU, and so the L0
// does not need to be bypassed.		// does not need to be bypassed.
Show All 32 Lines	bool SIGfx10CacheControl::enableVolatileAndOrNonTemporal(
// instructions. The latter are always marked as volatile so cannot sensibly		// instructions. The latter are always marked as volatile so cannot sensibly
// handle it as do not want to pessimize all atomics. Also they do not support		// handle it as do not want to pessimize all atomics. Also they do not support
// the nontemporal attribute.		// the nontemporal attribute.
assert(Op == SIMemOp::LOAD \|\| Op == SIMemOp::STORE);		assert(Op == SIMemOp::LOAD \|\| Op == SIMemOp::STORE);

bool Changed = false;		bool Changed = false;

if (IsVolatile) {		if (IsVolatile) {
		// Set L0 and L1 cache policy to be MISS_EVICT for load instructions
		// and MISS_LRU (write-combine) for store instructions.
		t-tyeUnsubmitted Done Reply Inline Actions Eliminate "(write-cobine)". Add: // Note that there is no L2 cache coherent bypass policy at the isa level. t-tye: Eliminate "(write-cobine)". Add: // Note that there is no L2 cache coherent bypass policy at…
if (Op == SIMemOp::LOAD) {		if (Op == SIMemOp::LOAD) {
		t-tyeUnsubmitted Done Reply Inline Actions How about: // Request L0 and L1 cache policy to be MISS_EVICT for load instructions and MISS_LRU for store instructions. Note that there is no L2 cache coherent bypass policy at the isa level. t-tye: How about: // Request L0 and L1 cache policy to be MISS_EVICT for load instructions and…
Changed \|= enableGLCBit(MI);		Changed \|= enableGLCBit(MI);
Changed \|= enableDLCBit(MI);		Changed \|= enableDLCBit(MI);
}		}

// Ensure operation has completed at system scope to cause all volatile		// Ensure operation has completed at system scope to cause all volatile
// operations to be visible outside the program in a global order. Do not		// operations to be visible outside the program in a global order. Do not
// request cross address space as only the global address space can be		// request cross address space as only the global address space can be
// observable outside the program, so no need to cause a waitcnt for LDS		// observable outside the program, so no need to cause a waitcnt for LDS
// address space operations.		// address space operations.
Changed \|= insertWait(MI, SIAtomicScope::SYSTEM, AddrSpace, Op, false,		Changed \|= insertWait(MI, SIAtomicScope::SYSTEM, AddrSpace, Op, false,
Position::AFTER);		Position::AFTER);
return Changed;		return Changed;
}		}

if (IsNonTemporal) {		if (IsNonTemporal) {
// Request L0/L1 HIT_EVICT and L2 STREAM for load and store instructions.		// For loads setting SLC configures L0 and L1 cache policy to HIT_EVICT
		// and L2 cache policy to STREAM.
		// For stores setting both GLC and SLC configures L0 and L1 cache policy
		// to MISS_EVICT and the L2 cache policy to STREAM.
		if (Op == SIMemOp::STORE)
		Changed \|= enableGLCBit(MI);
		foadUnsubmitted Not Done Reply Inline Actions Does this change the documented code sequences: https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-memory-model-code-sequences-gfx10-table ? foad: Does this change the documented code sequences: https://llvm.org/docs/AMDGPUUsage.html#amdgpu…
		foadUnsubmitted Not Done Reply Inline Actions D114707 foad: D114707
Changed \|= enableSLCBit(MI);		Changed \|= enableSLCBit(MI);

		t-tyeUnsubmitted Done Reply Inline Actions It appears that this should be setting GLC=1 for stores so that L0 will be HIT_EVICT instead of MISS_EVICT. This must not be done for loads as that would make Lo MISS_EVICT. How about: // For loads setting GLC to 1 sets the L0 and L1 cache policy to HIT_EVICT and the L2 cache policy to STREAM. For stores setting GLC and SLC both to 1 sets the L0 and L1 cache policy to MISS_EVICT and the L2 cache policy to STREAM. t-tye: It appears that this should be setting GLC=1 for stores so that L0 will be HIT_EVICT instead of…
		critsonAuthorUnsubmitted Done Reply Inline Actions Do you have MISS_EVICT and HIT_EVICT flipped in your description? Do you mean: // For loads setting SLC to 1 sets the L0 and L1 cache policy to HIT_EVICT and the L2 cache policy to STREAM. For stores setting GLC and SLC both to 1 sets the L0 cache policy to MISS_EVICT and the L2 cache policy to STREAM. L1 is always bypassed for stores. I can add the GLC bit for stores and this ceases to be NFC. critson: Do you have MISS_EVICT and HIT_EVICT flipped in your description? Do you mean: // For loads…
		t-tyeUnsubmitted Done Reply Inline Actions I believe I have it right according to the hardware GFX10 memory model spec. // For loads setting SLC to 1 sets the L0 and L1 cache policy to HIT_EVICT and the L2 cache policy to STREAM. For stores setting GLC and SLC both to 1 sets the L0 and L1 cache policy to MISS_EVICT and the L2 cache policy to STREAM. We have to state the policy for L1 too even though the hardware documentation does not state it. The L1 MUST be evict or a subsequent load could see stale data. Yes this changes ceases to be NFC and will need thorough testing. t-tye: I believe I have it right according to the hardware GFX10 memory model spec. // For loads…
		critsonAuthorUnsubmitted Done Reply Inline Actions Technically the L1 only has one policy described as "bypassed (but is coherent)" (RDNA1 Shader ISA p69), but on paper the behaviour of this looks the same as MISS_EVICT. So I guess I can accept just calling it that. Do you have any test cases which use non-temporal? critson: Technically the L1 only has one policy described as "bypassed (but is coherent)" (RDNA1 Shader…
		t-tyeUnsubmitted Done Reply Inline Actions In the GFX10 memory model spec I have, it does not state the L1 policy for stores, but my understanding is that it behaves as MISS_EVICT. To me "bypass but coherent" is semantically the same thing as MISS_EVICT and it seems odd not to use the terminology that is already well defined. t-tye: In the GFX10 memory model spec I have, it does not state the L1 policy for stores, but my…
return Changed;		return Changed;
}		}

return Changed;		return Changed;
}		}

bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,		bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,		SIAtomicScope Scope,
▲ Show 20 Lines • Show All 401 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add SIMemoryLegalizer comments to clarify bit usage
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 389721

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add SIMemoryLegalizer comments to clarify bit usageClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 389721

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

[AMDGPU] Add SIMemoryLegalizer comments to clarify bit usage
ClosedPublic