This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add SIMemoryLegalizer comments to clarify bit usage
ClosedPublic

Authored by critson on Nov 22 2021, 12:57 AM.

Details

Summary

Attempt to further document the intended cache policies requested
by different combinations of GLC, SLC and DLC bits.

Diff Detail

Event Timeline

critson created this revision.Nov 22 2021, 12:57 AM
critson requested review of this revision.Nov 22 2021, 12:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 22 2021, 12:57 AM
foad added a comment.Nov 22 2021, 1:20 AM

LGTM.

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
881

Not really related to your patch, but why do we return here? Doesn't that mean that IsNonTemporal is effectively ignored if IsVolatile is true? Wouldn't it be both better and simpler to fall through to the IsNonTemporal handling here?

critson added inline comments.Nov 22 2021, 1:25 AM
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
881

This is a good point. From a bit setting perspective it would be fine.
Of course there is the question of what it semantically means to have a volatile nontemporal access when we seem to define volatile as "bypasses all caches".

t-tye added inline comments.Nov 22 2021, 9:11 AM
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
881

I suspect this is because at one time relaxed atomics were marked as volatile. This may have been because the C/C++/OpenCL standards defined them that way, or because LLVM back then did not fully support atomics so marking them all as volatile made the existing passes "do the right thing". So this code may have been an attempt not to pessimize normal relaxed atomics. LLVM does not support non-temporal atomics currently.

I am not sure if these reasons are still the case so would be good to investigate and potentially fix this code (or at least document why it is the way it is with a FIXME). That can be a separate review I think.

t-tye added inline comments.Nov 22 2021, 10:45 AM
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
844

Please add back:

/// There is no bypass control for the L2 cache at the isa level.

The modified comment is only explaining the L1 cache and both caches are involved for system scope.

867–868

How about:

// Request L1 cache policy to be MISS_EVICT for load instructions and MISS_LRU for store instructions. Note that there is no L2 cache bypass policy at the isa level.
885–887

I had not remembered that the GLC and SLC bits are used together to set the L1 cache policy. So how about:

// Setting GLC and SLC both to 1 sets the L1 cache policy to MISS_EVICT for both loads and stores, and the L2 cache policy to STREAM.

and delete the comment below.

1108

MISS_EVICT is a policy that only applies when both SLC and GLC are set. Here only GLC is being set which specifies MISS_LRU and L2 LRU. How about:

// Set the L1 cache policy to MISS_LRU. Note that there is no L2 cache bypass policy at the isa level.
1219–1220

How about:

// Request L1 cache policy to be MISS_EVICT for load instructions and MISS_LRU for store instructions. Note that there is no L2 cache bypass policy at the isa level.

1237–1240

How about:

// Setting GLC and SLC both to 1 sets the L1 cache policy to MISS_EVICT for both loads and stores, and the L2 cache policy to STREAM.
1400

How about:

// Set the L0 and L1 cache policies to MISS_EVICT. Note that there is no L2 cache bypass policy at the isa level.
1450–1452

How about:

// Request L0 and L1 cache policy to be MISS_EVICT for load instructions and MISS_LRU for store instructions. Note that there is no L2 cache coherent bypass policy at the isa level.
1468–1475

It appears that this should be setting GLC=1 for stores so that L0 will be HIT_EVICT instead of MISS_EVICT. This must not be done for loads as that would make Lo MISS_EVICT. How about:

// For loads setting GLC to 1 sets the L0 and L1 cache policy to HIT_EVICT and the L2 cache policy to STREAM. For stores setting GLC and SLC both to 1 sets the L0 and L1 cache policy to MISS_EVICT and the L2 cache policy to STREAM.

t-tye requested changes to this revision.Nov 22 2021, 10:46 AM
This revision now requires changes to proceed.Nov 22 2021, 10:46 AM
critson marked 2 inline comments as done.Nov 22 2021, 9:50 PM
critson added inline comments.
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
844

I deleted that text because there is a bypass for L2 stores and atomics on GFX10: SLC=0 DLC=1.
I can put it back but only contextualised for targets before GFX10?
(And the same for all the similar references in comments below.)

881

Yes, a separate investigation and review.

1468–1475

Do you have MISS_EVICT and HIT_EVICT flipped in your description?

Do you mean:
// For loads setting SLC to 1 sets the L0 and L1 cache policy to HIT_EVICT and the L2 cache policy to STREAM. For stores setting GLC and SLC both to 1 sets the L0 cache policy to MISS_EVICT and the L2 cache policy to STREAM. L1 is always bypassed for stores.

I can add the GLC bit for stores and this ceases to be NFC.

t-tye added inline comments.Nov 23 2021, 12:16 AM
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
844

In GFX10 the bypass is only available for stores and not loads, and is not coherent so cannot be used anyway. That is why I added the word "coherent" in one of the comments below. So probably should do that here too.

1468–1475

I believe I have it right according to the hardware GFX10 memory model spec.

// For loads setting SLC to 1 sets the L0 and L1 cache policy to HIT_EVICT and the L2 cache policy to STREAM. For stores setting GLC and SLC both to 1 sets the L0 and L1 cache policy to MISS_EVICT and the L2 cache policy to STREAM.

We have to state the policy for L1 too even though the hardware documentation does not state it. The L1 MUST be evict or a subsequent load could see stale data.

Yes this changes ceases to be NFC and will need thorough testing.

critson updated this revision to Diff 389119.Nov 23 2021, 1:31 AM
critson marked 9 inline comments as done.
  • Address reviewer comments.
  • Normalise language to use "set" instead of "request".
  • Add GLC to non-temporal stores on GFX10 meaning this is no longer NFC.
critson retitled this revision from [AMDGPU] Add SIMemoryLegalizer comments to clarify bit usage (NFC) to [AMDGPU] Add SIMemoryLegalizer comments to clarify bit usage.Nov 23 2021, 1:32 AM
critson added inline comments.
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
1108

"The load intentionally misses the GPU L1 and reads from L2. If there was a line in the GPU L1 that matched, it is invalidated; L2 is reread." -- (CDNA1 Shader ISA, p67).

This sounds like MISS_EVICT to me.
As far as I know MISS_LRU only exists for stores and means write-combine, whereas MISS_EVICT is write-through.

1468–1475

Technically the L1 only has one policy described as "bypassed (but is coherent)" (RDNA1 Shader ISA p69), but on paper the behaviour of this looks the same as MISS_EVICT. So I guess I can accept just calling it that.

Do you have any test cases which use non-temporal?

t-tye added inline comments.Nov 23 2021, 1:52 PM
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
1108

GFX90A cache policies are different to other GFX9 and GFX10. If GLC=1 then the L1 policy is MISS_LRU for loads: any existing line is invalidated, then the cache line is loaded and remains in the cache with LRU policy.

1468–1475

In the GFX10 memory model spec I have, it does not state the L1 policy for stores, but my understanding is that it behaves as MISS_EVICT. To me "bypass but coherent" is semantically the same thing as MISS_EVICT and it seems odd not to use the terminology that is already well defined.

critson updated this revision to Diff 389721.Nov 25 2021, 3:42 AM
critson marked 5 inline comments as done.
  • Update GFX90A comment inline with Tony's comments
t-tye added inline comments.Nov 25 2021, 3:59 AM
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
868

I would not add "{write-combine)" as it really is not an exact match for that term.

1220

Would eliminate "(write-combine)" as mentioned above.

1451

Eliminate "(write-cobine)". Add:

// Note that there is no L2 cache coherent bypass policy at the isa level.
critson updated this revision to Diff 389732.Nov 25 2021, 4:10 AM
critson marked 3 inline comments as done.
  • Address reviewer comments
t-tye accepted this revision.Nov 25 2021, 5:40 AM

LGTM

This revision is now accepted and ready to land.Nov 25 2021, 5:40 AM
critson updated this revision to Diff 389980.Nov 26 2021, 3:27 AM
  • Update tests for non-temporal GLC bit addition in GFX10
This revision was landed with ongoing or failed builds.Nov 26 2021, 4:06 AM
This revision was automatically updated to reflect the committed changes.
foad added inline comments.Nov 29 2021, 6:21 AM
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
1473
foad added inline comments.Nov 29 2021, 6:46 AM
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
1473