This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Removed unnecessary cache invalidations.
Needs ReviewPublic

Authored by s-perron on Mar 22 2021, 5:14 PM.

Details

Summary

The SPIR-V memory barriers are translated in a very pessimistic way by
LLPC. When translating, LLPC does not know where memory will be stored, so
it will introduce a fence instruction. The fence will eventually be
turned into an instruction to invalidate the memory cache.

If the barrier was for workgroup shared memory and all workgroup shared
variabled are allocated to LDS, then the L1 cache does not have to be
invalidated. However, the code will still invalidate it.

This commit modifies the SI-Insert-waitcnts pass to remove cache
invalidation instructions it can prove will not be needed. If no store
or load to memory that is cached reaches the cache invalidation instruction
without passing through another cache invalidation instruction, then it is
safe to remove.

Diff Detail

Event Timeline

s-perron created this revision.Mar 22 2021, 5:14 PM
s-perron requested review of this revision.Mar 22 2021, 5:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2021, 5:14 PM
s-perron edited the summary of this revision. (Show Details)
arsenm added inline comments.
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1848–1849

Range loop?

foad added a subscriber: foad.Mar 23 2021, 6:56 AM

"When translating, LLPC does know"

Did you mean "does not know"?

foad added inline comments.Mar 23 2021, 6:59 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
325

Typo "pontentially".

329

Typo "pontentially".

sameerds added inline comments.Mar 23 2021, 8:18 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1625

This condition and the previous one need to be captured as a function with a meaningful name. Or if TableGen is involved in this enum, then perhaps a property on the instruction.

s-perron updated this revision to Diff 332746.Mar 23 2021, 11:44 AM

Fix tests and clean up code

  • Fix store instructions in tests.
  • Small refactoring to simplify removing instructions.
s-perron edited the summary of this revision. (Show Details)Mar 23 2021, 11:45 AM
s-perron updated this revision to Diff 332748.Mar 23 2021, 11:50 AM
s-perron marked 2 inline comments as done.
  • Fix typos.
s-perron marked 2 inline comments as done.Mar 23 2021, 11:52 AM

"When translating, LLPC does know"

Did you mean "does not know"?

Yes that is what I meant. I've fixed it up.

t-tye requested changes to this revision.Mar 23 2021, 1:57 PM
t-tye added inline comments.
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
92

gfx90a also has L2 cache control. See https://llvm.org/docs/AMDGPUUsage.html#memory-model-gfx90a .

198–212

Should this be a query on an instruction as the glc, et al bit can control the caches affected? There is also buffer_invl2. Should this be driven by a table gen property?

Should cache writeback also be tracked? GFX90A has L2 writeback controls.

499

Invalidating is about removing stale data. Writeback is about removing dirty data. So suggest renaming these operations.

502

Also have similar for removing unnecessary writeback instructions.

725

Is GDS needed here? Seems this is only dealing with VMEM. GDS is like LDS and not part of VMEM.

Should SMEM writes be considered? Some targets support that.

726–727

Should the cache bypass of the instructions be considered? When the cache is bypassed the data is not left in the cache and so does not cause the need for an invalidate. But the cache bypass can specify which caches it is bypassing. For example, performing a series of relaxed atomics at system scope would require no invalidates.

Should the impact of writes be considered in deciding on the need for a cache writeback? Again there can be cache bypass, and some caches are readonly or write-through.

732

LDS has no cache control so should if be considered?

Why only level 0 as loading into the L0 can also cause data to be loaded into the other caches.

794

Seems this is tracking possible stale data, not dirty data.

938

Is this correct? These instructions do not ensure the waicnt is 0. This is also missing GFX90A cache instructions.

1732

Should hasPending be generalized to include all pending "things"? Seems merge has been generalized to merge all the "things".

This revision now requires changes to proceed.Mar 23 2021, 1:57 PM
s-perron updated this revision to Diff 332994.Mar 24 2021, 8:00 AM
s-perron marked 3 inline comments as done.
  • Fix lint issues
  • Use "stale" in place of "dirty" to describe the cache.

I've made some changes. I will look into some of the improvements that have been suggested.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
725

Is GDS needed here?

Does a load from GDS go through the L1 cache? I was under the impression that it does, but I'm not all that familiar with the architecture. If it does not then it is not needed.

Should SMEM writes be considered?

The SMEM_ACCESS event covers both reads and writes to SMEM, so this should already be covered. See the comment by the definition of the enum.

726–727

Should the cache bypass of the instructions be considered?

I did not know about the cache bypass. I'll see what I can do about that.

some caches are readonly or write-through?

That could be useful. Is this something that can be easily queried? If determining the behaviour of the cache would require a lot of new code, I would prefer to make that change in a subsequent patch. The analysis would still be correct, even if it is a bit conservative. The same for the cache bypass.

732

I guess I misunderstood how the arch is designed. Would a read from LDS load data into the L1 cache? I was told "Another subtlety is that on gfx10 in WGP mode, the L0 cache needs to be invalidated even for LDS." I'm not being precise enough here, I should be checking which mode we are in and possibly which GFX version.

I also want to ask about the export instructions. I made a conservative guess that they would require the L0 cache invalidation, but I am not sure how the output buffer is implemented.

938

I do not know. I just tried keep this functionally the same. I could revert my change to this, and then it could be fixed or removed with a different patch. I don't want to fix completely unrelated fixes with this patch.

t-tye added inline comments.Mar 24 2021, 10:18 PM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
725

Is GDS needed here?

Does a load from GDS go through the L1 cache? I was under the impression that it does, but I'm not all that familiar with the architecture. If it does not then it is not needed.

No it does not.

Should SMEM writes be considered?

The SMEM_ACCESS event covers both reads and writes to SMEM, so this should already be covered. See the comment by the definition of the enum.

Still seems SMEM reads and writes should be separate.

732

I guess I misunderstood how the arch is designed. Would a read from LDS load data into the L1 cache? I was told "Another subtlety is that on gfx10 in WGP mode, the L0 cache needs to be invalidated even for LDS." I'm not being precise enough here, I should be checking which mode we are in and possibly which GFX version.

I also want to ask about the export instructions. I made a conservative guess that they would require the L0 cache invalidation, but I am not sure how the output buffer is implemented.

It would probably be easier to just talk through how the architecture works so you can then make updates.

kuhar added a subscriber: kuhar.Apr 21 2021, 8:16 AM