This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Remove unnecessary waitcnts
AbandonedPublic

Authored by OutOfCache on Mar 24 2023, 10:39 AM.

Details

Reviewers
arsenm
mareko
tsymalla
msearles
sebastian-ne
foad
Group Reviewers
Restricted Project
Summary

Some ds_* instructions do not access LDS memory.
Therefore, @sebastian-ne suggested removing the waitcnts.

Diff Detail

Unit TestsFailed

Event Timeline

OutOfCache created this revision.Mar 24 2023, 10:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2023, 10:39 AM
OutOfCache requested review of this revision.Mar 24 2023, 10:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2023, 10:39 AM
OutOfCache edited the summary of this revision. (Show Details)Mar 24 2023, 10:44 AM
OutOfCache added a project: Restricted Project.
OutOfCache added a subscriber: Flakebi.
foad added inline comments.Mar 25 2023, 5:02 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1393 ↗(On Diff #508156)

Might be better to make usesLGKM_CNT itself more accurate, by tweaking DSInstructions.td so that LGKM_CNT is set to 0 for instructions that don't use it.

  • editing DSInstructions.td instead of SIInsertWaitcnts.cpp
OutOfCache edited the summary of this revision. (Show Details)Mar 27 2023, 3:27 AM
OutOfCache edited reviewers, added: sebastian-ne, Restricted Project; removed: Flakebi.
OutOfCache marked an inline comment as done.
OutOfCache added a subscriber: sebastian-ne.
OutOfCache added inline comments.
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1393 ↗(On Diff #508156)

Thank you! For some reason, I only found the definition of the intrinsic. I did not know there was a separate .td file for DS Instructions.

foad accepted this revision.Mar 27 2023, 3:45 AM

LGTM, thanks.

This revision is now accepted and ready to land.Mar 27 2023, 3:45 AM

Are you sure about this? lgkmcnt(0) isn't about accessing LDS memory, but about waiting for the result to be received from the LDS block.

OutOfCache marked an inline comment as done.Mar 27 2023, 8:38 AM

Are you sure about this? lgkmcnt(0) isn't about accessing LDS memory, but about waiting for the result to be received from the LDS block.

In that case I am not certain. Is there a way to verify whether this is acceptable?

t-tye added a subscriber: t-tye.Mar 27 2023, 9:57 AM

The waitcnt's serve two purposes. They notify that the result of the operation is available to the thread that requested it, and they ensure that the effect of the operation is visible to other threads before this thread continues to do other operations. This latter purpose is used to ensure the happens-before relationship in the memory model. So for example, if a VMEM release atomic is done at workgroup scope, should these operations be visible to other threads before the result that is store-released onto VMEM?

If these operations go down the LDS queues (even if they are not performed in the LDS itself), then there are 2 queues for the waves of a workgroup, but a single L1 shared by all waves of a workgroup for VMEM. So to ensure visibility to all waves in the workgroup the LDS operation must be waited to complete before starting the VMEM operation if there needs to be a happens-before relation. That waiting is achieved by the waitcnt on LGKM before executing the VMEM instruction.

The waitcnt's serve two purposes. They notify that the result of the operation is available to the thread that requested it, and they ensure that the effect of the operation is visible to other threads before this thread continues to do other operations. This latter purpose is used to ensure the happens-before relationship in the memory model. So for example, if a VMEM release atomic is done at workgroup scope, should these operations be visible to other threads before the result that is store-released onto VMEM?

If these operations go down the LDS queues (even if they are not performed in the LDS itself), then there are 2 queues for the waves of a workgroup, but a single L1 shared by all waves of a workgroup for VMEM. So to ensure visibility to all waves in the workgroup the LDS operation must be waited to complete before starting the VMEM operation if there needs to be a happens-before relation. That waiting is achieved by the waitcnt on LGKM before executing the VMEM instruction.

Thank you for taking the time to explain! If I understand correctly, the waitcnt does not only notify the current lane that the result is available, but also the other lanes within the same workgroup. So without the waitcnt, there is a possibility that the other lanes see the result of the VMEM instruction first?

OutOfCache abandoned this revision.Mar 28 2023, 9:01 AM

After further discussion, @mareko is right and the waitcnts are necessary. Thanks for bringing that up!

mareko added a comment.EditedMar 28 2023, 9:59 AM

All lanes receive the result at the same time because it's really just v_mov under the hood. s_waitcnt only waits for it.

the waitcnt does not only notify the current lane that the result is available, but also the other lanes within the same workgroup

FWIW, it's lanes of the same wave, not workgroup.

t-tye added a comment.Mar 28 2023, 2:49 PM

The waitcnt's serve two purposes. They notify that the result of the operation is available to the thread that requested it, and they ensure that the effect of the operation is visible to other threads before this thread continues to do other operations. This latter purpose is used to ensure the happens-before relationship in the memory model. So for example, if a VMEM release atomic is done at workgroup scope, should these operations be visible to other threads before the result that is store-released onto VMEM?

If these operations go down the LDS queues (even if they are not performed in the LDS itself), then there are 2 queues for the waves of a workgroup, but a single L1 shared by all waves of a workgroup for VMEM. So to ensure visibility to all waves in the workgroup the LDS operation must be waited to complete before starting the VMEM operation if there needs to be a happens-before relation. That waiting is achieved by the waitcnt on LGKM before executing the VMEM instruction.

Thank you for taking the time to explain! If I understand correctly, the waitcnt does not only notify the current lane that the result is available, but also the other lanes within the same workgroup. So without the waitcnt, there is a possibility that the other lanes see the result of the VMEM instruction first?

waitcnt causes the thread to stall until the previous operations that change the counter are completed. This can be used to ensure a result has been returned to the thread. It can also be used to delay executing following instructions until the previous operations have completed and so are globally visible. For the memory model it is the latter that it gets used for. I the thread ensures that the LDS/... operation are complete before executing a VMEM operation, it ensures all waves will see the updates to LDS and VMEM in the same order which is a requirement for seq_cst and release memory orders.