Some ds_* instructions do not access LDS memory.
Therefore, @sebastian-ne suggested removing the waitcnts.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
60,030 ms | x64 debian > MLIR.Examples/standalone::test.toy |
Event Timeline
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1393 ↗ | (On Diff #508156) | Might be better to make usesLGKM_CNT itself more accurate, by tweaking DSInstructions.td so that LGKM_CNT is set to 0 for instructions that don't use it. |
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp | ||
---|---|---|
1393 ↗ | (On Diff #508156) | Thank you! For some reason, I only found the definition of the intrinsic. I did not know there was a separate .td file for DS Instructions. |
Are you sure about this? lgkmcnt(0) isn't about accessing LDS memory, but about waiting for the result to be received from the LDS block.
The waitcnt's serve two purposes. They notify that the result of the operation is available to the thread that requested it, and they ensure that the effect of the operation is visible to other threads before this thread continues to do other operations. This latter purpose is used to ensure the happens-before relationship in the memory model. So for example, if a VMEM release atomic is done at workgroup scope, should these operations be visible to other threads before the result that is store-released onto VMEM?
If these operations go down the LDS queues (even if they are not performed in the LDS itself), then there are 2 queues for the waves of a workgroup, but a single L1 shared by all waves of a workgroup for VMEM. So to ensure visibility to all waves in the workgroup the LDS operation must be waited to complete before starting the VMEM operation if there needs to be a happens-before relation. That waiting is achieved by the waitcnt on LGKM before executing the VMEM instruction.
Thank you for taking the time to explain! If I understand correctly, the waitcnt does not only notify the current lane that the result is available, but also the other lanes within the same workgroup. So without the waitcnt, there is a possibility that the other lanes see the result of the VMEM instruction first?
After further discussion, @mareko is right and the waitcnts are necessary. Thanks for bringing that up!
All lanes receive the result at the same time because it's really just v_mov under the hood. s_waitcnt only waits for it.
the waitcnt does not only notify the current lane that the result is available, but also the other lanes within the same workgroup
FWIW, it's lanes of the same wave, not workgroup.
waitcnt causes the thread to stall until the previous operations that change the counter are completed. This can be used to ensure a result has been returned to the thread. It can also be used to delay executing following instructions until the previous operations have completed and so are globally visible. For the memory model it is the latter that it gets used for. I the thread ensures that the LDS/... operation are complete before executing a VMEM operation, it ensures all waves will see the updates to LDS and VMEM in the same order which is a requirement for seq_cst and release memory orders.