This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Do not wait for vscnt on function entry and return
ClosedPublic

Authored by foad on Jun 22 2023, 5:31 AM.

Details

Summary

SIInsertWaitcnts inserts waitcnt instructions to resolve data
dependencies. The GFX10+ vscnt (VMEM store count) counter is never used
in this way. It is only used to resolve memory dependencies, and that is
handled by SIMemoryLegalizer. Hence there is no need to conservatively
wait for vscnt to be 0 on function entry and before returns.

Diff Detail

Event Timeline

foad created this revision.Jun 22 2023, 5:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2023, 5:31 AM
foad requested review of this revision.Jun 22 2023, 5:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2023, 5:31 AM
foad added inline comments.Jun 22 2023, 5:33 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1231

@kerbowa unlike the rest of SIInsertWaitcnts, I assume this part does want to wait for vscnt==0 since it is handling memory dependencies?

arsenm added inline comments.Jun 22 2023, 7:24 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/bswap.ll
70

These were super annoying

kerbowa added inline comments.Jun 25 2023, 4:42 PM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1231

Technically it should not matter on HW with VScnt since they all can back off barriers, so this 'if' should never be true on Navi. There is an exception currently with gfx11 because of the memory model description bug with cumode, but that should be temporary.

foad added a comment.Jul 3 2023, 7:29 AM

Ping!

This does not change the ABI. It just removes a bunch of wait instructions that were never required in the first place.

arsenm added a comment.Jul 3 2023, 7:56 AM

It is only used to resolve memory dependencies, and that is handled by SIMemoryLegalizer.

The idea was the stores on the caller side don't need to wait to finish because the prolog wait will take care of it after the latency of the jump. Is the memory legalizer not taking advantage of this?

foad added a comment.Jul 3 2023, 8:02 AM

It is only used to resolve memory dependencies, and that is handled by SIMemoryLegalizer.

The idea was the stores on the caller side don't need to wait to finish because the prolog wait will take care of it after the latency of the jump. Is the memory legalizer not taking advantage of this?

I don't see how it can be taking advantage of that, because it only considers each load or store in isolation.

arsenm added a comment.Jul 3 2023, 8:14 AM

It is only used to resolve memory dependencies, and that is handled by SIMemoryLegalizer.

The idea was the stores on the caller side don't need to wait to finish because the prolog wait will take care of it after the latency of the jump. Is the memory legalizer not taking advantage of this?

I don't see how it can be taking advantage of that, because it only considers each load or store in isolation.

Would that just be an implementation deficiency? With something like https://godbolt.org/z/55YEn473W where does the wait happen?

foad added a comment.Jul 3 2023, 8:20 AM

With something like https://godbolt.org/z/55YEn473W where does the wait happen?

No wait is required. The store and the load stay in order. SIMemoryLegalizer does not insert one, irrespective of whether the store and load are in the same function or not.

arsenm accepted this revision.Jul 3 2023, 9:18 AM

LGTM I guess wait would only be needed in the atomic case anyway?

This revision is now accepted and ready to land.Jul 3 2023, 9:18 AM
foad updated this revision to Diff 537051.Jul 4 2023, 4:25 AM

Rebase.

This revision was landed with ongoing or failed builds.Jul 4 2023, 4:26 AM
This revision was automatically updated to reflect the committed changes.