This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Flush vmcnt with any loop extraneous defs
Needs ReviewPublic

Authored by kerbowa on Jul 5 2023, 12:51 AM.

Details

Summary

Starts to hoist waitcnt in loops containing the use of a value that was loaded outside of the loop, which also has any VMEM load inside of the loop that defines a value that is used outside of the loop.

example:

v0 = load(...)
loop {
  ...
  use(v0)
  v1 = load(...)
  ...
  use(v1)
  v2 = load(...)
}
use(v2)

Previously we would not hoist waitcnt to the preheader of any loop which contained any use/def pairs that had any subregisters that were defined and used wholly within the loop. It seems somewhat arbitrary to limit the optimization to loops that only load values but never use them, but I may be missing something. While there is a concern about increased compile time with this change, it is essentially what was done before with FLAT/GLOBAL instructions.

A more thorough approach would try and estimate the minimum number of cycles gained or lost by hoisting the waitcnt, but this would involve further increases in compile time.

Diff Detail

Event Timeline

kerbowa created this revision.Jul 5 2023, 12:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2023, 12:51 AM
kerbowa requested review of this revision.Jul 5 2023, 12:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2023, 12:52 AM
kerbowa edited the summary of this revision. (Show Details)Jul 5 2023, 12:54 AM
foad added a comment.Jul 5 2023, 2:56 AM

You haven't added any tests that show the effect of your patch.

arsenm added inline comments.Jul 5 2023, 10:14 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1770

How does this set distinguish sub and full register dfes?

1805

.empty()?

llvm/test/CodeGen/AMDGPU/waitcnt-vmcnt-loop.mir
788

-NEXT is much better than -NOT