This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Allow scalar loads after barrier
ClosedPublic

Authored by rampitec on Jan 27 2022, 4:39 PM.

Details

Summary

Currently we cannot convert a vector load into scalar if there
is dominating barrier or fence. It is considered a clobbering
memory access to prevent memory operations reordering. While
reordering is not possible the actual memory is not being clobbered
by a barrier or fence and we can still use a scalar load for a
uniform pointer.

The solution is not to bail on a first clobbering access but
traverse MemorySSA to the root excluding barriers and fences.

Diff Detail

Event Timeline

rampitec created this revision.Jan 27 2022, 4:39 PM
rampitec requested review of this revision.Jan 27 2022, 4:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2022, 4:39 PM
Herald added a subscriber: wdng. · View Herald Transcript

Can we just mark the barriers as inaccessiblememonly instead

Can we just mark the barriers as inaccessiblememonly instead

It will not help MemorySSA but likely will allow unwanted reordering around barriers. Plus there are fences anyway along with any barrier.

arsenm added inline comments.Jan 27 2022, 4:45 PM
llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
80

Don't need extra new line

93

Don't need extra newline

rampitec marked 2 inline comments as done.Jan 27 2022, 4:48 PM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
80

It is not extra. This is the actual output on my console:

Checking clobbering of:   %i2 = load i32, i32 addrspace(1)* %i1, align 4, !tbaa !15
  Def:   fence syncscope("workgroup") acquire
  Def:   tail call void @llvm.amdgcn.s.barrier() #3
  Def:   fence syncscope("workgroup") release
      -> no clobber
93

Ditto.

My gut feeling is this is OK, but I'm also not convinced this is entirely sound. Can you add a test where this does not happen if there is an atomic load?

llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
75–136

This needs a comment explaining why this loop is OK

80

Weird that the Instruction and MachineInstr behavior is different

rampitec marked 3 inline comments as done.Jan 27 2022, 5:51 PM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
75–136

Will add, but in a nutshell like with any SSA we can go up the tree to the root. Along that route we shall see all defs of the memory location in this case. Provided that phis are carefully procesded, which can create loops in a worst case. What we are looking for in that traversal is a store invalidating scalar cache practically.

Some sort of a confirmation GVN does a similar thing. In fact even the Visited set shall not be needed, we will eventually come to the root and exhaust the list, but that will be a bit slower. A testcase for that is memory_phi tests.

Anyhow, I am running PSDB on this to double check.

80

It is.

rampitec updated this revision to Diff 404130.Jan 28 2022, 12:19 PM
rampitec marked 2 inline comments as done.
  • Added atomic load test
  • Added comment

PSDB passed.

rampitec updated this revision to Diff 404701.Jan 31 2022, 1:30 PM

Further study showed that a barrier or fence not only defines the load of interest but also clobbering any other load dominating them (for the same reason). The change adds aliasing disambiguation while walking up the tree to only consider defs affecting original memory location.

  • Use getClobberingMemoryAccess() in the loop with the memory location of the original load to skip any store which may not alias.
  • Added more tests and generated opt metadata checks as well.
rampitec updated this revision to Diff 404745.Jan 31 2022, 2:57 PM

Added -amdgpu-aa to the opt run line in the test to do LDS disambiguation.

Looks almost good

llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
92

This can be moved to the bottom of the loop, so you don't need to push first MA and just use it through the first iteration of the loop. The loop would be do-while then. In a case of lucky you won't hit any push.

110

It may worth to make a function like isReallyAClobber just to deduplicate code a bit

rampitec updated this revision to Diff 405013.Feb 1 2022, 11:17 AM
rampitec marked an inline comment as done.

Added helper isReallyAClobber.

llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
92

There are a lot of 'continue' here. I need to pop_back the worklist somewhere. Then this is SmallVector and some stack is already preallocated and just initialized.

vpykhtin accepted this revision.Feb 1 2022, 11:27 AM

LGTM.

This revision is now accepted and ready to land.Feb 1 2022, 11:27 AM
This revision was landed with ongoing or failed builds.Feb 1 2022, 11:43 AM
This revision was automatically updated to reflect the committed changes.