This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add two TSFlags: IsAtomicNoRtn and IsAtomicRtn
ClosedPublic

Authored by rampitec on Feb 12 2021, 3:49 PM.

Details

Summary

We are using AtomicNoRet map in multiple places to determine
if an instruction atomic, rtn or nortn atomic. This method
does not work always since we have some instructions which
only has rtn or nortn version.

One such instruction is ds_wrxchg_rtn_b32 which does not have
nortn version. This has caused changes in memory legalizer
tests.

Diff Detail

Event Timeline

rampitec created this revision.Feb 12 2021, 3:49 PM
rampitec requested review of this revision.Feb 12 2021, 3:49 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 12 2021, 3:49 PM
Herald added a subscriber: wdng. · View Herald Transcript

@t-tye memory legalizer changes are due to the fact that SIMemoryLegalizer::isAtomicRet() was returning false for the ds_wrxchg_rtn_b32.

Please verify if the tests now look correctly or not. I am checking memory model for GFX10 and I do not see vmcnt required for these situations, but I do not see vscnt which was generated before there as well. Looks like this has uncovered a bug in either legalizer (likely) or memory model description. On top of that I do not see memory model describing atomicrmw acquire agent local combination.

@t-tye memory legalizer changes are due to the fact that SIMemoryLegalizer::isAtomicRet() was returning false for the ds_wrxchg_rtn_b32.

Please verify if the tests now look correctly or not. I am checking memory model for GFX10 and I do not see vmcnt required for these situations, but I do not see vscnt which was generated before there as well. Looks like this has uncovered a bug in either legalizer (likely) or memory model description. On top of that I do not see memory model describing atomicrmw acquire agent local combination.

I.e. I understand why vscnt is replaced with vmcnt, it was treated as store and now as load. This is fine. I don't understand what does it do with vmem at all.

rampitec added inline comments.Feb 12 2021, 5:05 PM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
572

@foad do I get it right, all MIMG atomics are rtn only? MIMG has never used AtomicNoRet map, so this branch was dead. Maybe we were missing to wait for exp, maybe they all mayStore, or maybe it is really dead an unused. Potentially you will see some extra exp waits, but it didn't appear in our tests. That said I am afraid test coverage for waitcounts is not sufficient.

t-tye added a comment.Feb 12 2021, 7:55 PM

On top of that I do not see memory model describing atomicrmw acquire agent local combination.

It is not present because the local address space is per workgroup and so only supports up to workgroup scope. Using any larger scope "decays" to workgroup scope.

t-tye accepted this revision.Feb 12 2021, 8:21 PM

@t-tye memory legalizer changes are due to the fact that SIMemoryLegalizer::isAtomicRet() was returning false for the ds_wrxchg_rtn_b32.

Please verify if the tests now look correctly or not. I am checking memory model for GFX10 and I do not see vmcnt required for these situations, but I do not see vscnt which was generated before there as well. Looks like this has uncovered a bug in either legalizer (likely) or memory model description. On top of that I do not see memory model describing atomicrmw acquire agent local combination.

I.e. I understand why vscnt is replaced with vmcnt, it was treated as store and now as load. This is fine. I don't understand what does it do with vmem at all.

The test result changes look correct. The legalizer generated code was not incorrect before, it was just not optimal. The generation of the vmcnt or vscnt is not required at all as the ds_wrxchg_rtn instruction only acts on the local memory which is only accessibly up to workgroup scope. Accessing at a wider scope "decays" to workgroup scope. It would be good to consider updating the memory legalizer to exploit this fact.

This revision is now accepted and ready to land.Feb 12 2021, 8:21 PM

On top of that I do not see memory model describing atomicrmw acquire agent local combination.

It is not present because the local address space is per workgroup and so only supports up to workgroup scope. Using any larger scope "decays" to workgroup scope.

That's my understanding, but probably deserves some explanation of scope nesting in the documentation. Maybe references in the memory model to scpoes shall refer to ">= a scope"? I am just trying to play a game, an unsophisticated user came to read it. I understand why is it not written, why does a wider scope decays a narrower, but coming after a long day of bit cracking into this table stunned me for a moment. Just a personal experience, I didn't want to think, I just wanted an answer which was not readily available. I had to think where I probably shouldn't. In many cases thinking leads to unhealthy ideas where it comes to memory model. After all we test it for some reason yet don't document it.

On top of that I do not see memory model describing atomicrmw acquire agent local combination.

It is not present because the local address space is per workgroup and so only supports up to workgroup scope. Using any larger scope "decays" to workgroup scope.

That's my understanding, but probably deserves some explanation of scope nesting in the documentation. Maybe references in the memory model to scpoes shall refer to ">= a scope"? I am just trying to play a game, an unsophisticated user came to read it. I understand why is it not written, why does a wider scope decays a narrower, but coming after a long day of bit cracking into this table stunned me for a moment. Just a personal experience, I didn't want to think, I just wanted an answer which was not readily available. I had to think where I probably shouldn't. In many cases thinking leads to unhealthy ideas where it comes to memory model. After all we test it for some reason yet don't document it.

The AMDGPU memory scopes are defined in https://llvm.org/docs/AMDGPUUsage.html#amdgpu-memory-scopes which is referenced at the beginning of the memory model section. Reading that I think addresses all the questions. That section also references the HSA and OpenCL specifications that more formally define the memory model supported by AMDGPU. I agree that thinking about memory models can be detremental to your mental health and should be avoided when working on code on Friday evening's when possible:-)

rampitec updated this revision to Diff 323634.Feb 14 2021, 2:19 PM

Rebased. Unfortunately D96643 did not remove waits on vmem for GFX10 LDS accesses.

foad added a comment.Feb 15 2021, 7:02 AM

Looks good. Can you remove the getAtomicRetOp table if it's not used for anything?

llvm/lib/Target/AMDGPU/SIDefines.h
111

This one lacks a comment.

Can you spell it "IsAtomicNoRet"? I don't think we use the "rtn"/"nortn" spelling anywhere else.

114

"IsAtomicRet"?

foad added inline comments.Feb 15 2021, 7:15 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
572

That's not true, architecturally all image atomics have ret/noret versions controlled by the glc bit just like buffer instructions. But I guess in LLVM we have never bothered to select the noret instructions.

foad added inline comments.Feb 15 2021, 7:38 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.h
557

I am surprised that there isn't already a MachineInstr::IsAtomic method.

foad added inline comments.Feb 15 2021, 8:12 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
572

Incidentally I don't understand this code at all. Surely all atomics are both mayLoad and mayStore, so this will have taken the "if" path on line 570?

t-tye added inline comments.Feb 15 2021, 8:15 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
572

The hardware can be configured as to whether IMG instructions behave as non-IMG instruction wit respect to the waitcnts and the ordering. For HSA the hardware is configured to make them behave the same.

foad added inline comments.Feb 15 2021, 8:43 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
572

Incidentally I don't understand this code at all. Surely all atomics are both mayLoad and mayStore, so this will have taken the "if" path on line 570?

Just to prove my point, this still passes the test suite: https://reviews.llvm.org/differential/diff/323776/

rampitec added inline comments.Feb 15 2021, 8:50 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
572

I am afraid we do not have tests covering every path in this pass.

foad added inline comments.Feb 15 2021, 8:53 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
572

I am saying this is dead code because every atomic is mayLoad and mayStore. Do you disagree?

rampitec added inline comments.Feb 15 2021, 8:54 AM
llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
572

Theoretically an atomic load can eecape that.

llvm/lib/Target/AMDGPU/SIInstrInfo.h
557

Me either.

rampitec added inline comments.Feb 15 2021, 9:00 AM
llvm/lib/Target/AMDGPU/SIDefines.h
111

Actually we are using "rtn" in the td files. I was looking what do we use more, "ret" or "rtn" and got the impression "rtn" is used more often. I do not have a strong preference though.

rampitec updated this revision to Diff 323798.Feb 15 2021, 10:55 AM
rampitec marked 2 inline comments as done.

Addressed Jay's comments.

foad accepted this revision.Feb 15 2021, 11:17 AM

LGTM, thanks.

llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp
115

I think this is dead too since it is covered by the mayStore just above.

This revision was automatically updated to reflect the committed changes.
rampitec added inline comments.Feb 15 2021, 11:31 AM
llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp
115

There are atomic loads. Although these are not marked as atomics, which may be a problem. As far as I understand this is just a load with glc.

t-tye added inline comments.Feb 15 2021, 2:32 PM
llvm/lib/Target/AMDGPU/SIFormMemoryClauses.cpp
115

There are atomic loads. Although these are not marked as atomics, which may be a problem. As far as I understand this is just a load with glc.

Not all atomics are read-modify-write. There are atomics loads, atmic stores, and rmw atomics that do not consume the result.