This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Expand remaining system atomic operations
Needs ReviewPublic

Authored by arsenm on Aug 8 2023, 1:38 PM.

Details

Reviewers
rampitec
doru1004
yaxunl
kerbowa
JonChesterfield
b-sumner
Group Reviewers
Restricted Project
Summary

System scope atomics need to use cmpxchg
loops. aea5980e26e6a87dab9f8acb10eb3a59dd143cb1 started this, this
expands the set to cover the remaining integer operations.

Don't expand xchg and add, those theoretically should work over PCIe.

Diff Detail

Event Timeline

arsenm created this revision.Aug 8 2023, 1:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2023, 1:38 PM
arsenm requested review of this revision.Aug 8 2023, 1:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2023, 1:38 PM
Herald added a subscriber: wdng. · View Herald Transcript

PCIe supports integer fetch-add, swap, and compare-and-swap.

arsenm updated this revision to Diff 548397.Aug 8 2023, 4:32 PM
arsenm edited the summary of this revision. (Show Details)

Don't expand xchg or add

It seems unfair to me, we give an option to override it for FP atomics, but not for INT.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
14166

It does not support it for system scope atomics, it just supports it.

arsenm added a comment.Aug 8 2023, 4:57 PM

It seems unfair to me, we give an option to override it for FP atomics, but not for INT.

The FP option is also broken and should be removed. There are two separate properties. We should have something for the FP properties that can be ignored on the specific instruction, and one for ignore the scope (although I don't know why the unsafe usage can't just use the wrong scope)

arsenm added inline comments.Aug 8 2023, 4:58 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
14166

so what's the distinction? It's there but useless ?

rampitec added inline comments.Aug 8 2023, 5:02 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
14166

'system' here is just an indicator (quite imperfect) it can go over PCIe. But the PCIe itself just supports these and does not know about scopes.

yaxunl added a comment.Aug 8 2023, 6:44 PM

It seems unfair to me, we give an option to override it for FP atomics, but not for INT.

The FP option is also broken and should be removed. There are two separate properties. We should have something for the FP properties that can be ignored on the specific instruction, and one for ignore the scope (although I don't know why the unsafe usage can't just use the wrong scope)

My understanding is that there are 3 situations:

corse-grained memory: all integer atomic ops work, fp atomics (if available) work
fine-grained memory with PCIE: only integer add/xchg/cmpxchg work, others need to be transformed to cmpxchg loop
fine-grained memory with XGMI: all integer atomic ops work, fp atomics not work

so we need separate options for unsafe-fp-atomics and unsafe-pcie-integer-atomics

If users make sure they only use certain memories, then they can specify suitable options for their usecase:

corse-grained memory only: -unsafe-fp-atomics -unsafe-pcie-integer-atomics
fine-grained memory with PCIE:
fine-grained memory with XGMI: -unsafe-pcie-integer-atomics

If users make sure they only use certain memories, then they can specify suitable options for their usecase:

corse-grained memory only: -unsafe-fp-atomics -unsafe-pcie-integer-atomics
fine-grained memory with PCIE:
fine-grained memory with XGMI: -unsafe-pcie-integer-atomics

I'm not totally sure I'm following the setup here. It sounds like someone writes fetch_or to a global in addrspace(1) and gets it expanded to CAS unless they also pass a scary command line flag which will miscompile other atomic operations which do actually need the expansion.

We could use 'system' scope to mean 'flat CAS instruction' but as pointed out above that'll kill performance on non-pcie systems, plus it's really easy to emit system scope instructions unintentionally since that's the default.

Is the root problem that addrspace annotation is inaccurate? It seems that architecture + which memory contains the variable is exactly sufficient to pick the instruction, at which point I don't see what the command line flags would be for.

@jhuber6 I think libc uses system scope everywhere because the only intrinsics that give access to device scope are opencl ones with constraints on the right type of atomic type argument. If it expands fetch_or to CAS we're losing performance.

If users make sure they only use certain memories, then they can specify suitable options for their usecase:

corse-grained memory only: -unsafe-fp-atomics -unsafe-pcie-integer-atomics
fine-grained memory with PCIE:
fine-grained memory with XGMI: -unsafe-pcie-integer-atomics

I'm not totally sure I'm following the setup here. It sounds like someone writes fetch_or to a global in addrspace(1) and gets it expanded to CAS unless they also pass a scary command line flag which will miscompile other atomic operations which do actually need the expansion.

We kind of need to have a parameter or set of atomic functions which correspond to the various restrictions, that end up emitting some kind of qualifier on regular atomicrmw. The current unsafe-fp-atomics flag is bogus, it came to mean too many things (both ignore memory safety, and ignore FP mode mismatches) and the core issue has absolutely nothing to do with floating point (there is a dimension of floating point issues, which is also underspecified) but we can ignore that part for the moment. The corresponding attribute also breaks on inlining, such that unsafe code can infect safe code.

We could use 'system' scope to mean 'flat CAS instruction' but as pointed out above that'll kill performance on non-pcie systems, plus it's really easy to emit system scope instructions unintentionally since that's the default.

Yes, that is what we must do. Doesn't matter what the performance, an undecorated atomic should always work no matter what.

Is the root problem that addrspace annotation is inaccurate? It seems that architecture + which memory contains the variable is exactly sufficient to pick the instruction, at which point I don't see what the command line flags would be for.

This isn't covered by address space, it's possibly should be a few new scopes. If not a scope, some other per-instruction metadata.

@jhuber6 I think libc uses system scope everywhere because the only intrinsics that give access to device scope are opencl ones with constraints on the right type of atomic type argument. If it expands fetch_or to CAS we're losing performance.

These just need to thread a choice through. I think the openmp device RTL isn't threading through the appropriate scope either

ping, I think we just need to go with this and add in the new thing later. Also would like clarification on whether add/xchg at system scope are supposed to work

ping, I think we just need to go with this and add in the new thing later. Also would like clarification on whether add/xchg at system scope are supposed to work

This will cause perf regression for XGMI users. We cannot afford that.

Maybe add another option -amdgpu-safe-pcie-atomics to enable the new changes (default off).

ping, I think we just need to go with this and add in the new thing later. Also would like clarification on whether add/xchg at system scope are supposed to work

This will cause perf regression for XGMI users. We cannot afford that.

Nobody complained when the last batch were fixed

Maybe add another option -amdgpu-safe-pcie-atomics to enable the new changes (default off).

Absolutely not

ping, I think we just need to go with this and add in the new thing later. Also would like clarification on whether add/xchg at system scope are supposed to work

This will cause perf regression for XGMI users. We cannot afford that.

Nobody complained when the last batch were fixed

Also correctness trumps speed. We should establish a working baseline

ping, I think we just need to go with this and add in the new thing later. Also would like clarification on whether add/xchg at system scope are supposed to work

This will cause perf regression for XGMI users. We cannot afford that.

Nobody complained when the last batch were fixed

Also correctness trumps speed. We should establish a working baseline

If the XGMI users complain, do we have an existing approach to recover the performance by modifying their code?

How about making the default always work but adding an option -amdgpu-unsafe-integer-ops to allow integer atomic rmw ops? Then at least they have a way to recover the performance.

llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll