System scope atomics need to use cmpxchg
loops. aea5980e26e6a87dab9f8acb10eb3a59dd143cb1 started this, this
expands the set to cover the remaining integer operations.
Don't expand xchg and add, those theoretically should work over PCIe.
Paths
| Differential D157437
AMDGPU: Expand remaining system atomic operations Needs ReviewPublic Authored by arsenm on Aug 8 2023, 1:38 PM.
Details Summary System scope atomics need to use cmpxchg Don't expand xchg and add, those theoretically should work over PCIe.
Diff Detail
Unit TestsFailed Event TimelineComment Actions It seems unfair to me, we give an option to override it for FP atomics, but not for INT.
Comment Actions
The FP option is also broken and should be removed. There are two separate properties. We should have something for the FP properties that can be ignored on the specific instruction, and one for ignore the scope (although I don't know why the unsafe usage can't just use the wrong scope)
Comment Actions
My understanding is that there are 3 situations: corse-grained memory: all integer atomic ops work, fp atomics (if available) work so we need separate options for unsafe-fp-atomics and unsafe-pcie-integer-atomics If users make sure they only use certain memories, then they can specify suitable options for their usecase: corse-grained memory only: -unsafe-fp-atomics -unsafe-pcie-integer-atomics Comment Actions
I'm not totally sure I'm following the setup here. It sounds like someone writes fetch_or to a global in addrspace(1) and gets it expanded to CAS unless they also pass a scary command line flag which will miscompile other atomic operations which do actually need the expansion. We could use 'system' scope to mean 'flat CAS instruction' but as pointed out above that'll kill performance on non-pcie systems, plus it's really easy to emit system scope instructions unintentionally since that's the default. Is the root problem that addrspace annotation is inaccurate? It seems that architecture + which memory contains the variable is exactly sufficient to pick the instruction, at which point I don't see what the command line flags would be for. @jhuber6 I think libc uses system scope everywhere because the only intrinsics that give access to device scope are opencl ones with constraints on the right type of atomic type argument. If it expands fetch_or to CAS we're losing performance. Comment Actions
We kind of need to have a parameter or set of atomic functions which correspond to the various restrictions, that end up emitting some kind of qualifier on regular atomicrmw. The current unsafe-fp-atomics flag is bogus, it came to mean too many things (both ignore memory safety, and ignore FP mode mismatches) and the core issue has absolutely nothing to do with floating point (there is a dimension of floating point issues, which is also underspecified) but we can ignore that part for the moment. The corresponding attribute also breaks on inlining, such that unsafe code can infect safe code.
Yes, that is what we must do. Doesn't matter what the performance, an undecorated atomic should always work no matter what.
This isn't covered by address space, it's possibly should be a few new scopes. If not a scope, some other per-instruction metadata.
These just need to thread a choice through. I think the openmp device RTL isn't threading through the appropriate scope either Comment Actions ping, I think we just need to go with this and add in the new thing later. Also would like clarification on whether add/xchg at system scope are supposed to work Comment Actions
This will cause perf regression for XGMI users. We cannot afford that. Maybe add another option -amdgpu-safe-pcie-atomics to enable the new changes (default off). Comment Actions
Nobody complained when the last batch were fixed
Absolutely not Comment Actions
Also correctness trumps speed. We should establish a working baseline Comment Actions
If the XGMI users complain, do we have an existing approach to recover the performance by modifying their code? How about making the default always work but adding an option -amdgpu-unsafe-integer-ops to allow integer atomic rmw ops? Then at least they have a way to recover the performance.
Revision Contents
Diff 548397 llvm/lib/Target/AMDGPU/SIISelLowering.cpp
llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_udec_wrap.ll
llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_uinc_wrap.ll
llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll
llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll
llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll
llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll
llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-i16-system.ll
llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-i8-system.ll
|
It does not support it for system scope atomics, it just supports it.