Add just fadd/fsub for now.
Diff Detail
Event Timeline
Is this to support wg21.link/p0020 ?
Can you do fsub while you're there. It should be exactly the same thing.
AFAIK we don't care about FP flags at all here, correct? Even strictfp is irrelevant.
Yes
Can you do fsub while you're there. It should be exactly the same thing.
AMDGPU has hardware for fadd, fmin and fmax so I was focusing on those first, but it's trivial to add fsub. fmin/fmax are problematic since it requires deciding NaN behaviors (so I fear we'll end up with 2 or 3 versions of each)
AFAIK we don't care about FP flags at all here, correct? Even strictfp is irrelevant.
We can't algebraically combine this, so the flags shouldn't matter.
AFAIK we don't care about FP flags at all here, correct? Even strictfp is irrelevant.
We can't algebraically combine this, so the flags shouldn't matter.
To clarify my concern: were LLVM to support fenv properly, this instruction can spuriously set FP flags if, say, the fadd executes in a loop and has spurious failure. The final store could hold a result which raised no FP exception, but intermediate results could have done so. But LLVM doesn't support fenv properly yet, so that's not an issue (I think!).
fenv support requires using the new constrained operation intrinsics, but the support there is incomplete (and still some confusion on how they should interact with the regular IR operations, which is used in the cmpxchg loop expansion)
I'd note that this change isn't a necessary prerequisite to implementing the C++ feature -- FP math is already available in C11, and clang already implements it by lowering to cmpxchg. I'd be very skeptical of this addition, if you hadn't mentioned that AMDGPU supported it in hardware. AFAIK, there's no other ISAs that support atomic fadd in hardware. However, that AMDGPU does have hardware support seems like a sufficient reason to add support for it to LLVM.
On LL/SC architectures, while you could theoretically execute a floating-point instruction inside the LL/SC loop, there's often a bunch of complexity regarding the potential for a trap to be raised (e.g. for software FPU emulation, or for edge cases like denormals), which may deterministically cancel the reservation, and create an infinite loop. So I'm pretty wary of any proposal to lower to an LL/SC loop containing FP instructions -- I think even on LL/SC architectures, this should generally be lowered to a cmpxchg loop.
On a general note -- this patch doesn't seem complete. I think this should be a complete implementation of the feature -- specifying an "atomicrmw fadd" should function on all architectures after this, even if not most optimally on AMDGPU (since it's not possible to lower into SelectionDAG until after the next patch).
That'll means pulling in the AtomicExpand changes, and also updating all architectures' shouldExpandAtomicRMWInIR to return CmpXchg for these ops.
docs/LangRef.rst | ||
---|---|---|
8443 | Lost some words. | |
include/llvm/IR/Instructions.h | ||
813 | Unused? | |
lib/CodeGen/AtomicExpandPass.cpp | ||
1559โ1560 | This isn't great. And I see in your next patch you fix it, but splitting that to another patch doesn't make sense. |
include/llvm/IR/Instructions.h | ||
---|---|---|
813 | It's not used here, but included for completeness since all of the other instructions do this (e.g. ICmpInst::getUnsignedPredicate | |
lib/CodeGen/AtomicExpandPass.cpp | ||
1559โ1560 | I'd disagree, since usually with IR changes there's a split to add the minimum support to the assembler and bitcode before making them functional as a separate commit. This is just needed to quiet the covered switch warning |
include/llvm/IR/Instructions.h | ||
---|---|---|
813 | I found a use for it when updating all the targets anyway |
LGTM
docs/LangRef.rst | ||
---|---|---|
8470 | using floating-point rules -> using floating-point arithmetic |
Lost some words.