Details
- Reviewers
arsenm b-sumner foad cdevadas - Group Reviewers
Restricted Project - Commits
- rGedb9fab39022: [AMDGPU] Support FMin/FMax in AMDGPUAtomicOptimizer.
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
For FMin and FMax cases, clang itself is emitting CAS loop for both
__device__ inline float atomicMax(float* addr, float val) and __device__ inline float unsafeAtomicMax(float* addr, float val)
I am not sure how to potentially avoid this CAS loop before we reach atomic optimization pass.
CC: @b-sumner @arsenm
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
382 | @arsenm you earlier suggested to use minnum/maxnum intrinsics for this. This also seems to give correct behavior. I am not sure what I am missing here |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
382 | Yes, it is wrong to use fcmp and select here. For example for fmax what you have returns the wrong result if LHS is a nan. select (ugt nan, rhs), nan, rhs -> nan |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
382 |
Are you referring to @llvm.amdgcn.fcmp.f32(float, float, i32) intrinsic here right ? |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
382 | No, IRBuilder.CreateMinNum and CreateMaxNum |
Also should have some end to end codegen tests, just this won't catch the interaction between the atomic expand and atomic optimizer
llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll | ||
---|---|---|
543 ↗ | (On Diff #548193) | Should also test with different scopes, at least default system and agent |
547 ↗ | (On Diff #548193) | Should take this from the command line |
llvm/test/CodeGen/AMDGPU/global_atomic_optimizer_fp_rtn.ll | ||
---|---|---|
729 | This broke the strictfp handling (I thought this was supposed to fail the verifier now?) You probably need something like Builder.setIsFPConstrained( RMWI->getFunction()->hasFnAttribute(Attribute::StrictFP)); |
llvm/test/CodeGen/AMDGPU/global_atomic_optimizer_fp_rtn.ll | ||
---|---|---|
631–633 | The canonical way to do this extract in the IR is trunc and trunc (lshr x, 32) |
I don't want to implicitly convert the intrinsics here. We should move towards getting rid of the intrinsics and autoupgrading them