This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Support FMin/FMax in AMDGPUAtomicOptimizer.
ClosedPublic

Authored by pravinjagtap on Aug 8 2023, 5:40 AM.

Diff Detail

Event Timeline

pravinjagtap created this revision.Aug 8 2023, 5:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2023, 5:40 AM
pravinjagtap requested review of this revision.Aug 8 2023, 5:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2023, 5:40 AM
arsenm added inline comments.Aug 8 2023, 5:54 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
314

I don't want to implicitly convert the intrinsics here. We should move towards getting rid of the intrinsics and autoupgrading them

382

This is incorrect, you should create minnum/maxnum

For FMin and FMax cases, clang itself is emitting CAS loop for both

__device__ inline float atomicMax(float* addr, float val) and
__device__ inline float unsafeAtomicMax(float* addr, float val)

I am not sure how to potentially avoid this CAS loop before we reach atomic optimization pass.
CC: @b-sumner @arsenm

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
382

@arsenm you earlier suggested to use minnum/maxnum intrinsics for this. This also seems to give correct behavior. I am not sure what I am missing here

arsenm added inline comments.Aug 8 2023, 6:05 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
382

Yes, it is wrong to use fcmp and select here. For example for fmax what you have returns the wrong result if LHS is a nan.

select (ugt nan, rhs), nan, rhs -> nan
maxnum(nan, rhs) -> rhs

arsenm added a comment.Aug 8 2023, 6:05 AM

For FMin and FMax cases, clang itself is emitting CAS loop for both

__device__ inline float atomicMax(float* addr, float val) and
__device__ inline float unsafeAtomicMax(float* addr, float val)

I am not sure how to potentially avoid this CAS loop before we reach atomic optimization pass.
CC: @b-sumner @arsenm

Clang should not be expanding any atomics itself

pravinjagtap added inline comments.Aug 8 2023, 6:08 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
382

you should create minnum/maxnum

Are you referring to @llvm.amdgcn.fcmp.f32(float, float, i32) intrinsic here right ?

arsenm added inline comments.Aug 8 2023, 6:09 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
382

No, IRBuilder.CreateMinNum and CreateMaxNum

Switched to CreateMinNum/CreateMaxNum than fcmp and select

arsenm added a comment.Aug 8 2023, 6:46 AM

Also should have some end to end codegen tests, just this won't catch the interaction between the atomic expand and atomic optimizer

llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll
543 ↗(On Diff #548193)

Should also test with different scopes, at least default system and agent

547 ↗(On Diff #548193)

Should take this from the command line

Added few more test points

For FMin and FMax cases, clang itself is emitting CAS loop for both

__device__ inline float atomicMax(float* addr, float val) and
__device__ inline float unsafeAtomicMax(float* addr, float val)

I am not sure how to potentially avoid this CAS loop before we reach atomic optimization pass.
CC: @b-sumner @arsenm

Clang should not be expanding any atomics itself

All this is being implemented outside of llvm. The logic for creating CAS loop is inserted in hipamd/include/hip/amd_detail/amd_hip_atomic.h. How should we go about this? do I need to update hipamd repo ? CC: @b-sumner @arsenm

Rebase & test-updates

pravinjagtap added a reviewer: Restricted Project.Aug 17 2023, 3:47 AM

Addressed review comments and rebase

arsenm added inline comments.Aug 23 2023, 4:41 PM
llvm/test/CodeGen/AMDGPU/global_atomic_optimizer_fp_rtn.ll
729

This broke the strictfp handling (I thought this was supposed to fail the verifier now?)

You probably need something like

Builder.setIsFPConstrained(
    RMWI->getFunction()->hasFnAttribute(Attribute::StrictFP));

Rebased agaist updated dependancies.

arsenm accepted this revision.Aug 29 2023, 3:09 PM
arsenm added inline comments.
llvm/test/CodeGen/AMDGPU/global_atomic_optimizer_fp_rtn.ll
631–633

The canonical way to do this extract in the IR is trunc and trunc (lshr x, 32)

This revision is now accepted and ready to land.Aug 29 2023, 3:09 PM

Addressed review comments

This revision was landed with ongoing or failed builds.Aug 30 2023, 9:11 AM
This revision was automatically updated to reflect the committed changes.
llvm/test/CodeGen/AMDGPU/global_atomics_optimizer_fp_no_rtn.ll