This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Support FMin/FMax in AMDGPUAtomicOptimizer.
ClosedPublic

Authored by pravinjagtap on Aug 8 2023, 5:40 AM.

Diff Detail

Unit TestsFailed

Event Timeline

pravinjagtap created this revision.Aug 8 2023, 5:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2023, 5:40 AM
pravinjagtap requested review of this revision.Aug 8 2023, 5:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2023, 5:40 AM
arsenm added inline comments.Aug 8 2023, 5:54 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
320

I don't want to implicitly convert the intrinsics here. We should move towards getting rid of the intrinsics and autoupgrading them

405

This is incorrect, you should create minnum/maxnum

For FMin and FMax cases, clang itself is emitting CAS loop for both

__device__ inline float atomicMax(float* addr, float val) and
__device__ inline float unsafeAtomicMax(float* addr, float val)

I am not sure how to potentially avoid this CAS loop before we reach atomic optimization pass.
CC: @b-sumner @arsenm

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
405

@arsenm you earlier suggested to use minnum/maxnum intrinsics for this. This also seems to give correct behavior. I am not sure what I am missing here

arsenm added inline comments.Aug 8 2023, 6:05 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
405

Yes, it is wrong to use fcmp and select here. For example for fmax what you have returns the wrong result if LHS is a nan.

select (ugt nan, rhs), nan, rhs -> nan
maxnum(nan, rhs) -> rhs

arsenm added a comment.Aug 8 2023, 6:05 AM

For FMin and FMax cases, clang itself is emitting CAS loop for both

__device__ inline float atomicMax(float* addr, float val) and
__device__ inline float unsafeAtomicMax(float* addr, float val)

I am not sure how to potentially avoid this CAS loop before we reach atomic optimization pass.
CC: @b-sumner @arsenm

Clang should not be expanding any atomics itself

pravinjagtap added inline comments.Aug 8 2023, 6:08 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
405

you should create minnum/maxnum

Are you referring to @llvm.amdgcn.fcmp.f32(float, float, i32) intrinsic here right ?

arsenm added inline comments.Aug 8 2023, 6:09 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
405

No, IRBuilder.CreateMinNum and CreateMaxNum

Switched to CreateMinNum/CreateMaxNum than fcmp and select

arsenm added a comment.Aug 8 2023, 6:46 AM

Also should have some end to end codegen tests, just this won't catch the interaction between the atomic expand and atomic optimizer

llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll
0

Should take this from the command line

259

Should also test with different scopes, at least default system and agent

Added few more test points

For FMin and FMax cases, clang itself is emitting CAS loop for both

__device__ inline float atomicMax(float* addr, float val) and
__device__ inline float unsafeAtomicMax(float* addr, float val)

I am not sure how to potentially avoid this CAS loop before we reach atomic optimization pass.
CC: @b-sumner @arsenm

Clang should not be expanding any atomics itself

All this is being implemented outside of llvm. The logic for creating CAS loop is inserted in hipamd/include/hip/amd_detail/amd_hip_atomic.h. How should we go about this? do I need to update hipamd repo ? CC: @b-sumner @arsenm

Rebase & test-updates

pravinjagtap added a reviewer: Restricted Project.Aug 17 2023, 3:47 AM

Addressed review comments and rebase

arsenm added inline comments.Aug 23 2023, 4:41 PM
llvm/test/CodeGen/AMDGPU/global_atomic_optimizer_fp_rtn.ll
809 ↗(On Diff #551428)

This broke the strictfp handling (I thought this was supposed to fail the verifier now?)

You probably need something like

Builder.setIsFPConstrained(
    RMWI->getFunction()->hasFnAttribute(Attribute::StrictFP));

Rebased agaist updated dependancies.

arsenm accepted this revision.Aug 29 2023, 3:09 PM
arsenm added inline comments.
llvm/test/CodeGen/AMDGPU/global_atomic_optimizer_fp_rtn.ll
631–633 ↗(On Diff #553027)

The canonical way to do this extract in the IR is trunc and trunc (lshr x, 32)

This revision is now accepted and ready to land.Aug 29 2023, 3:09 PM

Addressed review comments

This revision was landed with ongoing or failed builds.Aug 30 2023, 9:11 AM
This revision was automatically updated to reflect the committed changes.
llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll