This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer.
ClosedPublic

Authored by pravinjagtap on Jul 26 2023, 1:45 AM.

Details

Summary

Reduction and Scan are implemented using Iterative
and DPP strategy for float type.

Diff Detail

Event Timeline

pravinjagtap created this revision.Jul 26 2023, 1:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 26 2023, 1:45 AM
pravinjagtap requested review of this revision.Jul 26 2023, 1:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 26 2023, 1:45 AM
pravinjagtap edited the summary of this revision. (Show Details)Jul 26 2023, 1:54 AM
pravinjagtap added reviewers: foad, b-sumner.

the fmin/fmax case and fadd/fsub cases have nothing to do with each other, you're probably better off handling them in separate patches

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
140

AtomicRMWInst already has isFloatingPointOperation/isFPOperation for this, which also picks up fsub

207

Should also handle fsub

380

you can't do it like this, you should use minnum/maxnum intrinsics

637

This would be +infinity for fmax.

For fadd you there isn't really an identity value since fadd -0, 0 -> -0. You probably can't do this without nsz, which we don't have a way of representing.

I have a draft patch for unsafe FP atomic metadata I don't have time to pick up.

639

This would be -infinity

810

I don't follow how this can be a convert and multiply

arsenm added inline comments.Jul 26 2023, 7:11 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
637

For fadd you can use -0 as the identify value. For fsub I think 0 works:

Check instcombine:

define float @fsub_fold(float %x) {

%add = fsub float %x, 0.0
ret float %add

}

define float @fadd_fold_n0(float %x) {

%add = fadd float %x, -0.0
ret float %add

}

This is of course ignoring signaling nan quieting and denormal flushes

foad added inline comments.Jul 27 2023, 3:01 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
639

No, the identity should be +inf for fmin and -inf for fmax.

Spitting the Support of Floating Point Ops into two seperate patches.

pravinjagtap retitled this revision from [WIP] Support FP global atomics in AMDGPUAtomicOptimizer. to [WIP] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer..Jul 28 2023, 11:01 AM
pravinjagtap edited the summary of this revision. (Show Details)
arsenm added inline comments.Jul 28 2023, 11:29 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
213

I think this is a bad interpretation of the strategy option. Doing nothing just because you wanted something else is worse than just using an implemented path. Also you can just implement this with dpp?

317

Doesn't consider half

Should also handle <2 x half>, but atomicrmw doesn't support vectors now (you need the intrinsics for those)

605

You shouldn't need a cast after D147732

arsenm added inline comments.Jul 28 2023, 11:51 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
820

These belong with the other patch

pravinjagtap added inline comments.Jul 28 2023, 9:05 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
213

Also you can just implement this with dpp?

If I understand correctly, current dpp intrinsics that we need for reduction & scan(llvm.amdgcn.update.dpp) can return only integer types (accepts inputs with any types). @foad Is it possible to extend current dpp implementation for float types as well ?

pravinjagtap added inline comments.Jul 28 2023, 9:13 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
213

Also you can just implement this with dpp?

If I understand correctly, current dpp intrinsics that we need for reduction & scan(llvm.amdgcn.update.dpp) can return only integer types (accepts inputs with any types).

I am wrong, this intrinsic is lowered to V_MOV_B32_dpp when matched with i32 types. I think, we should be able to implement dpp for floats with bitcasts noise.

pravinjagtap added inline comments.Jul 29 2023, 11:56 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
213

I am able to generate functionally correct code for scan with DPP strategy but it needs lot of bitcast mess for llvm.amdgcn.set.inactive.i32 and lvm.amdgcn.update.dpp.i32. Is there any better way of doing this ?

%16 = bitcast float %9 to i32
%17 = call i32 @llvm.amdgcn.set.inactive.i32(i32 %16, i32 0)
%18 = bitcast i32 %17 to float
%19 = bitcast i32 %16 to float
%20 = bitcast float %18 to i32
%21 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %20, i32 273, i32 15, i32 15, i1 false)
%22 = bitcast i32 %21 to float
%23 = bitcast i32 %20 to float
%24 = fadd float %23, %22
%25 = bitcast float %24 to i32
%26 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %25, i32 274, i32 15, i32 15, i1 false)
%27 = bitcast i32 %26 to float
%28 = bitcast i32 %25 to float
%29 = fadd float %28, %27
%30 = bitcast float %29 to i32
%31 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %30, i32 276, i32 15, i32 15, i1 false)
%32 = bitcast i32 %31 to float
%33 = bitcast i32 %30 to float
%34 = fadd float %33, %32
%35 = bitcast float %34 to i32
%36 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %35, i32 280, i32 15, i32 15, i1 false)
%37 = bitcast i32 %36 to float
%38 = bitcast i32 %35 to float
%39 = fadd float %38, %37
%40 = bitcast float %39 to i32
%41 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %40, i32 322, i32 10, i32 15, i1 false)
%42 = bitcast i32 %41 to float
%43 = bitcast i32 %40 to float
%44 = fadd float %43, %42
%45 = bitcast float %44 to i32
%46 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %45, i32 323, i32 12, i32 15, i1 false)
%47 = bitcast i32 %46 to float
%48 = bitcast i32 %45 to float
%49 = fadd float %48, %47
%50 = bitcast float %49 to i32
%51 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %50, i32 312, i32 15, i32 15, i1 false)
%52 = bitcast i32 %51 to float
%53 = bitcast float %49 to i32
%54 = call i32 @llvm.amdgcn.readlane(i32 %53, i32 63)
%55 = bitcast i32 %54 to float
%56 = call float @llvm.amdgcn.strict.wwm.f32(float %55)
pravinjagtap added inline comments.Jul 31 2023, 2:32 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
317

Doesn't consider half

Appears that _Float16 is not supported for atomics in HIP: https://cuda.godbolt.org/z/Gf7so4Y9K

arsenm added inline comments.Jul 31 2023, 5:26 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
317

Doesn't matter, the IR does. You should select the types you do handle, not try to exclude ones you don't

Supported float type for Atomic Ops in Atomic Optimizer for DPP strategy.
This mostly requires the bitcasting noise before and after:

  • amdgcn.set.inactive
  • amdgcn.update.dpp
  • amdgcn.readlane
  • amdgcn.writelane
  • amdgcn.permlanex16
  • amdgcn.permlanex64

We can get rid of this noise after D147732 and D156647.

pravinjagtap added inline comments.Aug 2 2023, 1:33 AM
llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll
146

This & next test points are already covered above. Will remove this.

Fixed/updated lit tests

pravinjagtap edited the summary of this revision. (Show Details)Aug 2 2023, 4:13 AM
pravinjagtap retitled this revision from [WIP] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer. to [AMDGPU] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer..Aug 2 2023, 8:18 PM
pravinjagtap edited the summary of this revision. (Show Details)
pravinjagtap added a reviewer: cdevadas.
cdevadas added inline comments.Aug 2 2023, 8:36 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
607

You could use the ternary operator to initialize them.

pravinjagtap added inline comments.Aug 2 2023, 9:11 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
607

You could use the ternary operator to initialize them.

Wherever there are two bit-cast statements, I have used if loop and ternary operator for single bit-cast statement. I will update this to ternary at all places.

pravinjagtap added a reviewer: Restricted Project.Aug 7 2023, 7:44 AM
arsenm added inline comments.Aug 10 2023, 2:49 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
316

The intrinsics should just be deleted, everything should move to atomicrmw

419–423

You can just unconditionally call CreateBitCast, it's a no-op if the type matches anyway

637

Identity value for fadd is -0, you got these backwards

639

identity for fsub is +0, so no true

736–740

Can you just make getIdentityValueForAtomicOp return a Constant? Or add a variant that does?

Addressed review comments

foad added inline comments.Aug 11 2023, 2:26 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
78

No need to pass in isAtomicFloatingPointTy to all these functions. It is just V->getType()->isFloatingPointTy().

315

Don't need to change this

631–634

You can derive C from Ty, and BitWidth from Ty, so the arguments should just be: AtomicRMWInst::BinOp Op, Type *Ty

810

In general fmul will not give the exact same answer as a sequence of fadds, so you probably need to check some fast math flags before doing this.

Code clean up..

pravinjagtap added inline comments.Aug 11 2023, 4:41 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
636

Is it safe to get BitWidth like this ? We dont need this for float types

foad added inline comments.Aug 11 2023, 4:57 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
636

Simpler to call Ty->getPrimitiveSizeInBits() unconditionally.

731–732

Might be clearer as:
Mbcnt = isAtomicFloatingPointTy ? B.CreateUIToFP(Mbcnt, Ty) : B.CreateIntCast(Mbcnt, Ty, false);
(instead of doing the fp cast on line 996) since in both cases we want to convert Mbcnt to type Ty.

foad added inline comments.Aug 11 2023, 4:59 AM
llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll
250–1

Please pre-commit the conversion to generated checks

llvm/test/CodeGen/AMDGPU/shl_add_ptr_global.ll
1 ↗(On Diff #549344)

Please pre-commit the conversion to generated checks

pravinjagtap added inline comments.Aug 11 2023, 5:46 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
731–732

If we convert Mbcnt to float here, Integer comparison will fail at line no 869

foad added inline comments.Aug 11 2023, 6:00 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
731–732

Then I suggest moving the casts (both int and fp cases) down to line 976.

Currently, for a 64-bit integer atomic, we will case mbcnt to i64 here, so the comparison on line 869 will be an i64 comparison. That is silly. There is no need for the comparison to be wider than i32.

addressed reveiw comments.

pravinjagtap added inline comments.Aug 11 2023, 7:10 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
837

I hope, this stops 64 bit comparisons for 64 bit atomic values. Please check effect of this in llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll

foad added inline comments.Aug 11 2023, 7:16 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
837

I don't actually see any 64-bit cmp instructions in that test, even before your patch. I guess we already managed to shrink them back to 32-bit comparisons.

pravinjagtap added inline comments.Aug 11 2023, 7:30 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
837

Having 32-bit comparison here for all the cases (int, long, float, wavefront size 32/64) is fine right ? Or do I need to revert this change?

foad added inline comments.Aug 11 2023, 7:42 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
837

It is fine. We are talking about the laneid == 0 comparison, which should always be 32-bit even for a 64-bit atomic, since the laneid is just a small integer in the range 0..63.

arsenm added inline comments.Aug 11 2023, 9:20 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
78

This is wrong in the case of FP typed xchg, which the pass just happens to not handle

foad added inline comments.Aug 15 2023, 2:23 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
392–393

Simplify this, here and in other functions

527

Do these bitcasts unconditionally, here and below.

Added unconditional bitcasts. Also, code clean up

arsenm added inline comments.Aug 18 2023, 5:41 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
402

Do you want to switch to the float overloads for the DPP intrinsic here or in a follow up?

pravinjagtap added inline comments.Aug 18 2023, 6:04 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
402

I would prefer in follow up patch.

Switched to the float overloads for the DPP intrinsic

Missing IR check lines? I thought you added some in a previous diff

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
355

Can you use B.CreateFAdd instead of the low level CreateBinOp? You'll need that to handle strictfp correctly

358–359

Ditto

808–810

We don't have fast math flags on atomics, but you would need to expand to the add sequence without some kind of reassociate flag

Missing IR check lines? I thought you added some in a previous diff

IR checks have been added in files:

llvm/test/CodeGen/AMDGPU/global_atomics_optimizer_fp_no_rtn.ll
llvm/test/CodeGen/AMDGPU/global_atomic_optimizer_fp_rtn.ll

Fixed the strictfp handling

pravinjagtap added inline comments.Aug 23 2023, 9:31 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
808–810

If the logic of no-of-active-lanes * uniform float value is not valid here for uniform value case, then can we use the logic implemented in buildScanIteratively for divergent values (even if the input value is uniform in atomics).

Or, we want sequence of additions avoiding the loop (branch instructions) that we have in buildScanIteratively. We also need to write back this intermediate values of sequence of additions if results is needed later in the kernel.

pravinjagtap added inline comments.Aug 23 2023, 10:12 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
808–810
arsenm accepted this revision.Aug 29 2023, 4:12 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
808–810

I suppose this is fine. You didn't have any adding order guarantee before

This revision is now accepted and ready to land.Aug 29 2023, 4:12 PM

Addressed reveiw comment: trunc instead of bitcast

foad added inline comments.Sep 12 2023, 3:06 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
651–654

These are the wrong way round. You want +0 for fadd and -0 for fsub.

arsenm added inline comments.Sep 12 2023, 3:12 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
651–654

No? This was wrong before and corrected. InstCombine uses -0 as fadd identity and +0 as fsub identity

foad added inline comments.Sep 12 2023, 3:16 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
651–654

Oh yeah, you're right. Sorry for the noise.

foad added inline comments.Sep 12 2023, 3:23 AM
llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll
172–241

This fsub code does not look right (both strategies). First you do an fsub-reduction, and then you do an atomic fsub of the reduced value. That is like a double negative - you will end up adding the values to the memory location. I think you need to do an fadd reduction followed by an atomic fsub, or vice versa.

Have you run any conformance tests that exercise this code?

pravinjagtap added inline comments.Sep 12 2023, 4:04 AM
llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll
172–241

This holds true for integer sub also right? I have ran psdb and gfx pipeline which runs some conformance tests. I will take closer look to see test coverage required to exercise this.

pravinjagtap added inline comments.Sep 12 2023, 4:44 AM
llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll
172–241

This did not get caught because atomic fsub is transformed to fadd before we reach atomic-optimizer: https://cuda.godbolt.org/z/56ToP79Pb

foad added inline comments.Sep 12 2023, 5:05 AM
llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll
172–241

For integer sub this is already handled by:

const AtomicRMWInst::BinOp ScanOp =
    Op == AtomicRMWInst::Sub ? AtomicRMWInst::Add : Op;