This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
12/51
AMDGPUAtomicOptimizer.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
3/5
global_atomics_iterative_scan_fp.ll

Differential D156301

[AMDGPU] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer.
ClosedPublic

Authored by pravinjagtap on Jul 26 2023, 1:45 AM.

Download Raw Diff

Details

Reviewers

arsenm
foad
b-sumner
ruiling
cdevadas

Group Reviewers

Restricted Project

Commits

rGf09360d20d41: [AMDGPU] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer.

Summary

Reduction and Scan are implemented using Iterative
and DPP strategy for float type.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	340 ms	x64 debian > LLVM.CodeGen/AMDGPU::atomic_optimizations_global_pointer.ll
	370 ms	x64 debian > LLVM.CodeGen/AMDGPU::atomic_optimizations_local_pointer.ll
	300 ms	x64 debian > LLVM.CodeGen/AMDGPU::atomics-hw-remarks-gfx90a.ll
	290 ms	x64 debian > LLVM.CodeGen/AMDGPU::cgp-addressing-modes-gfx908.ll
	700 ms	x64 debian > LLVM.CodeGen/AMDGPU::fp64-atomics-gfx90a.ll
		View Full Test Results (16 Failed)

Event Timeline

pravinjagtap created this revision.Jul 26 2023, 1:45 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 26 2023, 1:45 AM

Herald added subscribers: foad, kerbowa, hiraditya and 5 others. · View Herald Transcript

pravinjagtap requested review of this revision.Jul 26 2023, 1:45 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 26 2023, 1:45 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

pravinjagtap edited the summary of this revision. (Show Details)Jul 26 2023, 1:54 AM

pravinjagtap added reviewers: foad, b-sumner.

Herald added a subscriber: StephenFan. · View Herald TranscriptJul 26 2023, 1:54 AM

pravinjagtap added a reviewer: ruiling.Jul 26 2023, 2:17 AM

Harbormaster completed remote builds in B248189: Diff 544260.Jul 26 2023, 6:36 AM

the fmin/fmax case and fadd/fsub cases have nothing to do with each other, you're probably better off handling them in separate patches

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
142	AtomicRMWInst already has isFloatingPointOperation/isFPOperation for this, which also picks up fsub
224	Should also handle fsub
410	you can't do it like this, you should use minnum/maxnum intrinsics
650	This would be +infinity for fmax. For fadd you there isn't really an identity value since fadd -0, 0 -> -0. You probably can't do this without nsz, which we don't have a way of representing. I have a draft patch for unsafe FP atomic metadata I don't have time to pick up.
652	This would be -infinity
822	I don't follow how this can be a convert and multiply

arsenm added inline comments.Jul 26 2023, 7:11 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
650	For fadd you can use -0 as the identify value. For fsub I think 0 works: Check instcombine: define float @fsub_fold(float %x) { %add = fsub float %x, 0.0 ret float %add } define float @fadd_fold_n0(float %x) { %add = fadd float %x, -0.0 ret float %add } This is of course ignoring signaling nan quieting and denormal flushes

foad added inline comments.Jul 27 2023, 3:01 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
652	No, the identity should be +inf for fmin and -inf for fmax.

Spitting the Support of Floating Point Ops into two seperate patches.

pravinjagtap retitled this revision from [WIP] Support FP global atomics in AMDGPUAtomicOptimizer. to [WIP] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer..Jul 28 2023, 11:01 AM

pravinjagtap edited the summary of this revision. (Show Details)

arsenm added inline comments.Jul 28 2023, 11:29 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
231	I think this is a bad interpretation of the strategy option. Doing nothing just because you wanted something else is worse than just using an implemented path. Also you can just implement this with dpp?
349	Doesn't consider half Should also handle <2 x half>, but atomicrmw doesn't support vectors now (you need the intrinsics for those)
622	You shouldn't need a cast after D147732

arsenm added inline comments.Jul 28 2023, 11:51 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
832	These belong with the other patch

Harbormaster completed remote builds in B248893: Diff 545226.Jul 28 2023, 11:54 AM

pravinjagtap added inline comments.Jul 28 2023, 9:05 PM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
231	Also you can just implement this with dpp? If I understand correctly, current dpp intrinsics that we need for reduction & scan(`llvm.amdgcn.update.dpp`) can return only `integer` types (accepts inputs with any types). @foad Is it possible to extend current dpp implementation for float types as well ?

pravinjagtap added inline comments.Jul 28 2023, 9:13 PM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
231	Also you can just implement this with dpp? If I understand correctly, current dpp intrinsics that we need for reduction & scan(`llvm.amdgcn.update.dpp`) can return only `integer` types (accepts inputs with any types). I am wrong, this intrinsic is lowered to V_MOV_B32_dpp when matched with i32 types. I think, we should be able to implement dpp for floats with bitcasts noise.

pravinjagtap added inline comments.Jul 29 2023, 11:56 PM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp

231

I am able to generate functionally correct code for scan with DPP strategy but it needs lot of bitcast mess for llvm.amdgcn.set.inactive.i32 and lvm.amdgcn.update.dpp.i32. Is there any better way of doing this ?

%16 = bitcast float %9 to i32
%17 = call i32 @llvm.amdgcn.set.inactive.i32(i32 %16, i32 0)
%18 = bitcast i32 %17 to float
%19 = bitcast i32 %16 to float
%20 = bitcast float %18 to i32
%21 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %20, i32 273, i32 15, i32 15, i1 false)
%22 = bitcast i32 %21 to float
%23 = bitcast i32 %20 to float
%24 = fadd float %23, %22
%25 = bitcast float %24 to i32
%26 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %25, i32 274, i32 15, i32 15, i1 false)
%27 = bitcast i32 %26 to float
%28 = bitcast i32 %25 to float
%29 = fadd float %28, %27
%30 = bitcast float %29 to i32
%31 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %30, i32 276, i32 15, i32 15, i1 false)
%32 = bitcast i32 %31 to float
%33 = bitcast i32 %30 to float
%34 = fadd float %33, %32
%35 = bitcast float %34 to i32
%36 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %35, i32 280, i32 15, i32 15, i1 false)
%37 = bitcast i32 %36 to float
%38 = bitcast i32 %35 to float
%39 = fadd float %38, %37
%40 = bitcast float %39 to i32
%41 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %40, i32 322, i32 10, i32 15, i1 false)
%42 = bitcast i32 %41 to float
%43 = bitcast i32 %40 to float
%44 = fadd float %43, %42
%45 = bitcast float %44 to i32
%46 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %45, i32 323, i32 12, i32 15, i1 false)
%47 = bitcast i32 %46 to float
%48 = bitcast i32 %45 to float
%49 = fadd float %48, %47
%50 = bitcast float %49 to i32
%51 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %50, i32 312, i32 15, i32 15, i1 false)
%52 = bitcast i32 %51 to float
%53 = bitcast float %49 to i32
%54 = call i32 @llvm.amdgcn.readlane(i32 %53, i32 63)
%55 = bitcast i32 %54 to float
%56 = call float @llvm.amdgcn.strict.wwm.f32(float %55)

pravinjagtap mentioned this in D156647: [AMDGPU] Extend f32 support for llvm.amdgcn.update.dpp intrinsic.Jul 31 2023, 12:07 AM

pravinjagtap added inline comments.Jul 31 2023, 2:32 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
349	Doesn't consider half Appears that `_Float16` is not supported for atomics in HIP: https://cuda.godbolt.org/z/Gf7so4Y9K

arsenm added inline comments.Jul 31 2023, 5:26 PM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
349	Doesn't matter, the IR does. You should select the types you do handle, not try to exclude ones you don't

Supported float type for Atomic Ops in Atomic Optimizer for DPP strategy.
This mostly requires the bitcasting noise before and after:

amdgcn.set.inactive
amdgcn.update.dpp
amdgcn.readlane
amdgcn.writelane
amdgcn.permlanex16
amdgcn.permlanex64

We can get rid of this noise after D147732 and D156647.

pravinjagtap added inline comments.Aug 2 2023, 1:33 AM

llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll
146	This & next test points are already covered above. Will remove this.

Harbormaster completed remote builds in B249689: Diff 546342.Aug 2 2023, 2:31 AM

pravinjagtap updated this revision to Diff 546403.Aug 2 2023, 4:12 AM

Fixed/updated lit tests

pravinjagtap edited the summary of this revision. (Show Details)Aug 2 2023, 4:13 AM

Harbormaster completed remote builds in B249728: Diff 546403.Aug 2 2023, 7:04 AM

pravinjagtap retitled this revision from [WIP] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer. to [AMDGPU] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer..Aug 2 2023, 8:18 PM

pravinjagtap edited the summary of this revision. (Show Details)

pravinjagtap added a reviewer: cdevadas.

cdevadas added inline comments.Aug 2 2023, 8:36 PM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
624	You could use the ternary operator to initialize them.

pravinjagtap added inline comments.Aug 2 2023, 9:11 PM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
624	You could use the ternary operator to initialize them. Wherever there are two bit-cast statements, I have used if loop and ternary operator for single bit-cast statement. I will update this to ternary at all places.

pravinjagtap added a child revision: D157265: [AMDGPU] Reorder atomic optimizer to avoid CAS loop..Aug 7 2023, 2:43 AM

pravinjagtap added a reviewer: Restricted Project.Aug 7 2023, 7:44 AM

arsenm added inline comments.Aug 10 2023, 2:49 PM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
348	The intrinsics should just be deleted, everything should move to atomicrmw
454	You can just unconditionally call CreateBitCast, it's a no-op if the type matches anyway
650	Identity value for fadd is -0, you got these backwards
652	identity for fsub is +0, so no true
750–754	Can you just make getIdentityValueForAtomicOp return a Constant? Or add a variant that does?

Addressed review comments

foad added inline comments.Aug 11 2023, 2:26 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
78	No need to pass in isAtomicFloatingPointTy to all these functions. It is just V->getType()->isFloatingPointTy().
347	Don't need to change this
657	You can derive C from Ty, and BitWidth from Ty, so the arguments should just be: `AtomicRMWInst::BinOp Op, Type *Ty`
822	In general fmul will not give the exact same answer as a sequence of fadds, so you probably need to check some fast math flags before doing this.

Harbormaster completed remote builds in B251878: Diff 549294.Aug 11 2023, 3:43 AM

Code clean up..

pravinjagtap added inline comments.Aug 11 2023, 4:41 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
662	Is it safe to get BitWidth like this ? We dont need this for `float` types

foad added inline comments.Aug 11 2023, 4:57 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
662	Simpler to call Ty->getPrimitiveSizeInBits() unconditionally.
745	Might be clearer as: `Mbcnt = isAtomicFloatingPointTy ? B.CreateUIToFP(Mbcnt, Ty) : B.CreateIntCast(Mbcnt, Ty, false);` (instead of doing the fp cast on line 996) since in both cases we want to convert Mbcnt to type Ty.

foad added inline comments.Aug 11 2023, 4:59 AM

llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll
1 ↗	(On Diff #549344)	Please pre-commit the conversion to generated checks
llvm/test/CodeGen/AMDGPU/shl_add_ptr_global.ll
1 ↗	(On Diff #549344)	Please pre-commit the conversion to generated checks

pravinjagtap added inline comments.Aug 11 2023, 5:46 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
745	If we convert `Mbcnt` to `float` here, Integer comparison will fail at line no 869

foad added inline comments.Aug 11 2023, 6:00 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
745	Then I suggest moving the casts (both int and fp cases) down to line 976. Currently, for a 64-bit integer atomic, we will case mbcnt to i64 here, so the comparison on line 869 will be an i64 comparison. That is silly. There is no need for the comparison to be wider than i32.

Harbormaster completed remote builds in B251924: Diff 549344.Aug 11 2023, 6:53 AM

addressed reveiw comments.

pravinjagtap added inline comments.Aug 11 2023, 7:10 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
851	I hope, this stops 64 bit comparisons for 64 bit atomic values. Please check effect of this in `llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll`

foad added inline comments.Aug 11 2023, 7:16 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
851	I don't actually see any 64-bit cmp instructions in that test, even before your patch. I guess we already managed to shrink them back to 32-bit comparisons.

pravinjagtap added inline comments.Aug 11 2023, 7:30 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
851	Having 32-bit comparison here for all the cases (int, long, float, wavefront size 32/64) is fine right ? Or do I need to revert this change?

foad added inline comments.Aug 11 2023, 7:42 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
851	It is fine. We are talking about the `laneid == 0` comparison, which should always be 32-bit even for a 64-bit atomic, since the laneid is just a small integer in the range 0..63.

Harbormaster completed remote builds in B251951: Diff 549380.Aug 11 2023, 9:07 AM

arsenm added inline comments.Aug 11 2023, 9:20 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
78	This is wrong in the case of FP typed xchg, which the pass just happens to not handle

foad added inline comments.Aug 15 2023, 2:23 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
428–429	Simplify this, here and in other functions
547	Do these bitcasts unconditionally, here and below.

Added unconditional bitcasts. Also, code clean up

Harbormaster completed remote builds in B252880: Diff 550656.Aug 16 2023, 5:16 AM

pravinjagtap removed a child revision: D157265: [AMDGPU] Reorder atomic optimizer to avoid CAS loop..Aug 17 2023, 5:24 AM

Rebased

Harbormaster completed remote builds in B253200: Diff 551104.Aug 17 2023, 6:38 AM

pravinjagtap mentioned this in rGaf5fd142d352: [AMDGPU] Extend f32 support for llvm.amdgcn.update.dpp intrinsic.Aug 17 2023, 7:45 AM

pravinjagtap added a child revision: D157265: [AMDGPU] Reorder atomic optimizer to avoid CAS loop..Aug 17 2023, 8:08 AM

pravinjagtap added a parent revision: D157712: [AMDGPU] Autogenerate & pre-commit tests for D156301 and D157388.Aug 17 2023, 10:54 PM

Rebased

Harbormaster completed remote builds in B253401: Diff 551392.Aug 18 2023, 1:07 AM

arsenm added inline comments.Aug 18 2023, 5:41 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
437	Do you want to switch to the float overloads for the DPP intrinsic here or in a follow up?

pravinjagtap added inline comments.Aug 18 2023, 6:04 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
437	I would prefer in follow up patch.

pravinjagtap mentioned this in rGc931f2e6fd0c: [AMDGPU] Autogenerate & pre-commit tests for D156301 and D157388.Aug 18 2023, 6:51 AM

Switched to the float overloads for the DPP intrinsic

Harbormaster completed remote builds in B254056: Diff 552309.Aug 22 2023, 5:26 AM

pravinjagtap added a parent revision: D156647: [AMDGPU] Extend f32 support for llvm.amdgcn.update.dpp intrinsic.Aug 22 2023, 9:07 PM

ping.

Missing IR check lines? I thought you added some in a previous diff

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
387	Can you use B.CreateFAdd instead of the low level CreateBinOp? You'll need that to handle strictfp correctly
390–391	Ditto
820–822	We don't have fast math flags on atomics, but you would need to expand to the add sequence without some kind of reassociate flag

In D156301#4612013, @arsenm wrote:

Missing IR check lines? I thought you added some in a previous diff

IR checks have been added in files:

llvm/test/CodeGen/AMDGPU/global_atomics_optimizer_fp_no_rtn.ll
llvm/test/CodeGen/AMDGPU/global_atomic_optimizer_fp_rtn.ll

Fixed the strictfp handling

pravinjagtap mentioned this in D157265: [AMDGPU] Reorder atomic optimizer to avoid CAS loop..Aug 23 2023, 9:13 PM

pravinjagtap added inline comments.Aug 23 2023, 9:31 PM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
820–822	If the logic of `no-of-active-lanes * uniform float value` is not valid here for uniform value case, then can we use the logic implemented in `buildScanIteratively` for divergent values (even if the input value is uniform in atomics). Or, we want sequence of additions avoiding the loop (branch instructions) that we have in `buildScanIteratively`. We also need to write back this intermediate values of sequence of additions if results is needed later in the kernel.

Harbormaster completed remote builds in B254525: Diff 552974.Aug 23 2023, 9:35 PM

pravinjagtap added inline comments.Aug 23 2023, 10:12 PM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
820–822	CC: @b-sumner @foad

arsenm accepted this revision.Aug 29 2023, 4:12 PM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
820–822	I suppose this is fine. You didn't have any adding order guarantee before

This revision is now accepted and ready to land.Aug 29 2023, 4:12 PM

Addressed reveiw comment: trunc instead of bitcast

Harbormaster completed remote builds in B255739: Diff 554641.Aug 30 2023, 5:28 AM

Closed by commit rGf09360d20d41: [AMDGPU] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer. (authored by pravinjagtap). · Explain WhyAug 30 2023, 8:58 AM

This revision was automatically updated to reflect the committed changes.

pravinjagtap added a commit: rGf09360d20d41: [AMDGPU] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer..

foad added inline comments.Sep 12 2023, 3:06 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
674–677	These are the wrong way round. You want +0 for fadd and -0 for fsub.

arsenm added inline comments.Sep 12 2023, 3:12 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
674–677	No? This was wrong before and corrected. InstCombine uses -0 as fadd identity and +0 as fsub identity

foad added inline comments.Sep 12 2023, 3:16 AM

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
674–677	Oh yeah, you're right. Sorry for the noise.

foad added inline comments.Sep 12 2023, 3:23 AM

llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll
173–242	This fsub code does not look right (both strategies). First you do an fsub-reduction, and then you do an atomic fsub of the reduced value. That is like a double negative - you will end up adding the values to the memory location. I think you need to do an fadd reduction followed by an atomic fsub, or vice versa. Have you run any conformance tests that exercise this code?

pravinjagtap added inline comments.Sep 12 2023, 4:04 AM

llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll
173–242	This holds true for integer sub also right? I have ran psdb and gfx pipeline which runs some conformance tests. I will take closer look to see test coverage required to exercise this.

pravinjagtap added inline comments.Sep 12 2023, 4:44 AM

llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll
173–242	This did not get caught because atomic `fsub` is transformed to `fadd` before we reach atomic-optimizer: https://cuda.godbolt.org/z/56ToP79Pb

foad added inline comments.Sep 12 2023, 5:05 AM

llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll
173–242	For integer sub this is already handled by: const AtomicRMWInst::BinOp ScanOp = Op == AtomicRMWInst::Sub ? AtomicRMWInst::Add : Op;

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUAtomicOptimizer.cpp

91 lines

test/

CodeGen/

AMDGPU/

global_atomics_iterative_scan_fp.ll

71 lines

Diff 544260

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp

Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines private:

const UniformityInfo *UA; const UniformityInfo *UA;

const DataLayout *DL; const DataLayout *DL;

DomTreeUpdater &DTU; DomTreeUpdater &DTU;

const GCNSubtarget *ST; const GCNSubtarget *ST;

bool IsPixelShader; bool IsPixelShader;

ScanOptions ScanImpl; ScanOptions ScanImpl;

Value *buildReduction(IRBuilder<> &B, AtomicRMWInst::BinOp Op, Value *V, Value *buildReduction(IRBuilder<> &B, AtomicRMWInst::BinOp Op, Value *V,

Value *const Identity) const; Value *const Identity) const;

foadUnsubmitted

Not Done

No need to pass in isAtomicFloatingPointTy to all these functions. It is just V->getType()->isFloatingPointTy().

foad: No need to pass in isAtomicFloatingPointTy to all these functions. It is just V->getType()…

arsenmUnsubmitted

Not Done

This is wrong in the case of FP typed xchg, which the pass just happens to not handle

arsenm: This is wrong in the case of FP typed xchg, which the pass just happens to not handle

Value *buildScan(IRBuilder<> &B, AtomicRMWInst::BinOp Op, Value *V, Value *buildScan(IRBuilder<> &B, AtomicRMWInst::BinOp Op, Value *V,

Value *const Identity) const; Value *const Identity) const;

Value *buildShiftRight(IRBuilder<> &B, Value *V, Value *const Identity) const; Value *buildShiftRight(IRBuilder<> &B, Value *V, Value *const Identity) const;

std::pair<Value *, Value *> std::pair<Value *, Value *>

buildScanIteratively(IRBuilder<> &B, AtomicRMWInst::BinOp Op, buildScanIteratively(IRBuilder<> &B, AtomicRMWInst::BinOp Op,

Value *const Identity, Value *V, Instruction &I, Value *const Identity, Value *V, Instruction &I,

BasicBlock *ComputeLoop, BasicBlock *ComputeEnd) const; BasicBlock *ComputeLoop, BasicBlock *ComputeEnd) const;

Show All 9 Lines AMDGPUAtomicOptimizerImpl(const UniformityInfo *UA, const DataLayout *DL,

bool IsPixelShader, ScanOptions ScanImpl) bool IsPixelShader, ScanOptions ScanImpl)

: UA(UA), DL(DL), DTU(DTU), ST(ST), IsPixelShader(IsPixelShader), : UA(UA), DL(DL), DTU(DTU), ST(ST), IsPixelShader(IsPixelShader),

ScanImpl(ScanImpl) {} ScanImpl(ScanImpl) {}

bool run(Function &F); bool run(Function &F);

void visitAtomicRMWInst(AtomicRMWInst &I); void visitAtomicRMWInst(AtomicRMWInst &I);

void visitIntrinsicInst(IntrinsicInst &I); void visitIntrinsicInst(IntrinsicInst &I);

bool isScanStrategyIterative();

}; };

} // namespace } // namespace

char AMDGPUAtomicOptimizer::ID = 0; char AMDGPUAtomicOptimizer::ID = 0;

char &llvm::AMDGPUAtomicOptimizerID = AMDGPUAtomicOptimizer::ID; char &llvm::AMDGPUAtomicOptimizerID = AMDGPUAtomicOptimizer::ID;

Show All 16 Lines bool AMDGPUAtomicOptimizer::runOnFunction(Function &F) {

const GCNSubtarget *ST = &TM.getSubtarget<GCNSubtarget>(F); const GCNSubtarget *ST = &TM.getSubtarget<GCNSubtarget>(F);

bool IsPixelShader = F.getCallingConv() == CallingConv::AMDGPU_PS; bool IsPixelShader = F.getCallingConv() == CallingConv::AMDGPU_PS;

return AMDGPUAtomicOptimizerImpl(UA, DL, DTU, ST, IsPixelShader, ScanImpl) return AMDGPUAtomicOptimizerImpl(UA, DL, DTU, ST, IsPixelShader, ScanImpl)

.run(F); .run(F);

} }

bool AMDGPUAtomicOptimizerImpl::isScanStrategyIterative() {

return ScanImpl == ScanOptions::Iterative;

}

bool isOpFP(AtomicRMWInst::BinOp &Op) {

arsenmUnsubmitted

Not Done

AtomicRMWInst already has isFloatingPointOperation/isFPOperation for this, which also picks up fsub

arsenm: AtomicRMWInst already has isFloatingPointOperation/isFPOperation for this, which also picks up…

switch (Op) {

default:

return false;

case AtomicRMWInst::FAdd:

case AtomicRMWInst::FMax:

case AtomicRMWInst::FMin:

return true;

}

PreservedAnalyses AMDGPUAtomicOptimizerPass::run(Function &F, PreservedAnalyses AMDGPUAtomicOptimizerPass::run(Function &F,

FunctionAnalysisManager &AM) { FunctionAnalysisManager &AM) {

const auto *UA = &AM.getResult<UniformityInfoAnalysis>(F); const auto *UA = &AM.getResult<UniformityInfoAnalysis>(F);

const DataLayout *DL = &F.getParent()->getDataLayout(); const DataLayout *DL = &F.getParent()->getDataLayout();

DomTreeUpdater DTU(&AM.getResult<DominatorTreeAnalysis>(F), DomTreeUpdater DTU(&AM.getResult<DominatorTreeAnalysis>(F),

DomTreeUpdater::UpdateStrategy::Lazy); DomTreeUpdater::UpdateStrategy::Lazy);

▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines void AMDGPUAtomicOptimizerImpl::visitAtomicRMWInst(AtomicRMWInst &I) {

case AtomicRMWInst::Sub: case AtomicRMWInst::Sub:

case AtomicRMWInst::And: case AtomicRMWInst::And:

case AtomicRMWInst::Or: case AtomicRMWInst::Or:

case AtomicRMWInst::Xor: case AtomicRMWInst::Xor:

case AtomicRMWInst::Max: case AtomicRMWInst::Max:

case AtomicRMWInst::Min: case AtomicRMWInst::Min:

case AtomicRMWInst::UMax: case AtomicRMWInst::UMax:

case AtomicRMWInst::UMin: case AtomicRMWInst::UMin:

case AtomicRMWInst::FAdd:

case AtomicRMWInst::FMax:

case AtomicRMWInst::FMin:

arsenmUnsubmitted

Not Done

Should also handle fsub

arsenm: Should also handle fsub

break; break;

} }

// FP Atomics are supported for only Iterative Strategy

if (isOpFP(Op) && !isScanStrategyIterative()) {

return;

}

arsenmUnsubmitted

Not Done

I think this is a bad interpretation of the strategy option. Doing nothing just because you wanted something else is worse than just using an implemented path. Also you can just implement this with dpp?

arsenm: I think this is a bad interpretation of the strategy option. Doing nothing just because you…

pravinjagtapAuthorUnsubmitted

Done

Also you can just implement this with dpp?

If I understand correctly, current dpp intrinsics that we need for reduction & scan(llvm.amdgcn.update.dpp) can return only integer types (accepts inputs with any types). @foad Is it possible to extend current dpp implementation for float types as well ?

pravinjagtap: > Also you can just implement this with dpp? If I understand correctly, current dpp intrinsics…

pravinjagtapAuthorUnsubmitted

Done

Also you can just implement this with dpp?

If I understand correctly, current dpp intrinsics that we need for reduction & scan(llvm.amdgcn.update.dpp) can return only integer types (accepts inputs with any types).

I am wrong, this intrinsic is lowered to V_MOV_B32_dpp when matched with i32 types. I think, we should be able to implement dpp for floats with bitcasts noise.

pravinjagtap: > > Also you can just implement this with dpp? > > If I understand correctly, current dpp…

pravinjagtapAuthorUnsubmitted

Done

%16 = bitcast float %9 to i32
%17 = call i32 @llvm.amdgcn.set.inactive.i32(i32 %16, i32 0)
%18 = bitcast i32 %17 to float
%19 = bitcast i32 %16 to float
%20 = bitcast float %18 to i32
%21 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %20, i32 273, i32 15, i32 15, i1 false)
%22 = bitcast i32 %21 to float
%23 = bitcast i32 %20 to float
%24 = fadd float %23, %22
%25 = bitcast float %24 to i32
%26 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %25, i32 274, i32 15, i32 15, i1 false)
%27 = bitcast i32 %26 to float
%28 = bitcast i32 %25 to float
%29 = fadd float %28, %27
%30 = bitcast float %29 to i32
%31 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %30, i32 276, i32 15, i32 15, i1 false)
%32 = bitcast i32 %31 to float
%33 = bitcast i32 %30 to float
%34 = fadd float %33, %32
%35 = bitcast float %34 to i32
%36 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %35, i32 280, i32 15, i32 15, i1 false)
%37 = bitcast i32 %36 to float
%38 = bitcast i32 %35 to float
%39 = fadd float %38, %37
%40 = bitcast float %39 to i32
%41 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %40, i32 322, i32 10, i32 15, i1 false)
%42 = bitcast i32 %41 to float
%43 = bitcast i32 %40 to float
%44 = fadd float %43, %42
%45 = bitcast float %44 to i32
%46 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %45, i32 323, i32 12, i32 15, i1 false)
%47 = bitcast i32 %46 to float
%48 = bitcast i32 %45 to float
%49 = fadd float %48, %47
%50 = bitcast float %49 to i32
%51 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %50, i32 312, i32 15, i32 15, i1 false)
%52 = bitcast i32 %51 to float
%53 = bitcast float %49 to i32
%54 = call i32 @llvm.amdgcn.readlane(i32 %53, i32 63)
%55 = bitcast i32 %54 to float
%56 = call float @llvm.amdgcn.strict.wwm.f32(float %55)

pravinjagtap: I am able to generate functionally correct code for scan with DPP strategy but it needs lot of…

const unsigned PtrIdx = 0; const unsigned PtrIdx = 0;

const unsigned ValIdx = 1; const unsigned ValIdx = 1;

// If the pointer operand is divergent, then each lane is doing an atomic // If the pointer operand is divergent, then each lane is doing an atomic

// operation on a different address, and we cannot optimize that. // operation on a different address, and we cannot optimize that.

if (UA->isDivergentUse(I.getOperandUse(PtrIdx))) { if (UA->isDivergentUse(I.getOperandUse(PtrIdx))) {

return; return;

} }

▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines case Intrinsic::amdgcn_raw_ptr_buffer_atomic_smax:

break; break;

case Intrinsic::amdgcn_buffer_atomic_umax: case Intrinsic::amdgcn_buffer_atomic_umax:

case Intrinsic::amdgcn_struct_buffer_atomic_umax: case Intrinsic::amdgcn_struct_buffer_atomic_umax:

case Intrinsic::amdgcn_struct_ptr_buffer_atomic_umax: case Intrinsic::amdgcn_struct_ptr_buffer_atomic_umax:

case Intrinsic::amdgcn_raw_buffer_atomic_umax: case Intrinsic::amdgcn_raw_buffer_atomic_umax:

case Intrinsic::amdgcn_raw_ptr_buffer_atomic_umax: case Intrinsic::amdgcn_raw_ptr_buffer_atomic_umax:

Op = AtomicRMWInst::UMax; Op = AtomicRMWInst::UMax;

break; break;

case Intrinsic::amdgcn_global_atomic_fadd:

Op = AtomicRMWInst::FAdd;

break;

case Intrinsic::amdgcn_global_atomic_fmin:

Op = AtomicRMWInst::FMin;

break;

case Intrinsic::amdgcn_global_atomic_fmax:

Op = AtomicRMWInst::FMax;

break;

}

// FP Atomics are supported for only Iterative Strategy

if (isOpFP(Op) && !isScanStrategyIterative()) {

return;

} }

const unsigned ValIdx = 0; const unsigned ValIdx = 0;

foadUnsubmitted

Not Done

Don't need to change this

foad: Don't need to change this

const bool ValDivergent = UA->isDivergentUse(I.getOperandUse(ValIdx)); const bool ValDivergent = UA->isDivergentUse(I.getOperandUse(ValIdx));

arsenmUnsubmitted

Not Done

The intrinsics should just be deleted, everything should move to atomicrmw

arsenm: The intrinsics should just be deleted, everything should move to atomicrmw

arsenmUnsubmitted

Not Done

Doesn't consider half

Should also handle <2 x half>, but atomicrmw doesn't support vectors now (you need the intrinsics for those)

arsenm: Doesn't consider half Should also handle <2 x half>, but atomicrmw doesn't support vectors now…

pravinjagtapAuthorUnsubmitted

Done

Doesn't consider half

Appears that _Float16 is not supported for atomics in HIP: https://cuda.godbolt.org/z/Gf7so4Y9K

pravinjagtap: > Doesn't consider half Appears that `_Float16` is not supported for atomics in HIP: https…

arsenmUnsubmitted

Not Done

Doesn't matter, the IR does. You should select the types you do handle, not try to exclude ones you don't

arsenm: Doesn't matter, the IR does. You should select the types you do handle, not try to exclude ones…

// If the value operand is divergent, each lane is contributing a different // If the value operand is divergent, each lane is contributing a different

// value to the atomic calculation. We can only optimize divergent values if // value to the atomic calculation. We can only optimize divergent values if

// we have DPP available on our subtarget, and the atomic operation is 32 // we have DPP available on our subtarget, and the atomic operation is 32

// bits. // bits.

if (ValDivergent && if (ValDivergent &&

(!ST->hasDPP() || DL->getTypeSizeInBits(I.getType()) != 32)) { (!ST->hasDPP() || DL->getTypeSizeInBits(I.getType()) != 32)) {

return; return;

} }

Show All 20 Lines static Value *buildNonAtomicBinOp(IRBuilder<> &B, AtomicRMWInst::BinOp Op,

Value *LHS, Value *RHS) { Value *LHS, Value *RHS) {

CmpInst::Predicate Pred; CmpInst::Predicate Pred;

switch (Op) { switch (Op) {

default: default:

llvm_unreachable("Unhandled atomic op"); llvm_unreachable("Unhandled atomic op");

case AtomicRMWInst::Add: case AtomicRMWInst::Add:

return B.CreateBinOp(Instruction::Add, LHS, RHS); return B.CreateBinOp(Instruction::Add, LHS, RHS);

case AtomicRMWInst::FAdd:

return B.CreateBinOp(Instruction::FAdd, LHS, RHS);

arsenmUnsubmitted

Not Done

Can you use B.CreateFAdd instead of the low level CreateBinOp? You'll need that to handle strictfp correctly

arsenm: Can you use B.CreateFAdd instead of the low level CreateBinOp? You'll need that to handle…

case AtomicRMWInst::Sub: case AtomicRMWInst::Sub:

return B.CreateBinOp(Instruction::Sub, LHS, RHS); return B.CreateBinOp(Instruction::Sub, LHS, RHS);

case AtomicRMWInst::And: case AtomicRMWInst::And:

return B.CreateBinOp(Instruction::And, LHS, RHS); return B.CreateBinOp(Instruction::And, LHS, RHS);

arsenmUnsubmitted

Not Done

Ditto

arsenm: Ditto

case AtomicRMWInst::Or: case AtomicRMWInst::Or:

return B.CreateBinOp(Instruction::Or, LHS, RHS); return B.CreateBinOp(Instruction::Or, LHS, RHS);

case AtomicRMWInst::Xor: case AtomicRMWInst::Xor:

return B.CreateBinOp(Instruction::Xor, LHS, RHS); return B.CreateBinOp(Instruction::Xor, LHS, RHS);

case AtomicRMWInst::Max: case AtomicRMWInst::Max:

Pred = CmpInst::ICMP_SGT; Pred = CmpInst::ICMP_SGT;

break; break;

case AtomicRMWInst::Min: case AtomicRMWInst::Min:

Pred = CmpInst::ICMP_SLT; Pred = CmpInst::ICMP_SLT;

break; break;

case AtomicRMWInst::UMax: case AtomicRMWInst::UMax:

Pred = CmpInst::ICMP_UGT; Pred = CmpInst::ICMP_UGT;

break; break;

case AtomicRMWInst::UMin: case AtomicRMWInst::UMin:

Pred = CmpInst::ICMP_ULT; Pred = CmpInst::ICMP_ULT;

break; break;

case AtomicRMWInst::FMax:

Pred = CmpInst::FCMP_UGT;

arsenmUnsubmitted

Not Done

you can't do it like this, you should use minnum/maxnum intrinsics

arsenm: you can't do it like this, you should use minnum/maxnum intrinsics

break;

case AtomicRMWInst::FMin:

Pred = CmpInst::FCMP_ULT;

break;

} }

Value *Cond = B.CreateICmp(Pred, LHS, RHS); Value *Cond = B.CreateICmp(Pred, LHS, RHS);

return B.CreateSelect(Cond, LHS, RHS); return B.CreateSelect(Cond, LHS, RHS);

} }

// Use the builder to create a reduction of V across the wavefront, with all // Use the builder to create a reduction of V across the wavefront, with all

// lanes active, returning the same result in all lanes. // lanes active, returning the same result in all lanes.

Value *AMDGPUAtomicOptimizerImpl::buildReduction(IRBuilder<> &B, Value *AMDGPUAtomicOptimizerImpl::buildReduction(IRBuilder<> &B,

AtomicRMWInst::BinOp Op, AtomicRMWInst::BinOp Op,

Value *V, Value *V,

Value *const Identity) const { Value *const Identity) const {

Type *const Ty = V->getType(); Type *const Ty = V->getType();

Module *M = B.GetInsertBlock()->getModule(); Module *M = B.GetInsertBlock()->getModule();

Function *UpdateDPP = Function *UpdateDPP =

Intrinsic::getDeclaration(M, Intrinsic::amdgcn_update_dpp, Ty); Intrinsic::getDeclaration(M, Intrinsic::amdgcn_update_dpp, Ty);

foadUnsubmitted

Not Done

Type *Int32Ty = B.getInt32Ty();

- bool isAtomicFloatingPointTy = AtomicTy->isFloatingPointTy();

- Type *UpdateDPPTy = isAtomicFloatingPointTy ? Int32Ty : AtomicTy;

+ Type *UpdateDPPTy = B.getIntNTy(AtomicTy->getPrimitiveSizeInBits());

Module *M = B.GetInsertBlock()->getModule();

Simplify this, here and in other functions

foad: Simplify this, here and in other functions

// Reduce within each row of 16 lanes. // Reduce within each row of 16 lanes.

for (unsigned Idx = 0; Idx < 4; Idx++) { for (unsigned Idx = 0; Idx < 4; Idx++) {

V = buildNonAtomicBinOp( V = buildNonAtomicBinOp(

B, Op, V, B, Op, V,

B.CreateCall(UpdateDPP, B.CreateCall(UpdateDPP,

{Identity, V, B.getInt32(DPP::ROW_XMASK0 | 1 << Idx), {Identity, V, B.getInt32(DPP::ROW_XMASK0 | 1 << Idx),

B.getInt32(0xf), B.getInt32(0xf), B.getFalse()})); B.getInt32(0xf), B.getInt32(0xf), B.getFalse()}));

arsenmUnsubmitted

Not Done

Do you want to switch to the float overloads for the DPP intrinsic here or in a follow up?

arsenm: Do you want to switch to the float overloads for the DPP intrinsic here or in a follow up?

pravinjagtapAuthorUnsubmitted

Done

I would prefer in follow up patch.

pravinjagtap: I would prefer in follow up patch.

} }

// Reduce within each pair of rows (i.e. 32 lanes). // Reduce within each pair of rows (i.e. 32 lanes).

assert(ST->hasPermLaneX16()); assert(ST->hasPermLaneX16());

V = buildNonAtomicBinOp( V = buildNonAtomicBinOp(

B, Op, V, B, Op, V,

B.CreateIntrinsic( B.CreateIntrinsic(

Intrinsic::amdgcn_permlanex16, {}, Intrinsic::amdgcn_permlanex16, {},

{V, V, B.getInt32(-1), B.getInt32(-1), B.getFalse(), B.getFalse()})); {V, V, B.getInt32(-1), B.getInt32(-1), B.getFalse(), B.getFalse()}));

if (ST->isWave32()) if (ST->isWave32())

return V; return V;

if (ST->hasPermLane64()) { if (ST->hasPermLane64()) {

// Reduce across the upper and lower 32 lanes. // Reduce across the upper and lower 32 lanes.

return buildNonAtomicBinOp( return buildNonAtomicBinOp(

B, Op, V, B.CreateIntrinsic(Intrinsic::amdgcn_permlane64, {}, V)); B, Op, V, B.CreateIntrinsic(Intrinsic::amdgcn_permlane64, {}, V));

arsenmUnsubmitted

Not Done

You can just unconditionally call CreateBitCast, it's a no-op if the type matches anyway

arsenm: You can just unconditionally call CreateBitCast, it's a no-op if the type matches anyway

} }

// Pick an arbitrary lane from 0..31 and an arbitrary lane from 32..63 and // Pick an arbitrary lane from 0..31 and an arbitrary lane from 32..63 and

// combine them with a scalar operation. // combine them with a scalar operation.

Function *ReadLane = Function *ReadLane =

Intrinsic::getDeclaration(M, Intrinsic::amdgcn_readlane, {}); Intrinsic::getDeclaration(M, Intrinsic::amdgcn_readlane, {});

Value *const Lane0 = B.CreateCall(ReadLane, {V, B.getInt32(0)}); Value *const Lane0 = B.CreateCall(ReadLane, {V, B.getInt32(0)});

Value *const Lane32 = B.CreateCall(ReadLane, {V, B.getInt32(32)}); Value *const Lane32 = B.CreateCall(ReadLane, {V, B.getInt32(32)});

▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines if (ST->hasDPPWavefrontShifts()) {

Function *ReadLane = Function *ReadLane =

Intrinsic::getDeclaration(M, Intrinsic::amdgcn_readlane, {}); Intrinsic::getDeclaration(M, Intrinsic::amdgcn_readlane, {});

Function *WriteLane = Function *WriteLane =

Intrinsic::getDeclaration(M, Intrinsic::amdgcn_writelane, {}); Intrinsic::getDeclaration(M, Intrinsic::amdgcn_writelane, {});

// On GFX10 all DPP operations are confined to a single row. To get cross- // On GFX10 all DPP operations are confined to a single row. To get cross-

// row operations we have to use permlane or readlane. // row operations we have to use permlane or readlane.

Value *Old = V; Value *Old = V;

V = B.CreateCall(UpdateDPP, V = B.CreateCall(UpdateDPP,

foadUnsubmitted

Not Done

Do these bitcasts unconditionally, here and below.

foad: Do these bitcasts unconditionally, here and below.

{Identity, V, B.getInt32(DPP::ROW_SHR0 + 1), {Identity, V, B.getInt32(DPP::ROW_SHR0 + 1),

B.getInt32(0xf), B.getInt32(0xf), B.getFalse()}); B.getInt32(0xf), B.getInt32(0xf), B.getFalse()});

// Copy the old lane 15 to the new lane 16. // Copy the old lane 15 to the new lane 16.

V = B.CreateCall(WriteLane, {B.CreateCall(ReadLane, {Old, B.getInt32(15)}), V = B.CreateCall(WriteLane, {B.CreateCall(ReadLane, {Old, B.getInt32(15)}),

B.getInt32(16), V}); B.getInt32(16), V});

if (!ST->isWave32()) { if (!ST->isWave32()) {

Show All 40 Lines if (NeedResult) {

OldValuePhi->addIncoming(PoisonValue::get(Ty), EntryBB); OldValuePhi->addIncoming(PoisonValue::get(Ty), EntryBB);

} }

auto *ActiveBits = B.CreatePHI(WaveTy, 2, "ActiveBits"); auto *ActiveBits = B.CreatePHI(WaveTy, 2, "ActiveBits");

ActiveBits->addIncoming(Ballot, EntryBB); ActiveBits->addIncoming(Ballot, EntryBB);

// Use llvm.cttz instrinsic to find the lowest remaining active lane. // Use llvm.cttz instrinsic to find the lowest remaining active lane.

auto *FF1 = auto *FF1 =

B.CreateIntrinsic(Intrinsic::cttz, WaveTy, {ActiveBits, B.getTrue()}); B.CreateIntrinsic(Intrinsic::cttz, WaveTy, {ActiveBits, B.getTrue()});

auto *LaneIdxInt = B.CreateTrunc(FF1, Ty); auto *LaneIdxInt = B.CreateTrunc(FF1, B.getInt32Ty());

// Get the value required for atomic operation // Get the value required for atomic operation

auto *LaneValue = if (Ty->isFloatingPointTy())

V = B.CreateBitCast(V, B.getInt32Ty());

Value *LaneValue =

B.CreateIntrinsic(Intrinsic::amdgcn_readlane, {}, {V, LaneIdxInt}); B.CreateIntrinsic(Intrinsic::amdgcn_readlane, {}, {V, LaneIdxInt});

if (Ty->isFloatingPointTy())

LaneValue = B.CreateBitCast(LaneValue, Ty);

// Perform writelane if intermediate scan results are required later in the // Perform writelane if intermediate scan results are required later in the

// kernel computations // kernel computations

Value *OldValue = nullptr; Value *OldValue = nullptr;

if (NeedResult) { if (NeedResult) {

OldValue = B.CreateIntrinsic(Intrinsic::amdgcn_writelane, {}, OldValue = B.CreateIntrinsic(Intrinsic::amdgcn_writelane, {},

{Accumulator, LaneIdxInt, OldValuePhi}); {Accumulator, LaneIdxInt, OldValuePhi});

OldValuePhi->addIncoming(OldValue, ComputeLoop); OldValuePhi->addIncoming(OldValue, ComputeLoop);

} }

arsenmUnsubmitted

Not Done

You shouldn't need a cast after D147732

arsenm: You shouldn't need a cast after D147732

// Accumulate the results // Accumulate the results

auto *NewAccumulator = buildNonAtomicBinOp(B, Op, Accumulator, LaneValue); auto *NewAccumulator = buildNonAtomicBinOp(B, Op, Accumulator, LaneValue);

cdevadasUnsubmitted

Not Done

You could use the ternary operator to initialize them.

cdevadas: You could use the ternary operator to initialize them.

pravinjagtapAuthorUnsubmitted

Done

You could use the ternary operator to initialize them.

Wherever there are two bit-cast statements, I have used if loop and ternary operator for single bit-cast statement. I will update this to ternary at all places.

pravinjagtap: > You could use the ternary operator to initialize them. Wherever there are two bit-cast…

Accumulator->addIncoming(NewAccumulator, ComputeLoop); Accumulator->addIncoming(NewAccumulator, ComputeLoop);

// Set bit to zero of current active lane so that for next iteration llvm.cttz // Set bit to zero of current active lane so that for next iteration llvm.cttz

// return the next active lane // return the next active lane

auto *Mask = B.CreateShl(ConstantInt::get(WaveTy, 1), FF1); auto *Mask = B.CreateShl(ConstantInt::get(WaveTy, 1), FF1);

auto *InverseMask = B.CreateXor(Mask, ConstantInt::get(WaveTy, -1)); auto *InverseMask = B.CreateXor(Mask, ConstantInt::get(WaveTy, -1));

auto *NewActiveBits = B.CreateAnd(ActiveBits, InverseMask); auto *NewActiveBits = B.CreateAnd(ActiveBits, InverseMask);

ActiveBits->addIncoming(NewActiveBits, ComputeLoop); ActiveBits->addIncoming(NewActiveBits, ComputeLoop);

// Branch out of the loop when all lanes are processed. // Branch out of the loop when all lanes are processed.

auto *IsEnd = B.CreateICmpEQ(NewActiveBits, ConstantInt::get(WaveTy, 0)); auto *IsEnd = B.CreateICmpEQ(NewActiveBits, ConstantInt::get(WaveTy, 0));

B.CreateCondBr(IsEnd, ComputeEnd, ComputeLoop); B.CreateCondBr(IsEnd, ComputeEnd, ComputeLoop);

B.SetInsertPoint(ComputeEnd); B.SetInsertPoint(ComputeEnd);

return {OldValue, NewAccumulator}; return {OldValue, NewAccumulator};

} }

static APFloat getIdentityValueForFAtomicOp(AtomicRMWInst::BinOp Op) {

switch (Op) {

default:

llvm_unreachable("Unhandled atomic op");

case AtomicRMWInst::FAdd:

case AtomicRMWInst::FMax:

return APFloat::getSmallest(APFloat::IEEEsingle(), false);

arsenmUnsubmitted

Not Done

This would be +infinity for fmax.

For fadd you there isn't really an identity value since fadd -0, 0 -> -0. You probably can't do this without nsz, which we don't have a way of representing.

I have a draft patch for unsafe FP atomic metadata I don't have time to pick up.

arsenm: This would be +infinity for fmax. For fadd you there isn't really an identity value since fadd…

arsenmUnsubmitted

Not Done

For fadd you can use -0 as the identify value. For fsub I think 0 works:

Check instcombine:

define float @fsub_fold(float %x) {

%add = fsub float %x, 0.0
ret float %add

}

define float @fadd_fold_n0(float %x) {

%add = fadd float %x, -0.0
ret float %add

}

This is of course ignoring signaling nan quieting and denormal flushes

arsenm: For fadd you can use -0 as the identify value. For fsub I think 0 works: Check instcombine…

arsenmUnsubmitted

Not Done

Identity value for fadd is -0, you got these backwards

arsenm: Identity value for fadd is -0, you got these backwards

case AtomicRMWInst::FMin:

return APFloat::getLargest(APFloat::IEEEsingle(), false);

arsenmUnsubmitted

Not Done

This would be -infinity

arsenm: This would be -infinity

foadUnsubmitted

Not Done

No, the identity should be +inf for fmin and -inf for fmax.

foad: No, the identity should be +inf for fmin and -inf for fmax.

arsenmUnsubmitted

Not Done

identity for fsub is +0, so no true

arsenm: identity for fsub is +0, so no true

}

static APInt getIdentityValueForAtomicOp(AtomicRMWInst::BinOp Op, static APInt getIdentityValueForAtomicOp(AtomicRMWInst::BinOp Op,

unsigned BitWidth) { unsigned BitWidth) {

foadUnsubmitted

Not Done

You can derive C from Ty, and BitWidth from Ty, so the arguments should just be: AtomicRMWInst::BinOp Op, Type *Ty

foad: You can derive C from Ty, and BitWidth from Ty, so the arguments should just be: `AtomicRMWInst…

switch (Op) { switch (Op) {

default: default:

llvm_unreachable("Unhandled atomic op"); llvm_unreachable("Unhandled atomic op");

case AtomicRMWInst::Add: case AtomicRMWInst::Add:

case AtomicRMWInst::Sub: case AtomicRMWInst::Sub:

pravinjagtapAuthorUnsubmitted

Done

Is it safe to get BitWidth like this ? We dont need this for float types

pravinjagtap: Is it safe to get BitWidth like this ? We dont need this for `float` types

foadUnsubmitted

Not Done

Simpler to call Ty->getPrimitiveSizeInBits() unconditionally.

foad: Simpler to call Ty->getPrimitiveSizeInBits() unconditionally.

case AtomicRMWInst::Or: case AtomicRMWInst::Or:

case AtomicRMWInst::Xor: case AtomicRMWInst::Xor:

case AtomicRMWInst::UMax: case AtomicRMWInst::UMax:

return APInt::getMinValue(BitWidth); return APInt::getMinValue(BitWidth);

case AtomicRMWInst::And: case AtomicRMWInst::And:

case AtomicRMWInst::UMin: case AtomicRMWInst::UMin:

return APInt::getMaxValue(BitWidth); return APInt::getMaxValue(BitWidth);

case AtomicRMWInst::Max: case AtomicRMWInst::Max:

return APInt::getSignedMinValue(BitWidth); return APInt::getSignedMinValue(BitWidth);

case AtomicRMWInst::Min: case AtomicRMWInst::Min:

return APInt::getSignedMaxValue(BitWidth); return APInt::getSignedMaxValue(BitWidth);

} }

static Value *buildMul(IRBuilder<> &B, Value *LHS, Value *RHS) { static Value *buildMul(IRBuilder<> &B, Value *LHS, Value *RHS) {

foadUnsubmitted

Not Done

These are the wrong way round. You want +0 for fadd and -0 for fsub.

foad: These are the wrong way round. You want +0 for fadd and -0 for fsub.

arsenmUnsubmitted

Not Done

No? This was wrong before and corrected. InstCombine uses -0 as fadd identity and +0 as fsub identity

arsenm: No? This was wrong before and corrected. InstCombine uses -0 as fadd identity and +0 as fsub…

foadUnsubmitted

Not Done

Oh yeah, you're right. Sorry for the noise.

foad: Oh yeah, you're right. Sorry for the noise.

const ConstantInt *CI = dyn_cast<ConstantInt>(LHS); const ConstantInt *CI = dyn_cast<ConstantInt>(LHS);

return (CI && CI->isOne()) ? RHS : B.CreateMul(LHS, RHS); return (CI && CI->isOne()) ? RHS : B.CreateMul(LHS, RHS);

} }

void AMDGPUAtomicOptimizerImpl::optimizeAtomic(Instruction &I, void AMDGPUAtomicOptimizerImpl::optimizeAtomic(Instruction &I,

AtomicRMWInst::BinOp Op, AtomicRMWInst::BinOp Op,

unsigned ValIdx, unsigned ValIdx,

bool ValDivergent) const { bool ValDivergent) const {

▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines if (ST->isWave32()) {

Value *const BitCast = B.CreateBitCast(Ballot, VecTy); Value *const BitCast = B.CreateBitCast(Ballot, VecTy);

Value *const ExtractLo = B.CreateExtractElement(BitCast, B.getInt32(0)); Value *const ExtractLo = B.CreateExtractElement(BitCast, B.getInt32(0));

Value *const ExtractHi = B.CreateExtractElement(BitCast, B.getInt32(1)); Value *const ExtractHi = B.CreateExtractElement(BitCast, B.getInt32(1));

Mbcnt = B.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_lo, {}, Mbcnt = B.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_lo, {},

{ExtractLo, B.getInt32(0)}); {ExtractLo, B.getInt32(0)});

Mbcnt = Mbcnt =

B.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_hi, {}, {ExtractHi, Mbcnt}); B.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_hi, {}, {ExtractHi, Mbcnt});

} }

Mbcnt = B.CreateIntCast(Mbcnt, Ty, false); Mbcnt = B.CreateIntCast(Mbcnt, B.getInt32Ty(), false);

foadUnsubmitted

Not Done

Might be clearer as:
Mbcnt = isAtomicFloatingPointTy ? B.CreateUIToFP(Mbcnt, Ty) : B.CreateIntCast(Mbcnt, Ty, false);
(instead of doing the fp cast on line 996) since in both cases we want to convert Mbcnt to type Ty.

foad: Might be clearer as: `Mbcnt = isAtomicFloatingPointTy ? B.CreateUIToFP(Mbcnt, Ty) : B.

pravinjagtapAuthorUnsubmitted

Done

If we convert Mbcnt to float here, Integer comparison will fail at line no 869

pravinjagtap: If we convert `Mbcnt` to `float` here, Integer comparison will fail at line no 869

foadUnsubmitted

Not Done

Then I suggest moving the casts (both int and fp cases) down to line 976.

Currently, for a 64-bit integer atomic, we will case mbcnt to i64 here, so the comparison on line 869 will be an i64 comparison. That is silly. There is no need for the comparison to be wider than i32.

foad: Then I suggest moving the casts (both int and fp cases) down to line 976. Currently, for a 64…

Value *const Identity = B.getInt(getIdentityValueForAtomicOp(Op, TyBitWidth)); Function *F = I.getFunction();

LLVMContext &C = F->getContext();

Value *Identity;

if (Ty->isIntegerTy()) {

Identity = B.getInt(getIdentityValueForAtomicOp(Op, TyBitWidth));

} else if (Ty->isFloatingPointTy()) {

Identity = ConstantFP::get(C, getIdentityValueForFAtomicOp(Op));

}

arsenmUnsubmitted

Not Done

Can you just make getIdentityValueForAtomicOp return a Constant? Or add a variant that does?

arsenm: Can you just make getIdentityValueForAtomicOp return a Constant? Or add a variant that does?

Value *ExclScan = nullptr; Value *ExclScan = nullptr;

Value *NewV = nullptr; Value *NewV = nullptr;

const bool NeedResult = !I.use_empty(); const bool NeedResult = !I.use_empty();

Function *F = I.getFunction();

LLVMContext &C = F->getContext();

BasicBlock *ComputeLoop = nullptr; BasicBlock *ComputeLoop = nullptr;

BasicBlock *ComputeEnd = nullptr; BasicBlock *ComputeEnd = nullptr;

// If we have a divergent value in each lane, we need to combine the value // If we have a divergent value in each lane, we need to combine the value

// using DPP. // using DPP.

if (ValDivergent) { if (ValDivergent) {

const AtomicRMWInst::BinOp ScanOp = const AtomicRMWInst::BinOp ScanOp =

Op == AtomicRMWInst::Sub ? AtomicRMWInst::Add : Op; Op == AtomicRMWInst::Sub ? AtomicRMWInst::Add : Op;

if (ScanImpl == ScanOptions::DPP) { if (ScanImpl == ScanOptions::DPP) {

Show All 40 Lines if (ValDivergent) {

case AtomicRMWInst::Sub: { case AtomicRMWInst::Sub: {

// The new value we will be contributing to the atomic operation is the // The new value we will be contributing to the atomic operation is the

// old value times the number of active lanes. // old value times the number of active lanes.

Value *const Ctpop = B.CreateIntCast( Value *const Ctpop = B.CreateIntCast(

B.CreateUnaryIntrinsic(Intrinsic::ctpop, Ballot), Ty, false); B.CreateUnaryIntrinsic(Intrinsic::ctpop, Ballot), Ty, false);

NewV = buildMul(B, V, Ctpop); NewV = buildMul(B, V, Ctpop);

break; break;

} }

case AtomicRMWInst::FAdd: {

Value *const Ctpop =

B.CreateIntCast(B.CreateUnaryIntrinsic(Intrinsic::ctpop, Ballot),

B.getInt32Ty(), false);

Value *const CtpopFP = B.CreateUIToFP(Ctpop, Ty);

NewV = B.CreateFMul(V, CtpopFP);

arsenmUnsubmitted

Not Done

I don't follow how this can be a convert and multiply

arsenm: I don't follow how this can be a convert and multiply

foadUnsubmitted

Not Done

In general fmul will not give the exact same answer as a sequence of fadds, so you probably need to check some fast math flags before doing this.

foad: In general fmul will not give the exact same answer as a sequence of fadds, so you probably…

arsenmUnsubmitted

Not Done

We don't have fast math flags on atomics, but you would need to expand to the add sequence without some kind of reassociate flag

arsenm: We don't have fast math flags on atomics, but you would need to expand to the add sequence…

pravinjagtapAuthorUnsubmitted

Done

If the logic of no-of-active-lanes * uniform float value is not valid here for uniform value case, then can we use the logic implemented in buildScanIteratively for divergent values (even if the input value is uniform in atomics).

Or, we want sequence of additions avoiding the loop (branch instructions) that we have in buildScanIteratively. We also need to write back this intermediate values of sequence of additions if results is needed later in the kernel.

pravinjagtap: If the logic of `no-of-active-lanes * uniform float value` is not valid here for uniform value…

pravinjagtapAuthorUnsubmitted

Done

CC: @b-sumner @foad

pravinjagtap: CC: @b-sumner @foad

arsenmUnsubmitted

Not Done

I suppose this is fine. You didn't have any adding order guarantee before

arsenm: I suppose this is fine. You didn't have any adding order guarantee before

break;

}

case AtomicRMWInst::And: case AtomicRMWInst::And:

case AtomicRMWInst::Or: case AtomicRMWInst::Or:

case AtomicRMWInst::Max: case AtomicRMWInst::Max:

case AtomicRMWInst::Min: case AtomicRMWInst::Min:

case AtomicRMWInst::UMax: case AtomicRMWInst::UMax:

case AtomicRMWInst::UMin: case AtomicRMWInst::UMin:

case AtomicRMWInst::FMin:

case AtomicRMWInst::FMax:

arsenmUnsubmitted

Not Done

These belong with the other patch

arsenm: These belong with the other patch

// These operations with a uniform value are idempotent: doing the atomic // These operations with a uniform value are idempotent: doing the atomic

// operation multiple times has the same effect as doing it once. // operation multiple times has the same effect as doing it once.

NewV = V; NewV = V;

break; break;

case AtomicRMWInst::Xor: case AtomicRMWInst::Xor:

// The new value we will be contributing to the atomic operation is the // The new value we will be contributing to the atomic operation is the

// old value times the parity of the number of active lanes. // old value times the parity of the number of active lanes.

Value *const Ctpop = B.CreateIntCast( Value *const Ctpop = B.CreateIntCast(

B.CreateUnaryIntrinsic(Intrinsic::ctpop, Ballot), Ty, false); B.CreateUnaryIntrinsic(Intrinsic::ctpop, Ballot), Ty, false);

NewV = buildMul(B, V, B.CreateAnd(Ctpop, 1)); NewV = buildMul(B, V, B.CreateAnd(Ctpop, 1));

break; break;

} }

// We only want a single lane to enter our new control flow, and we do this // We only want a single lane to enter our new control flow, and we do this

// by checking if there are any active lanes below us. Only one lane will // by checking if there are any active lanes below us. Only one lane will

// have 0 active lanes below us, so that will be the only one to progress. // have 0 active lanes below us, so that will be the only one to progress.

Value *const Cond = B.CreateICmpEQ(Mbcnt, B.getIntN(TyBitWidth, 0)); Value *const Cond = B.CreateICmpEQ(Mbcnt, B.getIntN(TyBitWidth, 0));

pravinjagtapAuthorUnsubmitted

Done

I hope, this stops 64 bit comparisons for 64 bit atomic values. Please check effect of this in llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll

pravinjagtap: I hope, this stops 64 bit comparisons for 64 bit atomic values. Please check effect of this in…

foadUnsubmitted

Not Done

I don't actually see any 64-bit cmp instructions in that test, even before your patch. I guess we already managed to shrink them back to 32-bit comparisons.

foad: I don't actually see any 64-bit cmp instructions in that test, even before your patch. I guess…

pravinjagtapAuthorUnsubmitted

Done

Having 32-bit comparison here for all the cases (int, long, float, wavefront size 32/64) is fine right ? Or do I need to revert this change?

pravinjagtap: Having 32-bit comparison here for all the cases (int, long, float, wavefront size 32/64) is…

foadUnsubmitted

Not Done

It is fine. We are talking about the laneid == 0 comparison, which should always be 32-bit even for a 64-bit atomic, since the laneid is just a small integer in the range 0..63.

foad: It is fine. We are talking about the `laneid == 0` comparison, which should always be 32-bit…

// Store I's original basic block before we split the block. // Store I's original basic block before we split the block.

BasicBlock *const EntryBB = I.getParent(); BasicBlock *const EntryBB = I.getParent();

// We need to introduce some new control flow to force a single lane to be // We need to introduce some new control flow to force a single lane to be

// active. We do this by splitting I's basic block at I, and introducing the // active. We do this by splitting I's basic block at I, and introducing the

// new block such that: // new block such that:

// entry --> single_lane -\ // entry --> single_lane -\

▲ Show 20 Lines • Show All 144 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-atomic-optimizer-strategy=Iterative -passes='amdgpu-atomic-optimizer,verify<domtree>' %s \| FileCheck -check-prefix=IR %s

				declare i32 @llvm.amdgcn.workitem.id.x()
				define amdgpu_kernel void @global_atomic_fadd_uni_value(ptr addrspace(1) %ptr) #0 {
				; IR-LABEL: @global_atomic_fadd_uni_value(
				; IR-NEXT: [[TMP1:%.*]] = call i64 @llvm.amdgcn.ballot.i64(i1 true)
				; IR-NEXT: [[TMP2:%.*]] = bitcast i64 [[TMP1]] to <2 x i32>
				; IR-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP2]], i32 0
				; IR-NEXT: [[TMP4:%.*]] = extractelement <2 x i32> [[TMP2]], i32 1
				; IR-NEXT: [[TMP5:%.*]] = call i32 @llvm.amdgcn.mbcnt.lo(i32 [[TMP3]], i32 0)
				; IR-NEXT: [[TMP6:%.*]] = call i32 @llvm.amdgcn.mbcnt.hi(i32 [[TMP4]], i32 [[TMP5]])
				; IR-NEXT: [[TMP7:%.*]] = call i64 @llvm.ctpop.i64(i64 [[TMP1]])
				; IR-NEXT: [[TMP8:%.*]] = trunc i64 [[TMP7]] to i32
				; IR-NEXT: [[TMP9:%.*]] = uitofp i32 [[TMP8]] to float
				; IR-NEXT: [[TMP10:%.*]] = fmul float 4.000000e+00, [[TMP9]]
				; IR-NEXT: [[TMP11:%.*]] = icmp eq i32 [[TMP6]], 0
				; IR-NEXT: br i1 [[TMP11]], label [[TMP12:%.]], label [[TMP14:%.]]
				; IR: 12:
				; IR-NEXT: [[TMP13:%.]] = atomicrmw fadd ptr addrspace(1) [[PTR:%.]], float [[TMP10]] seq_cst, align 4
				; IR-NEXT: br label [[TMP14]]
				; IR: 14:
				; IR-NEXT: ret void
				;
				%result = atomicrmw fadd ptr addrspace(1) %ptr, float 4.0 seq_cst
				ret void
				}


				define amdgpu_kernel void @global_atomic_fadd_div_value(ptr addrspace(1) %ptr) #0 {
				; IR-LABEL: @global_atomic_fadd_div_value(
				; IR-NEXT: [[ID_X:%.*]] = call i32 @llvm.amdgcn.workitem.id.x()
				; IR-NEXT: [[DIVVALUE:%.*]] = bitcast i32 [[ID_X]] to float
				; IR-NEXT: [[TMP1:%.*]] = call i64 @llvm.amdgcn.ballot.i64(i1 true)
				; IR-NEXT: [[TMP2:%.*]] = bitcast i64 [[TMP1]] to <2 x i32>
				; IR-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP2]], i32 0
				; IR-NEXT: [[TMP4:%.*]] = extractelement <2 x i32> [[TMP2]], i32 1
				; IR-NEXT: [[TMP5:%.*]] = call i32 @llvm.amdgcn.mbcnt.lo(i32 [[TMP3]], i32 0)
				; IR-NEXT: [[TMP6:%.*]] = call i32 @llvm.amdgcn.mbcnt.hi(i32 [[TMP4]], i32 [[TMP5]])
				; IR-NEXT: [[TMP7:%.*]] = call i64 @llvm.amdgcn.ballot.i64(i1 true)
				; IR-NEXT: br label [[COMPUTELOOP:%.*]]
				; IR: 8:
				; IR-NEXT: [[TMP9:%.]] = atomicrmw fadd ptr addrspace(1) [[PTR:%.]], float [[TMP16:%.*]] seq_cst, align 4
				; IR-NEXT: br label [[TMP10:%.*]]
				; IR: 10:
				; IR-NEXT: ret void
				; IR: ComputeLoop:
				; IR-NEXT: [[ACCUMULATOR:%.]] = phi float [ 0x36A0000000000000, [[TMP0:%.]] ], [ [[TMP16]], [[COMPUTELOOP]] ]
				; IR-NEXT: [[ACTIVEBITS:%.]] = phi i64 [ [[TMP7]], [[TMP0]] ], [ [[TMP19:%.]], [[COMPUTELOOP]] ]
				; IR-NEXT: [[TMP11:%.*]] = call i64 @llvm.cttz.i64(i64 [[ACTIVEBITS]], i1 true)
				; IR-NEXT: [[TMP12:%.*]] = trunc i64 [[TMP11]] to i32
				; IR-NEXT: [[TMP13:%.*]] = bitcast float [[DIVVALUE]] to i32
				; IR-NEXT: [[TMP14:%.*]] = call i32 @llvm.amdgcn.readlane(i32 [[TMP13]], i32 [[TMP12]])
				; IR-NEXT: [[TMP15:%.*]] = bitcast i32 [[TMP14]] to float
				; IR-NEXT: [[TMP16]] = fadd float [[ACCUMULATOR]], [[TMP15]]
				; IR-NEXT: [[TMP17:%.*]] = shl i64 1, [[TMP11]]
				; IR-NEXT: [[TMP18:%.*]] = xor i64 [[TMP17]], -1
				; IR-NEXT: [[TMP19]] = and i64 [[ACTIVEBITS]], [[TMP18]]
				; IR-NEXT: [[TMP20:%.*]] = icmp eq i64 [[TMP19]], 0
				; IR-NEXT: br i1 [[TMP20]], label [[COMPUTEEND:%.*]], label [[COMPUTELOOP]]
				; IR: ComputeEnd:
				; IR-NEXT: [[TMP21:%.*]] = icmp eq i32 [[TMP6]], 0
				; IR-NEXT: br i1 [[TMP21]], label [[TMP8:%.*]], label [[TMP10]]
				;
				%id.x = call i32 @llvm.amdgcn.workitem.id.x()
				%divValue = bitcast i32 %id.x to float
				%result = atomicrmw fadd ptr addrspace(1) %ptr, float %divValue seq_cst
				ret void
				}

				attributes #0 = {"target-cpu"="gfx906"}
				pravinjagtapAuthorUnsubmitted Done Reply Inline Actions This & next test points are already covered above. Will remove this. pravinjagtap: This & next test points are already covered above. Will remove this.
				foadUnsubmitted Not Done Reply Inline Actions This fsub code does not look right (both strategies). First you do an fsub-reduction, and then you do an atomic fsub of the reduced value. That is like a double negative - you will end up adding the values to the memory location. I think you need to do an fadd reduction followed by an atomic fsub, or vice versa. Have you run any conformance tests that exercise this code? foad: This fsub code does not look right (both strategies). First you do an fsub-reduction, and then…
				pravinjagtapAuthorUnsubmitted Done Reply Inline Actions This holds true for integer sub also right? I have ran psdb and gfx pipeline which runs some conformance tests. I will take closer look to see test coverage required to exercise this. pravinjagtap: This holds true for integer sub also right? I have ran psdb and gfx pipeline which runs some…
				pravinjagtapAuthorUnsubmitted Done Reply Inline Actions This did not get caught because atomic `fsub` is transformed to `fadd` before we reach atomic-optimizer: https://cuda.godbolt.org/z/56ToP79Pb pravinjagtap: This did not get caught because atomic `fsub` is transformed to `fadd` before we reach atomic…
				foadUnsubmitted Not Done Reply Inline Actions For integer sub this is already handled by: const AtomicRMWInst::BinOp ScanOp = Op == AtomicRMWInst::Sub ? AtomicRMWInst::Add : Op; foad: For integer sub this is already handled by: ``` const AtomicRMWInst::BinOp ScanOp =…

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer.ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 544260

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp

llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll

[AMDGPU] Support FAdd/FSub global atomics in AMDGPUAtomicOptimizer.
ClosedPublic