Reduction and Scan are implemented using Iterative
and DPP strategy for float type.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
the fmin/fmax case and fadd/fsub cases have nothing to do with each other, you're probably better off handling them in separate patches
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
140 | AtomicRMWInst already has isFloatingPointOperation/isFPOperation for this, which also picks up fsub | |
207 | Should also handle fsub | |
380 | you can't do it like this, you should use minnum/maxnum intrinsics | |
637 | This would be +infinity for fmax. For fadd you there isn't really an identity value since fadd -0, 0 -> -0. You probably can't do this without nsz, which we don't have a way of representing. I have a draft patch for unsafe FP atomic metadata I don't have time to pick up. | |
639 | This would be -infinity | |
810 | I don't follow how this can be a convert and multiply |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
637 | For fadd you can use -0 as the identify value. For fsub I think 0 works: Check instcombine: define float @fsub_fold(float %x) { %add = fsub float %x, 0.0 ret float %add } define float @fadd_fold_n0(float %x) { %add = fadd float %x, -0.0 ret float %add } This is of course ignoring signaling nan quieting and denormal flushes |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
639 | No, the identity should be +inf for fmin and -inf for fmax. |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
213 | I think this is a bad interpretation of the strategy option. Doing nothing just because you wanted something else is worse than just using an implemented path. Also you can just implement this with dpp? | |
317 | Doesn't consider half Should also handle <2 x half>, but atomicrmw doesn't support vectors now (you need the intrinsics for those) | |
605 | You shouldn't need a cast after D147732 |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
820 | These belong with the other patch |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
213 |
If I understand correctly, current dpp intrinsics that we need for reduction & scan(llvm.amdgcn.update.dpp) can return only integer types (accepts inputs with any types). @foad Is it possible to extend current dpp implementation for float types as well ? |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
213 |
I am wrong, this intrinsic is lowered to V_MOV_B32_dpp when matched with i32 types. I think, we should be able to implement dpp for floats with bitcasts noise. |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
213 | I am able to generate functionally correct code for scan with DPP strategy but it needs lot of bitcast mess for llvm.amdgcn.set.inactive.i32 and lvm.amdgcn.update.dpp.i32. Is there any better way of doing this ? %16 = bitcast float %9 to i32 %17 = call i32 @llvm.amdgcn.set.inactive.i32(i32 %16, i32 0) %18 = bitcast i32 %17 to float %19 = bitcast i32 %16 to float %20 = bitcast float %18 to i32 %21 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %20, i32 273, i32 15, i32 15, i1 false) %22 = bitcast i32 %21 to float %23 = bitcast i32 %20 to float %24 = fadd float %23, %22 %25 = bitcast float %24 to i32 %26 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %25, i32 274, i32 15, i32 15, i1 false) %27 = bitcast i32 %26 to float %28 = bitcast i32 %25 to float %29 = fadd float %28, %27 %30 = bitcast float %29 to i32 %31 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %30, i32 276, i32 15, i32 15, i1 false) %32 = bitcast i32 %31 to float %33 = bitcast i32 %30 to float %34 = fadd float %33, %32 %35 = bitcast float %34 to i32 %36 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %35, i32 280, i32 15, i32 15, i1 false) %37 = bitcast i32 %36 to float %38 = bitcast i32 %35 to float %39 = fadd float %38, %37 %40 = bitcast float %39 to i32 %41 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %40, i32 322, i32 10, i32 15, i1 false) %42 = bitcast i32 %41 to float %43 = bitcast i32 %40 to float %44 = fadd float %43, %42 %45 = bitcast float %44 to i32 %46 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %45, i32 323, i32 12, i32 15, i1 false) %47 = bitcast i32 %46 to float %48 = bitcast i32 %45 to float %49 = fadd float %48, %47 %50 = bitcast float %49 to i32 %51 = call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %50, i32 312, i32 15, i32 15, i1 false) %52 = bitcast i32 %51 to float %53 = bitcast float %49 to i32 %54 = call i32 @llvm.amdgcn.readlane(i32 %53, i32 63) %55 = bitcast i32 %54 to float %56 = call float @llvm.amdgcn.strict.wwm.f32(float %55) |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
317 |
Appears that _Float16 is not supported for atomics in HIP: https://cuda.godbolt.org/z/Gf7so4Y9K |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
317 | Doesn't matter, the IR does. You should select the types you do handle, not try to exclude ones you don't |
llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll | ||
---|---|---|
146 | This & next test points are already covered above. Will remove this. |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
607 | You could use the ternary operator to initialize them. |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
607 |
Wherever there are two bit-cast statements, I have used if loop and ternary operator for single bit-cast statement. I will update this to ternary at all places. |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
316 | The intrinsics should just be deleted, everything should move to atomicrmw | |
419–423 | You can just unconditionally call CreateBitCast, it's a no-op if the type matches anyway | |
637 | Identity value for fadd is -0, you got these backwards | |
639 | identity for fsub is +0, so no true | |
736–740 | Can you just make getIdentityValueForAtomicOp return a Constant? Or add a variant that does? |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
78 | No need to pass in isAtomicFloatingPointTy to all these functions. It is just V->getType()->isFloatingPointTy(). | |
315 | Don't need to change this | |
631–634 | You can derive C from Ty, and BitWidth from Ty, so the arguments should just be: AtomicRMWInst::BinOp Op, Type *Ty | |
810 | In general fmul will not give the exact same answer as a sequence of fadds, so you probably need to check some fast math flags before doing this. |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
636 | Is it safe to get BitWidth like this ? We dont need this for float types |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
636 | Simpler to call Ty->getPrimitiveSizeInBits() unconditionally. | |
731–732 | Might be clearer as: |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
731–732 | If we convert Mbcnt to float here, Integer comparison will fail at line no 869 |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
731–732 | Then I suggest moving the casts (both int and fp cases) down to line 976. Currently, for a 64-bit integer atomic, we will case mbcnt to i64 here, so the comparison on line 869 will be an i64 comparison. That is silly. There is no need for the comparison to be wider than i32. |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
837 | I hope, this stops 64 bit comparisons for 64 bit atomic values. Please check effect of this in llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
837 | I don't actually see any 64-bit cmp instructions in that test, even before your patch. I guess we already managed to shrink them back to 32-bit comparisons. |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
837 | Having 32-bit comparison here for all the cases (int, long, float, wavefront size 32/64) is fine right ? Or do I need to revert this change? |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
837 | It is fine. We are talking about the laneid == 0 comparison, which should always be 32-bit even for a 64-bit atomic, since the laneid is just a small integer in the range 0..63. |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
78 | This is wrong in the case of FP typed xchg, which the pass just happens to not handle |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
402 | Do you want to switch to the float overloads for the DPP intrinsic here or in a follow up? |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
402 | I would prefer in follow up patch. |
Missing IR check lines? I thought you added some in a previous diff
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
355 | Can you use B.CreateFAdd instead of the low level CreateBinOp? You'll need that to handle strictfp correctly | |
358–359 | Ditto | |
808–810 | We don't have fast math flags on atomics, but you would need to expand to the add sequence without some kind of reassociate flag |
IR checks have been added in files:
llvm/test/CodeGen/AMDGPU/global_atomics_optimizer_fp_no_rtn.ll llvm/test/CodeGen/AMDGPU/global_atomic_optimizer_fp_rtn.ll
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
808–810 | If the logic of no-of-active-lanes * uniform float value is not valid here for uniform value case, then can we use the logic implemented in buildScanIteratively for divergent values (even if the input value is uniform in atomics). Or, we want sequence of additions avoiding the loop (branch instructions) that we have in buildScanIteratively. We also need to write back this intermediate values of sequence of additions if results is needed later in the kernel. |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
808–810 | I suppose this is fine. You didn't have any adding order guarantee before |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
651–654 | These are the wrong way round. You want +0 for fadd and -0 for fsub. |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
651–654 | No? This was wrong before and corrected. InstCombine uses -0 as fadd identity and +0 as fsub identity |
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp | ||
---|---|---|
651–654 | Oh yeah, you're right. Sorry for the noise. |
llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll | ||
---|---|---|
172–241 | This fsub code does not look right (both strategies). First you do an fsub-reduction, and then you do an atomic fsub of the reduced value. That is like a double negative - you will end up adding the values to the memory location. I think you need to do an fadd reduction followed by an atomic fsub, or vice versa. Have you run any conformance tests that exercise this code? |
llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll | ||
---|---|---|
172–241 | This holds true for integer sub also right? I have ran psdb and gfx pipeline which runs some conformance tests. I will take closer look to see test coverage required to exercise this. |
llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll | ||
---|---|---|
172–241 | This did not get caught because atomic fsub is transformed to fadd before we reach atomic-optimizer: https://cuda.godbolt.org/z/56ToP79Pb |
llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan_fp.ll | ||
---|---|---|
172–241 | For integer sub this is already handled by: const AtomicRMWInst::BinOp ScanOp = Op == AtomicRMWInst::Sub ? AtomicRMWInst::Add : Op; |
No need to pass in isAtomicFloatingPointTy to all these functions. It is just V->getType()->isFloatingPointTy().