This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Iterative scan implementation for atomic optimizer.
ClosedPublic

Authored by pravinjagtap on Apr 2 2023, 2:16 AM.

Details

Summary

This patch provides an alternative implementation to DPP for Scan Computations.

An alternative implementation iterates over all active lanes of Wavefront
using llvm.cttz and performs the following steps:

  1. Read the value that needs to be atomically incremented using llvm.amdgcn.readlane intrinsic
  2. Accumulate the result.
  3. Update the scan result using llvm.amdgcn.writelane intrinsic if intermediate scan results are needed later in the kernel.

Diff Detail

Event Timeline

pravinjagtap created this revision.Apr 2 2023, 2:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 2 2023, 2:16 AM
pravinjagtap requested review of this revision.Apr 2 2023, 2:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 2 2023, 2:16 AM

Shouldn't this new lowering get enabled for device functions too?

foad added a comment.Apr 3 2023, 12:26 AM

This is the same as D147303?

This is the same as D147303?

Yes Jay. D147303 was created for testing purpose. I have abandoned it. Sorry for confusion.

Addressed @cdevadas comment. Used isGraphics to guard graphic shaders DPP implementation against this new iterative approach using readlane and writelane

Need a test for device functions.

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
672–697

I'm not sure if this should get enabled for all graphics CCs. @foad can you confirm?

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
282

You're turning this flag on by default and it is going to change the default behavior for shaders. Can you run the shader tests too?

ruiling added inline comments.Apr 3 2023, 11:34 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
507

Why do we choose to unroll the loop over wave-front-size? I think this makes the sp3 assembly hard to read. Shouldn't a loop over active lanes just work?

pravinjagtap added inline comments.Apr 4 2023, 2:32 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
507

Hello @ruiling,

One of the considerations for selecting this approach is its simplicity and efforts required for implementation. We know that the most optimized implementation for the scan is DPP with WWM. In the future, this iterative approach will become redundant when concerns related to WWM robustness are addressed. If you and everyone else think that loop over active lanes is the right thing to do, I will start implementing it.

arsenm added inline comments.Apr 4 2023, 6:45 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
672–697

I think part of the point of doing this is to stop special casing graphics usage. Semantically the shaderiness shouldn't matter. A strategy switch would be a separate control if we wanted such a thing

llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
264–267

Does this need to be rebased? I wouldn't expect pass changes

b-sumner added inline comments.Apr 4 2023, 7:22 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
507

Scalar branches may be the most expensive aspect of this algorithm . If the loop is fully unrolled, we still end up with 64 in wave64. If we didn't unroll the active-lane approach loop, then if the wave is full, then couldn't we end up with 126 scalar branches?

foad added a comment.Apr 4 2023, 8:13 AM

Scalar branches may be the most expensive aspect of this algorithm

If not-taken conditional branches are cheap then we could do something like this. It only has one taken branch, when we have finished handling all the active lanes.

  // Inclusive plus-scan v0 into v1. Also leaves the result of the plus-reduction in s3.
  s_mov s0, exec
  s_mov s3, 0 // accumulator
// repeat this section 32 or 64 times:
  s_ff1 s1, s0 // find lowest remaining active lane
  s_cmp_eq s1, -1
  s_cbranch_scc1 end
  s_bitset0 s0, s1
  v_readlane s2, v0, s1
  s_add s3, s2
  v_writelane v1, s3, s1
// end of repeated section
end:

Scalar branches may be the most expensive aspect of this algorithm

If not-taken conditional branches are cheap then we could do something like this. It only has one taken branch, when we have finished handling all the active lanes.

  // Inclusive plus-scan v0 into v1. Also leaves the result of the plus-reduction in s3.
  s_mov s0, exec
  s_mov s3, 0 // accumulator
// repeat this section 32 or 64 times:
  s_ff1 s1, s0 // find lowest remaining active lane
  s_cmp_eq s1, -1
  s_cbranch_scc1 end
  s_bitset0 s0, s1
  v_readlane s2, v0, s1
  s_add s3, s2
  v_writelane v1, s3, s1
// end of repeated section
end:

Yes, that looks like what we want. The challenge will be creating IR that will lower to that.

arsenm added a comment.Apr 4 2023, 9:06 AM

Scalar branches may be the most expensive aspect of this algorithm

If not-taken conditional branches are cheap then we could do something like this. It only has one taken branch, when we have finished handling all the active lanes.

  // Inclusive plus-scan v0 into v1. Also leaves the result of the plus-reduction in s3.
  s_mov s0, exec
  s_mov s3, 0 // accumulator
// repeat this section 32 or 64 times:
  s_ff1 s1, s0 // find lowest remaining active lane
  s_cmp_eq s1, -1
  s_cbranch_scc1 end
  s_bitset0 s0, s1
  v_readlane s2, v0, s1
  s_add s3, s2
  v_writelane v1, s3, s1
// end of repeated section
end:

Yes, that looks like what we want. The challenge will be creating IR that will lower to that.

I increasingly think we should just have intrinsics for reduction ops and move all this into codegen

nhaehnle added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
672–697

Let's be clear:

  • Using the loop is bound to be slower in almost all cases, often significantly so.
  • The fast path is currently always used in graphics.
  • We cannot cause such significant performance regressions for graphics.

I agree that if we do have two different paths here, it doesn't make sense to make them "graphics" vs. "compute", but to instead have a dedicated switch. The important part is that that switch defaults to the existing, fast path for graphics.

If not-taken conditional branches are cheap then we could do something like this. It only has one taken branch, when we have finished handling all the active lanes.

  // Inclusive plus-scan v0 into v1. Also leaves the result of the plus-reduction in s3.
  s_mov s0, exec
  s_mov s3, 0 // accumulator
// repeat this section 32 or 64 times:
  s_ff1 s1, s0 // find lowest remaining active lane
  s_cmp_eq s1, -1
  s_cbranch_scc1 end
  s_bitset0 s0, s1
  v_readlane s2, v0, s1
  s_add s3, s2
  v_writelane v1, s3, s1
// end of repeated section
end:

The LLVM IR that can do this:

bb0:
  %value = ... 
  %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 1)
  br label %bb1

bb1:
  %accum = phi i32 [ 0, %entry ], [ %new_accum, %bb1 ]
  %old_value_phi = phi i32 [ poison, %entry ], [ %old_value, %bb1 ]
  %active_bits = phi i32 [ %ballot, %entry ], [ %new_active_bits, %bb1 ]
  %ff1 = call i32 @llvm.cttz.i32(i32 %active_bits, i1 true)

  %lane_value = call i32 @llvm.amdgcn.readlane(i32 %value, i32 %ff1)
  %old_value = call i32 @llvm.amdgcn.writelane(i32 %accum, i32 %ff1, i32 %old_value_phi)
  %new_accum = add i32 %accum, %lane_value

  %mask = shl i32 1, %ff1
  %inverse_mask = xor i32 %mask, -1
  %new_active_bits = and i32 %active_bits, %inverse_mask
  %is_end = icmp eq i32 %new_active_bits, 0
  br i1 %is_end, label %bb2, label %bb1

bb2:
pravinjagtap updated this revision to Diff 512415.EditedApr 11 2023, 6:08 AM
pravinjagtap edited the summary of this revision. (Show Details)

Implemented @ruiling suggestions. In this approach, we iterate over only active lanes of a wavefront using llvm.cttz to precompute an exclusive scan scan.

I have attempted the unrolled version of this loop to avoid the conditional cost of taken branch, but, unfortunately compile time cost increases proportionally as we need to create 64 basic blocks for one atomic operation (one for each active lane). So, compile time becomes the function of no of atomic ops into wavefront size.

TODO:

  • Not finalized the dedicated switch between graphics vs compute. I am not sure about how this can be addressed. If we default to DDP then users of compute need to explicitly set the flag for selecting this iterative approach for compute.
  • Device function test.
pravinjagtap marked an inline comment as done.Apr 11 2023, 8:33 PM
pravinjagtap marked 2 inline comments as done.Apr 11 2023, 11:18 PM
pravinjagtap added inline comments.
llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
264–267

Code is rebased correctly. The AMDGPUAtomicOptimizer pass has dependency on UniformityInfoWrapperPass. Enabling atomic optimizer pass is scheduling Cycle Info and Uniformity analysis.

pravinjagtap marked an inline comment as done.Apr 12 2023, 2:52 AM

Shouldn't this new lowering get enabled for device functions too?

Hello @cdevadas, The current visitor of AtomicRMWInst considers only AMDGPUAS::GLOBAL_ADDRESS and AMDGPUAS::LOCAL_ADDRESS as potential candidates for atomic optimizations and *NOT* the AMDGPUAS::FLAT_ADDRESS. In cases of device functions, I am observing that input argument (if device function is doing atomic add then we need to pass the address to device function) are addrSpaceCasted to AMDGPUAS::FLAT_ADDRESS in the caller (i.e global function) before passing it to device function. Thats the reason why this lowering is not getting enabled for device functions. Will talk to @b-sumner and @arsenm about handling of this.

pravinjagtap added inline comments.Apr 13 2023, 3:58 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
488

Hello @ruiling.

Your suggestions i.e., loop based iterative approach have been implemented to perform scan operation. Now, we iterate over only active lanes using @llvm.cttz and clear the associated bit when processed so that for the next iteration we will be branching out to next active lane.

ruiling added inline comments.Apr 13 2023, 6:11 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
488

This part LGTM.

pravinjagtap updated this revision to Diff 515605.EditedApr 21 2023, 12:01 AM
pravinjagtap retitled this revision from [AMDGPU] Enable AMDGPU Atomic Optimizer Pass by default. to [AMDGPU] Iterative scan implementation for atomic optimizer..
pravinjagtap edited the summary of this revision. (Show Details)

Introduced new command line flag amdgpu-atomic-optimizer-use-dpp which selects the scan implementation(DPP/Iterative).

Following is the plan for submission:

  1. Submit this change with -amdgpu-atomic-optimizer-use-dpp=on and -amdgpu-atomic-optimizations=off
  2. Downstream clients can opt in using -amdgpu-atomic-optimizer-use-dpp=on and -amdgpu-atomic-optimizations=on
  3. Change default -amdgpu-atomic-optimizations to on and -amdgpu-atomic-optimizer-use-dpp to off so that compute uses iterative approach

In the lit tests you could see that iterative approach is being selected when we turned off -amdgpu-atomic-optimizer-use-dpp

pravinjagtap marked 3 inline comments as done.Apr 21 2023, 9:50 AM

Rebased & Ping

pravinjagtap marked 2 inline comments as done.Apr 28 2023, 1:19 AM
foad added inline comments.Apr 28 2023, 5:35 AM
llvm/lib/Target/AMDGPU/AMDGPU.h
89

Just "UseDpp"?

245

No reason for this to be public.

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
658

Remove llvm::, use getFunction

664–665

Sink this down to line 700, where you use them?

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
272–275

Description seems a bit misleading to me since this option doesn't enable the whole atomic optimizer pass. I would suggest changing it to "Enable use of DPP in the atomic optimizer" or just "Use DPP in the atomic optimizer"

foad added inline comments.Apr 28 2023, 5:35 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
501–502

Don't need these. They will always be the same as WaveTy and WaveFrontSize.

515

I think you can use an undef value here instead of V.

521–522

Use B.CreateIntrinsic, then you don't need M.

529

Typo "perform", "intermediate"

533–536

Use B.CreateIntrinsic

553

Maybe change "Compute" to "ComputeLoop" everywhere, since it is the body of the loop?

740–746

I don't think you need to change any of this. The original way of doing the icmp should work in all cases.

767

Why do you need to clone? Could you just do Terminator->removeFromParent then B.Insert(Terminator)?

foad added inline comments.Apr 28 2023, 5:47 AM
llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
459–462

Not your fault, but we really ought to be able to select s_ff1_i32_b64 here.

468

Not your fault, but we really ought to be able to select s_bitset0_b64 here.

pravinjagtap marked 13 inline comments as done.

Thank you @foad for comments. Addressed most of them.

pravinjagtap added inline comments.Apr 28 2023, 9:44 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
664–665

Sink this down to line 700, where you use them?

ComputeEnd is required at line 766 & 770 after if ValDivergent loop. Thats why it is hoisted here.

740–746

Actually No. In the WWM, only the 0th lane (its always the case) will update the final value in a wavefront whereas in the iterative approach first active lane will update the final value (first active lane will not be 0th always in iterative approach).

pravinjagtap added inline comments.Apr 28 2023, 9:47 PM
llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
459–462

Not your fault, but we really ought to be able to select s_ff1_i32_b64 here.

I am not sure how to address this. May be, we need to teach ISel this specific pattern.

foad added inline comments.Apr 29 2023, 12:59 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
664–665

I am suggesting to put these two lines immediately before the call to buildScanIteratively (line 695 now).

740–746

No, even in the DPP case, the atomic is executed by the first active lane, not lane 0. This happens after exiting the WWM section.

Addressed review comments

pravinjagtap marked 2 inline comments as done.Apr 29 2023, 2:42 AM
pravinjagtap added inline comments.May 18 2023, 5:40 AM
llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
459–462

Hello @foad,
Can we consider generating s_ff1_i32_b64 and s_bitset0_b64 as independent task (future enhancement) and unblock this to move forward since we need to submit this in stages ?

foad added inline comments.May 18 2023, 6:02 AM
llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
459–462

Can we consider generating s_ff1_i32_b64 and s_bitset0_b64 as independent task (future enhancement)

Yes of course, it does not need to block this patch.

Rebased. I think, work is in good shape now. Please let me know if there are any other concerns, if not, we can move forward.

Note: This work needs to be submitted in stages, LLPC needs to adapt to the new flag (-amdgpu-atomic-optimizer-use-dpp). By default this flag is turned ON now. Once LLPC adapts to this change, compute can turned OFF this flag.

Ping & Rebase

cdevadas accepted this revision.May 29 2023, 5:27 AM

LGTM

llvm/test/CodeGen/AMDGPU/GlobalISel/atomic_optimizations_mul_one.ll
3

-verify-machineinstrs won't do anything here.

This revision is now accepted and ready to land.May 29 2023, 5:27 AM
ruiling added inline comments.May 29 2023, 7:21 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
765

I think you also need to update Dominator tree through DTU as we are inserting two extra blocks. And the branch setup process sounds messy. We insert a branch to ComputeLoop in the middle of the entry block. And then we further split the entry block for inserting the single-lane block. It might be more clear to first split the entry block before inserting the branch to ComputeLoop.

arsenm requested changes to this revision.May 30 2023, 3:04 AM

Should have some checks on the IR output

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
45

Should probably add header comment explaining the different strategies

492

east const throughout

493

ST->getWavefrontSize()

512

Use poison for missing values

541

You already have WaveTy above

542

You already have WaveTy above

547

You already have WaveTy above

659

F->getContext()

700

Use consistent naming style for the blocks? other places used snake case

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
272–280

I would expect this to be one cl::enum flag for the optimizer strategy. It would also still be better if this was a parsable pass parameter

llvm/test/CodeGen/AMDGPU/GlobalISel/atomic_optimizations_mul_one.ll
3

-verify-machineinstrs should be removed

This revision now requires changes to proceed.May 30 2023, 3:04 AM
pravinjagtap marked 9 inline comments as done.

Addressed review comments of @ruiling and @arsenm

pravinjagtap marked 4 inline comments as done.May 31 2023, 5:41 AM

Thanks for the new version, looks good to me.

pravinjagtap updated this revision to Diff 528288.EditedJun 4 2023, 10:46 PM

Added parsable cl::enum flag for the optimizer strategy as suggested by @arsenm.

We need to enable atomic optimizer using -amdgpu-atomic-optimizations=true and then
we can use -amdgpu-atomic-optimizer-strategy=DPP/Iterative flag to select the strategy for scan (defaulted to DPP for now and will be changed to Iterative for compute pipeline once LLPC adapts to this change).

This will allow LLPC to adjust to new change and then we can enable and default to Iterative strategy using only one flag -amdgpu-atomic-optimizer-strategy=DPP/Iterative
and deprecate -amdgpu-atomic-optimizations

pravinjagtap marked an inline comment as done.Jun 4 2023, 10:51 PM
arsenm added inline comments.Jun 5 2023, 5:28 PM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
492

These weren't done, const is still weirdly placed

764

Won't the regular post-pass verifier catch this?

789–793

Could use initializer list with all of these instead of push_back x 3. Also can use std::array

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
272–280

You shouldn't nede to come up with your own parsing with clEnumVal. For an example see the recent amdgpu-lower-module-lds-strategy

Addressed review comments of @arsenm

pravinjagtap added inline comments.Jun 6 2023, 2:41 AM
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
764

Won't the regular post-pass verifier catch this?

Yes, During my experimentation I found that -verify-dom-info is not catching the issues related to updates required in dominator tree (both with opt and llc). Although, assert here is giving desired behavior.

pravinjagtap marked 4 inline comments as done.Jun 6 2023, 2:43 AM
arsenm accepted this revision.Jun 8 2023, 4:31 AM

LGTM with nit

llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
549

{OldValue, NewAccumulator}

787

You don't need std::initializer_list, you can just use std::array

This revision is now accepted and ready to land.Jun 8 2023, 4:31 AM
arsenm added inline comments.Jun 8 2023, 4:32 AM
llvm/test/CodeGen/AMDGPU/global_atomics_iterative_scan.ll
100 ↗(On Diff #529550)

Use named values

119 ↗(On Diff #529550)

Can drop most attributes

Addressed review comments.

arsenm accepted this revision.Jun 8 2023, 7:23 AM
foad added a comment.Jun 9 2023, 2:06 AM

We need to enable atomic optimizer using -amdgpu-atomic-optimizations=true and then
we can use -amdgpu-atomic-optimizer-strategy=DPP/Iterative flag to select the strategy for scan (defaulted to DPP for now and will be changed to Iterative for compute pipeline once LLPC adapts to this change).

This will allow LLPC to adjust to new change and then we can enable and default to Iterative strategy using only one flag -amdgpu-atomic-optimizer-strategy=DPP/Iterative
and deprecate -amdgpu-atomic-optimizations

Here's the LLPC patch to set -amdgpu-atomic-optimizer-strategy=DPP: https://github.com/GPUOpen-Drivers/llpc/pull/2506

llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll