[AMDGPU] Add an AMDGPU specific atomic optimizer.
ClosedPublic

Authored by sheredom on Sep 12 2018, 2:36 AM.

Details

Summary

This commit adds a new IR level pass to the AMDGPU backend to perform atomic optimizations. It works by:

  • Running through a function and finding atomicrmw add/sub or uses of the atomic buffer intrinsics for add/sub.
  • If all arguments except the value to be added/subtracted are uniform, record the value to be optimized.
  • Run through the atomic operations we can optimize and, depending on whether the value is uniform/divergent use wavefront wide operations (DPP in the divergent case) to calculate the total amount to be atomically added/subtracted.
  • Then let only a single lane of each wavefront perform the atomic operation, reducing the total number of atomic operations in flight.
  • Lastly we recombine the result from the single lane to each lane of the wavefront, and calculate our individual lanes offset into the final result.

Diff Detail

Repository
rL LLVM
sheredom created this revision.Sep 12 2018, 2:36 AM
sheredom updated this revision to Diff 165041.Sep 12 2018, 2:44 AM

Update to include tip LLVM changes (DivergenceAnalysis -> LegacyDivergenceAnalysis, AMDGPUAS is become just an enum now).

Should include tests running only the IR pass

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
57 ↗(On Diff #165041)

This can be dropped

60–61 ↗(On Diff #165041)

This needs work, but should preserve the dominator tree

176 ↗(On Diff #165041)

No getPrimitiveSizeInBits

207 ↗(On Diff #165041)

All the consts are in the wrong place

212–213 ↗(On Diff #165041)

I'm pretty sure there's a utility function that deals with what you're trying to do, that will also handle the DominatorTree updates for this

219 ↗(On Diff #165041)

You pretty much should never use getPrimitiveSizeInBits directly. Use the DataLayout

229–234 ↗(On Diff #165041)

You need to mark this call site as convergent

243–244 ↗(On Diff #165041)

CreateIntrinsic

294–295 ↗(On Diff #165041)

Can you use enums to derive these values

367–378 ↗(On Diff #165041)

The canonical way to do this in IR is lshr x, 32 and truncate

arsenm added inline comments.Sep 13 2018, 5:11 AM
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
379 ↗(On Diff #165041)

Checking a different value is suspicious

arsenm added inline comments.Sep 13 2018, 5:14 AM
test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
71 ↗(On Diff #165041)

Missing -LABELs

sheredom added inline comments.Sep 13 2018, 6:11 AM
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
57 ↗(On Diff #165041)

You mean I don't need to include a getPassName at all? I was just including it for consistency with the other AMDGPU level passes (a bunch at least have the method overridden).

207 ↗(On Diff #165041)

I ran clang-format -i -style=file AMDGPUAtomicOptimizer.cpp and this is what it did!

243–244 ↗(On Diff #165041)

I can't use the IRBuilder's CreateIntrinsic because it will then try and mangle the amdgcn_mbcnt_lo like llvm.amdgcn.mbcnt.lo.i32, whereas we expect the name to be sans-type-suffix.

nhaehnle added inline comments.Sep 14 2018, 1:52 AM
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
243–244 ↗(On Diff #165041)

Ugh, this is a bug in CreateIntrinsic. It does:

Function *Fn = Intrinsic::getDeclaration(M, ID, { Args.front()->getType() });

... when in this particular case, the type vector should be empty.

There aren't that many users of CreateIntrinsic -- can you just fix CreateIntrinsic so that it takes an ArrayRef<Type*> like Intrinsic::getDeclaration? (As a separate preparatory patch obviously.)

sheredom added inline comments.Sep 14 2018, 2:00 AM
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
243–244 ↗(On Diff #165041)

Just to be clear - do you want me to get the IRBuilder fix in first? Or is it fine to let this patch in as is, and then fixup during the IRBuilder change?

nhaehnle added inline comments.Sep 14 2018, 2:22 AM
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
243–244 ↗(On Diff #165041)

I think we should do it right the first time.

sheredom added inline comments.Sep 14 2018, 4:36 AM
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
243–244 ↗(On Diff #165041)
sheredom updated this revision to Diff 165490.Sep 14 2018, 6:25 AM

Added an interaction with the DominatorTree, so that if its present in the PassManager we can preserve + update it.

sheredom marked 8 inline comments as done.Sep 14 2018, 6:26 AM
sheredom marked an inline comment as done.

This still needs to be adjusted for D52087, and I have one minor comment, but mostly LGTM.

lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
340 ↗(On Diff #165490)

The last two arguments are defaults, so you can just leave them out.

arsenm added inline comments.Sep 17 2018, 6:56 AM
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
57 ↗(On Diff #165041)

Yes. We should be deleting them. I'm not sure why they exist in the first place - they are redundant with the name passed in the macro

arsenm added inline comments.Sep 17 2018, 6:58 AM
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
139–140 ↗(On Diff #165490)

You can use I.getPointerOperand/getValOperand

sheredom added inline comments.Sep 24 2018, 4:23 AM
lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp
139–140 ↗(On Diff #165490)

So I had that originally - but because I am using ValIdx so that I can have a single optimizeAtomic call that works for atomicrmw insts and our atomic intrinsics, I thought this was slightly nicer overall.

mareko added a subscriber: mareko.Sep 24 2018, 6:38 PM

What happens if a shader already does "if (threadID == 0) { do_atomic(); }"? Is the optimization skipped in this case?

What happens if a shader already does "if (threadID == 0) { do_atomic(); }"? Is the optimization skipped in this case?

It is not skipped - the optimization would be applied twice. To be able to skip that kind of thing we need to track a further axis for a given value - we need to know if the value is divergent, uniform, or scalar. A scalar value would be like the one in the if statement of your example, its a value that is known to be only active for a single lane of the wavefront.

That is a much bigger change, and so we deemed it better to get this in as is - even if it will apply the optimization twice if the user had applied it, rather than not doing the optimization at all (which is a much worse performance penalty).

sheredom updated this revision to Diff 168643.Oct 8 2018, 4:48 AM

Rebased ontop of tip to get the IRBuilder change in https://reviews.llvm.org/D52087, and incorporate the requested changes by @nhaehnle resulting from that.

sheredom marked 14 inline comments as done.Oct 8 2018, 4:49 AM
nhaehnle accepted this revision.Oct 8 2018, 8:18 AM
This revision is now accepted and ready to land.Oct 8 2018, 8:18 AM
This revision was automatically updated to reflect the committed changes.