This commit adds a new IR level pass to the AMDGPU backend to perform atomic optimizations. It works by:
- Running through a function and finding atomicrmw add/sub or uses of the atomic buffer intrinsics for add/sub.
- If all arguments except the value to be added/subtracted are uniform, record the value to be optimized.
- Run through the atomic operations we can optimize and, depending on whether the value is uniform/divergent use wavefront wide operations (DPP in the divergent case) to calculate the total amount to be atomically added/subtracted.
- Then let only a single lane of each wavefront perform the atomic operation, reducing the total number of atomic operations in flight.
- Lastly we recombine the result from the single lane to each lane of the wavefront, and calculate our individual lanes offset into the final result.