This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Special case divergence analysis for wave ID computation
Needs ReviewPublic

Authored by arsenm on Apr 25 2022, 6:52 AM.

Details

Reviewers
None
Group Reviewers
Restricted Project
Summary

Specially recognize the 1D computation as wave uniform. Insert a
readfirstlane during codegen prepare to assert the result is uniform
even if the operand is not. It would be better if the DAG divergence
could maintain the same property after the division is optimized into
shifts.

Addresses issue 54010.

There's currently a potential pass ordering issue because m_Mul
doesn't automatically handle commuted operands. The order the group ID
appears in the multiply is non-canonical after the lowering for
uniform-work-group-size replaces the operand.

Diff Detail

Event Timeline

arsenm created this revision.Apr 25 2022, 6:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 25 2022, 6:52 AM
arsenm requested review of this revision.Apr 25 2022, 6:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 25 2022, 6:52 AM
Herald added a subscriber: wdng. · View Herald Transcript
arsenm updated this revision to Diff 426178.Apr 29 2022, 3:25 PM

Use m_c_Mul, don't handle srem/urem

I have a bunch of questions about this.

I don't understand why / how this code is trying to compute a wave ID. My understanding is that workitem.id.{x,y,z} is the workgroup-local ID of a thread/lane, so the first thread always has (0,0,0), the next one has (1,0,0) (assuming the x-size of the workgroup is at least 2) and so on. We can compute the wave ID using implicit knowledge about how the hardware unrolls the threads of a workgroup in a wave, but I would not expect to see the workgroup ID as part of the calculation. In other words, something looks very suspicious about the incoming code.

The logic of the uniformity check also looks wrong. We can't just treat X, Y, and Z as all the same.

Here's what I would expect a wave ID calculation to look like -- this uses the knowledge that the HW uses a "typewriter" unrolling of threads:

# workitem.id.linear is LocalInvocationIndex in SPIR-V
workitem.id.linear = workitem.id.x + workgroup.size.x * (workitem.id.y + workgroup.size.y * workitem.id.z)
waveid = workitem.id.linear / wavesize

Also, there's a caveat when partial workgroups are used: if partial workgroups *are* used, then the workgroups "on the fringes" of the grid use a different workgroup size which must be derived by combining the base workgroup size with the grid size.

Assuming no partial workgroups, the calculation can be simplified if we know (based on kernel attributes) that workgroup.size.x or (workgroup.size.x * workgroup.size.y) are multiples of the wavesize.

To be honest, it would probably be best in the long run if we did something along the lines of:

  • Introduce a bunch of new intrinsics: wave.id, workitem.id.linear, workgroup.size.{x,y,z}.
  • Add InstCombine patterns to canonicalize known code patterns to these intrinsics.
  • The divergence analysis improvement is now trivial.

(We would also do well to have an intrinsic as the canonical form for the lane ID instead of mbcnt(-1) as a first step to improving codegen for some related branching patterns, e.g. there's a bunch of code out there that branches on laneid == 0 or workitem.id.linear == 0.)

llvm/test/Analysis/DivergenceAnalysis/AMDGPU/wave-id-computation.ll
28

Why is this division exact?

I have a bunch of questions about this.

I've come back to this several times and been confused each time. I did write a program that convinced me that this is correct, and doesn't need to consider partial workgroups.

llvm/test/Analysis/DivergenceAnalysis/AMDGPU/wave-id-computation.ll
28

that's what comes out of clang

I have a bunch of questions about this.

I've come back to this several times and been confused each time. I did write a program that convinced me that this is correct, and doesn't need to consider partial workgroups.

The partial workgroup check is implicit in the pattern checked here. The uniform workgroup size attribute triggers optimizations on the group calculation. There's additional complexity on the group size load if that didn't trigger

Description speaks about 1D, while code targets multidimensional grids. In fact for 1D you would not need a wg to be uniform, it is only needed for 2D and 3D.

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
943

Can this constant be defined somewhere?

The code in isAlwaysUniform is weirdly non-orthogonal. There are cases above the diff that match for straight-up workitem.id.x-based calculation of a wave ID, but they match [al]shr and and. And then there's the new code, but that one matches [su]div. Why doesn't the [su]div become a [al]shr?

The logic overall is also not quite convincing.

  • Why do the existing checks bail out based on workgroup Y and Z sizes? Okay, probably because of the FIXME above, though that seems sort of fundamental?
  • If I understand the new code correctly, it matches the patterns (groupSize.? * groupId.? + threadId.?). If you're covering y and z, isn't it more likely that one finds expressions like threadId.y * groupSize.y + threadId.x?
  • Also, it still seems like checks against 2D/3D group sizes or odd 1D sizes are still missing.