This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Special case uniformity info for single lane workgroups
ClosedPublic

Authored by arsenm on May 24 2023, 8:51 AM.

Details

Reviewers
sameerds
jhuber6
JonChesterfield
jdoerfert
yassingh
Group Reviewers
Restricted Project
Summary

Constructors/destructors and OpenMP make use of single lane groups
in some cases.

Diff Detail

Event Timeline

arsenm created this revision.May 24 2023, 8:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 24 2023, 8:51 AM
arsenm requested review of this revision.May 24 2023, 8:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 24 2023, 8:51 AM

Shouldn't this similarly apply to wave syncs and ballots?

Shouldn't this similarly apply to wave syncs and ballots?

Ballots are forcibly uniform all the time. I guess this works for every operation and don't actually need to look for specific intrinsics

Ballots are forcibly uniform all the time. I guess this works for every operation and don't actually need to look for specific intrinsics

We should be able to remove every convergent attribute under this special case, right.

yassingh added inline comments.May 24 2023, 11:33 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
910–914

Will this also work for GMIR?
eg "%4:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)"

llvm/test/Analysis/UniformityAnalysis/AMDGPU/workitem-intrinsics.ll
44–50

Same as above, equivalent GMIR test?

Ballots are forcibly uniform all the time. I guess this works for every operation and don't actually need to look for specific intrinsics

We should be able to remove every convergent attribute under this special case, right.

Maybe, in the future with convergence tokens we should have a convergence reduction optimization.

I'm thinking a cleaner way to handle this would be to just have the divergence analysis early exit for single lane functions. Divergence cannot exist in the first place

This could be a top-level check in isSourceOfDivergence: if a source of divergence executes with a single lane, it de facto stops being a source of divergence.

But even beyond that: it is reasonably common to have compute shaders that do something like this:

if (threadid == 0) {
  ...
}

So it would be interesting to be able to handle this more generally in UniformityAnalysis, perhaps by having a callback that analyzes conditional branches and returns whether one side of the branch is entered by at most one lane.

Once that's the long-term goal, it seems plausible to me to make isSingleLaneExecution(Function &) a callback that is accessible to UniformityAnalysis, as opposed to querying it from isSourceOfDivergence.

But even beyond that: it is reasonably common to have compute shaders that do something like this:

if (threadid == 0) {
  ...
}

So it would be interesting to be able to handle this more generally in UniformityAnalysis, perhaps by having a callback that analyzes conditional branches and returns whether one side of the branch is entered by at most one lane.

Once that's the long-term goal, it seems plausible to me to make isSingleLaneExecution(Function &) a callback that is accessible to UniformityAnalysis, as opposed to querying it from isSourceOfDivergence.

We have a similar check for that explicit case in the OpenMPOpt pass https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/IPO/OpenMPOpt.cpp#L2760.

We should be able to remove every convergent attribute under this special case, right.

I'm not convinced that this is true. Consider:

uint64_t helper() {
  return ballot(true);
}

uint64_t tricky() {
  for (int i = 0; i < 64; ++i)
    if (thread == i)
      return helper();
  unreachable();
}

We can miscompile this as follows:

  • Mark helper as being "single lane only"
  • Remove convergent attributes based on that
  • Inline and eliminate the loop, reducing everything down to:
uint64_t tricky() {
  return ballot(true);
}

I'm not convinced that this is true. Consider:

uint64_t helper() {
  return ballot(true);
}

uint64_t tricky() {
  for (int i = 0; i < 64; ++i)
    if (thread == i)
      return helper();
  unreachable();
}

We can miscompile this as follows:

I think this program is illegal to begin with. The constraint isn't that helper runs with a single lane, it's that it only executes a workgroup of size 1. If the loop in tricky ever went over 1 iteration, that implies execution with a larger workgroup

Ballots are forcibly uniform all the time. I guess this works for every operation and don't actually need to look for specific intrinsics

We should be able to remove every convergent attribute under this special case, right.

Maybe, in the future with convergence tokens we should have a convergence reduction optimization.

I'm thinking a cleaner way to handle this would be to just have the divergence analysis early exit for single lane functions. Divergence cannot exist in the first place

That sounds right to me. Much simpler and assertive than checking for just a handful of intrinsics. Eventually, this can be an input to deciding whether to remove the convergent attribute, but that's a different task.

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
910–914

The change is in TTI, which is clearly meant for LLVM IR. But from the rest of the comments, if we were to make this change in the UA instead, then we could make it work for both LLVM IR and MIR.

arsenm updated this revision to Diff 527816.Jun 2 2023, 5:28 AM
arsenm retitled this revision from AMDGPU: Special case uniformity info for singlethreaded workitem IDs to AMDGPU: Special case uniformity info for single lane workgroups.

Skip analysis and make everything uniform

sameerds accepted this revision.Jun 3 2023, 3:15 AM
This revision is now accepted and ready to land.Jun 3 2023, 3:15 AM
arsenm added inline comments.Jun 16 2023, 3:51 PM
llvm/test/Analysis/UniformityAnalysis/AMDGPU/workitem-intrinsics.ll
44–50

Can't test MIR as-is, the TTI based machine uniformity query is broken as mentioned in the last patch