This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Detect uniformness of TID / wavefrontsize
ClosedPublic

Authored by rampitec on Aug 23 2022, 3:40 PM.

Details

Summary

A value of 'workitemid / wavefrontize' or 'workitemid & (wavefrontize - 1)'
is wave uniform.

Diff Detail

Event Timeline

rampitec created this revision.Aug 23 2022, 3:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 3:40 PM
rampitec requested review of this revision.Aug 23 2022, 3:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 3:40 PM
Herald added a subscriber: wdng. · View Herald Transcript

A bit of explanation why change to the isUniformLoad is needed. The DA is stateless analysis and we can attach amdgpu.uniform metadata to the GEP, but when it comes to SDag DA cannot help and SDag sees the node as divergent. Even if divergent bit is unset on the node during dag creation it hardly helps after dag combines losing it when killing the original shift node. Moreover, dag combines create a pattern which is much more difficult to recognize. So checking the MMO metadata (much like gisel does, it relies on the MMO metadata completely for that).

rampitec updated this revision to Diff 455308.Aug 24 2022, 11:41 AM

This looks good to me. The patch enables the compiler to generate s_load when the user writes code that divides threadIdx.x by the wavefront size. The only suggestion I have is to add some test cases showing explicitly that the amdgpu.uniform metadata is added via the divergence analysis. (by the AnnotateUniformValues pass). The test cases provided rely upon that working correctly, though they show the end result rather than the steps needed to get the result.

This looks good to me. The patch enables the compiler to generate s_load when the user writes code that divides threadIdx.x by the wavefront size. The only suggestion I have is to add some test cases showing explicitly that the amdgpu.uniform metadata is added via the divergence analysis. (by the AnnotateUniformValues pass). The test cases provided rely upon that working correctly, though they show the end result rather than the steps needed to get the result.

I will update the same testcase to run only annotate uniform values pass and check the metadata.

what if the operation happens in divergent control flow, e.g

int x = 0;
x = threadIdx.x > 32 : threadIdx/64 : 0;

will this patch still work?

what if the operation happens in divergent control flow, e.g

int x = 0;
x = threadIdx.x > 32 : threadIdx/64 : 0;

will this patch still work?

Regardless of the CFG the value of TID / 64 is always uniform. The value of 'x' here is another value, derived from that uniform value. It is no different from x = cc ? sgpr0 : sgpr1; LHS and RHS are uniform, but 'x' is not.

rampitec updated this revision to Diff 455974.Aug 26 2022, 11:43 AM

Added test run lines to check amdgpu-annotate-uniform metadata directly.

bcahoon accepted this revision.Aug 26 2022, 2:42 PM

LGTM. Thanks for adding this patch!

This revision is now accepted and ready to land.Aug 26 2022, 2:42 PM
This revision was automatically updated to reflect the committed changes.
bcl5980 added a subscriber: bcl5980.EditedAug 27 2022, 5:10 AM

what if the operation happens in divergent control flow, e.g

int x = 0;
x = threadIdx.x > 32 : threadIdx/64 : 0;

will this patch still work?

Regardless of the CFG the value of TID / 64 is always uniform. The value of 'x' here is another value, derived from that uniform value. It is no different from x = cc ? sgpr0 : sgpr1; LHS and RHS are uniform, but 'x' is not.

What if the blockDim.x is not 64, like 65, blockDim.y is not 1
for example:
the workgroup shape is <65, 2, 1>
warp 0 should be [0,0] to [63, 0]
warp 1 should be [64,0] to [62, 1]
warp 2 should be [63, 1] to [64, 1]
warp1&2's threadIdx.x / 64 should be still divergent

rampitec added a comment.EditedAug 29 2022, 10:49 AM

What if the blockDim.x is not 64, like 65, blockDim.y is not 1
for example:
the workgroup shape is <65, 2, 1>
warp 0 should be [0,0] to [63, 0]
warp 1 should be [64,0] to [62, 1]
warp 2 should be [63, 1] to [64, 1]
warp1&2's threadIdx.x / 64 should be still divergent

Yes, you are right, thanks! Looks like I need to limit it to the case when there is amdgpu-no-workitem-id-y attribute on the function.

What if the blockDim.x is not 64, like 65, blockDim.y is not 1
for example:
the workgroup shape is <65, 2, 1>
warp 0 should be [0,0] to [63, 0]
warp 1 should be [64,0] to [62, 1]
warp 2 should be [63, 1] to [64, 1]
warp1&2's threadIdx.x / 64 should be still divergent

Yes, you are right, thanks! Looks like I need to limit it to the case when there is amdgpu-no-workitem-id-y attribute on the function.

D132879 limits it.

Does this address the same issue as D124385?

Does this address the same issue as D124385?

Yes, although practically divisor is a power of 2 and in fact a wavefront size (that is how people use it), so there is no SDiv or UDiv, there is a shift like in this patch.