This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Fix divergence analysis of control flow intrinsics
ClosedPublic

Authored by arsenm on Jan 31 2020, 3:40 PM.

Details

Summary

The mask results of these should be uniform. The trickier part is the
dummy booleans used as IR glue need to be treated as divergent. This
should make the divergence analysis results correct for the IR the DAG
is constructed from.

This should allow us to eliminate requiresUniformRegister, which has
an expensive, recursive scan over all users looking for control flow
intrinsics. This should avoid recent compile time regressions.

Diff Detail

Event Timeline

arsenm created this revision.Jan 31 2020, 3:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 31 2020, 3:40 PM

Let's Alex review, but could you run a PSDB in the meanwhile?

arsenm marked an inline comment as done.Jan 31 2020, 5:03 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
712–739

I think this works for the def cases, but the uses don't currently have the analog in DivergenceAnalysis for values that have to be uniform uses

This looks plausible enough, but feels like the kind of change that should ideally be run through a thorough set of real execution checks.

arsenm updated this revision to Diff 242180.Feb 3 2020, 2:02 PM
arsenm retitled this revision from AMDGPU: Stop scanning for control flow intrinsic users to AMDGPU: Fix divergence analysis of control flow intrinsics.
arsenm edited the summary of this revision. (Show Details)

Now with more passing

arsenm updated this revision to Diff 242191.Feb 3 2020, 2:24 PM

Split removal into separate patch

arsenm added a comment.Feb 3 2020, 3:18 PM

I think the only obstacle to eliminating requiresUniformRegister is due to the treatment of phis with always uniform inputs. This case for example, the LCSSA phi in the return block was incorrectly concluded to be divergent, despite there only being one always uniform input

Printing analysis 'Legacy Divergence Analysis' for function 'atomic_nand_i32_lds':
DIVERGENT: i32 addrspace(3)* %ptr

           :
DIVERGENT:       %1 = load i32, i32 addrspace(3)* %ptr, align 4
                 br label %atomicrmw.start

           atomicrmw.start:
                 %phi.broken = phi i64 [ %4, %atomicrmw.start ], [ 0, %0 ]
DIVERGENT:       %loaded = phi i32 [ %1, %0 ], [ %newloaded, %atomicrmw.start ]
DIVERGENT:       %2 = and i32 %loaded, 4
DIVERGENT:       %new = xor i32 %2, -1
DIVERGENT:       %3 = cmpxchg i32 addrspace(3)* %ptr, i32 %loaded, i32 %new seq_cst seq_cst
DIVERGENT:       %success = extractvalue { i32, i1 } %3, 1
DIVERGENT:       %newloaded = extractvalue { i32, i1 } %3, 0
                 %4 = call i64 @llvm.amdgcn.if.break.i64(i1 %success, i64 %phi.broken)
DIVERGENT:       %5 = call i1 @llvm.amdgcn.loop.i64(i64 %4)
DIVERGENT:       br i1 %5, label %atomicrmw.end, label %atomicrmw.start

           atomicrmw.end:
DIVERGENT:       %newloaded.lcssa = phi i32 [ %newloaded, %atomicrmw.start ]
DIVERGENT:       %.lcssa = phi i64 [ %4, %atomicrmw.start ]
DIVERGENT:       call void @llvm.amdgcn.end.cf.i64(i64 %.lcssa)
DIVERGENT:       ret i32 %newloaded.lcssa
alex-t added a comment.Feb 4 2020, 7:51 AM

I agree that current expensive walk on IR should be removed.
I initially consider it as a temporary solution that allows to switch to the new concept
of assignment correct register classes keeping in mind to eventually find another way to workaround CF results exceptions.

What I conceptually don't like here is mixing the "divergence" and register class notions.
The mask produced by the CF intrinsics is always scalar - not same as always uniform.
The isAlwaysUniform is queried by the DA in the divergence propagation.
I'm afraid we can get some subtle bugs claiming CF intrinsic "result 1" always uniform.
Although I cannot yet provide any example to illustrate my concerns.

I would suggest to run extended testing: MESA tests, BLAS tests.... something else that is tremendous enough.

arsenm added a comment.Feb 4 2020, 7:58 AM

I agree that current expensive walk on IR should be removed.
I initially consider it as a temporary solution that allows to switch to the new concept
of assignment correct register classes keeping in mind to eventually find another way to workaround CF results exceptions.

What I conceptually don't like here is mixing the "divergence" and register class notions.
The mask produced by the CF intrinsics is always scalar - not same as always uniform.

Scalar is the only modeled concept of uniform we have. We don't currently try to model workgroup uniform, so these should be the same

alex-t added a comment.Feb 4 2020, 8:43 AM

I agree that current expensive walk on IR should be removed.
I initially consider it as a temporary solution that allows to switch to the new concept
of assignment correct register classes keeping in mind to eventually find another way to workaround CF results exceptions.

What I conceptually don't like here is mixing the "divergence" and register class notions.
The mask produced by the CF intrinsics is always scalar - not same as always uniform.

Scalar is the only modeled concept of uniform we have. We don't currently try to model workgroup uniform, so these should be the same

It should work if we can guarantee that the values produced by the CF intrinsics are only used by the another CF intrinsics.
I mean something like AND/OR the uniform mask with some divergent value and then feeding result to another CF...
but normally this should not happen.
Once again: I like getting rid of requiresUniformRegister but insist on the extended testing.
Also, the trouble with LCSSA use of the uniform mask out of the divergent loop need to be solved anyway.

arsenm added a comment.Feb 4 2020, 9:10 AM

I agree that current expensive walk on IR should be removed.
I initially consider it as a temporary solution that allows to switch to the new concept
of assignment correct register classes keeping in mind to eventually find another way to workaround CF results exceptions.

What I conceptually don't like here is mixing the "divergence" and register class notions.
The mask produced by the CF intrinsics is always scalar - not same as always uniform.

Scalar is the only modeled concept of uniform we have. We don't currently try to model workgroup uniform, so these should be the same

It should work if we can guarantee that the values produced by the CF intrinsics are only used by the another CF intrinsics.
I mean something like AND/OR the uniform mask with some divergent value and then feeding result to another CF...
but normally this should not happen.

In no context should this ever happen. The control flow intrinsics are not general use, and are just a hack to get some control flow representation through the DAG. We should not be trying to worry about handling some bug case where somehow a divergent operation was introduced between a control flow intrinsic and the user. That is a bug and it should break. We shouldn't need to bend over backwards to support this case.

Once again: I like getting rid of requiresUniformRegister but insist on the extended testing.
Also, the trouble with LCSSA use of the uniform mask out of the divergent loop need to be solved anyway.

I've split that into an independent problem. The control flow intrinsics can still have correct divergence information without yet ripping out requiresUniformRegister

alex-t accepted this revision.Feb 5 2020, 8:07 AM
This revision is now accepted and ready to land.Feb 5 2020, 8:07 AM