This is an archive of the discontinued LLVM Phabricator instance.

[WIP] DivergenceAnalysis: Infer uniformity from assume calls
Needs ReviewPublic

Authored by arsenm on Oct 31 2022, 6:28 PM.

Details

Reviewers
sameerds
nhaehnle
simoll
jdoerfert
Group Reviewers
Restricted Project
Summary

I believe this patch is OK as is, but is currently useless in practice and
I'm not sure how useful this really will be. Theoretically this should allow
something like:

kernel void foo(global int* global* arg_ptr) {

global int* ptr = arg_ptr[get_global_id(0)];
__builtin_assume(sub_group_all(ptr != NULL));
if (ptr != NULL) {
    *ptr += 1;
}

}

to use a scalar branch around the pointer dereference. There are a few obstacles
to this working today. First, using sub_group_all generates this warning
for some reason:

warning: the argument to '__builtin_assume' has side effects that will be discarded

Second, the device libraries are still using the legacy llvm.amdgcn.icmp intrinsics
instead of ballot.

Third, the device libraries are still using an inline assembly hack in lieu of
convergence tokens.

Fourth, even if those issues are avoided, the branch is still treated
as divergent when ultimately selected.

Diff Detail

Event Timeline

arsenm created this revision.Oct 31 2022, 6:28 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 31 2022, 6:28 PM
arsenm requested review of this revision.Oct 31 2022, 6:28 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 31 2022, 6:28 PM
Herald added a subscriber: wdng. · View Herald Transcript

I generally think this is worthwhile. As you noted, there are still problems but we could move this part ahead I think.

warning: the argument to '__builtin_assume' has side effects that will be discarded

__builtin_assume is dropped if we cannot show that the expression is side-effect free as part of the lowering. The expression you used has arbitrary side-effects, I think (there is no godbold for HIP):
https://github.com/llvm/llvm-project/blob/6c8995649afac04a9eb0e71affd997e493c9b93a/clang/lib/Sema/OpenCLBuiltins.td#L1718

That said, multiple ways around this, easiest for now:

int assumption = sub_group_all(P != NULL);
__builtin_assume(assumption);
llvm/lib/Analysis/DivergenceAnalysis.cpp
154

I'm not super sure this must hold. I'd just check it.

nhaehnle added a comment.EditedNov 2 2022, 8:34 AM

I don't quite see the point of this change. For test cases like @assume_ballot_eq_0, what we really should be doing here is optimize the branch away entirely because the llvm.assume implies that %cmp == 0.

It sounds like what we really want here is a sort of llvm.assume.uniform intrinsic. Or maybe an llvm.amdgcn.is.uniform intrinsic and then do llvm.assume(llvm.amdgcn.is.uniform)

I don't quite see the point of this change. For test cases like @assume_ballot_eq_0, what we really should be doing here is optimize the branch away entirely because the llvm.assume implies that %cmp == 0.

It sounds like what we really want here is a sort of llvm.assume.uniform intrinsic. Or maybe an llvm.amdgcn.is.uniform intrinsic and then do llvm.assume(llvm.amdgcn.is.uniform)

Not a lot of time right now to follow up on this.. and as much as i dislike passerby comments:
For the DA in isolation, ideally, we'd have something like:

%Y = llvm.assume.uniform(%X)
foo(%Y) ; <- Rewritten to use %Y instead of %X.

The DA would automatically pick up on the uniformity without any changes.
The assume intrinsic is non-speculatable to keep the control dependences around.
Obfuscating the %X -> foo def-use chain may inflict some damage to other analyses, though.

foad added a comment.Jan 12 2023, 2:38 AM

For the DA in isolation, ideally, we'd have something like:

%Y = llvm.assume.uniform(%X)
foo(%Y) ; <- Rewritten to use %Y instead of %X.

This is roughly what the target-specific @llvm.amdgcn.readfirstlane does today, and some frontends do use it to assert and/or enforce uniformity of particular values. There is some sublety about exactly what it means (or exactly what @llvm.assume.uniform should mean): Read the first active lane? Read an arbitrary active lane? Undefined/poison if active lanes do not all have the same value?

For the DA in isolation, ideally, we'd have something like:

%Y = llvm.assume.uniform(%X)
foo(%Y) ; <- Rewritten to use %Y instead of %X.

This is roughly what the target-specific @llvm.amdgcn.readfirstlane does today, and some frontends do use it to assert and/or enforce uniformity of particular values. There is some sublety about exactly what it means (or exactly what @llvm.assume.uniform should mean): Read the first active lane? Read an arbitrary active lane? Undefined/poison if active lanes do not all have the same value?

Read all active lanes. The intrinsic only tells us that we can assume uniformity among the active lanes in each instance, it could not be used to enforce it. Not so sure about the values on inactive lanes, I'd say it simply passes through the incoming values.. you may just want poison here though..

For the DA in isolation, ideally, we'd have something like:

%Y = llvm.assume.uniform(%X)
foo(%Y) ; <- Rewritten to use %Y instead of %X.

This is roughly what the target-specific @llvm.amdgcn.readfirstlane does today, and some frontends do use it to assert and/or enforce uniformity of particular values. There is some sublety about exactly what it means (or exactly what @llvm.assume.uniform should mean): Read the first active lane? Read an arbitrary active lane? Undefined/poison if active lanes do not all have the same value?

Read all active lanes. The intrinsic only tells us that we can assume uniformity among the active lanes in each instance, it could not be used to enforce it. Not so sure about the values on inactive lanes, I'd say it simply passes through the incoming values.. you may just want poison here though..

We do need to say what happens if the assumptions is wrong. I believe at a minimum we need to say that the result is poison, because of what happens when the result feeds into a conditional branch: divergence analysis uses the assumption, which can affect codegen. So UB on that branch if the assumption is wrong seems like the minimum we need.

Though immediate UB is a legitimate alternative, since it would allow us to replace other uses of %X by %Y.

simoll added a comment.EditedMay 12 2023, 3:25 AM

For the DA in isolation, ideally, we'd have something like:

%Y = llvm.assume.uniform(%X)
foo(%Y) ; <- Rewritten to use %Y instead of %X.

This is roughly what the target-specific @llvm.amdgcn.readfirstlane does today, and some frontends do use it to assert and/or enforce uniformity of particular values. There is some sublety about exactly what it means (or exactly what @llvm.assume.uniform should mean): Read the first active lane? Read an arbitrary active lane? Undefined/poison if active lanes do not all have the same value?

Read all active lanes. The intrinsic only tells us that we can assume uniformity among the active lanes in each instance, it could not be used to enforce it. Not so sure about the values on inactive lanes, I'd say it simply passes through the incoming values.. you may just want poison here though..

We do need to say what happens if the assumptions is wrong. I believe at a minimum we need to say that the result is poison, because of what happens when the result feeds into a conditional branch: divergence analysis uses the assumption, which can affect codegen. So UB on that branch if the assumption is wrong seems like the minimum we need.

Though immediate UB is a legitimate alternative, since it would allow us to replace other uses of %X by %Y.

.. and by extension you can take the control conditions of the call as pre-conditions whereas if it's 'just' poison you can only do that for the instructions that actually trigger ub upon poison. You may want to turn the pre-condition into an explicit parameter, as in:

%Y = llvm.assume.uniform(%X, %mask) <-- triggers immediate UB where %X is not uniform among the threads that actively execute this in lock-step and where %mask is true.

You are then free to rewrite uses of %X into uses of %Y where the use is dominated by the intrinsic to improve DA precision.