I believe this patch is OK as is, but is currently useless in practice and
I'm not sure how useful this really will be. Theoretically this should allow
something like:
kernel void foo(global int* global* arg_ptr) {
global int* ptr = arg_ptr[get_global_id(0)]; __builtin_assume(sub_group_all(ptr != NULL)); if (ptr != NULL) { *ptr += 1; }
}
to use a scalar branch around the pointer dereference. There are a few obstacles
to this working today. First, using sub_group_all generates this warning
for some reason:
warning: the argument to '__builtin_assume' has side effects that will be discarded
Second, the device libraries are still using the legacy llvm.amdgcn.icmp intrinsics
instead of ballot.
Third, the device libraries are still using an inline assembly hack in lieu of
convergence tokens.
Fourth, even if those issues are avoided, the branch is still treated
as divergent when ultimately selected.
I'm not super sure this must hold. I'd just check it.