This is an archive of the discontinued LLVM Phabricator instance.

[Clang][Attr] Skip adding noundef attribute to arguments when function has convergent attribute
AbandonedPublic

Authored by skc7 on Apr 21 2022, 2:54 AM.

Details

Summary

Change https://reviews.llvm.org/D105169 enables noundef attribute by default. This is causing issue with functions tagged with convergent attribute.

For Ex: SimplifyCFG pass removes the branch leading to a BB which has an incoming value that will always trigger undefined behavior. This basically modifies the CFG and combines the basic blocks. This works for CPU execution. But on a GPU, there are intrinsics like "__shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize)", Where the exchange of variable occurs simultaneously for all active threads within the warp. So, here in the cuda/hip kernel, variable var in shuffl_sync may not be initialised, and LLVM IR treats it as undef. Currently all the arguments are tagged with noundef attribute and the above mentioned optimization by SimplifyCFG gets applied and the kernel execution becomes ambiguous. So, the proposed change is to skip adding noundef attribute to arguments when a function has been tagged with convergent attribute.

Diff Detail

Event Timeline

skc7 created this revision.Apr 21 2022, 2:54 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 2:54 AM
skc7 requested review of this revision.Apr 21 2022, 2:54 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald Transcript
skc7 added a reviewer: arsenm.Apr 26 2022, 7:42 AM
arsenm added inline comments.Apr 26 2022, 9:41 AM
clang/lib/CodeGen/CGCall.cpp
2435

Missing space before (.

Needs comment explaining this

llvm/test/Transforms/SimplifyCFG/tautological-conditional-branch-convergent-noundef.ll
26

Aren't the cases with defined handling of undef lanes still defined for the result?

skc7 updated this revision to Diff 425393.Apr 26 2022, 7:08 PM

clang-format to CGCall.cpp. Added comment for the change

skc7 added inline comments.Apr 26 2022, 7:30 PM
clang/lib/CodeGen/CGCall.cpp
2435

Updated

llvm/test/Transforms/SimplifyCFG/tautological-conditional-branch-convergent-noundef.ll
26

ret double %i4?

skc7 updated this revision to Diff 425442.Apr 27 2022, 1:59 AM

fix failing tests

skc7 updated this revision to Diff 425458.Apr 27 2022, 3:06 AM

skip adding noundef to return type

skc7 updated this revision to Diff 425463.Apr 27 2022, 3:18 AM

update test

skc7 marked an inline comment as not done.Apr 27 2022, 3:19 AM
skc7 added inline comments.
llvm/test/Transforms/SimplifyCFG/tautological-conditional-branch-convergent-noundef.ll
26

Updated patch to skip adding noundef attribute to return types aswell

For Ex: SimplifyCFG pass removes the branch leading to a BB which has an incoming value that will always trigger undefined behavior. This basically modifies the CFG and combines the basic blocks. This works for CPU execution. But on a GPU, there are intrinsics like "__shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize)", Where the exchange of variable occurs simultaneously for all active threads within the warp. So, here in the cuda/hip kernel, variable var in shuffl_sync may not be initialised, and LLVM IR treats it as undef. Currently all the arguments are tagged with noundef attribute and the above mentioned optimization by SimplifyCFG gets applied and the kernel execution becomes ambiguous. So, the proposed change is to skip adding noundef attribute to arguments when a function has been tagged with convergent attribute.

Can we please have an example for this. I don't know what would be broken w/ noundef + convergent and I somewhat doubt noundef is the problem.

skc7 added a comment.Apr 27 2022, 9:31 AM

For Ex: SimplifyCFG pass removes the branch leading to a BB which has an incoming value that will always trigger undefined behavior. This basically modifies the CFG and combines the basic blocks. This works for CPU execution. But on a GPU, there are intrinsics like "__shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize)", Where the exchange of variable occurs simultaneously for all active threads within the warp. So, here in the cuda/hip kernel, variable var in shuffl_sync may not be initialised, and LLVM IR treats it as undef. Currently all the arguments are tagged with noundef attribute and the above mentioned optimization by SimplifyCFG gets applied and the kernel execution becomes ambiguous. So, the proposed change is to skip adding noundef attribute to arguments when a function has been tagged with convergent attribute.

Can we please have an example for this. I don't know what would be broken w/ noundef + convergent and I somewhat doubt noundef is the problem.

For the below source kernel from hypre, the optimisation by simplifyCFG pass caused issue with kernel execution on GPU.
We figured out that enabling noudef analysis by default is triggering this optimization.

source kernel:
Note: variable t is uninitialised intially and gets initialiazed when lane is 0.
void kernel{

double t, measure_row;
int lane = hypre_cuda_get_lane_id<1>();

...

if (lane == 0) {t = read_only_load(measure_diag + row);}
measure_row = __shfl_sync(HYPRE_WARP_FULL_MASK, t, 0);

...
}

Example LLVM IR for a similar scenario:
define void @func(i32 noundef %arg17) {
bb1:

%i1 = icmp eq i32 %arg17, 0
br i1 %i1, label %bb2, label %bb3

bb2: ; preds = %bb1

%i2 = call noundef double @read_only_load()
br label %bb3

bb3: ; preds = %bb2, %bb1

%i3 = phi double [ %i2, %bb2 ], [ undef, %bb1 ]
%i4 = call noundef double @__shfl_sync(double noundef %i3)
ret void

}

declare double @read_only_load()
declare double @__shfl_sync(double noundef) convergent

IR Dump After SimplifyCFGPass on func:
define void @func(i32 noundef %arg17) {
bb1:

%i1 = icmp eq i32 %arg17, 0
call void @llvm.assume(i1 %i1)
%i2 = call noundef double @read_only_load()
%i4 = call noundef double @__shfl_sync(double noundef %i2)
ret void

}

The issue you're describing sounds like it's specific to @__shfl_sync. In general, in C++, you aren't allowed to read from an uninitialized variable; see [basic.indet] in the standard. But if your testcase doesn't have undefined behavior, CUDA language rules must somehow allow this particular builtin function to take undef variables as input. (Is this documented somewhere?)

That isn't related to the "convergent" attribute; the transform you're describing doesn't break convergence rules.

jdoerfert added a comment.EditedApr 27 2022, 2:50 PM

The issue you're describing sounds like it's specific to @__shfl_sync. In general, in C++, you aren't allowed to read from an uninitialized variable; see [basic.indet] in the standard. But if your testcase doesn't have undefined behavior, CUDA language rules must somehow allow this particular builtin function to take undef variables as input. (Is this documented somewhere?)

That isn't related to the "convergent" attribute; the transform you're describing doesn't break convergence rules.

I concur, especially on the last part. So far I have not seen why this is tied in any way to convergent. It might be a shfl oddity in which case the proper solution is to freeze all shuffle arguments in clang.
EDIT: https://godbolt.org/z/dnv63bzjn

As far as I know this is supposed to be a broadcast from lane 0 to every lane, so not sure why the control flow really matters

The issue you're describing sounds like it's specific to @__shfl_sync. In general, in C++, you aren't allowed to read from an uninitialized variable; see [basic.indet] in the standard. But if your testcase doesn't have undefined behavior, CUDA language rules must somehow allow this particular builtin function to take undef variables as input. (Is this documented somewhere?)

That isn't related to the "convergent" attribute; the transform you're describing doesn't break convergence rules.

I concur, especially on the last part. So far I have not seen why this is tied in any way to convergent. It might be a shfl oddity in which case the proper solution is to freeze all shuffle arguments in clang.
EDIT: https://godbolt.org/z/dnv63bzjn

I'm thinking noundef is a bit of red herring here. The real problem seems to be arising from the assume call which is inserted, which now introduces the assumption that the lane ID must be 0

I'm thinking noundef is a bit of red herring here. The real problem seems to be arising from the assume call which is inserted, which now introduces the assumption that the lane ID must be 0

The optimizer is creating the llvm.assume call based on the violation of the noundef attribute.

I'm thinking noundef is a bit of red herring here. The real problem seems to be arising from the assume call which is inserted, which now introduces the assumption that the lane ID must be 0

The optimizer is creating the llvm.assume call based on the violation of the noundef attribute.

I agree. As far as I can tell you have two options, both are specific to the shuffle functions:

  1. Do not set noundef for calls to them as they allow undef values for all lanes we don't read the value.
  2. Freeze the inputs unconditionally.

Convergent is unrelated to this.

I agree. As far as I can tell you have two options, both are specific to the shuffle functions:

  1. Do not set noundef for calls to them as they allow undef values for all lanes we don't read the value.
  2. Freeze the inputs unconditionally.

Right, with some nitpicks. Option #1 is semantically more accurate: __shfl_sync, subgroupShuffe, and all similar instructions across GPU programming languages are meant to be conceptually similar to select in that they select a value from a lane. The "data" argument and return value should allow undef and poison. The incoming value is simply returned as-is, and so poison is propagated instead of causing immediate UB, just as it is for a select instruction. However, the lane argument can (and arguably should) still be noundef.

There's a separate curious issue in that apparently, reading from an uninitialized variable is not UB in CUDA/HIP/GLSL/HLSL/etc. If it was, a lot of code existing out there in the wild would be broken. But that's a matter for the relevant language standards to decide (for the subset of languages that have proper standards to begin with).

arsenm resigned from this revision.Nov 16 2022, 3:40 PM

This is obsolete and can be abandoned

skc7 abandoned this revision.Nov 16 2022, 9:47 PM