The new llvm.amdgcn.convergent.copy intrinsic creates a convergent copy.
Diff Detail
Event Timeline
This convergent copy intrinsic will acts here as a form of barrier which makes sure that all the active lanes of VGPR (i.e. result of intrinsic) is computed before its use.
Consider following input llvm IR:
%sub.i = sub nsw i32 0, %11 %12 = atomicrmw add ptr addrspace(1) %1, i32 %sub.i syncscope("agent-one-as") monotonic, align 4
Here, I am transforming computation of %12 into a for loop which will executed by only a first active lane. One of the Basic Blocks associated with this for loop needs to read values from a VGPR (use) corresponding to %sub.i.
So with the convergent copy I want to make sure that:
- All lanes of a VGPR associated with %sub.i are computed before starting the loop.
- It will also stop sinking the computation of %sub.i into basic blocks associated with that for loop.
And you want to use @llvm.amdgcn.readlane inside the loop to read arbitrary (probably inactive) lanes from %sub.i ?
I'm not sure that this is a robust solution. What about an example like this:
%sub.i = sub nsw i32 0, %11 %cc = call @llvm.amdgcn.convergent.copy(%sub.i) if (divergent condition) { some code here } else { ... = call @llvm.amdgcn.readlane(%cc, %n) }
What's to stop "some code here" from clobbering some lanes of the vgpr that was holding %cc?
I don't think this patch solves any real problem, it just raises a bunch of questions about what you're trying to do.
If you want to read values from inactive lanes of a vgpr robustly then you need something like WWM - but I guess you don't trust the WWM implementation, so you're back to square one.
However... why are you doing readlane inside the part of the code that only has a single active lane? You can write a reduction across all active lanes like this without changing EXEC. (This is the unrolled version but you can put it in a loop if you want; that's irrelevant.)
s_mov s0, 0 ; initialize accumulator ; conditionally add in lane 0 v_readlane s1, v0, 0 s_bitcmp1 exec, 0 s_cselect s1, s1, 0 s_add s0, s0, s1 ; conditionally add in lane 1 v_readlane s1, v0, 1 s_bitcmp1 exec, 1 s_cselect s1, s1, 0 s_add s0, s0, s1 ... ; conditionally add in lane 31 v_readlane s1, v0, 31 s_bitcmp1 exec, 31 s_cselect s1, s1, 0 s_add s0, s0, s1 ; result is in s0
You should be able to generate code like this from regular IR using the readlane instrinsic, which is already marked as Convergent. Once you've done the reduction you can do your atomic operation with only one lane active by generating regular IR like the AtomicOptimizer pass does.
FWIW, there is no desire to read from inactive lanes. The loop is supposed to only be reading from, and writing to, lanes that were active before the for loop is executed by a select single lane.
Then I'm back to not understanding what this convergent copy is for. I'd need to see a more complete example.
As I understand it, if an expression is involved, the computation of the expression is sinking into the loop, and that must be prevented.
Hello Jay, I am making sure that I am accessing only active lanes inside a for loop. Consider input IR to atomic optimizer:
define protected amdgpu_kernel void @_Z3SumPiS_S_(...) local_unnamed_addr #0 { entry: ... %11 = load i32, ptr addrspace(1) %arrayidx, align 4, !tbaa !9, !amdgpu.noclobber !6 %sub.i = sub nsw i32 0, %11 %12 = atomicrmw add ptr addrspace(1) %1, i32 %sub.i syncscope("agent-one-as") monotonic, align 4 %arrayidx8 = getelementptr inbounds i32, ptr addrspace(1) %2, i64 %idxprom store i32 %12, ptr addrspace(1) %arrayidx8, align 4, !tbaa !9 ret void }
I am transforming computations of %sub.i as follows:
VAL = ... // VGPR RES = ... // FINAL result of scan, active lanes will write to this VGPR sum = 0; // SGPR, holds the partial sum for (int lane = 0; lane < 64; lane++) { if(IsActive(lane)) { // check to see whether lane is active or not elementToadd = readlane(VAL, lane ); // SGPR, read value which we want to add from VAL at lane id sum = sum + elementToadd; // SGPR, update the value writelane(RES, sum, lane ) // write value sum(SGPR) to VGPR RES at lane } }
which will be transformed into following llvm IR based on above design:
define protected amdgpu_kernel void @_Z3SumPiS_S_(...) local_unnamed_addr #0 { entry: %Alloca = alloca i32, align 4, addrspace(5) %ExclScan = load i32, ptr addrspace(5) %Alloca, align 4 .... .... %sub.i = sub nsw i32 0, %11 %12 = call i64 @llvm.amdgcn.ballot.i64(i1 true) %13 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) %14 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %13) %15 = call i32 @llvm.amdgcn.readfirstlane(i32 %14) %16 = icmp eq i32 %14, %15 br i1 %16, label %LoopHeader, label %19 17: ; preds = %ForElse %18 = atomicrmw add ptr addrspace(1) %1, i32 %29 syncscope("agent-one-as") monotonic, align 4 br label %19 19: ; preds = %entry, %17 %20 = phi i32 [ poison, %entry ], [ %18, %17 ] %21 = call i32 @llvm.amdgcn.readfirstlane(i32 %20) %22 = add i32 %21, %30 %arrayidx8 = getelementptr inbounds i32, ptr addrspace(1) %2, i64 %idxprom store i32 %22, ptr addrspace(1) %arrayidx8, align 4, !tbaa !9 ret void LoopHeader: ; preds = %entry br label %Loop Loop: ; preds = %ForElse, %LoopHeader %lane = phi i32 [ 0, %LoopHeader ], [ %nextlane, %ForElse ] %iSum = phi i32 [ 0, %LoopHeader ], [ %iSum14, %ForElse ] %scan = phi i32 [ %ExclScan, %LoopHeader ], [ %30, %ForElse ] %iVal = phi i32 [ %sub.i, %LoopHeader ], [ %28, %ForElse ] %23 = zext i32 %lane to i64 %24 = shl i64 1, %23 %25 = and i64 %12, %24 %26 = icmp ne i64 %25, 0 br i1 %26, label %ForIf, label %ForElse ForIf: ; preds = %Loop %27 = call i32 @llvm.amdgcn.readlane(i32 %14, i32 %lane) %28 = call i32 @llvm.amdgcn.readlane(i32 %iVal, i32 %27) %29 = add i32 %iSum, %28 %30 = call i32 @llvm.amdgcn.writelane(i32 %29, i32 %27, i32 %scan) br label %ForElse ForElse: ; preds = %ForIf, %Loop %iSum14 = phi i32 [ %29, %ForIf ], [ %iSum, %Loop ] %iVal15 = phi i32 [ %28, %ForIf ], [ %iVal, %Loop ] %nextlane = add i32 1, %lane %31 = icmp eq i32 %nextlane, 64 br i1 %31, label %17, label %Loop }
Here, entry basic block will branch out to scan loop which consist of LoopHeader, Loop ForIf ForElse and 17 responsible for precomputing the partial sums for each active lane as well as final reduced value (which will be updated by only BB 17). I am making sure that readlane and writelane works only only active lanes.
Now, coming to the problem which I am facing, MC sink optimization is sinking the computation of %sub.i from entry basic block to LoopHeader which will be executed by only one lane. This is leading to incorrect computations.
_Z3SumPiS_S_: ; @_Z3SumPiS_S_ ; %bb.0: ; %entry ... global_load_dword v5, v[4:5], off s_load_dwordx4 s[4:7], s[6:7], 0x0 v_mbcnt_lo_u32_b32 v3, -1, 0 v_mbcnt_hi_u32_b32 v3, -1, v3 v_readfirstlane_b32 s8, v3 s_mov_b32 s13, 0 v_cmp_eq_u32_e32 vcc, s8, v3 ; implicit-def: $vgpr4 s_and_saveexec_b64 s[8:9], vcc s_cbranch_execz .LBB0_6 ; %bb.1: ; %LoopHeader ** s_waitcnt vmcnt(0) v_sub_u32_e32 v4, 0, v5** s_mov_b32 s15, 0 s_mov_b32 s14, 0 ; implicit-def: $sgpr16 s_lshr_b64 s[18:19], s[10:11], s15 s_and_b32 s12, s18, 1 s_cmp_eq_u64 s[12:13], 0 s_cbranch_scc1 .LBB0_3 .LBB0_2: ; %ForIf v_readlane_b32 s12, v3, s15 s_mov_b32 m0, s12 s_nop 2 v_readlane_b32 s16, v4, s12 s_add_i32 s14, s14, s16 v_writelane_b32 v2, s14, m0
You can notice that after sinking computation of v4 into %bb.1, V4 be will partially computed for only first active lane. This V4 will be read later in LBB0_2 which is not the desired behavior in this case.
So with this convergent copy I want to prevent this sinking and make sure that loop is starting before computing the %sub.i i.e., V4 here. Please let me know if there is any better way to achieve this. I will discuss your concerns with Matt and Brian.
I am making sure that I am accessing only active lanes inside a for loop.
No you're not. In your example, all of LoopHeader, Loop, ForIf, ForElse run inside the part that only enables a single lane. (Why?) But they access arbitrary lanes of %sub.i.
Hello Jay, Can I extend this ISA as follows to store the intermediate results of scan for each active lane using writelane? There will be cases where we may need this intermediate scan results of each active lane later in the kernel.
s_mov s0, 0 ; initialize accumulator v_mov v1, 0 ; initialize a vgpr to store scan results of each active lane ; conditionally add in lane 0 v_readlane s1, v0, 0 s_bitcmp1 exec, 0 s_cselect s1, s1, 0 s_add s0, s0, s1 v_writelane v1, s0, 0 ; conditionally add in lane 1 v_readlane s1, v0, 1 s_bitcmp1 exec, 1 s_cselect s1, s1, 0 s_add s0, s0, s1 v_writelane v1, s0, 1 ... ; conditionally add in lane 31 v_readlane s1, v0, 31 s_bitcmp1 exec, 31 s_cselect s1, s1, 0 s_add s0, s0, s1 v_writelane v1, s0, 31
Actually on second thoughts, that will write a value to inactive lanes of v1, so it seems dangerous. I think you might have to use a conditional branch to skip the writelane for inactive lanes.
VAL = ... // VGPR RES = ... // FINAL result of scan, active lanes will write to this VGPR sum = 0; // SGPR, holds the partial sum for (int lane = 0; lane < 64; lane++) { if(IsActive(lane)) { // check to see whether lane is active or not elementToadd = readlane(VAL, lane ); // SGPR, read value which we want to add from VAL at lane id sum = sum + elementToadd; // SGPR, update the value writelane(RES, sum, lane ) // write value sum(SGPR) to VGPR RES at lane } }
The idea here is a dangerous way to program our GPU. Please check comment below to see why we should not do this.
A possible safe way is to do something like:
// all the active threads should enter the loop. do { get_first_active_lane() bool hasUnprocessedlane = true; if (is_first_active_lane) { // only the first active lane will go here, other threads will skip to the loop end. do the work for this active lane hasUnprocessedLane = false; } } while (hasUnprocessedLane);
The hasUnprocessedLane was used to say that the first active lane being processed in this iteration should exit the loop.
define protected amdgpu_kernel void @_Z3SumPiS_S_(...) local_unnamed_addr #0 {
entry:%Alloca = alloca i32, align 4, addrspace(5) %ExclScan = load i32, ptr addrspace(5) %Alloca, align 4 .... .... %sub.i = sub nsw i32 0, %11 %12 = call i64 @llvm.amdgcn.ballot.i64(i1 true) %13 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0) %14 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %13) %15 = call i32 @llvm.amdgcn.readfirstlane(i32 %14) %16 = icmp eq i32 %14, %15 br i1 %16, label %LoopHeader, label %19
The branching condition here is broken, it is the same as to say only the first active lane will branch to LoopHeader. That's why the sunk instruction got executed only once. But in fact, all the active threads should branch to LoopHeader.
The rule is that the branch condition should correctly reflect which threads will take the true successor, we should not break this rule.
[...]