This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU]: Add new intrinsic llvm.amdgcn.convergent.copy
AbandonedPublic

Authored by pravinjagtap on Mar 21 2023, 6:25 AM.

Details

Reviewers
arsenm
b-sumner
foad
Group Reviewers
Restricted Project
Summary

The new llvm.amdgcn.convergent.copy intrinsic creates a convergent copy.

Diff Detail

Event Timeline

pravinjagtap created this revision.Mar 21 2023, 6:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2023, 6:25 AM
pravinjagtap requested review of this revision.Mar 21 2023, 6:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2023, 6:26 AM

Clang format

foad added a reviewer: Restricted Project.Mar 21 2023, 7:42 AM

What does this do and what is it for?

What does this do and what is it for?

This convergent copy intrinsic will acts here as a form of barrier which makes sure that all the active lanes of VGPR (i.e. result of intrinsic) is computed before its use.

foad added a comment.Mar 21 2023, 8:44 AM

What does this do and what is it for?

This convergent copy intrinsic will acts here as a form of barrier which makes sure that all the active lanes of VGPR (i.e. result of intrinsic) is computed before its use.

Can you give an example?

pravinjagtap added a comment.EditedMar 21 2023, 9:43 AM

What does this do and what is it for?

This convergent copy intrinsic will acts here as a form of barrier which makes sure that all the active lanes of VGPR (i.e. result of intrinsic) is computed before its use.

Can you give an example?

Consider following input llvm IR:

%sub.i = sub nsw i32 0, %11
%12 = atomicrmw add ptr addrspace(1) %1, i32 %sub.i syncscope("agent-one-as") monotonic, align 4

Here, I am transforming computation of %12 into a for loop which will executed by only a first active lane. One of the Basic Blocks associated with this for loop needs to read values from a VGPR (use) corresponding to %sub.i.

So with the convergent copy I want to make sure that:

  1. All lanes of a VGPR associated with %sub.i are computed before starting the loop.
  2. It will also stop sinking the computation of %sub.i into basic blocks associated with that for loop.

What does this do and what is it for?

This convergent copy intrinsic will acts here as a form of barrier which makes sure that all the active lanes of VGPR (i.e. result of intrinsic) is computed before its use.

Can you give an example?

Consider following input llvm IR:

%sub.i = sub nsw i32 0, %11
%12 = atomicrmw add ptr addrspace(1) %1, i32 %sub.i syncscope("agent-one-as") monotonic, align 4

Here, I am transforming computation of %12 into a for loop which will executed by only a first active lane. One of the Basic Blocks associated with this for loop needs to read values from a VGPR (use) corresponding to %sub.i.

So with the convergent copy I want to make sure that:

  1. All lanes of a VGPR associated with %sub.i are computed before starting the loop.
  2. It will also stop sinking the computation of %sub.i into basic blocks associated with that for loop.

And you want to use @llvm.amdgcn.readlane inside the loop to read arbitrary (probably inactive) lanes from %sub.i ?

I'm not sure that this is a robust solution. What about an example like this:

%sub.i = sub nsw i32 0, %11
%cc = call @llvm.amdgcn.convergent.copy(%sub.i)
if (divergent condition) {
  some code here
} else {
  ... = call @llvm.amdgcn.readlane(%cc, %n)
}

What's to stop "some code here" from clobbering some lanes of the vgpr that was holding %cc?

Adding @ruiling @cdevadas

foad requested changes to this revision.Mar 21 2023, 12:49 PM

I don't think this patch solves any real problem, it just raises a bunch of questions about what you're trying to do.

If you want to read values from inactive lanes of a vgpr robustly then you need something like WWM - but I guess you don't trust the WWM implementation, so you're back to square one.

However... why are you doing readlane inside the part of the code that only has a single active lane? You can write a reduction across all active lanes like this without changing EXEC. (This is the unrolled version but you can put it in a loop if you want; that's irrelevant.)

s_mov s0, 0 ; initialize accumulator
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
; result is in s0

You should be able to generate code like this from regular IR using the readlane instrinsic, which is already marked as Convergent. Once you've done the reduction you can do your atomic operation with only one lane active by generating regular IR like the AtomicOptimizer pass does.

This revision now requires changes to proceed.Mar 21 2023, 12:49 PM

FWIW, there is no desire to read from inactive lanes. The loop is supposed to only be reading from, and writing to, lanes that were active before the for loop is executed by a select single lane.

foad added a comment.Mar 21 2023, 1:19 PM

FWIW, there is no desire to read from inactive lanes. The loop is supposed to only be reading from, and writing to, lanes that were active before the for loop is executed by a select single lane.

Then I'm back to not understanding what this convergent copy is for. I'd need to see a more complete example.

FWIW, there is no desire to read from inactive lanes. The loop is supposed to only be reading from, and writing to, lanes that were active before the for loop is executed by a select single lane.

Then I'm back to not understanding what this convergent copy is for. I'd need to see a more complete example.

As I understand it, if an expression is involved, the computation of the expression is sinking into the loop, and that must be prevented.

FWIW, there is no desire to read from inactive lanes. The loop is supposed to only be reading from, and writing to, lanes that were active before the for loop is executed by a select single lane.

Then I'm back to not understanding what this convergent copy is for. I'd need to see a more complete example.

Hello Jay, I am making sure that I am accessing only active lanes inside a for loop. Consider input IR to atomic optimizer:

define protected amdgpu_kernel void @_Z3SumPiS_S_(...) local_unnamed_addr #0 {
entry:
   ...
  %11 = load i32, ptr addrspace(1) %arrayidx, align 4, !tbaa !9, !amdgpu.noclobber !6
  %sub.i = sub nsw i32 0, %11
  %12 = atomicrmw add ptr addrspace(1) %1, i32 %sub.i syncscope("agent-one-as") monotonic, align 4
  %arrayidx8 = getelementptr inbounds i32, ptr addrspace(1) %2, i64 %idxprom
  store i32 %12, ptr addrspace(1) %arrayidx8, align 4, !tbaa !9
  ret void
}

I am transforming computations of %sub.i as follows:

VAL = ... // VGPR
RES = ... // FINAL result of scan, active lanes will write to this VGPR
sum = 0;                               // SGPR, holds the partial sum
for (int lane = 0; lane < 64; lane++) {
    if(IsActive(lane)) {                      // check to see whether lane is active or not 
        elementToadd = readlane(VAL, lane );  // SGPR, read value which we want to add from VAL at lane id
        sum = sum + elementToadd;            // SGPR, update the value
        writelane(RES, sum, lane )          // write value sum(SGPR) to VGPR RES at lane
    } 
}

which will be transformed into following llvm IR based on above design:

define protected amdgpu_kernel void @_Z3SumPiS_S_(...) local_unnamed_addr #0 {
entry:
  %Alloca = alloca i32, align 4, addrspace(5)
  %ExclScan = load i32, ptr addrspace(5) %Alloca, align 4
  ....
  ....
  %sub.i = sub nsw i32 0, %11
  %12 = call i64 @llvm.amdgcn.ballot.i64(i1 true)
  %13 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
  %14 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %13)
  %15 = call i32 @llvm.amdgcn.readfirstlane(i32 %14)
  %16 = icmp eq i32 %14, %15
  br i1 %16, label %LoopHeader, label %19

17:                                               ; preds = %ForElse
  %18 = atomicrmw add ptr addrspace(1) %1, i32 %29 syncscope("agent-one-as") monotonic, align 4
  br label %19

19:                                               ; preds = %entry, %17
  %20 = phi i32 [ poison, %entry ], [ %18, %17 ]
  %21 = call i32 @llvm.amdgcn.readfirstlane(i32 %20)
  %22 = add i32 %21, %30
  %arrayidx8 = getelementptr inbounds i32, ptr addrspace(1) %2, i64 %idxprom
  store i32 %22, ptr addrspace(1) %arrayidx8, align 4, !tbaa !9
  ret void

LoopHeader:                                       ; preds = %entry
  br label %Loop

Loop:                                             ; preds = %ForElse, %LoopHeader
  %lane = phi i32 [ 0, %LoopHeader ], [ %nextlane, %ForElse ]
  %iSum = phi i32 [ 0, %LoopHeader ], [ %iSum14, %ForElse ]
  %scan = phi i32 [ %ExclScan, %LoopHeader ], [ %30, %ForElse ]
  %iVal = phi i32 [ %sub.i, %LoopHeader ], [ %28, %ForElse ]
  %23 = zext i32 %lane to i64
  %24 = shl i64 1, %23
  %25 = and i64 %12, %24
  %26 = icmp ne i64 %25, 0
  br i1 %26, label %ForIf, label %ForElse

ForIf:                                            ; preds = %Loop
  %27 = call i32 @llvm.amdgcn.readlane(i32 %14, i32 %lane)
  %28 = call i32 @llvm.amdgcn.readlane(i32 %iVal, i32 %27)
  %29 = add i32 %iSum, %28
  %30 = call i32 @llvm.amdgcn.writelane(i32 %29, i32 %27, i32 %scan)
  br label %ForElse

ForElse:                                          ; preds = %ForIf, %Loop
  %iSum14 = phi i32 [ %29, %ForIf ], [ %iSum, %Loop ]
  %iVal15 = phi i32 [ %28, %ForIf ], [ %iVal, %Loop ]
  %nextlane = add i32 1, %lane
  %31 = icmp eq i32 %nextlane, 64
  br i1 %31, label %17, label %Loop
}

Here, entry basic block will branch out to scan loop which consist of LoopHeader, Loop ForIf ForElse and 17 responsible for precomputing the partial sums for each active lane as well as final reduced value (which will be updated by only BB 17). I am making sure that readlane and writelane works only only active lanes.
Now, coming to the problem which I am facing, MC sink optimization is sinking the computation of %sub.i from entry basic block to LoopHeader which will be executed by only one lane. This is leading to incorrect computations.

_Z3SumPiS_S_:                           ; @_Z3SumPiS_S_
; %bb.0:                                ; %entry
        ... 
	global_load_dword v5, v[4:5], off
	s_load_dwordx4 s[4:7], s[6:7], 0x0
	v_mbcnt_lo_u32_b32 v3, -1, 0
	v_mbcnt_hi_u32_b32 v3, -1, v3
	v_readfirstlane_b32 s8, v3
	s_mov_b32 s13, 0
	v_cmp_eq_u32_e32 vcc, s8, v3
                                        ; implicit-def: $vgpr4
	s_and_saveexec_b64 s[8:9], vcc
	s_cbranch_execz .LBB0_6
; %bb.1:                                ; %LoopHeader
**	s_waitcnt vmcnt(0)
	v_sub_u32_e32 v4, 0, v5**
	s_mov_b32 s15, 0
	s_mov_b32 s14, 0
                                        ; implicit-def: $sgpr16
	s_lshr_b64 s[18:19], s[10:11], s15
	s_and_b32 s12, s18, 1
	s_cmp_eq_u64 s[12:13], 0
	s_cbranch_scc1 .LBB0_3
.LBB0_2:                                ; %ForIf
	v_readlane_b32 s12, v3, s15
	s_mov_b32 m0, s12
	s_nop 2
	v_readlane_b32 s16, v4, s12
	s_add_i32 s14, s14, s16
	v_writelane_b32 v2, s14, m0

You can notice that after sinking computation of v4 into %bb.1, V4 be will partially computed for only first active lane. This V4 will be read later in LBB0_2 which is not the desired behavior in this case.
So with this convergent copy I want to prevent this sinking and make sure that loop is starting before computing the %sub.i i.e., V4 here. Please let me know if there is any better way to achieve this. I will discuss your concerns with Matt and Brian.

foad added a comment.Mar 22 2023, 12:12 AM

I am making sure that I am accessing only active lanes inside a for loop.

No you're not. In your example, all of LoopHeader, Loop, ForIf, ForElse run inside the part that only enables a single lane. (Why?) But they access arbitrary lanes of %sub.i.

I don't think this patch solves any real problem, it just raises a bunch of questions about what you're trying to do.

If you want to read values from inactive lanes of a vgpr robustly then you need something like WWM - but I guess you don't trust the WWM implementation, so you're back to square one.

However... why are you doing readlane inside the part of the code that only has a single active lane? You can write a reduction across all active lanes like this without changing EXEC. (This is the unrolled version but you can put it in a loop if you want; that's irrelevant.)

s_mov s0, 0 ; initialize accumulator
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
; result is in s0

You should be able to generate code like this from regular IR using the readlane instrinsic, which is already marked as Convergent. Once you've done the reduction you can do your atomic operation with only one lane active by generating regular IR like the AtomicOptimizer pass does.

Hello Jay, Can I extend this ISA as follows to store the intermediate results of scan for each active lane using writelane? There will be cases where we may need this intermediate scan results of each active lane later in the kernel.

s_mov s0, 0 ; initialize accumulator
v_mov v1, 0 ; initialize a vgpr to store scan results of each active lane
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 0
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 31
foad added a comment.Mar 22 2023, 2:37 AM

I don't think this patch solves any real problem, it just raises a bunch of questions about what you're trying to do.

If you want to read values from inactive lanes of a vgpr robustly then you need something like WWM - but I guess you don't trust the WWM implementation, so you're back to square one.

However... why are you doing readlane inside the part of the code that only has a single active lane? You can write a reduction across all active lanes like this without changing EXEC. (This is the unrolled version but you can put it in a loop if you want; that's irrelevant.)

s_mov s0, 0 ; initialize accumulator
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
; result is in s0

You should be able to generate code like this from regular IR using the readlane instrinsic, which is already marked as Convergent. Once you've done the reduction you can do your atomic operation with only one lane active by generating regular IR like the AtomicOptimizer pass does.

Hello Jay, Can I extend this ISA as follows to store the intermediate results of scan for each active lane using writelane? There will be cases where we may need this intermediate scan results of each active lane later in the kernel.

s_mov s0, 0 ; initialize accumulator
v_mov v1, 0 ; initialize a vgpr to store scan results of each active lane
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 0
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 31

Yes, that seems reasonable to me.

foad added a comment.EditedMar 22 2023, 6:12 AM

I don't think this patch solves any real problem, it just raises a bunch of questions about what you're trying to do.

If you want to read values from inactive lanes of a vgpr robustly then you need something like WWM - but I guess you don't trust the WWM implementation, so you're back to square one.

However... why are you doing readlane inside the part of the code that only has a single active lane? You can write a reduction across all active lanes like this without changing EXEC. (This is the unrolled version but you can put it in a loop if you want; that's irrelevant.)

s_mov s0, 0 ; initialize accumulator
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
; result is in s0

You should be able to generate code like this from regular IR using the readlane instrinsic, which is already marked as Convergent. Once you've done the reduction you can do your atomic operation with only one lane active by generating regular IR like the AtomicOptimizer pass does.

Hello Jay, Can I extend this ISA as follows to store the intermediate results of scan for each active lane using writelane? There will be cases where we may need this intermediate scan results of each active lane later in the kernel.

s_mov s0, 0 ; initialize accumulator
v_mov v1, 0 ; initialize a vgpr to store scan results of each active lane
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 0
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 31

Yes, that seems reasonable to me.

Actually on second thoughts, that will write a value to inactive lanes of v1, so it seems dangerous. I think you might have to use a conditional branch to skip the writelane for inactive lanes.

It might be better to loop over only active lanes, found using ctz or clz builtins.

VAL = ... // VGPR
RES = ... // FINAL result of scan, active lanes will write to this VGPR
sum = 0;                               // SGPR, holds the partial sum
for (int lane = 0; lane < 64; lane++) {
    if(IsActive(lane)) {                      // check to see whether lane is active or not 
        elementToadd = readlane(VAL, lane );  // SGPR, read value which we want to add from VAL at lane id
        sum = sum + elementToadd;            // SGPR, update the value
        writelane(RES, sum, lane )          // write value sum(SGPR) to VGPR RES at lane
    } 
}

The idea here is a dangerous way to program our GPU. Please check comment below to see why we should not do this.
A possible safe way is to do something like:

// all the active threads should enter the loop.
do {
 get_first_active_lane()
 bool hasUnprocessedlane = true;
 if (is_first_active_lane) {
   // only the first active lane will go here, other threads will skip to the loop end.
   do the work for this active lane
   hasUnprocessedLane = false;
 }
}  while (hasUnprocessedLane);

The hasUnprocessedLane was used to say that the first active lane being processed in this iteration should exit the loop.

define protected amdgpu_kernel void @_Z3SumPiS_S_(...) local_unnamed_addr #0 {
entry:

%Alloca = alloca i32, align 4, addrspace(5)
%ExclScan = load i32, ptr addrspace(5) %Alloca, align 4
....
....
%sub.i = sub nsw i32 0, %11
%12 = call i64 @llvm.amdgcn.ballot.i64(i1 true)
%13 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
%14 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %13)
%15 = call i32 @llvm.amdgcn.readfirstlane(i32 %14)
%16 = icmp eq i32 %14, %15
br i1 %16, label %LoopHeader, label %19

The branching condition here is broken, it is the same as to say only the first active lane will branch to LoopHeader. That's why the sunk instruction got executed only once. But in fact, all the active threads should branch to LoopHeader.
The rule is that the branch condition should correctly reflect which threads will take the true successor, we should not break this rule.
[...]

Is this still relevant?

pravinjagtap abandoned this revision.Jun 22 2023, 7:29 AM