This is an archive of the discontinued LLVM Phabricator instance.

This convergent copy intrinsic will acts here as a form of barrier which makes sure that all the active lanes of VGPR (i.e. result of intrinsic) is computed before its use.

Harbormaster completed remote builds in B220710: Diff 506958.Mar 21 2023, 8:14 AM

In D146523#4209735, @pravinjagtap wrote:

In D146523#4209684, @foad wrote:

What does this do and what is it for?

This convergent copy intrinsic will acts here as a form of barrier which makes sure that all the active lanes of VGPR (i.e. result of intrinsic) is computed before its use.

Can you give an example?

In D146523#4209889, @foad wrote:

In D146523#4209735, @pravinjagtap wrote:

In D146523#4209684, @foad wrote:

What does this do and what is it for?

This convergent copy intrinsic will acts here as a form of barrier which makes sure that all the active lanes of VGPR (i.e. result of intrinsic) is computed before its use.

Can you give an example?

Consider following input llvm IR:

%sub.i = sub nsw i32 0, %11
%12 = atomicrmw add ptr addrspace(1) %1, i32 %sub.i syncscope("agent-one-as") monotonic, align 4

Here, I am transforming computation of %12 into a for loop which will executed by only a first active lane. One of the Basic Blocks associated with this for loop needs to read values from a VGPR (use) corresponding to %sub.i.

So with the convergent copy I want to make sure that:

All lanes of a VGPR associated with %sub.i are computed before starting the loop.
It will also stop sinking the computation of %sub.i into basic blocks associated with that for loop.

In D146523#4210124, @pravinjagtap wrote:
In D146523#4209889, @foad wrote:

In D146523#4209735, @pravinjagtap wrote:

In D146523#4209684, @foad wrote:

What does this do and what is it for?

This convergent copy intrinsic will acts here as a form of barrier which makes sure that all the active lanes of VGPR (i.e. result of intrinsic) is computed before its use.

Can you give an example?

Consider following input llvm IR:
%sub.i = sub nsw i32 0, %11
%12 = atomicrmw add ptr addrspace(1) %1, i32 %sub.i syncscope("agent-one-as") monotonic, align 4
Here, I am transforming computation of %12 into a for loop which will executed by only a first active lane. One of the Basic Blocks associated with this for loop needs to read values from a VGPR (use) corresponding to %sub.i.

So with the convergent copy I want to make sure that:

All lanes of a VGPR associated with %sub.i are computed before starting the loop.

It will also stop sinking the computation of %sub.i into basic blocks associated with that for loop.

And you want to use @llvm.amdgcn.readlane inside the loop to read arbitrary (probably inactive) lanes from %sub.i ?

I'm not sure that this is a robust solution. What about an example like this:

%sub.i = sub nsw i32 0, %11
%cc = call @llvm.amdgcn.convergent.copy(%sub.i)
if (divergent condition) {
  some code here
} else {
  ... = call @llvm.amdgcn.readlane(%cc, %n)
}

What's to stop "some code here" from clobbering some lanes of the vgpr that was holding %cc?

Adding @ruiling @cdevadas

I don't think this patch solves any real problem, it just raises a bunch of questions about what you're trying to do.

If you want to read values from inactive lanes of a vgpr robustly then you need something like WWM - but I guess you don't trust the WWM implementation, so you're back to square one.

However... why are you doing readlane inside the part of the code that only has a single active lane? You can write a reduction across all active lanes like this without changing EXEC. (This is the unrolled version but you can put it in a loop if you want; that's irrelevant.)

s_mov s0, 0 ; initialize accumulator
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
; result is in s0

You should be able to generate code like this from regular IR using the readlane instrinsic, which is already marked as Convergent. Once you've done the reduction you can do your atomic operation with only one lane active by generating regular IR like the AtomicOptimizer pass does.

This revision now requires changes to proceed.Mar 21 2023, 12:49 PM

Herald added a subscriber: StephenFan. · View Herald TranscriptMar 21 2023, 12:49 PM

FWIW, there is no desire to read from inactive lanes. The loop is supposed to only be reading from, and writing to, lanes that were active before the for loop is executed by a select single lane.

In D146523#4210843, @b-sumner wrote:

FWIW, there is no desire to read from inactive lanes. The loop is supposed to only be reading from, and writing to, lanes that were active before the for loop is executed by a select single lane.

Then I'm back to not understanding what this convergent copy is for. I'd need to see a more complete example.

In D146523#4210894, @foad wrote:

In D146523#4210843, @b-sumner wrote:

FWIW, there is no desire to read from inactive lanes. The loop is supposed to only be reading from, and writing to, lanes that were active before the for loop is executed by a select single lane.

Then I'm back to not understanding what this convergent copy is for. I'd need to see a more complete example.

As I understand it, if an expression is involved, the computation of the expression is sinking into the loop, and that must be prevented.

In D146523#4210894, @foad wrote:

In D146523#4210843, @b-sumner wrote:

FWIW, there is no desire to read from inactive lanes. The loop is supposed to only be reading from, and writing to, lanes that were active before the for loop is executed by a select single lane.

Then I'm back to not understanding what this convergent copy is for. I'd need to see a more complete example.

Hello Jay, I am making sure that I am accessing only active lanes inside a for loop. Consider input IR to atomic optimizer:

define protected amdgpu_kernel void @_Z3SumPiS_S_(...) local_unnamed_addr #0 {
entry:
   ...
  %11 = load i32, ptr addrspace(1) %arrayidx, align 4, !tbaa !9, !amdgpu.noclobber !6
  %sub.i = sub nsw i32 0, %11
  %12 = atomicrmw add ptr addrspace(1) %1, i32 %sub.i syncscope("agent-one-as") monotonic, align 4
  %arrayidx8 = getelementptr inbounds i32, ptr addrspace(1) %2, i64 %idxprom
  store i32 %12, ptr addrspace(1) %arrayidx8, align 4, !tbaa !9
  ret void
}

I am transforming computations of %sub.i as follows:

VAL = ... // VGPR
RES = ... // FINAL result of scan, active lanes will write to this VGPR
sum = 0;                               // SGPR, holds the partial sum
for (int lane = 0; lane < 64; lane++) {
    if(IsActive(lane)) {                      // check to see whether lane is active or not 
        elementToadd = readlane(VAL, lane );  // SGPR, read value which we want to add from VAL at lane id
        sum = sum + elementToadd;            // SGPR, update the value
        writelane(RES, sum, lane )          // write value sum(SGPR) to VGPR RES at lane
    } 
}

which will be transformed into following llvm IR based on above design:

define protected amdgpu_kernel void @_Z3SumPiS_S_(...) local_unnamed_addr #0 {
entry:
  %Alloca = alloca i32, align 4, addrspace(5)
  %ExclScan = load i32, ptr addrspace(5) %Alloca, align 4
  ....
  ....
  %sub.i = sub nsw i32 0, %11
  %12 = call i64 @llvm.amdgcn.ballot.i64(i1 true)
  %13 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
  %14 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %13)
  %15 = call i32 @llvm.amdgcn.readfirstlane(i32 %14)
  %16 = icmp eq i32 %14, %15
  br i1 %16, label %LoopHeader, label %19

17:                                               ; preds = %ForElse
  %18 = atomicrmw add ptr addrspace(1) %1, i32 %29 syncscope("agent-one-as") monotonic, align 4
  br label %19

19:                                               ; preds = %entry, %17
  %20 = phi i32 [ poison, %entry ], [ %18, %17 ]
  %21 = call i32 @llvm.amdgcn.readfirstlane(i32 %20)
  %22 = add i32 %21, %30
  %arrayidx8 = getelementptr inbounds i32, ptr addrspace(1) %2, i64 %idxprom
  store i32 %22, ptr addrspace(1) %arrayidx8, align 4, !tbaa !9
  ret void

LoopHeader:                                       ; preds = %entry
  br label %Loop

Loop:                                             ; preds = %ForElse, %LoopHeader
  %lane = phi i32 [ 0, %LoopHeader ], [ %nextlane, %ForElse ]
  %iSum = phi i32 [ 0, %LoopHeader ], [ %iSum14, %ForElse ]
  %scan = phi i32 [ %ExclScan, %LoopHeader ], [ %30, %ForElse ]
  %iVal = phi i32 [ %sub.i, %LoopHeader ], [ %28, %ForElse ]
  %23 = zext i32 %lane to i64
  %24 = shl i64 1, %23
  %25 = and i64 %12, %24
  %26 = icmp ne i64 %25, 0
  br i1 %26, label %ForIf, label %ForElse

ForIf:                                            ; preds = %Loop
  %27 = call i32 @llvm.amdgcn.readlane(i32 %14, i32 %lane)
  %28 = call i32 @llvm.amdgcn.readlane(i32 %iVal, i32 %27)
  %29 = add i32 %iSum, %28
  %30 = call i32 @llvm.amdgcn.writelane(i32 %29, i32 %27, i32 %scan)
  br label %ForElse

ForElse:                                          ; preds = %ForIf, %Loop
  %iSum14 = phi i32 [ %29, %ForIf ], [ %iSum, %Loop ]
  %iVal15 = phi i32 [ %28, %ForIf ], [ %iVal, %Loop ]
  %nextlane = add i32 1, %lane
  %31 = icmp eq i32 %nextlane, 64
  br i1 %31, label %17, label %Loop
}

Here, entry basic block will branch out to scan loop which consist of LoopHeader, Loop ForIf ForElse and 17 responsible for precomputing the partial sums for each active lane as well as final reduced value (which will be updated by only BB 17). I am making sure that readlane and writelane works only only active lanes.
Now, coming to the problem which I am facing, MC sink optimization is sinking the computation of %sub.i from entry basic block to LoopHeader which will be executed by only one lane. This is leading to incorrect computations.

_Z3SumPiS_S_:                           ; @_Z3SumPiS_S_
; %bb.0:                                ; %entry
        ... 
	global_load_dword v5, v[4:5], off
	s_load_dwordx4 s[4:7], s[6:7], 0x0
	v_mbcnt_lo_u32_b32 v3, -1, 0
	v_mbcnt_hi_u32_b32 v3, -1, v3
	v_readfirstlane_b32 s8, v3
	s_mov_b32 s13, 0
	v_cmp_eq_u32_e32 vcc, s8, v3
                                        ; implicit-def: $vgpr4
	s_and_saveexec_b64 s[8:9], vcc
	s_cbranch_execz .LBB0_6
; %bb.1:                                ; %LoopHeader
**	s_waitcnt vmcnt(0)
	v_sub_u32_e32 v4, 0, v5**
	s_mov_b32 s15, 0
	s_mov_b32 s14, 0
                                        ; implicit-def: $sgpr16
	s_lshr_b64 s[18:19], s[10:11], s15
	s_and_b32 s12, s18, 1
	s_cmp_eq_u64 s[12:13], 0
	s_cbranch_scc1 .LBB0_3
.LBB0_2:                                ; %ForIf
	v_readlane_b32 s12, v3, s15
	s_mov_b32 m0, s12
	s_nop 2
	v_readlane_b32 s16, v4, s12
	s_add_i32 s14, s14, s16
	v_writelane_b32 v2, s14, m0

You can notice that after sinking computation of v4 into %bb.1, V4 be will partially computed for only first active lane. This V4 will be read later in LBB0_2 which is not the desired behavior in this case.
So with this convergent copy I want to prevent this sinking and make sure that loop is starting before computing the %sub.i i.e., V4 here. Please let me know if there is any better way to achieve this. I will discuss your concerns with Matt and Brian.

I am making sure that I am accessing only active lanes inside a for loop.

No you're not. In your example, all of LoopHeader, Loop, ForIf, ForElse run inside the part that only enables a single lane. (Why?) But they access arbitrary lanes of %sub.i.

In D146523#4210812, @foad wrote:
I don't think this patch solves any real problem, it just raises a bunch of questions about what you're trying to do.

If you want to read values from inactive lanes of a vgpr robustly then you need something like WWM - but I guess you don't trust the WWM implementation, so you're back to square one.

However... why are you doing readlane inside the part of the code that only has a single active lane? You can write a reduction across all active lanes like this without changing EXEC. (This is the unrolled version but you can put it in a loop if you want; that's irrelevant.)
s_mov s0, 0 ; initialize accumulator
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
; result is in s0
You should be able to generate code like this from regular IR using the readlane instrinsic, which is already marked as Convergent. Once you've done the reduction you can do your atomic operation with only one lane active by generating regular IR like the AtomicOptimizer pass does.

Hello Jay, Can I extend this ISA as follows to store the intermediate results of scan for each active lane using writelane? There will be cases where we may need this intermediate scan results of each active lane later in the kernel.

s_mov s0, 0 ; initialize accumulator
v_mov v1, 0 ; initialize a vgpr to store scan results of each active lane
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 0
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 31

In D146523#4212396, @pravinjagtap wrote:
In D146523#4210812, @foad wrote:
I don't think this patch solves any real problem, it just raises a bunch of questions about what you're trying to do.

If you want to read values from inactive lanes of a vgpr robustly then you need something like WWM - but I guess you don't trust the WWM implementation, so you're back to square one.

However... why are you doing readlane inside the part of the code that only has a single active lane? You can write a reduction across all active lanes like this without changing EXEC. (This is the unrolled version but you can put it in a loop if you want; that's irrelevant.)
s_mov s0, 0 ; initialize accumulator
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
; result is in s0
You should be able to generate code like this from regular IR using the readlane instrinsic, which is already marked as Convergent. Once you've done the reduction you can do your atomic operation with only one lane active by generating regular IR like the AtomicOptimizer pass does.
Hello Jay, Can I extend this ISA as follows to store the intermediate results of scan for each active lane using writelane? There will be cases where we may need this intermediate scan results of each active lane later in the kernel.
s_mov s0, 0 ; initialize accumulator
v_mov v1, 0 ; initialize a vgpr to store scan results of each active lane
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 0
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 31

Yes, that seems reasonable to me.

In D146523#4212482, @foad wrote:
In D146523#4212396, @pravinjagtap wrote:
In D146523#4210812, @foad wrote:
I don't think this patch solves any real problem, it just raises a bunch of questions about what you're trying to do.

If you want to read values from inactive lanes of a vgpr robustly then you need something like WWM - but I guess you don't trust the WWM implementation, so you're back to square one.

However... why are you doing readlane inside the part of the code that only has a single active lane? You can write a reduction across all active lanes like this without changing EXEC. (This is the unrolled version but you can put it in a loop if you want; that's irrelevant.)
s_mov s0, 0 ; initialize accumulator
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
; result is in s0
You should be able to generate code like this from regular IR using the readlane instrinsic, which is already marked as Convergent. Once you've done the reduction you can do your atomic operation with only one lane active by generating regular IR like the AtomicOptimizer pass does.
Hello Jay, Can I extend this ISA as follows to store the intermediate results of scan for each active lane using writelane? There will be cases where we may need this intermediate scan results of each active lane later in the kernel.
s_mov s0, 0 ; initialize accumulator
v_mov v1, 0 ; initialize a vgpr to store scan results of each active lane
; conditionally add in lane 0
v_readlane s1, v0, 0
s_bitcmp1 exec, 0
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 0
; conditionally add in lane 1
v_readlane s1, v0, 1
s_bitcmp1 exec, 1
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 1
...
; conditionally add in lane 31
v_readlane s1, v0, 31
s_bitcmp1 exec, 31
s_cselect s1, s1, 0
s_add s0, s0, s1
v_writelane v1, s0, 31
Yes, that seems reasonable to me.

Actually on second thoughts, that will write a value to inactive lanes of v1, so it seems dangerous. I think you might have to use a conditional branch to skip the writelane for inactive lanes.

It might be better to loop over only active lanes, found using ctz or clz builtins.

VAL = ... // VGPR
RES = ... // FINAL result of scan, active lanes will write to this VGPR
sum = 0;                               // SGPR, holds the partial sum
for (int lane = 0; lane < 64; lane++) {
    if(IsActive(lane)) {                      // check to see whether lane is active or not 
        elementToadd = readlane(VAL, lane );  // SGPR, read value which we want to add from VAL at lane id
        sum = sum + elementToadd;            // SGPR, update the value
        writelane(RES, sum, lane )          // write value sum(SGPR) to VGPR RES at lane
    } 
}

The idea here is a dangerous way to program our GPU. Please check comment below to see why we should not do this.
A possible safe way is to do something like:

// all the active threads should enter the loop.
do {
 get_first_active_lane()
 bool hasUnprocessedlane = true;
 if (is_first_active_lane) {
   // only the first active lane will go here, other threads will skip to the loop end.
   do the work for this active lane
   hasUnprocessedLane = false;
 }
}  while (hasUnprocessedLane);

The hasUnprocessedLane was used to say that the first active lane being processed in this iteration should exit the loop.

define protected amdgpu_kernel void @_Z3SumPiS_S_(...) local_unnamed_addr #0 {
entry:

%Alloca = alloca i32, align 4, addrspace(5)
%ExclScan = load i32, ptr addrspace(5) %Alloca, align 4
....
....
%sub.i = sub nsw i32 0, %11
%12 = call i64 @llvm.amdgcn.ballot.i64(i1 true)
%13 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
%14 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %13)
%15 = call i32 @llvm.amdgcn.readfirstlane(i32 %14)
%16 = icmp eq i32 %14, %15
br i1 %16, label %LoopHeader, label %19

The branching condition here is broken, it is the same as to say only the first active lane will branch to LoopHeader. That's why the sunk instruction got executed only once. But in fact, all the active threads should branch to LoopHeader.
The rule is that the branch condition should correctly reflect which threads will take the true successor, we should not break this rule.
[...]

Is this still relevant?

pravinjagtap abandoned this revision.Jun 22 2023, 7:29 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

IntrinsicsAMDGPU.td

4 lines

lib/

Target/

AMDGPU/

SIISelLowering.cpp

6 lines

SIInstrInfo.cpp

6 lines

SIInstructions.td

6 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.convergent.copy.ll

32 lines

Diff 506958

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

Show First 20 Lines • Show All 1,818 Lines • ▼ Show 20 Lines	def int_amdgcn_wwm : Intrinsic<[llvm_any_ty],
[LLVMMatchType<0>], [IntrNoMem, IntrSpeculatable,		[LLVMMatchType<0>], [IntrNoMem, IntrSpeculatable,
IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]		IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]
>;		>;
def int_amdgcn_strict_wqm : Intrinsic<[llvm_any_ty],		def int_amdgcn_strict_wqm : Intrinsic<[llvm_any_ty],
[LLVMMatchType<0>], [IntrNoMem, IntrSpeculatable,		[LLVMMatchType<0>], [IntrNoMem, IntrSpeculatable,
IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]		IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]
>;		>;

		def int_amdgcn_convergent_copy : Intrinsic<[llvm_any_ty],
		[LLVMMatchType<0>], [IntrNoMem, IntrConvergent]
		>;

// Given a value, copies it while setting all the inactive lanes to a given		// Given a value, copies it while setting all the inactive lanes to a given
// value. Note that OpenGL helper lanes are considered active, so if the		// value. Note that OpenGL helper lanes are considered active, so if the
// program ever uses WQM, then the instruction and the first source will be		// program ever uses WQM, then the instruction and the first source will be
// computed in WQM.		// computed in WQM.
def int_amdgcn_set_inactive :		def int_amdgcn_set_inactive :
Intrinsic<[llvm_anyint_ty],		Intrinsic<[llvm_anyint_ty],
[LLVMMatchType<0>, // value to be copied		[LLVMMatchType<0>, // value to be copied
LLVMMatchType<0>], // value for the inactive lanes to take		LLVMMatchType<0>], // value for the inactive lanes to take
▲ Show 20 Lines • Show All 640 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,177 Lines • ▼ Show 20 Lines	case Intrinsic::amdgcn_reloc_constant: {
const MDNode *Metadata = cast<MDNodeSDNode>(Op.getOperand(1))->getMD();		const MDNode *Metadata = cast<MDNodeSDNode>(Op.getOperand(1))->getMD();
auto SymbolName = cast<MDString>(Metadata->getOperand(0))->getString();		auto SymbolName = cast<MDString>(Metadata->getOperand(0))->getString();
auto RelocSymbol = cast<GlobalVariable>(		auto RelocSymbol = cast<GlobalVariable>(
M->getOrInsertGlobal(SymbolName, Type::getInt32Ty(M->getContext())));		M->getOrInsertGlobal(SymbolName, Type::getInt32Ty(M->getContext())));
SDValue GA = DAG.getTargetGlobalAddress(RelocSymbol, DL, MVT::i32, 0,		SDValue GA = DAG.getTargetGlobalAddress(RelocSymbol, DL, MVT::i32, 0,
SIInstrInfo::MO_ABS32_LO);		SIInstrInfo::MO_ABS32_LO);
return {DAG.getMachineNode(AMDGPU::S_MOV_B32, DL, MVT::i32, GA), 0};		return {DAG.getMachineNode(AMDGPU::S_MOV_B32, DL, MVT::i32, GA), 0};
}		}
		case Intrinsic::amdgcn_convergent_copy: {
		SDValue Src = Op.getOperand(1);
		return SDValue(DAG.getMachineNode(AMDGPU::CONVERGENT_COPY_PSEUDO, DL,
		Src.getValueType(), Src),
		0);
		}
default:		default:
if (const AMDGPU::ImageDimIntrinsicInfo *ImageDimIntr =		if (const AMDGPU::ImageDimIntrinsicInfo *ImageDimIntr =
AMDGPU::getImageDimIntrinsicInfo(IntrinsicID))		AMDGPU::getImageDimIntrinsicInfo(IntrinsicID))
return lowerImage(Op, ImageDimIntr, DAG, false);		return lowerImage(Op, ImageDimIntr, DAG, false);

return Op;		return Op;
}		}
}		}
▲ Show 20 Lines • Show All 6,268 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,042 Lines • ▼ Show 20 Lines	case AMDGPU::V_SET_INACTIVE_B32: {
FirstNot->addRegisterDead(AMDGPU::SCC, TRI); // SCC is overwritten		FirstNot->addRegisterDead(AMDGPU::SCC, TRI); // SCC is overwritten
BuildMI(MBB, MI, DL, get(AMDGPU::V_MOV_B32_e32), MI.getOperand(0).getReg())		BuildMI(MBB, MI, DL, get(AMDGPU::V_MOV_B32_e32), MI.getOperand(0).getReg())
.add(MI.getOperand(2));		.add(MI.getOperand(2));
BuildMI(MBB, MI, DL, get(NotOpc), Exec)		BuildMI(MBB, MI, DL, get(NotOpc), Exec)
.addReg(Exec);		.addReg(Exec);
MI.eraseFromParent();		MI.eraseFromParent();
break;		break;
}		}
		case AMDGPU::CONVERGENT_COPY_PSEUDO: {
		BuildMI(MBB, MI, DL, get(AMDGPU::V_MOV_B32_e32), MI.getOperand(0).getReg())
		.add(MI.getOperand(1));
		MI.eraseFromParent();
		break;
		}
case AMDGPU::V_SET_INACTIVE_B64: {		case AMDGPU::V_SET_INACTIVE_B64: {
unsigned NotOpc = ST.isWave32() ? AMDGPU::S_NOT_B32 : AMDGPU::S_NOT_B64;		unsigned NotOpc = ST.isWave32() ? AMDGPU::S_NOT_B32 : AMDGPU::S_NOT_B64;
unsigned Exec = ST.isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;		unsigned Exec = ST.isWave32() ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
MachineInstr *Copy = BuildMI(MBB, MI, DL, get(AMDGPU::V_MOV_B64_PSEUDO),		MachineInstr *Copy = BuildMI(MBB, MI, DL, get(AMDGPU::V_MOV_B64_PSEUDO),
MI.getOperand(0).getReg())		MI.getOperand(0).getReg())
.add(MI.getOperand(1));		.add(MI.getOperand(1));
expandPostRAPseudo(*Copy);		expandPostRAPseudo(*Copy);
auto FirstNot = BuildMI(MBB, MI, DL, get(NotOpc), Exec).addReg(Exec);		auto FirstNot = BuildMI(MBB, MI, DL, get(NotOpc), Exec).addReg(Exec);
▲ Show 20 Lines • Show All 6,745 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstructions.td

	Show First 20 Lines • Show All 229 Lines • ▼ Show 20 Lines
	}			}

	def V_SET_INACTIVE_B64 : VPseudoInstSI <(outs VReg_64:$vdst),			def V_SET_INACTIVE_B64 : VPseudoInstSI <(outs VReg_64:$vdst),
	(ins VSrc_b64: $src, VSrc_b64:$inactive),			(ins VSrc_b64: $src, VSrc_b64:$inactive),
	[(set i64:$vdst, (int_amdgcn_set_inactive i64:$src, i64:$inactive))]> {			[(set i64:$vdst, (int_amdgcn_set_inactive i64:$src, i64:$inactive))]> {
	}			}
	} // End Defs = [SCC]			} // End Defs = [SCC]

				let hasSideEffects = 0, mayLoad = 0, mayStore = 0, Uses = [EXEC] in {
				def CONVERGENT_COPY_PSEUDO : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)> {
				let isConvergent = 1;
				}
				}

	let usesCustomInserter = 1, Defs = [VCC, EXEC] in {			let usesCustomInserter = 1, Defs = [VCC, EXEC] in {
	def V_ADD_U64_PSEUDO : VPseudoInstSI <			def V_ADD_U64_PSEUDO : VPseudoInstSI <
	(outs VReg_64:$vdst), (ins VSrc_b64:$src0, VSrc_b64:$src1),			(outs VReg_64:$vdst), (ins VSrc_b64:$src0, VSrc_b64:$src1),
	[(set VReg_64:$vdst, (DivergentBinFrag<add> i64:$src0, i64:$src1))]			[(set VReg_64:$vdst, (DivergentBinFrag<add> i64:$src0, i64:$src1))]
	>;			>;

	def V_SUB_U64_PSEUDO : VPseudoInstSI <			def V_SUB_U64_PSEUDO : VPseudoInstSI <
	(outs VReg_64:$vdst), (ins VSrc_b64:$src0, VSrc_b64:$src1),			(outs VReg_64:$vdst), (ins VSrc_b64:$src0, VSrc_b64:$src1),
	▲ Show 20 Lines • Show All 3,428 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.convergent.copy.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				; RUN: llc -global-isel=0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -stop-after=amdgpu-isel -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

				define i32 @convergent_copy_i32(i32 %val) {
				; GCN-LABEL: name: convergent_copy_i32
				; GCN: bb.0 (%ir-block.0):
				; GCN-NEXT: liveins: $vgpr0
				; GCN-NEXT: {{ $}}
				; GCN-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
				; GCN-NEXT: [[CONVERG:%[0-9]+]]:vgpr_32 = CONVERGENT_COPY_PSEUDO [[COPY]], implicit $exec
				; GCN-NEXT: $vgpr0 = COPY [[CONVERG]]
				; GCN-NEXT: SI_RETURN implicit $vgpr0
				%ret = call i32 @llvm.amdgcn.convergent.copy.i32(i32 %val)
				ret i32 %ret
				}

				define float @convergent_copy_f32(float %val) {
				; GCN-LABEL: name: convergent_copy_f32
				; GCN: bb.0 (%ir-block.0):
				; GCN-NEXT: liveins: $vgpr0
				; GCN-NEXT: {{ $}}
				; GCN-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
				; GCN-NEXT: [[CONVERG:%[0-9]+]]:vgpr_32 = CONVERGENT_COPY_PSEUDO [[COPY]], implicit $exec
				; GCN-NEXT: $vgpr0 = COPY [[CONVERG]]
				; GCN-NEXT: SI_RETURN implicit $vgpr0
				%ret = call float @llvm.amdgcn.convergent.copy.f32(float %val)
				ret float %ret
				}

				declare i32 @llvm.amdgcn.convergent.copy.i32(i32) #0
				declare float @llvm.amdgcn.convergent.copy.f32(float) #0
				attributes #0 = { nounwind readnone speculatable }

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU]: Add new intrinsic llvm.amdgcn.convergent.copyAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 506958

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.convergent.copy.ll

[AMDGPU]: Add new intrinsic llvm.amdgcn.convergent.copy
AbandonedPublic