Download Raw Diff

Details

Reviewers

rampitec
t-tye
kzhuravl
arsenm
tstellar

Commits

rG74da350b850e: AMDGPU : Fix common dominator of two incoming blocks terminates with uniform…
rL300142: AMDGPU : Fix common dominator of two incoming blocks terminates with uniform…

Summary

We don't need to fix the PHI if the common dominator of the two incoming blocks terminates with a uniform branch. But looks like the condition is not strong enough to find two blocks with a uniform branch, which bring loop iteration order regression.

MI:
  %vreg5<def> = PHI %vreg125, <BB#1>, %vreg12, <BB#3>; SReg_64:%vreg5,%vreg125,%vreg12
  
MBB0:
BB#1: derived from LLVM BB %for.body.preheader
    Predecessors according to CFG: BB#0
	%vreg126<def> = S_MOV_B32 0; SReg_32_XM0:%vreg126
	%vreg125<def> = S_MOV_B64 0; SReg_64:%vreg125
	S_BRANCH <BB#3>
    Successors according to CFG: BB#3(?%)

MBB1:
BB#3: derived from LLVM BB %for.body
    Predecessors according to CFG: BB#1 BB#3
	%vreg5<def> = PHI %vreg125, <BB#1>, %vreg12, <BB#3>; SReg_64:%vreg5,%vreg125,%vreg12
	%vreg6<def> = PHI %vreg1, <BB#1>, %vreg9, <BB#3>; VGPR_32:%vreg6,%vreg1,%vreg9
	%vreg7<def> = PHI %vreg126, <BB#1>, %vreg11, <BB#3>; SReg_32_XM0:%vreg7,%vreg126,%vreg11
	%vreg8<def> = PHI %vreg2, <BB#1>, %vreg10, <BB#3>; VGPR_32:%vreg8,%vreg2,%vreg10
	%vreg127<def,tied6> = V_MAC_F32_e64 0, %vreg6, 0, %vreg6, 0, %vreg1<tied0>, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg127,%vreg6,%vreg6,%vreg1
	%vreg9<def> = V_MAD_F32 1, %vreg8, 0, %vreg8, 0, %vreg127<kill>, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg9,%vreg8,%vreg8,%vreg127
	%vreg128<def> = V_ADD_F32_e64 0, %vreg6, 0, %vreg6, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg128,%vreg6,%vreg6
	%vreg10<def,tied6> = V_MAC_F32_e64 0, %vreg128<kill>, 0, %vreg8, 0, %vreg2<tied0>, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg10,%vreg128,%vreg8,%vreg2
	%vreg129<def> = S_MOV_B32 1; SReg_32_XM0:%vreg129
	%vreg11<def> = S_ADD_I32 %vreg7, %vreg129<kill>, %SCC<imp-def,dead>; SReg_32_XM0:%vreg11,%vreg7,%vreg129
	%vreg130<def> = V_MUL_F32_e64 0, %vreg10, 0, %vreg10, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg130,%vreg10,%vreg10
	%vreg131<def,tied6> = V_MAC_F32_e64 0, %vreg9, 0, %vreg9, 0, %vreg130<tied0>, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg131,%vreg9,%vreg9,%vreg130
	%vreg132<def> = V_MOV_B32_e32 1082130432, %EXEC<imp-use>; VGPR_32:%vreg132
	%vreg133<def> = V_CMP_NLE_F32_e64 0, %vreg131<kill>, 0, %vreg132<kill>, 0, 0, %EXEC<imp-use>; SReg_64:%vreg133 VGPR_32:%vreg131,%vreg132
	%vreg135<def> = COPY %vreg21; VGPR_32:%vreg135 SReg_32_XM0:%vreg21
	%vreg134<def> = V_CMP_GE_U32_e64 %vreg11, %vreg135, %EXEC<imp-use>; SReg_64:%vreg134 SReg_32_XM0:%vreg11 VGPR_32:%vreg135
	%vreg136<def> = S_OR_B64 %vreg134<kill>, %vreg133<kill>, %SCC<imp-def,dead>; SReg_64:%vreg136,%vreg134,%vreg133
	%vreg12<def> = SI_IF_BREAK %vreg136<kill>, %vreg5, %SCC<imp-def,dead>; SReg_64:%vreg12,%vreg136,%vreg5
	SI_LOOP %vreg12, <BB#3>, %EXEC<imp-def,dead>, %SCC<imp-def,dead>, %EXEC<imp-use>; SReg_64:%vreg12
	S_BRANCH <BB#4>
    Successors according to CFG: BB#4(0x04000000 / 0x80000000 = 3.12%) BB#3(0x7c000000 / 0x80000000 = 96.88%)

NCD: 
BB#1: derived from LLVM BB %for.body.preheader
    Predecessors according to CFG: BB#0
	%vreg126<def> = S_MOV_B32 0; SReg_32_XM0:%vreg126
	%vreg125<def> = S_MOV_B64 0; SReg_64:%vreg125
	S_BRANCH <BB#3>
    Successors according to CFG: BB#3(?%)

Diff Detail

Repository: rL LLVM

Event Timeline

wdng created this revision.Mar 24 2017, 1:08 PM

Herald added subscribers: t-tye, tpr, dstuttard and 2 others. · View Herald TranscriptMar 24 2017, 1:08 PM

wdng edited the summary of this revision. (Show Details)Mar 24 2017, 1:10 PM

wdng edited the summary of this revision. (Show Details)

wdng edited reviewers, added: t-tye; removed: tony-tye.Mar 24 2017, 1:18 PM

This needs a test and is not correct. The direct predecessors could themselves be divergent blocks. You need to identify whether this phi is in a divergent region

Add conditions to identify whether this phi is in a divergent region.

wdng edited the summary of this revision. (Show Details)Mar 30 2017, 4:06 PM

wdng added a reviewer: rampitec.

Completely remove weak if condition for checking whether two blocks with a uniform branch, current implementation uses divergence analysis to detect whether two blocks with a uniform branch. If all conditions on all paths leading to a block are uniform, block is uniform.

In D31350#714741, @wdng wrote:

Completely remove weak if condition for checking whether two blocks with a uniform branch, current implementation uses divergence analysis to detect whether two blocks with a uniform branch. If all conditions on all paths leading to a block are uniform, block is uniform.

I do not see divergence analysis used. I see that you are checking for PHI to be a constant, which is way stronger than just uniform.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
393 ↗	(On Diff #93563)	Please use nullptr instead of NULL.

arsenm added inline comments.Mar 30 2017, 6:34 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
393 ↗	(On Diff #93563)	if (IPostDom)
395 ↗	(On Diff #93563)	This isn't going to do anything. PHINodes are IR constructs

This is still approaching this from the wrong direction. You need to move all SGPR phis in the entire region unless the inputs are from the control flow intrinsics. There also needs to be a test

Based on Matt' comments, check conditions whether inputs are from the control flow intrinsics.

Tests will be added later once we decide to use this approach. Thanks!

rampitec added inline comments.Apr 3 2017, 12:06 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
393 ↗	(On Diff #93668)	I do not understand this. If we speak about %vreg5 from the code in the description, it is not defined by the post dominator. Its definitions are in BB#1 and BB#3 and both needs to be checked.

Address code reviews: check terminators for all predecessors.

wdng edited the summary of this revision. (Show Details)Apr 6 2017, 1:01 PM

rampitec added inline comments.Apr 6 2017, 1:13 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
331 ↗	(On Diff #94425)	You need recursive search to do it.
334 ↗	(On Diff #94425)	Weird brace indention.

wdng added inline comments.Apr 6 2017, 1:22 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
331 ↗	(On Diff #94425)	http://llvm.org/docs/ProgrammersManual.html#iterating-over-predecessors-successors-of-blocks, looks like this has iterated over of all predecessors of a BB, right?

rampitec added inline comments.Apr 6 2017, 1:23 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
331 ↗	(On Diff #94425)	All immediate predecessors.

Use BFS to search for all intermediate predecessors to check whether it has terminators.

wdng marked 5 inline comments as done.Apr 6 2017, 2:49 PM

rampitec added inline comments.Apr 6 2017, 2:53 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗	(On Diff #94442)	You do not need two queues. In fact it is usually done with a SmallVector.

wdng added inline comments.Apr 6 2017, 3:02 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗	(On Diff #94442)	BFS can be implemented in different ways, either using one queue or two. Are there any benefits to use SmallVector? Thanks!

rampitec added inline comments.Apr 6 2017, 3:05 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗	(On Diff #94442)	You definitely do not need two. Then llvm preferrs to use its own containers over std.

wdng added inline comments.Apr 6 2017, 4:09 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗	(On Diff #94442)	I agree to use LLVM's own container, but implementing BFS using vectors is not a good data structure choice although it's doable.

arsenm added inline comments.Apr 6 2017, 4:41 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
335 ↗	(On Diff #94442)	There needs to be a set because there can be loops
test/CodeGen/AMDGPU/sgprcopies.ll
1 ↗	(On Diff #94442)	s/SI GCN. Also needs -verify-machineinstrs
3 ↗	(On Diff #94442)	Should have better name and secretion of what this is really checking
12 ↗	(On Diff #94442)	Should have instnamer run on this and the test could be simplified a bit

Add hash set to avoid processing cycles in CFG.
This LIT test is based on changes of ocltst, renaming lit test function name.

Upload correct diff.

Use LLVM provided container for implementation.

Ping.

rampitec added inline comments.Apr 10 2017, 5:28 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗	(On Diff #94735)	Usually it is called Visited.
334 ↗	(On Diff #94735)	And this is usually Worklist.
337 ↗	(On Diff #94735)	Not needed.
339 ↗	(On Diff #94735)	SmallVector<MachineBasicBlock*, 4> Worklist(MBB->pred_begin(), MBB->pred_end()); while (!Worklist.empty())
341 ↗	(On Diff #94735)	pop_back_val.
343 ↗	(On Diff #94735)	if (Visited.insert(mbb).second) continue;
348 ↗	(On Diff #94735)	pred, not preds. Anyway, Worklist.insert(Worklist.end(), mbb->pred_begin(), mbb->pred_end()).

Code changes based on code reviews.

wdng marked 6 inline comments as done.Apr 10 2017, 10:27 PM

rampitec added inline comments.Apr 11 2017, 10:36 AM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
344 ↗	(On Diff #94780)	If you use Visited.insert(mbb).second it is one search in the set instead of two.

wdng added inline comments.Apr 11 2017, 11:34 AM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
344 ↗	(On Diff #94780)	Looks like this is not correct. We need to check whether this node has visited before, then add into Visited if not found instead of inserting first and then search, correct?

rampitec added inline comments.Apr 11 2017, 11:38 AM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
344 ↗	(On Diff #94780)	http://llvm.org/docs/doxygen/html/DenseMap_8h_source.html#l00169

Address code reviews. Thanks Stas!

Fix other 3 LIT tests regression caused by code changes.

arsenm added inline comments.Apr 11 2017, 5:22 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
331 ↗	(On Diff #94909)	This is a bad name because most blocks have terminators. Almost all should at this point, this needs to specify that it is divergent terminator
333 ↗	(On Diff #94909)	You could maybe make this a map and cache the result so that every single block that needs to be visited doesn't need to walk all the way up each time

wdng added inline comments.Apr 11 2017, 11:47 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗	(On Diff #94909)	Visited is used to check whether the visiting node has been visited before, once is has been visited before it's predecessors won't be processed again. For example: D / \ A C \| \ \| B E \ / F Assume starting from node F, trace down F->B->A->D, when visiting from F to E, A and D won't be processed again. So I don't think there is a need to save results like A and D.

Change function name.

Ping.

arsenm added inline comments.Apr 12 2017, 12:35 PM

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗	(On Diff #94909)	If you consider the entire function, the same predecessors will be visited for phis in other blocks
78 ↗	(On Diff #94979)	This should be the first included llvm header
test/CodeGen/AMDGPU/sgprcopies.ll
6 ↗	(On Diff #94979)	This should be marked amdgpu_kernel

Address code reviews.
Will create another patch for optimized searching of divergent terminator.

rampitec added inline comments.Apr 12 2017, 4:35 PM

test/CodeGen/AMDGPU/sgprcopies.ll
6 ↗	(On Diff #95055)	Calling convention, not function name.

Address code review.

LGTM

This revision is now accepted and ready to land.Apr 12 2017, 4:46 PM

wdng marked 3 inline comments as done.Apr 12 2017, 4:46 PM

wdng added inline comments.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗	(On Diff #94909)	A separate patch will be created for optimized search divergent terminators.

Closed by commit rL300142: AMDGPU : Fix common dominator of two incoming blocks terminates with uniform… (authored by wdng). · Explain WhyApr 12 2017, 5:04 PM

This revision was automatically updated to reflect the committed changes.

wdng marked an inline comment as done.

Diff 95059

llvm/trunk/lib/Target/AMDGPU/SIFixSGPRCopies.cpp

Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines
/// In order to avoid this problem, this pass searches for PHI instructions		/// In order to avoid this problem, this pass searches for PHI instructions
/// which define a <vsrc> register and constrains its definition class to		/// which define a <vsrc> register and constrains its definition class to
/// <vgpr> if the user of the PHI's definition register is a vector instruction.		/// <vgpr> if the user of the PHI's definition register is a vector instruction.
/// If the PHI's definition class is constrained to <vgpr> then the coalescer		/// If the PHI's definition class is constrained to <vgpr> then the coalescer
/// will be unable to perform the COPY removal from the above example which		/// will be unable to perform the COPY removal from the above example which
/// ultimately led to the creation of an illegal COPY.		/// ultimately led to the creation of an illegal COPY.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		#include "llvm/ADT/DenseSet.h"
#include "AMDGPU.h"		#include "AMDGPU.h"
#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
#include "SIInstrInfo.h"		#include "SIInstrInfo.h"
#include "llvm/CodeGen/MachineDominators.h"		#include "llvm/CodeGen/MachineDominators.h"
#include "llvm/CodeGen/MachineFunctionPass.h"		#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"		#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
▲ Show 20 Lines • Show All 250 Lines • ▼ Show 20 Lines	static bool isSafeToFoldImmIntoCopy(const MachineInstr *Copy,
case AMDGPU::V_MOV_B64_PSEUDO:		case AMDGPU::V_MOV_B64_PSEUDO:
SMovOp = AMDGPU::S_MOV_B64;		SMovOp = AMDGPU::S_MOV_B64;
break;		break;
}		}
Imm = ImmOp->getImm();		Imm = ImmOp->getImm();
return true;		return true;
}		}

		static bool predsHasDivergentTerminator(MachineBasicBlock *MBB,
		const TargetRegisterInfo *TRI) {
		DenseSet<MachineBasicBlock*> Visited;
		SmallVector<MachineBasicBlock*, 4> Worklist(MBB->pred_begin(),
		MBB->pred_end());

		while (!Worklist.empty()) {
		MachineBasicBlock *mbb = Worklist.back();
		Worklist.pop_back();

		if (!Visited.insert(mbb).second)
		continue;
		if (hasTerminatorThatModifiesExec(mbb, TRI))
		return true;

		Worklist.insert(Worklist.end(), mbb->pred_begin(), mbb->pred_end());
		}

		return false;
		}

bool SIFixSGPRCopies::runOnMachineFunction(MachineFunction &MF) {		bool SIFixSGPRCopies::runOnMachineFunction(MachineFunction &MF) {
const SISubtarget &ST = MF.getSubtarget<SISubtarget>();		const SISubtarget &ST = MF.getSubtarget<SISubtarget>();
MachineRegisterInfo &MRI = MF.getRegInfo();		MachineRegisterInfo &MRI = MF.getRegInfo();
const SIRegisterInfo *TRI = ST.getRegisterInfo();		const SIRegisterInfo *TRI = ST.getRegisterInfo();
const SIInstrInfo *TII = ST.getInstrInfo();		const SIInstrInfo *TII = ST.getInstrInfo();
MDT = &getAnalysis<MachineDominatorTree>();		MDT = &getAnalysis<MachineDominatorTree>();

SmallVector<MachineInstr *, 16> Worklist;		SmallVector<MachineInstr *, 16> Worklist;
Show All 40 Lines	for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();
break;		break;

// We don't need to fix the PHI if the common dominator of the		// We don't need to fix the PHI if the common dominator of the
// two incoming blocks terminates with a uniform branch.		// two incoming blocks terminates with a uniform branch.
if (MI.getNumExplicitOperands() == 5) {		if (MI.getNumExplicitOperands() == 5) {
MachineBasicBlock *MBB0 = MI.getOperand(2).getMBB();		MachineBasicBlock *MBB0 = MI.getOperand(2).getMBB();
MachineBasicBlock *MBB1 = MI.getOperand(4).getMBB();		MachineBasicBlock *MBB1 = MI.getOperand(4).getMBB();

MachineBasicBlock *NCD = MDT->findNearestCommonDominator(MBB0, MBB1);		if (!predsHasDivergentTerminator(MBB0, TRI) &&
if (NCD && !hasTerminatorThatModifiesExec(NCD, TRI)) {		!predsHasDivergentTerminator(MBB1, TRI)) {
DEBUG(dbgs() << "Not fixing PHI for uniform branch: " << MI << '\n');		DEBUG(dbgs() << "Not fixing PHI for uniform branch: " << MI << '\n');
break;		break;
}		}
}		}

// If a PHI node defines an SGPR and any of its operands are VGPRs,		// If a PHI node defines an SGPR and any of its operands are VGPRs,
// then we need to move it to the VALU.		// then we need to move it to the VALU.
//		//
▲ Show 20 Lines • Show All 68 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/loop_break.ll

	Show All 21 Lines
	; OPT: call void @llvm.amdgcn.end.cf(i64			; OPT: call void @llvm.amdgcn.end.cf(i64

	; TODO: Can remove exec fixes in return block			; TODO: Can remove exec fixes in return block
	; GCN-LABEL: {{^}}break_loop:			; GCN-LABEL: {{^}}break_loop:
	; GCN: s_mov_b64 [[INITMASK:s\[[0-9]+:[0-9]+\]]], 0{{$}}			; GCN: s_mov_b64 [[INITMASK:s\[[0-9]+:[0-9]+\]]], 0{{$}}

	; GCN: [[LOOP_ENTRY:BB[0-9]+_[0-9]+]]: ; %bb1			; GCN: [[LOOP_ENTRY:BB[0-9]+_[0-9]+]]: ; %bb1
	; GCN: s_or_b64 [[MASK:s\[[0-9]+:[0-9]+\]]], exec, [[INITMASK]]			; GCN: s_or_b64 [[MASK:s\[[0-9]+:[0-9]+\]]], exec, [[INITMASK]]
	; GCN: s_cmp_gt_i32 s{{[0-9]+}}, -1			; GCN: v_cmp_lt_i32_e32 vcc, -1
	; GCN-NEXT: s_cbranch_scc1 [[FLOW:BB[0-9]+_[0-9]+]]			; GCN: s_and_b64 vcc, exec, vcc
				; GCN-NEXT: s_cbranch_vccnz [[FLOW:BB[0-9]+_[0-9]+]]

	; GCN: ; BB#2: ; %bb4			; GCN: ; BB#2: ; %bb4
	; GCN: buffer_load_dword			; GCN: buffer_load_dword
	; GCN: v_cmp_ge_i32_e32 vcc,			; GCN: v_cmp_ge_i32_e32 vcc,
	; GCN: s_or_b64 [[MASK]], vcc, [[INITMASK]]			; GCN: s_or_b64 [[MASK]], vcc, [[INITMASK]]

	; GCN: [[FLOW]]:			; GCN: [[FLOW]]:
	; GCN: s_mov_b64 [[INITMASK]], [[MASK]]			; GCN: s_mov_b64 [[INITMASK]], [[MASK]]
	▲ Show 20 Lines • Show All 289 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/sgprcopies.ll

				; RUN: llc < %s -march=amdgcn -verify-machineinstrs \| FileCheck -check-prefix=GCN %s

				; GCN-LABEL: {{^}}checkTwoBlocksWithUniformBranch
				; GCN: BB0_2
				; GCN: v_add
				define amdgpu_kernel void @checkTwoBlocksWithUniformBranch(i32 addrspace(1)* nocapture %out, i32 %width, float %xPos, float %yPos, float %xStep, float %yStep, i32 %maxIter) {
				entry:
				%conv = call i32 @llvm.amdgcn.workitem.id.x() #1
				%rem = urem i32 %conv, %width
				%div = udiv i32 %conv, %width
				%conv1 = sitofp i32 %rem to float
				%x = tail call float @llvm.fmuladd.f32(float %xStep, float %conv1, float %xPos)
				%conv2 = sitofp i32 %div to float
				%y = tail call float @llvm.fmuladd.f32(float %yStep, float %conv2, float %yPos)
				%yy = fmul float %y, %y
				%xy = tail call float @llvm.fmuladd.f32(float %x, float %x, float %yy)
				%cmp01 = fcmp ole float %xy, 4.000000e+00
				%cmp02 = icmp ne i32 %maxIter, 0
				%cond01 = and i1 %cmp02, %cmp01
				br i1 %cond01, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.preheader, %for.body
				%x_val = phi float [ %call8, %for.body ], [ %x, %for.body.preheader ]
				%iter_val = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%y_val = phi float [ %call9, %for.body ], [ %y, %for.body.preheader ]
				%sub = fsub float -0.000000e+00, %y_val
				%call7 = tail call float @llvm.fmuladd.f32(float %x_val, float %x_val, float %x) #1
				%call8 = tail call float @llvm.fmuladd.f32(float %sub, float %y_val, float %call7) #1
				%mul = fmul float %x_val, 2.000000e+00
				%call9 = tail call float @llvm.fmuladd.f32(float %mul, float %y_val, float %y) #1
				%inc = add nuw i32 %iter_val, 1
				%mul3 = fmul float %call9, %call9
				%0 = tail call float @llvm.fmuladd.f32(float %call8, float %call8, float %mul3)
				%cmp = fcmp ole float %0, 4.000000e+00
				%cmp5 = icmp ult i32 %inc, %maxIter
				%or.cond = and i1 %cmp5, %cmp
				br i1 %or.cond, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%iter.0.lcssa = phi i32 [ 0, %entry ], [ %inc, %for.end.loopexit ]
				%idxprom = ashr exact i32 %conv, 32
				%arrayidx = getelementptr inbounds i32, i32 addrspace(1)* %out, i32 %idxprom
				store i32 %iter.0.lcssa, i32 addrspace(1)* %arrayidx, align 4
				ret void
				}

				; Function Attrs: nounwind readnone
				declare i32 @llvm.amdgcn.workitem.id.x() #0
				declare float @llvm.fmuladd.f32(float, float, float) #1

				attributes #0 = { nounwind readnone }
				attributes #1 = { readnone }

llvm/trunk/test/CodeGen/AMDGPU/uniform-loop-inside-nonuniform.ll

	; RUN: llc -march=amdgcn -mcpu=verde < %s \| FileCheck %s			; RUN: llc -march=amdgcn -mcpu=verde < %s \| FileCheck %s

	; Test a simple uniform loop that lives inside non-uniform control flow.			; Test a simple uniform loop that lives inside non-uniform control flow.

	; CHECK-LABEL: {{^}}test1:			; CHECK-LABEL: {{^}}test1:
	; CHECK: v_cmp_ne_u32_e32 vcc, 0			; CHECK: v_cmp_ne_u32_e32 vcc, 0
	; CHECK: s_and_saveexec_b64			; CHECK: s_and_saveexec_b64
	; CHECK-NEXT: s_xor_b64			; CHECK-NEXT: s_xor_b64
	; CHECK-NEXT: ; mask branch			; CHECK-NEXT: ; mask branch
				; CHECK-NEXT: s_cbranch_execz BB{{[0-9]+_[0-9]+}}
	; CHECK-NEXT: BB{{[0-9]+_[0-9]+}}: ; %loop_body.preheader			; CHECK-NEXT: BB{{[0-9]+_[0-9]+}}: ; %loop_body.preheader

	; CHECK: [[LOOP_BODY_LABEL:BB[0-9]+_[0-9]+]]:			; CHECK: [[LOOP_BODY_LABEL:BB[0-9]+_[0-9]+]]:
	; CHECK: s_cbranch_scc0 [[LOOP_BODY_LABEL]]			; CHECK: s_cbranch_vccz [[LOOP_BODY_LABEL]]

	; CHECK: s_endpgm			; CHECK: s_endpgm
	define amdgpu_ps void @test1(<8 x i32> inreg %rsrc, <2 x i32> %addr.base, i32 %y, i32 %p) {			define amdgpu_ps void @test1(<8 x i32> inreg %rsrc, <2 x i32> %addr.base, i32 %y, i32 %p) {
	main_body:			main_body:
	%cc = icmp eq i32 %p, 0			%cc = icmp eq i32 %p, 0
	br i1 %cc, label %out, label %loop_body			br i1 %cc, label %out, label %loop_body

	loop_body:			loop_body:
	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/valu-i1.ll

	Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines
	; SI: s_xor_b64 [[BR_SREG]], exec, [[BR_SREG]]			; SI: s_xor_b64 [[BR_SREG]], exec, [[BR_SREG]]
	; SI: s_cbranch_execz [[LABEL_EXIT:BB[0-9]+_[0-9]+]]			; SI: s_cbranch_execz [[LABEL_EXIT:BB[0-9]+_[0-9]+]]

	; SI: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, 0{{$}}			; SI: s_mov_b64 {{s\[[0-9]+:[0-9]+\]}}, 0{{$}}

	; SI: [[LABEL_LOOP:BB[0-9]+_[0-9]+]]:			; SI: [[LABEL_LOOP:BB[0-9]+_[0-9]+]]:
	; SI: buffer_load_dword			; SI: buffer_load_dword
	; SI-DAG: buffer_store_dword			; SI-DAG: buffer_store_dword
	; SI-DAG: s_cmpk_eq_i32 s{{[0-9]+}}, 0x100			; SI-DAG: v_cmp_eq_u32_e32 vcc, 0x100
	; SI: s_cbranch_scc0 [[LABEL_LOOP]]			; SI: s_cbranch_vccz [[LABEL_LOOP]]
	; SI: [[LABEL_EXIT]]:			; SI: [[LABEL_EXIT]]:
	; SI: s_endpgm			; SI: s_endpgm

	define amdgpu_kernel void @simple_test_v_loop(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {			define amdgpu_kernel void @simple_test_v_loop(i32 addrspace(1)* %dst, i32 addrspace(1)* %src) #1 {
	entry:			entry:
	%tid = call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone			%tid = call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone
	%is.0 = icmp ne i32 %tid, 0			%is.0 = icmp ne i32 %tid, 0
	%limit = add i32 %tid, 64			%limit = add i32 %tid, 64
	▲ Show 20 Lines • Show All 98 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU : Fix common dominator of two incoming blocks terminates with uniform branch issue.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 95059

llvm/trunk/lib/Target/AMDGPU/SIFixSGPRCopies.cpp

llvm/trunk/test/CodeGen/AMDGPU/loop_break.ll

llvm/trunk/test/CodeGen/AMDGPU/sgprcopies.ll

llvm/trunk/test/CodeGen/AMDGPU/uniform-loop-inside-nonuniform.ll

llvm/trunk/test/CodeGen/AMDGPU/valu-i1.ll

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU : Fix common dominator of two incoming blocks terminates with uniform branch issue.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 95059

llvm/trunk/lib/Target/AMDGPU/SIFixSGPRCopies.cpp

llvm/trunk/test/CodeGen/AMDGPU/loop_break.ll

llvm/trunk/test/CodeGen/AMDGPU/sgprcopies.ll

llvm/trunk/test/CodeGen/AMDGPU/uniform-loop-inside-nonuniform.ll

llvm/trunk/test/CodeGen/AMDGPU/valu-i1.ll

AMDGPU : Fix common dominator of two incoming blocks terminates with uniform branch issue.
ClosedPublic