This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add pseudo "old" and "wqm_mode" source to all DPP instructions
ClosedPublic

Authored by cwabbott on Jun 27 2017, 3:47 PM.

Download Raw Diff

Details

Reviewers

tstellar
arsenm
nhaehnle
tpr
SamWot

Commits

rG79f3ade51ae0: [AMDGPU] Add pseudo "old" source to all DPP instructions
rL310283: [AMDGPU] Add pseudo "old" source to all DPP instructions

Summary

All instructions with the DPP modifier may not write to certain lanes of
the output if bound_ctrl=1 is set or any bits in bank_mask or row_mask
aren't set, so the destination register may be both defined and modified.
The right way to handle this is to add a constraint that the destination
register is the same as one of the inputs. We could tie the destination
to the first source, but that would be too restrictive for some use-cases
where we want the destination to be some other value before the
instruction executes. Instead, add a fake "old" source and tie it to the
destination. Effectively, the "old" source defines what value unwritten
lanes will get. We'll expose this functionality to users with a new
intrinsic later.

Also, we want to use DPP instructions for computing derivatives, which
means we need to set WQM for them. We also need to enable the entire
wavefront when using DPP intrinsics to implement nonuniform subgroup
reductions, since otherwise we'll get incorrect results in some cases.
To accomodate this, add a new operand to all DPP instructions which will
be interpreted by the SI WQM pass. This will be exposed with a new
intrinsic later. We'll also add support for Whole Wavefront Mode later.

I also fixed llvm.amdgcn.mov.dpp to overwrite the source and fixed up
the test. However, I could also keep the old behavior (where lanes that
aren't written are undefined) if people want it.

This change is more of an RFC, since some assembler tests are failing
and I have no idea why. Also, this seemed quite hairy, and I'm not sure
if this the best way to hook everything up. Should I be creating
separate pseudo-instructions with these extra sources? Any guidance on
that would be appreciated.

Diff Detail

Repository: rL LLVM

Event Timeline

cwabbott created this revision.Jun 27 2017, 3:47 PM

Herald added subscribers: t-tye, tpr, dstuttard and 4 others. · View Herald TranscriptJun 27 2017, 3:47 PM

cwabbott added a parent revision: D34718: [AMDGPU] Add llvm.amdgpu.update.dpp intrinsic.Jun 27 2017, 5:25 PM

cwabbott removed a parent revision: D34718: [AMDGPU] Add llvm.amdgpu.update.dpp intrinsic.

cwabbott added a child revision: D34718: [AMDGPU] Add llvm.amdgpu.update.dpp intrinsic.

cwabbott removed a child revision: D34718: [AMDGPU] Add llvm.amdgpu.update.dpp intrinsic.Jun 27 2017, 5:29 PM

cwabbott added a child revision: D34717: [AMDGPU] Teach the WQM pass about Whole Wavefront Mode and wqm_ctrl.

Hi Connor. We have also been thinking about issues around dpp, wqm and wwm inside AMD. Something we may want to see is a way in machine instructions to express that a dpp operand can be combined into an alu op, and then the write gating (bound_ctrl=1, row and bank masks) affects the result of the alu op, not the result of the dpp move. However I guess that is a more extensive change to the definition of a whole class of instructions. I'd be interested to hear your thoughts.

Being able to combine dpp move, alu op and write gating also affects how to express it at the IR intrinsic level. It seems to me that the write gating needs to be a separate intrinsic, with instruction selection spotting that it can all be combined into a single instruction.

arsenm added a subscriber: llvm-commits.Jun 28 2017, 10:15 AM

In D34716#793614, @tpr wrote:

Hi Connor. We have also been thinking about issues around dpp, wqm and wwm inside AMD. Something we may want to see is a way in machine instructions to express that a dpp operand can be combined into an alu op, and then the write gating (bound_ctrl=1, row and bank masks) affects the result of the alu op, not the result of the dpp move. However I guess that is a more extensive change to the definition of a whole class of instructions. I'd be interested to hear your thoughts.

Being able to combine dpp move, alu op and write gating also affects how to express it at the IR intrinsic level. It seems to me that the write gating needs to be a separate intrinsic, with instruction selection spotting that it can all be combined into a single instruction.

Yeah, I've already added such an intrinsic in D34718. You can also see my implementation of the inclusive scan kernel in Mesa here. Each round generates IR like:

%1 = call i32 @llvm.amdgcn.update.dpp(i32 0, i32 %0, <dpp_ctrl, etc.>)
%3 = i32 iadd %1, %2

The key part here is putting the identity of the operation as the "old" source, which works regardless of operation -- this lets it be folded to something like:

V_ADD_I32_dpp  %3, old:%1, src0:%0, src1:%2 <dpp_ctrl, etc.>

Which becomes a single instruction with %0 and %1 tied to the same register. This should just be a matter of writing a few ISel patterns, although I haven't done that yet, since I've been concentrating on getting it working first.

So, one thing that's not clear to me with is the semantics of how the update.dpp intrinsic is supposed to enable WQM or WWM. In your sequence of instructions, if you just put a WQM/WWM flag on the update.dpp intrinsic, how does LLVM know whether the regular ALU intrinsics in between should run in WQM/WWM or not?

Tim had an interesting proposal for that, which involved a pair of intrinsics:

llvm.amdgcn.helpervalue(src, helpervalue) --> returns src for active lanes and helpervalue for other lanes

llvm.amdgcn.wwm(src) --> returns src for active lanes and undefined/poison (my choice of words, not TIm's) for other lanes, but guarantees that the computations leading to src are executed "as-if" in WWM.

llvm.amdgcn.wqm(src) --> analogous

I'm writing "as-if", because not all computations leading up to src actually need to be in WWM: llvm.amdgcn.helpervalue can act as a "barrier" to the propagation of WWM. So if you think of the graph of WWM computations, .helpervalue acts as a source, and .wwm acts as a sink.

I think this proposal goes a long way towards clarifying which operations actually need WQM/WWM. One issue that occurred to me today is that the semantics are unclear when control flow is involved. Two basic examples to think about:

v = some computation
if (cond) {
   t1 = f(v)
   r1 = wwm(t1)
} else {
   t2 = f(v)
   r2 = wwm(t2)
}

I believe the desirable semantics here are clear, though they may require some compiler work. Basically, you want the entire vector of v be equal at the start of both blocks. This requires ensuring that no part of it gets overwritten during the first block we go through.

The much more problematic case is:

if (cond) {
  v1 = ...
} else {
  v2 = ...
}
v = wwm(phi(v1, v2))

What does v look like? Specifically, what's in the inactive lanes? Perhaps the best thing we can do is say that the active lanes come from the predecessor block they went through, and all the other lanes come from one of the two blocks, though it is undefined which one.

In D34716#794465, @nhaehnle wrote:

So, one thing that's not clear to me with is the semantics of how the update.dpp intrinsic is supposed to enable WQM or WWM. In your sequence of instructions, if you just put a WQM/WWM flag on the update.dpp intrinsic, how does LLVM know whether the regular ALU intrinsics in between should run in WQM/WWM or not?

Tim had an interesting proposal for that, which involved a pair of intrinsics:

llvm.amdgcn.helpervalue(src, helpervalue) --> returns src for active lanes and helpervalue for other lanes

In D34719, I added llvm.amdgcn.set.inactive, which does exactly what you describe. I left that out of the example in my comment, but you can see it in the Mesa implementation I posted (in particular, look at ac_build_reduce(), ac_build_inclusive_scan(), and ac_build_exclusive_scan()).

llvm.amdgcn.wwm(src) --> returns src for active lanes and undefined/poison (my choice of words, not TIm's) for other lanes, but guarantees that the computations leading to src are executed "as-if" in WWM.

llvm.amdgcn.wqm(src) --> analogous

I'm writing "as-if", because not all computations leading up to src actually need to be in WWM: llvm.amdgcn.helpervalue can act as a "barrier" to the propagation of WWM. So if you think of the graph of WWM computations, .helpervalue acts as a source, and .wwm acts as a sink.

Hmm, this might be an interesting approach. I think that setting wqm_ctrl to WQM on a DPP instruction is essentially equivalent to caling llvm.amdgcn.wqm on the result and then replacing all uses with the result of llvm.amdgcn.wqm (and similarly for llvm.amdgcn.wwm). I can see how having a separate pseudo-instruction might be a little cleaner though. And it would be nice for us to stop pretending that we can figure out what needs WWM/WQM based on the instruction itself, since it does very much depend on what you're using the instruction and what the API demands.

One thing that strikes me is that while your definition is sufficient for WWM, it isn't for WQM -- for derivatives, GL says that we actually do have to care about the values of things in helper invocations. The program has to behave as if it's always in WQM, except for loads and stores, so just assuming that helper lanes are undefined/poison isn't valid as long as loads from memory aren't involved. I think we can just strengthen the definition of llvm.amdgcn.wqm a little, to say that helper lanes must have the correct value as if everything was computed in WQM.

Also, with these two intrinsics, we still wouldn't be able to express that some computation must happen in exact mode. This matters for DPP instructions and store instructions with side effects. I'm not sure if we'll ever want to use DPP instructions in exact mode, but we definitely need to care about store instructions. I guess we can just keep the current logic for making sure that stores are executed in exact mode, although it certainly seems kinda hack-ish, especially if the goal is to get rid of special assumptions about instructions needing WQM/WWM/Exact.

I think this proposal goes a long way towards clarifying which operations actually need WQM/WWM. One issue that occurred to me today is that the semantics are unclear when control flow is involved. Two basic examples to think about:
v = some computation
if (cond) {
   t1 = f(v)
   r1 = wwm(t1)
} else {
   t2 = f(v)
   r2 = wwm(t2)
}
I believe the desirable semantics here are clear, though they may require some compiler work. Basically, you want the entire vector of v be equal at the start of both blocks. This requires ensuring that no part of it gets overwritten during the first block we go through.

I think the extra edge will guarantee that that's the case already. And we certainly already do have similar problems with WQM, where you have to consider v live during the first block in case some WQM operation clobbers it.

The much more problematic case is:
if (cond) {
  v1 = ...
} else {
  v2 = ...
}
v = wwm(phi(v1, v2))
What does v look like? Specifically, what's in the inactive lanes? Perhaps the best thing we can do is say that the active lanes come from the predecessor block they went through, and all the other lanes come from one of the two blocks, though it is undefined which one.

If you take the "as-if" semantics at heart, then the inactive lanes should have the value they would have if the whole program were executed in WWM -- that is, the block they come from should depend on what the value of cond would be if we executed the entire thing in WWM. In fact, if you replace "WWM" with "WQM" everywhere, then GL already mandates this behavior, and we implement it in the existing WQM pass. I chose not to implement it in WWM, since we're only ever generating WWM things ourselves with a matching llvm.amdgcn.set.inactive that tightly contains the "WWM-ness", and I doubt we'll ever need to care about these types of examples.

In D34716#794564, @cwabbott wrote:

One thing that strikes me is that while your definition is sufficient for WWM, it isn't for WQM -- for derivatives, GL says that we actually do have to care about the values of things in helper invocations. The program has to behave as if it's always in WQM, except for loads and stores, so just assuming that helper lanes are undefined/poison isn't valid as long as loads from memory aren't involved. I think we can just strengthen the definition of llvm.amdgcn.wqm a little, to say that helper lanes must have the correct value as if everything was computed in WQM.

Actually, scratch that -- after thinking about it some more, the "as-if" semantics should be enough for GL. You can't really observe the effects of helper invocations except for when you take a derivative, so as long as you stick a llvm.amdgcn.wqm intrinsic after every derivative, you should be fine.

Remove wqm_mode in favor of llvm.amdgcn.wqm and llvm.amdgcn.wwm intrinsics.

Harbormaster completed remote builds in B8324: Diff 106995.Jul 17 2017, 5:54 PM

cwabbott added reviewers: nhaehnle, tpr.Jul 17 2017, 7:16 PM

One question, apart from that looks good.

lib/Target/AMDGPU/VOP2Instructions.td
278 ↗	(On Diff #106995)	Is the wqm_ctrl still correct?

cwabbott added inline comments.Jul 26 2017, 1:26 PM

lib/Target/AMDGPU/VOP2Instructions.td
278 ↗	(On Diff #106995)	No, good catch!

Remove leftover $wqm_ctrl.

Harbormaster completed remote builds in B8716: Diff 108738.Jul 28 2017, 3:54 PM

cwabbott marked 2 inline comments as done.Jul 28 2017, 3:55 PM

LGTM

This revision is now accepted and ready to land.Aug 2 2017, 2:26 AM

cwabbott mentioned this in D34715: [AMDGPU] don't set Constraints/DisableEncoding from the Profile.Aug 3 2017, 5:39 PM

Fix assembling DPP instructions. Also, adopt a more conservative version of
D34715. In particular, we ignore Constraints/DisableEncoding from the original
instruction for the DPP version. The only instruction with any special
constraints is MAC, because of its fake third source, and there it doesn't make
sense to keep the fake third source since it has to be the same as the normal
"old" source anyways. We can revisit this if something else comes up, but I
think this is a good plan for now.

Some comments on what I noticed. Probably best if somebody else has a look as well.

lib/Target/AMDGPU/VOP1Instructions.td
84–85 ↗	(On Diff #109666)	Duplicate setting of Constraints/DisableEncoding.
lib/Target/AMDGPU/VOP2Instructions.td
105–106 ↗	(On Diff #109666)	Duplicate setting of Constraints/DisableEncoding.
lib/Target/AMDGPU/VOPInstructions.td
451–452 ↗	(On Diff #109666)	Both here and in VOP_SDWA_Real, it looks like there are duplicated pairs of let Constraints/DisableEncoding lines. One of those pairs should be removed.

cwabbott added inline comments.Aug 4 2017, 8:48 AM

lib/Target/AMDGPU/VOPInstructions.td
451–452 ↗	(On Diff #109666)	These duplicates already exist in master now, and removing them would be unrelated to the current change, so I thought I'd keep them for now -- I can make a separate change that removes them.

Remove spurious change to AMDGPUAsmParser.cpp

It looks like Sam has worked a lot on the assembler, including adding support for DPP instructions, so I'm adding him for the assembler bits. I'd like to get this in before I leave next week, though.

SamWot accepted this revision.Aug 5 2017, 12:57 AM

Closed by commit rL310283: [AMDGPU] Add pseudo "old" source to all DPP instructions (authored by cwabbott). · Explain WhyAug 7 2017, 12:12 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AsmParser/

15 lines

26 lines

11 lines

9 lines

2 lines

test/

CodeGen/

AMDGPU/

inserted-wait-states.mir

4 lines

llvm.amdgcn.mov.dpp.ll

11 lines

Diff 110048

llvm/trunk/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp

Show First 20 Lines • Show All 4,452 Lines • ▼ Show 20 Lines	void AMDGPUAsmParser::cvtDPP(MCInst &Inst, const OperandVector &Operands) {
OptionalImmIndexMap OptionalIdx;		OptionalImmIndexMap OptionalIdx;

unsigned I = 1;		unsigned I = 1;
const MCInstrDesc &Desc = MII.get(Inst.getOpcode());		const MCInstrDesc &Desc = MII.get(Inst.getOpcode());
for (unsigned J = 0; J < Desc.getNumDefs(); ++J) {		for (unsigned J = 0; J < Desc.getNumDefs(); ++J) {
((AMDGPUOperand &)*Operands[I++]).addRegOperands(Inst, 1);		((AMDGPUOperand &)*Operands[I++]).addRegOperands(Inst, 1);
}		}

		// All DPP instructions with at least one source operand have a fake "old"
		// source at the beginning that's tied to the dst operand. Handle it here.
		if (Desc.getNumOperands() >= 2)
		Inst.addOperand(Inst.getOperand(0));

for (unsigned E = Operands.size(); I != E; ++I) {		for (unsigned E = Operands.size(); I != E; ++I) {
AMDGPUOperand &Op = ((AMDGPUOperand &)*Operands[I]);		AMDGPUOperand &Op = ((AMDGPUOperand &)*Operands[I]);
// Add the register arguments		// Add the register arguments
if (Op.isReg() && Op.Reg.RegNo == AMDGPU::VCC) {		if (Op.isReg() && Op.Reg.RegNo == AMDGPU::VCC) {
// VOP2b (v_add_u32, v_sub_u32 ...) dpp use "vcc" token.		// VOP2b (v_add_u32, v_sub_u32 ...) dpp use "vcc" token.
// Skip it.		// Skip it.
continue;		continue;
} if (isRegOrImmWithInputMods(Desc, Inst.getNumOperands())) {		} if (isRegOrImmWithInputMods(Desc, Inst.getNumOperands())) {
Op.addRegWithFPInputModsOperands(Inst, 2);		Op.addRegWithFPInputModsOperands(Inst, 2);
} else if (Op.isDPPCtrl()) {		} else if (Op.isDPPCtrl()) {
Op.addImmOperands(Inst, 1);		Op.addImmOperands(Inst, 1);
} else if (Op.isImm()) {		} else if (Op.isImm()) {
// Handle optional arguments		// Handle optional arguments
OptionalIdx[Op.getImmTy()] = I;		OptionalIdx[Op.getImmTy()] = I;
} else {		} else {
llvm_unreachable("Invalid operand type");		llvm_unreachable("Invalid operand type");
}		}
}		}

addOptionalImmOperand(Inst, Operands, OptionalIdx, AMDGPUOperand::ImmTyDppRowMask, 0xf);		addOptionalImmOperand(Inst, Operands, OptionalIdx, AMDGPUOperand::ImmTyDppRowMask, 0xf);
addOptionalImmOperand(Inst, Operands, OptionalIdx, AMDGPUOperand::ImmTyDppBankMask, 0xf);		addOptionalImmOperand(Inst, Operands, OptionalIdx, AMDGPUOperand::ImmTyDppBankMask, 0xf);
addOptionalImmOperand(Inst, Operands, OptionalIdx, AMDGPUOperand::ImmTyDppBoundCtrl);		addOptionalImmOperand(Inst, Operands, OptionalIdx, AMDGPUOperand::ImmTyDppBoundCtrl);

// special case v_mac_{f16, f32}:
// it has src2 register operand that is tied to dst operand
if (Inst.getOpcode() == AMDGPU::V_MAC_F32_dpp \|\|
Inst.getOpcode() == AMDGPU::V_MAC_F16_dpp) {
auto it = Inst.begin();
std::advance(
it, AMDGPU::getNamedOperandIdx(Inst.getOpcode(), AMDGPU::OpName::src2));
Inst.insert(it, Inst.getOperand(0)); // src2 = dst
}
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// sdwa		// sdwa
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

OperandMatchResultTy		OperandMatchResultTy
AMDGPUAsmParser::parseSDWASel(OperandVector &Operands, StringRef Prefix,		AMDGPUAsmParser::parseSDWASel(OperandVector &Operands, StringRef Prefix,
▲ Show 20 Lines • Show All 218 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.td

Show First 20 Lines • Show All 1,178 Lines • ▼ Show 20 Lines	!if (HasClamp,
op_sel:$op_sel),		op_sel:$op_sel),
(ins Src0Mod:$src0_modifiers, Src0RC:$src0,		(ins Src0Mod:$src0_modifiers, Src0RC:$src0,
Src1Mod:$src1_modifiers, Src1RC:$src1,		Src1Mod:$src1_modifiers, Src1RC:$src1,
Src2Mod:$src2_modifiers, Src2RC:$src2,		Src2Mod:$src2_modifiers, Src2RC:$src2,
op_sel:$op_sel))		op_sel:$op_sel))
);		);
}		}

class getInsDPP <RegisterClass Src0RC, RegisterClass Src1RC, int NumSrcArgs,		class getInsDPP <RegisterOperand DstRC, RegisterClass Src0RC, RegisterClass Src1RC,
bit HasModifiers, Operand Src0Mod, Operand Src1Mod> {		int NumSrcArgs, bit HasModifiers,
		Operand Src0Mod, Operand Src1Mod> {

dag ret = !if (!eq(NumSrcArgs, 0),		dag ret = !if (!eq(NumSrcArgs, 0),
// VOP1 without input operands (V_NOP)		// VOP1 without input operands (V_NOP)
(ins dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,		(ins dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,
bank_mask:$bank_mask, bound_ctrl:$bound_ctrl),		bank_mask:$bank_mask, bound_ctrl:$bound_ctrl),
!if (!eq(NumSrcArgs, 1),		!if (!eq(NumSrcArgs, 1),
!if (!eq(HasModifiers, 1),		!if (!eq(HasModifiers, 1),
// VOP1_DPP with modifiers		// VOP1_DPP with modifiers
(ins Src0Mod:$src0_modifiers, Src0RC:$src0,		(ins DstRC:$old, Src0Mod:$src0_modifiers,
dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,		Src0RC:$src0, dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,
bank_mask:$bank_mask, bound_ctrl:$bound_ctrl)		bank_mask:$bank_mask, bound_ctrl:$bound_ctrl)
/* else */,		/* else */,
// VOP1_DPP without modifiers		// VOP1_DPP without modifiers
(ins Src0RC:$src0, dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,		(ins DstRC:$old, Src0RC:$src0,
		dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,
bank_mask:$bank_mask, bound_ctrl:$bound_ctrl)		bank_mask:$bank_mask, bound_ctrl:$bound_ctrl)
/* endif */)		/* endif */)
/* NumSrcArgs == 2 */,		/* NumSrcArgs == 2 */,
!if (!eq(HasModifiers, 1),		!if (!eq(HasModifiers, 1),
// VOP2_DPP with modifiers		// VOP2_DPP with modifiers
(ins Src0Mod:$src0_modifiers, Src0RC:$src0,		(ins DstRC:$old,
		Src0Mod:$src0_modifiers, Src0RC:$src0,
Src1Mod:$src1_modifiers, Src1RC:$src1,		Src1Mod:$src1_modifiers, Src1RC:$src1,
dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,		dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,
bank_mask:$bank_mask, bound_ctrl:$bound_ctrl)		bank_mask:$bank_mask, bound_ctrl:$bound_ctrl)
/* else */,		/* else */,
// VOP2_DPP without modifiers		// VOP2_DPP without modifiers
(ins Src0RC:$src0, Src1RC:$src1, dpp_ctrl:$dpp_ctrl,		(ins DstRC:$old,
		Src0RC:$src0, Src1RC:$src1, dpp_ctrl:$dpp_ctrl,
row_mask:$row_mask, bank_mask:$bank_mask,		row_mask:$row_mask, bank_mask:$bank_mask,
bound_ctrl:$bound_ctrl)		bound_ctrl:$bound_ctrl)
/* endif */)));		/* endif */)));
}		}



// Ins for SDWA		// Ins for SDWA
class getInsSDWA <RegisterOperand Src0RC, RegisterOperand Src1RC, int NumSrcArgs,		class getInsSDWA <RegisterOperand Src0RC, RegisterOperand Src1RC, int NumSrcArgs,
bit HasSDWAOMod, Operand Src0Mod, Operand Src1Mod,		bit HasSDWAOMod, Operand Src0Mod, Operand Src1Mod,
▲ Show 20 Lines • Show All 318 Lines • ▼ Show 20 Lines	field dag InsVOP3P = getInsVOP3P<Src0RC64, Src1RC64, Src2RC64,
NumSrcArgs, HasClamp,		NumSrcArgs, HasClamp,
Src0PackedMod, Src1PackedMod, Src2PackedMod>.ret;		Src0PackedMod, Src1PackedMod, Src2PackedMod>.ret;
field dag InsVOP3OpSel = getInsVOP3OpSel<Src0RC64, Src1RC64, Src2RC64,		field dag InsVOP3OpSel = getInsVOP3OpSel<Src0RC64, Src1RC64, Src2RC64,
NumSrcArgs,		NumSrcArgs,
HasClamp,		HasClamp,
getOpSelMod<Src0VT>.ret,		getOpSelMod<Src0VT>.ret,
getOpSelMod<Src1VT>.ret,		getOpSelMod<Src1VT>.ret,
getOpSelMod<Src2VT>.ret>.ret;		getOpSelMod<Src2VT>.ret>.ret;
field dag InsDPP = getInsDPP<Src0DPP, Src1DPP, NumSrcArgs,		field dag InsDPP = getInsDPP<DstRCDPP, Src0DPP, Src1DPP, NumSrcArgs,
HasModifiers, Src0ModDPP, Src1ModDPP>.ret;		HasModifiers, Src0ModDPP, Src1ModDPP>.ret;
field dag InsSDWA = getInsSDWA<Src0SDWA, Src1SDWA, NumSrcArgs,		field dag InsSDWA = getInsSDWA<Src0SDWA, Src1SDWA, NumSrcArgs,
HasSDWAOMod, Src0ModSDWA, Src1ModSDWA,		HasSDWAOMod, Src0ModSDWA, Src1ModSDWA,
DstVT>.ret;		DstVT>.ret;


field string Asm32 = getAsm32<HasDst, NumSrcArgs, DstVT>.ret;		field string Asm32 = getAsm32<HasDst, NumSrcArgs, DstVT>.ret;
field string Asm64 = getAsm64<HasDst, NumSrcArgs, HasModifiers, HasOMod, DstVT>.ret;		field string Asm64 = getAsm64<HasDst, NumSrcArgs, HasModifiers, HasOMod, DstVT>.ret;
▲ Show 20 Lines • Show All 243 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/VOP1Instructions.td

	Show First 20 Lines • Show All 260 Lines • ▼ Show 20 Lines
	// add an implicit def of a virtual register in tablegen.			// add an implicit def of a virtual register in tablegen.
	def VOP_MOVRELD : VOPProfile<[untyped, i32, untyped, untyped]> {			def VOP_MOVRELD : VOPProfile<[untyped, i32, untyped, untyped]> {
	let Src0RC32 = VOPDstOperand<VGPR_32>;			let Src0RC32 = VOPDstOperand<VGPR_32>;
	let Src0RC64 = VOPDstOperand<VGPR_32>;			let Src0RC64 = VOPDstOperand<VGPR_32>;

	let Outs = (outs);			let Outs = (outs);
	let Ins32 = (ins Src0RC32:$vdst, VSrc_b32:$src0);			let Ins32 = (ins Src0RC32:$vdst, VSrc_b32:$src0);
	let Ins64 = (ins Src0RC64:$vdst, VSrc_b32:$src0);			let Ins64 = (ins Src0RC64:$vdst, VSrc_b32:$src0);
	let InsDPP = (ins Src0RC32:$vdst, Src0RC32:$src0, dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,			let InsDPP = (ins DstRC:$vdst, DstRC:$old, Src0RC32:$src0,
				dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,
	bank_mask:$bank_mask, bound_ctrl:$bound_ctrl);			bank_mask:$bank_mask, bound_ctrl:$bound_ctrl);

	let InsSDWA = (ins Src0RC32:$vdst, Src0ModSDWA:$src0_modifiers, Src0SDWA:$src0,			let InsSDWA = (ins Src0RC32:$vdst, Src0ModSDWA:$src0_modifiers, Src0SDWA:$src0,
	clampmod:$clamp, omod:$omod, dst_sel:$dst_sel, dst_unused:$dst_unused,			clampmod:$clamp, omod:$omod, dst_sel:$dst_sel, dst_unused:$dst_unused,
	src0_sel:$src0_sel);			src0_sel:$src0_sel);

	let Asm32 = getAsm32<1, 1>.ret;			let Asm32 = getAsm32<1, 1>.ret;
	let Asm64 = getAsm64<1, 1, 0, 1>.ret;			let Asm64 = getAsm64<1, 1, 0, 1>.ret;
	▲ Show 20 Lines • Show All 221 Lines • ▼ Show 20 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	class VOP1_DPP <bits<8> op, VOP1_Pseudo ps, VOPProfile P = ps.Pfl> :			class VOP1_DPP <bits<8> op, VOP1_Pseudo ps, VOPProfile P = ps.Pfl> :
	VOP_DPP <ps.OpName, P> {			VOP_DPP <ps.OpName, P> {
	let Defs = ps.Defs;			let Defs = ps.Defs;
	let Uses = ps.Uses;			let Uses = ps.Uses;
	let SchedRW = ps.SchedRW;			let SchedRW = ps.SchedRW;
	let hasSideEffects = ps.hasSideEffects;			let hasSideEffects = ps.hasSideEffects;
	let Constraints = ps.Constraints;
	let DisableEncoding = ps.DisableEncoding;

	bits<8> vdst;			bits<8> vdst;
	let Inst{8-0} = 0xfa; // dpp			let Inst{8-0} = 0xfa; // dpp
	let Inst{16-9} = op;			let Inst{16-9} = op;
	let Inst{24-17} = !if(P.EmitDst, vdst{7-0}, 0);			let Inst{24-17} = !if(P.EmitDst, vdst{7-0}, 0);
	let Inst{31-25} = 0x3f; //encoding			let Inst{31-25} = 0x3f; //encoding
	}			}

	▲ Show 20 Lines • Show All 137 Lines • ▼ Show 20 Lines
	def V_MOVRELD_B32_V8 : V_MOVRELD_B32_pseudo<VReg_256>;			def V_MOVRELD_B32_V8 : V_MOVRELD_B32_pseudo<VReg_256>;
	def V_MOVRELD_B32_V16 : V_MOVRELD_B32_pseudo<VReg_512>;			def V_MOVRELD_B32_V16 : V_MOVRELD_B32_pseudo<VReg_512>;

	let Predicates = [isVI] in {			let Predicates = [isVI] in {

	def : Pat <			def : Pat <
	(i32 (int_amdgcn_mov_dpp i32:$src, imm:$dpp_ctrl, imm:$row_mask, imm:$bank_mask,			(i32 (int_amdgcn_mov_dpp i32:$src, imm:$dpp_ctrl, imm:$row_mask, imm:$bank_mask,
	imm:$bound_ctrl)),			imm:$bound_ctrl)),
	(V_MOV_B32_dpp $src, (as_i32imm $dpp_ctrl), (as_i32imm $row_mask),			(V_MOV_B32_dpp $src, $src, (as_i32imm $dpp_ctrl),
	(as_i32imm $bank_mask), (as_i1imm $bound_ctrl))			(as_i32imm $row_mask), (as_i32imm $bank_mask),
				(as_i1imm $bound_ctrl))
	>;			>;


	def : Pat<			def : Pat<
	(i32 (anyext i16:$src)),			(i32 (anyext i16:$src)),
	(COPY $src)			(COPY $src)
	>;			>;

	def : Pat<			def : Pat<
	(i64 (anyext i16:$src)),			(i64 (anyext i16:$src)),
	(REG_SEQUENCE VReg_64,			(REG_SEQUENCE VReg_64,
	Show All 15 Lines

llvm/trunk/lib/Target/AMDGPU/VOP2Instructions.td

Show First 20 Lines • Show All 203 Lines • ▼ Show 20 Lines
def VOP_MADMK_F32 : VOP_MADMK <f32>;		def VOP_MADMK_F32 : VOP_MADMK <f32>;

// FIXME: Remove src2_modifiers. It isn't used, so is wasting memory		// FIXME: Remove src2_modifiers. It isn't used, so is wasting memory
// and processing time but it makes it easier to convert to mad.		// and processing time but it makes it easier to convert to mad.
class VOP_MAC <ValueType vt> : VOPProfile <[vt, vt, vt, vt]> {		class VOP_MAC <ValueType vt> : VOPProfile <[vt, vt, vt, vt]> {
let Ins32 = (ins Src0RC32:$src0, Src1RC32:$src1, VGPR_32:$src2);		let Ins32 = (ins Src0RC32:$src0, Src1RC32:$src1, VGPR_32:$src2);
let Ins64 = getIns64<Src0RC64, Src1RC64, RegisterOperand<VGPR_32>, 3,		let Ins64 = getIns64<Src0RC64, Src1RC64, RegisterOperand<VGPR_32>, 3,
HasModifiers, HasOMod, Src0Mod, Src1Mod, Src2Mod>.ret;		HasModifiers, HasOMod, Src0Mod, Src1Mod, Src2Mod>.ret;
let InsDPP = (ins Src0ModDPP:$src0_modifiers, Src0DPP:$src0,		let InsDPP = (ins DstRCDPP:$old,
		Src0ModDPP:$src0_modifiers, Src0DPP:$src0,
Src1ModDPP:$src1_modifiers, Src1DPP:$src1,		Src1ModDPP:$src1_modifiers, Src1DPP:$src1,
VGPR_32:$src2, // stub argument
dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,		dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,
bank_mask:$bank_mask, bound_ctrl:$bound_ctrl);		bank_mask:$bank_mask, bound_ctrl:$bound_ctrl);

let InsSDWA = (ins Src0ModSDWA:$src0_modifiers, Src0SDWA:$src0,		let InsSDWA = (ins Src0ModSDWA:$src0_modifiers, Src0SDWA:$src0,
Src1ModSDWA:$src1_modifiers, Src1SDWA:$src1,		Src1ModSDWA:$src1_modifiers, Src1SDWA:$src1,
VGPR_32:$src2, // stub argument		VGPR_32:$src2, // stub argument
clampmod:$clamp, omod:$omod,		clampmod:$clamp, omod:$omod,
dst_sel:$dst_sel, dst_unused:$dst_unused,		dst_sel:$dst_sel, dst_unused:$dst_unused,
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	def VOP2b_I32_I1_I32_I32_I1 : VOPProfile<[i32, i32, i32, i1]> {
let Ins32 = (ins Src0RC32:$src0, Src1RC32:$src1);		let Ins32 = (ins Src0RC32:$src0, Src1RC32:$src1);

let InsSDWA = (ins Src0ModSDWA:$src0_modifiers, Src0SDWA:$src0,		let InsSDWA = (ins Src0ModSDWA:$src0_modifiers, Src0SDWA:$src0,
Src1ModSDWA:$src1_modifiers, Src1SDWA:$src1,		Src1ModSDWA:$src1_modifiers, Src1SDWA:$src1,
clampmod:$clamp, omod:$omod,		clampmod:$clamp, omod:$omod,
dst_sel:$dst_sel, dst_unused:$dst_unused,		dst_sel:$dst_sel, dst_unused:$dst_unused,
src0_sel:$src0_sel, src1_sel:$src1_sel);		src0_sel:$src0_sel, src1_sel:$src1_sel);

let InsDPP = (ins Src0Mod:$src0_modifiers, Src0DPP:$src0,		let InsDPP = (ins DstRCDPP:$old,
		Src0Mod:$src0_modifiers, Src0DPP:$src0,
Src1Mod:$src1_modifiers, Src1DPP:$src1,		Src1Mod:$src1_modifiers, Src1DPP:$src1,
dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,		dpp_ctrl:$dpp_ctrl, row_mask:$row_mask,
bank_mask:$bank_mask, bound_ctrl:$bound_ctrl);		bank_mask:$bank_mask, bound_ctrl:$bound_ctrl);
let HasExt = 1;		let HasExt = 1;
let HasSDWA9 = 1;		let HasSDWA9 = 1;
}		}

// Read in from vcc or arbitrary SGPR		// Read in from vcc or arbitrary SGPR
▲ Show 20 Lines • Show All 366 Lines • ▼ Show 20 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

class VOP2_DPP <bits<6> op, VOP2_Pseudo ps, VOPProfile P = ps.Pfl> :		class VOP2_DPP <bits<6> op, VOP2_Pseudo ps, VOPProfile P = ps.Pfl> :
VOP_DPP <ps.OpName, P> {		VOP_DPP <ps.OpName, P> {
let Defs = ps.Defs;		let Defs = ps.Defs;
let Uses = ps.Uses;		let Uses = ps.Uses;
let SchedRW = ps.SchedRW;		let SchedRW = ps.SchedRW;
let hasSideEffects = ps.hasSideEffects;		let hasSideEffects = ps.hasSideEffects;
let Constraints = ps.Constraints;
let DisableEncoding = ps.DisableEncoding;

bits<8> vdst;		bits<8> vdst;
bits<8> src1;		bits<8> src1;
let Inst{8-0} = 0xfa; //dpp		let Inst{8-0} = 0xfa; //dpp
let Inst{16-9} = !if(P.HasSrc1, src1{7-0}, 0);		let Inst{16-9} = !if(P.HasSrc1, src1{7-0}, 0);
let Inst{24-17} = !if(P.EmitDst, vdst{7-0}, 0);		let Inst{24-17} = !if(P.EmitDst, vdst{7-0}, 0);
let Inst{30-25} = op;		let Inst{30-25} = op;
let Inst{31} = 0x0; //encoding		let Inst{31} = 0x0; //encoding
▲ Show 20 Lines • Show All 172 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/VOPInstructions.td

Show First 20 Lines • Show All 504 Lines • ▼ Show 20 Lines	class VOP_DPP <string OpName, VOPProfile P> :
let DPP = 1;		let DPP = 1;
let Size = 8;		let Size = 8;

let AsmMatchConverter = !if(!eq(P.HasModifiers,1), "cvtDPP", "");		let AsmMatchConverter = !if(!eq(P.HasModifiers,1), "cvtDPP", "");
let SubtargetPredicate = HasDPP;		let SubtargetPredicate = HasDPP;
let AssemblerPredicate = !if(P.HasExt, HasDPP, DisableInst);		let AssemblerPredicate = !if(P.HasExt, HasDPP, DisableInst);
let AsmVariantName = !if(P.HasExt, AMDGPUAsmVariants.DPP,		let AsmVariantName = !if(P.HasExt, AMDGPUAsmVariants.DPP,
AMDGPUAsmVariants.Disable);		AMDGPUAsmVariants.Disable);
		let Constraints = !if(P.NumSrcArgs, "$old = $vdst", "");
		let DisableEncoding = !if(P.NumSrcArgs, "$old", "");
let DecoderNamespace = "DPP";		let DecoderNamespace = "DPP";
}		}

include "VOPCInstructions.td"		include "VOPCInstructions.td"
include "VOP1Instructions.td"		include "VOP1Instructions.td"
include "VOP2Instructions.td"		include "VOP2Instructions.td"
include "VOP3Instructions.td"		include "VOP3Instructions.td"
include "VOP3PInstructions.td"		include "VOP3PInstructions.td"

llvm/trunk/test/CodeGen/AMDGPU/inserted-wait-states.mir

	Show First 20 Lines • Show All 498 Lines • ▼ Show 20 Lines
	# VI-NEXT: S_NOP 0			# VI-NEXT: S_NOP 0
	# VI-NEXT: V_MOV_B32_dpp			# VI-NEXT: V_MOV_B32_dpp

	name: dpp			name: dpp

	body: \|			body: \|
	bb.0:			bb.0:
	%vgpr0 = V_MOV_B32_e32 0, implicit %exec			%vgpr0 = V_MOV_B32_e32 0, implicit %exec
	%vgpr1 = V_MOV_B32_dpp %vgpr0, 0, 15, 15, 0, implicit %exec			%vgpr1 = V_MOV_B32_dpp %vgpr1, %vgpr0, 0, 15, 15, 0, implicit %exec
	S_BRANCH %bb.1			S_BRANCH %bb.1

	bb.1:			bb.1:
	implicit %exec, implicit %vcc = V_CMPX_EQ_I32_e32 %vgpr0, %vgpr1, implicit %exec			implicit %exec, implicit %vcc = V_CMPX_EQ_I32_e32 %vgpr0, %vgpr1, implicit %exec
	%vgpr3 = V_MOV_B32_dpp %vgpr0, 0, 15, 15, 0, implicit %exec			%vgpr3 = V_MOV_B32_dpp %vgpr3, %vgpr0, 0, 15, 15, 0, implicit %exec
	S_ENDPGM			S_ENDPGM
	...			...
	---			---
	name: mov_fed_hazard_crash_on_dbg_value			name: mov_fed_hazard_crash_on_dbg_value
	alignment: 0			alignment: 0
	exposesReturnsTwice: false			exposesReturnsTwice: false
	legalized: false			legalized: false
	regBankSelected: false			regBankSelected: false
	▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/llvm.amdgcn.mov.dpp.ll

	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs -show-mc-encoding < %s \| FileCheck -check-prefix=VI -check-prefix=VI-OPT %s			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs -show-mc-encoding < %s \| FileCheck -check-prefix=VI -check-prefix=VI-OPT %s
	; RUN: llc -O0 -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs -show-mc-encoding < %s \| FileCheck -check-prefix=VI -check-prefix=VI-NOOPT %s			; RUN: llc -O0 -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs -show-mc-encoding < %s \| FileCheck -check-prefix=VI -check-prefix=VI-NOOPT %s

	; FIXME: The register allocator / scheduler should be able to avoid these hazards.			; FIXME: The register allocator / scheduler should be able to avoid these hazards.

	; VI-LABEL: {{^}}dpp_test:			; VI-LABEL: {{^}}dpp_test:
	; VI: v_mov_b32_e32 v0, s{{[0-9]+}}			; VI: v_mov_b32_e32 v0, s{{[0-9]+}}
				; VI-NOOPT: v_mov_b32_e32 v1, s{{[0-9]+}}
	; VI: s_nop 1			; VI: s_nop 1
	; VI: v_mov_b32_dpp v0, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0 ; encoding: [0xfa,0x02,0x00,0x7e,0x00,0x01,0x08,0x11]			; VI-OPT: v_mov_b32_dpp v0, v0 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0 ; encoding: [0xfa,0x02,0x00,0x7e,0x00,0x01,0x08,0x11]
				; VI-NOOPT: v_mov_b32_dpp v0, v1 quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0 ; encoding: [0xfa,0x02,0x00,0x7e,0x01,0x01,0x08,0x11]
	define amdgpu_kernel void @dpp_test(i32 addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @dpp_test(i32 addrspace(1)* %out, i32 %in) {
	%tmp0 = call i32 @llvm.amdgcn.mov.dpp.i32(i32 %in, i32 1, i32 1, i32 1, i1 1) #0			%tmp0 = call i32 @llvm.amdgcn.mov.dpp.i32(i32 %in, i32 1, i32 1, i32 1, i1 1) #0
	store i32 %tmp0, i32 addrspace(1)* %out			store i32 %tmp0, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; VI-LABEL: {{^}}dpp_wait_states:			; VI-LABEL: {{^}}dpp_wait_states:
				; VI-NOOPT: v_mov_b32_e32 [[VGPR1:v[0-9]+]], s{{[0-9]+}}
	; VI: v_mov_b32_e32 [[VGPR0:v[0-9]+]], s{{[0-9]+}}			; VI: v_mov_b32_e32 [[VGPR0:v[0-9]+]], s{{[0-9]+}}
	; VI: s_nop 1			; VI: s_nop 1
	; VI: v_mov_b32_dpp [[VGPR1:v[0-9]+]], [[VGPR0]] quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0			; VI-OPT: v_mov_b32_dpp [[VGPR0]], [[VGPR0]] quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
				; VI-NOOPT: v_mov_b32_dpp [[VGPR1]], [[VGPR0]] quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
	; VI: s_nop 1			; VI: s_nop 1
	; VI: v_mov_b32_dpp v{{[0-9]+}}, [[VGPR1]] quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0			; VI-OPT: v_mov_b32_dpp v{{[0-9]+}}, [[VGPR0]] quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
				; VI-NOOPT: v_mov_b32_dpp v{{[0-9]+}}, [[VGPR1]] quad_perm:[1,0,0,0] row_mask:0x1 bank_mask:0x1 bound_ctrl:0
	define amdgpu_kernel void @dpp_wait_states(i32 addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @dpp_wait_states(i32 addrspace(1)* %out, i32 %in) {
	%tmp0 = call i32 @llvm.amdgcn.mov.dpp.i32(i32 %in, i32 1, i32 1, i32 1, i1 1) #0			%tmp0 = call i32 @llvm.amdgcn.mov.dpp.i32(i32 %in, i32 1, i32 1, i32 1, i1 1) #0
	%tmp1 = call i32 @llvm.amdgcn.mov.dpp.i32(i32 %tmp0, i32 1, i32 1, i32 1, i1 1) #0			%tmp1 = call i32 @llvm.amdgcn.mov.dpp.i32(i32 %tmp0, i32 1, i32 1, i32 1, i1 1) #0
	store i32 %tmp1, i32 addrspace(1)* %out			store i32 %tmp1, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; VI-LABEL: {{^}}dpp_first_in_bb:			; VI-LABEL: {{^}}dpp_first_in_bb:
	Show All 38 Lines