This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Produce madak and madmk from the two-address pass
ClosedPublic

Authored by rampitec on Sep 1 2017, 12:10 PM.

Download Raw Diff

Details

Reviewers

Commits

rG710da42b86bb: [AMDGPU] Produce madak and madmk from the two-address pass
rL312928: [AMDGPU] Produce madak and madmk from the two-address pass

Summary

These two instructions are normally selected, but when the
two address pass converts mac into mad we end up with the
mad where we could have one of these.

Diff Detail

Event Timeline

rampitec created this revision.Sep 1 2017, 12:10 PM

Herald added subscribers: t-tye, tpr, dstuttard and 4 others. · View Herald TranscriptSep 1 2017, 12:10 PM

I don't think this is the right place for this. This is converting a two address instruction into another two address instruction. Why wasn't this matched to a v_madak originally?

This looks like it is undoing the specific handling for 2 uses done in FoldImmediate. This also probably should be handled there, and specifically for 2 uses. Also needs a test with 3+ uses, where it probably still shouldn't be folded.

In D37389#859395, @arsenm wrote:

I don't think this is the right place for this. This is converting a two address instruction into another two address instruction. Why wasn't this matched to a v_madak originally?

It was selected as mac, and mac is preferable over mad or madmk.

In D37389#859399, @arsenm wrote:

This looks like it is undoing the specific handling for 2 uses done in FoldImmediate. This also probably should be handled there, and specifically for 2 uses. Also needs a test with 3+ uses, where it probably still shouldn't be folded.

There is no point in the folding. Both mad and madak use 64 bit, both use constant bus. With folding we also using one extra register and plus madak/madmk are VOP2, so preferable.

In D37389#859432, @rampitec wrote:

In D37389#859395, @arsenm wrote:

I don't think this is the right place for this. This is converting a two address instruction into another two address instruction. Why wasn't this matched to a v_madak originally?

It was selected as mac, and mac is preferable over mad or madmk.

In fact I do not see how madak is different from mad in the respect of two/three address. It really worth nothing that one of the register operands can be also folded into the instruction as immediate. If we would produce mad here and fold immediate later the result would be same, just more work. The pass needs to simplify operands and we do it. For the selection part I also do not see how could it efficiently deal with register pressure issues, it is just too early.

In D37389#861163, @rampitec wrote:

In D37389#859432, @rampitec wrote:

In D37389#859395, @arsenm wrote:

I don't think this is the right place for this. This is converting a two address instruction into another two address instruction. Why wasn't this matched to a v_madak originally?

It was selected as mac, and mac is preferable over mad or madmk.

In fact I do not see how madak is different from mad in the respect of two/three address. It really worth nothing that one of the register operands can be also folded into the instruction as immediate. If we would produce mad here and fold immediate later the result would be same, just more work. The pass needs to simplify operands and we do it. For the selection part I also do not see how could it efficiently deal with register pressure issues, it is just too early.

OK, I was thinking of s_mulk_i32/s_addk_i32 which is an inplace update of the register.

test/CodeGen/AMDGPU/madak.ll
40	Should add a test with 3 uses. We should consider not doing it for > 2 uses if optsize

rampitec added inline comments.Sep 6 2017, 3:25 PM

test/CodeGen/AMDGPU/madak.ll
40	We shall always use madak/madmk instead of v_mad_f32. Size of the instruction is the same: mad is VOP3, madak/madmk are VOP2 + literal. I.e. 64 bit VOP3 vs 32 bit VOP2 + 32 bit literal. At the same time we can save a register for the literal and move into that register. So even for the optsize we shall prefer these.

arsenm added inline comments.Sep 6 2017, 6:27 PM

lib/Target/AMDGPU/SIInstrInfo.cpp
2088	You don't need the isFullCopy check. COPY can't be used to materialize an immediate. Def->isMoveImm() instead of the separate opcode checks
2133	Braces
2134	What if this is an f16 mac?

rampitec updated this revision to Diff 114118.Sep 6 2017, 7:30 PM

rampitec marked 2 inline comments as done.

rampitec added inline comments.

lib/Target/AMDGPU/SIInstrInfo.cpp
2088	isMoveImm() will catch cmov, so will not work. isFullCopy removed.
2134	"if (!IsF16 ..." on line 2132.

arsenm added inline comments.Sep 7 2017, 3:25 PM

lib/Target/AMDGPU/SIInstrInfo.cpp
2088	It won't catch cmov (I assume you mean s_cmov_b32, which we also don't emit).
2134	But we can handle f16 here as well, so might as well

rampitec added inline comments.Sep 7 2017, 3:30 PM

lib/Target/AMDGPU/SIInstrInfo.cpp
2088	It doesn't matter we do not produce it right now. Until it is marked as move imm I cannot use it, that is a time bomb. In fact I could remove flag from S_CMOV_B*. Objections?

Added f16 case.

LGTM with the minor cleanup

lib/Target/AMDGPU/SIInstrInfo.cpp
2135	Assign the opcode to a variable instead of repeating this 3 places

This revision is now accepted and ready to land.Sep 11 2017, 9:18 AM

In D37389#866644, @arsenm wrote:

LGTM with the minor cleanup

We only know what to assign inside each if statement.

Closed by commit rL312928: [AMDGPU] Produce madak and madmk from the two-address pass (authored by rampitec). · Explain WhySep 11 2017, 10:15 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

SIInstrInfo.cpp

38 lines

test/

CodeGen/

AMDGPU/

madak.ll

2 lines

twoaddr-mad.mir

56 lines

v_madak_f16.ll

2 lines

Diff 113569

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 2,073 Lines • ▼ Show 20 Lines	if (isFLAT(MIb))
return checkInstOffsetsDoNotOverlap(MIa, MIb);		return checkInstOffsetsDoNotOverlap(MIa, MIb);

return false;		return false;
}		}

return false;		return false;
}		}

		static int64_t getFoldableImm(const MachineOperand* MO) {
		if (!MO->isReg())
		return false;
		const MachineFunction *MF = MO->getParent()->getParent()->getParent();
		const MachineRegisterInfo &MRI = MF->getRegInfo();
		auto Def = MRI.getUniqueVRegDef(MO->getReg());
		if (Def && (Def->isFullCopy() \|\| Def->getOpcode() == AMDGPU::S_MOV_B32 \|\|
		arsenmUnsubmitted Done Reply Inline Actions You don't need the isFullCopy check. COPY can't be used to materialize an immediate. Def->isMoveImm() instead of the separate opcode checks arsenm: You don't need the isFullCopy check. COPY can't be used to materialize an immediate. Def…
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions isMoveImm() will catch cmov, so will not work. isFullCopy removed. rampitec: isMoveImm() will catch cmov, so will not work. isFullCopy removed.
		arsenmUnsubmitted Not Done Reply Inline Actions It won't catch cmov (I assume you mean s_cmov_b32, which we also don't emit). arsenm: It won't catch cmov (I assume you mean s_cmov_b32, which we also don't emit).
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions It doesn't matter we do not produce it right now. Until it is marked as move imm I cannot use it, that is a time bomb. In fact I could remove flag from S_CMOV_B. Objections? rampitec:* It doesn't matter we do not produce it right now. Until it is marked as move imm I cannot use…
		Def->getOpcode() == AMDGPU::V_MOV_B32_e32) &&
		Def->getOperand(1).isImm())
		return Def->getOperand(1).getImm();
		return AMDGPU::NoRegister;
		}

MachineInstr *SIInstrInfo::convertToThreeAddress(MachineFunction::iterator &MBB,		MachineInstr *SIInstrInfo::convertToThreeAddress(MachineFunction::iterator &MBB,
MachineInstr &MI,		MachineInstr &MI,
LiveVariables *LV) const {		LiveVariables *LV) const {
bool IsF16 = false;		bool IsF16 = false;

switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
default:		default:
return nullptr;		return nullptr;
Show All 21 Lines	const MachineOperand *Src0Mods =
getNamedOperand(MI, AMDGPU::OpName::src0_modifiers);		getNamedOperand(MI, AMDGPU::OpName::src0_modifiers);
const MachineOperand *Src1 = getNamedOperand(MI, AMDGPU::OpName::src1);		const MachineOperand *Src1 = getNamedOperand(MI, AMDGPU::OpName::src1);
const MachineOperand *Src1Mods =		const MachineOperand *Src1Mods =
getNamedOperand(MI, AMDGPU::OpName::src1_modifiers);		getNamedOperand(MI, AMDGPU::OpName::src1_modifiers);
const MachineOperand *Src2 = getNamedOperand(MI, AMDGPU::OpName::src2);		const MachineOperand *Src2 = getNamedOperand(MI, AMDGPU::OpName::src2);
const MachineOperand *Clamp = getNamedOperand(MI, AMDGPU::OpName::clamp);		const MachineOperand *Clamp = getNamedOperand(MI, AMDGPU::OpName::clamp);
const MachineOperand *Omod = getNamedOperand(MI, AMDGPU::OpName::omod);		const MachineOperand *Omod = getNamedOperand(MI, AMDGPU::OpName::omod);

		if (!IsF16 && !Src0Mods && !Src1Mods && !Clamp && !Omod) {
		if (auto Imm = getFoldableImm(Src2))
		arsenmUnsubmitted Done Reply Inline Actions Braces arsenm: Braces
		return BuildMI(*MBB, MI, MI.getDebugLoc(), get(AMDGPU::V_MADAK_F32))
		arsenmUnsubmitted Done Reply Inline Actions What if this is an f16 mac? arsenm: What if this is an f16 mac?
		rampitecAuthorUnsubmitted Done Reply Inline Actions "if (!IsF16 ..." on line 2132. rampitec: "if (!IsF16 ..." on line 2132.
		arsenmUnsubmitted Done Reply Inline Actions But we can handle f16 here as well, so might as well arsenm: But we can handle f16 here as well, so might as well
		.add(*Dst)
		arsenmUnsubmitted Not Done Reply Inline Actions Assign the opcode to a variable instead of repeating this 3 places arsenm: Assign the opcode to a variable instead of repeating this 3 places
		.add(*Src0)
		.add(*Src1)
		.addImm(Imm);

		if (auto Imm = getFoldableImm(Src1))
		return BuildMI(*MBB, MI, MI.getDebugLoc(), get(AMDGPU::V_MADMK_F32))
		.add(*Dst)
		.add(*Src0)
		.addImm(Imm)
		.add(*Src2);

		if (auto Imm = getFoldableImm(Src0))
		if (isOperandLegal(MI, AMDGPU::getNamedOperandIdx(AMDGPU::V_MADMK_F32,
		AMDGPU::OpName::src0), Src1))
		return BuildMI(*MBB, MI, MI.getDebugLoc(), get(AMDGPU::V_MADMK_F32))
		.add(*Dst)
		.add(*Src1)
		.addImm(Imm)
		.add(*Src2);
		}

return BuildMI(*MBB, MI, MI.getDebugLoc(),		return BuildMI(*MBB, MI, MI.getDebugLoc(),
get(IsF16 ? AMDGPU::V_MAD_F16 : AMDGPU::V_MAD_F32))		get(IsF16 ? AMDGPU::V_MAD_F16 : AMDGPU::V_MAD_F32))
.add(*Dst)		.add(*Dst)
.addImm(Src0Mods ? Src0Mods->getImm() : 0)		.addImm(Src0Mods ? Src0Mods->getImm() : 0)
.add(*Src0)		.add(*Src0)
.addImm(Src1Mods ? Src1Mods->getImm() : 0)		.addImm(Src1Mods ? Src1Mods->getImm() : 0)
.add(*Src1)		.add(*Src1)
.addImm(0) // Src mods		.addImm(0) // Src mods
▲ Show 20 Lines • Show All 2,298 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/madak.ll

	Show All 28 Lines
	; optimization and if we fold the immediate multiple times, we'll undo			; optimization and if we fold the immediate multiple times, we'll undo
	; it.			; it.

	; GCN-LABEL: {{^}}madak_2_use_f32:			; GCN-LABEL: {{^}}madak_2_use_f32:
	; GCN-DAG: buffer_load_dword [[VA:v[0-9]+]], {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}			; GCN-DAG: buffer_load_dword [[VA:v[0-9]+]], {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 addr64{{$}}
	; GCN-DAG: buffer_load_dword [[VB:v[0-9]+]], {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4			; GCN-DAG: buffer_load_dword [[VB:v[0-9]+]], {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:4
	; GCN-DAG: buffer_load_dword [[VC:v[0-9]+]], {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8			; GCN-DAG: buffer_load_dword [[VC:v[0-9]+]], {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 addr64 offset:8
	; GCN-DAG: v_mov_b32_e32 [[VK:v[0-9]+]], 0x41200000			; GCN-DAG: v_mov_b32_e32 [[VK:v[0-9]+]], 0x41200000
	; GCN-DAG: v_mad_f32 {{v[0-9]+}}, [[VA]], [[VB]], [[VK]]			; GCN-DAG: v_madak_f32 {{v[0-9]+}}, [[VA]], [[VB]], 0x41200000
	; GCN-DAG: v_mac_f32_e32 [[VK]], [[VA]], [[VC]]			; GCN-DAG: v_mac_f32_e32 [[VK]], [[VA]], [[VC]]
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @madak_2_use_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) nounwind {			define amdgpu_kernel void @madak_2_use_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) nounwind {
				arsenmUnsubmitted Not Done Reply Inline Actions Should add a test with 3 uses. We should consider not doing it for > 2 uses if optsize arsenm: Should add a test with 3 uses. We should consider not doing it for > 2 uses if optsize
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions We shall always use madak/madmk instead of v_mad_f32. Size of the instruction is the same: mad is VOP3, madak/madmk are VOP2 + literal. I.e. 64 bit VOP3 vs 32 bit VOP2 + 32 bit literal. At the same time we can save a register for the literal and move into that register. So even for the optsize we shall prefer these. rampitec: We shall always use madak/madmk instead of v_mad_f32. Size of the instruction is the same: mad…
	%tid = tail call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone			%tid = tail call i32 @llvm.amdgcn.workitem.id.x() nounwind readnone

	%in.gep.0 = getelementptr float, float addrspace(1)* %in, i32 %tid			%in.gep.0 = getelementptr float, float addrspace(1)* %in, i32 %tid
	%in.gep.1 = getelementptr float, float addrspace(1)* %in.gep.0, i32 1			%in.gep.1 = getelementptr float, float addrspace(1)* %in.gep.0, i32 1
	%in.gep.2 = getelementptr float, float addrspace(1)* %in.gep.0, i32 2			%in.gep.2 = getelementptr float, float addrspace(1)* %in.gep.0, i32 2

	%out.gep.0 = getelementptr float, float addrspace(1)* %out, i32 %tid			%out.gep.0 = getelementptr float, float addrspace(1)* %out, i32 %tid
	%out.gep.1 = getelementptr float, float addrspace(1)* %in.gep.0, i32 1			%out.gep.1 = getelementptr float, float addrspace(1)* %in.gep.0, i32 1
	▲ Show 20 Lines • Show All 174 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/twoaddr-mad.mir

This file was added.

				# RUN: llc -march=amdgcn %s -run-pass twoaddressinstruction -verify-machineinstrs -o - \| FileCheck -check-prefix=GCN %s

				# GCN-LABEL: name: test_madmk_reg_imm
				# GCN: V_MADMK_F32 killed %0.sub0, 1078523331, killed %1, implicit %exec
				---
				name: test_madmk_reg_imm
				registers:
				- { id: 0, class: vreg_64 }
				- { id: 1, class: vgpr_32 }
				- { id: 2, class: vgpr_32 }
				- { id: 3, class: vgpr_32 }
				body: \|
				bb.0:

				%0 = IMPLICIT_DEF
				%1 = COPY %0.sub1
				%2 = V_MOV_B32_e32 1078523331, implicit %exec
				%3 = V_MAC_F32_e32 killed %0.sub0, %2, killed %1, implicit %exec

				...

				# GCN-LABEL: name: test_madmk_imm_reg
				# GCN: V_MADMK_F32 killed %0.sub0, 1078523331, killed %1, implicit %exec
				---
				name: test_madmk_imm_reg
				registers:
				- { id: 0, class: vreg_64 }
				- { id: 1, class: vgpr_32 }
				- { id: 2, class: vgpr_32 }
				- { id: 3, class: vgpr_32 }
				body: \|
				bb.0:

				%0 = IMPLICIT_DEF
				%1 = COPY %0.sub1
				%2 = V_MOV_B32_e32 1078523331, implicit %exec
				%3 = V_MAC_F32_e32 %2, killed %0.sub0, killed %1, implicit %exec

				...

				# GCN-LABEL: name: test_madak
				# GCN: V_MADAK_F32 killed %0.sub0, %0.sub1, 1078523331, implicit %exec
				---
				name: test_madak
				registers:
				- { id: 0, class: vreg_64 }
				- { id: 1, class: vgpr_32 }
				- { id: 2, class: vgpr_32 }
				body: \|
				bb.0:

				%0 = IMPLICIT_DEF
				%1 = V_MOV_B32_e32 1078523331, implicit %exec
				%2 = V_MAC_F32_e32 killed %0.sub0, %0.sub1, %1, implicit %exec

				...

test/CodeGen/AMDGPU/v_madak_f16.ll

Show All 17 Lines	entry:
%t.val = fmul half %a.val, %b.val		%t.val = fmul half %a.val, %b.val
%r.val = fadd half %t.val, 10.0		%r.val = fadd half %t.val, 10.0

store half %r.val, half addrspace(1)* %r		store half %r.val, half addrspace(1)* %r
ret void		ret void
}		}

; GCN-LABEL: {{^}}madak_f16_use_2		; GCN-LABEL: {{^}}madak_f16_use_2
; SI: v_mad_f32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}		; SI: v_madak_f32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, 0x41200000
; SI: v_mac_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}		; SI: v_mac_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
; VI: v_mad_f16 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}		; VI: v_mad_f16 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
; VI: v_mac_f16_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}		; VI: v_mac_f16_e32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
; GCN: s_endpgm		; GCN: s_endpgm
define amdgpu_kernel void @madak_f16_use_2(		define amdgpu_kernel void @madak_f16_use_2(
half addrspace(1)* %r0,		half addrspace(1)* %r0,
half addrspace(1)* %r1,		half addrspace(1)* %r1,
half addrspace(1)* %a,		half addrspace(1)* %a,
Show All 16 Lines