This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
2/5
SIShrinkInstructions.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
3/4
andorbitset.ll
-
andorxorinvimm.ll
-
fabs.ll
-
fneg-fabs.ll
-
gep-address-space.ll
-
local-64.ll

Differential D55380

[AMDGPU] Shrink scalar AND, OR, XOR instructions
ClosedPublic

Authored by grahamsellers on Dec 6 2018, 11:36 AM.

Download Raw Diff

Details

Reviewers

arsenm
nhaehnle

Summary

This change attempts to shrink scalar AND, OR and XOR instructions which take an immediate that isn't inlineable.

It performs:
AND s0, s0, ~(1 << n) -> BITSET0 s0, n
OR s0, s0, (1 << n) -> BITSET1 s0, n
AND s0, s1, x -> ANDN2 s0, s1, ~x
OR s0, s1, x -> ORN2 s0, s1, ~x
XOR s0, s1, x -> XNOR s0, s1, ~x

In particular, this catches setting and clearing the sign bit for fabs (and x, 0x7ffffffff -> bitset0 x, 31 and or x, 0x80000000 -> bitset1 x, 31).

Diff Detail

Event Timeline

grahamsellers created this revision.Dec 6 2018, 11:36 AM

Herald added subscribers: llvm-commits, t-tye, tpr and 5 others. · View Herald TranscriptDec 6 2018, 11:36 AM

arsenm added inline comments.Dec 6 2018, 11:49 AM

lib/Target/AMDGPU/SIShrinkInstructions.cpp
515	This loop is starting to get big, can you move this to a function?
529–532	Is there any real reason you need to handle this? Constants are canonicalized to the RHS (we just undo this for VALU instructions because that's the only way to shrink)

arsenm added inline comments.Dec 6 2018, 12:18 PM

test/CodeGen/AMDGPU/andorbitset.ll
2–7	The check labels are confusing. You should stop using FUNC here at all, and just use GCN. These are also both SI run lines, so I don't see why that's separate
16	Why only SI?

Addressing review comments.

grahamsellers marked 4 inline comments as done.Dec 6 2018, 1:22 PM

grahamsellers added inline comments.

test/CodeGen/AMDGPU/andorbitset.ll
16	I'm not sure what's going on but if you supply only -march=amdgcn and no -mcpu, then it appears the register allocator doesn't end up allocating the source and destination of the s_or_b32 instruction to the same register even though we use setRegAllocationHint. That part of the code was modeled on a similar chunk for S_MULK_I32 earlier in the function. Interestingly, the s_mulk.ll test doesn't have a RUN line which doesn't specify the CPU. I'll do the same.

grahamsellers marked an inline comment as done.Dec 6 2018, 1:22 PM

arsenm added inline comments.Dec 6 2018, 6:47 PM

test/CodeGen/AMDGPU/andorbitset.ll
16	You should generally just pick an explicit target. The default cpu is something approximating Tahiti but isn't quite the same

LGTM

This revision is now accepted and ready to land.Dec 6 2018, 6:54 PM

Do you need someone to commit this?

Herald added a subscriber: kerbowa. · View Herald TranscriptMar 30 2020, 4:39 PM

foad added a subscriber: foad.Mar 31 2020, 1:45 AM

foad added inline comments.

lib/Target/AMDGPU/SIShrinkInstructions.cpp
215–218	Is it worth trying to do the converse too: ANDN2 -> AND ORN2 -> OR XNOR -> XOR Or would those cases never appear in the first place?
231–232	Return early if `!SrcImm->isImm() \|\| AMDGPU::isInlinableLiteral32(SrcImm->getImm(), ST.hasInv2PiInlineImm())`.
261–262	What does this do? I can't see how `SrcImm == Src0` would ever be true here.

This looks like it was already committed way back in b297379ef07829ac7f06c0e2058a889366c46a82

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

SIShrinkInstructions.cpp

84 lines

test/

CodeGen/

AMDGPU/

49 lines

49 lines

9 lines

5 lines

2 lines

2 lines

Diff 177034

lib/Target/AMDGPU/SIShrinkInstructions.cpp

	Show First 20 Lines • Show All 92 Lines • ▼ Show 20 Lines
	const MCInstrDesc &NewDesc = TII->get(SOPKOpc);			const MCInstrDesc &NewDesc = TII->get(SOPKOpc);

	if ((TII->sopkIsZext(SOPKOpc) && isKUImmOperand(TII, Src1)) \|\|			if ((TII->sopkIsZext(SOPKOpc) && isKUImmOperand(TII, Src1)) \|\|
	(!TII->sopkIsZext(SOPKOpc) && isKImmOperand(TII, Src1))) {			(!TII->sopkIsZext(SOPKOpc) && isKImmOperand(TII, Src1))) {
	MI.setDesc(NewDesc);			MI.setDesc(NewDesc);
	}			}
	}			}

				/// Attempt to shink AND/OR/XOR operations requiring non-inlineable literals.
				/// For AND or OR, try using S_BITSET{0,1} to clear or set bits.
				/// If the inverse of the immediate is legal, use ANDN2, ORN2 or
				/// XNOR (as a ^ b == ~(a ^ ~b)).
				foadUnsubmitted Not Done Reply Inline Actions Is it worth trying to do the converse too: ANDN2 -> AND ORN2 -> OR XNOR -> XOR Or would those cases never appear in the first place? foad: Is it worth trying to do the converse too: ANDN2 -> AND ORN2 -> OR XNOR -> XOR Or would those…
				/// \returns true if the caller should continue the machine function iterator
				static bool shrinkScalarLogicOp(const GCNSubtarget &ST,
				MachineRegisterInfo &MRI,
				const SIInstrInfo *TII,
				MachineInstr &MI) {
				unsigned Opc = MI.getOpcode();
				const MachineOperand *Dest = &MI.getOperand(0);
				MachineOperand *Src0 = &MI.getOperand(1);
				MachineOperand *Src1 = &MI.getOperand(2);
				MachineOperand *SrcReg = Src0;
				MachineOperand *SrcImm = Src1;

				if (SrcImm->isImm() &&
				!AMDGPU::isInlinableLiteral32(SrcImm->getImm(), ST.hasInv2PiInlineImm())) {
				foadUnsubmitted Not Done Reply Inline Actions Return early if `!SrcImm->isImm() \|\| AMDGPU::isInlinableLiteral32(SrcImm->getImm(), ST.hasInv2PiInlineImm())`. foad: Return early if `!SrcImm->isImm() \|\| AMDGPU::isInlinableLiteral32(SrcImm->getImm(), ST.
				uint32_t Imm = static_cast<uint32_t>(SrcImm->getImm());
				uint32_t NewImm = 0;

				if (Opc == AMDGPU::S_AND_B32) {
				if (isPowerOf2_32(~Imm)) {
				NewImm = countTrailingOnes(Imm);
				Opc = AMDGPU::S_BITSET0_B32;
				} else if (AMDGPU::isInlinableLiteral32(~Imm, ST.hasInv2PiInlineImm())) {
				NewImm = ~Imm;
				Opc = AMDGPU::S_ANDN2_B32;
				}
				} else if (Opc == AMDGPU::S_OR_B32) {
				if (isPowerOf2_32(Imm)) {
				NewImm = countTrailingZeros(Imm);
				Opc = AMDGPU::S_BITSET1_B32;
				} else if (AMDGPU::isInlinableLiteral32(~Imm, ST.hasInv2PiInlineImm())) {
				NewImm = ~Imm;
				Opc = AMDGPU::S_ORN2_B32;
				}
				} else if (Opc == AMDGPU::S_XOR_B32) {
				if (AMDGPU::isInlinableLiteral32(~Imm, ST.hasInv2PiInlineImm())) {
				NewImm = ~Imm;
				Opc = AMDGPU::S_XNOR_B32;
				}
				} else {
				llvm_unreachable("unexpected opcode");
				}

				if ((Opc == AMDGPU::S_ANDN2_B32 \|\| Opc == AMDGPU::S_ORN2_B32) &&
				SrcImm == Src0) {
				foadUnsubmitted Not Done Reply Inline Actions What does this do? I can't see how `SrcImm == Src0` would ever be true here. foad: What does this do? I can't see how `SrcImm == Src0` would ever be true here.
				if (!TII->commuteInstruction(MI, false, 1, 2))
				NewImm = 0;
				}

				if (NewImm != 0) {
				if (TargetRegisterInfo::isVirtualRegister(Dest->getReg()) &&
				SrcReg->isReg()) {
				MRI.setRegAllocationHint(Dest->getReg(), 0, SrcReg->getReg());
				MRI.setRegAllocationHint(SrcReg->getReg(), 0, Dest->getReg());
				return true;
				}

				if (SrcReg->isReg() && SrcReg->getReg() == Dest->getReg()) {
				MI.setDesc(TII->get(Opc));
				if (Opc == AMDGPU::S_BITSET0_B32 \|\|
				Opc == AMDGPU::S_BITSET1_B32) {
				Src0->ChangeToImmediate(NewImm);
				MI.RemoveOperand(2);
				} else {
				SrcImm->setImm(NewImm);
				}
				}
				}
				}

				return false;
				}

	// This is the same as MachineInstr::readsRegister/modifiesRegister except			// This is the same as MachineInstr::readsRegister/modifiesRegister except
	// it takes subregs into account.			// it takes subregs into account.
	static bool instAccessReg(iterator_range<MachineInstr::const_mop_iterator> &&R,			static bool instAccessReg(iterator_range<MachineInstr::const_mop_iterator> &&R,
	unsigned Reg, unsigned SubReg,			unsigned Reg, unsigned SubReg,
	const SIRegisterInfo &TRI) {			const SIRegisterInfo &TRI) {
	for (const MachineOperand &MO : R) {			for (const MachineOperand &MO : R) {
	if (!MO.isReg())			if (!MO.isReg())
	continue;			continue;
	▲ Show 20 Lines • Show All 108 Lines • ▼ Show 20 Lines
	// =>			// =>
	// s_nop (N + M)			// s_nop (N + M)
	if (MI.getOpcode() == AMDGPU::S_NOP &&			if (MI.getOpcode() == AMDGPU::S_NOP &&
	Next != MBB.end() &&			Next != MBB.end() &&
	(*Next).getOpcode() == AMDGPU::S_NOP) {			(*Next).getOpcode() == AMDGPU::S_NOP) {

	MachineInstr &NextMI = *Next;			MachineInstr &NextMI = *Next;
	// The instruction encodes the amount to wait with an offset of 1,			// The instruction encodes the amount to wait with an offset of 1,
	// i.e. 0 is wait 1 cycle. Convert both to cycles and then convert back			// i.e. 0 is wait 1 cycle. Convert both to cycles and then convert back
				arsenmUnsubmitted Done Reply Inline Actions This loop is starting to get big, can you move this to a function? arsenm: This loop is starting to get big, can you move this to a function?
	// after adding.			// after adding.
	uint8_t Nop0 = MI.getOperand(0).getImm() + 1;			uint8_t Nop0 = MI.getOperand(0).getImm() + 1;
	uint8_t Nop1 = NextMI.getOperand(0).getImm() + 1;			uint8_t Nop1 = NextMI.getOperand(0).getImm() + 1;

	// Make sure we don't overflow the bounds.			// Make sure we don't overflow the bounds.
	if (Nop0 + Nop1 <= 8) {			if (Nop0 + Nop1 <= 8) {
	NextMI.getOperand(0).setImm(Nop0 + Nop1 - 1);			NextMI.getOperand(0).setImm(Nop0 + Nop1 - 1);
	MI.eraseFromParent();			MI.eraseFromParent();
	}			}

	continue;			continue;
	}			}

	// FIXME: We also need to consider movs of constant operands since			// FIXME: We also need to consider movs of constant operands since
	// immediate operands are not folded if they have more than one use, and			// immediate operands are not folded if they have more than one use, and
	// the operand folding pass is unaware if the immediate will be free since			// the operand folding pass is unaware if the immediate will be free since
	// it won't know if the src == dest constraint will end up being			// it won't know if the src == dest constraint will end up being
				arsenmUnsubmitted Done Reply Inline Actions Is there any real reason you need to handle this? Constants are canonicalized to the RHS (we just undo this for VALU instructions because that's the only way to shrink) arsenm: Is there any real reason you need to handle this? Constants are canonicalized to the RHS (we…
	// satisfied.			// satisfied.
	if (MI.getOpcode() == AMDGPU::S_ADD_I32 \|\|			if (MI.getOpcode() == AMDGPU::S_ADD_I32 \|\|
	MI.getOpcode() == AMDGPU::S_MUL_I32) {			MI.getOpcode() == AMDGPU::S_MUL_I32) {
	const MachineOperand *Dest = &MI.getOperand(0);			const MachineOperand *Dest = &MI.getOperand(0);
	MachineOperand *Src0 = &MI.getOperand(1);			MachineOperand *Src0 = &MI.getOperand(1);
	MachineOperand *Src1 = &MI.getOperand(2);			MachineOperand *Src1 = &MI.getOperand(2);

	if (!Src0->isReg() && Src1->isReg()) {			if (!Src0->isReg() && Src1->isReg()) {
	▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines
	MI.setDesc(TII->get(AMDGPU::S_BREV_B32));			MI.setDesc(TII->get(AMDGPU::S_BREV_B32));
	Src.setImm(ReverseImm);			Src.setImm(ReverseImm);
	}			}
	}			}

	continue;			continue;
	}			}

				// Shrink scalar logic operations.
				if (MI.getOpcode() == AMDGPU::S_AND_B32 \|\|
				MI.getOpcode() == AMDGPU::S_OR_B32 \|\|
				MI.getOpcode() == AMDGPU::S_XOR_B32) {
				if (shrinkScalarLogicOp(ST, MRI, TII, MI))
				continue;
				}

	if (!TII->hasVALU32BitEncoding(MI.getOpcode()))			if (!TII->hasVALU32BitEncoding(MI.getOpcode()))
	continue;			continue;

	if (!TII->canShrink(MI, MRI)) {			if (!TII->canShrink(MI, MRI)) {
	// Try commuting the instruction and see if that enables us to shrink			// Try commuting the instruction and see if that enables us to shrink
	// it.			// it.
	if (!MI.isCommutable() \|\| !TII->commuteInstruction(MI) \|\|			if (!MI.isCommutable() \|\| !TII->commuteInstruction(MI) \|\|
	!TII->canShrink(MI, MRI))			!TII->canShrink(MI, MRI))
	▲ Show 20 Lines • Show All 86 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/andorbitset.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=tahiti -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s

				; SI-LABEL: {{^}}s_clear_msb:
				; SI: s_bitset0_b32 s{{[0-9]+}}, 31
				define amdgpu_kernel void @s_clear_msb(i32 addrspace(1)* %out, i32 %in) {
				%x = and i32 %in, 2147483647
				store i32 %x, i32 addrspace(1)* %out
				arsenmUnsubmitted Done Reply Inline Actions The check labels are confusing. You should stop using FUNC here at all, and just use GCN. These are also both SI run lines, so I don't see why that's separate arsenm: The check labels are confusing. You should stop using FUNC here at all, and just use GCN. These…
				ret void
				}

				; SI-LABEL: {{^}}s_set_msb:
				; SI: s_bitset1_b32 s{{[0-9]+}}, 31
				define amdgpu_kernel void @s_set_msb(i32 addrspace(1)* %out, i32 %in) {
				%x = or i32 %in, 2147483648
				store i32 %x, i32 addrspace(1)* %out
				ret void
				arsenmUnsubmitted Done Reply Inline Actions Why only SI? arsenm: Why only SI?
				grahamsellersAuthorUnsubmitted Done Reply Inline Actions I'm not sure what's going on but if you supply only -march=amdgcn and no -mcpu, then it appears the register allocator doesn't end up allocating the source and destination of the s_or_b32 instruction to the same register even though we use setRegAllocationHint. That part of the code was modeled on a similar chunk for S_MULK_I32 earlier in the function. Interestingly, the s_mulk.ll test doesn't have a RUN line which doesn't specify the CPU. I'll do the same. grahamsellers: I'm not sure what's going on but if you supply only -march=amdgcn and no -mcpu, then it appears…
				arsenmUnsubmitted Not Done Reply Inline Actions You should generally just pick an explicit target. The default cpu is something approximating Tahiti but isn't quite the same arsenm: You should generally just pick an explicit target. The default cpu is something approximating…
				}

				; SI-LABEL: {{^}}s_clear_lsb:
				; SI: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, -2
				define amdgpu_kernel void @s_clear_lsb(i32 addrspace(1)* %out, i32 %in) {
				%x = and i32 %in, 4294967294
				store i32 %x, i32 addrspace(1)* %out
				ret void
				}

				; SI-LABEL: {{^}}s_set_lsb:
				; SI: s_or_b32 s{{[0-9]+}}, s{{[0-9]+}}, 1
				define amdgpu_kernel void @s_set_lsb(i32 addrspace(1)* %out, i32 %in) {
				%x = or i32 %in, 1
				store i32 %x, i32 addrspace(1)* %out
				ret void
				}

				; SI-LABEL: {{^}}s_clear_midbit:
				; SI: s_bitset0_b32 s{{[0-9]+}}, 8
				define amdgpu_kernel void @s_clear_midbit(i32 addrspace(1)* %out, i32 %in) {
				%x = and i32 %in, 4294967039
				store i32 %x, i32 addrspace(1)* %out
				ret void
				}

				; SI-LABEL: {{^}}s_set_midbit:
				; SI: s_bitset1_b32 s{{[0-9]+}}, 8
				define amdgpu_kernel void @s_set_midbit(i32 addrspace(1)* %out, i32 %in) {
				%x = or i32 %in, 256
				store i32 %x, i32 addrspace(1)* %out
				ret void
				}

test/CodeGen/AMDGPU/andorxorinvimm.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=tahiti -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s

				; SI-LABEL: {{^}}s_or_to_orn2:
				; SI: s_orn2_b32 s{{[0-9]+}}, s{{[0-9]+}}, 50
				define amdgpu_kernel void @s_or_to_orn2(i32 addrspace(1)* %out, i32 %in) {
				%x = or i32 %in, -51
				store i32 %x, i32 addrspace(1)* %out
				ret void
				}

				; SI-LABEL: {{^}}s_or_to_orn2_imm0:
				; SI: s_orn2_b32 s{{[0-9]+}}, s{{[0-9]+}}, 50
				define amdgpu_kernel void @s_or_to_orn2_imm0(i32 addrspace(1)* %out, i32 %in) {
				%x = or i32 -51, %in
				store i32 %x, i32 addrspace(1)* %out
				ret void
				}

				; SI-LABEL: {{^}}s_and_to_andn2:
				; SI: s_andn2_b32 s{{[0-9]+}}, s{{[0-9]+}}, 50
				define amdgpu_kernel void @s_and_to_andn2(i32 addrspace(1)* %out, i32 %in) {
				%x = and i32 %in, -51
				store i32 %x, i32 addrspace(1)* %out
				ret void
				}

				; SI-LABEL: {{^}}s_and_to_andn2_imm0:
				; SI: s_andn2_b32 s{{[0-9]+}}, s{{[0-9]+}}, 50
				define amdgpu_kernel void @s_and_to_andn2_imm0(i32 addrspace(1)* %out, i32 %in) {
				%x = and i32 -51, %in
				store i32 %x, i32 addrspace(1)* %out
				ret void
				}

				; SI-LABEL: {{^}}s_xor_to_xnor:
				; SI: s_xnor_b32 s{{[0-9]+}}, s{{[0-9]+}}, 50
				define amdgpu_kernel void @s_xor_to_xnor(i32 addrspace(1)* %out, i32 %in) {
				%x = xor i32 %in, -51
				store i32 %x, i32 addrspace(1)* %out
				ret void
				}

				; SI-LABEL: {{^}}s_xor_to_xnor_imm0:
				; SI: s_xnor_b32 s{{[0-9]+}}, s{{[0-9]+}}, 50
				define amdgpu_kernel void @s_xor_to_xnor_imm0(i32 addrspace(1)* %out, i32 %in) {
				%x = xor i32 -51, %in
				store i32 %x, i32 addrspace(1)* %out
				ret void
				}

test/CodeGen/AMDGPU/fabs.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=SI -check-prefix=FUNC %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=SI -check-prefix=FUNC %s
	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VI -check-prefix=FUNC %s			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VI -check-prefix=FUNC %s
	; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=R600 -check-prefix=FUNC %s			; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=R600 -check-prefix=FUNC %s


	; DAGCombiner will transform:			; DAGCombiner will transform:
	; (fabs (f32 bitcast (i32 a))) => (f32 bitcast (and (i32 a), 0x7FFFFFFF))			; (fabs (f32 bitcast (i32 a))) => (f32 bitcast (and (i32 a), 0x7FFFFFFF))
	; unless isFabsFree returns true			; unless isFabsFree returns true

	; FUNC-LABEL: {{^}}s_fabs_fn_free:			; FUNC-LABEL: {{^}}s_fabs_fn_free:
	; R600-NOT: AND			; R600-NOT: AND
	; R600: \|PV.{{[XYZW]}}\|			; R600: \|PV.{{[XYZW]}}\|

	; GCN: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x7fffffff			; SI: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x7fffffff
				; VI: s_bitset0_b32 s{{[0-9]+}}, 31
	define amdgpu_kernel void @s_fabs_fn_free(float addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @s_fabs_fn_free(float addrspace(1)* %out, i32 %in) {
	%bc= bitcast i32 %in to float			%bc= bitcast i32 %in to float
	%fabs = call float @fabs(float %bc)			%fabs = call float @fabs(float %bc)
	store float %fabs, float addrspace(1)* %out			store float %fabs, float addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}s_fabs_free:			; FUNC-LABEL: {{^}}s_fabs_free:
	; R600-NOT: AND			; R600-NOT: AND
	; R600: \|PV.{{[XYZW]}}\|			; R600: \|PV.{{[XYZW]}}\|

	; GCN: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x7fffffff			; SI: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x7fffffff
				; VI: s_bitset0_b32 s{{[0-9]+}}, 31
	define amdgpu_kernel void @s_fabs_free(float addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @s_fabs_free(float addrspace(1)* %out, i32 %in) {
	%bc= bitcast i32 %in to float			%bc= bitcast i32 %in to float
	%fabs = call float @llvm.fabs.f32(float %bc)			%fabs = call float @llvm.fabs.f32(float %bc)
	store float %fabs, float addrspace(1)* %out			store float %fabs, float addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}s_fabs_f32:			; FUNC-LABEL: {{^}}s_fabs_f32:
	; R600: \|{{(PV\|T[0-9])\.[XYZW]}}\|			; R600: \|{{(PV\|T[0-9])\.[XYZW]}}\|

	; GCN: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x7fffffff			; SI: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x7fffffff
				; VI: s_bitset0_b32 s{{[0-9]+}}, 31
	define amdgpu_kernel void @s_fabs_f32(float addrspace(1)* %out, float %in) {			define amdgpu_kernel void @s_fabs_f32(float addrspace(1)* %out, float %in) {
	%fabs = call float @llvm.fabs.f32(float %in)			%fabs = call float @llvm.fabs.f32(float %in)
	store float %fabs, float addrspace(1)* %out			store float %fabs, float addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}fabs_v2f32:			; FUNC-LABEL: {{^}}fabs_v2f32:
	; R600: \|{{(PV\|T[0-9])\.[XYZW]}}\|			; R600: \|{{(PV\|T[0-9])\.[XYZW]}}\|
	▲ Show 20 Lines • Show All 68 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/fneg-fabs.ll

	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=FUNC %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=SI -check-prefix=FUNC %s
	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=FUNC %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VI -check-prefix=FUNC %s
	; RUN: llc -amdgpu-scalarize-global-loads=false -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=R600 -check-prefix=FUNC %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=R600 -check-prefix=FUNC %s

	; FUNC-LABEL: {{^}}fneg_fabs_fadd_f32:			; FUNC-LABEL: {{^}}fneg_fabs_fadd_f32:
	; SI-NOT: and			; SI-NOT: and
	; SI: v_sub_f32_e64 {{v[0-9]+}}, {{v[0-9]+}}, \|{{s[0-9]+}}\|			; SI: v_sub_f32_e64 {{v[0-9]+}}, {{v[0-9]+}}, \|{{s[0-9]+}}\|
	define amdgpu_kernel void @fneg_fabs_fadd_f32(float addrspace(1)* %out, float %x, float %y) {			define amdgpu_kernel void @fneg_fabs_fadd_f32(float addrspace(1)* %out, float %x, float %y) {
	%fabs = call float @llvm.fabs.f32(float %x)			%fabs = call float @llvm.fabs.f32(float %x)
	%fsub = fsub float -0.000000e+00, %fabs			%fsub = fsub float -0.000000e+00, %fabs
	Show All 19 Lines
	; unless isFabsFree returns true			; unless isFabsFree returns true

	; FUNC-LABEL: {{^}}fneg_fabs_free_f32:			; FUNC-LABEL: {{^}}fneg_fabs_free_f32:
	; R600-NOT: AND			; R600-NOT: AND
	; R600: \|PV.{{[XYZW]}}\|			; R600: \|PV.{{[XYZW]}}\|
	; R600: -PV			; R600: -PV

	; SI: s_or_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x80000000			; SI: s_or_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x80000000
				; VI: s_bitset1_b32 s{{[0-9]+}}, 31
	define amdgpu_kernel void @fneg_fabs_free_f32(float addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @fneg_fabs_free_f32(float addrspace(1)* %out, i32 %in) {
	%bc = bitcast i32 %in to float			%bc = bitcast i32 %in to float
	%fabs = call float @llvm.fabs.f32(float %bc)			%fabs = call float @llvm.fabs.f32(float %bc)
	%fsub = fsub float -0.000000e+00, %fabs			%fsub = fsub float -0.000000e+00, %fabs
	store float %fsub, float addrspace(1)* %out			store float %fsub, float addrspace(1)* %out
	ret void			ret void
	}			}

	▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/gep-address-space.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs< %s \| FileCheck --check-prefix=SI --check-prefix=CHECK %s			; RUN: llc -march=amdgcn -verify-machineinstrs< %s \| FileCheck --check-prefix=SI --check-prefix=CHECK %s
	; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs< %s \| FileCheck --check-prefix=CI --check-prefix=CHECK %s			; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs< %s \| FileCheck --check-prefix=CI --check-prefix=CHECK %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs< %s \| FileCheck --check-prefix=CI --check-prefix=CHECK %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs< %s \| FileCheck --check-prefix=CI --check-prefix=CHECK %s

	define amdgpu_kernel void @use_gep_address_space([1024 x i32] addrspace(3)* %array) nounwind {			define amdgpu_kernel void @use_gep_address_space([1024 x i32] addrspace(3)* %array) nounwind {
	; CHECK-LABEL: {{^}}use_gep_address_space:			; CHECK-LABEL: {{^}}use_gep_address_space:
	; CHECK: v_mov_b32_e32 [[PTR:v[0-9]+]], s{{[0-9]+}}			; CHECK: v_mov_b32_e32 [[PTR:v[0-9]+]], s{{[0-9]+}}
	; CHECK: ds_write_b32 [[PTR]], v{{[0-9]+}} offset:64			; CHECK: ds_write_b32 [[PTR]], v{{[0-9]+}} offset:64
	%p = getelementptr [1024 x i32], [1024 x i32] addrspace(3)* %array, i16 0, i16 16			%p = getelementptr [1024 x i32], [1024 x i32] addrspace(3)* %array, i16 0, i16 16
	store i32 99, i32 addrspace(3)* %p			store i32 99, i32 addrspace(3)* %p
	ret void			ret void
	}			}

	; CHECK-LABEL: {{^}}use_gep_address_space_large_offset:			; CHECK-LABEL: {{^}}use_gep_address_space_large_offset:
	; The LDS offset will be 65536 bytes, which is larger than the size of LDS on			; The LDS offset will be 65536 bytes, which is larger than the size of LDS on
	; SI, which is why it is being OR'd with the base pointer.			; SI, which is why it is being OR'd with the base pointer.
	; SI: s_or_b32			; SI: s_bitset1_b32
	; CI: s_add_i32			; CI: s_add_i32
	; CHECK: ds_write_b32			; CHECK: ds_write_b32
	define amdgpu_kernel void @use_gep_address_space_large_offset([1024 x i32] addrspace(3)* %array) nounwind {			define amdgpu_kernel void @use_gep_address_space_large_offset([1024 x i32] addrspace(3)* %array) nounwind {
	%p = getelementptr [1024 x i32], [1024 x i32] addrspace(3)* %array, i16 0, i16 16384			%p = getelementptr [1024 x i32], [1024 x i32] addrspace(3)* %array, i16 0, i16 16384
	store i32 99, i32 addrspace(3)* %p			store i32 99, i32 addrspace(3)* %p
	ret void			ret void
	}			}

	▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/local-64.ll

	Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines
	}			}

	; GCN-LABEL: {{^}}local_i8_load_over_i16_max_offset:			; GCN-LABEL: {{^}}local_i8_load_over_i16_max_offset:
	; SICIVI-DAG: s_mov_b32 m0			; SICIVI-DAG: s_mov_b32 m0
	; GFX9-NOT: m0			; GFX9-NOT: m0

	; The LDS offset will be 65536 bytes, which is larger than the size of LDS on			; The LDS offset will be 65536 bytes, which is larger than the size of LDS on
	; SI, which is why it is being OR'd with the base pointer.			; SI, which is why it is being OR'd with the base pointer.
	; SI-DAG: s_or_b32 [[ADDR:s[0-9]+]], s{{[0-9]+}}, 0x10000			; SI-DAG: s_bitset1_b32 [[ADDR:s[0-9]+]], 16
	; CI-DAG: s_add_i32 [[ADDR:s[0-9]+]], s{{[0-9]+}}, 0x10000			; CI-DAG: s_add_i32 [[ADDR:s[0-9]+]], s{{[0-9]+}}, 0x10000
	; VI-DAG: s_add_i32 [[ADDR:s[0-9]+]], s{{[0-9]+}}, 0x10000			; VI-DAG: s_add_i32 [[ADDR:s[0-9]+]], s{{[0-9]+}}, 0x10000
	; GFX9-DAG: s_add_i32 [[ADDR:s[0-9]+]], s{{[0-9]+}}, 0x10000			; GFX9-DAG: s_add_i32 [[ADDR:s[0-9]+]], s{{[0-9]+}}, 0x10000

	; GCN-DAG: v_mov_b32_e32 [[VREGADDR:v[0-9]+]], [[ADDR]]			; GCN-DAG: v_mov_b32_e32 [[VREGADDR:v[0-9]+]], [[ADDR]]
	; GCN: ds_read_u8 [[REG:v[0-9]+]], [[VREGADDR]]			; GCN: ds_read_u8 [[REG:v[0-9]+]], [[VREGADDR]]
	; GCN: buffer_store_byte [[REG]],			; GCN: buffer_store_byte [[REG]],
	define amdgpu_kernel void @local_i8_load_over_i16_max_offset(i8 addrspace(1)* %out, i8 addrspace(3)* %in) nounwind {			define amdgpu_kernel void @local_i8_load_over_i16_max_offset(i8 addrspace(1)* %out, i8 addrspace(3)* %in) nounwind {
	▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines