This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Supported ds_read_b128 generation; Widened vector length for local address-space
ClosedPublic

Authored by FarhanaAleen on Mar 7 2018, 8:06 AM.

Download Raw Diff

Details

Reviewers

Commits

rGa7cb31123c25: [AMDGPU] Supported ds_read_b128 generation; Widened vector length for local…
rL327153: [AMDGPU] Supported ds_read_b128 generation; Widened vector length for local…

Summary

Starting from GCN 2nd generation, ISA supports ds_read_b128 on top of ds_read_b64. This patch supports ds_read_b128 instruction pattern and generation of this instruction.

In the vectorizer, this patch also widen the vector length so that vectorizer generates 128 bit loads for local address-space which gets translated to ds_read_b128.

Diff Detail

Repository: rL LLVM

Event Timeline

FarhanaAleen created this revision.Mar 7 2018, 8:06 AM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptMar 7 2018, 8:06 AM

Have you tested this on real hardware? I remember reading that there is a hardware bug on gfx7 with this instruction. The bug may apply only to early gfx7 chips.

test/CodeGen/AMDGPU/reorder-stores.ll
2 ↗	(On Diff #137397)	What does SEA mean? We usually use CI for Sea Islands.

I've implemented this before: https://github.com/arsenm/llvm/tree/ds-128

This looks mostly the same. It's not clear to me it's always better to use this. I don't think this executes any faster, and at least for ds_write_b128, this has an additional constraint that the inputs must now be in a contiguous 128-bit register instead of 2 independent 64-bit pairs, which increases register pressure and may require copies. It might be better to defer forming this until later, like in the LoadStoreOptimizer pass. Jeff had a benchmark he wanted to try with this.

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
246–248 ↗	(On Diff #137397)	It might be OK to say 128 anyway. You could still do adjacent ds_read2_b64 even when not using ds_read_b128. I don't think we try to do the same trick we do with 4-byte aligned 8 byte reads for the 64-bit equivalent, but you might want to look into that. Anything you change here would also equally apply to REGION_ADDRESS
lib/Target/AMDGPU/SIISelLowering.cpp
5427 ↗	(On Diff #137397)	This should be hidden inside a Subtarget->hasDS128() check
5428 ↗	(On Diff #137397)	You don't need the isAligned16 helper. You just need to check that the alignment is >= 16, not % 16
test/CodeGen/AMDGPU/ds_read2_superreg.ll
100–135 ↗	(On Diff #137397)	These tests have the unfortunate side effect of breaking what this test intended, which was the pass forming the read2. Maybe change all of these to reduce the alignment so you still get read2?

Given performance benefit is somewhat unclear can you put it under an option?

lib/Target/AMDGPU/SIISelLowering.cpp
5428 ↗	(On Diff #137397)	Second to that.

Enabled ds_read_b128 under a switch and incorporated additional comments.

rampitec added inline comments.Mar 8 2018, 4:21 PM

lib/Target/AMDGPU/SIISelLowering.cpp
5434 ↗	(On Diff #137670)	You only have pattern for v4i32, but enable operation for all 128 bit. Will it work with v8i16 for example?

FarhanaAleen updated this revision to Diff 137761.Mar 9 2018, 8:42 AM

FarhanaAleen added inline comments.

lib/Target/AMDGPU/SIISelLowering.cpp
5434 ↗	(On Diff #137670)	Yes, it works for i16/i8. During dag combine, AMDGPU loadCombiner combines vector types of 8/16/64 to vector types of 32 bit type.

LGTM. Thanks.

This revision is now accepted and ready to land.Mar 9 2018, 9:28 AM

Closed by commit rL327153: [AMDGPU] Supported ds_read_b128 generation; Widened vector length for local… (authored by faaleen). · Explain WhyMar 9 2018, 9:46 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AMDGPUInstructions.td

8 lines

AMDGPUSubtarget.h

6 lines

AMDGPUTargetTransformInfo.cpp

8 lines

DSInstructions.td

1 line

SIISelLowering.cpp

16 lines

SIInstrInfo.td

4 lines

test/

CodeGen/

AMDGPU/

19 lines

18 lines

18 lines

19 lines

14 lines

17 lines

Transforms/

LoadStoreVectorizer/

AMDGPU/

merge-stores.ll

3 lines

multiple_tails.ll

3 lines

Diff 137776

llvm/trunk/lib/Target/AMDGPU/AMDGPUInstructions.td

	Show First 20 Lines • Show All 242 Lines • ▼ Show 20 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Load/Store Pattern Fragments			// Load/Store Pattern Fragments
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	class Aligned8Bytes <dag ops, dag frag> : PatFrag <ops, frag, [{			class Aligned8Bytes <dag ops, dag frag> : PatFrag <ops, frag, [{
	return cast<MemSDNode>(N)->getAlignment() % 8 == 0;			return cast<MemSDNode>(N)->getAlignment() % 8 == 0;
	}]>;			}]>;

				class Aligned16Bytes <dag ops, dag frag> : PatFrag <ops, frag, [{
				return cast<MemSDNode>(N)->getAlignment() >= 16;
				}]>;

	class LoadFrag <SDPatternOperator op> : PatFrag<(ops node:$ptr), (op node:$ptr)>;			class LoadFrag <SDPatternOperator op> : PatFrag<(ops node:$ptr), (op node:$ptr)>;

	class StoreFrag<SDPatternOperator op> : PatFrag <			class StoreFrag<SDPatternOperator op> : PatFrag <
	(ops node:$value, node:$ptr), (op node:$value, node:$ptr)			(ops node:$value, node:$ptr), (op node:$value, node:$ptr)
	>;			>;

	class StoreHi16<SDPatternOperator op> : PatFrag <			class StoreHi16<SDPatternOperator op> : PatFrag <
	(ops node:$value, node:$ptr), (op (srl node:$value, (i32 16)), node:$ptr)			(ops node:$value, node:$ptr), (op (srl node:$value, (i32 16)), node:$ptr)
	▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines
	def truncstorei16_local : LocalStore <truncstorei16>;			def truncstorei16_local : LocalStore <truncstorei16>;
	def store_local_hi16 : StoreHi16 <truncstorei16>, LocalAddress;			def store_local_hi16 : StoreHi16 <truncstorei16>, LocalAddress;
	def truncstorei8_local_hi16 : StoreHi16<truncstorei8>, LocalAddress;			def truncstorei8_local_hi16 : StoreHi16<truncstorei8>, LocalAddress;

	def load_align8_local : Aligned8Bytes <			def load_align8_local : Aligned8Bytes <
	(ops node:$ptr), (load_local node:$ptr)			(ops node:$ptr), (load_local node:$ptr)
	>;			>;

				def load_align16_local : Aligned16Bytes <
				(ops node:$ptr), (load_local node:$ptr)
				>;

	def store_align8_local : Aligned8Bytes <			def store_align8_local : Aligned8Bytes <
	(ops node:$val, node:$ptr), (store_local node:$val, node:$ptr)			(ops node:$val, node:$ptr), (store_local node:$val, node:$ptr)
	>;			>;


	def load_flat : FlatLoad <load>;			def load_flat : FlatLoad <load>;
	def az_extloadi8_flat : FlatLoad <az_extloadi8>;			def az_extloadi8_flat : FlatLoad <az_extloadi8>;
	def sextloadi8_flat : FlatLoad <sextloadi8>;			def sextloadi8_flat : FlatLoad <sextloadi8>;
	▲ Show 20 Lines • Show All 364 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUSubtarget.h

Show First 20 Lines • Show All 408 Lines • ▼ Show 20 Lines	public:
bool enableIEEEBit(const MachineFunction &MF) const {		bool enableIEEEBit(const MachineFunction &MF) const {
return AMDGPU::isCompute(MF.getFunction().getCallingConv());		return AMDGPU::isCompute(MF.getFunction().getCallingConv());
}		}

bool useFlatForGlobal() const {		bool useFlatForGlobal() const {
return FlatForGlobal;		return FlatForGlobal;
}		}

		/// \returns If target supports ds_read/write_b128 and user enables generation
		/// of ds_read/write_b128.
		bool useDS128(bool UserEnable) const {
		return CIInsts && UserEnable;
		}

/// \returns If MUBUF instructions always perform range checking, even for		/// \returns If MUBUF instructions always perform range checking, even for
/// buffer resources used for private memory access.		/// buffer resources used for private memory access.
bool privateMemoryResourceIsRangeChecked() const {		bool privateMemoryResourceIsRangeChecked() const {
return getGeneration() < AMDGPUSubtarget::GFX9;		return getGeneration() < AMDGPUSubtarget::GFX9;
}		}

bool hasAutoWaitcntBeforeBarrier() const {		bool hasAutoWaitcntBeforeBarrier() const {
return AutoWaitcntBeforeBarrier;		return AutoWaitcntBeforeBarrier;
▲ Show 20 Lines • Show All 530 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 259 Lines • ▼ Show 20 Lines	unsigned AMDGPUTTIImpl::getLoadStoreVecRegBitWidth(unsigned AddrSpace) const {
if (AddrSpace == AS.GLOBAL_ADDRESS \|\|		if (AddrSpace == AS.GLOBAL_ADDRESS \|\|
AddrSpace == AS.CONSTANT_ADDRESS \|\|		AddrSpace == AS.CONSTANT_ADDRESS \|\|
AddrSpace == AS.CONSTANT_ADDRESS_32BIT) {		AddrSpace == AS.CONSTANT_ADDRESS_32BIT) {
if (ST->getGeneration() <= AMDGPUSubtarget::NORTHERN_ISLANDS)		if (ST->getGeneration() <= AMDGPUSubtarget::NORTHERN_ISLANDS)
return 128;		return 128;
return 512;		return 512;
}		}

if (AddrSpace == AS.FLAT_ADDRESS)		if (AddrSpace == AS.FLAT_ADDRESS \|\|
return 128;		AddrSpace == AS.LOCAL_ADDRESS \|\|
if (AddrSpace == AS.LOCAL_ADDRESS \|\|
AddrSpace == AS.REGION_ADDRESS)		AddrSpace == AS.REGION_ADDRESS)
return 64;		return 128;

if (AddrSpace == AS.PRIVATE_ADDRESS)		if (AddrSpace == AS.PRIVATE_ADDRESS)
return 8 * ST->getMaxPrivateElementSize();		return 8 * ST->getMaxPrivateElementSize();

if (ST->getGeneration() <= AMDGPUSubtarget::NORTHERN_ISLANDS &&		if (ST->getGeneration() <= AMDGPUSubtarget::NORTHERN_ISLANDS &&
(AddrSpace == AS.PARAM_D_ADDRESS \|\|		(AddrSpace == AS.PARAM_D_ADDRESS \|\|
AddrSpace == AS.PARAM_I_ADDRESS \|\|		AddrSpace == AS.PARAM_I_ADDRESS \|\|
(AddrSpace >= AS.CONSTANT_BUFFER_0 &&		(AddrSpace >= AS.CONSTANT_BUFFER_0 &&
AddrSpace <= AS.CONSTANT_BUFFER_15)))		AddrSpace <= AS.CONSTANT_BUFFER_15)))
▲ Show 20 Lines • Show All 321 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/DSInstructions.td

	Show First 20 Lines • Show All 643 Lines • ▼ Show 20 Lines
	defm : DSReadPat_mc <DS_READ_I16, i32, "sextloadi16_local">;			defm : DSReadPat_mc <DS_READ_I16, i32, "sextloadi16_local">;
	defm : DSReadPat_mc <DS_READ_U16, i32, "az_extloadi16_local">;			defm : DSReadPat_mc <DS_READ_U16, i32, "az_extloadi16_local">;
	defm : DSReadPat_mc <DS_READ_U16, i16, "load_local">;			defm : DSReadPat_mc <DS_READ_U16, i16, "load_local">;
	defm : DSReadPat_mc <DS_READ_B32, i32, "load_local">;			defm : DSReadPat_mc <DS_READ_B32, i32, "load_local">;

	let AddedComplexity = 100 in {			let AddedComplexity = 100 in {

	defm : DSReadPat_mc <DS_READ_B64, v2i32, "load_align8_local">;			defm : DSReadPat_mc <DS_READ_B64, v2i32, "load_align8_local">;
				defm : DSReadPat_mc <DS_READ_B128, v4i32, "load_align16_local">;

	} // End AddedComplexity = 100			} // End AddedComplexity = 100

	let OtherPredicates = [HasD16LoadStore] in {			let OtherPredicates = [HasD16LoadStore] in {
	let AddedComplexity = 100 in {			let AddedComplexity = 100 in {
	defm : DSReadPat_Hi16<DS_READ_U16_D16_HI, load_local>;			defm : DSReadPat_Hi16<DS_READ_U16_D16_HI, load_local>;
	defm : DSReadPat_Hi16<DS_READ_U8_D16_HI, az_extloadi8_local>;			defm : DSReadPat_Hi16<DS_READ_U8_D16_HI, az_extloadi8_local>;
	defm : DSReadPat_Hi16<DS_READ_I8_D16_HI, sextloadi8_local>;			defm : DSReadPat_Hi16<DS_READ_I8_D16_HI, sextloadi8_local>;
	▲ Show 20 Lines • Show All 488 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines

STATISTIC(NumTailCalls, "Number of tail calls");		STATISTIC(NumTailCalls, "Number of tail calls");

static cl::opt<bool> EnableVGPRIndexMode(		static cl::opt<bool> EnableVGPRIndexMode(
"amdgpu-vgpr-index-mode",		"amdgpu-vgpr-index-mode",
cl::desc("Use GPR indexing mode instead of movrel for vector indexing"),		cl::desc("Use GPR indexing mode instead of movrel for vector indexing"),
cl::init(false));		cl::init(false));

		static cl::opt<bool> EnableDS128(
		"amdgpu-ds128",
		cl::desc("Use DS_read/write_b128"),
		cl::init(false));

static cl::opt<unsigned> AssumeFrameIndexHighZeroBits(		static cl::opt<unsigned> AssumeFrameIndexHighZeroBits(
"amdgpu-frame-index-zero-bits",		"amdgpu-frame-index-zero-bits",
cl::desc("High bits of frame index assumed to be zero"),		cl::desc("High bits of frame index assumed to be zero"),
cl::init(5),		cl::init(5),
cl::ReallyHidden);		cl::ReallyHidden);

static unsigned findFirstFreeSGPR(CCState &CCInfo) {		static unsigned findFirstFreeSGPR(CCState &CCInfo) {
unsigned NumSGPRs = AMDGPU::SGPR_32RegClass.getNumRegs();		unsigned NumSGPRs = AMDGPU::SGPR_32RegClass.getNumRegs();
▲ Show 20 Lines • Show All 5,315 Lines • ▼ Show 20 Lines	case 16:
// Same as global/flat		// Same as global/flat
if (NumElements > 4)		if (NumElements > 4)
return SplitVectorLoad(Op, DAG);		return SplitVectorLoad(Op, DAG);
return SDValue();		return SDValue();
default:		default:
llvm_unreachable("unsupported private_element_size");		llvm_unreachable("unsupported private_element_size");
}		}
} else if (AS == AMDGPUASI.LOCAL_ADDRESS) {		} else if (AS == AMDGPUASI.LOCAL_ADDRESS) {
if (NumElements > 2)		// Use ds_read_b128 if possible.
return SplitVectorLoad(Op, DAG);		if (Subtarget->useDS128(EnableDS128) && Load->getAlignment() >= 16 &&
		MemVT.getStoreSize() == 16)
if (NumElements == 2)
return SDValue();		return SDValue();

// If properly aligned, if we split we might be able to use ds_read_b64.		if (NumElements > 2)
return SplitVectorLoad(Op, DAG);		return SplitVectorLoad(Op, DAG);
}		}
return SDValue();		return SDValue();
}		}

SDValue SITargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {		SDValue SITargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {
if (Op.getValueType() != MVT::i64)		if (Op.getValueType() != MVT::i64)
return SDValue();		return SDValue();

▲ Show 20 Lines • Show All 2,329 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.td

	Show First 20 Lines • Show All 404 Lines • ▼ Show 20 Lines

	def sextloadi16_glue : PatFrag<(ops node:$ptr), (sextload_glue node:$ptr), [{			def sextloadi16_glue : PatFrag<(ops node:$ptr), (sextload_glue node:$ptr), [{
	return cast<LoadSDNode>(N)->getMemoryVT() == MVT::i16;			return cast<LoadSDNode>(N)->getMemoryVT() == MVT::i16;
	}]>;			}]>;

	def load_glue_align8 : Aligned8Bytes <			def load_glue_align8 : Aligned8Bytes <
	(ops node:$ptr), (load_glue node:$ptr)			(ops node:$ptr), (load_glue node:$ptr)
	>;			>;
				def load_glue_align16 : Aligned16Bytes <
				(ops node:$ptr), (load_glue node:$ptr)
				>;


	def load_local_m0 : LoadFrag<load_glue>, LocalAddress;			def load_local_m0 : LoadFrag<load_glue>, LocalAddress;
	def sextloadi8_local_m0 : LoadFrag<sextloadi8_glue>, LocalAddress;			def sextloadi8_local_m0 : LoadFrag<sextloadi8_glue>, LocalAddress;
	def sextloadi16_local_m0 : LoadFrag<sextloadi16_glue>, LocalAddress;			def sextloadi16_local_m0 : LoadFrag<sextloadi16_glue>, LocalAddress;
	def az_extloadi8_local_m0 : LoadFrag<az_extloadi8_glue>, LocalAddress;			def az_extloadi8_local_m0 : LoadFrag<az_extloadi8_glue>, LocalAddress;
	def az_extloadi16_local_m0 : LoadFrag<az_extloadi16_glue>, LocalAddress;			def az_extloadi16_local_m0 : LoadFrag<az_extloadi16_glue>, LocalAddress;
	def load_align8_local_m0 : LoadFrag <load_glue_align8>, LocalAddress;			def load_align8_local_m0 : LoadFrag <load_glue_align8>, LocalAddress;
				def load_align16_local_m0 : LoadFrag <load_glue_align16>, LocalAddress;


	def AMDGPUst_glue : SDNode <"ISD::STORE", SDTStore,			def AMDGPUst_glue : SDNode <"ISD::STORE", SDTStore,
	[SDNPHasChain, SDNPMayStore, SDNPMemOperand, SDNPInGlue]			[SDNPHasChain, SDNPMayStore, SDNPMemOperand, SDNPInGlue]
	>;			>;

	def unindexedstore_glue : PatFrag<(ops node:$val, node:$ptr),			def unindexedstore_glue : PatFrag<(ops node:$val, node:$ptr),
	(AMDGPUst_glue node:$val, node:$ptr), [{			(AMDGPUst_glue node:$val, node:$ptr), [{
	▲ Show 20 Lines • Show All 1,713 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/load-local-f32.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s
	; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefixes=EG,FUNC %s			; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefixes=EG,FUNC %s

				; Testing for ds_read_128
				; RUN: llc -march=amdgcn -mcpu=tahiti -amdgpu-ds128 < %s \| FileCheck -check-prefixes=SI,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=tonga -amdgpu-ds128 < %s \| FileCheck -check-prefixes=CIVI,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-ds128 < %s \| FileCheck -check-prefixes=CIVI,FUNC %s

	; FUNC-LABEL: {{^}}load_f32_local:			; FUNC-LABEL: {{^}}load_f32_local:
	; SICIVI: s_mov_b32 m0			; SICIVI: s_mov_b32 m0
	; GFX9-NOT: m0			; GFX9-NOT: m0
	; GCN: ds_read_b32			; GCN: ds_read_b32

	; EG: LDS_READ_RET			; EG: LDS_READ_RET
	define amdgpu_kernel void @load_f32_local(float addrspace(1)* %out, float addrspace(3)* %in) #0 {			define amdgpu_kernel void @load_f32_local(float addrspace(1)* %out, float addrspace(3)* %in) #0 {
	entry:			entry:
	▲ Show 20 Lines • Show All 104 Lines • ▼ Show 20 Lines
	; EG: LDS_READ_RET			; EG: LDS_READ_RET
	define amdgpu_kernel void @local_load_v16f32(<16 x float> addrspace(3)* %out, <16 x float> addrspace(3)* %in) #0 {			define amdgpu_kernel void @local_load_v16f32(<16 x float> addrspace(3)* %out, <16 x float> addrspace(3)* %in) #0 {
	entry:			entry:
	%tmp0 = load <16 x float>, <16 x float> addrspace(3)* %in			%tmp0 = load <16 x float>, <16 x float> addrspace(3)* %in
	store <16 x float> %tmp0, <16 x float> addrspace(3)* %out			store <16 x float> %tmp0, <16 x float> addrspace(3)* %out
	ret void			ret void
	}			}

				; Tests if ds_read_b128 gets generated for the 16 byte aligned load.
				; FUNC-LABEL: {{^}}local_v4f32_to_128:
				; SI-NOT: ds_read_b128
				; CIVI: ds_read_b128
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				define amdgpu_kernel void @local_v4f32_to_128(<4 x float> addrspace(3)* %out, <4 x float> addrspace(3)* %in) {
				%ld = load <4 x float>, <4 x float> addrspace(3)* %in, align 16
				store <4 x float> %ld, <4 x float> addrspace(3)* %out
				ret void
				}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

llvm/trunk/test/CodeGen/AMDGPU/load-local-f64.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s
	; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=kaveri -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s			; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=kaveri -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s
	; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX9,FUNC %s			; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX9,FUNC %s
	; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefixes=EG,FUNC %s			; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefixes=EG,FUNC %s

				; Testing for ds_read_b128
				; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs -amdgpu-ds128 < %s \| FileCheck -check-prefixes=CIVI,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -amdgpu-ds128 < %s \| FileCheck -check-prefixes=CIVI,FUNC %s

	; FUNC-LABEL: {{^}}local_load_f64:			; FUNC-LABEL: {{^}}local_load_f64:
	; SICIV: s_mov_b32 m0			; SICIV: s_mov_b32 m0
	; GFX9-NOT: m0			; GFX9-NOT: m0

	; GCN: ds_read_b64 [[VAL:v\[[0-9]+:[0-9]+\]]], v{{[0-9]+}}{{$}}			; GCN: ds_read_b64 [[VAL:v\[[0-9]+:[0-9]+\]]], v{{[0-9]+}}{{$}}
	; GCN: ds_write_b64 v{{[0-9]+}}, [[VAL]]			; GCN: ds_write_b64 v{{[0-9]+}}, [[VAL]]

	; EG: LDS_READ_RET			; EG: LDS_READ_RET
	▲ Show 20 Lines • Show All 150 Lines • ▼ Show 20 Lines
	; EG: LDS_READ_RET			; EG: LDS_READ_RET
	define amdgpu_kernel void @local_load_v16f64(<16 x double> addrspace(3)* %out, <16 x double> addrspace(3)* %in) #0 {			define amdgpu_kernel void @local_load_v16f64(<16 x double> addrspace(3)* %out, <16 x double> addrspace(3)* %in) #0 {
	entry:			entry:
	%ld = load <16 x double>, <16 x double> addrspace(3)* %in			%ld = load <16 x double>, <16 x double> addrspace(3)* %in
	store <16 x double> %ld, <16 x double> addrspace(3)* %out			store <16 x double> %ld, <16 x double> addrspace(3)* %out
	ret void			ret void
	}			}

				; Tests if ds_read_b128 gets generated for the 16 byte aligned load.
				; FUNC-LABEL: {{^}}local_load_v2f64_to_128:
				; CIVI: ds_read_b128
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				define amdgpu_kernel void @local_load_v2f64_to_128(<2 x double> addrspace(3)* %out, <2 x double> addrspace(3)* %in) {
				entry:
				%ld = load <2 x double>, <2 x double> addrspace(3)* %in, align 16
				store <2 x double> %ld, <2 x double> addrspace(3)* %out
				ret void
				}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

llvm/trunk/test/CodeGen/AMDGPU/load-local-i16.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SI,SICIVI,FUNC %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SI,SICIVI,FUNC %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,GFX89,FUNC %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,GFX89,FUNC %s
	; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX9,GFX89,FUNC %s			; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX9,GFX89,FUNC %s
	; RUN: llc -march=r600 -mcpu=redwood -verify-machineinstrs < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s			; RUN: llc -march=r600 -mcpu=redwood -verify-machineinstrs < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s

				; Testing for ds_read_b128
				; RUN: llc -march=amdgcn -mcpu=tonga -amdgpu-ds128 < %s \| FileCheck -check-prefixes=CIVI,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-ds128 < %s \| FileCheck -check-prefixes=CIVI,FUNC %s

	; FUNC-LABEL: {{^}}local_load_i16:			; FUNC-LABEL: {{^}}local_load_i16:
	; GFX9-NOT: m0			; GFX9-NOT: m0
	; SICIVI: s_mov_b32 m0			; SICIVI: s_mov_b32 m0

	; GCN: ds_read_u16 v{{[0-9]+}}			; GCN: ds_read_u16 v{{[0-9]+}}

	; EG: MOV {{[* ]*}}[[FROM:T[0-9]+\.[XYZW]]], KC0[2].Z			; EG: MOV {{[* ]*}}[[FROM:T[0-9]+\.[XYZW]]], KC0[2].Z
	; EG: LDS_USHORT_READ_RET {{.*}} [[FROM]]			; EG: LDS_USHORT_READ_RET {{.*}} [[FROM]]
	▲ Show 20 Lines • Show All 916 Lines • ▼ Show 20 Lines
	; ; XFUNC-LABEL: {{^}}local_sextload_v64i16_to_v64i64:			; ; XFUNC-LABEL: {{^}}local_sextload_v64i16_to_v64i64:
	; define amdgpu_kernel void @local_sextload_v64i16_to_v64i64(<64 x i64> addrspace(3)* %out, <64 x i16> addrspace(3)* %in) #0 {			; define amdgpu_kernel void @local_sextload_v64i16_to_v64i64(<64 x i64> addrspace(3)* %out, <64 x i16> addrspace(3)* %in) #0 {
	; %load = load <64 x i16>, <64 x i16> addrspace(3)* %in			; %load = load <64 x i16>, <64 x i16> addrspace(3)* %in
	; %ext = sext <64 x i16> %load to <64 x i64>			; %ext = sext <64 x i16> %load to <64 x i64>
	; store <64 x i64> %ext, <64 x i64> addrspace(3)* %out			; store <64 x i64> %ext, <64 x i64> addrspace(3)* %out
	; ret void			; ret void
	; }			; }

				; Tests if ds_read_b128 gets generated for the 16 byte aligned load.
				; FUNC-LABEL: {{^}}local_v8i16_to_128:
				; SI-NOT: ds_read_b128
				; CIVI: ds_read_b128
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				define amdgpu_kernel void @local_v8i16_to_128(<8 x i16> addrspace(3)* %out, <8 x i16> addrspace(3)* %in) {
				%ld = load <8 x i16>, <8 x i16> addrspace(3)* %in, align 16
				store <8 x i16> %ld, <8 x i16> addrspace(3)* %out
				ret void
				}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

llvm/trunk/test/CodeGen/AMDGPU/load-local-i32.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s
	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s
	; RUN: llc -march=amdgcn -mcpu=gfx900 -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,VI,FUNC %s			; RUN: llc -march=amdgcn -mcpu=gfx900 -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,VI,FUNC %s
	; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s			; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s

				; Testing for ds_read_128
				; RUN: llc -march=amdgcn -mcpu=tahiti -amdgpu-ds128 < %s \| FileCheck -check-prefixes=SI,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=tonga -amdgpu-ds128 < %s \| FileCheck -check-prefixes=CIVI,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-ds128 < %s \| FileCheck -check-prefixes=CIVI,FUNC %s

	; FUNC-LABEL: {{^}}local_load_i32:			; FUNC-LABEL: {{^}}local_load_i32:
	; GCN-NOT: s_wqm_b64			; GCN-NOT: s_wqm_b64
	; SICIVI: s_mov_b32 m0, -1			; SICIVI: s_mov_b32 m0, -1
	; GFX9-NOT: m0			; GFX9-NOT: m0
	; GCN: ds_read_b32			; GCN: ds_read_b32

	; EG: LDS_READ_RET			; EG: LDS_READ_RET
	define amdgpu_kernel void @local_load_i32(i32 addrspace(3)* %out, i32 addrspace(3)* %in) #0 {			define amdgpu_kernel void @local_load_i32(i32 addrspace(3)* %out, i32 addrspace(3)* %in) #0 {
	▲ Show 20 Lines • Show All 156 Lines • ▼ Show 20 Lines

	define amdgpu_kernel void @local_sextload_v4i32_to_v4i64(<4 x i64> addrspace(3)* %out, <4 x i32> addrspace(3)* %in) #0 {			define amdgpu_kernel void @local_sextload_v4i32_to_v4i64(<4 x i64> addrspace(3)* %out, <4 x i32> addrspace(3)* %in) #0 {
	%ld = load <4 x i32>, <4 x i32> addrspace(3)* %in			%ld = load <4 x i32>, <4 x i32> addrspace(3)* %in
	%ext = sext <4 x i32> %ld to <4 x i64>			%ext = sext <4 x i32> %ld to <4 x i64>
	store <4 x i64> %ext, <4 x i64> addrspace(3)* %out			store <4 x i64> %ext, <4 x i64> addrspace(3)* %out
	ret void			ret void
	}			}

				; Tests if ds_read_b128 gets generated for the 16 byte aligned load.
				; FUNC-LABEL: {{^}}local_v4i32_to_128:
				; SI-NOT: ds_read_b128
				; CIVI: ds_read_b128
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				define amdgpu_kernel void @local_v4i32_to_128(<4 x i32> addrspace(3)* %out, <4 x i32> addrspace(3)* %in) {
				%ld = load <4 x i32>, <4 x i32> addrspace(3)* %in, align 16
				store <4 x i32> %ld, <4 x i32> addrspace(3)* %out
				ret void
				}

	; FUNC-LABEL: {{^}}local_zextload_v8i32_to_v8i64:			; FUNC-LABEL: {{^}}local_zextload_v8i32_to_v8i64:
	; SICIVI: s_mov_b32 m0, -1			; SICIVI: s_mov_b32 m0, -1
	; GFX9-NOT: m0			; GFX9-NOT: m0

	define amdgpu_kernel void @local_zextload_v8i32_to_v8i64(<8 x i64> addrspace(3)* %out, <8 x i32> addrspace(3)* %in) #0 {			define amdgpu_kernel void @local_zextload_v8i32_to_v8i64(<8 x i64> addrspace(3)* %out, <8 x i32> addrspace(3)* %in) #0 {
	%ld = load <8 x i32>, <8 x i32> addrspace(3)* %in			%ld = load <8 x i32>, <8 x i32> addrspace(3)* %in
	%ext = zext <8 x i32> %ld to <8 x i64>			%ext = zext <8 x i32> %ld to <8 x i64>
	store <8 x i64> %ext, <8 x i64> addrspace(3)* %out			store <8 x i64> %ext, <8 x i64> addrspace(3)* %out
	▲ Show 20 Lines • Show All 59 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/load-local-i64.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s
	; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=kaveri -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s			; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=kaveri -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SICIVI,FUNC %s
	; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX9,FUNC %s			; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX9,FUNC %s
	; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefixes=EG,FUNC %s			; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefixes=EG,FUNC %s

				; Testing for ds_read_b128
				; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs -amdgpu-ds128 < %s \| FileCheck -check-prefixes=CIVI,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -amdgpu-ds128 < %s \| FileCheck -check-prefixes=CIVI,FUNC %s

	; FUNC-LABEL: {{^}}local_load_i64:			; FUNC-LABEL: {{^}}local_load_i64:
	; SICIVI: s_mov_b32 m0			; SICIVI: s_mov_b32 m0
	; GFX9-NOT: m0			; GFX9-NOT: m0

	; GCN: ds_read_b64 [[VAL:v\[[0-9]+:[0-9]+\]]], v{{[0-9]+}}{{$}}			; GCN: ds_read_b64 [[VAL:v\[[0-9]+:[0-9]+\]]], v{{[0-9]+}}{{$}}
	; GCN: ds_write_b64 v{{[0-9]+}}, [[VAL]]			; GCN: ds_write_b64 v{{[0-9]+}}, [[VAL]]

	; EG: LDS_READ_RET			; EG: LDS_READ_RET
	Show All 16 Lines
	; EG: LDS_READ_RET			; EG: LDS_READ_RET
	define amdgpu_kernel void @local_load_v2i64(<2 x i64> addrspace(3)* %out, <2 x i64> addrspace(3)* %in) #0 {			define amdgpu_kernel void @local_load_v2i64(<2 x i64> addrspace(3)* %out, <2 x i64> addrspace(3)* %in) #0 {
	entry:			entry:
	%ld = load <2 x i64>, <2 x i64> addrspace(3)* %in			%ld = load <2 x i64>, <2 x i64> addrspace(3)* %in
	store <2 x i64> %ld, <2 x i64> addrspace(3)* %out			store <2 x i64> %ld, <2 x i64> addrspace(3)* %out
	ret void			ret void
	}			}

				; Tests if ds_read_b128 gets generated for the 16 byte aligned load.
				; FUNC-LABEL: {{^}}local_load_v2i64_to_128:
				; CIVI: ds_read_b128
				define amdgpu_kernel void @local_load_v2i64_to_128(<2 x i64> addrspace(3)* %out, <2 x i64> addrspace(3)* %in) {
				entry:
				%ld = load <2 x i64>, <2 x i64> addrspace(3)* %in
				store <2 x i64> %ld, <2 x i64> addrspace(3)* %out
				ret void
				}

	; FUNC-LABEL: {{^}}local_load_v3i64:			; FUNC-LABEL: {{^}}local_load_v3i64:
	; SICIVI: s_mov_b32 m0			; SICIVI: s_mov_b32 m0
	; GFX9-NOT: m0			; GFX9-NOT: m0

	; GCN-DAG: ds_read2_b64			; GCN-DAG: ds_read2_b64
	; GCN-DAG: ds_read_b64			; GCN-DAG: ds_read_b64

	; EG: LDS_READ_RET			; EG: LDS_READ_RET
	▲ Show 20 Lines • Show All 127 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/load-local-i8.ll

	; RUN: llc -march=amdgcn -mtriple=amdgcn---amdgiz -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SI,SICIVI,FUNC %s			; RUN: llc -march=amdgcn -mtriple=amdgcn---amdgiz -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,SI,SICIVI,FUNC %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn---amdgiz -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,VI,SICIVI,FUNC %s			; RUN: llc -march=amdgcn -mtriple=amdgcn---amdgiz -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,VI,SICIVI,FUNC %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn---amdgiz -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX9,FUNC %s			; RUN: llc -march=amdgcn -mtriple=amdgcn---amdgiz -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX9,FUNC %s
	; RUN: llc -march=r600 -mtriple=r600---amdgiz -mcpu=redwood -verify-machineinstrs < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s			; RUN: llc -march=r600 -mtriple=r600---amdgiz -mcpu=redwood -verify-machineinstrs < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s

				; Testing for ds_read_b128
				; RUN: llc -march=amdgcn -mcpu=tonga -amdgpu-ds128 < %s \| FileCheck -check-prefixes=CIVI,FUNC %s
				; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-ds128 < %s \| FileCheck -check-prefixes=CIVI,FUNC %s

	; FUNC-LABEL: {{^}}local_load_i8:			; FUNC-LABEL: {{^}}local_load_i8:
	; GCN-NOT: s_wqm_b64			; GCN-NOT: s_wqm_b64
	; SICIVI: s_mov_b32 m0			; SICIVI: s_mov_b32 m0
	; GFX9-NOT: m0			; GFX9-NOT: m0
	; GCN: ds_read_u8			; GCN: ds_read_u8

	; EG: LDS_UBYTE_READ_RET			; EG: LDS_UBYTE_READ_RET
	▲ Show 20 Lines • Show All 1,002 Lines • ▼ Show 20 Lines
	; XFUNC-LABEL: {{^}}local_sextload_v64i8_to_v64i16:			; XFUNC-LABEL: {{^}}local_sextload_v64i8_to_v64i16:
	; define amdgpu_kernel void @local_sextload_v64i8_to_v64i16(<64 x i16> addrspace(3)* %out, <64 x i8> addrspace(3)* %in) #0 {			; define amdgpu_kernel void @local_sextload_v64i8_to_v64i16(<64 x i16> addrspace(3)* %out, <64 x i8> addrspace(3)* %in) #0 {
	; %load = load <64 x i8>, <64 x i8> addrspace(3)* %in			; %load = load <64 x i8>, <64 x i8> addrspace(3)* %in
	; %ext = sext <64 x i8> %load to <64 x i16>			; %ext = sext <64 x i8> %load to <64 x i16>
	; store <64 x i16> %ext, <64 x i16> addrspace(3)* %out			; store <64 x i16> %ext, <64 x i16> addrspace(3)* %out
	; ret void			; ret void
	; }			; }

				; Tests if ds_read_b128 gets generated for the 16 byte aligned load.
				; FUNC-LABEL: {{^}}local_v16i8_to_128:
				; SI-NOT: ds_read_b128
				; CIVI: ds_read_b128
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				; EG: LDS_READ_RET
				define amdgpu_kernel void @local_v16i8_to_128(<16 x i8> addrspace(3)* %out, <16 x i8> addrspace(3)* %in) {
				%ld = load <16 x i8>, <16 x i8> addrspace(3)* %in, align 16
				store <16 x i8> %ld, <16 x i8> addrspace(3)* %out
				ret void
				}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

llvm/trunk/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-stores.ll

Show First 20 Lines • Show All 498 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @merge_local_store_2_constants_i32_align_2(i32 addrspace(3)* %out) #0 {
%out.gep.1 = getelementptr i32, i32 addrspace(3)* %out, i32 1		%out.gep.1 = getelementptr i32, i32 addrspace(3)* %out, i32 1

store i32 123, i32 addrspace(3)* %out.gep.1, align 2		store i32 123, i32 addrspace(3)* %out.gep.1, align 2
store i32 456, i32 addrspace(3)* %out, align 2		store i32 456, i32 addrspace(3)* %out, align 2
ret void		ret void
}		}

; CHECK-LABEL: @merge_local_store_4_constants_i32		; CHECK-LABEL: @merge_local_store_4_constants_i32
; CHECK: store <2 x i32> <i32 456, i32 333>, <2 x i32> addrspace(3)*		; CHECK: store <4 x i32> <i32 1234, i32 123, i32 456, i32 333>, <4 x i32> addrspace(3)*
; CHECK: store <2 x i32> <i32 1234, i32 123>, <2 x i32> addrspace(3)*
define amdgpu_kernel void @merge_local_store_4_constants_i32(i32 addrspace(3)* %out) #0 {		define amdgpu_kernel void @merge_local_store_4_constants_i32(i32 addrspace(3)* %out) #0 {
%out.gep.1 = getelementptr i32, i32 addrspace(3)* %out, i32 1		%out.gep.1 = getelementptr i32, i32 addrspace(3)* %out, i32 1
%out.gep.2 = getelementptr i32, i32 addrspace(3)* %out, i32 2		%out.gep.2 = getelementptr i32, i32 addrspace(3)* %out, i32 2
%out.gep.3 = getelementptr i32, i32 addrspace(3)* %out, i32 3		%out.gep.3 = getelementptr i32, i32 addrspace(3)* %out, i32 3

store i32 123, i32 addrspace(3)* %out.gep.1		store i32 123, i32 addrspace(3)* %out.gep.1
store i32 456, i32 addrspace(3)* %out.gep.2		store i32 456, i32 addrspace(3)* %out.gep.2
store i32 333, i32 addrspace(3)* %out.gep.3		store i32 333, i32 addrspace(3)* %out.gep.3
▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoadStoreVectorizer/AMDGPU/multiple_tails.ll

Show All 23 Lines	define amdgpu_kernel void @no_crash(i32 %arg) {

ret void		ret void
}		}

; Check adjiacent memory locations are properly matched and the		; Check adjiacent memory locations are properly matched and the
; longest chain vectorized		; longest chain vectorized

; CHECK-LABEL: @interleave_get_longest		; CHECK-LABEL: @interleave_get_longest
; CHECK: load <2 x i32>		; CHECK: load <4 x i32>
; CHECK: load i32		; CHECK: load i32
; CHECK: store <2 x i32> zeroinitializer		; CHECK: store <2 x i32> zeroinitializer
; CHECK: load i32		; CHECK: load i32
; CHECK: load <2 x i32>
; CHECK: load i32		; CHECK: load i32
; CHECK: load i32		; CHECK: load i32

define amdgpu_kernel void @interleave_get_longest(i32 %arg) {		define amdgpu_kernel void @interleave_get_longest(i32 %arg) {
%a1 = add i32 %arg, 1		%a1 = add i32 %arg, 1
%a2 = add i32 %arg, 2		%a2 = add i32 %arg, 2
%a3 = add i32 %arg, 3		%a3 = add i32 %arg, 3
%a4 = add i32 %arg, 4		%a4 = add i32 %arg, 4
Show All 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Supported ds_read_b128 generation; Widened vector length for local address-spaceClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 137776

llvm/trunk/lib/Target/AMDGPU/AMDGPUInstructions.td

llvm/trunk/lib/Target/AMDGPU/AMDGPUSubtarget.h

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

llvm/trunk/lib/Target/AMDGPU/DSInstructions.td

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.td

llvm/trunk/test/CodeGen/AMDGPU/load-local-f32.ll

llvm/trunk/test/CodeGen/AMDGPU/load-local-f64.ll

llvm/trunk/test/CodeGen/AMDGPU/load-local-i16.ll

llvm/trunk/test/CodeGen/AMDGPU/load-local-i32.ll

llvm/trunk/test/CodeGen/AMDGPU/load-local-i64.ll

llvm/trunk/test/CodeGen/AMDGPU/load-local-i8.ll

llvm/trunk/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-stores.ll

llvm/trunk/test/Transforms/LoadStoreVectorizer/AMDGPU/multiple_tails.ll

[AMDGPU] Supported ds_read_b128 generation; Widened vector length for local address-space
ClosedPublic