This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Adjusted alignment-check for local address space;
AbandonedPublic

Authored by FarhanaAleen on Mar 2 2018, 2:25 PM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm

Summary

LDS is allocated to 64-dword alignment. In order to prove a memory address being 8 byte aligned it is sufficient to check that the offset if multiple of 8.

This allows generating ds_read/write_b64 instead of ds_read/write2_b32.

Diff Detail

Event Timeline

FarhanaAleen created this revision.Mar 2 2018, 2:25 PM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptMar 2 2018, 2:25 PM

rampitec added inline comments.Mar 2 2018, 3:22 PM

lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
667	It is not immediately clear to me if the base pointer is always an LDS allocated symbol. I guess it can also be a GEP, right? Let's consider the situation: sym is a 64-dword aligned LDS allocation. gep1 = &sym[1] -> it is only 4 bytes aligned. gep2 = &gep1[8] -> it is still 4 bytes aligned. I suppose for the gep2 your logic will incorrectly report 8 byte alignment.

I think it is simpler than that. If a local symbol must be 64 dword aligned, it should be declared as a such and not 4 byte aligned as we have.
Such logic shall not be needed at the first place, llvm should be able to deduce proper alignment given proper input.

Although I am not really sure this is true it is always 64 dword aligned. Consider:

local int x;
local int y;

Do you mean this allocation would take 128 dwords? I highly doubt.

I suppose only the first symbol is 64 dword aligned, and everything after is just naturally aligned wrt element type size. So a logic to leverage actual allocation alignment can be useful only after all LDS is allocated and allocation is flattened into a single LDS memory array.

I don't understand this. You should only be using the node's alignment here. If there's a way to infer a higher alignment for something, that should be an optimization much earlier than selection.

test/CodeGen/AMDGPU/ds_read2.ll
661	No reason to use an external function call here

This revision now requires changes to proceed.Mar 2 2018, 5:48 PM

arsenm added inline comments.Mar 2 2018, 5:54 PM

lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
661–663	Also if you're referring to the allocation granularity for the entire program, that doesn't reflect the individual symbols allocated. We certainly don't allocate the individual globals with at least 8 byte alignment

Thank you guys. My assumption was wrong, I was thinking that each allocation gets 64-dword alignment.

FarhanaAleen abandoned this revision.Mar 2 2018, 6:35 PM

You can leverage higher alignment after you have actual allocation map.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPUISelDAGToDAG.cpp

20 lines

AMDGPUInstructions.td

2 lines

test/

CodeGen/

AMDGPU/

ds_read2.ll

66 lines

ds_read2_offset_order.ll

2 lines

ds_read2_superreg.ll

11 lines

ds_write2.ll

11 lines

Diff 136851

lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

Show First 20 Lines • Show All 99 Lines • ▼ Show 20 Lines	private:
bool isInlineImmediate(const SDNode *N) const;		bool isInlineImmediate(const SDNode *N) const;
bool FoldOperand(SDValue &Src, SDValue &Sel, SDValue &Neg, SDValue &Abs,		bool FoldOperand(SDValue &Src, SDValue &Sel, SDValue &Neg, SDValue &Abs,
const R600InstrInfo *TII);		const R600InstrInfo *TII);
bool FoldOperands(unsigned, const R600InstrInfo *, std::vector<SDValue> &);		bool FoldOperands(unsigned, const R600InstrInfo *, std::vector<SDValue> &);
bool FoldDotOperands(unsigned, const R600InstrInfo *, std::vector<SDValue> &);		bool FoldDotOperands(unsigned, const R600InstrInfo *, std::vector<SDValue> &);

bool isConstantLoad(const MemSDNode *N, int cbID) const;		bool isConstantLoad(const MemSDNode *N, int cbID) const;
bool isUniformBr(const SDNode *N) const;		bool isUniformBr(const SDNode *N) const;
		bool is8ByteAligned(const MemSDNode *N) const;

SDNode glueCopyToM0(SDNode N) const;		SDNode glueCopyToM0(SDNode N) const;

const TargetRegisterClass getOperandRegClass(SDNode N, unsigned OpNo) const;		const TargetRegisterClass getOperandRegClass(SDNode N, unsigned OpNo) const;
bool SelectGlobalValueConstantOffset(SDValue Addr, SDValue& IntPtr);		bool SelectGlobalValueConstantOffset(SDValue Addr, SDValue& IntPtr);
bool SelectGlobalValueVariableOffset(SDValue Addr, SDValue &BaseReg,		bool SelectGlobalValueVariableOffset(SDValue Addr, SDValue &BaseReg,
SDValue& Offset);		SDValue& Offset);
virtual bool SelectADDRVTX_READ(SDValue Addr, SDValue &Base, SDValue &Offset);		virtual bool SelectADDRVTX_READ(SDValue Addr, SDValue &Base, SDValue &Offset);
▲ Show 20 Lines • Show All 529 Lines • ▼ Show 20 Lines

bool AMDGPUDAGToDAGISel::isUniformBr(const SDNode *N) const {		bool AMDGPUDAGToDAGISel::isUniformBr(const SDNode *N) const {
const BasicBlock *BB = FuncInfo->MBB->getBasicBlock();		const BasicBlock *BB = FuncInfo->MBB->getBasicBlock();
const Instruction *Term = BB->getTerminator();		const Instruction *Term = BB->getTerminator();
return Term->getMetadata("amdgpu.uniform") \|\|		return Term->getMetadata("amdgpu.uniform") \|\|
Term->getMetadata("structurizecfg.uniform");		Term->getMetadata("structurizecfg.uniform");
}		}

		bool AMDGPUDAGToDAGISel::is8ByteAligned(const MemSDNode *N) const {
		if ((N->getAlignment() & 7) == 0)
		return true;

		if (N->getAddressSpace() != AMDGPUASI.LOCAL_ADDRESS)
		return false;

		// LDS space is allocated to a work-group or wavefront in contiguous blocks of
		// 64 Dwords on 64-Dword alignment; checking offset being multiple of 8 is
		// sufficient to prove that the address is 8 byte aligned.
		arsenmUnsubmitted Not Done Reply Inline Actions Also if you're referring to the allocation granularity for the entire program, that doesn't reflect the individual symbols allocated. We certainly don't allocate the individual globals with at least 8 byte alignment arsenm: Also if you're referring to the allocation granularity for the entire program, that doesn't…
		if ((N->getSrcValueOffset() & 7) != 0)
		return false;

		SDValue Offset = N->getBasePtr();
		rampitecUnsubmitted Not Done Reply Inline Actions It is not immediately clear to me if the base pointer is always an LDS allocated symbol. I guess it can also be a GEP, right? Let's consider the situation: sym is a 64-dword aligned LDS allocation. gep1 = &sym[1] -> it is only 4 bytes aligned. gep2 = &gep1[8] -> it is still 4 bytes aligned. I suppose for the gep2 your logic will incorrectly report 8 byte alignment. rampitec: It is not immediately clear to me if the base pointer is always an LDS allocated symbol. I…
		KnownBits Known;
		CurDAG->computeKnownBits(Offset, Known);
		return Known.countMinTrailingZeros() >= Log2_32(8);
		}

StringRef AMDGPUDAGToDAGISel::getPassName() const {		StringRef AMDGPUDAGToDAGISel::getPassName() const {
return "AMDGPU DAG->DAG Pattern Instruction Selection";		return "AMDGPU DAG->DAG Pattern Instruction Selection";
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Complex Patterns		// Complex Patterns
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

▲ Show 20 Lines • Show All 1,582 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUInstructions.td

	Show First 20 Lines • Show All 239 Lines • ▼ Show 20 Lines
	>;			>;


	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Load/Store Pattern Fragments			// Load/Store Pattern Fragments
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	class Aligned8Bytes <dag ops, dag frag> : PatFrag <ops, frag, [{			class Aligned8Bytes <dag ops, dag frag> : PatFrag <ops, frag, [{
	return cast<MemSDNode>(N)->getAlignment() % 8 == 0;			return is8ByteAligned(cast<MemSDNode>(N));
	}]>;			}]>;

	class LoadFrag <SDPatternOperator op> : PatFrag<(ops node:$ptr), (op node:$ptr)>;			class LoadFrag <SDPatternOperator op> : PatFrag<(ops node:$ptr), (op node:$ptr)>;

	class StoreFrag<SDPatternOperator op> : PatFrag <			class StoreFrag<SDPatternOperator op> : PatFrag <
	(ops node:$value, node:$ptr), (op node:$value, node:$ptr)			(ops node:$value, node:$ptr), (op node:$value, node:$ptr)
	>;			>;

	▲ Show 20 Lines • Show All 489 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/ds_read2.ll

Show First 20 Lines • Show All 436 Lines • ▼ Show 20 Lines

@foo = addrspace(3) global [4 x i32] undef, align 4		@foo = addrspace(3) global [4 x i32] undef, align 4

; GCN-LABEL: @load_constant_adjacent_offsets		; GCN-LABEL: @load_constant_adjacent_offsets
; CI-DAG: s_mov_b32 m0		; CI-DAG: s_mov_b32 m0
; GFX9-NOT: m0		; GFX9-NOT: m0

; GCN-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0{{$}}		; GCN-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0{{$}}
; GCN: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[ZERO]] offset1:1		; GCN: ds_read_b64 v{{\[[0-9]+:[0-9]+\]}}, [[ZERO]]
define amdgpu_kernel void @load_constant_adjacent_offsets(i32 addrspace(1)* %out) {		define amdgpu_kernel void @load_constant_adjacent_offsets(i32 addrspace(1)* %out) {
%val0 = load i32, i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @foo, i32 0, i32 0), align 4		%val0 = load i32, i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @foo, i32 0, i32 0), align 4
%val1 = load i32, i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @foo, i32 0, i32 1), align 4		%val1 = load i32, i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @foo, i32 0, i32 1), align 4
%sum = add i32 %val0, %val1		%sum = add i32 %val0, %val1
store i32 %sum, i32 addrspace(1)* %out, align 4		store i32 %sum, i32 addrspace(1)* %out, align 4
ret void		ret void
}		}

Show All 13 Lines

@bar = addrspace(3) global [4 x i64] undef, align 4		@bar = addrspace(3) global [4 x i64] undef, align 4

; GCN-LABEL: @load_misaligned64_constant_offsets		; GCN-LABEL: @load_misaligned64_constant_offsets
; CI-DAG: s_mov_b32 m0		; CI-DAG: s_mov_b32 m0
; GFX9-NOT: m0		; GFX9-NOT: m0

; GCN-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0{{$}}		; GCN-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0{{$}}
; GCN: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[ZERO]] offset1:1		; GCN: ds_read2_b64 v{{\[[0-9]+:[0-9]+\]}}, [[ZERO]] offset1:1
; GCN: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[ZERO]] offset0:2 offset1:3
define amdgpu_kernel void @load_misaligned64_constant_offsets(i64 addrspace(1)* %out) {		define amdgpu_kernel void @load_misaligned64_constant_offsets(i64 addrspace(1)* %out) {
%val0 = load i64, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 0), align 4		%val0 = load i64, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 0), align 4
%val1 = load i64, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 1), align 4		%val1 = load i64, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 1), align 4
%sum = add i64 %val0, %val1		%sum = add i64 %val0, %val1
store i64 %sum, i64 addrspace(1)* %out, align 8		store i64 %sum, i64 addrspace(1)* %out, align 8
ret void		ret void
}		}

@bar.large = addrspace(3) global [4096 x i64] undef, align 4		@bar.large = addrspace(3) global [4096 x i64] undef, align 4

; GCN-LABEL: @load_misaligned64_constant_large_offsets		; GCN-LABEL: @load_misaligned64_constant_large_offsets
; CI-DAG: s_mov_b32 m0		; CI-DAG: s_mov_b32 m0
; GFX9-NOT: m0		; GFX9-NOT: m0

; GCN-DAG: v_mov_b32_e32 [[BASE0:v[0-9]+]], 0x7ff8{{$}}		; GCN-DAG: v_mov_b32_e32 [[BASE:v[0-9]+]], 0
; GCN-DAG: v_mov_b32_e32 [[BASE1:v[0-9]+]], 0x4000		; GCN-DAG: ds_read_b64 v{{\[[0-9]+:[0-9]+\]}}, [[BASE]] offset:16384
; GCN-DAG: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASE0]] offset1:1		; GCN-DAG: ds_read_b64 v{{\[[0-9]+:[0-9]+\]}}, [[BASE]] offset:32760
; GCN-DAG: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASE1]] offset1:1
; GCN: s_endpgm		; GCN: s_endpgm
define amdgpu_kernel void @load_misaligned64_constant_large_offsets(i64 addrspace(1)* %out) {		define amdgpu_kernel void @load_misaligned64_constant_large_offsets(i64 addrspace(1)* %out) {
%val0 = load i64, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 2048), align 4		%val0 = load i64, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 2048), align 4
%val1 = load i64, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 4095), align 4		%val1 = load i64, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 4095), align 4
%sum = add i64 %val0, %val1		%sum = add i64 %val0, %val1
store i64 %sum, i64 addrspace(1)* %out, align 8		store i64 %sum, i64 addrspace(1)* %out, align 8
ret void		ret void
}		}
▲ Show 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	define amdgpu_ps <2 x float> @ds_read_interp_read(i32 inreg %prims, float addrspace(3)* %inptr) {
%ptr1 = getelementptr float, float addrspace(3)* %inptr, i32 4		%ptr1 = getelementptr float, float addrspace(3)* %inptr, i32 4
%v1 = load float, float addrspace(3)* %ptr1, align 4		%v1 = load float, float addrspace(3)* %ptr1, align 4
%v1b = fadd float %v1, %intrp		%v1b = fadd float %v1, %intrp
%r0 = insertelement <2 x float> undef, float %v0, i32 0		%r0 = insertelement <2 x float> undef, float %v0, i32 0
%r1 = insertelement <2 x float> %r0, float %v1b, i32 1		%r1 = insertelement <2 x float> %r0, float %v1b, i32 1
ret <2 x float> %r1		ret <2 x float> %r1
}		}

		@localMemory = internal unnamed_addr addrspace(3) constant [819 x float] undef, align 4
		; Function Attrs: nounwind readnone
		declare i64 @hc_get_workitem_id(i32)

		; lowerload() should not adjust alignment;
		; GCN-LABEL: load_v2f32_local_unaligned8:
		; GCN: ds_read2_b32
		; GCN: ds_read2_b32
		define amdgpu_kernel void @load_v2f32_local_unaligned8(float addrspace(1)* %out) {
		entry:
		%call = tail call i64 @hc_get_workitem_id(i32 0)
		arsenmUnsubmitted Not Done Reply Inline Actions No reason to use an external function call here arsenm: No reason to use an external function call here
		%conv = trunc i64 %call to i32
		%index = and i32 %conv, 7
		%localPtr = getelementptr inbounds [819 x float], [819 x float] addrspace(3)* @localMemory, i32 0, i32 %index
		%elt1 = load float, float addrspace(3)* %localPtr, align 4
		%arrayidx1 = getelementptr inbounds float, float addrspace(3)* %localPtr, i32 1
		%elt2 = load float, float addrspace(3)* %arrayidx1, align 4
		%arrayidx2 = getelementptr inbounds float, float addrspace(3)* %localPtr, i32 2
		%elt3 = load float, float addrspace(3)* %arrayidx2, align 4
		%arrayidx3 = getelementptr inbounds float, float addrspace(3)* %localPtr, i32 3
		%elt4 = load float, float addrspace(3)* %arrayidx3, align 4
		%mul1 = fmul float %elt1, %elt2
		%mul2 = fmul float %elt3, %elt4
		%add = fadd float %mul1, %mul2
		%arrayidx11 = getelementptr inbounds float, float addrspace(1)* %out, i64 0
		store float %add, float addrspace(1)* %arrayidx11, align 4
		ret void
		}

		; load address is 8 byte aligned; lowerload() should not adjust alignment.
		; GCN-LABEL: load_v2f32_local_aligned8:
		; GCN: ds_read2_b64
		define amdgpu_kernel void @load_v2f32_local_aligned8(float addrspace(1)* %out) {
		entry:
		%call = tail call i64 @hc_get_workitem_id(i32 0)
		%conv = trunc i64 %call to i32
		%rem = and i32 %conv, 7
		%index = shl nuw nsw i32 %rem, 3
		%localPtr = getelementptr inbounds [819 x float], [819 x float] addrspace(3)* @localMemory, i32 0, i32 %index
		%elt1 = load float, float addrspace(3)* %localPtr, align 4
		%arrayidx1 = getelementptr inbounds float, float addrspace(3)* %localPtr, i32 1
		%elt2 = load float, float addrspace(3)* %arrayidx1, align 4
		%arrayidx2 = getelementptr inbounds float, float addrspace(3)* %localPtr, i32 2
		%elt3 = load float, float addrspace(3)* %arrayidx2, align 4
		%arrayidx3 = getelementptr inbounds float, float addrspace(3)* %localPtr, i32 3
		%elt4 = load float, float addrspace(3)* %arrayidx3, align 4
		%mul1 = fmul float %elt1, %elt2
		%mul2 = fmul float %elt3, %elt4
		%add = fadd float %mul1, %mul2
		%arrayidx11 = getelementptr inbounds float, float addrspace(1)* %out, i64 0
		store float %add, float addrspace(1)* %arrayidx11, align 4
		ret void
		}

declare void @void_func_void() #3		declare void @void_func_void() #3

declare i32 @llvm.amdgcn.workgroup.id.x() #1		declare i32 @llvm.amdgcn.workgroup.id.x() #1
declare i32 @llvm.amdgcn.workgroup.id.y() #1		declare i32 @llvm.amdgcn.workgroup.id.y() #1
declare i32 @llvm.amdgcn.workitem.id.x() #1		declare i32 @llvm.amdgcn.workitem.id.x() #1
declare i32 @llvm.amdgcn.workitem.id.y() #1		declare i32 @llvm.amdgcn.workitem.id.y() #1

declare float @llvm.amdgcn.interp.mov(i32, i32, i32, i32) nounwind readnone		declare float @llvm.amdgcn.interp.mov(i32, i32, i32, i32) nounwind readnone

declare void @llvm.amdgcn.s.barrier() #2		declare void @llvm.amdgcn.s.barrier() #2

attributes #0 = { nounwind }		attributes #0 = { nounwind }
attributes #1 = { nounwind readnone speculatable }		attributes #1 = { nounwind readnone speculatable }
attributes #2 = { convergent nounwind }		attributes #2 = { convergent nounwind }
attributes #3 = { nounwind noinline }		attributes #3 = { nounwind noinline }

test/CodeGen/AMDGPU/ds_read2_offset_order.ll

	; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -strict-whitespace -check-prefix=SI %s			; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -strict-whitespace -check-prefix=SI %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -strict-whitespace -check-prefix=SI %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -strict-whitespace -check-prefix=SI %s

	@lds = addrspace(3) global [512 x float] undef, align 4			@lds = addrspace(3) global [512 x float] undef, align 4

	; offset0 is larger than offset1			; offset0 is larger than offset1

	; SI-LABEL: {{^}}offset_order:			; SI-LABEL: {{^}}offset_order:
	; SI-DAG: ds_read2st64_b32 v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} offset1:4{{$}}			; SI-DAG: ds_read2st64_b32 v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} offset1:4{{$}}
	; SI-DAG: ds_read2_b32 v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} offset0:2 offset1:3			; SI-DAG: ds_read_b64 v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} offset:8
	; SI-DAG: ds_read_b32 v{{[0-9]+}}, v{{[0-9]+}} offset:56			; SI-DAG: ds_read_b32 v{{[0-9]+}}, v{{[0-9]+}} offset:56
	; SI-DAG: ds_read2_b32 v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} offset0:11 offset1:12			; SI-DAG: ds_read2_b32 v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} offset0:11 offset1:12
	define amdgpu_kernel void @offset_order(float addrspace(1)* %out) {			define amdgpu_kernel void @offset_order(float addrspace(1)* %out) {
	entry:			entry:
	%ptr0 = getelementptr inbounds [512 x float], [512 x float] addrspace(3)* @lds, i32 0, i32 0			%ptr0 = getelementptr inbounds [512 x float], [512 x float] addrspace(3)* @lds, i32 0, i32 0
	%val0 = load float, float addrspace(3)* %ptr0			%val0 = load float, float addrspace(3)* %ptr0

	%ptr1 = getelementptr inbounds [512 x float], [512 x float] addrspace(3)* @lds, i32 0, i32 256			%ptr1 = getelementptr inbounds [512 x float], [512 x float] addrspace(3)* @lds, i32 0, i32 256
	Show All 25 Lines

test/CodeGen/AMDGPU/ds_read2_superreg.ll

; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs -mattr=+load-store-opt < %s \| FileCheck -check-prefix=CI %s		; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs -mattr=+load-store-opt < %s \| FileCheck -check-prefix=CI %s

@lds = addrspace(3) global [512 x float] undef, align 4		@lds = addrspace(3) global [512 x float] undef, align 4
@lds.v2 = addrspace(3) global [512 x <2 x float>] undef, align 4		@lds.v2 = addrspace(3) global [512 x <2 x float>] undef, align 4
@lds.v3 = addrspace(3) global [512 x <3 x float>] undef, align 4		@lds.v3 = addrspace(3) global [512 x <3 x float>] undef, align 4
@lds.v4 = addrspace(3) global [512 x <4 x float>] undef, align 4		@lds.v4 = addrspace(3) global [512 x <4 x float>] undef, align 4
@lds.v8 = addrspace(3) global [512 x <8 x float>] undef, align 4		@lds.v8 = addrspace(3) global [512 x <8 x float>] undef, align 4
@lds.v16 = addrspace(3) global [512 x <16 x float>] undef, align 4		@lds.v16 = addrspace(3) global [512 x <16 x float>] undef, align 4

; CI-LABEL: {{^}}simple_read2_v2f32_superreg_align4:		; CI-LABEL: {{^}}simple_read2_v2f32_superreg_align4:
; CI: ds_read2_b32 [[RESULT:v\[[0-9]+:[0-9]+\]]], v{{[0-9]+}} offset1:1{{$}}		; CI: ds_read_b64 [[RESULT:v\[[0-9]+:[0-9]+\]]], v{{[0-9]+}}
; CI: s_waitcnt lgkmcnt(0)		; CI: s_waitcnt lgkmcnt(0)
; CI: buffer_store_dwordx2 [[RESULT]]		; CI: buffer_store_dwordx2 [[RESULT]]
; CI: s_endpgm		; CI: s_endpgm
define amdgpu_kernel void @simple_read2_v2f32_superreg_align4(<2 x float> addrspace(1)* %out) #0 {		define amdgpu_kernel void @simple_read2_v2f32_superreg_align4(<2 x float> addrspace(1)* %out) #0 {
%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1		%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1
%arrayidx0 = getelementptr inbounds [512 x <2 x float>], [512 x <2 x float>] addrspace(3)* @lds.v2, i32 0, i32 %x.i		%arrayidx0 = getelementptr inbounds [512 x <2 x float>], [512 x <2 x float>] addrspace(3)* @lds.v2, i32 0, i32 %x.i
%val0 = load <2 x float>, <2 x float> addrspace(3)* %arrayidx0, align 4		%val0 = load <2 x float>, <2 x float> addrspace(3)* %arrayidx0, align 4
%out.gep = getelementptr inbounds <2 x float>, <2 x float> addrspace(1)* %out, i32 %x.i		%out.gep = getelementptr inbounds <2 x float>, <2 x float> addrspace(1)* %out, i32 %x.i
Show All 11 Lines	define amdgpu_kernel void @simple_read2_v2f32_superreg(<2 x float> addrspace(1)* %out) #0 {
%arrayidx0 = getelementptr inbounds [512 x <2 x float>], [512 x <2 x float>] addrspace(3)* @lds.v2, i32 0, i32 %x.i		%arrayidx0 = getelementptr inbounds [512 x <2 x float>], [512 x <2 x float>] addrspace(3)* @lds.v2, i32 0, i32 %x.i
%val0 = load <2 x float>, <2 x float> addrspace(3)* %arrayidx0		%val0 = load <2 x float>, <2 x float> addrspace(3)* %arrayidx0
%out.gep = getelementptr inbounds <2 x float>, <2 x float> addrspace(1)* %out, i32 %x.i		%out.gep = getelementptr inbounds <2 x float>, <2 x float> addrspace(1)* %out, i32 %x.i
store <2 x float> %val0, <2 x float> addrspace(1)* %out.gep		store <2 x float> %val0, <2 x float> addrspace(1)* %out.gep
ret void		ret void
}		}

; CI-LABEL: {{^}}simple_read2_v4f32_superreg_align4:		; CI-LABEL: {{^}}simple_read2_v4f32_superreg_align4:
; CI-DAG: ds_read2_b32 v{{\[}}[[REG_X:[0-9]+]]:[[REG_Y:[0-9]+]]{{\]}}, v{{[0-9]+}} offset1:1{{$}}		; CI-DAG: ds_read2_b64 v{{\[}}[[REG_X:[0-9]+]]:[[REG_Y:[0-9]+]]{{\]}}, v{{[0-9]+}} offset1:1
; CI-DAG: ds_read2_b32 v{{\[}}[[REG_Z:[0-9]+]]:[[REG_W:[0-9]+]]{{\]}}, v{{[0-9]+}} offset0:2 offset1:3{{$}}		; CI-DAG: v_add_f32_e32 v[[ADD0:[0-9]+]], v[[REG_X]]
; CI-DAG: v_add_f32_e32 v[[ADD0:[0-9]+]], v[[REG_X]], v[[REG_Z]]		; CI-DAG: v_add_f32_e32 v[[ADD1:[0-9]+]], v{{[0-9]+}}, v[[REG_Y]]
; CI-DAG: v_add_f32_e32 v[[ADD1:[0-9]+]], v[[REG_Y]], v[[REG_W]]
; CI: v_add_f32_e32 v[[ADD2:[0-9]+]], v[[ADD0]], v[[ADD1]]		; CI: v_add_f32_e32 v[[ADD2:[0-9]+]], v[[ADD0]], v[[ADD1]]
; CI: buffer_store_dword v[[ADD2]]		; CI: buffer_store_dword v[[ADD2]]
; CI: s_endpgm		; CI: s_endpgm
define amdgpu_kernel void @simple_read2_v4f32_superreg_align4(float addrspace(1)* %out) #0 {		define amdgpu_kernel void @simple_read2_v4f32_superreg_align4(float addrspace(1)* %out) #0 {
%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1		%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1
%arrayidx0 = getelementptr inbounds [512 x <4 x float>], [512 x <4 x float>] addrspace(3)* @lds.v4, i32 0, i32 %x.i		%arrayidx0 = getelementptr inbounds [512 x <4 x float>], [512 x <4 x float>] addrspace(3)* @lds.v4, i32 0, i32 %x.i
%val0 = load <4 x float>, <4 x float> addrspace(3)* %arrayidx0, align 4		%val0 = load <4 x float>, <4 x float> addrspace(3)* %arrayidx0, align 4
%elt0 = extractelement <4 x float> %val0, i32 0		%elt0 = extractelement <4 x float> %val0, i32 0
%elt1 = extractelement <4 x float> %val0, i32 1		%elt1 = extractelement <4 x float> %val0, i32 1
%elt2 = extractelement <4 x float> %val0, i32 2		%elt2 = extractelement <4 x float> %val0, i32 2
%elt3 = extractelement <4 x float> %val0, i32 3		%elt3 = extractelement <4 x float> %val0, i32 3

%add0 = fadd float %elt0, %elt2		%add0 = fadd float %elt0, %elt2
%add1 = fadd float %elt1, %elt3		%add1 = fadd float %elt1, %elt3
%add2 = fadd float %add0, %add1		%add2 = fadd float %add0, %add1

%out.gep = getelementptr inbounds float, float addrspace(1)* %out, i32 %x.i		%out.gep = getelementptr inbounds float, float addrspace(1)* %out, i32 %x.i
store float %add2, float addrspace(1)* %out.gep		store float %add2, float addrspace(1)* %out.gep
ret void		ret void
}		}

; CI-LABEL: {{^}}simple_read2_v3f32_superreg_align4:		; CI-LABEL: {{^}}simple_read2_v3f32_superreg_align4:
; CI-DAG: ds_read2_b32 v{{\[}}[[REG_X:[0-9]+]]:[[REG_Y:[0-9]+]]{{\]}}, v{{[0-9]+}} offset1:1{{$}}		; CI-DAG: ds_read_b64 v{{\[}}[[REG_X:[0-9]+]]:[[REG_Y:[0-9]+]]{{\]}}, v{{[0-9]+}}
; CI-DAG: ds_read_b32 v[[REG_Z:[0-9]+]], v{{[0-9]+}} offset:8{{$}}		; CI-DAG: ds_read_b32 v[[REG_Z:[0-9]+]], v{{[0-9]+}} offset:8{{$}}
; CI-DAG: v_add_f32_e32 v[[ADD0:[0-9]+]], v[[REG_X]], v[[REG_Z]]		; CI-DAG: v_add_f32_e32 v[[ADD0:[0-9]+]], v[[REG_X]], v[[REG_Z]]
; CI-DAG: v_add_f32_e32 v[[ADD1:[0-9]+]], v[[ADD0]], v[[REG_Y]]		; CI-DAG: v_add_f32_e32 v[[ADD1:[0-9]+]], v[[ADD0]], v[[REG_Y]]
; CI: buffer_store_dword v[[ADD1]]		; CI: buffer_store_dword v[[ADD1]]
; CI: s_endpgm		; CI: s_endpgm
define amdgpu_kernel void @simple_read2_v3f32_superreg_align4(float addrspace(1)* %out) #0 {		define amdgpu_kernel void @simple_read2_v3f32_superreg_align4(float addrspace(1)* %out) #0 {
%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1		%x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1
%arrayidx0 = getelementptr inbounds [512 x <3 x float>], [512 x <3 x float>] addrspace(3)* @lds.v3, i32 0, i32 %x.i		%arrayidx0 = getelementptr inbounds [512 x <3 x float>], [512 x <3 x float>] addrspace(3)* @lds.v3, i32 0, i32 %x.i
▲ Show 20 Lines • Show All 136 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/ds_write2.ll

	Show First 20 Lines • Show All 384 Lines • ▼ Show 20 Lines

	@foo = addrspace(3) global [4 x i32] undef, align 4			@foo = addrspace(3) global [4 x i32] undef, align 4

	; GCN-LABEL: @store_constant_adjacent_offsets			; GCN-LABEL: @store_constant_adjacent_offsets
	; CI-DAG: s_mov_b32 m0			; CI-DAG: s_mov_b32 m0
	; GFX9-NOT: m0			; GFX9-NOT: m0

	; GCN-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0{{$}}			; GCN-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0{{$}}
	; GCN: ds_write2_b32 [[ZERO]], v{{[0-9]+}}, v{{[0-9]+}} offset1:1			; GCN: ds_write_b64 [[ZERO]], {{v\[[0-9]+:[0-9]+\]}}
	define amdgpu_kernel void @store_constant_adjacent_offsets() {			define amdgpu_kernel void @store_constant_adjacent_offsets() {
	store i32 123, i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @foo, i32 0, i32 0), align 4			store i32 123, i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @foo, i32 0, i32 0), align 4
	store i32 123, i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @foo, i32 0, i32 1), align 4			store i32 123, i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @foo, i32 0, i32 1), align 4
	ret void			ret void
	}			}

	; GCN-LABEL: @store_constant_disjoint_offsets			; GCN-LABEL: @store_constant_disjoint_offsets
	; CI-DAG: s_mov_b32 m0			; CI-DAG: s_mov_b32 m0
	Show All 10 Lines

	@bar = addrspace(3) global [4 x i64] undef, align 4			@bar = addrspace(3) global [4 x i64] undef, align 4

	; GCN-LABEL: @store_misaligned64_constant_offsets			; GCN-LABEL: @store_misaligned64_constant_offsets
	; CI-DAG: s_mov_b32 m0			; CI-DAG: s_mov_b32 m0
	; GFX9-NOT: m0			; GFX9-NOT: m0

	; GCN-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0{{$}}			; GCN-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0{{$}}
	; GCN-DAG: ds_write2_b32 [[ZERO]], v{{[0-9]+}}, v{{[0-9]+}} offset1:1			; GCN-DAG: ds_write2_b64 [[ZERO]], {{v\[[0-9]+:[0-9]+\]}}, {{v\[[0-9]+:[0-9]+\]}} offset1:1
	; GCN-DAG: ds_write2_b32 [[ZERO]], v{{[0-9]+}}, v{{[0-9]+}} offset0:2 offset1:3
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @store_misaligned64_constant_offsets() {			define amdgpu_kernel void @store_misaligned64_constant_offsets() {
	store i64 123, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 0), align 4			store i64 123, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 0), align 4
	store i64 123, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 1), align 4			store i64 123, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 1), align 4
	ret void			ret void
	}			}

	@bar.large = addrspace(3) global [4096 x i64] undef, align 4			@bar.large = addrspace(3) global [4096 x i64] undef, align 4

	; GCN-LABEL: @store_misaligned64_constant_large_offsets			; GCN-LABEL: @store_misaligned64_constant_large_offsets
	; CI-DAG: s_mov_b32 m0			; CI-DAG: s_mov_b32 m0
	; GFX9-NOT: m0			; GFX9-NOT: m0

	; GCN-DAG: v_mov_b32_e32 [[BASE0:v[0-9]+]], 0x7ff8{{$}}			; GCN-DAG: ds_write_b64 v{{[0-9]+}}, {{v\[[0-9]+:[0-9]+\]}} offset:16384
	; GCN-DAG: v_mov_b32_e32 [[BASE1:v[0-9]+]], 0x4000{{$}}			; GCN-DAG: ds_write_b64 v{{[0-9]+}}, {{v\[[0-9]+:[0-9]+\]}} offset:32760
	; GCN-DAG: ds_write2_b32 [[BASE0]], v{{[0-9]+}}, v{{[0-9]+}} offset1:1
	; GCN-DAG: ds_write2_b32 [[BASE1]], v{{[0-9]+}}, v{{[0-9]+}} offset1:1
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @store_misaligned64_constant_large_offsets() {			define amdgpu_kernel void @store_misaligned64_constant_large_offsets() {
	store i64 123, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 2048), align 4			store i64 123, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 2048), align 4
	store i64 123, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 4095), align 4			store i64 123, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 4095), align 4
	ret void			ret void
	}			}

	@sgemm.lA = internal unnamed_addr addrspace(3) global [264 x float] undef, align 4			@sgemm.lA = internal unnamed_addr addrspace(3) global [264 x float] undef, align 4
	▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines