This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
3/5
SILoadStoreOptimizer.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
ds-combine-large-stride.ll
-
fence-lds-read2-write2.ll
-
merge-load-store-vreg.mir

Differential D96421

[AMDGPU] Better selection of base offset when merging DS reads/writes
ClosedPublic

Authored by foad on Feb 10 2021, 7:25 AM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm
s-perron

Commits

rG23db2d363fd3: [AMDGPU] Better selection of base offset when merging DS reads/writes

Summary

When merging a pair of DS reads or writes needs to materialize the base
offset in a vgpr, choose a value that is aligned to as high a power of
two as possible. This maximises the chance that different pairs can use
the same base offset, in which case the base offset registers can be
commoned up by MachineCSE.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	290 ms	x64 debian > MemProfiler-x86_64-linux-dynamic.TestCases::test_malloc_load_store.c
	430 ms	x64 debian > MemProfiler-x86_64-linux.TestCases::test_malloc_load_store.c

Event Timeline

foad created this revision.Feb 10 2021, 7:25 AM

Herald added subscribers: kerbowa, hiraditya, t-tye and 6 others. · View Herald TranscriptFeb 10 2021, 7:25 AM

foad requested review of this revision.Feb 10 2021, 7:25 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2021, 7:25 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B88633: Diff 322680.Feb 10 2021, 7:27 AM

arsenm added inline comments.Feb 10 2021, 8:12 AM

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
812–813	Should probably use uint32_t throughout here
815–816	const
824–826	This bit magic is a bit fancy. Can you extract it to a separate function

Nice, thank you! LGTM with Matt's comments addressed.

This revision is now accepted and ready to land.Feb 10 2021, 10:01 AM

Split out mostAlignedValueInRange.

foad marked 2 inline comments as done.Feb 11 2021, 7:33 AM

foad added inline comments.

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
812–813	I'd prefer not to, since the rest of the file is pretty consistent in using `unsigned` throughout.

arsenm added inline comments.Feb 11 2021, 7:47 AM

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
812–813	LLVM is pretty consistently wrong in using unsigned for 32-bit values when it's not guaranteed by the standard

Use uint32_t.

Harbormaster completed remote builds in B88814: Diff 323004.Feb 11 2021, 8:43 AM

arsenm accepted this revision.Feb 11 2021, 8:54 AM

This revision was landed with ongoing or failed builds.Feb 11 2021, 9:46 AM

Closed by commit rG23db2d363fd3: [AMDGPU] Better selection of base offset when merging DS reads/writes (authored by foad). · Explain Why

This revision was automatically updated to reflect the committed changes.

foad added a commit: rG23db2d363fd3: [AMDGPU] Better selection of base offset when merging DS reads/writes.

Harbormaster completed remote builds in B88830: Diff 323030.Feb 11 2021, 11:46 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SILoadStoreOptimizer.cpp

40 lines

test/

CodeGen/

AMDGPU/

ds-combine-large-stride.ll

120 lines

fence-lds-read2-write2.ll

11 lines

merge-load-store-vreg.mir

8 lines

Diff 323004

llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 719 Lines • ▼ Show 20 Lines	if (!NewFormatInfo)
return 0;		return 0;

assert(NewFormatInfo->NumFormat == OldFormatInfo->NumFormat &&		assert(NewFormatInfo->NumFormat == OldFormatInfo->NumFormat &&
NewFormatInfo->BitsPerComp == OldFormatInfo->BitsPerComp);		NewFormatInfo->BitsPerComp == OldFormatInfo->BitsPerComp);

return NewFormatInfo->Format;		return NewFormatInfo->Format;
}		}

		// Return the value in the inclusive range [Lo,Hi] that is aligned to the
		// highest power of two. Note that the result is well defined for all inputs
		// including corner cases like:
		// - if Lo == Hi, return that value
		// - if Lo == 0, return 0 (even though the "- 1" below underflows
		// - if Lo > Hi, return 0 (as if the range wrapped around)
		static unsigned mostAlignedValueInRange(unsigned Lo, unsigned Hi) {
		return Hi & maskLeadingOnes<unsigned>(countLeadingZeros((Lo - 1) ^ Hi) + 1);
		}

bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI,		bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI,
const GCNSubtarget &STI,		const GCNSubtarget &STI,
CombineInfo &Paired,		CombineInfo &Paired,
bool Modify) {		bool Modify) {
assert(CI.InstClass != MIMG);		assert(CI.InstClass != MIMG);

// XXX - Would the same offset be OK? Is there any reason this would happen or		// XXX - Would the same offset be OK? Is there any reason this would happen or
// be useful?		// be useful?
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	if (isUInt<8>(EltOffset0) && isUInt<8>(EltOffset1)) {
if (Modify) {		if (Modify) {
CI.Offset = EltOffset0;		CI.Offset = EltOffset0;
Paired.Offset = EltOffset1;		Paired.Offset = EltOffset1;
}		}
return true;		return true;
}		}

// Try to shift base address to decrease offsets.		// Try to shift base address to decrease offsets.
unsigned OffsetDiff = std::abs((int)EltOffset1 - (int)EltOffset0);		unsigned Min = std::min(EltOffset0, EltOffset1);
CI.BaseOff = std::min(CI.Offset, Paired.Offset);		unsigned Max = std::max(EltOffset0, EltOffset1);
		arsenmUnsubmitted Not Done Reply Inline Actions Should probably use uint32_t throughout here arsenm: Should probably use uint32_t throughout here
		foadAuthorUnsubmitted Done Reply Inline Actions I'd prefer not to, since the rest of the file is pretty consistent in using `unsigned` throughout. foad: I'd prefer not to, since the rest of the file is pretty consistent in using `unsigned`…
		arsenmUnsubmitted Not Done Reply Inline Actions LLVM is pretty consistently wrong in using unsigned for 32-bit values when it's not guaranteed by the standard arsenm: LLVM is pretty consistently wrong in using unsigned for 32-bit values when it's not guaranteed…

if ((OffsetDiff % 64 == 0) && isUInt<8>(OffsetDiff / 64)) {		const unsigned Mask = maskTrailingOnes<unsigned>(8) * 64;
		if (((Max - Min) & ~Mask) == 0) {
		arsenmUnsubmitted Done Reply Inline Actions const arsenm: const
if (Modify) {		if (Modify) {
CI.Offset = (EltOffset0 - CI.BaseOff / CI.EltSize) / 64;		// From the range of values we could use for BaseOff, choose the one that
Paired.Offset = (EltOffset1 - CI.BaseOff / CI.EltSize) / 64;		// is aligned to the highest power of two, to maximise the chance that
		// the same offset can be reused for other load/store pairs.
		unsigned BaseOff = mostAlignedValueInRange(Max - 0xff * 64, Min);
		// Copy the low bits of the offsets, so that when we adjust them by
		// subtracting BaseOff they will be multiples of 64.
		BaseOff \|= Min & maskTrailingOnes<unsigned>(6);
		CI.BaseOff = BaseOff * CI.EltSize;
		CI.Offset = (EltOffset0 - BaseOff) / 64;
		arsenmUnsubmitted Done Reply Inline Actions This bit magic is a bit fancy. Can you extract it to a separate function arsenm: This bit magic is a bit fancy. Can you extract it to a separate function
		Paired.Offset = (EltOffset1 - BaseOff) / 64;
CI.UseST64 = true;		CI.UseST64 = true;
}		}
return true;		return true;
}		}

if (isUInt<8>(OffsetDiff)) {		if (isUInt<8>(Max - Min)) {
if (Modify) {		if (Modify) {
CI.Offset = EltOffset0 - CI.BaseOff / CI.EltSize;		// From the range of values we could use for BaseOff, choose the one that
Paired.Offset = EltOffset1 - CI.BaseOff / CI.EltSize;		// is aligned to the highest power of two, to maximise the chance that
		// the same offset can be reused for other load/store pairs.
		unsigned BaseOff = mostAlignedValueInRange(Max - 0xff, Min);
		CI.BaseOff = BaseOff * CI.EltSize;
		CI.Offset = EltOffset0 - BaseOff;
		Paired.Offset = EltOffset1 - BaseOff;
}		}
return true;		return true;
}		}

return false;		return false;
}		}

bool SILoadStoreOptimizer::widthsFit(const GCNSubtarget &STM,		bool SILoadStoreOptimizer::widthsFit(const GCNSubtarget &STM,
▲ Show 20 Lines • Show All 1,369 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/ds-combine-large-stride.ll

; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,VI %s		; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,VI %s
; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9 %s		; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9 %s

; GCN-LABEL: ds_read32_combine_stride_400:		; GCN-LABEL: ds_read32_combine_stride_400:
; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0		; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]		; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]

; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]

; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x320, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x200, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x640, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x400, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x960, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x800, [[BASE]]

; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset1:100		; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset1:100
; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:100		; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset0:72 offset1:172
; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B2]] offset1:100		; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B2]] offset0:144 offset1:244
; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B3]] offset1:100		; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B3]] offset0:88 offset1:188
define amdgpu_kernel void @ds_read32_combine_stride_400(float addrspace(3)* nocapture readonly %arg, float *nocapture %arg1) {		define amdgpu_kernel void @ds_read32_combine_stride_400(float addrspace(3)* nocapture readonly %arg, float *nocapture %arg1) {
bb:		bb:
%tmp = load float, float addrspace(3)* %arg, align 4		%tmp = load float, float addrspace(3)* %arg, align 4
%tmp2 = fadd float %tmp, 0.000000e+00		%tmp2 = fadd float %tmp, 0.000000e+00
%tmp3 = getelementptr inbounds float, float addrspace(3)* %arg, i32 100		%tmp3 = getelementptr inbounds float, float addrspace(3)* %arg, i32 100
%tmp4 = load float, float addrspace(3)* %tmp3, align 4		%tmp4 = load float, float addrspace(3)* %tmp3, align 4
%tmp5 = fadd float %tmp2, %tmp4		%tmp5 = fadd float %tmp2, %tmp4
%tmp6 = getelementptr inbounds float, float addrspace(3)* %arg, i32 200		%tmp6 = getelementptr inbounds float, float addrspace(3)* %arg, i32 200
Show All 19 Lines
}		}

; GCN-LABEL: ds_read32_combine_stride_20:		; GCN-LABEL: ds_read32_combine_stride_20:
; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0		; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]		; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]

; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B4:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]

; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x640, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x400, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x6e0, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x800, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x780, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B4:v[0-9]+]], 0x820, [[BASE]]		; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset0:144 offset1:164
		; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset0:184 offset1:204
; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:20		; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset0:224 offset1:244
; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B2]] offset1:20		; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B2]] offset0:8 offset1:28
; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B3]] offset1:20
; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B4]] offset1:20
define amdgpu_kernel void @ds_read32_combine_stride_20(float addrspace(3)* nocapture readonly %arg, float *nocapture %arg1) {		define amdgpu_kernel void @ds_read32_combine_stride_20(float addrspace(3)* nocapture readonly %arg, float *nocapture %arg1) {
bb:		bb:
%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 400		%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 400
%tmp1 = load float, float addrspace(3)* %tmp, align 4		%tmp1 = load float, float addrspace(3)* %tmp, align 4
%tmp2 = fadd float %tmp1, 0.000000e+00		%tmp2 = fadd float %tmp1, 0.000000e+00
%tmp3 = getelementptr inbounds float, float addrspace(3)* %arg, i32 420		%tmp3 = getelementptr inbounds float, float addrspace(3)* %arg, i32 420
%tmp4 = load float, float addrspace(3)* %tmp3, align 4		%tmp4 = load float, float addrspace(3)* %tmp3, align 4
%tmp5 = fadd float %tmp2, %tmp4		%tmp5 = fadd float %tmp2, %tmp4
Show All 22 Lines
; GCN-LABEL: ds_read32_combine_stride_400_back:		; GCN-LABEL: ds_read32_combine_stride_400_back:
; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0		; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]		; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]

; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]

; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x320, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x800, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x640, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x400, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x960, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x200, [[BASE]]

; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset1:100		; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset1:100
; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:100		; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset0:88 offset1:188
; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B2]] offset1:100		; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B2]] offset0:144 offset1:244
; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B3]] offset1:100		; GCN-DAG: ds_read2_b32 v[{{[0-9]+:[0-9]+}}], [[B3]] offset0:72 offset1:172
define amdgpu_kernel void @ds_read32_combine_stride_400_back(float addrspace(3)* nocapture readonly %arg, float *nocapture %arg1) {		define amdgpu_kernel void @ds_read32_combine_stride_400_back(float addrspace(3)* nocapture readonly %arg, float *nocapture %arg1) {
bb:		bb:
%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 700		%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 700
%tmp2 = load float, float addrspace(3)* %tmp, align 4		%tmp2 = load float, float addrspace(3)* %tmp, align 4
%tmp3 = fadd float %tmp2, 0.000000e+00		%tmp3 = fadd float %tmp2, 0.000000e+00
%tmp4 = getelementptr inbounds float, float addrspace(3)* %arg, i32 600		%tmp4 = getelementptr inbounds float, float addrspace(3)* %arg, i32 600
%tmp5 = load float, float addrspace(3)* %tmp4, align 4		%tmp5 = load float, float addrspace(3)* %tmp4, align 4
%tmp6 = fadd float %tmp3, %tmp5		%tmp6 = fadd float %tmp3, %tmp5
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	bb:
ret void		ret void
}		}

; GCN-LABEL: ds_read32_combine_stride_8192_shifted:		; GCN-LABEL: ds_read32_combine_stride_8192_shifted:
; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0		; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]		; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]

; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, 8, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, 8, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]

; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 8, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 8, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x4008, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x8008, [[BASE]]

; GCN-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:32		; GCN-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:32
; GCN-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[B2]] offset1:32		; GCN-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset0:64 offset1:96
; GCN-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[B3]] offset1:32		; GCN-DAG: ds_read2st64_b32 v[{{[0-9]+:[0-9]+}}], [[B1]] offset0:128 offset1:160
define amdgpu_kernel void @ds_read32_combine_stride_8192_shifted(float addrspace(3)* nocapture readonly %arg, float *nocapture %arg1) {		define amdgpu_kernel void @ds_read32_combine_stride_8192_shifted(float addrspace(3)* nocapture readonly %arg, float *nocapture %arg1) {
bb:		bb:
%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 2		%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 2
%tmp2 = load float, float addrspace(3)* %tmp, align 4		%tmp2 = load float, float addrspace(3)* %tmp, align 4
%tmp3 = fadd float %tmp2, 0.000000e+00		%tmp3 = fadd float %tmp2, 0.000000e+00
%tmp4 = getelementptr inbounds float, float addrspace(3)* %arg, i32 2050		%tmp4 = getelementptr inbounds float, float addrspace(3)* %arg, i32 2050
%tmp5 = load float, float addrspace(3)* %tmp4, align 4		%tmp5 = load float, float addrspace(3)* %tmp4, align 4
%tmp6 = fadd float %tmp3, %tmp5		%tmp6 = fadd float %tmp3, %tmp5
Show All 13 Lines	bb:
ret void		ret void
}		}

; GCN-LABEL: ds_read64_combine_stride_400:		; GCN-LABEL: ds_read64_combine_stride_400:
; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0		; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]		; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]

; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x960, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x800, [[BASE]]

; GCN-DAG: ds_read2_b64 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset1:50		; GCN-DAG: ds_read2_b64 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset1:50
; GCN-DAG: ds_read2_b64 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset0:100 offset1:150		; GCN-DAG: ds_read2_b64 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset0:100 offset1:150
; GCN-DAG: ds_read2_b64 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset0:200 offset1:250		; GCN-DAG: ds_read2_b64 v[{{[0-9]+:[0-9]+}}], [[BASE]] offset0:200 offset1:250
; GCN-DAG: ds_read2_b64 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:50		; GCN-DAG: ds_read2_b64 v[{{[0-9]+:[0-9]+}}], [[B1]] offset0:44 offset1:94
define amdgpu_kernel void @ds_read64_combine_stride_400(double addrspace(3)* nocapture readonly %arg, double *nocapture %arg1) {		define amdgpu_kernel void @ds_read64_combine_stride_400(double addrspace(3)* nocapture readonly %arg, double *nocapture %arg1) {
bb:		bb:
%tmp = load double, double addrspace(3)* %arg, align 8		%tmp = load double, double addrspace(3)* %arg, align 8
%tmp2 = fadd double %tmp, 0.000000e+00		%tmp2 = fadd double %tmp, 0.000000e+00
%tmp3 = getelementptr inbounds double, double addrspace(3)* %arg, i32 50		%tmp3 = getelementptr inbounds double, double addrspace(3)* %arg, i32 50
%tmp4 = load double, double addrspace(3)* %tmp3, align 8		%tmp4 = load double, double addrspace(3)* %tmp3, align 8
%tmp5 = fadd double %tmp2, %tmp4		%tmp5 = fadd double %tmp2, %tmp4
%tmp6 = getelementptr inbounds double, double addrspace(3)* %arg, i32 100		%tmp6 = getelementptr inbounds double, double addrspace(3)* %arg, i32 100
Show All 18 Lines	bb:
ret void		ret void
}		}

; GCN-LABEL: ds_read64_combine_stride_8192_shifted:		; GCN-LABEL: ds_read64_combine_stride_8192_shifted:
; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0		; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]		; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]

; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, 8, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, 8, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]

; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 8, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 8, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x4008, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x8008, [[BASE]]

; GCN-DAG: ds_read2st64_b64 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:16		; GCN-DAG: ds_read2st64_b64 v[{{[0-9]+:[0-9]+}}], [[B1]] offset1:16
; GCN-DAG: ds_read2st64_b64 v[{{[0-9]+:[0-9]+}}], [[B2]] offset1:16		; GCN-DAG: ds_read2st64_b64 v[{{[0-9]+:[0-9]+}}], [[B1]] offset0:32 offset1:48
; GCN-DAG: ds_read2st64_b64 v[{{[0-9]+:[0-9]+}}], [[B3]] offset1:16		; GCN-DAG: ds_read2st64_b64 v[{{[0-9]+:[0-9]+}}], [[B1]] offset0:64 offset1:80
define amdgpu_kernel void @ds_read64_combine_stride_8192_shifted(double addrspace(3)* nocapture readonly %arg, double *nocapture %arg1) {		define amdgpu_kernel void @ds_read64_combine_stride_8192_shifted(double addrspace(3)* nocapture readonly %arg, double *nocapture %arg1) {
bb:		bb:
%tmp = getelementptr inbounds double, double addrspace(3)* %arg, i32 1		%tmp = getelementptr inbounds double, double addrspace(3)* %arg, i32 1
%tmp2 = load double, double addrspace(3)* %tmp, align 8		%tmp2 = load double, double addrspace(3)* %tmp, align 8
%tmp3 = fadd double %tmp2, 0.000000e+00		%tmp3 = fadd double %tmp2, 0.000000e+00
%tmp4 = getelementptr inbounds double, double addrspace(3)* %arg, i32 1025		%tmp4 = getelementptr inbounds double, double addrspace(3)* %arg, i32 1025
%tmp5 = load double, double addrspace(3)* %tmp4, align 8		%tmp5 = load double, double addrspace(3)* %tmp4, align 8
%tmp6 = fadd double %tmp3, %tmp5		%tmp6 = fadd double %tmp3, %tmp5
Show All 16 Lines
; GCN-LABEL: ds_write32_combine_stride_400:		; GCN-LABEL: ds_write32_combine_stride_400:
; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0		; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]		; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]

; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]

; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x320, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x200, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x640, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x400, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x960, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x800, [[BASE]]

; GCN-DAG: ds_write2_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100		; GCN-DAG: ds_write2_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
; GCN-DAG: ds_write2_b32 [[B1]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100		; GCN-DAG: ds_write2_b32 [[B1]], v{{[0-9]+}}, v{{[0-9]+}} offset0:72 offset1:172
; GCN-DAG: ds_write2_b32 [[B2]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100		; GCN-DAG: ds_write2_b32 [[B2]], v{{[0-9]+}}, v{{[0-9]+}} offset0:144 offset1:244
; GCN-DAG: ds_write2_b32 [[B3]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100		; GCN-DAG: ds_write2_b32 [[B3]], v{{[0-9]+}}, v{{[0-9]+}} offset0:88 offset1:188
define amdgpu_kernel void @ds_write32_combine_stride_400(float addrspace(3)* nocapture %arg) {		define amdgpu_kernel void @ds_write32_combine_stride_400(float addrspace(3)* nocapture %arg) {
bb:		bb:
store float 1.000000e+00, float addrspace(3)* %arg, align 4		store float 1.000000e+00, float addrspace(3)* %arg, align 4
%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 100		%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 100
store float 1.000000e+00, float addrspace(3)* %tmp, align 4		store float 1.000000e+00, float addrspace(3)* %tmp, align 4
%tmp1 = getelementptr inbounds float, float addrspace(3)* %arg, i32 200		%tmp1 = getelementptr inbounds float, float addrspace(3)* %arg, i32 200
store float 1.000000e+00, float addrspace(3)* %tmp1, align 4		store float 1.000000e+00, float addrspace(3)* %tmp1, align 4
%tmp2 = getelementptr inbounds float, float addrspace(3)* %arg, i32 300		%tmp2 = getelementptr inbounds float, float addrspace(3)* %arg, i32 300
Show All 12 Lines
; GCN-LABEL: ds_write32_combine_stride_400_back:		; GCN-LABEL: ds_write32_combine_stride_400_back:
; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0		; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]		; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]

; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]

; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x320, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x800, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x640, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x400, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x960, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x200, [[BASE]]

		; GCN-DAG: ds_write2_b32 [[B1]], v{{[0-9]+}}, v{{[0-9]+}} offset0:88 offset1:188
		; GCN-DAG: ds_write2_b32 [[B2]], v{{[0-9]+}}, v{{[0-9]+}} offset0:144 offset1:244
		; GCN-DAG: ds_write2_b32 [[B3]], v{{[0-9]+}}, v{{[0-9]+}} offset0:72 offset1:172
; GCN-DAG: ds_write2_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100		; GCN-DAG: ds_write2_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
; GCN-DAG: ds_write2_b32 [[B1]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
; GCN-DAG: ds_write2_b32 [[B2]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
; GCN-DAG: ds_write2_b32 [[B3]], v{{[0-9]+}}, v{{[0-9]+}} offset1:100
define amdgpu_kernel void @ds_write32_combine_stride_400_back(float addrspace(3)* nocapture %arg) {		define amdgpu_kernel void @ds_write32_combine_stride_400_back(float addrspace(3)* nocapture %arg) {
bb:		bb:
%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 700		%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 700
store float 1.000000e+00, float addrspace(3)* %tmp, align 4		store float 1.000000e+00, float addrspace(3)* %tmp, align 4
%tmp1 = getelementptr inbounds float, float addrspace(3)* %arg, i32 600		%tmp1 = getelementptr inbounds float, float addrspace(3)* %arg, i32 600
store float 1.000000e+00, float addrspace(3)* %tmp1, align 4		store float 1.000000e+00, float addrspace(3)* %tmp1, align 4
%tmp2 = getelementptr inbounds float, float addrspace(3)* %arg, i32 500		%tmp2 = getelementptr inbounds float, float addrspace(3)* %arg, i32 500
store float 1.000000e+00, float addrspace(3)* %tmp2, align 4		store float 1.000000e+00, float addrspace(3)* %tmp2, align 4
Show All 35 Lines	bb:
store float 1.000000e+00, float addrspace(3)* %tmp6, align 4		store float 1.000000e+00, float addrspace(3)* %tmp6, align 4
ret void		ret void
}		}

; GCN-LABEL: ds_write32_combine_stride_8192_shifted:		; GCN-LABEL: ds_write32_combine_stride_8192_shifted:
; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0		; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]		; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]

; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, 4, [[BASE]]		; VI-DAG: v_add_u32_e32 [[BASE:v[0-9]+]], vcc, 4, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[BASE:v[0-9]+]], 4, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]

; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 4, [[BASE]]		; GCN-DAG: ds_write2st64_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset1:32
; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x4004, [[BASE]]		; GCN-DAG: ds_write2st64_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset0:64 offset1:96
; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x8004, [[BASE]]		; GCN-DAG: ds_write2st64_b32 [[BASE]], v{{[0-9]+}}, v{{[0-9]+}} offset0:128 offset1:160

; GCN-DAG: ds_write2st64_b32 [[B1]], v{{[0-9]+}}, v{{[0-9]+}} offset1:32
; GCN-DAG: ds_write2st64_b32 [[B2]], v{{[0-9]+}}, v{{[0-9]+}} offset1:32
; GCN-DAG: ds_write2st64_b32 [[B3]], v{{[0-9]+}}, v{{[0-9]+}} offset1:32
define amdgpu_kernel void @ds_write32_combine_stride_8192_shifted(float addrspace(3)* nocapture %arg) {		define amdgpu_kernel void @ds_write32_combine_stride_8192_shifted(float addrspace(3)* nocapture %arg) {
bb:		bb:
%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 1		%tmp = getelementptr inbounds float, float addrspace(3)* %arg, i32 1
store float 1.000000e+00, float addrspace(3)* %tmp, align 4		store float 1.000000e+00, float addrspace(3)* %tmp, align 4
%tmp1 = getelementptr inbounds float, float addrspace(3)* %arg, i32 2049		%tmp1 = getelementptr inbounds float, float addrspace(3)* %arg, i32 2049
store float 1.000000e+00, float addrspace(3)* %tmp1, align 4		store float 1.000000e+00, float addrspace(3)* %tmp1, align 4
%tmp2 = getelementptr inbounds float, float addrspace(3)* %arg, i32 4097		%tmp2 = getelementptr inbounds float, float addrspace(3)* %arg, i32 4097
store float 1.000000e+00, float addrspace(3)* %tmp2, align 4		store float 1.000000e+00, float addrspace(3)* %tmp2, align 4
%tmp3 = getelementptr inbounds float, float addrspace(3)* %arg, i32 6145		%tmp3 = getelementptr inbounds float, float addrspace(3)* %arg, i32 6145
store float 1.000000e+00, float addrspace(3)* %tmp3, align 4		store float 1.000000e+00, float addrspace(3)* %tmp3, align 4
%tmp4 = getelementptr inbounds float, float addrspace(3)* %arg, i32 8193		%tmp4 = getelementptr inbounds float, float addrspace(3)* %arg, i32 8193
store float 1.000000e+00, float addrspace(3)* %tmp4, align 4		store float 1.000000e+00, float addrspace(3)* %tmp4, align 4
%tmp5 = getelementptr inbounds float, float addrspace(3)* %arg, i32 10241		%tmp5 = getelementptr inbounds float, float addrspace(3)* %arg, i32 10241
store float 1.000000e+00, float addrspace(3)* %tmp5, align 4		store float 1.000000e+00, float addrspace(3)* %tmp5, align 4
ret void		ret void
}		}

; GCN-LABEL: ds_write64_combine_stride_400:		; GCN-LABEL: ds_write64_combine_stride_400:
; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0		; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]		; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]

; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x960, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 0x800, [[BASE]]

; GCN-DAG: ds_write2_b64 [[BASE]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset1:50		; GCN-DAG: ds_write2_b64 [[BASE]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset1:50
; GCN-DAG: ds_write2_b64 [[BASE]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset0:100 offset1:150		; GCN-DAG: ds_write2_b64 [[BASE]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset0:100 offset1:150
; GCN-DAG: ds_write2_b64 [[BASE]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset0:200 offset1:250		; GCN-DAG: ds_write2_b64 [[BASE]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset0:200 offset1:250
; GCN-DAG: ds_write2_b64 [[B1]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset1:50		; GCN-DAG: ds_write2_b64 [[B1]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset0:44 offset1:94
define amdgpu_kernel void @ds_write64_combine_stride_400(double addrspace(3)* nocapture %arg) {		define amdgpu_kernel void @ds_write64_combine_stride_400(double addrspace(3)* nocapture %arg) {
bb:		bb:
store double 1.000000e+00, double addrspace(3)* %arg, align 8		store double 1.000000e+00, double addrspace(3)* %arg, align 8
%tmp = getelementptr inbounds double, double addrspace(3)* %arg, i32 50		%tmp = getelementptr inbounds double, double addrspace(3)* %arg, i32 50
store double 1.000000e+00, double addrspace(3)* %tmp, align 8		store double 1.000000e+00, double addrspace(3)* %tmp, align 8
%tmp1 = getelementptr inbounds double, double addrspace(3)* %arg, i32 100		%tmp1 = getelementptr inbounds double, double addrspace(3)* %arg, i32 100
store double 1.000000e+00, double addrspace(3)* %tmp1, align 8		store double 1.000000e+00, double addrspace(3)* %tmp1, align 8
%tmp2 = getelementptr inbounds double, double addrspace(3)* %arg, i32 150		%tmp2 = getelementptr inbounds double, double addrspace(3)* %arg, i32 150
store double 1.000000e+00, double addrspace(3)* %tmp2, align 8		store double 1.000000e+00, double addrspace(3)* %tmp2, align 8
%tmp3 = getelementptr inbounds double, double addrspace(3)* %arg, i32 200		%tmp3 = getelementptr inbounds double, double addrspace(3)* %arg, i32 200
store double 1.000000e+00, double addrspace(3)* %tmp3, align 8		store double 1.000000e+00, double addrspace(3)* %tmp3, align 8
%tmp4 = getelementptr inbounds double, double addrspace(3)* %arg, i32 250		%tmp4 = getelementptr inbounds double, double addrspace(3)* %arg, i32 250
store double 1.000000e+00, double addrspace(3)* %tmp4, align 8		store double 1.000000e+00, double addrspace(3)* %tmp4, align 8
%tmp5 = getelementptr inbounds double, double addrspace(3)* %arg, i32 300		%tmp5 = getelementptr inbounds double, double addrspace(3)* %arg, i32 300
store double 1.000000e+00, double addrspace(3)* %tmp5, align 8		store double 1.000000e+00, double addrspace(3)* %tmp5, align 8
%tmp6 = getelementptr inbounds double, double addrspace(3)* %arg, i32 350		%tmp6 = getelementptr inbounds double, double addrspace(3)* %arg, i32 350
store double 1.000000e+00, double addrspace(3)* %tmp6, align 8		store double 1.000000e+00, double addrspace(3)* %tmp6, align 8
ret void		ret void
}		}

; GCN-LABEL: ds_write64_combine_stride_8192_shifted:		; GCN-LABEL: ds_write64_combine_stride_8192_shifted:
; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0		; GCN: s_load_dword [[ARG:s[0-9]+]], s[4:5], 0x0
; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]		; GCN: v_mov_b32_e32 [[BASE:v[0-9]+]], [[ARG]]

; VI-DAG: v_add_u32_e32 [[B1:v[0-9]+]], vcc, 8, [[BASE]]		; VI-DAG: v_add_u32_e32 [[BASE]], vcc, 8, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B2:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]		; GFX9-DAG: v_add_u32_e32 [[BASE]], 8, [[BASE]]
; VI-DAG: v_add_u32_e32 [[B3:v[0-9]+]], vcc, {{s[0-9]+}}, [[BASE]]

; GFX9-DAG: v_add_u32_e32 [[B1:v[0-9]+]], 8, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B2:v[0-9]+]], 0x4008, [[BASE]]
; GFX9-DAG: v_add_u32_e32 [[B3:v[0-9]+]], 0x8008, [[BASE]]

; GCN-DAG: ds_write2st64_b64 [[B1]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset1:16		; GCN-DAG: ds_write2st64_b64 [[BASE]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset1:16
; GCN-DAG: ds_write2st64_b64 [[B2]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset1:16		; GCN-DAG: ds_write2st64_b64 [[BASE]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset0:32 offset1:48
; GCN-DAG: ds_write2st64_b64 [[B3]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset1:16		; GCN-DAG: ds_write2st64_b64 [[BASE]], v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}] offset0:64 offset1:80
define amdgpu_kernel void @ds_write64_combine_stride_8192_shifted(double addrspace(3)* nocapture %arg) {		define amdgpu_kernel void @ds_write64_combine_stride_8192_shifted(double addrspace(3)* nocapture %arg) {
bb:		bb:
%tmp = getelementptr inbounds double, double addrspace(3)* %arg, i32 1		%tmp = getelementptr inbounds double, double addrspace(3)* %arg, i32 1
store double 1.000000e+00, double addrspace(3)* %tmp, align 8		store double 1.000000e+00, double addrspace(3)* %tmp, align 8
%tmp1 = getelementptr inbounds double, double addrspace(3)* %arg, i32 1025		%tmp1 = getelementptr inbounds double, double addrspace(3)* %arg, i32 1025
store double 1.000000e+00, double addrspace(3)* %tmp1, align 8		store double 1.000000e+00, double addrspace(3)* %tmp1, align 8
%tmp2 = getelementptr inbounds double, double addrspace(3)* %arg, i32 2049		%tmp2 = getelementptr inbounds double, double addrspace(3)* %arg, i32 2049
store double 1.000000e+00, double addrspace(3)* %tmp2, align 8		store double 1.000000e+00, double addrspace(3)* %tmp2, align 8
%tmp3 = getelementptr inbounds double, double addrspace(3)* %arg, i32 3073		%tmp3 = getelementptr inbounds double, double addrspace(3)* %arg, i32 3073
store double 1.000000e+00, double addrspace(3)* %tmp3, align 8		store double 1.000000e+00, double addrspace(3)* %tmp3, align 8
%tmp4 = getelementptr inbounds double, double addrspace(3)* %arg, i32 4097		%tmp4 = getelementptr inbounds double, double addrspace(3)* %arg, i32 4097
store double 1.000000e+00, double addrspace(3)* %tmp4, align 8		store double 1.000000e+00, double addrspace(3)* %tmp4, align 8
%tmp5 = getelementptr inbounds double, double addrspace(3)* %arg, i32 5121		%tmp5 = getelementptr inbounds double, double addrspace(3)* %arg, i32 5121
store double 1.000000e+00, double addrspace(3)* %tmp5, align 8		store double 1.000000e+00, double addrspace(3)* %tmp5, align 8
ret void		ret void
}		}

llvm/test/CodeGen/AMDGPU/fence-lds-read2-write2.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

	@lds = internal addrspace(3) global [576 x double] undef, align 16			@lds = internal addrspace(3) global [576 x double] undef, align 16

	; Stores to the same address appear multiple places in the same			; Stores to the same address appear multiple places in the same
	; block. When sorted by offset, the merges would fail. We should form			; block. When sorted by offset, the merges would fail. We should form
	; two groupings of ds_write2_b64 on either side of the fence.			; two groupings of ds_write2_b64 on either side of the fence.
	define amdgpu_kernel void @same_address_fence_merge_write2() #0 {			define amdgpu_kernel void @same_address_fence_merge_write2() #0 {
	; GCN-LABEL: same_address_fence_merge_write2:			; GCN-LABEL: same_address_fence_merge_write2:
	; GCN: ; %bb.0: ; %bb			; GCN: ; %bb.0: ; %bb
	; GCN-NEXT: s_mov_b32 s0, 0			; GCN-NEXT: s_mov_b32 s0, 0
	; GCN-NEXT: v_lshlrev_b32_e32 v2, 3, v0			; GCN-NEXT: v_lshlrev_b32_e32 v2, 3, v0
	; GCN-NEXT: s_mov_b32 s1, 0x40100000			; GCN-NEXT: s_mov_b32 s1, 0x40100000
	; GCN-NEXT: v_mov_b32_e32 v0, s0			; GCN-NEXT: v_mov_b32_e32 v0, s0
	; GCN-NEXT: v_mov_b32_e32 v1, s1			; GCN-NEXT: v_mov_b32_e32 v1, s1
	; GCN-NEXT: v_add_u32_e32 v3, 0x840, v2			; GCN-NEXT: v_add_u32_e32 v3, 0x800, v2
	; GCN-NEXT: v_add_u32_e32 v4, 0xc60, v2
	; GCN-NEXT: ds_write2_b64 v2, v[0:1], v[0:1] offset1:66			; GCN-NEXT: ds_write2_b64 v2, v[0:1], v[0:1] offset1:66
	; GCN-NEXT: ds_write2_b64 v2, v[0:1], v[0:1] offset0:132 offset1:198			; GCN-NEXT: ds_write2_b64 v2, v[0:1], v[0:1] offset0:132 offset1:198
	; GCN-NEXT: ds_write2_b64 v3, v[0:1], v[0:1] offset1:66			; GCN-NEXT: ds_write2_b64 v3, v[0:1], v[0:1] offset0:8 offset1:74
	; GCN-NEXT: ds_write2_b64 v4, v[0:1], v[0:1] offset1:66			; GCN-NEXT: ds_write2_b64 v3, v[0:1], v[0:1] offset0:140 offset1:206
	; GCN-NEXT: s_mov_b32 s1, 0x3ff00000			; GCN-NEXT: s_mov_b32 s1, 0x3ff00000
	; GCN-NEXT: v_mov_b32_e32 v0, s0			; GCN-NEXT: v_mov_b32_e32 v0, s0
	; GCN-NEXT: v_mov_b32_e32 v1, s1			; GCN-NEXT: v_mov_b32_e32 v1, s1
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: s_barrier			; GCN-NEXT: s_barrier
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: ds_write2_b64 v2, v[0:1], v[0:1] offset1:66			; GCN-NEXT: ds_write2_b64 v2, v[0:1], v[0:1] offset1:66
	; GCN-NEXT: ds_write2_b64 v2, v[0:1], v[0:1] offset0:132 offset1:198			; GCN-NEXT: ds_write2_b64 v2, v[0:1], v[0:1] offset0:132 offset1:198
	; GCN-NEXT: ds_write2_b64 v3, v[0:1], v[0:1] offset1:66			; GCN-NEXT: ds_write2_b64 v3, v[0:1], v[0:1] offset0:8 offset1:74
	; GCN-NEXT: ds_write2_b64 v4, v[0:1], v[0:1] offset1:66			; GCN-NEXT: ds_write2_b64 v3, v[0:1], v[0:1] offset0:140 offset1:206
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	bb:			bb:
	%tmp = tail call i32 @llvm.amdgcn.workitem.id.x(), !range !0			%tmp = tail call i32 @llvm.amdgcn.workitem.id.x(), !range !0
	%tmp1 = getelementptr inbounds [576 x double], [576 x double] addrspace(3)* @lds, i32 0, i32 %tmp			%tmp1 = getelementptr inbounds [576 x double], [576 x double] addrspace(3)* @lds, i32 0, i32 %tmp
	store double 4.000000e+00, double addrspace(3)* %tmp1, align 8			store double 4.000000e+00, double addrspace(3)* %tmp1, align 8
	%tmp2 = getelementptr inbounds double, double addrspace(3)* %tmp1, i32 66			%tmp2 = getelementptr inbounds double, double addrspace(3)* %tmp1, i32 66
	store double 4.000000e+00, double addrspace(3)* %tmp2, align 8			store double 4.000000e+00, double addrspace(3)* %tmp2, align 8
	%tmp3 = getelementptr inbounds double, double addrspace(3)* %tmp1, i32 132			%tmp3 = getelementptr inbounds double, double addrspace(3)* %tmp1, i32 132
	Show All 32 Lines

llvm/test/CodeGen/AMDGPU/merge-load-store-vreg.mir

# RUN: llc -march=amdgcn -mcpu=gfx803 -verify-machineinstrs -run-pass si-load-store-opt -o - %s \| FileCheck -check-prefixes=GCN,VI %s		# RUN: llc -march=amdgcn -mcpu=gfx803 -verify-machineinstrs -run-pass si-load-store-opt -o - %s \| FileCheck -check-prefixes=GCN,VI %s
# RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -run-pass si-load-store-opt -o - %s \| FileCheck -check-prefixes=GCN,GFX9 %s		# RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -run-pass si-load-store-opt -o - %s \| FileCheck -check-prefixes=GCN,GFX9 %s

# If there's a base offset, check that SILoadStoreOptimizer creates		# If there's a base offset, check that SILoadStoreOptimizer creates
# V_ADD_{I\|U}32_e64 for that offset; _e64 uses a vreg for the carry (rather than		# V_ADD_{I\|U}32_e64 for that offset; _e64 uses a vreg for the carry (rather than
# $vcc, which is used in _e32); this ensures that $vcc is not inadvertently		# $vcc, which is used in _e32); this ensures that $vcc is not inadvertently
# clobbered.		# clobbered.

# GCN-LABEL: name: ds_combine_base_offset{{$}}		# GCN-LABEL: name: ds_combine_base_offset{{$}}

# VI: V_ADD_CO_U32_e64 %6, %0,		# VI: V_ADD_CO_U32_e64 %6, %0,
# VI-NEXT: DS_WRITE2_B32 killed %7, %0, %3, 0, 8,		# VI-NEXT: DS_WRITE2_B32 killed %7, %0, %3, 0, 8,
# VI: V_ADD_CO_U32_e64 %10, %3,		# VI: V_ADD_CO_U32_e64 %10, %3,
# VI-NEXT: DS_READ2_B32 killed %11, 0, 8,		# VI-NEXT: DS_READ2_B32 killed %11, 16, 24,

# GFX9: V_ADD_U32_e64 %6, %0,		# GFX9: V_ADD_U32_e64 %6, %0,
# GFX9-NEXT: DS_WRITE2_B32_gfx9 killed %7, %0, %3, 0, 8,		# GFX9-NEXT: DS_WRITE2_B32_gfx9 killed %7, %0, %3, 0, 8,
# GFX9: V_ADD_U32_e64 %9, %3,		# GFX9: V_ADD_U32_e64 %9, %3,
# GFX9-NEXT: DS_READ2_B32_gfx9 killed %10, 0, 8,		# GFX9-NEXT: DS_READ2_B32_gfx9 killed %10, 16, 24,

--- \|		--- \|
@0 = internal unnamed_addr addrspace(3) global [256 x float] undef, align 4		@0 = internal unnamed_addr addrspace(3) global [256 x float] undef, align 4

define amdgpu_kernel void @ds_combine_base_offset() {		define amdgpu_kernel void @ds_combine_base_offset() {
bb.0:		bb.0:
br label %bb2		br label %bb2

▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	bb.2:
S_BRANCH %bb.1		S_BRANCH %bb.1
...		...

# GCN-LABEL: name: ds_combine_base_offset_subreg{{$}}		# GCN-LABEL: name: ds_combine_base_offset_subreg{{$}}

# VI: V_ADD_CO_U32_e64 %6, %0.sub0,		# VI: V_ADD_CO_U32_e64 %6, %0.sub0,
# VI-NEXT: DS_WRITE2_B32 killed %7, %0.sub0, %3.sub0, 0, 8,		# VI-NEXT: DS_WRITE2_B32 killed %7, %0.sub0, %3.sub0, 0, 8,
# VI: V_ADD_CO_U32_e64 %10, %3.sub0,		# VI: V_ADD_CO_U32_e64 %10, %3.sub0,
# VI-NEXT: DS_READ2_B32 killed %11, 0, 8,		# VI-NEXT: DS_READ2_B32 killed %11, 16, 24,

# GFX9: V_ADD_U32_e64 %6, %0.sub0,		# GFX9: V_ADD_U32_e64 %6, %0.sub0,
# GFX9-NEXT: DS_WRITE2_B32_gfx9 killed %7, %0.sub0, %3.sub0, 0, 8,		# GFX9-NEXT: DS_WRITE2_B32_gfx9 killed %7, %0.sub0, %3.sub0, 0, 8,
# GFX9: V_ADD_U32_e64 %9, %3.sub0,		# GFX9: V_ADD_U32_e64 %9, %3.sub0,
# GFX9-NEXT: DS_READ2_B32_gfx9 killed %10, 0, 8,		# GFX9-NEXT: DS_READ2_B32_gfx9 killed %10, 16, 24,
---		---
name: ds_combine_base_offset_subreg		name: ds_combine_base_offset_subreg
body: \|		body: \|
bb.0:		bb.0:
%0:vreg_64 = IMPLICIT_DEF		%0:vreg_64 = IMPLICIT_DEF
S_BRANCH %bb.2		S_BRANCH %bb.2

bb.1:		bb.1:
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines