This is an archive of the discontinued LLVM Phabricator instance.

DAGCombiner: Don't stop finding better chain on 2 aliases
ClosedPublic

Authored by arsenm on Oct 5 2015, 9:04 AM.

Download Raw Diff

Details

Reviewers

Summary

The comment says this was stopped because it was unlikely to be
profitable. This is not true if you want to combine vector loads
with multiple components.

For a simple case that looks like

t0 = load t0 ...
t1 = load t0 ...
t2 = load t0 ...
t3 = load t0 ...

t4 = store t0:1, t0:1
  t5 = store t4, t1:0
    t6 = store t5, t2:0
      t7 = store t6, t3:0

We want to get all of these stores onto a chain
that is a TokenFactor of these N loads. This mostly
solves the AMDGPU merge-stores.ll regressions
with -combiner-alias-analysis for merging vector
stores of vector loads.

Diff Detail

Event Timeline

arsenm updated this revision to Diff 36523.Oct 5 2015, 9:04 AM

arsenm retitled this revision from to DAGCombiner: Don't stop finding better chain on 2 aliases.

arsenm updated this object.

arsenm added a reviewer: hfinkel.

arsenm added a subscriber: llvm-commits.

LGTM.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14365	It would be nice, IMHO, to get rid of these hard-coded depth limits and turn them into cl::opts. This is a general comment, as I feel the same way about all of these depth limits everywhere. [I'm not requesting that you change this here].

This revision is now accepted and ready to land.Oct 5 2015, 9:09 AM

arsenm added inline comments.Oct 5 2015, 9:31 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14365	I have a later patch that turns this into a TargetLowering preference. The depth limit is a problem if you want to find adjacent large vectors

r250138 with some test changes since the x8 tests depend on a patch I've decided to submit later

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

6 lines

test/

CodeGen/

AMDGPU/

merge-stores.ll

48 lines

Diff 36523

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 14,351 Lines • ▼ Show 20 Lines	void DAGCombiner::GatherAllAliases(SDNode *N, SDValue OriginalChain,

// Look at each chain and determine if it is an alias. If so, add it to the		// Look at each chain and determine if it is an alias. If so, add it to the
// aliases list. If not, then continue up the chain looking for the next		// aliases list. If not, then continue up the chain looking for the next
// candidate.		// candidate.
while (!Chains.empty()) {		while (!Chains.empty()) {
SDValue Chain = Chains.pop_back_val();		SDValue Chain = Chains.pop_back_val();

// For TokenFactor nodes, look at each operand and only continue up the		// For TokenFactor nodes, look at each operand and only continue up the
// chain until we find two aliases. If we've seen two aliases, assume we'll		// chain until we reach the depth limit.
// find more and revert to original chain since the xform is unlikely to be
// profitable.
//		//
// FIXME: The depth check could be made to return the last non-aliasing		// FIXME: The depth check could be made to return the last non-aliasing
// chain we found before we hit a tokenfactor rather than the original		// chain we found before we hit a tokenfactor rather than the original
// chain.		// chain.
if (Depth > 6 \|\| Aliases.size() == 2) {		if (Depth > 6) {
		hfinkelUnsubmitted Not Done Reply Inline Actions It would be nice, IMHO, to get rid of these hard-coded depth limits and turn them into cl::opts. This is a general comment, as I feel the same way about all of these depth limits everywhere. [I'm not requesting that you change this here]. hfinkel: It would be nice, IMHO, to get rid of these hard-coded depth limits and turn them into cl::opts.
		arsenmAuthorUnsubmitted Not Done Reply Inline Actions I have a later patch that turns this into a TargetLowering preference. The depth limit is a problem if you want to find adjacent large vectors arsenm: I have a later patch that turns this into a TargetLowering preference. The depth limit is a…
Aliases.clear();		Aliases.clear();
Aliases.push_back(OriginalChain);		Aliases.push_back(OriginalChain);
return;		return;
}		}

// Don't bother if we've been before.		// Don't bother if we've been before.
if (!Visited.insert(Chain.getNode()).second)		if (!Visited.insert(Chain.getNode()).second)
continue;		continue;
▲ Show 20 Lines • Show All 206 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/merge-stores.ll

; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=GCN %s		; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=GCN -check-prefix=GCN-NOAA %s
; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=GCN %s		; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=GCN -check-prefix=GCN-NOAA %s

		; RUN: llc -march=amdgcn -verify-machineinstrs -combiner-alias-analysis < %s \| FileCheck -check-prefix=SI -check-prefix=GCN -check-prefix=GCN-AA %s
		; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs -combiner-alias-analysis < %s \| FileCheck -check-prefix=SI -check-prefix=GCN -check-prefix=GCN-AA %s

; Run with devices with different unaligned load restrictions.		; Run with devices with different unaligned load restrictions.

; TODO: Vector element tests		; TODO: Vector element tests
; TODO: Non-zero base offset for load and store combinations		; TODO: Non-zero base offset for load and store combinations
; TODO: Same base addrspacecasted		; TODO: Same base addrspacecasted


▲ Show 20 Lines • Show All 135 Lines • ▼ Show 20 Lines	define void @merge_global_store_4_constants_f32(float addrspace(1)* %out) #0 {
store float 2.0, float addrspace(1)* %out.gep.2		store float 2.0, float addrspace(1)* %out.gep.2
store float 4.0, float addrspace(1)* %out.gep.3		store float 4.0, float addrspace(1)* %out.gep.3
store float 8.0, float addrspace(1)* %out		store float 8.0, float addrspace(1)* %out
ret void		ret void
}		}

; FIXME: Should be able to merge this		; FIXME: Should be able to merge this
; GCN-LABEL: {{^}}merge_global_store_4_constants_mixed_i32_f32:		; GCN-LABEL: {{^}}merge_global_store_4_constants_mixed_i32_f32:
; XGCN: buffer_store_dwordx4		; GCN-NOAA: buffer_store_dword v
; GCN: buffer_store_dword		; GCN-NOAA: buffer_store_dword v
; GCN: buffer_store_dword		; GCN-NOAA: buffer_store_dword v
; GCN: buffer_store_dword		; GCN-NOAA: buffer_store_dword v
; GCN: buffer_store_dword
		; GCN-AA: buffer_store_dwordx2
		; GCN-AA: buffer_store_dword v
		; GCN-AA: buffer_store_dword v

; GCN: s_endpgm		; GCN: s_endpgm
define void @merge_global_store_4_constants_mixed_i32_f32(float addrspace(1)* %out) #0 {		define void @merge_global_store_4_constants_mixed_i32_f32(float addrspace(1)* %out) #0 {
%out.gep.1 = getelementptr float, float addrspace(1)* %out, i32 1		%out.gep.1 = getelementptr float, float addrspace(1)* %out, i32 1
%out.gep.2 = getelementptr float, float addrspace(1)* %out, i32 2		%out.gep.2 = getelementptr float, float addrspace(1)* %out, i32 2
%out.gep.3 = getelementptr float, float addrspace(1)* %out, i32 3		%out.gep.3 = getelementptr float, float addrspace(1)* %out, i32 3

%out.gep.1.bc = bitcast float addrspace(1)* %out.gep.1 to i32 addrspace(1)*		%out.gep.1.bc = bitcast float addrspace(1)* %out.gep.1 to i32 addrspace(1)*
%out.gep.3.bc = bitcast float addrspace(1)* %out.gep.3 to i32 addrspace(1)*		%out.gep.3.bc = bitcast float addrspace(1)* %out.gep.3 to i32 addrspace(1)*
▲ Show 20 Lines • Show All 312 Lines • ▼ Show 20 Lines	define void @merge_global_store_4_adjacent_loads_i8_natural_align(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #0 {
store i8 %z, i8 addrspace(1)* %out.gep.2		store i8 %z, i8 addrspace(1)* %out.gep.2
store i8 %w, i8 addrspace(1)* %out.gep.3		store i8 %w, i8 addrspace(1)* %out.gep.3
ret void		ret void
}		}

; This works once AA is enabled on the subtarget		; This works once AA is enabled on the subtarget
; GCN-LABEL: {{^}}merge_global_store_4_vector_elts_loads_v4i32:		; GCN-LABEL: {{^}}merge_global_store_4_vector_elts_loads_v4i32:
; GCN: buffer_load_dwordx4 [[LOAD:v\[[0-9]+:[0-9]+\]]]		; GCN: buffer_load_dwordx4 [[LOAD:v\[[0-9]+:[0-9]+\]]]
; XGCN: buffer_store_dwordx4 [[LOAD]]
; GCN: buffer_store_dword v		; GCN-NOAA: buffer_store_dword v
; GCN: buffer_store_dword v		; GCN-NOAA: buffer_store_dword v
; GCN: buffer_store_dword v		; GCN-NOAA: buffer_store_dword v
; GCN: buffer_store_dword v		; GCN-NOAA: buffer_store_dword v

		; GCN-AA: buffer_store_dwordx4 [[LOAD]]

		; GCN: s_endpgm
define void @merge_global_store_4_vector_elts_loads_v4i32(i32 addrspace(1)* %out, <4 x i32> addrspace(1)* %in) #0 {		define void @merge_global_store_4_vector_elts_loads_v4i32(i32 addrspace(1)* %out, <4 x i32> addrspace(1)* %in) #0 {
%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1		%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2		%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2
%out.gep.3 = getelementptr i32, i32 addrspace(1)* %out, i32 3		%out.gep.3 = getelementptr i32, i32 addrspace(1)* %out, i32 3
%vec = load <4 x i32>, <4 x i32> addrspace(1)* %in		%vec = load <4 x i32>, <4 x i32> addrspace(1)* %in

%x = extractelement <4 x i32> %vec, i32 0		%x = extractelement <4 x i32> %vec, i32 0
%y = extractelement <4 x i32> %vec, i32 1		%y = extractelement <4 x i32> %vec, i32 1
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	define void @merge_global_store_7_constants_i32(i32 addrspace(1)* %out) {
store i32 98, i32 addrspace(1)* %idx4, align 4		store i32 98, i32 addrspace(1)* %idx4, align 4
%idx5 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 5		%idx5 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 5
store i32 91, i32 addrspace(1)* %idx5, align 4		store i32 91, i32 addrspace(1)* %idx5, align 4
%idx6 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 6		%idx6 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 6
store i32 212, i32 addrspace(1)* %idx6, align 4		store i32 212, i32 addrspace(1)* %idx6, align 4
ret void		ret void
}		}

		; FIXME: This should do 2 dwordx4 loads
; GCN-LABEL: {{^}}merge_global_store_8_constants_i32:		; GCN-LABEL: {{^}}merge_global_store_8_constants_i32:
; GCN: buffer_store_dwordx4
; GCN: buffer_store_dwordx4		; GCN-NOAA: buffer_store_dwordx4
		; GCN-NOAA: buffer_store_dwordx4

		; GCN-AA: buffer_store_dwordx4
		; GCN-AA: buffer_store_dwordx2
		; GCN-AA: buffer_store_dwordx2

		; GCN: s_endpgm

define void @merge_global_store_8_constants_i32(i32 addrspace(1)* %out) {		define void @merge_global_store_8_constants_i32(i32 addrspace(1)* %out) {
store i32 34, i32 addrspace(1)* %out, align 4		store i32 34, i32 addrspace(1)* %out, align 4
%idx1 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 1		%idx1 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 1
store i32 999, i32 addrspace(1)* %idx1, align 4		store i32 999, i32 addrspace(1)* %idx1, align 4
%idx2 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 2		%idx2 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 2
store i32 65, i32 addrspace(1)* %idx2, align 4		store i32 65, i32 addrspace(1)* %idx2, align 4
%idx3 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 3		%idx3 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 3
store i32 33, i32 addrspace(1)* %idx3, align 4		store i32 33, i32 addrspace(1)* %idx3, align 4
Show All 15 Lines