This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelDAGToDAG.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
vec_ss_load_fold.ll

Differential D26790

[X86] Add a hasOneUse check to selectScalarSSELoad to keep the same load from being folded multiple times
ClosedPublic

Authored by craig.topper on Nov 16 2016, 10:04 PM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
zvi
delena

Commits

rG991d1ca3ba8f: [X86] Add a hasOneUse check to selectScalarSSELoad to keep the same load from…
rL287983: [X86] Add a hasOneUse check to selectScalarSSELoad to keep the same load from…

Summary

When selectScalarSSELoad is looking for a scalar_to_vector of a scalar load, it makes sure the load is only used by the scalar_to_vector. But it doesn't make sure the scalar_to_vector is only used once. This can cause the same load to be folded multiple times. This can be bad for performance.

Diff Detail

Repository: rL LLVM

Event Timeline

craig.topper updated this revision to Diff 78321.Nov 16 2016, 10:04 PM

craig.topper retitled this revision from to [X86] Add a hasOneUse check to selectScalarSSELoad to keep the same load from being folded multiple times.

craig.topper updated this object.

craig.topper added reviewers: RKSimon, spatel, zvi, delena.

craig.topper added a subscriber: llvm-commits.

zvi added inline comments.Nov 16 2016, 11:46 PM

test/CodeGen/X86/vec_ss_load_fold.ll
384 ↗	(On Diff #78321)	As the change in generated code shows, the tradeoff is increased register pressure. Do we have reason to believe that this change puts us on the preferred side?

andreadb added a subscriber: andreadb.Nov 17 2016, 4:00 AM

andreadb added inline comments.

test/CodeGen/X86/vec_ss_load_fold.ll
384 ↗	(On Diff #78321)	I agree that this change has the potential of increasing register pressure. However, regalloc is smart enough to avoid inserting a spill slot if the spill candidate can be folded as a load into the user instructions. In this case, the load value would be feeding into a minss and a maxss. Those instructions both appear in the memory folding tables (see X86InstrInfo.cpp); the InlineSpiller would take advantage of this knowledge and fold the load instead of reserving a stack slot (and therefore inserting a reload). Basically, my opinion is that this may not be a very problematic case because regalloc should be smart enough to "fall-back" to the other codegen if we run out of registers. Note also that this is not the only place where we check for 'hasOneUse()' in this file (see for example `X86DAGToDAGISel::IsProfitableToFold`). I hope this makes sense.

The vector version of this already works as expected:

define <4 x float> @double_fold(<4 x float>* %x, <4 x float> %y) {
  %t0 = load <4 x float>, <4 x float>* %x, align 1
  %t1 = tail call <4 x float> @llvm.x86.sse.min.ps(<4 x float> %y, <4 x float> %t0)
  %t2 = tail call <4 x float> @llvm.x86.sse.max.ps(<4 x float> %y, <4 x float> %t0)
  %t3 = fadd <4 x float> %t1, %t2
  ret <4 x float> %t3
}

$ ./llc -o - foldfold.ll -mattr=avx
...

vmovups	(%rdi), %xmm1
vminps	%xmm1, %xmm0, %xmm2
vmaxps	%xmm1, %xmm0, %xmm0
vaddps	%xmm0, %xmm2, %xmm0
retq

The divergence begins when we map the vector intrinsics to x86-specific nodes:

X86_INTRINSIC_DATA(sse_max_ps,        INTR_TYPE_2OP, X86ISD::FMAX, 0),
X86_INTRINSIC_DATA(sse_min_ps,        INTR_TYPE_2OP, X86ISD::FMIN, 0),

...but there's no equivalent mapping for the scalar intrinsics. Would that be a better/another fix (assuming it works, I didn't actually try it)?

The min/max intrinsics patterns in tablegen try to use sse_load_f32 and sse_load_f64 which use this function to look for a load that is zero extended from f32/f64 to v4f32/v2f64 or a scalar_to_vector from a f32/f64 to v4f32/v2f64. The intrinsics themselves takes a v4f32/v2f64. I ultimately I want to extend this function to also allow a regular v4f32/v2f64 load as well. Currently those cases are folded later using the folding tables, but isel should have been able to get it right without the peephole.

Another possible fix is to lower the instrinsics to a scalar max SDNode with inserts and extracts around it like this (insert_vector_elt src1 (X86max (extract_vector_elt src1, 0), (extract_vector_elt src2, 0)), 0) Then pattern match it back to the min/max intrinsic instructions. This would be equivalent to how clang emits the FADD/FSUB/FMUL/FDIV intrinsics. We would need to do this for every pattern that currently uses sse_load_f32/f64. This would probably also fix PR31032 so maybe its worth doing?

I was also planning to fix AVX-512 to use sse_load_f32/f64 for all the instructions that are equivalent to SSE/AVX instructions that are already using it.

In D26790#598894, @craig.topper wrote:

Another possible fix is to lower the instrinsics to a scalar max SDNode with inserts and extracts around it like this (insert_vector_elt src1 (X86max (extract_vector_elt src1, 0), (extract_vector_elt src2, 0)), 0) Then pattern match it back to the min/max intrinsic instructions. This would be equivalent to how clang emits the FADD/FSUB/FMUL/FDIV intrinsics. We would need to do this for every pattern that currently uses sse_load_f32/f64. This would probably also fix PR31032 so maybe its worth doing?

That's what I was imagining - just so we try to standardize on a path for handling various opcodes and scalar vs. vector. But it's probably not possible for all opcodes/intrinsics.

Given Andrea's comment that we can undo this to avoid spilling, I have no objections to the patch.

But should this be or is this already gated when optimizing for size?

There's a potential correctness problem here too. This IR finishes isel with the store only depending on the chain from the minss instruction but not the maxss instruction. So the duplication of the load breaks the chain dependency.

define void @double_fold(float* %x, <4 x float> %y, <4 x float>* %z) {
entry:

%0 = load float, float* %x, align 1
%vecinit.i = insertelement <4 x float> undef, float %0, i32 0
%1 = tail call <4 x float> @llvm.x86.sse.min.ss(<4 x float> %y, <4 x float> %vecinit.i)
%2 = tail call <4 x float> @llvm.x86.sse.max.ss(<4 x float> %y, <4 x float> %vecinit.i)
%3 = fadd <4 x float> %1, %2
store <4 x float> %3, <4 x float>* %z
ret void

}

declare <4 x float> @llvm.x86.sse.min.ss(<4 x float>, <4 x float>)

declare <4 x float> @llvm.x86.sse.max.ss(<4 x float>, <4 x float>)

Given the problem with the chain I don't think I can gate this on any optimization flag. I think we need this check to be consistent with the behavior that OPC_CheckFoldableChainNode has for normal load during isel. That check makes sure all nodes preceeding the load up to the root have a single use. So adding the check here would ensure the same behavior.

LGTM, but please add a FIXME code comment about the chain problem, so we don't lose track of that.

This revision is now accepted and ready to land.Nov 26 2016, 8:04 AM

Closed by commit rL287983: [X86] Add a hasOneUse check to selectScalarSSELoad to keep the same load from… (authored by ctopper). · Explain WhyNov 26 2016, 9:39 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelDAGToDAG.cpp

9 lines

test/

CodeGen/

X86/

vec_ss_load_fold.ll

32 lines

Diff 79335

llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp

	Show First 20 Lines • Show All 1,507 Lines • ▼ Show 20 Lines
	/// We also return:			/// We also return:
	/// PatternChainNode: this is the matched node that has a chain input and			/// PatternChainNode: this is the matched node that has a chain input and
	/// output.			/// output.
	bool X86DAGToDAGISel::selectScalarSSELoad(SDNode *Root,			bool X86DAGToDAGISel::selectScalarSSELoad(SDNode *Root,
	SDValue N, SDValue &Base,			SDValue N, SDValue &Base,
	SDValue &Scale, SDValue &Index,			SDValue &Scale, SDValue &Index,
	SDValue &Disp, SDValue &Segment,			SDValue &Disp, SDValue &Segment,
	SDValue &PatternNodeWithChain) {			SDValue &PatternNodeWithChain) {
	if (N.getOpcode() == ISD::SCALAR_TO_VECTOR) {			// Need to make sure that the SCALAR_TO_VECTOR and load are both only used
				// once. Otherwise the load might get duplicated and the chain output of the
				// duplicate load will not be observed by all dependencies.
				if (N.getOpcode() == ISD::SCALAR_TO_VECTOR && N.getNode()->hasOneUse()) {
	PatternNodeWithChain = N.getOperand(0);			PatternNodeWithChain = N.getOperand(0);
	if (ISD::isNON_EXTLoad(PatternNodeWithChain.getNode()) &&			if (ISD::isNON_EXTLoad(PatternNodeWithChain.getNode()) &&
	PatternNodeWithChain.hasOneUse() &&			PatternNodeWithChain.hasOneUse() &&
	IsProfitableToFold(N.getOperand(0), N.getNode(), Root) &&			IsProfitableToFold(PatternNodeWithChain, N.getNode(), Root) &&
	IsLegalToFold(N.getOperand(0), N.getNode(), Root, OptLevel)) {			IsLegalToFold(PatternNodeWithChain, N.getNode(), Root, OptLevel)) {
	LoadSDNode *LD = cast<LoadSDNode>(PatternNodeWithChain);			LoadSDNode *LD = cast<LoadSDNode>(PatternNodeWithChain);
	if (!selectAddr(LD, LD->getBasePtr(), Base, Scale, Index, Disp, Segment))			if (!selectAddr(LD, LD->getBasePtr(), Base, Scale, Index, Disp, Segment))
	return false;			return false;
	return true;			return true;
	}			}
	}			}

	// Also handle the case where we explicitly require zeros in the top			// Also handle the case where we explicitly require zeros in the top
	▲ Show 20 Lines • Show All 1,208 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vec_ss_load_fold.ll

	Show First 20 Lines • Show All 375 Lines • ▼ Show 20 Lines
	}			}
	declare <4 x float> @llvm.x86.sse.cmp.ss(<4 x float>, <4 x float>, i8) nounwind readnone			declare <4 x float> @llvm.x86.sse.cmp.ss(<4 x float>, <4 x float>, i8) nounwind readnone


	define <4 x float> @double_fold(float* %x, <4 x float> %y) {			define <4 x float> @double_fold(float* %x, <4 x float> %y) {
	; X32-LABEL: double_fold:			; X32-LABEL: double_fold:
	; X32: ## BB#0: ## %entry			; X32: ## BB#0: ## %entry
	; X32-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-NEXT: movaps %xmm0, %xmm1			; X32-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; X32-NEXT: minss (%eax), %xmm1			; X32-NEXT: movaps %xmm0, %xmm2
	; X32-NEXT: maxss (%eax), %xmm0			; X32-NEXT: minss %xmm1, %xmm2
	; X32-NEXT: addps %xmm1, %xmm0			; X32-NEXT: maxss %xmm1, %xmm0
				; X32-NEXT: addps %xmm2, %xmm0
	; X32-NEXT: retl			; X32-NEXT: retl
	;			;
	; X64-LABEL: double_fold:			; X64-LABEL: double_fold:
	; X64: ## BB#0: ## %entry			; X64: ## BB#0: ## %entry
	; X64-NEXT: movaps %xmm0, %xmm1			; X64-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; X64-NEXT: minss (%rdi), %xmm1			; X64-NEXT: movaps %xmm0, %xmm2
	; X64-NEXT: maxss (%rdi), %xmm0			; X64-NEXT: minss %xmm1, %xmm2
	; X64-NEXT: addps %xmm1, %xmm0			; X64-NEXT: maxss %xmm1, %xmm0
				; X64-NEXT: addps %xmm2, %xmm0
	; X64-NEXT: retq			; X64-NEXT: retq
	;			;
	; X32_AVX-LABEL: double_fold:			; X32_AVX-LABEL: double_fold:
	; X32_AVX: ## BB#0: ## %entry			; X32_AVX: ## BB#0: ## %entry
	; X32_AVX-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32_AVX-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32_AVX-NEXT: vminss (%eax), %xmm0, %xmm1			; X32_AVX-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; X32_AVX-NEXT: vmaxss (%eax), %xmm0, %xmm0			; X32_AVX-NEXT: vminss %xmm1, %xmm0, %xmm2
	; X32_AVX-NEXT: vaddps %xmm0, %xmm1, %xmm0			; X32_AVX-NEXT: vmaxss %xmm1, %xmm0, %xmm0
				; X32_AVX-NEXT: vaddps %xmm0, %xmm2, %xmm0
	; X32_AVX-NEXT: retl			; X32_AVX-NEXT: retl
	;			;
	; X64_AVX-LABEL: double_fold:			; X64_AVX-LABEL: double_fold:
	; X64_AVX: ## BB#0: ## %entry			; X64_AVX: ## BB#0: ## %entry
	; X64_AVX-NEXT: vminss (%rdi), %xmm0, %xmm1			; X64_AVX-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; X64_AVX-NEXT: vmaxss (%rdi), %xmm0, %xmm0			; X64_AVX-NEXT: vminss %xmm1, %xmm0, %xmm2
	; X64_AVX-NEXT: vaddps %xmm0, %xmm1, %xmm0			; X64_AVX-NEXT: vmaxss %xmm1, %xmm0, %xmm0
				; X64_AVX-NEXT: vaddps %xmm0, %xmm2, %xmm0
	; X64_AVX-NEXT: retq			; X64_AVX-NEXT: retq
	entry:			entry:
	%0 = load float, float* %x, align 1			%0 = load float, float* %x, align 1
	%vecinit.i = insertelement <4 x float> undef, float %0, i32 0			%vecinit.i = insertelement <4 x float> undef, float %0, i32 0
	%1 = tail call <4 x float> @llvm.x86.sse.min.ss(<4 x float> %y, <4 x float> %vecinit.i)			%1 = tail call <4 x float> @llvm.x86.sse.min.ss(<4 x float> %y, <4 x float> %vecinit.i)
	%2 = tail call <4 x float> @llvm.x86.sse.max.ss(<4 x float> %y, <4 x float> %vecinit.i)			%2 = tail call <4 x float> @llvm.x86.sse.max.ss(<4 x float> %y, <4 x float> %vecinit.i)
	%3 = fadd <4 x float> %1, %2			%3 = fadd <4 x float> %1, %2
	ret <4 x float> %3			ret <4 x float> %3
	}			}