This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add a hasOneUse check to selectScalarSSELoad to keep the same load from being folded multiple times
ClosedPublic

Authored by craig.topper on Nov 16 2016, 10:04 PM.

Details

Summary

When selectScalarSSELoad is looking for a scalar_to_vector of a scalar load, it makes sure the load is only used by the scalar_to_vector. But it doesn't make sure the scalar_to_vector is only used once. This can cause the same load to be folded multiple times. This can be bad for performance.

Diff Detail

Repository
rL LLVM

Event Timeline

craig.topper retitled this revision from to [X86] Add a hasOneUse check to selectScalarSSELoad to keep the same load from being folded multiple times.
craig.topper updated this object.
craig.topper added reviewers: RKSimon, spatel, zvi, delena.
craig.topper added a subscriber: llvm-commits.
zvi added inline comments.Nov 16 2016, 11:46 PM
test/CodeGen/X86/vec_ss_load_fold.ll
384 ↗(On Diff #78321)

As the change in generated code shows, the tradeoff is increased register pressure. Do we have reason to believe that this change puts us on the preferred side?

andreadb added inline comments.
test/CodeGen/X86/vec_ss_load_fold.ll
384 ↗(On Diff #78321)

I agree that this change has the potential of increasing register pressure.
However, regalloc is smart enough to avoid inserting a spill slot if the spill candidate can be folded as a load into the user instructions. In this case, the load value would be feeding into a minss and a maxss. Those instructions both appear in the memory folding tables (see X86InstrInfo.cpp); the InlineSpiller would take advantage of this knowledge and fold the load instead of reserving a stack slot (and therefore inserting a reload).
Basically, my opinion is that this may not be a very problematic case because regalloc should be smart enough to "fall-back" to the other codegen if we run out of registers.

Note also that this is not the only place where we check for 'hasOneUse()' in this file (see for example X86DAGToDAGISel::IsProfitableToFold).

I hope this makes sense.

spatel edited edge metadata.Nov 17 2016, 9:45 AM

The vector version of this already works as expected:

define <4 x float> @double_fold(<4 x float>* %x, <4 x float> %y) {
  %t0 = load <4 x float>, <4 x float>* %x, align 1
  %t1 = tail call <4 x float> @llvm.x86.sse.min.ps(<4 x float> %y, <4 x float> %t0)
  %t2 = tail call <4 x float> @llvm.x86.sse.max.ps(<4 x float> %y, <4 x float> %t0)
  %t3 = fadd <4 x float> %t1, %t2
  ret <4 x float> %t3
}

$ ./llc -o - foldfold.ll -mattr=avx
...

vmovups	(%rdi), %xmm1
vminps	%xmm1, %xmm0, %xmm2
vmaxps	%xmm1, %xmm0, %xmm0
vaddps	%xmm0, %xmm2, %xmm0
retq

The divergence begins when we map the vector intrinsics to x86-specific nodes:

X86_INTRINSIC_DATA(sse_max_ps,        INTR_TYPE_2OP, X86ISD::FMAX, 0),
X86_INTRINSIC_DATA(sse_min_ps,        INTR_TYPE_2OP, X86ISD::FMIN, 0),

...but there's no equivalent mapping for the scalar intrinsics. Would that be a better/another fix (assuming it works, I didn't actually try it)?

The min/max intrinsics patterns in tablegen try to use sse_load_f32 and sse_load_f64 which use this function to look for a load that is zero extended from f32/f64 to v4f32/v2f64 or a scalar_to_vector from a f32/f64 to v4f32/v2f64. The intrinsics themselves takes a v4f32/v2f64. I ultimately I want to extend this function to also allow a regular v4f32/v2f64 load as well. Currently those cases are folded later using the folding tables, but isel should have been able to get it right without the peephole.

Another possible fix is to lower the instrinsics to a scalar max SDNode with inserts and extracts around it like this (insert_vector_elt src1 (X86max (extract_vector_elt src1, 0), (extract_vector_elt src2, 0)), 0) Then pattern match it back to the min/max intrinsic instructions. This would be equivalent to how clang emits the FADD/FSUB/FMUL/FDIV intrinsics. We would need to do this for every pattern that currently uses sse_load_f32/f64. This would probably also fix PR31032 so maybe its worth doing?

I was also planning to fix AVX-512 to use sse_load_f32/f64 for all the instructions that are equivalent to SSE/AVX instructions that are already using it.

Another possible fix is to lower the instrinsics to a scalar max SDNode with inserts and extracts around it like this (insert_vector_elt src1 (X86max (extract_vector_elt src1, 0), (extract_vector_elt src2, 0)), 0) Then pattern match it back to the min/max intrinsic instructions. This would be equivalent to how clang emits the FADD/FSUB/FMUL/FDIV intrinsics. We would need to do this for every pattern that currently uses sse_load_f32/f64. This would probably also fix PR31032 so maybe its worth doing?

That's what I was imagining - just so we try to standardize on a path for handling various opcodes and scalar vs. vector. But it's probably not possible for all opcodes/intrinsics.

Given Andrea's comment that we can undo this to avoid spilling, I have no objections to the patch.

But should this be or is this already gated when optimizing for size?

There's a potential correctness problem here too. This IR finishes isel with the store only depending on the chain from the minss instruction but not the maxss instruction. So the duplication of the load breaks the chain dependency.

define void @double_fold(float* %x, <4 x float> %y, <4 x float>* %z) {
entry:

%0 = load float, float* %x, align 1
%vecinit.i = insertelement <4 x float> undef, float %0, i32 0
%1 = tail call <4 x float> @llvm.x86.sse.min.ss(<4 x float> %y, <4 x float> %vecinit.i)
%2 = tail call <4 x float> @llvm.x86.sse.max.ss(<4 x float> %y, <4 x float> %vecinit.i)
%3 = fadd <4 x float> %1, %2
store <4 x float> %3, <4 x float>* %z
ret void

}

declare <4 x float> @llvm.x86.sse.min.ss(<4 x float>, <4 x float>)

declare <4 x float> @llvm.x86.sse.max.ss(<4 x float>, <4 x float>)

Given the problem with the chain I don't think I can gate this on any optimization flag. I think we need this check to be consistent with the behavior that OPC_CheckFoldableChainNode has for normal load during isel. That check makes sure all nodes preceeding the load up to the root have a single use. So adding the check here would ensure the same behavior.

spatel accepted this revision.Nov 26 2016, 8:04 AM
spatel edited edge metadata.

LGTM, but please add a FIXME code comment about the chain problem, so we don't lose track of that.

This revision is now accepted and ready to land.Nov 26 2016, 8:04 AM
This revision was automatically updated to reflect the committed changes.