When selectScalarSSELoad is looking for a scalar_to_vector of a scalar load, it makes sure the load is only used by the scalar_to_vector. But it doesn't make sure the scalar_to_vector is only used once. This can cause the same load to be folded multiple times. This can be bad for performance.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
test/CodeGen/X86/vec_ss_load_fold.ll | ||
---|---|---|
384 ↗ | (On Diff #78321) | As the change in generated code shows, the tradeoff is increased register pressure. Do we have reason to believe that this change puts us on the preferred side? |
test/CodeGen/X86/vec_ss_load_fold.ll | ||
---|---|---|
384 ↗ | (On Diff #78321) | I agree that this change has the potential of increasing register pressure. Note also that this is not the only place where we check for 'hasOneUse()' in this file (see for example X86DAGToDAGISel::IsProfitableToFold). I hope this makes sense. |
The vector version of this already works as expected:
define <4 x float> @double_fold(<4 x float>* %x, <4 x float> %y) { %t0 = load <4 x float>, <4 x float>* %x, align 1 %t1 = tail call <4 x float> @llvm.x86.sse.min.ps(<4 x float> %y, <4 x float> %t0) %t2 = tail call <4 x float> @llvm.x86.sse.max.ps(<4 x float> %y, <4 x float> %t0) %t3 = fadd <4 x float> %t1, %t2 ret <4 x float> %t3 }
$ ./llc -o - foldfold.ll -mattr=avx
...
vmovups (%rdi), %xmm1 vminps %xmm1, %xmm0, %xmm2 vmaxps %xmm1, %xmm0, %xmm0 vaddps %xmm0, %xmm2, %xmm0 retq
The divergence begins when we map the vector intrinsics to x86-specific nodes:
X86_INTRINSIC_DATA(sse_max_ps, INTR_TYPE_2OP, X86ISD::FMAX, 0), X86_INTRINSIC_DATA(sse_min_ps, INTR_TYPE_2OP, X86ISD::FMIN, 0),
...but there's no equivalent mapping for the scalar intrinsics. Would that be a better/another fix (assuming it works, I didn't actually try it)?
The min/max intrinsics patterns in tablegen try to use sse_load_f32 and sse_load_f64 which use this function to look for a load that is zero extended from f32/f64 to v4f32/v2f64 or a scalar_to_vector from a f32/f64 to v4f32/v2f64. The intrinsics themselves takes a v4f32/v2f64. I ultimately I want to extend this function to also allow a regular v4f32/v2f64 load as well. Currently those cases are folded later using the folding tables, but isel should have been able to get it right without the peephole.
Another possible fix is to lower the instrinsics to a scalar max SDNode with inserts and extracts around it like this (insert_vector_elt src1 (X86max (extract_vector_elt src1, 0), (extract_vector_elt src2, 0)), 0) Then pattern match it back to the min/max intrinsic instructions. This would be equivalent to how clang emits the FADD/FSUB/FMUL/FDIV intrinsics. We would need to do this for every pattern that currently uses sse_load_f32/f64. This would probably also fix PR31032 so maybe its worth doing?
I was also planning to fix AVX-512 to use sse_load_f32/f64 for all the instructions that are equivalent to SSE/AVX instructions that are already using it.
That's what I was imagining - just so we try to standardize on a path for handling various opcodes and scalar vs. vector. But it's probably not possible for all opcodes/intrinsics.
Given Andrea's comment that we can undo this to avoid spilling, I have no objections to the patch.
But should this be or is this already gated when optimizing for size?
There's a potential correctness problem here too. This IR finishes isel with the store only depending on the chain from the minss instruction but not the maxss instruction. So the duplication of the load breaks the chain dependency.
define void @double_fold(float* %x, <4 x float> %y, <4 x float>* %z) {
entry:
%0 = load float, float* %x, align 1 %vecinit.i = insertelement <4 x float> undef, float %0, i32 0 %1 = tail call <4 x float> @llvm.x86.sse.min.ss(<4 x float> %y, <4 x float> %vecinit.i) %2 = tail call <4 x float> @llvm.x86.sse.max.ss(<4 x float> %y, <4 x float> %vecinit.i) %3 = fadd <4 x float> %1, %2 store <4 x float> %3, <4 x float>* %z ret void
}
declare <4 x float> @llvm.x86.sse.min.ss(<4 x float>, <4 x float>)
declare <4 x float> @llvm.x86.sse.max.ss(<4 x float>, <4 x float>)
Given the problem with the chain I don't think I can gate this on any optimization flag. I think we need this check to be consistent with the behavior that OPC_CheckFoldableChainNode has for normal load during isel. That check makes sure all nodes preceeding the load up to the root have a single use. So adding the check here would ensure the same behavior.
LGTM, but please add a FIXME code comment about the chain problem, so we don't lose track of that.