This is an archive of the discontinued LLVM Phabricator instance.

[X86] EltsFromConsecutiveLoads - support common source loads
ClosedPublic

Authored by RKSimon on Jul 11 2019, 3:00 AM.

Details

Summary

This patch enables us to find the source loads for each element, splitting them into a Load and ByteOffset, and attempts to recognise consecutive loads that are in fact from the same source load.

A helper function, FindEltLoadSrc, recurses through to find a LoadSDNode and determines the element's byte offset within it. When attempting to match consecutive loads, byte offsetted loads then attempt to matched against a previous load that has already been confirmed to be a consecutive match.

Next step towards PR16739 - after this we just need to account for shuffling/repeated elements to create a vector load + shuffle.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Jul 11 2019, 3:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 11 2019, 3:00 AM

Does this need to do anything to ensure that there are no interferences, in the sense of non-known-noalias writes?

Does this need to do anything to ensure that there are no interferences, in the sense of non-known-noalias writes?

That's what areNonVolatileConsecutiveLoads handles no?

Does this need to do anything to ensure that there are no interferences, in the sense of non-known-noalias writes?

That's what areNonVolatileConsecutiveLoads handles no?

It isn't fully obvious whether that only checks that the loads are from sequential locations in memory,
or whether it also checks that it is *legal* to perform those loads as one, at least to me.
I.e. i'm expecting this doesn't fold:

define <4 x float> @load_float4_float3_as_float2_float__with_write(<4 x float>* nocapture readonly dereferenceable(16)) {
  %2 = bitcast <4 x float>* %0 to <2 x float>*
  %3 = load <2 x float>, <2 x float>* %2, align 4
  %4 = extractelement <2 x float> %3, i32 0
  %5 = insertelement <4 x float> undef, float %4, i32 0
  %6 = extractelement <2 x float> %3, i32 1
  %7 = insertelement <4 x float> %5, float %6, i32 1
  %8 = getelementptr inbounds <4 x float>, <4 x float>* %0, i64 0, i64 2
  store float 42.0, float* %8 ; !!!
  %9 = load float, float* %8, align 4
  %10 = insertelement <4 x float> %7, float %9, i32 2
  ret <4 x float> %10
}

Is that what if (LD->getChain() != Base->getChain()) return false; does?

Is that what if (LD->getChain() != Base->getChain()) return false; does?

Yes, chains will handle these kinds of dependencies - do you want me to add that test ?

Is that what if (LD->getChain() != Base->getChain()) return false; does?

Yes, chains will handle these kinds of dependencies

Great to know! No other comments from me.

  • do you want me to add that test ?

Hmm, i'm not fully sure it's really useful as-is - in *that* case we can just reorder that store before these loads, so *that* case could still be folded.

Any more comments? I'm keen to try and get PR16739 fixed before the release branch.

spatel accepted this revision.Jul 18 2019, 5:41 AM

It's worth noting here in the review that this patch depends on the dereferenceable attribute (see D64205), and that attribute could change meaning as part of the larger changes related to the Attributor pass (D63243).
Based on current definitions, I think this is correct and allowable, so LGTM.

lib/Target/X86/X86ISelLowering.cpp
7505

formatting:
FindElt... --> findElt

7519–7521

Inconsistent methods for getting the constant (APInt vs. uin64_t). I don't think a logic difference is possible given other limits of LLVM, so go with getConstantOperandVal() in both places?

This revision is now accepted and ready to land.Jul 18 2019, 5:41 AM
This revision was automatically updated to reflect the committed changes.

This broke compilation for me, with https://martin.st/temp/simd_cmp_avx2.cpp, built with clang -target i686-w64-mingw32 -c -O3 -mavx2 simd_cmp_avx2.cpp, triggering failed asserts. I can file a proper bug later if necessary.

This broke compilation for me, with https://martin.st/temp/simd_cmp_avx2.cpp, built with clang -target i686-w64-mingw32 -c -O3 -mavx2 simd_cmp_avx2.cpp, triggering failed asserts. I can file a proper bug later if necessary.

Filed as a bug at https://bugs.llvm.org/show_bug.cgi?id=42727.

yubing added a subscriber: yubing.Sep 4 2019, 7:56 AM

Hi, Simon. This patch has produced a bug in llvm:

the attached t.ll can reproduce this bug:
For t.ll, llvm without this patch produces the correct asm while llvm with this patch produces bad asm:
llvm with this patch:

vmovd   xmm0, dword ptr [rdi + 260] # xmm0 = mem[0],zero,zero,zero
vmovd   xmm1, dword ptr [rdi + 252] # xmm1 = mem[0],zero,zero,zero
vpunpcklqdq     xmm0, xmm1, xmm0 # xmm0 = xmm1[0],xmm0[0]
vxorps  xmm1, xmm1, xmm1
vinserti128     ymm0, ymm1, xmm0, 1
vmovdqa ymmword ptr [rsi + 672], ymm0

llvm without this patch:

vmovq   xmm0, qword ptr [rdi + 252] # xmm0 = mem[0],zero
movabs  rax, offset .LCPI0_0
vmovdqa xmm1, xmmword ptr [rax]
vpshufb xmm0, xmm0, xmm1
vmovd   xmm1, dword ptr [rdi + 260] # xmm1 = mem[0],zero,zero,zero
vpunpcklqdq     xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
vxorps  xmm1, xmm1, xmm1
vinserti128     ymm0, ymm1, xmm0, 1
vmovdqa ymmword ptr [rsi + 672], ymm0

When I compared the ‘llc -debug’ output, I found the procedure of Combining: t53: v4i32 = X86ISD::VZEXT_MOVL t43 are different:
llvm without this patch:

Legalizing: t53: v4i32 = X86ISD::VZEXT_MOVL t43
Legal node: nothing to do

Combining: t53: v4i32 = X86ISD::VZEXT_MOVL t43
Creating new node: t55: v2i64 = undef
Creating constant: t56: i8 = Constant<4>
Creating constant: t57: i8 = Constant<5>
Creating constant: t58: i8 = Constant<6>
Creating constant: t59: i8 = Constant<7>
Creating constant: t60: i8 = Constant<-1>

llvm with this patch:

Legalizing: t53: v4i32 = X86ISD::VZEXT_MOVL t43
Legal node: nothing to do

Combining: t53: v4i32 = X86ISD::VZEXT_MOVL t43

Then I used ‘dbg llc’ and found that EltsFromConsecutiveLoads will return SDValue() while llvm with this patch won’t.

The return SDValue() was deleted by your patch:

if (!ISD::isNON_EXTLoad(Elt.getNode()))
   return SDValue();

So, I added these two lines back to llvm with this patch, then output asm is correct and the same with llvm without this patch:

I've submit a patch to solve the bug which I commented yesterday.
https://reviews.llvm.org/D67210

llvm/trunk/test/CodeGen/X86/load-partial.ll
64 ↗(On Diff #210563)

This is also not correct, according to IR.

86 ↗(On Diff #210563)

This is not correct, since we are moving 3 float numbers to %xmm0, according to IR, instead of 4 float.
Before your patch, the testcase's CHECK is correct:
; AVX: # %bb.0:
; AVX-NEXT: vmovss (%rdi), %xmm0 # xmm0 = mem[0],zero,zero,zero
; AVX-NEXT: vinsertps $16, 4(%rdi), %xmm0, %xmm0 # xmm0 = xmm0[0],mem[0],xmm0[2,3]
; AVX-NEXT: vinsertps $32, 8(%rdi), %xmm0, %xmm0 # xmm0 = xmm0[0,1],mem[0],xmm0[3]
; AVX-NEXT: retq

RKSimon marked 2 inline comments as done.Sep 5 2019, 4:28 AM
RKSimon added inline comments.
llvm/trunk/test/CodeGen/X86/load-partial.ll
64 ↗(On Diff #210563)

why?

86 ↗(On Diff #210563)

The pointer is 16 byte dereferencable - loading all 4 floats is safe.

yubing added inline comments.Sep 5 2019, 9:51 AM
llvm/trunk/test/CodeGen/X86/load-partial.ll
86 ↗(On Diff #210563)

Sorry, It seems I am not familiar with 'dereferencable'. Could you please explain it in detail?
All I see is that moving only 3 float numbers to %xmm0, according to IR.

lebedev.ri added inline comments.Sep 5 2019, 10:06 AM
llvm/trunk/test/CodeGen/X86/load-partial.ll
86 ↗(On Diff #210563)

... moving 3 floats to %xmm, with 4'th channel being undef.
As per dereferenceable attribute, we can load 16 bytes here,
and since we are allowed to replace undef with anything,
we can just simply load the entire %xmm0.
So i'm not seeing an issue *here*. Can you explain what issue do you see?

yubing added inline comments.Sep 5 2019, 7:05 PM
llvm/trunk/test/CodeGen/X86/load-partial.ll
86 ↗(On Diff #210563)

You're right. Now I get your point.