This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX] Prohibit creating X86ISD::VBROADCAST(128->256) when it is AVX in combineConcatVectorOps
AbandonedPublic

Authored by yubing on Sep 7 2021, 2:16 AM.

Diff Detail

Event Timeline

yubing created this revision.Sep 7 2021, 2:16 AM
yubing requested review of this revision.Sep 7 2021, 2:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2021, 2:16 AM
yubing added a reviewer: lebedev.ri.
yubing added a comment.EditedSep 7 2021, 2:24 AM

Hi, @lebedev.ri .

  1. I think https://reviews.llvm.org/D105390 also has the same issue where you might create X86ISD::VBROADCAST(128->256) in AVX.
  2. Besides, Before https://reviews.llvm.org/D105390, we have nicer asm output for the testcase we added:
movq    qa_@GOTPCREL(%rip), %rax
movl    $1091567616, 30256(%rax)        # imm = 0x41100000
movabsq $4294967297, %rcx               # imm = 0x100000001
movq    %rcx, 46348(%rax)
vbroadcastf128  .LCPI0_0(%rip), %ymm0   # ymm0 = [7.812501848093234E-3,7.812501848093234E-3,7.812501848093234E-3,7.812501848093234E-3]
                                # ymm0 = mem[0,1,0,1]
vmovups %ymm0, 48296(%rax)
vmovsd  .LCPI0_1(%rip), %xmm0           # xmm0 = mem[0],zero
vmovsd  %xmm0, 47372(%rax)
vzeroupper
retq

Would you take a look at the issue? I guess there is opportunity for
broadcast(extract_vector_elt(load(y), 0)) -> sub_broadcast_load(y). but need to do a deep investigation.

RKSimon added inline comments.Sep 7 2021, 2:32 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
50997

Why not add isel patterns to handle 128 ->256 like we do for scalar broadcasts?

I haven't gotten around to look at this yet, is this really the problem in the test in question?

yubing added inline comments.Sep 7 2021, 7:52 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
50997

I didn't got your point. Would you show the code?
BTW, X86ISD::VBROADCAST(A:v2f64->B:v4f64) is broadcasting A's low f64 to B's four elements, the corresponding instruction only exist in avx2(or after avx2). Do you mean lowering into broadcast_load or something else in isel pattern?

RKSimon added inline comments.Sep 7 2021, 8:33 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
50997

This is what I had in mind - expand what is already in X86InstrSSE.td:

let Predicates = [HasAVX1Only] in {
  def : Pat<(v4f32 (X86VBroadcast FR32:$src)),
            (VPERMILPSri (v4f32 (COPY_TO_REGCLASS FR32:$src, VR128)), 0)>;
  def : Pat<(v8f32 (X86VBroadcast FR32:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v8f32 (IMPLICIT_DEF)),
              (v4f32 (VPERMILPSri (v4f32 (COPY_TO_REGCLASS FR32:$src, VR128)), 0)), sub_xmm),
              (v4f32 (VPERMILPSri (v4f32 (COPY_TO_REGCLASS FR32:$src, VR128)), 0)), 1)>;
  def : Pat<(v8f32 (X86VBroadcast v4f32:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v8f32 (IMPLICIT_DEF)),
              (v4f32 (VPERMILPSri VR128:$src, 0)), sub_xmm),
              (v4f32 (VPERMILPSri VR128:$src, 0)), 1)>;
  def : Pat<(v4f64 (X86VBroadcast FR64:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v4f64 (IMPLICIT_DEF)),
              (v2f64 (VMOVDDUPrr (v2f64 (COPY_TO_REGCLASS FR64:$src, VR128)))), sub_xmm),
              (v2f64 (VMOVDDUPrr (v2f64 (COPY_TO_REGCLASS FR64:$src, VR128)))), 1)>;
  def : Pat<(v4f64 (X86VBroadcast v2f64:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v4f64 (IMPLICIT_DEF)),
              (v2f64 (VMOVDDUPrr VR128:$src)), sub_xmm),
              (v2f64 (VMOVDDUPrr VR128:$src)), 1)>;

  def : Pat<(v4i32 (X86VBroadcast GR32:$src)),
            (VPSHUFDri (VMOVDI2PDIrr GR32:$src), 0)>;
  def : Pat<(v8i32 (X86VBroadcast GR32:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v8i32 (IMPLICIT_DEF)),
              (v4i32 (VPSHUFDri (VMOVDI2PDIrr GR32:$src), 0)), sub_xmm),
              (v4i32 (VPSHUFDri (VMOVDI2PDIrr GR32:$src), 0)), 1)>;
  def : Pat<(v8i32 (X86VBroadcast v4i32:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v8i32 (IMPLICIT_DEF)),
              (v4i32 (VPSHUFDri VR128:$src, 0)), sub_xmm),
              (v4i32 (VPSHUFDri VR128:$src, 0)), 1)>;
  def : Pat<(v4i64 (X86VBroadcast GR64:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v4i64 (IMPLICIT_DEF)),
              (v4i32 (VPSHUFDri (VMOV64toPQIrr GR64:$src), 0x44)), sub_xmm),
              (v4i32 (VPSHUFDri (VMOV64toPQIrr GR64:$src), 0x44)), 1)>;
  def : Pat<(v4i64 (X86VBroadcast v2i64:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v4i64 (IMPLICIT_DEF)),
              (v4i32 (VPSHUFDri VR128:$src, 0x44)), sub_xmm),
              (v4i32 (VPSHUFDri VR128:$src, 0x44)), 1)>;

  def : Pat<(v2i64 (X86VBroadcast i64:$src)),
            (VPSHUFDri (VMOV64toPQIrr GR64:$src), 0x44)>;
  def : Pat<(v2i64 (X86VBroadcastld64 addr:$src)),
            (VMOVDDUPrm addr:$src)>;
}
yubing added inline comments.Sep 7 2021, 11:27 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
50997

It confused me now because what you said in comments is expanding vbroadcast into several instructions while what we do in lowering is combine several nodes into vbroadcast. I think there is redundant work here. Shouldn't we prohibit redundant combine at the beginning?

RKSimon added inline comments.Sep 8 2021, 4:57 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
50997

Creating the broadcast node in DAG (even if it expands later on) has the nice effect that the input will often then have hasOneUse, which permits further load folding, optimizations, SimplifyDemandedVectorElts etc.

yubing added inline comments.Sep 8 2021, 6:07 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
50997

However, When we are trying the combine: broadcast(extract_vector_elt(x, 0)) -> broadcast(x), we never check x has oneuse which is located at extract_vector_elt(x, 0), as shown in https://reviews.llvm.org/D105390 .

yubing abandoned this revision.Sep 8 2021, 8:24 PM