Details
- Reviewers
RKSimon pengfei LuoYuanke lebedev.ri
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Hi, @lebedev.ri .
- I think https://reviews.llvm.org/D105390 also has the same issue where you might create X86ISD::VBROADCAST(128->256) in AVX.
- Besides, Before https://reviews.llvm.org/D105390, we have nicer asm output for the testcase we added:
movq qa_@GOTPCREL(%rip), %rax movl $1091567616, 30256(%rax) # imm = 0x41100000 movabsq $4294967297, %rcx # imm = 0x100000001 movq %rcx, 46348(%rax) vbroadcastf128 .LCPI0_0(%rip), %ymm0 # ymm0 = [7.812501848093234E-3,7.812501848093234E-3,7.812501848093234E-3,7.812501848093234E-3] # ymm0 = mem[0,1,0,1] vmovups %ymm0, 48296(%rax) vmovsd .LCPI0_1(%rip), %xmm0 # xmm0 = mem[0],zero vmovsd %xmm0, 47372(%rax) vzeroupper retq
Would you take a look at the issue? I guess there is opportunity for
broadcast(extract_vector_elt(load(y), 0)) -> sub_broadcast_load(y). but need to do a deep investigation.
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
50997 | Why not add isel patterns to handle 128 ->256 like we do for scalar broadcasts? |
I haven't gotten around to look at this yet, is this really the problem in the test in question?
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
50997 | I didn't got your point. Would you show the code? |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
50997 | This is what I had in mind - expand what is already in X86InstrSSE.td: let Predicates = [HasAVX1Only] in { def : Pat<(v4f32 (X86VBroadcast FR32:$src)), (VPERMILPSri (v4f32 (COPY_TO_REGCLASS FR32:$src, VR128)), 0)>; def : Pat<(v8f32 (X86VBroadcast FR32:$src)), (VINSERTF128rr (INSERT_SUBREG (v8f32 (IMPLICIT_DEF)), (v4f32 (VPERMILPSri (v4f32 (COPY_TO_REGCLASS FR32:$src, VR128)), 0)), sub_xmm), (v4f32 (VPERMILPSri (v4f32 (COPY_TO_REGCLASS FR32:$src, VR128)), 0)), 1)>; def : Pat<(v8f32 (X86VBroadcast v4f32:$src)), (VINSERTF128rr (INSERT_SUBREG (v8f32 (IMPLICIT_DEF)), (v4f32 (VPERMILPSri VR128:$src, 0)), sub_xmm), (v4f32 (VPERMILPSri VR128:$src, 0)), 1)>; def : Pat<(v4f64 (X86VBroadcast FR64:$src)), (VINSERTF128rr (INSERT_SUBREG (v4f64 (IMPLICIT_DEF)), (v2f64 (VMOVDDUPrr (v2f64 (COPY_TO_REGCLASS FR64:$src, VR128)))), sub_xmm), (v2f64 (VMOVDDUPrr (v2f64 (COPY_TO_REGCLASS FR64:$src, VR128)))), 1)>; def : Pat<(v4f64 (X86VBroadcast v2f64:$src)), (VINSERTF128rr (INSERT_SUBREG (v4f64 (IMPLICIT_DEF)), (v2f64 (VMOVDDUPrr VR128:$src)), sub_xmm), (v2f64 (VMOVDDUPrr VR128:$src)), 1)>; def : Pat<(v4i32 (X86VBroadcast GR32:$src)), (VPSHUFDri (VMOVDI2PDIrr GR32:$src), 0)>; def : Pat<(v8i32 (X86VBroadcast GR32:$src)), (VINSERTF128rr (INSERT_SUBREG (v8i32 (IMPLICIT_DEF)), (v4i32 (VPSHUFDri (VMOVDI2PDIrr GR32:$src), 0)), sub_xmm), (v4i32 (VPSHUFDri (VMOVDI2PDIrr GR32:$src), 0)), 1)>; def : Pat<(v8i32 (X86VBroadcast v4i32:$src)), (VINSERTF128rr (INSERT_SUBREG (v8i32 (IMPLICIT_DEF)), (v4i32 (VPSHUFDri VR128:$src, 0)), sub_xmm), (v4i32 (VPSHUFDri VR128:$src, 0)), 1)>; def : Pat<(v4i64 (X86VBroadcast GR64:$src)), (VINSERTF128rr (INSERT_SUBREG (v4i64 (IMPLICIT_DEF)), (v4i32 (VPSHUFDri (VMOV64toPQIrr GR64:$src), 0x44)), sub_xmm), (v4i32 (VPSHUFDri (VMOV64toPQIrr GR64:$src), 0x44)), 1)>; def : Pat<(v4i64 (X86VBroadcast v2i64:$src)), (VINSERTF128rr (INSERT_SUBREG (v4i64 (IMPLICIT_DEF)), (v4i32 (VPSHUFDri VR128:$src, 0x44)), sub_xmm), (v4i32 (VPSHUFDri VR128:$src, 0x44)), 1)>; def : Pat<(v2i64 (X86VBroadcast i64:$src)), (VPSHUFDri (VMOV64toPQIrr GR64:$src), 0x44)>; def : Pat<(v2i64 (X86VBroadcastld64 addr:$src)), (VMOVDDUPrm addr:$src)>; } |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
50997 | It confused me now because what you said in comments is expanding vbroadcast into several instructions while what we do in lowering is combine several nodes into vbroadcast. I think there is redundant work here. Shouldn't we prohibit redundant combine at the beginning? |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
50997 | Creating the broadcast node in DAG (even if it expands later on) has the nice effect that the input will often then have hasOneUse, which permits further load folding, optimizations, SimplifyDemandedVectorElts etc. |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
50997 | However, When we are trying the combine: broadcast(extract_vector_elt(x, 0)) -> broadcast(x), we never check x has oneuse which is located at extract_vector_elt(x, 0), as shown in https://reviews.llvm.org/D105390 . |
Why not add isel patterns to handle 128 ->256 like we do for scalar broadcasts?