This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX] Prohibit creating X86ISD::VBROADCAST(128->256) when it is AVX in combineConcatVectorOps
AbandonedPublic

Authored by yubing on Sep 7 2021, 2:16 AM.

Download Raw Diff

Details

Reviewers

RKSimon
pengfei
LuoYuanke
lebedev.ri

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

yubing created this revision.Sep 7 2021, 2:16 AM

Herald added subscribers: pengfei, hiraditya. · View Herald TranscriptSep 7 2021, 2:16 AM

yubing requested review of this revision.Sep 7 2021, 2:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2021, 2:16 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

yubing added reviewers: RKSimon, pengfei, LuoYuanke.Sep 7 2021, 2:17 AM

yubing added a reviewer: lebedev.ri.

Hi, @lebedev.ri .

I think https://reviews.llvm.org/D105390 also has the same issue where you might create X86ISD::VBROADCAST(128->256) in AVX.
Besides, Before https://reviews.llvm.org/D105390, we have nicer asm output for the testcase we added:

movq    qa_@GOTPCREL(%rip), %rax
movl    $1091567616, 30256(%rax)        # imm = 0x41100000
movabsq $4294967297, %rcx               # imm = 0x100000001
movq    %rcx, 46348(%rax)
vbroadcastf128  .LCPI0_0(%rip), %ymm0   # ymm0 = [7.812501848093234E-3,7.812501848093234E-3,7.812501848093234E-3,7.812501848093234E-3]
                                # ymm0 = mem[0,1,0,1]
vmovups %ymm0, 48296(%rax)
vmovsd  .LCPI0_1(%rip), %xmm0           # xmm0 = mem[0],zero
vmovsd  %xmm0, 47372(%rax)
vzeroupper
retq

Would you take a look at the issue? I guess there is opportunity for
broadcast(extract_vector_elt(load(y), 0)) -> sub_broadcast_load(y). but need to do a deep investigation.

RKSimon added inline comments.Sep 7 2021, 2:32 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
50997	Why not add isel patterns to handle 128 ->256 like we do for scalar broadcasts?

I haven't gotten around to look at this yet, is this really the problem in the test in question?

Harbormaster completed remote builds in B122835: Diff 371013.Sep 7 2021, 3:44 AM

yubing added inline comments.Sep 7 2021, 7:52 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
50997	I didn't got your point. Would you show the code? BTW, X86ISD::VBROADCAST(A:v2f64->B:v4f64) is broadcasting A's low f64 to B's four elements, the corresponding instruction only exist in avx2(or after avx2). Do you mean lowering into broadcast_load or something else in isel pattern?

RKSimon added inline comments.Sep 7 2021, 8:33 AM

llvm/lib/Target/X86/X86ISelLowering.cpp

50997

This is what I had in mind - expand what is already in X86InstrSSE.td:

let Predicates = [HasAVX1Only] in {
  def : Pat<(v4f32 (X86VBroadcast FR32:$src)),
            (VPERMILPSri (v4f32 (COPY_TO_REGCLASS FR32:$src, VR128)), 0)>;
  def : Pat<(v8f32 (X86VBroadcast FR32:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v8f32 (IMPLICIT_DEF)),
              (v4f32 (VPERMILPSri (v4f32 (COPY_TO_REGCLASS FR32:$src, VR128)), 0)), sub_xmm),
              (v4f32 (VPERMILPSri (v4f32 (COPY_TO_REGCLASS FR32:$src, VR128)), 0)), 1)>;
  def : Pat<(v8f32 (X86VBroadcast v4f32:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v8f32 (IMPLICIT_DEF)),
              (v4f32 (VPERMILPSri VR128:$src, 0)), sub_xmm),
              (v4f32 (VPERMILPSri VR128:$src, 0)), 1)>;
  def : Pat<(v4f64 (X86VBroadcast FR64:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v4f64 (IMPLICIT_DEF)),
              (v2f64 (VMOVDDUPrr (v2f64 (COPY_TO_REGCLASS FR64:$src, VR128)))), sub_xmm),
              (v2f64 (VMOVDDUPrr (v2f64 (COPY_TO_REGCLASS FR64:$src, VR128)))), 1)>;
  def : Pat<(v4f64 (X86VBroadcast v2f64:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v4f64 (IMPLICIT_DEF)),
              (v2f64 (VMOVDDUPrr VR128:$src)), sub_xmm),
              (v2f64 (VMOVDDUPrr VR128:$src)), 1)>;

  def : Pat<(v4i32 (X86VBroadcast GR32:$src)),
            (VPSHUFDri (VMOVDI2PDIrr GR32:$src), 0)>;
  def : Pat<(v8i32 (X86VBroadcast GR32:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v8i32 (IMPLICIT_DEF)),
              (v4i32 (VPSHUFDri (VMOVDI2PDIrr GR32:$src), 0)), sub_xmm),
              (v4i32 (VPSHUFDri (VMOVDI2PDIrr GR32:$src), 0)), 1)>;
  def : Pat<(v8i32 (X86VBroadcast v4i32:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v8i32 (IMPLICIT_DEF)),
              (v4i32 (VPSHUFDri VR128:$src, 0)), sub_xmm),
              (v4i32 (VPSHUFDri VR128:$src, 0)), 1)>;
  def : Pat<(v4i64 (X86VBroadcast GR64:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v4i64 (IMPLICIT_DEF)),
              (v4i32 (VPSHUFDri (VMOV64toPQIrr GR64:$src), 0x44)), sub_xmm),
              (v4i32 (VPSHUFDri (VMOV64toPQIrr GR64:$src), 0x44)), 1)>;
  def : Pat<(v4i64 (X86VBroadcast v2i64:$src)),
            (VINSERTF128rr (INSERT_SUBREG (v4i64 (IMPLICIT_DEF)),
              (v4i32 (VPSHUFDri VR128:$src, 0x44)), sub_xmm),
              (v4i32 (VPSHUFDri VR128:$src, 0x44)), 1)>;

  def : Pat<(v2i64 (X86VBroadcast i64:$src)),
            (VPSHUFDri (VMOV64toPQIrr GR64:$src), 0x44)>;
  def : Pat<(v2i64 (X86VBroadcastld64 addr:$src)),
            (VMOVDDUPrm addr:$src)>;
}

yubing added inline comments.Sep 7 2021, 11:27 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
50997	It confused me now because what you said in comments is expanding vbroadcast into several instructions while what we do in lowering is combine several nodes into vbroadcast. I think there is redundant work here. Shouldn't we prohibit redundant combine at the beginning?

RKSimon added inline comments.Sep 8 2021, 4:57 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
50997	Creating the broadcast node in DAG (even if it expands later on) has the nice effect that the input will often then have hasOneUse, which permits further load folding, optimizations, SimplifyDemandedVectorElts etc.

RKSimon mentioned this in rG35cc4ad35416: [X86] Add AVX2 test coverage to combine-concatvectors.ll.Sep 8 2021, 5:27 AM

RKSimon mentioned this in D109434: [X86][AVX] Add missing X86ISD::VBROADCAST(128->256) isel patterns for AVX1 targets.Sep 8 2021, 5:43 AM

yubing added inline comments.Sep 8 2021, 6:07 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
50997	However, When we are trying the combine: broadcast(extract_vector_elt(x, 0)) -> broadcast(x), we never check x has oneuse which is located at extract_vector_elt(x, 0), as shown in https://reviews.llvm.org/D105390 .

yubing abandoned this revision.Sep 8 2021, 8:24 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

6 lines

test/

CodeGen/

X86/

combine-concatvectors.ll

35 lines

Diff 371013

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 32,759 Lines • ▼ Show 20 Lines

	SDValue Op0 = Ops[0];			SDValue Op0 = Ops[0];
	bool IsSplat = llvm::all_of(Ops, [&Op0](SDValue Op) { return Op == Op0; });			bool IsSplat = llvm::all_of(Ops, [&Op0](SDValue Op) { return Op == Op0; });

	// Repeated subvectors.			// Repeated subvectors.
	if (IsSplat &&			if (IsSplat &&
	(VT.is256BitVector() \|\| (VT.is512BitVector() && Subtarget.hasAVX512()))) {			(VT.is256BitVector() \|\| (VT.is512BitVector() && Subtarget.hasAVX512()))) {
	// If this broadcast is inserted into both halves, use a larger broadcast.			// If this broadcast is inserted into both halves, use a larger broadcast.
	if (Op0.getOpcode() == X86ISD::VBROADCAST)			// Before AVX2, we don't have vbroadcast instruction(128->256), we prohibit
				// creating such vbroadcast.
				RKSimonUnsubmitted Not Done Reply Inline Actions Why not add isel patterns to handle 128 ->256 like we do for scalar broadcasts? RKSimon: Why not add isel patterns to handle 128 ->256 like we do for scalar broadcasts?
				yubingAuthorUnsubmitted Done Reply Inline Actions I didn't got your point. Would you show the code? BTW, X86ISD::VBROADCAST(A:v2f64->B:v4f64) is broadcasting A's low f64 to B's four elements, the corresponding instruction only exist in avx2(or after avx2). Do you mean lowering into broadcast_load or something else in isel pattern? yubing: I didn't got your point. Would you show the code? BTW, X86ISD::VBROADCAST(A:v2f64->B:v4f64) is…
				RKSimonUnsubmitted Not Done Reply Inline Actions This is what I had in mind - expand what is already in X86InstrSSE.td: let Predicates = [HasAVX1Only] in { def : Pat<(v4f32 (X86VBroadcast FR32:$src)), (VPERMILPSri (v4f32 (COPY_TO_REGCLASS FR32:$src, VR128)), 0)>; def : Pat<(v8f32 (X86VBroadcast FR32:$src)), (VINSERTF128rr (INSERT_SUBREG (v8f32 (IMPLICIT_DEF)), (v4f32 (VPERMILPSri (v4f32 (COPY_TO_REGCLASS FR32:$src, VR128)), 0)), sub_xmm), (v4f32 (VPERMILPSri (v4f32 (COPY_TO_REGCLASS FR32:$src, VR128)), 0)), 1)>; def : Pat<(v8f32 (X86VBroadcast v4f32:$src)), (VINSERTF128rr (INSERT_SUBREG (v8f32 (IMPLICIT_DEF)), (v4f32 (VPERMILPSri VR128:$src, 0)), sub_xmm), (v4f32 (VPERMILPSri VR128:$src, 0)), 1)>; def : Pat<(v4f64 (X86VBroadcast FR64:$src)), (VINSERTF128rr (INSERT_SUBREG (v4f64 (IMPLICIT_DEF)), (v2f64 (VMOVDDUPrr (v2f64 (COPY_TO_REGCLASS FR64:$src, VR128)))), sub_xmm), (v2f64 (VMOVDDUPrr (v2f64 (COPY_TO_REGCLASS FR64:$src, VR128)))), 1)>; def : Pat<(v4f64 (X86VBroadcast v2f64:$src)), (VINSERTF128rr (INSERT_SUBREG (v4f64 (IMPLICIT_DEF)), (v2f64 (VMOVDDUPrr VR128:$src)), sub_xmm), (v2f64 (VMOVDDUPrr VR128:$src)), 1)>; def : Pat<(v4i32 (X86VBroadcast GR32:$src)), (VPSHUFDri (VMOVDI2PDIrr GR32:$src), 0)>; def : Pat<(v8i32 (X86VBroadcast GR32:$src)), (VINSERTF128rr (INSERT_SUBREG (v8i32 (IMPLICIT_DEF)), (v4i32 (VPSHUFDri (VMOVDI2PDIrr GR32:$src), 0)), sub_xmm), (v4i32 (VPSHUFDri (VMOVDI2PDIrr GR32:$src), 0)), 1)>; def : Pat<(v8i32 (X86VBroadcast v4i32:$src)), (VINSERTF128rr (INSERT_SUBREG (v8i32 (IMPLICIT_DEF)), (v4i32 (VPSHUFDri VR128:$src, 0)), sub_xmm), (v4i32 (VPSHUFDri VR128:$src, 0)), 1)>; def : Pat<(v4i64 (X86VBroadcast GR64:$src)), (VINSERTF128rr (INSERT_SUBREG (v4i64 (IMPLICIT_DEF)), (v4i32 (VPSHUFDri (VMOV64toPQIrr GR64:$src), 0x44)), sub_xmm), (v4i32 (VPSHUFDri (VMOV64toPQIrr GR64:$src), 0x44)), 1)>; def : Pat<(v4i64 (X86VBroadcast v2i64:$src)), (VINSERTF128rr (INSERT_SUBREG (v4i64 (IMPLICIT_DEF)), (v4i32 (VPSHUFDri VR128:$src, 0x44)), sub_xmm), (v4i32 (VPSHUFDri VR128:$src, 0x44)), 1)>; def : Pat<(v2i64 (X86VBroadcast i64:$src)), (VPSHUFDri (VMOV64toPQIrr GR64:$src), 0x44)>; def : Pat<(v2i64 (X86VBroadcastld64 addr:$src)), (VMOVDDUPrm addr:$src)>; } RKSimon: This is what I had in mind - expand what is already in X86InstrSSE.td: ``` let Predicates =…
				yubingAuthorUnsubmitted Done Reply Inline Actions It confused me now because what you said in comments is expanding vbroadcast into several instructions while what we do in lowering is combine several nodes into vbroadcast. I think there is redundant work here. Shouldn't we prohibit redundant combine at the beginning? yubing: It confused me now because what you said in comments is expanding vbroadcast into several…
				RKSimonUnsubmitted Not Done Reply Inline Actions Creating the broadcast node in DAG (even if it expands later on) has the nice effect that the input will often then have hasOneUse, which permits further load folding, optimizations, SimplifyDemandedVectorElts etc. RKSimon: Creating the broadcast node in DAG (even if it expands later on) has the nice effect that the…
				yubingAuthorUnsubmitted Done Reply Inline Actions However, When we are trying the combine: broadcast(extract_vector_elt(x, 0)) -> broadcast(x), we never check x has oneuse which is located at extract_vector_elt(x, 0), as shown in https://reviews.llvm.org/D105390 . yubing: However, When we are trying the combine: broadcast(extract_vector_elt(x, 0)) -> broadcast(x)…
				if (Op0.getOpcode() == X86ISD::VBROADCAST &&
				!(!Subtarget.hasAVX2() && VT.is256BitVector() &&
				Op0.getOperand(0).getValueType().is128BitVector()))
	return DAG.getNode(Op0.getOpcode(), DL, VT, Op0.getOperand(0));			return DAG.getNode(Op0.getOpcode(), DL, VT, Op0.getOperand(0));

	// If this scalar/subvector broadcast_load is inserted into both halves, use			// If this scalar/subvector broadcast_load is inserted into both halves, use
	// a larger broadcast_load. Update other uses to use an extracted subvector.			// a larger broadcast_load. Update other uses to use an extracted subvector.
	if (Op0.getOpcode() == X86ISD::VBROADCAST_LOAD \|\|			if (Op0.getOpcode() == X86ISD::VBROADCAST_LOAD \|\|
	Op0.getOpcode() == X86ISD::SUBV_BROADCAST_LOAD) {			Op0.getOpcode() == X86ISD::SUBV_BROADCAST_LOAD) {
	auto *MemIntr = cast<MemIntrinsicSDNode>(Op0);			auto *MemIntr = cast<MemIntrinsicSDNode>(Op0);
	SDVTList Tys = DAG.getVTList(VT, MVT::Other);			SDVTList Tys = DAG.getVTList(VT, MVT::Other);
	▲ Show 20 Lines • Show All 2,568 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/combine-concatvectors.ll

Show All 31 Lines	bar:
%extr = extractvalue { i8, double } %val, 1		%extr = extractvalue { i8, double } %val, 1
%bc = bitcast double %extr to <2 x float>		%bc = bitcast double %extr to <2 x float>
br label %baz		br label %baz

baz:		baz:
%extr1 = extractelement <2 x float> %bc, i64 0		%extr1 = extractelement <2 x float> %bc, i64 0
unreachable		unreachable
}		}

		@qa_ = external unnamed_addr global [49216 x i8], align 32

		define void @prohibit_combine_concat_into_vbroadcast() {
		; CHECK-LABEL: prohibit_combine_concat_into_vbroadcast:
		; CHECK: # %bb.0: # %alloca_0
		; CHECK-NEXT: movq qa_@GOTPCREL(%rip), %rax
		; CHECK-NEXT: movl $1091567616, 30256(%rax) # imm = 0x41100000
		; CHECK-NEXT: movabsq $4294967297, %rcx # imm = 0x100000001
		; CHECK-NEXT: movq %rcx, 46348(%rax)
		; CHECK-NEXT: vmovaps {{.*#+}} xmm0 = <1.0E+0,1.0E+0,u,u>
		; CHECK-NEXT: vmovddup {{.*#+}} xmm1 = xmm0[0,0]
		; CHECK-NEXT: vinsertf128 $1, %xmm1, %ymm1, %ymm1
		; CHECK-NEXT: vmovups %ymm1, 48296(%rax)
		; CHECK-NEXT: vmovlps %xmm0, 47372(%rax)
		; CHECK-NEXT: vzeroupper
		; CHECK-NEXT: retq
		alloca_0:
		store float 9.000000e+00, float* bitcast (i8* getelementptr inbounds ([49216 x i8], [49216 x i8]* @qa_, i64 0, i64 30256) to float*), align 16
		store <2 x i32> <i32 1, i32 1>, <2 x i32>* bitcast (i8* getelementptr inbounds ([49216 x i8], [49216 x i8]* @qa_, i64 0, i64 46348) to <2 x i32>*), align 4
		br label %loop.4942

		loop.4942: ; preds = %loop.4942, %alloca_0
		br i1 undef, label %loop.4942, label %ifmerge.1298

		ifmerge.1298: ; preds = %loop.4942
		%gepload4638 = load float, float* bitcast (i8* getelementptr inbounds ([49216 x i8], [49216 x i8]* @qa_, i64 0, i64 28324) to float*), align 4
		store <2 x float> <float 1.000000e+00, float 1.000000e+00>, <2 x float>* bitcast (i8* getelementptr inbounds ([49216 x i8], [49216 x i8]* @qa_, i64 0, i64 48296) to <2 x float>*), align 8
		store <2 x float> <float 1.000000e+00, float 1.000000e+00>, <2 x float>* bitcast (i8* getelementptr inbounds ([49216 x i8], [49216 x i8]* @qa_, i64 0, i64 48304) to <2 x float>*), align 16
		store <2 x float> <float 1.000000e+00, float 1.000000e+00>, <2 x float>* bitcast (i8* getelementptr inbounds ([49216 x i8], [49216 x i8]* @qa_, i64 0, i64 48312) to <2 x float>*), align 8
		store <2 x float> <float 1.000000e+00, float 1.000000e+00>, <2 x float>* bitcast (i8* getelementptr inbounds ([49216 x i8], [49216 x i8]* @qa_, i64 0, i64 48320) to <2 x float>*), align 32
		store <2 x float> <float 1.000000e+00, float 1.000000e+00>, <2 x float>* bitcast (i8* getelementptr inbounds ([49216 x i8], [49216 x i8]* @qa_, i64 0, i64 47372) to <2 x float>*), align 4
		ret void
		}

This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX] Prohibit creating X86ISD::VBROADCAST(128->256) when it is AVX in combineConcatVectorOpsAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 371013

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/combine-concatvectors.ll

[X86][AVX] Prohibit creating X86ISD::VBROADCAST(128->256) when it is AVX in combineConcatVectorOps
AbandonedPublic