This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add isel pattern infrastructure to begin recognizing when we're inserting 0s into the upper portions of a vector register and the producing instruction as already produced the zeros.
ClosedPublic

Authored by craig.topper on Sep 8 2017, 4:40 PM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
zvi
igorb
aymanmus

Commits

rG143797eb8970: [X86] Add isel pattern infrastructure to begin recognizing when we're inserting…
rL313365: [X86] Add isel pattern infrastructure to begin recognizing when we're inserting…

Summary

Currently if we're inserting 0s into the upper elements of a vector register we insert an explicit move of the smaller register to implicitly zero the upper bits. But if we can prove that they are already zero we can skip that. This is based on a similar idea of what we do to avoid emitting explicit zero extends for GR32->GR64.

Unfortunately, this is harder for vector registers because there are several opcodes that don't have VEX equivalent instructions, but can write to XMM registers. Among these are SHA instructions and a MMX->XMM move. Bitcasts can also get in the way.

So for now I'm starting with explicitly allowing only VPMADDWD because we emit zeros in combineLoopMAddPattern. So that is placing extra instruction into the reduction loop.

I'd like to allow PSADBW as well after D37453, but that's currently blocked by a bitcast. We either need to peek through bitcasts or canonicalize insert_subvectors with zeros to remove bitcasts on the value being inserted.

Longer term we should probably have a cleanup pass that removes superfluous zeroing moves even when the producer is in another basic block which is something these isel tricks can't do. See PR32544.

Diff Detail

Repository: rL LLVM

Event Timeline

craig.topper created this revision.Sep 8 2017, 4:40 PM

Ping

LGTM

This revision is now accepted and ready to land.Sep 15 2017, 1:36 AM

Closed by commit rL313365: [X86] Add isel pattern infrastructure to begin recognizing when we're inserting… (authored by ctopper). · Explain WhySep 15 2017, 10:10 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86InstrVecCompiler.td

59 lines

test/

CodeGen/

X86/

madd.ll

3 lines

Diff 115419

llvm/trunk/lib/Target/X86/X86InstrVecCompiler.td

Show First 20 Lines • Show All 354 Lines • ▼ Show 20 Lines	defm : subvector_zero_lowering<"DQAY", VR256, v8i64, v4i64, v16i32,
loadv4i64, sub_ymm>;		loadv4i64, sub_ymm>;
defm : subvector_zero_lowering<"DQAY", VR256, v16i32, v8i32, v16i32,		defm : subvector_zero_lowering<"DQAY", VR256, v16i32, v8i32, v16i32,
loadv4i64, sub_ymm>;		loadv4i64, sub_ymm>;
defm : subvector_zero_lowering<"DQAY", VR256, v32i16, v16i16, v16i32,		defm : subvector_zero_lowering<"DQAY", VR256, v32i16, v16i16, v16i32,
loadv4i64, sub_ymm>;		loadv4i64, sub_ymm>;
defm : subvector_zero_lowering<"DQAY", VR256, v64i8, v32i8, v16i32,		defm : subvector_zero_lowering<"DQAY", VR256, v64i8, v32i8, v16i32,
loadv4i64, sub_ymm>;		loadv4i64, sub_ymm>;
}		}

		// List of opcodes that guaranteed to zero the upper elements of vector regs.
		// TODO: Ideally this would be a blacklist instead of a whitelist. But SHA
		// intrinsics and some MMX->XMM move instructions that aren't VEX encoded make
		// this difficult. So starting with a couple opcodes used by reduction loops
		// where we explicitly insert zeros.
		class veczeroupper<ValueType vt, RegisterClass RC> :
		PatLeaf<(vt RC:$src), [{
		return N->getOpcode() == X86ISD::VPMADDWD;
		}]>;

		def zeroupperv2f64 : veczeroupper<v2f64, VR128>;
		def zeroupperv4f32 : veczeroupper<v4f32, VR128>;
		def zeroupperv2i64 : veczeroupper<v2i64, VR128>;
		def zeroupperv4i32 : veczeroupper<v4i32, VR128>;
		def zeroupperv8i16 : veczeroupper<v8i16, VR128>;
		def zeroupperv16i8 : veczeroupper<v16i8, VR128>;

		def zeroupperv4f64 : veczeroupper<v4f64, VR256>;
		def zeroupperv8f32 : veczeroupper<v8f32, VR256>;
		def zeroupperv4i64 : veczeroupper<v4i64, VR256>;
		def zeroupperv8i32 : veczeroupper<v8i32, VR256>;
		def zeroupperv16i16 : veczeroupper<v16i16, VR256>;
		def zeroupperv32i8 : veczeroupper<v32i8, VR256>;


		// If we can guarantee the upper elements have already been zeroed we can elide
		// an explicit zeroing.
		multiclass subvector_zero_ellision<RegisterClass RC, ValueType DstTy,
		ValueType SrcTy, ValueType ZeroTy,
		SubRegIndex SubIdx, PatLeaf Zeroupper> {
		def : Pat<(DstTy (insert_subvector (bitconvert (ZeroTy immAllZerosV)),
		Zeroupper:$src, (iPTR 0))),
		(SUBREG_TO_REG (i64 0), RC:$src, SubIdx)>;
		}

		// 128->256
		defm: subvector_zero_ellision<VR128, v4f64, v2f64, v8i32, sub_xmm, zeroupperv2f64>;
		defm: subvector_zero_ellision<VR128, v8f32, v4f32, v8i32, sub_xmm, zeroupperv4f32>;
		defm: subvector_zero_ellision<VR128, v4i64, v2i64, v8i32, sub_xmm, zeroupperv2i64>;
		defm: subvector_zero_ellision<VR128, v8i32, v4i32, v8i32, sub_xmm, zeroupperv4i32>;
		defm: subvector_zero_ellision<VR128, v16i16, v8i16, v8i32, sub_xmm, zeroupperv8i16>;
		defm: subvector_zero_ellision<VR128, v32i8, v16i8, v8i32, sub_xmm, zeroupperv16i8>;

		// 128->512
		defm: subvector_zero_ellision<VR128, v8f64, v2f64, v16i32, sub_xmm, zeroupperv2f64>;
		defm: subvector_zero_ellision<VR128, v16f32, v4f32, v16i32, sub_xmm, zeroupperv4f32>;
		defm: subvector_zero_ellision<VR128, v8i64, v2i64, v16i32, sub_xmm, zeroupperv2i64>;
		defm: subvector_zero_ellision<VR128, v16i32, v4i32, v16i32, sub_xmm, zeroupperv4i32>;
		defm: subvector_zero_ellision<VR128, v32i16, v8i16, v16i32, sub_xmm, zeroupperv8i16>;
		defm: subvector_zero_ellision<VR128, v64i8, v16i8, v16i32, sub_xmm, zeroupperv16i8>;

		// 256->512
		defm: subvector_zero_ellision<VR256, v8f64, v4f64, v16i32, sub_ymm, zeroupperv4f64>;
		defm: subvector_zero_ellision<VR256, v16f32, v8f32, v16i32, sub_ymm, zeroupperv8f32>;
		defm: subvector_zero_ellision<VR256, v8i64, v4i64, v16i32, sub_ymm, zeroupperv4i64>;
		defm: subvector_zero_ellision<VR256, v16i32, v8i32, v16i32, sub_ymm, zeroupperv8i32>;
		defm: subvector_zero_ellision<VR256, v32i16, v16i16, v16i32, sub_ymm, zeroupperv16i16>;
		defm: subvector_zero_ellision<VR256, v64i8, v32i8, v16i32, sub_ymm, zeroupperv32i8>;

llvm/trunk/test/CodeGen/X86/madd.ll

	Show All 34 Lines
	; AVX2-NEXT: movl %edx, %eax			; AVX2-NEXT: movl %edx, %eax
	; AVX2-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX2-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX2-NEXT: xorl %ecx, %ecx			; AVX2-NEXT: xorl %ecx, %ecx
	; AVX2-NEXT: .p2align 4, 0x90			; AVX2-NEXT: .p2align 4, 0x90
	; AVX2-NEXT: .LBB0_1: # %vector.body			; AVX2-NEXT: .LBB0_1: # %vector.body
	; AVX2-NEXT: # =>This Inner Loop Header: Depth=1			; AVX2-NEXT: # =>This Inner Loop Header: Depth=1
	; AVX2-NEXT: vmovdqu (%rsi,%rcx,2), %xmm1			; AVX2-NEXT: vmovdqu (%rsi,%rcx,2), %xmm1
	; AVX2-NEXT: vpmaddwd (%rdi,%rcx,2), %xmm1, %xmm1			; AVX2-NEXT: vpmaddwd (%rdi,%rcx,2), %xmm1, %xmm1
	; AVX2-NEXT: vmovdqa %xmm1, %xmm1
	; AVX2-NEXT: vpaddd %ymm0, %ymm1, %ymm0			; AVX2-NEXT: vpaddd %ymm0, %ymm1, %ymm0
	; AVX2-NEXT: addq $8, %rcx			; AVX2-NEXT: addq $8, %rcx
	; AVX2-NEXT: cmpq %rcx, %rax			; AVX2-NEXT: cmpq %rcx, %rax
	; AVX2-NEXT: jne .LBB0_1			; AVX2-NEXT: jne .LBB0_1
	; AVX2-NEXT: # BB#2: # %middle.block			; AVX2-NEXT: # BB#2: # %middle.block
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vphaddd %ymm0, %ymm0, %ymm0			; AVX2-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX2-NEXT: vmovd %xmm0, %eax			; AVX2-NEXT: vmovd %xmm0, %eax
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: _Z10test_shortPsS_i:			; AVX512-LABEL: _Z10test_shortPsS_i:
	; AVX512: # BB#0: # %entry			; AVX512: # BB#0: # %entry
	; AVX512-NEXT: movl %edx, %eax			; AVX512-NEXT: movl %edx, %eax
	; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: xorl %ecx, %ecx			; AVX512-NEXT: xorl %ecx, %ecx
	; AVX512-NEXT: .p2align 4, 0x90			; AVX512-NEXT: .p2align 4, 0x90
	; AVX512-NEXT: .LBB0_1: # %vector.body			; AVX512-NEXT: .LBB0_1: # %vector.body
	; AVX512-NEXT: # =>This Inner Loop Header: Depth=1			; AVX512-NEXT: # =>This Inner Loop Header: Depth=1
	; AVX512-NEXT: vmovdqu (%rsi,%rcx,2), %xmm1			; AVX512-NEXT: vmovdqu (%rsi,%rcx,2), %xmm1
	; AVX512-NEXT: vpmaddwd (%rdi,%rcx,2), %xmm1, %xmm1			; AVX512-NEXT: vpmaddwd (%rdi,%rcx,2), %xmm1, %xmm1
	; AVX512-NEXT: vmovdqa %xmm1, %xmm1
	; AVX512-NEXT: vpaddd %ymm0, %ymm1, %ymm0			; AVX512-NEXT: vpaddd %ymm0, %ymm1, %ymm0
	; AVX512-NEXT: addq $8, %rcx			; AVX512-NEXT: addq $8, %rcx
	; AVX512-NEXT: cmpq %rcx, %rax			; AVX512-NEXT: cmpq %rcx, %rax
	; AVX512-NEXT: jne .LBB0_1			; AVX512-NEXT: jne .LBB0_1
	; AVX512-NEXT: # BB#2: # %middle.block			; AVX512-NEXT: # BB#2: # %middle.block
	; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512-NEXT: vpaddd %ymm1, %ymm0, %ymm0			; AVX512-NEXT: vpaddd %ymm1, %ymm0, %ymm0
	; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	▲ Show 20 Lines • Show All 232 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX512-NEXT: xorl %ecx, %ecx			; AVX512-NEXT: xorl %ecx, %ecx
	; AVX512-NEXT: .p2align 4, 0x90			; AVX512-NEXT: .p2align 4, 0x90
	; AVX512-NEXT: .LBB2_1: # %vector.body			; AVX512-NEXT: .LBB2_1: # %vector.body
	; AVX512-NEXT: # =>This Inner Loop Header: Depth=1			; AVX512-NEXT: # =>This Inner Loop Header: Depth=1
	; AVX512-NEXT: vpmovsxbw (%rdi,%rcx), %ymm1			; AVX512-NEXT: vpmovsxbw (%rdi,%rcx), %ymm1
	; AVX512-NEXT: vpmovsxbw (%rsi,%rcx), %ymm2			; AVX512-NEXT: vpmovsxbw (%rsi,%rcx), %ymm2
	; AVX512-NEXT: vpmaddwd %ymm1, %ymm2, %ymm1			; AVX512-NEXT: vpmaddwd %ymm1, %ymm2, %ymm1
	; AVX512-NEXT: vmovdqa %ymm1, %ymm1
	; AVX512-NEXT: vpaddd %zmm0, %zmm1, %zmm0			; AVX512-NEXT: vpaddd %zmm0, %zmm1, %zmm0
	; AVX512-NEXT: addq $16, %rcx			; AVX512-NEXT: addq $16, %rcx
	; AVX512-NEXT: cmpq %rcx, %rax			; AVX512-NEXT: cmpq %rcx, %rax
	; AVX512-NEXT: jne .LBB2_1			; AVX512-NEXT: jne .LBB2_1
	; AVX512-NEXT: # BB#2: # %middle.block			; AVX512-NEXT: # BB#2: # %middle.block
	; AVX512-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1
	▲ Show 20 Lines • Show All 41 Lines • Show Last 20 Lines