This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension
ClosedPublic

Authored by RKSimon on Oct 28 2015, 9:04 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
chandlerc
andreadb

Commits

rG846b64e17a98: [X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension
rL253561: [X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension

Summary

The lowering patterns for X86ISD::VZEXT_MOVL for 128-bit to 256-bit vectors were just copying the lower xmm instead of actually masking off the first scalar using a xmm blend (we make use of the implicit zeroing of the upper ymm).

Fix for PR25320.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 38662.Oct 28 2015, 9:04 AM

RKSimon retitled this revision from to [X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension.

RKSimon updated this object.

RKSimon added reviewers: andreadb, spatel, chandlerc, qcolombet.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: llvm-commits.

jketema added a subscriber: jketema.Oct 28 2015, 11:33 AM

jketema added inline comments.

test/CodeGen/X86/vec_extract-avx.ll
112	I think there still might be something wrong here. If I understand this correctly only the lower 48 bytes will now be copied into xmm0.

RKSimon updated this revision to Diff 38677.Oct 28 2015, 11:53 AM

RKSimon marked an inline comment as done.

RKSimon added inline comments.

test/CodeGen/X86/vec_extract-avx.ll
112	Nice catch!

Jeroen has confirmed that fixes PR24935

ping

Hi Simon,

lib/Target/X86/X86InstrSSE.td
7211–7231	I don't think these new patterns are needed. We already have sse4.1/avx patterns to select a blend from a vzmovl node. If your goal is just to fix the miscompile, then the minimal fix consists in removing the offending patterns between lines 939 and 952. The poor codegen reported by Jeroen is caused by the lack of smart x86 combine rules for 256-bit shuffles in function 'PerformShuffleCombine256'. That function implements a very simple rule for when there is a shuffle between two concat_vector nodes. Ideally we should extend it and add rules for the case where the second operand is a build_vector of all zeroes. Currently we check if a shuffle takes as input two concat_vectors and we try to fold it to a zero extending load or an insert of a 128-bit vector into a zero vector. I think that we are just missing rules for the case where we are inserting a 64/32-bit quantity in a zero vector.

I'll update this patch to just be a fix, with just the removed lines (and the altered tests) and then prepare a second patch that improves the code quality. We are missing a potentially big perf gain for when the upper half of a register can be implicitly zero'd with VEX encoded instructions - especially on 128-bit ALUs such as Jaguar and Sandy Bridge.

lib/Target/X86/X86InstrSSE.td
7211–7231	I can confirm that just removing the lines 939 to 952 fixes the problem. It then leaves AVX1 targets with a lot of domain crossing stalls between integer / float to deal with the 256-bit vectors.

Simplified patch to just the bugfix - improved codegen to follow in a future patch

Thanks Simon for working on this!

The patch LGTM.

Cheers,
Andrea

This revision is now accepted and ready to land.Nov 19 2015, 3:17 AM

Closed by commit rL253561: [X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension (authored by RKSimon). · Explain WhyNov 19 2015, 4:21 AM

This revision was automatically updated to reflect the committed changes.

RKSimon mentioned this in D15477: [X86][AVX] Only shuffle the lower half of vectors if the upper half is undefined.Dec 12 2015, 6:17 AM

RKSimon mentioned this in rL256332: [X86][AVX] Only shuffle the lower half of vectors if the upper half is undefined.Dec 23 2015, 5:13 AM

Revision Contents

Path

Size

lib/

Target/

X86/

	X86InstrSSE.td
	X86InstrSSE.td (revision 253496)

20 lines

test/

CodeGen/

X86/

	vec_extract-avx.ll
	vec_extract-avx.ll (revision 253496)

71 lines

Diff 40565

lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 929 Lines • ▼ Show 20 Lines	def VMOVUPSYrr_REV : VPSI<0x11, MRMDestReg, (outs VR256:$dst),
"movups\t{$src, $dst\|$dst, $src}", [],		"movups\t{$src, $dst\|$dst, $src}", [],
IIC_SSE_MOVU_P_RR>, VEX, VEX_L;		IIC_SSE_MOVU_P_RR>, VEX, VEX_L;
def VMOVUPDYrr_REV : VPDI<0x11, MRMDestReg, (outs VR256:$dst),		def VMOVUPDYrr_REV : VPDI<0x11, MRMDestReg, (outs VR256:$dst),
(ins VR256:$src),		(ins VR256:$src),
"movupd\t{$src, $dst\|$dst, $src}", [],		"movupd\t{$src, $dst\|$dst, $src}", [],
IIC_SSE_MOVU_P_RR>, VEX, VEX_L;		IIC_SSE_MOVU_P_RR>, VEX, VEX_L;
}		}

let Predicates = [HasAVX] in {
def : Pat<(v8i32 (X86vzmovl
(insert_subvector undef, (v4i32 VR128:$src), (iPTR 0)))),
(SUBREG_TO_REG (i32 0), (VMOVAPSrr VR128:$src), sub_xmm)>;
def : Pat<(v4i64 (X86vzmovl
(insert_subvector undef, (v2i64 VR128:$src), (iPTR 0)))),
(SUBREG_TO_REG (i32 0), (VMOVAPSrr VR128:$src), sub_xmm)>;
def : Pat<(v8f32 (X86vzmovl
(insert_subvector undef, (v4f32 VR128:$src), (iPTR 0)))),
(SUBREG_TO_REG (i32 0), (VMOVAPSrr VR128:$src), sub_xmm)>;
def : Pat<(v4f64 (X86vzmovl
(insert_subvector undef, (v2f64 VR128:$src), (iPTR 0)))),
(SUBREG_TO_REG (i32 0), (VMOVAPSrr VR128:$src), sub_xmm)>;
}


def : Pat<(int_x86_avx_storeu_ps_256 addr:$dst, VR256:$src),		def : Pat<(int_x86_avx_storeu_ps_256 addr:$dst, VR256:$src),
(VMOVUPSYmr addr:$dst, VR256:$src)>;		(VMOVUPSYmr addr:$dst, VR256:$src)>;
def : Pat<(int_x86_avx_storeu_pd_256 addr:$dst, VR256:$src),		def : Pat<(int_x86_avx_storeu_pd_256 addr:$dst, VR256:$src),
(VMOVUPDYmr addr:$dst, VR256:$src)>;		(VMOVUPDYmr addr:$dst, VR256:$src)>;

let SchedRW = [WriteStore] in {		let SchedRW = [WriteStore] in {
def MOVAPSmr : PSI<0x29, MRMDestMem, (outs), (ins f128mem:$dst, VR128:$src),		def MOVAPSmr : PSI<0x29, MRMDestMem, (outs), (ins f128mem:$dst, VR128:$src),
"movaps\t{$src, $dst\|$dst, $src}",		"movaps\t{$src, $dst\|$dst, $src}",
▲ Show 20 Lines • Show All 1,967 Lines • ▼ Show 20 Lines	multiclass sse12_fp_packed_vector_logical_alias<

defm V#NAME#PD : sse12_fp_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode,		defm V#NAME#PD : sse12_fp_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode,
VR128, v2f64, f128mem, loadv2f64, SSEPackedDouble, itins, 0>,		VR128, v2f64, f128mem, loadv2f64, SSEPackedDouble, itins, 0>,
PD, VEX_4V;		PD, VEX_4V;

defm V#NAME#PSY : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode,		defm V#NAME#PSY : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode,
VR256, v8f32, f256mem, loadv8f32, SSEPackedSingle, itins, 0>,		VR256, v8f32, f256mem, loadv8f32, SSEPackedSingle, itins, 0>,
PS, VEX_4V, VEX_L;		PS, VEX_4V, VEX_L;

defm V#NAME#PDY : sse12_fp_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode,		defm V#NAME#PDY : sse12_fp_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode,
VR256, v4f64, f256mem, loadv4f64, SSEPackedDouble, itins, 0>,		VR256, v4f64, f256mem, loadv4f64, SSEPackedDouble, itins, 0>,
PD, VEX_4V, VEX_L;		PD, VEX_4V, VEX_L;
}		}

let Constraints = "$src1 = $dst" in {		let Constraints = "$src1 = $dst" in {
defm PS : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode, VR128,		defm PS : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode, VR128,
v4f32, f128mem, memopv4f32, SSEPackedSingle, itins>,		v4f32, f128mem, memopv4f32, SSEPackedSingle, itins>,
▲ Show 20 Lines • Show All 1,240 Lines • ▼ Show 20 Lines
defm VPSRAWY : PDI_binop_rmi<0xE1, 0x71, MRM4r, "vpsraw", X86vsra, X86vsrai,		defm VPSRAWY : PDI_binop_rmi<0xE1, 0x71, MRM4r, "vpsraw", X86vsra, X86vsrai,
VR256, v16i16, v8i16, bc_v8i16, loadv2i64,		VR256, v16i16, v8i16, bc_v8i16, loadv2i64,
SSE_INTSHIFT_ITINS_P, 0>, VEX_4V, VEX_L;		SSE_INTSHIFT_ITINS_P, 0>, VEX_4V, VEX_L;
defm VPSRADY : PDI_binop_rmi<0xE2, 0x72, MRM4r, "vpsrad", X86vsra, X86vsrai,		defm VPSRADY : PDI_binop_rmi<0xE2, 0x72, MRM4r, "vpsrad", X86vsra, X86vsrai,
VR256, v8i32, v4i32, bc_v4i32, loadv2i64,		VR256, v8i32, v4i32, bc_v4i32, loadv2i64,
SSE_INTSHIFT_ITINS_P, 0>, VEX_4V, VEX_L;		SSE_INTSHIFT_ITINS_P, 0>, VEX_4V, VEX_L;
}// Predicates = [HasAVX2]		}// Predicates = [HasAVX2]

let ExeDomain = SSEPackedInt, SchedRW = [WriteVecShift], hasSideEffects = 0 ,		let ExeDomain = SSEPackedInt, SchedRW = [WriteVecShift], hasSideEffects = 0 ,
Predicates = [HasAVX2, NoVLX_Or_NoBWI] in {		Predicates = [HasAVX2, NoVLX_Or_NoBWI] in {
// 256-bit logical shifts.		// 256-bit logical shifts.
def VPSLLDQYri : PDIi8<0x73, MRM7r,		def VPSLLDQYri : PDIi8<0x73, MRM7r,
(outs VR256:$dst), (ins VR256:$src1, u8imm:$src2),		(outs VR256:$dst), (ins VR256:$src1, u8imm:$src2),
"vpslldq\t{$src2, $src1, $dst\|$dst, $src1, $src2}",		"vpslldq\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
[(set VR256:$dst,		[(set VR256:$dst,
(v4i64 (X86vshldq VR256:$src1, (i8 imm:$src2))))]>,		(v4i64 (X86vshldq VR256:$src1, (i8 imm:$src2))))]>,
VEX_4V, VEX_L;		VEX_4V, VEX_L;
▲ Show 20 Lines • Show All 3,016 Lines • ▼ Show 20 Lines	def : Pat<(v8f32 (X86vzmovl (insert_subvector undef,
(v4f32 (VMOVSSrr (v4f32 (V_SET0)), FR32:$src)),		(v4f32 (VMOVSSrr (v4f32 (V_SET0)), FR32:$src)),
sub_xmm)>;		sub_xmm)>;
def : Pat<(v4f64 (X86vzmovl (insert_subvector undef,		def : Pat<(v4f64 (X86vzmovl (insert_subvector undef,
(v2f64 (scalar_to_vector FR64:$src)), (iPTR 0)))),		(v2f64 (scalar_to_vector FR64:$src)), (iPTR 0)))),
(SUBREG_TO_REG (i64 0),		(SUBREG_TO_REG (i64 0),
(v2f64 (VMOVSDrr (v2f64 (V_SET0)), FR64:$src)),		(v2f64 (VMOVSDrr (v2f64 (V_SET0)), FR64:$src)),
sub_xmm)>;		sub_xmm)>;

// These will incur an FP/int domain crossing penalty, but it may be the only		// These will incur an FP/int domain crossing penalty, but it may be the only
// way without AVX2. Do not add any complexity because we may be able to match		// way without AVX2. Do not add any complexity because we may be able to match
// more optimal patterns defined earlier in this file.		// more optimal patterns defined earlier in this file.
def : Pat<(v8i32 (X86vzmovl (v8i32 VR256:$src))),		def : Pat<(v8i32 (X86vzmovl (v8i32 VR256:$src))),
(VBLENDPSYrri (v8i32 (AVX_SET0)), VR256:$src, (i8 1))>;		(VBLENDPSYrri (v8i32 (AVX_SET0)), VR256:$src, (i8 1))>;
def : Pat<(v4i64 (X86vzmovl (v4i64 VR256:$src))),		def : Pat<(v4i64 (X86vzmovl (v4i64 VR256:$src))),
(VBLENDPDYrri (v4i64 (AVX_SET0)), VR256:$src, (i8 1))>;		(VBLENDPDYrri (v4i64 (AVX_SET0)), VR256:$src, (i8 1))>;
}		}

// FIXME: Prefer a movss or movsd over a blendps when optimizing for size or		// FIXME: Prefer a movss or movsd over a blendps when optimizing for size or
// on targets where they have equal performance. These were changed to use		// on targets where they have equal performance. These were changed to use
// blends because blends have better throughput on SandyBridge and Haswell, but		// blends because blends have better throughput on SandyBridge and Haswell, but
// movs[s/d] are 1-2 byte shorter instructions.		// movs[s/d] are 1-2 byte shorter instructions.
let Predicates = [UseSSE41] in {		let Predicates = [UseSSE41] in {
// With SSE41 we can use blends for these patterns.		// With SSE41 we can use blends for these patterns.
def : Pat<(v4f32 (X86vzmovl (v4f32 VR128:$src))),		def : Pat<(v4f32 (X86vzmovl (v4f32 VR128:$src))),
(BLENDPSrri (v4f32 (V_SET0)), VR128:$src, (i8 1))>;		(BLENDPSrri (v4f32 (V_SET0)), VR128:$src, (i8 1))>;
def : Pat<(v4i32 (X86vzmovl (v4i32 VR128:$src))),		def : Pat<(v4i32 (X86vzmovl (v4i32 VR128:$src))),
(PBLENDWrri (v4i32 (V_SET0)), VR128:$src, (i8 3))>;		(PBLENDWrri (v4i32 (V_SET0)), VR128:$src, (i8 3))>;
def : Pat<(v2f64 (X86vzmovl (v2f64 VR128:$src))),		def : Pat<(v2f64 (X86vzmovl (v2f64 VR128:$src))),
(BLENDPDrri (v2f64 (V_SET0)), VR128:$src, (i8 1))>;		(BLENDPDrri (v2f64 (V_SET0)), VR128:$src, (i8 1))>;
		andreadbUnsubmitted Not Done Reply Inline Actions I don't think these new patterns are needed. We already have sse4.1/avx patterns to select a blend from a vzmovl node. If your goal is just to fix the miscompile, then the minimal fix consists in removing the offending patterns between lines 939 and 952. The poor codegen reported by Jeroen is caused by the lack of smart x86 combine rules for 256-bit shuffles in function 'PerformShuffleCombine256'. That function implements a very simple rule for when there is a shuffle between two concat_vector nodes. Ideally we should extend it and add rules for the case where the second operand is a build_vector of all zeroes. Currently we check if a shuffle takes as input two concat_vectors and we try to fold it to a zero extending load or an insert of a 128-bit vector into a zero vector. I think that we are just missing rules for the case where we are inserting a 64/32-bit quantity in a zero vector. andreadb: I don't think these new patterns are needed. We already have sse4.1/avx patterns to select a…
		RKSimonAuthorUnsubmitted Not Done Reply Inline Actions I can confirm that just removing the lines 939 to 952 fixes the problem. It then leaves AVX1 targets with a lot of domain crossing stalls between integer / float to deal with the 256-bit vectors. RKSimon: I can confirm that just removing the lines 939 to 952 fixes the problem. It then leaves AVX1…
}		}


/// SS41I_ternary_int - SSE 4.1 ternary operator		/// SS41I_ternary_int - SSE 4.1 ternary operator
let Uses = [XMM0], Constraints = "$src1 = $dst" in {		let Uses = [XMM0], Constraints = "$src1 = $dst" in {
multiclass SS41I_ternary_int<bits<8> opc, string OpcodeStr, PatFrag mem_frag,		multiclass SS41I_ternary_int<bits<8> opc, string OpcodeStr, PatFrag mem_frag,
X86MemOperand x86memop, Intrinsic IntId,		X86MemOperand x86memop, Intrinsic IntId,
OpndItins itins = DEFAULT_ITINS> {		OpndItins itins = DEFAULT_ITINS> {
▲ Show 20 Lines • Show All 1,631 Lines • Show Last 20 Lines

test/CodeGen/X86/vec_extract-avx.ll

target triple = "x86_64-unknown-unknown"		target triple = "x86_64-unknown-unknown"

; RUN: llc < %s -march=x86-64 -mattr=+avx \| FileCheck %s		; RUN: llc < %s -march=x86-64 -mattr=+avx \| FileCheck %s

; When extracting multiple consecutive elements from a larger		; When extracting multiple consecutive elements from a larger
; vector into a smaller one, do it efficiently. We should use		; vector into a smaller one, do it efficiently. We should use
; an EXTRACT_SUBVECTOR node internally rather than a bunch of		; an EXTRACT_SUBVECTOR node internally rather than a bunch of
; single element extractions.		; single element extractions.

; Extracting the low elements only requires using the right kind of store.		; Extracting the low elements only requires using the right kind of store.
define void @low_v8f32_to_v4f32(<8 x float> %v, <4 x float>* %ptr) {		define void @low_v8f32_to_v4f32(<8 x float> %v, <4 x float>* %ptr) {
%ext0 = extractelement <8 x float> %v, i32 0		%ext0 = extractelement <8 x float> %v, i32 0
%ext1 = extractelement <8 x float> %v, i32 1		%ext1 = extractelement <8 x float> %v, i32 1
%ext2 = extractelement <8 x float> %v, i32 2		%ext2 = extractelement <8 x float> %v, i32 2
%ext3 = extractelement <8 x float> %v, i32 3		%ext3 = extractelement <8 x float> %v, i32 3
%ins0 = insertelement <4 x float> undef, float %ext0, i32 0		%ins0 = insertelement <4 x float> undef, float %ext0, i32 0
%ins1 = insertelement <4 x float> %ins0, float %ext1, i32 1		%ins1 = insertelement <4 x float> %ins0, float %ext1, i32 1
%ins2 = insertelement <4 x float> %ins1, float %ext2, i32 2		%ins2 = insertelement <4 x float> %ins1, float %ext2, i32 2
%ins3 = insertelement <4 x float> %ins2, float %ext3, i32 3		%ins3 = insertelement <4 x float> %ins2, float %ext3, i32 3
store <4 x float> %ins3, <4 x float>* %ptr, align 16		store <4 x float> %ins3, <4 x float>* %ptr, align 16
ret void		ret void

; CHECK-LABEL: low_v8f32_to_v4f32		; CHECK-LABEL: low_v8f32_to_v4f32
; CHECK: vmovaps		; CHECK: vmovaps
; CHECK-NEXT: vzeroupper		; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq		; CHECK-NEXT: retq
}		}

; Extracting the high elements requires just one AVX instruction.		; Extracting the high elements requires just one AVX instruction.
define void @high_v8f32_to_v4f32(<8 x float> %v, <4 x float>* %ptr) {		define void @high_v8f32_to_v4f32(<8 x float> %v, <4 x float>* %ptr) {
%ext0 = extractelement <8 x float> %v, i32 4		%ext0 = extractelement <8 x float> %v, i32 4
%ext1 = extractelement <8 x float> %v, i32 5		%ext1 = extractelement <8 x float> %v, i32 5
%ext2 = extractelement <8 x float> %v, i32 6		%ext2 = extractelement <8 x float> %v, i32 6
%ext3 = extractelement <8 x float> %v, i32 7		%ext3 = extractelement <8 x float> %v, i32 7
%ins0 = insertelement <4 x float> undef, float %ext0, i32 0		%ins0 = insertelement <4 x float> undef, float %ext0, i32 0
%ins1 = insertelement <4 x float> %ins0, float %ext1, i32 1		%ins1 = insertelement <4 x float> %ins0, float %ext1, i32 1
%ins2 = insertelement <4 x float> %ins1, float %ext2, i32 2		%ins2 = insertelement <4 x float> %ins1, float %ext2, i32 2
Show All 37 Lines	define void @high_v4f64_to_v2f64(<4 x double> %v, <2 x double>* %ptr) {
store <2 x double> %ins1, <2 x double>* %ptr, align 16		store <2 x double> %ins1, <2 x double>* %ptr, align 16
ret void		ret void

; CHECK-LABEL: high_v4f64_to_v2f64		; CHECK-LABEL: high_v4f64_to_v2f64
; CHECK: vextractf128		; CHECK: vextractf128
; CHECK-NEXT: vzeroupper		; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq		; CHECK-NEXT: retq
}		}

		; PR25320 Make sure that a widened (possibly legalized) vector correctly zero-extends upper elements.
		; FIXME - Ideally these should just call VMOVD/VMOVQ/VMOVSS/VMOVSD

		define void @legal_vzmovl_2i32_8i32(<2 x i32>* %in, <8 x i32>* %out) {
		%ld = load <2 x i32>, <2 x i32>* %in, align 8
		%ext = extractelement <2 x i32> %ld, i64 0
		%ins = insertelement <8 x i32> <i32 undef, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>, i32 %ext, i64 0
		store <8 x i32> %ins, <8 x i32>* %out, align 32
		ret void

		; CHECK-LABEL: legal_vzmovl_2i32_8i32
		; CHECK: vpmovzxdq {{.*#+}} xmm0 = mem[0],zero,mem[1],zero
		; CHECK-NEXT: vxorps %ymm1, %ymm1, %ymm1
		; CHECK-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3,4,5,6,7]
		; CHECK-NEXT: vmovaps %ymm0, (%rsi)
		; CHECK-NEXT: vzeroupper
		; CHECK-NEXT: retq
		}

		define void @legal_vzmovl_2i64_4i64(<2 x i64>* %in, <4 x i64>* %out) {
		%ld = load <2 x i64>, <2 x i64>* %in, align 8
		%ext = extractelement <2 x i64> %ld, i64 0
		%ins = insertelement <4 x i64> <i64 undef, i64 0, i64 0, i64 0>, i64 %ext, i64 0
		store <4 x i64> %ins, <4 x i64>* %out, align 32
		ret void

		; CHECK-LABEL: legal_vzmovl_2i64_4i64
		; CHECK: vmovupd (%rdi), %xmm0
		; CHECK-NEXT: vxorpd %ymm1, %ymm1, %ymm1
		jketemaUnsubmitted Done Reply Inline Actions I think there still might be something wrong here. If I understand this correctly only the lower 48 bytes will now be copied into xmm0. jketema: I think there still might be something wrong here. If I understand this correctly only the…
		RKSimonAuthorUnsubmitted Not Done Reply Inline Actions Nice catch! RKSimon: Nice catch!
		; CHECK-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
		; CHECK-NEXT: vmovapd %ymm0, (%rsi)
		; CHECK-NEXT: vzeroupper
		; CHECK-NEXT: retq
		}

		define void @legal_vzmovl_2f32_8f32(<2 x float>* %in, <8 x float>* %out) {
		%ld = load <2 x float>, <2 x float>* %in, align 8
		%ext = extractelement <2 x float> %ld, i64 0
		%ins = insertelement <8 x float> <float undef, float 0.0, float 0.0, float 0.0, float 0.0, float 0.0, float 0.0, float 0.0>, float %ext, i64 0
		store <8 x float> %ins, <8 x float>* %out, align 32
		ret void

		; CHECK-LABEL: legal_vzmovl_2f32_8f32
		; CHECK: vmovq {{.*#+}} xmm0 = mem[0],zero
		; CHECK-NEXT: vxorps %ymm1, %ymm1, %ymm1
		; CHECK-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3,4,5,6,7]
		; CHECK-NEXT: vmovaps %ymm0, (%rsi)
		; CHECK-NEXT: vzeroupper
		; CHECK-NEXT: retq
		}

		define void @legal_vzmovl_2f64_4f64(<2 x double>* %in, <4 x double>* %out) {
		%ld = load <2 x double>, <2 x double>* %in, align 8
		%ext = extractelement <2 x double> %ld, i64 0
		%ins = insertelement <4 x double> <double undef, double 0.0, double 0.0, double 0.0>, double %ext, i64 0
		store <4 x double> %ins, <4 x double>* %out, align 32
		ret void

		; CHECK-LABEL: legal_vzmovl_2f64_4f64
		; CHECK: vmovupd (%rdi), %xmm0
		; CHECK-NEXT: vxorpd %ymm1, %ymm1, %ymm1
		; CHECK-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
		; CHECK-NEXT: vmovapd %ymm0, (%rsi)
		; CHECK-NEXT: vzeroupper
		; CHECK-NEXT: retq
		}