This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension
ClosedPublic

Authored by RKSimon on Oct 28 2015, 9:04 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
chandlerc
andreadb

Commits

rG846b64e17a98: [X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension
rL253561: [X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension

Summary

The lowering patterns for X86ISD::VZEXT_MOVL for 128-bit to 256-bit vectors were just copying the lower xmm instead of actually masking off the first scalar using a xmm blend (we make use of the implicit zeroing of the upper ymm).

Fix for PR25320.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 38662.Oct 28 2015, 9:04 AM

RKSimon retitled this revision from to [X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension.

RKSimon updated this object.

RKSimon added reviewers: andreadb, spatel, chandlerc, qcolombet.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: llvm-commits.

jketema added a subscriber: jketema.Oct 28 2015, 11:33 AM

jketema added inline comments.

test/CodeGen/X86/vec_extract-avx.ll
112 ↗	(On Diff #38662)	I think there still might be something wrong here. If I understand this correctly only the lower 48 bytes will now be copied into xmm0.

RKSimon updated this revision to Diff 38677.Oct 28 2015, 11:53 AM

RKSimon marked an inline comment as done.

RKSimon added inline comments.

test/CodeGen/X86/vec_extract-avx.ll
112 ↗	(On Diff #38677)	Nice catch!

Jeroen has confirmed that fixes PR24935

ping

Hi Simon,

lib/Target/X86/X86InstrSSE.td
7207–7227 ↗	(On Diff #38677)	I don't think these new patterns are needed. We already have sse4.1/avx patterns to select a blend from a vzmovl node. If your goal is just to fix the miscompile, then the minimal fix consists in removing the offending patterns between lines 939 and 952. The poor codegen reported by Jeroen is caused by the lack of smart x86 combine rules for 256-bit shuffles in function 'PerformShuffleCombine256'. That function implements a very simple rule for when there is a shuffle between two concat_vector nodes. Ideally we should extend it and add rules for the case where the second operand is a build_vector of all zeroes. Currently we check if a shuffle takes as input two concat_vectors and we try to fold it to a zero extending load or an insert of a 128-bit vector into a zero vector. I think that we are just missing rules for the case where we are inserting a 64/32-bit quantity in a zero vector.

I'll update this patch to just be a fix, with just the removed lines (and the altered tests) and then prepare a second patch that improves the code quality. We are missing a potentially big perf gain for when the upper half of a register can be implicitly zero'd with VEX encoded instructions - especially on 128-bit ALUs such as Jaguar and Sandy Bridge.

lib/Target/X86/X86InstrSSE.td
7207–7227 ↗	(On Diff #38677)	I can confirm that just removing the lines 939 to 952 fixes the problem. It then leaves AVX1 targets with a lot of domain crossing stalls between integer / float to deal with the 256-bit vectors.

Simplified patch to just the bugfix - improved codegen to follow in a future patch

Thanks Simon for working on this!

The patch LGTM.

Cheers,
Andrea

This revision is now accepted and ready to land.Nov 19 2015, 3:17 AM

Closed by commit rL253561: [X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension (authored by RKSimon). · Explain WhyNov 19 2015, 4:21 AM

This revision was automatically updated to reflect the committed changes.

RKSimon mentioned this in D15477: [X86][AVX] Only shuffle the lower half of vectors if the upper half is undefined.Dec 12 2015, 6:17 AM

RKSimon mentioned this in rL256332: [X86][AVX] Only shuffle the lower half of vectors if the upper half is undefined.Dec 23 2015, 5:13 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86InstrSSE.td

20 lines

test/

CodeGen/

X86/

vec_extract-avx.ll

71 lines

Diff 40630

llvm/trunk/lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 929 Lines • ▼ Show 20 Lines	def VMOVUPSYrr_REV : VPSI<0x11, MRMDestReg, (outs VR256:$dst),
"movups\t{$src, $dst\|$dst, $src}", [],		"movups\t{$src, $dst\|$dst, $src}", [],
IIC_SSE_MOVU_P_RR>, VEX, VEX_L;		IIC_SSE_MOVU_P_RR>, VEX, VEX_L;
def VMOVUPDYrr_REV : VPDI<0x11, MRMDestReg, (outs VR256:$dst),		def VMOVUPDYrr_REV : VPDI<0x11, MRMDestReg, (outs VR256:$dst),
(ins VR256:$src),		(ins VR256:$src),
"movupd\t{$src, $dst\|$dst, $src}", [],		"movupd\t{$src, $dst\|$dst, $src}", [],
IIC_SSE_MOVU_P_RR>, VEX, VEX_L;		IIC_SSE_MOVU_P_RR>, VEX, VEX_L;
}		}

let Predicates = [HasAVX] in {
def : Pat<(v8i32 (X86vzmovl
(insert_subvector undef, (v4i32 VR128:$src), (iPTR 0)))),
(SUBREG_TO_REG (i32 0), (VMOVAPSrr VR128:$src), sub_xmm)>;
def : Pat<(v4i64 (X86vzmovl
(insert_subvector undef, (v2i64 VR128:$src), (iPTR 0)))),
(SUBREG_TO_REG (i32 0), (VMOVAPSrr VR128:$src), sub_xmm)>;
def : Pat<(v8f32 (X86vzmovl
(insert_subvector undef, (v4f32 VR128:$src), (iPTR 0)))),
(SUBREG_TO_REG (i32 0), (VMOVAPSrr VR128:$src), sub_xmm)>;
def : Pat<(v4f64 (X86vzmovl
(insert_subvector undef, (v2f64 VR128:$src), (iPTR 0)))),
(SUBREG_TO_REG (i32 0), (VMOVAPSrr VR128:$src), sub_xmm)>;
}


def : Pat<(int_x86_avx_storeu_ps_256 addr:$dst, VR256:$src),		def : Pat<(int_x86_avx_storeu_ps_256 addr:$dst, VR256:$src),
(VMOVUPSYmr addr:$dst, VR256:$src)>;		(VMOVUPSYmr addr:$dst, VR256:$src)>;
def : Pat<(int_x86_avx_storeu_pd_256 addr:$dst, VR256:$src),		def : Pat<(int_x86_avx_storeu_pd_256 addr:$dst, VR256:$src),
(VMOVUPDYmr addr:$dst, VR256:$src)>;		(VMOVUPDYmr addr:$dst, VR256:$src)>;

let SchedRW = [WriteStore] in {		let SchedRW = [WriteStore] in {
def MOVAPSmr : PSI<0x29, MRMDestMem, (outs), (ins f128mem:$dst, VR128:$src),		def MOVAPSmr : PSI<0x29, MRMDestMem, (outs), (ins f128mem:$dst, VR128:$src),
"movaps\t{$src, $dst\|$dst, $src}",		"movaps\t{$src, $dst\|$dst, $src}",
▲ Show 20 Lines • Show All 1,967 Lines • ▼ Show 20 Lines	multiclass sse12_fp_packed_vector_logical_alias<

defm V#NAME#PD : sse12_fp_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode,		defm V#NAME#PD : sse12_fp_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode,
VR128, v2f64, f128mem, loadv2f64, SSEPackedDouble, itins, 0>,		VR128, v2f64, f128mem, loadv2f64, SSEPackedDouble, itins, 0>,
PD, VEX_4V;		PD, VEX_4V;

defm V#NAME#PSY : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode,		defm V#NAME#PSY : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode,
VR256, v8f32, f256mem, loadv8f32, SSEPackedSingle, itins, 0>,		VR256, v8f32, f256mem, loadv8f32, SSEPackedSingle, itins, 0>,
PS, VEX_4V, VEX_L;		PS, VEX_4V, VEX_L;

defm V#NAME#PDY : sse12_fp_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode,		defm V#NAME#PDY : sse12_fp_packed<opc, !strconcat(OpcodeStr, "pd"), OpNode,
VR256, v4f64, f256mem, loadv4f64, SSEPackedDouble, itins, 0>,		VR256, v4f64, f256mem, loadv4f64, SSEPackedDouble, itins, 0>,
PD, VEX_4V, VEX_L;		PD, VEX_4V, VEX_L;
}		}

let Constraints = "$src1 = $dst" in {		let Constraints = "$src1 = $dst" in {
defm PS : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode, VR128,		defm PS : sse12_fp_packed<opc, !strconcat(OpcodeStr, "ps"), OpNode, VR128,
v4f32, f128mem, memopv4f32, SSEPackedSingle, itins>,		v4f32, f128mem, memopv4f32, SSEPackedSingle, itins>,
▲ Show 20 Lines • Show All 1,240 Lines • ▼ Show 20 Lines
defm VPSRAWY : PDI_binop_rmi<0xE1, 0x71, MRM4r, "vpsraw", X86vsra, X86vsrai,		defm VPSRAWY : PDI_binop_rmi<0xE1, 0x71, MRM4r, "vpsraw", X86vsra, X86vsrai,
VR256, v16i16, v8i16, bc_v8i16, loadv2i64,		VR256, v16i16, v8i16, bc_v8i16, loadv2i64,
SSE_INTSHIFT_ITINS_P, 0>, VEX_4V, VEX_L;		SSE_INTSHIFT_ITINS_P, 0>, VEX_4V, VEX_L;
defm VPSRADY : PDI_binop_rmi<0xE2, 0x72, MRM4r, "vpsrad", X86vsra, X86vsrai,		defm VPSRADY : PDI_binop_rmi<0xE2, 0x72, MRM4r, "vpsrad", X86vsra, X86vsrai,
VR256, v8i32, v4i32, bc_v4i32, loadv2i64,		VR256, v8i32, v4i32, bc_v4i32, loadv2i64,
SSE_INTSHIFT_ITINS_P, 0>, VEX_4V, VEX_L;		SSE_INTSHIFT_ITINS_P, 0>, VEX_4V, VEX_L;
}// Predicates = [HasAVX2]		}// Predicates = [HasAVX2]

let ExeDomain = SSEPackedInt, SchedRW = [WriteVecShift], hasSideEffects = 0 ,		let ExeDomain = SSEPackedInt, SchedRW = [WriteVecShift], hasSideEffects = 0 ,
Predicates = [HasAVX2, NoVLX_Or_NoBWI] in {		Predicates = [HasAVX2, NoVLX_Or_NoBWI] in {
// 256-bit logical shifts.		// 256-bit logical shifts.
def VPSLLDQYri : PDIi8<0x73, MRM7r,		def VPSLLDQYri : PDIi8<0x73, MRM7r,
(outs VR256:$dst), (ins VR256:$src1, u8imm:$src2),		(outs VR256:$dst), (ins VR256:$src1, u8imm:$src2),
"vpslldq\t{$src2, $src1, $dst\|$dst, $src1, $src2}",		"vpslldq\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
[(set VR256:$dst,		[(set VR256:$dst,
(v4i64 (X86vshldq VR256:$src1, (i8 imm:$src2))))]>,		(v4i64 (X86vshldq VR256:$src1, (i8 imm:$src2))))]>,
VEX_4V, VEX_L;		VEX_4V, VEX_L;
▲ Show 20 Lines • Show All 4,683 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vec_extract-avx.ll

target triple = "x86_64-unknown-unknown"		target triple = "x86_64-unknown-unknown"

; RUN: llc < %s -march=x86-64 -mattr=+avx \| FileCheck %s		; RUN: llc < %s -march=x86-64 -mattr=+avx \| FileCheck %s

; When extracting multiple consecutive elements from a larger		; When extracting multiple consecutive elements from a larger
; vector into a smaller one, do it efficiently. We should use		; vector into a smaller one, do it efficiently. We should use
; an EXTRACT_SUBVECTOR node internally rather than a bunch of		; an EXTRACT_SUBVECTOR node internally rather than a bunch of
; single element extractions.		; single element extractions.

; Extracting the low elements only requires using the right kind of store.		; Extracting the low elements only requires using the right kind of store.
define void @low_v8f32_to_v4f32(<8 x float> %v, <4 x float>* %ptr) {		define void @low_v8f32_to_v4f32(<8 x float> %v, <4 x float>* %ptr) {
%ext0 = extractelement <8 x float> %v, i32 0		%ext0 = extractelement <8 x float> %v, i32 0
%ext1 = extractelement <8 x float> %v, i32 1		%ext1 = extractelement <8 x float> %v, i32 1
%ext2 = extractelement <8 x float> %v, i32 2		%ext2 = extractelement <8 x float> %v, i32 2
%ext3 = extractelement <8 x float> %v, i32 3		%ext3 = extractelement <8 x float> %v, i32 3
%ins0 = insertelement <4 x float> undef, float %ext0, i32 0		%ins0 = insertelement <4 x float> undef, float %ext0, i32 0
%ins1 = insertelement <4 x float> %ins0, float %ext1, i32 1		%ins1 = insertelement <4 x float> %ins0, float %ext1, i32 1
%ins2 = insertelement <4 x float> %ins1, float %ext2, i32 2		%ins2 = insertelement <4 x float> %ins1, float %ext2, i32 2
%ins3 = insertelement <4 x float> %ins2, float %ext3, i32 3		%ins3 = insertelement <4 x float> %ins2, float %ext3, i32 3
store <4 x float> %ins3, <4 x float>* %ptr, align 16		store <4 x float> %ins3, <4 x float>* %ptr, align 16
ret void		ret void

; CHECK-LABEL: low_v8f32_to_v4f32		; CHECK-LABEL: low_v8f32_to_v4f32
; CHECK: vmovaps		; CHECK: vmovaps
; CHECK-NEXT: vzeroupper		; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq		; CHECK-NEXT: retq
}		}

; Extracting the high elements requires just one AVX instruction.		; Extracting the high elements requires just one AVX instruction.
define void @high_v8f32_to_v4f32(<8 x float> %v, <4 x float>* %ptr) {		define void @high_v8f32_to_v4f32(<8 x float> %v, <4 x float>* %ptr) {
%ext0 = extractelement <8 x float> %v, i32 4		%ext0 = extractelement <8 x float> %v, i32 4
%ext1 = extractelement <8 x float> %v, i32 5		%ext1 = extractelement <8 x float> %v, i32 5
%ext2 = extractelement <8 x float> %v, i32 6		%ext2 = extractelement <8 x float> %v, i32 6
%ext3 = extractelement <8 x float> %v, i32 7		%ext3 = extractelement <8 x float> %v, i32 7
%ins0 = insertelement <4 x float> undef, float %ext0, i32 0		%ins0 = insertelement <4 x float> undef, float %ext0, i32 0
%ins1 = insertelement <4 x float> %ins0, float %ext1, i32 1		%ins1 = insertelement <4 x float> %ins0, float %ext1, i32 1
%ins2 = insertelement <4 x float> %ins1, float %ext2, i32 2		%ins2 = insertelement <4 x float> %ins1, float %ext2, i32 2
Show All 37 Lines	define void @high_v4f64_to_v2f64(<4 x double> %v, <2 x double>* %ptr) {
store <2 x double> %ins1, <2 x double>* %ptr, align 16		store <2 x double> %ins1, <2 x double>* %ptr, align 16
ret void		ret void

; CHECK-LABEL: high_v4f64_to_v2f64		; CHECK-LABEL: high_v4f64_to_v2f64
; CHECK: vextractf128		; CHECK: vextractf128
; CHECK-NEXT: vzeroupper		; CHECK-NEXT: vzeroupper
; CHECK-NEXT: retq		; CHECK-NEXT: retq
}		}

		; PR25320 Make sure that a widened (possibly legalized) vector correctly zero-extends upper elements.
		; FIXME - Ideally these should just call VMOVD/VMOVQ/VMOVSS/VMOVSD

		define void @legal_vzmovl_2i32_8i32(<2 x i32>* %in, <8 x i32>* %out) {
		%ld = load <2 x i32>, <2 x i32>* %in, align 8
		%ext = extractelement <2 x i32> %ld, i64 0
		%ins = insertelement <8 x i32> <i32 undef, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>, i32 %ext, i64 0
		store <8 x i32> %ins, <8 x i32>* %out, align 32
		ret void

		; CHECK-LABEL: legal_vzmovl_2i32_8i32
		; CHECK: vpmovzxdq {{.*#+}} xmm0 = mem[0],zero,mem[1],zero
		; CHECK-NEXT: vxorps %ymm1, %ymm1, %ymm1
		; CHECK-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3,4,5,6,7]
		; CHECK-NEXT: vmovaps %ymm0, (%rsi)
		; CHECK-NEXT: vzeroupper
		; CHECK-NEXT: retq
		}

		define void @legal_vzmovl_2i64_4i64(<2 x i64>* %in, <4 x i64>* %out) {
		%ld = load <2 x i64>, <2 x i64>* %in, align 8
		%ext = extractelement <2 x i64> %ld, i64 0
		%ins = insertelement <4 x i64> <i64 undef, i64 0, i64 0, i64 0>, i64 %ext, i64 0
		store <4 x i64> %ins, <4 x i64>* %out, align 32
		ret void

		; CHECK-LABEL: legal_vzmovl_2i64_4i64
		; CHECK: vmovupd (%rdi), %xmm0
		; CHECK-NEXT: vxorpd %ymm1, %ymm1, %ymm1
		; CHECK-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
		; CHECK-NEXT: vmovapd %ymm0, (%rsi)
		; CHECK-NEXT: vzeroupper
		; CHECK-NEXT: retq
		}

		define void @legal_vzmovl_2f32_8f32(<2 x float>* %in, <8 x float>* %out) {
		%ld = load <2 x float>, <2 x float>* %in, align 8
		%ext = extractelement <2 x float> %ld, i64 0
		%ins = insertelement <8 x float> <float undef, float 0.0, float 0.0, float 0.0, float 0.0, float 0.0, float 0.0, float 0.0>, float %ext, i64 0
		store <8 x float> %ins, <8 x float>* %out, align 32
		ret void

		; CHECK-LABEL: legal_vzmovl_2f32_8f32
		; CHECK: vmovq {{.*#+}} xmm0 = mem[0],zero
		; CHECK-NEXT: vxorps %ymm1, %ymm1, %ymm1
		; CHECK-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3,4,5,6,7]
		; CHECK-NEXT: vmovaps %ymm0, (%rsi)
		; CHECK-NEXT: vzeroupper
		; CHECK-NEXT: retq
		}

		define void @legal_vzmovl_2f64_4f64(<2 x double>* %in, <4 x double>* %out) {
		%ld = load <2 x double>, <2 x double>* %in, align 8
		%ext = extractelement <2 x double> %ld, i64 0
		%ins = insertelement <4 x double> <double undef, double 0.0, double 0.0, double 0.0>, double %ext, i64 0
		store <4 x double> %ins, <4 x double>* %out, align 32
		ret void

		; CHECK-LABEL: legal_vzmovl_2f64_4f64
		; CHECK: vmovupd (%rdi), %xmm0
		; CHECK-NEXT: vxorpd %ymm1, %ymm1, %ymm1
		; CHECK-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
		; CHECK-NEXT: vmovapd %ymm0, (%rsi)
		; CHECK-NEXT: vzeroupper
		; CHECK-NEXT: retq
		}