This is an archive of the discontinued LLVM Phabricator instance.

Redundant vmov instruction generated with vcvtph2ps
ClosedPublic

Authored by rob.lougher on Jan 11 2016, 7:50 AM.

Download Raw Diff

Details

Reviewers

qcolombet
RKSimon

Commits

rG6abd69a60b74: The isel pattern that selects the memory-register form of VCVTPH2PS (64 to 128…
rL257470: The isel pattern that selects the memory-register form of VCVTPH2PS

Summary

Since revision 248784 the following code generates a redundant vmov:

m128 test1(m128i const *src) {

return _mm_cvtph_ps(_mm_loadl_epi64(src));

}

Output:

vmovq (%rdi), %xmm0 # xmm0 = mem[0],zero
vcvtph2ps %xmm0, %xmm0
retq

The regression was caused by a change made in revision 247504 which teaches the instruction combiner that only the lower 64 bits of a 128-bit vcvtph2ps are used. The IR for the above code is:

define <4 x float> @_Z4testPKDv2_x(<2 x i64>* nocapture readonly %src) #0 {
entry:

%__u.i = getelementptr inbounds <2 x i64>, <2 x i64>* %src, i64 0, i64 0
%0 = load i64, i64* %__u.i, align 1, !tbaa !1
%vecinit.i = insertelement <2 x i64> undef, i64 %0, i32 0
%vecinit1.i = insertelement <2 x i64> %vecinit.i, i64 0, i32 1
%1 = bitcast <2 x i64> %vecinit1.i to <8 x i16>
%2 = call <4 x float> @llvm.x86.vcvtph2ps.128(<8 x i16> %1) #2
ret <4 x float> %2

}

After r247504 the instruction combiner can see that the second insertelement is redundant and it is deleted. While this is correct, it interferes with the isel patterns that select the memory-register form of VCVTPH2PS. Previously, the load followed by the two insertelements would have been recognized by the pattern fragment 'vzmovl_v2i64'. This would then have selected a VCVTPH2PSrm via the following pattern:

// Pattern match vcvtph2ps of a scalar i64 load.
def : Pat<(int_x86_vcvtph2ps_128 (vzmovl_v2i64 addr:$src)),
          (VCVTPH2PSrm addr:$src)>;

However, after r247504 we only have a single insertelement and the vzmovl_v2i64 is no longer matched. The reason the problem only occurs after r248784 is that prior to this the combiner was unable to peek through the bitcast operation.

Diff Detail

Repository: rL LLVM

Event Timeline

rob.lougher updated this revision to Diff 44509.Jan 11 2016, 7:50 AM

rob.lougher retitled this revision from to Redundant vmov instruction generated with vcvtph2ps.

rob.lougher updated this object.

rob.lougher added reviewers: RKSimon, qcolombet.

rob.lougher added a subscriber: llvm-commits.

andreadb added a subscriber: andreadb.Jan 11 2016, 7:51 AM

Hi,

LGTM.

Should we have something similar for the 256bit variant in a following patch?

Thanks,
-Quentin

This revision is now accepted and ready to land.Jan 11 2016, 2:06 PM

Hi Quentin,

Thanks for the review! I've done a little investigation into the 256-bit variant and I don't think we have an issue there. In this case we're converting from a full XMM register to a YMM register so we don't have the zero-entension as in the 64-bit case.

Thanks,
Rob.

Closed by commit rL257470: The isel pattern that selects the memory-register form of VCVTPH2PS (authored by rlougher). · Explain WhyJan 12 2016, 3:52 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86InstrSSE.td

3 lines

test/

CodeGen/

X86/

f16c-intrinsics.ll

12 lines

Diff 44617

llvm/trunk/lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,251 Lines • ▼ Show 20 Lines	let Predicates = [HasF16C] in {
defm VCVTPS2PH : f16c_ps2ph<VR128, f64mem, int_x86_vcvtps2ph_128>;		defm VCVTPS2PH : f16c_ps2ph<VR128, f64mem, int_x86_vcvtps2ph_128>;
defm VCVTPS2PHY : f16c_ps2ph<VR256, f128mem, int_x86_vcvtps2ph_256>, VEX_L;		defm VCVTPS2PHY : f16c_ps2ph<VR256, f128mem, int_x86_vcvtps2ph_256>, VEX_L;

// Pattern match vcvtph2ps of a scalar i64 load.		// Pattern match vcvtph2ps of a scalar i64 load.
def : Pat<(int_x86_vcvtph2ps_128 (vzmovl_v2i64 addr:$src)),		def : Pat<(int_x86_vcvtph2ps_128 (vzmovl_v2i64 addr:$src)),
(VCVTPH2PSrm addr:$src)>;		(VCVTPH2PSrm addr:$src)>;
def : Pat<(int_x86_vcvtph2ps_128 (vzload_v2i64 addr:$src)),		def : Pat<(int_x86_vcvtph2ps_128 (vzload_v2i64 addr:$src)),
(VCVTPH2PSrm addr:$src)>;		(VCVTPH2PSrm addr:$src)>;
		def : Pat<(int_x86_vcvtph2ps_128 (bitconvert
		(v2i64 (scalar_to_vector (loadi64 addr:$src))))),
		(VCVTPH2PSrm addr:$src)>;

def : Pat<(store (f64 (extractelt (bc_v2f64 (v8i16		def : Pat<(store (f64 (extractelt (bc_v2f64 (v8i16
(int_x86_vcvtps2ph_128 VR128:$src1, i32:$src2))), (iPTR 0))),		(int_x86_vcvtps2ph_128 VR128:$src1, i32:$src2))), (iPTR 0))),
addr:$dst),		addr:$dst),
(VCVTPS2PHmr addr:$dst, VR128:$src1, imm:$src2)>;		(VCVTPS2PHmr addr:$dst, VR128:$src1, imm:$src2)>;
def : Pat<(store (i64 (extractelt (bc_v2i64 (v8i16		def : Pat<(store (i64 (extractelt (bc_v2i64 (v8i16
(int_x86_vcvtps2ph_128 VR128:$src1, i32:$src2))), (iPTR 0))),		(int_x86_vcvtps2ph_128 VR128:$src1, i32:$src2))), (iPTR 0))),
addr:$dst),		addr:$dst),
▲ Show 20 Lines • Show All 675 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/f16c-intrinsics.ll

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	; CHECK: vcvtph2ps (%
%load = load i64, i64* %ptr		%load = load i64, i64* %ptr
%ins1 = insertelement <2 x i64> undef, i64 %load, i32 0		%ins1 = insertelement <2 x i64> undef, i64 %load, i32 0
%ins2 = insertelement <2 x i64> %ins1, i64 0, i32 1		%ins2 = insertelement <2 x i64> %ins1, i64 0, i32 1
%bc = bitcast <2 x i64> %ins2 to <8 x i16>		%bc = bitcast <2 x i64> %ins2 to <8 x i16>
%res = tail call <4 x float> @llvm.x86.vcvtph2ps.128(<8 x i16> %bc) #2		%res = tail call <4 x float> @llvm.x86.vcvtph2ps.128(<8 x i16> %bc) #2
ret <4 x float> %res		ret <4 x float> %res
}		}

		define <4 x float> @test_x86_vcvtps2ph_128_scalar2(i64* %ptr) {
		; CHECK-LABEL: test_x86_vcvtps2ph_128_scalar2:
		; CHECK-NOT: vmov
		; CHECK: vcvtph2ps (%

		%load = load i64, i64* %ptr
		%ins = insertelement <2 x i64> undef, i64 %load, i32 0
		%bc = bitcast <2 x i64> %ins to <8 x i16>
		%res = tail call <4 x float> @llvm.x86.vcvtph2ps.128(<8 x i16> %bc)
		ret <4 x float> %res
		}

define void @test_x86_vcvtps2ph_256_m(<8 x i16>* nocapture %d, <8 x float> %a) nounwind {		define void @test_x86_vcvtps2ph_256_m(<8 x i16>* nocapture %d, <8 x float> %a) nounwind {
entry:		entry:
; CHECK-LABEL: test_x86_vcvtps2ph_256_m:		; CHECK-LABEL: test_x86_vcvtps2ph_256_m:
; CHECK-NOT: vmov		; CHECK-NOT: vmov
; CHECK: vcvtps2ph $3, %ymm0, (%		; CHECK: vcvtps2ph $3, %ymm0, (%
%0 = tail call <8 x i16> @llvm.x86.vcvtps2ph.256(<8 x float> %a, i32 3)		%0 = tail call <8 x i16> @llvm.x86.vcvtps2ph.256(<8 x float> %a, i32 3)
store <8 x i16> %0, <8 x i16>* %d, align 16		store <8 x i16> %0, <8 x i16>* %d, align 16
ret void		ret void
Show All 36 Lines