This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX] Remove "OptForSize" condition from some memory foldings.
AbandonedPublic

Authored by aymanmus on Jan 15 2017, 6:20 AM.

Download Raw Diff

Details

Reviewers

RKSimon
zvi
craig.topper
igorb

Summary

These instruction included a partial update of the output register in their SSE versions.
AVX, and AVX512 versions of the same instructions canceled the partial register update by passing an additional operand, that it's value will be copied to the rest of the output register.
No partial register update => The folding should be performed normally (not only when optimizing for size).

Diff Detail

Event Timeline

aymanmus updated this revision to Diff 84490.Jan 15 2017, 6:20 AM

aymanmus retitled this revision from to [X86][AVX] Remove "OptForSize" condition from some memory foldings..

aymanmus updated this object.

aymanmus added reviewers: igorb, zvi, craig.topper.

aymanmus added a subscriber: llvm-commits.

RKSimon added a reviewer: RKSimon.Jan 15 2017, 6:36 AM

zvi added a subscriber: myatsina.Jan 15 2017, 7:20 AM

LGTM, but please wait for an ok from other reviewers.

lib/Target/X86/X86InstrSSE.td
3457	consider merging Predicates scope. You can do this as a NFC follow-up commit.
3460	Is this extra line intentional?

aymanmus marked an inline comment as done.Jan 15 2017, 7:35 AM

aymanmus added inline comments.

lib/Target/X86/X86InstrSSE.td
3457	UseAVX and HasAVX are not the same predicates.

igorb added inline comments.Jan 17 2017, 1:15 AM

lib/Target/X86/X86InstrAVX512.td
7189	this one should not be removed , only remove predicate OptForSize. please add test case, i think llvm.sqrt.f32 can be used ( please insure AVX512 instruction selected)
7209	this one should not be removed, only remove predicate, please add test case.
7214	this one should not be removed, only remove predicate, please add test case.

RKSimon added inline comments.Jan 17 2017, 4:03 AM

lib/Target/X86/X86InstrInfo.cpp
1928	Isn't it only the _Int variants that need TB_NO_REVERSE ?
1932	All these changes to X86InstrInfo.cpp should probably be split off - add them to the stack-folding-*.ll tests

aymanmus marked 3 inline comments as done.Jan 18 2017, 5:22 AM

aymanmus added inline comments.

lib/Target/X86/X86InstrInfo.cpp
1928	TB_NO_REVERSE is used when the load size of the memory form is smaller than the register's size of the register form. In that case the unfolding is not legal since it will produce a load with the registers size.
1932	Do you mean these changes should be in a separate patch? If I understand you correctly, I think the changes in X86InstrInfo.cpp is relevant to the patch since the main change here is to enable the folding of few instructions. Adding these entries does exactly that.

Add tests

RKSimon added inline comments.Jan 18 2017, 10:21 AM

lib/Target/X86/X86InstrInfo.cpp
1928	I maybe reading the avx512_fp14_s definitions incorrectly but it appears to be using the scalar f32x_info type not the packed equivalents. @craig.topper / @igorb please can you confirm?
1932	OK - but please add them to the stack-folding-*.ll tests as well.

aymanmus added inline comments.Jan 19 2017, 12:36 AM

lib/Target/X86/X86InstrInfo.cpp
1928	That's right, but if you track the definitions more deeply you can see that the "rr" version is defined to have an xmm register operand while the "rm" version is defined to have a 32-bit memory operand. Unfolding from the memory form to the register form will generate a 128-bit load (because of the xmm register), which we want to avoid.
1932	Already added (in this patch).

RKSimon added inline comments.Jan 19 2017, 6:54 AM

test/CodeGen/X86/stack-folding-fp-avx512.ll
758	Please keep these in alphabetical order to make it easier to track missing instructions.

craig.topper added inline comments.Jan 20 2017, 8:54 PM

lib/Target/X86/X86InstrInfo.cpp
1928	Can leave a TODO here. We really should fix that so the we have a separate VRCP14SSrr_Int that uses the XMM type like we do for similar instructions.
lib/Target/X86/X86InstrSSE.td
1789	This doesn't need a Requires. There's no pattern.
1855	Not sure this even needs a Requires. There's no pattern defined.
1866	Will this fit on the previous line now?
test/CodeGen/X86/avx-arith.ll
353	This now has a register read dependency on xmm0 which I believe is what the OptForSize was originally protecting against. I know work has gone into ExeDepFix for UndefRegClearance. Do we believe that is sufficient to allow this folding now?

aymanmus marked 4 inline comments as done.Jan 30 2017, 6:30 AM

aymanmus added inline comments.

test/CodeGen/X86/avx-arith.ll
353	The OptForSize was incorrectly copied from the SSE version multiclass, where the memory form instructions performed a partial register updates (optimization guide states that partial updates come with a penalty and thus should be avoided). I'm afraid I didn't understand how the memory folding adds read dependency on xmm0 in this case, the read dependency was already there.

aymanmus updated this revision to Diff 86278.Jan 30 2017, 6:31 AM

craig.topper added inline comments.Jan 30 2017, 7:16 AM

test/CodeGen/X86/avx-arith.ll
353	Yes there was a read of xmm0 on the dart but it was from the movss instruction which wrote all bits. After this patch the xmm0 read is dependent on an unknown instruction.

craig.topper added inline comments.Jan 30 2017, 7:29 AM

test/CodeGen/X86/avx-arith.ll
353	Oops. Autocorrect turned sqrt into dart.

aymanmus added inline comments.Jan 31 2017, 1:46 AM

test/CodeGen/X86/avx-arith.ll
353	I see what you mean. I think this is a general problem that should not be discussed in this patch. The reason the folding was disabled before has nothing related to this issue. And the problem is not specific for this instruction. IMHO this patch should not consider this kind of problems as it should be dealt with as a separate manner.

I'm not sure I believe the OptForSize was naively copied from SSE. Someone went to the trouble of using AVX instructions with the correct number of operands in this comment block that your patch removes. It demonstrates exactly the issue I was pointing out.

// We don't want to fold scalar loads into these instructions unless
// optimizing for size. This is because the folded instruction will have a
// partial register update, while the unfolded sequence will not, e.g.
// vmovss mem, %xmm0
// vrcpss %xmm0, %xmm0, %xmm0
// which has a clobber before the rcp, vs.
// vrcpss mem, %xmm0, %xmm0
// TODO: In theory, we could fold the load, and avoid the stall caused by
// the partial register store, either in ExeDepFix or with smarter RA.

After consulting an architect about the general problem, this is the answer I got:

For the following sequence:

vmovss (%rax), %xmm0
vsqrtss %xmm0, %xmm0, %xmm0

memory folding should be avoided (to avoid generating new read dependency).
But for the following sequence:

vmovss (%rax), %xmm1
vsqrtss %xmm1, %xmm0, %xmm1

the memory folded sequence is better performance wise (read dependency is already there).

This applies to all AVX/AVX512 scalar instruction which accepts an extra input operand and copies it's upper part to the upper part of the output operand.
Adding OptForSize in this case, disables the folding of these specific instructions only, in all cases.
While the ideal way of dealing with this is:

Distinguishing between the 2 cases (listed above), and then deciding whether to fold or not.
Apply this on all AVX/AVX512 scalar instructions with this behavior.

Other instructions with the same behavior (like vscalefss and vreducess) do not have any folding patterns and are not included in the folding tables, which means folding is not allowed at all (we can improve that).
So what I suggest is not committing this patch (even though the OptForSize is there from the wrong reason) in order to avoid performance degradation, and open a bug on this issue.

Do you agree with that?

aymanmus abandoned this revision.Feb 12 2017, 5:47 AM

filcab mentioned this in D32684: [X86] Adding new LLVM TableGen backend that generates the X86 backend memory folding tables..May 4 2017, 9:29 AM

Revision Contents

Path

Size

lib/

Target/

X86/

X86InstrAVX512.td

16 lines

X86InstrInfo.cpp

5 lines

X86InstrSSE.td

23 lines

test/

CodeGen/

X86/

avx-arith.ll

3 lines

avx512-cvt.ll

3 lines

fold-load-unops.ll

12 lines

stack-folding-fp-avx512.ll

46 lines

Diff 84823

lib/Target/X86/X86InstrAVX512.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,148 Lines • ▼ Show 20 Lines	def : Pat<(f64 (fpextend FR32X:$src)),
(COPY_TO_REGCLASS FR32X:$src, VR128X)), VR128X)>,		(COPY_TO_REGCLASS FR32X:$src, VR128X)), VR128X)>,
Requires<[HasAVX512]>;		Requires<[HasAVX512]>;
def : Pat<(f64 (fpextend (loadf32 addr:$src))),		def : Pat<(f64 (fpextend (loadf32 addr:$src))),
(COPY_TO_REGCLASS (VCVTSS2SDZrm (v4f32 (IMPLICIT_DEF)), addr:$src), VR128X)>,		(COPY_TO_REGCLASS (VCVTSS2SDZrm (v4f32 (IMPLICIT_DEF)), addr:$src), VR128X)>,
Requires<[HasAVX512]>;		Requires<[HasAVX512]>;

def : Pat<(f64 (extloadf32 addr:$src)),		def : Pat<(f64 (extloadf32 addr:$src)),
(COPY_TO_REGCLASS (VCVTSS2SDZrm (v4f32 (IMPLICIT_DEF)), addr:$src), VR128X)>,		(COPY_TO_REGCLASS (VCVTSS2SDZrm (v4f32 (IMPLICIT_DEF)), addr:$src), VR128X)>,
Requires<[HasAVX512, OptForSize]>;		Requires<[HasAVX512]>;

def : Pat<(f64 (extloadf32 addr:$src)),
(COPY_TO_REGCLASS (VCVTSS2SDZrr (v4f32 (IMPLICIT_DEF)),
(COPY_TO_REGCLASS (VMOVSSZrm addr:$src), VR128X)), VR128X)>,
Requires<[HasAVX512, OptForSpeed]>;

def : Pat<(f32 (fpround FR64X:$src)),		def : Pat<(f32 (fpround FR64X:$src)),
(COPY_TO_REGCLASS (VCVTSD2SSZrr (COPY_TO_REGCLASS FR64X:$src, VR128X),		(COPY_TO_REGCLASS (VCVTSD2SSZrr (COPY_TO_REGCLASS FR64X:$src, VR128X),
(COPY_TO_REGCLASS FR64X:$src, VR128X)), VR128X)>,		(COPY_TO_REGCLASS FR64X:$src, VR128X)), VR128X)>,
Requires<[HasAVX512]>;		Requires<[HasAVX512]>;

def : Pat<(v4f32 (X86Movss		def : Pat<(v4f32 (X86Movss
(v4f32 VR128X:$dst),		(v4f32 VR128X:$dst),
▲ Show 20 Lines • Show All 1,010 Lines • ▼ Show 20 Lines	let mayLoad = 1 in
(ins _.FRC:$src1, _.ScalarMemOp:$src2),		(ins _.FRC:$src1, _.ScalarMemOp:$src2),
OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}", []>;		OpcodeStr#"\t{$src2, $src1, $dst\|$dst, $src1, $src2}", []>;
}		}

def : Pat<(_.EltVT (OpNode _.FRC:$src)),		def : Pat<(_.EltVT (OpNode _.FRC:$src)),
(!cast<Instruction>(NAME#SUFF#Zr)		(!cast<Instruction>(NAME#SUFF#Zr)
(_.EltVT (IMPLICIT_DEF)), _.FRC:$src)>;		(_.EltVT (IMPLICIT_DEF)), _.FRC:$src)>;

def : Pat<(_.EltVT (OpNode (load addr:$src))),		def : Pat<(_.EltVT (OpNode (load addr:$src))),
igorbUnsubmitted Done Reply Inline Actions this one should not be removed , only remove predicate OptForSize. please add test case, i think llvm.sqrt.f32 can be used ( please insure AVX512 instruction selected) igorb: this one should not be removed , only remove predicate OptForSize. please add test case, i…
(!cast<Instruction>(NAME#SUFF#Zm)		(!cast<Instruction>(NAME#SUFF#Zm)
(_.EltVT (IMPLICIT_DEF)), addr:$src)>, Requires<[HasAVX512, OptForSize]>;		(_.EltVT (IMPLICIT_DEF)), addr:$src)>, Requires<[HasAVX512]>;
}		}

multiclass avx512_sqrt_scalar_all<bits<8> opc, string OpcodeStr> {		multiclass avx512_sqrt_scalar_all<bits<8> opc, string OpcodeStr> {
defm SSZ : avx512_sqrt_scalar<opc, OpcodeStr#"ss", f32x_info, "SS", fsqrt,		defm SSZ : avx512_sqrt_scalar<opc, OpcodeStr#"ss", f32x_info, "SS", fsqrt,
X86fsqrtRnds>, EVEX_CD8<32, CD8VT1>, EVEX_4V, XS;		X86fsqrtRnds>, EVEX_CD8<32, CD8VT1>, EVEX_4V, XS;
defm SDZ : avx512_sqrt_scalar<opc, OpcodeStr#"sd", f64x_info, "SD", fsqrt,		defm SDZ : avx512_sqrt_scalar<opc, OpcodeStr#"sd", f64x_info, "SD", fsqrt,
X86fsqrtRnds>, EVEX_CD8<64, CD8VT1>, EVEX_4V, XD, VEX_W;		X86fsqrtRnds>, EVEX_CD8<64, CD8VT1>, EVEX_4V, XD, VEX_W;
}		}

defm VSQRT : avx512_sqrt_packed_all<0x51, "vsqrt", fsqrt>,		defm VSQRT : avx512_sqrt_packed_all<0x51, "vsqrt", fsqrt>,
avx512_sqrt_packed_all_round<0x51, "vsqrt", X86fsqrtRnd>;		avx512_sqrt_packed_all_round<0x51, "vsqrt", X86fsqrtRnd>;

defm VSQRT : avx512_sqrt_scalar_all<0x51, "vsqrt">, VEX_LIG;		defm VSQRT : avx512_sqrt_scalar_all<0x51, "vsqrt">, VEX_LIG;

let Predicates = [HasAVX512] in {		let Predicates = [HasAVX512] in {
def : Pat<(f32 (X86frsqrt FR32X:$src)),		def : Pat<(f32 (X86frsqrt FR32X:$src)),
(COPY_TO_REGCLASS (VRSQRT14SSrr (v4f32 (IMPLICIT_DEF)), (COPY_TO_REGCLASS FR32X:$src, VR128X)), VR128X)>;		(COPY_TO_REGCLASS (VRSQRT14SSrr (v4f32 (IMPLICIT_DEF)), (COPY_TO_REGCLASS FR32X:$src, VR128X)), VR128X)>;
def : Pat<(f32 (X86frsqrt (load addr:$src))),		def : Pat<(f32 (X86frsqrt (load addr:$src))),
igorbUnsubmitted Done Reply Inline Actions this one should not be removed, only remove predicate, please add test case. igorb: this one should not be removed, only remove predicate, please add test case.
(COPY_TO_REGCLASS (VRSQRT14SSrm (v4f32 (IMPLICIT_DEF)), addr:$src), VR128X)>,		(COPY_TO_REGCLASS (VRSQRT14SSrm (v4f32 (IMPLICIT_DEF)), addr:$src), VR128X)>;
Requires<[OptForSize]>;
def : Pat<(f32 (X86frcp FR32X:$src)),		def : Pat<(f32 (X86frcp FR32X:$src)),
(COPY_TO_REGCLASS (VRCP14SSrr (v4f32 (IMPLICIT_DEF)), (COPY_TO_REGCLASS FR32X:$src, VR128X)), VR128X )>;		(COPY_TO_REGCLASS (VRCP14SSrr (v4f32 (IMPLICIT_DEF)), (COPY_TO_REGCLASS FR32X:$src, VR128X)), VR128X )>;
def : Pat<(f32 (X86frcp (load addr:$src))),		def : Pat<(f32 (X86frcp (load addr:$src))),
igorbUnsubmitted Done Reply Inline Actions this one should not be removed, only remove predicate, please add test case. igorb: this one should not be removed, only remove predicate, please add test case.
(COPY_TO_REGCLASS (VRCP14SSrm (v4f32 (IMPLICIT_DEF)), addr:$src), VR128X)>,		(COPY_TO_REGCLASS (VRCP14SSrm (v4f32 (IMPLICIT_DEF)), addr:$src), VR128X)>;
Requires<[OptForSize]>;
}		}


multiclass		multiclass
avx512_rndscale_scalar<bits<8> opc, string OpcodeStr, X86VectorVTInfo _> {		avx512_rndscale_scalar<bits<8> opc, string OpcodeStr, X86VectorVTInfo _> {

let ExeDomain = _.ExeDomain in {		let ExeDomain = _.ExeDomain in {
defm r : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),		defm r : AVX512_maskable_scalar<opc, MRMSrcReg, _, (outs _.RC:$dst),
(ins _.RC:$src1, _.RC:$src2, i32u8imm:$src3), OpcodeStr,		(ins _.RC:$src1, _.RC:$src2, i32u8imm:$src3), OpcodeStr,
"$src3, $src2, $src1", "$src1, $src2, $src3",		"$src3, $src2, $src1", "$src1, $src2, $src3",
(_.VT (X86RndScales (_.VT _.RC:$src1), (_.VT _.RC:$src2),		(_.VT (X86RndScales (_.VT _.RC:$src1), (_.VT _.RC:$src2),
▲ Show 20 Lines • Show All 2,009 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,800 Lines • ▼ Show 20 Lines	static const X86MemoryFoldTableEntry MemoryFoldTable2[] = {
{ X86::VANDPDZrr, X86::VANDPDZrm, 0 },		{ X86::VANDPDZrr, X86::VANDPDZrm, 0 },
{ X86::VANDPSZrr, X86::VANDPSZrm, 0 },		{ X86::VANDPSZrr, X86::VANDPSZrm, 0 },
{ X86::VCMPPDZrri, X86::VCMPPDZrmi, 0 },		{ X86::VCMPPDZrri, X86::VCMPPDZrmi, 0 },
{ X86::VCMPPSZrri, X86::VCMPPSZrmi, 0 },		{ X86::VCMPPSZrri, X86::VCMPPSZrmi, 0 },
{ X86::VCMPSDZrr, X86::VCMPSDZrm, 0 },		{ X86::VCMPSDZrr, X86::VCMPSDZrm, 0 },
{ X86::VCMPSDZrr_Int, X86::VCMPSDZrm_Int, TB_NO_REVERSE },		{ X86::VCMPSDZrr_Int, X86::VCMPSDZrm_Int, TB_NO_REVERSE },
{ X86::VCMPSSZrr, X86::VCMPSSZrm, 0 },		{ X86::VCMPSSZrr, X86::VCMPSSZrm, 0 },
{ X86::VCMPSSZrr_Int, X86::VCMPSSZrm_Int, TB_NO_REVERSE },		{ X86::VCMPSSZrr_Int, X86::VCMPSSZrm_Int, TB_NO_REVERSE },
		{ X86::VCVTSS2SDZrr, X86::VCVTSS2SDZrm, TB_NO_REVERSE },
{ X86::VDIVPDZrr, X86::VDIVPDZrm, 0 },		{ X86::VDIVPDZrr, X86::VDIVPDZrm, 0 },
{ X86::VDIVPSZrr, X86::VDIVPSZrm, 0 },		{ X86::VDIVPSZrr, X86::VDIVPSZrm, 0 },
{ X86::VDIVSDZrr, X86::VDIVSDZrm, 0 },		{ X86::VDIVSDZrr, X86::VDIVSDZrm, 0 },
{ X86::VDIVSDZrr_Int, X86::VDIVSDZrm_Int, TB_NO_REVERSE },		{ X86::VDIVSDZrr_Int, X86::VDIVSDZrm_Int, TB_NO_REVERSE },
{ X86::VDIVSSZrr, X86::VDIVSSZrm, 0 },		{ X86::VDIVSSZrr, X86::VDIVSSZrm, 0 },
{ X86::VDIVSSZrr_Int, X86::VDIVSSZrm_Int, TB_NO_REVERSE },		{ X86::VDIVSSZrr_Int, X86::VDIVSSZrm_Int, TB_NO_REVERSE },
{ X86::VINSERTF32x4Zrr, X86::VINSERTF32x4Zrm, 0 },		{ X86::VINSERTF32x4Zrr, X86::VINSERTF32x4Zrm, 0 },
{ X86::VINSERTF32x8Zrr, X86::VINSERTF32x8Zrm, 0 },		{ X86::VINSERTF32x8Zrr, X86::VINSERTF32x8Zrm, 0 },
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	static const X86MemoryFoldTableEntry MemoryFoldTable2[] = {
{ X86::VPUNPCKHQDQZrr, X86::VPUNPCKHQDQZrm, 0 },		{ X86::VPUNPCKHQDQZrr, X86::VPUNPCKHQDQZrm, 0 },
{ X86::VPUNPCKHWDZrr, X86::VPUNPCKHWDZrm, 0 },		{ X86::VPUNPCKHWDZrr, X86::VPUNPCKHWDZrm, 0 },
{ X86::VPUNPCKLBWZrr, X86::VPUNPCKLBWZrm, 0 },		{ X86::VPUNPCKLBWZrr, X86::VPUNPCKLBWZrm, 0 },
{ X86::VPUNPCKLDQZrr, X86::VPUNPCKLDQZrm, 0 },		{ X86::VPUNPCKLDQZrr, X86::VPUNPCKLDQZrm, 0 },
{ X86::VPUNPCKLQDQZrr, X86::VPUNPCKLQDQZrm, 0 },		{ X86::VPUNPCKLQDQZrr, X86::VPUNPCKLQDQZrm, 0 },
{ X86::VPUNPCKLWDZrr, X86::VPUNPCKLWDZrm, 0 },		{ X86::VPUNPCKLWDZrr, X86::VPUNPCKLWDZrm, 0 },
{ X86::VPXORDZrr, X86::VPXORDZrm, 0 },		{ X86::VPXORDZrr, X86::VPXORDZrm, 0 },
{ X86::VPXORQZrr, X86::VPXORQZrm, 0 },		{ X86::VPXORQZrr, X86::VPXORQZrm, 0 },
		{ X86::VRCP14SSrr, X86::VRCP14SSrm, TB_NO_REVERSE },
		{ X86::VRSQRT14SSrr, X86::VRSQRT14SSrm, TB_NO_REVERSE },
		RKSimonUnsubmitted Not Done Reply Inline Actions Isn't it only the _Int variants that need TB_NO_REVERSE ? RKSimon: Isn't it only the _Int variants that need TB_NO_REVERSE ?
		aymanmusAuthorUnsubmitted Not Done Reply Inline Actions TB_NO_REVERSE is used when the load size of the memory form is smaller than the register's size of the register form. In that case the unfolding is not legal since it will produce a load with the registers size. aymanmus: TB_NO_REVERSE is used when the load size of the memory form is smaller than the register's size…
		RKSimonUnsubmitted Not Done Reply Inline Actions I maybe reading the avx512_fp14_s definitions incorrectly but it appears to be using the scalar f32x_info type not the packed equivalents. @craig.topper / @igorb please can you confirm? RKSimon: I maybe reading the avx512_fp14_s definitions incorrectly but it appears to be using the scalar…
		aymanmusAuthorUnsubmitted Not Done Reply Inline Actions That's right, but if you track the definitions more deeply you can see that the "rr" version is defined to have an xmm register operand while the "rm" version is defined to have a 32-bit memory operand. Unfolding from the memory form to the register form will generate a 128-bit load (because of the xmm register), which we want to avoid. aymanmus: That's right, but if you track the definitions more deeply you can see that the "rr" version is…
		craig.topperUnsubmitted Not Done Reply Inline Actions Can leave a TODO here. We really should fix that so the we have a separate VRCP14SSrr_Int that uses the XMM type like we do for similar instructions. craig.topper: Can leave a TODO here. We really should fix that so the we have a separate VRCP14SSrr_Int that…
{ X86::VSHUFPDZrri, X86::VSHUFPDZrmi, 0 },		{ X86::VSHUFPDZrri, X86::VSHUFPDZrmi, 0 },
{ X86::VSHUFPSZrri, X86::VSHUFPSZrmi, 0 },		{ X86::VSHUFPSZrri, X86::VSHUFPSZrmi, 0 },
		{ X86::VSQRTSSZr, X86::VSQRTSSZm, 0 },
		{ X86::VSQRTSDZr, X86::VSQRTSDZm, 0 },
		RKSimonUnsubmitted Not Done Reply Inline Actions All these changes to X86InstrInfo.cpp should probably be split off - add them to the stack-folding-.ll tests RKSimon:* All these changes to X86InstrInfo.cpp should probably be split off - add them to the stack…
		aymanmusAuthorUnsubmitted Not Done Reply Inline Actions Do you mean these changes should be in a separate patch? If I understand you correctly, I think the changes in X86InstrInfo.cpp is relevant to the patch since the main change here is to enable the folding of few instructions. Adding these entries does exactly that. aymanmus: Do you mean these changes should be in a separate patch? If I understand you correctly, I think…
		RKSimonUnsubmitted Not Done Reply Inline Actions OK - but please add them to the stack-folding-.ll tests as well. RKSimon:* OK - but please add them to the stack-folding-*.ll tests as well.
		aymanmusAuthorUnsubmitted Not Done Reply Inline Actions Already added (in this patch). aymanmus: Already added (in this patch).
{ X86::VSUBPDZrr, X86::VSUBPDZrm, 0 },		{ X86::VSUBPDZrr, X86::VSUBPDZrm, 0 },
{ X86::VSUBPSZrr, X86::VSUBPSZrm, 0 },		{ X86::VSUBPSZrr, X86::VSUBPSZrm, 0 },
{ X86::VSUBSDZrr, X86::VSUBSDZrm, 0 },		{ X86::VSUBSDZrr, X86::VSUBSDZrm, 0 },
{ X86::VSUBSDZrr_Int, X86::VSUBSDZrm_Int, TB_NO_REVERSE },		{ X86::VSUBSDZrr_Int, X86::VSUBSDZrm_Int, TB_NO_REVERSE },
{ X86::VSUBSSZrr, X86::VSUBSSZrm, 0 },		{ X86::VSUBSSZrr, X86::VSUBSSZrm, 0 },
{ X86::VSUBSSZrr_Int, X86::VSUBSSZrm_Int, TB_NO_REVERSE },		{ X86::VSUBSSZrr_Int, X86::VSUBSSZrm_Int, TB_NO_REVERSE },
{ X86::VUNPCKHPDZrr, X86::VUNPCKHPDZrm, 0 },		{ X86::VUNPCKHPDZrr, X86::VUNPCKHPDZrm, 0 },
{ X86::VUNPCKHPSZrr, X86::VUNPCKHPSZrm, 0 },		{ X86::VUNPCKHPSZrr, X86::VUNPCKHPSZrm, 0 },
▲ Show 20 Lines • Show All 7,943 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,780 Lines • ▼ Show 20 Lines	def VCVTSD2SSrr : VSDI<0x5A, MRMSrcReg, (outs FR32:$dst),
"cvtsd2ss\t{$src2, $src1, $dst\|$dst, $src1, $src2}", [],		"cvtsd2ss\t{$src2, $src1, $dst\|$dst, $src1, $src2}", [],
IIC_SSE_CVT_Scalar_RR>, VEX_4V, VEX_LIG,		IIC_SSE_CVT_Scalar_RR>, VEX_4V, VEX_LIG,
Sched<[WriteCvtF2F]>;		Sched<[WriteCvtF2F]>;
let mayLoad = 1 in		let mayLoad = 1 in
def VCVTSD2SSrm : I<0x5A, MRMSrcMem, (outs FR32:$dst),		def VCVTSD2SSrm : I<0x5A, MRMSrcMem, (outs FR32:$dst),
(ins FR64:$src1, f64mem:$src2),		(ins FR64:$src1, f64mem:$src2),
"vcvtsd2ss\t{$src2, $src1, $dst\|$dst, $src1, $src2}",		"vcvtsd2ss\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
[], IIC_SSE_CVT_Scalar_RM>,		[], IIC_SSE_CVT_Scalar_RM>,
XD, Requires<[HasAVX, OptForSize]>, VEX_4V, VEX_LIG,		XD, Requires<[HasAVX]>, VEX_4V, VEX_LIG,
		craig.topperUnsubmitted Done Reply Inline Actions This doesn't need a Requires. There's no pattern. craig.topper: This doesn't need a Requires. There's no pattern.
Sched<[WriteCvtF2FLd, ReadAfterLd]>;		Sched<[WriteCvtF2FLd, ReadAfterLd]>;
}		}

def : Pat<(f32 (fpround FR64:$src)), (VCVTSD2SSrr FR64:$src, FR64:$src)>,		def : Pat<(f32 (fpround FR64:$src)), (VCVTSD2SSrr FR64:$src, FR64:$src)>,
Requires<[UseAVX]>;		Requires<[UseAVX]>;

def CVTSD2SSrr : SDI<0x5A, MRMSrcReg, (outs FR32:$dst), (ins FR64:$src),		def CVTSD2SSrr : SDI<0x5A, MRMSrcReg, (outs FR32:$dst), (ins FR64:$src),
"cvtsd2ss\t{$src, $dst\|$dst, $src}",		"cvtsd2ss\t{$src, $dst\|$dst, $src}",
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	def VCVTSS2SDrr : I<0x5A, MRMSrcReg, (outs FR64:$dst),
[], IIC_SSE_CVT_Scalar_RR>,		[], IIC_SSE_CVT_Scalar_RR>,
XS, Requires<[HasAVX]>, VEX_4V, VEX_LIG,		XS, Requires<[HasAVX]>, VEX_4V, VEX_LIG,
Sched<[WriteCvtF2F]>;		Sched<[WriteCvtF2F]>;
let mayLoad = 1 in		let mayLoad = 1 in
def VCVTSS2SDrm : I<0x5A, MRMSrcMem, (outs FR64:$dst),		def VCVTSS2SDrm : I<0x5A, MRMSrcMem, (outs FR64:$dst),
(ins FR32:$src1, f32mem:$src2),		(ins FR32:$src1, f32mem:$src2),
"vcvtss2sd\t{$src2, $src1, $dst\|$dst, $src1, $src2}",		"vcvtss2sd\t{$src2, $src1, $dst\|$dst, $src1, $src2}",
[], IIC_SSE_CVT_Scalar_RM>,		[], IIC_SSE_CVT_Scalar_RM>,
XS, VEX_4V, VEX_LIG, Requires<[HasAVX, OptForSize]>,		XS, VEX_4V, VEX_LIG, Requires<[HasAVX]>,
		craig.topperUnsubmitted Done Reply Inline Actions Not sure this even needs a Requires. There's no pattern defined. craig.topper: Not sure this even needs a Requires. There's no pattern defined.
Sched<[WriteCvtF2FLd, ReadAfterLd]>;		Sched<[WriteCvtF2FLd, ReadAfterLd]>;
}		}

def : Pat<(f64 (fpextend FR32:$src)),		def : Pat<(f64 (fpextend FR32:$src)),
(VCVTSS2SDrr FR32:$src, FR32:$src)>, Requires<[UseAVX]>;		(VCVTSS2SDrr FR32:$src, FR32:$src)>, Requires<[UseAVX]>;
def : Pat<(fpextend (loadf32 addr:$src)),		def : Pat<(fpextend (loadf32 addr:$src)),
(VCVTSS2SDrm (f32 (IMPLICIT_DEF)), addr:$src)>, Requires<[UseAVX]>;		(VCVTSS2SDrm (f32 (IMPLICIT_DEF)), addr:$src)>, Requires<[UseAVX]>;

def : Pat<(extloadf32 addr:$src),		def : Pat<(extloadf32 addr:$src),
(VCVTSS2SDrm (f32 (IMPLICIT_DEF)), addr:$src)>,		(VCVTSS2SDrm (f32 (IMPLICIT_DEF)), addr:$src)>,
Requires<[UseAVX, OptForSize]>;		Requires<[UseAVX]>;
		craig.topperUnsubmitted Done Reply Inline Actions Will this fit on the previous line now? craig.topper: Will this fit on the previous line now?
def : Pat<(extloadf32 addr:$src),
(VCVTSS2SDrr (f32 (IMPLICIT_DEF)), (VMOVSSrm addr:$src))>,
Requires<[UseAVX, OptForSpeed]>;

def CVTSS2SDrr : I<0x5A, MRMSrcReg, (outs FR64:$dst), (ins FR32:$src),		def CVTSS2SDrr : I<0x5A, MRMSrcReg, (outs FR64:$dst), (ins FR32:$src),
"cvtss2sd\t{$src, $dst\|$dst, $src}",		"cvtss2sd\t{$src, $dst\|$dst, $src}",
[(set FR64:$dst, (fpextend FR32:$src))],		[(set FR64:$dst, (fpextend FR32:$src))],
IIC_SSE_CVT_Scalar_RR>, XS,		IIC_SSE_CVT_Scalar_RR>, XS,
Requires<[UseSSE2]>, Sched<[WriteCvtF2F]>;		Requires<[UseSSE2]>, Sched<[WriteCvtF2F]>;
def CVTSS2SDrm : I<0x5A, MRMSrcMem, (outs FR64:$dst), (ins f32mem:$src),		def CVTSS2SDrm : I<0x5A, MRMSrcMem, (outs FR64:$dst), (ins f32mem:$src),
"cvtss2sd\t{$src, $dst\|$dst, $src}",		"cvtss2sd\t{$src, $dst\|$dst, $src}",
▲ Show 20 Lines • Show All 1,574 Lines • ▼ Show 20 Lines	multiclass avx_fp_unop_s<bits<8> opc, string OpcodeStr, RegisterClass RC,
let mayLoad = 1 in		let mayLoad = 1 in
def m_Int : I<opc, MRMSrcMem, (outs VR128:$dst),		def m_Int : I<opc, MRMSrcMem, (outs VR128:$dst),
(ins VR128:$src1, x86memop:$src2),		(ins VR128:$src1, x86memop:$src2),
!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),		!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
[]>, Sched<[itins.Sched.Folded, ReadAfterLd]>;		[]>, Sched<[itins.Sched.Folded, ReadAfterLd]>;
}		}
}		}

// We don't want to fold scalar loads into these instructions unless
// optimizing for size. This is because the folded instruction will have a
// partial register update, while the unfolded sequence will not, e.g.
// vmovss mem, %xmm0
// vrcpss %xmm0, %xmm0, %xmm0
// which has a clobber before the rcp, vs.
// vrcpss mem, %xmm0, %xmm0
// TODO: In theory, we could fold the load, and avoid the stall caused by
// the partial register store, either in ExeDepFix or with smarter RA.
let Predicates = [UseAVX] in {		let Predicates = [UseAVX] in {
		zviUnsubmitted Not Done Reply Inline Actions consider merging Predicates scope. You can do this as a NFC follow-up commit. zvi: consider merging Predicates scope. You can do this as a NFC follow-up commit.
		aymanmusAuthorUnsubmitted Not Done Reply Inline Actions UseAVX and HasAVX are not the same predicates. aymanmus: UseAVX and HasAVX are not the same predicates.
def : Pat<(OpNode RC:$src), (!cast<Instruction>("V"#NAME#Suffix##r)		def : Pat<(OpNode RC:$src), (!cast<Instruction>("V"#NAME#Suffix##r)
(ScalarVT (IMPLICIT_DEF)), RC:$src)>;		(ScalarVT (IMPLICIT_DEF)), RC:$src)>;
}		}
		zviUnsubmitted Done Reply Inline Actions Is this extra line intentional? zvi: Is this extra line intentional?
let Predicates = [HasAVX] in {		let Predicates = [HasAVX] in {
def : Pat<(Intr VR128:$src),		def : Pat<(Intr VR128:$src),
(!cast<Instruction>("V"#NAME#Suffix##r_Int) VR128:$src,		(!cast<Instruction>("V"#NAME#Suffix##r_Int) VR128:$src,
VR128:$src)>;		VR128:$src)>;
}		}
let Predicates = [HasAVX, OptForSize] in {
		let Predicates = [HasAVX] in {
def : Pat<(Intr (scalar_to_vector (ScalarVT (load addr:$src2)))),		def : Pat<(Intr (scalar_to_vector (ScalarVT (load addr:$src2)))),
(!cast<Instruction>("V"#NAME#Suffix##m_Int)		(!cast<Instruction>("V"#NAME#Suffix##m_Int)
(vt (IMPLICIT_DEF)), addr:$src2)>;		(vt (IMPLICIT_DEF)), addr:$src2)>;
}		}
let Predicates = [UseAVX, OptForSize] in {		let Predicates = [UseAVX] in {
def : Pat<(ScalarVT (OpNode (load addr:$src))),		def : Pat<(ScalarVT (OpNode (load addr:$src))),
(!cast<Instruction>("V"#NAME#Suffix##m) (ScalarVT (IMPLICIT_DEF)),		(!cast<Instruction>("V"#NAME#Suffix##m) (ScalarVT (IMPLICIT_DEF)),
addr:$src)>;		addr:$src)>;
}		}
}		}

/// sse1_fp_unop_p - SSE1 unops in packed form.		/// sse1_fp_unop_p - SSE1 unops in packed form.
multiclass sse1_fp_unop_p<bits<8> opc, string OpcodeStr, SDNode OpNode,		multiclass sse1_fp_unop_p<bits<8> opc, string OpcodeStr, SDNode OpNode,
▲ Show 20 Lines • Show All 5,292 Lines • Show Last 20 Lines

test/CodeGen/X86/avx-arith.ll

Show First 20 Lines • Show All 344 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retq
ret <4 x i64> %x		ret <4 x i64> %x
}		}

declare <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float>) nounwind readnone		declare <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float>) nounwind readnone

define <4 x float> @int_sqrt_ss() {		define <4 x float> @int_sqrt_ss() {
; CHECK-LABEL: int_sqrt_ss:		; CHECK-LABEL: int_sqrt_ss:
; CHECK: ## BB#0:		; CHECK: ## BB#0:
; CHECK-NEXT: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero		; CHECK-NEXT: vsqrtss (%rax), %xmm0, %xmm0
		craig.topperUnsubmitted Not Done Reply Inline Actions This now has a register read dependency on xmm0 which I believe is what the OptForSize was originally protecting against. I know work has gone into ExeDepFix for UndefRegClearance. Do we believe that is sufficient to allow this folding now? craig.topper: This now has a register read dependency on xmm0 which I believe is what the OptForSize was…
		aymanmusAuthorUnsubmitted Not Done Reply Inline Actions The OptForSize was incorrectly copied from the SSE version multiclass, where the memory form instructions performed a partial register updates (optimization guide states that partial updates come with a penalty and thus should be avoided). I'm afraid I didn't understand how the memory folding adds read dependency on xmm0 in this case, the read dependency was already there. aymanmus: The OptForSize was incorrectly copied from the SSE version multiclass, where the memory form…
		craig.topperUnsubmitted Not Done Reply Inline Actions Yes there was a read of xmm0 on the dart but it was from the movss instruction which wrote all bits. After this patch the xmm0 read is dependent on an unknown instruction. craig.topper: Yes there was a read of xmm0 on the dart but it was from the movss instruction which wrote all…
		craig.topperUnsubmitted Not Done Reply Inline Actions Oops. Autocorrect turned sqrt into dart. craig.topper: Oops. Autocorrect turned sqrt into dart.
		aymanmusAuthorUnsubmitted Not Done Reply Inline Actions I see what you mean. I think this is a general problem that should not be discussed in this patch. The reason the folding was disabled before has nothing related to this issue. And the problem is not specific for this instruction. IMHO this patch should not consider this kind of problems as it should be dealt with as a separate manner. aymanmus: I see what you mean. I think this is a general problem that should not be discussed in this…
; CHECK-NEXT: vsqrtss %xmm0, %xmm0, %xmm0
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%x0 = load float, float addrspace(1)* undef, align 8		%x0 = load float, float addrspace(1)* undef, align 8
%x1 = insertelement <4 x float> undef, float %x0, i32 0		%x1 = insertelement <4 x float> undef, float %x0, i32 0
%x2 = call <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float> %x1) nounwind		%x2 = call <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float> %x1) nounwind
ret <4 x float> %x2		ret <4 x float> %x2
}		}

define <2 x double> @vector_sqrt_scalar_load(double* %a0) optsize {		define <2 x double> @vector_sqrt_scalar_load(double* %a0) optsize {
Show All 11 Lines

test/CodeGen/X86/avx512-cvt.ll

Show First 20 Lines • Show All 594 Lines • ▼ Show 20 Lines	entry:
%tmp1 = load i64, i64* %e, align 8		%tmp1 = load i64, i64* %e, align 8
%conv = sitofp i64 %tmp1 to float		%conv = sitofp i64 %tmp1 to float
ret float %conv		ret float %conv
}		}

define void @fpext() {		define void @fpext() {
; ALL-LABEL: fpext:		; ALL-LABEL: fpext:
; ALL: ## BB#0: ## %entry		; ALL: ## BB#0: ## %entry
; ALL-NEXT: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero		; ALL-NEXT: vcvtss2sd -{{[0-9]+}}(%rsp), %xmm0, %xmm0
; ALL-NEXT: vcvtss2sd %xmm0, %xmm0, %xmm0
; ALL-NEXT: vmovsd %xmm0, -{{[0-9]+}}(%rsp)		; ALL-NEXT: vmovsd %xmm0, -{{[0-9]+}}(%rsp)
; ALL-NEXT: retq		; ALL-NEXT: retq
entry:		entry:
%f = alloca float, align 4		%f = alloca float, align 4
%d = alloca double, align 8		%d = alloca double, align 8
%tmp = load float, float* %f, align 4		%tmp = load float, float* %f, align 4
%conv = fpext float %tmp to double		%conv = fpext float %tmp to double
store double %conv, double* %d, align 8		store double %conv, double* %d, align 8
▲ Show 20 Lines • Show All 743 Lines • Show Last 20 Lines

test/CodeGen/X86/fold-load-unops.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=+sse2 < %s \| FileCheck %s --check-prefix=SSE			; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=+sse2 < %s \| FileCheck %s --check-prefix=SSE
	; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=+avx < %s \| FileCheck %s --check-prefix=AVX			; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=+avx < %s \| FileCheck %s --check-prefix=AVX

	; Verify we fold loads into unary sse intrinsics only when optimizing for size			; Verify we fold loads into unary sse intrinsics only when optimizing for size

	define float @rcpss(float* %a) {			define float @rcpss(float* %a) {
	; SSE-LABEL: rcpss:			; SSE-LABEL: rcpss:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: movss (%rdi), %xmm0			; SSE-NEXT: movss (%rdi), %xmm0
	; SSE-NEXT: rcpss %xmm0, %xmm0			; SSE-NEXT: rcpss %xmm0, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: rcpss:			; AVX-LABEL: rcpss:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vmovss (%rdi), %xmm0			; AVX-NEXT: vrcpss (%rdi), %xmm0, %xmm0
	; AVX-NEXT: vrcpss %xmm0, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%ld = load float, float* %a			%ld = load float, float* %a
	%ins = insertelement <4 x float> undef, float %ld, i32 0			%ins = insertelement <4 x float> undef, float %ld, i32 0
	%res = tail call <4 x float> @llvm.x86.sse.rcp.ss(<4 x float> %ins)			%res = tail call <4 x float> @llvm.x86.sse.rcp.ss(<4 x float> %ins)
	%ext = extractelement <4 x float> %res, i32 0			%ext = extractelement <4 x float> %res, i32 0
	ret float %ext			ret float %ext
	}			}

	define float @rsqrtss(float* %a) {			define float @rsqrtss(float* %a) {
	; SSE-LABEL: rsqrtss:			; SSE-LABEL: rsqrtss:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: movss (%rdi), %xmm0			; SSE-NEXT: movss (%rdi), %xmm0
	; SSE-NEXT: rsqrtss %xmm0, %xmm0			; SSE-NEXT: rsqrtss %xmm0, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: rsqrtss:			; AVX-LABEL: rsqrtss:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vmovss (%rdi), %xmm0			; AVX-NEXT: vrsqrtss (%rdi), %xmm0, %xmm0
	; AVX-NEXT: vrsqrtss %xmm0, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%ld = load float, float* %a			%ld = load float, float* %a
	%ins = insertelement <4 x float> undef, float %ld, i32 0			%ins = insertelement <4 x float> undef, float %ld, i32 0
	%res = tail call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %ins)			%res = tail call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %ins)
	%ext = extractelement <4 x float> %res, i32 0			%ext = extractelement <4 x float> %res, i32 0
	ret float %ext			ret float %ext
	}			}

	define float @sqrtss(float* %a) {			define float @sqrtss(float* %a) {
	; SSE-LABEL: sqrtss:			; SSE-LABEL: sqrtss:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: movss (%rdi), %xmm0			; SSE-NEXT: movss (%rdi), %xmm0
	; SSE-NEXT: sqrtss %xmm0, %xmm0			; SSE-NEXT: sqrtss %xmm0, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: sqrtss:			; AVX-LABEL: sqrtss:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vmovss (%rdi), %xmm0			; AVX-NEXT: vsqrtss (%rdi), %xmm0, %xmm0
	; AVX-NEXT: vsqrtss %xmm0, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%ld = load float, float* %a			%ld = load float, float* %a
	%ins = insertelement <4 x float> undef, float %ld, i32 0			%ins = insertelement <4 x float> undef, float %ld, i32 0
	%res = tail call <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float> %ins)			%res = tail call <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float> %ins)
	%ext = extractelement <4 x float> %res, i32 0			%ext = extractelement <4 x float> %res, i32 0
	ret float %ext			ret float %ext
	}			}

	define double @sqrtsd(double* %a) {			define double @sqrtsd(double* %a) {
	; SSE-LABEL: sqrtsd:			; SSE-LABEL: sqrtsd:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: movsd (%rdi), %xmm0			; SSE-NEXT: movsd (%rdi), %xmm0
	; SSE-NEXT: sqrtsd %xmm0, %xmm0			; SSE-NEXT: sqrtsd %xmm0, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: sqrtsd:			; AVX-LABEL: sqrtsd:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vmovsd (%rdi), %xmm0			; AVX-NEXT: vsqrtsd (%rdi), %xmm0, %xmm0
	; AVX-NEXT: vsqrtsd %xmm0, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%ld = load double, double* %a			%ld = load double, double* %a
	%ins = insertelement <2 x double> undef, double %ld, i32 0			%ins = insertelement <2 x double> undef, double %ld, i32 0
	%res = tail call <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double> %ins)			%res = tail call <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double> %ins)
	%ext = extractelement <2 x double> %res, i32 0			%ext = extractelement <2 x double> %res, i32 0
	ret double %ext			ret double %ext
	}			}

	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

test/CodeGen/X86/stack-folding-fp-avx512.ll

Show First 20 Lines • Show All 749 Lines • ▼ Show 20 Lines	define <16 x float> @stack_fold_permilpsvar_zmm_maskz(<16 x float> %a0, <16 x i32> %a1, i16 %mask) {
;CHECK-LABEL: stack_fold_permilpsvar_zmm_maskz		;CHECK-LABEL: stack_fold_permilpsvar_zmm_maskz
;CHECK: vpermilps {{-?[0-9]}}(%rsp), {{%zmm[0-9][0-9]}}, {{%zmm[0-9][0-9]}} {{{%k[0-7]}}} {z} {{.#+}} 64-byte Folded Reload		;CHECK: vpermilps {{-?[0-9]}}(%rsp), {{%zmm[0-9][0-9]}}, {{%zmm[0-9][0-9]}} {{{%k[0-7]}}} {z} {{.#+}} 64-byte Folded Reload
%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{xmm16},~{xmm17},~{xmm18},~{xmm19},~{xmm20},~{xmm21},~{xmm22},~{xmm23},~{xmm24},~{xmm25},~{xmm26},~{xmm27},~{xmm28},~{xmm29},~{xmm30},~{xmm31},~{flags}"()		%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{xmm16},~{xmm17},~{xmm18},~{xmm19},~{xmm20},~{xmm21},~{xmm22},~{xmm23},~{xmm24},~{xmm25},~{xmm26},~{xmm27},~{xmm28},~{xmm29},~{xmm30},~{xmm31},~{flags}"()
%2 = call <16 x float> @llvm.x86.avx512.mask.vpermilvar.ps.512(<16 x float> %a0, <16 x i32> %a1, <16 x float> undef, i16 -1)		%2 = call <16 x float> @llvm.x86.avx512.mask.vpermilvar.ps.512(<16 x float> %a0, <16 x i32> %a1, <16 x float> undef, i16 -1)
%3 = bitcast i16 %mask to <16 x i1>		%3 = bitcast i16 %mask to <16 x i1>
%4 = select <16 x i1> %3, <16 x float> %2, <16 x float> zeroinitializer		%4 = select <16 x i1> %3, <16 x float> %2, <16 x float> zeroinitializer
ret <16 x float> %4		ret <16 x float> %4
}		}

		RKSimonUnsubmitted Done Reply Inline Actions Please keep these in alphabetical order to make it easier to track missing instructions. RKSimon: Please keep these in alphabetical order to make it easier to track missing instructions.
		define float @stack_fold_vsqrtss(float %a0) {
		;CHECK-LABEL: stack_fold_vsqrtss
		;CHECK: vsqrtss {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}}, {{%xmm[0-9][0-9]}} {{.#+}} 4-byte Folded Reload
		%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{xmm16},~{xmm17},~{xmm18},~{xmm19},~{xmm20},~{xmm21},~{xmm22},~{xmm23},~{xmm24},~{xmm25},~{xmm26},~{xmm27},~{xmm28},~{xmm29},~{xmm30},~{xmm31},~{flags}"()
		%2 = call float @llvm.sqrt.f32(float %a0)
		ret float %2
		}
		declare float @llvm.sqrt.f32(float %Val)

		define double @stack_fold_vsqrtsd(double %a0) {
		;CHECK-LABEL: stack_fold_vsqrtsd
		;CHECK: vsqrtsd {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}}, {{%xmm[0-9][0-9]}} {{.#+}} 8-byte Folded Reload
		%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{xmm16},~{xmm17},~{xmm18},~{xmm19},~{xmm20},~{xmm21},~{xmm22},~{xmm23},~{xmm24},~{xmm25},~{xmm26},~{xmm27},~{xmm28},~{xmm29},~{xmm30},~{xmm31},~{flags}"()
		%2 = call double @llvm.sqrt.f64(double %a0)
		ret double %2
		}
		declare double @llvm.sqrt.f64(double %Val)

		define <4 x float> @stack_fold_vrsqrt14ss(<4 x float> %a0, <4 x float> %a1) {
		;CHECK-LABEL: stack_fold_vrsqrt14ss
		;CHECK: vrsqrt14ss {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}}, {{%xmm[0-9][0-9]}} {{.#+}} 16-byte Folded Reload
		%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{xmm16},~{xmm17},~{xmm18},~{xmm19},~{xmm20},~{xmm21},~{xmm22},~{xmm23},~{xmm24},~{xmm25},~{xmm26},~{xmm27},~{xmm28},~{xmm29},~{xmm30},~{xmm31},~{flags}"()
		%res = call <4 x float> @llvm.x86.avx512.rsqrt14.ss(<4 x float> %a0, <4 x float> %a1, <4 x float> undef, i8 -1) ;
		ret <4 x float> %res
		}
		declare <4 x float> @llvm.x86.avx512.rsqrt14.ss(<4 x float>, <4 x float>, <4 x float>, i8) nounwind readnone

		define <4 x float> @stack_fold_vrcp14ss(<4 x float> %a0, <4 x float> %a1) {
		; CHECK-LABEL: stack_fold_vrcp14ss
		; CHECK: vrcp14ss {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}}, {{%xmm[0-9][0-9]}} {{.#+}} 16-byte Folded Reload
		%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{xmm16},~{xmm17},~{xmm18},~{xmm19},~{xmm20},~{xmm21},~{xmm22},~{xmm23},~{xmm24},~{xmm25},~{xmm26},~{xmm27},~{xmm28},~{xmm29},~{xmm30},~{xmm31},~{flags}"()
		%res = call <4 x float> @llvm.x86.avx512.rcp14.ss(<4 x float> %a0, <4 x float> %a1, <4 x float> undef, i8 -1) ;
		ret <4 x float> %res
		}
		declare <4 x float> @llvm.x86.avx512.rcp14.ss(<4 x float>, <4 x float>, <4 x float>, i8) nounwind readnone

		define <2 x double> @stack_fold_vcvtss2sd(<2 x double> %a0, <4 x float> %a1) {
		; CHECK-LABEL: stack_fold_vcvtss2sd
		; CHECK: vcvtss2sd {{-?[0-9]}}(%rsp), {{%xmm[0-9][0-9]}}, {{%xmm[0-9][0-9]}} {{.#+}} 16-byte Folded Reload
		%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{xmm16},~{xmm17},~{xmm18},~{xmm19},~{xmm20},~{xmm21},~{xmm22},~{xmm23},~{xmm24},~{xmm25},~{xmm26},~{xmm27},~{xmm28},~{xmm29},~{xmm30},~{xmm31},~{flags}"()
		%ext = extractelement <4 x float> %a1, i32 0
		%cvt = fpext float %ext to double
		%res = insertelement <2 x double> %a0, double %cvt, i32 0
		ret <2 x double> %res
		}

attributes #0 = { "unsafe-fp-math"="false" }		attributes #0 = { "unsafe-fp-math"="false" }
attributes #1 = { "unsafe-fp-math"="true" }		attributes #1 = { "unsafe-fp-math"="true" }

This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX] Remove "OptForSize" condition from some memory foldings.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 84823

lib/Target/X86/X86InstrAVX512.td

lib/Target/X86/X86InstrInfo.cpp

lib/Target/X86/X86InstrSSE.td

test/CodeGen/X86/avx-arith.ll

test/CodeGen/X86/avx512-cvt.ll

test/CodeGen/X86/fold-load-unops.ll

test/CodeGen/X86/stack-folding-fp-avx512.ll

[X86][AVX] Remove "OptForSize" condition from some memory foldings.
AbandonedPublic