This is an archive of the discontinued LLVM Phabricator instance.

[X86] Fix non-intrinsic roundss/roundsd to not read the destination register
ClosedPublic

Authored by mkuper on Dec 1 2016, 3:07 PM.

Download Raw Diff

Details

Reviewers

zvi
craig.topper

Commits

rGe3036abcf9cd: [X86] Fix non-intrinsic roundss/roundsd to not read the destination register
rL288703: [X86] Fix non-intrinsic roundss/roundsd to not read the destination register

Summary

This changes the scalar non-intrinsic non-avx roundss/sd instruction definitions not to read their destination register - allowing partial dependency breaking.

This fixes PR31143.

Diff Detail

Repository: rL LLVM

Event Timeline

mkuper updated this revision to Diff 79989.Dec 1 2016, 3:07 PM

mkuper retitled this revision from to [X86] Fix non-intrinsic roundss/roundsd to not read the destination register.

mkuper updated this object.

mkuper added reviewers: craig.topper, zvi.

mkuper added a subscriber: llvm-commits.

Should these rows in X86InstrInfo.cpp be moved to MemoryFoldTable1 from MemoryFoldTable2?

{ X86::ROUNDSDr,        X86::ROUNDSDm,      0 },
{ X86::ROUNDSSr,        X86::ROUNDSSm,      0 },

And should ROUNDSSr_Int and ROUNDSDr_Int be added to MemoryFoldTable2? And related to that should RSQRTSSr_Int, RCPSSr_Int, SQRTSDr_Int, and SQRTSSr_Int actually be in MemoryFoldTable2 instead of MemoryFoldTable1?

Nevermind the question about RSQRTSSr_Int, RCPSSr_Int, SQRTSDr_Int, and SQRTSSr_Int. Didn't realize they don't take two arguments.

And ROUNDSSr_Int is already in the folding table I dont' know how i missed that earlier.

So only my question about moving ROUNDSSr and ROUNDSDr to the operand 1 table stands.

RKSimon added a subscriber: RKSimon.Dec 2 2016, 12:50 AM

Yes, thanks a lot for catching this!

The test change I made for folding was wrong, I didn't notice those are minsize test, where we *should* be folding, the only reason we stopped folding was because it was in the wrong table.

Fixed folding to use the right operand.

LGTM

This revision is now accepted and ready to land.Dec 3 2016, 11:05 AM

mkuper mentioned this in D27391: Fix for false dependency identification (pr31143) - reading undef values shouldn't be considers as a use.Dec 4 2016, 1:48 AM

As Michael have already noticed, I also have a solution for this false dependency bug (https://reviews.llvm.org/D27391).
What happens is that some instructions (like the ROUNDSSr) read an undef value from one of it's source operands.
Part of the logic that searches for false dependencies decides if the dependency can be broken only if the instruction doesn't read the operand.
I think that a read of an undef value should not be considered as a real read, and this is the fix in my patch. I believe this approach will catch more cases.

Thanks,
Marina

Worth adding an equivalent roundsd test to pr31143.ll?

In D27323#612920, @myatsina wrote:

As Michael have already noticed, I also have a solution for this false dependency bug (https://reviews.llvm.org/D27391).
What happens is that some instructions (like the ROUNDSSr) read an undef value from one of it's source operands.
Part of the logic that searches for false dependencies decides if the dependency can be broken only if the instruction doesn't read the operand.
I think that a read of an undef value should not be considered as a real read, and this is the fix in my patch. I believe this approach will catch more cases.

Thanks,
Marina

To reiterate what I wrote on the other patch - the only reason we end up with reading an undef value for the ROUNDSS/Drr is because we have a source operand tied with the destination, and then all patterns that match these instructions shove an IMPLICIT_DEF into that operand - making it, in practice, a dummy operand. This is in contrast to the other SSE instructions that have a false dependence on the high lanes, that simply don't have this operand.

I think we should be modeling all these instructions in the same way.
One option is to sync ROUNDSS/Drr with the rest, by removing the operand, like this patch does.
The other is to add the operand to all relevant instructions (making them binary instead of unary), and put something like D27391 in place.
Or am I missing a reason why ROUND should be modeled differently from the rest?

Regarding D27391 catching more cases - Marina, do you have anything specific in mind? Which instructions would that apply to?

In D27323#612947, @RKSimon wrote:

Worth adding an equivalent roundsd test to pr31143.ll?

Sure, will do.

Updated with SD testcase.

Craig, does your LGTM still hold?

In D27323#612970, @mkuper wrote:

In D27323#612920, @myatsina wrote:

As Michael have already noticed, I also have a solution for this false dependency bug (https://reviews.llvm.org/D27391).
What happens is that some instructions (like the ROUNDSSr) read an undef value from one of it's source operands.
Part of the logic that searches for false dependencies decides if the dependency can be broken only if the instruction doesn't read the operand.
I think that a read of an undef value should not be considered as a real read, and this is the fix in my patch. I believe this approach will catch more cases.

Thanks,
Marina

To reiterate what I wrote on the other patch - the only reason we end up with reading an undef value for the ROUNDSS/Drr is because we have a source operand tied with the destination, and then all patterns that match these instructions shove an IMPLICIT_DEF into that operand - making it, in practice, a dummy operand. This is in contrast to the other SSE instructions that have a false dependence on the high lanes, that simply don't have this operand.

I think we should be modeling all these instructions in the same way.
One option is to sync ROUNDSS/Drr with the rest, by removing the operand, like this patch does.
The other is to add the operand to all relevant instructions (making them binary instead of unary), and put something like D27391 in place.
Or am I missing a reason why ROUND should be modeled differently from the rest?

Regarding D27391 catching more cases - Marina, do you have anything specific in mind? Which instructions would that apply to?

You are right, ROUND shouldn't be modeled differently than the other cases, so your change should go in.
I found other instructions and intrinsics that look for uses and find uses of undef values:
%XMM0<def,tied1> = RCPSSm_Int %XMM0<undef,tied0>, %RDI<kill>, 1, %noreg, 0, %noreg
So there is a deeper problem here than just ROUND

In D27323#613687, @myatsina wrote:

In D27323#612970, @mkuper wrote:

In D27323#612920, @myatsina wrote:

As Michael have already noticed, I also have a solution for this false dependency bug (https://reviews.llvm.org/D27391).
What happens is that some instructions (like the ROUNDSSr) read an undef value from one of it's source operands.
Part of the logic that searches for false dependencies decides if the dependency can be broken only if the instruction doesn't read the operand.
I think that a read of an undef value should not be considered as a real read, and this is the fix in my patch. I believe this approach will catch more cases.

Thanks,
Marina

To reiterate what I wrote on the other patch - the only reason we end up with reading an undef value for the ROUNDSS/Drr is because we have a source operand tied with the destination, and then all patterns that match these instructions shove an IMPLICIT_DEF into that operand - making it, in practice, a dummy operand. This is in contrast to the other SSE instructions that have a false dependence on the high lanes, that simply don't have this operand.

I think we should be modeling all these instructions in the same way.
One option is to sync ROUNDSS/Drr with the rest, by removing the operand, like this patch does.
The other is to add the operand to all relevant instructions (making them binary instead of unary), and put something like D27391 in place.
Or am I missing a reason why ROUND should be modeled differently from the rest?

Regarding D27391 catching more cases - Marina, do you have anything specific in mind? Which instructions would that apply to?

You are right, ROUND shouldn't be modeled differently than the other cases, so your change should go in.

Ok, great, then I'll commit this.

I found other instructions and intrinsics that look for uses and find uses of undef values:
%XMM0<def,tied1> = RCPSSm_Int %XMM0<undef,tied0>, %RDI<kill>, 1, %noreg, 0, %noreg
So there is a deeper problem here than just ROUND

This looks extremely weird.
(And now I understand what Craig's email was about! :-) )

Closed by commit rL288703: [X86] Fix non-intrinsic roundss/roundsd to not read the destination register (authored by mkuper). · Explain WhyDec 5 2016, 1:07 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86InstrInfo.cpp

4 lines

X86InstrSSE.td

174 lines

test/

CodeGen/

X86/

pr31143.ll

60 lines

Diff 80316

llvm/trunk/lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 586 Lines • ▼ Show 20 Lines	static const X86MemoryFoldTableEntry MemoryFoldTable1[] = {
{ X86::PSHUFHWri, X86::PSHUFHWmi, TB_ALIGN_16 },		{ X86::PSHUFHWri, X86::PSHUFHWmi, TB_ALIGN_16 },
{ X86::PSHUFLWri, X86::PSHUFLWmi, TB_ALIGN_16 },		{ X86::PSHUFLWri, X86::PSHUFLWmi, TB_ALIGN_16 },
{ X86::PTESTrr, X86::PTESTrm, TB_ALIGN_16 },		{ X86::PTESTrr, X86::PTESTrm, TB_ALIGN_16 },
{ X86::RCPPSr, X86::RCPPSm, TB_ALIGN_16 },		{ X86::RCPPSr, X86::RCPPSm, TB_ALIGN_16 },
{ X86::RCPSSr, X86::RCPSSm, 0 },		{ X86::RCPSSr, X86::RCPSSm, 0 },
{ X86::RCPSSr_Int, X86::RCPSSm_Int, TB_NO_REVERSE },		{ X86::RCPSSr_Int, X86::RCPSSm_Int, TB_NO_REVERSE },
{ X86::ROUNDPDr, X86::ROUNDPDm, TB_ALIGN_16 },		{ X86::ROUNDPDr, X86::ROUNDPDm, TB_ALIGN_16 },
{ X86::ROUNDPSr, X86::ROUNDPSm, TB_ALIGN_16 },		{ X86::ROUNDPSr, X86::ROUNDPSm, TB_ALIGN_16 },
		{ X86::ROUNDSDr, X86::ROUNDSDm, 0 },
		{ X86::ROUNDSSr, X86::ROUNDSSm, 0 },
{ X86::RSQRTPSr, X86::RSQRTPSm, TB_ALIGN_16 },		{ X86::RSQRTPSr, X86::RSQRTPSm, TB_ALIGN_16 },
{ X86::RSQRTSSr, X86::RSQRTSSm, 0 },		{ X86::RSQRTSSr, X86::RSQRTSSm, 0 },
{ X86::RSQRTSSr_Int, X86::RSQRTSSm_Int, TB_NO_REVERSE },		{ X86::RSQRTSSr_Int, X86::RSQRTSSm_Int, TB_NO_REVERSE },
{ X86::SQRTPDr, X86::SQRTPDm, TB_ALIGN_16 },		{ X86::SQRTPDr, X86::SQRTPDm, TB_ALIGN_16 },
{ X86::SQRTPSr, X86::SQRTPSm, TB_ALIGN_16 },		{ X86::SQRTPSr, X86::SQRTPSm, TB_ALIGN_16 },
{ X86::SQRTSDr, X86::SQRTSDm, 0 },		{ X86::SQRTSDr, X86::SQRTSDm, 0 },
{ X86::SQRTSDr_Int, X86::SQRTSDm_Int, TB_NO_REVERSE },		{ X86::SQRTSDr_Int, X86::SQRTSDm_Int, TB_NO_REVERSE },
{ X86::SQRTSSr, X86::SQRTSSm, 0 },		{ X86::SQRTSSr, X86::SQRTSSm, 0 },
▲ Show 20 Lines • Show All 604 Lines • ▼ Show 20 Lines	static const X86MemoryFoldTableEntry MemoryFoldTable2[] = {
{ X86::PUNPCKHDQrr, X86::PUNPCKHDQrm, TB_ALIGN_16 },		{ X86::PUNPCKHDQrr, X86::PUNPCKHDQrm, TB_ALIGN_16 },
{ X86::PUNPCKHQDQrr, X86::PUNPCKHQDQrm, TB_ALIGN_16 },		{ X86::PUNPCKHQDQrr, X86::PUNPCKHQDQrm, TB_ALIGN_16 },
{ X86::PUNPCKHWDrr, X86::PUNPCKHWDrm, TB_ALIGN_16 },		{ X86::PUNPCKHWDrr, X86::PUNPCKHWDrm, TB_ALIGN_16 },
{ X86::PUNPCKLBWrr, X86::PUNPCKLBWrm, TB_ALIGN_16 },		{ X86::PUNPCKLBWrr, X86::PUNPCKLBWrm, TB_ALIGN_16 },
{ X86::PUNPCKLDQrr, X86::PUNPCKLDQrm, TB_ALIGN_16 },		{ X86::PUNPCKLDQrr, X86::PUNPCKLDQrm, TB_ALIGN_16 },
{ X86::PUNPCKLQDQrr, X86::PUNPCKLQDQrm, TB_ALIGN_16 },		{ X86::PUNPCKLQDQrr, X86::PUNPCKLQDQrm, TB_ALIGN_16 },
{ X86::PUNPCKLWDrr, X86::PUNPCKLWDrm, TB_ALIGN_16 },		{ X86::PUNPCKLWDrr, X86::PUNPCKLWDrm, TB_ALIGN_16 },
{ X86::PXORrr, X86::PXORrm, TB_ALIGN_16 },		{ X86::PXORrr, X86::PXORrm, TB_ALIGN_16 },
{ X86::ROUNDSDr, X86::ROUNDSDm, 0 },
{ X86::ROUNDSSr, X86::ROUNDSSm, 0 },
{ X86::ROUNDSDr_Int, X86::ROUNDSDm_Int, TB_NO_REVERSE },		{ X86::ROUNDSDr_Int, X86::ROUNDSDm_Int, TB_NO_REVERSE },
{ X86::ROUNDSSr_Int, X86::ROUNDSSm_Int, TB_NO_REVERSE },		{ X86::ROUNDSSr_Int, X86::ROUNDSSm_Int, TB_NO_REVERSE },
{ X86::SBB32rr, X86::SBB32rm, 0 },		{ X86::SBB32rr, X86::SBB32rm, 0 },
{ X86::SBB64rr, X86::SBB64rm, 0 },		{ X86::SBB64rr, X86::SBB64rm, 0 },
{ X86::SHUFPDrri, X86::SHUFPDrmi, TB_ALIGN_16 },		{ X86::SHUFPDrri, X86::SHUFPDrmi, TB_ALIGN_16 },
{ X86::SHUFPSrri, X86::SHUFPSrmi, TB_ALIGN_16 },		{ X86::SHUFPSrri, X86::SHUFPSrmi, TB_ALIGN_16 },
{ X86::SUB16rr, X86::SUB16rm, 0 },		{ X86::SUB16rr, X86::SUB16rm, 0 },
{ X86::SUB32rr, X86::SUB32rm, 0 },		{ X86::SUB32rr, X86::SUB32rm, 0 },
▲ Show 20 Lines • Show All 8,379 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,262 Lines • ▼ Show 20 Lines	def : Pat<(v4f32 (X86insertps (v4f32 VR128:$src1),
(X86VBroadcast (loadv4f32 addr:$src2)), imm:$src3)),		(X86VBroadcast (loadv4f32 addr:$src2)), imm:$src3)),
(VINSERTPSrm VR128:$src1, addr:$src2, imm:$src3)>;		(VINSERTPSrm VR128:$src1, addr:$src2, imm:$src3)>;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// SSE4.1 - Round Instructions		// SSE4.1 - Round Instructions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

multiclass sse41_fp_unop_rm<bits<8> opcps, bits<8> opcpd, string OpcodeStr,		multiclass sse41_fp_unop_p<bits<8> opcps, bits<8> opcpd, string OpcodeStr,
X86MemOperand x86memop, RegisterClass RC,		X86MemOperand x86memop, RegisterClass RC,
PatFrag mem_frag32, PatFrag mem_frag64,		PatFrag mem_frag32, PatFrag mem_frag64,
Intrinsic V4F32Int, Intrinsic V2F64Int> {		Intrinsic V4F32Int, Intrinsic V2F64Int> {
let ExeDomain = SSEPackedSingle in {		let ExeDomain = SSEPackedSingle in {
// Intrinsic operation, reg.		// Intrinsic operation, reg.
// Vector intrinsic operation, reg		// Vector intrinsic operation, reg
def PSr : SS4AIi8<opcps, MRMSrcReg,		def PSr : SS4AIi8<opcps, MRMSrcReg,
(outs RC:$dst), (ins RC:$src1, i32u8imm:$src2),		(outs RC:$dst), (ins RC:$src1, i32u8imm:$src2),
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"ps\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),		"ps\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
[(set RC:$dst, (V4F32Int RC:$src1, imm:$src2))],		[(set RC:$dst, (V4F32Int RC:$src1, imm:$src2))],
Show All 24 Lines	def PDm : SS4AIi8<opcpd, MRMSrcMem,
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"pd\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),		"pd\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
[(set RC:$dst,		[(set RC:$dst,
(V2F64Int (mem_frag64 addr:$src1),imm:$src2))],		(V2F64Int (mem_frag64 addr:$src1),imm:$src2))],
IIC_SSE_ROUNDPS_REG>, Sched<[WriteFAddLd]>;		IIC_SSE_ROUNDPS_REG>, Sched<[WriteFAddLd]>;
} // ExeDomain = SSEPackedDouble		} // ExeDomain = SSEPackedDouble
}		}

multiclass sse41_fp_binop_rm<bits<8> opcss, bits<8> opcsd,		multiclass avx_fp_unop_rm<bits<8> opcss, bits<8> opcsd,
string OpcodeStr,		string OpcodeStr> {
Intrinsic F32Int,		let ExeDomain = GenericDomain, hasSideEffects = 0 in {
Intrinsic F64Int, bit Is2Addr = 1> {
let ExeDomain = GenericDomain in {
// Operation, reg.
let hasSideEffects = 0 in
def SSr : SS4AIi8<opcss, MRMSrcReg,		def SSr : SS4AIi8<opcss, MRMSrcReg,
(outs FR32:$dst), (ins FR32:$src1, FR32:$src2, i32u8imm:$src3),		(outs FR32:$dst), (ins FR32:$src1, FR32:$src2, i32u8imm:$src3),
!if(Is2Addr,
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"ss\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),		"ss\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
!strconcat(OpcodeStr,
"ss\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[]>, Sched<[WriteFAdd]>;		[]>, Sched<[WriteFAdd]>;

// Operation, mem.		let mayLoad = 1 in
let mayLoad = 1, hasSideEffects = 0 in
def SSm : SS4AIi8<opcss, MRMSrcMem,		def SSm : SS4AIi8<opcss, MRMSrcMem,
(outs FR32:$dst), (ins FR32:$src1, f32mem:$src2, i32u8imm:$src3),		(outs FR32:$dst), (ins FR32:$src1, f32mem:$src2, i32u8imm:$src3),
!if(Is2Addr,
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"ss\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),		"ss\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
		[]>, Sched<[WriteFAddLd, ReadAfterLd]>;

		def SDr : SS4AIi8<opcsd, MRMSrcReg,
		(outs FR64:$dst), (ins FR64:$src1, FR64:$src2, i32u8imm:$src3),
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"ss\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),		"sd\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
		[]>, Sched<[WriteFAdd]>;

		let mayLoad = 1 in
		def SDm : SS4AIi8<opcsd, MRMSrcMem,
		(outs FR64:$dst), (ins FR64:$src1, f64mem:$src2, i32u8imm:$src3),
		!strconcat(OpcodeStr,
		"sd\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}"),
[]>, Sched<[WriteFAddLd, ReadAfterLd]>;		[]>, Sched<[WriteFAddLd, ReadAfterLd]>;
		} // ExeDomain = GenericDomain, hasSideEffects = 0
		}

// Intrinsic operation, reg.		multiclass sse41_fp_unop_s<bits<8> opcss, bits<8> opcsd,
let isCodeGenOnly = 1 in		string OpcodeStr> {
		let ExeDomain = GenericDomain, hasSideEffects = 0 in {
		def SSr : SS4AIi8<opcss, MRMSrcReg,
		(outs FR32:$dst), (ins FR32:$src1, i32u8imm:$src2),
		!strconcat(OpcodeStr,
		"ss\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
		[]>, Sched<[WriteFAdd]>;

		let mayLoad = 1 in
		def SSm : SS4AIi8<opcss, MRMSrcMem,
		(outs FR32:$dst), (ins f32mem:$src1, i32u8imm:$src2),
		!strconcat(OpcodeStr,
		"ss\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
		[]>, Sched<[WriteFAddLd, ReadAfterLd]>;

		def SDr : SS4AIi8<opcsd, MRMSrcReg,
		(outs FR64:$dst), (ins FR64:$src1, i32u8imm:$src2),
		!strconcat(OpcodeStr,
		"sd\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
		[]>, Sched<[WriteFAdd]>;

		let mayLoad = 1 in
		def SDm : SS4AIi8<opcsd, MRMSrcMem,
		(outs FR64:$dst), (ins f64mem:$src1, i32u8imm:$src2),
		!strconcat(OpcodeStr,
		"sd\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
		[]>, Sched<[WriteFAddLd, ReadAfterLd]>;
		} // ExeDomain = GenericDomain, hasSideEffects = 0
		}

		multiclass sse41_fp_binop_s<bits<8> opcss, bits<8> opcsd,
		string OpcodeStr,
		Intrinsic F32Int,
		Intrinsic F64Int, bit Is2Addr = 1> {
		let ExeDomain = GenericDomain, isCodeGenOnly = 1 in {
def SSr_Int : SS4AIi8<opcss, MRMSrcReg,		def SSr_Int : SS4AIi8<opcss, MRMSrcReg,
(outs VR128:$dst), (ins VR128:$src1, VR128:$src2, i32u8imm:$src3),		(outs VR128:$dst), (ins VR128:$src1, VR128:$src2, i32u8imm:$src3),
!if(Is2Addr,		!if(Is2Addr,
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"ss\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),		"ss\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"ss\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),		"ss\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[(set VR128:$dst, (F32Int VR128:$src1, VR128:$src2, imm:$src3))]>,		[(set VR128:$dst, (F32Int VR128:$src1, VR128:$src2, imm:$src3))]>,
Sched<[WriteFAdd]>;		Sched<[WriteFAdd]>;

// Intrinsic operation, mem.
let isCodeGenOnly = 1 in
def SSm_Int : SS4AIi8<opcss, MRMSrcMem,		def SSm_Int : SS4AIi8<opcss, MRMSrcMem,
(outs VR128:$dst), (ins VR128:$src1, ssmem:$src2, i32u8imm:$src3),		(outs VR128:$dst), (ins VR128:$src1, ssmem:$src2, i32u8imm:$src3),
!if(Is2Addr,		!if(Is2Addr,
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"ss\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),		"ss\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"ss\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),		"ss\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[(set VR128:$dst,		[(set VR128:$dst,
(F32Int VR128:$src1, sse_load_f32:$src2, imm:$src3))]>,		(F32Int VR128:$src1, sse_load_f32:$src2, imm:$src3))]>,
Sched<[WriteFAddLd, ReadAfterLd]>;		Sched<[WriteFAddLd, ReadAfterLd]>;

// Operation, reg.
let hasSideEffects = 0 in
def SDr : SS4AIi8<opcsd, MRMSrcReg,
(outs FR64:$dst), (ins FR64:$src1, FR64:$src2, i32u8imm:$src3),
!if(Is2Addr,
!strconcat(OpcodeStr,
"sd\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
!strconcat(OpcodeStr,
"sd\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[]>, Sched<[WriteFAdd]>;

// Operation, mem.
let mayLoad = 1, hasSideEffects = 0 in
def SDm : SS4AIi8<opcsd, MRMSrcMem,
(outs FR64:$dst), (ins FR64:$src1, f64mem:$src2, i32u8imm:$src3),
!if(Is2Addr,
!strconcat(OpcodeStr,
"sd\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
!strconcat(OpcodeStr,
"sd\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[]>, Sched<[WriteFAddLd, ReadAfterLd]>;

// Intrinsic operation, reg.
let isCodeGenOnly = 1 in
def SDr_Int : SS4AIi8<opcsd, MRMSrcReg,		def SDr_Int : SS4AIi8<opcsd, MRMSrcReg,
(outs VR128:$dst), (ins VR128:$src1, VR128:$src2, i32u8imm:$src3),		(outs VR128:$dst), (ins VR128:$src1, VR128:$src2, i32u8imm:$src3),
!if(Is2Addr,		!if(Is2Addr,
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"sd\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),		"sd\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"sd\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),		"sd\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[(set VR128:$dst, (F64Int VR128:$src1, VR128:$src2, imm:$src3))]>,		[(set VR128:$dst, (F64Int VR128:$src1, VR128:$src2, imm:$src3))]>,
Sched<[WriteFAdd]>;		Sched<[WriteFAdd]>;

// Intrinsic operation, mem.
let isCodeGenOnly = 1 in
def SDm_Int : SS4AIi8<opcsd, MRMSrcMem,		def SDm_Int : SS4AIi8<opcsd, MRMSrcMem,
(outs VR128:$dst), (ins VR128:$src1, sdmem:$src2, i32u8imm:$src3),		(outs VR128:$dst), (ins VR128:$src1, sdmem:$src2, i32u8imm:$src3),
!if(Is2Addr,		!if(Is2Addr,
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"sd\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),		"sd\t{$src3, $src2, $dst\|$dst, $src2, $src3}"),
!strconcat(OpcodeStr,		!strconcat(OpcodeStr,
"sd\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),		"sd\t{$src3, $src2, $src1, $dst\|$dst, $src1, $src2, $src3}")),
[(set VR128:$dst,		[(set VR128:$dst,
(F64Int VR128:$src1, sse_load_f64:$src2, imm:$src3))]>,		(F64Int VR128:$src1, sse_load_f64:$src2, imm:$src3))]>,
Sched<[WriteFAddLd, ReadAfterLd]>;		Sched<[WriteFAddLd, ReadAfterLd]>;
} // ExeDomain = GenericDomain		} // ExeDomain = GenericDomain, isCodeGenOnly = 1
}		}

// FP round - roundss, roundps, roundsd, roundpd		// FP round - roundss, roundps, roundsd, roundpd
let Predicates = [HasAVX] in {		let Predicates = [HasAVX] in {
// Intrinsic form		// Intrinsic form
defm VROUND : sse41_fp_unop_rm<0x08, 0x09, "vround", f128mem, VR128,		defm VROUND : sse41_fp_unop_p<0x08, 0x09, "vround", f128mem, VR128,
loadv4f32, loadv2f64,		loadv4f32, loadv2f64,
int_x86_sse41_round_ps,		int_x86_sse41_round_ps,
int_x86_sse41_round_pd>, VEX;		int_x86_sse41_round_pd>, VEX;
defm VROUNDY : sse41_fp_unop_rm<0x08, 0x09, "vround", f256mem, VR256,		defm VROUNDY : sse41_fp_unop_p<0x08, 0x09, "vround", f256mem, VR256,
loadv8f32, loadv4f64,		loadv8f32, loadv4f64,
int_x86_avx_round_ps_256,		int_x86_avx_round_ps_256,
int_x86_avx_round_pd_256>, VEX, VEX_L;		int_x86_avx_round_pd_256>, VEX, VEX_L;
defm VROUND : sse41_fp_binop_rm<0x0A, 0x0B, "vround",		defm VROUND : sse41_fp_binop_s<0x0A, 0x0B, "vround",
int_x86_sse41_round_ss,		int_x86_sse41_round_ss,
int_x86_sse41_round_sd, 0>, VEX_4V, VEX_LIG;		int_x86_sse41_round_sd, 0>, VEX_4V, VEX_LIG;
		defm VROUND : avx_fp_unop_rm<0x0A, 0x0B, "vround">, VEX_4V, VEX_LIG;
}		}

let Predicates = [UseAVX] in {		let Predicates = [UseAVX] in {
def : Pat<(ffloor FR32:$src),		def : Pat<(ffloor FR32:$src),
(VROUNDSSr (f32 (IMPLICIT_DEF)), FR32:$src, (i32 0x9))>;		(VROUNDSSr (f32 (IMPLICIT_DEF)), FR32:$src, (i32 0x9))>;
def : Pat<(f64 (ffloor FR64:$src)),		def : Pat<(f64 (ffloor FR64:$src)),
(VROUNDSDr (f64 (IMPLICIT_DEF)), FR64:$src, (i32 0x9))>;		(VROUNDSDr (f64 (IMPLICIT_DEF)), FR64:$src, (i32 0x9))>;
def : Pat<(f32 (fnearbyint FR32:$src)),		def : Pat<(f32 (fnearbyint FR32:$src)),
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	let Predicates = [HasAVX] in {
def : Pat<(v4f64 (fceil VR256:$src)),		def : Pat<(v4f64 (fceil VR256:$src)),
(VROUNDYPDr VR256:$src, (i32 0xA))>;		(VROUNDYPDr VR256:$src, (i32 0xA))>;
def : Pat<(v4f64 (frint VR256:$src)),		def : Pat<(v4f64 (frint VR256:$src)),
(VROUNDYPDr VR256:$src, (i32 0x4))>;		(VROUNDYPDr VR256:$src, (i32 0x4))>;
def : Pat<(v4f64 (ftrunc VR256:$src)),		def : Pat<(v4f64 (ftrunc VR256:$src)),
(VROUNDYPDr VR256:$src, (i32 0xB))>;		(VROUNDYPDr VR256:$src, (i32 0xB))>;
}		}

defm ROUND : sse41_fp_unop_rm<0x08, 0x09, "round", f128mem, VR128,		defm ROUND : sse41_fp_unop_p<0x08, 0x09, "round", f128mem, VR128,
memopv4f32, memopv2f64,		memopv4f32, memopv2f64, int_x86_sse41_round_ps,
int_x86_sse41_round_ps, int_x86_sse41_round_pd>;		int_x86_sse41_round_pd>;

		defm ROUND : sse41_fp_unop_s<0x0A, 0x0B, "round">;

let Constraints = "$src1 = $dst" in		let Constraints = "$src1 = $dst" in
defm ROUND : sse41_fp_binop_rm<0x0A, 0x0B, "round",		defm ROUND : sse41_fp_binop_s<0x0A, 0x0B, "round",
int_x86_sse41_round_ss, int_x86_sse41_round_sd>;		int_x86_sse41_round_ss, int_x86_sse41_round_sd>;

let Predicates = [UseSSE41] in {		let Predicates = [UseSSE41] in {
def : Pat<(ffloor FR32:$src),		def : Pat<(ffloor FR32:$src),
(ROUNDSSr (f32 (IMPLICIT_DEF)), FR32:$src, (i32 0x9))>;		(ROUNDSSr FR32:$src, (i32 0x9))>;
def : Pat<(f64 (ffloor FR64:$src)),		def : Pat<(f64 (ffloor FR64:$src)),
(ROUNDSDr (f64 (IMPLICIT_DEF)), FR64:$src, (i32 0x9))>;		(ROUNDSDr FR64:$src, (i32 0x9))>;
def : Pat<(f32 (fnearbyint FR32:$src)),		def : Pat<(f32 (fnearbyint FR32:$src)),
(ROUNDSSr (f32 (IMPLICIT_DEF)), FR32:$src, (i32 0xC))>;		(ROUNDSSr FR32:$src, (i32 0xC))>;
def : Pat<(f64 (fnearbyint FR64:$src)),		def : Pat<(f64 (fnearbyint FR64:$src)),
(ROUNDSDr (f64 (IMPLICIT_DEF)), FR64:$src, (i32 0xC))>;		(ROUNDSDr FR64:$src, (i32 0xC))>;
def : Pat<(f32 (fceil FR32:$src)),		def : Pat<(f32 (fceil FR32:$src)),
(ROUNDSSr (f32 (IMPLICIT_DEF)), FR32:$src, (i32 0xA))>;		(ROUNDSSr FR32:$src, (i32 0xA))>;
def : Pat<(f64 (fceil FR64:$src)),		def : Pat<(f64 (fceil FR64:$src)),
(ROUNDSDr (f64 (IMPLICIT_DEF)), FR64:$src, (i32 0xA))>;		(ROUNDSDr FR64:$src, (i32 0xA))>;
def : Pat<(f32 (frint FR32:$src)),		def : Pat<(f32 (frint FR32:$src)),
(ROUNDSSr (f32 (IMPLICIT_DEF)), FR32:$src, (i32 0x4))>;		(ROUNDSSr FR32:$src, (i32 0x4))>;
def : Pat<(f64 (frint FR64:$src)),		def : Pat<(f64 (frint FR64:$src)),
(ROUNDSDr (f64 (IMPLICIT_DEF)), FR64:$src, (i32 0x4))>;		(ROUNDSDr FR64:$src, (i32 0x4))>;
def : Pat<(f32 (ftrunc FR32:$src)),		def : Pat<(f32 (ftrunc FR32:$src)),
(ROUNDSSr (f32 (IMPLICIT_DEF)), FR32:$src, (i32 0xB))>;		(ROUNDSSr FR32:$src, (i32 0xB))>;
def : Pat<(f64 (ftrunc FR64:$src)),		def : Pat<(f64 (ftrunc FR64:$src)),
(ROUNDSDr (f64 (IMPLICIT_DEF)), FR64:$src, (i32 0xB))>;		(ROUNDSDr FR64:$src, (i32 0xB))>;

def : Pat<(v4f32 (ffloor VR128:$src)),		def : Pat<(v4f32 (ffloor VR128:$src)),
(ROUNDPSr VR128:$src, (i32 0x9))>;		(ROUNDPSr VR128:$src, (i32 0x9))>;
def : Pat<(v4f32 (fnearbyint VR128:$src)),		def : Pat<(v4f32 (fnearbyint VR128:$src)),
(ROUNDPSr VR128:$src, (i32 0xC))>;		(ROUNDPSr VR128:$src, (i32 0xC))>;
def : Pat<(v4f32 (fceil VR128:$src)),		def : Pat<(v4f32 (fceil VR128:$src)),
(ROUNDPSr VR128:$src, (i32 0xA))>;		(ROUNDPSr VR128:$src, (i32 0xA))>;
def : Pat<(v4f32 (frint VR128:$src)),		def : Pat<(v4f32 (frint VR128:$src)),
▲ Show 20 Lines • Show All 2,218 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/pr31143.ll

				; RUN: llc -mtriple=x86_64-pc-linux-gnu -mattr=+sse4.2 < %s \| FileCheck %s

				; CHECK-LABEL: testss:
				; CHECK: movss {{.*}}, %[[XMM0:xmm[0-9]+]]
				; CHECK: xorps %[[XMM1:xmm[0-9]+]], %[[XMM1]]
				; CHECK: roundss $9, %[[XMM0]], %[[XMM1]]

				define void @testss(float* nocapture %a, <4 x float>* nocapture %b, i32 %k) {
				entry:
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%v = load float, float* %arrayidx, align 4
				%floor = call float @floorf(float %v)
				%sub = fsub float %floor, %v
				%v1 = insertelement <4 x float> undef, float %sub, i32 0
				%br = shufflevector <4 x float> %v1, <4 x float> undef, <4 x i32> <i32 0, i32 0, i32 0, i32 0>
				store volatile <4 x float> %br, <4 x float>* %b, align 4
				%indvars.iv.next = add i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %k
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				ret void
				}

				; CHECK-LABEL: testsd:
				; CHECK: movsd {{.*}}, %[[XMM0:xmm[0-9]+]]
				; CHECK: xorps %[[XMM1:xmm[0-9]+]], %[[XMM1]]
				; CHECK: roundsd $9, %[[XMM0]], %[[XMM1]]

				define void @testsd(double* nocapture %a, <2 x double>* nocapture %b, i32 %k) {
				entry:
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds double, double* %a, i64 %indvars.iv
				%v = load double, double* %arrayidx, align 4
				%floor = call double @floor(double %v)
				%sub = fsub double %floor, %v
				%v1 = insertelement <2 x double> undef, double %sub, i32 0
				%br = shufflevector <2 x double> %v1, <2 x double> undef, <2 x i32> <i32 0, i32 0>
				store volatile <2 x double> %br, <2 x double>* %b, align 4
				%indvars.iv.next = add i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %k
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				ret void
				}

				declare float @floorf(float) nounwind readnone

				declare double @floor(double) nounwind readnone