This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/X86/
-
Target/
-
X86/
-
X86InstrSSE.td
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
avx-schedule.ll
-
sha-schedule.ll
-
sse-schedule.ll
-
sse2-schedule.ll
-
sse41-schedule.ll

Differential D44428

[X86][SSE] Treat (V)MOVAPD/(V)MOVUPD + (V)MOVAPS/(V)MOVUPS reg-reg instructions as moves not shuffles
AbandonedPublic

Authored by RKSimon on Mar 13 2018, 7:39 AM.

Download Raw Diff

Details

Reviewers

craig.topper
gadi.haber
andreadb
spatel
courbet

Summary

Oddly the (V)MOVAPDrr/(V)MOVUPDrr and (V)MOVAPSrr/(V)MOVUPSrr instructions were classed as WriteFShuffle.

This patch changes the class to WriteMove, this matches what we already do for (V)MOVDQArr/(V)MOVDQUrr.

Found by llvm-mca

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon created this revision.Mar 13 2018, 7:39 AM

RKSimon added a reviewer: courbet.Mar 13 2018, 7:54 AM

Hi Simon,

Can you elaborate on how you used llvm-mca to derive this ?

Using the compute_itineraries tool that I've mentioned in the past a haswell machine I see VMOVUPSrr and other variants use only HWPort5, which would make the WriteFShuffle more accurate.
For sandybridge, results are consistent (only HWPort5 is used).
(Note that we have an LLVM version of this tool for which we'll send an RFC shortly).

Also note that most CPU models override the sched class specifically for these instructions, e.g. on haswell:

def HWWriteResGroup4 : SchedWriteRes<[HWPort5]> {
  let Latency = 1;
  let NumMicroOps = 1;
  let ResourceCycles = [1];
}
def: InstRW<[HWWriteResGroup4], (instregex "VMOVUPSrr")>;

So the change here is not going to change specific CPUs.

For reference, for VMOVDQArr I see P0156 on haswell (consistent with the "Move" profile).

I saw this on btver2 - MOVDQA is reported as using JALU0/JALU1 while MOVAPS/MOVAPD reports JFPU0/JFPU1. And it appears to be affecting a couple of other targets with non-exhaustive scheduler model overloads.

Note that this shouldn't t affect SB etc. as you've said it already has overloaded the schedules for reg-reg moves.

What we could do is start splitting vector/scalar moves/loads/stores but this patch was all that was necessary for the cases I saw.

I haven't looked at the sched model details recently, but this seems like a step in the right direction...although we really need:
https://bugs.llvm.org/show_bug.cgi?id=36671 ?

Ie, most reg-reg moves on Zen, IvyBridge or later should be special-cased if we want an accurate simulation.

For example, Agner has this in section 19.13 of the micro-arch doc:
"Register-to-register move instructions are resolved at the register rename stage without using any execution units.
These instructions have zero latency. It is possible to do six such register renamings per clock cycle, and it is even
possible to rename the same register several times in one clock cycle."

In D44428#1036073, @RKSimon wrote:

I saw this on btver2 - MOVDQA is reported as using JALU0/JALU1 while MOVAPS/MOVAPD reports JFPU0/JFPU1. And it appears to be affecting a couple of other targets with non-exhaustive scheduler model overloads.

I see - if there's nothing in common between microarchitectures then ideally shouldn't we avoid putting a default ? I guess what happened here is that someone measured it on Intel and put that here, which ended up hurting btver2... I could see this opposite happening with this change if any intel uarch forgets to override this. Did you check that all intel uarchs override this ?

In D44428#1036105, @courbet wrote:

In D44428#1036073, @RKSimon wrote:

I saw this on btver2 - MOVDQA is reported as using JALU0/JALU1 while MOVAPS/MOVAPD reports JFPU0/JFPU1. And it appears to be affecting a couple of other targets with non-exhaustive scheduler model overloads.

I see - if there's nothing in common between microarchitectures then ideally shouldn't we avoid putting a default ? I guess what happened here is that someone measured it on Intel and put that here, which ended up hurting btver2... I could see this opposite happening with this change if any intel uarch forgets to override this. Did you check that all intel uarchs override this ?

As you can see from the test changes in the patch, these are limited to btver2/znver1/slm/glm - Agner's tables show no diff for pipe usage between dq/ps/pd aligned/unaligned rr moves for these cpus. The tests have pretty much coverage complete for SSE + AVX1/AVX2 cases.

IMO its much safer to default to WriteMove than WriteFShuffle for these instructions.

In D44428#1036143, @RKSimon wrote:

In D44428#1036105, @courbet wrote:

In D44428#1036073, @RKSimon wrote:

I saw this on btver2 - MOVDQA is reported as using JALU0/JALU1 while MOVAPS/MOVAPD reports JFPU0/JFPU1. And it appears to be affecting a couple of other targets with non-exhaustive scheduler model overloads.

I see - if there's nothing in common between microarchitectures then ideally shouldn't we avoid putting a default ? I guess what happened here is that someone measured it on Intel and put that here, which ended up hurting btver2... I could see this opposite happening with this change if any intel uarch forgets to override this. Did you check that all intel uarchs override this ?

As you can see from the test changes in the patch, these are limited to btver2/znver1/slm/glm - Agner's tables show no diff for pipe usage between dq/ps/pd aligned/unaligned rr moves for these cpus. The tests have pretty much coverage complete for SSE + AVX1/AVX2 cases.

IMO its much safer to default to WriteMove than WriteFShuffle for these instructions.

What about introducing WriteVMove then ? THis can be the same as WriteMove for btver2 and WriteFShuffle for intel CPUs.

In D44428#1036715, @courbet wrote:

In D44428#1036143, @RKSimon wrote:

In D44428#1036105, @courbet wrote:

In D44428#1036073, @RKSimon wrote:

I saw this on btver2 - MOVDQA is reported as using JALU0/JALU1 while MOVAPS/MOVAPD reports JFPU0/JFPU1. And it appears to be affecting a couple of other targets with non-exhaustive scheduler model overloads.

I see - if there's nothing in common between microarchitectures then ideally shouldn't we avoid putting a default ? I guess what happened here is that someone measured it on Intel and put that here, which ended up hurting btver2... I could see this opposite happening with this change if any intel uarch forgets to override this. Did you check that all intel uarchs override this ?

As you can see from the test changes in the patch, these are limited to btver2/znver1/slm/glm - Agner's tables show no diff for pipe usage between dq/ps/pd aligned/unaligned rr moves for these cpus. The tests have pretty much coverage complete for SSE + AVX1/AVX2 cases.

IMO its much safer to default to WriteMove than WriteFShuffle for these instructions.

What about introducing WriteVMove then ? THis can be the same as WriteMove for btver2 and WriteFShuffle for intel CPUs.

OK, I'll do that this morning.

RKSimon mentioned this in rL327505: [X86][SSE] Use WriteFShuffleLd for MOVDDUP/MOVSHDUP/MOVSLDUP reg-mem….Mar 14 2018, 6:25 AM

RKSimon mentioned this in D44471: [X86][SSE] Introduce WriteVecMove, WriteVecLoad and WriteVecStore scheduler classes.Mar 14 2018, 8:12 AM

RKSimon mentioned this in rL327524: [X86][AVX] Use WriteFShuffleLd for broadcast reg-mem instructions.Mar 14 2018, 8:51 AM

RKSimon mentioned this in rL327630: [X86][SSE] Introduce Float/Vector WriteMove, WriteLoad and Writetore scheduler….Mar 15 2018, 7:49 AM

Abandoning this now that D44471 has landed.

Revision Contents

Path

Size

lib/

Target/

X86/

	X86InstrSSE.td
	X86InstrSSE.td (revision 327396)

6 lines

test/

CodeGen/

X86/

	avx-schedule.ll
	avx-schedule.ll (revision 327396)

8 lines

	sha-schedule.ll
	sha-schedule.ll (revision 327396)

12 lines

	sse-schedule.ll
	sse-schedule.ll (revision 327396)

6 lines

	sse2-schedule.ll
	sse2-schedule.ll (revision 327396)

8 lines

	sse41-schedule.ll
	sse41-schedule.ll (revision 327396)

22 lines

Diff 138188

lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 643 Lines • ▼ Show 20 Lines

	multiclass sse12_mov_packed<bits<8> opc, RegisterClass RC,			multiclass sse12_mov_packed<bits<8> opc, RegisterClass RC,
	X86MemOperand x86memop, PatFrag ld_frag,			X86MemOperand x86memop, PatFrag ld_frag,
	string asm, Domain d,			string asm, Domain d,
	OpndItins itins> {			OpndItins itins> {
	let hasSideEffects = 0 in			let hasSideEffects = 0 in
	def rr : PI<opc, MRMSrcReg, (outs RC:$dst), (ins RC:$src),			def rr : PI<opc, MRMSrcReg, (outs RC:$dst), (ins RC:$src),
	!strconcat(asm, "\t{$src, $dst\|$dst, $src}"), [], itins.rr, d>,			!strconcat(asm, "\t{$src, $dst\|$dst, $src}"), [], itins.rr, d>,
	Sched<[WriteFShuffle]>;			Sched<[WriteMove]>;
	let canFoldAsLoad = 1, isReMaterializable = 1 in			let canFoldAsLoad = 1, isReMaterializable = 1 in
	def rm : PI<opc, MRMSrcMem, (outs RC:$dst), (ins x86memop:$src),			def rm : PI<opc, MRMSrcMem, (outs RC:$dst), (ins x86memop:$src),
	!strconcat(asm, "\t{$src, $dst\|$dst, $src}"),			!strconcat(asm, "\t{$src, $dst\|$dst, $src}"),
	[(set RC:$dst, (ld_frag addr:$src))], itins.rm, d>,			[(set RC:$dst, (ld_frag addr:$src))], itins.rm, d>,
	Sched<[WriteLoad]>;			Sched<[WriteLoad]>;
	}			}

	let Predicates = [HasAVX, NoVLX] in {			let Predicates = [HasAVX, NoVLX] in {
	▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines
	def VMOVUPDYmr : VPDI<0x11, MRMDestMem, (outs), (ins f256mem:$dst, VR256:$src),			def VMOVUPDYmr : VPDI<0x11, MRMDestMem, (outs), (ins f256mem:$dst, VR256:$src),
	"movupd\t{$src, $dst\|$dst, $src}",			"movupd\t{$src, $dst\|$dst, $src}",
	[(store (v4f64 VR256:$src), addr:$dst)],			[(store (v4f64 VR256:$src), addr:$dst)],
	IIC_SSE_MOVU_P_MR>, VEX, VEX_L, VEX_WIG;			IIC_SSE_MOVU_P_MR>, VEX, VEX_L, VEX_WIG;
	} // SchedRW			} // SchedRW

	// For disassembler			// For disassembler
	let isCodeGenOnly = 1, ForceDisassemble = 1, hasSideEffects = 0,			let isCodeGenOnly = 1, ForceDisassemble = 1, hasSideEffects = 0,
	SchedRW = [WriteFShuffle] in {			SchedRW = [WriteMove] in {
	def VMOVAPSrr_REV : VPSI<0x29, MRMDestReg, (outs VR128:$dst),			def VMOVAPSrr_REV : VPSI<0x29, MRMDestReg, (outs VR128:$dst),
	(ins VR128:$src),			(ins VR128:$src),
	"movaps\t{$src, $dst\|$dst, $src}", [],			"movaps\t{$src, $dst\|$dst, $src}", [],
	IIC_SSE_MOVA_P_RR>, VEX, VEX_WIG,			IIC_SSE_MOVA_P_RR>, VEX, VEX_WIG,
	FoldGenData<"VMOVAPSrr">;			FoldGenData<"VMOVAPSrr">;
	def VMOVAPDrr_REV : VPDI<0x29, MRMDestReg, (outs VR128:$dst),			def VMOVAPDrr_REV : VPDI<0x29, MRMDestReg, (outs VR128:$dst),
	(ins VR128:$src),			(ins VR128:$src),
	"movapd\t{$src, $dst\|$dst, $src}", [],			"movapd\t{$src, $dst\|$dst, $src}", [],
	▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines
	def MOVUPDmr : PDI<0x11, MRMDestMem, (outs), (ins f128mem:$dst, VR128:$src),			def MOVUPDmr : PDI<0x11, MRMDestMem, (outs), (ins f128mem:$dst, VR128:$src),
	"movupd\t{$src, $dst\|$dst, $src}",			"movupd\t{$src, $dst\|$dst, $src}",
	[(store (v2f64 VR128:$src), addr:$dst)],			[(store (v2f64 VR128:$src), addr:$dst)],
	IIC_SSE_MOVU_P_MR>;			IIC_SSE_MOVU_P_MR>;
	} // SchedRW			} // SchedRW

	// For disassembler			// For disassembler
	let isCodeGenOnly = 1, ForceDisassemble = 1, hasSideEffects = 0,			let isCodeGenOnly = 1, ForceDisassemble = 1, hasSideEffects = 0,
	SchedRW = [WriteFShuffle] in {			SchedRW = [WriteMove] in {
	def MOVAPSrr_REV : PSI<0x29, MRMDestReg, (outs VR128:$dst), (ins VR128:$src),			def MOVAPSrr_REV : PSI<0x29, MRMDestReg, (outs VR128:$dst), (ins VR128:$src),
	"movaps\t{$src, $dst\|$dst, $src}", [],			"movaps\t{$src, $dst\|$dst, $src}", [],
	IIC_SSE_MOVA_P_RR>, FoldGenData<"MOVAPSrr">;			IIC_SSE_MOVA_P_RR>, FoldGenData<"MOVAPSrr">;
	def MOVAPDrr_REV : PDI<0x29, MRMDestReg, (outs VR128:$dst), (ins VR128:$src),			def MOVAPDrr_REV : PDI<0x29, MRMDestReg, (outs VR128:$dst), (ins VR128:$src),
	"movapd\t{$src, $dst\|$dst, $src}", [],			"movapd\t{$src, $dst\|$dst, $src}", [],
	IIC_SSE_MOVA_P_RR>, FoldGenData<"MOVAPDrr">;			IIC_SSE_MOVA_P_RR>, FoldGenData<"MOVAPDrr">;
	def MOVUPSrr_REV : PSI<0x11, MRMDestReg, (outs VR128:$dst), (ins VR128:$src),			def MOVUPSrr_REV : PSI<0x11, MRMDestReg, (outs VR128:$dst), (ins VR128:$src),
	"movups\t{$src, $dst\|$dst, $src}", [],			"movups\t{$src, $dst\|$dst, $src}", [],
	▲ Show 20 Lines • Show All 7,813 Lines • Show Last 20 Lines

test/CodeGen/X86/avx-schedule.ll

	Show First 20 Lines • Show All 2,097 Lines • ▼ Show 20 Lines
	; BTVER2-NEXT: vmaskmovpd %xmm1, %xmm0, (%rdi) # sched: [6:2.00]			; BTVER2-NEXT: vmaskmovpd %xmm1, %xmm0, (%rdi) # sched: [6:2.00]
	; BTVER2-NEXT: vmovapd %xmm2, %xmm0 # sched: [1:0.50]			; BTVER2-NEXT: vmovapd %xmm2, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]			; BTVER2-NEXT: retq # sched: [4:1.00]
	;			;
	; ZNVER1-LABEL: test_maskmovpd:			; ZNVER1-LABEL: test_maskmovpd:
	; ZNVER1: # %bb.0:			; ZNVER1: # %bb.0:
	; ZNVER1-NEXT: vmaskmovpd (%rdi), %xmm0, %xmm2 # sched: [8:0.50]			; ZNVER1-NEXT: vmaskmovpd (%rdi), %xmm0, %xmm2 # sched: [8:0.50]
	; ZNVER1-NEXT: vmaskmovpd %xmm1, %xmm0, (%rdi) # sched: [4:0.50]			; ZNVER1-NEXT: vmaskmovpd %xmm1, %xmm0, (%rdi) # sched: [4:0.50]
	; ZNVER1-NEXT: vmovapd %xmm2, %xmm0 # sched: [1:0.50]			; ZNVER1-NEXT: vmovapd %xmm2, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [1:0.50]			; ZNVER1-NEXT: retq # sched: [1:0.50]
	%1 = call <2 x double> @llvm.x86.avx.maskload.pd(i8* %a0, <2 x i64> %a1)			%1 = call <2 x double> @llvm.x86.avx.maskload.pd(i8* %a0, <2 x i64> %a1)
	call void @llvm.x86.avx.maskstore.pd(i8* %a0, <2 x i64> %a1, <2 x double> %a2)			call void @llvm.x86.avx.maskstore.pd(i8* %a0, <2 x i64> %a1, <2 x double> %a2)
	ret <2 x double> %1			ret <2 x double> %1
	}			}
	declare <2 x double> @llvm.x86.avx.maskload.pd(i8*, <2 x i64>) nounwind readonly			declare <2 x double> @llvm.x86.avx.maskload.pd(i8*, <2 x i64>) nounwind readonly
	declare void @llvm.x86.avx.maskstore.pd(i8*, <2 x i64>, <2 x double>) nounwind			declare void @llvm.x86.avx.maskstore.pd(i8*, <2 x i64>, <2 x double>) nounwind

	▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines
	; BTVER2-NEXT: vmaskmovpd %ymm1, %ymm0, (%rdi) # sched: [6:2.00]			; BTVER2-NEXT: vmaskmovpd %ymm1, %ymm0, (%rdi) # sched: [6:2.00]
	; BTVER2-NEXT: vmovapd %ymm2, %ymm0 # sched: [1:0.50]			; BTVER2-NEXT: vmovapd %ymm2, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]			; BTVER2-NEXT: retq # sched: [4:1.00]
	;			;
	; ZNVER1-LABEL: test_maskmovpd_ymm:			; ZNVER1-LABEL: test_maskmovpd_ymm:
	; ZNVER1: # %bb.0:			; ZNVER1: # %bb.0:
	; ZNVER1-NEXT: vmaskmovpd (%rdi), %ymm0, %ymm2 # sched: [8:1.00]			; ZNVER1-NEXT: vmaskmovpd (%rdi), %ymm0, %ymm2 # sched: [8:1.00]
	; ZNVER1-NEXT: vmaskmovpd %ymm1, %ymm0, (%rdi) # sched: [5:1.00]			; ZNVER1-NEXT: vmaskmovpd %ymm1, %ymm0, (%rdi) # sched: [5:1.00]
	; ZNVER1-NEXT: vmovapd %ymm2, %ymm0 # sched: [1:0.50]			; ZNVER1-NEXT: vmovapd %ymm2, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [1:0.50]			; ZNVER1-NEXT: retq # sched: [1:0.50]
	%1 = call <4 x double> @llvm.x86.avx.maskload.pd.256(i8* %a0, <4 x i64> %a1)			%1 = call <4 x double> @llvm.x86.avx.maskload.pd.256(i8* %a0, <4 x i64> %a1)
	call void @llvm.x86.avx.maskstore.pd.256(i8* %a0, <4 x i64> %a1, <4 x double> %a2)			call void @llvm.x86.avx.maskstore.pd.256(i8* %a0, <4 x i64> %a1, <4 x double> %a2)
	ret <4 x double> %1			ret <4 x double> %1
	}			}
	declare <4 x double> @llvm.x86.avx.maskload.pd.256(i8*, <4 x i64>) nounwind readonly			declare <4 x double> @llvm.x86.avx.maskload.pd.256(i8*, <4 x i64>) nounwind readonly
	declare void @llvm.x86.avx.maskstore.pd.256(i8*, <4 x i64>, <4 x double>) nounwind			declare void @llvm.x86.avx.maskstore.pd.256(i8*, <4 x i64>, <4 x double>) nounwind

	▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines
	; BTVER2-NEXT: vmaskmovps %xmm1, %xmm0, (%rdi) # sched: [6:2.00]			; BTVER2-NEXT: vmaskmovps %xmm1, %xmm0, (%rdi) # sched: [6:2.00]
	; BTVER2-NEXT: vmovaps %xmm2, %xmm0 # sched: [1:0.50]			; BTVER2-NEXT: vmovaps %xmm2, %xmm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]			; BTVER2-NEXT: retq # sched: [4:1.00]
	;			;
	; ZNVER1-LABEL: test_maskmovps:			; ZNVER1-LABEL: test_maskmovps:
	; ZNVER1: # %bb.0:			; ZNVER1: # %bb.0:
	; ZNVER1-NEXT: vmaskmovps (%rdi), %xmm0, %xmm2 # sched: [8:0.50]			; ZNVER1-NEXT: vmaskmovps (%rdi), %xmm0, %xmm2 # sched: [8:0.50]
	; ZNVER1-NEXT: vmaskmovps %xmm1, %xmm0, (%rdi) # sched: [4:0.50]			; ZNVER1-NEXT: vmaskmovps %xmm1, %xmm0, (%rdi) # sched: [4:0.50]
	; ZNVER1-NEXT: vmovaps %xmm2, %xmm0 # sched: [1:0.50]			; ZNVER1-NEXT: vmovaps %xmm2, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [1:0.50]			; ZNVER1-NEXT: retq # sched: [1:0.50]
	%1 = call <4 x float> @llvm.x86.avx.maskload.ps(i8* %a0, <4 x i32> %a1)			%1 = call <4 x float> @llvm.x86.avx.maskload.ps(i8* %a0, <4 x i32> %a1)
	call void @llvm.x86.avx.maskstore.ps(i8* %a0, <4 x i32> %a1, <4 x float> %a2)			call void @llvm.x86.avx.maskstore.ps(i8* %a0, <4 x i32> %a1, <4 x float> %a2)
	ret <4 x float> %1			ret <4 x float> %1
	}			}
	declare <4 x float> @llvm.x86.avx.maskload.ps(i8*, <4 x i32>) nounwind readonly			declare <4 x float> @llvm.x86.avx.maskload.ps(i8*, <4 x i32>) nounwind readonly
	declare void @llvm.x86.avx.maskstore.ps(i8*, <4 x i32>, <4 x float>) nounwind			declare void @llvm.x86.avx.maskstore.ps(i8*, <4 x i32>, <4 x float>) nounwind

	▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines
	; BTVER2-NEXT: vmaskmovps %ymm1, %ymm0, (%rdi) # sched: [6:2.00]			; BTVER2-NEXT: vmaskmovps %ymm1, %ymm0, (%rdi) # sched: [6:2.00]
	; BTVER2-NEXT: vmovaps %ymm2, %ymm0 # sched: [1:0.50]			; BTVER2-NEXT: vmovaps %ymm2, %ymm0 # sched: [1:0.50]
	; BTVER2-NEXT: retq # sched: [4:1.00]			; BTVER2-NEXT: retq # sched: [4:1.00]
	;			;
	; ZNVER1-LABEL: test_maskmovps_ymm:			; ZNVER1-LABEL: test_maskmovps_ymm:
	; ZNVER1: # %bb.0:			; ZNVER1: # %bb.0:
	; ZNVER1-NEXT: vmaskmovps (%rdi), %ymm0, %ymm2 # sched: [8:1.00]			; ZNVER1-NEXT: vmaskmovps (%rdi), %ymm0, %ymm2 # sched: [8:1.00]
	; ZNVER1-NEXT: vmaskmovps %ymm1, %ymm0, (%rdi) # sched: [5:1.00]			; ZNVER1-NEXT: vmaskmovps %ymm1, %ymm0, (%rdi) # sched: [5:1.00]
	; ZNVER1-NEXT: vmovaps %ymm2, %ymm0 # sched: [1:0.50]			; ZNVER1-NEXT: vmovaps %ymm2, %ymm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [1:0.50]			; ZNVER1-NEXT: retq # sched: [1:0.50]
	%1 = call <8 x float> @llvm.x86.avx.maskload.ps.256(i8* %a0, <8 x i32> %a1)			%1 = call <8 x float> @llvm.x86.avx.maskload.ps.256(i8* %a0, <8 x i32> %a1)
	call void @llvm.x86.avx.maskstore.ps.256(i8* %a0, <8 x i32> %a1, <8 x float> %a2)			call void @llvm.x86.avx.maskstore.ps.256(i8* %a0, <8 x i32> %a1, <8 x float> %a2)
	ret <8 x float> %1			ret <8 x float> %1
	}			}
	declare <8 x float> @llvm.x86.avx.maskload.ps.256(i8*, <8 x i32>) nounwind readonly			declare <8 x float> @llvm.x86.avx.maskload.ps.256(i8*, <8 x i32>) nounwind readonly
	declare void @llvm.x86.avx.maskstore.ps.256(i8*, <8 x i32>, <8 x float>) nounwind			declare void @llvm.x86.avx.maskstore.ps.256(i8*, <8 x i32>, <8 x float>) nounwind

	▲ Show 20 Lines • Show All 3,126 Lines • Show Last 20 Lines

test/CodeGen/X86/sha-schedule.ll

	Show First 20 Lines • Show All 204 Lines • ▼ Show 20 Lines
	; GENERIC-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]			; GENERIC-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]
	; GENERIC-NEXT: sha256rnds2 %xmm0, %xmm1, %xmm3 # sched: [5:1.00]			; GENERIC-NEXT: sha256rnds2 %xmm0, %xmm1, %xmm3 # sched: [5:1.00]
	; GENERIC-NEXT: sha256rnds2 %xmm0, (%rdi), %xmm3 # sched: [9:1.00]			; GENERIC-NEXT: sha256rnds2 %xmm0, (%rdi), %xmm3 # sched: [9:1.00]
	; GENERIC-NEXT: movaps %xmm3, %xmm0 # sched: [1:1.00]			; GENERIC-NEXT: movaps %xmm3, %xmm0 # sched: [1:1.00]
	; GENERIC-NEXT: retq # sched: [1:1.00]			; GENERIC-NEXT: retq # sched: [1:1.00]
	;			;
	; GOLDMONT-LABEL: test_sha256rnds2:			; GOLDMONT-LABEL: test_sha256rnds2:
	; GOLDMONT: # %bb.0:			; GOLDMONT: # %bb.0:
	; GOLDMONT-NEXT: movaps %xmm0, %xmm3 # sched: [1:1.00]			; GOLDMONT-NEXT: movaps %xmm0, %xmm3 # sched: [1:0.50]
	; GOLDMONT-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]			; GOLDMONT-NEXT: movaps %xmm2, %xmm0 # sched: [1:0.50]
	; GOLDMONT-NEXT: sha256rnds2 %xmm0, %xmm1, %xmm3 # sched: [4:1.00]			; GOLDMONT-NEXT: sha256rnds2 %xmm0, %xmm1, %xmm3 # sched: [4:1.00]
	; GOLDMONT-NEXT: sha256rnds2 %xmm0, (%rdi), %xmm3 # sched: [7:1.00]			; GOLDMONT-NEXT: sha256rnds2 %xmm0, (%rdi), %xmm3 # sched: [7:1.00]
	; GOLDMONT-NEXT: movaps %xmm3, %xmm0 # sched: [1:1.00]			; GOLDMONT-NEXT: movaps %xmm3, %xmm0 # sched: [1:0.50]
	; GOLDMONT-NEXT: retq # sched: [4:1.00]			; GOLDMONT-NEXT: retq # sched: [4:1.00]
	;			;
	; CANNONLAKE-LABEL: test_sha256rnds2:			; CANNONLAKE-LABEL: test_sha256rnds2:
	; CANNONLAKE: # %bb.0:			; CANNONLAKE: # %bb.0:
	; CANNONLAKE-NEXT: vmovaps %xmm0, %xmm3 # sched: [1:0.33]			; CANNONLAKE-NEXT: vmovaps %xmm0, %xmm3 # sched: [1:0.33]
	; CANNONLAKE-NEXT: vmovaps %xmm2, %xmm0 # sched: [1:0.33]			; CANNONLAKE-NEXT: vmovaps %xmm2, %xmm0 # sched: [1:0.33]
	; CANNONLAKE-NEXT: sha256rnds2 %xmm0, %xmm1, %xmm3 # sched: [5:1.00]			; CANNONLAKE-NEXT: sha256rnds2 %xmm0, %xmm1, %xmm3 # sched: [5:1.00]
	; CANNONLAKE-NEXT: sha256rnds2 %xmm0, (%rdi), %xmm3 # sched: [10:1.00]			; CANNONLAKE-NEXT: sha256rnds2 %xmm0, (%rdi), %xmm3 # sched: [10:1.00]
	; CANNONLAKE-NEXT: vmovaps %xmm3, %xmm0 # sched: [1:0.33]			; CANNONLAKE-NEXT: vmovaps %xmm3, %xmm0 # sched: [1:0.33]
	; CANNONLAKE-NEXT: retq # sched: [7:1.00]			; CANNONLAKE-NEXT: retq # sched: [7:1.00]
	;			;
	; ZNVER1-LABEL: test_sha256rnds2:			; ZNVER1-LABEL: test_sha256rnds2:
	; ZNVER1: # %bb.0:			; ZNVER1: # %bb.0:
	; ZNVER1-NEXT: vmovaps %xmm0, %xmm3 # sched: [1:0.50]			; ZNVER1-NEXT: vmovaps %xmm0, %xmm3 # sched: [1:0.25]
	; ZNVER1-NEXT: vmovaps %xmm2, %xmm0 # sched: [1:0.50]			; ZNVER1-NEXT: vmovaps %xmm2, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: sha256rnds2 %xmm0, %xmm1, %xmm3 # sched: [4:1.00]			; ZNVER1-NEXT: sha256rnds2 %xmm0, %xmm1, %xmm3 # sched: [4:1.00]
	; ZNVER1-NEXT: sha256rnds2 %xmm0, (%rdi), %xmm3 # sched: [11:1.00]			; ZNVER1-NEXT: sha256rnds2 %xmm0, (%rdi), %xmm3 # sched: [11:1.00]
	; ZNVER1-NEXT: vmovaps %xmm3, %xmm0 # sched: [1:0.50]			; ZNVER1-NEXT: vmovaps %xmm3, %xmm0 # sched: [1:0.25]
	; ZNVER1-NEXT: retq # sched: [1:0.50]			; ZNVER1-NEXT: retq # sched: [1:0.50]
	%1 = load <4 x i32>, <4 x i32>* %a3			%1 = load <4 x i32>, <4 x i32>* %a3
	%2 = tail call <4 x i32> @llvm.x86.sha256rnds2(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> %a2)			%2 = tail call <4 x i32> @llvm.x86.sha256rnds2(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> %a2)
	%3 = tail call <4 x i32> @llvm.x86.sha256rnds2(<4 x i32> %2, <4 x i32> %1, <4 x i32> %a2)			%3 = tail call <4 x i32> @llvm.x86.sha256rnds2(<4 x i32> %2, <4 x i32> %1, <4 x i32> %a2)
	ret <4 x i32> %3			ret <4 x i32> %3
	}			}
	declare <4 x i32> @llvm.x86.sha256rnds2(<4 x i32>, <4 x i32>, <4 x i32>)			declare <4 x i32> @llvm.x86.sha256rnds2(<4 x i32>, <4 x i32>, <4 x i32>)

test/CodeGen/X86/sse-schedule.ll

	Show First 20 Lines • Show All 2,551 Lines • ▼ Show 20 Lines
	; ATOM-NEXT: movaps %xmm1, %xmm0 # sched: [1:0.50]			; ATOM-NEXT: movaps %xmm1, %xmm0 # sched: [1:0.50]
	; ATOM-NEXT: retq # sched: [79:39.50]			; ATOM-NEXT: retq # sched: [79:39.50]
	;			;
	; SLM-LABEL: test_rcpps:			; SLM-LABEL: test_rcpps:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: rcpps (%rdi), %xmm1 # sched: [8:1.00]			; SLM-NEXT: rcpps (%rdi), %xmm1 # sched: [8:1.00]
	; SLM-NEXT: rcpps %xmm0, %xmm0 # sched: [5:1.00]			; SLM-NEXT: rcpps %xmm0, %xmm0 # sched: [5:1.00]
	; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]			; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_rcpps:			; SANDY-LABEL: test_rcpps:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vrcpps %xmm0, %xmm0 # sched: [5:1.00]			; SANDY-NEXT: vrcpps %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vrcpps (%rdi), %xmm1 # sched: [11:1.00]			; SANDY-NEXT: vrcpps (%rdi), %xmm1 # sched: [11:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]			; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: retq # sched: [1:1.00]			; SANDY-NEXT: retq # sched: [1:1.00]
	▲ Show 20 Lines • Show All 155 Lines • ▼ Show 20 Lines
	; ATOM-NEXT: movaps %xmm1, %xmm0 # sched: [1:0.50]			; ATOM-NEXT: movaps %xmm1, %xmm0 # sched: [1:0.50]
	; ATOM-NEXT: retq # sched: [79:39.50]			; ATOM-NEXT: retq # sched: [79:39.50]
	;			;
	; SLM-LABEL: test_rsqrtps:			; SLM-LABEL: test_rsqrtps:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: rsqrtps (%rdi), %xmm1 # sched: [8:1.00]			; SLM-NEXT: rsqrtps (%rdi), %xmm1 # sched: [8:1.00]
	; SLM-NEXT: rsqrtps %xmm0, %xmm0 # sched: [5:1.00]			; SLM-NEXT: rsqrtps %xmm0, %xmm0 # sched: [5:1.00]
	; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]			; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_rsqrtps:			; SANDY-LABEL: test_rsqrtps:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vrsqrtps %xmm0, %xmm0 # sched: [5:1.00]			; SANDY-NEXT: vrsqrtps %xmm0, %xmm0 # sched: [5:1.00]
	; SANDY-NEXT: vrsqrtps (%rdi), %xmm1 # sched: [11:1.00]			; SANDY-NEXT: vrsqrtps (%rdi), %xmm1 # sched: [11:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]			; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: retq # sched: [1:1.00]			; SANDY-NEXT: retq # sched: [1:1.00]
	▲ Show 20 Lines • Show All 292 Lines • ▼ Show 20 Lines
	; ATOM-NEXT: addps %xmm1, %xmm0 # sched: [5:5.00]			; ATOM-NEXT: addps %xmm1, %xmm0 # sched: [5:5.00]
	; ATOM-NEXT: retq # sched: [79:39.50]			; ATOM-NEXT: retq # sched: [79:39.50]
	;			;
	; SLM-LABEL: test_sqrtps:			; SLM-LABEL: test_sqrtps:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: sqrtps (%rdi), %xmm1 # sched: [18:1.00]			; SLM-NEXT: sqrtps (%rdi), %xmm1 # sched: [18:1.00]
	; SLM-NEXT: sqrtps %xmm0, %xmm0 # sched: [15:1.00]			; SLM-NEXT: sqrtps %xmm0, %xmm0 # sched: [15:1.00]
	; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]			; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_sqrtps:			; SANDY-LABEL: test_sqrtps:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vsqrtps %xmm0, %xmm0 # sched: [14:1.00]			; SANDY-NEXT: vsqrtps %xmm0, %xmm0 # sched: [14:1.00]
	; SANDY-NEXT: vsqrtps (%rdi), %xmm1 # sched: [20:1.00]			; SANDY-NEXT: vsqrtps (%rdi), %xmm1 # sched: [20:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]			; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: retq # sched: [1:1.00]			; SANDY-NEXT: retq # sched: [1:1.00]
	▲ Show 20 Lines • Show All 717 Lines • Show Last 20 Lines

test/CodeGen/X86/sse2-schedule.ll

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 3,558 Lines • ▼ Show 20 Lines
	; ATOM-NEXT: nop # sched: [1:0.50]			; ATOM-NEXT: nop # sched: [1:0.50]
	; ATOM-NEXT: nop # sched: [1:0.50]			; ATOM-NEXT: nop # sched: [1:0.50]
	; ATOM-NEXT: nop # sched: [1:0.50]			; ATOM-NEXT: nop # sched: [1:0.50]
	; ATOM-NEXT: retq # sched: [79:39.50]			; ATOM-NEXT: retq # sched: [79:39.50]
	;			;
	; SLM-LABEL: test_movsd_reg:			; SLM-LABEL: test_movsd_reg:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: movlhps {{.*#+}} xmm1 = xmm1[0],xmm0[0] sched: [1:1.00]			; SLM-NEXT: movlhps {{.*#+}} xmm1 = xmm1[0],xmm0[0] sched: [1:1.00]
	; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_movsd_reg:			; SANDY-LABEL: test_movsd_reg:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0] sched: [1:1.00]			; SANDY-NEXT: vmovlhps {{.*#+}} xmm0 = xmm1[0],xmm0[0] sched: [1:1.00]
	; SANDY-NEXT: retq # sched: [1:1.00]			; SANDY-NEXT: retq # sched: [1:1.00]
	;			;
	; HASWELL-LABEL: test_movsd_reg:			; HASWELL-LABEL: test_movsd_reg:
	▲ Show 20 Lines • Show All 5,175 Lines • ▼ Show 20 Lines
	; ATOM-NEXT: addpd %xmm1, %xmm0 # sched: [6:3.00]			; ATOM-NEXT: addpd %xmm1, %xmm0 # sched: [6:3.00]
	; ATOM-NEXT: retq # sched: [79:39.50]			; ATOM-NEXT: retq # sched: [79:39.50]
	;			;
	; SLM-LABEL: test_sqrtpd:			; SLM-LABEL: test_sqrtpd:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: sqrtpd (%rdi), %xmm1 # sched: [18:1.00]			; SLM-NEXT: sqrtpd (%rdi), %xmm1 # sched: [18:1.00]
	; SLM-NEXT: sqrtpd %xmm0, %xmm0 # sched: [15:1.00]			; SLM-NEXT: sqrtpd %xmm0, %xmm0 # sched: [15:1.00]
	; SLM-NEXT: addpd %xmm0, %xmm1 # sched: [3:1.00]			; SLM-NEXT: addpd %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movapd %xmm1, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movapd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_sqrtpd:			; SANDY-LABEL: test_sqrtpd:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vsqrtpd %xmm0, %xmm0 # sched: [22:1.00]			; SANDY-NEXT: vsqrtpd %xmm0, %xmm0 # sched: [22:1.00]
	; SANDY-NEXT: vsqrtpd (%rdi), %xmm1 # sched: [28:1.00]			; SANDY-NEXT: vsqrtpd (%rdi), %xmm1 # sched: [28:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]			; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: retq # sched: [1:1.00]			; SANDY-NEXT: retq # sched: [1:1.00]
	▲ Show 20 Lines • Show All 511 Lines • ▼ Show 20 Lines
	; ATOM-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [1:1.00]			; ATOM-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [1:1.00]
	; ATOM-NEXT: addpd %xmm0, %xmm1 # sched: [6:3.00]			; ATOM-NEXT: addpd %xmm0, %xmm1 # sched: [6:3.00]
	; ATOM-NEXT: movapd %xmm1, %xmm0 # sched: [1:0.50]			; ATOM-NEXT: movapd %xmm1, %xmm0 # sched: [1:0.50]
	; ATOM-NEXT: retq # sched: [79:39.50]			; ATOM-NEXT: retq # sched: [79:39.50]
	;			;
	; SLM-LABEL: test_unpcklpd:			; SLM-LABEL: test_unpcklpd:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:1.00]			; SLM-NEXT: unpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:1.00]
	; SLM-NEXT: movapd %xmm0, %xmm1 # sched: [1:1.00]			; SLM-NEXT: movapd %xmm0, %xmm1 # sched: [1:0.50]
	; SLM-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [4:1.00]			; SLM-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0],mem[0] sched: [4:1.00]
	; SLM-NEXT: addpd %xmm0, %xmm1 # sched: [3:1.00]			; SLM-NEXT: addpd %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movapd %xmm1, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movapd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_unpcklpd:			; SANDY-LABEL: test_unpcklpd:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:1.00]			; SANDY-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0] sched: [1:1.00]
	; SANDY-NEXT: vunpcklpd {{.*#+}} xmm1 = xmm0[0],mem[0] sched: [7:1.00]			; SANDY-NEXT: vunpcklpd {{.*#+}} xmm1 = xmm0[0],mem[0] sched: [7:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]			; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: retq # sched: [1:1.00]			; SANDY-NEXT: retq # sched: [1:1.00]
	▲ Show 20 Lines • Show All 131 Lines • Show Last 20 Lines

test/CodeGen/X86/sse41-schedule.ll

	Show First 20 Lines • Show All 157 Lines • ▼ Show 20 Lines
	; GENERIC-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]			; GENERIC-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]
	; GENERIC-NEXT: blendvpd %xmm0, %xmm1, %xmm3 # sched: [2:1.00]			; GENERIC-NEXT: blendvpd %xmm0, %xmm1, %xmm3 # sched: [2:1.00]
	; GENERIC-NEXT: blendvpd %xmm0, (%rdi), %xmm3 # sched: [8:1.00]			; GENERIC-NEXT: blendvpd %xmm0, (%rdi), %xmm3 # sched: [8:1.00]
	; GENERIC-NEXT: movapd %xmm3, %xmm0 # sched: [1:1.00]			; GENERIC-NEXT: movapd %xmm3, %xmm0 # sched: [1:1.00]
	; GENERIC-NEXT: retq # sched: [1:1.00]			; GENERIC-NEXT: retq # sched: [1:1.00]
	;			;
	; SLM-LABEL: test_blendvpd:			; SLM-LABEL: test_blendvpd:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: movapd %xmm0, %xmm3 # sched: [1:1.00]			; SLM-NEXT: movapd %xmm0, %xmm3 # sched: [1:0.50]
	; SLM-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movaps %xmm2, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: blendvpd %xmm0, %xmm1, %xmm3 # sched: [1:1.00]			; SLM-NEXT: blendvpd %xmm0, %xmm1, %xmm3 # sched: [1:1.00]
	; SLM-NEXT: blendvpd %xmm0, (%rdi), %xmm3 # sched: [4:1.00]			; SLM-NEXT: blendvpd %xmm0, (%rdi), %xmm3 # sched: [4:1.00]
	; SLM-NEXT: movapd %xmm3, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movapd %xmm3, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_blendvpd:			; SANDY-LABEL: test_blendvpd:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vblendvpd %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:1.00]			; SANDY-NEXT: vblendvpd %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; SANDY-NEXT: vblendvpd %xmm2, (%rdi), %xmm0, %xmm0 # sched: [8:1.00]			; SANDY-NEXT: vblendvpd %xmm2, (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; SANDY-NEXT: retq # sched: [1:1.00]			; SANDY-NEXT: retq # sched: [1:1.00]
	;			;
	▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines
	; GENERIC-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]			; GENERIC-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]
	; GENERIC-NEXT: blendvps %xmm0, %xmm1, %xmm3 # sched: [2:1.00]			; GENERIC-NEXT: blendvps %xmm0, %xmm1, %xmm3 # sched: [2:1.00]
	; GENERIC-NEXT: blendvps %xmm0, (%rdi), %xmm3 # sched: [8:1.00]			; GENERIC-NEXT: blendvps %xmm0, (%rdi), %xmm3 # sched: [8:1.00]
	; GENERIC-NEXT: movaps %xmm3, %xmm0 # sched: [1:1.00]			; GENERIC-NEXT: movaps %xmm3, %xmm0 # sched: [1:1.00]
	; GENERIC-NEXT: retq # sched: [1:1.00]			; GENERIC-NEXT: retq # sched: [1:1.00]
	;			;
	; SLM-LABEL: test_blendvps:			; SLM-LABEL: test_blendvps:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: movaps %xmm0, %xmm3 # sched: [1:1.00]			; SLM-NEXT: movaps %xmm0, %xmm3 # sched: [1:0.50]
	; SLM-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movaps %xmm2, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: blendvps %xmm0, %xmm1, %xmm3 # sched: [1:1.00]			; SLM-NEXT: blendvps %xmm0, %xmm1, %xmm3 # sched: [1:1.00]
	; SLM-NEXT: blendvps %xmm0, (%rdi), %xmm3 # sched: [4:1.00]			; SLM-NEXT: blendvps %xmm0, (%rdi), %xmm3 # sched: [4:1.00]
	; SLM-NEXT: movaps %xmm3, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movaps %xmm3, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_blendvps:			; SANDY-LABEL: test_blendvps:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:1.00]			; SANDY-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	; SANDY-NEXT: vblendvps %xmm2, (%rdi), %xmm0, %xmm0 # sched: [8:1.00]			; SANDY-NEXT: vblendvps %xmm2, (%rdi), %xmm0, %xmm0 # sched: [8:1.00]
	; SANDY-NEXT: retq # sched: [1:1.00]			; SANDY-NEXT: retq # sched: [1:1.00]
	;			;
	▲ Show 20 Lines • Show All 466 Lines • ▼ Show 20 Lines
	; GENERIC-NEXT: pblendvb %xmm0, %xmm1, %xmm3 # sched: [8:1.00]			; GENERIC-NEXT: pblendvb %xmm0, %xmm1, %xmm3 # sched: [8:1.00]
	; GENERIC-NEXT: pblendvb %xmm0, (%rdi), %xmm3 # sched: [6:1.00]			; GENERIC-NEXT: pblendvb %xmm0, (%rdi), %xmm3 # sched: [6:1.00]
	; GENERIC-NEXT: movdqa %xmm3, %xmm0 # sched: [1:0.33]			; GENERIC-NEXT: movdqa %xmm3, %xmm0 # sched: [1:0.33]
	; GENERIC-NEXT: retq # sched: [1:1.00]			; GENERIC-NEXT: retq # sched: [1:1.00]
	;			;
	; SLM-LABEL: test_pblendvb:			; SLM-LABEL: test_pblendvb:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: movdqa %xmm0, %xmm3 # sched: [1:0.50]			; SLM-NEXT: movdqa %xmm0, %xmm3 # sched: [1:0.50]
	; SLM-NEXT: movaps %xmm2, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movaps %xmm2, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: pblendvb %xmm0, %xmm1, %xmm3 # sched: [1:1.00]			; SLM-NEXT: pblendvb %xmm0, %xmm1, %xmm3 # sched: [1:1.00]
	; SLM-NEXT: pblendvb %xmm0, (%rdi), %xmm3 # sched: [4:1.00]			; SLM-NEXT: pblendvb %xmm0, (%rdi), %xmm3 # sched: [4:1.00]
	; SLM-NEXT: movdqa %xmm3, %xmm0 # sched: [1:0.50]			; SLM-NEXT: movdqa %xmm3, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_pblendvb:			; SANDY-LABEL: test_pblendvb:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vpblendvb %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:1.00]			; SANDY-NEXT: vpblendvb %xmm2, %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
	▲ Show 20 Lines • Show All 2,257 Lines • ▼ Show 20 Lines
	; GENERIC-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]			; GENERIC-NEXT: addpd %xmm1, %xmm0 # sched: [3:1.00]
	; GENERIC-NEXT: retq # sched: [1:1.00]			; GENERIC-NEXT: retq # sched: [1:1.00]
	;			;
	; SLM-LABEL: test_roundpd:			; SLM-LABEL: test_roundpd:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: roundpd $7, (%rdi), %xmm1 # sched: [6:1.00]			; SLM-NEXT: roundpd $7, (%rdi), %xmm1 # sched: [6:1.00]
	; SLM-NEXT: roundpd $7, %xmm0, %xmm0 # sched: [3:1.00]			; SLM-NEXT: roundpd $7, %xmm0, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addpd %xmm0, %xmm1 # sched: [3:1.00]			; SLM-NEXT: addpd %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movapd %xmm1, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movapd %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_roundpd:			; SANDY-LABEL: test_roundpd:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vroundpd $7, %xmm0, %xmm0 # sched: [3:1.00]			; SANDY-NEXT: vroundpd $7, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vroundpd $7, (%rdi), %xmm1 # sched: [9:1.00]			; SANDY-NEXT: vroundpd $7, (%rdi), %xmm1 # sched: [9:1.00]
	; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]			; SANDY-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: retq # sched: [1:1.00]			; SANDY-NEXT: retq # sched: [1:1.00]
	▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	; GENERIC-NEXT: addps %xmm1, %xmm0 # sched: [3:1.00]			; GENERIC-NEXT: addps %xmm1, %xmm0 # sched: [3:1.00]
	; GENERIC-NEXT: retq # sched: [1:1.00]			; GENERIC-NEXT: retq # sched: [1:1.00]
	;			;
	; SLM-LABEL: test_roundps:			; SLM-LABEL: test_roundps:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: roundps $7, (%rdi), %xmm1 # sched: [6:1.00]			; SLM-NEXT: roundps $7, (%rdi), %xmm1 # sched: [6:1.00]
	; SLM-NEXT: roundps $7, %xmm0, %xmm0 # sched: [3:1.00]			; SLM-NEXT: roundps $7, %xmm0, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]			; SLM-NEXT: addps %xmm0, %xmm1 # sched: [3:1.00]
	; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:1.00]			; SLM-NEXT: movaps %xmm1, %xmm0 # sched: [1:0.50]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_roundps:			; SANDY-LABEL: test_roundps:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vroundps $7, %xmm0, %xmm0 # sched: [3:1.00]			; SANDY-NEXT: vroundps $7, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: vroundps $7, (%rdi), %xmm1 # sched: [9:1.00]			; SANDY-NEXT: vroundps $7, (%rdi), %xmm1 # sched: [9:1.00]
	; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]			; SANDY-NEXT: vaddps %xmm1, %xmm0, %xmm0 # sched: [3:1.00]
	; SANDY-NEXT: retq # sched: [1:1.00]			; SANDY-NEXT: retq # sched: [1:1.00]
	▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines
	; GENERIC-NEXT: movapd %xmm0, %xmm2 # sched: [1:1.00]			; GENERIC-NEXT: movapd %xmm0, %xmm2 # sched: [1:1.00]
	; GENERIC-NEXT: roundsd $7, %xmm1, %xmm2 # sched: [3:1.00]			; GENERIC-NEXT: roundsd $7, %xmm1, %xmm2 # sched: [3:1.00]
	; GENERIC-NEXT: roundsd $7, (%rdi), %xmm0 # sched: [9:1.00]			; GENERIC-NEXT: roundsd $7, (%rdi), %xmm0 # sched: [9:1.00]
	; GENERIC-NEXT: addpd %xmm2, %xmm0 # sched: [3:1.00]			; GENERIC-NEXT: addpd %xmm2, %xmm0 # sched: [3:1.00]
	; GENERIC-NEXT: retq # sched: [1:1.00]			; GENERIC-NEXT: retq # sched: [1:1.00]
	;			;
	; SLM-LABEL: test_roundsd:			; SLM-LABEL: test_roundsd:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: movapd %xmm0, %xmm2 # sched: [1:1.00]			; SLM-NEXT: movapd %xmm0, %xmm2 # sched: [1:0.50]
	; SLM-NEXT: roundsd $7, (%rdi), %xmm0 # sched: [6:1.00]			; SLM-NEXT: roundsd $7, (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: roundsd $7, %xmm1, %xmm2 # sched: [3:1.00]			; SLM-NEXT: roundsd $7, %xmm1, %xmm2 # sched: [3:1.00]
	; SLM-NEXT: addpd %xmm2, %xmm0 # sched: [3:1.00]			; SLM-NEXT: addpd %xmm2, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_roundsd:			; SANDY-LABEL: test_roundsd:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vroundsd $7, %xmm1, %xmm0, %xmm1 # sched: [3:1.00]			; SANDY-NEXT: vroundsd $7, %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines
	; GENERIC-NEXT: movaps %xmm0, %xmm2 # sched: [1:1.00]			; GENERIC-NEXT: movaps %xmm0, %xmm2 # sched: [1:1.00]
	; GENERIC-NEXT: roundss $7, %xmm1, %xmm2 # sched: [3:1.00]			; GENERIC-NEXT: roundss $7, %xmm1, %xmm2 # sched: [3:1.00]
	; GENERIC-NEXT: roundss $7, (%rdi), %xmm0 # sched: [9:1.00]			; GENERIC-NEXT: roundss $7, (%rdi), %xmm0 # sched: [9:1.00]
	; GENERIC-NEXT: addps %xmm2, %xmm0 # sched: [3:1.00]			; GENERIC-NEXT: addps %xmm2, %xmm0 # sched: [3:1.00]
	; GENERIC-NEXT: retq # sched: [1:1.00]			; GENERIC-NEXT: retq # sched: [1:1.00]
	;			;
	; SLM-LABEL: test_roundss:			; SLM-LABEL: test_roundss:
	; SLM: # %bb.0:			; SLM: # %bb.0:
	; SLM-NEXT: movaps %xmm0, %xmm2 # sched: [1:1.00]			; SLM-NEXT: movaps %xmm0, %xmm2 # sched: [1:0.50]
	; SLM-NEXT: roundss $7, (%rdi), %xmm0 # sched: [6:1.00]			; SLM-NEXT: roundss $7, (%rdi), %xmm0 # sched: [6:1.00]
	; SLM-NEXT: roundss $7, %xmm1, %xmm2 # sched: [3:1.00]			; SLM-NEXT: roundss $7, %xmm1, %xmm2 # sched: [3:1.00]
	; SLM-NEXT: addps %xmm2, %xmm0 # sched: [3:1.00]			; SLM-NEXT: addps %xmm2, %xmm0 # sched: [3:1.00]
	; SLM-NEXT: retq # sched: [4:1.00]			; SLM-NEXT: retq # sched: [4:1.00]
	;			;
	; SANDY-LABEL: test_roundss:			; SANDY-LABEL: test_roundss:
	; SANDY: # %bb.0:			; SANDY: # %bb.0:
	; SANDY-NEXT: vroundss $7, %xmm1, %xmm0, %xmm1 # sched: [3:1.00]			; SANDY-NEXT: vroundss $7, %xmm1, %xmm0, %xmm1 # sched: [3:1.00]
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines