This is an archive of the discontinued LLVM Phabricator instance.

[X86] Remove wrong ReadAdvance from multiclass sse_fp_unop_s
ClosedPublic

Authored by andreadb on Aug 31 2018, 7:56 AM.

Download Raw Diff

Details

Reviewers

RKSimon
courbet
craig.topper
lebedev.ri
mattd
spatel

Commits

rGfb3d9e1449b1: [X86] Remove wrong ReadAdvance from multiclass sse_fp_unop_s.
rL341326: [X86] Remove wrong ReadAdvance from multiclass sse_fp_unop_s.

Summary

A ReadAdvance was incorrectly added to the list of SchedReadWrite for the following opcodes:

sqrtss
sqrtsd
rsqrtss
rcpss

As a consequence, a wrong operand latency was computed for the register operand used by the folded load.

This patch removes the wrong ReadAdvance, and updates the llvm-mca test cases. Now the llvm-mca timeline report shows correct timings for those unary fp SSE1/SSE2 instructions.

Please let me know if okay to commit.

Thanks,
-Andrea

Diff Detail

Repository: rL LLVM

Event Timeline

andreadb created this revision.Aug 31 2018, 7:56 AM

Herald added a subscriber: gbedwell. · View Herald TranscriptAug 31 2018, 7:56 AM

andreadb added a reviewer: spatel.Aug 31 2018, 8:46 AM

I think this requires an understanding of the intent of ReadAfterLd:

// Instructions with folded loads need to read the memory operand immediately,
// but other register operands don't have to be read until the load is ready.
// These operands are marked with ReadAfterLd.

...that D51534 did not. That's because a broadcast only has one source operand, so ReadAfterLd doesn't even make sense on that instruction?

In this case, we have 2 source operands:

The loaded value that we're doing the math on.
The unchanging vector lanes of the second source (destination) register.

The patch has the intended effect of making the math op depend on the load operand, but it's not clear to me what is or should be happening in a case like this on skylake:

Trunk:

[0,0]     DeeeeeeeeeER.   dppd	$1, %xmm1, %xmm2 <--- long latency, but pipelined
[0,1]     D=eE-------R.   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=eeeeeeeeeER   rsqrtss	(%rax), %xmm2 <--- wrong: this can't start executing before %rax is loaded

Apply this patch (remove ReadAfterLd:)

[0,0]     DeeeeeeeeeER .   dppd	$1, %xmm1, %xmm2
[0,1]     D=eE-------R .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D==eeeeeeeeeER   rsqrtss	(%rax), %xmm2  <--- is this right? the calc can begin before xmm2 is known?

But with AVX the 2nd source is explicit, and ReadAfterLd has a different effect:

[0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?

No ReadAfterLd:

[0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known

Hi Sanjay,

Thanks for the feedback.

In D51542#1221074, @spatel wrote:
I think this requires an understanding of the intent of ReadAfterLd:
// Instructions with folded loads need to read the memory operand immediately,
// but other register operands don't have to be read until the load is ready.
// These operands are marked with ReadAfterLd.
...that D51534 did not. That's because a broadcast only has one source operand, so ReadAfterLd doesn't even make sense on that instruction?

Not only it didn't make any sense. It was even harmful because it was decreasing the use latency of the register used as the base address for the folded load by 'ReadAfterLd' cycles..

In this case, we have 2 source operands:

The loaded value that we're doing the math on.

The unchanging vector lanes of the second source (destination) register.

I think you are getting confused. These are not instructions with 2 input operands.
These are just SSE1/SSE2 unary operations (one def, and one use; see below the tablegen definition).

def m : I<opc, MRMSrcMem, (outs RC:$dst), (ins x86memop:$src1)

The patch has the intended effect of making the math op depend on the load operand, but it's not clear to me what is or should be happening in a case like this on skylake:

Trunk:
[0,0]     DeeeeeeeeeER.   dppd	$1, %xmm1, %xmm2 <--- long latency, but pipelined
[0,1]     D=eE-------R.   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=eeeeeeeeeER   rsqrtss	(%rax), %xmm2 <--- wrong: this can't start executing before %rax is loaded
Apply this patch (remove ReadAfterLd:)
[0,0]     DeeeeeeeeeER .   dppd	$1, %xmm1, %xmm2
[0,1]     D=eE-------R .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D==eeeeeeeeeER   rsqrtss	(%rax), %xmm2  <--- is this right? the calc can begin before xmm2 is known?

Here rsqrtss cannot start if %rax is not available.
The value of %xmm2 is not needed because a) it is not an input register, and b) false dependencies on %xmm2 are eliminated by the register renamer.

But with AVX the 2nd source is explicit, and ReadAfterLd has a different effect:

[0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?

Here the use latency of %xmm2 is decreased by a number of cycles specified by the read-advance.
From the look of that timeline, you are probably testing for Skylake. Skylake defines ReadAfterLd as follows:

def : ReadAdvance<ReadAfterLd, 5>;

So the vrsqrtss can assume that %xmm2 is available 5 cycles in advance (i.e the use latency is decreased by 5 cycles).

No ReadAfterLd:

[0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known

My patch doesn't modify/remove that ReadAfterLd. It would be wrong.
As I wrote, this patch only affects SSE1/2 definitions.

I hope this helps
-Andrea

In D51542#1221366, @andreadb wrote:
Hi Sanjay,

Thanks for the feedback.
In D51542#1221074, @spatel wrote:
I think this requires an understanding of the intent of ReadAfterLd:
// Instructions with folded loads need to read the memory operand immediately,
// but other register operands don't have to be read until the load is ready.
// These operands are marked with ReadAfterLd.
...that D51534 did not. That's because a broadcast only has one source operand, so ReadAfterLd doesn't even make sense on that instruction?
Not only it didn't make any sense. It was even harmful because it was decreasing the use latency of the register used as the base address for the folded load by 'ReadAfterLd' cycles..

In this case, we have 2 source operands:

The loaded value that we're doing the math on.

The unchanging vector lanes of the second source (destination) register.

I think you are getting confused. These are not instructions with 2 input operands.
These are just SSE1/SSE2 unary operations (one def, and one use; see below the tablegen definition).

Sorry. You are right. The output register is only partially updated.
Forget about my other comment on the register renamer; if the processor doesn't keep track of XMM registers whose upper portions have been cleared to zeros, then the instruction has to wait on the previous value of the destination register.
This is quite tricky to model for llvm-mca without adding extra domain knowledge.

That being said, the ReadAfterLd is still wrong as it applies to the register containing the base address of $src1.

There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.

AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.

lib/Target/X86/X86InstrSSE.td
2725 ↗	(On Diff #163533)	Intrinsic version needs updating as well? Although AFAICT this can't be tested within llvm-mca....

In D51542#1221933, @RKSimon wrote:

There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.

AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.

Yep.

I will raise a bug for 2. This patch only addresses the issue with ReadAfterLd being incorrectly applied to a register used as the base address of the folded load.

lib/Target/X86/X86InstrSSE.td
2725 ↗	(On Diff #163533)	I think the intrinsic version is fine because it uses three operands, and ReadAdvance correctly applies to the register operand (i.e. $src1), and not the memory operand. Also, src1 is tied to $dst. It would not be easy to test it. As you wrote, we cannot write an llvm-mca since it is a "codegen only" instruction...

In D51542#1221947, @andreadb wrote:

In D51542#1221933, @RKSimon wrote:

There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.

AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.

Yep.

I will raise a bug for 2. This patch only addresses the issue with ReadAfterLd being incorrectly applied to a register used as the base address of the folded load.

Raised bug https://bugs.llvm.org/show_bug.cgi?id=38813.

In D51542#1222120, @andreadb wrote:

In D51542#1221947, @andreadb wrote:

In D51542#1221933, @RKSimon wrote:

There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.

AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.

Yep.

I will raise a bug for 2. This patch only addresses the issue with ReadAfterLd being incorrectly applied to a register used as the base address of the folded load.

Raised bug https://bugs.llvm.org/show_bug.cgi?id=38813.

Thanks! I think everyone agrees that the ReadAfterLd is not behaving as intended on the SSE instructions. I am still wondering about the AVX case (and the SSE intrinsic is the same?):
As noted, this is testing with llvm-mca with skylake CPU:

With ReadAfterLd:

[0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?

No ReadAfterLd:

[0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known

Do we have confirmation from hardware testing or Intel docs that the 1st timeline is correct? We can punt this question to another patch to not delay this one if needed, but I'd like to understand what the hardware does in this case.

In D51542#1222161, @spatel wrote:
In D51542#1222120, @andreadb wrote:

In D51542#1221947, @andreadb wrote:

In D51542#1221933, @RKSimon wrote:

There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.

AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.

Yep.

I will raise a bug for 2. This patch only addresses the issue with ReadAfterLd being incorrectly applied to a register used as the base address of the folded load.

Raised bug https://bugs.llvm.org/show_bug.cgi?id=38813.

Thanks! I think everyone agrees that the ReadAfterLd is not behaving as intended on the SSE instructions. I am still wondering about the AVX case (and the SSE intrinsic is the same?):
As noted, this is testing with llvm-mca with skylake CPU:
With ReadAfterLd:

[0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?

No ReadAfterLd:

[0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known
Do we have confirmation from hardware testing or Intel docs that the 1st timeline is correct? We can punt this question to another patch to not delay this one if needed, but I'd like to understand what the hardware does in this case.

VRSQRTSS depends on XMM2, so it delayed by vdppd.
XMM2 doesn't have to necessarily be ready immediately when VRSQRTSS starts executing. That is because the uOp for the folded load is executed first, and then the loaded value is forwarded to the pipeline that executes VRSQRTSS. Only at that point, XMM2 has to be available.

I don't have a skylake to microbenchmark that sequence. However, (in my mental model) it kind of makes sense to have a ReadAfterLd there. I would be a "strange" otherwise.

In D51542#1222176, @andreadb wrote:
In D51542#1222161, @spatel wrote:
In D51542#1222120, @andreadb wrote:

In D51542#1221947, @andreadb wrote:

In D51542#1221933, @RKSimon wrote:

There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.

AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.

Yep.

I will raise a bug for 2. This patch only addresses the issue with ReadAfterLd being incorrectly applied to a register used as the base address of the folded load.

Raised bug https://bugs.llvm.org/show_bug.cgi?id=38813.

Thanks! I think everyone agrees that the ReadAfterLd is not behaving as intended on the SSE instructions. I am still wondering about the AVX case (and the SSE intrinsic is the same?):
As noted, this is testing with llvm-mca with skylake CPU:
With ReadAfterLd:

[0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?

No ReadAfterLd:

[0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known
Do we have confirmation from hardware testing or Intel docs that the 1st timeline is correct? We can punt this question to another patch to not delay this one if needed, but I'd like to understand what the hardware does in this case.
VRSQRTSS depends on XMM2, so it delayed by vdppd.
XMM2 doesn't have to necessarily be ready immediately when VRSQRTSS starts executing. That is because the uOp for the folded load is executed first, and then the loaded value is forwarded to the pipeline that executes VRSQRTSS. Only at that point, XMM2 has to be available.

I don't have a skylake to microbenchmark that sequence. However, (in my mental model) it kind of makes sense to have a ReadAfterLd there. I would be a "strange" otherwise.

Note: we can observe similar behavior with btver2 as the CPU model with llvm-mca (because the ReadAfterLd is on the default instruction definition):

[0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
[0,1]     .DeeE------R   .   leaq	8(%rsp,%rdi,2), %rax
[0,2]     . D====eeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3

And I was thinking the opposite - it seems strange to me that hardware would devote resources to make this case faster, but maybe it's not as difficult as I'm imagining.

We don't need to hold this patch up before we answer that, so LGTM.

This revision is now accepted and ready to land.Sep 3 2018, 7:50 AM

Closed by commit rL341326: [X86] Remove wrong ReadAdvance from multiclass sse_fp_unop_s. (authored by adibiagio). · Explain WhySep 3 2018, 9:48 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86InstrSSE.td

2 lines

test/

tools/

llvm-mca/

X86/

sqrt-rsqrt-rcp-memop.s

176 lines

Diff 163733

llvm/trunk/lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,706 Lines • ▼ Show 20 Lines	multiclass sse_fp_unop_s<bits<8> opc, string OpcodeStr, RegisterClass RC,
def r : I<opc, MRMSrcReg, (outs RC:$dst), (ins RC:$src1),		def r : I<opc, MRMSrcReg, (outs RC:$dst), (ins RC:$src1),
!strconcat(OpcodeStr, "\t{$src1, $dst\|$dst, $src1}"),		!strconcat(OpcodeStr, "\t{$src1, $dst\|$dst, $src1}"),
[(set RC:$dst, (OpNode RC:$src1))], d>, Sched<[sched]>,		[(set RC:$dst, (OpNode RC:$src1))], d>, Sched<[sched]>,
Requires<[target]>;		Requires<[target]>;
let mayLoad = 1 in		let mayLoad = 1 in
def m : I<opc, MRMSrcMem, (outs RC:$dst), (ins x86memop:$src1),		def m : I<opc, MRMSrcMem, (outs RC:$dst), (ins x86memop:$src1),
!strconcat(OpcodeStr, "\t{$src1, $dst\|$dst, $src1}"),		!strconcat(OpcodeStr, "\t{$src1, $dst\|$dst, $src1}"),
[(set RC:$dst, (OpNode (load addr:$src1)))], d>,		[(set RC:$dst, (OpNode (load addr:$src1)))], d>,
Sched<[sched.Folded, ReadAfterLd]>,		Sched<[sched.Folded]>,
Requires<[target, OptForSize]>;		Requires<[target, OptForSize]>;

let isCodeGenOnly = 1, Constraints = "$src1 = $dst", ExeDomain = d in {		let isCodeGenOnly = 1, Constraints = "$src1 = $dst", ExeDomain = d in {
def r_Int : I<opc, MRMSrcReg, (outs VR128:$dst), (ins VR128:$src1, VR128:$src2),		def r_Int : I<opc, MRMSrcReg, (outs VR128:$dst), (ins VR128:$src1, VR128:$src2),
!strconcat(OpcodeStr, "\t{$src2, $dst\|$dst, $src2}"), []>,		!strconcat(OpcodeStr, "\t{$src2, $dst\|$dst, $src2}"), []>,
Sched<[sched]>;		Sched<[sched]>;
let mayLoad = 1 in		let mayLoad = 1 in
def m_Int : I<opc, MRMSrcMem, (outs VR128:$dst), (ins VR128:$src1, intmemop:$src2),		def m_Int : I<opc, MRMSrcMem, (outs VR128:$dst), (ins VR128:$src1, intmemop:$src2),
▲ Show 20 Lines • Show All 5,483 Lines • Show Last 20 Lines

llvm/trunk/test/tools/llvm-mca/X86/sqrt-rsqrt-rcp-memop.s

	Show All 23 Lines
	leaq 8(%rsp, %rdi, 2), %rax			leaq 8(%rsp, %rdi, 2), %rax
	rcpss (%rax), %xmm1			rcpss (%rax), %xmm1
	#LLVM-MCA-END			#LLVM-MCA-END

	# ALL: [0] Code Region - test_sqrtss			# ALL: [0] Code Region - test_sqrtss

	# ALL: Timeline view:			# ALL: Timeline view:

	# BROADWELL-NEXT: 012345678			# BROADWELL-NEXT: 0123456789
	# BROADWELL-NEXT: Index 0123456789			# BROADWELL-NEXT: Index 0123456789

	# BTVER2-NEXT: 0123456789			# BTVER2-NEXT: 0123456789 0
	# BTVER2-NEXT: Index 0123456789 012345678			# BTVER2-NEXT: Index 0123456789 0123456789

	# HASWELL-NEXT: 012345678			# HASWELL-NEXT: 0123456789
	# HASWELL-NEXT: Index 0123456789			# HASWELL-NEXT: Index 0123456789

	# SKYLAKE-NEXT: 0123456789			# SKYLAKE-NEXT: 0123456789
	# SKYLAKE-NEXT: Index 0123456789			# SKYLAKE-NEXT: Index 0123456789 0

	# ZNVER1-NEXT: 0123456789			# ZNVER1-NEXT: 0123456789 0
	# ZNVER1-NEXT: Index 0123456789 0123456789			# ZNVER1-NEXT: Index 0123456789 0123456789

	# ZNVER1: [0,0] DeER . . . . . . leaq 8(%rsp,%rdi,2), %rax			# ZNVER1: [0,0] DeER . . . . . . leaq 8(%rsp,%rdi,2), %rax
	# ZNVER1-NEXT: [0,1] DeeeeeeeeeeeeeeeeeeeeeeeeeeeER sqrtss (%rax), %xmm1			# ZNVER1-NEXT: [0,1] D=eeeeeeeeeeeeeeeeeeeeeeeeeeeER sqrtss (%rax), %xmm1

	# SKYLAKE: [0,0] DeER . . . . leaq 8(%rsp,%rdi,2), %rax			# SKYLAKE: [0,0] DeER . . . . leaq 8(%rsp,%rdi,2), %rax
	# SKYLAKE-NEXT: [0,1] DeeeeeeeeeeeeeeeeeER sqrtss (%rax), %xmm1			# SKYLAKE-NEXT: [0,1] D=eeeeeeeeeeeeeeeeeER sqrtss (%rax), %xmm1

	# BROADWELL: [0,0] DeER . . . . leaq 8(%rsp,%rdi,2), %rax			# BROADWELL: [0,0] DeER . . . . leaq 8(%rsp,%rdi,2), %rax
	# BROADWELL-NEXT: [0,1] DeeeeeeeeeeeeeeeeER sqrtss (%rax), %xmm1			# BROADWELL-NEXT: [0,1] D=eeeeeeeeeeeeeeeeER sqrtss (%rax), %xmm1

	# HASWELL: [0,0] DeER . . . . leaq 8(%rsp,%rdi,2), %rax			# HASWELL: [0,0] DeER . . . . leaq 8(%rsp,%rdi,2), %rax
	# HASWELL-NEXT: [0,1] DeeeeeeeeeeeeeeeeER sqrtss (%rax), %xmm1			# HASWELL-NEXT: [0,1] D=eeeeeeeeeeeeeeeeER sqrtss (%rax), %xmm1

	# BTVER2: [0,0] DeeER. . . . . . leaq 8(%rsp,%rdi,2), %rax			# BTVER2: [0,0] DeeER. . . . . . leaq 8(%rsp,%rdi,2), %rax
	# BTVER2-NEXT: [0,1] DeeeeeeeeeeeeeeeeeeeeeeeeeeER sqrtss (%rax), %xmm1			# BTVER2-NEXT: [0,1] D==eeeeeeeeeeeeeeeeeeeeeeeeeeER sqrtss (%rax), %xmm1

	# ALL: Average Wait times (based on the timeline view):			# ALL: Average Wait times (based on the timeline view):
	# ALL-NEXT: [0]: Executions			# ALL-NEXT: [0]: Executions
	# ALL-NEXT: [1]: Average time spent waiting in a scheduler's queue			# ALL-NEXT: [1]: Average time spent waiting in a scheduler's queue
	# ALL-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready			# ALL-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
	# ALL-NEXT: [3]: Average time elapsed from WB until retire stage			# ALL-NEXT: [3]: Average time elapsed from WB until retire stage

	# ALL: [0] [1] [2] [3]			# ALL: [0] [1] [2] [3]
	# ALL-NEXT: 0. 1 1.0 1.0 0.0 leaq 8(%rsp,%rdi,2), %rax			# ALL-NEXT: 0. 1 1.0 1.0 0.0 leaq 8(%rsp,%rdi,2), %rax
	# ALL-NEXT: 1. 1 1.0 0.0 0.0 sqrtss (%rax), %xmm1
				# BROADWELL-NEXT: 1. 1 2.0 0.0 0.0 sqrtss (%rax), %xmm1
				# BTVER2-NEXT: 1. 1 3.0 0.0 0.0 sqrtss (%rax), %xmm1
				# HASWELL-NEXT: 1. 1 2.0 0.0 0.0 sqrtss (%rax), %xmm1
				# SKYLAKE-NEXT: 1. 1 2.0 0.0 0.0 sqrtss (%rax), %xmm1
				# ZNVER1-NEXT: 1. 1 2.0 0.0 0.0 sqrtss (%rax), %xmm1

	# ALL: [1] Code Region - test_sqrtsd			# ALL: [1] Code Region - test_sqrtsd

	# ALL: Timeline view:			# ALL: Timeline view:

	# BROADWELL-NEXT: 0123456789			# BROADWELL-NEXT: 0123456789
	# BROADWELL-NEXT: Index 0123456789 0123			# BROADWELL-NEXT: Index 0123456789 01234

	# BTVER2-NEXT: 0123456789 01234			# BTVER2-NEXT: 0123456789 0123456
	# BTVER2-NEXT: Index 0123456789 0123456789			# BTVER2-NEXT: Index 0123456789 0123456789

	# HASWELL-NEXT: 0123456789			# HASWELL-NEXT: 0123456789
	# HASWELL-NEXT: Index 0123456789 0123			# HASWELL-NEXT: Index 0123456789 01234

	# SKYLAKE-NEXT: 0123456789			# SKYLAKE-NEXT: 0123456789
	# SKYLAKE-NEXT: Index 0123456789 012345			# SKYLAKE-NEXT: Index 0123456789 0123456

	# ZNVER1-NEXT: 0123456789			# ZNVER1-NEXT: 0123456789 0
	# ZNVER1-NEXT: Index 0123456789 0123456789			# ZNVER1-NEXT: Index 0123456789 0123456789

	# ZNVER1: [0,0] DeER . . . . . . leaq 8(%rsp,%rdi,2), %rax			# ZNVER1: [0,0] DeER . . . . . . leaq 8(%rsp,%rdi,2), %rax
	# ZNVER1-NEXT: [0,1] DeeeeeeeeeeeeeeeeeeeeeeeeeeeER sqrtsd (%rax), %xmm1			# ZNVER1-NEXT: [0,1] D=eeeeeeeeeeeeeeeeeeeeeeeeeeeER sqrtsd (%rax), %xmm1

	# SKYLAKE: [0,0] DeER . . . . . leaq 8(%rsp,%rdi,2), %rax			# SKYLAKE: [0,0] DeER . . . . .. leaq 8(%rsp,%rdi,2), %rax
	# SKYLAKE-NEXT: [0,1] DeeeeeeeeeeeeeeeeeeeeeeeER sqrtsd (%rax), %xmm1			# SKYLAKE-NEXT: [0,1] D=eeeeeeeeeeeeeeeeeeeeeeeER sqrtsd (%rax), %xmm1

	# BROADWELL: [0,0] DeER . . . . . leaq 8(%rsp,%rdi,2), %rax			# BROADWELL: [0,0] DeER . . . . . leaq 8(%rsp,%rdi,2), %rax
	# BROADWELL-NEXT: [0,1] DeeeeeeeeeeeeeeeeeeeeeER sqrtsd (%rax), %xmm1			# BROADWELL-NEXT: [0,1] D=eeeeeeeeeeeeeeeeeeeeeER sqrtsd (%rax), %xmm1

	# HASWELL: [0,0] DeER . . . . . leaq 8(%rsp,%rdi,2), %rax			# HASWELL: [0,0] DeER . . . . . leaq 8(%rsp,%rdi,2), %rax
	# HASWELL-NEXT: [0,1] DeeeeeeeeeeeeeeeeeeeeeER sqrtsd (%rax), %xmm1			# HASWELL-NEXT: [0,1] D=eeeeeeeeeeeeeeeeeeeeeER sqrtsd (%rax), %xmm1

	# BTVER2: [0,0] DeeER. . . . . . . leaq 8(%rsp,%rdi,2), %rax			# BTVER2: [0,0] DeeER. . . . . . .. leaq 8(%rsp,%rdi,2), %rax
	# BTVER2-NEXT: [0,1] DeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeER sqrtsd (%rax), %xmm1			# BTVER2-NEXT: [0,1] D==eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeER sqrtsd (%rax), %xmm1

	# ALL: Average Wait times (based on the timeline view):			# ALL: Average Wait times (based on the timeline view):
	# ALL-NEXT: [0]: Executions			# ALL-NEXT: [0]: Executions
	# ALL-NEXT: [1]: Average time spent waiting in a scheduler's queue			# ALL-NEXT: [1]: Average time spent waiting in a scheduler's queue
	# ALL-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready			# ALL-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
	# ALL-NEXT: [3]: Average time elapsed from WB until retire stage			# ALL-NEXT: [3]: Average time elapsed from WB until retire stage

	# ALL: [0] [1] [2] [3]			# ALL: [0] [1] [2] [3]
	# ALL-NEXT: 0. 1 1.0 1.0 0.0 leaq 8(%rsp,%rdi,2), %rax			# ALL-NEXT: 0. 1 1.0 1.0 0.0 leaq 8(%rsp,%rdi,2), %rax
	# ALL-NEXT: 1. 1 1.0 0.0 0.0 sqrtsd (%rax), %xmm1

	# ALL: [2] Code Region - test_rsqrtss			# BROADWELL-NEXT: 1. 1 2.0 0.0 0.0 sqrtsd (%rax), %xmm1
				# BTVER2-NEXT: 1. 1 3.0 0.0 0.0 sqrtsd (%rax), %xmm1
				# HASWELL-NEXT: 1. 1 2.0 0.0 0.0 sqrtsd (%rax), %xmm1
				# SKYLAKE-NEXT: 1. 1 2.0 0.0 0.0 sqrtsd (%rax), %xmm1
				# ZNVER1-NEXT: 1. 1 2.0 0.0 0.0 sqrtsd (%rax), %xmm1

	# SKYLAKE: Timeline view:			# ALL: [2] Code Region - test_rsqrtss
	# SKYLAKE-NEXT: 01
	# SKYLAKE-NEXT: Index 0123456789

	# BROADWELL: Timeline view:
	# BROADWELL-NEXT: 012
	# BROADWELL-NEXT: Index 0123456789

	# HASWELL: Timeline view:			# ALL: Timeline view:
	# HASWELL-NEXT: 012
	# HASWELL-NEXT: Index 0123456789

	# ZNVER1: Timeline view:			# BROADWELL-NEXT: 0123
	# ZNVER1-NEXT: 01234			# BTVER2-NEXT: 01
	# ZNVER1-NEXT: Index 0123456789			# HASWELL-NEXT: 0123
				# SKYLAKE-NEXT: 012
				# ZNVER1-NEXT: 012345

	# BTVER2: Timeline view:			# ALL-NEXT: Index 0123456789
	# BTVER2-NEXT: Index 0123456789

	# ZNVER1: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax			# ZNVER1: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax
	# ZNVER1-NEXT: [0,1] DeeeeeeeeeeeeER rsqrtss (%rax), %xmm1			# ZNVER1-NEXT: [0,1] D=eeeeeeeeeeeeER rsqrtss (%rax), %xmm1

	# BROADWELL: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax			# BROADWELL: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax
	# BROADWELL-NEXT: [0,1] DeeeeeeeeeeER rsqrtss (%rax), %xmm1			# BROADWELL-NEXT: [0,1] D=eeeeeeeeeeER rsqrtss (%rax), %xmm1

	# HASWELL: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax			# HASWELL: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax
	# HASWELL-NEXT: [0,1] DeeeeeeeeeeER rsqrtss (%rax), %xmm1			# HASWELL-NEXT: [0,1] D=eeeeeeeeeeER rsqrtss (%rax), %xmm1

	# SKYLAKE: [0,0] DeER . .. leaq 8(%rsp,%rdi,2), %rax			# SKYLAKE: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax
	# SKYLAKE-NEXT: [0,1] DeeeeeeeeeER rsqrtss (%rax), %xmm1			# SKYLAKE-NEXT: [0,1] D=eeeeeeeeeER rsqrtss (%rax), %xmm1

	# BTVER2: [0,0] DeeER. . leaq 8(%rsp,%rdi,2), %rax			# BTVER2: [0,0] DeeER. .. leaq 8(%rsp,%rdi,2), %rax
	# BTVER2-NEXT: [0,1] DeeeeeeeER rsqrtss (%rax), %xmm1			# BTVER2-NEXT: [0,1] D==eeeeeeeER rsqrtss (%rax), %xmm1

	# ALL: Average Wait times (based on the timeline view):			# ALL: Average Wait times (based on the timeline view):
	# ALL-NEXT: [0]: Executions			# ALL-NEXT: [0]: Executions
	# ALL-NEXT: [1]: Average time spent waiting in a scheduler's queue			# ALL-NEXT: [1]: Average time spent waiting in a scheduler's queue
	# ALL-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready			# ALL-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
	# ALL-NEXT: [3]: Average time elapsed from WB until retire stage			# ALL-NEXT: [3]: Average time elapsed from WB until retire stage

	# ALL: [0] [1] [2] [3]			# ALL: [0] [1] [2] [3]
	# ALL-NEXT: 0. 1 1.0 1.0 0.0 leaq 8(%rsp,%rdi,2), %rax			# ALL-NEXT: 0. 1 1.0 1.0 0.0 leaq 8(%rsp,%rdi,2), %rax
	# ALL-NEXT: 1. 1 1.0 0.0 0.0 rsqrtss (%rax), %xmm1

	# ALL: [3] Code Region - test_rcp

	# SKYLAKE: Timeline view:			# BROADWELL-NEXT: 1. 1 2.0 0.0 0.0 rsqrtss (%rax), %xmm1
	# SKYLAKE-NEXT: 01			# BTVER2-NEXT: 1. 1 3.0 0.0 0.0 rsqrtss (%rax), %xmm1
	# SKYLAKE-NEXT: Index 0123456789			# HASWELL-NEXT: 1. 1 2.0 0.0 0.0 rsqrtss (%rax), %xmm1
				# SKYLAKE-NEXT: 1. 1 2.0 0.0 0.0 rsqrtss (%rax), %xmm1
				# ZNVER1-NEXT: 1. 1 2.0 0.0 0.0 rsqrtss (%rax), %xmm1

	# BROADWELL: Timeline view:			# ALL: [3] Code Region - test_rcp
	# BROADWELL-NEXT: 012
	# BROADWELL-NEXT: Index 0123456789

	# HASWELL: Timeline view:			# ALL: Timeline view:
	# HASWELL-NEXT: 012
	# HASWELL-NEXT: Index 0123456789

	# ZNVER1: Timeline view:			# BROADWELL-NEXT: 0123
	# ZNVER1-NEXT: 01234			# BTVER2-NEXT: 01
	# ZNVER1-NEXT: Index 0123456789			# HASWELL-NEXT: 0123
				# SKYLAKE-NEXT: 012
				# ZNVER1-NEXT: 012345

	# BTVER2: Timeline view:			# ALL-NEXT: Index 0123456789
	# BTVER2-NEXT: Index 0123456789

	# ZNVER1: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax			# ZNVER1: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax
	# ZNVER1-NEXT: [0,1] DeeeeeeeeeeeeER rcpss (%rax), %xmm1			# ZNVER1-NEXT: [0,1] D=eeeeeeeeeeeeER rcpss (%rax), %xmm1

	# BROADWELL: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax			# BROADWELL: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax
	# BROADWELL-NEXT: [0,1] DeeeeeeeeeeER rcpss (%rax), %xmm1			# BROADWELL-NEXT: [0,1] D=eeeeeeeeeeER rcpss (%rax), %xmm1

	# HASWELL: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax			# HASWELL: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax
	# HASWELL-NEXT: [0,1] DeeeeeeeeeeER rcpss (%rax), %xmm1			# HASWELL-NEXT: [0,1] D=eeeeeeeeeeER rcpss (%rax), %xmm1

	# SKYLAKE: [0,0] DeER . .. leaq 8(%rsp,%rdi,2), %rax			# SKYLAKE: [0,0] DeER . . . leaq 8(%rsp,%rdi,2), %rax
	# SKYLAKE-NEXT: [0,1] DeeeeeeeeeER rcpss (%rax), %xmm1			# SKYLAKE-NEXT: [0,1] D=eeeeeeeeeER rcpss (%rax), %xmm1

	# BTVER2: [0,0] DeeER. . leaq 8(%rsp,%rdi,2), %rax			# BTVER2: [0,0] DeeER. .. leaq 8(%rsp,%rdi,2), %rax
	# BTVER2-NEXT: [0,1] DeeeeeeeER rcpss (%rax), %xmm1			# BTVER2-NEXT: [0,1] D==eeeeeeeER rcpss (%rax), %xmm1

	# ALL: Average Wait times (based on the timeline view):			# ALL: Average Wait times (based on the timeline view):
	# ALL-NEXT: [0]: Executions			# ALL-NEXT: [0]: Executions
	# ALL-NEXT: [1]: Average time spent waiting in a scheduler's queue			# ALL-NEXT: [1]: Average time spent waiting in a scheduler's queue
	# ALL-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready			# ALL-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
	# ALL-NEXT: [3]: Average time elapsed from WB until retire stage			# ALL-NEXT: [3]: Average time elapsed from WB until retire stage

	# ALL: [0] [1] [2] [3]			# ALL: [0] [1] [2] [3]
	# ALL-NEXT: 0. 1 1.0 1.0 0.0 leaq 8(%rsp,%rdi,2), %rax			# ALL-NEXT: 0. 1 1.0 1.0 0.0 leaq 8(%rsp,%rdi,2), %rax
	# ALL-NEXT: 1. 1 1.0 0.0 0.0 rcpss (%rax), %xmm1
				# BROADWELL-NEXT: 1. 1 2.0 0.0 0.0 rcpss (%rax), %xmm1
				# BTVER2-NEXT: 1. 1 3.0 0.0 0.0 rcpss (%rax), %xmm1
				# HASWELL-NEXT: 1. 1 2.0 0.0 0.0 rcpss (%rax), %xmm1
				# SKYLAKE-NEXT: 1. 1 2.0 0.0 0.0 rcpss (%rax), %xmm1
				# ZNVER1-NEXT: 1. 1 2.0 0.0 0.0 rcpss (%rax), %xmm1