This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/X86/
-
Target/
-
X86/
-
X86ScheduleBtVer2.td
-
test/tools/llvm-mca/X86/BtVer2/
-
tools/
-
llvm-mca/
-
X86/
-
BtVer2/
-
one-idioms.s

Differential D48877

[X86][BtVer2][MCA] Recognize CMPEQ one-idioms
AbandonedPublic

Authored by lebedev.ri on Jul 3 2018, 6:35 AM.

Download Raw Diff

Details

Reviewers

RKSimon
courbet
andreadb

Summary

Commit message of rL334303 said:

As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and
Jaguar pipeline - Dependency-breaking instructions), these instructions
are dependency breaking and fast-path zero the destination register
(and appropriate EFLAGS bits).

That very section also listed PCMPEQx right before PCMPGTx in that very list.
So these are also dependency-breaking, although they produce ones and still consume resources.

Found accidentally while continuing trying to look into bdver2 scheduling profile..

Diff Detail

Repository: rL LLVM

Event Timeline

lebedev.ri created this revision.Jul 3 2018, 6:35 AM

Herald added a subscriber: gbedwell. · View Herald TranscriptJul 3 2018, 6:35 AM

lebedev.ri added a parent revision: D48876: [X86][BtVer2][MCA][NFC] Add CMPEQ one-idioms tests.Jul 3 2018, 6:35 AM

lebedev.ri added a child revision: D48878: [X86][BtVer2][NFC] Regexify zero-idioms.

I think you're confusing zero-idioms with dependency breaking instructions.

Zero idioms break dependencies, don't use any resources (the frontend just sets the PRF value to zero) and then retires.

PCMPEQ 'ones' patterns can break dependencies but still use resources to set the bits to all ones before retiring.

In D48877#1150824, @RKSimon wrote:

I think you're confusing zero-idioms with dependency breaking instructions.

Yes, indeed.

Zero idioms break dependencies, don't use any resources (the frontend just sets the PRF value to zero) and then retires.

PCMPEQ 'ones' patterns can break dependencies but still use resources to set the bits to all ones before retiring.

Would it be valuable to salvage those into one-idioms.s with a comment that these are dep-breaking?
I'm not sure any special handling is needed in the sched model.

Hmm, so if they still consume resources, does this mean it is the lack of latency is what making them special?

FIXME: how do i test a custom snippet in llvm-exegesis? CC @courbet

$ perl -e 'print "pcmpgtb %xmm5,%xmm5\n"x10000; print "retq"' > /tmp/snippet-ones,txt

On Intel hardware, in register renaming it basically rewrites the input register to an internal all zeros register. The instruction still needs to execute and now compares zero with zero and gets all ones. But by rewriting the input register to all zeros instead of the real input, the instruction doesn't need to wait for the previous writer of the input register to execute.

In D48877#1151042, @lebedev.ri wrote:
Hmm, so if they still consume resources, does this mean it is the lack of latency is what making them special?

FIXME: how do i test a custom snippet in llvm-exegesis? CC @courbet
$ perl -e 'print "pcmpgtb %xmm5,%xmm5\n"x10000; print "retq"' > /tmp/snippet-ones,txt

For now, you can't, but it's very easy to add, so we should support it. I've created PR38048.

In D48877#1151042, @lebedev.ri wrote:

Hmm, so if they still consume resources, does this mean it is the lack of latency is what making them special?

The PCMPEQ 'all ones' idiom is a regular instruction - it consumes resources and has a latency before its result is available for any instructions that depend on it.

What it doesn't have to do is wait for its source resisters to be available:

VDIVPS %xmm1, %xmm0, %xmm0    <---- Big latency 
VPCMPEQB %xmm1, %xmm0, %xmm0  <---- Must wait a loooooong time until VDIVPS has completed

VDIVPS %xmm1, %xmm0, %xmm0    <---- Big latency 
VPCMPEQB %xmm0, %xmm0, %xmm0  <---- 'Ones Idiom' - can execute immediately, doesn't wait for VDIVPS

In D48877#1151974, @RKSimon wrote:
In D48877#1151042, @lebedev.ri wrote:

Hmm, so if they still consume resources, does this mean it is the lack of latency is what making them special?

The PCMPEQ 'all ones' idiom is a regular instruction - it consumes resources and has a latency before its result is available for any instructions that depend on it.

What it doesn't have to do is wait for its source resisters to be available:
VDIVPS %xmm1, %xmm0, %xmm0    <---- Big latency 
VPCMPEQB %xmm1, %xmm0, %xmm0  <---- Must wait a loooooong time until VDIVPS has completed
vs
VDIVPS %xmm1, %xmm0, %xmm0    <---- Big latency 
VPCMPEQB %xmm0, %xmm0, %xmm0  <---- 'Ones Idiom' - can execute immediately, doesn't wait for VDIVPS

Ok, well, i guess what i was trying to ask/understand is, is that already properly represented https://godbolt.org/g/9rYPYA, or not?

In D48877#1152009, @lebedev.ri wrote:
In D48877#1151974, @RKSimon wrote:
In D48877#1151042, @lebedev.ri wrote:

Hmm, so if they still consume resources, does this mean it is the lack of latency is what making them special?

The PCMPEQ 'all ones' idiom is a regular instruction - it consumes resources and has a latency before its result is available for any instructions that depend on it.

What it doesn't have to do is wait for its source resisters to be available:
VDIVPS %xmm1, %xmm0, %xmm0    <---- Big latency 
VPCMPEQB %xmm1, %xmm0, %xmm0  <---- Must wait a loooooong time until VDIVPS has completed
vs
VDIVPS %xmm1, %xmm0, %xmm0    <---- Big latency 
VPCMPEQB %xmm0, %xmm0, %xmm0  <---- 'Ones Idiom' - can execute immediately, doesn't wait for VDIVPS
Ok, well, i guess what i was trying to ask/understand is, is that already properly represented https://godbolt.org/g/9rYPYA, or not?

No, we don't properly model dependency breaking instructions yet - zero-idioms are making use of a special case of llvm-mca that assumes dependency breaking if no resources are used - IMO that's something that should be removed and we come up with a better way to model this.

In D48877#1152098, @RKSimon wrote:

Ok, well, i guess what i was trying to ask/understand is, is that already properly represented https://godbolt.org/g/9rYPYA, or not?

No, we don't properly model dependency breaking instructions yet - zero-idioms are making use of a special case of llvm-mca that assumes dependency breaking if no resources are used - IMO that's something that should be removed and we come up with a better way to model this.

Simon is right on this.
We still don't model dependency breaking instructions. There is already a plan to teach llvm-mca how to identify those instructions, and that is next on my TODO list. Once we have that system in place, we can remove the "zero-latency implies dependency-breaking" hack in llvm-mca.

This patch doesn't do the right thing. The timeline clearly shows how dependencies are not broken.

-Andrea

In D48877#1152115, @andreadb wrote:

In D48877#1152098, @RKSimon wrote:

Ok, well, i guess what i was trying to ask/understand is, is that already properly represented https://godbolt.org/g/9rYPYA, or not?

No, we don't properly model dependency breaking instructions yet - zero-idioms are making use of a special case of llvm-mca that assumes dependency breaking if no resources are used - IMO that's something that should be removed and we come up with a better way to model this.

Simon is right on this.
We still don't model dependency breaking instructions. There is already a plan to teach llvm-mca how to identify those instructions, and that is next on my TODO list. Once we have that system in place, we can remove the "zero-latency implies dependency-breaking" hack in llvm-mca.

This patch doesn't do the right thing. The timeline clearly shows how dependencies are not broken.

Ok, that is actually good, i was starting to question my [rudimentary] understanding of all this.

-Andrea

Then, back to square one, are D48876 tests of any use? :)

In D48877#1152117, @lebedev.ri wrote:

In D48877#1152115, @andreadb wrote:

In D48877#1152098, @RKSimon wrote:

Ok, well, i guess what i was trying to ask/understand is, is that already properly represented https://godbolt.org/g/9rYPYA, or not?

No, we don't properly model dependency breaking instructions yet - zero-idioms are making use of a special case of llvm-mca that assumes dependency breaking if no resources are used - IMO that's something that should be removed and we come up with a better way to model this.

Simon is right on this.
We still don't model dependency breaking instructions. There is already a plan to teach llvm-mca how to identify those instructions, and that is next on my TODO list. Once we have that system in place, we can remove the "zero-latency implies dependency-breaking" hack in llvm-mca.

This patch doesn't do the right thing. The timeline clearly shows how dependencies are not broken.

Ok, that is actually good, i was starting to question my [rudimentary] understanding of all this.

-Andrea

Then, back to square one, are D48876 tests of any use? :)

You can commit those tests to show that we don't correctly model dependency breaking packed compare instructions on BtVer2. However, I would remove the padd from the tests.

Great, thank you all for pointing out that this is bogus :)

Diffusion mentioned this in rL336292: [X86][BtVer2][MCA][NFC] Add CMPEQ dependency-breaking one-idioms tests.Jul 4 2018, 10:37 AM

Revision Contents

Path

Size

lib/

Target/

X86/

X86ScheduleBtVer2.td

24 lines

test/

tools/

llvm-mca/

X86/

BtVer2/

one-idioms.s

114 lines

Diff 153936

lib/Target/X86/X86ScheduleBtVer2.td

	Show First 20 Lines • Show All 648 Lines • ▼ Show 20 Lines
	def : InstRW<[JWriteVZeroIdiomALUX], (instrs PSUBBrr, VPSUBBrr,			def : InstRW<[JWriteVZeroIdiomALUX], (instrs PSUBBrr, VPSUBBrr,
	PSUBDrr, VPSUBDrr,			PSUBDrr, VPSUBDrr,
	PSUBQrr, VPSUBQrr,			PSUBQrr, VPSUBQrr,
	PSUBWrr, VPSUBWrr,			PSUBWrr, VPSUBWrr,
	PCMPGTBrr, VPCMPGTBrr,			PCMPGTBrr, VPCMPGTBrr,
	PCMPGTDrr, VPCMPGTDrr,			PCMPGTDrr, VPCMPGTDrr,
	PCMPGTQrr, VPCMPGTQrr,			PCMPGTQrr, VPCMPGTQrr,
	PCMPGTWrr, VPCMPGTWrr)>;			PCMPGTWrr, VPCMPGTWrr)>;

				// There are also dependency-breaking one-idioms, but they do consume resources.
				def JWriteOneLatency : SchedWriteRes<[JFPU01, JVALU]> {
				let Latency = 0;
				}

				def JWriteVOneIdiomALU : SchedWriteVariant<[
				SchedVar<MCSchedPredicate<ZeroIdiomPredicate>, [JWriteOneLatency]>,
				SchedVar<MCSchedPredicate<TruePred>, [WriteVecALU]>
				]>;
				def : InstRW<[JWriteVOneIdiomALU], (instrs MMX_PCMPEQBirr,
				MMX_PCMPEQDirr,
				// MMX_PCMPGTQirr is invalid
				MMX_PCMPEQWirr)>;

				def JWriteVOneIdiomALUX : SchedWriteVariant<[
				SchedVar<MCSchedPredicate<ZeroIdiomPredicate>, [JWriteOneLatency]>,
				SchedVar<MCSchedPredicate<TruePred>, [WriteVecALUX]>
				]>;
				def : InstRW<[JWriteVOneIdiomALUX], (instrs PCMPEQBrr, VPCMPEQBrr,
				PCMPEQDrr, VPCMPEQDrr,
				PCMPEQQrr, VPCMPEQQrr,
				PCMPEQWrr, VPCMPEQWrr)>;

	} // SchedModel			} // SchedModel

test/tools/llvm-mca/X86/BtVer2/one-idioms.s

	Show All 28 Lines

	vpcmpeqb %xmm3, %xmm3, %xmm5			vpcmpeqb %xmm3, %xmm3, %xmm5
	vpcmpeqd %xmm3, %xmm3, %xmm5			vpcmpeqd %xmm3, %xmm3, %xmm5
	vpcmpeqq %xmm3, %xmm3, %xmm5			vpcmpeqq %xmm3, %xmm3, %xmm5
	vpcmpeqw %xmm3, %xmm3, %xmm5			vpcmpeqw %xmm3, %xmm3, %xmm5

	# CHECK: Iterations: 1			# CHECK: Iterations: 1
	# CHECK-NEXT: Instructions: 19			# CHECK-NEXT: Instructions: 19
	# CHECK-NEXT: Total Cycles: 15			# CHECK-NEXT: Total Cycles: 14
	# CHECK-NEXT: Dispatch Width: 2			# CHECK-NEXT: Dispatch Width: 2
	# CHECK-NEXT: IPC: 1.27			# CHECK-NEXT: IPC: 1.36
	# CHECK-NEXT: Block RThroughput: 9.5			# CHECK-NEXT: Block RThroughput: 9.5

	# CHECK: Instruction Info:			# CHECK: Instruction Info:
	# CHECK-NEXT: [1]: #uOps			# CHECK-NEXT: [1]: #uOps
	# CHECK-NEXT: [2]: Latency			# CHECK-NEXT: [2]: Latency
	# CHECK-NEXT: [3]: RThroughput			# CHECK-NEXT: [3]: RThroughput
	# CHECK-NEXT: [4]: MayLoad			# CHECK-NEXT: [4]: MayLoad
	# CHECK-NEXT: [5]: MayStore			# CHECK-NEXT: [5]: MayStore
	# CHECK-NEXT: [6]: HasSideEffects			# CHECK-NEXT: [6]: HasSideEffects

	# CHECK: [1] [2] [3] [4] [5] [6] Instructions:			# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
	# CHECK-NEXT: 1 1 0.50 paddb %mm2, %mm2			# CHECK-NEXT: 1 1 0.50 paddb %mm2, %mm2
	# CHECK-NEXT: 1 1 0.50 paddb %mm2, %mm2			# CHECK-NEXT: 1 1 0.50 paddb %mm2, %mm2
	# CHECK-NEXT: 1 1 0.50 pcmpeqb %mm2, %mm2			# CHECK-NEXT: 1 0 0.50 pcmpeqb %mm2, %mm2
	# CHECK-NEXT: 1 1 0.50 pcmpeqd %mm2, %mm2			# CHECK-NEXT: 1 0 0.50 pcmpeqd %mm2, %mm2
	# CHECK-NEXT: 1 1 0.50 pcmpeqw %mm2, %mm2			# CHECK-NEXT: 1 0 0.50 pcmpeqw %mm2, %mm2
	# CHECK-NEXT: 1 1 0.50 paddb %xmm2, %xmm2			# CHECK-NEXT: 1 1 0.50 paddb %xmm2, %xmm2
	# CHECK-NEXT: 1 1 0.50 paddb %xmm2, %xmm2			# CHECK-NEXT: 1 1 0.50 paddb %xmm2, %xmm2
	# CHECK-NEXT: 1 1 0.50 pcmpeqb %xmm2, %xmm2			# CHECK-NEXT: 1 0 0.50 pcmpeqb %xmm2, %xmm2
	# CHECK-NEXT: 1 1 0.50 pcmpeqd %xmm2, %xmm2			# CHECK-NEXT: 1 0 0.50 pcmpeqd %xmm2, %xmm2
	# CHECK-NEXT: 1 1 0.50 pcmpeqq %xmm2, %xmm2			# CHECK-NEXT: 1 0 0.50 pcmpeqq %xmm2, %xmm2
	# CHECK-NEXT: 1 1 0.50 pcmpeqw %xmm2, %xmm2			# CHECK-NEXT: 1 0 0.50 pcmpeqw %xmm2, %xmm2
	# CHECK-NEXT: 1 1 0.50 vpcmpeqb %xmm3, %xmm3, %xmm3			# CHECK-NEXT: 1 0 0.50 vpcmpeqb %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: 1 1 0.50 vpcmpeqd %xmm3, %xmm3, %xmm3			# CHECK-NEXT: 1 0 0.50 vpcmpeqd %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: 1 1 0.50 vpcmpeqq %xmm3, %xmm3, %xmm3			# CHECK-NEXT: 1 0 0.50 vpcmpeqq %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: 1 1 0.50 vpcmpeqw %xmm3, %xmm3, %xmm3			# CHECK-NEXT: 1 0 0.50 vpcmpeqw %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: 1 1 0.50 vpcmpeqb %xmm3, %xmm3, %xmm5			# CHECK-NEXT: 1 0 0.50 vpcmpeqb %xmm3, %xmm3, %xmm5
	# CHECK-NEXT: 1 1 0.50 vpcmpeqd %xmm3, %xmm3, %xmm5			# CHECK-NEXT: 1 0 0.50 vpcmpeqd %xmm3, %xmm3, %xmm5
	# CHECK-NEXT: 1 1 0.50 vpcmpeqq %xmm3, %xmm3, %xmm5			# CHECK-NEXT: 1 0 0.50 vpcmpeqq %xmm3, %xmm3, %xmm5
	# CHECK-NEXT: 1 1 0.50 vpcmpeqw %xmm3, %xmm3, %xmm5			# CHECK-NEXT: 1 0 0.50 vpcmpeqw %xmm3, %xmm3, %xmm5

	# CHECK: Register File statistics:			# CHECK: Register File statistics:
	# CHECK-NEXT: Total number of mappings created: 19			# CHECK-NEXT: Total number of mappings created: 19
	# CHECK-NEXT: Max number of mappings used: 10			# CHECK-NEXT: Max number of mappings used: 8

	# CHECK: * Register File #1 -- JFpuPRF:			# CHECK: * Register File #1 -- JFpuPRF:
	# CHECK-NEXT: Number of physical registers: 72			# CHECK-NEXT: Number of physical registers: 72
	# CHECK-NEXT: Total number of mappings created: 19			# CHECK-NEXT: Total number of mappings created: 19
	# CHECK-NEXT: Max number of mappings used: 10			# CHECK-NEXT: Max number of mappings used: 8

	# CHECK: * Register File #2 -- JIntegerPRF:			# CHECK: * Register File #2 -- JIntegerPRF:
	# CHECK-NEXT: Number of physical registers: 64			# CHECK-NEXT: Number of physical registers: 64
	# CHECK-NEXT: Total number of mappings created: 0			# CHECK-NEXT: Total number of mappings created: 0
	# CHECK-NEXT: Max number of mappings used: 0			# CHECK-NEXT: Max number of mappings used: 0

	# CHECK: Resources:			# CHECK: Resources:
	# CHECK-NEXT: [0] - JALU0			# CHECK-NEXT: [0] - JALU0
	Show All 15 Lines
	# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]			# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
	# CHECK-NEXT: - - - - - 9.00 10.00 - - - - 9.00 10.00 -			# CHECK-NEXT: - - - - - 9.00 10.00 - - - - 9.00 10.00 -

	# CHECK: Resource pressure by instruction:			# CHECK: Resource pressure by instruction:
	# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:			# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - paddb %mm2, %mm2			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - paddb %mm2, %mm2
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - paddb %mm2, %mm2			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - paddb %mm2, %mm2
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - pcmpeqb %mm2, %mm2			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - pcmpeqb %mm2, %mm2
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - pcmpeqd %mm2, %mm2			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - pcmpeqd %mm2, %mm2
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - pcmpeqw %mm2, %mm2			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - pcmpeqw %mm2, %mm2
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - paddb %xmm2, %xmm2			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - paddb %xmm2, %xmm2
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - paddb %xmm2, %xmm2			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - paddb %xmm2, %xmm2
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - pcmpeqb %xmm2, %xmm2			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - pcmpeqb %xmm2, %xmm2
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - pcmpeqd %xmm2, %xmm2			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - pcmpeqd %xmm2, %xmm2
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - pcmpeqq %xmm2, %xmm2			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - pcmpeqq %xmm2, %xmm2
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - pcmpeqw %xmm2, %xmm2			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - pcmpeqw %xmm2, %xmm2
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpcmpeqb %xmm3, %xmm3, %xmm3			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpcmpeqb %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpcmpeqd %xmm3, %xmm3, %xmm3			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpcmpeqd %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpcmpeqq %xmm3, %xmm3, %xmm3			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpcmpeqq %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpcmpeqw %xmm3, %xmm3, %xmm3			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpcmpeqw %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpcmpeqb %xmm3, %xmm3, %xmm5			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpcmpeqb %xmm3, %xmm3, %xmm5
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpcmpeqd %xmm3, %xmm3, %xmm5			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpcmpeqd %xmm3, %xmm3, %xmm5
	# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpcmpeqq %xmm3, %xmm3, %xmm5			# CHECK-NEXT: - - - - - 1.00 - - - - - 1.00 - - vpcmpeqq %xmm3, %xmm3, %xmm5
	# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpcmpeqw %xmm3, %xmm3, %xmm5			# CHECK-NEXT: - - - - - - 1.00 - - - - - 1.00 - vpcmpeqw %xmm3, %xmm3, %xmm5

	# CHECK: Timeline view:			# CHECK: Timeline view:
	# CHECK-NEXT: 01234			# CHECK-NEXT: 0123
	# CHECK-NEXT: Index 0123456789			# CHECK-NEXT: Index 0123456789

	# CHECK: [0,0] DeER . . . paddb %mm2, %mm2			# CHECK: [0,0] DeER . . . paddb %mm2, %mm2
	# CHECK-NEXT: [0,1] D=eER. . . paddb %mm2, %mm2			# CHECK-NEXT: [0,1] D=eER. . . paddb %mm2, %mm2
	# CHECK-NEXT: [0,2] .D=eER . . pcmpeqb %mm2, %mm2			# CHECK-NEXT: [0,2] .D=ER. . . pcmpeqb %mm2, %mm2
	# CHECK-NEXT: [0,3] .D==eER . . pcmpeqd %mm2, %mm2			# CHECK-NEXT: [0,3] .D=E-R . . pcmpeqd %mm2, %mm2
	# CHECK-NEXT: [0,4] . D==eER . . pcmpeqw %mm2, %mm2			# CHECK-NEXT: [0,4] . D=ER . . pcmpeqw %mm2, %mm2
	# CHECK-NEXT: [0,5] . DeE--R . . paddb %xmm2, %xmm2			# CHECK-NEXT: [0,5] . D=eER . . paddb %xmm2, %xmm2
	# CHECK-NEXT: [0,6] . DeE--R . . paddb %xmm2, %xmm2			# CHECK-NEXT: [0,6] . D=eER . . paddb %xmm2, %xmm2
	# CHECK-NEXT: [0,7] . D=eE-R . . pcmpeqb %xmm2, %xmm2			# CHECK-NEXT: [0,7] . D==ER . . pcmpeqb %xmm2, %xmm2
	# CHECK-NEXT: [0,8] . D=eE-R. . pcmpeqd %xmm2, %xmm2			# CHECK-NEXT: [0,8] . D=E-R . . pcmpeqd %xmm2, %xmm2
	# CHECK-NEXT: [0,9] . D==eER. . pcmpeqq %xmm2, %xmm2			# CHECK-NEXT: [0,9] . D==ER . . pcmpeqq %xmm2, %xmm2
	# CHECK-NEXT: [0,10] . D==eER . pcmpeqw %xmm2, %xmm2			# CHECK-NEXT: [0,10] . D=E-R. . pcmpeqw %xmm2, %xmm2
	# CHECK-NEXT: [0,11] . DeE--R . vpcmpeqb %xmm3, %xmm3, %xmm3			# CHECK-NEXT: [0,11] . D==ER. . vpcmpeqb %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: [0,12] . .DeE--R . vpcmpeqd %xmm3, %xmm3, %xmm3			# CHECK-NEXT: [0,12] . .D=E-R . vpcmpeqd %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: [0,13] . .D=eE-R . vpcmpeqq %xmm3, %xmm3, %xmm3			# CHECK-NEXT: [0,13] . .D==ER . vpcmpeqq %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: [0,14] . . D=eE-R . vpcmpeqw %xmm3, %xmm3, %xmm3			# CHECK-NEXT: [0,14] . . D=E-R . vpcmpeqw %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: [0,15] . . D==eER . vpcmpeqb %xmm3, %xmm3, %xmm5			# CHECK-NEXT: [0,15] . . D==ER . vpcmpeqb %xmm3, %xmm3, %xmm5
	# CHECK-NEXT: [0,16] . . D=eE-R. vpcmpeqd %xmm3, %xmm3, %xmm5			# CHECK-NEXT: [0,16] . . D=E-R. vpcmpeqd %xmm3, %xmm3, %xmm5
	# CHECK-NEXT: [0,17] . . D==eER. vpcmpeqq %xmm3, %xmm3, %xmm5			# CHECK-NEXT: [0,17] . . D==ER. vpcmpeqq %xmm3, %xmm3, %xmm5
	# CHECK-NEXT: [0,18] . . D=eE-R vpcmpeqw %xmm3, %xmm3, %xmm5			# CHECK-NEXT: [0,18] . . D=E-R vpcmpeqw %xmm3, %xmm3, %xmm5

	# CHECK: Average Wait times (based on the timeline view):			# CHECK: Average Wait times (based on the timeline view):
	# CHECK-NEXT: [0]: Executions			# CHECK-NEXT: [0]: Executions
	# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue			# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
	# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready			# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
	# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage			# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage

	# CHECK: [0] [1] [2] [3]			# CHECK: [0] [1] [2] [3]
	# CHECK-NEXT: 0. 1 1.0 1.0 0.0 paddb %mm2, %mm2			# CHECK-NEXT: 0. 1 1.0 1.0 0.0 paddb %mm2, %mm2
	# CHECK-NEXT: 1. 1 2.0 0.0 0.0 paddb %mm2, %mm2			# CHECK-NEXT: 1. 1 2.0 0.0 0.0 paddb %mm2, %mm2
	# CHECK-NEXT: 2. 1 2.0 0.0 0.0 pcmpeqb %mm2, %mm2			# CHECK-NEXT: 2. 1 2.0 0.0 0.0 pcmpeqb %mm2, %mm2
	# CHECK-NEXT: 3. 1 3.0 0.0 0.0 pcmpeqd %mm2, %mm2			# CHECK-NEXT: 3. 1 2.0 0.0 1.0 pcmpeqd %mm2, %mm2
	# CHECK-NEXT: 4. 1 3.0 0.0 0.0 pcmpeqw %mm2, %mm2			# CHECK-NEXT: 4. 1 2.0 1.0 0.0 pcmpeqw %mm2, %mm2
	# CHECK-NEXT: 5. 1 1.0 1.0 2.0 paddb %xmm2, %xmm2			# CHECK-NEXT: 5. 1 2.0 2.0 0.0 paddb %xmm2, %xmm2
	# CHECK-NEXT: 6. 1 1.0 0.0 2.0 paddb %xmm2, %xmm2			# CHECK-NEXT: 6. 1 2.0 0.0 0.0 paddb %xmm2, %xmm2
	# CHECK-NEXT: 7. 1 2.0 0.0 1.0 pcmpeqb %xmm2, %xmm2			# CHECK-NEXT: 7. 1 3.0 0.0 0.0 pcmpeqb %xmm2, %xmm2
	# CHECK-NEXT: 8. 1 2.0 0.0 1.0 pcmpeqd %xmm2, %xmm2			# CHECK-NEXT: 8. 1 2.0 0.0 1.0 pcmpeqd %xmm2, %xmm2
	# CHECK-NEXT: 9. 1 3.0 0.0 0.0 pcmpeqq %xmm2, %xmm2			# CHECK-NEXT: 9. 1 3.0 1.0 0.0 pcmpeqq %xmm2, %xmm2
	# CHECK-NEXT: 10. 1 3.0 0.0 0.0 pcmpeqw %xmm2, %xmm2			# CHECK-NEXT: 10. 1 2.0 0.0 1.0 pcmpeqw %xmm2, %xmm2
	# CHECK-NEXT: 11. 1 1.0 1.0 2.0 vpcmpeqb %xmm3, %xmm3, %xmm3			# CHECK-NEXT: 11. 1 3.0 3.0 0.0 vpcmpeqb %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: 12. 1 1.0 0.0 2.0 vpcmpeqd %xmm3, %xmm3, %xmm3			# CHECK-NEXT: 12. 1 2.0 0.0 1.0 vpcmpeqd %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: 13. 1 2.0 0.0 1.0 vpcmpeqq %xmm3, %xmm3, %xmm3			# CHECK-NEXT: 13. 1 3.0 1.0 0.0 vpcmpeqq %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: 14. 1 2.0 0.0 1.0 vpcmpeqw %xmm3, %xmm3, %xmm3			# CHECK-NEXT: 14. 1 2.0 0.0 1.0 vpcmpeqw %xmm3, %xmm3, %xmm3
	# CHECK-NEXT: 15. 1 3.0 0.0 0.0 vpcmpeqb %xmm3, %xmm3, %xmm5			# CHECK-NEXT: 15. 1 3.0 1.0 0.0 vpcmpeqb %xmm3, %xmm3, %xmm5
	# CHECK-NEXT: 16. 1 2.0 0.0 1.0 vpcmpeqd %xmm3, %xmm3, %xmm5			# CHECK-NEXT: 16. 1 2.0 1.0 1.0 vpcmpeqd %xmm3, %xmm3, %xmm5
	# CHECK-NEXT: 17. 1 3.0 1.0 0.0 vpcmpeqq %xmm3, %xmm3, %xmm5			# CHECK-NEXT: 17. 1 3.0 2.0 0.0 vpcmpeqq %xmm3, %xmm3, %xmm5
	# CHECK-NEXT: 18. 1 2.0 1.0 1.0 vpcmpeqw %xmm3, %xmm3, %xmm5			# CHECK-NEXT: 18. 1 2.0 2.0 1.0 vpcmpeqw %xmm3, %xmm3, %xmm5