Download Raw Diff

Details

Reviewers

craig.topper
RKSimon
atrick
spatel
gbedwell
filcab

Commits

rG39e5a5695fd3: [RFC][patch 3/3] Add support for variant scheduling classes in llvm-mca.
rL333909: [RFC][patch 3/3] Add support for variant scheduling classes in llvm-mca.

Summary

This patch is the third of a sequence of three patches related to LLVM-dev RFC "MC support for variant scheduling classes". http://lists.llvm.org/pipermail/llvm-dev/2018-May/123181.html

This patch requires D47077 to be applied first.

The main goal of this patch is to teach llvm-mca how to solve variant scheduling classes.
This patch does that, plus it adds a variant scheduling class to the BtVer2 scheduling model to identify so-called zero-idioms (data dependency breaking instructions that are known to produce zero, and that are optimized out at register renaming stage).

Without the BtVer2 change, this patch would not have had any tests.
This patch is effectively the union of two changes:

the llvm-mca change that enables the resolution of variant scheduling classes, and
the change to the BtVer2 scheduling model.

Point 2. (partially) fixes PR36671.
Point 1. fixes PR36672.

@RKSimon and @craig.topper , the new scheduling predicate for the XOR zero-idiom is quite simple. In future, we could move predicates that are valid for multiple processor models into a common .td file.
For now, I keep that predicate check into the BtVer2 model.

Please let me know if okay to commit.

Thanks
-Andrea

Diff Detail

Repository: rL LLVM

Event Timeline

andreadb created this revision.May 25 2018, 8:14 AM

Herald added a subscriber: tschuett. · View Herald TranscriptMay 25 2018, 8:14 AM

RKSimon added inline comments.May 25 2018, 8:38 AM

lib/Target/X86/X86ScheduleBtVer2.td
550 ↗	(On Diff #148608)	Add a TODO saying this may go into X86Schedule.td in the future?
575 ↗	(On Diff #148608)	Reference Agner's microarchitecture doc as well - he explicity says this works for Jaguar - AMD 16h SOG unfortunately misses it and just having a reference to a different cpu is confusing.
test/CodeGen/X86/sse-schedule.ll
6084 ↗	(On Diff #148608)	This is unfortunate - ideally we'd keep [1:0.50] here and zero idioms would get [0:0.50]
test/tools/llvm-mca/X86/BtVer2/zero-idioms.s
32 ↗	(On Diff #148608)	Should the RThroughput still be limited to by issue width?
tools/llvm-mca/InstrBuilder.cpp
385 ↗	(On Diff #148608)	Drop braces - also is it worth pulling out SM.getProcessorID()? It might get this down to a single line for tidiness.

andreadb added inline comments.May 25 2018, 8:53 AM

lib/Target/X86/X86ScheduleBtVer2.td
550 ↗	(On Diff #148608)	Will do.
575 ↗	(On Diff #148608)	Okay. I will add that reference.
test/CodeGen/X86/sse-schedule.ll
6084 ↗	(On Diff #148608)	I should have mentioned this in the summary. The existing functionality used to obtain the latency from a MCInst doesn't know how to resolve variant scheduling classes. So, it accepts a machine opcode ID, and it bails out if the scheduling class identifier from the opcode descriptor refers to a variant class. The idea is to fix it. However, it would be a follow-up patch. I can add a FIXME comment there.
test/tools/llvm-mca/X86/BtVer2/zero-idioms.s
32 ↗	(On Diff #148608)	It doesn't take into account the issue width. In this case, the MCSchedInfo method returned an invalid RThroughput (an Optional<double>). On the other hand, the Block RThroughput correctly reports 3.0 for the sequence of 6 instructions (for a total of 6 uOps).
tools/llvm-mca/InstrBuilder.cpp
385 ↗	(On Diff #148608)	Okay. I will fix it.

Patch updated.

Addressed review comments.

andreadb mentioned this in D46698: [RFC][llvm-mca][patch 3/3] Add support for variant scheduling classes in llvm-mca..May 25 2018, 9:52 AM

RKSimon mentioned this in D47377: [X86][Sched] Fix WriteZero sched class for all CPUs..May 25 2018, 10:21 AM

courbet added a subscriber: courbet.May 25 2018, 11:06 AM

courbet added inline comments.

lib/Target/X86/X86ScheduleBtVer2.td
569 ↗	(On Diff #148623)	I think you should also set NumMicroOps to 0 here.

andreadb added inline comments.May 25 2018, 11:14 AM

lib/Target/X86/X86ScheduleBtVer2.td
569 ↗	(On Diff #148623)	Here, the number of opcodes is what gets subtracted to the IssueWidth budget for the current cycle. On Jaguar, that quantity matches what the docs define as COP (complex opcode). Think of it as a container of uOps. An SSE xor is always decoded into a single COP. That COP is then sent to the RCU (and it takes one slot in the reorder buffer), and eliminated at register renaming stage. Basically, although it is zero-latency, it still has to be retired. It consumes one slot in the ROB, and one slot in the dispatch group (subtracted to what llvm-mca calls "DispatchWidth" - the equivalent of the IssueWidth in the scheduling model).

andreadb edited the summary of this revision. (Show Details)May 25 2018, 11:20 AM

RKSimon added inline comments.May 27 2018, 7:05 AM

lib/Target/X86/X86ScheduleBtVer2.td
560 ↗	(On Diff #148623)	Just to be clear, can this work for GPR zero-idioms as well (XOR32rr/XOR64rr etc)?

andreadb added inline comments.May 27 2018, 12:41 PM

lib/Target/X86/X86ScheduleBtVer2.td
560 ↗	(On Diff #148623)	Yes. `JZeroIdiomPredicate` can be used to describe XOR32rr and XOR64rr zero-idioms too. You would still need a fall-back strategy for when the XOR is not a zero idiom. That means, you would need to add the following case to `JWriteZeroIdiom`: `SchedVar<MCSchedPredicate<JIntXOR>, [WriteALU]>` where JIntXOR is defined as a `CheckOpcode<[XOR32rr, XOR64rr]>`.

andreadb mentioned this in D47536: [MCSchedule] Add the ability to correctly compute the latency and throughput information for MCInst..May 30 2018, 8:08 AM

Patch updated.

This patch is now dependent on D47536.
D47536 avoids the loss of test coverage in -print-schedule tests.

Diffusion mentioned this in rL333650: [MCSchedule] Add the ability to compute the latency and throughput information….May 31 2018, 6:38 AM

Ping.

Patch updated.

Simplified the scheduling predicates as suggested off-line by SimonP.

Patch updated (sorry for the spam).

This time, with an updated/extended test case.

RKSimon added inline comments.Jun 2 2018, 9:05 AM

lib/Target/X86/X86ScheduleBtVer2.td
558 ↗	(On Diff #149496)	This should probably be moved to X86Schedule.td straightaway - I see no benefit keeping this here as zeroidioms are something we're going to want in all x86 scheduler models. Rename either to X86ZeroIdiomPredicate or X86SameRegOps01 (or something like that.....).

Address review comments.

Patch updated. This time with the correct diff...

RKSimon added inline comments.Jun 4 2018, 5:52 AM

CodeGen/X86/sse-schedule.ll
6228 ↗	(On Diff #149712)	These are unfortunate - please can you raise an upstream bugzilla about throughput printing defaulting to issues width.
llvm-mca/InstrBuilder.cpp
388 ↗	(On Diff #149712)	So !SchedClassID can only occur here due to variant resolution failing?

andreadb added inline comments.Jun 4 2018, 6:43 AM

CodeGen/X86/sse-schedule.ll
6228 ↗	(On Diff #149712)	Makes sense. I will raise a bug for it.
llvm-mca/InstrBuilder.cpp
388 ↗	(On Diff #149712)	Good point. Strictly speaking, the scheduling class ID associated with MCI should always be valid. But I cannot guarantee it.. I will add an if-stmt to guard against invalid classes.

Address Simon's review comment.

Raised https://bugs.llvm.org/show_bug.cgi?id=37678 for the -print-schedule issue with the throughput computation of zero-latency instructions.

Patch updated.

You've lost the diff context, but LGTM - thanks.

This revision is now accepted and ready to land.Jun 4 2018, 8:15 AM

Closed by commit rL333909: [RFC][patch 3/3] Add support for variant scheduling classes in llvm-mca. (authored by adibiagio). · Explain WhyJun 4 2018, 8:47 AM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D47723: [CodeGen] print max throughput for 0-latency insts.Jun 4 2018, 8:57 AM

spatel mentioned this in rL334055: [CodeGen] assume max/default throughput for unspecified instructions.Jun 5 2018, 4:39 PM

Diff 149775

llvm/trunk/lib/Target/X86/X86Schedule.td

	Show First 20 Lines • Show All 553 Lines • ▼ Show 20 Lines
	def SchedWriteFSqrtSizes			def SchedWriteFSqrtSizes
	: X86SchedWriteSizes<SchedWriteFSqrt, SchedWriteFSqrt64>;			: X86SchedWriteSizes<SchedWriteFSqrt, SchedWriteFSqrt64>;
	def SchedWriteFLogicSizes			def SchedWriteFLogicSizes
	: X86SchedWriteSizes<SchedWriteFLogic, SchedWriteFLogic>;			: X86SchedWriteSizes<SchedWriteFLogic, SchedWriteFLogic>;
	def SchedWriteFShuffleSizes			def SchedWriteFShuffleSizes
	: X86SchedWriteSizes<SchedWriteFShuffle, SchedWriteFShuffle>;			: X86SchedWriteSizes<SchedWriteFShuffle, SchedWriteFShuffle>;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
				// Common MCInstPredicate definitions used by variant scheduling classes.

				def ZeroIdiomPredicate : CheckSameRegOperand<1, 2>;

				//===----------------------------------------------------------------------===//
	// Generic Processor Scheduler Models.			// Generic Processor Scheduler Models.

	// IssueWidth is analogous to the number of decode units. Core and its			// IssueWidth is analogous to the number of decode units. Core and its
	// descendents, including Nehalem and SandyBridge have 4 decoders.			// descendents, including Nehalem and SandyBridge have 4 decoders.
	// Resources beyond the decoder operate on micro-ops and are bufferred			// Resources beyond the decoder operate on micro-ops and are bufferred
	// so adjacent micro-ops don't directly compete.			// so adjacent micro-ops don't directly compete.
	//			//
	// MicroOpBufferSize > 1 indicates that RAW dependencies can be			// MicroOpBufferSize > 1 indicates that RAW dependencies can be
	Show All 27 Lines

llvm/trunk/lib/Target/X86/X86ScheduleBtVer2.td

	Show First 20 Lines • Show All 540 Lines • ▼ Show 20 Lines
	}			}
	def : InstRW<[JWriteJVZEROALL], (instrs VZEROALL)>;			def : InstRW<[JWriteJVZEROALL], (instrs VZEROALL)>;

	def JWriteJVZEROUPPER: SchedWriteRes<[]> {			def JWriteJVZEROUPPER: SchedWriteRes<[]> {
	let Latency = 46;			let Latency = 46;
	let NumMicroOps = 37;			let NumMicroOps = 37;
	}			}
	def : InstRW<[JWriteJVZEROUPPER], (instrs VZEROUPPER)>;			def : InstRW<[JWriteJVZEROUPPER], (instrs VZEROUPPER)>;
	} // SchedModel

				///////////////////////////////////////////////////////////////////////////////
				// SchedWriteVariant definitions.
				///////////////////////////////////////////////////////////////////////////////

				def JWriteZeroLatency : SchedWriteRes<[]> {
				let Latency = 0;
				}

				// Vector XOR instructions that use the same register for both source
				// operands do not have a real dependency on the previous contents of the
				// register, and thus, do not have to wait before completing. They can be
				// optimized out at register renaming stage.
				// Reference: Section 10.8 of the "Software Optimization Guide for AMD Family
				// 15h Processors".
				// Reference: Agner's Fog "The microarchitecture of Intel, AMD and VIA CPUs",
				// Section 21.8 [Dependency-breaking instructions].

				def JWriteFZeroIdiom : SchedWriteVariant<[
				SchedVar<MCSchedPredicate<ZeroIdiomPredicate>, [JWriteZeroLatency]>,
				SchedVar<MCSchedPredicate<TruePred>, [WriteFLogic]>
				]>;

				def : InstRW<[JWriteFZeroIdiom], (instrs XORPSrr, VXORPSrr, XORPDrr, VXORPDrr)>;

				def JWriteVZeroIdiom : SchedWriteVariant<[
				SchedVar<MCSchedPredicate<ZeroIdiomPredicate>, [JWriteZeroLatency]>,
				SchedVar<MCSchedPredicate<TruePred>, [WriteVecLogicX]>
				]>;

				def : InstRW<[JWriteVZeroIdiom], (instrs PXORrr, VPXORrr)>;

				} // SchedModel

llvm/trunk/test/CodeGen/X86/sse-schedule.ll

	Show First 20 Lines • Show All 6,219 Lines • ▼ Show 20 Lines
	; SKX-NEXT: #APP			; SKX-NEXT: #APP
	; SKX-NEXT: nop # sched: [1:0.25]			; SKX-NEXT: nop # sched: [1:0.25]
	; SKX-NEXT: #NO_APP			; SKX-NEXT: #NO_APP
	; SKX-NEXT: vxorps %xmm0, %xmm0, %xmm0 # sched: [1:0.33]			; SKX-NEXT: vxorps %xmm0, %xmm0, %xmm0 # sched: [1:0.33]
	; SKX-NEXT: retq # sched: [7:1.00]			; SKX-NEXT: retq # sched: [7:1.00]
	;			;
	; BTVER2-SSE-LABEL: test_fnop:			; BTVER2-SSE-LABEL: test_fnop:
	; BTVER2-SSE: # %bb.0:			; BTVER2-SSE: # %bb.0:
	; BTVER2-SSE-NEXT: xorps %xmm0, %xmm0 # sched: [1:0.50]			; BTVER2-SSE-NEXT: xorps %xmm0, %xmm0 # sched: [0:?]
	; BTVER2-SSE-NEXT: #APP			; BTVER2-SSE-NEXT: #APP
	; BTVER2-SSE-NEXT: nop # sched: [1:0.50]			; BTVER2-SSE-NEXT: nop # sched: [1:0.50]
	; BTVER2-SSE-NEXT: #NO_APP			; BTVER2-SSE-NEXT: #NO_APP
	; BTVER2-SSE-NEXT: retq # sched: [4:1.00]			; BTVER2-SSE-NEXT: retq # sched: [4:1.00]
	;			;
	; BTVER2-LABEL: test_fnop:			; BTVER2-LABEL: test_fnop:
	; BTVER2: # %bb.0:			; BTVER2: # %bb.0:
	; BTVER2-NEXT: vxorps %xmm0, %xmm0, %xmm0 # sched: [1:0.50]			; BTVER2-NEXT: vxorps %xmm0, %xmm0, %xmm0 # sched: [0:?]
	; BTVER2-NEXT: #APP			; BTVER2-NEXT: #APP
	; BTVER2-NEXT: nop # sched: [1:0.50]			; BTVER2-NEXT: nop # sched: [1:0.50]
	; BTVER2-NEXT: #NO_APP			; BTVER2-NEXT: #NO_APP
	; BTVER2-NEXT: retq # sched: [4:1.00]			; BTVER2-NEXT: retq # sched: [4:1.00]
	;			;
	; ZNVER1-SSE-LABEL: test_fnop:			; ZNVER1-SSE-LABEL: test_fnop:
	; ZNVER1-SSE: # %bb.0:			; ZNVER1-SSE: # %bb.0:
	; ZNVER1-SSE-NEXT: xorps %xmm0, %xmm0 # sched: [1:0.25]			; ZNVER1-SSE-NEXT: xorps %xmm0, %xmm0 # sched: [1:0.25]
	Show All 17 Lines

llvm/trunk/test/tools/llvm-mca/X86/BtVer2/zero-idioms.s

				# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
				# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -timeline -register-file-stats -iterations=5 < %s \| FileCheck %s

				xorps %xmm0, %xmm0
				xorpd %xmm1, %xmm1
				vxorps %xmm2, %xmm2, %xmm2
				vxorpd %xmm1, %xmm1, %xmm1
				pxor %xmm2, %xmm2
				vpxor %xmm3, %xmm3, %xmm3

				vxorps %xmm4, %xmm4, %xmm5
				vxorpd %xmm1, %xmm1, %xmm3
				vpxor %xmm3, %xmm3, %xmm5

				# CHECK: Iterations: 5
				# CHECK-NEXT: Instructions: 45
				# CHECK-NEXT: Total Cycles: 24
				# CHECK-NEXT: Dispatch Width: 2
				# CHECK-NEXT: IPC: 1.88
				# CHECK-NEXT: Block RThroughput: 4.5

				# CHECK: Instruction Info:
				# CHECK-NEXT: [1]: #uOps
				# CHECK-NEXT: [2]: Latency
				# CHECK-NEXT: [3]: RThroughput
				# CHECK-NEXT: [4]: MayLoad
				# CHECK-NEXT: [5]: MayStore
				# CHECK-NEXT: [6]: HasSideEffects

				# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
				# CHECK-NEXT: 1 0 - xorps %xmm0, %xmm0
				# CHECK-NEXT: 1 0 - xorpd %xmm1, %xmm1
				# CHECK-NEXT: 1 0 - vxorps %xmm2, %xmm2, %xmm2
				# CHECK-NEXT: 1 0 - vxorpd %xmm1, %xmm1, %xmm1
				# CHECK-NEXT: 1 0 - pxor %xmm2, %xmm2
				# CHECK-NEXT: 1 0 - vpxor %xmm3, %xmm3, %xmm3
				# CHECK-NEXT: 1 0 - vxorps %xmm4, %xmm4, %xmm5
				# CHECK-NEXT: 1 0 - vxorpd %xmm1, %xmm1, %xmm3
				# CHECK-NEXT: 1 0 - vpxor %xmm3, %xmm3, %xmm5

				# CHECK: Register File statistics:
				# CHECK-NEXT: Total number of mappings created: 0
				# CHECK-NEXT: Max number of mappings used: 0

				# CHECK: * Register File #1 -- JFpuPRF:
				# CHECK-NEXT: Number of physical registers: 72
				# CHECK-NEXT: Total number of mappings created: 0
				# CHECK-NEXT: Max number of mappings used: 0

				# CHECK: * Register File #2 -- JIntegerPRF:
				# CHECK-NEXT: Number of physical registers: 64
				# CHECK-NEXT: Total number of mappings created: 0
				# CHECK-NEXT: Max number of mappings used: 0

				# CHECK: Resources:
				# CHECK-NEXT: [0] - JALU0
				# CHECK-NEXT: [1] - JALU1
				# CHECK-NEXT: [2] - JDiv
				# CHECK-NEXT: [3] - JFPA
				# CHECK-NEXT: [4] - JFPM
				# CHECK-NEXT: [5] - JFPU0
				# CHECK-NEXT: [6] - JFPU1
				# CHECK-NEXT: [7] - JLAGU
				# CHECK-NEXT: [8] - JMul
				# CHECK-NEXT: [9] - JSAGU
				# CHECK-NEXT: [10] - JSTC
				# CHECK-NEXT: [11] - JVALU0
				# CHECK-NEXT: [12] - JVALU1
				# CHECK-NEXT: [13] - JVIMUL

				# CHECK: Resource pressure per iteration:
				# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
				# CHECK-NEXT: - - - - - - - - - - - - - -

				# CHECK: Resource pressure by instruction:
				# CHECK-NEXT: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
				# CHECK-NEXT: - - - - - - - - - - - - - - xorps %xmm0, %xmm0
				# CHECK-NEXT: - - - - - - - - - - - - - - xorpd %xmm1, %xmm1
				# CHECK-NEXT: - - - - - - - - - - - - - - vxorps %xmm2, %xmm2, %xmm2
				# CHECK-NEXT: - - - - - - - - - - - - - - vxorpd %xmm1, %xmm1, %xmm1
				# CHECK-NEXT: - - - - - - - - - - - - - - pxor %xmm2, %xmm2
				# CHECK-NEXT: - - - - - - - - - - - - - - vpxor %xmm3, %xmm3, %xmm3
				# CHECK-NEXT: - - - - - - - - - - - - - - vxorps %xmm4, %xmm4, %xmm5
				# CHECK-NEXT: - - - - - - - - - - - - - - vxorpd %xmm1, %xmm1, %xmm3
				# CHECK-NEXT: - - - - - - - - - - - - - - vpxor %xmm3, %xmm3, %xmm5

				# CHECK: Timeline view:
				# CHECK-NEXT: 0123456789
				# CHECK-NEXT: Index 0123456789 0123

				# CHECK: [0,0] DR . . . . . xorps %xmm0, %xmm0
				# CHECK-NEXT: [0,1] DR . . . . . xorpd %xmm1, %xmm1
				# CHECK-NEXT: [0,2] .DR . . . . . vxorps %xmm2, %xmm2, %xmm2
				# CHECK-NEXT: [0,3] .DR . . . . . vxorpd %xmm1, %xmm1, %xmm1
				# CHECK-NEXT: [0,4] . DR . . . . . pxor %xmm2, %xmm2
				# CHECK-NEXT: [0,5] . DR . . . . . vpxor %xmm3, %xmm3, %xmm3
				# CHECK-NEXT: [0,6] . DR. . . . . vxorps %xmm4, %xmm4, %xmm5
				# CHECK-NEXT: [0,7] . DR. . . . . vxorpd %xmm1, %xmm1, %xmm3
				# CHECK-NEXT: [0,8] . DR . . . . vpxor %xmm3, %xmm3, %xmm5
				# CHECK-NEXT: [1,0] . DR . . . . xorps %xmm0, %xmm0
				# CHECK-NEXT: [1,1] . DR . . . . xorpd %xmm1, %xmm1
				# CHECK-NEXT: [1,2] . DR . . . . vxorps %xmm2, %xmm2, %xmm2
				# CHECK-NEXT: [1,3] . .DR . . . . vxorpd %xmm1, %xmm1, %xmm1
				# CHECK-NEXT: [1,4] . .DR . . . . pxor %xmm2, %xmm2
				# CHECK-NEXT: [1,5] . . DR . . . . vpxor %xmm3, %xmm3, %xmm3
				# CHECK-NEXT: [1,6] . . DR . . . . vxorps %xmm4, %xmm4, %xmm5
				# CHECK-NEXT: [1,7] . . DR. . . . vxorpd %xmm1, %xmm1, %xmm3
				# CHECK-NEXT: [1,8] . . DR. . . . vpxor %xmm3, %xmm3, %xmm5
				# CHECK-NEXT: [2,0] . . DR . . . xorps %xmm0, %xmm0
				# CHECK-NEXT: [2,1] . . DR . . . xorpd %xmm1, %xmm1
				# CHECK-NEXT: [2,2] . . DR . . . vxorps %xmm2, %xmm2, %xmm2
				# CHECK-NEXT: [2,3] . . DR . . . vxorpd %xmm1, %xmm1, %xmm1
				# CHECK-NEXT: [2,4] . . .DR . . . pxor %xmm2, %xmm2
				# CHECK-NEXT: [2,5] . . .DR . . . vpxor %xmm3, %xmm3, %xmm3
				# CHECK-NEXT: [2,6] . . . DR . . . vxorps %xmm4, %xmm4, %xmm5
				# CHECK-NEXT: [2,7] . . . DR . . . vxorpd %xmm1, %xmm1, %xmm3
				# CHECK-NEXT: [2,8] . . . DR. . . vpxor %xmm3, %xmm3, %xmm5
				# CHECK-NEXT: [3,0] . . . DR. . . xorps %xmm0, %xmm0
				# CHECK-NEXT: [3,1] . . . DR . . xorpd %xmm1, %xmm1
				# CHECK-NEXT: [3,2] . . . DR . . vxorps %xmm2, %xmm2, %xmm2
				# CHECK-NEXT: [3,3] . . . DR . . vxorpd %xmm1, %xmm1, %xmm1
				# CHECK-NEXT: [3,4] . . . DR . . pxor %xmm2, %xmm2
				# CHECK-NEXT: [3,5] . . . .DR . . vpxor %xmm3, %xmm3, %xmm3
				# CHECK-NEXT: [3,6] . . . .DR . . vxorps %xmm4, %xmm4, %xmm5
				# CHECK-NEXT: [3,7] . . . . DR . . vxorpd %xmm1, %xmm1, %xmm3
				# CHECK-NEXT: [3,8] . . . . DR . . vpxor %xmm3, %xmm3, %xmm5
				# CHECK-NEXT: [4,0] . . . . DR. . xorps %xmm0, %xmm0
				# CHECK-NEXT: [4,1] . . . . DR. . xorpd %xmm1, %xmm1
				# CHECK-NEXT: [4,2] . . . . DR . vxorps %xmm2, %xmm2, %xmm2
				# CHECK-NEXT: [4,3] . . . . DR . vxorpd %xmm1, %xmm1, %xmm1
				# CHECK-NEXT: [4,4] . . . . DR . pxor %xmm2, %xmm2
				# CHECK-NEXT: [4,5] . . . . DR . vpxor %xmm3, %xmm3, %xmm3
				# CHECK-NEXT: [4,6] . . . . .DR. vxorps %xmm4, %xmm4, %xmm5
				# CHECK-NEXT: [4,7] . . . . .DR. vxorpd %xmm1, %xmm1, %xmm3
				# CHECK-NEXT: [4,8] . . . . . DR vpxor %xmm3, %xmm3, %xmm5

				# CHECK: Average Wait times (based on the timeline view):
				# CHECK-NEXT: [0]: Executions
				# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
				# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
				# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage

				# CHECK: [0] [1] [2] [3]
				# CHECK-NEXT: 0. 5 0.0 0.0 0.0 xorps %xmm0, %xmm0
				# CHECK-NEXT: 1. 5 0.0 0.0 0.0 xorpd %xmm1, %xmm1
				# CHECK-NEXT: 2. 5 0.0 0.0 0.0 vxorps %xmm2, %xmm2, %xmm2
				# CHECK-NEXT: 3. 5 0.0 0.0 0.0 vxorpd %xmm1, %xmm1, %xmm1
				# CHECK-NEXT: 4. 5 0.0 0.0 0.0 pxor %xmm2, %xmm2
				# CHECK-NEXT: 5. 5 0.0 0.0 0.0 vpxor %xmm3, %xmm3, %xmm3
				# CHECK-NEXT: 6. 5 0.0 0.0 0.0 vxorps %xmm4, %xmm4, %xmm5
				# CHECK-NEXT: 7. 5 0.0 0.0 0.0 vxorpd %xmm1, %xmm1, %xmm3
				# CHECK-NEXT: 8. 5 0.0 0.0 0.0 vpxor %xmm3, %xmm3, %xmm5

llvm/trunk/tools/llvm-mca/InstrBuilder.h

	Show All 34 Lines
	/// Information from the machine scheduling model is used to identify processor			/// Information from the machine scheduling model is used to identify processor
	/// resources that are consumed by an instruction.			/// resources that are consumed by an instruction.
	class InstrBuilder {			class InstrBuilder {
	const llvm::MCSubtargetInfo &STI;			const llvm::MCSubtargetInfo &STI;
	const llvm::MCInstrInfo &MCII;			const llvm::MCInstrInfo &MCII;
	llvm::SmallVector<uint64_t, 8> ProcResourceMasks;			llvm::SmallVector<uint64_t, 8> ProcResourceMasks;

	llvm::DenseMap<unsigned short, std::unique_ptr<const InstrDesc>> Descriptors;			llvm::DenseMap<unsigned short, std::unique_ptr<const InstrDesc>> Descriptors;
				llvm::DenseMap<const llvm::MCInst *, std::unique_ptr<const InstrDesc>>
				VariantDescriptors;

	const InstrDesc &createInstrDescImpl(const llvm::MCInst &MCI);			const InstrDesc &createInstrDescImpl(const llvm::MCInst &MCI);

	InstrBuilder(const InstrBuilder &) = delete;			InstrBuilder(const InstrBuilder &) = delete;
	InstrBuilder &operator=(const InstrBuilder &) = delete;			InstrBuilder &operator=(const InstrBuilder &) = delete;

	public:			public:
	InstrBuilder(const llvm::MCSubtargetInfo &sti, const llvm::MCInstrInfo &mcii)			InstrBuilder(const llvm::MCSubtargetInfo &sti, const llvm::MCInstrInfo &mcii)
	: STI(sti), MCII(mcii),			: STI(sti), MCII(mcii),
	ProcResourceMasks(STI.getSchedModel().getNumProcResourceKinds()) {			ProcResourceMasks(STI.getSchedModel().getNumProcResourceKinds()) {
	computeProcResourceMasks(STI.getSchedModel(), ProcResourceMasks);			computeProcResourceMasks(STI.getSchedModel(), ProcResourceMasks);
	Show All 16 Lines

llvm/trunk/tools/llvm-mca/InstrBuilder.cpp

Show First 20 Lines • Show All 390 Lines • ▼ Show 20 Lines	const InstrDesc &InstrBuilder::createInstrDescImpl(const MCInst &MCI) {

unsigned short Opcode = MCI.getOpcode();		unsigned short Opcode = MCI.getOpcode();
// Obtain the instruction descriptor from the opcode.		// Obtain the instruction descriptor from the opcode.
const MCInstrDesc &MCDesc = MCII.get(Opcode);		const MCInstrDesc &MCDesc = MCII.get(Opcode);
const MCSchedModel &SM = STI.getSchedModel();		const MCSchedModel &SM = STI.getSchedModel();

// Then obtain the scheduling class information from the instruction.		// Then obtain the scheduling class information from the instruction.
unsigned SchedClassID = MCDesc.getSchedClass();		unsigned SchedClassID = MCDesc.getSchedClass();
const MCSchedClassDesc &SCDesc = *SM.getSchedClassDesc(SchedClassID);		unsigned CPUID = SM.getProcessorID();

		// Try to solve variant scheduling classes.
		if (SchedClassID) {
		while (SchedClassID && SM.getSchedClassDesc(SchedClassID)->isVariant())
		SchedClassID = STI.resolveVariantSchedClass(SchedClassID, &MCI, CPUID);

		if (!SchedClassID)
		llvm::report_fatal_error("unable to resolve this variant class.");
		}

// Create a new empty descriptor.		// Create a new empty descriptor.
std::unique_ptr<InstrDesc> ID = llvm::make_unique<InstrDesc>();		std::unique_ptr<InstrDesc> ID = llvm::make_unique<InstrDesc>();

if (SCDesc.isVariant()) {		const MCSchedClassDesc &SCDesc = *SM.getSchedClassDesc(SchedClassID);
WithColor::warning() << "don't know how to model variant opcodes.\n";
WithColor::note() << "assume 1 micro opcode.\n";
ID->NumMicroOps = 1U;
} else {
ID->NumMicroOps = SCDesc.NumMicroOps;		ID->NumMicroOps = SCDesc.NumMicroOps;
}

if (MCDesc.isCall()) {		if (MCDesc.isCall()) {
// We don't correctly model calls.		// We don't correctly model calls.
WithColor::warning() << "found a call in the input assembly sequence.\n";		WithColor::warning() << "found a call in the input assembly sequence.\n";
WithColor::note() << "call instructions are not correctly modeled. "		WithColor::note() << "call instructions are not correctly modeled. "
<< "Assume a latency of 100cy.\n";		<< "Assume a latency of 100cy.\n";
}		}

Show All 11 Lines	const InstrDesc &InstrBuilder::createInstrDescImpl(const MCInst &MCI) {
computeMaxLatency(*ID, MCDesc, SCDesc, STI);		computeMaxLatency(*ID, MCDesc, SCDesc, STI);
populateWrites(*ID, MCI, MCDesc, SCDesc, STI);		populateWrites(*ID, MCI, MCDesc, SCDesc, STI);
populateReads(*ID, MCI, MCDesc, SCDesc, STI);		populateReads(*ID, MCI, MCDesc, SCDesc, STI);

LLVM_DEBUG(dbgs() << "\t\tMaxLatency=" << ID->MaxLatency << '\n');		LLVM_DEBUG(dbgs() << "\t\tMaxLatency=" << ID->MaxLatency << '\n');
LLVM_DEBUG(dbgs() << "\t\tNumMicroOps=" << ID->NumMicroOps << '\n');		LLVM_DEBUG(dbgs() << "\t\tNumMicroOps=" << ID->NumMicroOps << '\n');

// Now add the new descriptor.		// Now add the new descriptor.
Descriptors[Opcode] = std::move(ID);		SchedClassID = MCDesc.getSchedClass();
return *Descriptors[Opcode];		if (!SM.getSchedClassDesc(SchedClassID)->isVariant()) {
		Descriptors[MCI.getOpcode()] = std::move(ID);
		return *Descriptors[MCI.getOpcode()];
		}

		VariantDescriptors[&MCI] = std::move(ID);
		return *VariantDescriptors[&MCI];
}		}

const InstrDesc &InstrBuilder::getOrCreateInstrDesc(const MCInst &MCI) {		const InstrDesc &InstrBuilder::getOrCreateInstrDesc(const MCInst &MCI) {
if (Descriptors.find_as(MCI.getOpcode()) == Descriptors.end())		if (Descriptors.find_as(MCI.getOpcode()) != Descriptors.end())
return createInstrDescImpl(MCI);
return *Descriptors[MCI.getOpcode()];		return *Descriptors[MCI.getOpcode()];

		if (VariantDescriptors.find(&MCI) != VariantDescriptors.end())
		return *VariantDescriptors[&MCI];

		return createInstrDescImpl(MCI);
}		}

std::unique_ptr<Instruction>		std::unique_ptr<Instruction>
InstrBuilder::createInstruction(const MCInst &MCI) {		InstrBuilder::createInstruction(const MCInst &MCI) {
const InstrDesc &D = getOrCreateInstrDesc(MCI);		const InstrDesc &D = getOrCreateInstrDesc(MCI);
std::unique_ptr<Instruction> NewIS = llvm::make_unique<Instruction>(D);		std::unique_ptr<Instruction> NewIS = llvm::make_unique<Instruction>(D);

// Initialize Reads first.		// Initialize Reads first.
Show All 38 Lines

llvm/trunk/tools/llvm-mca/InstructionInfoView.cpp

Show All 30 Lines	void InstructionInfoView::printView(raw_ostream &OS) const {
TempStream << "\n\nInstruction Info:\n";		TempStream << "\n\nInstruction Info:\n";
TempStream << "[1]: #uOps\n[2]: Latency\n[3]: RThroughput\n"		TempStream << "[1]: #uOps\n[2]: Latency\n[3]: RThroughput\n"
<< "[4]: MayLoad\n[5]: MayStore\n[6]: HasSideEffects\n\n";		<< "[4]: MayLoad\n[5]: MayStore\n[6]: HasSideEffects\n\n";

TempStream << "[1] [2] [3] [4] [5] [6] Instructions:\n";		TempStream << "[1] [2] [3] [4] [5] [6] Instructions:\n";
for (unsigned I = 0, E = Instructions; I < E; ++I) {		for (unsigned I = 0, E = Instructions; I < E; ++I) {
const MCInst &Inst = Source.getMCInstFromIndex(I);		const MCInst &Inst = Source.getMCInstFromIndex(I);
const MCInstrDesc &MCDesc = MCII.get(Inst.getOpcode());		const MCInstrDesc &MCDesc = MCII.get(Inst.getOpcode());
const MCSchedClassDesc &SCDesc =
*SM.getSchedClassDesc(MCDesc.getSchedClass());

		// Obtain the scheduling class information from the instruction.
		unsigned SchedClassID = MCDesc.getSchedClass();
		unsigned CPUID = SM.getProcessorID();

		// Try to solve variant scheduling classes.
		while (SchedClassID && SM.getSchedClassDesc(SchedClassID)->isVariant())
		SchedClassID = STI.resolveVariantSchedClass(SchedClassID, &Inst, CPUID);

		const MCSchedClassDesc &SCDesc = *SM.getSchedClassDesc(SchedClassID);
unsigned NumMicroOpcodes = SCDesc.NumMicroOps;		unsigned NumMicroOpcodes = SCDesc.NumMicroOps;
unsigned Latency = MCSchedModel::computeInstrLatency(STI, SCDesc);		unsigned Latency = MCSchedModel::computeInstrLatency(STI, SCDesc);
Optional<double> RThroughput =		Optional<double> RThroughput =
MCSchedModel::getReciprocalThroughput(STI, SCDesc);		MCSchedModel::getReciprocalThroughput(STI, SCDesc);

TempStream << ' ' << NumMicroOpcodes << " ";		TempStream << ' ' << NumMicroOpcodes << " ";
if (NumMicroOpcodes < 10)		if (NumMicroOpcodes < 10)
TempStream << " ";		TempStream << " ";
Show All 36 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[RFC][patch 3/3] Add support for variant scheduling classes in llvm-mca.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 149775

llvm/trunk/lib/Target/X86/X86Schedule.td

llvm/trunk/lib/Target/X86/X86ScheduleBtVer2.td

llvm/trunk/test/CodeGen/X86/sse-schedule.ll

llvm/trunk/test/tools/llvm-mca/X86/BtVer2/zero-idioms.s

llvm/trunk/tools/llvm-mca/InstrBuilder.h

llvm/trunk/tools/llvm-mca/InstrBuilder.cpp

llvm/trunk/tools/llvm-mca/InstructionInfoView.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[RFC][patch 3/3] Add support for variant scheduling classes in llvm-mca.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 149775

llvm/trunk/lib/Target/X86/X86Schedule.td

llvm/trunk/lib/Target/X86/X86ScheduleBtVer2.td

llvm/trunk/test/CodeGen/X86/sse-schedule.ll

llvm/trunk/test/tools/llvm-mca/X86/BtVer2/zero-idioms.s

llvm/trunk/tools/llvm-mca/InstrBuilder.h

llvm/trunk/tools/llvm-mca/InstrBuilder.cpp

llvm/trunk/tools/llvm-mca/InstructionInfoView.cpp

[RFC][patch 3/3] Add support for variant scheduling classes in llvm-mca.
ClosedPublic