This is an archive of the discontinued LLVM Phabricator instance.

[RFC][patch 3/3] Add support for variant scheduling classes in llvm-mca.
ClosedPublic

Authored by andreadb on May 25 2018, 8:14 AM.

Details

Summary

This patch is the third of a sequence of three patches related to LLVM-dev RFC "MC support for variant scheduling classes". http://lists.llvm.org/pipermail/llvm-dev/2018-May/123181.html

This patch requires D47077 to be applied first.

The main goal of this patch is to teach llvm-mca how to solve variant scheduling classes.
This patch does that, plus it adds a variant scheduling class to the BtVer2 scheduling model to identify so-called zero-idioms (data dependency breaking instructions that are known to produce zero, and that are optimized out at register renaming stage).

Without the BtVer2 change, this patch would not have had any tests.
This patch is effectively the union of two changes:

  1. the llvm-mca change that enables the resolution of variant scheduling classes, and
  2. the change to the BtVer2 scheduling model.

Point 2. (partially) fixes PR36671.
Point 1. fixes PR36672.

@RKSimon and @craig.topper , the new scheduling predicate for the XOR zero-idiom is quite simple. In future, we could move predicates that are valid for multiple processor models into a common .td file.
For now, I keep that predicate check into the BtVer2 model.

Please let me know if okay to commit.

Thanks
-Andrea

Diff Detail

Event Timeline

andreadb created this revision.May 25 2018, 8:14 AM
RKSimon added inline comments.May 25 2018, 8:38 AM
lib/Target/X86/X86ScheduleBtVer2.td
550 ↗(On Diff #148608)

Add a TODO saying this may go into X86Schedule.td in the future?

575 ↗(On Diff #148608)

Reference Agner's microarchitecture doc as well - he explicity says this works for Jaguar - AMD 16h SOG unfortunately misses it and just having a reference to a different cpu is confusing.

test/CodeGen/X86/sse-schedule.ll
6084 ↗(On Diff #148608)

This is unfortunate - ideally we'd keep [1:0.50] here and zero idioms would get [0:0.50]

test/tools/llvm-mca/X86/BtVer2/zero-idioms.s
32 ↗(On Diff #148608)

Should the RThroughput still be limited to by issue width?

tools/llvm-mca/InstrBuilder.cpp
385 ↗(On Diff #148608)

Drop braces - also is it worth pulling out SM.getProcessorID()? It might get this down to a single line for tidiness.

andreadb added inline comments.May 25 2018, 8:53 AM
lib/Target/X86/X86ScheduleBtVer2.td
550 ↗(On Diff #148608)

Will do.

575 ↗(On Diff #148608)

Okay. I will add that reference.

test/CodeGen/X86/sse-schedule.ll
6084 ↗(On Diff #148608)

I should have mentioned this in the summary. The existing functionality used to obtain the latency from a MCInst doesn't know how to resolve variant scheduling classes. So, it accepts a machine opcode ID, and it bails out if the scheduling class identifier from the opcode descriptor refers to a variant class.
The idea is to fix it. However, it would be a follow-up patch. I can add a FIXME comment there.

test/tools/llvm-mca/X86/BtVer2/zero-idioms.s
32 ↗(On Diff #148608)

It doesn't take into account the issue width. In this case, the MCSchedInfo method returned an invalid RThroughput (an Optional<double>).

On the other hand, the Block RThroughput correctly reports 3.0 for the sequence of 6 instructions (for a total of 6 uOps).

tools/llvm-mca/InstrBuilder.cpp
385 ↗(On Diff #148608)

Okay. I will fix it.

andreadb updated this revision to Diff 148623.May 25 2018, 9:42 AM
andreadb marked 3 inline comments as done.

Patch updated.

Addressed review comments.

courbet added inline comments.
lib/Target/X86/X86ScheduleBtVer2.td
569 ↗(On Diff #148623)

I think you should also set NumMicroOps to 0 here.

andreadb added inline comments.May 25 2018, 11:14 AM
lib/Target/X86/X86ScheduleBtVer2.td
569 ↗(On Diff #148623)

Here, the number of opcodes is what gets subtracted to the IssueWidth budget for the current cycle.
On Jaguar, that quantity matches what the docs define as COP (complex opcode). Think of it as a container of uOps. An SSE xor is always decoded into a single COP. That COP is then sent to the RCU (and it takes one slot in the reorder buffer), and eliminated at register renaming stage.
Basically, although it is zero-latency, it still has to be retired. It consumes one slot in the ROB, and one slot in the dispatch group (subtracted to what llvm-mca calls "DispatchWidth" - the equivalent of the IssueWidth in the scheduling model).

andreadb edited the summary of this revision. (Show Details)May 25 2018, 11:20 AM
RKSimon added inline comments.May 27 2018, 7:05 AM
lib/Target/X86/X86ScheduleBtVer2.td
560 ↗(On Diff #148623)

Just to be clear, can this work for GPR zero-idioms as well (XOR32rr/XOR64rr etc)?

andreadb added inline comments.May 27 2018, 12:41 PM
lib/Target/X86/X86ScheduleBtVer2.td
560 ↗(On Diff #148623)

Yes. JZeroIdiomPredicate can be used to describe XOR32rr and XOR64rr zero-idioms too.

You would still need a fall-back strategy for when the XOR is not a zero idiom.
That means, you would need to add the following case to JWriteZeroIdiom:

SchedVar<MCSchedPredicate<JIntXOR>, [WriteALU]>

where JIntXOR is defined as a CheckOpcode<[XOR32rr, XOR64rr]>.

andreadb updated this revision to Diff 149123.May 30 2018, 8:27 AM

Patch updated.

This patch is now dependent on D47536.
D47536 avoids the loss of test coverage in -print-schedule tests.

andreadb updated this revision to Diff 149482.Jun 1 2018, 8:39 AM

Patch updated.

Simplified the scheduling predicates as suggested off-line by SimonP.

andreadb updated this revision to Diff 149496.Jun 1 2018, 9:03 AM

Patch updated (sorry for the spam).

This time, with an updated/extended test case.

RKSimon added inline comments.Jun 2 2018, 9:05 AM
lib/Target/X86/X86ScheduleBtVer2.td
558 ↗(On Diff #149496)

This should probably be moved to X86Schedule.td straightaway - I see no benefit keeping this here as zeroidioms are something we're going to want in all x86 scheduler models.

Rename either to X86ZeroIdiomPredicate or X86SameRegOps01 (or something like that.....).

andreadb updated this revision to Diff 149710.Jun 4 2018, 3:51 AM
andreadb marked 3 inline comments as done.

Address review comments.

andreadb updated this revision to Diff 149712.Jun 4 2018, 4:00 AM

Patch updated. This time with the correct diff...

RKSimon added inline comments.Jun 4 2018, 5:52 AM
CodeGen/X86/sse-schedule.ll
6228

These are unfortunate - please can you raise an upstream bugzilla about throughput printing defaulting to issues width.

llvm-mca/InstrBuilder.cpp
388

So !SchedClassID can only occur here due to variant resolution failing?

andreadb added inline comments.Jun 4 2018, 6:43 AM
CodeGen/X86/sse-schedule.ll
6228

Makes sense. I will raise a bug for it.

llvm-mca/InstrBuilder.cpp
388

Good point.
Strictly speaking, the scheduling class ID associated with MCI *should* always be valid. But I cannot guarantee it.. I will add an if-stmt to guard against invalid classes.

andreadb updated this revision to Diff 149757.Jun 4 2018, 7:22 AM

Address Simon's review comment.

Raised https://bugs.llvm.org/show_bug.cgi?id=37678 for the -print-schedule issue with the throughput computation of zero-latency instructions.

andreadb updated this revision to Diff 149767.Jun 4 2018, 7:49 AM

Patch updated.

RKSimon accepted this revision.Jun 4 2018, 8:15 AM

You've lost the diff context, but LGTM - thanks.

This revision is now accepted and ready to land.Jun 4 2018, 8:15 AM
This revision was automatically updated to reflect the committed changes.