This is an archive of the discontinued LLVM Phabricator instance.

Better testing of schedule model instruction latencies/throughputs
ClosedPublic

Authored by avt77 on Mar 14 2017, 7:51 AM.

Details

Summary

This patch should close the issue raised in PR32216. The patch introduce a new llc switch "-print-latency" which allows to add comments with latencies in output .s files. The patch opens several questions:

  • there is real mess in terminology: the sources mix throughput and latency completely - we should fix it
  • sometimes the selection of our sched model is unclear because the selected alternative code sequence (sometimes) looks worse than the initial code
  • there are some issues with FMA support inside the model (in fact it does not work at the moment)
  • some values of latency (reported by model) look rather strange

This is the initial version of the patch and I'm sure all the above issues will be resolved soon.

Diff Detail

Event Timeline

avt77 created this revision.Mar 14 2017, 7:51 AM
RKSimon added inline comments.Mar 14 2017, 8:44 AM
lib/Target/X86/InstPrinter/X86InstComments.h
22 ↗(On Diff #91724)

Style guide - AC_PRINT_LATENCY

test/CodeGen/X86/recip-fastmath2.ll
20

Where have the retq instructions gone?

hfinkel added inline comments.Mar 14 2017, 9:28 AM
lib/Target/X86/X86MCInstLower.cpp
1269

This seems really useful, but is not target dependent. Can you please move this hook into the target-independent code? Maybe in void AsmPrinter::EmitFunctionBody(), around here:

default:
  EmitInstruction(&MI);
  break;

(right before the call to EmitInstruction).

1278

Can you call TII->getInstrLatency here instead of computing it in this loop?

(if you just call SCModel->computeInstrLatency, as suggested below, this will take care of itself).

1291

Can't you just call SCModel->computeInstrLatency return the result? The logic there seems like exactly what you want:

unsigned
TargetSchedModel::computeInstrLatency(const MachineInstr *MI,
                                      bool UseDefaultDefLatency) const {
  // For the itinerary model, fall back to the old subtarget hook.
  // Allow subtargets to compute Bundle latencies outside the machine model.
  if (hasInstrItineraries() || MI->isBundle() ||
      (!hasInstrSchedModel() && !UseDefaultDefLatency))
    return TII->getInstrLatency(&InstrItins, *MI);
 
  if (hasInstrSchedModel()) {
    const MCSchedClassDesc *SCDesc = resolveSchedClass(MI);
    if (SCDesc->isValid())
      return computeInstrLatency(*SCDesc);
  }
  return TII->defaultDefLatency(SchedModel, *MI);
}
RKSimon edited edge metadata.Mar 14 2017, 10:15 AM

How feasible is it to include the instruction's throughput in the comment as well?

avt77 added a comment.Mar 15 2017, 3:17 AM

hfinkel, could you help me? First of all could you give me a link(s) to any doc(s) related to our MCSchedModel except sources?

Next, I was told that ResourceCycles here:

class ProcWriteResources<list<ProcResourceKind> resources> {

list<ProcResourceKind> ProcResources = resources;
list<int> ResourceCycles = [];
int Latency = 1;
int NumMicroOps = 1;

could be used as Throughput of the given instruction. Is it right? Does it mean I could include it in generated comment as well? If YES I suppose it should be the max of the Cycles, right?

spatel edited edge metadata.EditedMar 15 2017, 7:11 AM

hfinkel, could you help me? First of all could you give me a link(s) to any doc(s) related to our MCSchedModel except sources?

Next, I was told that ResourceCycles here:

class ProcWriteResources<list<ProcResourceKind> resources> {

list<ProcResourceKind> ProcResources = resources;
list<int> ResourceCycles = [];
int Latency = 1;
int NumMicroOps = 1;

could be used as Throughput of the given instruction. Is it right? Does it mean I could include it in generated comment as well? If YES I suppose it should be the max of the Cycles, right?

I don't know if there are any docs besides the code and the code comments, but I think you are correct - the max of ResourceCycles is the inverse throughput for the instruction:

This is from include/llvm/Target/TargetSchedule.td :

// Optionally, ResourceCycles indicates the number of cycles the
// resource is consumed. Each ResourceCycles item is paired with the
// ProcResource item at the same position in its list. Since
// ResourceCycles are rarely specialized, the list may be
// incomplete. By default, resources are consumed for a single cycle,
// regardless of latency, which models a fully pipelined processing
// unit. A value of 0 for ResourceCycles means that the resource must
// be available but is not consumed, which is only relevant for
// unbuffered resources.

And this is in MachineScheduler.cpp:

// For reserved resources, record the highest cycle using the resource.
// For top-down scheduling, this is the cycle in which we schedule this
// instruction plus the number of cycles the operations reserves the
// resource.

We could abbreviate the comment string that you are adding like: [7:2].
I'm biased because that's the way I've always formatted [ latency : inverse throughput ], but I think that people that care about CPU timing will recognize that format, so you don't have to print out the words "latency" or "throughput".

hfinkel edited edge metadata.Mar 15 2017, 9:11 AM

hfinkel, could you help me? First of all could you give me a link(s) to any doc(s) related to our MCSchedModel except sources?

Next, I was told that ResourceCycles here:

class ProcWriteResources<list<ProcResourceKind> resources> {

list<ProcResourceKind> ProcResources = resources;
list<int> ResourceCycles = [];
int Latency = 1;
int NumMicroOps = 1;

could be used as Throughput of the given instruction. Is it right? Does it mean I could include it in generated comment as well? If YES I suppose it should be the max of the Cycles, right?

I don't know if there are any docs besides the code and the code comments,

The documentation is primarily in the header and TableGen files (for better or worse).

but I think you are correct - the max of ResourceCycles is the inverse throughput for the instruction:

That's correct if the instruction can only dispatch through that one resource.

First, for itineraries, I think you can do something like this:

double Unknown = std::numeric_limits<double>::infinity()
double Throughput = Unknown;
if (IID.isEmpty())
  return Throughput;

for (const InstrStage *IS = IID.beginStage(ItinClassIndx),
           *E = IID.endStage(ItinClassIndx); IS != E; ++IS) {
  unsigned Cycles = IS->getCycles();
  if (!Cycles)
    continue;

  Throughput = std::min(Throughput, popcnt(IS->getUnits()) * 1.0/Cycles);
}

return Throughput;

For resource descriptions, I think that you want the inverse of ResourceCycles multiplied by the number of applicable resources. Something like this:

for (MCWriteProcResEntry *WPR = STI.getWriteProcResBegin(SCClass),
                                                     *WEnd = STI.getWriteProcResEnd(SCClass); WPR != WEnd; ++WPR) {
  unsigned Cycles = WPR->Cycles;
  if (!Cycles)
    return Unknown;

  unsigned NumUnits = SCModel->getProcResource(WPR->ProcResourceIdx)->NumUnits;
  Throughput = std::min(Throughput, NumUnits * 1.0/Cycles);
}

This is from include/llvm/Target/TargetSchedule.td :

// Optionally, ResourceCycles indicates the number of cycles the
// resource is consumed. Each ResourceCycles item is paired with the
// ProcResource item at the same position in its list. Since
// ResourceCycles are rarely specialized, the list may be
// incomplete. By default, resources are consumed for a single cycle,
// regardless of latency, which models a fully pipelined processing
// unit. A value of 0 for ResourceCycles means that the resource must
// be available but is not consumed, which is only relevant for
// unbuffered resources.

And this is in MachineScheduler.cpp:

// For reserved resources, record the highest cycle using the resource.
// For top-down scheduling, this is the cycle in which we schedule this
// instruction plus the number of cycles the operations reserves the
// resource.

We could abbreviate the comment string that you are adding like: [7:2].
I'm biased because that's the way I've always formatted [ latency : inverse throughput ], but I think that people that care about CPU timing will recognize that format, so you don't have to print out the words "latency" or "throughput".

avt77 updated this revision to Diff 91987.Mar 16 2017, 5:52 AM

The implementation was moved to target independent area and all Hal's comments were applied. I did not do anything with Throughput: it will be done in the patch.

The implementation was moved to target independent area and all Hal's comments were applied. I did not do anything with Throughput: it will be done in the patch.

Do you mean that the throughput will be done in a follow-up patch (or a later revision of this one)? Either is fine with me, although if we're going to add more than latencies we might pick a different name. More later...

The implementation was moved to target independent area and all Hal's comments were applied. I did not do anything with Throughput: it will be done in the patch.

Do you mean that the throughput will be done in a follow-up patch (or a later revision of this one)? Either is fine with me, although if we're going to add more than latencies we might pick a different name. More later...

If possible please can we use the term 'print schedule' now so that future improvements (throughput, etc.) make sense

avt77 added a comment.Mar 17 2017, 1:33 AM

hfinkel, yes I mean I'll do it in the next version of this patch soon.
rksimon, do you mean we should rename the compiler option like "-print-schedule"?

hfinkel, yes I mean I'll do it in the next version of this patch soon.
rksimon, do you mean we should rename the compiler option like "-print-schedule"?

It is not clear to me why you won't just always do this when in verbose-asm mode. Thoughts on not having a separate option at all?

avt77 updated this revision to Diff 92153.Mar 17 2017, 9:37 AM

Throughput calculation is implemented.

avt77 added a comment.Mar 17 2017, 9:41 AM

hfinkel, yes I mean I'll do it in the next version of this patch soon.
rksimon, do you mean we should rename the compiler option like "-print-schedule"?

It is not clear to me why you won't just always do this when in verbose-asm mode. Thoughts on not having a separate option at all?

We can do it without special option of course. Should I remove it and use "verbose-asm mode" instead?

RKSimon added inline comments.Mar 20 2017, 11:31 AM
include/llvm/CodeGen/TargetSchedule.h
192

Add doxygen comment.

avt77 added a comment.Mar 21 2017, 5:53 AM

Hal,
I removed the special option (-print-schedule) and tried to check-all. The result was very unpleseant but predictable:

Expected Passes    : 29409
Expected Failures  : 160
Unsupported Tests  : 442
Unexpected Failures: 643

Should I fix all these 643 failures or it's better to keep the option?

Hal,
I removed the special option (-print-schedule) and tried to check-all. The result was very unpleseant but predictable:

Expected Passes    : 29409
Expected Failures  : 160
Unsupported Tests  : 442
Unexpected Failures: 643

Should I fix all these 643 failures or it's better to keep the option?

No. Mostly because it is not entirely mechanical (it does not make sense to update the tests without understanding whether the results make sense; the person who updates the test should hopefully have that background). FWIW, this also does not make sense for higher-level targets (e.g. NVPTX).

I recommend the following:

  1. Add a target callback to enable the printing of the scheduling information during verbose mode
  2. Provide a command-line option (cl::opt) that can override that target callback (for easy testing), as in:

    bool Enable = OptionName.getNumOccurrences() ? OptionName : STI->shouldPrintStuff();
  3. Enable this only on x86 (fix those tests, if any) (*)

Once we think this works reasonably well, we can encourage other backend maintainers to look at updating their tests and enabling it on their targets if they'd like.

(*) Given that FileCheck does prefix matching, and we're adding this information last on the line, I'm curious to understand why the tests are matching that is causing these failures.

lib/CodeGen/AsmPrinter/AsmPrinter.cpp
776

I think that one ? is enough (three looks excessive to me).

778

One only ? here please.

This comment was removed by RKSimon.
avt77 updated this revision to Diff 92624.Mar 22 2017, 6:24 AM

I did everything accordingly to Hal's requirements except one: the default value of "print-schedule" switch is false because otherewise we have "Unexpected Failures: 530" and it's X86 tests ony. The problem is very simple: update_llc_test_checks.py generates CHECKs like here

; XOP-AVX1-NEXT: vextractf128 $1, %ymm2, %xmm5

It means FileCheck has to check the whole line but this patch adds the comment at the end of line and as result the line can't be checked properly.
Finaly what we have now:

    • we have option "-print-schedule" allowing to print [latency:throughput" in output (default is false)
    • we have enablePrintSchedInfo() - default is false
  • X86 overrides enablePrintSchedInfo()

Hope that's exactly what was required.

I did everything accordingly to Hal's requirements except one: the default value of "print-schedule" switch is false because otherewise we have "Unexpected Failures: 530" and it's X86 tests ony. The problem is very simple: update_llc_test_checks.py generates CHECKs like here

; XOP-AVX1-NEXT: vextractf128 $1, %ymm2, %xmm5

I did not realize that CHECK-NEXT always matched the whole line. That's interesting.

In any case, if this is an update_llc_test_checks.py problem, why don't you use it to update the tests?

It means FileCheck has to check the whole line but this patch adds the comment at the end of line and as result the line can't be checked properly.
Finaly what we have now:

    • we have option "-print-schedule" allowing to print [latency:throughput" in output (default is false)
    • we have enablePrintSchedInfo() - default is false
  • X86 overrides enablePrintSchedInfo()

Hope that's exactly what was required.

include/llvm/CodeGen/AsmPrinter.h
118

Variable names should start with a capital letter: EnablePrintSchedInfo

include/llvm/MC/MCTargetOptions.h
50 ↗(On Diff #92624)

Remove this option.

lib/CodeGen/AsmPrinter/AsmPrinter.cpp
1405

We don't add { } around single-statement blocks.

1407

When you remove the '}' here, I'd keep the blank line ;)

1408

The command-line option should be declared in this file. That way you can predicate using the command-line option value on getNumOccurrences() as I had indicated previously. Plus, there's no need to have the option name depend on the tool.

Full disclosure: There is another option: You can use a tristate setup, look for DefaultOnOff in lib/CodeGen/AsmPrinter/DwarfDebug.cpp.

lib/CodeGen/TargetSchedule.cpp
342

Please remove this comment and the UseDefaultRThroughput variable. We can always add it later if we'd like.

348

This Unknown value should never be returned (this is an implementation detail). Please return an Optional<double>.

avt77 added a comment.Mar 22 2017, 7:30 AM

I did everything accordingly to Hal's requirements except one: the default value of "print-schedule" switch is false because otherewise we have "Unexpected Failures: 530" and it's X86 tests ony. The problem is very simple: update_llc_test_checks.py generates CHECKs like here

; XOP-AVX1-NEXT: vextractf128 $1, %ymm2, %xmm5

I did not realize that CHECK-NEXT always matched the whole line. That's interesting.

If you'd like to CHECK the prefix only you should use something like

CHECK: vextractf128 $1, %ymm2, %xmm5{{*}}

In any case, if this is an update_llc_test_checks.py problem, why don't you use it to update the tests?

I can use it to update the tests but it means I should update 530 tests. Is it acceptable? Should I do it? For me it is not a problem but is it OK for review?

I did everything accordingly to Hal's requirements except one: the default value of "print-schedule" switch is false because otherewise we have "Unexpected Failures: 530" and it's X86 tests ony. The problem is very simple: update_llc_test_checks.py generates CHECKs like here

; XOP-AVX1-NEXT: vextractf128 $1, %ymm2, %xmm5

I did not realize that CHECK-NEXT always matched the whole line. That's interesting.

If you'd like to CHECK the prefix only you should use something like

CHECK: vextractf128 $1, %ymm2, %xmm5{{*}}

In any case, if this is an update_llc_test_checks.py problem, why don't you use it to update the tests?

I can use it to update the tests but it means I should update 530 tests. Is it acceptable? Should I do it? For me it is not a problem but is it OK for review?

I think that you should update them. I assume we'd want to do it at some point. The reason not to do it would be that the numbers aren't generally right yet. The goal is that, because of the MemoryCombiner, etc. these numbers will affect not only scheduling but also instruction selection (etc.). we want to make them very visible in the tests. So, let's make them very visible... (if others disagree, please say so).

avt77 updated this revision to Diff 93498.Mar 30 2017, 10:01 AM

The problem with failed tests raised because of new lines of comments added as result of this patch. I was wrong when I told that FileCheck does not allow adding of new comments at EOL.
I redesigned the patch to make it possible to add Latency:Throughput at the end of exisiting comment (if any). As result I was forced to change API of EmitInstruction from MCStreamer. I don't like this change because there are a lot of successors of MCStreamer but it works perfectly and maybe useful for other targets.
I regenerated (with help of update_llc_test_checks.py) 34 tests and now we have only 16 failed tests: I'm going to fix them asap.

Please can you put back the "## sched: [LAT:RCP]" sched prefix?

If at all possible I'd like this first patch to just do the minimum - provide a -print-sched option (default = false) that adds the scheduling comments. I'd probably not even include the comment 'appending' in this first patch.

After this initial patch I'm much more interested in then getting a high amount of test coverage of as many instructions as possible across each scheduler model - imo that is what we need to improve first to try and find the existing issues with the models. If we don't and we then fix them later, it will require the regeneration of a lot of tests that don't actually care about the schedule.

Only then should we begin investigating how to include this in more asm output.

avt77 updated this revision to Diff 93611.Mar 31 2017, 3:14 AM

Accordingly to requirements from Simon I inserted prefix "sched: " for scheduler comments and made "false" as default value for -print-schedule option. As result I restored original versions of all X86-tests excepting 2 ones to demonstrate the changes. Now we don't have any failed test.

A number of minor style comments.

@hfinkel does this look alright to you as a patch for initial support for scheduler comments?

include/llvm/CodeGen/AsmPrinter.h
117

style - newline before comment, add '.' to end of comment.

include/llvm/CodeGen/TargetSchedule.h
192

newline

include/llvm/Target/TargetSubtargetInfo.h
23

Any clang-format / include 'juggling' should be done as a NFC cleanup commit, not as part of a larger patch.

147

Comment says 'enable' but function says 'support' - which is it? Full stop at end of comment (style)

lib/CodeGen/AsmPrinter/AsmPrinter.cpp
771

Add brief comment explaining what you're doing.

lib/CodeGen/TargetSchedule.cpp
281

Shouldn't this be an assert?

lib/MC/MCAsmStreamer.cpp
106–109

Explain how PrintSchedInfo will be used in comment

1586

Comment this

lib/MC/MCObjectStreamer.cpp
241

You're inconsistent leaving some unused argument without names and other with.

lib/Object/RecordStreamer.cpp
82

'B' is meaningless - please use a real name to explain its usage.

lib/Target/Mips/MCTargetDesc/MipsELFStreamer.h
49

Why have you include a default value here??

lib/Target/X86/InstPrinter/X86InstComments.h
26 ↗(On Diff #93611)

Should this be here or inside the cpp file(s) ?

lib/Target/X86/X86Subtarget.cpp
378 ↗(On Diff #93611)

Probably best to bring this inside creatSchedInfoStr?

379 ↗(On Diff #93611)

typo: createSchedInfoStr

lib/Target/X86/X86Subtarget.h
631

newline

avt77 added inline comments.Apr 7 2017, 9:51 AM
lib/CodeGen/TargetSchedule.cpp
281

No, that was the issue: we can have opcodes with invalid SCDesc

avt77 updated this revision to Diff 94538.Apr 7 2017, 9:54 AM

Hope, I fixed all comments raised by RKSimon.
hfinkel, what do you think about?

RKSimon added inline comments.Apr 7 2017, 11:19 AM
lib/MC/MCAsmStreamer.cpp
106–109

Don't use @param - you have to provide all params otherwise Wdocumentation fires.

lib/Target/X86/X86Subtarget.cpp
380 ↗(On Diff #94538)

const char *SchedPrefix = " sched: [";

avt77 updated this revision to Diff 94784.Apr 11 2017, 12:23 AM

I fixed the latest requirements from RKSimon. Please, give me your feedback.

hfinkel added inline comments.Apr 11 2017, 11:25 AM
lib/CodeGen/AsmPrinter/AsmPrinter.cpp
1412

Why is this commented out? You have a target hook and it defaults to false, so that should be fine.

lib/Target/X86/X86Subtarget.cpp
394 ↗(On Diff #94784)

Why are these functions X86-specific?

lib/Target/X86/X86Subtarget.h
627

Put the false and the TODO here.

avt77 updated this revision to Diff 94973.Apr 12 2017, 8:15 AM

I implemeted all requirements from hfinkel.
Please, review again.

hfinkel accepted this revision.Apr 12 2017, 8:42 AM

I implemeted all requirements from hfinkel.
Please, review again.

LGTM, thanks!

lib/Target/X86/X86Subtarget.h
627

I'd just say:

// TODO: Update the regression tests and return true.

This revision is now accepted and ready to land.Apr 12 2017, 8:42 AM

llvm/include/llvm/MC/MCSubtargetInfo.h:36:7: warning: 'llvm::MCSubtargetInfo' has virtual functions but non-virtual destructor [-Wnon-virtual-dtor]
Currently running into warnings relating to this

llvm/include/llvm/MC/MCSubtargetInfo.h:36:7: warning: 'llvm::MCSubtargetInfo' has virtual functions but non-virtual destructor [-Wnon-virtual-dtor]
Currently running into warnings relating to this

Should be fixed by rL300322

Perfect, thanks for following up @RKSimon