This is an archive of the discontinued LLVM Phabricator instance.

[MC][TableGen] Add optional libpfm counter names for ProcResUnits.
ClosedPublic

Authored by courbet on Apr 6 2018, 2:51 AM.

Details

Summary

Subtargets can define the libpfm counter names that can be used to
measure cycles and uops issued on ProcResUnits.
This allows making llvm-exegesis available on more targets.
Fixes PR36984.

Diff Detail

Event Timeline

courbet created this revision.Apr 6 2018, 2:51 AM

Hi Clement,

I prefer a more flexible design where perf events can be defined for all resources, and not just for ProcResourceUnits.
In future, we may want to add profiling information to resources that are not just pipeline ports. For example, we could add profiling support for hardware scheduler resources. Essentially groups should be considered too.

I also prefer that we move all the "optional" information into MCExtraProcessorInfo. I don't like the idea of adding an extra pointer to every single resource defined by a processor (especially if the processor doesn't care about describing perf counters).

This is just an idea for an alternative design:
You could define a separate tablegen class to describe the mapping of resources to perf events. Something like this:

class PFMCounter;
class DispatchPort0Counter : PFMCounter;
class DispatchPort1Counter : PFMCounter;
 /*etc.,  ... enumerate all counters here ... */

class PerfCounter<ProcResourceKind Kind, list<PFMCounter> Counters> {
  ProcResourceKind Resource = Kind;
  list<PFMCounter> Counters;  // A list of counter IDs.
}

Here, PFMCounters would be used to construct a (subtarget specific) enum type, where each enum value is a reference to a PMC string name.
The generic "unhalted_core_cycles" event would get an enum ID too.

This is just an idea for an alternative design. Regardless of whether you end up adopting this solution or not, I believe that it is cleaner if we move anything which is optional into MCExtraProcessorInfo.

Just my opinion.
-Andrea

include/llvm/MC/MCSchedule.h
261–263

I think that this should be moved to the MCExtraProcessorInfo data structure.

utils/TableGen/SubtargetEmitter.cpp
775

Can this be moved to MCExtraProcessorInfo?

I am of the opinion that this information is optional, and it shouldn't be included by default in MCProcResourceDesc. See my other comments.

I'm a bit worried about creating a pfm dependency like this in the core - pfm doesn't come close to covering all targets that we'll want to check in the long term so it's likely that some targets will require a different library. (or raw msrs.....)

Is there anyway that we can keep more of this in llvm-exegesis project - config files or embedding in source?

I'm a bit worried about creating a pfm dependency like this in the core - pfm doesn't come close to covering all targets that we'll want to check in the long term so it's likely that some targets will require a different library. (or raw msrs.....)

Is there anyway that we can keep more of this in llvm-exegesis project - config files or embedding in source?

I tend to agree with Simon.
All of this info could live in a config file within exegesis. Something that describes the mapping between abstract counters for perf events and actual pfm counters.

courbet marked 2 inline comments as done.Apr 6 2018, 6:31 AM

Hi Clement,

I prefer that we move all the "optional" information into MCExtraProcessorInfo. I don't like the idea of adding an extra pointer to every single resource defined by a processor (especially if the processor doesn't care about describing perf counters).

Agreed. I've moved all counters to MCExtraProcessorInfo.

I prefer a more flexible design where perf events can be defined for all resources, and not just for ProcResourceUnits.
In future, we may want to add profiling information to resources that are not just pipeline ports. For example, we could add profiling support for hardware scheduler resources. Essentially groups should be considered too.

In terms of generated code, the current design has counters index by ProcResIdx, so nothing needs to change if we decide that we want to add counters to groups too.
I agree that the schema you propose would be more generic but I'm reluctant to add more complexity to the TableGen schema until it's clear that we need to support counting other things.

I'm a bit worried about creating a pfm dependency like this in the core - pfm doesn't come close to covering all targets that we'll want to check in the long term so it's likely that some targets will require a different library. (or raw msrs.....)

Is there anyway that we can keep more of this in llvm-exegesis project - config files or embedding in source?

I tend to agree with Simon.
All of this info could live in a config file within exegesis. Something that describes the mapping between abstract counters for perf events and actual pfm counters.

If we were using raw events, I would agree. But the whole point of libpfm is to abstract counters. libpfm pretty much *is* this mapping.

Let's summarize what we need and how we can address those needs.

For each subtarget, we need to know:

  • How to count cycles.
  • How to count uops issued on each resource.

There are two orthogonal technical aspects to decide on, which are how to name the counters, and where to store the mapping from subtarget/procres to abstract counter

As for naming counters, the options that have been cited up to now are:
1 - (this design and Andrea's proposal) counters are identified by an abstract string (I happened to pick the libpfm convention, because that's the lib we've used).
2 - (raw counters) counters are identified by the architecture-specific id, which is an int on Intel but could be a something else on other architectures.

I think that (1) is as good as (2), because (2) will have to be something complex to handle all microarchitectures, so I'd rather leave this complexity to be dealt with by libpfm, which was designed for this very purpose.

As for storing the mapping, the options that have been cited up to now are:
A - (this design) counters are a property of the class they semantically relate to.
B - (Andrea's proposal) the mapping is a separate table of pairs of (counter, class it semantically relates to)
C - (Simon's proposal 2) the mapping is a separate config file. I believe this to be a variation of (B) where the table is stored in a different file.
D - (Simon's proposal 1) the mapping is hardcoded in llvm-exegesis.

Let's summarize what we need and how we can address those needs.

For each subtarget, we need to know:

  • How to count cycles.
  • How to count uops issued on each resource.

There are two orthogonal technical aspects to decide on, which are how to name the counters, and where to store the mapping from subtarget/procres to abstract counter

As for naming counters, the options that have been cited up to now are:
1 - (this design and Andrea's proposal) counters are identified by an abstract string (I happened to pick the libpfm convention, because that's the lib we've used).
2 - (raw counters) counters are identified by the architecture-specific id, which is an int on Intel but could be a something else on other architectures.

I think that (1) is as good as (2), because (2) will have to be something complex to handle all microarchitectures, so I'd rather leave this complexity to be dealt with by libpfm, which was designed for this very purpose.

As for storing the mapping, the options that have been cited up to now are:
A - (this design) counters are a property of the class they semantically relate to.
B - (Andrea's proposal) the mapping is a separate table of pairs of (counter, class it semantically relates to)
C - (Simon's proposal 2) the mapping is a separate config file. I believe this to be a variation of (B) where the table is stored in a different file.
D - (Simon's proposal 1) the mapping is hardcoded in llvm-exegesis.

Reflecting on this a bit more:

Option (B) has the advantage that the information is not stored alongside either resources or model but in a separate "table" (using relational language). That has the advantage that it imposes no specific place where the information has to be stored (which nicely fits with (C)).
Then (B) is simply "store the mapping for subtarget XYZ into file X86SchedXYZ.td" , which (C) is "store all mappings into a separate X86Counters.td"

As for (D), I'm very much opposed to storing anything outside of the TD files, because that would mean that we loose the advantage of all the object resolution done in SubtargetEmitter, so it would be very hard to maintain.

Let's summarize what we need and how we can address those needs.

For each subtarget, we need to know:

  • How to count cycles.
  • How to count uops issued on each resource.

There are two orthogonal technical aspects to decide on, which are how to name the counters, and where to store the mapping from subtarget/procres to abstract counter

As for naming counters, the options that have been cited up to now are:
1 - (this design and Andrea's proposal) counters are identified by an abstract string (I happened to pick the libpfm convention, because that's the lib we've used).
2 - (raw counters) counters are identified by the architecture-specific id, which is an int on Intel but could be a something else on other architectures.

I think that (1) is as good as (2), because (2) will have to be something complex to handle all microarchitectures, so I'd rather leave this complexity to be dealt with by libpfm, which was designed for this very purpose.

I am a bit confused. Does it mean that exegesis will always have the dependency on libpfm (which means, it would only work on systems that provide it, for the cpus known by the installed lib version on the system)?

As for storing the mapping, the options that have been cited up to now are:
A - (this design) counters are a property of the class they semantically relate to.
B - (Andrea's proposal) the mapping is a separate table of pairs of (counter, class it semantically relates to)
C - (Simon's proposal 2) the mapping is a separate config file. I believe this to be a variation of (B) where the table is stored in a different file.
D - (Simon's proposal 1) the mapping is hardcoded in llvm-exegesis.

If D is too complicated, then B or C. I don't want to pollute existing abstractions used by the LLVM schedulers with information which is only used by external tools.
I still want to allow targets to be able to strip off (or disable the emission of ) that information if they don't need it.

Let's summarize what we need and how we can address those needs.

For each subtarget, we need to know:

  • How to count cycles.
  • How to count uops issued on each resource.

There are two orthogonal technical aspects to decide on, which are how to name the counters, and where to store the mapping from subtarget/procres to abstract counter

As for naming counters, the options that have been cited up to now are:
1 - (this design and Andrea's proposal) counters are identified by an abstract string (I happened to pick the libpfm convention, because that's the lib we've used).
2 - (raw counters) counters are identified by the architecture-specific id, which is an int on Intel but could be a something else on other architectures.

I think that (1) is as good as (2), because (2) will have to be something complex to handle all microarchitectures, so I'd rather leave this complexity to be dealt with by libpfm, which was designed for this very purpose.

I am a bit confused. Does it mean that exegesis will always have the dependency on libpfm (which means, it would only work on systems that provide it, for the cpus known by the installed lib version on the system)?

It does not *have to*, but until a better solution comes along, yes :) Just peeking at the tables in libpfm to abstract both OS and hardware really makes me want to not have to handle that !

As for storing the mapping, the options that have been cited up to now are:
A - (this design) counters are a property of the class they semantically relate to.
B - (Andrea's proposal) the mapping is a separate table of pairs of (counter, class it semantically relates to)
C - (Simon's proposal 2) the mapping is a separate config file. I believe this to be a variation of (B) where the table is stored in a different file.
D - (Simon's proposal 1) the mapping is hardcoded in llvm-exegesis.

If D is too complicated, then B or C. I don't want to pollute existing abstractions used by the LLVM schedulers with information which is only used by external tools.
I still want to allow targets to be able to strip off (or disable the emission of ) that information if they don't need it.

I'll go with (C).

I think that (1) is as good as (2), because (2) will have to be something complex to handle all microarchitectures, so I'd rather leave this complexity to be dealt with by libpfm, which was designed for this very purpose.

I am a bit confused. Does it mean that exegesis will always have the dependency on libpfm (which means, it would only work on systems that provide it, for the cpus known by the installed lib version on the system)?

It does not *have to*, but until a better solution comes along, yes :) Just peeking at the tables in libpfm to abstract both OS and hardware really makes me want to not have to handle that !

I agree. You don't want to do that..
I was just curious. I guess, this is not going to hurt me since I mostly develop on linux. Also, based on this: http://perfmon2.sourceforge.net/ I think I should be able to get support for Jaguar (which is the cpu I mostly care about) and other AMD chips.

As for storing the mapping, the options that have been cited up to now are:
A - (this design) counters are a property of the class they semantically relate to.
B - (Andrea's proposal) the mapping is a separate table of pairs of (counter, class it semantically relates to)
C - (Simon's proposal 2) the mapping is a separate config file. I believe this to be a variation of (B) where the table is stored in a different file.
D - (Simon's proposal 1) the mapping is hardcoded in llvm-exegesis.

If D is too complicated, then B or C. I don't want to pollute existing abstractions used by the LLVM schedulers with information which is only used by external tools.
I still want to allow targets to be able to strip off (or disable the emission of ) that information if they don't need it.

I'll go with (C).

Cool. Thanks Clement :-)

courbet updated this revision to Diff 141598.Apr 9 2018, 2:04 AM
  • Implement option (C).
courbet updated this revision to Diff 141600.Apr 9 2018, 2:09 AM
  • Implement option (C).
courbet updated this revision to Diff 141625.Apr 9 2018, 5:23 AM
  • add missing license

Nice patch.

Would it be difficult to have a single string table for all the PfmIssueCounters defined by the scheduling models?

At the moment, your patch introduces a PfmIssueCounters table for every model:

static const char* HaswellModelPfmIssueCounters[] = {
  nullptr,
  nullptr,
  nullptr,
  "uops_dispatched_port:port_0,",  //HWPort0
  "uops_dispatched_port:port_1,",  //HWPort1
  "uops_dispatched_port:port_2,",  //HWPort2
  "uops_dispatched_port:port_3,",  //HWPort3
  "uops_dispatched_port:port_4,",  //HWPort4
  "uops_dispatched_port:port_5,",  //HWPort5
  "uops_dispatched_port:port_6,",  //HWPort6
  "uops_dispatched_port:port_7,",  //HWPort7
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
};
<snip>

static const char* SkylakeClientModelPfmIssueCounters[] = {
  nullptr,
  nullptr,
  nullptr,
  "uops_dispatched_port:port_0,",  //SKLPort0
  "uops_dispatched_port:port_1,",  //SKLPort1
  "uops_dispatched_port:port_2,",  //SKLPort2
  "uops_dispatched_port:port_3,",  //SKLPort3
  "uops_dispatched_port:port_4,",  //SKLPort4
  "uops_dispatched_port:port_5,",  //SKLPort5
  "uops_dispatched_port:port_6,",  //SKLPort6
  "uops_dispatched_port:port_7,",  //SKLPort7
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
  nullptr,
};

However, I think this is suboptimal for two reasons:

  1. the per-model table is not well compressed because there is one entry per each processor resource. That means, we consume entries even for resources that don't have a counter associated.
  2. (as you can see from the code snippet above), most strings are actually duplicated in each table.

If possible, ideally we should have a single compressed string table accessible by all models:

nullptr  // invalid entry
"uops_dispatched_port:port_0,"
"uops_dispatched_port:port_1,"
"uops_dispatched_port:port_2,"
"uops_dispatched_port:port_3,"
"uops_dispatched_port:port_4,"
"uops_dispatched_port:port_5,"
"uops_dispatched_port:port_6,"
"uops_dispatched_port:port_7,"

The compressed table comes with the downside that it requires an extra mapping between processor resource IDs and indices to the string table. That mapping could be stored somewhere in the "ExtraInfo" table.

I hope this makes sense.

Overall, the patch looks good. I am also okay if you want to commit this patch first, and then improve the design in a follow-up patch.

-Andrea

Thanks Andrea,

Nice patch.

Would it be difficult to have a single string table for all the PfmIssueCounters defined by the scheduling models?

The compressed table comes with the downside that it requires an extra mapping between processor resource IDs and indices to the string table. That mapping could be stored somewhere in the "ExtraInfo" table.

I hope this makes sense.

Makes sense, this is similar to how scheduling info is stored.
I guess that what we gain from doing this also depends on whether we really do split ExtraInfo to a separate target that only llvm-mca and llvm-exegesis link. If that's the case, then not compressing is not a big deal.

Overall, the patch looks good. I am also okay if you want to commit this patch first, and then improve the design in a follow-up patch.

I'd rather do it in a separate patch to keep matters separate if you don't mind.

andreadb accepted this revision.Apr 9 2018, 6:58 AM

Thanks Andrea,

Nice patch.

Would it be difficult to have a single string table for all the PfmIssueCounters defined by the scheduling models?

The compressed table comes with the downside that it requires an extra mapping between processor resource IDs and indices to the string table. That mapping could be stored somewhere in the "ExtraInfo" table.

I hope this makes sense.

Makes sense, this is similar to how scheduling info is stored.
I guess that what we gain from doing this also depends on whether we really do split ExtraInfo to a separate target that only llvm-mca and llvm-exegesis link. If that's the case, then not compressing is not a big deal.

Overall, the patch looks good. I am also okay if you want to commit this patch first, and then improve the design in a follow-up patch.

I'd rather do it in a separate patch to keep matters separate if you don't mind.

Sounds good to me.

Thanks

This revision is now accepted and ready to land.Apr 9 2018, 6:58 AM

Created PR37068 to track compression.

This revision was automatically updated to reflect the committed changes.