This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Allow parallel uses of a resource
AbandonedPublic

Authored by dpenry on Jan 13 2021, 7:58 AM.

Details

Summary

At present, if a write resource is listed more than once by a
instruction, the resource is assumed to be used sequentially
across multiple cycles. The new ResourceUses attribute
of ProcWriteResource permits multiple simultaneous uses to be
specified.

Use the ResourceUses information when performing machine scheduling
and other codegen tasks.

A use of this new annotation is in revision: https://reviews.llvm.org/D94605

Diff Detail

Event Timeline

dpenry created this revision.Jan 13 2021, 7:58 AM
dpenry requested review of this revision.Jan 13 2021, 7:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2021, 7:58 AM
dpenry edited the summary of this revision. (Show Details)Jan 13 2021, 8:08 AM

I have only skimmed through this patch once, however I think that you can fix the problem in https://reviews.llvm.org/D94605 without introducing your new field ResourceUses.

The "problematic" resource is M7UnitVPort

def M7UnitVPort  : ProcResource<2> { let BufferSize = 0; }

In your case, you want to allow the consumption of both resource units from a single write.
You can do that if you convert M7UnitVPort into a group (see example below)

def M7UnitVPort0 : ProcResource<1> { let BufferSize = 0; }
def M7UnitVPort1 : ProcResource<1> { let BufferSize = 0; }

def M7UnitVPort : ProcResGroup<[M7UnitVPort0, M7UnitVPort1]>;

At that point, you simply enumerate the resource units in the list of consumed resources. So, something like this:

Example - before:

def : WriteRes<WriteFPMAC64, [M7UnitVFP, M7UnitVPort, M7UnitVPort]>

Example - after:

def : WriteRes<WriteFPMAC64, [M7UnitVFP, M7UnitVPort0, M7UnitVPort1]>

In conclusion, if the goal is to be able to do something like that, then I think the syntax is already expressive enough.
The obvious downside is that currently you need to declare multiple resources to do what you want to do.

I have only skimmed through this patch once, however I think that you can fix the problem in https://reviews.llvm.org/D94605 without introducing your new field ResourceUses.

The "problematic" resource is M7UnitVPort

def M7UnitVPort  : ProcResource<2> { let BufferSize = 0; }

In your case, you want to allow the consumption of both resource units from a single write.
You can do that if you convert M7UnitVPort into a group (see example below)

def M7UnitVPort0 : ProcResource<1> { let BufferSize = 0; }
def M7UnitVPort1 : ProcResource<1> { let BufferSize = 0; }

def M7UnitVPort : ProcResGroup<[M7UnitVPort0, M7UnitVPort1]>;

At that point, you simply enumerate the resource units in the list of consumed resources. So, something like this:

Example - before:

def : WriteRes<WriteFPMAC64, [M7UnitVFP, M7UnitVPort, M7UnitVPort]>

Example - after:

def : WriteRes<WriteFPMAC64, [M7UnitVFP, M7UnitVPort0, M7UnitVPort1]>

In conclusion, if the goal is to be able to do something like that, then I think the syntax is already expressive enough.
The obvious downside is that currently you need to declare multiple resources to do what you want to do.

Unfortunately, I have tried doing this with a resource group with no success. ExpandProcResources ends up marking the resource group as used for multiple cycles:

From CortexM7ModelSchedClasses:

{DBGFIELD("IIC_fpFMAC64_WriteFPMAC64_ReadFPMAC_ReadFPMUL_ReadFPMUL") 1, true, false, 161, 4, 795, 1, 132, 3}, // #136

From ARMWriteProcResTable:

{ 9,  1}, // #161
{10,  2}, // #162
{11,  1}, // #163
{12,  1}, // #164

From CortexM7ModelProcResources:

{"M7UnitVFP",       1, 0, 0, nullptr}, // #9
{"M7UnitVPort",     2, 0, 0, CortexM7ModelProcResourceSubUnits + 1}, // #10
{"M7UnitVPort0",    1, 0, 0, nullptr}, // #11
{"M7UnitVPort1",    1, 0, 0, nullptr}, // #12

In the end, the test in lines 1139-1140 of SubTargetEmitter.cpp forces multiple uses of a resource -- whether they be explicitly stated in an InstRW, implied by using different resources in a resource group, or hierarchically stated as using subunits of the resource -- to take multiple cycles. That test seems so fundamental to the way that current schedule descriptions work that it seemed better to introduce the additional Uses notation than to change it.

I have only skimmed through this patch once, however I think that you can fix the problem in https://reviews.llvm.org/D94605 without introducing your new field ResourceUses.

The "problematic" resource is M7UnitVPort

def M7UnitVPort  : ProcResource<2> { let BufferSize = 0; }

In your case, you want to allow the consumption of both resource units from a single write.
You can do that if you convert M7UnitVPort into a group (see example below)

def M7UnitVPort0 : ProcResource<1> { let BufferSize = 0; }
def M7UnitVPort1 : ProcResource<1> { let BufferSize = 0; }

def M7UnitVPort : ProcResGroup<[M7UnitVPort0, M7UnitVPort1]>;

At that point, you simply enumerate the resource units in the list of consumed resources. So, something like this:

Example - before:

def : WriteRes<WriteFPMAC64, [M7UnitVFP, M7UnitVPort, M7UnitVPort]>

Example - after:

def : WriteRes<WriteFPMAC64, [M7UnitVFP, M7UnitVPort0, M7UnitVPort1]>

In conclusion, if the goal is to be able to do something like that, then I think the syntax is already expressive enough.
The obvious downside is that currently you need to declare multiple resources to do what you want to do.

Unfortunately, I have tried doing this with a resource group with no success. ExpandProcResources ends up marking the resource group as used for multiple cycles:

From CortexM7ModelSchedClasses:

{DBGFIELD("IIC_fpFMAC64_WriteFPMAC64_ReadFPMAC_ReadFPMUL_ReadFPMUL") 1, true, false, 161, 4, 795, 1, 132, 3}, // #136

From ARMWriteProcResTable:

{ 9,  1}, // #161
{10,  2}, // #162
{11,  1}, // #163
{12,  1}, // #164

From CortexM7ModelProcResources:

{"M7UnitVFP",       1, 0, 0, nullptr}, // #9
{"M7UnitVPort",     2, 0, 0, CortexM7ModelProcResourceSubUnits + 1}, // #10
{"M7UnitVPort0",    1, 0, 0, nullptr}, // #11
{"M7UnitVPort1",    1, 0, 0, nullptr}, // #12

In the end, the test in lines 1139-1140 of SubTargetEmitter.cpp forces multiple uses of a resource -- whether they be explicitly stated in an InstRW, implied by using different resources in a resource group, or hierarchically stated as using subunits of the resource -- to take multiple cycles. That test seems so fundamental to the way that current schedule descriptions work that it seemed better to introduce the additional Uses notation than to change it.

Maybe I am missing some context here (apologies in case), but why is that a problem in practice?

This is how I see it:

Resource-cycles are there to limit the resource throughput. The write from your example can only be issued when both ports (M7UnitVPort0 and M7UnitVPort1) are available. If group M7UnitVPort is partially or fully used, then your write needs to be delayed until both ports become available. The model assumes that micro-opcodes are all dispatched at the same cycle. We cannot currently model "delayed consumption of resources", so resource consumption starts immediately at the beginning of the issue cycle.
In practice, what that means is that ports are "consumed" during the entire duration of the issue cycle. The two resource cycles set by ExpandProcResources for group M7UnitVPort are in practice contributed by the underlying units (i.e. 1 cycle of M7UnitVPort0, and 1 cycle by M7UnitVPort1). So the group doesn't need to be consumed for any extra cycles.
That write alone is enough to maximise the throughput of M7UnitVPort; no other write that uses M7UnitVPort0 and/or M7UnitVPort1 can issue during that same cycle.

I have only skimmed through this patch once, however I think that you can fix the problem in https://reviews.llvm.org/D94605 without introducing your new field ResourceUses.

The "problematic" resource is M7UnitVPort

def M7UnitVPort  : ProcResource<2> { let BufferSize = 0; }

In your case, you want to allow the consumption of both resource units from a single write.
You can do that if you convert M7UnitVPort into a group (see example below)

def M7UnitVPort0 : ProcResource<1> { let BufferSize = 0; }
def M7UnitVPort1 : ProcResource<1> { let BufferSize = 0; }

def M7UnitVPort : ProcResGroup<[M7UnitVPort0, M7UnitVPort1]>;

At that point, you simply enumerate the resource units in the list of consumed resources. So, something like this:

Example - before:

def : WriteRes<WriteFPMAC64, [M7UnitVFP, M7UnitVPort, M7UnitVPort]>

Example - after:

def : WriteRes<WriteFPMAC64, [M7UnitVFP, M7UnitVPort0, M7UnitVPort1]>

In conclusion, if the goal is to be able to do something like that, then I think the syntax is already expressive enough.
The obvious downside is that currently you need to declare multiple resources to do what you want to do.

Unfortunately, I have tried doing this with a resource group with no success. ExpandProcResources ends up marking the resource group as used for multiple cycles:

From CortexM7ModelSchedClasses:

{DBGFIELD("IIC_fpFMAC64_WriteFPMAC64_ReadFPMAC_ReadFPMUL_ReadFPMUL") 1, true, false, 161, 4, 795, 1, 132, 3}, // #136

From ARMWriteProcResTable:

{ 9,  1}, // #161
{10,  2}, // #162
{11,  1}, // #163
{12,  1}, // #164

From CortexM7ModelProcResources:

{"M7UnitVFP",       1, 0, 0, nullptr}, // #9
{"M7UnitVPort",     2, 0, 0, CortexM7ModelProcResourceSubUnits + 1}, // #10
{"M7UnitVPort0",    1, 0, 0, nullptr}, // #11
{"M7UnitVPort1",    1, 0, 0, nullptr}, // #12

In the end, the test in lines 1139-1140 of SubTargetEmitter.cpp forces multiple uses of a resource -- whether they be explicitly stated in an InstRW, implied by using different resources in a resource group, or hierarchically stated as using subunits of the resource -- to take multiple cycles. That test seems so fundamental to the way that current schedule descriptions work that it seemed better to introduce the additional Uses notation than to change it.

Maybe I am missing some context here (apologies in case), but why is that a problem in practice?

This is how I see it:

Resource-cycles are there to limit the resource throughput. The write from your example can only be issued when both ports (M7UnitVPort0 and M7UnitVPort1) are available. If group M7UnitVPort is partially or fully used, then your write needs to be delayed until both ports become available. The model assumes that micro-opcodes are all dispatched at the same cycle. We cannot currently model "delayed consumption of resources", so resource consumption starts immediately at the beginning of the issue cycle.
In practice, what that means is that ports are "consumed" during the entire duration of the issue cycle. The two resource cycles set by ExpandProcResources for group M7UnitVPort are in practice contributed by the underlying units (i.e. 1 cycle of M7UnitVPort0, and 1 cycle by M7UnitVPort1). So the group doesn't need to be consumed for any extra cycles.
That write alone is enough to maximise the throughput of M7UnitVPort; no other write that uses M7UnitVPort0 and/or M7UnitVPort1 can issue during that same cycle.

Perhaps I should state explicitly what it is that needs to be modeled for the Cortex-M7 scheduler to make sure we're on the same page:

  1. Some instructions require the entire FP datapath
  2. Other instructions require half of the FP datapath
  3. It is possible to dual-issue two instructions each requiring half of the FP datapath
  4. It is not possible to dual-issue instructions requiring the entire FP datapath with instructions requiring the entire FP datapath or half of the FP datapath.

I would love it if there was a way to just make this work out of the box. However, stating that a resource is used twice (that's what's in the current code) or that there's a resource group with two parts (as suggested) doesn't do the trick. Nor did trying to define VPort0 and VPort1 as sub-units of VPort.

I do get that resource consumption begins immediately, but the scheduling model certainly does allow a resource to be occupied for multiple cycles. And the MachineScheduler doesn't seem to care about how it came to be that way -- whether through resource groups, subclasses, or using the same resource twice in the InstRW. What it cares about is the list of resources and cycles in the WriteProcResTable. So what appears to be happening when the resource group method is used is that MachineScheduler, when scheduling one of these instructions that uses both M7UnitVPort0 and M7UnitVPort1, marks M7UnitVPort0 and M7VPort1 as being occupied the until next cycle and one M7UnitVPort as occupied until two cycles from now. (SchedBoundary::bumpNode, line 2427 for the top-down scheduling side). Similarly, the top-down check for scheduling this instruction looks for the first cycle in which all three of M7UnitVPort0, M7UnitVPort1, and M7UnitVPort are available, and doesn't try to find two of M7UnitVPort (SchedBoundary::getNextResourceCycle). (Note that bottom-up scheduling takes the cycle count into account by adding it to the first available cycle and not by incorporating it into the recorded occupancy, with essentially the same effect.) This does two unwanted things:

  1. It allows another M7UnitVPort user (e.g., a VLDR) to simultaneously issue in this cycle, which is not wanted.
  2. It prevents two M7UnitVPort users (e.g VLDR.F32) from simultaneously issuing in the next cycle, when they should be able to.

It does succeed in preventing two dual-M7UnitVPort users from simultaneously issuing due to the limitations in M7UnitVPort0 and M7UnitVPort1.

yroux added a subscriber: yroux.Mar 9 2021, 1:39 AM

I'm not experienced enough in the machine scheduler to approve it, but the description of the problem and the proposed solution make sense to me.

Is there still some concerns about this patch ?

<snip>

Perhaps I should state explicitly what it is that needs to be modeled for the Cortex-M7 scheduler to make sure we're on the same page:

  1. Some instructions require the entire FP datapath
  2. Other instructions require half of the FP datapath
  3. It is possible to dual-issue two instructions each requiring half of the FP datapath
  4. It is not possible to dual-issue instructions requiring the entire FP datapath with instructions requiring the entire FP datapath or half of the FP datapath.

I would love it if there was a way to just make this work out of the box. However, stating that a resource is used twice (that's what's in the current code) or that there's a resource group with two parts (as suggested) doesn't do the trick. Nor did trying to define VPort0 and VPort1 as sub-units of VPort.

What you have described is a classic scenario for processor resource groups You can have a group for the entire FP datapath, and then model each half of the FP datapath separately with a resource unit. If this doesn't work for you, then it is a bug in the MachineScheduler (at least, the logic that does the bookkeping of resource cycles for groups).

More in general, an algorithm cannot ignore the resource cycle contributions of individual units to a group. Otherwise, group latencies are incorrectly computed.

I do get that resource consumption begins immediately, but the scheduling model certainly does allow a resource to be occupied for multiple cycles. And the MachineScheduler doesn't seem to care about how it came to be that way -- whether through resource groups, subclasses, or using the same resource twice in the InstRW. What it cares about is the list of resources and cycles in the WriteProcResTable.

The number of cycles reported in WriteProcResTable is not a problem here. In fact, it is actually correct and it should be 2.
For groups, the number of resource cycles reported by WriteProcResTable doesn't necessarily translate to actual latency. Some of (if not all) the resource cycles consumed by a group may often map to the same runtime cycle. That is because each individual unit starts consumption at relative cycle #0. So there is clearly an overlap. It implies that resource cycles for groups can be consumed in parallel.

In WriteProcResTable, group resource cycles are computed by summing all the individual contributions from all the resource units (that, plus any extra cycles explicitly declared for the group). That's how you end up with 2cy for M7UnitVPort, and that is correct.

If MachineScheduler believes that those 2 resource cycles translates to a 2cy latency, then that's a bug in MachineScheduler.

In your case, group M7UnitVPort is consumed for 2 "resource cycles". However, it doesn't mean that the group will only be available every other cycle. Those two resource cycles are contributed by M7UnitVPort0 and M7UnitVPort1 (one resource cycle each), and the resource unit consumption always happens at relative cycle #0. In reality, those two resource cycles are effectively the same cycle (i.e. units are consumed in parallel for 1cy).

Again, apologies if I still don't get the full picture. But I strongly believe at this point that there might be a wrong assumption in the MachineScheduler on how resource cycles are set for groups.

<snip>

Perhaps I should state explicitly what it is that needs to be modeled for the Cortex-M7 scheduler to make sure we're on the same page:

  1. Some instructions require the entire FP datapath
  2. Other instructions require half of the FP datapath
  3. It is possible to dual-issue two instructions each requiring half of the FP datapath
  4. It is not possible to dual-issue instructions requiring the entire FP datapath with instructions requiring the entire FP datapath or half of the FP datapath.

I would love it if there was a way to just make this work out of the box. However, stating that a resource is used twice (that's what's in the current code) or that there's a resource group with two parts (as suggested) doesn't do the trick. Nor did trying to define VPort0 and VPort1 as sub-units of VPort.

What you have described is a classic scenario for processor resource groups You can have a group for the entire FP datapath, and then model each half of the FP datapath separately with a resource unit. If this doesn't work for you, then it is a bug in the MachineScheduler (at least, the logic that does the bookkeping of resource cycles for groups).

More in general, an algorithm cannot ignore the resource cycle contributions of individual units to a group. Otherwise, group latencies are incorrectly computed.

That is certainly what I would have expected resource groups to do.

I do get that resource consumption begins immediately, but the scheduling model certainly does allow a resource to be occupied for multiple cycles. And the MachineScheduler doesn't seem to care about how it came to be that way -- whether through resource groups, subclasses, or using the same resource twice in the InstRW. What it cares about is the list of resources and cycles in the WriteProcResTable.

The number of cycles reported in WriteProcResTable is not a problem here. In fact, it is actually correct and it should be 2.
For groups, the number of resource cycles reported by WriteProcResTable doesn't necessarily translate to actual latency. Some of (if not all) the resource cycles consumed by a group may often map to the same runtime cycle. That is because each individual unit starts consumption at relative cycle #0. So there is clearly an overlap. It implies that resource cycles for groups can be consumed in parallel.

In WriteProcResTable, group resource cycles are computed by summing all the individual contributions from all the resource units (that, plus any extra cycles explicitly declared for the group). That's how you end up with 2cy for M7UnitVPort, and that is correct.

That is the computation I'm seeing.

If MachineScheduler believes that those 2 resource cycles translates to a 2cy latency, then that's a bug in MachineScheduler.

In your case, group M7UnitVPort is consumed for 2 "resource cycles". However, it doesn't mean that the group will only be available every other cycle. Those two resource cycles are contributed by M7UnitVPort0 and M7UnitVPort1 (one resource cycle each), and the resource unit consumption always happens at relative cycle #0. In reality, those two resource cycles are effectively the same cycle (i.e. units are consumed in parallel for 1cy).

That's what I don't see happening in MachineScheduler. It sees the group as if it were a separate resource which is consumed for two cycles and doesn't try to find parallel units within the group to provide those two cycles of resource consumption. As far as I can tell, the concept of groups does not exist at all in MachineScheduler.

Again, apologies if I still don't get the full picture. But I strongly believe at this point that there might be a wrong assumption in the MachineScheduler on how resource cycles are set for groups.

I think where we're getting to is that MachineScheduler has never been made to work "as expected" with groups. If groups are the accepted way to specify this sort of resource usage, then making MachineScheduler use group information would seem to be preferable to adding a new annotation. However, I do have one reservation. Groups are used fairly widely at present -- I see them in PowerPC, X86, AArch64, and ARM. I am not at all sanguine about changing something which would have such widespread effects without more people chiming in. Any idea who else should be part of this discussion?

<snip>

Perhaps I should state explicitly what it is that needs to be modeled for the Cortex-M7 scheduler to make sure we're on the same page:

  1. Some instructions require the entire FP datapath
  2. Other instructions require half of the FP datapath
  3. It is possible to dual-issue two instructions each requiring half of the FP datapath
  4. It is not possible to dual-issue instructions requiring the entire FP datapath with instructions requiring the entire FP datapath or half of the FP datapath.

I would love it if there was a way to just make this work out of the box. However, stating that a resource is used twice (that's what's in the current code) or that there's a resource group with two parts (as suggested) doesn't do the trick. Nor did trying to define VPort0 and VPort1 as sub-units of VPort.

What you have described is a classic scenario for processor resource groups You can have a group for the entire FP datapath, and then model each half of the FP datapath separately with a resource unit. If this doesn't work for you, then it is a bug in the MachineScheduler (at least, the logic that does the bookkeping of resource cycles for groups).

More in general, an algorithm cannot ignore the resource cycle contributions of individual units to a group. Otherwise, group latencies are incorrectly computed.

That is certainly what I would have expected resource groups to do.

I do get that resource consumption begins immediately, but the scheduling model certainly does allow a resource to be occupied for multiple cycles. And the MachineScheduler doesn't seem to care about how it came to be that way -- whether through resource groups, subclasses, or using the same resource twice in the InstRW. What it cares about is the list of resources and cycles in the WriteProcResTable.

The number of cycles reported in WriteProcResTable is not a problem here. In fact, it is actually correct and it should be 2.
For groups, the number of resource cycles reported by WriteProcResTable doesn't necessarily translate to actual latency. Some of (if not all) the resource cycles consumed by a group may often map to the same runtime cycle. That is because each individual unit starts consumption at relative cycle #0. So there is clearly an overlap. It implies that resource cycles for groups can be consumed in parallel.

In WriteProcResTable, group resource cycles are computed by summing all the individual contributions from all the resource units (that, plus any extra cycles explicitly declared for the group). That's how you end up with 2cy for M7UnitVPort, and that is correct.

That is the computation I'm seeing.

If MachineScheduler believes that those 2 resource cycles translates to a 2cy latency, then that's a bug in MachineScheduler.

In your case, group M7UnitVPort is consumed for 2 "resource cycles". However, it doesn't mean that the group will only be available every other cycle. Those two resource cycles are contributed by M7UnitVPort0 and M7UnitVPort1 (one resource cycle each), and the resource unit consumption always happens at relative cycle #0. In reality, those two resource cycles are effectively the same cycle (i.e. units are consumed in parallel for 1cy).

That's what I don't see happening in MachineScheduler. It sees the group as if it were a separate resource which is consumed for two cycles and doesn't try to find parallel units within the group to provide those two cycles of resource consumption. As far as I can tell, the concept of groups does not exist at all in MachineScheduler.

Again, apologies if I still don't get the full picture. But I strongly believe at this point that there might be a wrong assumption in the MachineScheduler on how resource cycles are set for groups.

I think where we're getting to is that MachineScheduler has never been made to work "as expected" with groups. If groups are the accepted way to specify this sort of resource usage, then making MachineScheduler use group information would seem to be preferable to adding a new annotation. However, I do have one reservation. Groups are used fairly widely at present -- I see them in PowerPC, X86, AArch64, and ARM. I am not at all sanguine about changing something which would have such widespread effects without more people chiming in. Any idea who else should be part of this discussion?

When it comes to scheduling models, I personally think that @atrick is the most knowledgeable person. Most of the people interested in scheduling models have been already added in CC. So I don't know who else could be added...

On X86, we don't particularly care so much about how resource bookkeeping is done because we still use the old post-ra-scheduler.
The analysis conducted by that algorithm is purely "latency based"; it only tracks the completion of writes, so the analysis is purely data driven. I may be wrong, but the last time I looked at that algorithm, no particular checks were performed on processor resources. So, I don't think that this issue affects x86 at least.

Speaking about X86 scheduling models, groups are mainly used in two cases:

  1. to model hardware schedulers (each one with its own buffer).
  2. to restrict the set of ports/pipes that can be consumed by instructions.

About 1. Intel (out-of-order) processors (at least those for which there is a scheduling model upstream), tend to implement a single unified scheduler in hardware.
AMD processors on the other hand, tend to use a mixed approach, with multiple schedulers in the Integer unit (typically one scheduler per ALU pipe), and a unified scheduler for the FPU. The FPU is often implemented using a coprocessor model; it is disjoint from the Integer cluster, and it often uses a unified scheduler to serve the underlying pipes.

This is taken from the Haswell model:

// 60 Entry Unified Scheduler
def HWPortAny : ProcResGroup<[HWPort0, HWPort1, HWPort2, HWPort3, HWPort4, HWPort5, HWPort6, HWPort7]> {
  let BufferSize=60;
}

The unified scheduler is basically our HWPortAny, which literally sees ALL the hardware ports. So, any instruction will always consume HWPortAny resource cycles.
There are also other groups, but those are fully contained in HWPortAny, and are only used to restrict which ports are usable for specific instructions (based on their scheduling class).

Since HWPortAny contains all the other groups and all the existing port units, the expectation is that each write will consume HWPortAny resource cycles too.
Actually, the more ports are consumed during a cycle, the higher is the number of HWPortAny resource cycle consumption. To put it in another way, the higher the issue throughput, the higher the number of resource cycles consumed for HWPortAny. It is essentially a sort of measure of throughput. I definitely don't see it as a measure of latency.

You could easily imagine what problems it would cause for X86 models if we started using a scheduling algorithm which doesn't make a distinction between groups and normal units. HWPortAny is just an example of group which would be negatively impacted by that heuristic. HWPortAny is likely to always stall for multiple cycles if issue throughput is more than 1.

In reality, group availability should only depend on whether the contained resource units are available or not.
This is how how I see it. This is also how I have implemented the resource usage in llvm-mca. The ResourceManager component in llvm-mca is responsible for doing the bookkeeping of processor resources; it internally tracks resource consuption. If all the units of a group are consumed during a cycle, then (during that same cycle) the group is unavailable for other instructions.

OK. So X86 isn't a concern because it isn't using MachineScheduler (at least not post-RA).

Indeed, after looking a bit to brainstorm how MachineScheduler would change, I note that only when BufferSize of a resource is 0 does MachineScheduler update any of the resource usage (ReservedCycles). And with a bit more digging, there is currently only one upstream scheduling model which uses both BufferSize = 0 and resource groups -- and the resource group itself doesn't have BufferSize=0. So, a change to MachineScheduler to respect groups when BufferSize = 0 doesn't look like it's likely to cause much disruption.

I'll go try that out and see what happens.

OK. So X86 isn't a concern because it isn't using MachineScheduler (at least not post-RA).

Indeed, after looking a bit to brainstorm how MachineScheduler would change, I note that only when BufferSize of a resource is 0 does MachineScheduler update any of the resource usage (ReservedCycles). And with a bit more digging, there is currently only one upstream scheduling model which uses both BufferSize = 0 and resource groups -- and the resource group itself doesn't have BufferSize=0. So, a change to MachineScheduler to respect groups when BufferSize = 0 doesn't look like it's likely to cause much disruption.

I'll go try that out and see what happens.

Yeah. For what I remember, BufferSize=0 is kind of special because it is used to simulate in-order units in a otherwise out-of-order backend.
Which - if you think of groups like schedulers - kinda makes sense. A scheduler with no buffer is forced to immediately issue instructions at dispatch time. So, dispatching is effectively equivalent to issuing instructions. I had to specially support that in llvm-mca (it gave me some headhaches).

There should be a nice comment about it in TargetSchedule.td

dpenry abandoned this revision.Apr 19 2021, 1:31 PM

Abandoned because was replaced by D98976