This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Use ProcResGroup information in SchedBoundary
ClosedPublic

Authored by dpenry on Mar 19 2021, 11:46 AM.

Details

Summary

When the ProcResGroup has BufferSize=0,

  1. if there is a subunit in the list of write resources for the scheduling class, do not attempt to schedule the ProcResGroup.
  2. if there is not a subunit in the list of write resources for the scheduling class, choose a subunit to use instead of the ProcResGroup.
  3. having both the ProcResGroup and any of its subunits in the resources implied by a InstRW is not supported.

Used to model parallel uses from a pool of resources.

Diff Detail

Event Timeline

dpenry created this revision.Mar 19 2021, 11:46 AM
dpenry requested review of this revision.Mar 19 2021, 11:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 19 2021, 11:46 AM
dpenry edited the summary of this revision. (Show Details)Mar 19 2021, 11:54 AM
dpenry edited the summary of this revision. (Show Details)
RKSimon added a subscriber: RKSimon.Apr 8 2021, 3:05 AM

I'm not sure this avoids the criticisms from D94604 - we have resource groups and the child resource units, and we can control the width of both of these to model usage each cycle.

But for starters, I'd like to see tests of some kind (llvm-mca timeline or whatever you prefer) that clearly shows the issue you think you're encountering.

I'm not sure this avoids the criticisms from D94604 - we have resource groups and the child resource units, and we can control the width of both of these to model usage each cycle.

But for starters, I'd like to see tests of some kind (llvm-mca timeline or whatever you prefer) that clearly shows the issue you think you're encountering.

The issue comes when using resource groups with BufferSize = 0 -- MachineScheduler just doesn't care about the group indicating a parallel use of resources, though from the discussion in D94604, it seems clear that this is what resource groups are intended to model. It hasn't shown up as an issue before because resource groups with BufferSize = 0 haven't been used before D98977 (which tries to use resource groups in the CortexM7 scheduling model instead of directly using multiple copies of a resource, as code comments indicate that using a resource twice for an in-order model is supposed to mean using it over multiple cycles.)

Some sort of test showing the change in scheduling outcomes is a good idea and I'll come up with something.

I'm not sure this avoids the criticisms from D94604 - we have resource groups and the child resource units, and we can control the width of both of these to model usage each cycle.

But for starters, I'd like to see tests of some kind (llvm-mca timeline or whatever you prefer) that clearly shows the issue you think you're encountering.

The issue comes when using resource groups with BufferSize = 0 -- MachineScheduler just doesn't care about the group indicating a parallel use of resources, though from the discussion in D94604, it seems clear that this is what resource groups are intended to model. It hasn't shown up as an issue before because resource groups with BufferSize = 0 haven't been used before D98977 (which tries to use resource groups in the CortexM7 scheduling model instead of directly using multiple copies of a resource, as code comments indicate that using a resource twice for an in-order model is supposed to mean using it over multiple cycles.)

Some sort of test showing the change in scheduling outcomes is a good idea and I'll come up with something.

This patch doesn't look right to me. I share the same concerns as Simon.

How resource cycles are contributed by individual units of a group has nothing to do with the value of BufferSize.
I don't understand why you think it should make a difference.

A BufferSize of zero simply means that the dispatch event must coincide with the issue event. In the absence of a buffer, opcodes are forced to be sent immediately for execution to the underlying resource units. Basically, there is no delayed execution, and resource consumption starts immediately. It doesn't say anything about how and which units are selected for execution.

I'm not sure this avoids the criticisms from D94604 - we have resource groups and the child resource units, and we can control the width of both of these to model usage each cycle.

But for starters, I'd like to see tests of some kind (llvm-mca timeline or whatever you prefer) that clearly shows the issue you think you're encountering.

The issue comes when using resource groups with BufferSize = 0 -- MachineScheduler just doesn't care about the group indicating a parallel use of resources, though from the discussion in D94604, it seems clear that this is what resource groups are intended to model. It hasn't shown up as an issue before because resource groups with BufferSize = 0 haven't been used before D98977 (which tries to use resource groups in the CortexM7 scheduling model instead of directly using multiple copies of a resource, as code comments indicate that using a resource twice for an in-order model is supposed to mean using it over multiple cycles.)

Some sort of test showing the change in scheduling outcomes is a good idea and I'll come up with something.

This patch doesn't look right to me. I share the same concerns as Simon.

How resource cycles are contributed by individual units of a group has nothing to do with the value of BufferSize.
I don't understand why you think it should make a difference.

A BufferSize of zero simply means that the dispatch event must coincide with the issue event. In the absence of a buffer, opcodes are forced to be sent immediately for execution to the underlying resource units. Basically, there is no delayed execution, and resource consumption starts immediately. It doesn't say anything about how and which units are selected for execution.

In MachineScheduler, BufferSize = 0 means more than just dispatch = issue. I don't know if this is what was intended originally, but it's certainly what's in the code. Only resources with BufferSize = 0 are "reserved" resources. ReservedCycles is only updated for these resources Thus, if BufferSize <> 0, the consumption is *not* tracked on a cycle-by-cycle basis. This is not anything I have changed; if it doesn't make sense, it hasn't made sense for a long time.

But this actually does make sense, I think. When BufferSize = 0, the structural hazard stalls both issue and dispatch; the resource consumption can be assigned to a particular cycle. When it isn't, it's stalling dispatch, but not issue, and the resource consumption can 't really be assigned to a particular cycle. Thus, the current approach of not tracking it exactly but using it to determine overall resource pressure is a valid one when BufferSize <> 0.

TableGen doesn't allow a resource to be repeated in the list of processor resources for an instruction, stating in the comments that it's important to use them serially for in-order processors. (I'm not sure I agree with that, but that's what it says and does.) Thus uses of two sub-resources of a resource group, two uses of the same resource, or uses of two resources with the same parent all end up with some resource being used for 2 cycles instead of two uses of 1 cycle. That's perfectly fine if BufferSize <> 0 for the resource, as the amount of resource pressure is the same for two uses of one cycle and one use of two cycles. But for BufferSize = 0, you get weird scheduling results instead. D94604 attempted to get around this by allowing an explicit statement in the scheduling model of how many parallel uses of a resource were intended and carrying that information in the WriteProcRes table. As that was deemed not necessary because uses of resources within resource groups are supposed to imply parallel uses, this patch looks to see if the unbuffered resource at hand is a resource group and adjusts its cycle use as if the uses are intended to be in parallel.

If this all seems surprising, it is... who would have thought that it was impossible to schedule multiple uses of a resource by an in-order processor correctly? But it turns out that no processor scheduling model had tried to use BufferSize = 0 with a processor resource group before. The CortexM7 model uses BufferSize = 0 and multiple uses of the same resource, which doesn't produce good results; debugging that is what led to these patches. One use of two cycles and two uses of one cycle are not the same thing for an in-order processor.

I will work up a test that shows what is happening currently, and hopefully that will make it more clear.

In MachineScheduler, BufferSize = 0 means more than just dispatch = issue. I don't know if this is what was intended originally, but it's certainly what's in the code. Only resources with BufferSize = 0 are "reserved" resources. ReservedCycles is only updated for these resources Thus, if BufferSize <> 0, the consumption is *not* tracked on a cycle-by-cycle basis. This is not anything I have changed; if it doesn't make sense, it hasn't made sense for a long time.

Thanks for clarifying this particular aspect. I was under the (wrong) impression that resource consumption was always updated regardless of whether BufferSize was 0 or not.
Since that is not the case, then your patch definitely makes more sense. Presumably, the algorithm was just trying to be very conservative, not making any sort of assumptions on the issue cycles in the presence of buffered resources. That indeed limits the ability to track resources; tracking is effectively limited to in-order dispatch/issue resources only.

Based on your last comment, my understanding is that CortexM7 is the only model really affected by this change.
I have only skimmed through the code, but it sounds reasonable overall.
@dmgreen would you be happy with this change?

You folks all know more about scheduling than I do, but if you are in accord that is great. I can certainly verify that this improves the accuracy of the M7 schedule for the samples I've seen.

There are a number of loops where the if and else are very similar now. Is it possible to combine them, possibly with the use of some lambda functions? It might be worth having a getWriteProcRes that returns an iterator_range.

Some other nitpicks inline.

llvm/lib/CodeGen/MachineScheduler.cpp
2058

Don't need the brackets

2059

u -> U, ue -> UE.

2256–2267

ResTmp -> umm, AdjustedResources?

2258

I would add brackets to these, if they have large statements under them.

I had another look at the patch.

The resource cycle computation doesn't look correct. I think that there is a bit of confusion on how resource cycles of a group are allocated to the underlying sub-units. See my comments below.

llvm/lib/CodeGen/MachineScheduler.cpp
2061–2062

This doesn't look correct.

A same processor resource may belong to multiple groups, not just one.

Example:

def X : ProcResource<1>;
def Y : ProcResource<1>;
def Z : ProcResource<1>;

def A : ProcResGroup<[X, Y]>;
def B : ProcResGroup<[X, Z]>;
def C : ProcResGroup<[X, Y, Z]>;

Here X is included in A, B and C.

Your logic is only able to track a single group identifier. This is not enough.
It may be uncommon or non-sensical for your particular model. However the syntax let's you define this particular nesting of groups.

2116–2124

Remove tabs.

2125

unnecessary else after return.

2189

Early exit to reduce indentation if HasUnbufferedResourceGroup is false.

2199

Why do you need to keep track of the number of uses?

The number of users of a group isn't important.
The only thing that matters for the resource latency computation is how many cycles are contributed by the sub-units. See my comment below.

2204

This seems redundant. PR is already cleared on entry to this method.

2212

This doesn't look right.

Two problems:

  1. Different sub-units may be consumed for a different number of resource cycles.
  2. The number of resource cycles consumed by all the sub-units of a group may not match the value of PI->Cycles.

About 1.
Example:

def X : ProcResource<1>;
def Y : ProcResource<1>;
def A : ProcResGroup<[X, Y]>;

def MyWrite : SchedWriteRes<[X,Y]> {
  let Latency = 3;
  let ResourceCycles = [1, 2];
}

MyWrite is a write that consumes X for 1 cycle, and Y for 2 cycles.
Both X and Y are part of group A, so the subtarget emitter will generate the following profile:

X -> 1 resource cycle
Y ->  2  resource cycles
A -> 3  resource cycles

In this case, the resource cycles of group A are equivalent to the sum of the resource cycles of X and the resource cycles of Y.

Group A is not consumed for any extra cycles, so the resource latency of A is MAX(ResourceCycle_of(X), ResourceCycle_of(Y)) = 2cy. X is consumed for 1cy. Y is consumed for 2cy.

As far as I understand, your code assumes that X and Y will be consumed for the same number of cycles. This is not true in general.
Also, the number of group cycles may not match the total cycles contributed by the sub-units (see the following example).

About point 2.

def MyWrite : SchedWriteRes<[X,Y,A]> {
  let Latency = 3;
  let ResourceCycles = [1, 2, 1];
}

This is similar to the previous example. However , A is consumed for an extra cycle.
the profile generated by the subtarget emitter is:

X -> 1 resource cycle
Y ->  2  resource cycles
A -> 4  resource cycles

This situation is more uncommon, but still possible. The model doesn't prevent users from "overbooking" group resources.

If we subtract the resource cycles from X and Y, we find out that A is still consumed for an extra cycle. In this case, conservatively assume that A is consumed for MAX(ResourceCycle_of(X), ResourceCycle_of(Y)) + 1 -> 3cy.
So, 3cy instead of 2cy.

I had another look at the patch.

The resource cycle computation doesn't look correct. I think that there is a bit of confusion on how resource cycles of a group are allocated to the underlying sub-units. See my comments below.

I didn't deal with resources assigned to more than one group as an intentional simplification; I had intended to document that. The code can certainly be made to accommodate that with more complexity.

Again as a simplification there is an assumption of equal cycle usages among members of a group. You could also think of it as finding the average number of cycles the resource group is used. I wouldn't mind make it more exact, but max is not going to give the right answer for true hazarding; it would need to really recreate the original structure of uses (e.g. 1 instance for 2 cycles in parallel with 1 instance for 1 cycle). Not impossible, but more complexity.

However, there must be a tracking of how many uses have been made of a resource group by its members. The whole point of this exercise is to be able to model the simultaneous use of multiple instances of a resource; the resource group represents this resource. If we don't track them in some way, we can't use the right number of them.

llvm/lib/CodeGen/MachineScheduler.cpp
2199

Why do you need to keep track of the number of uses?

The number of users of a group isn't important.
The only thing that matters for the resource latency computation is how many cycles are contributed by the sub-units. See my comment below.

andreadb added a comment.EditedApr 14 2021, 2:58 AM

I had another look at the patch.

The resource cycle computation doesn't look correct. I think that there is a bit of confusion on how resource cycles of a group are allocated to the underlying sub-units. See my comments below.

I didn't deal with resources assigned to more than one group as an intentional simplification; I had intended to document that. The code can certainly be made to accommodate that with more complexity.

Again as a simplification there is an assumption of equal cycle usages among members of a group. You could also think of it as finding the average number of cycles the resource group is used. I wouldn't mind make it more exact, but max is not going to give the right answer for true hazarding; it would need to really recreate the original structure of uses (e.g. 1 instance for 2 cycles in parallel with 1 instance for 1 cycle). Not impossible, but more complexity.

It sounds like you are making a lot of simplifications though.
Those simplifications might make sense for your particular scenario, but they don't make any sense in the general case.

I understand that part of the goal is to keep algorithmic complexity low, but the resource usage is definitely inaccurate for units that are in multiple groups.
I am not sure about how important this may be for future in-order processor definitions. But it is potentially a significant limitation, and I am not the right person to say that this is the right way to go. If people who work more with in-order processors think that this scenario is not so important/very unlikely for in-order processors, then fine.
I guess the next question would be: where do you plan to document this limitation?

About the number of cycles:
We know exactly the number of resource cycles for each resource unit, so we shouldn't be using the average (at least for resource units).

For groups, I understand your concern about using MAX (I actually meant to write MIN to be honest - I wrote that message very late). However I disagree that it is about computing the average. A group is "available" if at least one of its sub-units is available. Conversely, it is busy if _all_ the sub-units are also busy.

However, there must be a tracking of how many uses have been made of a resource group by its members. The whole point of this exercise is to be able to model the simultaneous use of multiple instances of a resource; the resource group represents this resource. If we don't track them in some way, we can't use the right number of them.

Normally you would not need to track uses to solve that issue. It might makes sense for your particular algorithm, but in normal circumstances, the resource consumption is never a function of the users, but only of the resource cycles contributed by the writes.

dpenry updated this revision to Diff 337583.Apr 14 2021, 5:11 PM

Reordered D98976 and D98977 and added a test case
to show improvement in scheduling results

dpenry added a comment.EditedApr 14 2021, 5:26 PM

I have added a test case which might clarify how the scheduling improves with the scheduler changes; the t2ADDrr is able to dual-issue with VADDD, but VLDRS is not.

But more fundamentally, I am not sure that I'm understanding the semantics of ProcResGroup in the same way, so I'll just ask a few questions...

Assume:

let BufferSize = 0 in {
   def X : ProcResource<1>;
   def Y : ProcResource<1>; 
   def A : ProcResGroup<[X, Y]>;
}

def MyWideWrite : SchedWriteRes<[X,Y]> {
  let ResourceCycles = [1, 1];
}

def MyNarrowWrite: SchedWriteRes<[A]> {
  let ResourceCycles = [1];
}
  1. How many instances of A does the scheduling model intend to say that there are: one or two?
  2. How many instances of A is MyWideWrite intended to consume?
  3. How many instances of A is MyNarrowWrite intended to consume?
  4. The goal is to be able to allow two instructions which use MyNarrowWrite to issue together, but to prevent a pair of instructions, one using MyNarrowWrite and the other using MyWideWrite, from dispatching/issuing together. Is this the right way to represent that restriction?

I have added a test case which might clarify how the scheduling improves with the scheduler changes; the t2ADDrr is able to dual-issue with VADDD, but VLDRS is not.

But more fundamentally, I am not sure that I'm understanding the semantics of ProcResGroup in the same way, so I'll just ask a few questions...

Assume:

let BufferSize = 0 in {
   def X : ProcResource<1>;
   def Y : ProcResource<1>; 
   def A : ProcResGroup<[X, Y]>;
}

def MyWideWrite : SchedWriteRes<[X,Y]> {
  let ResourceCycles = [1, 1];
}

def MyNarrowWrite: SchedWriteRes<[A]> {
  let ResourceCycles = [1];
}
  1. How many instances of A does the scheduling model intend to say that there are: one or two?
  2. How many instances of A is MyWideWrite intended to consume?
  3. How many instances of A is MyNarrowWrite intended to consume?
  4. The goal is to be able to allow two instructions which use MyNarrowWrite to issue together, but to prevent a pair of instructions, one using MyNarrowWrite and the other using MyWideWrite, from dispatching/issuing together. Is this the right way to represent that restriction?

This is how I see it:

A group behaves like a hardware scheduler. Consuming a group for 1cy is equivalent to consuming one of its underlying sub-units for 1cy.
That is because a group acts as a proxy/dispatcher of uOPs to its sub-units. When a group is consumed by a write, the consumption is effectively redirected towards one of its sub-units.
uOPs issued to a sub-unit always go through the parent group. Basically, a group is always either implicitly or explicitly used when one of its sub-units is consumed.
A group is available if at least one of its sub-units is also available. Conversely, a group is unavailable if all its sub-units are also unavailable. A group becomes available again when at least one of its sub-units becomes available.
How a "victim" resource sub-unit is selected by a group, is unfortunately unspecified, and therefore it is implementation dependent. llvm-mca allows users to customise the selection logic of individual groups for specific targets. By default, it assumes a (pseudo) LRU (so that, over time, resource consumption through a group is always uniformly spread among all the sub-units). That being said, since this is all unspecified, it is fine for algorithms to just pick the first sub-unit available from the set.

In your example, there is only one instance of group A, and it intercepts all the uses/consumptions of X and Y. So, both MyWideWrite and MyNarrowWrite use A.

Since MyWideWrite consumes both X and Y for 1cy, group A becomes also unavailable for 1cy. MyWideWrite is basically consuming the entire group, effectively stalling the dispatch from A until at least one between X and Y becomes available again.

MyNarrowWrite only consumes one between X and Y for 1cy. That is because consuming A for 1cy is equivalent to consuming one of {X, Y} for 1cy.

Group A is still available after the issue of MyNarrowWrite if not all sub-units are busy. Note that this can only happen when X and Y are both available before the issue of MyNarrowWrite. Group A is fully consumed for at least 1cy if either X or Y was busy before MyNarrowWrite was issued.
For that reason, the sequence MyNarrowWrite, MyWideWrite can never be issued during a same cycle; MyWideWrite requires that both resources are avaible.

Sequence MyNarrowWrite, MyNarrowWrite can be issued during a same cycle. However, this can only happen if both X and Y are available at the beginning of the sequence.
If one between X and Y is unavailable at the beginning of that sequence, then the second MyWideWrite will need to wait for one cycle.

A sequence MyWideWrite, MyNarrowWrite can never be issued during a same cycle, because MyNarrowWrite consumes the entire group of A, this making resource A unavailable for 1cy.

Back to your patch. I am OK with a simplified approach like yours as long as a) it fixes your particular cases, b) your case is likely to be the most common one, and c) we document all the assumptions made by your algorithm somewhere in the MachineScheduler.
I don't want to block this progress because, as Dave pointed out, it does improve the runtime in some cases.

OK, I think we are in agreement about what they're supposed to allow/disallow when ResourceCycles is uniform. I'd like to make a stab at making this work in the general cases of SchedWriteRes you mentioned before as well:

  1. Resource is member of multiple resource groups: there's nothing conceptually too difficult about it; just a matter of coding it up.
  2. Mixed latencies: In the code below, in the cycle after an instruction with UnevenWideWrite issues, should it be possible to issue one or two MyNarrowWrites? I'm inclined to say it should be one.... only X is actually available for use in that cycle.
  3. Leftover group time: If I've understood what you've remarked on before, after issuing OverWideWrite nothing else in the group can issue for this or the following cycle. After issuing OverNarrowWrite, only one NarrowWrite can issue in this cycle or the next cycle.
def X : ProcResource<1>;
def Y : ProcResource<1>;
def A : ProcResGroup<[X, Y]>;

def UnevenWideWrite : SchedWriteRes<[X,Y]> {
  let ResourceCycles = [1,2];
}
def MyNarrowWrite : SchedWriteRes<[A]> {
  let ResourceCycles = [1];
}
def OverWideWrite : SchedWriteRes<[X,Y,A]> {
  let ResourceCycles = [1,1,1];
}
def OverNarrowWrite : SchedWriteRes<[X,A]> {
  let ResourceCycles = [1,1];
}

As an algorithmic detail, though there's logically one instance of a ProcResGroup, the subtarget emitter actually reports it to have as many instances as the total instances of its members. There seem to be a few options for dealing with groups:

  1. Continue to instantiate that many instances in the scheduler and use them to track how many members of the group have been spoken for in each cycle. This will support all of the general cases and carries the most flavor of what I've tried to do before.
  2. Instantiate only one instance of the processor resource group, but don't track how many members have been spoken for (because tracking would be effectively the same thing as the previous option but with more changes to the code).
    1. Only mark it as in use if there is leftover group time. This will not permit dual issue of OverNarrowWrite with anything else from the group, as the group will have been marked used.
    2. OR mark it as in use if there is leftover group time and all members are used. In this case, we lose the extra time in which OverNarrowWrite prevents a dual-issue of NarrowWrite.
  3. Don't instantiate the processor resource group at all and assign the leftover time to some chosen victim.
  4. Don't instantiate the processor resource group at all and declare that leftover time gets ignored during hazarding (but is still part of pressure calculations) for BufferSize = 0.

Use of the second, third, or fourth alternative will require some means of choosing which member is the victim; first-available, LRU, or plugin are all viable options.

At this point, #4 is the way I'm leaning; it's clean and simple -- a resource group is fully used (and thus prevents further issue) when all its members are used. No extra tracking is required. It is also fairly simple to state what model features are not being handled -- write resources which mix the members and the group and for which the group has BufferSize = 0. #2 and #3 just seem like half-solutions that will sometimes work and sometimes not work. #1 will leave code very similar to what you see now (without dividing cycles by units used, but still requiring requests for multiple instances).

Sounds good to me.
It is a reasonable compromise. Plus, it should be good enough for all your important use cases.

dpenry updated this revision to Diff 338259.Apr 16 2021, 4:43 PM
dpenry edited the summary of this revision. (Show Details)

Major rework to address review comments. Much simpler code!

dpenry marked 11 inline comments as done.Apr 16 2021, 4:46 PM

Marked comments as done which are no longer relevant to the code.

dpenry updated this revision to Diff 338260.Apr 16 2021, 4:50 PM

Fixed a comment

dpenry edited the summary of this revision. (Show Details)Apr 16 2021, 5:06 PM
andreadb accepted this revision.Apr 18 2021, 3:51 AM

Patch looks good. I left a couple of comments below.

Thanks!

llvm/include/llvm/CodeGen/MachineScheduler.h
678

uint64_t is fine in general because models tend to declare less than 30 processor resources on average. On x86 I have seen peaks of ~45 resources (iirc). I don't expect future models to end up declaring more than 63 resources (even mca makes that strong assumption).

That being said, if possible (and if it is not problematic for your code) it would be much safer to use APInt, so that we are not limited to 64-bits and we can manipulate arbitrary big bitmasks.

llvm/lib/CodeGen/MachineScheduler.cpp
2056–2057

Could you move this check into a helper method?
Something like IsUnbufferedGroup(). A similar chek is repeated later on in the code; using a helper would make the code slightly more readable.

2115

See my previous comment.

This revision is now accepted and ready to land.Apr 18 2021, 3:51 AM
dpenry updated this revision to Diff 338562.Apr 19 2021, 10:59 AM

Created helper function and used APInt instead of uint64_t.

dpenry marked 3 inline comments as done.Apr 19 2021, 11:02 AM

Now using APInt (thanks for that.... I'd wanted something like that but had forgotten what it was called!) and a helper function.

Thanks for detailed reviews; I think this is much better code than I started with.

andreadb accepted this revision.Apr 19 2021, 11:30 AM

LGTM.

Thank you for bearing with me :)

This revision was landed with ongoing or failed builds.Apr 19 2021, 1:28 PM
This revision was automatically updated to reflect the committed changes.