This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
1/1
MachineScheduler.h
-
lib/CodeGen/
-
CodeGen/
14/14
MachineScheduler.cpp

Differential D98976

[CodeGen] Use ProcResGroup information in SchedBoundary
ClosedPublic

Authored by dpenry on Mar 19 2021, 11:46 AM.

Download Raw Diff

Details

Reviewers

MatzeB
jroelofs
Paul-C-Anagnostopoulos
dmgreen
steven.zhang
atrick
evgeny777
andreadb
RKSimon

Commits

rGca8eef7e3da8: [CodeGen] Use ProcResGroup information in SchedBoundary

Summary

When the ProcResGroup has BufferSize=0,

if there is a subunit in the list of write resources for the scheduling class, do not attempt to schedule the ProcResGroup.
if there is not a subunit in the list of write resources for the scheduling class, choose a subunit to use instead of the ProcResGroup.
having both the ProcResGroup and any of its subunits in the resources implied by a InstRW is not supported.

Used to model parallel uses from a pool of resources.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	90 ms	x64 windows > MLIR.Conversion/PDLToPDLInterp::pdl-to-pdl-interp-matcher.mlir

Event Timeline

dpenry created this revision.Mar 19 2021, 11:46 AM

Herald added subscribers: ecnelises, javed.absar, hiraditya, MatzeB. · View Herald TranscriptMar 19 2021, 11:46 AM

dpenry requested review of this revision.Mar 19 2021, 11:46 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 19 2021, 11:46 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

dpenry added a child revision: D98977: [ARM] Use ProcResGroup in Cortex-M7 scheduling model.Mar 19 2021, 11:52 AM

dpenry edited the summary of this revision. (Show Details)Mar 19 2021, 11:54 AM

dpenry added reviewers: MatzeB, jroelofs, Paul-C-Anagnostopoulos, dmgreen, steven.zhang, atrick, evgeny777, andreadb.

dpenry edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B94762: Diff 331962.Mar 19 2021, 1:47 PM

<ping>

RKSimon added a subscriber: RKSimon.Apr 8 2021, 3:05 AM

I'm not sure this avoids the criticisms from D94604 - we have resource groups and the child resource units, and we can control the width of both of these to model usage each cycle.

But for starters, I'd like to see tests of some kind (llvm-mca timeline or whatever you prefer) that clearly shows the issue you think you're encountering.

In D98976#2682694, @RKSimon wrote:

I'm not sure this avoids the criticisms from D94604 - we have resource groups and the child resource units, and we can control the width of both of these to model usage each cycle.

But for starters, I'd like to see tests of some kind (llvm-mca timeline or whatever you prefer) that clearly shows the issue you think you're encountering.

The issue comes when using resource groups with BufferSize = 0 -- MachineScheduler just doesn't care about the group indicating a parallel use of resources, though from the discussion in D94604, it seems clear that this is what resource groups are intended to model. It hasn't shown up as an issue before because resource groups with BufferSize = 0 haven't been used before D98977 (which tries to use resource groups in the CortexM7 scheduling model instead of directly using multiple copies of a resource, as code comments indicate that using a resource twice for an in-order model is supposed to mean using it over multiple cycles.)

Some sort of test showing the change in scheduling outcomes is a good idea and I'll come up with something.

In D98976#2684092, @dpenry wrote:

In D98976#2682694, @RKSimon wrote:

I'm not sure this avoids the criticisms from D94604 - we have resource groups and the child resource units, and we can control the width of both of these to model usage each cycle.

But for starters, I'd like to see tests of some kind (llvm-mca timeline or whatever you prefer) that clearly shows the issue you think you're encountering.

The issue comes when using resource groups with BufferSize = 0 -- MachineScheduler just doesn't care about the group indicating a parallel use of resources, though from the discussion in D94604, it seems clear that this is what resource groups are intended to model. It hasn't shown up as an issue before because resource groups with BufferSize = 0 haven't been used before D98977 (which tries to use resource groups in the CortexM7 scheduling model instead of directly using multiple copies of a resource, as code comments indicate that using a resource twice for an in-order model is supposed to mean using it over multiple cycles.)

Some sort of test showing the change in scheduling outcomes is a good idea and I'll come up with something.

This patch doesn't look right to me. I share the same concerns as Simon.

How resource cycles are contributed by individual units of a group has nothing to do with the value of BufferSize.
I don't understand why you think it should make a difference.

A BufferSize of zero simply means that the dispatch event must coincide with the issue event. In the absence of a buffer, opcodes are forced to be sent immediately for execution to the underlying resource units. Basically, there is no delayed execution, and resource consumption starts immediately. It doesn't say anything about how and which units are selected for execution.

In D98976#2685193, @andreadb wrote:

In D98976#2684092, @dpenry wrote:

In D98976#2682694, @RKSimon wrote:

I'm not sure this avoids the criticisms from D94604 - we have resource groups and the child resource units, and we can control the width of both of these to model usage each cycle.

But for starters, I'd like to see tests of some kind (llvm-mca timeline or whatever you prefer) that clearly shows the issue you think you're encountering.

The issue comes when using resource groups with BufferSize = 0 -- MachineScheduler just doesn't care about the group indicating a parallel use of resources, though from the discussion in D94604, it seems clear that this is what resource groups are intended to model. It hasn't shown up as an issue before because resource groups with BufferSize = 0 haven't been used before D98977 (which tries to use resource groups in the CortexM7 scheduling model instead of directly using multiple copies of a resource, as code comments indicate that using a resource twice for an in-order model is supposed to mean using it over multiple cycles.)

Some sort of test showing the change in scheduling outcomes is a good idea and I'll come up with something.

This patch doesn't look right to me. I share the same concerns as Simon.

How resource cycles are contributed by individual units of a group has nothing to do with the value of BufferSize.
I don't understand why you think it should make a difference.

A BufferSize of zero simply means that the dispatch event must coincide with the issue event. In the absence of a buffer, opcodes are forced to be sent immediately for execution to the underlying resource units. Basically, there is no delayed execution, and resource consumption starts immediately. It doesn't say anything about how and which units are selected for execution.

In MachineScheduler, BufferSize = 0 means more than just dispatch = issue. I don't know if this is what was intended originally, but it's certainly what's in the code. Only resources with BufferSize = 0 are "reserved" resources. ReservedCycles is only updated for these resources Thus, if BufferSize <> 0, the consumption is *not* tracked on a cycle-by-cycle basis. This is not anything I have changed; if it doesn't make sense, it hasn't made sense for a long time.

But this actually does make sense, I think. When BufferSize = 0, the structural hazard stalls both issue and dispatch; the resource consumption can be assigned to a particular cycle. When it isn't, it's stalling dispatch, but not issue, and the resource consumption can 't really be assigned to a particular cycle. Thus, the current approach of not tracking it exactly but using it to determine overall resource pressure is a valid one when BufferSize <> 0.

TableGen doesn't allow a resource to be repeated in the list of processor resources for an instruction, stating in the comments that it's important to use them serially for in-order processors. (I'm not sure I agree with that, but that's what it says and does.) Thus uses of two sub-resources of a resource group, two uses of the same resource, or uses of two resources with the same parent all end up with some resource being used for 2 cycles instead of two uses of 1 cycle. That's perfectly fine if BufferSize <> 0 for the resource, as the amount of resource pressure is the same for two uses of one cycle and one use of two cycles. But for BufferSize = 0, you get weird scheduling results instead. D94604 attempted to get around this by allowing an explicit statement in the scheduling model of how many parallel uses of a resource were intended and carrying that information in the WriteProcRes table. As that was deemed not necessary because uses of resources within resource groups are supposed to imply parallel uses, this patch looks to see if the unbuffered resource at hand is a resource group and adjusts its cycle use as if the uses are intended to be in parallel.

If this all seems surprising, it is... who would have thought that it was impossible to schedule multiple uses of a resource by an in-order processor correctly? But it turns out that no processor scheduling model had tried to use BufferSize = 0 with a processor resource group before. The CortexM7 model uses BufferSize = 0 and multiple uses of the same resource, which doesn't produce good results; debugging that is what led to these patches. One use of two cycles and two uses of one cycle are not the same thing for an in-order processor.

I will work up a test that shows what is happening currently, and hopefully that will make it more clear.

In MachineScheduler, BufferSize = 0 means more than just dispatch = issue. I don't know if this is what was intended originally, but it's certainly what's in the code. Only resources with BufferSize = 0 are "reserved" resources. ReservedCycles is only updated for these resources Thus, if BufferSize <> 0, the consumption is *not* tracked on a cycle-by-cycle basis. This is not anything I have changed; if it doesn't make sense, it hasn't made sense for a long time.

Thanks for clarifying this particular aspect. I was under the (wrong) impression that resource consumption was always updated regardless of whether BufferSize was 0 or not.
Since that is not the case, then your patch definitely makes more sense. Presumably, the algorithm was just trying to be very conservative, not making any sort of assumptions on the issue cycles in the presence of buffered resources. That indeed limits the ability to track resources; tracking is effectively limited to in-order dispatch/issue resources only.

Based on your last comment, my understanding is that CortexM7 is the only model really affected by this change.
I have only skimmed through the code, but it sounds reasonable overall.
@dmgreen would you be happy with this change?

You folks all know more about scheduling than I do, but if you are in accord that is great. I can certainly verify that this improves the accuracy of the M7 schedule for the samples I've seen.

There are a number of loops where the if and else are very similar now. Is it possible to combine them, possibly with the use of some lambda functions? It might be worth having a getWriteProcRes that returns an iterator_range.

Some other nitpicks inline.

llvm/lib/CodeGen/MachineScheduler.cpp
2058	Don't need the brackets
2059	u -> U, ue -> UE.
2256	ResTmp -> umm, AdjustedResources?
2258	I would add brackets to these, if they have large statements under them.

I had another look at the patch.

The resource cycle computation doesn't look correct. I think that there is a bit of confusion on how resource cycles of a group are allocated to the underlying sub-units. See my comments below.

llvm/lib/CodeGen/MachineScheduler.cpp
2061–2062	This doesn't look correct. A same processor resource may belong to multiple groups, not just one. Example: def X : ProcResource<1>; def Y : ProcResource<1>; def Z : ProcResource<1>; def A : ProcResGroup<[X, Y]>; def B : ProcResGroup<[X, Z]>; def C : ProcResGroup<[X, Y, Z]>; Here X is included in A, B and C. Your logic is only able to track a single group identifier. This is not enough. It may be uncommon or non-sensical for your particular model. However the syntax let's you define this particular nesting of groups.
2116–2124	Remove tabs.
2125	unnecessary else after return.
2189	Early exit to reduce indentation if `HasUnbufferedResourceGroup` is false.
2199	Why do you need to keep track of the number of uses? The number of users of a group isn't important. The only thing that matters for the resource latency computation is how many cycles are contributed by the sub-units. See my comment below.
2204	This seems redundant. PR is already cleared on entry to this method.
2212	This doesn't look right. Two problems: Different sub-units may be consumed for a different number of resource cycles. The number of resource cycles consumed by all the sub-units of a group may not match the value of `PI->Cycles`. About 1. Example: def X : ProcResource<1>; def Y : ProcResource<1>; def A : ProcResGroup<[X, Y]>; def MyWrite : SchedWriteRes<[X,Y]> { let Latency = 3; let ResourceCycles = [1, 2]; } MyWrite is a write that consumes X for 1 cycle, and Y for 2 cycles. Both X and Y are part of group A, so the subtarget emitter will generate the following profile: X -> 1 resource cycle Y -> 2 resource cycles A -> 3 resource cycles In this case, the resource cycles of group A are equivalent to the sum of the resource cycles of X and the resource cycles of Y. Group A is not consumed for any extra cycles, so the resource latency of A is `MAX(ResourceCycle_of(X), ResourceCycle_of(Y)) = 2cy`. X is consumed for 1cy. Y is consumed for 2cy. As far as I understand, your code assumes that X and Y will be consumed for the same number of cycles. This is not true in general. Also, the number of group cycles may not match the total cycles contributed by the sub-units (see the following example). About point 2. def MyWrite : SchedWriteRes<[X,Y,A]> { let Latency = 3; let ResourceCycles = [1, 2, 1]; } This is similar to the previous example. However , A is consumed for an extra cycle. the profile generated by the subtarget emitter is: X -> 1 resource cycle Y -> 2 resource cycles A -> 4 resource cycles This situation is more uncommon, but still possible. The model doesn't prevent users from "overbooking" group resources. If we subtract the resource cycles from X and Y, we find out that A is still consumed for an extra cycle. In this case, conservatively assume that A is consumed for `MAX(ResourceCycle_of(X), ResourceCycle_of(Y)) + 1 -> 3cy`. So, 3cy instead of 2cy.

In D98976#2686840, @andreadb wrote:

I had another look at the patch.

The resource cycle computation doesn't look correct. I think that there is a bit of confusion on how resource cycles of a group are allocated to the underlying sub-units. See my comments below.

I didn't deal with resources assigned to more than one group as an intentional simplification; I had intended to document that. The code can certainly be made to accommodate that with more complexity.

Again as a simplification there is an assumption of equal cycle usages among members of a group. You could also think of it as finding the average number of cycles the resource group is used. I wouldn't mind make it more exact, but max is not going to give the right answer for true hazarding; it would need to really recreate the original structure of uses (e.g. 1 instance for 2 cycles in parallel with 1 instance for 1 cycle). Not impossible, but more complexity.

However, there must be a tracking of how many uses have been made of a resource group by its members. The whole point of this exercise is to be able to model the simultaneous use of multiple instances of a resource; the resource group represents this resource. If we don't track them in some way, we can't use the right number of them.

llvm/lib/CodeGen/MachineScheduler.cpp
2199	Why do you need to keep track of the number of uses? The number of users of a group isn't important. The only thing that matters for the resource latency computation is how many cycles are contributed by the sub-units. See my comment below.

In D98976#2687108, @dpenry wrote:

In D98976#2686840, @andreadb wrote:

I had another look at the patch.

The resource cycle computation doesn't look correct. I think that there is a bit of confusion on how resource cycles of a group are allocated to the underlying sub-units. See my comments below.

I didn't deal with resources assigned to more than one group as an intentional simplification; I had intended to document that. The code can certainly be made to accommodate that with more complexity.

Again as a simplification there is an assumption of equal cycle usages among members of a group. You could also think of it as finding the average number of cycles the resource group is used. I wouldn't mind make it more exact, but max is not going to give the right answer for true hazarding; it would need to really recreate the original structure of uses (e.g. 1 instance for 2 cycles in parallel with 1 instance for 1 cycle). Not impossible, but more complexity.

It sounds like you are making a lot of simplifications though.
Those simplifications might make sense for your particular scenario, but they don't make any sense in the general case.

I understand that part of the goal is to keep algorithmic complexity low, but the resource usage is definitely inaccurate for units that are in multiple groups.
I am not sure about how important this may be for future in-order processor definitions. But it is potentially a significant limitation, and I am not the right person to say that this is the right way to go. If people who work more with in-order processors think that this scenario is not so important/very unlikely for in-order processors, then fine.
I guess the next question would be: where do you plan to document this limitation?

About the number of cycles:
We know exactly the number of resource cycles for each resource unit, so we shouldn't be using the average (at least for resource units).

For groups, I understand your concern about using MAX (I actually meant to write MIN to be honest - I wrote that message very late). However I disagree that it is about computing the average. A group is "available" if at least one of its sub-units is available. Conversely, it is busy if _all_ the sub-units are also busy.

However, there must be a tracking of how many uses have been made of a resource group by its members. The whole point of this exercise is to be able to model the simultaneous use of multiple instances of a resource; the resource group represents this resource. If we don't track them in some way, we can't use the right number of them.

Normally you would not need to track uses to solve that issue. It might makes sense for your particular algorithm, but in normal circumstances, the resource consumption is never a function of the users, but only of the resource cycles contributed by the writes.

Reordered D98976 and D98977 and added a test case
to show improvement in scheduling results

Harbormaster completed remote builds in B98783: Diff 337583.Apr 14 2021, 5:11 PM

dpenry removed a child revision: D98977: [ARM] Use ProcResGroup in Cortex-M7 scheduling model.Apr 14 2021, 5:12 PM

dpenry added a parent revision: D98977: [ARM] Use ProcResGroup in Cortex-M7 scheduling model.

I have added a test case which might clarify how the scheduling improves with the scheduler changes; the t2ADDrr is able to dual-issue with VADDD, but VLDRS is not.

But more fundamentally, I am not sure that I'm understanding the semantics of ProcResGroup in the same way, so I'll just ask a few questions...

Assume:

let BufferSize = 0 in {
   def X : ProcResource<1>;
   def Y : ProcResource<1>; 
   def A : ProcResGroup<[X, Y]>;
}

def MyWideWrite : SchedWriteRes<[X,Y]> {
  let ResourceCycles = [1, 1];
}

def MyNarrowWrite: SchedWriteRes<[A]> {
  let ResourceCycles = [1];
}

How many instances of A does the scheduling model intend to say that there are: one or two?
How many instances of A is MyWideWrite intended to consume?
How many instances of A is MyNarrowWrite intended to consume?
The goal is to be able to allow two instructions which use MyNarrowWrite to issue together, but to prevent a pair of instructions, one using MyNarrowWrite and the other using MyWideWrite, from dispatching/issuing together. Is this the right way to represent that restriction?

In D98976#2690205, @dpenry wrote:
I have added a test case which might clarify how the scheduling improves with the scheduler changes; the t2ADDrr is able to dual-issue with VADDD, but VLDRS is not.

But more fundamentally, I am not sure that I'm understanding the semantics of ProcResGroup in the same way, so I'll just ask a few questions...

Assume:
let BufferSize = 0 in {
   def X : ProcResource<1>;
   def Y : ProcResource<1>; 
   def A : ProcResGroup<[X, Y]>;
}

def MyWideWrite : SchedWriteRes<[X,Y]> {
  let ResourceCycles = [1, 1];
}

def MyNarrowWrite: SchedWriteRes<[A]> {
  let ResourceCycles = [1];
}
How many instances of A does the scheduling model intend to say that there are: one or two?

How many instances of A is MyWideWrite intended to consume?

How many instances of A is MyNarrowWrite intended to consume?

The goal is to be able to allow two instructions which use MyNarrowWrite to issue together, but to prevent a pair of instructions, one using MyNarrowWrite and the other using MyWideWrite, from dispatching/issuing together. Is this the right way to represent that restriction?

This is how I see it:

A group behaves like a hardware scheduler. Consuming a group for 1cy is equivalent to consuming one of its underlying sub-units for 1cy.
That is because a group acts as a proxy/dispatcher of uOPs to its sub-units. When a group is consumed by a write, the consumption is effectively redirected towards one of its sub-units.
uOPs issued to a sub-unit always go through the parent group. Basically, a group is always either implicitly or explicitly used when one of its sub-units is consumed.
A group is available if at least one of its sub-units is also available. Conversely, a group is unavailable if all its sub-units are also unavailable. A group becomes available again when at least one of its sub-units becomes available.
How a "victim" resource sub-unit is selected by a group, is unfortunately unspecified, and therefore it is implementation dependent. llvm-mca allows users to customise the selection logic of individual groups for specific targets. By default, it assumes a (pseudo) LRU (so that, over time, resource consumption through a group is always uniformly spread among all the sub-units). That being said, since this is all unspecified, it is fine for algorithms to just pick the first sub-unit available from the set.

In your example, there is only one instance of group A, and it intercepts all the uses/consumptions of X and Y. So, both MyWideWrite and MyNarrowWrite use A.

Since MyWideWrite consumes both X and Y for 1cy, group A becomes also unavailable for 1cy. MyWideWrite is basically consuming the entire group, effectively stalling the dispatch from A until at least one between X and Y becomes available again.

MyNarrowWrite only consumes one between X and Y for 1cy. That is because consuming A for 1cy is equivalent to consuming one of {X, Y} for 1cy.

Group A is still available after the issue of MyNarrowWrite if not all sub-units are busy. Note that this can only happen when X and Y are both available before the issue of MyNarrowWrite. Group A is fully consumed for at least 1cy if either X or Y was busy before MyNarrowWrite was issued.
For that reason, the sequence MyNarrowWrite, MyWideWrite can never be issued during a same cycle; MyWideWrite requires that both resources are avaible.

Sequence MyNarrowWrite, MyNarrowWrite can be issued during a same cycle. However, this can only happen if both X and Y are available at the beginning of the sequence.
If one between X and Y is unavailable at the beginning of that sequence, then the second MyWideWrite will need to wait for one cycle.

A sequence MyWideWrite, MyNarrowWrite can never be issued during a same cycle, because MyNarrowWrite consumes the entire group of A, this making resource A unavailable for 1cy.

Back to your patch. I am OK with a simplified approach like yours as long as a) it fixes your particular cases, b) your case is likely to be the most common one, and c) we document all the assumptions made by your algorithm somewhere in the MachineScheduler.
I don't want to block this progress because, as Dave pointed out, it does improve the runtime in some cases.

OK, I think we are in agreement about what they're supposed to allow/disallow when ResourceCycles is uniform. I'd like to make a stab at making this work in the general cases of SchedWriteRes you mentioned before as well:

Resource is member of multiple resource groups: there's nothing conceptually too difficult about it; just a matter of coding it up.
Mixed latencies: In the code below, in the cycle after an instruction with UnevenWideWrite issues, should it be possible to issue one or two MyNarrowWrites? I'm inclined to say it should be one.... only X is actually available for use in that cycle.
Leftover group time: If I've understood what you've remarked on before, after issuing OverWideWrite nothing else in the group can issue for this or the following cycle. After issuing OverNarrowWrite, only one NarrowWrite can issue in this cycle or the next cycle.

def X : ProcResource<1>;
def Y : ProcResource<1>;
def A : ProcResGroup<[X, Y]>;

def UnevenWideWrite : SchedWriteRes<[X,Y]> {
  let ResourceCycles = [1,2];
}
def MyNarrowWrite : SchedWriteRes<[A]> {
  let ResourceCycles = [1];
}
def OverWideWrite : SchedWriteRes<[X,Y,A]> {
  let ResourceCycles = [1,1,1];
}
def OverNarrowWrite : SchedWriteRes<[X,A]> {
  let ResourceCycles = [1,1];
}

As an algorithmic detail, though there's logically one instance of a ProcResGroup, the subtarget emitter actually reports it to have as many instances as the total instances of its members. There seem to be a few options for dealing with groups:

Continue to instantiate that many instances in the scheduler and use them to track how many members of the group have been spoken for in each cycle. This will support all of the general cases and carries the most flavor of what I've tried to do before.
Instantiate only one instance of the processor resource group, but don't track how many members have been spoken for (because tracking would be effectively the same thing as the previous option but with more changes to the code).
1. Only mark it as in use if there is leftover group time. This will not permit dual issue of OverNarrowWrite with anything else from the group, as the group will have been marked used.
2. OR mark it as in use if there is leftover group time and all members are used. In this case, we lose the extra time in which OverNarrowWrite prevents a dual-issue of NarrowWrite.
Don't instantiate the processor resource group at all and assign the leftover time to some chosen victim.
Don't instantiate the processor resource group at all and declare that leftover time gets ignored during hazarding (but is still part of pressure calculations) for BufferSize = 0.

Use of the second, third, or fourth alternative will require some means of choosing which member is the victim; first-available, LRU, or plugin are all viable options.

At this point, #4 is the way I'm leaning; it's clean and simple -- a resource group is fully used (and thus prevents further issue) when all its members are used. No extra tracking is required. It is also fairly simple to state what model features are not being handled -- write resources which mix the members and the group and for which the group has BufferSize = 0. #2 and #3 just seem like half-solutions that will sometimes work and sometimes not work. #1 will leave code very similar to what you see now (without dividing cycles by units used, but still requiring requests for multiple instances).

Sounds good to me.
It is a reasonable compromise. Plus, it should be good enough for all your important use cases.

Major rework to address review comments. Much simpler code!

Marked comments as done which are no longer relevant to the code.

Fixed a comment

dpenry edited the summary of this revision. (Show Details)Apr 16 2021, 5:06 PM

Harbormaster completed remote builds in B99284: Diff 338260.Apr 16 2021, 6:47 PM

Harbormaster completed remote builds in B99283: Diff 338259.Apr 16 2021, 7:14 PM

Patch looks good. I left a couple of comments below.

Thanks!

llvm/include/llvm/CodeGen/MachineScheduler.h
678	uint64_t is fine in general because models tend to declare less than 30 processor resources on average. On x86 I have seen peaks of ~45 resources (iirc). I don't expect future models to end up declaring more than 63 resources (even mca makes that strong assumption). That being said, if possible (and if it is not problematic for your code) it would be much safer to use APInt, so that we are not limited to 64-bits and we can manipulate arbitrary big bitmasks.
llvm/lib/CodeGen/MachineScheduler.cpp
2056–2057	Could you move this check into a helper method? Something like `IsUnbufferedGroup()`. A similar chek is repeated later on in the code; using a helper would make the code slightly more readable.
2115	See my previous comment.

This revision is now accepted and ready to land.Apr 18 2021, 3:51 AM

Created helper function and used APInt instead of uint64_t.

Now using APInt (thanks for that.... I'd wanted something like that but had forgotten what it was called!) and a helper function.

Thanks for detailed reviews; I think this is much better code than I started with.

LGTM.

Thank you for bearing with me :)

Harbormaster completed remote builds in B99511: Diff 338562.Apr 19 2021, 11:46 AM

This revision was landed with ongoing or failed builds.Apr 19 2021, 1:28 PM

Closed by commit rGca8eef7e3da8: [CodeGen] Use ProcResGroup information in SchedBoundary (authored by dpenry, committed by dmgreen). · Explain Why

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rGca8eef7e3da8: [CodeGen] Use ProcResGroup information in SchedBoundary.

dpenry mentioned this in D94604: [CodeGen] Allow parallel uses of a resource.Apr 19 2021, 1:31 PM

dpenry mentioned this in D133572: [MachinePipeliner] Fix the interpretation of the scheduling model.Sep 13 2022, 8:20 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

MachineScheduler.h

23 lines

lib/

CodeGen/

MachineScheduler.cpp

255 lines

Diff 331962

llvm/include/llvm/CodeGen/MachineScheduler.h

Show First 20 Lines • Show All 668 Lines • ▼ Show 20 Lines	private:
// Record the highest cycle at which each resource has been reserved by a		// Record the highest cycle at which each resource has been reserved by a
// scheduled instruction.		// scheduled instruction.
SmallVector<unsigned, 16> ReservedCycles;		SmallVector<unsigned, 16> ReservedCycles;

// For each PIdx, stores first index into ReservedCycles that corresponds to		// For each PIdx, stores first index into ReservedCycles that corresponds to
// it.		// it.
SmallVector<unsigned, 16> ReservedCyclesIndex;		SmallVector<unsigned, 16> ReservedCyclesIndex;

		// For each PIdx, stores the resource group ID of which it is a member
		SmallVector<uint16_t, 16> ResourceGroupIDs;
		andreadbUnsubmitted Done Reply Inline Actions uint64_t is fine in general because models tend to declare less than 30 processor resources on average. On x86 I have seen peaks of ~45 resources (iirc). I don't expect future models to end up declaring more than 63 resources (even mca makes that strong assumption). That being said, if possible (and if it is not problematic for your code) it would be much safer to use APInt, so that we are not limited to 64-bits and we can manipulate arbitrary big bitmasks. andreadb: uint64_t is fine in general because models tend to declare less than 30 processor resources on…
		bool HasUnbufferedResourceGroup;

#ifndef NDEBUG		#ifndef NDEBUG
// Remember the greatest possible stall as an upper bound on the number of		// Remember the greatest possible stall as an upper bound on the number of
// times we should retry the pending queue because of a hazard.		// times we should retry the pending queue because of a hazard.
unsigned MaxObservedStall;		unsigned MaxObservedStall;
#endif		#endif

public:		public:
/// Pending queues extend the ready queues with the same ID and the		/// Pending queues extend the ready queues with the same ID and the
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	public:

/// Get the difference between the given SUnit's ready time and the current		/// Get the difference between the given SUnit's ready time and the current
/// cycle.		/// cycle.
unsigned getLatencyStallCycles(SUnit *SU);		unsigned getLatencyStallCycles(SUnit *SU);

unsigned getNextResourceCycleByInstance(unsigned InstanceIndex,		unsigned getNextResourceCycleByInstance(unsigned InstanceIndex,
unsigned Cycles);		unsigned Cycles);

std::pair<unsigned, unsigned> getNextResourceCycle(unsigned PIdx,		std::pair<unsigned, unsigned>
unsigned Cycles);		getNextResourceCycle(unsigned PIdx, unsigned Cycles, unsigned Uses);

		typedef SmallVector<std::pair<unsigned, unsigned>, 8> NextResourceCycles;
		NextResourceCycles getNextResourceCycleList(unsigned PIdx, unsigned Cycles,
		unsigned Uses);

		struct AdjustedResEntry {
		uint16_t ProcResourceIdx;
		uint16_t Cycles;
		uint16_t Uses;
		};
		void adjustProcResources(const MCSchedClassDesc *SC,
		SmallVector<AdjustedResEntry, 16> &PR);

bool checkHazard(SUnit *SU);		bool checkHazard(SUnit *SU);

unsigned findMaxLatency(ArrayRef<SUnit*> ReadySUs);		unsigned findMaxLatency(ArrayRef<SUnit*> ReadySUs);

unsigned getOtherResourceCount(unsigned &OtherCritIdx);		unsigned getOtherResourceCount(unsigned &OtherCritIdx);

/// Release SU to make it ready. If it's not in hazard, remove it from		/// Release SU to make it ready. If it's not in hazard, remove it from
/// pending queue (if already in) and push into available queue.		/// pending queue (if already in) and push into available queue.
/// Otherwise, push the SU into pending queue.		/// Otherwise, push the SU into pending queue.
///		///
/// @param SU The unit to be released.		/// @param SU The unit to be released.
/// @param ReadyCycle Until which cycle the unit is ready.		/// @param ReadyCycle Until which cycle the unit is ready.
/// @param InPQueue Whether SU is already in pending queue.		/// @param InPQueue Whether SU is already in pending queue.
/// @param Idx Position offset in pending queue (if in it).		/// @param Idx Position offset in pending queue (if in it).
void releaseNode(SUnit *SU, unsigned ReadyCycle, bool InPQueue,		void releaseNode(SUnit *SU, unsigned ReadyCycle, bool InPQueue,
unsigned Idx = 0);		unsigned Idx = 0);

void bumpCycle(unsigned NextCycle);		void bumpCycle(unsigned NextCycle);

void incExecutedResources(unsigned PIdx, unsigned Count);		void incExecutedResources(unsigned PIdx, unsigned Count);

unsigned countResource(unsigned PIdx, unsigned Cycles, unsigned ReadyCycle);		unsigned countResource(unsigned PIdx, unsigned Cycles, unsigned Uses,
		unsigned ReadyCycle);

void bumpNode(SUnit *SU);		void bumpNode(SUnit *SU);

void releasePending();		void releasePending();

void removeReady(SUnit *SU);		void removeReady(SUnit *SU);

/// Call this before applying any other heuristics to the Available queue.		/// Call this before applying any other heuristics to the Available queue.
▲ Show 20 Lines • Show All 308 Lines • Show Last 20 Lines

llvm/lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 1,998 Lines • ▼ Show 20 Lines	void SchedBoundary::reset() {
ExpectedLatency = 0;		ExpectedLatency = 0;
DependentLatency = 0;		DependentLatency = 0;
RetiredMOps = 0;		RetiredMOps = 0;
MaxExecutedResCount = 0;		MaxExecutedResCount = 0;
ZoneCritResIdx = 0;		ZoneCritResIdx = 0;
IsResourceLimited = false;		IsResourceLimited = false;
ReservedCycles.clear();		ReservedCycles.clear();
ReservedCyclesIndex.clear();		ReservedCyclesIndex.clear();
		ResourceGroupIDs.clear();
#ifndef NDEBUG		#ifndef NDEBUG
// Track the maximum number of stall cycles that could arise either from the		// Track the maximum number of stall cycles that could arise either from the
// latency of a DAG edge or the number of cycles that a processor resource is		// latency of a DAG edge or the number of cycles that a processor resource is
// reserved (SchedBoundary::ReservedCycles).		// reserved (SchedBoundary::ReservedCycles).
MaxObservedStall = 0;		MaxObservedStall = 0;
#endif		#endif
// Reserve a zero-count for invalid CritResIdx.		// Reserve a zero-count for invalid CritResIdx.
ExecutedResCounts.resize(1);		ExecutedResCounts.resize(1);
Show All 25 Lines	init(ScheduleDAGMI dag, const TargetSchedModel smodel, SchedRemainder *rem) {
reset();		reset();
DAG = dag;		DAG = dag;
SchedModel = smodel;		SchedModel = smodel;
Rem = rem;		Rem = rem;
if (SchedModel->hasInstrSchedModel()) {		if (SchedModel->hasInstrSchedModel()) {
unsigned ResourceCount = SchedModel->getNumProcResourceKinds();		unsigned ResourceCount = SchedModel->getNumProcResourceKinds();
ReservedCyclesIndex.resize(ResourceCount);		ReservedCyclesIndex.resize(ResourceCount);
ExecutedResCounts.resize(ResourceCount);		ExecutedResCounts.resize(ResourceCount);
		ResourceGroupIDs.resize(ResourceCount, 0);
		HasUnbufferedResourceGroup = false;
unsigned NumUnits = 0;		unsigned NumUnits = 0;

for (unsigned i = 0; i < ResourceCount; ++i) {		for (unsigned i = 0; i < ResourceCount; ++i) {
ReservedCyclesIndex[i] = NumUnits;		ReservedCyclesIndex[i] = NumUnits;
NumUnits += SchedModel->getProcResource(i)->NumUnits;		NumUnits += SchedModel->getProcResource(i)->NumUnits;
		if (SchedModel->getProcResource(i)->SubUnitsIdxBegin) {
		HasUnbufferedResourceGroup \|=
		andreadbUnsubmitted Done Reply Inline Actions Could you move this check into a helper method? Something like `IsUnbufferedGroup()`. A similar chek is repeated later on in the code; using a helper would make the code slightly more readable. andreadb: Could you move this check into a helper method? Something like `IsUnbufferedGroup()`. A similar…
		!(SchedModel->getProcResource(i)->BufferSize);
		dmgreenUnsubmitted Done Reply Inline Actions Don't need the brackets dmgreen: Don't need the brackets
		for (unsigned u = 0, ue = SchedModel->getProcResource(i)->NumUnits;
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'u' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for variable 'ue' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'u' [readability-identifier-naming]…
		dmgreenUnsubmitted Done Reply Inline Actions u -> U, ue -> UE. dmgreen: u -> U, ue -> UE.
		u != ue; ++u) {
		ResourceGroupIDs[SchedModel->getProcResource(i)
		->SubUnitsIdxBegin[u]] = i;
		andreadbUnsubmitted Done Reply Inline Actions This doesn't look correct. A same processor resource may belong to multiple groups, not just one. Example: def X : ProcResource<1>; def Y : ProcResource<1>; def Z : ProcResource<1>; def A : ProcResGroup<[X, Y]>; def B : ProcResGroup<[X, Z]>; def C : ProcResGroup<[X, Y, Z]>; Here X is included in A, B and C. Your logic is only able to track a single group identifier. This is not enough. It may be uncommon or non-sensical for your particular model. However the syntax let's you define this particular nesting of groups. andreadb: This doesn't look correct. A same processor resource may belong to multiple groups, not just…
		}
		}
}		}

ReservedCycles.resize(NumUnits, InvalidCycle);		ReservedCycles.resize(NumUnits, InvalidCycle);
}		}
}		}

/// Compute the stall cycles based on this SUnit's ready time. Heuristics treat		/// Compute the stall cycles based on this SUnit's ready time. Heuristics treat
/// these "soft stalls" differently than the hard stall cycles based on CPU		/// these "soft stalls" differently than the hard stall cycles based on CPU
Show All 25 Lines	if (!isTop())
NextUnreserved += Cycles;		NextUnreserved += Cycles;
return NextUnreserved;		return NextUnreserved;
}		}

/// Compute the next cycle at which the given processor resource can be		/// Compute the next cycle at which the given processor resource can be
/// scheduled. Returns the next cycle and the index of the processor resource		/// scheduled. Returns the next cycle and the index of the processor resource
/// instance in the reserved cycles vector.		/// instance in the reserved cycles vector.
std::pair<unsigned, unsigned>		std::pair<unsigned, unsigned>
SchedBoundary::getNextResourceCycle(unsigned PIdx, unsigned Cycles) {		SchedBoundary::getNextResourceCycle(unsigned PIdx, unsigned Cycles,
		unsigned Uses) {
unsigned MinNextUnreserved = InvalidCycle;		unsigned MinNextUnreserved = InvalidCycle;
unsigned InstanceIdx = 0;		unsigned InstanceIdx = 0;
unsigned StartIndex = ReservedCyclesIndex[PIdx];		unsigned StartIndex = ReservedCyclesIndex[PIdx];
unsigned NumberOfInstances = SchedModel->getProcResource(PIdx)->NumUnits;		unsigned NumberOfInstances = SchedModel->getProcResource(PIdx)->NumUnits;
assert(NumberOfInstances > 0 &&		assert(NumberOfInstances > 0 &&
"Cannot have zero instances of a ProcResource");		"Cannot have zero instances of a ProcResource");

		if (Uses == 1) {
		andreadbUnsubmitted Done Reply Inline Actions See my previous comment. andreadb: See my previous comment.
for (unsigned I = StartIndex, End = StartIndex + NumberOfInstances; I < End;		for (unsigned I = StartIndex, End = StartIndex + NumberOfInstances; I < End;
++I) {		++I) {
unsigned NextUnreserved = getNextResourceCycleByInstance(I, Cycles);		unsigned NextUnreserved = getNextResourceCycleByInstance(I, Cycles);
if (MinNextUnreserved > NextUnreserved) {		if (MinNextUnreserved > NextUnreserved) {
InstanceIdx = I;		InstanceIdx = I;
MinNextUnreserved = NextUnreserved;		MinNextUnreserved = NextUnreserved;
}		}
}		}
return std::make_pair(MinNextUnreserved, InstanceIdx);		return std::make_pair(MinNextUnreserved, InstanceIdx);
		andreadbUnsubmitted Done Reply Inline Actions Remove tabs. andreadb: Remove tabs.
		} else {
		andreadbUnsubmitted Done Reply Inline Actions unnecessary else after return. andreadb: unnecessary else after return.
		SmallVector<std::pair<unsigned, unsigned>, 8> NextResCycles;
		for (unsigned I = StartIndex, End = StartIndex + NumberOfInstances; I < End;
		++I) {
		unsigned NextUnreserved = getNextResourceCycleByInstance(I, Cycles);
		NextResCycles.emplace_back(NextUnreserved, I);
		}
		// Find the earliest (Uses) resource instances
		std::nth_element(NextResCycles.begin(), NextResCycles.begin() + (Uses - 1),
		NextResCycles.end());
		return NextResCycles[Uses - 1];
		}
		}

		/// Compute the next cycle at which the given processor resource can be
		/// scheduled. Returns the next cycle and the indices of the processor resource
		/// instances in the reserved cycles vector.
		SchedBoundary::NextResourceCycles
		SchedBoundary::getNextResourceCycleList(unsigned PIdx, unsigned Cycles,
		unsigned Uses) {
		unsigned MinNextUnreserved = InvalidCycle;
		unsigned InstanceIdx = 0;
		unsigned StartIndex = ReservedCyclesIndex[PIdx];
		unsigned NumberOfInstances = SchedModel->getProcResource(PIdx)->NumUnits;
		assert(NumberOfInstances > 0 &&
		"Cannot have zero instances of a ProcResource");

		SmallVector<std::pair<unsigned, unsigned>, 8> NextResCycles;
		if (Uses == 1) {
		for (unsigned I = StartIndex, End = StartIndex + NumberOfInstances; I < End;
		++I) {
		unsigned NextUnreserved = getNextResourceCycleByInstance(I, Cycles);
		if (MinNextUnreserved > NextUnreserved) {
		InstanceIdx = I;
		MinNextUnreserved = NextUnreserved;
		}
		}
		NextResCycles.emplace_back(MinNextUnreserved, InstanceIdx);
		} else {
		for (unsigned I = StartIndex, End = StartIndex + NumberOfInstances; I < End;
		++I) {
		unsigned NextUnreserved = getNextResourceCycleByInstance(I, Cycles);
		NextResCycles.emplace_back(NextUnreserved, I);
		}
		// Find the earliest (Uses) resource instances
		std::nth_element(NextResCycles.begin(), NextResCycles.begin() + (Uses - 1),
		NextResCycles.end());
		NextResCycles.resize(Uses);
		}

		return NextResCycles;
		}

		/// Adjust the processor write resources to use for unbuffered ProcResGroups.
		/// The problem that must be solved is that the WriteProcRes table reports
		/// the "total" cycles*instances claim upon a group from direct uses of it to
		/// it as well as uses of its member. When BufferSize = 0, individual claims
		/// for instances of the group are needed. This is accomplished by generating
		/// a new write resource vector (in a passed-in temporary vector) which has
		/// individual claims on the group from each member of it as well as a claim
		/// upon the entire group which is reduced by those individual claims.
		void SchedBoundary::adjustProcResources(const MCSchedClassDesc *SC,
		SmallVector<AdjustedResEntry, 16> &PR) {
		PR.clear();
		if (HasUnbufferedResourceGroup) {
		andreadbUnsubmitted Done Reply Inline Actions Early exit to reduce indentation if `HasUnbufferedResourceGroup` is false. andreadb: Early exit to reduce indentation if `HasUnbufferedResourceGroup` is false.
		SmallVector<uint16_t, 32> GroupAdjustments;
		GroupAdjustments.resize(SchedModel->getNumProcResourceKinds(), 0);
		bool HasResGroup = false;
		for (TargetSchedModel::ProcResIter
		PI = SchedModel->getWriteProcResBegin(SC),
		PE = SchedModel->getWriteProcResEnd(SC);
		PI != PE; ++PI) {
		unsigned ResGrpID = ResourceGroupIDs[PI->ProcResourceIdx];
		if (ResGrpID && !SchedModel->getProcResource(ResGrpID)->BufferSize) {
		GroupAdjustments[ResGrpID]++;
		andreadbUnsubmitted Done Reply Inline Actions Why do you need to keep track of the number of uses? The number of users of a group isn't important. The only thing that matters for the resource latency computation is how many cycles are contributed by the sub-units. See my comment below. andreadb: Why do you need to keep track of the number of uses? The number of users of a group isn't…
		dpenryAuthorUnsubmitted Done Reply Inline Actions Why do you need to keep track of the number of uses? The number of users of a group isn't important. The only thing that matters for the resource latency computation is how many cycles are contributed by the sub-units. See my comment below. dpenry: > Why do you need to keep track of the number of uses? > > The number of users of a group…
		HasResGroup = true;
		}
		}
		if (HasResGroup) {
		PR.clear();
		andreadbUnsubmitted Done Reply Inline Actions This seems redundant. PR is already cleared on entry to this method. andreadb: This seems redundant. PR is already cleared on entry to this method.
		for (TargetSchedModel::ProcResIter
		PI = SchedModel->getWriteProcResBegin(SC),
		PE = SchedModel->getWriteProcResEnd(SC);
		PI != PE; ++PI)
		if (GroupAdjustments[PI->ProcResourceIdx])
		PR.emplace_back(AdjustedResEntry{
		PI->ProcResourceIdx,
		uint16_t(PI->Cycles / GroupAdjustments[PI->ProcResourceIdx]),
		andreadbUnsubmitted Done Reply Inline Actions This doesn't look right. Two problems: Different sub-units may be consumed for a different number of resource cycles. The number of resource cycles consumed by all the sub-units of a group may not match the value of `PI->Cycles`. About 1. Example: def X : ProcResource<1>; def Y : ProcResource<1>; def A : ProcResGroup<[X, Y]>; def MyWrite : SchedWriteRes<[X,Y]> { let Latency = 3; let ResourceCycles = [1, 2]; } MyWrite is a write that consumes X for 1 cycle, and Y for 2 cycles. Both X and Y are part of group A, so the subtarget emitter will generate the following profile: X -> 1 resource cycle Y -> 2 resource cycles A -> 3 resource cycles In this case, the resource cycles of group A are equivalent to the sum of the resource cycles of X and the resource cycles of Y. Group A is not consumed for any extra cycles, so the resource latency of A is `MAX(ResourceCycle_of(X), ResourceCycle_of(Y)) = 2cy`. X is consumed for 1cy. Y is consumed for 2cy. As far as I understand, your code assumes that X and Y will be consumed for the same number of cycles. This is not true in general. Also, the number of group cycles may not match the total cycles contributed by the sub-units (see the following example). About point 2. def MyWrite : SchedWriteRes<[X,Y,A]> { let Latency = 3; let ResourceCycles = [1, 2, 1]; } This is similar to the previous example. However , A is consumed for an extra cycle. the profile generated by the subtarget emitter is: X -> 1 resource cycle Y -> 2 resource cycles A -> 4 resource cycles This situation is more uncommon, but still possible. The model doesn't prevent users from "overbooking" group resources. If we subtract the resource cycles from X and Y, we find out that A is still consumed for an extra cycle. In this case, conservatively assume that A is consumed for `MAX(ResourceCycle_of(X), ResourceCycle_of(Y)) + 1 -> 3cy`. So, 3cy instead of 2cy. andreadb: This doesn't look right. Two problems: 1. Different sub-units may be consumed for a different…
		GroupAdjustments[PI->ProcResourceIdx]});
		else
		PR.emplace_back(AdjustedResEntry{PI->ProcResourceIdx, PI->Cycles, 1});
		}
		}
}		}

/// Does this SU have a hazard within the current instruction group.		/// Does this SU have a hazard within the current instruction group.
///		///
/// The scheduler supports two modes of hazard recognition. The first is the		/// The scheduler supports two modes of hazard recognition. The first is the
/// ScheduleHazardRecognizer API. It is a fully general hazard recognizer that		/// ScheduleHazardRecognizer API. It is a fully general hazard recognizer that
/// supports highly complicated in-order reservation tables		/// supports highly complicated in-order reservation tables
/// (ScoreboardHazardRecognizer) and arbitrary target-specific logic.		/// (ScoreboardHazardRecognizer) and arbitrary target-specific logic.
Show All 22 Lines	if (CurrMOps > 0 &&
(!isTop() && SchedModel->mustEndGroup(SU->getInstr())))) {		(!isTop() && SchedModel->mustEndGroup(SU->getInstr())))) {
LLVM_DEBUG(dbgs() << " hazard: SU(" << SU->NodeNum << ") must "		LLVM_DEBUG(dbgs() << " hazard: SU(" << SU->NodeNum << ") must "
<< (isTop() ? "begin" : "end") << " group\n");		<< (isTop() ? "begin" : "end") << " group\n");
return true;		return true;
}		}

if (SchedModel->hasInstrSchedModel() && SU->hasReservedResource) {		if (SchedModel->hasInstrSchedModel() && SU->hasReservedResource) {
const MCSchedClassDesc *SC = DAG->getSchedClass(SU);		const MCSchedClassDesc *SC = DAG->getSchedClass(SU);
		SmallVector<AdjustedResEntry, 16> ResTmp;
		dmgreenUnsubmitted Done Reply Inline Actions ResTmp -> umm, AdjustedResources? dmgreen: ResTmp -> umm, AdjustedResources?
		adjustProcResources(SC, ResTmp);
		if (!ResTmp.size())
		dmgreenUnsubmitted Done Reply Inline Actions I would add brackets to these, if they have large statements under them. dmgreen: I would add brackets to these, if they have large statements under them.
for (const MCWriteProcResEntry &PE :		for (const MCWriteProcResEntry &PE :
make_range(SchedModel->getWriteProcResBegin(SC),		make_range(SchedModel->getWriteProcResBegin(SC),
SchedModel->getWriteProcResEnd(SC))) {		SchedModel->getWriteProcResEnd(SC))) {
unsigned ResIdx = PE.ProcResourceIdx;		unsigned ResIdx = PE.ProcResourceIdx;
unsigned Cycles = PE.Cycles;		unsigned Cycles = PE.Cycles;
unsigned NRCycle, InstanceIdx;		unsigned NRCycle, InstanceIdx;
std::tie(NRCycle, InstanceIdx) = getNextResourceCycle(ResIdx, Cycles);		std::tie(NRCycle, InstanceIdx) =
		getNextResourceCycle(ResIdx, Cycles, 1);
		if (NRCycle > CurrCycle) {
		#ifndef NDEBUG
		MaxObservedStall = std::max(Cycles, MaxObservedStall);
		#endif
		LLVM_DEBUG(dbgs() << " SU(" << SU->NodeNum << ") "
		<< SchedModel->getResourceName(ResIdx) << '['
		<< InstanceIdx - ReservedCyclesIndex[ResIdx] << ']'
		<< "=" << NRCycle << "c\n");
		return true;
		}
		}
		else
		for (auto &PE : ResTmp) {
		unsigned ResIdx = PE.ProcResourceIdx;
		unsigned Cycles = PE.Cycles;
		unsigned Uses = PE.Uses;
		unsigned NRCycle, InstanceIdx;
		std::tie(NRCycle, InstanceIdx) =
		getNextResourceCycle(ResIdx, Cycles, Uses);
if (NRCycle > CurrCycle) {		if (NRCycle > CurrCycle) {
#ifndef NDEBUG		#ifndef NDEBUG
MaxObservedStall = std::max(Cycles, MaxObservedStall);		MaxObservedStall = std::max(Cycles, MaxObservedStall);
#endif		#endif
LLVM_DEBUG(dbgs() << " SU(" << SU->NodeNum << ") "		LLVM_DEBUG(dbgs() << " SU(" << SU->NodeNum << ") "
<< SchedModel->getResourceName(ResIdx)		<< SchedModel->getResourceName(ResIdx) << '['
<< '[' << InstanceIdx - ReservedCyclesIndex[ResIdx] << ']'		<< InstanceIdx - ReservedCyclesIndex[ResIdx] << ']'
<< "=" << NRCycle << "c\n");		<< "=" << NRCycle << "c\n");
return true;		return true;
}		}
}		}
}		}
return false;		return false;
}		}

// Find the unscheduled node in ReadySUs with the highest latency.		// Find the unscheduled node in ReadySUs with the highest latency.
unsigned SchedBoundary::		unsigned SchedBoundary::
findMaxLatency(ArrayRef<SUnit*> ReadySUs) {		findMaxLatency(ArrayRef<SUnit*> ReadySUs) {
SUnit *LateSU = nullptr;		SUnit *LateSU = nullptr;
▲ Show 20 Lines • Show All 122 Lines • ▼ Show 20 Lines

/// Add the given processor resource to this scheduled zone.		/// Add the given processor resource to this scheduled zone.
///		///
/// \param Cycles indicates the number of consecutive (non-pipelined) cycles		/// \param Cycles indicates the number of consecutive (non-pipelined) cycles
/// during which this resource is consumed.		/// during which this resource is consumed.
///		///
/// \return the next cycle at which the instruction may execute without		/// \return the next cycle at which the instruction may execute without
/// oversubscribing resources.		/// oversubscribing resources.
unsigned SchedBoundary::		unsigned SchedBoundary::countResource(unsigned PIdx, unsigned Cycles,
countResource(unsigned PIdx, unsigned Cycles, unsigned NextCycle) {		unsigned Uses, unsigned NextCycle) {
unsigned Factor = SchedModel->getResourceFactor(PIdx);		unsigned Factor = SchedModel->getResourceFactor(PIdx);
unsigned Count = Factor * Cycles;		unsigned Count = Factor * Cycles;
LLVM_DEBUG(dbgs() << " " << SchedModel->getResourceName(PIdx) << " +"		LLVM_DEBUG(dbgs() << " " << SchedModel->getResourceName(PIdx) << " +"
<< Cycles << "x" << Factor << "u\n");		<< Cycles << "x" << Factor << "u\n");

// Update Executed resources counts.		// Update Executed resources counts.
incExecutedResources(PIdx, Count);		incExecutedResources(PIdx, Count);
assert(Rem->RemainingCounts[PIdx] >= Count && "resource double counted");		assert(Rem->RemainingCounts[PIdx] >= Count && "resource double counted");
Rem->RemainingCounts[PIdx] -= Count;		Rem->RemainingCounts[PIdx] -= Count;

// Check if this resource exceeds the current critical resource. If so, it		// Check if this resource exceeds the current critical resource. If so, it
// becomes the critical resource.		// becomes the critical resource.
if (ZoneCritResIdx != PIdx && (getResourceCount(PIdx) > getCriticalCount())) {		if (ZoneCritResIdx != PIdx && (getResourceCount(PIdx) > getCriticalCount())) {
ZoneCritResIdx = PIdx;		ZoneCritResIdx = PIdx;
LLVM_DEBUG(dbgs() << " *** Critical resource "		LLVM_DEBUG(dbgs() << " *** Critical resource "
<< SchedModel->getResourceName(PIdx) << ": "		<< SchedModel->getResourceName(PIdx) << ": "
<< getResourceCount(PIdx) / SchedModel->getLatencyFactor()		<< getResourceCount(PIdx) / SchedModel->getLatencyFactor()
<< "c\n");		<< "c\n");
}		}
// For reserved resources, record the highest cycle using the resource.		// For reserved resources, record the highest cycle using the resource.
unsigned NextAvailable, InstanceIdx;		unsigned NextAvailable, InstanceIdx;
std::tie(NextAvailable, InstanceIdx) = getNextResourceCycle(PIdx, Cycles);		std::tie(NextAvailable, InstanceIdx) =
		getNextResourceCycle(PIdx, Cycles, Uses);
if (NextAvailable > CurrCycle) {		if (NextAvailable > CurrCycle) {
LLVM_DEBUG(dbgs() << " Resource conflict: "		LLVM_DEBUG(dbgs() << " Resource conflict: "
<< SchedModel->getResourceName(PIdx)		<< SchedModel->getResourceName(PIdx)
<< '[' << InstanceIdx - ReservedCyclesIndex[PIdx] << ']'		<< '[' << InstanceIdx - ReservedCyclesIndex[PIdx] << ']'
<< " reserved until @" << NextAvailable << "\n");		<< " reserved until @" << NextAvailable << "\n");
}		}
return NextAvailable;		return NextAvailable;
}		}
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	if (ZoneCritResIdx) {
if ((int)(ScaledMOps - getResourceCount(ZoneCritResIdx))		if ((int)(ScaledMOps - getResourceCount(ZoneCritResIdx))
>= (int)SchedModel->getLatencyFactor()) {		>= (int)SchedModel->getLatencyFactor()) {
ZoneCritResIdx = 0;		ZoneCritResIdx = 0;
LLVM_DEBUG(dbgs() << " *** Critical resource NumMicroOps: "		LLVM_DEBUG(dbgs() << " *** Critical resource NumMicroOps: "
<< ScaledMOps / SchedModel->getLatencyFactor()		<< ScaledMOps / SchedModel->getLatencyFactor()
<< "c\n");		<< "c\n");
}		}
}		}
		SmallVector<AdjustedResEntry, 16> ResTmp;
		adjustProcResources(SC, ResTmp);
		if (!ResTmp.size())
for (TargetSchedModel::ProcResIter		for (TargetSchedModel::ProcResIter
PI = SchedModel->getWriteProcResBegin(SC),		PI = SchedModel->getWriteProcResBegin(SC),
PE = SchedModel->getWriteProcResEnd(SC); PI != PE; ++PI) {		PE = SchedModel->getWriteProcResEnd(SC);
		PI != PE; ++PI) {
unsigned RCycle =		unsigned RCycle =
countResource(PI->ProcResourceIdx, PI->Cycles, NextCycle);		countResource(PI->ProcResourceIdx, PI->Cycles, 1, NextCycle);
		if (RCycle > NextCycle)
		NextCycle = RCycle;
		}
		else
		for (auto &PI : ResTmp) {
		unsigned RCycle =
		countResource(PI.ProcResourceIdx, PI.Cycles, PI.Uses, NextCycle);
if (RCycle > NextCycle)		if (RCycle > NextCycle)
NextCycle = RCycle;		NextCycle = RCycle;
}		}
if (SU->hasReservedResource) {		if (SU->hasReservedResource) {
// For reserved resources, record the highest cycle using the resource.		// For reserved resources, record the highest cycle using the resource.
// For top-down scheduling, this is the cycle in which we schedule this		// For top-down scheduling, this is the cycle in which we schedule this
// instruction plus the number of cycles the operations reserves the		// instruction plus the number of cycles the operations reserves the
// resource. For bottom-up is it simply the instruction's cycle.		// resource. For bottom-up is it simply the instruction's cycle.
		if (!ResTmp.size())
for (TargetSchedModel::ProcResIter		for (TargetSchedModel::ProcResIter
PI = SchedModel->getWriteProcResBegin(SC),		PI = SchedModel->getWriteProcResBegin(SC),
PE = SchedModel->getWriteProcResEnd(SC); PI != PE; ++PI) {		PE = SchedModel->getWriteProcResEnd(SC);
		PI != PE; ++PI) {
unsigned PIdx = PI->ProcResourceIdx;		unsigned PIdx = PI->ProcResourceIdx;
if (SchedModel->getProcResource(PIdx)->BufferSize == 0) {		if (SchedModel->getProcResource(PIdx)->BufferSize == 0) {
unsigned ReservedUntil, InstanceIdx;		unsigned ReservedUntil, InstanceIdx;
std::tie(ReservedUntil, InstanceIdx) = getNextResourceCycle(PIdx, 0);		std::tie(ReservedUntil, InstanceIdx) =
		getNextResourceCycle(PIdx, 0, 1);
if (isTop()) {		if (isTop()) {
ReservedCycles[InstanceIdx] =		ReservedCycles[InstanceIdx] =
std::max(ReservedUntil, NextCycle + PI->Cycles);		std::max(ReservedUntil, NextCycle + PI->Cycles);
} else		} else
ReservedCycles[InstanceIdx] = NextCycle;		ReservedCycles[InstanceIdx] = NextCycle;
}		}
}		}
		else
		for (auto &PI : ResTmp) {
		unsigned PIdx = PI.ProcResourceIdx;
		if (SchedModel->getProcResource(PIdx)->BufferSize == 0) {
		auto NRC = getNextResourceCycleList(PIdx, 0, PI.Uses);
		for (auto &R : NRC) {
		unsigned ReservedUntil = R.first;
		unsigned InstanceIdx = R.second;
		if (isTop()) {
		ReservedCycles[InstanceIdx] =
		std::max(ReservedUntil, NextCycle + PI.Cycles);
		} else
		ReservedCycles[InstanceIdx] = NextCycle;
		}
		}
		}
}		}
}		}
// Update ExpectedLatency and DependentLatency.		// Update ExpectedLatency and DependentLatency.
unsigned &TopLatency = isTop() ? ExpectedLatency : DependentLatency;		unsigned &TopLatency = isTop() ? ExpectedLatency : DependentLatency;
unsigned &BotLatency = isTop() ? DependentLatency : ExpectedLatency;		unsigned &BotLatency = isTop() ? DependentLatency : ExpectedLatency;
if (SU->getDepth() > TopLatency) {		if (SU->getDepth() > TopLatency) {
TopLatency = SU->getDepth();		TopLatency = SU->getDepth();
LLVM_DEBUG(dbgs() << " " << Available.getName() << " TopLatency SU("		LLVM_DEBUG(dbgs() << " " << Available.getName() << " TopLatency SU("
▲ Show 20 Lines • Show All 1,465 Lines • Show Last 20 Lines