This is an archive of the discontinued LLVM Phabricator instance.

[MISched] Introduce and use ResourceSegments.
ClosedPublic

Authored by fpetrogalli on May 10 2023, 2:30 PM.

Details

Summary

The class ResourceSegments is used to keep track of the intervals
that represent resource usage of a collection of instruction that are
being scheduled by the machine scheduler.

The collection is made of intervals that are closed on the left and
open on the right (represented by the standard notation [a, b)).

These collection of intervals can be extended by adding new
intervals accordingly while scheduling a basic block.

Unit tests are added to verify the possible configurations of
intervals, and the relative possibility of scheduling a new
instruction in these configurations. Specifically, the methods
getFirstAvailableAtFromBottom and getFirstAvailableAtFromTop are
tested to make sure that both bottom-up and top-down scheduling work
when tracking resource usage across the basic block with
ResourceSegments.

Note that the scheduler tracks resource usage with two methods:

  1. counters (via std::vector<unsigned> ReservedCycles;);
  1. intervals (via std::map<unsigned, ResourceSegments> ReservedResourceSegments;).

This patch can be considered a NFC test for existing scheduling models
because the tracking system that uses intervals is turned off by
default (field bit EnableIntervals = false; in the tablegen class
SchedMachineModel).

Diff Detail

Event Timeline

fpetrogalli created this revision.May 10 2023, 2:30 PM
fpetrogalli requested review of this revision.May 10 2023, 2:30 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2023, 2:30 PM
RKSimon added inline comments.May 12 2023, 7:44 AM
llvm/include/llvm/CodeGen/MachineScheduler.h
96

#include "llvm/Support/raw_ostream.h" ?

716

Use assert(all_of(_Intervals))

815

typedef std::pair<long, long> ? in fact we might be better off using a std::pair<int64_t,int64_t> ?

877

Some of these larger implementations might be better off in MachineScheduler.cpp

880

(style) avoid auto

887

(style) assertion message

llvm/include/llvm/MC/MCSchedule.h
308

Include the description here as well - don't just refer to TargetSchedule.td

llvm/lib/CodeGen/MachineScheduler.cpp
170

Test coverage?

llvm/unittests/CodeGen/CMakeLists.txt
41

These are supposed to be kept sorted :(

Also - please add a summary to the patch

fpetrogalli marked 8 inline comments as done.
fpetrogalli retitled this revision from [MISched] Introduce and use ResourceSegments. to [MISched] Introduce and use ResourceSegments..

Address code review.

llvm/lib/CodeGen/MachineScheduler.cpp
170

The test would just let the command line option, because we do not have any SchedModel upstream that use StartAtCycle.

Are you saying that I should add an llc invocation that uses this command line option, independently whether or not I can test the codegen changes caused by the value change?

fpetrogalli added inline comments.May 15 2023, 8:13 AM
llvm/lib/CodeGen/MachineScheduler.cpp
170

FWIW, the behaviour of changing the cutoff value is tested in the unit test TEST(ResourceSegments, AddWithCutOff)

@RKSimon - thank you for looking into this!

The sort() and merge() operations were always invoked in pair, therefore I merged them together in the method mergeAndSort().

andreadb added a comment.EditedMay 25 2023, 4:55 AM

Hi Francesco,

Apologies for the very late reply. I have been quite busy these days, and I am still trying to figure out mentally how well this new framework works in practice.

I am thinking about edge cases where writes of a same instruction somehow conflict in terms of resource cycles and/or StartAtCycle quantities.

If I understand it correctly, StartAtCycle forces the scheduling algorithm to find valid segments after that relative cycle. Scheduling won't fail if no segment can be allocated exactly at that cycle.
Basically, resource consumption is requested to start not before StartAtCycle. However, it is OK to start after StartAtCycle if slot allocation is unsuccessful.
Is that correct?

If so, then what happens if I declare the following write:

Write W = [ A (cycles=1, start_at=0), B (cycles=1, start_at=0), AB (cycles=3, start_at=1) ].

Where:
A and B are simple resource units (one unit each).
AB is a resource group containing A and B.

I want AB to be consumed after the first A and after the first B.

Ideally, I expect either this:

///       C      1      2      3      4      5  ...
/// ------|------|------|------|------|------|----->
/// 
/// A    [C,                       C+4)  -- includes the extra cycle from AB.
/// B    [C, C+1)

or

///       C      1      2      3      4      5  ...
/// ------|------|------|------|------|------|----->
/// 
/// A    [C, C+1)
/// B    [C,                       C+4)  -- includes the extra cycle from AB.

However, if resouce A is unavailable until cycle 2. Then what happens to the schedule?

///       C      1      2      3      4      5  ...
/// ------|------|------|------|------|------|----->
/// 
/// A     X      X      X        [C+3, C+4)
/// B    [C,  C+1)

At this point, when would AB be allocated? Could it be allocated to B from relative cycle 2?

I am asking this question because StartAtCycle could be used to describe unmodeled dependencies between multiple micro opcodes of a same instruction.
In that example, the user might have wanted to suggest that the consumption of group AB must start after A and B have been consumed for one cycle.

For example, an instruction may decode into 3 micro opcodes:

  • opcode #1 consumes one cycle of resource A
  • opcode #2 consumes one cycle of resource B
  • opcode #3 waits for opcode #1 and #2 to complete, and then consumes three cycles of A or B.

If we use StartAtCycle, I am not sure if we can guarantee that consumption of "A or B" would always start at the right time.

NOTE: On x86, a similar pattern could be used to implement horizontal operations which are often microcoded and expanded into a pair of independent shuffle operations, followed by an ADD/SUB.

I wonder if there is a way to delay the start cycle until both A and B have been consumed.
Again, apologies if I misunderstood all of this behaviour.
In retrospect, I wonder whether we need something more than StartAtCycle to properly describe these dependencies. I fear that this won't be expressive enough otherwise.
I should have raised this concern on the other patch. However, I only ended up thinking about this potential issue now. Sorry.

I also wonder about what happens to those cases where a write triggers the consumption of extra cycles of so-called "super" resources. I suspect that "super" resources must be treated specially, and they should inherit the same StartAtCycle?

I have only suggested some minor changes (see my comments below).
I am curious to see how much scheduling improves if we start using StartAtCycle in our scheduling models. Do you have some perf numbers to share?

On x86 (SimonP can correct me on this) I believe that x86 still uses the old latency-based post-ra scheduling algorithm, which doesn't do any bookkeeping on hw resources.
I think that using StartAtCycle can significantly improve the quality of most x86 models (horizontal operations would benefit from it a lot). However, it would be nicer if it was possible to express a StartAfter resource consumption.

llvm/include/llvm/CodeGen/MachineScheduler.h
691

I'd be tempted to just move your new ResourceSegments struct outside of SchedBoundary. Not sure what other reviewers think about it.
I feel like it would be slightly more readable all this part.
It was also suggested by Simon before. You might have missed his comment before.
Basically, this struct is big enough to be promoted to top level (mainly for readability reasons).

692

Was this meant to be public? If so, then it is redundant.

913–915

Can this be an inline lambda used by the std::is_sorted at line 868?

llvm/lib/CodeGen/MachineScheduler.cpp
2296

You don't need else after return.

2684–2693

Not entirely sure about the coding style. However, I suggest to use braces for those blocks because statements are particularly long and use four lines...

4247

Don't need to check NDEBUG

Thank you for the feedback @andreadb

I need some time to digest it to be able to give you an answer. I have however inlined a clarification of how I have interpreted StartAtCycle and ResourceCycle. (I'd be of course happy to revisit my interpretation if it makes things more clear)

Hi Francesco,

Apologies for the very late reply. I have been quite busy these days, and I am still trying to figure out mentally how well this new framework works in practice.

I am thinking about edge cases where writes of a same instruction somehow conflict in terms of resource cycles and/or StartAtCycle quantities.

If I understand it correctly, StartAtCycle forces the scheduling algorithm to find valid segments after that relative cycle. Scheduling won't fail if no segment can be allocated exactly at that cycle.
Basically, resource consumption is requested to start not before StartAtCycle. However, it is OK to start after StartAtCycle if slot allocation is unsuccessful.
Is that correct?

If so, then what happens if I declare the following write:

Write W = [ A (cycles=1, start_at=0), B (cycles=1, start_at=0), AB (cycles=3, start_at=1) ].

I do not have an answer (yet!) on the question following this set up, however I wanted to clarify that the way I have intended StartAtCycle and ResourceCycle in the tablegen description is as follows.

For a resource RES used in a WriteRes, that is used for 3 cycles starting at cycle 2, the tablegen description I expect to use is the following:

def : WriteRes<..., [RES]> {
  let ResourceCycles = [5];
  let StartAtCycle = [2];
}

This mens that the total number of cycle for resource RES is given by the difference between the corresponding values in ResourceCycles and StartAtCycle respectively, which results in 5 - 2 = 3.

The reason for this choice is the following. In the current code resource usage is always considered booked from 0 (resulting in overbooking). For example, given ResourceCycle = [1,3] for resources A, B, I assumed the meaning being A for 1 cycle, followed by B for 2 cycles (cycle 0, overlapping the use of A , is overbooked for B).

cycle 0 | 1 | 2 | 3 | 4 | 5
A     X
B     X   X   X

I decided to reinterpret the ResourceCycles as "ReleaseAtCyc;e" because I did not want to mess with the values of the current scheduling models in case people wanted to optimise the existing one with similar situation by just adding StartAtCycle = [0,1], without the need of changing any of the values in resourceCycles. By setting StartAtCYcle = [1,2] for the example, we would get the real resource usage without having to change ResourceCYcle from [1,3] to [1,2]:

cycle 0 | 1 | 2 | 3 | 4 | 5
A     X
B         X   X

Essentially to me resource usage is from StartAtCycle to resourceCycles , while I *think* that you intend resource being booked from StartAtCycle to StartAtCycle + ResourceCycles.

If we proceed with my interpretation, I think it will be easier to mass rename ResourceCycles to ReleaseAtCycle, instead of having to manually figure out the math that is needed to optimise the existing scheduling models.

Of course, this is my person preference, and I am totally fine in changing the code to your interpretation.

Having said that, I'll take a look into the issue you are reporting. It would really help if you could point me at some scheduling models in the sources that use the resource groups mechanism you describe, because I could use it as a starting point to play with it and see what happens.

Thanks!

andreadb added a comment.EditedMay 30 2023, 5:31 AM

Thank you for the feedback @andreadb

I need some time to digest it to be able to give you an answer. I have however inlined a clarification of how I have interpreted StartAtCycle and ResourceCycle. (I'd be of course happy to revisit my interpretation if it makes things more clear)

Hi Francesco,

Apologies for the very late reply. I have been quite busy these days, and I am still trying to figure out mentally how well this new framework works in practice.

I am thinking about edge cases where writes of a same instruction somehow conflict in terms of resource cycles and/or StartAtCycle quantities.

If I understand it correctly, StartAtCycle forces the scheduling algorithm to find valid segments after that relative cycle. Scheduling won't fail if no segment can be allocated exactly at that cycle.
Basically, resource consumption is requested to start not before StartAtCycle. However, it is OK to start after StartAtCycle if slot allocation is unsuccessful.
Is that correct?

If so, then what happens if I declare the following write:

Write W = [ A (cycles=1, start_at=0), B (cycles=1, start_at=0), AB (cycles=3, start_at=1) ].

I do not have an answer (yet!) on the question following this set up, however I wanted to clarify that the way I have intended StartAtCycle and ResourceCycle in the tablegen description is as follows.

For a resource RES used in a WriteRes, that is used for 3 cycles starting at cycle 2, the tablegen description I expect to use is the following:

def : WriteRes<..., [RES]> {
  let ResourceCycles = [5];
  let StartAtCycle = [2];
}

This mens that the total number of cycle for resource RES is given by the difference between the corresponding values in ResourceCycles and StartAtCycle respectively, which results in 5 - 2 = 3.

The reason for this choice is the following. In the current code resource usage is always considered booked from 0 (resulting in overbooking). For example, given ResourceCycle = [1,3] for resources A, B, I assumed the meaning being A for 1 cycle, followed by B for 2 cycles (cycle 0, overlapping the use of A , is overbooked for B).

cycle 0 | 1 | 2 | 3 | 4 | 5
A     X
B     X   X   X

I decided to reinterpret the ResourceCycles as "ReleaseAtCyc;e" because I did not want to mess with the values of the current scheduling models in case people wanted to optimise the existing one with similar situation by just adding StartAtCycle = [0,1], without the need of changing any of the values in resourceCycles. By setting StartAtCYcle = [1,2] for the example, we would get the real resource usage without having to change ResourceCYcle from [1,3] to [1,2].

Thanks, that makes sense.

For most devs, ResourceCycles is just a measure of latency; it declares for how many cycles a resource becomes unavailable after "instruction issue".
To avoid confusion, I suggest to add a code comment that further emphasizes how ResourceCycle actually means StopAtCycle. Otherwise, some people may wrongly assume that ResourceCycles is relative to StartAtCycle.

cycle 0 | 1 | 2 | 3 | 4 | 5
A     X
B         X   X

Essentially to me resource usage is from StartAtCycle to resourceCycles , while I *think* that you intend resource being booked from StartAtCycle to StartAtCycle + ResourceCycles.

If we proceed with my interpretation, I think it will be easier to mass rename ResourceCycles to ReleaseAtCycle, instead of having to manually figure out the math that is needed to optimise the existing scheduling models.

Of course, this is my person preference, and I am totally fine in changing the code to your interpretation.

I was essentially asking whether instruction issue still works the same way or not.

Is it required that ALL resource segments are reserved/allocated at issue cycle?
To put it in another way: Is the StartAtCycle a hard requirement for resource allocation? Would it prevent instruction issue if some but not all resources are available at their StartAtCycle?

Example:

A write W declaring the following:
 - 3 micro-opcodes
 - Resource consumption :
     A - 1cy
     B - 1cy
     C - 2cy   -- StartAtCycle=1

Given the following scenario:

cycle 0 | 1 | 2 | 3 |
A     -
B     -  
C     X   X

In this scenario, not all resources are available at their relative StartAtCycle. In particular, resource C cannot be allocated at relative cycle 1. If we interpret StartAtCycle as a hard constraint, then instruction issue will have to stall for one cycle.
C is busy for two more cycles, so issue of write W must be delayed for 1 extra cycle (even though A and B are already available).

The reason why I am asking these questions is because I want to fully understand how flexible is this new model in practice. In future, it would be nice if we could model resource consumption on a per-micro-opcode basis.

As you know, writes may declare multiple "micro-opcodes". However, there is no way currently to describe which micro-opcodes consume which hardware resources.
For that reason, scheduling algorithms don't track the lifetime of individual micro-opcodes; instead, instructions are essentially treated like atomic entities.
So, "issue event" - as a concept - can only apply to instructions as a whole and not individual micro-opcodes.

This obviously would have complicated the model, and it would have required per-micro-opcode knowledge which we don't have at the moment.
So, this is not feasible now, but it could be a future development (although unlikely).

I was wondering whether your StartAtCycle could have simplified that future development or not. I think your design is completely orthogonal, and it shouldn't prevent any further development in the area of modelling micro-opcode scheduling.

So, overall I think that your StartAtCycle is a nice and useful addition.

Having said that, I'll take a look into the issue you are reporting. It would really help if you could point me at some scheduling models in the sources that use the resource groups mechanism you describe, because I could use it as a starting point to play with it and see what happens.

Thanks!

Have a look at the Haswell model on x86.

defm : HWWriteResPair<WriteFHAdd,   [HWPort1, HWPort5], 5, [1,2], 3, 6>;

It describes the latency/throughput profile of an horizontal add.
An horizontal add is composed of 3 micro-opcodes:

  • 2 shuffles (can only execute on HWPort5).
  • 1 ADD (can only execute on HWPort1).

Shuffle opcodes are independent from each-other and can start execution immediately. The ADD opcode will have to wait for the shuffles to complete. The ADD could be marked as StartAtCycle=2.
HWPort5 is a bottleneck for shuffle opcodes; the 2 shuffle micro-opcodes must be serialised. That explains why it has a resource consumption of 2cy.

I am sure that there are several other (bad and good) examples in that file which also involve group resources.
On AMD platforms, shuffle opcodes can be issued to multiple pipes, so that bottleneck would not exist. For those models, the tablegen definition would be similar except that it would use a resource group instead of a unit for the two shuffles.

-Andrea

[...]
I was essentially asking whether instruction issue still works the same way or not.

Yes, indeed, it works the same as before - it issues an instruction only if all the resources used by the instruction are available.

Is it required that ALL resource segments are reserved/allocated at issue cycle?

Yep - no changes w.r.t. the requirements for issuing.

I will update the patch according to the comments inline.

fpetrogalli marked 6 inline comments as done.Jun 5 2023, 7:48 AM

Also - please add a summary to the patch

Done

llvm/include/llvm/CodeGen/MachineScheduler.h
913–915

Done, if not for the fact that I have moved the static method in MachineScheduler.cpp because I need it in two places (one of which is an assertion).

fpetrogalli edited the summary of this revision. (Show Details)Jun 5 2023, 7:48 AM
andreadb accepted this revision.Jun 5 2023, 7:49 AM

Thanks,

I don't have any other questions about this patch. It looks good to me.

This revision is now accepted and ready to land.Jun 5 2023, 7:49 AM
fpetrogalli marked an inline comment as done.

Address comments from @andreadb and @RKSimon.

Thank you both!

@RKSimon - any other concerns?

Thanks,

Francesco

This revision was automatically updated to reflect the committed changes.