This is an archive of the discontinued LLVM Phabricator instance.

[MachinePipeliner] Fix the interpretation of the scheduling model
ClosedPublic

Authored by ytmukai on Sep 9 2022, 5:43 AM.

Details

Summary

The method of counting resource consumption is modified to be based on
"Cycles" value when DFA is not used.

The calculation of ResMII is modified to total "Cycles" and divide it
by the number of units for each resource. Previously, ResMII was
excessive because it was assumed that resources were consumed for
the cycles of "Latency" value.

The method of resource reservation is modified similarly. When a
value of "Cycles" is larger than 1, the resource is considered to be
consumed by 1 for cycles of its length from the scheduled cycle.
To realize this, ResourceManager maintains a resource table for all
slots. Previously, resource consumption was always 1 for 1 cycle
regardless of the value of "Cycles" or "Latency".

In addition, the number of instructions per cycle is modified to be
constrained by "IssueWidth".

For the case of using DFA, the scheduling results are unchanged.

Example:

  • Command: $ llc -mtriple=powerpc64le-unknown-linux-gnu -mcpu=pwr9 --ppc-enable-pipeliner --debug-only=pipeliner --pipeliner-dbg-res --pipeliner-max-stages=10 llvm/test/CodeGen/PowerPC/sms-phi-2.ll
  • Previous result:
MII = 15 MAX_II = 25 (rec=2, res=15)
Schedule Found? 1 (II=15)
  • Modified result:
#Insts: 7, IssueWidth: 8, Cycles: 1
ID            Name     Units  Consumed    Cycles
 1             ALU         4         5         2
 2            ALUE         2         0         0
 3            ALUO         2         0         0
 4              BR         1         0         0
 5              CY         1         0         0
 6             DFU         1         0         0
 7        DISP_NBR         6        15         3
 8         DISP_SS         4         8         2
 9         DISPb01         2         0         0
10         DISPx02         2         4         2
11         DISPx13         2         4         2
12         DISPxab         2         3         2
13             DIV         2         8         4
14              DP         4         1         1
15             DPE         2         0         0
16             DPO         2         0         0
17         IP_AGEN         4         1         1
18         IP_EXEC         4         9         3
19        IP_EXECE         2         1         1
20        IP_EXECO         2         1         1
21              LS         4         1         1
22              PM         2         0         0
MII = 4 MAX_II = 14 (rec=2, res=4)

MRT:
Slot  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 #Insts
   0  0  0  0  0  0  0  3  2  0  2  2  1  2  1  0  0  0  2  1  1  0  0      2
   1  3  0  0  0  0  0  5  4  0  2  2  2  2  0  0  0  1  4  0  0  1  0      3
   2  0  0  0  0  0  0  3  2  0  0  0  0  2  0  0  0  0  1  0  0  0  0      0
   3  2  0  0  0  0  0  4  0  0  0  0  0  2  0  0  0  0  2  0  0  0  0      2

Schedule Found? 1 (II=4)

The modification provides a better ResMII and can actually schedule it at that value (although resource management during scheduling is changed to be more restrictive.)

The modifications will produce more aggressive schedule, but the final result may not be significantly different due to the limitation of the maximum number of stages (3).
I believe that only ARM Cortex-M7 enables pipeliner by default. What are your thoughts on the impact of this modification?

Diff Detail

Event Timeline

ytmukai created this revision.Sep 9 2022, 5:43 AM
ytmukai requested review of this revision.Sep 9 2022, 5:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 9 2022, 5:43 AM
dpenry added a comment.Sep 9 2022, 9:06 AM

Remarkably, I was just about to upload a patch to do this very thing for in-order non-DFA scheduling models, but this one has more functionality! (It handles multi-cycle occupancy better, and the pipeliner-force-ii option will be quite useful for testing.)

Overall, I expect this to be helpful for Cortex-M7. I can't speak to any side effects on models using DFA (Hexagon), or out-of-order non-DFA (PowerPC). (NOTE: DFA is used by default; PowerPC and ARM targets turn it off.)

Do update the failing Thumb2 test outputs so I can comment on how/whether they need to be modified. The use of pragmas in them to force the MII was precisely because of the problem this patch solves.

Also "unbuffered groups" -- resources which have subunits and whose BufferSize is zero -- are intended to consume resources differently -- or so the MachineScheduler code (SchedBoundary::getNextResourceCycle) implies. At least in some cases, they do not consume resources for themselves, but choose between their subunits. IIRC, this is used to model instructions having options in what resources they consume. I don't think that getting this behavior in place is necessary for this patch, but could be in a follow-on patch.

llvm/lib/CodeGen/MachinePipeliner.cpp
3055

Will need to adjust for use of NumMicroOps as described in later comments

3080

Should increment by SCDesc->NumMicroOps (see comment below while calculating ResMII)

3205

Use SM.resolveSchedClass here so that variant scheduling classes can be resolved.

3210

Increment by SCDesc->NumMicroOps, as NumMicroOps represents the amount of issue width taken by the instruction (at least that's how MachineScheduler interprets it).

ytmukai updated this revision to Diff 459680.Sep 13 2022, 5:02 AM
  • Update Thumb2 tests.
  • Refer to the number of micro operations when restricting the number of issues.
ytmukai marked 3 inline comments as done.Sep 13 2022, 5:43 AM

@dpenry Thank you for the reviews!

Thumb2 tests passed by disabling the restriction of issue width. (I had forgotten that the ARM swp tests are under Thumb2.)

The number of issues per cycle was changed to be counted in micro-ops. Micro-ops are assumed to be scheduled one per cycle, starting with the cycle in which the instruction is scheduled. (Reserved at reserveResources()) This may differ from actual operation, but it prevents an instruction from being unschedulable if it has a large number of micro-ops relative to the issue width. In MachineScheduler, when the number of micro-ops exceeds the limit, it seems that a next cycle's slot is used. In any case, the model does not have that much resolution, so I preferred a simple implementation.

Regarding problem on getting scheduling classes, I understand that the interface resolveSchedClass should be used, but the problem spans the entire code and I would like to solve it in a separate patch, is that ok?

Thanks for the info about unbuffered groups. I have not yet understood the details and have addressed it in a future patch if necessary.

dpenry accepted this revision.Sep 13 2022, 8:20 AM

LGTM

<snip>

Thumb2 tests passed by disabling the restriction of issue width. (I had forgotten that the ARM swp tests are under Thumb2.)

That's fine for now; I may tweak them in a new patch to ensure they're still adequately testing what they were intended to test.

Regarding problem on getting scheduling classes, I understand that the interface resolveSchedClass should be used, but the problem spans the entire code and I would like to solve it in a separate patch, is that ok?

Sure. Let's make it soon, though. IIRC, when I was doing my changes, not resolving the class led to occasionally odd results for Cortex-M7.

Thanks for the info about unbuffered groups. I have not yet understood the details and have addressed it in a future patch if necessary.

Yeah, this all gets pretty hairy. Look at https://reviews.llvm.org/D98976 for a thread with me trying to figure out what it all means once there are resource groups in play.

This revision is now accepted and ready to land.Sep 13 2022, 8:20 AM
ytmukai updated this revision to Diff 459989.Sep 13 2022, 10:49 PM
  • Refactoring
  • Deal with variants when getting scheduling classes
ytmukai marked an inline comment as done.Sep 13 2022, 11:08 PM

By refactoring, the modifications to resolve variant scheduling classes could be made naturally. calculateResMIIDFA() is not modified yet, but the only user of the function, Hexagon, has empty resolveSchedClass(), so it should not be a problem for the time being.

LGTM. Thanks!

This revision was landed with ongoing or failed builds.Sep 15 2022, 6:26 PM
This revision was automatically updated to reflect the committed changes.