We are going to set MicroOpBufferSize to 1 (one) for Power 9.
Despite the fact that Power9 is an out-of-order core this value still produces the best results for us across the board for various benchmarks that we tested.
The reason is that we are using the GenericScheduler which does not seem to update the value of CurrCycle correctly for Power PC. When we have reached the issue limit the GenericScheduler bumps the CurrCycle by 1 and then tries to fill the issue queue again. This is not always accurate since once we have reached the limit of the issue queue then we have dispatched to all of the slices and are using resources on those slices. Before issuing any more instructions some of those resources will have to become free first.
If in the future we create a PowerPC specific scheduler we can change the value of MicroOpBufferSize to something greater than 1 to indicate an out-of-order core.
Details
Diff Detail
Event Timeline
So, what's going on with the CurrCycle counter as we're going through? If we were, instead, to fix that would we still need this patch?
-eric
Hi Eric,
When I was looking at the scheduler code I came across the CurrCycle variable. The way I understand it is that this variable keeps track of the cycle that we are in the process of scheduling for. For example, an instruction that depends on no other instructions is scheduled at cycle 0 since it can be dispatched immediately. If instruction B depends on instruction A then it cannot be dispatched until A completes. If A takes two cycles then instruction B will not be scheduled until CurrCycle is at least 2. If I set MicroOpBufferSize=1 this is the way things generally work. It will dispatch everything it can at cycle 0 and then increase the cycle count to allow currently dispatched instructions to complete and then it will dispatch more.
However, if I set MicroOpBufferSize to a value greater than 1 this CurrCycle is not updated in the same way. As in the scenario above B depends on the result of the two cycle operation A. However, in this case we may dispatch both A and B when CurrCycle=0. I have seen situations where an instruction was dispatched at a given CurrCycle count even though I knew that the inputs to that instruction would require more cycles than that to compute.
The issue seems to come from here:
switch (SchedModel->getMicroOpBufferSize()) { case 0: assert(ReadyCycle <= CurrCycle && "Broken PendingQueue"); break; case 1: if (ReadyCycle > NextCycle) { NextCycle = ReadyCycle; DEBUG(dbgs() << " *** Stall until: " << ReadyCycle << "\n"); } break; default: // We don't currently model the OOO reorder buffer, so consider all // scheduled MOps to be "retired". We do loosely model in-order resource // latency. If this instruction uses an in-order resource, account for any // likely stall cycles. if (SU->isUnbuffered && ReadyCycle > NextCycle) NextCycle = ReadyCycle; break; }
In cases when MicroOpBufferSize=1 the NextCycle is bumped up by the ReadyCycle whenever we need to wait for things to finish. However, when MicroOpBufferSize>1 the NextCycle is bumped up only when we use a reserved or in-order resource. Based on the comment in the default case I believe that this is the way the scheduler was designed to run. I don't think that this is a bug or something that needs to be fixed in the MachineScheduler.cpp where it is now.
We have a few options here to get the performance we want:
- Use MicroOpBufferSize=1 as I did here
- Copy the MachineScheduler.cpp code into something like PPCMachineScheduler.cpp and then change that switch statement to have it do the same thing for the default case as for the 1 case.
- Create a proper custom PPC Machine Scheduler.
The reason I picked option 1 was mostly due to time constraints. I feel that options 1 and 2 are both non-ideal but that doing option 3 correctly will probably take months to complete. In the long term I think we should be moving toward option 3.
So, to answer your question, if we were to do option 3 we would be able to set MicroOpBufferSize to a value greater than 1 which makes more sense for the out-of-order Power PC architecture. Due to time constraints we are thinking of using a value of 1 for now and then in the future implement a custom scheduler and set the MicroOpBufferSize to a different value.
Think of CurrCycle as the cycle in which an instruction is issued.
With MicroOpsBufferSize = 0 (VLIW-ish): Instructions will not be scheduled until after CurrCycle >= ReadyCycle. Stalling on latency is a last resort.
With MicroOpsBufferSize = 1 (inorder): Any ready instruction may be scheduled, but it will effectively stall the issue pipe (CurrCycle = ReadyCycle). Heuristics decide when to stall.
With MicroOpsBufferSize <1(OOO): Instructions can be issued before they are ready without stalling the pipeline.
Realistically, the scheduler can't model precisely when OOO resources are full, but heuristics can take it into account.
You can add in-order resources though to model stalls for just a subset of operations (I'm not sure how well that works in practice).
Look at -debug-only=machine-schedule to see how it works.
Please add comments where you think it would be helpful. This should be more self-documenting.
I think that this is fine, as a work around, but I don't quite understand the problem based on this description. I'm thinking that either:
- Like on the P8 (and previous), there are some in-order resources in the decoder/dispatcher (which should be modeled as in-order resources).
- There are resources for which we're significantly underestimating the buffer size (the current P9 scheduling model only sets the BufferSize for BR (which is fine, but as we don't currently schedule branches, it's essentially documentation). Maybe we should be setting BufferSize for other things.
(or some combination of the two). Based on the description in the patch and in the patch description, it sounds like the latter, but I'm not sure.
It looks not a good work around to me. Because the problem is only in scheduler, but the patch inappropriately changes a very basic API isOutOfOrder(), it potentially can be used by many other optimizations. I personally have a patch depends on this API https://reviews.llvm.org/D39352.
A few more questions:
- Can you find another work around local to the scheduler?
- You mentioned an ideal solution is a long term P9 specific scheduler, do you have specific plan for it?
- In the source code, you mentioned
// If, in the future, we decide to implement a PPC scheduler ...
Can I understand it as in the future, you may also not implement a PPC scheduler?
Hi Everyone,
Thank you for your comments.
In light of the feedback I'm thinking of changing this patch a little. We can set the MicroOpBufferSize = 44 since that's the actual size of the buffer.
To get the scheduler behavior we want we can set some of the resources to BufferSize=1 so that CurrCycle will increase as we expect.
The solution will still be a workaround but it will provide the correct result for isOutOfOrder() and it won't take as long to implement as it would take to write a custom PPC scheduler. Unfortunately, there is no plan yet of when we could start on a PPC specific scheduler.
Feel free to add this part immediately :)
IMO "represent the actual hardware faithfully" is, as much as can be, a "non-controversial" change.
To get the scheduler behavior we want we can set some of the resources to BufferSize=1 so that CurrCycle will increase as we expect.
This should probably get reviewed.
Thanks!
-eric
I think this is no longer applicable as we have set this value correctly for the CPU already. Could you close this review if so?