This is an archive of the discontinued LLVM Phabricator instance.

A new software pipliner pass based on non-SSA form
Needs ReviewPublic

Authored by wwei on Nov 29 2018, 11:44 PM.

Details

Summary

Current Machine pipeliner framework, works on SSA code before phi elimination. This has created some problems for us. In particular, we have difficulty handling copy instructions generated for different reasons. Usually we need to rely on PreRA instruction scheduler to take care of scheduling these copy instructions, but as you will see in the below, some problems cannot be handled by instruction scheduler. Also instruction scheduler uses a list scheduling algorithm and in some cases we have had difficulty tuning it for pipelined loops. Two specific problems we have with copy instruction follow.

Before the examples, I should mention that copy instructions could be generated for different reasons such as
(1) A def and a use are scheduled more than II cycles away from each other during pipelining. So we need a copy to avoid overriding the value before it is used.
(2) Tied operands expanded during Two Address pass will generate copy instructions that sometimes cannot be optimized away depending on data dependency patterns in the loop.
(3) Similar issue exists for copy instructions generated during phi elimination.

While majority of copies that we see are due to (1), we do see copy instruction from (2) and (3) in very performance sensitive loops that make the problem worse.

1- Sometimes pipeliner finds a schedule with a good II. But we have a large number of copy instructions that cannot be necessarily bundled with existing instructions. So effective length of one iteration of the loop becomes II + X, for some X. Now if we had tried higher values for II, we could have found schedules with say II+2, where we have fewer copies that can be scheduler better. In the current framework, we don’t know how many copy instructions we will eventually have in the loop and how we can schedule them. There are workarounds that we have employed so far, but eventually these are hacks. If we do pipelining Before PreRA scheduling and even disable PreRA scheduling for pipelined loops, this issue, to a very large extend, will be taken care of. When we find a schedule we can look into all copies for the loop and potentially consider higher II values if we need to. Compare between different schedules (while the impact of copy instructions is taken care of) and choose the best schedule.

2- Current framework considers copy instructions as zero cost which is correct sometimes. When a copy is not optimized away, sometimes it is quite costly in our target. For example sometimes there are some stalls between copy instructions and its user. Treating such copies as zero cost results in suboptimal schedules.

Unfortunately we cannot post Machine IR examples due to confidentiality reasons. We do appreciate your feedback. If you think this is not the right approach to tackle this problem or you have any other comment please let us know.

This is the first patch and covers DDG data structures. Other patches will follow. But we would like to discuss the approach and get early feedback for this part of the work.

Diff Detail

Event Timeline

wwei created this revision.Nov 29 2018, 11:44 PM
materi added a subscriber: materi.Nov 30 2018, 4:17 AM
lsaba added a subscriber: lsaba.Dec 2 2018, 12:15 AM
MatzeB removed a reviewer: MatzeB.Dec 2 2018, 5:55 PM

I'm currently less active in LLVM and will not have the time to review this in depth. Some comments:

  • Given that this is already the 2nd machine pipelining implementation this should have more introductory comments, highlighting the differences and what to choose in which situation!
  • Which target is going to use this?
jwroorda added inline comments.Dec 13 2018, 11:29 AM
lib/CodeGen/MachineModuloSched.cpp
204

Can you please explain the difference between this function and SwingSchedulerDAG::addLoopCarriedDependences?

242

the code seems very similar to the function with the same name in MachinePipeliner.cpp. Is is possible to share code between the two implementations instead of duplicating it?

277

This function seems a verbatim copy of SwingSchedulerDAG::computeDelta(MachineInstr &MI, unsigned &Delta). Please, instead of copying, consider refactoring such that the code can be shared between the two implementations.

I think that there is no particular problem in this part itself.

However, I would like to know the plan of how to coexist with MachinePipeliner over the long term.
I believe that the core part of the algorithm is the same and that many codes can be shared with MachinePipeliner.
Also, if both MachinePipeliner and MachineModuloSched are present for a certain period, it is nice to be able to switch both by using them.
For example, I wrote the member function 'AArch64InstrInfo::analyzeLoop' to port MachinePipeliner to AArch64.
Rather than preparing such a function for both passes, I think that it is better to be able to utilize one function with parameter 'bool SSA' in both optimization passes.

I imagine the implementation like RegAlloc* for register allocation passes.
However, this may be difficult because the timing of executing each pass is different.