Current Machine pipeliner framework, works on SSA code before phi elimination. This has created some problems for us. In particular, we have difficulty handling copy instructions generated for different reasons. Usually we need to rely on PreRA instruction scheduler to take care of scheduling these copy instructions, but as you will see in the below, some problems cannot be handled by instruction scheduler. Also instruction scheduler uses a list scheduling algorithm and in some cases we have had difficulty tuning it for pipelined loops. Two specific problems we have with copy instruction follow.
Before the examples, I should mention that copy instructions could be generated for different reasons such as
(1) A def and a use are scheduled more than II cycles away from each other during pipelining. So we need a copy to avoid overriding the value before it is used.
(2) Tied operands expanded during Two Address pass will generate copy instructions that sometimes cannot be optimized away depending on data dependency patterns in the loop.
(3) Similar issue exists for copy instructions generated during phi elimination.
While majority of copies that we see are due to (1), we do see copy instruction from (2) and (3) in very performance sensitive loops that make the problem worse.
1- Sometimes pipeliner finds a schedule with a good II. But we have a large number of copy instructions that cannot be necessarily bundled with existing instructions. So effective length of one iteration of the loop becomes II + X, for some X. Now if we had tried higher values for II, we could have found schedules with say II+2, where we have fewer copies that can be scheduler better. In the current framework, we don’t know how many copy instructions we will eventually have in the loop and how we can schedule them. There are workarounds that we have employed so far, but eventually these are hacks. If we do pipelining Before PreRA scheduling and even disable PreRA scheduling for pipelined loops, this issue, to a very large extend, will be taken care of. When we find a schedule we can look into all copies for the loop and potentially consider higher II values if we need to. Compare between different schedules (while the impact of copy instructions is taken care of) and choose the best schedule.
2- Current framework considers copy instructions as zero cost which is correct sometimes. When a copy is not optimized away, sometimes it is quite costly in our target. For example sometimes there are some stalls between copy instructions and its user. Treating such copies as zero cost results in suboptimal schedules.
Unfortunately we cannot post Machine IR examples due to confidentiality reasons. We do appreciate your feedback. If you think this is not the right approach to tackle this problem or you have any other comment please let us know.
This is the first patch and covers DDG data structures. Other patches will follow. But we would like to discuss the approach and get early feedback for this part of the work.
Can you please explain the difference between this function and SwingSchedulerDAG::addLoopCarriedDependences?