This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/
-
llvm/
-
CodeGen/
1
Passes.h
-
InitializePasses.h
-
Target/
4
TargetInstrInfo.h
-
lib/
-
CodeGen/
-
CMakeLists.txt
-
CodeGen.cpp
32
MachinePipeliner.cpp
2
Passes.cpp
-
Target/Hexagon/
-
Hexagon/
-
HexagonInstrInfo.h
-
HexagonInstrInfo.cpp
-
HexagonTargetMachine.cpp
-
test/CodeGen/
-
CodeGen/
-
Hexagon/
1
swp-const-tc.ll
-
swp-dag-phi.ll
-
swp-epilog-reuse.ll
-
swp-matmul-bitext.ll
-
swp-max.ll
-
swp-vect-dotprod.ll
-
swp-vmult.ll
-
swp-vsum.ll
-
swp-multi-loops.ll

Differential D16829

An implementation of Swing Modulo Scheduling
ClosedPublic

Authored by bcahoon on Feb 2 2016, 3:37 PM.

Download Raw Diff

Details

Reviewers

qcolombet
sebpop
arvindsm
marksl
• tstellarAMD
kparzysz

Summary

Software pipelining is an optimization for improving ILP by overlapping loop iterations. Swing Modulo Scheduling (SMS) is an implementation of software pipelining that attempts to reduce register pressure and generating efficient pipelines with a low compile-time cost.

This implementation of SMS is a target-independent back-end pass. When enabled, the pass runs just prior to the register allocation pass, while the machine IR is in SSA form. If software pipelining is successful, then the original loop is replaced by the optimized loop. The optimized loop contains one or more prolog blocks, the pipelined kernel, and one or more epilog blocks.

The SMS implementation is an extension of the ScheduleDAGInstrs class. We represent loop carried dependences in the DAG as order edges to the Phi nodes. We also perform several passes over the DAG to eliminate unnecessary edges that inhibit the ability to pipeline. The implementation uses the DFAPacketizer class to compute the minimum initiation interval and the check where an instruction may be inserted in the pipelined schedule.

In order for the SMS pass to work, several target specific hooks need to be implemented to get information about the loop structure and to rewrite instructions. This patch implements the target hooks for Hexagon.

The pipeliner should be easily extendable to work with compare-and-branch loops instead of just hardware loops. I assume that may require some additional support from a target. Also, the implementation assumes the target uses the DFA code for scheduling, but I think it could also be changed to work with the scoreboard structure (or some other mechanism).

The SMS algorithm consists of three main steps after computing the minimal initiation interval (MII).

Analyze the dependence graph and compute information about each instruction in the graph.
Order the nodes (instructions) by priority based upon the heuristics described in the algorithm.
Attempt to schedule the nodes in the specified order using the MII.

If all the instructions can be scheduled in the specified order, then the algorithm is successful. Otherwise, we increase the MII by one and try again. When the algorithm is successful, we need to replace the original loop with one or more prolog blocks, the optimized kernel, and one or more epilog blocks. The number of prolog and epilog blocks depends on the number of stages in the scheduled loop. When creating the new blocks, we need to generate new SSA names and generate Phis for the new kernel and each epilog block. The process of creating the new, pipelined loop is quite complex and hard to understand. This part of the code can certainly use some additional work to simplify and to improve the readability.

Diff Detail

Event Timeline

bcahoon updated this revision to Diff 46712.Feb 2 2016, 3:37 PM

bcahoon retitled this revision from to An implementation of Swing Modulo Scheduling.

bcahoon updated this object.

materi added a subscriber: materi.Feb 3 2016, 3:39 AM

asl added a subscriber: llvm-commits.Feb 3 2016, 3:47 AM

This looks like a nice SMS implementation!

I have also implemented SMS in LLVM (in a completely out-of-tree target). My target is also a VLIW where software pipelining is very important. We did the implementation after the register coalescer because that's where our normal VLIW scheduler operated. Your implementation is earlier which seems to be better in some ways (you can use phi-nodes in a nice way).

Have you considered implementing bundling before regalloc? It seems like this could be advantageous since some false live range interferences can be avoided.

lib/CodeGen/MachinePipeliner.cpp
3057–3059	I do not understand how this works when more than one iteration starts to execute in the prolog. For example if the runtime trip count is 1, and 2 iterations are started in the prolog. Don't you miss executing some instructions from the only loop iteration? If this is not a bug, maybe you can add a test case that shows how this works?

mssimpso added a subscriber: mssimpso.Feb 3 2016, 5:36 AM

pjcoup added a subscriber: pjcoup.Feb 3 2016, 7:16 AM

We have considered adding bundles much earlier in the back-end, but I think that means changing the register allocator (and other passes) to work on bundles. It's been on our list of things to attempt, but we haven't done it yet. We've worked around the major problems of not having bundles in the pipeliner by carefully ordering the pipelined instructions, though it's not the ideal solution.

lib/CodeGen/MachinePipeliner.cpp
3057–3059	If two iterations are started in the prolog, then we generate two prolog basic blocks, and two epilog basic blocks. At the end of each prolog basic block, we add a compare and branch to the corresponding epilog basic block (the fall through is to the next prolog block or the kernel). This means that the first prolog block contains instructions from stage 0 and the second prolog block contains instructions from stage 1 and the 2nd iteration of stage 0. In your example, with a run-time trip count of 1, the first prolog block branches to the last epilog block, and the instructions in the last epilog block are the first iteration of instructions scheduled in stage 1 and stage 2. The swp-max.ll test case shows a pipelined schedule with 2 prolog and epilog blocks.

materi added inline comments.Feb 3 2016, 11:11 AM

lib/CodeGen/MachinePipeliner.cpp
3057–3059	Thank you! I think I understand how it works now. The prolog and epilog blocks are not the "bundles" of the SWP prolog and epilog. The jump label for my trip count = 1 case is put in the middle of the first "epilog bundle". But what if there are loop carried 0-latency dependences in the graph? This will force a certain order within the kernel to allow correct bundling in a later step. Can this be handled?

arvindsm added a subscriber: arvindsm.Feb 3 2016, 11:17 AM

bcahoon added inline comments.Feb 3 2016, 2:52 PM

lib/CodeGen/MachinePipeliner.cpp
3057–3059	If I'm understanding your question, then yes - we do handle the case of a loop carried 0-latency instruction. The order of the instructions in the prolog and epilog blocks is different than the order in the pipelined schedule. The prolog/epilog instructions appear in the original instruction order (i.e., prior to pipelining), and they are grouped by the pipelined stage. As an example, lets say there are 3 stages, numbered 0,1,2, so there will be two prolog blocks and two epilog blocks. The first prolog contains instructions from stage 0 in the original order. The last epilog contains instructions from stages 1 and 2 in original order. If the loop contains only 1 iteration, then the stage 0 instructions in the first prolog are executed, and control jumps to the last epilog block to execute the first iteration of instructions from stages 1 and 2. In the second prolog, we first generate the instructions from stage 1 in the original order, and then stage 0 in original order. In the second to last epilog, we generate instructions for stage 2 in the original order. If the loop has 2 iterations, then the 2 prolog bocks execute instructions from stage 0 twice, and stage 1 one. The 2 epilog blocks execute instructions from stage 2 twice and stage 1 once. I hope this makes sense and answers your question correctly. Let me know.

jonpa added a subscriber: jonpa.Feb 3 2016, 10:33 PM

If you have a functional unit that issues in stages such that another instruction of needing the same FU can ussue the very next cycle, then isn't the sum of the cycles too great? Example:

InstrItinData<IIC_MUL_rr, [InstrStage<1, [MUL_DSP_STAGE1]>,InstrStage<1, [MUL_DSP_STAGE2]>,InstrStage<1, [MUL_DSP_STAGE3]>]>,

In this case multiplies will issue back-to-back and require 3 cycles to complete. If we have 2 multiplies then is ResMII = (2 * 3) / 1 = 6 when in reality it will be 4 cycles since it's broken into stages.

lib/CodeGen/MachinePipeliner.cpp
1382	If you have a functional unit that issues in stages such that another instruction of needing the same FU can ussue the very next cycle, then isn't the sum of the cycles too great? Example: InstrItinData<IIC_MUL_rr, [InstrStage<1, [MUL_DSP_STAGE1]>,InstrStage<1, [MUL_DSP_STAGE2]>,InstrStage<1, [MUL_DSP_STAGE3]>]>, In this case multiplies will issue back-to-back and require 3 cycles to complete. If we have 2 multiplies then is ResMII = (2 * 3) / 1 = 6 when in reality it will be 4 cycles since it's broken into stages.

bcahoon added inline comments.Feb 5 2016, 3:53 PM

lib/CodeGen/MachinePipeliner.cpp
1382	In the case with the two multiplies, I think the resource MII should be 2. The resource MII is just the cycle count of the most heavily used resource. Since each of the resources, MUL_DSP_STAGE1, MUL_DSP_STAGE2, and MUL_DSP_STAGE3 are used for 2 cycles for the 2 instructions. But, I agree that the code here is not going to return 2. With your type of itinerary, I believe the code computes that there are 6 functional units used. Then, when iterating over the functional units, the code calls canReserveResources() for each instruction 3 times. Each time, the query returns false, and a new DFA is created for a total of 6 cycles. It seems to me that the DFA requires an additional parameter to determine the cycle, or stage, but I don't believe that is possible currently. I'll need to think about how to get this to work correctly for this type of itinerary. Unfortunately, it hard for me to test a solution to this since Hexagon doesn't have a similar itinerary.

marksl added inline comments.Feb 11 2016, 9:59 AM

include/llvm/Target/TargetInstrInfo.h
1049	Should this return true for pre & post increment & decrement?

marksl added inline comments.Feb 11 2016, 4:58 PM

lib/CodeGen/MachinePipeliner.cpp
877	I've seen CodeGenPrepare delete the preheader. Specifically, if the previous block is a loop and this loop immediate follows it. I don't have a test case, but it's basically two sequential for loops. I wonder if that has any impact here?

bcahoon added inline comments.Feb 12 2016, 7:09 AM

include/llvm/Target/TargetInstrInfo.h
1049	I've only tested this for post increment. I think returning true for the other cases would cause problems, since the offset value adjustment would probably be incorrect (at least for the post/pre decrement cases). The pre-increment case would end up ok. A more general solution could be useful if your target, or others, have these variants. The pipeliner uses this information to eliminate the dependence on the post incremented value w.r.t other instructions that use the same base value.
lib/CodeGen/MachinePipeliner.cpp
877	Yes, it would have an impact for loops without a preheader. On Hexagon, when we create a hardware loop, the preheader is added if it's not there already. Then, the pipeliner pass only sees loops with a preheader. It would be easy enough to add a preheader if one doesn't exist already.

Very nice work Brandon.

This revision is now accepted and ready to land.Feb 12 2016, 4:39 PM

Good work. Brendon.

marksl added inline comments.Mar 1 2016, 7:58 AM

lib/CodeGen/MachinePipeliner.cpp
1701	What paper are you referring to?

marksl added inline comments.Mar 1 2016, 9:08 AM

lib/CodeGen/MachinePipeliner.cpp
1701	Sorry, I found it was Tanya Lattner's paper.

bcahoon added inline comments.Mar 1 2016, 2:09 PM

lib/CodeGen/MachinePipeliner.cpp
1701	The original paper is "Swing Modulo Scheduling: A Lifetime-Sensitive Approach" from PACT 1996. Though, Tanya's thesis provides a good description as well.

marksl added inline comments.Mar 9 2016, 1:43 PM

lib/CodeGen/MachinePipeliner.cpp
3	Are you contributing this code under the terms of the LLVM License?

bcahoon added inline comments.Mar 10 2016, 6:33 AM

lib/CodeGen/MachinePipeliner.cpp
3	Yes. I need to change that comment. I'll submit a patch with the change.

flyingforyou added a subscriber: flyingforyou.Mar 15 2016, 6:41 PM

Updated to the correct license and rebased the patch.

I'm just wondering how to generate a loop which has only one basic block. For an example,

extern int a, b[10];

void foo(void)
{

for (int i = 0; i < 10; ++i) {
  a += b[i];
}

}

Run clang -emit-llvm -S --target=hexagon foo.c will generate

for.cond: ; preds = %for.inc, %entry

%0 = load i32, i32* %i, align 4
%cmp = icmp slt i32 %0, 10
br i1 %cmp, label %for.body, label %for.end

for.body: ; preds = %for.cond

%1 = load i32, i32* %i, align 4
%arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @b, i32 0, i32 %1
%2 = load i32, i32* %arrayidx, align 4
%3 = load i32, i32* @a, align 4
%add = add nsw i32 %3, %2
store i32 %add, i32* @a, align 4
br label %for.inc

...

Thus the swp will not be done due to the constraint.

(gdb) p L->getNumBlocks()
$2 = 2

I'm just wondering how to generate a loop which has only one basic block. For an example,

Run clang -emit-llvm -S --target=hexagon foo.c will generate

Try with -O2 -fno-unroll-loops. The O2 flag will generate a canonical loop with a single basic block and the -fno-unroll-loops flag makes sure the loop is not unrolled.

srhines added a subscriber: srhines.Apr 20 2016, 4:31 PM

It would be great to get someone else with more familiarity to approve this patch before submitting. Even though the pass is off by default (except for Hexagon), there are other changes to small pieces that should get additional attention.

I tested the patch on eembc benchmark with hexagon-sim. The speedup of
autcor00data_2_lite can reach 120%. For the testcases that SMS does
analysis and reports "Schedule Found? 0" finally, the insns and cycles
will have a little bit of changes.

Ping? It would be great to see this patch get some attention from an official code owner for CodeGen, since the new pass provides a substantial improvement for VLIW architectures.

srhines added reviewers: qcolombet, • tstellarAMD.Jun 3 2016, 9:09 AM

Quentin and Tom: It was suggested to me at last night's social to add you both as potential reviewers for this code. Do you think you could help with officially approving this? I think that the VLIW (and even non-VLIW) benefits for having a modulo scheduling pass could be quite interesting for LLVM in general. Thanks.

Hi Stephen,

I will have a look if no-one did at some point.
I am way behind in reviews and open source work in general and I do not expect to be able to look at it before at least a couple of weeks.

Therefore, be patient, keep pinging people, and eventually the review will happen :).

Thanks for working on this BTW!

Cheers,
-Quentin

PS: I tend to look at reviews that do not have active reviewers. For that thread, a bunch of people commented on it and thus I thought it was already well covered. Sorry for not having looked earlier.

In D16829#448414, @qcolombet wrote:

Hi Stephen,

I will have a look if no-one did at some point.
I am way behind in reviews and open source work in general and I do not expect to be able to look at it before at least a couple of weeks.

Therefore, be patient, keep pinging people, and eventually the review will happen :).

No problem. I just want to make sure that this has a reasonable chance of being submitted in a reasonable timeframe.

Thanks for working on this BTW!

I actually am just trying to help shepherd this along, as we have users who really want this feature enabled for Hexagon. I didn't do any of the work here. All of the credit goes to the original author(s).

Cheers,
-Quentin

PS: I tend to look at reviews that do not have active reviewers. For that thread, a bunch of people commented on it and thus I thought it was already well covered. Sorry for not having looked earlier.

Ah, there were indeed some great reviews earlier, but I think that the concern was one of adding a new pass to LLVM without more extensive approval (i.e. a more senior contributor actually saying "Accept"). If you think that the other reviewers have done a sufficient job here (or that no additional reviews are needed, and anything else can be fixed up post-commit), please say so. Thanks again for taking time to respond and for the quick glance.

There are probably several targets out there that care about software pipelining so it is good to code and share it between targets. Who will be the the code owner and conduct reviews/maintenance in the future?

I am not experienced with software pipelining algorithm so I mainly concentrated on coding style and llvm infrastructure issues. General comments:

There is an odd mismatch between the class being 'MachineSMS' the filename being 'MachinePipeliner' the debug type 'modulo-sched' and the cl::opt switches being called 'swp-XXX'.
While there are more than enough comments in the code, the pass could use some highlevel introduction/references to papers used etc.

Detailed comments:

include/llvm/CodeGen/Passes.h
675–677	Avoid repeating the name of the method/field in the doxygen comment as per coding convention. (There's more instances of this following).
include/llvm/Target/TargetInstrInfo.h
537–538	For parameters that cannot be nullptr I would recommend using a reference to make this fact clear (probably only MachineLoop *L here). Similar for the following functions.
1006–1007	The function returns just a bool but sets BasePos/OffsetPos.
lib/CodeGen/MachinePipeliner.cpp
11–18	Use three slash doxygen.
114–116	Would be good to intregrate this into the OptBisect infrastructure in subsequent patches.
132–133	we can use C++11 member initializers now and write `MachineFunction *MF = nullptr;` above instead.
171	This field could use a comment.
192–194	swap order of public: and the comment?
195	Given that we are inside a .cpp file specific to modulo scheduling anyway we can probably move all those helper classes to the toplevel instead of nesting them into the class.
212–215	Shouldn't you rather have the iterator implicitely cast to const_iterator so you do not need to have a templated version of the constructor (and insert etc. below)? Could also use member initializers.
304–316	Why not move all those to the top of the class declaration to the other analysis info?
360	Many function comments are repeated here and at the place where the function is actually implemented. You should only mention the comment in one of the two places or risk that the two will get out of sync in time. (Which of the two you choose is a matter of personal preference and shouldn't matter).
493	Use doxygen comment (same for some following functions).
812	I believe this pass is not required for correctness and should therefore have a `if(skipFunction(..)) return false;` sequence first.
824–825	Use range based for. Similar in many following loops.
1703	It would be good to mention relevant papers in the comment at the beginning of the .cpp file.
1816	Should avoid `auto` here and mention the type name. Some more instances following.
3495	nullptr
lib/CodeGen/Passes.cpp
113–116	Even though you see both, I think the more typical style nowadays would be to move this flag into the .cpp file of the pass and make it the first thing that is checked in runOnMachineFunction().
552–556	Maybe we should leave adding the pass to the backends that support it and let them do it by overriding addPreRegAlloc()?
test/CodeGen/Hexagon/swp-const-tc.ll
7	Some comments about all the tests: Tests should start with ; CHECK-LABEL: functionname: to make them more stable against unrelated output that happens to be on stdout triggering check lines (pass debug output, filenames, etc.) The tests appear to contain unnecessary extra data (nocapture, readonly) flags, sometimes strange value or block names, function and alias metadata. I am sure they can and should be further simplified.
test/CodeGen/swp-multi-loops.ll
76–83	Is all this metadata really necessary for the test?

Hi Matthias - thank you for the review and the comments. I really appreciate it. I think I've addressed all of your comments, unless I've inadvertently missed something. The new patch includes a lot of changes based upon your comments. The code has also been rebased.

Here's a summary of the changes:
I decided to use the name Pipeliner to be more consistent with naming, except for the test names which still start with swp. I don't really have a strong preference of Pipeliner vs. SWP vs. SMS in the naming scheme.

I've moved the code to add the pipeliner pass to HexagonTargetMachine.cpp in the addPreRegAlloc method instead of doing it in TargetPassConfig.cpp.

I've added some more high-level comments at the beginning with references to a couple of useful papers.

I've converted many loops to be range-based for loops.

I've removed the comments from the method declarations so that they are not repeated in the method implementation.

I've removed unnecessary metadata, etc. from the tests.

Thanks,
Brendon

sebpop added a subscriber: sebpop.Jun 15 2016, 7:29 AM

sebpop added inline comments.

lib/CodeGen/MachinePipeliner.cpp
2223	Both generateExistingPhis and generatePhis are passed the same parameters. Can we have this code factored up in a class: that would allow to split these two functions into smaller functions easier to follow.
2509	Could we have the code of this loop split up into smaller functions?
2514	I find the indexing in VRMap to be difficult to follow: could we have the arithmetic hidden behind some set/get interface for the prolog/epilog?

kparzysz added a subscriber: kparzysz.Jun 15 2016, 1:45 PM

After ISEL our compare instructions, multiply, and MAC instructions have real physical register side effects. I'm getting errors from SWP for loops containing these physical register dependencies. Are you aware of this? Is there a way to model physical register dependencies with loop carried dependencies such that we would generate correct code for them?

bcahoon added inline comments.Jun 16 2016, 2:00 PM

lib/CodeGen/MachinePipeliner.cpp
2223	Hi Sebastian - thanks for the comments. The code for generating Phis in the pipelined schedule really could use some work. It's a non-trivial effort though. I'll try to do some refactoring to improve it though.

In D16829#460066, @marksl wrote:

After ISEL our compare instructions, multiply, and MAC instructions have real physical register side effects. I'm getting errors from SWP for loops containing these physical register dependencies. Are you aware of this? Is there a way to model physical register dependencies with loop carried dependencies such that we would generate correct code for them?

Hi Mark - yes, I am aware of the of the problem, but don't yet have a good fix for it. One potential fix is to not pipeline loops that end up with a loop carried physical register. That's what's I've added to my local version. I added the following function, which is called from schedulePipeline() if a schedule is found.

bool SMSchedule::isValidSchedule(SwingSchedulerDAG *SSD) {

const TargetRegisterInfo *TRI = ST.getRegisterInfo();
for (int i = 0, e = SSD->SUnits.size(); i < e; ++i) {
  SUnit &SU = SSD->SUnits[i];
  if (!SU.hasPhysRegDefs)
    continue;
  int StageDef = stageScheduled(&SU);
  assert(StageDef != -1 && "Instruction should have been scheduled.");
  for (auto &SI : SU.Succs)
    if (SI.isAssignedRegDep())
      if (TRI->isPhysicalRegister(SI.getReg()))
        if (stageScheduled(SI.getSUnit()) != StageDef)
          return false;
}
return true;

}

Seems like all significant issues have been addressed. The remaining ones can be fixed in subsequent commits.

Agreed, cleanups should not hold SMS to be committed.
LGTM.

Rebased the patch. There were some API changes to resolve.

I've included several bug fixes since the last update, including:

Updates to with deal physical registers. The register pressure code needs to check for physical registers. Also, the pipeliner does not allow a schedule that generates a loop carrried dependence for a physical register.
A bug fix to the code that generates Phis for the pipelined loop.
A bug fix to the code that generates the final instruction order, when serializing the instructions.

Unless there is an objection, I'd like to be able to commit this code. There are several other folks that use the patch, and it would be great to enable them to make updates and improvements to the software pipeliner. This patch enables the pipeliner for the Hexagon target. I think there is a good opportunity for others to contribute to the pass, to improve the functionality, and to enable it for other targets.

In D16829#494869, @bcahoon wrote:

Unless there is an objection, I'd like to be able to commit this code. There are several other folks that use the patch, and it would be great to enable them to make updates and improvements to the software pipeliner. This patch enables the pipeliner for the Hexagon target. I think there is a good opportunity for others to contribute to the pass, to improve the functionality, and to enable it for other targets.

As this is only enabled on Hexagon, I would say go ahead and commit: it is a very useful performance feature for that target.
If you get more feedback in the future, you can address that in follow-up patches.

Thanks,
Sebastian

Committed in https://reviews.llvm.org/rL277169
Thanks Brendon!

HanyeWei added a subscriber: HanyeWei.Oct 11 2016, 10:06 PM

• lihan2011 added a subscriber: • lihan2011.Jan 6 2017, 11:42 PM

when i use SMS.enterRegion(MBB, MBB->begin(), MBB->getFirstTerminator(), size); it occurs
Assertion failed: VNI && "No value to read by operand"
but if use SMS.enterRegion(MBB, MBB->getFirstNonPHI(), MBB->getFirstTerminator(), size2); it has no error.
Am i use the wrong version of LLVM?

In D16829#638891, @lihan2011 wrote:

when i use SMS.enterRegion(MBB, MBB->begin(), MBB->getFirstTerminator(), size); it occurs
Assertion failed: VNI && "No value to read by operand"
but if use SMS.enterRegion(MBB, MBB->getFirstNonPHI(), MBB->getFirstTerminator(), size2); it has no error.
Am i use the wrong version of LLVM?

You do need pass MBB->begin() to enterRegion for the pipeliner to work correctly. It looks like you're using an older version of LLVM. I believe this issue has been fixed with a commit that was made on Dec. 4 2015 to ScheduleDAGInstrs.cpp. If you're unable to use a newer version of LLVM, I'd suggest the following the following change to addVRegUseDeps in ScheduleDAGInstrs.cpp:

   // VNI will be valid because MachineOperand::readsReg() is checked by caller.
-  assert(VNI && "No value to read by operand");
-  MachineInstr *Def = LIS->getInstructionFromIndex(VNI->def);
+  MachineInstr *Def = (VNI ? LIS->getInstructionFromIndex(VNI->def) : 0);
   // Phis and other noninstructions (after coalescing) have a NULL Def

Thanks,
Brendon

In D16829#639719, @bcahoon wrote:
In D16829#638891, @lihan2011 wrote:

when i use SMS.enterRegion(MBB, MBB->begin(), MBB->getFirstTerminator(), size); it occurs
Assertion failed: VNI && "No value to read by operand"
but if use SMS.enterRegion(MBB, MBB->getFirstNonPHI(), MBB->getFirstTerminator(), size2); it has no error.
Am i use the wrong version of LLVM?

You do need pass MBB->begin() to enterRegion for the pipeliner to work correctly. It looks like you're using an older version of LLVM. I believe this issue has been fixed with a commit that was made on Dec. 4 2015 to ScheduleDAGInstrs.cpp. If you're unable to use a newer version of LLVM, I'd suggest the following the following change to addVRegUseDeps in ScheduleDAGInstrs.cpp:
   // VNI will be valid because MachineOperand::readsReg() is checked by caller.
-  assert(VNI && "No value to read by operand");
-  MachineInstr *Def = LIS->getInstructionFromIndex(VNI->def);
+  MachineInstr *Def = (VNI ? LIS->getInstructionFromIndex(VNI->def) : 0);
   // Phis and other noninstructions (after coalescing) have a NULL Def
Thanks,
Brendon

Thanks for your suggestions!

Is this problem occurs because PhiNode's use operand which is defined in ohter BasicBlocks?

AND
%vreg1<def> = PHI %vreg5, <BB#0>, %vreg4, <BB#1>; CPURegs:%vreg1,%vreg5,%vreg4
......
%vreg4<def> = ADDiu %vreg1, 4; CPURegs:%vreg4,%vreg1
When code deals with ADDiu %vreg1,it can not find the defination of %vreg1 in PHINode.
So the function void SwingSchedulerDAG::updatePhiDependences() fix this problem?

Thanks,
Lee

javed.absar added a subscriber: javed.absar.Jun 28 2018, 5:39 AM

Herald added subscribers: mgrang, mgorny, wdng. · View Herald TranscriptJun 28 2018, 5:39 AM

Revision Contents

Path

Size

include/

llvm/

CodeGen/

Passes.h

4 lines

InitializePasses.h

1 line

Target/

TargetInstrInfo.h

40 lines

lib/

CodeGen/

1 line

1 line

3986 lines

9 lines

Target/

Hexagon/

HexagonInstrInfo.h

33 lines

HexagonInstrInfo.cpp

108 lines

HexagonTargetMachine.cpp

4 lines

test/

CodeGen/

Hexagon/

52 lines

42 lines

68 lines

77 lines

42 lines

41 lines

33 lines

29 lines

82 lines

Diff 46712

include/llvm/CodeGen/Passes.h

Show First 20 Lines • Show All 665 Lines • ▼ Show 20 Lines	/// MachineDominanaceFrontier - This pass is a machine dominators analysis pass.
///		///
Pass createGlobalMergePass(const TargetMachine TM, unsigned MaximalOffset,		Pass createGlobalMergePass(const TargetMachine TM, unsigned MaximalOffset,
bool OnlyOptimizeForSize = false,		bool OnlyOptimizeForSize = false,
bool MergeExternalByDefault = false);		bool MergeExternalByDefault = false);

/// This pass splits the stack into a safe stack and an unsafe stack to		/// This pass splits the stack into a safe stack and an unsafe stack to
/// protect against stack-based overflow vulnerabilities.		/// protect against stack-based overflow vulnerabilities.
FunctionPass createSafeStackPass(const TargetMachine TM = nullptr);		FunctionPass createSafeStackPass(const TargetMachine TM = nullptr);

		/// MachineSMS - This pass performs software pipelining on machine
		/// instructions.
		extern char &MachineSMSID;
		MatzeBUnsubmitted Not Done Reply Inline Actions Avoid repeating the name of the method/field in the doxygen comment as per coding convention. (There's more instances of this following). MatzeB: Avoid repeating the name of the method/field in the doxygen comment as per coding convention.
} // End llvm namespace		} // End llvm namespace

/// Target machine pass initializer for passes with dependencies. Use with		/// Target machine pass initializer for passes with dependencies. Use with
/// INITIALIZE_TM_PASS_END.		/// INITIALIZE_TM_PASS_END.
#define INITIALIZE_TM_PASS_BEGIN INITIALIZE_PASS_BEGIN		#define INITIALIZE_TM_PASS_BEGIN INITIALIZE_PASS_BEGIN

/// Target machine pass initializer for passes with dependencies. Use with		/// Target machine pass initializer for passes with dependencies. Use with
/// INITIALIZE_TM_PASS_BEGIN.		/// INITIALIZE_TM_PASS_BEGIN.
Show All 21 Lines

include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 193 Lines • ▼ Show 20 Lines
	void initializeMachineDominanceFrontierPass(PassRegistry&);			void initializeMachineDominanceFrontierPass(PassRegistry&);
	void initializeMachinePostDominatorTreePass(PassRegistry&);			void initializeMachinePostDominatorTreePass(PassRegistry&);
	void initializeMachineLICMPass(PassRegistry&);			void initializeMachineLICMPass(PassRegistry&);
	void initializeMachineLoopInfoPass(PassRegistry&);			void initializeMachineLoopInfoPass(PassRegistry&);
	void initializeMachineModuleInfoPass(PassRegistry&);			void initializeMachineModuleInfoPass(PassRegistry&);
	void initializeMachineRegionInfoPassPass(PassRegistry&);			void initializeMachineRegionInfoPassPass(PassRegistry&);
	void initializeMachineSchedulerPass(PassRegistry&);			void initializeMachineSchedulerPass(PassRegistry&);
	void initializeMachineSinkingPass(PassRegistry&);			void initializeMachineSinkingPass(PassRegistry&);
				void initializeMachineSMSPass(PassRegistry&);
	void initializeMachineTraceMetricsPass(PassRegistry&);			void initializeMachineTraceMetricsPass(PassRegistry&);
	void initializeMachineVerifierPassPass(PassRegistry&);			void initializeMachineVerifierPassPass(PassRegistry&);
	void initializeMemCpyOptPass(PassRegistry&);			void initializeMemCpyOptPass(PassRegistry&);
	void initializeMemDepPrinterPass(PassRegistry&);			void initializeMemDepPrinterPass(PassRegistry&);
	void initializeMemDerefPrinterPass(PassRegistry&);			void initializeMemDerefPrinterPass(PassRegistry&);
	void initializeMemoryDependenceAnalysisPass(PassRegistry&);			void initializeMemoryDependenceAnalysisPass(PassRegistry&);
	void initializeMergedLoadStoreMotionPass(PassRegistry &);			void initializeMergedLoadStoreMotionPass(PassRegistry &);
	void initializeMetaRenamerPass(PassRegistry&);			void initializeMetaRenamerPass(PassRegistry&);
	▲ Show 20 Lines • Show All 108 Lines • Show Last 20 Lines

include/llvm/Target/TargetInstrInfo.h

Show All 12 Lines

#ifndef LLVM_TARGET_TARGETINSTRINFO_H		#ifndef LLVM_TARGET_TARGETINSTRINFO_H
#define LLVM_TARGET_TARGETINSTRINFO_H		#define LLVM_TARGET_TARGETINSTRINFO_H

#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/SmallSet.h"		#include "llvm/ADT/SmallSet.h"
#include "llvm/CodeGen/MachineCombinerPattern.h"		#include "llvm/CodeGen/MachineCombinerPattern.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
		#include "llvm/CodeGen/MachineLoopInfo.h"
#include "llvm/MC/MCInstrInfo.h"		#include "llvm/MC/MCInstrInfo.h"
#include "llvm/Support/BranchProbability.h"		#include "llvm/Support/BranchProbability.h"
#include "llvm/Target/TargetRegisterInfo.h"		#include "llvm/Target/TargetRegisterInfo.h"

namespace llvm {		namespace llvm {

class InstrItineraryData;		class InstrItineraryData;
class LiveVariables;		class LiveVariables;
▲ Show 20 Lines • Show All 495 Lines • ▼ Show 20 Lines	public:
/// merging needs to be disabled.		/// merging needs to be disabled.
virtual unsigned InsertBranch(MachineBasicBlock &MBB, MachineBasicBlock *TBB,		virtual unsigned InsertBranch(MachineBasicBlock &MBB, MachineBasicBlock *TBB,
MachineBasicBlock *FBB,		MachineBasicBlock *FBB,
ArrayRef<MachineOperand> Cond,		ArrayRef<MachineOperand> Cond,
DebugLoc DL) const {		DebugLoc DL) const {
llvm_unreachable("Target didn't implement TargetInstrInfo::InsertBranch!");		llvm_unreachable("Target didn't implement TargetInstrInfo::InsertBranch!");
}		}

		/// AnalyzeLoop - Analyze the loop code, return true if it cannot be
		/// understood. Upon success, this function returns false and returns
		/// information about the induction variable and compare instruction
		/// used at the end.
		virtual bool AnalyzeLoop(MachineLoop L, MachineInstr &IndVarInst,
		MachineInstr *&CmpInst) const {
		MatzeBUnsubmitted Not Done Reply Inline Actions For parameters that cannot be nullptr I would recommend using a reference to make this fact clear (probably only MachineLoop L here). Similar for the following functions. MatzeB:* For parameters that cannot be nullptr I would recommend using a reference to make this fact…
		return true;
		}

		/// ReduceLoopCount - Generate code to reduce the loop iteration by one
		/// and check if the loop is finished. Return the value/register of the
		/// the new loop count. We need this function when peeling off one
		/// or more iterations of a loop. This function assumes the nth iteration
		/// is peeled first.
		virtual unsigned ReduceLoopCount(MachineBasicBlock &MBB,
		MachineInstr IndVar, MachineInstr Cmp,
		SmallVectorImpl<MachineOperand> &Cond,
		SmallVectorImpl<MachineInstr *> &PrevInsts,
		unsigned Iter, unsigned MaxIter) const {
		llvm_unreachable("Target didn't implement ReduceLoopCount");
		}

/// Delete the instruction OldInst and everything after it, replacing it with		/// Delete the instruction OldInst and everything after it, replacing it with
/// an unconditional branch to NewDest. This is used by the tail merging pass.		/// an unconditional branch to NewDest. This is used by the tail merging pass.
virtual void ReplaceTailWithBranchTo(MachineBasicBlock::iterator Tail,		virtual void ReplaceTailWithBranchTo(MachineBasicBlock::iterator Tail,
MachineBasicBlock *NewDest) const;		MachineBasicBlock *NewDest) const;

/// Get an instruction that performs an unconditional branch to the given		/// Get an instruction that performs an unconditional branch to the given
/// symbol.		/// symbol.
virtual void		virtual void
▲ Show 20 Lines • Show All 435 Lines • ▼ Show 20 Lines	public:
/// Get the base register and byte offset of an instruction that reads/writes		/// Get the base register and byte offset of an instruction that reads/writes
/// memory.		/// memory.
virtual bool getMemOpBaseRegImmOfs(MachineInstr *MemOp, unsigned &BaseReg,		virtual bool getMemOpBaseRegImmOfs(MachineInstr *MemOp, unsigned &BaseReg,
unsigned &Offset,		unsigned &Offset,
const TargetRegisterInfo *TRI) const {		const TargetRegisterInfo *TRI) const {
return false;		return false;
}		}

		/// For instructions with a base and offset, return the position of the
		/// base register and offset operands.
		MatzeBUnsubmitted Not Done Reply Inline Actions The function returns just a bool but sets BasePos/OffsetPos. MatzeB: The function returns just a bool but sets BasePos/OffsetPos.
		virtual bool getBaseAndOffsetPosition(const MachineInstr *MI,
		unsigned &BasePos,
		unsigned &OffsetPos) const {
		return false;
		}

		/// If the instruction is an increment of a constant value, return the amount.
		virtual bool getIncrementValue(const MachineInstr *MI, int &Value) const {
		return false;
		}

virtual bool enableClusterLoads() const { return false; }		virtual bool enableClusterLoads() const { return false; }

virtual bool shouldClusterLoads(MachineInstr *FirstLdSt,		virtual bool shouldClusterLoads(MachineInstr *FirstLdSt,
MachineInstr *SecondLdSt,		MachineInstr *SecondLdSt,
unsigned NumLoads) const {		unsigned NumLoads) const {
return false;		return false;
}		}

Show All 14 Lines	public:
/// Insert a noop into the instruction stream at the specified point.		/// Insert a noop into the instruction stream at the specified point.
virtual void insertNoop(MachineBasicBlock &MBB,		virtual void insertNoop(MachineBasicBlock &MBB,
MachineBasicBlock::iterator MI) const;		MachineBasicBlock::iterator MI) const;


/// Return the noop instruction to use for a noop.		/// Return the noop instruction to use for a noop.
virtual void getNoopForMachoTarget(MCInst &NopInst) const;		virtual void getNoopForMachoTarget(MCInst &NopInst) const;

		/// Return true for post-incremented instructions.
		markslUnsubmitted Not Done Reply Inline Actions Should this return true for pre & post increment & decrement? marksl: Should this return true for pre & post increment & decrement?
		bcahoonAuthorUnsubmitted Not Done Reply Inline Actions I've only tested this for post increment. I think returning true for the other cases would cause problems, since the offset value adjustment would probably be incorrect (at least for the post/pre decrement cases). The pre-increment case would end up ok. A more general solution could be useful if your target, or others, have these variants. The pipeliner uses this information to eliminate the dependence on the post incremented value w.r.t other instructions that use the same base value. bcahoon: I've only tested this for post increment. I think returning true for the other cases would…
		virtual bool isPostIncrement(const MachineInstr* MI) const {
		return false;
		}

/// Returns true if the instruction is already predicated.		/// Returns true if the instruction is already predicated.
virtual bool isPredicated(const MachineInstr *MI) const {		virtual bool isPredicated(const MachineInstr *MI) const {
return false;		return false;
}		}

/// Returns true if the instruction is a		/// Returns true if the instruction is a
/// terminator instruction that has not been predicated.		/// terminator instruction that has not been predicated.
▲ Show 20 Lines • Show All 420 Lines • Show Last 20 Lines

lib/CodeGen/CMakeLists.txt

Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	add_llvm_library(LLVMCodeGen
MachineFunctionPrinterPass.cpp		MachineFunctionPrinterPass.cpp
MachineInstrBundle.cpp		MachineInstrBundle.cpp
MachineInstr.cpp		MachineInstr.cpp
MachineLICM.cpp		MachineLICM.cpp
MachineLoopInfo.cpp		MachineLoopInfo.cpp
MachineModuleInfo.cpp		MachineModuleInfo.cpp
MachineModuleInfoImpls.cpp		MachineModuleInfoImpls.cpp
MachinePassRegistry.cpp		MachinePassRegistry.cpp
		MachinePipeliner.cpp
MachinePostDominators.cpp		MachinePostDominators.cpp
MachineRegionInfo.cpp		MachineRegionInfo.cpp
MachineRegisterInfo.cpp		MachineRegisterInfo.cpp
MachineScheduler.cpp		MachineScheduler.cpp
MachineSink.cpp		MachineSink.cpp
MachineSSAUpdater.cpp		MachineSSAUpdater.cpp
MachineTraceMetrics.cpp		MachineTraceMetrics.cpp
MachineVerifier.cpp		MachineVerifier.cpp
▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines

lib/CodeGen/CodeGen.cpp

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	void llvm::initializeCodeGen(PassRegistry &Registry) {
initializeMachineDominatorTreePass(Registry);		initializeMachineDominatorTreePass(Registry);
initializeMachineFunctionPrinterPassPass(Registry);		initializeMachineFunctionPrinterPassPass(Registry);
initializeMachineLICMPass(Registry);		initializeMachineLICMPass(Registry);
initializeMachineLoopInfoPass(Registry);		initializeMachineLoopInfoPass(Registry);
initializeMachineModuleInfoPass(Registry);		initializeMachineModuleInfoPass(Registry);
initializeMachinePostDominatorTreePass(Registry);		initializeMachinePostDominatorTreePass(Registry);
initializeMachineSchedulerPass(Registry);		initializeMachineSchedulerPass(Registry);
initializeMachineSinkingPass(Registry);		initializeMachineSinkingPass(Registry);
		initializeMachineSMSPass(Registry);
initializeMachineVerifierPassPass(Registry);		initializeMachineVerifierPassPass(Registry);
initializeOptimizePHIsPass(Registry);		initializeOptimizePHIsPass(Registry);
initializePEIPass(Registry);		initializePEIPass(Registry);
initializePHIEliminationPass(Registry);		initializePHIEliminationPass(Registry);
initializePeepholeOptimizerPass(Registry);		initializePeepholeOptimizerPass(Registry);
initializePostMachineSchedulerPass(Registry);		initializePostMachineSchedulerPass(Registry);
initializePostRASchedulerPass(Registry);		initializePostRASchedulerPass(Registry);
initializeProcessImplicitDefsPass(Registry);		initializeProcessImplicitDefsPass(Registry);
Show All 23 Lines

lib/CodeGen/MachinePipeliner.cpp

This file was added.

				//===-- MachinePipeliner.cpp - Machine Software Pipeliner Pass ------------===//
				//
				// (c) 2013 Qualcomm Innovation Center, Inc. All rights reserved.
				markslUnsubmitted Not Done Reply Inline Actions Are you contributing this code under the terms of the LLVM License? marksl: Are you contributing this code under the terms of the LLVM License?
				bcahoonAuthorUnsubmitted Not Done Reply Inline Actions Yes. I need to change that comment. I'll submit a patch with the change. bcahoon: Yes. I need to change that comment. I'll submit a patch with the change.
				//
				// An implementation of the Swing Modulo Scheduling (SMS) software pipeliner.
				//
				// Software pipelining is an instruction scheduling technique for loops that
				// overlap loop iterations and explioits ILP via a compiler transformation.
				//
				// Swing Modulo Scheduling (SMS) is an implementation of software pipelining
				// that generates schedules that are near optimal in terms of initiation
				// interval, register requirements, and stage count.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/ADT/DenseMap.h"
				#include "llvm/ADT/MapVector.h"
				#include "llvm/ADT/PriorityQueue.h"
				MatzeBUnsubmitted Not Done Reply Inline Actions Use three slash doxygen. MatzeB: Use three slash doxygen.
				#include "llvm/ADT/SetVector.h"
				#include "llvm/ADT/SmallPtrSet.h"
				#include "llvm/ADT/SmallSet.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/AliasAnalysis.h"
				#include "llvm/Analysis/ValueTracking.h"
				#include "llvm/CodeGen/DFAPacketizer.h"
				#include "llvm/CodeGen/LiveIntervalAnalysis.h"
				#include "llvm/CodeGen/MachineBasicBlock.h"
				#include "llvm/CodeGen/MachineDominators.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineLoopInfo.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"
				#include "llvm/CodeGen/Passes.h"
				#include "llvm/CodeGen/RegisterClassInfo.h"
				#include "llvm/CodeGen/RegisterPressure.h"
				#include "llvm/CodeGen/ScheduleDAGInstrs.h"
				#include "llvm/MC/MCInstrItineraries.h"
				#include "llvm/Support/CommandLine.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/raw_ostream.h"
				#include "llvm/Target/TargetInstrInfo.h"
				#include "llvm/Target/TargetMachine.h"
				#include "llvm/Target/TargetRegisterInfo.h"
				#include "llvm/Target/TargetSubtargetInfo.h"
				#include <climits>
				#include <deque>
				#include <map>

				using namespace llvm;

				#define DEBUG_TYPE "modulo-sched"

				STATISTIC(NumTrytoPipeline, "Number of loops that we attempt to pipeline");
				STATISTIC(NumPipelined, "Number of loops software pipelined");

				// A command line option to enable SWP at -Os.
				static cl::opt<bool> EnableSWPOptSize("enable-swp-opt-size",
				cl::desc("Enable SWP at Os."), cl::Hidden,
				cl::init(false));

				// A command line argument to limit minimum initial interval for pipelining.
				static cl::opt<int> SwpMaxMii("swp-max-mii",
				cl::desc("Size limit for the the MII."),
				cl::Hidden, cl::init(27));

				static cl::opt<int>
				SwpMaxStages("swp-max-stages",
				cl::desc("Maximum stages allowed in the generated scheduled."),
				cl::Hidden, cl::init(3));

				// A command line option to disable the pruning of chain dependences due to
				// an unrelated Phi.
				static cl::opt<bool>
				SwpPruneDeps("swp-prune-deps",
				cl::desc("Prune dependences between unrelated Phi nodes."),
				cl::Hidden, cl::init(true));

				// A command line option to disable the pruning of loop carried order
				// dependences.
				static cl::opt<bool>
				SwpPruneLoopCarried("swp-prune-loop-carried",
				cl::desc("Prune loop carried order dependences."),
				cl::Hidden, cl::init(true));

				#ifndef NDEBUG
				static cl::opt<int> SwpLoopLimit("swp-max", cl::Hidden, cl::init(-1));
				#endif

				static cl::opt<bool> SwpIgnoreRecMII("swp-ignore-recmii", cl::ReallyHidden,
				cl::init(false), cl::ZeroOrMore,
				cl::desc("Ignore RecMII"));

				namespace {

				class SMSchedule;
				class SwingSchedulerDAG;

				// The main class in the implementation of the target independent
				// software pipeliner pass.
				class MachineSMS : public MachineFunctionPass {
				public:
				MachineFunction *MF;
				const MachineLoopInfo *MLI;
				const MachineDominatorTree *MDT;
				const InstrItineraryData *InstrItins;
				const TargetInstrInfo *TII;
				RegisterClassInfo RegClassInfo;

				#ifndef NDEBUG
				static int NumTries;
				#endif
				// Cache the target analysis information about the loop.
				struct LoopInfo {
				MachineBasicBlock *TBB;
				MachineBasicBlock *FBB;
				SmallVector<MachineOperand, 4> BrCond;
				MachineInstr *LoopInductionVar;
				MatzeBUnsubmitted Not Done Reply Inline Actions Would be good to intregrate this into the OptBisect infrastructure in subsequent patches. MatzeB: Would be good to intregrate this into the OptBisect infrastructure in subsequent patches.
				MachineInstr *LoopCompare;
				LoopInfo()
				: TBB(nullptr), FBB(nullptr), LoopInductionVar(nullptr),
				LoopCompare(nullptr) {}
				};
				LoopInfo LI;

				static char ID;
				MachineSMS()
				: MachineFunctionPass(ID), MF(nullptr), MLI(nullptr), MDT(nullptr),
				TII(nullptr) {
				initializeMachineSMSPass(*PassRegistry::getPassRegistry());
				}

				virtual bool runOnMachineFunction(MachineFunction &MF);

				virtual void getAnalysisUsage(AnalysisUsage &AU) const {
				MatzeBUnsubmitted Not Done Reply Inline Actions we can use C++11 member initializers now and write `MachineFunction MF = nullptr;` above instead. MatzeB:* we can use C++11 member initializers now and write `MachineFunction *MF = nullptr;` above…
				AU.addRequired<AAResultsWrapperPass>();
				AU.addPreserved<AAResultsWrapperPass>();
				AU.addRequired<MachineLoopInfo>();
				AU.addRequired<MachineDominatorTree>();
				AU.addRequired<LiveIntervals>();
				MachineFunctionPass::getAnalysisUsage(AU);
				}

				private:
				// Return true if the loop can be software pipelined. The algorithm is
				// restricted to loops with a single basic block that the target is
				// able to analyze.
				bool canPipelineLoop(MachineLoop *L);

				// Attempt to perform the SMS algorithm on the specified loop. This
				// function is the main entry point for the algorithm. The function
				// identifies candidate loops, calculates the minimum initiation
				// interval, and attempts to schedule the loop.
				bool scheduleLoop(MachineLoop *L);

				// The SMS algorithm consists of the following main steps:
				// 1. Computation and analysis of the dependence graph.
				// 2. Ordering of the nodes (instruction).
				// 3. Attempt to Schedule the loop.
				bool swingModuloScheduler(MachineLoop *L);
				};

				// This class builds the dependence graph for the instructions in a loop,
				// and attempts to schedule the instructions using the SMS algorithm.
				class SwingSchedulerDAG : public ScheduleDAGInstrs {
				MachineSMS *Pass;
				unsigned MII;
				bool Scheduled;
				MachineLoop *Loop;
				LiveIntervals *LIS;

				/// A toplogical ordering of the SUnits, which is needed for changing
				/// dependences and iterating over the SUnits.
				MatzeBUnsubmitted Not Done Reply Inline Actions This field could use a comment. MatzeB: This field could use a comment.
				ScheduleDAGTopologicalSort Topo;

				struct NodeInfo {
				int ASAP;
				int ALAP;
				NodeInfo() : ASAP(0), ALAP(0) {}
				};
				/// Computed properties for each node in the graph.
				std::vector<NodeInfo> ScheduleInfo;

				enum OrderKind { BottomUp = 0, TopDown = 1 };
				/// Computed node ordering for scheduling.
				SetVector<SUnit *> NodeOrder;

				// A NodeSet contains a set of SUnit DAG nodes with additional information
				// that assigns a priority to the set.
				public:
				class NodeSet {
				SetVector<SUnit *> Nodes;
				bool HasRecurrence;
				unsigned RecMII;
				int MaxMOV;
				int MaxDepth;
				MatzeBUnsubmitted Not Done Reply Inline Actions swap order of public: and the comment? MatzeB: swap order of public: and the comment?
				unsigned Colocate;
				MatzeBUnsubmitted Not Done Reply Inline Actions Given that we are inside a .cpp file specific to modulo scheduling anyway we can probably move all those helper classes to the toplevel instead of nesting them into the class. MatzeB: Given that we are inside a .cpp file specific to modulo scheduling anyway we can probably move…
				SUnit *ExceedPressure;

				public:
				typedef SetVector<SUnit *>::iterator iterator;
				typedef SetVector<SUnit *>::const_iterator const_iterator;

				NodeSet()
				: Nodes(), HasRecurrence(false), RecMII(0), MaxMOV(0), MaxDepth(0),
				Colocate(0), ExceedPressure(nullptr) {}

				template <typename It>
				NodeSet(It S, It E)
				: Nodes(S, E), HasRecurrence(true), RecMII(0), MaxMOV(0), MaxDepth(0),
				Colocate(0), ExceedPressure(nullptr) {}

				bool insert(SUnit *SU) { return Nodes.insert(SU); }

				template <typename It> void insert(It B, It E) { Nodes.insert(B, E); }

				template <typename UnaryPredicate> bool remove_if(UnaryPredicate P) {
				MatzeBUnsubmitted Not Done Reply Inline Actions Shouldn't you rather have the iterator implicitely cast to const_iterator so you do not need to have a templated version of the constructor (and insert etc. below)? Could also use member initializers. MatzeB: Shouldn't you rather have the iterator implicitely cast to const_iterator so you do not need to…
				return Nodes.remove_if(P);
				}

				unsigned count(SUnit *SU) const { return Nodes.count(SU); }

				bool hasRecurrence() { return HasRecurrence; };

				unsigned size() const { return Nodes.size(); }

				bool empty() const { return Nodes.empty(); }

				SUnit *getNode(unsigned i) const { return Nodes[i]; };

				void setRecMII(unsigned mii) { RecMII = mii; };

				void setColocate(unsigned c) { Colocate = c; };

				void setExceedPressure(SUnit *SU) { ExceedPressure = SU; }

				bool isExceedSU(SUnit *SU) { return ExceedPressure == SU; }

				int compareRecMII(NodeSet &RHS) { return RecMII - RHS.RecMII; }

				int getRecMII() { return RecMII; }

				/// Summarize node functions for the entire node set.
				void computeNodeSetInfo(SwingSchedulerDAG *SSD) {
				for (NodeSet::iterator I = begin(), E = end(); I != E; ++I) {
				MaxMOV = std::max(MaxMOV, SSD->getMOV(*I));
				MaxDepth = std::max(MaxDepth, SSD->getDepth(*I));
				}
				}

				void clear() {
				Nodes.clear();
				RecMII = 0;
				HasRecurrence = false;
				MaxMOV = 0;
				MaxDepth = 0;
				Colocate = 0;
				ExceedPressure = nullptr;
				}

				operator SetVector<SUnit *> &() { return Nodes; }

				/// Sort the node sets by importance. First, rank them by recurrence MII,
				/// then by mobility (least mobile done first), and finally by depth.
				/// Each node set may contain a colocate value which is used as the first
				/// tie breaker, if it's set.
				bool operator>(const NodeSet &RHS) const {
				if (RecMII == RHS.RecMII) {
				if (Colocate != 0 && RHS.Colocate != 0 && Colocate != RHS.Colocate)
				return Colocate < RHS.Colocate;
				if (MaxMOV == RHS.MaxMOV)
				return MaxDepth > RHS.MaxDepth;
				return MaxMOV < RHS.MaxMOV;
				}
				return RecMII > RHS.RecMII;
				}
				bool operator==(const NodeSet &RHS) const {
				return RecMII == RHS.RecMII && MaxMOV == RHS.MaxMOV &&
				MaxDepth == RHS.MaxDepth;
				}
				bool operator!=(const NodeSet &RHS) const { return !operator==(RHS); }

				iterator begin() { return Nodes.begin(); }
				const_iterator begin() const { return Nodes.begin(); }
				iterator end() { return Nodes.end(); }
				const_iterator end() const { return Nodes.end(); }

				void print(raw_ostream &os) const {
				os << "Num nodes " << size() << " rec " << RecMII << " mov " << MaxMOV
				<< " depth " << MaxDepth << " col " << Colocate << "\n";
				for (iterator I = begin(), E = end(); I != E; ++I)
				os << " SU(" << (I)->NodeNum << ") " << ((*I)->getInstr());
				os << "\n";
				}

				void dump() const { print(dbgs()); }
				};

				private:
				typedef SmallVector<NodeSet, 8> NodeSetType;
				typedef DenseMap<unsigned, unsigned> ValueMapTy;
				typedef SmallVectorImpl<MachineBasicBlock *> MBBVectorTy;
				typedef DenseMap<MachineInstr , MachineInstr > InstrMapTy;

				/// Instructions to change when emitting the final schedule.
				DenseMap<SUnit *, std::pair<unsigned, int64_t>> InstrChanges;

				/// We may create a new instruction, so remember it because it
				/// must be deleted when the pass is finished.
				SmallPtrSet<MachineInstr *, 4> NewMIs;

				const RegisterClassInfo &RegClassInfo;

				/// Helper class to implement Johnson's circuit finding algorithm.
				class Circuits {
				std::vector<SUnit> &SUnits;
				SmallVector<SUnit *, 10> Stack;
				BitVector Blocked;
				MatzeBUnsubmitted Not Done Reply Inline Actions Why not move all those to the top of the class declaration to the other analysis info? MatzeB: Why not move all those to the top of the class declaration to the other analysis info?
				SmallVector<SmallPtrSet<SUnit *, 4>, 10> B;
				SmallVector<SmallVector<int, 4>, 16> AdjK;
				unsigned NumPaths;
				static unsigned MaxPaths;

				public:
				Circuits(std::vector<SUnit> &SUs)
				: SUnits(SUs), Stack(), Blocked(SUs.size()), B(SUs.size()),
				AdjK(SUs.size()) {}
				/// Reset the data structures used in the circuit algorithm.
				void reset() {
				Stack.clear();
				Blocked.reset();
				B.assign(SUnits.size(), SmallPtrSet<SUnit *, 4>());
				NumPaths = 0;
				}
				/// Create the Adjacency structure of the nodes in the graph.
				void createAdjacencyStructure(SwingSchedulerDAG *DAG);

				/// The circuit function from Johnson's algorithm finds a circuit starting
				/// at the specified node.
				bool circuit(int V, int S, NodeSetType &NodeSets, bool HasBackedge = false);

				/// Unblock a node in the circuit finding algorithm.
				void unblock(int U);
				};

				public:
				SwingSchedulerDAG(MachineSMS P, MachineLoop L, LiveIntervals *lis,
				const RegisterClassInfo &rci)
				: ScheduleDAGInstrs(*P->MF, P->MLI, false), Pass(P), MII(0),
				Scheduled(false), Loop(L), LIS(lis), Topo(SUnits, &ExitSU),
				RegClassInfo(rci) {}

				// We need to implement this pure virtual function to do the scheduling.
				void schedule();

				/// Clean up after the software pipeliner runs.
				void finishBlock();

				// Return true if the loop kernel has been scheduled.
				bool hasNewSchedule() { return Scheduled; }

				// Return the earliest time an instruction may be scheduled.
				MatzeBUnsubmitted Not Done Reply Inline Actions Many function comments are repeated here and at the place where the function is actually implemented. You should only mention the comment in one of the two places or risk that the two will get out of sync in time. (Which of the two you choose is a matter of personal preference and shouldn't matter). MatzeB: Many function comments are repeated here and at the place where the function is actually…
				int getASAP(SUnit *Node) { return ScheduleInfo[Node->NodeNum].ASAP; }

				// Return the latest time an instruction my be scheduled.
				int getALAP(SUnit *Node) { return ScheduleInfo[Node->NodeNum].ALAP; }

				// The mobility function, which the the number of slots in which
				// an instruction may be scheduled.
				int getMOV(SUnit *Node) { return getALAP(Node) - getASAP(Node); }

				// The depth, in the dependence graph, for a node.
				int getDepth(SUnit *Node) { return Node->getDepth(); }

				// The height, in the dependence graph, for a node.
				int getHeight(SUnit *Node) { return Node->getHeight(); }

				/// Return true if the dependence is a back-edge in the data dependence graph.
				/// Since the DAG doesn't contain cycles, we represent a cycle in the graph
				/// using an anti dependence from a Phi to an instruction.
				bool isBackedge(SUnit *Source, const SDep &Dep) {
				if (Dep.getKind() != SDep::Anti)
				return false;
				return Source->getInstr()->isPHI() \|\| Dep.getSUnit()->getInstr()->isPHI();
				}

				/// Return true if the dependence is an order dependence between non-Phis.
				static bool isOrder(SUnit *Source, const SDep &Dep) {
				if (Dep.getKind() != SDep::Order)
				return false;
				return (!Source->getInstr()->isPHI() &&
				!Dep.getSUnit()->getInstr()->isPHI());
				}

				/// Return true for an order dependence that is loop carried potentially.
				/// An order dependence is loop carried if the destination defines a value
				/// that may be used by the source in a subsequent iteration.
				bool isLoopCarriedOrder(SUnit *Source, const SDep &Dep, bool isSucc = true);

				/// The latency of the dependence.
				unsigned getLatency(SUnit *Source, const SDep &Dep) {
				// Anti dependences represent recurrences, so use the latency of the
				// instruction on the back-edge.
				if (Dep.getKind() == SDep::Anti) {
				if (Source->getInstr()->isPHI())
				return Dep.getSUnit()->Latency;
				if (Dep.getSUnit()->getInstr()->isPHI())
				return Source->Latency;
				return Dep.getLatency();
				}
				return Dep.getLatency();
				}

				// The distance function, which indicates that operation V of iteration I
				// depends on operations U of iteration I-distance.
				unsigned getDistance(SUnit U, SUnit V, const SDep &Dep) {
				// Instructions that feed a Phi have a distance of 1. Computing larger
				// values for arrays requires data dependence information.
				if (V->getInstr()->isPHI() && Dep.getKind() == SDep::Anti)
				return 1;
				return 0;
				}

				// Set the Minimum Initiation Interval for this schedule attempt.
				void setMII(unsigned mii) { MII = mii; }

				/// Apply changes to the instruction, which are needed to improve the
				/// final schedule.
				MachineInstr applyInstrChange(MachineInstr MI, SMSchedule &Schedule,
				bool UpdateDAG = false);

				/// Return the new base register that was stored away for the changed
				/// instruction.
				unsigned getInstrBaseReg(SUnit *SU) {
				DenseMap<SUnit *, std::pair<unsigned, int64_t>>::iterator It =
				InstrChanges.find(SU);
				if (It != InstrChanges.end())
				return It->second.first;
				return 0;
				}

				private:
				/// Add a chain edge between a load and store if the store can be an
				/// alias of the load on a subsequent iteration.
				void addLoopCarriedDependences(AliasAnalysis *AA);

				/// Update the phi dependences to the DAG because ScheduleDAGInstrs no longer
				/// processes dependences for PHIs. We also remove unneeded chain
				/// dependences between Phis.
				void updatePhiDependences();

				/// Try to transform the instructions to improve the generated schedule.
				void changeDependences();

				/// Calculate the resource constrained minimum initiation interval.
				unsigned calculateResMII();

				/// Calculate the recurrence-constrainted minimum initiation interval.
				/// A recurrence occurs if an operation in one iteration of the loop
				/// has a dependence upon the same opration from a prior iteration.
				unsigned calculateRecMII(NodeSetType &RecNodeSets);

				/// Find all the elementary circuits in the dependence graph using Johnson's
				/// circuit algorithm.
				void findCircuits(NodeSetType &NodeSets);

				/// Merge the recurrence node sets that have the same initial node.
				void fuseRecs(NodeSetType &NodeSets);

				/// Remove nodes that have been scheduled in previous NodeSets.
				void removeDuplicateNodes(NodeSetType &NodeSets);

				// Several functions are needed by the algorithm for each node in the graph.
				// This function computes the necessary information.
				void computeNodeFunctions(NodeSetType &NodeSets);

				/// A heuristic to filter nodes in recurrent node-sets if the register
				/// pressure of a set is too high.
				void registerPressureFilter(NodeSetType &NodeSets);

				/// A heuristic to colocate node sets that have the same set of
				/// successors.
				void colocateNodeSets(NodeSetType &NodeSets);

				/// Check if the existing node-sets are profitable. If not, then ignore the
				/// recurrent node-sets, and attempt to schedule all nodes together.
				void checkNodeSets(NodeSetType &NodeSets);

				// Group the remaining nodes into sets based upon connected components.
				void groupRemainingNodes(NodeSetType &NodeSets);

				// Add the node to the set, and add all is its connected nodes to the set.
				void addConnectedNodes(SUnit *SU, NodeSet &NewSet,
				SetVector<SUnit *> &NodesAdded);

				MatzeBUnsubmitted Not Done Reply Inline Actions Use doxygen comment (same for some following functions). MatzeB: Use doxygen comment (same for some following functions).
				// Generate an ordered list containing each node in the graph,
				// which is used as the order for schduling the instructions.
				void computeNodeOrder(NodeSetType &NodeSets);

				// Generate the pipelined schedule for the instructions, if possible.
				// Return true if a schedule has been found.
				bool schedulePipeline(SMSchedule &Schedule);

				// Rearrange the instructions in the loop according to the schedule.
				void generatePipelinedLoop(SMSchedule &Schedule);

				// Generate the prolog code for the pipeline.
				void generateProlog(SMSchedule &Schedule, unsigned LastStage,
				MachineBasicBlock KernelBB, ValueMapTy VRMap,
				MBBVectorTy &PrologBBs);

				// Generate the epilog code for the software pipelined loop.
				void generateEpilog(SMSchedule &Schedule, unsigned LastStage,
				MachineBasicBlock KernelBB, ValueMapTy VRMap,
				MBBVectorTy &EpilogBBs, MBBVectorTy &PrologBBs);

				/// Generate Phis in the pipelined loop for Phis that existed in the
				/// original loop.
				void generateExistingPhis(MachineBasicBlock NewBB, MachineBasicBlock BB1,
				MachineBasicBlock BB2, MachineBasicBlock KernelBB,
				SMSchedule &Schedule, ValueMapTy *VRMap,
				InstrMapTy &InstrMap, unsigned LastStageNum,
				unsigned CurStageNum, bool IsLast);

				/// Generate Phis for the specific block in the generated pipelined code.
				void generatePhis(MachineBasicBlock NewBB, MachineBasicBlock BB1,
				MachineBasicBlock BB2, MachineBasicBlock KernelBB,
				SMSchedule &Schedule, ValueMapTy *VRMap,
				InstrMapTy &InstrMap, unsigned LastStageNum,
				unsigned CurStageNum, bool IsLast);

				// Remove instructions, in the epilog, that generate values with no uses.
				// Typically, these are induction variable operations that generate values
				// used in the loop itself.
				void removeDeadInstructions(MachineBasicBlock *KernelBB,
				MBBVectorTy &EpilogBBs);

				/// For loop carried definitions, we split the lifetime of a virtual register
				/// that has uses past the defiition in the next iteration. A copy with a new
				/// virtual register is inserted before the definition, which helps with
				/// generating a better register assignment.
				void splitLifetimes(MachineBasicBlock *KernelBB, MBBVectorTy &EpilogBBs,
				SMSchedule &Schedule);

				// Add branches and Phis between the prolog and epilog blocks.
				void addBranches(MBBVectorTy &PrologBBs, MachineBasicBlock *KernelBB,
				MBBVectorTy &EpilogBBs, SMSchedule &Schedule,
				ValueMapTy *VRMap);

				/// Return true if we can compute the amount the instruction changes
				/// during each iteration. Set Delta to the amount of the change.
				bool computeDelta(MachineInstr *MI, unsigned &Delta);

				/// Update the memory operand with a new offset when the pipeliner
				/// generate a new copy of the instruction that refers to a
				/// different memory location.
				void updateMemOperands(MachineInstr NewMI, MachineInstr OldMI,
				unsigned Num);

				/// Clone the instruction for the new pipelined loop and update the
				/// memory operands, if needed.
				MachineInstr cloneInstr(MachineInstr OldMI, unsigned CurStageNum,
				unsigned InstStageNum);

				/// Clone the instruction for the new pipelined loop. If needed, this
				/// function updates the instruction using the values saved in the
				/// InstrChanges structure.
				MachineInstr cloneAndChangeInstr(MachineInstr OldMI, unsigned CurStageNum,
				unsigned InstStageNum,
				SMSchedule &Schedule);

				// Update the machine instruction with new virtual registers. This
				// function may change both the defintions and/or uses.
				void updateInstruction(MachineInstr *NewMI, bool LastDef,
				unsigned CurStageNum, unsigned InstStageNum,
				SMSchedule &Schedule, ValueMapTy *VRMap);

				// Return the instruction in the loop that defines the register.
				// If the definition is a Phi, then follow the Phi operand to
				// the instruction in the loop.
				MachineInstr *findDefInLoop(unsigned Reg);

				/// Return the new name for the value from the previous stage.
				unsigned getPrevMapVal(unsigned StageNum, unsigned PhiStage, unsigned LoopVal,
				ValueMapTy VRMap, MachineBasicBlock BB);

				/// Rewrite the Phi values in the specified block to use the mappings
				/// from the initial operand.
				void rewritePhiValues(MachineBasicBlock *NewBB, unsigned StageNum,
				SMSchedule &Schedule, ValueMapTy *VRMap,
				InstrMapTy &InstrMap);

				/// Rewrite a previously scheduled instruction to use a new register value,
				/// which is created by a Phi typically. This version rewrites references
				/// that occur at a specific stage only.
				void rewriteScheduledInstr(MachineBasicBlock *BB, SMSchedule &Schedule,
				InstrMapTy &InstrMap, unsigned CurStageNum,
				unsigned PhiNum, MachineInstr *Phi,
				unsigned OldReg, unsigned NewReg,
				unsigned PrevReg = 0);

				/// Check if we can change the instruction to use an offset value from the
				/// previous iteration. If so, return true and set the base and offset values
				/// so that we can rewrite the load, if necessary.
				bool canUseLastOffsetValue(MachineInstr *MI, unsigned &BasePos,
				unsigned &OffsetPos, unsigned &NewBase,
				int64_t &NewOffset);
				};

				// This class repesents the scheduled code. The main data structure is a
				// map from scheduled cycle to instructions. During scheduling, the
				// data structure explicitly represents all stages/iterations. When
				// the algorithm finshes, the schedule is collapsed into a single stage,
				// which represents instructions from different loop iterations.
				//
				// The SMS algorithm allows negative values for cycles, so the first cycle
				// in the schedule is the smallest cycle value.
				class SMSchedule {
				private:
				// Map from execution cycle to instructions.
				DenseMap<int, std::deque<SUnit *>> ScheduledInstrs;

				// Map from instruction to execution cycle.
				std::map<SUnit *, int> InstrToCycle;

				// Map for each register and the max difference between its uses and def.
				// The first element in the pair is the max difference in stages. The
				// second is true if the register defines a Phi value and loop value is
				// scheduled before the Phi.
				std::map<unsigned, std::pair<unsigned, bool>> RegToStageDiff;

				// Keep track of the first cycle value in the schedule. It starts
				// as zero, but the algorithm allows negative values.
				int FirstCycle;

				// Keep track of the last cycle value in the schedule.
				int LastCycle;

				// The initiation interval (II) for the schedule.
				int InitiationInterval;

				// Target machine information.
				const TargetSubtargetInfo &ST;

				// Virtual register information.
				MachineRegisterInfo &MRI;

				DFAPacketizer *Resources;

				public:
				SMSchedule(MachineFunction *mf)
				: ST(mf->getSubtarget()), MRI(mf->getRegInfo()),
				Resources(ST.getInstrInfo()->CreateTargetScheduleState(ST)) {
				FirstCycle = 0;
				LastCycle = 0;
				InitiationInterval = 0;
				}

				~SMSchedule() {
				ScheduledInstrs.clear();
				InstrToCycle.clear();
				RegToStageDiff.clear();
				delete Resources;
				}

				void reset() {
				ScheduledInstrs.clear();
				InstrToCycle.clear();
				RegToStageDiff.clear();
				FirstCycle = 0;
				LastCycle = 0;
				InitiationInterval = 0;
				}

				// Set the initiation interval for this schedule.
				void setInitiationInterval(int ii) { InitiationInterval = ii; }

				// Return the first cycle in the completed schedule. This
				// can be a negative value.
				int getFirstCycle() const { return FirstCycle; }

				// Return the last cycle in the finalized schedule.
				int getFinalCycle() const { return FirstCycle + InitiationInterval - 1; }

				// Return the cycle of the earliest scheduled instruction in the dependence
				// chain.
				int earliestCycleInChain(const SDep &Dep);

				// Return the cycle of the latest scheduled instruction in the dependence
				// chain.
				int latestCycleInChain(const SDep &Dep);

				// Compute the scheduling start slot for the instruction. The start slot
				// depends on any predecessor or successor nodes scheduled already.
				void computeStart(SUnit SU, int MaxEarlyStart, int *MinLateStart,
				int MinEnd, int MaxStart, int II, SwingSchedulerDAG *DAG);

				// Try to schedule the node at the specified StartCycle and continue
				// until the node is schedule or the EndCycle is reached. This function
				// returns true if the node is scheduled. This routine may search either
				// forward or backward for a place to insert the instruction based upon
				// the relative values of StartCycle and EndCycle.
				bool insert(SUnit *SU, int StartCycle, int EndCycle, int II);

				// Iterators for the cycle to instruction map.
				typedef DenseMap<int, std::deque<SUnit *>>::iterator sched_iterator;
				typedef DenseMap<int, std::deque<SUnit *>>::const_iterator
				const_sched_iterator;

				// Return true if the instruction is scheduled at the specified stage.
				bool isScheduledAtStage(SUnit *SU, unsigned StageNum) {
				return (stageScheduled(SU) == (int)StageNum);
				}

				// Return the stage for a scheduled instruction. Return -1 if
				// the instruction has not been scheduled.
				int stageScheduled(SUnit *SU) const {
				std::map<SUnit *, int>::const_iterator it = InstrToCycle.find(SU);
				if (it == InstrToCycle.end())
				return -1;
				return (it->second - FirstCycle) / InitiationInterval;
				}

				/// Return the cycle for a scheduled instruction. This function normalizes
				/// the first cycle to be 0.
				unsigned cycleScheduled(SUnit *SU) const {
				std::map<SUnit *, int>::const_iterator it = InstrToCycle.find(SU);
				assert(it != InstrToCycle.end() && "Instruction hasn't been scheduled.");
				return (it->second - FirstCycle) % InitiationInterval;
				}

				// Return the maximum stage count needed for this schedule.
				unsigned getMaxStageCount() {
				return (LastCycle - FirstCycle) / InitiationInterval;
				}

				/// Return the max. number of stages/iterations that can occur between a
				/// register definition and its uses.
				unsigned getStagesForReg(int Reg, unsigned CurStage) {
				std::pair<unsigned, bool> Stages = RegToStageDiff[Reg];
				if (CurStage > getMaxStageCount() && Stages.first == 0 && Stages.second)
				return 1;
				return Stages.first;
				}

				/// The number of stages for a Phi is a little different than other
				/// instructions. The minimum value computed in RegToStageDiff is 1
				/// because we assume the Phi is needed for at least 1 iteration.
				/// This is not the case if the loop value is scheduled prior to the
				/// Phi in the same stage. This function returns the number of stages
				/// or iterations needed between the Phi definition and any uses.
				unsigned getStagesForPhi(int Reg) {
				std::pair<unsigned, bool> Stages = RegToStageDiff[Reg];
				if (Stages.second)
				return Stages.first;
				return Stages.first - 1;
				}

				/// Return the instructions that are scheduled at the specified cycle.
				std::deque<SUnit *> &getInstructions(int cycle) {
				return ScheduledInstrs[cycle];
				}

				// After the schedule has been formed, call this function to combine
				// the instructions from the different stages/cycles. That is, this
				// function creates a schedule that represents a single iteration.
				void finalizeSchedule(SwingSchedulerDAG *SSD);

				/// Order the instructions within a cycle so that the definitions occur
				/// before the uses.
				bool orderDependence(SwingSchedulerDAG SSD, SUnit SU,
				std::deque<SUnit *> &Insts);

				/// Return true if the scheduled Phi has a loop carried operand.
				bool isLoopCarried(SwingSchedulerDAG SSD, MachineInstr Phi);

				/// Return true if the instruction is a definition that is loop carried
				/// and defines the use on the next iteration.
				bool isLoopCarriedDefOfUse(SwingSchedulerDAG SSD, MachineInstr Inst,
				MachineOperand &MO);

				/// Print the schedule information to the given output.
				void print(raw_ostream &os) const;

				/// Dump the schedule information to stderr.
				void dump() const;
				};

				} // end anonymous namespace

				unsigned SwingSchedulerDAG::Circuits::MaxPaths = 5;
				char MachineSMS::ID = 0;
				#ifndef NDEBUG
				int MachineSMS::NumTries = 0;
				#endif
				char &llvm::MachineSMSID = MachineSMS::ID;
				INITIALIZE_PASS_BEGIN(MachineSMS, "softwarepipe", "Modulo Software Pipelining",
				false, false)
				INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(MachineLoopInfo)
				INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
				INITIALIZE_PASS_DEPENDENCY(LiveIntervals)
				INITIALIZE_PASS_END(MachineSMS, "softwarepipe", "Modulo Software Pipelining",
				false, false)

				// The "main" function for implementing Swing Modulo Scheduling.
				bool MachineSMS::runOnMachineFunction(MachineFunction &mf) {

				if (mf.getFunction()->getAttributes().hasAttribute(
				AttributeSet::FunctionIndex, Attribute::OptimizeForSize) &&
				!EnableSWPOptSize.getPosition())
				return false;

				MF = &mf;
				MatzeBUnsubmitted Not Done Reply Inline Actions I believe this pass is not required for correctness and should therefore have a `if(skipFunction(..)) return false;` sequence first. MatzeB: I believe this pass is not required for correctness and should therefore have a `if…
				MLI = &getAnalysis<MachineLoopInfo>();
				MDT = &getAnalysis<MachineDominatorTree>();
				TII = MF->getSubtarget().getInstrInfo();
				RegClassInfo.runOnMachineFunction(*MF);

				for (MachineLoopInfo::iterator I = MLI->begin(), E = MLI->end(); I != E;
				++I) {
				MachineLoop L = I;
				scheduleLoop(L);
				}

				return false;
				}
				MatzeBUnsubmitted Not Done Reply Inline Actions Use range based for. Similar in many following loops. MatzeB: Use range based for. Similar in many following loops.

				// Attempt to perform the SMS algorithm on the specified loop. This function
				// is the main entry point for the algorithm. The function identifies candidate
				// loops, calculates the minimum initiation interval, and attempts to
				// schedule the loop.
				bool MachineSMS::scheduleLoop(MachineLoop *L) {
				bool Changed = false;
				for (MachineLoop::iterator I = L->begin(), E = L->end(); I != E; ++I) {
				Changed \|= scheduleLoop(*I);
				}

				#ifndef NDEBUG
				// Stop trying after reaching the limit (if any).
				int Limit = SwpLoopLimit;
				if (Limit >= 0) {
				if (NumTries >= SwpLoopLimit)
				return Changed;
				NumTries++;
				}
				#endif

				if (!canPipelineLoop(L))
				return Changed;

				++NumTrytoPipeline;

				Changed = swingModuloScheduler(L);

				return Changed;
				}

				// Return true if the loop can be software pipelined. The algorithm is
				// restricted to loops with a single basic block. Make sure that the
				// branch in the loop can be analyzed.
				bool MachineSMS::canPipelineLoop(MachineLoop *L) {
				if (L->getNumBlocks() != 1)
				return false;

				// Check if the branch can't be understood because we can't do pipelining
				// if that's the case.
				LI.TBB = nullptr;
				LI.FBB = nullptr;
				LI.BrCond.clear();
				if (TII->AnalyzeBranch(*L->getHeader(), LI.TBB, LI.FBB, LI.BrCond))
				return false;

				LI.LoopInductionVar = nullptr;
				LI.LoopCompare = nullptr;
				if (TII->AnalyzeLoop(L, LI.LoopInductionVar, LI.LoopCompare))
				return false;

				if (!L->getLoopPreheader())
				markslUnsubmitted Not Done Reply Inline Actions I've seen CodeGenPrepare delete the preheader. Specifically, if the previous block is a loop and this loop immediate follows it. I don't have a test case, but it's basically two sequential for loops. I wonder if that has any impact here? marksl: I've seen CodeGenPrepare delete the preheader. Specifically, if the previous block is a loop…
				bcahoonAuthorUnsubmitted Not Done Reply Inline Actions Yes, it would have an impact for loops without a preheader. On Hexagon, when we create a hardware loop, the preheader is added if it's not there already. Then, the pipeliner pass only sees loops with a preheader. It would be easy enough to add a preheader if one doesn't exist already. bcahoon: Yes, it would have an impact for loops without a preheader. On Hexagon, when we create a…
				return false;

				// If any of the Phis contain subregs, then we can't pipeline
				// because we don't know how to maintain subreg information in the
				// VMap structure.
				MachineBasicBlock *MBB = L->getHeader();
				for (MachineBasicBlock::iterator BBI = MBB->instr_begin(),
				BBE = MBB->getFirstNonPHI();
				BBI != BBE; ++BBI)
				for (unsigned i = 1; i != BBI->getNumOperands(); i += 2)
				if (BBI->getOperand(i).getSubReg() != 0)
				return false;

				return true;
				}

				// The SMS algorithm consists of the following main steps:
				// 1. Computation and analysis of the dependence graph.
				// 2. Ordering of the nodes (instructions).
				// 3. Attempt to Schedule the loop.
				//
				bool MachineSMS::swingModuloScheduler(MachineLoop *L) {
				assert(L->getBlocks().size() == 1 && "SMS works on single blocks only.");

				SwingSchedulerDAG SMS(this, L, &getAnalysis<LiveIntervals>(), RegClassInfo);

				MachineBasicBlock *MBB = L->getHeader();
				// The kernel should not include any terminator instructions. These
				// will be added back later.
				SMS.startBlock(MBB);

				// Compute the number of 'real' instructions in the basic block by
				// ignoring terminators.
				unsigned size = MBB->size();
				for (MachineBasicBlock::iterator I = MBB->getFirstTerminator(),
				E = MBB->instr_end();
				I != E; ++I, --size)
				;

				SMS.enterRegion(MBB, MBB->begin(), MBB->getFirstTerminator(), size);
				SMS.schedule();
				SMS.exitRegion();

				SMS.finishBlock();
				return SMS.hasNewSchedule();
				}

				// We override the schedule function in ScheduleDAGInstrs to implement the
				// scheduling part of the Swing Modulo Scheduling algorithm.
				void SwingSchedulerDAG::schedule() {
				AliasAnalysis *AA = &Pass->getAnalysis<AAResultsWrapperPass>().getAAResults();
				buildSchedGraph(AA);
				addLoopCarriedDependences(AA);
				updatePhiDependences();
				Topo.InitDAGTopologicalSorting();
				changeDependences();
				DEBUG({
				for (unsigned su = 0, e = SUnits.size(); su != e; ++su)
				SUnits[su].dumpAll(this);
				});

				NodeSetType NodeSets;
				findCircuits(NodeSets);

				// Calculate the MII.
				unsigned ResMII = calculateResMII();
				unsigned RecMII = calculateRecMII(NodeSets);

				fuseRecs(NodeSets);

				// This flag is used for testing and can cause correctness problems.
				if (SwpIgnoreRecMII)
				RecMII = 0;

				MII = std::max(ResMII, RecMII);
				DEBUG(dbgs() << "MII = " << MII << " (rec=" << RecMII << ", res=" << ResMII
				<< ")\n");

				// Can't schedule a loop without a valid MII.
				if (MII == 0)
				return;

				// Don't pipeline large loops.
				if (SwpMaxMii != -1 && (int)MII > SwpMaxMii)
				return;

				computeNodeFunctions(NodeSets);

				registerPressureFilter(NodeSets);

				colocateNodeSets(NodeSets);

				checkNodeSets(NodeSets);

				DEBUG({
				for (auto &I : NodeSets) {
				dbgs() << " Rec NodeSet ";
				I.dump();
				}
				});

				std::sort(NodeSets.begin(), NodeSets.end(), std::greater<NodeSet>());

				groupRemainingNodes(NodeSets);

				removeDuplicateNodes(NodeSets);

				DEBUG({
				for (auto &I : NodeSets) {
				dbgs() << " NodeSet ";
				I.dump();
				}
				});

				computeNodeOrder(NodeSets);

				SMSchedule Schedule(Pass->MF);
				Scheduled = schedulePipeline(Schedule);

				if (!Scheduled)
				return;

				unsigned numStages = Schedule.getMaxStageCount();
				// No need to generate pipeline if there are no overlapped iterations.
				if (numStages == 0)
				return;

				// Check that the maximum stage count is less than user-defined limit.
				if (SwpMaxStages > -1 && (int)numStages > SwpMaxStages)
				return;

				generatePipelinedLoop(Schedule);
				++NumPipelined;
				}

				/// Clean up after the software pipeliner runs.
				void SwingSchedulerDAG::finishBlock() {
				for (SmallPtrSet<MachineInstr *, 4>::iterator I = NewMIs.begin(),
				E = NewMIs.end();
				I != E; ++I)
				MF.DeleteMachineInstr(*I);
				NewMIs.clear();

				// Call the superclass.
				ScheduleDAGInstrs::finishBlock();
				}

				/// Return the register values for the operands of a Phi instruction.
				/// This function assume the instruction is a Phi.
				static void getPhiRegs(MachineInstr Phi, MachineBasicBlock Loop,
				unsigned &InitVal, unsigned &LoopVal) {
				assert(Phi->isPHI() && "Expecting a Phi.");

				InitVal = 0;
				LoopVal = 0;
				for (unsigned i = 1, e = Phi->getNumOperands(); i != e; i += 2)
				if (Phi->getOperand(i + 1).getMBB() != Loop)
				InitVal = Phi->getOperand(i).getReg();
				else if (Phi->getOperand(i + 1).getMBB() == Loop)
				LoopVal = Phi->getOperand(i).getReg();

				assert(InitVal != 0 && LoopVal != 0 && "Unexpected Phi structure.");
				}

				/// Return the Phi register value that comes from the incoming block.
				static unsigned getInitPhiReg(MachineInstr Phi, MachineBasicBlock LoopBB) {
				for (unsigned i = 1, e = Phi->getNumOperands(); i != e; i += 2)
				if (Phi->getOperand(i + 1).getMBB() != LoopBB)
				return Phi->getOperand(i).getReg();
				return 0;
				}

				/// Return the Phi register value that comes the the loop block.
				static unsigned getLoopPhiReg(MachineInstr Phi, MachineBasicBlock LoopBB) {
				for (unsigned i = 1, e = Phi->getNumOperands(); i != e; i += 2)
				if (Phi->getOperand(i + 1).getMBB() == LoopBB)
				return Phi->getOperand(i).getReg();
				return 0;
				}

				/// Return true if SUb can be reached from SUa following the chain edges.
				static bool isSuccOrder(SUnit SUa, SUnit SUb) {
				SmallPtrSet<SUnit *, 8> Visited;
				SmallVector<SUnit *, 8> Worklist;
				Worklist.push_back(SUa);
				while (!Worklist.empty()) {
				const SUnit *SU = Worklist.pop_back_val();
				for (auto &SI : SU->Succs) {
				SUnit *SuccSU = SI.getSUnit();
				if (SI.getKind() == SDep::Order) {
				if (Visited.count(SuccSU))
				continue;
				if (SuccSU == SUb)
				return true;
				Worklist.push_back(SuccSU);
				Visited.insert(SuccSU);
				}
				}
				}
				return false;
				}

				/// Return true if the instruction causes a chain between memory
				/// references before and after it.
				static bool isDependenceBarrier(MachineInstr MI, AliasAnalysis AA) {
				return MI->isCall() \|\| MI->hasUnmodeledSideEffects() \|\|
				(MI->hasOrderedMemoryRef() &&
				(!MI->mayLoad() \|\| !MI->isInvariantLoad(AA)));
				}

				/// Return the underlying objects for the memory references of an instruction.
				/// This function calls the code in ValueTracking, but first checks that the
				/// instruction has a memory operand.
				static void getUnderlyingObjects(MachineInstr *MI,
				SmallVectorImpl<Value *> &Objs,
				const DataLayout &DL) {
				if (!MI->hasOneMemOperand())
				return;
				MachineMemOperand MM = MI->memoperands_begin();
				if (!MM->getValue())
				return;
				GetUnderlyingObjects(const_cast<Value *>(MM->getValue()), Objs, DL);
				}

				/// Add a chain edge between a load and store if the store can be an
				/// alias of the load on a subsequent iteration, i.e., a loop carried
				/// dependence. This code is very similar to the code in ScheduleDAGInstrs
				/// but that code doesn't create loop carried dependences.
				void SwingSchedulerDAG::addLoopCarriedDependences(AliasAnalysis *AA) {
				MapVector<Value , SmallVector<SUnit , 4>> PendingLoads;
				for (auto &SU : SUnits) {
				MachineInstr *MI = SU.getInstr();
				if (isDependenceBarrier(MI, AA))
				PendingLoads.clear();
				else if (MI->mayLoad()) {
				SmallVector<Value *, 4> Objs;
				getUnderlyingObjects(MI, Objs, MF.getDataLayout());
				for (auto V : Objs) {
				SmallVector<SUnit *, 4> &SUs = PendingLoads[V];
				SUs.push_back(&SU);
				}
				} else if (MI->mayStore()) {
				SmallVector<Value *, 4> Objs;
				getUnderlyingObjects(MI, Objs, MF.getDataLayout());
				for (auto V : Objs) {
				MapVector<Value , SmallVector<SUnit , 4>>::iterator I =
				PendingLoads.find(V);
				if (I == PendingLoads.end())
				continue;
				for (auto Load : I->second) {
				if (isSuccOrder(Load, &SU))
				continue;
				MachineInstr *LdMI = Load->getInstr();
				// First, perform the cheaper check that compares the base register.
				// If they are the same and the load offset is less than the store
				// offset, then mark the dependence as loop carried potentially.
				unsigned BaseReg1, Offset1, BaseReg2, Offset2;
				if (!TII->getMemOpBaseRegImmOfs(LdMI, BaseReg1, Offset1, TRI) \|\|
				!TII->getMemOpBaseRegImmOfs(MI, BaseReg2, Offset2, TRI))
				continue;
				if (BaseReg1 == BaseReg2 && (int)Offset1 < (int)Offset2) {
				assert(TII->areMemAccessesTriviallyDisjoint(LdMI, MI, AA) &&
				"What happened to the chain edge?");
				SU.addPred(SDep(Load, SDep::Barrier));
				continue;
				}
				// Second, the more expensive check that uses alias analysis on the
				// base registers. If they alias, and the load offset is less than
				// the store offset, the mark the dependence as loop carried.
				if (!AA)
				continue;
				MachineMemOperand MMO1 = LdMI->memoperands_begin();
				MachineMemOperand MMO2 = MI->memoperands_begin();
				if (!MMO1->getValue() \|\| !MMO2->getValue())
				continue;
				if (MMO1->getValue() == MMO2->getValue() &&
				MMO1->getOffset() <= MMO2->getOffset()) {
				SU.addPred(SDep(Load, SDep::Barrier));
				continue;
				}
				AliasResult AAResult = AA->alias(
				MemoryLocation(MMO1->getValue(), MemoryLocation::UnknownSize,
				MMO1->getAAInfo()),
				MemoryLocation(MMO2->getValue(), MemoryLocation::UnknownSize,
				MMO2->getAAInfo()));

				if (AAResult != NoAlias)
				SU.addPred(SDep(Load, SDep::Barrier));
				}
				}
				}
				}
				}

				/// Update the phi dependences to the DAG because ScheduleDAGInstrs no longer
				/// processes dependences for PHIs. This function adds true dependences
				/// from a PHI to a use, and a loop carried dependence from the use to the
				/// PHI. The loop carried dependence is represented as an anti dependence
				/// edge. This function also removes chain dependences between unrelated
				/// PHIs.
				void SwingSchedulerDAG::updatePhiDependences() {
				SmallVector<SDep, 4> RemoveDeps;
				const TargetSubtargetInfo &ST = MF.getSubtarget<TargetSubtargetInfo>();

				// Iterate over each DAG node.
				for (std::vector<SUnit>::iterator I = SUnits.begin(), E = SUnits.end();
				I != E; ++I) {
				RemoveDeps.clear();
				// Set to true if the instruction has an operand defined by a Phi.
				unsigned HasPhiUse = 0;
				unsigned HasPhiDef = 0;
				MachineInstr *MI = I->getInstr();
				// Iterate over each operand, and we process the definitions.
				for (MachineInstr::mop_iterator MOI = MI->operands_begin(),
				MOE = MI->operands_end();
				MOI != MOE; ++MOI) {
				if (!MOI->isReg())
				continue;
				unsigned Reg = MOI->getReg();
				if (MOI->isDef()) {
				// If the register is used by a Phi, then create an anti dependence.
				for (MachineRegisterInfo::use_instr_iterator
				UI = MRI.use_instr_begin(Reg),
				UE = MRI.use_instr_end();
				UI != UE; ++UI) {
				MachineInstr UseMI = &UI;
				SUnit *SU = getSUnit(UseMI);
				if (SU != 0 && UseMI->isPHI()) {
				if (!MI->isPHI()) {
				SDep Dep(SU, SDep::Anti, Reg);
				I->addPred(Dep);
				} else {
				HasPhiDef = Reg;
				// Add a chain edge to a dependent Phi that isn't an existing
				// predecessor.
				if (SU->NodeNum < I->NodeNum && !I->isPred(SU))
				I->addPred(SDep(SU, SDep::Barrier));
				}
				}
				}
				} else if (MOI->isUse()) {
				// If the register is defined by a Phi, then create a true dependence.
				MachineInstr *DefMI = MRI.getUniqueVRegDef(Reg);
				if (DefMI == 0)
				continue;
				SUnit *SU = getSUnit(DefMI);
				if (SU != 0 && DefMI->isPHI()) {
				if (!MI->isPHI()) {
				SDep Dep(SU, SDep::Data, Reg);
				Dep.setLatency(0);
				ST.adjustSchedDependency(SU, &*I, Dep);
				I->addPred(Dep);
				} else {
				HasPhiUse = Reg;
				// Add a chain edge to a dependent Phi that isn't an existing
				// predecessor.
				if (SU->NodeNum < I->NodeNum && !I->isPred(SU))
				I->addPred(SDep(SU, SDep::Barrier));
				}
				}
				}
				}
				// Remove order dependences from an unrelated Phi.
				if (!SwpPruneDeps)
				continue;
				for (SUnit::pred_iterator PI = I->Preds.begin(), PE = I->Preds.end();
				PI != PE; ++PI) {
				MachineInstr *PMI = PI->getSUnit()->getInstr();
				if (PMI->isPHI() && PI->getKind() == SDep::Order) {
				if (I->getInstr()->isPHI()) {
				if (PMI->getOperand(0).getReg() == HasPhiUse)
				continue;
				if (getLoopPhiReg(PMI, PMI->getParent()) == HasPhiDef)
				continue;
				}
				RemoveDeps.push_back(*PI);
				}
				}
				for (int i = 0, e = RemoveDeps.size(); i != e; ++i)
				I->removePred(RemoveDeps[i]);
				}
				}

				/// Iterate over each DAG node and see if we can change any dependences
				/// in order to reduce the recurrence MII.
				void SwingSchedulerDAG::changeDependences() {
				// See if an instruction can use a value from the previous iteration.
				// If so, we update the base and offset of the instruction and change
				// the dependences.
				for (std::vector<SUnit>::iterator I = SUnits.begin(), E = SUnits.end();
				I != E; ++I) {
				unsigned BasePos = 0, OffsetPos = 0, NewBase = 0;
				int64_t NewOffset = 0;
				if (!canUseLastOffsetValue(I->getInstr(), BasePos, OffsetPos, NewBase,
				NewOffset))
				continue;

				// Get the MI and SUnit for the instruction that defines the original base.
				unsigned OrigBase = I->getInstr()->getOperand(BasePos).getReg();
				MachineInstr *DefMI = MRI.getUniqueVRegDef(OrigBase);
				if (!DefMI)
				continue;
				SUnit *DefSU = getSUnit(DefMI);
				if (!DefSU)
				continue;
				// Get the MI and SUnit for the instruction that defins the new base.
				MachineInstr *LastMI = MRI.getUniqueVRegDef(NewBase);
				if (!LastMI)
				continue;
				SUnit *LastSU = getSUnit(LastMI);
				if (!LastSU)
				continue;

				if (Topo.IsReachable(&*I, LastSU))
				continue;

				// Remove the dependence. The value now depends on a prior iteration.
				SmallVector<SDep, 4> Deps;
				for (SUnit::pred_iterator P = I->Preds.begin(), E = I->Preds.end(); P != E;
				++P)
				if (P->getSUnit() == DefSU)
				Deps.push_back(*P);
				for (int i = 0, e = Deps.size(); i != e; i++) {
				Topo.RemovePred(&*I, Deps[i].getSUnit());
				I->removePred(Deps[i]);
				}
				// Remove the chain dependence between the instructions.
				Deps.clear();
				for (SUnit::pred_iterator P = LastSU->Preds.begin(),
				E = LastSU->Preds.end();
				P != E; ++P)
				if (P->getSUnit() == &*I && P->getKind() == SDep::Order)
				Deps.push_back(*P);
				for (int i = 0, e = Deps.size(); i != e; i++) {
				Topo.RemovePred(LastSU, Deps[i].getSUnit());
				LastSU->removePred(Deps[i]);
				}

				// Add a dependence between the new instruction and the instruction
				// that defines the new base.
				SDep Dep(&*I, SDep::Anti, NewBase);
				LastSU->addPred(Dep);

				// Remember the base and offset information so that we can update the
				// instruction during code generation.
				InstrChanges[&*I] = std::make_pair(NewBase, NewOffset);
				}
				}

				namespace {
				// FuncUnitSorter - Comparison operator used to sort instructions by
				// the number of functional unit choices.
				struct FuncUnitSorter {
				const InstrItineraryData *InstrItins;
				DenseMap<unsigned, unsigned> Resources;

				// Compute the number of functional unit alternatives needed
				// at each stage, and take the minimum value. We prioritize the
				// instructions by the least number of choices first.
				unsigned minFuncUnits(const MachineInstr *Inst, unsigned &F) const {
				unsigned schedClass = Inst->getDesc().getSchedClass();
				unsigned min = UINT_MAX;
				for (const InstrStage *IS = InstrItins->beginStage(schedClass),
				*IE = InstrItins->endStage(schedClass);
				IS != IE; ++IS) {
				unsigned funcUnits = IS->getUnits();
				unsigned numAlternatives = countPopulation(funcUnits);
				if (numAlternatives < min) {
				min = numAlternatives;
				F = funcUnits;
				}
				}
				return min;
				}

				// Compute the critical resources needed by the instruction. This
				// function records the functional units needed by instructions that
				// must use only one functional unit. We use this as a tie breaker
				// for computing the resource MII. The instrutions that require
				// the same, highly used, functional unit have high priority.
				void calcCriticalResources(const MachineInstr *MI) {
				unsigned SchedClass = MI->getDesc().getSchedClass();
				for (const InstrStage *IS = InstrItins->beginStage(SchedClass),
				*IE = InstrItins->endStage(SchedClass);
				IS != IE; ++IS) {
				unsigned FuncUnits = IS->getUnits();
				if (countPopulation(FuncUnits) == 1)
				Resources[FuncUnits]++;
				}
				}

				FuncUnitSorter(const InstrItineraryData *IID) : InstrItins(IID) {}
				/// Return true if IS1 has less priority than IS2.
				bool operator()(const MachineInstr IS1, const MachineInstr IS2) const {
				unsigned F1 = 0, F2 = 0;
				unsigned MFUs1 = minFuncUnits(IS1, F1);
				unsigned MFUs2 = minFuncUnits(IS2, F2);
				if (MFUs1 == 1 && MFUs2 == 1)
				return Resources.lookup(F1) < Resources.lookup(F2);
				return MFUs1 > MFUs2;
				}
				};
				}

				/// Calculate the resource constrained minimum initiation interval for the
				markslUnsubmitted Not Done Reply Inline Actions If you have a functional unit that issues in stages such that another instruction of needing the same FU can ussue the very next cycle, then isn't the sum of the cycles too great? Example: InstrItinData<IIC_MUL_rr, [InstrStage<1, [MUL_DSP_STAGE1]>,InstrStage<1, [MUL_DSP_STAGE2]>,InstrStage<1, [MUL_DSP_STAGE3]>]>, In this case multiplies will issue back-to-back and require 3 cycles to complete. If we have 2 multiplies then is ResMII = (2 * 3) / 1 = 6 when in reality it will be 4 cycles since it's broken into stages. marksl: If you have a functional unit that issues in stages such that another instruction of needing…
				bcahoonAuthorUnsubmitted Not Done Reply Inline Actions In the case with the two multiplies, I think the resource MII should be 2. The resource MII is just the cycle count of the most heavily used resource. Since each of the resources, MUL_DSP_STAGE1, MUL_DSP_STAGE2, and MUL_DSP_STAGE3 are used for 2 cycles for the 2 instructions. But, I agree that the code here is not going to return 2. With your type of itinerary, I believe the code computes that there are 6 functional units used. Then, when iterating over the functional units, the code calls canReserveResources() for each instruction 3 times. Each time, the query returns false, and a new DFA is created for a total of 6 cycles. It seems to me that the DFA requires an additional parameter to determine the cycle, or stage, but I don't believe that is possible currently. I'll need to think about how to get this to work correctly for this type of itinerary. Unfortunately, it hard for me to test a solution to this since Hexagon doesn't have a similar itinerary. bcahoon: In the case with the two multiplies, I think the resource MII should be 2. The resource MII is…
				/// specified loop. We use the DFA to model the resources needed for
				/// each instruction, and we ignore dependences. A different DFA is created
				/// for each cycle that is required. When adding a new instruction, we attempt
				/// to add it to each existing DFA, until a legal space is found. If the
				/// instruction cannot be reserved in an existing DFA, we create a new one.
				unsigned SwingSchedulerDAG::calculateResMII() {
				SmallVector<DFAPacketizer *, 8> Resources;
				MachineBasicBlock *MBB = Loop->getHeader();
				Resources.push_back(TII->CreateTargetScheduleState(MF.getSubtarget()));

				// Sort the instructions by the number of available choices for scheduling,
				// least to most. Use the number of critical resources as the tie breaker.
				FuncUnitSorter FUS =
				FuncUnitSorter(MF.getSubtarget().getInstrItineraryData());
				for (MachineBasicBlock::iterator I = MBB->getFirstNonPHI(),
				E = MBB->getFirstTerminator();
				I != E; ++I)
				FUS.calcCriticalResources(I);
				PriorityQueue<MachineInstr , std::vector<MachineInstr >, FuncUnitSorter>
				FuncUnitOrder(FUS);

				for (MachineBasicBlock::iterator I = MBB->getFirstNonPHI(),
				E = MBB->getFirstTerminator();
				I != E; ++I)
				FuncUnitOrder.push(I);

				while (!FuncUnitOrder.empty()) {
				MachineInstr *MI = FuncUnitOrder.top();
				FuncUnitOrder.pop();
				if (TII->isZeroCost(MI->getOpcode()))
				continue;
				// Attempt to reserve the instruction in an existing DFA. At least one
				// DFA is needed for each cycle.
				unsigned NumCycles = getSUnit(MI)->Latency;
				unsigned ReservedCycles = 0;
				SmallVectorImpl<DFAPacketizer *>::iterator RI = Resources.begin();
				SmallVectorImpl<DFAPacketizer *>::iterator RE = Resources.end();
				for (unsigned C = 0; C < NumCycles; ++C)
				while (RI != RE) {
				if ((*RI++)->canReserveResources(MI)) {
				++ReservedCycles;
				break;
				}
				}
				// Start reserving resources using existing DFAs.
				for (unsigned C = 0; C < ReservedCycles; ++C) {
				--RI;
				(*RI)->reserveResources(MI);
				}
				// Add new DFAs, if needed, to reserve resources.
				for (unsigned C = ReservedCycles; C < NumCycles; ++C) {
				DFAPacketizer *NewResource =
				TII->CreateTargetScheduleState(MF.getSubtarget());
				assert(NewResource->canReserveResources(MI) && "Reserve error.");
				NewResource->reserveResources(MI);
				Resources.push_back(NewResource);
				}
				}
				int Resmii = Resources.size();
				// Delete the memory for each of the DFAs that were created earlier.
				for (SmallVectorImpl<DFAPacketizer *>::iterator RI = Resources.begin(),
				RE = Resources.end();
				RI != RE; ++RI) {
				DFAPacketizer D = RI;
				delete D;
				}
				Resources.clear();
				return Resmii;
				}

				/// Calculate the recurrence-constrainted minimum initiation interval.
				/// Iterate over each circuit. Compute the delay(c) and distance(c)
				/// for each circuit. The II needs to satisfy the inequality
				/// delay(c) - II*distance(c) <= 0. For each circuit, choose the smallest
				/// II that satistifies the inequality, and the RecMII is the maximum
				/// of those values.
				unsigned SwingSchedulerDAG::calculateRecMII(NodeSetType &NodeSets) {
				unsigned RecMII = 0;

				for (NodeSetType::iterator NS = NodeSets.begin(), ENS = NodeSets.end();
				NS != ENS; ++NS) {
				NodeSet &Nodes = *NS;
				if (Nodes.size() == 0)
				continue;

				unsigned Delay = Nodes.size() - 1;
				unsigned Distance = 1;

				// ii = ceil(delay / distance)
				unsigned CurMII = (Delay + Distance - 1) / Distance;
				Nodes.setRecMII(CurMII);
				if (CurMII > RecMII)
				RecMII = CurMII;
				}

				return RecMII;
				}

				/// Swap all the anti dependences in the DAG. That means it is no longer a DAG,
				/// but we do this to find the circuits, and then change them back.
				static void swapAntiDependences(std::vector<SUnit> &SUnits) {
				SmallVector<std::pair<SUnit *, SDep>, 8> DepsAdded;
				for (unsigned i = 0, e = SUnits.size(); i != e; ++i) {
				SUnit *SU = &SUnits[i];
				for (SUnit::pred_iterator IP = SU->Preds.begin(), EP = SU->Preds.end();
				IP != EP; ++IP) {
				if (IP->getKind() != SDep::Anti)
				continue;
				DepsAdded.push_back(std::make_pair(SU, *IP));
				}
				}
				for (SmallVector<std::pair<SUnit *, SDep>, 8>::iterator I = DepsAdded.begin(),
				E = DepsAdded.end();
				I != E; ++I) {
				// Remove this anti dependency and add one in the reverse direction.
				SUnit *SU = I->first;
				SDep &D = I->second;
				SUnit *TargetSU = D.getSUnit();
				unsigned Reg = D.getReg();
				unsigned Lat = D.getLatency();
				SU->removePred(D);
				SDep Dep(SU, SDep::Anti, Reg);
				Dep.setLatency(Lat);
				TargetSU->addPred(Dep);
				}
				}

				/// Create the adjacency structure of the nodes in the graph.
				void SwingSchedulerDAG::Circuits::createAdjacencyStructure(
				SwingSchedulerDAG *DAG) {
				BitVector Added(SUnits.size());
				for (int i = 0, e = SUnits.size(); i != e; ++i) {
				Added.reset();
				// Add any successor to the adjacency matrix and exclude duplicates.
				for (auto &SI : SUnits[i].Succs) {
				// A back-edge is processed only if it goes to a Phi.
				if (SI.getKind() == SDep::Anti && !SI.getSUnit()->getInstr()->isPHI())
				continue;
				int N = SI.getSUnit()->NodeNum;
				if (!Added.test(N)) {
				AdjK[i].push_back(N);
				Added.set(N);
				}
				}
				// A chain edge between a store and a load is treated as a back-edge in the
				// adjacency matrix.
				for (auto &PI : SUnits[i].Preds) {
				if (!SUnits[i].getInstr()->mayStore() \|\|
				!DAG->isLoopCarriedOrder(&SUnits[i], PI, false))
				continue;
				if (PI.getKind() == SDep::Order && PI.getSUnit()->getInstr()->mayLoad()) {
				int N = PI.getSUnit()->NodeNum;
				if (!Added.test(N)) {
				AdjK[i].push_back(N);
				Added.set(N);
				}
				}
				}
				}
				}

				/// Identify an elementary circuit in the dependence graph.
				bool SwingSchedulerDAG::Circuits::circuit(int V, int S, NodeSetType &NodeSets,
				bool HasBackedge) {
				SUnit *SV = &SUnits[V];
				bool F = false;
				Stack.push_back(SV);
				Blocked.set(V);

				for (auto W : AdjK[V]) {
				if (NumPaths > MaxPaths)
				break;
				if (W < S)
				continue;
				if (W == S) {
				if (!HasBackedge)
				NodeSets.push_back(NodeSet(Stack.begin(), Stack.end()));
				F = true;
				++NumPaths;
				break;
				} else if (!Blocked.test(W)) {
				if (circuit(W, S, NodeSets, W < V ? true : HasBackedge))
				F = true;
				}
				}

				if (F)
				unblock(V);
				else {
				for (auto W : AdjK[V]) {
				if (W < S)
				continue;
				if (B[W].count(SV) == 0)
				B[W].insert(SV);
				}
				}
				Stack.pop_back();
				return F;
				}

				/// Unblock a node in the circuit finding algorithm.
				void SwingSchedulerDAG::Circuits::unblock(int U) {
				Blocked.reset(U);
				SmallPtrSet<SUnit *, 4> &BU = B[U];
				while (!BU.empty()) {
				SmallPtrSet<SUnit *, 4>::iterator SI = BU.begin();
				assert(SI != BU.end() && "Invalid B set.");
				SUnit W = SI;
				BU.erase(W);
				if (Blocked.test(W->NodeNum))
				unblock(W->NodeNum);
				}
				}

				/// Identify all the elementary circuits in the dependence graph using
				/// Johnson's circuit algorithm.
				void SwingSchedulerDAG::findCircuits(NodeSetType &NodeSets) {
				// Swap all the anti dependences in the DAG. That means it is no longer a DAG,
				// but we do this to find the circuits, and then change them back.
				swapAntiDependences(SUnits);

				Circuits Cir(SUnits);
				// Create the adjacency structure.
				Cir.createAdjacencyStructure(this);
				for (int i = 0, e = SUnits.size(); i != e; ++i) {
				Cir.reset();
				Cir.circuit(i, i, NodeSets);
				}

				// Change the dependences back so that we've created a DAG again.
				swapAntiDependences(SUnits);
				}

				// Return true for DAG nodes that we ignore when computing the cost functions.
				// We ignore the back-edge recurrence in order to avoid unbounded recurison
				// in the calculation of the ASAP, ALAP, etc functions.
				static bool ignoreDependence(const SDep &D, bool isPred) {
				if (D.isArtificial())
				return true;
				return D.getKind() == SDep::Anti && isPred;
				}

				// Compute several functions need to order the nodes for scheduling.
				// ASAP - Earliest time to schedule a node.
				// ALAP - Latest time to schedule a node.
				// MOV - Mobility function, difference between ALAP and ASAP.
				// D - Depth of each node.
				// H - Height of each node.
				//
				void SwingSchedulerDAG::computeNodeFunctions(NodeSetType &NodeSets) {

				ScheduleInfo.resize(SUnits.size());

				DEBUG({
				for (ScheduleDAGTopologicalSort::const_iterator I = Topo.begin(),
				E = Topo.end();
				I != E; ++I) {
				SUnit SU = &SUnits[I];
				SU->dump(this);
				}
				});

				int maxASAP = 0;
				// Compute ASAP.
				for (ScheduleDAGTopologicalSort::const_iterator I = Topo.begin(),
				E = Topo.end();
				I != E; ++I) {
				int asap = 0;
				SUnit SU = &SUnits[I];
				for (SUnit::const_pred_iterator IP = SU->Preds.begin(),
				EP = SU->Preds.end();
				IP != EP; ++IP) {
				if (ignoreDependence(*IP, true))
				continue;
				SUnit *pred = IP->getSUnit();
				asap = std::max(asap, (int)(getASAP(pred) + getLatency(SU, *IP) -
				getDistance(pred, SU, IP) MII));
				}
				maxASAP = std::max(maxASAP, asap);
				ScheduleInfo[*I].ASAP = asap;
				}

				// Compute ALAP and MOV.
				for (ScheduleDAGTopologicalSort::const_reverse_iterator I = Topo.rbegin(),
				E = Topo.rend();
				I != E; ++I) {
				int alap = maxASAP;
				SUnit SU = &SUnits[I];
				for (SUnit::const_succ_iterator IS = SU->Succs.begin(),
				ES = SU->Succs.end();
				IS != ES; ++IS) {
				if (ignoreDependence(*IS, true))
				continue;
				SUnit *succ = IS->getSUnit();
				alap = std::min(alap, (int)(getALAP(succ) - getLatency(SU, *IS) +
				getDistance(SU, succ, IS) MII));
				}

				ScheduleInfo[*I].ALAP = alap;
				}

				// After computing the node functions, compute the summary for each node set.
				for (NodeSetType::iterator I = NodeSets.begin(), E = NodeSets.end(); I != E;
				++I)
				I->computeNodeSetInfo(this);

				DEBUG({
				for (unsigned i = 0; i < SUnits.size(); i++) {
				dbgs() << "\tNode " << i << ":\n";
				dbgs() << "\t ASAP = " << getASAP(&SUnits[i]) << "\n";
				dbgs() << "\t ALAP = " << getALAP(&SUnits[i]) << "\n";
				dbgs() << "\t MOV = " << getMOV(&SUnits[i]) << "\n";
				dbgs() << "\t D = " << getDepth(&SUnits[i]) << "\n";
				dbgs() << "\t H = " << getHeight(&SUnits[i]) << "\n";
				}
				});
				}

				/// Compute the Pred_L(O) set, as defined in the paper. The set is defined
				markslUnsubmitted Not Done Reply Inline Actions What paper are you referring to? marksl: What paper are you referring to?
				markslUnsubmitted Not Done Reply Inline Actions Sorry, I found it was Tanya Lattner's paper. marksl: Sorry, I found it was Tanya Lattner's paper.
				bcahoonAuthorUnsubmitted Not Done Reply Inline Actions The original paper is "Swing Modulo Scheduling: A Lifetime-Sensitive Approach" from PACT 1996. Though, Tanya's thesis provides a good description as well. bcahoon: The original paper is "Swing Modulo Scheduling: A Lifetime-Sensitive Approach" from PACT 1996.
				/// as the predecessors of the elements of NodeOrder that are not also in
				/// NodeOrder.
				MatzeBUnsubmitted Not Done Reply Inline Actions It would be good to mention relevant papers in the comment at the beginning of the .cpp file. MatzeB: It would be good to mention relevant papers in the comment at the beginning of the .cpp file.
				static bool pred_L(SetVector<SUnit *> &NodeOrder,
				SmallSetVector<SUnit *, 8> &Preds,
				const SwingSchedulerDAG::NodeSet *S = nullptr) {
				Preds.clear();
				for (SetVector<SUnit *>::iterator I = NodeOrder.begin(), E = NodeOrder.end();
				I != E; ++I) {
				for (SUnit::pred_iterator PI = (I)->Preds.begin(), PE = (I)->Preds.end();
				PI != PE; ++PI) {
				if (S && S->count(PI->getSUnit()) == 0)
				continue;
				if (ignoreDependence(*PI, true))
				continue;
				if (NodeOrder.count(PI->getSUnit()) == 0)
				Preds.insert(PI->getSUnit());
				}
				// Back-edges are predecessors with an anti-dependence.
				for (SUnit::const_succ_iterator IS = (*I)->Succs.begin(),
				ES = (*I)->Succs.end();
				IS != ES; ++IS) {
				if (IS->getKind() != SDep::Anti)
				continue;
				if (S && S->count(IS->getSUnit()) == 0)
				continue;
				if (NodeOrder.count(IS->getSUnit()) == 0)
				Preds.insert(IS->getSUnit());
				}
				}
				return Preds.size() > 0;
				}

				/// Compute the Succ_L(O) set, as defined in the paper. The set is defined
				/// as the successors of the elements of NodeOrder that are not also in
				/// NodeOrder.
				static bool succ_L(SetVector<SUnit *> &NodeOrder,
				SmallSetVector<SUnit *, 8> &Succs,
				const SwingSchedulerDAG::NodeSet *S = nullptr) {
				Succs.clear();
				for (SetVector<SUnit *>::iterator I = NodeOrder.begin(), E = NodeOrder.end();
				I != E; ++I) {
				for (SUnit::succ_iterator SI = (I)->Succs.begin(), SE = (I)->Succs.end();
				SI != SE; ++SI) {
				if (S && S->count(SI->getSUnit()) == 0)
				continue;
				if (ignoreDependence(*SI, false))
				continue;
				if (NodeOrder.count(SI->getSUnit()) == 0)
				Succs.insert(SI->getSUnit());
				}
				for (SUnit::const_pred_iterator PI = (*I)->Preds.begin(),
				PE = (*I)->Preds.end();
				PI != PE; ++PI) {
				if (PI->getKind() != SDep::Anti)
				continue;
				if (S && S->count(PI->getSUnit()) == 0)
				continue;
				if (NodeOrder.count(PI->getSUnit()) == 0)
				Succs.insert(PI->getSUnit());
				}
				}
				return Succs.size() > 0;
				}

				/// Return true if there is a path from the specified node to any of the nodes
				/// in DestNodes. Keep track and return the nodes in any path.
				static bool computePath(SUnit Cur, SetVector<SUnit > &Path,
				SetVector<SUnit *> &DestNodes,
				SetVector<SUnit *> &Exclude,
				SmallPtrSet<SUnit *, 8> &Visited) {
				if (Cur->isBoundaryNode())
				return false;
				if (Exclude.count(Cur) != 0)
				return false;
				if (DestNodes.count(Cur) != 0)
				return true;
				if (!Visited.insert(Cur).second)
				return Path.count(Cur) != 0;
				bool FoundPath = false;
				for (SUnit::succ_iterator SI = Cur->Succs.begin(), SE = Cur->Succs.end();
				SI != SE; ++SI)
				FoundPath \|= computePath(SI->getSUnit(), Path, DestNodes, Exclude, Visited);
				for (SUnit::pred_iterator PI = Cur->Preds.begin(), PE = Cur->Preds.end();
				PI != PE; ++PI)
				if (PI->getKind() == SDep::Anti)
				FoundPath \|=
				computePath(PI->getSUnit(), Path, DestNodes, Exclude, Visited);
				if (FoundPath)
				Path.insert(Cur);
				return FoundPath;
				}

				/// Return true if Set1 is a subset of Set2.
				template <class S1Ty, class S2Ty> static bool isSubset(S1Ty &Set1, S2Ty &Set2) {
				for (typename S1Ty::iterator I = Set1.begin(), E = Set1.end(); I != E; ++I)
				if (Set2.count(*I) == 0)
				return false;
				return true;
				}

				/// Compute the live-out registers for the instructions in a node-set.
				/// The live-out registers are those that are defined in the node-set,
				/// but not used. Except for use operands of Phis.
				static void computeLiveOuts(RegPressureTracker &RPTracker,
				SwingSchedulerDAG::NodeSet &NS) {
				SmallVector<RegisterMaskPair, 8> LiveOutRegs;
				SmallSet<unsigned, 4> Uses;
				for (auto &I : NS) {
				const MachineInstr *MI = I->getInstr();
				if (MI->isPHI())
				continue;
				for (ConstMIOperands MO(MI); MO.isValid(); ++MO)
				if (MO->isReg() && MO->isUse())
				Uses.insert(MO->getReg());
				}
				MatzeBUnsubmitted Not Done Reply Inline Actions Should avoid `auto` here and mention the type name. Some more instances following. MatzeB: Should avoid `auto` here and mention the type name. Some more instances following.
				for (auto &I : NS)
				for (ConstMIOperands MO(I->getInstr()); MO.isValid(); ++MO)
				if (MO->isReg() && MO->isDef() && !MO->isDead() &&
				!Uses.count(MO->getReg()))
				LiveOutRegs.push_back(RegisterMaskPair(MO->getReg(), 0));
				RPTracker.addLiveRegs(LiveOutRegs);
				}

				/// A heuristic to filter nodes in recurrent node-sets if the register
				/// pressure of a set is too high.
				void SwingSchedulerDAG::registerPressureFilter(NodeSetType &NodeSets) {
				for (auto &NS : NodeSets) {
				// Skip small node-sets since they won't cause register pressure problems.
				if (NS.size() <= 2)
				continue;
				IntervalPressure RecRegPressure;
				RegPressureTracker RecRPTracker(RecRegPressure);
				RecRPTracker.init(&MF, &RegClassInfo, LIS, BB, BB->end(), false, true);
				computeLiveOuts(RecRPTracker, NS);
				RecRPTracker.closeBottom();

				std::vector<SUnit *> SUnits(NS.begin(), NS.end());
				std::sort(SUnits.begin(), SUnits.end(), [](const SUnit A, const SUnit B) {
				return A->NodeNum > B->NodeNum;
				});

				for (auto &SU : SUnits) {
				// Since we're computing the register pressure for a subset of the
				// instructions in a block, we need to set the tracker for each
				// instruction in the node-set. The tracker is set to the instruction
				// just after the one we're interested in.
				MachineBasicBlock::const_iterator CurInstI = SU->getInstr();
				RecRPTracker.setPos(std::next(CurInstI));

				RegPressureDelta RPDelta;
				ArrayRef<PressureChange> CriticalPSets;
				RecRPTracker.getMaxUpwardPressureDelta(SU->getInstr(), nullptr, RPDelta,
				CriticalPSets,
				RecRegPressure.MaxSetPressure);
				if (RPDelta.Excess.isValid()) {
				DEBUG(dbgs() << "Excess register pressure: SU(" << SU->NodeNum << ") "
				<< TRI->getRegPressureSetName(RPDelta.Excess.getPSet())
				<< ":" << RPDelta.Excess.getUnitInc());
				NS.setExceedPressure(SU);
				break;
				}
				RecRPTracker.recede();
				}
				}
				}

				/// A heuristic to colocate node sets that have the same set of
				/// successors.
				void SwingSchedulerDAG::colocateNodeSets(NodeSetType &NodeSets) {
				unsigned Colocate = 0;
				for (int i = 0, e = NodeSets.size(); i < e; ++i) {
				NodeSet &N1 = NodeSets[i];
				SmallSetVector<SUnit *, 8> S1;
				if (N1.empty() \|\| !succ_L(N1, S1))
				continue;
				for (int j = i + 1; j < e; ++j) {
				NodeSet &N2 = NodeSets[j];
				if (N1.compareRecMII(N2) != 0)
				continue;
				SmallSetVector<SUnit *, 8> S2;
				if (N2.empty() \|\| !succ_L(N2, S2))
				continue;
				if (isSubset(S1, S2) && S1.size() == S2.size()) {
				N1.setColocate(++Colocate);
				N2.setColocate(Colocate);
				break;
				}
				}
				}
				}

				/// Check if the existing node-sets are profitable. If not, then ignore the
				/// recurrent node-sets, and attempt to schedule all nodes together. This is
				/// a heuristic. If the MII is large and there is a non-recurrent node with
				/// a large depth compared to the MII, then it's best to try and schedule
				/// all instruction together instead of starting with the recurrent node-sets.
				void SwingSchedulerDAG::checkNodeSets(NodeSetType &NodeSets) {
				// Look for loops with a large MII.
				if (MII <= 20)
				return;
				// Check if the node-set contains only a simple add recurrence.
				for (auto &NS : NodeSets)
				if (NS.size() > 2)
				return;
				// If the depth of any instruction is significantly larger than the MII, then
				// ignore the recurrent node-sets and treat all instructions equally.
				for (auto &SU : SUnits)
				if (SU.getDepth() > MII * 1.5) {
				NodeSets.clear();
				DEBUG(dbgs() << "Clear recurrence node-sets\n");
				return;
				}
				}

				// Add the nodes that do not belong to a recurrence set into groups
				// based upon connected componenets.
				void SwingSchedulerDAG::groupRemainingNodes(NodeSetType &NodeSets) {
				SetVector<SUnit *> NodesAdded;
				SmallPtrSet<SUnit *, 8> Visited;
				// Add the nodes that are on a path between the previous node sets and
				// the current node set.
				for (NodeSetType::iterator I = NodeSets.begin(), E = NodeSets.end(); I != E;
				++I) {
				SmallSetVector<SUnit *, 8> N;
				// Add the nodes from the current node set to the previous node set.
				if (succ_L(*I, N)) {
				SetVector<SUnit *> Path;
				for (SmallSetVector<SUnit *, 8>::iterator NI = N.begin(), NE = N.end();
				NI != NE; ++NI) {
				Visited.clear();
				computePath(NI, Path, NodesAdded, I, Visited);
				}
				if (Path.size() > 0)
				(*I).insert(Path.begin(), Path.end());
				}
				// Add the nodes from the previous node set to the current node set.
				N.clear();
				if (succ_L(NodesAdded, N)) {
				SetVector<SUnit *> Path;
				for (SmallSetVector<SUnit *, 8>::iterator NI = N.begin(), NE = N.end();
				NI != NE; ++NI) {
				Visited.clear();
				computePath(NI, Path, I, NodesAdded, Visited);
				}
				if (Path.size() > 0)
				(*I).insert(Path.begin(), Path.end());
				}
				NodesAdded.insert((I).begin(), (I).end());
				}

				// Create a new node set with the connected nodes of any successor of a node
				// in a recurrent set.
				NodeSet NewSet;
				SmallSetVector<SUnit *, 8> N;
				if (succ_L(NodesAdded, N))
				for (SmallSetVector<SUnit *, 8>::iterator I = N.begin(), E = N.end();
				I != E; ++I)
				addConnectedNodes(*I, NewSet, NodesAdded);
				if (NewSet.size() > 0)
				NodeSets.push_back(NewSet);

				// Create a new node set with the connected nodes of any predecessor of a node
				// in a recurrent set.
				NewSet.clear();
				if (pred_L(NodesAdded, N))
				for (SmallSetVector<SUnit *, 8>::iterator I = N.begin(), E = N.end();
				I != E; ++I)
				addConnectedNodes(*I, NewSet, NodesAdded);
				if (NewSet.size() > 0)
				NodeSets.push_back(NewSet);

				// Create new nodes sets with the connected nodes any any remaining node that
				// has no predecessor.
				for (unsigned i = 0; i < SUnits.size(); ++i) {
				SUnit *SU = &SUnits[i];
				if (NodesAdded.count(SU) == 0) {
				NewSet.clear();
				addConnectedNodes(SU, NewSet, NodesAdded);
				if (NewSet.size() > 0)
				NodeSets.push_back(NewSet);
				}
				}
				}

				// Add the node to the set, and add all is its connected nodes to the set.
				void SwingSchedulerDAG::addConnectedNodes(SUnit *SU, NodeSet &NewSet,
				SetVector<SUnit *> &NodesAdded) {
				NewSet.insert(SU);
				NodesAdded.insert(SU);
				for (SUnit::const_succ_iterator SI = SU->Succs.begin(), SE = SU->Succs.end();
				SI != SE; ++SI) {
				SUnit *Successor = SI->getSUnit();
				if (!SI->isArtificial() && NodesAdded.count(Successor) == 0)
				addConnectedNodes(Successor, NewSet, NodesAdded);
				}
				for (SUnit::const_pred_iterator PI = SU->Preds.begin(), PE = SU->Preds.end();
				PI != PE; ++PI) {
				SUnit *Predecessor = PI->getSUnit();
				if (!PI->isArtificial() && NodesAdded.count(Predecessor) == 0)
				addConnectedNodes(Predecessor, NewSet, NodesAdded);
				}
				}

				/// Return true if Set1 contains elements in Set2. The elements in common
				/// are returned in a different container.
				static bool isIntersect(SmallSetVector<SUnit *, 8> &Set1,
				const SwingSchedulerDAG::NodeSet &Set2,
				SmallSetVector<SUnit *, 8> &Result) {
				Result.clear();
				for (unsigned i = 0, e = Set1.size(); i != e; ++i) {
				SUnit *SU = Set1[i];
				if (Set2.count(SU) != 0)
				Result.insert(SU);
				}
				return !Result.empty();
				}

				/// Merge the recurrence node sets that have the same initial node.
				void SwingSchedulerDAG::fuseRecs(NodeSetType &NodeSets) {
				for (NodeSetType::iterator I = NodeSets.begin(), E = NodeSets.end(); I != E;
				++I) {
				NodeSet &NI = *I;
				for (NodeSetType::iterator J = I + 1; J != E;) {
				NodeSet &NJ = *J;
				if (NI.getNode(0)->NodeNum == NJ.getNode(0)->NodeNum) {
				if (NJ.compareRecMII(NI) > 0)
				NI.setRecMII(NJ.getRecMII());
				for (NodeSet::iterator NII = J->begin(), ENI = J->end(); NII != ENI;
				++NII)
				I->insert(*NII);
				NodeSets.erase(J);
				E = NodeSets.end();
				} else {
				++J;
				}
				}
				}
				}

				/// Remove nodes that have been scheduled in previous NodeSets.
				void SwingSchedulerDAG::removeDuplicateNodes(NodeSetType &NodeSets) {
				for (NodeSetType::iterator I = NodeSets.begin(), E = NodeSets.end(); I != E;
				++I)
				for (NodeSetType::iterator J = I + 1; J != E;) {
				J->remove_if([&](SUnit *SUJ) { return I->count(SUJ); });

				if (J->size() == 0) {
				NodeSets.erase(J);
				E = NodeSets.end();
				} else {
				++J;
				}
				}
				}

				/// Return true if Inst1 defines a value that is used in Inst2.
				static bool hasDataDependence(SUnit Inst1, SUnit Inst2) {
				for (SUnit::succ_iterator SI = Inst1->Succs.begin(), SE = Inst1->Succs.end();
				SI != SE; ++SI)
				if (SI->getSUnit() == Inst2 && SI->getKind() == SDep::Data)
				return true;
				return false;
				}

				// Compute an ordered list of the dependence graph nodes, which
				// indicates the order that the nodes will be scheduled. This is a
				// two-level algorithm. First, a partial order is created, which
				// consists of a list of sets ordered from highest to lowest priority.
				//
				void SwingSchedulerDAG::computeNodeOrder(NodeSetType &NodeSets) {
				SmallSetVector<SUnit *, 8> R;
				NodeOrder.clear();

				for (auto &Nodes : NodeSets) {
				DEBUG(dbgs() << "NodeSet size " << Nodes.size() << "\n");
				OrderKind Order;
				SmallSetVector<SUnit *, 8> N;
				if (pred_L(NodeOrder, N) && isSubset(N, Nodes)) {
				R.insert(N.begin(), N.end());
				Order = BottomUp;
				DEBUG(dbgs() << " Bottom up (preds) ");
				} else if (succ_L(NodeOrder, N) && isSubset(N, Nodes)) {
				R.insert(N.begin(), N.end());
				Order = TopDown;
				DEBUG(dbgs() << " Top down (succs) ");
				} else if (isIntersect(N, Nodes, R)) {
				// If some of the successors are in the existing node-set, then use the
				// top-down ordering.
				Order = TopDown;
				DEBUG(dbgs() << " Top down (intersect) ");
				} else if (NodeSets.size() == 1) {
				for (auto &N : Nodes)
				if (N->Succs.size() == 0)
				R.insert(N);
				Order = BottomUp;
				DEBUG(dbgs() << " Bottom up (all) ");
				} else {
				// Find the node with the highest ASAP.
				SUnit *maxASAP = nullptr;
				for (NodeSet::iterator NI = Nodes.begin(), ENI = Nodes.end(); NI != ENI;
				++NI) {
				SUnit SU = NI;
				if (maxASAP == nullptr \|\| getASAP(SU) >= getASAP(maxASAP))
				maxASAP = SU;
				}
				R.insert(maxASAP);
				Order = BottomUp;
				DEBUG(dbgs() << " Bottom up (default) ");
				}

				while (!R.empty()) {
				if (Order == TopDown) {
				// Choose the node with the maximum height. If more than one, choose
				// the node with the lowest MOV. If still more than one, check if there
				// is a dependence between the instructions.
				while (!R.empty()) {
				SUnit *maxHeight = nullptr;
				for (SmallSetVector<SUnit *, 8>::iterator I = R.begin(), E = R.end();
				I != E; ++I) {
				if (maxHeight == 0 \|\| getHeight(*I) > getHeight(maxHeight))
				maxHeight = *I;
				else if (getHeight(*I) == getHeight(maxHeight) &&
				getMOV(*I) < getMOV(maxHeight) &&
				!hasDataDependence(maxHeight, *I))
				maxHeight = *I;
				else if (hasDataDependence(*I, maxHeight))
				maxHeight = *I;
				}
				NodeOrder.insert(maxHeight);
				DEBUG(dbgs() << maxHeight->NodeNum << " ");
				R.remove(maxHeight);
				for (SUnit::const_succ_iterator I = maxHeight->Succs.begin(),
				E = maxHeight->Succs.end();
				I != E; ++I) {
				if (Nodes.count(I->getSUnit()) == 0)
				continue;
				if (NodeOrder.count(I->getSUnit()) != 0)
				continue;
				if (ignoreDependence(*I, false))
				continue;
				R.insert(I->getSUnit());
				}
				// Back-edges are predecessors with an anti-dependence.
				for (SUnit::const_pred_iterator I = maxHeight->Preds.begin(),
				E = maxHeight->Preds.end();
				I != E; ++I) {
				if (I->getKind() != SDep::Anti)
				continue;
				if (Nodes.count(I->getSUnit()) == 0)
				continue;
				if (NodeOrder.count(I->getSUnit()) != 0)
				continue;
				R.insert(I->getSUnit());
				}
				}
				Order = BottomUp;
				DEBUG(dbgs() << "\n Switching order to bottom up ");
				SmallSetVector<SUnit *, 8> N;
				if (pred_L(NodeOrder, N, &Nodes))
				R.insert(N.begin(), N.end());
				} else {
				// Choose the node with the maximum depth. If more than one, choose
				// the node with the lowest MOV. If there is still more than one, check
				// for a dependence between the instructions.
				while (!R.empty()) {
				SUnit *maxDepth = nullptr;
				for (SmallSetVector<SUnit *, 8>::iterator I = R.begin(), E = R.end();
				I != E; ++I) {
				if (maxDepth == 0 \|\| getDepth(*I) > getDepth(maxDepth))
				maxDepth = *I;
				else if (getDepth(*I) == getDepth(maxDepth) &&
				getMOV(*I) < getMOV(maxDepth) &&
				!hasDataDependence(*I, maxDepth))
				maxDepth = *I;
				else if (hasDataDependence(maxDepth, *I))
				maxDepth = *I;
				}
				NodeOrder.insert(maxDepth);
				DEBUG(dbgs() << maxDepth->NodeNum << " ");
				R.remove(maxDepth);
				if (Nodes.isExceedSU(maxDepth)) {
				Order = TopDown;
				R.clear();
				R.insert(Nodes.getNode(0));
				break;
				}
				for (SUnit::const_pred_iterator I = maxDepth->Preds.begin(),
				E = maxDepth->Preds.end();
				I != E; ++I) {
				if (Nodes.count(I->getSUnit()) == 0)
				continue;
				if (NodeOrder.count(I->getSUnit()) != 0)
				continue;
				if (I->getKind() == SDep::Anti)
				continue;
				R.insert(I->getSUnit());
				}
				// Back-edges are predecessors with an anti-dependence.
				for (SUnit::const_succ_iterator I = maxDepth->Succs.begin(),
				E = maxDepth->Succs.end();
				I != E; ++I) {
				if (I->getKind() != SDep::Anti)
				continue;
				if (Nodes.count(I->getSUnit()) == 0)
				continue;
				if (NodeOrder.count(I->getSUnit()) != 0)
				continue;
				R.insert(I->getSUnit());
				}
				}
				Order = TopDown;
				DEBUG(dbgs() << "\n Switching order to top down ");
				SmallSetVector<SUnit *, 8> N;
				if (succ_L(NodeOrder, N, &Nodes))
				R.insert(N.begin(), N.end());
				}
				}
				DEBUG(dbgs() << "\nDone with Nodeset\n");
				}

				DEBUG({
				dbgs() << "Node order: ";
				sebpopUnsubmitted Not Done Reply Inline Actions Both generateExistingPhis and generatePhis are passed the same parameters. Can we have this code factored up in a class: that would allow to split these two functions into smaller functions easier to follow. sebpop: Both generateExistingPhis and generatePhis are passed the same parameters. Can we have this…
				bcahoonAuthorUnsubmitted Not Done Reply Inline Actions Hi Sebastian - thanks for the comments. The code for generating Phis in the pipelined schedule really could use some work. It's a non-trivial effort though. I'll try to do some refactoring to improve it though. bcahoon: Hi Sebastian - thanks for the comments. The code for generating Phis in the pipelined schedule…
				for (SetVector<SUnit *>::iterator I = NodeOrder.begin(),
				E = NodeOrder.end();
				I != E; ++I) {
				dbgs() << " " << (*I)->NodeNum << " ";
				}
				dbgs() << "\n";
				});
				}

				// Process the nodes in the computed order and create the pipelined
				// schedule of the instructions.
				bool SwingSchedulerDAG::schedulePipeline(SMSchedule &Schedule) {

				if (NodeOrder.size() == 0)
				return false;

				bool scheduleFound = false;
				// Keep increasing II until a valid schedule is found.
				for (unsigned II = MII; II < MII + 10 && !scheduleFound; ++II) {
				Schedule.reset();
				Schedule.setInitiationInterval(II);
				DEBUG(dbgs() << "Try to schedule with " << II << "\n");

				SetVector<SUnit *>::iterator NI = NodeOrder.begin();
				SetVector<SUnit *>::iterator NE = NodeOrder.end();
				do {
				SUnit SU = NI;

				// Compute the schedule time for the instruction, which is based
				// upon the scheduled time for any predecessors/successors.
				int EarlyStart = INT_MIN;
				int LateStart = INT_MAX;
				// These values are set when the size of the schedule window is limited
				// due to chain dependences.
				int SchedEnd = INT_MAX;
				int SchedStart = INT_MIN;
				Schedule.computeStart(SU, &EarlyStart, &LateStart, &SchedEnd, &SchedStart,
				II, this);
				DEBUG({
				dbgs() << "Inst (" << SU->NodeNum << ") ";
				SU->getInstr()->dump();
				dbgs() << "\n";
				});
				DEBUG({
				dbgs() << "\tes: " << EarlyStart << " ls: " << LateStart
				<< " me: " << SchedEnd << " ms: " << SchedStart << "\n";
				});

				if (EarlyStart > LateStart \|\| SchedEnd < EarlyStart \|\|
				SchedStart > LateStart)
				scheduleFound = false;
				else if (EarlyStart != INT_MIN && LateStart == INT_MAX) {
				SchedEnd = std::min(SchedEnd, EarlyStart + (int)II - 1);
				scheduleFound = Schedule.insert(SU, EarlyStart, SchedEnd, II);
				} else if (EarlyStart == INT_MIN && LateStart != INT_MAX) {
				SchedStart = std::max(SchedStart, LateStart - (int)II + 1);
				scheduleFound = Schedule.insert(SU, LateStart, SchedStart, II);
				} else if (EarlyStart != INT_MIN && LateStart != INT_MAX) {
				SchedEnd =
				std::min(SchedEnd, std::min(LateStart, EarlyStart + (int)II - 1));
				// When scheduling a Phi it is better to start at the late cycle and go
				// backwards. The default order may insert the Phi too far away from
				// its first dependence.
				if (SU->getInstr()->isPHI())
				scheduleFound = Schedule.insert(SU, SchedEnd, EarlyStart, II);
				else
				scheduleFound = Schedule.insert(SU, EarlyStart, SchedEnd, II);
				} else {
				int FirstCycle = Schedule.getFirstCycle();
				scheduleFound = Schedule.insert(SU, FirstCycle + getASAP(SU),
				FirstCycle + getASAP(SU) + II - 1, II);
				}
				DEBUG({
				if (!scheduleFound)
				dbgs() << "\tCan't schedule\n";
				});
				} while (++NI != NE && scheduleFound);
				}

				DEBUG(dbgs() << "Schedule Found? " << scheduleFound << "\n");

				if (scheduleFound)
				Schedule.finalizeSchedule(this);
				else
				Schedule.reset();

				return scheduleFound && Schedule.getMaxStageCount() > 0;
				}

				// Given a schedule for the loop, generate a new version of the loop,
				// and replace the old version. This function generates a prolog
				// that contains the initial iterations in the pipeline, and kernel
				// loop, and the epilogue that contains the code for the final
				// iterations.
				void SwingSchedulerDAG::generatePipelinedLoop(SMSchedule &Schedule) {
				// Create a new basic block for the kernel and add it to the CFG.
				MachineBasicBlock *KernelBB = MF.CreateMachineBasicBlock(BB->getBasicBlock());

				unsigned MaxStageCount = Schedule.getMaxStageCount();

				// Remember the registers that are used in different stages. The index is
				// the iteration, or stage, that the instruction is scheduled in. This is
				// a map between register names in the orignal block and the names created
				// in each stage of the pipelined loop.
				ValueMapTy VRMap = new ValueMapTy[(MaxStageCount + 1) 2];
				InstrMapTy InstrMap;

				SmallVector<MachineBasicBlock *, 4> PrologBBs;
				// Generate the prolog instructions that set up the pipeline.
				generateProlog(Schedule, MaxStageCount, KernelBB, VRMap, PrologBBs);
				MF.insert(BB->getIterator(), KernelBB);

				// Rearrange the instructions to generate the new, pipelined loop,
				// and update register names as needed.
				for (int Cycle = Schedule.getFirstCycle(),
				LastCycle = Schedule.getFinalCycle();
				Cycle <= LastCycle; ++Cycle) {
				std::deque<SUnit *> &CycleInstrs = Schedule.getInstructions(Cycle);
				// This inner loop schedules each instruction in the cycle.
				for (std::deque<SUnit *>::iterator CI = CycleInstrs.begin(),
				ECI = CycleInstrs.end();
				CI != ECI; ++CI) {
				if ((*CI)->getInstr()->isPHI())
				continue;
				unsigned StageNum = Schedule.stageScheduled(getSUnit((*CI)->getInstr()));
				MachineInstr *NewMI =
				cloneInstr((*CI)->getInstr(), MaxStageCount, StageNum);
				updateInstruction(NewMI, false, MaxStageCount, StageNum, Schedule, VRMap);
				KernelBB->push_back(NewMI);
				InstrMap[NewMI] = (*CI)->getInstr();
				}
				}

				// Copy any terminator instructions to the new kernel, and update
				// names as needed.
				for (MachineBasicBlock::iterator I = BB->getFirstTerminator(),
				E = BB->instr_end();
				I != E; ++I) {
				MachineInstr *NewMI = MF.CloneMachineInstr(I);
				updateInstruction(NewMI, false, MaxStageCount, 0, Schedule, VRMap);
				KernelBB->push_back(NewMI);
				InstrMap[NewMI] = I;
				}

				KernelBB->transferSuccessors(BB);
				KernelBB->replaceSuccessor(BB, KernelBB);

				generateExistingPhis(KernelBB, PrologBBs.back(), KernelBB, KernelBB, Schedule,
				VRMap, InstrMap, MaxStageCount, MaxStageCount, false);
				generatePhis(KernelBB, PrologBBs.back(), KernelBB, KernelBB, Schedule, VRMap,
				InstrMap, MaxStageCount, MaxStageCount, false);

				DEBUG(dbgs() << "New block\n"; KernelBB->dump(););

				SmallVector<MachineBasicBlock *, 4> EpilogBBs;
				// Generate the epilog instructions to complete the pipeline.
				generateEpilog(Schedule, MaxStageCount, KernelBB, VRMap, EpilogBBs,
				PrologBBs);

				// We need this step because the register allocation doesn't handle some
				// situations well, so we insert copies to help out.
				splitLifetimes(KernelBB, EpilogBBs, Schedule);

				// Remove dead instructions due to loop induction variables.
				removeDeadInstructions(KernelBB, EpilogBBs);

				// Add branches between prolog and epilog blocks.
				addBranches(PrologBBs, KernelBB, EpilogBBs, Schedule, VRMap);

				// Remove the original loop since it's no longer referenced.
				BB->clear();
				BB->eraseFromParent();

				delete[] VRMap;
				}

				// Generate the pipeline prolog code.
				void SwingSchedulerDAG::generateProlog(SMSchedule &Schedule, unsigned LastStage,
				MachineBasicBlock *KernelBB,
				ValueMapTy *VRMap,
				MBBVectorTy &PrologBBs) {
				MachineBasicBlock *PreheaderBB = MLI->getLoopFor(BB)->getLoopPreheader();
				assert(PreheaderBB != NULL &&
				"Need to add code to handle loops w/o preheader");
				MachineBasicBlock *PredBB = PreheaderBB;
				InstrMapTy InstrMap;

				// Generate a basic block for each stage, not including the last stage,
				// which will be generated in the kernel. Each basic block may contain
				// instructions from multiple stages/iterations.
				for (unsigned i = 0; i < LastStage; ++i) {
				// Create and insert the prolog basic block prior to the original loop
				// basic block. The original loop is removed later.
				MachineBasicBlock *NewBB = MF.CreateMachineBasicBlock(BB->getBasicBlock());
				PrologBBs.push_back(NewBB);
				MF.insert(BB->getIterator(), NewBB);
				NewBB->transferSuccessors(PredBB);
				PredBB->addSuccessor(NewBB);
				PredBB = NewBB;

				// Generate instructions for each appropriate stage. Process instructions
				// in original program order.
				for (int StageNum = i; StageNum >= 0; --StageNum) {
				for (MachineBasicBlock::iterator BBI = BB->instr_begin(),
				BBE = BB->getFirstTerminator();
				BBI != BBE; ++BBI) {
				if (Schedule.isScheduledAtStage(getSUnit(BBI), (unsigned)StageNum)) {
				if (BBI->isPHI())
				continue;
				MachineInstr *NewMI =
				cloneAndChangeInstr(BBI, i, (unsigned)StageNum, Schedule);
				updateInstruction(NewMI, false, i, (unsigned)StageNum, Schedule,
				VRMap);
				NewBB->push_back(NewMI);
				InstrMap[NewMI] = BBI;
				}
				}
				}
				rewritePhiValues(NewBB, i, Schedule, VRMap, InstrMap);
				DEBUG({
				dbgs() << "prolog:\n";
				NewBB->dump();
				});
				}

				PredBB->replaceSuccessor(BB, KernelBB);

				// Check if we need to remove the branch from the preheader to the original
				// loop, and replace it with a branch to the new loop.
				unsigned numBranches = TII->RemoveBranch(*PreheaderBB);
				if (numBranches) {
				SmallVector<MachineOperand, 0> Cond;
				TII->InsertBranch(*PreheaderBB, PrologBBs[0], 0, Cond, DebugLoc());
				}
				}

				// Generate the pipeline epilog code. The epilog code finishes the iterations
				// that were started in either the prolog or the kernel. We create a basic
				// block for each stage that needs to complete.
				void SwingSchedulerDAG::generateEpilog(SMSchedule &Schedule, unsigned LastStage,
				MachineBasicBlock *KernelBB,
				ValueMapTy *VRMap,
				MBBVectorTy &EpilogBBs,
				MBBVectorTy &PrologBBs) {
				// We need to change the branch from the kernel to the first epilog block, so
				// this call to analyze branch uses the kernel rather than the original BB.
				MachineBasicBlock TBB = nullptr, FBB = nullptr;
				SmallVector<MachineOperand, 4> Cond;
				bool checkBranch = TII->AnalyzeBranch(*KernelBB, TBB, FBB, Cond);
				assert(!checkBranch && "generateEpilog must be able to analyze the branch");
				if (checkBranch)
				return;

				MachineBasicBlock::succ_iterator LoopExitI = KernelBB->succ_begin();
				if (*LoopExitI == KernelBB)
				++LoopExitI;
				assert(LoopExitI != KernelBB->succ_end() && "Expecting a successor");
				MachineBasicBlock LoopExitBB = LoopExitI;

				MachineBasicBlock *PredBB = KernelBB;
				MachineBasicBlock *EpilogStart = LoopExitBB;
				InstrMapTy InstrMap;

				// Generate a basic block for each stage, not including the last stage,
				// which was generated for the kernel. Each basic block may contain
				// instructions from multiple stages/iterations.
				int EpilogStage = LastStage + 1;
				for (unsigned i = LastStage; i >= 1; --i, ++EpilogStage) {
				MachineBasicBlock *NewBB = MF.CreateMachineBasicBlock();
				EpilogBBs.push_back(NewBB);
				MF.insert(BB->getIterator(), NewBB);

				PredBB->replaceSuccessor(LoopExitBB, NewBB);
				NewBB->addSuccessor(LoopExitBB);

				if (EpilogStart == LoopExitBB)
				EpilogStart = NewBB;

				// Add instructions to the epilog depending on the current block.
				// Process instructions in original program order.
				for (unsigned StageNum = i; StageNum <= LastStage; ++StageNum) {
				for (auto &BBI : *BB) {
				if (BBI.isPHI())
				continue;
				MachineInstr *In = &BBI;
				if (Schedule.isScheduledAtStage(getSUnit(In), StageNum)) {
				sebpopUnsubmitted Not Done Reply Inline Actions Could we have the code of this loop split up into smaller functions? sebpop: Could we have the code of this loop split up into smaller functions?
				MachineInstr *NewMI = cloneInstr(In, EpilogStage - LastStage, 0);
				updateInstruction(NewMI, i == 1, EpilogStage, 0, Schedule, VRMap);
				NewBB->push_back(NewMI);
				InstrMap[NewMI] = In;
				}
				sebpopUnsubmitted Not Done Reply Inline Actions I find the indexing in VRMap to be difficult to follow: could we have the arithmetic hidden behind some set/get interface for the prolog/epilog? sebpop: I find the indexing in VRMap to be difficult to follow: could we have the arithmetic hidden…
				}
				}
				generateExistingPhis(NewBB, PrologBBs[i - 1], PredBB, KernelBB, Schedule,
				VRMap, InstrMap, LastStage, EpilogStage, i == 1);
				generatePhis(NewBB, PrologBBs[i - 1], PredBB, KernelBB, Schedule, VRMap,
				InstrMap, LastStage, EpilogStage, i == 1);
				PredBB = NewBB;

				DEBUG({
				dbgs() << "epilog:\n";
				NewBB->dump();
				});
				}

				// Fix any Phi nodes in the loop exit block.
				for (MachineBasicBlock::instr_iterator MI = LoopExitBB->instr_begin(),
				ME = LoopExitBB->instr_end();
				MI != ME && MI->isPHI(); ++MI)
				for (unsigned i = 2, e = MI->getNumOperands() + 1; i != e; i += 2) {
				MachineOperand &MO = MI->getOperand(i);
				if (MO.getMBB() == BB)
				MO.setMBB(PredBB);
				}

				// Create a branch to the new epilog from the kernel.
				// Remove the original branch and add a new branch to the epilog.
				TII->RemoveBranch(*KernelBB);
				TII->InsertBranch(*KernelBB, KernelBB, EpilogStart, Cond, DebugLoc());
				// Add a branch to the loop exit.
				if (EpilogBBs.size() > 0) {
				MachineBasicBlock *LastEpilogBB = EpilogBBs.back();
				SmallVector<MachineOperand, 4> Cond1;
				TII->InsertBranch(*LastEpilogBB, LoopExitBB, 0, Cond1, DebugLoc());
				}
				}

				/// Replace all uses of FromReg that appear outside the specified
				/// basic block with ToReg.
				static void replaceRegUsesAfterLoop(unsigned FromReg, unsigned ToReg,
				MachineBasicBlock *MBB,
				MachineRegisterInfo &MRI,
				LiveIntervals *LIS) {
				for (MachineRegisterInfo::use_iterator I = MRI.use_begin(FromReg),
				E = MRI.use_end();
				I != E;) {
				MachineOperand &O = *I;
				++I;
				if (O.getParent()->getParent() != MBB)
				O.setReg(ToReg);
				}
				if (!LIS->hasInterval(ToReg))
				LIS->createEmptyInterval(ToReg);
				}

				/// Return true if the register has a use that occurs outside the
				/// specified loop.
				static bool hasUseAfterLoop(unsigned Reg, MachineBasicBlock *BB,
				MachineRegisterInfo &MRI) {
				for (MachineRegisterInfo::use_iterator I = MRI.use_begin(Reg),
				E = MRI.use_end();
				I != E; ++I)
				if (I->getParent()->getParent() != BB)
				return true;
				return false;
				}

				/// Generate Phis for the specific block in the generated pipelined code.
				/// This function looks at the Phis from the original code to guide the
				/// creation of new Phis.
				void SwingSchedulerDAG::generateExistingPhis(
				MachineBasicBlock NewBB, MachineBasicBlock BB1, MachineBasicBlock *BB2,
				MachineBasicBlock KernelBB, SMSchedule &Schedule, ValueMapTy VRMap,
				InstrMapTy &InstrMap, unsigned LastStageNum, unsigned CurStageNum,
				bool IsLast) {
				// Compute the stage number for the inital value of the Phi, which
				// comes from the prolog. The prolog to use depends on to which kernel/
				// epilog that we're adding the Phi.
				unsigned PrologStage = 0;
				unsigned PrevStage = 0;
				bool InKernel = (LastStageNum == CurStageNum);
				if (InKernel) {
				PrologStage = LastStageNum - 1;
				PrevStage = CurStageNum;
				} else {
				PrologStage = LastStageNum - (CurStageNum - LastStageNum);
				PrevStage = LastStageNum + (CurStageNum - LastStageNum) - 1;
				}

				for (MachineBasicBlock::iterator BBI = BB->instr_begin(),
				BBE = BB->getFirstNonPHI();
				BBI != BBE; ++BBI) {
				unsigned Def = BBI->getOperand(0).getReg();

				unsigned InitVal = 0;
				unsigned LoopVal = 0;
				getPhiRegs(BBI, BB, InitVal, LoopVal);

				unsigned PhiOp1 = 0;
				// The Phi value from the loop body typically is defined in the loop, but
				// not always. So, we need to check if the value is defined in the loop.
				unsigned PhiOp2 = LoopVal;
				if (VRMap[LastStageNum].count(LoopVal))
				PhiOp2 = VRMap[LastStageNum][LoopVal];

				int StageScheduled = Schedule.stageScheduled(getSUnit(BBI));
				int LoopValStage =
				Schedule.stageScheduled(getSUnit(MRI.getVRegDef(LoopVal)));
				unsigned NumStages = Schedule.getStagesForReg(Def, CurStageNum);
				if (NumStages == 0) {
				// We don't need to generate a Phi anymore, but we need to rename any uses
				// of the Phi value.
				unsigned NewReg = VRMap[PrevStage][LoopVal];
				rewriteScheduledInstr(NewBB, Schedule, InstrMap, CurStageNum, 0, BBI,
				Def, NewReg);
				if (VRMap[CurStageNum].count(LoopVal))
				VRMap[CurStageNum][Def] = VRMap[CurStageNum][LoopVal];
				}
				// The number of Phis can't exceed the number of prolog stages. We add 2
				// beacause prolog stage number is zero based, and there is a "stage"/Phi
				// for the original Phi instruction.
				unsigned NumPhis = NumStages;
				if (StageScheduled <= (int)PrologStage && NumStages > PrologStage + 2)
				NumPhis = PrologStage + 2;
				else if (LoopValStage > (int)PrologStage && NumStages > PrologStage + 1)
				NumPhis = PrologStage + 1;
				unsigned NewReg = 0;

				unsigned AccessStage = (LoopValStage != -1) ? LoopValStage : StageScheduled;
				// In the epilog, we may need to look back one stage to get the correct
				// Phi name because the epilog and prolog blocks execute the same stage.
				// The correct name is from the previous block only when the Phi has
				// been completely scheduled prior to the epilog, and Phi value is not
				// needed in multiple stages.
				int StageDiff = 0;
				if (!InKernel && StageScheduled >= LoopValStage && AccessStage == 0 &&
				NumPhis == 1)
				StageDiff = 1;
				// Adjust the computations below when the phi and the loop definition
				// are scheduled in different stages.
				if (InKernel && LoopValStage != -1 && StageScheduled > LoopValStage)
				StageDiff = StageScheduled - LoopValStage;
				for (unsigned np = 0; np < NumPhis; ++np) {
				// If the Phi hasn't been scheduled, then use the initial Phi operand
				// value. Otherwise, use the scheduled version of the instruction. This
				// is a little complicated when a Phi references another Phi.
				if (np > PrologStage \|\| StageScheduled >= (int)LastStageNum)
				PhiOp1 = InitVal;
				// Check if the Phi has already been scheduled in a prolog stage.
				else if (PrologStage >= AccessStage + StageDiff + np &&
				VRMap[PrologStage - StageDiff - np].count(LoopVal) != 0)
				PhiOp1 = VRMap[PrologStage - StageDiff - np][LoopVal];
				// Check if the Phi has already been scheduled, but the loop intruction
				// is either another Phi, or doesn't occur in the loop.
				else if (PrologStage >= AccessStage + StageDiff + np) {
				// If the Phi references another Phi, we need to examine the other
				// Phi to get the correct value.
				PhiOp1 = LoopVal;
				MachineInstr *InstOp1 = MRI.getVRegDef(PhiOp1);
				int Indirects = 1;
				while (InstOp1 && InstOp1->isPHI() && InstOp1->getParent() == BB) {
				int PhiStage = Schedule.stageScheduled(getSUnit(InstOp1));
				if ((int)(PrologStage - StageDiff - np) < PhiStage + Indirects)
				PhiOp1 = getInitPhiReg(InstOp1, BB);
				else
				PhiOp1 = getLoopPhiReg(InstOp1, BB);
				InstOp1 = MRI.getVRegDef(PhiOp1);
				int PhiOpStage = Schedule.stageScheduled(getSUnit(InstOp1));
				if (PhiOpStage != -1 && PrologStage + PhiOpStage >= Indirects + np &&
				VRMap[PrologStage + PhiOpStage - Indirects - np].count(PhiOp1)) {
				PhiOp1 = VRMap[PrologStage + PhiOpStage - Indirects - np][PhiOp1];
				break;
				}
				++Indirects;
				}
				} else
				PhiOp1 = InitVal;
				// If this references a generated Phi in the kernel, get the Phi operand
				// from the incoming block.
				if (MachineInstr *InstOp1 = MRI.getVRegDef(PhiOp1))
				if (InstOp1->isPHI() && InstOp1->getParent() == KernelBB)
				PhiOp1 = getInitPhiReg(InstOp1, KernelBB);

				MachineInstr *PhiInst = MRI.getVRegDef(LoopVal);
				bool LoopDefIsPhi = PhiInst && PhiInst->isPHI();
				// In the epilog, a map lookup is needed to get the value from the kernel,
				// or previous epilog block. How is does this depends on if the
				// instruction is scheduled in the previous block.
				if (!InKernel) {
				int StageDiffAdj = 0;
				if (LoopValStage != -1 && StageScheduled > LoopValStage)
				StageDiffAdj = StageScheduled - LoopValStage;
				// Use the loop value defined in the kernel, unless the kernel
				// contains the last definition of the Phi.
				if (np == 0 && PrevStage == LastStageNum &&
				(StageScheduled != 0 \|\| LoopValStage != 0) &&
				VRMap[PrevStage - StageDiffAdj].count(LoopVal))
				PhiOp2 = VRMap[PrevStage - StageDiffAdj][LoopVal];
				// Use the value defined by the Phi. We add one because we switch
				// from looking at the loop value to the Phi definition.
				else if (np > 0 && PrevStage == LastStageNum &&
				VRMap[PrevStage - np + 1].count(Def))
				PhiOp2 = VRMap[PrevStage - np + 1][Def];
				// Use the loop value defined in the kernel.
				else if ((unsigned)LoopValStage + StageDiffAdj > PrologStage + 1 &&
				VRMap[PrevStage - StageDiffAdj - np].count(LoopVal))
				PhiOp2 = VRMap[PrevStage - StageDiffAdj - np][LoopVal];
				// Use the value defined by the Phi, unless we're generating the first
				// epilog and the Phi refers to a Phi in a different stage.
				else if (VRMap[PrevStage - np].count(Def) &&
				(!LoopDefIsPhi \|\| PrevStage != LastStageNum))
				PhiOp2 = VRMap[PrevStage - np][Def];
				}

				// Check if we can reuse an existing Phi. This occurs when a Phi
				// references another Phi, and the other Phi is scheduled in an
				// earlier stage. We can try to reuse an existing Phi up until the last
				// stage of the current Phi.
				if (LoopDefIsPhi && VRMap[CurStageNum].count(LoopVal) &&
				LoopValStage >= (int)(CurStageNum - LastStageNum)) {
				int LVNumStages = Schedule.getStagesForPhi(LoopVal);
				int StageDiff = (StageScheduled - LoopValStage);
				LVNumStages -= StageDiff;
				if (LVNumStages > (int)np) {
				NewReg = PhiOp2;
				unsigned ReuseStage = CurStageNum;
				if (Schedule.isLoopCarried(this, PhiInst))
				ReuseStage -= LVNumStages;
				// Check if the Phi to reuse has been generated yet. If not, then
				// there is nothing to reuse.
				if (VRMap[ReuseStage].count(LoopVal)) {
				NewReg = VRMap[ReuseStage][LoopVal];

				rewriteScheduledInstr(NewBB, Schedule, InstrMap, CurStageNum, np,
				BBI, Def, NewReg);
				// Update the map with the new Phi name.
				VRMap[CurStageNum - np][Def] = NewReg;
				PhiOp2 = NewReg;
				if (VRMap[LastStageNum - np - 1].count(LoopVal))
				PhiOp2 = VRMap[LastStageNum - np - 1][LoopVal];

				if (IsLast && np == NumPhis - 1)
				replaceRegUsesAfterLoop(Def, NewReg, BB, MRI, LIS);
				continue;
				}
				} else if (StageDiff > 0 &&
				VRMap[CurStageNum - StageDiff - np].count(LoopVal))
				PhiOp2 = VRMap[CurStageNum - StageDiff - np][LoopVal];
				}

				const TargetRegisterClass *RC = MRI.getRegClass(Def);
				NewReg = MRI.createVirtualRegister(RC);

				MachineInstrBuilder NewPhi =
				BuildMI(*NewBB, NewBB->getFirstNonPHI(), DebugLoc(),
				TII->get(TargetOpcode::PHI), NewReg);
				NewPhi.addReg(PhiOp1).addMBB(BB1);
				NewPhi.addReg(PhiOp2).addMBB(BB2);
				if (np == 0)
				InstrMap[NewPhi] = BBI;

				// We define the Phis after creating the new pipelined code, so
				// we need to rename the Phi values in scheduled instructions.

				unsigned PrevReg = 0;
				if (InKernel && VRMap[PrevStage - np].count(LoopVal))
				PrevReg = VRMap[PrevStage - np][LoopVal];
				rewriteScheduledInstr(NewBB, Schedule, InstrMap, CurStageNum, np, BBI,
				Def, NewReg, PrevReg);
				// If the Phi has been scheduled, use the new name for rewriting.
				if (VRMap[CurStageNum - np].count(Def)) {
				unsigned R = VRMap[CurStageNum - np][Def];
				rewriteScheduledInstr(NewBB, Schedule, InstrMap, CurStageNum, np, BBI,
				R, NewReg);
				}

				// Check if we need to rename any uses that occurs after the loop. The
				// register to replace depends on whether the Phi is scheduled in the
				// epilog.
				if (IsLast && np == NumPhis - 1)
				replaceRegUsesAfterLoop(Def, NewReg, BB, MRI, LIS);

				// In the kernel, a dependent Phi uses the value from this Phi.
				if (InKernel)
				PhiOp2 = NewReg;

				// Update the map with the new Phi name.
				VRMap[CurStageNum - np][Def] = NewReg;
				}

				while (NumPhis++ < NumStages) {
				rewriteScheduledInstr(NewBB, Schedule, InstrMap, CurStageNum, NumPhis,
				BBI, Def, NewReg, 0);
				}

				// Check if we need to rename a Phi that has been eliminated due to
				// scheduling.
				if (NumStages == 0 && IsLast && VRMap[CurStageNum].count(LoopVal))
				replaceRegUsesAfterLoop(Def, VRMap[CurStageNum][LoopVal], BB, MRI, LIS);
				}
				}

				/// Generate Phis for the specified block in the generated pipelined code.
				/// These are new Phis needed because the definition is scheduled after the
				/// use in the pipelened sequence.
				void SwingSchedulerDAG::generatePhis(
				MachineBasicBlock NewBB, MachineBasicBlock BB1, MachineBasicBlock *BB2,
				MachineBasicBlock KernelBB, SMSchedule &Schedule, ValueMapTy VRMap,
				InstrMapTy &InstrMap, unsigned LastStageNum, unsigned CurStageNum,
				bool IsLast) {
				// Compute the stage number that contains the initial Phi value, and
				// the Phi from the previous stage.
				unsigned PrologStage = 0;
				unsigned PrevStage = 0;
				unsigned StageDiff = CurStageNum - LastStageNum;
				bool InKernel = (StageDiff == 0);
				if (InKernel) {
				PrologStage = LastStageNum - 1;
				PrevStage = CurStageNum;
				} else {
				PrologStage = LastStageNum - StageDiff;
				PrevStage = LastStageNum + StageDiff - 1;
				}

				for (MachineBasicBlock::iterator BBI = BB->getFirstNonPHI(),
				BBE = BB->instr_end();
				BBI != BBE; ++BBI) {
				for (unsigned i = 0, e = BBI->getNumOperands(); i != e; ++i) {
				MachineOperand &MO = BBI->getOperand(i);
				if (!MO.isReg() \|\| !MO.isDef() \|\|
				!TargetRegisterInfo::isVirtualRegister(MO.getReg()))
				continue;

				int StageScheduled = Schedule.stageScheduled(getSUnit(BBI));
				assert(StageScheduled != -1 && "Expecting scheduled instruction.");
				unsigned Def = MO.getReg();
				unsigned NumPhis = Schedule.getStagesForReg(Def, CurStageNum);
				// An instruction scheduled in stage 0 and is used after the loop
				// requires a phi in the epilog for the last definition from either
				// the kernel or prolog.
				if (!InKernel && NumPhis == 0 && StageScheduled == 0 &&
				hasUseAfterLoop(Def, BB, MRI))
				NumPhis = 1;
				if (!InKernel && (unsigned)StageScheduled > PrologStage)
				continue;

				unsigned PhiOp2 = VRMap[PrevStage][Def];
				if (MachineInstr *InstOp2 = MRI.getVRegDef(PhiOp2))
				if (InstOp2->isPHI() && InstOp2->getParent() == NewBB)
				PhiOp2 = getLoopPhiReg(InstOp2, BB2);
				// The number of Phis can't exceed the number of prolog stages. The
				// prolog stage number is zero based.
				if (NumPhis > PrologStage + 1 - StageScheduled)
				NumPhis = PrologStage + 1 - StageScheduled;
				for (unsigned np = 0; np < NumPhis; ++np) {
				unsigned PhiOp1 = VRMap[PrologStage][Def];
				if (np <= PrologStage)
				PhiOp1 = VRMap[PrologStage - np][Def];
				if (MachineInstr *InstOp1 = MRI.getVRegDef(PhiOp1)) {
				if (InstOp1->isPHI() && InstOp1->getParent() == KernelBB)
				PhiOp1 = getInitPhiReg(InstOp1, KernelBB);
				if (InstOp1->isPHI() && InstOp1->getParent() == NewBB)
				PhiOp1 = getInitPhiReg(InstOp1, NewBB);
				}
				if (!InKernel)
				PhiOp2 = VRMap[PrevStage - np][Def];

				const TargetRegisterClass *RC = MRI.getRegClass(Def);
				unsigned NewReg = MRI.createVirtualRegister(RC);

				MachineInstrBuilder NewPhi =
				BuildMI(*NewBB, NewBB->getFirstNonPHI(), DebugLoc(),
				TII->get(TargetOpcode::PHI), NewReg);
				NewPhi.addReg(PhiOp1).addMBB(BB1);
				NewPhi.addReg(PhiOp2).addMBB(BB2);
				if (np == 0)
				InstrMap[NewPhi] = BBI;

				// Rewrite uses and update the map. The actions depend upon whether
				// we generating code for the kernel or epilog blocks.
				if (InKernel) {
				rewriteScheduledInstr(NewBB, Schedule, InstrMap, CurStageNum, np,
				BBI, PhiOp1, NewReg);
				rewriteScheduledInstr(NewBB, Schedule, InstrMap, CurStageNum, np,
				BBI, PhiOp2, NewReg);

				PhiOp2 = NewReg;
				VRMap[PrevStage - np - 1][Def] = NewReg;
				} else {
				VRMap[CurStageNum - np][Def] = NewReg;
				if (np == NumPhis - 1)
				rewriteScheduledInstr(NewBB, Schedule, InstrMap, CurStageNum, np,
				BBI, Def, NewReg);
				}
				if (IsLast && np == NumPhis - 1)
				replaceRegUsesAfterLoop(Def, NewReg, BB, MRI, LIS);
				}
				}
				}
				}

				// Remove instructions that generate values with no uses.
				// Typically, these are induction variable operations that generate values
				// used in the loop itself. A dead instruction has a definition with
				// no uses, or uses that occur in the original loop only.
				void SwingSchedulerDAG::removeDeadInstructions(MachineBasicBlock *KernelBB,
				MBBVectorTy &EpilogBBs) {
				// For each epilog block, check that the value defined by each instruction
				// is used. If not, delete it.
				for (MBBVectorTy::reverse_iterator MBB = EpilogBBs.rbegin(),
				MBE = EpilogBBs.rend();
				MBB != MBE; ++MBB)
				for (MachineBasicBlock::reverse_instr_iterator MI = (*MBB)->instr_rbegin(),
				ME = (*MBB)->instr_rend();
				MI != ME;) {
				// From DeadMachineInstructionElem. Don't delete inline assembly.
				if (MI->isInlineAsm()) {
				++MI;
				continue;
				}
				bool SawStore = false;
				// Check if it's safe to remove the instruction due to side effects.
				// We can, and want to, remove Phis here.
				if (!MI->isSafeToMove(nullptr, SawStore) && !MI->isPHI()) {
				++MI;
				continue;
				}
				bool used = true;
				for (MachineInstr::mop_iterator MOI = MI->operands_begin(),
				MOE = MI->operands_end();
				MOI != MOE; ++MOI) {
				if (!MOI->isReg() \|\| !MOI->isDef())
				continue;
				unsigned reg = MOI->getReg();
				unsigned realUses = 0;
				for (MachineRegisterInfo::use_iterator UI = MRI.use_begin(reg),
				EI = MRI.use_end();
				UI != EI; ++UI) {
				// Check if there are any uses that occur only in the original
				// loop. If so, that's not a real use.
				if (UI->getParent()->getParent() != BB) {
				realUses++;
				used = true;
				break;
				}
				}
				if (realUses > 0)
				break;
				used = false;
				}
				if (!used) {
				MI->eraseFromParent();
				ME = (*MBB)->instr_rend();
				continue;
				}
				++MI;
				}
				// In the kernel block, check if we can remove a Phi that generates a value
				// used in an instruction removed in the epilog block.
				for (MachineBasicBlock::iterator BBI = KernelBB->instr_begin(),
				BBE = KernelBB->getFirstNonPHI();
				BBI != BBE;) {
				MachineInstr MI = &BBI;
				++BBI;
				unsigned reg = MI->getOperand(0).getReg();
				if (MRI.use_begin(reg) == MRI.use_end()) {
				MI->eraseFromParent();
				}
				}
				}

				/// For loop carried definitions, we split the lifetime of a virtual register
				/// that has uses past the definition in the next iteration. A copy with a new
				/// virtual register is inserted before the definition, which helps with
				/// generating a better register assignment.
				///
				/// v1 = phi(a, v2) v1 = phi(a, v2)
				/// v2 = phi(b, v3) v2 = phi(b, v3)
				/// v3 = .. v4 = copy v1
				/// .. = V1 v3 = ..
				/// .. = v4
				void SwingSchedulerDAG::splitLifetimes(MachineBasicBlock *KernelBB,
				MBBVectorTy &EpilogBBs,
				SMSchedule &Schedule) {
				const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();
				for (MachineBasicBlock::iterator BBI = KernelBB->instr_begin(),
				BBF = KernelBB->getFirstNonPHI();
				BBI != BBF; ++BBI) {
				unsigned Def = BBI->getOperand(0).getReg();
				// Check for any Phi definition that used as an operand of another Phi
				// in the same block.
				for (MachineRegisterInfo::use_instr_iterator I = MRI.use_instr_begin(Def),
				E = MRI.use_instr_end();
				I != E; ++I) {
				if (I->isPHI() && I->getParent() == KernelBB) {
				// Get the loop carried definition.
				unsigned LCDef = getLoopPhiReg(BBI, KernelBB);
				if (!LCDef)
				continue;
				MachineInstr *MI = MRI.getVRegDef(LCDef);
				if (!MI \|\| MI->getParent() != KernelBB \|\| MI->isPHI())
				continue;
				// Search through the rest of the block looking for uses of the Phi
				// definition. If one occurs, then split the lifetime.
				unsigned SplitReg = 0;
				for (auto &BBJ : make_range(MachineBasicBlock::instr_iterator(MI),
				KernelBB->instr_end()))
				if (BBJ.readsRegister(Def)) {
				// We split the lifetime when we find the first use.
				if (SplitReg == 0) {
				SplitReg = MRI.createVirtualRegister(MRI.getRegClass(Def));
				BuildMI(*KernelBB, MI, MI->getDebugLoc(),
				TII->get(TargetOpcode::COPY), SplitReg)
				.addReg(Def);
				}
				BBJ.substituteRegister(Def, SplitReg, 0, *TRI);
				}
				if (!SplitReg)
				continue;
				// Search through each of the epilog blocks for any uses to be renamed.
				for (auto &Epilog : EpilogBBs)
				for (auto &I : *Epilog)
				if (I.readsRegister(Def))
				I.substituteRegister(Def, SplitReg, 0, *TRI);
				break;
				}
				}
				}
				}

				/// Remove the incoming block from the Phis in a basic block.
				static void removePhis(MachineBasicBlock BB, MachineBasicBlock Incoming) {
				for (MachineBasicBlock::instr_iterator MII = BB->instr_begin(),
				MIE = BB->instr_end();
				MII != MIE && MII->isPHI(); ++MII)
				for (unsigned i = 1, e = MII->getNumOperands(); i != e; i += 2)
				if (MII->getOperand(i + 1).getMBB() == Incoming) {
				MII->RemoveOperand(i + 1);
				MII->RemoveOperand(i);
				break;
				}
				}

				// Create branches from each prolog basic block to the appropriate epilog
				// block. These edges are needed if the loop ends before reaching the
				// kernel.
				materiUnsubmitted Not Done Reply Inline Actions I do not understand how this works when more than one iteration starts to execute in the prolog. For example if the runtime trip count is 1, and 2 iterations are started in the prolog. Don't you miss executing some instructions from the only loop iteration? If this is not a bug, maybe you can add a test case that shows how this works? materi: I do not understand how this works when more than one iteration starts to execute in the prolog.
				bcahoonAuthorUnsubmitted Not Done Reply Inline Actions If two iterations are started in the prolog, then we generate two prolog basic blocks, and two epilog basic blocks. At the end of each prolog basic block, we add a compare and branch to the corresponding epilog basic block (the fall through is to the next prolog block or the kernel). This means that the first prolog block contains instructions from stage 0 and the second prolog block contains instructions from stage 1 and the 2nd iteration of stage 0. In your example, with a run-time trip count of 1, the first prolog block branches to the last epilog block, and the instructions in the last epilog block are the first iteration of instructions scheduled in stage 1 and stage 2. The swp-max.ll test case shows a pipelined schedule with 2 prolog and epilog blocks. bcahoon: If two iterations are started in the prolog, then we generate two prolog basic blocks, and two…
				materiUnsubmitted Not Done Reply Inline Actions Thank you! I think I understand how it works now. The prolog and epilog blocks are not the "bundles" of the SWP prolog and epilog. The jump label for my trip count = 1 case is put in the middle of the first "epilog bundle". But what if there are loop carried 0-latency dependences in the graph? This will force a certain order within the kernel to allow correct bundling in a later step. Can this be handled? materi: Thank you! I think I understand how it works now. The prolog and epilog blocks are not the…
				bcahoonAuthorUnsubmitted Not Done Reply Inline Actions If I'm understanding your question, then yes - we do handle the case of a loop carried 0-latency instruction. The order of the instructions in the prolog and epilog blocks is different than the order in the pipelined schedule. The prolog/epilog instructions appear in the original instruction order (i.e., prior to pipelining), and they are grouped by the pipelined stage. As an example, lets say there are 3 stages, numbered 0,1,2, so there will be two prolog blocks and two epilog blocks. The first prolog contains instructions from stage 0 in the original order. The last epilog contains instructions from stages 1 and 2 in original order. If the loop contains only 1 iteration, then the stage 0 instructions in the first prolog are executed, and control jumps to the last epilog block to execute the first iteration of instructions from stages 1 and 2. In the second prolog, we first generate the instructions from stage 1 in the original order, and then stage 0 in original order. In the second to last epilog, we generate instructions for stage 2 in the original order. If the loop has 2 iterations, then the 2 prolog bocks execute instructions from stage 0 twice, and stage 1 one. The 2 epilog blocks execute instructions from stage 2 twice and stage 1 once. I hope this makes sense and answers your question correctly. Let me know. bcahoon: If I'm understanding your question, then yes - we do handle the case of a loop carried 0…
				void SwingSchedulerDAG::addBranches(MBBVectorTy &PrologBBs,
				MachineBasicBlock *KernelBB,
				MBBVectorTy &EpilogBBs,
				SMSchedule &Schedule, ValueMapTy *VRMap) {
				assert(PrologBBs.size() == EpilogBBs.size() && "Prolog/Epilog mismatch");
				MachineInstr *IndVar = Pass->LI.LoopInductionVar;
				MachineInstr *Cmp = Pass->LI.LoopCompare;
				MachineBasicBlock *LastPro = KernelBB;
				MachineBasicBlock *LastEpi = KernelBB;

				// Start from the blocks connected to the kernel and work "out"
				// to the first prolog and the last epilog blocks.
				SmallVector<MachineInstr *, 4> PrevInsts;
				unsigned MaxIter = PrologBBs.size() - 1;
				unsigned LC = UINT_MAX;
				unsigned LCMin = UINT_MAX;
				for (unsigned i = 0, j = MaxIter; i <= MaxIter; ++i, --j) {
				// Add branches to the prolog that go to the corresponding
				// epilog, and the fall-thru prolog/kernel block.
				MachineBasicBlock *Prolog = PrologBBs[j];
				MachineBasicBlock *Epilog = EpilogBBs[i];
				// We've executed one iteration, so decrement the loop count and check for
				// the loop end.
				SmallVector<MachineOperand, 4> Cond;
				// Check if the LOOP0 has already been removed. If so, then there is no need
				// to reduce the trip count.
				if (LC != 0)
				LC = TII->ReduceLoopCount(*Prolog, IndVar, Cmp, Cond, PrevInsts, j,
				MaxIter);

				// Record the value of the first trip count, which is used to determine if
				// branches and blocks can be removed for constant trip counts.
				if (LCMin == UINT_MAX)
				LCMin = LC;

				unsigned numAdded = 0;
				if (TargetRegisterInfo::isVirtualRegister(LC)) {
				Prolog->addSuccessor(Epilog);
				numAdded = TII->InsertBranch(*Prolog, Epilog, LastPro, Cond, DebugLoc());
				} else if (j >= LCMin) {
				Prolog->addSuccessor(Epilog);
				Prolog->removeSuccessor(LastPro);
				LastEpi->removeSuccessor(Epilog);
				numAdded = TII->InsertBranch(*Prolog, Epilog, 0, Cond, DebugLoc());
				removePhis(Epilog, LastEpi);
				// Remove the blocks that are no longer referenced.
				if (LastPro != LastEpi) {
				LastEpi->clear();
				LastEpi->eraseFromParent();
				}
				LastPro->clear();
				LastPro->eraseFromParent();
				} else {
				numAdded = TII->InsertBranch(*Prolog, LastPro, 0, Cond, DebugLoc());
				removePhis(Epilog, Prolog);
				}
				LastPro = Prolog;
				LastEpi = Epilog;
				for (MachineBasicBlock::reverse_instr_iterator I = Prolog->instr_rbegin(),
				E = Prolog->instr_rend();
				I != E && numAdded > 0; ++I, --numAdded)
				updateInstruction(&*I, false, j, 0, Schedule, VRMap);
				}
				}

				/// Return true if we can compute the amount the instruction changes
				/// during each iteration. Set Delta to the amount of the change.
				bool SwingSchedulerDAG::computeDelta(MachineInstr *MI, unsigned &Delta) {
				const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();
				unsigned BaseReg, Offset;
				if (!TII->getMemOpBaseRegImmOfs(MI, BaseReg, Offset, TRI))
				return false;

				MachineRegisterInfo &MRI = MF.getRegInfo();
				// Check if there is a Phi. If so, get the definition in the loop.
				MachineInstr *BaseDef = MRI.getVRegDef(BaseReg);
				if (BaseDef && BaseDef->isPHI()) {
				BaseReg = getLoopPhiReg(BaseDef, MI->getParent());
				BaseDef = MRI.getVRegDef(BaseReg);
				}
				if (!BaseDef)
				return false;

				int D;
				if (!TII->getIncrementValue(BaseDef, D) \|\| D < 0)
				return false;

				Delta = D;
				return true;
				}

				/// Update the memory operand with a new offset when the pipeliner
				/// generate a new copy of the instruction that refers to a
				/// different memory location.
				void SwingSchedulerDAG::updateMemOperands(MachineInstr *NewMI,
				MachineInstr *OldMI, unsigned Num) {
				if (Num == 0)
				return;
				// If the instruction has memory operands, then adjust the offset
				// when the instruction appears in different stages.
				unsigned NumRefs = NewMI->memoperands_end() - NewMI->memoperands_begin();
				if (NumRefs == 0)
				return;
				MachineInstr::mmo_iterator NewMemRefs = MF.allocateMemRefsArray(NumRefs);
				unsigned Refs = 0;
				for (MachineInstr::mmo_iterator I = NewMI->memoperands_begin(),
				E = NewMI->memoperands_end();
				I != E; ++I) {
				if ((I)->isVolatile() \|\| (I)->isInvariant() \|\| (!(*I)->getValue())) {
				NewMemRefs[Refs++] = *I;
				continue;
				}
				unsigned Delta;
				if (computeDelta(OldMI, Delta)) {
				int64_t AdjOffset = Delta * Num;
				NewMemRefs[Refs++] =
				MF.getMachineMemOperand(I, AdjOffset, (I)->getSize());
				} else
				NewMemRefs[Refs++] = MF.getMachineMemOperand(*I, 0, UINT64_MAX);
				}
				NewMI->setMemRefs(NewMemRefs, NewMemRefs + NumRefs);
				}

				/// Clone the instruction for the new pipelined loop and update the
				/// memory operands, if needed.
				MachineInstr SwingSchedulerDAG::cloneInstr(MachineInstr OldMI,
				unsigned CurStageNum,
				unsigned InstStageNum) {
				MachineInstr *NewMI = MF.CloneMachineInstr(OldMI);
				updateMemOperands(NewMI, OldMI, CurStageNum - InstStageNum);
				return NewMI;
				}

				/// Clone the instruction for the new pipelined loop. If needed, this
				/// function updates the instruction using the values saved in the
				/// InstrChanges structure.
				MachineInstr SwingSchedulerDAG::cloneAndChangeInstr(MachineInstr OldMI,
				unsigned CurStageNum,
				unsigned InstStageNum,
				SMSchedule &Schedule) {
				MachineInstr *NewMI = MF.CloneMachineInstr(OldMI);
				DenseMap<SUnit *, std::pair<unsigned, int64_t>>::iterator It =
				InstrChanges.find(getSUnit(OldMI));
				if (It != InstrChanges.end()) {
				std::pair<unsigned, int64_t> RegAndOffset = It->second;
				unsigned BasePos, OffsetPos;
				if (!TII->getBaseAndOffsetPosition(OldMI, BasePos, OffsetPos))
				return 0;
				int64_t NewOffset = OldMI->getOperand(OffsetPos).getImm();
				MachineInstr *LoopDef = findDefInLoop(RegAndOffset.first);
				if (Schedule.stageScheduled(getSUnit(LoopDef)) >= (signed)CurStageNum)
				NewOffset += RegAndOffset.second * (CurStageNum - InstStageNum);
				NewMI->getOperand(OffsetPos).setImm(NewOffset);
				}
				updateMemOperands(NewMI, OldMI, CurStageNum - InstStageNum);
				return NewMI;
				}

				// Update the machine instruction with new virtual registers. This
				// function may change the defintions and/or uses.
				void SwingSchedulerDAG::updateInstruction(MachineInstr *NewMI, bool LastDef,
				unsigned CurStageNum,
				unsigned InstrStageNum,
				SMSchedule &Schedule,
				ValueMapTy *VRMap) {
				for (unsigned i = 0, e = NewMI->getNumOperands(); i != e; ++i) {
				MachineOperand &MO = NewMI->getOperand(i);
				if (!MO.isReg() \|\| !TargetRegisterInfo::isVirtualRegister(MO.getReg()))
				continue;
				unsigned reg = MO.getReg();
				if (MO.isDef()) {
				// Create a new virtual register for the definition.
				const TargetRegisterClass *RC = MRI.getRegClass(reg);
				unsigned NewReg = MRI.createVirtualRegister(RC);
				MO.setReg(NewReg);
				VRMap[CurStageNum][reg] = NewReg;
				if (LastDef)
				replaceRegUsesAfterLoop(reg, NewReg, BB, MRI, LIS);
				} else if (MO.isUse()) {
				MachineInstr *Def = MRI.getVRegDef(reg);
				// Compute the stage that contains the last definition for instruction.
				int DefStageNum = Schedule.stageScheduled(getSUnit(Def));
				unsigned StageNum = CurStageNum;
				if (DefStageNum != -1 && (int)InstrStageNum > DefStageNum) {
				// Compute the difference in stages between the defintion and the use.
				unsigned StageDiff = (InstrStageNum - DefStageNum);
				// Make an adjustment to get the last definition.
				StageNum -= StageDiff;
				}
				if (VRMap[StageNum].count(reg))
				MO.setReg(VRMap[StageNum][reg]);
				}
				}
				}

				// Return the instruction in the loop that defines the register.
				// If the definition is a Phi, then follow the Phi operand to
				// the instruction in the loop.
				MachineInstr *SwingSchedulerDAG::findDefInLoop(unsigned Reg) {
				SmallPtrSet<MachineInstr *, 8> Visited;
				MachineInstr *Def = MRI.getVRegDef(Reg);
				while (Def->isPHI()) {
				if (!Visited.insert(Def).second)
				break;
				for (unsigned i = 1, e = Def->getNumOperands(); i < e; i += 2)
				if (Def->getOperand(i + 1).getMBB() == BB) {
				Def = MRI.getVRegDef(Def->getOperand(i).getReg());
				break;
				}
				}
				return Def;
				}

				/// Return the new name for the value from the previous stage.
				unsigned SwingSchedulerDAG::getPrevMapVal(unsigned StageNum, unsigned PhiStage,
				unsigned LoopVal, ValueMapTy *VRMap,
				MachineBasicBlock *BB) {
				unsigned PrevVal = 0;
				if (StageNum > PhiStage) {
				MachineInstr *LoopInst = MRI.getVRegDef(LoopVal);
				if (VRMap[StageNum - 1].count(LoopVal))
				// The name is defined in the previous stage.
				PrevVal = VRMap[StageNum - 1][LoopVal];
				else if (VRMap[StageNum].count(LoopVal))
				// The previous name is defined in the current stage when the instruction
				// order is swapped.
				PrevVal = VRMap[StageNum][LoopVal];
				else if (!LoopInst->isPHI())
				// The loop value hasn't yet been scheduled.
				PrevVal = LoopVal;
				else if (StageNum == PhiStage + 1)
				// The loop value is another phi, which has not been scheduled.
				PrevVal = getInitPhiReg(LoopInst, BB);
				else if (StageNum > PhiStage + 1 && LoopInst->getParent() == BB)
				// The loop value is another phi, which has been scheduled.
				PrevVal = getPrevMapVal(StageNum - 1, PhiStage,
				getLoopPhiReg(LoopInst, BB), VRMap, BB);
				}
				return PrevVal;
				}

				/// Rewrite the Phi values in the specified block to use the mappings
				/// from the initial operand. Once the Phi is scheduled, we switch
				/// to using the loop value instead of the Phi value, so those names
				/// do not need to be rewritten.
				void SwingSchedulerDAG::rewritePhiValues(MachineBasicBlock *NewBB,
				unsigned StageNum,
				SMSchedule &Schedule,
				ValueMapTy *VRMap,
				InstrMapTy &InstrMap) {
				for (MachineBasicBlock::iterator BBI = BB->instr_begin(),
				BBE = BB->getFirstNonPHI();
				BBI != BBE; ++BBI) {
				unsigned InitVal = 0;
				unsigned LoopVal = 0;
				getPhiRegs(BBI, BB, InitVal, LoopVal);
				unsigned PhiDef = BBI->getOperand(0).getReg();

				unsigned PhiStage =
				(unsigned)Schedule.stageScheduled(getSUnit(MRI.getVRegDef(PhiDef)));
				unsigned NumPhis = Schedule.getStagesForReg(PhiDef, StageNum);
				if (NumPhis > StageNum + 1)
				NumPhis = StageNum + 1;
				// Always do at least one iteration. In this case, the loop value is
				// scheduled prior to the Phi in the next iteration.
				if (NumPhis == 0)
				NumPhis = 1;
				for (unsigned np = 0; np < NumPhis; ++np) {
				unsigned PrevVal =
				getPrevMapVal(StageNum - np, PhiStage, LoopVal, VRMap, BB);
				rewriteScheduledInstr(NewBB, Schedule, InstrMap, StageNum, np, BBI,
				PhiDef, InitVal, PrevVal);
				}
				}
				}

				/// Rewrite a previously scheduled instruction to use the register value
				/// from the new instruction. Make sure the instruction occurs in the
				/// basic block, and we don't change the uses in the new instruction.
				void SwingSchedulerDAG::rewriteScheduledInstr(
				MachineBasicBlock *BB, SMSchedule &Schedule, InstrMapTy &InstrMap,
				unsigned CurStageNum, unsigned PhiNum, MachineInstr *Phi, unsigned OldReg,
				unsigned NewReg, unsigned PrevReg) {
				bool InProlog = (CurStageNum < Schedule.getMaxStageCount());
				int StagePhi = Schedule.stageScheduled(getSUnit(Phi)) + PhiNum;
				// Rewrite uses that have been scheduled already to use the new
				// Phi register.
				for (MachineRegisterInfo::use_iterator UI = MRI.use_begin(OldReg),
				EI = MRI.use_end();
				UI != EI;) {
				MachineOperand &UseOp = *UI;
				MachineInstr *UseMI = UseOp.getParent();
				++UI;
				if (UseMI->getParent() != BB)
				continue;
				if (UseMI->isPHI()) {
				if (!Phi->isPHI() && UseMI->getOperand(0).getReg() == NewReg)
				continue;
				if (getLoopPhiReg(UseMI, BB) != OldReg)
				continue;
				}
				InstrMapTy::iterator OrigInstr = InstrMap.find(UseMI);
				assert(OrigInstr != InstrMap.end() && "Instruction not scheduled.");
				SUnit *OrigMISU = getSUnit(OrigInstr->second);
				int StageSched = Schedule.stageScheduled(OrigMISU);
				int CycleSched = Schedule.cycleScheduled(OrigMISU);
				unsigned ReplaceReg = 0;
				// This is the stage for the scheduled instruction.
				if (StagePhi == StageSched && Phi->isPHI()) {
				int CyclePhi = Schedule.cycleScheduled(getSUnit(Phi));
				if (PrevReg && InProlog)
				ReplaceReg = PrevReg;
				else if (PrevReg && !Schedule.isLoopCarried(this, Phi) &&
				(CyclePhi <= CycleSched \|\| OrigMISU->getInstr()->isPHI()))
				ReplaceReg = PrevReg;
				else
				ReplaceReg = NewReg;
				}
				// The scheduled instruction occurs before the scheduled Phi, and the
				// Phi is not loop carried.
				if (StagePhi + 1 == StageSched && !Schedule.isLoopCarried(this, Phi))
				ReplaceReg = NewReg;
				if (StagePhi > StageSched && Phi->isPHI())
				ReplaceReg = NewReg;
				if (!InProlog && !Phi->isPHI() && StagePhi < StageSched)
				ReplaceReg = NewReg;
				if (ReplaceReg) {
				MRI.constrainRegClass(ReplaceReg, MRI.getRegClass(OldReg));
				UseOp.setReg(ReplaceReg);
				}
				}
				}

				/// Check if we can change the instruction to use an offset value from the
				/// previous iteration. If so, return true and set the base and offset values
				/// so that we can rewrite the load, if necessary.
				/// v1 = Phi(v0, v3)
				/// v2 = load v1, 0
				/// v3 = post_store v1, 4, x
				/// This function enables the load to be rewritten as v2 = load v3, 4.
				bool SwingSchedulerDAG::canUseLastOffsetValue(MachineInstr *MI,
				unsigned &BasePos,
				unsigned &OffsetPos,
				unsigned &NewBase,
				int64_t &Offset) {
				// Get the load instruction.
				if (TII->isPostIncrement(MI))
				return false;
				unsigned BasePosLd, OffsetPosLd;
				if (!TII->getBaseAndOffsetPosition(MI, BasePosLd, OffsetPosLd))
				return false;
				unsigned BaseReg = MI->getOperand(BasePosLd).getReg();

				// Look for the Phi instruction.
				MachineRegisterInfo &MRI = MI->getParent()->getParent()->getRegInfo();
				MachineInstr *Phi = MRI.getVRegDef(BaseReg);
				if (!Phi \|\| !Phi->isPHI())
				return false;
				// Get the register defined in the loop block.
				unsigned PrevReg = getLoopPhiReg(Phi, MI->getParent());
				if (!PrevReg)
				return false;

				// Check for the post-increment load/store instruction.
				MachineInstr *PrevDef = MRI.getVRegDef(PrevReg);
				if (!PrevDef \|\| PrevDef == MI)
				return false;

				if (!TII->isPostIncrement(PrevDef))
				return false;

				unsigned BasePos1 = 0, OffsetPos1 = 0;
				if (!TII->getBaseAndOffsetPosition(PrevDef, BasePos1, OffsetPos1))
				return false;

				// Make sure offset values are both positive or both negative.
				int64_t LoadOffset = MI->getOperand(OffsetPosLd).getImm();
				int64_t StoreOffset = PrevDef->getOperand(OffsetPos1).getImm();
				if ((LoadOffset >= 0) != (StoreOffset >= 0))
				return false;

				// Set the return value once we determine that we return true.
				BasePos = BasePosLd;
				OffsetPos = OffsetPosLd;
				NewBase = PrevReg;
				Offset = StoreOffset;
				return true;
				}

				/// Apply changes to the instruction if needed. The changes are need
				/// to improve the scheduling and depend up on the final schedule.
				MachineInstr SwingSchedulerDAG::applyInstrChange(MachineInstr MI,
				SMSchedule &Schedule,
				bool UpdateDAG) {
				SUnit *SU = getSUnit(MI);
				DenseMap<SUnit *, std::pair<unsigned, int64_t>>::iterator It =
				InstrChanges.find(SU);
				if (It != InstrChanges.end()) {
				std::pair<unsigned, int64_t> RegAndOffset = It->second;
				unsigned BasePos, OffsetPos;
				if (!TII->getBaseAndOffsetPosition(MI, BasePos, OffsetPos))
				return 0;
				unsigned BaseReg = MI->getOperand(BasePos).getReg();
				MachineInstr *LoopDef = findDefInLoop(BaseReg);
				int DefStageNum = Schedule.stageScheduled(getSUnit(LoopDef));
				int DefCycleNum = Schedule.cycleScheduled(getSUnit(LoopDef));
				int BaseStageNum = Schedule.stageScheduled(SU);
				int BaseCycleNum = Schedule.cycleScheduled(SU);
				if (BaseStageNum < DefStageNum) {
				MachineInstr *NewMI = MF.CloneMachineInstr(MI);
				int OffsetDiff = DefStageNum - BaseStageNum;
				if (DefCycleNum < BaseCycleNum) {
				NewMI->getOperand(BasePos).setReg(RegAndOffset.first);
				if (OffsetDiff > 0)
				--OffsetDiff;
				}
				int64_t NewOffset =
				MI->getOperand(OffsetPos).getImm() + RegAndOffset.second * OffsetDiff;
				NewMI->getOperand(OffsetPos).setImm(NewOffset);
				if (UpdateDAG) {
				SU->setInstr(NewMI);
				MISUnitMap[NewMI] = SU;
				}
				NewMIs.insert(NewMI);
				return NewMI;
				}
				}
				return 0;
				}

				/// Return true for an order dependence that is loop carried potentially.
				/// An order dependence is loop carried if the destination defines a value
				/// that may be used by the source in a subsequent iteration.
				bool SwingSchedulerDAG::isLoopCarriedOrder(SUnit *Source, const SDep &Dep,
				bool isSucc) {
				if (!isOrder(Source, Dep) \|\| Dep.isArtificial())
				MatzeBUnsubmitted Not Done Reply Inline Actions nullptr MatzeB: nullptr
				return false;

				if (!SwpPruneLoopCarried)
				return true;

				MachineInstr *SI = Source->getInstr();
				MachineInstr *DI = Dep.getSUnit()->getInstr();
				if (!isSucc)
				std::swap(SI, DI);
				assert(SI != nullptr && DI != nullptr && "Expecting SUnit with an MI.");

				// Assume ordered loads and stores may have a loop carried dependence.
				if (SI->hasUnmodeledSideEffects() \|\| DI->hasUnmodeledSideEffects() \|\|
				SI->hasOrderedMemoryRef() \|\| DI->hasOrderedMemoryRef())
				return true;

				// Only chain dependences between a load and store can be loop carried.
				if (!DI->mayStore() \|\| !SI->mayLoad())
				return false;

				unsigned DeltaS, DeltaD;
				if (!computeDelta(SI, DeltaS) \|\| !computeDelta(DI, DeltaD))
				return true;

				unsigned BaseRegS, OffsetS, BaseRegD, OffsetD;
				const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();
				if (!TII->getMemOpBaseRegImmOfs(SI, BaseRegS, OffsetS, TRI) \|\|
				!TII->getMemOpBaseRegImmOfs(DI, BaseRegD, OffsetD, TRI))
				return true;

				if (BaseRegS != BaseRegD)
				return true;

				uint64_t AccessSizeS = (*SI->memoperands_begin())->getSize();
				uint64_t AccessSizeD = (*DI->memoperands_begin())->getSize();

				// This is the main test, which checks the offset values and the loop
				// increment value to determine if the accesses may be loop carried.
				if (OffsetS >= OffsetD)
				return OffsetS + AccessSizeS > DeltaS;
				else if (OffsetS < OffsetD)
				return OffsetD + AccessSizeD > DeltaD;

				return true;
				}

				// Try to schedule the node at the specified StartCycle and continue
				// until the node is schedule or the EndCycle is reached. This function
				// returns true if the node is scheduled. This routine may search either
				// forward or backward for a place to insert the instruction based upon
				// the relative values of StartCycle and EndCycle.
				//
				bool SMSchedule::insert(SUnit *SU, int StartCycle, int EndCycle, int II) {
				bool forward = true;
				if (StartCycle > EndCycle)
				forward = false;

				// The terminating condition depends on the direction.
				int termCycle = forward ? EndCycle + 1 : EndCycle - 1;
				for (int curCycle = StartCycle; curCycle != termCycle;
				forward ? ++curCycle : --curCycle) {

				// Add the already scheduled instructions at the specified cycle to the DFA.
				Resources->clearResources();
				for (int checkCycle = FirstCycle + ((curCycle - FirstCycle) % II);
				checkCycle <= LastCycle; checkCycle += II) {
				std::deque<SUnit *> &cycleInstrs = ScheduledInstrs[checkCycle];

				for (std::deque<SUnit *>::iterator I = cycleInstrs.begin(),
				E = cycleInstrs.end();
				I != E; ++I) {
				if (ST.getInstrInfo()->isZeroCost((*I)->getInstr()->getOpcode()))
				continue;
				assert(Resources->canReserveResources((*I)->getInstr()) &&
				"These instructions have already been scheduled.");
				Resources->reserveResources((*I)->getInstr());
				}
				}
				if (ST.getInstrInfo()->isZeroCost(SU->getInstr()->getOpcode()) \|\|
				Resources->canReserveResources(SU->getInstr())) {
				DEBUG({
				dbgs() << "\tinsert at cycle " << curCycle << " ";
				SU->getInstr()->dump();
				});

				ScheduledInstrs[curCycle].push_back(SU);
				InstrToCycle.insert(std::make_pair(SU, curCycle));
				if (curCycle > LastCycle)
				LastCycle = curCycle;
				if (curCycle < FirstCycle)
				FirstCycle = curCycle;
				return true;
				}
				DEBUG({
				dbgs() << "\tfailed to insert at cycle " << curCycle << " ";
				SU->getInstr()->dump();
				});
				}
				return false;
				}

				// Return the cycle of the earliest scheduled instruction in the chain.
				int SMSchedule::earliestCycleInChain(const SDep &Dep) {
				SmallPtrSet<SUnit *, 8> Visited;
				SmallVector<SDep, 8> Worklist;
				Worklist.push_back(Dep);
				int EarlyCycle = INT_MAX;
				while (!Worklist.empty()) {
				const SDep &Cur = Worklist.pop_back_val();
				SUnit *PrevSU = Cur.getSUnit();
				if (Visited.count(PrevSU))
				continue;
				std::map<SUnit *, int>::const_iterator it = InstrToCycle.find(PrevSU);
				if (it == InstrToCycle.end())
				continue;
				EarlyCycle = std::min(EarlyCycle, it->second);
				for (const auto &PI : PrevSU->Preds)
				if (SwingSchedulerDAG::isOrder(PrevSU, PI))
				Worklist.push_back(PI);
				Visited.insert(PrevSU);
				}
				return EarlyCycle;
				}

				// Return the cycle of the latest scheduled instruction in the chain.
				int SMSchedule::latestCycleInChain(const SDep &Dep) {
				SmallPtrSet<SUnit *, 8> Visited;
				SmallVector<SDep, 8> Worklist;
				Worklist.push_back(Dep);
				int LateCycle = INT_MIN;
				while (!Worklist.empty()) {
				const SDep &Cur = Worklist.pop_back_val();
				SUnit *SuccSU = Cur.getSUnit();
				if (Visited.count(SuccSU))
				continue;
				std::map<SUnit *, int>::const_iterator it = InstrToCycle.find(SuccSU);
				if (it == InstrToCycle.end())
				continue;
				LateCycle = std::max(LateCycle, it->second);
				for (const auto &SI : SuccSU->Succs)
				if (SwingSchedulerDAG::isOrder(SuccSU, SI))
				Worklist.push_back(SI);
				Visited.insert(SuccSU);
				}
				return LateCycle;
				}

				/// If an instruction has a use that spans multiple iterations, then
				/// return true. These instructions are characterized by having a back-ege
				/// to a Phi, which contains a reference to another Phi.
				static SUnit multipleIterations(SUnit SU, SwingSchedulerDAG *DAG) {
				for (auto &P : SU->Preds)
				if (DAG->isBackedge(SU, P) && P.getSUnit()->getInstr()->isPHI())
				for (auto &S : P.getSUnit()->Succs)
				if (S.getKind() == SDep::Order && S.getSUnit()->getInstr()->isPHI())
				return P.getSUnit();
				return nullptr;
				}

				// Compute the scheduling start slot for the instruction. The start slot
				// depends on any predecessor or successor nodes scheduled already.
				void SMSchedule::computeStart(SUnit SU, int MaxEarlyStart, int *MinLateStart,
				int MinEnd, int MaxStart, int II,
				SwingSchedulerDAG *DAG) {
				// Iterate over each instruction that has been scheduled already. The start
				// slot computuation depends on whether the previously scheduled instruction
				// is a predecessor or successor of the specified instruction.
				for (int cycle = getFirstCycle(); cycle <= LastCycle; ++cycle) {

				// Iterate over each instruction in the current cycle.
				for (std::deque<SUnit *>::iterator I = getInstructions(cycle).begin(),
				E = getInstructions(cycle).end();
				I != E; ++I) {

				// Because we're processing a DAG for the dependences, we recognize
				// the back-edge in recurrences by anti dependences.
				for (unsigned i = 0, e = (unsigned)SU->Preds.size(); i != e; ++i) {
				const SDep &Dep = SU->Preds[i];
				if (Dep.getSUnit() == *I) {
				if (!DAG->isBackedge(SU, Dep)) {
				int EarlyStart = cycle + DAG->getLatency(SU, Dep) -
				DAG->getDistance(Dep.getSUnit(), SU, Dep) * II;
				MaxEarlyStart = std::max(MaxEarlyStart, EarlyStart);
				if (DAG->isLoopCarriedOrder(SU, Dep, false)) {
				int End = earliestCycleInChain(Dep) + (II - 1);
				MinEnd = std::min(MinEnd, End);
				}
				} else {
				int LateStart = cycle - DAG->getLatency(SU, Dep) +
				DAG->getDistance(SU, Dep.getSUnit(), Dep) * II;
				MinLateStart = std::min(MinLateStart, LateStart);
				}
				}
				// For instruction that requires multiple iterations, make sure that
				// the dependent instruction is not scheduled past the definition.
				SUnit BE = multipleIterations(I, DAG);
				if (BE && Dep.getSUnit() == BE && !SU->getInstr()->isPHI() &&
				!SU->isPred(*I))
				MinLateStart = std::min(MinLateStart, cycle);
				}
				for (unsigned i = 0, e = (unsigned)SU->Succs.size(); i != e; ++i)
				if (SU->Succs[i].getSUnit() == *I) {
				const SDep &Dep = SU->Succs[i];
				if (!DAG->isBackedge(SU, Dep)) {
				int LateStart = cycle - DAG->getLatency(SU, Dep) +
				DAG->getDistance(SU, Dep.getSUnit(), Dep) * II;
				MinLateStart = std::min(MinLateStart, LateStart);
				if (DAG->isLoopCarriedOrder(SU, Dep)) {
				int Start = latestCycleInChain(Dep) + 1 - II;
				MaxStart = std::max(MaxStart, Start);
				}
				} else {
				int EarlyStart = cycle + DAG->getLatency(SU, Dep) -
				DAG->getDistance(Dep.getSUnit(), SU, Dep) * II;
				MaxEarlyStart = std::max(MaxEarlyStart, EarlyStart);
				}
				}
				}
				}
				}

				/// Order the instructions within a cycle so that the definitions occur
				/// before the uses. Returns true if the instruction is added to the start
				/// of the list, or false if added to the end.
				bool SMSchedule::orderDependence(SwingSchedulerDAG SSD, SUnit SU,
				std::deque<SUnit *> &Insts) {
				MachineInstr *MI = SU->getInstr();
				bool OrderBeforeUse = false;
				bool OrderAfterDef = false;
				bool OrderBeforeDef = false;
				unsigned MoveDef = 0;
				unsigned MoveUse = 0;
				int StageInst1 = stageScheduled(SU);

				unsigned Pos = 0;
				for (std::deque<SUnit *>::iterator I = Insts.begin(), E = Insts.end(); I != E;
				++I, ++Pos) {
				// Relative order of Phis does not matter.
				if (MI->isPHI() && (*I)->getInstr()->isPHI())
				continue;
				for (unsigned i = 0, e = MI->getNumOperands(); i < e; ++i) {
				MachineOperand &MO = MI->getOperand(i);
				if (!MO.isReg() \|\| !TargetRegisterInfo::isVirtualRegister(MO.getReg()))
				continue;
				unsigned Reg = MO.getReg();
				unsigned BasePos, OffsetPos;
				if (ST.getInstrInfo()->getBaseAndOffsetPosition(MI, BasePos, OffsetPos))
				if (MI->getOperand(BasePos).getReg() == Reg)
				if (unsigned NewReg = SSD->getInstrBaseReg(SU))
				Reg = NewReg;
				bool Reads, Writes;
				std::tie(Reads, Writes) =
				(*I)->getInstr()->readsWritesVirtualRegister(Reg);
				if (MO.isDef() && Reads && stageScheduled(*I) <= StageInst1) {
				OrderBeforeUse = true;
				MoveUse = Pos;
				} else if (MO.isUse() && Writes && stageScheduled(*I) == StageInst1) {
				if (cycleScheduled(I) == cycleScheduled(SU) && !(I)->isSucc(SU)) {
				OrderBeforeUse = true;
				MoveUse = Pos;
				} else {
				OrderAfterDef = true;
				MoveDef = Pos;
				}
				} else if (MO.isUse() && Writes && stageScheduled(*I) > StageInst1) {
				OrderBeforeUse = true;
				MoveUse = Pos;
				if (MoveUse != 0) {
				OrderAfterDef = true;
				MoveDef = Pos - 1;
				}
				} else if (MO.isUse() && stageScheduled(*I) == StageInst1 &&
				isLoopCarriedDefOfUse(SSD, (*I)->getInstr(), MO)) {
				OrderBeforeDef = true;
				MoveUse = Pos;
				}
				}
				// Check for order dependences between instructions. Make sure the source
				// is ordered before the destination.
				for (auto &S : SU->Succs)
				if (S.getKind() == SDep::Order && S.getSUnit() == *I) {
				OrderBeforeUse = true;
				MoveUse = Pos;
				}
				for (auto &P : SU->Preds)
				if (P.getKind() == SDep::Order && P.getSUnit() == *I) {
				OrderAfterDef = true;
				MoveDef = Pos;
				}
				}

				// OrderAfterDef takes precedences over OrderBeforeDef. The latter is due
				// to a loop-carried dependence.
				if (OrderBeforeDef)
				OrderBeforeUse = !OrderAfterDef;

				// The uncommon case when the instruction order needs to be updated because
				// there is both a use and def.
				if (OrderBeforeUse && OrderAfterDef) {
				assert(MoveUse != MoveDef && "Need to move two instructions.");
				SUnit *UseSU = Insts.at(MoveUse);
				SUnit *DefSU = Insts.at(MoveDef);
				if (MoveUse > MoveDef) {
				Insts.erase(Insts.begin() + MoveUse);
				Insts.erase(Insts.begin() + MoveDef);
				} else {
				Insts.erase(Insts.begin() + MoveDef);
				Insts.erase(Insts.begin() + MoveUse);
				}
				if (orderDependence(SSD, UseSU, Insts)) {
				Insts.push_front(SU);
				orderDependence(SSD, DefSU, Insts);
				return true;
				}
				Insts.pop_back();
				Insts.push_back(SU);
				Insts.push_back(UseSU);
				orderDependence(SSD, DefSU, Insts);
				return false;
				}
				// Put the new instruction first if there is a use in the list. Otherwise,
				// put it at the end of the list.
				if (OrderBeforeUse)
				Insts.push_front(SU);
				else
				Insts.push_back(SU);
				return OrderBeforeUse;
				}

				/// Return true if the scheduled Phi has a loop carried operand.
				bool SMSchedule::isLoopCarried(SwingSchedulerDAG SSD, MachineInstr Phi) {
				if (!Phi->isPHI())
				return false;
				assert(Phi->isPHI() && "Expecing a Phi.");
				SUnit *DefSU = SSD->getSUnit(Phi);
				unsigned DefCycle = cycleScheduled(DefSU);
				int DefStage = stageScheduled(DefSU);

				unsigned InitVal = 0;
				unsigned LoopVal = 0;
				getPhiRegs(Phi, Phi->getParent(), InitVal, LoopVal);
				SUnit *UseSU = SSD->getSUnit(MRI.getVRegDef(LoopVal));
				if (!UseSU)
				return true;
				if (UseSU->getInstr()->isPHI())
				return true;
				unsigned LoopCycle = cycleScheduled(UseSU);
				int LoopStage = stageScheduled(UseSU);
				return LoopCycle > DefCycle \|\|
				(LoopCycle <= DefCycle && LoopStage <= DefStage);
				}

				/// Return true if the instruction is a definition that is loop carried
				/// and defines the use on the next iteration.
				/// v1 = phi(v2, v3)
				/// (Def) v3 = op v1
				/// (MO) = v1
				/// If MO appears before Def, then then v1 and v3 may get assigned to the same
				/// register.
				bool SMSchedule::isLoopCarriedDefOfUse(SwingSchedulerDAG *SSD,
				MachineInstr *Def, MachineOperand &MO) {
				if (!MO.isReg())
				return false;
				if (Def->isPHI())
				return false;
				MachineInstr *Phi = MRI.getVRegDef(MO.getReg());
				if (!Phi \|\| !Phi->isPHI() \|\| Phi->getParent() != Def->getParent())
				return false;
				if (!isLoopCarried(SSD, Phi))
				return false;
				unsigned LoopReg = getLoopPhiReg(Phi, Phi->getParent());
				for (unsigned i = 0, e = Def->getNumOperands(); i != e; ++i) {
				MachineOperand &DMO = Def->getOperand(i);
				if (!DMO.isReg() \|\| !DMO.isDef())
				continue;
				if (DMO.getReg() == LoopReg)
				return true;
				}
				return false;
				}

				// After the schedule has been formed, call this function to combine
				// the instructions from the different stages/cycles. That is, this
				// function creates a schedule that represents a single iteration.
				void SMSchedule::finalizeSchedule(SwingSchedulerDAG *SSD) {
				// Move all instructions to the first stage from later stages.
				for (int cycle = getFirstCycle(); cycle <= getFinalCycle(); ++cycle) {
				for (int stage = 1, lastStage = getMaxStageCount(); stage <= lastStage;
				++stage) {
				std::deque<SUnit *> &cycleInstrs =
				ScheduledInstrs[cycle + (stage * InitiationInterval)];
				for (std::deque<SUnit *>::reverse_iterator I = cycleInstrs.rbegin(),
				E = cycleInstrs.rend();
				I != E; ++I)
				ScheduledInstrs[cycle].push_front(*I);
				}
				}
				// Iterate over the definitions in each instruction, and compute the
				// stage difference for each use. Keep the maximum value.
				for (std::map<SUnit *, int>::iterator I = InstrToCycle.begin(),
				E = InstrToCycle.end();
				I != E; ++I) {
				int DefStage = stageScheduled(I->first);
				MachineInstr *MI = I->first->getInstr();
				for (unsigned i = 0, e = MI->getNumOperands(); i < e; ++i) {
				MachineOperand &Op = MI->getOperand(i);
				if (!Op.isReg() \|\| !Op.isDef())
				continue;

				unsigned Reg = Op.getReg();
				unsigned MaxDiff = 0;
				bool PhiIsSwapped = false;
				for (MachineRegisterInfo::use_iterator UI = MRI.use_begin(Reg),
				EI = MRI.use_end();
				UI != EI; ++UI) {
				MachineOperand &UseOp = *UI;
				MachineInstr *UseMI = UseOp.getParent();
				SUnit *SUnitUse = SSD->getSUnit(UseMI);
				int UseStage = stageScheduled(SUnitUse);
				unsigned Diff = 0;
				if (UseStage != -1 && UseStage >= DefStage)
				Diff = UseStage - DefStage;
				if (MI->isPHI()) {
				if (isLoopCarried(SSD, MI))
				++Diff;
				else
				PhiIsSwapped = true;
				}
				MaxDiff = std::max(Diff, MaxDiff);
				}
				RegToStageDiff[Reg] = std::make_pair(MaxDiff, PhiIsSwapped);
				}
				}

				// Erase all the elements in the later stages. Only one iteration should
				// remain in the scheduled list, and it contains all the instructions.
				for (int cycle = getFinalCycle() + 1; cycle <= LastCycle; ++cycle)
				ScheduledInstrs.erase(cycle);

				// Change the registers in instruction as specified in the InstrChanges
				// map. We need to use the new registers to create the correct order.
				for (int i = 0, e = SSD->SUnits.size(); i != e; ++i) {
				SUnit *SU = &SSD->SUnits[i];
				SSD->applyInstrChange(SU->getInstr(), *this, true);
				}

				// Reorder the instructions in each cycle to fix and improve the
				// generated code.
				for (int Cycle = getFirstCycle(), E = getFinalCycle(); Cycle <= E; ++Cycle) {
				std::deque<SUnit *> &cycleInstrs = ScheduledInstrs[Cycle];
				std::deque<SUnit *> newOrderZC;
				// Put the zero-cost, pseudo instructions at the start of the cycle.
				for (unsigned i = 0, e = cycleInstrs.size(); i < e; ++i) {
				SUnit *SU = cycleInstrs[i];
				if (ST.getInstrInfo()->isZeroCost(SU->getInstr()->getOpcode()))
				orderDependence(SSD, SU, newOrderZC);
				}
				std::deque<SUnit *> newOrderI;
				// Then, add the regular instructions back.
				for (unsigned i = 0, e = cycleInstrs.size(); i < e; ++i) {
				SUnit *SU = cycleInstrs[i];
				if (!ST.getInstrInfo()->isZeroCost(SU->getInstr()->getOpcode()))
				orderDependence(SSD, SU, newOrderI);
				}
				// Replace the old order with the new order.
				cycleInstrs.swap(newOrderZC);
				cycleInstrs.insert(cycleInstrs.end(), newOrderI.begin(), newOrderI.end());
				}

				DEBUG(dump(););
				}

				// Print the schedule information to the given output.
				void SMSchedule::print(raw_ostream &os) const {
				// Iterate over each cycle.
				for (int cycle = getFirstCycle(); cycle <= getFinalCycle(); ++cycle) {
				// Iterate over each instruction in the cycle.
				const_sched_iterator cycleInstrs = ScheduledInstrs.find(cycle);
				for (std::deque<SUnit *>::const_iterator CI = cycleInstrs->second.begin(),
				ECI = cycleInstrs->second.end();
				CI != ECI; ++CI) {
				os << "cycle " << cycle << " (" << stageScheduled(*CI) << ") ";
				os << "(" << (*CI)->NodeNum << ") ";
				(*CI)->getInstr()->print(os);
				os << "\n";
				}
				}
				}

				// Utility function used for debugging to print the schedule.
				void SMSchedule::dump() const { print(dbgs()); }

lib/CodeGen/Passes.cpp

Show First 20 Lines • Show All 104 Lines • ▼ Show 20 Lines
// Experimental option to run live interval analysis early.		// Experimental option to run live interval analysis early.
static cl::opt<bool> EarlyLiveIntervals("early-live-intervals", cl::Hidden,		static cl::opt<bool> EarlyLiveIntervals("early-live-intervals", cl::Hidden,
cl::desc("Run live interval analysis earlier in the pipeline"));		cl::desc("Run live interval analysis earlier in the pipeline"));

static cl::opt<bool> UseCFLAA("use-cfl-aa-in-codegen",		static cl::opt<bool> UseCFLAA("use-cfl-aa-in-codegen",
cl::init(false), cl::Hidden,		cl::init(false), cl::Hidden,
cl::desc("Enable the new, experimental CFL alias analysis in CodeGen"));		cl::desc("Enable the new, experimental CFL alias analysis in CodeGen"));

		cl::opt<bool> EnableSWP("enable-swp", cl::Hidden, cl::init(false),
		cl::ZeroOrMore,
		cl::desc("Enable Software Pipelining"));

		MatzeBUnsubmitted Not Done Reply Inline Actions Even though you see both, I think the more typical style nowadays would be to move this flag into the .cpp file of the pass and make it the first thing that is checked in runOnMachineFunction(). MatzeB: Even though you see both, I think the more typical style nowadays would be to move this flag…
/// Allow standard passes to be disabled by command line options. This supports		/// Allow standard passes to be disabled by command line options. This supports
/// simple binary flags that either suppress the pass or do nothing.		/// simple binary flags that either suppress the pass or do nothing.
/// i.e. -disable-mypass=false has no effect.		/// i.e. -disable-mypass=false has no effect.
/// These should be converted to boolOrDefault in order to use applyOverride.		/// These should be converted to boolOrDefault in order to use applyOverride.
static IdentifyingPassPtr applyDisable(IdentifyingPassPtr PassID,		static IdentifyingPassPtr applyDisable(IdentifyingPassPtr PassID,
bool Override) {		bool Override) {
if (Override)		if (Override)
return IdentifyingPassPtr();		return IdentifyingPassPtr();
▲ Show 20 Lines • Show All 419 Lines • ▼ Show 20 Lines	if (getOptLevel() != CodeGenOpt::None) {
// If the target requests it, assign local variables to stack slots relative		// If the target requests it, assign local variables to stack slots relative
// to one another and simplify frame index references where possible.		// to one another and simplify frame index references where possible.
addPass(&LocalStackSlotAllocationID, false);		addPass(&LocalStackSlotAllocationID, false);
}		}

// Run pre-ra passes.		// Run pre-ra passes.
addPreRegAlloc();		addPreRegAlloc();

		if (EnableSWP && getOptLevel() != CodeGenOpt::None) {
		addPass(&MachineSMSID);
		printAndVerify("After software pipelining");
		}

		MatzeBUnsubmitted Not Done Reply Inline Actions Maybe we should leave adding the pass to the backends that support it and let them do it by overriding addPreRegAlloc()? MatzeB: Maybe we should leave adding the pass to the backends that support it and let them do it by…
// Run register allocation and passes that are tightly coupled with it,		// Run register allocation and passes that are tightly coupled with it,
// including phi elimination and scheduling.		// including phi elimination and scheduling.
if (getOptimizeRegAlloc())		if (getOptimizeRegAlloc())
addOptimizedRegAlloc(createRegAllocPass(true));		addOptimizedRegAlloc(createRegAllocPass(true));
else		else
addFastRegAlloc(createRegAllocPass(false));		addFastRegAlloc(createRegAllocPass(false));

// Run post-ra passes.		// Run post-ra passes.
▲ Show 20 Lines • Show All 262 Lines • Show Last 20 Lines

lib/Target/Hexagon/HexagonInstrInfo.h

Show First 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	public:
/// It is also invoked by tail merging to add unconditional branches in		/// It is also invoked by tail merging to add unconditional branches in
/// cases where AnalyzeBranch doesn't apply because there was no original		/// cases where AnalyzeBranch doesn't apply because there was no original
/// branch to analyze. At least this much must be implemented, else tail		/// branch to analyze. At least this much must be implemented, else tail
/// merging needs to be disabled.		/// merging needs to be disabled.
unsigned InsertBranch(MachineBasicBlock &MBB, MachineBasicBlock *TBB,		unsigned InsertBranch(MachineBasicBlock &MBB, MachineBasicBlock *TBB,
MachineBasicBlock *FBB, ArrayRef<MachineOperand> Cond,		MachineBasicBlock *FBB, ArrayRef<MachineOperand> Cond,
DebugLoc DL) const override;		DebugLoc DL) const override;

		/// Analyze the loop code, return true if it cannot be
		/// understood. Upon success, this function returns false and returns
		/// information about the induction variable and compare instruction
		/// used at the end.
		bool AnalyzeLoop(MachineLoop L, MachineInstr &IndVarInst,
		MachineInstr *&CmpInst) const override;

		/// Generate code to reduce the loop iteration by one
		/// and check if the loop is finished. Return the value/register of the
		/// the new loop count. We need this function when peeling off one
		/// or more iterations of a loop. This function assumes the nth iteration
		/// is peeled first.
		unsigned ReduceLoopCount(MachineBasicBlock &MBB,
		MachineInstr IndVar, MachineInstr Cmp,
		SmallVectorImpl<MachineOperand> &Cond,
		SmallVectorImpl<MachineInstr *> &PrevInsts,
		unsigned Iter, unsigned MaxIter) const override;

/// Return true if it's profitable to predicate		/// Return true if it's profitable to predicate
/// instructions with accumulated instruction latency of "NumCycles"		/// instructions with accumulated instruction latency of "NumCycles"
/// of the specified basic block, where the probability of the instructions		/// of the specified basic block, where the probability of the instructions
/// being executed is given by Probability, and Confidence is a measure		/// being executed is given by Probability, and Confidence is a measure
/// of our confidence that it will be properly predicted.		/// of our confidence that it will be properly predicted.
bool isProfitableToIfCvt(MachineBasicBlock &MBB, unsigned NumCycles,		bool isProfitableToIfCvt(MachineBasicBlock &MBB, unsigned NumCycles,
unsigned ExtraPredCycles,		unsigned ExtraPredCycles,
BranchProbability Probability) const override;		BranchProbability Probability) const override;
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	public:
/// This function is called for all pseudo instructions		/// This function is called for all pseudo instructions
/// that remain after register allocation. Many pseudo instructions are		/// that remain after register allocation. Many pseudo instructions are
/// created to help register allocation. This is the place to convert them		/// created to help register allocation. This is the place to convert them
/// into real instructions. The target can edit MI in place, or it can insert		/// into real instructions. The target can edit MI in place, or it can insert
/// new instructions and erase MI. The function should return true if		/// new instructions and erase MI. The function should return true if
/// anything was changed.		/// anything was changed.
bool expandPostRAPseudo(MachineBasicBlock::iterator MI) const override;		bool expandPostRAPseudo(MachineBasicBlock::iterator MI) const override;

		/// \brief Get the base register and byte offset of a load/store instr.
		bool getMemOpBaseRegImmOfs(MachineInstr *LdSt, unsigned &BaseReg,
		unsigned &Offset,
		const TargetRegisterInfo *TRI) const override;

/// Reverses the branch condition of the specified condition list,		/// Reverses the branch condition of the specified condition list,
/// returning false on success and true if it cannot be reversed.		/// returning false on success and true if it cannot be reversed.
bool ReverseBranchCondition(SmallVectorImpl<MachineOperand> &Cond)		bool ReverseBranchCondition(SmallVectorImpl<MachineOperand> &Cond)
const override;		const override;

/// Insert a noop into the instruction stream at the specified point.		/// Insert a noop into the instruction stream at the specified point.
void insertNoop(MachineBasicBlock &MBB,		void insertNoop(MachineBasicBlock &MBB,
MachineBasicBlock::iterator MI) const override;		MachineBasicBlock::iterator MI) const override;
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	public:
// Sometimes, it is possible for the target		// Sometimes, it is possible for the target
// to tell, even without aliasing information, that two MIs access different		// to tell, even without aliasing information, that two MIs access different
// memory addresses. This function returns true if two MIs access different		// memory addresses. This function returns true if two MIs access different
// memory addresses and false otherwise.		// memory addresses and false otherwise.
bool areMemAccessesTriviallyDisjoint(MachineInstr MIa, MachineInstr MIb,		bool areMemAccessesTriviallyDisjoint(MachineInstr MIa, MachineInstr MIb,
AliasAnalysis *AA = nullptr)		AliasAnalysis *AA = nullptr)
const override;		const override;

		/// For instructions with a base and offset, return the position of the
		/// base register and offset operands.
		bool getBaseAndOffsetPosition(const MachineInstr *MI, unsigned &BasePos,
		unsigned &OffsetPos) const override;

		/// If the instruction is an increment of a constant value, return the amount.
		bool getIncrementValue(const MachineInstr *MI, int &Value) const override;


/// HexagonInstrInfo specifics.		/// HexagonInstrInfo specifics.
///		///

const HexagonRegisterInfo &getRegisterInfo() const { return RI; }		const HexagonRegisterInfo &getRegisterInfo() const { return RI; }

unsigned createVR(MachineFunction* MF, MVT VT) const;		unsigned createVR(MachineFunction* MF, MVT VT) const;

▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	public:
bool predCanBeUsedAsDotNew(const MachineInstr *MI, unsigned PredReg) const;		bool predCanBeUsedAsDotNew(const MachineInstr *MI, unsigned PredReg) const;
bool PredOpcodeHasJMP_c(unsigned Opcode) const;		bool PredOpcodeHasJMP_c(unsigned Opcode) const;
bool predOpcodeHasNot(ArrayRef<MachineOperand> Cond) const;		bool predOpcodeHasNot(ArrayRef<MachineOperand> Cond) const;


unsigned getAddrMode(const MachineInstr* MI) const;		unsigned getAddrMode(const MachineInstr* MI) const;
unsigned getBaseAndOffset(const MachineInstr *MI, int &Offset,		unsigned getBaseAndOffset(const MachineInstr *MI, int &Offset,
unsigned &AccessSize) const;		unsigned &AccessSize) const;
bool getBaseAndOffsetPosition(const MachineInstr *MI, unsigned &BasePos,
unsigned &OffsetPos) const;
SmallVector<MachineInstr*,2> getBranchingInstrs(MachineBasicBlock& MBB) const;		SmallVector<MachineInstr*,2> getBranchingInstrs(MachineBasicBlock& MBB) const;
unsigned getCExtOpNum(const MachineInstr *MI) const;		unsigned getCExtOpNum(const MachineInstr *MI) const;
HexagonII::CompoundGroup		HexagonII::CompoundGroup
getCompoundCandidateGroup(const MachineInstr *MI) const;		getCompoundCandidateGroup(const MachineInstr *MI) const;
unsigned getCompoundOpcode(const MachineInstr *GA,		unsigned getCompoundOpcode(const MachineInstr *GA,
const MachineInstr *GB) const;		const MachineInstr *GB) const;
int getCondOpcode(int Opc, bool sense) const;		int getCondOpcode(int Opc, bool sense) const;
int getDotCurOp(const MachineInstr* MI) const;		int getDotCurOp(const MachineInstr* MI) const;
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

lib/Target/Hexagon/HexagonInstrInfo.cpp

Show First 20 Lines • Show All 570 Lines • ▼ Show 20 Lines	if (isEndLoopN(Cond[0].getImm())) {
unsigned Flags = getUndefRegState(RO.isUndef());		unsigned Flags = getUndefRegState(RO.isUndef());
BuildMI(&MBB, DL, get(BccOpc)).addReg(RO.getReg(), Flags).addMBB(TBB);		BuildMI(&MBB, DL, get(BccOpc)).addReg(RO.getReg(), Flags).addMBB(TBB);
}		}
BuildMI(&MBB, DL, get(BOpc)).addMBB(FBB);		BuildMI(&MBB, DL, get(BOpc)).addMBB(FBB);

return 2;		return 2;
}		}

		/// Analyze the loop code to find the loop induction
		/// variable and compare used to compute the number of iterations.
		/// Currently, we analyze loop that are controlled using hardware
		/// loops. In this case, the induction variable instruction is
		/// null. For all other cases, this function returns true, which
		/// means we're unable to analyze it.
		bool HexagonInstrInfo::AnalyzeLoop(MachineLoop *L,
		MachineInstr *&IndVarInst,
		MachineInstr *&CmpInst) const {

		MachineBasicBlock *LoopEnd = L->getBottomBlock();
		MachineBasicBlock::iterator I = LoopEnd->getFirstTerminator();
		// We really "analyze" only hardware loops right now.
		if (I != LoopEnd->end() && isEndLoopN(I->getOpcode())) {
		IndVarInst = nullptr;
		CmpInst = I;
		return false;
		}
		return true;
		}

		/// ReduceLoopCount - Generate code to reduce the loop iteration by one
		/// and check if the loop is finished. Return the value/register of the
		/// new loop count. this function assumes the nth iteration is peeled first.
		unsigned HexagonInstrInfo::ReduceLoopCount(MachineBasicBlock &MBB,
		MachineInstr IndVar, MachineInstr Cmp,
		SmallVectorImpl<MachineOperand> &Cond,
		SmallVectorImpl<MachineInstr *> &PrevInsts,
		unsigned Iter, unsigned MaxIter) const {
		// We expect a hardware loop currently. This means that IndVar is set
		// to null, and the compare is the ENDLOOP instruction.
		assert((!IndVar) && isEndLoopN(Cmp->getOpcode())
		&& "Expecting a hardware loop");
		MachineFunction *MF = MBB.getParent();
		DebugLoc DL = Cmp->getDebugLoc();
		SmallPtrSet<MachineBasicBlock *, 8> VisitedBBs;
		MachineInstr *Loop = findLoopInstr(&MBB, Cmp->getOpcode(), VisitedBBs);
		if (!Loop)
		return 0;
		// If the loop trip count is a compile-time value, then just change the
		// value.
		if (Loop->getOpcode() == Hexagon::J2_loop0i \|\|
		Loop->getOpcode() == Hexagon::J2_loop1i) {
		int64_t Offset = Loop->getOperand(1).getImm();
		if (Offset <= 1)
		Loop->eraseFromParent();
		else
		Loop->getOperand(1).setImm(Offset - 1);
		return Offset - 1;
		}
		// The loop trip count is a run-time value. We generate code to subtract
		// one from the trip count, and update the loop instruction.
		assert(Loop->getOpcode() == Hexagon::J2_loop0r && "Unexpected instruction");
		unsigned LoopCount = Loop->getOperand(1).getReg();
		// Check if we're done with the loop.
		unsigned LoopEnd = createVR(MF, MVT::i1);
		MachineInstr *NewCmp = BuildMI(&MBB, DL, get(Hexagon::C2_cmpgtui), LoopEnd).
		addReg(LoopCount).addImm(1);
		unsigned NewLoopCount = createVR(MF, MVT::i32);
		MachineInstr *NewAdd = BuildMI(&MBB, DL, get(Hexagon::A2_addi), NewLoopCount).
		addReg(LoopCount).addImm(-1);
		// Update the previously generated instructions with the new loop counter.
		for (SmallVectorImpl<MachineInstr *>::iterator I = PrevInsts.begin(),
		E = PrevInsts.end(); I != E; ++I)
		(*I)->substituteRegister(LoopCount, NewLoopCount, 0, getRegisterInfo());
		PrevInsts.clear();
		PrevInsts.push_back(NewCmp);
		PrevInsts.push_back(NewAdd);
		// Insert the new loop instruction if this is the last time the loop is
		// decremented.
		if (Iter == MaxIter)
		BuildMI(&MBB, DL, get(Hexagon::J2_loop0r)).
		addMBB(Loop->getOperand(0).getMBB()).addReg(NewLoopCount);
		// Delete the old loop instruction.
		if (Iter == 0)
		Loop->eraseFromParent();
		Cond.push_back(MachineOperand::CreateImm(Hexagon::J2_jumpf));
		Cond.push_back(NewCmp->getOperand(0));
		return NewLoopCount;
		}

bool HexagonInstrInfo::isProfitableToIfCvt(MachineBasicBlock &MBB,		bool HexagonInstrInfo::isProfitableToIfCvt(MachineBasicBlock &MBB,
unsigned NumCycles, unsigned ExtraPredCycles,		unsigned NumCycles, unsigned ExtraPredCycles,
BranchProbability Probability) const {		BranchProbability Probability) const {
return nonDbgBBSize(&MBB) <= 3;		return nonDbgBBSize(&MBB) <= 3;
}		}


▲ Show 20 Lines • Show All 879 Lines • ▼ Show 20 Lines	if (OffsetA > OffsetB) {
uint64_t offDiff = (uint64_t)((int64_t)OffsetB - (int64_t)OffsetA);		uint64_t offDiff = (uint64_t)((int64_t)OffsetB - (int64_t)OffsetA);
return (SizeA <= offDiff);		return (SizeA <= offDiff);
}		}

return false;		return false;
}		}


		/// If the instruction is an increment of a constant value, return the amount.
		bool HexagonInstrInfo::getIncrementValue(const MachineInstr *MI,
		int &Value) const {
		if (isPostIncrement(MI)) {
		unsigned AccessSize;
		return getBaseAndOffset(MI, Value, AccessSize);
		}
		if (MI->getOpcode() == Hexagon::A2_addi) {
		Value = MI->getOperand(2).getImm();
		return true;
		}

		return false;
		}


unsigned HexagonInstrInfo::createVR(MachineFunction* MF, MVT VT) const {		unsigned HexagonInstrInfo::createVR(MachineFunction* MF, MVT VT) const {
MachineRegisterInfo &MRI = MF->getRegInfo();		MachineRegisterInfo &MRI = MF->getRegInfo();
const TargetRegisterClass *TRC;		const TargetRegisterClass *TRC;
if (VT == MVT::i1) {		if (VT == MVT::i1) {
TRC = &Hexagon::PredRegsRegClass;		TRC = &Hexagon::PredRegsRegClass;
} else if (VT == MVT::i32 \|\| VT == MVT::f32) {		} else if (VT == MVT::i32 \|\| VT == MVT::f32) {
TRC = &Hexagon::IntRegsRegClass;		TRC = &Hexagon::IntRegsRegClass;
} else if (VT == MVT::i64 \|\| VT == MVT::f64) {		} else if (VT == MVT::i64 \|\| VT == MVT::f64) {
▲ Show 20 Lines • Show All 1,064 Lines • ▼ Show 20 Lines	bool HexagonInstrInfo::isVecUsableNextPacket(const MachineInstr *ProdMI,

if (mayBeNewStore(ConsMI))		if (mayBeNewStore(ConsMI))
return true;		return true;

return false;		return false;
}		}


		/// \brief Get the base register and byte offset of a load/store instr.
		bool HexagonInstrInfo::getMemOpBaseRegImmOfs(MachineInstr *LdSt,
		unsigned &BaseReg, unsigned &Offset, const TargetRegisterInfo *TRI)
		const {
		unsigned AccessSize = 0;
		int OffsetVal = 0;
		BaseReg = getBaseAndOffset(LdSt, OffsetVal, AccessSize);
		Offset = (unsigned)OffsetVal;
		return BaseReg != 0;
		}


/// \brief Can these instructions execute at the same time in a bundle.		/// \brief Can these instructions execute at the same time in a bundle.
bool HexagonInstrInfo::canExecuteInBundle(const MachineInstr *First,		bool HexagonInstrInfo::canExecuteInBundle(const MachineInstr *First,
const MachineInstr *Second) const {		const MachineInstr *Second) const {
if (DisableNVSchedule)		if (DisableNVSchedule)
return false;		return false;
if (mayBeNewStore(Second)) {		if (mayBeNewStore(Second)) {
// Make sure the definition of the first instruction is the value being		// Make sure the definition of the first instruction is the value being
// stored.		// stored.
▲ Show 20 Lines • Show All 1,311 Lines • Show Last 20 Lines

lib/Target/Hexagon/HexagonTargetMachine.cpp

Show All 20 Lines
#include "llvm/IR/LegacyPassManager.h"		#include "llvm/IR/LegacyPassManager.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/TargetRegistry.h"		#include "llvm/Support/TargetRegistry.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"

using namespace llvm;		using namespace llvm;

		extern cl::opt<bool> EnableSWP;

static cl::opt<bool> EnableRDFOpt("rdf-opt", cl::Hidden, cl::ZeroOrMore,		static cl::opt<bool> EnableRDFOpt("rdf-opt", cl::Hidden, cl::ZeroOrMore,
cl::init(true), cl::desc("Enable RDF-based optimizations"));		cl::init(true), cl::desc("Enable RDF-based optimizations"));

static cl::opt<bool> DisableHardwareLoops("disable-hexagon-hwloops",		static cl::opt<bool> DisableHardwareLoops("disable-hexagon-hwloops",
cl::Hidden, cl::desc("Disable Hardware Loops for Hexagon target"));		cl::Hidden, cl::desc("Disable Hardware Loops for Hexagon target"));

static cl::opt<bool> DisableHexagonCFGOpt("disable-hexagon-cfgopt",		static cl::opt<bool> DisableHexagonCFGOpt("disable-hexagon-cfgopt",
▲ Show 20 Lines • Show All 144 Lines • ▼ Show 20 Lines	HexagonPassConfig(HexagonTargetMachine *TM, PassManagerBase &PM)
: TargetPassConfig(TM, PM) {		: TargetPassConfig(TM, PM) {
bool NoOpt = (TM->getOptLevel() == CodeGenOpt::None);		bool NoOpt = (TM->getOptLevel() == CodeGenOpt::None);
if (!NoOpt) {		if (!NoOpt) {
if (EnableExpandCondsets) {		if (EnableExpandCondsets) {
Pass *Exp = createHexagonExpandCondsets();		Pass *Exp = createHexagonExpandCondsets();
insertPass(&RegisterCoalescerID, IdentifyingPassPtr(Exp));		insertPass(&RegisterCoalescerID, IdentifyingPassPtr(Exp));
}		}
}		}
		// Enable software pipelining at O2 and higher.
		if (TM->getOptLevel() >= CodeGenOpt::Default && !EnableSWP.getPosition())
		EnableSWP = true;
}		}

HexagonTargetMachine &getHexagonTargetMachine() const {		HexagonTargetMachine &getHexagonTargetMachine() const {
return getTM<HexagonTargetMachine>();		return getTM<HexagonTargetMachine>();
}		}

ScheduleDAGInstrs *		ScheduleDAGInstrs *
createMachineScheduler(MachineSchedContext *C) const override {		createMachineScheduler(MachineSchedContext *C) const override {
▲ Show 20 Lines • Show All 111 Lines • Show Last 20 Lines

test/CodeGen/Hexagon/swp-const-tc.ll

This file was added.

				; RUN: llc -march=hexagon -mcpu=hexagonv5 -enable-swp -verify-machineinstrs < %s \| FileCheck %s

				; If the trip count is a compile-time constant, then decrement it instead
				; of computing a new LC0 value.

				; CHECK: loop0(.LBB0_1, #998)

				MatzeBUnsubmitted Not Done Reply Inline Actions Some comments about all the tests: Tests should start with ; CHECK-LABEL: functionname: to make them more stable against unrelated output that happens to be on stdout triggering check lines (pass debug output, filenames, etc.) The tests appear to contain unnecessary extra data (nocapture, readonly) flags, sometimes strange value or block names, function and alias metadata. I am sure they can and should be further simplified. MatzeB: Some comments about all the tests: - Tests should start with ; CHECK-LABEL: functionname: to…
				define i32 @test(i32* nocapture readonly %A, i32* nocapture readnone %B, i32 %count) #0 {
				entry:
				br label %for.body

				for.body:
				%sum.02 = phi i32 [ 0, %entry ], [ %add, %for.body ]
				%arrayidx.phi = phi i32* [ %A, %entry ], [ %arrayidx.inc, %for.body ]
				%i.01 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%0 = load i32, i32* %arrayidx.phi, align 4
				%add = add nsw i32 %0, %sum.02
				%inc = add nsw i32 %i.01, 1
				%exitcond = icmp eq i32 %inc, 1000
				%arrayidx.inc = getelementptr i32, i32* %arrayidx.phi, i32 1
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				ret i32 %add
				}

				; The constant trip count is small enough that the kernel is not executed.

				; CHECK-NOT: loop0(

				define i32 @test1(i32* nocapture readonly %A, i32* nocapture readnone %B, i32 %count) #0 {
				entry:
				br label %for.body

				for.body:
				%sum.02 = phi i32 [ 0, %entry ], [ %add, %for.body ]
				%arrayidx.phi = phi i32* [ %A, %entry ], [ %arrayidx.inc, %for.body ]
				%i.01 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%0 = load i32, i32* %arrayidx.phi, align 4
				%add = add nsw i32 %0, %sum.02
				%inc = add nsw i32 %i.01, 1
				%exitcond = icmp eq i32 %inc, 1
				%arrayidx.inc = getelementptr i32, i32* %arrayidx.phi, i32 1
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				ret i32 %add
				}



				attributes #0 = { nounwind readonly "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"="true" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }

test/CodeGen/Hexagon/swp-dag-phi.ll

This file was added.

				; RUN: llc -march=hexagon -mcpu=hexagonv5 -enable-swp -swp-max-stages=2 < %s
				; REQUIRES: asserts

				; This tests check that a dependence is created between a Phi and it's uses.
				; An assert occurs if the Phi dependences are not correct.

				define void @test1(i32* nocapture %f2, i32 %nc) {
				entry:
				%i.011 = add i32 %nc, -1
				%cmp12 = icmp sgt i32 %i.011, 1
				br i1 %cmp12, label %for.body.preheader, label %for.end

				for.body.preheader:
				%0 = add i32 %nc, -2
				%scevgep = getelementptr i32, i32* %f2, i32 %0
				%sri = load i32, i32* %scevgep, align 4
				%scevgep15 = getelementptr i32, i32* %f2, i32 %i.011
				%sri16 = load i32, i32* %scevgep15, align 4
				br label %for.body

				for.body:
				%i.014 = phi i32 [ %i.0, %for.body ], [ %i.011, %for.body.preheader ]
				%i.0.in13 = phi i32 [ %i.014, %for.body ], [ %nc, %for.body.preheader ]
				%sr = phi i32 [ %1, %for.body ], [ %sri, %for.body.preheader ]
				%sr17 = phi i32 [ %sr, %for.body ], [ %sri16, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %f2, i32 %i.014
				%sub1 = add nsw i32 %i.0.in13, -3
				%arrayidx2 = getelementptr inbounds i32, i32* %f2, i32 %sub1
				%1 = load i32, i32* %arrayidx2, align 4
				%sub3 = sub nsw i32 %sr17, %1
				store i32 %sub3, i32* %arrayidx, align 4
				%i.0 = add nsw i32 %i.014, -1
				%cmp = icmp sgt i32 %i.0, 1
				br i1 %cmp, label %for.body, label %for.end.loopexit

				for.end.loopexit:
				br label %for.end

				for.end:
				ret void
				}

test/CodeGen/Hexagon/swp-epilog-reuse.ll

This file was added.

				; RUN: llc -fp-contract=fast -O3 -march=hexagon -mcpu=hexagonv5 < %s
				; REQUIRES: asserts

				; Test that the pipeliner doesn't ICE due because the PHI generation
				; code in the epilog does not attempt to reuse an existing PHI.

				define void @fcvScaleDownBy2Gaussian5x5f32C(float* noalias %srcImg, i32 %width, float* noalias %dstImg) #0 {
				entry.split:
				%shr = lshr i32 %width, 1
				%incdec.ptr253 = getelementptr inbounds float, float* %dstImg, i32 2
				br i1 undef, label %for.body, label %for.end

				for.body:
				%dst.21518.reg2mem.0 = phi float* [ null, %while.end712 ], [ %incdec.ptr253, %entry.split ]
				%dstEnd.01519 = phi float* [ %add.ptr725, %while.end712 ], [ undef, %entry.split ]
				%add.ptr367 = getelementptr inbounds float, float* %srcImg, i32 undef
				%dst.31487 = getelementptr inbounds float, float* %dst.21518.reg2mem.0, i32 1
				br i1 undef, label %while.body661.preheader, label %while.end712

				while.body661.preheader:
				%scevgep1941 = getelementptr float, float* %add.ptr367, i32 1
				br label %while.body661.ur

				while.body661.ur:
				%lsr.iv1942 = phi float* [ %scevgep1941, %while.body661.preheader ], [ undef, %while.body661.ur ]
				%col1.31508.reg2mem.0.ur = phi float [ %col3.31506.reg2mem.0.ur, %while.body661.ur ], [ undef, %while.body661.preheader ]
				%col4.31507.reg2mem.0.ur = phi float [ %add710.ur, %while.body661.ur ], [ 0.000000e+00, %while.body661.preheader ]
				%col3.31506.reg2mem.0.ur = phi float [ %add689.ur, %while.body661.ur ], [ undef, %while.body661.preheader ]
				%dst.41511.ur = phi float* [ %incdec.ptr674.ur, %while.body661.ur ], [ %dst.31487, %while.body661.preheader ]
				%mul662.ur = fmul float %col1.31508.reg2mem.0.ur, 4.000000e+00
				%add663.ur = fadd float undef, %mul662.ur
				%add665.ur = fadd float %add663.ur, undef
				%add667.ur = fadd float undef, %add665.ur
				%add669.ur = fadd float undef, %add667.ur
				%add670.ur = fadd float %col4.31507.reg2mem.0.ur, %add669.ur
				%conv673.ur = fmul float %add670.ur, 3.906250e-03
				%incdec.ptr674.ur = getelementptr inbounds float, float* %dst.41511.ur, i32 1
				store float %conv673.ur, float* %dst.41511.ur, align 4
				%scevgep1959 = getelementptr float, float* %lsr.iv1942, i32 -1
				%0 = load float, float* %scevgep1959, align 4
				%mul680.ur = fmul float %0, 4.000000e+00
				%add681.ur = fadd float undef, %mul680.ur
				%add684.ur = fadd float undef, %add681.ur
				%add687.ur = fadd float undef, %add684.ur
				%add689.ur = fadd float undef, %add687.ur
				%add699.ur = fadd float undef, undef
				%add703.ur = fadd float undef, %add699.ur
				%add707.ur = fadd float undef, %add703.ur
				%add710.ur = fadd float undef, %add707.ur
				%cmp660.ur = icmp ult float* %incdec.ptr674.ur, %dstEnd.01519
				br i1 %cmp660.ur, label %while.body661.ur, label %while.end712

				while.end712:
				%dst.4.lcssa.reg2mem.0 = phi float* [ %dst.31487, %for.body ], [ undef, %while.body661.ur ]
				%conv721 = fpext float undef to double
				%mul722 = fmul double %conv721, 0x3F7111112119E8FB
				%conv723 = fptrunc double %mul722 to float
				store float %conv723, float* %dst.4.lcssa.reg2mem.0, align 4
				%add.ptr725 = getelementptr inbounds float, float* %dstEnd.01519, i32 %shr
				%cmp259 = icmp ult i32 undef, undef
				br i1 %cmp259, label %for.body, label %for.end

				for.end:
				ret void
				}

				attributes #0 = { nounwind "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-realign-stack" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }

test/CodeGen/Hexagon/swp-matmul-bitext.ll

This file was added.

				; RUN: llc -march=hexagon -mcpu=hexagonv60 -enable-bsb-sched=0 -enable-swp < %s \| FileCheck %s
				; RUN: llc -march=hexagon -mcpu=hexagonv5 -enable-swp < %s \| FileCheck %s

				; From coremark. Test that we pipeline the matrix multiplication bitextract
				; function. The pipelined code should have two packets.

				; CHECK: loop0(.LBB0_[[LOOP:.]],
				; CHECK: .LBB0_[[LOOP]]:
				; CHECK: = extractu([[REG2:(r[0-9]+)]],
				; CHECK: = extractu([[REG2]],
				; CHECK: [[REG0:(r[0-9]+)]] = memh
				; CHECK: [[REG1:(r[0-9]+)]] = memh
				; CHECK: += mpyi
				; CHECK: [[REG2]] = mpyi([[REG0]], [[REG1]])
				; CHECK: endloop0

				%union_h2_sem_t = type { i32 }

				@sem_i = common global [0 x %union_h2_sem_t] zeroinitializer, align 4

				define void @matrix_mul_matrix_bitextract(i32 %N, i32* nocapture %C, i16* nocapture readonly %A, i16* nocapture readonly %B) #0 {
				entry:
				%cmp53 = icmp eq i32 %N, 0
				br i1 %cmp53, label %for_end27, label %for_body3_lr_ph_us

				for_body3_lr_ph_us:
				%i_054_us = phi i32 [ %inc26_us, %for_cond1_for_inc25_crit_edge_us ], [ 0, %entry ]
				%0 = mul i32 %i_054_us, %N
				%arrayidx9_us_us_gep = getelementptr i16, i16* %A, i32 %0
				br label %for_body3_us_us

				for_cond1_for_inc25_crit_edge_us:
				%inc26_us = add i32 %i_054_us, 1
				%exitcond89 = icmp eq i32 %inc26_us, %N
				br i1 %exitcond89, label %for_end27, label %for_body3_lr_ph_us

				for_body3_us_us:
				%j_052_us_us = phi i32 [ %inc23_us_us, %for_cond4_for_inc22_crit_edge_us_us ], [ 0, %for_body3_lr_ph_us ]
				%add_us_us = add i32 %j_052_us_us, %0
				%arrayidx_us_us = getelementptr inbounds i32, i32* %C, i32 %add_us_us
				store i32 0, i32* %arrayidx_us_us, align 4
				br label %for_body6_us_us

				for_cond4_for_inc22_crit_edge_us_us:
				store i32 %add21_us_us, i32* %arrayidx_us_us, align 4
				%inc23_us_us = add i32 %j_052_us_us, 1
				%exitcond88 = icmp eq i32 %inc23_us_us, %N
				br i1 %exitcond88, label %for_cond1_for_inc25_crit_edge_us, label %for_body3_us_us

				for_body6_us_us:
				%1 = phi i32 [ 0, %for_body3_us_us ], [ %add21_us_us, %for_body6_us_us ]
				%arrayidx9_us_us_phi = phi i16* [ %arrayidx9_us_us_gep, %for_body3_us_us ], [ %arrayidx9_us_us_inc, %for_body6_us_us ]
				%k_050_us_us = phi i32 [ 0, %for_body3_us_us ], [ %inc_us_us, %for_body6_us_us ]
				%2 = load i16, i16* %arrayidx9_us_us_phi, align 2
				%conv_us_us = sext i16 %2 to i32
				%mul10_us_us = mul i32 %k_050_us_us, %N
				%add11_us_us = add i32 %mul10_us_us, %j_052_us_us
				%arrayidx12_us_us = getelementptr inbounds i16, i16* %B, i32 %add11_us_us
				%3 = load i16, i16* %arrayidx12_us_us, align 2
				%conv13_us_us = sext i16 %3 to i32
				%mul14_us_us = mul nsw i32 %conv13_us_us, %conv_us_us
				%shr47_us_us = lshr i32 %mul14_us_us, 2
				%and_us_us = and i32 %shr47_us_us, 15
				%shr1548_us_us = lshr i32 %mul14_us_us, 5
				%and16_us_us = and i32 %shr1548_us_us, 127
				%mul17_us_us = mul i32 %and_us_us, %and16_us_us
				%add21_us_us = add i32 %mul17_us_us, %1
				%inc_us_us = add i32 %k_050_us_us, 1
				%exitcond87 = icmp eq i32 %inc_us_us, %N
				%arrayidx9_us_us_inc = getelementptr i16, i16* %arrayidx9_us_us_phi, i32 1
				br i1 %exitcond87, label %for_cond4_for_inc22_crit_edge_us_us, label %for_body6_us_us

				for_end27:
				ret void
				}

				attributes #0 = { nounwind "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }

test/CodeGen/Hexagon/swp-max.ll

This file was added.

				; RUN: llc -march=hexagon -mcpu=hexagonv5 -enable-swp -swp-max-stages=2 < %s \
				; RUN: \| FileCheck %s

				@A = global [8 x i32] [i32 4, i32 -3, i32 5, i32 -2, i32 -1, i32 2, i32 6, i32 -2], align 8

				define i32 @test(i32 %Left, i32 %Right) nounwind {
				entry:
				%add = add nsw i32 %Right, %Left
				%div = sdiv i32 %add, 2
				%cmp9 = icmp slt i32 %div, %Left
				br i1 %cmp9, label %for.end, label %for.body.preheader

				for.body.preheader:
				br label %for.body

				; CHECK: loop0(.LBB0_[[LOOP:.]],
				; CHECK: .LBB0_[[LOOP]]:
				; CHECK: [[REG1:(r[0-9]+)]] = max(r{{[0-9]+}}, [[REG1]])
				; CHECK: [[REG0:(r[0-9]+)]] = add([[REG2:(r[0-9]+)]], [[REG0]])
				; CHECK: [[REG2]] = memw
				; CHECK: endloop0

				for.body:
				%MaxLeftBorderSum.012 = phi i32 [ %MaxLeftBorderSum.1, %for.body ], [ 0, %for.body.preheader ]
				%i.011 = phi i32 [ %dec, %for.body ], [ %div, %for.body.preheader ]
				%LeftBorderSum.010 = phi i32 [ %add1, %for.body ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds [8 x i32], [8 x i32]* @A, i32 0, i32 %i.011
				%0 = load i32, i32* %arrayidx, align 4
				%add1 = add nsw i32 %0, %LeftBorderSum.010
				%cmp2 = icmp sgt i32 %add1, %MaxLeftBorderSum.012
				%MaxLeftBorderSum.1 = select i1 %cmp2, i32 %add1, i32 %MaxLeftBorderSum.012
				%dec = add nsw i32 %i.011, -1
				%cmp = icmp slt i32 %dec, %Left
				br i1 %cmp, label %for.end.loopexit, label %for.body

				for.end.loopexit:
				br label %for.end

				for.end:
				%MaxLeftBorderSum.0.lcssa = phi i32 [ 0, %entry ], [ %MaxLeftBorderSum.1, %for.end.loopexit ]
				ret i32 %MaxLeftBorderSum.0.lcssa
				}

test/CodeGen/Hexagon/swp-vect-dotprod.ll

This file was added.

				; RUN: llc -march=hexagon -mcpu=hexagonv5 -enable-swp < %s \| FileCheck %s
				; RUN: llc -march=hexagon -mcpu=hexagonv5 -O2 < %s \| FileCheck %s
				; RUN: llc -march=hexagon -mcpu=hexagonv5 -O3 < %s \| FileCheck %s
				;
				; Check that we pipeline a vectorized dot product in a single packet.
				;
				; CHECK: {
				; CHECK: += mpyi
				; CHECK: += mpyi
				; CHECK: memd
				; CHECK: memd
				; CHECK: } :endloop0

				@a = common global [5000 x i32] zeroinitializer, align 8
				@b = common global [5000 x i32] zeroinitializer, align 8

				define i32 @vecMultGlobal() nounwind readonly {
				entry:
				br label %polly.loop_body

				polly.loop_after:
				%0 = extractelement <2 x i32> %addp_vec, i32 0
				%1 = extractelement <2 x i32> %addp_vec, i32 1
				%add_sum = add i32 %0, %1
				ret i32 %add_sum

				polly.loop_body:
				%polly.loopiv13 = phi i32 [ 0, %entry ], [ %polly.next_loopiv, %polly.loop_body ]
				%reduction.012 = phi <2 x i32> [ zeroinitializer, %entry ], [ %addp_vec, %polly.loop_body ]
				%polly.next_loopiv = add nsw i32 %polly.loopiv13, 2
				%p_arrayidx1 = getelementptr [5000 x i32], [5000 x i32]* @b, i32 0, i32 %polly.loopiv13
				%p_arrayidx = getelementptr [5000 x i32], [5000 x i32]* @a, i32 0, i32 %polly.loopiv13
				%vector_ptr = bitcast i32* %p_arrayidx1 to <2 x i32>*
				%_p_vec_full = load <2 x i32>, <2 x i32>* %vector_ptr, align 8
				%vector_ptr7 = bitcast i32* %p_arrayidx to <2 x i32>*
				%_p_vec_full8 = load <2 x i32>, <2 x i32>* %vector_ptr7, align 8
				%mulp_vec = mul <2 x i32> %_p_vec_full8, %_p_vec_full
				%addp_vec = add <2 x i32> %mulp_vec, %reduction.012
				%2 = icmp slt i32 %polly.next_loopiv, 5000
				br i1 %2, label %polly.loop_body, label %polly.loop_after
				}

test/CodeGen/Hexagon/swp-vmult.ll

This file was added.

				; RUN: llc -march=hexagon -mcpu=hexagonv5 -enable-swp < %s \| FileCheck %s
				; RUN: llc -march=hexagon -mcpu=hexagonv5 -O3 < %s \| FileCheck %s

				; Multiply and accumulate
				; CHECK: mpyi([[REG0:r([0-9]+)]], [[REG1:r([0-9]+)]])
				; CHECK-NEXT: add(r{{[0-9]+}}, #4)
				; CHECK-NEXT: [[REG0]] = memw(r{{[0-9]+}} + r{{[0-9]+}}<<#0)
				; CHECK-NEXT: [[REG1]] = memw(r{{[0-9]+}} + r{{[0-9]+}}<<#0)
				; CHECK-NEXT: endloop0

				define i32 @foo(i32* nocapture %a, i32* nocapture %b, i32 %n) nounwind readonly {
				entry:
				br label %for.body

				for.body:
				%sum.03 = phi i32 [ 0, %entry ], [ %add, %for.body ]
				%arrayidx.phi = phi i32* [ %a, %entry ], [ %arrayidx.inc, %for.body ]
				%arrayidx1.phi = phi i32* [ %b, %entry ], [ %arrayidx1.inc, %for.body ]
				%i.02 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%0 = load i32, i32* %arrayidx.phi, align 4
				%1 = load i32, i32* %arrayidx1.phi, align 4
				%mul = mul nsw i32 %1, %0
				%add = add nsw i32 %mul, %sum.03
				%inc = add nsw i32 %i.02, 1
				%exitcond = icmp eq i32 %inc, 10000
				%arrayidx.inc = getelementptr i32, i32* %arrayidx.phi, i32 1
				%arrayidx1.inc = getelementptr i32, i32* %arrayidx1.phi, i32 1
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				ret i32 %add
				}

test/CodeGen/Hexagon/swp-vsum.ll

This file was added.

				; RUN: llc -march=hexagon -mcpu=hexagonv5 -enable-swp < %s \| FileCheck %s
				; RUN: llc -march=hexagon -mcpu=hexagonv5 -O3 < %s \| FileCheck %s

				; Simple vector total.
				; CHECK: loop0(.LBB0_[[LOOP:.]],
				; CHECK: .LBB0_[[LOOP]]:
				; CHECK: add([[REG:r([0-9]+)]], r{{[0-9]+}})
				; CHECK-NEXT: add(r{{[0-9]+}}, #4)
				; CHECK-NEXT: [[REG]] = memw(r{{[0-9]+}} + r{{[0-9]+}}<<#0)
				; CHECK-NEXT: endloop0

				define i32 @foo(i32* nocapture %a, i32 %n) nounwind readonly {
				entry:
				br label %for.body

				for.body:
				%sum.02 = phi i32 [ 0, %entry ], [ %add, %for.body ]
				%arrayidx.phi = phi i32* [ %a, %entry ], [ %arrayidx.inc, %for.body ]
				%i.01 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%0 = load i32, i32* %arrayidx.phi, align 4
				%add = add nsw i32 %0, %sum.02
				%inc = add nsw i32 %i.01, 1
				%exitcond = icmp eq i32 %inc, 10000
				%arrayidx.inc = getelementptr i32, i32* %arrayidx.phi, i32 1
				br i1 %exitcond, label %for.end, label %for.body

				for.end:
				ret i32 %add
				}

test/CodeGen/swp-multi-loops.ll

This file was added.

				; RUN: llc -march=hexagon -mcpu=hexagonv5 -enable-swp < %s \| FileCheck %s

				; Make sure we attempt to pipeline all inner most loops.

				; Check if the first loop is pipelined.
				; CHECK: loop0(.LBB0_[[LOOP:.]],
				; CHECK: .LBB0_[[LOOP]]:
				; CHECK: add(r{{[0-9]+}}, r{{[0-9]+}})
				; CHECK-NEXT: memw(r{{[0-9]+}}{{.}}++{{.}}#4)
				; CHECK-NEXT: endloop0

				; Check if the second loop is pipelined.
				; CHECK: loop0(.LBB0_[[LOOP:.]],
				; CHECK: .LBB0_[[LOOP]]:
				; CHECK: add(r{{[0-9]+}}, r{{[0-9]+}})
				; CHECK-NEXT: memw(r{{[0-9]+}}{{.}}++{{.}}#4)
				; CHECK-NEXT: endloop0

				define i32 @test(i32* %a, i32 %n, i32 %l) #0 {
				entry:
				%cmp23 = icmp sgt i32 %n, 0
				br i1 %cmp23, label %for.body3.lr.ph.preheader, label %for.end14

				for.body3.lr.ph.preheader:
				br label %for.body3.lr.ph

				for.body3.lr.ph:
				%sum1.026 = phi i32 [ %add8, %for.inc12 ], [ 0, %for.body3.lr.ph.preheader ]
				%sum.025 = phi i32 [ %add, %for.inc12 ], [ 0, %for.body3.lr.ph.preheader ]
				%j.024 = phi i32 [ %inc13, %for.inc12 ], [ 0, %for.body3.lr.ph.preheader ]
				br label %for.body3

				for.body3:
				%sum.118 = phi i32 [ %sum.025, %for.body3.lr.ph ], [ %add, %for.body3 ]
				%arrayidx.phi = phi i32* [ %a, %for.body3.lr.ph ], [ %arrayidx.inc, %for.body3 ]
				%i.017 = phi i32 [ 0, %for.body3.lr.ph ], [ %inc, %for.body3 ]
				%0 = load i32, i32* %arrayidx.phi, align 4, !tbaa !0
				%add = add nsw i32 %0, %sum.118
				%inc = add nsw i32 %i.017, 1
				%exitcond = icmp eq i32 %inc, %n
				%arrayidx.inc = getelementptr i32, i32* %arrayidx.phi, i32 1
				br i1 %exitcond, label %for.end, label %for.body3

				for.end:
				tail call void @bar(i32* %a) #2
				br label %for.body6

				for.body6:
				%sum1.121 = phi i32 [ %sum1.026, %for.end ], [ %add8, %for.body6 ]
				%arrayidx7.phi = phi i32* [ %a, %for.end ], [ %arrayidx7.inc, %for.body6 ]
				%i.120 = phi i32 [ 0, %for.end ], [ %inc10, %for.body6 ]
				%1 = load i32, i32* %arrayidx7.phi, align 4, !tbaa !0
				%add8 = add nsw i32 %1, %sum1.121
				%inc10 = add nsw i32 %i.120, 1
				%exitcond29 = icmp eq i32 %inc10, %n
				%arrayidx7.inc = getelementptr i32, i32* %arrayidx7.phi, i32 1
				br i1 %exitcond29, label %for.inc12, label %for.body6

				for.inc12:
				%inc13 = add nsw i32 %j.024, 1
				%exitcond30 = icmp eq i32 %inc13, %n
				br i1 %exitcond30, label %for.end14.loopexit, label %for.body3.lr.ph

				for.end14.loopexit:
				br label %for.end14

				for.end14:
				%sum1.0.lcssa = phi i32 [ 0, %entry ], [ %add8, %for.end14.loopexit ]
				%sum.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.end14.loopexit ]
				%add15 = add nsw i32 %sum1.0.lcssa, %sum.0.lcssa
				ret i32 %add15
				}

				declare void @bar(i32*) #1

				attributes #0 = { nounwind "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"="true" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
				attributes #1 = { "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"="true" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
				attributes #2 = { nounwind }

				!0 = !{!"int", !1}
				!1 = !{!"omnipotent char", !2}
				!2 = !{!"Simple C/C++ TBAA"}

This is an archive of the discontinued LLVM Phabricator instance.

An implementation of Swing Modulo SchedulingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 46712

include/llvm/CodeGen/Passes.h

include/llvm/InitializePasses.h

include/llvm/Target/TargetInstrInfo.h

lib/CodeGen/CMakeLists.txt

lib/CodeGen/CodeGen.cpp

lib/CodeGen/MachinePipeliner.cpp

lib/CodeGen/Passes.cpp

lib/Target/Hexagon/HexagonInstrInfo.h

lib/Target/Hexagon/HexagonInstrInfo.cpp

lib/Target/Hexagon/HexagonTargetMachine.cpp

test/CodeGen/Hexagon/swp-const-tc.ll

test/CodeGen/Hexagon/swp-dag-phi.ll

test/CodeGen/Hexagon/swp-epilog-reuse.ll

test/CodeGen/Hexagon/swp-matmul-bitext.ll

test/CodeGen/Hexagon/swp-max.ll

test/CodeGen/Hexagon/swp-vect-dotprod.ll

test/CodeGen/Hexagon/swp-vmult.ll

test/CodeGen/Hexagon/swp-vsum.ll

test/CodeGen/swp-multi-loops.ll

An implementation of Swing Modulo Scheduling
ClosedPublic