Page MenuHomePhabricator

An implementation of Swing Modulo Scheduling
ClosedPublic

Authored by bcahoon on Feb 2 2016, 3:37 PM.

Details

Summary

Software pipelining is an optimization for improving ILP by overlapping loop iterations. Swing Modulo Scheduling (SMS) is an implementation of software pipelining that attempts to reduce register pressure and generating efficient pipelines with a low compile-time cost.

This implementation of SMS is a target-independent back-end pass. When enabled, the pass runs just prior to the register allocation pass, while the machine IR is in SSA form. If software pipelining is successful, then the original loop is replaced by the optimized loop. The optimized loop contains one or more prolog blocks, the pipelined kernel, and one or more epilog blocks.

The SMS implementation is an extension of the ScheduleDAGInstrs class. We represent loop carried dependences in the DAG as order edges to the Phi nodes. We also perform several passes over the DAG to eliminate unnecessary edges that inhibit the ability to pipeline. The implementation uses the DFAPacketizer class to compute the minimum initiation interval and the check where an instruction may be inserted in the pipelined schedule.

In order for the SMS pass to work, several target specific hooks need to be implemented to get information about the loop structure and to rewrite instructions. This patch implements the target hooks for Hexagon.

The pipeliner should be easily extendable to work with compare-and-branch loops instead of just hardware loops. I assume that may require some additional support from a target. Also, the implementation assumes the target uses the DFA code for scheduling, but I think it could also be changed to work with the scoreboard structure (or some other mechanism).

The SMS algorithm consists of three main steps after computing the minimal initiation interval (MII).

  1. Analyze the dependence graph and compute information about each instruction in the graph.
  2. Order the nodes (instructions) by priority based upon the heuristics described in the algorithm.
  3. Attempt to schedule the nodes in the specified order using the MII.

If all the instructions can be scheduled in the specified order, then the algorithm is successful. Otherwise, we increase the MII by one and try again. When the algorithm is successful, we need to replace the original loop with one or more prolog blocks, the optimized kernel, and one or more epilog blocks. The number of prolog and epilog blocks depends on the number of stages in the scheduled loop. When creating the new blocks, we need to generate new SSA names and generate Phis for the new kernel and each epilog block. The process of creating the new, pipelined loop is quite complex and hard to understand. This part of the code can certainly use some additional work to simplify and to improve the readability.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
pjcoup added a subscriber: pjcoup.Feb 3 2016, 7:16 AM

We have considered adding bundles much earlier in the back-end, but I think that means changing the register allocator (and other passes) to work on bundles. It's been on our list of things to attempt, but we haven't done it yet. We've worked around the major problems of not having bundles in the pipeliner by carefully ordering the pipelined instructions, though it's not the ideal solution.

lib/CodeGen/MachinePipeliner.cpp
3058–3060

If two iterations are started in the prolog, then we generate two prolog basic blocks, and two epilog basic blocks. At the end of each prolog basic block, we add a compare and branch to the corresponding epilog basic block (the fall through is to the next prolog block or the kernel). This means that the first prolog block contains instructions from stage 0 and the second prolog block contains instructions from stage 1 and the 2nd iteration of stage 0.

In your example, with a run-time trip count of 1, the first prolog block branches to the last epilog block, and the instructions in the last epilog block are the first iteration of instructions scheduled in stage 1 and stage 2.

The swp-max.ll test case shows a pipelined schedule with 2 prolog and epilog blocks.

materi added inline comments.Feb 3 2016, 11:11 AM
lib/CodeGen/MachinePipeliner.cpp
3058–3060

Thank you! I think I understand how it works now. The prolog and epilog blocks are not the "bundles" of the SWP prolog and epilog. The jump label for my trip count = 1 case is put in the middle of the first "epilog bundle".

But what if there are loop carried 0-latency dependences in the graph? This will force a certain order within the kernel to allow correct bundling in a later step. Can this be handled?

bcahoon added inline comments.Feb 3 2016, 2:52 PM
lib/CodeGen/MachinePipeliner.cpp
3058–3060

If I'm understanding your question, then yes - we do handle the case of a loop carried 0-latency instruction. The order of the instructions in the prolog and epilog blocks is different than the order in the pipelined schedule. The prolog/epilog instructions appear in the original instruction order (i.e., prior to pipelining), and they are grouped by the pipelined stage.

As an example, lets say there are 3 stages, numbered 0,1,2, so there will be two prolog blocks and two epilog blocks. The first prolog contains instructions from stage 0 in the original order. The last epilog contains instructions from stages 1 and 2 in original order. If the loop contains only 1 iteration, then the stage 0 instructions in the first prolog are executed, and control jumps to the last epilog block to execute the first iteration of instructions from stages 1 and 2.

In the second prolog, we first generate the instructions from stage 1 in the original order, and then stage 0 in original order. In the second to last epilog, we generate instructions for stage 2 in the original order. If the loop has 2 iterations, then the 2 prolog bocks execute instructions from stage 0 twice, and stage 1 one. The 2 epilog blocks execute instructions from stage 2 twice and stage 1 once.

I hope this makes sense and answers your question correctly. Let me know.

jonpa added a subscriber: jonpa.Feb 3 2016, 10:33 PM
marksl added a subscriber: marksl.Feb 5 2016, 2:24 PM

If you have a functional unit that issues in stages such that another instruction of needing the same FU can ussue the very next cycle, then isn't the sum of the cycles too great? Example:

InstrItinData<IIC_MUL_rr, [InstrStage<1, [MUL_DSP_STAGE1]>,InstrStage<1, [MUL_DSP_STAGE2]>,InstrStage<1, [MUL_DSP_STAGE3]>]>,

In this case multiplies will issue back-to-back and require 3 cycles to complete. If we have 2 multiplies then is ResMII = (2 * 3) / 1 = 6 when in reality it will be 4 cycles since it's broken into stages.

lib/CodeGen/MachinePipeliner.cpp
1383

If you have a functional unit that issues in stages such that another instruction of needing the same FU can ussue the very next cycle, then isn't the sum of the cycles too great? Example:

InstrItinData<IIC_MUL_rr, [InstrStage<1, [MUL_DSP_STAGE1]>,InstrStage<1, [MUL_DSP_STAGE2]>,InstrStage<1, [MUL_DSP_STAGE3]>]>,

In this case multiplies will issue back-to-back and require 3 cycles to complete. If we have 2 multiplies then is ResMII = (2 * 3) / 1 = 6 when in reality it will be 4 cycles since it's broken into stages.

bcahoon added inline comments.Feb 5 2016, 3:53 PM
lib/CodeGen/MachinePipeliner.cpp
1383

In the case with the two multiplies, I think the resource MII should be 2. The resource MII is just the cycle count of the most heavily used resource. Since each of the resources, MUL_DSP_STAGE1, MUL_DSP_STAGE2, and MUL_DSP_STAGE3 are used for 2 cycles for the 2 instructions.

But, I agree that the code here is not going to return 2. With your type of itinerary, I believe the code computes that there are 6 functional units used. Then, when iterating over the functional units, the code calls canReserveResources() for each instruction 3 times. Each time, the query returns false, and a new DFA is created for a total of 6 cycles. It seems to me that the DFA requires an additional parameter to determine the cycle, or stage, but I don't believe that is possible currently. I'll need to think about how to get this to work correctly for this type of itinerary. Unfortunately, it hard for me to test a solution to this since Hexagon doesn't have a similar itinerary.

marksl added inline comments.Feb 11 2016, 9:59 AM
include/llvm/Target/TargetInstrInfo.h
1052

Should this return true for pre & post increment & decrement?

marksl added inline comments.Feb 11 2016, 4:58 PM
lib/CodeGen/MachinePipeliner.cpp
878

I've seen CodeGenPrepare delete the preheader. Specifically, if the previous block is a loop and this loop immediate follows it. I don't have a test case, but it's basically two sequential for loops. I wonder if that has any impact here?

bcahoon added inline comments.Feb 12 2016, 7:09 AM
include/llvm/Target/TargetInstrInfo.h
1052

I've only tested this for post increment. I think returning true for the other cases would cause problems, since the offset value adjustment would probably be incorrect (at least for the post/pre decrement cases). The pre-increment case would end up ok. A more general solution could be useful if your target, or others, have these variants. The pipeliner uses this information to eliminate the dependence on the post incremented value w.r.t other instructions that use the same base value.

lib/CodeGen/MachinePipeliner.cpp
878

Yes, it would have an impact for loops without a preheader. On Hexagon, when we create a hardware loop, the preheader is added if it's not there already. Then, the pipeliner pass only sees loops with a preheader. It would be easy enough to add a preheader if one doesn't exist already.

marksl accepted this revision.Feb 12 2016, 4:39 PM
marksl added a reviewer: marksl.

Very nice work Brandon.

This revision is now accepted and ready to land.Feb 12 2016, 4:39 PM
arvindsm accepted this revision.Feb 29 2016, 7:44 AM
arvindsm added a reviewer: arvindsm.
arvindsm edited edge metadata.

Good work. Brendon.

marksl added inline comments.Mar 1 2016, 7:58 AM
lib/CodeGen/MachinePipeliner.cpp
1702

What paper are you referring to?

marksl added inline comments.Mar 1 2016, 9:08 AM
lib/CodeGen/MachinePipeliner.cpp
1702

Sorry, I found it was Tanya Lattner's paper.

bcahoon added inline comments.Mar 1 2016, 2:09 PM
lib/CodeGen/MachinePipeliner.cpp
1702

The original paper is "Swing Modulo Scheduling: A Lifetime-Sensitive Approach" from PACT 1996. Though, Tanya's thesis provides a good description as well.

marksl added inline comments.Mar 9 2016, 1:43 PM
lib/CodeGen/MachinePipeliner.cpp
4

Are you contributing this code under the terms of the LLVM License?

bcahoon added inline comments.Mar 10 2016, 6:33 AM
lib/CodeGen/MachinePipeliner.cpp
4

Yes. I need to change that comment. I'll submit a patch with the change.

bcahoon updated this revision to Diff 51922.Mar 29 2016, 8:26 AM
bcahoon edited edge metadata.

Updated to the correct license and rebased the patch.

xmj added a subscriber: xmj.Mar 30 2016, 1:43 AM

I'm just wondering how to generate a loop which has only one basic block. For an example,

extern int a, b[10];

void foo(void)
{

for (int i = 0; i < 10; ++i) {
  a += b[i];
}

}

Run clang -emit-llvm -S --target=hexagon foo.c will generate

for.cond: ; preds = %for.inc, %entry

%0 = load i32, i32* %i, align 4
%cmp = icmp slt i32 %0, 10
br i1 %cmp, label %for.body, label %for.end

for.body: ; preds = %for.cond

%1 = load i32, i32* %i, align 4
%arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* @b, i32 0, i32 %1
%2 = load i32, i32* %arrayidx, align 4
%3 = load i32, i32* @a, align 4
%add = add nsw i32 %3, %2
store i32 %add, i32* @a, align 4
br label %for.inc

...

Thus the swp will not be done due to the constraint.

(gdb) p L->getNumBlocks()
$2 = 2

I'm just wondering how to generate a loop which has only one basic block. For an example,

Run clang -emit-llvm -S --target=hexagon foo.c will generate

Try with -O2 -fno-unroll-loops. The O2 flag will generate a canonical loop with a single basic block and the -fno-unroll-loops flag makes sure the loop is not unrolled.

It would be great to get someone else with more familiarity to approve this patch before submitting. Even though the pass is off by default (except for Hexagon), there are other changes to small pieces that should get additional attention.

xmj added a comment.Apr 21 2016, 6:29 PM

I tested the patch on eembc benchmark with hexagon-sim. The speedup of
autcor00data_2_lite can reach 120%. For the testcases that SMS does
analysis and reports "Schedule Found? 0" finally, the insns and cycles
will have a little bit of changes.

Ping? It would be great to see this patch get some attention from an official code owner for CodeGen, since the new pass provides a substantial improvement for VLIW architectures.

Quentin and Tom: It was suggested to me at last night's social to add you both as potential reviewers for this code. Do you think you could help with officially approving this? I think that the VLIW (and even non-VLIW) benefits for having a modulo scheduling pass could be quite interesting for LLVM in general. Thanks.

qcolombet edited edge metadata.Jun 3 2016, 9:37 AM
qcolombet added a subscriber: qcolombet.

Hi Stephen,

I will have a look if no-one did at some point.
I am way behind in reviews and open source work in general and I do not expect to be able to look at it before at least a couple of weeks.

Therefore, be patient, keep pinging people, and eventually the review will happen :).

Thanks for working on this BTW!

Cheers,
-Quentin

PS: I tend to look at reviews that do not have active reviewers. For that thread, a bunch of people commented on it and thus I thought it was already well covered. Sorry for not having looked earlier.

Hi Stephen,

I will have a look if no-one did at some point.
I am way behind in reviews and open source work in general and I do not expect to be able to look at it before at least a couple of weeks.

Therefore, be patient, keep pinging people, and eventually the review will happen :).

No problem. I just want to make sure that this has a reasonable chance of being submitted in a reasonable timeframe.

Thanks for working on this BTW!

I actually am just trying to help shepherd this along, as we have users who really want this feature enabled for Hexagon. I didn't do any of the work here. All of the credit goes to the original author(s).

Cheers,
-Quentin

PS: I tend to look at reviews that do not have active reviewers. For that thread, a bunch of people commented on it and thus I thought it was already well covered. Sorry for not having looked earlier.

Ah, there were indeed some great reviews earlier, but I think that the concern was one of adding a new pass to LLVM without more extensive approval (i.e. a more senior contributor actually saying "Accept"). If you think that the other reviewers have done a sufficient job here (or that no additional reviews are needed, and anything else can be fixed up post-commit), please say so. Thanks again for taking time to respond and for the quick glance.

MatzeB added a subscriber: MatzeB.Jun 9 2016, 6:30 PM

There are probably several targets out there that care about software pipelining so it is good to code and share it between targets. Who will be the the code owner and conduct reviews/maintenance in the future?

I am not experienced with software pipelining algorithm so I mainly concentrated on coding style and llvm infrastructure issues. General comments:

  • There is an odd mismatch between the class being 'MachineSMS' the filename being 'MachinePipeliner' the debug type 'modulo-sched' and the cl::opt switches being called 'swp-XXX'.
  • While there are more than enough comments in the code, the pass could use some highlevel introduction/references to papers used etc.

Detailed comments:

include/llvm/CodeGen/Passes.h
679–681

Avoid repeating the name of the method/field in the doxygen comment as per coding convention.
(There's more instances of this following).

include/llvm/Target/TargetInstrInfo.h
541–542

For parameters that cannot be nullptr I would recommend using a reference to make this fact clear (probably only MachineLoop *L here). Similar for the following functions.

1009–1010

The function returns just a bool but sets BasePos/OffsetPos.

lib/CodeGen/MachinePipeliner.cpp
10–17

Use three slash doxygen.

113–115

Would be good to intregrate this into the OptBisect infrastructure in subsequent patches.

131–132

we can use C++11 member initializers now and write MachineFunction *MF = nullptr; above instead.

170

This field could use a comment.

191–193

swap order of public: and the comment?

194

Given that we are inside a .cpp file specific to modulo scheduling anyway we can probably move all those helper classes to the toplevel instead of nesting them into the class.

211–214

Shouldn't you rather have the iterator implicitely cast to const_iterator so you do not need to have a templated version of the constructor (and insert etc. below)?
Could also use member initializers.

303–315

Why not move all those to the top of the class declaration to the other analysis info?

359

Many function comments are repeated here and at the place where the function is actually implemented. You should only mention the comment in one of the two places or risk that the two will get out of sync in time. (Which of the two you choose is a matter of personal preference and shouldn't matter).

492

Use doxygen comment (same for some following functions).

811

I believe this pass is not required for correctness and should therefore have a if(skipFunction(..)) return false; sequence first.

823–824

Use range based for. Similar in many following loops.

1702

It would be good to mention relevant papers in the comment at the beginning of the .cpp file.

1815

Should avoid auto here and mention the type name. Some more instances following.

3494

nullptr

lib/CodeGen/Passes.cpp
113–116

Even though you see both, I think the more typical style nowadays would be to move this flag into the .cpp file of the pass and make it the first thing that is checked in runOnMachineFunction().

552–556

Maybe we should leave adding the pass to the backends that support it and let them do it by overriding addPreRegAlloc()?

test/CodeGen/Hexagon/swp-const-tc.ll
6

Some comments about all the tests:

  • Tests should start with ; CHECK-LABEL: functionname: to make them more stable against unrelated output that happens to be on stdout triggering check lines (pass debug output, filenames, etc.)
  • The tests appear to contain unnecessary extra data (nocapture, readonly) flags, sometimes strange value or block names, function and alias metadata. I am sure they can and should be further simplified.
test/CodeGen/swp-multi-loops.ll
75–82

Is all this metadata really necessary for the test?

bcahoon updated this revision to Diff 60760.Jun 14 2016, 3:05 PM
bcahoon edited edge metadata.

Hi Matthias - thank you for the review and the comments. I really appreciate it. I think I've addressed all of your comments, unless I've inadvertently missed something. The new patch includes a lot of changes based upon your comments. The code has also been rebased.

Here's a summary of the changes:
I decided to use the name Pipeliner to be more consistent with naming, except for the test names which still start with swp. I don't really have a strong preference of Pipeliner vs. SWP vs. SMS in the naming scheme.

I've moved the code to add the pipeliner pass to HexagonTargetMachine.cpp in the addPreRegAlloc method instead of doing it in TargetPassConfig.cpp.

I've added some more high-level comments at the beginning with references to a couple of useful papers.

I've converted many loops to be range-based for loops.

I've removed the comments from the method declarations so that they are not repeated in the method implementation.

I've removed unnecessary metadata, etc. from the tests.

Thanks,
Brendon

sebpop added a subscriber: sebpop.Jun 15 2016, 7:29 AM
sebpop added inline comments.
lib/CodeGen/MachinePipeliner.cpp
2223

Both generateExistingPhis and generatePhis are passed the same parameters. Can we have this code factored up in a class: that would allow to split these two functions into smaller functions easier to follow.

2509

Could we have the code of this loop split up into smaller functions?

2514

I find the indexing in VRMap to be difficult to follow: could we have the arithmetic hidden behind some set/get interface for the prolog/epilog?

After ISEL our compare instructions, multiply, and MAC instructions have real physical register side effects. I'm getting errors from SWP for loops containing these physical register dependencies. Are you aware of this? Is there a way to model physical register dependencies with loop carried dependencies such that we would generate correct code for them?

bcahoon added inline comments.Jun 16 2016, 2:00 PM
lib/CodeGen/MachinePipeliner.cpp
2223

Hi Sebastian - thanks for the comments. The code for generating Phis in the pipelined schedule really could use some work. It's a non-trivial effort though. I'll try to do some refactoring to improve it though.

After ISEL our compare instructions, multiply, and MAC instructions have real physical register side effects. I'm getting errors from SWP for loops containing these physical register dependencies. Are you aware of this? Is there a way to model physical register dependencies with loop carried dependencies such that we would generate correct code for them?

Hi Mark - yes, I am aware of the of the problem, but don't yet have a good fix for it. One potential fix is to not pipeline loops that end up with a loop carried physical register. That's what's I've added to my local version. I added the following function, which is called from schedulePipeline() if a schedule is found.

bool SMSchedule::isValidSchedule(SwingSchedulerDAG *SSD) {

const TargetRegisterInfo *TRI = ST.getRegisterInfo();
for (int i = 0, e = SSD->SUnits.size(); i < e; ++i) {
  SUnit &SU = SSD->SUnits[i];
  if (!SU.hasPhysRegDefs)
    continue;
  int StageDef = stageScheduled(&SU);
  assert(StageDef != -1 && "Instruction should have been scheduled.");
  for (auto &SI : SU.Succs)
    if (SI.isAssignedRegDep())
      if (TRI->isPhysicalRegister(SI.getReg()))
        if (stageScheduled(SI.getSUnit()) != StageDef)
          return false;
}
return true;

}

kparzysz accepted this revision.Jun 27 2016, 10:45 AM
kparzysz added a reviewer: kparzysz.

Seems like all significant issues have been addressed. The remaining ones can be fixed in subsequent commits.

sebpop accepted this revision.Jun 27 2016, 11:17 AM
sebpop added a reviewer: sebpop.

Agreed, cleanups should not hold SMS to be committed.
LGTM.

bcahoon updated this revision to Diff 65365.Jul 25 2016, 9:59 AM
bcahoon edited edge metadata.

Rebased the patch. There were some API changes to resolve.

I've included several bug fixes since the last update, including:

  • Updates to with deal physical registers. The register pressure code needs to check for physical registers. Also, the pipeliner does not allow a schedule that generates a loop carrried dependence for a physical register.
  • A bug fix to the code that generates Phis for the pipelined loop.
  • A bug fix to the code that generates the final instruction order, when serializing the instructions.

Unless there is an objection, I'd like to be able to commit this code. There are several other folks that use the patch, and it would be great to enable them to make updates and improvements to the software pipeliner. This patch enables the pipeliner for the Hexagon target. I think there is a good opportunity for others to contribute to the pass, to improve the functionality, and to enable it for other targets.

Unless there is an objection, I'd like to be able to commit this code. There are several other folks that use the patch, and it would be great to enable them to make updates and improvements to the software pipeliner. This patch enables the pipeliner for the Hexagon target. I think there is a good opportunity for others to contribute to the pass, to improve the functionality, and to enable it for other targets.

As this is only enabled on Hexagon, I would say go ahead and commit: it is a very useful performance feature for that target.
If you get more feedback in the future, you can address that in follow-up patches.

Thanks,
Sebastian

sebpop closed this revision.Jul 29 2016, 5:39 PM

Committed in https://reviews.llvm.org/rL277169
Thanks Brendon!

when i use SMS.enterRegion(MBB, MBB->begin(), MBB->getFirstTerminator(), size); it occurs
Assertion failed: VNI && "No value to read by operand"
but if use SMS.enterRegion(MBB, MBB->getFirstNonPHI(), MBB->getFirstTerminator(), size2); it has no error.
Am i use the wrong version of LLVM?

when i use SMS.enterRegion(MBB, MBB->begin(), MBB->getFirstTerminator(), size); it occurs
Assertion failed: VNI && "No value to read by operand"
but if use SMS.enterRegion(MBB, MBB->getFirstNonPHI(), MBB->getFirstTerminator(), size2); it has no error.
Am i use the wrong version of LLVM?

You do need pass MBB->begin() to enterRegion for the pipeliner to work correctly. It looks like you're using an older version of LLVM. I believe this issue has been fixed with a commit that was made on Dec. 4 2015 to ScheduleDAGInstrs.cpp. If you're unable to use a newer version of LLVM, I'd suggest the following the following change to addVRegUseDeps in ScheduleDAGInstrs.cpp:

   // VNI will be valid because MachineOperand::readsReg() is checked by caller.
-  assert(VNI && "No value to read by operand");
-  MachineInstr *Def = LIS->getInstructionFromIndex(VNI->def);
+  MachineInstr *Def = (VNI ? LIS->getInstructionFromIndex(VNI->def) : 0);
   // Phis and other noninstructions (after coalescing) have a NULL Def

Thanks,
Brendon

when i use SMS.enterRegion(MBB, MBB->begin(), MBB->getFirstTerminator(), size); it occurs
Assertion failed: VNI && "No value to read by operand"
but if use SMS.enterRegion(MBB, MBB->getFirstNonPHI(), MBB->getFirstTerminator(), size2); it has no error.
Am i use the wrong version of LLVM?

You do need pass MBB->begin() to enterRegion for the pipeliner to work correctly. It looks like you're using an older version of LLVM. I believe this issue has been fixed with a commit that was made on Dec. 4 2015 to ScheduleDAGInstrs.cpp. If you're unable to use a newer version of LLVM, I'd suggest the following the following change to addVRegUseDeps in ScheduleDAGInstrs.cpp:

   // VNI will be valid because MachineOperand::readsReg() is checked by caller.
-  assert(VNI && "No value to read by operand");
-  MachineInstr *Def = LIS->getInstructionFromIndex(VNI->def);
+  MachineInstr *Def = (VNI ? LIS->getInstructionFromIndex(VNI->def) : 0);
   // Phis and other noninstructions (after coalescing) have a NULL Def

Thanks,
Brendon

Thanks for your suggestions!

Is this problem occurs because PhiNode's use operand which is defined in ohter BasicBlocks?

AND
%vreg1<def> = PHI %vreg5, <BB#0>, %vreg4, <BB#1>; CPURegs:%vreg1,%vreg5,%vreg4
......
%vreg4<def> = ADDiu %vreg1, 4; CPURegs:%vreg4,%vreg1
When code deals with ADDiu %vreg1,it can not find the defination of %vreg1 in PHINode.
So the function void SwingSchedulerDAG::updatePhiDependences() fix this problem?

Thanks,
Lee