This is an archive of the discontinued LLVM Phabricator instance.

Fix instruction scheduling live register tracking
ClosedPublic

Authored by chfast on May 25 2015, 8:34 AM.

Details

Summary

This patch fixes PR23405 (https://llvm.org/bugs/show_bug.cgi?id=23405).

During a node unscheduling an entry in LiveRegGens can be replaced with a new value. That corrupts the live reg tracking and LiveReg* structure is not cleared as should be during unscheduling. Problematic condition that enforces Gen replacement is I->getSUnit()->getHeight() < LiveRegGens[I->getReg()]->getHeight(). I don't know why it is needed but regression tests do not covert that case.

I went a bit more aggressive here replacing 2 other defensive checks with asserts.

Diff Detail

Repository
rL LLVM

Event Timeline

chfast updated this revision to Diff 26437.May 25 2015, 8:34 AM
chfast retitled this revision from to Fix instruction scheduling live register tracking.
chfast updated this object.
chfast edited the test plan for this revision. (Show Details)
chfast added reviewers: atrick, hfinkel.
chfast set the repository for this revision to rL LLVM.
chfast added a subscriber: Unknown Object (MLST).
hfinkel edited edge metadata.May 27 2015, 4:29 PM

Have you run the LLVM test suite and self-hosted with this change?

atrick edited edge metadata.May 27 2015, 4:38 PM

Thanks for the fix!

I am generally ok with this even though I still can't prove to myself it's correct, since the existing unit tests (like add-with-overflow-128) still pass and it does fix a nasty bug.

One quick reminder: Make sure you build for all targets before running lit tests.

More importantly, can you provide some explanation for why LiveRegGens cannot have a null entry at the point where you assert. My understanding is that LiveRegGens is cleared when we schedule the def. So when we're about to unschedule, the entry should be null until we set it again.

The way this is supposed to work (I think):

  • While scheduling, assign increasing "heights" to the scheduled nodes.
  • When backtracking, we know the successor that should be added back to LiveRegGens is the one with the lowest "height"

This is almost certainly broken because the height field in the DAG node is serving dual purpose, both as a DAG height and a schedule height. One thing to try would using the ScheduleDAG node's BotReadCycle field instead of setting/checking the Height field:

in ScheduleNodeBottom up set BotReadyCycle = CurCycle instead of
SU->setHeightToAtLeast(CurCycle);

Then in backtracking, check BotReadyCycle instead of Height. There may be other bugs in backtracking that you're hitting, but that definitely looks suspect to me.

Have you run the LLVM test suite and self-hosted with this change?

Test suite - no. What "self-hosted" actually means?

Thanks for the fix!

I am generally ok with this even though I still can't prove to myself it's correct, since the existing unit tests (like add-with-overflow-128) still pass and it does fix a nasty bug.

One quick reminder: Make sure you build for all targets before running lit tests.

I'm building all targets.

More importantly, can you provide some explanation for why LiveRegGens cannot have a null entry at the point where you assert. My understanding is that LiveRegGens is cleared when we schedule the def. So when we're about to unschedule, the entry should be null until we set it again.

If we are at the end of a chain the Def/Gen should be cleared out around line #817. If we are not at the end of chain the other Def is expected. But I might be wrong. I suggested myself to that checking out code coverage.

The way this is supposed to work (I think):

  • While scheduling, assign increasing "heights" to the scheduled nodes.
  • When backtracking, we know the successor that should be added back to LiveRegGens is the one with the lowest "height"

This is almost certainly broken because the height field in the DAG node is serving dual purpose, both as a DAG height and a schedule height. One thing to try would using the ScheduleDAG node's BotReadCycle field instead of setting/checking the Height field:

in ScheduleNodeBottom up set BotReadyCycle = CurCycle instead of
SU->setHeightToAtLeast(CurCycle);

Then in backtracking, check BotReadyCycle instead of Height. There may be other bugs in backtracking that you're hitting, but that definitely looks suspect to me.

I'll check that solution however I don't fully get it why we need to ever rewrite the Gen.

Have you run the LLVM test suite and self-hosted with this change?

Test suite - no. What "self-hosted" actually means?

I mean have you built Clang/LLVM with Clang/LLVM? I find this is an excellent test.

I worked through some examples on paper. I haven't debugged real examples, so could be wrong about something...

In the normal two-address case with only a single physical register, LiveRegDef and LiveRegGen should always be non-null when a node that defines the same register is unscheduled. So I understand why you didn't hit any assert and why we don't have any unit tests (which would need to be realizable from valid IR with a real target).

A comment above the assert should be sufficient, as long as this
passes all the testing that Hal mentioned and doesn't break any
bots. You could say something like:

With interference on only a single physical register (flags), a node
that defines the register will not be unscheduled unless it is a
two-address instruction. In that case, LiveRegDef will currently be
set to the def that reaches this instruciton, and LiveRegGen will
already be set to the last user of this instruction.

I might even convert the assert into report_fatal_error since the condition is theoretically possible and could result in a miscompile if true.

Now, to answer your question, here's a hypothetical case the motivates
the current code:

Say Rx and Ry are a physical registers and we have this DAG:

Ry = I0
Rx = I1(I0:Ry)
I2(I1:Rx)
I3(I1:Rx)
Rx = I4(I0)

I5(I4:Rx)

We first schedule this sequence bottom-up:

Rx = I1(I0:Ry)
I2(I1:Rx)
I3(I1:Rx)

Giving us:
Def[Rx]=null, Def[Ry]=I0
Gen[Rx]=null, Gen[Ry]=I1

Since I4 has not been scheduled, we cannot schedule I0 and need to backtrack.

Unscheduling I1, we see two successors: I2 and I3. I3 has the lower height, so we should end up with:
Def[Rx]=I1, Def[Ry]=null
Gen[Rx]=I3, Gen[Ry]=null

If we mistakenly picked Gen[Rx]=I2, then we would not backtrack far enough and would miscompile this example!

So, if fixing the code to check scheduled order instead of DAG height fixes your problem, that would be the best solution.

Say Rx and Ry are a physical registers and we have this DAG:

Ry = I0
Rx = I1(I0:Ry)
I2(I1:Rx)
I3(I1:Rx)
Rx = I4(I0)

I5(I4:Rx)

We first schedule this sequence bottom-up:

Rx = I1(I0:Ry)
I2(I1:Rx)
I3(I1:Rx)

Giving us:
Def[Rx]=null, Def[Ry]=I0
Gen[Rx]=null, Gen[Ry]=I1

Since I4 has not been scheduled, we cannot schedule I0 and need to backtrack.

Unscheduling I1, we see two successors: I2 and I3. I3 has the lower height, so we should end up with:
Def[Rx]=I1, Def[Ry]=null
Gen[Rx]=I3, Gen[Ry]=null

If we mistakenly picked Gen[Rx]=I2, then we would not backtrack far enough and would miscompile this example!

Agreed. But it can only happen if traversing successors in order I3, I2. Is it possible?

So, if fixing the code to check scheduled order instead of DAG height fixes your problem, that would be the best solution.

chfast updated this revision to Diff 26957.Jun 2 2015, 1:57 AM
chfast edited edge metadata.

Priority in live reg gen for phys reg copies.

I tried Andrew's proposition to separate node's DAG height from schedule cycle without success. I think the problem is different.

In my case there is a PHYS REG COPY node that might have incorrect height. During unscheduling this phys reg copy node is not set as a live reg gen because it looses against some other successor that has lower height. If I prioritise phys reg copies in gen update the test passes. No other regressions.

I think I need more information about these phys reg copy nodes to investigate more. What do they mean and where do they come from? I cannot see this problematic node in the DAG log before scheduling. Maybe we should change the height assinged to that nodes.

The copy could be created during InsertCopiesAndMoveSuccs. This could be a much more complicated scenario than I showed in my examples.

At any rate, as long as the height is set an ascending order as each node is scheduled, it seems like it should work out. It's possible that CurrCycle is not being properly updated during backtracking. That might lead to inconsistent heights. I would add print statements to find out what CurrCycle and the copy's height is at the time of scheduling to determine why the wrong choice is made.

chfast updated this revision to Diff 27834.Jun 17 2015, 7:30 AM
  • Add LiveRegGen changes logs.

I have a filling I'm going in circles.

Now it looks the best "fix" is to remove the case I->getSUnit()->getHeight() < LiveRegGens[I->getReg()]->getHeight() when updating LiveRegGen. Without this case LiveRegGen is set during unscheduling only if it was empty before.

I checked selfhosted regression tests and test-suite. Now using the clang build with this change in my other projects.

Can someone give me an example when this removed case is needed?

atrick accepted this revision.Jun 22 2015, 10:26 AM
atrick edited edge metadata.

I don't have an example other than the hypothetical case shown below.

At this point, I'm ok with you checking in your fix even though we don't really understand the issue.

Andy


Now, to answer your question, here's a hypothetical case the motivates
the current code:

Say Rx and Ry are a physical registers and we have this DAG:

Ry = I0
Rx = I1(I0:Ry)
I2(I1:Rx)
I3(I1:Rx)
Rx = I4(I0)

I5(I4:Rx)

We first schedule this sequence bottom-up:

Rx = I1(I0:Ry)
I2(I1:Rx)
I3(I1:Rx)

Giving us:
Def[Rx]=null, Def[Ry]=I0
Gen[Rx]=null, Gen[Ry]=I1

Since I4 has not been scheduled, we cannot schedule I0 and need to backtrack.

Unscheduling I1, we see two successors: I2 and I3. I3 has the lower height, so we should end up with:
Def[Rx]=I1, Def[Ry]=null
Gen[Rx]=I3, Gen[Ry]=null

If we mistakenly picked Gen[Rx]=I2, then we would not backtrack far enough and would miscompile this example!

So, if fixing the code to check scheduled order instead of DAG height fixes your problem, that would be the best solution.

This revision is now accepted and ready to land.Jun 22 2015, 10:26 AM
chfast updated this revision to Diff 28231.Jun 23 2015, 6:02 AM
chfast edited edge metadata.

Thanks Andy for your another set of comments. They helped me try another solution.

During a node unscheduling the successor with the lowest height is found and set as LiveRegGen.
The difference form previous implementation is that LiveRegGen is not updated again in consecutive unschedulings.

The implementation is quite naive and is done by doing a copy of LiveRegGens to track the initial values.
The alternative would be to nest another traversal of succesors to find the lowest one.

I'm ok with your revised approach. I can't say that I understand how it works in all cases, but I trust that it's better than your previous fix.

Thanks. I will send it to bots soon.

Can you give me any tip about the implementation? Is the copy of LiveRegGens ok? On X86 it has over 200 entries. Alternative to the copy is to double traverse the successor list.

Right, on some target the LiveRegGens vector could be in the thousands. Only 1-2 regs are probably relevant here. Any sparse data type would be better. e.g. Instead you could keep track of only the registers that were changed via SmallSet. You can also reuse the memory be defining it in the class and clearing it before each use.

chfast updated this revision to Diff 28324.Jun 24 2015, 2:04 AM

Implementation changed, no LiveRegGens copy needed.

This revision was automatically updated to reflect the committed changes.

Thanks. That's much better than making the copy.

No true buildbot failures so far so I'm quite happy about the fix. Thanks for all your help.

Will this code stay here for long? I've heard some rumors that it is going to be replaced. If it stays I would like to refactor it a bit.

Everyone wants to replace this code. But that rumor is probably two years old. Even after we have a replacement, the old code will stick around until all the targets migrate. So feel free to fixup the code. It will be in use for *at least* another year.

Andy