This is an archive of the discontinued LLVM Phabricator instance.

StackColoring: smarter check for slot overlap
ClosedPublic

Authored by arielb1 on Apr 2 2017, 9:40 AM.

Details

Summary

The old check for slot overlap treated 2 slots S and T as
overlapping if there existed a CFG node in which both of the slots could
possibly be active. That is overly conservative and caused stack blowups
in Rust programs. Instead, check whether there is a single CFG node in
which both of the slots are possibly active *together*.

Fixes PR32488.

Diff Detail

Event Timeline

arielb1 created this revision.Apr 2 2017, 9:40 AM
thanm edited edge metadata.Apr 4 2017, 1:58 PM

Hi Ariel,

I am thinking that maybe there is something wrong with the patch on phab-- I downloaded it and it would not compile:

StackColoring.cpp:1050:5: error: use of undeclared identifier 'UI'
    UI->getNextValue(Indexes->getZeroIndex(), VNInfoAllocator);

I fixed that up in the obvious way.

Aside from that, I'd say that so far I don't think I completely understand how your strategy works. Part of this may be due simple to terminology and the names you've given to the new things.

For example, I don't get why you've chosen the name "Use" for the new set in BlockInfo. The fact that gets recorded in this set is not really a "use" of the stack slot. Consider these two blocks:

BB1:
  lifetime_end slot 5
  branch

BB2:
  lifetime_end slot 5
  lifetime_start slot 5
  use of slot 5
  branch

The way the code works now, "Use" will be set for BB1 and not for BB2, even though BB2 has an actual use (reference to) the slot and BB2 does not. This is very confusing for people reading the code.

Similar confusion over your use of the term "active" in the comment ("active" is also used at various other places in the file comments, and I don't think your definition of "active" is the same as that used in the other places). Ditto for "IntervalStarts" -- why is it named this way?

Another point of confusion is over the use of the term "interference graph" in the comments-- the data structure being used to capture interferences is not really a graph, there are no explicit edges to speak of. A real interference graph would be an improvement over what's there (could be more precise in a number of cases), but that's not what exists in the code today.

A final question is what sort of testing you've done so far to make sure this works for all the odd corner cases. LNT and bootstrap build are two things that come to mind.

With that said, please keep working on this, I think it's worth pursuing. I am OOO all this week and possibly next, so I won't have a lot of time to look at it, but I'll get there eventually.

Regards,

Than

arielb1 updated this revision to Diff 94237.Apr 5 2017, 8:51 AM

Typofix UI -> SI

Hi Ariel,

I am thinking that maybe there is something wrong with the patch on phab-- I downloaded it and it would not compile:

StackColoring.cpp:1050:5: error: use of undeclared identifier 'UI'
    UI->getNextValue(Indexes->getZeroIndex(), VNInfoAllocator);

I fixed that up in the obvious way.

Fixed that.

Aside from that, I'd say that so far I don't think I completely understand how your strategy works. Part of this may be due simple to terminology and the names you've given to the new things.

For example, I don't get why you've chosen the name "Use" for the new set in BlockInfo. The fact that gets recorded in this set is not really a "use" of the stack slot. Consider these two blocks:

BB1:
  lifetime_end slot 5
  branch

BB2:
  lifetime_end slot 5
  lifetime_start slot 5
  use of slot 5
  branch

The way the code works now, "Use" will be set for BB1 and not for BB2, even though BB2 has an actual use (reference to) the slot and BB2 does not. This is very confusing for people reading the code.

BLI.Use refes, as it says, to a slot that begins and ends within the same block. Maybe I should have called it "BeginAndEnd" or something? Maybe I should have Use also include Begin (actually, let's do that)?

Similar confusion over your use of the term "active" in the comment ("active" is also used at various other places in the file comments, and I don't think your definition of "active" is the same as that used in the other places). Ditto for "IntervalStarts" -- why is it named this way?

I thought "active" was a new fresh term. I informally referred to it as "live", but unfortunately "live" has another meaning. Is there a better term than "active"?

Another point of confusion is over the use of the term "interference graph" in the comments-- the data structure being used to capture interferences is not really a graph, there are no explicit edges to speak of. A real interference graph would be an improvement over what's there (could be more precise in a number of cases), but that's not what exists in the code today.

I was thinking about things theoretically.

A final question is what sort of testing you've done so far to make sure this works for all the odd corner cases. LNT and bootstrap build are two things that come to mind.

I bootstrapped rustc and passed all of its tests. I don't know how to bootstrap LLVM, so I didn't do that.

With that said, please keep working on this, I think it's worth pursuing. I am OOO all this week and possibly next, so I won't have a lot of time to look at it, but I'll get there eventually.

Thanks.

Regards,

Than

dberlin edited edge metadata.Apr 5 2017, 9:42 AM

To try to echo a bit what than is saying:, i think it would be helpful if you could write some paragraphs of what you think the high level design of this new part is.

IE something like:
we create a dataflow problem that tells us x
we do this by computing these attributes on a per-block basis, and then finding a maximal fixpoint.
We then use x to prune y.

or whatever.

I can think of a lot of different approaches than stack coloring currently takes (IE you could do real liveness analysis, as than mentions, you could build interference graphs and simplify them, etc), but they are more fundamental changes than i see here.

arielb1 updated this revision to Diff 94255.Apr 5 2017, 10:22 AM

Change names and comment to not mention interference graph.

To try to echo a bit what than is saying:, i think it would be helpful if you could write some paragraphs of what you think the high level design of this new part is.

IE something like:
we create a dataflow problem that tells us x
we do this by computing these attributes on a per-block basis, and then finding a maximal fixpoint.
We then use x to prune y.

or whatever.

I can think of a lot of different approaches than stack coloring currently takes (IE you could do real liveness analysis, as than mentions, you could build interference graphs and simplify them, etc), but they are more fundamental changes than i see here.

I have a big comment in the source code that is supposed to be that explanation. Is there anything missing?

dberlin added inline comments.Apr 5 2017, 4:32 PM
lib/CodeGen/StackColoring.cpp
1125

Or you could just augment the standard dataflow problem with more info?
You also just kind of assert it doesn't give very good results, but it's pretty widely used with good results :)

1163

Errr, what?
It is very possible to do this in N time with N dataflow facts.

arielb1 added inline comments.Apr 6 2017, 1:53 AM
lib/CodeGen/StackColoring.cpp
1125

What I said is that what the old code was doing - propagating the N S active facts across the CFG and that
saying that 2 nodes interfere if they are both potentially active at a point - is overconservative at merge points.
Merging this PR caused x16 stack usage reductions in real world Rust compiler stack usage.

My algorithm also propagates the N S active facts, but uses them along with S active starts non-propagated facts to compute S active & T active more precisely.

Liveness can make it smarter, but aliasing makes it annoying to compute (and if there are opaque function calls in the merge BB, impossible to compute).

dotdash added inline comments.Apr 16 2017, 8:31 AM
test/CodeGen/X86/StackColoring.ll
539

I think you want to use %bar here.

arielb1 updated this revision to Diff 95432.Apr 17 2017, 5:08 AM

Clean up code to LLVM standards - thanks @doener. Also add support for multiple live ranges.

Since I had a hand in changing the logic for the live interval calculation, I
feel like I should take some time to explain the approach that was taken here.

Originally, the stack coloring pass collected the lifetime markers and assumed
that there are at most two such markers per MBB. Either in the order start ->
end, creating a single segment in the live interval. Or in the order end ->
start, creating two segments in the live interval, the slot being dead in the
middle of the MBB.

Later the code was adjusted to handle more than two markers per block. Because
the collected markers were in a random order, instead of forming multiple
segments only a single segment was created and extended as needed.
Unfortunately this actually broke the logic for handling two markers in the
order end -> start. This is because the LiveIn and LiveOut bits would then also
be set, and the segment gets extended to cover the whole MBB.

By now, we're iterating over the instructions in the MBB in order anyway, so we
can actually handle multiple start/end markers properly and create multiple
segments in the live interval even within a single block.

The semantics are as follows:

A slot for which there are no lifetime markers is always live.

For slots that have lifetime markers:

In the entry MBB, the slot starts out as dead. In other MBBs, the slot starts
out as live if the dataflow analysis determined it to be live coming into this
block, otherwise it starts as dead.

Iterating over the instructions in the MBB in sequential order:

A dead slot becomes live when it encounters a START (which could be a use), and
stays live until it encounters an END. Any START encountered while a slot is
live has no effect. This is necessary, because the START could be a plain use
rather than a lifetime_start and may thus not shorten the live interval.

A live slot becomes dead when it encounters an END. At this point, the slots
live interval gets a new segment that starts at the index where the slot became
live, and ends at the index the END was encountered. Any END encountered while
a slot is dead has no effect.

Building upon that, the improvement this patch was initially motivated by is
described below.

Because a lifetime end on a dead slot has no effect, a frontend may choose to
combine certain code paths to produce fewer BBs. Consider this pseudocode:

A = alloca TYPE
B = alloca TYPE
C = alloca TYPE_WITH_DTOR

main:
  LT_START(C)

  if COND {
    LT_START(A)
    INVOKE func UNWIND cleanup_A
    LT_END(A)
  } else {
    LT_START(B)
    INVOKE func UNWIND cleanup_B
    LT_END(B)
  }

  DTOR(C)
  LT_END(C)

  RETURN

cleanup_A:
  LP
  LT_END(A)
  br cleanup;

cleanup_B:
  LP
  LT_END(B)
  br cleanup;

cleanup:
  DTOR(C)
  LT_END(C)
  RESUME

Here we need two distinct cleanup paths and landing pads just to ensure that
the old stack coloring code can merge slots A and B. But assuming that a
lifetime end on a dead slot is a no-op, we could also write:

A = alloca TYPE
B = alloca TYPE
C = alloca TYPE_WITH_DTOR

main:
  LT_START(C)

  if COND {
    LT_START(A)
    INVOKE func UNWIND cleanup_A
    LT_END(A)
  } else {
    LT_START(B)
    INVOKE func UNWIND cleanup_B
    LT_END(B)
  }

  DTOR(C)
  LT_END(C)

  RETURN

cleanup:
  LP
  LT_END(A)
  LT_END(B)
  DTOR(C)
  LT_END(C)
  RESUME

Now the problem is that both A and B are only "possibly live" coming into
"cleanup". "Possibly live" meaning that a slot is live on one but not
necessarily all incoming edges. Using a plain overlap check on the live
intervals created using this information causes false positives, stopping the
slots from being merged.

To solve this, we can use an alternative approach to check whether two slots
are live at the same time. For two slots to be live at the same time, one of
them needs to become live when the other is live as well. We can check this by
keeping track of the points at which a slot becomes live. As these points tell
us that a slot is "definitely" live, we get more accurate results.

I hope this explains the approach we've taken well enough.

Ready and just waiting for review.

arielb1 added a reviewer: rnk.May 15 2017, 8:01 AM

rnk can you review this? This is stuck in the queue for almost a month.

thanm added a comment.May 15 2017, 8:12 AM

Ready and just waiting for review.

Thanks for the ping/reminder.

I read through Björn's April 17th comment. The explanation makes sense, but this is something that should be in the code as opposed to only in a Phab review. Please add something along these lines to the "Implementation Notes" command section -- that is where other material exists relating to how the data flow analysis works.

My existing comments regarding naming still stand (overloading of the terms "active" and naming of the BB set "Use").

Ditto on my previous comment regarding testing. I recommend running a clang/llvm bootstrap build to make sure you haven't broken anything. You can find instructions on how to do bootstrap builds at http://llvm.org/docs/AdvancedBuilds.html. This is purely an advisory suggestion, you are free to ignore it (I mention it mainly since when I made changes to stack coloring previously it was helpful in flushing out problems that I hadn't anticipated).

Tests run locally.

arielb1 updated this revision to Diff 100755.May 30 2017, 1:15 PM

Updated comment.

arielb1 updated this revision to Diff 100765.May 30 2017, 1:52 PM

Works locally.

Ping. I think the new comment is good enough to go.

thanm requested changes to this revision.Jun 6 2017, 5:28 AM

I pulled your patch and did a bootstrap build with it, no issues. Overall this seems like an improvement over the initial patch. Please see inline comments.

lib/CodeGen/StackColoring.cpp
127

This is kind of a nit, but if I am reading this correctly, your use of "non-conservative" here refers to a slot that is not degenerate, which is not talked about until down on line 342. Maybe it would make sense to put in a forward reference?

646

I see "if (!IsStart && ..." -- seems like the "!IsStart" is redundant at this point, no?

1000

A couple of observations:

First, relative to your first patch I think I can make more more sense out of what's going on. I can also see where you're getting the improvement, since doing the conflict check based on starts is inherently more precise.

Second, there is a concern that in a large function with many packets overlapped together you could get into quadratic compile time behavior (since in the worst case the interval has O(N) items and the LiveStarts vector has O(N) items). If you haven't seen any issues in practice, though, I suppose it's probably not worrying too much about.

This revision now requires changes to proceed.Jun 6 2017, 5:28 AM
arielb1 updated this revision to Diff 101884.Jun 8 2017, 4:11 AM
arielb1 edited edge metadata.

Changed else/if nesting in that part.

I'm not too worried about any O(n^2) part - both the old and the new code do an O(# slots) * O(# slots) overlap check, and I don't remember it being a particularly expensive pass.

arielb1 updated this revision to Diff 101887.Jun 8 2017, 4:23 AM

Use "degenerate" in all comments.

thanm accepted this revision.Jun 8 2017, 5:32 AM

LGTM.

FWIW I tried a self-build of clang with and without the change and looked at the -stats output to see what sort of improvement there is. Looks pretty modest (<1%) but it is positive overall.

This revision is now accepted and ready to land.Jun 8 2017, 5:32 AM
This revision was automatically updated to reflect the committed changes.