This is an archive of the discontinued LLVM Phabricator instance.

Rework/enhance stack coloring data flow analysis.
ClosedPublic

Authored by thanm on Apr 6 2016, 8:18 AM.

Details

Summary

Replace bidirectional flow analysis to compute
liveness with forward analysis pass. Treat lifetimes
as starting when there is a first reference to the
stack slot, as opposed to starting at the point of
the lifetime.start intrinsic, so as to increase the
number of stack variables we can overlap.

Bug: 25776

Diff Detail

Event Timeline

thanm updated this revision to Diff 52804.Apr 6 2016, 8:18 AM
thanm retitled this revision from to Rework/enhance stack coloring data flow analysis..
thanm updated this object.
thanm added reviewers: gbiv, davidxl.
thanm updated this revision to Diff 52912.Apr 7 2016, 7:01 AM

Fix nit: debugging code was improperly guarded.

davidxl edited reviewers, added: wmi, qcolombet; removed: davidxl.Apr 7 2016, 9:55 AM
davidxl added a subscriber: davidxl.
thanm added a comment.Apr 7 2016, 1:04 PM

I would welcome any advice/suggestions on the best way to qualify these changes. I've run "ninja check" of course, and I've also run "lnt runtest nt" as described in http://llvm.org/docs/lnt/quickstart.html.

wmi edited edge metadata.Apr 10 2016, 3:36 PM

Hi Than,

It is a nice improvement and the code looks clean.

I can see the patch should work well for caller func with local vars introduced from callee during inline. The only concern I have is for independent local vars both defined in the same func: Because an interval is from use to @llvm.lifetime.end, even if two local vars have separate live ranges, they may still be regarded as overlapped only because their live intervals are extended from the last use to @llvm.lifetime.end. I understand it is safe to use @llvm.lifetime.end as the end of live interval, especically when there are indirect accesses to those local vars, but it can also limit the improvement we could have got, especially for local vars defined in the same func. I tried the motivational testcase stack-ifthen.cc and found it had the problem described here too - it still allocated space for both itar1 and itar2 (@llvm.lifetime.end of itar1 and itar2 are both sinked to the func exit, not at the end of then and else branches. I apply the patch and the command I use: ~/workarea/llvm-r265225/dbuild/bin/clang++ -O2 -S stack-ifthen.cc).

If we want to recognize more exact live interval of local vars, we need to involve alias analysis to count every alias access also as the uses of a local var in addition to direct use of local vars. Or at least if we know which local vars don't have the address saved or passed, we can only use their direct uses to compute live interval.

Since inline is an important source of increasing stack space, I think it maybe ok to leave the more exact interval requirement described here for further improvement. The cause that stack-ifthen.cc were not improved may be worthy to look at.

Other inline comments are nits.

Thanks,
Wei.

lib/CodeGen/StackColoring.cpp
441

Since we only do dataflow iteration in forward direction now, if only no change in LiveIn, there shouldn't be change in LiveOut. So can we move BlockInfo.LiveOut |= LocalLiveOut into the above if?

815–817

There may be an unused warning when NDEBUG is enabled.

test/CodeGen/X86/StackColoring.ll
445

%0, %1, %2 are prone to changing. we can run "opt -instnamer" to rename %0, %1, %2..., so the test will be stabler.

thanm added a comment.Apr 11 2016, 8:15 AM

Hi Wei,

Thanks for the review. I responded to your comments in line.

You are quite right, even after this patch, clang/llvm will still worse than GCC. The reason has to do with the way that the GCC code computes interferences between variables.

The GCC implementation uses an interference graph (node == var, edge == coloring conflict between vars), and it adds edges in the interference graph only at DEFs: statements where there could conceivably be some sort of memory write. In contrast, the LLVM algorithm works by building live intervals and checking to see whether there is any overlap between the intervals (there is no attention paid to DEF vs non-DEF). This second strategy is simpler but is definitely less precise in some cases.

Let me expand on this a bit. Going back to the if/then example:

int foo(int x) {
  int ar1[128]; int ar2[128];
  if (x & 0x99) {
    bar(&ar1); return x-1;
  } else {
    bar(&ar2); return x+1;
  }
}

Here is what the IR looks like at the point where the analysis is done (it's pretety much the same between both GCC and LLVM at an abstract level):

       BB#1:
I0:    call bar(&ar1)
I1:    %vreg1 = %vregx - 1
I2:    jmp BB3

       BB#2:
I3:    call bar(&ar1)
I4:    %vreg2 = %vregx + 1
       <fall through>
   
       BB#3:
I5:    %vrefr = phi(%vreg1, %vreg2)
I6:    LIFETIME_END <ar1>
I7:    LIFETIME_END <ar2>
I8:    <return %vregr>

As mentioned previously, the GCC algorithm uses an uses an interference graph -- for each variable X it keeps a bit vector of other variables that X interferes with. At each BB, it computes the live-in set for the BB by unioning together the live out sets of the pred BBs. It then walks the BB and looks for any instruction that could write memory (essentially anything other than phi or debug). At such an inst, it adds conflicts between every variable in the work set.

At the point where GCC processes BB#3 above, both "ar1" and "ar2" are in the work set. However as it walks through the BB it never sees any instruction that could write memory, so it never adds conflicts between the vars.

The LLVM implementation doesn't use an interference graph; instead it builds live intervals for each variable and then sees whether there is overlap between the intervals; this approach is less precise. For the graph above, we wind up with the following live intervals:

ar1: [1-2], [5-6]
ar2: [3-7]

The existing stack-coloring code examines these two intervals and sees that they overlap in at inst 5, so it considers the two variables as interfering and gives them separate stack allocations.

[This situation is somewhat related to the so-called "X graph problem" in register allocation, you may have heard of it.]

The patch I wrote sticks with the current method of building the (less precise) interval graph, hence it doesn't catch as many legal variable overlaps. The main reason I took this route was that there is an option in the existing implementation ("protect-from-escaped-allocas") that uses the interval information to identify "stray" uses of a stack location outside of the LIFETIME-START/LIFETIME-END interval the variable -- moving to an interference graph would have made it difficult to retain this functionality. Second, since I am an LLVM newbie I thought it would be best to start with a relatively modest patch, as opposed to completely throwing out the old implementation and putting in my own.

Given that the new implementation is a bit better but not ideal, I think we can revisit it at some point and decide whether we want to throw out intervals and move to interference-graph.

Thanks, Than

lib/CodeGen/StackColoring.cpp
441

I think there would be a problem with that on the first dataflow iteration for blocks containing a GEN (or first-use) of some variable "x". In that case we wind up adding "X" to the live out set, so we need to update the changed flag. On all subsequent iterations what you say is true.

815–817

Hmm. Good point. I will rework the code to handle this.

test/CodeGen/X86/StackColoring.ll
445

Thanks-- I hadn't realized that we had such a tool. I'll update the test.

wmi added a subscriber: wmi.Apr 11 2016, 9:37 AM

Than,

Thank you for the detailed explanation of the implementation difference
between GCC and LLVM. I have a question about GCC's implementation inlined.

Wei wrote: But isn't bar() could potentially change ar1 and ar2 indirectly since ar1 and ar2 are escaped local arrays from current func?

GCC's implementation completely ignores "lifetime start" and instead adds a var X to the work set the first time it sees a reference to it. So on entry to BB1 it has not yet seen any references to "ar2", so when it processes "call bar(&ar1)" no edge is added. Ditto for processing BB2; live-in for the block is empty (since no explicit references in the entry block), so when it sees "call bar(&a2)" no edge is added. Both ar1 and ar1 are part of the work set on entry to BB3, but BB3 has no instructions that could access memory.

BTW in the abstract IR there is a typo, call in BB2 should be "call bar(&ar2)" not "call bar(&ar1)".

thanm updated this revision to Diff 53276.Apr 11 2016, 10:40 AM
thanm edited edge metadata.

Incorporate code review feedback from Wei; rebase.

thanm updated this revision to Diff 53281.Apr 11 2016, 10:50 AM

Incorporate code review feedback from Wei (part 2).

qcolombet edited edge metadata.Apr 11 2016, 11:19 AM

Hi Than,

Have you run clang-format?
Some of the formatting looks strange to me.

I haven’t looked into the algo yet.

Cheers,
-Quentin

lib/CodeGen/StackColoring.cpp
178

Can’t we theoretically have an interesting lifetime start and end on the same instruction.
The API does not allow to model that. What happens in that case?

181

Having look at the implementation, this method does not support nullptr as input argument so please, use references.

242

All the dump methods are refactoring right?
If so, please commit them separately.

244

Add a message in assert.

262
360

SmallVector?

464–465

Period.

thanm added inline comments.Apr 11 2016, 12:48 PM
lib/CodeGen/StackColoring.cpp
178

As far as I know (take this with a grain of salt, since I am an LLVM newbie) the only way to end a lifetime is with a LIFETIME_END marker instruction. If you can think of a scenario where you might have both, then yes, this should definitely be changed. Let me know if you have a real example?

181

OK, I will convert to use references.

242

Yes, all changes to dump methods are refactoring. I will pull them into a separate patch.

244

Will fix.

262

Will fix.

360

Will fix.

464–466

Will fix.

qcolombet added inline comments.Apr 11 2016, 12:57 PM
lib/CodeGen/StackColoring.cpp
178

No, you’re right, for lifetime markers this cannot happen at the same time. I was thinking in term of live-range. Maybe add in the comment that we consider only lifetime end from LIFETIME_END instructions and thus, we cannot have both conditions (start and end) at the same time.

thanm added a comment.Apr 11 2016, 2:36 PM

Hi Quentin,

Haven't run clang-format. Would it make sense to run afterwards, so as to not mix up pure formatting changes with functional changes?

I am ok either way, just not sure what the best practice is.

Thanks, Than

thanm updated this revision to Diff 53320.Apr 11 2016, 2:36 PM
thanm edited edge metadata.

Incorporate code review feedback from qcolombet.

thanm updated this revision to Diff 53574.Apr 13 2016, 9:04 AM

Improve unit test coverage.

wmi added a comment.Apr 13 2016, 4:29 PM

Some Nits. Other than that LGTM. Quentin knows it better than me so I will let him to approve it finally.

Thanks,
Wei.

lib/CodeGen/StackColoring.cpp
366

Missing an assert message.

821

We want dumpIntervals() after intervals being computed.

thanm added a comment.Apr 18 2016, 8:34 AM

Hi all,
I did some additional testing of this patch over the weekend, and I've discovered (unfortunately) a couple of cases where my new algorithm is not doing as well as the old algorithm (in spite of the fact that the new scheme does fix the case that motivated me to file the bug in the first place). I am still working to understand what the issue is, but it looks as though the degradations are in cases where uses of a stack slot appear outside of regions strictly dominated/post-dominated by markers.
I will post another comment when I have more info.
Thanks, Than

thanm added a comment.Apr 19 2016, 6:07 AM

Some more detail on my previous comment.

I should clarify what I am seeing are not correctness problems (e.g. incorrectly overlapping two slots that should not be overlapped) but performance problems (cases where the old implementation overlaps more slots then the new implementation). I had assumed that my new scheme would always produce better results than the old scheme, but it turns out that this is not the case.

The first problem looks like it is due to a bug in my implementation-- since with the new scheme there can now be multiple starts/begins for a lifetime, the code that computes begin/end has to treat begin/end in the same block as "end" (as opposed to a no-op) during the liveness propagation. Example:

 BB#2:
   <use fi#1>
   ...
   fall through to BB#3

BB#3:
   <use fi#1>
   LIFETIME_END <fi#1>

The old implementation assumed matched pairs of begin/end -- in the case above where you have multiple begins for fi#1, there would not be a 'kill' of the fi#1 lifetime in BB#3, so its lifetime would be extended. I have put in a fix for this.

The second and more serious problem is that I am seeing cases where stack slot references are hoisted out of loops into preheader blocks, where the preheader block is not strictly dominated by the original lifetime start. (and more importantly, is not post-dominated by the lifetime end op). It looks as though this is happening due to code motion of some sort (LICM or GVN perhaps).

In such cases the lifetimes computed by the old approach are smaller/shorter than those computed by the new approach. Also means that if the "-protectfromescapedallocas" option were to be used, the IR would be flagged as invalid, for whatever that is worth.

I will need to think about what to do with this information. I also need to run some more experiments to see just exactly how widespread the problem is.

thanm updated this revision to Diff 56315.May 5 2016, 11:07 AM

Revise stack coloring patch to handle degenerate slots.

Update to the enhancement to handle cases where references
to a given stack slot appear outside of the lifetime
start/end markers (this can happen due to optimizations
such as LICM).

thanm added a comment.May 5 2016, 11:10 AM

Hello reviewers:

My apologies for the long delay in updating this revision.

I've worked out a method for handling the specific cases I was seeing where my new technique was less effective than the old one; you can find additional detail in the comment section I added to StackColoring.cpp marked "Implementation Notes".

I think I've addressed all of Wei's and Quentin's specific comments, so things should be ready for another look now.

Thanks, Than

thanm added a comment.May 11 2016, 5:22 AM

Hi folks,

I spent some time working with SPEC yesterday to characterize the amount of improvement (number of additional stack slots eliminated) that this patch provides.

Here are the results. Note that these are not execution times, this is simply looking at the stats to collect the stack space savings reported by the optimization. I did a SPEC2006 build (int and fp, but only the C/C++ programs, no Fortran), vanilla "-O2", x86_64 target.

Base:         3011 slots removed, 87483 space saved
Enhanced:     3151 slots removed, 94137 space saved

So a modest improvement in the number of bytes saved (~ 7%), but an improvement nonetheless.

Thanks, Than

wmi added a comment.May 12 2016, 2:58 PM

Thanks for the detailed implementation notes and spec2006 data. The spec2006 data is a overall number which looks good. I guess for individual file there is no regression anymore, right?

But I havn't understood quite well about the way to handle case with degenerate slot. From my understanding, for case with degenerate slot, its liverange from startmarker to endmarker is smaller than actual, so it is possible for stackcoloring to generate incorrect code for such case (although I understand the case is never worse than without the patch). The question is before this patch, since the case with degenerate slot always existed, why we didn't see error caused by it?

According to http://lists.llvm.org/pipermail/llvm-dev/2012-September/053458.html, letting ProtectFromEscapedAllocas stay off is to detect problematic case and fix it. Is the case with degenerate slot caused by LICM such a problematic case?

Thanks,
Wei.

thanm added a comment.May 12 2016, 4:48 PM
In D18827#429001, @wmi wrote:

Thanks for the detailed implementation notes and spec2006 data. The spec2006 data is a overall number which looks good. I guess for individual file there is no regression anymore, right?

Correct.

But I havn't understood quite well about the way to handle case with degenerate slot. From my understanding, for case with degenerate slot, its liverange from startmarker to endmarker is smaller than actual, so it is possible for stackcoloring to generate incorrect code for such case (although I understand the case is never worse than without the patch). The question is before this patch, since the case with degenerate slot always existed, why we didn't see error caused by it?

Sorry, looking back at the example I put in the comment I think I left things a bit vague. In the case that I have seen so far where uses of the stack slot are hoisted outside of the livetime start/end interval, it's only the address computation that's outside the interval-- the actual memory load/store takes place inside. So at least for all of the cases I have seen so far, the program is still correct.

I will update the comments to make this clearer.

Thanks, Than

wmi added a comment.May 12 2016, 5:06 PM

Thanks for the explanation. It makes sense for me now.

lib/CodeGen/StackColoring.cpp
370

should we add an assertion "!MI.mayLoad() && !MI.mayStore()" so if weird escaped reference case does exist, they will be caught at compile time?

thanm added a comment.May 16 2016, 9:16 AM

Wei, you suggested adding an assert that triggers if an instruction is found otuside the the start/end lifetime that accesses memory.

I experimented a bit with this; turns out there is a roadblock here relating to how I degenerate ranges are identified in my implementation.

At the moment I am not doing flow analysis to determine ranges of instructions that fall in between lifetime start/end for a given slot -- the code is only doing a single pass over the CFG. This makes it fast, but there is some imprecision, e.g. if there is unstructure control flow, a slot may appear degenerate when it is not.

Here is an example, abstracted from the 400.perlbench function store_scalar():

void store_scalar(...) {
   ...
   if (...) goto string;
   ...
   if (cond) {
      int wlen;
      ...
      string:
        wlen = ...;
   }
}

Note the 'goto'. The front end will place start and end lifetime markers within the body of the "if (cond)" true block, but due to the goto there is a path that leads from the entry block to the 'wlen' assignment that doesn't go pass through the lifetime start marker. If the ordering of blocks in "depth_first(MF)" happens to choose that path, then we'll see what looks like a reference to "wlen" before we see its lifetime start marker.

I'm not sure it's worth it to slow down the pass by doing real dataflow analysis to collect lifetime ranges-- that seems like overkill to me.

lib/CodeGen/StackColoring.cpp
370

This is difficult to accomplish with my current implementation (see more detail on comment).

thanm updated this revision to Diff 57360.May 16 2016, 9:19 AM

Improvements to comments.

wmi added inline comments.May 16 2016, 10:48 AM
lib/CodeGen/StackColoring.cpp
166

repeated "can sometimes"

380

assert message missing.

thanm marked 2 inline comments as done.May 16 2016, 11:46 AM

Will post a revised patch shortly.

lib/CodeGen/StackColoring.cpp
166

Fixed.

166

Fixed.

380

Fixed

thanm updated this revision to Diff 57378.May 16 2016, 11:46 AM
thanm marked 2 inline comments as done.

Incorporate fixes suggested by Wei.

thanm added a comment.May 18 2016, 5:50 PM

Friendly ping...

Another friendly ping... I'd like to get this submitted at some point.

wmi accepted this revision.May 23 2016, 12:54 PM
wmi edited edge metadata.
This revision is now accepted and ready to land.May 23 2016, 12:54 PM
thanm closed this revision.May 24 2016, 9:08 AM

Submitted in r270559.

Thanks Wei, George, and Quentin for your reviews.

Than