This is an archive of the discontinued LLVM Phabricator instance.

[X86] Order the local stack symbols to improve code size and locality.
ClosedPublic

Authored by zansari on Dec 9 2015, 1:29 PM.

Details

Summary

These changes introduce a local stack symbol table ordering phase to allow all targets a chance to order the stack symbols the way they'd like it.

X86 heuristics are added to order the symbols to improve code size and locality. The current default behavior for all other targets is to leave the order untouched. Other target specific heuristics can be very easily applied by simply providing the necessary heuristics.

As an example, here are some cpu2000 code size improvements that I measured (mileage may vary depending on options used).. Numbers are percentage reduction in the sum of the text size of all objects:

177.mesa 1.81%
179.art 0.50%
183.equake 1.78%
188.ammp 3.75%
164.gzip 0.79%
175.vpr 0.85%
176.gcc 0.70%
181.mcf 0.26%
186.crafty 0.47%
197.parser 0.29%
252.eon 2.47%
253.perl 0.47%
254.gap 0.55%
255.vortex 1.14%
256.bzip2 0.64%
300.twolf 0.88%
total 1.24%

I measured performance on cpu2k, cpu2006, eembc, and a few other benchmarks and it was pretty much flat, although on average, these changes should also improve data locality.

Many lit changes broke due to new offsets being assigned to local symbols. I fixed these by disabling the optimization (vs. updating with new offsets) so that we'd avoid additional flakiness due to heuristic tuning.

Diff Detail

Repository
rL LLVM

Event Timeline

zansari updated this revision to Diff 42317.Dec 9 2015, 1:29 PM
zansari retitled this revision from to [X86] Order the local stack symbols to improve code size and locality..
zansari updated this object.
zansari added reviewers: rnk, qcolombet, mkuper.
zansari added a subscriber: llvm-commits.
hfinkel added a subscriber: hfinkel.Dec 9 2015, 1:42 PM
hfinkel added inline comments.
lib/Target/X86/X86FrameLowering.cpp
2714 ↗(On Diff #42317)

Based on your comparison function, it looks like this should be stable_sort.

lib/Target/X86/X86FrameLowering.h
240 ↗(On Diff #42317)

I'm somewhat afraid of using floating-point comparisons here because you might run into the same problems we had when computing block frequencies/probabilities with floating-point (that you'd get different answers on different systems because of enhanced intermediate precision and other non-strict-IEEE effects). Especially because we have a floating-point equality comparison below.

rnk added a reviewer: hans.Dec 9 2015, 1:54 PM
rnk edited edge metadata.Dec 9 2015, 2:04 PM

Our current default stack layout algorithm optimizes for stack frame size without accounting for code size from frame offsets. I'm worried that this algorithm will reorder all the

lib/Target/X86/X86FrameLowering.cpp
2714 ↗(On Diff #42317)

+1

lib/Target/X86/X86FrameLowering.h
200 ↗(On Diff #42317)

This doesn't need to be in a header, move it to the .cpp file and put it in an anonymous namespace.

200–203 ↗(On Diff #42317)

This can just be a struct. I'd also prefer it if you used default data member initializers instead of a default constructor.

224 ↗(On Diff #42317)

Ditto, sink this to the .cpp file to reduce visibility.

240 ↗(On Diff #42317)

+1

rnk added a comment.Dec 9 2015, 2:16 PM

Oops, I hit submit too early.

Our current default stack layout algorithm optimizes for stack frame size without accounting for code size from frame offsets. I'm worried that your algorithm may reorder all of the potentially large allocations outside of the 256 bytes of stack that we can address with a smaller offset, and unnecessarily increase stack frame sizes.

I wonder if we can rephrase this as a weighted bin packing problem, where we only have one bin and it has size ~128 bytes, or the max one byte displacement from FP/SP. The object weights would be the use counts, and the goal is to put as many uses into the bin as possible. There's probably a good approximate dynamic programming algorithm that we could use for that.

hans edited edge metadata.Dec 9 2015, 2:30 PM

This looks very interesting.

Maybe update the commit message and/or add to the comments in the file why this is beneficial for code size, since it wasn't obvious for me at least. Do I understand correctly that the idea is that addressing objects closer to %esp generally requires less code because it can use a tighter addressing mode?

lib/CodeGen/PrologEpilogInserter.cpp
708 ↗(On Diff #42317)

Maybe use SmallVector instead?

710 ↗(On Diff #42317)

Since this block of code no longer does the actual assigning, maybe change the comment to "prepare to assign offsets ..." or "gather objects that need stack offsets ..." or something like tthat.

lib/Target/X86/X86FrameLowering.cpp
2678 ↗(On Diff #42317)

Maybe SmallVector?

2722 ↗(On Diff #42317)

Declare i here, instead of at the start of the function.

2727 ↗(On Diff #42317)

I think breaking early would be more in line with LLVM style.

if (!Obj.IsValid)
  break;
ObjectsToAllocate[i++] = ...
lib/Target/X86/X86FrameLowering.h
205 ↗(On Diff #42317)

Just "unsigned" is more common in LLVM code

224 ↗(On Diff #42317)

How about calling it X86FrameSortingComparator instead?

In D15393#306456, @rnk wrote:

Oops, I hit submit too early.

Our current default stack layout algorithm optimizes for stack frame size without accounting for code size from frame offsets. I'm worried that your algorithm may reorder all of the potentially large allocations outside of the 256 bytes of stack that we can address with a smaller offset, and unnecessarily increase stack frame sizes.

I wonder if we can rephrase this as a weighted bin packing problem, where we only have one bin and it has size ~128 bytes, or the max one byte displacement from FP/SP. The object weights would be the use counts, and the goal is to put as many uses into the bin as possible. There's probably a good approximate dynamic programming algorithm that we could use for that.

Hi Reid.. Thanks for the review.

I'm not quite understanding what you mean in your first paragraph regarding stack frame size. Perhaps I'm missing something simple, but I don't see what significance large allocations have, and I also don't see the significance of 256 bytes.

With respect to stack frame size, I do see the potential for increasing it, in some cases, based on how objects with larger "alignment" are ordered with respect to each other, but I'm missing how "large allocations outside of 256 bytes" comes into play. I did try to include alignment into the heuristics to help this out a little bit (it's not perfect, of course, but I felt that it require a fair jump in complexity to achieve better). In general, I didn't see significant increases in stack frame size, at least in the benchmarks that I was looking at.

Also (and perhaps this applies to your second comment), one of my main goals in this pass was to make it cheap and simple while getting as much benefit out of it as possible over what we already have. I initially toyed around with a bunch of additional "smarter" heuristics that required more complexity and iterations, but the tiny extra saving they gave me in a few cases wasn't worth the extra compile time, in my opinion. I found that this simple, single pass algorithm caught a bulk of the code saving opportunities (which were fairly significant, in some case).

Do you feel this is worth additional complexity to squeeze a little more out of?

In D15393#306468, @hans wrote:

This looks very interesting.

Maybe update the commit message and/or add to the comments in the file why this is beneficial for code size, since it wasn't obvious for me at least. Do I understand correctly that the idea is that addressing objects closer to %esp generally requires less code because it can use a tighter addressing mode?

Hi Hans.. Thanks for the review.

Yes, sorry I didn't make that clear in the description. The saving are from allocating more frequently (statically) accessed addresses closer to sp/fp and saving 3-4 bytes of address encoding by doing so.. For example, if a large array is allocated early on the stack and only accessed once, that would cause all other symbols on the stack to cost an extra 3-4 bytes to access, which is a fair bit of bloat if they're accessed frequently.

Hope that helps.

zansari updated this revision to Diff 42562.Dec 11 2015, 1:21 PM
zansari edited edge metadata.

Thanks for all the reviews.. I think I've address most of the lower level concerns with an updated patch.

I responded to the higher level/general comments in the comment dialogues.

Thanks,
Zia.

zansari marked 12 inline comments as done.Dec 16 2015, 10:53 AM

Ping..

Also, another quick comment regarding minimizing stack frame size before my changes:

In addition to my comment/response above regarding stack frame size, I don't think that the existing symbol ordering heuristics minimizes frame size, in all cases... For example :

int a;
int c;
int e;
__attribute__((aligned(32))) int b;
__attribute__((aligned(32))) int d;
bar(&a, &b, &c, &d, &e);
bar2(&a, &c, &e);

With -m32, the existing symbol table layout will require 128bytes for this frame. With my heuristics, we only require 96bytes.

The current layout will vary frame size depending on the order in which we DECLARE the variables, indicating that we don't do everything we can to ensure we minimize the frame size. My heuristics may not always find the smallest frame size, since it tends to favor code size a little over frame size, but it tends to be close and remain consistent, no matter what order the variables are declared. I don't think we can get a perfect frame-size vs code-size tradeoff, since we need to favor one over the other.

lib/CodeGen/PrologEpilogInserter.cpp
708 ↗(On Diff #42317)

Correct me if I'm wrong, but doesn't SmallVector require a compile-time known size?

lib/Target/X86/X86FrameLowering.cpp
2678 ↗(On Diff #42317)

Correct me if I'm wrong, but doesn't SmallVector require a compile-time known size?

hans added inline comments.Dec 16 2015, 10:57 AM
lib/CodeGen/PrologEpilogInserter.cpp
708 ↗(On Diff #42562)

The compile-time value is for the object's initial capacity. If the size grows beyond that, it will allocate memory on the heap.

See: http://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallvector-h

zansari updated this revision to Diff 43062.Dec 16 2015, 2:05 PM

Thanks, Hans.. I changed one of the vectors to SmallVector, and kept the other as a vector, since it's fixed and only gets malloc'ed once. I ran a few benchmarks on the first one and sampled around 3000 frames to see that size 8 captures ~90% of the cases.

joerg added a subscriber: joerg.Jan 11 2016, 9:19 AM

Have you made sure that the reordering doesn't defeat stack protector?

Hi Joerg,

Unless I'm missing something, this shouldn't affect stack protection in any way. The algorithm only works within the scope of this:

// Then assign frame offsets to stack objects that are not used to spill
// callee saved registers.
for (unsigned i = 0, e = MFI->getObjectIndexEnd(); i != e; ++i) {
  if (MFI->isObjectPreAllocated(i) &&
      MFI->getUseLocalStackAllocationBlock())
    continue;
  if (i >= MinCSFrameIndex && i <= MaxCSFrameIndex)
    continue;
  if (RS && RS->isScavengingFrameIndex((int)i))
    continue;
  if (MFI->isDeadObjectIndex(i))
    continue;
  if (MFI->getStackProtectorIndex() == (int)i)
    continue;
  if (ProtectedObjs.count(i))
    continue;

  AdjustStackOffset(MFI, i, StackGrowsDown, Offset, MaxAlign, Skew);
}

and reorders only these objects per whatever custom algorithm. The stack protection (and any other "special" conditions that need allocation) are handled prior to this allocation walk.. Please correct me, if I'm wrong.

joerg added a comment.Jan 11 2016, 9:52 AM

The stack protector wants to ensure that all (char) arrays are below the canary and all other arguments above. Such arrays can be small.

zansari added a comment.EditedJan 11 2016, 10:22 AM

Right, and I believe that's all taken care of, unless I'm missing something.

The piece of code right before the scope that I showed earlier looks like this:

// Make sure that the stack protector comes before the local variables on the
// stack.
SmallSet<int, 16> ProtectedObjs;
if (MFI->getStackProtectorIndex() >= 0) {
  StackObjSet LargeArrayObjs;
  StackObjSet SmallArrayObjs;
  StackObjSet AddrOfObjs;

  AdjustStackOffset(MFI, MFI->getStackProtectorIndex(), StackGrowsDown,
                    Offset, MaxAlign, Skew);

  // Assign large stack objects first.
  for (unsigned i = 0, e = MFI->getObjectIndexEnd(); i != e; ++i) {
    if (MFI->isObjectPreAllocated(i) &&
        MFI->getUseLocalStackAllocationBlock())
      continue;
    if (i >= MinCSFrameIndex && i <= MaxCSFrameIndex)
      continue;
    if (RS && RS->isScavengingFrameIndex((int)i))
      continue;
    if (MFI->isDeadObjectIndex(i))
      continue;
    if (MFI->getStackProtectorIndex() == (int)i)
      continue;

    switch (SP->getSSPLayout(MFI->getObjectAllocation(i))) {
    case StackProtector::SSPLK_None:
      continue;
    case StackProtector::SSPLK_SmallArray:
      SmallArrayObjs.insert(i);
      continue;
    case StackProtector::SSPLK_AddrOf:
      AddrOfObjs.insert(i);
      continue;
    case StackProtector::SSPLK_LargeArray:
      LargeArrayObjs.insert(i);
      continue;
    }
    llvm_unreachable("Unexpected SSPLayoutKind.");
  }

  AssignProtectedObjSet(LargeArrayObjs, ProtectedObjs, MFI, StackGrowsDown,
                        Offset, MaxAlign, Skew);
  AssignProtectedObjSet(SmallArrayObjs, ProtectedObjs, MFI, StackGrowsDown,
                        Offset, MaxAlign, Skew);
  AssignProtectedObjSet(AddrOfObjs, ProtectedObjs, MFI, StackGrowsDown,
                        Offset, MaxAlign, Skew);
}

This walks all objects requiring special allocation due to stack protection and assigns offsets to them. I don't touch this code. Once offset have been assigned to these special objects, they are put in the "ProtectedObjs" bucket which is untouched during my algorithm. My reordering should only operate withing the scope of "everything that's left over at the end".

An interesting "TODO" would be to see whether there's additional opportunities for code size by reordering the way the large arrays get allocated within their own scope.

majnemer added inline comments.Feb 8 2016, 11:12 PM
lib/Target/X86/X86FrameLowering.cpp
2660–2662 ↗(On Diff #43062)

Please don't indent in namespaces.

2663–2667 ↗(On Diff #43062)

Please use = for initializers.

2683–2686 ↗(On Diff #43062)

Please clang -format this.

2684–2685 ↗(On Diff #43062)

Pointers and references lean right. Variables should be capitalized.

2707–2710 ↗(On Diff #43062)

Why signed?

2756–2757 ↗(On Diff #43062)

Why? Why not 1?

lib/Target/X86/X86FrameLowering.h
142 ↗(On Diff #43062)

Please capitalize variable names.

spatel added a subscriber: spatel.Feb 9 2016, 7:25 AM

One interesting thing to also note: the 11, 9, and 12 rows have such bloated instruction sizes primarily because of needing to store an immediate "0" as an imm32. (we have some open PR's about this I think).

Excellent data visualization - thanks! If you have a chance, it would be interesting to recompile the app that you're looking at with this patch applied to see if that app benefits.

The related instruction-size bugzillas that I filed are:
https://llvm.org/bugs/show_bug.cgi?id=24447
https://llvm.org/bugs/show_bug.cgi?id=24448
https://llvm.org/bugs/show_bug.cgi?id=24449

We made some progress on the example in the last report, but that may not apply to the cases that you were looking at. It's possible that this helps too:
http://reviews.llvm.org/rL256969

zansari updated this revision to Diff 47420.Feb 9 2016, 8:56 PM

Thanks, David, for the review.. And thanks, Sean, for the additional data and analysis on your workloads. I'm glad it showed positive results for you.

David, I think I addressed all of your concerns.. The one I left open was the one regarding what size I assign to variable size symbols.. I couldn't think of any number that made sense, so I just arbitrarily picked 4. I figured that 1 would probably make less sense, since most allocas, for example, would probably be > 1B. If you feel 1 would make more sense, I wouldn't have any issues changing it. I doubt it would make much different in the grand scheme of things.

Thanks,
Zia.

zansari marked 6 inline comments as done.Feb 9 2016, 9:13 PM
zansari added inline comments.
lib/Target/X86/X86FrameLowering.cpp
2756–2757 ↗(On Diff #43062)

I couldn't think of any number that made sense, so I just arbitrarily picked 4. I figured that 1 would probably make less sense, since most allocas, for example, would probably be > 1B. If you feel 1 would make more sense, I wouldn't have any issues changing it. I doubt it would make much difference in the grand scheme of things.

majnemer accepted this revision.Feb 9 2016, 9:41 PM
majnemer edited edge metadata.

LGTM with nits addressed.

include/llvm/Target/TargetFrameLowering.h
321–322 ↗(On Diff #47420)

Please clang-format this.

323 ↗(On Diff #47420)

I'd just keep the function body empty.

lib/Target/X86/X86FrameLowering.cpp
2764–2766 ↗(On Diff #47420)

Could this be a range-based for loop over MI.operands() ?

This revision is now accepted and ready to land.Feb 9 2016, 9:41 PM
silvas added inline comments.Feb 9 2016, 9:51 PM
lib/Target/X86/X86FrameLowering.cpp
2697 ↗(On Diff #47420)

Tiny nit: This comment reference "double" but the code doesn't seem to be using floating point anymore.

majnemer added inline comments.Feb 9 2016, 11:03 PM
lib/Target/X86/X86FrameLowering.cpp
2697 ↗(On Diff #47420)

This comment addresses this later on with: "This ends up factoring the division and, with it, the need for any floating arithmetic."

zansari marked 3 inline comments as done.Feb 15 2016, 3:42 PM
Closed by commit rL260917 (authored by zansari). · Explain WhyFeb 15 2016, 3:48 PM
This revision was automatically updated to reflect the committed changes.
Gerolf added a subscriber: Gerolf.Feb 17 2016, 9:00 PM

Hi Zia,

do you have HSW performance numbers for this change? An internal bot is logging a >10% regression for MultiSource/Benchmarks/Ptrdist/ks/ks https://smooshbase.apple.com/perf/db_default/v4/nts/graph?highlight_run=114068&plot.746=313.746.3 (O3 flto) pinned to this change. If you can reproduce it perhaps there is a tuning opportunity.

Thanks!
Gerolf

tkn added a subscriber: tkn.Feb 22 2016, 5:57 AM

Hi Zia,

do you have HSW performance numbers for this change? An internal bot is logging a >10% regression for MultiSource/Benchmarks/Ptrdist/ks/ks https://smooshbase.apple.com/perf/db_default/v4/nts/graph?highlight_run=114068&plot.746=313.746.3 (O3 flto) pinned to this change.

Thanks!
Gerolf

Hi Gerolf,

I ran internal performance testing on Atom and HSW (spec2k/2006, eembc, and a few other benchmarks), and I didn't see any negative swings. I'll be happy to look into your regression if you can give me something to look at. The link doesn't open for me (page not available).

If you can reproduce it perhaps there is a tuning opportunity.

Is this a test to which I can get access easily? Let me know how I can get/run it and I'll take a look.

Thanks,
Zia.

hans added a comment.Feb 22 2016, 4:08 PM

In terms of Chromium code size, this reduced the size of one of our main .dlls with 447 kB, which is a significant chunk!

This was just about a test in the llvm test suite, I assume you have run it and not seen any issue (sorry for the confusing link).

MultiSource/Benchmarks/Ptrdist/ks/ks https://smooshbase.apple.com/perf/db_default/v4/nts/graph?highlight_run=114068&plot.746=313.746.3 (compiled with O3 FLTO).

-Gerolf