This is an archive of the discontinued LLVM Phabricator instance.

Codegen: [X86] Set preferred loop alignment to 32 bytes.

Authored by iteratee on Apr 26 2016, 2:26 PM.



I recently discovered a performance regression in
test-suite/Multisource/Benchmarks/ptrdist/ks/ks.test solely due to
changing alignment. If you add 16 nops to the top of the
FindMaxGpAndSwap, performance goes down by about 10%. If you add an
aligment directive of 32 bytes at the top of the first loop of the
function, the performance comes back.

Verified on both Haswell and Sandybridge.

Diff Detail


Event Timeline

iteratee updated this revision to Diff 55091.Apr 26 2016, 2:26 PM
iteratee retitled this revision from to Codegen: [X86] Set preferred loop alignment to 32 bytes..
iteratee updated this object.
iteratee added reviewers: craig.topper, echristo.
iteratee set the repository for this revision to rL LLVM.

As a default I think this is reasonable and I don't know if we need to change it to something subtarget dependent. Sandybridge is going back a ways at this point, haswell is pretty recent, I don't know how performance stands on Broadwell though.

Adding Quentin here for that perspective.


qcolombet edited edge metadata.Apr 27 2016, 10:39 AM

Hi Kyle,

What are the performance numbers you get not just on that benchmark?

I am trying to ascertain how broadly this will impact the performance and thus, if it is worth doing that.

My problem is that changing this attribute will have a non-negligible impact on the code size (do you have numbers BTW?) and I am not ready to take that hit without broader numbers. In particular, I think ks is known to be noisy so I would just take that regression and be done with it.


PS: Please add llvm-commmit as a Subscriber.

Hi Quentin,

To be clear we're seeing multiple benchmarks speeding up 10% because of this change. That said, size numbers would be good just so we know what we're getting into - though I don't think they should be a blocking factor here. If that's an issue then we can make loop alignment in general dependent upon optimization level, but for cpu optimization we don't want that to be the guiding factor (i.e. the patch is still fine, but we might want to change where we align loops).



mehdi_amini added inline comments.Apr 27 2016, 11:22 AM

The settings just above differs between OptSize or not, if there is an impact on code size it makes sense to change the way the PreLoopAlignment is handle as well.

I'd like to have more data that would show the real benefit of this, and isolate it from other factor (read to the end to see what I mean), because I believe that this 10% is a side effect and not a real consequence of the realignment.

First, KS is absolutely not a reliable test, I spent a long week on this test alone for the same issue, trying to align loops in this benchmark. In my case I was doing some performance tuning and noticed a 10% regression after one of my changes. I turned out that my A/B test was expanding the FILE macro with a different size, and it lead to performance swing on this test.

Here are my notes from last October:

I measured the time when replacing the __FILE__ macro by a  byte per byte growing string:

0->4 : ~750ms
5->19: ~850ms
20->35: 730ms
36->51: 850ms
52->67: 730ms
68->83: 850ms
84->99: 750ms
100->115: 870ms

The pattern continues (i checked till 1024): 16 bytes fast, then 16 bytes slow.

My first thought was increasing the loop alignment, but it didn't provide great results all the time. I don't remember the details, but simply aligning the header of the hot loop independently of the rest of the didn't help as much (it can help by side effect on the rest of the code alignment).

At some point, somehow Bob heard that I was working on this test (KS) and pointed me to: ; I invite you to read Zia answers there.
You should also check the slides he attached to the bug that detail the issue in a very nice way.
Here is the relevant LLVM-dev post about this:

In general, I'm not a big fan of blindly aligning loops (on any boundary), as this can cause random effects +ve and -ve.
It's very simple to come up with examples where aligning a loop will cause regressions on certain architectures, specifically in loops that have control flow.

Having said that, I'd anticipate a change like this to cause pretty broad impact on code in a random way, and also bloat.

If the benchmark affected is important, it would be best to try and figure out the exact cause and see if there's a reasonable compiler heuristic that can be applied to target the specific issue (in many alignment-type regressions, there aren't). If you need help with this part of the analysis because it's tricky to determine the cause, I'd be happy to try and help.


After speaking with Kyle I'm going to walk back my earlier comments. I was under the impression this helped more than the one benchmark - it's something else that helps those.

So, just ignore me here :)



Is there anything else to discuss here or should we close that revision?


I meant to abandon this patch.

So just close it in phabricator (the drop-down menu above the box where you can write the comment)

iteratee abandoned this revision.May 18 2016, 11:20 AM