This is an archive of the discontinued LLVM Phabricator instance.

Standford/Bubble sort code restructure
AbandonedPublic

Authored by evstupac on Sep 14 2016, 4:07 PM.

Details

Summary
  1. Fix potential issue with MAX and MIN elements in array (that should be always one of array elements).
  2. Based on https://reviews.llvm.org/D18158 it turned out that currently bubble sort is a flaky test.

The flakiness caused by unpredictable memory accesses to array and code:

if (a[i] > a[i + 1) { load a[i], a[i+1];
store to a[i], a[i+1];
}

Making stores unconditional will simplify memory accesses and potentially CFG.
This will open the test for compiler optimizations (like unroll, scalar replacement, if conversion...).
Generally the idea to convert the test from memory test to compiler test.

Diff Detail

Event Timeline

evstupac updated this revision to Diff 71458.Sep 14 2016, 4:07 PM
evstupac retitled this revision from to Standford/Bubble sort code restructure.
evstupac updated this object.
evstupac added reviewers: MatzeB, mehdi_amini, hfinkel.
evstupac added subscribers: llvm-commits, mzolotukhin.
mehdi_amini edited edge metadata.Sep 14 2016, 4:14 PM

I'm not to understand what is "flaky" with this test.
The fact that the compiler is not smart enough seems that we should improve the compiler, not simplify the benchmark.

I'm probably missing the whole point. The link you're providing https://reviews.llvm.org/D18158 contains extensive discussion, can you be more specific where to look (or what to look for)?

SingleSource/Benchmarks/Stanford/Bubblesort.c
134

I don't understand why this change? (i.e. it should be at least documented).

MatzeB requested changes to this revision.Sep 14 2016, 5:01 PM
MatzeB edited edge metadata.

We should refrain from changing the benchmarks unless there is a good reason. Undefined behavior would be a good reason, the compiler not being smart enough to catch some optimisation opportunities is not IMO. If we change it will no longer be the "Stanford" benchmark, there is also no reason to believe that code out in the wild doesn't look like the existing benchmark which the compiler should handle.

At a first glance I don't see any "flakiness" and the changes just look like manual loop peeling to me.

This revision now requires changes to proceed.Sep 14 2016, 5:01 PM
evstupac added inline comments.Sep 14 2016, 5:11 PM
SingleSource/Benchmarks/Stanford/Bubblesort.c
134

Ok. We can skip this. Yes this one is just peeling and set MAX (biggest) and MIN (littlest) to sortlist element. Otherwise it could happen that MAX (biggest) and/or MIN (littlest) will stay 0 after the Init loop.
That is just potential bug if we change array length or rand algorithm. In the test we have sortlist elements that are bigger and less than 0.

160

The flakiness is here. And the change allows to avoid the flakiness.

mehdi_amini added inline comments.Sep 14 2016, 5:17 PM
SingleSource/Benchmarks/Stanford/Bubblesort.c
160

Please define "flakiness".

MatzeB added inline comments.Sep 14 2016, 5:18 PM
SingleSource/Benchmarks/Stanford/Bubblesort.c
134

If srtelements is 0, then we will write to sortlist[1] with this change which is bad.

160

I don't understand, the code does the same thing before and after. (You just change to always read and write the array elements even in the case where no swap happens).

evstupac updated this revision to Diff 71465.Sep 14 2016, 5:19 PM
evstupac edited edge metadata.

Keep only flakiness fix.

evstupac added inline comments.Sep 14 2016, 5:29 PM
SingleSource/Benchmarks/Stanford/Bubblesort.c
137

No, while array is defined as

int sortlist[sortelements+1]

And this will fail as well as further in the code:

if ( (sortlist[1] != littlest) || (sortlist[srtelements] != biggest) )

But I agree let's do not focus on this. I remove these changes.

160

The flakiness caused by unpredictable memory accesses to array and code on short distance:

Loop:

if (a[i] > a[i + 1) {// load a[i], a[i+1];
//store to a[i], a[i+1];
}

Making stores unconditional will simplify memory accesses and potentially CFG.
This will open the test for compiler optimizations (like unroll, scalar replacement, if conversion...).
Generally the idea to convert the test from memory test to compiler test.

MatzeB added inline comments.Sep 14 2016, 5:34 PM
SingleSource/Benchmarks/Stanford/Bubblesort.c
160

This will open the test for compiler optimizations (like unroll, scalar replacement, if conversion...).

Generally the idea to convert the test from memory test to compiler test.

We should not change any of the existing benchmarks. It is the Stanford benchmark as is, if you change it we will have a sudden change change in performance in the LNT database, and we can no longer claim to be compiling the "Stanford" benchmark (it's the same reason that we cannot modify SPEC to be better suited for the compiler).

If anything you should add a new benchmark with the structure you desire.

mehdi_amini added inline comments.Sep 14 2016, 5:36 PM
SingleSource/Benchmarks/Stanford/Bubblesort.c
160

It seems to me that you're creating a "new" benchmark by doing that.
It does not seem legit to change a benchmark because you find it hard to analyze from within the compiler.

(That does not fit my definition of flaky either)

evstupac added inline comments.Sep 14 2016, 5:46 PM
SingleSource/Benchmarks/Stanford/Bubblesort.c
160

Please define "flakiness".

Currently the code contains both branch mispredictions and memory stalls Sometimes they compensate each other, sometimes not. That depends on array/code addresses. In https://reviews.llvm.org/D18158 depending on what unroll technique we use (epilog/prolog) both addresses changed. That causes performance difference up to 2 times with the identical hot loop - which is out of compiler control.

evstupac added inline comments.Sep 14 2016, 6:01 PM
SingleSource/Benchmarks/Stanford/Bubblesort.c
160

we can no longer claim to be compiling the "Stanford" benchmark (it's the same reason that we cannot modify SPEC to be better suited for the compiler

That is true.
However SPEC benchmark is something different. There are regular updates. The newest are checked to avoid any kind of flakiness on latest architectures.

Stanford benchmarks are old and there are no updates (correct me if I'm wrong, but even timing of these benchmarks is ~0.01 sec or even less). We are not checking spec92 performance on latest architectures, but still checking Stanford benchmarks.
That way - yes we are unable to influence on the benchmarks somehow.
But maybe there is a chance to switch to renewed Stanford benchmarks in LLVM testing?

Where we can get newer Stanford benchmarks? Who is the owner to whom I can address my concerns?

The only public source I've found is:
http://classes.engineering.wustl.edu/cse465/docs/BCCExamples/stanford.c
But if it is the source, the benchmarks were already modified by submitter. Why we are unable to modify them now?

Finally I see 2 solutions:

  1. Benchmarks are owned by LLVM community and we can modify them.
  2. Benchmarks are owned by someone else, we already modified them 14 years ago and do not want to get updates or make other modifications, because... initial modifications was approved by owner and there were no more updates. I can request one more approval from owner.

The only public source I've found is:
http://classes.engineering.wustl.edu/cse465/docs/BCCExamples/stanford.c
But if it is the source, the benchmarks were already modified by submitter. Why we are unable to modify them now?

Finally I see 2 solutions:

  1. Benchmarks are owned by LLVM community and we can modify them.
  2. Benchmarks are owned by someone else, we already modified them 14 years ago and do not want to get updates or make other modifications, because... initial modifications was approved by owner and there were no more updates. I can request one more approval from owner.

Yes the benchmark is probably a bad benchmark, has no official reference and it is unlikely to ever get updated. However given that we the very same will be true for a lot of source code in the wild, I don't see this as a bad thing. I still think it is more honest to live with the performance swings (we can explain them after all, so no need to blame the compiler or your patches) rather than modifying our benchmarks so the results look nicer.

I still think it is more honest to live with the performance swings (we can explain them after all, so no need to blame the compiler or your patches) rather than modifying our benchmarks so the results look nicer.

Honest - yes. More effective - No.
I've already spend a lot of time trying to prove that epilogue instead of prologue has no controlled influence on this test... if we leave it as is someone else will get performance improve because of some change and this will be ok. And after this someone else will get performance drop and spend a lot of time doing the same.

We should make a decision what for we are using these tests?

  1. Regression control? Then history does not matter. We can apply changes.
  2. Represent some widely used benchmark? Most likely no.
  3. To prove that LLVM is moving forward? Release to release on flaky bubble sort? Maybe.
  4. To represent some important customer application? Hmm... most likely no.
  5. Something else?

What is the profit of leaving the benchmark as is?
The profit of changing is obvious - we'll safe time for future commits.

mehdi_amini added inline comments.Sep 14 2016, 8:07 PM
SingleSource/Benchmarks/Stanford/Bubblesort.c
160

Can you provide more (detailed) explanations (or specific pointers) to an analysis that shows how this loop is sensitive to transformations in an unpredictable way for the compiler? This seems quite hand-wavy right now to me.

evstupac added inline comments.Sep 15 2016, 3:12 PM
SingleSource/Benchmarks/Stanford/Bubblesort.c
160

I've filed a bug with assembly attachments: https://llvm.org/bugs/show_bug.cgi?id=30404
Please take a look at loop assemblies, try to compile and run 2 variants. If you need more information, like modified assembly with identical hottest loops, collected counters - please let me know.

chandlerc requested changes to this revision.Oct 5 2016, 5:36 PM
chandlerc edited edge metadata.
chandlerc added inline comments.
SingleSource/Benchmarks/Stanford/Bubblesort.c
160

What Mehdi said.

This is not flaky. You don't like what a particular transform does to the performance here. You can't fix this by changing the benchmark.

You can add more benchmarks that show that a particular transform is *worth* the regression in this benchmark.

Or you can teach the transform to not regress the code pattern that this benchmark is testing. Which would seem the more likely solution.

No matter what, *changing the benchmark code is not the answer*.

This revision now requires changes to proceed.Oct 5 2016, 5:36 PM
evstupac edited edge metadata.Oct 6 2016, 12:44 AM
evstupac added a subscriber: zansari.

This is not flaky.

Maybe it is not flaky. But what I'm trying to say that now the test is not ok to measure compiler performance.
Exactly the same (same instructions, same addresses) hot loop, but performance is 2 times lower. How compiler can deal with this?
Branches here are unpredictable, the data is random. In prolog case we are lucky to have more predictable branches. In epilog case no.

We can leave the test code as is, but modify array data filing. This will also help. But leaving the test as is will cause unreasonable performance changes in future.

This is not flaky.

Maybe it is not flaky. But what I'm trying to say that now the test is not ok to measure compiler performance.
Exactly the same (same instructions, same addresses) hot loop, but performance is 2 times lower. How compiler can deal with this?

But there is *some* difference in the generated code. You were passing two different flags to the compiler and changing the transformations.

Branches here are unpredictable, the data is random. In prolog case we are lucky to have more predictable branches. In epilog case no.

Sure. Why shouldn't we have a benchmark with unpredictable branches?

I understand if you *also* want to benchmark predictable branches. But that doesn't mean the current benchmark is invalid.

We can leave the test code as is, but modify array data filing. This will also help. But leaving the test as is will cause unreasonable performance changes in future.

I'm not sure why you call this unreasonable.

You've not really explained:

  • What transform you are changing that causes this benchmark's performance to fluctuate.
  • Why that transform is clearly a good thing despite regressing this benchmark

Until you do those two things, IMO this benchmark is doing exactly what is intended: it is identifying a performance regression.

Now, if you present clear and clearly explained evidence that there is a transform you would like to enable which happens to regress this benchmark but has a very good impact on other benchmarks, perhaps we would be OK with accepting that tradeoff. I don't see why that makes this benchmark invalid, it just means that the regression observed is tolerable.

I will also point out that your analysis is completely x86 specific. There are other architectures that this might be a very useful benchmark for even if it does lose all utility for benchmarking changes to the compiler on x86.

I'll add my 2 cents here, since I did spend some time looking into the underlying x86 architectural reasons for this regression.

  1. I personally don't agree with these changes made to the benchmark. As is, the benchmark is just fine and all the changes do is to move things around to generate slightly different IL that the compiler could itself do or undo in the future. I see no reason for the changes outside of temporary masking of a performance swing.
  1. I'll leave it to Evengy to describe the overall/general benefits of the original loop unrolling patch.
  1. Having said all that, the x86 regressions observed in this benchmark from the patch are 100% random bad luck due to unpredictable behavior of the branch predictor (on the arch the regression was observed). I've verified this through pipe-traces and have made reproducers that demonstrate the regression even w/o Evgeny's loop unrolling transformation. For this particular benchmark, the loop unrolling changes are not doing anything "bad", and there's nothing the compiler can do to directly affect (i.e. first order) the source of the regression.

In my opinion, I think the right course of action, at this point is:

  1. Evgeny to explain why the loop unrolling changes are generally a good thing to do (on all architectures).
  1. If everyone is OK with #1, then the x86 regression seen in bubble-sort should be simply called "bad luck" due to x86 arch reasons that the compiler can't see/control. We'll just expect this to randomly bounce back up/down in the future, and move on.

If anyone has any strong objections with #2, I'd be happy to discuss it in more detail.

Thanks,
Zia.

This is not flaky.

Maybe it is not flaky. But what I'm trying to say that now the test is not ok to measure compiler performance.

I was digging in my archive, because in the past "someone" (didn't remember who) provided great explanations as of why a test case wasn't great for a particular microarchitecture, and I wanted to point at it as an example of the level of information I'd expected from someone to convince about what's going on beyond simply a claim that "this is bad luck with the predictor".

I just found it: https://llvm.org/bugs/show_bug.cgi?id=5615 (See the description, and the attached PPT).
(I noticed when finding it that the name seems familiar...)

evstupac abandoned this revision.May 2 2018, 12:00 PM