This is an archive of the discontinued LLVM Phabricator instance.

Under limitation of allocated buffer, inplace_merge() does NOT take usage of partial allocated buffer but applies native rotate directly.
Needs ReviewPublic

Authored by WeiChungChang on Jan 21 2018, 6:06 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

Under limitation of allocated buffer, inplace_merge does NOT take usage of partial allocated buffer but uses native rotate directly. It makes the performance far behind from corresponding part of GCC. In this patch, it tries to use partial allocated buffer firstly. Experiment shows 28.35% & 25.06% speedup for merging two equal size sorted integers for -O3 & no -O3 cases.

Refer to the experimental results below

Event Timeline

WeiChungChang created this revision.Jan 21 2018, 6:06 PM
WeiChungChang edited the summary of this revision. (Show Details)Jan 21 2018, 6:07 PM
WeiChungChang retitled this revision from Under limitation of allocated buffer, inplace_merge does NOT take usge of partial allocated buffer but uses native rotate directly. It makes the performace far behind from corresponding part of GCC. In this patch, it tries to use partial allocated... to Under limitation of allocated buffer, inplace_merge() does NOT take usage of partial allocated buffer but applies native rotate directly..
WeiChungChang edited the summary of this revision. (Show Details)
WeiChungChang edited the summary of this revision. (Show Details)Jan 21 2018, 6:12 PM
mclow.lists added inline comments.Jan 22 2018, 9:38 AM
include/algorithm
2511

These iterator calculations only work for random access iterators.
Does this actually work with forward iterators?

2515

Probably a good idea to qualify these calls with _VSTD:: to ensure that no inadvertent ADL happens.

4582

It seems a shame to calculate __len1, etc on L#2511 above when you have them right here.

mclow.lists added inline comments.Jan 23 2018, 10:40 AM
include/algorithm
2515

I think that you should run some tests with types that aren't int.

You don't know what the state of the buffer is here; how much of it is actual objects, and how much of it is just raw memory. For raw memory, __move is the wrong call, because it will attempt to "clean up" the objects that are already there.

WeiChungChang added a comment.EditedJan 23 2018, 2:58 PM

Dear mclow:

Thanks a lot for your kindly reply!

I would like to briefly describe the purpose of this fix here.

This patch is in a series of 3 aimed targets.
I could explain them item-by-item in detail with experiment.
However, I mainly focus on algorithm so if there is any syntax issue, I may need your help to correct it.

Please refer to this report report about the final correction of algorithm in detail which may avoid unnecessary split, rotate under limited buffer.

Following list the 3 items I found when compared to GCC libstlc++.

  1. sort is faster but merge is far slower than libstlc++ .
  2. unstable merge speed between backward & forward. (will open other ticket later)
  3. efficient algorithm of pivot selection. (will open other ticket later)

This patch is for 1st item only
Please refer to for my test code, it is simple to sort & merge two vectors.
test source code & make file

I will attached the comparison of merge speed under different cases later (different max allocated buffer constraint).

According to your suggestion, I also would like to know which object is representative instead of integer to analyze the problem.
So I can test it too.

Thanks a lot for your suggestions.

WeiChungChang added a comment.EditedJan 23 2018, 10:00 PM

Here gives the comparison between GCC & LLVM's std c++ library.

It is shown that libc++ soon suffers from rotate speed under limited allocated buffer and slower than GCC's corresponding version.
Both LLVM & GCC run the same test program.

Notice that the meaning of full is not the same here.
For merging [A, B] with sorted A & B, each with 8*1024*1024 integers,
LLVM takes a better way to allocate 8*1024*1024 initially but GCC with 16*1024*1024.

full25%12.5%6.25%3.125%
LLVM0.2386146670.7332957781.1372851111.3486716671.603639556
GCC0.3687784440.3516592220.5329763330.6607857780.832495444

Here gives the comparison after fix.

It is better now but still slower; also, one could notice the standard variance (each round the execution time is unstable even the input is uniform)

The problem will be shown on the other ticket.

full25%12.5%6.25%3.125%
LLVM Opt 10.24019880.4071110.rG8510433990291.22162131.3857887
LLVM0.2386146670.7332957781.1372851111.3486716671.603639556
GCC0.3687784440.3516592220.5329763330.6607857780.832495444