Under limitation of allocated buffer, inplace_merge does NOT take usage of partial allocated buffer but uses native rotate directly. It makes the performance far behind from corresponding part of GCC. In this patch, it tries to use partial allocated buffer firstly. Experiment shows 28.35% & 25.06% speedup for merging two equal size sorted integers for -O3 & no -O3 cases.
Refer to the experimental results below
These iterator calculations only work for random access iterators.
Does this actually work with forward iterators?