[SROA] enable splitting for non-whole-alloca loads and stores
Needs ReviewPublic

Authored by inouehrs on May 9 2017, 5:52 AM.



Currently, SROA splits loads and stores only when they are accessing the whole alloca.
This patch relaxes this limitation to allow splitting a load/store if all other loads and stores to the alloca are disjoint to or fully included in the current load/store. If there is no other load or store that crosses the boundary of the current load/store, the current splitting implementation works as is.
The whole-alloca loads and stores meet this new condition and so they are still splittable.

Here is a simplified motivating example.

struct record {
    long long a;
    int b;
    int c;

int func(struct record r) {
    for (int i = 0; i < r.c; i++)
    return r.b;

When updating r.b (or r.c as well), LLVM generates redundant instructions on some platforms (such as x86_64, ppc64); here, r.b and r.c are packed into one 64-bit GPR when the struct is passed as a method argument.

With this patch, the above example is compiled into only few instructions without loop.
Without the patch, unnecessary loop-carried dependency is introduced by SROA and the loop cannot be eliminated by the later optimizers.

The path length of the std::string's assign method with istreambuf_iterator (an example below) is reduced by about 10% on ppc64.


Diff Detail

inouehrs created this revision.May 9 2017, 5:52 AM
rnk added a reviewer: rnk.May 12 2017, 3:35 PM
inouehrs updated this revision to Diff 98944.May 14 2017, 11:06 PM
inouehrs edited the summary of this revision. (Show Details)
inouehrs updated this revision to Diff 99408.May 18 2017, 1:34 AM
inouehrs retitled this revision from [WIP] SROA for method argument to [SROA] enable splitting for non-whole-alloca loads and stores.
inouehrs edited the summary of this revision. (Show Details)
inouehrs added a reviewer: efriedma.

I added a new unit test and modified affected existing unit tests.

rnk added inline comments.May 18 2017, 10:54 AM

Does this have to be n^2 in the number of slices? Can't you leverage the sorting to be more efficient?


These CHECKs have no colon, they aren't firing.

jroelofs added inline comments.

Also, the NEXT must be capitalized.

inouehrs updated this revision to Diff 99527.EditedMay 18 2017, 11:34 PM
  • Implemented a linear-time algorithm (instead of N^2 for number of slices) for searching splittable slices.
  • Fixed errors in the unit test.
inouehrs marked 3 inline comments as done.May 18 2017, 11:37 PM
inouehrs added inline comments.

I changed the algorithm to a linear-time one. I confirmed that this algorithm generates the same results to the previous N^2 algorithm during the bootstrap test.


Thank you so much for pointing this out. Fixed these errors.

inouehrs marked 2 inline comments as done.May 18 2017, 11:38 PM

Gentle ping.

This looks OK to me, but I don't have enough knowledge of SROA to accept this.

I'd like to see a response from Chandler, since he was the last one to touch this code.

Some more general performance numbers would also be nice (LLVM testsuite or SPEC); this could impact a lot of code.

inouehrs added a comment.EditedJun 6 2017, 9:12 AM

The performance changes in SPEC on POWER8 are not significant; within a range of fluctuations.
I am trying to make additional SPEC runs.

Average of three runs with and without this patch.
A positive number means improvement by the patch.

400.perlbench	-0.11%
401.bzip2	-0.01%
403.gcc		0.64%
429.mcf		-1.14%
445.gobmk	-0.05%
456.hmmer	0.03%
458.sjeng	-0.16%
462.libquantum	0.36%
464.h264ref	-0.04%
471.omnetpp	1.39%
473.astar	0.08%
483.xalancbmk	0.18%
433.milc	1.33%
444.namd	-0.04%
447.dealII	-0.27%
450.soplex	0.20%
453.povray	-0.25%
470.lbm		-0.29%
482.sphinx3	0.30%
GEOMEAN		0.11%

I conducted more performance measurements.
Overall, I did not find significant degradetions or improvements except for the iterotor example above.

tcmalloc (on ppc64le)
This patch makes additional splitting but no visible changes in malloc_bench score

gipfeli (on ppc64le)
This patch makes additional splitting in the benchmark harness (for istreambuf_iterator) but not in compressor/decompressor.

snappy (on ppc64le)
This patch makes additional splitting but no visible changes in snappy-unittest score

SPECCPU2006 (on x86_64)
No additional splitting observed in 12 out of 19 C/C++ benchmarks.
Even for benchmarks with additional splitting the changes in the score is within range of fluctuation.

403.gcc	0.00%
462.libquantum	0.00%
471.omnetpp	-0.42%
473.astar	0.00%
483.xalancbmk	0.20%
447.dealII	0.13%
453.povray	-0.06%
SPECINT (5)	-0.04%
SPECFP  (2)	0.04%
TOTAL   (7)	-0.02%

@chandlerc @efriedma I appreciate any suggestions on this. Thanks!

inouehrs updated this revision to Diff 104187.Tue, Jun 27, 9:11 AM
inouehrs added a reviewer: sanjoy.
  • rebased to the latest tree and ran tests
  • did minor touchup in comments