[SROA] enable splitting for non-whole-alloca loads and stores
Needs ReviewPublic

Authored by inouehrs on May 9 2017, 5:52 AM.

Details

Summary

Currently, SROA splits loads and stores only when they are accessing the whole alloca.
This patch relaxes this limitation to allow splitting a load/store if all other loads and stores to the alloca are disjoint to or fully included in the current load/store. If there is no other load or store that crosses the boundary of the current load/store, the current splitting implementation works as is.
The whole-alloca loads and stores meet this new condition and so they are still splittable.

Here is a simplified motivating example.

struct record {
    long long a;
    int b;
    int c;
};

int func(struct record r) {
    for (int i = 0; i < r.c; i++)
        r.b++;
    return r.b;
}

When updating r.b (or r.c as well), LLVM generates redundant instructions on some platforms (such as x86_64, ppc64); here, r.b and r.c are packed into one 64-bit GPR when the struct is passed as a method argument.

With this patch, the above example is compiled into only few instructions without loop.
Without the patch, unnecessary loop-carried dependency is introduced by SROA and the loop cannot be eliminated by the later optimizers.

The path length of the std::string's assign method with istreambuf_iterator (an example below) is reduced by about 10% on ppc64.

assign((std::istreambuf_iterator<char>(ifs)),
       (std::istreambuf_iterator<char>()));

Diff Detail

inouehrs created this revision.May 9 2017, 5:52 AM
rnk added a reviewer: rnk.May 12 2017, 3:35 PM
inouehrs updated this revision to Diff 98944.May 14 2017, 11:06 PM
inouehrs edited the summary of this revision. (Show Details)
inouehrs updated this revision to Diff 99408.May 18 2017, 1:34 AM
inouehrs retitled this revision from [WIP] SROA for method argument to [SROA] enable splitting for non-whole-alloca loads and stores.
inouehrs edited the summary of this revision. (Show Details)
inouehrs added a reviewer: efriedma.

I added a new unit test and modified affected existing unit tests.

rnk added inline comments.May 18 2017, 10:54 AM
lib/Transforms/Scalar/SROA.cpp
4074–4075

Does this have to be n^2 in the number of slices? Can't you leverage the sorting to be more efficient?

test/Transforms/SROA/basictest.ll
1707–1709

These CHECKs have no colon, they aren't firing.

jroelofs added inline comments.
test/Transforms/SROA/basictest.ll
1707–1709

Also, the NEXT must be capitalized.

inouehrs updated this revision to Diff 99527.EditedMay 18 2017, 11:34 PM
  • Implemented a linear-time algorithm (instead of N^2 for number of slices) for searching splittable slices.
  • Fixed errors in the unit test.
inouehrs marked 3 inline comments as done.May 18 2017, 11:37 PM
inouehrs added inline comments.
lib/Transforms/Scalar/SROA.cpp
4074–4075

I changed the algorithm to a linear-time one. I confirmed that this algorithm generates the same results to the previous N^2 algorithm during the bootstrap test.

test/Transforms/SROA/basictest.ll
1707–1709

Thank you so much for pointing this out. Fixed these errors.

inouehrs marked 2 inline comments as done.May 18 2017, 11:38 PM

Gentle ping.

This looks OK to me, but I don't have enough knowledge of SROA to accept this.

I'd like to see a response from Chandler, since he was the last one to touch this code.

Some more general performance numbers would also be nice (LLVM testsuite or SPEC); this could impact a lot of code.

inouehrs added a comment.EditedJun 6 2017, 9:12 AM

The performance changes in SPEC on POWER8 are not significant; within a range of fluctuations.
I am trying to make additional SPEC runs.

Average of three runs with and without this patch.
A positive number means improvement by the patch.

400.perlbench	-0.11%
401.bzip2	-0.01%
403.gcc		0.64%
429.mcf		-1.14%
445.gobmk	-0.05%
456.hmmer	0.03%
458.sjeng	-0.16%
462.libquantum	0.36%
464.h264ref	-0.04%
471.omnetpp	1.39%
473.astar	0.08%
483.xalancbmk	0.18%
433.milc	1.33%
444.namd	-0.04%
447.dealII	-0.27%
450.soplex	0.20%
453.povray	-0.25%
470.lbm		-0.29%
482.sphinx3	0.30%
GEOMEAN		0.11%

I conducted more performance measurements.
Overall, I did not find significant degradetions or improvements except for the iterotor example above.

tcmalloc (on ppc64le)
This patch makes additional splitting but no visible changes in malloc_bench score

gipfeli (on ppc64le)
This patch makes additional splitting in the benchmark harness (for istreambuf_iterator) but not in compressor/decompressor.

snappy (on ppc64le)
This patch makes additional splitting but no visible changes in snappy-unittest score

SPECCPU2006 (on x86_64)
No additional splitting observed in 12 out of 19 C/C++ benchmarks.
Even for benchmarks with additional splitting the changes in the score is within range of fluctuation.

403.gcc	0.00%
462.libquantum	0.00%
471.omnetpp	-0.42%
473.astar	0.00%
483.xalancbmk	0.20%
447.dealII	0.13%
453.povray	-0.06%
SPECINT (5)	-0.04%
SPECFP  (2)	0.04%
TOTAL   (7)	-0.02%

@chandlerc @efriedma I appreciate any suggestions on this. Thanks!

inouehrs updated this revision to Diff 104187.Jun 27 2017, 9:11 AM
inouehrs added a reviewer: sanjoy.
  • rebased to the latest tree and ran tests
  • did minor touchup in comments
inouehrs updated this revision to Diff 119463.Oct 18 2017, 5:07 AM
  • rebase to the latest tree