This is an archive of the discontinued LLVM Phabricator instance.

[SROA] enable splitting for non-whole-alloca loads and stores
ClosedPublic

Authored by inouehrs on May 9 2017, 5:52 AM.

Details

Summary

Currently, SROA splits loads and stores only when they are accessing the whole alloca.
This patch relaxes this limitation to allow splitting a load/store if all other loads and stores to the alloca are disjoint to or fully included in the current load/store. If there is no other load or store that crosses the boundary of the current load/store, the current splitting implementation works as is.
The whole-alloca loads and stores meet this new condition and so they are still splittable.

Here is a simplified motivating example.

struct record {
    long long a;
    int b;
    int c;
};

int func(struct record r) {
    for (int i = 0; i < r.c; i++)
        r.b++;
    return r.b;
}

When updating r.b (or r.c as well), LLVM generates redundant instructions on some platforms (such as x86_64, ppc64); here, r.b and r.c are packed into one 64-bit GPR when the struct is passed as a method argument.

With this patch, the above example is compiled into only few instructions without loop.
Without the patch, unnecessary loop-carried dependency is introduced by SROA and the loop cannot be eliminated by the later optimizers.

The path length of the std::string's assign method with istreambuf_iterator (an example below) is reduced by about 10% on ppc64.

assign((std::istreambuf_iterator<char>(ifs)),
       (std::istreambuf_iterator<char>()));

Diff Detail

Repository
rL LLVM

Event Timeline

inouehrs created this revision.May 9 2017, 5:52 AM
rnk added a reviewer: rnk.May 12 2017, 3:35 PM
inouehrs updated this revision to Diff 98944.May 14 2017, 11:06 PM
inouehrs edited the summary of this revision. (Show Details)
inouehrs updated this revision to Diff 99408.May 18 2017, 1:34 AM
inouehrs retitled this revision from [WIP] SROA for method argument to [SROA] enable splitting for non-whole-alloca loads and stores.
inouehrs edited the summary of this revision. (Show Details)
inouehrs added a reviewer: efriedma.

I added a new unit test and modified affected existing unit tests.

rnk added inline comments.May 18 2017, 10:54 AM
lib/Transforms/Scalar/SROA.cpp
3978 ↗(On Diff #99408)

Does this have to be n^2 in the number of slices? Can't you leverage the sorting to be more efficient?

test/Transforms/SROA/basictest.ll
1707–1709 ↗(On Diff #99408)

These CHECKs have no colon, they aren't firing.

jroelofs added inline comments.
test/Transforms/SROA/basictest.ll
1707–1709 ↗(On Diff #99408)

Also, the NEXT must be capitalized.

inouehrs updated this revision to Diff 99527.EditedMay 18 2017, 11:34 PM
  • Implemented a linear-time algorithm (instead of N^2 for number of slices) for searching splittable slices.
  • Fixed errors in the unit test.
inouehrs marked 3 inline comments as done.May 18 2017, 11:37 PM
inouehrs added inline comments.
lib/Transforms/Scalar/SROA.cpp
3978 ↗(On Diff #99408)

I changed the algorithm to a linear-time one. I confirmed that this algorithm generates the same results to the previous N^2 algorithm during the bootstrap test.

test/Transforms/SROA/basictest.ll
1707–1709 ↗(On Diff #99408)

Thank you so much for pointing this out. Fixed these errors.

inouehrs marked 2 inline comments as done.May 18 2017, 11:38 PM

Gentle ping.

kbarton edited edge metadata.Jun 5 2017, 11:17 AM

This looks OK to me, but I don't have enough knowledge of SROA to accept this.

efriedma edited edge metadata.Jun 5 2017, 4:00 PM

I'd like to see a response from Chandler, since he was the last one to touch this code.

Some more general performance numbers would also be nice (LLVM testsuite or SPEC); this could impact a lot of code.

inouehrs added a comment.EditedJun 6 2017, 9:12 AM

The performance changes in SPEC on POWER8 are not significant; within a range of fluctuations.
I am trying to make additional SPEC runs.

Average of three runs with and without this patch.
A positive number means improvement by the patch.

400.perlbench	-0.11%
401.bzip2	-0.01%
403.gcc		0.64%
429.mcf		-1.14%
445.gobmk	-0.05%
456.hmmer	0.03%
458.sjeng	-0.16%
462.libquantum	0.36%
464.h264ref	-0.04%
471.omnetpp	1.39%
473.astar	0.08%
483.xalancbmk	0.18%
433.milc	1.33%
444.namd	-0.04%
447.dealII	-0.27%
450.soplex	0.20%
453.povray	-0.25%
470.lbm		-0.29%
482.sphinx3	0.30%
GEOMEAN		0.11%

I conducted more performance measurements.
Overall, I did not find significant degradetions or improvements except for the iterotor example above.

tcmalloc (on ppc64le)
This patch makes additional splitting but no visible changes in malloc_bench score

gipfeli (on ppc64le)
This patch makes additional splitting in the benchmark harness (for istreambuf_iterator) but not in compressor/decompressor.

snappy (on ppc64le)
This patch makes additional splitting but no visible changes in snappy-unittest score

SPECCPU2006 (on x86_64)
No additional splitting observed in 12 out of 19 C/C++ benchmarks.
Even for benchmarks with additional splitting the changes in the score is within range of fluctuation.

403.gcc	0.00%
462.libquantum	0.00%
471.omnetpp	-0.42%
473.astar	0.00%
483.xalancbmk	0.20%
447.dealII	0.13%
453.povray	-0.06%
SPECINT (5)	-0.04%
SPECFP  (2)	0.04%
TOTAL   (7)	-0.02%

@chandlerc @efriedma I appreciate any suggestions on this. Thanks!

inouehrs updated this revision to Diff 104187.Jun 27 2017, 9:11 AM
inouehrs added a reviewer: sanjoy.
  • rebased to the latest tree and ran tests
  • did minor touchup in comments
inouehrs updated this revision to Diff 119463.Oct 18 2017, 5:07 AM
  • rebase to the latest tree
nemanjai edited edge metadata.Nov 29 2017, 12:14 AM

We should try to decide what we're doing about this patch.
@inouehrs Does this have enough of an effect on performance with ToT to continue to pursue this or should it be abandoned.
If we do want to continue with it, it seems that there are no real objections (pending any from @chandlerc) so I imagine we can proceed with the patch. Perhaps add maintainers of some other targets to the review so they can run try it on their targets. You've already ensured there are no performance regressions for PPC and X86, perhaps a few other targets can confirm this as well.

rnk accepted this revision.Nov 29 2017, 10:28 AM

I think I understand the new code, and it seems like an improvement. I'd land it and see what breaks.

This revision is now accepted and ready to land.Nov 29 2017, 10:28 AM
This revision was automatically updated to reflect the committed changes.