This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Fix PR32384: bump up the number of stores per memset and memcpy
ClosedPublic

Authored by evandro on Mar 30 2018, 9:19 AM.

Details

Summary

As @eli.friedman suggested in https://bugs.llvm.org/show_bug.cgi?id=32384#c1, this makes the inlining of memset and memcpy more aggressive when compiling for speed. The tuning remains the same when optimizing for size.

The following experiment on A-72 shows that there are several benchmarks benefiting from this change when compiling the SPEC CPU2000 with -O3 with a low overhead on code size.

A better score is positive, an increase in text size is positive:

BenchmarkScoreText
spec2000/164.gzip0.01%-0.01%
spec2000/175.vpr-0.46%0.01%
spec2000/176.gcc-0.28%0.01%
spec2000/177.mesa0.75%0.08%
spec2000/179.art0.39%0.00%
spec2000/181.mcf0.26%0.00%
spec2000/183.equake-0.34%-0.01%
spec2000/186.crafty0.09%0.06%
spec2000/188.ammp2.50%0.01%
spec2000/197.parser0.21%0.00%
spec2000/252.eon0.50%1.62%
spec2000/253.perlbmk1.67%0.00%
spec2000/254.gap-0.40%0.01%
spec2000/255.vortex-0.24%0.00%
spec2000/256.bzip20.01%0.00%
spec2000/300.twolf0.59%0.00%

Diff Detail

Event Timeline

sebpop created this revision.Mar 30 2018, 9:19 AM

Increasing this makes sense.

Should we check for hasNEON() here? The generic code doesn't know AArch64 has ldp/stp, so we might want to be a little more aggressive to compensate.

Should we check for hasNEON() here? The generic code doesn't know AArch64 has ldp/stp, so we might want to be a little more aggressive to compensate.

Do you mean something like this? or something else?

if (Subtarget->hasNEON()) {
  MaxStoresPerMemset = 32;
  MaxStoresPerMemsetOptSize = 8;
  MaxStoresPerMemcpy = 16;
  MaxStoresPerMemcpyOptSize = 4;
  MaxStoresPerMemmove = 16;
  MaxStoresPerMemmoveOptSize = 4;
} else {
  MaxStoresPerMemset = MaxStoresPerMemsetOptSize = 8;
  MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 4;
  MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = 4;
}

Should we check for hasNEON() here? The generic code doesn't know AArch64 has ldp/stp, so we might want to be a little more aggressive to compensate.

Makes sense. Though LDP and STP are available to generic registers even when hasNEON() is false. The issue is whether AArch64LoadStoreOpt can form the pairs if the pairs are too far apart.

Typically, depending on the target, blocks of loads and stores should be have the loads grouped together followed by the stores.

Wait, nevermind, it shouldn't matter whether we have NEON; we probably want to inline roughly the same number of instructions either way, and integer and vector registers have roughly equivalent ldp/stp instructions.

Looking at the generated code a bit, it looks like we do a really terrible job lowering memcpy; we don't form ldp/stp at all, ever. We should probably fix that before we mess with the threshold here; it could substantially change the codesize/performance impact of this change.

Looking at the generated code a bit, it looks like we do a really terrible job lowering memcpy; we don't form ldp/stp at all, ever.

Yes, the inline code for memcpy does not look great: I was seeing a mix of ldr and str.

We should probably fix that before we mess with the threshold here; it could substantially change the codesize/performance impact of this change.

Agreed, let's measure the perf of this patch again after we improve the codegen for memcpy.

sebpop added a subscriber: SirishP.Apr 3 2018, 12:53 PM

I just helped the compiler with restrict and I see a pretty good code generated out of this example:

void fun(char * restrict in, char * restrict out) {
  memcpy(out, in, 100);
}

llvm produces:

	ldp	q0, q1, [x0, #64]
	stp	q0, q1, [x1, #64]
	ldp	q0, q1, [x0, #32]
	stp	q0, q1, [x1, #32]
	ldp	q0, q1, [x0]
	ldr	w8, [x0, #96]
	str	w8, [x1, #96]
	stp	q0, q1, [x1]
	ret

And here is the testcase I was looking at before producing the mix of ldr/str:

void fun(char *in, char *out) {
  memcpy(out, in, 100);
}

the mi-scheduler is unable to move ldr past str:

	ldr	w8, [x0, #96]
	str	w8, [x1, #96]
	ldr	q0, [x0, #80]
	str	q0, [x1, #80]
	ldr	q0, [x0, #64]
	str	q0, [x1, #64]
	ldr	q0, [x0, #48]
	str	q0, [x1, #48]
	ldr	q0, [x0, #32]
	str	q0, [x1, #32]
	ldr	q0, [x0, #16]
	str	q0, [x1, #16]
	ldr	q0, [x0]
	str	q0, [x1]
	ret

For this to work, the code generator expanding memcpy in getMemcpyLoadsAndStores()
needs to be amended to produce more than one ldr/str at a time.
The target should be able to specify the number of consecutive loads and stores to be produced.
In the case of generic aarch64 that should be 2 such that we can produce a ldp; stp; sequence.
For Exynos processors that should be a much higher number like 8 as it is better to have all loads and all stores scheduled together.

Sirish is working on a patch for that.

Looking at the generated code a bit, it looks like we do a really terrible job lowering memcpy; we don't form ldp/stp at all, ever. We should probably fix that before we mess with the threshold here; it could substantially change the codesize/performance impact of this change.

Eli, could you please look at the patch that Sirish has posted https://reviews.llvm.org/D46477 to make the memcpy lowering produce ldp/stp?
Are there other things to be fixed to land this change?

It looks like D46477 has review comments which are waiting to be addressed? But the approach of changing the chains seems reasonable.

Increasing the maximum number of inlined instructions for memcpy and memset from 4 to 16 seems right; it's probably a good performance/size tradeoff. Not sure about memmove; you also have to worry about register pressure for that.

sebpop updated this revision to Diff 148048.May 22 2018, 10:33 AM
sebpop retitled this revision from [AArch64] fix PR32384: bump the number of stores per memset/memcpy/memmov to [AArch64] fix PR32384: bump the number of stores per memset/memcpy.
sebpop edited the summary of this revision. (Show Details)

Following Eli's recommendation the patch does not modify memmov.
I will post the updated numbers on top of the improved code generation
for memcpy: https://reviews.llvm.org/rL332482

The experiment is cpu2000 best score out of 3 runs on A-72 of a firefly device.
A better score is positive.

spec2000-164.gzip0.03%
spec2000-175.vpr-0.66%
spec2000-177.mesa0.28%
spec2000-179.art-0.96%
spec2000-181.mcf-0.98%
spec2000-183.equake-0.86%
spec2000-186.crafty0.06%
spec2000-188.ammp-0.62%
spec2000-197.parser-0.61%
spec2000-252.eon-0.05%
spec2000-253.perlbmk0.47%
spec2000-254.gap-0.68%
spec2000-255.vortex0.04%
spec2000-256.bzip2-0.75%
spec2000-300.twolf2.30%

Those numbers look very different from the ones before. Is r332482 making this less profitable somehow? Or is the change all noise?

Those numbers look very different from the ones before. Is r332482 making this less profitable somehow? Or is the change all noise?

I'll have to take a closer look at these CPU2000 results, but in proprietary benchmarks this change is still beneficial.

evandro set the repository for this revision to rL LLVM.May 24 2018, 2:16 PM
evandro commandeered this revision.May 24 2018, 2:22 PM
evandro edited reviewers, added: sebpop; removed: evandro.

Since @sebpop has just left for a deserved vacation, he asked me to babysit his pending patches.

evandro updated this revision to Diff 148487.May 24 2018, 2:25 PM

Update test case to keep it from failing due to this change.

Those numbers look very different from the ones before. Is r332482 making this less profitable somehow? Or is the change all noise?

I'll have to take a closer look at these CPU2000 results, but in proprietary benchmarks this change is still beneficial.

It's come down to noise. CPU2000 just lacks enough C++ code where this change has shown to be more beneficial. Again, in proprietary benchmarks, this change has yielded aggregate improvement around 1% overall and just shy of 5% in a few cases.

efriedma added inline comments.May 25 2018, 10:31 AM
llvm/test/CodeGen/AArch64/arm64-memset-to-bzero.ll
4

This change doesn't make any sense to me; what are you trying to do?

evandro added inline comments.May 25 2018, 11:26 AM
llvm/test/CodeGen/AArch64/arm64-memset-to-bzero.ll
4

This test doesn't seem to expect that any string function is inlined, so I added -mattr=+strict-align to prevent this.

efriedma added inline comments.May 25 2018, 11:44 AM
llvm/test/CodeGen/AArch64/arm64-memset-to-bzero.ll
4

Those two aren't related...? I mean, yes, it has the right effect, but that's just a coincidence. Please mark the functions optsize instead.

evandro added inline comments.May 25 2018, 11:58 AM
llvm/test/CodeGen/AArch64/arm64-memset-to-bzero.ll
4

<facepalm>Of course!</facepalm>

evandro updated this revision to Diff 148670.May 25 2018, 2:22 PM
evandro retitled this revision from [AArch64] fix PR32384: bump the number of stores per memset/memcpy to [AArch64] Fix PR32384: bump up the number of stores per memset and memcpy.
evandro edited the summary of this revision. (Show Details)
evandro set the repository for this revision to rL LLVM.

Include the solution found in D47349.

efriedma added inline comments.May 25 2018, 2:30 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
586

Useless comment.

llvm/lib/Target/AArch64/AArch64ISelLowering.h
243 ↗(On Diff #148670)

No-op change?

evandro updated this revision to Diff 148673.May 25 2018, 2:35 PM
evandro marked 2 inline comments as done.
This revision is now accepted and ready to land.May 25 2018, 2:37 PM
This revision was automatically updated to reflect the committed changes.