This is an archive of the discontinued LLVM Phabricator instance.

[x86] Teach the x86 backend about general fast rep+movs and rep+stos features of modern x86 CPUs, and use this feature to drastically reduce the number of places we actually emit memset and memcpy library calls.
Needs ReviewPublic

Authored by chandlerc on Jul 21 2017, 5:46 PM.

Details

Summary

To understand the motivation of this patch, it is important to consider
that LLVM is remarkably diligent and effective at converting user loops
into memset and memcpy intrinsics. These in fact show up frequently
inside of deeply nested loops, etc. However, when it emits these as
calls to the actual memset and memcpy library functions the cost of
issuing this call can in many cases far outstrip the cost of actually
doing the operation and negatively impact surrounding code. Our analysis
of some benchmarks which hit this shows this comes from a few places:

  1. Calling these library functions requires setting up registers for the calling convention and ends up in practice forcing a surprising number of register reloads if they occur inside of loops.
  2. When using PIC, the call is much more expensive due to the PLT-call pattern, requiring at best a double indirect jump on Linux and BSD systems.

For older x86 processors this was unavoidable. But modern processors
provide very fast instruction pattern support for implementing these
library functions in many (if not quite all) cases. Starting with
Ivybridge, there seems to be no point in using the library functions
with well aligned buffers (alignment of 16-bytes or better), and even
starting with Nehalem, they seem superior to PLT library function calls.

It is also possible to carefully fold size scaling into these sequences
which helps avoid generating extra scaling code when we are in fact
emitting code for user loops that were written at 4-byte or 8-byte
granularity.

Naturally, this is a pretty significant change. I'm still running
benchmarks on various architectures to confirm that this direction makes
sense, but more insight from Intel and other x86 hardware experts would
be really welcome here to make sure we're picking reasonable tradeoffs.
I'm starting here with a very aggressive version of the patch so I can
find where it *does* regress, and we can back off until it looks
reasonable.

Given that Sandybridge is now over 4 years old and the growing density
of Ivybridge or newer processors in the world, I think it may be
reasonable to be somewhat aggressive in this lowering even if the
performance on older processors isn't ideal.

One interesting question is whether rep+movs{w,d,q} and rep+stos{w,d,q}
are as well tuned as rep+movsb and rep+stosb and the descaled versions
are actually reasonable. Craig has indicated they may not be, and I'm hoping to
confirm one way or the other when benchmarking.

The test case added also exposes some annoying problems with codegen of
these instructions that should also be addressed.

Last but not least, in many cases the pattern to match the scaling here
will not fire because currently LLVM has a bad bug that causes it to
much more often than necessary scale using a more complex pattern of
math. I'm going to work to address that in a separate patch as it
appears to be a middle-end issue.

Event Timeline

chandlerc created this revision.Jul 21 2017, 5:46 PM
echristo added inline comments.Jul 21 2017, 5:58 PM
lib/Target/X86/X86.td
287

"too high of a cost"

lib/Target/X86/X86SelectionDAGInfo.cpp
73

Comment please on what this is for :)

115

Ditto.

164–171

This seems like this all wants to be subsumed in a couple of different checks?

i.e. I don't think with the rep str ops that the max inline size threshold is as important anymore. I worry that we don't actually care about whether or not pic here and should probably just use the inline expansion either way.

319–327

Ditto.

davidxl edited edge metadata.Jul 22 2017, 12:04 AM

Do you have more benchmark numbers? For reference, here is GCC does (for sandybridge and above) for mempcy when size profile data is available:

  1. when the size is <= 24, use 8 byte copy loop or straightline code.
  2. when size is is between 24 and 128, use rep movsq
  3. when size is b above that, use libcall

It is an interesting idea to consider PLT overhead here, but is there a better way to model the cost?

I worry that without profile data, blindly using rep movsb may be problematic. Teresa has a pending patch to make use value profile information. Without profile, if size matters, perhaps we can guard the expansion sequence with size checks.

Also if the root cause

craig.topper edited edge metadata.Jul 22 2017, 3:17 PM

I believe prior to Ivy Bridge the Intel optimization manual indicates that rep movsb/stosb was only optimized to handle 1-3 bytes. Specifically to be used to handle the remainder portion in conjunction with rep+movsd to handle the rest.

lib/Target/X86/X86.td
287

Should this mention that the byte version is still terrible here?

301

Was that supposed to be W/D/Q?

547

Have you collected the data to verify this? Should we go ahead and commit this independent of this patch?