This is an archive of the discontinued LLVM Phabricator instance.

[libc] Adding memcpy implementation for x86_64
ClosedPublic

Authored by gchatelet on Feb 11 2020, 4:51 AM.

Details

Summary

It is advised to read the post motivating the creation of __builtin_memcpy_inline first.

The patch focuses on static library but allows creation of several implementations depending on cpu features. The default implementation will be optimized for the host capabilities.
Currently the use of rep movsb is disabled but we plan to unable it via CMake options.

This implementation is mainly tested on clang but should compile with GCC as well. For now it doesn't build on MSVC.

Event Timeline

gchatelet created this revision.Feb 11 2020, 4:51 AM

I've gone ahead and copy and pasted some of this into godbolt to save others some time if they wish to play around https://godbolt.org/z/z4dCmj :)

How do we build? We may want to test in debug but build the libc with -march=native for instance,

This is logical. To my knowledge we don't currently do anything special when CMAKE_BUILD_TYPE=Release but this makes sense to turn on for release and benchmarking.

With gcc we can use __builtin_memcpy but then we'd need a postprocess step to check that the final assembly do not contain call to memcpy (unlikely but allowed),

I think we can do this in cmake with add_custom_command and POST_BUILD specified then just nm $<TARGET_FILE:target> | grep "U memcpy"

libc/src/string/memcpy_x86_64.cpp
12 ↗(On Diff #243817)

It might be worth it as a sanity check because this is an x86 specific file to static_assert LLVM_LIBC_CACHELINE_SIZE == 64

21–30 ↗(On Diff #243817)

Off of intuition I would imagine these are very rare sizes to call memcpy with. Would it be better to do something like if (count < 5) goto smallCount; and move these down?
It's worth linking while we are here @ckennelly's thread from a few weeks ago. "For memcpy, 96% of sizes are <= 128 bytes. 99% are <= 1024 bytes."

44 ↗(On Diff #243817)

I'm not sure I'm understanding this. It's not like we can change this at compile time, line 47 is dead code.

46 ↗(On Diff #243817)

Why was 32 chosen?

libc/test/src/string/memory_utils/memcpy_utils_test.cpp
15

D74091 Was also wanting to use assert/abort we should add them. Also that comment is funny to me because its the only thing in assert.h :)

18

Not used as far as I can tell

How do we customize the implementation? (i.e. how to define kRepMovsBSize),

What kind of customizations?

  • How do we specify custom compilation flags? (We'd need -fno-builtin-memcpy to be passed in),

Do we need to pass this flag for building llvm-libc, or to user code (and llvm-libc tests?) Reading the code, it seems to me that it is the latter, correct?

How do we build? We may want to test in debug but build the libc with -march=native for instance,

Not sure I understand this fully. What are the use cases/goals of what you are describing here?

Clang has a brand new builtin __builtin_memcpy_inline which makes the implementation easy and efficient, but:

  • If we compile with gcc or msvc we can't use it, resorting on less efficient code generation,

Less efficient wrt clang compiled code? Can we rephrase what you are saying as, "we make use of a better optimization when compiled with clang?"

  • With gcc we can use __builtin_memcpy but then we'd need a postprocess step to check that the final assembly do not contain call to memcpy (unlikely but allowed),

The concern is that it will become a recursive call? What action is to be taken if we do find a call to memcpy in the final assembly?

  • For msvc we'd need to resort on the compiler optimization passes.

Does "we" mean llvm-libc developers? A related question: considering we are using __builtin_memcpy and inline assembly, does the code work as is with MSVC?

libc/src/string/memcpy_x86_64.cpp
15 ↗(On Diff #243817)

Is this the only reason this file name has the _86_64 in it? If yes, is this function the only machine specific piece for other archs as well? If yes, we should put it in a header file and use the same scheme we use for LLVM_LIBC_CACHELINE_SIZE.

44 ↗(On Diff #243817)

I guess this relates to the customization @gchatelet mentioned about in the patch description.

libc/test/src/string/memory_utils/memcpy_utils_test.cpp
10

Where is monitor_memcpy defined?

15

Point taken. I will prepare something to address this.

16

Use stdint.h instead of cstdint.

MaskRay added inline comments.Feb 11 2020, 10:34 PM
libc/src/string/memcpy_x86_64.cpp
16 ↗(On Diff #243817)

LIBC_INLINE_ASM does not improve readability.

asm volatile

Any reason using constraint modifier +? It adds two operands, one input and one output. Just use "D"(dst), "S"(src), "c"(count)

abrachet added inline comments.Feb 11 2020, 10:43 PM
libc/src/string/memcpy_x86_64.cpp
16 ↗(On Diff #243817)

+1 I think the reasoning behind the macro is for MSVC, which as far as I can tell doesn't support GCC's syntax anyway https://docs.microsoft.com/en-us/cpp/assembler/inline/asm?view=vs-2019

I've gone ahead and copy and pasted some of this into godbolt to save others some time if they wish to play around https://godbolt.org/z/z4dCmj :)

Thx : )
One should add -fno-builtin-memcpy
Also it's worth playing with -mno-avx or -mavx512f to see the difference in codegen.

How do we build? We may want to test in debug but build the libc with -march=native for instance,

This is logical. To my knowledge we don't currently do anything special when CMAKE_BUILD_TYPE=Release but this makes sense to turn on for release and benchmarking.

Yes I think we want to get the most out of the architecture we target (see the difference in codegen depending on available features avx / avx512f)

With gcc we can use __builtin_memcpy but then we'd need a postprocess step to check that the final assembly do not contain call to memcpy (unlikely but allowed),

I think we can do this in cmake with add_custom_command and POST_BUILD specified then just nm $<TARGET_FILE:target> | grep "U memcpy"

Thx that's useful

How do we customize the implementation? (i.e. how to define kRepMovsBSize),

What kind of customizations?

The rep movsb instruction performance is highly tied to the targeted microarchitecture.
For one there is the ERMS cpuid flag that helps to know if it should be used at all.
Then depending on the microarchitecture the crosspoint between aligned copy and rep movsb varies between 512 to a few kilobytes.
Ideally we need to adapt this threshold somehow to provide the best implementation.

Eventually, when rep movsb becomes excellent we can replace the function entirely with this single instruction.

I understand that llvm-libc is to be a pick and choose what you need libc which implies some sort of customization anyways. Am I right?

  • How do we specify custom compilation flags? (We'd need -fno-builtin-memcpy to be passed in),

Do we need to pass this flag for building llvm-libc, or to user code (and llvm-libc tests?) Reading the code, it seems to me that it is the latter, correct?

It is solely for this specific compilation unit
When using clang we can use [[ https://clang.llvm.org/docs/AttributeReference.html#no-builtin | __attribute__((no_builtin("memcpy"))) ]] but it won't work with gcc.

How do we build? We may want to test in debug but build the libc with -march=native for instance,

Not sure I understand this fully. What are the use cases/goals of what you are describing here?

See above, the generated code is improved (smaller and faster) when targeting specific architecture.
glibc is using IFUNC to that matter and lets the runtime pick the best implementation.
Although feasible in llvm-libc we noticed that the required extra level of indirection is hurting branch prediction and latency for small sizes (which are the most frequent)

Clang has a brand new builtin __builtin_memcpy_inline which makes the implementation easy and efficient, but:

  • If we compile with gcc or msvc we can't use it, resorting on less efficient code generation,

Less efficient wrt clang compiled code? Can we rephrase what you are saying as, "we make use of a better optimization when compiled with clang?"

Ok

  • With gcc we can use __builtin_memcpy but then we'd need a postprocess step to check that the final assembly do not contain call to memcpy (unlikely but allowed),

The concern is that it will become a recursive call?

Yes this is technically possible but I've never seen it in practice,
It is highly unlikely that a compiler would generate a call to memcpy for sizes <=64B.

What action is to be taken if we do find a call to memcpy in the final assembly?

I believe we should just refuse to compile on such a compiler if it happens.

  • For msvc we'd need to resort on the compiler optimization passes.

Does "we" mean llvm-libc developers?

Yes, sorry for not being clear here.

A related question: considering we are using __builtin_memcpy and inline assembly, does the code work as is with MSVC?

No it doesn't, I can add a fallback case if you want but the generated code is not good right now.
This means we'd need to specialize the CopyRepMovsb function so it uses the correct syntax for msvc __asm instead of the provided LIBC_INLINE_ASM

The patch is to get the conversation started and is not a full just implementation yet (although it is functional).

Quick question @sivachandra , how do we currently build llvm-libc?
[[ https://github.com/llvm/llvm-project/blob/master/libc/docs/source_layout.rst#the-utilsbuild_scripts-directory | utils/build_scripts ]] seems to be missing so I could only build the entrypoints via transitive dependency through tests.

gchatelet edited the summary of this revision. (Show Details)Feb 12 2020, 2:41 AM
gchatelet marked 13 inline comments as done.Feb 12 2020, 4:57 AM
gchatelet added inline comments.
libc/src/string/memcpy_x86_64.cpp
12 ↗(On Diff #243817)

Agreed but I'm deferring the addition of the static assert to the point where we know which layout to use for the source code. I'm keeping your comment as Not Done so I don't forget to add it later.

15 ↗(On Diff #243817)

Yes for now. I'd yet to see how this implementation performs on other architectures (ARM, PowerPC, ...). But yes indeed this implementation will at least be common to x86 and x86_64. I'll adapt the code consequently,

16 ↗(On Diff #243817)

@MaskRay

LIBC_INLINE_ASM was available so I used it,
I thought this would be used to mask out the differences between the different compilers but I'm not so sure. The operands/constraints modifiers might be sufficiently different that it is not possible to write it in a compiler agnostic way.

For instance asm is not to be used with msv (from the link @abrachet sent).

As for the + modifier they are not needed in this context indeed.

21–30 ↗(On Diff #243817)

Off of intuition I would imagine these are very rare sizes to call memcpy with

They are not rare actually.
The rationale here is to have a branching cost that is proportional to the copy cost.

Now since the code is C++ the compiler sees it and understands the semantic.
For people using Profile Guided Optimization techniques the compiler will be able to reorder the branches according to branching probabilities.

Thx for linking in the thread.

44 ↗(On Diff #243817)

Yes as I stated in the introduction of the patch this is for early feedback. Ultimately kRepMovsBSize needs to be defined by the environment.

46 ↗(On Diff #243817)

It gave good results : )
CopyAligned<N> is prone to memory reloads so there's a balance between copying big chunks and reloading data from memory.

To be honest I would like a few pieces of the code to be customizable (this is one) so we could be benchmarking the cross-product of the implementations and pick the best performing ones per uarch.

The overall design should be better documented for sure, I'll work on it.

Code size is especially important for icache pressure and as you saw the generated code is quite small.

libc/test/src/string/memory_utils/memcpy_utils_test.cpp
10

Further down after the GetTrace function.

15

I think that would be nice to have the functions in memcpy_utils.h have some precondition checking, an abort function would be useful indeed,

#if not defined(NDEBUG)
// check precondition
#endif
gchatelet updated this revision to Diff 244144.Feb 12 2020, 5:24 AM
gchatelet marked 2 inline comments as done.
  • Address comments
gchatelet marked an inline comment as done.Feb 12 2020, 5:27 AM
gchatelet added inline comments.
libc/src/string/memcpy_arch_specific.h.def
18

@sivachandra I'm not super happy with this pattern that dispatches the logic in many files. Would it be possible to generate the cpp file directly instead of generating an intermediate header file?

At least the common implementation would be in the template.

gchatelet updated this revision to Diff 244162.Feb 12 2020, 6:32 AM
  • Improved documentation / fallback code for msvc
gchatelet updated this revision to Diff 244170.Feb 12 2020, 7:01 AM
gchatelet marked an inline comment as done.
  • Revert rep movsb
libc/src/string/memcpy_x86_64.cpp
16 ↗(On Diff #243817)

Actually I believe we still need to use +
See clang implementation.

This is to inform the compiler that executing rep mov will change the underlying registers, it can't assume that ECX, ESI, EDI won't change.

The rep movsb instruction performance is highly tied to the targeted microarchitecture.
For one there is the ERMS cpuid flag https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features that helps to know if it should be used at all.
Then depending on the microarchitecture the crosspoint between aligned copy and rep movsb varies between 512 to a few kilobytes.
Ideally we need to adapt this threshold somehow to provide the best implementation.

Eventually, when rep movsb becomes excellent we can replace the function entirely with this single instruction.

understand that llvm-libc is to be a pick and choose what you need libc which implies some sort of customization anyways. Am I right?

Would configure time sniffing to detect what is available work? It will not work in case of cross-compilation. But, requiring explicit setting of build params for cross-compilation seems OK to me.

A related question: considering we are using __builtin_memcpy and inline assembly, does the code work as is with MSVC?

No it doesn't, I can add a fallback case if you want but the generated code is not good right now.
This means we'd need to specialize the CopyRepMovsb function so it uses the correct syntax for msvc __asm instead of the provided LIBC_INLINE_ASM

We don't need to provide the specialization in this patch, but pointing out which parts needs to be specialized would help.

Quick question @sivachandra , how do we currently build llvm-libc?
utils/build_scripts https://github.com/llvm/llvm-project/blob/master/libc/docs/source_layout.rst#the-utilsbuild_scripts-directory seems to be missing so  I could only build the entrypoints via transitive dependency through tests.

Sorry, that file is out of date. Will fix it soon.
Add the target of your entrypoint to the list of DEPENDS for the llvmlibc target here: https://github.com/llvm/llvm-project/blob/master/libc/lib/CMakeLists.txt
Running ninja llvmlibc after that will produce libllvmlibc.a which includes your entrypoint.

libc/test/src/string/memory_utils/memcpy_utils_test.cpp
10

Ah sorry, I missed it. So, does it mean that you want that name to be overridable/customizable? If yes, then would it make sense to define it to monitor_memcpy only if not defined?

#ifndef LLVM_LIBC_MEMCPY_MONITOR
#define LLVM_LIBC_MEMCPY_MONITOR monitor_memcpy
#endif
15

Okay. There are two kinds of abort here. One to use in the implementation, another to use in the test. For the abort used in the test, we should use the one coming from the system libc. For the one going into the implementation, we should use the one from llvm-libc. The abort function is fairly involved and needs other pieces to be in place for us to build llvm-libc's implementation. Let me do my homework get back to you on that.

See above, the generated code is improved (smaller and faster) when targeting specific architecture.
glibc is using IFUNC to that matter and lets the runtime pick the best implementation.
Although feasible in llvm-libc we noticed that the required extra level of indirection is hurting branch prediction and latency for small sizes (which are the most frequent)

The resolver function is only called the first time that ld.so binds the symbol into the plt so I don't think it would hurt branch prediction any more than any other dynamic symbol. Notably a big goal of llvm libc stated from Siva early on was static linking, so I am sure we will end up having a build time way of determining which specialization to call, as well as runtime with ifunc.

How do we specify custom compilation flags? (We'd need -fno-builtin-memcpy to be passed in),

Were you asking how to express this in CMake?

set_property(
  SOURCE memcpy.cpp
  PROPERTY COMPILE_OPTIONS
  -fno-builtin-memcpy
)

This worked in my (quick) testing. ninja -v will show the commands it's running so you can see if it worked here too.

libc/src/string/memcpy_arch_specific.h.def
22

remove #define

libc/src/string/memory_utils/memcpy_utils.h
23

Presumably should be just __builtin_memcpy

libc/test/src/string/memory_utils/memcpy_utils_test.cpp
15

The abort function is fairly involved and needs other pieces to be in place for us to build llvm-libc's implementation.

I was actually working on writing to the list about getting started on signals, which is how abort should be implemented I think, for(;;) raise(SIGABRT);. I'd be happy to get started on signals :)

I think for now though, it is fine to have abort be __builtin_trap() because we are mainly focused on x86 at the moment. (I remember on ARM32 __builtin_trap() calls abort for example)

sivachandra added inline comments.Feb 12 2020, 2:46 PM
libc/test/src/string/memory_utils/memcpy_utils_test.cpp
15

It would be awesome if you can start work on signals.

gchatelet edited the summary of this revision. (Show Details)Feb 13 2020, 1:12 AM

The rep movsb instruction performance is highly tied to the targeted microarchitecture.
For one there is the ERMS cpuid flag https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features that helps to know if it should be used at all.
Then depending on the microarchitecture the crosspoint between aligned copy and rep movsb varies between 512 to a few kilobytes.
Ideally we need to adapt this threshold somehow to provide the best implementation.

Eventually, when rep movsb becomes excellent we can replace the function entirely with this single instruction.

understand that llvm-libc is to be a pick and choose what you need libc which implies some sort of customization anyways. Am I right?

Would configure time sniffing to detect what is available work? It will not work in case of cross-compilation. But, requiring explicit setting of build params for cross-compilation seems OK to me.

Yes that would work, we can map from micro architecture to parameters and be conservative when we don't know.
The possibility to force the configuration manually would be required for cross compilation as you noticed.

A related question: considering we are using __builtin_memcpy and inline assembly, does the code work as is with MSVC?

No it doesn't, I can add a fallback case if you want but the generated code is not good right now.
This means we'd need to specialize the CopyRepMovsb function so it uses the correct syntax for msvc __asm instead of the provided LIBC_INLINE_ASM

We don't need to provide the specialization in this patch, but pointing out which parts needs to be specialized would help.

I'll document the code then.

Quick question @sivachandra , how do we currently build llvm-libc?
utils/build_scripts https://github.com/llvm/llvm-project/blob/master/libc/docs/source_layout.rst#the-utilsbuild_scripts-directory seems to be missing so  I could only build the entrypoints via transitive dependency through tests.

Sorry, that file is out of date. Will fix it soon.
Add the target of your entrypoint to the list of DEPENDS for the llvmlibc target here: https://github.com/llvm/llvm-project/blob/master/libc/lib/CMakeLists.txt
Running ninja llvmlibc after that will produce libllvmlibc.a which includes your entrypoint.

Perfect! Thx!

See above, the generated code is improved (smaller and faster) when targeting specific architecture.
glibc is using IFUNC to that matter and lets the runtime pick the best implementation.
Although feasible in llvm-libc we noticed that the required extra level of indirection is hurting branch prediction and latency for small sizes (which are the most frequent)

The resolver function is only called the first time that ld.so binds the symbol into the plt

Yes

so I don't think it would hurt branch prediction any more than any other dynamic symbol.

Not more than dynamic symbol but still noticeable for small sizes.
That also mean one branch prediction entry is kept for each runtime dispatched function (at least memcpy, memset, memmove, memcmp). These functions are called every now and prevent other code from using them.

Notably a big goal of llvm libc stated from Siva early on was static linking, so I am sure we will end up having a build time way of determining which specialization to call, as well as runtime with ifunc.

Yes, I agree that static linking is the primary goal, we can always provide an ifunc selected function on top of individual implementations.

How do we specify custom compilation flags? (We'd need -fno-builtin-memcpy to be passed in),

Were you asking how to express this in CMake?

set_property(
  SOURCE memcpy.cpp
  PROPERTY COMPILE_OPTIONS
  -fno-builtin-memcpy
)

This worked in my (quick) testing. ninja -v will show the commands it's running so you can see if it worked here too.

Thx !

gchatelet marked 2 inline comments as done.Feb 13 2020, 7:05 AM
gchatelet added inline comments.
libc/src/string/memory_utils/memcpy_utils.h
23

Good catch

gchatelet updated this revision to Diff 244426.Feb 13 2020, 7:12 AM
gchatelet marked 9 inline comments as done.
  • Address comments
  • Improved documentation / fallback code for msvc
  • Revert rep movsb
  • Add memcpy to llvmlibc
  • Add -fnobuiltin and check no undefined symbols
  • Fix LLVM_LIBC_MEMCPY_MONITOR macro
  • Remove define in comment
  • Add restrict and documentation for MSVC
libc/src/string/memory_utils/memcpy_utils.h
23

__has_builtin is actually fairly new and __builtin_memcpy pretty old so I'm just going to assume that gcc has __builtin_memcpy.

libc/test/src/string/memory_utils/memcpy_utils_test.cpp
10

I've defined it in the CMakeLists.txt with a check in the test file.

gchatelet updated this revision to Diff 244428.Feb 13 2020, 7:15 AM
  • Remove unneeded comment
Harbormaster completed remote builds in B46414: Diff 244428.
gchatelet updated this revision to Diff 244429.Feb 13 2020, 7:24 AM
  • Tighten MaxReloads test
gchatelet updated this revision to Diff 244454.Feb 13 2020, 8:55 AM
  • Restrict check undefined symbols to release builds
  • Added thorough test of memcpy function
gchatelet updated this revision to Diff 244626.Feb 14 2020, 4:55 AM
  • Allow multiple code generation
gchatelet marked an inline comment as done.Feb 14 2020, 5:00 AM
gchatelet added inline comments.
libc/src/string/CMakeLists.txt
78

This now creates the following memcpy implementations

  • __llvm_libc::memcpy_x86_64_avx512
  • __llvm_libc::memcpy_x86_64_avx
  • __llvm_libc::memcpy_x86_64_sse2
  • __llvm_libc::memcpy_x86_64_sse
  • __llvm_libc::memcpy_x86_64_unopt

For shared libc, we need an ifunc like trampoline to select the correct version.
For static libc, we need to select an implementation

@sivachandra how do you see this kind if code generation integrate into the more general cmake functions from libc/cmake/modules/LLVMLibCRules.cmake?
I expect other memory functions to follow the same scheme.

sivachandra added inline comments.Feb 16 2020, 12:56 AM
libc/src/string/CMakeLists.txt
78

Instead of building all the possible implementations, could we use the CMake [[ https://cmake.org/cmake/help/v3.14/command/try_compile.html | try_compile ]] and/or [[ https://cmake.org/cmake/help/v3.14/command/try_run.html | try_run ]] command to sniff out the best flags to use? I think try_run is more appropriate as I expect that we need to run the cpuid instruction? Also, compilers have a convenience macro __cpuid to run this instruction on x86/x86_64?

BTW, one can have ifuncs in static libraries as well. But, I do understand we want to avoid the overhead of the added indirection, so sniffing out at configure time is the best. If we can setup something for configure time sniffing, I believe we should be able use it (may be with straightforward extension/modification iff required) to use as the ifunc selector as well.

abrachet added inline comments.Feb 16 2020, 3:35 PM
libc/src/string/CMakeLists.txt
11

llvm-nm (the nm that ships with MacOS) for mach-o files prints just the symbol and not its value or type character so grepping for U wouldn't work here (in this very niche case that I just happened to test). If we're using --undefined-only we could just grep . perhaps?

78

I think this is going to get quickly outside of the scope of this patch. It would probably be better to work on this in a separate patch.

libc/src/string/memcpy_arch_specific.h.def
34–35

Seems like clang-format interfered a little bit here.

libc/src/string/memcpy_entrypoint.cpp
1 ↗(On Diff #244626)

Must this be in its own file? Why not in src/string/memcpy.cpp?

sivachandra added inline comments.Feb 17 2020, 12:34 AM
libc/src/string/CMakeLists.txt
78

If there are any infrastructure pieces required to build a full solution, then we can do them separately as a prerequisite. However, if correct compile options are critical to memcpy implementation, then any code added to deduce them should belong to this patch. I agree that the patch will start to become too big. But, I think the additions to memcpy_utils and their tests can be split out to another prerequisite patch.

gchatelet marked 11 inline comments as done.Feb 19 2020, 5:30 AM
gchatelet added inline comments.
libc/src/string/CMakeLists.txt
11

thx for letting me know. I guess grep . will work.

78

@sivachandra try_compile and try_run will only work when the host is also the target of the build. For instance it won't work when cross compiling.

There are a few dimensions here:

  1. static library
    • autodetect which implementation to use with try_compile / try_run
    • manually choose the implementation for cross compilation
  2. dynamic library
    • select a set of implementations to include in the library and the associated dispatch code
  3. test
    • test all the implementations that can run on the host

I agree with both of you that this is getting too big for this patch.
Let me do my homework and come back with the prerequisite patches.

78

Let me think about it, I'll get back with the required building blocks.

libc/src/string/memcpy_arch_specific.h.def
34–35

Good catch

libc/src/string/memcpy_entrypoint.cpp
1 ↗(On Diff #244626)

On the one hand we need to pick one implementation to be the LLVM_LIBC_ENTRYPOINT(memcpy) and on the other hand we want to generate multiple implementations to be able to test them and embed them into the library.

I don't like having two files for this purpose, this is a dirty hack that is not to be submitted, I'll come back with a better solution.

sivachandra added inline comments.Feb 19 2020, 11:17 AM
libc/src/string/CMakeLists.txt
78

We don't need to worry about #2 for now.

For cross-compilation in case of #1, if we set up the build vars appropriately, I believe it should not be a problem.

#3 is currently a problem because of the way the add_entrypoint_object rule works: it expects the entrypoint name and the target name to be the same. We can add an optional argument to the rule which explicitly provides the entrypoint name. This will allow us to add multiple targets for the same entrypoint. We can setup our targets as follows:

src/string/
    - x86_64 # x86_64 specific directory.
          - CMakeLists.txt # Lists the targets for the various
                           # x86_64 flavors which all use the
                           # single memcpy.cpp source file
    - CMakeLists.txt # Lists the target for the release version
                     # of memcpy built using the best compile
                     # for the target machine.
    - memcpy.cpp # The actual platform independent memcpy
                 # implementation.

The same structure can be followed for tests as well. WDYT?

gchatelet marked 5 inline comments as done.Feb 20 2020, 5:58 AM
gchatelet added inline comments.
libc/src/string/CMakeLists.txt
78

@sivachandra SGTM
Can you try to amend add_entrypoint_object in a separate patch?

Separately, I came up with a CMake function to test cpu flags on linux/mac this will help for 3 and 2.

x86 and x86_64will have the same implementation any suggestion on how to name the directory? x86_32_64 ?

sivachandra added inline comments.Feb 20 2020, 10:27 PM
libc/src/string/CMakeLists.txt
78

Done now: https://reviews.llvm.org/D74948

Conventionally, pieces common to x86_64 and other x86 flavors are all put together in a directory named x86. I do not have an easy to way to test on non-x86_64 machines. So, may be just start with x86_64 for now and leave a comment that it works for other x86 flavors also? If you have a better way, feel free to propose.

gchatelet updated this revision to Diff 247261.Feb 28 2020, 7:13 AM

This version creates and tests many implementations, and select one for publication into llvmlibc.a

gchatelet marked 3 inline comments as done.Feb 28 2020, 7:20 AM

@sivachandra it seems there's a problem with the NAME attribute in add_entrypoint_object as the symbol being exported in the final library is not memcpy anymore but the provided NAME

 % nm /tmp/llvm-project_rel_compiled-with-clang/projects/libc/lib/libllvmlibc.a

__errno_location.o:
0000000000000000 T __errno_location
0000000000000000 T _ZN11__llvm_libc16__errno_locationEv
0000000000000000 b _ZN11__llvm_libcL7__errnoE

strcpy.o:
                 U memcpy
0000000000000000 T strcpy
                 U strlen
0000000000000000 T _ZN11__llvm_libc6strcpyEPcPKc

strcat.o:
0000000000000000 T strcat
                 U strlen
0000000000000000 T _ZN11__llvm_libc6strcatEPcPKc
                 U _ZN11__llvm_libc6strcpyEPcPKc

memcpy.o:
0000000000000000 A memcpy_x86_64_opt_avx512f
0000000000000000 T _ZN11__llvm_libc25memcpy_x86_64_opt_avx512fEPvPKvm
0000000000000000 T _ZN11__llvm_libc6memcpyEPvPKvm

mmap.o:
0000000000000000 T mmap
                 U _ZN11__llvm_libc16__errno_locationEv
0000000000000000 T _ZN11__llvm_libc4mmapEPvmiiil

munmap.o:
0000000000000000 T munmap
                 U _ZN11__llvm_libc16__errno_locationEv
0000000000000000 T _ZN11__llvm_libc6munmapEPvm

raise.o:
0000000000000000 T raise
0000000000000000 T _ZN11__llvm_libc5raiseEi
libc/cmake/modules/LLVMLibCRules.cmake
108 ↗(On Diff #247261)

@sivachandra this should be submitted as a separate patch but it's better to have context for the change.

libc/src/string/CMakeLists.txt
0–1

I'm removing this since all implementations will be tested from now on.

libc/src/string/memcpy.cpp
15

We need to selectively declare the entry point depending on if we generate test or final implementation.

@sivachandra it seems there's a problem with the NAME attribute in add_entrypoint_object as the symbol being exported in the final library is not memcpy anymore but the provided NAME

 % nm /tmp/llvm-project_rel_compiled-with-clang/projects/libc/lib/libllvmlibc.a

__errno_location.o:
0000000000000000 T __errno_location
0000000000000000 T _ZN11__llvm_libc16__errno_locationEv
0000000000000000 b _ZN11__llvm_libcL7__errnoE

strcpy.o:
                 U memcpy
0000000000000000 T strcpy
                 U strlen
0000000000000000 T _ZN11__llvm_libc6strcpyEPcPKc

strcat.o:
0000000000000000 T strcat
                 U strlen
0000000000000000 T _ZN11__llvm_libc6strcatEPcPKc
                 U _ZN11__llvm_libc6strcpyEPcPKc

memcpy.o:
0000000000000000 A memcpy_x86_64_opt_avx512f
0000000000000000 T _ZN11__llvm_libc25memcpy_x86_64_opt_avx512fEPvPKvm
0000000000000000 T _ZN11__llvm_libc6memcpyEPvPKvm

mmap.o:
0000000000000000 T mmap
                 U _ZN11__llvm_libc16__errno_locationEv
0000000000000000 T _ZN11__llvm_libc4mmapEPvmiiil

munmap.o:
0000000000000000 T munmap
                 U _ZN11__llvm_libc16__errno_locationEv
0000000000000000 T _ZN11__llvm_libc6munmapEPvm

raise.o:
0000000000000000 T raise
0000000000000000 T _ZN11__llvm_libc5raiseEi

I thought your use case was the other way around: you want to have multiple targets with the same entrypoint name. Is this not correct? The first argument to add_entrypoint_object is the target name (which is in sync with the CMake convention of using target names as the first arg.) The NAME option specifies the entrypoint name.

libc/cmake/modules/LLVMLibCRules.cmake
108 ↗(On Diff #247261)

This is fine.

libc/src/string/CMakeLists.txt
39

The NAME is the entrypoint name and not the target name. May be you mean the opposite here? As in, instead of memcpy on line 54, you want ${memcpy_name} and memcpy for the NAME argument?

libc/src/string/memcpy.cpp
15

Test or final, all of them should have the same entrypoint name. No?

For example, memcpy_config1_test will depend on memcpy_config1 but will call __llvm_libc::memcpy for the test.

libc/src/string/memcpy_arch_specific.h.def
61

This should all be just memcpy. May be I am missing something still.

libc/test/src/string/CMakeLists.txt
30

This seems to be setup correctly. But, instead of memcpy_name, they should be called memcpy_config_name to make it clear that they are all memcpy implementations in different configs.

36

This as well.

libc/test/src/string/memcpy_test.cpp
12

This should not be required.

46

This should also not be required.

MaskRay added inline comments.Mar 2 2020, 10:00 PM
libc/src/string/CMakeLists.txt
78

Does __attribute__((target("avx"))) meet the needs?

As to ifunc, it needs non-trivial work in the rtld. Even in a -static (but not -static-pie) context, there will be R_X86_64_IRELATIVE relocations and crt should resolve them.

gchatelet updated this revision to Diff 249635.Mar 11 2020, 8:33 AM
gchatelet marked 12 inline comments as done.
  • address comments and rebase

I think we misunderstood each other.

My original goal was to also support shared library with runtime dispatch (postponed but we'll do it eventually)
In that case we need a single library with all the implementations, they need to have different names to prevent ODR, and memcpy would be the dispatching function.
That's why I provided the custom definitions MEMCPY_NAME=${memcpy_name}.

Now I can indeed remove all of this, have different targets generating different object files all with the same memcpy function in it. This means that I need to create X unit tests and X benchmarks all linking with a different targets but all referring to the same memcpy function (just a different implementation). This works, and I've updated the patch accordingly.

However it will not be straightforward to go from this to the "shared library with runtime dispatch" state. We would need extra machinery to copy the memcpy functions of each implementations into the final object and rename them to prevent ODR.

The current implementation is simpler though. Let me know what you think.

libc/src/string/CMakeLists.txt
78

Does __attribute__((target("avx"))) meet the needs?

Kind of, but

  • It's brittle, I've been bitten a few times passing features with typo and the compiler will happily compile without a warning.
  • It won't work with MSVC

For these reasons I'd rather not use it.

Mostly LGTM. Few minor questions inline.

libc/cmake/modules/LLVMLibCRules.cmake
149 ↗(On Diff #249635)

Sorry for not mentioning this earlier. I have copied this part in to another patch of mine.

155 ↗(On Diff #249635)

Do we need this still?

324 ↗(On Diff #249635)

Likewise, seems to me like COMPILE_DEFINITIONS is not used.

libc/src/string/CMakeLists.txt
57

Should we check if ${opt_level} is available for the machine this target is being built?

libc/test/src/string/memcpy_test.cpp
13

Instead of this, you can include src/string/memcpy.h.

gchatelet updated this revision to Diff 249903.Mar 12 2020, 5:05 AM
gchatelet marked 9 inline comments as done.

Address comments

libc/cmake/modules/LLVMLibCRules.cmake
149 ↗(On Diff #249635)

Yes, I think I messed up with the merge, should be fixed.

155 ↗(On Diff #249635)

Right now, support for repmovsb is disabled. This is a customization point that should be exposed through preprocessor definitions (and CMake options).
As of this patch this is not needed but I expect to need it soon-ish.

libc/src/string/CMakeLists.txt
57

Yes, makes sense.
I'm still not super happy about the introspection part, it's cumbersome to use...
I'll work on it as a separate patch if you don't mind.

libc/test/src/string/memcpy_test.cpp
13

Ha sure !

Harbormaster completed remote builds in B48970: Diff 249903.
sivachandra accepted this revision.Mar 12 2020, 8:12 AM
sivachandra marked an inline comment as done.
sivachandra added inline comments.
libc/src/string/CMakeLists.txt
57

SGTM

This revision is now accepted and ready to land.Mar 12 2020, 8:12 AM
gchatelet updated this revision to Diff 250296.Mar 13 2020, 1:54 PM

I revamped the cpu feature part. It's much simpler and easier to understand.
Now tests that can't run on the host will be skipped.

This last modification does not change the memcpy implementation but how we detect cpu features and how we build/test the memcpy implementations.
There are less global variables only ALL_CPU_FEATURES and HOST_CPU_FEATURES.
The flags are built through compute_flags and host compatibility tested with host_supports.
All memcpy implementations are added to a global property to be passed down to tests.
Tests can query the target for their required cpu features and test the host against them, hence testing only the implementations that the host can execute.
This also remove the need for individual check files.

Overall I think the design is much simpler and straightforward.

As this patch gets closer to landing I think it would make sense to update the description :)

libc/cmake/modules/LLVMLibCCheckCpuFeatures.cmake
10 ↗(On Diff #250296)

Is it necessary to print this?

libc/cmake/modules/cpu_features/check_cpu_features.cpp.in
4–10 ↗(On Diff #250296)

Is it necessary to define these? Does DEFINITIONS rely on the definitions of macros? Also I think handling MSVC here is strange when compute_flags does not currently.

19–20 ↗(On Diff #250296)

Why not putchar and puts?

libc/test/src/string/CMakeLists.txt
32

Would you mind explaining this? It seems like ${flags} will just be -march=native, and the work above to find flags gets ignored.

gchatelet edited the summary of this revision. (Show Details)Mar 16 2020, 7:40 AM
gchatelet updated this revision to Diff 250584.Mar 16 2020, 9:24 AM
gchatelet marked 8 inline comments as done.

Address comments and rebase

libc/cmake/modules/LLVMLibCCheckCpuFeatures.cmake
10 ↗(On Diff #250296)

Not really.

libc/cmake/modules/cpu_features/check_cpu_features.cpp.in
4–10 ↗(On Diff #250296)

Right, it's probably better to defer anything related to MSVC at that point.
Also I'll add documentation to this file because it's a bit obscure.

19–20 ↗(On Diff #250296)

I used putchar for a single character but I can't use puts since it automatically appends a line feed.

libc/test/src/string/CMakeLists.txt
32

Actually there's no need for flags here, the implementation has already been compiled with the correct flags. The test itself doesn't need them.

sivachandra marked an inline comment as done.Mar 16 2020, 9:29 AM
sivachandra added inline comments.
libc/src/string/CMakeLists.txt
95

It seems like this will not ensure the best flags. Is that the intention? If so, why?

libc/src/string/x86_64/CMakeLists.txt
1 ↗(On Diff #250296)

These listings are already under x86_64. So, do we need to use ${LIBC_TARGET_MACHINE}?

4 ↗(On Diff #250296)

It would be nice if a descriptive error message is shown if one tries to build a target which isn't supported on their machine. ISTM that it is not the case. If indeed so, we can do it in a later round I think.

libc/test/src/string/CMakeLists.txt
32

Same question for me as well. I left a related comment at a different place above.

sivachandra added inline comments.Mar 16 2020, 10:20 AM
libc/test/src/string/CMakeLists.txt
32

Ah, so remove this line then?

gchatelet marked 6 inline comments as done.Mar 16 2020, 11:16 AM
gchatelet added inline comments.
libc/src/string/CMakeLists.txt
95

It will, -march=native enables all the features available on the host.
Why do you think -march=native won't get the best flags?

That said if you're on a skylake-avx512 machine it will not use avx512 instructions. This is because -mprefer-vector-width is defaulted to 256 bit width operations (see this phoronix article)

Currently, if on a skylake-avx512 machine the implementation will be the same as the avx one. We would have to measure to be sure it's worth forcing -mprefer-vector-width=512 as well.

libc/src/string/x86_64/CMakeLists.txt
1 ↗(On Diff #250296)

This is because we redirect both x86 and x86_64 versions to this folder and they ought to have different names. We should probably rename this folder if it's unclear. Maybe x86_multi ?

libc/test/src/string/CMakeLists.txt
32

This is a two step process:

  • the implementations get compiled with specific flags
  • when testing we retrieve these flags and check whether the current host supports them

If they are compatible, the already compiled .o file can run on the host, the test file itself doesn't need to receive the flags (only the implementation)

This will be the same for benchmarking, the benchmarking code does not need to be compiled with avx support, only the code under test needs to be.

Do you think it deserves a comment?

sivachandra accepted this revision.Mar 16 2020, 11:55 AM
sivachandra marked an inline comment as done.

I think what is outstanding wrt comments is very minor. So, feel free to land and we can iterate if required after landing.

libc/src/string/CMakeLists.txt
95

I was of the opinion that the compilers do not have the complete set of capabilities available to them and that is why we have facilities like HWCAP, cpuid etc. But, you are the expert, and if you say what you have is enough, I will take it :)

libc/src/string/x86_64/CMakeLists.txt
1 ↗(On Diff #250296)

Normal convention is to name it just x86. So, name it just x86 then?

libc/test/src/string/CMakeLists.txt
32

AFAICT, compute_flags does not have any side effects. Also, it doesn't look like ${flags} is used anywhere here. So, why call it at all?

gchatelet updated this revision to Diff 251115.Mar 18 2020, 9:44 AM
gchatelet marked 8 inline comments as done.

Address comments and rebase

libc/src/string/CMakeLists.txt
95

With -march=native the compiler will introspect the CPU with cpuid and detect the available capabilities. We're deferring to the compiler here.

Now for shared libraries and runtime dispatch we'll have to provide such code but we're not there yet.

libc/test/src/string/CMakeLists.txt
32

Yes thank you it was a leftover.

Thank you @sivachandra and @abrachet for the review.
I'll submit it as-is for now and we can iterate from here.

Thank you @sivachandra and @abrachet for the review.
I'll submit it as-is for now and we can iterate from here.

No really, thank you for patiently working this out. The bots are green so it is sticking.

This revision was automatically updated to reflect the committed changes.

This change breaks -DLLVM_INCLUDE_TESTS=OFF:

$ cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=libc -DLLVM_INCLUDE_TESTS=OFF ../llvm
...
CMake Error at /home/nathan/src/llvm-project/libc/test/src/string/memory_utils/CMakeLists.txt:13 (target_compile_definitions):
  Cannot specify compile definitions for target "utils_test" which is not
  built by this project.
...