This is an archive of the discontinued LLVM Phabricator instance.

Introduce element-wise atomic memcpy and memmove intrinsics
ClosedPublic

Authored by igor-laevsky on Nov 25 2016, 7:59 AM.

Details

Summary

LLVM can do various optimizations around standard library memcpy and memmove
intrinsics. In particular we have LoopIdionRecognize and MemCpyOptimizer passes
which do a bunch of interesting transformations.

Some languages require from all of their memory accesses to be unordered
atomics (like Java for example). Which means that we can't directly use above
mentioned passes when compiling for such language.

This change is a step towards supporting this optimizations for languages with
atomicity constraints. It adds special versions of the memcpy and memmove intrinsics
which are specified to have predictable set of atomic memory accesses. They are
lowered to call to the library functions and it's target responsibility to
implement them.

Following changes will be able to extend LoopIdionRecognize and other passes
to support newly added intrinsics.

Diff Detail

Repository
rL LLVM

Event Timeline

igor-laevsky retitled this revision from to Introduce element-wise atomic memcpy and memmove intrinsics.
igor-laevsky updated this object.
igor-laevsky added a subscriber: llvm-commits.
anna added a subscriber: anna.Nov 30 2016, 1:05 PM
anna added inline comments.
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
4932 ↗(On Diff #79305)

do we need a switch case here? the main switch case on the intrinsic is only triggered for memcpy and memmove (line 4901). Maybe we can have it as a ternary with just memcpy_element_atomic and memmove_element_atomic?

lib/IR/Verifier.cpp
3963 ↗(On Diff #79305)

Don't we need verification that the source and dest are of same addrspace?

reames added a subscriber: reames.Nov 30 2016, 6:48 PM
reames added inline comments.
docs/LangRef.rst
12684 ↗(On Diff #79305)

It might be good to specify the alignment of source and destination separately.

12704 ↗(On Diff #79305)

Hm, the wording isn't quite right. As written, we could have six bytes to copy with an element size of 2 and be legally allowed to copy two three byte chunks.

Memory copy is performed as a sequence of memory accesses where each access is an even multiple of element size and aligned at an element size boundary. (e.g. each element is accessed atomicly in source and destination buffer)

Also, add:
The order of the copy is unspecified. The same value may be read from the source buffer many times, but only one write is issued to the destination buffer per element. It is well defined to have concurrent reads and writes to both source and destination provided those reads and writes are at least unordered atomic.

This intrinsic does not provide ordering with respect to other memory accesses in adjacent code. It can be used to implement a fence-like construct.

12711 ↗(On Diff #79305)

If we're just going to ignore the volatile argument, should we have it?

Also mention that the optimizer may inline the copy if profitable.

12714 ↗(On Diff #79305)

Not sure we really need memmove. I might start with just atomic_memcpy and see where that gets us.

include/llvm/IR/Intrinsics.td
766 ↗(On Diff #79305)

The first argument is the return type right? Shouldn't that be void?

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
4906 ↗(On Diff #79305)

We'll want to inline small copies here eventually, but this is a fine starting place.

4913 ↗(On Diff #79305)

If this can be written as:
Args.push_back(Entry(Ty, Node))

please do. the stateful code is slightly confusing.

test/CodeGen/X86/element-wise-atomic-memory-intrinsics.ll
6 ↗(On Diff #79305)

Check the argument passing as well.

igor-laevsky marked 5 inline comments as done.

Thanks for the comments! Please take a look at the updated diff.

docs/LangRef.rst
12684 ↗(On Diff #79305)

Why do you think this will be useful? This will definitely add some complexity to the possible transforms with this intrinsics.

12704 ↗(On Diff #79305)

Right. Thanks for catching this!

Memory copy is performed as a sequence of memory accesses where each access is an even multiple of element size

Not sure, why it should be even multiplier? I think it's fine to copy any number of elements in a single operation, only importance is to never copy them partially.

This intrinsic does not provide ordering with respect to other memory accesses in adjacent code. It can be used to implement a fence-like construct.

Since there is no ordering, why it can be used as a fence-like construct?

12711 ↗(On Diff #79305)

Plan is to use volatile to disable all optimizations involving this intrinsic. Much in a same way as original memcpy does.

12714 ↗(On Diff #79305)

My thought was that since it looks very much like memcpy it would be easy to add them both together. However you're right and for the sake of simplifying review process I removed memmove from this change.

include/llvm/IR/Intrinsics.td
766 ↗(On Diff #79305)

I'm not sure if there is any semantic difference between specifying empty list as a return value or specifying list with a single element which is llvm_void_ty. However I see that all other intrinsic definitions describe void return type as an empty list, so anyway it's better to be in sync with the rest of the code.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
4913 ↗(On Diff #79305)

Unfortunately TargetLowering::ArgListEntry doesn't have suitable constructor. Probably it worth adding it, but in a different change.

4932 ↗(On Diff #79305)

Right, we don't need it. I completely removed memmove from the change, so now there is no conditional at all.

lib/IR/Verifier.cpp
3963 ↗(On Diff #79305)

I didn't add this limitation to the intrinsic definition. Since address space semantics is target defined, we might be able to execute memory copy from one address space to another. I think it should be frontend's job to ensure that this is a valid operation.

reames added inline comments.Dec 1 2016, 12:26 PM
docs/LangRef.rst
12684 ↗(On Diff #79305)

I don't have a strong view here. I just vaguely remembering someone wanting to do that to the memcpy intrinsic.

12704 ↗(On Diff #79305)

w.r.t. "even multiple" you're right. The even can be dropped. The key word is "multiple".

w.r.t. the fence instruction, that's spell check dropping a word. It should have been "It can NOT be used..."

12711 ↗(On Diff #79305)

Ok.

include/llvm/IR/Intrinsics.td
766 ↗(On Diff #79305)

I think I just misread the code. I thought you had three llvm_anyptr_ty arguments. I now see the third is an llvm_anyint_ty.

Specify alignment for each of the input pointers via parameter attribute instead of an explicit call argument.

sanjoy requested changes to this revision.Dec 16 2016, 4:36 PM
sanjoy added a reviewer: sanjoy.
sanjoy added a subscriber: sanjoy.

I have some comments / questions on the specification

docs/LangRef.rst
12644 ↗(On Diff #79895)

Add a couple of - s to underline the whole heading.

12646 ↗(On Diff #80740)

I'd change the style a little bit here --

"to the standard library memory intrinsics except that they perform"

12667 ↗(On Diff #80740)

Why not just have the i64 variant?

12679 ↗(On Diff #80740)

s/atomicly/atomically/

Also, perhaps you meant "i.e." and not "e.g."?

12689 ↗(On Diff #80740)

*less than a target-specific atomic access size limit

12697 ↗(On Diff #80740)

Given that you're explicitly not passing this parameter to `_llvm_memcpy_element_atomic`, perhaps this isn't needed?

12710 ↗(On Diff #80740)

"The same value may be read from the source buffer many times" >> why do you care about this?

12710 ↗(On Diff #80740)

"The order of the copy is unspecified" -- I'd be a bit more explicit -- "The order in which the `num_elements` elements are copied is unspecified"

12716 ↗(On Diff #80740)

I'd expect for this to at least participate with reordering constraints due to other memory operations. That is, I'd expect reordering

%val = acquire_load
atomic_memcpy(...)

to

atomic_memcpy(...)
%val = acquire_load

to be illegal, even if the memory locations involved in memcpy is disjoint with the location in the load.

This revision now requires changes to proceed.Dec 16 2016, 4:36 PM
sanjoy added inline comments.Dec 16 2016, 4:40 PM
docs/LangRef.rst
12697 ↗(On Diff #80740)

I see you've already replied to this before -- it would be used as a switch to disable all optimizations.

However, what is the use case for this? I'd imagine the motivation behind emitting this intrinsic would be to enable memcpy related optimizations and better codegen. If a user does not want optimizations, perhaps they can just emit a loop with volatile loads and stores? Or are you trying to cater to a use case of when a user wants good codegen (i.e. they want to call into a optimized runtime intrinsic) but do not want memcpy optimizations? In that case, why won't they call `__llvm_memcpy_element_atomic` directly?

reames added inline comments.Dec 21 2016, 5:48 PM
docs/LangRef.rst
12697 ↗(On Diff #80740)

In general, we want to implement two families of optimizations:

  • target function recognition to intrinsic matching
  • lowering of intrinsics to IR for small sizes

We need the volatile flag to know that the later isn't legal.

This looks like it's basically ready to go in, we're down to minor wordsmithing at this point.

I'd really prefer not to sign off on this myself. Igor and I work together and I arguably have a conflict of interest here. Can I get another reviewer to speak up and LGTM the patch? If no one else has taken a look in a couple of days, I will LGTM to let us move forward, but I'd really prefer not to do that.

docs/LangRef.rst
12667 ↗(On Diff #80740)

Yeah, I'd just drop the i32 version. It's redundant. I think this means that argument can just be an i64 rather than an anyint as well.

12710 ↗(On Diff #80740)

The ability to read multiple times let's you restart after inspecting the value. Not sure why we'd want this, but let's leave the possibly available.

(Maybe we can fastpath zero initialize if all the source values are zero or something?)

12716 ↗(On Diff #80740)

Hm, good point. This clearly needs reworded. Possibly:
This intrinsic does not provide any additional ordering guarantees over those provided by a set of unordered loads from the source location and stores to the destination.

igor-laevsky edited edge metadata.
igor-laevsky marked 5 inline comments as done.
igor-laevsky added inline comments.
docs/LangRef.rst
12679 ↗(On Diff #80740)

"E.g" is correct here. It is meant to describe one of the possible access patterns we can get. For example we can instead copy elements pair by pair if length is even. I replaced "e.g" with "for example" to make it more clear.

12710 ↗(On Diff #80740)

I have a feeling that new phrase implies that more (or less) elements might be copied. Why do you think it's important to rephrase current statement?

12716 ↗(On Diff #80740)

Right. Idea here is that this intrinsic behaves like unordered memory operation. I've updated it with Philip's wording.

Have you considered instead of specifying a single __llvm_memcpy_element_atomic library function, specifying a set? (__llvm_memcpy_element_atomic_1, __llvm_memcpy_element_atomic_2, __llvm_memcpy_element_atomic_4, __llvm_memcpy_element_atomic_8, __llvm_memcpy_element_atomic_16). It should be a bit faster, and if someone screws up, you'll get a link error rather than a runtime error.

It's kind of hard to for me to judge how useful this is because clang doesn't use unordered loads/stores... but I can definitely see this being nice to have.

In general, we want to implement two families of optimizations:

  • target function recognition to intrinsic matching
  • lowering of intrinsics to IR for small sizes

We need the volatile flag to know that the later isn't legal.

I don't follow; what exactly is "volatile" supposed to mean here? memcpy only has a volatile bit to match C semantics for volatile structs; since you're not dealing with that legacy, you should call your bit something different to reflect how you actually expect it to be used (maybe noinline, if that's what you're after?).

igor-laevsky edited edge metadata.

Hi Eli, thanks for the comments!

I updated the change according to your suggestion to use different library functions for the different element sizes. I also removed 'isvolatile' flag. My primary motivation was to match memcpy semantics, but I see now that it's not necessary.

efriedma accepted this revision.Dec 27 2016, 10:35 AM
efriedma added a reviewer: efriedma.

LGTM with one minor tweak.

Please send an email to llvmdev with a short summary of the feature and a link to this review before you merge this; there might be other people interested in this who don't read llvm-commits.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
4929 ↗(On Diff #82509)

This should probably be a report_fatal_error rather than an assertion, given that it can be triggered with valid IR.

This revision was automatically updated to reflect the committed changes.

LGTM with one minor tweak.

Please send an email to llvmdev with a short summary of the feature and a link to this review before you merge this; there might be other people interested in this who don't read llvm-commits.

Thanks! I already such email some time ago: http://lists.llvm.org/pipermail/llvm-dev/2016-November/107018.html