This is an archive of the discontinued LLVM Phabricator instance.

[Atomics] Rename and change prototype for atomic memcpy intrinsic
ClosedPublic

Authored by dneilson on May 16 2017, 7:34 AM.

Details

Summary

Background: http://lists.llvm.org/pipermail/llvm-dev/2017-May/112779.html

This change is to alter the prototype for the atomic memcpy intrinsic. The prototype itself is being changed to more closely resemble the semantics and parameters of the llvm.memcpy intrinsic -- to ease later combination of the llvm.memcpy and atomic memcpy intrinsics. Furthermore, the name of the atomic memcpy intrinsic is being changed to make it clear that it is not a generic atomic memcpy, but specifically a memcpy is unordered atomic.

Diff Detail

Event Timeline

dneilson created this revision.May 16 2017, 7:34 AM

Note that this is the first of a series of patches that are being developed for the unordered atomic memcpy. Minimally, the plan is to push the following changes one at a time to minimize risk and impact on others:
i. Change intrinsic name, prototype (to match memcpy closely), & documentation.
ii. Add code to loop idiom to recognize the element unordered atomic memcpy.
iii. Add code to instcombine & selection dag builder to lower the intrinsic.
iv. Add an isunordered() to the MemIntrinsic introspection class (returning false for all existing intrinsics), and add calls to it to all passes it's relevant.
v. Add intrinsic into the introspection hierarchy & complete support for new intrinsic in passes.

reames requested changes to this revision.May 16 2017, 11:18 AM

Reviewing only the LangRef changes for the moment. Let's iterate on those until we're happy and then I can go looking for code issues.

docs/LangRef.rst
13582

Your revised text is missing key aspects of the old text. You need to preserve the "as a sequence of intrinsic. It differs in that the `dest and src are treated as arrays with elements that are element_size` bytes wide and aligned at an element size boundary. " wording from the original, because this is semantically important.

13592

Er, huh? What do these major values mean? And why do we need anything other than an i1 boolean?

13595

This sentence is important and shouldn't be dropped.

13607

Ah, the answer to my question above. I think it would be cleaner to have two i1 params instead of encoding the bitmask. Does that complicate anything for you?

13614

This is slightly wrong. You don't need the writes to be unordered atomic if the src/dest doesn't need it, but you do still want to allow concurrent reads and writes. I think you want something along the lines of: "It is well defined to have concurrent reads and writes to both source and destination provided those reads and writes are unordered atomic when specified.

This revision now requires changes to proceed.May 16 2017, 11:18 AM
efriedma edited edge metadata.May 16 2017, 11:31 AM

Do we get any practical benefit from separately specifying whether the source and destination require unordered operations?

Why are you adding an alignment parameter? The alignment is already specified with attributes. (There was a plan at one point to change memcpy to specify alignment like this; IIRC it got committed, then reverted? I don't recall what happened after that.)

Why do we want to specify the length in bytes, as opposed to the number of elements to copy? Any implementation is inevitably just going to divide a length in bytes by the element size.

I'm not really sure why you're messing with the signature of the intrinsic in the first place; we went through most of this design space when it was initially proposed.

dneilson added a comment.EditedMay 16 2017, 12:06 PM

Do we get any practical benefit from separately specifying whether the source and destination require unordered operations?

I don't know enough about the possible source languages to know with 100% certainty that it's not possible to mix, say, unordered loads with ordered stores. For example, Java only requires the unordered ops for shared data (i.e. stuff on the heap). It's conceivable that a memcpy is desired to copy, say, from the stack to the heap; only the heap stores would need to be unordered in this case -- playing devil's advocate, it would not be wrong (just unnecessary) to use unordered loads of the stack data here as well. Erring on the side of flexible/generic here.

Why are you adding an alignment parameter? The alignment is already specified with attributes. (There was a plan at one point to change memcpy to specify alignment like this; IIRC it got committed, then reverted? I don't recall what happened after that.)

Compatibility with the existing memcpy intrinsic. I have no problem removing the alignment parameter if that's the long-term goal/vision. My only concern is that I don't know whether that difference will lead me into trouble when I get to the point of adding the unordered atomic intrinsic to the MemTransferInst introspection class hierarchy.

Why do we want to specify the length in bytes, as opposed to the number of elements to copy? Any implementation is inevitably just going to divide a length in bytes by the element size.

Compatibility with memcpy semantics. The 'length' parameter of a MemTransferInst intrinsic is semantically understood by transforms/analysis to be the size of the transfer in bytes. Having an intrinsic in that hierarchy that specifies a non-bytes value for that parameter strikes me as a recipe for bugs.

I'm not really sure why you're messing with the signature of the intrinsic in the first place; we went through most of this design space when it was initially proposed.

It's being revisited with a long-term eye towards merging the unordered atomic semantics into the existing memcpy intrinsic. By making the proposed/new intrinsic's definition much closer to that of the existing llvm.memcpy it should be much easier to do that eventual merging of the two without having to revisit the semantic understanding of the unordered intrinsic in every analysis/transform.

I'll make the suggested changes to the LangRef.

docs/LangRef.rst
13607

Easy enough to do if it's desired. Only reason that I didn't do that originally is that I'm already adding two parameters on top of memcpy for unordered memcpy; seemed like a good way to prevent that from becoming three additional parameters.

dneilson updated this revision to Diff 99202.May 16 2017, 2:32 PM
dneilson edited edge metadata.
  • Addressed suggested changes to the LangRef doc for the intrinsic.
  • Split out the is unordered parameter into two separate parameters -- dest_unordered & src_unordered.
dneilson marked 5 inline comments as done.May 16 2017, 2:39 PM

I'm going to let Daniel and Eli debate the general direction before reviewing further. I want to make sure Eli is on board with the general direction before we invest lots of effort in cleaning up the code.

skatkov added inline comments.
include/llvm/IR/IntrinsicInst.h
199

why not enum?

200

ARG_SOURCE? For consistency with other names.

205

ARG_SOURCE_UNORDERED? For consistency with other names.

343

bool isDestUnordered?

348

bool isSrcUnordered?

385

Don't you want to check (or assert) the constraints for the alignment here?

391

setSourceUnordered? For consistency with other names.

393

Don't you want to check (or assert) the constraints for the element size here?

include/llvm/IR/Intrinsics.td
810

you have two unordered arguments.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
4869–4870

You introduced the special getters, can you use them here?

lib/IR/Verifier.cpp
3997

is zero alignment allowed? At least it is strange that the 0 is a power of 2 :) if it is allowed, please update text "or zero"

I don't know enough about the possible source languages to know with 100% certainty that it's not possible to mix, say, unordered loads with ordered stores.

It's almost certainly possible to mix them... even if the source language doesn't allow mixing them, we could transform unordered operations to non-atomic operations if we can prove the memory isn't accessed from another thread. (I don't think we actually do this transform at the moment, but LICM has a similar sort of check.)

The question is whether there's actually any optimization that would actually check the unordered bit. Maybe there is? (See my comments on __llvm_memcpy_element_unordered_atomic_*.)

Compatibility with the existing memcpy intrinsic. I have no problem removing the alignment parameter if that's the long-term goal/vision. My only concern is that I don't know whether that difference will lead me into trouble when I get to the point of adding the unordered atomic intrinsic to the MemTransferInst introspection class hierarchy.

Yes, this is the direction we want to go in. You should be able to hide the difference in the implementation of MemIntrinsic, I think. If it does cause problems, we can revisit.

Compatibility with memcpy semantics. The 'length' parameter of a MemTransferInst intrinsic is semantically understood by transforms/analysis to be the size of the transfer in bytes. Having an intrinsic in that hierarchy that specifies a non-bytes value for that parameter strikes me as a recipe for bugs.

Okay. It's a little ugly, but I guess it isn't that terrible.

docs/LangRef.rst
13600

"must" is kind of confusing in this context. Probably need to explicitly say "if len is not a multiple of element_size, the behavior is undefined", or something like that, since we can't actually tell until runtime.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
4887–4888

I'm not sure changing the signature of __llvm_memcpy_element_unordered_atomic_* like this makes sense.

I guess changing from the number of elements to the length in bytes is fine.

I'm not sure why you want to pass the alignment to the function; the implementation can easily compute the alignment itself based on the pointers passed in.

Not sure what the implementation is going to do with the DestUnordered and SrcUnordered parameters. Maybe if the source/dest is non-atomic, it could use unaligned load/store operations?

If we do need DestUnordered and SrcUnordered, it probably makes sense to merge them to save an instruction in the caller.

dneilson marked 8 inline comments as done.May 17 2017, 1:24 PM
dneilson added inline comments.
docs/LangRef.rst
13582

I'm not seeing the difference here. "Sequence" doesn't imply any sort of ordering. So, to me the old and new are semantically equivalent -- there is a copy happening, and it's being done with unordered atomic load/stores.

13600

Fair. I'll make that change.

include/llvm/IR/IntrinsicInst.h
199

No particularly good reason. Just playing around.

200

Fair.

385

I figured that's handled by the verifier.

393

Handled by verifier.

include/llvm/IR/Intrinsics.td
810

Good catch! I updated the prototype, but neglected the comment.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
4887–4888

Fair. I have no strong preference with the library function taking size in bytes vs. number of elements. The only benefit that I can see to passing size in bytes instead of number of elements here is the possibility for runtime checks being implemented in a debug version of the library -- one that verifies that the length is a multiple of element size.

Good point about alignment. Not the smartest move on my part.

I *think* that you're right about passing dest_unordered & src_unordered to the library -- it might be unnecessary. The lib function could just be implemented assuming that both source & dest require unordered atomic ops; there shouldn't be any harm in it, since unordered just means that we can't break up an element into partial loads/stores, and we wouldn't want to do that in a high performance library anyways.

I'll change the lib prototype to match memcpy exactly.
_llvm_memcpy_unordered_atomic(i8* noalias dest, i8* noalias src, uint64 length)

lib/IR/Verifier.cpp
3997

I would think not, but this is the same check as exists for memcpy. So, for compatibility I think it should be allowed unless we can definitively say that it's not allowed for memcpy.

dneilson updated this revision to Diff 99472.May 18 2017, 11:45 AM
dneilson marked 3 inline comments as done.

Addressing suggestions.

anna added a subscriber: anna.May 18 2017, 2:39 PM
anna added inline comments.
docs/LangRef.rst
13604–13605

Nit: if and only if stores to the destination buffer *are*

skatkov added inline comments.May 18 2017, 9:08 PM
include/llvm/IR/IntrinsicInst.h
385

ok, but to me it is one of a primary goal for setter to check incoming args.

lib/IR/Verifier.cpp
3991

Use getters?

dneilson updated this revision to Diff 99618.May 19 2017, 1:27 PM
dneilson marked 3 inline comments as done.

Addressing some comments -- use of getters, adding assertions to setters, and some minor wording changes to LangRef.

skatkov added inline comments.May 21 2017, 8:07 PM
lib/IR/Verifier.cpp
3992

Extra semicolon

dneilson updated this revision to Diff 100415.May 26 2017, 8:44 AM
  • Another iteration on the intrinsic prototype. I've removed the align, and dest/src_unordered arguments.
    • Having align both as arg attributes and as an arg could cause challenges if we need to resolve a difference, and it is the desired future direction for intrinsics.
    • Upon further thought, and digging into where passes would have to be made aware of this intrinsic -- I'm no longer convinced about the value of the separate dest_unordered/src_unordered args.
      • It seems sufficient to have the semantics of the intrinsic being that all loads/stores are unordered atomic; we can still "promote" idioms that mix, say, unordered loads with simple stores.
      • Any library implementation will just use unordered atomic loads & stores throughout, anyways.
      • There will be a side-effect of promoting simple ops to unordered-atomic if we recognize a loop idiom, and then later lower it into loads/stores. The tradeoff is that it should be easier to work with the intrinsic in passes.
      • The only value that I can see in having the separate dest_unordered/src_unordered args is that in lowering passes that change the intrinsic into explicit loads/stores we wouldn't "promote" a simple op into an unordered-atomic op.
  • Added in updating the InstCombine lowering of the intrinsic so that the change doesn't lose functionality; even temporarily.
efriedma added inline comments.May 26 2017, 11:41 AM
docs/LangRef.rst
13573

Hmm... I didn't mention isvolatile earlier? See https://reviews.llvm.org/D27133?id=79305#629884 for original discussion of it. At the very least, we need a better description of what it means.

skatkov added inline comments.May 28 2017, 10:49 PM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
4887–4888

MI.getLength() == Length

lib/Transforms/InstCombine/InstCombineCalls.cpp
110

assert LengthInBytes % ElementSizeInBytes == 0 and LengthInBytes > 0?

Did a scan through the code, didn't spot anything major. Once we settle the last few design/specification questions, this looks basically ready to go in.

docs/LangRef.rst
13573

I think we can just remove this. The original motivation was essentially future proofing, and I don't think it's worth keeping the complexity for now. We can change our minds later if it turns out we actually need this.

lib/Transforms/InstCombine/InstCombineCalls.cpp
110

Agreed. Also, length might actually be zero. We should remove such calls.

1900

Hm, I might sink this into the helper function. Optional, and can be submitted separately without further review.

dneilson added inline comments.May 31 2017, 12:29 PM
docs/LangRef.rst
13573

Agreed. I'll remove it.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
4887–4888

Not quite. Doesn't seem to be a straightforward way to go from an SDValue to a Type*, so I don't think this sort of replacement can be made.

lib/Transforms/InstCombine/InstCombineCalls.cpp
110

Re: The assert. I think that it would be better to check this in the verifier. In the LangRef we've said that it's undefined behaviour if length isn't a multiple of element size, so I think it's okay to blindly do this divide here and add a check to the verifier.

Re: Zero length; see the in-line comment below.

1900

This is following the same pattern/code-flow as the normal memcpy/memmove/memset handlers just above this. i.e. Check for a null length -- if there is one, then remove the call, else call the simplify method for the intrinsic. I'm inclined to stick to this pattern to make the later merging of the introspection classes cleaner.

dneilson updated this revision to Diff 100922.May 31 2017, 2:33 PM
  • Remove volatile arg from intrinsic.
  • Add check to verifier to ensure that constant length is a multiple of element size & add corresponding test.
reames accepted this revision.Jun 5 2017, 7:07 PM

LGTM w/one comment addressed before submission.

lib/Transforms/InstCombine/InstCombineCalls.cpp
116

Where did this check come from and why is it needed? It looks like an attempt to handle a length which isn't an even interval of element size, but the verifier should reject that?

This revision is now accepted and ready to land.Jun 5 2017, 7:07 PM
dneilson added inline comments.Jun 6 2017, 6:29 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
116

Just me being extra cautious. The verifier checks for the case where constant length is not a multiple of element size, and the zero length case is handled elsewhere. However, I'm not sure that the verifier runs after every single pass. So, I figure there's no harm in handling the corner case.

dneilson updated this revision to Diff 101605.Jun 6 2017, 1:03 PM
  • Loop idiom patch was dropped, so update loop idiom recognition as well.
dneilson added inline comments.Jun 7 2017, 7:57 AM
include/llvm/IR/IntrinsicInst.h
197

I'm inclined to change this name to 'EUAMemcpyInst' to cut down on the length of its name. Any objections?

This revision was automatically updated to reflect the committed changes.