This is an archive of the discontinued LLVM Phabricator instance.

[InstCombineCalls] Unfold element atomic memcpy instruction
ClosedPublic

Authored by igor-laevsky on Jan 19 2017, 9:41 AM.

Details

Summary

This change teaches instruction combiner on how to unfold element atomic memcpy intrinsic into sequence of explicit atomic memory operations.

Also I added new IntrinsicInst subclass to provide user friendly interface for the element atomic memcpy intrinsic. In the future we want to integrate this new class into MemIntrinsic hierarchy which will allow us to easily write optimizations covering both atomic and non-atomic versions. However as a first step I decided to frame it as a separate entity and refactor it later after we determine which use cases generalised interface should cover.

Diff Detail

Repository
rL LLVM

Event Timeline

igor-laevsky created this revision.Jan 19 2017, 9:41 AM
reames requested changes to this revision.Jan 23 2017, 8:05 PM
reames added inline comments.
include/llvm/IR/IntrinsicInst.h
155 ↗(On Diff #84985)

minor: instead of "class handles" use "represents"

lib/Transforms/InstCombine/InstCombineCalls.cpp
63 ↗(On Diff #84985)

Units? The name should ideally reflect.

119 ↗(On Diff #84985)

Please use the same convention used for the normal memcpy. Lower the atomic memcopy, set the length of the original to zero, then remove the zero length one. This also requires you to implement a currently missed optimization for zero length copies.

131 ↗(On Diff #84985)

Why 16 elements as opposed to 8 bytes? (i.e. what the non-atomic form uses)

167 ↗(On Diff #84985)

The source alignment may not hold for all elements. i.e. your alignment might be 1024 with your element size being 8.

This revision now requires changes to proceed.Jan 23 2017, 8:05 PM
igor-laevsky edited edge metadata.
igor-laevsky marked 4 inline comments as done.
igor-laevsky added inline comments.
lib/Transforms/InstCombine/InstCombineCalls.cpp
131 ↗(On Diff #84985)

This transformation is not exactly similar to the non-atomic unfolding. Non atomic version only unfolds into single load/store pair. Here we can generate more operations. Number 16 means that no more than 16 load/store pairs will be generated. I picked it as an arbitrary small integer but we can easily tune it afterwards.

167 ↗(On Diff #84985)

Yes, thanks for the catch!

sanjoy accepted this revision.Feb 4 2017, 10:10 PM

lgtm with nits inline

include/llvm/IR/IntrinsicInst.h
171 ↗(On Diff #85597)

No need to assert on isa -- that assert should implicitly happen during cast<>

lib/Transforms/InstCombine/InstCombineCalls.cpp
66 ↗(On Diff #85597)

I'd rephrase this slightly as "Maximum number of elements in atomic memcpies the optimizer is allowed to unfold".

(i.e. no need to get into details -- this is a hidden option that only people who know what they're doing should use)

118 ↗(On Diff #85597)

clang-format?

123 ↗(On Diff #85597)

Minor: use auto * here.

133 ↗(On Diff #85597)

s/ElementSize/ElementSizeInBits/

134 ↗(On Diff #85597)

There is a DataLayout in InstCombiner.

166 ↗(On Diff #85597)

I'd also prefer adding an assert here that checks what you said in the comment.

170 ↗(On Diff #85597)

Note: you can do better here -- e.g. if the source is aligned to 16 bytes, and the element size is 4 bytes, then instead the alignment being 16, 4, 4, 4, ... it can be 16, 4, 8, 4, 16, ... etc.

For now I'll just leave a TODO.

182 ↗(On Diff #85597)

Is it difficult to delete AMI right here? If not, I'd just do that -- no point in waiting for another iteration if it is obvious what will happen.

1543 ↗(On Diff #85597)

This is really a separate transform -- I'd prefer splitting it out and landing it independently.

This revision was automatically updated to reflect the committed changes.
igor-laevsky marked 7 inline comments as done.Feb 8 2017, 6:56 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
166 ↗(On Diff #85597)

Proper checks of this invariants are quite verbose and will clobber the actual logic. I would prefer to rely on the verifier in this case.

182 ↗(On Diff #85597)

Philip suggested that I should stick to the same process as other memory intrinsic optimizations do. I don't have a strong opinion here, but I would go with Philip's suggestion since I already implemented it.