[Atomics][LoopIdiom] Recognize unordered atomic memcpy
ClosedPublic

Authored by dneilson on May 16 2017, 9:20 AM.

Details

Summary

Expanding the loop idiom test for memcpy to also recognize unordered atomic memcpy. The only difference for recognizing an unordered atomic memcpy and instead of a normal memcpy is that the loads and/or stores involved are unordered atomic operations.

Background: http://lists.llvm.org/pipermail/llvm-dev/2017-May/112779.html

Diff Detail

Repository
rL LLVM
There are a very large number of changes, so older changes are hidden. Show Older Changes
anna added a comment.May 18 2017, 2:38 PM

Looking at the code in LoopIdiomRecognize::collectStores, it might be better to refactor it first. Firstly, there's couple of TODOs regarding adding more patterns such as memcmp, memmove etc.

So, having something like this : isLegalStore(SI, ForMemset, ForMemsetPattern, ForMemcpy) seems messy. Adding something like an enum of patterns we are recognizing against, will be useful.

lib/Transforms/Scalar/LoopIdiomRecognize.cpp
354 ↗(On Diff #99477)

Nit: extra semi colon.

356 ↗(On Diff #99477)

This predicate is really confusing. isSimple = !isAtomic && !isVolatile
I'm not even sure if isSimple and ForUnorderedAtomic might negate each other.

399 ↗(On Diff #99477)

Can you pls add a comment here stating that this is not supported for memset and memset_patternN.

427 ↗(On Diff #99477)

Actually, there are 2 booleans (RecogUnorderedAtomicMemcpy and ForUnorderedAtomic ) and they are being used for different purposes throughout this function. Could you perhaps do an early bail out and state the requirements at the start of the function?

Also, add an assert at the caller of isLegalStore that unordered atomic stores was legal only for memcpy.

995 ↗(On Diff #99477)

Nit: end with period.

1006 ↗(On Diff #99477)

Nit: no need of braces for single line else.

anna requested changes to this revision.May 19 2017, 6:45 AM

Marking as needing changes, based on comments inline.

Daniel, as mentioned in previous comment: it might be better to start off with a refactoring NFC patch which would make the code which uses isLegaslStore cleaner. First, instead of bool, return the pattern for which this is valid. Creating an enum would be good - it states clearly which patterns are currently supported.
The caller of isLegalStore collects these stores based on the pattern, but there is no check that exactly one pattern is selected. Also there is an implicit ordering on which pattern is checked.
Once the NFC is reviewed and checked in, coming back to this patch and cleaning up the booleans and the various checks in isLegalStore might be easier.

This revision now requires changes to proceed.May 19 2017, 6:45 AM
dneilson marked 6 inline comments as done.May 19 2017, 11:03 AM
dneilson added inline comments.
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
356 ↗(On Diff #99477)

It's the double negation... always confusing.

!ForUnordered && !isSimple
=> !(ForUnordered || isSimple)
=> !( (isAtomic && isUnordered) || isSimple)
=> !( (isAtomic && (isSimple || unordered-atomic)) || isSimple )

Which is exactly what's wanted here.

427 ↗(On Diff #99477)

One (RecogUnorderedAtomicMemcpy) was a command-line arg to turn on the idiom recognition for unordered atomic memcpy. I've just removed the option entirely.

84 ↗(On Diff #99205)

I'm removing the option entirely in the next diff.

dneilson updated this revision to Diff 99610.May 19 2017, 12:14 PM
dneilson marked 2 inline comments as done.

Addressing most concerns.

Remaining: Find a way to query a platform-specific maximum store-size for converting into an unordered memcpy.

skatkov added inline comments.May 22 2017, 11:18 PM
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
991 ↗(On Diff #99610)

I would prefer to revert the if statement.
if SI and LI are not atomic then memcpy otherwise your code.

It is simpler to follow if the short case is handled first.

dneilson marked an inline comment as done.May 23 2017, 8:01 AM
dneilson added inline comments.
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
991 ↗(On Diff #99610)

Sure

anna requested changes to this revision.May 23 2017, 11:56 AM

Some comments inline.

lib/Transforms/Scalar/LoopIdiomRecognize.cpp
995 ↗(On Diff #99610)

We can do 2 things here: Add a hidden cl option for the size that is default to 16 for now, instead of the hardcoded value.

You could take a look if TTI has that target specific information you need. Perhaps getRegisterBitWidth?

1004 ↗(On Diff #99610)

If the spec requires atomic loads and stores to have an alignment, shouldn't this be an assert, rather than a check and return?

test/Transforms/LoopIdiom/unordered-atomic-memcpy.ll
79 ↗(On Diff #99610)

please add a test for memset_patternN not being recognized as well. These tests should change once support is added for both memsets.

This revision now requires changes to proceed.May 23 2017, 11:56 AM
dneilson marked 3 inline comments as done.May 24 2017, 7:42 AM
dneilson added inline comments.
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
995 ↗(On Diff #99610)

I've changed this to query the runtime lib for whether an unordered atomic memcpy with the given store size exists. If it does, then we can create the memcpy; else we can't because we won't be able to lower it into a lib call.

1004 ↗(On Diff #99610)

Spec requires atomic loads/stores to have alignment, but has no such requirement for non-atomic loads/stores. It's possible that one of source or dest will be non-atomic. I'll clarify the comment.

dneilson updated this revision to Diff 100088.May 24 2017, 7:45 AM
dneilson marked an inline comment as done.
  • Refining logic regarding which store sizes to allow for the unordered atomic memcpy.
  • Add some additional tests -- simple loads/stores with no alignment, and memset_pattern
anna added a comment.May 25 2017, 6:14 AM

These changes look good to me. Prefer @reames to specifically LGTM this.

dneilson updated this revision to Diff 100259.May 25 2017, 9:39 AM
dneilson edited the summary of this revision. (Show Details)
  • Remove dependence on D33240 by expanding to the current, existing, form of atomic.memcpy.

LGTM as well.

anna added a comment.May 26 2017, 7:05 AM
  • Remove dependence on D33240 by expanding to the current, existing, form of atomic.memcpy.

Do you mean the intrinsic that was already added? If so, could you please run make check all to confirm that all tests pass.

In D33243#765659, @anna wrote:
  • Remove dependence on D33240 by expanding to the current, existing, form of atomic.memcpy.

Do you mean the intrinsic that was already added? If so, could you please run make check all to confirm that all tests pass.

Yes. This is overtop of the intrinsic that's already in-tree. I've run both check & check-all without issue.

reames requested changes to this revision.May 26 2017, 7:54 PM

Very close, but needs changes. In particular, bug around volatile access.

lib/Transforms/Scalar/LoopIdiomRecognize.cpp
359 ↗(On Diff #100259)

This is missing the case where a instruction is both unordered atomic and volatile. Add fix and test case please.

361 ↗(On Diff #100259)

I'd strongly recommend rewriting this as:
if (Volatile) return None;
if (isAtomic() && Ordering != unordered) return None;

404 ↗(On Diff #100259)

add: (yet)

1016 ↗(On Diff #100259)

This is incorrect. We can't allow a misaligned atomic memcpy. If we can't ensure the load and store is sufficiently aligned, we must reject the transform. Fix and add test please.

test/Transforms/LoopIdiom/unordered-atomic-memcpy.ll
128 ↗(On Diff #100259)

Please add at least one test case for each element size in (2, 4, 8)

This revision now requires changes to proceed.May 26 2017, 7:54 PM
dneilson marked an inline comment as done.May 29 2017, 7:23 AM
dneilson added inline comments.
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
359 ↗(On Diff #100259)

isUnordered() precludes volatile. From StoreInst:
393 bool isUnordered() const {
394 return (getOrdering() == AtomicOrdering::NotAtomic ||
395 getOrdering() == AtomicOrdering::Unordered) &&
396 !isVolatile();
397 }

So, the case isn't missing.

361 ↗(On Diff #100259)

True enough... wouldn't hurt to make it more explicit. I'm thinking:
if (Volatile) return None;
if (not Unordered) return None;

1016 ↗(On Diff #100259)

Funny enough, I caught this when updating the version of this patch that's overtop of the changed atomic memcpy intrinsic; just didn't port it to this version. Will do that...

test/Transforms/LoopIdiom/unordered-atomic-memcpy.ll
128 ↗(On Diff #100259)

Will do.

dneilson marked 4 inline comments as done.May 29 2017, 9:20 AM
dneilson updated this revision to Diff 100619.May 29 2017, 9:32 AM
  • Adding tests to check permutations of alignment validity.
  • Adding tests for element sizes beyond 1 byte.
  • Clarify some conditional rejections in isLegalStore()
reames accepted this revision.May 30 2017, 3:21 PM

LGTM

Anna, since Daniel doesn't yet have commit access would you mind landing this? I would do it myself, but can't commit the time to watch for any problems after commit.

lib/Transforms/Scalar/LoopIdiomRecognize.cpp
362 ↗(On Diff #100619)

I don't understand the note? Maybe just remove

anna added a comment.May 31 2017, 7:23 AM

LGTM

Anna, since Daniel doesn't yet have commit access would you mind landing this? I would do it myself, but can't commit the time to watch for any problems after commit.

Sure. @dneilson Could you please update the diff after addressing the comment inline, run clang-format on modified lines, and required tests.. I'll run some tests and check it in once that's done. Thanks.

dneilson marked an inline comment as done.May 31 2017, 8:07 AM
dneilson added inline comments.
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
362 ↗(On Diff #100619)

The test is for 'isUnordered()', and the note aims to clarify that unordered implies (=>) simple.

dneilson updated this revision to Diff 100867.May 31 2017, 8:35 AM
dneilson marked an inline comment as done.
  • Rebase to ToT.
  • clang-format (no-op)

@anna I've verified that this passes 'make check'

This revision was automatically updated to reflect the committed changes.
anna added a comment.May 31 2017, 10:29 AM

Patch was reverted in rL304315 because of undefined reference to RTLIB::getMEMCPY_ELEMENT_ATOMIC in polly (http://lab.llvm.org:8011/builders/polly-arm-linux/builds/5582) and mingw builds. Not sure whether the other builds link RTLIB. clang seemed to pass fine (http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/4373)

dneilson updated this revision to Diff 100907.May 31 2017, 12:16 PM
  • Remove use of RTLIB in loop idiom. Results in a circular dependence in libs, so we can't use it there. Instead of querying RTLIB for allowable store sizes, we query the scalar register bit width from TTI.
anna reopened this revision.May 31 2017, 12:19 PM
dneilson added inline comments.May 31 2017, 12:23 PM
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1012 ↗(On Diff #100907)

This is the replacement due to the inability to use RTLIB in ScalarOpt due to a circular dependency. It's not ideal as it's theoretically possible that larger width versions of the intrinsic's lib call will exist (ex: ones that are implemented to use vector regs for the load/stores), but we won't be able to exploit any that are wider than scalar register width.

anna added inline comments.Jun 1 2017, 7:02 AM
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1012 ↗(On Diff #100907)

Is it possible that even if the StoreSizeBits is within the max registerBitWidth for the target arch, we do not have the corresponding lib call?
Could you please check if TLI has the information you require or can be modified to do so - it seems to have memset, memcpy etc.

dneilson added inline comments.Jun 1 2017, 7:55 AM
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1012 ↗(On Diff #100907)

Theoretically it's possible, yes. Lib calls for sizes 1, 2, 4, 8, and 16 bytes are currently defined. It a platform doesn't have one of those sizes's definition available and we create the call, then the result will be a link error -- which seems to be intentional per the discussion around the design of the element atomic memcpy. If scalar registers with more than 16 bytes are available on the platform, then we'd end up creating the intrinsic call and then blowing up when lowering it to a libcall in SelectionDAGBuilder -- this is the part I'd like to avoid.

TLI seems to be concerned with system lib calls (libc, libm, etc), and not the __llvm_ lib calls, so I don't think that's a suitable place for this.

dneilson updated this revision to Diff 101452.Jun 5 2017, 1:34 PM
  • Abstract the max element size for atomic memory intrinsics into the TTI.
reames added a comment.Jun 5 2017, 2:53 PM

LGTM again w/minor comment addressed before commit.

include/llvm/Analysis/TargetTransformInfoImpl.h
429 ↗(On Diff #101452)

Why 4? 0 would seem like a safer default.

reames accepted this revision.Jun 5 2017, 2:55 PM
dneilson added inline comments.Jun 6 2017, 6:31 AM
include/llvm/Analysis/TargetTransformInfoImpl.h
429 ↗(On Diff #101452)

Good point, zero would seem safer... I put in 4 because my original intent was to put in getRegisterBitWidth()/8 & the default reg bit width is 32 -- I couldn't do that because the call here would get bound to the getRegisterBitWidth() in this base class, and would never see the platform-specific value.

dneilson updated this revision to Diff 101563.Jun 6 2017, 7:45 AM
  • Default max element size to 0
anna added inline comments.Jun 6 2017, 7:51 AM
include/llvm/Analysis/TargetTransformInfoImpl.h
429 ↗(On Diff #101452)

So, again it maybe the case that getRegisterBitWidth()/8 may not have the corresponding libcall defined for the target. I have a mild preference to leave it to the target to confirm they have valid definitions for these libcall sizes (i.e. having 0 as the default in the base definition).

dneilson marked an inline comment as done.Jun 6 2017, 7:53 AM
anna accepted this revision.Jun 6 2017, 8:43 AM

Daniel, I had written my comment while you updated the diff :). This LGTM now. I'll land this upstream on your behalf.

This revision is now accepted and ready to land.Jun 6 2017, 8:43 AM
This revision was automatically updated to reflect the committed changes.
anna added a comment.Jun 6 2017, 10:57 AM

FYI: The committed patch (rL304806)caused failures in multiple targets (passed on X86, PPC and s390), since the unordered-atomic-memcpy.ll test was in the common LoopIdiom directory.
Discussing with echristo on IRC: Specifying target-triple as X86 is not enough. It would still be run on targets where the X86 backend is not compiled in.

I've checked in the fix (rL304809), which is to move the file to the X86 specific subdirectory, where the lit.cfg chooses that the test should not be run on any other target.