Expanding the loop idiom test for memcpy to also recognize unordered atomic memcpy. The only difference for recognizing an unordered atomic memcpy and instead of a normal memcpy is that the loads and/or stores involved are unordered atomic operations.
reames anna skatkov
- rGb2a212c070d9: [Atomics][LoopIdiom] Recognize unordered atomic memcpy
rG056c009f1b0f: [Atomics][LoopIdiom] Recognize unordered atomic memcpy
rL304806: [Atomics][LoopIdiom] Recognize unordered atomic memcpy
rL304310: [Atomics][LoopIdiom] Recognize unordered atomic memcpy
Looking at the code in LoopIdiomRecognize::collectStores, it might be better to refactor it first. Firstly, there's couple of TODOs regarding adding more patterns such as memcmp, memmove etc.
So, having something like this : isLegalStore(SI, ForMemset, ForMemsetPattern, ForMemcpy) seems messy. Adding something like an enum of patterns we are recognizing against, will be useful.
Nit: extra semi colon.
This predicate is really confusing. isSimple = !isAtomic && !isVolatile
Can you pls add a comment here stating that this is not supported for memset and memset_patternN.
Actually, there are 2 booleans (RecogUnorderedAtomicMemcpy and ForUnorderedAtomic ) and they are being used for different purposes throughout this function. Could you perhaps do an early bail out and state the requirements at the start of the function?
Also, add an assert at the caller of isLegalStore that unordered atomic stores was legal only for memcpy.
Nit: end with period.
Nit: no need of braces for single line else.
Marking as needing changes, based on comments inline.
Daniel, as mentioned in previous comment: it might be better to start off with a refactoring NFC patch which would make the code which uses isLegaslStore cleaner. First, instead of bool, return the pattern for which this is valid. Creating an enum would be good - it states clearly which patterns are currently supported.
The caller of isLegalStore collects these stores based on the pattern, but there is no check that exactly one pattern is selected. Also there is an implicit ordering on which pattern is checked.
Once the NFC is reviewed and checked in, coming back to this patch and cleaning up the booleans and the various checks in isLegalStore might be easier.
I'm removing the option entirely in the next diff.
It's the double negation... always confusing.
!ForUnordered && !isSimple
Which is exactly what's wanted here.
One (RecogUnorderedAtomicMemcpy) was a command-line arg to turn on the idiom recognition for unordered atomic memcpy. I've just removed the option entirely.
I would prefer to revert the if statement.
It is simpler to follow if the short case is handled first.
Some comments inline.
We can do 2 things here: Add a hidden cl option for the size that is default to 16 for now, instead of the hardcoded value.
You could take a look if TTI has that target specific information you need. Perhaps getRegisterBitWidth?
If the spec requires atomic loads and stores to have an alignment, shouldn't this be an assert, rather than a check and return?
please add a test for memset_patternN not being recognized as well. These tests should change once support is added for both memsets.
I've changed this to query the runtime lib for whether an unordered atomic memcpy with the given store size exists. If it does, then we can create the memcpy; else we can't because we won't be able to lower it into a lib call.
Spec requires atomic loads/stores to have alignment, but has no such requirement for non-atomic loads/stores. It's possible that one of source or dest will be non-atomic. I'll clarify the comment.
Very close, but needs changes. In particular, bug around volatile access.
This is missing the case where a instruction is both unordered atomic and volatile. Add fix and test case please.
I'd strongly recommend rewriting this as:
This is incorrect. We can't allow a misaligned atomic memcpy. If we can't ensure the load and store is sufficiently aligned, we must reject the transform. Fix and add test please.
Please add at least one test case for each element size in (2, 4, 8)
isUnordered() precludes volatile. From StoreInst:
So, the case isn't missing.
True enough... wouldn't hurt to make it more explicit. I'm thinking:
Funny enough, I caught this when updating the version of this patch that's overtop of the changed atomic memcpy intrinsic; just didn't port it to this version. Will do that...
Patch was reverted in rL304315 because of undefined reference to RTLIB::getMEMCPY_ELEMENT_ATOMIC in polly (http://lab.llvm.org:8011/builders/polly-arm-linux/builds/5582) and mingw builds. Not sure whether the other builds link RTLIB. clang seemed to pass fine (http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/4373)
- Remove use of RTLIB in loop idiom. Results in a circular dependence in libs, so we can't use it there. Instead of querying RTLIB for allowable store sizes, we query the scalar register bit width from TTI.
This is the replacement due to the inability to use RTLIB in ScalarOpt due to a circular dependency. It's not ideal as it's theoretically possible that larger width versions of the intrinsic's lib call will exist (ex: ones that are implemented to use vector regs for the load/stores), but we won't be able to exploit any that are wider than scalar register width.
Is it possible that even if the StoreSizeBits is within the max registerBitWidth for the target arch, we do not have the corresponding lib call?
Theoretically it's possible, yes. Lib calls for sizes 1, 2, 4, 8, and 16 bytes are currently defined. It a platform doesn't have one of those sizes's definition available and we create the call, then the result will be a link error -- which seems to be intentional per the discussion around the design of the element atomic memcpy. If scalar registers with more than 16 bytes are available on the platform, then we'd end up creating the intrinsic call and then blowing up when lowering it to a libcall in SelectionDAGBuilder -- this is the part I'd like to avoid.
TLI seems to be concerned with system lib calls (libc, libm, etc), and not the __llvm_ lib calls, so I don't think that's a suitable place for this.
|429 ↗||(On Diff #101452)|
Good point, zero would seem safer... I put in 4 because my original intent was to put in getRegisterBitWidth()/8 & the default reg bit width is 32 -- I couldn't do that because the call here would get bound to the getRegisterBitWidth() in this base class, and would never see the platform-specific value.
|429 ↗||(On Diff #101452)|
So, again it maybe the case that getRegisterBitWidth()/8 may not have the corresponding libcall defined for the target. I have a mild preference to leave it to the target to confirm they have valid definitions for these libcall sizes (i.e. having 0 as the default in the base definition).
FYI: The committed patch (rL304806)caused failures in multiple targets (passed on X86, PPC and s390), since the unordered-atomic-memcpy.ll test was in the common LoopIdiom directory.
Discussing with echristo on IRC: Specifying target-triple as X86 is not enough. It would still be run on targets where the X86 backend is not compiled in.
I've checked in the fix (rL304809), which is to move the file to the X86 specific subdirectory, where the lit.cfg chooses that the test should not be run on any other target.