Page MenuHomePhabricator

AArch64: use ldp/stp for atomic & volatile 128-bit where appropriate.
Needs ReviewPublic

Authored by t.p.northover on Sep 12 2019, 2:38 AM.

Details

Reviewers
labrinea
Summary

From v8.4a onwards, aligned 128-bit ldp and stp instructions are guaranteed to be single-copy atomic, so they are going to be a lot more efficient than the CAS loop we used to implement "load atomic" and "store atomic" before even if we do need a DMB sometimes. Additionally, some earlier CPUs had this property anyway so it makes sense to use it.

Finally, even before this guarantee there are machine-specific circumstances where a 128-bit ldp/stp makes sense but doesn't really fit an atomic profile. So this extends the selection to volatile accesses, even if they're not aligned (presumably the coder knows what they're doing). The one exception for volatile is when -mstrict-align is in force, that should take precedence.

Diff Detail

Event Timeline

t.p.northover created this revision.Sep 12 2019, 2:38 AM

Hi Tim, thanks for looking into this optimization opportunity. I have a few remarks regarding this change:

  • First, it appears that the current codegen (CAS loop) for 128-bit atomic accesses is broken based on this comment: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70814#c3. There are two problematic cases as far as I understand: (1) const and (2) volatile atomic objects. Const objects disallow write access to the underlying memory, volatile objects mandate that each byte of the underlying memory shall be accessed exactly once according to the AAPCS. The CAS loop violates both.
  • Maybe the solution is to follow GCC here, at least for the general case (architectures prior to v8.4), meaning to expand atomic operations into a call to the appropriate __atomic library function . I believe this is what the AtomicExpandPass does in LLVM. In that case we need to provide an implementation of those functions.
  • My concern about using ldp/stp is that the specification promises single-copy atomicity provided that accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory. Is there such a guarantee to the compiler? Maybe that's the default for Operating Systems but not for bare-metal? Would it make sense to enable this optimization for certain target triples, i.e. when sys != none?
  • In certain code sequences where you have two consecutive atomic stores, or an atomic load followed by an atomic store you'll end up with redundant memory barriers. Is there a way to get rid of them?
  • Enabling the corresponding subtarget feature on cyclone doesn't seem safe to me. If we ever implement -mtune, then a command line like clang -march=armv8a -mtune=cyclone should mean “generated correct code for the v8.0 architecture, but optimize for cyclone". Adding this feature to cyclone as is would probably result in the above command line producing code that isn’t architecturally correct for v8.0.

First, it appears that the current codegen (CAS loop) for 128-bit atomic accesses is broken based on this comment: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70814#c3. There are two problematic cases as far as I understand: (1) const and (2) volatile atomic objects. Const objects disallow write access to the underlying memory, volatile objects mandate that each byte of the underlying memory shall be accessed exactly once according to the AAPCS. The CAS loop violates both.

Maybe the solution is to follow GCC here, at least for the general case (architectures prior to v8.4), meaning to expand atomic operations into a call to the appropriate __atomic library function . I believe this is what the AtomicExpandPass does in LLVM. In that case we need to provide an implementation of those functions.

I think Clang is involved there too, in horribly non-obvious ways (for example I think that's the only way to get the actual libcalls you want rather than legacy ones). Either way, that's a change that would need pretty careful coordination. Since all of our CPUs are Cyclone or above we could probably just skip the libcalls entirely at Apple without ABI breakage (which, unintentionally, is what this patch does).

My concern about using ldp/stp is that the specification promises single-copy atomicity provided that accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory. Is there such a guarantee to the compiler? Maybe that's the default for Operating Systems but not for bare-metal? Would it make sense to enable this optimization for certain target triples, i.e. when sys != none?

I don't think anyone has written down a guarantee, but we've pretty much always assumed we're accessing reasonably normal memory. dmb instructions are always ish for example. I've never had any comments from our more embedded developers on that front (or seen anyone try to do general atomics in another realm). I suspect they go to assembly for the few spots it might matter.

In certain code sequences where you have two consecutive atomic stores, or an atomic load followed by an atomic store you'll end up with redundant memory barriers. Is there a way to get rid of them?

ARM has a pass designed to merge adjacent barriers, though I've seen it miss some cases. We might think about porting it to AArch64, or maybe doing some work in AtomicExpansion in generic way.

Enabling the corresponding subtarget feature on cyclone doesn't seem safe to me. If we ever implement -mtune, then a command line like clang -march=armv8a -mtune=cyclone should mean “generated correct code for the v8.0 architecture, but optimize for cyclone". Adding this feature to cyclone as is would probably result in the above command line producing code that isn’t architecturally correct for v8.0.

I don't think Clang really does anything with -mtune yet. The most it could do based on the way CPUs are implemented in LLVM now would be something like applying the scheduling model. Almost all of the features in the list are going to break older CPUs.

Enabling the corresponding subtarget feature on cyclone doesn't seem safe to me. If we ever implement -mtune, then a command line like clang -march=armv8a -mtune=cyclone should mean “generated correct code for the v8.0 architecture, but optimize for cyclone". Adding this feature to cyclone as is would probably result in the above command line producing code that isn’t architecturally correct for v8.0.

I don't think Clang really does anything with -mtune yet. The most it could do based on the way CPUs are implemented in LLVM now would be something like applying the scheduling model. Almost all of the features in the list are going to break older CPUs.

I'm afraid I mentioned to Alexandros that I wondered how this would interact with a potential future enabling for an -mtune feature, leading to the above question.
But you're right, if we do end up implementing support for -mtune, we'd need to categorize subtarget features into either architectural or tuning features, and only enable the tuning features for a subtarget when -mtune is used. Clearly, FeatureAtomicLDPSTP is an architectural feature. So this part of the patch is just fine I think.
Sorry for the noise.

I think Clang is involved there too, in horribly non-obvious ways (for example I think that's the only way to get the actual libcalls you want rather than legacy ones). Either way, that's a change that would need pretty careful coordination. Since all of our CPUs are Cyclone or above we could probably just skip the libcalls entirely at Apple without ABI breakage (which, unintentionally, is what this patch does).

I am not sure I am following here. According to https://llvm.org/docs/Atomics.html the AtomicExpandPass will translate atomic operations on data sizes above MaxAtomicSizeInBitsSupported into calls to atomic libcalls. The docs say that even though the libcalls share the same names with clang builtins they are not directly related to them. Indeed, I hacked the AArhc64 backend to disallow codegen for 128-bit atomics and as a result LLVM emitted calls to __atomic_store_16 and __atomic_load_16. Are those legacy names? I also tried emitting IR for the clang builtins and I saw atomic load/store IR instructions (like those in your tests), no libcalls. Anyhow, my concern here is that if sometime in the future we replace the broken CAS loop with a libcall, the current patch will break ABI compatibity between v8.4 objects with atomic ldp/stp and v8.X objects without the extension. Moreover, this ABI incompatibility already exists between objects built with LLVM and GCC. Any thoughts?

I don't think anyone has written down a guarantee, but we've pretty much always assumed we're accessing reasonably normal memory. dmb instructions are always ish for example. I've never had any comments from our more embedded developers on that front (or seen anyone try to do general atomics in another realm). I suspect they go to assembly for the few spots it might matter.

Fair enough.

ARM has a pass designed to merge adjacent barriers, though I've seen it miss some cases. We might think about porting it to AArch64, or maybe doing some work in AtomicExpansion in generic way.

Sounds good. On that note can you add some testcases (like the sequences I mentioned) with a FIXME label as motivating examples of such an optimization for future work?