Page MenuHomePhabricator

[AArch64] Fix for bug 35094 atomicrmw on Armv8.1-A+lse
ClosedPublic

Authored by christof on Feb 18 2019, 5:50 AM.

Details

Summary

Fix for https://bugs.llvm.org/show_bug.cgi?id=35094

The Dead register definition pass should leave alone the atomicrmw
instructions on AArch64 (LTE extension). The reason is the following
statement in the Arm ARM:

The ST<OP> instructions, and LD<OP> instructions where the destination
register is WZR or XZR, are not regarded as doing a read for the purpose
of a DMB LD barrier.

A good example was given in the gcc thread by Will Deacon (linked in the
bugzilla ticket):

P0 (atomic_int* y,atomic_int* x) {
  atomic_store_explicit(x,1,memory_order_relaxed);
  atomic_thread_fence(memory_order_release);
  atomic_store_explicit(y,1,memory_order_relaxed);
}

P1 (atomic_int* y,atomic_int* x) {
  atomic_fetch_add_explicit(y,1,memory_order_relaxed);  // STADD
  atomic_thread_fence(memory_order_acquire);
  int r0 = atomic_load_explicit(x,memory_order_relaxed);
}

P2 (atomic_int* y) {
  int r1 = atomic_load_explicit(y,memory_order_relaxed);
}

My understanding is that it is forbidden for r0 == 0 and r1 == 2 after
this test has executed. However, if the relaxed add in P1 compiles to
STADD and the subsequent acquire fence is compiled as DMB LD, then we
don't have any ordering guarantees in P1 and the forbidden result could
be observed.

Diff Detail

Event Timeline

christof created this revision.Feb 18 2019, 5:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 18 2019, 5:50 AM

Are you sure that Bugzilla link is right?

I agree this is a problem according to the formal model, although it would be kind of weird to run into it (you can't tell whether the happens-before edge formed without reading the value at some point, so the code would have to load the result of the atomic operation at some point).

christof edited the summary of this revision. (Show Details)Feb 21 2019, 3:21 AM
christof marked an inline comment as done.Feb 21 2019, 3:36 AM

Are you sure that Bugzilla link is right?

Fixed. The bug number in the subject was the correct one.

I agree this is a problem according to the formal model, although it would be kind of weird to run into it (you can't tell whether the happens-before edge formed without reading the value at some point, so the code would have to load the result of the atomic operation at some point).

It might be an edge case, I'm not sure, to be honest. I'll go see if I can get some better idea of that. I though that it would be good to adhere to the formal model to prevent surprise, so if there is an example that breaks, we ought to prevent that. Is that up for debate in case this is an edge case that is unlikely to be hit?

lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
71

s/load/lose/

john.brawn added inline comments.Feb 21 2019, 5:07 AM
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
73

It doesn't especially matter, as it'll already get caught by atomicBarrierDroppedOnZero, but the acquire variants of these instructions (LDADDAW etc.) also aren't counted as performing a read when the destination register is zero.

Is that up for debate in case this is an edge case that is unlikely to be hit?

When it comes to the atomics in general, it's important to allow people to write code with the same performance characteristics as hand-written assembly, or else we force people will go around the compiler using inline asm. But the performance impact here should be basically zero for code that doesn't use an acquire fence, I think, so it shouldn't be an issue here.

christof updated this revision to Diff 188212.Feb 25 2019, 10:17 AM
christof marked 2 inline comments as done.
christof added inline comments.
lib/Target/AArch64/AArch64DeadRegisterDefinitionsPass.cpp
73

True. I'll add a remark that this is incomplete and I've left all the acquire variants to atomicBarrierDroppedOnZero(). If people wish to have this function complete, I can copy all the cases from atomicBarrierDroppedOnZero that apply and duplicate them here instead.

I was wondering if I cannot do this type of grouping in tablegen by leaving some label on an instruction. That way I could just test that label, rather than listing the subsets in yet another place. I'm not aware of such mechanism, unfortunately.

Is that up for debate in case this is an edge case that is unlikely to be hit?

When it comes to the atomics in general, it's important to allow people to write code with the same performance characteristics as hand-written assembly, or else we force people will go around the compiler using inline asm. But the performance impact here should be basically zero for code that doesn't use an acquire fence, I think, so it shouldn't be an issue here.

Fair enough. The loads increase the memory bus operations and in case you've got high register pressure you are using one register extra (with zero lifetime, but still). The performance impact is low. However, the programmer has asked for a load and the compiler has to proof that this cannot be observed before that load is removed, which is rather difficult when it can be used to order other memory operations.

This revision is now accepted and ready to land.Feb 26 2019, 7:58 AM