This is an archive of the discontinued LLVM Phabricator instance.

[WoA] Use fences for sequentially consistent stores/writes
ClosedPublic

Authored by mnadeem on Jan 13 2023, 8:48 PM.

Details

Summary

LLVM currently uses LDAR/STLR and variants for acquire/release
as well as seq_cst operations. This is fine as long as all code uses
this convention.

Normally LDAR/STLR act as one way barriers but when used in
combination they provide a sequentially consistent model. i.e.
when an LDAR appears after an STLR in program order the STLR
acts as a two way fence and the store will be observed before
the load.

The problem is that normal loads (unlike ldar), when they appear
after the STLR can be observed before STLR (if my understanding
is correct). Possibly providing weaker than expected guarantees if
they are used for ordered atomic operations.

Unfortunately in Microsoft Visual Studio STL seq_cst ld/st are
implemented using normal load/stores and explicit fences:
dmb ish + str + dmb ish
ldr + dmb ish

This patch uses fences for MSVC target whenever we write to the
memory in a sequentially consistent way so that we don't rely on
the assumptions that just using LDAR/STLR will give us sequentially
consistent ordering.

Diff Detail

Event Timeline

mnadeem created this revision.Jan 13 2023, 8:48 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2023, 8:48 PM
mnadeem requested review of this revision.Jan 13 2023, 8:48 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2023, 8:48 PM
mnadeem retitled this revision from [WoA] Use fences for sequentially consistent stores to [WoA] Use fences for sequentially consistent stores/writes.Jan 13 2023, 8:53 PM

The explanation makes sense.

Can we make this specifically apply to _Interlocked* functions, instead of all atomics on Windows targets? I'd prefer not to impose this performance penalty on other users of atomics if we can avoid it.

It looks like for atomic rmw ops, MSVC generates one barrier, not two; can we do the same?

The explanation makes sense.

Can we make this specifically apply to _Interlocked* functions, instead of all atomics on Windows targets? I'd prefer not to impose this performance penalty on other users of atomics if we can avoid it.

To be clear this is about the ABI of std::atomic, and we are likely to change that when we get to break that Someday. Interlocked doesn't come with plain load/store ops (because, I assume, it was assumed plain volatiles were sufficient for that)

It looks like for atomic rmw ops, MSVC generates one barrier, not two; can we do the same?

Let me poke our backend folks...

It looks like if MSVC is implementing sequentially-consistent atomic operations in the manner described then we will need to stores (but not loads) in the same way, so it looks like this patch is doing the right thing. Reasoning below:

Using the memory model simulator at http://diy.inria.fr/www/?record=aarch64, the following input has three threads purely using LDAR/STRL

AArch64 SeqCst
{
0:X1=x; 1:X1=x; 2:X1=x;
0:X3=y; 1:X3=y; 2:X3=y;
}
 P0            | P1            | P2 ;
 MOV W0, #1    | MOV W2, #1    | LDAR W2, [X3] ;
 STLR W0, [X1] | STLR W2, [X3] | LDAR W0, [X1] ;
 LDAR W2, [X3] | LDAR W0, [X1] | ;
exists(2:X2=1 /\ 2:X0=0 /\ 0:X2=0 /\ 1:X0=1)

gives the result "No", which means it's not possible for P2 to see X2=1 and X0=0 if P0 sees X2=0 and P1 sees X0=1, or in other words using LDAR/STRL gives sequentially consistent behaviour.

The following has thread P0 do STLR followed by a MSVC-style sequentially consistent load

AArch64 SeqCst
{
0:X1=x; 1:X1=x; 2:X1=x;
0:X3=y; 1:X3=y; 2:X3=y;
}
 P0            | P1            | P2 ;
 MOV W0, #1    | MOV W2, #1    | LDAR W2, [X3] ;
 STLR W0, [X1] | STLR W2, [X3] | LDAR W0, [X1] ;
 LDR W2, [X3]  | LDAR W0, [X1] | ;
 DMB ISH       |               | ;
exists(2:X2=1 /\ 2:X0=0 /\ 0:X2=0 /\ 1:X0=1)

this gives the result "OK", meaning it is possible for these threads to observe that, which means that we don't have sequentially consistent behaviour.

For the MSVC-style sequentially consistent store followed by LDAR

AArch64 SeqCst
{
0:X1=x; 1:X1=x; 2:X1=x;
0:X3=y; 1:X3=y; 2:X3=y;
}
 P0            | P1            | P2 ;
 MOV W0, #1    | MOV W2, #1    | LDAR W2, [X3] ;
 DMB ISH       |               | ;
 STR W0, [X1]  | STLR W2, [X3] | LDAR W0, [X1] ;
 DMB ISH       |               | ;
 LDAR W2, [X3] | LDAR W0, [X1] | ;
exists(2:X2=1 /\ 2:X0=0 /\ 0:X2=0 /\ 1:X0=1)

we get "No", so we still do get sequentially consistent behaviour in this case.

The explanation makes sense.

Can we make this specifically apply to _Interlocked* functions, instead of all atomics on Windows targets? I'd prefer not to impose this performance penalty on other users of atomics if we can avoid it.

IIUC we must add a fence after every sequentially consistent write to memory.

It looks like for atomic rmw ops, MSVC generates one barrier, not two; can we do the same?

They generate acquire/release instructions with implicit fences + one trailing explicit fence, while this patch generates normal monotonic st/ld, so two explicit fences are needed but the fences are outside the loop.
We could do something similar to MSVC if it helps performance. It would only involve adding a trailing fence while keeping the current acquire/release instructions.

I can think of a few possible ways to do this:

  1. Modify clang to add a fence seq_cst after the interlock intrinsics, I think we would also need to modify any other builtins that end up generating seq_cst writes to memory, so it might get messy.
  2. Add something like TLI->shouldInsertTrailingFenceForAtomic(I) to use in AtomicExpandPass and generate a Trailing Fence while keeping the original memory ordering. With the current patch we set the ordering to Monotonic and thus also need a leading fence.
  3. Add a fence during lowering. Not sure how much effort this will take.

The explanation makes sense.

Can we make this specifically apply to _Interlocked* functions, instead of all atomics on Windows targets? I'd prefer not to impose this performance penalty on other users of atomics if we can avoid it.

IIUC we must add a fence after every sequentially consistent write to memory.

The scenario I wanted to avoid is that someone has an existing codebase that isn't using Microsoft's STL, and is working correctly, but we pessimize the performance to fix a bug that never affected it in the first place. But I guess there are two issues with that:

Maybe we can expose the optimal sequence as a command-line option for codebases that don't need ABI compatibility.

It looks like for atomic rmw ops, MSVC generates one barrier, not two; can we do the same?

They generate acquire/release instructions with implicit fences + one trailing explicit fence, while this patch generates normal monotonic st/ld, so two explicit fences are needed but the fences are outside the loop.
We could do something similar to MSVC if it helps performance.

I assume fewer fences helps performance, although I haven't tried it. Using the acquire/release instructions also means we do the right thing if there's code using the normal armv8 atomic sequences.

  1. Add something like TLI->shouldInsertTrailingFenceForAtomic(I) to use in AtomicExpandPass and generate a Trailing Fence while keeping the original memory ordering. With the current patch we set the ordering to Monotonic and thus also need a leading fence.

This seems fine.

mnadeem updated this revision to Diff 490328.Jan 18 2023, 4:02 PM
  • Use fewer barriers and existing acquire/release instruction sequence.
  • Add a shouldInsertTrailingFenceForAtomicStore() function.
efriedma accepted this revision.Jan 18 2023, 4:12 PM

Code LGTM; maybe wait to see if Microsoft has any more feedback before we merge.

This revision is now accepted and ready to land.Jan 18 2023, 4:12 PM
This revision was landed with ongoing or failed builds.Jan 23 2023, 4:10 PM
This revision was automatically updated to reflect the committed changes.