Page MenuHomePhabricator

[AArch64] Out-of-line atomics (-moutline-atomics) implementation.
ClosedPublic

Authored by ilinpv on Nov 10 2020, 5:56 AM.

Details

Summary

This patch implements out of line atomics for LSE deployment
mechanism. Details how it works can be found in llvm/docs/Atomics.rst
Options -moutline-atomics and -mno-outline-atomics to enable and disable it
were added to clang driver. This is clang and llvm part of out-of-line atomics
interface, library part is already supported by libgcc. Compiler-rt
support is provided in separate patch.

Diff Detail

Event Timeline

ilinpv created this revision.Nov 10 2020, 5:56 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptNov 10 2020, 5:56 AM
ilinpv requested review of this revision.Nov 10 2020, 5:56 AM
t.p.northover added inline comments.
clang/lib/Driver/ToolChains/Clang.cpp
6374

This excludes aarch64_be, which is a bit weird. Might be best to check Triple.isAArch64().

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1818

Could you do the whitespace changes separately (if at all)?

4062

"its"

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2179

I think this is a bit of an abuse of the LibcallName mechanism. A separate function in TargetLowering would probably be better.

llvm/lib/Target/AArch64/AArch64Subtarget.h
476

I think something is a bit weird with how your clang-format handles comments. Here and earlier line lengths are about half as long as I'd expect.

llvm/test/CodeGen/AArch64/atomic-ops.ll
534

I'd prefer not to overwrite existing CHECKS that have generic dataflow with ones produced by that tool hardcoding a particular register allocation.

jyknight added inline comments.Nov 10 2020, 2:27 PM
llvm/docs/Atomics.rst
625

I think this section needs to be put on the end of the section on __sync_*. These functions are effectively an aarch64-specific version of the the __sync libcalls -- just with the addition of the memory ordering in the function name, instead of assuming seq_cst. All of the same commentary applies otherwise, and clearly distinguishing from the __atomic_* calls is important.

Maybe something like:

On AArch64, a variant of the __sync_* routines is used which contain the memory order as part of the function name. These routines may determine at runtime whether the single-instruction atomic operations which were introduced as part of AArch64 Large System Extensions "LSE" instruction set are available, or if it needs to fall back to an LL/SC loop.

The following helper functions are implemented in both [.....]
llvm/include/llvm/IR/RuntimeLibcalls.def
548

Maybe just go ahead and define the libcalls up to size 16, even though aarch64 won't define or use the 16-byte functions, other than CAS. Can we come up with a better name for these libfuncs here? "ATOMIC_*" is an unfortunate prefix, since we already use it for the entirely-distinct set of functions above.

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2179

I don't think that's odd or unusual -- we often condition libcall availability on getLibcallName != nullptr.

What does strike me here is the (pre-existing) code duplication between this function (DAGTypeLegalizer::ExapndAtomic) and SelectionDAGLegalize::ConvertNodeToLibcall. Not sure what's up with that...

llvm/lib/CodeGen/TargetLoweringBase.cpp
453

What's the purpose of the struct?

475

If you moved this switch to the end, you can just have each clause be "return SwpLcalls[ModeN][ModelN];", instead of storing the address of the array.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15839

So, hold on -- AArch64 has umin/umax/smin/smax instructions, but libgcc and compiler-rt don't have helpers for those? That seems to be a remarkably unfortunate state of affairs.

Can you fix that, by implementing those functions in the compiler-rt patch, and submitting the same to libgcc?

llvm/lib/Target/AArch64/AArch64Subtarget.h
476

I think it'd be clearer to have this simply "return OutlineAtomics;". The only usage that needs to change is AArch64ISelLowering.cpp L663, and it'd be _clearer_ to have it explicitly say if (!Subtarget->hasLSE() && Subtarget->outlineAtomics()).

t.p.northover added inline comments.Nov 11 2020, 5:55 AM
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2179

Fair enough. Didn't realise it was that common.

ilinpv updated this revision to Diff 304663.Nov 11 2020, 3:11 PM

Work on comments has been done.

ilinpv marked 10 inline comments as done.Nov 11 2020, 3:36 PM
ilinpv added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15839

Yes, agree, initially I was going to add min/max helpers to both compiler-rt and libgcc. But the point is that gcc is not generating LSE min/max at all. So these helpers would be untested and become rotten. There is no reason for gcc community to support unused code. Hopefully after <atomic> maximum/minimum proposal approval I can easily add the helpers to libcc and compiler-rt and enable [U]Min/[U]Max RWM atomics expansion in LLVM.

Have you got any further comments?

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2179

I noticed this existed duplication too, but find no proper place to put common functionality from DAGTypeLegalizer and SelectionDAGLegalize.

jyknight accepted this revision.Nov 19 2020, 7:44 AM

LG after fixing the minor nits.

clang/lib/Driver/ToolChains/Clang.cpp
6382

We don't usually explicitly use negative features like this, do we? I think this else clause can be omitted.

llvm/lib/CodeGen/TargetLoweringBase.cpp
494–497

As with the .def file, get rid of LCALL4, and just use LCALL5 for everything here. AArch64ISelLowering won't setup libcall names for 128-bit ones, which is fine.

This revision is now accepted and ready to land.Nov 19 2020, 7:44 AM
ilinpv added inline comments.Nov 19 2020, 10:23 AM
clang/lib/Driver/ToolChains/Clang.cpp
6382

We do, -soft-float for example in Clang.cpp:2397

if (Value.startswith("-mhard-float")) {
          CmdArgs.push_back("-target-feature");
          CmdArgs.push_back("-soft-float");
          continue;
        }
ilinpv updated this revision to Diff 306502.Nov 19 2020, 1:13 PM
ilinpv edited reviewers, added: t.p.northover; removed: eli.friedman.

LCALL4 removed in TargetLoweringBase.

ilinpv marked an inline comment as done.Nov 19 2020, 1:16 PM
This revision was landed with ongoing or failed builds.Nov 20 2020, 5:30 AM
This revision was automatically updated to reflect the committed changes.
sebpop added a subscriber: sebpop.Dec 5 2020, 6:03 PM

I tested this change on Graviton2 aarch64-linux by building https://github.com/xianyi/OpenBLAS with clang -O3 -moutline-atomics and make test: all tests pass with and without outline-atomics.
Clang was configured to use libgcc.

I also tested https://github.com/boostorg/boost.git with and without -moutline-atomics, and there are no new fails.
Here is how I built and ran the tests for boost:

git clone --recursive https://github.com/boostorg/boost.git $HOME/boost
cd $HOME/boost
mkdir usr
./bootstrap.sh --prefix=$HOME/boost/usr
# in project-config.jam line 12
# replace `using gcc ;` with `using clang :   : $HOME/llvm-project/usr/bin/clang++ ;`
./b2 --build-type=complete --layout=versioned -a
cd status
../b2  # runs all regression tests

I also looked at the performance of some atomic operations using google-benchmark on Ubuntu 20.04 c6g instance with Graviton2 (Neoverse-N1).
Performance is better when using LSE instructions compared to generic armv8-a code.
The overhead of -moutline-atomics is negligible compared to armv8-a+lse.
clang trunk as of today produces slightly slower code than gcc-9 with and without -moutline-atomics.

$ cat a.cc
#include <benchmark/benchmark.h>
#include <atomic>

std::atomic<int> i;
static void BM_atomic_increment(benchmark::State& state) {
  for (auto _ : state)
    benchmark::DoNotOptimize(i++);
}
BENCHMARK(BM_atomic_increment);

int j;
static void BM_atomic_fetch_add(benchmark::State& state) {
  for (auto _ : state)
    benchmark::DoNotOptimize(__atomic_fetch_add(&j, 1, __ATOMIC_SEQ_CST));
}
BENCHMARK(BM_atomic_fetch_add);

int k;
static void BM_atomic_compare_exchange(benchmark::State& state) {
  for (auto _ : state)
    benchmark::DoNotOptimize(__atomic_compare_exchange
                             (&j, &k, &k, 1, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE));
}
BENCHMARK(BM_atomic_compare_exchange);


template<class T>
struct node {
  T data;
  node* next;
  node(const T& data) : data(data), next(nullptr) {}
};

static void BM_std_atomic_compare_exchange(benchmark::State& state) {
  node<int>* new_node = new node<int>(42);
  std::atomic<node<int>*> head;
  for (auto _ : state)
    benchmark::DoNotOptimize(std::atomic_compare_exchange_weak_explicit
                             (&head, &new_node->next, new_node,
                              std::memory_order_release,
                              std::memory_order_relaxed));
}
BENCHMARK(BM_std_atomic_compare_exchange);

BENCHMARK_MAIN();

---
$ ./go.sh
+ g++ -o generic-v8 a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
+ ./generic-v8
2020-12-06 01:06:26
Running ./generic-v8
Run on (64 X 243.75 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 1024 KiB (x64)
  L3 Unified 32768 KiB (x1)
Load Average: 64.36, 59.36, 36.41
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BM_atomic_increment                  7.21 ns         7.20 ns     97116662
BM_atomic_fetch_add                  7.20 ns         7.20 ns     97152394
BM_atomic_compare_exchange           7.71 ns         7.71 ns     90780423
BM_std_atomic_compare_exchange       7.61 ns         7.61 ns     92037159
+ /home/ubuntu/llvm-project/nin/bin/clang++ -o clang-generic-v8 a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
+ ./clang-generic-v8
2020-12-06 01:06:30
Running ./clang-generic-v8
Run on (64 X 243.75 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 1024 KiB (x64)
  L3 Unified 32768 KiB (x1)
Load Average: 64.57, 59.49, 36.57
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BM_atomic_increment                  9.21 ns         9.21 ns     75989223
BM_atomic_fetch_add                  9.21 ns         9.21 ns     76031211
BM_atomic_compare_exchange           7.61 ns         7.61 ns     92012620
BM_std_atomic_compare_exchange       12.4 ns         12.4 ns     56421424
+ g++ -o lse -march=armv8-a+lse a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
+ ./lse
2020-12-06 01:06:34
Running ./lse
Run on (64 X 243.75 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 1024 KiB (x64)
  L3 Unified 32768 KiB (x1)
Load Average: 64.85, 59.63, 36.74
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BM_atomic_increment                  5.21 ns         5.21 ns    134201945
BM_atomic_fetch_add                  5.21 ns         5.21 ns    134438848
BM_atomic_compare_exchange           6.80 ns         6.80 ns    102872012
BM_std_atomic_compare_exchange       6.80 ns         6.80 ns    102864719
+ clang++ -o clang-lse -march=armv8-a+lse a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
+ ./clang-lse
2020-12-06 01:06:38
Running ./clang-lse
Run on (64 X 243.75 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 1024 KiB (x64)
  L3 Unified 32768 KiB (x1)
Load Average: 64.85, 59.63, 36.74
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BM_atomic_increment                  7.21 ns         7.21 ns     97086511
BM_atomic_fetch_add                  7.21 ns         7.21 ns     97152416
BM_atomic_compare_exchange           7.20 ns         7.20 ns     97186161
BM_std_atomic_compare_exchange       11.6 ns         11.6 ns     60302378
+ g++ -o moutline -moutline-atomics a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
+ ./moutline
2020-12-06 01:06:41
Running ./moutline
Run on (64 X 243.75 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 1024 KiB (x64)
  L3 Unified 32768 KiB (x1)
Load Average: 64.94, 59.74, 36.90
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BM_atomic_increment                  5.60 ns         5.60 ns    124853685
BM_atomic_fetch_add                  5.60 ns         5.60 ns    124907943
BM_atomic_compare_exchange           7.21 ns         7.21 ns     97151664
BM_std_atomic_compare_exchange       7.21 ns         7.21 ns     97148224
+ /home/ubuntu/llvm-project/nin/bin/clang++ -o clang-moutline -moutline-atomics a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
+ ./clang-moutline
2020-12-06 01:06:45
Running ./clang-moutline
Run on (64 X 243.75 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x64)
  L1 Instruction 64 KiB (x64)
  L2 Unified 1024 KiB (x64)
  L3 Unified 32768 KiB (x1)
Load Average: 64.95, 59.82, 37.05
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BM_atomic_increment                  7.21 ns         7.21 ns     97071465
BM_atomic_fetch_add                  7.21 ns         7.20 ns     97150580
BM_atomic_compare_exchange           7.20 ns         7.20 ns     97164566
BM_std_atomic_compare_exchange       11.6 ns         11.6 ns     60301778