This patch implements out of line atomics for LSE deployment
mechanism. Details how it works can be found in llvm/docs/Atomics.rst
Options -moutline-atomics and -mno-outline-atomics to enable and disable it
were added to clang driver. This is clang and llvm part of out-of-line atomics
interface, library part is already supported by libgcc. Compiler-rt
support is provided in separate patch.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
clang/lib/Driver/ToolChains/Clang.cpp | ||
---|---|---|
6374 | This excludes aarch64_be, which is a bit weird. Might be best to check Triple.isAArch64(). | |
llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
1818 | Could you do the whitespace changes separately (if at all)? | |
4062 | "its" | |
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp | ||
2179 | I think this is a bit of an abuse of the LibcallName mechanism. A separate function in TargetLowering would probably be better. | |
llvm/lib/Target/AArch64/AArch64Subtarget.h | ||
476 | I think something is a bit weird with how your clang-format handles comments. Here and earlier line lengths are about half as long as I'd expect. | |
llvm/test/CodeGen/AArch64/atomic-ops.ll | ||
534 | I'd prefer not to overwrite existing CHECKS that have generic dataflow with ones produced by that tool hardcoding a particular register allocation. |
llvm/docs/Atomics.rst | ||
---|---|---|
625 | I think this section needs to be put on the end of the section on __sync_*. These functions are effectively an aarch64-specific version of the the __sync libcalls -- just with the addition of the memory ordering in the function name, instead of assuming seq_cst. All of the same commentary applies otherwise, and clearly distinguishing from the __atomic_* calls is important. Maybe something like: On AArch64, a variant of the __sync_* routines is used which contain the memory order as part of the function name. These routines may determine at runtime whether the single-instruction atomic operations which were introduced as part of AArch64 Large System Extensions "LSE" instruction set are available, or if it needs to fall back to an LL/SC loop. The following helper functions are implemented in both [.....] | |
llvm/include/llvm/IR/RuntimeLibcalls.def | ||
548 | Maybe just go ahead and define the libcalls up to size 16, even though aarch64 won't define or use the 16-byte functions, other than CAS. Can we come up with a better name for these libfuncs here? "ATOMIC_*" is an unfortunate prefix, since we already use it for the entirely-distinct set of functions above. | |
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp | ||
2179 | I don't think that's odd or unusual -- we often condition libcall availability on getLibcallName != nullptr. What does strike me here is the (pre-existing) code duplication between this function (DAGTypeLegalizer::ExapndAtomic) and SelectionDAGLegalize::ConvertNodeToLibcall. Not sure what's up with that... | |
llvm/lib/CodeGen/TargetLoweringBase.cpp | ||
453 | What's the purpose of the struct? | |
475 | If you moved this switch to the end, you can just have each clause be "return SwpLcalls[ModeN][ModelN];", instead of storing the address of the array. | |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
15839 | So, hold on -- AArch64 has umin/umax/smin/smax instructions, but libgcc and compiler-rt don't have helpers for those? That seems to be a remarkably unfortunate state of affairs. Can you fix that, by implementing those functions in the compiler-rt patch, and submitting the same to libgcc? | |
llvm/lib/Target/AArch64/AArch64Subtarget.h | ||
476 | I think it'd be clearer to have this simply "return OutlineAtomics;". The only usage that needs to change is AArch64ISelLowering.cpp L663, and it'd be _clearer_ to have it explicitly say if (!Subtarget->hasLSE() && Subtarget->outlineAtomics()). |
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp | ||
---|---|---|
2179 | Fair enough. Didn't realise it was that common. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
15839 | Yes, agree, initially I was going to add min/max helpers to both compiler-rt and libgcc. But the point is that gcc is not generating LSE min/max at all. So these helpers would be untested and become rotten. There is no reason for gcc community to support unused code. Hopefully after <atomic> maximum/minimum proposal approval I can easily add the helpers to libcc and compiler-rt and enable [U]Min/[U]Max RWM atomics expansion in LLVM. |
Have you got any further comments?
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp | ||
---|---|---|
2179 | I noticed this existed duplication too, but find no proper place to put common functionality from DAGTypeLegalizer and SelectionDAGLegalize. |
LG after fixing the minor nits.
clang/lib/Driver/ToolChains/Clang.cpp | ||
---|---|---|
6382 | We don't usually explicitly use negative features like this, do we? I think this else clause can be omitted. | |
llvm/lib/CodeGen/TargetLoweringBase.cpp | ||
494–497 | As with the .def file, get rid of LCALL4, and just use LCALL5 for everything here. AArch64ISelLowering won't setup libcall names for 128-bit ones, which is fine. |
clang/lib/Driver/ToolChains/Clang.cpp | ||
---|---|---|
6382 | We do, -soft-float for example in Clang.cpp:2397 if (Value.startswith("-mhard-float")) { CmdArgs.push_back("-target-feature"); CmdArgs.push_back("-soft-float"); continue; } |
I tested this change on Graviton2 aarch64-linux by building https://github.com/xianyi/OpenBLAS with clang -O3 -moutline-atomics and make test: all tests pass with and without outline-atomics.
Clang was configured to use libgcc.
I also tested https://github.com/boostorg/boost.git with and without -moutline-atomics, and there are no new fails.
Here is how I built and ran the tests for boost:
git clone --recursive https://github.com/boostorg/boost.git $HOME/boost cd $HOME/boost mkdir usr ./bootstrap.sh --prefix=$HOME/boost/usr # in project-config.jam line 12 # replace `using gcc ;` with `using clang : : $HOME/llvm-project/usr/bin/clang++ ;` ./b2 --build-type=complete --layout=versioned -a cd status ../b2 # runs all regression tests
I also looked at the performance of some atomic operations using google-benchmark on Ubuntu 20.04 c6g instance with Graviton2 (Neoverse-N1).
Performance is better when using LSE instructions compared to generic armv8-a code.
The overhead of -moutline-atomics is negligible compared to armv8-a+lse.
clang trunk as of today produces slightly slower code than gcc-9 with and without -moutline-atomics.
$ cat a.cc #include <benchmark/benchmark.h> #include <atomic> std::atomic<int> i; static void BM_atomic_increment(benchmark::State& state) { for (auto _ : state) benchmark::DoNotOptimize(i++); } BENCHMARK(BM_atomic_increment); int j; static void BM_atomic_fetch_add(benchmark::State& state) { for (auto _ : state) benchmark::DoNotOptimize(__atomic_fetch_add(&j, 1, __ATOMIC_SEQ_CST)); } BENCHMARK(BM_atomic_fetch_add); int k; static void BM_atomic_compare_exchange(benchmark::State& state) { for (auto _ : state) benchmark::DoNotOptimize(__atomic_compare_exchange (&j, &k, &k, 1, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE)); } BENCHMARK(BM_atomic_compare_exchange); template<class T> struct node { T data; node* next; node(const T& data) : data(data), next(nullptr) {} }; static void BM_std_atomic_compare_exchange(benchmark::State& state) { node<int>* new_node = new node<int>(42); std::atomic<node<int>*> head; for (auto _ : state) benchmark::DoNotOptimize(std::atomic_compare_exchange_weak_explicit (&head, &new_node->next, new_node, std::memory_order_release, std::memory_order_relaxed)); } BENCHMARK(BM_std_atomic_compare_exchange); BENCHMARK_MAIN(); --- $ ./go.sh + g++ -o generic-v8 a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread + ./generic-v8 2020-12-06 01:06:26 Running ./generic-v8 Run on (64 X 243.75 MHz CPU s) CPU Caches: L1 Data 64 KiB (x64) L1 Instruction 64 KiB (x64) L2 Unified 1024 KiB (x64) L3 Unified 32768 KiB (x1) Load Average: 64.36, 59.36, 36.41 ***WARNING*** Library was built as DEBUG. Timings may be affected. ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_atomic_increment 7.21 ns 7.20 ns 97116662 BM_atomic_fetch_add 7.20 ns 7.20 ns 97152394 BM_atomic_compare_exchange 7.71 ns 7.71 ns 90780423 BM_std_atomic_compare_exchange 7.61 ns 7.61 ns 92037159 + /home/ubuntu/llvm-project/nin/bin/clang++ -o clang-generic-v8 a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread + ./clang-generic-v8 2020-12-06 01:06:30 Running ./clang-generic-v8 Run on (64 X 243.75 MHz CPU s) CPU Caches: L1 Data 64 KiB (x64) L1 Instruction 64 KiB (x64) L2 Unified 1024 KiB (x64) L3 Unified 32768 KiB (x1) Load Average: 64.57, 59.49, 36.57 ***WARNING*** Library was built as DEBUG. Timings may be affected. ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_atomic_increment 9.21 ns 9.21 ns 75989223 BM_atomic_fetch_add 9.21 ns 9.21 ns 76031211 BM_atomic_compare_exchange 7.61 ns 7.61 ns 92012620 BM_std_atomic_compare_exchange 12.4 ns 12.4 ns 56421424 + g++ -o lse -march=armv8-a+lse a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread + ./lse 2020-12-06 01:06:34 Running ./lse Run on (64 X 243.75 MHz CPU s) CPU Caches: L1 Data 64 KiB (x64) L1 Instruction 64 KiB (x64) L2 Unified 1024 KiB (x64) L3 Unified 32768 KiB (x1) Load Average: 64.85, 59.63, 36.74 ***WARNING*** Library was built as DEBUG. Timings may be affected. ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_atomic_increment 5.21 ns 5.21 ns 134201945 BM_atomic_fetch_add 5.21 ns 5.21 ns 134438848 BM_atomic_compare_exchange 6.80 ns 6.80 ns 102872012 BM_std_atomic_compare_exchange 6.80 ns 6.80 ns 102864719 + clang++ -o clang-lse -march=armv8-a+lse a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread + ./clang-lse 2020-12-06 01:06:38 Running ./clang-lse Run on (64 X 243.75 MHz CPU s) CPU Caches: L1 Data 64 KiB (x64) L1 Instruction 64 KiB (x64) L2 Unified 1024 KiB (x64) L3 Unified 32768 KiB (x1) Load Average: 64.85, 59.63, 36.74 ***WARNING*** Library was built as DEBUG. Timings may be affected. ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_atomic_increment 7.21 ns 7.21 ns 97086511 BM_atomic_fetch_add 7.21 ns 7.21 ns 97152416 BM_atomic_compare_exchange 7.20 ns 7.20 ns 97186161 BM_std_atomic_compare_exchange 11.6 ns 11.6 ns 60302378 + g++ -o moutline -moutline-atomics a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread + ./moutline 2020-12-06 01:06:41 Running ./moutline Run on (64 X 243.75 MHz CPU s) CPU Caches: L1 Data 64 KiB (x64) L1 Instruction 64 KiB (x64) L2 Unified 1024 KiB (x64) L3 Unified 32768 KiB (x1) Load Average: 64.94, 59.74, 36.90 ***WARNING*** Library was built as DEBUG. Timings may be affected. ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_atomic_increment 5.60 ns 5.60 ns 124853685 BM_atomic_fetch_add 5.60 ns 5.60 ns 124907943 BM_atomic_compare_exchange 7.21 ns 7.21 ns 97151664 BM_std_atomic_compare_exchange 7.21 ns 7.21 ns 97148224 + /home/ubuntu/llvm-project/nin/bin/clang++ -o clang-moutline -moutline-atomics a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread + ./clang-moutline 2020-12-06 01:06:45 Running ./clang-moutline Run on (64 X 243.75 MHz CPU s) CPU Caches: L1 Data 64 KiB (x64) L1 Instruction 64 KiB (x64) L2 Unified 1024 KiB (x64) L3 Unified 32768 KiB (x1) Load Average: 64.95, 59.82, 37.05 ***WARNING*** Library was built as DEBUG. Timings may be affected. ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_atomic_increment 7.21 ns 7.21 ns 97071465 BM_atomic_fetch_add 7.21 ns 7.20 ns 97150580 BM_atomic_compare_exchange 7.20 ns 7.20 ns 97164566 BM_std_atomic_compare_exchange 11.6 ns 11.6 ns 60301778
Hi Pavel,
We need to handle one more case for __sync_* builtins, please see testcase and patches applied to GCC:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162
https://github.com/llvm/llvm-project/issues/29472 was never fixed; whatever issues exist with -moutline-atomics also exist with -mno-outline-atomics. (I don't think anyone has run into any practical issues with this, so it hasn't been a priority for anyone.)
I think it looks reasonable to define 5th memory model, add barriers __sync_* builtins and to outline-atomics calls as well.
This excludes aarch64_be, which is a bit weird. Might be best to check Triple.isAArch64().