The code is slightly larger than LDXR/STXR loops but doesn't suffer from the starvation issues with asymmetric cores it was (at least partly) added to solve so it's probably better to avoid LDXR across the board.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I was looking at this again and noticed it didn't handle unordered loads & stores properly. It tried to create unordered atomicrmw/cmpxchg, which are illegal and assert. So I updated it to promote unordered to monotonic.
It's a bit hard to know whether to accept this or not, given that you says it's "probably better to avoid LDXR". Do you have any data to say if it is any better or not?
Incidentally the handling of volatile atomic operations here is probably wrong as I don't think we can have any kind of repeated loading, but that's nothing to do with this change (both LDXR/STXR loop and CAS loop have this problem) and doesn't need to be fixed by it.
Incidentally the handling of volatile atomic operations here is probably wrong as I don't think we can have any kind of repeated loading, but that's nothing to do with this change (both LDXR/STXR loop and CAS loop have this problem) and doesn't need to be fixed by it.
volatile atomic is generally a bit of a minefield. If we don't have an instruction that does a straight load then there's not much we can do...
To answer my own question: I quickly threw together this test program:
#include <iostream> #include <atomic> #include <thread> #include <chrono> std::atomic<__int128> var; #define COUNT 100000 int thread_fn(int n) { for (int i = 0; i < COUNT; ++i) { var.store(n*COUNT + i, std::memory_order_relaxed); } return 0; } int main() { std::thread *threads[32]; std::chrono::high_resolution_clock::time_point start, end; start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < 32; ++i) threads[i] = new std::thread(thread_fn, i); for (int i = 0; i < 32; ++i) threads[i]->join(); end = std::chrono::high_resolution_clock::now(); std::cout << (end-start).count() << "\n"; return 0; }
and running it on a thunderx2 (compiled -O2 -mcpu=native, taking average of 20 runs) I get:
without path | 1760231879 |
with patch | 3411562006 |
so it looks like this patch actually makes things worse.
According to http://llvm.org/docs/Atomics.html#unordered "Note that code generation will fail for unsupported atomic operations; if you need such an operation, use explicit locking.", so I guess we should give an error.
The atomics documentation is out of date; the implementation was changed to lower "unsupported" atomic operations to __atomic_*& calls.