Page MenuHomePhabricator

[AArch64] Use CAS loops for all atomic operations when available.
Needs RevisionPublic

Authored by t.p.northover on Feb 6 2019, 7:12 AM.

Details

Reviewers
john.brawn
Summary

The code is slightly larger than LDXR/STXR loops but doesn't suffer from the starvation issues with asymmetric cores it was (at least partly) added to solve so it's probably better to avoid LDXR across the board.

Diff Detail

Repository
rL LLVM

Event Timeline

t.p.northover created this revision.Feb 6 2019, 7:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2019, 7:13 AM

I was looking at this again and noticed it didn't handle unordered loads & stores properly. It tried to create unordered atomicrmw/cmpxchg, which are illegal and assert. So I updated it to promote unordered to monotonic.

It's a bit hard to know whether to accept this or not, given that you says it's "probably better to avoid LDXR". Do you have any data to say if it is any better or not?

Incidentally the handling of volatile atomic operations here is probably wrong as I don't think we can have any kind of repeated loading, but that's nothing to do with this change (both LDXR/STXR loop and CAS loop have this problem) and doesn't need to be fixed by it.

jfb added a comment.Feb 15 2019, 10:11 AM

Incidentally the handling of volatile atomic operations here is probably wrong as I don't think we can have any kind of repeated loading, but that's nothing to do with this change (both LDXR/STXR loop and CAS loop have this problem) and doesn't need to be fixed by it.

volatile atomic is generally a bit of a minefield. If we don't have an instruction that does a straight load then there's not much we can do...

john.brawn requested changes to this revision.Feb 18 2019, 4:32 AM

It's a bit hard to know whether to accept this or not, given that you says it's "probably better to avoid LDXR". Do you have any data to say if it is any better or not?

To answer my own question: I quickly threw together this test program:

#include <iostream>
#include <atomic>
#include <thread>
#include <chrono>

std::atomic<__int128> var;

#define COUNT 100000

int thread_fn(int n) {
  for (int i = 0; i < COUNT; ++i) {
    var.store(n*COUNT + i,  std::memory_order_relaxed);
  }
  return 0;
}

int main() {
  std::thread *threads[32];
  std::chrono::high_resolution_clock::time_point start, end;
  start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < 32; ++i)
    threads[i] = new std::thread(thread_fn, i);
  for (int i = 0; i < 32; ++i)
    threads[i]->join();
  end =  std::chrono::high_resolution_clock::now();
  std::cout << (end-start).count() << "\n";
  return 0;
}

and running it on a thunderx2 (compiled -O2 -mcpu=native, taking average of 20 runs) I get:

without path1760231879
with patch3411562006

so it looks like this patch actually makes things worse.

In D57820#1399588, @jfb wrote:

Incidentally the handling of volatile atomic operations here is probably wrong as I don't think we can have any kind of repeated loading, but that's nothing to do with this change (both LDXR/STXR loop and CAS loop have this problem) and doesn't need to be fixed by it.

volatile atomic is generally a bit of a minefield. If we don't have an instruction that does a straight load then there's not much we can do...

According to http://llvm.org/docs/Atomics.html#unordered "Note that code generation will fail for unsupported atomic operations; if you need such an operation, use explicit locking.", so I guess we should give an error.

This revision now requires changes to proceed.Feb 18 2019, 4:32 AM

The atomics documentation is out of date; the implementation was changed to lower "unsupported" atomic operations to __atomic_*& calls.