Mostly useful for implementing seqlocks in C11/C++11, as explained in
In particular, it can avoid cache-line bouncing, bringing massive scalability
improvements in the micro-benchmarks of the paper.
This cannot be done as a target-independent pass, because it is unsound
to turn a fetch_add(&x, 0, release) into fence(seq_cst); load(&x, seq_cst)
as shown by the following example(from the paper above):
atomic<int> x = y = 0;
x.store(1, mo_relaxed); r1 = y.fetch_add(0, mo_release);
y.fetch_add(1, mo_acquire); r2 = x.load(mo_relaxed);
r1 == r2 == 0 is not possible in the above code, but becomes possible if it the
fetch_add of thread 0 is turned into a fence followed by a load, even if they
are both seq_cst.