[X86] replace (atomic fetch_add of 0) by (mfence; mov)
AbandonedPublic

Authored by morisset on Aug 27 2014, 3:01 PM.

Details

Reviewers
jfb
Summary

Mostly useful for implementing seqlocks in C11/C++11, as explained in
http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf
In particular, it can avoid cache-line bouncing, bringing massive scalability
improvements in the micro-benchmarks of the paper.

This cannot be done as a target-independent pass, because it is unsound
to turn a fetch_add(&x, 0, release) into fence(seq_cst); load(&x, seq_cst)
as shown by the following example(from the paper above):
atomic<int> x = y = 0;
Thread 0:

x.store(1, mo_relaxed);
r1 = y.fetch_add(0, mo_release);

Thread 1:

y.fetch_add(1, mo_acquire);
r2 = x.load(mo_relaxed);

r1 == r2 == 0 is not possible in the above code, but becomes possible if it the
fetch_add of thread 0 is turned into a fence followed by a load, even if they
are both seq_cst.

Diff Detail

morisset retitled this revision from to [X86] replace (atomic fetch_add of 0) by (mfence; mov).Aug 27 2014, 3:01 PM
morisset updated this object.
morisset edited the test plan for this revision. (Show Details)
morisset added a reviewer: jfb.
morisset added a subscriber: Unknown Object (MLST).
majnemer added inline comments.
lib/Target/X86/X86ISelDAGToDAG.cpp
1777–1792

This should be formatted per the coding standards; consider running clang-format on your changes.

morisset updated this revision to Diff 13006.Aug 27 2014, 3:37 PM

Fixed formatting.

Thanks David for suggesting the use of clang-format.

chandlerc added inline comments.
lib/Target/X86/X86ISelDAGToDAG.cpp
1748–1751

A general, important comment about implementing these kinds of optimizations: please write out the justification in the comments. Cite the paper for details, but at least lay out the basic ideas. It also would be useful to explain it using the terminology likely to be familiar to another LLVM developer.

morisset updated this revision to Diff 13007.Aug 27 2014, 3:54 PM

Improve comment, giving the rationale for the optimization.

majnemer added inline comments.Aug 27 2014, 4:29 PM
lib/Target/X86/X86ISelDAGToDAG.cpp
1810–1811

This should be on the previous line.

morisset updated this revision to Diff 13014.Aug 27 2014, 4:35 PM

Fix formatting

jfb added a comment.Aug 30 2014, 10:29 AM

The code and tests don't really discuss the atomic ordering. It would be good to provide an intuition for why your code works even with different orderings.

lib/Target/X86/X86ISelDAGToDAG.cpp
1749

s/exemple/example/

1756

s/../,/

1758

I think it might be clearer to add that the cache line goes from shared to modified state with the fetch_add, whereas it remains shared with the load.

1762

I think it would be good to explain that the mfence is required because the *hardware* can reorder load/store. This code is in a compiler, the reader may assume that "hoisting" is done by the compiler.

2191

Would it make sense to also select atomic or zero in the same way? I can't think of a reason for someone to use atomic or zero instead of add zero, but they are strictly equivalent so it seems silly to not select both with the same code.

Thanks for the review, I will modify the comments accordingly.

About the ordering in particular, it is made irrelevant by the mfence (i.e. this transformation is correct for seq_cst so it is also correct for anything weaker).

I do not know if this patch should go forward anymore however because of the issue with Or/Xor (see inline comment below).

lib/Target/X86/X86ISelDAGToDAG.cpp
2191

As we discussed yesterday, when I tried to do that I found a nasty problem: the code was only working for add because of a special case in X86AtomicExpandPass that avoids touching them. By this point in the pipeline, Or/Xor have already been expanded in fact in most cases to a cmpxchg loop. I will try to fix it, but it may take a bit of time as X86AtomicExpandPass is being deleted by another patch I have under review..

jfb added inline comments.Sep 3 2014, 1:55 PM
lib/Target/X86/X86ISelDAGToDAG.cpp
2191

Does your patch that removes X86AtomicExpandPass and make it a generic pass change the generated code? It sounds like moving to a generic pass will require a way for the backend to say "don't touch certain atomics, I'll expand them later" or something pluggable by the backend.

morisset added inline comments.Sep 3 2014, 2:09 PM
lib/Target/X86/X86ISelDAGToDAG.cpp
2191

That other patch uses a "shouldExpandAtomicRMW()" method of TargetLowering as a hook. So yes it already provides a way for the backend to say "don't touch these atomics, I will deal with them later", the only (small) problem is that currently the X86 backend does not always use this possibility for Or/Xor (but it does for Add/Sub).
So it is fixable, but I would like to avoid messing with it before that patch gets accepted (merge conflicts aren't fun).

morisset abandoned this revision.Sep 19 2014, 4:53 PM

Doing this optimization so late was looking increasingly brittle as it must interact with AtomicExpandPass earlier at the IR level.
I reimplemented it at that level in D5422, and it turned out a lot cleaner.