This is an archive of the discontinued LLVM Phabricator instance.

[X86] replace (atomic fetch_add of 0) by (mfence; mov)
AbandonedPublic

Authored by morisset on Aug 27 2014, 3:01 PM.

Download Raw Diff

Details

Reviewers

Summary

Mostly useful for implementing seqlocks in C11/C++11, as explained in
http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf
In particular, it can avoid cache-line bouncing, bringing massive scalability
improvements in the micro-benchmarks of the paper.

This cannot be done as a target-independent pass, because it is unsound
to turn a fetch_add(&x, 0, release) into fence(seq_cst); load(&x, seq_cst)
as shown by the following example(from the paper above):
atomic<int> x = y = 0;
Thread 0:

x.store(1, mo_relaxed);
r1 = y.fetch_add(0, mo_release);

Thread 1:

y.fetch_add(1, mo_acquire);
r2 = x.load(mo_relaxed);

r1 == r2 == 0 is not possible in the above code, but becomes possible if it the
fetch_add of thread 0 is turned into a fence followed by a load, even if they
are both seq_cst.

Diff Detail

Event Timeline

morisset updated this revision to Diff 13005.Aug 27 2014, 3:01 PM

morisset retitled this revision from to [X86] replace (atomic fetch_add of 0) by (mfence; mov).

morisset updated this object.

morisset edited the test plan for this revision. (Show Details)

morisset added a reviewer: jfb.

morisset added a subscriber: Unknown Object (MLST).

majnemer added a subscriber: majnemer.Aug 27 2014, 3:30 PM

majnemer added inline comments.

lib/Target/X86/X86ISelDAGToDAG.cpp
1777–1792	This should be formatted per the coding standards; consider running clang-format on your changes.

Fixed formatting.

Thanks David for suggesting the use of clang-format.

chandlerc added a subscriber: chandlerc.Aug 27 2014, 3:40 PM

chandlerc added inline comments.

lib/Target/X86/X86ISelDAGToDAG.cpp
1748–1751	A general, important comment about implementing these kinds of optimizations: please write out the justification in the comments. Cite the paper for details, but at least lay out the basic ideas. It also would be useful to explain it using the terminology likely to be familiar to another LLVM developer.

Improve comment, giving the rationale for the optimization.

majnemer added inline comments.Aug 27 2014, 4:29 PM

lib/Target/X86/X86ISelDAGToDAG.cpp
1810–1811	This should be on the previous line.

Fix formatting

The code and tests don't really discuss the atomic ordering. It would be good to provide an intuition for why your code works even with different orderings.

lib/Target/X86/X86ISelDAGToDAG.cpp
1749	s/exemple/example/
1756	s/../,/
1758	I think it might be clearer to add that the cache line goes from shared to modified state with the fetch_add, whereas it remains shared with the load.
1762	I think it would be good to explain that the mfence is required because the hardware can reorder load/store. This code is in a compiler, the reader may assume that "hoisting" is done by the compiler.
2191	Would it make sense to also select atomic or zero in the same way? I can't think of a reason for someone to use atomic or zero instead of add zero, but they are strictly equivalent so it seems silly to not select both with the same code.

Thanks for the review, I will modify the comments accordingly.

About the ordering in particular, it is made irrelevant by the mfence (i.e. this transformation is correct for seq_cst so it is also correct for anything weaker).

I do not know if this patch should go forward anymore however because of the issue with Or/Xor (see inline comment below).

lib/Target/X86/X86ISelDAGToDAG.cpp
2191	As we discussed yesterday, when I tried to do that I found a nasty problem: the code was only working for add because of a special case in X86AtomicExpandPass that avoids touching them. By this point in the pipeline, Or/Xor have already been expanded in fact in most cases to a cmpxchg loop. I will try to fix it, but it may take a bit of time as X86AtomicExpandPass is being deleted by another patch I have under review..

jfb added inline comments.Sep 3 2014, 1:55 PM

lib/Target/X86/X86ISelDAGToDAG.cpp
2191	Does your patch that removes X86AtomicExpandPass and make it a generic pass change the generated code? It sounds like moving to a generic pass will require a way for the backend to say "don't touch certain atomics, I'll expand them later" or something pluggable by the backend.

morisset added inline comments.Sep 3 2014, 2:09 PM

lib/Target/X86/X86ISelDAGToDAG.cpp
2191	That other patch uses a "shouldExpandAtomicRMW()" method of TargetLowering as a hook. So yes it already provides a way for the backend to say "don't touch these atomics, I will deal with them later", the only (small) problem is that currently the X86 backend does not always use this possibility for Or/Xor (but it does for Add/Sub). So it is fixable, but I would like to avoid messing with it before that patch gets accepted (merge conflicts aren't fun).

Doing this optimization so late was looking increasingly brittle as it must interact with AtomicExpandPass earlier at the IR level.
I reimplemented it at that level in D5422, and it turned out a lot cleaner.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelDAGToDAG.cpp

87 lines

test/

CodeGen/

X86/

atomic_add_zero.ll

47 lines

Diff 13014

lib/Target/X86/X86ISelDAGToDAG.cpp

Show All 16 Lines
#include "X86MachineFunctionInfo.h"		#include "X86MachineFunctionInfo.h"
#include "X86RegisterInfo.h"		#include "X86RegisterInfo.h"
#include "X86Subtarget.h"		#include "X86Subtarget.h"
#include "X86TargetMachine.h"		#include "X86TargetMachine.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"		#include "llvm/CodeGen/MachineInstrBuilder.h"
		#include "llvm/CodeGen/MachineMemOperand.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/CodeGen/SelectionDAGISel.h"		#include "llvm/CodeGen/SelectionDAGISel.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/Intrinsics.h"		#include "llvm/IR/Intrinsics.h"
#include "llvm/IR/Type.h"		#include "llvm/IR/Type.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/MathExtras.h"		#include "llvm/Support/MathExtras.h"
▲ Show 20 Lines • Show All 155 Lines • ▼ Show 20 Lines

// Include the pieces autogenerated from the target description.		// Include the pieces autogenerated from the target description.
#include "X86GenDAGISel.inc"		#include "X86GenDAGISel.inc"

private:		private:
SDNode Select(SDNode N) override;		SDNode Select(SDNode N) override;
SDNode SelectGather(SDNode N, unsigned Opc);		SDNode SelectGather(SDNode N, unsigned Opc);
SDNode SelectAtomicLoadArith(SDNode Node, MVT NVT);		SDNode SelectAtomicLoadArith(SDNode Node, MVT NVT);
		SDNode SelectAtomicAddZero(SDNode Node, MVT NVT);

bool FoldOffsetIntoAddress(uint64_t Offset, X86ISelAddressMode &AM);		bool FoldOffsetIntoAddress(uint64_t Offset, X86ISelAddressMode &AM);
bool MatchLoadInAddress(LoadSDNode *N, X86ISelAddressMode &AM);		bool MatchLoadInAddress(LoadSDNode *N, X86ISelAddressMode &AM);
bool MatchWrapper(SDValue N, X86ISelAddressMode &AM);		bool MatchWrapper(SDValue N, X86ISelAddressMode &AM);
bool MatchAddress(SDValue N, X86ISelAddressMode &AM);		bool MatchAddress(SDValue N, X86ISelAddressMode &AM);
bool MatchAddressRecursively(SDValue N, X86ISelAddressMode &AM,		bool MatchAddressRecursively(SDValue N, X86ISelAddressMode &AM,
unsigned Depth);		unsigned Depth);
bool MatchAddressBase(SDValue N, X86ISelAddressMode &AM);		bool MatchAddressBase(SDValue N, X86ISelAddressMode &AM);
▲ Show 20 Lines • Show All 1,534 Lines • ▼ Show 20 Lines	if (Val.getOpcode() == ISD::TRUNCATE && NVT == MVT::i16 &&
return CurDAG->getTargetExtractSubreg(X86::sub_16bit, dl, NVT,		return CurDAG->getTargetExtractSubreg(X86::sub_16bit, dl, NVT,
Val.getOperand(1));		Val.getOperand(1));
}		}
}		}

return Val;		return Val;
}		}

		// On x86 an atomic load-add of the constant 0 can be replaced by an mfence
		// followed by a mov. A detailed explanation of this (and exemple of why the
		jfbUnsubmitted Not Done Reply Inline Actions s/exemple/example/ jfb: s/exemple/example/
		// mfence is required) is available at
		// http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf
		chandlercUnsubmitted Not Done Reply Inline Actions A general, important comment about implementing these kinds of optimizations: please write out the justification in the comments. Cite the paper for details, but at least lay out the basic ideas. It also would be useful to explain it using the terminology likely to be familiar to another LLVM developer. chandlerc: A general, important comment about implementing these kinds of optimizations: please write out…
		// The general idea is that only a store operation can have release
		// semantics so a seqlock (implemented entirely with loads) needs a release
		// operation at the end of critical sections, to prevent operations from
		// being sunk out of the critical section. Replacing the last load by a
		// fetch_add(0, release) accomplishes just that.. but requires this
		jfbUnsubmitted Not Done Reply Inline Actions s/../,/ jfb: s/../,/
		// optimization to preserve the desirable property of seqlocks that readers
		// do not cause cache line bouncing.
		jfbUnsubmitted Not Done Reply Inline Actions I think it might be clearer to add that the cache line goes from shared to modified state with the fetch_add, whereas it remains shared with the load. jfb: I think it might be clearer to add that the cache line goes from shared to modified state with…
		// The mfence is required because otherwise the load could be hoisted before
		// a preceding store (according to the x86 memory model), which the original
		// fetch_add could not do (since only store-load can be reordered in
		// load-store) on X86.
		jfbUnsubmitted Not Done Reply Inline Actions I think it would be good to explain that the mfence is required because the hardware can reorder load/store. This code is in a compiler, the reader may assume that "hoisting" is done by the compiler. jfb: I think it would be good to explain that the mfence is required because the hardware can…
		SDNode X86DAGToDAGISel::SelectAtomicAddZero(SDNode Node, MVT NVT) {
		assert(Node->getOpcode() == ISD::ATOMIC_LOAD_ADD);

		SDLoc dl(Node);

		SDValue Chain = Node->getOperand(0);
		SDValue Ptr = Node->getOperand(1);
		SDValue Val = Node->getOperand(2);
		SDValue Base, Scale, Index, Disp, Segment;
		if (!SelectAddr(Node, Ptr, Base, Scale, Index, Disp, Segment))
		return nullptr;

		auto CN = dyn_cast<ConstantSDNode>(Val);
		if (!CN)
		return nullptr;

		int64_t CNVal = CN->getSExtValue();
		if (CNVal != 0)
		return nullptr;

		auto FenceNode = CurDAG->getMachineNode(X86::MFENCE, dl, MVT::Other, Chain);

		unsigned Opc;
		switch (NVT.SimpleTy) {
		case MVT::i8:
		Opc = X86::ACQUIRE_MOV8rm;
		break;
		case MVT::i16:
		Opc = X86::ACQUIRE_MOV16rm;
		break;
		majnemerUnsubmitted Not Done Reply Inline Actions This should be formatted per the coding standards; consider running clang-format on your changes. majnemer: This should be formatted per the coding standards; consider running clang-format on your…
		case MVT::i32:
		Opc = X86::ACQUIRE_MOV32rm;
		break;
		case MVT::i64:
		Opc = X86::ACQUIRE_MOV64rm;
		break;
		default:
		llvm_unreachable("Unexpected size for LXADD 0");
		}

		// Note that FenceNode is used for the 'chain' operand, guaranteeing that it
		// will be scheduled before the load.
		SDValue FenceChain = SDValue(FenceNode, 0);
		SDValue Ops[] = {Base, Scale, Index, Disp, Segment, FenceChain};
		auto LoadNode = CurDAG->getMachineNode(Opc, dl, NVT, MVT::Other, Ops);

		// We must copy the information about the memory operand, but change the flags
		// to remove the mayStore flag.
		MachineSDNode::mmo_iterator MemOp = MF->allocateMemRefsArray(1);
		majnemerUnsubmitted Not Done Reply Inline Actions This should be on the previous line. majnemer: This should be on the previous line.
		auto SMemOp = cast<MemSDNode>(Node)->getMemOperand();
		auto Flags = SMemOp->getFlags() & ~MachineMemOperand::MOStore;
		MemOp[0] = new MachineMemOperand(SMemOp->getPointerInfo(), Flags,
		SMemOp->getSize(), SMemOp->getAlignment(),
		SMemOp->getAAInfo(), SMemOp->getRanges());
		cast<MachineSDNode>(LoadNode)->setMemRefs(MemOp, MemOp + 1);

		return LoadNode;
		}

SDNode X86DAGToDAGISel::SelectAtomicLoadArith(SDNode Node, MVT NVT) {		SDNode X86DAGToDAGISel::SelectAtomicLoadArith(SDNode Node, MVT NVT) {
if (Node->hasAnyUseOfValue(0))		if (Node->hasAnyUseOfValue(0))
return nullptr;		return nullptr;

SDLoc dl(Node);		SDLoc dl(Node);

// Optimize common patterns for __sync_or_and_fetch and similar arith		// Optimize common patterns for __sync_or_and_fetch and similar arith
// operations where the result is not used. This allows us to use the "lock"		// operations where the result is not used. This allows us to use the "lock"
▲ Show 20 Lines • Show All 347 Lines • ▼ Show 20 Lines	case Intrinsic::x86_avx2_gather_q_d_256: {
break;		break;
}		}
}		}
break;		break;
}		}
case X86ISD::GlobalBaseReg:		case X86ISD::GlobalBaseReg:
return getGlobalBaseReg();		return getGlobalBaseReg();

		case ISD::ATOMIC_LOAD_ADD:
		if (SDNode *RetVal = SelectAtomicAddZero(Node, NVT))
		return RetVal;
		/* FALLTHROUGH */
case ISD::ATOMIC_LOAD_XOR:		case ISD::ATOMIC_LOAD_XOR:
case ISD::ATOMIC_LOAD_AND:		case ISD::ATOMIC_LOAD_AND:
case ISD::ATOMIC_LOAD_OR:		case ISD::ATOMIC_LOAD_OR: {
		jfbUnsubmitted Not Done Reply Inline Actions Would it make sense to also select atomic or zero in the same way? I can't think of a reason for someone to use atomic or zero instead of add zero, but they are strictly equivalent so it seems silly to not select both with the same code. jfb: Would it make sense to also select atomic or zero in the same way? I can't think of a reason…
		morissetAuthorUnsubmitted Not Done Reply Inline Actions As we discussed yesterday, when I tried to do that I found a nasty problem: the code was only working for add because of a special case in X86AtomicExpandPass that avoids touching them. By this point in the pipeline, Or/Xor have already been expanded in fact in most cases to a cmpxchg loop. I will try to fix it, but it may take a bit of time as X86AtomicExpandPass is being deleted by another patch I have under review.. morisset: As we discussed yesterday, when I tried to do that I found a nasty problem: the code was only…
		jfbUnsubmitted Not Done Reply Inline Actions Does your patch that removes X86AtomicExpandPass and make it a generic pass change the generated code? It sounds like moving to a generic pass will require a way for the backend to say "don't touch certain atomics, I'll expand them later" or something pluggable by the backend. jfb: Does your patch that removes X86AtomicExpandPass and make it a generic pass change the…
		morissetAuthorUnsubmitted Not Done Reply Inline Actions That other patch uses a "shouldExpandAtomicRMW()" method of TargetLowering as a hook. So yes it already provides a way for the backend to say "don't touch these atomics, I will deal with them later", the only (small) problem is that currently the X86 backend does not always use this possibility for Or/Xor (but it does for Add/Sub). So it is fixable, but I would like to avoid messing with it before that patch gets accepted (merge conflicts aren't fun). morisset: That other patch uses a "shouldExpandAtomicRMW()" method of TargetLowering as a hook. So yes it…
case ISD::ATOMIC_LOAD_ADD: {		if (SDNode *RetVal = SelectAtomicLoadArith(Node, NVT))
SDNode *RetVal = SelectAtomicLoadArith(Node, NVT);
if (RetVal)
return RetVal;		return RetVal;
break;		break;
}		}
case ISD::AND:		case ISD::AND:
case ISD::OR:		case ISD::OR:
case ISD::XOR: {		case ISD::XOR: {
// For operations of the form (x << C1) op C2, check if we can use a smaller		// For operations of the form (x << C1) op C2, check if we can use a smaller
// encoding for C2 by transforming it into (x op (C2>>C1)) << C1.		// encoding for C2 by transforming it into (x op (C2>>C1)) << C1.
▲ Show 20 Lines • Show All 664 Lines • Show Last 20 Lines

test/CodeGen/X86/atomic_add_zero.ll

This file was added.

				; RUN: llc < %s -march=x86-64 -verify-machineinstrs \| FileCheck %s --check-prefix=CHECK --check-prefix=X64
				; RUN: llc < %s -march=x86 -verify-machineinstrs \| FileCheck %s --check-prefix=CHECK --check-prefix=X32

				; On x86, an atomic rmw operation that does not modify the value in memory
				; (such as atomic add 0) can be replaced by an mfence followed by a mov.
				; This is explained (with the motivation for such an optimisation) in
				; http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf

				define i8 @add8(i8* %p) {
				; CHECK-LABEL: add8
				; CHECK: mfence
				; CHECK: movb
				%1 = atomicrmw add i8* %p, i8 0 monotonic
				ret i8 %1
				}

				define i16 @add16(i16* %p) {
				; CHECK-LABEL: add16
				; CHECK: mfence
				; CHECK: movw
				%1 = atomicrmw add i16* %p, i16 0 monotonic
				ret i16 %1
				}

				define i32 @add32(i32* %p) {
				; CHECK-LABEL: add32
				; CHECK: mfence
				; CHECK: movl
				%1 = atomicrmw add i32* %p, i32 0 monotonic
				ret i32 %1
				}

				define i64 @add64(i64* %p) {
				; CHECK-LABEL: add64
				; X64: mfence
				; X64: movq
				; X32-NOT: mfence
				%1 = atomicrmw add i64* %p, i64 0 monotonic
				ret i64 %1
				}

				define i128 @add128(i128* %p) {
				; CHECK-LABEL: add128
				; CHECK-NOT: mfence
				%1 = atomicrmw add i128* %p, i128 0 monotonic
				ret i128 %1
				}