This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/Target/
-
llvm/
-
Target/
-
TargetLowering.h
-
lib/
-
CodeGen/
-
AtomicExpandPass.cpp
-
Target/X86/
-
X86/
-
X86ISelLowering.h
-
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
atomic_idempotent.ll

Differential D5422

Lower idempotent RMWs to fence+load
ClosedPublic

Authored by morisset on Sep 19 2014, 4:51 PM.

Download Raw Diff

Details

Reviewers

jfb

Commits

rG810739d17420: Lower idempotent RMWs to fence+load
rL218455: Lower idempotent RMWs to fence+load

Summary

I originally tried doing this specifically for X86 in the backend in D5091,
but it was rather brittle and generally running to late to be general.
Furthermore, other targets may want to implement similar optimizations.
So I reimplemented it at the IR-level, fitting it into AtomicExpandPass
as it interacts with that pass (which could not be cleanly done before
at the backend level).

This optimization relies on a new target hook, which is only used by X86
for now, as the correctness of the optimization on other targets remains
an open question. If it is found correct on other targets, it should be
trivial to enable for them.

Details of the optimization are discussed in D5091.

Diff Detail

Repository: rL LLVM

Event Timeline

morisset updated this revision to Diff 13899.Sep 19 2014, 4:51 PM

morisset retitled this revision from to Lower idempotent RMWs to fence+load.

morisset updated this object.

morisset edited the test plan for this revision. (Show Details)

morisset added a reviewer: jfb.

morisset added a subscriber: Unknown Object (MLST).

jfb added inline comments.Sep 22 2014, 2:11 PM

lib/CodeGen/AtomicExpandPass.cpp
96 ↗	(On Diff #13899)	I find that this code isn't quite obvious: if the RMW is idempotent then you try to simplify it in a target-defined way, and if that fails you call `expandAtomicRMW` from `simplifyIdempotentRMW`. There should be only one place where you call `expandAtomicRMW`, and it should be here. Maybe add a `virtual shouldSimplifyIdempotentRMWInIR` function, or change the control flow here to first try to simplify, if that fails check if you should expand, and if that also fails when keep going (I think this is better).
481 ↗	(On Diff #13899)	You could handle `min`/`max`/`umin`/`umax` but those seem somewhat useless. I guess other optimizations won't do partial evaluation into an atomic instruction, so constant propagation will feed up to the value operand of the atomic but not further, so it may be possible that this optimization would fire?
490 ↗	(On Diff #13899)	Merge the two above lines, LLVM usually follows that style.
492 ↗	(On Diff #13899)	Could you instead add the `LoadInst` to the `AtomicInsts` that you're iterating through? Same as before, I don't really like the repeated logic since it makes the code harder to modify and follow.
lib/Target/X86/X86ISelLowering.cpp
17029 ↗	(On Diff #13899)	s/harmfull/harmful/
17030 ↗	(On Diff #13899)	Not all primitive types smaller than native width have an appropriate load. Will this still work in these cases (or is it impossible to get here with types that aren't 8/16/32/64)?
test/CodeGen/X86/atomic_idempotent.ll
6 ↗	(On Diff #13899)	optimization
47 ↗	(On Diff #13899)	Could you also try weird integer sizes? I also think that testing more RMW operations at least for 32-bit would be good, especially `and`.

Thanks for the review. Answers inline, and patch next.

lib/CodeGen/AtomicExpandPass.cpp
96 ↗	(On Diff #13899)	I've changed the control-flow here, it is simpler (no more redundancy, only one call to expandAtomicRMW), but the control-flow lost a bit of readability. Please tell-me if it looks ok to you.
481 ↗	(On Diff #13899)	There are two different optimizations doable for min/max/umin/umax: simplify them when the value operand is the constant INT_MIN, INT_MAX, ... simplify them when the value operand is necessarily the same value as the value already in the memory cell (for some definition of value, already and memory cell...). It is not exactly clear to me which you want me to do. The first one is rather trivial to add, I would just need to find a way of getting INT_MIN/INT_MAX for any type size (probably not very hard, but I would rather keep it in a separate patch). The second way would probably have to be a different pass, dealing with nightmarish complexity in the form of aliasing issues and subtle memory model issues (just defining what "current value in a memory location" is basically impossible in C11), and generally I don't want to open that can of worms unless it can be shown as crucial in actual benchmarks (I lost months trying to prove the correctness of simple variants of that class of optimizations with nothing to show for it but a slew of counter-examples and traps, so I am quite cautious about anything that looks like it).
490 ↗	(On Diff #13899)	Done.
492 ↗	(On Diff #13899)	I tried doing that originally, but it is not obvious how to do it cleanly: adding the LoadInst to AtomicInsts might invalidate the iterator and break everything a goto from that case to the beginning of the function would do the trick, but .. "goto" is not really recommended in LLVM I think adding an extra level of nesting with a while loop would also work, but probably not be extremely readable (and the nesting of the control-flow in runOnFunction is already uncomfortable), and feels a bit overkill. So I took this approach with a bit of redundancy because the other solutions looked even worse. I agree that it is ugly, I would love suggestions on how to clean it up.
lib/Target/X86/X86ISelLowering.cpp
17029 ↗	(On Diff #13899)	Fixed.
17030 ↗	(On Diff #13899)	Luckily, LLVM only allows power-of-2 sizes (in bytes) for Atomic RMW operations. So we don't have to bother checking for some abominations like an i24 or i13.
test/CodeGen/X86/atomic_idempotent.ll
6 ↗	(On Diff #13899)	Fixed.
47 ↗	(On Diff #13899)	I've added tests for and/sub. However atomic RMWs are not defined/accepted by LLVM for sizes other than power of 2 number of bytes (luckily for my sanity).

Partial cleanup of the control-flow + extra test + fixed formatting/typos.

jfb added inline comments.Sep 23 2014, 1:53 PM

include/llvm/Target/TargetLowering.h
1011 ↗	(On Diff #14004)	I still find the interaction between this function and `shouldExpandAtomicLoadInIR` very confusing: it's not clear from this documentation that the implementation of `lowerIdempotentRMWIntoFencedLoad` can return a simple `load atomic` that will then be lowered appropriately if `shouldExpandAtomicLoadInIR` is true. Overall the code is more understandable, though. I think updating this documentation is good enough.
lib/CodeGen/AtomicExpandPass.cpp
488 ↗	(On Diff #14004)	Leave a FIXME that documents the other optimizations that can be done here.

Add requested comments

I think this looks good, but I'd leave it open for a while to see if others have comments.

This revision is now accepted and ready to land.Sep 23 2014, 2:09 PM

Closed by commit rL218455 (authored by @morisset).

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Target/

TargetLowering.h

14 lines

lib/

CodeGen/

AtomicExpandPass.cpp

45 lines

Target/

X86/

X86ISelLowering.h

3 lines

X86ISelLowering.cpp

73 lines

test/

CodeGen/

X86/

atomic_idempotent.ll

56 lines

Diff 14080

llvm/trunk/include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 1,002 Lines • ▼ Show 20 Lines	public:
virtual bool shouldExpandAtomicLoadInIR(LoadInst *LI) const { return false; }		virtual bool shouldExpandAtomicLoadInIR(LoadInst *LI) const { return false; }

/// Returns true if the given AtomicRMW should be expanded by the		/// Returns true if the given AtomicRMW should be expanded by the
/// IR-level AtomicExpand pass into a loop using LoadLinked/StoreConditional.		/// IR-level AtomicExpand pass into a loop using LoadLinked/StoreConditional.
virtual bool shouldExpandAtomicRMWInIR(AtomicRMWInst *RMWI) const {		virtual bool shouldExpandAtomicRMWInIR(AtomicRMWInst *RMWI) const {
return false;		return false;
}		}

		/// On some platforms, an AtomicRMW that never actually modifies the value
		/// (such as fetch_add of 0) can be turned into a fence followed by an
		/// atomic load. This may sound useless, but it makes it possible for the
		/// processor to keep the cacheline shared, dramatically improving
		/// performance. And such idempotent RMWs are useful for implementing some
		/// kinds of locks, see for example (justification + benchmarks):
		/// http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf
		/// This method tries doing that transformation, returning the atomic load if
		/// it succeeds, and nullptr otherwise.
		/// If shouldExpandAtomicLoadInIR returns true on that load, it will undergo
		/// another round of expansion.
		virtual LoadInst lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst RMWI) const {
		return nullptr;
		}
//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// TargetLowering Configuration Methods - These methods should be invoked by		// TargetLowering Configuration Methods - These methods should be invoked by
// the derived class constructor to configure this object for the target.		// the derived class constructor to configure this object for the target.
//		//

/// \brief Reset the operation actions based on target options.		/// \brief Reset the operation actions based on target options.
virtual void resetOperationActions() {}		virtual void resetOperationActions() {}

▲ Show 20 Lines • Show All 1,659 Lines • Show Last 20 Lines

llvm/trunk/lib/CodeGen/AtomicExpandPass.cpp

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	private:
bool expandAtomicLoad(LoadInst *LI);		bool expandAtomicLoad(LoadInst *LI);
bool expandAtomicLoadToLL(LoadInst *LI);		bool expandAtomicLoadToLL(LoadInst *LI);
bool expandAtomicLoadToCmpXchg(LoadInst *LI);		bool expandAtomicLoadToCmpXchg(LoadInst *LI);
bool expandAtomicStore(StoreInst *SI);		bool expandAtomicStore(StoreInst *SI);
bool expandAtomicRMW(AtomicRMWInst *AI);		bool expandAtomicRMW(AtomicRMWInst *AI);
bool expandAtomicRMWToLLSC(AtomicRMWInst *AI);		bool expandAtomicRMWToLLSC(AtomicRMWInst *AI);
bool expandAtomicRMWToCmpXchg(AtomicRMWInst *AI);		bool expandAtomicRMWToCmpXchg(AtomicRMWInst *AI);
bool expandAtomicCmpXchg(AtomicCmpXchgInst *CI);		bool expandAtomicCmpXchg(AtomicCmpXchgInst *CI);
		bool isIdempotentRMW(AtomicRMWInst *AI);
		bool simplifyIdempotentRMW(AtomicRMWInst *AI);
};		};
}		}

char AtomicExpand::ID = 0;		char AtomicExpand::ID = 0;
char &llvm::AtomicExpandID = AtomicExpand::ID;		char &llvm::AtomicExpandID = AtomicExpand::ID;
INITIALIZE_TM_PASS(AtomicExpand, "atomic-expand",		INITIALIZE_TM_PASS(AtomicExpand, "atomic-expand",
"Expand Atomic calls in terms of either load-linked & store-conditional or cmpxchg",		"Expand Atomic calls in terms of either load-linked & store-conditional or cmpxchg",
false, false)		false, false)
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	if (TargetLowering->getInsertFencesForAtomic()) {
MadeChange \|= bracketInstWithFences(I, FenceOrdering, IsStore, IsLoad);		MadeChange \|= bracketInstWithFences(I, FenceOrdering, IsStore, IsLoad);
}		}
}		}

if (LI && TargetLowering->shouldExpandAtomicLoadInIR(LI)) {		if (LI && TargetLowering->shouldExpandAtomicLoadInIR(LI)) {
MadeChange \|= expandAtomicLoad(LI);		MadeChange \|= expandAtomicLoad(LI);
} else if (SI && TargetLowering->shouldExpandAtomicStoreInIR(SI)) {		} else if (SI && TargetLowering->shouldExpandAtomicStoreInIR(SI)) {
MadeChange \|= expandAtomicStore(SI);		MadeChange \|= expandAtomicStore(SI);
} else if (RMWI && TargetLowering->shouldExpandAtomicRMWInIR(RMWI)) {		} else if (RMWI) {
MadeChange \|= expandAtomicRMW(RMWI);		// There are two different ways of expanding RMW instructions:
		// - into a load if it is idempotent
		// - into a Cmpxchg/LL-SC loop otherwise
		// we try them in that order.
		MadeChange \|= (isIdempotentRMW(RMWI) &&
		simplifyIdempotentRMW(RMWI)) \|\|
		(TargetLowering->shouldExpandAtomicRMWInIR(RMWI) &&
		expandAtomicRMW(RMWI));
} else if (CASI && TargetLowering->hasLoadLinkedStoreConditional()) {		} else if (CASI && TargetLowering->hasLoadLinkedStoreConditional()) {
MadeChange \|= expandAtomicCmpXchg(CASI);		MadeChange \|= expandAtomicCmpXchg(CASI);
}		}
}		}
return MadeChange;		return MadeChange;
}		}

bool AtomicExpand::bracketInstWithFences(Instruction *I, AtomicOrdering Order,		bool AtomicExpand::bracketInstWithFences(Instruction *I, AtomicOrdering Order,
▲ Show 20 Lines • Show All 375 Lines • ▼ Show 20 Lines	if (!CI->use_empty()) {
Res = Builder.CreateInsertValue(Res, Success, 1);		Res = Builder.CreateInsertValue(Res, Success, 1);

CI->replaceAllUsesWith(Res);		CI->replaceAllUsesWith(Res);
}		}

CI->eraseFromParent();		CI->eraseFromParent();
return true;		return true;
}		}

		bool AtomicExpand::isIdempotentRMW(AtomicRMWInst* RMWI) {
		auto C = dyn_cast<ConstantInt>(RMWI->getValOperand());
		if(!C)
		return false;

		AtomicRMWInst::BinOp Op = RMWI->getOperation();
		switch(Op) {
		case AtomicRMWInst::Add:
		case AtomicRMWInst::Sub:
		case AtomicRMWInst::Or:
		case AtomicRMWInst::Xor:
		return C->isZero();
		case AtomicRMWInst::And:
		return C->isMinusOne();
		// FIXME: we could also treat Min/Max/UMin/UMax by the INT_MIN/INT_MAX/...
		default:
		return false;
		}
		}

		bool AtomicExpand::simplifyIdempotentRMW(AtomicRMWInst* RMWI) {
		auto TLI = TM->getSubtargetImpl()->getTargetLowering();

		if (auto ResultingLoad = TLI->lowerIdempotentRMWIntoFencedLoad(RMWI)) {
		if (TLI->shouldExpandAtomicLoadInIR(ResultingLoad))
		expandAtomicLoad(ResultingLoad);
		return true;
		}

		return false;
		}

llvm/trunk/lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 959 Lines • ▼ Show 20 Lines	bool CanLowerReturn(CallingConv::ID CallConv, MachineFunction &MF,
LLVMContext &Context) const override;		LLVMContext &Context) const override;

const MCPhysReg *getScratchRegisters(CallingConv::ID CC) const override;		const MCPhysReg *getScratchRegisters(CallingConv::ID CC) const override;

bool shouldExpandAtomicLoadInIR(LoadInst *SI) const override;		bool shouldExpandAtomicLoadInIR(LoadInst *SI) const override;
bool shouldExpandAtomicStoreInIR(StoreInst *SI) const override;		bool shouldExpandAtomicStoreInIR(StoreInst *SI) const override;
bool shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const override;		bool shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const override;

		LoadInst *
		lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const override;

bool needsCmpXchgNb(const Type *MemType) const;		bool needsCmpXchgNb(const Type *MemType) const;

/// Utility function to emit atomic-load-arith operations (and, or, xor,		/// Utility function to emit atomic-load-arith operations (and, or, xor,
/// nand, max, min, umax, umin). It takes the corresponding instruction to		/// nand, max, min, umax, umin). It takes the corresponding instruction to
/// expand, the associated machine basic block, and the associated X86		/// expand, the associated machine basic block, and the associated X86
/// opcodes for reg/reg.		/// opcodes for reg/reg.
MachineBasicBlock EmitAtomicLoadArith(MachineInstr MI,		MachineBasicBlock EmitAtomicLoadArith(MachineInstr MI,
MachineBasicBlock *MBB) const;		MachineBasicBlock *MBB) const;
▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 17,796 Lines • ▼ Show 20 Lines	bool X86TargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const {
case AtomicRMWInst::UMax:		case AtomicRMWInst::UMax:
case AtomicRMWInst::UMin:		case AtomicRMWInst::UMin:
// These always require a non-trivial set of data operations on x86. We must		// These always require a non-trivial set of data operations on x86. We must
// use a cmpxchg loop.		// use a cmpxchg loop.
return true;		return true;
}		}
}		}

		static bool hasMFENCE(const X86Subtarget& Subtarget) {
		// Use mfence if we have SSE2 or we're on x86-64 (even if we asked for
		// no-sse2). There isn't any reason to disable it if the target processor
		// supports it.
		return Subtarget.hasSSE2() \|\| Subtarget.is64Bit();
		}

		LoadInst *
		X86TargetLowering::lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const {
		const X86Subtarget &Subtarget =
		getTargetMachine().getSubtarget<X86Subtarget>();
		unsigned NativeWidth = Subtarget.is64Bit() ? 64 : 32;
		const Type *MemType = AI->getType();
		// Accesses larger than the native width are turned into cmpxchg/libcalls, so
		// there is no benefit in turning such RMWs into loads, and it is actually
		// harmful as it introduces a mfence.
		if (MemType->getPrimitiveSizeInBits() > NativeWidth)
		return nullptr;

		auto Builder = IRBuilder<>(AI);
		Module *M = Builder.GetInsertBlock()->getParent()->getParent();
		auto SynchScope = AI->getSynchScope();
		// We must restrict the ordering to avoid generating loads with Release or
		// ReleaseAcquire orderings.
		auto Order = AtomicCmpXchgInst::getStrongestFailureOrdering(AI->getOrdering());
		auto Ptr = AI->getPointerOperand();

		// Before the load we need a fence. Here is an example lifted from
		// http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf showing why a fence
		// is required:
		// Thread 0:
		// x.store(1, relaxed);
		// r1 = y.fetch_add(0, release);
		// Thread 1:
		// y.fetch_add(42, acquire);
		// r2 = x.load(relaxed);
		// r1 = r2 = 0 is impossible, but becomes possible if the idempotent rmw is
		// lowered to just a load without a fence. A mfence flushes the store buffer,
		// making the optimization clearly correct.
		// FIXME: it is required if isAtLeastRelease(Order) but it is not clear
		// otherwise, we might be able to be more agressive on relaxed idempotent
		// rmw. In practice, they do not look useful, so we don't try to be
		// especially clever.
		if (SynchScope == SingleThread) {
		// FIXME: we could just insert an X86ISD::MEMBARRIER here, except we are at
		// the IR level, so we must wrap it in an intrinsic.
		return nullptr;
		} else if (hasMFENCE(Subtarget)) {
		Function *MFence = llvm::Intrinsic::getDeclaration(M,
		Intrinsic::x86_sse2_mfence);
		Builder.CreateCall(MFence);
		} else {
		// FIXME: it might make sense to use a locked operation here but on a
		// different cache-line to prevent cache-line bouncing. In practice it
		// is probably a small win, and x86 processors without mfence are rare
		// enough that we do not bother.
		return nullptr;
		}

		// Finally we can emit the atomic load.
		LoadInst *Loaded = Builder.CreateAlignedLoad(Ptr,
		AI->getType()->getPrimitiveSizeInBits());
		Loaded->setAtomic(Order, SynchScope);
		AI->replaceAllUsesWith(Loaded);
		AI->eraseFromParent();
		return Loaded;
		}

static SDValue LowerATOMIC_FENCE(SDValue Op, const X86Subtarget *Subtarget,		static SDValue LowerATOMIC_FENCE(SDValue Op, const X86Subtarget *Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
SDLoc dl(Op);		SDLoc dl(Op);
AtomicOrdering FenceOrdering = static_cast<AtomicOrdering>(		AtomicOrdering FenceOrdering = static_cast<AtomicOrdering>(
cast<ConstantSDNode>(Op.getOperand(1))->getZExtValue());		cast<ConstantSDNode>(Op.getOperand(1))->getZExtValue());
SynchronizationScope FenceScope = static_cast<SynchronizationScope>(		SynchronizationScope FenceScope = static_cast<SynchronizationScope>(
cast<ConstantSDNode>(Op.getOperand(2))->getZExtValue());		cast<ConstantSDNode>(Op.getOperand(2))->getZExtValue());

// The only fence that needs an instruction is a sequentially-consistent		// The only fence that needs an instruction is a sequentially-consistent
// cross-thread fence.		// cross-thread fence.
if (FenceOrdering == SequentiallyConsistent && FenceScope == CrossThread) {		if (FenceOrdering == SequentiallyConsistent && FenceScope == CrossThread) {
// Use mfence if we have SSE2 or we're on x86-64 (even if we asked for		if (hasMFENCE(*Subtarget))
// no-sse2). There isn't any reason to disable it if the target processor
// supports it.
if (Subtarget->hasSSE2() \|\| Subtarget->is64Bit())
return DAG.getNode(X86ISD::MFENCE, dl, MVT::Other, Op.getOperand(0));		return DAG.getNode(X86ISD::MFENCE, dl, MVT::Other, Op.getOperand(0));

SDValue Chain = Op.getOperand(0);		SDValue Chain = Op.getOperand(0);
SDValue Zero = DAG.getConstant(0, MVT::i32);		SDValue Zero = DAG.getConstant(0, MVT::i32);
SDValue Ops[] = {		SDValue Ops[] = {
DAG.getRegister(X86::ESP, MVT::i32), // Base		DAG.getRegister(X86::ESP, MVT::i32), // Base
DAG.getTargetConstant(1, MVT::i8), // Scale		DAG.getTargetConstant(1, MVT::i8), // Scale
DAG.getRegister(0, MVT::i32), // Index		DAG.getRegister(0, MVT::i32), // Index
▲ Show 20 Lines • Show All 7,136 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/atomic_idempotent.ll

				; RUN: llc < %s -march=x86-64 -verify-machineinstrs \| FileCheck %s --check-prefix=CHECK --check-prefix=X64
				; RUN: llc < %s -march=x86 -mattr=+sse2 -verify-machineinstrs \| FileCheck %s --check-prefix=CHECK --check-prefix=X32

				; On x86, an atomic rmw operation that does not modify the value in memory
				; (such as atomic add 0) can be replaced by an mfence followed by a mov.
				; This is explained (with the motivation for such an optimization) in
				; http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf

				define i8 @add8(i8* %p) {
				; CHECK-LABEL: add8
				; CHECK: mfence
				; CHECK: movb
				%1 = atomicrmw add i8* %p, i8 0 monotonic
				ret i8 %1
				}

				define i16 @or16(i16* %p) {
				; CHECK-LABEL: or16
				; CHECK: mfence
				; CHECK: movw
				%1 = atomicrmw or i16* %p, i16 0 acquire
				ret i16 %1
				}

				define i32 @xor32(i32* %p) {
				; CHECK-LABEL: xor32
				; CHECK: mfence
				; CHECK: movl
				%1 = atomicrmw xor i32* %p, i32 0 release
				ret i32 %1
				}

				define i64 @sub64(i64* %p) {
				; CHECK-LABEL: sub64
				; X64: mfence
				; X64: movq
				; X32-NOT: mfence
				%1 = atomicrmw sub i64* %p, i64 0 seq_cst
				ret i64 %1
				}

				define i128 @or128(i128* %p) {
				; CHECK-LABEL: or128
				; CHECK-NOT: mfence
				%1 = atomicrmw or i128* %p, i128 0 monotonic
				ret i128 %1
				}

				; For 'and', the idempotent value is (-1)
				define i32 @and32 (i32* %p) {
				; CHECK-LABEL: and32
				; CHECK: mfence
				; CHECK: movl
				%1 = atomicrmw and i32* %p, i32 -1 acq_rel
				ret i32 %1
				}