This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelLowering.h
-
X86ISelLowering.cpp
-
X86InstrCompiler.td
-
X86MCInstLower.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
atomic_mi.ll

Differential D11382

x86 atomic: optimize a.store(reg op a.load(acquire), release)
ClosedPublic

Authored by jfb on Jul 20 2015, 10:46 PM.

Download Raw Diff

Details

Reviewers

kcc
chandlerc
nadav
t.p.northover
reames
dvyukov
pete
morisset

Commits

rG8662083770b3: x86 atomic: optimize a.store(reg op a.load(acquire), release)
rL244128: x86 atomic: optimize a.store(reg op a.load(acquire), release)

Summary

PR24191 finds that the expected memory-register operations aren't generated when relaxed { load ; modify ; store } is used. This is similar to PR17281 which was addressed in D4796, but only for memory-immediate operations (and for memory orderings up to acquire and release). This patch also handles some floating-point operations.

Diff Detail

Repository: rL LLVM

Event Timeline

jfb updated this revision to Diff 30230.Jul 20 2015, 10:46 PM

jfb retitled this revision from to x86 atomic: optimize a.store(reg op a.load(acquire), release).

jfb updated this object.

jfb added reviewers: reames, kcc, dvyukov, nadav.

jfb added a subscriber: llvm-commits.

jfb added a reviewer: morisset.Jul 20 2015, 11:15 PM

Will this optimization transform:

int foo() {
   int r = atomic_load_n(&x, __ATOMIC_RELAXED);
   atomic_store_n(&x, r+1, __ATOMIC_RELAXED);
   return r;
}

? If yes, how?

Address dvyukov's comment: test that self-add doesn't fold.

In D11382#208803, @dvyukov wrote:
Will this optimization transform:
int foo() {
   int r = atomic_load_n(&x, __ATOMIC_RELAXED);
   atomic_store_n(&x, r+1, __ATOMIC_RELAXED);
   return r;
}
? If yes, how?

Good point, I added test add_32r_self to ensure that this doesn't happen, and that the pattern matching figures out dependencies properly.

In D11382#209066, @jfb wrote:
In D11382#208803, @dvyukov wrote:
Will this optimization transform:
int foo() {
   int r = atomic_load_n(&x, __ATOMIC_RELAXED);
   atomic_store_n(&x, r+1, __ATOMIC_RELAXED);
   return r;
}
? If yes, how?
Good point, I added test add_32r_self to ensure that this doesn't happen, and that the pattern matching figures out dependencies properly.

I am glad that the comment was useful, but I actually asked a different thing :)
My example does not contain self-add. It contains two usages of a load result, and one of the usages can be potentially folded. My concern was that the code can be compiled as:

MOV [addr], r
ADD [addr], 1
MOV r, rax
RET

or to:

ADD [addr], 1
MOV [addr], rax
RET

Both of which would be incorrect transformations -- two loads instead of one.
I guess this transformation should require that the folded store is the only usage of the load result.

Add test suggested by dvyukov making sure that the load isn't duplicated.

In D11382#209576, @dvyukov wrote:
In D11382#209066, @jfb wrote:
In D11382#208803, @dvyukov wrote:
Will this optimization transform:
int foo() {
   int r = atomic_load_n(&x, __ATOMIC_RELAXED);
   atomic_store_n(&x, r+1, __ATOMIC_RELAXED);
   return r;
}
? If yes, how?
Good point, I added test add_32r_self to ensure that this doesn't happen, and that the pattern matching figures out dependencies properly.
I am glad that the comment was useful, but I actually asked a different thing :)
My example does not contain self-add. It contains two usages of a load result, and one of the usages can be potentially folded. My concern was that the code can be compiled as:
MOV [addr], r
ADD [addr], 1
MOV r, rax
RET
or to:
ADD [addr], 1
MOV [addr], rax
RET
Both of which would be incorrect transformations -- two loads instead of one.
I guess this transformation should require that the folded store is the only usage of the load result.

Oh sorry, I totally misunderstood you! I added a test for this. IIUC it can't happen because the entire pattern that's matched is replace with a pseudo instruction, so an escaping intermediate result wouldn't have a def anymore.

Also handle floating-point memory-register addition.

Add one more FIXME.

LGTM but wait for somebody else

atomic_mi.ll add a colon after FileCheck LABEL checks to prevent partial string matches.

[NFC] name pseudo-instructions more consistently.

jfb updated this object.Jul 22 2015, 12:51 PM

LGTM for the integer part.

I am not really convinced that the floating point change belongs in the same patch: it is conceptually different, and does not seem to share any code with the rest of the patch. It is also not obvious to me why it is only tested on X86_64, a comment would be appreciated.

test/CodeGen/X86/atomic_mi.ll
847 ↗	(On Diff #30387)	Why ?
864 ↗	(On Diff #30387)	Why ?

Comment on x86-32 testing of atomics and SSE.

In D11382#210671, @morisset wrote:

LGTM for the integer part.

I am not really convinced that the floating point change belongs in the same patch: it is conceptually different, and does not seem to share any code with the rest of the patch. It is also not obvious to me why it is only tested on X86_64, a comment would be appreciated.

I improved the comment for x86-32. LLVM's code generation was silly when I was testing it out. The instructions I use are in SSE, and even with -mattr=+sse (and +sse2) it wasn't particularly good code, even without using any atomics. I figure x86-32 optimizations of floating-point atomics isn't particularly useful compared to x86-64 at this point in time.

*sd instructions are in SSE2.

Actually, *sd instructions are SSE2, so I fixed the pattern matcher :)

To be specific, the x86-32 code with -mattr=+sse generates the following:

fadd_32r:                               # @fadd_32r
	# ...
	movl	12(%esp), %eax
	movl	(%eax), %ecx
	movl	%ecx, (%esp)
	movss	(%esp), %xmm0           # xmm0 = mem[0],zero,zero,zero
	addss	16(%esp), %xmm0
	movss	%xmm0, 4(%esp)
	movl	4(%esp), %ecx
	movl	%ecx, (%eax)
	addl	$8, %esp
	retl
	# ...
fadd_64r:                               # @fadd_64r
	# ...
	movl	32(%esp), %esi
	xorl	%eax, %eax
	xorl	%edx, %edx
	xorl	%ebx, %ebx
	xorl	%ecx, %ecx
	lock		cmpxchg8b	(%esi)
	movl	%edx, 12(%esp)
	movl	%eax, 8(%esp)
	fldl	8(%esp)
	faddl	36(%esp)
	fstpl	(%esp)
	movl	(%esp), %ebx
	movl	4(%esp), %ecx
	movl	(%esi), %eax
	movl	4(%esi), %edx
	# ...

Part of the problem is calling convention on x86-32, and part of the problem is x87 for 64-bit.

With -mattr=+sse,+sse2 the generated code becomes:

fadd_32r:                               # @fadd_32r
	# ...
	movss	8(%esp), %xmm0          # xmm0 = mem[0],zero,zero,zero
	movl	4(%esp), %eax
	addss	(%eax), %xmm0
	movss	%xmm0, (%eax)
	retl
	# ...
fadd_64r:                               # @fadd_64r
	# ...
	movl	32(%esp), %esi
	xorl	%eax, %eax
	xorl	%edx, %edx
	xorl	%ebx, %ebx
	xorl	%ecx, %ecx
	lock		cmpxchg8b	(%esi)
	movl	%edx, 12(%esp)
	movl	%eax, 8(%esp)
	movsd	8(%esp), %xmm0          # xmm0 = mem[0],zero
	addsd	36(%esp), %xmm0
	movsd	%xmm0, (%esp)
	movl	(%esp), %ebx
	movl	4(%esp), %ecx
	movl	(%esi), %eax
	movl	4(%esi), %edx

I could do something with this, but I'm a bit wary of how the calling convention works out (or rather, that the code generation won't change slightly and throw things off).

@echristo suggested I add @chandlerc because this patch says "atomic". Note @dvyukov's LGTM above.

@chandlerc suggested I add @pete and @t.p.northover as reviewers.

I would have preferred to see the renaming in another patch, but at this point I don't think its worth the effort to split it out. Its a mechanical change, not a change in behavior, and tablegen would have made it clear if it didn't like the change.

So, I agree with @dvyukov, LGTM.

Cheers,
Pete

lib/Target/X86/X86InstrCompiler.td
765 ↗	(On Diff #30499)	This was there prior to your change, but i wonder if we should have a later patch (by you or me or anyone else) to consider removing this cast. We can do so by taking 'SDNode op' instead of a string. For example diff --git a/lib/Target/X86/X86InstrCompiler.td b/lib/Target/X86/X86InstrCompiler.td index 49dc318..9168713 100644 a/lib/Target/X86/X86InstrCompiler.td +++ b/lib/Target/X86/X86InstrCompiler.td @@ -757,26 +757,26 @@ defm LXADD : ATOMIC_LOAD_BINOP<0xc0, 0xc1, "xadd", "atomic_load_add", extremely late to prevent them from being accidentally reordered in the backend (see below the RELEASE_MOV* / ACQUIRE_MOV* pseudo-instructions) */ -multiclass RELEASE_BINOP_MI<string op> { +multiclass RELEASE_BINOP_MI<SDNode op> { def NAME#8mi : I<0, Pseudo, (outs), (ins i8mem:$dst, i8imm:$src), "#RELEASE_BINOP PSEUDO!", [(atomic_store_8 addr:$dst, (!cast<PatFrag>(op) + [(atomic_store_8 addr:$dst, (op (atomic_load_8 addr:$dst), (i8 imm:$src)))]>; // NAME#16 is not generated as 16-bit arithmetic instructions are considered // costly and avoided as far as possible by this backend anyway def NAME#32mi : I<0, Pseudo, (outs), (ins i32mem:$dst, i32imm:$src), "#RELEASE_BINOP PSEUDO!", [(atomic_store_32 addr:$dst, (!cast<PatFrag>(op) + [(atomic_store_32 addr:$dst, (op (atomic_load_32 addr:$dst), (i32 imm:$src)))]>; def NAME#64mi32 : I<0, Pseudo, (outs), (ins i64mem:$dst, i64i32imm:$src), "#RELEASE_BINOP PSEUDO!", [(atomic_store_64 addr:$dst, (!cast<PatFrag>(op) + [(atomic_store_64 addr:$dst, (op (atomic_load_64 addr:$dst), (i64immSExt32:$src)))]>; } -defm RELEASE_ADD : RELEASE_BINOP_MI<"add">; -defm RELEASE_AND : RELEASE_BINOP_MI<"and">; -defm RELEASE_OR : RELEASE_BINOP_MI<"or">; -defm RELEASE_XOR : RELEASE_BINOP_MI<"xor">; +defm RELEASE_ADD : RELEASE_BINOP_MI<add>; +defm RELEASE_AND : RELEASE_BINOP_MI<and>; +defm RELEASE_OR : RELEASE_BINOP_MI<or>; +defm RELEASE_XOR : RELEASE_BINOP_MI<xor>;

This revision is now accepted and ready to land.Aug 5 2015, 10:17 AM

jfb mentioned this in D6629: x86: Emit LAHF/SAHF instead of PUSHF/POPF.Aug 5 2015, 1:52 PM

Closed by commit rL244128: x86 atomic: optimize a.store(reg op a.load(acquire), release) (authored by jfb). · Explain WhyAug 5 2015, 2:05 PM

This revision was automatically updated to reflect the committed changes.

jfb added inline comments.Aug 5 2015, 2:06 PM

lib/Target/X86/X86InstrCompiler.td
765 ↗	(On Diff #30499)	I can send a follow-up.

jfb mentioned this in D11788: x86: NFC remove needless InstrCompiler cast.Aug 5 2015, 4:16 PM

jfb mentioned this in rL244167: x86: NFC remove needless InstrCompiler cast.Aug 5 2015, 4:16 PM

jfb marked an inline comment as done.Aug 5 2015, 4:16 PM

jfb added inline comments.

lib/Target/X86/X86InstrCompiler.td
765 ↗	(On Diff #30499)	Done in D11788.

jfb mentioned this in D13749: x86 atomic codegen: don't drop globals.Oct 14 2015, 3:24 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

3 lines

43 lines

75 lines

12 lines

test/

CodeGen/

X86/

atomic_mi.ll

548 lines

Diff 31393

llvm/trunk/lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,074 Lines • ▼ Show 20 Lines	private:
/// Utility function to emit the xmm reg save portion of va_start.		/// Utility function to emit the xmm reg save portion of va_start.
MachineBasicBlock *EmitVAStartSaveXMMRegsWithCustomInserter(		MachineBasicBlock *EmitVAStartSaveXMMRegsWithCustomInserter(
MachineInstr *BInstr,		MachineInstr *BInstr,
MachineBasicBlock *BB) const;		MachineBasicBlock *BB) const;

MachineBasicBlock EmitLoweredSelect(MachineInstr I,		MachineBasicBlock EmitLoweredSelect(MachineInstr I,
MachineBasicBlock *BB) const;		MachineBasicBlock *BB) const;

		MachineBasicBlock EmitLoweredAtomicFP(MachineInstr I,
		MachineBasicBlock *BB) const;

MachineBasicBlock EmitLoweredWinAlloca(MachineInstr MI,		MachineBasicBlock EmitLoweredWinAlloca(MachineInstr MI,
MachineBasicBlock *BB) const;		MachineBasicBlock *BB) const;

MachineBasicBlock EmitLoweredSegAlloca(MachineInstr MI,		MachineBasicBlock EmitLoweredSegAlloca(MachineInstr MI,
MachineBasicBlock *BB) const;		MachineBasicBlock *BB) const;

MachineBasicBlock EmitLoweredTLSCall(MachineInstr MI,		MachineBasicBlock EmitLoweredTLSCall(MachineInstr MI,
MachineBasicBlock *BB) const;		MachineBasicBlock *BB) const;
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 20,127 Lines • ▼ Show 20 Lines	if (NextCMOV) {
NextCMOV->eraseFromParent();		NextCMOV->eraseFromParent();
}		}

MI->eraseFromParent(); // The pseudo instruction is gone now.		MI->eraseFromParent(); // The pseudo instruction is gone now.
return sinkMBB;		return sinkMBB;
}		}

MachineBasicBlock *		MachineBasicBlock *
		X86TargetLowering::EmitLoweredAtomicFP(MachineInstr *MI,
		MachineBasicBlock *BB) const {
		// Combine the following atomic floating-point modification pattern:
		// a.store(reg OP a.load(acquire), release)
		// Transform them into:
		// OPss (%gpr), %xmm
		// movss %xmm, (%gpr)
		// Or sd equivalent for 64-bit operations.
		unsigned MOp, FOp;
		switch (MI->getOpcode()) {
		default: llvm_unreachable("unexpected instr type for EmitLoweredAtomicFP");
		case X86::RELEASE_FADD32mr: MOp = X86::MOVSSmr; FOp = X86::ADDSSrm; break;
		case X86::RELEASE_FADD64mr: MOp = X86::MOVSDmr; FOp = X86::ADDSDrm; break;
		}
		const X86InstrInfo *TII = Subtarget->getInstrInfo();
		DebugLoc DL = MI->getDebugLoc();
		MachineRegisterInfo &MRI = BB->getParent()->getRegInfo();
		unsigned MSrc = MI->getOperand(0).getReg();
		unsigned VSrc = MI->getOperand(5).getReg();
		MachineInstrBuilder MIM = BuildMI(*BB, MI, DL, TII->get(MOp))
		.addReg(/Base=/MSrc)
		.addImm(/Scale=/1)
		.addReg(/Index=/0)
		.addImm(0)
		.addReg(0);
		MachineInstr MIO = BuildMI(BB, (MachineInstr *)MIM, DL, TII->get(FOp),
		MRI.createVirtualRegister(MRI.getRegClass(VSrc)))
		.addReg(VSrc)
		.addReg(/Base=/MSrc)
		.addImm(/Scale=/1)
		.addReg(/Index=/0)
		.addImm(/Disp=/0)
		.addReg(/Segment=/0);
		MIM.addReg(MIO->getOperand(0).getReg(), RegState::Kill);
		MI->eraseFromParent(); // The pseudo instruction is gone now.
		return BB;
		}

		MachineBasicBlock *
X86TargetLowering::EmitLoweredSegAlloca(MachineInstr *MI,		X86TargetLowering::EmitLoweredSegAlloca(MachineInstr *MI,
MachineBasicBlock *BB) const {		MachineBasicBlock *BB) const {
MachineFunction *MF = BB->getParent();		MachineFunction *MF = BB->getParent();
const TargetInstrInfo *TII = Subtarget->getInstrInfo();		const TargetInstrInfo *TII = Subtarget->getInstrInfo();
DebugLoc DL = MI->getDebugLoc();		DebugLoc DL = MI->getDebugLoc();
const BasicBlock *LLVM_BB = BB->getBasicBlock();		const BasicBlock *LLVM_BB = BB->getBasicBlock();

assert(MF->shouldSplitStack());		assert(MF->shouldSplitStack());
▲ Show 20 Lines • Show All 538 Lines • ▼ Show 20 Lines	X86TargetLowering::EmitInstrWithCustomInserter(MachineInstr *MI,
case X86::CMOV_RFP64:		case X86::CMOV_RFP64:
case X86::CMOV_RFP80:		case X86::CMOV_RFP80:
case X86::CMOV_V8I1:		case X86::CMOV_V8I1:
case X86::CMOV_V16I1:		case X86::CMOV_V16I1:
case X86::CMOV_V32I1:		case X86::CMOV_V32I1:
case X86::CMOV_V64I1:		case X86::CMOV_V64I1:
return EmitLoweredSelect(MI, BB);		return EmitLoweredSelect(MI, BB);

		case X86::RELEASE_FADD32mr:
		case X86::RELEASE_FADD64mr:
		return EmitLoweredAtomicFP(MI, BB);

case X86::FP32_TO_INT16_IN_MEM:		case X86::FP32_TO_INT16_IN_MEM:
case X86::FP32_TO_INT32_IN_MEM:		case X86::FP32_TO_INT32_IN_MEM:
case X86::FP32_TO_INT64_IN_MEM:		case X86::FP32_TO_INT64_IN_MEM:
case X86::FP64_TO_INT16_IN_MEM:		case X86::FP64_TO_INT16_IN_MEM:
case X86::FP64_TO_INT32_IN_MEM:		case X86::FP64_TO_INT32_IN_MEM:
case X86::FP64_TO_INT64_IN_MEM:		case X86::FP64_TO_INT64_IN_MEM:
case X86::FP80_TO_INT16_IN_MEM:		case X86::FP80_TO_INT16_IN_MEM:
case X86::FP80_TO_INT32_IN_MEM:		case X86::FP80_TO_INT32_IN_MEM:
▲ Show 20 Lines • Show All 5,716 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrCompiler.td

	Show First 20 Lines • Show All 746 Lines • ▼ Show 20 Lines
	}			}

	defm LXADD : ATOMIC_LOAD_BINOP<0xc0, 0xc1, "xadd", "atomic_load_add",			defm LXADD : ATOMIC_LOAD_BINOP<0xc0, 0xc1, "xadd", "atomic_load_add",
	IIC_XADD_LOCK_MEM8, IIC_XADD_LOCK_MEM>,			IIC_XADD_LOCK_MEM8, IIC_XADD_LOCK_MEM>,
	TB, LOCK;			TB, LOCK;

	/* The following multiclass tries to make sure that in code like			/* The following multiclass tries to make sure that in code like
	* x.store (immediate op x.load(acquire), release)			* x.store (immediate op x.load(acquire), release)
				* and
				* x.store (register op x.load(acquire), release)
	* an operation directly on memory is generated instead of wasting a register.			* an operation directly on memory is generated instead of wasting a register.
	* It is not automatic as atomic_store/load are only lowered to MOV instructions			* It is not automatic as atomic_store/load are only lowered to MOV instructions
	* extremely late to prevent them from being accidentally reordered in the backend			* extremely late to prevent them from being accidentally reordered in the backend
	* (see below the RELEASE_MOV* / ACQUIRE_MOV* pseudo-instructions)			* (see below the RELEASE_MOV* / ACQUIRE_MOV* pseudo-instructions)
	*/			*/
	multiclass RELEASE_BINOP_MI<string op> {			multiclass RELEASE_BINOP_MI<string op> {
	def NAME#8mi : I<0, Pseudo, (outs), (ins i8mem:$dst, i8imm:$src),			def NAME#8mi : I<0, Pseudo, (outs), (ins i8mem:$dst, i8imm:$src),
	"#RELEASE_BINOP PSEUDO!",			"#BINOP "#NAME#"8mi PSEUDO!",
	[(atomic_store_8 addr:$dst, (!cast<PatFrag>(op)			[(atomic_store_8 addr:$dst, (!cast<PatFrag>(op)
	(atomic_load_8 addr:$dst), (i8 imm:$src)))]>;			(atomic_load_8 addr:$dst), (i8 imm:$src)))]>;
				def NAME#8mr : I<0, Pseudo, (outs), (ins i8mem:$dst, GR8:$src),
				"#BINOP "#NAME#"8mr PSEUDO!",
				[(atomic_store_8 addr:$dst, (!cast<PatFrag>(op)
				(atomic_load_8 addr:$dst), GR8:$src))]>;
	// NAME#16 is not generated as 16-bit arithmetic instructions are considered			// NAME#16 is not generated as 16-bit arithmetic instructions are considered
	// costly and avoided as far as possible by this backend anyway			// costly and avoided as far as possible by this backend anyway
	def NAME#32mi : I<0, Pseudo, (outs), (ins i32mem:$dst, i32imm:$src),			def NAME#32mi : I<0, Pseudo, (outs), (ins i32mem:$dst, i32imm:$src),
	"#RELEASE_BINOP PSEUDO!",			"#BINOP "#NAME#"32mi PSEUDO!",
	[(atomic_store_32 addr:$dst, (!cast<PatFrag>(op)			[(atomic_store_32 addr:$dst, (!cast<PatFrag>(op)
	(atomic_load_32 addr:$dst), (i32 imm:$src)))]>;			(atomic_load_32 addr:$dst), (i32 imm:$src)))]>;
				def NAME#32mr : I<0, Pseudo, (outs), (ins i32mem:$dst, GR32:$src),
				"#BINOP "#NAME#"32mr PSEUDO!",
				[(atomic_store_32 addr:$dst, (!cast<PatFrag>(op)
				(atomic_load_32 addr:$dst), GR32:$src))]>;
	def NAME#64mi32 : I<0, Pseudo, (outs), (ins i64mem:$dst, i64i32imm:$src),			def NAME#64mi32 : I<0, Pseudo, (outs), (ins i64mem:$dst, i64i32imm:$src),
	"#RELEASE_BINOP PSEUDO!",			"#BINOP "#NAME#"64mi32 PSEUDO!",
	[(atomic_store_64 addr:$dst, (!cast<PatFrag>(op)			[(atomic_store_64 addr:$dst, (!cast<PatFrag>(op)
	(atomic_load_64 addr:$dst), (i64immSExt32:$src)))]>;			(atomic_load_64 addr:$dst), (i64immSExt32:$src)))]>;
				def NAME#64mr : I<0, Pseudo, (outs), (ins i64mem:$dst, GR64:$src),
				"#BINOP "#NAME#"64mr PSEUDO!",
				[(atomic_store_64 addr:$dst, (!cast<PatFrag>(op)
				(atomic_load_64 addr:$dst), GR64:$src))]>;
	}			}
	defm RELEASE_ADD : RELEASE_BINOP_MI<"add">;			defm RELEASE_ADD : RELEASE_BINOP_MI<"add">;
	defm RELEASE_AND : RELEASE_BINOP_MI<"and">;			defm RELEASE_AND : RELEASE_BINOP_MI<"and">;
	defm RELEASE_OR : RELEASE_BINOP_MI<"or">;			defm RELEASE_OR : RELEASE_BINOP_MI<"or">;
	defm RELEASE_XOR : RELEASE_BINOP_MI<"xor">;			defm RELEASE_XOR : RELEASE_BINOP_MI<"xor">;
	// Note: we don't deal with sub, because substractions of constants are			// Note: we don't deal with sub, because substractions of constants are
	// optimized into additions before this code can run			// optimized into additions before this code can run

				// Same as above, but for floating-point.
				// FIXME: imm version.
				// FIXME: Version that doesn't clobber $src, using AVX's VADDSS.
				// FIXME: This could also handle SIMD operations with ps and pd instructions.
				let usesCustomInserter = 1 in {
				multiclass RELEASE_FP_BINOP_MI<string op> {
				def NAME#32mr : I<0, Pseudo, (outs), (ins i32mem:$dst, FR32:$src),
				"#BINOP "#NAME#"32mr PSEUDO!",
				[(atomic_store_32 addr:$dst,
				(i32 (bitconvert (!cast<PatFrag>(op)
				(f32 (bitconvert (i32 (atomic_load_32 addr:$dst)))),
				FR32:$src))))]>, Requires<[HasSSE1]>;
				def NAME#64mr : I<0, Pseudo, (outs), (ins i64mem:$dst, FR64:$src),
				"#BINOP "#NAME#"64mr PSEUDO!",
				[(atomic_store_64 addr:$dst,
				(i64 (bitconvert (!cast<PatFrag>(op)
				(f64 (bitconvert (i64 (atomic_load_64 addr:$dst)))),
				FR64:$src))))]>, Requires<[HasSSE2]>;
				}
				defm RELEASE_FADD : RELEASE_FP_BINOP_MI<"fadd">;
				// FIXME: Add fsub, fmul, fdiv, ...
				}

	multiclass RELEASE_UNOP<dag dag8, dag dag16, dag dag32, dag dag64> {			multiclass RELEASE_UNOP<dag dag8, dag dag16, dag dag32, dag dag64> {
	def NAME#8m : I<0, Pseudo, (outs), (ins i8mem:$dst),			def NAME#8m : I<0, Pseudo, (outs), (ins i8mem:$dst),
	"#RELEASE_UNOP PSEUDO!",			"#UNOP "#NAME#"8m PSEUDO!",
	[(atomic_store_8 addr:$dst, dag8)]>;			[(atomic_store_8 addr:$dst, dag8)]>;
	def NAME#16m : I<0, Pseudo, (outs), (ins i16mem:$dst),			def NAME#16m : I<0, Pseudo, (outs), (ins i16mem:$dst),
	"#RELEASE_UNOP PSEUDO!",			"#UNOP "#NAME#"16m PSEUDO!",
	[(atomic_store_16 addr:$dst, dag16)]>;			[(atomic_store_16 addr:$dst, dag16)]>;
	def NAME#32m : I<0, Pseudo, (outs), (ins i32mem:$dst),			def NAME#32m : I<0, Pseudo, (outs), (ins i32mem:$dst),
	"#RELEASE_UNOP PSEUDO!",			"#UNOP "#NAME#"32m PSEUDO!",
	[(atomic_store_32 addr:$dst, dag32)]>;			[(atomic_store_32 addr:$dst, dag32)]>;
	def NAME#64m : I<0, Pseudo, (outs), (ins i64mem:$dst),			def NAME#64m : I<0, Pseudo, (outs), (ins i64mem:$dst),
	"#RELEASE_UNOP PSEUDO!",			"#UNOP "#NAME#"64m PSEUDO!",
	[(atomic_store_64 addr:$dst, dag64)]>;			[(atomic_store_64 addr:$dst, dag64)]>;
	}			}

	defm RELEASE_INC : RELEASE_UNOP<			defm RELEASE_INC : RELEASE_UNOP<
	(add (atomic_load_8 addr:$dst), (i8 1)),			(add (atomic_load_8 addr:$dst), (i8 1)),
	(add (atomic_load_16 addr:$dst), (i16 1)),			(add (atomic_load_16 addr:$dst), (i16 1)),
	(add (atomic_load_32 addr:$dst), (i32 1)),			(add (atomic_load_32 addr:$dst), (i32 1)),
	(add (atomic_load_64 addr:$dst), (i64 1))>, Requires<[NotSlowIncDec]>;			(add (atomic_load_64 addr:$dst), (i64 1))>, Requires<[NotSlowIncDec]>;
	Show All 13 Lines
	defm RELEASE_NOT : RELEASE_UNOP<			defm RELEASE_NOT : RELEASE_UNOP<
	(not (atomic_load_8 addr:$dst)),			(not (atomic_load_8 addr:$dst)),
	(not (atomic_load_16 addr:$dst)),			(not (atomic_load_16 addr:$dst)),
	(not (atomic_load_32 addr:$dst)),			(not (atomic_load_32 addr:$dst)),
	(not (atomic_load_64 addr:$dst))>;			(not (atomic_load_64 addr:$dst))>;
	*/			*/

	def RELEASE_MOV8mi : I<0, Pseudo, (outs), (ins i8mem:$dst, i8imm:$src),			def RELEASE_MOV8mi : I<0, Pseudo, (outs), (ins i8mem:$dst, i8imm:$src),
	"#RELEASE_MOV PSEUDO !",			"#RELEASE_MOV8mi PSEUDO!",
	[(atomic_store_8 addr:$dst, (i8 imm:$src))]>;			[(atomic_store_8 addr:$dst, (i8 imm:$src))]>;
	def RELEASE_MOV16mi : I<0, Pseudo, (outs), (ins i16mem:$dst, i16imm:$src),			def RELEASE_MOV16mi : I<0, Pseudo, (outs), (ins i16mem:$dst, i16imm:$src),
	"#RELEASE_MOV PSEUDO !",			"#RELEASE_MOV16mi PSEUDO!",
	[(atomic_store_16 addr:$dst, (i16 imm:$src))]>;			[(atomic_store_16 addr:$dst, (i16 imm:$src))]>;
	def RELEASE_MOV32mi : I<0, Pseudo, (outs), (ins i32mem:$dst, i32imm:$src),			def RELEASE_MOV32mi : I<0, Pseudo, (outs), (ins i32mem:$dst, i32imm:$src),
	"#RELEASE_MOV PSEUDO !",			"#RELEASE_MOV32mi PSEUDO!",
	[(atomic_store_32 addr:$dst, (i32 imm:$src))]>;			[(atomic_store_32 addr:$dst, (i32 imm:$src))]>;
	def RELEASE_MOV64mi32 : I<0, Pseudo, (outs), (ins i64mem:$dst, i64i32imm:$src),			def RELEASE_MOV64mi32 : I<0, Pseudo, (outs), (ins i64mem:$dst, i64i32imm:$src),
	"#RELEASE_MOV PSEUDO !",			"#RELEASE_MOV64mi32 PSEUDO!",
	[(atomic_store_64 addr:$dst, i64immSExt32:$src)]>;			[(atomic_store_64 addr:$dst, i64immSExt32:$src)]>;

	def RELEASE_MOV8mr : I<0, Pseudo, (outs), (ins i8mem :$dst, GR8 :$src),			def RELEASE_MOV8mr : I<0, Pseudo, (outs), (ins i8mem :$dst, GR8 :$src),
	"#RELEASE_MOV PSEUDO!",			"#RELEASE_MOV8mr PSEUDO!",
	[(atomic_store_8 addr:$dst, GR8 :$src)]>;			[(atomic_store_8 addr:$dst, GR8 :$src)]>;
	def RELEASE_MOV16mr : I<0, Pseudo, (outs), (ins i16mem:$dst, GR16:$src),			def RELEASE_MOV16mr : I<0, Pseudo, (outs), (ins i16mem:$dst, GR16:$src),
	"#RELEASE_MOV PSEUDO!",			"#RELEASE_MOV16mr PSEUDO!",
	[(atomic_store_16 addr:$dst, GR16:$src)]>;			[(atomic_store_16 addr:$dst, GR16:$src)]>;
	def RELEASE_MOV32mr : I<0, Pseudo, (outs), (ins i32mem:$dst, GR32:$src),			def RELEASE_MOV32mr : I<0, Pseudo, (outs), (ins i32mem:$dst, GR32:$src),
	"#RELEASE_MOV PSEUDO!",			"#RELEASE_MOV32mr PSEUDO!",
	[(atomic_store_32 addr:$dst, GR32:$src)]>;			[(atomic_store_32 addr:$dst, GR32:$src)]>;
	def RELEASE_MOV64mr : I<0, Pseudo, (outs), (ins i64mem:$dst, GR64:$src),			def RELEASE_MOV64mr : I<0, Pseudo, (outs), (ins i64mem:$dst, GR64:$src),
	"#RELEASE_MOV PSEUDO!",			"#RELEASE_MOV64mr PSEUDO!",
	[(atomic_store_64 addr:$dst, GR64:$src)]>;			[(atomic_store_64 addr:$dst, GR64:$src)]>;

	def ACQUIRE_MOV8rm : I<0, Pseudo, (outs GR8 :$dst), (ins i8mem :$src),			def ACQUIRE_MOV8rm : I<0, Pseudo, (outs GR8 :$dst), (ins i8mem :$src),
	"#ACQUIRE_MOV PSEUDO!",			"#ACQUIRE_MOV8rm PSEUDO!",
	[(set GR8:$dst, (atomic_load_8 addr:$src))]>;			[(set GR8:$dst, (atomic_load_8 addr:$src))]>;
	def ACQUIRE_MOV16rm : I<0, Pseudo, (outs GR16:$dst), (ins i16mem:$src),			def ACQUIRE_MOV16rm : I<0, Pseudo, (outs GR16:$dst), (ins i16mem:$src),
	"#ACQUIRE_MOV PSEUDO!",			"#ACQUIRE_MOV16rm PSEUDO!",
	[(set GR16:$dst, (atomic_load_16 addr:$src))]>;			[(set GR16:$dst, (atomic_load_16 addr:$src))]>;
	def ACQUIRE_MOV32rm : I<0, Pseudo, (outs GR32:$dst), (ins i32mem:$src),			def ACQUIRE_MOV32rm : I<0, Pseudo, (outs GR32:$dst), (ins i32mem:$src),
	"#ACQUIRE_MOV PSEUDO!",			"#ACQUIRE_MOV32rm PSEUDO!",
	[(set GR32:$dst, (atomic_load_32 addr:$src))]>;			[(set GR32:$dst, (atomic_load_32 addr:$src))]>;
	def ACQUIRE_MOV64rm : I<0, Pseudo, (outs GR64:$dst), (ins i64mem:$src),			def ACQUIRE_MOV64rm : I<0, Pseudo, (outs GR64:$dst), (ins i64mem:$src),
	"#ACQUIRE_MOV PSEUDO!",			"#ACQUIRE_MOV64rm PSEUDO!",
	[(set GR64:$dst, (atomic_load_64 addr:$src))]>;			[(set GR64:$dst, (atomic_load_64 addr:$src))]>;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// DAG Pattern Matching Rules			// DAG Pattern Matching Rules
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	// ConstantPool GlobalAddress, ExternalSymbol, and JumpTable			// ConstantPool GlobalAddress, ExternalSymbol, and JumpTable
	def : Pat<(i32 (X86Wrapper tconstpool :$dst)), (MOV32ri tconstpool :$dst)>;			def : Pat<(i32 (X86Wrapper tconstpool :$dst)), (MOV32ri tconstpool :$dst)>;
	▲ Show 20 Lines • Show All 935 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86MCInstLower.cpp

Show First 20 Lines • Show All 592 Lines • ▼ Show 20 Lines	ReSimplify:
case X86::RELEASE_MOV16mr: OutMI.setOpcode(X86::MOV16mr); goto ReSimplify;		case X86::RELEASE_MOV16mr: OutMI.setOpcode(X86::MOV16mr); goto ReSimplify;
case X86::RELEASE_MOV32mr: OutMI.setOpcode(X86::MOV32mr); goto ReSimplify;		case X86::RELEASE_MOV32mr: OutMI.setOpcode(X86::MOV32mr); goto ReSimplify;
case X86::RELEASE_MOV64mr: OutMI.setOpcode(X86::MOV64mr); goto ReSimplify;		case X86::RELEASE_MOV64mr: OutMI.setOpcode(X86::MOV64mr); goto ReSimplify;
case X86::RELEASE_MOV8mi: OutMI.setOpcode(X86::MOV8mi); goto ReSimplify;		case X86::RELEASE_MOV8mi: OutMI.setOpcode(X86::MOV8mi); goto ReSimplify;
case X86::RELEASE_MOV16mi: OutMI.setOpcode(X86::MOV16mi); goto ReSimplify;		case X86::RELEASE_MOV16mi: OutMI.setOpcode(X86::MOV16mi); goto ReSimplify;
case X86::RELEASE_MOV32mi: OutMI.setOpcode(X86::MOV32mi); goto ReSimplify;		case X86::RELEASE_MOV32mi: OutMI.setOpcode(X86::MOV32mi); goto ReSimplify;
case X86::RELEASE_MOV64mi32: OutMI.setOpcode(X86::MOV64mi32); goto ReSimplify;		case X86::RELEASE_MOV64mi32: OutMI.setOpcode(X86::MOV64mi32); goto ReSimplify;
case X86::RELEASE_ADD8mi: OutMI.setOpcode(X86::ADD8mi); goto ReSimplify;		case X86::RELEASE_ADD8mi: OutMI.setOpcode(X86::ADD8mi); goto ReSimplify;
		case X86::RELEASE_ADD8mr: OutMI.setOpcode(X86::ADD8mr); goto ReSimplify;
case X86::RELEASE_ADD32mi: OutMI.setOpcode(X86::ADD32mi); goto ReSimplify;		case X86::RELEASE_ADD32mi: OutMI.setOpcode(X86::ADD32mi); goto ReSimplify;
		case X86::RELEASE_ADD32mr: OutMI.setOpcode(X86::ADD32mr); goto ReSimplify;
case X86::RELEASE_ADD64mi32: OutMI.setOpcode(X86::ADD64mi32); goto ReSimplify;		case X86::RELEASE_ADD64mi32: OutMI.setOpcode(X86::ADD64mi32); goto ReSimplify;
		case X86::RELEASE_ADD64mr: OutMI.setOpcode(X86::ADD64mr); goto ReSimplify;
case X86::RELEASE_AND8mi: OutMI.setOpcode(X86::AND8mi); goto ReSimplify;		case X86::RELEASE_AND8mi: OutMI.setOpcode(X86::AND8mi); goto ReSimplify;
		case X86::RELEASE_AND8mr: OutMI.setOpcode(X86::AND8mr); goto ReSimplify;
case X86::RELEASE_AND32mi: OutMI.setOpcode(X86::AND32mi); goto ReSimplify;		case X86::RELEASE_AND32mi: OutMI.setOpcode(X86::AND32mi); goto ReSimplify;
		case X86::RELEASE_AND32mr: OutMI.setOpcode(X86::AND32mr); goto ReSimplify;
case X86::RELEASE_AND64mi32: OutMI.setOpcode(X86::AND64mi32); goto ReSimplify;		case X86::RELEASE_AND64mi32: OutMI.setOpcode(X86::AND64mi32); goto ReSimplify;
		case X86::RELEASE_AND64mr: OutMI.setOpcode(X86::AND64mr); goto ReSimplify;
case X86::RELEASE_OR8mi: OutMI.setOpcode(X86::OR8mi); goto ReSimplify;		case X86::RELEASE_OR8mi: OutMI.setOpcode(X86::OR8mi); goto ReSimplify;
		case X86::RELEASE_OR8mr: OutMI.setOpcode(X86::OR8mr); goto ReSimplify;
case X86::RELEASE_OR32mi: OutMI.setOpcode(X86::OR32mi); goto ReSimplify;		case X86::RELEASE_OR32mi: OutMI.setOpcode(X86::OR32mi); goto ReSimplify;
		case X86::RELEASE_OR32mr: OutMI.setOpcode(X86::OR32mr); goto ReSimplify;
case X86::RELEASE_OR64mi32: OutMI.setOpcode(X86::OR64mi32); goto ReSimplify;		case X86::RELEASE_OR64mi32: OutMI.setOpcode(X86::OR64mi32); goto ReSimplify;
		case X86::RELEASE_OR64mr: OutMI.setOpcode(X86::OR64mr); goto ReSimplify;
case X86::RELEASE_XOR8mi: OutMI.setOpcode(X86::XOR8mi); goto ReSimplify;		case X86::RELEASE_XOR8mi: OutMI.setOpcode(X86::XOR8mi); goto ReSimplify;
		case X86::RELEASE_XOR8mr: OutMI.setOpcode(X86::XOR8mr); goto ReSimplify;
case X86::RELEASE_XOR32mi: OutMI.setOpcode(X86::XOR32mi); goto ReSimplify;		case X86::RELEASE_XOR32mi: OutMI.setOpcode(X86::XOR32mi); goto ReSimplify;
		case X86::RELEASE_XOR32mr: OutMI.setOpcode(X86::XOR32mr); goto ReSimplify;
case X86::RELEASE_XOR64mi32: OutMI.setOpcode(X86::XOR64mi32); goto ReSimplify;		case X86::RELEASE_XOR64mi32: OutMI.setOpcode(X86::XOR64mi32); goto ReSimplify;
		case X86::RELEASE_XOR64mr: OutMI.setOpcode(X86::XOR64mr); goto ReSimplify;
case X86::RELEASE_INC8m: OutMI.setOpcode(X86::INC8m); goto ReSimplify;		case X86::RELEASE_INC8m: OutMI.setOpcode(X86::INC8m); goto ReSimplify;
case X86::RELEASE_INC16m: OutMI.setOpcode(X86::INC16m); goto ReSimplify;		case X86::RELEASE_INC16m: OutMI.setOpcode(X86::INC16m); goto ReSimplify;
case X86::RELEASE_INC32m: OutMI.setOpcode(X86::INC32m); goto ReSimplify;		case X86::RELEASE_INC32m: OutMI.setOpcode(X86::INC32m); goto ReSimplify;
case X86::RELEASE_INC64m: OutMI.setOpcode(X86::INC64m); goto ReSimplify;		case X86::RELEASE_INC64m: OutMI.setOpcode(X86::INC64m); goto ReSimplify;
case X86::RELEASE_DEC8m: OutMI.setOpcode(X86::DEC8m); goto ReSimplify;		case X86::RELEASE_DEC8m: OutMI.setOpcode(X86::DEC8m); goto ReSimplify;
case X86::RELEASE_DEC16m: OutMI.setOpcode(X86::DEC16m); goto ReSimplify;		case X86::RELEASE_DEC16m: OutMI.setOpcode(X86::DEC16m); goto ReSimplify;
case X86::RELEASE_DEC32m: OutMI.setOpcode(X86::DEC32m); goto ReSimplify;		case X86::RELEASE_DEC32m: OutMI.setOpcode(X86::DEC32m); goto ReSimplify;
case X86::RELEASE_DEC64m: OutMI.setOpcode(X86::DEC64m); goto ReSimplify;		case X86::RELEASE_DEC64m: OutMI.setOpcode(X86::DEC64m); goto ReSimplify;
▲ Show 20 Lines • Show All 724 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/atomic_mi.ll

	; RUN: llc < %s -march=x86-64 -verify-machineinstrs \| FileCheck %s --check-prefix X64			; RUN: llc < %s -march=x86-64 -verify-machineinstrs \| FileCheck %s --check-prefix X64
	; RUN: llc < %s -march=x86 -verify-machineinstrs \| FileCheck %s --check-prefix X32			; RUN: llc < %s -march=x86 -verify-machineinstrs \| FileCheck %s --check-prefix X32
	; RUN: llc < %s -march=x86-64 -mattr=slow-incdec -verify-machineinstrs \| FileCheck %s --check-prefix SLOW_INC			; RUN: llc < %s -march=x86-64 -mattr=slow-incdec -verify-machineinstrs \| FileCheck %s --check-prefix SLOW_INC

	; This file checks that atomic (non-seq_cst) stores of immediate values are			; This file checks that atomic (non-seq_cst) stores of immediate values are
	; done in one mov instruction and not 2. More precisely, it makes sure that the			; done in one mov instruction and not 2. More precisely, it makes sure that the
	; immediate is not first copied uselessly into a register.			; immediate is not first copied uselessly into a register.

	; Similarily, it checks that a binary operation of an immediate with an atomic			; Similarily, it checks that a binary operation of an immediate with an atomic
	; variable that is stored back in that variable is done as a single instruction.			; variable that is stored back in that variable is done as a single instruction.
	; For example: x.store(42 + x.load(memory_order_acquire), memory_order_release)			; For example: x.store(42 + x.load(memory_order_acquire), memory_order_release)
	; should be just an add instruction, instead of loading x into a register, doing			; should be just an add instruction, instead of loading x into a register, doing
	; an add and storing the result back.			; an add and storing the result back.
	; The binary operations supported are currently add, and, or, xor.			; The binary operations supported are currently add, and, or, xor.
	; sub is not supported because they are translated by an addition of the			; sub is not supported because they are translated by an addition of the
	; negated immediate.			; negated immediate.
	; Finally, we also check the same kind of pattern for inc/dec			;
				; We also check the same patterns:
				; - For inc/dec.
				; - For register instead of immediate operands.
				; - For floating point operations.

	; seq_cst stores are left as (lock) xchgl, but we try to check every other			; seq_cst stores are left as (lock) xchgl, but we try to check every other
	; attribute at least once.			; attribute at least once.

	; Please note that these operations do not require the lock prefix: only			; Please note that these operations do not require the lock prefix: only
	; sequentially consistent stores require this kind of protection on X86.			; sequentially consistent stores require this kind of protection on X86.
	; And even for seq_cst operations, llvm uses the xchg instruction which has			; And even for seq_cst operations, llvm uses the xchg instruction which has
	; an implicit lock prefix, so making it explicit is not required.			; an implicit lock prefix, so making it explicit is not required.

	define void @store_atomic_imm_8(i8* %p) {			define void @store_atomic_imm_8(i8* %p) {
	; X64-LABEL: store_atomic_imm_8			; X64-LABEL: store_atomic_imm_8:
	; X64: movb			; X64: movb
	; X64-NOT: movb			; X64-NOT: movb
	; X32-LABEL: store_atomic_imm_8			; X32-LABEL: store_atomic_imm_8:
	; X32: movb			; X32: movb
	; X32-NOT: movb			; X32-NOT: movb
	store atomic i8 42, i8* %p release, align 1			store atomic i8 42, i8* %p release, align 1
	ret void			ret void
	}			}

	define void @store_atomic_imm_16(i16* %p) {			define void @store_atomic_imm_16(i16* %p) {
	; X64-LABEL: store_atomic_imm_16			; X64-LABEL: store_atomic_imm_16:
	; X64: movw			; X64: movw
	; X64-NOT: movw			; X64-NOT: movw
	; X32-LABEL: store_atomic_imm_16			; X32-LABEL: store_atomic_imm_16:
	; X32: movw			; X32: movw
	; X32-NOT: movw			; X32-NOT: movw
	store atomic i16 42, i16* %p monotonic, align 2			store atomic i16 42, i16* %p monotonic, align 2
	ret void			ret void
	}			}

	define void @store_atomic_imm_32(i32* %p) {			define void @store_atomic_imm_32(i32* %p) {
	; X64-LABEL: store_atomic_imm_32			; X64-LABEL: store_atomic_imm_32:
	; X64: movl			; X64: movl
	; X64-NOT: movl			; X64-NOT: movl
	; On 32 bits, there is an extra movl for each of those functions			; On 32 bits, there is an extra movl for each of those functions
	; (probably for alignment reasons).			; (probably for alignment reasons).
	; X32-LABEL: store_atomic_imm_32			; X32-LABEL: store_atomic_imm_32:
	; X32: movl 4(%esp), %eax			; X32: movl 4(%esp), %eax
	; X32: movl			; X32: movl
	; X32-NOT: movl			; X32-NOT: movl
	store atomic i32 42, i32* %p release, align 4			store atomic i32 42, i32* %p release, align 4
	ret void			ret void
	}			}

	define void @store_atomic_imm_64(i64* %p) {			define void @store_atomic_imm_64(i64* %p) {
	; X64-LABEL: store_atomic_imm_64			; X64-LABEL: store_atomic_imm_64:
	; X64: movq			; X64: movq
	; X64-NOT: movq			; X64-NOT: movq
	; These are implemented with a CAS loop on 32 bit architectures, and thus			; These are implemented with a CAS loop on 32 bit architectures, and thus
	; cannot be optimized in the same way as the others.			; cannot be optimized in the same way as the others.
	; X32-LABEL: store_atomic_imm_64			; X32-LABEL: store_atomic_imm_64:
	; X32: cmpxchg8b			; X32: cmpxchg8b
	store atomic i64 42, i64* %p release, align 8			store atomic i64 42, i64* %p release, align 8
	ret void			ret void
	}			}

	; If an immediate is too big to fit in 32 bits, it cannot be store in one mov,			; If an immediate is too big to fit in 32 bits, it cannot be store in one mov,
	; even on X64, one must use movabsq that can only target a register.			; even on X64, one must use movabsq that can only target a register.
	define void @store_atomic_imm_64_big(i64* %p) {			define void @store_atomic_imm_64_big(i64* %p) {
	; X64-LABEL: store_atomic_imm_64_big			; X64-LABEL: store_atomic_imm_64_big:
	; X64: movabsq			; X64: movabsq
	; X64: movq			; X64: movq
	store atomic i64 100000000000, i64* %p monotonic, align 8			store atomic i64 100000000000, i64* %p monotonic, align 8
	ret void			ret void
	}			}

	; It would be incorrect to replace a lock xchgl by a movl			; It would be incorrect to replace a lock xchgl by a movl
	define void @store_atomic_imm_32_seq_cst(i32* %p) {			define void @store_atomic_imm_32_seq_cst(i32* %p) {
	; X64-LABEL: store_atomic_imm_32_seq_cst			; X64-LABEL: store_atomic_imm_32_seq_cst:
	; X64: xchgl			; X64: xchgl
	; X32-LABEL: store_atomic_imm_32_seq_cst			; X32-LABEL: store_atomic_imm_32_seq_cst:
	; X32: xchgl			; X32: xchgl
	store atomic i32 42, i32* %p seq_cst, align 4			store atomic i32 42, i32* %p seq_cst, align 4
	ret void			ret void
	}			}

	; ----- ADD -----			; ----- ADD -----

	define void @add_8(i8* %p) {			define void @add_8i(i8* %p) {
	; X64-LABEL: add_8			; X64-LABEL: add_8i:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: addb			; X64: addb
	; X64-NOT: movb			; X64-NOT: movb
	; X32-LABEL: add_8			; X32-LABEL: add_8i:
	; X32-NOT: lock			; X32-NOT: lock
	; X32: addb			; X32: addb
	; X32-NOT: movb			; X32-NOT: movb
	%1 = load atomic i8, i8* %p seq_cst, align 1			%1 = load atomic i8, i8* %p seq_cst, align 1
	%2 = add i8 %1, 2			%2 = add i8 %1, 2
	store atomic i8 %2, i8* %p release, align 1			store atomic i8 %2, i8* %p release, align 1
	ret void			ret void
	}			}

	define void @add_16(i16* %p) {			define void @add_8r(i8* %p, i8 %v) {
				; X64-LABEL: add_8r:
				; X64-NOT: lock
				; X64: addb
				; X64-NOT: movb
				; X32-LABEL: add_8r:
				; X32-NOT: lock
				; X32: addb
				; X32-NOT: movb
				%1 = load atomic i8, i8* %p seq_cst, align 1
				%2 = add i8 %1, %v
				store atomic i8 %2, i8* %p release, align 1
				ret void
				}

				define void @add_16i(i16* %p) {
	; Currently the transformation is not done on 16 bit accesses, as the backend			; Currently the transformation is not done on 16 bit accesses, as the backend
	; treat 16 bit arithmetic as expensive on X86/X86_64.			; treat 16 bit arithmetic as expensive on X86/X86_64.
	; X64-LABEL: add_16			; X64-LABEL: add_16i:
	; X64-NOT: addw			; X64-NOT: addw
	; X32-LABEL: add_16			; X32-LABEL: add_16i:
	; X32-NOT: addw			; X32-NOT: addw
	%1 = load atomic i16, i16* %p acquire, align 2			%1 = load atomic i16, i16* %p acquire, align 2
	%2 = add i16 %1, 2			%2 = add i16 %1, 2
	store atomic i16 %2, i16* %p release, align 2			store atomic i16 %2, i16* %p release, align 2
	ret void			ret void
	}			}

	define void @add_32(i32* %p) {			define void @add_16r(i16* %p, i16 %v) {
	; X64-LABEL: add_32			; Currently the transformation is not done on 16 bit accesses, as the backend
				; treat 16 bit arithmetic as expensive on X86/X86_64.
				; X64-LABEL: add_16r:
				; X64-NOT: addw
				; X32-LABEL: add_16r:
				; X32-NOT: addw [.*], (
				%1 = load atomic i16, i16* %p acquire, align 2
				%2 = add i16 %1, %v
				store atomic i16 %2, i16* %p release, align 2
				ret void
				}

				define void @add_32i(i32* %p) {
				; X64-LABEL: add_32i:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: addl			; X64: addl
	; X64-NOT: movl			; X64-NOT: movl
	; X32-LABEL: add_32			; X32-LABEL: add_32i:
	; X32-NOT: lock			; X32-NOT: lock
	; X32: addl			; X32: addl
	; X32-NOT: movl			; X32-NOT: movl
	%1 = load atomic i32, i32* %p acquire, align 4			%1 = load atomic i32, i32* %p acquire, align 4
	%2 = add i32 %1, 2			%2 = add i32 %1, 2
	store atomic i32 %2, i32* %p monotonic, align 4			store atomic i32 %2, i32* %p monotonic, align 4
	ret void			ret void
	}			}

	define void @add_64(i64* %p) {			define void @add_32r(i32* %p, i32 %v) {
	; X64-LABEL: add_64			; X64-LABEL: add_32r:
				; X64-NOT: lock
				; X64: addl
				; X64-NOT: movl
				; X32-LABEL: add_32r:
				; X32-NOT: lock
				; X32: addl
				; X32-NOT: movl
				%1 = load atomic i32, i32* %p acquire, align 4
				%2 = add i32 %1, %v
				store atomic i32 %2, i32* %p monotonic, align 4
				ret void
				}

				; The following is a corner case where the load is added to itself. The pattern
				; matching should not fold this. We only test with 32-bit add, but the same
				; applies to other sizes and operations.
				define void @add_32r_self(i32* %p) {
				; X64-LABEL: add_32r_self:
				; X64-NOT: lock
				; X64: movl (%[[M:[a-z]+]]), %[[R:[a-z]+]]
				; X64: addl %[[R]], %[[R]]
				; X64: movl %[[R]], (%[[M]])
				; X32-LABEL: add_32r_self:
				; X32-NOT: lock
				; X32: movl (%[[M:[a-z]+]]), %[[R:[a-z]+]]
				; X32: addl %[[R]], %[[R]]
				; X32: movl %[[R]], (%[[M]])
				%1 = load atomic i32, i32* %p acquire, align 4
				%2 = add i32 %1, %1
				store atomic i32 %2, i32* %p monotonic, align 4
				ret void
				}

				; The following is a corner case where the load's result is returned. The
				; optimizer isn't allowed to duplicate the load because it's atomic.
				define i32 @add_32r_ret_load(i32* %p, i32 %v) {
				; X64-LABEL: add_32r_ret_load:
				; X64-NOT: lock
				; X64: movl (%rdi), %eax
				; X64-NEXT: leal (%rsi,%rax), %ecx
				; X64-NEXT: movl %ecx, (%rdi)
				; X64-NEXT: retq
				; X32-LABEL: add_32r_ret_load:
				; X32-NOT: lock
				; X32: movl 4(%esp), %[[P:[a-z]+]]
				; X32-NEXT: movl (%[[P]]),
				; X32-NOT: %[[P]]
				; More code here, we just don't want it to load from P.
				; X32: movl %{{.*}}, (%[[P]])
				; X32-NEXT: retl
				%1 = load atomic i32, i32* %p acquire, align 4
				%2 = add i32 %1, %v
				store atomic i32 %2, i32* %p monotonic, align 4
				ret i32 %1
				}

				define void @add_64i(i64* %p) {
				; X64-LABEL: add_64i:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: addq			; X64: addq
	; X64-NOT: movq			; X64-NOT: movq
	; We do not check X86-32 as it cannot do 'addq'.			; We do not check X86-32 as it cannot do 'addq'.
	; X32-LABEL: add_64			; X32-LABEL: add_64i:
	%1 = load atomic i64, i64* %p acquire, align 8			%1 = load atomic i64, i64* %p acquire, align 8
	%2 = add i64 %1, 2			%2 = add i64 %1, 2
	store atomic i64 %2, i64* %p release, align 8			store atomic i64 %2, i64* %p release, align 8
	ret void			ret void
	}			}

	define void @add_32_seq_cst(i32* %p) {			define void @add_64r(i64* %p, i64 %v) {
	; X64-LABEL: add_32_seq_cst			; X64-LABEL: add_64r:
				; X64-NOT: lock
				; X64: addq
				; X64-NOT: movq
				; We do not check X86-32 as it cannot do 'addq'.
				; X32-LABEL: add_64r:
				%1 = load atomic i64, i64* %p acquire, align 8
				%2 = add i64 %1, %v
				store atomic i64 %2, i64* %p release, align 8
				ret void
				}

				define void @add_32i_seq_cst(i32* %p) {
				; X64-LABEL: add_32i_seq_cst:
	; X64: xchgl			; X64: xchgl
	; X32-LABEL: add_32_seq_cst			; X32-LABEL: add_32i_seq_cst:
	; X32: xchgl			; X32: xchgl
	%1 = load atomic i32, i32* %p monotonic, align 4			%1 = load atomic i32, i32* %p monotonic, align 4
	%2 = add i32 %1, 2			%2 = add i32 %1, 2
	store atomic i32 %2, i32* %p seq_cst, align 4			store atomic i32 %2, i32* %p seq_cst, align 4
	ret void			ret void
	}			}

				define void @add_32r_seq_cst(i32* %p, i32 %v) {
				; X64-LABEL: add_32r_seq_cst:
				; X64: xchgl
				; X32-LABEL: add_32r_seq_cst:
				; X32: xchgl
				%1 = load atomic i32, i32* %p monotonic, align 4
				%2 = add i32 %1, %v
				store atomic i32 %2, i32* %p seq_cst, align 4
				ret void
				}

	; ----- AND -----			; ----- AND -----

	define void @and_8(i8* %p) {			define void @and_8i(i8* %p) {
	; X64-LABEL: and_8			; X64-LABEL: and_8i:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: andb			; X64: andb
	; X64-NOT: movb			; X64-NOT: movb
	; X32-LABEL: and_8			; X32-LABEL: and_8i:
	; X32-NOT: lock			; X32-NOT: lock
	; X32: andb			; X32: andb
	; X32-NOT: movb			; X32-NOT: movb
	%1 = load atomic i8, i8* %p monotonic, align 1			%1 = load atomic i8, i8* %p monotonic, align 1
	%2 = and i8 %1, 2			%2 = and i8 %1, 2
	store atomic i8 %2, i8* %p release, align 1			store atomic i8 %2, i8* %p release, align 1
	ret void			ret void
	}			}

	define void @and_16(i16* %p) {			define void @and_8r(i8* %p, i8 %v) {
				; X64-LABEL: and_8r:
				; X64-NOT: lock
				; X64: andb
				; X64-NOT: movb
				; X32-LABEL: and_8r:
				; X32-NOT: lock
				; X32: andb
				; X32-NOT: movb
				%1 = load atomic i8, i8* %p monotonic, align 1
				%2 = and i8 %1, %v
				store atomic i8 %2, i8* %p release, align 1
				ret void
				}

				define void @and_16i(i16* %p) {
	; Currently the transformation is not done on 16 bit accesses, as the backend			; Currently the transformation is not done on 16 bit accesses, as the backend
	; treat 16 bit arithmetic as expensive on X86/X86_64.			; treat 16 bit arithmetic as expensive on X86/X86_64.
	; X64-LABEL: and_16			; X64-LABEL: and_16i:
	; X64-NOT: andw			; X64-NOT: andw
	; X32-LABEL: and_16			; X32-LABEL: and_16i:
	; X32-NOT: andw			; X32-NOT: andw
	%1 = load atomic i16, i16* %p acquire, align 2			%1 = load atomic i16, i16* %p acquire, align 2
	%2 = and i16 %1, 2			%2 = and i16 %1, 2
	store atomic i16 %2, i16* %p release, align 2			store atomic i16 %2, i16* %p release, align 2
	ret void			ret void
	}			}

	define void @and_32(i32* %p) {			define void @and_16r(i16* %p, i16 %v) {
	; X64-LABEL: and_32			; Currently the transformation is not done on 16 bit accesses, as the backend
				; treat 16 bit arithmetic as expensive on X86/X86_64.
				; X64-LABEL: and_16r:
				; X64-NOT: andw
				; X32-LABEL: and_16r:
				; X32-NOT: andw [.*], (
				%1 = load atomic i16, i16* %p acquire, align 2
				%2 = and i16 %1, %v
				store atomic i16 %2, i16* %p release, align 2
				ret void
				}

				define void @and_32i(i32* %p) {
				; X64-LABEL: and_32i:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: andl			; X64: andl
	; X64-NOT: movl			; X64-NOT: movl
	; X32-LABEL: and_32			; X32-LABEL: and_32i:
	; X32-NOT: lock			; X32-NOT: lock
	; X32: andl			; X32: andl
	; X32-NOT: movl			; X32-NOT: movl
	%1 = load atomic i32, i32* %p acquire, align 4			%1 = load atomic i32, i32* %p acquire, align 4
	%2 = and i32 %1, 2			%2 = and i32 %1, 2
	store atomic i32 %2, i32* %p release, align 4			store atomic i32 %2, i32* %p release, align 4
	ret void			ret void
	}			}

	define void @and_64(i64* %p) {			define void @and_32r(i32* %p, i32 %v) {
	; X64-LABEL: and_64			; X64-LABEL: and_32r:
				; X64-NOT: lock
				; X64: andl
				; X64-NOT: movl
				; X32-LABEL: and_32r:
				; X32-NOT: lock
				; X32: andl
				; X32-NOT: movl
				%1 = load atomic i32, i32* %p acquire, align 4
				%2 = and i32 %1, %v
				store atomic i32 %2, i32* %p release, align 4
				ret void
				}

				define void @and_64i(i64* %p) {
				; X64-LABEL: and_64i:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: andq			; X64: andq
	; X64-NOT: movq			; X64-NOT: movq
	; We do not check X86-32 as it cannot do 'andq'.			; We do not check X86-32 as it cannot do 'andq'.
	; X32-LABEL: and_64			; X32-LABEL: and_64i:
	%1 = load atomic i64, i64* %p acquire, align 8			%1 = load atomic i64, i64* %p acquire, align 8
	%2 = and i64 %1, 2			%2 = and i64 %1, 2
	store atomic i64 %2, i64* %p release, align 8			store atomic i64 %2, i64* %p release, align 8
	ret void			ret void
	}			}

	define void @and_32_seq_cst(i32* %p) {			define void @and_64r(i64* %p, i64 %v) {
	; X64-LABEL: and_32_seq_cst			; X64-LABEL: and_64r:
				; X64-NOT: lock
				; X64: andq
				; X64-NOT: movq
				; We do not check X86-32 as it cannot do 'andq'.
				; X32-LABEL: and_64r:
				%1 = load atomic i64, i64* %p acquire, align 8
				%2 = and i64 %1, %v
				store atomic i64 %2, i64* %p release, align 8
				ret void
				}

				define void @and_32i_seq_cst(i32* %p) {
				; X64-LABEL: and_32i_seq_cst:
	; X64: xchgl			; X64: xchgl
	; X32-LABEL: and_32_seq_cst			; X32-LABEL: and_32i_seq_cst:
	; X32: xchgl			; X32: xchgl
	%1 = load atomic i32, i32* %p monotonic, align 4			%1 = load atomic i32, i32* %p monotonic, align 4
	%2 = and i32 %1, 2			%2 = and i32 %1, 2
	store atomic i32 %2, i32* %p seq_cst, align 4			store atomic i32 %2, i32* %p seq_cst, align 4
	ret void			ret void
	}			}

				define void @and_32r_seq_cst(i32* %p, i32 %v) {
				; X64-LABEL: and_32r_seq_cst:
				; X64: xchgl
				; X32-LABEL: and_32r_seq_cst:
				; X32: xchgl
				%1 = load atomic i32, i32* %p monotonic, align 4
				%2 = and i32 %1, %v
				store atomic i32 %2, i32* %p seq_cst, align 4
				ret void
				}

	; ----- OR -----			; ----- OR -----

	define void @or_8(i8* %p) {			define void @or_8i(i8* %p) {
	; X64-LABEL: or_8			; X64-LABEL: or_8i:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: orb			; X64: orb
	; X64-NOT: movb			; X64-NOT: movb
	; X32-LABEL: or_8			; X32-LABEL: or_8i:
	; X32-NOT: lock			; X32-NOT: lock
	; X32: orb			; X32: orb
	; X32-NOT: movb			; X32-NOT: movb
	%1 = load atomic i8, i8* %p acquire, align 1			%1 = load atomic i8, i8* %p acquire, align 1
	%2 = or i8 %1, 2			%2 = or i8 %1, 2
	store atomic i8 %2, i8* %p release, align 1			store atomic i8 %2, i8* %p release, align 1
	ret void			ret void
	}			}

	define void @or_16(i16* %p) {			define void @or_8r(i8* %p, i8 %v) {
	; X64-LABEL: or_16			; X64-LABEL: or_8r:
				; X64-NOT: lock
				; X64: orb
				; X64-NOT: movb
				; X32-LABEL: or_8r:
				; X32-NOT: lock
				; X32: orb
				; X32-NOT: movb
				%1 = load atomic i8, i8* %p acquire, align 1
				%2 = or i8 %1, %v
				store atomic i8 %2, i8* %p release, align 1
				ret void
				}

				define void @or_16i(i16* %p) {
				; X64-LABEL: or_16i:
	; X64-NOT: orw			; X64-NOT: orw
	; X32-LABEL: or_16			; X32-LABEL: or_16i:
	; X32-NOT: orw			; X32-NOT: orw
	%1 = load atomic i16, i16* %p acquire, align 2			%1 = load atomic i16, i16* %p acquire, align 2
	%2 = or i16 %1, 2			%2 = or i16 %1, 2
	store atomic i16 %2, i16* %p release, align 2			store atomic i16 %2, i16* %p release, align 2
	ret void			ret void
	}			}

	define void @or_32(i32* %p) {			define void @or_16r(i16* %p, i16 %v) {
	; X64-LABEL: or_32			; X64-LABEL: or_16r:
				; X64-NOT: orw
				; X32-LABEL: or_16r:
				; X32-NOT: orw [.*], (
				%1 = load atomic i16, i16* %p acquire, align 2
				%2 = or i16 %1, %v
				store atomic i16 %2, i16* %p release, align 2
				ret void
				}

				define void @or_32i(i32* %p) {
				; X64-LABEL: or_32i:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: orl			; X64: orl
	; X64-NOT: movl			; X64-NOT: movl
	; X32-LABEL: or_32			; X32-LABEL: or_32i:
	; X32-NOT: lock			; X32-NOT: lock
	; X32: orl			; X32: orl
	; X32-NOT: movl			; X32-NOT: movl
	%1 = load atomic i32, i32* %p acquire, align 4			%1 = load atomic i32, i32* %p acquire, align 4
	%2 = or i32 %1, 2			%2 = or i32 %1, 2
	store atomic i32 %2, i32* %p release, align 4			store atomic i32 %2, i32* %p release, align 4
	ret void			ret void
	}			}

	define void @or_64(i64* %p) {			define void @or_32r(i32* %p, i32 %v) {
	; X64-LABEL: or_64			; X64-LABEL: or_32r:
				; X64-NOT: lock
				; X64: orl
				; X64-NOT: movl
				; X32-LABEL: or_32r:
				; X32-NOT: lock
				; X32: orl
				; X32-NOT: movl
				%1 = load atomic i32, i32* %p acquire, align 4
				%2 = or i32 %1, %v
				store atomic i32 %2, i32* %p release, align 4
				ret void
				}

				define void @or_64i(i64* %p) {
				; X64-LABEL: or_64i:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: orq			; X64: orq
	; X64-NOT: movq			; X64-NOT: movq
	; We do not check X86-32 as it cannot do 'orq'.			; We do not check X86-32 as it cannot do 'orq'.
	; X32-LABEL: or_64			; X32-LABEL: or_64i:
	%1 = load atomic i64, i64* %p acquire, align 8			%1 = load atomic i64, i64* %p acquire, align 8
	%2 = or i64 %1, 2			%2 = or i64 %1, 2
	store atomic i64 %2, i64* %p release, align 8			store atomic i64 %2, i64* %p release, align 8
	ret void			ret void
	}			}

	define void @or_32_seq_cst(i32* %p) {			define void @or_64r(i64* %p, i64 %v) {
	; X64-LABEL: or_32_seq_cst			; X64-LABEL: or_64r:
				; X64-NOT: lock
				; X64: orq
				; X64-NOT: movq
				; We do not check X86-32 as it cannot do 'orq'.
				; X32-LABEL: or_64r:
				%1 = load atomic i64, i64* %p acquire, align 8
				%2 = or i64 %1, %v
				store atomic i64 %2, i64* %p release, align 8
				ret void
				}

				define void @or_32i_seq_cst(i32* %p) {
				; X64-LABEL: or_32i_seq_cst:
	; X64: xchgl			; X64: xchgl
	; X32-LABEL: or_32_seq_cst			; X32-LABEL: or_32i_seq_cst:
	; X32: xchgl			; X32: xchgl
	%1 = load atomic i32, i32* %p monotonic, align 4			%1 = load atomic i32, i32* %p monotonic, align 4
	%2 = or i32 %1, 2			%2 = or i32 %1, 2
	store atomic i32 %2, i32* %p seq_cst, align 4			store atomic i32 %2, i32* %p seq_cst, align 4
	ret void			ret void
	}			}

				define void @or_32r_seq_cst(i32* %p, i32 %v) {
				; X64-LABEL: or_32r_seq_cst:
				; X64: xchgl
				; X32-LABEL: or_32r_seq_cst:
				; X32: xchgl
				%1 = load atomic i32, i32* %p monotonic, align 4
				%2 = or i32 %1, %v
				store atomic i32 %2, i32* %p seq_cst, align 4
				ret void
				}

	; ----- XOR -----			; ----- XOR -----

	define void @xor_8(i8* %p) {			define void @xor_8i(i8* %p) {
	; X64-LABEL: xor_8			; X64-LABEL: xor_8i:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: xorb			; X64: xorb
	; X64-NOT: movb			; X64-NOT: movb
	; X32-LABEL: xor_8			; X32-LABEL: xor_8i:
	; X32-NOT: lock			; X32-NOT: lock
	; X32: xorb			; X32: xorb
	; X32-NOT: movb			; X32-NOT: movb
	%1 = load atomic i8, i8* %p acquire, align 1			%1 = load atomic i8, i8* %p acquire, align 1
	%2 = xor i8 %1, 2			%2 = xor i8 %1, 2
	store atomic i8 %2, i8* %p release, align 1			store atomic i8 %2, i8* %p release, align 1
	ret void			ret void
	}			}

	define void @xor_16(i16* %p) {			define void @xor_8r(i8* %p, i8 %v) {
	; X64-LABEL: xor_16			; X64-LABEL: xor_8r:
				; X64-NOT: lock
				; X64: xorb
				; X64-NOT: movb
				; X32-LABEL: xor_8r:
				; X32-NOT: lock
				; X32: xorb
				; X32-NOT: movb
				%1 = load atomic i8, i8* %p acquire, align 1
				%2 = xor i8 %1, %v
				store atomic i8 %2, i8* %p release, align 1
				ret void
				}

				define void @xor_16i(i16* %p) {
				; X64-LABEL: xor_16i:
	; X64-NOT: xorw			; X64-NOT: xorw
	; X32-LABEL: xor_16			; X32-LABEL: xor_16i:
	; X32-NOT: xorw			; X32-NOT: xorw
	%1 = load atomic i16, i16* %p acquire, align 2			%1 = load atomic i16, i16* %p acquire, align 2
	%2 = xor i16 %1, 2			%2 = xor i16 %1, 2
	store atomic i16 %2, i16* %p release, align 2			store atomic i16 %2, i16* %p release, align 2
	ret void			ret void
	}			}

	define void @xor_32(i32* %p) {			define void @xor_16r(i16* %p, i16 %v) {
	; X64-LABEL: xor_32			; X64-LABEL: xor_16r:
				; X64-NOT: xorw
				; X32-LABEL: xor_16r:
				; X32-NOT: xorw [.*], (
				%1 = load atomic i16, i16* %p acquire, align 2
				%2 = xor i16 %1, %v
				store atomic i16 %2, i16* %p release, align 2
				ret void
				}

				define void @xor_32i(i32* %p) {
				; X64-LABEL: xor_32i:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: xorl			; X64: xorl
	; X64-NOT: movl			; X64-NOT: movl
	; X32-LABEL: xor_32			; X32-LABEL: xor_32i:
	; X32-NOT: lock			; X32-NOT: lock
	; X32: xorl			; X32: xorl
	; X32-NOT: movl			; X32-NOT: movl
	%1 = load atomic i32, i32* %p acquire, align 4			%1 = load atomic i32, i32* %p acquire, align 4
	%2 = xor i32 %1, 2			%2 = xor i32 %1, 2
	store atomic i32 %2, i32* %p release, align 4			store atomic i32 %2, i32* %p release, align 4
	ret void			ret void
	}			}

	define void @xor_64(i64* %p) {			define void @xor_32r(i32* %p, i32 %v) {
	; X64-LABEL: xor_64			; X64-LABEL: xor_32r:
				; X64-NOT: lock
				; X64: xorl
				; X64-NOT: movl
				; X32-LABEL: xor_32r:
				; X32-NOT: lock
				; X32: xorl
				; X32-NOT: movl
				%1 = load atomic i32, i32* %p acquire, align 4
				%2 = xor i32 %1, %v
				store atomic i32 %2, i32* %p release, align 4
				ret void
				}

				define void @xor_64i(i64* %p) {
				; X64-LABEL: xor_64i:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: xorq			; X64: xorq
	; X64-NOT: movq			; X64-NOT: movq
	; We do not check X86-32 as it cannot do 'xorq'.			; We do not check X86-32 as it cannot do 'xorq'.
	; X32-LABEL: xor_64			; X32-LABEL: xor_64i:
	%1 = load atomic i64, i64* %p acquire, align 8			%1 = load atomic i64, i64* %p acquire, align 8
	%2 = xor i64 %1, 2			%2 = xor i64 %1, 2
	store atomic i64 %2, i64* %p release, align 8			store atomic i64 %2, i64* %p release, align 8
	ret void			ret void
	}			}

	define void @xor_32_seq_cst(i32* %p) {			define void @xor_64r(i64* %p, i64 %v) {
	; X64-LABEL: xor_32_seq_cst			; X64-LABEL: xor_64r:
				; X64-NOT: lock
				; X64: xorq
				; X64-NOT: movq
				; We do not check X86-32 as it cannot do 'xorq'.
				; X32-LABEL: xor_64r:
				%1 = load atomic i64, i64* %p acquire, align 8
				%2 = xor i64 %1, %v
				store atomic i64 %2, i64* %p release, align 8
				ret void
				}

				define void @xor_32i_seq_cst(i32* %p) {
				; X64-LABEL: xor_32i_seq_cst:
	; X64: xchgl			; X64: xchgl
	; X32-LABEL: xor_32_seq_cst			; X32-LABEL: xor_32i_seq_cst:
	; X32: xchgl			; X32: xchgl
	%1 = load atomic i32, i32* %p monotonic, align 4			%1 = load atomic i32, i32* %p monotonic, align 4
	%2 = xor i32 %1, 2			%2 = xor i32 %1, 2
	store atomic i32 %2, i32* %p seq_cst, align 4			store atomic i32 %2, i32* %p seq_cst, align 4
	ret void			ret void
	}			}

				define void @xor_32r_seq_cst(i32* %p, i32 %v) {
				; X64-LABEL: xor_32r_seq_cst:
				; X64: xchgl
				; X32-LABEL: xor_32r_seq_cst:
				; X32: xchgl
				%1 = load atomic i32, i32* %p monotonic, align 4
				%2 = xor i32 %1, %v
				store atomic i32 %2, i32* %p seq_cst, align 4
				ret void
				}

	; ----- INC -----			; ----- INC -----

	define void @inc_8(i8* %p) {			define void @inc_8(i8* %p) {
	; X64-LABEL: inc_8			; X64-LABEL: inc_8:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: incb			; X64: incb
	; X64-NOT: movb			; X64-NOT: movb
	; X32-LABEL: inc_8			; X32-LABEL: inc_8:
	; X32-NOT: lock			; X32-NOT: lock
	; X32: incb			; X32: incb
	; X32-NOT: movb			; X32-NOT: movb
	; SLOW_INC-LABEL: inc_8			; SLOW_INC-LABEL: inc_8:
	; SLOW_INC-NOT: incb			; SLOW_INC-NOT: incb
	; SLOW_INC-NOT: movb			; SLOW_INC-NOT: movb
	%1 = load atomic i8, i8* %p seq_cst, align 1			%1 = load atomic i8, i8* %p seq_cst, align 1
	%2 = add i8 %1, 1			%2 = add i8 %1, 1
	store atomic i8 %2, i8* %p release, align 1			store atomic i8 %2, i8* %p release, align 1
	ret void			ret void
	}			}

	define void @inc_16(i16* %p) {			define void @inc_16(i16* %p) {
	; Currently the transformation is not done on 16 bit accesses, as the backend			; Currently the transformation is not done on 16 bit accesses, as the backend
	; treat 16 bit arithmetic as expensive on X86/X86_64.			; treat 16 bit arithmetic as expensive on X86/X86_64.
	; X64-LABEL: inc_16			; X64-LABEL: inc_16:
	; X64-NOT: incw			; X64-NOT: incw
	; X32-LABEL: inc_16			; X32-LABEL: inc_16:
	; X32-NOT: incw			; X32-NOT: incw
	; SLOW_INC-LABEL: inc_16			; SLOW_INC-LABEL: inc_16:
	; SLOW_INC-NOT: incw			; SLOW_INC-NOT: incw
	%1 = load atomic i16, i16* %p acquire, align 2			%1 = load atomic i16, i16* %p acquire, align 2
	%2 = add i16 %1, 1			%2 = add i16 %1, 1
	store atomic i16 %2, i16* %p release, align 2			store atomic i16 %2, i16* %p release, align 2
	ret void			ret void
	}			}

	define void @inc_32(i32* %p) {			define void @inc_32(i32* %p) {
	; X64-LABEL: inc_32			; X64-LABEL: inc_32:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: incl			; X64: incl
	; X64-NOT: movl			; X64-NOT: movl
	; X32-LABEL: inc_32			; X32-LABEL: inc_32:
	; X32-NOT: lock			; X32-NOT: lock
	; X32: incl			; X32: incl
	; X32-NOT: movl			; X32-NOT: movl
	; SLOW_INC-LABEL: inc_32			; SLOW_INC-LABEL: inc_32:
	; SLOW_INC-NOT: incl			; SLOW_INC-NOT: incl
	; SLOW_INC-NOT: movl			; SLOW_INC-NOT: movl
	%1 = load atomic i32, i32* %p acquire, align 4			%1 = load atomic i32, i32* %p acquire, align 4
	%2 = add i32 %1, 1			%2 = add i32 %1, 1
	store atomic i32 %2, i32* %p monotonic, align 4			store atomic i32 %2, i32* %p monotonic, align 4
	ret void			ret void
	}			}

	define void @inc_64(i64* %p) {			define void @inc_64(i64* %p) {
	; X64-LABEL: inc_64			; X64-LABEL: inc_64:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: incq			; X64: incq
	; X64-NOT: movq			; X64-NOT: movq
	; We do not check X86-32 as it cannot do 'incq'.			; We do not check X86-32 as it cannot do 'incq'.
	; X32-LABEL: inc_64			; X32-LABEL: inc_64:
	; SLOW_INC-LABEL: inc_64			; SLOW_INC-LABEL: inc_64:
	; SLOW_INC-NOT: incq			; SLOW_INC-NOT: incq
	; SLOW_INC-NOT: movq			; SLOW_INC-NOT: movq
	%1 = load atomic i64, i64* %p acquire, align 8			%1 = load atomic i64, i64* %p acquire, align 8
	%2 = add i64 %1, 1			%2 = add i64 %1, 1
	store atomic i64 %2, i64* %p release, align 8			store atomic i64 %2, i64* %p release, align 8
	ret void			ret void
	}			}

	define void @inc_32_seq_cst(i32* %p) {			define void @inc_32_seq_cst(i32* %p) {
	; X64-LABEL: inc_32_seq_cst			; X64-LABEL: inc_32_seq_cst:
	; X64: xchgl			; X64: xchgl
	; X32-LABEL: inc_32_seq_cst			; X32-LABEL: inc_32_seq_cst:
	; X32: xchgl			; X32: xchgl
	%1 = load atomic i32, i32* %p monotonic, align 4			%1 = load atomic i32, i32* %p monotonic, align 4
	%2 = add i32 %1, 1			%2 = add i32 %1, 1
	store atomic i32 %2, i32* %p seq_cst, align 4			store atomic i32 %2, i32* %p seq_cst, align 4
	ret void			ret void
	}			}

	; ----- DEC -----			; ----- DEC -----

	define void @dec_8(i8* %p) {			define void @dec_8(i8* %p) {
	; X64-LABEL: dec_8			; X64-LABEL: dec_8:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: decb			; X64: decb
	; X64-NOT: movb			; X64-NOT: movb
	; X32-LABEL: dec_8			; X32-LABEL: dec_8:
	; X32-NOT: lock			; X32-NOT: lock
	; X32: decb			; X32: decb
	; X32-NOT: movb			; X32-NOT: movb
	; SLOW_INC-LABEL: dec_8			; SLOW_INC-LABEL: dec_8:
	; SLOW_INC-NOT: decb			; SLOW_INC-NOT: decb
	; SLOW_INC-NOT: movb			; SLOW_INC-NOT: movb
	%1 = load atomic i8, i8* %p seq_cst, align 1			%1 = load atomic i8, i8* %p seq_cst, align 1
	%2 = sub i8 %1, 1			%2 = sub i8 %1, 1
	store atomic i8 %2, i8* %p release, align 1			store atomic i8 %2, i8* %p release, align 1
	ret void			ret void
	}			}

	define void @dec_16(i16* %p) {			define void @dec_16(i16* %p) {
	; Currently the transformation is not done on 16 bit accesses, as the backend			; Currently the transformation is not done on 16 bit accesses, as the backend
	; treat 16 bit arithmetic as expensive on X86/X86_64.			; treat 16 bit arithmetic as expensive on X86/X86_64.
	; X64-LABEL: dec_16			; X64-LABEL: dec_16:
	; X64-NOT: decw			; X64-NOT: decw
	; X32-LABEL: dec_16			; X32-LABEL: dec_16:
	; X32-NOT: decw			; X32-NOT: decw
	; SLOW_INC-LABEL: dec_16			; SLOW_INC-LABEL: dec_16:
	; SLOW_INC-NOT: decw			; SLOW_INC-NOT: decw
	%1 = load atomic i16, i16* %p acquire, align 2			%1 = load atomic i16, i16* %p acquire, align 2
	%2 = sub i16 %1, 1			%2 = sub i16 %1, 1
	store atomic i16 %2, i16* %p release, align 2			store atomic i16 %2, i16* %p release, align 2
	ret void			ret void
	}			}

	define void @dec_32(i32* %p) {			define void @dec_32(i32* %p) {
	; X64-LABEL: dec_32			; X64-LABEL: dec_32:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: decl			; X64: decl
	; X64-NOT: movl			; X64-NOT: movl
	; X32-LABEL: dec_32			; X32-LABEL: dec_32:
	; X32-NOT: lock			; X32-NOT: lock
	; X32: decl			; X32: decl
	; X32-NOT: movl			; X32-NOT: movl
	; SLOW_INC-LABEL: dec_32			; SLOW_INC-LABEL: dec_32:
	; SLOW_INC-NOT: decl			; SLOW_INC-NOT: decl
	; SLOW_INC-NOT: movl			; SLOW_INC-NOT: movl
	%1 = load atomic i32, i32* %p acquire, align 4			%1 = load atomic i32, i32* %p acquire, align 4
	%2 = sub i32 %1, 1			%2 = sub i32 %1, 1
	store atomic i32 %2, i32* %p monotonic, align 4			store atomic i32 %2, i32* %p monotonic, align 4
	ret void			ret void
	}			}

	define void @dec_64(i64* %p) {			define void @dec_64(i64* %p) {
	; X64-LABEL: dec_64			; X64-LABEL: dec_64:
	; X64-NOT: lock			; X64-NOT: lock
	; X64: decq			; X64: decq
	; X64-NOT: movq			; X64-NOT: movq
	; We do not check X86-32 as it cannot do 'decq'.			; We do not check X86-32 as it cannot do 'decq'.
	; X32-LABEL: dec_64			; X32-LABEL: dec_64:
	; SLOW_INC-LABEL: dec_64			; SLOW_INC-LABEL: dec_64:
	; SLOW_INC-NOT: decq			; SLOW_INC-NOT: decq
	; SLOW_INC-NOT: movq			; SLOW_INC-NOT: movq
	%1 = load atomic i64, i64* %p acquire, align 8			%1 = load atomic i64, i64* %p acquire, align 8
	%2 = sub i64 %1, 1			%2 = sub i64 %1, 1
	store atomic i64 %2, i64* %p release, align 8			store atomic i64 %2, i64* %p release, align 8
	ret void			ret void
	}			}

	define void @dec_32_seq_cst(i32* %p) {			define void @dec_32_seq_cst(i32* %p) {
	; X64-LABEL: dec_32_seq_cst			; X64-LABEL: dec_32_seq_cst:
	; X64: xchgl			; X64: xchgl
	; X32-LABEL: dec_32_seq_cst			; X32-LABEL: dec_32_seq_cst:
	; X32: xchgl			; X32: xchgl
	%1 = load atomic i32, i32* %p monotonic, align 4			%1 = load atomic i32, i32* %p monotonic, align 4
	%2 = sub i32 %1, 1			%2 = sub i32 %1, 1
	store atomic i32 %2, i32* %p seq_cst, align 4			store atomic i32 %2, i32* %p seq_cst, align 4
	ret void			ret void
	}			}

				; ----- FADD -----

				define void @fadd_32r(float* %loc, float %val) {
				; X64-LABEL: fadd_32r:
				; X64-NOT: lock
				; X64-NOT: mov
				; X64: addss (%[[M:[a-z]+]]), %[[XMM:xmm[0-9]+]]
				; X64-NEXT: movss %[[XMM]], (%[[M]])
				; X32-LABEL: fadd_32r:
				; Don't check x86-32.
				; LLVM's SSE handling is conservative on x86-32 even without using atomics.
				%floc = bitcast float* %loc to i32*
				%1 = load atomic i32, i32* %floc seq_cst, align 4
				%2 = bitcast i32 %1 to float
				%add = fadd float %2, %val
				%3 = bitcast float %add to i32
				store atomic i32 %3, i32* %floc release, align 4
				ret void
				}

				define void @fadd_64r(double* %loc, double %val) {
				; X64-LABEL: fadd_64r:
				; X64-NOT: lock
				; X64-NOT: mov
				; X64: addsd (%[[M:[a-z]+]]), %[[XMM:xmm[0-9]+]]
				; X64-NEXT: movsd %[[XMM]], (%[[M]])
				; X32-LABEL: fadd_64r:
				; Don't check x86-32 (see comment above).
				%floc = bitcast double* %loc to i64*
				%1 = load atomic i64, i64* %floc seq_cst, align 8
				%2 = bitcast i64 %1 to double
				%add = fadd double %2, %val
				%3 = bitcast double %add to i64
				store atomic i64 %3, i64* %floc release, align 8
				ret void
				}