This is an archive of the discontinued LLVM Phabricator instance.

PR 23155 - Improvement to X86 16 bit operation promotion for better performance.
AbandonedPublic

Authored by kbsmith1 on Apr 22 2015, 3:11 PM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
RKSimon
chandlerc
delena
silvas

Summary

This change improves the code for X86 16 bit operation promotion by checking more carefully for cases where promotion
shouldn't happen, thus causing more cases to be promoted to 32 bits. This improves performance in some cases where 16 bit
operations cause false dependencies on the upper portions of 16 bit operations.

Diff Detail

Repository: rL LLVM

Event Timeline

kbsmith1 updated this revision to Diff 24261.Apr 22 2015, 3:11 PM

kbsmith1 retitled this revision from to PR 23155 - Improvement to X86 16 bit operation promotion for better performance..

kbsmith1 updated this object.

kbsmith1 edited the test plan for this revision. (Show Details)

kbsmith1 set the repository for this revision to rL LLVM.

Ping.

Sanjay, Simon, or Elena, Would any or all of you be willing to review this please?

Thank you,
Kevin Smith

Hi Kevin -

Roping in some other potentially interested reviewers based on past activity.

I also added some comments to https://llvm.org/bugs/show_bug.cgi?id=23155 and linked some other partial reg update bugs.

We need some clarification on what the expected behavior is wrt partial reg updates and the various micro-architectures. Eg, I'm unable to reproduce all of your Haswell perf results locally...which seems to line up with Agner's advice, but then we definitely see a perf hit on bzip in https://llvm.org/bugs/show_bug.cgi?id=22473 ...but maybe there are different factors in play there and we're confusing the issues?

For ease of others, here is the comment I added to 23155

As in Agner's in in 17113, I agree that the newer Intel architectures don't really suffer from partial register stalls in the sense that Pentium Pro, Pentium 4, and older architectures did. As noted in 17113

There is no penalty on Haswell for partial register access.
On Sandy Bridge, the cost is a single uop that gets automatically inserted at the cost of 1 cycle latency.
On Ivy Bridge there is no penalty except for the "high" byte subregs (AH, BH, etc.), in which case it behaves like Sandy Bridge.

However, whenever a partial register is the destination of an operation, and doesn't otherwise need to read the register (such as occurs with movw, movb)
then this creates a read dependence on the upper portion of the register. If
a movzbl or movzwl is instead used, then the destination register is fully killed, eliminating this "false" dependence on the upper portion of the register. This issue impacts both word and byte operations. However, it is worth noting that this only really matters in relatively tight loops where the false dependence arc causes a loop carried dependence, and that loop carried dependence effectively keeps the out-of-order processor from being able to perform multiple iterations of the loop without loop carried "false" dependencies.

From Chandler's comments in 22473:
We need to add a pass that replaces movb (and movw) with movzbl (and movzwl) when the destination is a register and the high bytes aren't used. Then we need to benchmark bzip2 to ensure that this recovers all of the performance that forcing the use of cmpl did, and probably some other sanity benchmarking. Then we can swap the cmpl formation for the movzbl formation.

I am in agreement that this would be a good solution. If you, Chandler, and Eric all like that direction, I will be willing to work on that. I also have access to SPEC benchmarks, both 2000 and 2006 to be able to benchmark as well for bzip2 specifically since that is something the community considers important.

Kevin

+llvm-commits

mkuper added a subscriber: mkuper.May 9 2015, 11:44 PM

In D9209#162887, @kbsmith1 wrote:

From Chandler's comments in 22473:
We need to add a pass that replaces movb (and movw) with movzbl (and movzwl) when the destination is a register and the high bytes aren't used. Then we need to benchmark bzip2 to ensure that this recovers all of the performance that forcing the use of cmpl did, and probably some other sanity benchmarking. Then we can swap the cmpl formation for the movzbl formation.

I am in agreement that this would be a good solution. If you, Chandler, and Eric all like that direction, I will be willing to work on that. I also have access to SPEC benchmarks, both 2000 and 2006 to be able to benchmark as well for bzip2 specifically since that is something the community considers important.

I would be *very* interested in this, and would love it if you could work on it. I suspect you're in a much better position to implement, document, and evaluate the results. We really need to kill the 'cmpl' hack that is currently used.

Thanks for the support Chandler. I am starting to work on this.

My initial thoughts are:

1 - A very late pass through the MachineInstrs that would be inserted as part of X86PassConfig::addPreEmitPass.

2 - Initially look for 8 bit and 16 bit operations that would be better expanded into 32 bit operations.

There could be some different reasons to do this a - Specifically for the case in PR23155 where false dependence potentially slows execution. b - Just in general for cases where partial registers may cost something (Intel X86 prior to Haswell) c - cases where code could be saved by using an equivalent 32 bit instruction, such as 16 bit instructions that would encode shorter as 32 bit. We want to do this very late to allow for folding memory operations into the 16 and 8 bit operations, and not rely on heuristics to try to predict about this.

If you have any comments or disagreements with that direction please let me know.

Kevin B. Smith

Abandoning this to change to later pass to fix these up.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

81 lines

test/

CodeGen/

X86/

24 lines

135 lines

83 lines

66 lines

33 lines

103 lines

Diff 24261

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 397 Lines • ▼ Show 20 Lines	X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,

setOperationAction(ISD::READCYCLECOUNTER , MVT::i64 , Custom);		setOperationAction(ISD::READCYCLECOUNTER , MVT::i64 , Custom);

if (!Subtarget->hasMOVBE())		if (!Subtarget->hasMOVBE())
setOperationAction(ISD::BSWAP , MVT::i16 , Expand);		setOperationAction(ISD::BSWAP , MVT::i16 , Expand);

// These should be promoted to a larger select which is supported.		// These should be promoted to a larger select which is supported.
setOperationAction(ISD::SELECT , MVT::i1 , Promote);		setOperationAction(ISD::SELECT , MVT::i1 , Promote);

// X86 wants to expand cmov itself.		// X86 wants to expand cmov itself.
setOperationAction(ISD::SELECT , MVT::i8 , Custom);		setOperationAction(ISD::SELECT , MVT::i8 , Custom);
setOperationAction(ISD::SELECT , MVT::i16 , Custom);		setOperationAction(ISD::SELECT , MVT::i16 , Custom);
setOperationAction(ISD::SELECT , MVT::i32 , Custom);		setOperationAction(ISD::SELECT , MVT::i32 , Custom);
setOperationAction(ISD::SELECT , MVT::f32 , Custom);		setOperationAction(ISD::SELECT , MVT::f32 , Custom);
setOperationAction(ISD::SELECT , MVT::f64 , Custom);		setOperationAction(ISD::SELECT , MVT::f64 , Custom);
setOperationAction(ISD::SELECT , MVT::f80 , Custom);		setOperationAction(ISD::SELECT , MVT::f80 , Custom);
setOperationAction(ISD::SETCC , MVT::i8 , Custom);		setOperationAction(ISD::SETCC , MVT::i8 , Custom);
▲ Show 20 Lines • Show All 3,058 Lines • ▼ Show 20 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Other Lowering Hooks		// Other Lowering Hooks
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

static bool MayFoldLoad(SDValue Op) {		static bool MayFoldLoad(SDValue Op) {
return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode());		return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode());
}		}

static bool MayFoldIntoStore(SDValue Op) {		// MayFoldIntoReadModifyWrite - Return true if this op might be able to
return Op.hasOneUse() && ISD::isNormalStore(*Op.getNode()->use_begin());		// be folded into a read/modify/write type of instruction. This can only
		// happen when the load and the store have the same address expression.
		// The load that is to be checked for address equality with the store
		// is passed in.
		//
		static bool MayFoldIntoReadModifyWrite(SDValue Op, SDValue Load) {
		if (!Op.hasOneUse())
		return false;

		if (!MayFoldLoad(Load))
		return false;

		SDNode User = (Op.getNode()->use_begin());

		if (!ISD::isNormalStore(User))
		return false;

		// Check that the load and store use the same address.
		LoadSDNode *LD = cast<LoadSDNode>(Load);
		StoreSDNode *ST = cast<StoreSDNode>(User);

		return LD->getBasePtr() == ST->getBasePtr();
}		}

static bool isTargetShuffle(unsigned Opcode) {		static bool isTargetShuffle(unsigned Opcode) {
switch(Opcode) {		switch(Opcode) {
default: return false;		default: return false;
case X86ISD::BLENDI:		case X86ISD::BLENDI:
case X86ISD::PSHUFB:		case X86ISD::PSHUFB:
case X86ISD::PSHUFD:		case X86ISD::PSHUFD:
▲ Show 20 Lines • Show All 20,504 Lines • ▼ Show 20 Lines	bool X86TargetLowering::IsDesirableToPromoteOp(SDValue Op, EVT &PVT) const {
case ISD::ZERO_EXTEND:		case ISD::ZERO_EXTEND:
case ISD::ANY_EXTEND:		case ISD::ANY_EXTEND:
Promote = true;		Promote = true;
break;		break;
case ISD::SHL:		case ISD::SHL:
case ISD::SRL: {		case ISD::SRL: {
SDValue N0 = Op.getOperand(0);		SDValue N0 = Op.getOperand(0);
// Look out for (store (shl (load), x)).		// Look out for (store (shl (load), x)).
if (MayFoldLoad(N0) && MayFoldIntoStore(Op))		if (MayFoldIntoReadModifyWrite(Op, N0))
return false;		return false;
Promote = true;		Promote = true;
break;		break;
}		}
case ISD::ADD:		case ISD::ADD:
case ISD::MUL:		case ISD::MUL:
case ISD::AND:		case ISD::AND:
case ISD::OR:		case ISD::OR:
case ISD::XOR:		case ISD::XOR:
Commute = true;		Commute = true;
// fallthrough		// fallthrough
case ISD::SUB: {		case ISD::SUB: {
SDValue N0 = Op.getOperand(0);		SDValue N0 = Op.getOperand(0);
SDValue N1 = Op.getOperand(1);		SDValue N1 = Op.getOperand(1);
if (!Commute && MayFoldLoad(N1))
return false;
// Avoid disabling potential load folding opportunities.		// Avoid disabling potential load folding opportunities.
if (MayFoldLoad(N0) && (!isa<ConstantSDNode>(N1) \|\| MayFoldIntoStore(Op)))		if (Commute && MayFoldLoad(N0)) {
		// This is attempting to not disable read/modify/write
		// combining for the operation.
		if (MayFoldIntoReadModifyWrite(Op, N0))
		return false;

		// The load in N0 is only foldable if N1 has a single use, and is
		// not a constant. Otherwise, to find a register to modify, it most
		// likely is going to use the N0 load's register as the destination.
		if (!isa<ConstantSDNode>(N1)) {
		// Truncates are effectively not instructions, so check the
		// truncate's operand, but only in the case where the TRUNCATE doesn't
		// already have multiple uses. The point of this is that a single-use
		// NOP type of instruction shouldn't act as if that value could be used
		// as a modifiable reg once this goes to register allocation.
		if (N1.getOpcode() == ISD::TRUNCATE && N1.hasOneUse())
		N1 = N1.getOperand(0);

		if (N1.hasOneUse())
		return false;
		}
		}

		if (MayFoldLoad(N1)) {
		// This is attempting to not disable read/modify/write
		// combining for the operation.
		if (MayFoldIntoReadModifyWrite(Op, N1))
return false;		return false;
if (MayFoldLoad(N1) && (!isa<ConstantSDNode>(N0) \|\| MayFoldIntoStore(Op)))
		// The load in N1 is only foldable if N0 has a single use, and is
		// not a constant. Otherwise, to find a register to modify, it most
		// likely is going to use the N1 load's register as the destination.
		if (!isa<ConstantSDNode>(N0)) {
		// Truncates are effectively not instructions, so check the
		// truncate's operand, but only in the case where the TRUNCATE doesn't
		// already have multiple uses. The point of this is that a single-use
		// NOP type of instruction shouldn't act as if that value could be used
		// as a modifiable reg once this goes to register allocation.
		if (N0.getOpcode() == ISD::TRUNCATE && N0.hasOneUse())
		N0 = N0.getOperand(0);

		if (N0.hasOneUse())
return false;		return false;
		}
		}
Promote = true;		Promote = true;
		break;
}		}
}		}

PVT = MVT::i32;		PVT = MVT::i32;
return Promote;		return Promote;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
▲ Show 20 Lines • Show All 706 Lines • Show Last 20 Lines

test/CodeGen/X86/atom-lea-addw-bug.ll

				; This test is supposed to check some code in X86FixupLEAs.cpp that only gets
				; triggered when a 16bit register-register add is created and the two registers
				; are not the same actual register. This code change was committed in
				; LLVM revision 191711 of that file.
				;
				; However, with changes to improve 16 bit promotion to cut down on partial
				; register loads, I could find no way to get that instruction to be
				; generated any longer. So, this test has become mostly useless, but rather
				; than removing it, I changed it to verify to find leal, which it consistently
				; generates. That also gives the opportunity for a future developer to
				; see the comment on this test.
				;
	; RUN: llc < %s -mcpu=atom \| FileCheck %s			; RUN: llc < %s -mcpu=atom \| FileCheck %s

	; ModuleID = 'bugpoint-reduced-simplified.bc'			; ModuleID = 'bugpoint-reduced-simplified.bc'
				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-darwin12.5.0"			target triple = "x86_64-apple-darwin12.5.0"

	define i32 @DoLayout() {			define i32 @DoLayout(i16* %p1, i16* %p2) {
	entry:			entry:
	%tmp1 = load i16, i16* undef, align 2			%tmp1 = load i16, i16* %p1, align 2
	%tmp17 = load i16, i16* null, align 2			%tmp17 = load i16, i16* %p2, align 2
	%tmp19 = load i16, i16* undef, align 2			%shl = shl i16 %tmp1, 1
	%shl = shl i16 %tmp19, 1
	%add55 = add i16 %tmp17, %tmp1			%add55 = add i16 %tmp17, %tmp1
	%add57 = add i16 %add55, %shl			%add57 = add i16 %add55, %shl
	%conv60 = zext i16 %add57 to i32			%conv60 = zext i16 %add57 to i32
	%add61 = add nsw i32 %conv60, 0			%add61 = add nsw i32 %conv60, 0
	%conv63 = and i32 %add61, 65535			%conv63 = and i32 %add61, 65535
	ret i32 %conv63			ret i32 %conv63
	; CHECK: addw			; CHECK: leal
	}			}

test/CodeGen/X86/simp-loadfold16.ll

				; RUN: llc < %s \| FileCheck %s
				;
				; This test checks to make sure that operations with load operands that
				; should be foldable into the arithmetic operations don't
				; get promoted into longer forms.

				; ModuleID = 'simp-loadfold16.bc'
				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
				target triple = "x86_64-linux"

				define void @myf2(i16 %p1, i8 %p2, i16 *%p3) {
				entry:
				%t1 = load i16, i16* %p1, align 2 ; Should be folded into add
				%t2 = load i8, i8* %p2, align 1
				%t3 = zext i8 %t2 to i16
				%t4 = add i16 %t1, %t3 ; Should stay addw to allow folding.
				store i16 %t4, i16* %p3, align 2 ; Should be a 16 bit store.
				ret void
				; CHECK-LABEL: myf2:
				; CHECK: addw
				; CHECK-SAME: (%rdi)
				; CHECK: movw
				}

				define void @myf3(i16 %p1, i8 %p2, i16 *%p3) {
				entry:
				%t1 = load i16, i16* %p1, align 2 ; Should be folded into add
				%t2 = load i8, i8* %p2, align 1
				%t3 = zext i8 %t2 to i16
				%t4 = add i16 %t3, %t1 ; Should stay addw to allow folding.
				store i16 %t4, i16* %p3, align 2 ; Should be a 16 bit store.
				ret void
				; CHECK-LABEL: myf3:
				; CHECK: addw
				; CHECK-SAME: (%rdi)
				; CHECK: movw
				}

				define void @myf4(i16 %p1, i8 %p2, i16 *%p3) {
				entry:
				%t1 = load i16, i16* %p1, align 2 ; Should be folded into sub
				%t2 = load i8, i8* %p2, align 1
				%t3 = zext i8 %t2 to i16
				%t4 = sub i16 %t3, %t1 ; Should stay subw to allow folding.
				store i16 %t4, i16* %p3, align 2 ; Should be a 16 bit store.
				ret void
				; CHECK-LABEL: myf4:
				; CHECK: subw
				; CHECK-SAME: (%rdi)
				; CHECK: movw
				}

				define void @myf5(i16 %p1, i8 %p2, i16 *%p3) {
				entry:
				%t1 = load i16, i16* %p1, align 2 ; Should be folded into or
				%t2 = load i8, i8* %p2, align 1
				%t3 = zext i8 %t2 to i16
				%t4 = or i16 %t1, %t3 ; Should stay orw to allow folding.
				store i16 %t4, i16* %p3, align 2 ; Should be a 16 bit store.
				ret void
				; CHECK-LABEL: myf5:
				; CHECK: orw
				; CHECK-SAME: (%rdi)
				; CHECK: movw
				}

				define void @myf6(i16 %p1, i8 %p2, i16 *%p3) {
				entry:
				%t1 = load i16, i16* %p1, align 2 ; Should be folded into or
				%t2 = load i8, i8* %p2, align 1
				%t3 = zext i8 %t2 to i16
				%t4 = or i16 %t3, %t1 ; Should stay orw to allow folding.
				store i16 %t4, i16* %p3, align 2 ; Should be a 16 bit store.
				ret void
				; CHECK-LABEL: myf6:
				; CHECK: orw
				; CHECK-SAME: (%rdi)
				; CHECK: movw
				}

				define void @myf7(i16 %p1, i8 %p2, i16 *%p3) {
				entry:
				%t1 = load i16, i16* %p1, align 2 ; Should be folded into and
				%t2 = load i8, i8* %p2, align 1
				%t3 = zext i8 %t2 to i16
				%t4 = and i16 %t1, %t3 ; Should stay andw to allow folding.
				store i16 %t4, i16* %p3, align 2 ; Should be a 16 bit store.
				ret void
				; CHECK-LABEL: myf7:
				; CHECK: andw
				; CHECK-SAME: (%rdi)
				; CHECK: movw
				}

				define void @myf8(i16 %p1, i8 %p2, i16 *%p3) {
				entry:
				%t1 = load i16, i16* %p1, align 2 ; Should be folded into and
				%t2 = load i8, i8* %p2, align 1
				%t3 = zext i8 %t2 to i16
				%t4 = and i16 %t3, %t1 ; Should stay andw to allow folding.
				store i16 %t4, i16* %p3, align 2 ; Should be a 16 bit store.
				ret void
				; CHECK-LABEL: myf8:
				; CHECK: andw
				; CHECK-SAME: (%rdi)
				; CHECK: movw
				}

				define void @myf9(i16 %p1, i8 %p2, i16 *%p3) {
				entry:
				%t1 = load i16, i16* %p1, align 2 ; Should be folded into xor
				%t2 = load i8, i8* %p2, align 1
				%t3 = zext i8 %t2 to i16
				%t4 = xor i16 %t1, %t3 ; Should stay xorw to allow folding.
				store i16 %t4, i16* %p3, align 2 ; Should be a 16 bit store.
				ret void
				; CHECK-LABEL: myf9:
				; CHECK: xorw
				; CHECK-SAME: (%rdi)
				; CHECK: movw
				}

				define void @myf10(i16 %p1, i8 %p2, i16 *%p3) {
				entry:
				%t1 = load i16, i16* %p1, align 2 ; Should be folded into xor
				%t2 = load i8, i8* %p2, align 1
				%t3 = zext i8 %t2 to i16
				%t4 = xor i16 %t3, %t1 ; Should stay xorw to allow folding.
				store i16 %t4, i16* %p3, align 2 ; Should be a 16 bit store.
				ret void
				; CHECK-LABEL: myf10:
				; CHECK: xorw
				; CHECK-SAME: (%rdi)
				; CHECK: movw
				}

test/CodeGen/X86/simp-no-rmw1.ll

				; RUN: llc < %s \| FileCheck %s
				;
				; This test checks to make sure that patterns that look kind of like
				; read modify write operations might be OK, but where the load and store
				; addresses differ get properly promoted to 32 bit operations so that
				; many fewer partial register operations are used.
				;

				; ModuleID = 'simp-no-rmw1.bc'
				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
				target triple = "x86_64-linux"

				define void @myf1(i16 %p1, i16 %p2) {
				entry:
				%t1 = load i16, i16* %p1, align 2
				%t2 = shl i16 %t1, 5
				store i16 %t2, i16* %p2, align 2
				ret void
				; CHECK-LABEL: myf1:
				; CHECK movzwl
				; CHECK: shll
				; CHECK: movw
				}

				define void @myf2(i16 %p1, i16 %p2) {
				entry:
				%t1 = load i16, i16* %p1, align 2 ; This should become a movzwl
				%t2 = add i16 %t1, 357 ; This should be a 32 bit add or lea.
				store i16 %t2, i16* %p2, align 2 ; this should be a 16 bit store.
				ret void
				; CHECK-LABEL: myf2:
				; CHECK: movzwl
				; CHECK: addl
				; CHECK: movw
				}

				define void @myf3(i16 %p1, i16 %p2, i16 %p3) {
				entry:
				%t1 = load i16, i16* %p1, align 2 ; This should become a movzwl
				%t2 = sub i16 %t1, %p2 ; This should be a 32 bit sub.
				store i16 %t2, i16* %p3, align 2 ; This should be a 16 bit store.
				ret void
				; CHECK-LABEL: myf3:
				; CHECK: movzwl
				; CHECK: subl
				; CHECK: movw
				}

				define void @myf4(i16 %p1, i16 %p2) {
				entry:
				%t1 = load i16, i16* %p1, align 2
				%t2 = or i16 %t1, 263
				store i16 %t2, i16* %p2, align 2
				ret void
				; CHECK-LABEL: myf4:
				; CHECK: movzwl
				; CHECK: orl
				; CHECK: movw
				}

				define void @myf5(i16 %p1, i16 %p2) {
				entry:
				%t1 = load i16, i16* %p1, align 2
				%t2 = xor i16 %t1, 263
				store i16 %t2, i16* %p2, align 2
				ret void
				; CHECK-LABEL: myf5:
				; CHECK: movzwl
				; CHECK: xorl
				; CHECK: movw
				}

				define void @myf6(i16 %p1, i16 %p2) {
				entry:
				%t1 = load i16, i16* %p1, align 2
				%t2 = and i16 %t1, 258
				store i16 %t2, i16* %p2, align 2
				ret void
				; CHECK-LABEL: myf6:
				; CHECK: movzwl
				; CHECK: andl
				; CHECK: movw
				}

test/CodeGen/X86/simp-promote1.ll

				; RUN: llc < %s \| FileCheck %s
				;
				; This test checks to make sure that certain operations get promoted from
				; 16 bit operations into 32 bit operations. This is desirable in order to
				; decrease instances of partial register usage, which hurts performance.
				;

				; ModuleID = 'simp-promote1.bc'
				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
				target triple = "x86_64-linux"

				define i32 @myf1(i32* %p1, i16* %p2, i16* %p3) {
				entry:
				%t1 = load i32, i32* %p1, align 4
				%t2 = trunc i32 %t1 to i16
				%t3 = load i16, i16* %p2, align 2
				%t4 = or i16 %t3, %t2
				store i16 %t4, i16* %p3, align 2
				ret i32 %t1
				; CHECK-LABEL: myf1:
				; CHECK movzwl
				; CHECK: orl
				; CHECK: movw
				}

				define i32 @myf2(i32* %p1, i16* %p2, i16* %p3) {
				entry:
				%t1 = load i32, i32* %p1, align 4
				%t2 = trunc i32 %t1 to i16
				%t3 = load i16, i16* %p2, align 2
				%t4 = xor i16 %t2, %t3
				store i16 %t4, i16* %p3, align 2
				ret i32 %t1
				; CHECK-LABEL: myf2:
				; CHECK movzwl
				; CHECK: xorl
				; CHECK: movw
				}

				define i32 @myf3(i32* %p1, i16* %p2, i16* %p3) {
				entry:
				%t1 = load i32, i32* %p1, align 4
				%t2 = trunc i32 %t1 to i16
				%t3 = load i16, i16* %p2, align 2
				%t4 = shl i16 %t2, %t3
				store i16 %t4, i16* %p3, align 2
				ret i32 %t1
				; CHECK-LABEL: myf3:
				; CHECK movzwl
				; CHECK: shll
				; CHECK: movw
				}

				define i32 @myf4(i32* %p1, i16* %p2, i16* %p3) {
				entry:
				%t1 = load i32, i32* %p1, align 4
				%t2 = trunc i32 %t1 to i16
				%t3 = load i16, i16* %p2, align 2
				%t4 = lshr i16 %t2, %t3
				store i16 %t4, i16* %p3, align 2
				ret i32 %t1
				; CHECK-LABEL: myf4:
				; CHECK movzwl
				; CHECK: shrl
				; CHECK: movw
				}

test/CodeGen/X86/simp-rmw-vol.ll

				; RUN: llc < %s \| FileCheck %s
				;
				; This test checks to make sure that Read-Modify-Write instructions still
				; get generated for some 16 bit operations on volatile memory.
				; This makes sure that the new
				; code in IsDesirableToPromoteOp doesn't break read-modify-write instruction
				; creation for 16 bit operations.

				; ModuleID = 'simp-rmw-vol.bc'
				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
				target triple = "x86_64-apple-darwin12.5.0"

				define void @myf1(i16 *%p1) {
				entry:
				%t1 = load volatile i16, i16* %p1, align 2
				%t2 = shl i16 %t1, 5
				store volatile i16 %t2, i16* %p1, align 2
				ret void
				; CHECK-LABEL: myf1:
				; CHECK: shlw
				; CHECK-SAME: (%rdi)
				}

				define void @myf2(i16 *%p1) {
				entry:
				%t1 = load volatile i16, i16* %p1, align 2
				%t2 = or i16 %t1, 263
				store volatile i16 %t2, i16* %p1, align 2
				ret void
				; CHECK-LABEL: myf2:
				; CHECK: orw
				; CHECK-SAME: (%rdi)
				}

test/CodeGen/X86/simp-rmw1.ll

				; RUN: llc < %s \| FileCheck %s
				;
				; This test checks to make sure that Read-Modify-Write instructions still
				; get generated for 16 bit operations. This makes sure that the new
				; code in IsDesirableToPromoteOp doesn't break read-modify-write instruction
				; creation for 16 bit operations.

				; ModuleID = 'simp-rmw1.bc'
				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
				target triple = "x86_64-apple-darwin12.5.0"

				define void @myf1(i16 *%p1) {
				entry:
				%t1 = load i16, i16* %p1, align 2
				%t2 = shl i16 %t1, 5
				store i16 %t2, i16* %p1, align 2
				ret void
				; CHECK-LABEL: myf1:
				; CHECK: shlw
				; CHECK-SAME: (%rdi)
				}

				define void @myf2(i16 %p1, i16 %p2) {
				entry:
				%t1 = load i16, i16* %p1, align 2
				%t11 = load i16, i16 *%p2, align 2 ; This should get promoted to a movzwl.
				%t2 = add i16 %t1, %t11 ; This should be a RMW addw.
				store i16 %t2, i16* %p1, align 2
				ret void
				; CHECK-LABEL: myf2:
				; FIXME: movzwl - This improvement not implemented yet.
				; CHECK: addw
				; CHECK-SAME: (%rdi)
				}

				define void @myf3(i16 %p1, i16 %p2) {
				entry:
				%t1 = load i16, i16* %p1, align 2
				%t11 = load i16, i16* %p2, align 2 ; This should get promoted to a movzwl.
				%t2 = sub i16 %t1, %t11 ; This should be a RMW subw.
				store i16 %t2, i16* %p1, align 2
				ret void
				; CHECK-LABEL: myf3:
				; FIXME: movzwl - This improvement not implemented yet.
				; CHECK: subw
				; CHECK-SAME: (%rdi)
				}

				define void @myf4(i16 *%p1) {
				entry:
				%t1 = load i16, i16* %p1, align 2
				%t2 = or i16 %t1, 263
				store i16 %t2, i16* %p1, align 2
				ret void
				; CHECK-LABEL: myf4:
				; CHECK: orw
				; CHECK-SAME: (%rdi)
				}

				define void @myf5(i16 *%p1) {
				entry:
				%t1 = load i16, i16* %p1, align 2
				%t2 = xor i16 %t1, 263
				store i16 %t2, i16* %p1, align 2
				ret void
				; CHECK-LABEL: myf5:
				; CHECK: xorw
				; CHECK-SAME: (%rdi)
				}

				define void @myf6(i16 *%p1) {
				entry:
				%t1 = load i16, i16* %p1, align 2
				%t2 = and i16 %t1, 258
				store i16 %t2, i16* %p1, align 2
				ret void
				; CHECK-LABEL: myf6:
				; CHECK: andw
				; CHECK-SAME: (%rdi)
				}

				define void @myf7(i16 *%p1) {
				entry:
				%t1 = load i16, i16* %p1, align 2
				%t2 = lshr i16 %t1, 5
				store i16 %t2, i16* %p1, align 2
				ret void
				; CHECK-LABEL: myf7:
				; CHECK: shrw
				; CHECK-SAME: (%rdi)
				}

				define void @myf8(i16 *%p1) {
				entry:
				%t1 = load i16, i16* %p1, align 2
				%t2 = ashr i16 %t1, 5
				store i16 %t2, i16* %p1, align 2
				ret void
				; CHECK-LABEL: myf8:
				; CHECK: sarw
				; CHECK-SAME: (%rdi)
				}