This is an archive of the discontinued LLVM Phabricator instance.

TLI: Add option to generate dependent stores in scalarization.
AbandonedPublic

Authored by jvesely on Sep 19 2016, 2:16 PM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
bogner

Summary

This is used by targets (like r600) that do RMW for truncating stores.
Future optimization should remove most redundant stores+loads
Fixes R600 regressions since r274397.

Diff Detail

Repository: rL LLVM

Event Timeline

jvesely updated this revision to Diff 71881.Sep 19 2016, 2:16 PM

jvesely retitled this revision from to TLI: Add option to generate dependent stores in scalarization..

jvesely updated this object.

jvesely added reviewers: • tstellarAMD, bogner.

jvesely set the repository for this revision to rL LLVM.

jvesely added subscribers: llvm-commits, arsenm.

Herald added subscribers: nhaehnle, wdng. · View Herald TranscriptSep 19 2016, 2:16 PM

jvesely added a child revision: D24746: AMDGPU/R600: Don't use REGISTER_{LOAD,STORE} ISD nodes.Sep 19 2016, 2:22 PM

ping

In D24745#552299, @jvesely wrote:

ping

I don't understand why you need this (even if you end up lowing the individual stores using a RMW sequence). Can you please explain?

In D24745#552401, @hfinkel wrote:

I don't understand why you need this (even if you end up lowing the individual stores using a RMW sequence). Can you please explain?

the problem is if we use RMW sequence for two different elements of the same word. for example storing bytes at address A and A +1. let's assume that A is 4byte aligned. the generated code will look like this
1: r1 = LOAD A
2: r2 = {r1[8:31], x} This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
3: STORE A, r2
4: r3 = LOAD A
5: r4 ={r3[0:8],y,r3[15:31]} This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
6: STORE A, r4

The original code does not have dependency between 3 and 4. so 1 and 4 are loads from the same location and get merged into single load:
1: r1 = LOAD A
2: r2 = {r1[8:31], x} This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
3: STORE A, r2
5: r4 ={r1[0:8],y,r1[15:31]} This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
6: STORE A, r4
(note that sequences 2,3 and 5,6 are independent so the writes can occur in any order).
which results in data corruption at A.

This patch adds chain dependency between 3 and 4 preventing the load elimination. It should still be possible to eliminate both 3 and 4 (which in turn should enable combining of the bit ops in 2 and 5).

In D24745#552833, @jvesely wrote:

In D24745#552401, @hfinkel wrote:

I don't understand why you need this (even if you end up lowing the individual stores using a RMW sequence). Can you please explain?

the problem is if we use RMW sequence for two different elements of the same word. for example storing bytes at address A and A +1. let's assume that A is 4byte aligned. the generated code will look like this
1: r1 = LOAD A
2: r2 = {r1[8:31], x} This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
3: STORE A, r2
4: r3 = LOAD A
5: r4 ={r3[0:8],y,r3[15:31]} This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
6: STORE A, r4

The original code does not have dependency between 3 and 4.

Why not? I understand why the truncated stores and extending loads might be independent, but if you have code that widens those without looking at potential aliased peers and adjusting the chain, I think the problem lies there.

In D24745#552857, @hfinkel wrote:

In D24745#552833, @jvesely wrote:

In D24745#552401, @hfinkel wrote:

I don't understand why you need this (even if you end up lowing the individual stores using a RMW sequence). Can you please explain?

the problem is if we use RMW sequence for two different elements of the same word. for example storing bytes at address A and A +1. let's assume that A is 4byte aligned. the generated code will look like this
1: r1 = LOAD A
2: r2 = {r1[8:31], x} This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
3: STORE A, r2
4: r3 = LOAD A
5: r4 ={r3[0:8],y,r3[15:31]} This is a sequence of AND/OR instructions that masks of the old bits and ORs the new ones
6: STORE A, r4

The original code does not have dependency between 3 and 4.

Why not? I understand why the truncated stores and extending loads might be independent, but if you have code that widens those without looking at potential aliased peers and adjusting the chain, I think the problem lies there.

OK. I guess I'll need to look into how to do that then.

jvesely mentioned this in D27964: AMDGPU/R600: Serialize vector trunc stores to private AS.Dec 19 2016, 5:46 PM

Revision Contents

Path

Size

include/

llvm/

Target/

TargetLowering.h

3 lines

lib/

CodeGen/

SelectionDAG/

TargetLowering.cpp

11 lines

Target/

AMDGPU/

AMDGPUISelLowering.cpp

10 lines

R600ISelLowering.cpp

22 lines

SIISelLowering.cpp

12 lines

test/

CodeGen/

AMDGPU/

store-private.ll

743 lines

Diff 71881

include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 3,015 Lines • ▼ Show 20 Lines	public:
/// Turn load of vector type into a load of the individual elements.		/// Turn load of vector type into a load of the individual elements.
/// \param LD load to expand		/// \param LD load to expand
/// \returns MERGE_VALUEs of the scalar loads with their chains.		/// \returns MERGE_VALUEs of the scalar loads with their chains.
SDValue scalarizeVectorLoad(LoadSDNode *LD, SelectionDAG &DAG) const;		SDValue scalarizeVectorLoad(LoadSDNode *LD, SelectionDAG &DAG) const;

// Turn a store of a vector type into stores of the individual elements.		// Turn a store of a vector type into stores of the individual elements.
/// \param ST Store with a vector value type		/// \param ST Store with a vector value type
/// \returns MERGE_VALUs of the individual store chains.		/// \returns MERGE_VALUs of the individual store chains.
SDValue scalarizeVectorStore(StoreSDNode *ST, SelectionDAG &DAG) const;		SDValue scalarizeVectorStore(StoreSDNode *ST, SelectionDAG &DAG,
		bool depend = false) const;

/// Expands an unaligned load to 2 half-size loads for an integer, and		/// Expands an unaligned load to 2 half-size loads for an integer, and
/// possibly more for vectors.		/// possibly more for vectors.
std::pair<SDValue, SDValue> expandUnalignedLoad(LoadSDNode *LD,		std::pair<SDValue, SDValue> expandUnalignedLoad(LoadSDNode *LD,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;

/// Expands an unaligned store to 2 half-size stores for integer values, and		/// Expands an unaligned store to 2 half-size stores for integer values, and
/// possibly more for vectors.		/// possibly more for vectors.
▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/TargetLowering.cpp

Show First 20 Lines • Show All 3,221 Lines • ▼ Show 20 Lines	SDValue TargetLowering::scalarizeVectorLoad(LoadSDNode *LD,

return DAG.getMergeValues({ Value, NewChain }, SL);		return DAG.getMergeValues({ Value, NewChain }, SL);
}		}

// FIXME: This relies on each element having a byte size, otherwise the stride		// FIXME: This relies on each element having a byte size, otherwise the stride
// is 0 and just overwrites the same location. ExpandStore currently expects		// is 0 and just overwrites the same location. ExpandStore currently expects
// this broken behavior.		// this broken behavior.
SDValue TargetLowering::scalarizeVectorStore(StoreSDNode *ST,		SDValue TargetLowering::scalarizeVectorStore(StoreSDNode *ST,
SelectionDAG &DAG) const {		SelectionDAG &DAG,
		bool depend) const {
SDLoc SL(ST);		SDLoc SL(ST);

SDValue Chain = ST->getChain();		SDValue Chain = ST->getChain();
SDValue BasePtr = ST->getBasePtr();		SDValue BasePtr = ST->getBasePtr();
SDValue Value = ST->getValue();		SDValue Value = ST->getValue();
EVT StVT = ST->getMemoryVT();		EVT StVT = ST->getMemoryVT();

// The type of the data we want to save		// The type of the data we want to save
Show All 20 Lines	for (unsigned Idx = 0; Idx < NumElem; ++Idx) {
SDValue Ptr = DAG.getNode(ISD::ADD, SL, PtrVT, BasePtr,		SDValue Ptr = DAG.getNode(ISD::ADD, SL, PtrVT, BasePtr,
DAG.getConstant(Idx * Stride, SL, PtrVT));		DAG.getConstant(Idx * Stride, SL, PtrVT));

// This scalar TruncStore may be illegal, but we legalize it later.		// This scalar TruncStore may be illegal, but we legalize it later.
SDValue Store = DAG.getTruncStore(		SDValue Store = DAG.getTruncStore(
Chain, SL, Elt, Ptr, ST->getPointerInfo().getWithOffset(Idx * Stride),		Chain, SL, Elt, Ptr, ST->getPointerInfo().getWithOffset(Idx * Stride),
MemSclVT, MinAlign(ST->getAlignment(), Idx * Stride),		MemSclVT, MinAlign(ST->getAlignment(), Idx * Stride),
ST->getMemOperand()->getFlags(), ST->getAAInfo());		ST->getMemOperand()->getFlags(), ST->getAAInfo());
		if (depend)
		Chain = Store;
		else
Stores.push_back(Store);		Stores.push_back(Store);
}		}

return DAG.getNode(ISD::TokenFactor, SL, MVT::Other, Stores);		return depend ? Chain : DAG.getNode(ISD::TokenFactor, SL, MVT::Other, Stores);
}		}

std::pair<SDValue, SDValue>		std::pair<SDValue, SDValue>
TargetLowering::expandUnalignedLoad(LoadSDNode *LD, SelectionDAG &DAG) const {		TargetLowering::expandUnalignedLoad(LoadSDNode *LD, SelectionDAG &DAG) const {
assert(LD->getAddressingMode() == ISD::UNINDEXED &&		assert(LD->getAddressingMode() == ISD::UNINDEXED &&
"unaligned indexed loads not implemented!");		"unaligned indexed loads not implemented!");
SDValue Chain = LD->getChain();		SDValue Chain = LD->getChain();
SDValue Ptr = LD->getBasePtr();		SDValue Ptr = LD->getBasePtr();
▲ Show 20 Lines • Show All 334 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
AddPromotedToType(ISD::STORE, MVT::v2i64, MVT::v4i32);		AddPromotedToType(ISD::STORE, MVT::v2i64, MVT::v4i32);

setOperationAction(ISD::STORE, MVT::f64, Promote);		setOperationAction(ISD::STORE, MVT::f64, Promote);
AddPromotedToType(ISD::STORE, MVT::f64, MVT::v2i32);		AddPromotedToType(ISD::STORE, MVT::f64, MVT::v2i32);

setOperationAction(ISD::STORE, MVT::v2f64, Promote);		setOperationAction(ISD::STORE, MVT::v2f64, Promote);
AddPromotedToType(ISD::STORE, MVT::v2f64, MVT::v4i32);		AddPromotedToType(ISD::STORE, MVT::v2f64, MVT::v4i32);

setTruncStoreAction(MVT::v2i32, MVT::v2i8, Custom);
setTruncStoreAction(MVT::v2i32, MVT::v2i16, Custom);

setTruncStoreAction(MVT::v4i32, MVT::v4i8, Custom);
setTruncStoreAction(MVT::v4i32, MVT::v4i16, Expand);

setTruncStoreAction(MVT::v8i32, MVT::v8i16, Expand);
setTruncStoreAction(MVT::v16i32, MVT::v16i8, Expand);
setTruncStoreAction(MVT::v16i32, MVT::v16i16, Expand);

setTruncStoreAction(MVT::i64, MVT::i1, Expand);		setTruncStoreAction(MVT::i64, MVT::i1, Expand);
setTruncStoreAction(MVT::i64, MVT::i8, Expand);		setTruncStoreAction(MVT::i64, MVT::i8, Expand);
setTruncStoreAction(MVT::i64, MVT::i16, Expand);		setTruncStoreAction(MVT::i64, MVT::i16, Expand);
setTruncStoreAction(MVT::i64, MVT::i32, Expand);		setTruncStoreAction(MVT::i64, MVT::i32, Expand);

setTruncStoreAction(MVT::v2i64, MVT::v2i1, Expand);		setTruncStoreAction(MVT::v2i64, MVT::v2i1, Expand);
setTruncStoreAction(MVT::v2i64, MVT::v2i8, Expand);		setTruncStoreAction(MVT::v2i64, MVT::v2i8, Expand);
setTruncStoreAction(MVT::v2i64, MVT::v2i16, Expand);		setTruncStoreAction(MVT::v2i64, MVT::v2i16, Expand);
▲ Show 20 Lines • Show All 2,806 Lines • Show Last 20 Lines

lib/Target/AMDGPU/R600ISelLowering.cpp

Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	R600TargetLowering::R600TargetLowering(const TargetMachine &TM,

setOperationAction(ISD::STORE, MVT::i8, Custom);		setOperationAction(ISD::STORE, MVT::i8, Custom);
setOperationAction(ISD::STORE, MVT::i32, Custom);		setOperationAction(ISD::STORE, MVT::i32, Custom);
setOperationAction(ISD::STORE, MVT::v2i32, Custom);		setOperationAction(ISD::STORE, MVT::v2i32, Custom);
setOperationAction(ISD::STORE, MVT::v4i32, Custom);		setOperationAction(ISD::STORE, MVT::v4i32, Custom);

setTruncStoreAction(MVT::i32, MVT::i8, Custom);		setTruncStoreAction(MVT::i32, MVT::i8, Custom);
setTruncStoreAction(MVT::i32, MVT::i16, Custom);		setTruncStoreAction(MVT::i32, MVT::i16, Custom);
		// We need to include these since trunc STORES to PRIVATE need
		// special handling to accommodate RMW
		setTruncStoreAction(MVT::v2i32, MVT::v2i16, Custom);
		setTruncStoreAction(MVT::v4i32, MVT::v4i16, Custom);
		setTruncStoreAction(MVT::v8i32, MVT::v8i16, Custom);
		setTruncStoreAction(MVT::v16i32, MVT::v16i16, Custom);
		setTruncStoreAction(MVT::v32i32, MVT::v32i16, Custom);
		setTruncStoreAction(MVT::v2i32, MVT::v2i8, Custom);
		setTruncStoreAction(MVT::v4i32, MVT::v4i8, Custom);
		setTruncStoreAction(MVT::v8i32, MVT::v8i8, Custom);
		setTruncStoreAction(MVT::v16i32, MVT::v16i8, Custom);
		setTruncStoreAction(MVT::v32i32, MVT::v32i8, Custom);

// Workaround for LegalizeDAG asserting on expansion of i1 vector stores.		// Workaround for LegalizeDAG asserting on expansion of i1 vector stores.
setTruncStoreAction(MVT::v2i32, MVT::v2i1, Expand);		setTruncStoreAction(MVT::v2i32, MVT::v2i1, Expand);
setTruncStoreAction(MVT::v4i32, MVT::v4i1, Expand);		setTruncStoreAction(MVT::v4i32, MVT::v4i1, Expand);

// Set condition code actions		// Set condition code actions
setCondCodeAction(ISD::SETO, MVT::f32, Expand);		setCondCodeAction(ISD::SETO, MVT::f32, Expand);
setCondCodeAction(ISD::SETUO, MVT::f32, Expand);		setCondCodeAction(ISD::SETUO, MVT::f32, Expand);
▲ Show 20 Lines • Show All 1,032 Lines • ▼ Show 20 Lines
SDValue R600TargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {		SDValue R600TargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
StoreSDNode *StoreNode = cast<StoreSDNode>(Op);		StoreSDNode *StoreNode = cast<StoreSDNode>(Op);
unsigned AS = StoreNode->getAddressSpace();		unsigned AS = StoreNode->getAddressSpace();
SDValue Value = StoreNode->getValue();		SDValue Value = StoreNode->getValue();
EVT ValueVT = Value.getValueType();		EVT ValueVT = Value.getValueType();
EVT MemVT = StoreNode->getMemoryVT();		EVT MemVT = StoreNode->getMemoryVT();
unsigned Align = StoreNode->getAlignment();		unsigned Align = StoreNode->getAlignment();

		/* Neither LOCAL nor PRIVATE can do vectors at the moment */
if ((AS == AMDGPUAS::LOCAL_ADDRESS \|\| AS == AMDGPUAS::PRIVATE_ADDRESS) &&		if ((AS == AMDGPUAS::LOCAL_ADDRESS \|\| AS == AMDGPUAS::PRIVATE_ADDRESS) &&
ValueVT.isVector()) {		ValueVT.isVector()) {
return SplitVectorStore(Op, DAG);		bool UsesRMW = (AS == AMDGPUAS::PRIVATE_ADDRESS) &&
		StoreNode->isTruncatingStore();
		//TODO: this can be optimized for large vectors
		// only stores within sizeof(int) need the dependencies
		return scalarizeVectorStore(StoreNode, DAG, UsesRMW);
}		}

// Private AS needs special fixes		if (Align < MemVT.getStoreSize() &&
if (Align < MemVT.getStoreSize() && (AS != AMDGPUAS::PRIVATE_ADDRESS) &&
!allowsMisalignedMemoryAccesses(MemVT, AS, Align, NULL)) {		!allowsMisalignedMemoryAccesses(MemVT, AS, Align, NULL)) {
return expandUnalignedStore(StoreNode, DAG);		return expandUnalignedStore(StoreNode, DAG);
}		}

SDLoc DL(Op);		SDLoc DL(Op);
SDValue Chain = StoreNode->getChain();		SDValue Chain = StoreNode->getChain();
SDValue Ptr = StoreNode->getBasePtr();		SDValue Ptr = StoreNode->getBasePtr();

▲ Show 20 Lines • Show All 1,060 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	SITargetLowering::SITargetLowering(const TargetMachine &TM,
setOperationAction(ISD::LOAD, MVT::i1, Custom);		setOperationAction(ISD::LOAD, MVT::i1, Custom);

setOperationAction(ISD::STORE, MVT::v2i32, Custom);		setOperationAction(ISD::STORE, MVT::v2i32, Custom);
setOperationAction(ISD::STORE, MVT::v4i32, Custom);		setOperationAction(ISD::STORE, MVT::v4i32, Custom);
setOperationAction(ISD::STORE, MVT::v8i32, Custom);		setOperationAction(ISD::STORE, MVT::v8i32, Custom);
setOperationAction(ISD::STORE, MVT::v16i32, Custom);		setOperationAction(ISD::STORE, MVT::v16i32, Custom);
setOperationAction(ISD::STORE, MVT::i1, Custom);		setOperationAction(ISD::STORE, MVT::i1, Custom);

		setTruncStoreAction(MVT::v2i32, MVT::v2i16, Expand);
		setTruncStoreAction(MVT::v4i32, MVT::v4i16, Expand);
		setTruncStoreAction(MVT::v8i32, MVT::v8i16, Expand);
		setTruncStoreAction(MVT::v16i32, MVT::v16i16, Expand);
		setTruncStoreAction(MVT::v32i32, MVT::v32i16, Expand);
		setTruncStoreAction(MVT::v2i32, MVT::v2i8, Expand);
		setTruncStoreAction(MVT::v4i32, MVT::v4i8, Expand);
		setTruncStoreAction(MVT::v8i32, MVT::v8i8, Expand);
		setTruncStoreAction(MVT::v16i32, MVT::v16i8, Expand);
		setTruncStoreAction(MVT::v32i32, MVT::v32i8, Expand);


setOperationAction(ISD::GlobalAddress, MVT::i32, Custom);		setOperationAction(ISD::GlobalAddress, MVT::i32, Custom);
setOperationAction(ISD::GlobalAddress, MVT::i64, Custom);		setOperationAction(ISD::GlobalAddress, MVT::i64, Custom);
setOperationAction(ISD::ConstantPool, MVT::v2i64, Expand);		setOperationAction(ISD::ConstantPool, MVT::v2i64, Expand);

setOperationAction(ISD::SELECT, MVT::i1, Promote);		setOperationAction(ISD::SELECT, MVT::i1, Promote);
setOperationAction(ISD::SELECT, MVT::i64, Custom);		setOperationAction(ISD::SELECT, MVT::i64, Custom);
setOperationAction(ISD::SELECT, MVT::f64, Promote);		setOperationAction(ISD::SELECT, MVT::f64, Promote);
AddPromotedToType(ISD::SELECT, MVT::f64, MVT::i64);		AddPromotedToType(ISD::SELECT, MVT::f64, MVT::i64);
▲ Show 20 Lines • Show All 3,784 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/store-private.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=verde -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=FUNC %s
				; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=FUNC %s
				; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s
				; RUN: llc -march=r600 -mcpu=cayman < %s \| FileCheck -check-prefix=CM -check-prefix=FUNC %s

				; FUNC-LABEL: {{^}}store_i1:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; SI: buffer_store_byte
				define void @store_i1(i1 addrspace(0)* %out) {
				entry:
				store i1 true, i1 addrspace(0)* %out
				ret void
				}

				; i8 store
				; FUNC-LABEL: {{^}}store_i8:
				; EG: LSHR * [[ADDRESS:T[0-9]\.[XYZW]]], KC0[2].Y, literal.x
				; EG-NEXT: 2
				; EG: MOVA_INT * AR.x (MASKED)
				; EG: MOV [[OLD:T[0-9]\.[XYZW]]], {{.*}}AR.x

				; IG 0: Get the byte index and truncate the value
				; EG: AND_INT * T{{[0-9]}}.[[BI_CHAN:[XYZW]]], KC0[2].Y, literal.x
				; EG: LSHL * T{{[0-9]}}.[[SHIFT_CHAN:[XYZW]]], PV.[[BI_CHAN]], literal.x
				; EG-NEXT: 3(4.203895e-45)
				; EG: AND_INT * T{{[0-9]}}.[[TRUNC_CHAN:[XYZW]]], KC0[2].Z, literal.x
				; EG-NEXT: 255(3.573311e-43)

				; EG: NOT_INT
				; EG: AND_INT {{[\* ]}}[[CLR_CHAN:T[0-9]\.[XYZW]]], {{.}}[[OLD]]
				; EG: OR_INT * [[RES:T[0-9]\.[XYZW]]]
				; TODO: Is the reload necessary?
				; EG: MOVA_INT * AR.x (MASKED), [[ADDRESS]]
				; EG: MOV * T(0 + AR.x).X+, [[RES]]

				; SI: buffer_store_byte

				define void @store_i8(i8 addrspace(0)* %out, i8 %in) {
				entry:
				store i8 %in, i8 addrspace(0)* %out
				ret void
				}

				; i16 store
				; FUNC-LABEL: {{^}}store_i16:
				; EG: LSHR * [[ADDRESS:T[0-9]\.[XYZW]]], KC0[2].Y, literal.x
				; EG-NEXT: 2
				; EG: MOVA_INT * AR.x (MASKED)
				; EG: MOV [[OLD:T[0-9]\.[XYZW]]], {{.*}}AR.x

				; IG 0: Get the byte index and truncate the value
				; EG: AND_INT * T{{[0-9]}}.[[BI_CHAN:[XYZW]]], KC0[2].Y, literal.x
				; EG: LSHL * T{{[0-9]}}.[[SHIFT_CHAN:[XYZW]]], PV.[[BI_CHAN]], literal.x
				; EG-NEXT: 3(4.203895e-45)
				; EG: AND_INT * T{{[0-9]}}.[[TRUNC_CHAN:[XYZW]]], KC0[2].Z, literal.x
				; EG-NEXT: 65535(9.183409e-41)

				; EG: NOT_INT
				; EG: AND_INT {{[\* ]}}[[CLR_CHAN:T[0-9]\.[XYZW]]], {{.}}[[OLD]]
				; EG: OR_INT * [[RES:T[0-9]\.[XYZW]]]
				; TODO: Is the reload necessary?
				; EG: MOVA_INT * AR.x (MASKED), [[ADDRESS]]
				; EG: MOV * T(0 + AR.x).X+, [[RES]]

				; SI: buffer_store_short
				define void @store_i16(i16 addrspace(0)* %out, i16 %in) {
				entry:
				store i16 %in, i16 addrspace(0)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}store_i24:
				; SI: s_lshr_b32 s{{[0-9]+}}, s{{[0-9]+}}, 16
				; SI-DAG: buffer_store_byte
				; SI-DAG: buffer_store_short

				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store can be eliminated
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store can be eliminated
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				define void @store_i24(i24 addrspace(0)* %out, i24 %in) {
				entry:
				store i24 %in, i24 addrspace(0)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}store_i25:
				; SI: s_and_b32 [[AND:s[0-9]+]], s{{[0-9]+}}, 0x1ffffff{{$}}
				; SI: v_mov_b32_e32 [[VAND:v[0-9]+]], [[AND]]
				; SI: buffer_store_dword [[VAND]]

				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG-NOT: MOVA_INT

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM-NOT: MOVA_INT
				define void @store_i25(i25 addrspace(0)* %out, i25 %in) {
				entry:
				store i25 %in, i25 addrspace(0)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}store_v2i8:
				; v2i8 is naturally 2B aligned, treat as i16
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG-NOT: MOVA_INT

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM-NOT: MOVA_INT

				; SI: buffer_store_short
				define void @store_v2i8(<2 x i8> addrspace(0)* %out, <2 x i32> %in) {
				entry:
				%0 = trunc <2 x i32> %in to <2 x i8>
				store <2 x i8> %0, <2 x i8> addrspace(0)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}store_v2i8_unaligned:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; SI: buffer_store_byte
				define void @store_v2i8_unaligned(<2 x i8> addrspace(0)* %out, <2 x i32> %in) {
				entry:
				%0 = trunc <2 x i32> %in to <2 x i8>
				store <2 x i8> %0, <2 x i8> addrspace(0)* %out, align 1
				ret void
				}


				; FUNC-LABEL: {{^}}store_v2i16:
				; v2i8 is naturally 2B aligned, treat as i16
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG-NOT: MOVA_INT

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM-NOT: MOVA_INT

				; SI: buffer_store_dword
				define void @store_v2i16(<2 x i16> addrspace(0)* %out, <2 x i32> %in) {
				entry:
				%0 = trunc <2 x i32> %in to <2 x i16>
				store <2 x i16> %0, <2 x i16> addrspace(0)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}store_v2i16_unaligned:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; SI: buffer_store_short
				; SI: buffer_store_short
				define void @store_v2i16_unaligned(<2 x i16> addrspace(0)* %out, <2 x i32> %in) {
				entry:
				%0 = trunc <2 x i32> %in to <2 x i16>
				store <2 x i16> %0, <2 x i16> addrspace(0)* %out, align 2
				ret void
				}

				; FUNC-LABEL: {{^}}store_v4i8:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG-NOT: MOVA_INT

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM-NOT: MOVA_INT

				; SI: buffer_store_dword
				define void @store_v4i8(<4 x i8> addrspace(0)* %out, <4 x i32> %in) {
				entry:
				%0 = trunc <4 x i32> %in to <4 x i8>
				store <4 x i8> %0, <4 x i8> addrspace(0)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}store_v4i8_unaligned:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; SI: buffer_store_byte
				; SI: buffer_store_byte
				; SI: buffer_store_byte
				; SI: buffer_store_byte
				; SI-NOT: buffer_store_dword
				define void @store_v4i8_unaligned(<4 x i8> addrspace(0)* %out, <4 x i32> %in) {
				entry:
				%0 = trunc <4 x i32> %in to <4 x i8>
				store <4 x i8> %0, <4 x i8> addrspace(0)* %out, align 1
				ret void
				}

				; FUNC-LABEL: {{^}}store_v8i8_unaligned:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; SI: buffer_store_byte
				; SI: buffer_store_byte
				; SI: buffer_store_byte
				; SI: buffer_store_byte
				; SI: buffer_store_byte
				; SI: buffer_store_byte
				; SI: buffer_store_byte
				; SI: buffer_store_byte
				; SI-NOT: buffer_store_dword
				define void @store_v8i8_unaligned(<8 x i8> addrspace(0)* %out, <8 x i32> %in) {
				entry:
				%0 = trunc <8 x i32> %in to <8 x i8>
				store <8 x i8> %0, <8 x i8> addrspace(0)* %out, align 1
				ret void
				}

				; FUNC-LABEL: {{^}}store_v4i8_halfaligned:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; TODO: This load and store cannot be eliminated,
				; they might be different locations
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; SI: buffer_store_short
				; SI: buffer_store_short
				; SI-NOT: buffer_store_dword
				define void @store_v4i8_halfaligned(<4 x i8> addrspace(0)* %out, <4 x i32> %in) {
				entry:
				%0 = trunc <4 x i32> %in to <4 x i8>
				store <4 x i8> %0, <4 x i8> addrspace(0)* %out, align 2
				ret void
				}

				; floating-point store
				; FUNC-LABEL: {{^}}store_f32:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; SI: buffer_store_dword

				define void @store_f32(float addrspace(0)* %out, float %in) {
				store float %in, float addrspace(0)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}store_v4i16:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				;TODO: why not x2?
				; XSI: buffer_store_dwordx2
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				define void @store_v4i16(<4 x i16> addrspace(0)* %out, <4 x i32> %in) {
				entry:
				%0 = trunc <4 x i32> %in to <4 x i16>
				store <4 x i16> %0, <4 x i16> addrspace(0)* %out
				ret void
				}

				; vec2 floating-point stores
				; FUNC-LABEL: {{^}}store_v2f32:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				;TODO: why not x2?
				; XSI: buffer_store_dwordx2
				; SI: buffer_store_dword
				; SI: buffer_store_dword

				define void @store_v2f32(<2 x float> addrspace(0)* %out, float %a, float %b) {
				entry:
				%0 = insertelement <2 x float> <float 0.0, float 0.0>, float %a, i32 0
				%1 = insertelement <2 x float> %0, float %b, i32 1
				store <2 x float> %1, <2 x float> addrspace(0)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}store_v3i32:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				;TODO: why not x2?
				; XSI-DAG: buffer_store_dwordx2
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				; SI: buffer_store_dword

				define void @store_v3i32(<3 x i32> addrspace(0)* %out, <3 x i32> %a) nounwind {
				store <3 x i32> %a, <3 x i32> addrspace(0)* %out, align 16
				ret void
				}

				; FUNC-LABEL: {{^}}store_v4i32:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				;TODO: why not x4?
				; XSI: buffer_store_dwordx4
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				define void @store_v4i32(<4 x i32> addrspace(0)* %out, <4 x i32> %in) {
				entry:
				store <4 x i32> %in, <4 x i32> addrspace(0)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}store_v4i32_unaligned:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				;TODO: why not x4?
				; XSI: buffer_store_dwordx4
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				define void @store_v4i32_unaligned(<4 x i32> addrspace(0)* %out, <4 x i32> %in) {
				entry:
				store <4 x i32> %in, <4 x i32> addrspace(0)* %out, align 4
				ret void
				}

				; v4f32 store
				; FUNC-LABEL: {{^}}store_v4f32:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				;TODO: why not x4?
				; XSI: buffer_store_dwordx4
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				define void @store_v4f32(<4 x float> addrspace(0)* %out, <4 x float> addrspace(0)* %in) {
				%1 = load <4 x float>, <4 x float> addrspace(0) * %in
				store <4 x float> %1, <4 x float> addrspace(0)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}store_i64_i8:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; SI: buffer_store_byte
				define void @store_i64_i8(i8 addrspace(0)* %out, i64 %in) {
				entry:
				%0 = trunc i64 %in to i8
				store i8 %0, i8 addrspace(0)* %out
				ret void
				}

				; FUNC-LABEL: {{^}}store_i64_i16:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}{{T[0-9]+\.[XYZW]}}, T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; SI: buffer_store_short
				define void @store_i64_i16(i16 addrspace(0)* %out, i64 %in) {
				entry:
				%0 = trunc i64 %in to i16
				store i16 %0, i16 addrspace(0)* %out
				ret void
				}

				; The stores in this function are combined by the optimizer to create a
				; 64-bit store with 32-bit alignment. This is legal and the legalizer
				; should not try to split the 64-bit store back into 2 32-bit stores.

				; FUNC-LABEL: {{^}}vecload2:
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				;TODO: why not x2?
				; XSI: buffer_store_dwordx2
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				define void @vecload2(i32 addrspace(0)* nocapture %out, i32 addrspace(2)* nocapture %mem) #0 {
				entry:
				%0 = load i32, i32 addrspace(2)* %mem, align 4
				%arrayidx1.i = getelementptr inbounds i32, i32 addrspace(2)* %mem, i64 1
				%1 = load i32, i32 addrspace(2)* %arrayidx1.i, align 4
				store i32 %0, i32 addrspace(0)* %out, align 4
				%arrayidx1 = getelementptr inbounds i32, i32 addrspace(0)* %out, i64 1
				store i32 %1, i32 addrspace(0)* %arrayidx1, align 4
				ret void
				}

				; When i128 was a legal type this program generated cannot select errors:

				; FUNC-LABEL: {{^}}"i128-const-store":
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; EG: MOVA_INT
				; EG: MOV {{[\* ]*}}T(0 + AR.x).X+,

				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,
				; CM: MOVA_INT
				; CM: MOV {{[\* ]*}}T(0 + AR.x).X+,

				;TODO: why not x4?
				; XSI: buffer_store_dwordx4
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				; SI: buffer_store_dword
				define void @i128-const-store(i32 addrspace(0)* %out) {
				entry:
				store i32 1, i32 addrspace(0)* %out, align 4
				%arrayidx2 = getelementptr inbounds i32, i32 addrspace(0)* %out, i64 1
				store i32 1, i32 addrspace(0)* %arrayidx2, align 4
				%arrayidx4 = getelementptr inbounds i32, i32 addrspace(0)* %out, i64 2
				store i32 2, i32 addrspace(0)* %arrayidx4, align 4
				%arrayidx6 = getelementptr inbounds i32, i32 addrspace(0)* %out, i64 3
				store i32 2, i32 addrspace(0)* %arrayidx6, align 4
				ret void
				}


				attributes #0 = { nounwind }