This is an archive of the discontinued LLVM Phabricator instance.

Split the store of an int value merged from a pair of smaller values into multiple stores.
ClosedPublic

Authored by wmi on Jul 26 2016, 6:22 PM.

Download Raw Diff

Details

Reviewers

chandlerc
majnemer

Commits

rGc54d1298f52c: Split the store of a wide value merged from an int-fp pair into multiple stores.
rL280505: Split the store of a wide value merged from an int-fp pair into multiple stores.

Summary

The patch is to improve the code efficiency for the case described in https://llvm.org/bugs/show_bug.cgi?id=28726

For the instruction sequence of int64 store below, %int_tmp and %float_tmp are bundled together as an int64 data before stored into memory. If the int64 data is not used outside of the store, it is more efficient to generate separate stores for %int_tmp and %float_tmp.

Instruction sequence of int64 Store:

%ref.tmp = alloca i64, align 8
%1 = bitcast float %float_tmp to i32
%sroa.1.ext = zext i32 %1 to i64
%sroa.1.shift = shl nuw i64 %sroa.1.ext, 32
%sroa.0.ext = zext i32 %int_tmp to i64
%sroa.0.insert = or i64 %sroa.1.shift, %sroa.0.ext
store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8

Instruction sequence of separate stores:

%ref.tmp = alloca i64, align 8
%1 = bitcast i64* %ref.tmp to i32*
store i32 %int_tmp, i32* %1, align 4
%2 = getelementptr i32, i32* %1, i64 1
%3 = bitcast i32* %2 to float*
store float %float_tmp, float* %3, align 4

Diff Detail

Repository: rL LLVM

Event Timeline

wmi updated this revision to Diff 65636.Jul 26 2016, 6:22 PM

wmi retitled this revision from to [InstCombine] Split int64 store into separate int32 stores .

wmi updated this object.

wmi added reviewers: majnemer, chandlerc.

wmi set the repository for this revision to rL LLVM.

wmi added subscribers: llvm-commits, davidxl.

This looks target dependent, and also too specific for the 64-bit case, so probably shouldn't go in instcombine. This transform would be generally undesirable on my target

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1278	The alloca is not necessary for the example (and was misleading to me)

majnemer added inline comments.Jul 26 2016, 6:35 PM

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1287–1292	Is this more canonical than insertelement into a vector and performing a single store? I'm not sure we have a canonicalization for this sort of thing, I'm not sure what the best thing to do is here... Chandler, what are your thoughts?
1289	This should be align 8 no?

chandlerc added inline comments.Jul 26 2016, 9:30 PM

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1287–1292	I agree with Matt generally -- the original bug should be handled by splitting loads and stores late in the pipeline with target specific knowledge. SDAG makes a lot of sense here as a target-independent in representation but target-specific in optimality transform. Regarding the higher level question you pose David, I think that an integer is the most canonical thing here. I'll try to explain some of why. The first question is "is one store more canonical than two stores". IMO, the answer should almost always be "yes". The reason is that splitting loads and stores is so much easier than merging. Later on in the pipeline we may have a very hard time effectively merging memory accesses due to memory model constraints, while earlier we may have more information that proves these things are safe. We have consistently moved in this direction both in LLVM and even in Clang itself. The second question is "what type is the most canonical type?" I think there are no really good answers to that question. When merging adjacent values (as happens at the ABI level for pairs and structs, and in SROA, and elsewhere) there are a few options: a first-class-aggregate of the adjacent types, a vector of integers where the adjacent types are the same width, or an integer the size of the pair. We've been moving LLVM away from first-class aggregate loads and stores for a while. In retrospect, I'm no longer convinced this was the correct decision, but reversing it would be a huge undertaking and not very easy to do effectively without crippling optimizations. We also try to avoid forming novel vector types historically because we were worried about how they would lower. These days, we actually do a pretty fantastic job of this, and so I kind of re-evaluated this. However, there are still ways in which a vector of integer types falls down: non-uniform merges (an i32 next to 4 i8s for example), the fact that it still requires float -> int, and worst of all when the non-uniformity extends to overlapping things like unions. Because of all of the last set of complications, a wide integer type seems a pretty reasonable way to express "a bag of bits". And expressing the extract in terms of bit math is nice because when the components are integers, we often end up effectively combining across the extract operations to further simplify things. So I'm pretty happy with large integer types in the middle end for canonicalization. If we have lots of missed middle-end optimizations because of that representation that would be fixed by a different canonical form, we should definitely revisit this, but currently the test cases have largely involved missing lowering logic late in the backend.

Matt, David and Chandler, thanks for the review.

I only looked at the code for x86-64, powerpc and aarch64 targets. For those targets, separate stores seemed better. I am not familiar with target like AMDGPU. I guess wider store is generally preferred than multiple narrower stores on AMDGPU? Since it may be undesirable on some targets, I agree it is more appropriate to implement it in SDAG pass. I will update the patch.

Addressed comments from Matt, David, and Chandler.
Two major changes:

reimplement the split in DAGCombiner.
remove the restriction of int64.

Some more changes:

add x86_64-unknown-linux-gnu triple for test.
I found it was too limited for the optimization to say the value of store can only had one use so I removed it. It is possible that it has more than one use but the other uses are in cold blocks. After store splitting and machineSinking, those bitwise instructions will be moved to colder places. However, existing ISel cannot look beyond BB boundary so it is impossible to check other uses. Can we blindly do this (splitting an int64 store to two on x86 may not be very bad for performance)? I really hope we have globalISel here, or is there existing better solution for it? I have done some internal testing for it. With the extension of more than one uses of store value, I see 3% improvement for one internal benchmark (with D23210) and no regressions.

Sorry, I had lost track of this, but I can help finish up the review here.

In D22840#507217, @wmi wrote:

Some more changes:

add x86_64-unknown-linux-gnu triple for test.

I found it was too limited for the optimization to say the value of store can only had one use so I removed it. It is possible that it has more than one use but the other uses are in cold blocks. After store splitting and machineSinking, those bitwise instructions will be moved to colder places. However, existing ISel cannot look beyond BB boundary so it is impossible to check other uses. Can we blindly do this (splitting an int64 store to two on x86 may not be very bad for performance)? I really hope we have globalISel here, or is there existing better solution for it? I have done some internal testing for it. With the extension of more than one uses of store value, I see 3% improvement for one internal benchmark (with D23210) and no regressions.

I agree that the one-use test is too restrictive.

I think for x86, doing this blindly when it is a float and an int is a clear and unambiguous win.

I think it is much less clear for pairs of integers... Does restricting this to only cases with a mixture of floating point and integers still capture all of the improvements? If so, I'd start there.

To implement this, I'd actually detect the two elements merged into a larger integer, and pass those two values to the predicate routine so that targets can inspect them as part of determining whether split stores would be better. Then you can have the x86 target check for a mixture of FP and int types to trigger split stores.

Even if we end up wanting more cases to be split for x86, I think this is probably a good way to structure the predicate so that targets can make more detailed decisions about this kind of thing.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12159–12162 ↗	(On Diff #66976)	No need for braces around the outer if here, although it probably won't matter based on the refactoring I suggested...
12173–12184 ↗	(On Diff #66976)	You can write DAG examples using a more brief form that is pretty common: /// (store (or (zext (bitcast F to i32) to i64), /// (shl (zext I to i64), 32)), addr)
12217–12221 ↗	(On Diff #66976)	A common slightly shorter idiom than this that we use is to test if Op1 is not SHL, and then swap, test again and bail, and then you know Op1 is the SHL and you can directly extract the Hi value.
12247 ↗	(On Diff #66976)	FWIW, here is where I would do the target query, passing in the inputs to the zero extends. Then in the x86 implementation of the target query, I would look for a bitcast from a float, or whatever we end up with for the heuristic.

Chandler, thanks for the review!

For the case I saw performance change, it only contains int-float pair. I agree int-float pair is a more clear win: for int-float pair, splitting wide store will save two bitwise instructions and one float-to-int conversion instruction but increase a store instruction, so it is two instructions saving. For int-int pair, it is only one instruction saving so the saving is more blurred. Good suggestion, I will implement the target query.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12159–12162 ↗	(On Diff #66976)	Ok.
12173–12184 ↗	(On Diff #66976)	Ok. Will change it.
12217–12221 ↗	(On Diff #66976)	That is definitely better. Will change it to swap.
12247 ↗	(On Diff #66976)	Thanks. I feel the target will end up here as you suggested.

Change the input of the target query to be elements of the value pair before they are merged. The target query only returns true on x86 when the input is a mixture of int and fp values.
Testcase update since now wide store of int value pair will not be splitted for now.

The approach here looks awesome.

Mostly really minor nits about comments, and some suggestions to minimize and clean up the test case.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12208 ↗	(On Diff #69450)	I would just say "Sometimes" here to avoid people defaulting a new target into splitting the stores without measuring.
lib/Target/X86/X86ISelLowering.h
769–771 ↗	(On Diff #69450)	It's probably also worth noting that beyond the instruction count difference, there is potentially a more significant benefit because it avoids the float->int domain switch for input value. I have a suspicion that is a significant part of the savings here.
778–780 ↗	(On Diff #69450)	And here it is probably worth mentioning the other upside of only doing a single memory operation in addition to having minimal instruction count overhead.
test/Transforms/InstCombine/split-store.ll
17–32	It would be good to minimize this test case rather than keeping it so close to what happens to come out of Clang for a particular C++ input. I'd just write direct tests in LLVM IR. Something like: define void @test1(i32 %i, float %f, i64* %ptr) { entry: ... = zext ... ... = bitcast ... ... = zext ... ... = shl ... ... = or ... store ..., i64* %ptr ret void }

wmi added inline comments.Aug 26 2016, 10:36 PM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12208 ↗	(On Diff #69450)	Fixed.
lib/Target/X86/X86ISelLowering.h
769–771 ↗	(On Diff #69450)	Comment added
778–780 ↗	(On Diff #69450)	I am not sure here. The added memory operation is a store. Store usually will not stall the pipeline because there is store buffer, so I feel the cost of the extra store will be low?
test/Transforms/InstCombine/split-store.ll
17–32	That makes the test much shorter!

Address Chandler's comments: adjust comments and simplify testcase.

Hmmm. I'm not seeing the updated test cases? Not sure if phabricator is just not updating for me or what though...

lib/Target/X86/X86ISelLowering.h
778–780 ↗	(On Diff #69473)	I agree the cost will be low, I just think it makes sense to mention the different aspects (for example, saving an entry in the store buffer).

It is weird. However, I just realized in your example, alloca was replaced with a func param. It can make the test more shorter, so I updated the test again.

lib/Target/X86/X86ISelLowering.h
778–780 ↗	(On Diff #69473)	Got it. Added the comment.

Update comment and testcase.

LGTM, feel free to submit. Just a suggestion on further simplification of the tests below, but all the important stuff is already there.

Also, this is a really nice change, thanks for working on it and getting the layering figured out so nicely here.

test/Transforms/InstCombine/split-store.ll
10	You can also probable remove all the function attributes here.

This revision is now accepted and ready to land.Sep 1 2016, 2:49 PM

Closed by commit rL280505: Split the store of a wide value merged from an int-fp pair into multiple stores. (authored by wmi). · Explain WhySep 2 2016, 10:25 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstCombineInternal.h

1 line

InstCombineLoadStoreAlloca.cpp

79 lines

test/

Transforms/

InstCombine/

split-store.ll

118 lines

Diff 65636

lib/Transforms/InstCombine/InstCombineInternal.h

Show First 20 Lines • Show All 589 Lines • ▼ Show 20 Lines	Instruction OptAndOp(Instruction Op, ConstantInt *OpRHS,
ConstantInt *AndRHS, BinaryOperator &TheAnd);		ConstantInt *AndRHS, BinaryOperator &TheAnd);

Value FoldLogicalPlusAnd(Value LHS, Value RHS, ConstantInt Mask,		Value FoldLogicalPlusAnd(Value LHS, Value RHS, ConstantInt Mask,
bool isSub, Instruction &I);		bool isSub, Instruction &I);
Value InsertRangeTest(Value V, Constant Lo, Constant Hi, bool isSigned,		Value InsertRangeTest(Value V, Constant Lo, Constant Hi, bool isSigned,
bool Inside);		bool Inside);
Instruction *PromoteCastOfAllocation(BitCastInst &CI, AllocaInst &AI);		Instruction *PromoteCastOfAllocation(BitCastInst &CI, AllocaInst &AI);
Instruction *MatchBSwap(BinaryOperator &I);		Instruction *MatchBSwap(BinaryOperator &I);
		bool SplitInt64Store(StoreInst &SI);
bool SimplifyStoreAtEndOfBlock(StoreInst &SI);		bool SimplifyStoreAtEndOfBlock(StoreInst &SI);
Instruction SimplifyMemTransfer(MemIntrinsic MI);		Instruction SimplifyMemTransfer(MemIntrinsic MI);
Instruction SimplifyMemSet(MemSetInst MI);		Instruction SimplifyMemSet(MemSetInst MI);

Value EvaluateInDifferentType(Value V, Type *Ty, bool isSigned);		Value EvaluateInDifferentType(Value V, Type *Ty, bool isSigned);

/// \brief Returns a value X such that Val = X * Scale, or null if none.		/// \brief Returns a value X such that Val = X * Scale, or null if none.
///		///
Show All 9 Lines

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp

Show First 20 Lines • Show All 1,259 Lines • ▼ Show 20 Lines	do {
++BBI;		++BBI;
} while (isa<DbgInfoIntrinsic>(BBI) \|\|		} while (isa<DbgInfoIntrinsic>(BBI) \|\|
(isa<BitCastInst>(BBI) && BBI->getType()->isPointerTy()));		(isa<BitCastInst>(BBI) && BBI->getType()->isPointerTy()));
if (BranchInst *BI = dyn_cast<BranchInst>(BBI))		if (BranchInst *BI = dyn_cast<BranchInst>(BBI))
if (BI->isUnconditional())		if (BI->isUnconditional())
if (SimplifyStoreAtEndOfBlock(SI))		if (SimplifyStoreAtEndOfBlock(SI))
return nullptr; // xform done!		return nullptr; // xform done!

		SplitInt64Store(SI);
return nullptr;		return nullptr;
}		}

		/// For the instruction sequence of int64 store below, %int_tmp and %float_tmp
		/// are bundled together as an int64 data before stored into memory. If the
		/// int64 data is not used outside of the store, it is more efficent to
		/// generate separate stores for %int_tmp and %float_tmp.
		///
		/// Instruction sequence of int64 Store:
		/// %ref.tmp = alloca i64, align 8
		arsenmUnsubmitted Not Done Reply Inline Actions The alloca is not necessary for the example (and was misleading to me) arsenm: The alloca is not necessary for the example (and was misleading to me)
		/// %1 = bitcast float %float_tmp to i32
		/// %sroa.1.ext = zext i32 %1 to i64
		/// %sroa.1.shift = shl nuw i64 %sroa.1.ext, 32
		/// %sroa.0.ext = zext i32 %int_tmp to i64
		/// %sroa.0.insert = or i64 %sroa.1.shift, %sroa.0.ext
		/// store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8
		///
		/// Instruction sequence of splitted stores:
		/// %ref.tmp = alloca i64, align 8
		/// %1 = bitcast i64* %ref.tmp to i32*
		/// store i32 %int_tmp, i32* %1, align 4
		majnemerUnsubmitted Not Done Reply Inline Actions This should be align 8 no? majnemer: This should be align 8 no?
		/// %2 = getelementptr i32, i32* %1, i64 1
		/// %3 = bitcast i32* %2 to float*
		/// store float %float_tmp, float* %3, align 4
		majnemerUnsubmitted Not Done Reply Inline Actions Is this more canonical than insertelement into a vector and performing a single store? I'm not sure we have a canonicalization for this sort of thing, I'm not sure what the best thing to do is here... Chandler, what are your thoughts? majnemer: Is this more canonical than insertelement into a vector and performing a single store? I'm not…
		chandlercUnsubmitted Not Done Reply Inline Actions I agree with Matt generally -- the original bug should be handled by splitting loads and stores late in the pipeline with target specific knowledge. SDAG makes a lot of sense here as a target-independent in representation but target-specific in optimality transform. Regarding the higher level question you pose David, I think that an integer is the most canonical thing here. I'll try to explain some of why. The first question is "is one store more canonical than two stores". IMO, the answer should almost always be "yes". The reason is that splitting loads and stores is so much easier than merging. Later on in the pipeline we may have a very hard time effectively merging memory accesses due to memory model constraints, while earlier we may have more information that proves these things are safe. We have consistently moved in this direction both in LLVM and even in Clang itself. The second question is "what type is the most canonical type?" I think there are no really good answers to that question. When merging adjacent values (as happens at the ABI level for pairs and structs, and in SROA, and elsewhere) there are a few options: a first-class-aggregate of the adjacent types, a vector of integers where the adjacent types are the same width, or an integer the size of the pair. We've been moving LLVM away from first-class aggregate loads and stores for a while. In retrospect, I'm no longer convinced this was the correct decision, but reversing it would be a huge undertaking and not very easy to do effectively without crippling optimizations. We also try to avoid forming novel vector types historically because we were worried about how they would lower. These days, we actually do a pretty fantastic job of this, and so I kind of re-evaluated this. However, there are still ways in which a vector of integer types falls down: non-uniform merges (an i32 next to 4 i8s for example), the fact that it still requires float -> int, and worst of all when the non-uniformity extends to overlapping things like unions. Because of all of the last set of complications, a wide integer type seems a pretty reasonable way to express "a bag of bits". And expressing the extract in terms of bit math is nice because when the components are integers, we often end up effectively combining across the extract operations to further simplify things. So I'm pretty happy with large integer types in the middle end for canonicalization. If we have lots of missed middle-end optimizations because of that representation that would be fixed by a different canonical form, we should definitely revisit this, but currently the test cases have largely involved missing lowering logic late in the backend. chandlerc: I agree with Matt generally -- the original bug should be handled by splitting loads and…
		///
		/// The int64 store pattern is commonly seen from the simple code snippet below
		/// if only std::make_pair(...) is sroa transformed before inlined into hoo.
		/// void goo(const std::pair<int, float> &);
		/// hoo() {
		/// ...
		/// goo(std::make_pair(tmp, ftmp));
		/// ...
		/// }
		///
		bool InstCombiner::SplitInt64Store(StoreInst &SI) {
		Value *Val = SI.getOperand(0);
		if (!Val->getType()->isIntegerTy(64) \|\| !Val->hasOneUse())
		return false;
		BinaryOperator *OR = dyn_cast<BinaryOperator>(Val);
		if (!OR \|\| OR->getOpcode() != Instruction::Or \|\| !OR->hasOneUse())
		return false;

		Value *Op1 = OR->getOperand(0);
		BinaryOperator *SHL = dyn_cast<BinaryOperator>(Op1);
		if (!SHL \|\| SHL->getOpcode() != Instruction::Shl \|\| !SHL->hasOneUse())
		return false;
		ConstantInt *CI = dyn_cast<ConstantInt>(SHL->getOperand(1));
		if (!CI \|\| CI->getLimitedValue() != 32)
		return false;

		// Z1 and Z2 should only have one use and the source operands can fit
		// into i32.
		ZExtInst *Z1 = dyn_cast<ZExtInst>(SHL->getOperand(0));
		ZExtInst *Z2 = dyn_cast<ZExtInst>(OR->getOperand(1));
		if (!Z1 \|\| !Z1->hasOneUse() \|\| !Z1->getOperand(0)->getType()->isIntegerTy() \|\|
		DL.getTypeSizeInBits(Z1->getOperand(0)->getType()) > 32)
		return false;
		if (!Z2 \|\| !Z2->hasOneUse() \|\| !Z2->getOperand(0)->getType()->isIntegerTy() \|\|
		DL.getTypeSizeInBits(Z2->getOperand(0)->getType()) > 32)
		return false;

		// Now it is ok to split the int64 store into two int32 stores.
		Value *Low =
		Builder->CreateZExtOrBitCast(Z2->getOperand(0), Builder->getInt32Ty());
		Value *LowAddr = Builder->CreateBitCast(SI.getOperand(1),
		Type::getInt32PtrTy(SI.getContext()));
		Builder->CreateAlignedStore(Low, LowAddr, SI.getAlignment());
		Value *HighAddr = Builder->CreateConstGEP1_32(LowAddr, 1);
		Value *High =
		Builder->CreateZExtOrBitCast(Z1->getOperand(0), Builder->getInt32Ty());
		Builder->CreateAlignedStore(High, HighAddr, SI.getAlignment() / 2);

		// Delete the old store and the bitwise instructions generating int64.
		eraseInstFromFunction(SI);
		eraseInstFromFunction(*OR);
		eraseInstFromFunction(*SHL);
		eraseInstFromFunction(*Z1);
		eraseInstFromFunction(*Z2);
		return true;
		}

/// SimplifyStoreAtEndOfBlock - Turn things like:		/// SimplifyStoreAtEndOfBlock - Turn things like:
/// if () { P = v1; } else { P = v2 }		/// if () { P = v1; } else { P = v2 }
/// into a phi node with a store in the successor.		/// into a phi node with a store in the successor.
///		///
/// Simplify things like:		/// Simplify things like:
/// P = v1; if () { P = v2; }		/// P = v1; if () { P = v2; }
/// into a phi node with a store in the successor.		/// into a phi node with a store in the successor.
///		///
▲ Show 20 Lines • Show All 129 Lines • Show Last 20 Lines

test/Transforms/InstCombine/split-store.ll

				; RUN: opt -instcombine -S < %s \| FileCheck %s

				declare void @llvm.lifetime.start(i64, i8* nocapture)
				declare void @llvm.lifetime.end(i64, i8* nocapture)

				declare void @goo1(%"pair1"* dereferenceable(8)) local_unnamed_addr
				%"pair1" = type { i32, float }

				; CHECK-LABEL: @int32_float_pair(
				; CHECK: store i32 %tmp1
				chandlercUnsubmitted Not Done Reply Inline Actions You can also probable remove all the function attributes here. chandlerc: You can also probable remove all the function attributes here.
				; CHECK: store float %tmp2
				define void @int32_float_pair(i32 %tmp1, float %tmp2) local_unnamed_addr {
				entry:
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %"pair1"*
				%t0 = bitcast i64* %ref.tmp to i8*
				call void @llvm.lifetime.start(i64 8, i8* %t0)
				%t1 = bitcast float %tmp2 to i32
				%retval.sroa.2.0.insert.ext.i = zext i32 %t1 to i64
				%retval.sroa.2.0.insert.shift.i = shl nuw i64 %retval.sroa.2.0.insert.ext.i, 32
				%retval.sroa.0.0.insert.ext.i = zext i32 %tmp1 to i64
				%retval.sroa.0.0.insert.insert.i = or i64 %retval.sroa.2.0.insert.shift.i, %retval.sroa.0.0.insert.ext.i
				store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8
				call void @goo1(%"pair1"* dereferenceable(8) %tmpcast)
				call void @llvm.lifetime.end(i64 8, i8* %t0)
				ret void
				}

				declare void @goo2(%"pair2"* dereferenceable(8)) local_unnamed_addr
				%"pair2" = type { float, i32 }

				; CHECK-LABEL: @float_int32_pair(
				chandlercUnsubmitted Not Done Reply Inline Actions It would be good to minimize this test case rather than keeping it so close to what happens to come out of Clang for a particular C++ input. I'd just write direct tests in LLVM IR. Something like: define void @test1(i32 %i, float %f, i64* %ptr) { entry: ... = zext ... ... = bitcast ... ... = zext ... ... = shl ... ... = or ... store ..., i64* %ptr ret void } chandlerc: It would be good to minimize this test case rather than keeping it so close to what happens to…
				wmiAuthorUnsubmitted Not Done Reply Inline Actions That makes the test much shorter! wmi: That makes the test much shorter!
				; CHECK: store float %tmp1
				; CHECK: store i32 %tmp2
				define void @float_int32_pair(float %tmp1, i32 %tmp2) local_unnamed_addr #0 {
				entry:
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %"pair2"*
				%t0 = bitcast i64* %ref.tmp to i8*
				call void @llvm.lifetime.start(i64 8, i8* %t0) #5
				%t1 = bitcast float %tmp1 to i32
				%retval.sroa.2.0.insert.ext.i = zext i32 %tmp2 to i64
				%retval.sroa.2.0.insert.shift.i = shl nuw i64 %retval.sroa.2.0.insert.ext.i, 32
				%retval.sroa.0.0.insert.ext.i = zext i32 %t1 to i64
				%retval.sroa.0.0.insert.insert.i = or i64 %retval.sroa.2.0.insert.shift.i, %retval.sroa.0.0.insert.ext.i
				store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8
				call void @goo2(%"pair2"* dereferenceable(8) %tmpcast)
				call void @llvm.lifetime.end(i64 8, i8* %t0) #5
				ret void
				}

				declare void @goo3(%"pair3"* dereferenceable(8)) local_unnamed_addr
				%"pair3" = type { i32, i32 }

				; CHECK-LABEL: @int32_int32_pair(
				; CHECK: store i32 %tmp1
				; CHECK: store i32 %tmp2
				define void @int32_int32_pair(i32 %tmp1, i32 %tmp2) local_unnamed_addr {
				entry:
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %"pair3"*
				%t0 = bitcast i64* %ref.tmp to i8*
				call void @llvm.lifetime.start(i64 8, i8* %t0)
				%retval.sroa.2.0.insert.ext.i = zext i32 %tmp2 to i64
				%retval.sroa.2.0.insert.shift.i = shl nuw i64 %retval.sroa.2.0.insert.ext.i, 32
				%retval.sroa.0.0.insert.ext.i = zext i32 %tmp1 to i64
				%retval.sroa.0.0.insert.insert.i = or i64 %retval.sroa.2.0.insert.shift.i, %retval.sroa.0.0.insert.ext.i
				store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8
				call void @goo3(%"pair3"* dereferenceable(8) %tmpcast)
				call void @llvm.lifetime.end(i64 8, i8* %t0)
				ret void
				}

				declare void @goo4(%"pair4"* dereferenceable(8)) local_unnamed_addr
				%"pair4" = type { i32, i16 }

				; CHECK-LABEL: @int32_int16_pair(
				; CHECK: store i32 %tmp1
				; CHECK: %[[EXT:.+]] = zext i16 %tmp2 to i32
				; CHECK: store i32 %[[EXT]]
				define void @int32_int16_pair(i32 %tmp1, i16 signext %tmp2) local_unnamed_addr {
				entry:
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %"pair4"*
				%t0 = bitcast i64* %ref.tmp to i8*
				call void @llvm.lifetime.start(i64 8, i8* %t0)
				%retval.sroa.2.0.insert.ext.i = zext i16 %tmp2 to i64
				%retval.sroa.2.0.insert.shift.i = shl nuw nsw i64 %retval.sroa.2.0.insert.ext.i, 32
				%retval.sroa.0.0.insert.ext.i = zext i32 %tmp1 to i64
				%retval.sroa.0.0.insert.insert.i = or i64 %retval.sroa.2.0.insert.shift.i, %retval.sroa.0.0.insert.ext.i
				store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8
				call void @goo4(%"pair4"* dereferenceable(8) %tmpcast)
				call void @llvm.lifetime.end(i64 8, i8* %t0)
				ret void
				}

				declare void @goo5(%"pair5"* dereferenceable(8)) local_unnamed_addr
				%"pair5" = type { i32, i8 }

				; CHECK-LABEL: @int32_int8_pair(
				; CHECK: store i32 %tmp1
				; CHECK: %[[EXT:.+]] = zext i8 %tmp2 to i32
				; CHECK: store i32 %[[EXT]]
				define void @int32_int8_pair(i32 %tmp1, i8 signext %tmp2) local_unnamed_addr {
				entry:
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %"pair5"*
				%t0 = bitcast i64* %ref.tmp to i8*
				call void @llvm.lifetime.start(i64 8, i8* %t0)
				%retval.sroa.2.0.insert.ext.i = zext i8 %tmp2 to i64
				%retval.sroa.2.0.insert.shift.i = shl nuw nsw i64 %retval.sroa.2.0.insert.ext.i, 32
				%retval.sroa.0.0.insert.ext.i = zext i32 %tmp1 to i64
				%retval.sroa.0.0.insert.insert.i = or i64 %retval.sroa.2.0.insert.shift.i, %retval.sroa.0.0.insert.ext.i
				store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8
				call void @goo5(%"pair5"* dereferenceable(8) %tmpcast)
				call void @llvm.lifetime.end(i64 8, i8* %t0)
				ret void
				}