This is an archive of the discontinued LLVM Phabricator instance.

Split the store of an int value merged from a pair of smaller values into multiple stores.
ClosedPublic

Authored by wmi on Jul 26 2016, 6:22 PM.

Download Raw Diff

Details

Reviewers

chandlerc
majnemer

Commits

rGc54d1298f52c: Split the store of a wide value merged from an int-fp pair into multiple stores.
rL280505: Split the store of a wide value merged from an int-fp pair into multiple stores.

Summary

The patch is to improve the code efficiency for the case described in https://llvm.org/bugs/show_bug.cgi?id=28726

For the instruction sequence of int64 store below, %int_tmp and %float_tmp are bundled together as an int64 data before stored into memory. If the int64 data is not used outside of the store, it is more efficient to generate separate stores for %int_tmp and %float_tmp.

Instruction sequence of int64 Store:

%ref.tmp = alloca i64, align 8
%1 = bitcast float %float_tmp to i32
%sroa.1.ext = zext i32 %1 to i64
%sroa.1.shift = shl nuw i64 %sroa.1.ext, 32
%sroa.0.ext = zext i32 %int_tmp to i64
%sroa.0.insert = or i64 %sroa.1.shift, %sroa.0.ext
store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8

Instruction sequence of separate stores:

%ref.tmp = alloca i64, align 8
%1 = bitcast i64* %ref.tmp to i32*
store i32 %int_tmp, i32* %1, align 4
%2 = getelementptr i32, i32* %1, i64 1
%3 = bitcast i32* %2 to float*
store float %float_tmp, float* %3, align 4

Diff Detail

Repository: rL LLVM

Event Timeline

wmi updated this revision to Diff 65636.Jul 26 2016, 6:22 PM

wmi retitled this revision from to [InstCombine] Split int64 store into separate int32 stores .

wmi updated this object.

wmi added reviewers: majnemer, chandlerc.

wmi set the repository for this revision to rL LLVM.

wmi added subscribers: llvm-commits, davidxl.

This looks target dependent, and also too specific for the 64-bit case, so probably shouldn't go in instcombine. This transform would be generally undesirable on my target

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1278 ↗	(On Diff #65636)	The alloca is not necessary for the example (and was misleading to me)

majnemer added inline comments.Jul 26 2016, 6:35 PM

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1287–1292 ↗	(On Diff #65636)	Is this more canonical than insertelement into a vector and performing a single store? I'm not sure we have a canonicalization for this sort of thing, I'm not sure what the best thing to do is here... Chandler, what are your thoughts?
1289 ↗	(On Diff #65636)	This should be align 8 no?

chandlerc added inline comments.Jul 26 2016, 9:30 PM

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
1287–1292 ↗	(On Diff #65636)	I agree with Matt generally -- the original bug should be handled by splitting loads and stores late in the pipeline with target specific knowledge. SDAG makes a lot of sense here as a target-independent in representation but target-specific in optimality transform. Regarding the higher level question you pose David, I think that an integer is the most canonical thing here. I'll try to explain some of why. The first question is "is one store more canonical than two stores". IMO, the answer should almost always be "yes". The reason is that splitting loads and stores is so much easier than merging. Later on in the pipeline we may have a very hard time effectively merging memory accesses due to memory model constraints, while earlier we may have more information that proves these things are safe. We have consistently moved in this direction both in LLVM and even in Clang itself. The second question is "what type is the most canonical type?" I think there are no really good answers to that question. When merging adjacent values (as happens at the ABI level for pairs and structs, and in SROA, and elsewhere) there are a few options: a first-class-aggregate of the adjacent types, a vector of integers where the adjacent types are the same width, or an integer the size of the pair. We've been moving LLVM away from first-class aggregate loads and stores for a while. In retrospect, I'm no longer convinced this was the correct decision, but reversing it would be a huge undertaking and not very easy to do effectively without crippling optimizations. We also try to avoid forming novel vector types historically because we were worried about how they would lower. These days, we actually do a pretty fantastic job of this, and so I kind of re-evaluated this. However, there are still ways in which a vector of integer types falls down: non-uniform merges (an i32 next to 4 i8s for example), the fact that it still requires float -> int, and worst of all when the non-uniformity extends to overlapping things like unions. Because of all of the last set of complications, a wide integer type seems a pretty reasonable way to express "a bag of bits". And expressing the extract in terms of bit math is nice because when the components are integers, we often end up effectively combining across the extract operations to further simplify things. So I'm pretty happy with large integer types in the middle end for canonicalization. If we have lots of missed middle-end optimizations because of that representation that would be fixed by a different canonical form, we should definitely revisit this, but currently the test cases have largely involved missing lowering logic late in the backend.

Matt, David and Chandler, thanks for the review.

I only looked at the code for x86-64, powerpc and aarch64 targets. For those targets, separate stores seemed better. I am not familiar with target like AMDGPU. I guess wider store is generally preferred than multiple narrower stores on AMDGPU? Since it may be undesirable on some targets, I agree it is more appropriate to implement it in SDAG pass. I will update the patch.

Addressed comments from Matt, David, and Chandler.
Two major changes:

reimplement the split in DAGCombiner.
remove the restriction of int64.

Some more changes:

add x86_64-unknown-linux-gnu triple for test.
I found it was too limited for the optimization to say the value of store can only had one use so I removed it. It is possible that it has more than one use but the other uses are in cold blocks. After store splitting and machineSinking, those bitwise instructions will be moved to colder places. However, existing ISel cannot look beyond BB boundary so it is impossible to check other uses. Can we blindly do this (splitting an int64 store to two on x86 may not be very bad for performance)? I really hope we have globalISel here, or is there existing better solution for it? I have done some internal testing for it. With the extension of more than one uses of store value, I see 3% improvement for one internal benchmark (with D23210) and no regressions.

Sorry, I had lost track of this, but I can help finish up the review here.

In D22840#507217, @wmi wrote:

Some more changes:

add x86_64-unknown-linux-gnu triple for test.

I found it was too limited for the optimization to say the value of store can only had one use so I removed it. It is possible that it has more than one use but the other uses are in cold blocks. After store splitting and machineSinking, those bitwise instructions will be moved to colder places. However, existing ISel cannot look beyond BB boundary so it is impossible to check other uses. Can we blindly do this (splitting an int64 store to two on x86 may not be very bad for performance)? I really hope we have globalISel here, or is there existing better solution for it? I have done some internal testing for it. With the extension of more than one uses of store value, I see 3% improvement for one internal benchmark (with D23210) and no regressions.

I agree that the one-use test is too restrictive.

I think for x86, doing this blindly when it is a float and an int is a clear and unambiguous win.

I think it is much less clear for pairs of integers... Does restricting this to only cases with a mixture of floating point and integers still capture all of the improvements? If so, I'd start there.

To implement this, I'd actually detect the two elements merged into a larger integer, and pass those two values to the predicate routine so that targets can inspect them as part of determining whether split stores would be better. Then you can have the x86 target check for a mixture of FP and int types to trigger split stores.

Even if we end up wanting more cases to be split for x86, I think this is probably a good way to structure the predicate so that targets can make more detailed decisions about this kind of thing.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12200–12203	No need for braces around the outer if here, although it probably won't matter based on the refactoring I suggested...
12212–12223	You can write DAG examples using a more brief form that is pretty common: /// (store (or (zext (bitcast F to i32) to i64), /// (shl (zext I to i64), 32)), addr)
12256–12260	A common slightly shorter idiom than this that we use is to test if Op1 is not SHL, and then swap, test again and bail, and then you know Op1 is the SHL and you can directly extract the Hi value.
12286	FWIW, here is where I would do the target query, passing in the inputs to the zero extends. Then in the x86 implementation of the target query, I would look for a bitcast from a float, or whatever we end up with for the heuristic.

Chandler, thanks for the review!

For the case I saw performance change, it only contains int-float pair. I agree int-float pair is a more clear win: for int-float pair, splitting wide store will save two bitwise instructions and one float-to-int conversion instruction but increase a store instruction, so it is two instructions saving. For int-int pair, it is only one instruction saving so the saving is more blurred. Good suggestion, I will implement the target query.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12200–12203	Ok.
12212–12223	Ok. Will change it.
12256–12260	That is definitely better. Will change it to swap.
12286	Thanks. I feel the target will end up here as you suggested.

Change the input of the target query to be elements of the value pair before they are merged. The target query only returns true on x86 when the input is a mixture of int and fp values.
Testcase update since now wide store of int value pair will not be splitted for now.

The approach here looks awesome.

Mostly really minor nits about comments, and some suggestions to minimize and clean up the test case.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12208	I would just say "Sometimes" here to avoid people defaulting a new target into splitting the stores without measuring.
lib/Target/X86/X86ISelLowering.h
769–771	It's probably also worth noting that beyond the instruction count difference, there is potentially a more significant benefit because it avoids the float->int domain switch for input value. I have a suspicion that is a significant part of the savings here.
778–780	And here it is probably worth mentioning the other upside of only doing a single memory operation in addition to having minimal instruction count overhead.
test/Transforms/InstCombine/split-store.ll
16–31	It would be good to minimize this test case rather than keeping it so close to what happens to come out of Clang for a particular C++ input. I'd just write direct tests in LLVM IR. Something like: define void @test1(i32 %i, float %f, i64* %ptr) { entry: ... = zext ... ... = bitcast ... ... = zext ... ... = shl ... ... = or ... store ..., i64* %ptr ret void }

wmi added inline comments.Aug 26 2016, 10:36 PM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
12208	Fixed.
lib/Target/X86/X86ISelLowering.h
769–771	Comment added
778–780	I am not sure here. The added memory operation is a store. Store usually will not stall the pipeline because there is store buffer, so I feel the cost of the extra store will be low?
test/Transforms/InstCombine/split-store.ll
16–31	That makes the test much shorter!

Address Chandler's comments: adjust comments and simplify testcase.

Hmmm. I'm not seeing the updated test cases? Not sure if phabricator is just not updating for me or what though...

lib/Target/X86/X86ISelLowering.h
778–780	I agree the cost will be low, I just think it makes sense to mention the different aspects (for example, saving an entry in the store buffer).

It is weird. However, I just realized in your example, alloca was replaced with a func param. It can make the test more shorter, so I updated the test again.

lib/Target/X86/X86ISelLowering.h
778–780	Got it. Added the comment.

Update comment and testcase.

LGTM, feel free to submit. Just a suggestion on further simplification of the tests below, but all the important stuff is already there.

Also, this is a really nice change, thanks for working on it and getting the layering figured out so nicely here.

test/Transforms/InstCombine/split-store.ll
10	You can also probable remove all the function attributes here.

This revision is now accepted and ready to land.Sep 1 2016, 2:49 PM

Closed by commit rL280505: Split the store of a wide value merged from an int-fp pair into multiple stores. (authored by wmi). · Explain WhySep 2 2016, 10:25 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

Target/

TargetLowering.h

6 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

103 lines

Target/

X86/

X86ISelLowering.h

17 lines

test/

Transforms/

InstCombine/

split-store.ll

103 lines

Diff 69450

include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 324 Lines • ▼ Show 20 Lines	public:
/// Return true if it is safe to transform an integer-domain bitwise operation		/// Return true if it is safe to transform an integer-domain bitwise operation
/// into the equivalent floating-point operation. This should be set to true		/// into the equivalent floating-point operation. This should be set to true
/// if the target has IEEE-754-compliant fabs/fneg operations for the input		/// if the target has IEEE-754-compliant fabs/fneg operations for the input
/// type.		/// type.
virtual bool hasBitPreservingFPLogic(EVT VT) const {		virtual bool hasBitPreservingFPLogic(EVT VT) const {
return false;		return false;
}		}

		/// \brief Return true if it is cheaper to split the store of a merged int val
		/// from a pair of smaller values into multiple stores.
		virtual bool isMultiStoresCheaperThanBitsMerge(SDValue Lo, SDValue Hi) const {
		return false;
		}

/// \brief Return if the target supports combining a		/// \brief Return if the target supports combining a
/// chain like:		/// chain like:
/// \code		/// \code
/// %andResult = and %val1, #imm-with-one-bit-set;		/// %andResult = and %val1, #imm-with-one-bit-set;
/// %icmpResult = icmp %andResult, 0		/// %icmpResult = icmp %andResult, 0
/// br i1 %icmpResult, label %dest1, label %dest2		/// br i1 %icmpResult, label %dest1, label %dest2
/// \endcode		/// \endcode
/// into a single machine instruction of a form like:		/// into a single machine instruction of a form like:
▲ Show 20 Lines • Show All 2,731 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 368 Lines • ▼ Show 20 Lines	private:
SDValue MatchBSwapHWord(SDNode *N, SDValue N0, SDValue N1);		SDValue MatchBSwapHWord(SDNode *N, SDValue N0, SDValue N1);
SDNode *MatchRotatePosNeg(SDValue Shifted, SDValue Pos, SDValue Neg,		SDNode *MatchRotatePosNeg(SDValue Shifted, SDValue Pos, SDValue Neg,
SDValue InnerPos, SDValue InnerNeg,		SDValue InnerPos, SDValue InnerNeg,
unsigned PosOpcode, unsigned NegOpcode,		unsigned PosOpcode, unsigned NegOpcode,
const SDLoc &DL);		const SDLoc &DL);
SDNode *MatchRotate(SDValue LHS, SDValue RHS, const SDLoc &DL);		SDNode *MatchRotate(SDValue LHS, SDValue RHS, const SDLoc &DL);
SDValue ReduceLoadWidth(SDNode *N);		SDValue ReduceLoadWidth(SDNode *N);
SDValue ReduceLoadOpStoreWidth(SDNode *N);		SDValue ReduceLoadOpStoreWidth(SDNode *N);
		SDValue splitMergedValStore(StoreSDNode *ST);
SDValue TransformFPLoadStorePair(SDNode *N);		SDValue TransformFPLoadStorePair(SDNode *N);
SDValue reduceBuildVecExtToExtBuildVec(SDNode *N);		SDValue reduceBuildVecExtToExtBuildVec(SDNode *N);
SDValue reduceBuildVecConvertToConvertBuildVec(SDNode *N);		SDValue reduceBuildVecConvertToConvertBuildVec(SDNode *N);

SDValue GetDemandedBits(SDValue V, const APInt &Mask);		SDValue GetDemandedBits(SDValue V, const APInt &Mask);

/// Walk up chain skipping non-aliasing memory nodes,		/// Walk up chain skipping non-aliasing memory nodes,
/// looking for aliasing nodes and adding them to the Aliases vector.		/// looking for aliasing nodes and adding them to the Aliases vector.
▲ Show 20 Lines • Show All 11,806 Lines • ▼ Show 20 Lines	#endif
// Make sure to do this only after attempting to merge stores in order to		// Make sure to do this only after attempting to merge stores in order to
// avoid changing the types of some subset of stores due to visit order,		// avoid changing the types of some subset of stores due to visit order,
// preventing their merging.		// preventing their merging.
if (isa<ConstantFPSDNode>(Value)) {		if (isa<ConstantFPSDNode>(Value)) {
if (SDValue NewSt = replaceStoreOfFPConstant(ST))		if (SDValue NewSt = replaceStoreOfFPConstant(ST))
return NewSt;		return NewSt;
}		}

		if (SDValue NewSt = splitMergedValStore(ST))
		return NewSt;

return ReduceLoadOpStoreWidth(N);		return ReduceLoadOpStoreWidth(N);
		chandlercUnsubmitted Not Done Reply Inline Actions No need for braces around the outer if here, although it probably won't matter based on the refactoring I suggested... chandlerc: No need for braces around the outer if here, although it probably won't matter based on the…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Ok. wmi: Ok.
}		}

		/// For the instruction sequence of store below, F and I values
		/// are bundled together as an i64 value before being stored into memory.
		/// Quite often it is more efficent to generate separate stores for F and I,
		chandlercUnsubmitted Not Done Reply Inline Actions I would just say "Sometimes" here to avoid people defaulting a new target into splitting the stores without measuring. chandlerc: I would just say "Sometimes" here to avoid people defaulting a new target into splitting the…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Fixed. wmi: Fixed.
		/// which can remove the bitwise instructions or sink them to colder places.
		///
		/// (store (or (zext (bitcast F to i32) to i64),
		/// (shl (zext I to i64), 32)), addr) -->
		/// (store F, addr) and (store I, addr+4)
		///
		/// Similarly, splitting for other merged store can also be beneficial, like:
		/// For pair of {i32, i32}, i64 store --> two i32 stores.
		/// For pair of {i32, i16}, i64 store --> two i32 stores.
		/// For pair of {i16, i16}, i32 store --> two i16 stores.
		/// For pair of {i16, i8}, i32 store --> two i16 stores.
		/// For pair of {i8, i8}, i16 store --> two i8 stores.
		///
		/// We allow each target to determine specifically which kind of splitting is
		/// supported.
		chandlercUnsubmitted Not Done Reply Inline Actions You can write DAG examples using a more brief form that is pretty common: /// (store (or (zext (bitcast F to i32) to i64), /// (shl (zext I to i64), 32)), addr) chandlerc: You can write DAG examples using a more brief form that is pretty common: /// (store (or…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Ok. Will change it. wmi: Ok. Will change it.
		///
		/// The store patterns are commonly seen from the simple code snippet below
		/// if only std::make_pair(...) is sroa transformed before inlined into hoo.
		/// void goo(const std::pair<int, float> &);
		/// hoo() {
		/// ...
		/// goo(std::make_pair(tmp, ftmp));
		/// ...
		/// }
		///
		SDValue DAGCombiner::splitMergedValStore(StoreSDNode *ST) {
		if (OptLevel == CodeGenOpt::None)
		return SDValue();

		SDValue Val = ST->getValue();
		SDLoc DL(ST);

		// Match OR operand.
		if (!Val.getValueType().isScalarInteger() \|\| Val.getOpcode() != ISD::OR)
		return SDValue();

		// Match SHL operand and get Lower and Higher parts of Val.
		SDValue Op1 = Val.getOperand(0);
		SDValue Op2 = Val.getOperand(1);
		SDValue Lo, Hi;
		if (Op1.getOpcode() != ISD::SHL) {
		std::swap(Op1, Op2);
		if (Op1.getOpcode() != ISD::SHL)
		return SDValue();
		}
		Lo = Op2;
		Hi = Op1.getOperand(0);
		if (!Op1.hasOneUse())
		return SDValue();

		// Match shift amount to HalfValBitSize.
		unsigned HalfValBitSize = Val.getValueType().getSizeInBits() / 2;
		chandlercUnsubmitted Not Done Reply Inline Actions A common slightly shorter idiom than this that we use is to test if Op1 is not SHL, and then swap, test again and bail, and then you know Op1 is the SHL and you can directly extract the Hi value. chandlerc: A common slightly shorter idiom than this that we use is to test if Op1 is not SHL, and then…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions That is definitely better. Will change it to swap. wmi: That is definitely better. Will change it to swap.
		ConstantSDNode *ShAmt = dyn_cast<ConstantSDNode>(Op1.getOperand(1));
		if (!ShAmt \|\| ShAmt->getAPIntValue() != HalfValBitSize)
		return SDValue();

		// Lo and Hi are zero-extended from int with size less equal than 32
		// to i64.
		if (Lo.getOpcode() != ISD::ZERO_EXTEND \|\| !Lo.hasOneUse() \|\|
		!Lo.getOperand(0).getValueType().isScalarInteger() \|\|
		Lo.getOperand(0).getValueType().getSizeInBits() > HalfValBitSize \|\|
		Hi.getOpcode() != ISD::ZERO_EXTEND \|\| !Hi.hasOneUse() \|\|
		!Hi.getOperand(0).getValueType().isScalarInteger() \|\|
		Hi.getOperand(0).getValueType().getSizeInBits() > HalfValBitSize)
		return SDValue();

		if (!TLI.isMultiStoresCheaperThanBitsMerge(Lo.getOperand(0),
		Hi.getOperand(0)))
		return SDValue();

		// Start to split store.
		unsigned Alignment = ST->getAlignment();
		MachineMemOperand::Flags MMOFlags = ST->getMemOperand()->getFlags();
		AAMDNodes AAInfo = ST->getAAInfo();

		// Change the sizes of Lo and Hi's value types to HalfValBitSize.
		EVT VT = EVT::getIntegerVT(*DAG.getContext(), HalfValBitSize);
		Lo = DAG.getNode(ISD::ZERO_EXTEND, DL, VT, Lo.getOperand(0));
		chandlercUnsubmitted Not Done Reply Inline Actions FWIW, here is where I would do the target query, passing in the inputs to the zero extends. Then in the x86 implementation of the target query, I would look for a bitcast from a float, or whatever we end up with for the heuristic. chandlerc: FWIW, here is where I would do the target query, passing in the inputs to the zero extends.
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Thanks. I feel the target will end up here as you suggested. wmi: Thanks. I feel the target will end up here as you suggested.
		Hi = DAG.getNode(ISD::ZERO_EXTEND, DL, VT, Hi.getOperand(0));

		SDValue Chain = ST->getChain();
		SDValue Ptr = ST->getBasePtr();
		// Lower value store.
		SDValue St0 = DAG.getStore(Chain, DL, Lo, Ptr, ST->getPointerInfo(),
		ST->getAlignment(), MMOFlags, AAInfo);
		Ptr =
		DAG.getNode(ISD::ADD, DL, Ptr.getValueType(), Ptr,
		DAG.getConstant(HalfValBitSize / 8, DL, Ptr.getValueType()));
		// Higher value store.
		SDValue St1 =
		DAG.getStore(Chain, DL, Hi, Ptr,
		ST->getPointerInfo().getWithOffset(HalfValBitSize / 8),
		Alignment / 2, MMOFlags, AAInfo);
		return DAG.getNode(ISD::TokenFactor, DL, MVT::Other, St0, St1);
		}

SDValue DAGCombiner::visitINSERT_VECTOR_ELT(SDNode *N) {		SDValue DAGCombiner::visitINSERT_VECTOR_ELT(SDNode *N) {
SDValue InVec = N->getOperand(0);		SDValue InVec = N->getOperand(0);
SDValue InVal = N->getOperand(1);		SDValue InVal = N->getOperand(1);
SDValue EltNo = N->getOperand(2);		SDValue EltNo = N->getOperand(2);
SDLoc dl(N);		SDLoc dl(N);

// If the inserted element is an UNDEF, just use the input vector.		// If the inserted element is an UNDEF, just use the input vector.
if (InVal.isUndef())		if (InVal.isUndef())
▲ Show 20 Lines • Show All 2,850 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 758 Lines • ▼ Show 20 Lines	public:
bool isCheapToSpeculateCttz() const override;		bool isCheapToSpeculateCttz() const override;

bool isCheapToSpeculateCtlz() const override;		bool isCheapToSpeculateCtlz() const override;

bool hasBitPreservingFPLogic(EVT VT) const override {		bool hasBitPreservingFPLogic(EVT VT) const override {
return VT == MVT::f32 \|\| VT == MVT::f64 \|\| VT.isVector();		return VT == MVT::f32 \|\| VT == MVT::f64 \|\| VT.isVector();
}		}

		bool isMultiStoresCheaperThanBitsMerge(SDValue Lo,
		SDValue Hi) const override {
		// If the pair to store is a mixture of float and int values, we will
		// save two bitwise instructions and one float-to-int instruction and
		// increase one store instruction. It is more likely a win.
		chandlercUnsubmitted Not Done Reply Inline Actions It's probably also worth noting that beyond the instruction count difference, there is potentially a more significant benefit because it avoids the float->int domain switch for input value. I have a suspicion that is a significant part of the savings here. chandlerc: It's probably also worth noting that beyond the instruction count difference, there is…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Comment added wmi: Comment added
		if (Lo.getOpcode() == ISD::BITCAST \|\| Hi.getOpcode() == ISD::BITCAST) {
		SDValue BitCast = (Lo.getOpcode() == ISD::BITCAST) ? Lo.getOperand(0)
		: Hi.getOperand(0);
		if (BitCast.getValueType().isFloatingPoint())
		return true;
		}
		// If the pair only contains int values, we will save two bitwise
		// instructions and increase one store instruction. We leave the case
		// out for now and wait until we find a case showing it is beneficial.
		chandlercUnsubmitted Not Done Reply Inline Actions And here it is probably worth mentioning the other upside of only doing a single memory operation in addition to having minimal instruction count overhead. chandlerc: And here it is probably worth mentioning the other upside of only doing a single memory…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions I am not sure here. The added memory operation is a store. Store usually will not stall the pipeline because there is store buffer, so I feel the cost of the extra store will be low? wmi: I am not sure here. The added memory operation is a store. Store usually will not stall the…
		chandlercUnsubmitted Not Done Reply Inline Actions I agree the cost will be low, I just think it makes sense to mention the different aspects (for example, saving an entry in the store buffer). chandlerc: I agree the cost will be low, I just think it makes sense to mention the different aspects (for…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions Got it. Added the comment. wmi: Got it. Added the comment.
		return false;
		}

bool hasAndNotCompare(SDValue Y) const override;		bool hasAndNotCompare(SDValue Y) const override;

/// Return the value type to use for ISD::SETCC.		/// Return the value type to use for ISD::SETCC.
EVT getSetCCResultType(const DataLayout &DL, LLVMContext &Context,		EVT getSetCCResultType(const DataLayout &DL, LLVMContext &Context,
EVT VT) const override;		EVT VT) const override;

/// Determine which of the bits specified in Mask are known to be either		/// Determine which of the bits specified in Mask are known to be either
/// zero or one and return them in the KnownZero/KnownOne bitsets.		/// zero or one and return them in the KnownZero/KnownOne bitsets.
▲ Show 20 Lines • Show All 470 Lines • Show Last 20 Lines

test/Transforms/InstCombine/split-store.ll

				; RUN: llc < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				declare void @llvm.lifetime.start(i64, i8* nocapture)
				declare void @llvm.lifetime.end(i64, i8* nocapture)

				declare void @goo1(%"pair1"* dereferenceable(8)) local_unnamed_addr
				%"pair1" = type { i32, float }
				chandlercUnsubmitted Not Done Reply Inline Actions You can also probable remove all the function attributes here. chandlerc: You can also probable remove all the function attributes here.

				; CHECK-LABEL: int32_float_pair
				; CHECK: movss %xmm0, 4(%rsp)
				; CHECK: movl %edi, (%rsp)
				; CHECK: leaq (%rsp), %rdi
				define void @int32_float_pair(i32 %tmp1, float %tmp2) local_unnamed_addr {
				entry:
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %"pair1"*
				%t0 = bitcast i64* %ref.tmp to i8*
				call void @llvm.lifetime.start(i64 8, i8* %t0)
				%t1 = bitcast float %tmp2 to i32
				%retval.sroa.2.0.insert.ext.i = zext i32 %t1 to i64
				%retval.sroa.2.0.insert.shift.i = shl nuw i64 %retval.sroa.2.0.insert.ext.i, 32
				%retval.sroa.0.0.insert.ext.i = zext i32 %tmp1 to i64
				%retval.sroa.0.0.insert.insert.i = or i64 %retval.sroa.2.0.insert.shift.i, %retval.sroa.0.0.insert.ext.i
				store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8
				call void @goo1(%"pair1"* dereferenceable(8) %tmpcast)
				call void @llvm.lifetime.end(i64 8, i8* %t0)
				ret void
				}
				chandlercUnsubmitted Not Done Reply Inline Actions It would be good to minimize this test case rather than keeping it so close to what happens to come out of Clang for a particular C++ input. I'd just write direct tests in LLVM IR. Something like: define void @test1(i32 %i, float %f, i64* %ptr) { entry: ... = zext ... ... = bitcast ... ... = zext ... ... = shl ... ... = or ... store ..., i64* %ptr ret void } chandlerc: It would be good to minimize this test case rather than keeping it so close to what happens to…
				wmiAuthorUnsubmitted Not Done Reply Inline Actions That makes the test much shorter! wmi: That makes the test much shorter!

				declare void @goo2(%"pair2"* dereferenceable(8)) local_unnamed_addr
				%"pair2" = type { float, i32 }

				; CHECK-LABEL: float_int32_pair
				; CHECK: movl %edi, 4(%rsp)
				; CHECK: movss %xmm0, (%rsp)
				; CHECK: leaq (%rsp), %rdi
				define void @float_int32_pair(float %tmp1, i32 %tmp2) local_unnamed_addr #0 {
				entry:
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %"pair2"*
				%t0 = bitcast i64* %ref.tmp to i8*
				call void @llvm.lifetime.start(i64 8, i8* %t0) #5
				%t1 = bitcast float %tmp1 to i32
				%retval.sroa.2.0.insert.ext.i = zext i32 %tmp2 to i64
				%retval.sroa.2.0.insert.shift.i = shl nuw i64 %retval.sroa.2.0.insert.ext.i, 32
				%retval.sroa.0.0.insert.ext.i = zext i32 %t1 to i64
				%retval.sroa.0.0.insert.insert.i = or i64 %retval.sroa.2.0.insert.shift.i, %retval.sroa.0.0.insert.ext.i
				store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8
				call void @goo2(%"pair2"* dereferenceable(8) %tmpcast)
				call void @llvm.lifetime.end(i64 8, i8* %t0) #5
				ret void
				}

				declare void @goo3(%"pair3"* dereferenceable(8)) local_unnamed_addr
				%"pair3" = type { i16, float }

				; CHECK-LABEL: int16_float_pair
				; CHECK: movss %xmm0, 4(%rsp)
				; CHECK: movzwl %di, %eax
				; CHECK: leaq (%rsp), %rdi
				define void @int16_float_pair(i16 signext %tmp1, float %tmp2) local_unnamed_addr {
				entry:
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %"pair3"*
				%t0 = bitcast i64* %ref.tmp to i8*
				call void @llvm.lifetime.start(i64 8, i8* %t0)
				%t1 = bitcast float %tmp2 to i32
				%retval.sroa.2.0.insert.ext.i = zext i32 %t1 to i64
				%retval.sroa.2.0.insert.shift.i = shl nuw i64 %retval.sroa.2.0.insert.ext.i, 32
				%retval.sroa.0.0.insert.ext.i = zext i16 %tmp1 to i64
				%retval.sroa.0.0.insert.insert.i = or i64 %retval.sroa.2.0.insert.shift.i, %retval.sroa.0.0.insert.ext.i
				store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8
				call void @goo3(%"pair3"* dereferenceable(8) %tmpcast)
				call void @llvm.lifetime.end(i64 8, i8* %t0)
				ret void
				}

				declare void @goo4(%"pair4"* dereferenceable(8)) local_unnamed_addr
				%"pair4" = type { i8, float }

				; CHECK-LABEL: int8_float_pair
				; CHECK: movss %xmm0, 4(%rsp)
				; CHECK: movzbl %dil, %eax
				; CHECK: leaq (%rsp), %rdi
				define void @int8_float_pair(i8 signext %tmp1, float %tmp2) local_unnamed_addr {
				entry:
				%ref.tmp = alloca i64, align 8
				%tmpcast = bitcast i64* %ref.tmp to %"pair4"*
				%t0 = bitcast i64* %ref.tmp to i8*
				call void @llvm.lifetime.start(i64 8, i8* %t0)
				%t1 = bitcast float %tmp2 to i32
				%retval.sroa.2.0.insert.ext.i = zext i32 %t1 to i64
				%retval.sroa.2.0.insert.shift.i = shl nuw nsw i64 %retval.sroa.2.0.insert.ext.i, 32
				%retval.sroa.0.0.insert.ext.i = zext i8 %tmp1 to i64
				%retval.sroa.0.0.insert.insert.i = or i64 %retval.sroa.2.0.insert.shift.i, %retval.sroa.0.0.insert.ext.i
				store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8
				call void @goo4(%"pair4"* dereferenceable(8) %tmpcast)
				call void @llvm.lifetime.end(i64 8, i8* %t0)
				ret void
				}