This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
2
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/X86/
-
X86/
-
X86TargetTransformInfo.h
2/2
X86TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
1/5
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
8
reorder_with_external_users.ll

Differential D125712

[SLP][X86] Improve reordering to consider alternate instruction bundles
ClosedPublic

Authored by vporpo on May 16 2022, 11:01 AM.

Download Raw Diff

Details

Reviewers

vdmitrie
ABataev
dmgreen
RKSimon

Commits

rG6f88acf410b4: [SLP][X86] Improve reordering to consider alternate instruction bundles

Summary

During the reordering transformation we should try to avoid reordering bundles
like fadd,fsub because this may block them being matched into a single vector
instruction in x86.
We do this by checking if a TreeEntry is such a pattern and adding it to the
list of TreeEntries with orders that need to be considered.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,420 ms	x64 debian > Clang.Driver::fsanitize.c

Event Timeline

vporpo created this revision.May 16 2022, 11:01 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2022, 11:01 AM

Herald added subscribers: pengfei, hiraditya. · View Herald Transcript

vporpo requested review of this revision.May 16 2022, 11:01 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2022, 11:01 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B164706: Diff 429788.May 16 2022, 11:02 AM

bgraur added a subscriber: bgraur.May 17 2022, 12:39 AM

Do you have many more examples of these patterns? If the addsub<->subadd pattern is the main problem it shouldn't take much to fix it later on in the backend

I am not sure if there is any other pattern, but this one showed up in a regression in the eigen benchmark.
Hmm yeah fixing it in the back-end may be an option but what if we end up not vectorizing the code because of this? The alternative is a blend + add + sub which should increase the cost quite a bit. So it is probably best to teach reordering not to mess up this pattern in the first place.

Fixed matching of AddSub pattern.

Harbormaster completed remote builds in B165579: Diff 431043.May 20 2022, 1:41 PM

vporpo added a parent revision: D126091: [SLP][NFC] Precommit test for a followup patch that improves reordering for addsubs.May 20 2022, 1:51 PM

ping

ABataev added inline comments.May 25 2022, 7:50 AM

llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127	I don't quite understand what's the difference here. Could you explain, please?

vporpo added inline comments.May 25 2022, 9:24 AM

llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127	Before this patch the pattern `shuffle + fadd + fsub` lowers to 3 instructions: blend + vector add + vector sub (the shuffle selects TMP3[0],TMP2[1], which is fadd[0],fsub[1] , the inverse of the addsub pattern). With this patch `shuffle + fadd +fsub` lowers to a single addsub instruction (the shuffle selects TMP2[0], TMP3[2] which is fsub[0],fadd[1]). This saves 2 instructions which means that during reordering we should keep track of this pattern since reordering it can increase the overhead.

ABataev added inline comments.May 25 2022, 9:46 AM

llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127	Ok, why not fixing it in the backend? This new function you added does not affect the cost, but ignoring shuffle actually increases the cost of the tree.

vporpo added inline comments.May 25 2022, 9:58 AM

llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127	I think it is better to do it here because we may end up not vectorizing some code because of the additional cost of the blend + add + sub pattern. I think I can fix the tree cost with a follow-up patch that fixes the cost of the altshuffle pattern when it corresponds to the addsub instruction.

ABataev added inline comments.May 25 2022, 10:07 AM

llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127	Ah, I see, thanks. What about trying to do both - lowering in the backend and the cost adjustment? Is it possible?

vporpo added inline comments.May 25 2022, 10:55 AM

llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127	Hmm I guess in the back-end we won't have access to cost benefit analysis like the one we are doing during the reordering step (i.e., finding the most popular order). So we would have to do a simple conversion of any sub-add pattern to an add-sub + shuffles, but I am not sure that this would always be profitable. I think that the addsub pattern should be taken into consideration when looking for the most popular ordering so that it can influence the decision, but I am not so sure we can always justify the cost of the extra shuffles.

ABataev added inline comments.May 25 2022, 11:02 AM

llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127	The backend does not need the cost, it sjut checks the pattern and lowers the sequence to the instructions. Why does it need the cost?

vporpo added inline comments.May 25 2022, 11:36 AM

llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll
121–127	I think there is a difference between doing the transformation here vs in the back-end. In the back-end we can't easily check if reordering a sub-add to an add-sub can also remove some of the shuffles that are already in the code. For example if we have code like this: %vsub = fsub <2 x double> %subop, ... %vadd = fadd <2 x double> %addop, ... %shuffle = shufflevector <2 x double> %vsub, %vadd, 0, 3 # vsub[0], vadd[1] store <2 x double> %shuffle, ... This will be lowered to: `add + sub + blend + store`. But if we convert the `sub-add` pattern (i.e., `add + sub + blend`) to an `add-sub + shuffles` in the back-end, then this will introduce 3 shuffles: 2 for the operands and 1 for the user, resulting in a pattern: `pshuf + pshuf + addsub + pshuf + store`. This is clearly worse than the original code. But the transformation could be profitable if we could remove some of the shuffles, for example if the code looked like this: %vsub = fsub <2 x double> %subop, ... %vadd = fadd <2 x double> %addop, ... %shuffle = shufflevector <2 x double> %vsub, %vadd, 0, 3 # vsub[0], vadd[1] %shuffle2 = shufflevector <2 x double> %shuffle, zero, 1, 0 # shuffle[1], shuffle[0] store <2 x double> %shuffle, ... Converting this to sub-add in the back-end would result in `pshuf + pshuf + addsub + pshuf + pshuf + store`, but the latter two `pshuf` instructions negate each other so the could be optimized away. A similar optimization could happen if the inputs were already reordered with a shuffle earlier. What I am trying to say is that simply converting a `sub-add` pattern to an `add-sub` pattern does not look profitable unless we can also get rid of some of the shuffles that cancel each other out. I can't think of how we could do this in the back-end more effectively than we could do it here, but perhaps I am missing something.

vporpo added a child revision: D126431: [TTI][X86] Precommit test for a followup patch that fixes the cost of the add-sub pattern.May 25 2022, 4:30 PM

RKSimon added inline comments.May 26 2022, 2:35 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
685	If we're going to do this (and I'm still not convinced we should) we're going to need a shuffle mask - otherwise addsub matching is not going to be accurate.

vporpo added inline comments.May 26 2022, 9:11 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
685	I agree, the shuffle mask will make matching a lot more accurate, I will update the patches. Why do you think we should not be going ahead with this? To me this is very similar to the shuffle-reducing transformations that we already have, so it looks like it fits pretty well the scope of the SLP pass.

I'm concerned its yet something else that is being added to SLP to make up for a weakness some place else (usually the cost model mechanism) and may take years to unravel again.

In this case we don't have a way to determine realistic costs for instruction sequences - at best we can optionally provide operands when getting the cost of a single instruction - we already have this problem with folded loads, this could end up being a lot more complex :-(

I understand your concerns, but I feel that this pattern is not as tricky as folded loads. We can accurately detect the add-sub pattern within the SLP pass itself, since it matches the AltShuffle TreeEntry. So we can even make the tree cost more accurate (this is in a follow up patch). The only issue that I see is perhaps with cost modeling of this pattern from within a different pass, other than SLP, since that would require special logic to skip the cost of the vector add and vector sub. But given the current infrastructure I don't think we can do anything about this, but that shouldn't stop us from making getShuffleCost() more accurate, even if it will only benefit SLP.

Lets get the shuffle mask arg added - we need that to correctly cost alt patterns. We might be able to reduce the SLP impact to a getAltCost() wrapper of some kind.

Added shufflemask argument to TTI function.

Harbormaster completed remote builds in B166536: Diff 432373.May 26 2022, 2:05 PM

vdmitrie added inline comments.May 26 2022, 6:36 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2381	I'd say we don't really need this interface. IMO what is likely to be more useful is something like buildAltShuffleMask which you could directly feed into the TTI interface you added. See also buildShuffleEntryMask which could also benefit from having it.

vporpo added inline comments.May 26 2022, 8:01 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2381	Well, the issue is that the TTI function can't check whether this is an actual add-sub pattern, because it doesn't have all the information needed. For this to work, it would have to know how we are planning to combine the scalar instructions into vectors, which is an SLP-specific thing. So it feels that this logic should not belong to TTI and instead it should all be in the SLP pass. Regarding the mask, I don't really like the way this is done now. Actually I am not even sure passing the shuffle mask is a good idea because it doesn't actually help with checking the pattern. It is probably best to just pass the even/odd opcodes and the vector type and ask for TTI to very whether these the compatible ones with the addsub instructions. Wdyt?

vdmitrie added inline comments.May 27 2022, 9:10 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2381	my 2 cents wrt the TTI interface The interface as it defined in current revision has assumptions: two opcodes and that final result is built with select shuffle. And the latter is its main drawback. More generic version could look like bool isLegalSIMDInstruction(Type* VecTy, ArrayRef<unsigned> Opcodes, ArrayRef<unsigned>LaneOpcMask) where LaneOpcMask[Lane] is an index into Opcodes array which gives per lane opcode for a SIMD instruction. Or if you want to limit it with just 2 opcodes (perhaps makes sense) LaneOpcMask could be represented with a bit vector.

vporpo added inline comments.May 27 2022, 9:20 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2381	Yeah, this makes sense. I will update the patch.

Replaced shuffle mask argument with a bitvector and moved all the opcode checking code to TTI->isLegalAltInstr().

Harbormaster completed remote builds in B166684: Diff 432605.May 27 2022, 12:15 PM

vporpo added a child revision: D126432: [SLP][TTI][X86] Implement addsub cost calculation..May 31 2022, 10:27 AM

vporpo added inline comments.Jun 1 2022, 10:00 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2381	I updated the `isLegalAltInstr()` TTI interface function. I think it looks more like what you described in your previous comment.

Herald added a subscriber: jsji. · View Herald TranscriptJun 1 2022, 10:00 AM

Looks good yo me.
@RKSimon , @ABataev , do you have any comments/objections?

This revision is now accepted and ready to land.Jun 1 2022, 10:50 AM

Don't we need to account for cost somehow?

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
5331	The AVX case is unnecessary as any NumElements % 4 == 0 case will be supported as multiple ADDSUB ops.
5334	Same here NumElements % 2 == 0 is enough

Addressed comments.

Don't we need to account for cost somehow?

The cost calculation changes are in a followup patch: https://reviews.llvm.org/D126432

Harbormaster completed remote builds in B167394: Diff 433581.Jun 1 2022, 5:08 PM

@RKSimon any more comments on this?

ping

LGTM

This revision was landed with ongoing or failed builds.Jun 21 2022, 4:46 PM

Closed by commit rG6f88acf410b4: [SLP][X86] Improve reordering to consider alternate instruction bundles (authored by vporpo). · Explain Why

This revision was automatically updated to reflect the committed changes.

vporpo added a commit: rG6f88acf410b4: [SLP][X86] Improve reordering to consider alternate instruction bundles.

vporpo added a reverting change: rG6d6268dcbf0f: Revert "[SLP][X86] Improve reordering to consider alternate instruction bundles".Jun 21 2022, 5:09 PM

vporpo mentioned this in rG7a9ad257694c: Recommit "[SLP][X86] Improve reordering to consider alternate instruction….Jun 21 2022, 6:36 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

18 lines

TargetTransformInfoImpl.h

5 lines

lib/

Analysis/

TargetTransformInfo.cpp

6 lines

Target/

X86/

X86TargetTransformInfo.h

2 lines

X86TargetTransformInfo.cpp

33 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

35 lines

test/

Transforms/

SLPVectorizer/

X86/

reorder_with_external_users.ll

17 lines

Diff 433581

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show All 15 Lines
/// This file defines #2, which is the interface that IR-level transformations		/// This file defines #2, which is the interface that IR-level transformations
/// use for querying the codegen.		/// use for querying the codegen.
///		///
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_ANALYSIS_TARGETTRANSFORMINFO_H		#ifndef LLVM_ANALYSIS_TARGETTRANSFORMINFO_H
#define LLVM_ANALYSIS_TARGETTRANSFORMINFO_H		#define LLVM_ANALYSIS_TARGETTRANSFORMINFO_H

		#include "llvm/ADT/SmallBitVector.h"
#include "llvm/IR/FMF.h"		#include "llvm/IR/FMF.h"
#include "llvm/IR/InstrTypes.h"		#include "llvm/IR/InstrTypes.h"
#include "llvm/IR/PassManager.h"		#include "llvm/IR/PassManager.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/AtomicOrdering.h"		#include "llvm/Support/AtomicOrdering.h"
#include "llvm/Support/BranchProbability.h"		#include "llvm/Support/BranchProbability.h"
#include "llvm/Support/InstructionCost.h"		#include "llvm/Support/InstructionCost.h"
#include <functional>		#include <functional>
▲ Show 20 Lines • Show All 641 Lines • ▼ Show 20 Lines	public:
/// intrinsics.		/// intrinsics.
bool forceScalarizeMaskedScatter(VectorType *Type, Align Alignment) const;		bool forceScalarizeMaskedScatter(VectorType *Type, Align Alignment) const;

/// Return true if the target supports masked compress store.		/// Return true if the target supports masked compress store.
bool isLegalMaskedCompressStore(Type *DataType) const;		bool isLegalMaskedCompressStore(Type *DataType) const;
/// Return true if the target supports masked expand load.		/// Return true if the target supports masked expand load.
bool isLegalMaskedExpandLoad(Type *DataType) const;		bool isLegalMaskedExpandLoad(Type *DataType) const;

		/// Return true if this is an alternating opcode pattern that can be lowered
		/// to a single instruction on the target. In X86 this is for the addsub
		/// instruction which corrsponds to a Shuffle + Fadd + FSub pattern in IR.
		/// This function expectes two opcodes: \p Opcode1 and \p Opcode2 being
		RKSimonUnsubmitted Not Done Reply Inline Actions If we're going to do this (and I'm still not convinced we should) we're going to need a shuffle mask - otherwise addsub matching is not going to be accurate. RKSimon: If we're going to do this (and I'm still not convinced we should) we're going to need a shuffle…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions I agree, the shuffle mask will make matching a lot more accurate, I will update the patches. Why do you think we should not be going ahead with this? To me this is very similar to the shuffle-reducing transformations that we already have, so it looks like it fits pretty well the scope of the SLP pass. vporpo: I agree, the shuffle mask will make matching a lot more accurate, I will update the patches.
		/// selected by \p OpcodeMask. The mask contains one bit per lane and is a `0`
		/// when \p Opcode0 is selected and `1` when Opcode1 is selected.
		/// \p VecTy is the vector type of the instruction to be generated.
		bool isLegalAltInstr(VectorType *VecTy, unsigned Opcode0, unsigned Opcode1,
		const SmallBitVector &OpcodeMask) const;

/// Return true if we should be enabling ordered reductions for the target.		/// Return true if we should be enabling ordered reductions for the target.
bool enableOrderedReductions() const;		bool enableOrderedReductions() const;

/// Return true if the target has a unified operation to calculate division		/// Return true if the target has a unified operation to calculate division
/// and remainder. If so, the additional implicit multiplication and		/// and remainder. If so, the additional implicit multiplication and
/// subtraction required to calculate a remainder from division are free. This		/// subtraction required to calculate a remainder from division are free. This
/// can enable more aggressive transformations for division and remainder than		/// can enable more aggressive transformations for division and remainder than
/// would typically be allowed using throughput or size cost models.		/// would typically be allowed using throughput or size cost models.
▲ Show 20 Lines • Show All 886 Lines • ▼ Show 20 Lines	public:
virtual bool isLegalMaskedScatter(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedScatter(Type *DataType, Align Alignment) = 0;
virtual bool isLegalMaskedGather(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedGather(Type *DataType, Align Alignment) = 0;
virtual bool forceScalarizeMaskedGather(VectorType *DataType,		virtual bool forceScalarizeMaskedGather(VectorType *DataType,
Align Alignment) = 0;		Align Alignment) = 0;
virtual bool forceScalarizeMaskedScatter(VectorType *DataType,		virtual bool forceScalarizeMaskedScatter(VectorType *DataType,
Align Alignment) = 0;		Align Alignment) = 0;
virtual bool isLegalMaskedCompressStore(Type *DataType) = 0;		virtual bool isLegalMaskedCompressStore(Type *DataType) = 0;
virtual bool isLegalMaskedExpandLoad(Type *DataType) = 0;		virtual bool isLegalMaskedExpandLoad(Type *DataType) = 0;
		virtual bool isLegalAltInstr(VectorType *VecTy, unsigned Opcode0,
		unsigned Opcode1,
		const SmallBitVector &OpcodeMask) const = 0;
virtual bool enableOrderedReductions() = 0;		virtual bool enableOrderedReductions() = 0;
virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;		virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;
virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;		virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;
virtual bool prefersVectorizedAddressing() = 0;		virtual bool prefersVectorizedAddressing() = 0;
virtual InstructionCost getScalingFactorCost(Type Ty, GlobalValue BaseGV,		virtual InstructionCost getScalingFactorCost(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset,		int64_t BaseOffset,
bool HasBaseReg, int64_t Scale,		bool HasBaseReg, int64_t Scale,
unsigned AddrSpace) = 0;		unsigned AddrSpace) = 0;
▲ Show 20 Lines • Show All 408 Lines • ▼ Show 20 Lines	bool forceScalarizeMaskedScatter(VectorType *DataType,
return Impl.forceScalarizeMaskedScatter(DataType, Alignment);		return Impl.forceScalarizeMaskedScatter(DataType, Alignment);
}		}
bool isLegalMaskedCompressStore(Type *DataType) override {		bool isLegalMaskedCompressStore(Type *DataType) override {
return Impl.isLegalMaskedCompressStore(DataType);		return Impl.isLegalMaskedCompressStore(DataType);
}		}
bool isLegalMaskedExpandLoad(Type *DataType) override {		bool isLegalMaskedExpandLoad(Type *DataType) override {
return Impl.isLegalMaskedExpandLoad(DataType);		return Impl.isLegalMaskedExpandLoad(DataType);
}		}
		bool isLegalAltInstr(VectorType *VecTy, unsigned Opcode0, unsigned Opcode1,
		const SmallBitVector &OpcodeMask) const override {
		return Impl.isLegalAltInstr(VecTy, Opcode0, Opcode1, OpcodeMask);
		}
bool enableOrderedReductions() override {		bool enableOrderedReductions() override {
return Impl.enableOrderedReductions();		return Impl.enableOrderedReductions();
}		}
bool hasDivRemOp(Type *DataType, bool IsSigned) override {		bool hasDivRemOp(Type *DataType, bool IsSigned) override {
return Impl.hasDivRemOp(DataType, IsSigned);		return Impl.hasDivRemOp(DataType, IsSigned);
}		}
bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) override {		bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) override {
return Impl.hasVolatileVariant(I, AddrSpace);		return Impl.hasVolatileVariant(I, AddrSpace);
▲ Show 20 Lines • Show All 528 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 273 Lines • ▼ Show 20 Lines	public:

bool forceScalarizeMaskedScatter(VectorType *DataType,		bool forceScalarizeMaskedScatter(VectorType *DataType,
Align Alignment) const {		Align Alignment) const {
return false;		return false;
}		}

bool isLegalMaskedCompressStore(Type *DataType) const { return false; }		bool isLegalMaskedCompressStore(Type *DataType) const { return false; }

		bool isLegalAltInstr(VectorType *VecTy, unsigned Opcode0, unsigned Opcode1,
		const SmallBitVector &OpcodeMask) const {
		return false;
		}

bool isLegalMaskedExpandLoad(Type *DataType) const { return false; }		bool isLegalMaskedExpandLoad(Type *DataType) const { return false; }

bool enableOrderedReductions() const { return false; }		bool enableOrderedReductions() const { return false; }

bool hasDivRemOp(Type *DataType, bool IsSigned) const { return false; }		bool hasDivRemOp(Type *DataType, bool IsSigned) const { return false; }

bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) const {		bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) const {
return false;		return false;
▲ Show 20 Lines • Show All 973 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 400 Lines • ▼ Show 20 Lines	bool TargetTransformInfo::isLegalBroadcastLoad(Type *ElementTy,
return TTIImpl->isLegalBroadcastLoad(ElementTy, NumElements);		return TTIImpl->isLegalBroadcastLoad(ElementTy, NumElements);
}		}

bool TargetTransformInfo::isLegalMaskedGather(Type *DataType,		bool TargetTransformInfo::isLegalMaskedGather(Type *DataType,
Align Alignment) const {		Align Alignment) const {
return TTIImpl->isLegalMaskedGather(DataType, Alignment);		return TTIImpl->isLegalMaskedGather(DataType, Alignment);
}		}

		bool TargetTransformInfo::isLegalAltInstr(
		VectorType *VecTy, unsigned Opcode0, unsigned Opcode1,
		const SmallBitVector &OpcodeMask) const {
		return TTIImpl->isLegalAltInstr(VecTy, Opcode0, Opcode1, OpcodeMask);
		}

bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType,		bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType,
Align Alignment) const {		Align Alignment) const {
return TTIImpl->isLegalMaskedScatter(DataType, Alignment);		return TTIImpl->isLegalMaskedScatter(DataType, Alignment);
}		}

bool TargetTransformInfo::forceScalarizeMaskedGather(VectorType *DataType,		bool TargetTransformInfo::forceScalarizeMaskedGather(VectorType *DataType,
Align Alignment) const {		Align Alignment) const {
return TTIImpl->forceScalarizeMaskedGather(DataType, Alignment);		return TTIImpl->forceScalarizeMaskedGather(DataType, Alignment);
▲ Show 20 Lines • Show All 800 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 235 Lines • ▼ Show 20 Lines	public:
bool forceScalarizeMaskedGather(VectorType *VTy, Align Alignment);		bool forceScalarizeMaskedGather(VectorType *VTy, Align Alignment);
bool forceScalarizeMaskedScatter(VectorType *VTy, Align Alignment) {		bool forceScalarizeMaskedScatter(VectorType *VTy, Align Alignment) {
return forceScalarizeMaskedGather(VTy, Alignment);		return forceScalarizeMaskedGather(VTy, Alignment);
}		}
bool isLegalMaskedGather(Type *DataType, Align Alignment);		bool isLegalMaskedGather(Type *DataType, Align Alignment);
bool isLegalMaskedScatter(Type *DataType, Align Alignment);		bool isLegalMaskedScatter(Type *DataType, Align Alignment);
bool isLegalMaskedExpandLoad(Type *DataType);		bool isLegalMaskedExpandLoad(Type *DataType);
bool isLegalMaskedCompressStore(Type *DataType);		bool isLegalMaskedCompressStore(Type *DataType);
		bool isLegalAltInstr(VectorType *VecTy, unsigned Opcode0, unsigned Opcode1,
		const SmallBitVector &OpcodeMask) const;
bool hasDivRemOp(Type *DataType, bool IsSigned);		bool hasDivRemOp(Type *DataType, bool IsSigned);
bool isFCmpOrdCheaperThanFCmpZero(Type *Ty);		bool isFCmpOrdCheaperThanFCmpZero(Type *Ty);
bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;
bool areTypesABICompatible(const Function Caller, const Function Callee,		bool areTypesABICompatible(const Function Caller, const Function Callee,
const ArrayRef<Type *> &Type) const;		const ArrayRef<Type *> &Type) const;
TTI::MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,		TTI::MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,
bool IsZeroCmp) const;		bool IsZeroCmp) const;
Show All 22 Lines

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,294 Lines • ▼ Show 20 Lines	bool X86TTIImpl::isLegalMaskedGather(Type *DataTy, Align Alignment) {

if (!ScalarTy->isIntegerTy())		if (!ScalarTy->isIntegerTy())
return false;		return false;

unsigned IntWidth = ScalarTy->getIntegerBitWidth();		unsigned IntWidth = ScalarTy->getIntegerBitWidth();
return IntWidth == 32 \|\| IntWidth == 64;		return IntWidth == 32 \|\| IntWidth == 64;
}		}

		bool X86TTIImpl::isLegalAltInstr(VectorType *VecTy, unsigned Opcode0,
		unsigned Opcode1,
		const SmallBitVector &OpcodeMask) const {
		// ADDSUBPS 4xf32 SSE3
		// VADDSUBPS 4xf32 AVX
		// VADDSUBPS 8xf32 AVX2
		// ADDSUBPD 2xf64 SSE3
		// VADDSUBPD 2xf64 AVX
		// VADDSUBPD 4xf64 AVX2

		unsigned NumElements = cast<FixedVectorType>(VecTy)->getNumElements();
		assert(OpcodeMask.size() == NumElements && "Mask and VecTy are incompatible");
		if (!isPowerOf2_32(NumElements))
		return false;
		// Check the opcode pattern. We apply the mask on the opcode arguments and
		// then check if it is what we expect.
		for (int Lane : seq<int>(0, NumElements)) {
		unsigned Opc = OpcodeMask.test(Lane) ? Opcode1 : Opcode0;
		// We expect FSub for even lanes and FAdd for odd lanes.
		if (Lane % 2 == 0 && Opc != Instruction::FSub)
		return false;
		if (Lane % 2 == 1 && Opc != Instruction::FAdd)
		return false;
		}
		// Now check that the pattern is supported by the target ISA.
		Type *ElemTy = cast<VectorType>(VecTy)->getElementType();
		if (ElemTy->isFloatTy())
		return ST->hasSSE3() && NumElements % 4 == 0;
		if (ElemTy->isDoubleTy())
		RKSimonUnsubmitted Done Reply Inline Actions The AVX case is unnecessary as any NumElements % 4 == 0 case will be supported as multiple ADDSUB ops. RKSimon: The AVX case is unnecessary as any NumElements % 4 == 0 case will be supported as multiple…
		return ST->hasSSE3() && NumElements % 2 == 0;
		return false;
		}
		RKSimonUnsubmitted Done Reply Inline Actions Same here NumElements % 2 == 0 is enough RKSimon: Same here NumElements % 2 == 0 is enough

bool X86TTIImpl::isLegalMaskedScatter(Type *DataType, Align Alignment) {		bool X86TTIImpl::isLegalMaskedScatter(Type *DataType, Align Alignment) {
// AVX2 doesn't support scatter		// AVX2 doesn't support scatter
if (!ST->hasAVX512())		if (!ST->hasAVX512())
return false;		return false;
return isLegalMaskedGather(DataType, Alignment);		return isLegalMaskedGather(DataType, Alignment);
}		}

bool X86TTIImpl::hasDivRemOp(Type *DataType, bool IsSigned) {		bool X86TTIImpl::hasDivRemOp(Type *DataType, bool IsSigned) {
▲ Show 20 Lines • Show All 594 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,372 Lines • ▼ Show 20 Lines	Value *getSingleOperand(unsigned OpIdx) const {
return Operands[OpIdx][0];		return Operands[OpIdx][0];
}		}

/// Some of the instructions in the list have alternate opcodes.		/// Some of the instructions in the list have alternate opcodes.
bool isAltShuffle() const { return MainOp != AltOp; }		bool isAltShuffle() const { return MainOp != AltOp; }

bool isOpcodeOrAlt(Instruction *I) const {		bool isOpcodeOrAlt(Instruction *I) const {
unsigned CheckedOpcode = I->getOpcode();		unsigned CheckedOpcode = I->getOpcode();
return (getOpcode() == CheckedOpcode \|\|		return (getOpcode() == CheckedOpcode \|\|
		vdmitrieUnsubmitted Not Done Reply Inline Actions I'd say we don't really need this interface. IMO what is likely to be more useful is something like buildAltShuffleMask which you could directly feed into the TTI interface you added. See also buildShuffleEntryMask which could also benefit from having it. vdmitrie: I'd say we don't really need this interface. IMO what is likely to be more useful is something…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions Well, the issue is that the TTI function can't check whether this is an actual add-sub pattern, because it doesn't have all the information needed. For this to work, it would have to know how we are planning to combine the scalar instructions into vectors, which is an SLP-specific thing. So it feels that this logic should not belong to TTI and instead it should all be in the SLP pass. Regarding the mask, I don't really like the way this is done now. Actually I am not even sure passing the shuffle mask is a good idea because it doesn't actually help with checking the pattern. It is probably best to just pass the even/odd opcodes and the vector type and ask for TTI to very whether these the compatible ones with the addsub instructions. Wdyt? vporpo: Well, the issue is that the TTI function can't check whether this is an actual add-sub pattern…
		vdmitrieUnsubmitted Not Done Reply Inline Actions my 2 cents wrt the TTI interface The interface as it defined in current revision has assumptions: two opcodes and that final result is built with select shuffle. And the latter is its main drawback. More generic version could look like bool isLegalSIMDInstruction(Type* VecTy, ArrayRef<unsigned> Opcodes, ArrayRef<unsigned>LaneOpcMask) where LaneOpcMask[Lane] is an index into Opcodes array which gives per lane opcode for a SIMD instruction. Or if you want to limit it with just 2 opcodes (perhaps makes sense) LaneOpcMask could be represented with a bit vector. vdmitrie: my 2 cents wrt the TTI interface The interface as it defined in current revision has…
		vporpoAuthorUnsubmitted Not Done Reply Inline Actions Yeah, this makes sense. I will update the patch. vporpo: Yeah, this makes sense. I will update the patch.
		vporpoAuthorUnsubmitted Done Reply Inline Actions I updated the `isLegalAltInstr()` TTI interface function. I think it looks more like what you described in your previous comment. vporpo: I updated the `isLegalAltInstr()` TTI interface function. I think it looks more like what you…
getAltOpcode() == CheckedOpcode);		getAltOpcode() == CheckedOpcode);
}		}

/// Chooses the correct key for scheduling data. If \p Op has the same (or		/// Chooses the correct key for scheduling data. If \p Op has the same (or
/// alternate) opcode as \p OpValue, the key is \p Op. Otherwise the key is		/// alternate) opcode as \p OpValue, the key is \p Op. Otherwise the key is
/// \p OpValue.		/// \p OpValue.
Value isOneOf(Value Op) const {		Value isOneOf(Value Op) const {
auto *I = dyn_cast<Instruction>(Op);		auto *I = dyn_cast<Instruction>(Op);
▲ Show 20 Lines • Show All 1,213 Lines • ▼ Show 20 Lines

void BoUpSLP::reorderTopToBottom() {		void BoUpSLP::reorderTopToBottom() {
// Maps VF to the graph nodes.		// Maps VF to the graph nodes.
DenseMap<unsigned, SetVector<TreeEntry *>> VFToOrderedEntries;		DenseMap<unsigned, SetVector<TreeEntry *>> VFToOrderedEntries;
// ExtractElement gather nodes which can be vectorized and need to handle		// ExtractElement gather nodes which can be vectorized and need to handle
// their ordering.		// their ordering.
DenseMap<const TreeEntry *, OrdersType> GathersToOrders;		DenseMap<const TreeEntry *, OrdersType> GathersToOrders;

		// AltShuffles can also have a preferred ordering that leads to fewer
		// instructions, e.g., the addsub instruction in x86.
		DenseMap<const TreeEntry *, OrdersType> AltShufflesToOrders;

// Maps a TreeEntry to the reorder indices of external users.		// Maps a TreeEntry to the reorder indices of external users.
DenseMap<const TreeEntry *, SmallVector<OrdersType, 1>>		DenseMap<const TreeEntry *, SmallVector<OrdersType, 1>>
ExternalUserReorderMap;		ExternalUserReorderMap;
// Find all reorderable nodes with the given VF.		// Find all reorderable nodes with the given VF.
// Currently the are vectorized stores,loads,extracts + some gathering of		// Currently the are vectorized stores,loads,extracts + some gathering of
// extracts.		// extracts.
for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders,		for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders,
&ExternalUserReorderMap](		&ExternalUserReorderMap, &AltShufflesToOrders](
const std::unique_ptr<TreeEntry> &TE) {		const std::unique_ptr<TreeEntry> &TE) {
// Look for external users that will probably be vectorized.		// Look for external users that will probably be vectorized.
SmallVector<OrdersType, 1> ExternalUserReorderIndices =		SmallVector<OrdersType, 1> ExternalUserReorderIndices =
findExternalStoreUsersReorderIndices(TE.get());		findExternalStoreUsersReorderIndices(TE.get());
if (!ExternalUserReorderIndices.empty()) {		if (!ExternalUserReorderIndices.empty()) {
VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());		VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());
ExternalUserReorderMap.try_emplace(TE.get(),		ExternalUserReorderMap.try_emplace(TE.get(),
std::move(ExternalUserReorderIndices));		std::move(ExternalUserReorderIndices));
}		}

		// Patterns like [fadd,fsub] can be combined into a single instruction in
		// x86. Reordering them into [fsub,fadd] blocks this pattern. So we need
		// to take into account their order when looking for the most used order.
		if (TE->isAltShuffle()) {
		VectorType *VecTy =
		FixedVectorType::get(TE->Scalars[0]->getType(), TE->Scalars.size());
		unsigned Opcode0 = TE->getOpcode();
		unsigned Opcode1 = TE->getAltOpcode();
		// The opcode mask selects between the two opcodes.
		SmallBitVector OpcodeMask(TE->Scalars.size(), 0);
		for (unsigned Lane : seq<unsigned>(0, TE->Scalars.size()))
		if (cast<Instruction>(TE->Scalars[Lane])->getOpcode() == Opcode1)
		OpcodeMask.set(Lane);
		// If this pattern is supported by the target then we consider the order.
		if (TTI->isLegalAltInstr(VecTy, Opcode0, Opcode1, OpcodeMask)) {
		VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());
		AltShufflesToOrders.try_emplace(TE.get(), OrdersType());
		}
		// TODO: Check the reverse order too.
		}

if (Optional<OrdersType> CurrentOrder =		if (Optional<OrdersType> CurrentOrder =
getReorderingData(TE, /TopToBottom=*/true)) {		getReorderingData(TE, /TopToBottom=*/true)) {
// Do not include ordering for nodes used in the alt opcode vectorization,		// Do not include ordering for nodes used in the alt opcode vectorization,
// better to reorder them during bottom-to-top stage. If follow the order		// better to reorder them during bottom-to-top stage. If follow the order
// here, it causes reordering of the whole graph though actually it is		// here, it causes reordering of the whole graph though actually it is
// profitable just to reorder the subgraph that starts from the alternate		// profitable just to reorder the subgraph that starts from the alternate
// opcode vectorization node. Such nodes already end-up with the shuffle		// opcode vectorization node. Such nodes already end-up with the shuffle
// instruction and it is just enough to change this shuffle rather than		// instruction and it is just enough to change this shuffle rather than
Show All 37 Lines	MapVector<OrdersType, unsigned,
OrdersUses;		OrdersUses;
SmallPtrSet<const TreeEntry *, 4> VisitedOps;		SmallPtrSet<const TreeEntry *, 4> VisitedOps;
for (const TreeEntry *OpTE : OrderedEntries) {		for (const TreeEntry *OpTE : OrderedEntries) {
// No need to reorder this nodes, still need to extend and to use shuffle,		// No need to reorder this nodes, still need to extend and to use shuffle,
// just need to merge reordering shuffle and the reuse shuffle.		// just need to merge reordering shuffle and the reuse shuffle.
if (!OpTE->ReuseShuffleIndices.empty())		if (!OpTE->ReuseShuffleIndices.empty())
continue;		continue;
// Count number of orders uses.		// Count number of orders uses.
const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & {		const auto &Order = [OpTE, &GathersToOrders,
		&AltShufflesToOrders]() -> const OrdersType & {
if (OpTE->State == TreeEntry::NeedToGather) {		if (OpTE->State == TreeEntry::NeedToGather) {
auto It = GathersToOrders.find(OpTE);		auto It = GathersToOrders.find(OpTE);
if (It != GathersToOrders.end())		if (It != GathersToOrders.end())
return It->second;		return It->second;
}		}
		if (OpTE->isAltShuffle()) {
		auto It = AltShufflesToOrders.find(OpTE);
		if (It != AltShufflesToOrders.end())
		return It->second;
		}
return OpTE->ReorderIndices;		return OpTE->ReorderIndices;
}();		}();
// First consider the order of the external scalar users.		// First consider the order of the external scalar users.
auto It = ExternalUserReorderMap.find(OpTE);		auto It = ExternalUserReorderMap.find(OpTE);
if (It != ExternalUserReorderMap.end()) {		if (It != ExternalUserReorderMap.end()) {
const auto &ExternalUserReorderIndices = It->second;		const auto &ExternalUserReorderIndices = It->second;
for (const OrdersType &ExtOrder : ExternalUserReorderIndices)		for (const OrdersType &ExtOrder : ExternalUserReorderIndices)
++OrdersUses.insert(std::make_pair(ExtOrder, 0)).first->second;		++OrdersUses.insert(std::make_pair(ExtOrder, 0)).first->second;
▲ Show 20 Lines • Show All 8,031 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll

	Show First 20 Lines • Show All 112 Lines • ▼ Show 20 Lines
	; We have to be careful when the tree contains add/sub patterns that could be			; We have to be careful when the tree contains add/sub patterns that could be
	; combined into a single addsub instruction. Reordering can block the pattern.			; combined into a single addsub instruction. Reordering can block the pattern.
	define void @addsub_and_external_users(double %A, double %ptr) {			define void @addsub_and_external_users(double %A, double %ptr) {
	; CHECK-LABEL: @addsub_and_external_users(			; CHECK-LABEL: @addsub_and_external_users(
	; CHECK-NEXT: bb1:			; CHECK-NEXT: bb1:
	; CHECK-NEXT: [[LD:%.]] = load double, double undef, align 8			; CHECK-NEXT: [[LD:%.]] = load double, double undef, align 8
	; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[LD]], i32 0			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[LD]], i32 0
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[LD]], i32 1			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[LD]], i32 1
	; CHECK-NEXT: [[TMP2:%.*]] = fsub <2 x double> [[TMP1]], <double 1.200000e+00, double 1.100000e+00>			; CHECK-NEXT: [[TMP2:%.*]] = fsub <2 x double> [[TMP1]], <double 1.100000e+00, double 1.200000e+00>
	; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP1]], <double 1.200000e+00, double 1.100000e+00>			; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP1]], <double 1.100000e+00, double 1.200000e+00>
	; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> [[TMP3]], <2 x i32> <i32 2, i32 1>			; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> [[TMP3]], <2 x i32> <i32 0, i32 3>
	; CHECK-NEXT: [[TMP5:%.*]] = fdiv <2 x double> [[TMP4]], <double 2.200000e+00, double 2.100000e+00>			; CHECK-NEXT: [[TMP5:%.*]] = fdiv <2 x double> [[TMP4]], <double 2.100000e+00, double 2.200000e+00>
	; CHECK-NEXT: [[TMP6:%.*]] = fmul <2 x double> [[TMP5]], <double 3.200000e+00, double 3.100000e+00>			; CHECK-NEXT: [[TMP6:%.*]] = fmul <2 x double> [[TMP5]], <double 3.100000e+00, double 3.200000e+00>
	; CHECK-NEXT: [[PTRA0:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 0			; CHECK-NEXT: [[PTRA0:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 0
				; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP6]], <2 x double> poison, <2 x i32> <i32 1, i32 0>
				ABataevUnsubmitted Not Done Reply Inline Actions I don't quite understand what's the difference here. Could you explain, please? ABataev: I don't quite understand what's the difference here. Could you explain, please?
				vporpoAuthorUnsubmitted Not Done Reply Inline Actions Before this patch the pattern `shuffle + fadd + fsub` lowers to 3 instructions: blend + vector add + vector sub (the shuffle selects TMP3[0],TMP2[1], which is fadd[0],fsub[1] , the inverse of the addsub pattern). With this patch `shuffle + fadd +fsub` lowers to a single addsub instruction (the shuffle selects TMP2[0], TMP3[2] which is fsub[0],fadd[1]). This saves 2 instructions which means that during reordering we should keep track of this pattern since reordering it can increase the overhead. vporpo: Before this patch the pattern `shuffle + fadd + fsub` lowers to 3 instructions: blend + vector…
				ABataevUnsubmitted Not Done Reply Inline Actions Ok, why not fixing it in the backend? This new function you added does not affect the cost, but ignoring shuffle actually increases the cost of the tree. ABataev: Ok, why not fixing it in the backend? This new function you added does not affect the cost, but…
				vporpoAuthorUnsubmitted Not Done Reply Inline Actions I think it is better to do it here because we may end up not vectorizing some code because of the additional cost of the blend + add + sub pattern. I think I can fix the tree cost with a follow-up patch that fixes the cost of the altshuffle pattern when it corresponds to the addsub instruction. vporpo: I think it is better to do it here because we may end up not vectorizing some code because of…
				ABataevUnsubmitted Not Done Reply Inline Actions Ah, I see, thanks. What about trying to do both - lowering in the backend and the cost adjustment? Is it possible? ABataev: Ah, I see, thanks. What about trying to do both - lowering in the backend and the cost…
				vporpoAuthorUnsubmitted Not Done Reply Inline Actions Hmm I guess in the back-end we won't have access to cost benefit analysis like the one we are doing during the reordering step (i.e., finding the most popular order). So we would have to do a simple conversion of any sub-add pattern to an add-sub + shuffles, but I am not sure that this would always be profitable. I think that the addsub pattern should be taken into consideration when looking for the most popular ordering so that it can influence the decision, but I am not so sure we can always justify the cost of the extra shuffles. vporpo: Hmm I guess in the back-end we won't have access to cost benefit analysis like the one we are…
				ABataevUnsubmitted Not Done Reply Inline Actions The backend does not need the cost, it sjut checks the pattern and lowers the sequence to the instructions. Why does it need the cost? ABataev: The backend does not need the cost, it sjut checks the pattern and lowers the sequence to the…
				vporpoAuthorUnsubmitted Not Done Reply Inline Actions I think there is a difference between doing the transformation here vs in the back-end. In the back-end we can't easily check if reordering a sub-add to an add-sub can also remove some of the shuffles that are already in the code. For example if we have code like this: %vsub = fsub <2 x double> %subop, ... %vadd = fadd <2 x double> %addop, ... %shuffle = shufflevector <2 x double> %vsub, %vadd, 0, 3 # vsub[0], vadd[1] store <2 x double> %shuffle, ... This will be lowered to: `add + sub + blend + store`. But if we convert the `sub-add` pattern (i.e., `add + sub + blend`) to an `add-sub + shuffles` in the back-end, then this will introduce 3 shuffles: 2 for the operands and 1 for the user, resulting in a pattern: `pshuf + pshuf + addsub + pshuf + store`. This is clearly worse than the original code. But the transformation could be profitable if we could remove some of the shuffles, for example if the code looked like this: %vsub = fsub <2 x double> %subop, ... %vadd = fadd <2 x double> %addop, ... %shuffle = shufflevector <2 x double> %vsub, %vadd, 0, 3 # vsub[0], vadd[1] %shuffle2 = shufflevector <2 x double> %shuffle, zero, 1, 0 # shuffle[1], shuffle[0] store <2 x double> %shuffle, ... Converting this to sub-add in the back-end would result in `pshuf + pshuf + addsub + pshuf + pshuf + store`, but the latter two `pshuf` instructions negate each other so the could be optimized away. A similar optimization could happen if the inputs were already reordered with a shuffle earlier. What I am trying to say is that simply converting a `sub-add` pattern to an `add-sub` pattern does not look profitable unless we can also get rid of some of the shuffles that cancel each other out. I can't think of how we could do this in the back-end more effectively than we could do it here, but perhaps I am missing something. vporpo: I think there is a difference between doing the transformation here vs in the back-end. In the…
	; CHECK-NEXT: [[TMP7:%.]] = bitcast double [[PTRA0]] to <2 x double>*			; CHECK-NEXT: [[TMP7:%.]] = bitcast double [[PTRA0]] to <2 x double>*
	; CHECK-NEXT: store <2 x double> [[TMP6]], <2 x double>* [[TMP7]], align 8			; CHECK-NEXT: store <2 x double> [[SHUFFLE]], <2 x double>* [[TMP7]], align 8
	; CHECK-NEXT: br label [[BB2:%.*]]			; CHECK-NEXT: br label [[BB2:%.*]]
	; CHECK: bb2:			; CHECK: bb2:
	; CHECK-NEXT: [[TMP8:%.*]] = fadd <2 x double> [[TMP6]], <double 4.200000e+00, double 4.100000e+00>			; CHECK-NEXT: [[TMP8:%.*]] = fadd <2 x double> [[TMP6]], <double 4.100000e+00, double 4.200000e+00>
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x double> [[TMP8]], i32 0			; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x double> [[TMP8]], i32 0
	; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x double> [[TMP8]], i32 1			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x double> [[TMP8]], i32 1
	; CHECK-NEXT: [[SEED:%.*]] = fcmp ogt double [[TMP10]], [[TMP9]]			; CHECK-NEXT: [[SEED:%.*]] = fcmp ogt double [[TMP9]], [[TMP10]]
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	bb1:			bb1:
	%ld = load double, double* undef			%ld = load double, double* undef

	%sub1 = fsub double %ld, 1.1			%sub1 = fsub double %ld, 1.1
	%add2 = fadd double %ld, 1.2			%add2 = fadd double %ld, 1.2

	▲ Show 20 Lines • Show All 124 Lines • Show Last 20 Lines