This is an archive of the discontinued LLVM Phabricator instance.

Add support to recognize non SIMD kind of parallelism in SLPVectorizer
ClosedPublic

Authored by karthikthecool on Jun 4 2014, 1:19 AM.

Download Raw Diff

Details

Reviewers

nadav
aschwaighofer
karthikthecool
rsilvera
hfinkel

Summary

Hi All,
I'm currently looking into adding support to recognize and vectorize non SIMD kind of parallelism e.g. add sub patterns.

This kind of parallelism may be important in complex/numerical computations were these patterns are common.
These patterns can later be converted to instructions such as ADDSUBPS on X86.

I would like to get few inputs on the patch and design used to support this feature.

This patch adds support to recognize one asymmetric pairing (i.e. add/sub instruction). This is the rough design which i followed-

Recognize add/sub patterns in getSameOpcode as shuffle vector instructions and handle shuffle vector in buildTree_rec.
Calculate appropriate cost of vectorization when shuffle vector is used.
Calculate appropriate mask and create shuffle vector instruction to vectorize these patterns.

The advantage of using shuffle vector is we can use the same shuffle vector code as in this patch to generally pair any alternative sequence such as addsub, subadd etc we just need to handle them in getSameOpcode and classify them as shuffle vectors.

I tested the patch on a local test case having large number of add/sub patterns and it seems to give a nice ~10% improvement.

Awaiting inputs.

Thanks and Regards
Karthik Bhat

Diff Detail

Event Timeline

karthikthecool updated this revision to Diff 10084.Jun 4 2014, 1:19 AM

karthikthecool retitled this revision from to Add support to recognize non SIMD kind of parallelism in SLPVectorizer.

karthikthecool updated this object.

karthikthecool edited the test plan for this revision. (Show Details)

karthikthecool added reviewers: nadav, aschwaighofer, rsilvera, hfinkel.

karthikthecool added a subscriber: Unknown Object (MLST).

Update the patch to fix an issue.
When vectorizing Shuffle Vector instruction make sure that it actually refers to alternate sequence such as add sub sequence and is not reffering to a set of ShuffleVector instruction itself.

Hi Karthik,

thanks for working on this! I have a few questions embedded in the source. Also, did you run the test suite for correctness and performance?

lib/Transforms/Vectorize/SLPVectorizer.cpp
152	Could we name this function "isAddSubInst" and add some documentation that we encode vector bundles that are combination of opcodes as a shufflevector.
520	I don't understand why you are relaxing this constraint. Why should a scalar instruction be part of two vectors?
791	I don't understand why you are relaxing this constraint. Within an ADDSUB operation you still want the instructions to be unique? For the bundle [ADD1, SUB1, ADD2, SUB2] we still want all the members to be unique in the bundle?
1341	I think we should use the getShuffleCost instruction (and we would need a new ShuffleKind and implement the cost model for it).
2007	Why is this needed?

Hi Arnold,
Thanks for the review. Yes i have run llvm regression test cases in debug and release modes and there are no failures. I'm yet to generate performance results for this patch. Will try to collect it over weekend.

Checking Opcode != Instruction::ShuffleVector was an experimental code unfortunetly i forgot to remove it while submitting the patch. Updated the patch to remove wrongly added code. Thanks for highlighting it I will be more careful next time.

Updated the cost model to use getShuffleCost and added a new ShuffleKind (SK_Alternate).
I have not yet handled SK_Alternate separately and we are returning default cost of 1 as with SK_Broadcast etc.)
I have added a TODO to handle cost model for shuffle vector in a better way as i can see that with current cost model sequences generated from code such as-

float c[4];
float a;
float b;
void foo () {
  c[0]= a+b;
  c[1] = a-b;
  c[2] = a+b;
  c[3] = a-b;
}

are not getting converted in shuffle vector.(it works when we set a lower -slp-threshold). It also works when both a and b are arrays.(e.g. test case attached in this patch (addsub.ll) )

Thanks for your inputs and time.
Regards
Karthik Bhat

Hi Arnold,Nadav,
Please find the lnt results with a sample size of 20. I did not observe any significant performance changes in lnt test cases but one of our internal test gave an improvement of ~5-6%.

Thanks for the time.

Regards
Karthik Bhat

ping..

aschwaighofer added inline comments.Jun 17 2014, 8:59 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp

1344

Ultimately, I think this could be replaced by a call to TargetTransfromInfo::getAddSubCost and BasicTargetTransformInfo should override this with the implementation you have here.
Targets can then override the cost as they see fit.

For now, your code should be conservatively correct for all targets assuming they return an accurate (conservative cost) for getShuffleCost(SK_Alternate) - and I don't see addsubs generated yet anyway, so introducing the getAddSubCost abstraction change would be premature.

Can you take a look at the code we generate for arm, arm64 and x86 to make sure that one instruction is correct? It seems to me that we generate 2 instructions for x86_64, arm, and arm64 and <4 x float>. Indicating that those targets should override getShuffleCost(SK_Alternate) and should return a cost of 2 for getShuffleCost(SK_Alternate, <4 x float>).

cat > test.ll

define void @test1(<4 x float> *%a, <4 x float> *%b, <4 x float> *%c) {
entry:
  %in1 = load <4 x float>* %a
  %in2 = load <4 x float>* %b
  %add = fadd <4 x float> %in1, %in2
  %sub = fsub <4 x float> %in1, %in2
  %Shuff = shufflevector <4 x float> %add,
                         <4 x float> %sub,
                         <4 x i32> <i32 0, i32 5, i32 1, i32 6>
  store <4 x float> %Shuff, <4 x float>* %c
  ret void
}

define void @test2(<2 x double> *%a, <2 x double> *%b, <2 x double> *%c) {
entry:
  %in1 = load <2 x double>* %a
  %in2 = load <2 x double>* %b
  %add = fadd <2 x double> %in1, %in2
  %sub = fsub <2 x double> %in1, %in2
  %Shuff = shufflevector <2 x double> %add,
                         <2 x double> %sub,
                         <2 x i32> <i32 0, i32 3>
  store <2 x double> %Shuff, <2 x double>* %c
  ret void
}


bin/llc -mtriple=arm64-apple-ios7.0 -mcpu=cyclone < testshufflevector.ll

        .section        __TEXT,__text,regular,pure_instructions
        .ios_version_min 7, 0
        .globl  _test1
        .align  2
_test1:                                 ; @test1
        .cfi_startproc
; BB#0:                                 ; %entry
        ldr      q0, [x0]
        ldr      q1, [x1]
        fadd.4s v2, v0, v1
        fsub.4s v0, v0, v1
// TWO INSTRUCTIONS
        ext.16b v0, v0, v2, #4
        zip1.4s v0, v2, v0
///
        str      q0, [x2]
        ret
        .cfi_endproc

bin/llc -mtriple=armv7s-apple-ios7.0 < testshufflevector.ll
bin/llc -mtriple=x86_64-apple-macos < testshufflevector.ll

With the adjustments to the cost model (X86TargetTransformInfo/ARMTargetTransformInfo/AArch64TargetTransformInfo::getShuffleCost) this LGTM.

Thanks

Hi Arnold,
Thanks for the review comments. Updated the patch to use shuffle cost of 2. I have modified getShuffleCost in BasicTargetTransformInfo and TargetTransformInfo as X86TargetTransformInfo::getShuffleCost and ARMTargetTransformInfo::getShuffleCost gets the cost from this common function.

Yes addsub instruction is not yet emitted by backend. We will need some more modification to recognize this shuffle as addsub instruction and generate corresponding target instruction. I will look into it once these changes are committed. Will also try to update the cost model to use getAddSubCost in followup patches.
Does this look good to commit now? Thanks once again for the review.
Regards
Karthik Bhat

Updated patch rebased to trunk. Thanks

Thanks for working on this!

lib/Transforms/Vectorize/SLPVectorizer.cpp
178	Can we add support for matching (fsub, fadd, fsub, fadd, ...)? I think that generating (fneg ( addsub (x, y))) would be nice. Also, is there any particular reason that we're restricting this to floating-point add/sub? Granted, I know of no ISA with integer add/sub instructions, but I think that the lowering as (addsub ( x, xor (y, mask))) is likely more efficient than the scalar version.
test/Transforms/SLPVectorizer/X86/addsub.ll
14	Please check the actual shuffle indices
40	We don't need this ident metadata in the test files.

Hi Hal,
Thanks for the review.
Yes we can handle fsub,fadd,fsub sequence as well.
Updated the patch as per review comments to handle sequence fsub,fadd,fsub,fadd,... and also add/sub sequence.
Updated the test case to check for the shuffle mask used.

Please if you could review the same.
Thanks
Karthik Bhat

aschwaighofer added inline comments.Jun 18 2014, 9:12 AM

lib/Analysis/TargetTransformInfo.cpp
575 ↗	(On Diff #10546)	Please leave a cost of 1 here.
lib/CodeGen/BasicTargetTransformInfo.cpp
355	Karthik thanks again for working on this. Now that we generate shuffles we need to make sure costs are approximately right. BasicTTI is supposed to return a conservative estimate if targets don't override it. We can't always return two here. The cost of the shuffle could be much higher. We should return a conservative default. One is not right either. A conservative cost would be something like the cost of VecEltNum * cost(extractelement, Tp) + cost(insert element, Tp), that is the cost of constructing the shuffled vector by extracting the individual items and then creating the result vector. Targets should then override this to provide more accurate estimates. Costs should depend on the type. So for example: X86TTI::getShuffleCost(SK_Alternate, <2 x double>) == 1 X86TTI::getShuffleCost(SK_Alternate, <4 x float>) == 2 We try to make vectorized code not slower than the scalar version. It is therefore important to not underestimate costs of vectorized code. If we are doing integer addsubs we need to make sure we also return sensible costs for integer alternate shuffles as well. (My sample code before was using the wrong indices on the shuffle mask. The mask should have been 0, 5, 2, 7 of course) We will have to look at what code we generate for <8 x i16> for example, <16 x i8> usw. define void @test3(<8 x i16> %a, <8 x i16> %b, <8 x i16> %c) { entry: %in1 = load <8 x i16> %a %in2 = load <8 x i16>* %b %add = add <8 x i16> %in1, %in2 %sub = sub <8 x i16> %in1, %in2 %Shuff = shufflevector <8 x i16> %add, <8 x i16> %sub, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15> store <8 x i16> %Shuff, <8 x i16>* %c ret void }

Hi Arnold,
Thanks for the pointers. Updated the cost model for SK_Alternate Shuffles.
The cost model is as follows-

In BasicTTI created a function getAltShuffleOverhead.

As you mentioned the conservative cost of shuffle is the cost of extracting the element from an index from vector + cost of inserting the element into the result vector for all elements in result vector.
Since we will be alternatively picking elements from the 2 vectors which are of same type(i.e. element 0 of 1st vector, element 1 of 2nd vector, element 3 of 1st vector and so on..) the functions just runs a loop to calculate the cost of extracting elements from each index and adds it up to the cost of inserting the element at that index.

We have overidden the getShuffleCost for X86 and ARM. Created 2 tables NEONAltShuffleTable and X86AltShuffleTable with more accurate cost of shuffle. The cost here represents the number of instructions required to generate the shuffled vector.

Please if you could let me know your inputs on this.

For <8 x i16> as well we should be generating the correct mask as the logic to create mask is generic. The logic to create mask is we take the vector length(8 in this case) and run a loop from 0 to length alternatively selecting loopindex(i) and lengthofVector+i. So in this case as we will be generating sequence (0,8+1,2,8+3,4,8+5,6,8+7) i.e. <0,9, 2, 11,4,13, 6, 15>.

I tried to test it on a sample test but it exits before entering buildTree in vectorizeStoreChain as VF>ChainLen. I will try to come up with a working test to check <8 x i16> and <16 x i8> but i feel we should be generating correct mask in both these cases as per our logic.

Thanks once again for your time and inputs.
Regards
Karthik Bhat

Thanks Arnold for the review. Committed as r211339. Now that SLPVectorizer is vectorizing these patterns will try to map these vector shuffles to instructions such as addsubpd etc.
Thanks!

This revision is now accepted and ready to land.Jun 19 2014, 9:50 PM

karthikthecool closed this revision.Jul 1 2014, 9:23 PM

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

1 line

lib/

CodeGen/

BasicTargetTransformInfo.cpp

23 lines

Target/

ARM/

ARMTargetTransformInfo.cpp

73 lines

X86/

X86TargetTransformInfo.cpp

46 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

165 lines

test/

Transforms/

SLPVectorizer/

X86/

addsub.ll

193 lines

Diff 10617

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 316 Lines • ▼ Show 20 Lines	public:

/// \name Vector Target Information		/// \name Vector Target Information
/// @{		/// @{

/// \brief The various kinds of shuffle patterns for vector queries.		/// \brief The various kinds of shuffle patterns for vector queries.
enum ShuffleKind {		enum ShuffleKind {
SK_Broadcast, ///< Broadcast element 0 to all other elements.		SK_Broadcast, ///< Broadcast element 0 to all other elements.
SK_Reverse, ///< Reverse the order of the vector.		SK_Reverse, ///< Reverse the order of the vector.
		SK_Alternate, ///< Choose alternate elements from vector.
SK_InsertSubvector, ///< InsertSubvector. Index indicates start offset.		SK_InsertSubvector, ///< InsertSubvector. Index indicates start offset.
SK_ExtractSubvector ///< ExtractSubvector Index indicates start offset.		SK_ExtractSubvector ///< ExtractSubvector Index indicates start offset.
};		};

/// \brief Additional information about an operand's possible values.		/// \brief Additional information about an operand's possible values.
enum OperandValueKind {		enum OperandValueKind {
OK_AnyValue, // Operand can have any value.		OK_AnyValue, // Operand can have any value.
OK_UniformValue, // Operand is uniform (splat of a value).		OK_UniformValue, // Operand is uniform (splat of a value).
▲ Show 20 Lines • Show All 101 Lines • Show Last 20 Lines

lib/CodeGen/BasicTargetTransformInfo.cpp

Show All 33 Lines

class BasicTTI final : public ImmutablePass, public TargetTransformInfo {		class BasicTTI final : public ImmutablePass, public TargetTransformInfo {
const TargetMachine *TM;		const TargetMachine *TM;

/// Estimate the overhead of scalarizing an instruction. Insert and Extract		/// Estimate the overhead of scalarizing an instruction. Insert and Extract
/// are set if the result needs to be inserted and/or extracted from vectors.		/// are set if the result needs to be inserted and/or extracted from vectors.
unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) const;		unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract) const;

		/// Estimate the cost overhead of SK_Alternate shuffle.
		unsigned getAltShuffleOverhead(Type *Ty) const;

const TargetLoweringBase *getTLI() const { return TM->getTargetLowering(); }		const TargetLoweringBase *getTLI() const { return TM->getTargetLowering(); }

public:		public:
BasicTTI() : ImmutablePass(ID), TM(nullptr) {		BasicTTI() : ImmutablePass(ID), TM(nullptr) {
llvm_unreachable("This pass cannot be directly constructed");		llvm_unreachable("This pass cannot be directly constructed");
}		}

BasicTTI(const TargetMachine *TM) : ImmutablePass(ID), TM(TM) {		BasicTTI(const TargetMachine *TM) : ImmutablePass(ID), TM(TM) {
▲ Show 20 Lines • Show All 272 Lines • ▼ Show 20 Lines	if (Ty->isVectorTy()) {
// and extracting the values.		// and extracting the values.
return getScalarizationOverhead(Ty, true, true) + Num * Cost;		return getScalarizationOverhead(Ty, true, true) + Num * Cost;
}		}

// We don't know anything about this scalar instruction.		// We don't know anything about this scalar instruction.
return OpCost;		return OpCost;
}		}

		unsigned BasicTTI::getAltShuffleOverhead(Type *Ty) const {
		assert(Ty->isVectorTy() && "Can only shuffle vectors");
		unsigned Cost = 0;
		// Shuffle cost is equal to the cost of extracting element from its argument
		// plus the cost of inserting them onto the result vector.

		// e.g. <4 x float> has a mask of <0,5,2,7> i.e we need to extract from index
		// 0 of first vector, index 1 of second vector,index 2 of first vector and
		// finally index 3 of second vector and insert them at index <0,1,2,3> of
		// result vector.
		for (int i = 0, e = Ty->getVectorNumElements(); i < e; ++i) {
		Cost += TopTTI->getVectorInstrCost(Instruction::InsertElement, Ty, i);
		Cost += TopTTI->getVectorInstrCost(Instruction::ExtractElement, Ty, i);
		}
		return Cost;
		}

unsigned BasicTTI::getShuffleCost(ShuffleKind Kind, Type *Tp, int Index,		unsigned BasicTTI::getShuffleCost(ShuffleKind Kind, Type *Tp, int Index,
Type *SubTp) const {		Type *SubTp) const {
		if (Kind == SK_Alternate) {
		return getAltShuffleOverhead(Tp);
		}
return 1;		return 1;
		aschwaighoferUnsubmitted Not Done Reply Inline Actions Karthik thanks again for working on this. Now that we generate shuffles we need to make sure costs are approximately right. BasicTTI is supposed to return a conservative estimate if targets don't override it. We can't always return two here. The cost of the shuffle could be much higher. We should return a conservative default. One is not right either. A conservative cost would be something like the cost of VecEltNum * cost(extractelement, Tp) + cost(insert element, Tp), that is the cost of constructing the shuffled vector by extracting the individual items and then creating the result vector. Targets should then override this to provide more accurate estimates. Costs should depend on the type. So for example: X86TTI::getShuffleCost(SK_Alternate, <2 x double>) == 1 X86TTI::getShuffleCost(SK_Alternate, <4 x float>) == 2 We try to make vectorized code not slower than the scalar version. It is therefore important to not underestimate costs of vectorized code. If we are doing integer addsubs we need to make sure we also return sensible costs for integer alternate shuffles as well. (My sample code before was using the wrong indices on the shuffle mask. The mask should have been 0, 5, 2, 7 of course) We will have to look at what code we generate for <8 x i16> for example, <16 x i8> usw. define void @test3(<8 x i16> %a, <8 x i16> %b, <8 x i16> %c) { entry: %in1 = load <8 x i16> %a %in2 = load <8 x i16>* %b %add = add <8 x i16> %in1, %in2 %sub = sub <8 x i16> %in1, %in2 %Shuff = shufflevector <8 x i16> %add, <8 x i16> %sub, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15> store <8 x i16> %Shuff, <8 x i16>* %c ret void } aschwaighofer: Karthik thanks again for working on this. Now that we generate shuffles we need to make sure…
}		}

unsigned BasicTTI::getCastInstrCost(unsigned Opcode, Type *Dst,		unsigned BasicTTI::getCastInstrCost(unsigned Opcode, Type *Dst,
Type *Src) const {		Type *Src) const {
const TargetLoweringBase *TLI = getTLI();		const TargetLoweringBase *TLI = getTLI();
int ISD = TLI->InstructionOpcodeToISD(Opcode);		int ISD = TLI->InstructionOpcodeToISD(Opcode);
assert(ISD && "Invalid opcode");		assert(ISD && "Invalid opcode");

▲ Show 20 Lines • Show All 277 Lines • Show Last 20 Lines

lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 437 Lines • ▼ Show 20 Lines	unsigned ARMTTI::getAddressComputationCost(Type *Ty, bool IsComplex) const {

// In many cases the address computation is not merged into the instruction		// In many cases the address computation is not merged into the instruction
// addressing mode.		// addressing mode.
return 1;		return 1;
}		}

unsigned ARMTTI::getShuffleCost(ShuffleKind Kind, Type *Tp, int Index,		unsigned ARMTTI::getShuffleCost(ShuffleKind Kind, Type *Tp, int Index,
Type *SubTp) const {		Type *SubTp) const {
// We only handle costs of reverse shuffles for now.		// We only handle costs of reverse and alternate shuffles for now.
if (Kind != SK_Reverse)		if (Kind != SK_Reverse && Kind != SK_Alternate)
return TargetTransformInfo::getShuffleCost(Kind, Tp, Index, SubTp);		return TargetTransformInfo::getShuffleCost(Kind, Tp, Index, SubTp);

		if (Kind == SK_Reverse) {
static const CostTblEntry<MVT::SimpleValueType> NEONShuffleTbl[] = {		static const CostTblEntry<MVT::SimpleValueType> NEONShuffleTbl[] = {
// Reverse shuffle cost one instruction if we are shuffling within a double		// Reverse shuffle cost one instruction if we are shuffling within a
// word (vrev) or two if we shuffle a quad word (vrev, vext).		// double word (vrev) or two if we shuffle a quad word (vrev, vext).
{ ISD::VECTOR_SHUFFLE, MVT::v2i32, 1 },		{ISD::VECTOR_SHUFFLE, MVT::v2i32, 1},
{ ISD::VECTOR_SHUFFLE, MVT::v2f32, 1 },		{ISD::VECTOR_SHUFFLE, MVT::v2f32, 1},
{ ISD::VECTOR_SHUFFLE, MVT::v2i64, 1 },		{ISD::VECTOR_SHUFFLE, MVT::v2i64, 1},
{ ISD::VECTOR_SHUFFLE, MVT::v2f64, 1 },		{ISD::VECTOR_SHUFFLE, MVT::v2f64, 1},

{ ISD::VECTOR_SHUFFLE, MVT::v4i32, 2 },		{ISD::VECTOR_SHUFFLE, MVT::v4i32, 2},
{ ISD::VECTOR_SHUFFLE, MVT::v4f32, 2 },		{ISD::VECTOR_SHUFFLE, MVT::v4f32, 2},
{ ISD::VECTOR_SHUFFLE, MVT::v8i16, 2 },		{ISD::VECTOR_SHUFFLE, MVT::v8i16, 2},
{ ISD::VECTOR_SHUFFLE, MVT::v16i8, 2 }		{ISD::VECTOR_SHUFFLE, MVT::v16i8, 2}};
};

std::pair<unsigned, MVT> LT = TLI->getTypeLegalizationCost(Tp);		std::pair<unsigned, MVT> LT = TLI->getTypeLegalizationCost(Tp);

int Idx = CostTableLookup(NEONShuffleTbl, ISD::VECTOR_SHUFFLE, LT.second);		int Idx = CostTableLookup(NEONShuffleTbl, ISD::VECTOR_SHUFFLE, LT.second);
if (Idx == -1)		if (Idx == -1)
return TargetTransformInfo::getShuffleCost(Kind, Tp, Index, SubTp);		return TargetTransformInfo::getShuffleCost(Kind, Tp, Index, SubTp);

return LT.first * NEONShuffleTbl[Idx].Cost;		return LT.first * NEONShuffleTbl[Idx].Cost;
}		}
		if (Kind == SK_Alternate) {
		static const CostTblEntry<MVT::SimpleValueType> NEONAltShuffleTbl[] = {
		// Alt shuffle cost table for ARM. Cost is the number of instructions
		// required to create the shuffled vector.

		{ISD::VECTOR_SHUFFLE, MVT::v2f32, 1},
		{ISD::VECTOR_SHUFFLE, MVT::v2i64, 1},
		{ISD::VECTOR_SHUFFLE, MVT::v2f64, 1},
		{ISD::VECTOR_SHUFFLE, MVT::v2i32, 1},

		{ISD::VECTOR_SHUFFLE, MVT::v4i32, 2},
		{ISD::VECTOR_SHUFFLE, MVT::v4f32, 2},
		{ISD::VECTOR_SHUFFLE, MVT::v4i16, 2},

		{ISD::VECTOR_SHUFFLE, MVT::v8i16, 16},

		{ISD::VECTOR_SHUFFLE, MVT::v16i8, 32}};

		std::pair<unsigned, MVT> LT = TLI->getTypeLegalizationCost(Tp);
		int Idx =
		CostTableLookup(NEONAltShuffleTbl, ISD::VECTOR_SHUFFLE, LT.second);
		if (Idx == -1)
		return TargetTransformInfo::getShuffleCost(Kind, Tp, Index, SubTp);
		return LT.first * NEONAltShuffleTbl[Idx].Cost;
		}
		return TargetTransformInfo::getShuffleCost(Kind, Tp, Index, SubTp);
		}

unsigned ARMTTI::getArithmeticInstrCost(unsigned Opcode, Type *Ty,		unsigned ARMTTI::getArithmeticInstrCost(unsigned Opcode, Type *Ty,
OperandValueKind Op1Info,		OperandValueKind Op1Info,
OperandValueKind Op2Info) const {		OperandValueKind Op2Info) const {

int ISDOpcode = TLI->InstructionOpcodeToISD(Opcode);		int ISDOpcode = TLI->InstructionOpcodeToISD(Opcode);
std::pair<unsigned, MVT> LT = TLI->getTypeLegalizationCost(Ty);		std::pair<unsigned, MVT> LT = TLI->getTypeLegalizationCost(Ty);

▲ Show 20 Lines • Show All 80 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 396 Lines • ▼ Show 20 Lines	unsigned X86TTI::getArithmeticInstrCost(unsigned Opcode, Type *Ty,

// Fallback to the default implementation.		// Fallback to the default implementation.
return TargetTransformInfo::getArithmeticInstrCost(Opcode, Ty, Op1Info,		return TargetTransformInfo::getArithmeticInstrCost(Opcode, Ty, Op1Info,
Op2Info);		Op2Info);
}		}

unsigned X86TTI::getShuffleCost(ShuffleKind Kind, Type *Tp, int Index,		unsigned X86TTI::getShuffleCost(ShuffleKind Kind, Type *Tp, int Index,
Type *SubTp) const {		Type *SubTp) const {
// We only estimate the cost of reverse shuffles.		// We only estimate the cost of reverse and alternate shuffles.
if (Kind != SK_Reverse)		if (Kind != SK_Reverse && Kind != SK_Alternate)
return TargetTransformInfo::getShuffleCost(Kind, Tp, Index, SubTp);		return TargetTransformInfo::getShuffleCost(Kind, Tp, Index, SubTp);

		if (Kind == SK_Reverse) {
std::pair<unsigned, MVT> LT = TLI->getTypeLegalizationCost(Tp);		std::pair<unsigned, MVT> LT = TLI->getTypeLegalizationCost(Tp);
unsigned Cost = 1;		unsigned Cost = 1;
if (LT.second.getSizeInBits() > 128)		if (LT.second.getSizeInBits() > 128)
Cost = 3; // Extract + insert + copy.		Cost = 3; // Extract + insert + copy.

// Multiple by the number of parts.		// Multiple by the number of parts.
return Cost * LT.first;		return Cost * LT.first;
}		}

		if (Kind == SK_Alternate) {
		static const CostTblEntry<MVT::SimpleValueType> X86AltShuffleTbl[] = {
		// Alt shuffle cost table for X86. Cost is the number of instructions
		// required to create the shuffled vector.

		{ISD::VECTOR_SHUFFLE, MVT::v2f32, 1},
		{ISD::VECTOR_SHUFFLE, MVT::v2i64, 1},
		{ISD::VECTOR_SHUFFLE, MVT::v2f64, 1},

		{ISD::VECTOR_SHUFFLE, MVT::v2i32, 2},
		{ISD::VECTOR_SHUFFLE, MVT::v4i32, 2},
		{ISD::VECTOR_SHUFFLE, MVT::v4f32, 2},

		{ISD::VECTOR_SHUFFLE, MVT::v4i16, 8},
		{ISD::VECTOR_SHUFFLE, MVT::v8i16, 8},

		{ISD::VECTOR_SHUFFLE, MVT::v16i8, 49}};

		std::pair<unsigned, MVT> LT = TLI->getTypeLegalizationCost(Tp);

		int Idx = CostTableLookup(X86AltShuffleTbl, ISD::VECTOR_SHUFFLE, LT.second);
		if (Idx == -1)
		return TargetTransformInfo::getShuffleCost(Kind, Tp, Index, SubTp);
		return LT.first * X86AltShuffleTbl[Idx].Cost;
		}

		return TargetTransformInfo::getShuffleCost(Kind, Tp, Index, SubTp);
		}

unsigned X86TTI::getCastInstrCost(unsigned Opcode, Type Dst, Type Src) const {		unsigned X86TTI::getCastInstrCost(unsigned Opcode, Type Dst, Type Src) const {
int ISD = TLI->InstructionOpcodeToISD(Opcode);		int ISD = TLI->InstructionOpcodeToISD(Opcode);
assert(ISD && "Invalid opcode");		assert(ISD && "Invalid opcode");

std::pair<unsigned, MVT> LTSrc = TLI->getTypeLegalizationCost(Src);		std::pair<unsigned, MVT> LTSrc = TLI->getTypeLegalizationCost(Src);
std::pair<unsigned, MVT> LTDest = TLI->getTypeLegalizationCost(Dst);		std::pair<unsigned, MVT> LTDest = TLI->getTypeLegalizationCost(Dst);

static const TypeConversionCostTblEntry<MVT::SimpleValueType>		static const TypeConversionCostTblEntry<MVT::SimpleValueType>
▲ Show 20 Lines • Show All 535 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 143 Lines • ▼ Show 20 Lines
/// \returns True if all of the values in \p VL are identical.		/// \returns True if all of the values in \p VL are identical.
static bool isSplat(ArrayRef<Value *> VL) {		static bool isSplat(ArrayRef<Value *> VL) {
for (unsigned i = 1, e = VL.size(); i < e; ++i)		for (unsigned i = 1, e = VL.size(); i < e; ++i)
if (VL[i] != VL[0])		if (VL[i] != VL[0])
return false;		return false;
return true;		return true;
}		}

		///\returns Opcode that can be clubbed with \p Op to create an alternate
		aschwaighoferUnsubmitted Not Done Reply Inline Actions Could we name this function "isAddSubInst" and add some documentation that we encode vector bundles that are combination of opcodes as a shufflevector. aschwaighofer: Could we name this function "isAddSubInst" and add some documentation that we encode vector…
		/// sequence which can later be merged as a ShuffleVector instruction.
		static unsigned getAltOpcode(unsigned Op) {
		switch (Op) {
		case Instruction::FAdd:
		return Instruction::FSub;
		case Instruction::FSub:
		return Instruction::FAdd;
		case Instruction::Add:
		return Instruction::Sub;
		case Instruction::Sub:
		return Instruction::Add;
		default:
		return 0;
		}
		}

		///\returns bool representing if Opcode \p Op can be part
		/// of an alternate sequence which can later be merged as
		/// a ShuffleVector instruction.
		static bool canCombineAsAltInst(unsigned Op) {
		if (Op == Instruction::FAdd \|\| Op == Instruction::FSub \|\|
		Op == Instruction::Sub \|\| Op == Instruction::Add)
		return true;
		return false;
		}

		hfinkelUnsubmitted Not Done Reply Inline Actions Can we add support for matching (fsub, fadd, fsub, fadd, ...)? I think that generating (fneg ( addsub (x, y))) would be nice. Also, is there any particular reason that we're restricting this to floating-point add/sub? Granted, I know of no ISA with integer add/sub instructions, but I think that the lowering as (addsub ( x, xor (y, mask))) is likely more efficient than the scalar version. hfinkel: Can we add support for matching (fsub, fadd, fsub, fadd, ...)? I think that generating (fneg…
		/// \returns ShuffleVector instruction if intructions in \p VL have
		/// alternate fadd,fsub / fsub,fadd/add,sub/sub,add sequence.
		/// (i.e. e.g. opcodes of fadd,fsub,fadd,fsub...)
		static unsigned isAltInst(ArrayRef<Value *> VL) {
		Instruction *I0 = dyn_cast<Instruction>(VL[0]);
		unsigned Opcode = I0->getOpcode();
		unsigned AltOpcode = getAltOpcode(Opcode);
		for (int i = 1, e = VL.size(); i < e; i++) {
		Instruction *I = dyn_cast<Instruction>(VL[i]);
		if (!I \|\| I->getOpcode() != ((i & 1) ? AltOpcode : Opcode))
		return 0;
		}
		return Instruction::ShuffleVector;
		}

/// \returns The opcode if all of the Instructions in \p VL have the same		/// \returns The opcode if all of the Instructions in \p VL have the same
/// opcode, or zero.		/// opcode, or zero.
static unsigned getSameOpcode(ArrayRef<Value *> VL) {		static unsigned getSameOpcode(ArrayRef<Value *> VL) {
Instruction *I0 = dyn_cast<Instruction>(VL[0]);		Instruction *I0 = dyn_cast<Instruction>(VL[0]);
if (!I0)		if (!I0)
return 0;		return 0;
unsigned Opcode = I0->getOpcode();		unsigned Opcode = I0->getOpcode();
for (int i = 1, e = VL.size(); i < e; i++) {		for (int i = 1, e = VL.size(); i < e; i++) {
Instruction *I = dyn_cast<Instruction>(VL[i]);		Instruction *I = dyn_cast<Instruction>(VL[i]);
if (!I \|\| Opcode != I->getOpcode())		if (!I \|\| Opcode != I->getOpcode()) {
		if (canCombineAsAltInst(Opcode) && i == 1)
		return isAltInst(VL);
return 0;		return 0;
}		}
		}
return Opcode;		return Opcode;
}		}

/// \returns \p I after propagating metadata from \p VL.		/// \returns \p I after propagating metadata from \p VL.
static Instruction propagateMetadata(Instruction I, ArrayRef<Value *> VL) {		static Instruction propagateMetadata(Instruction I, ArrayRef<Value *> VL) {
Instruction *I0 = cast<Instruction>(VL[0]);		Instruction *I0 = cast<Instruction>(VL[0]);
SmallVector<std::pair<unsigned, MDNode *>, 4> Metadata;		SmallVector<std::pair<unsigned, MDNode *>, 4> Metadata;
I0->getAllMetadataOtherThanDebugLoc(Metadata);		I0->getAllMetadataOtherThanDebugLoc(Metadata);
▲ Show 20 Lines • Show All 200 Lines • ▼ Show 20 Lines	void deleteTree() {
MemBarrierIgnoreList.clear();		MemBarrierIgnoreList.clear();
}		}

/// \returns true if the memory operations A and B are consecutive.		/// \returns true if the memory operations A and B are consecutive.
bool isConsecutiveAccess(Value A, Value B);		bool isConsecutiveAccess(Value A, Value B);

/// \brief Perform LICM and CSE on the newly generated gather sequences.		/// \brief Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();

private:		private:
struct TreeEntry;		struct TreeEntry;

/// \returns the cost of the vectorizable entry.		/// \returns the cost of the vectorizable entry.
int getEntryCost(TreeEntry *E);		int getEntryCost(TreeEntry *E);

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);		void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	TreeEntry newTreeEntry(ArrayRef<Value > VL, bool Vectorized) {
VectorizableTree.push_back(TreeEntry());		VectorizableTree.push_back(TreeEntry());
int idx = VectorizableTree.size() - 1;		int idx = VectorizableTree.size() - 1;
TreeEntry *Last = &VectorizableTree[idx];		TreeEntry *Last = &VectorizableTree[idx];
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());		Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->NeedToGather = !Vectorized;		Last->NeedToGather = !Vectorized;
if (Vectorized) {		if (Vectorized) {
Last->LastScalarIndex = getLastIndex(VL);		Last->LastScalarIndex = getLastIndex(VL);
for (int i = 0, e = VL.size(); i != e; ++i) {		for (int i = 0, e = VL.size(); i != e; ++i) {
assert(!ScalarToTreeEntry.count(VL[i]) && "Scalar already in tree!");		assert(!ScalarToTreeEntry.count(VL[i]) && "Scalar already in tree!");
		aschwaighoferUnsubmitted Not Done Reply Inline Actions I don't understand why you are relaxing this constraint. Why should a scalar instruction be part of two vectors? aschwaighofer: I don't understand why you are relaxing this constraint. Why should a scalar instruction be…
ScalarToTreeEntry[VL[i]] = idx;		ScalarToTreeEntry[VL[i]] = idx;
}		}
} else {		} else {
Last->LastScalarIndex = 0;		Last->LastScalarIndex = 0;
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
}		}
return Last;		return Last;
}		}
▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
}		}
}		}
}		}
}		}


void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth) {		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth) {
bool SameTy = getSameType(VL); (void)SameTy;		bool SameTy = getSameType(VL); (void)SameTy;
		bool isAltShuffle = false;
assert(SameTy && "Invalid types!");		assert(SameTy && "Invalid types!");

if (Depth == RecursionMaxDepth) {		if (Depth == RecursionMaxDepth) {
DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");		DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false);
return;		return;
}		}

// Don't handle vectors.		// Don't handle vectors.
if (VL[0]->getType()->isVectorTy()) {		if (VL[0]->getType()->isVectorTy()) {
DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");		DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false);
return;		return;
}		}

if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
if (SI->getValueOperand()->getType()->isVectorTy()) {		if (SI->getValueOperand()->getType()->isVectorTy()) {
DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");		DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false);
return;		return;
}		}
		unsigned Opcode = getSameOpcode(VL);

		// Check that this shuffle vector refers to the alternate
		// sequence of opcodes.
		if (Opcode == Instruction::ShuffleVector) {
		Instruction *I0 = dyn_cast<Instruction>(VL[0]);
		unsigned Op = I0->getOpcode();
		if (Op != Instruction::ShuffleVector)
		isAltShuffle = true;
		}

// If all of the operands are identical or constant we have a simple solution.		// If all of the operands are identical or constant we have a simple solution.
if (allConstant(VL) \|\| isSplat(VL) \|\| !getSameBlock(VL) \|\|		if (allConstant(VL) \|\| isSplat(VL) \|\| !getSameBlock(VL) \|\| !Opcode) {
!getSameOpcode(VL)) {
DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O. \n");		DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O. \n");
newTreeEntry(VL, false);		newTreeEntry(VL, false);
return;		return;
}		}

// We now know that this is a vector of instructions of the same type from		// We now know that this is a vector of instructions of the same type from
// the same block.		// the same block.

▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	for (User *U : Scalar->users()) {
return;		return;
}		}
}		}
}		}

// Check that every instructions appears once in this bundle.		// Check that every instructions appears once in this bundle.
for (unsigned i = 0, e = VL.size(); i < e; ++i)		for (unsigned i = 0, e = VL.size(); i < e; ++i)
for (unsigned j = i+1; j < e; ++j)		for (unsigned j = i+1; j < e; ++j)
if (VL[i] == VL[j]) {		if (VL[i] == VL[j]) {
		aschwaighoferUnsubmitted Not Done Reply Inline Actions I don't understand why you are relaxing this constraint. Within an ADDSUB operation you still want the instructions to be unique? For the bundle [ADD1, SUB1, ADD2, SUB2] we still want all the members to be unique in the bundle? aschwaighofer: I don't understand why you are relaxing this constraint. Within an ADDSUB operation you still…
DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");		DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false);
return;		return;
}		}

// Check that instructions in this bundle don't reference other instructions.		// Check that instructions in this bundle don't reference other instructions.
// The runtime of this check is O(N * N-1 * uses(N)) and a typical N is 4.		// The runtime of this check is O(N * N-1 * uses(N)) and a typical N is 4.
for (unsigned i = 0, e = VL.size(); i < e; ++i) {		for (unsigned i = 0, e = VL.size(); i < e; ++i) {
for (User *U : VL[i]->users()) {		for (User *U : VL[i]->users()) {
for (unsigned j = 0; j < e; ++j) {		for (unsigned j = 0; j < e; ++j) {
if (i != j && U == VL[j]) {		if (i != j && U == VL[j]) {
DEBUG(dbgs() << "SLP: Intra-bundle dependencies!" << *U << ". \n");		DEBUG(dbgs() << "SLP: Intra-bundle dependencies!" << *U << ". \n");
newTreeEntry(VL, false);		newTreeEntry(VL, false);
return;		return;
}		}
}		}
}		}
}		}

DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");		DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");

unsigned Opcode = getSameOpcode(VL);

// Check if it is safe to sink the loads or the stores.		// Check if it is safe to sink the loads or the stores.
if (Opcode == Instruction::Load \|\| Opcode == Instruction::Store) {		if (Opcode == Instruction::Load \|\| Opcode == Instruction::Store) {
Instruction *Last = getLastInstruction(VL);		Instruction *Last = getLastInstruction(VL);

for (unsigned i = 0, e = VL.size(); i < e; ++i) {		for (unsigned i = 0, e = VL.size(); i < e; ++i) {
if (VL[i] == Last)		if (VL[i] == Last)
continue;		continue;
Value *Barrier = getSinkBarrier(cast<Instruction>(VL[i]), Last);		Value *Barrier = getSinkBarrier(cast<Instruction>(VL[i]), Last);
▲ Show 20 Lines • Show All 285 Lines • ▼ Show 20 Lines	case Instruction::Call: {
for (unsigned j = 0; j < VL.size(); ++j) {		for (unsigned j = 0; j < VL.size(); ++j) {
CallInst *CI2 = dyn_cast<CallInst>(VL[j]);		CallInst *CI2 = dyn_cast<CallInst>(VL[j]);
Operands.push_back(CI2->getArgOperand(i));		Operands.push_back(CI2->getArgOperand(i));
}		}
buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
		case Instruction::ShuffleVector: {
		// If this is not an alternate sequence of opcode like add-sub
		// then do not vectorize this instruction.
		if (!isAltShuffle) {
		newTreeEntry(VL, false);
		DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");
		return;
		}
		newTreeEntry(VL, true);
		DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");
		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
		ValueList Operands;
		// Prepare the operand vector.
		for (unsigned j = 0; j < VL.size(); ++j)
		Operands.push_back(cast<Instruction>(VL[j])->getOperand(i));

		buildTree_rec(Operands, Depth + 1);
		}
		return;
		}
default:		default:
newTreeEntry(VL, false);		newTreeEntry(VL, false);
DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");		DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
return;		return;
}		}
}		}

int BoUpSLP::getEntryCost(TreeEntry *E) {		int BoUpSLP::getEntryCost(TreeEntry *E) {
ArrayRef<Value*> VL = E->Scalars;		ArrayRef<Value*> VL = E->Scalars;

Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
VectorType *VecTy = VectorType::get(ScalarTy, VL.size());		VectorType *VecTy = VectorType::get(ScalarTy, VL.size());

if (E->NeedToGather) {		if (E->NeedToGather) {
if (allConstant(VL))		if (allConstant(VL))
return 0;		return 0;
if (isSplat(VL)) {		if (isSplat(VL)) {
return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy, 0);		return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy, 0);
}		}
return getGatherCost(E->Scalars);		return getGatherCost(E->Scalars);
}		}
		unsigned Opcode = getSameOpcode(VL);
assert(getSameOpcode(VL) && getSameType(VL) && getSameBlock(VL) &&		assert(Opcode && getSameType(VL) && getSameBlock(VL) && "Invalid VL");
"Invalid VL");
Instruction *VL0 = cast<Instruction>(VL[0]);		Instruction *VL0 = cast<Instruction>(VL[0]);
unsigned Opcode = VL0->getOpcode();
switch (Opcode) {		switch (Opcode) {
case Instruction::PHI: {		case Instruction::PHI: {
return 0;		return 0;
}		}
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
if (CanReuseExtract(VL)) {		if (CanReuseExtract(VL)) {
int DeadCost = 0;		int DeadCost = 0;
for (unsigned i = 0, e = VL.size(); i < e; ++i) {		for (unsigned i = 0, e = VL.size(); i < e; ++i) {
▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	case Instruction::Call: {
int VecCallCost = TTI->getIntrinsicInstrCost(ID, VecTy, VecTys);		int VecCallCost = TTI->getIntrinsicInstrCost(ID, VecTy, VecTys);

DEBUG(dbgs() << "SLP: Call cost "<< VecCallCost - ScalarCallCost		DEBUG(dbgs() << "SLP: Call cost "<< VecCallCost - ScalarCallCost
<< " (" << VecCallCost << "-" << ScalarCallCost << ")"		<< " (" << VecCallCost << "-" << ScalarCallCost << ")"
<< " for " << *CI << "\n");		<< " for " << *CI << "\n");

return VecCallCost - ScalarCallCost;		return VecCallCost - ScalarCallCost;
}		}
		case Instruction::ShuffleVector: {
		TargetTransformInfo::OperandValueKind Op1VK =
		TargetTransformInfo::OK_AnyValue;
		TargetTransformInfo::OperandValueKind Op2VK =
		TargetTransformInfo::OK_AnyValue;
		int ScalarCost = 0;
		int VecCost = 0;
		for (unsigned i = 0; i < VL.size(); ++i) {
		Instruction *I = cast<Instruction>(VL[i]);
		if (!I)
		break;
		ScalarCost +=
		TTI->getArithmeticInstrCost(I->getOpcode(), ScalarTy, Op1VK, Op2VK);
		}
		// VecCost is equal to sum of the cost of creating 2 vectors
		// and the cost of creating shuffle.
		Instruction *I0 = cast<Instruction>(VL[0]);
		VecCost =
		TTI->getArithmeticInstrCost(I0->getOpcode(), VecTy, Op1VK, Op2VK);
		Instruction *I1 = cast<Instruction>(VL[1]);
		VecCost +=
		TTI->getArithmeticInstrCost(I1->getOpcode(), VecTy, Op1VK, Op2VK);
		VecCost +=
		TTI->getShuffleCost(TargetTransformInfo::SK_Alternate, VecTy, 0);
		return VecCost - ScalarCost;
		aschwaighoferUnsubmitted Not Done Reply Inline Actions I think we should use the getShuffleCost instruction (and we would need a new ShuffleKind and implement the cost model for it). aschwaighofer: I think we should use the getShuffleCost instruction (and we would need a new ShuffleKind and…
		}
default:		default:
llvm_unreachable("Unknown instruction");		llvm_unreachable("Unknown instruction");
		aschwaighoferUnsubmitted Not Done Reply Inline Actions Ultimately, I think this could be replaced by a call to TargetTransfromInfo::getAddSubCost and BasicTargetTransformInfo should override this with the implementation you have here. Targets can then override the cost as they see fit. For now, your code should be conservatively correct for all targets assuming they return an accurate (conservative cost) for getShuffleCost(SK_Alternate) - and I don't see addsubs generated yet anyway, so introducing the getAddSubCost abstraction change would be premature. Can you take a look at the code we generate for arm, arm64 and x86 to make sure that one instruction is correct? It seems to me that we generate 2 instructions for x86_64, arm, and arm64 and <4 x float>. Indicating that those targets should override getShuffleCost(SK_Alternate) and should return a cost of 2 for getShuffleCost(SK_Alternate, <4 x float>). cat > test.ll define void @test1(<4 x float> %a, <4 x float> %b, <4 x float> %c) { entry: %in1 = load <4 x float> %a %in2 = load <4 x float>* %b %add = fadd <4 x float> %in1, %in2 %sub = fsub <4 x float> %in1, %in2 %Shuff = shufflevector <4 x float> %add, <4 x float> %sub, <4 x i32> <i32 0, i32 5, i32 1, i32 6> store <4 x float> %Shuff, <4 x float>* %c ret void } define void @test2(<2 x double> %a, <2 x double> %b, <2 x double> %c) { entry: %in1 = load <2 x double> %a %in2 = load <2 x double>* %b %add = fadd <2 x double> %in1, %in2 %sub = fsub <2 x double> %in1, %in2 %Shuff = shufflevector <2 x double> %add, <2 x double> %sub, <2 x i32> <i32 0, i32 3> store <2 x double> %Shuff, <2 x double>* %c ret void } bin/llc -mtriple=arm64-apple-ios7.0 -mcpu=cyclone < testshufflevector.ll .section __TEXT,__text,regular,pure_instructions .ios_version_min 7, 0 .globl _test1 .align 2 _test1: ; @test1 .cfi_startproc ; BB#0: ; %entry ldr q0, [x0] ldr q1, [x1] fadd.4s v2, v0, v1 fsub.4s v0, v0, v1 // TWO INSTRUCTIONS ext.16b v0, v0, v2, #4 zip1.4s v0, v2, v0 /// str q0, [x2] ret .cfi_endproc bin/llc -mtriple=armv7s-apple-ios7.0 < testshufflevector.ll bin/llc -mtriple=x86_64-apple-macos < testshufflevector.ll With the adjustments to the cost model (X86TargetTransformInfo/ARMTargetTransformInfo/AArch64TargetTransformInfo::getShuffleCost) this LGTM. Thanks aschwaighofer: Ultimately, I think this could be replaced by a call to TargetTransfromInfo::getAddSubCost and…
}		}
}		}

bool BoUpSLP::isFullyVectorizableTinyTree() {		bool BoUpSLP::isFullyVectorizableTinyTree() {
DEBUG(dbgs() << "SLP: Check whether the tree with height " <<		DEBUG(dbgs() << "SLP: Check whether the tree with height " <<
VectorizableTree.size() << " is fully vectorizable .\n");		VectorizableTree.size() << " is fully vectorizable .\n");

// We only handle trees of height 2.		// We only handle trees of height 2.
▲ Show 20 Lines • Show All 262 Lines • ▼ Show 20 Lines	Value BoUpSLP::vectorizeTree(TreeEntry E) {
if (StoreInst *SI = dyn_cast<StoreInst>(VL0))		if (StoreInst *SI = dyn_cast<StoreInst>(VL0))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
VectorType *VecTy = VectorType::get(ScalarTy, E->Scalars.size());		VectorType *VecTy = VectorType::get(ScalarTy, E->Scalars.size());

if (E->NeedToGather) {		if (E->NeedToGather) {
setInsertPointAfterBundle(E->Scalars);		setInsertPointAfterBundle(E->Scalars);
return Gather(E->Scalars, VecTy);		return Gather(E->Scalars, VecTy);
}		}
		unsigned Opcode = getSameOpcode(E->Scalars);
unsigned Opcode = VL0->getOpcode();
assert(Opcode == getSameOpcode(E->Scalars) && "Invalid opcode");

switch (Opcode) {		switch (Opcode) {
case Instruction::PHI: {		case Instruction::PHI: {
PHINode *PH = dyn_cast<PHINode>(VL0);		PHINode *PH = dyn_cast<PHINode>(VL0);
Builder.SetInsertPoint(PH->getParent()->getFirstNonPHI());		Builder.SetInsertPoint(PH->getParent()->getFirstNonPHI());
Builder.SetCurrentDebugLocation(PH->getDebugLoc());		Builder.SetCurrentDebugLocation(PH->getDebugLoc());
PHINode *NewPhi = Builder.CreatePHI(VecTy, PH->getNumIncomingValues());		PHINode *NewPhi = Builder.CreatePHI(VecTy, PH->getNumIncomingValues());
E->VectorizedValue = NewPhi;		E->VectorizedValue = NewPhi;
▲ Show 20 Lines • Show All 256 Lines • ▼ Show 20 Lines	case Instruction::Call: {
Module *M = F->getParent();		Module *M = F->getParent();
Intrinsic::ID ID = getIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getIntrinsicIDForCall(CI, TLI);
Type *Tys[] = { VectorType::get(CI->getType(), E->Scalars.size()) };		Type *Tys[] = { VectorType::get(CI->getType(), E->Scalars.size()) };
Function *CF = Intrinsic::getDeclaration(M, ID, Tys);		Function *CF = Intrinsic::getDeclaration(M, ID, Tys);
Value *V = Builder.CreateCall(CF, OpVecs);		Value *V = Builder.CreateCall(CF, OpVecs);
E->VectorizedValue = V;		E->VectorizedValue = V;
return V;		return V;
}		}
		case Instruction::ShuffleVector: {
		ValueList LHSVL, RHSVL;
		for (int i = 0, e = E->Scalars.size(); i < e; ++i) {
		LHSVL.push_back(cast<Instruction>(E->Scalars[i])->getOperand(0));
		RHSVL.push_back(cast<Instruction>(E->Scalars[i])->getOperand(1));
		}
		setInsertPointAfterBundle(E->Scalars);

		Value *LHS = vectorizeTree(LHSVL);
		Value *RHS = vectorizeTree(RHSVL);

		if (Value *V = alreadyVectorized(E->Scalars))
		return V;

		// Create a vector of LHS op1 RHS
		BinaryOperator *BinOp0 = cast<BinaryOperator>(VL0);
		Value *V0 = Builder.CreateBinOp(BinOp0->getOpcode(), LHS, RHS);

		// Create a vector of LHS op2 RHS
		Instruction *VL1 = cast<Instruction>(E->Scalars[1]);
		BinaryOperator *BinOp1 = cast<BinaryOperator>(VL1);
		Value *V1 = Builder.CreateBinOp(BinOp1->getOpcode(), LHS, RHS);

		// Create appropriate shuffle to take alternative operations from
		// the vector.
		std::vector<Constant *> Mask(E->Scalars.size());
		unsigned e = E->Scalars.size();
		for (unsigned i = 0; i < e; ++i) {
		if (i & 1)
		Mask[i] = Builder.getInt32(e + i);
		else
		Mask[i] = Builder.getInt32(i);
		}

		Value *ShuffleMask = ConstantVector::get(Mask);

		Value *V = Builder.CreateShuffleVector(V0, V1, ShuffleMask);
		E->VectorizedValue = V;
		if (Instruction *I = dyn_cast<Instruction>(V))
		return propagateMetadata(I, E->Scalars);

		return V;
		}
default:		default:
llvm_unreachable("unknown inst");		llvm_unreachable("unknown inst");
}		}
return nullptr;		return nullptr;
}		}

Value *BoUpSLP::vectorizeTree() {		Value *BoUpSLP::vectorizeTree() {
Builder.SetInsertPoint(F->getEntryBlock().begin());		Builder.SetInsertPoint(F->getEntryBlock().begin());
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	Value *BoUpSLP::vectorizeTree() {

// For each vectorized value:		// For each vectorized value:
for (int EIdx = 0, EE = VectorizableTree.size(); EIdx < EE; ++EIdx) {		for (int EIdx = 0, EE = VectorizableTree.size(); EIdx < EE; ++EIdx) {
TreeEntry *Entry = &VectorizableTree[EIdx];		TreeEntry *Entry = &VectorizableTree[EIdx];

// For each lane:		// For each lane:
for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {		for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
Value *Scalar = Entry->Scalars[Lane];		Value *Scalar = Entry->Scalars[Lane];

// No need to handle users of gathered values.		// No need to handle users of gathered values.
		aschwaighoferUnsubmitted Not Done Reply Inline Actions Why is this needed? aschwaighofer: Why is this needed?
if (Entry->NeedToGather)		if (Entry->NeedToGather)
continue;		continue;

assert(Entry->VectorizedValue && "Can't find vectorizable value");		assert(Entry->VectorizedValue && "Can't find vectorizable value");

Type *Ty = Scalar->getType();		Type *Ty = Scalar->getType();
if (!Ty->isVoidTy()) {		if (!Ty->isVoidTy()) {
#ifndef NDEBUG		#ifndef NDEBUG
▲ Show 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	bool runOnFunction(Function &F) override {
// Use the bottom up slp vectorizer to construct chains that start with		// Use the bottom up slp vectorizer to construct chains that start with
// store instructions.		// store instructions.
BoUpSLP R(&F, SE, DL, TTI, TLI, AA, LI, DT);		BoUpSLP R(&F, SE, DL, TTI, TLI, AA, LI, DT);

// Scan the blocks in the function in post order.		// Scan the blocks in the function in post order.
for (po_iterator<BasicBlock*> it = po_begin(&F.getEntryBlock()),		for (po_iterator<BasicBlock*> it = po_begin(&F.getEntryBlock()),
e = po_end(&F.getEntryBlock()); it != e; ++it) {		e = po_end(&F.getEntryBlock()); it != e; ++it) {
BasicBlock BB = it;		BasicBlock BB = it;

// Vectorize trees that end at stores.		// Vectorize trees that end at stores.
if (unsigned count = collectStores(BB, R)) {		if (unsigned count = collectStores(BB, R)) {
(void)count;		(void)count;
DEBUG(dbgs() << "SLP: Found " << count << " stores to vectorize.\n");		DEBUG(dbgs() << "SLP: Found " << count << " stores to vectorize.\n");
Changed \|= vectorizeStoreChains(R);		Changed \|= vectorizeStoreChains(R);
}		}

// Vectorize trees that end at reductions.		// Vectorize trees that end at reductions.
▲ Show 20 Lines • Show All 890 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/addsub.ll

				; RUN: opt < %s -basicaa -slp-vectorizer -S \| FileCheck %s
				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				@b = common global [4 x i32] zeroinitializer, align 16
				@c = common global [4 x i32] zeroinitializer, align 16
				@d = common global [4 x i32] zeroinitializer, align 16
				@e = common global [4 x i32] zeroinitializer, align 16
				@a = common global [4 x i32] zeroinitializer, align 16
				@fb = common global [4 x float] zeroinitializer, align 16
				@fc = common global [4 x float] zeroinitializer, align 16
				@fa = common global [4 x float] zeroinitializer, align 16

				; CHECK-LABEL: @addsub
				hfinkelUnsubmitted Not Done Reply Inline Actions Please check the actual shuffle indices hfinkel: Please check the actual shuffle indices
				; CHECK: %4 = add <4 x i32> %1, %0
				; CHECK: %5 = add <4 x i32> %4, %2
				; CHECK: %6 = sub <4 x i32> %4, %2
				; CHECK: %7 = shufflevector <4 x i32> %5, <4 x i32> %6, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
				; CHECK: %8 = add <4 x i32> %7, %3
				; CHECK: %9 = sub <4 x i32> %7, %3
				; CHECK: %10 = shufflevector <4 x i32> %8, <4 x i32> %9, <4 x i32> <i32 0, i32 5, i32 2, i32 7>

				; Function Attrs: nounwind uwtable
				define void @addsub() #0 {
				entry:
				%0 = load i32* getelementptr inbounds ([4 x i32]* @b, i64 0, i64 0), align 16, !tbaa !1
				%1 = load i32* getelementptr inbounds ([4 x i32]* @c, i64 0, i64 0), align 16, !tbaa !1
				%2 = load i32* getelementptr inbounds ([4 x i32]* @d, i64 0, i64 0), align 16, !tbaa !1
				%3 = load i32* getelementptr inbounds ([4 x i32]* @e, i64 0, i64 0), align 16, !tbaa !1
				%add1 = add i32 %1, %0
				%add = add i32 %add1, %2
				%add2 = add i32 %add, %3
				store i32 %add2, i32* getelementptr inbounds ([4 x i32]* @a, i64 0, i64 0), align 16, !tbaa !1
				%4 = load i32* getelementptr inbounds ([4 x i32]* @b, i64 0, i64 1), align 4, !tbaa !1
				%5 = load i32* getelementptr inbounds ([4 x i32]* @c, i64 0, i64 1), align 4, !tbaa !1
				%6 = load i32* getelementptr inbounds ([4 x i32]* @d, i64 0, i64 1), align 4, !tbaa !1
				%7 = load i32* getelementptr inbounds ([4 x i32]* @e, i64 0, i64 1), align 4, !tbaa !1
				%add4.neg = add i32 %5, %4
				%add3 = sub i32 %add4.neg, %6
				%sub = sub i32 %add3, %7
				hfinkelUnsubmitted Not Done Reply Inline Actions We don't need this ident metadata in the test files. hfinkel: We don't need this ident metadata in the test files.
				store i32 %sub, i32* getelementptr inbounds ([4 x i32]* @a, i64 0, i64 1), align 4, !tbaa !1
				%8 = load i32* getelementptr inbounds ([4 x i32]* @b, i64 0, i64 2), align 8, !tbaa !1
				%9 = load i32* getelementptr inbounds ([4 x i32]* @c, i64 0, i64 2), align 8, !tbaa !1
				%10 = load i32* getelementptr inbounds ([4 x i32]* @d, i64 0, i64 2), align 8, !tbaa !1
				%11 = load i32* getelementptr inbounds ([4 x i32]* @e, i64 0, i64 2), align 8, !tbaa !1
				%add6 = add i32 %9, %8
				%add5 = add i32 %add6, %10
				%add7 = add i32 %add5, %11
				store i32 %add7, i32* getelementptr inbounds ([4 x i32]* @a, i64 0, i64 2), align 8, !tbaa !1
				%12 = load i32* getelementptr inbounds ([4 x i32]* @b, i64 0, i64 3), align 4, !tbaa !1
				%13 = load i32* getelementptr inbounds ([4 x i32]* @c, i64 0, i64 3), align 4, !tbaa !1
				%14 = load i32* getelementptr inbounds ([4 x i32]* @d, i64 0, i64 3), align 4, !tbaa !1
				%15 = load i32* getelementptr inbounds ([4 x i32]* @e, i64 0, i64 3), align 4, !tbaa !1
				%add9.neg = add i32 %13, %12
				%add8 = sub i32 %add9.neg, %14
				%sub10 = sub i32 %add8, %15
				store i32 %sub10, i32* getelementptr inbounds ([4 x i32]* @a, i64 0, i64 3), align 4, !tbaa !1
				ret void
				}

				; CHECK-LABEL: @subadd
				; CHECK: %4 = add <4 x i32>
				; CHECK: %5 = sub <4 x i32> %4, %2
				; CHECK: %6 = add <4 x i32> %4, %2
				; CHECK: %7 = shufflevector <4 x i32> %5, <4 x i32> %6, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
				; CHECK: %8 = sub <4 x i32> %7, %3
				; CHECK: %9 = add <4 x i32> %7, %3
				; CHECK: %10 = shufflevector <4 x i32> %8, <4 x i32> %9, <4 x i32> <i32 0, i32 5, i32 2, i32 7>

				; Function Attrs: nounwind uwtable
				define void @subadd() #0 {
				entry:
				%0 = load i32* getelementptr inbounds ([4 x i32]* @b, i64 0, i64 0), align 16, !tbaa !1
				%1 = load i32* getelementptr inbounds ([4 x i32]* @c, i64 0, i64 0), align 16, !tbaa !1
				%2 = load i32* getelementptr inbounds ([4 x i32]* @d, i64 0, i64 0), align 16, !tbaa !1
				%3 = load i32* getelementptr inbounds ([4 x i32]* @e, i64 0, i64 0), align 16, !tbaa !1
				%add1.neg = add i32 %1, %0
				%add = sub i32 %add1.neg, %2
				%sub = sub i32 %add, %3
				store i32 %sub, i32* getelementptr inbounds ([4 x i32]* @a, i64 0, i64 0), align 16, !tbaa !1
				%4 = load i32* getelementptr inbounds ([4 x i32]* @b, i64 0, i64 1), align 4, !tbaa !1
				%5 = load i32* getelementptr inbounds ([4 x i32]* @c, i64 0, i64 1), align 4, !tbaa !1
				%6 = load i32* getelementptr inbounds ([4 x i32]* @d, i64 0, i64 1), align 4, !tbaa !1
				%7 = load i32* getelementptr inbounds ([4 x i32]* @e, i64 0, i64 1), align 4, !tbaa !1
				%add3 = add i32 %5, %4
				%add2 = add i32 %add3, %6
				%add4 = add i32 %add2, %7
				store i32 %add4, i32* getelementptr inbounds ([4 x i32]* @a, i64 0, i64 1), align 4, !tbaa !1
				%8 = load i32* getelementptr inbounds ([4 x i32]* @b, i64 0, i64 2), align 8, !tbaa !1
				%9 = load i32* getelementptr inbounds ([4 x i32]* @c, i64 0, i64 2), align 8, !tbaa !1
				%10 = load i32* getelementptr inbounds ([4 x i32]* @d, i64 0, i64 2), align 8, !tbaa !1
				%11 = load i32* getelementptr inbounds ([4 x i32]* @e, i64 0, i64 2), align 8, !tbaa !1
				%add6.neg = add i32 %9, %8
				%add5 = sub i32 %add6.neg, %10
				%sub7 = sub i32 %add5, %11
				store i32 %sub7, i32* getelementptr inbounds ([4 x i32]* @a, i64 0, i64 2), align 8, !tbaa !1
				%12 = load i32* getelementptr inbounds ([4 x i32]* @b, i64 0, i64 3), align 4, !tbaa !1
				%13 = load i32* getelementptr inbounds ([4 x i32]* @c, i64 0, i64 3), align 4, !tbaa !1
				%14 = load i32* getelementptr inbounds ([4 x i32]* @d, i64 0, i64 3), align 4, !tbaa !1
				%15 = load i32* getelementptr inbounds ([4 x i32]* @e, i64 0, i64 3), align 4, !tbaa !1
				%add9 = add i32 %13, %12
				%add8 = add i32 %add9, %14
				%add10 = add i32 %add8, %15
				store i32 %add10, i32* getelementptr inbounds ([4 x i32]* @a, i64 0, i64 3), align 4, !tbaa !1
				ret void
				}

				; CHECK-LABEL: @faddfsub
				; CHECK: %2 = fadd <4 x float> %0, %1
				; CHECK: %3 = fsub <4 x float> %0, %1
				; CHECK: %4 = shufflevector <4 x float> %2, <4 x float> %3, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
				; Function Attrs: nounwind uwtable
				define void @faddfsub() #0 {
				entry:
				%0 = load float* getelementptr inbounds ([4 x float]* @fb, i64 0, i64 0), align 16, !tbaa !5
				%1 = load float* getelementptr inbounds ([4 x float]* @fc, i64 0, i64 0), align 16, !tbaa !5
				%add = fadd float %0, %1
				store float %add, float* getelementptr inbounds ([4 x float]* @fa, i64 0, i64 0), align 16, !tbaa !5
				%2 = load float* getelementptr inbounds ([4 x float]* @fb, i64 0, i64 1), align 4, !tbaa !5
				%3 = load float* getelementptr inbounds ([4 x float]* @fc, i64 0, i64 1), align 4, !tbaa !5
				%sub = fsub float %2, %3
				store float %sub, float* getelementptr inbounds ([4 x float]* @fa, i64 0, i64 1), align 4, !tbaa !5
				%4 = load float* getelementptr inbounds ([4 x float]* @fb, i64 0, i64 2), align 8, !tbaa !5
				%5 = load float* getelementptr inbounds ([4 x float]* @fc, i64 0, i64 2), align 8, !tbaa !5
				%add1 = fadd float %4, %5
				store float %add1, float* getelementptr inbounds ([4 x float]* @fa, i64 0, i64 2), align 8, !tbaa !5
				%6 = load float* getelementptr inbounds ([4 x float]* @fb, i64 0, i64 3), align 4, !tbaa !5
				%7 = load float* getelementptr inbounds ([4 x float]* @fc, i64 0, i64 3), align 4, !tbaa !5
				%sub2 = fsub float %6, %7
				store float %sub2, float* getelementptr inbounds ([4 x float]* @fa, i64 0, i64 3), align 4, !tbaa !5
				ret void
				}

				; CHECK-LABEL: @fsubfadd
				; CHECK: %2 = fsub <4 x float> %0, %1
				; CHECK: %3 = fadd <4 x float> %0, %1
				; CHECK: %4 = shufflevector <4 x float> %2, <4 x float> %3, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
				; Function Attrs: nounwind uwtable
				define void @fsubfadd() #0 {
				entry:
				%0 = load float* getelementptr inbounds ([4 x float]* @fb, i64 0, i64 0), align 16, !tbaa !5
				%1 = load float* getelementptr inbounds ([4 x float]* @fc, i64 0, i64 0), align 16, !tbaa !5
				%sub = fsub float %0, %1
				store float %sub, float* getelementptr inbounds ([4 x float]* @fa, i64 0, i64 0), align 16, !tbaa !5
				%2 = load float* getelementptr inbounds ([4 x float]* @fb, i64 0, i64 1), align 4, !tbaa !5
				%3 = load float* getelementptr inbounds ([4 x float]* @fc, i64 0, i64 1), align 4, !tbaa !5
				%add = fadd float %2, %3
				store float %add, float* getelementptr inbounds ([4 x float]* @fa, i64 0, i64 1), align 4, !tbaa !5
				%4 = load float* getelementptr inbounds ([4 x float]* @fb, i64 0, i64 2), align 8, !tbaa !5
				%5 = load float* getelementptr inbounds ([4 x float]* @fc, i64 0, i64 2), align 8, !tbaa !5
				%sub1 = fsub float %4, %5
				store float %sub1, float* getelementptr inbounds ([4 x float]* @fa, i64 0, i64 2), align 8, !tbaa !5
				%6 = load float* getelementptr inbounds ([4 x float]* @fb, i64 0, i64 3), align 4, !tbaa !5
				%7 = load float* getelementptr inbounds ([4 x float]* @fc, i64 0, i64 3), align 4, !tbaa !5
				%add2 = fadd float %6, %7
				store float %add2, float* getelementptr inbounds ([4 x float]* @fa, i64 0, i64 3), align 4, !tbaa !5
				ret void
				}

				; CHECK-LABEL: @No_faddfsub
				; CHECK-NOT: fadd <4 x float>
				; CHECK-NOT: fsub <4 x float>
				; CHECK-NOT: shufflevector
				; Function Attrs: nounwind uwtable
				define void @No_faddfsub() #0 {
				entry:
				%0 = load float* getelementptr inbounds ([4 x float]* @fb, i64 0, i64 0), align 16, !tbaa !5
				%1 = load float* getelementptr inbounds ([4 x float]* @fc, i64 0, i64 0), align 16, !tbaa !5
				%add = fadd float %0, %1
				store float %add, float* getelementptr inbounds ([4 x float]* @fa, i64 0, i64 0), align 16, !tbaa !5
				%2 = load float* getelementptr inbounds ([4 x float]* @fb, i64 0, i64 1), align 4, !tbaa !5
				%3 = load float* getelementptr inbounds ([4 x float]* @fc, i64 0, i64 1), align 4, !tbaa !5
				%add1 = fadd float %2, %3
				store float %add1, float* getelementptr inbounds ([4 x float]* @fa, i64 0, i64 1), align 4, !tbaa !5
				%4 = load float* getelementptr inbounds ([4 x float]* @fb, i64 0, i64 2), align 8, !tbaa !5
				%5 = load float* getelementptr inbounds ([4 x float]* @fc, i64 0, i64 2), align 8, !tbaa !5
				%add2 = fadd float %4, %5
				store float %add2, float* getelementptr inbounds ([4 x float]* @fa, i64 0, i64 2), align 8, !tbaa !5
				%6 = load float* getelementptr inbounds ([4 x float]* @fb, i64 0, i64 3), align 4, !tbaa !5
				%7 = load float* getelementptr inbounds ([4 x float]* @fc, i64 0, i64 3), align 4, !tbaa !5
				%sub = fsub float %6, %7
				store float %sub, float* getelementptr inbounds ([4 x float]* @fa, i64 0, i64 3), align 4, !tbaa !5
				ret void
				}

				attributes #0 = { nounwind}

				!1 = metadata !{metadata !2, metadata !2, i64 0}
				!2 = metadata !{metadata !"int", metadata !3, i64 0}
				!3 = metadata !{metadata !"omnipotent char", metadata !4, i64 0}
				!4 = metadata !{metadata !"Simple C/C++ TBAA"}
				!5 = metadata !{metadata !6, metadata !6, i64 0}
				!6 = metadata !{metadata !"float", metadata !3, i64 0}

This is an archive of the discontinued LLVM Phabricator instance.

Add support to recognize non SIMD kind of parallelism in SLPVectorizerClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 10617

include/llvm/Analysis/TargetTransformInfo.h

lib/CodeGen/BasicTargetTransformInfo.cpp

lib/Target/ARM/ARMTargetTransformInfo.cpp

lib/Target/X86/X86TargetTransformInfo.cpp

lib/Transforms/Vectorize/SLPVectorizer.cpp

test/Transforms/SLPVectorizer/X86/addsub.ll

Add support to recognize non SIMD kind of parallelism in SLPVectorizer
ClosedPublic