This is an archive of the discontinued LLVM Phabricator instance.

PR20234 - [SLP Vectorizer] Canonicalize tree operands of commutitive binary operands.
ClosedPublic

Authored by mcrosier on Jul 25 2014, 4:10 PM.

Download Raw Diff

Details

Reviewers

chandlerc
nadav
aschwaighofer
eeckstein
mcrosier

Summary

As a result of recent work on the reassociation pass, I noticed the ordering of operands effected the behavior of the SLP vectorizer. The ordering can change how the expression tree is built and in turn change the cost of the tree. In the provided test case, derived from spec2k/mesa, the ordering of the load instructions (in the expression tree) are reversed, so the loads are gathered, rather than vectorized.

This patch attempts to resolve this issue by canonicalizing the operands of commutitive instructions based on source order. The result is an expression tree that more closely mirrors the instruction source order. Currently the canonicalization happens to both the instruction as well as the expression tree. However, only the latter is necessary to address this issue; I have no objection to leaving the instruction operands as is.

I'm sure a more robust solution exists, but I decided to begin with the simplest solution because I think it gets a fairly good bang for the buck, is safe, and is maintainable.

Correctness runs look good. Performance runs (AArch64/A53) also look good; no regressions and a few minor improvements.

Please take a look!

Chad

Diff Detail

Event Timeline

mcrosier updated this revision to Diff 11900.Jul 25 2014, 4:10 PM

mcrosier retitled this revision from to PR20234 - [SLP Vectorizer] Canonicalize operands of commutitive binary operands..

mcrosier updated this object.

mcrosier edited the test plan for this revision. (Show Details)

mcrosier added reviewers: aschwaighofer, nadav, grosbach, chandlerc.

mcrosier added a subscriber: Unknown Object (MLST).

Herald added subscribers: mcrosier, aemerson. · View Herald TranscriptJul 25 2014, 4:10 PM

Chad,

In this patch you are changing the instruction V regardless if vectorization succeeds or not. Would it be possible to only change the VL vector without modifying the binary operator V?

Thanks,
Nadav

In D4680#5, @nadav wrote:

Chad,

In this patch you are changing the instruction V regardless if vectorization succeeds or not. Would it be possible to only change the VL vector without modifying the binary operator V?

Thanks,
Nadav

Nadav,
That would be perfectly fine. In fact, it's the right thing to do as we don't want to undo the canonicalization performed by the reassociation pass.

Chad

lib/Transforms/Vectorize/SLPVectorizer.cpp
2424	We don't need to actually modify the instruction for this to work.

Ping.

Update based on Nadav's feedback. Specifically, only commute the tree operands.

mcrosier retitled this revision from PR20234 - [SLP Vectorizer] Canonicalize operands of commutitive binary operands. to PR20234 - [SLP Vectorizer] Canonicalize tree operands of commutitive binary operands..Jul 30 2014, 1:17 PM

mcrosier added a reviewer: eeckstein.

LGTM.

Nadav gave the LGTM. Will commit shortly.

This revision is now accepted and ready to land.Jul 30 2014, 2:05 PM

Committed as r214338.

Hi Chad,

I'm sorry that I didn't see your patch before.
I'm currently also working on the SLPVectorizer and my patch is in the final review phase. It's about improving the scheduling.
This is a different issue than what you handle in your patch, but still there is one "conflict":

My new scheduling algorithm is more general and made some heuristics obsolete which - like your approach - relied on the instruction numbering. So I could remove the instruction numbering at all.

Now I looked in detail what problem you solve in your test case. Here the original problem is that the load instructions are in the wrong order, so that they are not recognised as consecutive loads.

I attached a new patch which is a more general solution and does not rely on instruction numbering.
Please take a look at it. If it proves to work well, I'd like to replace your change with this algorithm. This would make my life easier for my other patch :-)

For reference I also attached the my scheduling-patch (it's based on an earlier revision).

Thanks and sorry for replying so late,
Erik

commute.patch6 KBDownload
msg-10624-37.txt149 BDownload
slp-vectorizer-scheduling-5.patch52 KBDownload

In D4680#21, @eeckstein wrote:

Hi Chad,

I'm sorry that I didn't see your patch before.
I'm currently also working on the SLPVectorizer and my patch is in the final review phase. It's about improving the scheduling.
This is a different issue than what you handle in your patch, but still there is one "conflict":

My new scheduling algorithm is more general and made some heuristics obsolete which - like your approach - relied on the instruction numbering. So I could remove the instruction numbering at all.

Now I looked in detail what problem you solve in your test case. Here the original problem is that the load instructions are in the wrong order, so that they are not recognised as consecutive loads.

I attached a new patch which is a more general solution and does not rely on instruction numbering.
Please take a look at it. If it proves to work well, I'd like to replace your change with this algorithm. This would make my life easier for my other patch :-)

For reference I also attached the my scheduling-patch (it's based on an earlier revision).

Thanks and sorry for replying so late,
Erik

commute.patch6 KBDownload

msg-10624-37.txt149 BDownload

slp-vectorizer-scheduling-5.patch52 KBDownload

Hi Erik,
I added you to the CC list because I noticed you were working on the SLP vectorizer as well. I suspect your patch is the more robust solution I was referring to in the original email. ;) Does the test case in this patch still work with your patch? If so, I have no problem with you reverting this patch and applying yours. Please just add my test case to your patch. If not, I would like to get the test passing with your patch.

Chad

In D4680#22, @mcrosier wrote:

Hi Erik,
I added you to the CC list because I noticed you were working on the SLP vectorizer as well. I suspect your patch is the more robust solution I was referring to in the original email. ;) Does the test case in this patch still work with your patch? If so, I have no problem with you reverting this patch and applying yours. Please just add my test case to your patch. If not, I would like to get the test passing with your patch.

Chad

Note: If the test doesn't pass, please don't consider this a blocker to your patch. We can revert my patch and apply yours as it's obviously a more general solution. All I ask is that we try to get the test case passing in the near future. This specific code sequence is derived from spec2k/mesa and improves performance by ~3% (AArch64/A53).

yes, the test passes.
Please let me know if it also gives the same improvement for the benchmark.

Thanks,
Erik

In D4680#24, @eeckstein wrote:

yes, the test passes.
Please let me know if it also gives the same improvement for the benchmark.

Thanks,
Erik

If it passes, then we're good. Please go ahead and revert my patch and apply yours (assuming you've got the LGTM). :)

OK, thanks!

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

44 lines

test/

Transforms/

SLPVectorizer/

AArch64/

commute.ll

75 lines

Diff 12045

lib/Transforms/Vectorize/SLPVectorizer.cpp

Context not available.
	/// \brief Perform LICM and CSE on the newly generated gather sequences.	/// \brief Perform LICM and CSE on the newly generated gather sequences.
	void optimizeGatherSequence();	void optimizeGatherSequence();

		/// \brief Get the instruction numbering for a given Instruction.
		int getIndex(Instruction *I) {
		BlockNumbering &BN = getBlockNumbering(I->getParent());
		return BN.getIndex(I);
		}

	private:	private:
	struct TreeEntry;	struct TreeEntry;

Context not available.
	unsigned collectStores(BasicBlock *BB, BoUpSLP &R);	unsigned collectStores(BasicBlock *BB, BoUpSLP &R);

	/// \brief Try to vectorize a chain that starts at two arithmetic instrs.	/// \brief Try to vectorize a chain that starts at two arithmetic instrs.
	bool tryToVectorizePair(Value A, Value B, BoUpSLP &R);	bool tryToVectorizePair(Value A, Value B, BoUpSLP &R,
		BinaryOperator *V = nullptr);

	/// \brief Try to vectorize a list of operands.	/// \brief Try to vectorize a list of operands.
	/// \@param BuildVector A list of users to ignore for the purpose of	/// \@param BuildVector A list of users to ignore for the purpose of
Context not available.
	return count;	return count;
	}	}

	bool SLPVectorizer::tryToVectorizePair(Value A, Value B, BoUpSLP &R) {	bool SLPVectorizer::tryToVectorizePair(Value A, Value B, BoUpSLP &R,
		BinaryOperator *V) {
	if (!A \|\| !B)	if (!A \|\| !B)
	return false;	return false;
	Value *VL[] = { A, B };	Value *VL[] = { A, B };

		// Canonicalize operands based on source order, so that the ordering in the
		// expression tree more closely matches the ordering of the source.
		if (V && V->isCommutative() && isa<Instruction>(A) && isa<Instruction>(B) &&
		cast<Instruction>(A)->getParent() == cast<Instruction>(B)->getParent()) {
		assert(V->getOperand(0) == A && V->getOperand(1) == B &&
		mcrosierAuthorUnsubmitted Not Done Reply Inline Actions We don't need to actually modify the instruction for this to work. mcrosier: We don't need to actually modify the instruction for this to work.
		"Expected operands in order.");
		int IndexA = R.getIndex(cast<Instruction>(A));
		int IndexB = R.getIndex(cast<Instruction>(B));
		if (IndexA > IndexB)
		std::swap(VL[0], VL[1]);
		}
	return tryToVectorizeList(VL, R);	return tryToVectorizeList(VL, R);
	}	}

Context not available.
	return false;	return false;

	// Try to vectorize V.	// Try to vectorize V.
	if (tryToVectorizePair(V->getOperand(0), V->getOperand(1), R))	if (tryToVectorizePair(V->getOperand(0), V->getOperand(1), R, V))
	return true;	return true;

	BinaryOperator *A = dyn_cast<BinaryOperator>(V->getOperand(0));	BinaryOperator *A = dyn_cast<BinaryOperator>(V->getOperand(0));
Context not available.
	}	}

	for (int i = 0; i < 2; ++i) {	for (int i = 0; i < 2; ++i) {
	if (BinaryOperator *BI = dyn_cast<BinaryOperator>(CI->getOperand(i))) {	if (BinaryOperator *BI = dyn_cast<BinaryOperator>(CI->getOperand(i))) {
	if (tryToVectorizePair(BI->getOperand(0), BI->getOperand(1), R)) {	if (tryToVectorizePair(BI->getOperand(0), BI->getOperand(1), R, BI)) {
	Changed = true;	Changed = true;
	// We would like to start over since some instructions are deleted	// We would like to start over since some instructions are deleted
	// and the iterator may become invalid value.	// and the iterator may become invalid value.
	it = BB->begin();	it = BB->begin();
	e = BB->end();	e = BB->end();
	}	}
	}	}
	}	}
	continue;	continue;
	}	}
Context not available.

test/Transforms/SLPVectorizer/AArch64/commute.ll

This file was added.

				; RUN: opt -S -slp-vectorizer %s \| FileCheck %s
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				%structA = type { [2 x float] }

				define void @test1(%structA* nocapture readonly %J, i32 %xmin, i32 %ymin) {
				; CHECK-LABEL: test1
				; CHECK: %arrayidx4 = getelementptr inbounds %structA* %J, i64 0, i32 0, i64 0
				; CHECK: %arrayidx9 = getelementptr inbounds %structA* %J, i64 0, i32 0, i64 1
				; CHECK: %3 = bitcast float* %arrayidx4 to <2 x float>*
				; CHECK: %4 = load <2 x float>* %3, align 4
				; CHECK: %5 = fsub <2 x float> %2, %4
				; CHECK: %6 = fmul <2 x float> %5, %5
				; CHECK: %7 = extractelement <2 x float> %6, i32 0
				; CHECK: %8 = extractelement <2 x float> %6, i32 1
				; CHECK: %add = fadd fast float %7, %8
				; CHECK: %cmp = fcmp oeq float %add, 0.000000e+00

				entry:
				br label %for.body3.lr.ph

				for.body3.lr.ph:
				%conv5 = sitofp i32 %ymin to float
				%conv = sitofp i32 %xmin to float
				%arrayidx4 = getelementptr inbounds %structA* %J, i64 0, i32 0, i64 0
				%0 = load float* %arrayidx4, align 4
				%sub = fsub fast float %conv, %0
				%arrayidx9 = getelementptr inbounds %structA* %J, i64 0, i32 0, i64 1
				%1 = load float* %arrayidx9, align 4
				%sub10 = fsub fast float %conv5, %1
				%mul11 = fmul fast float %sub, %sub
				%mul12 = fmul fast float %sub10, %sub10
				%add = fadd fast float %mul11, %mul12
				%cmp = fcmp oeq float %add, 0.000000e+00
				br i1 %cmp, label %for.body3.lr.ph, label %for.end27

				for.end27:
				ret void
				}

				define void @test2(%structA* nocapture readonly %J, i32 %xmin, i32 %ymin) {
				; CHECK-LABEL: test2
				; CHECK: %arrayidx4 = getelementptr inbounds %structA* %J, i64 0, i32 0, i64 0
				; CHECK: %arrayidx9 = getelementptr inbounds %structA* %J, i64 0, i32 0, i64 1
				; CHECK: %3 = bitcast float* %arrayidx4 to <2 x float>*
				; CHECK: %4 = load <2 x float>* %3, align 4
				; CHECK: %5 = fsub <2 x float> %2, %4
				; CHECK: %6 = fmul <2 x float> %5, %5
				; CHECK: %7 = extractelement <2 x float> %6, i32 0
				; CHECK: %8 = extractelement <2 x float> %6, i32 1
				; CHECK: %add = fadd fast float %7, %8
				; CHECK: %cmp = fcmp oeq float %add, 0.000000e+00

				entry:
				br label %for.body3.lr.ph

				for.body3.lr.ph:
				%conv5 = sitofp i32 %ymin to float
				%conv = sitofp i32 %xmin to float
				%arrayidx4 = getelementptr inbounds %structA* %J, i64 0, i32 0, i64 0
				%0 = load float* %arrayidx4, align 4
				%sub = fsub fast float %conv, %0
				%arrayidx9 = getelementptr inbounds %structA* %J, i64 0, i32 0, i64 1
				%1 = load float* %arrayidx9, align 4
				%sub10 = fsub fast float %conv5, %1
				%mul11 = fmul fast float %sub, %sub
				%mul12 = fmul fast float %sub10, %sub10
				%add = fadd fast float %mul12, %mul11 ;;;<---- Operands commuted!!
				%cmp = fcmp oeq float %add, 0.000000e+00
				br i1 %cmp, label %for.body3.lr.ph, label %for.end27

				for.end27:
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

PR20234 - [SLP Vectorizer] Canonicalize tree operands of commutitive binary operands.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 12045

lib/Transforms/Vectorize/SLPVectorizer.cpp

test/Transforms/SLPVectorizer/AArch64/commute.ll

PR20234 - [SLP Vectorizer] Canonicalize tree operands of commutitive binary operands.
ClosedPublic