This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
VectorCombine.cpp
-
test/Transforms/VectorCombine/X86/
-
Transforms/
-
VectorCombine/
-
X86/
-
insert-binop.ll

Differential D79799

[VectorCombine] add loop to enable iterative folding
AbandonedPublic

Authored by spatel on May 12 2020, 12:48 PM.

Download Raw Diff

Details

Reviewers

nikic
lebedev.ri
efriedma
RKSimon

Summary

Given the limited range of potential vector transforms, this is an acceptable way to iterate? If this is or becomes too inefficient, we could use a worklist strategy like instcombine, but that would require altering more code. The motivation comes from PR42174:
https://bugs.llvm.org/show_bug.cgi?id=42174
...although we don't have the underlying scalarization-with-constant-operand fold yet. If we add that transform 1st, it would only scalarize 1 instruction instead of the entire chain of insert-insert-binops.

Diff Detail

Event Timeline

spatel created this revision.May 12 2020, 12:48 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 12 2020, 12:48 PM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

My only concern here would be whether this can degenerate quadratically. Without particular familiarity with this pass, and looking at example @ins1_ins1_iterate, it seems like this would work by scalarizing one operation on each iteration. Is that right? If so, that seems potentially problematic, as the pass becomes quadratic for longer instruction chains.

lebedev.ri mentioned this in D79078: [VectorCombine] Leave reduction operation to SLP.May 15 2020, 10:21 AM

In D79799#2038887, @nikic wrote:

My only concern here would be whether this can degenerate quadratically. Without particular familiarity with this pass, and looking at example @ins1_ins1_iterate, it seems like this would work by scalarizing one operation on each iteration. Is that right? If so, that seems potentially problematic, as the pass becomes quadratic for longer instruction chains.

Yes, in the worst case it could become quadratic, and you're seeing the test correctly. I think we could undo the reverse basic-block walk in this patch to make that particular case more efficient.

If I'm seeing it correctly, other iterative passes like SimplifyCFG and CodeGenPrepare allow for quadratic-time possibility too. I'm not sure if there's a theoretical way to draw the line on that, or if we just accept that potential risk (assume that anything this pass will ever do is rare, so it can't be too expensive). As I wrote in the description, I think the alternative is to revise things to be more like InstCombine's user-based worklist. We could also put in a cl::opt flag to bail out and/or assert if we go overboard. Any other suggestions?

In D79799#2039039, @spatel wrote:

In D79799#2038887, @nikic wrote:

My only concern here would be whether this can degenerate quadratically. Without particular familiarity with this pass, and looking at example @ins1_ins1_iterate, it seems like this would work by scalarizing one operation on each iteration. Is that right? If so, that seems potentially problematic, as the pass becomes quadratic for longer instruction chains.

Yes, in the worst case it could become quadratic, and you're seeing the test correctly. I think we could undo the reverse basic-block walk in this patch to make that particular case more efficient.

Speaking of which: If I replace this with a forward walk, then the ins1_ins1_iterate test folds, and there are no other test changes in VectorCombine or PhaseOrdering. If there are cases that benefit from the backwards walk, there doesn't seem to be test coverage for them.

If I'm seeing it correctly, other iterative passes like SimplifyCFG and CodeGenPrepare allow for quadratic-time possibility too. I'm not sure if there's a theoretical way to draw the line on that, or if we just accept that potential risk (assume that anything this pass will ever do is rare, so it can't be too expensive). As I wrote in the description, I think the alternative is to revise things to be more like InstCombine's user-based worklist. We could also put in a cl::opt flag to bail out and/or assert if we go overboard. Any other suggestions?

I agree that this is unlikely to become a problem in practice for this pass, because vector code is uncommon. And you are right that this problem also exists in other passes, though it is probably not always as easy to find actually quadratic cases. Personally I'm fine with your approach here, and we can always pivot to a worklist if necessary. Though per my comment above, it seems like for now just flipping the iteration order might be sufficient.

spatel mentioned this in rG81e9ede3a2db: [VectorCombine] forward walk through instructions to improve chaining of….May 16 2020, 10:33 AM

In D79799#2040145, @nikic wrote:

In D79799#2039039, @spatel wrote:

In D79799#2038887, @nikic wrote:

My only concern here would be whether this can degenerate quadratically. Without particular familiarity with this pass, and looking at example @ins1_ins1_iterate, it seems like this would work by scalarizing one operation on each iteration. Is that right? If so, that seems potentially problematic, as the pass becomes quadratic for longer instruction chains.

Yes, in the worst case it could become quadratic, and you're seeing the test correctly. I think we could undo the reverse basic-block walk in this patch to make that particular case more efficient.

Speaking of which: If I replace this with a forward walk, then the ins1_ins1_iterate test folds, and there are no other test changes in VectorCombine or PhaseOrdering. If there are cases that benefit from the backwards walk, there doesn't seem to be test coverage for them.

If I'm seeing it correctly, other iterative passes like SimplifyCFG and CodeGenPrepare allow for quadratic-time possibility too. I'm not sure if there's a theoretical way to draw the line on that, or if we just accept that potential risk (assume that anything this pass will ever do is rare, so it can't be too expensive). As I wrote in the description, I think the alternative is to revise things to be more like InstCombine's user-based worklist. We could also put in a cl::opt flag to bail out and/or assert if we go overboard. Any other suggestions?

I agree that this is unlikely to become a problem in practice for this pass, because vector code is uncommon. And you are right that this problem also exists in other passes, though it is probably not always as easy to find actually quadratic cases. Personally I'm fine with your approach here, and we can always pivot to a worklist if necessary. Though per my comment above, it seems like for now just flipping the iteration order might be sufficient.

Excellent point. Let's make the small change until we have evidence for the need for a costlier solution:
rG81e9ede3a2db

Over in D79078, we're trying to figure out how to properly deal with the reduction test diffs seen in that commit. I'm not sure what the answer will be.

I'll put this patch on hold for now rather than abandon.

D110171 added a worklist, so this is moot.

Herald added a project: Restricted Project. · View Herald TranscriptNov 22 2022, 6:40 AM

Herald added a subscriber: • pcwang-thead. · View Herald Transcript

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

47 lines

test/

Transforms/

VectorCombine/

X86/

insert-binop.ll

8 lines

Diff 263468

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

	Show First 20 Lines • Show All 370 Lines • ▼ Show 20 Lines
	/// This is the entry point for all transforms. Pass manager differences are			/// This is the entry point for all transforms. Pass manager differences are
	/// handled in the callers of this function.			/// handled in the callers of this function.
	static bool runImpl(Function &F, const TargetTransformInfo &TTI,			static bool runImpl(Function &F, const TargetTransformInfo &TTI,
	const DominatorTree &DT) {			const DominatorTree &DT) {
	if (DisableVectorCombine)			if (DisableVectorCombine)
	return false;			return false;

	bool MadeChange = false;			bool MadeChange = false;

				// Iterate until there are no more changes. Transforms can build on each
				// other's improvements.
				bool IterationChange;
				do {
				IterationChange = false;
	for (BasicBlock &BB : F) {			for (BasicBlock &BB : F) {
	// Ignore unreachable basic blocks.			// Ignore unreachable basic blocks.
	if (!DT.isReachableFromEntry(&BB))			if (!DT.isReachableFromEntry(&BB))
	continue;			continue;
	// Do not delete instructions under here and invalidate the iterator.
	// Walk the block backwards for efficiency. We're matching a chain of			// Walk the block backwards for efficiency. We are matching a chain of
	// use->defs, so we're more likely to succeed by starting from the bottom.			// use->defs, so we're more likely to succeed by starting from the bottom.
	// TODO: It could be more efficient to remove dead instructions
	// iteratively in this loop rather than waiting until the end.
	for (Instruction &I : make_range(BB.rbegin(), BB.rend())) {			for (Instruction &I : make_range(BB.rbegin(), BB.rend())) {
	if (isa<DbgInfoIntrinsic>(I))			if (isa<DbgInfoIntrinsic>(I))
	continue;			continue;
	MadeChange \|= foldExtractExtract(I, TTI);			IterationChange \|= foldExtractExtract(I, TTI);
	MadeChange \|= foldBitcastShuf(I, TTI);			IterationChange \|= foldBitcastShuf(I, TTI);
	MadeChange \|= scalarizeBinop(I, TTI);			IterationChange \|= scalarizeBinop(I, TTI);
	}			}
	}			}
				// Remove dead instructions before iterating.
	// We're done with transforms, so remove dead instructions.			if (IterationChange)
	if (MadeChange)
	for (BasicBlock &BB : F)			for (BasicBlock &BB : F)
	SimplifyInstructionsInBlock(&BB);			SimplifyInstructionsInBlock(&BB);

				// Set overall changed flag.
				MadeChange \|= IterationChange;
				} while (IterationChange);

	return MadeChange;			return MadeChange;
	}			}

	// Pass manager boilerplate below here.			// Pass manager boilerplate below here.

	namespace {			namespace {
	class VectorCombineLegacyPass : public FunctionPass {			class VectorCombineLegacyPass : public FunctionPass {
	public:			public:
	▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/insert-binop.ll

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	;
%i1 = insertelement <2 x i64> undef, i64 %y, i32 1		%i1 = insertelement <2 x i64> undef, i64 %y, i32 1
%r = xor <2 x i64> %i0, %i1		%r = xor <2 x i64> %i0, %i1
ret <2 x i64> %r		ret <2 x i64> %r
}		}

define <2 x i64> @ins1_ins1_iterate(i64 %w, i64 %x, i64 %y, i64 %z) {		define <2 x i64> @ins1_ins1_iterate(i64 %w, i64 %x, i64 %y, i64 %z) {
; CHECK-LABEL: @ins1_ins1_iterate(		; CHECK-LABEL: @ins1_ins1_iterate(
; CHECK-NEXT: [[S0_SCALAR:%.]] = sub i64 [[W:%.]], [[X:%.*]]		; CHECK-NEXT: [[S0_SCALAR:%.]] = sub i64 [[W:%.]], [[X:%.*]]
; CHECK-NEXT: [[S0:%.*]] = insertelement <2 x i64> undef, i64 [[S0_SCALAR]], i64 1		; CHECK-NEXT: [[S1_SCALAR:%.]] = or i64 [[S0_SCALAR]], [[Y:%.]]
; CHECK-NEXT: [[I2:%.]] = insertelement <2 x i64> undef, i64 [[Y:%.]], i32 1		; CHECK-NEXT: [[S2_SCALAR:%.]] = shl i64 [[Z:%.]], [[S1_SCALAR]]
; CHECK-NEXT: [[S1:%.*]] = or <2 x i64> [[S0]], [[I2]]		; CHECK-NEXT: [[S2:%.*]] = insertelement <2 x i64> undef, i64 [[S2_SCALAR]], i64 1
; CHECK-NEXT: [[I3:%.]] = insertelement <2 x i64> undef, i64 [[Z:%.]], i32 1
; CHECK-NEXT: [[S2:%.*]] = shl <2 x i64> [[I3]], [[S1]]
; CHECK-NEXT: ret <2 x i64> [[S2]]		; CHECK-NEXT: ret <2 x i64> [[S2]]
;		;
%i0 = insertelement <2 x i64> undef, i64 %w, i64 1		%i0 = insertelement <2 x i64> undef, i64 %w, i64 1
%i1 = insertelement <2 x i64> undef, i64 %x, i32 1		%i1 = insertelement <2 x i64> undef, i64 %x, i32 1
%s0 = sub <2 x i64> %i0, %i1		%s0 = sub <2 x i64> %i0, %i1
%i2 = insertelement <2 x i64> undef, i64 %y, i32 1		%i2 = insertelement <2 x i64> undef, i64 %y, i32 1
%s1 = or <2 x i64> %s0, %i2		%s1 = or <2 x i64> %s0, %i2
%i3 = insertelement <2 x i64> undef, i64 %z, i32 1		%i3 = insertelement <2 x i64> undef, i64 %z, i32 1
▲ Show 20 Lines • Show All 170 Lines • Show Last 20 Lines