This is an archive of the discontinued LLVM Phabricator instance.

[LoadStoreVectorizer] Change VectorSet to Vector to match head and tail positions. Resolves PR29148.
ClosedPublic

Authored by asbirlea on Aug 30 2016, 2:51 PM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
jlebar
arsenm

Commits

rG3f8f7840bf12: [LoadStoreVectorizer] Change VectorSet to Vector to match head and tail…
rL280179: [LoadStoreVectorizer] Change VectorSet to Vector to match head and tail…

Summary

LSV was using two vector sets (heads and tails) to track pairs of adjiacent position to vectorize.
A recent optimization is trying to obtain the longest chain to vectorize and assumes the positions
in heads(H) and tails(T) match, which is not the case is there are multiple tails for the same head.

e.g.:
i1: store a[0]
i2: store a[1]
i3: store a[1]
Leads to:
H: i1
T: i2 i3
Instead of:
H: i1 i1
T: i2 i3
So the positions for instructions that follow i3 will have different indexes in H/T.
This patch resolves PR29148.

This issue also surfaced the fact that if the chain is too long, and TLI
returns a "not-fast" answer, the whole chain will be abandoned for
vectorization, even though a smaller one would be beneficial.
Added a testcase and FIXME for this.

Diff Detail

Repository: rL LLVM

Event Timeline

asbirlea updated this revision to Diff 69760.Aug 30 2016, 2:51 PM

asbirlea retitled this revision from to [LoadStoreVectorizer] Change VectorSet to Vector to match head and tail positions. Resolves PR29148..

asbirlea updated this object.

asbirlea added reviewers: arsenm, jlebar.

asbirlea added a subscriber: llvm-commits.

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptAug 30 2016, 2:51 PM

Herald added subscribers: wdng, mzolotukhin. · View Herald Transcript

This is about what I was guessing. What about having a side SmallSet instead of the linear is_contained?

I thought of that...not sure if it's worth it, if the vectors we're dealing with here are always very small (I'm thinking they're actually under 16 elements each). I can be convinced either way right now..

LGTM. I guess the worst possible case is 16 x 16 which probably isn't so bad

This revision is now accepted and ready to land.Aug 30 2016, 3:26 PM

jlebar accepted this revision.Aug 30 2016, 3:29 PM

jlebar edited edge metadata.

jlebar added inline comments.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
677 ↗	(On Diff #69760)	Please clang-format. :)

Format.

Here's something that bothers me. While it is possible to have a head with multiple tails, (and the Heads and Tails vectors reflect this), there's the "ConsecutiveChain[i] = j;" which forces a single path chain.
Obviously this will miss vectorization opportunities.
e.g. load a[1], load a[1], load a[2], load a[2] will only vectorize one pair, because the second a[1] will point to the same (already vectorized) a[2].
But the alternative of going though all options can end up being prohibitive.
Say you have load a[1] N times, followed by load a[2] N times etc, you'd end up with N^2 comparisons, extending to N^K for a[K].
Is it worth extending this?

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
677 ↗	(On Diff #69770)	ACK.

Is it worth extending this?

Yeah, I think I had the same concern in the original review.

I don't know if it's a worthwhile optimization or not. But if you can hack something together, it should be easy to run an experiment -- compile TensorFlow (or Eigen or Thrust) and use the statistics we already have to count how many extra vectorization opportunities we find.

Sounds good. I'll land this as is to fix the PR and test the extension separately. Thanks!

Sadly, the original version of this pass was *very* much not designed for the case of loading the same location multiple times in a single basic block... as you can probably tell.

Well, yes :) ... but that may be fine if the costs outweigh the benefits, we just don't know right now...
As Justin pointed out, I need to get some test data showing one way or the other.

Closed by commit rL280179: [LoadStoreVectorizer] Change VectorSet to Vector to match head and tail… (authored by asbirlea). · Explain WhyAug 30 2016, 5:02 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoadStoreVectorizer.cpp

14 lines

test/

Transforms/

LoadStoreVectorizer/

AMDGPU/

multiple_tails.ll

64 lines

X86/

subchain-interleaved.ll

30 lines

Diff 69785

llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

Show First 20 Lines • Show All 622 Lines • ▼ Show 20 Lines	for (const std::pair<Value *, InstrList> &Chain : Map) {
}		}
}		}

return Changed;		return Changed;
}		}

bool Vectorizer::vectorizeInstructions(ArrayRef<Instruction *> Instrs) {		bool Vectorizer::vectorizeInstructions(ArrayRef<Instruction *> Instrs) {
DEBUG(dbgs() << "LSV: Vectorizing " << Instrs.size() << " instructions.\n");		DEBUG(dbgs() << "LSV: Vectorizing " << Instrs.size() << " instructions.\n");
SmallSetVector<int, 16> Heads, Tails;		SmallVector<int, 16> Heads, Tails;
int ConsecutiveChain[64];		int ConsecutiveChain[64];

// Do a quadratic search on all of the given stores and find all of the pairs		// Do a quadratic search on all of the given stores and find all of the pairs
// of stores that follow each other.		// of stores that follow each other.
for (int i = 0, e = Instrs.size(); i < e; ++i) {		for (int i = 0, e = Instrs.size(); i < e; ++i) {
ConsecutiveChain[i] = -1;		ConsecutiveChain[i] = -1;
for (int j = e - 1; j >= 0; --j) {		for (int j = e - 1; j >= 0; --j) {
if (i == j)		if (i == j)
continue;		continue;

if (isConsecutiveAccess(Instrs[i], Instrs[j])) {		if (isConsecutiveAccess(Instrs[i], Instrs[j])) {
if (ConsecutiveChain[i] != -1) {		if (ConsecutiveChain[i] != -1) {
int CurDistance = std::abs(ConsecutiveChain[i] - i);		int CurDistance = std::abs(ConsecutiveChain[i] - i);
int NewDistance = std::abs(ConsecutiveChain[i] - j);		int NewDistance = std::abs(ConsecutiveChain[i] - j);
if (j < i \|\| NewDistance > CurDistance)		if (j < i \|\| NewDistance > CurDistance)
continue; // Should not insert.		continue; // Should not insert.
}		}

Tails.insert(j);		Tails.push_back(j);
Heads.insert(i);		Heads.push_back(i);
ConsecutiveChain[i] = j;		ConsecutiveChain[i] = j;
}		}
}		}
}		}

bool Changed = false;		bool Changed = false;
SmallPtrSet<Instruction *, 16> InstructionsProcessed;		SmallPtrSet<Instruction *, 16> InstructionsProcessed;

for (int Head : Heads) {		for (int Head : Heads) {
if (InstructionsProcessed.count(Instrs[Head]))		if (InstructionsProcessed.count(Instrs[Head]))
continue;		continue;
bool longerChainExists = false;		bool LongerChainExists = false;
for (unsigned TIt = 0; TIt < Tails.size(); TIt++)		for (unsigned TIt = 0; TIt < Tails.size(); TIt++)
if (Head == Tails[TIt] &&		if (Head == Tails[TIt] &&
!InstructionsProcessed.count(Instrs[Heads[TIt]])) {		!InstructionsProcessed.count(Instrs[Heads[TIt]])) {
longerChainExists = true;		LongerChainExists = true;
break;		break;
}		}
if (longerChainExists)		if (LongerChainExists)
continue;		continue;

// We found an instr that starts a chain. Now follow the chain and try to		// We found an instr that starts a chain. Now follow the chain and try to
// vectorize it.		// vectorize it.
SmallVector<Instruction *, 16> Operands;		SmallVector<Instruction *, 16> Operands;
int I = Head;		int I = Head;
while (I != -1 && (Tails.count(I) \|\| Heads.count(I))) {		while (I != -1 && (is_contained(Tails, I) \|\| is_contained(Heads, I))) {
if (InstructionsProcessed.count(Instrs[I]))		if (InstructionsProcessed.count(Instrs[I]))
break;		break;

Operands.push_back(Instrs[I]);		Operands.push_back(Instrs[I]);
I = ConsecutiveChain[I];		I = ConsecutiveChain[I];
}		}

bool Vectorized = false;		bool Vectorized = false;
▲ Show 20 Lines • Show All 352 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoadStoreVectorizer/AMDGPU/multiple_tails.ll

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -basicaa -load-store-vectorizer -S -o - %s \| FileCheck %s

				target datalayout = "e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64"

				; Checks that there is no crash when there are multiple tails
				; for a the same head starting a chain.
				@0 = internal addrspace(3) global [16384 x i32] undef

				; CHECK-LABEL: @no_crash(
				; CHECK: store <2 x i32> zeroinitializer
				; CHECK: store i32 0
				; CHECK: store i32 0

				define void @no_crash(i32 %arg) {
				%tmp2 = add i32 %arg, 14
				%tmp3 = getelementptr [16384 x i32], [16384 x i32] addrspace(3)* @0, i32 0, i32 %tmp2
				%tmp4 = add i32 %arg, 15
				%tmp5 = getelementptr [16384 x i32], [16384 x i32] addrspace(3)* @0, i32 0, i32 %tmp4

				store i32 0, i32 addrspace(3)* %tmp3, align 4
				store i32 0, i32 addrspace(3)* %tmp5, align 4
				store i32 0, i32 addrspace(3)* %tmp5, align 4
				store i32 0, i32 addrspace(3)* %tmp5, align 4

				ret void
				}

				; Check adjiacent memory locations are properly matched and the
				; longest chain vectorized

				; CHECK-LABEL: @interleave_get_longest
				; CHECK: load <2 x i32>
				; CHECK: load i32
				; CHECK: store <2 x i32> zeroinitializer
				; CHECK: load i32
				; CHECK: load <2 x i32>
				; CHECK: load i32
				; CHECK: load i32

				define void @interleave_get_longest(i32 %arg) {
				%a1 = add i32 %arg, 1
				%a2 = add i32 %arg, 2
				%a3 = add i32 %arg, 3
				%a4 = add i32 %arg, 4
				%tmp1 = getelementptr [16384 x i32], [16384 x i32] addrspace(3)* @0, i32 0, i32 %arg
				%tmp2 = getelementptr [16384 x i32], [16384 x i32] addrspace(3)* @0, i32 0, i32 %a1
				%tmp3 = getelementptr [16384 x i32], [16384 x i32] addrspace(3)* @0, i32 0, i32 %a2
				%tmp4 = getelementptr [16384 x i32], [16384 x i32] addrspace(3)* @0, i32 0, i32 %a3
				%tmp5 = getelementptr [16384 x i32], [16384 x i32] addrspace(3)* @0, i32 0, i32 %a4

				%l1 = load i32, i32 addrspace(3)* %tmp2, align 4
				%l2 = load i32, i32 addrspace(3)* %tmp1, align 4
				store i32 0, i32 addrspace(3)* %tmp2, align 4
				store i32 0, i32 addrspace(3)* %tmp1, align 4
				%l3 = load i32, i32 addrspace(3)* %tmp2, align 4
				%l4 = load i32, i32 addrspace(3)* %tmp3, align 4
				%l5 = load i32, i32 addrspace(3)* %tmp4, align 4
				%l6 = load i32, i32 addrspace(3)* %tmp5, align 4
				%l7 = load i32, i32 addrspace(3)* %tmp5, align 4
				%l8 = load i32, i32 addrspace(3)* %tmp5, align 4

				ret void
				}

llvm/trunk/test/Transforms/LoadStoreVectorizer/X86/subchain-interleaved.ll

Show First 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	define void @chain_prefix_suffix(i32* noalias %ptr) {
store i32 0, i32* %next.gep2, align 4		store i32 0, i32* %next.gep2, align 4
%l3 = load i32, i32* %next.gep1, align 4		%l3 = load i32, i32* %next.gep1, align 4
%l4 = load i32, i32* %next.gep2, align 4		%l4 = load i32, i32* %next.gep2, align 4
%l5 = load i32, i32* %next.gep3, align 4		%l5 = load i32, i32* %next.gep3, align 4

ret void		ret void
}		}

		; FIXME: If the chain is too long and TLI says misaligned is not fast,
		; then LSV fails to vectorize anything in that chain.
		; To reproduce below, add a tmp5 (ptr+4) and load tmp5 into l6 and l7.

		; CHECK-LABEL: @interleave_get_longest
		; CHECK: load <3 x i32>
		; CHECK: load i32
		; CHECK: store <2 x i32> zeroinitializer
		; CHECK: load i32
		; CHECK: load i32
		; CHECK: load i32

		define void @interleave_get_longest(i32* noalias %ptr) {
		%tmp1 = getelementptr i32, i32* %ptr, i64 0
		%tmp2 = getelementptr i32, i32* %ptr, i64 1
		%tmp3 = getelementptr i32, i32* %ptr, i64 2
		%tmp4 = getelementptr i32, i32* %ptr, i64 3

		%l1 = load i32, i32* %tmp2, align 4
		%l2 = load i32, i32* %tmp1, align 4
		store i32 0, i32* %tmp2, align 4
		store i32 0, i32* %tmp1, align 4
		%l3 = load i32, i32* %tmp2, align 4
		%l4 = load i32, i32* %tmp3, align 4
		%l5 = load i32, i32* %tmp4, align 4
		%l6 = load i32, i32* %tmp4, align 4
		%l7 = load i32, i32* %tmp4, align 4

		ret void
		}