Page MenuHomePhabricator

[SLP] Vectorize transposable binary operand bundles
Needs ReviewPublic

Authored by mssimpso on Apr 26 2018, 8:44 AM.

Details

Summary

Given a bundle of binary operations, we normally try to vectorize their operands by bundling all operand-zero values together and bundling all operand-one values together. Thus, assuming all operands are instructions, each operand bundle must have the same opcode to be vectorizable. If this is not the case, the operand bundles must be gathered. Instead of gathering, we can try to "transpose" the operand bundles such that each resulting bundle will have a single opcode.

This patch adds the ability to transpose binary operand bundles to enable vectorization. When we transpose the operand bundles, we bundle together all operands of one or more instructions, rather than bundling together a single operand (e.g, operand-zero) of several instructions. This bundling is similar to that performed by tryToVectorizePair. If a transpose is performed while building a vectorizable tree, we must re-transpose the vectors at run-time to restore the correct bundling. The transpose is performed with shufflevector instructions that read corresponding even- or odd-numbered vector elements from two n-dimensional source vectors and write each result into consecutive elements of an n-dimensional destination vector. The cost of these added shufflevector instructions is included in the cost model.

Please refer to the modified test cases for examples.

Diff Detail

Event Timeline

mssimpso created this revision.Apr 26 2018, 8:44 AM

Will it work correctly if some of the operations are used several times in the bundles? It would be good to have the tests for this kind of situation.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1444–1460

Maybe it's worth it to merge these 2 checks into one to reduce the number of iterations?

1491

auto &->const BoUpSLP::ValueList & or, probably, you may use ArrayRef<Value* here

1495

Try to use Bundles.emplace_back(Slice.begin(), Slice.end());

1982

auto->Optional<SmallVector<BoUpSLP::ValueList, 4>>

1983

auto &->const auto &

2494

Pre-reserve the space for the Operands here.

2495

Maybe it is worth to store the Transposed bundles in the TreeNode to not perform this kind of analysis several times?

Will it work correctly if some of the operations are used several times in the bundles? It would be good to have the tests for this kind of situation.

Thanks for taking a look at the patch Alexey!. It does work for the reuse case, but the operations still need to be in transposed form. I'll add a test for this. I already have one reuse test case (build_vec_v4i32_reuse), but this shows reuse in the final binop, not in its operands.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1444–1460

Sure, that makes sense.

2495

Yes, that was my thought as well. I'll update the patch.

Ayal added a comment.Apr 30 2018, 8:48 AM

This is reminiscent of LV's interleave group optimization, in the sense that a couple of correlated inefficient vector "gathers" are replaced by a couple of efficiently formed vectors followed by transposing shuffles. The correlated gathers may come from the two operands of a binary operation, as in this patch, or more generally from arbitrary leaves of the SLP tree.

lib/Analysis/VectorUtils.cpp
535

Could createStrideMask be (re)used?

lib/Transforms/Vectorize/SLPVectorizer.cpp
1470

Both operands are mapped to the opcode of operand 0?

1528

May be easier to follow if ElemsPerSrcVec is initially set to BundleVF and doubled inside the loop, ElemsPerDstVec is set to its double, and VF renamed to something like NumVectors.

1980

Comment that transposing operands each having a common opcode is mutually exclusive with swapping commutative operands below, and should precede it?

This is reminiscent of LV's interleave group optimization, in the sense that a couple of correlated inefficient vector "gathers" are replaced by a couple of efficiently formed vectors followed by transposing shuffles. The correlated gathers may come from the two operands of a binary operation, as in this patch, or more generally from arbitrary leaves of the SLP tree.

Thanks for taking a look at the patch, Ayal! Yes, that's right. It's a little like LV's interleave groups. While the "correlated gathers" would be leaves of the tree without this patch, if we can perform a transpose, we can continue recursively building a deeper tree based on the shuffled bundles. That's a good summary that could be incorporated in the comment above transposeBinOpBundle.

And yes, the approach is currently limited to the two operands of binary operations. But it could be generalized to include the operands of other instructions, or as you point out, any correlated gathers in the SLP tree.

lib/Analysis/VectorUtils.cpp
535

Not directly, but we could use createStrideMask to create two stride-2 masks and then interleave them.

stride_mask_0 = <0, 2,  4, 6>
stride_mask_1 = <8, 10, 12, 14>
transpose_mask = <0, 8, 2, 10, 4, 12, 6, 14>

What do you think?

lib/Transforms/Vectorize/SLPVectorizer.cpp
1470

I've already checked that both operands have the same opcode (starting at line 1423). I also added a comment here, but this is probably still a bit confusing. I will rewrite this to make it more clear.

1528

Sounds good.

1980

Yes, that's a good idea.

Ayal added inline comments.May 1 2018, 8:45 AM
lib/Analysis/VectorUtils.cpp
535

I'm a bit confused about the concatenation order which presumably leads to this interleaved strided mask, rather than using the strided <0, 2, 4, ..., 2*(VF-1)> mask directly. See comments below, also examining the VF=4 tests.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1470

Ahh, of course...

1479

Making all bundles have the same (smallest) size certainly simplifies things, but potentially misses performance opportunities; at-least in theory. Perhaps deserves a comment.

Suffice to check that MinSize is a power of 2, and that all bundle sizes are divisible by MinSize, rather than requiring all bundle sizes to be powers of 2. E.g., having one bundle of size 2 and another of size 6 should be fine, iiuc.

1491

This is a bit confusing: bundles are traversed according to their insertion order into Opcode2Operands, i.e., according to original operand order; but are aggregated and placed inside Bundles according to their opcode?

test/Transforms/SLPVectorizer/AArch64/transpose.ll
93

Indeed this calls for shuffling with <0,4,2,6> and <1,5,3,7> masks; but is this pattern more natural to expect than

%tmp2.1 = add i32 %tmp0.2, %tmp0.3
%tmp2.2 = add i32 %tmp1.0, %tmp1.1

Perhaps in some complex numbers context(?)

168

Could this be done equally well with 'normal' strided masks, i.e.:

; CHECK-NEXT:    [[TMP5:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT:    [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT:    [[TMP7:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32> <i32 0, i32 2, i32 4, i32 6>
; CHECK-NEXT:    [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32> <i32 1, i32 3, i32 5, i32 7>
183

This admittedly refers to the original test: %tmp1.2 == %tmp0.2 and %tmp1.3 == %tmp0.3. Not sure if this was intentional, but it does raise the issue of exercising bundles of originally different sizes, and potential reuse of same instruction multiple times. Are the four xor's first bundled together, and then broken into two adjacent bundles of MinSize=2?

228

ditto.

mssimpso updated this revision to Diff 144743.May 1 2018, 10:00 AM
mssimpso marked 9 inline comments as done.

Addressed first round of comments from Alexey and Ayal. Thanks again for the feedback! I'll respond to Ayal's most recent comments in a separate update.

mssimpso added inline comments.May 1 2018, 10:33 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1491

Using const std::pair<unsigned, BoUpSLP::ValueList> &

2494

I'm not sure we can do this. For the transpose case, since we don't know what the operands are (they will be shuffles), I've left Operands empty. If we instead pre-reserve space, we'll end up sending a vector of null pointers to getArithmeticInstrCost, which will probably break.

mssimpso added inline comments.May 1 2018, 11:12 AM
test/Transforms/SLPVectorizer/AArch64/transpose.ll
93

Ah, I see your point now. I think the current patch will actually produce incorrect code if you make the above change to the test, since we don't actually enforce the "interleaved order". We'll still use the <0,4,2,6> and <1,5,3,7> masks instead of the stride-2 masks. So I think we should probably record what the mask should be when we do the re-bundling to allow for the various possibilities. It makes sense that we would have to do this in hindsight.

We can also probably get rid of the MinSize restriction at the same time.

I'd also want to test the mask against the known shuffle kinds for the cost calculation to ensure we are computing the most appropriate cost for the target. I'm actually surprised we don't already have something like TTI->getShuffleKind(ArrayRef<int> Mask). Perhaps I'll work on that first.

Ayal added inline comments.May 1 2018, 12:31 PM
test/Transforms/SLPVectorizer/AArch64/transpose.ll
93

Perhaps the initial natural order, going bottom-up, is the 2 x n matrix transpose, i.e., the even and odd stride-2 masks. This could be further optimized, by considering the orderings preferred by subsequent leaves up the tree, similar to Alexey's NumOpsWantToKeepOrder; but here one could pass through several transposings before reaching the leaves... (btw, any worthwhile workloads driving this?)
After choosing the bestOrder() = the one wanted by most ops, SK_PermuteSingleSrc shuffle costs are used. LV uses getInterleavedMemoryOpCost*() to compute the cost of its strided shuffles. And isShuffle() may also provide inspiration ;-).