Given a load without a better order, this patch partially sorts the elements to form clusters of adjacent elements in memory. These clusters can potentially be loaded in fewer loads, meaning less overall shuffling (for example loading v4i8 clusters of a v16i8 as a single f32 loads, as opposed to multiple independent bytes loads and inserts).
Details
Diff Detail
Event Timeline
Oh yeah, that's interesting. I had tried a few of your other patches, but not seen that one. It goes back before I was looking. It seems to change a lot more - is this one of the things it does? It's hard to tell with so many changes :)
One thing I was thinking of doing here was a kind of ordering-priority, and only cluster the loads if there wasn't anything else that looked like a better order. It seemed that a lot of the tests I tried did just fine with the clustered loads order though compared to any other, so I wasn't sure if it was wroth adding something like that. When I tried it, it was getting tripped up in the TopToBottom ordering, not being able to detect what counted as a better order.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
3257 | I beleive better to use Value * here instead of auto * | |
3259 | const auto &? | |
3259–3269 | Better to outline this loop to a lambda. | |
3266 | emplace_back(Ptr, *Diff, Cnt);. Also, I'm not sure about postincrement here, we usually avoid doing it in the expressions. | |
3277 | emplace_back(Ptr, 0, Cnt); | |
3282 | const auto &? | |
3285 | stable_sort | |
3291 | What if some of the loads are the loads from the same address but different instruction? %gep0 = gep %0, 1 %l0 = load %gep0 %gep1 = gep %0, 1 %l1 = load %gep1 %l0 and %l1 will be in the same vector but they are not consecutive, though (End - Start) == int(Vec.size() - 1); might be true for >=4 loads. | |
3305 | Message? | |
3314 | Try to preallocate space in the vector, the number of elements is known. | |
3368–3369 | Add a check that we have >= 4 loads |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
3302–3303 | There are a couple of test in reduce_blockstrided3 and store_blockstrided3 with blocks of size 3. The first was a few instruction shorter under X86 (didn't change on AArch64). The second was different-but-not-worse on AArch64 (didn't change under X86). That's the only tests I've seen with non-power-2 clusters though, so it's not very exhaustive testing. (A quick test of a "reduce_blockstrided5" seems to be better too - a lot less shuffling in the version I tried under X86 and more vectorization under AArch64) It can depend on the costmodel. I think both AArch64 and X86 will have a much lower cost for insert-subvectors that are aligned and a power2 in size. And how bad the initial ordering is - if it allows more less-than-full-width vectorization that might still be a win. I can make it more conservative if you think that's best? I don't have a strong opinion either way. | |
llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll | ||
352–359 | Because of the two v4i16 mul's? It looks like it does OK overall with the nicer order of the loads: https://godbolt.org/z/9f44fPeTW, https://godbolt.org/z/eonoM8Ys7 for x86 From what I can see, the SLP vectorizer produces a single v8i16 mul. It is instcombine that then splits that up because it thinks that one shuffle is better than 2: *** IR Dump After SLPVectorizerPass on reduce_blockstrided4 *** define i16 @reduce_blockstrided4(i16* nocapture noundef readonly %x, i16* nocapture noundef readonly %y, i32 noundef %stride) { entry: %idxprom = sext i32 %stride to i64 %arrayidx4 = getelementptr inbounds i16, i16* %x, i64 %idxprom %arrayidx20 = getelementptr inbounds i16, i16* %y, i64 %idxprom %0 = bitcast i16* %x to <4 x i16>* %1 = load <4 x i16>, <4 x i16>* %0, align 2 %2 = bitcast i16* %arrayidx4 to <4 x i16>* %3 = load <4 x i16>, <4 x i16>* %2, align 2 %4 = bitcast i16* %y to <4 x i16>* %5 = load <4 x i16>, <4 x i16>* %4, align 2 %6 = bitcast i16* %arrayidx20 to <4 x i16>* %7 = load <4 x i16>, <4 x i16>* %6, align 2 %8 = shufflevector <4 x i16> %5, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %9 = shufflevector <4 x i16> %7, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %10 = shufflevector <8 x i16> %8, <8 x i16> %9, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11> %11 = shufflevector <4 x i16> %1, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %12 = shufflevector <4 x i16> %3, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> %13 = shufflevector <8 x i16> %11, <8 x i16> %12, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11> %14 = mul <8 x i16> %10, %13 %15 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %14) ret i16 %15 } I can look into fixing that if you think it's worth doing. I'm not sure how yet (instcombine can't look at the cost model), but I've often worried about the amount of vector shuffles that instcombine transforms. Maybe it can be moved to VectorCombine so to get better costing. |
LG
llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll | ||
---|---|---|
352–359 | I think we neeed to improve SLP vectorizer here to reduce the number of emitted shuffles, hope it will be fixed soon. |
OK after a bit of a delay (D123801), I'm going to give this a try. Please let me know if it causes issues.
I beleive better to use Value * here instead of auto *