This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Cluster ordering for loads
ClosedPublic

Authored by dmgreen on Mar 21 2022, 8:17 AM.

Details

Summary

Given a load without a better order, this patch partially sorts the elements to form clusters of adjacent elements in memory. These clusters can potentially be loaded in fewer loads, meaning less overall shuffling (for example loading v4i8 clusters of a v16i8 as a single f32 loads, as opposed to multiple independent bytes loads and inserts).

Diff Detail

Event Timeline

dmgreen created this revision.Mar 21 2022, 8:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 8:17 AM
dmgreen requested review of this revision.Mar 21 2022, 8:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 8:17 AM

Just FYI D105986

Oh yeah, that's interesting. I had tried a few of your other patches, but not seen that one. It goes back before I was looking. It seems to change a lot more - is this one of the things it does? It's hard to tell with so many changes :)

One thing I was thinking of doing here was a kind of ordering-priority, and only cluster the loads if there wasn't anything else that looked like a better order. It seemed that a lot of the tests I tried did just fine with the clustered loads order though compared to any other, so I wasn't sure if it was wroth adding something like that. When I tried it, it was getting tripped up in the TopToBottom ordering, not being able to detect what counted as a better order.

ABataev added inline comments.Apr 4 2022, 8:03 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3257

I beleive better to use Value * here instead of auto *

3259

const auto &?

3259–3269

Better to outline this loop to a lambda.

3266

emplace_back(Ptr, *Diff, Cnt);. Also, I'm not sure about postincrement here, we usually avoid doing it in the expressions.

3277

emplace_back(Ptr, 0, Cnt);

3282

const auto &?

3285

stable_sort

3291

What if some of the loads are the loads from the same address but different instruction?
Like:

%gep0 = gep %0, 1
%l0 = load %gep0
%gep1 = gep %0, 1
%l1 = load %gep1

%l0 and %l1 will be in the same vector but they are not consecutive, though (End - Start) == int(Vec.size() - 1); might be true for >=4 loads.

3305

Message?

3314

Try to preallocate space in the vector, the number of elements is known.

3368–3369

Add a check that we have >= 4 loads

dmgreen updated this revision to Diff 420435.Apr 5 2022, 3:16 AM
dmgreen marked 10 inline comments as done.

Update as per comments.

dmgreen added inline comments.Apr 5 2022, 3:17 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3259

For this one and the other const auto& below we might modify Base.

3259–3269

Not sure I understood this one exactly, let me know if I got the idea wrong.

3291

Changing it to checking all the indexes sounds good.

ABataev added inline comments.Apr 5 2022, 4:03 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3302–3303

What if we have non-power-of-2 number of elements in each cluster?

llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
352–359

Looks like a regression here, worth investigation.

dmgreen added inline comments.Apr 5 2022, 7:36 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3302–3303

There are a couple of test in reduce_blockstrided3 and store_blockstrided3 with blocks of size 3. The first was a few instruction shorter under X86 (didn't change on AArch64). The second was different-but-not-worse on AArch64 (didn't change under X86). That's the only tests I've seen with non-power-2 clusters though, so it's not very exhaustive testing.

(A quick test of a "reduce_blockstrided5" seems to be better too - a lot less shuffling in the version I tried under X86 and more vectorization under AArch64)

It can depend on the costmodel. I think both AArch64 and X86 will have a much lower cost for insert-subvectors that are aligned and a power2 in size. And how bad the initial ordering is - if it allows more less-than-full-width vectorization that might still be a win.

I can make it more conservative if you think that's best? I don't have a strong opinion either way.

llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
352–359

Because of the two v4i16 mul's? It looks like it does OK overall with the nicer order of the loads: https://godbolt.org/z/9f44fPeTW, https://godbolt.org/z/eonoM8Ys7 for x86

From what I can see, the SLP vectorizer produces a single v8i16 mul. It is instcombine that then splits that up because it thinks that one shuffle is better than 2:

*** IR Dump After SLPVectorizerPass on reduce_blockstrided4 ***
define i16 @reduce_blockstrided4(i16* nocapture noundef readonly %x, i16* nocapture noundef readonly %y, i32 noundef %stride) {
entry:
  %idxprom = sext i32 %stride to i64
  %arrayidx4 = getelementptr inbounds i16, i16* %x, i64 %idxprom
  %arrayidx20 = getelementptr inbounds i16, i16* %y, i64 %idxprom
  %0 = bitcast i16* %x to <4 x i16>*
  %1 = load <4 x i16>, <4 x i16>* %0, align 2
  %2 = bitcast i16* %arrayidx4 to <4 x i16>*
  %3 = load <4 x i16>, <4 x i16>* %2, align 2
  %4 = bitcast i16* %y to <4 x i16>*
  %5 = load <4 x i16>, <4 x i16>* %4, align 2
  %6 = bitcast i16* %arrayidx20 to <4 x i16>*
  %7 = load <4 x i16>, <4 x i16>* %6, align 2
  %8 = shufflevector <4 x i16> %5, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %9 = shufflevector <4 x i16> %7, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %10 = shufflevector <8 x i16> %8, <8 x i16> %9, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
  %11 = shufflevector <4 x i16> %1, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %12 = shufflevector <4 x i16> %3, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
  %13 = shufflevector <8 x i16> %11, <8 x i16> %12, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
  %14 = mul <8 x i16> %10, %13
  %15 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %14)
  ret i16 %15
}

I can look into fixing that if you think it's worth doing. I'm not sure how yet (instcombine can't look at the cost model), but I've often worried about the amount of vector shuffles that instcombine transforms. Maybe it can be moved to VectorCombine so to get better costing.

ABataev accepted this revision.Apr 5 2022, 7:41 AM

LG

llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
352–359

I think we neeed to improve SLP vectorizer here to reduce the number of emitted shuffles, hope it will be fixed soon.

This revision is now accepted and ready to land.Apr 5 2022, 7:41 AM

OK after a bit of a delay (D123801), I'm going to give this a try. Please let me know if it causes issues.

This revision was landed with ongoing or failed builds.May 7 2022, 6:38 AM
This revision was automatically updated to reflect the committed changes.