This is an archive of the discontinued LLVM Phabricator instance.

[LV] Try larger VFs if VF is unprofitable for small types.
AbandonedPublic

Authored by fhahn on Feb 11 2021, 9:26 AM.

Details

Summary

Currently LV can choose sub-optimal vectorization factors for loops with
memory accesses using different widths. At the moment, the largest type
limits the vectorization factor, but this is overly pessimistic on some
targets, which have memory instructions that require a certain minimum
VF for operations on narrow types.

The motivating example is AArch64, which requires a larger VFs for
vectorization to be profitable when narrow types are involved.

Currently code like below is not vectorized on AArch64, because the
chosen max VF of 4 (because the largest type is i32) is not profitable
(due to to type extensions).

int foo(unsigned char *len, unsigned size) {
   int maxLen = 0;
   int minLen = 0;
   for (unsigned i = 0; i < size; i++) {
     if (len[i] > maxLen) maxLen = len[i];
     if (len[i] < minLen) minLen = len[i];
  }
  return maxLen + minLen;
}

This patch addresses this issue by detecting cases where memory ops for
the narrowest type are more expensive than with larger VFs. For such
cases, it instead considers larger vectorization factors, limited by
estimated register usage. Loops like the above can be speed-up by ~4x
on AArch64.

This change should not introduce regressions; we only explore more
vectorization factors, but the cost model still picks the most
profitable one.

The impact on SPEC2000 & SPEC2006 is relatively small:

Tests: 31
Same hash: 18 (filtered out)
Remaining: 13
Metric: loop-vectorize.LoopsVectorized

test-suite...T2000/300.twolf/300.twolf.test 18.00 23.00 27.8%
test-suite...T2000/256.bzip2/256.bzip2.test 12.00 14.00 16.7%
test-suite...T2006/401.bzip2/401.bzip2.test 15.00 17.00 13.3%
test-suite...T2006/445.gobmk/445.gobmk.test 25.00 27.00 8.0%
test-suite...0/253.perlbmk/253.perlbmk.test 32.00 34.00 6.2%
test-suite...000/186.crafty/186.crafty.test 19.00 20.00 5.3%
test-suite...0.perlbench/400.perlbench.test 38.00 40.00 5.3%
test-suite...T2006/456.hmmer/456.hmmer.test 63.00 65.00 3.2%
test-suite...6/482.sphinx3/482.sphinx3.test 64.00 66.00 3.1%
test-suite.../CINT2000/176.gcc/176.gcc.test 43.00 44.00 2.3%
test-suite.../CINT2006/403.gcc/403.gcc.test 97.00 98.00 1.0%
test-suite...3.xalancbmk/483.xalancbmk.test 271.00 273.00 0.7%
test-suite...6/464.h264ref/464.h264ref.test 79.00 79.00 0.0%

There are a few small runtime improvements.

I also verified the changes to the vectorized loops in 300.twolf, 401.bzip2
& 445.gobmk. All changed loops are loops that the patch targets.

Diff Detail

Event Timeline

fhahn created this revision.Feb 11 2021, 9:26 AM
fhahn requested review of this revision.Feb 11 2021, 9:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 11 2021, 9:26 AM

I like the idea of this. I think it often makes sense to at least try a larger vector size, even if it's over the vector width.

Is your only target for this reductions? There is some code at the moment in getSmallestAndWidestTypes that limits the widest type to loads, stores and reductions, using RdxDesc.getRecurrenceType(). We already override that for InLoopReduction as they can sometimes handle larger than legal operations natively. Does that use the original reduction size, or the MinBWs size? Does the MinBWs help here, or does the cmp block that from working, and it doesn't realize the whole loop can be calculated using an i8?

In that example, the entire loop seems like it could become load; smin.16b, smax.16b under AArch64. That might be difficult to make happen though.

I like the idea of this. I think it often makes sense to at least try a larger vector size, even if it's over the vector width.

Agreed! ideally we would be confident enough in the cost-model to more liberally explore larger VFs and trust the cost-model to pick the best one, even in the presence of larger ones. I think this patch is a small first step in this direction, with potential to extend this to more cases in the future.

Is your only target for this reductions? There is some code at the moment in getSmallestAndWidestTypes that limits the widest type to loads, stores and reductions, using RdxDesc.getRecurrenceType(). We already override that for InLoopReduction as they can sometimes handle larger than legal operations natively. Does that use the original reduction size, or the MinBWs size? Does the MinBWs help here, or does the cmp block that from working, and it doesn't realize the whole loop can be calculated using an i8?

I just realized the example I choose was more complex than it needed to be. The main initial focus is narrow loads & stores, but it also applies to reductions. I don't think MinBWs helps directly with deciding whether to try bigger vector factors (the IR test cases do not have reductions and have either a load or store to i32).

In that example, the entire loop seems like it could become load; smin.16b, smax.16b under AArch64. That might be difficult to make happen though.

Yeah, I'm planning to take a look if we can use MinBWs more aggressively in the reduction case as follow-up.

I like the idea of this. I think it often makes sense to at least try a larger vector size, even if it's over the vector width.

Agreed! ideally we would be confident enough in the cost-model to more liberally explore larger VFs and trust the cost-model to pick the best one, even in the presence of larger ones. I think this patch is a small first step in this direction, with potential to extend this to more cases in the future.

The approach seems sensible to me. Is the reason for not exploring larger VFs more liberally because the cost-model doesn't always accurately represent/consider the legalization costs?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5526

nit: Are you planning to handle more special cases like this in the future? If so, then it may be worth moving this to its own shouldMaximizeBandwidth function.

5535

nit: this condition is currently always true. Can you change this into an assert?

OK, More about the loads being expensive than the reductions. If this is for 4 x i8 loads being expensive, should we just be overriding shouldMaximizeVectorBandwidth on AArch64? It could be based on SmallestType / WidestType if that makes it more precise, but I don't know that it needs to be.
https://godbolt.org/z/xG55W6

That would prevent us from putting this into the vectorizer costmodel directly, which may not be correct with the made up alignments/address space, and no indication of whether the loads are extended or not.

fhahn updated this revision to Diff 407245.Feb 9 2022, 12:25 PM
fhahn marked 2 inline comments as done.

A blast from the past :)

Rebased and updated to work with scalable VFs as well.

OK, More about the loads being expensive than the reductions. If this is for 4 x i8 loads being expensive, should we just be overriding shouldMaximizeVectorBandwidth on AArch64? It could be based on SmallestType / WidestType if that makes it more precise, but I don't know that it needs to be.
https://godbolt.org/z/xG55W6

That would prevent us from putting this into the vectorizer costmodel directly, which may not be correct with the made up alignments/address space, and no indication of whether the loads are extended or not.

The extended loads case is interesting. To catch that, we effectively would have to look at all the loads. Where we do that wouldn't matter too much I think.

fhahn added inline comments.Feb 9 2022, 12:26 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5535

I updated the code to support scalable vectors. But it seems like the cost model considers loads of <vscale x 4 x i8> as expensive as <vscale x 16 x i8> so the higher VF isn't chosen for scalable vectors. And the maximized fixed VF seems to be more profitable than the scalable one, so we end up choosing <16 x i8> (see the scalable tests)

fhahn abandoned this revision.Jul 1 2022, 7:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 1 2022, 7:29 AM