This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Disable single stride access predicates when gather loads are available.
AcceptedPublic

Authored by dmgreen on Dec 27 2019, 12:54 AM.

Details

Reviewers
Ayal
hsaito
fhahn
Summary

The LoopVectorizer/LAA has the ability to add runtime checks for memory accesses that look like they may be single stride accesses, in an attempt to still run vectorized code. This can happen in a boring matrix multiply kernel, for example:

for(int i = 0; i < n; i++) {
  for (int j = 0; j < m; j++)
  {
    int sum = 0;
    for (int k = 0; k < l; k++)
      sum += A[i*l + k] * B[k*m + j];
    C[i*m + j] = sum;
  }
}

However if we have access to efficient vector gather loads, they should be are a much better option than vectoizing with runtime checks for a stride of 1.

This adds a check into the place that appears to be dictating this, LAA, to check if the MaskedGather or MaskedScatter would be legal.

Diff Detail

Event Timeline

dmgreen created this revision.Dec 27 2019, 12:54 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 27 2019, 12:54 AM
Ayal added a comment.Dec 31 2019, 3:44 AM

The LoopVectorizer/LAA has the ability to add runtime checks for memory accesses that look like they may be single stride accesses, in an attempt to still run vectorized code. This can happen in a boring matrix multiply kernel, for example: [snip]

However if we have access to efficient vector gather loads, they should be are a much better option than vectoizing with runtime checks for a stride of 1.

This adds a check into the place that appears to be dictating this, LAA, to check if the MaskedGather or MaskedScatter would be legal.

OK.

Longer version:

Agreed, this is the place that gathers all symbolic strides for which runtime checks are later added by replaceSymbolicStrideSCEV().

Agreed, a gather or scatter would probably be a better option than a runtime check, certainly if the stride turns out to be other than 1, in which case the runtime check option will execute the original scalar loop.
Note that if the stride does turn out to be 1, a runtime check may be faster: the cost of a vector load/store is typically less than that of a gather/scatter, disregarding the overhead of the runtime check itself. So having a way to "manually" restore original performance for such cases may be useful (in addition to EnableMemAccessVersioning). Always preferring a gather or scatter as suggested should be good step forward, given the expected complexity of devising a cost-based preference.

Instead of teaching LAI to make such target/cost-based decisions, it would have been better to let this analysis continue to collect *all* symbolic strides potentially subject to runtime checks, and teach the planning/transform to prune/decide which strides to actually specialize; e.g., have LVP::plan() start by calling "CM.setVersioningForStrideDecisions()", analogous to InterleavedAccessInfo::analyzeInterleaving() which collects all potential interleave groups, and CM::setCostBasedWideningDecision() which decides which of the groups to materialize (per VF). However, this requires a fair amount of refactoring; worth a TODO?

llvm/lib/Analysis/LoopAccessAnalysis.cpp
2291

Separate the existing Value *Ptr = getLoadStorePointerOperand(MemAccess); if (!Ptr) return; part from the new gather/scatter consideration?

Would have been nice to reuse LV's isLegalGatherOrScatter(Value *V), or perhaps refactor it into if (TTI && TTI->isLegalGatherOrScatter(MemAccess)) return;?

Worth informing of filtered strides with LLVM_DEBUG messages.

(Can check if Ptr is already in SymbolicStrides and exit early; unrelated change.)

llvm/test/Transforms/LoopVectorize/X86/optsize.ll
151–182

Would indeed be good to have an i16 version retaining current checks (unvectorized behavior), as skx supports gathers of i32 but not for i16; and also the original i32 version with checks for vectorized code.

dmgreen marked an inline comment as done.Jan 2 2020, 3:10 AM

OK.

Longer version:

Agreed, this is the place that gathers all symbolic strides for which runtime checks are later added by replaceSymbolicStrideSCEV().

Agreed, a gather or scatter would probably be a better option than a runtime check, certainly if the stride turns out to be other than 1, in which case the runtime check option will execute the original scalar loop.
Note that if the stride does turn out to be 1, a runtime check may be faster: the cost of a vector load/store is typically less than that of a gather/scatter, disregarding the overhead of the runtime check itself. So having a way to "manually" restore original performance for such cases may be useful (in addition to EnableMemAccessVersioning). Always preferring a gather or scatter as suggested should be good step forward, given the expected complexity of devising a cost-based preference.

Instead of teaching LAI to make such target/cost-based decisions, it would have been better to let this analysis continue to collect *all* symbolic strides potentially subject to runtime checks, and teach the planning/transform to prune/decide which strides to actually specialize; e.g., have LVP::plan() start by calling "CM.setVersioningForStrideDecisions()", analogous to InterleavedAccessInfo::analyzeInterleaving() which collects all potential interleave groups, and CM::setCostBasedWideningDecision() which decides which of the groups to materialize (per VF). However, this requires a fair amount of refactoring; worth a TODO?

Thanks for taking a look. I agree, this all feels like a layering violation to me! (I found an existing TODO in LAA saying the same thing). The refactoring does seem like a lot of work though. What can I say, I'm looking forward to a point where VPlan can make some of these decisions more structurally.

We (ARM MVE) don't have gathers/scatters enabled yet, this was just hitting the second thing I tried (matrix multiply). It may be a little while before we have them turned on by default.

llvm/test/Transforms/LoopVectorize/X86/optsize.ll
151–182

The gather version of the original 32bit code here seems to be quite large for -Os. There are large constant pool entries that the gather needs to access?

dmgreen marked an inline comment as done.Jan 2 2020, 3:17 AM
dmgreen added inline comments.
llvm/lib/Analysis/LoopAccessAnalysis.cpp
2291

Yeah, my original version of this patch had isLegalGatherOrScatter passed as a functor from the LoopVectoriser through to LAA. It felt quite ugly though, this seemed a lot simpler to just pass TTI (plus I was looking as the SVE patch set and they had also added TTI to LAA for different reasons).

But adding isLegalGatherOrScatter to TTI sounds like a better idea. That will make this cleaner.

dmgreen updated this revision to Diff 235843.Jan 2 2020, 3:19 AM
Ayal accepted this revision.Jan 2 2020, 4:23 AM

This looks good to me, adding a couple of minor last nits, thanks!

llvm/include/llvm/Analysis/TargetTransformInfo.h
609

"represent \p V" >> "represent a vectorized version of \p V" ?
(This is the original comment, but the original context was LV.)

llvm/lib/Analysis/LoopAccessAnalysis.cpp
2318

May want to add a cl::opt flag to turn this filtering on/off.

2319

Suggest to start the line with "LAA:"

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4596–4599

This is another potential customer of [TTI.]isLegalGatherOrScatter(I). After which CM's isLegalMaskedGather() and isLegalMaskedScatter() become useless and should be removed.

This also suggests a TTI.isLegalMaskedLoadOrStore(Value *), btw.

But these are independent cleanups.

llvm/test/Transforms/LoopVectorize/X86/optsize.ll
151–182

The actual size of vectorized loops under -Os does deserve further investigation, in general.

This revision is now accepted and ready to land.Jan 2 2020, 4:23 AM
dmgreen updated this revision to Diff 236326.Jan 6 2020, 4:46 AM
dmgreen marked 4 inline comments as done.

Update as per the comments. I may leave this to be committed until the MVE codegen is further along (it depends on the first patch at least).

Ayal added inline comments.Jan 7 2020, 2:34 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4599

Thanks!
nit: may look better to have instead

bool isLegalMaskedLoadOrStore = isa<LoadInst>(I) ? isLegalMaskedLoad(Ty, Ptr, Alignment)
                                                 : isLegalMaskedStore(Ty, Ptr, Alignment);
return !(isLegalMaskedLoadOrStore || TTI.isLegalGatherOrScatter(I));

(in anticipation of TTI.isLegalMaskedLoadOrStore())

dmgreen marked an inline comment as done.
dmgreen added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4599

I've put up the isLegalMaskedLoadStore patch in D72387.

Ayal added a comment.Jan 9 2020, 8:32 AM

The LoopVectorizer/LAA has the ability to add runtime checks for memory accesses that look like they may be single stride accesses, in an attempt to still run vectorized code. This can happen in a boring matrix multiply kernel, for example:

for(int i = 0; i < n; i++) {
  for (int j = 0; j < m; j++)
  {
    int sum = 0;
    for (int k = 0; k < l; k++)
      sum += A[i*l + k] * B[k*m + j];
    C[i*m + j] = sum;
  }
}

Note that a (more boring?) matrix multiply kernel where B is a square matrix, i.e., where stride m is equal to trip count l, will not be specialized for m=1. But this general case may multiply matrix A by a single column matrix B, whose stride m is 1.

Another possible way to prevent such undesired specialization may be with a __builtin_expect/llvm.expect(m>1, 1).

Yeah, thanks. The square case comes up quite a lot.

My intuition is that users are unlikely to call matrix multiply when they are multiplying a matrix and a vector (they would call mat-vect multiply). I may be wrong about how much that happens though. In other cases that are not matrix multiply, the stride of 1 may be more common, but it's hard to tell that without some profiling data.

We here can also just disable the -enable-mem-access-versioning option, which will suffice for our testing in the short term.

We shouldn't make this either/or. Ability to runtime check unit-stride is good, and ability to use gather/scatter is also good. Depending on the target, I see the following situations

  1. don't vectorize the loop ---- unit-strided code alone isn't profitable or loop itself is profitable but not good enough to cover runtime check cost.
  2. vectorize with runtime check, with scalar code to fall back ---- when gather/scatter code is deemed not profitable.
  3. vectorize with runtime check with gather/scatter code to fall back
  4. vectorize with gather/scatter

If we aren't adding proper cost modeling, I think the default action should stay unchanged, i.e., 2), until further study allows us to collectively say another default is better. Looks like ARM wants to go with 4). If so, we need to make this TTI check, and allow LoopVectorize internal flag to override TTI. I don't think we immediately need to have the ability to do 3), but writing it down as TODO would be nice.

Matt added a subscriber: Matt.Sep 2 2021, 12:06 PM