This is an archive of the discontinued LLVM Phabricator instance.

[LV] Add generic scalarization support for unpredicated scalable vectors
AbandonedPublic

Authored by reames on Aug 3 2022, 2:54 PM.

Details

Summary

This change adds generic support for scalarizing scalable vector operations. Unlike fixed length vectors, we can't simply unroll a scalable vector as we don't know how long it is at compile time. However, there's nothing that prevents us from emitting the loop directly as we do know the dynamic number of elements.

For testing purposes, I have hocked this up to the uniform memory op path. This is not an optimal lowering for uniform mem ops, but it nicely demonstrates the value of having a fallback scalarization strategy available when smarter and more optimal things haven't yet been implemented.

From here, I plan on doing the following:

  • Add the support on the predicated path. This is quite a bit more involved and requires setting up VPBlocks for the CFG.
  • Generalize the definition of uniform memory op to allow internal predication. (This fundamentally requires the fully general predicated scalarization fallback, so it makes a good test to make sure we haven't missed anything.)
  • Write generic cost modeling for scalable scalarization, and start enabling other paths that we current unconditionally bail out from.
  • Implement a dedicated recipe for the uniform memory op case in the current predication due to tail folding only form. The loop form will probably be removed via LICM, but we should really stop relying on pass ordering here.

Diff Detail

Event Timeline

reames created this revision.Aug 3 2022, 2:54 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2022, 2:54 PM
reames requested review of this revision.Aug 3 2022, 2:54 PM
david-arm added inline comments.Aug 11 2022, 1:02 AM
llvm/test/Transforms/LoopVectorize/AArch64/sve-inv-store.ll
28

Hi @reames, we definitely don't want to be doing this for SVE as it will likely hurt performance - scatters are still likely to be better. We made a conscious effort to avoid scalaring this way for SVE because we managed to find other ways of solving the same problems using existing instructions. Also, if we ever encountered a situation where we'd have to scalarise a scalable vector then the performance will likely to be terrible and so we choose to return an Invalid cost from the cost model and skip that VF. We may as well just use NEON or not vectorise, because this is almost certainly better.

What use cases are you trying to solve here? This patch doesn't seem to fix any actual bugs, so I'm assuming this is for performance reasons. It looks like this change only affects one test (@uniform_store_of_loop_varying) and I guess the IR in this test is not a common idiom. Have you tested performance before and after to see if this is worthwhile?

Matt added a subscriber: Matt.Aug 12 2022, 12:36 PM
reames added inline comments.Aug 15 2022, 9:12 AM
llvm/test/Transforms/LoopVectorize/AArch64/sve-inv-store.ll
28

The motivation here is primarily robustness - both correctness wise and performance.

On the correctness front, the existing code pretty fundamentally assumes that scalarization is an available strategy. The invalid bailouts for scalable has been added in, but I've already hit (and fixed) several cases where this was incomplete. From a code complexity, and likelihood of bugs perspective, having fixed and scalable have the same *capabilities* seems highly useful.

On the performance side, there's a two level argument here.

For many cases, you're entirely right that there are other strategies available. You'll note that I have a patch on review right now which implements a more optimal strategy for divides. However, we have not implemented all of them. As a result, the current situation is that these cases become *strictly* non-vectorizeable. This means that even otherwise hugely profitable to vectorize loops are prevented from vectorization. This is basically a "order of work" argument to minimize the pain while that work is happening.

However, there are also cases which we can vectorize, but fundamentally *don't* have a better strategy for. One that we could vectorize - but don't today - is a call to a readonly function in an otherwise vectorizable loop. We can't assume the existence of a vector variant of the call (since it's not a known function), but we could scalarize in general. Now, put that call down a rarely taken predicated block and you start seeing why having the *capability* of scalarization is useful.

Uniform memops are only a testing vehicle here. Nothing more. Please don't fixate on the generality of a uniform store test as that is not what this patch is really about.

For aarch64, if scatters are profitable and legal, then the code should pick that already. This is an example where the cost model believes scalarization is profitable. It didn't make it into the final patch, but I played with some pretty serious cost penalties, and this case didn't change behavior. If you have particular suggestions on how to adjust cost to penalize scalarizing "enough", I'm completely open to them.

david-arm added inline comments.Aug 15 2022, 9:19 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6870

I think you can avoid the scalarisation for SVE here simply by asking for the scalarisation cost of the instruction, similarly to how it's done elsewhere. For SVE this should return Invalid. Alternatively you could add a TTI hook to ask if target should scalarise or not, i.e.

if (!foldTailByMasking())
  return isLegalToScalarize();

We have always considered it 'illegal' to scalarise for SVE.

reames added inline comments.Aug 15 2022, 9:23 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6870

You seem to be missing the point of the code.

The costing here is done at two levels: TTI, and LV. TTI returns invalid costs for a bunch of cases which LV then turns around and decides to scalarize by reasoning about the scalarization cost and scalar op cost directly.

Doing as you suggest here completely prevents the use of the added lowering.

Adding a TTI hook is feasible, but I don't have a good name/semantic for it. Any suggestions?

reames added inline comments.Aug 15 2022, 9:25 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6870

Another possibility here - would you be okay with the scalarization being behind a flag for now, and the cost model discussion (i.e. profitability) being done in a separate review? I'm wanting to avoid too much coupling here, and when I tried various different approaches to costing, a ton of unrelated issues started falling out.

david-arm added inline comments.Aug 15 2022, 10:21 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6870

That might be an option too. I'll think about this more in the morning!

6893

It sounds like getUniformMemOpCost is not returning Invalid here for SVE. I'm not sure why, but by returning true from isLegalToScalarize it relies on the cost being sensible. When I'm back at work tomorrow morning I'll take a better look.

reames added inline comments.Aug 15 2022, 11:54 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6870

A further point for your consideration.

The scatter based lowering here has the following cost:
Found an estimated cost of 81 for VF vscale x 4 For instruction: store i16 %ld, i16* %dst, align 2

Assuming this is a correct cost for the scatter, this is blatantly unprofitable. Essentially, we've switched from one unprofitable vectorization to another. As such, I don't think this test is particularly interesting.

fhahn added inline comments.Aug 16 2022, 1:15 AM
llvm/test/Transforms/LoopVectorize/AArch64/sve-inv-store.ll
28

In general it seems like having this extra lowering strategy would be desirable with respect to guaranteeing that we will always be able to scalarize any construct, rather than crash. This is an assumption made in various places.

It should be up to the cost model to avoid scalarizing in cases where it won't be profitable, but I can't comment on this particular test case or if there are scenarios where this will profitable in general in practice.

Hi @reames, by the way I completely understand the intent of the patch and it seems like a useful addition to be able to fall back on scalarisation this way for scalable vectors.

However, just for some context on why I'm worried about this patch as it stands currently - I tried out this patch on a really simple (and obviously contrived!) loop like this:

void inv_store_i16(short* __restrict__ dst, short* __restrict__ src, long n) {
  for (long i = 0; i < n; i++) {
    *dst = src[i];
  }
}

corresponding to the test inv_store_i16 changed in this patch. On a neoverse-v1 (SVE) machine I created a micro-benchmark that iterated over this a number of times. Here are some rough results:

scalar loop: 4.9s
vector loop with SVE scatters: 4.5s
vector loop with scalarised op (using this patch): 13.7s

That strongly suggests that the cost model is broken now because we're now choosing the worst cost-based widening decision. At the moment the patch regresses performance for SVE. In another comment I've suggested two possible ways forward and I'd be happy with either. Of course, I'm happy to listen to other ideas too!

I'll try to review the technical changes in the patch properly soon ...

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6870

Given the comment

// If not predicated, we can now scalarize generically with a loop

then this looks wrong here. I think we should be doing:

if (!blockNeedsPredicationForAnyReason())

instead of

if (!foldTailByMasking())

so that we also don't scalarise loops with conditional uniform memory operations. Or if the scalarisation code does support conditional ops, then perhaps the comment needs updating?

6893

I'm now convinced that in the current form the patch is wrong because getUniformMemOpCost assumes that for stores we will perform a normal vector store, i.e.:

StoreInst *SI = cast<StoreInst>(I);

bool isLoopInvariantStoreValue = Legal->isUniform(SI->getValueOperand());
return TTI.getAddressComputationCost(ValTy) +
       TTI.getMemoryOpCost(Instruction::Store, ValTy, Alignment, AS,
                           CostKind) +
       (isLoopInvariantStoreValue
            ? 0
            : TTI.getVectorInstrCost(Instruction::ExtractElement, VectorTy,
                                     VF.getKnownMinValue() - 1));

whereas we're actually scalarising in some cases through generation of an inner vector loop. I can think of two ways forward here:

  1. Introduce this feature under control of a flag, which is off by default. Then in a later patch fix up the costs to properly calculate the scalarisation cost, which includes the loop overhead (pre-header, phi nodes, IV cmp+branch, etc.).
  2. Fix the cost as part of this patch.

As you say, provided the scalarisation cost is fairly sensible the vectoriser should make the right decision and choose the best widening decision. But right now we're making decisions based on what look like very incorrect costs.

llvm/lib/Transforms/Vectorize/VPlan.cpp
800

nit: Stray comment.

reames planned changes to this revision.Aug 17 2022, 8:07 AM

Marking as Plan Changes as @david-arm has convinced me that the cost model on this needs significantly reworked. Going to think about this a bit, and probably split this into a couple pieces.

reames abandoned this revision.Sep 7 2022, 12:37 PM

Abandoning this. I still think having the LV be able to scalarize is a worthwhile fallback, but given this has run into significant resistance in review, and my main motivating cases have now all been handled in scalable vectorization without fallback, I'm going to discontinue working on this.