This is an archive of the discontinued LLVM Phabricator instance.

[LV] Allow scalable vectorization with vscale = 1
ClosedPublic

Authored by reames on Jun 24 2022, 11:27 AM.

Details

Summary

This change is a bit subtle. If we have a type like <vscale x 1 x i64>, the vectorizer will currently reject vectorization. The reason is that a type like <1 x i64> is likely to get simply rescalarized, and the vectorizer doesn't want to be in the game of simple unrolling.

(I've given the example in terms of 1 x types which use a single register, but the same issue exists for any N x types which use N registers. e.g. RISCV LMULs.)

This change distinguishes scalable types from fixed types under the reasoning that converting to a scalable type isn't unrolling. Because the actual vscale isn't known until runtime, using a vscale type is potentially very profitable.

This makes an important, but unchecked, assumption. Specifically, the scalable type is assumed to only be legal per the cost model if there's actually a scalable register class which is distinct from the scalar domain. This is, to my knowledge, true for all targets which return non-invalid costs for scalable vector ops today, but in theory, we could have a target decide to lower scalable to fixed length vector or even scalar registers. If that ever happens, we'd need to revisit this code.

In practice, this patch unblocks scalable vectorization for ELEN types on RISCV.

Let me sketch one alternate implementation I considered. We could have restricted this to when we know a minimum value for vscale. Specifically, for the default +v extension for RISCV, we actually know that vscale >= 2 for ELEN types. However, doing it this way means we can't generate scalable vectors when using the various embedded vector extensions which have a minimum vscale of 1.

If folks don't like the unchecked assumption above, I can go ahead and add the min-vscale check here. That would at least gets us the most common +v extension.

Diff Detail

Event Timeline

reames created this revision.Jun 24 2022, 11:27 AM
reames requested review of this revision.Jun 24 2022, 11:27 AM
reames edited the summary of this revision. (Show Details)Jun 24 2022, 11:31 AM

This makes sense to me.

sdesmalen accepted this revision.Jun 27 2022, 3:00 AM

Thanks for fixing this @reames. The approach seems sensible, because vscale is likely to be larger than 1 so it shouldn't conservatively assume scalarisation will happen at this point in the LV. If vscale could be 1 at runtime and the codegen for <vscale x 1 x eltty> is less efficient than for (scalar) eltty, then the cost-model should probably reflect that. Note that you can distinguish the vscale value to tune for with TargetTransformInfo::getVScaleForTuning() (which is different from the vscale ranges the generated code will run on (albeit possibly inefficiently), which it gets from the vscale_range attribute). For Arm we've implemented this TTI function to use the information from -mcpu.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6719–6721

I think this comment can be removed. If this is ever the case, the LV shouldn't be using scalable vectors, but fixed-size vectors instead and leave it to the code-generator to choose the right register class and instructions. This is actually what we do for SVE when we compile for a specific vector-width; we vectorize using fixed-width vectors and map them to SVE registers instead of NEON.

6725

nit: unnecessary curly braces.

This revision is now accepted and ready to land.Jun 27 2022, 3:00 AM
This revision was landed with ongoing or failed builds.Jun 27 2022, 1:39 PM
This revision was automatically updated to reflect the committed changes.