Page MenuHomePhabricator

[LoopVectorizer][SVE] Vectorize a simple loop with with a scalable VF.
ClosedPublic

Authored by sdesmalen on Nov 9 2020, 6:53 AM.

Details

Summary
  • Steps are scaled by vscale, a runtime value.
  • Changes to circumvent the cost-model for now (temporary) so that the cost-model can be implemented separately.

This vectorizes:

void loop(int N, double *a, double *b) {
  #pragma clang loop vectorize_width(4, scalable)
  for (int i = 0; i < N; i++) {
    a[i] = b[i] + 1.0;
  }
}

Diff Detail

Event Timeline

sdesmalen created this revision.Nov 9 2020, 6:53 AM
Herald added a project: Restricted Project. · View Herald Transcript
sdesmalen requested review of this revision.Nov 9 2020, 6:53 AM
sdesmalen updated this revision to Diff 304167.Nov 10 2020, 6:29 AM

Fixed broken assert and made buildScalarSteps generate the right step-by-vscale code if the type is floating point.

cameron.mcinally added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8734

Nit: Yank spelling of scalarize?

8758

Same here.

sdesmalen updated this revision to Diff 304814.Nov 12 2020, 6:09 AM
sdesmalen marked an inline comment as done.

s/scalarise/scalarize

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8734

You're right, I'm mixing up American and British spelling in this patch, thanks for spotting!

dmgreen added inline comments.Nov 16 2020, 12:00 AM
llvm/lib/IR/IRBuilder.cpp
85

This could do with a quick clang-format.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2318–2331

Can you add a comment saying this has to potentially construct a scalable step. And that otherwise it should be folded to a constant?

5660

Perhaps add FIXME:

5987

Is it worth just not automatically interleaving for the moment? It might simplify things until the cost model is doing better, at least.

6037

Is there a better way to print this?

sdesmalen updated this revision to Diff 305509.Nov 16 2020, 7:38 AM
sdesmalen marked 5 inline comments as done.
  • Addressed comments.
  • Rebased after parent revision D88962 was updated.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5987

From a functional perspective, interleaving should be no different for scalable VFs as they are for fixed-width VFs. That means that if we keep interleaving enabled for scalable vectors, we'll be testing more functionality and code-paths in the vectorizer. We're still a way off from cost-modelling this properly, so until then I'd like to suggest we keep this enabled by default mostly for testing purposes. Does that make sense?

6037

Good point. I believe this can just print VF since ElementCount defines operator<< that adds "vscale x ".

dmgreen added inline comments.Nov 23 2020, 2:51 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5987

I was hoping that if we remove the need for this, we could remove the need for the "if scalable return 1" in the cost model. That way we get the testing of the cost model, which seems like a larger issue to test to me.

Can we remove that cost model change already? Or are there too many paths that do not work yet? The simple loop you have in the commit message might work already.

sdesmalen updated this revision to Diff 307097.Nov 23 2020, 9:11 AM

Removed early return from getInstructionCost, as this shortcut wasn't actually necessary for the test.

sdesmalen marked an inline comment as done.Nov 23 2020, 9:11 AM
sdesmalen added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5987

It seems I can actually leave this in and still remove the "if scalable return 1" (at least for the unit test in this patch).

dmgreen accepted this revision.Nov 24 2020, 4:52 AM

Nice. Thanks. That alleviates my objections to the interleaving count.

This LGTM. I think it may depend on some other patches? If not perhaps leave this open for a day or two in case other have comments.

This revision is now accepted and ready to land.Nov 24 2020, 4:52 AM
sdesmalen updated this revision to Diff 310169.Dec 8 2020, 6:17 AM
sdesmalen marked an inline comment as done.

Rebased patch after D90687 and D88962 landed. This required a small change to metadata-width.ll, because the original test case tries to vectorize an induction variable, which this prototype doesn't yet support.

Allen added a subscriber: Allen.Nov 13 2021, 12:21 AM

Please help to check whether to add the following code to support the usage: -march=armv8-a+sve2+use-scalar-inc-vl

index 1e27b0f80dd1..f01e71472f51 100644
--- a/llvm/include/llvm/Support/AArch64TargetParser.def
+++ b/llvm/include/llvm/Support/AArch64TargetParser.def
@@ -128,6 +128,7 @@ AARCH64_ARCH_EXT_NAME("ls64",         AArch64::AEK_LS64,        "+ls64",  "-ls64
 AARCH64_ARCH_EXT_NAME("brbe",         AArch64::AEK_BRBE,        "+brbe",  "-brbe")
 AARCH64_ARCH_EXT_NAME("pauth",        AArch64::AEK_PAUTH,       "+pauth", "-pauth")
 AARCH64_ARCH_EXT_NAME("flagm",        AArch64::AEK_FLAGM,       "+flagm", "-flagm")
+AARCH64_ARCH_EXT_NAME("use-scalar-inc-vl", AArch64::AEK_INCVL,  "+use-scalar-inc-vl", "-use-scalar-inc-vl")
 AARCH64_ARCH_EXT_NAME("sme",          AArch64::AEK_SME,         "+sme",   "-sme")
 AARCH64_ARCH_EXT_NAME("sme-f64",      AArch64::AEK_SMEF64,      "+sme-f64", "-sme-f64")
 AARCH64_ARCH_EXT_NAME("sme-i64",      AArch64::AEK_SMEI64,      "+sme-i64", "-sme-i64")
diff --git a/llvm/include/llvm/Support/AArch64TargetParser.h b/llvm/include/llvm/Support/AArch64TargetParser.h
index 131a58412db6..23a84b965159 100644
--- a/llvm/include/llvm/Support/AArch64TargetParser.h
+++ b/llvm/include/llvm/Support/AArch64TargetParser.h
@@ -69,6 +69,7 @@ enum ArchExtKind : uint64_t {
   AEK_SME =         1ULL << 37,
   AEK_SMEF64 =      1ULL << 38,
   AEK_SMEI64 =      1ULL << 39,
+  AEK_INCVL =       1ULL << 40,
 };

Please help to check whether to add the following code to support the usage: -march=armv8-a+sve2+use-scalar-inc-vl

Hi @Allen, +use-scalar-inc-vl is already implied when specifying +sve2.

For +sve, we don't really intend for this to be a user-facing flag, but rather to have it enabled for a specific uarch if it is known to be beneficial, so that compiling for -mcpu=<cpu> will enable the merging of add/sub+cnt into inc/dec instructions automatically. This is similar to how the other tuning features are defined/used in LLVM. See for example def TuneNeoverseN1 in AArch64.td that sets the tuning features for NeoverseN1, such as FeatureFuseAES to fuse AES crypto operations. These features are not supported by the -march flag, because they're not changing the set of supported architectural features, but are rather tuning features to hint the compiler which instructions are beneficial. FeatureUseScalarIncVL is no different in that respect. The way to enable this for a CPU is to add it as a tuning feature to a CPU in AArch64.td by defining some def TuneSomeCPU : SubtargetFeature<..., [FeatureUseScalarIncVL]>.

If you explicitly want to toggle this from the command-line in combination with -march, you can pass the command directly to the Clang compiler using -Xclang -target-feature -Xclang +use-scalar-inc-vl.