This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
1/2
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/
-
AArch64/
2/4
AArch64TargetTransformInfo.h
2/6
AArch64TargetTransformInfo.cpp
-
Hexagon/
-
HexagonTargetTransformInfo.h
-
Transforms/Vectorize/
-
Vectorize/
1/2
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
extend-vectorization-factor-for-unprofitable-memops.ll
-
loop-vectorization-factors.ll
-
reduction-small-size.ll
-
scalable-vectorization-cost-tuning.ll
-
scalable-vectorization.ll
2/4
sve-illegal-type.ll

Differential D118979

[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth
ClosedPublic

Authored by jaykang10 on Feb 4 2022, 2:20 AM.

Download Raw Diff

Details

Reviewers

dmgreen
fhahn
efriedma
sdesmalen
paulwalker-arm

Commits

rGbb82f746129f: Revert "Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth""
rG64b6192e8129: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth

Summary

Set the maximum VF of AArch64 with 128 / the size of smallest type in loop.

The performance improvement from benchmarks is as below.

SPEC2017
Benchmark       Improvement(%)
500.perlbench_r -0.44372
502.gcc_r        0.11339
505.mcf_r       -0.36421
520.omnetpp_r   -0.12037
523.xalancbmk_r -0.55858
525.x264_r      0.390159
531.deepsjeng_r -0.02378
541.leela_r     -0.01357
548.exchange2_r -0.00043
557.xz_r        -0.17387

Overall improvement(%) on an internal benchmark 0.238949

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jaykang10 created this revision.Feb 4 2022, 2:20 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald TranscriptFeb 4 2022, 2:20 AM

jaykang10 requested review of this revision.Feb 4 2022, 2:20 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2022, 2:20 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B147583: Diff 405895.Feb 4 2022, 3:14 AM

Hello. This option makes a lot of sense to me, for the way AArch64 lowers vectors with extensions in them. I have a downstream set of benchmarks which, whilst not amazing, do show changes in vectorization quite well. There are some great improvements, but some things are not looking so healthy. I will send you some details. We may need to work through improving some of them, either by fixing the costs or improving the codegen for larger than legal types.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
139	It's generally best if fixed length vectorization doesn't start behaving differently just because SVE is available (unless it can be better, of course). If we expect MaximizeVectorBandwidth to be better, but doesn't work for scalable vectors well, can we just try to disable the scalable VFs from being widened?

In D118979#3300227, @dmgreen wrote:

Hello. This option makes a lot of sense to me, for the way AArch64 lowers vectors with extensions in them. I have a downstream set of benchmarks which, whilst not amazing, do show changes in vectorization quite well. There are some great improvements, but some things are not looking so healthy. I will send you some details. We may need to work through improving some of them, either by fixing the costs or improving the codegen for larger than legal types.

Thanks for comment! @dmgreen

I agree with you. We need to improve the performance regressions from this Max VF.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
139	I think it could be good to have some comments from SVE people... If possible, can you add them as reviewer please? I do not know well who work on it...

dmgreen added reviewers: sdesmalen, paulwalker-arm.Feb 7 2022, 3:51 AM

paulwalker-arm added inline comments.Feb 8 2022, 3:10 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
139	I agree with Dave, this decision is distinct from whether SVE is available or not. As well as this affecting the NEON side of things there are circumstances where SVE is also used for fixed length vectorisation. Perhaps this function should be changed to take a `TargetTransformInfo::RegisterKind` much like getRegisterBitWidth?

jaykang10 added inline comments.Feb 9 2022, 2:35 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
139	Thanks for comment @paulwalker-arm! Let me update the function with the `TargetTransformInfo::RegisterKind`.

@paulwalker-arm As you can see, there are regression tests which are failed to generate scalable vector type with the shouldMaximizeVectorBandwidth.
Previously, I saw these regression tests were failed because of the cost so I returned false with SVE in shouldMaximizeVectorBandwidth...
I am not sure it is acceptable that LV selects VF 16 instead of scalable vector type on SVE...
If it is not acceptable, we could need to tune the cost on SVE and I would like to suggest to use shouldMaximizeVectorBandwidth for only neon until finishing the cost tune...

Harbormaster completed remote builds in B148451: Diff 407116.Feb 9 2022, 5:07 AM

The performance improvement from benchmarks is as below.

Are the numbers SPEC scores and negative Improvement values regressions? If that's the case, it seems like for the benchmarks you shared the impact is negative overall?

I'm missing a bit of rationale for this change. There is an interplay between having a wider VF or having a larger interleave factor. For 128bit vectors, an add <4 x i64> %x, %y will be legalized into two adds. Conceptually this is similar to vectorizing with <2 x i64> and having an interleave-factor of 2. I can imagine that interleaving in the loop-vectorizer leads to better code, because it avoids issues around type legalisation and may provide more opportunities for other IR passes to optimize the IR or move things around. If we always choose a wider VF I wonder if that may lead to poorer codegen because of type-legalization.

Is there a specific example where it's clearly an improvement to have a wider VF? And would choosing a larger unroll-factor help those cases?

In D118979#3308289, @sdesmalen wrote:

I'm missing a bit of rationale for this change. There is an interplay between having a wider VF or having a larger interleave factor. For 128bit vectors, an add <4 x i64> %x, %y will be legalized into two adds. Conceptually this is similar to vectorizing with <2 x i64> and having an interleave-factor of 2. I can imagine that interleaving in the loop-vectorizer leads to better code, because it avoids issues around type legalisation and may provide more opportunities for other IR passes to optimize the IR or move things around. If we always choose a wider VF I wonder if that may lead to poorer codegen because of type-legalization.

Is there a specific example where it's clearly an improvement to have a wider VF? And would choosing a larger unroll-factor help those cases?

One case where choosing a wider VF can be beneficial are loops with memory operations on types with different width, where the memory operations on the narrow type are not legal for the VF based on the widest type. This reminded me of an oldish outstanding patch that focuses on exactly that case: D96522. Unless there are other cases where maximizing the VF is clearly beneficial, iterating on D96522 might be an alternative.

Yeah. I had https://godbolt.org/z/3qWoY769v as an example, where is can make better use of the umull2 instructions, because of the wider vector loads as a single operation. That's what makes it beneficial for AArch64.

It sounds like D96522 might be a more limited way of getting the same result we want from this? That might be a good way forward. There were a number of regressions we would have to work through with this patch in the benchmarks I have access to, cases where either the codegen or the costmodelling need to be tweaked. (The only one I have so far is from some bad smull generation: https://godbolt.org/z/enox4ojhf. I still need to look through the rest).

In D118979#3307560, @fhahn wrote:

The performance improvement from benchmarks is as below.

Are the numbers SPEC scores and negative Improvement values regressions? If that's the case, it seems like for the benchmarks you shared the impact is negative overall?

Thanks for comments @fhahn!

The score includes a bit noise... I am sorry for that...

505.mcf_r, 531.deepsjeng_r, 548.exchange2_r, 557.xz_r are not affected by this patch but the N1 machine showed me slightly different scores... The rest of spec2017's benchmarks have loops which affected by this patch.

I wanted to say this patch does not make overall negative impact for spec2017.

In D118979#3308289, @sdesmalen wrote:

I'm missing a bit of rationale for this change. There is an interplay between having a wider VF or having a larger interleave factor. For 128bit vectors, an add <4 x i64> %x, %y will be legalized into two adds. Conceptually this is similar to vectorizing with <2 x i64> and having an interleave-factor of 2. I can imagine that interleaving in the loop-vectorizer leads to better code, because it avoids issues around type legalisation and may provide more opportunities for other IR passes to optimize the IR or move things around. If we always choose a wider VF I wonder if that may lead to poorer codegen because of type-legalization.

Is there a specific example where it's clearly an improvement to have a wider VF? And would choosing a larger unroll-factor help those cases?

Thanks for comment @sdesmalen!

Let's see a code snippet.

int test(int start, int size, char *src, char *dst) {
  int res = 0;
  for (int i = start; i < size; ++i) {
    res += *dst ^ *src;
    dst++;
    src++;
  }

  return res;
}

The assembly output of the vectorized loop is as below.

without this patch  --> VF 4 is selected.
.LBB0_5:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
	ldp	s3, s4, [x12, #-4]
	ldp	s5, s6, [x8, #-4]
	add	x8, x8, #8
	add	x12, x12, #8
	subs	x13, x13, #8
	ushll	v3.8h, v3.8b, #0
	ushll	v4.8h, v4.8b, #0
	ushll	v5.8h, v5.8b, #0
	ushll	v6.8h, v6.8b, #0
	eor	v3.8b, v5.8b, v3.8b
	eor	v4.8b, v6.8b, v4.8b
	ushll	v3.4s, v3.4h, #0
	ushll	v4.4s, v4.4h, #0
	and	v3.16b, v3.16b, v1.16b
	and	v4.16b, v4.16b, v1.16b
	add	v0.4s, v0.4s, v3.4s
	add	v2.4s, v2.4s, v4.4s
	b.ne	.LBB0_5

with this patch  --> VF 16 is selected
.LBB0_5:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
	ldp	q16, q18, [x12, #-16]
	add	x12, x12, #32
	subs	x13, x13, #32
	ldp	q17, q19, [x8, #-16]
	add	x8, x8, #32
	eor	v16.16b, v17.16b, v16.16b
	eor	v17.16b, v19.16b, v18.16b
	ushll2	v18.8h, v16.16b, #0
	ushll	v16.8h, v16.8b, #0
	ushll	v19.8h, v17.8b, #0
	ushll2	v17.8h, v17.16b, #0
	uaddw2	v2.4s, v2.4s, v18.8h
	uaddw	v1.4s, v1.4s, v18.4h
	uaddw2	v3.4s, v3.4s, v16.8h
	uaddw	v0.4s, v0.4s, v16.4h
	uaddw2	v6.4s, v6.4s, v17.8h
	uaddw	v5.4s, v5.4s, v17.4h
	uaddw2	v7.4s, v7.4s, v19.8h
	uaddw	v4.4s, v4.4s, v19.4h
	b.ne	.LBB0_5

We can see the uaddw instructions on the output with VF=16. AArch64 has below pattern definition and it is selected for the uaddw.

multiclass SIMDWideThreeVectorBHS<bit U, bits<4> opc, string asm,
                                  SDPatternOperator OpNode> {
...
  def v4i16_v4i32  : BaseSIMDDifferentThreeVector<U, 0b010, opc,
                                                  V128, V128, V64,
                                                  asm, ".4s", ".4s", ".4h",
       [(set (v4i32 V128:$Rd), (OpNode (v4i32 V128:$Rn), (v4i16 V64:$Rm)))]>;
  def v8i16_v4i32  : BaseSIMDDifferentThreeVector<U, 0b011, opc,
                                                  V128, V128, V128, 
                                                  asm#"2", ".4s", ".4s", ".8h",
       [(set (v4i32 V128:$Rd), (OpNode (v4i32 V128:$Rn),
                                       (extract_high_v8i16 V128:$Rm)))]>;
...
defm UADDW   : SIMDWideThreeVectorBHS<1, 0b0001, "uaddw",
                 BinOpFrag<(add node:$LHS, (zanyext node:$RHS))>>;

Given the number of instructions, we could expect the loop handles almost 4 times more data per iteration ideally.
As @dmgreen mentioned, we are seeing some performance degradations. In dave's case, it looks the LV generates shuffle vectors and it blocks to lower the MUL to SMULL. As an other case, if LV detects interleaved group, it generates shuffle vectors with big number of elements. The shuffle vectors cause lots of mov instructions. Maybe, there are more cases for the performance degradation but it could show us more opportunities to get better performance score. That's what I want from this patch...

So I'm wondering why we don't just control the functionality with a default off command line flag. That way it's available for testing, including unit-tests, until such a point where code generation is at a point where is makes sense to default it to on. Is this a terribly idea?

In D118979#3310489, @paulwalker-arm wrote:

So I'm wondering why we don't just control the functionality with a default off command line flag. That way it's available for testing, including unit-tests, until such a point where code generation is at a point where is makes sense to default it to on. Is this a terribly idea?

There is an option -vectorizer-maximize-bandwidth to control the functionality in LV.

static cl::opt<bool> MaximizeBandwidth(
    "vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
    cl::desc("Maximize bandwidth when selecting vectorization factor which "
             "will be determined by the smallest type in loop."));

In D118979#3310625, @jaykang10 wrote:
In D118979#3310489, @paulwalker-arm wrote:

So I'm wondering why we don't just control the functionality with a default off command line flag. That way it's available for testing, including unit-tests, until such a point where code generation is at a point where is makes sense to default it to on. Is this a terribly idea?

There is an option -vectorizer-maximize-bandwidth to control the functionality in LV.
static cl::opt<bool> MaximizeBandwidth(
    "vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
    cl::desc("Maximize bandwidth when selecting vectorization factor which "
             "will be determined by the smallest type in loop."));

Oh sure, but that affects all targets. Whereas I thought here we're talking about finding a migration path to enable it by default for AArch64 only.

In D118979#3310637, @paulwalker-arm wrote:
In D118979#3310625, @jaykang10 wrote:
In D118979#3310489, @paulwalker-arm wrote:

So I'm wondering why we don't just control the functionality with a default off command line flag. That way it's available for testing, including unit-tests, until such a point where code generation is at a point where is makes sense to default it to on. Is this a terribly idea?

There is an option -vectorizer-maximize-bandwidth to control the functionality in LV.
static cl::opt<bool> MaximizeBandwidth(
    "vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
    cl::desc("Maximize bandwidth when selecting vectorization factor which "
             "will be determined by the smallest type in loop."));
Oh sure, but that affects all targets. Whereas I thought here we're talking about finding a migration path to enable it by default for AArch64 only.

We could add similar option to AArch64TargetTransformInfo.cpp and use it in the shouldMaximizeVectorBandwidth of AArch64.

Following the comment of @paulwalker-arm, a option is added.

Harbormaster completed remote builds in B148718: Diff 407489.Feb 10 2022, 5:43 AM

dmgreen mentioned this in D119887: [AArch64] Common patterns between UMULL and int_aarch64_neon_umull.Feb 17 2022, 12:15 AM

dmgreen mentioned this in D120018: [AArch64] Alter mull shuffle(ext(..)) combine to work on buildvectors.Feb 17 2022, 1:09 AM

dmgreen mentioned this in D119469: [AArch64] Turn truncating buildvectors into truncates.

Matt added a subscriber: Matt.Mar 17 2022, 5:56 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2022, 5:56 PM

dmgreen mentioned this in D121788: [AArch64] Increase MaxInterleaveFactor to 4.Mar 22 2022, 4:19 AM

dmgreen added a child revision: D120215: [LV] Invalidate widening decisions after maximizing vector bandwidth.Mar 22 2022, 9:51 AM

I think this option makes a lot of sense - and we have cleaned up a lot of places where it was causing issues. This can change a lot of performance, but it seems to be pretty good in my experiments. More is likely to come up, but I think we are close to ready to go so long as we keep an eye on the performance.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
40	There is a vectorizer-maximize-bandwidth option is the vectorizer that can override the target option for shouldMaximizeVectorBandwidth. I don't think adding an aarch64 option is necessary, can you remove it?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5219	K is a bit of a short name. Perhaps use RegKind or something like it?
llvm/test/Transforms/LoopVectorize/AArch64/sve-illegal-type.ll
90	This is worrying - should it be vectorizing 64x for in i1 type! (and are there a lot of other extracts now)?

paulwalker-arm added inline comments.Apr 4 2022, 3:44 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
40	I agree. My original ask was because I thought there were concerns about enabling this by default. Given the flag still defaults to on and it seems we're happy to make this change for AArch64 I retract my previous ask.

jaykang10 added inline comments.Apr 4 2022, 6:14 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
40	Yep, let me remove this option.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5219	Yep, let me change it.
llvm/test/Transforms/LoopVectorize/AArch64/sve-illegal-type.ll
90	When I checked it, it looked the dagcombiner combines the 64 times i1 extract_vector_elt and store nodes to one 64 bit store node. Let me check it again.

jaykang10 added inline comments.Apr 4 2022, 6:32 AM

llvm/test/Transforms/LoopVectorize/AArch64/sve-illegal-type.ll
90	um... in this test, the `%dst` is passed as parameter so it is not changed in the loop. Therefore, the last element of <64 x i1>vector needs to be stored. It looks dagcombiner catches it and optimizes the nodes well. The assembly output of `vector.body` block from llc is as below. It looks ok. .LBB0_3: // %vector.body // =>This Inner Loop Header: Depth=1 dup v2.2d, x12 add x12, x12, #512 subs x11, x11, #64 add v2.2d, v2.2d, v0.2d cmeq v2.2d, v2.2d, v1.2d xtn2 v2.4s, v2.2d xtn2 v2.8h, v2.4s xtn v2.8b, v2.8h umov w13, v2.b[7] and w13, w13, #0x1 strb w13, [x0] b.ne .LBB0_3

Following the comment from @dmgreen, updated code.

Harbormaster completed remote builds in B157724: Diff 420164.Apr 4 2022, 7:13 AM

Thanks for the update. This LGTM.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
40	Yeah - I think the vectorizer-maximize-bandwidth option worked differently back when the comment was suggested too.
llvm/test/Transforms/LoopVectorize/AArch64/sve-illegal-type.ll
90	Ah I see, that makes sense that it would pick the higher factor then. It looks like if the address is varying it does not vectorize.

This revision is now accepted and ready to land.Apr 5 2022, 2:30 AM

fhahn added inline comments.Apr 5 2022, 2:57 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
940–941	Document `K`?
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
56	simpler to just have `return K == TargetTransformInfo::RGK_FixedWidthVector;`? Possibly with an assert that `K` is not `RGK_Scalar`.

jaykang10 added inline comments.Apr 5 2022, 3:26 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
940–941	Yep, let me add it.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
56	Yep, let me update it.

Following comments from @fhahn, updated patch.

This revision was landed with ongoing or failed builds.Apr 5 2022, 5:19 AM

Closed by commit rG64b6192e8129: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth (authored by jaykang10). · Explain Why

This revision was automatically updated to reflect the committed changes.

jaykang10 added a commit: rG64b6192e8129: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

Harbormaster completed remote builds in B157935: Diff 420448.Apr 5 2022, 5:43 AM

omjavaid added a reverting change: rG42ebfa826947: Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth".Apr 12 2022, 4:53 PM

This broke LLVM AArch64 buildbot clang-aarch64-sve-vls-2stage
https://lab.llvm.org/buildbot/#/builders/176/builds/1515
llvm-tblgen crashes after applying this patch.

I have reverted it for now. Kindly have a look.

omjavaid reopened this revision.Apr 12 2022, 4:54 PM

This revision is now accepted and ready to land.Apr 12 2022, 4:54 PM

In D118979#3447038, @omjavaid wrote:

This broke LLVM AArch64 buildbot clang-aarch64-sve-vls-2stage
https://lab.llvm.org/buildbot/#/builders/176/builds/1515
llvm-tblgen crashes after applying this patch.

I have reverted it for now. Kindly have a look.

Thanks for reverting the commit. I did not check stage2 build on sve...
If possible, can you share the sve vls environment to reproduce the stage2 build error please?

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

In D118979#3448556, @jaykang10 wrote:

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

I think it would be best to figure out if there's actually an issue/miscompile before re-landing. As is, the patch should only enable shouldMaximizeVectorBandwidth already, right?

In D118979#3448560, @fhahn wrote:

In D118979#3448556, @jaykang10 wrote:

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

I think it would be best to figure out if there's actually an issue/miscompile before re-landing. As is, the patch should only enable shouldMaximizeVectorBandwidth already, right?

um... To be honest, I am not expert of the SVE arch. Initially, I aimed to enable it for only neon arch.
@paulwalker-arm If possible, can you help me to investigate the build error from the buildbot for clang-aarch64-sve-vls-2stage please?

In D118979#3448556, @jaykang10 wrote:

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

I did have a completely difference experience, here are steps to reproduce:

Stage 1:

CC=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang CXX=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'
ninja

Stage 2:

CC=$(pwd)/stage1/bin/clang CXX=$(pwd)/stage1/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'

ninja llvm-tblgen

/home/omair.javaid/work/llvm-test/stage2/bin/llvm-tblgen -gen-dag-isel -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC -I/home/omair.javaid/work/llvm-test/stage2/include -I/home/omair.javaid/work/llvm-test/llvm/llvm/include -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target -omit-comments /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC/PPC.td --write-if-changed -o lib/Target/PowerPC/PPCGenDAGISel.inc -d lib/Target/PowerPC/PPCGenDAGISel.inc.d

In D118979#3450111, @omjavaid wrote:

In D118979#3448556, @jaykang10 wrote:

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

I did have a completely difference experience, here are steps to reproduce:

Stage 1:

CC=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang CXX=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'
ninja

Stage 2:

CC=$(pwd)/stage1/bin/clang CXX=$(pwd)/stage1/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'

ninja llvm-tblgen

/home/omair.javaid/work/llvm-test/stage2/bin/llvm-tblgen -gen-dag-isel -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC -I/home/omair.javaid/work/llvm-test/stage2/include -I/home/omair.javaid/work/llvm-test/llvm/llvm/include -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target -omit-comments /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC/PPC.td --write-if-changed -o lib/Target/PowerP

In D118979#3450111, @omjavaid wrote:

In D118979#3448556, @jaykang10 wrote:

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

I did have a completely difference experience, here are steps to reproduce:

Stage 1:

CC=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang CXX=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'
ninja

Stage 2:

CC=$(pwd)/stage1/bin/clang CXX=$(pwd)/stage1/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'

ninja llvm-tblgen

/home/omair.javaid/work/llvm-test/stage2/bin/llvm-tblgen -gen-dag-isel -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC -I/home/omair.javaid/work/llvm-test/stage2/include -I/home/omair.javaid/work/llvm-test/llvm/llvm/include -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target -omit-comments /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC/PPC.td --write-if-changed -o lib/Target/PowerPC/PPCGenDAGISel.inc -d lib/Target/PowerPC/PPCGenDAGISel.inc.d

um... I also followed same thing from build bot's 1,2 stage cmake options on a64fx machine.

stage1
cmake -G Ninja ../llvm -DCMAKE_C_COMPILER=/home/jinkan01/Projects/llvm-project/build-clang-13.0.1/bin/clang -DCMAKE_CXX_COMPILER=/home/jinkan01/Projects/llvm-project/build-clang-13.0.1/bin/clang++ -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True '-DLLVM_LIT_ARGS='"'"'-v'"'"'' -DCMAKE_INSTALL_PREFIX=../stage1.install '-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' '-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' -DLLVM_ENABLE_LLD=True '-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' '-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'

ninja install

stage 2
cmake -G Ninja ../llvm -DCMAKE_C_COMPILER=/home/jinkan01/Projects/llvm-project/stage1.install/bin/clang -DCMAKE_CXX_COMPILER=/home/jinkan01/Projects/llvm-project/stage1.install/bin/clang++ -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True '-DLLVM_LIT_ARGS='"'"'-v'"'"'' -DCMAKE_INSTALL_PREFIX=../stage2.install '-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' '-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' -DLLVM_ENABLE_LLD=True '-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' '-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'

ninja

Let me check it with your one again.
Can you let me know which cpu your testing machine has please? I guessed you are using a64fx machine from the -mcpu=a64fx option.

um... I can reproduce the build error now... Maybe, there was something wrong with my build...
Thanks for help @omjavaid

It looks the SVE target's data layout is missing 512-bit vector's alignment. The data layout does not mention the alignment of 512-bit vector so the alignment is same with its size. It is bigger than stack's alignment which is 128-bits and it causes stack re-alignment...

For the stage2 failure with llvm-tablegen, the ContractNodes function of tablegen has loop which is vectorized with VF 128 because it has i1 type... I did not expect vectorization with VF 128... It causes lots of spill codes... On PrologEpilogInserter, the default stack size is 3584 bytes and the SVE stack size is 4416 bytes... We need to avoid the VF 512... Anyway, the ContractNodes is called recursively and the stack re-alignment causes wrong stack overwriting...

After setting up the 512-bit vector type's alignment as 128-bit, the stage2 failure is gone.

It could be ok to disable the shouldMaximizeVectorBandwidth for SVE... because it could need cost model change not only vector type alignment... Let me discuss it with the team more...

peterwaller-arm mentioned this in D125918: [LV] Improve register pressure estimate at high VFs.May 18 2022, 12:04 PM

peterwaller-arm mentioned this in rGade47bdc317b: [LV] Improve register pressure estimate at high VFs.May 23 2022, 1:02 AM

I have checked the commit from https://reviews.llvm.org/D125918 has fixed the stage 2 build error with this patch.
Thanks for fixing it! @paulwalker-arm and @peterwaller-arm
Let me push this patch again.

Closed by commit rGbb82f746129f: Revert "Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth"" (authored by jaykang10). · Explain WhyMay 23 2022, 8:18 AM

This revision was automatically updated to reflect the committed changes.

jaykang10 added a commit: rGbb82f746129f: Revert "Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth"".

Allen mentioned this in D155355: [AArch64] Set maximum vscale VF with shouldMaximizeVectorBandwidth.Jul 14 2023, 9:48 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

11 lines

TargetTransformInfoImpl.h

5 lines

lib/

Analysis/

TargetTransformInfo.cpp

5 lines

Target/

AArch64/

AArch64TargetTransformInfo.h

2 lines

AArch64TargetTransformInfo.cpp

6 lines

Hexagon/

HexagonTargetTransformInfo.h

7 lines

Transforms/

Vectorize/

LoopVectorize.cpp

5 lines

test/

Transforms/

LoopVectorize/

AArch64/

extend-vectorization-factor-for-unprofitable-memops.ll

11 lines

loop-vectorization-factors.ll

6 lines

reduction-small-size.ll

16 lines

scalable-vectorization-cost-tuning.ll

2 lines

scalable-vectorization.ll

16 lines

sve-illegal-type.ll

8 lines

Diff 431384

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 931 Lines • ▼ Show 20 Lines	public:
Optional<unsigned> getVScaleForTuning() const;		Optional<unsigned> getVScaleForTuning() const;

/// \return True if the vectorization factor should be chosen to		/// \return True if the vectorization factor should be chosen to
/// make the vector of the smallest element type match the size of a		/// make the vector of the smallest element type match the size of a
/// vector register. For wider element types, this could result in		/// vector register. For wider element types, this could result in
/// creating vectors that span multiple vector registers.		/// creating vectors that span multiple vector registers.
/// If false, the vectorization factor will be chosen based on the		/// If false, the vectorization factor will be chosen based on the
/// size of the widest element type.		/// size of the widest element type.
bool shouldMaximizeVectorBandwidth() const;		/// \p K Register Kind for vectorization.
		bool shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const;
		fhahnUnsubmitted Not Done Reply Inline Actions Document `K`? fhahn: Document `K`?
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Yep, let me add it. jaykang10: Yep, let me add it.

/// \return The minimum vectorization factor for types of given element		/// \return The minimum vectorization factor for types of given element
/// bit width, or 0 if there is no minimum VF. The returned value only		/// bit width, or 0 if there is no minimum VF. The returned value only
/// applies when shouldMaximizeVectorBandwidth returns true.		/// applies when shouldMaximizeVectorBandwidth returns true.
/// If IsScalable is true, the returned ElementCount must be a scalable VF.		/// If IsScalable is true, the returned ElementCount must be a scalable VF.
ElementCount getMinimumVF(unsigned ElemWidth, bool IsScalable) const;		ElementCount getMinimumVF(unsigned ElemWidth, bool IsScalable) const;

/// \return The maximum vectorization factor for types of given element		/// \return The maximum vectorization factor for types of given element
▲ Show 20 Lines • Show All 686 Lines • ▼ Show 20 Lines	public:
virtual unsigned getNumberOfRegisters(unsigned ClassID) const = 0;		virtual unsigned getNumberOfRegisters(unsigned ClassID) const = 0;
virtual unsigned getRegisterClassForType(bool Vector,		virtual unsigned getRegisterClassForType(bool Vector,
Type *Ty = nullptr) const = 0;		Type *Ty = nullptr) const = 0;
virtual const char *getRegisterClassName(unsigned ClassID) const = 0;		virtual const char *getRegisterClassName(unsigned ClassID) const = 0;
virtual TypeSize getRegisterBitWidth(RegisterKind K) const = 0;		virtual TypeSize getRegisterBitWidth(RegisterKind K) const = 0;
virtual unsigned getMinVectorRegisterBitWidth() const = 0;		virtual unsigned getMinVectorRegisterBitWidth() const = 0;
virtual Optional<unsigned> getMaxVScale() const = 0;		virtual Optional<unsigned> getMaxVScale() const = 0;
virtual Optional<unsigned> getVScaleForTuning() const = 0;		virtual Optional<unsigned> getVScaleForTuning() const = 0;
virtual bool shouldMaximizeVectorBandwidth() const = 0;		virtual bool
		shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const = 0;
virtual ElementCount getMinimumVF(unsigned ElemWidth,		virtual ElementCount getMinimumVF(unsigned ElemWidth,
bool IsScalable) const = 0;		bool IsScalable) const = 0;
virtual unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const = 0;		virtual unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const = 0;
virtual unsigned getStoreMinimumVF(unsigned VF, Type *ScalarMemTy,		virtual unsigned getStoreMinimumVF(unsigned VF, Type *ScalarMemTy,
Type *ScalarValTy) const = 0;		Type *ScalarValTy) const = 0;
virtual bool shouldConsiderAddressTypePromotion(		virtual bool shouldConsiderAddressTypePromotion(
const Instruction &I, bool &AllowPromotionWithoutCommonHeader) = 0;		const Instruction &I, bool &AllowPromotionWithoutCommonHeader) = 0;
virtual unsigned getCacheLineSize() const = 0;		virtual unsigned getCacheLineSize() const = 0;
▲ Show 20 Lines • Show All 482 Lines • ▼ Show 20 Lines	unsigned getMinVectorRegisterBitWidth() const override {
return Impl.getMinVectorRegisterBitWidth();		return Impl.getMinVectorRegisterBitWidth();
}		}
Optional<unsigned> getMaxVScale() const override {		Optional<unsigned> getMaxVScale() const override {
return Impl.getMaxVScale();		return Impl.getMaxVScale();
}		}
Optional<unsigned> getVScaleForTuning() const override {		Optional<unsigned> getVScaleForTuning() const override {
return Impl.getVScaleForTuning();		return Impl.getVScaleForTuning();
}		}
bool shouldMaximizeVectorBandwidth() const override {		bool shouldMaximizeVectorBandwidth(
return Impl.shouldMaximizeVectorBandwidth();		TargetTransformInfo::RegisterKind K) const override {
		return Impl.shouldMaximizeVectorBandwidth(K);
}		}
ElementCount getMinimumVF(unsigned ElemWidth,		ElementCount getMinimumVF(unsigned ElemWidth,
bool IsScalable) const override {		bool IsScalable) const override {
return Impl.getMinimumVF(ElemWidth, IsScalable);		return Impl.getMinimumVF(ElemWidth, IsScalable);
}		}
unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const override {		unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const override {
return Impl.getMaximumVF(ElemWidth, Opcode);		return Impl.getMaximumVF(ElemWidth, Opcode);
}		}
▲ Show 20 Lines • Show All 391 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 411 Lines • ▼ Show 20 Lines	TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const {
return TypeSize::getFixed(32);		return TypeSize::getFixed(32);
}		}

unsigned getMinVectorRegisterBitWidth() const { return 128; }		unsigned getMinVectorRegisterBitWidth() const { return 128; }

Optional<unsigned> getMaxVScale() const { return None; }		Optional<unsigned> getMaxVScale() const { return None; }
Optional<unsigned> getVScaleForTuning() const { return None; }		Optional<unsigned> getVScaleForTuning() const { return None; }

bool shouldMaximizeVectorBandwidth() const { return false; }		bool
		shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const {
		return false;
		}

ElementCount getMinimumVF(unsigned ElemWidth, bool IsScalable) const {		ElementCount getMinimumVF(unsigned ElemWidth, bool IsScalable) const {
return ElementCount::get(0, IsScalable);		return ElementCount::get(0, IsScalable);
}		}

unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const { return 0; }		unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const { return 0; }
unsigned getStoreMinimumVF(unsigned VF, Type , Type ) const { return VF; }		unsigned getStoreMinimumVF(unsigned VF, Type , Type ) const { return VF; }

▲ Show 20 Lines • Show All 834 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 620 Lines • ▼ Show 20 Lines
	Optional<unsigned> TargetTransformInfo::getMaxVScale() const {			Optional<unsigned> TargetTransformInfo::getMaxVScale() const {
	return TTIImpl->getMaxVScale();			return TTIImpl->getMaxVScale();
	}			}

	Optional<unsigned> TargetTransformInfo::getVScaleForTuning() const {			Optional<unsigned> TargetTransformInfo::getVScaleForTuning() const {
	return TTIImpl->getVScaleForTuning();			return TTIImpl->getVScaleForTuning();
	}			}

	bool TargetTransformInfo::shouldMaximizeVectorBandwidth() const {			bool TargetTransformInfo::shouldMaximizeVectorBandwidth(
	return TTIImpl->shouldMaximizeVectorBandwidth();			TargetTransformInfo::RegisterKind K) const {
				return TTIImpl->shouldMaximizeVectorBandwidth(K);
	}			}

	ElementCount TargetTransformInfo::getMinimumVF(unsigned ElemWidth,			ElementCount TargetTransformInfo::getMinimumVF(unsigned ElemWidth,
	bool IsScalable) const {			bool IsScalable) const {
	return TTIImpl->getMinimumVF(ElemWidth, IsScalable);			return TTIImpl->getMinimumVF(ElemWidth, IsScalable);
	}			}

	unsigned TargetTransformInfo::getMaximumVF(unsigned ElemWidth,			unsigned TargetTransformInfo::getMaximumVF(unsigned ElemWidth,
	▲ Show 20 Lines • Show All 578 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 129 Lines • ▼ Show 20 Lines	public:
unsigned getMinVectorRegisterBitWidth() const {		unsigned getMinVectorRegisterBitWidth() const {
return ST->getMinVectorRegisterBitWidth();		return ST->getMinVectorRegisterBitWidth();
}		}

Optional<unsigned> getVScaleForTuning() const {		Optional<unsigned> getVScaleForTuning() const {
return ST->getVScaleForTuning();		return ST->getVScaleForTuning();
}		}

		bool shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const;

		dmgreenUnsubmitted Not Done Reply Inline Actions It's generally best if fixed length vectorization doesn't start behaving differently just because SVE is available (unless it can be better, of course). If we expect MaximizeVectorBandwidth to be better, but doesn't work for scalable vectors well, can we just try to disable the scalable VFs from being widened? dmgreen: It's generally best if fixed length vectorization doesn't start behaving differently just…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions I think it could be good to have some comments from SVE people... If possible, can you add them as reviewer please? I do not know well who work on it... jaykang10: I think it could be good to have some comments from SVE people... If possible, can you add them…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I agree with Dave, this decision is distinct from whether SVE is available or not. As well as this affecting the NEON side of things there are circumstances where SVE is also used for fixed length vectorisation. Perhaps this function should be changed to take a `TargetTransformInfo::RegisterKind` much like getRegisterBitWidth? paulwalker-arm: I agree with Dave, this decision is distinct from whether SVE is available or not. As well as…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Thanks for comment @paulwalker-arm! Let me update the function with the `TargetTransformInfo::RegisterKind`. jaykang10: Thanks for comment @paulwalker-arm! Let me update the function with the `TargetTransformInfo…
/// Try to return an estimate cost factor that can be used as a multiplier		/// Try to return an estimate cost factor that can be used as a multiplier
/// when scalarizing an operation for a vector with ElementCount \p VF.		/// when scalarizing an operation for a vector with ElementCount \p VF.
/// For scalable vectors this currently takes the most pessimistic view based		/// For scalable vectors this currently takes the most pessimistic view based
/// upon the maximum possible value for vscale.		/// upon the maximum possible value for vscale.
unsigned getMaxNumElements(ElementCount VF) const {		unsigned getMaxNumElements(ElementCount VF) const {
if (!VF.isScalable())		if (!VF.isScalable())
return VF.getFixedValue();		return VF.getFixedValue();

▲ Show 20 Lines • Show All 212 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show All 31 Lines	static cl::opt<bool> EnableFalkorHWPFUnrollFix("enable-falkor-hwpf-unroll-fix",
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);

static cl::opt<unsigned> SVEGatherOverhead("sve-gather-overhead", cl::init(10),		static cl::opt<unsigned> SVEGatherOverhead("sve-gather-overhead", cl::init(10),
cl::Hidden);		cl::Hidden);

static cl::opt<unsigned> SVEScatterOverhead("sve-scatter-overhead",		static cl::opt<unsigned> SVEScatterOverhead("sve-scatter-overhead",
cl::init(10), cl::Hidden);		cl::init(10), cl::Hidden);

bool AArch64TTIImpl::areInlineCompatible(const Function *Caller,		bool AArch64TTIImpl::areInlineCompatible(const Function *Caller,
		dmgreenUnsubmitted Not Done Reply Inline Actions There is a vectorizer-maximize-bandwidth option is the vectorizer that can override the target option for shouldMaximizeVectorBandwidth. I don't think adding an aarch64 option is necessary, can you remove it? dmgreen: There is a vectorizer-maximize-bandwidth option is the vectorizer that can override the target…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I agree. My original ask was because I thought there were concerns about enabling this by default. Given the flag still defaults to on and it seems we're happy to make this change for AArch64 I retract my previous ask. paulwalker-arm: I agree. My original ask was because I thought there were concerns about enabling this by…
		dmgreenUnsubmitted Not Done Reply Inline Actions Yeah - I think the vectorizer-maximize-bandwidth option worked differently back when the comment was suggested too. dmgreen: Yeah - I think the vectorizer-maximize-bandwidth option worked differently back when the…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Yep, let me remove this option. jaykang10: Yep, let me remove this option.
const Function *Callee) const {		const Function *Callee) const {
const TargetMachine &TM = getTLI()->getTargetMachine();		const TargetMachine &TM = getTLI()->getTargetMachine();

const FeatureBitset &CallerBits =		const FeatureBitset &CallerBits =
TM.getSubtargetImpl(*Caller)->getFeatureBits();		TM.getSubtargetImpl(*Caller)->getFeatureBits();
const FeatureBitset &CalleeBits =		const FeatureBitset &CalleeBits =
TM.getSubtargetImpl(*Callee)->getFeatureBits();		TM.getSubtargetImpl(*Callee)->getFeatureBits();

// Inline a callee if its target-features are a subset of the callers		// Inline a callee if its target-features are a subset of the callers
// target-features.		// target-features.
return (CallerBits & CalleeBits) == CalleeBits;		return (CallerBits & CalleeBits) == CalleeBits;
}		}

		bool AArch64TTIImpl::shouldMaximizeVectorBandwidth(
		TargetTransformInfo::RegisterKind K) const {
		assert(K != TargetTransformInfo::RGK_Scalar);
		fhahnUnsubmitted Not Done Reply Inline Actions simpler to just have `return K == TargetTransformInfo::RGK_FixedWidthVector;`? Possibly with an assert that `K` is not `RGK_Scalar`. fhahn: simpler to just have `return K == TargetTransformInfo::RGK_FixedWidthVector;`? Possibly with…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Yep, let me update it. jaykang10: Yep, let me update it.
		return K == TargetTransformInfo::RGK_FixedWidthVector;
		}

/// Calculate the cost of materializing a 64-bit value. This helper		/// Calculate the cost of materializing a 64-bit value. This helper
/// method might only calculate a fraction of a larger immediate. Therefore it		/// method might only calculate a fraction of a larger immediate. Therefore it
/// is valid to return a cost of ZERO.		/// is valid to return a cost of ZERO.
InstructionCost AArch64TTIImpl::getIntImmCost(int64_t Val) {		InstructionCost AArch64TTIImpl::getIntImmCost(int64_t Val) {
// Check if the immediate can be encoded within an instruction.		// Check if the immediate can be encoded within an instruction.
if (Val == 0 \|\| AArch64_AM::isLogicalImmediate(Val, 64))		if (Val == 0 \|\| AArch64_AM::isLogicalImmediate(Val, 64))
return 0;		return 0;

▲ Show 20 Lines • Show All 2,840 Lines • Show Last 20 Lines

llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h

Show First 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	public:
/// @{		/// @{

unsigned getNumberOfRegisters(bool vector) const;		unsigned getNumberOfRegisters(bool vector) const;
unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF);
TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const;		TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const;
unsigned getMinVectorRegisterBitWidth() const;		unsigned getMinVectorRegisterBitWidth() const;
ElementCount getMinimumVF(unsigned ElemWidth, bool IsScalable) const;		ElementCount getMinimumVF(unsigned ElemWidth, bool IsScalable) const;

bool shouldMaximizeVectorBandwidth() const {		bool
		shouldMaximizeVectorBandwidth(TargetTransformInfo::RegisterKind K) const {
return true;		return true;
}		}
bool supportsEfficientVectorElementLoadStore() {		bool supportsEfficientVectorElementLoadStore() { return false; }
return false;
}
bool hasBranchDivergence() {		bool hasBranchDivergence() {
return false;		return false;
}		}
bool enableAggressiveInterleaving(bool LoopHasReductions) {		bool enableAggressiveInterleaving(bool LoopHasReductions) {
return false;		return false;
}		}
bool prefersVectorizedAddressing() {		bool prefersVectorizedAddressing() {
return false;		return false;
▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,210 Lines • ▼ Show 20 Lines	if (ConstTripCount &&
// when the TC is less than or equal to the known number of lanes.		// when the TC is less than or equal to the known number of lanes.
auto ClampedConstTripCount = PowerOf2Floor(ConstTripCount);		auto ClampedConstTripCount = PowerOf2Floor(ConstTripCount);
LLVM_DEBUG(dbgs() << "LV: Clamping the MaxVF to maximum power of two not "		LLVM_DEBUG(dbgs() << "LV: Clamping the MaxVF to maximum power of two not "
"exceeding the constant trip count: "		"exceeding the constant trip count: "
<< ClampedConstTripCount << "\n");		<< ClampedConstTripCount << "\n");
return ElementCount::getFixed(ClampedConstTripCount);		return ElementCount::getFixed(ClampedConstTripCount);
}		}

		TargetTransformInfo::RegisterKind RegKind =
		dmgreenUnsubmitted Not Done Reply Inline Actions K is a bit of a short name. Perhaps use RegKind or something like it? dmgreen: K is a bit of a short name. Perhaps use RegKind or something like it?
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Yep, let me change it. jaykang10: Yep, let me change it.
		ComputeScalableMaxVF ? TargetTransformInfo::RGK_ScalableVector
		: TargetTransformInfo::RGK_FixedWidthVector;
ElementCount MaxVF = MaxVectorElementCount;		ElementCount MaxVF = MaxVectorElementCount;
if (MaximizeBandwidth \|\| (MaximizeBandwidth.getNumOccurrences() == 0 &&		if (MaximizeBandwidth \|\| (MaximizeBandwidth.getNumOccurrences() == 0 &&
TTI.shouldMaximizeVectorBandwidth())) {		TTI.shouldMaximizeVectorBandwidth(RegKind))) {
auto MaxVectorElementCountMaxBW = ElementCount::get(		auto MaxVectorElementCountMaxBW = ElementCount::get(
PowerOf2Floor(WidestRegister.getKnownMinSize() / SmallestType),		PowerOf2Floor(WidestRegister.getKnownMinSize() / SmallestType),
ComputeScalableMaxVF);		ComputeScalableMaxVF);
MaxVectorElementCountMaxBW = MinVF(MaxVectorElementCountMaxBW, MaxSafeVF);		MaxVectorElementCountMaxBW = MinVF(MaxVectorElementCountMaxBW, MaxSafeVF);

// Collect all viable vectorization factors larger than the default MaxVF		// Collect all viable vectorization factors larger than the default MaxVF
// (i.e. MaxVectorElementCount).		// (i.e. MaxVectorElementCount).
SmallVector<ElementCount, 8> VFs;		SmallVector<ElementCount, 8> VFs;
▲ Show 20 Lines • Show All 5,608 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/extend-vectorization-factor-for-unprofitable-memops.ll

; RUN: opt -loop-vectorize -mtriple=arm64-apple-darwin -S %s \| FileCheck %s		; RUN: opt -loop-vectorize -mtriple=arm64-apple-darwin -S %s \| FileCheck %s

; Test cases for extending the vectorization factor, if small memory operations		; Test cases for extending the vectorization factor, if small memory operations
; are not profitable.		; are not profitable.

; Test with a loop that contains memory accesses of i8 and i32 types. The		; Test with a loop that contains memory accesses of i8 and i32 types. The
; default maximum VF for NEON is 4. And while we don't have an instruction to		; maximum VF for NEON is calculated by 128/size of smallest type in loop.
; load 4 x i8, vectorization might still be profitable.		; And while we don't have an instruction to load 4 x i8, vectorization
		; might still be profitable.
define void @test_load_i8_store_i32(i8* noalias %src, i32* noalias %dst, i32 %off, i64 %N) {		define void @test_load_i8_store_i32(i8* noalias %src, i32* noalias %dst, i32 %off, i64 %N) {
; CHECK-LABEL: @test_load_i8_store_i32(		; CHECK-LABEL: @test_load_i8_store_i32(
; CHECK: <4 x i8>		; CHECK: <16 x i8>
;		;
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]		%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
%gep.src = getelementptr inbounds i8, i8* %src, i64 %iv		%gep.src = getelementptr inbounds i8, i8* %src, i64 %iv
%lv = load i8, i8* %gep.src, align 1		%lv = load i8, i8* %gep.src, align 1
%lv.ext = zext i8 %lv to i32		%lv.ext = zext i8 %lv to i32
%add = add i32 %lv.ext, %off		%add = add i32 %lv.ext, %off
%gep.dst = getelementptr inbounds i32, i32* %dst, i64 %iv		%gep.dst = getelementptr inbounds i32, i32* %dst, i64 %iv
store i32 %add, i32* %gep.dst		store i32 %add, i32* %gep.dst
%iv.next = add nuw nsw i64 %iv, 1		%iv.next = add nuw nsw i64 %iv, 1
%exitcond.not = icmp eq i64 %iv.next, %N		%exitcond.not = icmp eq i64 %iv.next, %N
br i1 %exitcond.not, label %exit, label %loop		br i1 %exitcond.not, label %exit, label %loop

exit:		exit:
ret void		ret void
}		}

; Same as test_load_i8_store_i32, but with types flipped for load and store.		; Same as test_load_i8_store_i32, but with types flipped for load and store.
define void @test_load_i32_store_i8(i32* noalias %src, i8* noalias %dst, i32 %off, i64 %N) {		define void @test_load_i32_store_i8(i32* noalias %src, i8* noalias %dst, i32 %off, i64 %N) {
; CHECK-LABEL: @test_load_i32_store_i8(		; CHECK-LABEL: @test_load_i32_store_i8(
; CHECK: <4 x i8>		; CHECK: <16 x i8>
;		;
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]		%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
%gep.src = getelementptr inbounds i32, i32* %src, i64 %iv		%gep.src = getelementptr inbounds i32, i32* %src, i64 %iv
%lv = load i32, i32* %gep.src, align 1		%lv = load i32, i32* %gep.src, align 1
Show All 35 Lines	exit:
ret void		ret void
}		}

; Test with loop body that requires a large number of vector registers if the		; Test with loop body that requires a large number of vector registers if the
; vectorization factor is large. Make sure the register estimates limit the		; vectorization factor is large. Make sure the register estimates limit the
; vectorization factor.		; vectorization factor.
define void @test_load_i8_store_i64_large(i8* noalias %src, i64* noalias %dst, i64* noalias %dst.2, i64* noalias %dst.3, i64* noalias %dst.4, i64* noalias %dst.5, i64%off, i64 %off.2, i64 %N) {		define void @test_load_i8_store_i64_large(i8* noalias %src, i64* noalias %dst, i64* noalias %dst.2, i64* noalias %dst.3, i64* noalias %dst.4, i64* noalias %dst.5, i64%off, i64 %off.2, i64 %N) {
; CHECK-LABEL: @test_load_i8_store_i64_large		; CHECK-LABEL: @test_load_i8_store_i64_large
; CHECK: <2 x i64>		; CHECK: <8 x i64>
;		;
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]		%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
%gep.src = getelementptr inbounds i8, i8* %src, i64 %iv		%gep.src = getelementptr inbounds i8, i8* %src, i64 %iv
%gep.dst.3 = getelementptr inbounds i64, i64* %dst.3, i64 %iv		%gep.dst.3 = getelementptr inbounds i64, i64* %dst.3, i64 %iv
Show All 28 Lines

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll

Show First 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	for.body: ; preds = %entry, %for.body
store i16 %conv1, i16* %arrayidx3		store i16 %conv1, i16* %arrayidx3
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, %len		%exitcond = icmp eq i32 %lftr.wideiv, %len
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

; CHECK-LABEL: @add_d(		; CHECK-LABEL: @add_d(
; CHECK: load <4 x i16>		; CHECK: load <8 x i16>
; CHECK: add nsw <4 x i32>		; CHECK: add nsw <8 x i32>
; CHECK: store <4 x i32>		; CHECK: store <8 x i32>
define void @add_d(i16* noalias nocapture readonly %p, i32* noalias nocapture %q, i32 %len) #0 {		define void @add_d(i16* noalias nocapture readonly %p, i32* noalias nocapture %q, i32 %len) #0 {
entry:		entry:
%cmp7 = icmp sgt i32 %len, 0		%cmp7 = icmp sgt i32 %len, 0
br i1 %cmp7, label %for.body, label %for.cond.cleanup		br i1 %cmp7, label %for.body, label %for.cond.cleanup

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
ret void		ret void

▲ Show 20 Lines • Show All 181 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/reduction-small-size.ll

	Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines
	; short reduction_i16_2(char a, char b, int n) {			; short reduction_i16_2(char a, char b, int n) {
	; short sum = 0;			; short sum = 0;
	; for (int i = 0; i < n; ++i)			; for (int i = 0; i < n; ++i)
	; sum += (a[i] + b[i]);			; sum += (a[i] + b[i]);
	; return sum;			; return sum;
	; }			; }
	;			;
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: phi <8 x i16>			; CHECK: phi <16 x i16>
	; CHECK: [[Ld1:%[a-zA-Z0-9.]+]] = load <8 x i8>			; CHECK: [[Ld1:%[a-zA-Z0-9.]+]] = load <16 x i8>
	; CHECK: zext <8 x i8> [[Ld1]] to <8 x i16>			; CHECK: zext <16 x i8> [[Ld1]] to <16 x i16>
	; CHECK: [[Ld2:%[a-zA-Z0-9.]+]] = load <8 x i8>			; CHECK: [[Ld2:%[a-zA-Z0-9.]+]] = load <16 x i8>
	; CHECK: zext <8 x i8> [[Ld2]] to <8 x i16>			; CHECK: zext <16 x i8> [[Ld2]] to <16 x i16>
	; CHECK: add <8 x i16>			; CHECK: add <16 x i16>
	; CHECK: add <8 x i16>			; CHECK: add <16 x i16>
	;			;
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[Rdx:%[a-zA-Z0-9.]+]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16>			; CHECK: [[Rdx:%[a-zA-Z0-9.]+]] = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16>
	; CHECK: zext i16 [[Rdx]] to i32			; CHECK: zext i16 [[Rdx]] to i32
	;			;
	define i16 @reduction_i16_2(i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %n) {			define i16 @reduction_i16_2(i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %n) {
	entry:			entry:
	%cmp.14 = icmp sgt i32 %n, 0			%cmp.14 = icmp sgt i32 %n, 0
	br i1 %cmp.14, label %for.body.preheader, label %for.cond.cleanup			br i1 %cmp.14, label %for.body.preheader, label %for.cond.cleanup

	for.body.preheader:			for.body.preheader:
	Show All 28 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalable-vectorization-cost-tuning.ll

	Show All 23 Lines

	; NEOVERSE-V1: LV: Vector loop of width vscale x 2 costs: 3 (assuming a minimum vscale of 2).			; NEOVERSE-V1: LV: Vector loop of width vscale x 2 costs: 3 (assuming a minimum vscale of 2).
	; NEOVERSE-V1: LV: Vector loop of width vscale x 4 costs: 1 (assuming a minimum vscale of 2).			; NEOVERSE-V1: LV: Vector loop of width vscale x 4 costs: 1 (assuming a minimum vscale of 2).

	; NEOVERSE-N2: LV: Vector loop of width vscale x 2 costs: 6 (assuming a minimum vscale of 1).			; NEOVERSE-N2: LV: Vector loop of width vscale x 2 costs: 6 (assuming a minimum vscale of 1).
	; NEOVERSE-N2: LV: Vector loop of width vscale x 4 costs: 3 (assuming a minimum vscale of 1).			; NEOVERSE-N2: LV: Vector loop of width vscale x 4 costs: 3 (assuming a minimum vscale of 1).

	; VF-4: <4 x i32>			; VF-4: <4 x i32>
	; VF-VSCALE4: <vscale x 4 x i32>			; VF-VSCALE4: <16 x i32>
	define void @test0(i32* %a, i8* %b, i32* %c) #0 {			define void @test0(i32* %a, i8* %b, i32* %c) #0 {
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
	%arrayidx = getelementptr inbounds i32, i32* %c, i64 %iv			%arrayidx = getelementptr inbounds i32, i32* %c, i64 %iv
	%0 = load i32, i32* %arrayidx, align 4			%0 = load i32, i32* %arrayidx, align 4
	Show All 14 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalable-vectorization.ll

; REQUIRES: asserts		; REQUIRES: asserts
; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=off < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_DISABLED		; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=off < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_DISABLED
; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON		; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON
; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -vectorizer-maximize-bandwidth -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON_MAXBW		; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -vectorizer-maximize-bandwidth -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON_MAXBW

; Test that the MaxVF for the following loop, that has no dependence distances,		; Test that the MaxVF for the following loop, that has no dependence distances,
; is calculated as vscale x 4 (max legal SVE vector size) or vscale x 16		; is calculated as vscale x 4 (max legal SVE vector size) or vscale x 16
; (maximized bandwidth for i8 in the loop).		; (maximized bandwidth for i8 in the loop).
define void @test0(i32* %a, i8* %b, i32* %c) #0 {		define void @test0(i32* %a, i8* %b, i32* %c) #0 {
; CHECK: LV: Checking a loop in 'test0'		; CHECK: LV: Checking a loop in 'test0'
; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4		; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4
; CHECK_SCALABLE_ON: LV: Selecting VF: vscale x 4		; CHECK_SCALABLE_ON: LV: Selecting VF: 16
; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF		; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF
; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4		; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 16
; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 16		; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 16
; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: vscale x 16		; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: vscale x 16
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%arrayidx = getelementptr inbounds i32, i32* %c, i64 %iv		%arrayidx = getelementptr inbounds i32, i32* %c, i64 %iv
Show All 12 Lines	exit:
ret void		ret void
}		}

; Test that the MaxVF for the following loop, with a dependence distance		; Test that the MaxVF for the following loop, with a dependence distance
; of 64 elements, is calculated as (maxvscale = 16) * 4.		; of 64 elements, is calculated as (maxvscale = 16) * 4.
define void @test1(i32* %a, i8* %b) #0 {		define void @test1(i32* %a, i8* %b) #0 {
; CHECK: LV: Checking a loop in 'test1'		; CHECK: LV: Checking a loop in 'test1'
; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4		; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4
; CHECK_SCALABLE_ON: LV: Selecting VF: vscale x 4		; CHECK_SCALABLE_ON: LV: Selecting VF: 16
; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF		; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF
; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4		; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 16
; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 4		; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 4
; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: 16		; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: 16
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv		%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv
Show All 13 Lines	exit:
ret void		ret void
}		}

; Test that the MaxVF for the following loop, with a dependence distance		; Test that the MaxVF for the following loop, with a dependence distance
; of 32 elements, is calculated as (maxvscale = 16) * 2.		; of 32 elements, is calculated as (maxvscale = 16) * 2.
define void @test2(i32* %a, i8* %b) #0 {		define void @test2(i32* %a, i8* %b) #0 {
; CHECK: LV: Checking a loop in 'test2'		; CHECK: LV: Checking a loop in 'test2'
; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 2		; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 2
; CHECK_SCALABLE_ON: LV: Selecting VF: vscale x 2		; CHECK_SCALABLE_ON: LV: Selecting VF: 16
; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF		; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF
; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4		; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 16
; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 2		; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 2
; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: 16		; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: 16
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv		%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv
Show All 13 Lines	exit:
ret void		ret void
}		}

; Test that the MaxVF for the following loop, with a dependence distance		; Test that the MaxVF for the following loop, with a dependence distance
; of 16 elements, is calculated as (maxvscale = 16) * 1.		; of 16 elements, is calculated as (maxvscale = 16) * 1.
define void @test3(i32* %a, i8* %b) #0 {		define void @test3(i32* %a, i8* %b) #0 {
; CHECK: LV: Checking a loop in 'test3'		; CHECK: LV: Checking a loop in 'test3'
; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 1		; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 1
; CHECK_SCALABLE_ON: LV: Selecting VF: 4		; CHECK_SCALABLE_ON: LV: Selecting VF: 16
; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF		; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF
; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4		; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 16
; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 1		; CHECK_SCALABLE_ON_MAXBW: LV: Found feasible scalable VF = vscale x 1
; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: 16		; CHECK_SCALABLE_ON_MAXBW: LV: Selecting VF: 16
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv		%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv
▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-illegal-type.ll

	Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines
	for.end:			for.end:
	ret void			ret void
	}			}

	; CHECK-REMARKS: Scalable vectorization is not supported for all element types found in this loop			; CHECK-REMARKS: Scalable vectorization is not supported for all element types found in this loop
	define void @uniform_store_i1(i1* noalias %dst, i64* noalias %start, i64 %N) {			define void @uniform_store_i1(i1* noalias %dst, i64* noalias %start, i64 %N) {
	; CHECK-LABEL: @uniform_store_i1			; CHECK-LABEL: @uniform_store_i1
	; CHECK: vector.body			; CHECK: vector.body
	; CHECK: %[[GEP:.]] = getelementptr inbounds i64, <2 x i64> {{.*}}, i64 1			; CHECK: %[[GEP:.]] = getelementptr inbounds i64, <64 x i64> {{.*}}, i64 1
	; CHECK: %[[ICMP:.]] = icmp eq <2 x i64> %[[GEP]], %[[SPLAT:.*]]			; CHECK: %[[ICMP:.]] = icmp eq <64 x i64> %[[GEP]], %[[SPLAT:.*]]
	; CHECK: %[[EXTRACT1:.*]] = extractelement <2 x i1> %[[ICMP]], i32 0			; CHECK: %[[EXTRACT1:.*]] = extractelement <64 x i1> %[[ICMP]], i32 0
	; CHECK: store i1 %[[EXTRACT1]], i1* %dst			; CHECK: store i1 %[[EXTRACT1]], i1* %dst
	; CHECK: %[[EXTRACT2:.*]] = extractelement <2 x i1> %[[ICMP]], i32 1			; CHECK: %[[EXTRACT2:.*]] = extractelement <64 x i1> %[[ICMP]], i32 1
				dmgreenUnsubmitted Not Done Reply Inline Actions This is worrying - should it be vectorizing 64x for in i1 type! (and are there a lot of other extracts now)? dmgreen: This is worrying - should it be vectorizing 64x for in i1 type! (and are there a lot of other…
				jaykang10AuthorUnsubmitted Done Reply Inline Actions When I checked it, it looked the dagcombiner combines the 64 times i1 extract_vector_elt and store nodes to one 64 bit store node. Let me check it again. jaykang10: When I checked it, it looked the dagcombiner combines the 64 times i1 extract_vector_elt and…
				jaykang10AuthorUnsubmitted Done Reply Inline Actions um... in this test, the `%dst` is passed as parameter so it is not changed in the loop. Therefore, the last element of <64 x i1>vector needs to be stored. It looks dagcombiner catches it and optimizes the nodes well. The assembly output of `vector.body` block from llc is as below. It looks ok. .LBB0_3: // %vector.body // =>This Inner Loop Header: Depth=1 dup v2.2d, x12 add x12, x12, #512 subs x11, x11, #64 add v2.2d, v2.2d, v0.2d cmeq v2.2d, v2.2d, v1.2d xtn2 v2.4s, v2.2d xtn2 v2.8h, v2.4s xtn v2.8b, v2.8h umov w13, v2.b[7] and w13, w13, #0x1 strb w13, [x0] b.ne .LBB0_3 jaykang10: um... in this test, the `%dst` is passed as parameter so it is not changed in the loop.
				dmgreenUnsubmitted Not Done Reply Inline Actions Ah I see, that makes sense that it would pick the higher factor then. It looks like if the address is varying it does not vectorize. dmgreen: Ah I see, that makes sense that it would pick the higher factor then. It looks like if the…
	; CHECK: store i1 %[[EXTRACT2]], i1* %dst			; CHECK: store i1 %[[EXTRACT2]], i1* %dst
	; CHECK-NOT: vscale			; CHECK-NOT: vscale
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%first.sroa = phi i64* [ %incdec.ptr, %for.body ], [ %start, %entry ]			%first.sroa = phi i64* [ %incdec.ptr, %for.body ], [ %start, %entry ]
	%iv = phi i64 [ %iv.next, %for.body ], [ 0, %entry ]			%iv = phi i64 [ %iv.next, %for.body ], [ 0, %entry ]
	Show All 37 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Set maximum VF with shouldMaximizeVectorBandwidthClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 431384

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/extend-vectorization-factor-for-unprofitable-memops.ll

llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll

llvm/test/Transforms/LoopVectorize/AArch64/reduction-small-size.ll

llvm/test/Transforms/LoopVectorize/AArch64/scalable-vectorization-cost-tuning.ll

llvm/test/Transforms/LoopVectorize/AArch64/scalable-vectorization.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-illegal-type.ll

[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth
ClosedPublic