Page MenuHomePhabricator

[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth
ClosedPublic

Authored by jaykang10 on Feb 4 2022, 2:20 AM.

Details

Summary

Set the maximum VF of AArch64 with 128 / the size of smallest type in loop.

The performance improvement from benchmarks is as below.

SPEC2017
Benchmark       Improvement(%)
500.perlbench_r -0.44372
502.gcc_r        0.11339
505.mcf_r       -0.36421
520.omnetpp_r   -0.12037
523.xalancbmk_r -0.55858
525.x264_r      0.390159
531.deepsjeng_r -0.02378
541.leela_r     -0.01357
548.exchange2_r -0.00043
557.xz_r        -0.17387
Overall improvement(%) on an internal benchmark 0.238949

Diff Detail

Event Timeline

jaykang10 created this revision.Feb 4 2022, 2:20 AM
jaykang10 requested review of this revision.Feb 4 2022, 2:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 4 2022, 2:20 AM

Hello. This option makes a lot of sense to me, for the way AArch64 lowers vectors with extensions in them. I have a downstream set of benchmarks which, whilst not amazing, do show changes in vectorization quite well. There are some great improvements, but some things are not looking so healthy. I will send you some details. We may need to work through improving some of them, either by fixing the costs or improving the codegen for larger than legal types.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
139

It's generally best if fixed length vectorization doesn't start behaving differently just because SVE is available (unless it can be better, of course). If we expect MaximizeVectorBandwidth to be better, but doesn't work for scalable vectors well, can we just try to disable the scalable VFs from being widened?

Hello. This option makes a lot of sense to me, for the way AArch64 lowers vectors with extensions in them. I have a downstream set of benchmarks which, whilst not amazing, do show changes in vectorization quite well. There are some great improvements, but some things are not looking so healthy. I will send you some details. We may need to work through improving some of them, either by fixing the costs or improving the codegen for larger than legal types.

Thanks for comment! @dmgreen

I agree with you. We need to improve the performance regressions from this Max VF.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
139

I think it could be good to have some comments from SVE people... If possible, can you add them as reviewer please? I do not know well who work on it...

paulwalker-arm added inline comments.Feb 8 2022, 3:10 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
139

I agree with Dave, this decision is distinct from whether SVE is available or not. As well as this affecting the NEON side of things there are circumstances where SVE is also used for fixed length vectorisation. Perhaps this function should be changed to take a TargetTransformInfo::RegisterKind much like getRegisterBitWidth?

jaykang10 added inline comments.Feb 9 2022, 2:35 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
139

Thanks for comment @paulwalker-arm!
Let me update the function with the TargetTransformInfo::RegisterKind.

jaykang10 updated this revision to Diff 407116.Feb 9 2022, 4:13 AM

@paulwalker-arm As you can see, there are regression tests which are failed to generate scalable vector type with the shouldMaximizeVectorBandwidth.
Previously, I saw these regression tests were failed because of the cost so I returned false with SVE in shouldMaximizeVectorBandwidth...
I am not sure it is acceptable that LV selects VF 16 instead of scalable vector type on SVE...
If it is not acceptable, we could need to tune the cost on SVE and I would like to suggest to use shouldMaximizeVectorBandwidth for only neon until finishing the cost tune...

fhahn added a comment.Feb 9 2022, 5:13 AM

The performance improvement from benchmarks is as below.

Are the numbers SPEC scores and negative Improvement values regressions? If that's the case, it seems like for the benchmarks you shared the impact is negative overall?

I'm missing a bit of rationale for this change. There is an interplay between having a wider VF or having a larger interleave factor. For 128bit vectors, an add <4 x i64> %x, %y will be legalized into two adds. Conceptually this is similar to vectorizing with <2 x i64> and having an interleave-factor of 2. I can imagine that interleaving in the loop-vectorizer leads to better code, because it avoids issues around type legalisation and may provide more opportunities for other IR passes to optimize the IR or move things around. If we always choose a wider VF I wonder if that may lead to poorer codegen because of type-legalization.

Is there a specific example where it's clearly an improvement to have a wider VF? And would choosing a larger unroll-factor help those cases?

fhahn added a comment.Feb 9 2022, 12:31 PM

I'm missing a bit of rationale for this change. There is an interplay between having a wider VF or having a larger interleave factor. For 128bit vectors, an add <4 x i64> %x, %y will be legalized into two adds. Conceptually this is similar to vectorizing with <2 x i64> and having an interleave-factor of 2. I can imagine that interleaving in the loop-vectorizer leads to better code, because it avoids issues around type legalisation and may provide more opportunities for other IR passes to optimize the IR or move things around. If we always choose a wider VF I wonder if that may lead to poorer codegen because of type-legalization.

Is there a specific example where it's clearly an improvement to have a wider VF? And would choosing a larger unroll-factor help those cases?

One case where choosing a wider VF can be beneficial are loops with memory operations on types with different width, where the memory operations on the narrow type are not legal for the VF based on the widest type. This reminded me of an oldish outstanding patch that focuses on exactly that case: D96522. Unless there are other cases where maximizing the VF is clearly beneficial, iterating on D96522 might be an alternative.

Yeah. I had https://godbolt.org/z/3qWoY769v as an example, where is can make better use of the umull2 instructions, because of the wider vector loads as a single operation. That's what makes it beneficial for AArch64.

It sounds like D96522 might be a more limited way of getting the same result we want from this? That might be a good way forward. There were a number of regressions we would have to work through with this patch in the benchmarks I have access to, cases where either the codegen or the costmodelling need to be tweaked. (The only one I have so far is from some bad smull generation: https://godbolt.org/z/enox4ojhf. I still need to look through the rest).

The performance improvement from benchmarks is as below.

Are the numbers SPEC scores and negative Improvement values regressions? If that's the case, it seems like for the benchmarks you shared the impact is negative overall?

Thanks for comments @fhahn!

The score includes a bit noise... I am sorry for that...

505.mcf_r, 531.deepsjeng_r, 548.exchange2_r, 557.xz_r are not affected by this patch but the N1 machine showed me slightly different scores... The rest of spec2017's benchmarks have loops which affected by this patch.

I wanted to say this patch does not make overall negative impact for spec2017.

I'm missing a bit of rationale for this change. There is an interplay between having a wider VF or having a larger interleave factor. For 128bit vectors, an add <4 x i64> %x, %y will be legalized into two adds. Conceptually this is similar to vectorizing with <2 x i64> and having an interleave-factor of 2. I can imagine that interleaving in the loop-vectorizer leads to better code, because it avoids issues around type legalisation and may provide more opportunities for other IR passes to optimize the IR or move things around. If we always choose a wider VF I wonder if that may lead to poorer codegen because of type-legalization.

Is there a specific example where it's clearly an improvement to have a wider VF? And would choosing a larger unroll-factor help those cases?

Thanks for comment @sdesmalen!

Let's see a code snippet.

int test(int start, int size, char *src, char *dst) {
  int res = 0;
  for (int i = start; i < size; ++i) {
    res += *dst ^ *src;
    dst++;
    src++;
  }

  return res;
}

The assembly output of the vectorized loop is as below.

without this patch  --> VF 4 is selected.
.LBB0_5:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
	ldp	s3, s4, [x12, #-4]
	ldp	s5, s6, [x8, #-4]
	add	x8, x8, #8
	add	x12, x12, #8
	subs	x13, x13, #8
	ushll	v3.8h, v3.8b, #0
	ushll	v4.8h, v4.8b, #0
	ushll	v5.8h, v5.8b, #0
	ushll	v6.8h, v6.8b, #0
	eor	v3.8b, v5.8b, v3.8b
	eor	v4.8b, v6.8b, v4.8b
	ushll	v3.4s, v3.4h, #0
	ushll	v4.4s, v4.4h, #0
	and	v3.16b, v3.16b, v1.16b
	and	v4.16b, v4.16b, v1.16b
	add	v0.4s, v0.4s, v3.4s
	add	v2.4s, v2.4s, v4.4s
	b.ne	.LBB0_5
with this patch  --> VF 16 is selected
.LBB0_5:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
	ldp	q16, q18, [x12, #-16]
	add	x12, x12, #32
	subs	x13, x13, #32
	ldp	q17, q19, [x8, #-16]
	add	x8, x8, #32
	eor	v16.16b, v17.16b, v16.16b
	eor	v17.16b, v19.16b, v18.16b
	ushll2	v18.8h, v16.16b, #0
	ushll	v16.8h, v16.8b, #0
	ushll	v19.8h, v17.8b, #0
	ushll2	v17.8h, v17.16b, #0
	uaddw2	v2.4s, v2.4s, v18.8h
	uaddw	v1.4s, v1.4s, v18.4h
	uaddw2	v3.4s, v3.4s, v16.8h
	uaddw	v0.4s, v0.4s, v16.4h
	uaddw2	v6.4s, v6.4s, v17.8h
	uaddw	v5.4s, v5.4s, v17.4h
	uaddw2	v7.4s, v7.4s, v19.8h
	uaddw	v4.4s, v4.4s, v19.4h
	b.ne	.LBB0_5

We can see the uaddw instructions on the output with VF=16. AArch64 has below pattern definition and it is selected for the uaddw.

multiclass SIMDWideThreeVectorBHS<bit U, bits<4> opc, string asm,
                                  SDPatternOperator OpNode> {
...
  def v4i16_v4i32  : BaseSIMDDifferentThreeVector<U, 0b010, opc,
                                                  V128, V128, V64,
                                                  asm, ".4s", ".4s", ".4h",
       [(set (v4i32 V128:$Rd), (OpNode (v4i32 V128:$Rn), (v4i16 V64:$Rm)))]>;
  def v8i16_v4i32  : BaseSIMDDifferentThreeVector<U, 0b011, opc,
                                                  V128, V128, V128, 
                                                  asm#"2", ".4s", ".4s", ".8h",
       [(set (v4i32 V128:$Rd), (OpNode (v4i32 V128:$Rn),
                                       (extract_high_v8i16 V128:$Rm)))]>;
...
defm UADDW   : SIMDWideThreeVectorBHS<1, 0b0001, "uaddw",
                 BinOpFrag<(add node:$LHS, (zanyext node:$RHS))>>;

Given the number of instructions, we could expect the loop handles almost 4 times more data per iteration ideally.
As @dmgreen mentioned, we are seeing some performance degradations. In dave's case, it looks the LV generates shuffle vectors and it blocks to lower the MUL to SMULL. As an other case, if LV detects interleaved group, it generates shuffle vectors with big number of elements. The shuffle vectors cause lots of mov instructions. Maybe, there are more cases for the performance degradation but it could show us more opportunities to get better performance score. That's what I want from this patch...

So I'm wondering why we don't just control the functionality with a default off command line flag. That way it's available for testing, including unit-tests, until such a point where code generation is at a point where is makes sense to default it to on. Is this a terribly idea?

So I'm wondering why we don't just control the functionality with a default off command line flag. That way it's available for testing, including unit-tests, until such a point where code generation is at a point where is makes sense to default it to on. Is this a terribly idea?

There is an option -vectorizer-maximize-bandwidth to control the functionality in LV.

static cl::opt<bool> MaximizeBandwidth(
    "vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
    cl::desc("Maximize bandwidth when selecting vectorization factor which "
             "will be determined by the smallest type in loop."));

So I'm wondering why we don't just control the functionality with a default off command line flag. That way it's available for testing, including unit-tests, until such a point where code generation is at a point where is makes sense to default it to on. Is this a terribly idea?

There is an option -vectorizer-maximize-bandwidth to control the functionality in LV.

static cl::opt<bool> MaximizeBandwidth(
    "vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
    cl::desc("Maximize bandwidth when selecting vectorization factor which "
             "will be determined by the smallest type in loop."));

Oh sure, but that affects all targets. Whereas I thought here we're talking about finding a migration path to enable it by default for AArch64 only.

So I'm wondering why we don't just control the functionality with a default off command line flag. That way it's available for testing, including unit-tests, until such a point where code generation is at a point where is makes sense to default it to on. Is this a terribly idea?

There is an option -vectorizer-maximize-bandwidth to control the functionality in LV.

static cl::opt<bool> MaximizeBandwidth(
    "vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
    cl::desc("Maximize bandwidth when selecting vectorization factor which "
             "will be determined by the smallest type in loop."));

Oh sure, but that affects all targets. Whereas I thought here we're talking about finding a migration path to enable it by default for AArch64 only.

We could add similar option to AArch64TargetTransformInfo.cpp and use it in the shouldMaximizeVectorBandwidth of AArch64.

jaykang10 updated this revision to Diff 407489.Feb 10 2022, 5:17 AM

Following the comment of @paulwalker-arm, a option is added.

Matt added a subscriber: Matt.Mar 17 2022, 5:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2022, 5:56 PM

I think this option makes a lot of sense - and we have cleaned up a lot of places where it was causing issues. This can change a lot of performance, but it seems to be pretty good in my experiments. More is likely to come up, but I think we are close to ready to go so long as we keep an eye on the performance.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
40

There is a vectorizer-maximize-bandwidth option is the vectorizer that can override the target option for shouldMaximizeVectorBandwidth. I don't think adding an aarch64 option is necessary, can you remove it?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5219

K is a bit of a short name. Perhaps use RegKind or something like it?

llvm/test/Transforms/LoopVectorize/AArch64/sve-illegal-type.ll
90

This is worrying - should it be vectorizing 64x for in i1 type! (and are there a lot of other extracts now)?

paulwalker-arm added inline comments.Apr 4 2022, 3:44 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
40

I agree. My original ask was because I thought there were concerns about enabling this by default. Given the flag still defaults to on and it seems we're happy to make this change for AArch64 I retract my previous ask.

jaykang10 added inline comments.Apr 4 2022, 6:14 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
40

Yep, let me remove this option.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5219

Yep, let me change it.

llvm/test/Transforms/LoopVectorize/AArch64/sve-illegal-type.ll
90

When I checked it, it looked the dagcombiner combines the 64 times i1 extract_vector_elt and store nodes to one 64 bit store node.
Let me check it again.

jaykang10 added inline comments.Apr 4 2022, 6:32 AM
llvm/test/Transforms/LoopVectorize/AArch64/sve-illegal-type.ll
90

um... in this test, the %dst is passed as parameter so it is not changed in the loop. Therefore, the last element of <64 x i1>vector needs to be stored. It looks dagcombiner catches it and optimizes the nodes well. The assembly output of vector.body block from llc is as below. It looks ok.

.LBB0_3:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
	dup	v2.2d, x12
	add	x12, x12, #512
	subs	x11, x11, #64
	add	v2.2d, v2.2d, v0.2d
	cmeq	v2.2d, v2.2d, v1.2d
	xtn2	v2.4s, v2.2d
	xtn2	v2.8h, v2.4s
	xtn	v2.8b, v2.8h
	umov	w13, v2.b[7]
	and	w13, w13, #0x1
	strb	w13, [x0]
	b.ne	.LBB0_3
jaykang10 updated this revision to Diff 420164.Apr 4 2022, 6:37 AM

Following the comment from @dmgreen, updated code.

dmgreen accepted this revision.Apr 5 2022, 2:30 AM

Thanks for the update. This LGTM.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
40

Yeah - I think the vectorizer-maximize-bandwidth option worked differently back when the comment was suggested too.

llvm/test/Transforms/LoopVectorize/AArch64/sve-illegal-type.ll
90

Ah I see, that makes sense that it would pick the higher factor then. It looks like if the address is varying it does not vectorize.

This revision is now accepted and ready to land.Apr 5 2022, 2:30 AM
fhahn added inline comments.Apr 5 2022, 2:57 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
940–941

Document K?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
56

simpler to just have return K == TargetTransformInfo::RGK_FixedWidthVector;?

Possibly with an assert that K is not RGK_Scalar.

jaykang10 added inline comments.Apr 5 2022, 3:26 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
940–941

Yep, let me add it.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
56

Yep, let me update it.

jaykang10 updated this revision to Diff 420448.Apr 5 2022, 4:05 AM

Following comments from @fhahn, updated patch.

This revision was landed with ongoing or failed builds.Apr 5 2022, 5:19 AM
This revision was automatically updated to reflect the committed changes.

This broke LLVM AArch64 buildbot clang-aarch64-sve-vls-2stage
https://lab.llvm.org/buildbot/#/builders/176/builds/1515
llvm-tblgen crashes after applying this patch.

I have reverted it for now. Kindly have a look.

omjavaid reopened this revision.Apr 12 2022, 4:54 PM
This revision is now accepted and ready to land.Apr 12 2022, 4:54 PM

This broke LLVM AArch64 buildbot clang-aarch64-sve-vls-2stage
https://lab.llvm.org/buildbot/#/builders/176/builds/1515
llvm-tblgen crashes after applying this patch.

I have reverted it for now. Kindly have a look.

Thanks for reverting the commit. I did not check stage2 build on sve...
If possible, can you share the sve vls environment to reproduce the stage2 build error please?

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

fhahn added a comment.Apr 13 2022, 8:29 AM

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

I think it would be best to figure out if there's actually an issue/miscompile before re-landing. As is, the patch should only enable shouldMaximizeVectorBandwidth already, right?

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

I think it would be best to figure out if there's actually an issue/miscompile before re-landing. As is, the patch should only enable shouldMaximizeVectorBandwidth already, right?

um... To be honest, I am not expert of the SVE arch. Initially, I aimed to enable it for only neon arch.
@paulwalker-arm If possible, can you help me to investigate the build error from the buildbot for clang-aarch64-sve-vls-2stage please?

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

I did have a completely difference experience, here are steps to reproduce:

Stage 1:

CC=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang CXX=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'
ninja

Stage 2:

CC=$(pwd)/stage1/bin/clang CXX=$(pwd)/stage1/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'

ninja llvm-tblgen

/home/omair.javaid/work/llvm-test/stage2/bin/llvm-tblgen -gen-dag-isel -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC -I/home/omair.javaid/work/llvm-test/stage2/include -I/home/omair.javaid/work/llvm-test/llvm/llvm/include -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target -omit-comments /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC/PPC.td --write-if-changed -o lib/Target/PowerPC/PPCGenDAGISel.inc -d lib/Target/PowerPC/PPCGenDAGISel.inc.d
jaykang10 added a comment.EditedApr 13 2022, 6:02 PM

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

I did have a completely difference experience, here are steps to reproduce:

Stage 1:

CC=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang CXX=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'
ninja

Stage 2:

CC=$(pwd)/stage1/bin/clang CXX=$(pwd)/stage1/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'

ninja llvm-tblgen

/home/omair.javaid/work/llvm-test/stage2/bin/llvm-tblgen -gen-dag-isel -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC -I/home/omair.javaid/work/llvm-test/stage2/include -I/home/omair.javaid/work/llvm-test/llvm/llvm/include -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target -omit-comments /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC/PPC.td --write-if-changed -o lib/Target/PowerP

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

I did have a completely difference experience, here are steps to reproduce:

Stage 1:

CC=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang CXX=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'
ninja

Stage 2:

CC=$(pwd)/stage1/bin/clang CXX=$(pwd)/stage1/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'

ninja llvm-tblgen

/home/omair.javaid/work/llvm-test/stage2/bin/llvm-tblgen -gen-dag-isel -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC -I/home/omair.javaid/work/llvm-test/stage2/include -I/home/omair.javaid/work/llvm-test/llvm/llvm/include -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target -omit-comments /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC/PPC.td --write-if-changed -o lib/Target/PowerPC/PPCGenDAGISel.inc -d lib/Target/PowerPC/PPCGenDAGISel.inc.d

um... I also followed same thing from build bot's 1,2 stage cmake options on a64fx machine.

stage1
cmake -G Ninja ../llvm -DCMAKE_C_COMPILER=/home/jinkan01/Projects/llvm-project/build-clang-13.0.1/bin/clang -DCMAKE_CXX_COMPILER=/home/jinkan01/Projects/llvm-project/build-clang-13.0.1/bin/clang++ -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True '-DLLVM_LIT_ARGS='"'"'-v'"'"'' -DCMAKE_INSTALL_PREFIX=../stage1.install '-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' '-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' -DLLVM_ENABLE_LLD=True '-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' '-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'

ninja install

stage 2
cmake -G Ninja ../llvm -DCMAKE_C_COMPILER=/home/jinkan01/Projects/llvm-project/stage1.install/bin/clang -DCMAKE_CXX_COMPILER=/home/jinkan01/Projects/llvm-project/stage1.install/bin/clang++ -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True '-DLLVM_LIT_ARGS='"'"'-v'"'"'' -DCMAKE_INSTALL_PREFIX=../stage2.install '-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' '-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' -DLLVM_ENABLE_LLD=True '-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' '-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'

ninja

Let me check it with your one again.
Can you let me know which cpu your testing machine has please? I guessed you are using a64fx machine from the -mcpu=a64fx option.

um... I can reproduce the build error now... Maybe, there was something wrong with my build...
Thanks for help @omjavaid

It looks the SVE target's data layout is missing 512-bit vector's alignment. The data layout does not mention the alignment of 512-bit vector so the alignment is same with its size. It is bigger than stack's alignment which is 128-bits and it causes stack re-alignment...

For the stage2 failure with llvm-tablegen, the ContractNodes function of tablegen has loop which is vectorized with VF 128 because it has i1 type... I did not expect vectorization with VF 128... It causes lots of spill codes... On PrologEpilogInserter, the default stack size is 3584 bytes and the SVE stack size is 4416 bytes... We need to avoid the VF 512... Anyway, the ContractNodes is called recursively and the stack re-alignment causes wrong stack overwriting...

After setting up the 512-bit vector type's alignment as 128-bit, the stage2 failure is gone.

It could be ok to disable the shouldMaximizeVectorBandwidth for SVE... because it could need cost model change not only vector type alignment... Let me discuss it with the team more...

I have checked the commit from https://reviews.llvm.org/D125918 has fixed the stage 2 build error with this patch.
Thanks for fixing it! @paulwalker-arm and @peterwaller-arm
Let me push this patch again.