This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Enable scalable vectorization by default for RVV
AbandonedPublic

Authored by Miss_Grape on May 17 2022, 12:48 AM.

Details

Diff Detail

Event Timeline

Miss_Grape created this revision.May 17 2022, 12:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 17 2022, 12:48 AM
Miss_Grape requested review of this revision.May 17 2022, 12:48 AM

Do you have any performance data?

reames added a subscriber: reames.May 17 2022, 7:10 AM

I doubt we want scalable vectorization enabled by default right now. As Craig said, we need performance data to justify, but the impression I've gotten asking around is that this is not yet ready for prime time.

More than that though, we probably never want scalable vectorization for all configurations. Having it enabled for default generic targets may make sense, but if we're targeting a particular CPU with full knowledge of vector lengths, classic fixed length is probably a better default.

pcwang-thead added a comment.EditedMay 17 2022, 8:31 PM

I have done some researches before, the quality of vectorized code haven't met our expectation currently.
Actually, it is just SIMD-style vectorization for RVV for the time being.
Maybe we should wait for VP-based vectorization?

llvm/test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll
48

For example, I believe this store is unnecessary.

reames added inline comments.May 17 2022, 9:06 PM
llvm/test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll
6

As an aside, I took a look at the assembly for this example. Th codegen for the vector loop ends up being less than great. For instance:

  • We have a extend trapped in the loop for some reason
  • We rerun vsetvli on every iteration (despite it producing a fixed result)
  • We have a rem in the vector preheader. That's rather expensive. (Well, actually, we end up with a *libcall* because the attributes don't include +m, but I ignored that.)
reames added inline comments.May 17 2022, 9:10 PM
llvm/test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll
6

To be fair, the first two issues also exist with fixed length vectorization of the same example.

craig.topper added inline comments.May 17 2022, 10:07 PM
llvm/test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll
48

Why is it unnecessary? I think the loop is processing using two vector loads/store due to RISCVSubtarget::getMaxInterleaveFactor()

craig.topper added inline comments.May 17 2022, 10:11 PM
llvm/test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll
6

What kind of extend?

llvm/test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll
48

Oops, you're right.
I think the problem is that we are using RVV as SIMD now, so we will:

  • read vlenb to calculate vector length.
  • use vsetvli _, zero, ... to set vl/vtype and ignore returned vl. This is inserted by InsertVSETVLI pass.
  • handle tail in scalar loop (may have been improved in D121595?).
reames added inline comments.May 18 2022, 7:42 AM
llvm/test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll
6

zext - it looks easily removable, I'm surprised to see it after O2. This shouldn't be a hard fix. It appears in both vector and scalar loops.

Do you have any performance data?

I use the TSCV test suit, then run on the spike,

when option: -scalable-vectorization=on, the performance more better. But I'm not sure if the performance data from the spike run can be used as a standard to measure performance

Do you have any performance data?

I use the TSCV test suit, then run on the spike,

when option: -scalable-vectorization=on, the performance more better. But I'm not sure if the performance data from the spike run can be used as a standard to measure performance

Better than specifying -riscv-v-vector-bits-min to match the machine width?

craig.topper added inline comments.May 27 2022, 8:22 AM
llvm/test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll
6

Can you share what you're seeing? I see a zero extend in the preheader but not in the loops.

Do you have any performance data?

I use the TSCV test suit, then run on the spike,

when option: -scalable-vectorization=on, the performance more better. But I'm not sure if the performance data from the spike run can be used as a standard to measure performance

Better than specifying -riscv-v-vector-bits-min to match the machine width?

@craig.topper I think this is somewhat the wrong question here. While I agree that fixed length should be our eventual default for known vector lengths, we currently don't enable any vectorization. If we can show either form of vectorization is generally profitable over the no-vectorization configuration, we should enable. We can then evaluate the other configuration against that new baseline.

@Miss_Grape I struggle to make out what that screenshot is conveying. Could you summarize please? Also, a text attachment is greatly preferred over images.

llvm/test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll
6

When running just loop-vectorize and then llc, I do not see the extend. When replacing -loop-vectorize with -O2 on the same input, I do.

./opt -mtriple=riscv64 -mattr=+v,+m test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll -loop-vectorize -scalable-vectorization=on -S -riscv-v-vector-bits-max=512 | ./llc -O3 
./opt -mtriple=riscv64 -mattr=+v,+m test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll -O2 -scalable-vectorization=on -S -riscv-v-vector-bits-max=512 | ./llc -O3

The key bit of IR after -O2 is:

%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%13 = zext i32 %index to i64
// ... 
%index.next = add nuw i32 %index, %12
%23 = icmp eq i32 %index.next, %n.vec
br i1 %23, label %middle.block, label %vector.body, !llvm.loop !0

We should be able to widen %index IV in indvars without issue, I'm a bit surprised we don't.

Interestingly, if I add Zba, something in codegen manages to fold away the extend now. We didn't when I'd last looked.

./opt -mtriple=riscv64 -mattr=+v,+m,+zba test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll -O2 -scalable-vectorization=on -S -riscv-v-vector-bits-max=512 | ./llc -O3

To be clear, this is a minor codegen issue at best, and definitely nothing which should block this patch.

craig.topper added inline comments.May 27 2022, 11:30 AM
llvm/test/Transforms/LoopVectorize/RISCV/unroll-in-loop-vectorizer.ll
6

Does opt not automatically infer the DataLayout from the triple? Adding a proper DataLayout seems to make the zext go away.

Do you have a set of tests with -scalable-vectorization=on and without to compare the code quality?

Do you have any performance data?

I use the TSCV test suit, then run on the spike,

when option: -scalable-vectorization=on, the performance more better. But I'm not sure if the performance data from the spike run can be used as a standard to measure performance

Better than specifying -riscv-v-vector-bits-min to match the machine width?

@craig.topper I think this is somewhat the wrong question here. While I agree that fixed length should be our eventual default for known vector lengths, we currently don't enable any vectorization. If we can show either form of vectorization is generally profitable over the no-vectorization configuration, we should enable. We can then evaluate the other configuration against that new baseline.

@Miss_Grape I struggle to make out what that screenshot is conveying. Could you summarize please? Also, a text attachment is greatly preferred over images.

  1. , TSVC Test suit cases,
  2. Options:default -scalable-vectorization is off

1)clang --target=riscv64-unknown-elf --sysroot=$HOME/task/rvv/riscv64-unknown-elf --gcc-toolchain=$HOME/task/rvv --march=rv64gcv -O3 -mllvm -riscv-v-vector-bits-max=128 -mllvm -riscv-v-vector-bits-min=128 -mllvm -scalable-vectorization=on tsc.c dummy.c -lm -o xxx
2)clang --target=riscv64-unknown-elf --sysroot=$HOME/task/rvv/riscv64-unknown-elf --gcc-toolchain=$HOME/task/rvv -march=rv64gcv -O3 -mllvm -riscv-v-vector-bits-max=128 -mllvm -riscv-v-vector-bits-min=128 tsc.c dummy.c -lm -o xxx

  1. down load riscv's pk and run the spike:

/home/wengliqin/task/rvv/bin/spike pk xxx, then you can get the results shown in the following figure

Do you have any performance data?

I use the TSCV test suit, then run on the spike,

when option: -scalable-vectorization=on, the performance more better. But I'm not sure if the performance data from the spike run can be used as a standard to measure performance

Better than specifying -riscv-v-vector-bits-min to match the machine width?

@craig.topper I think this is somewhat the wrong question here. While I agree that fixed length should be our eventual default for known vector lengths, we currently don't enable any vectorization. If we can show either form of vectorization is generally profitable over the no-vectorization configuration, we should enable. We can then evaluate the other configuration against that new baseline.

@Miss_Grape I struggle to make out what that screenshot is conveying. Could you summarize please? Also, a text attachment is greatly preferred over images.

This is reslut run on the spike:


Do you have any performance data?

I use the TSCV test suit, then run on the spike,

when option: -scalable-vectorization=on, the performance more better. But I'm not sure if the performance data from the spike run can be used as a standard to measure performance

Better than specifying -riscv-v-vector-bits-min to match the machine width?

@craig.topper I think this is somewhat the wrong question here. While I agree that fixed length should be our eventual default for known vector lengths, we currently don't enable any vectorization. If we can show either form of vectorization is generally profitable over the no-vectorization configuration, we should enable. We can then evaluate the other configuration against that new baseline.

@Miss_Grape I struggle to make out what that screenshot is conveying. Could you summarize please? Also, a text attachment is greatly preferred over images.

  1. , TSVC Test suit cases,
  2. Options:default -scalable-vectorization is off

1)clang --target=riscv64-unknown-elf --sysroot=$HOME/task/rvv/riscv64-unknown-elf --gcc-toolchain=$HOME/task/rvv --march=rv64gcv -O3 -mllvm -riscv-v-vector-bits-max=128 -mllvm -riscv-v-vector-bits-min=128 -mllvm -scalable-vectorization=on tsc.c dummy.c -lm -o xxx
2)clang --target=riscv64-unknown-elf --sysroot=$HOME/task/rvv/riscv64-unknown-elf --gcc-toolchain=$HOME/task/rvv -march=rv64gcv -O3 -mllvm -riscv-v-vector-bits-max=128 -mllvm -riscv-v-vector-bits-min=128 tsc.c dummy.c -lm -o xxx

  1. down load riscv's pk and run the spike:

/home/wengliqin/task/rvv/bin/spike pk xxx, then you can get the results shown in the following figure

Hi,I tried your testsuite. But there are so many warning:
warning: 'set1ds' accessing 128000 bytes in a region of size 85336 [-Wstringop-overflow=]

653 |                 set1ds(LEN/3,   &d[LEN/3],      zero,unit);

Does it influence the testing?

reames requested changes to this revision.Jun 2 2022, 12:28 PM

FYI, we seem to have some lurking functional issues with scalable vectorization. I tend to use sqllite as an easy test case for spotting assembly issues, and trying it with scalable vectorization (and without specifying a min vector length), promptly crashed.

clang -isystem /.../gnu-toolchain/sysroot/usr/include/ --target=riscv64 -Xclang -target-feature -Xclang +v,+f,+m,+c,+d,+zba sqlite-autoconf-3380500/sqlite3.c -c -O2 -S -mllvm -scalable-vectorization=on

The crash is:
Warning: Invalid size request on a scalable vector; Possible incorrect use of MVT::getVectorNumElements() for scalable vector. Scalable flag may be dropped, use MVT::getVectorElementCount() instead
clang: /home/preames/llvm-dev/llvm-project/llvm/include/llvm/ADT/Optional.h:195: T& llvm::optional_detail::OptionalStorage<T, true>::getValue() & [with T = long int]: Assertion `hasVal' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
With the last interesting frame being: llvm::CodeMetrics::analyzeBasicBlock

This revision now requires changes to proceed.Jun 2 2022, 12:28 PM

Downstream we have had to patch a lot of TTI functions to avoid crashing with scalable vectors. So I'm also thinking we're not ready for this to be enabled by default. I've been meaning to try and find some solid test cases to show this off but I agree with @reames .

Miss_Grape abandoned this revision.Jun 13 2022, 12:15 AM
reames added a comment.Jul 1 2022, 1:15 PM

FYI, I have a new version of this posted here: https://reviews.llvm.org/D129013

This follows a bunch of work to avoid crashes in cost modeling, or consumers involving Invalid costs, and a fair amount of cost model implementation work.