This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Enhance SLPVectorizer to vectorize vector aggregate
ClosedPublic

Authored by anton-afanasyev on Nov 11 2019, 1:03 AM.

Details

Summary

Vector aggregate is homogeneous aggregate of vectors like { <2 x float>, <2 x float> }.
This patch allows findBuildAggregate() to consider vector aggregates as
well as scalar ones. For instance, { <2 x float>, <2 x float> } maps to <4 x float>.

Fixes vector part of llvm.org/PR42022

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptNov 11 2019, 1:03 AM

Oops, missed regression test, I'm to add it.

Update summary

Harbormaster completed remote builds in B40731: Diff 228646.
RKSimon added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/pr42022.ll
5

Regenerate with update_test_checks.py ?

anton-afanasyev marked 2 inline comments as done.Nov 11 2019, 4:10 AM
anton-afanasyev added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/pr42022.ll
5

Should it really be autogenerated? I don't like test autogeneration for excessiveness and false positives while testing. Here we just checking vector <4 x float> is generated. Is autogeneration really standard for tests now?

dtemirbulatov added inline comments.Nov 11 2019, 5:00 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6766

Maybe here we could try other sizes here, for example, 8?

anton-afanasyev marked 3 inline comments as done.Nov 11 2019, 5:10 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6766

Are you talking about number 4 in SmallVector<Value*, 4>? It is just default size preallocated for vector, it could be potentially resized by push_back() below. But I believe it won't for most cases.

anton-afanasyev marked 2 inline comments as done.Nov 11 2019, 5:19 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6766

*default capacity

anton-afanasyev edited the summary of this revision. (Show Details)Nov 11 2019, 5:20 AM
dtemirbulatov added inline comments.Nov 11 2019, 5:32 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6766

yes, correct

ABataev added inline comments.Nov 11 2019, 7:22 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2879–2880

Why do we have this check?

2879–2880

Also, would be good to have a test for this if we don't have it yet.

6759–6762

Not formatted

6769–6771

Just BuildVectorOpds.append(TmpBuildVectorOpds.rbegin(), TmpBuildVectorOpds.rend());

llvm/test/Transforms/SLPVectorizer/X86/pr42022.ll
5

It is common practice to use auto checks and demonstrate the difference then.

RKSimon added inline comments.Nov 11 2019, 7:51 AM
llvm/test/Transforms/SLPVectorizer/X86/pr42022.ll
5

update_test_checks.py helps us identify any hidden regressions

also you can probably remove the dso_local and local_unnamed_addr #0 tags

anton-afanasyev marked 11 inline comments as done.Nov 11 2019, 8:07 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2879–2880

Hmm, you're right, I'm just to delete this check. I haven't faced any issue concerning scalability, but decided to check it just in case ({ <vscale x 2 x float>, <vscale x 2 x float> } is not isomorphic to <4 x float>?). But I see no similar check across SLPVectorizer.cpp, so I'm to delete this check.

6759–6762

Thanks, fixed with clang-format.

6769–6771

Sure, thanks!

llvm/test/Transforms/SLPVectorizer/X86/pr42022.ll
5

Ok, thank, I'm to fix it.

anton-afanasyev marked 4 inline comments as done.

Update

ABataev added inline comments.Nov 11 2019, 8:18 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2879–2880

My question was different. I just wanted some explanation what's the problem with the scalable vectors? If the check is required we must have it. Also, still, would be good to have a test for this situation.

anton-afanasyev marked 2 inline comments as done.Nov 11 2019, 10:20 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2879–2880

I wasn't familiar with vscale property, but learned it now. That check was actually unnecessary: { <vscale x N x Ty>, <vscale x N x Ty> } maps to <vscale x 2N x Ty>, so we can deal with this scalability set to true.

RKSimon added inline comments.Nov 12 2019, 1:26 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2879–2880

Probably worth adding at least some test coverage for this - in aarch64 maybe?

anton-afanasyev marked 4 inline comments as done.Nov 13 2019, 8:07 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2879–2880

Actually vscale property cannot be set to true for such cases.
The structure like { <vscale x 2 x float> } is forbidden, I've got error: invalid element type for struct trying to opt this.

2879–2880

What do you mean by test coverage here? I can copy the same pr42022.ll to llvm/test/Transforms/SLPVectorizer/AArch64, but the only difference will be -mtriple= option.
If you are talking about original issue (pointed here https://llvm.org/pr42022), for aarch64 the vectorization is done for "Bad" case and is failed for two "Good" case (https://godbolt.org/z/5JEGSm), but the root cause is different ABI of aarch64 compared to x86-64. This patch doesn't change this and doesn't concern this.

RKSimon added inline comments.Nov 13 2019, 8:29 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2879–2880

I was referring to adding vscale tests - but if they can't occur then that's fine.

I think we cannot handle several similar cases with aggregates inside aggregates (e.g., [2 x {float, float}], {{float, float}, {float, float}}). Perhaps we could address some of these cases too with this patch? Please see the tests below:

; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu < %s | FileCheck %s                                                                                                                                                                                       

%StructTy = type { float, float}

define [2 x %StructTy] @ArrayOfStruct(float *%Ptr) {
  %GEP0 = getelementptr inbounds float, float* %Ptr, i64 0
  %L0 = load float, float * %GEP0
  %GEP1 = getelementptr inbounds float, float* %Ptr, i64 1
  %L1 = load float, float * %GEP1
  %GEP2 = getelementptr inbounds float, float* %Ptr, i64 2
  %L2 = load float, float * %GEP2
  %GEP3 = getelementptr inbounds float, float* %Ptr, i64 3
  %L3 = load float, float * %GEP3

  %Fadd0 = fadd fast float %L0, 1.1e+01
  %Fadd1 = fadd fast float %L1, 1.2e+01
  %Fadd2 = fadd fast float %L2, 1.3e+01
  %Fadd3 = fadd fast float %L3, 1.4e+01

  %StructIn0 = insertvalue %StructTy undef, float %Fadd0, 0
  %StructIn1 = insertvalue %StructTy %StructIn0, float %Fadd1, 1

  %StructIn2 = insertvalue %StructTy undef, float %Fadd2, 0
  %StructIn3 = insertvalue %StructTy %StructIn2, float %Fadd3, 1

  %Ret0 = insertvalue [2 x %StructTy] undef, %StructTy %StructIn1, 0
  %Ret1 = insertvalue [2 x %StructTy] %Ret0, %StructTy %StructIn3, 1
  ret [2 x %StructTy] %Ret1
}

define {%StructTy, %StructTy} @StructOfStruct(float *%Ptr) {
  %GEP0 = getelementptr inbounds float, float* %Ptr, i64 0
  %L0 = load float, float * %GEP0
  %GEP1 = getelementptr inbounds float, float* %Ptr, i64 1
  %L1 = load float, float * %GEP1
  %GEP2 = getelementptr inbounds float, float* %Ptr, i64 2
  %L2 = load float, float * %GEP2
  %GEP3 = getelementptr inbounds float, float* %Ptr, i64 3
  %L3 = load float, float * %GEP3

  %Fadd0 = fadd fast float %L0, 1.1e+01
  %Fadd1 = fadd fast float %L1, 1.2e+01
  %Fadd2 = fadd fast float %L2, 1.3e+01
  %Fadd3 = fadd fast float %L3, 1.4e+01

  %StructIn0 = insertvalue %StructTy undef, float %Fadd0, 0
  %StructIn1 = insertvalue %StructTy %StructIn0, float %Fadd1, 1

  %StructIn2 = insertvalue %StructTy undef, float %Fadd2, 0
  %StructIn3 = insertvalue %StructTy %StructIn2, float %Fadd3, 1

  %Ret0 = insertvalue {%StructTy, %StructTy} undef, %StructTy %StructIn1, 0
  %Ret1 = insertvalue {%StructTy, %StructTy} %Ret0, %StructTy %StructIn3, 1
  ret {%StructTy, %StructTy} %Ret1
}
anton-afanasyev marked an inline comment as done.Nov 15 2019, 12:36 AM

I think we cannot handle several similar cases with aggregates inside aggregates (e.g., [2 x {float, float}], {{float, float}, {float, float}}). Perhaps we could address some of these cases too with this patch? Please see the tests below:
...

Good point! I've checked your case {{float, float}, {float, float}} with modified patch and it has been successfully vectorized. But this has required several hacks.
So may be a separate patch is better? One needs modify canMapToVector() and findBuildAggregate() to be recursive, merge findBuildVector() and findBuildAggregate() -- I can send it to review as a following patch, while fixing next, Matrix22 part of https://llvm.org/pr42022 issue.

Well, ideally we should be able to handle any combination and nesting of scalars, vectors and aggregates in a unified way.
For example:

{float, <2 x float>, float},
{float, float, <2 x float>},
{{float, float}, <4 x float>, <2 x float>},
{{float}, {float}, [2 x float]}, 
{{{float, float}, float}, float}

etc.
This would require some redesign of the code, which is why I think it makes sense to have a single patch for all of them.

But yes, feel free to split them into separate patches.

But could you also add some more tests to show which of these cases this patch is taking care of. For example, I think {<2 x float>, <2 x float>, <4 x float>} or {<2 x float>, float, float} will not work. Here are some examples:

; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu < %s | FileCheck %s  

define { <2 x float>, float, float } @StructOfVectorAndScalars(float *%Ptr) {
  %GEP0 = getelementptr inbounds float, float* %Ptr, i64 0
  %L0 = load float, float * %GEP0
  %GEP1 = getelementptr inbounds float, float* %Ptr, i64 1
  %L1 = load float, float * %GEP1
  %GEP2 = getelementptr inbounds float, float* %Ptr, i64 2
  %L2 = load float, float * %GEP2
  %GEP3 = getelementptr inbounds float, float* %Ptr, i64 3
  %L3 = load float, float * %GEP3

  %VecIn0 = insertelement <2 x float> undef, float %L0, i64 0
  %VecIn1 = insertelement <2 x float> %VecIn0, float %L1, i64 1

  %Ret0 = insertvalue {<2 x float>, float, float} undef, <2 x float> %VecIn1, 0
  %Ret1 = insertvalue {<2 x float>, float, float} %Ret0, float %L2, 1
  %Ret2 = insertvalue {<2 x float>, float, float} %Ret1, float %L3, 1
  ret {<2 x float>, float, float} %Ret1
}

define { <2 x float>, <2 x float>, <4 x float> } @StructOfVectors(float *%Ptr) {
  %GEP0 = getelementptr inbounds float, float* %Ptr, i64 0
  %L0 = load float, float * %GEP0
  %GEP1 = getelementptr inbounds float, float* %Ptr, i64 1
  %L1 = load float, float * %GEP1
  %GEP2 = getelementptr inbounds float, float* %Ptr, i64 2
  %L2 = load float, float * %GEP2
  %GEP3 = getelementptr inbounds float, float* %Ptr, i64 3
  %L3 = load float, float * %GEP3

  %GEP4 = getelementptr inbounds float, float* %Ptr, i64 4
  %L4 = load float, float * %GEP4
  %GEP5 = getelementptr inbounds float, float* %Ptr, i64 5
  %L5 = load float, float * %GEP5
  %GEP6 = getelementptr inbounds float, float* %Ptr, i64 6
  %L6 = load float, float * %GEP6
  %GEP7 = getelementptr inbounds float, float* %Ptr, i64 7
  %L7 = load float, float * %GEP7

  %VecIn0 = insertelement <2 x float> undef, float %L0, i64 0
  %VecIn1 = insertelement <2 x float> %VecIn0, float %L1, i64 1

  %VecIn2 = insertelement <2 x float> undef, float %L2, i64 0
  %VecIn3 = insertelement <2 x float> %VecIn2, float %L3, i64 1

  %VecIn4 = insertelement <4 x float> undef, float %L4, i64 0
  %VecIn5 = insertelement <4 x float> %VecIn4, float %L5, i64 1
  %VecIn6 = insertelement <4 x float> %VecIn5, float %L6, i64 2
  %VecIn7 = insertelement <4 x float> %VecIn6, float %L7, i64 3

  %Ret0 = insertvalue {<2 x float>, <2 x float>, <4 x float>} undef, <2 x float> %VecIn1, 0
  %Ret1 = insertvalue {<2 x float>, <2 x float>, <4 x float>} %Ret0, <2 x float> %VecIn3, 1
  %Ret2 = insertvalue {<2 x float>, <2 x float>, <4 x float>} %Ret1, <4 x float> %VecIn7, 2
  ret {<2 x float>, <2 x float>, <4 x float>} %Ret2
}
llvm/test/Transforms/SLPVectorizer/X86/pr42022.ll
2

Could you replace this test with the smallest test that exposes the issue?
I think something like this should do the job:

; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu < %s | FileCheck %s

; Checks that vector insertvalues into the struct become SLP seeds.
define { <2 x float>, <2 x float> } @StructOfVectors(float *%Ptr) {
  %GEP0 = getelementptr inbounds float, float* %Ptr, i64 0
  %L0 = load float, float * %GEP0
  %GEP1 = getelementptr inbounds float, float* %Ptr, i64 1
  %L1 = load float, float * %GEP1
  %GEP2 = getelementptr inbounds float, float* %Ptr, i64 2
  %L2 = load float, float * %GEP2
  %GEP3 = getelementptr inbounds float, float* %Ptr, i64 3
  %L3 = load float, float * %GEP3

  %VecIn0 = insertelement <2 x float> undef, float %L0, i64 0
  %VecIn1 = insertelement <2 x float> %VecIn0, float %L1, i64 1

  %VecIn2 = insertelement <2 x float> undef, float %L2, i64 0
  %VecIn3 = insertelement <2 x float> %VecIn2, float %L3, i64 1

  %Ret0 = insertvalue {<2 x float>, <2 x float>} undef, <2 x float> %VecIn1, 0
  %Ret1 = insertvalue {<2 x float>, <2 x float>} %Ret0, <2 x float> %VecIn3, 1
  ret {<2 x float>, <2 x float>} %Ret1
}
anton-afanasyev marked 2 inline comments as done.Nov 16 2019, 2:24 AM
anton-afanasyev added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/pr42022.ll
2

Thanks, I've replaced this with your test!

anton-afanasyev marked an inline comment as done.

Update test with shorter one

Well, ideally we should be able to handle any combination and nesting of scalars, vectors and aggregates in a unified way.
For example:

{float, <2 x float>, float},
{float, float, <2 x float>},
{{float, float}, <4 x float>, <2 x float>},
{{float}, {float}, [2 x float]}, 
{{{float, float}, float}, float}

etc.
This would require some redesign of the code, which is why I think it makes sense to have a single patch for all of them.

But yes, feel free to split them into separate patches.

Actually I'm not sure vectorizer should support vectorization of _any_ combinations. For instance, does add operation make sense for structure like {float, float, <2 x float>} in real life (dictated by frontend output, mostly from )? I would focus on _homogeneous_ aggregate -- aggregate of equal type elements, including any nesting depth like {{{float, float}, {float, float}},{{float, float}, {float, float}}}.

anton-afanasyev added a comment.EditedNov 16 2019, 2:55 AM

...
But could you also add some more tests to show which of these cases this patch is taking care of. For example, I think {<2 x float>, <2 x float>, <4 x float>} or {<2 x float>, float, float} will not work. Here are some examples:
...

Yes, thanks, I'm to add two tests for homogeneous and non-homogeneous cases.

Added several tests

Added test for {{{i16,i16},{i16,i16}},{{i16,i16},{i16,i16}}}.
I'm to process this and other previously added test cases by the following commit.

Is it ok to lgtm for now?

Please can you pre-commit the tests with current codegen and then rebase to show the diff?

Updated test diff against precommitted tests

Please can you pre-commit the tests with current codegen and then rebase to show the diff?

Ok, done

Here is further enhencement considering different combinations of aggregates: https://reviews.llvm.org/D70587

RKSimon accepted this revision.Nov 22 2019, 3:12 AM

LGTM - thanks

This revision is now accepted and ready to land.Nov 22 2019, 3:12 AM
This revision was automatically updated to reflect the committed changes.