Page MenuHomePhabricator

[SVE][LoopVectorize] Add support for scalable vectorization of loops with vector reverse

Authored by CarolineConcatto on Jan 25 2021, 7:22 AM.



This patch adds support for reverse loop vectorization.
It is possible to vectorize the following loop:

for (int i = n-1; i >= 0; --i)
  a[i] = b[i] + 1.0;

with fixed or scalable vector.
The loop-vectorizer will use 'reverse' on the loads/stores to make
sure the lanes themselves are also handled in the right order.
This patch adds support for scalable vector on IRBuilder interface to
create a reverse vector. The IR function
CreateVectorReverse lowers to experimental.vector.reverse for scalable vector
and keedp the original behavior for fixed vector using shuffle reverse.

Depends on D94883

Depends on D95603

Depends on D95139

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes


CarolineConcatto edited the summary of this revision. (Show Details)Jan 25 2021, 7:31 AM
CarolineConcatto edited the summary of this revision. (Show Details)Jan 25 2021, 7:49 AM
CarolineConcatto edited the summary of this revision. (Show Details)
david-arm added inline comments.Jan 25 2021, 8:45 AM
451 ↗(On Diff #319010)

Hi Carol, is the reason why you bail out here because we are casting between at least one illegal type? If so, perhaps you could add a TODO here, something like:

// TODO : For scalable vectors we avoid calling the BaseT version for now because it doesn't yet work for illegal vector types as it assumes these are fixed width.


Hi Carol. I don't think this is right here. The number of cached lanes in my patch (D95139) simply refers to the number of values we have managed to cache. It's just an optimisation the vectoriser uses to avoid constantly calculating the same lane index expressions. However, what you're after here is a runtime calculation for the last lane index - you can use getRuntimeVF in my patch for this. I think the code below will probably need splitting out into fixed width vs scalable cases, i.e.

Builder.getInt32(1 - NumElts)

would need to be something like:

Builder.CreateSub(Builder.getInt32(1), getRuntimeVF())

for scalable vectors. Hope that makes sense!

sdesmalen added inline comments.Jan 25 2021, 8:46 AM

It would be nice if the IRBuilder interface supports both the scalable and fixed-width case, i.e. for scalable vectors it creates the intrinsic, for fixed-width vectors it creates a shufflevector.
Then in the LoopVectorizer it can just call CreateVectorReverse(Vec) and get the same behaviour.

451–456 ↗(On Diff #319010)

I don't think you should return just some random value here. Is there anything in broken in BasicTTIImpl that needs fixing?

If you run into any asserts caused by the cast from FP -> Int, you can simplify the test so that it doesn't need that cast (and then fix this up in a separate patch)

1190 ↗(On Diff #319010)

Can you instead add cases to the table?


{ TTI::SK_Reverse, MVT::nxv16i8, 1 },
{ TTI::SK_Reverse, MVT::nxv8i16, 1 },

NumElts can't be unsigned here, because it has to be a value that's evaluated at runtime.
This means it needs to multiply VF.getKnownMinValue by the runtime value of VScale, making NumElts a Value *.

For the first GEP you can also do the arithmetic on a pointer of the vector type, and index by -Part, e.g.

getelementptr <vscale x 8 x double>, <vscale x 8 x double> *%ptr, i64 -<part>


%vscale = call i64 @llvm.vscale()
%part = mul i64 %vscale, 8
getelementptr double, double *%ptr, i64 %part

The second GEP indexes a specific element, so that would still need to be a getelementptr on a double*.


It is difficult to verify what this code is generating, and the GEPs here are actually really important to get right, so can you just check for explicit values?

fhahn added inline comments.Jan 25 2021, 8:58 AM
1191 ↗(On Diff #319010)

Also, can this have a separate cost model test? If so, can this be split off as a separate improvement?

  • Move cost function for Shuffle reverse to D93639
  • Use getRunTimeVF to compute the runtime vector size for scalable vector
  • add tests for fixed vector
CarolineConcatto marked 3 inline comments as done.Feb 7 2021, 10:49 AM

Thank you all for the review.
I believe I had addressed your comments.
I've created another patch to compute the cost for vector reverse with scalable vector.
I've removed the cast from int to double in one of the tests, so we don't need cast fixed for this feature.
And the IR builder now has a reverse function for fixed and scalable vectors.


I don't know if I understood what you want.
I had to change the checks.
Do you mind to show what do you like to see being checked in these tests?

david-arm added inline comments.Feb 8 2021, 2:49 AM

Maybe better to use Builder.getInt32(-Part) to be consistent with how it's done for the fixed width case?


If you use a i64 variable here then it makes the tests simpler as the variable doesn't need sign-extending with sext


Hi @CarolineConcatto something doesn't look right here because the result will be zero.


See comment about N - this line can be killed if you have "i64 N", then everywhere that uses %0 can just use %N instead. If you do this then the for.body.preheader block can be killed too and the branch above can jump straight to for.body


Same comment as in previous function.

CarolineConcatto edited the summary of this revision. (Show Details)Feb 9 2021, 9:05 AM
CarolineConcatto edited the summary of this revision. (Show Details)
CarolineConcatto edited the summary of this revision. (Show Details)Feb 9 2021, 9:08 AM
david-arm added inline comments.Feb 10 2021, 6:02 AM

I think I may have made a mistake here and the '0' is there because this is the 0th Part. Looking at the way GEPs are created it doesn't really optimise the case when Part=0. Adding "-instcombine" to the RUN line would clear up any dead code here.

CarolineConcatto edited the summary of this revision. (Show Details)
  • change how to get NumElt inside LoopVectorize
  • change the index to be 64 bits
CarolineConcatto marked an inline comment as done.Feb 10 2021, 12:16 PM
CarolineConcatto added inline comments.

Thank you David, you are correct things should be coherent.


I've replaced the index to be 64 bits, but I could not remove the for.body.preheader.

CarolineConcatto marked an inline comment as done.

-removing redundant lines in the test

CarolineConcatto marked an inline comment as done.
  • change index 'i' in the tests to be 64 bits instead of 32bits

Thank you @david-arm.
I've removed the pre.header.
It is indeed better.

Hi @CarolineConcatto, it's looking much better now and thanks for dealing with all the review comments!


Hi @CarolineConcatto, I just realised there isn't a test for this for scalable vectors. Would you be able to add a test for blocks with masks too? I think it should be something like:

void foo(int * __restrict__ a, int * __restrict__ cond, long long N) {
  for (int i = N - 1; i >= 0; i--)
    if (cond[i])
      a[i] += 1;

I think that since you're adding tests for this it's probably a good idea to expand all the shufflevector CHECK lines in this file to make sure we have a reverse mask here too?

Matt added a subscriber: Matt.Feb 16 2021, 9:05 AM
bin.cheng-ali added inline comments.

Sorry for one nitpicking.
IIUC the difference for scalable/fixed cases is the two operands of GEPs, is it better to factor out common code by doing below?
if (VF.isScalable()) {

// build GEP operands for scalable case

else {

// build GEP operands for fixed case

// common code building VecPtr with above GEP operands.

david-arm added inline comments.Feb 22 2021, 3:13 AM

I guess alternatively we could commonise them into one block where we always calculate

Value *LastLane =
                      getRuntimeVF(Builder, Builder.getInt32Ty(), VF));

for both fixed-width and scalable vectors, and just let instcombine and other passes fold the constants? I'm not sure if that breaks lots of tests though ...

  • Add tests for when vector.reverse needs to reverse a mask
  • improve code in LoopVectorize.cpp
  • C&P function getRunTimeVF from D95139
CarolineConcatto marked an inline comment as done.Mar 5 2021, 1:36 AM

Thank you all for the review.
I C&P the function getRunTime from D95139.
If D95139 is merged before this patch I will remove it.
All the other dependencies are already in main, and getRunTimeVF is the only function that this patch needs from D95139.


Thank you @david-arm for pointing that.
Yes, indeed it was missing test for that.
The cost model for i1 was missing with vector reverse.


So @david-arm 's suggestion works for fixed width and scalable vectors.
At least no test has failed.
I could also have done the way @bin.cheng-ali suggested:

if (scalable){
  NumElts =   Builder.CreateMul(Builder.getInt32(1),
                      getRuntimeVF(Builder, Builder.getInt32Ty(), VF));
  LastLane =   Builder.CreateSub(Builder.getInt32(1),
                      getRuntimeVF(Builder, Builder.getInt32Ty(), VF));
   NumElts = 1 * VF.getKnownMinValue()
  LastLane = 1 - VF.getKnownMinValue()

But as all work with getRunTime I thought it was code duplication and this way the code looks simpler.
I added a comment to explain how it works if it is fixed-width vector for me in the future too


Hi @david-arm I
think I forgot to answer that suggestion.

This file has 2 tests, for 2 variable types, double and integer variables

@vector_reverse_f64(i64 %N, double* %a, double* %b)
 @vector_reverse_i64(i64 %N, i64* %a, i64* %b)

So I believe it is ok to leave this as it is. Unless I am missing something. (Which is completely ok for me)
I can see for @vector_reverse_i64 that sext instruction is gone.

david-arm added inline comments.Mar 5 2021, 3:32 AM

nit: This function looks a bit too complex to live in a header - maybe better defined in the .cpp file?


nit: One of my patches has already landed with this function so I think you'll need to remove it when committing.


nit: Maybe instead of calling getRuntimeVF twice you can call it once and assign it to a variable?


nit: Perhaps better renamed to nxv4i1 since that's what the generated code looks like?


nit: Perhaps better renamed to v4i1 since the CHECK lines have that type?


This seems a bit less well tested than the sve version - perhaps it's worth adding CHECK lines for the store too? I assume the store will reuse the same mask?


nit: %[[REVERSE6]]?


nit: %[[REVERSE6]]

  • Fix nit in the .ll tests and LoopVectorize.cpp
  • Remote copied getRunTimeVF from LoopVectorize.cpp
  • Move CreateVectorReverse implementation from IRBuilder.h to IRBuilder.cpp

Hey @david-arm sorry for the nit in the tests.
Usually happens when I do copy and paste many times.
I've updated the patch and rebased it with the main.



Thank you David.
I hadn't noticed that this could be also implemented in IRBuilder.cpp.


Wonderfull! Good to know!


Good idea David, so I can reduce code duplication


That is what happens when I do C&P. I've creates tests for all masks sizes, but only submitted one size.
I was wondering, I did not add tests for all mask sizes here.
Do you think it is valid?
I guess that one size fits all, if one mask size does not break it means that the rest will not break too.


Ok, I think it is valid to have similar check's for scalable and fixed vectors


Good catch!

CarolineConcatto marked 5 inline comments as done.Mar 8 2021, 5:54 AM
david-arm accepted this revision.Mar 9 2021, 3:10 AM



nit: I think in this case it's worth fixing the clang-tidy comment. It looks like the function above also uses "auto *..."

This revision is now accepted and ready to land.Mar 9 2021, 3:10 AM
This revision was landed with ongoing or failed builds.Mar 16 2021, 2:19 AM
This revision was automatically updated to reflect the committed changes.