VecClone Pass
Needs ReviewPublic

Authored by mmasten on Jul 25 2016, 5:37 PM.



This work is part of an RFC sent by Xinmin back on 3/2/2016 regarding explicit function vectorization. The VecClone pass translates functions marked with "#pragma omp declare simd" into vector length trip count loops. Please see the RFC for details. It can be found at

Diff Detail

mmasten retitled this revision from to VecClone Pass.Jul 25 2016, 5:37 PM
mmasten updated this object.
mmasten added reviewers: mzolotukhin, hfinkel, Ayal.

Thanks for the nicely commented code and the provided tests.
A few inlined comments after a quick overview.

146 ↗(On Diff #65447)

Why do all these need to be public APIs?


Why does this need to be in a public header?

564 ↗(On Diff #65447)

This is all not very inefficient:

  1. getVectorVariantAttributes creates a temporary vector and copy attributes in it, just for iterating over a few lines later to convert these as strings (you can't keep StringRef because the attribute was copied I guess).
  2. FuncVars is queried two times for no apparent good reason.

It seems that the only use for this function is to populate a list of function to "vectorize". Populating a "vector<Function *>" should be enough for that. A nice property would be to be able to figure easily/efficiently if a function needs to be "vectorized".
Maybe reworking the attribute to be more "structured" instead of separate strings.


Some overview with an example/overview of what the pass accomplish would be nice. It is already captured in the RFC, but for someone reading the code it'd be helpful.


I think it should be spelled-out more clearly that it is required for correctness to have all the variants mentioned in the attribute list to be codegen'd, even at O0, because even if the vectorizer does not run in this module, it may run in another module that would expect these variant to exist.


So this will walk all the functions in the module and walk all the attributes and do a string comparison on all of these.
I'm not sure if it is fine to pay this when this pass has nothing to do (i.e. the early exit should be *fast*).


Coding style: no braces (other places as well).


Spurious empty line (other places as well).


What about a SmallVector and move it out-of-the loop (the call to clear() below is already handling the reset between iterations).

Thanks for the comments, Mehdi. I had some other things come up, but I'm making some corrections now.

mmasten added inline comments.Sep 1 2016, 10:44 AM
146 ↗(On Diff #65447)

They don't need to be. I can move these inside the VecClone class. Some of these could eventually become more generalized utilities, but I'll change it for now.


Do you mean that the class definition should be moved to VecClone.cpp?

564 ↗(On Diff #65447)

I will change #2. We need a little more than vector<Function*> here because each simd function will have multiple vector variants. FuncVars is a 1-many mapping of the original function to the string encodings corresponding to the variants that will be generated later. The string representation is essentially what is defined in We can easily tell which functions need to be vectorized by checking if any of these encodings exist as attributes on the original function. Can you elaborate on what you mean by "more structured"? Are you suggested an alternative representation to string-based function attributes? Thanks


We could selectively run the VecClone pass from PassManagerBuilder based on whether the OpenMP switch has been used. Otherwise, I don't know of another way to figure out which functions will need to be generated. Do you have a suggestion?

mehdi_amini added inline comments.Sep 1 2016, 12:11 PM
564 ↗(On Diff #65447)

Let me clarify what I meant with a vector<Function *>.
I meant that my take on getFunctionsToVectorize was that it sole purpose was to set a *single* entry in FuncVars, and this entry is a std::vector<std::string> containing the list of variants.

This is used exactly *once* here:

for (auto& pair : FunctionsToVectorize) {
    Function& F = *pair.first;
    DeclaredVariants &DeclaredVariants = pair.second;

If FunctionsToVectorize was a vector of Function *, you could get the list of DeclaredVariants on the fly while iterating on the attributes, without creating any temporary "string list".

mehdi_amini added inline comments.Sep 1 2016, 12:17 PM
564 ↗(On Diff #65447)

To come back to the "more structured", the question is "how to get the list of function that need variants very quickly/cheaply".
Having to look at all the attributes and do any kind of string processing *when we need to generate a variant* is not a problem, I'm more interested in doing as little work as possible when there is no variant.
Maybe having an attribute on the function which is "hasVariants" and could be queried cheaply could be a solution. I'd have to look at how attributes works internally again.

(Ideally I'd like a storage key->value where the key would be "variants" and the value would be a list of variants to generate, instead of the flat list where variants strings are mixed with the others and recognized by "magic" pattern matching)

mmasten updated this revision to Diff 73523.Oct 4 2016, 11:53 AM
mmasten marked an inline comment as done.
sodeh added a subscriber: sodeh.Nov 16 2016, 1:46 AM
simoll added a subscriber: simoll.Jan 23 2017, 1:22 AM
hfinkel added inline comments.Jan 27 2017, 4:14 AM

There's a lot of target information here in the target-independent code. Given that we're not going to vectorize without a target code model regardless, I'd like to see this information pushing into TargetTransformInfo. VecClone can then use TTI to convert the particular ISA tags into information about vector lengths, etc. Other architectures that are adapting this scheme can then extend this in a natural way.

564 ↗(On Diff #65447)

I'd like to come back to this. I agree with Mehdi, having unadorned mangled names as attributes isn't the right design - the "magic" pattern matching is unnecessary. It would seem much better to use something like:

attributes #0 = { "vector-variants"="_ZGVbM4l_foo,_ZGVbN4l_foo,_ZGVcM8l_foo,_ZGVcN8l_foo,_ZGVdM8l_foo,_ZGVdN8l_foo,_ZGVeM16l_foo,_ZGVeN16l_foo"

instead of:

attributes #0 = { hasvectorvariants "_ZGVbM4l_foo" "_ZGVbN4l_foo" "_ZGVcM8l_foo" "_ZGVcN8l_foo" "_ZGVdM8l_foo" "_ZGVdN8l_foo" "_ZGVeM16l_foo" "_ZGVeN16l_foo"

exactly like we do for "target-features".

mmasten updated this revision to Diff 87567.Feb 7 2017, 4:58 PM

New changes are:

  1. Update function attributes to "vector-variants"="<variant list>" format.
  2. Move target-specific code in the VectorVariant class to TTI.

Thanks for the feedback, Hal. I made the changes you suggested.

fpetrogalli added inline comments.Sep 1 2017, 3:14 AM

Hello Matt, thank you for working on this.

<nitpicking>Should the comment be updated with the "vector-variants"="_ZGVbM4ul_, ZGVbN4ul_" syntax? </nitpicking>



Hello again,

I think it would be good to have references to the document that describes the vector ABI. Is it the one at [1]?

If so, I think it would be good to cover with tests the "linear clause" cases of Table 1 in section 2.4, "Element Data Type to Vector Data Type Mapping".



fpetrogalli added inline comments.Sep 1 2017, 3:46 AM

Section 2.6 "Vector Function Name Mangling" of [1] maps the IsaClass to other values, "x', 'y'. Are you basing this on [2] for gcc compatibility?
Why do [1] and [2] differ?



hfinkel added inline comments.Sep 1 2017, 4:04 PM

This function doesn't return anything, and I think that makes the example hard to understand.


I don't understand what's going on here. Is it not possible to write the transformation such that it's semantically correct regardless of whether we've run mem2reg? This is not always an either/or situation.

Also, does the fact that we now have the VPlan infrastructure in the vectorizer change, in any way, how we'd like to approach this problem?

Hello - gentle ping on this patch. I think it is an important addition to the compiler. Do we need to reactivate the discussion in the mailing list?


Hi Francesco,

I'm working on updating my patch based on the recent comments and other modifications that have been made since the last patch was submitted. I should have it ready in the next couple of days.



mmasten added inline comments.Oct 31 2017, 12:26 PM

Currently, the implementation is based on gcc compatibility. However, it would be nice to extend to support both gcc and icc. The IsaClass values are different because of the calling convention differences. Intel icc uses regcall calling convention and gcc uses standard calling convention. I've added references to both ABI documents.


I agree, this was a bad bug with the example. This has been fixed and updated with an example that demonstrates the behavior of all three types of parameters (uniform, linear, and vector). Please let me know if the bitcasts shown in the example are confusing. The purpose of the bitcasts is so that the loop can appear in scalar form to the loop vectorizer. i.e., we can use a .scalar gep and index to reference vectors in the loop.


Thanks Francesco. Fixed.


Perhaps the comments were written very clearly. The pass has been written to handle parameters regardless of whether or not mem2reg has run. Hopefully, the updated comments are better.

mmasten updated this revision to Diff 121035.Oct 31 2017, 12:29 PM


thank you for updating the patch.

I have added one more comment, and I also have a question. Could you please add (here or in a different patch) the plumbing needed in the LoopVectorizer to make the functions available for vectorizationvia this pass? Apologies if you have already done it and I am just missing it.



We will have to support OpenMP 4.5 linear modifiers, which are rendered as 2-char tokens in the mangled names. Woudn't it better to avoid #defines of chars and instead use enums inside the VectorKind class?

fhahn added a subscriber: fhahn.Nov 6 2017, 6:11 AM
mmasten added inline comments.Nov 6 2017, 11:49 AM

Ok, I can change this to an enum inside the VectorKind class. I don't have a patch ready for the LoopVectorize part, but I will prepare one and put it up for review in the next few days. I also have a patch for clang that enables the "vector-variants" attribute support.

fpetrogalli added inline comments.Nov 6 2017, 2:25 PM

Ok, I can change this to an enum inside the VectorKind class.

Thank you.

I don't have a patch ready for the LoopVectorize part, but I will prepare one and put it up for review in the next few days.

That sounds great, thanks!

I also have a patch for clang that enables the "vector-variants" attribute support.

Clang is already generating the list of mangled vector names as string attributes. I believe that you need to rearrange the code so that strings get produced in the vector-variants attribute.

mmasten added inline comments.Nov 27 2017, 4:38 PM

Hi Francesco,

I have the LoopVectorize part of this done. Are there any objections to making it part of this review?



fpetrogalli added inline comments.Nov 28 2017, 12:41 AM

Hello Matt,

please don't add code to this review. I'd prefer to see the changes related to the Vectorizer in a separate patch.

Kind regards,


mmasten updated this revision to Diff 124626.Nov 28 2017, 1:14 PM

Moved calcCharacteristicType() function to VectorUtils so that VecClone and LV can share.
Removed vector-variant function attributes in LV instead of VecClone because LV needs to see them.

mmasten added inline comments.Nov 28 2017, 1:17 PM

Thanks Francesco. I made a couple of updates to this patch and created a new patch for the LV side of things. The LV patch is revision D40575.

In general, I think the VecClone pass is too complicated because it tries to handle the "optimized code" vs. "non-optimized code" cases separately. I don't think we should (or, in a theoretical sense, can) do that. We should have a uniform algorithm to handle all incoming IR. I think that we can do something like this:

  1. Split the entry block at the top and move all allocas in the original entry block to the new entry block.
  2. For each constant-sized alloca in the entry block, expand it by a factor of VL. For vectorizable types, you can do an alloca of the vector type. Otherwise, use an alloca to generate an array of VL items (i.e., use VL as the alloca's second parameter). Generate an alloca of <RetTy x VL> to hold the return value.
  3. Generate a new loop around all of the rest of the function (for i = [0, VL-1]) and a new return block (which loads the value from the return-value alloca and returns it).
  4. Replace all uses of each entry-block alloca, a, with &a[i]. Remove the old allocas.
  5. All uses of function parameters are now inside the loop. vector parameters will get a vector alloca and store in the entry block, and uses in the loop of the parameter will be replaced by load param[i]. linear parameters get replaced by (p+i). uniform parameters are unchanged.
  6. Replace the returns with a store to RetVal[i] and a branch to the loop-exit block.

Something like that should work for all functions, optimized or not.


This is X86 specific, and doesn't seem to belong here (also, it's not used anywhere).


This shouldn't be an enum like this. We don't need X86-specific things in the TTI interface. It looks like you've done a good job at making this opaque, so you can move this enum itself into the X86TTI implementation. The only thing we need here is the definition of UnknownISA (maybe define that to be an integer = -1).




This shouldn't start with a capital letter. Either name it isaClassMaxRegisterWidth or getISAClassMaxRegisterWidth. I prefer the latter.


This shouldn't start with a capital letter. Either name it isaClassToString or convertISAClassToString. I prefer the latter.


These preprocessor should just be constexpr/static const ints.


Don't use all caps for these names. IT_Alloca or Alloca could work.


Please use consistent terminology: marked as SIMD -> requires a vector variant.


You also need to add the pass to the new pass manager: lib/Passes/PassBuilder.cpp


I wouldn't bother with this. It should be safe if no functions with the vector-variants attribute are present (and, if they are, then you need to run the pass of things might not link).


What cleans up the extra loads/stores here (if we're optimizing)? Are assuming that there's a run of SROA afterwards, or does InstCombine do a good job here, or something else?


1-many -> one-to-many


Move this near the top of the function and check, before you generate code, that he function doesn't already exist. If it does, bail out (e.g., return a nullptr and the caller can move on to the next variant/function).


I don't think you need the iterators; this can just be:

for (auto &Arg : Clone->args())

and you use this pattern a lot (declaring two iterators and then using them in a for loop). In almost all of these cases, you should use a range-based for loop instead.


What happens if there's more than one return in the function? You might want, in that case, to create a new block with the return and convert all other returns to branches to that block.


In addition to branches, you need to handle SwitchInst and IndirectBrInst (and, for the latter, you need to find any place where the address of the return block is taken, and replace it with the address of the LoopExitBlock).


Use CreateAdd here so you can set both nuw and nsw on this increment.


What happens if there is more than one store user?

Thanks for the comments, Hal. Just to clarify your point #2, I think what you're saying is that we should start from a common parameter representation; i.e., parameters should be loaded/stored through memory. Please correct me if I'm wrong. I certainly think this would be a great way to reduce the complexity of the algorithm. The remainder of items in your list should already be covered, but some tweaking may be involved.

Thanks for the comments, Hal.

No problem. Thanks for working on this!

Just to clarify your point #2, I think what you're saying is that we should start from a common parameter representation; i.e., parameters should be loaded/stored through memory. Please correct me if I'm wrong. I certainly think this would be a great way to reduce the complexity of the algorithm. The remainder of items in your list should already be covered, but some tweaking may be involved.

For point #2, I'm saying that we should take all local stack allocations and make them wider by a factor of VL. Thinking about this as having VL simultaneously-running copies of the function, one per vector lane, each of those gets a separate "lane" of the local stack allocations. In point #5, I sketched how I'd handle parameters (I'm not exactly sure what you mean by common representation, as different kinds of parameters do require different handling (i.e., vector, uniform, scalar)). What is true is that, for unoptimized code, where the function arguments are generally stored in local stack allocations, all of those stores are now just inside the loop with everything else, so nothing special needs to happen. Does that make sense?

Ok, after doing some experimentation I believe I understand where you're heading with this. Once I have done some more refactoring I'll post a new version of VecClone for review to make sure we're on the same page.

mmasten added inline comments.Jan 26 2018, 4:09 PM

InstCombine will do it. I also know that EarlyCSE will work.


In all the test cases that I have used to this point, this type of re-wiring has already been done. Granted, I have mainly been testing some very simple multiple return functions, but if you have a test case where this happens it will be helpful. Thanks.

mmasten updated this revision to Diff 131670.Jan 26 2018, 4:27 PM

Extensive update to the VecClone algorithm based on Hal's feedback. VecClone pass is now supported through the new pass manager. Other minor code changes made.

The latest update includes some pretty extensive rework of the VecClone algorithm that Hal suggested. I also added support for VecClone in the new pass manager and since it is my first time doing this, I would welcome any specific comments on whether or not this was done correctly. Please note that I'm still working on fixing existing tests and will be adding new ones, but I wanted to make sure the overall algorithm is headed in the right direction. Thanks all.

Just a gentle reminder to have a look at the latest VecClone algorithm. Thanks, Matt.