This work is part of an RFC sent by Xinmin back on 3/2/2016 regarding explicit function vectorization. The VecClone pass translates functions marked with "#pragma omp declare simd" into vector length trip count loops. Please see the RFC for details. It can be found at http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html.
Details
Diff Detail
Event Timeline
Thanks for the nicely commented code and the provided tests.
A few inlined comments after a quick overview.
include/llvm/Analysis/VectorUtils.h | ||
---|---|---|
146 | Why do all these need to be public APIs? | |
include/llvm/Transforms/Utils/VecClone.h | ||
11 | Why does this need to be in a public header? | |
lib/Analysis/VectorUtils.cpp | ||
564 | This is all not very inefficient:
It seems that the only use for this function is to populate a list of function to "vectorize". Populating a "vector<Function *>" should be enough for that. A nice property would be to be able to figure easily/efficiently if a function needs to be "vectorized". | |
lib/Transforms/Utils/VecClone.cpp | ||
14 | Some overview with an example/overview of what the pass accomplish would be nice. It is already captured in the RFC, but for someone reading the code it'd be helpful. | |
20 | I think it should be spelled-out more clearly that it is required for correctness to have all the variants mentioned in the attribute list to be codegen'd, even at O0, because even if the vectorizer does not run in this module, it may run in another module that would expect these variant to exist. | |
1431 | So this will walk all the functions in the module and walk all the attributes and do a string comparison on all of these. | |
1435 | Coding style: no braces (other places as well). | |
1438 | Spurious empty line (other places as well). | |
1486 | What about a SmallVector and move it out-of-the loop (the call to clear() below is already handling the reset between iterations). |
Thanks for the comments, Mehdi. I had some other things come up, but I'm making some corrections now.
include/llvm/Analysis/VectorUtils.h | ||
---|---|---|
146 | They don't need to be. I can move these inside the VecClone class. Some of these could eventually become more generalized utilities, but I'll change it for now. | |
include/llvm/Transforms/Utils/VecClone.h | ||
11 | Do you mean that the class definition should be moved to VecClone.cpp? | |
lib/Analysis/VectorUtils.cpp | ||
564 | I will change #2. We need a little more than vector<Function*> here because each simd function will have multiple vector variants. FuncVars is a 1-many mapping of the original function to the string encodings corresponding to the variants that will be generated later. The string representation is essentially what is defined in https://www.cilkplus.org/sites/default/files/open_specifications/Intel-ABI-Vector-Function-2012-v0.9.5.pdf. We can easily tell which functions need to be vectorized by checking if any of these encodings exist as attributes on the original function. Can you elaborate on what you mean by "more structured"? Are you suggested an alternative representation to string-based function attributes? Thanks | |
lib/Transforms/Utils/VecClone.cpp | ||
1431 | We could selectively run the VecClone pass from PassManagerBuilder based on whether the OpenMP switch has been used. Otherwise, I don't know of another way to figure out which functions will need to be generated. Do you have a suggestion? |
lib/Analysis/VectorUtils.cpp | ||
---|---|---|
564 | Let me clarify what I meant with a vector<Function *>. This is used exactly *once* here: for (auto& pair : FunctionsToVectorize) { Function& F = *pair.first; DeclaredVariants &DeclaredVariants = pair.second; If FunctionsToVectorize was a vector of Function *, you could get the list of DeclaredVariants on the fly while iterating on the attributes, without creating any temporary "string list". |
lib/Analysis/VectorUtils.cpp | ||
---|---|---|
564 | To come back to the "more structured", the question is "how to get the list of function that need variants very quickly/cheaply". (Ideally I'd like a storage key->value where the key would be "variants" and the value would be a list of variants to generate, instead of the flat list where variants strings are mixed with the others and recognized by "magic" pattern matching) |
include/llvm/Analysis/VectorVariant.h | ||
---|---|---|
139 | There's a lot of target information here in the target-independent code. Given that we're not going to vectorize without a target code model regardless, I'd like to see this information pushing into TargetTransformInfo. VecClone can then use TTI to convert the particular ISA tags into information about vector lengths, etc. Other architectures that are adapting this scheme can then extend this in a natural way. | |
lib/Analysis/VectorUtils.cpp | ||
564 | I'd like to come back to this. I agree with Mehdi, having unadorned mangled names as attributes isn't the right design - the "magic" pattern matching is unnecessary. It would seem much better to use something like: attributes #0 = { "vector-variants"="_ZGVbM4l_foo,_ZGVbN4l_foo,_ZGVcM8l_foo,_ZGVcN8l_foo,_ZGVdM8l_foo,_ZGVdN8l_foo,_ZGVeM16l_foo,_ZGVeN16l_foo" instead of: attributes #0 = { hasvectorvariants "_ZGVbM4l_foo" "_ZGVbN4l_foo" "_ZGVcM8l_foo" "_ZGVcN8l_foo" "_ZGVdM8l_foo" "_ZGVdN8l_foo" "_ZGVeM16l_foo" "_ZGVeN16l_foo" exactly like we do for "target-features". |
New changes are:
- Update function attributes to "vector-variants"="<variant list>" format.
- Move target-specific code in the VectorVariant class to TTI.
lib/Transforms/Utils/VecClone.cpp | ||
---|---|---|
42 | Hello Matt, thank you for working on this. <nitpicking>Should the comment be updated with the "vector-variants"="_ZGVbM4ul_, ZGVbN4ul_" syntax? </nitpicking> Cheers, Francesco |
Hello again,
I think it would be good to have references to the document that describes the vector ABI. Is it the one at [1]?
If so, I think it would be good to cover with tests the "linear clause" cases of Table 1 in section 2.4, "Element Data Type to Vector Data Type Mapping".
Francesco
[1] https://software.intel.com/sites/default/files/managed/b4/c8/Intel-Vector-Function-ABI.pdf
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2289 ↗ | (On Diff #87567) | Section 2.6 "Vector Function Name Mangling" of [1] maps the IsaClass to other values, "x', 'y'. Are you basing this on [2] for gcc compatibility? Francesco [1] https://software.intel.com/sites/default/files/managed/b4/c8/Intel-Vector-Function-ABI.pdf |
lib/Transforms/Utils/VecClone.cpp | ||
---|---|---|
39 | This function doesn't return anything, and I think that makes the example hard to understand. | |
100 | I don't understand what's going on here. Is it not possible to write the transformation such that it's semantically correct regardless of whether we've run mem2reg? This is not always an either/or situation. |
Also, does the fact that we now have the VPlan infrastructure in the vectorizer change, in any way, how we'd like to approach this problem?
Hello - gentle ping on this patch. I think it is an important addition to the compiler. Do we need to reactivate the discussion in the mailing list?
Francesco
Hi Francesco,
I'm working on updating my patch based on the recent comments and other modifications that have been made since the last patch was submitted. I should have it ready in the next couple of days.
Thanks,
Matt
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2289 ↗ | (On Diff #87567) | Currently, the implementation is based on gcc compatibility. However, it would be nice to extend to support both gcc and icc. The IsaClass values are different because of the calling convention differences. Intel icc uses regcall calling convention and gcc uses standard calling convention. I've added references to both ABI documents. |
lib/Transforms/Utils/VecClone.cpp | ||
39 | I agree, this was a bad bug with the example. This has been fixed and updated with an example that demonstrates the behavior of all three types of parameters (uniform, linear, and vector). Please let me know if the bitcasts shown in the example are confusing. The purpose of the bitcasts is so that the loop can appear in scalar form to the loop vectorizer. i.e., we can use a .scalar gep and index to reference vectors in the loop. | |
42 | Thanks Francesco. Fixed. | |
100 | Perhaps the comments were written very clearly. The pass has been written to handle parameters regardless of whether or not mem2reg has run. Hopefully, the updated comments are better. |
Hello,
thank you for updating the patch.
I have added one more comment, and I also have a question. Could you please add (here or in a different patch) the plumbing needed in the LoopVectorizer to make the functions available for vectorizationvia this pass? Apologies if you have already done it and I am just missing it.
Francesco
include/llvm/Analysis/VectorVariant.h | ||
---|---|---|
35–38 | We will have to support OpenMP 4.5 linear modifiers, which are rendered as 2-char tokens in the mangled names. Woudn't it better to avoid #defines of chars and instead use enums inside the VectorKind class? |
include/llvm/Analysis/VectorVariant.h | ||
---|---|---|
35–38 | Ok, I can change this to an enum inside the VectorKind class. I don't have a patch ready for the LoopVectorize part, but I will prepare one and put it up for review in the next few days. I also have a patch for clang that enables the "vector-variants" attribute support. |
include/llvm/Analysis/VectorVariant.h | ||
---|---|---|
35–38 |
Thank you.
That sounds great, thanks!
Clang is already generating the list of mangled vector names as string attributes. I believe that you need to rearrange the code so that strings get produced in the vector-variants attribute. |
include/llvm/Analysis/VectorVariant.h | ||
---|---|---|
35–38 | Hi Francesco, I have the LoopVectorize part of this done. Are there any objections to making it part of this review? Thanks, Matt |
include/llvm/Analysis/VectorVariant.h | ||
---|---|---|
35–38 | Hello Matt, please don't add code to this review. I'd prefer to see the changes related to the Vectorizer in a separate patch. Kind regards, Francesco |
Moved calcCharacteristicType() function to VectorUtils so that VecClone and LV can share.
Removed vector-variant function attributes in LV instead of VecClone because LV needs to see them.
In general, I think the VecClone pass is too complicated because it tries to handle the "optimized code" vs. "non-optimized code" cases separately. I don't think we should (or, in a theoretical sense, can) do that. We should have a uniform algorithm to handle all incoming IR. I think that we can do something like this:
- Split the entry block at the top and move all allocas in the original entry block to the new entry block.
- For each constant-sized alloca in the entry block, expand it by a factor of VL. For vectorizable types, you can do an alloca of the vector type. Otherwise, use an alloca to generate an array of VL items (i.e., use VL as the alloca's second parameter). Generate an alloca of <RetTy x VL> to hold the return value.
- Generate a new loop around all of the rest of the function (for i = [0, VL-1]) and a new return block (which loads the value from the return-value alloca and returns it).
- Replace all uses of each entry-block alloca, a, with &a[i]. Remove the old allocas.
- All uses of function parameters are now inside the loop. vector parameters will get a vector alloca and store in the entry block, and uses in the loop of the parameter will be replaced by load param[i]. linear parameters get replaced by (p+i). uniform parameters are unchanged.
- Replace the returns with a store to RetVal[i] and a branch to the loop-exit block.
Something like that should work for all functions, optimized or not.
include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
633 ↗ | (On Diff #124626) | This is X86 specific, and doesn't seem to belong here (also, it's not used anywhere). |
648 ↗ | (On Diff #124626) | This shouldn't be an enum like this. We don't need X86-specific things in the TTI interface. It looks like you've done a good job at making this opaque, so you can move this enum itself into the X86TTI implementation. The only thing we need here is the definition of UnknownISA (maybe define that to be an integer = -1). |
653 ↗ | (On Diff #124626) | UnknownISA |
917 ↗ | (On Diff #124626) | This shouldn't start with a capital letter. Either name it isaClassMaxRegisterWidth or getISAClassMaxRegisterWidth. I prefer the latter. |
920 ↗ | (On Diff #124626) | This shouldn't start with a capital letter. Either name it isaClassToString or convertISAClassToString. I prefer the latter. |
include/llvm/Analysis/VectorVariant.h | ||
35 | These preprocessor should just be constexpr/static const ints. | |
include/llvm/Transforms/Utils/VecClone.h | ||
27 | Don't use all caps for these names. IT_Alloca or Alloca could work. | |
73 | Please use consistent terminology: marked as SIMD -> requires a vector variant. | |
lib/Transforms/IPO/PassManagerBuilder.cpp | ||
34 | You also need to add the pass to the new pass manager: lib/Passes/PassBuilder.cpp | |
106 | I wouldn't bother with this. It should be safe if no functions with the vector-variants attribute are present (and, if they are, then you need to run the pass of things might not link). | |
lib/Transforms/Utils/VecClone.cpp | ||
87 | What cleans up the extra loads/stores here (if we're optimizing)? Are assuming that there's a run of SROA afterwards, or does InstCombine do a good job here, or something else? | |
163 | 1-many -> one-to-many | |
290 | Move this near the top of the function and check, before you generate code, that he function doesn't already exist. If it does, bail out (e.g., return a nullptr and the caller can move on to the next variant/function). | |
346 | I don't think you need the iterators; this can just be: for (auto &Arg : Clone->args()) and you use this pattern a lot (declaring two iterators and then using them in a for loop). In almost all of these cases, you should use a range-based for loop instead. | |
419 | What happens if there's more than one return in the function? You might want, in that case, to create a new block with the return and convert all other returns to branches to that block. | |
484 | In addition to branches, you need to handle SwitchInst and IndirectBrInst (and, for the latter, you need to find any place where the address of the return block is taken, and replace it with the address of the LoopExitBlock). | |
535 | Use CreateAdd here so you can set both nuw and nsw on this increment. | |
639 | What happens if there is more than one store user? |
Thanks for the comments, Hal. Just to clarify your point #2, I think what you're saying is that we should start from a common parameter representation; i.e., parameters should be loaded/stored through memory. Please correct me if I'm wrong. I certainly think this would be a great way to reduce the complexity of the algorithm. The remainder of items in your list should already be covered, but some tweaking may be involved.
No problem. Thanks for working on this!
Just to clarify your point #2, I think what you're saying is that we should start from a common parameter representation; i.e., parameters should be loaded/stored through memory. Please correct me if I'm wrong. I certainly think this would be a great way to reduce the complexity of the algorithm. The remainder of items in your list should already be covered, but some tweaking may be involved.
For point #2, I'm saying that we should take all local stack allocations and make them wider by a factor of VL. Thinking about this as having VL simultaneously-running copies of the function, one per vector lane, each of those gets a separate "lane" of the local stack allocations. In point #5, I sketched how I'd handle parameters (I'm not exactly sure what you mean by common representation, as different kinds of parameters do require different handling (i.e., vector, uniform, scalar)). What is true is that, for unoptimized code, where the function arguments are generally stored in local stack allocations, all of those stores are now just inside the loop with everything else, so nothing special needs to happen. Does that make sense?
Ok, after doing some experimentation I believe I understand where you're heading with this. Once I have done some more refactoring I'll post a new version of VecClone for review to make sure we're on the same page.
lib/Transforms/Utils/VecClone.cpp | ||
---|---|---|
87 | InstCombine will do it. I also know that EarlyCSE will work. | |
419 | In all the test cases that I have used to this point, this type of re-wiring has already been done. Granted, I have mainly been testing some very simple multiple return functions, but if you have a test case where this happens it will be helpful. Thanks. |
Extensive update to the VecClone algorithm based on Hal's feedback. VecClone pass is now supported through the new pass manager. Other minor code changes made.
The latest update includes some pretty extensive rework of the VecClone algorithm that Hal suggested. I also added support for VecClone in the new pass manager and since it is my first time doing this, I would welcome any specific comments on whether or not this was done correctly. Please note that I'm still working on fixing existing tests and will be adding new ones, but I wanted to make sure the overall algorithm is headed in the right direction. Thanks all.
Just a gentle reminder to have a look at the latest VecClone algorithm. Thanks, Matt.
Hi Matt - overall the patch looks good, I just have a couple of comments.
I think you should remove some of the testing you do on the vectorizer side of things in the vectorizer patch, not here. Here you should be testing only the cloning of the function.
Moreover, I would like to see more testing happening in the following cases:
- when there is only one vector variant listed in the vector-variants attribute
- when the vector variant listed in the attribute is not supported by the target and the cloning does not happen (say you are compiling for SSE but you only have AVX512 vector functions listed in the attribute).
Moreover, I have added one comment about using IR vector types instead of integers to maps ISAs to vector register sizes. That would probably make extending this pass to SVE easier because for vector length agnostic types we will need to do some symbolic computation on vector sizes.
Thank you,
Francesco
include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
898 ↗ | (On Diff #131670) | Could you replace information about register size to use IR vector types? For example, you could use <16 x i8> for 128-bit wide registers, or <64 x i8> for 512-bit registers. You could then base all the computations of the VLEN of a vector function on that. I am asking because that will make easier the handling of Vector Length Agnostic (VLA) vector functions that we will end up using for targets like the AArch64 Scalable Vector Extension (SVE). |
test/Transforms/LoopVectorize/masked_simd_func.ll | ||
1 ↗ | (On Diff #131670) | Any test that checks the caller side functionality should be be under the patch that enables the loop vectorizer to use the VecClone pass. |
test/Transforms/LoopVectorize/simd_func.ll | ||
83 ↗ | (On Diff #131670) | I think we should unit test each variant, not a single test that picks up one vector function from a list of "vector-variants". |
test/Transforms/LoopVectorize/masked_simd_func.ll | ||
---|---|---|
1 ↗ | (On Diff #131670) | Thanks Francesco. Looks like I accidentally included this test in this patch. I looked to make sure it was already included in the LoopVectorize patch. |
include/llvm/Analysis/VectorVariant.h | ||
---|---|---|
39 | Description of the class? | |
42 | Explicitly set the base type to char | |
49 | Use member initializers | |
70 | VectorKind(const VectorKind &Other) = default; | |
76 | No need to use \brief tag, you must remove it | |
78 | const function | |
81 | const function | |
84 | const function | |
87 | const function | |
92 | const function | |
95 | const function | |
98 | const function | |
101 | const function | |
104 | const function | |
112 | Why not SmallString? | |
113 | Why not llvm::raw_svector_ostream? | |
133 | char->ParmKind | |
145 | Use SmallVector or ArrayRef | |
147 | std::string->StringRef. Is this target-specific? | |
153 | const function | |
156 | const function | |
159 | const function | |
162 | const function | |
167 | SmallString | |
167 | const function | |
187 | const function | |
194 | const function | |
207 | Target-specific? | |
213 | Target-specific? | |
include/llvm/Transforms/Utils/VecClone.h | ||
32 | SmallVector | |
35 | DenseMap | |
42 | const function | |
52 | const function |
Why do all these need to be public APIs?