This work is part of an RFC sent by Xinmin back on 3/2/2016 regarding explicit function vectorization. The VecClone pass translates functions marked with "#pragma omp declare simd" into vector length trip count loops. Please see the RFC for details. It can be found at http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html.
I don't understand what's going on here. Is it not possible to write the transformation such that it's semantically correct regardless of whether we've run mem2reg? This is not always an either/or situation.
I'm working on updating my patch based on the recent comments and other modifications that have been made since the last patch was submitted. I should have it ready in the next couple of days.
Currently, the implementation is based on gcc compatibility. However, it would be nice to extend to support both gcc and icc. The IsaClass values are different because of the calling convention differences. Intel icc uses regcall calling convention and gcc uses standard calling convention. I've added references to both ABI documents.
I agree, this was a bad bug with the example. This has been fixed and updated with an example that demonstrates the behavior of all three types of parameters (uniform, linear, and vector). Please let me know if the bitcasts shown in the example are confusing. The purpose of the bitcasts is so that the loop can appear in scalar form to the loop vectorizer. i.e., we can use a .scalar gep and index to reference vectors in the loop.
Thanks Francesco. Fixed.
Perhaps the comments were written very clearly. The pass has been written to handle parameters regardless of whether or not mem2reg has run. Hopefully, the updated comments are better.
thank you for updating the patch.
I have added one more comment, and I also have a question. Could you please add (here or in a different patch) the plumbing needed in the LoopVectorizer to make the functions available for vectorizationvia this pass? Apologies if you have already done it and I am just missing it.
We will have to support OpenMP 4.5 linear modifiers, which are rendered as 2-char tokens in the mangled names. Woudn't it better to avoid #defines of chars and instead use enums inside the VectorKind class?
Ok, I can change this to an enum inside the VectorKind class. I don't have a patch ready for the LoopVectorize part, but I will prepare one and put it up for review in the next few days. I also have a patch for clang that enables the "vector-variants" attribute support.
That sounds great, thanks!
Clang is already generating the list of mangled vector names as string attributes. I believe that you need to rearrange the code so that strings get produced in the vector-variants attribute.
Moved calcCharacteristicType() function to VectorUtils so that VecClone and LV can share.
Removed vector-variant function attributes in LV instead of VecClone because LV needs to see them.
In general, I think the VecClone pass is too complicated because it tries to handle the "optimized code" vs. "non-optimized code" cases separately. I don't think we should (or, in a theoretical sense, can) do that. We should have a uniform algorithm to handle all incoming IR. I think that we can do something like this:
- Split the entry block at the top and move all allocas in the original entry block to the new entry block.
- For each constant-sized alloca in the entry block, expand it by a factor of VL. For vectorizable types, you can do an alloca of the vector type. Otherwise, use an alloca to generate an array of VL items (i.e., use VL as the alloca's second parameter). Generate an alloca of <RetTy x VL> to hold the return value.
- Generate a new loop around all of the rest of the function (for i = [0, VL-1]) and a new return block (which loads the value from the return-value alloca and returns it).
- Replace all uses of each entry-block alloca, a, with &a[i]. Remove the old allocas.
- All uses of function parameters are now inside the loop. vector parameters will get a vector alloca and store in the entry block, and uses in the loop of the parameter will be replaced by load param[i]. linear parameters get replaced by (p+i). uniform parameters are unchanged.
- Replace the returns with a store to RetVal[i] and a branch to the loop-exit block.
Something like that should work for all functions, optimized or not.
This is X86 specific, and doesn't seem to belong here (also, it's not used anywhere).
This shouldn't be an enum like this. We don't need X86-specific things in the TTI interface. It looks like you've done a good job at making this opaque, so you can move this enum itself into the X86TTI implementation. The only thing we need here is the definition of UnknownISA (maybe define that to be an integer = -1).
This shouldn't start with a capital letter. Either name it isaClassMaxRegisterWidth or getISAClassMaxRegisterWidth. I prefer the latter.
This shouldn't start with a capital letter. Either name it isaClassToString or convertISAClassToString. I prefer the latter.
These preprocessor should just be constexpr/static const ints.
Don't use all caps for these names. IT_Alloca or Alloca could work.
Please use consistent terminology: marked as SIMD -> requires a vector variant.
You also need to add the pass to the new pass manager: lib/Passes/PassBuilder.cpp
I wouldn't bother with this. It should be safe if no functions with the vector-variants attribute are present (and, if they are, then you need to run the pass of things might not link).
What cleans up the extra loads/stores here (if we're optimizing)? Are assuming that there's a run of SROA afterwards, or does InstCombine do a good job here, or something else?
1-many -> one-to-many
Move this near the top of the function and check, before you generate code, that he function doesn't already exist. If it does, bail out (e.g., return a nullptr and the caller can move on to the next variant/function).
I don't think you need the iterators; this can just be:
for (auto &Arg : Clone->args())
and you use this pattern a lot (declaring two iterators and then using them in a for loop). In almost all of these cases, you should use a range-based for loop instead.
What happens if there's more than one return in the function? You might want, in that case, to create a new block with the return and convert all other returns to branches to that block.
In addition to branches, you need to handle SwitchInst and IndirectBrInst (and, for the latter, you need to find any place where the address of the return block is taken, and replace it with the address of the LoopExitBlock).
Use CreateAdd here so you can set both nuw and nsw on this increment.
What happens if there is more than one store user?
Thanks for the comments, Hal. Just to clarify your point #2, I think what you're saying is that we should start from a common parameter representation; i.e., parameters should be loaded/stored through memory. Please correct me if I'm wrong. I certainly think this would be a great way to reduce the complexity of the algorithm. The remainder of items in your list should already be covered, but some tweaking may be involved.
No problem. Thanks for working on this!
Just to clarify your point #2, I think what you're saying is that we should start from a common parameter representation; i.e., parameters should be loaded/stored through memory. Please correct me if I'm wrong. I certainly think this would be a great way to reduce the complexity of the algorithm. The remainder of items in your list should already be covered, but some tweaking may be involved.
For point #2, I'm saying that we should take all local stack allocations and make them wider by a factor of VL. Thinking about this as having VL simultaneously-running copies of the function, one per vector lane, each of those gets a separate "lane" of the local stack allocations. In point #5, I sketched how I'd handle parameters (I'm not exactly sure what you mean by common representation, as different kinds of parameters do require different handling (i.e., vector, uniform, scalar)). What is true is that, for unoptimized code, where the function arguments are generally stored in local stack allocations, all of those stores are now just inside the loop with everything else, so nothing special needs to happen. Does that make sense?
Ok, after doing some experimentation I believe I understand where you're heading with this. Once I have done some more refactoring I'll post a new version of VecClone for review to make sure we're on the same page.
InstCombine will do it. I also know that EarlyCSE will work.
In all the test cases that I have used to this point, this type of re-wiring has already been done. Granted, I have mainly been testing some very simple multiple return functions, but if you have a test case where this happens it will be helpful. Thanks.
The latest update includes some pretty extensive rework of the VecClone algorithm that Hal suggested. I also added support for VecClone in the new pass manager and since it is my first time doing this, I would welcome any specific comments on whether or not this was done correctly. Please note that I'm still working on fixing existing tests and will be adding new ones, but I wanted to make sure the overall algorithm is headed in the right direction. Thanks all.
Hi Matt - overall the patch looks good, I just have a couple of comments.
I think you should remove some of the testing you do on the vectorizer side of things in the vectorizer patch, not here. Here you should be testing only the cloning of the function.
Moreover, I would like to see more testing happening in the following cases:
- when there is only one vector variant listed in the vector-variants attribute
- when the vector variant listed in the attribute is not supported by the target and the cloning does not happen (say you are compiling for SSE but you only have AVX512 vector functions listed in the attribute).
Moreover, I have added one comment about using IR vector types instead of integers to maps ISAs to vector register sizes. That would probably make extending this pass to SVE easier because for vector length agnostic types we will need to do some symbolic computation on vector sizes.
Could you replace information about register size to use IR vector types? For example, you could use <16 x i8> for 128-bit wide registers, or <64 x i8> for 512-bit registers. You could then base all the computations of the VLEN of a vector function on that. I am asking because that will make easier the handling of Vector Length Agnostic (VLA) vector functions that we will end up using for targets like the AArch64 Scalable Vector Extension (SVE).
Any test that checks the caller side functionality should be be under the patch that enables the loop vectorizer to use the VecClone pass.
I think we should unit test each variant, not a single test that picks up one vector function from a list of "vector-variants".
Thanks Francesco. Looks like I accidentally included this test in this patch. I looked to make sure it was already included in the LoopVectorize patch.
Description of the class?
Explicitly set the base type to char
Use member initializers
VectorKind(const VectorKind &Other) = default;
No need to use \brief tag, you must remove it
Why not SmallString?
Why not llvm::raw_svector_ostream?
Use SmallVector or ArrayRef
std::string->StringRef. Is this target-specific?