Page MenuHomePhabricator

VecClone Pass
Needs ReviewPublic

Authored by mmasten on Jul 25 2016, 5:37 PM.

Details

Summary

This work is part of an RFC sent by Xinmin back on 3/2/2016 regarding explicit function vectorization. The VecClone pass translates functions marked with "#pragma omp declare simd" into vector length trip count loops. Please see the RFC for details. It can be found at http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hfinkel added inline comments.Sep 1 2017, 4:04 PM
lib/Transforms/Utils/VecClone.cpp
100

I don't understand what's going on here. Is it not possible to write the transformation such that it's semantically correct regardless of whether we've run mem2reg? This is not always an either/or situation.

hfinkel edited edge metadata.Sep 1 2017, 4:05 PM

Also, does the fact that we now have the VPlan infrastructure in the vectorizer change, in any way, how we'd like to approach this problem?

Hello - gentle ping on this patch. I think it is an important addition to the compiler. Do we need to reactivate the discussion in the mailing list?

Francesco

Hi Francesco,

I'm working on updating my patch based on the recent comments and other modifications that have been made since the last patch was submitted. I should have it ready in the next couple of days.

Thanks,

Matt

mmasten added inline comments.Oct 31 2017, 12:26 PM
lib/Target/X86/X86TargetTransformInfo.cpp
2850

Currently, the implementation is based on gcc compatibility. However, it would be nice to extend to support both gcc and icc. The IsaClass values are different because of the calling convention differences. Intel icc uses regcall calling convention and gcc uses standard calling convention. I've added references to both ABI documents.

lib/Transforms/Utils/VecClone.cpp
39

I agree, this was a bad bug with the example. This has been fixed and updated with an example that demonstrates the behavior of all three types of parameters (uniform, linear, and vector). Please let me know if the bitcasts shown in the example are confusing. The purpose of the bitcasts is so that the loop can appear in scalar form to the loop vectorizer. i.e., we can use a .scalar gep and index to reference vectors in the loop.

42

Thanks Francesco. Fixed.

100

Perhaps the comments were written very clearly. The pass has been written to handle parameters regardless of whether or not mem2reg has run. Hopefully, the updated comments are better.

mmasten updated this revision to Diff 121035.Oct 31 2017, 12:29 PM

Hello,

thank you for updating the patch.

I have added one more comment, and I also have a question. Could you please add (here or in a different patch) the plumbing needed in the LoopVectorizer to make the functions available for vectorizationvia this pass? Apologies if you have already done it and I am just missing it.

Francesco

include/llvm/Analysis/VectorVariant.h
35–38

We will have to support OpenMP 4.5 linear modifiers, which are rendered as 2-char tokens in the mangled names. Woudn't it better to avoid #defines of chars and instead use enums inside the VectorKind class?

fhahn added a subscriber: fhahn.Nov 6 2017, 6:11 AM
mmasten added inline comments.Nov 6 2017, 11:49 AM
include/llvm/Analysis/VectorVariant.h
35–38

Ok, I can change this to an enum inside the VectorKind class. I don't have a patch ready for the LoopVectorize part, but I will prepare one and put it up for review in the next few days. I also have a patch for clang that enables the "vector-variants" attribute support.

fpetrogalli added inline comments.Nov 6 2017, 2:25 PM
include/llvm/Analysis/VectorVariant.h
35–38

Ok, I can change this to an enum inside the VectorKind class.

Thank you.

I don't have a patch ready for the LoopVectorize part, but I will prepare one and put it up for review in the next few days.

That sounds great, thanks!

I also have a patch for clang that enables the "vector-variants" attribute support.

Clang is already generating the list of mangled vector names as string attributes. I believe that you need to rearrange the code so that strings get produced in the vector-variants attribute.

mmasten added inline comments.Nov 27 2017, 4:38 PM
include/llvm/Analysis/VectorVariant.h
35–38

Hi Francesco,

I have the LoopVectorize part of this done. Are there any objections to making it part of this review?

Thanks,

Matt

fpetrogalli added inline comments.Nov 28 2017, 12:41 AM
include/llvm/Analysis/VectorVariant.h
35–38

Hello Matt,

please don't add code to this review. I'd prefer to see the changes related to the Vectorizer in a separate patch.

Kind regards,

Francesco

mmasten updated this revision to Diff 124626.Nov 28 2017, 1:14 PM

Moved calcCharacteristicType() function to VectorUtils so that VecClone and LV can share.
Removed vector-variant function attributes in LV instead of VecClone because LV needs to see them.

mmasten added inline comments.Nov 28 2017, 1:17 PM
include/llvm/Analysis/VectorVariant.h
35–38

Thanks Francesco. I made a couple of updates to this patch and created a new patch for the LV side of things. The LV patch is revision D40575.

In general, I think the VecClone pass is too complicated because it tries to handle the "optimized code" vs. "non-optimized code" cases separately. I don't think we should (or, in a theoretical sense, can) do that. We should have a uniform algorithm to handle all incoming IR. I think that we can do something like this:

  1. Split the entry block at the top and move all allocas in the original entry block to the new entry block.
  2. For each constant-sized alloca in the entry block, expand it by a factor of VL. For vectorizable types, you can do an alloca of the vector type. Otherwise, use an alloca to generate an array of VL items (i.e., use VL as the alloca's second parameter). Generate an alloca of <RetTy x VL> to hold the return value.
  3. Generate a new loop around all of the rest of the function (for i = [0, VL-1]) and a new return block (which loads the value from the return-value alloca and returns it).
  4. Replace all uses of each entry-block alloca, a, with &a[i]. Remove the old allocas.
  5. All uses of function parameters are now inside the loop. vector parameters will get a vector alloca and store in the entry block, and uses in the loop of the parameter will be replaced by load param[i]. linear parameters get replaced by (p+i). uniform parameters are unchanged.
  6. Replace the returns with a store to RetVal[i] and a branch to the loop-exit block.

Something like that should work for all functions, optimized or not.

include/llvm/Analysis/TargetTransformInfo.h
633

This is X86 specific, and doesn't seem to belong here (also, it's not used anywhere).

648

This shouldn't be an enum like this. We don't need X86-specific things in the TTI interface. It looks like you've done a good job at making this opaque, so you can move this enum itself into the X86TTI implementation. The only thing we need here is the definition of UnknownISA (maybe define that to be an integer = -1).

653

UnknownISA

895

This shouldn't start with a capital letter. Either name it isaClassMaxRegisterWidth or getISAClassMaxRegisterWidth. I prefer the latter.

898

This shouldn't start with a capital letter. Either name it isaClassToString or convertISAClassToString. I prefer the latter.

include/llvm/Analysis/VectorVariant.h
35

These preprocessor should just be constexpr/static const ints.

include/llvm/Transforms/Utils/VecClone.h
27

Don't use all caps for these names. IT_Alloca or Alloca could work.

73

Please use consistent terminology: marked as SIMD -> requires a vector variant.

lib/Transforms/IPO/PassManagerBuilder.cpp
35

You also need to add the pass to the new pass manager: lib/Passes/PassBuilder.cpp

98

I wouldn't bother with this. It should be safe if no functions with the vector-variants attribute are present (and, if they are, then you need to run the pass of things might not link).

lib/Transforms/Utils/VecClone.cpp
87

What cleans up the extra loads/stores here (if we're optimizing)? Are assuming that there's a run of SROA afterwards, or does InstCombine do a good job here, or something else?

163

1-many -> one-to-many

290

Move this near the top of the function and check, before you generate code, that he function doesn't already exist. If it does, bail out (e.g., return a nullptr and the caller can move on to the next variant/function).

346

I don't think you need the iterators; this can just be:

for (auto &Arg : Clone->args())

and you use this pattern a lot (declaring two iterators and then using them in a for loop). In almost all of these cases, you should use a range-based for loop instead.

419

What happens if there's more than one return in the function? You might want, in that case, to create a new block with the return and convert all other returns to branches to that block.

484

In addition to branches, you need to handle SwitchInst and IndirectBrInst (and, for the latter, you need to find any place where the address of the return block is taken, and replace it with the address of the LoopExitBlock).

535

Use CreateAdd here so you can set both nuw and nsw on this increment.

639

What happens if there is more than one store user?

Thanks for the comments, Hal. Just to clarify your point #2, I think what you're saying is that we should start from a common parameter representation; i.e., parameters should be loaded/stored through memory. Please correct me if I'm wrong. I certainly think this would be a great way to reduce the complexity of the algorithm. The remainder of items in your list should already be covered, but some tweaking may be involved.

Thanks for the comments, Hal.

No problem. Thanks for working on this!

Just to clarify your point #2, I think what you're saying is that we should start from a common parameter representation; i.e., parameters should be loaded/stored through memory. Please correct me if I'm wrong. I certainly think this would be a great way to reduce the complexity of the algorithm. The remainder of items in your list should already be covered, but some tweaking may be involved.

For point #2, I'm saying that we should take all local stack allocations and make them wider by a factor of VL. Thinking about this as having VL simultaneously-running copies of the function, one per vector lane, each of those gets a separate "lane" of the local stack allocations. In point #5, I sketched how I'd handle parameters (I'm not exactly sure what you mean by common representation, as different kinds of parameters do require different handling (i.e., vector, uniform, scalar)). What is true is that, for unoptimized code, where the function arguments are generally stored in local stack allocations, all of those stores are now just inside the loop with everything else, so nothing special needs to happen. Does that make sense?

Ok, after doing some experimentation I believe I understand where you're heading with this. Once I have done some more refactoring I'll post a new version of VecClone for review to make sure we're on the same page.

mmasten added inline comments.Jan 26 2018, 4:09 PM
lib/Transforms/Utils/VecClone.cpp
87

InstCombine will do it. I also know that EarlyCSE will work.

419

In all the test cases that I have used to this point, this type of re-wiring has already been done. Granted, I have mainly been testing some very simple multiple return functions, but if you have a test case where this happens it will be helpful. Thanks.

mmasten updated this revision to Diff 131670.Jan 26 2018, 4:27 PM

Extensive update to the VecClone algorithm based on Hal's feedback. VecClone pass is now supported through the new pass manager. Other minor code changes made.

The latest update includes some pretty extensive rework of the VecClone algorithm that Hal suggested. I also added support for VecClone in the new pass manager and since it is my first time doing this, I would welcome any specific comments on whether or not this was done correctly. Please note that I'm still working on fixing existing tests and will be adding new ones, but I wanted to make sure the overall algorithm is headed in the right direction. Thanks all.

Just a gentle reminder to have a look at the latest VecClone algorithm. Thanks, Matt.

Hi Matt - overall the patch looks good, I just have a couple of comments.

I think you should remove some of the testing you do on the vectorizer side of things in the vectorizer patch, not here. Here you should be testing only the cloning of the function.

Moreover, I would like to see more testing happening in the following cases:

  1. when there is only one vector variant listed in the vector-variants attribute
  2. when the vector variant listed in the attribute is not supported by the target and the cloning does not happen (say you are compiling for SSE but you only have AVX512 vector functions listed in the attribute).

Moreover, I have added one comment about using IR vector types instead of integers to maps ISAs to vector register sizes. That would probably make extending this pass to SVE easier because for vector length agnostic types we will need to do some symbolic computation on vector sizes.

Thank you,

Francesco

include/llvm/Analysis/TargetTransformInfo.h
898

Could you replace information about register size to use IR vector types? For example, you could use <16 x i8> for 128-bit wide registers, or <64 x i8> for 512-bit registers. You could then base all the computations of the VLEN of a vector function on that. I am asking because that will make easier the handling of Vector Length Agnostic (VLA) vector functions that we will end up using for targets like the AArch64 Scalable Vector Extension (SVE).

test/Transforms/LoopVectorize/masked_simd_func.ll
1

Any test that checks the caller side functionality should be be under the patch that enables the loop vectorizer to use the VecClone pass.

test/Transforms/LoopVectorize/simd_func.ll
83

I think we should unit test each variant, not a single test that picks up one vector function from a list of "vector-variants".

mmasten added inline comments.Apr 26 2018, 11:04 AM
test/Transforms/LoopVectorize/masked_simd_func.ll
1

Thanks Francesco. Looks like I accidentally included this test in this patch. I looked to make sure it was already included in the LoopVectorize patch.

ABataev added inline comments.
include/llvm/Analysis/VectorVariant.h
38

Description of the class?

41

Explicitly set the base type to char

48

Use member initializers

69

VectorKind(const VectorKind &Other) = default;

75

No need to use \brief tag, you must remove it

77

const function

80

const function

83

const function

86

const function

91

const function

94

const function

97

const function

100

const function

103

const function

111

Why not SmallString?

112

Why not llvm::raw_svector_ostream?

132

char->ParmKind

144

Use SmallVector or ArrayRef

146

std::string->StringRef. Is this target-specific?

152

const function

155

const function

158

const function

161

const function

166

SmallString

166

const function

186

const function

193

const function

206

Target-specific?

212

Target-specific?

include/llvm/Transforms/Utils/VecClone.h
31

SmallVector

34

DenseMap

41

const function

51

const function