This is an archive of the discontinued LLVM Phabricator instance.

[VPlan] Add VPInstruction to VPRecipe transformation.
ClosedPublic

Authored by fhahn on May 14 2018, 4:03 AM.

Details

Summary

This patch introduces a VPInstructionToVPRecipe transformation, which
allows us to generate code for a VPInstruction based VPlan re-using the
existing infrastructure.

Diff Detail

Event Timeline

fhahn created this revision.May 14 2018, 4:03 AM
fhahn updated this revision to Diff 146616.May 14 2018, 8:14 AM

Only add VPValues for branch conditions once for edge masks.

Hi Florian. Thanks for the patch!

This patch just moves things around and makes things clearer, but it modifies
the inner loop vectorizer. I think we should try to use as many parts of VPlan
for inner loop vectorization as well, to make sure we get as much
testing early on. But I am not entirely sure if it would be better to have
a completely separate inner loop vectorization path in the VPlan native path, to
eliminate the risk of breaking things in the inner loop vectorizers.
What do you think?

I agree with you but I think it's too soon to enable some of this code in the inner loop vectorizer. I would wait at least until we have codegen support for outer loops so that we can do a more proper testing before we use it for production. In addition, another big concern is that introducing this code in the inner loop vectorizer too soon will limit the current development flexibility that we have in the VPlan native path. For those reasons, I would prefer this patch to just do the VPInstruction2VPRecipe transformation for the VPlan native path and leave the inner loop vectorizer changes for later. Does it make sense to you?

lib/Transforms/Vectorize/LoopVectorizationPlanner.h
363 ↗(On Diff #146616)

I think we should try to find a better place for this. The planner shouldn't be responsible for VPlan-to-VPlan transformations but just for doing the planning (orchestrating them). Otherwise, it will turn into a big brown bag of unrelated stuff.

Since this is a small and temporal thing and it's kind of related to code generation, maybe we could include it in ILV class as a "prepare" CG step?

Another option would be to create a specific class for it. We could rename VPlanHCFGBuilder.h/.cpp -> VPlanHCFGTransforms.h/.cpp and place there all the "small" VPlan-to-VPlan transformations.

fhahn added a comment.May 15 2018, 9:44 AM

I agree with you but I think it's too soon to enable some of this code in the inner loop vectorizer. I would wait at least until we have codegen support for outer loops so that we can do a more proper testing before we use it for production. In addition, another big concern is that introducing this code in the inner loop vectorizer too soon will limit the current development flexibility that we have in the VPlan native path. For those reasons, I would prefer this patch to just do the VPInstruction2VPRecipe transformation for the VPlan native path and leave the inner loop vectorizer changes for later. Does it make sense to you?

Yep, I think it would be better to move it to the VPlan native path, to have more flexibility when it comes to making changes.

The only problem is that the recipe generation depends on the cost model and we probably have to duplicate some code from the inner loop vectorizer in the VPlan native path, until we gradually replace them by VPlan implementations. I think that would give us a stable starting point (the VPlan native path should pass all existing inner loop vectorizer tests) and would help us to work towards VPlan native inner loop vectorization at a steady pace. Does this make sense?

lib/Transforms/Vectorize/LoopVectorizationPlanner.h
363 ↗(On Diff #146616)

Agreed I just put it there because I was not sure where to best put it. I will move it to VPlanHCFGTransforms.

The only problem is that the recipe generation depends on the cost model and we probably have to duplicate some code from the inner loop vectorizer in the VPlan native path, until we gradually replace them by VPlan implementations. I think that would give us a stable starting point (the VPlan native path should pass all existing inner loop vectorizer tests) and would help us to work towards VPlan native inner loop vectorization at a steady pace. Does this make sense?

What is the main dependence with cost model? Maybe @hsaito could have it into account since he is looking at how to refactor the current cost model.
In any case, I don't think we have to replicate too much code if we start with something simple, even if we generate inefficient code at the beginning. For example, I don't think we initially need the DeadInstructions code, anything related to masking or uniformity (since only uniform branches are allowed in the native path for now) or interleave accesses. That should simplify a lot the VPInstruction2VPRecipe step.

The only problem is that the recipe generation depends on the cost model and we probably have to duplicate some code from the inner loop vectorizer in the VPlan native path, until we gradually replace them by VPlan implementations. I think that would give us a stable starting point (the VPlan native path should pass all existing inner loop vectorizer tests) and would help us to work towards VPlan native inner loop vectorization at a steady pace. Does this make sense?

What is the main dependence with cost model? Maybe @hsaito could have it into account since he is looking at how to refactor the current cost model.
In any case, I don't think we have to replicate too much code if we start with something simple, even if we generate inefficient code at the beginning. For example, I don't think we initially need the DeadInstructions code, anything related to masking or uniformity (since only uniform branches are allowed in the native path for now) or interleave accesses. That should simplify a lot the VPInstruction2VPRecipe step.

I'm working on CostModel starting from the last part, i.e., walk HCFG and VPRecipes/VPInstructions instead of BBs and Instructions, to compute the cost (and going backwards). There are bunch of decision processes, e.g., memref should be vectorized or serialized, that also need to be "converted" before the entire CostModel becomes VPlan ready ---- and the last part certainly depends on those decisions. I plan to create IR versions (which is existing code) and VPlan versions (new code), make sure we can compare side by side, become comfortable to switch to VPlan versions, and then finally retire the IR versions.

fhahn planned changes to this revision.May 21 2018, 9:47 AM

I'll look into restructuring this soon.

fhahn added a comment.May 25 2018, 9:41 AM

I've tried restructuring the patch as suggested now and would appreciate your feedback on a few things before I update the patch:

  • Location of transformVPInstructionsToVPRecipies: I agree with Diego that LoopVectorizationPlanner is not the ideal place. But it needs access to the tryTo* functions to create VPRecipes, which unfortunately are defined in LoopVectorizationPlanner. I think there are 3 options: 1) keep transformVPInstructionsToVPRecipies in LoopVectorizationPlanner, 2) move the tryTo* functions to a better place (something like VPRecipeBuilder) or 3) pass a LoopVectorizationPlanner to the VPInstructionToVPRecipe transform. I think we should go for 1), if there is a concrete plan to get rid of VPRecipes in the near future or 2) if we need to keep the VPRecipes around for the time being (at least for the non-VPlan native path)
  • Move VPInstructionToVPRecipe to the VPlan native path: should be fairly straight forward, we already instantiate the legacy cost model in the VPlan native path, all we need to do is to allow inner loops in the VPlan native path, use the cost model to get the MaxVF and hook up VPlan execution, in case we decide to vectorize. We can migrate all those parts to VPlan infrastructure separately. Does that make sense?
  • Simplify buildVPlans() in the non VPlan native path: if we are pushing development of the VPlan based inner loop vectorization to the VPlan native path, we can avoid creating VPlans for each valid vectorization factor, as we throw them away without using them.

Hi Florian!

  • Location of transformVPInstructionsToVPRecipies: I agree with Diego that LoopVectorizationPlanner is not the ideal place. But it needs access to the tryTo* functions to create VPRecipes, which unfortunately are defined in LoopVectorizationPlanner. I think there are 3 options: 1) keep transformVPInstructionsToVPRecipies in LoopVectorizationPlanner, 2) move the tryTo* functions to a better place (something like VPRecipeBuilder) or 3) pass a LoopVectorizationPlanner to the VPInstructionToVPRecipe transform. I think we should go for 1), if there is a concrete plan to get rid of VPRecipes in the near future or 2) if we need to keep the VPRecipes around for the time being (at least for the non-VPlan native path)

Good suggestions! I would prefer #2 if it's feasible and it's not too much work. Moving the creation of the recipes to a different class sounds good to me since we'd prevent other developers to add similar code to the planner. VPRecipeBuilder sounds good to me. I also thought we could add them to VPBuilder since recipes are currently part of the IR representation, but I think it's not a good idea since these interfaces are doing much more than just creating a new recipe.

  • Move VPInstructionToVPRecipe to the VPlan native path: should be fairly straight forward, we already instantiate the legacy cost model in the VPlan native path, all we need to do is to allow inner loops in the VPlan native path, use the cost model to get the MaxVF and hook up VPlan execution, in case we decide to vectorize. We can migrate all those parts to VPlan infrastructure separately. Does that make sense?

My only concern is how that is going to work for outer loops. We need VPInstructionToVPRecipe working for them and, IMO, we shouldn't create another fork inside the VPlan native path to treat outer and inner loops differently. Reducing the number of recipes used in this step would also help us replace them quickly with VPInstructions in the VPlan native path. For those reasons, I suggested that we should start with a very simple VPInstructionToVPRecipe step that only created basic recipes (i.e., VPWiden*) for now, and leave interleaving and other optimizations for later, at least until we can make them work for outer loops. Do you see any problem with this approach? Is there something specific that you need for your follow-up work? We can think about how to accommodate it.

  • Simplify buildVPlans() in the non VPlan native path: if we are pushing development of the VPlan based inner loop vectorization to the VPlan native path, we can avoid creating VPlans for each valid vectorization factor, as we throw them away without using them.

I'm not sure I understand this point. In the non VPlan native path we are evaluating the cost for all VFs and choosing the best one. What do you want to change exactly?

Thanks!
Diego

fhahn added a comment.May 29 2018, 7:10 AM

Thanks Diego, I'll get the patches ready ASAP.

I'm not sure I understand this point. In the non VPlan native path we are evaluating the cost for all VFs and choosing the best one. What do you want to change exactly?

I've created D47477 to illustrate what I meant.

fhahn updated this revision to Diff 149312.May 31 2018, 10:06 AM
fhahn retitled this revision from [LV] Add VPInstruction to VPRecipe transformation. to [VPlan] Add VPInstruction to VPRecipe transformation..
fhahn edited the summary of this revision. (Show Details)

Update the patch to move inner loop vectorization using the VPInstr2VPRecipe transformation in the Vplan native path. I've updated the description of the revision with more details

My only concern is how that is going to work for outer loops. We need VPInstructionToVPRecipe working for them and, IMO, we shouldn't create another fork inside the VPlan native path to treat outer and inner loops differently. Reducing the number of recipes used in this step would also help us replace them quickly with VPInstructions in the VPlan native path. For those reasons, I suggested that we should start with a very simple VPInstructionToVPRecipe step that only created basic recipes (i.e., VPWiden*) for now, and leave interleaving and other optimizations for later, at least until we can make them work for outer loops. Do you see any problem with this approach? Is there something specific that you need for your follow-up work? We can think about how to accommodate it.

I played a around a bit with generating code for outer loops. I think there are a few things we need to do:

  1. collect some required information about phis during the legality checks,
  2. set which instructions to widen/scalarize in the outer/inner loops for the user provided VF, - that's all VPInstructionToVPRecipe needs for most instructions
  3. handle inner loop PHI nodes and control flow between outer & inner loop.

Thanks for this new version, Florian! Pretty excited to see that some inner loop tests are passing!
Satish is working on the support for outer loops in CG and we’ve been discussing your changes. We have to make sure that this patch aligns well with his next patch and also with the constraints that we stablished for this patch series and subsequent ones. Some comments in this regard:

  1. We think that VPInstructionsToVPRecipes is still too smart for the vplan native path and bringing too much code (from CM and Legal, for example) an recipes that we would prefer to keep away from the native path until they are properly ported. Otherwise, we’d end up having the same development limitations as we have in the inner loop path. What we had in main for the Recipe to VPInstruction transition in the native path is something as simple as the code in https://reviews.llvm.org/D44338?vs=on&id=140081&whitespace=ignore-most, where createRecipesForVPBB is naively creating only VPWidenMemoryInstructionRecipe, VPWidenRecipe and VPWidenPHIRecipe. Satish will give you more details in this regard so that this patch aligns with his.
  1. An example of #1 is createBlockInMask and createEdgeMask, which are currently invoked from tryToWidenMemory. In the VPlan native path we are only supporting uniform control flow so no masking is necessary at this point. For divergence control flow we plan to introduce a VPlan based predication algorithm in patch series #2 which will conflict with code.
  1. If you agree with the suggestion in #1, I think it would be better to introduce the inner loop support in the VPlan native path in a separate patch. In that way, we would introduce: a) VPInstructionToVPRecipe, b) CG support (Satish) and, finally, c) Basic inner loop support, i.e., same constraints as for outer loops, same naïve approach as in a) and b) and making sure that support for both inner and outer loops are well aligned.

Of course, we also need to work to bring whatever is necessary for SLP-aware loop vectorization. We can have this discussion offline to better understand what would be missing in the VPlan native path.

Please, let me know what you think!
Satish will follow up with more information regarding the CG patch.

Thanks,
Diego

lib/Transforms/Vectorize/LoopVectorize.cpp
6300 ↗(On Diff #149312)

Maybe it would be useful to keep this debug message for the remaining inner loops that don't hit the previous condition?

fhahn added a comment.Jun 4 2018, 12:08 PM

Thanks for this new version, Florian! Pretty excited to see that some inner loop tests are passing!
Satish is working on the support for outer loops in CG and we’ve been discussing your changes. We have to make sure that this patch aligns well with his next patch and also with the constraints that we stablished for this patch series and subsequent ones. Some comments in this regard:

  1. We think that VPInstructionsToVPRecipes is still too smart for the vplan native path and bringing too much code (from CM and Legal, for example) an recipes that we would prefer to keep away from the native path until they are properly ported. Otherwise, we’d end up having the same development limitations as we have in the inner loop path. What we had in main for the Recipe to VPInstruction transition in the native path is something as simple as the code in https://reviews.llvm.org/D44338?vs=on&id=140081&whitespace=ignore-most, where createRecipesForVPBB is naively creating only VPWidenMemoryInstructionRecipe, VPWidenRecipe and VPWidenPHIRecipe. Satish will give you more details in this regard so that this patch aligns with his.

Sure, we can start with just creating the basic recipes. Should we only support widening for a start and not relying on the cost model at all? I suppose that would make things easier for outer loop code gen.

  1. An example of #1 is createBlockInMask and createEdgeMask, which are currently invoked from tryToWidenMemory. In the VPlan native path we are only supporting uniform control flow so no masking is necessary at this point. For divergence control flow we plan to introduce a VPlan based predication algorithm in patch series #2 which will conflict with code.
  1. If you agree with the suggestion in #1, I think it would be better to introduce the inner loop support in the VPlan native path in a separate patch. In that way, we would introduce: a) VPInstructionToVPRecipe, b) CG support (Satish) and, finally, c) Basic inner loop support, i.e., same constraints as for outer loops, same naïve approach as in a) and b) and making sure that support for both inner and outer loops are well aligned.

Yep that sounds good. I would prefer not to land this change as dead code and if inner loop support is not pending on CG for outer loops, it would be great if we could get it in independently from that. Of course make sure it fits well with CG for outer loops.

Of course, we also need to work to bring whatever is necessary for SLP-aware loop vectorization. We can have this discussion offline to better understand what would be missing in the VPlan native path.

I think to start with, the limited set of recipes is more than enough. Once we have inner loop support I shared an updated set of patches for SLP.

Thanks for your comments! Please let me know what you require for outer loop CG and when you are planning on sharing those patches, so I do not end up doing unnecessary work :)

Sure, we can start with just creating the basic recipes. Should we only support widening for a start and not relying on the cost model at all? I suppose that would make things easier for outer loop code gen.

Hi Florian,

For the initial outer loop vector code generation, we were planning to only support widening and not rely on the cost model at all. Where the current code generation relies on results of cost modeling, we were planning to put changes in place to return a conservative result for the VPlanNativePath. Given the current constraints for the supported outer loops in the VPlanNativePath, this is what we had in mind specifically:

  • generate gather/scatter for loads/stores until we have proper divergence analysis in place. We may be able to infer unit-stride information using current SCEV based analysis. However, for the initial implementation we can generate gathers/scatters by returning CM_GatherScatter in getWideningDecision for outer loops in VPlanNativePath.
  • widen non-induction PHIs. The initial implementation will run some part of legality checks to recognize inductions as you pointed out earlier. PHIs that are not in the outer loop header will be widened. Due to back edges from inner loops we may need to generate place-holder PHIs to begin with and fixup such PHIs at the end of code generation. This is due to the fact that vector values corresponding to all the phi-operands may not be available when widening the PHI instruction. This will be similar to the way reductions/recurrences are currently handled.

We will need to either guarantee that the phi operand order matches the parent BBlock successor order or have a mapping between scalar BBlocks and the corresponding vector BBlock.

  • treat all branch instructions as uniform. During vector CG, the branch operands need to be replaced with the vector equivalents. We will need an approach similar to handling of PHIs i.e. fixup branches and control flow at the end of vector CG.
  • all other instructions will simply be widened using vector values or generating broadcasts(for loop invariants)

Initially, we will need at the least - widenMemory, widen, and widenPhi recipes. Branch instruction can be part of a widenRecipe and CG needs to handle it appropriately for the VPlanNativePath. If that is not acceptable, we can think about having a new recipe to handle branch instructions.

Please let me know if this matches with what you had in mind. We will try to send you a patch by the end of this week so that you can see what we have.

Satish

fhahn added a comment.Jun 4 2018, 2:05 PM

Please let me know if this matches with what you had in mind. We will try to send you a patch by the end of this week so that you can see what we have.

Yep that sounds good. It sounds like I could update this patch to use a simplified recipe creation as Diego mentioned?

Yep that sounds good. It sounds like I could update this patch to use a simplified recipe creation as Diego mentioned?

I think that makes sense.

fhahn updated this revision to Diff 150156.Jun 6 2018, 10:11 AM

Updated to not use VPRecipeBuilder for recipe construction. It now relies relies on the original VPlan to be vectorizable without masking/interleaving. If that makes sense, I'll move the code enabling the CG to a separate patch and add a unit test for the transformation.

Updated to not use VPRecipeBuilder for recipe construction. It now relies relies on the original VPlan to be vectorizable without masking/interleaving. If that makes sense, I'll move the code enabling the CG to a separate patch and add a unit test for the transformation.

Thanks, Florian! Much better now! Some comments below.

If that makes sense, I'll move the code enabling the CG to a separate patch and add a unit test for the transformation.

Please go ahead!

Let's see how this patch aligns with Satish's patch when it's ready.

Thanks,
Diego.

lib/Transforms/Vectorize/LoopVectorize.cpp
6395 ↗(On Diff #150156)

I think this might not be needed at this point?

6398 ↗(On Diff #150156)

unused?

lib/Transforms/Vectorize/VPlanHCFGTransforms.cpp
60

Since this transformation happens just before the execution of VPlan, I think it would be better to just modify the existing VPlan (i.e., remove each VPInstruction and add the corresponding recipes) instead of creating a new one. In that way, this code would be even simpler. Does it make sense to you or did you decide to create a new one for some reason?

74

We should always have a TopRegion. dyn_cast -> cast?

80

Please, remove TopRegion check.

95

Shoudn't we add DbgInfoIntrinsic to DeadInstructions then?

craig.topper added inline comments.
lib/Transforms/Vectorize/VPlanHCFGTransforms.cpp
134

Is the iteration order guaranteed here when the set is small? It's been a while since I've look at how SmallSet manages the vector.

fhahn added inline comments.Jun 8 2018, 9:01 AM
lib/Transforms/Vectorize/VPlanHCFGTransforms.cpp
134

I've just put up a patch adding an iterator for SmallSet: D47942

If the set is small, it just uses a SmallVector, otherwise it uses std::set, which should also give a deterministic iteration order.

craig.topper added inline comments.Jun 8 2018, 9:09 AM
lib/Transforms/Vectorize/VPlanHCFGTransforms.cpp
134

I guess deterministic wasn't the right word. It looks like when the set is small this will print out in the order that the VFs were inserted into the set not a numeric order? Given that this is going into a string is that a good idea?

fhahn added inline comments.Jun 8 2018, 9:11 AM
lib/Transforms/Vectorize/VPlanHCFGTransforms.cpp
134

Yeah, I guess we would need to sort the VFs here. But I'll update the patch to transform the original VPlan in place, so this code should go away.

craig.topper added inline comments.Jun 8 2018, 9:18 AM
lib/Transforms/Vectorize/VPlanHCFGTransforms.cpp
134

What's the range of VFs that you expect to use? Could you use a BitVector or SparseBitVector instead?

fhahn updated this revision to Diff 150953.Jun 12 2018, 8:37 AM
fhahn edited the summary of this revision. (Show Details)

Updated to modify the original plan in place, and removed inner loop vectorization code path from VPlan native path for now.

fhahn added inline comments.Jun 12 2018, 8:39 AM
lib/Transforms/Vectorize/VPlanHCFGTransforms.cpp
80

Done. I think we still need a check to not create recipes for the pre-header and exit blocks?

fhahn updated this revision to Diff 150954.Jun 12 2018, 8:42 AM

Fix typo.

Thanks for the rework and for your patience, Florian!
I don't see any major issues. Please, wait for Satish's approval, just in case he has any comments regarding how this would interact with his next patch.
In any case, we could always address any issues in a separate commit.

Diego

lib/Transforms/Vectorize/VPlanHCFGTransforms.cpp
76

Shoudn't we collect DbgInfoIntrinsic as a DeadInstructions instead of having this check here?

80

Yes, you can do that for now. I plan to create clean/empty PH and Exit during construction so that we can safely vectorize or process whatever we add to them other VPlan-to-VPlan transformations.

83

remove { } for single line ifs and elses?

unittests/Transforms/Vectorize/VPlanHCFGTest.cpp
39

Thanks a lot for these tests!

fhahn added a comment.Jun 14 2018, 2:42 PM

Thanks Diego! I'll address the remaining comments after Satish had a look, so I can address any additional comments in one go

sguggill added inline comments.Jun 14 2018, 6:03 PM
lib/Transforms/Vectorize/VPlanHCFGTransforms.cpp
76

An assert(inst) would be nice here.

94

I think we are diverging from the nonVPlanNativePath case, where a widenRecipe can have > 1 ingredients but this should be fine.

unittests/Transforms/Vectorize/VPlanHCFGTest.cpp
144

VPInstructionsToVPRecipies -> VPInstructionsToVPRecipes

I noticed this while trying to test my outerloop vector code generation changes along with your latest changes.

fhahn updated this revision to Diff 151515.Jun 15 2018, 8:40 AM
fhahn marked 3 inline comments as done.

Thanks for having a look. The comments should be addressed now.

dcaballe added inline comments.Jun 15 2018, 9:11 AM
lib/Transforms/Vectorize/VPlanHCFGTransforms.cpp
94

You mean in line 95? I see. We shouldn't create unnecessary recipes if we can reuse an existing one. Could we do something similar to the code at the end of 'tryToWiden'?:

// Success: widen this instruction. We optimize the common case where
// consecutive instructions can be represented by a single recipe.
if (!VPBB->empty()) {
  VPWidenRecipe *LastWidenRecipe = dyn_cast<VPWidenRecipe>(&VPBB->back());
  if (LastWidenRecipe && LastWidenRecipe->appendInstruction(I))
    return true;
}

VPBB->appendRecipe(new VPWidenRecipe(I));
fhahn updated this revision to Diff 151524.Jun 15 2018, 9:36 AM
fhahn marked an inline comment as done.

Thanks, updated the patch to combine widened instructions in single recipe, if last recipe was a VPWidenRecipe.

lib/Transforms/Vectorize/VPlanHCFGTransforms.cpp
76

Switched to cast<>

94

Yep, this code creates a new WidenRecipe for each widened instruction because it is slightly more work to get the previous recipe, because we remove the current instruction. I can change it though if you think that's better.

unittests/Transforms/Vectorize/VPlanHCFGTest.cpp
144

Ah yes, sorry about that. This instance slipped through when I fixed the typo.

The changes look good to me. I have also verified that my outer loop vector code generation changes work with these changes after some minor modifications.

lib/Transforms/Vectorize/VPlanHCFGTransforms.cpp
76

OK.

94

This looks good. Thanks!

dcaballe accepted this revision.Jun 15 2018, 11:19 AM

Thanks a lot, Florian! LGTM!

This revision is now accepted and ready to land.Jun 15 2018, 11:19 AM

Great thanks for all the comments! I'll commit it early next week, unless there are any more comments

fhahn updated this revision to Diff 151738.Jun 18 2018, 9:59 AM

Rebased. Thanks for all the feedback. I'll commit it in a bit

This revision was automatically updated to reflect the committed changes.