This is an archive of the discontinued LLVM Phabricator instance.

[RFC][VPlan, SLP] Add simple SLP analysis on top of VPlan.
ClosedPublic

Authored by fhahn on Jul 18 2018, 8:50 AM.

Details

Summary

This patch adds an initial implementation of the look-ahead SLP tree
construction described in 'Look-Ahead SLP: Auto-vectorization in the Presence
of Commutative Operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,
Luís F. W. Góes'.

It returns an SLP tree represented as VPInstructions, with combined
instructions represented as a single, wider VPInstruction.

This initial version does not support instructions with multiple
different users (either inside or outside the SLP tree) or
non-instruction operands; it won't generate any shuffles or
insertelement instructions.

It also just adds the analysis that builds an SLP tree rooted in a set
of stores. It does not include any cost modeling or memory legality
checks. The plan is to integrate it with VPlan based cost modeling, once
available and to only apply it to operations that can be widened.

A follow-up patch will add a support for replacing instructions in a
VPlan with their SLP counter parts

Diff Detail

Event Timeline

fhahn created this revision.Jul 18 2018, 8:50 AM

What is your plan for this vs SLPVectorizer? How do they compare wrt vectorized codegen?

rcorcs added a subscriber: rcorcs.Jul 18 2018, 9:20 AM

What is your plan for this vs SLPVectorizer? How do they compare wrt vectorized codegen?

The initial motivation is to improve vectorization in cases where currently the loop vectorizer only considers interleaving memory accesses, which either results in sub-optimal code or no vectorization in case the interleaved accesses are too expensive on the target. It should also come in handy when vectorizing for architectures with scalable vector registers, where SLP-style vectorization of an unrolled loop is harder, for example.

Overall I think this should fit nicely into the emerging VPlan infrastructure: applying SLP style vectorization is one strategy/plan among others, that are evaluated against each other before choosing the most profitable over all. We can add the SLP analysis and tooling to create a VPlan with the SLP combinations applied, and just need small tweaks to VPlan-based cost modelling and code generation to make it aware of the new combined load/store instructions.

In the long term, I think there is potential to share infrastructure with the SLPVectorizer on different levels: for example, we could potentially re-use VPlan based code generation, if the SLPVectorizer would emit VPInstructions; share the infrastructure to discover consecutive loads/stores; share different combination algorithms between VPlan and SLPVectorizer; potentially re-use part of the VPlan based cost model. One thing I want to try out is how easy/hard it would be to make the analysis work on VPInstruction and Instruction based basic blocks.

That being said, some of the VPlan infrastructure is still emerging: initial VPInstruction based codegeneration and cost modelling is currently worked on for example. However I think considering SLP style vectorization as a VPlan2VPlan transformation (and others) early on would help to make sure the design of the VPlan infrastructure is general enough to cover a wide range of use cases.

I hope that answers your question.

My tuppence...

The initial motivation is to improve vectorization in cases where currently the loop vectorizer only considers interleaving memory accesses, which either results in sub-optimal code or no vectorization in case the interleaved accesses are too expensive on the target. It should also come in handy when vectorizing for architectures with scalable vector registers, where SLP-style vectorization of an unrolled loop is harder, for example.

This was part of the VPlan from the beginning, but we never discussed in detail how the implementation would work, especially considering that SLP will still be there for quite a while as is.

Overall I think this should fit nicely into the emerging VPlan infrastructure: applying SLP style vectorization is one strategy/plan among others, that are evaluated against each other before choosing the most profitable over all. We can add the SLP analysis and tooling to create a VPlan with the SLP combinations applied, and just need small tweaks to VPlan-based cost modelling and code generation to make it aware of the new combined load/store instructions.

This is indeed the biggest benefit of having SLP analysis in VPlan, especially with the VPlan-to-VPlan transformations.

In the long term, I think there is potential to share infrastructure with the SLPVectorizer on different levels: for example, we could potentially re-use VPlan based code generation, if the SLPVectorizer would emit VPInstructions; share the infrastructure to discover consecutive loads/stores; share different combination algorithms between VPlan and SLPVectorizer; potentially re-use part of the VPlan based cost model. One thing I want to try out is how easy/hard it would be to make the analysis work on VPInstruction and Instruction based basic blocks.

This is why we haven't gone too deep in the analysis, yet. Sharing code between SLPVec, which operates in IR, and VPlanSLP, which operates in VInstructions, can be confusing, limiting or even not possible. The analysis part mostly works, because VPInstruction is similar enough to Inst, but implementation and, worse, heuristics, can go wrong very quickly.

I think long term we basically have two options:

  1. We cannibalise SLPVec, hoisting analyses, transformations and make it generic (Inst/VPInst) and make sure we never hurt performance or compile time. This is hard, slow and painful, but it's the most stable ongoing solution.
  1. We implement VPlanSLP in parallel, create a flag to flip between (not have both at the same time), and when VPlanSLP is doing all and more, we flip by default. This is much easier short-term but risk never get truly flipped and divide the usage.

I don't have a good view on which would be better right now. VPlan is still largely monolithic, VP-to-VP is too fresh, and SLP needs to understand loop and outer loop boundaries to operate correctly in VPlan.

That being said, some of the VPlan infrastructure is still emerging: initial VPInstruction based codegeneration and cost modelling is currently worked on for example. However I think considering SLP style vectorization as a VPlan2VPlan transformation (and others) early on would help to make sure the design of the VPlan infrastructure is general enough to cover a wide range of use cases.

I can see the appeal in a proof-of-concept, and I don't oppose having it. But I'm not strongly in favour either.

If more people think that option #2 above is the way to go, then this could turn out fine.

But if more people prefer the option #1, then we would want to see what gets hoisted and how will the local implementations look like before we add VPlanSLP.

cheers,
--renato

That being said, some of the VPlan infrastructure is still emerging: initial VPInstruction based codegeneration and cost modelling is currently worked on for example. However I think considering SLP style vectorization as a VPlan2VPlan transformation (and others) early on would help to make sure the design of the VPlan infrastructure is general enough to cover a wide range of use cases.

I can see the appeal in a proof-of-concept, and I don't oppose having it. But I'm not strongly in favour either.

If more people think that option #2 above is the way to go, then this could turn out fine.

But if more people prefer the option #1, then we would want to see what gets hoisted and how will the local implementations look like before we add VPlanSLP.

The fact of the matter is that the loop vectorization has a need to understand SLP and SLP vectorizer needs to understand Loop. As such, unless we want to build/maintain separate LoopVectorize+SLP and SLPVectorize+Loop, consolidation of LoopVectorization and SLPVectorization will inevitably happen sooner or later. From that perspective, ensuring that VPlan is the right infrastructure for such consolidation is a very important thing for us to do.
We don't necessarily have to choose between #1 and #2. Once we come to conclude that VPlan is promising enough for SLP, we can start VPlanize SLP vectorizer (#1) while #2 moves forward. We just need to make a conscious effort to converge the two, starting from sharing small chunks of code and then steadily increase sharing.

The fact of the matter is that the loop vectorization has a need to understand SLP and SLP vectorizer needs to understand Loop. As such, unless we want to build/maintain separate LoopVectorize+SLP and SLPVectorize+Loop, consolidation of LoopVectorization and SLPVectorization will inevitably happen sooner or later. From that perspective, ensuring that VPlan is the right infrastructure for such consolidation is a very important thing for us to do.

I'm totally on board with you. I want SLP to be in VPlan (have been advocating for it since the beginning).

We don't necessarily have to choose between #1 and #2. Once we come to conclude that VPlan is promising enough for SLP, we can start VPlanize SLP vectorizer (#1) while #2 moves forward. We just need to make a conscious effort to converge the two, starting from sharing small chunks of code and then steadily increase sharing.

A merge between 1 and 2, sounds equally plausible. I just want everyone to agree on the strategy.

I have no strong opinions on the best approach. My concern is my current work revolves around a number of issues:

1 - Generalize alternate vectorisation paths (multiple different vector ops + select/shuffle merges).
2 - Supporting 'copyable' elements (PR30787).
3 - Pull TTI/vectorization costs from scheduling models (PR36550).
4 - Using dereferenceable_or_null metadata to vectorise loads with missing elements (PR21780).
5 - Revectorisation of 128-bit vector code to 256-bit vector code (make most of YMM ops now that Jaguar model is treating them nicely).

All of these are large pieces of work and I don't want to find myself implementing them in SLPVectorizer just for all the work to be lost and we're back to a very basic SLP system again.

How quickly do you expect VPlanSLP to be close to current SLPVectorizer codegen? Ideally I'd like to see slp tests run on both asap.

I'll let Florian/Hideki reply about timeframes and strategies, and will just focus on specific items you list.

1 - Generalize alternate vectorisation paths (multiple different vector ops + select/shuffle merges).
2 - Supporting 'copyable' elements (PR30787).
4 - Using dereferenceable_or_null metadata to vectorise loads with missing elements (PR21780).

These all look transferable to a VPlan-based SLP.

3 - Pull TTI/vectorization costs from scheduling models (PR36550).

This is really important for VPlan, too. There is some work (can't remember which review now) to get better cost analysis, looking at a longer use-chain.

5 - Revectorisation of 128-bit vector code to 256-bit vector code (make most of YMM ops now that Jaguar model is treating them nicely).

This sounds like a large undertake to do in SLP proper. I remember having briefly discussed re-vectorisation as a VP2VP transformation, which I think makes more sense.

But having it in SLP for now is perfectly fine if you have cases in mind where it's obviously beneficial. It's probably going to take a while to get the VP2VP stuff, anyway.

All of these are large pieces of work and I don't want to find myself implementing them in SLPVectorizer just for all the work to be lost and we're back to a very basic SLP system again.

That was my primary concern, too. But I don't think anyone is proposing to just dump the existing SLP unless the new one has *all* of it (and more).

And that also means either merging or commoning-up the features above (and all others).

fhahn added a comment.Jul 19 2018, 6:08 AM

Thanks for all the feedback! My initial plan was to make VPlan based loop vectorization SLP aware, to improve on the cases mentioned earlier, where currently we do not do the best thing in LoopVectorize. That should take a bit of load off SLPVectorize, but would not replace the current SLPVectorizer for now.

I think in the long term, we should work towards re-using as much of the VPlan infrastructure in SLPVectorize as sensible and possible. Merging of #1 and #2 seems like a good approach: we can get some benefits for loop vectorization relatively quickly and in the meantime evolve the whole VPlan infrastructure around it, while making SLPVectorize and VPlanSLP share common infrastructure (e.g. start from the bottom up, initially use VPlan for code generation and move up from there (bundle scheduling, cost modelling, auxiliary analysis like interleaved load/store analysis)) where it makes sense and is beneficial. In the end we might end up with "competing" algorithm/SLP strategies, which get evaluated against each other before creating (and executing) the final VPlan.

I have no strong opinions on the best approach. My concern is my current work revolves around a number of issues:

1 - Generalize alternate vectorisation paths (multiple different vector ops + select/shuffle merges).

Representing alternate vectorization paths/strategies in a way that can be easily evaluated against each other is one major benefit of VPlan, so I assume it could be helpful here

2 - Supporting 'copyable' elements (PR30787).
3 - Pull TTI/vectorization costs from scheduling models (PR36550).

This is great and we should use the same API here in the VPlan cost modelling and SLPVectorizer.

4 - Using dereferenceable_or_null metadata to vectorise loads with missing elements (PR21780).
5 - Revectorisation of 128-bit vector code to 256-bit vector code (make most of YMM ops now that Jaguar model is treating them nicely).

All of these are large pieces of work and I don't want to find myself implementing them in SLPVectorizer just for all the work to be lost and we're back to a very basic SLP system again.

How quickly do you expect VPlanSLP to be close to current SLPVectorizer codegen? Ideally I'd like to see slp tests run on both asap.

This depends on when we get initial VPlan-native codegen and cost modelling. I hope those things get submitted in the next few months. As mentioned earlier, I could put together a VPlanSLP version operating on scalar code outside loops, so we have something more concrete to compare.

That was my primary concern, too. But I don't think anyone is proposing to just dump the existing SLP unless the new one has *all* of it (and more).

+1

And that also means either merging or commoning-up the features above (and all others).

Given the amount of tuning that went into SLPVectorize over the years, it will take a while to reach parity for a potential VPlan based replacement. But it would be great if we could agree on the general direction & approach and I would be more than happy to try to make sure the issues described above fit well into VPlanSLP. I guess with the problems mentioned above, the strategies and concepts are trickier to get right than the implementation details, so collaborating would be very valuable IMO.

Some minor/style comments as I get up to speed on the code.

lib/Transforms/Vectorize/VPlan.cpp
690
for (unsigned i = 0, e = User->getNumOperands(); i < e; i++) {
694

Remove unnecessary braces?

lib/Transforms/Vectorize/VPlan.h
679

No need for assert - cast will catch it.

Could you use cast_or_null<Instruction>(getUnderlyingValue()) directly?

1587

Please can you add description comments for all these member variables and public functions.

lib/Transforms/Vectorize/VPlanSLP.cpp
60

static?

91

Should you use VPSLP for debug messages to avoid confusion with SLPVectorizer?

155

auto?

162

Reuse Instruction::isCommutative(cast<VPInstruction>(Values[0])->getOpcode())?

183

llvm_unreachable?

190

This would be cleaner if you pulled out cast<VPInstruction>(Values[0])->getNumOperands()

unsigned NumOps = cast<VPInstruction>(Values[0])->getNumOperands();
for (unsigned i = 0; i < NumOps; ++i)
239

(style) Score

241

Don't re-evaluate the getNumOperands() calls

for (unsigned i = 0, e1 = cast<VPUser>(V1)->getNumOperands(); j < e1; i++)
  for (unsigned j = 0, e2 = cast<VPUser>(V2)->getNumOperands(); j < e2; j++)
258

(style) auto for casts

337
for (unsigned Op = 0, E = MultiNodeOps.size(); Op < E; ++Op) {
fhahn updated this revision to Diff 156311.Jul 19 2018, 10:26 AM

Address comments, thanks! I will add more comments tomorrow.

fhahn marked 13 inline comments as done.Jul 19 2018, 10:27 AM
fhahn added inline comments.
lib/Transforms/Vectorize/VPlan.h
1587

Will do tomorrow!

RKSimon added inline comments.Jul 20 2018, 3:09 AM
lib/Transforms/Vectorize/VPlanSLP.cpp
176

You can move this above the switch statement and remove the duplicate cast

ABataev added inline comments.Jul 20 2018, 7:20 AM
lib/Transforms/Vectorize/VPlan.cpp
690
  1. i->I, 'e'->E per coding standard.
  2. Use ++I instead of i++ per coding standard
lib/Transforms/Vectorize/VPlan.h
652–655

VPInstruction *clone() const

675

Make the function const

1590

No need to initialize BundleToCombined here, will be initalized by default

1596

const

lib/Transforms/Vectorize/VPlanSLP.cpp
85

const member function

165–166

static

177

i->I, numOps->NumOps, ++I

192

Maybe it is better to use llvm::Optional<unsigned> instead of some magic numbers for non-matching case?

235

static

263

Why 5? Use enum or const instead of magic number.

288–293

std::remove(Candidates.begin(), Candidates.end(), Best);?

313
  1. Lane = 1, E = MultiNodeOps[0].second.size(); Lane < E;
  2. Use preincrement
325

preincrement

331

Please, use the actual type here

343

ArrayRef<VPValue *> Values

399–401

No braces here

fhahn updated this revision to Diff 156529.Jul 20 2018, 10:15 AM

Address next round of review comments, thanks! Also add some comments.

fhahn marked 15 inline comments as done.Jul 20 2018, 10:19 AM
fhahn added inline comments.
lib/Transforms/Vectorize/VPlan.h
675

Agreed that it should be const. Before I can change that here I need to change some of the underlying/connected functions const.

ABataev added inline comments.Jul 20 2018, 10:53 AM
lib/Transforms/Vectorize/VPlan.cpp
702–712

const auto *

lib/Transforms/Vectorize/VPlanSLP.cpp
65

ArrayRef<VPValue *> Operands

150

ArrayRef<VPValue *> Values

160

ArrayRef<VPValue *> Values

166

ArrayRef<VPValue *> Values

187

ArrayRef<VPValue *> Values

228–229

Capitalize variables names + preincrement

236

ArrayRef<VPValue *> Candidates

349

ArrayRef<VPValue *> Values

fhahn updated this revision to Diff 156555.Jul 20 2018, 11:44 AM
fhahn marked 5 inline comments as done.Jul 20 2018, 11:49 AM

Address review comments, thanks!

lib/Transforms/Vectorize/VPlanSLP.cpp
65

Operands is used as key for the BundleToCombined map, which expects a SmallVector<VPValue *, 4> key.

236

We remove elements from Candidates which is not possible with ArrayRef I think

263

Replaced with a variable

349

Values is used as key for the BundleToCombined map, which expects a SmallVector<VPValue *, 4> key.

ABataev added inline comments.Jul 20 2018, 12:58 PM
lib/Transforms/Vectorize/VPlanSLP.cpp
65

llvm has special function llvm::to_vector<4>() that will allow to convert from ArrayRef to SmallVector

179–183

Do we need braces here?

195

Use llvm::None instead

303

Is it possible to replace it with emplace_back(Operands.first, Operands.second[0]);

303–309

You can preallocate the memory for Mode and FinalOrder vectors, because you know the number of elements is MultiNodeOps.size()

319

Also can preallocate the memory for the vector

335

use emplace_back()

fhahn updated this revision to Diff 156591.Jul 20 2018, 1:29 PM
fhahn updated this revision to Diff 156605.Jul 20 2018, 2:09 PM
fhahn marked 7 inline comments as done.
fhahn added inline comments.
lib/Transforms/Vectorize/VPlanSLP.cpp
65

Neat, thanks for pointing that out!

179–183

Not anymore, thanks

303

I don't think so, it fails to deduce the correct type for the second argument (with and without braces)

335

Is there are benefit by using emplace_back here? We are not constructing any objects here I think

fhahn added a comment.Aug 1 2018, 2:30 PM

Ping. Thanks for all the comments so far and the initial code review.

What do you think is the best way going forward overall: 1) evolving loop aware SLP initially in VPlan (independent of SLPVectorizer) and start using VPlan infrastructure for SLPVectorizer as it becomes available or 2) trying to share as much code between both from the start? Personally I think 1) would be better, as it allows us to evolve the VPlan parts quicker and I would like to avoid unnecessary code churn in SLPVectorizer and start using VPlan infrastructure there as it becomes clearly beneficial.

fhahn added a comment.Aug 1 2018, 2:32 PM

I'll also add some cleanup code for destroying the created instructions if the are not used in a bit.

Thanks for working on this, Florian, and sorry for my delayed response. I added some initial comments. I'll come back soon.

Ping. Thanks for all the comments so far and the initial code review.

What do you think is the best way going forward overall: 1) evolving loop aware SLP initially in VPlan (independent of SLPVectorizer) and start using VPlan infrastructure for SLPVectorizer as it becomes available or 2) trying to share as much code between both from the start? Personally I think 1) would be better, as it allows us to evolve the VPlan parts quicker and I would like to avoid unnecessary code churn in SLPVectorizer and start using VPlan infrastructure there as it becomes clearly beneficial.

I agree with you. I think it's better to have an initial end-to-end loop aware SLP PoC, properly define the representation of SLP operations, define how SLP transformation comes into play and interacts with the rest of the VPlan infrastructure, etc., before we move forward on SLPVectorizer. In the VPlan native path we have some room for experimentation. We can take advantage of that and start with SLPVectorizer once we have a solid model for SLP in VPlan.

In the long term, I think there is potential to share infrastructure with the SLPVectorizer on different levels: for example, we could potentially re-use VPlan based code generation, if the SLPVectorizer would emit VPInstructions; share the infrastructure to discover consecutive loads/stores; share different combination algorithms between VPlan and SLPVectorizer; potentially re-use part of the VPlan based cost model. One thing I want to try out is how easy/hard it would be to make the analysis work on VPInstruction and Instruction based basic blocks.

! In D49491#1166978, @rengolin wrote:
This is why we haven't gone too deep in the analysis, yet. Sharing code between SLPVec, which operates in IR, and VPlanSLP, which operates in VInstructions, can be confusing, limiting or even not possible. The analysis part mostly works, because VPInstruction is similar enough to Inst, but implementation and, worse, heuristics, can go wrong very quickly.

In terms of re-use, Value/VPValue templatization could work for some analyses. Templatizing IRBuilder/VPlanBuilder might also work for some parts of the implementation. However, we may want to think carefully. If for templatization to work we needed the whole VPValue hierarchy and internal details to be a carbon copy of Value hierarchy, we would be losing all the flexibility that VPValue/VPInstruction are bringing to VPlan. Of course, we would have to investigate this a bit more. Another alternative that I had in mind is introducing some kind of InstructionTraits. That would give us an unified interface while allowing more flexibility at VPValue/Value implementation level. Unfortunately, we haven't had have time to investigate this approach further.

lib/Transforms/Vectorize/VPlan.h
620

To be more specific, what about SLPLoad/SLPStore instead?

663

I wonder if having this interface is safe. Wouldn't it be problematic if we change the opcode of a VPInstruction object to an opcode of an operation which is represented with a VPInstruction subclass? Maybe it's better to not allow this and just create a VPInstruction using VPlanBuilder?

675

Make it protected per design principle described in getUnderlyingValue in VPlanValue.h? Same for setUnderlyingInstr.

1538

(parts) of -> (parts of)?

1543

SCEV -> VPValue?

I guess we couldn't have a single templatized version for sorted SmallVectors of any pointer, right?

1595

formatting

1606

Something to think about is the long term of this class. I wonder if it would be better to split it multiple ones, matching what we have for loop vectorization. For example, we may want to have a special VPlan (VPlanSLP) to represent the SLP planning information, a VPlanSLPBuilder, a VPlanSLPContext (if necessary)... We don't have to do it as part of this patch but it's something to keep in mind.

lib/Transforms/Vectorize/VPlanSLP.cpp
63

Would it make sense to move this to VPValue?

67

Some doc would be great. Same for some of the methods below.

211

If opcodes are the same at this point, shouldn't we keep only A or B checks but not both?
You could add assert(A->getOpcode() == B->getOpcode()) to be safe.

220

I would be great if you could elaborate a bit what 'score' is exactly.

fhahn updated this revision to Diff 159940.Aug 9 2018, 9:31 AM
fhahn marked 5 inline comments as done.

Addressed comments, thanks!

lib/Transforms/Vectorize/VPlan.h
663

Removed, it was a leftover from an earlier version.

1543

I think it is possible to provide a single implementation for SmallVectors. I've removed the implementation here and will upload a separate patch for it soon.

fhahn added a comment.Aug 22 2018, 9:43 AM

What do you think is the best way going forward overall: 1) evolving loop aware SLP initially in VPlan (independent of SLPVectorizer) and start using VPlan infrastructure for SLPVectorizer as it becomes available or 2) trying to share as much code between both from the start? Personally I think 1) would be better, as it allows us to evolve the VPlan parts quicker and I would like to avoid unnecessary code churn in SLPVectorizer and start using VPlan infrastructure there as it becomes clearly beneficial.

I agree with you. I think it's better to have an initial end-to-end loop aware SLP PoC, properly define the representation of SLP operations, define how SLP transformation comes into play and interacts with the rest of the VPlan infrastructure, etc., before we move forward on SLPVectorizer. In the VPlan native path we have some room for experimentation. We can take advantage of that and start with SLPVectorizer once we have a solid model for SLP in VPlan.

Yep! I think it would be good to get wider agreement of the direction. Is there anything I should look at/elaborate in more detail/summarize? Concerns with this approach that need to be addressed? It would be great to get some more feedback on this :)

This is why we haven't gone too deep in the analysis, yet. Sharing code between SLPVec, which operates in IR, and VPlanSLP, which operates in VInstructions, can be confusing, limiting or even not possible. The analysis part mostly works, because VPInstruction is similar enough to Inst, but implementation and, worse, heuristics, can go wrong very quickly.

In terms of re-use, Value/VPValue templatization could work for some analyses. Templatizing IRBuilder/VPlanBuilder might also work for some parts of the implementation. However, we may want to think carefully. If for templatization to work we needed the whole VPValue hierarchy and internal details to be a carbon copy of Value hierarchy, we would be losing all the flexibility that VPValue/VPInstruction are bringing to VPlan. Of course, we would have to investigate this a bit more. Another alternative that I had in mind is introducing some kind of InstructionTraits. That would give us an unified interface while allowing more flexibility at VPValue/Value implementation level. Unfortunately, we haven't had have time to investigate this approach further.

Right, I do not think we should restrict ourselves too much by coupling things together too early. I think in terms of analysis, we currently have access to def/uses, operands, opcodes in Instruction/VPInstruction and also DominatorTrees and LoopInfo for IR/VPlan, which should be enough for a set of analysis.

fhahn updated this revision to Diff 166309.Sep 20 2018, 9:05 AM

Rebased and added comments.

@RKSimon , @rengolin, @ABataev after the previous discussion, does getting this analysis in for VPlan only initially make sense to you? Or is there anything more I should investigate/write up?

fhahn added inline comments.
lib/Transforms/Vectorize/VPlan.h
1543

I've added a patch implementing DenseMap key info for smallvector: D52312

vporpo added inline comments.Sep 20 2018, 4:18 PM
lib/Transforms/Vectorize/VPlanSLP.cpp
135

Could we have other non-Store VPInstructions that could touch memory (e.g., calls) ?
Maybe it is safer to add a check whether the underlying IR instruction mayWriteToMemory().

141

I think we should also check that the operands are (i) all in the same BB, and (ii) in the same BB as the seed instructions (at least for now). A lit test for this would also be nice.

346

If I am not mistaken, since we are only allowing single user nodes in the graph, the graph is actually a tree. So this should not happen. We should have an assertion here to check this.

351

Use LLVM_DEBUG for dumpBundle()

364

Use LLVM_DEBUG for dumpBundle()

406

Hmm why no buildGraph() recursion here ?

fhahn updated this revision to Diff 166928.Sep 25 2018, 9:00 AM
fhahn marked 3 inline comments as done.

Thanks Vasileios! I hope I addressed all comments. The only follow up I need to do is the check that everything is in the same BB as the seed instructions, which I'll do tomorrow.

lib/Transforms/Vectorize/VPlanSLP.cpp
135

I've added a simple implementation for mayWriteToMemory to VPInstruction. We could just use mayWriteToMemory of the underlying instruction, but I think this would unnecessarily increase coupling of VPInstruction and IR instructions. What do you think?

141

I've added a check for (i) and I'll add a check for (ii) tomorrow. I'll think a bit more what the best way to do it would be.

I've added a unit test for (i), and will add IR tests once we VPlan SLP is hooked up to the vplan native code path.

346

Currently we check for unique users. For example, in the testSlpReuse_1 we have 2 add instruction which add the same value. In that case we should re-use the already created combined instruction I think

Hi Florian, yes I am happy with the changes, thanks!
Just please try to add an assertion that checks that the graph is a tree (see inline comment) to make sure that the graph is the one we expect.

lib/Transforms/Vectorize/VPlanSLP.cpp
135

I agree it looks better this way.

346

Ah yes, you are checking for unique users, so it should work the way you are describing. If I understand correctly, the graph can still be considered a tree since the only case when two nodes are connected by more than one path is when the user has two identical operands.

If this is the case, I still think that there should be some kind of assertion here checking that the graph is a tree and not a DAG, because SLP on a DAG is slightly different and would need a few more components to work correctly. I think that the only case where (I != BundleToCombined.end()) is true is the one you are describing: when the user instructions have identical operands. So maybe add an assertion that checks that the single users of Values have identical operands and a comment like "For now buildGraph should form a tree, not a DAG.".

fhahn updated this revision to Diff 168130.Oct 3 2018, 10:23 AM

Added assertion the ensure we only re-use nodes if the users of all values are equal and limited instructions considered to a single BB for now. I think this should now address all of @vporpo comments, thank you very much!

fhahn marked an inline comment as done.Oct 3 2018, 10:23 AM
ABataev added inline comments.Oct 31 2018, 10:52 AM
lib/Transforms/Vectorize/VPlanSLP.cpp
119

Seems to me the comment does not match the functionality

301

Try FinalOrder.emplace_back(Operands.first, Operands.second[0]);

fhahn updated this revision to Diff 172219.Nov 1 2018, 1:26 PM

Address comments, thanks!

fhahn added inline comments.Nov 1 2018, 1:29 PM
lib/Transforms/Vectorize/VPlanSLP.cpp
119

I've updated the comment and removed the code to check if all instructions are in the same BB. That's check earlier on. Does it make sense now?

301

I tried, but neither that nor emplace_back(Operands.first, {Operands.second[0])}; gets accepted unfortunately.

What about non-vectorizable loads/stores? Atomic, volaltile? Does it aware of those kind of instructions?

lib/Transforms/Vectorize/VPlanSLP.cpp
72

use try_emplace(to_vector<4>(Operands), New); instead

119

are no instructions?

fhahn updated this revision to Diff 172976.Nov 7 2018, 9:54 AM
fhahn marked an inline comment as done.

Address comments and add checks for simple loads/stores in areVectorizable().

Just one more question: how do you estimate that the vectorized code is more effective than the scalar one? What's the criterion?

lib/Transforms/Vectorize/VPlanValue.h
169

Capitalize i, must be I

fhahn updated this revision to Diff 173176.Nov 8 2018, 9:03 AM
fhahn marked an inline comment as done.
fhahn added a subscriber: sguggill.

Just one more question: how do you estimate that the vectorized code is more effective than the scalar one? What's the criterion?

This patch as a first step only adds support for the transform without cost modelling. This transform is intended as a VPlan2VPlan transformation and the general idea is to create multiple VPlans (say one with all the SLP opportunities applied and one with none) and evaluate the cost of the resulting VPlans at the end to choose the best one.

Together with @sguggill , we gave a brief overview of VP2VP transforms at the dev meeting (http://llvm.org/devmtg/2018-10/talk-abstracts.html#talk21) and this is something I want to write up a bit better in the actual VPlan design documentation.

I have no more comments.

fhahn added a comment.Nov 9 2018, 6:58 AM

I have no more comments.

Great, thanks for all the comments!

@RKSimon @rengolin @ABataev are you happy with the direction of getting this in as VPlan-only transform initially?

From the VPlan point of view, LGTM. I don't have any other comments.

Thanks, Florian!
Diego

rengolin accepted this revision.Nov 12 2018, 12:53 PM

Just had a look again and it's looking great, thanks Florian!

The unit tests help a lot in understanding what's the gist of the transformation and I really like where this is going.

I think time-wise, @RKSimon's concerns were addressed (ie. we're not going to switch to this any time soon, and certainly not without making sure all his work is kept/moved in).

Given that most people are happy with it, and there are no objections, I'll go on and approve it.

Thanks everyone!

This revision is now accepted and ready to land.Nov 12 2018, 12:53 PM

Thanks, everyone! Important first step towards the great converged Loop+SLP vectorizer.

I think time-wise, @RKSimon's concerns were addressed (ie. we're not going to switch to this any time soon, and certainly not without making sure all his work is kept/moved in).

Yup, I have no objections - cheers!

fhahn updated this revision to Diff 173830.Nov 13 2018, 3:45 AM

Thanks for all the review!

I've pulled in the DenseMapInfo changes from D52312 into this patch again, as making it generic requires some additional work. I will open an issue to follow up on this.

I'll commit this patch once D49489 gets accepted and committed.

fhahn closed this revision.Nov 14 2018, 8:14 AM

Committed as rL346857, thanks for all the comments! I will create a few tickets for follow up work.