This is an archive of the discontinued LLVM Phabricator instance.

[LV] Generate RT checks up-front and remove them if required.
ClosedPublic

Authored by fhahn on Mar 11 2020, 4:03 AM.

Details

Summary

This patch updates LV to generate the runtime checks just after cost
modeling, to allow a more precise estimate of the actual cost of the
checks. This information will be used in future patches to generate
larger runtime checks in cases where the checks only make up a small
fraction of the expected scalar loop execution time.

The runtime checks are created up-front in a temporary block to allow better
estimating the cost and un-linked from the existing IR. After deciding to
vectorize, the checks are moved backed. If deciding not to vectorize, the
temporary block is completely removed.

This patch is similar in spirit to D71053, but explores a different
direction: instead of delaying the decision on whether to vectorize in
the presence of runtime checks it instead optimistically creates the
runtime checks early and discards them later if decided to not
vectorize. This has the advantage that the cost-modeling decisions
can be kept together and can be done up-front and thus preserving the
general code structure. I think delaying (part) of the decision to
vectorize would also make the VPlan migration a bit harder.

One potential drawback of this patch is that we speculatively
generate IR which we might have to clean up later. However it seems like
the code required to do so is quite manageable.

Diff Detail

Event Timeline

fhahn created this revision.Mar 11 2020, 4:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2020, 4:03 AM

Thank you for looking into this!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1519

'moved backed'?

Reverse-ping. Can i somehow help get this going? :)

One potential drawback of this patch is that we speculatively
generate IR which we might have to clean up later. However it seems like
the code required to do so is quite manageable.

There's prior art for that, loop idiom pass has ExpandedValuesCleaner e.g.

Reverse-ping. Can i somehow help get this going? :)

I actually started rebasing/updating over the weekend. There are a few more loose ends to tie up, but I should be able to share an update in the next few days.

One potential drawback of this patch is that we speculatively
generate IR which we might have to clean up later. However it seems like
the code required to do so is quite manageable.

There's prior art for that, loop idiom pass has ExpandedValuesCleaner e.g.

Yes, part of the update is to fix SCEVExpander to actually collect all instructions it generates automatically, so they can be cleaned up later. It already does so for some instructions, but is missing a ton of cases.

Reverse-ping. Can i somehow help get this going? :)

I actually started rebasing/updating over the weekend. There are a few more loose ends to tie up, but I should be able to share an update in the next few days.

One potential drawback of this patch is that we speculatively
generate IR which we might have to clean up later. However it seems like
the code required to do so is quite manageable.

There's prior art for that, loop idiom pass has ExpandedValuesCleaner e.g.

Yes, part of the update is to fix SCEVExpander to actually collect all instructions it generates automatically, so they can be cleaned up later. It already does so for some instructions, but is missing a ton of cases.

Without looking at how it would look, i would imagine that can be cleaned by supplying a callback to the builder, that would automatically put each new instruction into some vector.
Let me know if that is the right direction and if it would be helpful for me to look into that (unless you're halfway there of course)

ebrevnov added a comment.EditedJul 21 2020, 11:32 PM

I'm glad to see this discussion alive again. Thanks Roman!

I think the topic deserves a dedicated discussion on dev list in a wider audience because it is not trivial one.

In fact I have an experience of doing similar thing in SLP and it was mainly negative. Things look simple in the beginning and we went ahead and implemented this approach. The problem was that over the time it became hard to manage pre-generated instructions correctly and we started having nasty bugs. Everything is simple until you don't have sharing. If you have references from multiple places (both from inside IR and internal data structures) things become complex and error prone. I would not like to go this direction again until we 1000% sure things will remain simple.

One more thing is concerning me in this particular implementation. I think the way we are trying to "hide" Checks.TmpBlock from all global analysis and data structures is fragile and can be source of bugs in future. In my opinion it would be much easier and cleaner to make pre-generated code runtime dead (by changing guarding condition to be unconditionally false) and let dedicated cleanup passes make their job.

fhahn added a comment.Jul 23 2020, 7:10 AM

I think the topic deserves a dedicated discussion on dev list in a wider audience because it is not trivial one.

Agreed, it would be good to make sure there are no major objections on the general approach. We already have a few cases of 'prior art' in the code base, but this patch might take things a little further than in some other places.

In fact I have an experience of doing similar thing in SLP and it was mainly negative. Things look simple in the beginning and we went ahead and implemented this approach. The problem was that over the time it became hard to manage pre-generated instructions correctly and we started having nasty bugs. Everything is simple until you don't have sharing. If you have references from multiple places (both from inside IR and internal data structures) things become complex and error prone. I would not like to go this direction again until we 1000% sure things will remain simple.

From what I've seen so far, the major concerns is around SCEV expansion. I started some patches solidifying the way we clean up after SCEV expansion (D84327), which is already done in LoopIdiomRecognize. I think we need to properly address this first, before continuing here.

I am not too concerned about interactions with other LV internal data structures, as the runtime checks are pretty independent of vector code generation/legality, as they are outside of the loop and nothing in the vector loop should be aware of the checks.

One more thing is concerning me in this particular implementation. I think the way we are trying to "hide" Checks.TmpBlock from all global analysis and data structures is fragile and can be source of bugs in future. In my opinion it would be much easier and cleaner to make pre-generated code runtime dead (by changing guarding condition to be unconditionally false) and let dedicated cleanup passes make their job.

That's an interesting suggestion and would certainly simplify things. My main concern with leaving cleanup to another pass is that it would mean that LV would always modify functions with loops that need runtime checks, even if the checks are not used. Not sure how big of an issue that would be in terms of compile-time caused by the additional invalidations.

It would be helpful if you could add prerequisite patches are parents of this patch.
What else needs to be done before this?

I approve of the general direction of this patch series :)
Some high-level notes. I'm not that familiar with the code in question,
so it will be best for someone else to review, too.

llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2649 ↗(On Diff #282930)

Should this go into a separate patch? Is this a preexisting problem?

2654–2659 ↗(On Diff #282930)

Are these two assertions not identical?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
772

s/they/vectorization/

1583

I'm guessing it can't be i1 true ?

2823–2826
lebedev.ri added inline comments.Aug 7 2020, 8:21 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2873–2875
lebedev.ri requested changes to this revision.Aug 12 2020, 2:33 PM

(marking as reviewed)

This revision now requires changes to proceed.Aug 12 2020, 2:33 PM
fhahn added a comment.Oct 7 2020, 11:00 AM

Just in the process of rebasing this again now that most other parts landed and discovered I accidentally made our job harder by teaching SCEVExpander to preserve LCSSA. Needs a bit of more work now.

fhahn updated this revision to Diff 314335.Jan 4 2021, 2:11 AM

Rebased on top of current trunk. This version now can build MultiSource/SPEC2006/SPEC2000 with -O3 -flto without crashing.

Just in the process of rebasing this again now that most other parts landed and discovered I accidentally made our job harder by teaching SCEVExpander to preserve LCSSA. Needs a bit of more work now.

I recently fixed a related issue (0ea3749b3cde), which makes this point obsolete; the final CFG is constructed before expanding, which solves the issue.

rkruppe removed a subscriber: rkruppe.Jan 4 2021, 2:52 AM
fhahn updated this revision to Diff 315352.Jan 8 2021, 5:00 AM

Split off the addRuntimeCheck changes to D94295

fhahn added a comment.Jan 8 2021, 5:34 AM

Sent mail to llvm-dev (https://lists.llvm.org/pipermail/llvm-dev/2021-January/147678.html), to get thoughts on the 'speculatively create IR and throw away if not needed` pattern in general

fhahn updated this revision to Diff 319575.Jan 27 2021, 7:36 AM

Rebase & ping. The thread on llvm-dev did not surface any objections, so I think it might be time to finally restart the review on this one :)

fhahn retitled this revision from [LV] Generate RT checks up-front and remove them if required. (WIP) to [LV] Generate RT checks up-front and remove them if required..Jan 27 2021, 7:42 AM
lebedev.ri accepted this revision.Jan 27 2021, 8:31 AM

I haven't reviewed this in full detail, but this looks good to me overall.
It would probably be good for someone else to also signoff, too.

This revision is now accepted and ready to land.Jan 27 2021, 8:31 AM

I'm going to take a look next week if nobody outstrip me...

fhahn updated this revision to Diff 322101.Feb 8 2021, 7:16 AM

Rebased

I'm going to take a look next week if nobody outstrip me...

That would be very much appreciated! :)

fhahn updated this revision to Diff 323975.Feb 16 2021, 5:56 AM

Rebased again & ping :)

I am planning on landing this within the next few days or early next week, unless there are any additional concerns. The 'build IR and throw away if not needed' is already used in multiple places and there have been no objections in llvm-dev. I am also more than happy to iterate on this once in tree to iron out any issues.

I started looking into this... Hopefully, will be able to provide my feedback by tomorrow.

ebrevnov added inline comments.Feb 17 2021, 2:55 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
373

Please add empty lines around and add something like "// Forward declarations" or similar

397

Would "RTChecks" be a better name?

629

Update description on return value. Also consider using Optional<BasicBlock *>.

1520

Some general notes. I would like us to:

  1. Consolidate all the work related to runtime checks under this helper abstraction. Currently, it looks like some functionality still on it's old place. Find more comments on this as you go.
  2. Make clear boundaries between this GeneratedRTChecks and its users. Please, convert it to class and define clear public API.
1534

'Preheader' looks redundant here as it can easily be get from Loop

1544

Do you really need as separate hook to kick in generation of runtime checks? In other words can we simply move this functionality to constructor?

1547

The goal is to create "un-linked" block(s). Right? In that case, would it be simpler to not depend on Preheader and use CreateBlock directly. I believe it would be easier to un-link the block then.

1556

I would suggest generating SCEV and MemRuntime checks into separate temporary blocks in the first place. I believe that will simplify further managment.

1590

I'm surprised to see the following piece of code dealing with deletion. I would expected SCEVExpanderCleaner does all necessary cleanup. It should be enough just to properly set "markResultUsed". If this is not the case can we rework in this direction?

2811–2829

I believe this functionality should be moved to GeneratedRTChecks.

2818–2819

I believe this functionality should be moved to GeneratedRTChecks.

2823–2826

Please address.

2881

I believe this functionality should be moved to GeneratedRTChecks.

2893

TmpBlock should not be directly exposed outside of GeneratedRTChecks.

6773–6774

I think we should better get all info about runtime checks from GeneratedruntimeChecks directly.

7698

Why do we need to create GeneratedRTChecks outside of InnerLoopVectorizer? Can/Should it have bigger life time?

7945

Why this was moved from its original place?

llvm/test/Transforms/LoopVectorize/pr47343-expander-lcssa-after-cfg-update.ll
59 ↗(On Diff #323975)

Looks like we stopped vectorizing something, did we?

fhahn updated this revision to Diff 324991.Feb 19 2021, 8:32 AM
fhahn marked 14 inline comments as done.

Addressed comments, thanks! Major changes include turning GeneratedRTChecks into a class, maintaining separate blocks for SCEV checks and mem checks, moving most code to add checks back to CFG to GeneratedRTChecks.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
397

Agreed, changed to RTChecks, thanks!

1520

That's a great suggestion, thanks! I moved most of the code from emitSCEVChecks & emitMemRuntimeChecks here. There's some code (mostly assertions), which remain in the original functions, because they need access to various member of ILV.

The new function names potentially can be improved and they should have doc-comments, but I'd prefer to first settle on the final API before doing that.

1547

I updated the code put SCEV and memory checks in separate blocks. It still uses SplitBlock, because this keeps the blocks in the DT/LI, which may be used during SCEV expansion and it also keeps them linked to the right predecessors during expansion (SCEVExpander tries to find values to re-use in predecessors).

1590

The extra cleanup is needed for instructions create directly by addRuntimeChecks, like compares, and/ors which compare pointer bounds (which are generated through SCEV expansion). I updated the comment and made it clearer. It is also only needed for the block with the mem-checks.

7698

For some users (e.g. epilogue vectorization), GeneratedRTChecks needs to be shared unfortunately.

7945

This is only needed for the follow-up, I moved it to the original place for this patch.

llvm/test/Transforms/LoopVectorize/pr47343-expander-lcssa-after-cfg-update.ll
59 ↗(On Diff #323975)

We still vectorize, but we now do not create an unnecessary phi node (which has no users)

Thanks for doing the changes. Overall this version looks fine to me. Please go ahead with some final cleanups.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
632

Please update this one as well.

1547

Please, put that info into the code as a comment. SCEVExpander also can decide to hoist some instruction(s) into predecessors. Will they be properly scheduled for deletion?

1615–1616

Please revisit

1661

Why do we need to set it to null?

1688

Why storing to temp?

1689

Why needed?

fhahn updated this revision to Diff 326063.Feb 24 2021, 5:43 AM
fhahn marked 2 inline comments as done.

Addressed latest round of comments, thanks! I also added additional comments to some of the fields in GeneratedRTChecks.

fhahn updated this revision to Diff 326064.Feb 24 2021, 5:46 AM
fhahn marked 2 inline comments as done.

re-flow comment not properly formatted

fhahn marked 11 inline comments as done.Feb 24 2021, 5:51 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1544

I forgot to post my response here. In some cases we know that we do not need runtime checks up front (e.g. because we only interleave the loop). Having a helper to only generate the RT checks if needed helps to avoid unnecessary work.

1547

That's a great idea, I added the comment. All instructions that are generated by SCEVExpander should be tracked by it; it uses an IRBuilder callback to collect them, so it should not matter where exactly they are added.

1583

I am not entirely sure, but I think if the compares could be simplified to true/false during generation from SCEV expression, LoopAccessInfo should have been able to figure out that the checks are unneeded. We could also check for i1 true, but I think that would be best in a separate patch, as the original code had the same assert.

1615–1616

Fixed comment, thanks!

1661

This is used to indicate the check was used and should not be cleaned up. I'll add a comment.

1688

It's not needed any longer, earlier versions set MemCheckBlock = nullptr. I removed it.

1689

This is used to indicate the check was used and should not be cleaned up. I'll add a comment.

ebrevnov accepted this revision.Feb 24 2021, 7:34 PM

LGTM.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1656

Can/Should SCEVCheckCond be used instead? Just to make it symmetric with emitMemRuntimeChecks.

fhahn updated this revision to Diff 326363.Feb 25 2021, 5:12 AM
fhahn marked 5 inline comments as done.

Update to use SCEVCheckCond.

This revision was landed with ongoing or failed builds.Mar 1 2021, 2:48 AM
This revision was automatically updated to reflect the committed changes.