Page MenuHomePhabricator

[SLPVectorizer] WIP Implement initial memory versioning (WIP!)
Needs ReviewPublic

Authored by fhahn on May 20 2021, 2:46 AM.

Details

Summary

This patch is just an initial sketch to get a discussion going on how to
best support generating runtime checks for may-aliasing memory accesses.

The key question to start with is where to best collect and generate
runtime checks. Currently codegen for a block is eager; for each block,
if we find a vectorizable tree, we vectorize it and then analyze the
rest of the block. This makes it hard to collect *all* runtime checks
for a block before changing the code.

Perhaps for now we need to limit the checks to the first vectorizable
tree in a block? This is what the patch tries to do.

There are a couple of mechanical/technical issues that need to be
addressed, but I think the questions above are key to answer/address
first.

Other than that, the patch does not yet consider the cost of cloning
the block and the runtime checks. It also does not introduce phis
for values used outside the cloned block, so it will generate invalid IR
in those cases for now.

Diff Detail

Event Timeline

fhahn created this revision.May 20 2021, 2:46 AM
fhahn requested review of this revision.May 20 2021, 2:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 20 2021, 2:46 AM
fhahn added a comment.May 20 2021, 2:47 AM

Reviewers, this initial version is mostly intended as a first sketch to discuss how to best decide for which bundles/trees to collect checks and where to generate them. Any high-level thoughts/comments would be greatly appreciated.

How about emitting proper aliasing metadata in the generated basicblocks? It should solve many problems with the SLP without extra changes and may help some other passes too. I mean, memaccesses in entry.slpversioned: should be noaliased and in entry.scalar: maybe aliased.

fhahn added a comment.May 20 2021, 6:40 AM

How about emitting proper aliasing metadata in the generated basicblocks? It should solve many problems with the SLP without extra changes and may help some other passes too. I mean, memaccesses in entry.slpversioned: should be noaliased and in entry.scalar: maybe aliased.

That sounds like a good idea! The question is how to best decide for which accesses to generate RT checks. At the moment, I think that's only possible for the first vectorizable tree per BB we generate code for. Is it easy to get the first and last instruction in a vectorizable tree? Then we could just clone this range.

Yes, interesting. LoopVersioningLICM came to my mind, except that works on loops of course...

How about emitting proper aliasing metadata in the generated basicblocks? It should solve many problems with the SLP without extra changes and may help some other passes too. I mean, memaccesses in entry.slpversioned: should be noaliased and in entry.scalar: maybe aliased.

That sounds like a good idea! The question is how to best decide for which accesses to generate RT checks. At the moment, I think that's only possible for the first vectorizable tree per BB we generate code for. Is it easy to get the first and last instruction in a vectorizable tree? Then we could just clone this range.

You can easily get the first instruction - VectorizableTree.front(). Stores can be only in the front of the vectorizable tree. As to the last instruction, I think we need to scan all tree entries and find all gather nodes with memaccess which may alias with stores. The main problem that the versioning better to perform before we try to build the tree, otherwise we may decide to gather some instructions instead of trying vectorizing them because of possible aliasing. And it may not be profitable.

Matt added a subscriber: Matt.May 20 2021, 8:08 AM
fhahn added a comment.May 20 2021, 8:33 AM

...

The main problem that the versioning better to perform before we try to build the tree, otherwise we may decide to gather some instructions instead of trying vectorizing them because of possible aliasing. And it may not be profitable.

This is the main challenge I think. We could split the block at the first tree that requires runtime checks, but only generate runtime checks for all trees once we processed all trees for the block. If runtime checks are not profitable over all, we can remove the cloned block and re-process the block without versioning. May be a little expensive though.

SjoerdMeijer added a comment.EditedWed, Jun 2, 6:32 AM

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

fhahn updated this revision to Diff 350256.Mon, Jun 7, 5:36 AM

You can easily get the first instruction - VectorizableTree.front(). Stores can be only in the front of the vectorizable tree. As to the last instruction, I think we need to scan all tree entries and find all gather nodes with memaccess which may alias with stores. The main problem that the versioning better to perform before we try to build the tree, otherwise we may decide to gather some instructions instead of trying vectorizing them because of possible aliasing. And it may not be profitable.

I updated the patch to only collect possible bounds for versioning first and queue basic blocks for which we found bounds for re-processing. When re-processing those blocks, we create a versioned block with the appropriate !noalias metadata and re-run vectorization (for now just seeded by stores). As as, this should be correct, but may not be optimal, because either:
a) there were issues preventing vectorization other than aliasing, which may cause us to not vectorize anything in the versioned block
b) the runtime checks are too expensive and offset the gain from additional vectorization.

We should be able to solve both by comparing the cost of the versioned & non-versioned BB after vectorization. If versioning is not profitable, we can remove the conditional branch again. What do you think of this direction?

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

It's fairly high on my todo list, as we have quite a number of of such cases reported by our internal users. But I'm not planning to rush and I think we should also start collecting micro-benchmarks for the missed opportunities, so we can thoroughly evaluate the impact and do not introduce regressions in the future.

You can easily get the first instruction - VectorizableTree.front(). Stores can be only in the front of the vectorizable tree. As to the last instruction, I think we need to scan all tree entries and find all gather nodes with memaccess which may alias with stores. The main problem that the versioning better to perform before we try to build the tree, otherwise we may decide to gather some instructions instead of trying vectorizing them because of possible aliasing. And it may not be profitable.

I updated the patch to only collect possible bounds for versioning first and queue basic blocks for which we found bounds for re-processing. When re-processing those blocks, we create a versioned block with the appropriate !noalias metadata and re-run vectorization (for now just seeded by stores). As as, this should be correct, but may not be optimal, because either:
a) there were issues preventing vectorization other than aliasing, which may cause us to not vectorize anything in the versioned block
b) the runtime checks are too expensive and offset the gain from additional vectorization.

We should be able to solve both by comparing the cost of the versioned & non-versioned BB after vectorization. If versioning is not profitable, we can remove the conditional branch again. What do you think of this direction?

Yeah, looks like this is the only way for now. Should be much easier to do with VPlan. Also, I would look at the instruction scheduling. Most probably, need to schedule the instructions more aggressively to improve locality and avoid regressions.

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

It's fairly high on my todo list, as we have quite a number of of such cases reported by our internal users. But I'm not planning to rush and I think we should also start collecting micro-benchmarks for the missed opportunities, so we can thoroughly evaluate the impact and do not introduce regressions in the future.

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

It's fairly high on my todo list, as we have quite a number of of such cases reported by our internal users. But I'm not planning to rush and I think we should also start collecting micro-benchmarks for the missed opportunities, so we can thoroughly evaluate the impact and do not introduce regressions in the future.

Nice one, excellent. What are your ideas on the micro-benchmarks? Is that something you'll be using internally, or were you e.g. thinking about adding things to the test suite. Just asking because I would be happy to contribute to that.

fhahn added a comment.Mon, Jun 7, 8:43 AM

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

It's fairly high on my todo list, as we have quite a number of of such cases reported by our internal users. But I'm not planning to rush and I think we should also start collecting micro-benchmarks for the missed opportunities, so we can thoroughly evaluate the impact and do not introduce regressions in the future.

Nice one, excellent. What are your ideas on the micro-benchmarks? Is that something you'll be using internally, or were you e.g. thinking about adding things to the test suite. Just asking because I would be happy to contribute to that.

I was thinking about adding something like D101844, just with SLP examples.

fhahn updated this revision to Diff 350319.Mon, Jun 7, 9:00 AM

Fix crash caused by not updating the DT's DFSNumbering and case where we checking for aliasing between pointers of the same underlying object.

fhahn updated this revision to Diff 350679.Tue, Jun 8, 11:46 AM

Update to remove deleted instructions early, so we do not pick up instructions to be deleted during SCEV expansion, flip branch targets.

fhahn updated this revision to Diff 351502.Fri, Jun 11, 11:02 AM

Finished off the main functionality, update includes

  • computing & comparing the cost of the versioned block to the original block,
  • more tests,
  • preserving LI & DT if changes to the CFG are made.

We now also undo the CFG changes, if the versioning is not profitable. Currently for MultiSource, SPEC2000 & SPEC2006 with -O3 on X86/AVX2+AVX512, versioning is tried and deemed beneficial 123 times and deemed unprofitable 224 times. Hopefully we can reduce the failure rate in the future.

Note that the numbers include 2 improvements to SCEV to catch additional cases which I'll share soon. There's also some additional refactoring that needs doing.

I also put up D104126 add a first set of micro benchmarks that require SLP versioning. Those are some of my motivating cases and fairly simple, so any additions on that front would also be greatly appreciated.

I would add an option to control whether to allow it or not.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
751

What about EXPENSIVE_CHECKS and verifyFunction?