This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Implement initial memory versioning.
AcceptedPublic

Authored by fhahn on May 20 2021, 2:46 AM.

Details

Summary

This patch is just an initial sketch to get a discussion going on how to
best support generating runtime checks for may-aliasing memory accesses.

The key question to start with is where to best collect and generate
runtime checks. Currently codegen for a block is eager; for each block,
if we find a vectorizable tree, we vectorize it and then analyze the
rest of the block. This makes it hard to collect *all* runtime checks
for a block before changing the code.

Perhaps for now we need to limit the checks to the first vectorizable
tree in a block? This is what the patch tries to do.

There are a couple of mechanical/technical issues that need to be
addressed, but I think the questions above are key to answer/address
first.

Other than that, the patch does not yet consider the cost of cloning
the block and the runtime checks. It also does not introduce phis
for values used outside the cloned block, so it will generate invalid IR
in those cases for now.

Diff Detail

Event Timeline

fhahn created this revision.May 20 2021, 2:46 AM
fhahn requested review of this revision.May 20 2021, 2:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 20 2021, 2:46 AM
fhahn added a comment.May 20 2021, 2:47 AM

Reviewers, this initial version is mostly intended as a first sketch to discuss how to best decide for which bundles/trees to collect checks and where to generate them. Any high-level thoughts/comments would be greatly appreciated.

How about emitting proper aliasing metadata in the generated basicblocks? It should solve many problems with the SLP without extra changes and may help some other passes too. I mean, memaccesses in entry.slpversioned: should be noaliased and in entry.scalar: maybe aliased.

fhahn added a comment.May 20 2021, 6:40 AM

How about emitting proper aliasing metadata in the generated basicblocks? It should solve many problems with the SLP without extra changes and may help some other passes too. I mean, memaccesses in entry.slpversioned: should be noaliased and in entry.scalar: maybe aliased.

That sounds like a good idea! The question is how to best decide for which accesses to generate RT checks. At the moment, I think that's only possible for the first vectorizable tree per BB we generate code for. Is it easy to get the first and last instruction in a vectorizable tree? Then we could just clone this range.

Yes, interesting. LoopVersioningLICM came to my mind, except that works on loops of course...

How about emitting proper aliasing metadata in the generated basicblocks? It should solve many problems with the SLP without extra changes and may help some other passes too. I mean, memaccesses in entry.slpversioned: should be noaliased and in entry.scalar: maybe aliased.

That sounds like a good idea! The question is how to best decide for which accesses to generate RT checks. At the moment, I think that's only possible for the first vectorizable tree per BB we generate code for. Is it easy to get the first and last instruction in a vectorizable tree? Then we could just clone this range.

You can easily get the first instruction - VectorizableTree.front(). Stores can be only in the front of the vectorizable tree. As to the last instruction, I think we need to scan all tree entries and find all gather nodes with memaccess which may alias with stores. The main problem that the versioning better to perform before we try to build the tree, otherwise we may decide to gather some instructions instead of trying vectorizing them because of possible aliasing. And it may not be profitable.

Matt added a subscriber: Matt.May 20 2021, 8:08 AM
fhahn added a comment.May 20 2021, 8:33 AM

...

The main problem that the versioning better to perform before we try to build the tree, otherwise we may decide to gather some instructions instead of trying vectorizing them because of possible aliasing. And it may not be profitable.

This is the main challenge I think. We could split the block at the first tree that requires runtime checks, but only generate runtime checks for all trees once we processed all trees for the block. If runtime checks are not profitable over all, we can remove the cloned block and re-process the block without versioning. May be a little expensive though.

SjoerdMeijer added a comment.EditedJun 2 2021, 6:32 AM

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

fhahn updated this revision to Diff 350256.Jun 7 2021, 5:36 AM

You can easily get the first instruction - VectorizableTree.front(). Stores can be only in the front of the vectorizable tree. As to the last instruction, I think we need to scan all tree entries and find all gather nodes with memaccess which may alias with stores. The main problem that the versioning better to perform before we try to build the tree, otherwise we may decide to gather some instructions instead of trying vectorizing them because of possible aliasing. And it may not be profitable.

I updated the patch to only collect possible bounds for versioning first and queue basic blocks for which we found bounds for re-processing. When re-processing those blocks, we create a versioned block with the appropriate !noalias metadata and re-run vectorization (for now just seeded by stores). As as, this should be correct, but may not be optimal, because either:
a) there were issues preventing vectorization other than aliasing, which may cause us to not vectorize anything in the versioned block
b) the runtime checks are too expensive and offset the gain from additional vectorization.

We should be able to solve both by comparing the cost of the versioned & non-versioned BB after vectorization. If versioning is not profitable, we can remove the conditional branch again. What do you think of this direction?

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

It's fairly high on my todo list, as we have quite a number of of such cases reported by our internal users. But I'm not planning to rush and I think we should also start collecting micro-benchmarks for the missed opportunities, so we can thoroughly evaluate the impact and do not introduce regressions in the future.

You can easily get the first instruction - VectorizableTree.front(). Stores can be only in the front of the vectorizable tree. As to the last instruction, I think we need to scan all tree entries and find all gather nodes with memaccess which may alias with stores. The main problem that the versioning better to perform before we try to build the tree, otherwise we may decide to gather some instructions instead of trying vectorizing them because of possible aliasing. And it may not be profitable.

I updated the patch to only collect possible bounds for versioning first and queue basic blocks for which we found bounds for re-processing. When re-processing those blocks, we create a versioned block with the appropriate !noalias metadata and re-run vectorization (for now just seeded by stores). As as, this should be correct, but may not be optimal, because either:
a) there were issues preventing vectorization other than aliasing, which may cause us to not vectorize anything in the versioned block
b) the runtime checks are too expensive and offset the gain from additional vectorization.

We should be able to solve both by comparing the cost of the versioned & non-versioned BB after vectorization. If versioning is not profitable, we can remove the conditional branch again. What do you think of this direction?

Yeah, looks like this is the only way for now. Should be much easier to do with VPlan. Also, I would look at the instruction scheduling. Most probably, need to schedule the instructions more aggressively to improve locality and avoid regressions.

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

It's fairly high on my todo list, as we have quite a number of of such cases reported by our internal users. But I'm not planning to rush and I think we should also start collecting micro-benchmarks for the missed opportunities, so we can thoroughly evaluate the impact and do not introduce regressions in the future.

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

It's fairly high on my todo list, as we have quite a number of of such cases reported by our internal users. But I'm not planning to rush and I think we should also start collecting micro-benchmarks for the missed opportunities, so we can thoroughly evaluate the impact and do not introduce regressions in the future.

Nice one, excellent. What are your ideas on the micro-benchmarks? Is that something you'll be using internally, or were you e.g. thinking about adding things to the test suite. Just asking because I would be happy to contribute to that.

fhahn added a comment.Jun 7 2021, 8:43 AM

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

It's fairly high on my todo list, as we have quite a number of of such cases reported by our internal users. But I'm not planning to rush and I think we should also start collecting micro-benchmarks for the missed opportunities, so we can thoroughly evaluate the impact and do not introduce regressions in the future.

Nice one, excellent. What are your ideas on the micro-benchmarks? Is that something you'll be using internally, or were you e.g. thinking about adding things to the test suite. Just asking because I would be happy to contribute to that.

I was thinking about adding something like D101844, just with SLP examples.

fhahn updated this revision to Diff 350319.Jun 7 2021, 9:00 AM

Fix crash caused by not updating the DT's DFSNumbering and case where we checking for aliasing between pointers of the same underlying object.

fhahn updated this revision to Diff 350679.Jun 8 2021, 11:46 AM

Update to remove deleted instructions early, so we do not pick up instructions to be deleted during SCEV expansion, flip branch targets.

fhahn updated this revision to Diff 351502.Jun 11 2021, 11:02 AM

Finished off the main functionality, update includes

  • computing & comparing the cost of the versioned block to the original block,
  • more tests,
  • preserving LI & DT if changes to the CFG are made.

We now also undo the CFG changes, if the versioning is not profitable. Currently for MultiSource, SPEC2000 & SPEC2006 with -O3 on X86/AVX2+AVX512, versioning is tried and deemed beneficial 123 times and deemed unprofitable 224 times. Hopefully we can reduce the failure rate in the future.

Note that the numbers include 2 improvements to SCEV to catch additional cases which I'll share soon. There's also some additional refactoring that needs doing.

I also put up D104126 add a first set of micro benchmarks that require SLP versioning. Those are some of my motivating cases and fairly simple, so any additions on that front would also be greatly appreciated.

I would add an option to control whether to allow it or not.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1017

What about EXPENSIVE_CHECKS and verifyFunction?

fhahn updated this revision to Diff 353436.Jun 21 2021, 11:34 AM

Adjust preserved passes in the legacy PM, update pipeline tests.

The SCEV improvements to increase the number of memory ops vectorized are available now as D104319 & D104634. I'll start with some more cleanup/refactoring and adding the feature behind a flag as suggested soon.

ormris removed a subscriber: ormris.Jun 23 2021, 9:44 AM
fhahn updated this revision to Diff 356712.Jul 6 2021, 7:13 AM

Rebased code on top of D105481, which allows to re-use the existing code to collect and generate runtime check code. Also added a flag to enable memory versioning (off by default for now).

fhahn retitled this revision from [SLPVectorizer] WIP Implement initial memory versioning (WIP!) to [SLPVectorizer] Implement initial memory versioning..Jul 6 2021, 7:13 AM
ABataev added inline comments.Jul 6 2021, 8:08 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
792–798

Make it a static function?

815

Why need to insert something similar twice, here and below?

10543

300 better to turn to option value.

10547

I think this can be simplified. This may consume a lot of time. Better to implement some kind of a simple traversal here and check if there are memaccesses in the operands, check if they may alias, and gather these aliases.

llvm/test/Transforms/SLPVectorizer/AArch64/loadi8.ll
150–152

What's changed here? Just the code growths?

SjoerdMeijer added inline comments.Jul 6 2021, 8:25 AM
llvm/test/Transforms/SLPVectorizer/AArch64/loadi8.ll
90

This comment is out of date now.

150–152

Was wondering the same. To me it looks like we are emitting the runtime checks, but still not vectorising, which we probably want to avoid.

fhahn added inline comments.Jul 8 2021, 7:52 AM
llvm/test/Transforms/SLPVectorizer/AArch64/loadi8.ll
150–152

This is what happens if versioning was attempted, but did not enable SLP vectorisation for the block. Probably a case where some memory operations where excluded for some reason. I'll investigate.

The versioned block has been deleted, but the dead runtime checks still remain, but those should be cleaned up by some later pass.

Just a bit of a heads up that I took this patch (and it's dependencies) and run some numbers for x264 SPEC, where I have seen quite a few missed opportunities caused by inability to emit runtime alias checks. I don't see any change in performance though, which is slightly unexpected (I was hoping for some already). But it might be that the next thing is blocking SLP vectorisation, for AArch64 which is what I am looking at, and that might be cost-model issues. This is the case for the 2 examples I am currently looking at, but will do some more analysis. And we of course need this patch as an enabler.

Just a bit of a heads up that I took this patch (and it's dependencies) and run some numbers for x264 SPEC, where I have seen quite a few missed opportunities caused by inability to emit runtime alias checks. I don't see any change in performance though, which is slightly unexpected (I was hoping for some already). But it might be that the next thing is blocking SLP vectorisation, for AArch64 which is what I am looking at, and that might be cost-model issues. This is the case for the 2 examples I am currently looking at, but will do some more analysis. And we of course need this patch as an enabler.

I also tried taking this patch and running the numbers for x264 SPEC. I'm also not seeing an improvement in performance. What loops are you specifically look at? I see that the loop you referenced in https://reviews.llvm.org/D102748 is still not vectorized with this patch.

Just a bit of a heads up that I took this patch (and it's dependencies) and run some numbers for x264 SPEC, where I have seen quite a few missed opportunities caused by inability to emit runtime alias checks. I don't see any change in performance though, which is slightly unexpected (I was hoping for some already). But it might be that the next thing is blocking SLP vectorisation, for AArch64 which is what I am looking at, and that might be cost-model issues. This is the case for the 2 examples I am currently looking at, but will do some more analysis. And we of course need this patch as an enabler.

The case from x264 is in the function @f_alias in ../AArch64/loadi8.ll, right?

With versioning, the SLP vectorizer generates the following vector block:

entry.slpversioned:                               ; preds = %entry.slpmemcheck
  %scale = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 0
  %0 = load i32, i32* %scale, align 16
  %offset = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 1
  %1 = load i32, i32* %offset, align 4
  %arrayidx.1 = getelementptr inbounds i8, i8* %src, i64 1
  %arrayidx2.1 = getelementptr inbounds i8, i8* %dst, i64 1
  %arrayidx.2 = getelementptr inbounds i8, i8* %src, i64 2
  %arrayidx2.2 = getelementptr inbounds i8, i8* %dst, i64 2
  %arrayidx.3 = getelementptr inbounds i8, i8* %src, i64 3
  %2 = bitcast i8* %src to <4 x i8>*
  %3 = load <4 x i8>, <4 x i8>* %2, align 1, !alias.scope !0, !noalias !3
  %4 = zext <4 x i8> %3 to <4 x i32>
  %5 = insertelement <4 x i32> poison, i32 %0, i32 0
  %6 = insertelement <4 x i32> %5, i32 %0, i32 1
  %7 = insertelement <4 x i32> %6, i32 %0, i32 2
  %8 = insertelement <4 x i32> %7, i32 %0, i32 3
  %9 = mul nsw <4 x i32> %8, %4
  %10 = insertelement <4 x i32> poison, i32 %1, i32 0
  %11 = insertelement <4 x i32> %10, i32 %1, i32 1
  %12 = insertelement <4 x i32> %11, i32 %1, i32 2
  %13 = insertelement <4 x i32> %12, i32 %1, i32 3
  %14 = add nsw <4 x i32> %9, %13
  %15 = icmp ult <4 x i32> %14, <i32 256, i32 256, i32 256, i32 256>
  %16 = icmp sgt <4 x i32> %14, zeroinitializer
  %17 = sext <4 x i1> %16 to <4 x i32>
  %18 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %17
  %19 = trunc <4 x i32> %18 to <4 x i8>
  %arrayidx2.3 = getelementptr inbounds i8, i8* %dst, i64 3
  %20 = bitcast i8* %dst to <4 x i8>*
  store <4 x i8> %19, <4 x i8>* %20, align 1, !alias.scope !3, !noalias !0
  br label %entry.merge

The problem there is that currently the cost of the vector block is compared to the cost of the original scalar block. In the case at hand the vectorized IR is not optimal and to cost gets over-estimated, causing the versioned block to be dropped as unprofitable.

We can tackle that in different ways: a) have the SLP vectorizer generated more optimal vector IR (e.g. shuffle instead of chain of inserts) or b) post-process the IR before computing the cost a bit. Not sure how much work a) would be. @ABataev any ideas?

Just a bit of a heads up that I took this patch (and it's dependencies) and run some numbers for x264 SPEC, where I have seen quite a few missed opportunities caused by inability to emit runtime alias checks. I don't see any change in performance though, which is slightly unexpected (I was hoping for some already). But it might be that the next thing is blocking SLP vectorisation, for AArch64 which is what I am looking at, and that might be cost-model issues. This is the case for the 2 examples I am currently looking at, but will do some more analysis. And we of course need this patch as an enabler.

The case from x264 is in the function @f_alias in ../AArch64/loadi8.ll, right?

With versioning, the SLP vectorizer generates the following vector block:

entry.slpversioned:                               ; preds = %entry.slpmemcheck
  %scale = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 0
  %0 = load i32, i32* %scale, align 16
  %offset = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 1
  %1 = load i32, i32* %offset, align 4
  %arrayidx.1 = getelementptr inbounds i8, i8* %src, i64 1
  %arrayidx2.1 = getelementptr inbounds i8, i8* %dst, i64 1
  %arrayidx.2 = getelementptr inbounds i8, i8* %src, i64 2
  %arrayidx2.2 = getelementptr inbounds i8, i8* %dst, i64 2
  %arrayidx.3 = getelementptr inbounds i8, i8* %src, i64 3
  %2 = bitcast i8* %src to <4 x i8>*
  %3 = load <4 x i8>, <4 x i8>* %2, align 1, !alias.scope !0, !noalias !3
  %4 = zext <4 x i8> %3 to <4 x i32>
  %5 = insertelement <4 x i32> poison, i32 %0, i32 0
  %6 = insertelement <4 x i32> %5, i32 %0, i32 1
  %7 = insertelement <4 x i32> %6, i32 %0, i32 2
  %8 = insertelement <4 x i32> %7, i32 %0, i32 3
  %9 = mul nsw <4 x i32> %8, %4
  %10 = insertelement <4 x i32> poison, i32 %1, i32 0
  %11 = insertelement <4 x i32> %10, i32 %1, i32 1
  %12 = insertelement <4 x i32> %11, i32 %1, i32 2
  %13 = insertelement <4 x i32> %12, i32 %1, i32 3
  %14 = add nsw <4 x i32> %9, %13
  %15 = icmp ult <4 x i32> %14, <i32 256, i32 256, i32 256, i32 256>
  %16 = icmp sgt <4 x i32> %14, zeroinitializer
  %17 = sext <4 x i1> %16 to <4 x i32>
  %18 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %17
  %19 = trunc <4 x i32> %18 to <4 x i8>
  %arrayidx2.3 = getelementptr inbounds i8, i8* %dst, i64 3
  %20 = bitcast i8* %dst to <4 x i8>*
  store <4 x i8> %19, <4 x i8>* %20, align 1, !alias.scope !3, !noalias !0
  br label %entry.merge

The problem there is that currently the cost of the vector block is compared to the cost of the original scalar block. In the case at hand the vectorized IR is not optimal and to cost gets over-estimated, causing the versioned block to be dropped as unprofitable.

We can tackle that in different ways: a) have the SLP vectorizer generated more optimal vector IR (e.g. shuffle instead of chain of inserts) or b) post-process the IR before computing the cost a bit. Not sure how much work a) would be. @ABataev any ideas?

We can easily generate shuffles for splats/reused scalars, will implement this ASAP.

Just a bit of a heads up that I took this patch (and it's dependencies) and run some numbers for x264 SPEC, where I have seen quite a few missed opportunities caused by inability to emit runtime alias checks. I don't see any change in performance though, which is slightly unexpected (I was hoping for some already). But it might be that the next thing is blocking SLP vectorisation, for AArch64 which is what I am looking at, and that might be cost-model issues. This is the case for the 2 examples I am currently looking at, but will do some more analysis. And we of course need this patch as an enabler.

The case from x264 is in the function @f_alias in ../AArch64/loadi8.ll, right?

Sorry for the late reply, but just wanted to confirm this: yes, that @f_alias in ../AArch64/loadi8.ll is a reproducer from x264.

fhahn updated this revision to Diff 365819.Aug 11 2021, 12:37 PM

Some major restructuring of the code for the basic block handling and removed some restrictions. The code now keeps track of the first and last instruction that uses a tracked object and only duplicates the instructions in between. The blocks are split in a way that allows us to roll back the changes to the original CFG. The runtime checks are now only generated if profitable. This has the drawback that we have to estimate the cost a bit differently. But it has the advantage that we only need to change the function if we vectorize.

The compile-time impact looks reasonable, worst geomean change is +0.09 for NewPM-ReleaseLTO-g http://llvm-compile-time-tracker.com/compare.php?from=7c61447052351b722c4d5aa4254cca9054a0be79&to=831b8bc299ef904a9980e59f59eab936d93bc9e1&stat=instructions .

We probably also can add additional heuristics to avoid attempting versioning in a few more cases.

The patch still needs additional comments, but I think the overall structure should be a good start now.

Sorry for the late reply, but just wanted to confirm this: yes, that @f_alias in ../AArch64/loadi8.ll is a reproducer from x264.

Great thanks. The code in the test should be transformed now. If you point me to the C code, I can check if it is transformed as expected now.

Sorry for the late reply, but just wanted to confirm this: yes, that @f_alias in ../AArch64/loadi8.ll is a reproducer from x264.

Great thanks. The code in the test should be transformed now. If you point me to the C code, I can check if it is transformed as expected now.

I believe that was function mc_weight_w20().

And like I mentioned earlier, I expect this change to make quite some differences for x264. For example, I hope it will trigger on quant_4x4() too, although additional trick may be required for successful SLP vectorisation of that example (cost-model and or other things).

fhahn updated this revision to Diff 367309.Aug 18 2021, 12:50 PM

Rebased and improved runtime check generation: 1) do not generated checks between 2 read-only groups and 2) skip overflow checks because we know that the last element is dereferenced (Alive2 agrees but I will double check if that is intentional).

Sorry for the late reply, but just wanted to confirm this: yes, that @f_alias in ../AArch64/loadi8.ll is a reproducer from x264.

Great thanks. The code in the test should be transformed now. If you point me to the C code, I can check if it is transformed as expected now.

I believe that was function mc_weight_w20().

Hm interesting. The latest version should vectorize mc_weight_w20 on AArch64. But there's no measurable speedup from that unfortunately on the hardware I have access to. I also cannot measure any speedup if I add restrict to the mc_weight_w20 arguments, which should cause SLP vectorization without runtime checks. Is it possible I am missing something?

And like I mentioned earlier, I expect this change to make quite some differences for x264. For example, I hope it will trigger on quant_4x4() too, although additional trick may be required for successful SLP vectorisation of that example (cost-model and or other things).

I am seeing 5-10% speedups when vectorizing quant_4x4. After runtime check generation, there's still another issue though (conditional load that's not hoisted out).

Anything else to be done? I see some commented code lines.

I am seeing 5-10% speedups when vectorizing quant_4x4. After runtime check generation, there's still another issue though (conditional load that's not hoisted out).

Hi @fhahn, I tried cherry-picking this patch and I'm not seeing quant_4x4 getting vectorized. Maybe I'm missing something here. The commandline invocation I'm using is
clang --target=aarch64-linux-gnu -mllvm -slp-memory-versioning -mllvm -debug-only=SLP -O3 -c quant_4x4.c -S -o quant_4x4.S

fhahn updated this revision to Diff 372992.Sep 16 2021, 10:36 AM

Rebased, added back verifyFunction call, remove commented-out code. Also precommitted a bunch of tests which exposed crashes in earlier versions of the patch.

I am seeing 5-10% speedups when vectorizing quant_4x4. After runtime check generation, there's still another issue though (conditional load that's not hoisted out).

Hi @fhahn, I tried cherry-picking this patch and I'm not seeing quant_4x4 getting vectorized. Maybe I'm missing something here. The commandline invocation I'm using is
clang --target=aarch64-linux-gnu -mllvm -slp-memory-versioning -mllvm -debug-only=SLP -O3 -c quant_4x4.c -S -o quant_4x4.S

I think I measured this with modifying the source (hoisting the loads) and running the loop vectorizer. Even with runtime checks, we need to improve if-conversion for SLP as mentioned in https://lists.llvm.org/pipermail/llvm-dev/2021-September/152740.html. Also, the current patch does not support vectorizing reduction values that are returned, but that should be a relatively easy extension.

fhahn added a comment.Sep 29 2021, 5:24 AM

@ABataev any suggestions on how to move forward with the patch?

@ABataev any suggestions on how to move forward with the patch?

I am not @ABataev :-) but I see you're defaulting slp-memory-versioning to false and was wondering about this. That looks like a safe strategy. But why is that exactly, as opposed to e.g. just enabling this by default? Is that because it's easier to test things before flipping the switch? How/when do you plan to flip this switch?

Reverse ping. :-)

Any plans to pick this up, or how I could help with this? I have taken this patch, ran some testing and numbers, and can confirm previous observations so that's good. I could e.g. do a in-depth code review this week, which is something I haven't done yet if someone else doesn't get there first...

fhahn updated this revision to Diff 381334.Oct 21 2021, 11:10 AM

@ABataev any suggestions on how to move forward with the patch?

I am not @ABataev :-) but I see you're defaulting slp-memory-versioning to false and was wondering about this. That looks like a safe strategy. But why is that exactly, as opposed to e.g. just enabling this by default? Is that because it's easier to test things before flipping the switch? How/when do you plan to flip this switch?

My only concern was compile-time impact. Overall the impact does not look too bad and the cases where it increases the compile-time it seems to correspond to code changes (= additional vectorization)

http://llvm-compile-time-tracker.com/compare.php?from=da7e033ee020a1708b98f3e4f159eef904a42600&to=cf03006253252dbe8a209bcb6d513f85bb5c036a&stat=instructions

Reverse ping. :-)

Any plans to pick this up, or how I could help with this? I have taken this patch, ran some testing and numbers, and can confirm previous observations so that's good. I could e.g. do a in-depth code review this week, which is something I haven't done yet if someone else doesn't get there first...

There were 2 more crashes I uncovered during me testing which should now be addressed. I also landed the outstanding tests, so it should be easier to apply the patch.

I have now read the code for the first time, and here's a first round of comments.
I definitely need to read this again, which is what I will do soon...

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
190

This is a nitpick, bikeshedding names... but I was wondering about the terminology here (i.e. memory versioning).
We emit runtime memory checks, and do versioning of a block. So was wondering if this should be BlockVersioning.

9488

Remove this?

10110–10463

Somewhere a high-level idea and description of the algorithm would be nice, I guess that would be here.

10117

Typo? delete -> deleted.

10137

Unused?

10139

That wasn't obvious to me, why is Start dereferenced?

10140

Do we want to link to alive (not sure how persistent the data is and if it's common), or just expand the example here?

10167

Could std::next return null (or end)?

10205

Don't think I have seen a block name with the .tail suffix in the tests.

10241

typo: versiond -> versioned.

10244

I am not that familiar with this, but do we expect this in one of the tests?

10329

Perhaps create a local helper function to handle this case.

10330

Re: "or possible", I would guess it's possible but not profitable?

10357

And another helper for this case too.

Relaying a message from @dmgreen about the emitted runtime checks: could they tank performance in some cases? For example, in (nested) loops? Or are they trivially hoisted if they are emitted in a loop nest? Do we have any idea or data about this?

Hi @fhahn, I wanted to check if this is still a priority for you? It is for us, this will enable a lot of things in x264, so am keen to progress this one way or another. We could continue this work, although it looks like we are nearly there...

ABataev added inline comments.Dec 1 2021, 5:08 AM
llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
58–59

Convert this to enum? + add comments

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
797

Comments? Also, put it to a namespace

798–802

Default initializers and defaulted constructor?

811

isValidElementType?

815

isValidElementType?

10251–10257

Move out of the loop. Also, similar lambda is used in another place, worth creating a static function instead of lambdas.

10328

Do you consider the cost of all new check instructions as 1 here?

fhahn added a comment.Dec 6 2021, 1:01 AM

Hi @fhahn, I wanted to check if this is still a priority for you? It is for us, this will enable a lot of things in x264, so am keen to progress this one way or another. We could continue this work, although it looks like we are nearly there...

Unfortunately I missed that I didn't update the patch after the latest comments! I'll update it today.

fhahn updated this revision to Diff 392057.Dec 6 2021, 7:26 AM
fhahn marked 22 inline comments as done.

Address latest sets of comments, thanks! I hope I didn't miss anything.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
190

My intention with the naming here was to be clear about why/what we version; the versioning is based on the memory accesses, we will create one version with the original memory access and a second version where memory accesses to different underlying objects won't alias.

I'm not sure about tying this directly to the scope we support at the moment (single block). We might want to extend the scope in the future. For example, consider a loop which contains a block to version, where each access also depends on an induction variable. We might want to try to push the checks outside the loop, by widening them to encompass the whole iteration space, so we only need to check once.

797

put into an anonymous namespace and added comments

798–802

remove the AccessInfo() constructor, added default values instead

811

updated, thanks

9488

it's gone now

10110–10463

I added an outline at the top of the function.

10137

I think this is out of date, AS and WrittenObjs should be used.

10139

We get the Start expression from the pointer operand of either a store or load, so it should always be dereference by those instructions.

10140

I'll remove it.

10167

I don't think so; both LastTracked and FirstTracked should point to loads or stores, so there should always be a next instruction (the terminator). If they would be 0, there shouldn't be any bounds (MemBounds should be empty).

10205

This split is mainly there to make it easier to split off only the region containing from first tracked to last tracked. It is later folded back into the merge block.

10244

The domain? It is in the metadata nodes created, but the tests only check the !noalias attachments, not the definitions of the metadata at the moment.

10251–10257

updated to use getLoadStorePointerOperand from Instructions.h.

10328

Yes, it assumes each instruction costs 1.

10329

I moved it to a function.

10330

I guess saying not beneficial is enough. When we reach here, it is always possible. Originally the possible bit was intended to refer to the case where we version but nothing gets vectorized because SLP fails due to other reasons. That's not really clear though, I'll remove it.

10357

I think moving this to a separate function would require a lot of arguments and the code here is quite compact.

Address latest sets of comments, thanks! I hope I didn't miss anything.

Thanks for that.

Not directly related to code changes, but more general question: could we see negative performance impact of these runtime checks? For example, if they are emitted in loops? Or are they trivially hoisted out of loops? Any thoughts on these kind of things?

fhahn added a comment.Dec 6 2021, 9:41 AM

Address latest sets of comments, thanks! I hope I didn't miss anything.

Thanks for that.

Not directly related to code changes, but more general question: could we see negative performance impact of these runtime checks? For example, if they are emitted in loops? Or are they trivially hoisted out of loops? Any thoughts on these kind of things?

Yes, there could be negative performance impact, for at least the following reasons 1) cost-model bugs overestimating the vector savings and 2)if the vector path is never taken because the runtime checks fail. Cases of 1) will be rather easy to fix, while there won't be too much we can do about 2) unfortunately, unless we have profile info. Note that we have the same issue with LV.

Address latest sets of comments, thanks! I hope I didn't miss anything.

Thanks for that.

Not directly related to code changes, but more general question: could we see negative performance impact of these runtime checks? For example, if they are emitted in loops? Or are they trivially hoisted out of loops? Any thoughts on these kind of things?

Yes, there could be negative performance impact, for at least the following reasons 1) cost-model bugs overestimating the vector savings and 2)if the vector path is never taken because the runtime checks fail. Cases of 1) will be rather easy to fix, while there won't be too much we can do about 2) unfortunately, unless we have profile info. Note that we have the same issue with LV.

Yeah okay, sure, fair enough. Just wanted to double check, and you haven't seen any performance disasters in Spec or the test suite?
I am thinking if we should not just start with this enabled:

static cl::opt<bool> EnableMemoryVersioning(
    "slp-memory-versioning", cl::init(false), cl::Hidden,

If we get perf regression reports, this can be kept in tree but with this flag off, then at least we directly know what to fix.

Reverse ping :-)
Any thoughts on my previous comment?

fhahn retitled this revision from [SLPVectorizer] Implement initial memory versioning. to [SLP] Implement initial memory versioning..Dec 16 2021, 1:50 AM
fhahn added a comment.Dec 16 2021, 1:56 AM

Yeah okay, sure, fair enough. Just wanted to double check, and you haven't seen any performance disasters in Spec or the test suite?
I am thinking if we should not just start with this enabled:

static cl::opt<bool> EnableMemoryVersioning(
    "slp-memory-versioning", cl::init(false), cl::Hidden,

If we get perf regression reports, this can be kept in tree but with this flag off, then at least we directly know what to fix.

I agree it would be good to start with this enabled by default! My main concern at the moment is compile-time impact. I need to re-measure, but from last time I remember that there were a few CTMark cases that had noticeable increases.

I think the rate of when versioning leads to actually vectorizing the versioned block at the moment is ~15%. I think for it to be enabled by default it would be good to increase this rate first. At the moment the reason this rate is so low is that we consider a block for versioning once we found an aliasing access and effectively stop analyzing further. So even if there are other problems preventing vectorization, we still try to version it.

Yeah okay, sure, fair enough. Just wanted to double check, and you haven't seen any performance disasters in Spec or the test suite?
I am thinking if we should not just start with this enabled:

static cl::opt<bool> EnableMemoryVersioning(
    "slp-memory-versioning", cl::init(false), cl::Hidden,

If we get perf regression reports, this can be kept in tree but with this flag off, then at least we directly know what to fix.

I agree it would be good to start with this enabled by default! My main concern at the moment is compile-time impact. I need to re-measure, but from last time I remember that there were a few CTMark cases that had noticeable increases.

Ok, that would indeed be good to double-check.

I think the rate of when versioning leads to actually vectorizing the versioned block at the moment is ~15%. I think for it to be enabled by default it would be good to increase this rate first. At the moment the reason this rate is so low is that we consider a block for versioning once we found an aliasing access and effectively stop analyzing further. So even if there are other problems preventing vectorization, we still try to version it.

Ok, I see. I suspect there will be a bit of a long tail of work anyway. But as this is such a crucial enabler for any further work in this area, I would be inclined to start with this on by default and "see what happens". If it manages to be enabled by default, that would be so much more ideal...

I am going to look over this one more time, but think I am going to LGTM this soon (requesting another lgtm too).

SjoerdMeijer accepted this revision.Dec 23 2021, 6:25 AM

This looks generally good to me, but it would be good to get a blessing from @ABataev too.

Ideally I would like to see two things changed before this is committed:

  • Start with this enabled by default, like I also mentioned in previous comments,
  • I added a comment about the style of vectorizeBlockWithVersioning, which is too big for my liking, see also the comment inlined. I think that could do with a refactoring, and it was almost a blocker for me, but not entirely. :-)
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10133

Style: this function is nearly 300 lines, which I think is big for a function, and it is already not the easiest function to read. I would prefer to see this split up and using helper functions.

10521

Make this an internal option, or a static constexpr.

This revision is now accepted and ready to land.Dec 23 2021, 6:25 AM

Hey, it's me again. :)
Any plans to pick this up, shall we get this in?

fhahn added a comment.Jan 27 2022, 8:12 AM

Hey, it's me again. :)
Any plans to pick this up, shall we get this in?

Hi, I am still planning to push this in, but I don't think it should happen before the branch for the next release which will happen soon.

I also would like to wrap up D109368 first, which applies one of the fundamental ideas of this patch (optimistically generated RT checks, undo if not profitable) to LV and has in fact been going on for longer. There were a couple of inefficiencies in some of the generated runtime checks, which meant we did not estimate their cost accurately. Those should all be fixed now, and I hope D109368 can go in soon after the branch. Once it has been in tree for a couple of weeks, we should proceed with this patch here IMO.

Okay, thanks, that sounds like a good plan!

Allen added a subscriber: Allen.May 14 2022, 6:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 14 2022, 6:07 AM
Herald added a subscriber: vporpo. · View Herald Transcript
fhahn updated this revision to Diff 454933.Aug 23 2022, 12:53 PM

Rebase and intermediate update to switch to using more lightweight checks. This needs a few more constraints in places to account for the more lightweight checks, but should be workable for runtime testing if anybody is interested.

khchen added a subscriber: khchen.Oct 22 2022, 10:04 AM

This needed a rebase. Not sure I missed anything doing that, but clang runs in an assert compiling x264.