Page MenuHomePhabricator

[SLP] Implement initial memory versioning.
AcceptedPublic

Authored by fhahn on May 20 2021, 2:46 AM.

Details

Summary

This patch is just an initial sketch to get a discussion going on how to
best support generating runtime checks for may-aliasing memory accesses.

The key question to start with is where to best collect and generate
runtime checks. Currently codegen for a block is eager; for each block,
if we find a vectorizable tree, we vectorize it and then analyze the
rest of the block. This makes it hard to collect *all* runtime checks
for a block before changing the code.

Perhaps for now we need to limit the checks to the first vectorizable
tree in a block? This is what the patch tries to do.

There are a couple of mechanical/technical issues that need to be
addressed, but I think the questions above are key to answer/address
first.

Other than that, the patch does not yet consider the cost of cloning
the block and the runtime checks. It also does not introduce phis
for values used outside the cloned block, so it will generate invalid IR
in those cases for now.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Just a bit of a heads up that I took this patch (and it's dependencies) and run some numbers for x264 SPEC, where I have seen quite a few missed opportunities caused by inability to emit runtime alias checks. I don't see any change in performance though, which is slightly unexpected (I was hoping for some already). But it might be that the next thing is blocking SLP vectorisation, for AArch64 which is what I am looking at, and that might be cost-model issues. This is the case for the 2 examples I am currently looking at, but will do some more analysis. And we of course need this patch as an enabler.

The case from x264 is in the function @f_alias in ../AArch64/loadi8.ll, right?

Sorry for the late reply, but just wanted to confirm this: yes, that @f_alias in ../AArch64/loadi8.ll is a reproducer from x264.

fhahn updated this revision to Diff 365819.Aug 11 2021, 12:37 PM

Some major restructuring of the code for the basic block handling and removed some restrictions. The code now keeps track of the first and last instruction that uses a tracked object and only duplicates the instructions in between. The blocks are split in a way that allows us to roll back the changes to the original CFG. The runtime checks are now only generated if profitable. This has the drawback that we have to estimate the cost a bit differently. But it has the advantage that we only need to change the function if we vectorize.

The compile-time impact looks reasonable, worst geomean change is +0.09 for NewPM-ReleaseLTO-g http://llvm-compile-time-tracker.com/compare.php?from=7c61447052351b722c4d5aa4254cca9054a0be79&to=831b8bc299ef904a9980e59f59eab936d93bc9e1&stat=instructions .

We probably also can add additional heuristics to avoid attempting versioning in a few more cases.

The patch still needs additional comments, but I think the overall structure should be a good start now.

Sorry for the late reply, but just wanted to confirm this: yes, that @f_alias in ../AArch64/loadi8.ll is a reproducer from x264.

Great thanks. The code in the test should be transformed now. If you point me to the C code, I can check if it is transformed as expected now.

Sorry for the late reply, but just wanted to confirm this: yes, that @f_alias in ../AArch64/loadi8.ll is a reproducer from x264.

Great thanks. The code in the test should be transformed now. If you point me to the C code, I can check if it is transformed as expected now.

I believe that was function mc_weight_w20().

And like I mentioned earlier, I expect this change to make quite some differences for x264. For example, I hope it will trigger on quant_4x4() too, although additional trick may be required for successful SLP vectorisation of that example (cost-model and or other things).

fhahn updated this revision to Diff 367309.Aug 18 2021, 12:50 PM

Rebased and improved runtime check generation: 1) do not generated checks between 2 read-only groups and 2) skip overflow checks because we know that the last element is dereferenced (Alive2 agrees but I will double check if that is intentional).

Sorry for the late reply, but just wanted to confirm this: yes, that @f_alias in ../AArch64/loadi8.ll is a reproducer from x264.

Great thanks. The code in the test should be transformed now. If you point me to the C code, I can check if it is transformed as expected now.

I believe that was function mc_weight_w20().

Hm interesting. The latest version should vectorize mc_weight_w20 on AArch64. But there's no measurable speedup from that unfortunately on the hardware I have access to. I also cannot measure any speedup if I add restrict to the mc_weight_w20 arguments, which should cause SLP vectorization without runtime checks. Is it possible I am missing something?

And like I mentioned earlier, I expect this change to make quite some differences for x264. For example, I hope it will trigger on quant_4x4() too, although additional trick may be required for successful SLP vectorisation of that example (cost-model and or other things).

I am seeing 5-10% speedups when vectorizing quant_4x4. After runtime check generation, there's still another issue though (conditional load that's not hoisted out).

Anything else to be done? I see some commented code lines.

I am seeing 5-10% speedups when vectorizing quant_4x4. After runtime check generation, there's still another issue though (conditional load that's not hoisted out).

Hi @fhahn, I tried cherry-picking this patch and I'm not seeing quant_4x4 getting vectorized. Maybe I'm missing something here. The commandline invocation I'm using is
clang --target=aarch64-linux-gnu -mllvm -slp-memory-versioning -mllvm -debug-only=SLP -O3 -c quant_4x4.c -S -o quant_4x4.S

fhahn updated this revision to Diff 372992.Sep 16 2021, 10:36 AM

Rebased, added back verifyFunction call, remove commented-out code. Also precommitted a bunch of tests which exposed crashes in earlier versions of the patch.

I am seeing 5-10% speedups when vectorizing quant_4x4. After runtime check generation, there's still another issue though (conditional load that's not hoisted out).

Hi @fhahn, I tried cherry-picking this patch and I'm not seeing quant_4x4 getting vectorized. Maybe I'm missing something here. The commandline invocation I'm using is
clang --target=aarch64-linux-gnu -mllvm -slp-memory-versioning -mllvm -debug-only=SLP -O3 -c quant_4x4.c -S -o quant_4x4.S

I think I measured this with modifying the source (hoisting the loads) and running the loop vectorizer. Even with runtime checks, we need to improve if-conversion for SLP as mentioned in https://lists.llvm.org/pipermail/llvm-dev/2021-September/152740.html. Also, the current patch does not support vectorizing reduction values that are returned, but that should be a relatively easy extension.

fhahn added a comment.Sep 29 2021, 5:24 AM

@ABataev any suggestions on how to move forward with the patch?

@ABataev any suggestions on how to move forward with the patch?

I am not @ABataev :-) but I see you're defaulting slp-memory-versioning to false and was wondering about this. That looks like a safe strategy. But why is that exactly, as opposed to e.g. just enabling this by default? Is that because it's easier to test things before flipping the switch? How/when do you plan to flip this switch?

Reverse ping. :-)

Any plans to pick this up, or how I could help with this? I have taken this patch, ran some testing and numbers, and can confirm previous observations so that's good. I could e.g. do a in-depth code review this week, which is something I haven't done yet if someone else doesn't get there first...

fhahn updated this revision to Diff 381334.Oct 21 2021, 11:10 AM

@ABataev any suggestions on how to move forward with the patch?

I am not @ABataev :-) but I see you're defaulting slp-memory-versioning to false and was wondering about this. That looks like a safe strategy. But why is that exactly, as opposed to e.g. just enabling this by default? Is that because it's easier to test things before flipping the switch? How/when do you plan to flip this switch?

My only concern was compile-time impact. Overall the impact does not look too bad and the cases where it increases the compile-time it seems to correspond to code changes (= additional vectorization)

http://llvm-compile-time-tracker.com/compare.php?from=da7e033ee020a1708b98f3e4f159eef904a42600&to=cf03006253252dbe8a209bcb6d513f85bb5c036a&stat=instructions

Reverse ping. :-)

Any plans to pick this up, or how I could help with this? I have taken this patch, ran some testing and numbers, and can confirm previous observations so that's good. I could e.g. do a in-depth code review this week, which is something I haven't done yet if someone else doesn't get there first...

There were 2 more crashes I uncovered during me testing which should now be addressed. I also landed the outstanding tests, so it should be easier to apply the patch.

I have now read the code for the first time, and here's a first round of comments.
I definitely need to read this again, which is what I will do soon...

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
190

This is a nitpick, bikeshedding names... but I was wondering about the terminology here (i.e. memory versioning).
We emit runtime memory checks, and do versioning of a block. So was wondering if this should be BlockVersioning.

9488

Remove this?

10110–10463

Somewhere a high-level idea and description of the algorithm would be nice, I guess that would be here.

10117

Typo? delete -> deleted.

10137

Unused?

10139

That wasn't obvious to me, why is Start dereferenced?

10140

Do we want to link to alive (not sure how persistent the data is and if it's common), or just expand the example here?

10167

Could std::next return null (or end)?

10205

Don't think I have seen a block name with the .tail suffix in the tests.

10241

typo: versiond -> versioned.

10244

I am not that familiar with this, but do we expect this in one of the tests?

10329

Perhaps create a local helper function to handle this case.

10330

Re: "or possible", I would guess it's possible but not profitable?

10357

And another helper for this case too.

Relaying a message from @dmgreen about the emitted runtime checks: could they tank performance in some cases? For example, in (nested) loops? Or are they trivially hoisted if they are emitted in a loop nest? Do we have any idea or data about this?

Hi @fhahn, I wanted to check if this is still a priority for you? It is for us, this will enable a lot of things in x264, so am keen to progress this one way or another. We could continue this work, although it looks like we are nearly there...

ABataev added inline comments.Dec 1 2021, 5:08 AM
llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
58–59

Convert this to enum? + add comments

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
797

Comments? Also, put it to a namespace

798–802

Default initializers and defaulted constructor?

811

isValidElementType?

815

isValidElementType?

10251–10257

Move out of the loop. Also, similar lambda is used in another place, worth creating a static function instead of lambdas.

10328

Do you consider the cost of all new check instructions as 1 here?

fhahn added a comment.Dec 6 2021, 1:01 AM

Hi @fhahn, I wanted to check if this is still a priority for you? It is for us, this will enable a lot of things in x264, so am keen to progress this one way or another. We could continue this work, although it looks like we are nearly there...

Unfortunately I missed that I didn't update the patch after the latest comments! I'll update it today.

fhahn updated this revision to Diff 392057.Dec 6 2021, 7:26 AM
fhahn marked 22 inline comments as done.

Address latest sets of comments, thanks! I hope I didn't miss anything.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
190

My intention with the naming here was to be clear about why/what we version; the versioning is based on the memory accesses, we will create one version with the original memory access and a second version where memory accesses to different underlying objects won't alias.

I'm not sure about tying this directly to the scope we support at the moment (single block). We might want to extend the scope in the future. For example, consider a loop which contains a block to version, where each access also depends on an induction variable. We might want to try to push the checks outside the loop, by widening them to encompass the whole iteration space, so we only need to check once.

797

put into an anonymous namespace and added comments

798–802

remove the AccessInfo() constructor, added default values instead

811

updated, thanks

9488

it's gone now

10110–10463

I added an outline at the top of the function.

10137

I think this is out of date, AS and WrittenObjs should be used.

10139

We get the Start expression from the pointer operand of either a store or load, so it should always be dereference by those instructions.

10140

I'll remove it.

10167

I don't think so; both LastTracked and FirstTracked should point to loads or stores, so there should always be a next instruction (the terminator). If they would be 0, there shouldn't be any bounds (MemBounds should be empty).

10205

This split is mainly there to make it easier to split off only the region containing from first tracked to last tracked. It is later folded back into the merge block.

10244

The domain? It is in the metadata nodes created, but the tests only check the !noalias attachments, not the definitions of the metadata at the moment.

10251–10257

updated to use getLoadStorePointerOperand from Instructions.h.

10328

Yes, it assumes each instruction costs 1.

10329

I moved it to a function.

10330

I guess saying not beneficial is enough. When we reach here, it is always possible. Originally the possible bit was intended to refer to the case where we version but nothing gets vectorized because SLP fails due to other reasons. That's not really clear though, I'll remove it.

10357

I think moving this to a separate function would require a lot of arguments and the code here is quite compact.

Address latest sets of comments, thanks! I hope I didn't miss anything.

Thanks for that.

Not directly related to code changes, but more general question: could we see negative performance impact of these runtime checks? For example, if they are emitted in loops? Or are they trivially hoisted out of loops? Any thoughts on these kind of things?

fhahn added a comment.Dec 6 2021, 9:41 AM

Address latest sets of comments, thanks! I hope I didn't miss anything.

Thanks for that.

Not directly related to code changes, but more general question: could we see negative performance impact of these runtime checks? For example, if they are emitted in loops? Or are they trivially hoisted out of loops? Any thoughts on these kind of things?

Yes, there could be negative performance impact, for at least the following reasons 1) cost-model bugs overestimating the vector savings and 2)if the vector path is never taken because the runtime checks fail. Cases of 1) will be rather easy to fix, while there won't be too much we can do about 2) unfortunately, unless we have profile info. Note that we have the same issue with LV.

Address latest sets of comments, thanks! I hope I didn't miss anything.

Thanks for that.

Not directly related to code changes, but more general question: could we see negative performance impact of these runtime checks? For example, if they are emitted in loops? Or are they trivially hoisted out of loops? Any thoughts on these kind of things?

Yes, there could be negative performance impact, for at least the following reasons 1) cost-model bugs overestimating the vector savings and 2)if the vector path is never taken because the runtime checks fail. Cases of 1) will be rather easy to fix, while there won't be too much we can do about 2) unfortunately, unless we have profile info. Note that we have the same issue with LV.

Yeah okay, sure, fair enough. Just wanted to double check, and you haven't seen any performance disasters in Spec or the test suite?
I am thinking if we should not just start with this enabled:

static cl::opt<bool> EnableMemoryVersioning(
    "slp-memory-versioning", cl::init(false), cl::Hidden,

If we get perf regression reports, this can be kept in tree but with this flag off, then at least we directly know what to fix.

Reverse ping :-)
Any thoughts on my previous comment?

fhahn retitled this revision from [SLPVectorizer] Implement initial memory versioning. to [SLP] Implement initial memory versioning..Dec 16 2021, 1:50 AM
fhahn added a comment.Dec 16 2021, 1:56 AM

Yeah okay, sure, fair enough. Just wanted to double check, and you haven't seen any performance disasters in Spec or the test suite?
I am thinking if we should not just start with this enabled:

static cl::opt<bool> EnableMemoryVersioning(
    "slp-memory-versioning", cl::init(false), cl::Hidden,

If we get perf regression reports, this can be kept in tree but with this flag off, then at least we directly know what to fix.

I agree it would be good to start with this enabled by default! My main concern at the moment is compile-time impact. I need to re-measure, but from last time I remember that there were a few CTMark cases that had noticeable increases.

I think the rate of when versioning leads to actually vectorizing the versioned block at the moment is ~15%. I think for it to be enabled by default it would be good to increase this rate first. At the moment the reason this rate is so low is that we consider a block for versioning once we found an aliasing access and effectively stop analyzing further. So even if there are other problems preventing vectorization, we still try to version it.

Yeah okay, sure, fair enough. Just wanted to double check, and you haven't seen any performance disasters in Spec or the test suite?
I am thinking if we should not just start with this enabled:

static cl::opt<bool> EnableMemoryVersioning(
    "slp-memory-versioning", cl::init(false), cl::Hidden,

If we get perf regression reports, this can be kept in tree but with this flag off, then at least we directly know what to fix.

I agree it would be good to start with this enabled by default! My main concern at the moment is compile-time impact. I need to re-measure, but from last time I remember that there were a few CTMark cases that had noticeable increases.

Ok, that would indeed be good to double-check.

I think the rate of when versioning leads to actually vectorizing the versioned block at the moment is ~15%. I think for it to be enabled by default it would be good to increase this rate first. At the moment the reason this rate is so low is that we consider a block for versioning once we found an aliasing access and effectively stop analyzing further. So even if there are other problems preventing vectorization, we still try to version it.

Ok, I see. I suspect there will be a bit of a long tail of work anyway. But as this is such a crucial enabler for any further work in this area, I would be inclined to start with this on by default and "see what happens". If it manages to be enabled by default, that would be so much more ideal...

I am going to look over this one more time, but think I am going to LGTM this soon (requesting another lgtm too).

SjoerdMeijer accepted this revision.Dec 23 2021, 6:25 AM

This looks generally good to me, but it would be good to get a blessing from @ABataev too.

Ideally I would like to see two things changed before this is committed:

  • Start with this enabled by default, like I also mentioned in previous comments,
  • I added a comment about the style of vectorizeBlockWithVersioning, which is too big for my liking, see also the comment inlined. I think that could do with a refactoring, and it was almost a blocker for me, but not entirely. :-)
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10133

Style: this function is nearly 300 lines, which I think is big for a function, and it is already not the easiest function to read. I would prefer to see this split up and using helper functions.

10521

Make this an internal option, or a static constexpr.

This revision is now accepted and ready to land.Dec 23 2021, 6:25 AM

Hey, it's me again. :)
Any plans to pick this up, shall we get this in?

fhahn added a comment.Jan 27 2022, 8:12 AM

Hey, it's me again. :)
Any plans to pick this up, shall we get this in?

Hi, I am still planning to push this in, but I don't think it should happen before the branch for the next release which will happen soon.

I also would like to wrap up D109368 first, which applies one of the fundamental ideas of this patch (optimistically generated RT checks, undo if not profitable) to LV and has in fact been going on for longer. There were a couple of inefficiencies in some of the generated runtime checks, which meant we did not estimate their cost accurately. Those should all be fixed now, and I hope D109368 can go in soon after the branch. Once it has been in tree for a couple of weeks, we should proceed with this patch here IMO.

Okay, thanks, that sounds like a good plan!

Allen added a subscriber: Allen.May 14 2022, 6:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 14 2022, 6:07 AM
Herald added a subscriber: vporpo. · View Herald Transcript
fhahn updated this revision to Diff 454933.Aug 23 2022, 12:53 PM

Rebase and intermediate update to switch to using more lightweight checks. This needs a few more constraints in places to account for the more lightweight checks, but should be workable for runtime testing if anybody is interested.

khchen added a subscriber: khchen.Oct 22 2022, 10:04 AM

This needed a rebase. Not sure I missed anything doing that, but clang runs in an assert compiling x264.