This is an archive of the discontinued LLVM Phabricator instance.

[IVDescriptors] Add pointer InductionDescriptors with non-constant strides (try 2)
ClosedPublic

Authored by reames on Mar 31 2023, 9:25 AM.

Details

Summary

(JFYI - This has been heavily reframed since original attempt at landing.)

This change updates the InductionDescriptor logic to allow matching a pointer IV with a non-constant stride, but also updates the LoopVectorizer to bailout on such descriptors by default. This preserves the default vectorizer behavior.

In review, it was pointed out that there's multiple unfortunate performance implications which need to be addressed before this can be enabled. Having a flag allows us to exercise the behavior, and write test cases for logic which is otherwise unreachable (or hard to reach).

This will also enable non-constant stride pointer recurrences for other consumers. I've audited said code, and don't see any obvious issues.

Diff Detail

Event Timeline

reames created this revision.Mar 31 2023, 9:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2023, 9:25 AM
reames requested review of this revision.Mar 31 2023, 9:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2023, 9:25 AM
reames added a comment.EditedMar 31 2023, 9:32 AM

@dmgreen I'd really appreciate any input you can share on the performance swing you observed.

Thinking through this, I've got a couple of potential theories, but it would help to know which (if any) is correct.

Option 1 - The codegen for the scalar and vector expansions is fairly poor. With solely constants, the expressions would get constant folded down so much it doesn't really matter, but for non-constants we're leaving a lot for the backend to cleanup. This is an easy fix, but probably the least likely issue.

Option 2 - There's something missing cost model wise for arbitrary gathers or something is matching a strided access too broadly in target code. I haven't looked into this at all yet. This is probably the most likely.

Option 3 - This is simply exposing more cases where LAA/LV can speculate on the stride of a non-constant step IV. (We already do this for integer forms.) From what I've seen in cases I'm looking at, the speculation and versioning that stride=1 at runtime can be commonly unprofitable. I'd planned on tackling that, but given it already kicks in so widely on scalar IVs, I hadn't thought enabling pointer IVs would make it critical.

Option 4 - Because of the lack of aliasing support for non-constant strides, we're generating more runtime checks resulting in overhead for shorter loops. This falls into the same category as the former in that I'm surprised if this is an issue that we don't see it with integer IVs.

reames edited the summary of this revision. (Show Details)Mar 31 2023, 9:32 AM

Hello. About the performance changes - I have been looking into them a little. A lot of it looks good, especially where gather/scatter are available which this can make use of. There are some cases where it wasn't doing quite as well under some places in MVE, but https://reviews.llvm.org/D147331 should hopefully sort out the largest of them. Sam/Nick should be back on Monday to review it. We have some other cases where gather/scatter are not available for whatever reason and those don't look as good but the changes are much smaller. Again it likely comes down to adjusting the costmodel, perhaps for the cost of the scalarized loads in getMemInstScalarizationCost needs to be a little higher under MVE. I wouldn't consider any of that blockers considering all the improvements.

The other part under AArch64 is a bit more silly I'm afraid, and falls into your Option 3. It is extracted from the code in x264, using the 16x16 version: https://github.com/mirror/x264/blob/master/common/pixel.c#L53, which I believe is also a part of Spec. I was thinking of adding a phase-ordering test but will just put it here: https://godbolt.org/z/qo54f7Pbv. It now decides to version the loop based on a stride of 1, like you mentioned. The loop-vectorized code is not as good as the slp version though, and never executed. I wouldn't expect that to cause a lot of slowdown, it only really adding an extra compare/branch and some dead code. The differences seemed higher that I expected in places though, around 25% for that function in isolation on some cpus. That might be quite noisy though as some were a lot closer to 0.

I might also be able to provide another example of an assert this is still hitting, with Assertion Instance.Lane.isFirstLane() && "cannot get lane > 0 for scalar"' failed`. It is reducing now.

This is the code producing the assert: https://godbolt.org/z/3a4drTheG. It has been ran through creduce and may be a bit over-reduced. It doesn't reproduce in godbolt, but does locally with bin/clang -O3 -mcpu=neoverse-v1 -c reduce.c -target aarch64.

clang: llvm/lib/Transforms/Vectorize/VPlan.cpp:226: llvm::Value* llvm::VPTransformState::get(llvm::VPValue*, const llvm::VPIteration&): Assertion `Instance.Lane.isFirstLane() && "cannot get lane > 0 for scalar"' failed.

Let me know if it doesn't work and I can try and get something better that use opt.

Hello. About the performance changes - I have been looking into them a little. A lot of it looks good, especially where gather/scatter are available which this can make use of. There are some cases where it wasn't doing quite as well under some places in MVE, but https://reviews.llvm.org/D147331 should hopefully sort out the largest of them. Sam/Nick should be back on Monday to review it. We have some other cases where gather/scatter are not available for whatever reason and those don't look as good but the changes are much smaller. Again it likely comes down to adjusting the costmodel, perhaps for the cost of the scalarized loads in getMemInstScalarizationCost needs to be a little higher under MVE. I wouldn't consider any of that blockers considering all the improvements.

This is all good news, thanks!

The other part under AArch64 is a bit more silly I'm afraid, and falls into your Option 3. It is extracted from the code in x264, using the 16x16 version: https://github.com/mirror/x264/blob/master/common/pixel.c#L53, which I believe is also a part of Spec. I was thinking of adding a phase-ordering test but will just put it here: https://godbolt.org/z/qo54f7Pbv. It now decides to version the loop based on a stride of 1, like you mentioned. The loop-vectorized code is not as good as the slp version though, and never executed. I wouldn't expect that to cause a lot of slowdown, it only really adding an extra compare/branch and some dead code. The differences seemed higher that I expected in places though, around 25% for that function in isolation on some cpus. That might be quite noisy though as some were a lot closer to 0.

Hm, this is less good news.

I took a bit of a look here, and noticed something interesting. The LLVM IR after optimization for the path when stride!=1 is nearly identical to the prior version. However, the runtime check in the assembly appears to have tripped some kind of hoisting optimization in the backend, and as a result, the assembly path for the stride!=1 path is a bit different.

I can see a couple ways of tackling this:

Option 1 - Be more restrictive on stride==1 speculation. I'd meant to do this anyways, but was expecting that to be the piece that had interesting perf swings.

Option 2 - Investigate the hoisting bit. (I don't really have the context to do this)

I might also be able to provide another example of an assert this is still hitting, with Assertion Instance.Lane.isFirstLane() && "cannot get lane > 0 for scalar"' failed`. It is reducing now.

Oops, yeah definitely looking forward to that. I'd assumed all the reports were the same issue since the assert seemed common.

I took a bit of a look here, and noticed something interesting. The LLVM IR after optimization for the path when stride!=1 is nearly identical to the prior version. However, the runtime check in the assembly appears to have tripped some kind of hoisting optimization in the backend, and as a result, the assembly path for the stride!=1 path is a bit different.

I can see a couple ways of tackling this:

Option 1 - Be more restrictive on stride==1 speculation. I'd meant to do this anyways, but was expecting that to be the piece that had interesting perf swings.

Option 2 - Investigate the hoisting bit. (I don't really have the context to do this)

I was looking at another profile that ran with different architecture features and with LTO, where the differences were more pronounced. The main body was vectorized with scalable vectors and remainder was no longer unrolled. That might have been why the differences in performance looked higher that I expected, but that function is relatively hot and called many times, so even just the extra compares can slow things down a bit.

Do you think it would be possible to come up with a heuristic to prevent the 1-stride speculation in this case at least? The overlapping load in https://godbolt.org/z/ToT4W74Yf with a stride of 1 in the same loop at the point of vectorization look like something that is unlike to be helpful in many cases.

(I have looked into single stride before in https://reviews.llvm.org/D71919, where I was running into places where it is not profitable compared to gather/scatter. From what I remember there were a few cases in the llvm-test-suite where was helping, and I didn't push that patch forward as I had no strong motivating example. The base AArch64 doesn't have gather scatter, so this case is a little different).

Ok, it's pretty clear the scope of this is much much broader than I'd realized. I'm going to change tack here, and come at this from a different angle. I'm going to re-frame this patch as generalizing IVDescriptors, and adding an off-by-default option in LV to allow vectorization of such IVs. This keeps the status quo in LV while getting the IVDescriptor code landed.

I do not intend for this flag to be supported long term. I need it to write a test corpus that covers the various optimization quality issues reported above. Doing it this way also allows me to separate the second functional issue into it's own review - I've dug into that one a bit, and it's looking a bit annoying.

My thinking is now that we need to work through result quality for strided IVs (i.e. using gathers or strided accesses where available), before returning to this. I'm hoping to be able to stage these as individual commits.

reames updated this revision to Diff 510555.Apr 3 2023, 10:54 AM
reames edited the summary of this revision. (Show Details)
dmgreen accepted this revision.Apr 4 2023, 7:33 AM

Sounds good. LGTM

Like I said many of the changes were improvements. There is likely some decent gain we can get from enabling it fully, if we can work through the problem cases.

This revision is now accepted and ready to land.Apr 4 2023, 7:33 AM
fhahn accepted this revision.Apr 4 2023, 9:37 AM

LGTM, having this behind an option to start with seems like a good idea

This revision was landed with ongoing or failed builds.Apr 5 2023, 9:32 AM
This revision was automatically updated to reflect the committed changes.