This is an archive of the discontinued LLVM Phabricator instance.

Fix SLP score for out of order contiguous loads
ClosedPublic

Authored by alban.bridonneau on Apr 11 2022, 8:02 AM.

Details

Summary

SLP uses the distance between pointers to optimize
the getShallowScore. However the current code misses
the case where we are trying to vectorize for VF=4, and the distance
between pointers is 2. In that case the returned score
reflects the case of contiguous loads, when it's not actually
contiguous.

The attached unit tests have 5 loads, where the program order
is not the same as the offset order in the GEPs. So, the choice
of which 4 loads to bundle together matters. If we pick the
first 4, then we can vectorize with VF=4. If we pick the
last 4, then we can only vectorize with VF=2.

This patch makes a more conservative choice, to consider
all distances>1 to not be a case of contiguous load, and
give those cases a lower score.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 8:02 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
alban.bridonneau requested review of this revision.Apr 11 2022, 8:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 8:02 AM

Try to simplify the tests.

I have simplified the unit tests.
As far as this patch goes, both tests were checking the same
behaviour, so I have only kept one. The complex data structures
and unnecesary attributes have been removed, and the IR has been reordered
to make the patterns clearer. Let me know if further reduction
is needed.

vporpo added inline comments.Apr 12 2022, 6:51 AM
llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s116.ll
18–24

This test should work fine without the loop and the entry and for.cond.cleanup blocks. Please remove these extra blocks and the loop-related code.

26–29

Do you think you could replace these offset calculations with constants? This would make the test a lot smaller.

ABataev added inline comments.Apr 12 2022, 8:36 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1199 ↗(On Diff #422172)

Actually, this code is part of non-power-of-2 vectorization support in SLP, the problem should disappear once it is fully landed.

@vporpo I was just starting to look at removing the loop-related code. That's a good point, i keep thinking about SLP in the context of loops, but it shouldn't be just about loops. The various costs seem completely different, it might require some extra work to reproduce the exact case without the loop control code.

@ABataev , that's interesting. Is that a patch you are working on? I'd be curious to take a look if that's possible.

@ABataev , that's interesting. Is that a patch you are working on? I'd be curious to take a look if that's possible.

D57059 but I did not update it for a while + D105986, which is part of the first one.

mgabka added a subscriber: mgabka.Apr 13 2022, 1:41 AM

Further reduction of the unit test

@ABataev , I had a look at the various patches you pointed. That looks like a proper piece of work! Given how much change is going to happen, i don't think it is worth pushing the fix i am proposing today.

However, I'd like to make sure my test case will get fixed, if possible. I tried your original patch (rebased last July) on the unit test and that failed. Do you think you could incorporate this unit test in your work?

I have reduced the unit test further, there is no more loop control, and the gep indices have been cleaned up. The CHECK lines are also more targeted to say we are looking for a vectorization of 4 (it currently does 2). We can actually see much better now which is the root cause of my issue. The LHS operands of the fmul are what's deciding which is the first group of loads beeing evaluated, and that group does not have 4 contiguous loads.

Thanks for updating the test, it looks much better now :)
I agree, I don't think that this needs to wait for the non-power-of-2 support, unless @ABataev has some objection.

llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s116.ll
8–11

Please use the utils/update_test_checks.py script for generating the checks. It makes it easier to diff test failures when tests contain automatically generated checks.

@ABataev , I had a look at the various patches you pointed. That looks like a proper piece of work! Given how much change is going to happen, i don't think it is worth pushing the fix i am proposing today.

However, I'd like to make sure my test case will get fixed, if possible. I tried your original patch (rebased last July) on the unit test and that failed. Do you think you could incorporate this unit test in your work?

Sure, +1 here. You can add the test separately with the CHECKs for currently emitted code and a FIXME message how it is expected to be.
The more tests we have, the better. Iy would be much easier for me to tune the patch.

I have reduced the unit test further, there is no more loop control, and the gep indices have been cleaned up. The CHECK lines are also more targeted to say we are looking for a vectorization of 4 (it currently does 2). We can actually see much better now which is the root cause of my issue. The LHS operands of the fmul are what's deciding which is the first group of loads beeing evaluated, and that group does not have 4 contiguous loads.

I'll check it, thanks. Follow suggestions from @vporpo to commit the test.

The CHECK lines are now autogenerated.

I also removed the code change, and instead added
a FIXME with details of what that unit test should
be checking.

This revision is now accepted and ready to land.Apr 14 2022, 6:48 AM
This revision was landed with ongoing or failed builds.Apr 19 2022, 3:59 AM
This revision was automatically updated to reflect the committed changes.