This is an archive of the discontinued LLVM Phabricator instance.

[LV] Choose the wider VF where they have same cost
AbandonedPublic

Authored by Allen on Aug 28 2023, 7:17 AM.

Details

Summary

sometimes, different VF will get same cost, and prefer to
the wider VF to improve the parallelism degree

Fixes https://github.com/llvm/llvm-project/issues/64986

Diff Detail

Unit TestsFailed

Event Timeline

Allen created this revision.Aug 28 2023, 7:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 28 2023, 7:17 AM
Allen requested review of this revision.Aug 28 2023, 7:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 28 2023, 7:17 AM

Hi @Allen,
I recently submitted D157628, which lowers the cost of extends when they can be folded into a urhadd or srhadd instruction.
The tests I added are similar the one in this patch, so I was wondering if D157628 may have fixed the same issue as your changes?

Hi @Allen,
I recently submitted D157628, which lowers the cost of extends when they can be folded into a urhadd or srhadd instruction.
The tests I added are similar the one in this patch, so I was wondering if D157628 may have fixed the same issue as your changes?

Thanks your information, I tried your PR and find it only affected the fixed length VF, so it will still prefer the vscale x 8 with your PR.

david-arm added a subscriber: fhahn.

Hi @Allen,
I recently submitted D157628, which lowers the cost of extends when they can be folded into a urhadd or srhadd instruction.
The tests I added are similar the one in this patch, so I was wondering if D157628 may have fixed the same issue as your changes?

Thanks your information, I tried your PR and find it only affected the fixed length VF, so it will still prefer the vscale x 8 with your PR.

Adding @fhahn as a reviewer.

Which CPU are you targeting and how did you build your example? I believe that @kmclaughlin used D157628 to show that for certain loops in x264 when tail-folding we choose a higher VF for some SVE2-enabled CPUs due to the lower cost of the zext and sext instructions. Regardless of that, I'm still a bit worried by this patch because I believe it is a very significant change that will affect all targets across a wide range of CPUs. I'm not saying this change is wrong, but can you describe in the commit message what benchmarks you have run and for what targets?

Also, it's quite possible that the cost model for the loop you're interested in needs improving further. For example, what if you changed the cost of the trunc in the loop to zero - would that also solve your problem? I think @kmclaughlin's patch only changed the costs of the zext and sext instructions.

Allen abandoned this revision.Aug 31 2023, 2:33 AM

Oh, sorry. I missed to use -mattr=+sve to test this case, it should be -mattr=+sve2, so it only change the fix length type vector.

opt -S -mtriple=aarch64-unknown-linux-gnu -mattr=+sve -passes=loop-vectorize -debug-only=loop-vectorize  pr64986.ll