sometimes, different VF will get same cost, and prefer to
the wider VF to improve the parallelism degree
Details
Diff Detail
Unit Tests
Time | Test | |
---|---|---|
60,060 ms | x64 debian > MLIR.Examples/standalone::test.toy |
Event Timeline
Thanks your information, I tried your PR and find it only affected the fixed length VF, so it will still prefer the vscale x 8 with your PR.
Adding @fhahn as a reviewer.
Which CPU are you targeting and how did you build your example? I believe that @kmclaughlin used D157628 to show that for certain loops in x264 when tail-folding we choose a higher VF for some SVE2-enabled CPUs due to the lower cost of the zext and sext instructions. Regardless of that, I'm still a bit worried by this patch because I believe it is a very significant change that will affect all targets across a wide range of CPUs. I'm not saying this change is wrong, but can you describe in the commit message what benchmarks you have run and for what targets?
Also, it's quite possible that the cost model for the loop you're interested in needs improving further. For example, what if you changed the cost of the trunc in the loop to zero - would that also solve your problem? I think @kmclaughlin's patch only changed the costs of the zext and sext instructions.
Oh, sorry. I missed to use -mattr=+sve to test this case, it should be -mattr=+sve2, so it only change the fix length type vector.
opt -S -mtriple=aarch64-unknown-linux-gnu -mattr=+sve -passes=loop-vectorize -debug-only=loop-vectorize pr64986.ll