Have BasicTTI call the base implementation so that both agree on the default behaviour, which the default being a cost of '1'. This has required an X86 specific implementation as it seems to be very reliant on those instructions being free. Changes are also made to AMDGPU so that their implementations distinguish between cost kinds, so that the unrolling isn't affected. The cost model changes now reflect that ret instructions are not generally free.
Sorry, this was buried in my notifications so got ignored. I saw you have updated PowerPC part so there is no difference here now. Thanks.
Looks like this is more aggressive than most target-specific implementation?
Is this intended?
Any specific reason we would like to force 0 for PowerPC?
CodeSize, Latency and SizeAndLatency would generally go through getUserCost first, in which it treats Phis as free, and the backends don't appear to care about Latency. So hopefully this is modelling the previous logic.
This is because, as this wasn't implemented before, it would have used BasicTTI and returned 0. But I'll check whether I can use the same logic as I've added for X86.
Is there actual need to specify align here? It is not changed in many other lines..
Do wen need theese extra lines? Is their number connected to lines removed before?-)
The comment for the test explicitly says we need cost of 4 here.
It is not clear why these costs (with same changes below) are increased
Thanks for taking another look.
I'm not sure how these got here, there has been ongoing alignment work so maybe the default alignment is being added when it's missing. I'll rebase this patch and check it out.
Yeah I will remove these. I think the test file was already formatted oddly which the script has then just added to the weirdness.
I'm not really sure what causes this, but it at least shouldn't stop the vectorizer from performing the scalarization.
The cost increases by two because the phi and the br are no longer classed as free for throughput.
Linaro benchmarking CI flagged this patch as it regresses SPEC2k6's 462.libquantum by 10% at -O2, -O3 and "-O3 -flto" for aarch64-linux-gnu. Impact on other benchmarks is below 2%, which is within noise.
For -O2 profile changes are:
<benchmark, exe/symbol>, %time, %size, time1, time2, size1, size2
462.libquantum,[.] quantum_toffoli ,113,125,5616,6360,320,400
462.libquantum,[.] quantum_sigma_x ,100,100,2498,2499,224,224
462.libquantum,[.] quantum_cnot ,130,135,1150,1495,288,388
After this patch quantum_toffoli is 13% slower and 25% bigger, and quantum_cnot is 30% slower and 35% bigger.
Would you please investigate this? Such big performance and code size regressions might expose a corner-case in your patch.
I suspect that the getCFInstr maybe used in an odd way by the vectorizer for these changes to make such a difference. I'm suspect that the vectorizer is multiplying the cost of it, where 0 or 1 would make quite a difference, but where it just shouldn't be... hopefully the cost of a branch and phi doesn't change that much after vectorization / scalarization.