Have BasicTTI call the base implementation so that both agree on the default behaviour, which the default being a cost of '1'. This has required an X86 specific implementation as it seems to be very reliant on those instructions being free. Changes are also made to AMDGPU so that their implementations distinguish between cost kinds, so that the unrolling isn't affected. The cost model changes now reflect that ret instructions are not generally free.
Details
Diff Detail
Event Timeline
Fixed some tests that needed asserts... which meant passing the CostKind from the vectorizer too special handling Phis in the base implementation. The important test is in PowerPC where the vector factor is now the same for PPC8 and PPC9. @jsji would you mind taking a look?
Implemented getCFInstrCost in the PowerPC backend to remove the vectorizer test changes.
Added Br and Ret into getUserCost switch so that getCFInstr should now be used by all cost kinds and API calls.
Sorry, this was buried in my notifications so got ignored. I saw you have updated PowerPC part so there is no difference here now. Thanks.
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h | ||
---|---|---|
445 | Looks like this is more aggressive than most target-specific implementation? Is this intended? | |
llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp | ||
761 | Any specific reason we would like to force 0 for PowerPC? |
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h | ||
---|---|---|
445 | CodeSize, Latency and SizeAndLatency would generally go through getUserCost first, in which it treats Phis as free, and the backends don't appear to care about Latency. So hopefully this is modelling the previous logic. | |
llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp | ||
761 | This is because, as this wasn't implemented before, it would have used BasicTTI and returned 0. But I'll check whether I can use the same logic as I've added for X86. |
Now aligned PPC and X86 implementations with the base implementation, so that they're checking != RecipThroughput.
llvm/test/Analysis/CostModel/ARM/cast.ll | ||
---|---|---|
1721–1729 | Is there actual need to specify align here? It is not changed in many other lines.. | |
llvm/test/Analysis/CostModel/ARM/mve-gather-scatter-cost.ll | ||
158–161 | Do wen need theese extra lines? Is their number connected to lines removed before?-) | |
llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll | ||
10–15 | The comment for the test explicitly says we need cost of 4 here. | |
llvm/test/Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll | ||
11 | It is not clear why these costs (with same changes below) are increased |
Thanks for taking another look.
llvm/test/Analysis/CostModel/ARM/cast.ll | ||
---|---|---|
1721–1729 | I'm not sure how these got here, there has been ongoing alignment work so maybe the default alignment is being added when it's missing. I'll rebase this patch and check it out. | |
llvm/test/Analysis/CostModel/ARM/mve-gather-scatter-cost.ll | ||
158–161 | Yeah I will remove these. I think the test file was already formatted oddly which the script has then just added to the weirdness. | |
llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll | ||
10–15 | I'm not really sure what causes this, but it at least shouldn't stop the vectorizer from performing the scalarization. | |
llvm/test/Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll | ||
11 | The cost increases by two because the phi and the br are no longer classed as free for throughput. |
LGTM, although minor change in test' comment
llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll | ||
---|---|---|
10–15 | Well, so the comment should be patched too? |
Actually it seems all of these dozens of "CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret i32" may be removed from most of tests not related to getCFInstrCost
Hi Sam,
Linaro benchmarking CI flagged this patch as it regresses SPEC2k6's 462.libquantum by 10% at -O2, -O3 and "-O3 -flto" for aarch64-linux-gnu. Impact on other benchmarks is below 2%, which is within noise.
For -O2 profile changes are:
<benchmark, exe/symbol>, %time, %size, time1, time2, size1, size2
462.libquantum,libquantum_base.default ,111,101,9859,10962,33140,33428
462.libquantum,[.] quantum_toffoli ,113,125,5616,6360,320,400
462.libquantum,[.] quantum_sigma_x ,100,100,2498,2499,224,224
462.libquantum,[.] quantum_cnot ,130,135,1150,1495,288,388
After this patch quantum_toffoli is 13% slower and 25% bigger, and quantum_cnot is 30% slower and 35% bigger.
Would you please investigate this? Such big performance and code size regressions might expose a corner-case in your patch.
It seems like this introduces a 9% performance regression for libquantum from SPEC2006 on AArch64 with -O3 LTO & PGO. I think AArch64 uses the implementation from TargetTransformInfoImpl.h, which overestimates the cost of branches on bigger cores.
This should be fixed by 1ccc49924aeb
I suspect that the getCFInstr maybe used in an odd way by the vectorizer for these changes to make such a difference. I'm suspect that the vectorizer is multiplying the cost of it, where 0 or 1 would make quite a difference, but where it just shouldn't be... hopefully the cost of a branch and phi doesn't change that much after vectorization / scalarization.
Looks like this is more aggressive than most target-specific implementation?
CostKind != TTI::TCK_RecipThroughput is more than if (CostKind == TTI::TCK_CodeSize || CostKind == TTI::TCK_SizeAndLatency)
Is this intended?