This is an archive of the discontinued LLVM Phabricator instance.

[CostModel] getCFInstrCost
ClosedPublic

Authored by samparker on Apr 30 2020, 2:57 AM.

Details

Summary

Have BasicTTI call the base implementation so that both agree on the default behaviour, which the default being a cost of '1'. This has required an X86 specific implementation as it seems to be very reliant on those instructions being free. Changes are also made to AMDGPU so that their implementations distinguish between cost kinds, so that the unrolling isn't affected. The cost model changes now reflect that ret instructions are not generally free.

Diff Detail

Event Timeline

samparker created this revision.Apr 30 2020, 2:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 30 2020, 2:57 AM
samparker updated this revision to Diff 262074.May 5 2020, 5:22 AM
samparker added a reviewer: jsji.
samparker added a subscriber: jsji.

Fixed some tests that needed asserts... which meant passing the CostKind from the vectorizer too special handling Phis in the base implementation. The important test is in PowerPC where the vector factor is now the same for PPC8 and PPC9. @jsji would you mind taking a look?

samparker updated this revision to Diff 263641.May 13 2020, 1:41 AM

Implemented getCFInstrCost in the PowerPC backend to remove the vectorizer test changes.

samparker updated this revision to Diff 263648.May 13 2020, 2:02 AM

Added Br and Ret into getUserCost switch so that getCFInstr should now be used by all cost kinds and API calls.

Fixed some tests that needed asserts... which meant passing the CostKind from the vectorizer too special handling Phis in the base implementation. The important test is in PowerPC where the vector factor is now the same for PPC8 and PPC9. @jsji would you mind taking a look?

Sorry, this was buried in my notifications so got ignored. I saw you have updated PowerPC part so there is no difference here now. Thanks.

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
445

Looks like this is more aggressive than most target-specific implementation?
CostKind != TTI::TCK_RecipThroughput is more than if (CostKind == TTI::TCK_CodeSize || CostKind == TTI::TCK_SizeAndLatency)

Is this intended?

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
761

Any specific reason we would like to force 0 for PowerPC?

AMDGPU part LGTM

samparker marked 2 inline comments as done.May 26 2020, 2:02 AM
samparker added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
445

CodeSize, Latency and SizeAndLatency would generally go through getUserCost first, in which it treats Phis as free, and the backends don't appear to care about Latency. So hopefully this is modelling the previous logic.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
761

This is because, as this wasn't implemented before, it would have used BasicTTI and returned 0. But I'll check whether I can use the same logic as I've added for X86.

samparker updated this revision to Diff 266130.May 26 2020, 2:14 AM

Now aligned PPC and X86 implementations with the base implementation, so that they're checking != RecipThroughput.

dfukalov added inline comments.Jun 9 2020, 5:36 PM
llvm/test/Analysis/CostModel/ARM/cast.ll
3920–3921

Is there actual need to specify align here? It is not changed in many other lines..
The same question for a number of bunches below

llvm/test/Analysis/CostModel/ARM/mve-gather-scatter-cost.ll
158–161

Do wen need theese extra lines? Is their number connected to lines removed before?-)
The same strange things below

llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
10–15

The comment for the test explicitly says we need cost of 4 here.

llvm/test/Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll
11

It is not clear why these costs (with same changes below) are increased

samparker marked 4 inline comments as done.Jun 10 2020, 12:50 AM

Thanks for taking another look.

llvm/test/Analysis/CostModel/ARM/cast.ll
3920–3921

I'm not sure how these got here, there has been ongoing alignment work so maybe the default alignment is being added when it's missing. I'll rebase this patch and check it out.

llvm/test/Analysis/CostModel/ARM/mve-gather-scatter-cost.ll
158–161

Yeah I will remove these. I think the test file was already formatted oddly which the script has then just added to the weirdness.

llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
10–15

I'm not really sure what causes this, but it at least shouldn't stop the vectorizer from performing the scalarization.

llvm/test/Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll
11

The cost increases by two because the phi and the br are no longer classed as free for throughput.

samparker updated this revision to Diff 269800.Jun 10 2020, 5:13 AM

Rebased and cleaned up the test changes.

spatel added inline comments.Jun 10 2020, 5:19 AM
llvm/test/Analysis/CostModel/ARM/cast.ll
3920–3921

That was probably D78403 - re-running the CHECK line script and rebasing should remove that noise from this patch.

dfukalov accepted this revision.Jun 12 2020, 8:10 AM

LGTM, although minor change in test' comment

llvm/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll
10–15

Well, so the comment should be patched too?

This revision is now accepted and ready to land.Jun 12 2020, 8:10 AM

Actually it seems all of these dozens of "CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: ret i32" may be removed from most of tests not related to getCFInstrCost

This revision was automatically updated to reflect the committed changes.

Hi Sam,
Linaro benchmarking CI flagged this patch as it regresses SPEC2k6's 462.libquantum by 10% at -O2, -O3 and "-O3 -flto" for aarch64-linux-gnu. Impact on other benchmarks is below 2%, which is within noise.

For -O2 profile changes are:
<benchmark, exe/symbol>, %time, %size, time1, time2, size1, size2
462.libquantum,libquantum_base.default ,111,101,9859,10962,33140,33428
462.libquantum,[.] quantum_toffoli ,113,125,5616,6360,320,400
462.libquantum,[.] quantum_sigma_x ,100,100,2498,2499,224,224
462.libquantum,[.] quantum_cnot ,130,135,1150,1495,288,388

After this patch quantum_toffoli is 13% slower and 25% bigger, and quantum_cnot is 30% slower and 35% bigger.

Would you please investigate this? Such big performance and code size regressions might expose a corner-case in your patch.

lenary removed a subscriber: lenary.Jul 2 2020, 9:50 AM
fhahn added a subscriber: fhahn.Jul 2 2020, 9:52 AM

It seems like this introduces a 9% performance regression for libquantum from SPEC2006 on AArch64 with -O3 LTO & PGO. I think AArch64 uses the implementation from TargetTransformInfoImpl.h, which overestimates the cost of branches on bigger cores.

Hi Sam,
Linaro benchmarking CI flagged this patch as it regresses SPEC2k6's 462.libquantum by 10% at -O2, -O3 and "-O3 -flto" for aarch64-linux-gnu. Impact on other benchmarks is below 2%, which is within noise.

For -O2 profile changes are:
<benchmark, exe/symbol>, %time, %size, time1, time2, size1, size2
462.libquantum,libquantum_base.default ,111,101,9859,10962,33140,33428
462.libquantum,[.] quantum_toffoli ,113,125,5616,6360,320,400
462.libquantum,[.] quantum_sigma_x ,100,100,2498,2499,224,224
462.libquantum,[.] quantum_cnot ,130,135,1150,1495,288,388

After this patch quantum_toffoli is 13% slower and 25% bigger, and quantum_cnot is 30% slower and 35% bigger.

Would you please investigate this? Such big performance and code size regressions might expose a corner-case in your patch.

This should be fixed by 1ccc49924aeb

I suspect that the getCFInstr maybe used in an odd way by the vectorizer for these changes to make such a difference. I'm suspect that the vectorizer is multiplying the cost of it, where 0 or 1 would make quite a difference, but where it just shouldn't be... hopefully the cost of a branch and phi doesn't change that much after vectorization / scalarization.