This is an archive of the discontinued LLVM Phabricator instance.

[CostModel][X86] Add i64 mul cost for avx512 as 1cy
ClosedPublic

Authored by HaohaiWen on Dec 2 2021, 5:55 PM.

Details

Summary

i64 mul cost is 1cy for all cpu that support avx512. Currently
all X86 cpu uses i64 mul cost in X64 cost table which is not
true for cpu that support avx512 (skx, icx).

Diff Detail

Event Timeline

HaohaiWen created this revision.Dec 2 2021, 5:55 PM
HaohaiWen requested review of this revision.Dec 2 2021, 5:55 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 2 2021, 5:55 PM
HaohaiWen edited the summary of this revision. (Show Details)Dec 2 2021, 6:01 PM
HaohaiWen added reviewers: pengfei, LuoYuanke.

which is not true for most recent cpus.

It seems Zen1's TP is still 2 in some cases. See Agner Fog's table and https://uops.info/table.html?search=mul%2064&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_ZENp=on&cb_ZEN2=on&cb_measurements=on&cb_doc=on&cb_base=on

Another problem, do we need to add check for 64Bit?

which is not true for most recent cpus.

It seems Zen1's TP is still 2 in some cases. See Agner Fog's table and https://uops.info/table.html?search=mul%2064&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_ZENp=on&cb_ZEN2=on&cb_measurements=on&cb_doc=on&cb_base=on

Ack, this is the tragedy of these cost tables, they have to represent the worst-case, even if it is bogusly higher than the average.

Another problem, do we need to add check for 64Bit?

i64 mul cost is 1cy for most cpu that support avx. Currently all X86 cpu uses i64 mul cost in X64 cost table represents a worst case which is not true for most recent cpus.

As others have already said, this isn't true for "most cpu" - all AMD cpus double pump for i64 (and Jaguar + Bulldozer families can be even slower depending on the exact values being multiplied), and the cost tables provide worst case costs.

This is a classic example of where having cost tables driven from (accurate) scheduler models would be much more useful (D46276), but there has been very little interest in helping polish those models to make that a viable option.

This is a classic example of where having cost tables driven from (accurate) scheduler models would be much more useful (D46276), but there has been very little interest in helping polish those models to make that a viable option.

Thanks for sharing this. I think using scheduler models based cost tables is more accurate and reasonable. Is there any plan to commit D46276?
Do we have any plan to use double as cost value type since many instruction's cost is less than 1 on modern superscalar CPU?

This is a classic example of where having cost tables driven from (accurate) scheduler models would be much more useful (D46276), but there has been very little interest in helping polish those models to make that a viable option.

Thanks for sharing this. I think using scheduler models based cost tables is more accurate and reasonable. Is there any plan to commit D46276?

I can investigate resurrecting this - but there was a number of issues that needed 'yak shaving' first IIRC (it's been a while I'll need to review it again) - and the models have to be opt-in (I was using the complete flag but that might be suitable).

Do we have any plan to use double as cost value type since many instruction's cost is less than 1 on modern superscalar CPU?

We have the InstructionCost struct now so that might be possible? A more critical feature would be to 'add' costs correctly for throughput cost kinds - depending on pipe usage etc. - instead of just numerically adding the costs together, which only makes sense for latency / size cost kinds.

HaohaiWen updated this revision to Diff 391972.Dec 5 2021, 11:05 PM

Move to avx512 cost table and rebase

Can we first fix imul i64 cost for avx512 before merging D46276? It's 1 cy for all avx512 target.

A more critical feature would be to 'add' costs correctly for throughput cost kinds - depending on pipe usage etc. - instead of just numerically adding the costs together, which only makes sense for latency / size cost kinds.

We may need to fix resource_cycles for many x86 schedmodels (e.g. SKL, SKX) to get a correct throughput. Most instructions set resource_cycles to 1 for each uop which gets wrong throughput.

pengfei accepted this revision.Dec 5 2021, 11:32 PM

LGTM. Please wait one or two days to see if others have comments.

This revision is now accepted and ready to land.Dec 5 2021, 11:32 PM
HaohaiWen added a comment.EditedDec 5 2021, 11:34 PM

Another problem, do we need to add check for 64Bit?

TLI->getTypeLegalizationCost(DL, Ty) for i64 is i32. Therefore mul i64, i64 on 32bit target will fall to BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info, Op2Info)

opt --analyze -cost-model fuzz.ll -mcpu=skylake-avx512  -mtriple=i386--
Cost Model: Found an estimated cost of 2 for instruction:   %result = mul i64 %a0, %a1

Update the title/summary - its AVX512F now, not AVX1

HaohaiWen retitled this revision from [CostModel][X86] Add i64 mul cost for avx as 1cy to [CostModel][X86] Add i64 mul cost for avx512 as 1cy.Dec 6 2021, 6:15 AM
HaohaiWen edited the summary of this revision. (Show Details)
RKSimon accepted this revision.Dec 6 2021, 6:38 AM

LGTM - cheers

This revision was landed with ongoing or failed builds.Dec 7 2021, 7:40 PM
This revision was automatically updated to reflect the committed changes.