i64 mul cost is 1cy for all cpu that support avx512. Currently
all X86 cpu uses i64 mul cost in X64 cost table which is not
true for cpu that support avx512 (skx, icx).
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
which is not true for most recent cpus.
It seems Zen1's TP is still 2 in some cases. See Agner Fog's table and https://uops.info/table.html?search=mul%2064&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKL=on&cb_ZENp=on&cb_ZEN2=on&cb_measurements=on&cb_doc=on&cb_base=on
Another problem, do we need to add check for 64Bit?
Ack, this is the tragedy of these cost tables, they have to represent the worst-case, even if it is bogusly higher than the average.
Another problem, do we need to add check for 64Bit?
i64 mul cost is 1cy for most cpu that support avx. Currently all X86 cpu uses i64 mul cost in X64 cost table represents a worst case which is not true for most recent cpus.
As others have already said, this isn't true for "most cpu" - all AMD cpus double pump for i64 (and Jaguar + Bulldozer families can be even slower depending on the exact values being multiplied), and the cost tables provide worst case costs.
This is a classic example of where having cost tables driven from (accurate) scheduler models would be much more useful (D46276), but there has been very little interest in helping polish those models to make that a viable option.
This is a classic example of where having cost tables driven from (accurate) scheduler models would be much more useful (D46276), but there has been very little interest in helping polish those models to make that a viable option.
Thanks for sharing this. I think using scheduler models based cost tables is more accurate and reasonable. Is there any plan to commit D46276?
Do we have any plan to use double as cost value type since many instruction's cost is less than 1 on modern superscalar CPU?
I can investigate resurrecting this - but there was a number of issues that needed 'yak shaving' first IIRC (it's been a while I'll need to review it again) - and the models have to be opt-in (I was using the complete flag but that might be suitable).
Do we have any plan to use double as cost value type since many instruction's cost is less than 1 on modern superscalar CPU?
We have the InstructionCost struct now so that might be possible? A more critical feature would be to 'add' costs correctly for throughput cost kinds - depending on pipe usage etc. - instead of just numerically adding the costs together, which only makes sense for latency / size cost kinds.
Can we first fix imul i64 cost for avx512 before merging D46276? It's 1 cy for all avx512 target.
A more critical feature would be to 'add' costs correctly for throughput cost kinds - depending on pipe usage etc. - instead of just numerically adding the costs together, which only makes sense for latency / size cost kinds.
We may need to fix resource_cycles for many x86 schedmodels (e.g. SKL, SKX) to get a correct throughput. Most instructions set resource_cycles to 1 for each uop which gets wrong throughput.
Another problem, do we need to add check for 64Bit?
TLI->getTypeLegalizationCost(DL, Ty) for i64 is i32. Therefore mul i64, i64 on 32bit target will fall to BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info, Op2Info)
opt --analyze -cost-model fuzz.ll -mcpu=skylake-avx512 -mtriple=i386-- Cost Model: Found an estimated cost of 2 for instruction: %result = mul i64 %a0, %a1