This is an archive of the discontinued LLVM Phabricator instance.

[X86] Revise Alderlake P-Core schedule model
ClosedPublic

Authored by HaohaiWen on Feb 20 2023, 5:41 AM.

Details

Summary

The previous Alderlake P-Core model prefer data from uops.info than intel doc.
Some measures latency from uops.info is larger than real latency. e.g. addpd
latency is 3 in uops.info while 2 in intel doc. This patch adjust the priority
of those two data source so that intel doc is more preferable.

Diff Detail

Event Timeline

HaohaiWen created this revision.Feb 20 2023, 5:41 AM
Herald added a project: Restricted Project. · View Herald Transcript
HaohaiWen requested review of this revision.Feb 20 2023, 5:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 20 2023, 5:41 AM

Was this still generated from the D130897 schedtool scripts?

Was this still generated from the D130897 schedtool scripts?

Yes. Only alderlake-p.td template changed one line to add additional ProcResGroup.

Diff in this patch comes from reordering of the generating command from:

llvm-tblgen llvm/lib/Target/X86/X86.td -I llvm/include -I llvm/lib/Target/X86 --gen-x86-inst-sched-info -o inst-sched-info.json
add_xed_info.py --xed [xed_path]/xed --jf inst-sched-info.json |
add_uops_uopsinfo.py --inst-xml instructions.xml --arch-name=ADL-P |
add_adl_p_uopsinfo.py --adl-p-json tpt_lat-glc-client.json |

add_smv_uopsinfo.py --ref-cpu=skylake --target-cpu=alderlake-p -o d.json
smg gen --target-cpu=alderlake-p d.json -o d.td

to:
llvm-tblgen llvm/lib/Target/X86/X86.td -I llvm/include -I llvm/lib/Target/X86 --gen-x86-inst-sched-info -o inst-sched-info.json
add_xed_info.py --xed [xed_path]/xed --jf inst-sched-info.json |
add_adl_p_uopsinfo.py --adl-p-json tpt_lat-glc-client.json |
add_uops_uopsinfo.py --inst-xml instructions.xml --arch-name=ADL-P |

add_smv_uopsinfo.py --ref-cpu=skylake --target-cpu=alderlake-p -o d.json
smg gen --target-cpu=alderlake-p d.json -o d.td

RKSimon accepted this revision.Feb 26 2023, 5:03 AM

LGTM

This revision is now accepted and ready to land.Feb 26 2023, 5:03 AM
This revision was landed with ongoing or failed builds.Feb 28 2023, 3:39 PM
This revision was automatically updated to reflect the committed changes.

Should not such issues be first fixed in uops site? How can you be sure it is not just a typo in Intel docs, which has a lot of them?

Should not such issues be first fixed in uops site? How can you be sure it is not just a typo in Intel docs, which has a lot of them?

+1

Should not such issues be first fixed in uops site? How can you be sure it is not just a typo in Intel docs, which has a lot of them?

+1

The worst part uops (and https://uica.uops.info/ ) do not insert numbers manually, okay? They use automatic approaches, and this means (if true, not Intel typo) bugs in their instruments. Moreover energy utilised and number of transistors/analog devices used and flipped may be once again different, that is Intel PCH OS Minix root exploit level information though. See also genius article and next level: Minix OS hacks. https://blog.can.ac/2021/03/22/speculating-x86-64-isa-with-one-weird-trick/

Should not such issues be first fixed in uops site? How can you be sure it is not just a typo in Intel docs, which has a lot of them?

AFAIK, vadd latency should be 2 cycles, not 3 in uops site. I guess data in intel doc was not measured in same way like uops site.
Do you know which data in Intel doc is wrong? We can provide feedback to fix it.

I know that Intel fixes typos in its docs every couple months. There is even a comment section for that. So, nope, do not know any typos right now or would have reported. Uops should still be fixed.

According to optimization manual, back to back ADD latency is 2 cycles. What uops.info tested was not back to back ADD latency.

Back-to-back ADD/SUB operations that are both executed on the Fast Adder unit perform the operations
in two cycles.

In 128/256-bit, back-to-back ADD/SUB operations executed on the Fast Adder unit perform the
operations in two cycles.

In 512-bit, back-to-back ADD/SUB operations are executed in two cycles if both operations use the
Fast Adder unit on port 5.
The following instructions are executed by the Fast Adder unit:

(V)ADDSUBSS/SD/PS/PD

(V)ADDSS/SD/PS/PD

(V)SUBSS/SD/PS/PD

This comment was removed by goldstein.w.n.

According to optimization manual, back to back ADD latency is 2 cycles. What uops.info tested was not back to back ADD latency.

Back-to-back ADD/SUB operations that are both executed on the Fast Adder unit perform the operations
in two cycles.

In 128/256-bit, back-to-back ADD/SUB operations executed on the Fast Adder unit perform the
operations in two cycles.

In 512-bit, back-to-back ADD/SUB operations are executed in two cycles if both operations use the
Fast Adder unit on port 5.
The following instructions are executed by the Fast Adder unit:

(V)ADDSUBSS/SD/PS/PD

(V)ADDSS/SD/PS/PD

(V)SUBSS/SD/PS/PD

2c latency in just a special case isn't really 2c latency. That's like saying HSW has 4c L1D latency b.c of the weird pointer chasing case.

I only tested back-to-back so can't say, but if it's 3c in other cases (as measured by uops.info) maybe should change back?

Should we always use worst latency in schedule model like TTI?

Should we always use worst latency in schedule model like TTI?

This might be necessary - I hit something similar in https://github.com/llvm/llvm-project/issues/61002 where <2 x float> fdiv operations ended up trashing perf because of the strange floatbits in the other vector elements.

It'd be an interesting future project to add support to the models for more special cases like back-2-back ops though (IIRC many CPUs have similar cases for chains of FMAs etc.).

I don't like the idea of blindly taking the uops.info (or Agner or instlatx64 or Intel/AMD) numbers as a golden truth though - we need to be careful about automating scheduler model generation, even though manual reviews/edits are very tedious....

Matt added a subscriber: Matt.Mar 6 2023, 12:32 PM