The previous Alderlake P-Core model prefer data from uops.info than intel doc.
Some measures latency from uops.info is larger than real latency. e.g. addpd
latency is 3 in uops.info while 2 in intel doc. This patch adjust the priority
of those two data source so that intel doc is more preferable.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Yes. Only alderlake-p.td template changed one line to add additional ProcResGroup.
Diff in this patch comes from reordering of the generating command from:
llvm-tblgen llvm/lib/Target/X86/X86.td -I llvm/include -I llvm/lib/Target/X86 --gen-x86-inst-sched-info -o inst-sched-info.json
add_xed_info.py --xed [xed_path]/xed --jf inst-sched-info.json |
add_uops_uopsinfo.py --inst-xml instructions.xml --arch-name=ADL-P |
add_adl_p_uopsinfo.py --adl-p-json tpt_lat-glc-client.json |
add_smv_uopsinfo.py --ref-cpu=skylake --target-cpu=alderlake-p -o d.json
smg gen --target-cpu=alderlake-p d.json -o d.td
to:
llvm-tblgen llvm/lib/Target/X86/X86.td -I llvm/include -I llvm/lib/Target/X86 --gen-x86-inst-sched-info -o inst-sched-info.json
add_xed_info.py --xed [xed_path]/xed --jf inst-sched-info.json |
add_adl_p_uopsinfo.py --adl-p-json tpt_lat-glc-client.json |
add_uops_uopsinfo.py --inst-xml instructions.xml --arch-name=ADL-P |
add_smv_uopsinfo.py --ref-cpu=skylake --target-cpu=alderlake-p -o d.json
smg gen --target-cpu=alderlake-p d.json -o d.td
Should not such issues be first fixed in uops site? How can you be sure it is not just a typo in Intel docs, which has a lot of them?
The worst part uops (and https://uica.uops.info/ ) do not insert numbers manually, okay? They use automatic approaches, and this means (if true, not Intel typo) bugs in their instruments. Moreover energy utilised and number of transistors/analog devices used and flipped may be once again different, that is Intel PCH OS Minix root exploit level information though. See also genius article and next level: Minix OS hacks. https://blog.can.ac/2021/03/22/speculating-x86-64-isa-with-one-weird-trick/
AFAIK, vadd latency should be 2 cycles, not 3 in uops site. I guess data in intel doc was not measured in same way like uops site.
Do you know which data in Intel doc is wrong? We can provide feedback to fix it.
I know that Intel fixes typos in its docs every couple months. There is even a comment section for that. So, nope, do not know any typos right now or would have reported. Uops should still be fixed.
According to optimization manual, back to back ADD latency is 2 cycles. What uops.info tested was not back to back ADD latency.
Back-to-back ADD/SUB operations that are both executed on the Fast Adder unit perform the operations
in two cycles.
•
In 128/256-bit, back-to-back ADD/SUB operations executed on the Fast Adder unit perform the
operations in two cycles.
•
In 512-bit, back-to-back ADD/SUB operations are executed in two cycles if both operations use the
Fast Adder unit on port 5.
The following instructions are executed by the Fast Adder unit:
•
(V)ADDSUBSS/SD/PS/PD
•
(V)ADDSS/SD/PS/PD
•
(V)SUBSS/SD/PS/PD
2c latency in just a special case isn't really 2c latency. That's like saying HSW has 4c L1D latency b.c of the weird pointer chasing case.
I only tested back-to-back so can't say, but if it's 3c in other cases (as measured by uops.info) maybe should change back?
This might be necessary - I hit something similar in https://github.com/llvm/llvm-project/issues/61002 where <2 x float> fdiv operations ended up trashing perf because of the strange floatbits in the other vector elements.
It'd be an interesting future project to add support to the models for more special cases like back-2-back ops though (IIRC many CPUs have similar cases for chains of FMAs etc.).
I don't like the idea of blindly taking the uops.info (or Agner or instlatx64 or Intel/AMD) numbers as a golden truth though - we need to be careful about automating scheduler model generation, even though manual reviews/edits are very tedious....