Previously this instruction could be used only in assembler. This change
makes it available for compiler also. Scheduling information was copied
from FTST instruction, hopefully this can be a satisfactory approximation.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
FXAM appears to be two uops where FTST is one on modern Intel CPUs based on Agner Fog's data. Agner's data for some AMD CPUs shows ~20 cycles of latency.
What do you intend to use this for?
Could tuning scheduling for this instruction be subsequent work?
What do you intend to use this for?
This function is a convenient basis for implementation of isnan and other classification functions for fp80. Now these functions are implemented using FUCOM* instructions. They however raise exception if argument is signaling NaN, such behavior is not suitable for the classification functions.
Yes, but that 20 cycle latency on some AMD CPUs is a little concerning. X87 tends to get more and more unoptimized in modern CPUs and I'm sure what instructions show up in code factors in to those design decisions. So using an X87 instruction that compilers haven't historically used could expose unexpected performance issues.
The table of FXAM properties built from Agner Fog data:
Core | N uops | Latency | Rec. Throughput |
K7 | 2 | 2 | |
K8 | 2 | 1 | |
K10 | 2 | 1 | |
Bulldozer | 1 | 20 | 0.5 |
Piledriver | 1 | 20 | 0.5 |
Steamroller | 1 | 26 | 0.5 |
Excavator | 1 | 26 | 0.5 |
Zen1 | 1 | 1 | |
Zen2 | 1 | 1 | |
Zen3 | 1 | 0.5 | |
Bobcat | 2 | 2 | |
Jaguar | 2 | 2 | |
Pentium | 17-21 | ||
Pentium 2,3 | 1 | 2 | |
Pentium M | 1 | 1 | |
Merom | 1 | 1 | |
Wolfdale | 1 | 1 | |
Nehalem | 1 | 1 | |
Sandy Bridge | 2 | 2 | |
Ivy Bridge | 2 | 2 | |
Haswell | 2 | 2 | |
Broadwell | 2 | 6 | 2 |
SkyLake | 2 | 6 | 2 |
SkylakeX | 2 | 6 | 2 |
Coffee Lake | 2 | 6 | 2 |
Ice Lake | 2 | 6 | 2 |
Pentium 4 | 1 | 2 | 1 |
Prescott | 1 | 1 | |
Atom | 1 | 1 | 1 |
Silvermont | 1 | 7 | 1 |
Goldmont | 1 | 1 | |
Goldmont+ | 1 | 1 | |
Goldmont | 1 | 1 | |
Knights Landing | 1 | 1 | |
VIA Nano 2000 | 41 | ||
Nano 3000 | 15 | 38 | 38 |
Number of uops vary mostly in range 1-2.
For operations on fp80 this instruction is still better than emulation, this is the main motivation for using it. Probably it is also can be useful if SSE is unavailable.
AMD's optimization manual for bulldozer only shows a 2 cycle latency. I'm not sure why Agner reports 20 unless there's some bad case for some particular input that isn't documented. A single uop taking 20 cycles sounds very strange and must be serializing the machine. I would only expect divide/sqrt to be that high from a single uop. Maybe someone can run llvm-exegesis and one of those AMD CPUs
This patch seems ok since it doesn't change codegen.