This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add description of FXAM instruction
ClosedPublic

Authored by sepavloff on Jun 24 2021, 5:46 AM.

Details

Summary

Previously this instruction could be used only in assembler. This change
makes it available for compiler also. Scheduling information was copied
from FTST instruction, hopefully this can be a satisfactory approximation.

Diff Detail

Event Timeline

sepavloff created this revision.Jun 24 2021, 5:46 AM
sepavloff requested review of this revision.Jun 24 2021, 5:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 24 2021, 5:46 AM

FXAM appears to be two uops where FTST is one on modern Intel CPUs based on Agner Fog's data. Agner's data for some AMD CPUs shows ~20 cycles of latency.

What do you intend to use this for?

FXAM appears to be two uops where FTST is one on modern Intel CPUs based on Agner Fog's data. Agner's data for some AMD CPUs shows ~20 cycles of latency.

Could tuning scheduling for this instruction be subsequent work?

What do you intend to use this for?

This function is a convenient basis for implementation of isnan and other classification functions for fp80. Now these functions are implemented using FUCOM* instructions. They however raise exception if argument is signaling NaN, such behavior is not suitable for the classification functions.

FXAM appears to be two uops where FTST is one on modern Intel CPUs based on Agner Fog's data. Agner's data for some AMD CPUs shows ~20 cycles of latency.

Could tuning scheduling for this instruction be subsequent work?

Yes, but that 20 cycle latency on some AMD CPUs is a little concerning. X87 tends to get more and more unoptimized in modern CPUs and I'm sure what instructions show up in code factors in to those design decisions. So using an X87 instruction that compilers haven't historically used could expose unexpected performance issues.

FXAM appears to be two uops where FTST is one on modern Intel CPUs based on Agner Fog's data. Agner's data for some AMD CPUs shows ~20 cycles of latency.

Could tuning scheduling for this instruction be subsequent work?

Yes, but that 20 cycle latency on some AMD CPUs is a little concerning. X87 tends to get more and more unoptimized in modern CPUs and I'm sure what instructions show up in code factors in to those design decisions. So using an X87 instruction that compilers haven't historically used could expose unexpected performance issues.

The table of FXAM properties built from Agner Fog data:

CoreN uopsLatencyRec. Throughput
K722
K821
K1021
Bulldozer1200.5
Piledriver1200.5
Steamroller1260.5
Excavator1260.5
Zen111
Zen211
Zen310.5
Bobcat22
Jaguar22
Pentium17-21
Pentium 2,312
Pentium M11
Merom11
Wolfdale11
Nehalem11
Sandy Bridge22
Ivy Bridge22
Haswell22
Broadwell262
SkyLake262
SkylakeX262
Coffee Lake262
Ice Lake262
Pentium 4121
Prescott11
Atom111
Silvermont171
Goldmont11
Goldmont+11
Goldmont11
Knights Landing11
VIA Nano 200041
Nano 3000153838

Number of uops vary mostly in range 1-2.

For operations on fp80 this instruction is still better than emulation, this is the main motivation for using it. Probably it is also can be useful if SSE is unavailable.

craig.topper accepted this revision.Jun 24 2021, 3:37 PM

AMD's optimization manual for bulldozer only shows a 2 cycle latency. I'm not sure why Agner reports 20 unless there's some bad case for some particular input that isn't documented. A single uop taking 20 cycles sounds very strange and must be serializing the machine. I would only expect divide/sqrt to be that high from a single uop. Maybe someone can run llvm-exegesis and one of those AMD CPUs

This patch seems ok since it doesn't change codegen.

This revision is now accepted and ready to land.Jun 24 2021, 3:37 PM
This revision was landed with ongoing or failed builds.Jun 24 2021, 11:07 PM
This revision was automatically updated to reflect the committed changes.