This is an archive of the discontinued LLVM Phabricator instance.

[MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit.
ClosedPublic

Authored by andreadb on Jan 22 2019, 7:41 AM.

Details

Summary

This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all x86 models excluding BtVer2. That means, this patch is only a functional change for the Jaguar cpu model only.

Tablegen definitions for instructions (V)PINSR* have been updated to account for the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch, that extra delay was added to the opcode latency. In practice, the insert opcode only executes for 1cy. Most of the actual latency is actually contributed by the so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR* latency is defined by expression f+1, where f is defined as a forwarding delay from the integer unit to the fpu.

When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC (only when flag -print-schedule is speified), we now need to account for any extra forwarding delays. We do this by checking if scheduling classes declare any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td: "A negative advance effectively increases latency, which may be used for cross-domain stalls". When computing the instruction latency for the purpose of our scheduling tests, we now add any extra delay to the formula. This avoids regressing existing codegen and mca schedule tests. It comes with the cost of an extra (but very simple) hook in MCSchedModel.

As a side note: this patch would have been a bit simpler if we didn't have to support flag -print-schedule. Now that we have llvm-mca, we can just move all our latency/throughput test coverage to llvm-mca. If we do that, then we can get rid of flag -print-schedule, and all the latency/throughput reporting framework. It would also help to solve a long-standing layering violation (originally reported as PR37160).

Let me know if okay to commit.

Thanks,
Andrea

Diff Detail

Event Timeline

andreadb created this revision.Jan 22 2019, 7:41 AM

@craig.topper Should any Intel model account for this in a similar way? Agner is a little vague on this.

lib/Target/X86/X86Schedule.td
20

Add a description comment

LGTM. I'd wait a bit to see if others have anything else to add.

include/llvm/MC/MCSchedule.h
375

You can probably constify Entries.

lib/Target/X86/X86Schedule.td
19

I think it would be helpful to have a comment here, describing ReadInt2Fpu. Something similar to what you described above in the Details.

andreadb marked an inline comment as done.Jan 22 2019, 10:54 AM
andreadb added inline comments.
lib/Target/X86/X86Schedule.td
20

I will add a comment to this line.

andreadb updated this revision to Diff 183088.Jan 23 2019, 4:52 AM

Thanks for the feedback.

Patch has been rebased. I have added a small description of ReadInt2Fpu in the code.

andreadb marked 2 inline comments as done.Jan 23 2019, 4:52 AM
RKSimon accepted this revision.Jan 23 2019, 7:18 AM

LGTM - cheers

This revision is now accepted and ready to land.Jan 23 2019, 7:18 AM
This revision was automatically updated to reflect the committed changes.

@craig.topper Should any Intel model account for this in a similar way? Agner is a little vague on this.

I tried to find out more about this internally. There are still bypass delays between integer and fp in some cases, and they change in every generation. And the definition of what's a float and what's an integer have changed over time. For example logic ops and shuffles aren't really considered int or fp in the most recent CPUs.