This is an archive of the discontinued LLVM Phabricator instance.

[X86][Btver2] Improved latency/throughput model for scalar int-to-float conversions.
ClosedPublic

Authored by andreadb on Jan 24 2019, 5:07 AM.

Details

Summary

This is a follow-up of D57056.

On Jaguar, we need to account for an additional operand latency of 6cy (caused by bypass delays) in the case of scalar_int-to-float conversions.
The latency of (V)CVTSI2S(S|D) should be f+3; In this context, f is a bypass delay of 6cy (see AMD fam16h SOG).

This patch marks the input gpr operand as ReadIntToFpu, so that we correctly account for that delay. That quantity has then be subtacted to the opcode latency (which should just be 3cy).

I verified that latency/throughput numbers from llvm-mca have improved, and now they better match what is reported by perf on Jaguar. That being said, I still see cases where the IPC as reported by llvm-mca doesn't quite match the IPC from perf.
Example:

vcvtsi2ss %ecx, %xmm0, %xmm0    # Should tend to IPC: 0.33. Perf reports IPC: 0.25 (one cvt every 4cy).

I suspect that local forwarding might be disabled for it; it looks like users have to wait for an extra +1cy. That would explain the 0.25. For now I decided to go with what is in the documents, so we always assume a +3cy latency.

Latency for the RM variants has changed (it has slightly improved). However, we need another patch to fix the number of opcodes (it should be 1, not 2).

Diff Detail

Event Timeline

andreadb created this revision.Jan 24 2019, 5:07 AM
RKSimon added inline comments.Jan 29 2019, 3:27 AM
lib/Target/X86/X86ScheduleBtVer2.td
432–433

Update this FIXME comment and mention local forwarding?

andreadb updated this revision to Diff 184062.Jan 29 2019, 5:28 AM
andreadb marked an inline comment as done.

Patch updated.

I verified that the int-to-fp convert latency is 4cy (instead of 3cy) if we exclude the extra 6cy of bypass delay.
So, I went ahead and set the latency directly to 4cy.

I've also verified by running some microbenchmarks that the RM variants have a latency of ~9cy (5cy for the load opcode + 4cy of convert). That latency seems consistent with what I found when testing the RR variants. The only difference for the RM variants is the number of opcodes (it is 1 COP instead of 2 COPs).
So, I went ahead and fixed all those writes in this patch (rather than splitting the change in two patches).

RKSimon accepted this revision.Jan 29 2019, 7:28 AM

LGTM

This revision is now accepted and ready to land.Jan 29 2019, 7:28 AM
RKSimon added inline comments.Jan 29 2019, 7:31 AM
lib/Target/X86/X86InstrSSE.td
880

One minor comment - we should probably add this to the avx512f equivalents for consistency.

andreadb updated this revision to Diff 184091.Jan 29 2019, 7:52 AM

Patch updated.
Addressed review comment.

RKSimon accepted this revision.Jan 29 2019, 8:01 AM

LGTM with one more minor

lib/Target/X86/X86InstrAVX512.td
7550 ↗(On Diff #184091)

[sched, ReadDefault, ReadInt2Fpu]

This revision was automatically updated to reflect the committed changes.