Page MenuHomePhabricator

[X86][Znver1] Specify Register Files, RCU; FP scheduler capacity.
ClosedPublic

Authored by lebedev.ri on Jun 2 2018, 12:22 PM.

Details

Summary

First off: i do not have any access to that processor,
so this is purely theoretical, no benchmarks.

I have been looking into bdver2 scheduling profile, and while cross-referencing
the existing btver2, znver1 profiles, and the reference docs
(Software Optimization Guide for AMD Family {15,16,17}h Processors),
i have noticed that only btver2 scheduling profile specifies these.

Also, there is no mca test coverage.

So i'm wondering if this info should be specified?

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Jun 2 2018, 12:22 PM
GGanesh added inline comments.Jun 3 2018, 10:30 PM
lib/Target/X86/X86ScheduleZnver1.td
96 ↗(On Diff #149616)

I am not sure why we include CCR register.

100 ↗(On Diff #149616)

This number is fine.

111 ↗(On Diff #149616)

The retire unit is shared between integer and FP ops. In SMT mode it is 96 entry per thread. So, I think we shall consider only 96 entry as a conservative value.

113 ↗(On Diff #149616)

I assume these are the load/store queue entries. The FPU has

  1. 44 entry Load Queue
  2. 72 Out of Order Loads
  3. 44 entry Store Queue

So, we are concerned only about the queues, we can have only 44 marked for LD and ST.

lebedev.ri marked 3 inline comments as done.

Addressed @GGanesh's review notes.

lebedev.ri added inline comments.
lib/Target/X86/X86ScheduleZnver1.td
111 ↗(On Diff #149616)

Aha, i was wondering how SMT was considered here.
But then what about MicroOpBufferSize in SchedMachineModel?
Is that supposed to be keep at 192?

113 ↗(On Diff #149616)

I think i would prefer to keep this just as a fixme note for now, since
while llvm-mca does have an options to specify the sizes of these queues,
it does not seem to read those from the sched model otherwise,
i guess because it can't currently be expressed here.

Any further comments?

lebedev.ri edited the summary of this revision. (Show Details)

Split off the baseline into separate differential, rebased.

andreadb added inline comments.Jun 15 2018, 3:34 AM
lib/Target/X86/X86ScheduleZnver1.td
96 ↗(On Diff #149616)

As a rule of thumb:
a register class should only be added if you want to allow those registers to be renamed. In this case, CCR should be added if you want to count EFLAGS renames agains the number of available physical registers in the PRF.
If you don't think EFLAGS renames should be counted against the total, then it is correct to remove it. Otherwise, it is incorrect.

111 ↗(On Diff #149616)

llvm-mca doesn't make assumptions on whether the CPU is in SMT mode or not. Same for the scheduling model, which assumes an optimistic micro-op buffer. For now, it is better to specify the resources seen available by a single thread running on the CPU (i.e. with no other concurrent threads).
Basically, I think you should not go for a conservative value here.

113 ↗(On Diff #149616)

In theory, the processor model could specify load/store queues as a buffered processor resources that are consumed by instructions.
In practice, llvm-mca has the concept of LSUnit. There is an open bugzilla about how to better describe load/store queues in llvm-mca and the scheduling model.

lebedev.ri marked an inline comment as done and an inline comment as not done.

Rebased, added CCR back.

lib/Target/X86/X86ScheduleZnver1.td
96 ↗(On Diff #149616)

I'm not seeing anything in docs, so i'm going to be conservative, and add it back.

111 ↗(On Diff #149616)

Hmm. Is that documented somewhere?
I would have naively expected that what @GGanesh wrote is the approach..

113 ↗(On Diff #149616)

(That matches what i said)

andreadb added inline comments.Jun 15 2018, 7:53 AM
lib/Target/X86/X86ScheduleZnver1.td
111 ↗(On Diff #149616)

This limitation was mentioned in the RFC.

You cannot possibly make any reasonable assumptions with SMT. The problem is not just the reorder buffer but any other resources which may or may not be competitively shared (or statically/dynamically partitioned).
Even if we know that the cpu is multi threaded, there is no way to predict how the other thread will use/consume hardware resources.

lebedev.ri added inline comments.Jun 15 2018, 8:04 AM
lib/Target/X86/X86ScheduleZnver1.td
111 ↗(On Diff #149616)

Hmm.
https://support.amd.com/TechDocs/55723_SOG_Fam_17h_Processors_3.00.pdf
Page 35:
The retire queue can hold up to 192 micro ops or 96 per thread in SMT mode.

@GGanesh any further thoughts?

andreadb added inline comments.Jun 19 2018, 3:56 AM
lib/Target/X86/X86ScheduleZnver1.td
111 ↗(On Diff #149616)

You shouldn't be making assumptions on SMT and use a conservative value for the retire queue. See my previous comment.
Without a proper framework to emulate SMT in llvm-mca, It is better to keep this value to the theoretical maximum. At least, you will get more accurate numbers when there is only a single thread active on the cpu.

lebedev.ri added inline comments.Jun 19 2018, 4:05 AM
lib/Target/X86/X86ScheduleZnver1.td
111 ↗(On Diff #149616)

You shouldn't be making assumptions on SMT

Ah, on the SMT itself too, not only on what happens in SMT mode. I missed that remark initially.
Does the scheduling use RetireControlUnit<> values, or is it only used in mca right now?

andreadb added inline comments.Jun 19 2018, 4:32 AM
lib/Target/X86/X86ScheduleZnver1.td
111 ↗(On Diff #149616)

My point is that in an SMT processor, resources (not just the ROB) are shared by multiple hardware threads.
Some resources may be competitively shared among the threads. Not knowing what the other thread is doing, makes it impossible to correctly predict how buffers are used.

By setting the RCU size to half of its capacity, you are assuming that the processor is conservatively running in SMT mode. However, you are mixing apples with oranges here. You are not saying how other resources should behave in SMT mode. What about the schedulers, register file, dispatch logic in SMT mode?

My point is: there is no framework at the moment to deal with SMT in general. So, there is no advantage in using a conservative value for the RCU here (unless you think it better matches your perf results).

To answer to your last question:
for now, the RetireControlUnit information is only used by mca.

lebedev.ri marked 8 inline comments as done and an inline comment as not done.

Don't be conservative re RetireControlUnit size.

lib/Target/X86/X86ScheduleZnver1.td
111 ↗(On Diff #149616)

Thank you for this detailed explanation!

To answer to your last question:

for now, the RetireControlUnit information is only used by mca.

Ok then!

LGTM!

Thank you for the review!
Could you please also click the accept button? :)

GGanesh accepted this revision.Jun 19 2018, 11:16 PM
This revision is now accepted and ready to land.Jun 19 2018, 11:16 PM
This revision was automatically updated to reflect the committed changes.