This is an archive of the discontinued LLVM Phabricator instance.

[X86] AMD Zen 3 Scheduler Model
ClosedPublic

Authored by lebedev.ri on Jan 11 2021, 2:01 AM.

Details

Summary

Introduce basic schedule model for AMD Zen 3 CPU's, a.k.a znver3.

This is fully built from scratch, from llvm-mca measurements
and documented reference materials.
Nothing was copied from znver2/znver1.

I believe this is in a reasonable state of completion for inclusion,
probably better than D52779 bdver2 was :)

Namely:

  • uops are pretty spot-on (at least what llvm-mca can measure)
  • latency is also pretty spot-on (at least what llvm-mca can measure)
  • throughput is within reason

I haven't run much benchmarks with this, but what i did run says this is beneficial:


I'll call out the obvious problems there:

  • i didn't really bother with X87 instructions
  • i didn't really bother with obviously-microcoded/system instructions
  • There are large discrepancy in throughput for mr and rm instructions. I'm not really sure if it's a modelling defect that needs to be fixed, or it's a defect of measurments.
  • Pipe distributions are probably bad :) I can't do much here until AMD allows that to be fixed by documenting the appropriate counters and updating libpfm

That being said, as @RKSimon notes:

I'll mention again that all the znver* models appear to be very inaccurate wrt SIMD/FPU instructions <...>

so how much worse this could possibly be?!

Things that aren't there:

  • Various tunings: zero idioms, etc. That is follow-ups.

Diff Detail

Event Timeline

GGanesh created this revision.Jan 11 2021, 2:01 AM
GGanesh requested review of this revision.Jan 11 2021, 2:01 AM
RKSimon added subscribers: lebedev.ri, bkramer.

Adding @bkramer as IIRC he added the initial znver3 support, and @lebedev.ri who has added amd models in the past.

This mainly looks like a copy+tweak of the existing znver2 model (which itself was just a copy+tweak of the znver1 model), which I guess makes sense, but I don't have a 5000 series cpu to compare exegesis numbers.

llvm/lib/Target/X86/X86PfmCounters.td
241

Do you have the fpu pipe assignments for znver3 (or znver2 for that matter?)

Adding @bkramer as IIRC he added the initial znver3 support, and @lebedev.ri who has added amd models in the past.

This mainly looks like a copy+tweak of the existing znver2 model (which itself was just a copy+tweak of the znver1 model), which I guess makes sense, but I don't have a 5000 series cpu to compare exegesis numbers.

I'd love to help (or even take over if that helps) with this, but presently i still don't have this cpu.

I would like to know if this patch can be approved and later can be checked for the numbers when a cpu is accessible.
Since this is an extension and tweaks from znver1->znver2 path, I would request that if it is doable.

Meanwhile, we are trying to host a machine as well for the community.

Hi,
I have the CPU (5900x) and I can help you to run some commands if you tell me exactly what to do.

There is a small catch right now that libpfm does not support Zen3 performance counters yet. So as a result llvm-exegesis throws the error "event not found - cannot create event cycles_not_in_halt" and fails. One has to patch/compile/install libpfm first. My simple stupid patch for libpfm below.

diff -uNr libpfm-4.11.0/lib/pfmlib_amd64.c libpfm-4.11.0.p/lib/pfmlib_amd64.c
--- libpfm-4.11.0/lib/pfmlib_amd64.c	2020-09-02 20:48:00.000000000 +0200
+++ libpfm-4.11.0.p/lib/pfmlib_amd64.c	2021-01-27 12:50:10.536351642 +0100
@@ -183,6 +183,8 @@
                 }
 	} else if (cfg->family == 22) { /* family 16h */
 		rev = PFM_PMU_AMD64_FAM16H;
+	} else if (cfg->family == 25) { /* family 19h */
+		rev = PFM_PMU_AMD64_FAM17H_ZEN2;
 	}
 
       cfg->revision = rev;

I think there's a number of other libpfm4 changes that need to be done foryou to be able to test this - the fpu pipe mappings seems to be missing from its zen2 events list (the equivalent of amd64_fam17h_zen1_fpu_pipe_assignment on zen1) - we can then add the missing mappings in X86PfmCounters.td and then create latency/uops inconsistency reports in llvm-exegesis. This is all assuming that fam19h has the same pfm pipes as zen2....

Matt added a subscriber: Matt.Feb 25 2021, 2:17 AM
Matt added inline comments.Feb 25 2021, 8:17 AM
llvm/lib/Target/X86/X86ScheduleZnver3.td
1171

I'm wondering, would you happen to know whether there's a chance that marking VPERM2F128 as "Microcoded Instruction" (with Latency = 100) is a leftover from a previous Zen microarchitecture scheduler?

It seems this is no longer the case for Zen 3; cf. https://www.agner.org/optimize/instruction_tables.pdf reporting macro-operations=1, latency=3.5, and reciprocal throughput=0.5 for the y,y,y/m,i instruction variant.

RKSimon added inline comments.Feb 25 2021, 9:02 AM
llvm/lib/Target/X86/X86ScheduleZnver3.td
1171

Yes, there are a lot of numbers in this patch that look like a direct copy+paste from the znver2 model (which was mainly a copy+paste of the znver1 model), and all of them diverge from what Agner, instlatx64 and the AMD SoG tables all report. For this patch to get any further we need llvm-exegesis to be run on the model to determine how accurate it really is. Otherwise we're better off just staying with the existing zen models.

We have started working on the libpfm patch. We are working on getting it posted very soon! Hopefully, once that is in place, we will be able to measure these numbers in znver3 hardware and correct accordingly.

We have started working on the libpfm patch. We are working on getting it posted very soon! Hopefully, once that is in place, we will be able to measure these numbers in znver3 hardware and correct accordingly.

Given that the later znver models are very similar to znver1 you might make faster progress by fixing the znver1 model first and copying those fixes down to znver2/znver3. But getting the zen2/zen3 libpfm support fixed upstream would be very useful as well.

(still very much interested in taking this over once i have this hardware, but that is presently still not the case)

(still very much interested in taking this over once i have this hardware, but that is presently still not the case)

@lebedev.ri Do you have an older znver1 CPU that can be used to analyze/clean up the base znver1 model first?

(still very much interested in taking this over once i have this hardware, but that is presently still not the case)

@lebedev.ri Do you have an older znver1 CPU that can be used to analyze/clean up the base znver1 model first?

Nope, otherwise i would have done so already :)

I can work on znver1 and znver2 models as well. I have a znver2 machine. Let me check if I can get a znver1 machine and run exegesis to correct these latency\throughput numbers.

(still very much interested in taking this over once i have this hardware, but that is presently still not the case)

I may have an update on this within next ~7 days, hang on..

We have exchanged mails with Eranian for libpfm4 support. We will upload the libpfm4 patch shortly (3-4 days in time). The patch will be based on the latest PPR manual (https://www.amd.com/system/files/TechDocs/55898_pub.zip). I believe that will keep it moving for verifying the numbers and the details with exegesis.

tambre added a subscriber: tambre.Mar 18 2021, 10:05 AM

(still very much interested in taking this over once i have this hardware, but that is presently still not the case)

I may have an update on this within next ~7 days, hang on..

Alrighty :)
This now deeply affects me personally, providing a necessary, and sufficient, interest to be involved with this.

$ lscpu  | grep -i "Model name"
Model name:                      AMD Ryzen 9 5950X 16-Core Processor

So. @RKSimon @craig.topper @GGanesh what would be the reaction to my proposition for me to take this over?
This won't be uncharter territory for me, see BdVer2 sched.

@lebedev.ri

My colleague had submitted the libpfm4 patch to the libpfm4 community. You can take this patch and enablement for exegesis for znver3. Thank you for the help!

@lebedev.ri

My colleague had submitted the libpfm4 patch to the libpfm4 community. You can take this patch and enablement for exegesis for znver3. Thank you for the help!

@lebedev.ri No objections as long as @GGanesh is happy. What might be an issue is znver2/znver3 libpfm4 event lists are still missing the FPU pipe events, despite them being documented in the PPRs - this will mean that llvm-exegesis wouldn't be able to verify pipe allocation.

I'll mention again that all the znver* models appear to be very inaccurate wrt SIMD/FPU instructions - so ensuring the existing znver1/znver2 models are correct must surely make sense - if it ends up being we copy+paste backwards from an accurate znver3 model then that works as well.

lebedev.ri commandeered this revision.Mar 24 2021, 6:01 AM
lebedev.ri planned changes to this revision.
lebedev.ri edited reviewers, added: GGanesh; removed: lebedev.ri.

@lebedev.ri

My colleague had submitted the libpfm4 patch to the libpfm4 community. You can take this patch and enablement for exegesis for znver3. Thank you for the help!

@lebedev.ri No objections as long as @GGanesh is happy. What might be an issue is znver2/znver3 libpfm4 event lists are still missing the FPU pipe events, despite them being documented in the PPRs - this will mean that llvm-exegesis wouldn't be able to verify pipe allocation.

I'll mention again that all the znver* models appear to be very inaccurate wrt SIMD/FPU instructions - so ensuring the existing znver1/znver2 models are correct must surely make sense - if it ends up being we copy+paste backwards from an accurate znver3 model then that works as well.

@GGanesh i agree with @RKSimon.
Having only cycle and uops counters is pretty limiting, having per-pipe counters would be awesome.

The FPU pipe counters aren't in the PPR as well. I have raised a ticket so as to get this updated in the PPR. Unfortunately, without the document getting updated, these events can't be enabled.

So i've finally started working on this (yay!),
and while i haven't gotten to the actual instruction latencies/uops/pipe distribution yet,
basically everything so far is is not really correct.
So i'm going to end up basically scrapping this and rewriting/remeasuring from scratch...

And yes, cursory examination suggests that zen2 is also not very correct.

lebedev.ri retitled this revision from [X86] AMD Znver3 Scheduler descriptions and llvm-mca tests to [X86] AMD Zen 3 Scheduler Model.
lebedev.ri edited the summary of this revision. (Show Details)

Alright, here we go. I think this is the state for inclusion.
@RKSimon @GGanesh @craig.topper please stamp is satisfied.


This is fully built from scratch, from llvm-mca measurements
and documented reference materials.
Nothing was copied from znver2/znver1.

I believe this is in a reasonable state of completion
for initial inclusion probably better than D52779 bdver2 was :)

Namely:

  • uops are pretty spot-on (at least what llvm-mca can measure)
  • latency is also pretty spot-on (at least what llvm-mca can measure)
  • throughput is within reason, at least for non-memory instructions

I'll call out the obvious problems there:

  • i didn't really bother with X87 instructions
  • i didn't really bother with obviously-microcoded/system instructions
  • There are large discrepancy in throughput for memory instructions. I'm not really sure if it's a modelling defect that needs to be fixed, or it's a defect of measurements.
  • Pipe distributions are probably bad :) I can't do too much here until AMD allows that to be fixed by documenting the appropriate counters and updating libpfm

Things that aren't there:

  • Various tunings: zero idioms, etc. That is follow-ups.
Matt added a comment.Apr 21 2021, 12:47 PM
  • Pipe distributions are probably bad :) I can't do too much here until AMD allows that to be fixed by documenting the appropriate counters and updating libpfm

FWIW, AMD Zen 3 events have been posted on linux-perf-users, https://lore.kernel.org/linux-perf-users/YG2z7q6MhhxO+M9p@kernel.org/T/

In particular, cf. tools/perf/pmu-events/arch/x86/amdzen3/floating-point.json, https://lore.kernel.org/lkml/20210406215944.113332-5-Smita.KoralahalliChannabasappa@amd.com/#Z30arch:x86:amdzen3:floating-point.json

CC'ing some more folks.

  • Pipe distributions are probably bad :) I can't do too much here until AMD allows that to be fixed by documenting the appropriate counters and updating libpfm

FWIW, AMD Zen 3 events have been posted on linux-perf-users, https://lore.kernel.org/linux-perf-users/YG2z7q6MhhxO+M9p@kernel.org/T/

In particular, cf. tools/perf/pmu-events/arch/x86/amdzen3/floating-point.json, https://lore.kernel.org/lkml/20210406215944.113332-5-Smita.KoralahalliChannabasappa@amd.com/#Z30arch:x86:amdzen3:floating-point.json

That's great!

Any chance libpfm4 might want to merge that? :)

Roman

lebedev.ri edited the summary of this revision. (Show Details)Apr 21 2021, 2:56 PM

Regarding the pipe events, the patch in kernel.org had just copy pasted events from zen1 and zen2. Apparently, I learnt that these events are restricted and at present not really ready for public for znver3. That is what is reflected in the kernel patch as well. If we notice, PMCx000 is incorrectly mapped and doesn't really count the 6 pipes.
Is there any workaround that you guys suggest. I understand that these are pfm counters and are dynamic in nature. However, for the model, (at least for the throughput numbers) shall I publish our internal numbers running the internally enabled libpfm4 library? This is restricted in its immediate purpose which I understand however at present it may act as a work around for the throughput and llvm-mca numbers.
I am going through the modifications\corrections made in the by @lebedev.ri in the meantime.

lebedev.ri added a comment.EditedApr 26 2021, 11:00 AM

Regarding the pipe events, the patch in kernel.org had just copy pasted events from zen1 and zen2. Apparently, I learnt that these events are restricted and at present not really ready for public for znver3. That is what is reflected in the kernel patch as well. If we notice, PMCx000 is incorrectly mapped and doesn't really count the 6 pipes.

Is that something that might be fixed via/in next microcode updates, or it's a hardware bug?

Is there any workaround that you guys suggest. I understand that these are pfm counters and are dynamic in nature. However, for the model, (at least for the throughput numbers) shall I publish our internal numbers running the internally enabled libpfm4 library? This is restricted in its immediate purpose which I understand however at present it may act as a work around for the throughput and llvm-mca numbers.

The answer depends on an answer to the previous question.

I am going through the modifications\corrections made in the by @lebedev.ri in the meantime.

Regarding the pipe events, the patch in kernel.org had just copy pasted events from zen1 and zen2. Apparently, I learnt that these events are restricted and at present not really ready for public for znver3. That is what is reflected in the kernel patch as well. If we notice, PMCx000 is incorrectly mapped and doesn't really count the 6 pipes.

Is that something that might be fixed via/in next microcode updates, or it's a hardware bug?

I have very limited info. However, definitely the solution is not near term. That's why I am inclined to check on the workarounds.

Is there any workaround that you guys suggest. I understand that these are pfm counters and are dynamic in nature. However, for the model, (at least for the throughput numbers) shall I publish our internal numbers running the internally

enabled libpfm4 library? This is restricted in its immediate purpose which I understand however at present it may act as a work around for the throughput and llvm-mca numbers.

The answer depends on an answer to the previous question.

I am going through the modifications\corrections made in the by @lebedev.ri in the meantime.

Regarding the pipe events, the patch in kernel.org had just copy pasted events from zen1 and zen2. Apparently, I learnt that these events are restricted and at present not really ready for public for znver3. That is what is reflected in the kernel patch as well. If we notice, PMCx000 is incorrectly mapped and doesn't really count the 6 pipes.

Is that something that might be fixed via/in next microcode updates, or it's a hardware bug?

I have very limited info. However, definitely the solution is not near term. That's why I am inclined to check on the workarounds.

I see.
*Please* do tell them that it would be good to also have these counters for Int pipes :)

Is there any workaround that you guys suggest. I understand that these are pfm counters and are dynamic in nature. However, for the model, (at least for the throughput numbers) shall I publish our internal numbers running the internally enabled libpfm4 library? This is restricted in its immediate purpose which I understand however at present it may act as a work around for the throughput and llvm-mca numbers.

The answer depends on an answer to the previous question.

Then, what do you mean by "internal numbers running the internally enabled libpfm4 library"?
I'm guessing you aren't talking about just patching libpfm4, and running it on a real cpu?

I think we can improve this afterwards, the initial version doesn't have to be perfect,
it won't ever be.

I am going through the modifications\corrections made in the by @lebedev.ri in the meantime.

@GGanesh i plan on landing this next monday, may 3'rd, in 3 days, unless some blocking feedback is provided.

@GGanesh i plan on landing this next monday, may 3'rd, in 3 days, unless some blocking feedback is provided.

I am fine mostly. Will give it another look and check the details.

llvm/lib/Target/X86/X86ScheduleZnver3.td
30

Ain't this based on the retire queue or reorder buffer? Its mapped to retire control unit in the MCA hardware units. So, I guess we should use the retire queue size rather than the micro tags\micro op cache here. This opcache doesn't go through the regular decode path. They are kind replay units most suited for loops. Or we should map MicroOpBufferSize with the RCU.

33

I think we should define LoopMicroOpBufferSize with 4096 as they are best suited for loops and not for insns in the legacy decode path.

47

This high latency is mostly as per the div unit. Can we use the arithmetic instruction high latency instead? I had this doubt when I wanted to enable but went ahead with the highest latency number that I could see. It gets used in the InstrInfo So, I think we can restrict it to only DIV class of instructions. I am okay if you decide to go with 25 here.

163

Yes, we need to model how these schedulers function. Let us get this forward for this iteration of the changes with FIXME.

372

Is there a way we can model some of these dispatch restrictions using the execution units or with other units? I initially had some plans to do it but was not really aware\sure how to do it.
We haven't really modeled such restrictions. With issue width I think we aren't really assuring these dispatch restrictions. I think modeling the scheduler functionality as mentioned in the earlier comment will do!

lebedev.ri marked 4 inline comments as done.

@GGanesh thank you for taking a look.
1 nit addressed

@GGanesh i plan on landing this next monday, may 3'rd, in 3 days, unless some blocking feedback is provided.

I am fine mostly. Will give it another look and check the details.

Was that it?

llvm/lib/Target/X86/X86ScheduleZnver3.td
30

Err, right, i may have forgotten to clean this up.

33

I'm not sure i follow. You are asking to spell out 4096 directly?

47

I'm indeed not really sure what value should be there,
and more importantly i'm not sure basing it purely off latency is the right choice.
I think it should ideally also consider all macrocoded (uops>2) instructions.
I think i want to leave this as-is for now.

163

To be noted, this is trivial to do, but llvm-mca bugs need to be fixed first.

372

Note that this is not a FIXME. Nothing to be done about this,
i have already modelled this correctly in this patch,
it's pretty trivial actually.

lebedev.ri edited the summary of this revision. (Show Details)May 1 2021, 9:49 AM

@GGanesh thank you for taking a look.
1 nit addressed

@GGanesh i plan on landing this next monday, may 3'rd, in 3 days, unless some blocking feedback is provided.

I am fine mostly. Will give it another look and check the details.

Was that it?

I am fine with the changes. I understood the LSU descriptions with the ProcResource quantitty. as well! Thank you for taking this up!

@GGanesh thank you for taking a look.
1 nit addressed

@GGanesh i plan on landing this next monday, may 3'rd, in 3 days, unless some blocking feedback is provided.

I am fine mostly. Will give it another look and check the details.

Was that it?

I am fine with the changes. I understood the LSU descriptions with the ProcResource quantitty. as well! Thank you for taking this up!

@GGanesh thank you very much!

Alrighty, here we go!

This revision was not accepted when it landed; it landed in state Needs Review.May 1 2021, 12:11 PM
This revision was automatically updated to reflect the committed changes.

Status update:
i'm done with renameable moves and dependency-removing instructions.
Rough remaining checklist:

  • Finish D60000, and maybe fix a few more latencies
  • Revisit pipe distribution
  • ???
RKSimon added inline comments.Jun 18 2021, 1:41 AM
llvm/lib/Target/X86/X86ScheduleZnver3.td
583

Piledriver ?

lebedev.ri marked an inline comment as done.Jun 19 2021, 12:07 PM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86ScheduleZnver3.td
583

Right, thanks, missed this one. Fixed.