This is an archive of the discontinued LLVM Phabricator instance.

[X86] AMD Znver4 (Genoa) Scheduler enablement
ClosedPublic

Authored by GGanesh on Feb 14 2023, 6:40 AM.

Details

Summary

The patch has the details of the znver4 scheduler model.
There are ample improvements with respect to instructions, execution units, latencies and throughput when compared with znver3.
The tests that were present for znver3 for llvm-mca tool were replicated.
New tests are added for AVX-512.

Diff Detail

Event Timeline

GGanesh created this revision.Feb 14 2023, 6:40 AM
GGanesh requested review of this revision.Feb 14 2023, 6:40 AM
Matt added a subscriber: Matt.Feb 15 2023, 10:10 AM

Please can you adjust x86PfmCounters.td so znver4 counters points to suitable model resources (its OK if you have to lose some of the mappings - I'm not sure how complete perfmon2 zen4 support is yet).

RKSimon added inline comments.Feb 15 2023, 10:46 AM
llvm/lib/Target/X86/X86ScheduleZnver4.td
906

double the latency just for vaddps?

910

Don't you need to account for double pumping for ZMM? In the Jaguar model we'd typically double the resource usage to [2] to simulate it - so uops stays at 1.

GGanesh updated this revision to Diff 501776.Mar 2 2023, 12:37 AM

Updated for RKSimon's review comments as below.

double the latency just for vaddps?

Ack. Corrected!

Please can you adjust x86PfmCounters.td so znver4 counters points to suitable model resources

On the outset, uses exactly similar counters when compared to znver3.
Categorised to Int, FP, Load-Store, Cycles and Ops retired. I have added Zn4AGU mapping additionally.

I'm not sure how complete perfmon2 zen4 support is yet

Libpfm for znver4 is updated by Dec'2022.

Don't you need to account for double pumping for ZMM? In the Jaguar model we'd typically double the resource usage to [2] to simulate it - so uops stays at 1.

It is not really double pumping similar to that of AMD archs! The micro-ops are doubled however they get into same pipeline one cycle at a time. So, at any given cycle, we can have two different 512-insns fed into the FP pipes (unlike previously double pumped archs). So, uops will be 2 for 512-insns however the other units in FP-pipeline are available for picking another 512 insn on the same cycle.

@RKSimon! We would like to have this patch as part of 16.0 release. Would like to know your thoughts on that! Revisions to the model can follow after the release!

Don't you need to account for double pumping for ZMM? In the Jaguar model we'd typically double the resource usage to [2] to simulate it - so uops stays at 1.

It is not really double pumping similar to that of AMD archs! The micro-ops are doubled however they get into same pipeline one cycle at a time. So, at any given cycle, we can have two different 512-insns fed into the FP pipes (unlike previously double pumped archs). So, uops will be 2 for 512-insns however the other units in FP-pipeline are available for picking another 512 insn on the same cycle.

I agree that LLVM scheduler models resources usage don't fully match what is actually happening on the hardware (we have similar problems with hardware having different concepts of microcoded instructions) - but we do need to indicate that the rthroughput of, say, VADDPD zmm is 1.0 but xmm/ymm is 0.5 - the easiest way to approximately model this is by doubling the resource count, the sideeffects are minimal in comparison (it doesn't affect latency / uop counts, its just not very accurate on how it uses a resource group).

@RKSimon! We would like to have this patch as part of 16.0 release. Would like to know your thoughts on that! Revisions to the model can follow after the release!

Apart from the 512-bit throughput under estimations I'm happy for the patch to get in for 16.0.

GGanesh updated this revision to Diff 502689.Mar 6 2023, 9:31 AM

Updated for review comments!

RKSimon added inline comments.Mar 6 2023, 10:37 AM
llvm/lib/Target/X86/X86ScheduleZnver4.td
910

I'd expected the zmm pairs to be something like:

defm : Zn4WriteResZMMPair<WriteFAdd64Z, [Zn4FPFAdd01], 3, [2], 1>;
GGanesh updated this revision to Diff 502940.Mar 7 2023, 12:12 AM

Adjusted the resource cycles and retained the micro-ops as is as per @RKSimon comments!

Thanks @GGanesh - I think if we can update the vector integer classes as well this will be good enough as an initial commit and merging for 16.x

llvm/lib/Target/X86/X86ScheduleZnver4.td
1144

It looks like the integer classes still need the ZMM resource cycles adjustment

GGanesh updated this revision to Diff 503247.Mar 7 2023, 11:34 PM

Updated for Vector integer classes (ZMM variants) as per @RKSimon comments!

RKSimon accepted this revision.Mar 8 2023, 1:40 AM

LGTM - thanks @GGanesh !

This revision is now accepted and ready to land.Mar 8 2023, 1:40 AM
llvm/test/tools/llvm-mca/X86/Znver4/resources-x86_32.s