The patch has the details of the znver4 scheduler model.
There are ample improvements with respect to instructions, execution units, latencies and throughput when compared with znver3.
The tests that were present for znver3 for llvm-mca tool were replicated.
New tests are added for AVX-512.
Details
- Reviewers
RKSimon craig.topper andreadb
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
Please can you adjust x86PfmCounters.td so znver4 counters points to suitable model resources (its OK if you have to lose some of the mappings - I'm not sure how complete perfmon2 zen4 support is yet).
Updated for RKSimon's review comments as below.
double the latency just for vaddps?
Ack. Corrected!
Please can you adjust x86PfmCounters.td so znver4 counters points to suitable model resources
On the outset, uses exactly similar counters when compared to znver3.
Categorised to Int, FP, Load-Store, Cycles and Ops retired. I have added Zn4AGU mapping additionally.
I'm not sure how complete perfmon2 zen4 support is yet
Libpfm for znver4 is updated by Dec'2022.
Don't you need to account for double pumping for ZMM? In the Jaguar model we'd typically double the resource usage to [2] to simulate it - so uops stays at 1.
It is not really double pumping similar to that of AMD archs! The micro-ops are doubled however they get into same pipeline one cycle at a time. So, at any given cycle, we can have two different 512-insns fed into the FP pipes (unlike previously double pumped archs). So, uops will be 2 for 512-insns however the other units in FP-pipeline are available for picking another 512 insn on the same cycle.
@RKSimon! We would like to have this patch as part of 16.0 release. Would like to know your thoughts on that! Revisions to the model can follow after the release!
I agree that LLVM scheduler models resources usage don't fully match what is actually happening on the hardware (we have similar problems with hardware having different concepts of microcoded instructions) - but we do need to indicate that the rthroughput of, say, VADDPD zmm is 1.0 but xmm/ymm is 0.5 - the easiest way to approximately model this is by doubling the resource count, the sideeffects are minimal in comparison (it doesn't affect latency / uop counts, its just not very accurate on how it uses a resource group).
@RKSimon! We would like to have this patch as part of 16.0 release. Would like to know your thoughts on that! Revisions to the model can follow after the release!
Apart from the 512-bit throughput under estimations I'm happy for the patch to get in for 16.0.
llvm/lib/Target/X86/X86ScheduleZnver4.td | ||
---|---|---|
910 | I'd expected the zmm pairs to be something like: defm : Zn4WriteResZMMPair<WriteFAdd64Z, [Zn4FPFAdd01], 3, [2], 1>; |
Adjusted the resource cycles and retained the micro-ops as is as per @RKSimon comments!
Committed at rGffdd5a330c05fa2b4339f64402f650df068c5767
I've proposed this to be backported to 16.x here: https://github.com/llvm/llvm-project/issues/61287
double the latency just for vaddps?