Page MenuHomePhabricator

[X86] Add SchedRW for PMULLD

Authored by craig.topper on Mar 27 2018, 10:47 PM.



It seems many CPUs don't implement this instruction as well as the other vector multiplies. Often using a multi uop flow. Silvermont in particular has a 7 uop flow with 11 cycle throughput. Sandy Bridge implements it as a single uop with 5 cycle latency and 1 cycle throughput. But Haswell and later use 2 uops with 10 cycle latency and 2 cycle throughput.

This patch adds a new X86SchedWritePair we can use to tag this instruction separately. I've provided correct information for Silvermont, Btver2, and Sandy Bridge. I've removed the InstRWs for SandyBridge. I've left Haswell/Broadwell/Skylake InstRWs in place because I wasn't sure how to account for the different load latency between 128 and 256 bits. I also left Znver1 InstRWs in place because the existing values don't match Agner's spreadsheet.

I also left a FIXME in the SandyBridge model because it being used for the "generic" model is too optimistic for the 256/512-bit versions since those are multiple uops on all known CPUs.

Diff Detail


Event Timeline

craig.topper created this revision.Mar 27 2018, 10:47 PM
courbet accepted this revision.Mar 28 2018, 12:22 AM
courbet added inline comments.
163 ↗(On Diff #140040)

llvm-exegesis measures 2xP01 here.

This revision is now accepted and ready to land.Mar 28 2018, 12:22 AM
courbet requested changes to this revision.Mar 28 2018, 12:23 AM
This revision now requires changes to proceed.Mar 28 2018, 12:23 AM

You will need to check the llvm-mca tests as well - some of those will have changed (sorry no update script - its manual!)

Fix the ports for SKL. Remove the register form override from most of the schedulers. I left the memory form overrides in place. Update the llvm-mca tests.

courbet added inline comments.Mar 29 2018, 2:01 AM
163 ↗(On Diff #140097)

I don't have a skylake server to test that, but I'm surprised that this is different from SKL. Is this a typo ?

The btver2 change looks good to me. Thanks!

RKSimon added inline comments.Mar 29 2018, 4:30 AM
166 ↗(On Diff #140097)

Remove BWWriteResGroup148 ((V?)PMULLDrm) overload and just leave BWWriteResGroup151 (VPMULLDYrm) ?

Same for others.

craig.topper added inline comments.Mar 29 2018, 7:55 AM
163 ↗(On Diff #140097)

SKX adds an extra FMA unit and vector multiplier in port 5 for AVX512. 512-bit operations combine the 256-bit port0 and 1 units. So an extra unit was added to maintain 2 ports for 512-bit.

I’m not sure the port 5 unit can be used for 128 and 256 bit, but the scheduler model thinks so. The scheduler model definitely doesn’t model 512 bit correctly, but that’s a larger problem than I want to fix here.

Remove 128-bit memory instructions from most of the models. Didn't touch Skylake Server because there are many things about it that I don't understand.

GGanesh added inline comments.Mar 29 2018, 9:20 PM
212 ↗(On Diff #140289)

This needs a fix definitely. I will do it!

RKSimon accepted this revision.Mar 30 2018, 7:35 AM

LGTM and then @GGanesh can fix the Zn model afterward.

This revision was not accepted when it landed; it landed in state Needs Review.Mar 30 2018, 10:00 PM
Closed by commit rL328914: [X86] Add SchedRW for PMULLD (authored by ctopper, committed by ). · Explain Why
This revision was automatically updated to reflect the committed changes.