Oddly the (V)MOVAPDrr/(V)MOVUPDrr and (V)MOVAPSrr/(V)MOVUPSrr instructions were classed as WriteFShuffle.
This patch changes the class to WriteMove, this matches what we already do for (V)MOVDQArr/(V)MOVDQUrr.
Found by llvm-mca
Paths
| Differential D44428
[X86][SSE] Treat (V)MOVAPD/(V)MOVUPD + (V)MOVAPS/(V)MOVUPS reg-reg instructions as moves not shuffles AbandonedPublic Authored by RKSimon on Mar 13 2018, 7:39 AM.
Details Summary Oddly the (V)MOVAPDrr/(V)MOVUPDrr and (V)MOVAPSrr/(V)MOVUPSrr instructions were classed as WriteFShuffle. This patch changes the class to WriteMove, this matches what we already do for (V)MOVDQArr/(V)MOVDQUrr. Found by llvm-mca
Diff Detail
Event TimelineComment Actions Hi Simon, Can you elaborate on how you used llvm-mca to derive this ? Using the compute_itineraries tool that I've mentioned in the past a haswell machine I see VMOVUPSrr and other variants use only HWPort5, which would make the WriteFShuffle more accurate. Also note that most CPU models override the sched class specifically for these instructions, e.g. on haswell: def HWWriteResGroup4 : SchedWriteRes<[HWPort5]> { let Latency = 1; let NumMicroOps = 1; let ResourceCycles = [1]; } def: InstRW<[HWWriteResGroup4], (instregex "VMOVUPSrr")>; So the change here is not going to change specific CPUs. Comment Actions For reference, for VMOVDQArr I see P0156 on haswell (consistent with the "Move" profile). Comment Actions I saw this on btver2 - MOVDQA is reported as using JALU0/JALU1 while MOVAPS/MOVAPD reports JFPU0/JFPU1. And it appears to be affecting a couple of other targets with non-exhaustive scheduler model overloads. Note that this shouldn't t affect SB etc. as you've said it already has overloaded the schedules for reg-reg moves. What we could do is start splitting vector/scalar moves/loads/stores but this patch was all that was necessary for the cases I saw. Comment Actions I haven't looked at the sched model details recently, but this seems like a step in the right direction...although we really need: Ie, most reg-reg moves on Zen, IvyBridge or later should be special-cased if we want an accurate simulation. For example, Agner has this in section 19.13 of the micro-arch doc: Comment Actions
I see - if there's nothing in common between microarchitectures then ideally shouldn't we avoid putting a default ? I guess what happened here is that someone measured it on Intel and put that here, which ended up hurting btver2... I could see this opposite happening with this change if any intel uarch forgets to override this. Did you check that all intel uarchs override this ? Comment Actions
As you can see from the test changes in the patch, these are limited to btver2/znver1/slm/glm - Agner's tables show no diff for pipe usage between dq/ps/pd aligned/unaligned rr moves for these cpus. The tests have pretty much coverage complete for SSE + AVX1/AVX2 cases. IMO its much safer to default to WriteMove than WriteFShuffle for these instructions. Comment Actions
What about introducing WriteVMove then ? THis can be the same as WriteMove for btver2 and WriteFShuffle for intel CPUs. Comment Actions
OK, I'll do that this morning.
Revision Contents
Diff 138188 lib/Target/X86/X86InstrSSE.td
test/CodeGen/X86/avx-schedule.ll
test/CodeGen/X86/sha-schedule.ll
test/CodeGen/X86/sse-schedule.ll
test/CodeGen/X86/sse2-schedule.ll
test/CodeGen/X86/sse41-schedule.ll
|