This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Treat (V)MOVAPD/(V)MOVUPD + (V)MOVAPS/(V)MOVUPS reg-reg instructions as moves not shuffles
AbandonedPublic

Authored by RKSimon on Mar 13 2018, 7:39 AM.

Details

Summary

Oddly the (V)MOVAPDrr/(V)MOVUPDrr and (V)MOVAPSrr/(V)MOVUPSrr instructions were classed as WriteFShuffle.

This patch changes the class to WriteMove, this matches what we already do for (V)MOVDQArr/(V)MOVDQUrr.

Found by llvm-mca

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Mar 13 2018, 7:39 AM

Hi Simon,

Can you elaborate on how you used llvm-mca to derive this ?

Using the compute_itineraries tool that I've mentioned in the past a haswell machine I see VMOVUPSrr and other variants use only HWPort5, which would make the WriteFShuffle more accurate.
For sandybridge, results are consistent (only HWPort5 is used).
(Note that we have an LLVM version of this tool for which we'll send an RFC shortly).

Also note that most CPU models override the sched class specifically for these instructions, e.g. on haswell:

def HWWriteResGroup4 : SchedWriteRes<[HWPort5]> {
  let Latency = 1;
  let NumMicroOps = 1;
  let ResourceCycles = [1];
}
def: InstRW<[HWWriteResGroup4], (instregex "VMOVUPSrr")>;

So the change here is not going to change specific CPUs.

For reference, for VMOVDQArr I see P0156 on haswell (consistent with the "Move" profile).

I saw this on btver2 - MOVDQA is reported as using JALU0/JALU1 while MOVAPS/MOVAPD reports JFPU0/JFPU1. And it appears to be affecting a couple of other targets with non-exhaustive scheduler model overloads.

Note that this shouldn't t affect SB etc. as you've said it already has overloaded the schedules for reg-reg moves.

What we could do is start splitting vector/scalar moves/loads/stores but this patch was all that was necessary for the cases I saw.

I haven't looked at the sched model details recently, but this seems like a step in the right direction...although we really need:
https://bugs.llvm.org/show_bug.cgi?id=36671 ?

Ie, most reg-reg moves on Zen, IvyBridge or later should be special-cased if we want an accurate simulation.

For example, Agner has this in section 19.13 of the micro-arch doc:
"Register-to-register move instructions are resolved at the register rename stage without using any execution units.
These instructions have zero latency. It is possible to do six such register renamings per clock cycle, and it is even
possible to rename the same register several times in one clock cycle."

I saw this on btver2 - MOVDQA is reported as using JALU0/JALU1 while MOVAPS/MOVAPD reports JFPU0/JFPU1. And it appears to be affecting a couple of other targets with non-exhaustive scheduler model overloads.

I see - if there's nothing in common between microarchitectures then ideally shouldn't we avoid putting a default ? I guess what happened here is that someone measured it on Intel and put that here, which ended up hurting btver2... I could see this opposite happening with this change if any intel uarch forgets to override this. Did you check that all intel uarchs override this ?

I saw this on btver2 - MOVDQA is reported as using JALU0/JALU1 while MOVAPS/MOVAPD reports JFPU0/JFPU1. And it appears to be affecting a couple of other targets with non-exhaustive scheduler model overloads.

I see - if there's nothing in common between microarchitectures then ideally shouldn't we avoid putting a default ? I guess what happened here is that someone measured it on Intel and put that here, which ended up hurting btver2... I could see this opposite happening with this change if any intel uarch forgets to override this. Did you check that all intel uarchs override this ?

As you can see from the test changes in the patch, these are limited to btver2/znver1/slm/glm - Agner's tables show no diff for pipe usage between dq/ps/pd aligned/unaligned rr moves for these cpus. The tests have pretty much coverage complete for SSE + AVX1/AVX2 cases.

IMO its much safer to default to WriteMove than WriteFShuffle for these instructions.

I saw this on btver2 - MOVDQA is reported as using JALU0/JALU1 while MOVAPS/MOVAPD reports JFPU0/JFPU1. And it appears to be affecting a couple of other targets with non-exhaustive scheduler model overloads.

I see - if there's nothing in common between microarchitectures then ideally shouldn't we avoid putting a default ? I guess what happened here is that someone measured it on Intel and put that here, which ended up hurting btver2... I could see this opposite happening with this change if any intel uarch forgets to override this. Did you check that all intel uarchs override this ?

As you can see from the test changes in the patch, these are limited to btver2/znver1/slm/glm - Agner's tables show no diff for pipe usage between dq/ps/pd aligned/unaligned rr moves for these cpus. The tests have pretty much coverage complete for SSE + AVX1/AVX2 cases.

IMO its much safer to default to WriteMove than WriteFShuffle for these instructions.

What about introducing WriteVMove then ? THis can be the same as WriteMove for btver2 and WriteFShuffle for intel CPUs.

I saw this on btver2 - MOVDQA is reported as using JALU0/JALU1 while MOVAPS/MOVAPD reports JFPU0/JFPU1. And it appears to be affecting a couple of other targets with non-exhaustive scheduler model overloads.

I see - if there's nothing in common between microarchitectures then ideally shouldn't we avoid putting a default ? I guess what happened here is that someone measured it on Intel and put that here, which ended up hurting btver2... I could see this opposite happening with this change if any intel uarch forgets to override this. Did you check that all intel uarchs override this ?

As you can see from the test changes in the patch, these are limited to btver2/znver1/slm/glm - Agner's tables show no diff for pipe usage between dq/ps/pd aligned/unaligned rr moves for these cpus. The tests have pretty much coverage complete for SSE + AVX1/AVX2 cases.

IMO its much safer to default to WriteMove than WriteFShuffle for these instructions.

What about introducing WriteVMove then ? THis can be the same as WriteMove for btver2 and WriteFShuffle for intel CPUs.

OK, I'll do that this morning.

RKSimon abandoned this revision.Mar 15 2018, 7:52 AM

Abandoning this now that D44471 has landed.