This is an archive of the discontinued LLVM Phabricator instance.

[X86][SandyBridge] SBWriteResPair +5cy and +1uop Memory Folds
ClosedPublic

Authored by RKSimon on Mar 19 2018, 5:22 PM.

Details

Summary

As mentioned on D44647, this patch increases the default memory latency to +5cy as well as costing +1uop, which more closely matches what most custom cases are doing for reg-mem instructions.

I've left ReadAfterLd at 4cy at the moment, which seems to be correct for 'pure' loads - should I can increase this to 5 as well? What about WriteLoad etc?

As Sandy Bridge is currently our default generic model, this affects a lot of scheduling tests...

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Mar 19 2018, 5:22 PM

I think this mostly looks ok. Most of this just stuff SNB doens't support anyway. We should file bugs for the obvious bugs its showing.

test/CodeGen/X86/mmx-schedule.ll
1266 ↗(On Diff #139028)

So a ton of MMX instructions are missing from SNB?

test/CodeGen/X86/schedule-x86_32.ll
213 ↗(On Diff #139028)

So RETQ is in the scheduler model but RETL isn't?

test/CodeGen/X86/sse41-schedule.ll
353 ↗(On Diff #139028)

So DPPSrmi and VDPPSrmi are missing from the SNB scheduler model?

713 ↗(On Diff #139028)

This is an obvious mistake in the SNB model. I'll commit a fix.

I'll fix the ror/rol/shr/sar problems too.

And I'll fix TEST

courbet added inline comments.Mar 20 2018, 2:02 AM
lib/Target/X86/X86SchedSandyBridge.td
83 ↗(On Diff #139028)

Please fix the comment.

RKSimon updated this revision to Diff 139146.Mar 20 2018, 9:39 AM

Fixed comment and rebased patch.

Another issue is WriteLoad is set to 4cy but WriteFLoad/WriteVecLoad is set to 6cy memory latency - there are a lot of inconsistencies in here.

I'm open to any/all recommendations - plus I think we need to bear in mind that this model is used for generic (non-specific) x86_64 targets.

Fixed comment and rebased patch.

Another issue is WriteLoad is set to 4cy but WriteFLoad/WriteVecLoad is set to 6cy memory latency - there are a lot of inconsistencies in here.

When the data is in cache, I see latencies of 2 for MOV64rm, 3 for VMOVUPSrm, 4 for VMOVUPSYrm with llvm-exegesis (btw @craig.topper: why are all memory latencies shifted by 3? Is it to account for data that's not in the cache ?). So the values here seem reasonable if the 6cy is for the ymm version. Essentially this shows that the basic model is too simplistic for sandybridge; we really need the specialization per mnemonic. The specialization seem to be correct: VMOVUPSYrm has latency 7 (4+3), VMOVUPSrm 6 (3+3) and MOV64rm 5 (2+3).

2 cycle latency for MOV64rm seems low to me. There's an address calculation and a TLB lookup before it can even start accessing the cache.

Table 2-20 of https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf shows the load latencies according to Intel.

2 cycle latency for MOV64rm seems low to me. There's an address calculation and a TLB lookup before it can even start accessing the cache.

Table 2-20 of https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf shows the load latencies according to Intel.

The generated code had store-to-load forwarding, so the numbers here are indeed missing the memory access part. When we change the generator to prevent the forwarding, we see latencies of 11/9/7 for ymm0/xmm0/rax on sandybridge, which are consistent with the access-less numbers I mentioned above (4/3/2), plus access times according to the doc you pointed to (7/6/5): 11=7+4 / 9=6+3 / 7=2+5.
It's not obvious to me which number LLVM should be using (should we give it both numbers and teach it to recognize store-load forwarding opportunities and schedule accordingly ?)

We'll try to think of a way to integrate this in a principled way into llvm-exegesis (created PR36905).

I've left ReadAfterLd at 4cy at the moment, which seems to be correct for 'pure' loads

Isn't ReadAfterLd only used by instructions with folded loads? So shouldn't it be 5cyc?

craig.topper accepted this revision.Apr 5 2018, 7:08 PM

LGTM. I'd like to see this go in so I can start removing InstrRWs from Sandy Bridge to start fixing the missing ReadAfterLd.

This revision is now accepted and ready to land.Apr 5 2018, 7:08 PM
This revision was automatically updated to reflect the committed changes.