This is an archive of the discontinued LLVM Phabricator instance.

[X86] Move ReadAfterLd functionality into X86FoldableSchedWrite (PR36957)
ClosedPublic

Authored by RKSimon on Oct 4 2018, 7:05 AM.

Details

Summary

Currently we hardcode instructions with ReadAfterLd if the register operands don't need to be available until the folded load has completed. This doesn't take into account the different load latencies of different memory operands (PR36957).

This patch adds a ReadAfterFold def into X86FoldableSchedWrite to replace ReadAfterLd, allowing us to specify the load latency at a scheduler class level.

I've added ReadAfterVec*Ld classes that match the XMM/Scl, XMM and YMM/ZMM WriteVecLoad classes that we currently use.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Oct 4 2018, 7:05 AM

Thanks Simon!

I am happy to see how this fixes the issue with read-advance being set to 3cy instead of 5cy for vector loads in BtVer2.

I also like the overall design (i.e. how the SchedRead is set).
I let Clement and Craig comment on the changes to the Intel models.
From my point of view, this patch looks good.

-Andrea

craig.topper added inline comments.Oct 4 2018, 10:37 AM
lib/Target/X86/X86SchedBroadwell.td
79

The load latencies on BDW seem weird given that HSW and SKL, the CPUs immediately before and after are the same.

lib/Target/X86/X86SchedSkylakeServer.td
74

Why this comment only changed in this model? The same comment existing in SNB, HSW, BDW, and SKLClient.

RKSimon added inline comments.Oct 4 2018, 10:42 AM
lib/Target/X86/X86SchedBroadwell.td
79

They're the same as BDW's load latencies - can we confirm if they are correct or not? I think this question has come up several times in the past....

lib/Target/X86/X86SchedSkylakeServer.td
74

Because my copy+paste skills are below average.

courbet added inline comments.Oct 5 2018, 4:58 AM
lib/Target/X86/X86SchedBroadwell.td
79

llvm-exegesis does not currently fully automate measurement of latency operations, nevertheless we can do:

echo -e 'mov (%rdi),%rax\n mov %rax, 1(%rdi)' | ./bin/llvm-exegesis -mode=latency -snippets-file=-
echo -e 'vmovups (%rdi),%xmm0\n vmovups %xmm0, 1(%rdi)' | ./bin/llvm-exegesis -mode=latency -snippets-file=-
echo -e 'vmovups (%rdi),%ymm0\n vmovups %ymm0, 1(%rdi)' | ./bin/llvm-exegesis -mode=latency -snippets-file=-

(with store-to-load forwarding)

echo -e 'mov (%rdi),%rax\n mov %rax, 1(%rdi)' | ./bin/llvm-exegesis -mode=latency -snippets-file=-
echo -e 'vmovups (%rdi),%xmm0\n vmovups %xmm0, 1(%rdi)' | ./bin/llvm-exegesis -mode=latency -snippets-file=-
echo -e 'vmovups (%rdi),%ymm0\n vmovups %ymm0, 1(%rdi)' | ./bin/llvm-exegesis -mode=latency -snippets-file=-

(without store-to-load forwarding)

On Haswell, this gives 5/6/7-15/16/17.
On Broadwell, this gives 5/6/7-15/16/17.

So they should be the same (5/6/7).

courbet added inline comments.Oct 5 2018, 5:05 AM
lib/Target/X86/X86SchedBroadwell.td
79

echo -e 'mov (%rdi),%rax\n mov %rax, 1(%rdi)' | ./bin/llvm-exegesis -mode=latency -snippets-file=-

This was supposed to be:
echo -e 'mov (%rdi),%rax\n mov %rax, (%rdi)' | ./bin/llvm-exegesis -mode=latency -snippets-file=-

courbet added inline comments.Oct 5 2018, 5:25 AM
lib/Target/X86/X86SchedSandyBridge.td
78

Looks good, except that store-to-load forwarding seems to incur an extra cycle in the ymm case.

lib/Target/X86/X86SchedSkylakeServer.td
78

I'm seeing 4.5(??)/5/6-15/15/16 here.

I've raised https://bugs.llvm.org/show_bug.cgi?id=39188 to discuss what to do about the broadwell load latencies - changing them would involve a notable update to the model and I'd be a lot more comfortable if we had (llvm-exegesis?) data to confirm them

I've raised https://bugs.llvm.org/show_bug.cgi?id=39188 to discuss what to do about the broadwell load latencies - changing them would involve a notable update to the model and I'd be a lot more comfortable if we had (llvm-exegesis?) data to confirm them

What's wrong with the data I gave above ? :)

I've raised https://bugs.llvm.org/show_bug.cgi?id=39188 to discuss what to do about the broadwell load latencies - changing them would involve a notable update to the model and I'd be a lot more comfortable if we had (llvm-exegesis?) data to confirm them

What's wrong with the data I gave above ? :)

:) I started creating a patch and realised how much of a diff it was going to cause - and that was before I'd altered any of the InstRW overrides..... I've CC'd you on PR39188, we can continue work on it there.

PR36957 is actually a problem for us (both llvm-mca accuracy and codegen scheduling), and I'd like to concentrate fixing this first.

RKSimon updated this revision to Diff 168464.Oct 5 2018, 6:20 AM

Refreshed comments

craig.topper accepted this revision.Oct 5 2018, 10:09 AM

LGTM. I agree we should deal with broadwell separately.

This revision is now accepted and ready to land.Oct 5 2018, 10:09 AM
This revision was automatically updated to reflect the committed changes.