This is an archive of the discontinued LLVM Phabricator instance.

Sched model improving on btver2: JFPU01 resource, vtestp* for xmm.
ClosedPublic

Authored by avt77 on Nov 8 2017, 7:26 AM.

Details

Diff Detail

Event Timeline

avt77 created this revision.Nov 8 2017, 7:26 AM
andreadb added inline comments.Nov 8 2017, 10:24 AM
lib/Target/X86/X86ScheduleBtVer2.td
202–203

My understanding is that ResourceCycles has to be 4 because you want to be able to compute a reciprocal throughput of 2. According to the amd documents, the pipes used by a float variable blend are "FPA|FPM".

I wonder whether we could have [JFPU0, JFPU1] instead of [JFPU01], and then change ResourceCycles to [2, 1].

A variable blend is 3 uOps, and internally, it is likely to be implemented as the sequence {VAND,VANDN,VOR}, where VAND and VANDN can execute in parallel.

It may be worthy to run some experiments to see which approach is better. That being said, your approach is not wrong, and I don't have a strong opinion on this.

202–208

let NumMicroOps = 3;

220–229

Variable blend instructions are 3 macro ops.
You should add let NumMicroOps = 3;.

avt77 added inline comments.Nov 9 2017, 12:20 AM
lib/Target/X86/X86ScheduleBtVer2.td
202–203

I have AMD laptop to make experiments but I don't have any perf test using a variable blend. Could anyone to help me with such a test?

davide removed a reviewer: davide.Nov 9 2017, 12:29 AM
avt77 updated this revision to Diff 122218.Nov 9 2017, 3:03 AM

I made updates required by andreadb except the question about JFPU01: if we need perf experiments please help me with perf test(s). If we agree to use the current implementation then I could commit the patch if you get me LGTM.

andreadb accepted this revision.Nov 9 2017, 4:19 AM

I made updates required by andreadb except the question about JFPU01: if we need perf experiments please help me with perf test(s). If we agree to use the current implementation then I could commit the patch if you get me LGTM.

I don't think it is worthy to run experiments just for the blendv case.
I just wanted to point out that there may be an alternative solution for that particular case. Having a resource cycle of 4 for the blendv is a bit odd (at least to me)..
But, as I wrote said, I really don't have a strong opinion on what numbers should go there (neither solution is ideal).
If you fix the remaining NumMicroOps issue I pointed out, then the patch looks good to me.

lib/Target/X86/X86ScheduleBtVer2.td
656–659

let NumMicroOps = 3;

663–666

Same.

This revision is now accepted and ready to land.Nov 9 2017, 4:19 AM
This revision was automatically updated to reflect the committed changes.