This is an archive of the discontinued LLVM Phabricator instance.

[X86] Model 256-bit AVX instructions in the AMD Jaguar scheduler (PR28573)
AbandonedPublic

Authored by avt77 on May 11 2017, 6:08 AM.

Details

Summary

This patch closes AVX part of Bug https://bugs.llvm.org/show_bug.cgi?id=28573. It seems we have some problem here: the throughput of some instructions could have non integer (but float) value.

Diff Detail

Event Timeline

avt77 created this revision.May 11 2017, 6:08 AM
RKSimon added inline comments.May 11 2017, 6:41 AM
lib/Target/X86/X86Schedule.td
48 ↗(On Diff #98624)

The LEA3 changes should be in their own patch.

lib/Target/X86/X86ScheduleBtVer2.td
137

Isn't this handled by the use of JALU01 grouping JALU0 + JALU1 together? So it has a choice of 2 pipes and it will have a tp of 1cy whichever it goes down.

test/CodeGen/X86/sse2-schedule.ll
6022

Jaguar has a max of 1 load/cycle - so the tp should still be 1.00

avt77 updated this revision to Diff 98747.May 12 2017, 5:02 AM

It seems I fixed all known issues except proper support of vzeroupper and vzeroall: will try to do it in the next patch.

avt77 updated this revision to Diff 99105.May 15 2017, 11:52 PM

I slightly changed the algorithm of throughput calculation: if the instr sched model does not have cycles for the given instruction but it's valid then throughput is equal to lattency.

RKSimon added inline comments.May 16 2017, 2:27 PM
lib/Target/X86/X86ScheduleBtVer2.td
71

I don't think adding these Cluster groups is necessary. TBH most of the ProcResource defs appear to be superfluous - most aren't used at all - we're just using the JFPU0/JFPU0/JFPU01 defs, with a few others for the longer op chain instructions.

349

Better off using JFPU0 as that's what is actually bound to the buffer. Same for the others below.

357

Shouldn't this def be something like the below, to show it will consume the AGU for a cycle? Same for the other loads.

def WriteFAddYMLd: SchedWriteRes<[JLAGU,JFPU0]>  {
  let Latency = 8;
  let ResourceCycles = [1,2];
}
test/CodeGen/X86/slow-unaligned-mem.ll
89

????

avt77 added inline comments.May 17 2017, 8:47 AM
lib/Target/X86/X86ScheduleBtVer2.td
71

Several nstructions below could be executed on FPA or on FPM that's why we need a possibility to say it and that's why I created JFPFltCluster. Is it OK?
And I created JFPIntCluster cluster just in case: should I remove it?

349

Are you sure it's a good idea? FP0 includes VALU0, VIMUL and FPA. I'm using FPA because this instruction uses exactly FPA.
At the same time if we use FPU0 then througput will be 2/3 = 0.666 and that's wrong.
Or you mean that our instruction is FP and it should not deal with VALU0, VIMUL? In this case we should change the algorithm again.

357

I thought about but Software Optimization Guide does not show it (I mean it says about AGU but it does not include the additional cycle in its tables). Should I update the model?

test/CodeGen/X86/slow-unaligned-mem.ll
89

This test was written by hand that's why it's difficult to compare the results but the new version generates:

BB#0:

vxorps  %ymm0, %ymm0, %ymm0
movl    4(%esp), %eax
vmovups %ymm0, 32(%eax)
vmovups %ymm0, (%eax)
retl

As you see we have vxorps between # BB#0: and 'movl'. I decided it's acceptable. Am I wrong?

avt77 updated this revision to Diff 100077.May 24 2017, 6:32 AM

I've fixed all issues raised by Simon. In addition I re-checked all numbers: it seems they are correct now.

RKSimon edited edge metadata.Jun 1 2017, 2:22 AM

I really don't understand why you are having to change the throughput calculation as part of this - split this as another patch?

lib/Target/X86/X86ScheduleBtVer2.td
353

Why WriteFAddYY not WriteFAddY ?

359

WriteFAddYLd ?

365

WriteFDivY?

avt77 added a comment.Jun 1 2017, 5:53 AM

I really don't understand why you are having to change the throughput calculation as part of this - split this as another patch?

In this case I should move the test changes for zerroall and zeroupper in a separate patch as well, right?

avt77 updated this revision to Diff 101712.Jun 7 2017, 5:50 AM

I removed all changes related to throughput calculations. And I made all updates suggested by Simon.

RKSimon added inline comments.Jun 7 2017, 7:04 AM
lib/Target/X86/X86ScheduleBtVer2.td
22

It is still the Retire Control Unit, its just that the FPU can only touch 44 of the entries.

let MicroOpBufferSize = 64; // Retire Control Unit
25

Don't remove whitespace.

97

Undo this whitespace

176

Don't remove whitespace.

373

WriteVMULYPD

For all these defs, please can you include the 'Y' to make it clear that its just the 256-bit case

445

What is AVX11?

Spelling: deafault -> default

453

"VMOVAP(D|S)rm" etc. are memory loads - they should be in the Ld version

test/CodeGen/X86/avx-vzeroupper.ll
163 ↗(On Diff #101712)

What is causing this?

test/CodeGen/X86/recip-fastmath.ll
345

Latency should be 5cy

avt77 added inline comments.Jun 8 2017, 4:41 AM
lib/Target/X86/X86ScheduleBtVer2.td
453

From my point of view rm-version store some register value into memory while mr-version loads the value from memory into the register. Am I right?

test/CodeGen/X86/recip-fastmath.ll
345

Why? In fact we should have tp 0.5 for XMM (see below). I'll fix it.

VMOVAPD xmm1 xmm2 AVX 1 FPA|FPM 1 0,5
VMOVAPD ymm1 ymm2 AVX 2 FPA|FPM 1 1
VMOVAPS xmm1 xmm2 AVX 1 FPA|FPM 1 0,5
VMOVAPS ymm1 ymm2 AVX 2 FPA|FPM 1 1

avt77 updated this revision to Diff 101916.Jun 8 2017, 7:22 AM

All notes from Simon were resolved. In addition I fixed numbers for some XMM versions of VMOVxxxx instructions.

avt77 updated this revision to Diff 105387.Jul 6 2017, 2:59 AM

We have now only 256-bit ops: it makes the patch smaller.

RKSimon added inline comments.Jul 6 2017, 6:01 AM
lib/Target/X86/X86ScheduleBtVer2.td
389

VRVPPSYr -> VRCPPSYr ?

395

VRVPPSYm -> VRCPPSYm ?

397

WriteVDPPSY

403

WriteVDPPSYLd

419

VROUNDYP(S|D)rm ?

avt77 retitled this revision from AMD Jaguar scheduler doesn't correctly model 256-bit AVX instructions to [X86] Model 256-bit AVX instructions in the AMD Jaguar scheduler (PR28573).Jul 6 2017, 7:24 AM
avt77 updated this revision to Diff 105616.Jul 7 2017, 4:00 AM

Simon, thank you for all these catches: I fixed them.

avt77 updated this revision to Diff 106583.Jul 13 2017, 11:57 PM

I merged this patch with trunk. Now it's a part 2 othe initial patch.

RKSimon added inline comments.Aug 4 2017, 8:27 AM
lib/Target/X86/X86ScheduleBtVer2.td
374

JLAGU, JSTC, JFPU01 ?

avt77 added a comment.Sep 25 2017, 8:51 AM

Simon, finaly I'm able to create ClothAvx test executable with clang. And I created it with this patch and without it. And I got the following results on AMD laptop (CPU AMD A10-8700P Radeon R6, 10 Compute Cores 4C+6G 1.80 GHz):

C:\Users\andre\Downloads\working\ClothExe>type avxcloth-patch.log
0.00 57.67 60.91 60.28 27.26 62.62 62.56 67.97
SIMD Width = 1
28.43 23.37 22.71 22.93 22.65 23.00 23.07 23.14 22.71 22.89 22.94 22.81 22.79 23.32 23.14
SIMD Width = 4
36.39 57.45 57.61 56.97 57.05 57.82 57.12 57.38 57.08 57.57 57.28 57.88 57.43 56.96 57.07 57.33
SIMD Width = 8
68.71 71.78 71.31 71.78 71.79 71.67 71.97 71.79 71.25 72.55 71.96 71.52 72.04 70.67 71.78 70.39 70.94
C:\Users\andre\Downloads\working\ClothExe>type avxcloth-trunk.log
0.00 55.19 59.88 58.43 19.96 60.22 58.58 57.37 59.34 60.11
SIMD Width = 1
24.51 21.64 21.29 21.42 21.37 21.38 21.43 21.19 22.37 23.09 24.14 23.77 23.23 23.12 22.43 22.30
SIMD Width = 4
35.74 58.77 56.04 55.87 56.56 55.44 55.24 55.26 55.44 54.91 56.47 57.75 56.36 56.72 55.56 56.11 56.59
SIMD Width = 8
65.76 70.74 70.04 70.26 70.95 72.26 73.50 70.77 69.86 69.93 70.76 70.25 70.41 71.99

As you see the patched version is slightly faster than the trunk one. Are you sure you saw any degradation with this patch?
BTW, the number are Flps per second (they are being calculated every one second).

avt77 updated this revision to Diff 116656.Sep 26 2017, 3:47 AM

I fixed an issue raised by Simon.

avt77 updated this revision to Diff 117336.Oct 2 2017, 5:56 AM

I re-based avx-schedule.ll test.

Thanks, please can you add the f16c-schedule.ll costs as well?

Also, please add DPPS/DPPD (xmm) costs as well.

avt77 updated this revision to Diff 117803.Oct 5 2017, 6:15 AM

All updates required by Simon were done.

avt77 updated this revision to Diff 118550.Oct 11 2017, 12:39 AM

I made changes related to SSE4.1 and F16C instructions in Jaguar.

RKSimon added inline comments.Oct 14 2017, 7:07 AM
lib/Target/X86/X86ScheduleBtVer2.td
346

let NumMicroOps = 10;

352

let NumMicroOps = 11;

357

This is a load so the AGU should be the first pipe

def WriteDPPSLd: SchedWriteRes<[JLAGU, JFPU0, JFPU1]> {
360

Give the MOVNT and ROUND instructions their own entries

368

def WriteDPPDLd: SchedWriteRes<[JLAGU, JFPU0, JFPU1]> {

374

Missing VTEST instructions

383

Latency is 3 according to AMD64_16h_InstrLatency_1.1.xlsx

386

You should probably just use a latency 3 here as its a convert+store.

389

There's no such instruction as VCVTPH2PSmr

398

WriteCVTPS2PHYSt

486

VPTESTD?

avt77 abandoned this revision.Nov 1 2017, 9:13 AM

This patch was splitted on 4 related patches which will be committed instead of this one.