This patch closes AVX part of Bug https://bugs.llvm.org/show_bug.cgi?id=28573. It seems we have some problem here: the throughput of some instructions could have non integer (but float) value.
Details
Diff Detail
Event Timeline
lib/Target/X86/X86Schedule.td | ||
---|---|---|
48 ↗ | (On Diff #98624) | The LEA3 changes should be in their own patch. |
lib/Target/X86/X86ScheduleBtVer2.td | ||
132 | Isn't this handled by the use of JALU01 grouping JALU0 + JALU1 together? So it has a choice of 2 pipes and it will have a tp of 1cy whichever it goes down. | |
test/CodeGen/X86/sse2-schedule.ll | ||
6022 ↗ | (On Diff #98624) | Jaguar has a max of 1 load/cycle - so the tp should still be 1.00 |
It seems I fixed all known issues except proper support of vzeroupper and vzeroall: will try to do it in the next patch.
I slightly changed the algorithm of throughput calculation: if the instr sched model does not have cycles for the given instruction but it's valid then throughput is equal to lattency.
lib/Target/X86/X86ScheduleBtVer2.td | ||
---|---|---|
70 | I don't think adding these Cluster groups is necessary. TBH most of the ProcResource defs appear to be superfluous - most aren't used at all - we're just using the JFPU0/JFPU0/JFPU01 defs, with a few others for the longer op chain instructions. | |
377 | Better off using JFPU0 as that's what is actually bound to the buffer. Same for the others below. | |
385 | Shouldn't this def be something like the below, to show it will consume the AGU for a cycle? Same for the other loads. def WriteFAddYMLd: SchedWriteRes<[JLAGU,JFPU0]> { let Latency = 8; let ResourceCycles = [1,2]; } | |
test/CodeGen/X86/slow-unaligned-mem.ll | ||
89 ↗ | (On Diff #99105) | ???? |
lib/Target/X86/X86ScheduleBtVer2.td | ||
---|---|---|
70 | Several nstructions below could be executed on FPA or on FPM that's why we need a possibility to say it and that's why I created JFPFltCluster. Is it OK? | |
377 | Are you sure it's a good idea? FP0 includes VALU0, VIMUL and FPA. I'm using FPA because this instruction uses exactly FPA. | |
385 | I thought about but Software Optimization Guide does not show it (I mean it says about AGU but it does not include the additional cycle in its tables). Should I update the model? | |
test/CodeGen/X86/slow-unaligned-mem.ll | ||
89 ↗ | (On Diff #99105) | This test was written by hand that's why it's difficult to compare the results but the new version generates: BB#0:vxorps %ymm0, %ymm0, %ymm0 movl 4(%esp), %eax vmovups %ymm0, 32(%eax) vmovups %ymm0, (%eax) retl As you see we have vxorps between # BB#0: and 'movl'. I decided it's acceptable. Am I wrong? |
I've fixed all issues raised by Simon. In addition I re-checked all numbers: it seems they are correct now.
In this case I should move the test changes for zerroall and zeroupper in a separate patch as well, right?
I removed all changes related to throughput calculations. And I made all updates suggested by Simon.
lib/Target/X86/X86ScheduleBtVer2.td | ||
---|---|---|
22 | It is still the Retire Control Unit, its just that the FPU can only touch 44 of the entries. let MicroOpBufferSize = 64; // Retire Control Unit | |
25 | Don't remove whitespace. | |
93 | Undo this whitespace | |
176 | Don't remove whitespace. | |
401 | WriteVMULYPD For all these defs, please can you include the 'Y' to make it clear that its just the 256-bit case | |
473 | What is AVX11? Spelling: deafault -> default | |
481 | "VMOVAP(D|S)rm" etc. are memory loads - they should be in the Ld version | |
test/CodeGen/X86/avx-vzeroupper.ll | ||
163 ↗ | (On Diff #101712) | What is causing this? |
test/CodeGen/X86/recip-fastmath.ll | ||
344 ↗ | (On Diff #101712) | Latency should be 5cy |
lib/Target/X86/X86ScheduleBtVer2.td | ||
---|---|---|
481 | From my point of view rm-version store some register value into memory while mr-version loads the value from memory into the register. Am I right? | |
test/CodeGen/X86/recip-fastmath.ll | ||
344 ↗ | (On Diff #101712) | Why? In fact we should have tp 0.5 for XMM (see below). I'll fix it. VMOVAPD xmm1 xmm2 AVX 1 FPA|FPM 1 0,5 |
All notes from Simon were resolved. In addition I fixed numbers for some XMM versions of VMOVxxxx instructions.
lib/Target/X86/X86ScheduleBtVer2.td | ||
---|---|---|
471 | JLAGU, JSTC, JFPU01 ? |
Simon, finaly I'm able to create ClothAvx test executable with clang. And I created it with this patch and without it. And I got the following results on AMD laptop (CPU AMD A10-8700P Radeon R6, 10 Compute Cores 4C+6G 1.80 GHz):
C:\Users\andre\Downloads\working\ClothExe>type avxcloth-patch.log
0.00 57.67 60.91 60.28 27.26 62.62 62.56 67.97
SIMD Width = 1
28.43 23.37 22.71 22.93 22.65 23.00 23.07 23.14 22.71 22.89 22.94 22.81 22.79 23.32 23.14
SIMD Width = 4
36.39 57.45 57.61 56.97 57.05 57.82 57.12 57.38 57.08 57.57 57.28 57.88 57.43 56.96 57.07 57.33
SIMD Width = 8
68.71 71.78 71.31 71.78 71.79 71.67 71.97 71.79 71.25 72.55 71.96 71.52 72.04 70.67 71.78 70.39 70.94
C:\Users\andre\Downloads\working\ClothExe>type avxcloth-trunk.log
0.00 55.19 59.88 58.43 19.96 60.22 58.58 57.37 59.34 60.11
SIMD Width = 1
24.51 21.64 21.29 21.42 21.37 21.38 21.43 21.19 22.37 23.09 24.14 23.77 23.23 23.12 22.43 22.30
SIMD Width = 4
35.74 58.77 56.04 55.87 56.56 55.44 55.24 55.26 55.44 54.91 56.47 57.75 56.36 56.72 55.56 56.11 56.59
SIMD Width = 8
65.76 70.74 70.04 70.26 70.95 72.26 73.50 70.77 69.86 69.93 70.76 70.25 70.41 71.99
As you see the patched version is slightly faster than the trunk one. Are you sure you saw any degradation with this patch?
BTW, the number are Flps per second (they are being calculated every one second).
lib/Target/X86/X86ScheduleBtVer2.td | ||
---|---|---|
385 | This is a load so the AGU should be the first pipe def WriteDPPSLd: SchedWriteRes<[JLAGU, JFPU0, JFPU1]> { | |
398 | def WriteDPPDLd: SchedWriteRes<[JLAGU, JFPU0, JFPU1]> { | |
404 | Missing VTEST instructions | |
413 | Latency is 3 according to AMD64_16h_InstrLatency_1.1.xlsx | |
416 | You should probably just use a latency 3 here as its a convert+store. | |
419 | There's no such instruction as VCVTPH2PSmr | |
428 | WriteCVTPS2PHYSt | |
443 | let NumMicroOps = 10; | |
449 | let NumMicroOps = 11; | |
457 | Give the MOVNT and ROUND instructions their own entries | |
583 | VPTESTD? |
This patch was splitted on 4 related patches which will be committed instead of this one.
It is still the Retire Control Unit, its just that the FPU can only touch 44 of the entries.