I can combine them to a single file.
The reason for the separation is to improve the ability to track the coverage.
Note that there is also the PPRO ISA set.
Dec 6 2017
I can combine them to a single file.
Dec 5 2017
Dec 3 2017
Dec 1 2017
- Added AVX tests for 32 and 64 bits.
- Updated AVX2 64 bit tests to include xmm15 test in addition to xmm6 tests
Nov 30 2017
Nov 29 2017
Removed duplicates per Simon's comment
Nov 27 2017
following Simon's comment to add the retl scheduling information.
Nov 26 2017
From the tables I have, I could not find any scheduling difference between retl and retq they are both mapped to:
latency = 2 cycles + 5 cycles load latency.
ports: 23, 0156, 6
AVX2-32.s : Removed redundant # signs from AVX2-32.s
AVX2-64.s: Duplicated the tests to use XMM8 and YMM9
Nov 25 2017
Ah, You mean the XMM registers.
Nov 23 2017
You mean that each instruction should cover all registers?
I did a quick check and it produces a huge number of tests that exceed 200k including AVX512.
I, therefore, chose to use only representatives.
Nov 22 2017
Nov 21 2017
Nov 20 2017
sorted the instructions per Zvi's comment
Nov 16 2017
Removed old scheduling for the GATHER instructions which has overridden the new ones.
Fixed the overall load latency attribute for HSW from 4 cycles to 5.
good catch, The old scheduling has overridden the new one in the td file. I will update the diff file
Nov 15 2017
Ah you're right. For SNB this is fine.
My comment was for HSW. Sorry.
Unfortunately, I cannot give you the exact numbers.
Overall, on ~900 benchmarks the performance speedup gain of new scheduling is ~6%.
Simon, splitting the instructions based on the CPUID bit is referred actually according to "CPU extension".
There are a total of 62 extensions.
I can do it according to extensions. The only problem with that is that it will complicate the generation of the tests.
I updated the list of ISA Sets.
I think we can consdier removing the REAL and the Protected ISA right?
Performance runs are done on 3 main benchmarks: SPEC CPU 2017, Geekbench4, EEMBC suite of automotive, denbench, coremark-pro, networking, telecom.
Please ignore my previous comment. These instructions are not used by the compiler Ring 1.
I will remove them from the ISA Set.
The BBX* ISA Sets are new AVX512 instrs. Here are some examples below (<format: #iclass, extension, category, iform, isa_set>:
VPSLCTLASTD, AVX512EVEX, BBX2, VPSLCTLASTD_XMMu32_MASKmskw_MEMu32_BBX, BBX2_128
VPSLCTLASTD, AVX512EVEX, BBX2, VPSLCTLASTD_XMMu32_MASKmskw_XMMu32_BBX, BBX2_128
VPSLCTLASTD, AVX512EVEX, BBX2, VPSLCTLASTD_YMMu32_MASKmskw_MEMu32_BBX, BBX2_256
VPSLCTLASTD, AVX512EVEX, BBX2, VPSLCTLASTD_YMMu32_MASKmskw_YMMu32_BBX, BBX2_256
VPSLCTLASTD, AVX512EVEX, BBX2, VPSLCTLASTD_ZMMu32_MASKmskw_MEMu32_BBX, BBX2_512
VPSLCTLASTD, AVX512EVEX, BBX2, VPSLCTLASTD_ZMMu32_MASKmskw_ZMMu32_BBX, BBX2_512
VPSLCTLASTQ, AVX512EVEX, BBX2, VPSLCTLASTQ_XMMu64_MASKmskw_MEMu64_BBX, BBX2_128
VPSLCTLASTQ, AVX512EVEX, BBX2, VPSLCTLASTQ_XMMu64_MASKmskw_XMMu64_BBX, BBX2_128
VPSLCTLASTQ, AVX512EVEX, BBX2, VPSLCTLASTQ_YMMu64_MASKmskw_MEMu64_BBX, BBX2_256
VPSLCTLASTQ, AVX512EVEX, BBX2, VPSLCTLASTQ_YMMu64_MASKmskw_YMMu64_BBX, BBX2_256
VPSLCTLASTQ, AVX512EVEX, BBX2, VPSLCTLASTQ_ZMMu64_MASKmskw_MEMu64_BBX, BBX2_512
VPSLCTLASTQ, AVX512EVEX, BBX2, VPSLCTLASTQ_ZMMu64_MASKmskw_ZMMu64_BBX, BBX2_512
Nov 14 2017
Ok. The CLZERO instruction is going to be part of AMD ISA Set. It is very recent and part of of the Rizen CPU.
Do you mean the following LWP instructions that belong to the XOP ISA set?:
Nov 13 2017
yes. SLM is good enough
Sorry. Here is a more readbale table of I486 + I486REAL:
I486 will include encoding + asm of the following instrs:
Unfortunately, I am using an internal DB with python scripts.
Updated diff after rebase
Note that WriteALU means 1 cycle latency using Port 0156
Nov 12 2017
Updated diff file after adding the following instructions to the SKX scheduling file:
Updated diff following Craig's comment.
Nov 9 2017
Oct 31 2017
Oct 30 2017
Oct 24 2017
Oct 22 2017
Oct 21 2017
Oct 18 2017
Oct 17 2017
Oct 16 2017
This modified version of the SkylakeClient schedulings does indeed show gains on SKL without apparent regressions when compared to the existing SKL schedulings.
Thus, we are motivated to continue with the process of updating all X86 target schedulings.
Next step, once this patch is committed, is to submit the Broadwell schedulings for review.
Oct 15 2017
Yes. The X86 Bit Test is known to run slowly on all X86 CPUs.
Oct 11 2017
The .td file is actually generated by a script and if you notice it already contains some (not many) regular expressions when possible,
For the SKX scheduling, for example, there are many regular expressions that include the broadcast, mask and zeroing bits for all relevant AVX512 instructions.
The problem is that there are not many opportunities to group instructions into regular expressions.
For example, the MMX_* is spread between groups 1, 2,3,8,9,12, etc.
The differences between the groups could be in any of the latency, number of uOps or the ports used by the uOPs.
This makes it hard to use regular expressions.
Updated diff following Simon's comment to remove the COMMON check prefix from the bmi2 scheduling test.
Oct 10 2017
Oct 9 2017
Oct 8 2017
Updated diff file after rebase,
Oct 4 2017
Oct 1 2017
Good point. We had multiple rounds of checks and it seems that indeed some (not all) of the memory instructions do need to include additional latency based on whether they are load of an address or data, load of 128 or 256 or 512 bits vector, load and store and store alone.
I intend to go back and fix it for the Skylake Client and for Haswell.
Sep 30 2017
Sep 28 2017
Had to replace all test files with new ones as they were using the avx512 intrinsics which are now in the process of being replaced by IR.
As a result I created 2 new tests which were compiled from the following regression tests:
Sep 27 2017
Sep 25 2017
Updated diff after running update_llc_test_checks on the 8 tests and re-applying the changes to get rid of the "End of function" comments.
Updated diff after a rebase.