This patch improves sched models for BT*rr X86 instrs removing unnecessary redefinitions of instr infos. This patch is based on results of D48222.
The current patch supports reg-reg instr versions but we're going to implement other versions as well.
Details
Diff Detail
Event Timeline
In general, looks good to me, but probably wait for someone else to review, too.
I'm assuming that you have run ninja check-llvm-tools-llvm-mca-x86 and it was ok. (the test coverage seems to be ok as-is.)
lib/Target/X86/X86Schedule.td | ||
---|---|---|
122 | Hmm. Nits:
|
Why does this need the 'r' at the end of WriteBTr? By existing convention the memory form would be called WriteBTLd. So WriteBT should be interpreted as non-memory anyway.
Usually I'm using "check-all" but I'll try "check-llvm-tools-llvm-mca-x86" as well. Yes, it works.
lib/Target/X86/X86InstrInfo.td | ||
---|---|---|
1802 | This patch does not deal with mem at all but I'm going to do it asap. |
I renamed WriteBTr with WriteBitTest.
I could not add memory version of the instructions because there were some issues. For example, if we have
BTR BTS BTC.
m,r.
def HWWriteBTRSCmr : SchedWriteRes<[]> {
let NumMicroOps = 11;
}
what about Latency and SchedWriteRes? There is no any relation to the default values (in our case it's WriteBitTest) that's why I don't understand how it works. Or there is such a relation thru instr info? Please, explain me.
And I got a lot of errors in the current sched tests. For example,
.../llvm/test/CodeGen/X86/schedule-x86_64.ll:2765:17: error: HASWELL-NEXT: expected string not found in input
; HASWELL-NEXT: btcw %si, (%rdx) # sched: [1:2.75]
^
<stdin>:391:2: note: scanning from here
btcw %si, (%rdx) # sched: [6:1.00]
^
agner says that the new lattency (6) is right but agner does not have any info about throughput. What should I do here? Should I publish the new version of tests (there could be a lot of changes) or what?
lib/Target/X86/X86Schedule.td | ||
---|---|---|
122 | Note that it *only* covers rr versions, and does not include mr versions. |
lib/Target/X86/X86Schedule.td | ||
---|---|---|
122 | I'm going to implement mr version asap that's why I use Pair here. |
Hi All,
I need your help!
After my changes in sched models I got the following:
..../llvm/test/CodeGen/X86/schedule-x86_64.ll:2765:17: error: HASWELL-NEXT: expected string not found in input
; HASWELL-NEXT: btcw %si, (%rdx) # sched: [1:2.75]
^
<stdin>:391:2: note: scanning from here
btcw %si, (%rdx) # sched: [6:1.00]
^
The new latency (6) looks more realistic than the current version (1) (rr version has 1 and mr version has 6) but throughput is unclear in both versions. What should I do? If I keep my version there could be to many changes in the current tests. Is it OK?
http://www.agner.org/optimize/instruction_tables.pdf, page 202, "Intel Haswell", "List of instruction timings and μop breakdown" appears to list all the BT* as having latency of 1.
lib/Target/X86/X86SchedHaswell.td | ||
---|---|---|
636 | @craig.topper @courbet @gchatelet These look completely wrong (and BTmr above) - and Broadwell appears to be missing them as well - any suggestions for the bit tests memory cases? | |
lib/Target/X86/X86Schedule.td | ||
122 | If the memory cases are causing a problem it'd be acceptable to just do a reg-reg version for now: def WriteBitTest : SchedWrite // Bit Test - TODO add memory folding support And you can come back to the memory cases once we understand whats to be done. I just don't want a X86SchedWritePair def when you're not using the folded case. |
lib/Target/X86/X86SchedHaswell.td | ||
---|---|---|
636 | Skylake doesn't even have an InstRW for them. They're also missing from the copy of the database used by IACA that I have. I believe that's where Gadi got most of the info from. I wonder what IACA does if you feed it those instructions. |
http://www.agner.org/optimize/instruction_tables.pdf, page 202, "Intel Haswell", "List of instruction timings and μop breakdown" appears to list all the BT* as having latency of 1.
I mixed columns for latency and throughput: latency is missed.
Now we use WriteRes instead of *WriteResPair.
Folded versions of the intrs will be implemented in the next patches.
lib/Target/X86/X86SchedHaswell.td | ||
---|---|---|
636 | I can't tell for latencies because we do not support memory operands yet. For uops, I have working support in this patch: On haswell, this gives: --- mode: uops key: instructions: - 'BTC64mr RDI i_0x1x i_0x0x R9' - 'BTC64mr RDI i_0x1x i_0x64x RBX' - 'BTC64mr RDI i_0x1x i_0x128x RSI' - 'BTC64mr RDI i_0x1x i_0x192x RCX' - 'BTC64mr RDI i_0x1x i_0x256x R8' - 'BTC64mr RDI i_0x1x i_0x320x RDX' config: '' cpu_name: haswell llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 10000 measurements: - { key: '3', value: 1.3771, debug_string: HWPort0 } - { key: '4', value: 1.8848, debug_string: HWPort1 } - { key: '5', value: 1.3687, debug_string: HWPort2 } - { key: '6', value: 0.728, debug_string: HWPort3 } - { key: '7', value: 1.0025, debug_string: HWPort4 } - { key: '8', value: 1.6272, debug_string: HWPort5 } - { key: '9', value: 2.1307, debug_string: HWPort6 } - { key: '10', value: 0.0002, debug_string: HWPort7 } error: '' info: instruction is parallel, repeating a random one. assembled_snippet: 5349C7C10100000048C7C30100000048C7C60100000048C7C10100000049C7C00100000048C7C2010000004C0FBB0F480FBB5F40480FBBB780000000480FBB8FC00000004C0FBB8700010000480FBB97400100004C0FBB0F480FBB5F40480FBBB780000000480FBB8FC00000004C0FBB8700010000480FBB97400100004C0FBB0F480FBB5F40480FBBB780000000480FBB8FC00000004C0FBB8700010000480FBB97400100004C0FBB0F480FBB5F40480FBBB780000000480FBB8FC00000005BC3 ... Other instructions are similar. This is a bit noisy unfortunately. This looks like 2*P23 (or maybe P23 + P237, P7 being unused for some reason ?) + 7*P0156 + P4. |
There's a shift uop in the BTR/BTS/BTC memory flow so thats port06 restricted. And the bit test uop should also be port 06 restricted just like the register form.
lib/Target/X86/X86Schedule.td | ||
---|---|---|
122 | Very minor, but please can you put the WriteBT defs next to the CMOV/SETCC defs - they are closer in behaviour. |
We decided that this patch won't include memory versions of instrs that's why I simply fixed tiny requirements like "place of WriteBitTest" and "removing of the TableGen CodeGenSchedule diffs".
LGTM as a reg-reg only patch - reg-mem to be handled once we have better numbers from intel/llvm-exegesis
Why isn't this WriteBTLd ?