This patch improves sched models for BT*rr X86 instrs removing unnecessary redefinitions of instr infos. This patch is based on results of D48222.
The current patch supports reg-reg instr versions but we're going to implement other versions as well.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
In general, looks good to me, but probably wait for someone else to review, too.
I'm assuming that you have run ninja check-llvm-tools-llvm-mca-x86 and it was ok. (the test coverage seems to be ok as-is.)
lib/Target/X86/X86Schedule.td | ||
---|---|---|
122 ↗ | (On Diff #155190) | Hmm. Nits:
|
Why does this need the 'r' at the end of WriteBTr? By existing convention the memory form would be called WriteBTLd. So WriteBT should be interpreted as non-memory anyway.
Usually I'm using "check-all" but I'll try "check-llvm-tools-llvm-mca-x86" as well. Yes, it works.
lib/Target/X86/X86InstrInfo.td | ||
---|---|---|
1799 ↗ | (On Diff #155190) | This patch does not deal with mem at all but I'm going to do it asap. |
I renamed WriteBTr with WriteBitTest.
I could not add memory version of the instructions because there were some issues. For example, if we have
BTR BTS BTC.
m,r.
def HWWriteBTRSCmr : SchedWriteRes<[]> {
let NumMicroOps = 11;
}
what about Latency and SchedWriteRes? There is no any relation to the default values (in our case it's WriteBitTest) that's why I don't understand how it works. Or there is such a relation thru instr info? Please, explain me.
And I got a lot of errors in the current sched tests. For example,
.../llvm/test/CodeGen/X86/schedule-x86_64.ll:2765:17: error: HASWELL-NEXT: expected string not found in input
; HASWELL-NEXT: btcw %si, (%rdx) # sched: [1:2.75]
^
<stdin>:391:2: note: scanning from here
btcw %si, (%rdx) # sched: [6:1.00]
^
agner says that the new lattency (6) is right but agner does not have any info about throughput. What should I do here? Should I publish the new version of tests (there could be a lot of changes) or what?
lib/Target/X86/X86Schedule.td | ||
---|---|---|
122 ↗ | (On Diff #155190) | Note that it *only* covers rr versions, and does not include mr versions. |
lib/Target/X86/X86Schedule.td | ||
---|---|---|
122 ↗ | (On Diff #155190) | I'm going to implement mr version asap that's why I use Pair here. |
Hi All,
I need your help!
After my changes in sched models I got the following:
..../llvm/test/CodeGen/X86/schedule-x86_64.ll:2765:17: error: HASWELL-NEXT: expected string not found in input
; HASWELL-NEXT: btcw %si, (%rdx) # sched: [1:2.75]
^
<stdin>:391:2: note: scanning from here
btcw %si, (%rdx) # sched: [6:1.00]
^
The new latency (6) looks more realistic than the current version (1) (rr version has 1 and mr version has 6) but throughput is unclear in both versions. What should I do? If I keep my version there could be to many changes in the current tests. Is it OK?
http://www.agner.org/optimize/instruction_tables.pdf, page 202, "Intel Haswell", "List of instruction timings and μop breakdown" appears to list all the BT* as having latency of 1.
lib/Target/X86/X86SchedHaswell.td | ||
---|---|---|
632 ↗ | (On Diff #155380) | @craig.topper @courbet @gchatelet These look completely wrong (and BTmr above) - and Broadwell appears to be missing them as well - any suggestions for the bit tests memory cases? |
lib/Target/X86/X86Schedule.td | ||
122 ↗ | (On Diff #155190) | If the memory cases are causing a problem it'd be acceptable to just do a reg-reg version for now: def WriteBitTest : SchedWrite // Bit Test - TODO add memory folding support And you can come back to the memory cases once we understand whats to be done. I just don't want a X86SchedWritePair def when you're not using the folded case. |
lib/Target/X86/X86SchedHaswell.td | ||
---|---|---|
632 ↗ | (On Diff #155380) | Skylake doesn't even have an InstRW for them. They're also missing from the copy of the database used by IACA that I have. I believe that's where Gadi got most of the info from. I wonder what IACA does if you feed it those instructions. |
http://www.agner.org/optimize/instruction_tables.pdf, page 202, "Intel Haswell", "List of instruction timings and μop breakdown" appears to list all the BT* as having latency of 1.
I mixed columns for latency and throughput: latency is missed.
Now we use WriteRes instead of *WriteResPair.
Folded versions of the intrs will be implemented in the next patches.
lib/Target/X86/X86SchedHaswell.td | ||
---|---|---|
632 ↗ | (On Diff #155380) | I can't tell for latencies because we do not support memory operands yet. For uops, I have working support in this patch: On haswell, this gives: --- mode: uops key: instructions: - 'BTC64mr RDI i_0x1x i_0x0x R9' - 'BTC64mr RDI i_0x1x i_0x64x RBX' - 'BTC64mr RDI i_0x1x i_0x128x RSI' - 'BTC64mr RDI i_0x1x i_0x192x RCX' - 'BTC64mr RDI i_0x1x i_0x256x R8' - 'BTC64mr RDI i_0x1x i_0x320x RDX' config: '' cpu_name: haswell llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 10000 measurements: - { key: '3', value: 1.3771, debug_string: HWPort0 } - { key: '4', value: 1.8848, debug_string: HWPort1 } - { key: '5', value: 1.3687, debug_string: HWPort2 } - { key: '6', value: 0.728, debug_string: HWPort3 } - { key: '7', value: 1.0025, debug_string: HWPort4 } - { key: '8', value: 1.6272, debug_string: HWPort5 } - { key: '9', value: 2.1307, debug_string: HWPort6 } - { key: '10', value: 0.0002, debug_string: HWPort7 } error: '' info: instruction is parallel, repeating a random one. assembled_snippet: 5349C7C10100000048C7C30100000048C7C60100000048C7C10100000049C7C00100000048C7C2010000004C0FBB0F480FBB5F40480FBBB780000000480FBB8FC00000004C0FBB8700010000480FBB97400100004C0FBB0F480FBB5F40480FBBB780000000480FBB8FC00000004C0FBB8700010000480FBB97400100004C0FBB0F480FBB5F40480FBBB780000000480FBB8FC00000004C0FBB8700010000480FBB97400100004C0FBB0F480FBB5F40480FBBB780000000480FBB8FC00000005BC3 ... Other instructions are similar. This is a bit noisy unfortunately. This looks like 2*P23 (or maybe P23 + P237, P7 being unused for some reason ?) + 7*P0156 + P4. |
There's a shift uop in the BTR/BTS/BTC memory flow so thats port06 restricted. And the bit test uop should also be port 06 restricted just like the register form.
lib/Target/X86/X86Schedule.td | ||
---|---|---|
122 ↗ | (On Diff #155190) | Very minor, but please can you put the WriteBT defs next to the CMOV/SETCC defs - they are closer in behaviour. |
We decided that this patch won't include memory versions of instrs that's why I simply fixed tiny requirements like "place of WriteBitTest" and "removing of the TableGen CodeGenSchedule diffs".
LGTM as a reg-reg only patch - reg-mem to be handled once we have better numbers from intel/llvm-exegesis