Page MenuHomePhabricator

[X86] Improved sched models for X86 BT*rr instructions
ClosedPublic

Authored by avt77 on Jul 12 2018, 8:30 AM.

Details

Summary

This patch improves sched models for BT*rr X86 instrs removing unnecessary redefinitions of instr infos. This patch is based on results of D48222.
The current patch supports reg-reg instr versions but we're going to implement other versions as well.

Diff Detail

Repository
rL LLVM

Event Timeline

avt77 created this revision.Jul 12 2018, 8:30 AM
lebedev.ri accepted this revision.Jul 12 2018, 9:01 AM

In general, looks good to me, but probably wait for someone else to review, too.

I'm assuming that you have run ninja check-llvm-tools-llvm-mca-x86 and it was ok. (the test coverage seems to be ok as-is.)

lib/Target/X86/X86Schedule.td
122 ↗(On Diff #155190)

Hmm. Nits:

  1. (not a nit) The suffix r notes that only the non-mem versions are covered. I wonder if we can convey that somehow better.
  2. These cover 4 different bit-test instructions - bt,bt[rcs] Naming this WriteBTr may be confizing - is this only about bt instruction? How about calling it WriteBitTest?
This revision is now accepted and ready to land.Jul 12 2018, 9:01 AM

Why does this need the 'r' at the end of WriteBTr? By existing convention the memory form would be called WriteBTLd. So WriteBT should be interpreted as non-memory anyway.

RKSimon requested changes to this revision.Jul 12 2018, 9:41 AM
RKSimon added inline comments.
lib/Target/X86/X86InstrInfo.td
1799 ↗(On Diff #155190)

Why isn't this WriteBTLd ?

lib/Target/X86/X86Schedule.td
122 ↗(On Diff #155190)

I'm confused - this should be probably be called WriteBT. But then you've declared this as a X86SchedWritePair but you're not using the folded half of the pair?

This revision now requires changes to proceed.Jul 12 2018, 9:41 AM
avt77 added a comment.Jul 13 2018, 2:05 AM

I'm assuming that you have run ninja check-llvm-tools-llvm-mca-x86 and it was ok. (the test coverage seems to be ok as-is.)

Usually I'm using "check-all" but I'll try "check-llvm-tools-llvm-mca-x86" as well. Yes, it works.

lib/Target/X86/X86InstrInfo.td
1799 ↗(On Diff #155190)

This patch does not deal with mem at all but I'm going to do it asap.

avt77 updated this revision to Diff 155380.Jul 13 2018, 7:45 AM

I renamed WriteBTr with WriteBitTest.
I could not add memory version of the instructions because there were some issues. For example, if we have

BTR BTS BTC.
m,r.
def HWWriteBTRSCmr : SchedWriteRes<[]> {
let NumMicroOps = 11;
}

what about Latency and SchedWriteRes? There is no any relation to the default values (in our case it's WriteBitTest) that's why I don't understand how it works. Or there is such a relation thru instr info? Please, explain me.

And I got a lot of errors in the current sched tests. For example,

.../llvm/test/CodeGen/X86/schedule-x86_64.ll:2765:17: error: HASWELL-NEXT: expected string not found in input
; HASWELL-NEXT: btcw %si, (%rdx) # sched: [1:2.75]

^

<stdin>:391:2: note: scanning from here
btcw %si, (%rdx) # sched: [6:1.00]
^

agner says that the new lattency (6) is right but agner does not have any info about throughput. What should I do here? Should I publish the new version of tests (there could be a lot of changes) or what?

lebedev.ri added inline comments.Jul 13 2018, 8:03 AM
lib/Target/X86/X86Schedule.td
122 ↗(On Diff #155190)

Note that it *only* covers rr versions, and does not include mr versions.
So yeah, maybe it shouldn't be X86SchedWritePair, but X86WriteRes?

avt77 added inline comments.Jul 16 2018, 12:41 AM
lib/Target/X86/X86Schedule.td
122 ↗(On Diff #155190)

I'm going to implement mr version asap that's why I use Pair here.

avt77 added a comment.Jul 16 2018, 7:40 AM

Hi All,
I need your help!

After my changes in sched models I got the following:

..../llvm/test/CodeGen/X86/schedule-x86_64.ll:2765:17: error: HASWELL-NEXT: expected string not found in input
; HASWELL-NEXT: btcw %si, (%rdx) # sched: [1:2.75]

^

<stdin>:391:2: note: scanning from here
btcw %si, (%rdx) # sched: [6:1.00]
^

The new latency (6) looks more realistic than the current version (1) (rr version has 1 and mr version has 6) but throughput is unclear in both versions. What should I do? If I keep my version there could be to many changes in the current tests. Is it OK?

Hi All,
I need your help!

After my changes in sched models I got the following:

..../llvm/test/CodeGen/X86/schedule-x86_64.ll:2765:17: error: HASWELL-NEXT: expected string not found in input
; HASWELL-NEXT: btcw %si, (%rdx) # sched: [1:2.75]

^

<stdin>:391:2: note: scanning from here
btcw %si, (%rdx) # sched: [6:1.00]
^

The new latency (6) looks more realistic than the current version (1) (rr version has 1 and mr version has 6) but throughput is unclear in both versions. What should I do? If I keep my version there could be to many changes in the current tests. Is it OK?

http://www.agner.org/optimize/instruction_tables.pdf, page 202, "Intel Haswell", "List of instruction timings and μop breakdown" appears to list all the BT* as having latency of 1.

RKSimon added inline comments.
lib/Target/X86/X86SchedHaswell.td
632 ↗(On Diff #155380)

@craig.topper @courbet @gchatelet These look completely wrong (and BTmr above) - and Broadwell appears to be missing them as well - any suggestions for the bit tests memory cases?

lib/Target/X86/X86Schedule.td
122 ↗(On Diff #155190)

If the memory cases are causing a problem it'd be acceptable to just do a reg-reg version for now:

def WriteBitTest : SchedWrite // Bit Test - TODO add memory folding support

And you can come back to the memory cases once we understand whats to be done. I just don't want a X86SchedWritePair def when you're not using the folded case.

craig.topper added inline comments.Jul 16 2018, 9:56 AM
lib/Target/X86/X86SchedHaswell.td
632 ↗(On Diff #155380)

Skylake doesn't even have an InstRW for them.

They're also missing from the copy of the database used by IACA that I have. I believe that's where Gadi got most of the info from. I wonder what IACA does if you feed it those instructions.

avt77 added a comment.Jul 17 2018, 2:46 AM

http://www.agner.org/optimize/instruction_tables.pdf, page 202, "Intel Haswell", "List of instruction timings and μop breakdown" appears to list all the BT* as having latency of 1.

I mixed columns for latency and throughput: latency is missed.

avt77 updated this revision to Diff 155868.Jul 17 2018, 6:21 AM

Now we use WriteRes instead of *WriteResPair.
Folded versions of the intrs will be implemented in the next patches.

courbet added inline comments.Jul 18 2018, 8:22 AM
lib/Target/X86/X86SchedHaswell.td
632 ↗(On Diff #155380)

I can't tell for latencies because we do not support memory operands yet.

For uops, I have working support in this patch:
https://reviews.llvm.org/D48935

On haswell, this gives:

---
mode:            uops
key:             
  instructions:    
    - 'BTC64mr RDI i_0x1x  i_0x0x  R9'
    - 'BTC64mr RDI i_0x1x  i_0x64x  RBX'
    - 'BTC64mr RDI i_0x1x  i_0x128x  RSI'
    - 'BTC64mr RDI i_0x1x  i_0x192x  RCX'
    - 'BTC64mr RDI i_0x1x  i_0x256x  R8'
    - 'BTC64mr RDI i_0x1x  i_0x320x  RDX'
  config:          ''
cpu_name:        haswell
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:    
  - { key: '3', value: 1.3771, debug_string: HWPort0 }
  - { key: '4', value: 1.8848, debug_string: HWPort1 }
  - { key: '5', value: 1.3687, debug_string: HWPort2 }
  - { key: '6', value: 0.728, debug_string: HWPort3 }
  - { key: '7', value: 1.0025, debug_string: HWPort4 }
  - { key: '8', value: 1.6272, debug_string: HWPort5 }
  - { key: '9', value: 2.1307, debug_string: HWPort6 }
  - { key: '10', value: 0.0002, debug_string: HWPort7 }
error:           ''
info:            instruction is parallel, repeating a random one.
assembled_snippet: 5349C7C10100000048C7C30100000048C7C60100000048C7C10100000049C7C00100000048C7C2010000004C0FBB0F480FBB5F40480FBBB780000000480FBB8FC00000004C0FBB8700010000480FBB97400100004C0FBB0F480FBB5F40480FBBB780000000480FBB8FC00000004C0FBB8700010000480FBB97400100004C0FBB0F480FBB5F40480FBBB780000000480FBB8FC00000004C0FBB8700010000480FBB97400100004C0FBB0F480FBB5F40480FBBB780000000480FBB8FC00000005BC3
...

Other instructions are similar.

This is a bit noisy unfortunately. This looks like 2*P23 (or maybe P23 + P237, P7 being unused for some reason ?) + 7*P0156 + P4.

There's a shift uop in the BTR/BTS/BTC memory flow so thats port06 restricted. And the bit test uop should also be port 06 restricted just like the register form.

RKSimon added inline comments.Jul 20 2018, 2:59 AM
lib/Target/X86/X86Schedule.td
122 ↗(On Diff #155190)

Very minor, but please can you put the WriteBT defs next to the CMOV/SETCC defs - they are closer in behaviour.

Please can you remove the TableGen CodeGenSchedule diffs

avt77 updated this revision to Diff 156741.Jul 23 2018, 4:07 AM

We decided that this patch won't include memory versions of instrs that's why I simply fixed tiny requirements like "place of WriteBitTest" and "removing of the TableGen CodeGenSchedule diffs".

RKSimon accepted this revision.Jul 31 2018, 1:54 AM

LGTM as a reg-reg only patch - reg-mem to be handled once we have better numbers from intel/llvm-exegesis

This revision is now accepted and ready to land.Jul 31 2018, 1:54 AM

@avt77 This was committed at rL338365 but reverted at rL338369 due to some extra files included in the patch - please can you recommit a fixed version?

This revision was automatically updated to reflect the committed changes.