This is an archive of the discontinued LLVM Phabricator instance.

[X86][BdVer2] Transfer delays from the integer to the floating point unit.
ClosedPublic

Authored by lebedev.ri on Jan 27 2019, 8:18 AM.

Details

Summary

I'm unable to find this number in the "AMD SOG for family 15h".
llvm-exegesis measures the latencies of these instructions as 2,
which matches the latencies specified in "AMD SOG for family 15h".

However if we look at Agner, Microarchitecture, "AMD Bulldozer, Piledriver,
Steamroller and Excavator pipeline", "Data delay between different execution
domains", the int->ivec transfer is listed as 8..10cy of additional latency.

Also, Agner's "Instruction tables", for Piledriver, lists their latencies as 12,
which is consistent with 2cy from exegesis / AMD SOG + 10cy transfer delay.

Additional data point comes from the fact that Agner's "Instruction tables",
for Jaguar, lists their latencies as 8; and "AMD SOG for family 16h" does
state the +6cy int->ivec delay, which is consistent with instr latency of 1 or 2.

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Jan 27 2019, 8:18 AM
lebedev.ri edited the summary of this revision. (Show Details)Jan 27 2019, 8:19 AM

I'm unable to find this number in the "AMD SOG for family 15h".
llvm-exegesis measures the latencies of these instructions as 2,
which matches the latencies specified in "AMD SOG for family 15h".

Can you print out the code snippet used by llvm-exegesis to measure that latency?

I'm unable to find this number in the "AMD SOG for family 15h".
llvm-exegesis measures the latencies of these instructions as 2,
which matches the latencies specified in "AMD SOG for family 15h".

Can you print out the code snippet used by llvm-exegesis to measure that latency?

Sure, i should have done that.

$ ./bin/llvm-exegesis -mode=latency -opcode-name=VPINSRBrr
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-639c95.o
---
mode:            latency
key:             
  instructions:    
    - 'VPINSRBrr XMM7 XMM7 R15D i_0x1'
  config:          ''
  register_initial_values: 
    - 'XMM7=0x0'
    - 'R15D=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:    
  - { key: latency, value: 2.0296, per_snippet_value: 2.0296 }
error:           ''
info:            Repeating a single explicitly serial instruction
assembled_snippet: 41574883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F3C244883C41041BF00000000C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01C4C34120FF01415FC3

$ /usr/bin/objdump -d /tmp/snippet-639c95.o 

/tmp/snippet-639c95.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <foo>:
       0:       41 57                   push   %r15
       2:       48 83 ec 10             sub    $0x10,%rsp
       6:       c7 04 24 00 00 00 00    movl   $0x0,(%rsp)
       d:       c7 44 24 04 00 00 00    movl   $0x0,0x4(%rsp)
      14:       00 
      15:       c7 44 24 08 00 00 00    movl   $0x0,0x8(%rsp)
      1c:       00 
      1d:       c7 44 24 0c 00 00 00    movl   $0x0,0xc(%rsp)
      24:       00 
      25:       c5 fa 6f 3c 24          vmovdqu (%rsp),%xmm7
      2a:       48 83 c4 10             add    $0x10,%rsp
      2e:       41 bf 00 00 00 00       mov    $0x0,%r15d
      34:       c4 c3 41 20 ff 01       vpinsrb $0x1,%r15d,%xmm7,%xmm7
      3a:       c4 c3 41 20 ff 01       vpinsrb $0x1,%r15d,%xmm7,%xmm7
....
    ea88:       c4 c3 41 20 ff 01       vpinsrb $0x1,%r15d,%xmm7,%xmm7
    ea8e:       c4 c3 41 20 ff 01       vpinsrb $0x1,%r15d,%xmm7,%xmm7
    ea94:       41 5f                   pop    %r15
    ea96:       c3                      retq   

I have additionally tried measuring the actual MCA snippets:

$ cat /tmp/snippet.s
# LLVM-EXEGESIS-LIVEIN EAX
# LLVM-EXEGESIS-LIVEIN XMM0
vpinsrb $0, %eax, %xmm0, %xmm0
vpinsrb $1, %eax, %xmm0, %xmm0
$ ./bin/llvm-exegesis -mode=latency -snippets-file=/tmp/snippet.s
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-67eb60.o
---
mode:            latency
key:             
  instructions:    
    - 'VPINSRBrr XMM0 XMM0 EAX i_0x0'
    - 'VPINSRBrr XMM0 XMM0 EAX i_0x1'
  config:          ''
  register_initial_values: []
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:    
  - { key: latency, value: 2.0317, per_snippet_value: 4.0634 }
error:           ''
info:            ''
assembled_snippet: C4E37920C000C4E37920C001C4E37920C000C4E37920C001C4E37920C000C4E37920C001C4E37920C000C4E37920C001C4E37920C000C4E37920C001C4E37920C000C4E37920C001C4E37920C000C4E37920C001C4E37920C000C4E37920C001C3
...

and

$ cat /tmp/snippet.s
# LLVM-EXEGESIS-LIVEIN EAX
# LLVM-EXEGESIS-LIVEIN XMM0
add %eax, %eax
vpinsrb $0, %eax, %xmm0, %xmm0
vpinsrb $1, %eax, %xmm0, %xmm0
$ ./bin/llvm-exegesis -mode=latency -snippets-file=/tmp/snippet.s
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-562e79.o
---
mode:            latency
key:             
  instructions:    
    - 'ADD32rr EAX EAX EAX'
    - 'VPINSRBrr XMM0 XMM0 EAX i_0x0'
    - 'VPINSRBrr XMM0 XMM0 EAX i_0x1'
  config:          ''
  register_initial_values: []
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:    
  - { key: latency, value: 1.4034, per_snippet_value: 4.2102 }
error:           ''
info:            ''
assembled_snippet: 01C0C4E37920C000C4E37920C00101C0C4E37920C000C4E37920C00101C0C4E37920C000C4E37920C00101C0C4E37920C000C4E37920C00101C0C4E37920C000C4E37920C00101C0C3

Am i holding llvm-exegesis wrong, or does this mean the info in Agner incorrect here?

andreadb added a comment.EditedJan 29 2019, 9:27 AM
      2e:       41 bf 00 00 00 00       mov    $0x0,%r15d
      34:       c4 c3 41 20 ff 01       vpinsrb $0x1,%r15d,%xmm7,%xmm7
      3a:       c4 c3 41 20 ff 01       vpinsrb $0x1,%r15d,%xmm7,%xmm7
....
    ea88:       c4 c3 41 20 ff 01       vpinsrb $0x1,%r15d,%xmm7,%xmm7

If there is really a bypass delay, then that code snippet is not going to expose it.
The real bottleneck in that code snippet is the dependency on %xmm7. R15 is only set once at the beginning by a zero-move, and then never updated again.

In this case, we have that each cycle the scheduler issues a uOp to moves R15 to the FPU. However, the vpinsrd can only be issued every other cycle due to the dependency on XMM7. That means, in the long run, any bypass delay is going to be hidden by the latency caused by the data dependency on XMM7.
Basically, that code snippet is not good to measure those kinds of delays...

(I edited my previous comment. However, the system didn't send another email.)

      2e:       41 bf 00 00 00 00       mov    $0x0,%r15d
      34:       c4 c3 41 20 ff 01       vpinsrb $0x1,%r15d,%xmm7,%xmm7
      3a:       c4 c3 41 20 ff 01       vpinsrb $0x1,%r15d,%xmm7,%xmm7
....
    ea88:       c4 c3 41 20 ff 01       vpinsrb $0x1,%r15d,%xmm7,%xmm7

If there is really a bypass delay, then that code snippet is not going to expose it.
The real bottleneck in that code snippet is the dependency on %xmm7. R15 is only set once at the beginning by a zero-move, and then never updated again.

In this case, we have that each cycle the scheduler issues a uOp to moves R15 to the FPU. However, the vpinsrd can only be issued every other cycle due to the dependency on XMM7. That means, in the long run, any bypass delay is going to be hidden by the latency caused by the data dependency on XMM7.
Basically, that code snippet is not good to measure those kinds of delays...

Very nice observation.
Let's try something better.

$ cat /tmp/snippet.s ; ./bin/llvm-exegesis -mode=latency -snippets-file=/tmp/snippet.s
# LLVM-EXEGESIS-DEFREG EAX 0
# LLVM-EXEGESIS-DEFREG XMM0 0
# LLVM-EXEGESIS-DEFREG XMM1 0
vpinsrb $0, %eax, %xmm0, %xmm1
vpextrb $0, %xmm1, %eax
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-a71a33.o
---
mode:            latency
key:             
  instructions:    
    - 'VPINSRBrr XMM1 XMM0 EAX i_0x0'
    - 'VPEXTRBrr EAX XMM1 i_0x0'
  config:          ''
  register_initial_values: 
    - 'EAX=0x0'
    - 'XMM0=0x0'
    - 'XMM1=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:    
  - { key: latency, value: 11.0282, per_snippet_value: 22.0564 }
error:           ''
info:            ''
assembled_snippet: B8000000004883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F0C244883C410C4E37920C800C4E37914C800C4E37920C800C4E37914C800C4E37920C800C4E37914C800C4E37920C800C4E37914C800C4E37920C800C4E37914C800C4E37920C800C4E37914C800C4E37920C800C4E37914C800C4E37920C800C4E37914C800C3
...
$ /usr/bin/objdump -d /tmp/snippet-a71a33.o

/tmp/snippet-a71a33.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <foo>:
       0:       b8 00 00 00 00          mov    $0x0,%eax
       5:       48 83 ec 10             sub    $0x10,%rsp
       9:       c7 04 24 00 00 00 00    movl   $0x0,(%rsp)
      10:       c7 44 24 04 00 00 00    movl   $0x0,0x4(%rsp)
      17:       00 
      18:       c7 44 24 08 00 00 00    movl   $0x0,0x8(%rsp)
      1f:       00 
      20:       c7 44 24 0c 00 00 00    movl   $0x0,0xc(%rsp)
      27:       00 
      28:       c5 fa 6f 04 24          vmovdqu (%rsp),%xmm0
      2d:       48 83 c4 10             add    $0x10,%rsp
      31:       48 83 ec 10             sub    $0x10,%rsp
      35:       c7 04 24 00 00 00 00    movl   $0x0,(%rsp)
      3c:       c7 44 24 04 00 00 00    movl   $0x0,0x4(%rsp)
      43:       00 
      44:       c7 44 24 08 00 00 00    movl   $0x0,0x8(%rsp)
      4b:       00 
      4c:       c7 44 24 0c 00 00 00    movl   $0x0,0xc(%rsp)
      53:       00 
      54:       c5 fa 6f 0c 24          vmovdqu (%rsp),%xmm1
      59:       48 83 c4 10             add    $0x10,%rsp
      5d:       c4 e3 79 20 c8 00       vpinsrb $0x0,%eax,%xmm0,%xmm1
      63:       c4 e3 79 14 c8 00       vpextrb $0x0,%xmm1,%eax
...
    eab1:       c4 e3 79 20 c8 00       vpinsrb $0x0,%eax,%xmm0,%xmm1
    eab7:       c4 e3 79 14 c8 00       vpextrb $0x0,%xmm1,%eax
    eabd:       c3                      retq   

Though i suppose that still have the dependency on xmm1.

andreadb accepted this revision.Feb 1 2019, 2:52 AM

Thanks for running that experiment. There is clearly an 8-10cy delay.

Out of curiosity, do you get the same latency if the insertion/extract is at index $1 (I.e. not at index 0)?

That being said. I think this change is good, and it is consistent with the latency value defined for the WriteVecMoveToGpr and WriteVecMoveFromGpr. So, LGTM

This revision is now accepted and ready to land.Feb 1 2019, 2:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 1 2019, 2:52 AM

Thanks for running that experiment. There is clearly an 8-10cy delay.

Out of curiosity, do you get the same latency if the insertion/extract is at index $1 (I.e. not at index 0)?

I did, the results appear to be consistent:

$ cat /tmp/snippet.s ; ./bin/llvm-exegesis -mode=latency -snippets-file=/tmp/snippet.s
# LLVM-EXEGESIS-DEFREG EAX 0
# LLVM-EXEGESIS-DEFREG XMM0 0
# LLVM-EXEGESIS-DEFREG XMM1 0
vpinsrb $1, %eax, %xmm0, %xmm1
vpextrb $1, %xmm1, %eax
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2b8c21.o
---
mode:            latency
key:             
  instructions:    
    - 'VPINSRBrr XMM1 XMM0 EAX i_0x1'
    - 'VPEXTRBrr EAX XMM1 i_0x1'
  config:          ''
  register_initial_values: 
    - 'EAX=0x0'
    - 'XMM0=0x0'
    - 'XMM1=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:    
  - { key: latency, value: 11.0372, per_snippet_value: 22.0744 }
error:           ''
info:            ''
assembled_snippet: B8000000004883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F0C244883C410C4E37920C801C4E37914C801C4E37920C801C4E37914C801C4E37920C801C4E37914C801C4E37920C801C4E37914C801C4E37920C801C4E37914C801C4E37920C801C4E37914C801C4E37920C801C4E37914C801C4E37920C801C4E37914C801C3
...
$ cat /tmp/snippet.s ; ./bin/llvm-exegesis -mode=latency -snippets-file=/tmp/snippet.s
# LLVM-EXEGESIS-DEFREG EAX 0
# LLVM-EXEGESIS-DEFREG XMM0 0
# LLVM-EXEGESIS-DEFREG XMM1 0
vpinsrb $0, %eax, %xmm0, %xmm1
vpextrb $1, %xmm1, %eax
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-3b8c6f.o
---
mode:            latency
key:             
  instructions:    
    - 'VPINSRBrr XMM1 XMM0 EAX i_0x0'
    - 'VPEXTRBrr EAX XMM1 i_0x1'
  config:          ''
  register_initial_values: 
    - 'EAX=0x0'
    - 'XMM0=0x0'
    - 'XMM1=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:    
  - { key: latency, value: 11.0304, per_snippet_value: 22.0608 }
error:           ''
info:            ''
assembled_snippet: B8000000004883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F0C244883C410C4E37920C800C4E37914C801C4E37920C800C4E37914C801C4E37920C800C4E37914C801C4E37920C800C4E37914C801C4E37920C800C4E37914C801C4E37920C800C4E37914C801C4E37920C800C4E37914C801C4E37920C800C4E37914C801C3
...
$ cat /tmp/snippet.s ; ./bin/llvm-exegesis -mode=latency -snippets-file=/tmp/snippet.s
# LLVM-EXEGESIS-DEFREG EAX 0
# LLVM-EXEGESIS-DEFREG XMM0 0
# LLVM-EXEGESIS-DEFREG XMM1 0
vpinsrb $1, %eax, %xmm0, %xmm1
vpextrb $0, %xmm1, %eax
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-5f6929.o
---
mode:            latency
key:             
  instructions:    
    - 'VPINSRBrr XMM1 XMM0 EAX i_0x1'
    - 'VPEXTRBrr EAX XMM1 i_0x0'
  config:          ''
  register_initial_values: 
    - 'EAX=0x0'
    - 'XMM0=0x0'
    - 'XMM1=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:    
  - { key: latency, value: 11.0333, per_snippet_value: 22.0666 }
error:           ''
info:            ''
assembled_snippet: B8000000004883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F04244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F0C244883C410C4E37920C801C4E37914C800C4E37920C801C4E37914C800C4E37920C801C4E37914C800C4E37920C801C4E37914C800C4E37920C801C4E37914C800C4E37920C801C4E37914C800C4E37920C801C4E37914C800C4E37920C801C4E37914C800C3
...

That being said. I think this change is good, and it is consistent with the latency value defined for the WriteVecMoveToGpr and WriteVecMoveFromGpr.

I suspect ReadFpu2Int will too be introduced?

So, LGTM

Thank you for the review.

This revision was automatically updated to reflect the committed changes.