This is proof-of-concept, i'm not sure if we want this
so i didn't spend much time on this. If we do i can improve this.
It's not very useful without something like D60000,
but even now i'm seeing somewhat confusing results:
---
mode: latency
key:
instructions:
- 'PEXTRDrr R9D XMM6 i_0x0'
- 'MOV64toPQIrr XMM6 R9'
config: ''
register_initial_values:
- 'XMM6=0x0'
- 'R9=0x0'
cpu_name: bdver2
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: latency, value: 10.0292, per_snippet_value: 20.0584 }
error: ''
info: Repeating two instructions
assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F34244883C41049B9000000000000000066410F3A16F10066490F6EF166410F3A16F10066490F6EF166410F3A16F10066490F6EF166410F3A16F10066490F6EF166410F3A16F10066490F6EF166410F3A16F10066490F6EF166410F3A16F10066490F6EF166410F3A16F10066490F6EF1C3
...
---
mode: latency
key:
instructions:
- 'PEXTRDrr EBP XMM8 i_0x0'
- 'VMOV64toPQIrr XMM8 RBP'
config: ''
register_initial_values:
- 'XMM8=0x0'
- 'RBP=0x0'
cpu_name: bdver2
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: latency, value: 10.0344, per_snippet_value: 20.0688 }
error: ''
info: Repeating two instructions
assembled_snippet: 554883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F04244883C41048BD000000000000000066440F3A16C500C461F96EC566440F3A16C500C461F96EC566440F3A16C500C461F96EC566440F3A16C500C461F96EC566440F3A16C500C461F96EC566440F3A16C500C461F96EC566440F3A16C500C461F96EC566440F3A16C500C461F96EC55DC3
...---
mode: latency
key:
instructions:
- 'EXTRACTPSrr EDI XMM2 i_0x0'
- 'VPINSRWrr XMM2 XMM7 EDI i_0x1'
config: ''
register_initial_values:
- 'XMM2=0x0'
- 'XMM7=0x0'
cpu_name: bdver2
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: latency, value: 11.0299, per_snippet_value: 22.0598 }
error: ''
info: Repeating two instructions
assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F14244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F3C244883C410660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701C3
...
---
mode: latency
key:
instructions:
- 'EXTRACTPSrr ESI XMM6 i_0x0'
- 'PINSRDrr XMM6 XMM6 ESI i_0x1'
config: ''
register_initial_values:
- 'XMM6=0x0'
cpu_name: bdver2
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
- { key: latency, value: 11.0328, per_snippet_value: 22.0656 }
error: ''
info: Repeating two instructions
assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F34244883C410660F3A17F600660F3A22F601660F3A17F600660F3A22F601660F3A17F600660F3A22F601660F3A17F600660F3A22F601660F3A17F600660F3A22F601660F3A17F600660F3A22F601660F3A17F600660F3A22F601660F3A17F600660F3A22F601C3
...So extraction from 0'th lane isn't actually any faster?
The VPINSR*Z* variants still work on 128-bit vectors, they just use EVEX encoding (supporting predicate masks etc.)