This is proof-of-concept, i'm not sure if we want this
so i didn't spend much time on this. If we do i can improve this.
It's not very useful without something like D60000,
but even now i'm seeing somewhat confusing results:
--- mode: latency key: instructions: - 'PEXTRDrr R9D XMM6 i_0x0' - 'MOV64toPQIrr XMM6 R9' config: '' register_initial_values: - 'XMM6=0x0' - 'R9=0x0' cpu_name: bdver2 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 10000 measurements: - { key: latency, value: 10.0292, per_snippet_value: 20.0584 } error: '' info: Repeating two instructions assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F34244883C41049B9000000000000000066410F3A16F10066490F6EF166410F3A16F10066490F6EF166410F3A16F10066490F6EF166410F3A16F10066490F6EF166410F3A16F10066490F6EF166410F3A16F10066490F6EF166410F3A16F10066490F6EF166410F3A16F10066490F6EF1C3 ... --- mode: latency key: instructions: - 'PEXTRDrr EBP XMM8 i_0x0' - 'VMOV64toPQIrr XMM8 RBP' config: '' register_initial_values: - 'XMM8=0x0' - 'RBP=0x0' cpu_name: bdver2 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 10000 measurements: - { key: latency, value: 10.0344, per_snippet_value: 20.0688 } error: '' info: Repeating two instructions assembled_snippet: 554883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F04244883C41048BD000000000000000066440F3A16C500C461F96EC566440F3A16C500C461F96EC566440F3A16C500C461F96EC566440F3A16C500C461F96EC566440F3A16C500C461F96EC566440F3A16C500C461F96EC566440F3A16C500C461F96EC566440F3A16C500C461F96EC55DC3 ...
--- mode: latency key: instructions: - 'EXTRACTPSrr EDI XMM2 i_0x0' - 'VPINSRWrr XMM2 XMM7 EDI i_0x1' config: '' register_initial_values: - 'XMM2=0x0' - 'XMM7=0x0' cpu_name: bdver2 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 10000 measurements: - { key: latency, value: 11.0299, per_snippet_value: 22.0598 } error: '' info: Repeating two instructions assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F14244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F3C244883C410660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701660F3A17D700C5C1C4D701C3 ... --- mode: latency key: instructions: - 'EXTRACTPSrr ESI XMM6 i_0x0' - 'PINSRDrr XMM6 XMM6 ESI i_0x1' config: '' register_initial_values: - 'XMM6=0x0' cpu_name: bdver2 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 10000 measurements: - { key: latency, value: 11.0328, per_snippet_value: 22.0656 } error: '' info: Repeating two instructions assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F34244883C410660F3A17F600660F3A22F601660F3A17F600660F3A22F601660F3A17F600660F3A22F601660F3A17F600660F3A22F601660F3A17F600660F3A22F601660F3A17F600660F3A22F601660F3A17F600660F3A22F601660F3A17F600660F3A22F601C3 ...
So extraction from 0'th lane isn't actually any faster?
The VPINSR*Z* variants still work on 128-bit vectors, they just use EVEX encoding (supporting predicate masks etc.)