unpckqdq seems to be treated as a shuffle from bypass delay
perspective (which makes sense it appears to have shared shuffle units
for all micro-arch).
unpckqdq is slightly preferable to shufpd as it saves 1-byte of
code size and can be used to replace the micro-fused rm version. So,
if the target has no bypass delay, we should do unpckpd ->
unpckqdq instead of `shufpd.
Update the comments here.