Hi @cameron.mcinally, I'm just sharing what I tried out today based
off your patch D80260. I'm not really planning to land it, but feel
free to use for reference or discard entirely if you've already been
working on something similar.
It passes the dup(0) to the zero-merging pseudos, similar to what D80260
does for any other mask value.
This patch also highlights a bug that currently exists with the expansion
of the pseudo instructions that merge zero's into the false lanes.
The zero-merging pseudos don't have any tied operand constraints to give
the register allocator more freedom to use the reverse instructions
(like FSUBR).
A bug currently exists when the register allocation of one of the pseudos
ends up as:
Dst = FSUB_ZERO_S P0, Z0, Z0
The expand pass cannot zero the false lanes of Z0 using MOVPRFX, because
the MOVPRFX instruction specifies that the destination register must not
be used in any other operand position than the destination register. This
would not be valid:
Z0 = MOVPRFX P0/z, Z0 Z0 = FSUB_S Z0, P0/m, Z0 ^^
At point of expanding the pseudo, there may not be a spare register
available to expand this into a legal sequence. In D71712 we've solved
this by using a 'Conditional Early Clobber' pass that runs during register
allocation and makes sure the destination register is different from
any of the input registers, if the two input registers will otherwise end
up the same. This is a bit fiddly, and it's probably better to build
on the design set out in D80260 where the merge-value value is passed
to the pseudo, so the compiler can decide at point of pseudo expansion
whether to use the DUP(0) value, or to use the zeroing MOVPRFX.
Given that the DUP IMM instructions have isReMaterializable set,
the register allocator hopefully won't try too hard to keep it in a
register.
I'm looking at some rough latency tables we've put together and it looks like the tied-reg MOVPRFX sequence is 1 cycle faster than the SEL sequence:
The vector MOV is faster than the DUP. And we burn the extra z1 register for both cases, so that's a wash.
That said, the MOVPRFX sequence we're generating actually looks like this:
where the DUP #0 is a dead instruction. It's proving pretty hard to get rid of the DUP at the MachineInstruction level though. Still looking...