Page MenuHomePhabricator

[WIP][SVE] Pass through dup(0) to zero-merging pseudos
AbandonedPublic

Authored by sdesmalen on May 21 2020, 2:46 PM.

Details

Summary

Hi @cameron.mcinally, I'm just sharing what I tried out today based
off your patch D80260. I'm not really planning to land it, but feel
free to use for reference or discard entirely if you've already been
working on something similar.

It passes the dup(0) to the zero-merging pseudos, similar to what D80260
does for any other mask value.

This patch also highlights a bug that currently exists with the expansion
of the pseudo instructions that merge zero's into the false lanes.

The zero-merging pseudos don't have any tied operand constraints to give
the register allocator more freedom to use the reverse instructions
(like FSUBR).

A bug currently exists when the register allocation of one of the pseudos
ends up as:

Dst = FSUB_ZERO_S P0, Z0, Z0

The expand pass cannot zero the false lanes of Z0 using MOVPRFX, because
the MOVPRFX instruction specifies that the destination register must not
be used in any other operand position than the destination register. This
would not be valid:

Z0 = MOVPRFX P0/z, Z0
Z0 = FSUB_S Z0, P0/m, Z0
                      ^^

At point of expanding the pseudo, there may not be a spare register
available to expand this into a legal sequence. In D71712 we've solved
this by using a 'Conditional Early Clobber' pass that runs during register
allocation and makes sure the destination register is different from
any of the input registers, if the two input registers will otherwise end
up the same. This is a bit fiddly, and it's probably better to build
on the design set out in D80260 where the merge-value value is passed
to the pseudo, so the compiler can decide at point of pseudo expansion
whether to use the DUP(0) value, or to use the zeroing MOVPRFX.

Given that the DUP IMM instructions have isReMaterializable set,
the register allocator hopefully won't try too hard to keep it in a
register.

Diff Detail

Event Timeline

sdesmalen created this revision.May 21 2020, 2:46 PM
Herald added a project: Restricted Project. · View Herald Transcript
llvm/test/CodeGen/AArch64/sve-intrinsics-fp-arith-merging.ll
301

I'm looking at some rough latency tables we've put together and it looks like the tied-reg MOVPRFX sequence is 1 cycle faster than the SEL sequence:

; CHECK-NEXT:  movprfx z1.s, p0/z, z0.s
; CHECK-NEXT:  fsubr   z1.s, p0/m, z1.s, z0.s
; CHECK-NEXT:  mov     z0.d, z1.d

The vector MOV is faster than the DUP. And we burn the extra z1 register for both cases, so that's a wash.

That said, the MOVPRFX sequence we're generating actually looks like this:

; CHECK-NEXT:  mov z1.s, #0
; CHECK-NEXT:  movprfx z1.s, p0/z, z0.s
; CHECK-NEXT:  fsubr   z1.s, p0/m, z1.s, z0.s
; CHECK-NEXT:  mov     z0.d, z1.d

where the DUP #0 is a dead instruction. It's proving pretty hard to get rid of the DUP at the MachineInstruction level though. Still looking...

sdesmalen marked an inline comment as done.May 27 2020, 2:38 PM
sdesmalen added inline comments.
llvm/test/CodeGen/AArch64/sve-intrinsics-fp-arith-merging.ll
301

I'm looking at some rough latency tables we've put together and it looks like the tied-reg MOVPRFX sequence is 1 cycle faster than the SEL sequence:

Ah that's good to know. Always using the tied-operand constraint for the zeroing forms possibly makes the common cases slower though, because it forces the compiler to honour the constraints and avoids benefiting from the reverse instructions as the register allocator will already have done the work. All cases except this one don't need the dup+select and can use movprfx directly and make use of the commutative/reverse instructions to expand the pseudo.

That said, the MOVPRFX sequence we're generating actually looks like this:

Is that with a different example than the one in this test?

sdesmalen abandoned this revision.Jul 26 2021, 8:11 AM
rkruppe removed a subscriber: rkruppe.Jul 26 2021, 9:41 AM