This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Proposal to use op+select to match scalable predicated operations
AbandonedPublic

Authored by cameron.mcinally on Dec 12 2019, 1:15 PM.

Details

Summary

As promised from the SVE LLVM Sync-up today, here is a proposal to support scalable op+select masks in the backend. A solution like this would allow us to generate predicated instructions directly from IR, without the need for target intrinsics.

Also note that this is a temporary solution. The native vector predication project D57504 will eventually obsolete this work.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald Transcript

Remove some leftover debugging junk...

Adding patterns for vselect of various operations seems reasonable in general. The patterns are simple enough that it's not a big deal to repeat for a bunch of instructions.

For floating-point ops in particular, I'm sort of wondering how this interacts with STRICT_* operations. I think these patterns should not match in that case? We'd be suppressing exceptions that would otherwise trigger. Not sure how important that is.

llvm/lib/Target/AArch64/SVEInstrFormats.td
1284

If we're going to stick this style of pattern all over the place, we probably want a "class" for it.

Adding patterns for vselect of various operations seems reasonable in general. The patterns are simple enough that it's not a big deal to repeat for a bunch of instructions.

I'm sure you know, but for others, this is how AVX512 handles predication. We originally began adding predicated intrinsics for every AVX512 instruction, but they were later replaced with op+select patterns (except for masked loads/stores/gathers/scatters and a few others). Although, I'll admit that this isn't an apples-to-apples comparison, since AVX512 has 5000+ predicated instructions. I don't know the SVE predicated instruction count, but my intuition says it's much less (please correct me if I'm wrong).

For floating-point ops in particular, I'm sort of wondering how this interacts with STRICT_* operations. I think these patterns should not match in that case? We'd be suppressing exceptions that would otherwise trigger. Not sure how important that is.

It's actually the other way around. Vectorization would need the selects in place to suppress exceptions guarded by conditions. E.g.:

for(...)
  if (b[i] != 0)
    a[i] = a[i]/b[i];

My current understanding (and I could be wrong) is that the native predication intrinsics and the constrained intrinsics will be merge into one set. So once we have native predication + constrained intrinsics, these op+select patterns can go away on all targets. The op+select patterns are just a stop-gap.

My current understanding (and I could be wrong) is that the native predication intrinsics and the constrained intrinsics will be merge into one set

My understanding is that we're going to get masked constrained intrinsics, yes. Not sure if that means the other versions are going away completely; we still probably want some way to represent vector operations on targets without predication, at least in SelectionDAG.

In the meantime, we have to generate all the exceptions implied by a strict vector operation. We can't change the exceptions based on the uses of the instruction. Consider the following:

`
for(...)
  double c = a[i]/b[i];
  if (b[i] != 0)
    a[i] = c;
`

Ah, ok. I see your point now.

for(...)
  double c = a[i]/b[i];
  if (b[i] != 0)
    a[i] = c;
`

For this particular case, the select should follow the store though, not the fdiv. That should be ok. The store should become a masked store and the fdiv would be unpredicated.

Add a class for the op+select patterns.

cameron.mcinally marked an inline comment as done.Dec 13 2019, 12:56 PM
simoll added a subscriber: simoll.Dec 16 2019, 12:30 AM

Hi @cameron.mcinally, thanks for sharing this patch!

For the purpose of merging a select with an unpredicated operation into a predicated operation, this is indeed sufficient. But I wonder if we need something a bit more elaborate if the intended purpose is to more cheaply select a value for the false-lanes (passthru).

While we don't support the generic case in our downstream compiler, we do have special support for the cases where the false lanes are zeroed or undef. Using the predicated MOVPRFX instruction, the false lanes can be zeroed relatively cheaply using e.g.:

movprfx z0.s, p0/z, z1.s
fsub z0.s, p0/m, z2.s

This avoids having to emit an explicit sequence of a splat and select / predicated mov to zero the false lanes. We match the operation + select into a Pseudo instruction (e.g. FSUB_ZERO or FSUB_UNDEF), that is expanded after register allocation (in the AArch64ExpandPseudoInsts pass) into the appropriate instructions.

Even if we don't care about selecting a passthru value for the false lanes, there is still value in creating the Pseudo. The lack of a tied-operand constraint for the Pseudo gives the register allocator more freedom to come up with a better allocation. Combined with the commutative property of some instructions or by expanding to their reversed variants (like SUBR vs SUB), we can avoid a number of unnecessary register moves.

We've been thinking about some ideas on how to make this support more generic to allow supporting the general use-case of:

%Res = FSUB_PSEUDO(%Pred, %Op1, %Op2, %Passthru)

Depending on the value for %Passthru, this can be expanded to use a movprfx or in the worst case an explicit select.

Ideally we'd use a Pseudo for most operations so that we can use this as a generic mechanism that natively supports the passthru value and benefits from better register allocation.

A bit of prototyping would be required though, as our downstream compiler only covers a limited use-case. We've also had to deal with some corner-cases, but I'd need to refresh my memory on the details of those before I can comment on those. I'll try to dig up some more details!

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
163–166

nit: the _pred isn't needed, as this is already implied by the _p.

Matt added a subscriber: Matt.Dec 18 2019, 12:57 PM

While we don't support the generic case in our downstream compiler, we do have special support for the cases where the false lanes are zeroed or undef. Using the predicated MOVPRFX instruction, the false lanes can be zeroed relatively cheaply using e.g.:

movprfx z0.s, p0/z, z1.s
fsub z0.s, p0/m, z2.s

This avoids having to emit an explicit sequence of a splat and select / predicated mov to zero the false lanes. We match the operation + select into a Pseudo instruction (e.g. FSUB_ZERO or FSUB_UNDEF), that is expanded after register allocation (in the AArch64ExpandPseudoInsts pass) into the appropriate instructions.

Even if we don't care about selecting a passthru value for the false lanes, there is still value in creating the Pseudo. The lack of a tied-operand constraint for the Pseudo gives the register allocator more freedom to come up with a better allocation. Combined with the commutative property of some instructions or by expanding to their reversed variants (like SUBR vs SUB), we can avoid a number of unnecessary register moves.

We've been thinking about some ideas on how to make this support more generic to allow supporting the general use-case of:

%Res = FSUB_PSEUDO(%Pred, %Op1, %Op2, %Passthru)

Depending on the value for %Passthru, this can be expanded to use a movprfx or in the worst case an explicit select.

Ideally we'd use a Pseudo for most operations so that we can use this as a generic mechanism that natively supports the passthru value and benefits from better register allocation.

A bit of prototyping would be required though, as our downstream compiler only covers a limited use-case. We've also had to deal with some corner-cases, but I'd need to refresh my memory on the details of those before I can comment on those. I'll try to dig up some more details!

Sorry for the slow reply... holidays.

I just checked out the PseudoOps and they're interesting. If we could make generic XXX_PSEUDO(..., %passthru) PseudoOps, that would be a good path forward. I don't foresee any problems adding an extra passthru operand to the existing patterns, but maybe I'm missing something. If you have any insight, it would be appreciated.

I agree that if these PseudoOps will land upstream, then the op+select solution isn't the right way to go. I'll see if I can build the PseudoOps out a bit.

That said, the current implementation is a little weird though. Here's the class that something like an FADD_ZERO would use:

class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1,
                   ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))),
      (inst $Op1, $Op2, $Op3)>;

The select+op DAG reads like we're zeroing the inactive input elements, not the inactive output elements. I see that this directly models the movprfx+op hardware instructions, and that Op2 is an input reg as well as the destination reg, but it still seems counter-intuitive. I don't feel strongly that this needs to change though. So if I'm the only one that thinks it's weird, I'll let it drop.

I just checked out the PseudoOps and they're interesting. If we could make generic XXX_PSEUDO(..., %passthru) PseudoOps, that would be a good path forward. I don't foresee any problems adding an extra passthru operand to the existing patterns, but maybe I'm missing something. If you have any insight, it would be appreciated.

Awesome, thanks for checking this out.

I don't see many issues with adding the extra passthru operand, but it could do with a bit of prototyping; what we've done downstream was very specific to having the false lanes zero'd and that is actually unnecessarily restrictive (but we never really had to care for the generic case). The challenge would be in expanding the pseudo. For some pseudo for that has a reverse variant, e.g. Dst = fsub_pseudo(Pg, Op1, Op2, Passthru)

we would for example expand this as follows:

z0 = fsub_pseudo_S(p0, z1, z2, z3)
  <=>
sel z0.s, p0.s, z1.s, z3.s
fsub z0.s, p0/m, z2.s

z0 = fsub_pseudo_S(p0, z1, z2, z2)
  <=>
movprfx z0, z2
fsubr z2.s, p0/m, z1.s

or a special case for zero'd lanes:

z0 = fsub_pseudo_S(p0, z1, z2, <zero>)
movprfx z0, p0/z, z1
fsub z0.s, p0/m, z2.s

The indexed FMLA instruction has three input operands and this is a case where we can't relaxation the register constraints:

%dst = Pseudo_FMLA %a, %n, %m, %index
(to implement: %dst = FMLA %a + %n * %m[%index])

We cannot recover from a register allocation where %index is used as %dst, e.g.

Z0 = Pseudo_FMLA Z1, Z2, Z0, <index>

(there is no other variant we can use to recover from this, so those will need to retain a constraint that $Zd = $Zs1)

I agree that if these PseudoOps will land upstream, then the op+select solution isn't the right way to go. I'll see if I can build the PseudoOps out a bit.

That said, the current implementation is a little weird though. Here's the class that something like an FADD_ZERO would use:

class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1,
                   ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))),
      (inst $Op1, $Op2, $Op3)>;

The select+op DAG reads like we're zeroing the inactive input elements, not the inactive output elements. I see that this directly models the movprfx+op hardware instructions, and that Op2 is an input reg as well as the destination reg, but it still seems counter-intuitive. I don't feel strongly that this needs to change though. So if I'm the only one that thinks it's weird, I'll let it drop.

Yes, that's mostly because of the pattern that the front-end generates for zeroing the false lanes, but it is functionally the same. I guess we could have two patterns to map to the pseudo, depending on how it looks in the IR.

cameron.mcinally abandoned this revision.Jan 8 2020, 7:29 AM

I just checked out the PseudoOps and they're interesting. If we could make generic XXX_PSEUDO(..., %passthru) PseudoOps, that would be a good path forward. I don't foresee any problems adding an extra passthru operand to the existing patterns, but maybe I'm missing something. If you have any insight, it would be appreciated.

Awesome, thanks for checking this out.

I don't see many issues with adding the extra passthru operand, but it could do with a bit of prototyping; what we've done downstream was very specific to having the false lanes zero'd and that is actually unnecessarily restrictive (but we never really had to care for the generic case). The challenge would be in expanding the pseudo. For some pseudo for that has a reverse variant, e.g. Dst = fsub_pseudo(Pg, Op1, Op2, Passthru)

we would for example expand this as follows:

z0 = fsub_pseudo_S(p0, z1, z2, z3)
  <=>
sel z0.s, p0.s, z1.s, z3.s
fsub z0.s, p0/m, z2.s

z0 = fsub_pseudo_S(p0, z1, z2, z2)
  <=>
movprfx z0, z2
fsubr z2.s, p0/m, z1.s

or a special case for zero'd lanes:

z0 = fsub_pseudo_S(p0, z1, z2, <zero>)
movprfx z0, p0/z, z1
fsub z0.s, p0/m, z2.s

The indexed FMLA instruction has three input operands and this is a case where we can't relaxation the register constraints:

%dst = Pseudo_FMLA %a, %n, %m, %index
(to implement: %dst = FMLA %a + %n * %m[%index])

We cannot recover from a register allocation where %index is used as %dst, e.g.

Z0 = Pseudo_FMLA Z1, Z2, Z0, <index>

(there is no other variant we can use to recover from this, so those will need to retain a constraint that $Zd = $Zs1)

Ok. thanks. I'll abandon this Diff and start building my intuition around the Pseudos. Will post some Diffs if I can make something work...

I agree that if these PseudoOps will land upstream, then the op+select solution isn't the right way to go. I'll see if I can build the PseudoOps out a bit.

That said, the current implementation is a little weird though. Here's the class that something like an FADD_ZERO would use:

class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1,
                   ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))),
      (inst $Op1, $Op2, $Op3)>;

The select+op DAG reads like we're zeroing the inactive input elements, not the inactive output elements. I see that this directly models the movprfx+op hardware instructions, and that Op2 is an input reg as well as the destination reg, but it still seems counter-intuitive. I don't feel strongly that this needs to change though. So if I'm the only one that thinks it's weird, I'll let it drop.

Yes, that's mostly because of the pattern that the front-end generates for zeroing the false lanes, but it is functionally the same. I guess we could have two patterns to map to the pseudo, depending on how it looks in the IR.

The existing patterns should be fine. Just a personal preference. No big deal.