Page MenuHomePhabricator

[WIP][SVE] Prototype for general merging MOVPRFX support.
Needs ReviewPublic

Authored by cameron.mcinally on May 19 2020, 5:11 PM.

Details

Summary

This is a prototype for general merging MOVPRFX support. It ties the pseudo's destination register to the passthru register, which seems to work for simple test cases.***

The intention is to use this general merging mechanism to also handle the zero merging case, but that would take some more cleanup work to get there.

***From an offline discussion with Sander, this implementation has limitations that I have not hit yet. Posting this patch to start a discussion.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald Transcript

Hi @cameron.mcinally thanks for this patch! The approach makes sense and nicely extends the mechanism we have for the zeroing forms. We can use similar pseudos for the zeroing case as well. Today I played around a bit with your patch and shared some changes for your reference in D80410. I'm not really planning to land it, but it at least highlights the bug for the zeroing-pseudos that currently exists in master. Feel free to use for reference or ignore/discard if you've already worked on something similar.

We should probably think about where we want to write down the design for this mechanism somewhere, as we use these pseudos to solve multiple problems.

llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp
393–394

This comment suggests that this is only possible for _ZERO and _UNDEF variants, but I'm not sure if that comment is still correct.

413

Can you update the description of expand_DestructiveOp to describe the new style of pseudos like FSUB_ZPZZ_MERGE_B?

419

nit: s/PassIdx/PassthruIdx/ ?

llvm/lib/Target/AArch64/SVEInstrFormats.td
482

I'm happy to keep the constraint for now, but I don't think it is strictly necessary, because if $Zd != $Zpt it is still possible to generate a valid instruction sequence.

For:

define <vscale x 4 x float> @foo(<vscale x 4 x i1> %p, <vscale x 4 x float> %z0, <vscale x 4 x float> %z1, <vscale x 4 x float> %passthru) {
  %z0_in = select <vscale x 4 x i1> %p, <vscale x 4 x float> %z0, <vscale x 4 x float> %passthru
  %sub = call <vscale x 4 x float> @llvm.aarch64.sve.fsub.nxv4f32(<vscale x 4 x i1> %p, <vscale x 4 x float> %z0_in, <vscale x 4 x float> %z0)
  ret <vscale x 4 x float> %sub
}

LLVM will generate:

movprfx z2.s, p0/m, z0.s
fsub    z2.s, p0/m, z2.s, z0.s
mov     z0.d, z2.d
ret

Where the last mov is needed because the register allocator chose to allocate the pseudo as:

z2 = FSUB_PSEUDO p0, z0, z0, z2

with the result operand z2 tied to the merge value. I think we could alternatively fall back on using a select instruction if the register allocator didn't have the tied operand restriction and instead generate:

sel z0.s, p0, z0.s, z2.s
fsub z0.s, p0/m, z0.s, z0.s
ret
llvm/test/CodeGen/AArch64/sve-intrinsics-int-arith-merging.ll
62

nit: the %a_z name implies that the false lanes are zeroed, perhaps %a_m is more appropriate.

76

I think we should add tests where the %passthru value is %b as that should cause it to use the reverse instruction for e.g. sub -> subr (or swap the operands in case of add).

cameron.mcinally marked 2 inline comments as done.

Thanks, Sander.

I have a nagging feeling that I forced this implementation on you. My intention was to use this patch as an intuition pump, not necessarily as the path forward. I don't have a lot of intuition around MOVPRFX today, so I'm hoping for guidance on the best way to proceed.

I definitely see the desired flexibility of having a lowering pass pre-regalloc. If you think that's the better solution, I'll work on it. I just don't have a strong opinions on where in the pipeline that pass should live.

Also, I plan to look at your zeroing Diff (D80410) next week. Thanks for that.

llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp
393–394

Massaged this a bit. Let me know if anything sounds off.

413

Updated. Please let me know if that's what you had in mine. Advertising copy is not my strong suit...

419

Updated, but I just realized that this isn't needed yet. With DstReg tied to the Passthru reg, I can replace MI.getOperand(PassIdx).getReg() with DestReg below.

I think I'll leave it for now in case the untied/zeroing case needs it. Can tear it out later if it's not needed.

llvm/lib/Target/AArch64/SVEInstrFormats.td
482

Ah, ok, I think I see the problem now. In this case we can clobber z0 since it only has the one use. I'll have to explore this further.

I think I'll look at the ConditionalEarlyClobberPass to see if this case can be guarded against. If you know it can't, or are aware of other cases that may be problematic, please let me know.

llvm/test/CodeGen/AArch64/sve-intrinsics-int-arith-merging.ll
62

Good catch. Fixed.

76

Added.

Also added a guard to prevent MOVPRFX Zx, p0/m, Zx from being generated. We might not want that for the p0/z zero merging case, but I haven't hit that yet.

I have a nagging feeling that I forced this implementation on you.

Not at all, it may be more the other way around :-) We initially went the pseudo-route downstream so we could use the reverse instructions, but we later used this for the cheaper zeroing as well.
For the former (using reverse instructions) there probably isn't much else we can do then go the pseudo route, but for the latter (cheaper merging) there may also be other alternatives to consider.

My intention was to use this patch as an intuition pump, not necessarily as the path forward. I don't have a lot of intuition around MOVPRFX today, so I'm hoping for guidance on the best way to proceed.

Are you happy to bring this topic up in tomorrow's meeting? If there's time, we can talk through the approach in this patch and maybe get some more input.

I definitely see the desired flexibility of having a lowering pass pre-regalloc. If you think that's the better solution, I'll work on it. I just don't have a strong opinions on where in the pipeline that pass should live.

I'm a bit confused by what you mean with 'pre-regalloc lowering pass'? (Do you mean something like the ConditionalEarlyClobber pass mentioned in D80410?)

Sorry for the delay, we've been a bit distracted over here...

Here's the patch that we briefly discussed at last week's Sync-up meeting. The patch attempts to rewrite the MachineInstrs in the general zero-merging case. I agree with @paulwalker-arm that MachineInstr isn't a great place to rewrite a sequence of instructions, but this case is fairly limited, so it might not be too bad. I'll elaborate about that a little now.

The general passthru merging case is largely uninteresting. Unless I've made a mistake, register allocation seems optimal for it (in the cases I've come across, at least). The zeroing case is a bit trickier. We start with a pattern like this:

class SVE_3_Op_Pat_SelZero_Passthru<ValueType vtd, SDPatternOperator op, ValueType vt1,
                   ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (vt2 SVEDup0:$Dup)), vt3:$Op3))),
      (inst $Op1, $Op2, $Op3, $Dup)>;

The SVEDup0 is the problem. It's accounted for in the Pseudo generated, but the original DUP also hangs around. In cases where the DUP only has one use, it can be removed. That's the hang up, i.e. can we safely remove that superfluous DUP when expanding the Pseudo.

You'll find the proposed solution under the FalseLanes == AArch64::FalseLanesZero block in this Diff. It's not an ideal solution, but I'm not convinced that it's overly risky either. The sequence we're looking for is fairly constrained, *I think*.

You'll also noticed that the test cases I've included are overkill. I've included them to save effort on the reviewer side. Just wanted the register allocation decisions to be easily seen.

I definitely see the desired flexibility of having a lowering pass pre-regalloc. If you think that's the better solution, I'll work on it. I just don't have a strong opinions on where in the pipeline that pass should live.

I'm a bit confused by what you mean with 'pre-regalloc lowering pass'? (Do you mean something like the ConditionalEarlyClobber pass mentioned in D80410?)

I'm probably the one confused. :D

We definitely need an extra register available in some of the MOVPRFX cases. Scavenging that reg probably isn't a good fix. So I was thinking it would be easier to generate the MOVPRFX when we're still at the virtual register phase. It's not clear to me how to make this work though, so I might be way off.

@paulwalker-arm mentioned something at last week's Sync-up meeting about generating the MOVPRFX directly during DAGCombine or similar. That sounds promising, but I'm not sure if I see that path clearly yet.

Hi @cameron.mcinally, sorry for my delayed reponse, I was OoO part of last week and didn't have much bandwidth to reply.

The general passthru merging case is largely uninteresting. Unless I've made a mistake, register allocation seems optimal for it (in the cases I've come across, at least).

Yes, that seems a fair assessment.

The zeroing case is a bit trickier. We start with a pattern like this:

class SVE_3_Op_Pat_SelZero_Passthru<ValueType vtd, SDPatternOperator op, ValueType vt1,
                   ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (vt2 SVEDup0:$Dup)), vt3:$Op3))),
      (inst $Op1, $Op2, $Op3, $Dup)>;

The SVEDup0 is the problem. It's accounted for in the Pseudo generated, but the original DUP also hangs around. In cases where the DUP only has one use, it can be removed. That's the hang up, i.e. can we safely remove that superfluous DUP when expanding the Pseudo.

I don't think the compiler will always be able to remove the superfluous DUP (it may have been hoisted out of the loop for example), but I'm not sure how much of a problem that is in practice. Maybe this is fine for a first implementation. Did you get a chance to try this out on real code?

You'll find the proposed solution under the FalseLanes == AArch64::FalseLanesZero block in this Diff. It's not an ideal solution, but I'm not convinced that it's overly risky either. The sequence we're looking for is fairly constrained, *I think*.

While it may seem trivial to remove that instruction this way, I'd rather leave this to a dedicated pass like DeadMachineInstructionElim that could be run directly after the pseudo-expand pass. There's details and special cases to take into account, such as defs/uses on subregisters (instead of the full register) or the use being a DEBUG instruction. The former could lead to the DUP being removed while it shouldn't, and the latter can prevent it from being removed while it should. The dedicated pass is probably better suited to handle those.

We definitely need an extra register available in some of the MOVPRFX cases. Scavenging that reg probably isn't a good fix. So I was thinking it would be easier to generate the MOVPRFX when we're still at the virtual register phase. It's not clear to me how to make this work though, so I might be way off.

When passing dup(0) to the pseudo, the compiler can always fall back on the explicit merge (predicated mov) and there is no longer a need to scavenge a register. I agree it would be better to do this earlier so we can avoid this altogether though.

@paulwalker-arm mentioned something at last week's Sync-up meeting about generating the MOVPRFX directly during DAGCombine or similar. That sounds promising, but I'm not sure if I see that path clearly yet.

What Paul meant is that we ideally want to add a new constraints to TableGen that allow us to model the constraints for using MOVPRFX more precisely. For the pseudo:

Zd = FSUB_PSEUDO_ZEROING Pg, Zs1, Zs2

Instead of forcing "Zd = Zs1", we want to allow any register allocation as long as one of the registers is unique, because the ExpandPseudo pass can always handle that case with movprfx directly or additionally with a reverse/commutative instruction. By adding a new TableGen constraint like "one_of_is_unique(Zd, Zs1, Zs2)" and teaching the register allocator to implement that constraint, we can avoid more ugly solutions like the conditional early-clobber pass (that only does half a job because it forces Zd to be unique, which may not necessarily be the best choice) or having to pass the dup(0) to the instruction and then trying to remove the instruction afterwards (which it may fail to remove). We could instead model the constraint as we want it. Other pseudos (e.g. for ternary instructions) may require slightly different constraints, but the one_of_is_unique() probably solves 90-95% of the cases that need solving.