This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Reuse VL (if non-zero) when building single element vector for start of reduction chain
ClosedPublic

Authored by reames on Dec 8 2022, 11:32 AM.

Details

Summary

This is an alternative patch on a path to D137530.

The basic problem being tackled here is that we need to place a scalar into lane 0 of a vector register before our reduction instructions. Since we only care about lane 0 of the vector, we can use any VL >= 1 provided that the total amount of work performed matches the work performed for a VL=1.

This change does not contain the logic from D137530 to perform the insert at the original VT, and then extract down to LMUL1. That turns out to be a good choice, as discussion in this review has indicated there are issues around LMUL2 and above with our representation of vmv.s.x. We'd also need to be careful with the splat logic for the same reasons.

The only potentially concerning codegen change I spot here is that we stop using a broadcast load (for VL=1) and instead do a scalar load and insert. I think this is probably reasonable; if reviewers disagree, I can investigate using a broadcast load which writes to the undef lanes. If we want to do that, we should do it for VECTOR_INSERT_ELT as well, so that'll end up as it's own patch series.

Diff Detail

Event Timeline

reames created this revision.Dec 8 2022, 11:32 AM
reames requested review of this revision.Dec 8 2022, 11:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 8 2022, 11:32 AM
Herald added subscribers: eopXD, MaskRay. · View Herald Transcript

Some thoughts

What if we had a a second set of VMV_S_X pseudos with LMUL1 register class, but LMUL=1,2,4,8,etc. in TSFlags for the vsetvli insertion pass to see.

We'd need a second ISD node that takes LMUL as an operand since it can't use the type. Then isel could pick the pseudo with the correct LMUL in TSFlags.

We would match the LMUL to the input type of the reduction instead of using LMUL=1.

For non-VP reductions, the VL would be the fixed width or vlmax VL. For VP, we'd still need to deal with zero vs non-zero VL to have correct behavior.

reames added a comment.Dec 8 2022, 2:35 PM

Talked to Craig because I hadn't followed his last comment.

The issue that we have is that vmv.x.s is modeled badly. We model it as if it had an LMUL8 variant, but if you read the actual instruction manual you'll see that it ignores register groups and only writes to a single register. This isn't a correctness concern, but it does mean that this patch over constrains the register allocator (by using a lmul8 reg class), which could result in poor codegen. This is only an issue for this call site as the previous callsite (earlier change in stack), already had this issue and was likely going to write to the full LMUL8 register group in the following instruction anyways. The reduction instruction also only writes to a single vector register.

Talked to Craig because I hadn't followed his last comment.

The issue that we have is that vmv.x.s is modeled badly. We model it as if it had an LMUL8 variant, but if you read the actual instruction manual you'll see that it ignores register groups and only writes to a single register. This isn't a correctness concern, but it does mean that this patch over constrains the register allocator (by using a lmul8 reg class), which could result in poor codegen. This is only an issue for this call site as the previous callsite (earlier change in stack), already had this issue and was likely going to write to the full LMUL8 register group in the following instruction anyways. The reduction instruction also only writes to a single vector register.

If I understand correctly, the concern is, though the vmv.s.x/vmv.s.f ignore LMUL and vector register groups, we will still allocate a register with LMUL>1 register class for them, which will result in bad register allocation. And this is same for the destination register of reduction instructions.
Hmmmm…so we may apply these optimizations only for LMUL<=1, or redesign the pseudos of vmv.s.x/vmv.s.f and reductions (sooner of later).

reames updated this revision to Diff 482558.EditedDec 13 2022, 10:56 AM
reames edited the summary of this revision. (Show Details)

Rebase over landed changes, and request re-review.

@craig.topper Can you take a second look at this? I think we might have been chasing a red herring on the LMUL2 and above concern for this patch. This patch specifically only reuses VL at a callsite where we know that lmul is constrained to LMUL1 or less. As such, I don't think we can get the over-constrained vmv.s.x issue we'd discussed in this patch.

It's definitely still a real issue which can be exercised from the insert element path, but that has not (edit!) changed with this patch (or the preceding patches).

reames retitled this revision from [RISCV] Build single element vector for start of reduction change to [RISCV] Reuse VL (if non-zero) when building single element vector for start of reduction change.Dec 13 2022, 11:00 AM
reames edited the summary of this revision. (Show Details)
reames retitled this revision from [RISCV] Reuse VL (if non-zero) when building single element vector for start of reduction change to [RISCV] Reuse VL (if non-zero) when building single element vector for start of reduction chain.
This revision is now accepted and ready to land.Dec 13 2022, 11:22 AM

Rebase over landed changes, and request re-review.

@craig.topper Can you take a second look at this? I think we might have been chasing a red herring on the LMUL2 and above concern for this patch. This patch specifically only reuses VL at a callsite where we know that lmul is constrained to LMUL1 or less. As such, I don't think we can get the over-constrained vmv.s.x issue we'd discussed in this patch.

It's definitely still a real issue which can be exercised from the insert element path, but that has not (edit!) changed with this patch (or the preceding patches).

I think the overconstrained regalloc came up when we talked about using LMUL>1 vmv.s.x and an extract to LMUL=1. We still have a lot of vsetvlis in the LMUL>1 tests that could be removed.

This revision was landed with ongoing or failed builds.Dec 13 2022, 12:16 PM
This revision was automatically updated to reflect the committed changes.