This is an archive of the discontinued LLVM Phabricator instance.

[riscv] Use X0 for destination of VSETVLI instruction if result unused
ClosedPublic

Authored by reames on May 4 2022, 1:45 PM.

Details

Summary

If the GPR destination register of a VSETVLI instruction is unused, we can replace it with X0. This discards the result, and thus reduces register pressure.

Since after the core insertion/lowering algorithm has run, basically all user written VSETVLIs will have their GPR result unused (as VTYPE/VLEN is now explicitly read instead), this kicks in for most tests which involve a vsetvli intrinsic. When inserting VSETVLIs to lower psuedos, we prefer the X0 form anyways.

Diff Detail

Event Timeline

reames created this revision.May 4 2022, 1:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2022, 1:45 PM
reames requested review of this revision.May 4 2022, 1:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2022, 1:45 PM

Since after the core insertion/lowering algorithm has run, basically all user written VSETVLIs will have their GPR result unused (as VTYPE/VLEN is now explicitly read instead), this kicks in for most tests which involve a vsetvli intrinsic. When inserting VSETVLIs to lower psuedos, we prefer the X0 form anyways.

Almost any real scenario that processes more data than fits in a register will need the GPR output to update memory pointers and to control a loop. Which I guess means we're lacking realistic tests.

llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
1229

This is valid for VSETIVLI too.

reames added a comment.May 4 2022, 2:09 PM

Since after the core insertion/lowering algorithm has run, basically all user written VSETVLIs will have their GPR result unused (as VTYPE/VLEN is now explicitly read instead), this kicks in for most tests which involve a vsetvli intrinsic. When inserting VSETVLIs to lower psuedos, we prefer the X0 form anyways.

Almost any real scenario that processes more data than fits in a register will need the GPR output to update memory pointers and to control a loop. Which I guess means we're lacking realistic tests.

How so? An idiomatic tail folded loop doesn't need to explicitly read VL. It's passed via VL reg itself to all the vector ops. It changes across iterations, but that doesn't mean the GPR is used.

llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
1229

True. Will update.

Since after the core insertion/lowering algorithm has run, basically all user written VSETVLIs will have their GPR result unused (as VTYPE/VLEN is now explicitly read instead), this kicks in for most tests which involve a vsetvli intrinsic. When inserting VSETVLIs to lower psuedos, we prefer the X0 form anyways.

Almost any real scenario that processes more data than fits in a register will need the GPR output to update memory pointers and to control a loop. Which I guess means we're lacking realistic tests.

How so? An idiomatic tail folded loop doesn't need to explicitly read VL. It's passed via VL reg itself to all the vector ops. It changes across iterations, but that doesn't mean the GPR is used.

Don't you need to know how many elements were processed to advance the pointer for the next iteration. That is done is scalar registers. You also need to update a remaining element count to be passed to the vsetvli for the next iteration. Assuming we're talking about a loop like this example from the spec

A.1. Vector-vector add example
    # vector-vector add routine of 32-bit integers
    # void vvaddint32(size_t n, const int*x, const int*y, int*z)
    # { for (size_t i=0; i<n; i++) { z[i]=x[i]+y[i]; } }
    #
    # a0 = n, a1 = x, a2 = y, a3 = z
    # Non-vector instructions are indented
vvaddint32:
    vsetvli t0, a0, e32, ta, ma  # Set vector length based on 32-bit vectors
vle32.v v0, (a1)
  sub a0, a0, t0
  slli t0, t0, 2
  add a1, a1, t0
vle32.v v1, (a2)
  add a2, a2, t0
vadd.vv v2, v0, v1
vse32.v v2, (a3)
  add a3, a3, t0
  bnez a0, vvaddint32
  ret
reames added a comment.May 4 2022, 2:19 PM

Since after the core insertion/lowering algorithm has run, basically all user written VSETVLIs will have their GPR result unused (as VTYPE/VLEN is now explicitly read instead), this kicks in for most tests which involve a vsetvli intrinsic. When inserting VSETVLIs to lower psuedos, we prefer the X0 form anyways.

Almost any real scenario that processes more data than fits in a register will need the GPR output to update memory pointers and to control a loop. Which I guess means we're lacking realistic tests.

How so? An idiomatic tail folded loop doesn't need to explicitly read VL. It's passed via VL reg itself to all the vector ops. It changes across iterations, but that doesn't mean the GPR is used.

Don't you need to know how many elements were processed to advance the pointer for the next iteration. That is done is scalar registers. You also need to update a remaining element count to be passed to the vsetvli for the next iteration. Assuming we're talking about a loop like this example from the spec

You are of course correct.

The actual example I was looking at was fixed length vectorization. Because you know the VL in advance, you don't need to read it back.

So we have realistic test for fixed length loop vectorization at least. :)

reames updated this revision to Diff 427137.May 4 2022, 2:21 PM

Add VSETIVLI

craig.topper added inline comments.May 4 2022, 2:54 PM
llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
1232

This should maybe be use_nodbg_empty.

reames updated this revision to Diff 427154.May 4 2022, 3:10 PM
This revision is now accepted and ready to land.May 4 2022, 3:45 PM
frasercrmck added inline comments.May 5 2022, 12:55 AM
llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
1224

I know we have PseudoVSETVLIX0, but is it really an invariant that PseudoVSETVLI never has X0, X0 at this stage? Should we maybe add an assert to be a bit more sure? If for whatever reason we do have X0, X0 this would (subtly) generate some wrong code.

khchen added a comment.May 5 2022, 7:27 AM

I think maybe it's good to have pre-commit test to demonstrate VLS realistic cases could be benefited by this improvement, or at least mention realistic test are coming from VLS vectorization in commit message.

This revision was landed with ongoing or failed builds.May 5 2022, 7:40 AM
This revision was automatically updated to reflect the committed changes.
reames added a comment.May 5 2022, 7:45 AM

I think maybe it's good to have pre-commit test to demonstrate VLS realistic cases could be benefited by this improvement, or at least mention realistic test are coming from VLS vectorization in commit message.

What does VLS stand for?

Assuming you mean fixed length vectorization, does the comment as submitted satisfy you?

llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp
1224

At this stage, all the operands are vregs. We have tor rely on the register constraints that are on the instructions to cause the register allocator to do the right thing here.

The more generic version of this transform - not applied to just VSETVLI - would have to use register class constraints to on the def instead of using knowledge of the psuedo to assume x0 is valid as this one does.