This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Make VL choosing for a splat-like VMV based on its users
Changes PlannedPublic

Authored by asi-sc on Aug 1 2022, 5:35 AM.

Details

Summary

In some cases VL of VMV/VFMV instructions exceeds required value, e.g. when we
convert SPLAT_VECTOR to vector move. It may cause insertion of extra/too wide
vsetvli instructions. We can look for common VL of the users to choose the
optimal VL value.

PR#55615

Diff Detail

Event Timeline

asi-sc created this revision.Aug 1 2022, 5:35 AM
asi-sc requested review of this revision.Aug 1 2022, 5:35 AM

What if the user is the merge operand of a tail undisturbed instruction. The VL of the using instruction would not apply to it then.

asi-sc added a comment.Aug 2 2022, 5:43 AM

What if the user is the merge operand of a tail undisturbed instruction. The VL of the using instruction would not apply to it then.

Thanks for catching that! I've looked at how to fix this problem and while we can easily check the policy, it seems not possible for now to figure out whether intrinsic has merge operand. We can extend intrinsics markup explicitly saying the position of the merge operand similar to VLOperand and ScalarOperand. What do you think? The fix should not be hard if the position of merge operands is known.

A gentle ping on the question about adding a markup of merge operands for intrinsics.

craig.topper added a comment.EditedAug 12 2022, 9:34 AM

A gentle ping on the question about adding a markup of merge operands for intrinsics.

Sounds fine to me. But the more important case to fix is RISCVISD::*_VL nodes. I'm not sure how common it is to have splats combined with intrinsics. intrinsics come from C code where the splats should have been done with a vmv.v.x initrinsic that would already have the correct VL.

asi-sc updated this revision to Diff 458148.Sep 6 2022, 5:27 AM

Make VL choosing for a splat-like VMV based on its users

In some cases VL of VMV/VFMV instructions exceeds required value, e.g. when we
convert SPLAT_VECTOR to vector move. It may cause insertion of extra/too wide
vsetvli instructions. We can look for common VL of the users to choose the
optimal VL value.

asi-sc retitled this revision from Make VL choosing for a splat constant based on its users to [RISCV] Make VL choosing for a splat-like VMV based on its users.Sep 6 2022, 5:29 AM
asi-sc edited the summary of this revision. (Show Details)
asi-sc edited the summary of this revision. (Show Details)Sep 6 2022, 5:30 AM

A gentle ping

Sounds fine to me. But the more important case to fix is RISCVISD::*_VL nodes. I'm not sure how common it is to have splats combined with intrinsics. intrinsics come from C code where the splats should have been done with a vmv.v.x initrinsic that would already have the correct VL.

New patch generalized implementation to support RISCVISD::*_VL nodes.

Also, attaching statistics showing the number of VL shinks

                  test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 167.00
                 test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 167.00
     test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 164.00
      test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 164.00
             test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  75.00
                     test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test  70.00
           test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test  47.00
               test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  31.00
                test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  31.00
                      test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  27.00
                          test-suite :: MultiSource/Applications/oggenc/oggenc.test  22.00
            test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test  19.00
      test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test  19.00
                       test-suite :: MultiSource/Applications/JM/lencod/lencod.test  16.00
           test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  15.00
          test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  15.00
             test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  13.00
          test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test  12.00
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HACCKernels/HACCKernels.test  12.00
                        test-suite :: MultiSource/Applications/ClamAV/clamscan.test  12.00
         test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test  11.00
          test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test  11.00
                        test-suite :: MultiSource/Applications/sqlite3/sqlite3.test  11.00
                       test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  10.00
            test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test  10.00
     test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test   9.00
      test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test   9.00
                 test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test   7.00
   test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test   7.00
                  test-suite :: MultiSource/Benchmarks/FreeBench/neural/neural.test   6.00
        test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   6.00
                 test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test   6.00
                    test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test   6.00
   ...

I've looked at a few ideas in this area, and there's an interaction I want to point out that I don't have a good answer for.

While reducing the VL of a splat to match the demanded lanes can remove vsetv toggles, it can also add them. Specifically, if the splat happens to be scheduled in a region which uses the original wider VL, then reducing the VL actually pessimizes the code. This actually shows up in mixed VL code, particularly when combined with LICM for splats which can't be folded (profitably) into their users.

We can patch part of this by extending InsertVSETVLI to allow increasing VL on splats if the splat is TA. However, there's still a scheduling interaction here which is a bit hard to account for.

I generally do believe that having VL reduction for all computation (not just splats) is generally a good idea; I'm just trying to figure out how to slice the pieces so that we can get there incrementally without nasty regressions along the way.

I've looked at a few ideas in this area, and there's an interaction I want to point out that I don't have a good answer for.

While reducing the VL of a splat to match the demanded lanes can remove vsetv toggles, it can also add them. Specifically, if the splat happens to be scheduled in a region which uses the original wider VL, then reducing the VL actually pessimizes the code. This actually shows up in mixed VL code, particularly when combined with LICM for splats which can't be folded (profitably) into their users.

We can patch part of this by extending InsertVSETVLI to allow increasing VL on splats if the splat is TA. However, there's still a scheduling interaction here which is a bit hard to account for.

I generally do believe that having VL reduction for all computation (not just splats) is generally a good idea; I'm just trying to figure out how to slice the pieces so that we can get there incrementally without nasty regressions along the way.

@reames, thanks for raising such an important question here. This change only works on splats with TA policy. So, we shouldn't get any regression for TU splats. At the same time combining this patch with modification of InsertVSETVLI you suggested should give only positive effect from my point of view:

  1. at DAG to DAG stage minimal VL for vector move is chosen;
  2. After modification of InsertVSETVLI we can skip unnecessary vsetv instructions for TA vector moves.

Such order of vector move transformations seems to me pretty reasonable. Let me know if you want me to look into InsertVSETVLI.

Spent some time looking at this today. Wrote up my initial investigation as a bug (https://github.com/llvm/llvm-project/issues/57767), mostly because phab doesn't make long format comments easy.

The good news is that the scheduling interaction case I was most worried about doesn't appear to be relevant for this patch. We already have that problem in generic IR; this code is basically only relevant for intrinsics. We have a difference in lowering strategy between fixed length vectors and intrinsics which sends us down two different codepaths.

craig.topper added inline comments.Sep 15 2022, 2:44 PM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2546

VL shrinking would not be legal for vs2 of a vrgather.vv or vrgatherei16.vv. The indices can select elements above VL.

I think it's also not legal for vs2 of vslidedown.vi and vslidedown.vx.

craig.topper added inline comments.Sep 15 2022, 2:55 PM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2510

I think this would be cleaner as two named variables.

2590

UpdateNodeOperands potentially returns a different pointer than Node if it triggered CSE internally. In that case you need to Replace all uses of Node with the value returned and not run any of the code in caller. See the UpdateNodeOperands call in X86DAGToDAGISel::tryShiftAmountMod for a similar case.

reames added inline comments.Sep 16 2022, 9:30 AM
llvm/lib/Target/RISCV/MCTargetDesc/RISCVBaseInfo.h
191 ↗(On Diff #458148)

If you rebase, these have landed through other patches which needed them.

llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2498

JFYI, this looks suspicious. The node operands of the SDNode and the MI are not guaranteed by construction to be the same. Or at least if they are, we haven't adjusted other code under that assumption and asserted appropriately.

2565

I request that you separate this patch into one which is one-use restricted, and then a second which adds the use walk. The former patch would parallel what we do for e.g. insert_vector folds.

2772

This change is intended to be NFC right? If so, please separate it for review. As mentioned above, the correspondence of SDNode operands and MI operands is non-trivial.

asi-sc added inline comments.Sep 16 2022, 10:01 AM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2498

JFYI, this looks suspicious. The node operands of the SDNode and the MI are not guaranteed by construction to be the same.

Then can we guarantee that

  • policy operand is the last operand of the node (skipping glue and chain)
  • VL operand is just before the policy operand if there is policy, or the last
  • merge operand is the first operand after defs?

Of course, only if the corresponding RISCVII:has*Op checker is true.

asi-sc added inline comments.Sep 19 2022, 7:19 AM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2565

@reames, do you think that use walk approach will introduce some regressions while one-use approach won't? I'll split the patch, it's not a problem, I'm just trying to understand what issue we are trying to solve by that.

asi-sc updated this revision to Diff 461221.Sep 19 2022, 8:39 AM

Addressed review comments

reames requested changes to this revision.Sep 21 2022, 3:12 PM
reames added inline comments.
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
1816

We will need to extend this for splats of constants eventually.

2512

This loop is redundant under one-use restriction.

2552

Pull this block out as a helper called: canReadyPastVL.

2573

It's not clear to me why merge ops have anything to do with the legality here.

2576

Policy is implicit for instructions without policy, it is safer to return false here.

Otherwise, look at the tied operand handling in RISCVInsertVSETVLI.cpp for what we'd need to factor out and reuse.

2579

Macro comment: The legality here has me nervous. Might be good to start with a whitelist approach. We can come back and loosen it, but starting with a known good set would make the change less risky.

2592

This loop is redundant under one-use restriction.

2597

This call should become a single retrieval of VL from the single user under the one-use restriction.

2600

getVLOperand

2602

Extract a helper. e.g. isVLLessThan(VL, VL)

Since this runs after selection, I think you need to special case RISCV::VLMaxSentinel.

2616

getVLOpIdx

This revision now requires changes to proceed.Sep 21 2022, 3:12 PM
craig.topper added inline comments.Sep 21 2022, 3:26 PM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2573

The policy applies to how the merge operand is merged. If it doesn't have a merge operand or the use isn't the merge operand then the splat is an input to the arithmetic portion of the instruction which usually obeys VL. Except for the special case instructions above.

asi-sc added inline comments.Sep 22 2022, 1:49 AM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
1816

Aren't splats already processed in RISCVDAGToDAGISel::PreprocessISelDAG ? Honestly, I thought that ISD:SPLAT_VECTOR here is a legacy that we can remove.

2579

One more idea is that we can forbid changing of VL for the instruction if its result is used as a merge operand. And add support for merge operands in the next patch as a separate feature.
Which option do you want me to follow?

2592

Isn't one-use restriction a temporary thing? I've added it in a way that we can easily drop this restriction and try use-walk (see if above). Is it a real issue that the code obtaining new VL is more generic than it's required for one-use approach? If you think we might eventually end up with one-use approach, then I agree we must completely remove use-walk support from this patch.

craig.topper added inline comments.Sep 22 2022, 11:49 AM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
1816

I wonder if @reames was noticing that it said V_X_VL and did not include V_I_VL which doesn't exist. Before isel all integer scalar use V_X_VL. Isel checks the size of the immediate and uses vmv.v.i or li+vmv.v.x.

asi-sc updated this revision to Diff 462896.Sep 26 2022, 6:53 AM

Addressed review comments except for a few which have questions from my side.

asi-sc added inline comments.Sep 26 2022, 7:04 AM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2576

Thanks for noticing. I changed to return false and added FIXME comment.

2600

We cannot use getVLOperands here (and getVLOpIdx a few lines below) as Node is RISCVISD. At the same time we know that it's VMV/VFMV, so we can be pretty sure in the positioning of operands.

2602

Actually, VLMaxSentinel was handled implicitly (as it is -1ULL and we did zero extension). But I agree we must check it explicitly and do not rely on the exact value of it

craig.topper added inline comments.Sep 26 2022, 5:55 PM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2497

Can we use SDValue() for consistency with the vast majority of SelectionDAG code

2555

I don't think you can assume the using instruction VL is relative to the same SEW/LMUL as the splat. There could have been a bitcast between them that changed the element size. Bitcasts don't generate code. I think insert_subvector/extract_subvector can also change the type without creating any code. Mainly for fractional LMULs.

craig.topper added inline comments.Sep 26 2022, 5:57 PM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2555

You'll need to extract the SEW and LMUL from both instructions and make sure they are in the same ratio so they would have the same VLMax.

asi-sc added inline comments.Sep 28 2022, 7:04 AM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2555

I tried to write a test that shows the behavior our described. And I have to say it is tricky for me.
Bitcasts preserve SEW/LMUL ratio, llvm.vector.extract intrinsic results in additional COPY at DAG to DAG stage, shufflevector instructions do not support indexes other than zeroinitializer for scalable vectors.
Is it in general correct if the result of VMV is used in the instruction with another SEW/LMUL ratio? It seems to me as something illegal at DAG to DAG stage.

craig.topper added inline comments.Sep 28 2022, 9:52 AM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2555

Bitcasts don't preserve the SEW/LMUL ratio. They have a constant LMUL and change the SEW. So an LMUL=4 SEW=32 can become an LMUL=4 SEW=64.

asi-sc updated this revision to Diff 464719.Oct 3 2022, 9:45 AM

Apply VL shrinking only when SEW/LMUL ratio is unmodified. This change depends on D135086

asi-sc updated this revision to Diff 465411.Oct 5 2022, 8:22 AM

Rebase to fetch merged dependency

A gentle ping.

I addressed all comments except a few that I'd like to discuss additionally.

llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2579

This is still unresolved. A whitelist approach worries me. What should be the rule to distinguish good and bad candidates for the whitelist? @reames, could you please share your thoughts?

Is the testing comprehensive enough to follow my suggestion: split the patch, commit parts separately and check the impact on tests?

2592

@reames, do you insist on simplifying all generic code to process only one-use nodes? Original issue #55615 requires multiple uses support, so eventually it must be introduced anyway. Currently, use-walk is disabled in a way that we can enable it back by removing two lines on the code.

reames added inline comments.Oct 20 2022, 11:15 AM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2579

How do you propose to split the patch?

One way would be to introduce a narrow whitelist, land that version, and then relax it with dedicated testing and review. Was that what you had in mind? Or something else?

2592

Yes, I do.

asi-sc planned changes to this revision.Oct 21 2022, 9:04 AM
asi-sc added inline comments.
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2579

I dumped all opcodes of the vmv uses that reach this point in the code for llvm-test-suite and SPEC2017. Just to remind: at this point we know that vmv has a single use, and this use is a merge operand in a TA instruction. List of the opcodes:

  • PseudoVLUXEI64_V_M1_M1_MASK
  • PseudoVLUXEI64_V_M1_MF2_MASK
  • PseudoVLSE16_V_M1_MASK
  • PseudoVLSE32_V_M1_MASK
  • PseudoVLSE32_V_MF2_MASK
  • PseudoVFMADD_VF64_M1

So, I think we should add basic arithmetic and load instructions to the whitelist. However, it won't be so narrow as @reames suggests.

2592

OK, thanks for your reply. It'll be done in the next patch.

asi-sc added inline comments.Oct 21 2022, 9:17 AM
llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp
2579

OK, I'll add only load instructions, otherwise the list is too big.

asi-sc updated this revision to Diff 470096.Oct 24 2022, 2:51 AM

Add a whitelist for merge operands, remove use-walk support

A gentle ping

reames requested changes to this revision.Nov 11 2022, 11:38 AM

I spent some time this morning creating a minimal version of this patch. This change is still too complicated, and I'm unconvinced of it's correctness. Please see https://reviews.llvm.org/D137856.

While doing this, I realize this change is definitely incorrect. Consider the following test case:
setvli x1, 5, <fixed lmul, flags>
vmv.v.i v8, -1
setvli x1, 1, <fixed lmul, flags>
vle8 v8, (a5) // memory contains an -1 vector
setvli x1, 5, <fixed lmul, flags>
vmseq v9, v8, -1

In this case, vle8 stands in for any instruction with a merge operand. The fact it's a load is not meant to be important.

The required semantics of this program is to produce a mask with five ones - regardless of the TA/TU status of the vle8. TA allows either the original value or -1, and in this case, both choices result in the same result.

However, this patch would produce the following:
setvli x1, 1, <fixed lmul, flags>
vmv.v.i v8, -1
setvli x1, 1, <fixed lmul, flags>
vle8 v8, (a5) // memory contains an -1 vector
setvli x1, 5, <fixed lmul, flags>
vmseq v9, v8, -1

In this case, lanes 1 through 4 (inclusive) are undefined, and may result in a mask without those lanes set. As such, the transform is unsound as currently implemented.

This revision now requires changes to proceed.Nov 11 2022, 11:38 AM

I spent some time this morning creating a minimal version of this patch. This change is still too complicated, and I'm unconvinced of it's correctness. Please see https://reviews.llvm.org/D137856.

While doing this, I realize this change is definitely incorrect. Consider the following test case:
setvli x1, 5, <fixed lmul, flags>
vmv.v.i v8, -1
setvli x1, 1, <fixed lmul, flags>
vle8 v8, (a5) // memory contains an -1 vector
setvli x1, 5, <fixed lmul, flags>
vmseq v9, v8, -1

In this case, vle8 stands in for any instruction with a merge operand. The fact it's a load is not meant to be important.

The required semantics of this program is to produce a mask with five ones - regardless of the TA/TU status of the vle8. TA allows either the original value or -1, and in this case, both choices result in the same result.

However, this patch would produce the following:
setvli x1, 1, <fixed lmul, flags>
vmv.v.i v8, -1
setvli x1, 1, <fixed lmul, flags>
vle8 v8, (a5) // memory contains an -1 vector
setvli x1, 5, <fixed lmul, flags>
vmseq v9, v8, -1

In this case, lanes 1 through 4 (inclusive) are undefined, and may result in a mask without those lanes set. As such, the transform is unsound as currently implemented.

@reames , thanks for reviewing it so thoroughly! However, the test case you provided doesn't convince me in the incorrectness of this change.
(1) If the instruction with a merge operand (vle) has TU policy, we won't shrink VL and the code remains untouched.
(2) The instruction has TA policy. In this case, I don't think we can rely on the values written to the tail. According to the spec:

The value of all 1s instead of all 0s was chosen for the overwrite value to discourage software developers from depending on the value written.

For me it means "you must never rely on the concrete value written to the tail if TA policy is used". Am I wrong? Do we really expect to see some concrete values in the tail of TA instruction when doing compiler transformations? If so, we should just reject shrinking VL for any merge operands.

asi-sc added a comment.Dec 2 2022, 5:05 AM

Ping.

@reames , could you please take a look at my previous comment? If you still think we should proceed with the minimized version you made, will you drive the review process? I don't see any progress there for three weeks.

reames added a comment.Dec 5 2022, 9:04 AM

I don't think there's much to say here. My prior point still stands. I find your proposed reading of the specification unconvincing. The optimization as proposed is unsound.

I am not particularly interested in this optimization. Unless you have a strong motivation to pursue this - in which case, please explain why - I'd suggest we abandon this patch and move on with our lives.

asi-sc planned changes to this revision.Dec 6 2022, 7:30 AM

Thanks for your feedback, Philip. I'll made some experiments with this change at my free time, so not abandoning it right now. However, I mark it as "plan changes" to remove it from review queues.