This is an archive of the discontinued LLVM Phabricator instance.

[WIP][RISCV][InsertVSETVLI] Allow promotion of TA to TU and MA to MU
AbandonedPublic

Authored by reames on Oct 12 2022, 12:29 PM.

Details

Summary

This is not a final patch - I need to properly plumb through MRI among other things. I'm posting for discussion. What do we think of the idea of eliminating vsetvli transitions by expanding the region which is mu and/or tu? I think this is generally reasonable, but are there any cases we need to be careful about?

Diff Detail

Event Timeline

reames created this revision.Oct 12 2022, 12:29 PM
reames requested review of this revision.Oct 12 2022, 12:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 12 2022, 12:29 PM
reames added inline comments.Oct 12 2022, 12:34 PM
llvm/test/CodeGen/RISCV/fold-vector-cmp.ll
16

FYI, D135794 is somewhat of an alternate patch to this test change. If we land that, this becomes less impactful, and vice versa.

craig.topper added inline comments.Oct 12 2022, 12:39 PM
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bitcast.ll
527

So now we can't execute this instruction until the previous writer of vmv.v.x completes? At least on a renamed microarchitecture.

reames added inline comments.Oct 12 2022, 1:05 PM
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bitcast.ll
527

For the instruction "vmv.v.x v8, a1", there's now a false dependence on the prior value of v8. Previously, the hardware could ignore this dependence as the input value could be ignored, and the high lanes unconditionally set to -1. After the change to TU, the hardware must wait for the dependency to be resolved.

craig.topper added inline comments.Oct 12 2022, 1:23 PM
llvm/test/CodeGen/RISCV/rvv/vmacc.ll
1572

If this were in a loop and a load misses the cache, the later iterations couldn't speculatively start loading until the earlier cache miss is resolved. That doesn't seem ideal.

reames added inline comments.Oct 12 2022, 1:41 PM
llvm/test/CodeGen/RISCV/rvv/vmacc.ll
1572

Depends on how the hardware handles this, and I don't really know what's realistic. In theory, the load can be issued, and only the merge is bottlenecked by the false dependency . Not sure if that's a realistic hardware expectation or not.

Note that the vmacc has the same loop carried false dependency issue in either case. So we're really just talking about the issue to overlap the loads.

But yes, unless the hardware is pretty uniformly smart about this - as sketched above - this would seem to be a fatal flaw for this patch.

craig.topper added inline comments.Oct 12 2022, 2:05 PM
llvm/test/CodeGen/RISCV/rvv/vmacc.ll
1572

I guess on most of the loop iterations of the loop you would be using vlmax so there isn't a tail. So maybe only the last iteration would be affected.

reames abandoned this revision.Oct 13 2022, 10:41 AM

Chatted w/ Craig about this offline. As pointed out in review comments, there are some cases where switching from agnostic to undisturbed can have significant runtime cost - mostly by preventing otherwise legal speculative reordering. We could maybe refine this into a patch which only exploits the possible state conversion for cheap instructions, but that's a bunch of infrastructure we don't have right now. At the moment, we don't have a strong motivation to push this. In some quick glancing at vector code, we're down to a small handful of tu or mu cases, and some extra toggles probably aren't worth aggressively optimizing. We'll revisit when we have motivating examples.