This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Removing vsetvli by removing agnostic flags on previous instructions
AbandonedPublic

Authored by reames on Jun 3 2022, 7:49 AM.

Details

Summary

This is posted for discussion, not for actual review. I'm pretty sure we're not going to take this patch for the reason discussed below, but I'm curious what other directions folks think are worth considering.

Semantically, we can switch an earlier vsetvli from agnostic to undisturbed. This patch implements a local transformation to delete a vsetvli configuration instruction if we can prove there is an earlier instruction which differs only in the (mask, tail) policy agnostic bits.

The problem with this approach is that I don't have a good grasp on the performance implications of switching from agnostic to undisturbed. What follows is really just a complete guess.

For high LMUL data parallel operations (e.g. vadd m8), it seems reasonable to believe that at least some hardware will use tail agnosticism combined with short VL (say a constant), to break dependencies on the input registers corresponding to the register group's tail. In theory, this is also possible for LMUL=1, but my guess is this is much less likely in practice.

From here, I see a couple potential paths, and I'm curious on folks take.

Option 1 - Do something like this, and figure out the cost model pieces later if it turns out this is a problem for some real piece of hardware.

Option 2 - Try to explore the cost modeling pieces now, and come up with a restricted form of this transform which is "always" profitable. This may be quite hard, mostly due to a lack of information.

Option 3 - Investigate why we're using TU in the later vsetvli at all. In several cases I've skimmed, I suspect that we could canonicalize the TA instead. Doing so indirectly removes the vsetvli by removing the state transition, and may (per above reasoning) have other benefits as well.

I think this basically comes down to a general question on how we want to optimize policy bits. Has anyone given this thought already?

Diff Detail