This is an archive of the discontinued LLVM Phabricator instance.

[RISCV][ISel] Fold extensions when all the users can consume them
ClosedPublic

Authored by qcolombet on Sep 12 2022, 5:27 PM.

Details

Summary

This patch aims at starting a conversation about how people think we should approach forming VW variants (operations that widen their inputs arguments) more aggressively.

Currently we fold sign/zero extensions in instructions that support widening only when the result of the extension is used only once.
The current (WIP) patch lifts this limitation by checking whether all the users of the extension support the folding and by allowing the transformation when that's the case.

The patch is far from being perfect because it doesn't actually check that the folding will happen for all the instructions (and in true SDISel fashion will be defeated by basic block boundaries) but demonstrates what could be achieved, codegen-wise, with the added test:

--- old_codegen.s       2022-09-13 00:12:48.989575265 +0000
+++ new_codegen.s       2022-09-13 00:13:02.134793836 +0000
@@ -16,30 +16,28 @@
 .Lfunc_end0:
        .size   vwmul_v2i16, .Lfunc_end0-vwmul_v2i16
        .cfi_endproc
                                         # -- End function
        .globl  vwmul_v2i16_multiple_users      # -- Begin function vwmul_v2i16_multiple_users
        .p2align        2
        .type   vwmul_v2i16_multiple_users,@function
 vwmul_v2i16_multiple_users:             # @vwmul_v2i16_multiple_users
        .cfi_startproc
 # %bb.0:
-       vsetivli        zero, 2, e16, mf4, ta, mu
+       vsetivli        zero, 2, e8, mf8, ta, mu
        vle8.v  v8, (a0)
        vle8.v  v9, (a1)
        vle8.v  v10, (a2)
-       vsext.vf2       v11, v8
-       vsext.vf2       v8, v9
-       vsext.vf2       v9, v10
-       vmul.vv v8, v11, v8
-       vmul.vv v9, v11, v9
-       vor.vv  v8, v8, v9
+       vwmul.vv        v11, v8, v9
+       vwmul.vv        v9, v8, v10
+       vsetvli zero, zero, e16, mf4, ta, mu
+       vor.vv  v8, v11, v9

@craig.topper How do you think we should approach forming VW instructions?

Diff Detail

Event Timeline

qcolombet created this revision.Sep 12 2022, 5:27 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2022, 5:27 PM
qcolombet requested review of this revision.Sep 12 2022, 5:27 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2022, 5:27 PM

Thanks for moving this forward, @qcolombet!

I wonder if even for cases where not all the users of the extension are folded, the performance gain from using a vw variant would make up for keeping two extension instructions. Any idea?

Thanks for moving this forward, @qcolombet!

I wonder if even for cases where not all the users of the extension are folded, the performance gain from using a vw variant would make up for keeping two extension instructions. Any idea?

One issue is that f it's a 4x or 8x extend, we only fold part of it. If we don't fold all uses, we'll have a 4x/8x extend left behind and a new 2x/4x extend created by the fold.

Other thoughts I had were about register pressure at larger LMULs. Not folding all uses of the extend increases the live range of the input to the extend.

On the other hand, it starts getting expensive to fully check that all uses can fold. If there are N users, we'll run the checks something like (N*(N+1))/2 times.

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
8158

Can this be something like llvm::all_of(Val->uses()?

On the other hand, it starts getting expensive to fully check that all uses can fold. If there are N users, we'll run the checks something like (N*(N+1))/2 times.

Agree, that's also a problem with the current patch. Even if the checks are lightweight we still do them n^2 times.
One thing I had in mind is we could do the transformation in one go for an extension (i.e., replace its users all at once).

The problem with that is it doesn't fit nicely with SDISel combining model, though I still believe it may be possible to do (at least I vaguely remember doing that in the past.)

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
8158

Good catch, yes, we should be able to use that.

I wonder if even for cases where not all the users of the extension are folded, the performance gain from using a vw variant would make up for keeping two extension instructions. Any idea?

I was wondering about that too, but like Craig said the register pressure implication may be problematic.

Have you looked at allowing the fold into the widening version without the one-use check at all? This would allow users of the extend which could be widen instructions to use the input of the extend while leaving the extend around for any non-wideable users.

Under the assumption that the widening variants execute at least as fast as the non-widening variants, this wouldn't seem to be problematic from a latency/throughput perspective.

There is a register pressure concern - as we potentially have to keep both extended and non-extended version alive where previously, the unextended version might have been dead. But in principle we have that problem every time we fold e.g. a splat into a .v.x variant of any instruction, and we don't seem to be burnt there.

Have you looked at allowing the fold into the widening version without the one-use check at all? This would allow users of the extend which could be widen instructions to use the input of the extend while leaving the extend around for any non-wideable users.

Under the assumption that the widening variants execute at least as fast as the non-widening variants, this wouldn't seem to be problematic from a latency/throughput perspective.

There is a register pressure concern - as we potentially have to keep both extended and non-extended version alive where previously, the unextended version might have been dead. But in principle we have that problem every time we fold e.g. a splat into a .v.x variant of any instruction, and we don't seem to be burnt there.

The register pressure is worse for large LMUL. We do have an early clobber on the extend instructions anyway, so the dest already can't reuse the source register.

We can only fold one 2x stage of widening. If the original sext/zext is from i8->i32/i64 or from i16->i64, the fold will create a smaller extend for the remaining part. If the original extend doesn't fold into all uses, this increases the number of instructions.

Have you looked at allowing the fold into the widening version without the one-use check at all? This would allow users of the extend which could be widen instructions to use the input of the extend while leaving the extend around for any non-wideable users.

Under the assumption that the widening variants execute at least as fast as the non-widening variants, this wouldn't seem to be problematic from a latency/throughput perspective.

There is a register pressure concern - as we potentially have to keep both extended and non-extended version alive where previously, the unextended version might have been dead. But in principle we have that problem every time we fold e.g. a splat into a .v.x variant of any instruction, and we don't seem to be burnt there.

The register pressure is worse for large LMUL.

Right, but this is a general problem for large LMUL. i.e. splat and extend are the same with respect to this.

We do have an early clobber on the extend instructions anyway, so the dest already can't reuse the source register.

I was wondering about needing to have two copies *live* over instructions between the extend and the last original use of the extend. So, not at the extend instruction itself, more the live ranges extending past that.

We can only fold one 2x stage of widening. If the original sext/zext is from i8->i32/i64 or from i16->i64, the fold will create a smaller extend for the remaining part. If the original extend doesn't fold into all uses, this increases the number of instructions.

We can maybe leave the one use requirement for this version of the transform? It seems reasonable to have different heuristics for "this folds the extend entirely" and "this allows a narrower extend". As an aside, its not really clear to me why the "narrower extend" version is profitable ever. It would seem neutral at best.

Have you looked at allowing the fold into the widening version without the one-use check at all? This would allow users of the extend which could be widen instructions to use the input of the extend while leaving the extend around for any non-wideable users.

Under the assumption that the widening variants execute at least as fast as the non-widening variants, this wouldn't seem to be problematic from a latency/throughput perspective.

There is a register pressure concern - as we potentially have to keep both extended and non-extended version alive where previously, the unextended version might have been dead. But in principle we have that problem every time we fold e.g. a splat into a .v.x variant of any instruction, and we don't seem to be burnt there.

The register pressure is worse for large LMUL.

Right, but this is a general problem for large LMUL. i.e. splat and extend are the same with respect to this.

We do have an early clobber on the extend instructions anyway, so the dest already can't reuse the source register.

I was wondering about needing to have two copies *live* over instructions between the extend and the last original use of the extend. So, not at the extend instruction itself, more the live ranges extending past that.

We can only fold one 2x stage of widening. If the original sext/zext is from i8->i32/i64 or from i16->i64, the fold will create a smaller extend for the remaining part. If the original extend doesn't fold into all uses, this increases the number of instructions.

We can maybe leave the one use requirement for this version of the transform? It seems reasonable to have different heuristics for "this folds the extend entirely" and "this allows a narrower extend". As an aside, its not really clear to me why the "narrower extend" version is profitable ever. It would seem neutral at best.

Without looking at any particular implementation. Widening from LMUL1 to LMUL4 could be 4 microops to write each physical register. Followed by another 4 microps for the add or mul. For a total of 8 ops. Whereas widening LMUL1 to LMUL2 could be 2 microops, followed by 4 microops for doing a widening add/mul from LMUL2 to LMUL4. For a total of 6 microops.

Have you looked at allowing the fold into the widening version without the one-use check at all? This would allow users of the extend which could be widen instructions to use the input of the extend while leaving the extend around for any non-wideable users.

Under the assumption that the widening variants execute at least as fast as the non-widening variants, this wouldn't seem to be problematic from a latency/throughput perspective.

There is a register pressure concern - as we potentially have to keep both extended and non-extended version alive where previously, the unextended version might have been dead. But in principle we have that problem every time we fold e.g. a splat into a .v.x variant of any instruction, and we don't seem to be burnt there.

The register pressure is worse for large LMUL.

Right, but this is a general problem for large LMUL. i.e. splat and extend are the same with respect to this.

We do have an early clobber on the extend instructions anyway, so the dest already can't reuse the source register.

I was wondering about needing to have two copies *live* over instructions between the extend and the last original use of the extend. So, not at the extend instruction itself, more the live ranges extending past that.

We can only fold one 2x stage of widening. If the original sext/zext is from i8->i32/i64 or from i16->i64, the fold will create a smaller extend for the remaining part. If the original extend doesn't fold into all uses, this increases the number of instructions.

We can maybe leave the one use requirement for this version of the transform? It seems reasonable to have different heuristics for "this folds the extend entirely" and "this allows a narrower extend". As an aside, its not really clear to me why the "narrower extend" version is profitable ever. It would seem neutral at best.

Without looking at any particular implementation. Widening from LMUL1 to LMUL4 could be 4 microops to write each physical register. Followed by another 4 microps for the add or mul. For a total of 8 ops. Whereas widening LMUL1 to LMUL2 could be 2 microops, followed by 4 microops for doing a widening add/mul from LMUL2 to LMUL4. For a total of 6 microops.

So, saying this back to you - reasonable hardware exists which incorporates extends at no additional cost, and the cost of the extend depends on result LMUL. So folding only part of the extend into the widening op reduces total cost. Fair enough.

My point about restricting that transform to one use while not restricting the variant that doesn't require the shift at all still seems reasonable.

My point about restricting that transform to one use while not restricting the variant that doesn't require the shift at all still seems reasonable

Just to react on that, although this could be a relatively easy win (see next sentence), in our motivating example it would not help because we keep the intermediate extends, while having several uses.

Going back to dropping the one-use check for the variant that completely fold, I'm unclear how we could make sure this is always a net win. We still have the register pressure issue if the extension is not completely eliminated.

Diego (@dcaballe) and I talked offline and I am going to prototype doing the folding in one go for a whole web of extensions/users and see how bad the implementation looks.

Technically I would rather have a dedicated pass for the whole thing, but given other combines may rely on these combines to produce better code (e.g., to generate vwmacc) and that doing these kind of combines before ISel may prevent other optimizations (because the extended operation would be represented by an intrinsic, which is opaque to the generic combines), I don't see a whole lot of options at the moment.

TL;DR Stay tuned, tentative patch is on its way.

Quick update here. I have a patch that does an all or nothing approach for the folding (i.e., if at least one operation cannot fold, don't do it.)
I'll post it hopefully next week.
There are two parts to it:

  1. A refactoring of how the folding is done, so that all the decisions are made by the same function for all binop (add/sub, mul, addw/subw) (This part is the meat of the whole exercise.)
  2. A relatively small patch that leverage this function to take the decision for a set of instructions connected by s|zext.
qcolombet retitled this revision from [RISCV][WIP] Form more VW instructions to [RISCV][ISel] Fold extensions when all the users can consume them.

This patch allows the combines that fold extensions in binary operations to have more than one use.
The approach here is pretty conservative: if all the users of an extension can fold the extension, then the folding is done, otherwise we don't fold. This is the first step towards avoiding the one-use limitation.

As a result, we make a decision to fold/don't fold for a web of instructions. An instruction is part of the web of instructions as soon as it consumes an extension that needs to be folded for all its users.

Because of how SDISel works a web of instructions can be visited over and over. More precisely, if the folding happens, it happens for the whole web and that's the end of it, but if the folding fails, the whole web may be revisited when another member of the web is visited.

To avoid a compile time explosion in pathological cases, we bail out earlier for webs that are bigger than a given threshold (arbitrarily set at 18 for now.) This size can be changed using --riscv-lower-ext-max-web-size=<maxWebSize>.

At the current time, I didn't see a better scheme for that. Assuming we want to stick with doing that in SDISel.

The "NFC" refactoring is at https://reviews.llvm.org/D134703.
Then this diff has been updated to take advantage of it.

I'm not super happy about the hardcoded threshold.

Ideally, I think we'd like to do this kind of matching in a machine pass. The problem with that is we would need to duplicate some of the combines in that machine pass (e.g., like how we produce vwmacc). That would be a perfect fit for GISel :D.

Anyhow, if you guys don't like the SDISel patch (this patch and its child), we can avoid spending time on code review and start thinking about a machine pass. (The "NFC" patch, catch potentially interesting cases though.)

qcolombet updated this revision to Diff 464860.Oct 3 2022, 5:32 PM
  • Rebase parent diff
This revision is now accepted and ready to land.Oct 4 2022, 10:52 PM
This revision was landed with ongoing or failed builds.Oct 5 2022, 1:50 PM
This revision was automatically updated to reflect the committed changes.