This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/RISCV/
-
Target/
-
RISCV/
1/2
RISCVISelLowering.cpp
-
test/CodeGen/RISCV/rvv/
-
CodeGen/
-
RISCV/
-
rvv/
-
fixed-vectors-vw-web-simplification.ll
-
fixed-vectors-vwmul.ll

Differential D133739

[RISCV][ISel] Fold extensions when all the users can consume them
ClosedPublic

Authored by qcolombet on Sep 12 2022, 5:27 PM.

Download Raw Diff

Details

Reviewers

craig.topper
dcaballe
reames
rogfer01

Commits

rGc5c2de287e5f: [RISCV][ISel] Fold extensions when all the users can consume them

Summary

This patch aims at starting a conversation about how people think we should approach forming VW variants (operations that widen their inputs arguments) more aggressively.

Currently we fold sign/zero extensions in instructions that support widening only when the result of the extension is used only once.
The current (WIP) patch lifts this limitation by checking whether all the users of the extension support the folding and by allowing the transformation when that's the case.

The patch is far from being perfect because it doesn't actually check that the folding will happen for all the instructions (and in true SDISel fashion will be defeated by basic block boundaries) but demonstrates what could be achieved, codegen-wise, with the added test:

--- old_codegen.s       2022-09-13 00:12:48.989575265 +0000
+++ new_codegen.s       2022-09-13 00:13:02.134793836 +0000
@@ -16,30 +16,28 @@
 .Lfunc_end0:
        .size   vwmul_v2i16, .Lfunc_end0-vwmul_v2i16
        .cfi_endproc
                                         # -- End function
        .globl  vwmul_v2i16_multiple_users      # -- Begin function vwmul_v2i16_multiple_users
        .p2align        2
        .type   vwmul_v2i16_multiple_users,@function
 vwmul_v2i16_multiple_users:             # @vwmul_v2i16_multiple_users
        .cfi_startproc
 # %bb.0:
-       vsetivli        zero, 2, e16, mf4, ta, mu
+       vsetivli        zero, 2, e8, mf8, ta, mu
        vle8.v  v8, (a0)
        vle8.v  v9, (a1)
        vle8.v  v10, (a2)
-       vsext.vf2       v11, v8
-       vsext.vf2       v8, v9
-       vsext.vf2       v9, v10
-       vmul.vv v8, v11, v8
-       vmul.vv v9, v11, v9
-       vor.vv  v8, v8, v9
+       vwmul.vv        v11, v8, v9
+       vwmul.vv        v9, v8, v10
+       vsetvli zero, zero, e16, mf4, ta, mu
+       vor.vv  v8, v11, v9

@craig.topper How do you think we should approach forming VW instructions?

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

qcolombet created this revision.Sep 12 2022, 5:27 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2022, 5:27 PM

Herald added subscribers: sunshaoce, VincentWu, luke957 and 30 others. · View Herald Transcript

qcolombet requested review of this revision.Sep 12 2022, 5:27 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2022, 5:27 PM

Herald added subscribers: • pcwang-thead, eopXD, MaskRay. · View Herald Transcript

craig.topper added a reviewer: reames.Sep 12 2022, 5:34 PM

Harbormaster completed remote builds in B186262: Diff 459595.Sep 12 2022, 6:09 PM

Thanks for moving this forward, @qcolombet!

I wonder if even for cases where not all the users of the extension are folded, the performance gain from using a vw variant would make up for keeping two extension instructions. Any idea?

dcaballe added a reviewer: rogfer01.Sep 14 2022, 12:04 AM

In D133739#3788725, @dcaballe wrote:

Thanks for moving this forward, @qcolombet!

I wonder if even for cases where not all the users of the extension are folded, the performance gain from using a vw variant would make up for keeping two extension instructions. Any idea?

One issue is that f it's a 4x or 8x extend, we only fold part of it. If we don't fold all uses, we'll have a 4x/8x extend left behind and a new 2x/4x extend created by the fold.

Other thoughts I had were about register pressure at larger LMULs. Not folding all uses of the extend increases the live range of the input to the extend.

On the other hand, it starts getting expensive to fully check that all uses can fold. If there are N users, we'll run the checks something like (N*(N+1))/2 times.

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
8235	Can this be something like `llvm::all_of(Val->uses()`?

On the other hand, it starts getting expensive to fully check that all uses can fold. If there are N users, we'll run the checks something like (N*(N+1))/2 times.

Agree, that's also a problem with the current patch. Even if the checks are lightweight we still do them n^2 times.
One thing I had in mind is we could do the transformation in one go for an extension (i.e., replace its users all at once).

The problem with that is it doesn't fit nicely with SDISel combining model, though I still believe it may be possible to do (at least I vaguely remember doing that in the past.)

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
8235	Good catch, yes, we should be able to use that.

I wonder if even for cases where not all the users of the extension are folded, the performance gain from using a vw variant would make up for keeping two extension instructions. Any idea?

I was wondering about that too, but like Craig said the register pressure implication may be problematic.

Have you looked at allowing the fold into the widening version without the one-use check at all? This would allow users of the extend which could be widen instructions to use the input of the extend while leaving the extend around for any non-wideable users.

Under the assumption that the widening variants execute at least as fast as the non-widening variants, this wouldn't seem to be problematic from a latency/throughput perspective.

There is a register pressure concern - as we potentially have to keep both extended and non-extended version alive where previously, the unextended version might have been dead. But in principle we have that problem every time we fold e.g. a splat into a .v.x variant of any instruction, and we don't seem to be burnt there.

In D133739#3790115, @reames wrote:

Have you looked at allowing the fold into the widening version without the one-use check at all? This would allow users of the extend which could be widen instructions to use the input of the extend while leaving the extend around for any non-wideable users.

Under the assumption that the widening variants execute at least as fast as the non-widening variants, this wouldn't seem to be problematic from a latency/throughput perspective.

There is a register pressure concern - as we potentially have to keep both extended and non-extended version alive where previously, the unextended version might have been dead. But in principle we have that problem every time we fold e.g. a splat into a .v.x variant of any instruction, and we don't seem to be burnt there.

The register pressure is worse for large LMUL. We do have an early clobber on the extend instructions anyway, so the dest already can't reuse the source register.

We can only fold one 2x stage of widening. If the original sext/zext is from i8->i32/i64 or from i16->i64, the fold will create a smaller extend for the remaining part. If the original extend doesn't fold into all uses, this increases the number of instructions.

In D133739#3790199, @craig.topper wrote:

In D133739#3790115, @reames wrote:

Have you looked at allowing the fold into the widening version without the one-use check at all? This would allow users of the extend which could be widen instructions to use the input of the extend while leaving the extend around for any non-wideable users.

Under the assumption that the widening variants execute at least as fast as the non-widening variants, this wouldn't seem to be problematic from a latency/throughput perspective.

There is a register pressure concern - as we potentially have to keep both extended and non-extended version alive where previously, the unextended version might have been dead. But in principle we have that problem every time we fold e.g. a splat into a .v.x variant of any instruction, and we don't seem to be burnt there.

The register pressure is worse for large LMUL.

Right, but this is a general problem for large LMUL. i.e. splat and extend are the same with respect to this.

We do have an early clobber on the extend instructions anyway, so the dest already can't reuse the source register.

I was wondering about needing to have two copies *live* over instructions between the extend and the last original use of the extend. So, not at the extend instruction itself, more the live ranges extending past that.

We can only fold one 2x stage of widening. If the original sext/zext is from i8->i32/i64 or from i16->i64, the fold will create a smaller extend for the remaining part. If the original extend doesn't fold into all uses, this increases the number of instructions.

We can maybe leave the one use requirement for this version of the transform? It seems reasonable to have different heuristics for "this folds the extend entirely" and "this allows a narrower extend". As an aside, its not really clear to me why the "narrower extend" version is profitable ever. It would seem neutral at best.

In D133739#3790249, @reames wrote:

In D133739#3790199, @craig.topper wrote:

In D133739#3790115, @reames wrote:

Have you looked at allowing the fold into the widening version without the one-use check at all? This would allow users of the extend which could be widen instructions to use the input of the extend while leaving the extend around for any non-wideable users.

Under the assumption that the widening variants execute at least as fast as the non-widening variants, this wouldn't seem to be problematic from a latency/throughput perspective.

There is a register pressure concern - as we potentially have to keep both extended and non-extended version alive where previously, the unextended version might have been dead. But in principle we have that problem every time we fold e.g. a splat into a .v.x variant of any instruction, and we don't seem to be burnt there.

The register pressure is worse for large LMUL.

Right, but this is a general problem for large LMUL. i.e. splat and extend are the same with respect to this.

We do have an early clobber on the extend instructions anyway, so the dest already can't reuse the source register.

I was wondering about needing to have two copies *live* over instructions between the extend and the last original use of the extend. So, not at the extend instruction itself, more the live ranges extending past that.

We can only fold one 2x stage of widening. If the original sext/zext is from i8->i32/i64 or from i16->i64, the fold will create a smaller extend for the remaining part. If the original extend doesn't fold into all uses, this increases the number of instructions.

We can maybe leave the one use requirement for this version of the transform? It seems reasonable to have different heuristics for "this folds the extend entirely" and "this allows a narrower extend". As an aside, its not really clear to me why the "narrower extend" version is profitable ever. It would seem neutral at best.

Without looking at any particular implementation. Widening from LMUL1 to LMUL4 could be 4 microops to write each physical register. Followed by another 4 microps for the add or mul. For a total of 8 ops. Whereas widening LMUL1 to LMUL2 could be 2 microops, followed by 4 microops for doing a widening add/mul from LMUL2 to LMUL4. For a total of 6 microops.

In D133739#3790353, @craig.topper wrote:

In D133739#3790249, @reames wrote:

In D133739#3790199, @craig.topper wrote:

In D133739#3790115, @reames wrote:

Have you looked at allowing the fold into the widening version without the one-use check at all? This would allow users of the extend which could be widen instructions to use the input of the extend while leaving the extend around for any non-wideable users.

Under the assumption that the widening variants execute at least as fast as the non-widening variants, this wouldn't seem to be problematic from a latency/throughput perspective.

There is a register pressure concern - as we potentially have to keep both extended and non-extended version alive where previously, the unextended version might have been dead. But in principle we have that problem every time we fold e.g. a splat into a .v.x variant of any instruction, and we don't seem to be burnt there.

The register pressure is worse for large LMUL.

Right, but this is a general problem for large LMUL. i.e. splat and extend are the same with respect to this.

We do have an early clobber on the extend instructions anyway, so the dest already can't reuse the source register.

I was wondering about needing to have two copies *live* over instructions between the extend and the last original use of the extend. So, not at the extend instruction itself, more the live ranges extending past that.

We can only fold one 2x stage of widening. If the original sext/zext is from i8->i32/i64 or from i16->i64, the fold will create a smaller extend for the remaining part. If the original extend doesn't fold into all uses, this increases the number of instructions.

We can maybe leave the one use requirement for this version of the transform? It seems reasonable to have different heuristics for "this folds the extend entirely" and "this allows a narrower extend". As an aside, its not really clear to me why the "narrower extend" version is profitable ever. It would seem neutral at best.

Without looking at any particular implementation. Widening from LMUL1 to LMUL4 could be 4 microops to write each physical register. Followed by another 4 microps for the add or mul. For a total of 8 ops. Whereas widening LMUL1 to LMUL2 could be 2 microops, followed by 4 microops for doing a widening add/mul from LMUL2 to LMUL4. For a total of 6 microops.

So, saying this back to you - reasonable hardware exists which incorporates extends at no additional cost, and the cost of the extend depends on result LMUL. So folding only part of the extend into the widening op reduces total cost. Fair enough.

My point about restricting that transform to one use while not restricting the variant that doesn't require the shift at all still seems reasonable.

My point about restricting that transform to one use while not restricting the variant that doesn't require the shift at all still seems reasonable

Just to react on that, although this could be a relatively easy win (see next sentence), in our motivating example it would not help because we keep the intermediate extends, while having several uses.

Going back to dropping the one-use check for the variant that completely fold, I'm unclear how we could make sure this is always a net win. We still have the register pressure issue if the extension is not completely eliminated.

Diego (@dcaballe) and I talked offline and I am going to prototype doing the folding in one go for a whole web of extensions/users and see how bad the implementation looks.

Technically I would rather have a dedicated pass for the whole thing, but given other combines may rely on these combines to produce better code (e.g., to generate vwmacc) and that doing these kind of combines before ISel may prevent other optimizations (because the extended operation would be represented by an intrinsic, which is opaque to the generic combines), I don't see a whole lot of options at the moment.

TL;DR Stay tuned, tentative patch is on its way.

Quick update here. I have a patch that does an all or nothing approach for the folding (i.e., if at least one operation cannot fold, don't do it.)
I'll post it hopefully next week.
There are two parts to it:

A refactoring of how the folding is done, so that all the decisions are made by the same function for all binop (add/sub, mul, addw/subw) (This part is the meat of the whole exercise.)
A relatively small patch that leverage this function to take the decision for a set of instructions connected by s|zext.

This patch allows the combines that fold extensions in binary operations to have more than one use.
The approach here is pretty conservative: if all the users of an extension can fold the extension, then the folding is done, otherwise we don't fold. This is the first step towards avoiding the one-use limitation.

As a result, we make a decision to fold/don't fold for a web of instructions. An instruction is part of the web of instructions as soon as it consumes an extension that needs to be folded for all its users.

Because of how SDISel works a web of instructions can be visited over and over. More precisely, if the folding happens, it happens for the whole web and that's the end of it, but if the folding fails, the whole web may be revisited when another member of the web is visited.

To avoid a compile time explosion in pathological cases, we bail out earlier for webs that are bigger than a given threshold (arbitrarily set at 18 for now.) This size can be changed using --riscv-lower-ext-max-web-size=<maxWebSize>.

At the current time, I didn't see a better scheme for that. Assuming we want to stick with doing that in SDISel.

qcolombet added a child revision: D134703: [RISCV][ISel] Refactor the formation of VW operations.Sep 26 2022, 10:05 PM

Harbormaster completed remote builds in B188849: Diff 463095.Sep 26 2022, 10:05 PM

The "NFC" refactoring is at https://reviews.llvm.org/D134703.
Then this diff has been updated to take advantage of it.

I'm not super happy about the hardcoded threshold.

Ideally, I think we'd like to do this kind of matching in a machine pass. The problem with that is we would need to duplicate some of the combines in that machine pass (e.g., like how we produce vwmacc). That would be a perfect fit for GISel :D.

Anyhow, if you guys don't like the SDISel patch (this patch and its child), we can avoid spending time on code review and start thinking about a machine pass. (The "NFC" patch, catch potentially interesting cases though.)

craig.topper removed a child revision: D134703: [RISCV][ISel] Refactor the formation of VW operations.Sep 30 2022, 1:37 PM

craig.topper added a parent revision: D134703: [RISCV][ISel] Refactor the formation of VW operations.

Rebase parent diff

Harbormaster completed remote builds in B190098: Diff 464860.Oct 3 2022, 5:34 PM

LGTM

This revision is now accepted and ready to land.Oct 4 2022, 10:52 PM

This revision was landed with ongoing or failed builds.Oct 5 2022, 1:50 PM

Closed by commit rGc5c2de287e5f: [RISCV][ISel] Fold extensions when all the users can consume them (authored by qcolombet). · Explain Why

This revision was automatically updated to reflect the committed changes.

qcolombet added a commit: rGc5c2de287e5f: [RISCV][ISel] Fold extensions when all the users can consume them.

GitHub <noreply@github.com> mentioned this in rG5b155aea0e52: [RISCV][ISel] Combine scalable vector add/sub/mul with zero/sign extension….Mon, Jan 15, 12:58 PM

GitHub <noreply@github.com> mentioned this in rGba81477e9cdb: Recommit "[RISCV][ISel] Combine scalable vector add/sub/mul with zero/sign….Wed, Jan 17, 6:30 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

RISCV/

RISCVISelLowering.cpp

102 lines

test/

CodeGen/

RISCV/

rvv/

fixed-vectors-vw-web-simplification.ll

60 lines

fixed-vectors-vwmul.ll

12 lines

Diff 465546

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 40 Lines
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "riscv-lower"		#define DEBUG_TYPE "riscv-lower"

STATISTIC(NumTailCalls, "Number of tail calls");		STATISTIC(NumTailCalls, "Number of tail calls");

		static cl::opt<unsigned> ExtensionMaxWebSize(
		DEBUG_TYPE "-ext-max-web-size", cl::Hidden,
		cl::desc("Give the maximum size (in number of nodes) of the web of "
		"instructions that we will consider for VW expansion"),
		cl::init(18));

static cl::opt<bool>		static cl::opt<bool>
AllowSplatInVW_W(DEBUG_TYPE "-form-vw-w-with-splat", cl::Hidden,		AllowSplatInVW_W(DEBUG_TYPE "-form-vw-w-with-splat", cl::Hidden,
cl::desc("Allow the formation of VW_W operations (e.g., "		cl::desc("Allow the formation of VW_W operations (e.g., "
"VWADD_W) with splat constants"),		"VWADD_W) with splat constants"),
cl::init(false));		cl::init(false));

RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,		RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
const RISCVSubtarget &STI)		const RISCVSubtarget &STI)
▲ Show 20 Lines • Show All 8,164 Lines • ▼ Show 20 Lines
/// add_vl -> vwadd(u) \| vwadd(u)_w		/// add_vl -> vwadd(u) \| vwadd(u)_w
/// sub_vl -> vwsub(u) \| vwsub(u)_w		/// sub_vl -> vwsub(u) \| vwsub(u)_w
/// mul_vl -> vwmul(u) \| vwmul_su		/// mul_vl -> vwmul(u) \| vwmul_su
///		///
/// An object of this class represents an operand of the operation we want to		/// An object of this class represents an operand of the operation we want to
/// combine.		/// combine.
/// E.g., when trying to combine `mul_vl a, b`, we will have one instance of		/// E.g., when trying to combine `mul_vl a, b`, we will have one instance of
/// NodeExtensionHelper for `a` and one for `b`.		/// NodeExtensionHelper for `a` and one for `b`.
///		///
		craig.topperUnsubmitted Not Done Reply Inline Actions Can this be something like `llvm::all_of(Val->uses()`? craig.topper: Can this be something like `llvm::all_of(Val->uses()`?
		qcolombetAuthorUnsubmitted Done Reply Inline Actions Good catch, yes, we should be able to use that. qcolombet: Good catch, yes, we should be able to use that.
/// This class abstracts away how the extension is materialized and		/// This class abstracts away how the extension is materialized and
/// how its Mask, VL, number of users affect the combines.		/// how its Mask, VL, number of users affect the combines.
///		///
/// In particular:		/// In particular:
/// - VWADD_W is conceptually == add(op0, sext(op1))		/// - VWADD_W is conceptually == add(op0, sext(op1))
/// - VWADDU_W == add(op0, zext(op1))		/// - VWADDU_W == add(op0, zext(op1))
/// - VWSUB_W == sub(op0, sext(op1))		/// - VWSUB_W == sub(op0, sext(op1))
/// - VWSUBU_W == sub(op0, zext(op1))		/// - VWSUBU_W == sub(op0, zext(op1))
▲ Show 20 Lines • Show All 304 Lines • ▼ Show 20 Lines	struct CombineResult {
/// \see NodeExtensionHelper::getSource().		/// \see NodeExtensionHelper::getSource().
Optional<unsigned> LHSExtOpc;		Optional<unsigned> LHSExtOpc;
/// Extension opcode to be applied to the source of RHS when materializing		/// Extension opcode to be applied to the source of RHS when materializing
/// TargetOpcode.		/// TargetOpcode.
Optional<unsigned> RHSExtOpc;		Optional<unsigned> RHSExtOpc;
/// Root of the combine.		/// Root of the combine.
SDNode *Root;		SDNode *Root;
/// LHS of the TargetOpcode.		/// LHS of the TargetOpcode.
const NodeExtensionHelper &LHS;		NodeExtensionHelper LHS;
/// RHS of the TargetOpcode.		/// RHS of the TargetOpcode.
const NodeExtensionHelper &RHS;		NodeExtensionHelper RHS;

CombineResult(unsigned TargetOpcode, SDNode *Root,		CombineResult(unsigned TargetOpcode, SDNode *Root,
const NodeExtensionHelper &LHS, Optional<bool> SExtLHS,		const NodeExtensionHelper &LHS, Optional<bool> SExtLHS,
const NodeExtensionHelper &RHS, Optional<bool> SExtRHS)		const NodeExtensionHelper &RHS, Optional<bool> SExtRHS)
: TargetOpcode(TargetOpcode), Root(Root), LHS(LHS), RHS(RHS) {		: TargetOpcode(TargetOpcode), Root(Root), LHS(LHS), RHS(RHS) {
MVT NarrowVT = NodeExtensionHelper::getNarrowType(Root);		MVT NarrowVT = NodeExtensionHelper::getNarrowType(Root);
if (SExtLHS && LHS.getSource().getValueType() != NarrowVT)		if (SExtLHS && LHS.getSource().getValueType() != NarrowVT)
LHSExtOpc = *SExtLHS ? RISCVISD::VSEXT_VL : RISCVISD::VZEXT_VL;		LHSExtOpc = *SExtLHS ? RISCVISD::VSEXT_VL : RISCVISD::VZEXT_VL;
▲ Show 20 Lines • Show All 162 Lines • ▼ Show 20 Lines
/// vwadd_w(u) -> vwadd(u)		/// vwadd_w(u) -> vwadd(u)
/// vwub_w(u) -> vwadd(u)		/// vwub_w(u) -> vwadd(u)
static SDValue		static SDValue
combineBinOp_VLToVWBinOp_VL(SDNode *N, TargetLowering::DAGCombinerInfo &DCI) {		combineBinOp_VLToVWBinOp_VL(SDNode *N, TargetLowering::DAGCombinerInfo &DCI) {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;

assert(NodeExtensionHelper::isSupportedRoot(N) &&		assert(NodeExtensionHelper::isSupportedRoot(N) &&
"Shouldn't have called this method");		"Shouldn't have called this method");
		SmallVector<SDNode *> Worklist;
		SmallSet<SDNode *, 8> Inserted;
		Worklist.push_back(N);
		Inserted.insert(N);
		SmallVector<CombineResult> CombinesToApply;

		while (!Worklist.empty()) {
		SDNode *Root = Worklist.pop_back_val();
		if (!NodeExtensionHelper::isSupportedRoot(Root))
		return SDValue();

NodeExtensionHelper LHS(N, 0, DAG);		NodeExtensionHelper LHS(N, 0, DAG);
NodeExtensionHelper RHS(N, 1, DAG);		NodeExtensionHelper RHS(N, 1, DAG);
		auto AppendUsersIfNeeded = [&Worklist,
		&Inserted](const NodeExtensionHelper &Op) {
		if (Op.needToPromoteOtherUsers()) {
		for (SDNode *TheUse : Op.OrigOperand->uses()) {
		if (Inserted.insert(TheUse).second)
		Worklist.push_back(TheUse);
		}
		}
		};
		AppendUsersIfNeeded(LHS);
		AppendUsersIfNeeded(RHS);

if (LHS.needToPromoteOtherUsers() && !LHS.OrigOperand.hasOneUse())		// Control the compile time by limiting the number of node we look at in
return SDValue();		// total.
		if (Inserted.size() > ExtensionMaxWebSize)
if (RHS.needToPromoteOtherUsers() && !RHS.OrigOperand.hasOneUse())
return SDValue();		return SDValue();

SmallVector<NodeExtensionHelper::CombineToTry> FoldingStrategies =		SmallVector<NodeExtensionHelper::CombineToTry> FoldingStrategies =
NodeExtensionHelper::getSupportedFoldings(N);		NodeExtensionHelper::getSupportedFoldings(N);

assert(!FoldingStrategies.empty() && "Nothing to be folded");		assert(!FoldingStrategies.empty() && "Nothing to be folded");
for (int Attempt = 0; Attempt != 1 + NodeExtensionHelper::isCommutative(N);		bool Matched = false;
		for (int Attempt = 0;
		(Attempt != 1 + NodeExtensionHelper::isCommutative(N)) && !Matched;
++Attempt) {		++Attempt) {

for (NodeExtensionHelper::CombineToTry FoldingStrategy :		for (NodeExtensionHelper::CombineToTry FoldingStrategy :
FoldingStrategies) {		FoldingStrategies) {
Optional<CombineResult> Res = FoldingStrategy(N, LHS, RHS);		Optional<CombineResult> Res = FoldingStrategy(N, LHS, RHS);
if (Res)		if (Res) {
return Res->materialize(DAG);		Matched = true;
		CombinesToApply.push_back(*Res);
		break;
		}
}		}
std::swap(LHS, RHS);		std::swap(LHS, RHS);
}		}
		// Right now we do an all or nothing approach.
		if (!Matched)
return SDValue();		return SDValue();
}		}
		// Store the value for the replacement of the input node separately.
		SDValue InputRootReplacement;
		// We do the RAUW after we materialize all the combines, because some replaced
		// nodes may be feeding some of the yet-to-be-replaced nodes. Put differently,
		// some of these nodes may appear in the NodeExtensionHelpers of some of the
		// yet-to-be-visited CombinesToApply roots.
		SmallVector<std::pair<SDValue, SDValue>> ValuesToReplace;
		ValuesToReplace.reserve(CombinesToApply.size());
		for (CombineResult Res : CombinesToApply) {
		SDValue NewValue = Res.materialize(DAG);
		if (!InputRootReplacement) {
		assert(Res.Root == N &&
		"First element is expected to be the current node");
		InputRootReplacement = NewValue;
		} else {
		ValuesToReplace.emplace_back(SDValue(Res.Root, 0), NewValue);
		}
		}
		for (std::pair<SDValue, SDValue> OldNewValues : ValuesToReplace) {
		DAG.ReplaceAllUsesOfValueWith(OldNewValues.first, OldNewValues.second);
		DCI.AddToWorklist(OldNewValues.second.getNode());
		}
		return InputRootReplacement;
		}

// Fold		// Fold
// (fp_to_int (froundeven X)) -> fcvt X, rne		// (fp_to_int (froundeven X)) -> fcvt X, rne
// (fp_to_int (ftrunc X)) -> fcvt X, rtz		// (fp_to_int (ftrunc X)) -> fcvt X, rtz
// (fp_to_int (ffloor X)) -> fcvt X, rdn		// (fp_to_int (ffloor X)) -> fcvt X, rdn
// (fp_to_int (fceil X)) -> fcvt X, rup		// (fp_to_int (fceil X)) -> fcvt X, rup
// (fp_to_int (fround X)) -> fcvt X, rmm		// (fp_to_int (fround X)) -> fcvt X, rmm
static SDValue performFP_TO_INTCombine(SDNode *N,		static SDValue performFP_TO_INTCombine(SDNode *N,
▲ Show 20 Lines • Show All 4,160 Lines • Show Last 20 Lines

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vw-web-simplification.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=riscv32 -mattr=+v -riscv-v-vector-bits-min=128 -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=1 \| FileCheck %s --check-prefixes=NO_FOLDING
				; RUN: llc -mtriple=riscv64 -mattr=+v -riscv-v-vector-bits-min=128 -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=1 \| FileCheck %s --check-prefixes=NO_FOLDING
				; RUN: llc -mtriple=riscv32 -mattr=+v -riscv-v-vector-bits-min=128 -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=2 \| FileCheck %s --check-prefixes=NO_FOLDING
				; RUN: llc -mtriple=riscv64 -mattr=+v -riscv-v-vector-bits-min=128 -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=2 \| FileCheck %s --check-prefixes=NO_FOLDING
				; RUN: llc -mtriple=riscv32 -mattr=+v -riscv-v-vector-bits-min=128 -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=3 \| FileCheck %s --check-prefixes=FOLDING
				; RUN: llc -mtriple=riscv64 -mattr=+v -riscv-v-vector-bits-min=128 -verify-machineinstrs %s -o - --riscv-lower-ext-max-web-size=3 \| FileCheck %s --check-prefixes=FOLDING
				; Check that the default value enables the web folding and
				; that it is bigger than 3.
				; RUN: llc -mtriple=riscv32 -mattr=+v -riscv-v-vector-bits-min=128 -verify-machineinstrs %s -o - \| FileCheck %s --check-prefixes=FOLDING
				; RUN: llc -mtriple=riscv64 -mattr=+v -riscv-v-vector-bits-min=128 -verify-machineinstrs %s -o - \| FileCheck %s --check-prefixes=FOLDING


				; Check that the add/sub/mul operations are all promoted into their
				; vw counterpart when the folding of the web size is increased to 3.
				; We need the web size to be at least 3 for the folding to happen, because
				; %c has 3 uses.
				define <2 x i16> @vwmul_v2i16_multiple_users(<2 x i8>* %x, <2 x i8>* %y, <2 x i8> *%z) {
				; NO_FOLDING-LABEL: vwmul_v2i16_multiple_users:
				; NO_FOLDING: # %bb.0:
				; NO_FOLDING-NEXT: vsetivli zero, 2, e16, mf4, ta, mu
				; NO_FOLDING-NEXT: vle8.v v8, (a0)
				; NO_FOLDING-NEXT: vle8.v v9, (a1)
				; NO_FOLDING-NEXT: vle8.v v10, (a2)
				; NO_FOLDING-NEXT: vsext.vf2 v11, v8
				; NO_FOLDING-NEXT: vsext.vf2 v8, v9
				; NO_FOLDING-NEXT: vsext.vf2 v9, v10
				; NO_FOLDING-NEXT: vmul.vv v8, v11, v8
				; NO_FOLDING-NEXT: vadd.vv v10, v11, v9
				; NO_FOLDING-NEXT: vsub.vv v9, v11, v9
				; NO_FOLDING-NEXT: vor.vv v8, v8, v10
				; NO_FOLDING-NEXT: vor.vv v8, v8, v9
				; NO_FOLDING-NEXT: ret
				;
				; FOLDING-LABEL: vwmul_v2i16_multiple_users:
				; FOLDING: # %bb.0:
				; FOLDING-NEXT: vsetivli zero, 2, e8, mf8, ta, mu
				; FOLDING-NEXT: vle8.v v8, (a0)
				; FOLDING-NEXT: vle8.v v9, (a1)
				; FOLDING-NEXT: vle8.v v10, (a2)
				; FOLDING-NEXT: vwmul.vv v11, v8, v9
				; FOLDING-NEXT: vwadd.vv v9, v8, v10
				; FOLDING-NEXT: vwsub.vv v12, v8, v10
				; FOLDING-NEXT: vsetvli zero, zero, e16, mf4, ta, mu
				; FOLDING-NEXT: vor.vv v8, v11, v9
				; FOLDING-NEXT: vor.vv v8, v8, v12
				; FOLDING-NEXT: ret
				%a = load <2 x i8>, <2 x i8>* %x
				%b = load <2 x i8>, <2 x i8>* %y
				%b2 = load <2 x i8>, <2 x i8>* %z
				%c = sext <2 x i8> %a to <2 x i16>
				%d = sext <2 x i8> %b to <2 x i16>
				%d2 = sext <2 x i8> %b2 to <2 x i16>
				%e = mul <2 x i16> %c, %d
				%f = add <2 x i16> %c, %d2
				%g = sub <2 x i16> %c, %d2
				%h = or <2 x i16> %e, %f
				%i = or <2 x i16> %h, %g
				ret <2 x i16> %i
				}

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vwmul.ll

Show All 15 Lines	; CHECK-NEXT: ret
%d = sext <2 x i8> %b to <2 x i16>		%d = sext <2 x i8> %b to <2 x i16>
%e = mul <2 x i16> %c, %d		%e = mul <2 x i16> %c, %d
ret <2 x i16> %e		ret <2 x i16> %e
}		}

define <2 x i16> @vwmul_v2i16_multiple_users(<2 x i8>* %x, <2 x i8>* %y, <2 x i8> *%z) {		define <2 x i16> @vwmul_v2i16_multiple_users(<2 x i8>* %x, <2 x i8>* %y, <2 x i8> *%z) {
; CHECK-LABEL: vwmul_v2i16_multiple_users:		; CHECK-LABEL: vwmul_v2i16_multiple_users:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 2, e16, mf4, ta, mu		; CHECK-NEXT: vsetivli zero, 2, e8, mf8, ta, mu
; CHECK-NEXT: vle8.v v8, (a0)		; CHECK-NEXT: vle8.v v8, (a0)
; CHECK-NEXT: vle8.v v9, (a1)		; CHECK-NEXT: vle8.v v9, (a1)
; CHECK-NEXT: vle8.v v10, (a2)		; CHECK-NEXT: vle8.v v10, (a2)
; CHECK-NEXT: vsext.vf2 v11, v8		; CHECK-NEXT: vwmul.vv v11, v8, v9
; CHECK-NEXT: vsext.vf2 v8, v9		; CHECK-NEXT: vwmul.vv v9, v8, v10
; CHECK-NEXT: vsext.vf2 v9, v10		; CHECK-NEXT: vsetvli zero, zero, e16, mf4, ta, mu
; CHECK-NEXT: vmul.vv v8, v11, v8		; CHECK-NEXT: vor.vv v8, v11, v9
; CHECK-NEXT: vmul.vv v9, v11, v9
; CHECK-NEXT: vor.vv v8, v8, v9
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <2 x i8>, <2 x i8>* %x		%a = load <2 x i8>, <2 x i8>* %x
%b = load <2 x i8>, <2 x i8>* %y		%b = load <2 x i8>, <2 x i8>* %y
%b2 = load <2 x i8>, <2 x i8>* %z		%b2 = load <2 x i8>, <2 x i8>* %z
%c = sext <2 x i8> %a to <2 x i16>		%c = sext <2 x i8> %a to <2 x i16>
%d = sext <2 x i8> %b to <2 x i16>		%d = sext <2 x i8> %b to <2 x i16>
%d2 = sext <2 x i8> %b2 to <2 x i16>		%d2 = sext <2 x i8> %b2 to <2 x i16>
%e = mul <2 x i16> %c, %d		%e = mul <2 x i16> %c, %d
▲ Show 20 Lines • Show All 863 Lines • Show Last 20 Lines