This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
20/45
X86ISelLowering.cpp
1
X86PartialReduction.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
4/5
dpbusd.ll
-
dpbusd_i4.ll

Differential D116039

[X86] Combine reduce (add (mul x, y)) to VNNI instruction.
ClosedPublic

Authored by LuoYuanke on Dec 20 2021, 6:57 AM.

Download Raw Diff

Details

Reviewers

craig.topper
pengfei
RKSimon

Commits

rG21babe4db326: [X86] Combine reduce(add (mul x, y)) to VNNI instruction.

Summary

For below C code, we can use VNNI to combine the mul and add operation.

int usdot_prod_qi(unsigned char *restrict a, char *restrict b, int c, int n) {

int i;
for (i = 0; i < 32; i++) {
  c += ((int)a[i] * (int)b[i]);
}
return c;

}

We didn't support the combine acoss basic block in this patch.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

LuoYuanke created this revision.Dec 20 2021, 6:57 AM

Herald added subscribers: pengfei, hiraditya. · View Herald TranscriptDec 20 2021, 6:57 AM

LuoYuanke requested review of this revision.Dec 20 2021, 6:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 20 2021, 6:57 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

LuoYuanke added reviewers: craig.topper, pengfei, RKSimon.Dec 20 2021, 6:59 AM

LuoYuanke edited the summary of this revision. (Show Details)

Any explicit checks for extension/truncation and their bitwidth delta instantly make me suspicious nowadays.
Does this deal with commutativity?
I think what you want to check is the number of known sign bits / known leading zero bits.

update the title.

LuoYuanke retitled this revision from [X86] Combine reduce(mul x, y) to VNNI instruction. to [X86] Combine reduce (add (mul x, y)) to VNNI instruction..Dec 20 2021, 7:03 AM

In D116039#3202764, @lebedev.ri wrote:

Any explicit checks for extension/truncation and their bitwidth delta instantly make me suspicious nowadays.

Pls comments on the code, so that I can easily understand.

Does this deal with commutativity?

At X86ISelLowering.cpp : 41763

I think what you want to check is the number of known sign bits / known leading zero bits.

Pls comments on the code, so that I can easily understand.

lebedev.ri added inline comments.Dec 20 2021, 7:10 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
41756–41774	Any explicit checks for extension/truncation and their bitwidth delta instantly make me suspicious nowadays. Does this deal with commutativity? I think what you want to check is the number of known sign bits / known leading zero bits.

Harbormaster completed remote builds in B140084: Diff 395436.Dec 20 2021, 7:25 AM

craig.topper added inline comments.Dec 20 2021, 8:38 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
41756–41774	If the sign extend side can be proven to be positive, the sign extend might be hidden as zero extend. This is why tryMAddReplacement checks for FreeTruncations and calls ComputeNumSignBits.
41801	`Opcode` is only used once. Why not just make it part of the `getNode` call?

Address Craig and Roman's comments.

LuoYuanke marked an inline comment as done.Dec 21 2021, 6:15 AM

LuoYuanke added inline comments.

llvm/lib/Target/X86/X86ISelLowering.cpp
41756–41774	Thanks Craig and Roman. I enhanced it in the new patch.

Remove debug code.

Please fix the patch description - both extensions there are signed, is that actually the specification for the intrinsic?

llvm/lib/Target/X86/X86ISelLowering.cpp
41790	I'm not sure i follow. Why is this okay with negative numbers?
41790–41791	This still does not handle the commutative variant.
41791	You want `ComputeMinSignedBits() <= 8` to check for sext-like

Harbormaster completed remote builds in B140244: Diff 395662.Dec 21 2021, 6:43 AM

lebedev.ri added inline comments.Dec 21 2021, 6:55 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
41790	https://alive2.llvm.org/ce/z/WGxXrz vs https://alive2.llvm.org/ce/z/UbWhNv

RKSimon added inline comments.Dec 21 2021, 8:21 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
41812	createVPDPBUSD ?
42108	Should this be called combineVPDPBUSDPattern? VNNI is the ISA no?
llvm/test/CodeGen/X86/dpbusd.ll
5	You might be able to use a common prefix for some of these to reduce check duplication
7	Drop dso_local?

LuoYuanke added inline comments.Dec 21 2021, 11:13 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
41790	I'm not sure i follow. Why is this okay with negative numbers? Here is description for VPDPBUSD from https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html?wapkw=instruction. The first operand is unsigned, and the second operand is signed. Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second source operand, producing intermediate signed word results. The word results are then summed and accumulated in the destination dword element size operand.
41790–41791	This still does not handle the commutative variant. Sorry, I don't understand it very well. I do it with " std::swap(Op0, Op1)" in line 41764;
41791	You want `ComputeMinSignedBits() <= 8` to check for sext-like

Address Roman and Simon's comments. Thanks for the review.

LuoYuanke marked 5 inline comments as done.Dec 21 2021, 11:17 PM

Could you please explain why there are both the knownbits-based checks, and checks for ISD::SIGN/ZERO_EXTEND nodes?

Harbormaster completed remote builds in B140343: Diff 395799.Dec 21 2021, 11:40 PM

In D116039#3206040, @lebedev.ri wrote:

Could you please explain why there are both the knownbits-based checks, and checks for ISD::SIGN/ZERO_EXTEND nodes?

The VPDPBUSD multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second source operand, producing intermediate signed word results. The word results are then summed and accumulated in the destination dword element size operand.

For src2, it is signed value, so we don't need to check for ISD::SIGN nodes, because if the signed bits are 1 it is negative value and if the signed bits are 0 it is positive value.
But for src1, it is unsigned value. If it is a positive value it is OK, but if it is a negative value we can't use VPDPBUSD to combine the original nodes. See test case mul_sext_i4i4 and mul_zext_i4i4() in dpbusd_i4.ll. For mul_zext_i4i4 we can use VPDPBUSD, but for mul_sext_i4i4 we can't because the src1 may be negative value.

LuoYuanke added inline comments.Dec 21 2021, 11:44 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
41780	Maybe we can remove IsFreeTruncation() check as Roman mentions. Roman, do you mean to remove IsFreeTruncation() check?

LuoYuanke added inline comments.Dec 22 2021, 12:32 AM

llvm/lib/Target/X86/X86ISelLowering.cpp

41780

Maybe we can remove IsFreeTruncation() check as Roman mentions.
Roman, do you mean to remove IsFreeTruncation() check?

If I remove the ISD::SIGN/ZERO_EXTEND check, I got crash with below test case in createVPDPBUSD(). I think there is room to improve the patch to cover more pattern. But to be conservatively I'd like to improve it in another patch, so that if we have regression we can revert less code.
Hi Roman,
What do you think?

declare i32 @llvm.vector.reduce.add.v16i32(<16 x i32>)

define dso_local i32 @mul_i4i2(<16 x i4> %b, i32 %c) {
entry:
  %0 = trunc <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15> to <16 x i16>
  %1 = zext <16 x i16> %0 to <16 x i32>
  %2 = zext <16 x i4> %b to <16 x i32>
  %3 = mul nsw <16 x i32> %2, %1
  %4 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %3)
  %op.extra = add nsw i32 %4, %c
  ret i32 %op.extra
}

Rebase.

Harbormaster completed remote builds in B140477: Diff 395980.Dec 23 2021, 12:21 AM

Format the code as Lint suggested?

llvm/lib/Target/X86/X86ISelLowering.cpp
41810	Can we return false here if `Op0` is not ZERO_EXTEND?
41813–41814	How about `ANY_EXTEND` ? The same below.
41826	It's better to hoist to line 41810. How about `ANY_EXTEND`?
41870	It is always 2? Better to add a comment to explain.
41887–41888	AVXVNNI implies AVX2. We won't need to split to 128 bits.
42156–42157	Can we generate i32 first then do the truncation?
42162–42163	Can check `DCI.isAfterLegalizeDAG()` before calling the function instead?
42210	Can ues `ExtractVT.getSizeInBits()` derectly.
llvm/test/CodeGen/X86/dpbusd.ll
14	This is the only and a strange diff with the AVX512 code. Is there anything wrong in one of each?

Are we missing test cases for 32 x i8 and 64 x i8?

craig.topper added inline comments.Dec 25 2021, 10:13 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
42162–42163	I don't think that works. We should be able to handle 32 x i8 and 64 x i8 which would have zero_extend and sign_extend with illegal result types.

I'll update the patch according to Phoebe and Craig's comments.

llvm/lib/Target/X86/X86ISelLowering.cpp
41813–41814	This check the opcode, so we need check both zero extend and sign extend. I'm not sure if any extend also works, because the upper bits is undefined. What's the signed bit for any extend?
42156–42157	I'm not sure if the result overflow, truncating back to i16 or some other types remain the same value. How about leave it as an enhancement?
llvm/test/CodeGen/X86/dpbusd.ll
14	This test doesn't generate vpdpbusd instruction, so the AVX512VNNI and AVX512VL generate the same code. For other test case, AVX512VNNI can only use zmm register, but AVX512VNNI + AVX512VL can use xmm register.

Address Phoebe and Craig's comments.

LuoYuanke added inline comments.Dec 26 2021, 5:43 AM

llvm/test/CodeGen/X86/dpbusd.ll
14	For vpdpbusd_64xi32, the result is the same.

lebedev.ri removed a subscriber: lebedev.ri.Dec 26 2021, 5:52 AM

Harbormaster completed remote builds in B140657: Diff 396222.Dec 26 2021, 6:24 AM

craig.topper added inline comments.Dec 27 2021, 10:34 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
41813–41814	It is undefined and ComputeMinSignedBits will return BitWidth - 1 for it.
41818	You can use `Op.getOperand(0).getScalarValueSizeInBits()` to simplify this
41828	Can we use `DAG.computeKnownBits(Op0).countMaxActiveBits() <= 8` to make this more readable?
41856	Why are Ext0 and Ext1 passed by const reference? SDValue should be passed by value.
41864	Can we just TRUNCATE the Ext nodes without assuming they are extend nodes. That way it just works when you support constants in the future?
41870	Can we do this without assuming the node is a SIGN/ZERO_EXTEND? Just truncate the original node to Vi8VT.
42164	This code isn't handling vpdpwssd so why mention it here?
42176	Is this code valid for this transform? There's a large comment of justification for why it is ok for SAD. I think I only saw a test for the SIGN_EXTEND case?

craig.topper added inline comments.Dec 27 2021, 10:44 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
42176	Oops I see the other test. I need to think about the math.

craig.topper added inline comments.Dec 27 2021, 11:02 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
42176	I don't think we can do this if the multiply result is zero extended. Each of the 4 multiplies done by vpdpbusd compute a signed 16-bit product that will be sign extended before adding into the accumulator. I think we also need to verify that the multiply has at least 2x the number of bits of the input. We shouldn't match (sign_extend (mul (vXi9 (zext (vXi8 X))), (vXi9 (zext (vXi8 Y)))). Does anything prevent that right now?

LuoYuanke added inline comments.Dec 28 2021, 7:13 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
41864	Can we just TRUNCATE the Ext nodes without assuming they are extend nodes. That way it just works when you support constants in the future? Good idea. :) I'll update my patch.
41870	Can we do this without assuming the node is a SIGN/ZERO_EXTEND? Just truncate the original node to Vi8VT.
41870	Can we do this without assuming the node is a SIGN/ZERO_EXTEND? Just truncate the original node to Vi8VT. Yes, that's better.
42164	This code isn't handling vpdpwssd so why mention it here? My original code covers both vpdpbusd and vpdpwssd. I'll clean it.
42176	I don't think we can do this if the multiply result is zero extended. Each of the 4 multiplies done by vpdpbusd compute a signed 16-bit product that will be sign extended before adding into the accumulator. I think we also need to verify that the multiply has at least 2x the number of bits of the input. We shouldn't match (sign_extend (mul (vXi9 (zext (vXi8 X))), (vXi9 (zext (vXi8 Y)))). Does anything prevent that right now? Really good catch. Thanks.

Address Craig's comments.

Harbormaster completed remote builds in B140833: Diff 396454.Dec 28 2021, 8:04 PM

Align the check in X86ISelLowering.cpp and X86PartialReduction.cpp.
Add test case for 2 x i32.

Harbormaster completed remote builds in B140862: Diff 396489.Dec 28 2021, 11:54 PM

LuoYuanke added a child revision: D116363: [X86] Combine to vpdpbusd when operand is constant and small enough..Dec 29 2021, 12:09 AM

Fix lint issues.

Harbormaster completed remote builds in B140872: Diff 396499.Dec 29 2021, 1:15 AM

craig.topper added inline comments.Dec 29 2021, 2:59 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
41860	Just name the paramaters ZExt0 and SExt1 and get rid of Ext0 and Ext1?
41870	This just Vi8VT right?
41895	Is this lambda getting anything via `&` in the capture list or can it just be `[]`
llvm/lib/Target/X86/X86PartialReduction.cpp
100	Why pass `0, nullptr, nullptr` when those have default values?

Address Craig's comments. Thanks, Craig.

Harbormaster completed remote builds in B140947: Diff 396585.Dec 29 2021, 6:27 PM

ping

LGTM other than that one comment.

llvm/lib/Target/X86/X86ISelLowering.cpp
41862	You can drop these ifs. getZExtOrTrunc/getSExtOrTrunc will do nothing if the type already matches.

This revision is now accepted and ready to land.Jan 6 2022, 9:48 AM

Address Craig's comments and rebase.

Harbormaster completed remote builds in B142014: Diff 398047.Jan 6 2022, 9:37 PM

This revision was landed with ongoing or failed builds.Jan 7 2022, 5:14 AM

Closed by commit rG21babe4db326: [X86] Combine reduce(add (mul x, y)) to VNNI instruction. (authored by LuoYuanke). · Explain Why

This revision was automatically updated to reflect the committed changes.

LuoYuanke added a commit: rG21babe4db326: [X86] Combine reduce(add (mul x, y)) to VNNI instruction..

FreddyYe mentioned this in D135938: [X86] Add AVX-VNNI-INT8 instructions..Oct 26 2022, 10:34 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

152 lines

X86PartialReduction.cpp

68 lines

test/

CodeGen/

X86/

dpbusd.ll

548 lines

dpbusd_i4.ll

131 lines

Diff 398109

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 41,747 Lines • ▼ Show 20 Lines	static SDValue combineBitcast(SDNode *N, SelectionDAG &DAG,
}		}

// Try to remove bitcasts from input and output of mask arithmetic to		// Try to remove bitcasts from input and output of mask arithmetic to
// remove GPR<->K-register crossings.		// remove GPR<->K-register crossings.
if (SDValue V = combineCastedMaskArithmetic(N, DAG, DCI, Subtarget))		if (SDValue V = combineCastedMaskArithmetic(N, DAG, DCI, Subtarget))
return V;		return V;

// Convert a bitcasted integer logic operation that has one bitcasted		// Convert a bitcasted integer logic operation that has one bitcasted
// floating-point operand into a floating-point logic operation. This may		// floating-point operand into a floating-point logic operation. This may
// create a load of a constant, but that is cheaper than materializing the		// create a load of a constant, but that is cheaper than materializing the
// constant in an integer register and transferring it to an SSE register or		// constant in an integer register and transferring it to an SSE register or
// transferring the SSE operand to integer register and back.		// transferring the SSE operand to integer register and back.
unsigned FPOpcode;		unsigned FPOpcode;
switch (N0.getOpcode()) {		switch (N0.getOpcode()) {
case ISD::AND: FPOpcode = X86ISD::FAND; break;		case ISD::AND: FPOpcode = X86ISD::FAND; break;
case ISD::OR: FPOpcode = X86ISD::FOR; break;		case ISD::OR: FPOpcode = X86ISD::FOR; break;
case ISD::XOR: FPOpcode = X86ISD::FXOR; break;		case ISD::XOR: FPOpcode = X86ISD::FXOR; break;
default: return SDValue();		default: return SDValue();
}		}

// Check if we have a bitcast from another integer type as well.		// Check if we have a bitcast from another integer type as well.
if (!((Subtarget.hasSSE1() && VT == MVT::f32) \|\|		if (!((Subtarget.hasSSE1() && VT == MVT::f32) \|\|
(Subtarget.hasSSE2() && VT == MVT::f64) \|\|		(Subtarget.hasSSE2() && VT == MVT::f64) \|\|
(Subtarget.hasFP16() && VT == MVT::f16) \|\|		(Subtarget.hasFP16() && VT == MVT::f16) \|\|
(Subtarget.hasSSE2() && VT.isInteger() && VT.isVector() &&		(Subtarget.hasSSE2() && VT.isInteger() && VT.isVector() &&
TLI.isTypeLegal(VT))))		TLI.isTypeLegal(VT))))
return SDValue();		return SDValue();
		lebedev.riUnsubmitted Not Done Reply Inline Actions Any explicit checks for extension/truncation and their bitwidth delta instantly make me suspicious nowadays. Does this deal with commutativity? I think what you want to check is the number of known sign bits / known leading zero bits. lebedev.ri: Any explicit checks for extension/truncation and their bitwidth delta instantly make me…
		craig.topperUnsubmitted Not Done Reply Inline Actions If the sign extend side can be proven to be positive, the sign extend might be hidden as zero extend. This is why tryMAddReplacement checks for FreeTruncations and calls ComputeNumSignBits. craig.topper: If the sign extend side can be proven to be positive, the sign extend might be hidden as zero…
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions Thanks Craig and Roman. I enhanced it in the new patch. LuoYuanke: Thanks Craig and Roman. I enhanced it in the new patch.

SDValue LogicOp0 = N0.getOperand(0);		SDValue LogicOp0 = N0.getOperand(0);
SDValue LogicOp1 = N0.getOperand(1);		SDValue LogicOp1 = N0.getOperand(1);
SDLoc DL0(N0);		SDLoc DL0(N0);

// bitcast(logic(bitcast(X), Y)) --> logic'(X, bitcast(Y))		// bitcast(logic(bitcast(X), Y)) --> logic'(X, bitcast(Y))
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions Maybe we can remove IsFreeTruncation() check as Roman mentions. Roman, do you mean to remove IsFreeTruncation() check? LuoYuanke: Maybe we can remove IsFreeTruncation() check as Roman mentions. Roman, do you mean to remove…
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions Maybe we can remove IsFreeTruncation() check as Roman mentions. Roman, do you mean to remove IsFreeTruncation() check? If I remove the ISD::SIGN/ZERO_EXTEND check, I got crash with below test case in createVPDPBUSD(). I think there is room to improve the patch to cover more pattern. But to be conservatively I'd like to improve it in another patch, so that if we have regression we can revert less code. Hi Roman, What do you think? declare i32 @llvm.vector.reduce.add.v16i32(<16 x i32>) define dso_local i32 @mul_i4i2(<16 x i4> %b, i32 %c) { entry: %0 = trunc <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15> to <16 x i16> %1 = zext <16 x i16> %0 to <16 x i32> %2 = zext <16 x i4> %b to <16 x i32> %3 = mul nsw <16 x i32> %2, %1 %4 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %3) %op.extra = add nsw i32 %4, %c ret i32 %op.extra } LuoYuanke: > Maybe we can remove IsFreeTruncation() check as Roman mentions. > Roman, do you mean to…
if (N0.hasOneUse() && LogicOp0.getOpcode() == ISD::BITCAST &&		if (N0.hasOneUse() && LogicOp0.getOpcode() == ISD::BITCAST &&
LogicOp0.hasOneUse() && LogicOp0.getOperand(0).hasOneUse() &&		LogicOp0.hasOneUse() && LogicOp0.getOperand(0).hasOneUse() &&
LogicOp0.getOperand(0).getValueType() == VT &&		LogicOp0.getOperand(0).getValueType() == VT &&
!isa<ConstantSDNode>(LogicOp0.getOperand(0))) {		!isa<ConstantSDNode>(LogicOp0.getOperand(0))) {
SDValue CastedOp1 = DAG.getBitcast(VT, LogicOp1);		SDValue CastedOp1 = DAG.getBitcast(VT, LogicOp1);
unsigned Opcode = VT.isFloatingPoint() ? FPOpcode : N0.getOpcode();		unsigned Opcode = VT.isFloatingPoint() ? FPOpcode : N0.getOpcode();
return DAG.getNode(Opcode, DL0, VT, LogicOp0.getOperand(0), CastedOp1);		return DAG.getNode(Opcode, DL0, VT, LogicOp0.getOperand(0), CastedOp1);
}		}
// bitcast(logic(X, bitcast(Y))) --> logic'(bitcast(X), Y)		// bitcast(logic(X, bitcast(Y))) --> logic'(bitcast(X), Y)
if (N0.hasOneUse() && LogicOp1.getOpcode() == ISD::BITCAST &&		if (N0.hasOneUse() && LogicOp1.getOpcode() == ISD::BITCAST &&
		lebedev.riUnsubmitted Not Done Reply Inline Actions I'm not sure i follow. Why is this okay with negative numbers? lebedev.ri: I'm not sure i follow. Why is this okay with negative numbers?
		lebedev.riUnsubmitted Not Done Reply Inline Actions https://alive2.llvm.org/ce/z/WGxXrz vs https://alive2.llvm.org/ce/z/UbWhNv lebedev.ri: https://alive2.llvm.org/ce/z/WGxXrz vs https://alive2.llvm.org/ce/z/UbWhNv
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions I'm not sure i follow. Why is this okay with negative numbers? Here is description for VPDPBUSD from https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html?wapkw=instruction. The first operand is unsigned, and the second operand is signed. Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second source operand, producing intermediate signed word results. The word results are then summed and accumulated in the destination dword element size operand. LuoYuanke: > I'm not sure i follow. > Why is this okay with negative numbers? Here is description for…
LogicOp1.hasOneUse() && LogicOp1.getOperand(0).hasOneUse() &&		LogicOp1.hasOneUse() && LogicOp1.getOperand(0).hasOneUse() &&
		lebedev.riUnsubmitted Not Done Reply Inline Actions This still does not handle the commutative variant. lebedev.ri: This still does not handle the commutative variant.
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions This still does not handle the commutative variant. Sorry, I don't understand it very well. I do it with " std::swap(Op0, Op1)" in line 41764; LuoYuanke: > This still does not handle the commutative variant. Sorry, I don't understand it very well.
		lebedev.riUnsubmitted Done Reply Inline Actions You want `ComputeMinSignedBits() <= 8` to check for sext-like lebedev.ri: You want `ComputeMinSignedBits() <= 8` to check for sext-like
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions You want `ComputeMinSignedBits() <= 8` to check for sext-like LuoYuanke: > You want `ComputeMinSignedBits() <= 8` to check for sext-like
LogicOp1.getOperand(0).getValueType() == VT &&		LogicOp1.getOperand(0).getValueType() == VT &&
!isa<ConstantSDNode>(LogicOp1.getOperand(0))) {		!isa<ConstantSDNode>(LogicOp1.getOperand(0))) {
SDValue CastedOp0 = DAG.getBitcast(VT, LogicOp0);		SDValue CastedOp0 = DAG.getBitcast(VT, LogicOp0);
unsigned Opcode = VT.isFloatingPoint() ? FPOpcode : N0.getOpcode();		unsigned Opcode = VT.isFloatingPoint() ? FPOpcode : N0.getOpcode();
return DAG.getNode(Opcode, DL0, VT, LogicOp1.getOperand(0), CastedOp0);		return DAG.getNode(Opcode, DL0, VT, LogicOp1.getOperand(0), CastedOp0);
}		}

return SDValue();		return SDValue();
}		}

		craig.topperUnsubmitted Done Reply Inline Actions `Opcode` is only used once. Why not just make it part of the `getNode` call? craig.topper: `Opcode` is only used once. Why not just make it part of the `getNode` call?
		// (mul (zext a), (sext, b))
		static bool detectExtMul(SelectionDAG &DAG, const SDValue &Mul, SDValue &Op0,
		SDValue &Op1) {
		Op0 = Mul.getOperand(0);
		Op1 = Mul.getOperand(1);

		// The operand1 should be signed extend
		if (Op0.getOpcode() == ISD::SIGN_EXTEND)
		std::swap(Op0, Op1);
		pengfeiUnsubmitted Not Done Reply Inline Actions Can we return false here if `Op0` is not ZERO_EXTEND? pengfei: Can we return false here if `Op0` is not ZERO_EXTEND?

		if (Op0.getOpcode() != ISD::ZERO_EXTEND)
		RKSimonUnsubmitted Done Reply Inline Actions createVPDPBUSD ? RKSimon: createVPDPBUSD ?
		return false;

		pengfeiUnsubmitted Not Done Reply Inline Actions How about `ANY_EXTEND` ? The same below. pengfei: How about `ANY_EXTEND` ? The same below.
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions This check the opcode, so we need check both zero extend and sign extend. I'm not sure if any extend also works, because the upper bits is undefined. What's the signed bit for any extend? LuoYuanke: This check the opcode, so we need check both zero extend and sign extend. I'm not sure if any…
		craig.topperUnsubmitted Not Done Reply Inline Actions It is undefined and ComputeMinSignedBits will return BitWidth - 1 for it. craig.topper: It is undefined and ComputeMinSignedBits will return BitWidth - 1 for it.
		auto IsFreeTruncation = [](SDValue &Op) -> bool {
		if ((Op.getOpcode() == ISD::ZERO_EXTEND \|\|
		Op.getOpcode() == ISD::SIGN_EXTEND) &&
		Op.getOperand(0).getScalarValueSizeInBits() <= 8)
		craig.topperUnsubmitted Not Done Reply Inline Actions You can use `Op.getOperand(0).getScalarValueSizeInBits()` to simplify this craig.topper: You can use `Op.getOperand(0).getScalarValueSizeInBits()` to simplify this
		return true;

		// TODO: Support contant value.
		return false;
		};

		// (dpbusd (zext a), (sext, b)). Since the first operand should be unsigned
		// value, we need to check Op0 is zero extended value. Op1 should be signed
		pengfeiUnsubmitted Not Done Reply Inline Actions It's better to hoist to line 41810. How about `ANY_EXTEND`? pengfei: It's better to hoist to line 41810. How about `ANY_EXTEND`?
		// value, so we just check the signed bits.
		if ((IsFreeTruncation(Op0) &&
		craig.topperUnsubmitted Not Done Reply Inline Actions Can we use `DAG.computeKnownBits(Op0).countMaxActiveBits() <= 8` to make this more readable? craig.topper: Can we use `DAG.computeKnownBits(Op0).countMaxActiveBits() <= 8` to make this more readable?
		DAG.computeKnownBits(Op0).countMaxActiveBits() <= 8) &&
		(IsFreeTruncation(Op1) && DAG.ComputeMaxSignificantBits(Op1) <= 8))
		return true;

		return false;
		}

// Given a ABS node, detect the following pattern:		// Given a ABS node, detect the following pattern:
// (ABS (SUB (ZERO_EXTEND a), (ZERO_EXTEND b))).		// (ABS (SUB (ZERO_EXTEND a), (ZERO_EXTEND b))).
// This is useful as it is the input into a SAD pattern.		// This is useful as it is the input into a SAD pattern.
static bool detectZextAbsDiff(const SDValue &Abs, SDValue &Op0, SDValue &Op1) {		static bool detectZextAbsDiff(const SDValue &Abs, SDValue &Op0, SDValue &Op1) {
SDValue AbsOp1 = Abs->getOperand(0);		SDValue AbsOp1 = Abs->getOperand(0);
if (AbsOp1.getOpcode() != ISD::SUB)		if (AbsOp1.getOpcode() != ISD::SUB)
return false;		return false;

Op0 = AbsOp1.getOperand(0);		Op0 = AbsOp1.getOperand(0);
Op1 = AbsOp1.getOperand(1);		Op1 = AbsOp1.getOperand(1);

// Check if the operands of the sub are zero-extended from vectors of i8.		// Check if the operands of the sub are zero-extended from vectors of i8.
if (Op0.getOpcode() != ISD::ZERO_EXTEND \|\|		if (Op0.getOpcode() != ISD::ZERO_EXTEND \|\|
Op0.getOperand(0).getValueType().getVectorElementType() != MVT::i8 \|\|		Op0.getOperand(0).getValueType().getVectorElementType() != MVT::i8 \|\|
Op1.getOpcode() != ISD::ZERO_EXTEND \|\|		Op1.getOpcode() != ISD::ZERO_EXTEND \|\|
Op1.getOperand(0).getValueType().getVectorElementType() != MVT::i8)		Op1.getOperand(0).getValueType().getVectorElementType() != MVT::i8)
return false;		return false;

return true;		return true;
}		}

		craig.topperUnsubmitted Not Done Reply Inline Actions Why are Ext0 and Ext1 passed by const reference? SDValue should be passed by value. craig.topper: Why are Ext0 and Ext1 passed by const reference? SDValue should be passed by value.
		static SDValue createVPDPBUSD(SelectionDAG &DAG, SDValue LHS, SDValue RHS,
		unsigned &LogBias, const SDLoc &DL,
		const X86Subtarget &Subtarget) {
		// Extend or truncate to MVT::i8 first.
		craig.topperUnsubmitted Not Done Reply Inline Actions Just name the paramaters ZExt0 and SExt1 and get rid of Ext0 and Ext1? craig.topper: Just name the paramaters ZExt0 and SExt1 and get rid of Ext0 and Ext1?
		MVT Vi8VT =
		MVT::getVectorVT(MVT::i8, LHS.getValueType().getVectorElementCount());
		craig.topperUnsubmitted Not Done Reply Inline Actions You can drop these ifs. getZExtOrTrunc/getSExtOrTrunc will do nothing if the type already matches. craig.topper: You can drop these ifs. getZExtOrTrunc/getSExtOrTrunc will do nothing if the type already…
		LHS = DAG.getZExtOrTrunc(LHS, DL, Vi8VT);
		RHS = DAG.getSExtOrTrunc(RHS, DL, Vi8VT);
		craig.topperUnsubmitted Not Done Reply Inline Actions Can we just TRUNCATE the Ext nodes without assuming they are extend nodes. That way it just works when you support constants in the future? craig.topper: Can we just TRUNCATE the Ext nodes without assuming they are extend nodes. That way it just…
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions Can we just TRUNCATE the Ext nodes without assuming they are extend nodes. That way it just works when you support constants in the future? Good idea. :) I'll update my patch. LuoYuanke: > Can we just TRUNCATE the Ext nodes without assuming they are extend nodes. That way it just…

		// VPDPBUSD(<16 x i32>C, <16 x i8>A, <16 x i8>B). For each dst element
		// C[0] = C[0] + A[0]B[0] + A[1]B[1] + A[2]B[2] + A[3]B[3].
		// The src A, B element type is i8, but the dst C element type is i32.
		// When we calculate the reduce stage, we use src vector type vXi8 for it
		// so we need logbias 2 to avoid extra 2 stages.
		pengfeiUnsubmitted Done Reply Inline Actions It is always 2? Better to add a comment to explain. pengfei: It is always 2? Better to add a comment to explain.
		craig.topperUnsubmitted Not Done Reply Inline Actions Can we do this without assuming the node is a SIGN/ZERO_EXTEND? Just truncate the original node to Vi8VT. craig.topper: Can we do this without assuming the node is a SIGN/ZERO_EXTEND? Just truncate the original node…
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions Can we do this without assuming the node is a SIGN/ZERO_EXTEND? Just truncate the original node to Vi8VT. LuoYuanke: > Can we do this without assuming the node is a SIGN/ZERO_EXTEND? Just truncate the original…
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions Can we do this without assuming the node is a SIGN/ZERO_EXTEND? Just truncate the original node to Vi8VT. Yes, that's better. LuoYuanke: > Can we do this without assuming the node is a SIGN/ZERO_EXTEND? Just truncate the original…
		craig.topperUnsubmitted Not Done Reply Inline Actions This just Vi8VT right? craig.topper: This just Vi8VT right?
		LogBias = 2;

		unsigned RegSize = std::max(128u, (unsigned)Vi8VT.getSizeInBits());
		if (Subtarget.hasVNNI() && !Subtarget.hasVLX())
		RegSize = std::max(512u, RegSize);

		// "Zero-extend" the i8 vectors. This is not a per-element zext, rather we
		// fill in the missing vector elements with 0.
		unsigned NumConcat = RegSize / Vi8VT.getSizeInBits();
		SmallVector<SDValue, 16> Ops(NumConcat, DAG.getConstant(0, DL, Vi8VT));
		Ops[0] = LHS;
		MVT ExtendedVT = MVT::getVectorVT(MVT::i8, RegSize / 8);
		SDValue DpOp0 = DAG.getNode(ISD::CONCAT_VECTORS, DL, ExtendedVT, Ops);
		Ops[0] = RHS;
		SDValue DpOp1 = DAG.getNode(ISD::CONCAT_VECTORS, DL, ExtendedVT, Ops);

		// Actually build the DotProduct, split as 256/512 bits for
		// AVXVNNI/AVX512VNNI.
		pengfeiUnsubmitted Done Reply Inline Actions AVXVNNI implies AVX2. We won't need to split to 128 bits. pengfei: AVXVNNI implies AVX2. We won't need to split to 128 bits.
		auto DpBuilder = [](SelectionDAG &DAG, const SDLoc &DL,
		ArrayRef<SDValue> Ops) {
		MVT VT = MVT::getVectorVT(MVT::i32, Ops[0].getValueSizeInBits() / 32);
		return DAG.getNode(X86ISD::VPDPBUSD, DL, VT, Ops);
		};
		MVT DpVT = MVT::getVectorVT(MVT::i32, RegSize / 32);
		SDValue Zero = DAG.getConstant(0, DL, DpVT);
		craig.topperUnsubmitted Not Done Reply Inline Actions Is this lambda getting anything via `&` in the capture list or can it just be `[]` craig.topper: Is this lambda getting anything via `&` in the capture list or can it just be `[]`

		return SplitOpsAndApply(DAG, Subtarget, DL, DpVT, {Zero, DpOp0, DpOp1},
		DpBuilder, false);
		}

// Given two zexts of <k x i8> to <k x i32>, create a PSADBW of the inputs		// Given two zexts of <k x i8> to <k x i32>, create a PSADBW of the inputs
// to these zexts.		// to these zexts.
static SDValue createPSADBW(SelectionDAG &DAG, const SDValue &Zext0,		static SDValue createPSADBW(SelectionDAG &DAG, const SDValue &Zext0,
const SDValue &Zext1, const SDLoc &DL,		const SDValue &Zext1, const SDLoc &DL,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
// Find the appropriate width for the PSADBW.		// Find the appropriate width for the PSADBW.
EVT InVT = Zext0.getOperand(0).getValueType();		EVT InVT = Zext0.getOperand(0).getValueType();
unsigned RegSize = std::max(128u, (unsigned)InVT.getSizeInBits());		unsigned RegSize = std::max(128u, (unsigned)InVT.getSizeInBits());
▲ Show 20 Lines • Show All 191 Lines • ▼ Show 20 Lines	if (MatchSizeInBits == 256 && BitWidth < 32 && !Subtarget.hasInt256()) {
std::tie(Lo, Hi) = DAG.SplitVector(Match, DL);		std::tie(Lo, Hi) = DAG.SplitVector(Match, DL);
Match = DAG.getNode(BinOp, DL, Lo.getValueType(), Lo, Hi);		Match = DAG.getNode(BinOp, DL, Lo.getValueType(), Lo, Hi);
MatchSizeInBits = Match.getValueSizeInBits();		MatchSizeInBits = Match.getValueSizeInBits();
}		}

// For 32/64 bit comparisons use MOVMSKPS/MOVMSKPD, else PMOVMSKB.		// For 32/64 bit comparisons use MOVMSKPS/MOVMSKPD, else PMOVMSKB.
MVT MaskSrcVT;		MVT MaskSrcVT;
if (64 == BitWidth \|\| 32 == BitWidth)		if (64 == BitWidth \|\| 32 == BitWidth)
MaskSrcVT = MVT::getVectorVT(MVT::getFloatingPointVT(BitWidth),		MaskSrcVT = MVT::getVectorVT(MVT::getFloatingPointVT(BitWidth),
		RKSimonUnsubmitted Done Reply Inline Actions Should this be called combineVPDPBUSDPattern? VNNI is the ISA no? RKSimon: Should this be called combineVPDPBUSDPattern? VNNI is the ISA no?
MatchSizeInBits / BitWidth);		MatchSizeInBits / BitWidth);
else		else
MaskSrcVT = MVT::getVectorVT(MVT::i8, MatchSizeInBits / 8);		MaskSrcVT = MVT::getVectorVT(MVT::i8, MatchSizeInBits / 8);

SDValue BitcastLogicOp = DAG.getBitcast(MaskSrcVT, Match);		SDValue BitcastLogicOp = DAG.getBitcast(MaskSrcVT, Match);
Movmsk = getPMOVMSKB(DL, BitcastLogicOp, DAG, Subtarget);		Movmsk = getPMOVMSKB(DL, BitcastLogicOp, DAG, Subtarget);
NumElts = MaskSrcVT.getVectorNumElements();		NumElts = MaskSrcVT.getVectorNumElements();
}		}
Show All 25 Lines	static SDValue combinePredicateReduction(SDNode *Extract, SelectionDAG &DAG,
EVT SetccVT =		EVT SetccVT =
TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), CmpVT);		TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), CmpVT);
SDValue Setcc = DAG.getSetCC(DL, SetccVT, Movmsk, CmpC, CondCode);		SDValue Setcc = DAG.getSetCC(DL, SetccVT, Movmsk, CmpC, CondCode);
SDValue Zext = DAG.getZExtOrTrunc(Setcc, DL, ExtractVT);		SDValue Zext = DAG.getZExtOrTrunc(Setcc, DL, ExtractVT);
SDValue Zero = DAG.getConstant(0, DL, ExtractVT);		SDValue Zero = DAG.getConstant(0, DL, ExtractVT);
return DAG.getNode(ISD::SUB, DL, ExtractVT, Zero, Zext);		return DAG.getNode(ISD::SUB, DL, ExtractVT, Zero, Zext);
}		}

		static SDValue combineVPDPBUSDPattern(SDNode *Extract, SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		if (!Subtarget.hasVNNI() && !Subtarget.hasAVXVNNI())
		return SDValue();

		EVT ExtractVT = Extract->getValueType(0);
		// Verify the type we're extracting is i32, as the output element type of
		// vpdpbusd is i32.
		pengfeiUnsubmitted Not Done Reply Inline Actions Can we generate i32 first then do the truncation? pengfei: Can we generate i32 first then do the truncation?
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions I'm not sure if the result overflow, truncating back to i16 or some other types remain the same value. How about leave it as an enhancement? LuoYuanke: I'm not sure if the result overflow, truncating back to i16 or some other types remain the same…
		if (ExtractVT != MVT::i32)
		return SDValue();

		EVT VT = Extract->getOperand(0).getValueType();
		if (!isPowerOf2_32(VT.getVectorNumElements()))
		return SDValue();
		pengfeiUnsubmitted Not Done Reply Inline Actions Can check `DCI.isAfterLegalizeDAG()` before calling the function instead? pengfei: Can check `DCI.isAfterLegalizeDAG()` before calling the function instead?
		craig.topperUnsubmitted Not Done Reply Inline Actions I don't think that works. We should be able to handle 32 x i8 and 64 x i8 which would have zero_extend and sign_extend with illegal result types. craig.topper: I don't think that works. We should be able to handle 32 x i8 and 64 x i8 which would have…

		craig.topperUnsubmitted Not Done Reply Inline Actions This code isn't handling vpdpwssd so why mention it here? craig.topper: This code isn't handling vpdpwssd so why mention it here?
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions This code isn't handling vpdpwssd so why mention it here? My original code covers both vpdpbusd and vpdpwssd. I'll clean it. LuoYuanke: > This code isn't handling vpdpwssd so why mention it here? My original code covers both…
		// Match shuffle + add pyramid.
		ISD::NodeType BinOp;
		SDValue Root = DAG.matchBinOpReduction(Extract, BinOp, {ISD::ADD});

		// We can't combine to vpdpbusd for zext, because each of the 4 multiplies
		// done by vpdpbusd compute a signed 16-bit product that will be sign extended
		// before adding into the accumulator.
		// TODO:
		// We also need to verify that the multiply has at least 2x the number of bits
		// of the input. We shouldn't match
		// (sign_extend (mul (vXi9 (zext (vXi8 X))), (vXi9 (zext (vXi8 Y)))).
		// if (Root && (Root.getOpcode() == ISD::SIGN_EXTEND))
		craig.topperUnsubmitted Not Done Reply Inline Actions Is this code valid for this transform? There's a large comment of justification for why it is ok for SAD. I think I only saw a test for the SIGN_EXTEND case? craig.topper: Is this code valid for this transform? There's a large comment of justification for why it is…
		craig.topperUnsubmitted Not Done Reply Inline Actions Oops I see the other test. I need to think about the math. craig.topper: Oops I see the other test. I need to think about the math.
		craig.topperUnsubmitted Not Done Reply Inline Actions I don't think we can do this if the multiply result is zero extended. Each of the 4 multiplies done by vpdpbusd compute a signed 16-bit product that will be sign extended before adding into the accumulator. I think we also need to verify that the multiply has at least 2x the number of bits of the input. We shouldn't match (sign_extend (mul (vXi9 (zext (vXi8 X))), (vXi9 (zext (vXi8 Y)))). Does anything prevent that right now? craig.topper: I don't think we can do this if the multiply result is zero extended. Each of the 4 multiplies…
		LuoYuankeAuthorUnsubmitted Done Reply Inline Actions I don't think we can do this if the multiply result is zero extended. Each of the 4 multiplies done by vpdpbusd compute a signed 16-bit product that will be sign extended before adding into the accumulator. I think we also need to verify that the multiply has at least 2x the number of bits of the input. We shouldn't match (sign_extend (mul (vXi9 (zext (vXi8 X))), (vXi9 (zext (vXi8 Y)))). Does anything prevent that right now? Really good catch. Thanks. LuoYuanke: > I don't think we can do this if the multiply result is zero extended. Each of the 4…
		// Root = Root.getOperand(0);

		// If there was a match, we want Root to be a mul.
		if (!Root \|\| Root.getOpcode() != ISD::MUL)
		return SDValue();

		// Check whether we have an extend and mul pattern
		SDValue LHS, RHS;
		if (!detectExtMul(DAG, Root, LHS, RHS))
		return SDValue();

		// Create the dot product instruction.
		SDLoc DL(Extract);
		unsigned StageBias;
		SDValue DP = createVPDPBUSD(DAG, LHS, RHS, StageBias, DL, Subtarget);

		// If the original vector was wider than 4 elements, sum over the results
		// in the DP vector.
		unsigned Stages = Log2_32(VT.getVectorNumElements());
		EVT DpVT = DP.getValueType();

		if (Stages > StageBias) {
		unsigned DpElems = DpVT.getVectorNumElements();

		for (unsigned i = Stages - StageBias; i > 0; --i) {
		SmallVector<int, 16> Mask(DpElems, -1);
		for (unsigned j = 0, MaskEnd = 1 << (i - 1); j < MaskEnd; ++j)
		Mask[j] = MaskEnd + j;

		SDValue Shuffle =
		DAG.getVectorShuffle(DpVT, DL, DP, DAG.getUNDEF(DpVT), Mask);
		DP = DAG.getNode(ISD::ADD, DL, DpVT, DP, Shuffle);
		}
		}
		pengfeiUnsubmitted Done Reply Inline Actions Can ues `ExtractVT.getSizeInBits()` derectly. pengfei: Can ues `ExtractVT.getSizeInBits()` derectly.

		// Return the lowest ExtractSizeInBits bits.
		EVT ResVT =
		EVT::getVectorVT(*DAG.getContext(), ExtractVT,
		DpVT.getSizeInBits() / ExtractVT.getSizeInBits());
		DP = DAG.getBitcast(ResVT, DP);
		return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, ExtractVT, DP,
		Extract->getOperand(1));
		}

static SDValue combineBasicSADPattern(SDNode *Extract, SelectionDAG &DAG,		static SDValue combineBasicSADPattern(SDNode *Extract, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
// PSADBW is only supported on SSE2 and up.		// PSADBW is only supported on SSE2 and up.
if (!Subtarget.hasSSE2())		if (!Subtarget.hasSSE2())
return SDValue();		return SDValue();

EVT ExtractVT = Extract->getValueType(0);		EVT ExtractVT = Extract->getValueType(0);
// Verify the type we're extracting is either i32 or i64.		// Verify the type we're extracting is either i32 or i64.
▲ Show 20 Lines • Show All 591 Lines • ▼ Show 20 Lines	static SDValue combineExtractVectorElt(SDNode *N, SelectionDAG &DAG,
}		}

// Check whether this extract is the root of a sum of absolute differences		// Check whether this extract is the root of a sum of absolute differences
// pattern. This has to be done here because we really want it to happen		// pattern. This has to be done here because we really want it to happen
// pre-legalization,		// pre-legalization,
if (SDValue SAD = combineBasicSADPattern(N, DAG, Subtarget))		if (SDValue SAD = combineBasicSADPattern(N, DAG, Subtarget))
return SAD;		return SAD;

		if (SDValue VPDPBUSD = combineVPDPBUSDPattern(N, DAG, Subtarget))
		return VPDPBUSD;

// Attempt to replace an all_of/any_of horizontal reduction with a MOVMSK.		// Attempt to replace an all_of/any_of horizontal reduction with a MOVMSK.
if (SDValue Cmp = combinePredicateReduction(N, DAG, Subtarget))		if (SDValue Cmp = combinePredicateReduction(N, DAG, Subtarget))
return Cmp;		return Cmp;

// Attempt to replace min/max v8i16/v16i8 reductions with PHMINPOSUW.		// Attempt to replace min/max v8i16/v16i8 reductions with PHMINPOSUW.
if (SDValue MinMax = combineMinMaxReduction(N, DAG, Subtarget))		if (SDValue MinMax = combineMinMaxReduction(N, DAG, Subtarget))
return MinMax;		return MinMax;

▲ Show 20 Lines • Show All 11,986 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86PartialReduction.cpp

//===-- X86PartialReduction.cpp -------------------------------------------===//		//===-- X86PartialReduction.cpp -------------------------------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This pass looks for add instructions used by a horizontal reduction to see		// This pass looks for add instructions used by a horizontal reduction to see
// if we might be able to use pmaddwd or psadbw. Some cases of this require		// if we might be able to use pmaddwd or psadbw. Some cases of this require
// cross basic block knowledge and can't be done in SelectionDAG.		// cross basic block knowledge and can't be done in SelectionDAG.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "X86.h"		#include "X86.h"
		#include "X86TargetMachine.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/CodeGen/TargetPassConfig.h"		#include "llvm/CodeGen/TargetPassConfig.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicsX86.h"		#include "llvm/IR/IntrinsicsX86.h"
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Operator.h"		#include "llvm/IR/Operator.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "X86TargetMachine.h"		#include "llvm/Support/KnownBits.h"

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "x86-partial-reduction"		#define DEBUG_TYPE "x86-partial-reduction"

namespace {		namespace {

class X86PartialReduction : public FunctionPass {		class X86PartialReduction : public FunctionPass {
Show All 11 Lines	void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesCFG();		AU.setPreservesCFG();
}		}

StringRef getPassName() const override {		StringRef getPassName() const override {
return "X86 Partial Reduction";		return "X86 Partial Reduction";
}		}

private:		private:
bool tryMAddReplacement(Instruction *Op);		bool tryMAddReplacement(Instruction *Op, bool ReduceInOneBB);
bool trySADReplacement(Instruction *Op);		bool trySADReplacement(Instruction *Op);
};		};
}		}

FunctionPass *llvm::createX86PartialReductionPass() {		FunctionPass *llvm::createX86PartialReductionPass() {
return new X86PartialReduction();		return new X86PartialReduction();
}		}

char X86PartialReduction::ID = 0;		char X86PartialReduction::ID = 0;

INITIALIZE_PASS(X86PartialReduction, DEBUG_TYPE,		INITIALIZE_PASS(X86PartialReduction, DEBUG_TYPE,
"X86 Partial Reduction", false, false)		"X86 Partial Reduction", false, false)

bool X86PartialReduction::tryMAddReplacement(Instruction *Op) {		// This function should be aligned with detectExtMul() in X86ISelLowering.cpp.
		static bool matchVPDPBUSDPattern(const X86Subtarget ST, BinaryOperator Mul,
		const DataLayout *DL) {
		if (!ST->hasVNNI() && !ST->hasAVXVNNI())
		return false;

		Value *LHS = Mul->getOperand(0);
		Value *RHS = Mul->getOperand(1);

		if (isa<SExtInst>(LHS))
		std::swap(LHS, RHS);

		if (!isa<ZExtInst>(LHS))
		return false;

		auto IsFreeTruncation = [&](Value *Op) {
		if (auto *Cast = dyn_cast<CastInst>(Op)) {
		if (Cast->getParent() == Mul->getParent() &&
		(Cast->getOpcode() == Instruction::SExt \|\|
		Cast->getOpcode() == Instruction::ZExt) &&
		Cast->getOperand(0)->getType()->getScalarSizeInBits() <= 8)
		return true;
		}
		// TODO: Support constant in ISel.
		return false;
		};

		// (dpbusd (zext a), (sext, b)). Since the first operand should be unsigned
		// value, we need to check LHS is zero extended value. RHS should be signed
		// value, so we just check the signed bits.
		if ((IsFreeTruncation(LHS) &&
		computeKnownBits(LHS, *DL).countMaxActiveBits() <= 8) &&
		(IsFreeTruncation(RHS) && ComputeMaxSignificantBits(RHS, *DL) <= 8))
		return true;
		craig.topperUnsubmitted Not Done Reply Inline Actions Why pass `0, nullptr, nullptr` when those have default values? craig.topper: Why pass `0, nullptr, nullptr` when those have default values?

		return false;
		}

		bool X86PartialReduction::tryMAddReplacement(Instruction *Op,
		bool ReduceInOneBB) {
if (!ST->hasSSE2())		if (!ST->hasSSE2())
return false;		return false;

// Need at least 8 elements.		// Need at least 8 elements.
if (cast<FixedVectorType>(Op->getType())->getNumElements() < 8)		if (cast<FixedVectorType>(Op->getType())->getNumElements() < 8)
return false;		return false;

// Element type should be i32.		// Element type should be i32.
if (!cast<VectorType>(Op->getType())->getElementType()->isIntegerTy(32))		if (!cast<VectorType>(Op->getType())->getElementType()->isIntegerTy(32))
return false;		return false;

auto *Mul = dyn_cast<BinaryOperator>(Op);		auto *Mul = dyn_cast<BinaryOperator>(Op);
if (!Mul \|\| Mul->getOpcode() != Instruction::Mul)		if (!Mul \|\| Mul->getOpcode() != Instruction::Mul)
return false;		return false;

Value *LHS = Mul->getOperand(0);		Value *LHS = Mul->getOperand(0);
Value *RHS = Mul->getOperand(1);		Value *RHS = Mul->getOperand(1);

		// If the target support VNNI, leave it to ISel to combine reduce operation
		// to VNNI instruction.
		// TODO: we can support transforming reduce to VNNI intrinsic for across block
		// in this pass.
		if (ReduceInOneBB && matchVPDPBUSDPattern(ST, Mul, DL))
		return false;

// LHS and RHS should be only used once or if they are the same then only		// LHS and RHS should be only used once or if they are the same then only
// used twice. Only check this when SSE4.1 is enabled and we have zext/sext		// used twice. Only check this when SSE4.1 is enabled and we have zext/sext
// instructions, otherwise we use punpck to emulate zero extend in stages. The		// instructions, otherwise we use punpck to emulate zero extend in stages. The
// trunc/ we need to do likely won't introduce new instructions in that case.		// trunc/ we need to do likely won't introduce new instructions in that case.
if (ST->hasSSE41()) {		if (ST->hasSSE41()) {
if (LHS == RHS) {		if (LHS == RHS) {
if (!isa<Constant>(LHS) && !LHS->hasNUses(2))		if (!isa<Constant>(LHS) && !LHS->hasNUses(2))
return false;		return false;
▲ Show 20 Lines • Show All 202 Lines • ▼ Show 20 Lines	bool X86PartialReduction::trySADReplacement(Instruction *Op) {
SI->replaceAllUsesWith(Ops[0]);		SI->replaceAllUsesWith(Ops[0]);
SI->eraseFromParent();		SI->eraseFromParent();

return true;		return true;
}		}

// Walk backwards from the ExtractElementInst and determine if it is the end of		// Walk backwards from the ExtractElementInst and determine if it is the end of
// a horizontal reduction. Return the input to the reduction if we find one.		// a horizontal reduction. Return the input to the reduction if we find one.
static Value *matchAddReduction(const ExtractElementInst &EE) {		static Value *matchAddReduction(const ExtractElementInst &EE,
		bool &ReduceInOneBB) {
		ReduceInOneBB = true;
// Make sure we're extracting index 0.		// Make sure we're extracting index 0.
auto *Index = dyn_cast<ConstantInt>(EE.getIndexOperand());		auto *Index = dyn_cast<ConstantInt>(EE.getIndexOperand());
if (!Index \|\| !Index->isNullValue())		if (!Index \|\| !Index->isNullValue())
return nullptr;		return nullptr;

const auto *BO = dyn_cast<BinaryOperator>(EE.getVectorOperand());		const auto *BO = dyn_cast<BinaryOperator>(EE.getVectorOperand());
if (!BO \|\| BO->getOpcode() != Instruction::Add \|\| !BO->hasOneUse())		if (!BO \|\| BO->getOpcode() != Instruction::Add \|\| !BO->hasOneUse())
return nullptr;		return nullptr;
		if (EE.getParent() != BO->getParent())
		ReduceInOneBB = false;

unsigned NumElems = cast<FixedVectorType>(BO->getType())->getNumElements();		unsigned NumElems = cast<FixedVectorType>(BO->getType())->getNumElements();
// Ensure the reduction size is a power of 2.		// Ensure the reduction size is a power of 2.
if (!isPowerOf2_32(NumElems))		if (!isPowerOf2_32(NumElems))
return nullptr;		return nullptr;

const Value *Op = BO;		const Value *Op = BO;
unsigned Stages = Log2_32(NumElems);		unsigned Stages = Log2_32(NumElems);
for (unsigned i = 0; i != Stages; ++i) {		for (unsigned i = 0; i != Stages; ++i) {
const auto *BO = dyn_cast<BinaryOperator>(Op);		const auto *BO = dyn_cast<BinaryOperator>(Op);
if (!BO \|\| BO->getOpcode() != Instruction::Add)		if (!BO \|\| BO->getOpcode() != Instruction::Add)
return nullptr;		return nullptr;
		if (EE.getParent() != BO->getParent())
		ReduceInOneBB = false;

// If this isn't the first add, then it should only have 2 users, the		// If this isn't the first add, then it should only have 2 users, the
// shuffle and another add which we checked in the previous iteration.		// shuffle and another add which we checked in the previous iteration.
if (i != 0 && !BO->hasNUses(2))		if (i != 0 && !BO->hasNUses(2))
return nullptr;		return nullptr;

Value *LHS = BO->getOperand(0);		Value *LHS = BO->getOperand(0);
Value *RHS = BO->getOperand(1);		Value *RHS = BO->getOperand(1);
▲ Show 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	bool X86PartialReduction::runOnFunction(Function &F) {

bool MadeChange = false;		bool MadeChange = false;
for (auto &BB : F) {		for (auto &BB : F) {
for (auto &I : BB) {		for (auto &I : BB) {
auto *EE = dyn_cast<ExtractElementInst>(&I);		auto *EE = dyn_cast<ExtractElementInst>(&I);
if (!EE)		if (!EE)
continue;		continue;

		bool ReduceInOneBB;
// First find a reduction tree.		// First find a reduction tree.
// FIXME: Do we need to handle other opcodes than Add?		// FIXME: Do we need to handle other opcodes than Add?
Value Root = matchAddReduction(EE);		Value Root = matchAddReduction(EE, ReduceInOneBB);
if (!Root)		if (!Root)
continue;		continue;

SmallVector<Instruction *, 8> Leaves;		SmallVector<Instruction *, 8> Leaves;
collectLeaves(Root, Leaves);		collectLeaves(Root, Leaves);

for (Instruction *I : Leaves) {		for (Instruction *I : Leaves) {
if (tryMAddReplacement(I)) {		if (tryMAddReplacement(I, ReduceInOneBB)) {
MadeChange = true;		MadeChange = true;
continue;		continue;
}		}

// Don't do SAD matching on the root node. SelectionDAG already		// Don't do SAD matching on the root node. SelectionDAG already
// has support for that and currently generates better code.		// has support for that and currently generates better code.
if (I != Root && trySADReplacement(I))		if (I != Root && trySADReplacement(I))
MadeChange = true;		MadeChange = true;
}		}
}		}
}		}

return MadeChange;		return MadeChange;
}		}

llvm/test/CodeGen/X86/dpbusd.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avxvnni \| FileCheck %s --check-prefixes=AVXVNNI
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vnni \| FileCheck %s --check-prefixes=AVX512,AVX512VNNI
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vnni -mattr=+avx512vl \| FileCheck %s --check-prefixes=AVX512,AVX512VLVNNI

				RKSimonUnsubmitted Done Reply Inline Actions You might be able to use a common prefix for some of these to reduce check duplication RKSimon: You might be able to use a common prefix for some of these to reduce check duplication
				define i32 @no_dpbusd(i8 %a, i8 %b, i32 %c, i32 %n) {
				; AVXVNNI-LABEL: no_dpbusd:
				RKSimonUnsubmitted Done Reply Inline Actions Drop dso_local? RKSimon: Drop dso_local?
				; AVXVNNI: # %bb.0: # %entry
				; AVXVNNI-NEXT: vpmovzxbw {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
				; AVXVNNI-NEXT: vpmovzxbw {{.*#+}} ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
				; AVXVNNI-NEXT: vpmaddwd %ymm0, %ymm1, %ymm0
				; AVXVNNI-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
				pengfeiUnsubmitted Not Done Reply Inline Actions This is the only and a strange diff with the AVX512 code. Is there anything wrong in one of each? pengfei: This is the only and a strange diff with the AVX512 code. Is there anything wrong in one of…
				LuoYuankeAuthorUnsubmitted Done Reply Inline Actions This test doesn't generate vpdpbusd instruction, so the AVX512VNNI and AVX512VL generate the same code. For other test case, AVX512VNNI can only use zmm register, but AVX512VNNI + AVX512VL can use xmm register. LuoYuanke: This test doesn't generate vpdpbusd instruction, so the AVX512VNNI and AVX512VL generate the…
				LuoYuankeAuthorUnsubmitted Done Reply Inline Actions For vpdpbusd_64xi32, the result is the same. LuoYuanke: For vpdpbusd_64xi32, the result is the same.
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vmovd %xmm0, %eax
				; AVXVNNI-NEXT: addl %edx, %eax
				; AVXVNNI-NEXT: vzeroupper
				; AVXVNNI-NEXT: retq
				;
				; AVX512-LABEL: no_dpbusd:
				; AVX512: # %bb.0: # %entry
				; AVX512-NEXT: vpmovzxbw {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
				; AVX512-NEXT: vpmovzxbw {{.*#+}} ymm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
				; AVX512-NEXT: vpmaddwd %ymm0, %ymm1, %ymm0
				; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX512-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
				; AVX512-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVX512-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512-NEXT: vmovd %xmm0, %eax
				; AVX512-NEXT: addl %edx, %eax
				; AVX512-NEXT: vzeroupper
				; AVX512-NEXT: retq
				entry:
				%0 = bitcast i8* %a to <16 x i8>*
				%1 = load <16 x i8>, <16 x i8>* %0, align 16
				%2 = zext <16 x i8> %1 to <16 x i32>
				%3 = bitcast i8* %b to <16 x i8>*
				%4 = load <16 x i8>, <16 x i8>* %3, align 16
				%5 = zext <16 x i8> %4 to <16 x i32>
				%6 = mul nsw <16 x i32> %5, %2
				%7 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %6)
				%op.extra = add nsw i32 %7, %c
				ret i32 %op.extra
				}

				define i32 @vpdpbusd_mutate(i8 %a, i8 %b, i32 %c, i32 %n) {
				; AVXVNNI-LABEL: vpdpbusd_mutate:
				; AVXVNNI: # %bb.0: # %entry
				; AVXVNNI-NEXT: vmovdqa (%rsi), %xmm0
				; AVXVNNI-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVXVNNI-NEXT: {vex} vpdpbusd (%rdi), %xmm0, %xmm1
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm0 = xmm1[2,3,2,3]
				; AVXVNNI-NEXT: vpaddd %xmm0, %xmm1, %xmm0
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vmovd %xmm0, %eax
				; AVXVNNI-NEXT: addl %edx, %eax
				; AVXVNNI-NEXT: retq
				;
				; AVX512VNNI-LABEL: vpdpbusd_mutate:
				; AVX512VNNI: # %bb.0: # %entry
				; AVX512VNNI-NEXT: vmovdqa (%rdi), %xmm0
				; AVX512VNNI-NEXT: vmovdqa (%rsi), %xmm1
				; AVX512VNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX512VNNI-NEXT: vpdpbusd %zmm0, %zmm1, %zmm2
				; AVX512VNNI-NEXT: vpshufd {{.*#+}} xmm0 = xmm2[2,3,2,3]
				; AVX512VNNI-NEXT: vpaddd %xmm0, %xmm2, %xmm0
				; AVX512VNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVX512VNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512VNNI-NEXT: vmovd %xmm0, %eax
				; AVX512VNNI-NEXT: addl %edx, %eax
				; AVX512VNNI-NEXT: vzeroupper
				; AVX512VNNI-NEXT: retq
				;
				; AVX512VLVNNI-LABEL: vpdpbusd_mutate:
				; AVX512VLVNNI: # %bb.0: # %entry
				; AVX512VLVNNI-NEXT: vmovdqa (%rsi), %xmm0
				; AVX512VLVNNI-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVX512VLVNNI-NEXT: vpdpbusd (%rdi), %xmm0, %xmm1
				; AVX512VLVNNI-NEXT: vpshufd {{.*#+}} xmm0 = xmm1[2,3,2,3]
				; AVX512VLVNNI-NEXT: vpaddd %xmm0, %xmm1, %xmm0
				; AVX512VLVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVX512VLVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512VLVNNI-NEXT: vmovd %xmm0, %eax
				; AVX512VLVNNI-NEXT: addl %edx, %eax
				; AVX512VLVNNI-NEXT: retq
				entry:
				%0 = bitcast i8* %a to <16 x i8>*
				%1 = load <16 x i8>, <16 x i8>* %0, align 16
				%2 = sext <16 x i8> %1 to <16 x i32>
				%3 = bitcast i8* %b to <16 x i8>*
				%4 = load <16 x i8>, <16 x i8>* %3, align 16
				%5 = zext <16 x i8> %4 to <16 x i32>
				%6 = mul nsw <16 x i32> %5, %2
				%7 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %6)
				%op.extra = add nsw i32 %7, %c
				ret i32 %op.extra
				}

				define i32 @mul_zext(i8 %a, i8 %b, i32 %c, i32 %n) {
				; AVXVNNI-LABEL: mul_zext:
				; AVXVNNI: # %bb.0: # %entry
				; AVXVNNI-NEXT: vpmovzxbw {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
				; AVXVNNI-NEXT: vpmovsxbw (%rsi), %ymm1
				; AVXVNNI-NEXT: vpmullw %ymm0, %ymm1, %ymm0
				; AVXVNNI-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVXVNNI-NEXT: vpmovzxwd {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
				; AVXVNNI-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
				; AVXVNNI-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVXVNNI-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vmovd %xmm0, %eax
				; AVXVNNI-NEXT: addl %edx, %eax
				; AVXVNNI-NEXT: vzeroupper
				; AVXVNNI-NEXT: retq
				;
				; AVX512-LABEL: mul_zext:
				; AVX512: # %bb.0: # %entry
				; AVX512-NEXT: vpmovzxbw {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
				; AVX512-NEXT: vpmovsxbw (%rsi), %ymm1
				; AVX512-NEXT: vpmullw %ymm0, %ymm1, %ymm0
				; AVX512-NEXT: vpmovzxwd {{.*#+}} zmm0 = ymm0[0],zero,ymm0[1],zero,ymm0[2],zero,ymm0[3],zero,ymm0[4],zero,ymm0[5],zero,ymm0[6],zero,ymm0[7],zero,ymm0[8],zero,ymm0[9],zero,ymm0[10],zero,ymm0[11],zero,ymm0[12],zero,ymm0[13],zero,ymm0[14],zero,ymm0[15],zero
				; AVX512-NEXT: vextracti64x4 $1, %zmm0, %ymm1
				; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX512-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
				; AVX512-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVX512-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512-NEXT: vmovd %xmm0, %eax
				; AVX512-NEXT: addl %edx, %eax
				; AVX512-NEXT: vzeroupper
				; AVX512-NEXT: retq
				entry:
				%0 = bitcast i8* %a to <16 x i8>*
				%1 = load <16 x i8>, <16 x i8>* %0, align 16
				%2 = zext <16 x i8> %1 to <16 x i16>
				%3 = bitcast i8* %b to <16 x i8>*
				%4 = load <16 x i8>, <16 x i8>* %3, align 16
				%5 = sext <16 x i8> %4 to <16 x i16>
				%6 = mul nsw <16 x i16> %5, %2
				; We can't combine to vpdpbusd for zext, because each of the 4 multiplies
				; done by vpdpbusd compute a signed 16-bit product that will be sign extended
				; before adding into the accumulator.
				%7 = zext <16 x i16> %6 to <16 x i32>
				%8 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %7)
				%op.extra = add nsw i32 %8, %c
				ret i32 %op.extra
				}

				define i32 @mul_sext(i8 %a, i8 %b, i32 %c, i32 %n) {
				; AVXVNNI-LABEL: mul_sext:
				; AVXVNNI: # %bb.0: # %entry
				; AVXVNNI-NEXT: vpmovzxbw {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
				; AVXVNNI-NEXT: vpmovsxbw (%rsi), %ymm1
				; AVXVNNI-NEXT: vpmullw %ymm0, %ymm1, %ymm0
				; AVXVNNI-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVXVNNI-NEXT: vpmovsxwd %xmm1, %ymm1
				; AVXVNNI-NEXT: vpmovsxwd %xmm0, %ymm0
				; AVXVNNI-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; AVXVNNI-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vmovd %xmm0, %eax
				; AVXVNNI-NEXT: addl %edx, %eax
				; AVXVNNI-NEXT: vzeroupper
				; AVXVNNI-NEXT: retq
				;
				; AVX512-LABEL: mul_sext:
				; AVX512: # %bb.0: # %entry
				; AVX512-NEXT: vpmovzxbw {{.*#+}} ymm0 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero,mem[8],zero,mem[9],zero,mem[10],zero,mem[11],zero,mem[12],zero,mem[13],zero,mem[14],zero,mem[15],zero
				; AVX512-NEXT: vpmovsxbw (%rsi), %ymm1
				; AVX512-NEXT: vpmullw %ymm0, %ymm1, %ymm0
				; AVX512-NEXT: vpmovsxwd %ymm0, %zmm0
				; AVX512-NEXT: vextracti64x4 $1, %zmm0, %ymm1
				; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX512-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
				; AVX512-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVX512-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512-NEXT: vmovd %xmm0, %eax
				; AVX512-NEXT: addl %edx, %eax
				; AVX512-NEXT: vzeroupper
				; AVX512-NEXT: retq
				entry:
				%0 = bitcast i8* %a to <16 x i8>*
				%1 = load <16 x i8>, <16 x i8>* %0, align 16
				%2 = zext <16 x i8> %1 to <16 x i16>
				%3 = bitcast i8* %b to <16 x i8>*
				%4 = load <16 x i8>, <16 x i8>* %3, align 16
				%5 = sext <16 x i8> %4 to <16 x i16>
				%6 = mul nsw <16 x i16> %5, %2
				; TODO:
				; We also need to verify that the multiply has at least 2x the number of bits
				; of the input. We shouldn't match
				; (sign_extend (mul (vXi9 (zext (vXi8 X))), (vXi9 (zext (vXi8 Y)))).
				%7 = sext <16 x i16> %6 to <16 x i32>
				%8 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %7)
				%op.extra = add nsw i32 %8, %c
				ret i32 %op.extra
				}

				define i32 @vpdpbusd_512(i8 %a, i8 %b, i32 %c, i32 %n) {
				; AVXVNNI-LABEL: vpdpbusd_512:
				; AVXVNNI: # %bb.0: # %entry
				; AVXVNNI-NEXT: vmovdqa (%rdi), %xmm0
				; AVXVNNI-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVXVNNI-NEXT: {vex} vpdpbusd (%rsi), %xmm0, %xmm1
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm0 = xmm1[2,3,2,3]
				; AVXVNNI-NEXT: vpaddd %xmm0, %xmm1, %xmm0
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vmovd %xmm0, %eax
				; AVXVNNI-NEXT: addl %edx, %eax
				; AVXVNNI-NEXT: retq
				;
				; AVX512VNNI-LABEL: vpdpbusd_512:
				; AVX512VNNI: # %bb.0: # %entry
				; AVX512VNNI-NEXT: vmovdqa (%rdi), %xmm0
				; AVX512VNNI-NEXT: vmovdqa (%rsi), %xmm1
				; AVX512VNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX512VNNI-NEXT: vpdpbusd %zmm1, %zmm0, %zmm2
				; AVX512VNNI-NEXT: vpshufd {{.*#+}} xmm0 = xmm2[2,3,2,3]
				; AVX512VNNI-NEXT: vpaddd %xmm0, %xmm2, %xmm0
				; AVX512VNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVX512VNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512VNNI-NEXT: vmovd %xmm0, %eax
				; AVX512VNNI-NEXT: addl %edx, %eax
				; AVX512VNNI-NEXT: vzeroupper
				; AVX512VNNI-NEXT: retq
				;
				; AVX512VLVNNI-LABEL: vpdpbusd_512:
				; AVX512VLVNNI: # %bb.0: # %entry
				; AVX512VLVNNI-NEXT: vmovdqa (%rdi), %xmm0
				; AVX512VLVNNI-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVX512VLVNNI-NEXT: vpdpbusd (%rsi), %xmm0, %xmm1
				; AVX512VLVNNI-NEXT: vpshufd {{.*#+}} xmm0 = xmm1[2,3,2,3]
				; AVX512VLVNNI-NEXT: vpaddd %xmm0, %xmm1, %xmm0
				; AVX512VLVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVX512VLVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512VLVNNI-NEXT: vmovd %xmm0, %eax
				; AVX512VLVNNI-NEXT: addl %edx, %eax
				; AVX512VLVNNI-NEXT: retq
				entry:
				%0 = bitcast i8* %a to <16 x i8>*
				%1 = load <16 x i8>, <16 x i8>* %0, align 16
				%2 = zext <16 x i8> %1 to <16 x i32>
				%3 = bitcast i8* %b to <16 x i8>*
				%4 = load <16 x i8>, <16 x i8>* %3, align 16
				%5 = sext <16 x i8> %4 to <16 x i32>
				%6 = mul nsw <16 x i32> %5, %2
				%7 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %6)
				%op.extra = add nsw i32 %7, %c
				ret i32 %op.extra
				}

				declare i32 @llvm.vector.reduce.add.v16i32(<16 x i32>)

				define i32 @vpdpbusd_256(i8 %a, i8 %b, i32 %c, i32 %n) {
				; AVXVNNI-LABEL: vpdpbusd_256:
				; AVXVNNI: # %bb.0: # %entry
				; AVXVNNI-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVXVNNI-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVXVNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVXVNNI-NEXT: {vex} vpdpbusd %xmm0, %xmm1, %xmm2
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm0 = xmm2[1,1,1,1]
				; AVXVNNI-NEXT: vpaddd %xmm0, %xmm2, %xmm0
				; AVXVNNI-NEXT: vmovd %xmm0, %eax
				; AVXVNNI-NEXT: addl %edx, %eax
				; AVXVNNI-NEXT: retq
				;
				; AVX512VNNI-LABEL: vpdpbusd_256:
				; AVX512VNNI: # %bb.0: # %entry
				; AVX512VNNI-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX512VNNI-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVX512VNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX512VNNI-NEXT: vpdpbusd %zmm0, %zmm1, %zmm2
				; AVX512VNNI-NEXT: vpshufd {{.*#+}} xmm0 = xmm2[1,1,1,1]
				; AVX512VNNI-NEXT: vpaddd %xmm0, %xmm2, %xmm0
				; AVX512VNNI-NEXT: vmovd %xmm0, %eax
				; AVX512VNNI-NEXT: addl %edx, %eax
				; AVX512VNNI-NEXT: vzeroupper
				; AVX512VNNI-NEXT: retq
				;
				; AVX512VLVNNI-LABEL: vpdpbusd_256:
				; AVX512VLVNNI: # %bb.0: # %entry
				; AVX512VLVNNI-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX512VLVNNI-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVX512VLVNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX512VLVNNI-NEXT: vpdpbusd %xmm0, %xmm1, %xmm2
				; AVX512VLVNNI-NEXT: vpshufd {{.*#+}} xmm0 = xmm2[1,1,1,1]
				; AVX512VLVNNI-NEXT: vpaddd %xmm0, %xmm2, %xmm0
				; AVX512VLVNNI-NEXT: vmovd %xmm0, %eax
				; AVX512VLVNNI-NEXT: addl %edx, %eax
				; AVX512VLVNNI-NEXT: retq
				entry:
				%0 = bitcast i8* %a to <8 x i8>*
				%1 = load <8 x i8>, <8 x i8>* %0, align 8
				%2 = zext <8 x i8> %1 to <8 x i32>
				%3 = bitcast i8* %b to <8 x i8>*
				%4 = load <8 x i8>, <8 x i8>* %3, align 8
				%5 = sext <8 x i8> %4 to <8 x i32>
				%6 = mul nsw <8 x i32> %5, %2
				%7 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %6)
				%op.extra = add nsw i32 %7, %c
				ret i32 %op.extra
				}

				declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32>)

				define i32 @vpdpbusd_128(i8 %a, i8 %b, i32 %c, i32 %n) {
				; AVXVNNI-LABEL: vpdpbusd_128:
				; AVXVNNI: # %bb.0: # %entry
				; AVXVNNI-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVXVNNI-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVXVNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVXVNNI-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3,4,5,6,7]
				; AVXVNNI-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3,4,5,6,7]
				; AVXVNNI-NEXT: {vex} vpdpbusd %xmm1, %xmm0, %xmm2
				; AVXVNNI-NEXT: vmovd %xmm2, %eax
				; AVXVNNI-NEXT: addl %edx, %eax
				; AVXVNNI-NEXT: retq
				;
				; AVX512VNNI-LABEL: vpdpbusd_128:
				; AVX512VNNI: # %bb.0: # %entry
				; AVX512VNNI-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX512VNNI-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVX512VNNI-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3,4,5,6,7]
				; AVX512VNNI-NEXT: vmovq {{.*#+}} xmm2 = mem[0],zero
				; AVX512VNNI-NEXT: vpblendw {{.*#+}} xmm1 = xmm2[0,1],xmm1[2,3,4,5,6,7]
				; AVX512VNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX512VNNI-NEXT: vpdpbusd %zmm0, %zmm1, %zmm2
				; AVX512VNNI-NEXT: vmovd %xmm2, %eax
				; AVX512VNNI-NEXT: addl %edx, %eax
				; AVX512VNNI-NEXT: vzeroupper
				; AVX512VNNI-NEXT: retq
				;
				; AVX512VLVNNI-LABEL: vpdpbusd_128:
				; AVX512VLVNNI: # %bb.0: # %entry
				; AVX512VLVNNI-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX512VLVNNI-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVX512VLVNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX512VLVNNI-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3,4,5,6,7]
				; AVX512VLVNNI-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1],xmm2[2,3,4,5,6,7]
				; AVX512VLVNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX512VLVNNI-NEXT: vpdpbusd %xmm1, %xmm0, %xmm2
				; AVX512VLVNNI-NEXT: vmovd %xmm2, %eax
				; AVX512VLVNNI-NEXT: addl %edx, %eax
				; AVX512VLVNNI-NEXT: retq
				entry:
				%0 = bitcast i8* %a to <4 x i8>*
				%1 = load <4 x i8>, <4 x i8>* %0, align 8
				%2 = zext <4 x i8> %1 to <4 x i32>
				%3 = bitcast i8* %b to <4 x i8>*
				%4 = load <4 x i8>, <4 x i8>* %3, align 8
				%5 = sext <4 x i8> %4 to <4 x i32>
				%6 = mul nsw <4 x i32> %5, %2
				%7 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %6)
				%op.extra = add nsw i32 %7, %c
				ret i32 %op.extra
				}

				declare i32 @llvm.vector.reduce.add.v4i32(<4 x i32>)

				define i32 @vpdpbusd_2xi32(i8 %a, i8 %b, i32 %c, i32 %n) {
				; AVXVNNI-LABEL: vpdpbusd_2xi32:
				; AVXVNNI: # %bb.0: # %entry
				; AVXVNNI-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVXVNNI-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVXVNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVXVNNI-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0],xmm2[1,2,3,4,5,6,7]
				; AVXVNNI-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0],xmm2[1,2,3,4,5,6,7]
				; AVXVNNI-NEXT: {vex} vpdpbusd %xmm1, %xmm0, %xmm2
				; AVXVNNI-NEXT: vmovd %xmm2, %eax
				; AVXVNNI-NEXT: addl %edx, %eax
				; AVXVNNI-NEXT: retq
				;
				; AVX512VNNI-LABEL: vpdpbusd_2xi32:
				; AVX512VNNI: # %bb.0: # %entry
				; AVX512VNNI-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX512VNNI-NEXT: vmovdqa {{.*#+}} xmm1 = [65535,0,0,0]
				; AVX512VNNI-NEXT: vpandq %zmm1, %zmm0, %zmm0
				; AVX512VNNI-NEXT: vmovq {{.*#+}} xmm2 = mem[0],zero
				; AVX512VNNI-NEXT: vpandq %zmm1, %zmm2, %zmm1
				; AVX512VNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX512VNNI-NEXT: vpdpbusd %zmm0, %zmm1, %zmm2
				; AVX512VNNI-NEXT: vmovd %xmm2, %eax
				; AVX512VNNI-NEXT: addl %edx, %eax
				; AVX512VNNI-NEXT: vzeroupper
				; AVX512VNNI-NEXT: retq
				;
				; AVX512VLVNNI-LABEL: vpdpbusd_2xi32:
				; AVX512VLVNNI: # %bb.0: # %entry
				; AVX512VLVNNI-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX512VLVNNI-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
				; AVX512VLVNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX512VLVNNI-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0],xmm2[1,2,3,4,5,6,7]
				; AVX512VLVNNI-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0],xmm2[1,2,3,4,5,6,7]
				; AVX512VLVNNI-NEXT: vpdpbusd %xmm1, %xmm0, %xmm2
				; AVX512VLVNNI-NEXT: vmovd %xmm2, %eax
				; AVX512VLVNNI-NEXT: addl %edx, %eax
				; AVX512VLVNNI-NEXT: retq
				entry:
				%0 = bitcast i8* %a to <2 x i8>*
				%1 = load <2 x i8>, <2 x i8>* %0, align 8
				%2 = zext <2 x i8> %1 to <2 x i32>
				%3 = bitcast i8* %b to <2 x i8>*
				%4 = load <2 x i8>, <2 x i8>* %3, align 8
				%5 = sext <2 x i8> %4 to <2 x i32>
				%6 = mul nsw <2 x i32> %5, %2
				%7 = call i32 @llvm.vector.reduce.add.v2i32(<2 x i32> %6)
				%op.extra = add nsw i32 %7, %c
				ret i32 %op.extra
				}

				declare i32 @llvm.vector.reduce.add.v2i32(<2 x i32>)

				define i32 @vpdpbusd_32xi32(i8 %a, i8 %b, i32 %c, i32 %n) {
				; AVXVNNI-LABEL: vpdpbusd_32xi32:
				; AVXVNNI: # %bb.0: # %entry
				; AVXVNNI-NEXT: vmovdqu (%rdi), %ymm0
				; AVXVNNI-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVXVNNI-NEXT: {vex} vpdpbusd (%rsi), %ymm0, %ymm1
				; AVXVNNI-NEXT: vextracti128 $1, %ymm1, %xmm0
				; AVXVNNI-NEXT: vpaddd %xmm0, %xmm1, %xmm0
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vmovd %xmm0, %eax
				; AVXVNNI-NEXT: addl %edx, %eax
				; AVXVNNI-NEXT: vzeroupper
				; AVXVNNI-NEXT: retq
				;
				; AVX512VNNI-LABEL: vpdpbusd_32xi32:
				; AVX512VNNI: # %bb.0: # %entry
				; AVX512VNNI-NEXT: vmovdqu (%rdi), %ymm0
				; AVX512VNNI-NEXT: vmovdqu (%rsi), %ymm1
				; AVX512VNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX512VNNI-NEXT: vpdpbusd %zmm1, %zmm0, %zmm2
				; AVX512VNNI-NEXT: vextracti128 $1, %ymm2, %xmm0
				; AVX512VNNI-NEXT: vpaddd %xmm0, %xmm2, %xmm0
				; AVX512VNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
				; AVX512VNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512VNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVX512VNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512VNNI-NEXT: vmovd %xmm0, %eax
				; AVX512VNNI-NEXT: addl %edx, %eax
				; AVX512VNNI-NEXT: vzeroupper
				; AVX512VNNI-NEXT: retq
				;
				; AVX512VLVNNI-LABEL: vpdpbusd_32xi32:
				; AVX512VLVNNI: # %bb.0: # %entry
				; AVX512VLVNNI-NEXT: vmovdqu (%rdi), %ymm0
				; AVX512VLVNNI-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVX512VLVNNI-NEXT: vpdpbusd (%rsi), %ymm0, %ymm1
				; AVX512VLVNNI-NEXT: vextracti128 $1, %ymm1, %xmm0
				; AVX512VLVNNI-NEXT: vpaddd %xmm0, %xmm1, %xmm0
				; AVX512VLVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
				; AVX512VLVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512VLVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVX512VLVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512VLVNNI-NEXT: vmovd %xmm0, %eax
				; AVX512VLVNNI-NEXT: addl %edx, %eax
				; AVX512VLVNNI-NEXT: vzeroupper
				; AVX512VLVNNI-NEXT: retq
				entry:
				%0 = bitcast i8* %a to <32 x i8>*
				%1 = load <32 x i8>, <32 x i8>* %0, align 16
				%2 = zext <32 x i8> %1 to <32 x i32>
				%3 = bitcast i8* %b to <32 x i8>*
				%4 = load <32 x i8>, <32 x i8>* %3, align 16
				%5 = sext <32 x i8> %4 to <32 x i32>
				%6 = mul nsw <32 x i32> %5, %2
				%7 = call i32 @llvm.vector.reduce.add.v32i32(<32 x i32> %6)
				%op.extra = add nsw i32 %7, %c
				ret i32 %op.extra
				}

				declare i32 @llvm.vector.reduce.add.v32i32(<32 x i32>)

				define i32 @vpdpbusd_64xi32(i8 %a, i8 %b, i32 %c, i32 %n) {
				; AVXVNNI-LABEL: vpdpbusd_64xi32:
				; AVXVNNI: # %bb.0: # %entry
				; AVXVNNI-NEXT: vmovdqu (%rdi), %ymm0
				; AVXVNNI-NEXT: vmovdqu 32(%rdi), %ymm1
				; AVXVNNI-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVXVNNI-NEXT: vpxor %xmm3, %xmm3, %xmm3
				; AVXVNNI-NEXT: {vex} vpdpbusd 32(%rsi), %ymm1, %ymm3
				; AVXVNNI-NEXT: {vex} vpdpbusd (%rsi), %ymm0, %ymm2
				; AVXVNNI-NEXT: vpaddd %ymm3, %ymm2, %ymm0
				; AVXVNNI-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVXVNNI-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVXVNNI-NEXT: vmovd %xmm0, %eax
				; AVXVNNI-NEXT: addl %edx, %eax
				; AVXVNNI-NEXT: vzeroupper
				; AVXVNNI-NEXT: retq
				;
				; AVX512-LABEL: vpdpbusd_64xi32:
				; AVX512: # %bb.0: # %entry
				; AVX512-NEXT: vmovdqu64 (%rdi), %zmm0
				; AVX512-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVX512-NEXT: vpdpbusd (%rsi), %zmm0, %zmm1
				; AVX512-NEXT: vextracti64x4 $1, %zmm1, %ymm0
				; AVX512-NEXT: vpaddd %zmm0, %zmm1, %zmm0
				; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX512-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
				; AVX512-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; AVX512-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; AVX512-NEXT: vmovd %xmm0, %eax
				; AVX512-NEXT: addl %edx, %eax
				; AVX512-NEXT: vzeroupper
				; AVX512-NEXT: retq
				entry:
				%0 = bitcast i8* %a to <64 x i8>*
				%1 = load <64 x i8>, <64 x i8>* %0, align 16
				%2 = zext <64 x i8> %1 to <64 x i32>
				%3 = bitcast i8* %b to <64 x i8>*
				%4 = load <64 x i8>, <64 x i8>* %3, align 16
				%5 = sext <64 x i8> %4 to <64 x i32>
				%6 = mul nsw <64 x i32> %5, %2
				%7 = call i32 @llvm.vector.reduce.add.v64i32(<64 x i32> %6)
				%op.extra = add nsw i32 %7, %c
				ret i32 %op.extra
				}

				declare i32 @llvm.vector.reduce.add.v64i32(<64 x i32>)

llvm/test/CodeGen/X86/dpbusd_i4.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vnni -mattr=+avx512vl \| FileCheck %s

				declare i32 @llvm.vector.reduce.add.v16i32(<16 x i32>)

				define i32 @mul_i8i8(i8 *%a, <16 x i8> %b, i32 %c) {
				; CHECK-LABEL: mul_i8i8:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: vmovdqa (%rdi), %xmm1
				; CHECK-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vpdpbusd %xmm0, %xmm1, %xmm2
				; CHECK-NEXT: vpshufd {{.*#+}} xmm0 = xmm2[2,3,2,3]
				; CHECK-NEXT: vpaddd %xmm0, %xmm2, %xmm0
				; CHECK-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; CHECK-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vmovd %xmm0, %eax
				; CHECK-NEXT: addl %esi, %eax
				; CHECK-NEXT: retq
				entry:
				%0 = bitcast i8* %a to <16 x i8>*
				%1 = load <16 x i8>, <16 x i8>* %0, align 16
				%2 = zext <16 x i8> %1 to <16 x i32>
				%3 = sext <16 x i8> %b to <16 x i32>
				%4 = mul nsw <16 x i32> %2, %3
				%5 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %4)
				%op.extra = add nsw i32 %5, %c
				ret i32 %op.extra
				}

				define i32 @mul_i4i8(<16 x i4> %a, <16 x i8> %b, i32 %c) {
				; CHECK-LABEL: mul_i4i8:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
				; CHECK-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vpdpbusd %xmm1, %xmm0, %xmm2
				; CHECK-NEXT: vpshufd {{.*#+}} xmm0 = xmm2[2,3,2,3]
				; CHECK-NEXT: vpaddd %xmm0, %xmm2, %xmm0
				; CHECK-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; CHECK-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vmovd %xmm0, %eax
				; CHECK-NEXT: addl %edi, %eax
				; CHECK-NEXT: retq
				entry:
				%0 = zext <16 x i4> %a to <16 x i32>
				%1 = sext <16 x i8> %b to <16 x i32>
				%2 = mul nsw <16 x i32> %0, %1
				%3 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %2)
				%op.extra = add nsw i32 %3, %c
				ret i32 %op.extra
				}

				define i32 @mul_i4i4(<16 x i4> %a, <16 x i4> %b, i32 %c) {
				; CHECK-LABEL: mul_i4i4:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: vpsllw $4, %xmm1, %xmm1
				; CHECK-NEXT: vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
				; CHECK-NEXT: vpsrlw $4, %xmm1, %xmm1
				; CHECK-NEXT: vmovdqa {{.*#+}} xmm2 = [8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8]
				; CHECK-NEXT: vpxor %xmm2, %xmm1, %xmm1
				; CHECK-NEXT: vpsubb %xmm2, %xmm1, %xmm1
				; CHECK-NEXT: vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
				; CHECK-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vpdpbusd %xmm1, %xmm0, %xmm2
				; CHECK-NEXT: vpshufd {{.*#+}} xmm0 = xmm2[2,3,2,3]
				; CHECK-NEXT: vpaddd %xmm0, %xmm2, %xmm0
				; CHECK-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; CHECK-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vmovd %xmm0, %eax
				; CHECK-NEXT: addl %edi, %eax
				; CHECK-NEXT: retq
				entry:
				%0 = zext <16 x i4> %a to <16 x i32>
				%1 = sext <16 x i4> %b to <16 x i32>
				%2 = mul nsw <16 x i32> %0, %1
				%3 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %2)
				%op.extra = add nsw i32 %3, %c
				ret i32 %op.extra
				}

				define i32 @mul_sext_i4i4(<16 x i4> %a, <16 x i4> %b, i32 %c) {
				; CHECK-LABEL: mul_sext_i4i4:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: vpmovzxbw {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero,xmm0[8],zero,xmm0[9],zero,xmm0[10],zero,xmm0[11],zero,xmm0[12],zero,xmm0[13],zero,xmm0[14],zero,xmm0[15],zero
				; CHECK-NEXT: vpmovzxbw {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero,xmm1[8],zero,xmm1[9],zero,xmm1[10],zero,xmm1[11],zero,xmm1[12],zero,xmm1[13],zero,xmm1[14],zero,xmm1[15],zero
				; CHECK-NEXT: vpsllw $12, %ymm1, %ymm1
				; CHECK-NEXT: vpsraw $12, %ymm1, %ymm1
				; CHECK-NEXT: vpsllw $12, %ymm0, %ymm0
				; CHECK-NEXT: vpsraw $12, %ymm0, %ymm0
				; CHECK-NEXT: vpmaddwd %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vextracti128 $1, %ymm0, %xmm1
				; CHECK-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
				; CHECK-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; CHECK-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vmovd %xmm0, %eax
				; CHECK-NEXT: addl %edi, %eax
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				entry:
				%0 = sext <16 x i4> %a to <16 x i32>
				%1 = sext <16 x i4> %b to <16 x i32>
				%2 = mul nsw <16 x i32> %0, %1
				%3 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %2)
				%op.extra = add nsw i32 %3, %c
				ret i32 %op.extra
				}

				define i32 @mul_zext_i4i4(<16 x i4> %a, <16 x i4> %b, i32 %c) {
				; CHECK-LABEL: mul_zext_i4i4:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: vmovdqa {{.*#+}} xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; CHECK-NEXT: vpand %xmm2, %xmm1, %xmm1
				; CHECK-NEXT: vpand %xmm2, %xmm0, %xmm0
				; CHECK-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vpdpbusd %xmm1, %xmm0, %xmm2
				; CHECK-NEXT: vpshufd {{.*#+}} xmm0 = xmm2[2,3,2,3]
				; CHECK-NEXT: vpaddd %xmm0, %xmm2, %xmm0
				; CHECK-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,1,1]
				; CHECK-NEXT: vpaddd %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vmovd %xmm0, %eax
				; CHECK-NEXT: addl %edi, %eax
				; CHECK-NEXT: retq
				entry:
				%0 = zext <16 x i4> %a to <16 x i32>
				%1 = zext <16 x i4> %b to <16 x i32>
				%2 = mul nsw <16 x i32> %0, %1
				%3 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %2)
				%op.extra = add nsw i32 %3, %c
				ret i32 %op.extra
				}

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Combine reduce (add (mul x, y)) to VNNI instruction.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 398109

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/lib/Target/X86/X86PartialReduction.cpp

llvm/test/CodeGen/X86/dpbusd.ll

llvm/test/CodeGen/X86/dpbusd_i4.ll

[X86] Combine reduce (add (mul x, y)) to VNNI instruction.
ClosedPublic