Download Raw Diff

Details

Reviewers

SjoerdMeijer
dmgreen
fhahn

Commits

rGf19259495628: [AArch64] Generate dot for v16i8 sum reduction to i32

Summary

Convert VECREDUCE_ADD( EXTEND(v16i8_type) ) to VECREDUCE_ADD( DOTv16i8(v16i8_type) ) whenever the result type is i32. This gains in one of the SPECCPU 2017 benchmark.

This partially solves the bug: https://bugs.llvm.org/show_bug.cgi?id=46888
Meta ticket: https://bugs.llvm.org/show_bug.cgi?id=46929

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mivnay created this revision.Sep 30 2020, 8:24 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 30 2020, 8:24 AM

Herald added subscribers: llvm-commits, danielkiss, kristof.beyls. · View Herald Transcript

mivnay requested review of this revision.Sep 30 2020, 8:24 AM

Harbormaster completed remote builds in B73516: Diff 295295.Sep 30 2020, 8:26 AM

mivnay updated this revision to Diff 295303.Sep 30 2020, 8:42 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptSep 30 2020, 8:42 AM

Harbormaster completed remote builds in B73522: Diff 295303.Sep 30 2020, 8:53 AM

dmgreen added a subscriber: dmgreen.Sep 30 2020, 11:40 PM

SjoerdMeijer added a subscriber: SjoerdMeijer.Oct 1 2020, 2:00 AM

mivnay added reviewers: SjoerdMeijer, dmgreen, fhahn.Oct 1 2020, 2:21 AM

I haven't looked in much detail at this patch, but this looks like some straightforward lowering of llvm.experimental.vector.reduce.add. Absolutely nothing wrong with that, but I am curious who's going to produce this intrinsic? The vectoriser, the matrix pass? In other words, any ideas on the bigger picture?

In D88577#2305592, @SjoerdMeijer wrote:

I haven't looked in much detail at this patch, but this looks like some straightforward lowering of llvm.experimental.vector.reduce.add. Absolutely nothing wrong with that, but I am curious who's going to produce this intrinsic? The vectoriser, the matrix pass? In other words, any ideas on the bigger picture?

Thanks for looking into the patch. The pattern added in lit test gets generated from the below C code:

#include <stdint.h>
#include <stdlib.h>

int func(uint8_t *a) {
  int sum = 0;
  for (int i = 0; i < 16; i++) {
    sum += a[i];
  }

  return sum;
}

EDIT: SLP vectorizer generates the pattern in this case.

Hello

Can you update with full context? -U999999. It makes phabriactor reviews easier to follow.

I had thought about this somewhat in reference to inloop reductions. I had presumed that it would need some form of partial reduction though, as you would want part of the reduction would then happen outside the loop (I think)

Improving codegen on it's own is good, but I'm interested in seeing how this fits with the other patches.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10970	We ideally shouldn't be just producing a machine node here. Can you add a AArch64ISD::UDOT node? We should be doing the same for SDOT too.

mivnay edited the summary of this revision. (Show Details)Oct 1 2020, 3:18 AM

In D88577#2305658, @dmgreen wrote:

Hello

Can you update with full context? -U999999. It makes phabriactor reviews easier to follow.

I had thought about this somewhat in reference to inloop reductions. I had presumed that it would need some form of partial reduction though, as you would want part of the reduction would then happen outside the loop (I think)

Improving codegen on it's own is good, but I'm interested in seeing how this fits with the other patches.

Hi,

I am working on the performance related issues mentioned in the bug and meta ticket. I have three unrelated patches (i.e., patterns) for codegen improvements. This is the first one.

Added support for SDOT

mivnay added inline comments.Oct 1 2020, 4:41 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10970	I have added the support for SDOT. Can you add a AArch64ISD::UDOT node? Can this be done in a later patch?

mivnay retitled this revision from [AArch64] Generate udot for v16i8 sum reduction to i32 to [AArch64] Generate dot for v16i8 sum reduction to i32.Oct 1 2020, 4:42 AM

mivnay edited the summary of this revision. (Show Details)

EDIT: SLP vectorizer generates the pattern in this case.

Ah I see.

https://llvm.org/docs/Phabricator.html#phabricator-request-review-web has some details on creating diffs with extra context.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10949–10951	Any single statement blocks do not need { } brackets.
10964–10965	This second isScalable check isn't needed, and maybe move the num elements == 16 check up to the VT check, by checking for v16i32 directly?
10970	It's best not create machine nodes directly in the DAG... Maybe create a int_aarch64_neon_sdot intrinsic node instead? That will avoid the need for a new node, and go through the existing tablegen patterns. It would be nice to be able to fold add(x, UDOT(0, y)) -> UDOT(x, y), which is where having a node will really become useful. That needn't be done here though.
10972	EVT::getVectorVT(*DAG.getContext(), MVT::i32, 4) -> MVT::v4i32 I wasn't aware that you could use getConstant directly on vector types like this. Neat.
10981	Why is this is calling ReplaceAllUsesOfValueWith? Is returning the value not enough?
llvm/test/CodeGen/AArch64/neon-dot-product.ll
261	Can you update the tests to show all the instructions? The 1's and 0's and addv's are important here too. I would just use the update_llc_test_checks script.

Fixed context and review comments

mivnay marked 7 inline comments as done.Oct 1 2020, 11:14 PM

Harbormaster completed remote builds in B73747: Diff 295737.Oct 1 2020, 11:26 PM

Thanks. LGTM with one minor modification.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11015	EVT::getVectorVT(*DAG.getContext(), MVT::i32, 4) -> MVT::v4i32 :)

This revision is now accepted and ready to land.Oct 2 2020, 12:29 AM

mivnay updated this revision to Diff 295784.Oct 2 2020, 3:47 AM

mivnay marked an inline comment as done.

Harbormaster completed remote builds in B73766: Diff 295784.Oct 2 2020, 3:59 AM

The build failure is unrelated to this patch and happening in other patches as well (Example: https://reviews.llvm.org/D88471). What should I do? Also, I do not have commit access. Can you please help in committing this patch?

In D88577#2308416, @mivnay wrote:

The build failure is unrelated to this patch and happening in other patches as well (Example: https://reviews.llvm.org/D88471). What should I do? Also, I do not have commit access. Can you please help in committing this patch?

Yeah I tend to ignore those failures. I can check it though.

I can certainly commit this. I just need an author string to attribute it correctly. Is "Vinay Madhusudan <vinay@compilertree.com>" OK for that?

I can certainly commit this. I just need an author string to attribute it correctly. Is "Vinay Madhusudan <vinay@compilertree.com>" OK for that?

Yes, thanks a lot.

This revision was landed with ongoing or failed builds.Oct 2 2020, 9:17 AM

Closed by commit rGf19259495628: [AArch64] Generate dot for v16i8 sum reduction to i32 (authored by mivnay, committed by dmgreen). · Explain Why

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rGf19259495628: [AArch64] Generate dot for v16i8 sum reduction to i32.

dmgreen mentioned this in D97279: [AArch64] Extend vecreduce -> udot handling to v8i8.Feb 23 2021, 6:11 AM

dmgreen mentioned this in rGa02f5068767a: [AArch64] Extend vecreduce -> udot handling to v8i8.Mar 10 2021, 1:03 PM

Diff 295847

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 784 Lines • ▼ Show 20 Lines	AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,

setTargetDAGCombine(ISD::SELECT);		setTargetDAGCombine(ISD::SELECT);
setTargetDAGCombine(ISD::VSELECT);		setTargetDAGCombine(ISD::VSELECT);

setTargetDAGCombine(ISD::INTRINSIC_VOID);		setTargetDAGCombine(ISD::INTRINSIC_VOID);
setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);		setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);
setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);		setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);
setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);		setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);
		setTargetDAGCombine(ISD::VECREDUCE_ADD);

setTargetDAGCombine(ISD::GlobalAddress);		setTargetDAGCombine(ISD::GlobalAddress);

// In case of strict alignment, avoid an excessive number of byte wide stores.		// In case of strict alignment, avoid an excessive number of byte wide stores.
MaxStoresPerMemsetOptSize = 8;		MaxStoresPerMemsetOptSize = 8;
MaxStoresPerMemset = Subtarget->requiresStrictAlign()		MaxStoresPerMemset = Subtarget->requiresStrictAlign()
? MaxStoresPerMemsetOptSize : 32;		? MaxStoresPerMemsetOptSize : 32;

▲ Show 20 Lines • Show All 10,139 Lines • ▼ Show 20 Lines
/// cmge X, X, #0		/// cmge X, X, #0
static SDValue foldVectorXorShiftIntoCmp(SDNode *N, SelectionDAG &DAG,		static SDValue foldVectorXorShiftIntoCmp(SDNode *N, SelectionDAG &DAG,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
if (!Subtarget->hasNEON() \|\| !VT.isVector())		if (!Subtarget->hasNEON() \|\| !VT.isVector())
return SDValue();		return SDValue();

// There must be a shift right algebraic before the xor, and the xor must be a		// There must be a shift right algebraic before the xor, and the xor must be a
// 'not' operation.		// 'not' operation.
SDValue Shift = N->getOperand(0);		SDValue Shift = N->getOperand(0);
SDValue Ones = N->getOperand(1);		SDValue Ones = N->getOperand(1);
		dmgreenUnsubmitted Done Reply Inline Actions Any single statement blocks do not need { } brackets. dmgreen: Any single statement blocks do not need { } brackets.
if (Shift.getOpcode() != AArch64ISD::VASHR \|\| !Shift.hasOneUse() \|\|		if (Shift.getOpcode() != AArch64ISD::VASHR \|\| !Shift.hasOneUse() \|\|
!ISD::isBuildVectorAllOnes(Ones.getNode()))		!ISD::isBuildVectorAllOnes(Ones.getNode()))
return SDValue();		return SDValue();

// The shift should be smearing the sign bit across each vector element.		// The shift should be smearing the sign bit across each vector element.
auto *ShiftAmt = dyn_cast<ConstantSDNode>(Shift.getOperand(1));		auto *ShiftAmt = dyn_cast<ConstantSDNode>(Shift.getOperand(1));
EVT ShiftEltTy = Shift.getValueType().getVectorElementType();		EVT ShiftEltTy = Shift.getValueType().getVectorElementType();
if (!ShiftAmt \|\| ShiftAmt->getZExtValue() != ShiftEltTy.getSizeInBits() - 1)		if (!ShiftAmt \|\| ShiftAmt->getZExtValue() != ShiftEltTy.getSizeInBits() - 1)
return SDValue();		return SDValue();

return DAG.getNode(AArch64ISD::CMGEz, SDLoc(N), VT, Shift.getOperand(0));		return DAG.getNode(AArch64ISD::CMGEz, SDLoc(N), VT, Shift.getOperand(0));
}		}

// Generate SUBS and CSEL for integer abs.		// Generate SUBS and CSEL for integer abs.
		dmgreenUnsubmitted Done Reply Inline Actions This second isScalable check isn't needed, and maybe move the num elements == 16 check up to the VT check, by checking for v16i32 directly? dmgreen: This second isScalable check isn't needed, and maybe move the num elements == 16 check up to…
static SDValue performIntegerAbsCombine(SDNode *N, SelectionDAG &DAG) {		static SDValue performIntegerAbsCombine(SDNode *N, SelectionDAG &DAG) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
		dmgreenUnsubmitted Done Reply Inline Actions We ideally shouldn't be just producing a machine node here. Can you add a AArch64ISD::UDOT node? We should be doing the same for SDOT too. dmgreen: We ideally shouldn't be just producing a machine node here. Can you add a AArch64ISD::UDOT node?
		mivnayAuthorUnsubmitted Done Reply Inline Actions I have added the support for SDOT. Can you add a AArch64ISD::UDOT node? Can this be done in a later patch? mivnay: I have added the support for SDOT. > Can you add a AArch64ISD::UDOT node? Can this be done…
		dmgreenUnsubmitted Done Reply Inline Actions It's best not create machine nodes directly in the DAG... Maybe create a int_aarch64_neon_sdot intrinsic node instead? That will avoid the need for a new node, and go through the existing tablegen patterns. It would be nice to be able to fold add(x, UDOT(0, y)) -> UDOT(x, y), which is where having a node will really become useful. That needn't be done here though. dmgreen: It's best not create machine nodes directly in the DAG... Maybe create a int_aarch64_neon_sdot…
SDLoc DL(N);		SDLoc DL(N);

		dmgreenUnsubmitted Done Reply Inline Actions EVT::getVectorVT(DAG.getContext(), MVT::i32, 4) -> MVT::v4i32 I wasn't aware that you could use getConstant directly on vector types like this. Neat. dmgreen:* EVT::getVectorVT(*DAG.getContext(), MVT::i32, 4) -> MVT::v4i32 I wasn't aware that you could…
// Check pattern of XOR(ADD(X,Y), Y) where Y is SRA(X, size(X)-1)		// Check pattern of XOR(ADD(X,Y), Y) where Y is SRA(X, size(X)-1)
// and change it to SUB and CSEL.		// and change it to SUB and CSEL.
if (VT.isInteger() && N->getOpcode() == ISD::XOR &&		if (VT.isInteger() && N->getOpcode() == ISD::XOR &&
N0.getOpcode() == ISD::ADD && N0.getOperand(1) == N1 &&		N0.getOpcode() == ISD::ADD && N0.getOperand(1) == N1 &&
N1.getOpcode() == ISD::SRA && N1.getOperand(0) == N0.getOperand(0))		N1.getOpcode() == ISD::SRA && N1.getOperand(0) == N0.getOperand(0))
if (ConstantSDNode *Y1C = dyn_cast<ConstantSDNode>(N1.getOperand(1)))		if (ConstantSDNode *Y1C = dyn_cast<ConstantSDNode>(N1.getOperand(1)))
if (Y1C->getAPIntValue() == VT.getSizeInBits() - 1) {		if (Y1C->getAPIntValue() == VT.getSizeInBits() - 1) {
SDValue Neg = DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT),		SDValue Neg = DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT),
N0.getOperand(0));		N0.getOperand(0));
		dmgreenUnsubmitted Done Reply Inline Actions Why is this is calling ReplaceAllUsesOfValueWith? Is returning the value not enough? dmgreen: Why is this is calling ReplaceAllUsesOfValueWith? Is returning the value not enough?
// Generate SUBS & CSEL.		// Generate SUBS & CSEL.
SDValue Cmp =		SDValue Cmp =
DAG.getNode(AArch64ISD::SUBS, DL, DAG.getVTList(VT, MVT::i32),		DAG.getNode(AArch64ISD::SUBS, DL, DAG.getVTList(VT, MVT::i32),
N0.getOperand(0), DAG.getConstant(0, DL, VT));		N0.getOperand(0), DAG.getConstant(0, DL, VT));
return DAG.getNode(AArch64ISD::CSEL, DL, VT, N0.getOperand(0), Neg,		return DAG.getNode(AArch64ISD::CSEL, DL, VT, N0.getOperand(0), Neg,
DAG.getConstant(AArch64CC::PL, DL, MVT::i32),		DAG.getConstant(AArch64CC::PL, DL, MVT::i32),
SDValue(Cmp.getNode(), 1));		SDValue(Cmp.getNode(), 1));
}		}
return SDValue();		return SDValue();
}		}

		// VECREDUCE_ADD( EXTEND(v16i8_type) ) to
		// VECREDUCE_ADD( DOTv16i8(v16i8_type) )
		static SDValue performVecReduceAddCombine(SDNode *N, SelectionDAG &DAG,
		const AArch64Subtarget *ST) {
		SDValue Op0 = N->getOperand(0);
		if (!ST->hasDotProd() \|\| N->getValueType(0) != MVT::i32)
		return SDValue();

		if (Op0.getValueType().getVectorElementType() != MVT::i32)
		return SDValue();

		unsigned ExtOpcode = Op0.getOpcode();
		if (ExtOpcode != ISD::ZERO_EXTEND && ExtOpcode != ISD::SIGN_EXTEND)
		return SDValue();

		EVT Op0VT = Op0.getOperand(0).getValueType();
		if (Op0VT != MVT::v16i8)
		return SDValue();

		SDLoc DL(Op0);
		SDValue Ones = DAG.getConstant(1, DL, Op0VT);
		SDValue Zeros = DAG.getConstant(0, DL, MVT::v4i32);
		auto DotIntrisic = (ExtOpcode == ISD::ZERO_EXTEND)
		dmgreenUnsubmitted Done Reply Inline Actions EVT::getVectorVT(DAG.getContext(), MVT::i32, 4) -> MVT::v4i32 :) dmgreen:* EVT::getVectorVT(*DAG.getContext(), MVT::i32, 4) -> MVT::v4i32 :)
		? Intrinsic::aarch64_neon_udot
		: Intrinsic::aarch64_neon_sdot;
		SDValue Dot = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, Zeros.getValueType(),
		DAG.getConstant(DotIntrisic, DL, MVT::i32), Zeros,
		Ones, Op0.getOperand(0));
		return DAG.getNode(ISD::VECREDUCE_ADD, DL, N->getValueType(0), Dot);
		}

static SDValue performXorCombine(SDNode *N, SelectionDAG &DAG,		static SDValue performXorCombine(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
if (DCI.isBeforeLegalizeOps())		if (DCI.isBeforeLegalizeOps())
return SDValue();		return SDValue();

if (SDValue Cmp = foldVectorXorShiftIntoCmp(N, DAG, Subtarget))		if (SDValue Cmp = foldVectorXorShiftIntoCmp(N, DAG, Subtarget))
return Cmp;		return Cmp;
▲ Show 20 Lines • Show All 3,666 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case AArch64ISD::NVCAST:		case AArch64ISD::NVCAST:
return performNVCASTCombine(N);		return performNVCASTCombine(N);
case AArch64ISD::UZP1:		case AArch64ISD::UZP1:
return performUzpCombine(N, DAG);		return performUzpCombine(N, DAG);
case ISD::INSERT_VECTOR_ELT:		case ISD::INSERT_VECTOR_ELT:
return performPostLD1Combine(N, DCI, true);		return performPostLD1Combine(N, DCI, true);
case ISD::EXTRACT_VECTOR_ELT:		case ISD::EXTRACT_VECTOR_ELT:
return performExtractVectorEltCombine(N, DAG);		return performExtractVectorEltCombine(N, DAG);
		case ISD::VECREDUCE_ADD:
		return performVecReduceAddCombine(N, DCI.DAG, Subtarget);
case ISD::INTRINSIC_VOID:		case ISD::INTRINSIC_VOID:
case ISD::INTRINSIC_W_CHAIN:		case ISD::INTRINSIC_W_CHAIN:
switch (cast<ConstantSDNode>(N->getOperand(1))->getZExtValue()) {		switch (cast<ConstantSDNode>(N->getOperand(1))->getZExtValue()) {
case Intrinsic::aarch64_sve_prfb_gather_scalar_offset:		case Intrinsic::aarch64_sve_prfb_gather_scalar_offset:
return combineSVEPrefetchVecBaseImmOff(N, DAG, 1 /=ScalarSizeInBytes/);		return combineSVEPrefetchVecBaseImmOff(N, DAG, 1 /=ScalarSizeInBytes/);
case Intrinsic::aarch64_sve_prfh_gather_scalar_offset:		case Intrinsic::aarch64_sve_prfh_gather_scalar_offset:
return combineSVEPrefetchVecBaseImmOff(N, DAG, 2 /=ScalarSizeInBytes/);		return combineSVEPrefetchVecBaseImmOff(N, DAG, 2 /=ScalarSizeInBytes/);
case Intrinsic::aarch64_sve_prfw_gather_scalar_offset:		case Intrinsic::aarch64_sve_prfw_gather_scalar_offset:
▲ Show 20 Lines • Show All 1,464 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/neon-dot-product.ll

Show First 20 Lines • Show All 249 Lines • ▼ Show 20 Lines	; CHECK: udot {{v[0-9]+}}.4s, {{v[0-9]+}}.16b, {{v[0-9]+}}.16b
%4 = load <16 x i8>, <16 x i8>* %3		%4 = load <16 x i8>, <16 x i8>* %3
%5 = zext <16 x i8> %4 to <16 x i32>		%5 = zext <16 x i8> %4 to <16 x i32>
%6 = mul nuw nsw <16 x i32> %5, %2		%6 = mul nuw nsw <16 x i32> %5, %2
%7 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> %6)		%7 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> %6)
%op.extra = add i32 %7, %sum		%op.extra = add i32 %7, %sum
ret i32 %op.extra		ret i32 %op.extra
}		}

		define i32 @test_udot_v16i8_2(i8* nocapture readonly %a1) {
		; CHECK-LABEL: test_udot_v16i8_2:
		; CHECK: movi {{v[0-9]+}}.16b, #1
		; CHECK: movi {{v[0-9]+}}.2d, #0000000000000000
		dmgreenUnsubmitted Done Reply Inline Actions Can you update the tests to show all the instructions? The 1's and 0's and addv's are important here too. I would just use the update_llc_test_checks script. dmgreen: Can you update the tests to show all the instructions? The 1's and 0's and addv's are important…
		; CHECK: udot {{v[0-9]+}}.4s, {{v[0-9]+}}.16b, {{v[0-9]+}}.16b
		; CHECK: addv s0, {{v[0-9]+}}.4s
		entry:
		%0 = bitcast i8* %a1 to <16 x i8>*
		%1 = load <16 x i8>, <16 x i8>* %0
		%2 = zext <16 x i8> %1 to <16 x i32>
		%3 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> %2)
		ret i32 %3
		}

define i32 @test_sdot_v16i8(i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %sum) {		define i32 @test_sdot_v16i8(i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %sum) {
entry:		entry:
; CHECK-LABEL: test_sdot_v16i8:		; CHECK-LABEL: test_sdot_v16i8:
; CHECK: sdot {{v[0-9]+}}.4s, {{v[0-9]+}}.16b, {{v[0-9]+}}.16b		; CHECK: sdot {{v[0-9]+}}.4s, {{v[0-9]+}}.16b, {{v[0-9]+}}.16b
%0 = bitcast i8* %a to <16 x i8>*		%0 = bitcast i8* %a to <16 x i8>*
%1 = load <16 x i8>, <16 x i8>* %0		%1 = load <16 x i8>, <16 x i8>* %0
%2 = sext <16 x i8> %1 to <16 x i32>		%2 = sext <16 x i8> %1 to <16 x i32>
%3 = bitcast i8* %b to <16 x i8>*		%3 = bitcast i8* %b to <16 x i8>*
%4 = load <16 x i8>, <16 x i8>* %3		%4 = load <16 x i8>, <16 x i8>* %3
%5 = sext <16 x i8> %4 to <16 x i32>		%5 = sext <16 x i8> %4 to <16 x i32>
%6 = mul nsw <16 x i32> %5, %2		%6 = mul nsw <16 x i32> %5, %2
%7 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> %6)		%7 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> %6)
%op.extra = add nsw i32 %7, %sum		%op.extra = add nsw i32 %7, %sum
ret i32 %op.extra		ret i32 %op.extra
}		}

		define i32 @test_sdot_v16i8_2(i8* nocapture readonly %a1) {
		; CHECK-LABEL: test_sdot_v16i8_2:
		; CHECK: movi {{v[0-9]+}}.16b, #1
		; CHECK: movi {{v[0-9]+}}.2d, #0000000000000000
		; CHECK: sdot {{v[0-9]+}}.4s, {{v[0-9]+}}.16b, {{v[0-9]+}}.16b
		; CHECK: addv s0, {{v[0-9]+}}.4s
		entry:
		%0 = bitcast i8* %a1 to <16 x i8>*
		%1 = load <16 x i8>, <16 x i8>* %0
		%2 = sext <16 x i8> %1 to <16 x i32>
		%3 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> %2)
		ret i32 %3
		}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Generate dot for v16i8 sum reduction to i32
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 295847

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/neon-dot-product.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Generate dot for v16i8 sum reduction to i32ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 295847

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/neon-dot-product.ll

[AArch64] Generate dot for v16i8 sum reduction to i32
ClosedPublic