Download Raw Diff

Details

Reviewers

SjoerdMeijer
evandro
t.p.northover
az

Commits

rGf6398fb72c9e: [aarch64] add def-pats for dot product
rL372408: [aarch64] add def-pats for dot product

Summary

This patch adds the patterns to select the dot product instructions.

Tested on aarch64-linux with make check-all.

Diff Detail

Event Timeline

sebpop created this revision.Sep 16 2019, 7:24 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 16 2019, 7:24 PM

Herald added subscribers: dmgreen, hiraditya, kristof.beyls. · View Herald Transcript

sebpop planned changes to this revision.Sep 16 2019, 10:28 PM

sebpop marked an inline comment as done.

sebpop added inline comments.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
7097	I think there is an error in this case: as we duplicate the original value $Vo across the 4 lanes of the dot product, and then in the end we do the ADDV reduction across lanes, we end up with 4 times the original value. I will prepare an updated patch to fix this.

I've got a cheeky request, and I appreciate that should go in a separate patch, but while you're at at it would you mind repeating this exercise for the ARM backend and AArch32?

The new patch does not use the first argument of the dot product instruction: we now set it to zero.
Patch tested on x86_64-apple-darwin with make check-all.

In D67645#1672161, @SjoerdMeijer wrote:

I've got a cheeky request, and I appreciate that should go in a separate patch, but while you're at at it would you mind repeating this exercise for the ARM backend and AArch32?

Let's get the arm64 patch in good shape and then we can retro-fit it for arm32 ;-)

Many thanks for addressing this!
This pattern matching looks very reasonable to me, just a round of nits:

was just curious about the AddedComplexity = 30
perhaps I am wrong, but was wondering if clang-format agrees with some of the indentation,
would some more test bring any benefits/additonal value? Like test with a loops, or tests with no aliasing attributes?
please ignore if you disagree, but thought that some names like ldop, K1, K2 could be slightly more informative/consistent.

In D67645#1673114, @SjoerdMeijer wrote:

was just curious about the AddedComplexity = 30

I chose 30 as it seemed to work: the rest of the file had complexities <= 20.
I updated that to 10 and it still works. I will update the patch with this.

perhaps I am wrong, but was wondering if clang-format agrees with some of the indentation,

I'll let you decide if clang-format is the best at tablegen ;-)
here is the output of clang format on the beginning of the patch...

// dot_v4i8                                                                                                                                                                       
class mul_v4i8<SDPatternOperator ldop> : PatFrag<(ops node                                                                                                                        
                                                  : $Rn, node                                                                                                                     
                                                  : $Rm, node                                                                                                                     
                                                  : $offset),                                                                                                                     
                                                 (mul(ldop(add node                                                                                                               
                                                           : $Rn, node                                                                                                            
                                                           : $offset)),                                                                                                           
                                                  (ldop(add node                                                                                                                  
                                                        : $Rm, node                                                                                                               
                                                        : $offset)))>;                                                                                                            
class mulz_v4i8<SDPatternOperator ldop> : PatFrag<(ops node                                                                                                                       
                                                   : $Rn, node                                                                                                                    
                                                   : $Rm),                                                                                                                        
                                                  (mul(ldop node                                                                                                                  
                                                       : $Rn),                                                                                                                    
                                                   (ldop node                                                                                                                     
                                                    : $Rm))>;

would some more test bring any benefits/additonal value? Like test with a loops, or tests with no aliasing attributes?

Sure, there may be some patterns that are not yet covered by the def-pats.
We can add more tests and catch more patterns.

please ignore if you disagree, but thought that some names like ldop, K1, K2 could be slightly more informative/consistent.

Please suggest better names, I'm fine changing these names.
K stands for constant, ldop is a load operator.

ok, sounds all good.

Good point: clang-format and tablegen :-)

To catch more dot product cases, we need to fix the passes above instruction selection.

I looked at the basic dot product loop:

int dot_product1(char *a, char *b, int sum) {
  for (int i = 0; i < 16 * K; i += 1)
    sum += a[i] * b[i];
  return sum;
}

for different values of K:

for K = 1, we do generate a dot instruction
for K = 2, K = 3
- the loop is unrolled
- SLP vectorizes the straight line code with vector factor 32
- type legalization kicks in and destroys the pattern
- we end up generating very poor code
K >= 4, no unroll, no SLP, no loop vectorization -> scalar byte loop code.

Looks like if we want to catch more dot product patterns, we'll need to fix the SLP and loop vectorizers.

I am also looking at some code that comes from TVM that is a higher level compiler generating code to LLVM IR.
I have seen that there is a missing pattern in interleaved load pass and a missing instruction in arm64: a ld8.
That is an interleaved load for an 8 by 8 byte matrix.
I think we can generate an i16 ld4 and then generate the low/high byte extracts in each lane.
This will simplify the dag on which we do instruction selection and enable generation of the dot product.

To catch more dot product cases, we need to fix the passes above instruction selection.

Yep, this actually what I was expecting. That is, I don't see a problem with this pattern matching here if it catches a few cases and helps you. But yes, to do the heavy lifting, this is probably a task for the loop vectorizer.

sebpop updated this revision to Diff 220669.Sep 18 2019, 7:56 AM

There are few things missing in current work such as indexed dot product or what they call s/udot (vector, by element) in the ARM document (no need to do it now but a comment about that would help). There is also sve dot product. We need to port this code to SVE.

I agree that this work will miss many opportunities and the middle end will optimize the code in such a way that pattern matching does not work. I think that dot product need to have it own pass or subpass in the middle end. I see three places where it can be done: 1) early before the vectorizer in the same way we recognize min/max/etc., or 2) we can do it within the vectorizer as dot product is mainly a vectorization problem, or 3) we can do it post the vectorizer similar to the SIMD interleaved load/store. The third option, while not the best, has more chance of being accepted given that it less disruptive. Any tought on this as we may contribute to this effort in the future?

In D67645#1673951, @SjoerdMeijer wrote:

to do the heavy lifting, this is probably a task for the loop vectorizer.

You are right, it is not the task of instruction selection to vectorize the code:
if the code is vectorized, ISel should know how to select the right instructions.
If in the future the loop/SLP vectorizers produce patterns that are not covered
by the def-pats in this patch, we will add more patterns to generate dot-product.

In D67645#1674197, @az wrote:

There are few things missing in current work such as indexed dot product or what they call s/udot (vector, by element) in the ARM document (no need to do it now but a comment about that would help).

I added a comment in the patch.

There is also sve dot product. We need to port this code to SVE.

Same here, I added a FIXME comment.

I agree that this work will miss many opportunities and the middle end will optimize the code in such a way that pattern matching does not work. I think that dot product need to have it own pass or subpass in the middle end. I see three places where it can be done: 1) early before the vectorizer in the same way we recognize min/max/etc., or 2) we can do it within the vectorizer as dot product is mainly a vectorization problem, or 3) we can do it post the vectorizer similar to the SIMD interleaved load/store. The third option, while not the best, has more chance of being accepted given that it less disruptive. Any tought on this as we may contribute to this effort in the future?

I think I like your solution 2, and I think pre and post vectorizer passes would work as well.

In the SnapDragon compiler we used to generate ARM builtins/intrinsics directly from the vectorizer:
that way we didn't spend time in the back-end to select again instructions and
it provided a mechanism to ensure that the code generated by the vectorizer
looked the same in the back-end. That worked nicely.

Similarly, for the dot product we may want to generate a target independent builtin,
as LLVM does today for @llvm.experimental.vector.reduce.add.v16i32 in
llvm/lib/CodeGen/ExpandReductions.cpp.
If the target has a special instruction to handle the dot product,
it will implement a custom lowering.
So there will be no need to recognize the dot product again
if the vectorizer has already done that work.

sebpop updated this revision to Diff 220719.Sep 18 2019, 12:04 PM

yep, I think we need to generate those reduction intrinsics, which we can then lower/instruction select. I don't think there's anything controversial about that, intrinsics gets generated in a lot of different cases.

Are you planning on doing that now and turn your attention the vectorizer? That would make this work obsolete when it is ready.

I looked at both the SLP and loop vectorizer and I think this is more work than I can do right now.

Are you planning on doing that now and turn your attention the vectorizer?

not a this time.

That would make this work obsolete when it is ready.

Instruction selection and vectorizers are orthogonal.
I don't think that isel work would become obsolete with a vectorizer emitting dot builtins:
there will be need to recognize the dot instructions in code that has not been processed by the vectorizers.

Instruction selection and vectorizers are orthogonal.

Well, okay, sure..... but depending on what tricks the vectoriser does, its output can be different, and the input to instruction selection be different, triggering different instruction selection patterns.

I don't think that isel work would become obsolete with a vectorizer emitting dot builtins:
there will be need to recognize the dot instructions in code that has not been processed by the vectorizers.

So, if the vectoriser for example emits dot product intrinsics, these patterns won't trigger and then it's say dead code if you see what I mean, but please correct me if I am wrong.

But at the same time, as I also said before, if this helps a few cases now, I don't see what's wrong with a nice little bit of pattern matching.

What if the code is written with intrinsic but using mul, reduce (say similar to last test in this patch), then this patch will optimize that into dot product instruction. So, for legacy code that was written with old intrinsic, then this patch will remain useful even after dot product is implemented in the vectorizer.

Note that if somebody will be writing new code with intrinsic using mul, reduce instead of dot product, he is probably doing that for a reason and will want it to stay that way (he can then use the right flags to disable dot product).

Yes, looks reasonable to me.
Perhaps wait a day with committing just in case someone has concerns.

This revision is now accepted and ready to land.Sep 19 2019, 9:04 AM

Closed by commit rL372408: [aarch64] add def-pats for dot product (authored by spop). · Explain WhySep 20 2019, 9:36 AM

This revision was automatically updated to reflect the committed changes.

Diff 220501

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 6,994 Lines • ▼ Show 20 Lines
	// rest of the compiler, especially the register allocator and copy propagation,			// rest of the compiler, especially the register allocator and copy propagation,
	// to reason about, so is preferred when it's possible to use it.			// to reason about, so is preferred when it's possible to use it.
	let AddedComplexity = 10 in {			let AddedComplexity = 10 in {
	def : Pat<(i64 (extractelt (v2i64 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, dsub)>;			def : Pat<(i64 (extractelt (v2i64 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, dsub)>;
	def : Pat<(i32 (extractelt (v4i32 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, ssub)>;			def : Pat<(i32 (extractelt (v4i32 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, ssub)>;
	def : Pat<(i32 (extractelt (v2i32 V64:$V), (i64 0))), (EXTRACT_SUBREG V64:$V, ssub)>;			def : Pat<(i32 (extractelt (v2i32 V64:$V), (i64 0))), (EXTRACT_SUBREG V64:$V, ssub)>;
	}			}

				// dot_v4i8
				class mul_v4i8<SDPatternOperator ldop> :
				PatFrag<(ops node:$Rn, node:$Rm, node:$offset),
				(mul (ldop (add node:$Rn, node:$offset)),
				(ldop (add node:$Rm, node:$offset)))>;
				class mulz_v4i8<SDPatternOperator ldop> :
				PatFrag<(ops node:$Rn, node:$Rm),
				(mul (ldop node:$Rn), (ldop node:$Rm))>;

				def load_v4i8 :
				OutPatFrag<(ops node:$R),
				(INSERT_SUBREG
				(v2i32 (IMPLICIT_DEF)),
				(i32 (COPY_TO_REGCLASS (LDRWui node:$R, (i64 0)), FPR32)),
				ssub)>;

				class dot_v4i8<Instruction DOT, SDPatternOperator ldop> :
				Pat<(i32 (add (mul_v4i8<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 3)),
				(add (mul_v4i8<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 2)),
				(add (mul_v4i8<ldop> GPR64sp:$Rn, GPR64sp:$Rm, (i64 1)),
				(mulz_v4i8<ldop> GPR64sp:$Rn, GPR64sp:$Rm))))),
				(EXTRACT_SUBREG (i64 (DOT (DUPv2i32gpr WZR),
				(load_v4i8 GPR64sp:$Rn),
				(load_v4i8 GPR64sp:$Rm))),
				sub_32)>, Requires<[HasDotProd]>;

				// dot_v8i8
				class ee_v8i8<SDPatternOperator extend> :
				PatFrag<(ops node:$V, node:$K),
				(v4i16 (extract_subvector (v8i16 (extend node:$V)), node:$K))>;

				class mul_v8i8<SDPatternOperator mulop, SDPatternOperator extend> :
				PatFrag<(ops node:$M, node:$N, node:$K),
				(mulop (v4i16 (ee_v8i8<extend> node:$M, node:$K)),
				(v4i16 (ee_v8i8<extend> node:$N, node:$K)))>;

				class idot_v8i8<SDPatternOperator mulop, SDPatternOperator extend> :
				PatFrag<(ops node:$M, node:$N),
				(i32 (extractelt
				(v4i32 (AArch64uaddv
				(add (mul_v8i8<mulop, extend> node:$M, node:$N, (i64 0)),
				(mul_v8i8<mulop, extend> node:$M, node:$N, (i64 4))))),
				(i64 0)))>;

				// vaddv_[su]32 is special; -> ADDP Vd.2S,Vn.2S,Vm.2S; return Vd.s[0];Vn==Vm
				def VADDV_32 : OutPatFrag<(ops node:$R), (ADDPv2i32 node:$R, node:$R)>;

				class odot_v8i8<Instruction DOT> :
				OutPatFrag<(ops node:$Vm, node:$Vn),
				(EXTRACT_SUBREG
				(VADDV_32
				(i64 (DOT (DUPv2i32gpr WZR),
				(v8i8 node:$Vm),
				(v8i8 node:$Vn)))),
				sub_32)>;

				class dot_v8i8<Instruction DOT, SDPatternOperator mulop,
				SDPatternOperator extend> :
				Pat<(idot_v8i8<mulop, extend> V64:$Vm, V64:$Vn),
				(odot_v8i8<DOT> V64:$Vm, V64:$Vn)>,
				Requires<[HasDotProd]>;

				// dot_v16i8
				class ee_v16i8<SDPatternOperator extend> :
				PatFrag<(ops node:$V, node:$K1, node:$K2),
				(v4i16 (extract_subvector
				(v8i16 (extend
				(v8i8 (extract_subvector node:$V, node:$K1)))), node:$K2))>;

				class mul_v16i8<SDPatternOperator mulop, SDPatternOperator extend> :
				PatFrag<(ops node:$M, node:$N, node:$K1, node:$K2),
				(v4i32
				(mulop (v4i16 (ee_v16i8<extend> node:$M, node:$K1, node:$K2)),
				(v4i16 (ee_v16i8<extend> node:$N, node:$K1, node:$K2))))>;

				class idot_v16i8<SDPatternOperator m, SDPatternOperator x> :
				PatFrag<(ops node:$M, node:$N),
				(i32 (extractelt
				(v4i32 (AArch64uaddv
				(add
				(add (mul_v16i8<m, x> node:$M, node:$N, (i64 0), (i64 0)),
				(mul_v16i8<m, x> node:$M, node:$N, (i64 8), (i64 0))),
				(add (mul_v16i8<m, x> node:$M, node:$N, (i64 0), (i64 4)),
				(mul_v16i8<m, x> node:$M, node:$N, (i64 8), (i64 4)))))),
				(i64 0)))>;

				class odot_v16i8<Instruction DOT> :
				OutPatFrag<(ops node:$Vm, node:$Vn),
				(i32 (ADDVv4i32v
				(DOT (DUPv4i32gpr WZR), node:$Vm, node:$Vn)))>;

				class dot_v16i8<Instruction DOT, SDPatternOperator mulop,
				SDPatternOperator extend> :
				Pat<(idot_v16i8<mulop, extend> V128:$Vm, V128:$Vn),
				(odot_v16i8<DOT> V128:$Vm, V128:$Vn)>,
				sebpopAuthorUnsubmitted Done Reply Inline Actions I think there is an error in this case: as we duplicate the original value $Vo across the 4 lanes of the dot product, and then in the end we do the ADDV reduction across lanes, we end up with 4 times the original value. I will prepare an updated patch to fix this. sebpop: I think there is an error in this case: as we duplicate the original value $Vo across the 4…
				Requires<[HasDotProd]>;

				let AddedComplexity = 30 in {
				def : dot_v4i8<SDOTv8i8, sextloadi8>;
				def : dot_v4i8<UDOTv8i8, zextloadi8>;
				def : dot_v8i8<SDOTv8i8, AArch64smull, sext>;
				def : dot_v8i8<UDOTv8i8, AArch64umull, zext>;
				def : dot_v16i8<SDOTv16i8, AArch64smull, sext>;
				def : dot_v16i8<UDOTv16i8, AArch64umull, zext>;
				}

	include "AArch64InstrAtomics.td"			include "AArch64InstrAtomics.td"
	include "AArch64SVEInstrInfo.td"			include "AArch64SVEInstrInfo.td"

llvm/test/CodeGen/AArch64/neon-dot-product.ll

	Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: test_vdotq_laneq_s32:			; CHECK-LABEL: test_vdotq_laneq_s32:
	; CHECK: sdot v0.4s, v1.16b, v2.4b[1]			; CHECK: sdot v0.4s, v1.16b, v2.4b[1]
	%.cast = bitcast <16 x i8> %c to <4 x i32>			%.cast = bitcast <16 x i8> %c to <4 x i32>
	%shuffle = shufflevector <4 x i32> %.cast, <4 x i32> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>			%shuffle = shufflevector <4 x i32> %.cast, <4 x i32> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
	%.cast3 = bitcast <4 x i32> %shuffle to <16 x i8>			%.cast3 = bitcast <4 x i32> %shuffle to <16 x i8>
	%vdot1.i = call <4 x i32> @llvm.aarch64.neon.sdot.v4i32.v16i8(<4 x i32> %a, <16 x i8> %b, <16 x i8> %.cast3) #2			%vdot1.i = call <4 x i32> @llvm.aarch64.neon.sdot.v4i32.v16i8(<4 x i32> %a, <16 x i8> %b, <16 x i8> %.cast3) #2
	ret <4 x i32> %vdot1.i			ret <4 x i32> %vdot1.i
	}			}

				define fastcc void @test_sdot_v4i8(i8* noalias nocapture %0, i8* noalias nocapture readonly %1, i8* noalias nocapture readonly %2) {
				entry:
				; CHECK-LABEL: test_sdot_v4i8:
				; CHECK: sdot {{v[0-9]+}}.2s, {{v[0-9]+}}.8b, {{v[0-9]+}}.8b
				%3 = bitcast i8* %0 to i32*
				%4 = load i8, i8* %1, align 1
				%5 = sext i8 %4 to i32
				%6 = load i8, i8* %2, align 1
				%7 = sext i8 %6 to i32
				%8 = mul nsw i32 %7, %5
				%9 = getelementptr inbounds i8, i8* %1, i64 1
				%10 = load i8, i8* %9, align 1
				%11 = sext i8 %10 to i32
				%12 = getelementptr inbounds i8, i8* %2, i64 1
				%13 = load i8, i8* %12, align 1
				%14 = sext i8 %13 to i32
				%15 = mul nsw i32 %14, %11
				%16 = add nsw i32 %15, %8
				%17 = getelementptr inbounds i8, i8* %1, i64 2
				%18 = load i8, i8* %17, align 1
				%19 = sext i8 %18 to i32
				%20 = getelementptr inbounds i8, i8* %2, i64 2
				%21 = load i8, i8* %20, align 1
				%22 = sext i8 %21 to i32
				%23 = mul nsw i32 %22, %19
				%24 = add nsw i32 %23, %16
				%25 = getelementptr inbounds i8, i8* %1, i64 3
				%26 = load i8, i8* %25, align 1
				%27 = sext i8 %26 to i32
				%28 = getelementptr inbounds i8, i8* %2, i64 3
				%29 = load i8, i8* %28, align 1
				%30 = sext i8 %29 to i32
				%31 = mul nsw i32 %30, %27
				%32 = add nsw i32 %31, %24
				store i32 %32, i32* %3, align 64
				ret void
				}

				define fastcc void @test_udot_v4i8(i8* noalias nocapture %0, i8* noalias nocapture readonly %1, i8* noalias nocapture readonly %2) {
				entry:
				; CHECK-LABEL: test_udot_v4i8:
				; CHECK: udot {{v[0-9]+}}.2s, {{v[0-9]+}}.8b, {{v[0-9]+}}.8b
				%3 = bitcast i8* %0 to i32*
				%4 = load i8, i8* %1, align 1
				%5 = zext i8 %4 to i32
				%6 = load i8, i8* %2, align 1
				%7 = zext i8 %6 to i32
				%8 = mul nsw i32 %7, %5
				%9 = getelementptr inbounds i8, i8* %1, i64 1
				%10 = load i8, i8* %9, align 1
				%11 = zext i8 %10 to i32
				%12 = getelementptr inbounds i8, i8* %2, i64 1
				%13 = load i8, i8* %12, align 1
				%14 = zext i8 %13 to i32
				%15 = mul nsw i32 %14, %11
				%16 = add nsw i32 %15, %8
				%17 = getelementptr inbounds i8, i8* %1, i64 2
				%18 = load i8, i8* %17, align 1
				%19 = zext i8 %18 to i32
				%20 = getelementptr inbounds i8, i8* %2, i64 2
				%21 = load i8, i8* %20, align 1
				%22 = zext i8 %21 to i32
				%23 = mul nsw i32 %22, %19
				%24 = add nsw i32 %23, %16
				%25 = getelementptr inbounds i8, i8* %1, i64 3
				%26 = load i8, i8* %25, align 1
				%27 = zext i8 %26 to i32
				%28 = getelementptr inbounds i8, i8* %2, i64 3
				%29 = load i8, i8* %28, align 1
				%30 = zext i8 %29 to i32
				%31 = mul nsw i32 %30, %27
				%32 = add nsw i32 %31, %24
				store i32 %32, i32* %3, align 64
				ret void
				}

				declare i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32>)

				define i32 @test_udot_v8i8(i8* nocapture readonly %a, i8* nocapture readonly %b) {
				entry:
				; CHECK-LABEL: test_udot_v8i8:
				; CHECK: udot {{v[0-9]+}}.2s, {{v[0-9]+}}.8b, {{v[0-9]+}}.8b
				%0 = bitcast i8* %a to <8 x i8>*
				%1 = load <8 x i8>, <8 x i8>* %0
				%2 = zext <8 x i8> %1 to <8 x i32>
				%3 = bitcast i8* %b to <8 x i8>*
				%4 = load <8 x i8>, <8 x i8>* %3
				%5 = zext <8 x i8> %4 to <8 x i32>
				%6 = mul nuw nsw <8 x i32> %5, %2
				%7 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> %6)
				ret i32 %7
				}

				define i32 @test_sdot_v8i8(i8* nocapture readonly %a, i8* nocapture readonly %b) {
				entry:
				; CHECK-LABEL: test_sdot_v8i8:
				; CHECK: sdot {{v[0-9]+}}.2s, {{v[0-9]+}}.8b, {{v[0-9]+}}.8b
				%0 = bitcast i8* %a to <8 x i8>*
				%1 = load <8 x i8>, <8 x i8>* %0
				%2 = sext <8 x i8> %1 to <8 x i32>
				%3 = bitcast i8* %b to <8 x i8>*
				%4 = load <8 x i8>, <8 x i8>* %3
				%5 = sext <8 x i8> %4 to <8 x i32>
				%6 = mul nsw <8 x i32> %5, %2
				%7 = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> %6)
				ret i32 %7
				}

				declare i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32>)

				define i32 @test_udot_v16i8(i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %sum) {
				entry:
				; CHECK-LABEL: test_udot_v16i8:
				; CHECK: udot {{v[0-9]+}}.4s, {{v[0-9]+}}.16b, {{v[0-9]+}}.16b
				%0 = bitcast i8* %a to <16 x i8>*
				%1 = load <16 x i8>, <16 x i8>* %0
				%2 = zext <16 x i8> %1 to <16 x i32>
				%3 = bitcast i8* %b to <16 x i8>*
				%4 = load <16 x i8>, <16 x i8>* %3
				%5 = zext <16 x i8> %4 to <16 x i32>
				%6 = mul nuw nsw <16 x i32> %5, %2
				%7 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> %6)
				%op.extra = add i32 %7, %sum
				ret i32 %op.extra
				}

				define i32 @test_sdot_v16i8(i8* nocapture readonly %a, i8* nocapture readonly %b, i32 %sum) {
				entry:
				; CHECK-LABEL: test_sdot_v16i8:
				; CHECK: sdot {{v[0-9]+}}.4s, {{v[0-9]+}}.16b, {{v[0-9]+}}.16b
				%0 = bitcast i8* %a to <16 x i8>*
				%1 = load <16 x i8>, <16 x i8>* %0
				%2 = sext <16 x i8> %1 to <16 x i32>
				%3 = bitcast i8* %b to <16 x i8>*
				%4 = load <16 x i8>, <16 x i8>* %3
				%5 = sext <16 x i8> %4 to <16 x i32>
				%6 = mul nsw <16 x i32> %5, %2
				%7 = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> %6)
				%op.extra = add nsw i32 %7, %sum
				ret i32 %op.extra
				}

This is an archive of the discontinued LLVM Phabricator instance.

[aarch64] add def-pats for dot product
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 220501

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/test/CodeGen/AArch64/neon-dot-product.ll

This is an archive of the discontinued LLVM Phabricator instance.

[aarch64] add def-pats for dot productClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 220501

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/test/CodeGen/AArch64/neon-dot-product.ll

[aarch64] add def-pats for dot product
ClosedPublic