This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/SparseTensor/Transforms/
-
mlir/
-
Dialect/
-
SparseTensor/
-
Transforms/
4/4
Passes.h
5/5
Passes.td
-
lib/Dialect/SparseTensor/Transforms/
-
Dialect/
-
SparseTensor/
-
Transforms/
-
CMakeLists.txt
1/2
SparseTensorPasses.cpp
34/42
SparseVectorization.cpp
-
test/Dialect/SparseTensor/
-
Dialect/
-
SparseTensor/
2/2
sparse_vector.mlir
-
utils/bazel/llvm-project-overlay/mlir/
-
bazel/
-
llvm-project-overlay/
-
mlir/
-
BUILD.bazel

Differential D138236

[mlir][sparse] introduce vectorization pass for sparse loops
ClosedPublic

Authored by aartbik on Nov 17 2022, 1:50 PM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
dcaballe
bixia
wrengr
Peiming
qcolombet

Commits

rG99b3849d89cf: [mlir][sparse] introduce vectorization pass for sparse loops

Summary

This brings back previous SIMD functionality, but in a separate pass.
The idea is to improve this new pass incrementally, going beyond for-loops
to while-loops for co-iteration as welll (masking), while introducing new
abstractions to make the lowering more progressive. The separation of
sparsification and vectorization is a very good first step on this journey.

Also brings back ArmSVE support

Still to be fine-tuned:

+ use of "index" in SIMD loop (viz. a[i] = i)
+ check that all ops really have SIMD support
+ check all forms of reductions
+ chain reduction SIMD values

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

aartbik created this revision.Nov 17 2022, 1:50 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 17 2022, 1:50 PM

Herald added subscribers: Moerafaat, anlunx, bzcheeseman and 22 others. · View Herald Transcript

aartbik requested review of this revision.Nov 17 2022, 1:50 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptNov 17 2022, 1:50 PM

Herald added subscribers: • pcwang-thead, alextsao1999, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald TranscriptNov 17 2022, 1:50 PM

Herald added a project: Restricted Project. · View Herald Transcript

aartbik added reviewers: bixia, wrengr, Peiming, qcolombet.Nov 17 2022, 1:54 PM

Harbormaster completed remote builds in B198285: Diff 476228.Nov 17 2022, 2:17 PM

Peiming added inline comments.Nov 17 2022, 2:20 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
297	You can simply do a `clone(def)` here and update the operand to `vx` in place
336	nit: static?
345	nit: successful
377	So, this assume that the reducValue is the first value, which is, of course, valid. But will it be too fragile against changes? How about adding static function to query the index? and add some assertion there to make sure these the index is always valid. e.g., in sparsification, we can have something like `assert (codegen.redVal == block.getArgument(getReductionValueArgumentIndex())`
433–434	Any reason why you resue the function for analysis/codegen instead of having two function? wouldn't that be more clear?

A general question (from a newbie in vectorization):

Is it possible to always start vectorization from the inner most for loop? if so, you can chain the for loop vectorization as all the inner loop will return a vector value if it has been vectorized?

It seems to me that the current implementation only vectorize the loops without dependence between reduction values?

In D138236#3935084, @Peiming wrote:

It seems to me that the current implementation only vectorize the loops without dependence between reduction values?

Yes, that is in the TBD (see: chain reduction SIMD values).
There is even a test for this that I plan to bring back in the follow ups:

CHECK: %[[VAL_65:.*]] = vector.insertelement %[[VAL_66:.*]]#2, %[[VAL_3]]{{\[}}%[[VAL_6]] : index] : vector<8xf64>
CHECK: %[[VAL_67:.*]] = scf.for %[[VAL_68:.*]] = %[[VAL_66]]#0 to %[[VAL_22]] step %[[VAL_4]] iter_args(%[[VAL_69:.*]] = %[[VAL_65]]) -> (vector<8xf64>) {
CHECK: %[[VAL_70:.*]] = affine.min #map(%[[VAL_22]], %[[VAL_68]])
CHECK: %[[VAL_71:.*]] = vector.create_mask %[[VAL_70]] : vector<8xi1>
CHECK: %[[VAL_72:.*]] = vector.maskedload %[[VAL_11]]{{\[}}%[[VAL_68]]], %[[VAL_71]], %[[VAL_3]] : memref<?xf64>, vector<8xi1>, vector<8xf64> into vector<8xf64>
CHECK: %[[VAL_73:.*]] = arith.addf %[[VAL_69]], %[[VAL_72]] : vector<8xf64>
CHECK: %[[VAL_74:.*]] = arith.select %[[VAL_71]], %[[VAL_73]], %[[VAL_69]] : vector<8xi1>, vector<8xf64>
CHECK: scf.yield %[[VAL_74]] : vector<8xf64>
CHECK: }
CHECK: %[[VAL_75:.*]] = scf.for %[[VAL_76:.*]] = %[[VAL_66]]#1 to %[[VAL_25]] step %[[VAL_4]] iter_args(%[[VAL_77:.*]] = %[[VAL_78:.*]]) -> (vector<8xf64>) {
CHECK: %[[VAL_79:.*]] = affine.min #map(%[[VAL_25]], %[[VAL_76]])
CHECK: %[[VAL_80:.*]] = vector.create_mask %[[VAL_79]] : vector<8xi1>
CHECK: %[[VAL_81:.*]] = vector.maskedload %[[VAL_14]]{{\[}}%[[VAL_76]]], %[[VAL_80]], %[[VAL_3]] : memref<?xf64>, vector<8xi1>, vector<8xf64> into vector<8xf64>
CHECK: %[[VAL_82:.*]] = arith.addf %[[VAL_77]], %[[VA CHECK: %[[VAL_65:.*]] = vector.insertelement %[[VAL_66:.*]]#2, %[[VAL_3]]{{\[}}%[[VAL_6]] : index] : vector<8xf64>
CHECK: %[[VAL_67:.*]] = scf.for %[[VAL_68:.*]] = %[[VAL_66]]#0 to %[[VAL_22]] step %[[VAL_4]] iter_args(%[[VAL_69:.*]] = %[[VAL_65]]) -> (vector<8xf64>) {
CHECK: %[[VAL_70:.*]] = affine.min #map(%[[VAL_22]], %[[VAL_68]])
CHECK: %[[VAL_71:.*]] = vector.create_mask %[[VAL_70]] : vector<8xi1>
CHECK: %[[VAL_72:.*]] = vector.maskedload %[[VAL_11]]{{\[}}%[[VAL_68]]], %[[VAL_71]], %[[VAL_3]] : memref<?xf64>, vector<8xi1>, vector<8xf64> into vector<8xf64>
CHECK: %[[VAL_73:.*]] = arith.addf %[[VAL_69]], %[[VAL_72]] : vector<8xf64>
CHECK: %[[VAL_74:.*]] = arith.select %[[VAL_71]], %[[VAL_73]], %[[VAL_69]] : vector<8xi1>, vector<8xf64>
CHECK: scf.yield %[[VAL_74]] : vector<8xf64>
CHECK: }
CHECK: %[[VAL_75:.*]] = scf.for %[[VAL_76:.*]] = %[[VAL_66]]#1 to %[[VAL_25]] step %[[VAL_4]] iter_args(%[[VAL_77:.*]] = %[[VAL_78:.*]]) -> (vector<8xf64>) {
CHECK: %[[VAL_79:.*]] = affine.min #map(%[[VAL_25]], %[[VAL_76]])
CHECK: %[[VAL_80:.*]] = vector.create_mask %[[VAL_79]] : vector<8xi1>
CHECK: %[[VAL_81:.*]] = vector.maskedload %[[VAL_14]]{{\[}}%[[VAL_76]]], %[[VAL_80]], %[[VAL_3]] : memref<?xf64>, vector<8xi1>, vector<8xf64> into vector<8xf64>
CHECK: %[[VAL_82:.*]] = arith.addf %[[VAL_77]], %[[VAL_81]] : vector<8xf64>
CHECK: %[[VAL_83:.*]] = arith.select %[[VAL_80]], %[[VAL_82]], %[[VAL_77]] : vector<8xi1>, vector<8xf64>
CHECK: scf.yield %[[VAL_83]] : vector<8xf64>
CHECK: }
CHECK: %[[VAL_84:.*]] = vector.reduction <add>, %[[VAL_85:.*]] : vector<8xf64> into f64
CHECK: scf.yield %[[VAL_84]] : f64L_81]] : vector<8xf64>
CHECK: %[[VAL_83:.*]] = arith.select %[[VAL_80]], %[[VAL_82]], %[[VAL_77]] : vector<8xi1>, vector<8xf64>
CHECK: scf.yield %[[VAL_83]] : vector<8xf64>
CHECK: }
CHECK: %[[VAL_84:.*]] = vector.reduction <add>, %[[VAL_85:.*]] : vector<8xf64> into f64
CHECK: scf.yield %[[VAL_84]] : f64

usama54321 added a subscriber: usama54321.Nov 17 2022, 3:32 PM

jsetoain added a subscriber: jsetoain.Nov 17 2022, 4:06 PM

Hi Aart,

Only high level comments, I didn't have time to dig into the meat of the pass yet (and by no means my review is blocking :P)

Cheers,
-Quentin

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.h
175	Maybe add a comment saying that the additional arguments are described as options in the `.td` file.
mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td
273	What should we set as vector length for something that depends on the actual type? E.g., if the vector unit is 128-bit wide, we would have 2x for `i64`, 4x for `i32`, 8x for `i16`. We don't need to address that in this patch and in particular, maybe the goal here is to vectorize as much as possible and let the actual backend split things up later.
mlir/lib/Dialect/SparseTensor/Transforms/SparseTensorPasses.cpp
73	I see that this moves in the vectorization pass. Couple of questions: Could this be a problem for users of the SparsificationPass? Should we run this as part of the vectorization pass at all? Essentially where I'm going is, should we let the users decide when and where the canonicalization should happen instead of appending it to vectorization pass (while it was after the sparsification pass before)?
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
42	You may be able to use `getConstantIntValue` instead.

aartbik marked 6 inline comments as done.Nov 17 2022, 6:05 PM

aartbik added inline comments.

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.h
175	I don't mind adding such a comment, but if you look in this file, then all passes are defined in the TD and have options defined as parameter here. So if I add something we need to document them all. WDYT?
mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td
273	That is a good question, and very fundamental to the vector dialect. Note that by having the vl=xxx flag, we define the mechanism (doing it) but not the policy (picking it). In general, you want the VL to match the SIMD length and a bit more. E.g. picking vl=8 on a 4-way simd will result in a vector loop with two iterations inside "unrolled" into registers. See this very old thread I started with experimental docs that show the generated AVX512 code https://discourse.llvm.org/t/case-study-docs-on-vector-dialect-cpu-codegen/1674
mlir/lib/Dialect/SparseTensor/Transforms/SparseTensorPasses.cpp
73	No problem, we never see vector code out of this pass. It was a left over from when we removed vectorization from the sparsification pass. The V2V should have been removed at at time too.
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
42	Ha! Let me try. Because this was an index, some of our other utilities did not work but your suggestion is nice and short
297	Yeah, I was looking for a more "reflective" way. But will clone() work? Does that make a shallow or deep copy?
377	Yeah, good suggestion. Made it less brittle.
433–434	Because so much of the tests that analyze are repeated in codegen, this was it is a lot less code ;-)

Matt added a subscriber: Matt.Nov 17 2022, 8:27 PM

Peiming added inline comments.Nov 17 2022, 9:49 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
433–434	okay, it seems that what you really need is a “dataflow visitor”, but seems that MLIR does not provide that

Peiming added inline comments.Nov 17 2022, 9:51 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
297	I believe that it makes a deep copy.

Peiming added inline comments.Nov 17 2022, 9:56 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
297	Actually it might be a little bit ambitious here to say deep copy. It clones the operation itself, plus all the regions in it. But the operands will remains the same. You can the call updateRootInPlace to set the operand to other SSA values

That was quick! Thanks, Aart! Just some minor comments. Other than that, LGTM!

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td
277	Is this mostly for gathers indices?
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
140	We may have some utilities in the Linalg vectorizer that could be useful here to detect and get the the combining operation.
430	Would this bail out if the loop is an outer loop?

awarzynski added a subscriber: awarzynski.Nov 18 2022, 6:36 AM

Peiming added inline comments.Nov 18 2022, 9:05 AM

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
297	a little bit ambiguous

aartbik marked 4 inline comments as done.Nov 18 2022, 9:18 AM

aartbik added inline comments.

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
297	But then it is way too much redundant work right? I would rather that just this MACRO logic to just create what we need. Ideally I would like to have some reflection flavor like vexp = rewriter.createUnOp(op->kind, loc, vx); do we have something like that?

Thanks for working on this, Aart!

This makes sense. I've left a few inline comments, but these are just nits, typos or basic questions from an aspiring "vectorization" ninja :) Definitely non-blocking!

One thing that's not clear is how enable-simd-index32=true affects the generated code. I'm comparing output for CHECK-VEC2 and CHECK-VEC3 and they look identical (func @add_dense) 🤔 . Also, would it make sense to test other values of vl (currently it's always vl=16)?

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
34	[nit] Would it make sense to document what the unit of `vectorLength` is? (i.e. bits, bytes, number of elements?)
96	What's `lo` and `hi` in this context?
143	For those of us who have no access to this book, would you be able to add a few more comments? In particular, could you explain the difference between `r` and `rd`? Thank you :)
223
334–335	Suggesting this as it's not immediately obvious what "first pass" means when looking at this method in isolation (it becomes obvious when it's being called).
348
360	[nit] Wouldn't `forOpNew` be more descriptive? `forOp2` might suggest there are 2 "for loops" to begin with.

It would also be good to set up the expectations for this vectorizer somewhere in the documentation to make sure nobody misses that this is a transition step towards Linalg.

qcolombet added inline comments.Nov 18 2022, 10:38 AM

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.h
175	If the canonical way is to have the documentation in the td file, what works for me :).

aartbik marked 11 inline comments as done.Nov 18 2022, 11:43 AM

aartbik added inline comments.

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.h
175	Thanks!
mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td
273	I added a comment in the doc. See if that makes sense.
277	yes, added that specifically in the option description
mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
34	Added more doc to struct
42	it works!
96	Documented
140	can you point me to any? None jumped out (this small)
143	We all should have access to the book! But documented
297	I played around with some cloning, but nothing looks as concise as the current macro magic (which is also the most efficient way to get to the new IR). SO I am leaving this as a follow up refinement if you are okay.
360	Good suggestion. My naming is always too short ;-)
430	It does since it will fail on the inner forOp. But is there a quicker way to test this?

addressed comments, rebased with main

Harbormaster completed remote builds in B198535: Diff 476574.Nov 18 2022, 1:42 PM

dotted the i's for ARMSVE

aartbik edited the summary of this revision. (Show Details)Nov 18 2022, 1:52 PM

Harbormaster completed remote builds in B198552: Diff 476594.Nov 18 2022, 2:50 PM

clang format

Harbormaster completed remote builds in B198569: Diff 476616.Nov 18 2022, 4:08 PM

aartbik added a child revision: D138343: [mlir][sparse][vector] ensure loop peeling to remove vector masks works.Nov 18 2022, 4:57 PM

In D138236#3938188, @aartbik wrote:

dotted the i's for ARMSVE

Great to see this! Also, many thanks for all the additional comments :)

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
285
397
407–408	Shouldn't the pass-through value be equal to whatever is the neutral value for the combining operation? (i.e. 1 for multiplication and 0 for addition)
mlir/test/Dialect/SparseTensor/sparse_vector.mlir
2–9	You could try more descriptive prefixes: `CHECK-VEC1` --> `CHECK-SCALAR` `CHECK-VEC2` --> `CHECK-VEC16` `CHECK-VEC3` --> `CHECK-VEC16-32IDX` `CHECK-VEC4` --> `CHECK-VEC4-SCALABLE` You could also just drop `CHECK` to make it a bit concise. This is clear either way, so feel free to ignore.

aartbik marked 2 inline comments as done.Nov 21 2022, 8:53 AM

aartbik added inline comments.

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
407–408	Ah, that is possible too, but by using the inits shown, you can safe one op in the end. E.g. a sum reduction could start with v = [0,...0] simd loop t = sum_reduce v r = r + t or you can do v = [0,...r] simd loop r = sum_reduce v
mlir/test/Dialect/SparseTensor/sparse_vector.mlir
2–9	This is a good suggestion, done.

addressed comments, rebased with main

Thanks, Aart! LGTM

This revision is now accepted and ready to land.Nov 21 2022, 3:07 PM

This revision was landed with ongoing or failed builds.Nov 21 2022, 4:12 PM

Closed by commit rG99b3849d89cf: [mlir][sparse] introduce vectorization pass for sparse loops (authored by aartbik). · Explain Why

This revision was automatically updated to reflect the committed changes.

aartbik added a commit: rG99b3849d89cf: [mlir][sparse] introduce vectorization pass for sparse loops.

qcolombet added inline comments.Nov 21 2022, 4:34 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
76	You could use `getConstantIntValue` here as well.
91	Nit: Took me some time to parse what the min was doing because the order of the expressions and the ValueRange don't match. I.e., the expression has `min(symbol, dim0 - dim1)` (i.e., `symbol, dim0, dim1`) and the `ValueRange` has `dim0, dim1, symbol`. Maybe that's just me not being used to dealing with affine builds (where we know that dimensions appears first) :). Feel free to ignore.
113	I feel I'm missing something obvious, which may indicate a need for a comment :). How do we know that only at most the last index will be a VectorType? Couldn't we technically end up with a VectorType for several of the indices? Going by what happens in `vectorizeSubscripts`, I don't see anything that would prevent generating more than one vload.

Harbormaster completed remote builds in B198809: Diff 476928.Nov 21 2022, 4:55 PM

aartbik marked 3 inline comments as done.Nov 21 2022, 6:29 PM

aartbik added inline comments.

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
76	Good point.
113	Indeed, if you have to ask, it needs a comment. Coming up!

awarzynski added inline comments.Nov 22 2022, 8:01 AM

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp
407–408	Ah, I see! I got confused because `genVectorInvariantValue` creates: %25 = "vector.broadcast"(%arg4) : (f32) -> vector<[4]xf32> so `v = [r, ..., r]` rather than `v = [0, ..., r]`. But then `vpass` is effectively replaced with `%6 = vector.insertelement %3, %cst[%c0 : index] : vector<[4]xf32>`, so the code is correct.

awarzynski mentioned this in D121304: [mlir][sparse][ArmSVE] Add sparse integration tests for ArmSVE.Nov 22 2022, 12:21 PM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

SparseTensor/

Transforms/

Passes.h

10 lines

Passes.td

60 lines

lib/

Dialect/

SparseTensor/

Transforms/

CMakeLists.txt

1 line

SparseTensorPasses.cpp

37 lines

SparseVectorization.cpp

485 lines

test/

Dialect/

SparseTensor/

sparse_vector.mlir

516 lines

utils/

bazel/

llvm-project-overlay/

mlir/

BUILD.bazel

2 lines

Diff 477024

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.h

	Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines

	void populateSparseBufferRewriting(RewritePatternSet &patterns,			void populateSparseBufferRewriting(RewritePatternSet &patterns,
	bool enableBufferInitialization);			bool enableBufferInitialization);

	std::unique_ptr<Pass> createSparseBufferRewritePass();			std::unique_ptr<Pass> createSparseBufferRewritePass();
	std::unique_ptr<Pass>			std::unique_ptr<Pass>
	createSparseBufferRewritePass(bool enableBufferInitialization);			createSparseBufferRewritePass(bool enableBufferInitialization);

				void populateSparseVectorizationPatterns(RewritePatternSet &patterns,
				qcolombetUnsubmitted Done Reply Inline Actions Maybe add a comment saying that the additional arguments are described as options in the `.td` file. qcolombet: Maybe add a comment saying that the additional arguments are described as options in the `.td`…
				aartbikAuthorUnsubmitted Done Reply Inline Actions I don't mind adding such a comment, but if you look in this file, then all passes are defined in the TD and have options defined as parameter here. So if I add something we need to document them all. WDYT? aartbik: I don't mind adding such a comment, but if you look in this file, then all passes are defined…
				qcolombetUnsubmitted Done Reply Inline Actions If the canonical way is to have the documentation in the td file, what works for me :). qcolombet: If the canonical way is to have the documentation in the td file, what works for me :).
				aartbikAuthorUnsubmitted Done Reply Inline Actions Thanks! aartbik: Thanks!
				unsigned vectorLength,
				bool enableVLAVectorization,
				bool enableSIMDIndex32);

				std::unique_ptr<Pass> createSparseVectorizationPass();
				std::unique_ptr<Pass> createSparseVectorizationPass(unsigned vectorLength,
				bool enableVLAVectorization,
				bool enableSIMDIndex32);

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Registration.			// Registration.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	/// Generate the code for registering passes.			/// Generate the code for registering passes.
	#define GEN_PASS_REGISTRATION			#define GEN_PASS_REGISTRATION
	#include "mlir/Dialect/SparseTensor/Transforms/Passes.h.inc"			#include "mlir/Dialect/SparseTensor/Transforms/Passes.h.inc"

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_DIALECT_SPARSETENSOR_TRANSFORMS_PASSES_H_			#endif // MLIR_DIALECT_SPARSETENSOR_TRANSFORMS_PASSES_H_

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td

Show First 20 Lines • Show All 219 Lines • ▼ Show 20 Lines	let dependentDialects = [
"sparse_tensor::SparseTensorDialect",		"sparse_tensor::SparseTensorDialect",
];		];
let options = [		let options = [
Option<"enableBufferInitialization", "enable-buffer-initialization", "bool",		Option<"enableBufferInitialization", "enable-buffer-initialization", "bool",
"false", "Enable zero-initialization of the memory buffers">,		"false", "Enable zero-initialization of the memory buffers">,
];		];
}		}

		def SparseVectorization : Pass<"sparse-vectorization", "ModuleOp"> {
		let summary = "Vectorizes loops after sparsification";
		let description = [{
		A pass that converts loops after sparsification into vector loops.
		The vector dialect is used as target to provide an architectural
		neutral way of exploiting any platform that supports SIMD instructions.

		The vector length (viz. `vl`) describes the number of packed data elements
		(e.g. both vector<16xf32> and vector<16xf64> have a vector length of 16 even
		though the actual bitwidths differ). A small multiple of the actual lengths
		supported in hardware typically results in efficient SIMD code, since the
		backend will map longer vectors to multiple vector registers, thereby
		effectively unrolling an addition level within the generated for-loop.

		Example of the conversion:

		```mlir
		Before:
		%3 = memref.load %2[] : memref<f32>
		%4 = scf.for %arg3 = %c0 to %c1024 step %c1 iter_args(%arg4 = %3) -> (f32) {
		%6 = memref.load %0[%arg3] : memref<?xf32>
		%7 = memref.load %1[%arg3] : memref<1024xf32>
		%8 = arith.mulf %6, %7 : f32
		%9 = arith.addf %arg4, %8 : f32
		scf.yield %9 : f32
		}
		memref.store %4, %2[] : memref<f32>

		After:
		%3 = memref.load %2[] : memref<f32>
		%4 = vector.insertelement %3, %cst[%c0 : index] : vector<32xf32>
		%5 = scf.for %arg3 = %c0 to %c1024 step %c32 iter_args(%arg4 = %4) -> (vector<32xf32>) {
		%8 = vector.load %0[%arg3] : memref<?xf32>, vector<32xf32>
		%9 = vector.load %1[%arg3] : memref<1024xf32>, vector<32xf32>
		%10 = arith.mulf %8, %9 : vector<32xf32>
		%11 = arith.addf %arg4, %10 : vector<32xf32>
		scf.yield %11 : vector<32xf32>
		}
		%6 = vector.reduction <add>, %5 : vector<32xf32> into f32
		memref.store %6, %2[] : memref<f32>
		```
		}];
		let constructor = "mlir::createSparseVectorizationPass()";
		let dependentDialects = [
		"arith::ArithDialect",
		"memref::MemRefDialect",
		qcolombetUnsubmitted Done Reply Inline Actions What should we set as vector length for something that depends on the actual type? E.g., if the vector unit is 128-bit wide, we would have 2x for `i64`, 4x for `i32`, 8x for `i16`. We don't need to address that in this patch and in particular, maybe the goal here is to vectorize as much as possible and let the actual backend split things up later. qcolombet: What should we set as vector length for something that depends on the actual type? E.g., if…
		aartbikAuthorUnsubmitted Done Reply Inline Actions That is a good question, and very fundamental to the vector dialect. Note that by having the vl=xxx flag, we define the mechanism (doing it) but not the policy (picking it). In general, you want the VL to match the SIMD length and a bit more. E.g. picking vl=8 on a 4-way simd will result in a vector loop with two iterations inside "unrolled" into registers. See this very old thread I started with experimental docs that show the generated AVX512 code https://discourse.llvm.org/t/case-study-docs-on-vector-dialect-cpu-codegen/1674 aartbik: That is a good question, and very fundamental to the vector dialect. Note that by having the…
		aartbikAuthorUnsubmitted Done Reply Inline Actions I added a comment in the doc. See if that makes sense. aartbik: I added a comment in the doc. See if that makes sense.
		"scf::SCFDialect",
		"sparse_tensor::SparseTensorDialect",
		"vector::VectorDialect",
		];
		dcaballeUnsubmitted Done Reply Inline Actions Is this mostly for gathers indices? dcaballe: Is this mostly for gathers indices?
		aartbikAuthorUnsubmitted Done Reply Inline Actions yes, added that specifically in the option description aartbik: yes, added that specifically in the option description
		let options = [
		Option<"vectorLength", "vl", "int32_t", "0",
		"Set the vector length (use 0 to disable vectorization)">,
		Option<"enableVLAVectorization", "enable-vla-vectorization", "bool",
		"false", "Enable vector length agnostic vectorization">,
		Option<"enableSIMDIndex32", "enable-simd-index32", "bool", "false",
		"Enable i32 indexing into vectors (for efficient gather/scatter)">,
		];
		}

#endif // MLIR_DIALECT_SPARSETENSOR_TRANSFORMS_PASSES		#endif // MLIR_DIALECT_SPARSETENSOR_TRANSFORMS_PASSES

mlir/lib/Dialect/SparseTensor/Transforms/CMakeLists.txt

	add_mlir_dialect_library(MLIRSparseTensorTransforms			add_mlir_dialect_library(MLIRSparseTensorTransforms
	BufferizableOpInterfaceImpl.cpp			BufferizableOpInterfaceImpl.cpp
	CodegenUtils.cpp			CodegenUtils.cpp
	DenseBufferizationPass.cpp			DenseBufferizationPass.cpp
	Sparsification.cpp			Sparsification.cpp
	SparseBufferRewriting.cpp			SparseBufferRewriting.cpp
	SparseTensorCodegen.cpp			SparseTensorCodegen.cpp
	SparseTensorConversion.cpp			SparseTensorConversion.cpp
	SparseTensorPasses.cpp			SparseTensorPasses.cpp
	SparseTensorRewriting.cpp			SparseTensorRewriting.cpp
				SparseVectorization.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/SparseTensor			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/SparseTensor

	DEPENDS			DEPENDS
	MLIRSparseTensorPassIncGen			MLIRSparseTensorPassIncGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	Show All 21 Lines

mlir/lib/Dialect/SparseTensor/Transforms/SparseTensorPasses.cpp

Show All 21 Lines

namespace mlir {		namespace mlir {
#define GEN_PASS_DEF_PRESPARSIFICATIONREWRITE		#define GEN_PASS_DEF_PRESPARSIFICATIONREWRITE
#define GEN_PASS_DEF_SPARSIFICATIONPASS		#define GEN_PASS_DEF_SPARSIFICATIONPASS
#define GEN_PASS_DEF_POSTSPARSIFICATIONREWRITE		#define GEN_PASS_DEF_POSTSPARSIFICATIONREWRITE
#define GEN_PASS_DEF_SPARSETENSORCONVERSIONPASS		#define GEN_PASS_DEF_SPARSETENSORCONVERSIONPASS
#define GEN_PASS_DEF_SPARSETENSORCODEGEN		#define GEN_PASS_DEF_SPARSETENSORCODEGEN
#define GEN_PASS_DEF_SPARSEBUFFERREWRITE		#define GEN_PASS_DEF_SPARSEBUFFERREWRITE
		#define GEN_PASS_DEF_SPARSEVECTORIZATION
#include "mlir/Dialect/SparseTensor/Transforms/Passes.h.inc"		#include "mlir/Dialect/SparseTensor/Transforms/Passes.h.inc"
} // namespace mlir		} // namespace mlir

using namespace mlir;		using namespace mlir;
using namespace mlir::sparse_tensor;		using namespace mlir::sparse_tensor;

namespace {		namespace {

Show All 24 Lines	struct SparsificationPass
SparsificationPass(const SparsificationOptions &options) {		SparsificationPass(const SparsificationOptions &options) {
parallelization = options.parallelizationStrategy;		parallelization = options.parallelizationStrategy;
}		}

void runOnOperation() override {		void runOnOperation() override {
auto *ctx = &getContext();		auto *ctx = &getContext();
// Translate strategy flags to strategy options.		// Translate strategy flags to strategy options.
SparsificationOptions options(parallelization);		SparsificationOptions options(parallelization);
// Apply sparsification and vector cleanup rewriting.		// Apply sparsification and cleanup rewriting.
RewritePatternSet patterns(ctx);		RewritePatternSet patterns(ctx);
populateSparsificationPatterns(patterns, options);		populateSparsificationPatterns(patterns, options);
vector::populateVectorToVectorCanonicalizationPatterns(patterns);
qcolombetUnsubmitted Not Done Reply Inline Actions I see that this moves in the vectorization pass. Couple of questions: Could this be a problem for users of the SparsificationPass? Should we run this as part of the vectorization pass at all? Essentially where I'm going is, should we let the users decide when and where the canonicalization should happen instead of appending it to vectorization pass (while it was after the sparsification pass before)? qcolombet: I see that this moves in the vectorization pass. Couple of questions: - Could this be a…
aartbikAuthorUnsubmitted Done Reply Inline Actions No problem, we never see vector code out of this pass. It was a left over from when we removed vectorization from the sparsification pass. The V2V should have been removed at at time too. aartbik: No problem, we never see vector code out of this pass. It was a left over from when we removed…
scf::ForOp::getCanonicalizationPatterns(patterns, ctx);		scf::ForOp::getCanonicalizationPatterns(patterns, ctx);
(void)applyPatternsAndFoldGreedily(getOperation(), std::move(patterns));		(void)applyPatternsAndFoldGreedily(getOperation(), std::move(patterns));
}		}
};		};

struct PostSparsificationRewritePass		struct PostSparsificationRewritePass
: public impl::PostSparsificationRewriteBase<		: public impl::PostSparsificationRewriteBase<
PostSparsificationRewritePass> {		PostSparsificationRewritePass> {
▲ Show 20 Lines • Show All 163 Lines • ▼ Show 20 Lines	struct SparseBufferRewritePass
void runOnOperation() override {		void runOnOperation() override {
auto *ctx = &getContext();		auto *ctx = &getContext();
RewritePatternSet patterns(ctx);		RewritePatternSet patterns(ctx);
populateSparseBufferRewriting(patterns, enableBufferInitialization);		populateSparseBufferRewriting(patterns, enableBufferInitialization);
(void)applyPatternsAndFoldGreedily(getOperation(), std::move(patterns));		(void)applyPatternsAndFoldGreedily(getOperation(), std::move(patterns));
}		}
};		};

		struct SparseVectorizationPass
		: public impl::SparseVectorizationBase<SparseVectorizationPass> {

		SparseVectorizationPass() = default;
		SparseVectorizationPass(const SparseVectorizationPass &pass) = default;
		SparseVectorizationPass(unsigned vl, bool vla, bool sidx32) {
		vectorLength = vl;
		enableVLAVectorization = vla;
		enableSIMDIndex32 = sidx32;
		}

		void runOnOperation() override {
		auto *ctx = &getContext();
		RewritePatternSet patterns(ctx);
		populateSparseVectorizationPatterns(
		patterns, vectorLength, enableVLAVectorization, enableSIMDIndex32);
		vector::populateVectorToVectorCanonicalizationPatterns(patterns);
		(void)applyPatternsAndFoldGreedily(getOperation(), std::move(patterns));
		}
		};

} // namespace		} // namespace

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Strategy flag methods.		// Strategy flag methods.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

SparseToSparseConversionStrategy		SparseToSparseConversionStrategy
mlir::sparseToSparseConversionStrategy(int32_t flag) {		mlir::sparseToSparseConversionStrategy(int32_t flag) {
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines
std::unique_ptr<Pass> mlir::createSparseBufferRewritePass() {		std::unique_ptr<Pass> mlir::createSparseBufferRewritePass() {
return std::make_unique<SparseBufferRewritePass>();		return std::make_unique<SparseBufferRewritePass>();
}		}

std::unique_ptr<Pass>		std::unique_ptr<Pass>
mlir::createSparseBufferRewritePass(bool enableBufferInitialization) {		mlir::createSparseBufferRewritePass(bool enableBufferInitialization) {
return std::make_unique<SparseBufferRewritePass>(enableBufferInitialization);		return std::make_unique<SparseBufferRewritePass>(enableBufferInitialization);
}		}

		std::unique_ptr<Pass> mlir::createSparseVectorizationPass() {
		return std::make_unique<SparseVectorizationPass>();
		}

		std::unique_ptr<Pass>
		mlir::createSparseVectorizationPass(unsigned vectorLength,
		bool enableVLAVectorization,
		bool enableSIMDIndex32) {
		return std::make_unique<SparseVectorizationPass>(
		vectorLength, enableVLAVectorization, enableSIMDIndex32);
		}

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp

This file was added.

//===- SparseVectorization.cpp - Vectorization of sparsified loops --------===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// A pass that converts loops generated by the sparse compiler into a form that

// can exploit SIMD instructions of the target architecture. Note that this pass

// ensures the sparse compiler can generate efficient SIMD (including ArmSVE

// support) with proper separation of concerns as far as sparsification and

// vectorization is concerned. However, this pass is not the final abstraction

// level we want, and not the general vectorizer we want either. It forms a good

// stepping stone for incremental future improvements though.

//===----------------------------------------------------------------------===//

#include "CodegenUtils.h"

#include "mlir/Dialect/Affine/IR/AffineOps.h"

#include "mlir/Dialect/Arith/IR/Arith.h"

#include "mlir/Dialect/Complex/IR/Complex.h"

#include "mlir/Dialect/Math/IR/Math.h"

#include "mlir/Dialect/MemRef/IR/MemRef.h"

#include "mlir/Dialect/SCF/IR/SCF.h"

#include "mlir/Dialect/SparseTensor/Transforms/Passes.h"

#include "mlir/Dialect/Vector/IR/VectorOps.h"

#include "mlir/IR/Matchers.h"

using namespace mlir;

using namespace mlir::sparse_tensor;

namespace {

awarzynskiUnsubmitted

Done

[nit] Would it make sense to document what the unit of vectorLength is? (i.e. bits, bytes, number of elements?)

awarzynski: [nit] Would it make sense to document what the unit of `vectorLength` is? (i.e. bits, bytes…

aartbikAuthorUnsubmitted

Done

Added more doc to struct

aartbik: Added more doc to struct

/// Target SIMD properties:

/// vectorLength: # packed data elements (viz. vector<16xf32> has length 16)

/// enableVLAVectorization: enables scalable vectors (viz. ARMSve)

/// enableSIMDIndex32: uses 32-bit indices in gather/scatter for efficiency

struct VL {

unsigned vectorLength;

bool enableVLAVectorization;

qcolombetUnsubmitted

Done

You may be able to use getConstantIntValue instead.

qcolombet: You may be able to use `getConstantIntValue` instead.

aartbikAuthorUnsubmitted

Done

Ha! Let me try. Because this was an index, some of our other utilities did not work but your suggestion is nice and short

aartbik: Ha! Let me try. Because this was an index, some of our other utilities did not work but your…

aartbikAuthorUnsubmitted

Done

it works!

aartbik: it works!

bool enableSIMDIndex32;

};

/// Helper to test for given index value.

static bool isIntValue(Value val, int64_t idx) {

if (auto ival = getConstantIntValue(val))

return *ival == idx;

return false;

}

/// Constructs vector type for element type.

static VectorType vectorType(VL vl, Type etp) {

unsigned numScalableDims = vl.enableVLAVectorization;

return VectorType::get(vl.vectorLength, etp, numScalableDims);

}

/// Constructs vector type from pointer.

static VectorType vectorType(VL vl, Value ptr) {

return vectorType(vl, ptr.getType().cast<MemRefType>().getElementType());

}

/// Constructs vector iteration mask.

static Value genVectorMask(PatternRewriter &rewriter, Location loc, VL vl,

Value iv, Value lo, Value hi, Value step) {

VectorType mtp = vectorType(vl, rewriter.getI1Type());

// Special case if the vector length evenly divides the trip count (for

// example, "for i = 0, 128, 16"). A constant all-true mask is generated

// so that all subsequent masked memory operations are immediately folded

// into unconditional memory operations.

IntegerAttr loInt, hiInt, stepInt;

if (matchPattern(lo, m_Constant(&loInt)) &&

matchPattern(hi, m_Constant(&hiInt)) &&

matchPattern(step, m_Constant(&stepInt))) {

if (((hiInt.getInt() - loInt.getInt()) % stepInt.getInt()) == 0) {

qcolombetUnsubmitted

Not Done

You could use getConstantIntValue here as well.

qcolombet: You could use `getConstantIntValue` here as well.

aartbikAuthorUnsubmitted

Done

Good point.

aartbik: Good point.

Value trueVal = constantI1(rewriter, loc, true);

return rewriter.create<vector::BroadcastOp>(loc, mtp, trueVal);

}

// Otherwise, generate a vector mask that avoids overrunning the upperbound

// during vector execution. Here we rely on subsequent loop optimizations to

// avoid executing the mask in all iterations, for example, by splitting the

// loop into an unconditional vector loop and a scalar cleanup loop.

auto min = AffineMap::get(

/*dimCount=*/2, /*symbolCount=*/1,

{rewriter.getAffineSymbolExpr(0),

rewriter.getAffineDimExpr(0) - rewriter.getAffineDimExpr(1)},

rewriter.getContext());

Value end =

rewriter.createOrFold<AffineMinOp>(loc, min, ValueRange{hi, iv, step});

qcolombetUnsubmitted

Done

Nit: Took me some time to parse what the min was doing because the order of the expressions and the ValueRange don't match.
I.e., the expression has min(symbol, dim0 - dim1) (i.e., symbol, dim0, dim1) and the ValueRange has dim0, dim1, symbol.

Maybe that's just me not being used to dealing with affine builds (where we know that dimensions appears first) :).

Feel free to ignore.

qcolombet: Nit: Took me some time to parse what the min was doing because the order of the expressions and…

return rewriter.create<vector::CreateMaskOp>(loc, mtp, end);

}

/// Generates a vectorized invariant. Here we rely on subsequent loop

/// optimizations to hoist the invariant broadcast out of the vector loop.

awarzynskiUnsubmitted

Done

What's lo and hi in this context?

awarzynski: What's `lo` and `hi` in this context?

aartbikAuthorUnsubmitted

Done

Documented

aartbik: Documented

static Value genVectorInvariantValue(PatternRewriter &rewriter, VL vl,

Value val) {

VectorType vtp = vectorType(vl, val.getType());

return rewriter.create<vector::BroadcastOp>(val.getLoc(), vtp, val);

}

/// Generates a vectorized load lhs = a[ind[lo:hi]] or lhs = a[lo:hi],

/// where 'lo' denotes the current index and 'hi = lo + vl - 1'.

static Value genVectorLoad(PatternRewriter &rewriter, Location loc, VL vl,

Value ptr, ArrayRef<Value> idxs, Value vmask) {

VectorType vtp = vectorType(vl, ptr);

Value pass = constantZero(rewriter, loc, vtp);

if (idxs.back().getType().isa<VectorType>()) {

SmallVector<Value> scalarArgs(idxs.begin(), idxs.end());

Value indexVec = idxs.back();

scalarArgs.back() = constantIndex(rewriter, loc, 0);

return rewriter.create<vector::GatherOp>(loc, vtp, ptr, scalarArgs,

qcolombetUnsubmitted

Not Done

I feel I'm missing something obvious, which may indicate a need for a comment :).
How do we know that only at most the last index will be a VectorType?

Couldn't we technically end up with a VectorType for several of the indices?

Going by what happens in vectorizeSubscripts, I don't see anything that would prevent generating more than one vload.

qcolombet: I feel I'm missing something obvious, which may indicate a need for a comment :). How do we…

aartbikAuthorUnsubmitted

Done

Indeed, if you have to ask, it needs a comment. Coming up!

aartbik: Indeed, if you have to ask, it needs a comment. Coming up!

indexVec, vmask, pass);

}

return rewriter.create<vector::MaskedLoadOp>(loc, vtp, ptr, idxs, vmask,

pass);

}

/// Generates a vectorized store a[ind[lo:hi]] = rhs or a[lo:hi] = rhs

/// where 'lo' denotes the current index and 'hi = lo + vl - 1'.

static void genVectorStore(PatternRewriter &rewriter, Location loc, Value ptr,

ArrayRef<Value> idxs, Value vmask, Value rhs) {

if (idxs.back().getType().isa<VectorType>()) {

SmallVector<Value> scalarArgs(idxs.begin(), idxs.end());

Value indexVec = idxs.back();

scalarArgs.back() = constantIndex(rewriter, loc, 0);

rewriter.create<vector::ScatterOp>(loc, ptr, scalarArgs, indexVec, vmask,

rhs);

return;

}

rewriter.create<vector::MaskedStoreOp>(loc, ptr, idxs, vmask, rhs);

}

/// Maps operation to combining kind for reduction.

static vector::CombiningKind getCombiningKind(Operation *def) {

if (isa<arith::AddFOp>(def) || isa<arith::AddIOp>(def) ||

isa<arith::SubFOp>(def) || isa<arith::SubIOp>(def))

return vector::CombiningKind::ADD;

if (isa<arith::MulFOp>(def) || isa<arith::MulIOp>(def))

dcaballeUnsubmitted

Done

We may have some utilities in the Linalg vectorizer that could be useful here to detect and get the the combining operation.

dcaballe: We may have some utilities in the Linalg vectorizer that could be useful here to detect and get…

aartbikAuthorUnsubmitted

Done

can you point me to any? None jumped out (this small)

aartbik: can you point me to any? None jumped out (this small)

return vector::CombiningKind::MUL;

if (isa<arith::AndIOp>(def))

return vector::CombiningKind::AND;

awarzynskiUnsubmitted

Done

For those of us who have no access to this book, would you be able to add a few more comments? In particular, could you explain the difference between r and rd? Thank you :)

awarzynski: For those of us who have no access to this book, would you be able to add a few more comments?

aartbikAuthorUnsubmitted

Done

We all should have access to the book! But documented

aartbik: We all should have access to the book! But documented

if (isa<arith::OrIOp>(def))

return vector::CombiningKind::OR;

if (isa<arith::XOrIOp>(def))

return vector::CombiningKind::XOR;

llvm_unreachable("unknown reduction kind");

}

/// Generates an initial value for a vector reduction, following the scheme

/// given in Chapter 5 of "The Software Vectorization Handbook", where the

/// initial scalar value is correctly embedded in the vector reduction value,

/// and a straightforward horizontal reduction will complete the operation.

/// The value 'r' denotes the initial value of the accumulator. Value 'rd'

/// denotes the accumulation operation, which is solely used here to determine

/// the kind of combining reduction (viz. addf -> sum-accumulation).

static Value genVectorReducInit(PatternRewriter &rewriter, Location loc,

VectorType vtp, Value r, Value rd) {

vector::CombiningKind kind = getCombiningKind(rd.getDefiningOp());

switch (kind) {

case vector::CombiningKind::ADD:

case vector::CombiningKind::XOR:

// Initialize reduction vector to: | 0 | .. | 0 | r |

return rewriter.create<vector::InsertElementOp>(

loc, r, constantZero(rewriter, loc, vtp),

constantIndex(rewriter, loc, 0));

case vector::CombiningKind::MUL:

// Initialize reduction vector to: | 1 | .. | 1 | r |

return rewriter.create<vector::InsertElementOp>(

loc, r, constantOne(rewriter, loc, vtp),

constantIndex(rewriter, loc, 0));

case vector::CombiningKind::AND:

case vector::CombiningKind::OR:

// Initialize reduction vector to: | r | .. | r | r |

return rewriter.create<vector::BroadcastOp>(loc, vtp, r);

default:

break;

}

llvm_unreachable("unknown reduction kind");

}

/// Generates final value for a vector reduction.

static Value genVectorReducEnd(PatternRewriter &rewriter, Location loc,

Value vexp, Value rd) {

vector::CombiningKind kind = getCombiningKind(rd.getDefiningOp());

return rewriter.create<vector::ReductionOp>(loc, kind, vexp);

}

/// This method is called twice to analyze and rewrite the given subscripts.

/// The first call (!codegen) does the analysis. Then, on success, the second

/// call (codegen) yields the proper vector form in the output parameter

/// vector 'idxs'. This mechanism ensures that analysis and rewriting code

/// stay in sync.

///

/// See https://llvm.org/docs/GetElementPtr.html for some background on

/// the complications described below.

///

/// We need to generate a pointer/index load from the sparse storage scheme.

/// Narrower data types need to be zero extended before casting the value

/// into the index type used for looping and indexing.

///

/// For the scalar case, subscripts simply zero extend narrower indices

/// into 64-bit values before casting to an index type without a performance

/// penalty. Indices that already are 64-bit, in theory, cannot express the

/// full range since the LLVM backend defines addressing in terms of an

/// unsigned pointer/signed index pair.

static bool vectorizeSubscripts(PatternRewriter &rewriter, scf::ForOp forOp,

VL vl, ValueRange subs, bool codegen,

Value vmask, SmallVectorImpl<Value> &idxs) {

for (auto sub : subs) {

// Invariant indices simply pass through.

if (sub.dyn_cast<BlockArgument>() ||

sub.getDefiningOp()->getBlock() != &forOp.getRegion().front()) {

if (codegen)

idxs.push_back(sub);

continue; // success so far

}

// Look under the hood of casting.

auto cast = sub;

while (1) {

if (auto icast = cast.getDefiningOp<arith::IndexCastOp>())

cast = icast->getOperand(0);

awarzynskiUnsubmitted

Done

// values, there is no good way to state that the indices are unsigned,

- // with creates the potential of incorrect address calculations in the

+ // which creates the potential of incorrect address calculations in the

// unlikely case we need such extremely large offsets.

awarzynski:

else if (auto ecast = cast.getDefiningOp<arith::ExtUIOp>())

cast = ecast->getOperand(0);

else

break;

}

// Since the index vector is used in a subsequent gather/scatter

// operations, which effectively defines an unsigned pointer + signed

// index, we must zero extend the vector to an index width. For 8-bit

// and 16-bit values, an 32-bit index width suffices. For 32-bit values,

// zero extending the elements into 64-bit loses some performance since

// the 32-bit indexed gather/scatter is more efficient than the 64-bit

// index variant (if the negative 32-bit index space is unused, the

// enableSIMDIndex32 flag can preserve this performance). For 64-bit

// values, there is no good way to state that the indices are unsigned,

// which creates the potential of incorrect address calculations in the

// unlikely case we need such extremely large offsets.

if (auto load = cast.getDefiningOp<memref::LoadOp>()) {

if (codegen) {

SmallVector<Value> idxs2(load.getIndices()); // no need to analyze

Location loc = forOp.getLoc();

Value vload =

genVectorLoad(rewriter, loc, vl, load.getMemRef(), idxs2, vmask);

Type etp = vload.getType().cast<VectorType>().getElementType();

if (!etp.isa<IndexType>()) {

if (etp.getIntOrFloatBitWidth() < 32)

vload = rewriter.create<arith::ExtUIOp>(

loc, vectorType(vl, rewriter.getI32Type()), vload);

else if (etp.getIntOrFloatBitWidth() < 64 && !vl.enableSIMDIndex32)

vload = rewriter.create<arith::ExtUIOp>(

loc, vectorType(vl, rewriter.getI64Type()), vload);

}

idxs.push_back(vload);

}

continue; // success so far

}

return false;

}

return true;

}

#define UNAOP(xxx) \

if (isa<xxx>(def)) { \

if (codegen) \

vexp = rewriter.create<xxx>(loc, vx); \

return true; \

}

#define BINOP(xxx) \

if (isa<xxx>(def)) { \

if (codegen) \

vexp = rewriter.create<xxx>(loc, vx, vy); \

return true; \

}

/// This method is called twice to analyze and rewrite the given expression.

/// The first call (!codegen) does the analysis. Then, on success, the second

/// call (codegen) yields the proper vector form in the output parameter 'vexp'.

/// This mechanism ensures that analysis and rewriting code stay in sync.

static bool vectorizeExpr(PatternRewriter &rewriter, scf::ForOp forOp, VL vl,

Value exp, bool codegen, Value vmask, Value &vexp) {

// A block argument in invariant.

if (auto arg = exp.dyn_cast<BlockArgument>()) {

awarzynskiUnsubmitted

Not Done

Value exp, bool codegen, Value vmask, Value &vexp) {

- // A block argument in invariant.

+ // A block argument is invariant.

if (auto arg = exp.dyn_cast<BlockArgument>()) {

awarzynski:

if (codegen)

vexp = genVectorInvariantValue(rewriter, vl, exp);

return true;

}

// Something defined outside the loop-body is invariant as well.

Operation *def = exp.getDefiningOp();

if (def->getBlock() != &forOp.getRegion().front()) {

if (codegen)

vexp = genVectorInvariantValue(rewriter, vl, exp);

return true;

}

// Inside loop-body unary and binary operations. Note that it would be

PeimingUnsubmitted

Not Done

You can simply do a clone(def) here and update the operand to vx in place

Peiming: You can simply do a `clone(def)` here and update the operand to `vx` in place

aartbikAuthorUnsubmitted

Done

Yeah, I was looking for a more "reflective" way.
But will clone() work? Does that make a shallow or deep copy?

aartbik: Yeah, I was looking for a more "reflective" way. But will clone() work? Does that make a…

PeimingUnsubmitted

Not Done

I believe that it makes a deep copy.

Peiming: I believe that it makes a deep copy.

PeimingUnsubmitted

Not Done

Actually it might be a little bit ambitious here to say deep copy.

It clones the operation itself, plus all the regions in it. But the operands will remains the same.

You can the call updateRootInPlace to set the operand to other SSA values

Peiming: Actually it might be a little bit ambitious here to say deep copy. It clones the operation…

PeimingUnsubmitted

Not Done

a little bit *ambiguous*

Peiming: a little bit *ambiguous*

aartbikAuthorUnsubmitted

Done

But then it is way too much redundant work right? I would rather that just this MACRO logic to just create what we need. Ideally I would like to have some reflection flavor like

vexp = rewriter.createUnOp(op->kind, loc, vx);

do we have something like that?

aartbik: But then it is way too much redundant work right? I would rather that just this MACRO logic to…

aartbikAuthorUnsubmitted

Done

I played around with some cloning, but nothing looks as concise as the current macro magic (which is also the most efficient way to get to the new IR). SO I am leaving this as a follow up refinement if you are okay.

aartbik: I played around with some cloning, but nothing looks as concise as the current macro magic…

// nicer if we could somehow test and build the operations in a more

// concise manner than just listing them all (although this way we know

// for certain that they can vectorize).

Location loc = forOp.getLoc();

if (auto load = dyn_cast<memref::LoadOp>(def)) {

auto subs = load.getIndices();

SmallVector<Value> idxs;

if (vectorizeSubscripts(rewriter, forOp, vl, subs, codegen, vmask, idxs)) {

if (codegen)

vexp = genVectorLoad(rewriter, loc, vl, load.getMemRef(), idxs, vmask);

return true;

}

} else if (def->getNumOperands() == 1) {

Value vx;

if (vectorizeExpr(rewriter, forOp, vl, def->getOperand(0), codegen, vmask,

vx)) {

UNAOP(math::AbsFOp)

UNAOP(math::AbsIOp)

UNAOP(math::CeilOp)

UNAOP(math::FloorOp)

UNAOP(math::SqrtOp)

UNAOP(math::ExpM1Op)

UNAOP(math::Log1pOp)

UNAOP(math::SinOp)

UNAOP(math::TanhOp)

UNAOP(arith::NegFOp)

}

} else if (def->getNumOperands() == 2) {

Value vx, vy;

if (vectorizeExpr(rewriter, forOp, vl, def->getOperand(0), codegen, vmask,

vx) &&

vectorizeExpr(rewriter, forOp, vl, def->getOperand(1), codegen, vmask,

vy)) {

BINOP(arith::MulFOp)

BINOP(arith::MulIOp)

BINOP(arith::DivFOp)

BINOP(arith::DivSIOp)

BINOP(arith::DivUIOp)

awarzynskiUnsubmitted

Done

#undef BINOP

- /// Analyzes the loop body (first pass) and rewrites to vector form

- /// on success (second pass).

+ /// Analyzes the loop body (first pass, i.e. `codegen = false`) and rewrites to vector form

+ /// on success (second pass, i.e. `codegen = true`).

bool vectorizeStmt(PatternRewriter &rewriter, scf::ForOp forOp, VL vl,

Suggesting this as it's not immediately obvious what "first pass" means when looking at this method in isolation (it becomes obvious when it's being called).

awarzynski: Suggesting this as it's not immediately obvious what "first pass" means when looking at this…

BINOP(arith::AddFOp)

PeimingUnsubmitted

Done

nit: static?

Peiming: nit: static?

BINOP(arith::AddIOp)

BINOP(arith::SubFOp)

BINOP(arith::SubIOp)

BINOP(arith::AndIOp)

BINOP(arith::OrIOp)

BINOP(arith::XOrIOp)

}

return false;

PeimingUnsubmitted

Done

nit: successful

Peiming: nit: successful

}

#undef UNAOP

awarzynskiUnsubmitted

Done

// changes into SIMD form. For stores, we can simply adjust the stride

- // and insert in the existing for-loop. In both case, we set up a vector

+ // and insert in the existing for-loop. In both cases, we set up a vector

// mask for all operations which takes care of confining vectors to

awarzynski:

#undef BINOP

/// This method is called twice to analyze and rewrite the given for-loop.

/// The first call (!codegen) does the analysis. Then, on success, the second

/// call (codegen) rewriters the IR into vector form. This mechanism ensures

/// that analysis and rewriting code stay in sync.

static bool vectorizeStmt(PatternRewriter &rewriter, scf::ForOp forOp, VL vl,

bool codegen) {

Location loc = forOp.getLoc();

Block &block = forOp.getRegion().front();

scf::YieldOp yield = cast<scf::YieldOp>(block.getTerminator());

auto &last = *++block.rbegin();

awarzynskiUnsubmitted

Done

[nit] Wouldn't forOpNew be more descriptive? forOp2 might suggest there are 2 "for loops" to begin with.

awarzynski: [nit] Wouldn't `forOpNew` be more descriptive? `forOp2` might suggest there are 2 "for loops"…

aartbikAuthorUnsubmitted

Done

Good suggestion. My naming is always too short ;-)

aartbik: Good suggestion. My naming is always too short ;-)

scf::ForOp forOpNew;

// Perform initial set up during codegen (we know that the first analysis

// pass was successful). For reductions, we need to construct a completely

// new for-loop, since the incoming and outgoing reduction type

// changes into SIMD form. For stores, we can simply adjust the stride

// and insert in the existing for-loop. In both cases, we set up a vector

// mask for all operations which takes care of confining vectors to

// the original iteration space (later cleanup loops or other

// optimizations can take care of those).

Value vmask;

if (codegen) {

Value step = constantIndex(rewriter, loc, vl.vectorLength);

if (vl.enableVLAVectorization) {

Value vscale =

rewriter.create<vector::VectorScaleOp>(loc, rewriter.getIndexType());

step = rewriter.create<arith::MulIOp>(loc, vscale, step);

PeimingUnsubmitted

Done

So, this assume that the reducValue is the first value, which is, of course, valid.

But will it be too fragile against changes? How about adding static function to query the index? and add some assertion there to make sure these the index is always valid.

e.g., in sparsification, we can have something like
assert (codegen.redVal == block.getArgument(getReductionValueArgumentIndex())

Peiming: So, this assume that the reducValue is the first value, which is, of course, valid. But will…

aartbikAuthorUnsubmitted

Done

Yeah, good suggestion. Made it less brittle.

aartbik: Yeah, good suggestion. Made it less brittle.

}

if (!yield.getResults().empty()) {

Value init = forOp.getInitArgs()[0];

VectorType vtp = vectorType(vl, init.getType());

Value vinit =

genVectorReducInit(rewriter, loc, vtp, init, yield->getOperand(0));

forOpNew = rewriter.create<scf::ForOp>(

loc, forOp.getLowerBound(), forOp.getUpperBound(), step, vinit);

rewriter.setInsertionPointToStart(forOpNew.getBody());

} else {

forOp.setStep(step);

rewriter.setInsertionPoint(yield);

}

vmask = genVectorMask(rewriter, loc, vl, forOp.getInductionVar(),

forOp.getLowerBound(), forOp.getUpperBound(), step);

}

// Sparse for-loops either are terminated by a non-empty yield operation

// (reduction loop) or otherwise by a store operation (pararallel loop).

if (!yield.getResults().empty()) {

awarzynskiUnsubmitted

Not Done

// Sparse for-loops either are terminated by a non-empty yield operation

- // (reduction loop) or otherwise by a store operation (pararallel loop).

+ // (reduction loop) or otherwise by a store operation (parallel loop).

if (!yield.getResults().empty()) {

awarzynski:

if (yield->getNumOperands() != 1)

return false;

Value redOp = yield->getOperand(0);

// Analyze/vectorize reduction.

// TODO: use linalg utils to verify the actual reduction?

Value vrhs;

if (vectorizeExpr(rewriter, forOp, vl, redOp, codegen, vmask, vrhs)) {

if (codegen) {

Value vpass =

genVectorInvariantValue(rewriter, vl, forOp.getRegionIterArg(0));

Value vred = rewriter.create<arith::SelectOp>(loc, vmask, vrhs, vpass);

awarzynskiUnsubmitted

Done

Shouldn't the pass-through value be equal to whatever is the neutral value for the combining operation? (i.e. 1 for multiplication and 0 for addition)

awarzynski: Shouldn't the pass-through value be equal to whatever is the neutral value for the combining…

aartbikAuthorUnsubmitted

Done

Ah, that is possible too, but by using the inits shown, you can safe one op in the end.
E.g. a sum reduction could start with

v = [0,...0]
simd loop
t = sum_reduce v
r = r + t

or you can do

v = [0,...r]
simd loop
r = sum_reduce v

aartbik: Ah, that is possible too, but by using the inits shown, you can safe one op in the end. E.g. a…

awarzynskiUnsubmitted

Not Done

Ah, I see! I got confused because genVectorInvariantValue creates:

%25 = "vector.broadcast"(%arg4) : (f32) -> vector<[4]xf32>

so v = [r, ..., r] rather than v = [0, ..., r].

But then vpass is effectively replaced with %6 = vector.insertelement %3, %cst[%c0 : index] : vector<[4]xf32>, so the code is correct.

awarzynski: Ah, I see! I got confused because `genVectorInvariantValue` creates: ``` %25 = "vector.

rewriter.create<scf::YieldOp>(loc, vred);

rewriter.setInsertionPointAfter(forOpNew);

Value vres = genVectorReducEnd(rewriter, loc, forOpNew.getResult(0), redOp);

// Now do some relinking (last one is not completely type safe

// but all bad ones are removed right away). This also folds away

// nop broadcast operations.

forOp.getResult(0).replaceAllUsesWith(vres);

forOp.getInductionVar().replaceAllUsesWith(forOpNew.getInductionVar());

forOp.getRegionIterArg(0).replaceAllUsesWith(

forOpNew.getRegionIterArg(0));

rewriter.eraseOp(forOp);

}

return true;

}

} else if (auto store = dyn_cast<memref::StoreOp>(last)) {

// Analyze/vectorize store operation.

auto subs = store.getIndices();

SmallVector<Value> idxs;

Value rhs = store.getValue();

Value vrhs;

if (vectorizeSubscripts(rewriter, forOp, vl, subs, codegen, vmask, idxs) &&

vectorizeExpr(rewriter, forOp, vl, rhs, codegen, vmask, vrhs)) {

dcaballeUnsubmitted

Done

Would this bail out if the loop is an outer loop?

dcaballe: Would this bail out if the loop is an outer loop?

aartbikAuthorUnsubmitted

Done

It does since it will fail on the inner forOp. But is there a quicker way to test this?

aartbik: It does since it will fail on the inner forOp. But is there a quicker way to test this?

if (codegen) {

genVectorStore(rewriter, loc, store.getMemRef(), idxs, vmask, vrhs);

rewriter.eraseOp(store);

}

PeimingUnsubmitted

Not Done

Any reason why you resue the function for analysis/codegen instead of having two function? wouldn't that be more clear?

Peiming: Any reason why you resue the function for analysis/codegen instead of having two function?

aartbikAuthorUnsubmitted

Done

Because so much of the tests that analyze are repeated in codegen, this was it is a lot less code ;-)

aartbik: Because so much of the tests that analyze are repeated in codegen, this was it is a lot less…

PeimingUnsubmitted

Done

okay, it seems that what you really need is a “dataflow visitor”, but seems that MLIR does not provide that

Peiming: okay, it seems that what you really need is a “dataflow visitor”, but seems that MLIR does not…

return true;

}

assert(!codegen && "cannot call codegen when analysis failed");

return false;

}

/// Basic for-loop vectorizer.

struct ForOpRewriter : public OpRewritePattern<scf::ForOp> {

public:

using OpRewritePattern<scf::ForOp>::OpRewritePattern;

ForOpRewriter(MLIRContext *context, unsigned vectorLength,

bool enableVLAVectorization, bool enableSIMDIndex32)

: OpRewritePattern(context),

vl{vectorLength, enableVLAVectorization, enableSIMDIndex32} {}

LogicalResult matchAndRewrite(scf::ForOp op,

PatternRewriter &rewriter) const override {

// Check for single block, unit-stride for-loop that is generated by

// sparse compiler, which means no data dependence analysis is required,

// and its loop-body is very restricted in form.

if (!op.getRegion().hasOneBlock() || !isIntValue(op.getStep(), 1) ||

!op->hasAttr(SparseTensorLoopEmitter::getLoopEmitterLoopAttrName()))

return failure();

// Analyze (!codegen) and rewrite (codegen) loop-body.

if (vectorizeStmt(rewriter, op, vl, /*codegen=*/false) &&

vectorizeStmt(rewriter, op, vl, /*codegen=*/true))

return success();

return failure();

}

private:

const VL vl;

};

} // namespace

//===----------------------------------------------------------------------===//

// Public method for populating vectorization rules.

//===----------------------------------------------------------------------===//

/// Populates the given patterns list with vectorization rules.

void mlir::populateSparseVectorizationPatterns(RewritePatternSet &patterns,

unsigned vectorLength,

bool enableVLAVectorization,

bool enableSIMDIndex32) {

patterns.add<ForOpRewriter>(patterns.getContext(), vectorLength,

enableVLAVectorization, enableSIMDIndex32);

}

mlir/test/Dialect/SparseTensor/sparse_vector.mlir

Property	Old Value	New Value
File Mode	100644	100755

// RUN: mlir-opt %s -sparsification -cse -split-input-file \| \		// RUN: mlir-opt %s -sparsification -cse -split-input-file \| \
// RUN: FileCheck %s		// RUN: FileCheck %s --check-prefix=CHECK-SCALAR
		// RUN: mlir-opt %s -sparsification -cse -sparse-vectorization="vl=16" -cse -split-input-file \| \
		// RUN: FileCheck %s --check-prefix=CHECK-VEC16
		// RUN: mlir-opt %s -sparsification -cse -sparse-vectorization="vl=16 enable-simd-index32=true" -cse -split-input-file \| \
		// RUN: FileCheck %s --check-prefix=CHECK-VEC16-IDX32
		// RUN: mlir-opt %s -sparsification -cse -sparse-vectorization="vl=4 enable-vla-vectorization=true" -cse -split-input-file \| \
		// RUN: FileCheck %s --check-prefix=CHECK-VEC4-SVE

		awarzynskiUnsubmitted Done Reply Inline Actions You could try more descriptive prefixes: `CHECK-VEC1` --> `CHECK-SCALAR` `CHECK-VEC2` --> `CHECK-VEC16` `CHECK-VEC3` --> `CHECK-VEC16-32IDX` `CHECK-VEC4` --> `CHECK-VEC4-SCALABLE` You could also just drop `CHECK` to make it a bit concise. This is clear either way, so feel free to ignore. awarzynski: You could try more descriptive prefixes: * `CHECK-VEC1` --> `CHECK-SCALAR` * `CHECK-VEC2` -->…
		aartbikAuthorUnsubmitted Done Reply Inline Actions This is a good suggestion, done. aartbik: This is a good suggestion, done.
#DenseVector = #sparse_tensor.encoding<{ dimLevelType = [ "dense" ] }>		#DenseVector = #sparse_tensor.encoding<{ dimLevelType = [ "dense" ] }>

#trait_scale_d = {		#trait_scale_d = {
indexing_maps = [		indexing_maps = [
affine_map<(i) -> (i)>, // a		affine_map<(i) -> (i)>, // a
affine_map<(i) -> (i)> // x (out)		affine_map<(i) -> (i)> // x (out)
],		],
iterator_types = ["parallel"],		iterator_types = ["parallel"],
doc = "x(i) = a(i) * b"		doc = "x(i) = a(i) * b"
}		}

//		//
// CHECK-LABEL: func @scale_d		// CHECK-SCALAR-LABEL: func @scale_d
// CHECK-DAG: %[[c0:.*]] = arith.constant 0 : index		// CHECK-SCALAR-DAG: %[[c0:.*]] = arith.constant 0 : index
// CHECK-DAG: %[[c1:.*]] = arith.constant 1 : index		// CHECK-SCALAR-DAG: %[[c1:.*]] = arith.constant 1 : index
// CHECK-DAG: %[[c1024:.*]] = arith.constant 1024 : index		// CHECK-SCALAR-DAG: %[[c1024:.*]] = arith.constant 1024 : index
// CHECK: scf.for %[[i:.*]] = %[[c0]] to %[[c1024]] step %[[c1]] {		// CHECK-SCALAR: scf.for %[[i:.*]] = %[[c0]] to %[[c1024]] step %[[c1]] {
// CHECK: %[[l:.]] = memref.load %{{.}}[%[[i]]] : memref<?xf32>		// CHECK-SCALAR: %[[l:.]] = memref.load %{{.}}[%[[i]]] : memref<?xf32>
// CHECK: %[[m:.]] = arith.mulf %[[l]], %{{.}} : f32		// CHECK-SCALAR: %[[m:.]] = arith.mulf %[[l]], %{{.}} : f32
// CHECK: store %[[m]], %{{.*}}[%[[i]]] : memref<1024xf32>		// CHECK-SCALAR: store %[[m]], %{{.*}}[%[[i]]] : memref<1024xf32>
// CHECK: }		// CHECK-SCALAR: }
// CHECK: return		// CHECK-SCALAR: return
		//
		// CHECK-VEC16-LABEL: func @scale_d
		// CHECK-VEC16-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC16-DAG: %[[c16:.*]] = arith.constant 16 : index
		// CHECK-VEC16-DAG: %[[c1024:.*]] = arith.constant 1024 : index
		// CHECK-VEC16: scf.for %[[i:.*]] = %[[c0]] to %[[c1024]] step %[[c16]] {
		// CHECK-VEC16: %[[r:.]] = vector.load %{{.}}[%[[i]]] : memref<?xf32>, vector<16xf32>
		// CHECK-VEC16: %[[b:.]] = vector.broadcast %{{.}} : f32 to vector<16xf32>
		// CHECK-VEC16: %[[m:.*]] = arith.mulf %[[r]], %[[b]] : vector<16xf32>
		// CHECK-VEC16: vector.store %[[m]], %{{.*}}[%[[i]]] : memref<1024xf32>, vector<16xf32>
		// CHECK-VEC16: }
		// CHECK-VEC16: return
		//
		// CHECK-VEC16-IDX32-LABEL: func @scale_d
		// CHECK-VEC16-IDX32-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC16-IDX32-DAG: %[[c16:.*]] = arith.constant 16 : index
		// CHECK-VEC16-IDX32-DAG: %[[c1024:.*]] = arith.constant 1024 : index
		// CHECK-VEC16-IDX32: scf.for %[[i:.*]] = %[[c0]] to %[[c1024]] step %[[c16]] {
		// CHECK-VEC16-IDX32: %[[r:.]] = vector.load %{{.}}[%[[i]]] : memref<?xf32>, vector<16xf32>
		// CHECK-VEC16-IDX32: %[[b:.]] = vector.broadcast %{{.}} : f32 to vector<16xf32>
		// CHECK-VEC16-IDX32: %[[m:.*]] = arith.mulf %[[r]], %[[b]] : vector<16xf32>
		// CHECK-VEC16-IDX32: vector.store %[[m]], %{{.*}}[%[[i]]] : memref<1024xf32>, vector<16xf32>
		// CHECK-VEC16-IDX32: }
		// CHECK-VEC16-IDX32: return
		//
		// CHECK-VEC4-SVE: #[[$map:.*]] = affine_map<(d0, d1)[s0] -> (s0, d0 - d1)
		// CHECK-VEC4-SVE-LABEL: func @scale_d
		// CHECK-VEC4-SVE-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC4-SVE-DAG: %[[c4:.*]] = arith.constant 4 : index
		// CHECK-VEC4-SVE-DAG: %[[c1024:.*]] = arith.constant 1024 : index
		// CHECK-VEC4-SVE-DAG: %[[v0:.*]] = arith.constant dense<0.000000e+00> : vector<[4]xf32>
		// CHECK-VEC4-SVE-DAG: %[[vscale:.*]] = vector.vscale
		// CHECK-VEC4-SVE: %[[step:.*]] = arith.muli %[[vscale]], %[[c4]] : index
		// CHECK-VEC4-SVE: scf.for %[[i:.*]] = %[[c0]] to %[[c1024]] step %[[step]] {
		// CHECK-VEC4-SVE: %[[sub:.*]] = affine.min #[[$map]](%[[c1024]], %[[i]])[%[[step]]]
		// CHECK-VEC4-SVE: %[[mask:.*]] = vector.create_mask %[[sub]] : vector<[4]xi1>
		// CHECK-VEC4-SVE: %[[val:.]] = vector.maskedload %{{.}}[%[[i]]], %[[mask]], %[[v0]] : memref<?xf32>, vector<[4]xi1>, vector<[4]xf32> into vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[scalev:.]] = vector.broadcast %{{.}} : f32 to vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[scaled:.*]] = arith.mulf %[[val]], %[[scalev]] : vector<[4]xf32>
		// CHECK-VEC4-SVE: vector.maskedstore %{{.*}}[%[[i]]], %[[mask]], %[[scaled]] : memref<1024xf32>, vector<[4]xi1>, vector<[4]xf32>
		// CHECK-VEC4-SVE: }
		// CHECK-VEC4-SVE: return
//		//

func.func @scale_d(%arga: tensor<1024xf32, #DenseVector>, %b: f32, %argx: tensor<1024xf32>) -> tensor<1024xf32> {		func.func @scale_d(%arga: tensor<1024xf32, #DenseVector>, %b: f32, %argx: tensor<1024xf32>) -> tensor<1024xf32> {
%0 = linalg.generic #trait_scale_d		%0 = linalg.generic #trait_scale_d
ins(%arga: tensor<1024xf32, #DenseVector>)		ins(%arga: tensor<1024xf32, #DenseVector>)
outs(%argx: tensor<1024xf32>) {		outs(%argx: tensor<1024xf32>) {
^bb(%a: f32, %x: f32):		^bb(%a: f32, %x: f32):
%0 = arith.mulf %a, %b : f32		%0 = arith.mulf %a, %b : f32
linalg.yield %0 : f32		linalg.yield %0 : f32
} -> tensor<1024xf32>		} -> tensor<1024xf32>
Show All 14 Lines	indexing_maps = [
affine_map<(i) -> (i)>, // b		affine_map<(i) -> (i)>, // b
affine_map<(i) -> (i)> // x (out)		affine_map<(i) -> (i)> // x (out)
],		],
iterator_types = ["parallel"],		iterator_types = ["parallel"],
doc = "x(i) = a(i) * b(i)"		doc = "x(i) = a(i) * b(i)"
}		}

//		//
// CHECK-LABEL: func @mul_s		// CHECK-SCALAR-LABEL: func @mul_s
// CHECK-DAG: %[[c0:.*]] = arith.constant 0 : index		// CHECK-SCALAR-DAG: %[[c0:.*]] = arith.constant 0 : index
// CHECK-DAG: %[[c1:.*]] = arith.constant 1 : index		// CHECK-SCALAR-DAG: %[[c1:.*]] = arith.constant 1 : index
// CHECK: %[[p:.]] = memref.load %{{.}}[%[[c0]]] : memref<?xi32>		// CHECK-SCALAR: %[[p:.]] = memref.load %{{.}}[%[[c0]]] : memref<?xi32>
// CHECK: %[[a:.*]] = arith.extui %[[p]] : i32 to i64		// CHECK-SCALAR: %[[a:.*]] = arith.extui %[[p]] : i32 to i64
// CHECK: %[[q:.*]] = arith.index_cast %[[a]] : i64 to index		// CHECK-SCALAR: %[[q:.*]] = arith.index_cast %[[a]] : i64 to index
// CHECK: %[[r:.]] = memref.load %{{.}}[%[[c1]]] : memref<?xi32>		// CHECK-SCALAR: %[[r:.]] = memref.load %{{.}}[%[[c1]]] : memref<?xi32>
// CHECK: %[[b:.*]] = arith.extui %[[r]] : i32 to i64		// CHECK-SCALAR: %[[b:.*]] = arith.extui %[[r]] : i32 to i64
// CHECK: %[[s:.*]] = arith.index_cast %[[b]] : i64 to index		// CHECK-SCALAR: %[[s:.*]] = arith.index_cast %[[b]] : i64 to index
// CHECK: scf.for %[[i:.*]] = %[[q]] to %[[s]] step %[[c1]] {		// CHECK-SCALAR: scf.for %[[i:.*]] = %[[q]] to %[[s]] step %[[c1]] {
// CHECK: %[[li:.]] = memref.load %{{.}}[%[[i]]] : memref<?xi32>		// CHECK-SCALAR: %[[li:.]] = memref.load %{{.}}[%[[i]]] : memref<?xi32>
// CHECK: %[[zi:.*]] = arith.extui %[[li]] : i32 to i64		// CHECK-SCALAR: %[[zi:.*]] = arith.extui %[[li]] : i32 to i64
// CHECK: %[[ci:.*]] = arith.index_cast %[[zi]] : i64 to index		// CHECK-SCALAR: %[[ci:.*]] = arith.index_cast %[[zi]] : i64 to index
// CHECK: %[[la:.]] = memref.load %{{.}}[%[[i]]] : memref<?xf32>		// CHECK-SCALAR: %[[la:.]] = memref.load %{{.}}[%[[i]]] : memref<?xf32>
// CHECK: %[[lb:.]] = memref.load %{{.}}[%[[ci]]] : memref<1024xf32>		// CHECK-SCALAR: %[[lb:.]] = memref.load %{{.}}[%[[ci]]] : memref<1024xf32>
// CHECK: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : f32		// CHECK-SCALAR: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : f32
// CHECK: store %[[m]], %{{.*}}[%[[ci]]] : memref<1024xf32>		// CHECK-SCALAR: store %[[m]], %{{.*}}[%[[ci]]] : memref<1024xf32>
// CHECK: }		// CHECK-SCALAR: }
// CHECK: return		// CHECK-SCALAR: return
//		//
func.func @mul_s(%arga: tensor<1024xf32, #SparseVector>, %argb: tensor<1024xf32>, %argx: tensor<1024xf32>) -> tensor<1024xf32> {		// CHECK-VEC16: #[[$map:.*]] = affine_map<(d0, d1)[s0] -> (16, d0 - d1)
		// CHECK-VEC16-LABEL: func @mul_s
		// CHECK-VEC16-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC16-DAG: %[[c1:.*]] = arith.constant 1 : index
		// CHECK-VEC16-DAG: %[[c16:.*]] = arith.constant 16 : index
		// CHECK-VEC16: %[[p:.]] = memref.load %{{.}}[%[[c0]]] : memref<?xi32>
		// CHECK-VEC16: %[[a:.*]] = arith.extui %[[p]] : i32 to i64
		// CHECK-VEC16: %[[q:.*]] = arith.index_cast %[[a]] : i64 to index
		// CHECK-VEC16: %[[r:.]] = memref.load %{{.}}[%[[c1]]] : memref<?xi32>
		// CHECK-VEC16: %[[b:.*]] = arith.extui %[[r]] : i32 to i64
		// CHECK-VEC16: %[[s:.*]] = arith.index_cast %[[b]] : i64 to index
		// CHECK-VEC16: scf.for %[[i:.*]] = %[[q]] to %[[s]] step %[[c16]] {
		// CHECK-VEC16: %[[sub:.*]] = affine.min #[[$map]](%[[s]], %[[i]])[%[[c16]]]
		// CHECK-VEC16: %[[mask:.*]] = vector.create_mask %[[sub]] : vector<16xi1>
		// CHECK-VEC16: %[[li:.]] = vector.maskedload %{{.}}[%[[i]]], %[[mask]], %{{.*}} : memref<?xi32>, vector<16xi1>, vector<16xi32> into vector<16xi32>
		// CHECK-VEC16: %[[zi:.*]] = arith.extui %[[li]] : vector<16xi32> to vector<16xi64>
		// CHECK-VEC16: %[[la:.]] = vector.maskedload %{{.}}[%[[i]]], %[[mask]], %{{.*}} : memref<?xf32>, vector<16xi1>, vector<16xf32> into vector<16xf32>
		// CHECK-VEC16: %[[lb:.]] = vector.gather %{{.}}[%[[c0]]] [%[[zi]]], %[[mask]], %{{.*}} : memref<1024xf32>, vector<16xi64>, vector<16xi1>, vector<16xf32> into vector<16xf32>
		// CHECK-VEC16: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : vector<16xf32>
		// CHECK-VEC16: vector.scatter %{{.*}}[%[[c0]]] [%[[zi]]], %[[mask]], %[[m]] : memref<1024xf32>, vector<16xi64>, vector<16xi1>, vector<16xf32>
		// CHECK-VEC16: }
		// CHECK-VEC16: return
		//
		// CHECK-VEC16-IDX32: #[[$map:.*]] = affine_map<(d0, d1)[s0] -> (16, d0 - d1)
		// CHECK-VEC16-IDX32-LABEL: func @mul_s
		// CHECK-VEC16-IDX32-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC16-IDX32-DAG: %[[c1:.*]] = arith.constant 1 : index
		// CHECK-VEC16-IDX32-DAG: %[[c16:.*]] = arith.constant 16 : index
		// CHECK-VEC16-IDX32: %[[p:.]] = memref.load %{{.}}[%[[c0]]] : memref<?xi32>
		// CHECK-VEC16-IDX32: %[[a:.*]] = arith.extui %[[p]] : i32 to i64
		// CHECK-VEC16-IDX32: %[[q:.*]] = arith.index_cast %[[a]] : i64 to index
		// CHECK-VEC16-IDX32: %[[r:.]] = memref.load %{{.}}[%[[c1]]] : memref<?xi32>
		// CHECK-VEC16-IDX32: %[[b:.*]] = arith.extui %[[r]] : i32 to i64
		// CHECK-VEC16-IDX32: %[[s:.*]] = arith.index_cast %[[b]] : i64 to index
		// CHECK-VEC16-IDX32: scf.for %[[i:.*]] = %[[q]] to %[[s]] step %[[c16]] {
		// CHECK-VEC16-IDX32: %[[sub:.*]] = affine.min #[[$map]](%[[s]], %[[i]])[%[[c16]]]
		// CHECK-VEC16-IDX32: %[[mask:.*]] = vector.create_mask %[[sub]] : vector<16xi1>
		// CHECK-VEC16-IDX32: %[[li:.]] = vector.maskedload %{{.}}[%[[i]]], %[[mask]], %{{.*}} : memref<?xi32>, vector<16xi1>, vector<16xi32> into vector<16xi32>
		// CHECK-VEC16-IDX32: %[[la:.]] = vector.maskedload %{{.}}[%[[i]]], %[[mask]], %{{.*}} : memref<?xf32>, vector<16xi1>, vector<16xf32> into vector<16xf32>
		// CHECK-VEC16-IDX32: %[[lb:.]] = vector.gather %{{.}}[%[[c0]]] [%[[li]]], %[[mask]], %{{.*}} : memref<1024xf32>, vector<16xi32>, vector<16xi1>, vector<16xf32> into vector<16xf32>
		// CHECK-VEC16-IDX32: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : vector<16xf32>
		// CHECK-VEC16-IDX32: vector.scatter %{{.*}}[%[[c0]]] [%[[li]]], %[[mask]], %[[m]] : memref<1024xf32>, vector<16xi32>, vector<16xi1>, vector<16xf32>
		// CHECK-VEC16-IDX32: }
		// CHECK-VEC16-IDX32: return
		//
		// CHECK-VEC4-SVE: #[[$map:.*]] = affine_map<(d0, d1)[s0] -> (s0, d0 - d1)
		// CHECK-VEC4-SVE-LABEL: func @mul_s
		// CHECK-VEC4-SVE-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC4-SVE-DAG: %[[c1:.*]] = arith.constant 1 : index
		// CHECK-VEC4-SVE-DAG: %[[c4:.*]] = arith.constant 4 : index
		// CHECK-VEC4-SVE-DAG: %[[v0i:.*]] = arith.constant dense<0> : vector<[4]xi32>
		// CHECK-VEC4-SVE-DAG: %[[v0f:.*]] = arith.constant dense<0.000000e+00> : vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[p:.]] = memref.load %{{.}}[%[[c0]]] : memref<?xi32>
		// CHECK-VEC4-SVE: %[[a:.*]] = arith.extui %[[p]] : i32 to i64
		// CHECK-VEC4-SVE: %[[q:.*]] = arith.index_cast %[[a]] : i64 to index
		// CHECK-VEC4-SVE: %[[r:.]] = memref.load %{{.}}[%[[c1]]] : memref<?xi32>
		// CHECK-VEC4-SVE: %[[b:.*]] = arith.extui %[[r]] : i32 to i64
		// CHECK-VEC4-SVE: %[[s:.*]] = arith.index_cast %[[b]] : i64 to index
		// CHECK-VEC4-SVE: %[[vscale:.*]] = vector.vscale
		// CHECK-VEC4-SVE: %[[step:.*]] = arith.muli %[[vscale]], %[[c4]] : index
		// CHECK-VEC4-SVE: scf.for %[[i:.*]] = %[[q]] to %[[s]] step %[[step]] {
		// CHECK-VEC4-SVE: %[[sub:.*]] = affine.min #[[$map]](%[[s]], %[[i]])[%[[step]]]
		// CHECK-VEC4-SVE: %[[mask:.*]] = vector.create_mask %[[sub]] : vector<[4]xi1>
		// CHECK-VEC4-SVE: %[[li:.]] = vector.maskedload %{{.}}[%[[i]]], %[[mask]], %[[v0i]] : memref<?xi32>, vector<[4]xi1>, vector<[4]xi32> into vector<[4]xi32>
		// CHECK-VEC4-SVE: %[[lii64:.*]] = arith.extui %[[li]] : vector<[4]xi32> to vector<[4]xi64>
		// CHECK-VEC4-SVE: %[[la:.]] = vector.maskedload %{{.}}[%[[i]]], %[[mask]], %[[v0f]] : memref<?xf32>, vector<[4]xi1>, vector<[4]xf32> into vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[lb:.]] = vector.gather %{{.}}[%[[c0]]] [%[[lii64]]], %[[mask]], %[[v0f]] : memref<1024xf32>, vector<[4]xi64>, vector<[4]xi1>, vector<[4]xf32> into vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : vector<[4]xf32>
		// CHECK-VEC4-SVE: vector.scatter %{{.*}}[%[[c0]]] [%[[lii64]]], %[[mask]], %[[m]] : memref<1024xf32>, vector<[4]xi64>, vector<[4]xi1>, vector<[4]xf32>
		// CHECK-VEC4-SVE: }
		// CHECK-VEC4-SVE: return
		//
		func.func @mul_s(%arga: tensor<1024xf32, #SparseVector>,
		%argb: tensor<1024xf32>,
		%argx: tensor<1024xf32>) -> tensor<1024xf32> {
%0 = linalg.generic #trait_mul_s		%0 = linalg.generic #trait_mul_s
ins(%arga, %argb: tensor<1024xf32, #SparseVector>, tensor<1024xf32>)		ins(%arga, %argb: tensor<1024xf32, #SparseVector>, tensor<1024xf32>)
outs(%argx: tensor<1024xf32>) {		outs(%argx: tensor<1024xf32>) {
^bb(%a: f32, %b: f32, %x: f32):		^bb(%a: f32, %b: f32, %x: f32):
%0 = arith.mulf %a, %b : f32		%0 = arith.mulf %a, %b : f32
linalg.yield %0 : f32		linalg.yield %0 : f32
} -> tensor<1024xf32>		} -> tensor<1024xf32>
return %0 : tensor<1024xf32>		return %0 : tensor<1024xf32>
Show All 9 Lines	indexing_maps = [
affine_map<(i) -> (i)>, // b		affine_map<(i) -> (i)>, // b
affine_map<(i) -> ()> // x (out)		affine_map<(i) -> ()> // x (out)
],		],
iterator_types = ["reduction"],		iterator_types = ["reduction"],
doc = "x += a(i) * b(i)"		doc = "x += a(i) * b(i)"
}		}

//		//
// CHECK-LABEL: func @reduction_d		// CHECK-SCALAR-LABEL: func @reduction_d
// CHECK-DAG: %[[c0:.*]] = arith.constant 0 : index		// CHECK-SCALAR-DAG: %[[c0:.*]] = arith.constant 0 : index
// CHECK-DAG: %[[c1:.*]] = arith.constant 1 : index		// CHECK-SCALAR-DAG: %[[c1:.*]] = arith.constant 1 : index
// CHECK-DAG: %[[c1024:.*]] = arith.constant 1024 : index		// CHECK-SCALAR-DAG: %[[c1024:.*]] = arith.constant 1024 : index
// CHECK: %[[red:.]] = scf.for %[[i:.]] = %[[c0]] to %[[c1024]] step %[[c1]] iter_args(%[[red_in:.]] = %{{.}}) -> (f32) {		// CHECK-SCALAR: %[[red:.]] = scf.for %[[i:.]] = %[[c0]] to %[[c1024]] step %[[c1]] iter_args(%[[red_in:.]] = %{{.}}) -> (f32) {
// CHECK: %[[la:.]] = memref.load %{{.}}[%[[i]]] : memref<?xf32>		// CHECK-SCALAR: %[[la:.]] = memref.load %{{.}}[%[[i]]] : memref<?xf32>
// CHECK: %[[lb:.]] = memref.load %{{.}}[%[[i]]] : memref<1024xf32>		// CHECK-SCALAR: %[[lb:.]] = memref.load %{{.}}[%[[i]]] : memref<1024xf32>
// CHECK: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : f32		// CHECK-SCALAR: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : f32
// CHECK: %[[a:.*]] = arith.addf %[[red_in]], %[[m]] : f32		// CHECK-SCALAR: %[[a:.*]] = arith.addf %[[red_in]], %[[m]] : f32
// CHECK: scf.yield %[[a]] : f32		// CHECK-SCALAR: scf.yield %[[a]] : f32
// CHECK: }		// CHECK-SCALAR: }
// CHECK: return		// CHECK-SCALAR: return
		//
		// CHECK-VEC16-LABEL: func @reduction_d
		// CHECK-VEC16-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC16-DAG: %[[c16:.*]] = arith.constant 16 : index
		// CHECK-VEC16-DAG: %[[c1024:.*]] = arith.constant 1024 : index
		// CHECK-VEC16-DAG: %[[v0:.*]] = arith.constant dense<0.000000e+00> : vector<16xf32>
		// CHECK-VEC16: %[[l:.]] = memref.load %{{.}}[] : memref<f32>
		// CHECK-VEC16: %[[r:.*]] = vector.insertelement %[[l]], %[[v0]][%[[c0]] : index] : vector<16xf32>
		// CHECK-VEC16: %[[red:.]] = scf.for %[[i:.]] = %[[c0]] to %[[c1024]] step %[[c16]] iter_args(%[[red_in:.*]] = %[[r]]) -> (vector<16xf32>) {
		// CHECK-VEC16: %[[la:.]] = vector.load %{{.}}[%[[i]]] : memref<?xf32>, vector<16xf32>
		// CHECK-VEC16: %[[lb:.]] = vector.load %{{.}}[%[[i]]] : memref<1024xf32>, vector<16xf32>
		// CHECK-VEC16: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : vector<16xf32>
		// CHECK-VEC16: %[[a:.*]] = arith.addf %[[red_in]], %[[m]] : vector<16xf32>
		// CHECK-VEC16: scf.yield %[[a]] : vector<16xf32>
		// CHECK-VEC16: }
		// CHECK-VEC16: %{{.*}} = vector.reduction <add>, %[[red]] : vector<16xf32> into f32
		// CHECK-VEC16: return
//		//
func.func @reduction_d(%arga: tensor<1024xf32, #DenseVector>, %argb: tensor<1024xf32>, %argx: tensor<f32>) -> tensor<f32> {		// CHECK-VEC16-IDX32-LABEL: func @reduction_d
		// CHECK-VEC16-IDX32-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC16-IDX32-DAG: %[[c16:.*]] = arith.constant 16 : index
		// CHECK-VEC16-IDX32-DAG: %[[c1024:.*]] = arith.constant 1024 : index
		// CHECK-VEC16-IDX32-DAG: %[[v0:.*]] = arith.constant dense<0.000000e+00> : vector<16xf32>
		// CHECK-VEC16-IDX32: %[[l:.]] = memref.load %{{.}}[] : memref<f32>
		// CHECK-VEC16-IDX32: %[[r:.*]] = vector.insertelement %[[l]], %[[v0]][%[[c0]] : index] : vector<16xf32>
		// CHECK-VEC16-IDX32: %[[red:.]] = scf.for %[[i:.]] = %[[c0]] to %[[c1024]] step %[[c16]] iter_args(%[[red_in:.*]] = %[[r]]) -> (vector<16xf32>) {
		// CHECK-VEC16-IDX32: %[[la:.]] = vector.load %{{.}}[%[[i]]] : memref<?xf32>, vector<16xf32>
		// CHECK-VEC16-IDX32: %[[lb:.]] = vector.load %{{.}}[%[[i]]] : memref<1024xf32>, vector<16xf32>
		// CHECK-VEC16-IDX32: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : vector<16xf32>
		// CHECK-VEC16-IDX32: %[[a:.*]] = arith.addf %[[red_in]], %[[m]] : vector<16xf32>
		// CHECK-VEC16-IDX32: scf.yield %[[a]] : vector<16xf32>
		// CHECK-VEC16-IDX32: }
		// CHECK-VEC16-IDX32: %{{.*}} = vector.reduction <add>, %[[red]] : vector<16xf32> into f32
		// CHECK-VEC16-IDX32: return
		//
		// CHECK-VEC4-SVE: #[[$map:.*]] = affine_map<(d0, d1)[s0] -> (s0, d0 - d1)
		// CHECK-VEC4-SVE-LABEL: func @reduction_d
		// CHECK-VEC4-SVE-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC4-SVE-DAG: %[[c4:.*]] = arith.constant 4 : index
		// CHECK-VEC4-SVE-DAG: %[[c1024:.*]] = arith.constant 1024 : index
		// CHECK-VEC4-SVE-DAG: %[[v0:.*]] = arith.constant dense<0.000000e+00> : vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[l:.]] = memref.load %{{.}}[] : memref<f32>
		// CHECK-VEC4-SVE: %[[vscale:.*]] = vector.vscale
		// CHECK-VEC4-SVE: %[[step:.*]] = arith.muli %[[vscale]], %[[c4]] : index
		// CHECK-VEC4-SVE: %[[r:.*]] = vector.insertelement %[[l]], %[[v0]][%[[c0]] : index] : vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[red:.]] = scf.for %[[i:.]] = %[[c0]] to %[[c1024]] step %[[step]] iter_args(%[[red_in:.*]] = %[[r]]) -> (vector<[4]xf32>) {
		// CHECK-VEC4-SVE: %[[sub:.*]] = affine.min #[[$map]](%[[c1024]], %[[i]])[%[[step]]]
		// CHECK-VEC4-SVE: %[[mask:.*]] = vector.create_mask %[[sub]] : vector<[4]xi1>
		// CHECK-VEC4-SVE: %[[la:.]] = vector.maskedload %{{.}}[%[[i]]], %[[mask]], %[[v0]] : memref<?xf32>, vector<[4]xi1>, vector<[4]xf32> into vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[lb:.]] = vector.maskedload %{{.}}[%[[i]]], %[[mask]], %[[v0]] : memref<1024xf32>, vector<[4]xi1>, vector<[4]xf32> into vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[a:.*]] = arith.addf %[[red_in]], %[[m]] : vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[sa:.*]] = arith.select %[[mask]], %[[a]], %[[red_in]] : vector<[4]xi1>, vector<[4]xf32>
		// CHECK-VEC4-SVE: scf.yield %[[sa]] : vector<[4]xf32>
		// CHECK-VEC4-SVE: }
		// CHECK-VEC4-SVE: %{{.*}} = vector.reduction <add>, %[[red]] : vector<[4]xf32> into f32
		// CHECK-VEC4-SVE: return
		//
		func.func @reduction_d(%arga: tensor<1024xf32, #DenseVector>,
		%argb: tensor<1024xf32>,
		%argx: tensor<f32>) -> tensor<f32> {
%0 = linalg.generic #trait_reduction_d		%0 = linalg.generic #trait_reduction_d
ins(%arga, %argb: tensor<1024xf32, #DenseVector>, tensor<1024xf32>)		ins(%arga, %argb: tensor<1024xf32, #DenseVector>, tensor<1024xf32>)
outs(%argx: tensor<f32>) {		outs(%argx: tensor<f32>) {
^bb(%a: f32, %b: f32, %x: f32):		^bb(%a: f32, %b: f32, %x: f32):
%0 = arith.mulf %a, %b : f32		%0 = arith.mulf %a, %b : f32
%1 = arith.addf %x, %0 : f32		%1 = arith.addf %x, %0 : f32
linalg.yield %1 : f32		linalg.yield %1 : f32
} -> tensor<f32>		} -> tensor<f32>
Show All 14 Lines	indexing_maps = [
affine_map<(i,j) -> (i,j)>, // B		affine_map<(i,j) -> (i,j)>, // B
affine_map<(i,j) -> (i,j)> // X (out)		affine_map<(i,j) -> (i,j)> // X (out)
],		],
iterator_types = ["parallel", "parallel"],		iterator_types = ["parallel", "parallel"],
doc = "X(i,j) = A(i,j) * B(i,j)"		doc = "X(i,j) = A(i,j) * B(i,j)"
}		}

//		//
// CHECK-LABEL: func @mul_ds		// CHECK-SCALAR-LABEL: func @mul_ds
// CHECK-DAG: %[[c0:.*]] = arith.constant 0 : index		// CHECK-SCALAR-DAG: %[[c0:.*]] = arith.constant 0 : index
// CHECK-DAG: %[[c1:.*]] = arith.constant 1 : index		// CHECK-SCALAR-DAG: %[[c1:.*]] = arith.constant 1 : index
// CHECK-DAG: %[[c512:.*]] = arith.constant 512 : index		// CHECK-SCALAR-DAG: %[[c512:.*]] = arith.constant 512 : index
// CHECK: scf.for %[[i:.*]] = %[[c0]] to %[[c512]] step %[[c1]] {		// CHECK-SCALAR: scf.for %[[i:.*]] = %[[c0]] to %[[c512]] step %[[c1]] {
// CHECK: %[[p:.]] = memref.load %{{.}}[%[[i]]] : memref<?xi32>		// CHECK-SCALAR: %[[p:.]] = memref.load %{{.}}[%[[i]]] : memref<?xi32>
// CHECK: %[[a:.*]] = arith.extui %[[p]] : i32 to i64		// CHECK-SCALAR: %[[a:.*]] = arith.extui %[[p]] : i32 to i64
// CHECK: %[[q:.*]] = arith.index_cast %[[a]] : i64 to index		// CHECK-SCALAR: %[[q:.*]] = arith.index_cast %[[a]] : i64 to index
// CHECK: %[[a:.*]] = arith.addi %[[i]], %[[c1]] : index		// CHECK-SCALAR: %[[a:.*]] = arith.addi %[[i]], %[[c1]] : index
// CHECK: %[[r:.]] = memref.load %{{.}}[%[[a]]] : memref<?xi32>		// CHECK-SCALAR: %[[r:.]] = memref.load %{{.}}[%[[a]]] : memref<?xi32>
// CHECK: %[[b:.*]] = arith.extui %[[r]] : i32 to i64		// CHECK-SCALAR: %[[b:.*]] = arith.extui %[[r]] : i32 to i64
// CHECK: %[[s:.*]] = arith.index_cast %[[b]] : i64 to index		// CHECK-SCALAR: %[[s:.*]] = arith.index_cast %[[b]] : i64 to index
// CHECK: scf.for %[[j:.*]] = %[[q]] to %[[s]] step %[[c1]] {		// CHECK-SCALAR: scf.for %[[j:.*]] = %[[q]] to %[[s]] step %[[c1]] {
// CHECK: %[[lj:.]] = memref.load %{{.}}[%[[j]]] : memref<?xi32>		// CHECK-SCALAR: %[[lj:.]] = memref.load %{{.}}[%[[j]]] : memref<?xi32>
// CHECK: %[[zj:.*]] = arith.extui %[[lj]] : i32 to i64		// CHECK-SCALAR: %[[zj:.*]] = arith.extui %[[lj]] : i32 to i64
// CHECK: %[[cj:.*]] = arith.index_cast %[[zj]] : i64 to index		// CHECK-SCALAR: %[[cj:.*]] = arith.index_cast %[[zj]] : i64 to index
// CHECK: %[[la:.]] = memref.load %{{.}}[%[[j]]] : memref<?xf32>		// CHECK-SCALAR: %[[la:.]] = memref.load %{{.}}[%[[j]]] : memref<?xf32>
// CHECK: %[[lb:.]] = memref.load %{{.}}[%[[i]], %[[cj]]] : memref<512x1024xf32>		// CHECK-SCALAR: %[[lb:.]] = memref.load %{{.}}[%[[i]], %[[cj]]] : memref<512x1024xf32>
// CHECK: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : f32		// CHECK-SCALAR: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : f32
// CHECK: store %[[m]], %{{.*}}[%[[i]], %[[cj]]] : memref<512x1024xf32>		// CHECK-SCALAR: store %[[m]], %{{.*}}[%[[i]], %[[cj]]] : memref<512x1024xf32>
// CHECK: }		// CHECK-SCALAR: }
// CHECK: }		// CHECK-SCALAR: }
// CHECK: return		// CHECK-SCALAR: return
		//
		// CHECK-VEC16: #[[$map:.*]] = affine_map<(d0, d1)[s0] -> (16, d0 - d1)
		// CHECK-VEC16-LABEL: func @mul_ds
		// CHECK-VEC16-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC16-DAG: %[[c1:.*]] = arith.constant 1 : index
		// CHECK-VEC16-DAG: %[[c16:.*]] = arith.constant 16 : index
		// CHECK-VEC16-DAG: %[[c512:.*]] = arith.constant 512 : index
		// CHECK-VEC16: scf.for %[[i:.*]] = %[[c0]] to %[[c512]] step %[[c1]] {
		// CHECK-VEC16: %[[p:.]] = memref.load %{{.}}[%[[i]]] : memref<?xi32>
		// CHECK-VEC16: %[[a:.*]] = arith.extui %[[p]] : i32 to i64
		// CHECK-VEC16: %[[q:.*]] = arith.index_cast %[[a]] : i64 to index
		// CHECK-VEC16: %[[a:.*]] = arith.addi %[[i]], %[[c1]] : index
		// CHECK-VEC16: %[[r:.]] = memref.load %{{.}}[%[[a]]] : memref<?xi32>
		// CHECK-VEC16: %[[b:.*]] = arith.extui %[[r]] : i32 to i64
		// CHECK-VEC16: %[[s:.*]] = arith.index_cast %[[b]] : i64 to index
		// CHECK-VEC16: scf.for %[[j:.*]] = %[[q]] to %[[s]] step %[[c16]] {
		// CHECK-VEC16: %[[sub:.*]] = affine.min #[[$map]](%[[s]], %[[j]])[%[[c16]]]
		// CHECK-VEC16: %[[mask:.*]] = vector.create_mask %[[sub]] : vector<16xi1>
		// CHECK-VEC16: %[[lj:.]] = vector.maskedload %{{.}}[%[[j]]], %[[mask]], %{{.*}} : memref<?xi32>, vector<16xi1>, vector<16xi32> into vector<16xi32>
		// CHECK-VEC16: %[[zj:.*]] = arith.extui %[[lj]] : vector<16xi32> to vector<16xi64>
		// CHECK-VEC16: %[[la:.]] = vector.maskedload %{{.}}[%[[j]]], %[[mask]], %{{.*}} : memref<?xf32>, vector<16xi1>, vector<16xf32> into vector<16xf32>
		// CHECK-VEC16: %[[lb:.]] = vector.gather %{{.}}[%[[i]], %[[c0]]] [%[[zj]]], %[[mask]], %{{.*}} : memref<512x1024xf32>, vector<16xi64>, vector<16xi1>, vector<16xf32> into vector<16xf32>
		// CHECK-VEC16: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : vector<16xf32>
		// CHECK-VEC16: vector.scatter %{{.*}}[%[[i]], %[[c0]]] [%[[zj]]], %[[mask]], %[[m]] : memref<512x1024xf32>, vector<16xi64>, vector<16xi1>, vector<16xf32>
		// CHECK-VEC16: }
		// CHECK-VEC16: }
		// CHECK-VEC16: return
		//
		// CHECK-VEC16-IDX32: #[[$map:.*]] = affine_map<(d0, d1)[s0] -> (16, d0 - d1)
		// CHECK-VEC16-IDX32-LABEL: func @mul_ds
		// CHECK-VEC16-IDX32-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC16-IDX32-DAG: %[[c1:.*]] = arith.constant 1 : index
		// CHECK-VEC16-IDX32-DAG: %[[c16:.*]] = arith.constant 16 : index
		// CHECK-VEC16-IDX32-DAG: %[[c512:.*]] = arith.constant 512 : index
		// CHECK-VEC16-IDX32: scf.for %[[i:.*]] = %[[c0]] to %[[c512]] step %[[c1]] {
		// CHECK-VEC16-IDX32: %[[p:.]] = memref.load %{{.}}[%[[i]]] : memref<?xi32>
		// CHECK-VEC16-IDX32: %[[a:.*]] = arith.extui %[[p]] : i32 to i64
		// CHECK-VEC16-IDX32: %[[q:.*]] = arith.index_cast %[[a]] : i64 to index
		// CHECK-VEC16-IDX32: %[[a:.*]] = arith.addi %[[i]], %[[c1]] : index
		// CHECK-VEC16-IDX32: %[[r:.]] = memref.load %{{.}}[%[[a]]] : memref<?xi32>
		// CHECK-VEC16-IDX32: %[[b:.*]] = arith.extui %[[r]] : i32 to i64
		// CHECK-VEC16-IDX32: %[[s:.*]] = arith.index_cast %[[b]] : i64 to index
		// CHECK-VEC16-IDX32: scf.for %[[j:.*]] = %[[q]] to %[[s]] step %[[c16]] {
		// CHECK-VEC16-IDX32: %[[sub:.*]] = affine.min #[[$map]](%[[s]], %[[j]])[%[[c16]]]
		// CHECK-VEC16-IDX32: %[[mask:.*]] = vector.create_mask %[[sub]] : vector<16xi1>
		// CHECK-VEC16-IDX32: %[[lj:.]] = vector.maskedload %{{.}}[%[[j]]], %[[mask]], %{{.*}} : memref<?xi32>, vector<16xi1>, vector<16xi32> into vector<16xi32>
		// CHECK-VEC16-IDX32: %[[la:.]] = vector.maskedload %{{.}}[%[[j]]], %[[mask]], %{{.*}} : memref<?xf32>, vector<16xi1>, vector<16xf32> into vector<16xf32>
		// CHECK-VEC16-IDX32: %[[lb:.]] = vector.gather %{{.}}[%[[i]], %[[c0]]] [%[[lj]]], %[[mask]], %{{.*}} : memref<512x1024xf32>, vector<16xi32>, vector<16xi1>, vector<16xf32> into vector<16xf32>
		// CHECK-VEC16-IDX32: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : vector<16xf32>
		// CHECK-VEC16-IDX32: vector.scatter %{{.*}}[%[[i]], %[[c0]]] [%[[lj]]], %[[mask]], %[[m]] : memref<512x1024xf32>, vector<16xi32>, vector<16xi1>, vector<16xf32>
		// CHECK-VEC16-IDX32: }
		// CHECK-VEC16-IDX32: }
		// CHECK-VEC16-IDX32: return
//		//
func.func @mul_ds(%arga: tensor<512x1024xf32, #SparseMatrix>, %argb: tensor<512x1024xf32>, %argx: tensor<512x1024xf32>) -> tensor<512x1024xf32> {		// CHECK-VEC4-SVE: #[[$map:.*]] = affine_map<(d0, d1)[s0] -> (s0, d0 - d1)
		// CHECK-VEC4-SVE-LABEL: func @mul_ds
		// CHECK-VEC4-SVE-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC4-SVE-DAG: %[[c1:.*]] = arith.constant 1 : index
		// CHECK-VEC4-SVE-DAG: %[[c4:.*]] = arith.constant 4 : index
		// CHECK-VEC4-SVE-DAG: %[[c512:.*]] = arith.constant 512 : index
		// CHECK-VEC4-SVE-DAG: %[[v0i:.*]] = arith.constant dense<0> : vector<[4]xi32>
		// CHECK-VEC4-SVE-DAG: %[[v0f:.*]] = arith.constant dense<0.000000e+00> : vector<[4]xf32>
		// CHECK-VEC4-SVE: scf.for %[[i:.*]] = %[[c0]] to %[[c512]] step %[[c1]] {
		// CHECK-VEC4-SVE: %[[p:.]] = memref.load %{{.}}[%[[i]]] : memref<?xi32>
		// CHECK-VEC4-SVE: %[[a:.*]] = arith.extui %[[p]] : i32 to i64
		// CHECK-VEC4-SVE: %[[q:.*]] = arith.index_cast %[[a]] : i64 to index
		// CHECK-VEC4-SVE: %[[a:.*]] = arith.addi %[[i]], %[[c1]] : index
		// CHECK-VEC4-SVE: %[[r:.]] = memref.load %{{.}}[%[[a]]] : memref<?xi32>
		// CHECK-VEC4-SVE: %[[b:.*]] = arith.extui %[[r]] : i32 to i64
		// CHECK-VEC4-SVE: %[[s:.*]] = arith.index_cast %[[b]] : i64 to index
		// CHECK-VEC4-SVE: %[[vscale:.*]] = vector.vscale
		// CHECK-VEC4-SVE: %[[step:.*]] = arith.muli %[[vscale]], %[[c4]] : index
		// CHECK-VEC4-SVE: scf.for %[[j:.*]] = %[[q]] to %[[s]] step %[[step]] {
		// CHECK-VEC4-SVE: %[[sub:.*]] = affine.min #[[$map]](%[[s]], %[[j]])[%[[step]]]
		// CHECK-VEC4-SVE: %[[mask:.*]] = vector.create_mask %[[sub]] : vector<[4]xi1>
		// CHECK-VEC4-SVE: %[[lji32:.]] = vector.maskedload %{{.}}[%[[j]]], %[[mask]], %[[v0i]] : memref<?xi32>, vector<[4]xi1>, vector<[4]xi32> into vector<[4]xi32>
		// CHECK-VEC4-SVE: %[[lj:.*]] = arith.extui %[[lji32]] : vector<[4]xi32> to vector<[4]xi64>
		// CHECK-VEC4-SVE: %[[la:.]] = vector.maskedload %{{.}}[%[[j]]], %[[mask]], %[[v0f]] : memref<?xf32>, vector<[4]xi1>, vector<[4]xf32> into vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[lb:.]] = vector.gather %{{.}}[%[[i]], %[[c0]]] [%[[lj]]], %[[mask]], %[[v0f]] : memref<512x1024xf32>, vector<[4]xi64>, vector<[4]xi1>, vector<[4]xf32> into vector<[4]xf32>
		// CHECK-VEC4-SVE: %[[m:.*]] = arith.mulf %[[la]], %[[lb]] : vector<[4]xf32>
		// CHECK-VEC4-SVE: vector.scatter %{{.*}}[%[[i]], %[[c0]]] [%[[lj]]], %[[mask]], %[[m]] : memref<512x1024xf32>, vector<[4]xi64>, vector<[4]xi1>, vector<[4]xf32>
		// CHECK-VEC4-SVE: }
		// CHECK-VEC4-SVE: }
		// CHECK-VEC4-SVE: return
		//
		func.func @mul_ds(%arga: tensor<512x1024xf32, #SparseMatrix>,
		%argb: tensor<512x1024xf32>,
		%argx: tensor<512x1024xf32>) -> tensor<512x1024xf32> {
%0 = linalg.generic #trait_mul_ds		%0 = linalg.generic #trait_mul_ds
ins(%arga, %argb: tensor<512x1024xf32, #SparseMatrix>, tensor<512x1024xf32>)		ins(%arga, %argb: tensor<512x1024xf32, #SparseMatrix>, tensor<512x1024xf32>)
outs(%argx: tensor<512x1024xf32>) {		outs(%argx: tensor<512x1024xf32>) {
^bb(%a: f32, %b: f32, %x: f32):		^bb(%a: f32, %b: f32, %x: f32):
%0 = arith.mulf %a, %b : f32		%0 = arith.mulf %a, %b : f32
linalg.yield %0 : f32		linalg.yield %0 : f32
} -> tensor<512x1024xf32>		} -> tensor<512x1024xf32>
return %0 : tensor<512x1024xf32>		return %0 : tensor<512x1024xf32>
}		}

// -----		// -----

#SparseMatrix = #sparse_tensor.encoding<{dimLevelType = ["dense","compressed"]}>		#SparseMatrix = #sparse_tensor.encoding<{dimLevelType = ["dense","compressed"]}>

#trait_affine = {		#trait_affine = {
indexing_maps = [		indexing_maps = [
affine_map<(i,j) -> (i,j)>,		affine_map<(i,j) -> (i,j)>,
affine_map<(i,j) -> (i+1,j)>		affine_map<(i,j) -> (i+1,j)>
],		],
iterator_types = ["parallel","parallel"],		iterator_types = ["parallel","parallel"],
doc = "X(i+1,j) += A(i,j)"		doc = "X(i+1,j) += A(i,j)"
}		}

//		//
// CHECK-LABEL: func @add_dense		// CHECK-SCALAR-LABEL: func @add_dense
// CHECK-DAG: %[[c0:.*]] = arith.constant 0 : index		// CHECK-SCALAR-DAG: %[[c0:.*]] = arith.constant 0 : index
// CHECK-DAG: %[[c1:.*]] = arith.constant 1 : index		// CHECK-SCALAR-DAG: %[[c1:.*]] = arith.constant 1 : index
// CHECK-DAG: %[[c32:.*]] = arith.constant 32 : index		// CHECK-SCALAR-DAG: %[[c32:.*]] = arith.constant 32 : index
// CHECK: scf.for %[[i:.*]] = %[[c0]] to %[[c32]] step %[[c1]] {		// CHECK-SCALAR: scf.for %[[i:.*]] = %[[c0]] to %[[c32]] step %[[c1]] {
// CHECK: %[[lo:.]] = memref.load %{{.}}[%[[i]]] : memref<?xindex>		// CHECK-SCALAR: %[[lo:.]] = memref.load %{{.}}[%[[i]]] : memref<?xindex>
// CHECK: %[[i1:.*]] = arith.addi %[[i]], %[[c1]] : index		// CHECK-SCALAR: %[[i1:.*]] = arith.addi %[[i]], %[[c1]] : index
// CHECK: %[[hi:.]] = memref.load %{{.}}[%[[i1]]] : memref<?xindex>		// CHECK-SCALAR: %[[hi:.]] = memref.load %{{.}}[%[[i1]]] : memref<?xindex>
// CHECK: scf.for %[[jj:.*]] = %[[lo]] to %[[hi]] step %[[c1]] {		// CHECK-SCALAR: scf.for %[[jj:.*]] = %[[lo]] to %[[hi]] step %[[c1]] {
// CHECK: %[[j:.]] = memref.load %{{.}}[%[[jj]]] : memref<?xindex>		// CHECK-SCALAR: %[[j:.]] = memref.load %{{.}}[%[[jj]]] : memref<?xindex>
// CHECK: %[[x:.]] = memref.load %{{.}}[%[[i1]], %[[j]]] : memref<33x64xf64>		// CHECK-SCALAR: %[[x:.]] = memref.load %{{.}}[%[[i1]], %[[j]]] : memref<33x64xf64>
// CHECK: %[[a:.]] = memref.load %{{.}}[%[[jj]]] : memref<?xf64>		// CHECK-SCALAR: %[[a:.]] = memref.load %{{.}}[%[[jj]]] : memref<?xf64>
// CHECK: %[[s:.*]] = arith.addf %[[x]], %[[a]] : f64		// CHECK-SCALAR: %[[s:.*]] = arith.addf %[[x]], %[[a]] : f64
// CHECK: memref.store %[[s]], %{{.*}}[%[[i1]], %[[j]]] : memref<33x64xf64>		// CHECK-SCALAR: memref.store %[[s]], %{{.*}}[%[[i1]], %[[j]]] : memref<33x64xf64>
// CHECK: }		// CHECK-SCALAR: }
// CHECK: }		// CHECK-SCALAR: }
// CHECK: return		// CHECK-SCALAR: return
		//
		// CHECK-VEC16: #[[$map:.*]] = affine_map<(d0, d1)[s0] -> (16, d0 - d1)
		// CHECK-VEC16-LABEL: func @add_dense
		// CHECK-VEC16-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC16-DAG: %[[c1:.*]] = arith.constant 1 : index
		// CHECK-VEC16-DAG: %[[c16:.*]] = arith.constant 16 : index
		// CHECK-VEC16-DAG: %[[c32:.*]] = arith.constant 32 : index
		// CHECK-VEC16: scf.for %[[i:.*]] = %[[c0]] to %[[c32]] step %[[c1]] {
		// CHECK-VEC16: %[[lo:.]] = memref.load %{{.}}[%[[i]]] : memref<?xindex>
		// CHECK-VEC16: %[[i1:.*]] = arith.addi %[[i]], %[[c1]] : index
		// CHECK-VEC16: %[[hi:.]] = memref.load %{{.}}[%[[i1]]] : memref<?xindex>
		// CHECK-VEC16: scf.for %[[jj:.*]] = %[[lo]] to %[[hi]] step %[[c16]] {
		// CHECK-VEC16: %[[sub:.*]] = affine.min #[[$map]](%[[hi]], %[[jj]])[%[[c16]]]
		// CHECK-VEC16: %[[mask:.*]] = vector.create_mask %[[sub]] : vector<16xi1>
		// CHECK-VEC16: %[[j:.]] = vector.maskedload %{{.}}[%[[jj]]], %[[mask]], %{{.*}} : memref<?xindex>
		// CHECK-VEC16: %[[x:.]] = vector.gather %{{.}}[%[[i1]], %[[c0]]] [%[[j]]], %[[mask]], %{{.*}} : memref<33x64xf64>
		// CHECK-VEC16: %[[a:.]] = vector.maskedload %{{.}}[%[[jj]]], %[[mask]], %{{.*}} : memref<?xf64>
		// CHECK-VEC16: %[[s:.*]] = arith.addf %[[x]], %[[a]] : vector<16xf64>
		// CHECK-VEC16: vector.scatter %{{.*}}[%[[i1]], %[[c0]]] [%[[j]]], %[[mask]], %[[s]] : memref<33x64xf64>
		// CHECK-VEC16: }
		// CHECK-VEC16: }
		// CHECK-VEC16: return
		//
		// CHECK-VEC16-IDX32: #[[$map:.*]] = affine_map<(d0, d1)[s0] -> (16, d0 - d1)
		// CHECK-VEC16-IDX32-LABEL: func @add_dense
		// CHECK-VEC16-IDX32-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC16-IDX32-DAG: %[[c1:.*]] = arith.constant 1 : index
		// CHECK-VEC16-IDX32-DAG: %[[c16:.*]] = arith.constant 16 : index
		// CHECK-VEC16-IDX32-DAG: %[[c32:.*]] = arith.constant 32 : index
		// CHECK-VEC16-IDX32: scf.for %[[i:.*]] = %[[c0]] to %[[c32]] step %[[c1]] {
		// CHECK-VEC16-IDX32: %[[lo:.]] = memref.load %{{.}}[%[[i]]] : memref<?xindex>
		// CHECK-VEC16-IDX32: %[[i1:.*]] = arith.addi %[[i]], %[[c1]] : index
		// CHECK-VEC16-IDX32: %[[hi:.]] = memref.load %{{.}}[%[[i1]]] : memref<?xindex>
		// CHECK-VEC16-IDX32: scf.for %[[jj:.*]] = %[[lo]] to %[[hi]] step %[[c16]] {
		// CHECK-VEC16-IDX32: %[[sub:.*]] = affine.min #[[$map]](%[[hi]], %[[jj]])[%[[c16]]]
		// CHECK-VEC16-IDX32: %[[mask:.*]] = vector.create_mask %[[sub]] : vector<16xi1>
		// CHECK-VEC16-IDX32: %[[j:.]] = vector.maskedload %{{.}}[%[[jj]]], %[[mask]], %{{.*}} : memref<?xindex>
		// CHECK-VEC16-IDX32: %[[x:.]] = vector.gather %{{.}}[%[[i1]], %[[c0]]] [%[[j]]], %[[mask]], %{{.*}} : memref<33x64xf64>
		// CHECK-VEC16-IDX32: %[[a:.]] = vector.maskedload %{{.}}[%[[jj]]], %[[mask]], %{{.*}} : memref<?xf64>
		// CHECK-VEC16-IDX32: %[[s:.*]] = arith.addf %[[x]], %[[a]] : vector<16xf64>
		// CHECK-VEC16-IDX32: vector.scatter %{{.*}}[%[[i1]], %[[c0]]] [%[[j]]], %[[mask]], %[[s]] : memref<33x64xf64>
		// CHECK-VEC16-IDX32: }
		// CHECK-VEC16-IDX32: }
		// CHECK-VEC16-IDX32: return
		//
		// CHECK-VEC4-SVE: #[[$map:.*]] = affine_map<(d0, d1)[s0] -> (s0, d0 - d1)
		// CHECK-VEC4-SVE-LABEL: func @add_dense
		// CHECK-VEC4-SVE-DAG: %[[c0:.*]] = arith.constant 0 : index
		// CHECK-VEC4-SVE-DAG: %[[c1:.*]] = arith.constant 1 : index
		// CHECK-VEC4-SVE-DAG: %[[c4:.*]] = arith.constant 4 : index
		// CHECK-VEC4-SVE-DAG: %[[c32:.*]] = arith.constant 32 : index
		// CHECK-VEC4-SVE-DAG: %[[v0idx:.*]] = arith.constant dense<0> : vector<[4]xindex>
		// CHECK-VEC4-SVE-DAG: %[[v0f64:.*]] = arith.constant dense<0.000000e+00> : vector<[4]xf64>
		// CHECK-VEC4-SVE: scf.for %[[i:.*]] = %[[c0]] to %[[c32]] step %[[c1]] {
		// CHECK-VEC4-SVE: %[[lo:.]] = memref.load %{{.}}[%[[i]]] : memref<?xindex>
		// CHECK-VEC4-SVE: %[[i1:.*]] = arith.addi %[[i]], %[[c1]] : index
		// CHECK-VEC4-SVE: %[[hi:.]] = memref.load %{{.}}[%[[i1]]] : memref<?xindex>
		// CHECK-VEC4-SVE: %[[vscale:.*]] = vector.vscale
		// CHECK-VEC4-SVE: %[[step:.*]] = arith.muli %[[vscale]], %[[c4]] : index
		// CHECK-VEC4-SVE: scf.for %[[jj:.*]] = %[[lo]] to %[[hi]] step %[[step]] {
		// CHECK-VEC4-SVE: %[[sub:.*]] = affine.min #[[$map]](%[[hi]], %[[jj]])[%[[step]]]
		// CHECK-VEC4-SVE: %[[mask:.*]] = vector.create_mask %[[sub]] : vector<[4]xi1>
		// CHECK-VEC4-SVE: %[[j:.]] = vector.maskedload %{{.}}[%[[jj]]], %[[mask]], %[[v0idx]] : memref<?xindex>
		// CHECK-VEC4-SVE: %[[x:.]] = vector.gather %{{.}}[%[[i1]], %[[c0]]] [%[[j]]], %[[mask]], %[[v0f64]] : memref<33x64xf64>
		// CHECK-VEC4-SVE: %[[a:.]] = vector.maskedload %{{.}}[%[[jj]]], %[[mask]], %[[v0f64]] : memref<?xf64>
		// CHECK-VEC4-SVE: %[[s:.*]] = arith.addf %[[x]], %[[a]] : vector<[4]xf64>
		// CHECK-VEC4-SVE: vector.scatter %{{.*}}[%[[i1]], %[[c0]]] [%[[j]]], %[[mask]], %[[s]] : memref<33x64xf64>
		// CHECK-VEC4-SVE: }
		// CHECK-VEC4-SVE: }
		// CHECK-VEC4-SVE: return
//		//
func.func @add_dense(%arga: tensor<32x64xf64, #SparseMatrix>,		func.func @add_dense(%arga: tensor<32x64xf64, #SparseMatrix>,
%argx: tensor<33x64xf64>) -> tensor<33x64xf64> {		%argx: tensor<33x64xf64>) -> tensor<33x64xf64> {
%0 = linalg.generic #trait_affine		%0 = linalg.generic #trait_affine
ins(%arga: tensor<32x64xf64, #SparseMatrix>)		ins(%arga: tensor<32x64xf64, #SparseMatrix>)
outs(%argx: tensor<33x64xf64>) {		outs(%argx: tensor<33x64xf64>) {
^bb(%a: f64, %x: f64):		^bb(%a: f64, %x: f64):
%0 = arith.addf %x, %a : f64		%0 = arith.addf %x, %a : f64
linalg.yield %0 : f64		linalg.yield %0 : f64
} -> tensor<33x64xf64>		} -> tensor<33x64xf64>
return %0 : tensor<33x64xf64>		return %0 : tensor<33x64xf64>
}		}

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

Show First 20 Lines • Show All 2,218 Lines • ▼ Show 20 Lines	deps = [
":DialectUtils",		":DialectUtils",
":FuncDialect",		":FuncDialect",
":FuncTransforms",		":FuncTransforms",
":IR",		":IR",
":LLVMDialect",		":LLVMDialect",
":LinalgDialect",		":LinalgDialect",
":LinalgTransforms",		":LinalgTransforms",
":LinalgUtils",		":LinalgUtils",
		":MathDialect",
":MemRefDialect",		":MemRefDialect",
":Pass",		":Pass",
":SCFDialect",		":SCFDialect",
":SCFTransforms",		":SCFTransforms",
":SparseTensorDialect",		":SparseTensorDialect",
":SparseTensorEnums",		":SparseTensorEnums",
":SparseTensorPassIncGen",		":SparseTensorPassIncGen",
":SparseTensorUtils",		":SparseTensorUtils",
":Support",		":Support",
":TensorDialect",		":TensorDialect",
":Transforms",		":Transforms",
		":VectorDialect",
"//llvm:Support",		"//llvm:Support",
],		],
)		)

cc_library(		cc_library(
name = "SparseTensorPipelines",		name = "SparseTensorPipelines",
srcs = glob(["lib/Dialect/SparseTensor/Pipelines/*.cpp"]),		srcs = glob(["lib/Dialect/SparseTensor/Pipelines/*.cpp"]),
hdrs = ["include/mlir/Dialect/SparseTensor/Pipelines/Passes.h"],		hdrs = ["include/mlir/Dialect/SparseTensor/Pipelines/Passes.h"],
▲ Show 20 Lines • Show All 8,123 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse] introduce vectorization pass for sparse loopsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 477024

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.h

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td

mlir/lib/Dialect/SparseTensor/Transforms/CMakeLists.txt

mlir/lib/Dialect/SparseTensor/Transforms/SparseTensorPasses.cpp

mlir/lib/Dialect/SparseTensor/Transforms/SparseVectorization.cpp

mlir/test/Dialect/SparseTensor/sparse_vector.mlir

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

[mlir][sparse] introduce vectorization pass for sparse loops
ClosedPublic