This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/Linalg/Transforms/
-
mlir/
-
Dialect/
-
Linalg/
-
Transforms/
-
Transforms.h
-
lib/Dialect/Linalg/Transforms/
-
Dialect/
-
Linalg/
-
Transforms/
-
CMakeLists.txt
44/50
Sparsification.cpp
-
test/
-
Dialect/Linalg/
-
Linalg/
1/2
sparse_1d.mlir
-
sparse_2d.mlir
1/2
sparse_3d.mlir
-
lib/Transforms/
-
Transforms/
-
CMakeLists.txt
-
TestSparsification.cpp
-
tools/mlir-opt/
-
mlir-opt/
-
mlir-opt.cpp

Differential D90994

[mlir] [sparse] start of sparse tensor compiler support
ClosedPublic

Authored by aartbik on Nov 6 2020, 5:59 PM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
tatianashp
penpornk
reidtatge
mehdi_amini
silvas
ThomasRaoux
stellaraccident
• gkestor
ftynse

Commits

rGeced4a8e6fe3: [mlir] [sparse] start of sparse tensor compiler support

Summary

As discussed in https://llvm.discourse.group/t/mlir-support-for-sparse-tensors/2020
this CL is the start of sparse tensor compiler support in MLIR. Starting with a
"dense" kernel expressed in the Linalg dialect together with per-dimension
sparsity annotations on the tensors, the compiler automatically lowers the
kernel to sparse code using the methods described in Fredrik Kjolstad's thesis.

Many details are still TBD. For example, the sparse "bufferization" is purely
done locally since we don't have a global solution for propagating sparsity
yet. Furthermore, code to input and output the sparse tensors is missing.
Nevertheless, with some hand modifications, the generated MLIR can be
easily converted into runnable code already.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

aartbik created this revision.Nov 6 2020, 5:59 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 6 2020, 5:59 PM

Herald added subscribers: rdzhabarov, tatianashp, msifontes and 13 others. · View Herald Transcript

aartbik requested review of this revision.Nov 6 2020, 5:59 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptNov 6 2020, 5:59 PM

Herald added subscribers: limo1996, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

aartbik added reviewers: tatianashp, penpornk, reidtatge, mehdi_amini, silvas, ThomasRaoux, stellaraccident.Nov 6 2020, 6:05 PM

Harbormaster completed remote builds in B77973: Diff 303603.Nov 6 2020, 6:24 PM

aartbik updated this revision to Diff 303610.Nov 6 2020, 8:07 PM

fixed lint issue

Harbormaster completed remote builds in B77976: Diff 303610.Nov 6 2020, 8:30 PM

sri added a subscriber: sri.Nov 8 2020, 7:14 PM

silvas added inline comments.Nov 9 2020, 8:39 AM

mlir/test/Dialect/Linalg/sparse_1d.mlir
19	Haven't done a full review of the patch yet, but VAL_0 here doesn't seem to feed into any of the memrefs in the converted program.

aartbik added inline comments.Nov 9 2020, 9:24 AM

mlir/test/Dialect/Linalg/sparse_1d.mlir
19	The current sparse compiler does a simple "local" bufferization, where the arguments are used to get the shape, but are otherwise dropped. The alloc() that replace them are not even filled with values. You will see this in all the examples. This is of course not a runnable (and final) solution, but it makes hand modifying it to something runnable easier. In the long run, the sparse bufferization should replace the parameters and prototype of the method, just like true bufferization does currently.

Thanks for pushing on this Aart!

Some high-level comments before digging deeper: I get that this is essentially reimplementing TACO whose ideas have been shown to work in the past but I am wondering if we could significantly simplify things by taking advantage of the existing MLIR infra?
In particular, this is a one-shot implement all at once that I fear will be hard to break down and interoperate with.
For instance, I have difficulty seeing how tiling, fusion or vectorization will work from this form.

As discussed offline, I am wondering if we would be better off by immediately starting from:

making topological ordering part of a verification pattern, the transformation part can likely be already be applied with linalg interchange, the optimization part is separate.
given the existing linalg to loops lowering, adapting it to work with the sparse annotation, in particular with a level of indirection in the load/store under appropriate sparsity conditions.
inserting extra control flow (additional scf.for loops + scf.if to emulate coiteration)
starting to invest in scf.for + scf.if hoisting and in particular -> scf.while canonicalizations

The following comment reminds me of the first broadcast lowering which had similar recursions:

/// Recursively generates code while computing iteration lattices in order
/// to manage the complexity of implementing co-iteration over unions
/// and intersections of sparse iterations spaces.

I somewhat view this as a sign of missing abstractions and lowering too fast, which makes sense given the similarity with TACO.
I suspect a lot of the lattice part of the code could start popping out quite naturally as canonicalizations, but I have not tried so I may be completely off here.
It may also be that we don't want to fully expand the while along all paths?

I expect the discussion above to work well with reasonable interfaces that could easily be switched to operations on future sparse types when they become available.

I realize this may sound like a lot of things that have to come together to get to a similar functionality but written in a more progressive and composable fashion.

Is this an opportunity to start breaking down the problem into small composable abstractions and start distributing some of the work ?

mlir/test/Dialect/Linalg/sparse_3d.mlir
1193	%[[VAL_25:.*]] = load %[[VAL_9]][%[[VAL_24]]] should just work. Cases where I've seen a need for the escaping sequence is when the % was also captured.

In D90994#2383314, @nicolasvasilache wrote:

Thanks for pushing on this Aart!
I realize this may sound like a lot of things that have to come together to get to a similar functionality but written in a more progressive and composable fashion.

Thanks for your feedback Nicolas, and of course I am pondering over all those questions myself as well already. But I guess the approach I am taking with this CL is a bit "learning to walk before you can run", i.e. start with a working solution (so people actually understand fully the direction that will be taken) and then progressively improve on it by adding useful abstractions and refactoring the working parts accordingly, similar to what is still being done for Linalg, bufferization, and even Vector dialect still. I don't see a very clear path adding such abstractions purely in isolation and then somehow hope they indeed come together well.

Taking this CL as starting point, the very first steps would of course be some testing of the generated code, some performance analysis, and addition of SIMD support, just to convince others this is a feasible approach. Then proper next steps would indeed be (1) move annotations to Linalg interface, so that we can verify early and other phases have to start thinking of how to propagate that information, (2) find a better approach to the bufferization issue (3) progressively try to find better primitives to express sparse code, including I/O of the tensors.

mlir/test/Dialect/Linalg/sparse_3d.mlir
1193	Note that all CHECKs are auto-generated (generate-test-checks.py), so the layout and constructs are no my doing. Perhaps we can improve the script to avoid escapes where not needed (also, I had to make one hand modification on a parameter match that did not work out of the box).

aartbik added reviewers: • gkestor, ftynse.Nov 9 2020, 10:36 AM

Thanks Aart!

I did one pass, have a bunch of nits to make this look more MLIResque in style. I agree with Nicolas that we should think about how to connect this better to the infrastructure and avoid a monolithic codegen that will grow huge over time (the same rationale as for affine code generation could apply). My knowledge of the TACO paper is residual at this point, so it's not always easy to track what is happening in the code the high level. Maybe having a relatively short comment blob that explains the algorithm without referring to the paper could be a good way for us to crystallize the understanding on how this can be fitted into MLIR. That being said, I wouldn't oppose landing a version that is correct + tests, and then iterating on refactoring and generalization while keeping the tests green.

mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
56–57	Nit: these could use some documentation. In particular, what do these things contain for different kinds.
79	Nit: `///` for function docs as well
97	Nit: emplace_back() ?
238	Nit: we use LogicalResult for such cases
243	Nit: getAttrOfType
272	We have some other implementations of topological sort, iterative IIRC, would it be possible to refactor+template those instead?
331	Nit: I'd rather return Optional<unsigned>
338–340	I was a bit lots here before realizing that kTensor refers to any values that is indexed in the GenericOp and kScalar refers to any value that is not, regardless of their actual types. kTensor may or may not be a tensor, kScalar may also be a tensor if the ops inside just consume it as a whole... Would be nice to have this documented.
342–352	How do you expect this to scale to more ops? Generally, MLIR doesn't seem to like data structures parallel to the IR, but it may be justified if there are significant restrictions or analysis costs to process the IR every time.
371	Nit: `map.getResult(i).isFunctionOfDim(idx)` should work and be more future-proof
428–431	Where are these deallocated?
436	Where does this inference take place?
445	Shouldn't this be `op.getInitTensor(t - numInputs)`?
462	Where is this deallocated?
472	Nit: this seems to be a recurrent pattern, can we factor it out into the op class or interface ?
524	Could we have this as a function template with lambdas instead?
529	This actually closes the conditional rather than the loop, looks a strange to me. Correct, but strange.
630–631	This is not compatible with PatternRewriter, use rewriter.createBlock instead.
638	This is not compatible with PatternRewriter either. `createBlock` takes an arrayref of argument types for this purpose.

aartbik marked 15 inline comments as done.Nov 12 2020, 8:10 PM

aartbik added inline comments.

mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
238	By merging with the new verifier of the annotations, this is no longer needed.
243	By merging with the other CL that makes annotations part of generic op, this is no longer needed.
272	I looked at some, but for the ones I found, I felt expressing this with code reuse would be more obscure than adding these few lines. Is there one simpler one that I overlooked?
331	Yeah, I guess I am old school C sometimes with my extra parameters and magic returns. Changed to the more elegant Optional, which also has the advantage of stopping the build as soon as it fails.
338–340	Well, anything indexed would be a tensor for the kernel, but I see your point on the scalar issue. Would kInvariant be a better name (besides adding more doc, which I did too).
342–352	I too, am reluctant to have parallel data structures, but the lattice point manipulations really need some IR to represent arbitrary subexpressions (viz a + b + c has a + b + c, a + b. a + c, b + c, a, b etc.). I don't expect this too scale to very large IR ever...
428–431	They are not. Currently the genBuffers() is a "local" bufferization method that just adds some stub allocations that I can simply hand modify into something running. Most of this code is throw-away since it will be replaced by something more global in bufferization (or at least something changed here by modifying the parameters of the method rather than adding allocs).
436	L444.
445	Any tensor (t < numInput) is an input parameter, except potentially the output tensor (t == numInput), which has a single InitTensor. Again, I realize all code should be clear and solid, but this genBuffers will be rewritten heavily very soon. This just gets some reasonable allocs before the rewritten kernel.
462	See above. TBD ;-)
472	I see a lot of this outside this file too, let's do a cleanup later for all
529	Yeah, this was actually a result of not using { on a single statement for-loop (body is single if). I had two {{ and }} but was not sure if style would apply to macro as well ;-)
630–631	Ah yes, you are right. Thanks for this very sharp observation, I had completely overlooked that!

In D90994#2392659, @ftynse wrote:

Thanks Aart!

I did one pass, have a bunch of nits to make this look more MLIResque in style. I agree with Nicolas that we should think about how to connect this better to the infrastructure and avoid a monolithic codegen that will grow huge over time (the same rationale as for affine code generation could apply). My knowledge of the TACO paper is residual at this point, so it's not always easy to track what is happening in the code the high level. Maybe having a relatively short comment blob that explains the algorithm without referring to the paper could be a good way for us to crystallize the understanding on how this can be fitted into MLIR. That being said, I wouldn't oppose landing a version that is correct + tests, and then iterating on refactoring and generalization while keeping the tests green.

Thanks for the very detailed comments, Alex, to the point as always! The endgoal is definitely not a monolithic codegen, but for now to simply bootstrap the idea, and iterate to nicer abstractions (not unlike the approach we took for many other modules). I really look forward getting this of the ground!

addressed Alex' comments

fixed typo

Harbormaster completed remote builds in B78711: Diff 305016.Nov 12 2020, 8:50 PM

Harbormaster completed remote builds in B78713: Diff 305021.Nov 12 2020, 9:06 PM

rebased with new built-in annotations

Herald added a subscriber: teijeong. · View Herald TranscriptNov 16 2020, 10:57 AM

Harbormaster completed remote builds in B78985: Diff 305562.Nov 16 2020, 11:20 AM

nicolasvasilache added inline comments.Nov 16 2020, 12:11 PM

mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
78	If this is meant to subsist in the mid/long-term, it would be good to move it to utils and have a special test for it with simple MLIR custom ops and prints, completely independent of the codegen.
531	I find this function very hard to follow in the current spaghetti form. It doesn't seem like it would be very hard to restructured to a more structured form with helper functions that have semantically meaningful names. Could we please try that ?
555	braces plz, trivial statement is defined as 1-liner these days :)

aartbik marked 2 inline comments as done.Nov 16 2020, 8:42 PM

aartbik added inline comments.

mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
78	If we indeed keep this representation and data structure in the long term, I will be happy to pull it out so it can be unit tested in isolation. But let's leave that for a follow-up restructuring.
531	Agreed. When I started this function, it was actually very readable (it more or less follows the recursive variant given in the Taco paper), but I must admit that after filling in on the blanks with actual implementation, it has become a bit massive. I broke it up into smaller, individual methods. PTAL if this reads better.

broke up large genStmt() method into many genXXX() methods

Harbormaster completed remote builds in B79045: Diff 305651.Nov 16 2020, 9:12 PM

Thanks @aartbik for refactoring and making this more readable.

This makes a few things clear IMO that we should address before landing.
The remaining issues are mainly concentrated around goto-style programming induced by manual insertion point manipulations across function boundaries.

You should only ever need to build IR by using:

OpBuilder::InsertionGuard g(b) which resets the IP upon RAII release
structured builders that take lambdas and make it very clear the nesting that you emit.

Granted, there is no such builder for WhileOp, this revision should still adopt best practices and we can turn this into the structured WhileOp builder in a subsequent revision.

Approving conditioned on that.

mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
577	no goto please
620	no goto please
731	no goto please
762	tmp does not escape and is overwritten a bunch of times, please declare inside the for loop, it makes its transient nature clearer.
768	please factor l767-777 into a meaningfully named helper function
783	Please split in half and refactor further: // Single index if (indices.count() == 1) { genForLoop(args, [&](OpBuilder &b, Location, Value iv, ValueRange) { genLocals(merger, codegen, b, ...); };); // where lambda is a scoped body builder to be passed to the scf::ForOp builder return; } genWhileLoop( args, [&](OpBuilder &b, Location, Value iv, ValueRange){ genLocals(merger, codegen, b, ...); for (unsigned lj : merger.set(lts)) { if (li == lj \|\| merger.latGT(li, lj)) { LatPoint latj = merger.lat(lj); tmp = latj.bits; tmp ^= lati.bits; if (merger.hasAnyOf(tmp, false)) continue; // dense exhausted within if/else // Recurse into body of each branch. genIf(merger, codegen, rewriter, op, idx, latj.bits, ifOp, [](){ ifBodyBuilder }); genStmt(merger, codegen, rewriter, op, topSort, latj.exp, at + 1); } } }, /* regionBuilder1 / [&](OpBuilder &b, Location, Value iv, ValueRange){}, / regionBuilder2 */ ); In subsequent revisions, we can continue refactoring to give WhileOp a proper builder with lambdas. Then all insertion point manipulations can be isolated in that builder.
788	This is confusing due to the goto-style insertion point manipulation across function boundaries. Please refactor to use the structured scf::IfOp builder: OpBuilderDAG<(ins "Value":$cond, CArg<"function_ref<void(OpBuilder &, Location)>", "buildTerminatedBody">:$thenBuilder, CArg<"function_ref<void(OpBuilder &, Location)>", "nullptr">:$elseBuilder)> with proper helper functions / lambda to make it read more naturally. Additionally, since this is only used in the while case, I expect it will compose nicely with the above suggestion to show the real structure that you emit in C++: /// dense iterator emitForOp(..., [](){ body no conditional }) return; /// coiteration emitWhileOp(..., [](...){}, [](...){ body emitIfOp(..., [](..){ body }); });
804	Let's please get rid of goto-style insertion point manipulation across function boundaries. genWhileInduction should be lifted as a lambda passed to `genWhileLoop`.

This revision is now accepted and ready to land.Nov 17 2020, 6:30 AM

I addressed your remaining comments, except that I am pushing back (for now) on the insertion point vs. builder issue. Most obvious, since we don't have such a builder for while yet. But also, I am not convinced it will necessarily make for clearer code. Right now, every method leaves the insertion "cursor" at the right place for the next, even if that cursor can be something different (inside for, while, or if). I am not sure builders will improve on that, but could be convinced otherwise by trying once we actually have those abstractions. But let's not keep stalling this CL by trying to obtain a perfection that is also not met in other parts yet ;-)

addressed remaining comments (except builders)
also rebased with the new Affine utils that make for shorter code

Harbormaster completed remote builds in B79155: Diff 305858.Nov 17 2020, 11:25 AM

ftynse added inline comments.Nov 17 2020, 11:41 AM

mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
20–23	A bit more detail on what is the iteration graph and lattice would be welcome for future readers :)
428–431	Maybe use an `alloca` instead? If it is rewritten anyway, we shouldn't care if it's stack or heap, and we wouldn't be creating code that leaks memory.
783	In subsequent revisions, we can continue refactoring to give WhileOp a proper builder with lambdas. The only reason why I did not add such a builder was the absence of uses, and we don't want untested code.

aartbik marked 3 inline comments as done.Nov 17 2020, 12:06 PM

aartbik added inline comments.

mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp
20–23	Added a bit more detail.
783	I already am very grateful you added the while loops in the first place, which saved me a lot of time. I am happy to develop more infrastructure as needed.

alloc -> alloca, added a few more comments in top description

ftynse accepted this revision.Nov 17 2020, 12:53 PM

Harbormaster completed remote builds in B79173: Diff 305878.Nov 17 2020, 1:00 PM

Closed by commit rGeced4a8e6fe3: [mlir] [sparse] start of sparse tensor compiler support (authored by aartbik). · Explain WhyNov 17 2020, 1:10 PM

This revision was automatically updated to reflect the committed changes.

aartbik added a commit: rGeced4a8e6fe3: [mlir] [sparse] start of sparse tensor compiler support.

jjolly added a subscriber: jjolly.Nov 18 2020, 8:29 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Linalg/

Transforms/

Transforms.h

6 lines

lib/

Dialect/

Linalg/

Transforms/

CMakeLists.txt

1 line

Sparsification.cpp

887 lines

test/

Dialect/

Linalg/

sparse_1d.mlir

637 lines

sparse_2d.mlir

1058 lines

sparse_3d.mlir

1225 lines

lib/

Transforms/

CMakeLists.txt

1 line

TestSparsification.cpp

42 lines

tools/

mlir-opt/

mlir-opt.cpp

2 lines

Diff 305878

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h

	Show First 20 Lines • Show All 765 Lines • ▼ Show 20 Lines
	/// non-local transformation effects. This allows creating custom fused			/// non-local transformation effects. This allows creating custom fused
	/// transformations where patterns can be ordered and applied at a finer			/// transformations where patterns can be ordered and applied at a finer
	/// granularity than a sequence of traditional compiler passes.			/// granularity than a sequence of traditional compiler passes.
	LogicalResult applyStagedPatterns(			LogicalResult applyStagedPatterns(
	Operation *op, ArrayRef<FrozenRewritePatternList> stage1Patterns,			Operation *op, ArrayRef<FrozenRewritePatternList> stage1Patterns,
	const FrozenRewritePatternList &stage2Patterns,			const FrozenRewritePatternList &stage2Patterns,
	function_ref<LogicalResult(Operation *)> stage3Lambda = nullptr);			function_ref<LogicalResult(Operation *)> stage3Lambda = nullptr);

				//===----------------------------------------------------------------------===//
				// Support for sparse tensor code generation.
				//===----------------------------------------------------------------------===//
				void populateSparsificationPatterns(MLIRContext *context,
				OwningRewritePatternList &patterns);

	} // namespace linalg			} // namespace linalg
	} // namespace mlir			} // namespace mlir

	#endif // DIALECT_LINALG_TRANSFORMS_TRANSFORMS_H_			#endif // DIALECT_LINALG_TRANSFORMS_TRANSFORMS_H_

mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt

	add_mlir_dialect_library(MLIRLinalgTransforms			add_mlir_dialect_library(MLIRLinalgTransforms
	Bufferize.cpp			Bufferize.cpp
	CodegenStrategy.cpp			CodegenStrategy.cpp
	DropUnitDims.cpp			DropUnitDims.cpp
	ElementwiseToLinalg.cpp			ElementwiseToLinalg.cpp
	Fusion.cpp			Fusion.cpp
	FusionOnTensors.cpp			FusionOnTensors.cpp
	Hoisting.cpp			Hoisting.cpp
	Interchange.cpp			Interchange.cpp
	Loops.cpp			Loops.cpp
	Promotion.cpp			Promotion.cpp
				Sparsification.cpp
	Tiling.cpp			Tiling.cpp
	Transforms.cpp			Transforms.cpp
	Vectorization.cpp			Vectorization.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/Linalg			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/Linalg

	DEPENDS			DEPENDS
	Show All 22 Lines

mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp

This file was added.

				//===- Sparsification.cpp - Implementation of linalg sparsification -------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements lowering annotated linalg dialect to sparse code.
				//
				// The concept of letting a compiler generate sparse code automatically was
				// pioneered for dense linear algebra code in Fortran by [Bik96] in MT1 and
				// formalized to tensor algebra by [Kjolstad17,20] for the Sparse Tensor
				// Algebra Compiler (TACO). The implementation in this file closely follows
				// the "sparse iteration theory" that forms the foundation of TACO. A rewriting
				// rule is applied to each tensor expression in linalg (MLIR's tensor index
				// notation) where the sparsity of tensors is indicated with annotation using
				// a per-dimension specification of sparse/dense storage together with a
				// specification of the order on the dimensions. Subsequently, a topologically
				// sorted iteration graph, reflecting the required order on indices with respect
				// to the dimensions of each tensor, is constructed to ensure that all tensors
				// are visited in natural index order. Next, iteration lattices are constructed
				// for the tensor expression for every index in topological order. Each
				ftynseUnsubmitted Done Reply Inline Actions A bit more detail on what is the iteration graph and lattice would be welcome for future readers :) ftynse: A bit more detail on what is the iteration graph and lattice would be welcome for future…
				aartbikAuthorUnsubmitted Done Reply Inline Actions Added a bit more detail. aartbik: Added a bit more detail.
				// iteration lattice point consists of a conjunction of tensor indices together
				// with a tensor (sub)expression that needs to be evaluated for that
				// conjunction. Within the lattice, iteration points are ordered according to
				// the way indices are exhausted. As such these iteration lattices drive actual
				// sparse code generation, which consists of a tedious but relatively
				// straightforward one-to-one mapping from iteration lattices to combinations
				// of for-loops, while-loops, and if-statements.
				//
				// [Bik96] Aart J.C. Bik. Compiler Support for Sparse Matrix Computations.
				// PhD thesis, Leiden University, May 1996 (aartbik.com/sparse.php).
				// [Kjolstad17] Fredrik Berg Kjolstad, Shoaib Ashraf Kamil, Stephen Chou,
				// David Lugato, and Saman Amarasinghe. The Tensor Algebra Compiler.
				// Proceedings of the ACM on Programming Languages, October 2017.
				// [Kjolstad20] Fredrik Berg Kjolstad. Sparse Tensor Algebra Compilation.
				// PhD thesis, MIT, February, 2020 (tensor-compiler.org).
				//
				// Implementation detail: We use llvm::SmallVector for vectors with
				// variable lengths and std::vector for vectors with fixed lengths.
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/Linalg/IR/LinalgOps.h"
				#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
				#include "mlir/Dialect/Linalg/Utils/Utils.h"
				#include "mlir/Dialect/SCF/SCF.h"
				#include "mlir/Dialect/StandardOps/IR/Ops.h"

				using namespace mlir;

				namespace {

				enum class Kind { kTensor, kInvariant, kMulF, kMulI, kAddF, kAddI };

				/// Tensor expression. Represents a MLIR expression in tensor index notation.
				/// For tensors and invariants, e0 denotes the tensor index. For all binary
				ftynseUnsubmitted Done Reply Inline Actions Nit: these could use some documentation. In particular, what do these things contain for different kinds. ftynse: Nit: these could use some documentation. In particular, what do these things contain for…
				/// operations, e0 and e1 denote the index of the children tensor expressions.
				struct TensorExp {
				TensorExp(Kind k, unsigned x, unsigned y) : kind(k), e0(x), e1(y) {}
				Kind kind;
				unsigned e0;
				unsigned e1;
				};

				/// Lattice point. Each lattice point consist of a conjunction of tensor
				/// loop indices (encoded in a bitvector) and the index of the corresponding
				/// tensor expression.
				struct LatPoint {
				LatPoint(unsigned n, unsigned e, unsigned b) : bits(n, false), exp(e) {
				bits.set(b);
				}
				LatPoint(const llvm::BitVector &b, unsigned e) : bits(b), exp(e) {}
				llvm::BitVector bits;
				unsigned exp;
				};

				/// A class to handle all iteration lattice operations. This class abstracts
				nicolasvasilacheUnsubmitted Done Reply Inline Actions If this is meant to subsist in the mid/long-term, it would be good to move it to utils and have a special test for it with simple MLIR custom ops and prints, completely independent of the codegen. nicolasvasilache: If this is meant to subsist in the mid/long-term, it would be good to move it to utils and have…
				aartbikAuthorUnsubmitted Done Reply Inline Actions If we indeed keep this representation and data structure in the long term, I will be happy to pull it out so it can be unit tested in isolation. But let's leave that for a follow-up restructuring. aartbik: If we indeed keep this representation and data structure in the long term, I will be happy to…
				/// away from some implementation details of storing iteration lattices and
				ftynseUnsubmitted Done Reply Inline Actions Nit: `///` for function docs as well ftynse: Nit: `///` for function docs as well
				/// tensor expressions. This allows for fine-tuning performance characteristics
				/// independently from the basic algorithm if bottlenecks are identified.
				class Merger {
				public:
				Merger(unsigned t, unsigned l)
				: numTensors(t), numLoops(l), isSparse(t, std::vector<bool>(l, false)) {}

				/// Adds a tensor expression. Returns its index.
				unsigned addExp(Kind k, unsigned e0, unsigned e1 = -1u) {
				unsigned e = tensorExps.size();
				tensorExps.push_back(TensorExp(k, e0, e1));
				return e;
				}

				/// Adds an iteration lattice point. Returns its index.
				unsigned addLat(unsigned t, unsigned i, unsigned e) {
				assert(t < numTensors && i < numLoops);
				unsigned p = latPoints.size();
				ftynseUnsubmitted Done Reply Inline Actions Nit: emplace_back() ? ftynse: Nit: emplace_back() ?
				latPoints.push_back(LatPoint(numLoops * numTensors, e, numTensors * i + t));
				return p;
				}

				/// Adds a new, initially empty, set. Returns its index.
				unsigned addSet() {
				unsigned s = latSets.size();
				latSets.emplace_back(SmallVector<unsigned, 16>());
				return s;
				}

				/// Computes a single conjunction of two lattice points by taking the "union"
				/// of loop indices (effectively constucting a larger "intersection" of those
				/// indices) with a newly constructed tensor (sub)expression of given kind.
				/// Returns the index of the new lattice point.
				unsigned conjLatPoint(Kind kind, unsigned p0, unsigned p1) {
				unsigned p = latPoints.size();
				llvm::BitVector nb = llvm::BitVector(latPoints[p0].bits);
				nb \|= latPoints[p1].bits;
				unsigned e = addExp(kind, latPoints[p0].exp, latPoints[p1].exp);
				latPoints.push_back(LatPoint(nb, e));
				return p;
				}

				/// Conjunctive merge of L1 and L2 is conjunction of cartesian product.
				/// Returns the index of the new set.
				unsigned takeConj(Kind kind, unsigned s0, unsigned s1) {
				unsigned s = addSet();
				for (unsigned p0 : latSets[s0])
				for (unsigned p1 : latSets[s1])
				latSets[s].push_back(conjLatPoint(kind, p0, p1));
				return s;
				}

				/// Disjunctive merge of L0 and L1 is (L0 /\_op L1, L0, L1).
				/// Returns the index of the new set.
				unsigned takeDisj(Kind kind, unsigned s0, unsigned s1) {
				unsigned s = takeConj(kind, s0, s1);
				for (unsigned p : latSets[s0])
				latSets[s].push_back(p);
				for (unsigned p : latSets[s1])
				latSets[s].push_back(p);
				return s;
				}

				/// Optimizes the iteration lattice points in the given set.
				unsigned optimize(unsigned s0) {
				unsigned s = addSet();
				assert(latSets[s0].size() != 0);
				unsigned p0 = latSets[s0][0];
				for (unsigned p1 : latSets[s0]) {
				bool add = true;
				if (p0 != p1) {
				llvm::BitVector tmp = latPoints[p1].bits;
				tmp ^= latPoints[p0].bits;
				if (hasAnyOf(tmp, false))
				continue; // dense exhausted?
				for (unsigned p2 : latSets[s]) {
				tmp = latPoints[p1].bits;
				tmp ^= latPoints[p2].bits;
				if (tmp.count() == 0) {
				add = false; // direct dup?
				break;
				}
				}
				assert(!add \|\| latGT(p0, p1));
				}
				if (add)
				latSets[s].push_back(p1);
				}
				return s;
				}

				// Returns true if Li > Lj.
				bool latGT(unsigned i, unsigned j) const {
				const llvm::BitVector &bitsi = latPoints[i].bits;
				const llvm::BitVector &bitsj = latPoints[j].bits;
				assert(bitsi.size() == bitsj.size());
				if (bitsi.count() > bitsj.count()) {
				for (unsigned b = 0, be = bitsj.size(); b < be; b++)
				if (bitsj[b] && !bitsi[b])
				return false;
				return true;
				}
				return false;
				}

				// Bit translation.
				unsigned tensor(unsigned b) const { return b % numTensors; }
				unsigned index(unsigned b) const { return b / numTensors; }

				// Returns true if bit corresponds to sparse access.
				bool isSparseBit(unsigned b) const {
				return isSparseAccess(tensor(b), index(b));
				}

				// Returns true if tensor access at given index is sparse.
				bool isSparseAccess(unsigned t, unsigned i) const {
				assert(t < numTensors && i < numLoops);
				return isSparse[t][i];
				}

				// Returns true if any set bit corresponds to sparse/dense access.
				bool hasAnyOf(const llvm::BitVector &bits, bool sparse) const {
				for (unsigned b = 0, be = bits.size(); b < be; b++)
				if (bits[b] && isSparseBit(b) == sparse)
				return true;
				return false;
				}

				// Getters.
				std::vector<std::vector<bool>> &sparse() { return isSparse; }
				TensorExp &exp(unsigned e) { return tensorExps[e]; }
				LatPoint &lat(unsigned l) { return latPoints[l]; }
				SmallVector<unsigned, 16> &set(unsigned s) { return latSets[s]; }

				private:
				const unsigned numTensors;
				const unsigned numLoops;

				std::vector<std::vector<bool>> isSparse;
				llvm::SmallVector<TensorExp, 32> tensorExps;
				llvm::SmallVector<LatPoint, 16> latPoints;
				llvm::SmallVector<SmallVector<unsigned, 16>, 8> latSets;
				};

				// Code generation.
				struct CodeGen {
				CodeGen(unsigned numTensors, unsigned numLoops)
				: loops(numLoops), sizes(numLoops), buffers(numTensors),
				pointers(numTensors, std::vector<Value>(numLoops)),
				indices(numTensors, std::vector<Value>(numLoops)),
				highs(numTensors, std::vector<Value>(numLoops)),
				pidxs(numTensors, std::vector<Value>(numLoops)),
				idxs(numTensors, std::vector<Value>(numLoops)) {}
				// Universal dense indices and upper bounds (by index).
				std::vector<Value> loops;
				std::vector<Value> sizes;
				// Buffers for storing dense and sparse numerical values (by tensor).
				std::vector<Value> buffers;
				// Sparse storage schemes (1-D): pointers and indices (by tensor and index).
				ftynseUnsubmitted Done Reply Inline Actions Nit: we use LogicalResult for such cases ftynse: Nit: we use LogicalResult for such cases
				aartbikAuthorUnsubmitted Done Reply Inline Actions By merging with the new verifier of the annotations, this is no longer needed. aartbik: By merging with the new verifier of the annotations, this is no longer needed.
				std::vector<std::vector<Value>> pointers;
				std::vector<std::vector<Value>> indices;
				// Sparse iteration information (by tensor and index).
				std::vector<std::vector<Value>> highs;
				std::vector<std::vector<Value>> pidxs;
				ftynseUnsubmitted Done Reply Inline Actions Nit: getAttrOfType ftynse: Nit: getAttrOfType
				aartbikAuthorUnsubmitted Done Reply Inline Actions By merging with the other CL that makes annotations part of generic op, this is no longer needed. aartbik: By merging with the other CL that makes annotations part of generic op, this is no longer…
				std::vector<std::vector<Value>> idxs;
				};

				} // namespace

				/// Helper method to inspect sparse annotations in the linalg operation.
				/// Fills the per-dimension sparsity information for all tensors.
				static void findSparseAnnotations(linalg::GenericOp op,
				std::vector<std::vector<bool>> &isSparse) {
				unsigned numTensors = op.getNumInputsAndOutputs();
				ArrayAttr sparseAttr = op.sparseAttr();
				for (unsigned t = 0; t < numTensors; t++) {
				auto map = op.getIndexingMap(t);
				auto dimAttr = sparseAttr[t].cast<ArrayAttr>();
				// For each tensor, we accept a per-dimension Sparse or Dense annotation.
				// This is translated to the loop index that indexes that dimension.
				unsigned rank = op.getShapedType(t).getRank();
				for (unsigned d = 0; d < rank; d++)
				if (isSparseDim(dimAttr[d])) {
				unsigned idx = map.getDimPosition(d);
				isSparse[t][idx] = true;
				} else {
				assert(isDenseDim(dimAttr[d]));
				}
				}
				}

				/// A DFS helper to compute a topological sort. Note that recursion is
				/// bounded by the number of implicit loops, which is always small.
				ftynseUnsubmitted Done Reply Inline Actions We have some other implementations of topological sort, iterative IIRC, would it be possible to refactor+template those instead? ftynse: We have some other implementations of topological sort, iterative IIRC, would it be possible to…
				aartbikAuthorUnsubmitted Done Reply Inline Actions I looked at some, but for the ones I found, I felt expressing this with code reuse would be more obscure than adding these few lines. Is there one simpler one that I overlooked? aartbik: I looked at some, but for the ones I found, I felt expressing this with code reuse would be…
				/// Returns false when a cycle is detected.
				static bool topSortDFS(unsigned i, std::vector<unsigned> &visit,
				std::vector<unsigned> &topSort,
				std::vector<std::vector<bool>> &adjM) {
				if (visit[i] != 0)
				return visit[i] != 1; // 1 denotes cycle!
				visit[i] = 1;
				for (unsigned j = 0, e = visit.size(); j < e; j++)
				if (adjM[i][j])
				if (!topSortDFS(j, visit, topSort, adjM))
				return false;
				visit[i] = 2;
				topSort.push_back(i);
				return true;
				}

				/// Computes a topologically sorted iteration graph for the linalg operation.
				/// Ensures all tensors are visited in natural index order. This is essential
				/// for sparse storage formats since these only support access along fixed
				/// dimensions. Even for dense storage formats, however, the natural index
				/// order yields innermost unit-stride access with better spatial locality.
				static bool computeIterationGraph(linalg::GenericOp op,
				std::vector<unsigned> &topSort) {
				// Set up an n x n from/to adjacency matrix of the iteration graph
				// for the implicit loop indices i_0 .. i_n-1.
				unsigned n = op.getNumLoops();
				std::vector<std::vector<bool>> adjM(n, std::vector<bool>(n, false));

				// Iterate over the indexing maps of every tensor in the tensor expression.
				for (auto imap : llvm::enumerate(op.indexing_maps())) {
				auto map = imap.value().template cast<AffineMapAttr>().getValue();
				assert(map.getNumDims() == n);
				// At the moment, we take the index variables in the tensor access
				// expression in the order in which they appear (conceptually a
				// "row-major" layout of every tensor). So, a tensor access A_ijk
				// forces the ordering i < j < k on the loop indices.
				// TODO: support affine map to define alternative dimension orders.
				for (unsigned d = 1, e = map.getNumResults(); d < e; d++) {
				unsigned f = map.getDimPosition(d - 1);
				unsigned t = map.getDimPosition(d);
				adjM[f][t] = true;
				}
				}

				// Topologically sort the iteration graph to determine loop order.
				// Report failure for a cyclic iteration graph.
				topSort.reserve(n);
				std::vector<unsigned> visit(n, 0);
				for (unsigned i = 0; i < n; i++)
				if (visit[i] == 0)
				if (!topSortDFS(i, visit, topSort, adjM))
				return false; // cycle!
				std::reverse(std::begin(topSort), std::end(topSort));
				return true;
				}

				/// Traverses the SSA tree (possibly a DAG) to build a tensor expression.
				/// This simplifies constructing (sub)expressions during iteration lattice
				/// building (compared to using the SSA representation everywhere).
				ftynseUnsubmitted Done Reply Inline Actions Nit: I'd rather return Optional<unsigned> ftynse: Nit: I'd rather return Optional<unsigned>
				aartbikAuthorUnsubmitted Done Reply Inline Actions Yeah, I guess I am old school C sometimes with my extra parameters and magic returns. Changed to the more elegant Optional, which also has the advantage of stopping the build as soon as it fails. aartbik: Yeah, I guess I am old school C sometimes with my extra parameters and magic returns. Changed…
				static Optional<unsigned> buildTensorExp(Merger &merger, linalg::GenericOp op,
				Value val) {
				Operation *def = val.getDefiningOp();
				if (auto arg = val.dyn_cast<BlockArgument>()) {
				unsigned argN = arg.getArgNumber();
				if (arg.getOwner()->getParentOp() == op) {
				// Any parameter of the generic op is considered a tensor,
				// indexed by the implicit loop bounds.
				auto map = op.getIndexingMap(argN);
				ftynseUnsubmitted Done Reply Inline Actions I was a bit lots here before realizing that kTensor refers to any values that is indexed in the GenericOp and kScalar refers to any value that is not, regardless of their actual types. kTensor may or may not be a tensor, kScalar may also be a tensor if the ops inside just consume it as a whole... Would be nice to have this documented. ftynse: I was a bit lots here before realizing that kTensor refers to any values that is indexed in the…
				aartbikAuthorUnsubmitted Done Reply Inline Actions Well, anything indexed would be a tensor for the kernel, but I see your point on the scalar issue. Would kInvariant be a better name (besides adding more doc, which I did too). aartbik: Well, anything indexed would be a tensor for the kernel, but I see your point on the scalar…
				if (map.isProjectedPermutation())
				return merger.addExp(Kind::kTensor, argN);
				} else {
				// Any parameter of a higher op is invariant in the tensor expression.
				return merger.addExp(Kind::kInvariant, argN);
				}
				} else if (def->getNumOperands() == 2) {
				// Construct binary operations if subexpressions could be built.
				auto x = buildTensorExp(merger, op, def->getOperand(0));
				auto y = buildTensorExp(merger, op, def->getOperand(1));
				if (x.hasValue() && y.hasValue()) {
				unsigned e0 = x.getValue();
				ftynseUnsubmitted Done Reply Inline Actions How do you expect this to scale to more ops? Generally, MLIR doesn't seem to like data structures parallel to the IR, but it may be justified if there are significant restrictions or analysis costs to process the IR every time. ftynse: How do you expect this to scale to more ops? Generally, MLIR doesn't seem to like data…
				aartbikAuthorUnsubmitted Done Reply Inline Actions I too, am reluctant to have parallel data structures, but the lattice point manipulations really need some IR to represent arbitrary subexpressions (viz a + b + c has a + b + c, a + b. a + c, b + c, a, b etc.). I don't expect this too scale to very large IR ever... aartbik: I too, am reluctant to have parallel data structures, but the lattice point manipulations…
				unsigned e1 = y.getValue();
				if (isa<MulFOp>(def))
				return merger.addExp(Kind::kMulF, e0, e1);
				if (isa<MulIOp>(def))
				return merger.addExp(Kind::kMulI, e0, e1);
				if (isa<AddFOp>(def))
				return merger.addExp(Kind::kAddF, e0, e1);
				if (isa<AddIOp>(def))
				return merger.addExp(Kind::kAddI, e0, e1);
				}
				}
				// Cannot build (yet).
				return None;
				}

				/// Builds the iteration lattices in a bottom-up traversal given the remaining
				/// tensor (sub)expression and the next loop index in the iteration graph.
				static unsigned buildLattices(Merger &merger, linalg::GenericOp op,
				unsigned exp, unsigned idx) {
				ftynseUnsubmitted Done Reply Inline Actions Nit: `map.getResult(i).isFunctionOfDim(idx)` should work and be more future-proof ftynse: Nit: `map.getResult(i).isFunctionOfDim(idx)` should work and be more future-proof
				Kind kind = merger.exp(exp).kind;
				if (kind == Kind::kTensor \|\| kind == Kind::kInvariant) {
				// Either the index is really used in the tensor expression, or it it
				// set to the "non-existing dense index" in that dimension.
				unsigned s = merger.addSet();
				merger.set(s).push_back(merger.addLat(merger.exp(exp).e0, idx, exp));
				return s;
				}
				unsigned s0 = buildLattices(merger, op, merger.exp(exp).e0, idx);
				unsigned s1 = buildLattices(merger, op, merger.exp(exp).e1, idx);
				switch (kind) {
				case Kind::kTensor:
				case Kind::kInvariant:
				llvm_unreachable("handled above");
				case Kind::kMulF:
				case Kind::kMulI:
				return merger.takeConj(kind, s0, s1);
				case Kind::kAddF:
				case Kind::kAddI:
				return merger.takeDisj(kind, s0, s1);
				}
				}

				/// Local bufferization of all dense and sparse data structures.
				/// This code enables testing the first prototype sparse compiler.
				// TODO: replace this with a proliferated bufferization strategy
				void genBuffers(Merger &merger, CodeGen &codegen, PatternRewriter &rewriter,
				linalg::GenericOp op) {
				Location loc = op.getLoc();
				unsigned numTensors = op.getNumInputsAndOutputs();
				unsigned numInputs = op.getNumInputs();
				assert(numTensors == numInputs + 1);
				Type indexType = rewriter.getIndexType();

				// For now, set all unknown dimensions to 999.
				// TODO: compute these values (using sparsity or by reading tensor)
				Value unknown = rewriter.create<ConstantIndexOp>(loc, 999);

				// For every tensor, find lower and upper bound on dimensions, set the
				// same bounds on loop indices, and allocate dense or sparse buffer(s).
				SmallVector<Value, 4> args;
				for (unsigned t = 0; t < numTensors; t++) {
				auto tensorType = op.getShapedType(t);
				auto shape = tensorType.getShape();
				auto map = op.getIndexingMap(t);
				// Scan all dimensions of current tensor.
				bool allDense = true;
				args.clear();
				for (unsigned d = 0, rank = shape.size(); d < rank; d++) {
				unsigned i = map.getDimPosition(d);
				// Handle sparse storage schemes.
				if (merger.isSparseAccess(t, i)) {
				allDense = false;
				auto dynTp = MemRefType::get({ShapedType::kDynamicSize}, indexType);
				codegen.pointers[t][i] = rewriter.create<AllocaOp>(loc, dynTp, unknown);
				codegen.indices[t][i] = rewriter.create<AllocaOp>(loc, dynTp, unknown);
				}
				// Find lower and upper bound in current dimension.
				Value up;
				if (shape[d] == TensorType::kDynamicSize) {
				ftynseUnsubmitted Done Reply Inline Actions Where are these deallocated? ftynse: Where are these deallocated?
				aartbikAuthorUnsubmitted Done Reply Inline Actions They are not. Currently the genBuffers() is a "local" bufferization method that just adds some stub allocations that I can simply hand modify into something running. Most of this code is throw-away since it will be replaced by something more global in bufferization (or at least something changed here by modifying the parameters of the method rather than adding allocs). aartbik: They are not. Currently the genBuffers() is a "local" bufferization method that just adds some…
				ftynseUnsubmitted Done Reply Inline Actions Maybe use an `alloca` instead? If it is rewritten anyway, we shouldn't care if it's stack or heap, and we wouldn't be creating code that leaks memory. ftynse: Maybe use an `alloca` instead? If it is rewritten anyway, we shouldn't care if it's stack or…
				// For the output tensor, we may need to infer the upper bound.
				// For all others, we look at the incoming argument.
				if (t == numInputs && !op.getNumInitTensors()) {
				up = codegen.sizes[i];
				assert(up); // TODO: what else?
				ftynseUnsubmitted Not Done Reply Inline Actions Where does this inference take place? ftynse: Where does this inference take place?
				aartbikAuthorUnsubmitted Done Reply Inline Actions L444. aartbik: L444.
				} else {
				Value arg = t < numInputs ? op.getInput(t) : op.getInitTensor(0);
				up = rewriter.create<DimOp>(loc, arg, d);
				}
				args.push_back(up);
				} else {
				up = rewriter.create<ConstantIndexOp>(loc, shape[d]);
				}
				codegen.sizes[i] = codegen.highs[t][i] = up;
				ftynseUnsubmitted Done Reply Inline Actions Shouldn't this be `op.getInitTensor(t - numInputs)`? ftynse: Shouldn't this be `op.getInitTensor(t - numInputs)`?
				aartbikAuthorUnsubmitted Done Reply Inline Actions Any tensor (t < numInput) is an input parameter, except potentially the output tensor (t == numInput), which has a single InitTensor. Again, I realize all code should be clear and solid, but this genBuffers will be rewritten heavily very soon. This just gets some reasonable allocs before the rewritten kernel. aartbik: Any tensor (t < numInput) is an input parameter, except potentially the output tensor (t ==…
				}
				// Allocate dense or sparse buffer for numerical values.
				if (allDense) {
				auto denseTp = MemRefType::get(shape, tensorType.getElementType());
				codegen.buffers[t] = rewriter.create<AllocaOp>(loc, denseTp, args);
				} else {
				auto sparseTp = MemRefType::get({ShapedType::kDynamicSize},
				tensorType.getElementType());
				codegen.buffers[t] = rewriter.create<AllocaOp>(loc, sparseTp, unknown);
				}
				}
				}

				/// Generates a load on a dense or sparse tensor.
				static Value genTensorLoad(Merger &merger, CodeGen &codegen,
				PatternRewriter &rewriter, linalg::GenericOp op,
				unsigned tensor) {
				ftynseUnsubmitted Done Reply Inline Actions Where is this deallocated? ftynse: Where is this deallocated?
				aartbikAuthorUnsubmitted Done Reply Inline Actions See above. TBD ;-) aartbik: See above. TBD ;-)
				SmallVector<Value, 4> args;
				auto map = op.getIndexingMap(tensor);
				bool sparse = false;
				for (unsigned i = 0, m = map.getNumResults(); i < m; ++i) {
				unsigned idx = map.getDimPosition(i);
				args.push_back(codegen.loops[idx]); // universal dense index
				if (sparse \|\| merger.isSparseAccess(tensor, idx)) {
				sparse = true;
				args.clear();
				args.push_back(codegen.pidxs[tensor][idx]); // position index
				ftynseUnsubmitted Done Reply Inline Actions Nit: this seems to be a recurrent pattern, can we factor it out into the op class or interface ? ftynse: Nit: this seems to be a recurrent pattern, can we factor it out into the op class or interface ?
				aartbikAuthorUnsubmitted Done Reply Inline Actions I see a lot of this outside this file too, let's do a cleanup later for all aartbik: I see a lot of this outside this file too, let's do a cleanup later for all
				}
				}
				return rewriter.create<LoadOp>(op.getLoc(), codegen.buffers[tensor], args);
				}

				/// Generates a store on a dense tensor.
				static void genTensorStore(Merger &merger, CodeGen &codegen,
				PatternRewriter &rewriter, linalg::GenericOp op,
				unsigned tensor, Value rhs) {
				SmallVector<Value, 4> args;
				auto map = op.getIndexingMap(tensor);
				for (unsigned i = 0, m = map.getNumResults(); i < m; ++i) {
				unsigned idx = map.getDimPosition(i);
				args.push_back(codegen.loops[idx]); // universal dense index
				}
				rewriter.create<StoreOp>(op.getLoc(), rhs, codegen.buffers[tensor], args);
				}

				/// Recursively generates tensor expression.
				static Value genExp(Merger &merger, CodeGen &codegen, PatternRewriter &rewriter,
				linalg::GenericOp op, unsigned exp) {
				if (merger.exp(exp).kind == Kind::kTensor)
				return genTensorLoad(merger, codegen, rewriter, op, merger.exp(exp).e0);
				else if (merger.exp(exp).kind == Kind::kInvariant)
				return op.getParentRegion()->front().getArgument(merger.exp(exp).e0);
				Value v0 = genExp(merger, codegen, rewriter, op, merger.exp(exp).e0);
				Value v1 = genExp(merger, codegen, rewriter, op, merger.exp(exp).e1);
				switch (merger.exp(exp).kind) {
				case Kind::kTensor:
				case Kind::kInvariant:
				llvm_unreachable("handled above");
				case Kind::kMulF:
				return rewriter.create<MulFOp>(op.getLoc(), v0, v1);
				case Kind::kMulI:
				return rewriter.create<MulIOp>(op.getLoc(), v0, v1);
				case Kind::kAddF:
				return rewriter.create<AddFOp>(op.getLoc(), v0, v1);
				case Kind::kAddI:
				return rewriter.create<AddIOp>(op.getLoc(), v0, v1);
				}
				}

				/// Generates initialization code for the subsequent loop sequence at
				/// current index level. Returns true if the loop sequence needs to
				/// maintain the universal index.
				static bool genInit(Merger &merger, CodeGen &codegen, PatternRewriter &rewriter,
				linalg::GenericOp op, std::vector<unsigned> &topSort,
				unsigned at, llvm::BitVector &inits) {
				bool needsUniv = false;
				Location loc = op.getLoc();
				unsigned idx = topSort[at];

				ftynseUnsubmitted Not Done Reply Inline Actions Could we have this as a function template with lambdas instead? ftynse: Could we have this as a function template with lambdas instead?
				// Initialize sparse positions.
				Value one = rewriter.create<ConstantIndexOp>(loc, 1);
				for (unsigned b = 0, be = inits.size(); b < be; b++) {
				if (inits[b]) {
				unsigned tensor = merger.tensor(b);
				ftynseUnsubmitted Done Reply Inline Actions This actually closes the conditional rather than the loop, looks a strange to me. Correct, but strange. ftynse: This actually closes the conditional rather than the loop, looks a strange to me. Correct, but…
				aartbikAuthorUnsubmitted Done Reply Inline Actions Yeah, this was actually a result of not using { on a single statement for-loop (body is single if). I had two {{ and }} but was not sure if style would apply to macro as well ;-) aartbik: Yeah, this was actually a result of not using { on a single statement for-loop (body is single…
				assert(idx == merger.index(b));
				if (merger.isSparseBit(b)) {
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions I find this function very hard to follow in the current spaghetti form. It doesn't seem like it would be very hard to restructured to a more structured form with helper functions that have semantically meaningful names. Could we please try that ? nicolasvasilache: I find this function very hard to follow in the current spaghetti form. It doesn't seem like it…
				aartbikAuthorUnsubmitted Done Reply Inline Actions Agreed. When I started this function, it was actually very readable (it more or less follows the recursive variant given in the Taco paper), but I must admit that after filling in on the blanks with actual implementation, it has become a bit massive. I broke it up into smaller, individual methods. PTAL if this reads better. aartbik: Agreed. When I started this function, it was actually very readable (it more or less follows…
				// Initialize sparse index.
				unsigned pat = at;
				for (; pat != 0; pat--) {
				if (codegen.pidxs[tensor][topSort[pat - 1]])
				break;
				}
				Value ptr = codegen.pointers[tensor][idx];
				Value p = (pat == 0) ? rewriter.create<ConstantIndexOp>(loc, 0)
				: codegen.pidxs[tensor][topSort[pat - 1]];
				codegen.pidxs[tensor][idx] = rewriter.create<LoadOp>(loc, ptr, p);
				p = rewriter.create<AddIOp>(loc, p, one);
				codegen.highs[tensor][idx] = rewriter.create<LoadOp>(loc, ptr, p);
				} else {
				// Dense index still in play.
				needsUniv = true;
				}
				}
				}

				// Initialize the universal dense index.
				codegen.loops[idx] = rewriter.create<ConstantIndexOp>(loc, 0);
				return needsUniv;
				}

				nicolasvasilacheUnsubmitted Done Reply Inline Actions braces plz, trivial statement is defined as 1-liner these days :) nicolasvasilache: braces plz, trivial statement is defined as 1-liner these days :)
				/// Generates a for-loop or a while-loop, depending on whether it implements
				/// singleton iteration or co-iteration over the given conjunction.
				static void genLoop(Merger &merger, CodeGen &codegen, PatternRewriter &rewriter,
				linalg::GenericOp op, unsigned idx, bool needsUniv,
				llvm::BitVector &indices, scf::ForOp &forOp,
				scf::WhileOp &whileOp) {
				Location loc = op.getLoc();

				// Emit a for-loop for a single index.
				if (indices.count() == 1) {
				unsigned fb = indices.find_first();
				unsigned tensor = merger.tensor(fb);
				assert(idx == merger.index(fb));
				// Emit a sparse for-loop or a dense for-loop.
				Value one = rewriter.create<ConstantIndexOp>(loc, 1);
				if (merger.isSparseBit(fb)) {
				forOp = rewriter.create<scf::ForOp>(loc, codegen.pidxs[tensor][idx],
				codegen.highs[tensor][idx], one);
				codegen.pidxs[tensor][idx] = forOp.getInductionVar();
				} else {
				forOp = rewriter.create<scf::ForOp>(loc, codegen.loops[idx],
				codegen.sizes[idx], one);
				nicolasvasilacheUnsubmitted Done Reply Inline Actions no goto please nicolasvasilache: no goto please
				codegen.loops[idx] = forOp.getInductionVar();
				}
				rewriter.setInsertionPointToStart(forOp.getBody());
				return;
				}

				// Otherwise, emit a while-loop for co-iteration.
				Type indexType = rewriter.getIndexType();
				SmallVector<Type, 4> types;
				SmallVector<Value, 4> operands;
				for (unsigned b = 0, be = indices.size(); b < be; b++) {
				if (indices[b] && merger.isSparseBit(b)) {
				unsigned tensor = merger.tensor(b);
				assert(idx == merger.index(b));
				types.push_back(indexType);
				operands.push_back(codegen.pidxs[tensor][idx]);
				}
				}
				if (needsUniv) {
				types.push_back(indexType);
				operands.push_back(codegen.loops[idx]);
				}
				whileOp = rewriter.create<scf::WhileOp>(loc, types, operands);
				Block *before = rewriter.createBlock(&whileOp.before(), {}, types);
				Block *after = rewriter.createBlock(&whileOp.after(), {}, types);
				// Build the "before" region, which effectively consists
				// of a conjunction of "i < upper" tests on all induction.
				rewriter.setInsertionPointToStart(&whileOp.before().front());
				Value cond;
				unsigned o = 0;
				for (unsigned b = 0, be = indices.size(); b < be; b++) {
				if (indices[b] && merger.isSparseBit(b)) {
				unsigned tensor = merger.tensor(b);
				assert(idx == merger.index(b));
				Value op1 = before->getArgument(o);
				Value op2 = codegen.highs[tensor][idx];
				Value opc = rewriter.create<CmpIOp>(loc, CmpIPredicate::ult, op1, op2);
				cond = cond ? rewriter.create<AndOp>(loc, cond, opc) : opc;
				codegen.pidxs[tensor][idx] = after->getArgument(o++);
				}
				}
				if (needsUniv)
				codegen.loops[idx] = after->getArgument(o++);
				nicolasvasilacheUnsubmitted Done Reply Inline Actions no goto please nicolasvasilache: no goto please
				assert(o == operands.size());
				rewriter.create<scf::ConditionOp>(loc, cond, before->getArguments());
				rewriter.setInsertionPointToStart(&whileOp.after().front());
				}

				/// Generates the local variables for this loop, consisting of the sparse
				/// indices, restored universal dense index, and dense positions.
				static void genLocals(Merger &merger, CodeGen &codegen,
				PatternRewriter &rewriter, linalg::GenericOp op,
				std::vector<unsigned> &topSort, unsigned at,
				bool needsUniv, llvm::BitVector &locals) {
				ftynseUnsubmitted Done Reply Inline Actions This is not compatible with PatternRewriter, use rewriter.createBlock instead. ftynse: This is not compatible with PatternRewriter, use rewriter.createBlock instead.
				aartbikAuthorUnsubmitted Done Reply Inline Actions Ah yes, you are right. Thanks for this very sharp observation, I had completely overlooked that! aartbik: Ah yes, you are right. Thanks for this very sharp observation, I had completely overlooked that!
				Location loc = op.getLoc();
				unsigned idx = topSort[at];

				// Initialize sparse indices.
				Value min;
				for (unsigned b = 0, be = locals.size(); b < be; b++) {
				if (locals[b] && merger.isSparseBit(b)) {
				ftynseUnsubmitted Done Reply Inline Actions This is not compatible with PatternRewriter either. `createBlock` takes an arrayref of argument types for this purpose. ftynse: This is not compatible with PatternRewriter either. `createBlock` takes an arrayref of argument…
				unsigned tensor = merger.tensor(b);
				assert(idx == merger.index(b));
				Value ld = rewriter.create<LoadOp>(loc, codegen.indices[tensor][idx],
				codegen.pidxs[tensor][idx]);
				codegen.idxs[tensor][idx] = ld;
				if (!needsUniv) {
				if (min) {
				Value cmp = rewriter.create<CmpIOp>(loc, CmpIPredicate::ult, ld, min);
				min = rewriter.create<SelectOp>(loc, cmp, ld, min);
				} else {
				min = ld;
				}
				}
				}
				}

				// Merge dense universal index over minimum.
				if (min) {
				assert(!needsUniv);
				codegen.loops[idx] = min;
				}

				// Initialize dense positions.
				for (unsigned b = 0, be = locals.size(); b < be; b++) {
				if (locals[b] && !merger.isSparseBit(b)) {
				unsigned tensor = merger.tensor(b);
				assert(idx == merger.index(b));
				if (!codegen.highs[tensor][idx])
				continue; // unused dimension
				unsigned pat = at;
				for (; pat != 0; pat--)
				if (codegen.pidxs[tensor][topSort[pat - 1]])
				break;
				Value p = (pat == 0) ? rewriter.create<ConstantIndexOp>(loc, 0)
				: codegen.pidxs[tensor][topSort[pat - 1]];
				Value m = rewriter.create<MulIOp>(loc, codegen.sizes[idx], p);
				codegen.pidxs[tensor][idx] =
				rewriter.create<AddIOp>(loc, m, codegen.loops[idx]);
				}
				}
				}

				/// Generates the induction structure for a while-loop.
				static void genWhileInduction(Merger &merger, CodeGen &codegen,
				PatternRewriter &rewriter, linalg::GenericOp op,
				unsigned idx, bool needsUniv,
				llvm::BitVector &induction, ResultRange results) {
				Location loc = op.getLoc();
				unsigned o = 0;
				SmallVector<Value, 4> operands;
				Value one = rewriter.create<ConstantIndexOp>(loc, 1);
				for (unsigned b = 0, be = induction.size(); b < be; b++)
				if (induction[b] && merger.isSparseBit(b)) {
				unsigned tensor = merger.tensor(b);
				assert(idx == merger.index(b));
				Value op1 = codegen.idxs[tensor][idx];
				Value op2 = codegen.loops[idx];
				Value op3 = codegen.pidxs[tensor][idx];
				Value cmp = rewriter.create<CmpIOp>(loc, CmpIPredicate::eq, op1, op2);
				Value add = rewriter.create<AddIOp>(loc, op3, one);
				operands.push_back(rewriter.create<SelectOp>(loc, cmp, add, op3));
				codegen.pidxs[tensor][idx] = results[o++];
				}
				if (needsUniv) {
				operands.push_back(rewriter.create<AddIOp>(loc, codegen.loops[idx], one));
				codegen.loops[idx] = results[o++];
				}
				assert(o == operands.size());
				rewriter.create<scf::YieldOp>(loc, operands);
				}

				/// Generates a single if-statement within a while-loop.
				static void genIf(Merger &merger, CodeGen &codegen, PatternRewriter &rewriter,
				linalg::GenericOp op, unsigned idx,
				llvm::BitVector &conditions, scf::IfOp &ifOp) {
				Location loc = op.getLoc();
				if (ifOp)
				rewriter.setInsertionPointToStart(&ifOp.elseRegion().front());
				Value cond;
				for (unsigned b = 0, be = conditions.size(); b < be; b++) {
				if (conditions[b]) {
				unsigned tensor = merger.tensor(b);
				assert(idx == merger.index(b));
				Value clause;
				if (merger.isSparseBit(b)) {
				Value op1 = codegen.idxs[tensor][idx];
				Value op2 = codegen.loops[idx];
				clause = rewriter.create<CmpIOp>(loc, CmpIPredicate::eq, op1, op2);
				} else {
				clause = rewriter.create<ConstantIntOp>(loc, 1, 1); // true
				}
				cond = cond ? rewriter.create<AndOp>(loc, cond, clause) : clause;
				}
				nicolasvasilacheUnsubmitted Done Reply Inline Actions no goto please nicolasvasilache: no goto please
				}
				ifOp = rewriter.create<scf::IfOp>(loc, cond, /else/ true);
				rewriter.setInsertionPointToStart(&ifOp.thenRegion().front());
				}

				/// Optimize the loop indices of Li with two rules rules:
				/// (1) convert multiple dense to single dense, and
				/// (2) convert singleton sparse/dense to sparse/random access.
				static void optimizeIndices(Merger merger, unsigned lsize,
				llvm::BitVector &indices) {
				if (merger.hasAnyOf(indices, false)) {
				bool reset = lsize == 1 && merger.hasAnyOf(indices, true);
				for (unsigned b = 0, be = indices.size(); b < be; b++) {
				if (indices[b] && !merger.isSparseBit(b)) {
				if (reset)
				indices.reset(b);
				reset = true;
				}
				}
				}
				}

				/// Recursively generates code while computing iteration lattices in order
				/// to manage the complexity of implementing co-iteration over unions
				/// and intersections of sparse iterations spaces.
				static void genStmt(Merger &merger, CodeGen &codegen, PatternRewriter &rewriter,
				linalg::GenericOp op, std::vector<unsigned> &topSort,
				unsigned exp, unsigned at) {
				// At each leaf, assign remaining tensor (sub)expression to output tensor.
				if (at == topSort.size()) {
				unsigned lhs = op.getNumInputsAndOutputs() - 1;
				nicolasvasilacheUnsubmitted Done Reply Inline Actions tmp does not escape and is overwritten a bunch of times, please declare inside the for loop, it makes its transient nature clearer. nicolasvasilache: tmp does not escape and is overwritten a bunch of times, please declare inside the for loop, it…
				Value rhs = genExp(merger, codegen, rewriter, op, exp);
				genTensorStore(merger, codegen, rewriter, op, lhs, rhs);
				return;
				}

				// Construct iteration lattices for current loop index, with L0 at top.
				nicolasvasilacheUnsubmitted Done Reply Inline Actions please factor l767-777 into a meaningfully named helper function nicolasvasilache: please factor l767-777 into a meaningfully named helper function
				// Then emit initialization code for the loop sequence at this level.
				// We maintain the universal dense index if dense indices are still
				// in play for a non-singleton loop sequence.
				unsigned idx = topSort[at];
				unsigned lts = merger.optimize(buildLattices(merger, op, exp, idx));
				unsigned lsize = merger.set(lts).size();
				assert(lsize != 0);
				unsigned l0 = merger.set(lts)[0];
				LatPoint lat0 = merger.lat(l0);
				bool needsUniv =
				genInit(merger, codegen, rewriter, op, topSort, at, lat0.bits) &&
				lsize > 1;

				// Emit a loop for every lattice point L0 >= Li.
				for (unsigned li : merger.set(lts)) {
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Please split in half and refactor further: // Single index if (indices.count() == 1) { genForLoop(args, [&](OpBuilder &b, Location, Value iv, ValueRange) { genLocals(merger, codegen, b, ...); };); // where lambda is a scoped body builder to be passed to the scf::ForOp builder return; } genWhileLoop( args, [&](OpBuilder &b, Location, Value iv, ValueRange){ genLocals(merger, codegen, b, ...); for (unsigned lj : merger.set(lts)) { if (li == lj \|\| merger.latGT(li, lj)) { LatPoint latj = merger.lat(lj); tmp = latj.bits; tmp ^= lati.bits; if (merger.hasAnyOf(tmp, false)) continue; // dense exhausted within if/else // Recurse into body of each branch. genIf(merger, codegen, rewriter, op, idx, latj.bits, ifOp, [](){ ifBodyBuilder }); genStmt(merger, codegen, rewriter, op, topSort, latj.exp, at + 1); } } }, /* regionBuilder1 / [&](OpBuilder &b, Location, Value iv, ValueRange){}, / regionBuilder2 / ); In subsequent revisions, we can continue refactoring to give WhileOp a proper builder with lambdas. Then all insertion point manipulations can be isolated in that builder. nicolasvasilache:* Please split in half and refactor further: ``` // Single index if (indices.count() == 1) {…
				ftynseUnsubmitted Done Reply Inline Actions In subsequent revisions, we can continue refactoring to give WhileOp a proper builder with lambdas. The only reason why I did not add such a builder was the absence of uses, and we don't want untested code. ftynse: > In subsequent revisions, we can continue refactoring to give WhileOp a proper builder with…
				aartbikAuthorUnsubmitted Done Reply Inline Actions I already am very grateful you added the while loops in the first place, which saved me a lot of time. I am happy to develop more infrastructure as needed. aartbik: I already am very grateful you added the while loops in the first place, which saved me a lot…
				LatPoint lati = merger.lat(li);

				// Emit loop.
				scf::ForOp forOp;
				scf::WhileOp whileOp;
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions This is confusing due to the goto-style insertion point manipulation across function boundaries. Please refactor to use the structured scf::IfOp builder: OpBuilderDAG<(ins "Value":$cond, CArg<"function_ref<void(OpBuilder &, Location)>", "buildTerminatedBody">:$thenBuilder, CArg<"function_ref<void(OpBuilder &, Location)>", "nullptr">:$elseBuilder)> with proper helper functions / lambda to make it read more naturally. Additionally, since this is only used in the while case, I expect it will compose nicely with the above suggestion to show the real structure that you emit in C++: /// dense iterator emitForOp(..., [](){ body no conditional }) return; /// coiteration emitWhileOp(..., [](...){}, [](...){ body emitIfOp(..., [](..){ body }); }); nicolasvasilache: This is confusing due to the goto-style insertion point manipulation across function boundaries.
				llvm::BitVector indices = lati.bits;
				optimizeIndices(merger, lsize, indices);
				genLoop(merger, codegen, rewriter, op, idx, needsUniv, indices, forOp,
				whileOp);
				genLocals(merger, codegen, rewriter, op, topSort, at, needsUniv, lati.bits);

				// Visit all lattices points with Li >= Lj to generate the
				// loop-body, possibly with if statements for coiteration.
				scf::IfOp ifOp;
				for (unsigned lj : merger.set(lts)) {
				if (li == lj \|\| merger.latGT(li, lj)) {
				LatPoint latj = merger.lat(lj);
				llvm::BitVector tmp = latj.bits;
				tmp ^= lati.bits;
				if (merger.hasAnyOf(tmp, false))
				continue; // dense exhausted within if/else
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Let's please get rid of goto-style insertion point manipulation across function boundaries. genWhileInduction should be lifted as a lambda passed to `genWhileLoop`. nicolasvasilache: Let's please get rid of goto-style insertion point manipulation across function boundaries.
				// Recurse into body of each branch.
				if (whileOp)
				genIf(merger, codegen, rewriter, op, idx, latj.bits, ifOp);
				genStmt(merger, codegen, rewriter, op, topSort, latj.exp, at + 1);
				}
				}

				// Wrap-up induction and restore insertion point.
				if (forOp) {
				needsUniv = false;
				rewriter.setInsertionPointAfter(forOp);
				} else {
				rewriter.setInsertionPointToEnd(&whileOp.after().front());
				genWhileInduction(merger, codegen, rewriter, op, idx, needsUniv,
				lati.bits, whileOp.results());
				rewriter.setInsertionPointAfter(whileOp);
				}
				}
				}

				namespace {

				/// Sparse rewriting rule for generic Lingalg operation.
				struct GenericOpSparsifier : public OpRewritePattern<linalg::GenericOp> {
				using OpRewritePattern<linalg::GenericOp>::OpRewritePattern;

				LogicalResult matchAndRewrite(linalg::GenericOp op,
				PatternRewriter &rewriter) const override {
				unsigned numTensors = op.getNumInputsAndOutputs();
				unsigned numLoops = op.iterator_types().getValue().size();
				Merger merger(numTensors, numLoops);

				// Detects sparse annotations and translate the per-dimension sparsity
				// information for all tensors to loop indices in the kernel.
				if (!op.hasSparseSemantics())
				return failure();
				findSparseAnnotations(op, merger.sparse());

				// Accept only single, dense result.
				if (op.getNumOutputs() != 1 \|\|
				std::any_of(merger.sparse().back().begin(),
				merger.sparse().back().end(), [](bool b) { return b; }))
				return failure();

				// Computes a topologically sorted iteration graph to ensure
				// tensors are visited in natural index order. Fails on cycles.
				// This assumes that higher-level passes have already put the
				// tensors in each tensor expression in a feasible order.
				// TODO: try again without dense constraints on failure or
				// even try to insert sparse reorderings to resolve cycles
				std::vector<unsigned> topSort;
				if (!computeIterationGraph(op, topSort))
				return failure();

				// Finds the terminating yield statement and builds the tensor
				// expression for the Linalg operation in SSA form.
				auto &region = op.region();
				if (!llvm::hasSingleElement(region))
				return failure(); // single block only
				Operation *yield = region.front().getTerminator();
				Optional<unsigned> exp = buildTensorExp(merger, op, yield->getOperand(0));
				if (!exp.hasValue())
				return failure(); // build failure

				// Recursively generates code.
				CodeGen codegen(numTensors, numLoops);
				genBuffers(merger, codegen, rewriter, op);
				genStmt(merger, codegen, rewriter, op, topSort, exp.getValue(), 0);
				Value result =
				rewriter.create<TensorLoadOp>(op.getLoc(), codegen.buffers.back());
				rewriter.replaceOp(op, result);
				return success();
				}
				};

				} // namespace

				/// Populates the given patterns list with rewriting rules required for
				/// the sparsification of linear algebra operations.
				void mlir::linalg::populateSparsificationPatterns(
				MLIRContext *context, OwningRewritePatternList &patterns) {
				patterns.insert<GenericOpSparsifier>(context);
				}

mlir/test/Dialect/Linalg/sparse_1d.mlir

This file was added.

				// NOTE: Assertions have been autogenerated by utils/generate-test-checks.py
				// RUN: mlir-opt %s -test-sparsification \| FileCheck %s

				#trait_d = {
				indexing_maps = [
				affine_map<(i) -> (i)>, // a
				affine_map<(i) -> (i)> // x (out)
				],
				sparse = [
				[ "D" ], // a
				[ "D" ] // x
				],
				iterator_types = ["parallel"],
				doc = "x(i) = a(i) OP b"
				}

				// CHECK-LABEL: func @add_d(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: f32) -> tensor<32xf32> {
				silvasUnsubmitted Not Done Reply Inline Actions Haven't done a full review of the patch yet, but VAL_0 here doesn't seem to feed into any of the memrefs in the converted program. silvas: Haven't done a full review of the patch yet, but VAL_0 here doesn't seem to feed into any of…
				aartbikAuthorUnsubmitted Done Reply Inline Actions The current sparse compiler does a simple "local" bufferization, where the arguments are used to get the shape, but are otherwise dropped. The alloc() that replace them are not even filled with values. You will see this in all the examples. This is of course not a runnable (and final) solution, but it makes hand modifying it to something runnable easier. In the long run, the sparse bufferization should replace the parameters and prototype of the method, just like true bufferization does currently. aartbik: The current sparse compiler does a simple "local" bufferization, where the arguments are used…
				// CHECK: %[[VAL_2:.*]] = constant 32 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_6:.*]] = alloca() : memref<32xf32>
				// CHECK: scf.for %[[VAL_7:.*]] = %[[VAL_3]] to %[[VAL_2]] step %[[VAL_4]] {
				// CHECK: %[[VAL_8:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_7]]] : memref<32xf32>
				// CHECK: %[[VAL_9:.*]] = addf %[[VAL_8]], %[[VAL_1]] : f32
				// CHECK: store %[[VAL_9]], %[[VAL_6]]{{\[}}%[[VAL_7]]] : memref<32xf32>
				// CHECK: }
				// CHECK: %[[VAL_10:.*]] = tensor_load %[[VAL_6]] : memref<32xf32>
				// CHECK: return %[[VAL_10]] : tensor<32xf32>
				// CHECK: }
				func @add_d(%arga: tensor<32xf32>, %argb: f32) -> tensor<32xf32> {
				%0 = linalg.generic #trait_d
				ins(%arga: tensor<32xf32>) {
				^bb(%a: f32):
				%0 = addf %a, %argb : f32
				linalg.yield %0 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

				// CHECK-LABEL: func @mul_d(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: f32) -> tensor<32xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 32 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_6:.*]] = alloca() : memref<32xf32>
				// CHECK: scf.for %[[VAL_7:.*]] = %[[VAL_3]] to %[[VAL_2]] step %[[VAL_4]] {
				// CHECK: %[[VAL_8:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_7]]] : memref<32xf32>
				// CHECK: %[[VAL_9:.*]] = mulf %[[VAL_8]], %[[VAL_1]] : f32
				// CHECK: store %[[VAL_9]], %[[VAL_6]]{{\[}}%[[VAL_7]]] : memref<32xf32>
				// CHECK: }
				// CHECK: %[[VAL_10:.*]] = tensor_load %[[VAL_6]] : memref<32xf32>
				// CHECK: return %[[VAL_10]] : tensor<32xf32>
				// CHECK: }
				func @mul_d(%arga: tensor<32xf32>, %argb: f32) -> tensor<32xf32> {
				%0 = linalg.generic #trait_d
				ins(%arga: tensor<32xf32>) {
				^bb(%a: f32):
				%0 = mulf %a, %argb : f32
				linalg.yield %0 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

				#trait_s = {
				indexing_maps = [
				affine_map<(i) -> (i)>, // a
				affine_map<(i) -> (i)> // x (out)
				],
				sparse = [
				[ "S" ], // a
				[ "D" ] // x
				],
				iterator_types = ["parallel"],
				doc = "x(i) = a(i) OP b"
				}

				// CHECK-LABEL: func @add_s(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: f32) -> tensor<32xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 0 : index
				// CHECK: %[[VAL_5:.*]] = constant true
				// CHECK: %[[VAL_6:.*]] = constant 1 : index
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_10:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_11:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_6]]] : memref<?xindex>
				// CHECK: %[[VAL_13:.]]:2 = scf.while (%[[VAL_14:.]] = %[[VAL_11]], %[[VAL_15:.*]] = %[[VAL_4]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_16:.*]] = cmpi "ult", %[[VAL_14]], %[[VAL_12]] : index
				// CHECK: scf.condition(%[[VAL_16]]) %[[VAL_14]], %[[VAL_15]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_17:.]]: index, %[[VAL_18:.]]: index):
				// CHECK: %[[VAL_19:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_17]]] : memref<?xindex>
				// CHECK: %[[VAL_20:.*]] = cmpi "eq", %[[VAL_19]], %[[VAL_18]] : index
				// CHECK: scf.if %[[VAL_20]] {
				// CHECK: %[[VAL_21:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_17]]] : memref<?xf32>
				// CHECK: %[[VAL_22:.*]] = addf %[[VAL_21]], %[[VAL_1]] : f32
				// CHECK: store %[[VAL_22]], %[[VAL_10]]{{\[}}%[[VAL_18]]] : memref<32xf32>
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_5]] {
				// CHECK: store %[[VAL_1]], %[[VAL_10]]{{\[}}%[[VAL_18]]] : memref<32xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_23:.*]] = cmpi "eq", %[[VAL_19]], %[[VAL_18]] : index
				// CHECK: %[[VAL_24:.*]] = addi %[[VAL_17]], %[[VAL_6]] : index
				// CHECK: %[[VAL_25:.*]] = select %[[VAL_23]], %[[VAL_24]], %[[VAL_17]] : index
				// CHECK: %[[VAL_26:.*]] = addi %[[VAL_18]], %[[VAL_6]] : index
				// CHECK: scf.yield %[[VAL_25]], %[[VAL_26]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_27:.]] = %[[VAL_28:.]]#1 to %[[VAL_3]] step %[[VAL_6]] {
				// CHECK: store %[[VAL_1]], %[[VAL_10]]{{\[}}%[[VAL_27]]] : memref<32xf32>
				// CHECK: }
				// CHECK: %[[VAL_29:.*]] = tensor_load %[[VAL_10]] : memref<32xf32>
				// CHECK: return %[[VAL_29]] : tensor<32xf32>
				// CHECK: }
				func @add_s(%arga: tensor<32xf32>, %argb: f32) -> tensor<32xf32> {
				%0 = linalg.generic #trait_s
				ins(%arga: tensor<32xf32>) {
				^bb(%a: f32):
				%0 = addf %a, %argb : f32
				linalg.yield %0 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

				// CHECK-LABEL: func @repeated_add_s(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>) -> tensor<32xf32> {
				// CHECK: %[[VAL_1:.*]] = constant 999 : index
				// CHECK: %[[VAL_2:.*]] = constant 0 : index
				// CHECK: %[[VAL_3:.*]] = constant 1 : index
				// CHECK: %[[VAL_4:.*]] = alloca(%[[VAL_1]]) : memref<?xindex>
				// CHECK: %[[VAL_5:.*]] = alloca(%[[VAL_1]]) : memref<?xindex>
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_1]]) : memref<?xf32>
				// CHECK: %[[VAL_7:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_8:.*]] = load %[[VAL_4]]{{\[}}%[[VAL_2]]] : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = load %[[VAL_4]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_10:.*]] = %[[VAL_8]] to %[[VAL_9]] step %[[VAL_3]] {
				// CHECK: %[[VAL_11:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_10]]] : memref<?xindex>
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_10]]] : memref<?xf32>
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_10]]] : memref<?xf32>
				// CHECK: %[[VAL_14:.*]] = addf %[[VAL_12]], %[[VAL_13]] : f32
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_10]]] : memref<?xf32>
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_10]]] : memref<?xf32>
				// CHECK: %[[VAL_17:.*]] = addf %[[VAL_15]], %[[VAL_16]] : f32
				// CHECK: %[[VAL_18:.*]] = addf %[[VAL_14]], %[[VAL_17]] : f32
				// CHECK: store %[[VAL_18]], %[[VAL_7]]{{\[}}%[[VAL_11]]] : memref<32xf32>
				// CHECK: }
				// CHECK: %[[VAL_19:.*]] = tensor_load %[[VAL_7]] : memref<32xf32>
				// CHECK: return %[[VAL_19]] : tensor<32xf32>
				// CHECK: }
				func @repeated_add_s(%arga: tensor<32xf32>) -> tensor<32xf32> {
				%0 = linalg.generic #trait_s
				ins(%arga: tensor<32xf32>) {
				^bb(%a: f32):
				%0 = addf %a, %a : f32 // same tensor
				%1 = addf %a, %a : f32 // should yield
				%2 = addf %0, %1 : f32 // one guard
				linalg.yield %2 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

				// CHECK-LABEL: func @mul_s(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: f32) -> tensor<32xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_8:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_9:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_11:.*]] = %[[VAL_9]] to %[[VAL_10]] step %[[VAL_4]] {
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_11]]] : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_11]]] : memref<?xf32>
				// CHECK: %[[VAL_14:.*]] = mulf %[[VAL_13]], %[[VAL_1]] : f32
				// CHECK: store %[[VAL_14]], %[[VAL_8]]{{\[}}%[[VAL_12]]] : memref<32xf32>
				// CHECK: }
				// CHECK: %[[VAL_15:.*]] = tensor_load %[[VAL_8]] : memref<32xf32>
				// CHECK: return %[[VAL_15]] : tensor<32xf32>
				// CHECK: }
				func @mul_s(%arga: tensor<32xf32>, %argb: f32) -> tensor<32xf32> {
				%0 = linalg.generic #trait_s
				ins(%arga: tensor<32xf32>) {
				^bb(%a: f32):
				%0 = mulf %a, %argb : f32
				linalg.yield %0 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

				#trait_dd = {
				indexing_maps = [
				affine_map<(i) -> (i)>, // a
				affine_map<(i) -> (i)>, // b
				affine_map<(i) -> (i)> // x (out)
				],
				sparse = [
				[ "D" ], // a
				[ "D" ], // b
				[ "D" ] // x
				],
				iterator_types = ["parallel"],
				doc = "x(i) = a(i) OP b(i)"
				}

				// CHECK-LABEL: func @add_dd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32xf32>) -> tensor<32xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 32 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_6:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_7:.*]] = alloca() : memref<32xf32>
				// CHECK: scf.for %[[VAL_8:.*]] = %[[VAL_3]] to %[[VAL_2]] step %[[VAL_4]] {
				// CHECK: %[[VAL_9:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_8]]] : memref<32xf32>
				// CHECK: %[[VAL_10:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_8]]] : memref<32xf32>
				// CHECK: %[[VAL_11:.*]] = addf %[[VAL_9]], %[[VAL_10]] : f32
				// CHECK: store %[[VAL_11]], %[[VAL_7]]{{\[}}%[[VAL_8]]] : memref<32xf32>
				// CHECK: }
				// CHECK: %[[VAL_12:.*]] = tensor_load %[[VAL_7]] : memref<32xf32>
				// CHECK: return %[[VAL_12]] : tensor<32xf32>
				// CHECK: }
				func @add_dd(%arga: tensor<32xf32>, %argb: tensor<32xf32>) -> tensor<32xf32> {
				%0 = linalg.generic #trait_dd
				ins(%arga, %argb: tensor<32xf32>, tensor<32xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

				// CHECK-LABEL: func @mul_dd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32xf32>) -> tensor<32xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 32 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_6:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_7:.*]] = alloca() : memref<32xf32>
				// CHECK: scf.for %[[VAL_8:.*]] = %[[VAL_3]] to %[[VAL_2]] step %[[VAL_4]] {
				// CHECK: %[[VAL_9:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_8]]] : memref<32xf32>
				// CHECK: %[[VAL_10:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_8]]] : memref<32xf32>
				// CHECK: %[[VAL_11:.*]] = mulf %[[VAL_9]], %[[VAL_10]] : f32
				// CHECK: store %[[VAL_11]], %[[VAL_7]]{{\[}}%[[VAL_8]]] : memref<32xf32>
				// CHECK: }
				// CHECK: %[[VAL_12:.*]] = tensor_load %[[VAL_7]] : memref<32xf32>
				// CHECK: return %[[VAL_12]] : tensor<32xf32>
				// CHECK: }
				func @mul_dd(%arga: tensor<32xf32>, %argb: tensor<32xf32>) -> tensor<32xf32> {
				%0 = linalg.generic #trait_dd
				ins(%arga, %argb: tensor<32xf32>, tensor<32xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

				#trait_ds = {
				indexing_maps = [
				affine_map<(i) -> (i)>, // a
				affine_map<(i) -> (i)>, // b
				affine_map<(i) -> (i)> // x (out)
				],
				sparse = [
				[ "D" ], // a
				[ "S" ], // b
				[ "D" ] // x
				],
				iterator_types = ["parallel"],
				doc = "x(i) = a(i) OP b(i)"
				}

				// CHECK-LABEL: func @add_ds(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32xf32>) -> tensor<32xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 0 : index
				// CHECK: %[[VAL_5:.*]] = constant true
				// CHECK: %[[VAL_6:.*]] = constant 1 : index
				// CHECK: %[[VAL_7:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_6]]] : memref<?xindex>
				// CHECK: %[[VAL_14:.]]:2 = scf.while (%[[VAL_15:.]] = %[[VAL_12]], %[[VAL_16:.*]] = %[[VAL_4]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_17:.*]] = cmpi "ult", %[[VAL_15]], %[[VAL_13]] : index
				// CHECK: scf.condition(%[[VAL_17]]) %[[VAL_15]], %[[VAL_16]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_18:.]]: index, %[[VAL_19:.]]: index):
				// CHECK: %[[VAL_20:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_18]]] : memref<?xindex>
				// CHECK: %[[VAL_21:.*]] = cmpi "eq", %[[VAL_20]], %[[VAL_19]] : index
				// CHECK: scf.if %[[VAL_21]] {
				// CHECK: %[[VAL_22:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_19]]] : memref<32xf32>
				// CHECK: %[[VAL_23:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_18]]] : memref<?xf32>
				// CHECK: %[[VAL_24:.*]] = addf %[[VAL_22]], %[[VAL_23]] : f32
				// CHECK: store %[[VAL_24]], %[[VAL_11]]{{\[}}%[[VAL_19]]] : memref<32xf32>
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_5]] {
				// CHECK: %[[VAL_25:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_19]]] : memref<32xf32>
				// CHECK: store %[[VAL_25]], %[[VAL_11]]{{\[}}%[[VAL_19]]] : memref<32xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_26:.*]] = cmpi "eq", %[[VAL_20]], %[[VAL_19]] : index
				// CHECK: %[[VAL_27:.*]] = addi %[[VAL_18]], %[[VAL_6]] : index
				// CHECK: %[[VAL_28:.*]] = select %[[VAL_26]], %[[VAL_27]], %[[VAL_18]] : index
				// CHECK: %[[VAL_29:.*]] = addi %[[VAL_19]], %[[VAL_6]] : index
				// CHECK: scf.yield %[[VAL_28]], %[[VAL_29]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_30:.]] = %[[VAL_31:.]]#1 to %[[VAL_3]] step %[[VAL_6]] {
				// CHECK: %[[VAL_32:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_30]]] : memref<32xf32>
				// CHECK: store %[[VAL_32]], %[[VAL_11]]{{\[}}%[[VAL_30]]] : memref<32xf32>
				// CHECK: }
				// CHECK: %[[VAL_33:.*]] = tensor_load %[[VAL_11]] : memref<32xf32>
				// CHECK: return %[[VAL_33]] : tensor<32xf32>
				// CHECK: }
				func @add_ds(%arga: tensor<32xf32>, %argb: tensor<32xf32>) -> tensor<32xf32> {
				%0 = linalg.generic #trait_ds
				ins(%arga, %argb: tensor<32xf32>, tensor<32xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

				// CHECK-LABEL: func @mul_ds(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32xf32>) -> tensor<32xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_9:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_10:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_12:.*]] = %[[VAL_10]] to %[[VAL_11]] step %[[VAL_4]] {
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_12]]] : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_13]]] : memref<32xf32>
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_12]]] : memref<?xf32>
				// CHECK: %[[VAL_16:.*]] = mulf %[[VAL_14]], %[[VAL_15]] : f32
				// CHECK: store %[[VAL_16]], %[[VAL_9]]{{\[}}%[[VAL_13]]] : memref<32xf32>
				// CHECK: }
				// CHECK: %[[VAL_17:.*]] = tensor_load %[[VAL_9]] : memref<32xf32>
				// CHECK: return %[[VAL_17]] : tensor<32xf32>
				// CHECK: }
				func @mul_ds(%arga: tensor<32xf32>, %argb: tensor<32xf32>) -> tensor<32xf32> {
				%0 = linalg.generic #trait_ds
				ins(%arga, %argb: tensor<32xf32>, tensor<32xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

				#trait_sd = {
				indexing_maps = [
				affine_map<(i) -> (i)>, // a
				affine_map<(i) -> (i)>, // b
				affine_map<(i) -> (i)> // x (out)
				],
				sparse = [
				[ "S" ], // a
				[ "D" ], // b
				[ "D" ] // x
				],
				iterator_types = ["parallel"],
				doc = "x(i) = a(i) OP b(i)"
				}

				// CHECK-LABEL: func @add_sd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32xf32>) -> tensor<32xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 0 : index
				// CHECK: %[[VAL_5:.*]] = constant true
				// CHECK: %[[VAL_6:.*]] = constant 1 : index
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_10:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_6]]] : memref<?xindex>
				// CHECK: %[[VAL_14:.]]:2 = scf.while (%[[VAL_15:.]] = %[[VAL_12]], %[[VAL_16:.*]] = %[[VAL_4]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_17:.*]] = cmpi "ult", %[[VAL_15]], %[[VAL_13]] : index
				// CHECK: scf.condition(%[[VAL_17]]) %[[VAL_15]], %[[VAL_16]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_18:.]]: index, %[[VAL_19:.]]: index):
				// CHECK: %[[VAL_20:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_18]]] : memref<?xindex>
				// CHECK: %[[VAL_21:.*]] = cmpi "eq", %[[VAL_20]], %[[VAL_19]] : index
				// CHECK: scf.if %[[VAL_21]] {
				// CHECK: %[[VAL_22:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_18]]] : memref<?xf32>
				// CHECK: %[[VAL_23:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_19]]] : memref<32xf32>
				// CHECK: %[[VAL_24:.*]] = addf %[[VAL_22]], %[[VAL_23]] : f32
				// CHECK: store %[[VAL_24]], %[[VAL_11]]{{\[}}%[[VAL_19]]] : memref<32xf32>
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_5]] {
				// CHECK: %[[VAL_25:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_19]]] : memref<32xf32>
				// CHECK: store %[[VAL_25]], %[[VAL_11]]{{\[}}%[[VAL_19]]] : memref<32xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_26:.*]] = cmpi "eq", %[[VAL_20]], %[[VAL_19]] : index
				// CHECK: %[[VAL_27:.*]] = addi %[[VAL_18]], %[[VAL_6]] : index
				// CHECK: %[[VAL_28:.*]] = select %[[VAL_26]], %[[VAL_27]], %[[VAL_18]] : index
				// CHECK: %[[VAL_29:.*]] = addi %[[VAL_19]], %[[VAL_6]] : index
				// CHECK: scf.yield %[[VAL_28]], %[[VAL_29]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_30:.]] = %[[VAL_31:.]]#1 to %[[VAL_3]] step %[[VAL_6]] {
				// CHECK: %[[VAL_32:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_30]]] : memref<32xf32>
				// CHECK: store %[[VAL_32]], %[[VAL_11]]{{\[}}%[[VAL_30]]] : memref<32xf32>
				// CHECK: }
				// CHECK: %[[VAL_33:.*]] = tensor_load %[[VAL_11]] : memref<32xf32>
				// CHECK: return %[[VAL_33]] : tensor<32xf32>
				// CHECK: }
				func @add_sd(%arga: tensor<32xf32>, %argb: tensor<32xf32>) -> tensor<32xf32> {
				%0 = linalg.generic #trait_sd
				ins(%arga, %argb: tensor<32xf32>, tensor<32xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

				// CHECK-LABEL: func @mul_sd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32xf32>) -> tensor<32xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_8:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_9:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_10:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_12:.*]] = %[[VAL_10]] to %[[VAL_11]] step %[[VAL_4]] {
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_12]]] : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_12]]] : memref<?xf32>
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_13]]] : memref<32xf32>
				// CHECK: %[[VAL_16:.*]] = mulf %[[VAL_14]], %[[VAL_15]] : f32
				// CHECK: store %[[VAL_16]], %[[VAL_9]]{{\[}}%[[VAL_13]]] : memref<32xf32>
				// CHECK: }
				// CHECK: %[[VAL_17:.*]] = tensor_load %[[VAL_9]] : memref<32xf32>
				// CHECK: return %[[VAL_17]] : tensor<32xf32>
				// CHECK: }
				func @mul_sd(%arga: tensor<32xf32>, %argb: tensor<32xf32>) -> tensor<32xf32> {
				%0 = linalg.generic #trait_sd
				ins(%arga, %argb: tensor<32xf32>, tensor<32xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

				#trait_ss = {
				indexing_maps = [
				affine_map<(i) -> (i)>, // a
				affine_map<(i) -> (i)>, // b
				affine_map<(i) -> (i)> // x (out)
				],
				sparse = [
				[ "S" ], // a
				[ "S" ], // b
				[ "D" ] // x
				],
				iterator_types = ["parallel"],
				doc = "x(i) = a(i) OP b(i)"
				}

				// CHECK-LABEL: func @add_ss(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32xf32>) -> tensor<32xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_16:.]]:2 = scf.while (%[[VAL_17:.]] = %[[VAL_12]], %[[VAL_18:.*]] = %[[VAL_14]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_19:.*]] = cmpi "ult", %[[VAL_17]], %[[VAL_13]] : index
				// CHECK: %[[VAL_20:.*]] = cmpi "ult", %[[VAL_18]], %[[VAL_15]] : index
				// CHECK: %[[VAL_21:.*]] = and %[[VAL_19]], %[[VAL_20]] : i1
				// CHECK: scf.condition(%[[VAL_21]]) %[[VAL_17]], %[[VAL_18]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_22:.]]: index, %[[VAL_23:.]]: index):
				// CHECK: %[[VAL_24:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_22]]] : memref<?xindex>
				// CHECK: %[[VAL_25:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_23]]] : memref<?xindex>
				// CHECK: %[[VAL_26:.*]] = cmpi "ult", %[[VAL_25]], %[[VAL_24]] : index
				// CHECK: %[[VAL_27:.*]] = select %[[VAL_26]], %[[VAL_25]], %[[VAL_24]] : index
				// CHECK: %[[VAL_28:.*]] = cmpi "eq", %[[VAL_24]], %[[VAL_27]] : index
				// CHECK: %[[VAL_29:.*]] = cmpi "eq", %[[VAL_25]], %[[VAL_27]] : index
				// CHECK: %[[VAL_30:.*]] = and %[[VAL_28]], %[[VAL_29]] : i1
				// CHECK: scf.if %[[VAL_30]] {
				// CHECK: %[[VAL_31:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_22]]] : memref<?xf32>
				// CHECK: %[[VAL_32:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_23]]] : memref<?xf32>
				// CHECK: %[[VAL_33:.*]] = addf %[[VAL_31]], %[[VAL_32]] : f32
				// CHECK: store %[[VAL_33]], %[[VAL_11]]{{\[}}%[[VAL_27]]] : memref<32xf32>
				// CHECK: } else {
				// CHECK: %[[VAL_34:.*]] = cmpi "eq", %[[VAL_24]], %[[VAL_27]] : index
				// CHECK: scf.if %[[VAL_34]] {
				// CHECK: %[[VAL_35:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_22]]] : memref<?xf32>
				// CHECK: store %[[VAL_35]], %[[VAL_11]]{{\[}}%[[VAL_27]]] : memref<32xf32>
				// CHECK: } else {
				// CHECK: %[[VAL_36:.*]] = cmpi "eq", %[[VAL_25]], %[[VAL_27]] : index
				// CHECK: scf.if %[[VAL_36]] {
				// CHECK: %[[VAL_37:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_23]]] : memref<?xf32>
				// CHECK: store %[[VAL_37]], %[[VAL_11]]{{\[}}%[[VAL_27]]] : memref<32xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_38:.*]] = cmpi "eq", %[[VAL_24]], %[[VAL_27]] : index
				// CHECK: %[[VAL_39:.*]] = addi %[[VAL_22]], %[[VAL_4]] : index
				// CHECK: %[[VAL_40:.*]] = select %[[VAL_38]], %[[VAL_39]], %[[VAL_22]] : index
				// CHECK: %[[VAL_41:.*]] = cmpi "eq", %[[VAL_25]], %[[VAL_27]] : index
				// CHECK: %[[VAL_42:.*]] = addi %[[VAL_23]], %[[VAL_4]] : index
				// CHECK: %[[VAL_43:.*]] = select %[[VAL_41]], %[[VAL_42]], %[[VAL_23]] : index
				// CHECK: scf.yield %[[VAL_40]], %[[VAL_43]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_44:.]] = %[[VAL_45:.]]#0 to %[[VAL_13]] step %[[VAL_4]] {
				// CHECK: %[[VAL_46:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_44]]] : memref<?xindex>
				// CHECK: %[[VAL_47:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_44]]] : memref<?xf32>
				// CHECK: store %[[VAL_47]], %[[VAL_11]]{{\[}}%[[VAL_46]]] : memref<32xf32>
				// CHECK: }
				// CHECK: scf.for %[[VAL_48:.]] = %[[VAL_49:.]]#1 to %[[VAL_15]] step %[[VAL_4]] {
				// CHECK: %[[VAL_50:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_48]]] : memref<?xindex>
				// CHECK: %[[VAL_51:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_48]]] : memref<?xf32>
				// CHECK: store %[[VAL_51]], %[[VAL_11]]{{\[}}%[[VAL_50]]] : memref<32xf32>
				// CHECK: }
				// CHECK: %[[VAL_52:.*]] = tensor_load %[[VAL_11]] : memref<32xf32>
				// CHECK: return %[[VAL_52]] : tensor<32xf32>
				// CHECK: }
				func @add_ss(%arga: tensor<32xf32>, %argb: tensor<32xf32>) -> tensor<32xf32> {
				%0 = linalg.generic #trait_ss
				ins(%arga, %argb: tensor<32xf32>, tensor<32xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

				// CHECK-LABEL: func @mul_ss(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32xf32>) -> tensor<32xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_16:.]]:2 = scf.while (%[[VAL_17:.]] = %[[VAL_12]], %[[VAL_18:.*]] = %[[VAL_14]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_19:.*]] = cmpi "ult", %[[VAL_17]], %[[VAL_13]] : index
				// CHECK: %[[VAL_20:.*]] = cmpi "ult", %[[VAL_18]], %[[VAL_15]] : index
				// CHECK: %[[VAL_21:.*]] = and %[[VAL_19]], %[[VAL_20]] : i1
				// CHECK: scf.condition(%[[VAL_21]]) %[[VAL_17]], %[[VAL_18]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_22:.]]: index, %[[VAL_23:.]]: index):
				// CHECK: %[[VAL_24:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_22]]] : memref<?xindex>
				// CHECK: %[[VAL_25:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_23]]] : memref<?xindex>
				// CHECK: %[[VAL_26:.*]] = cmpi "ult", %[[VAL_25]], %[[VAL_24]] : index
				// CHECK: %[[VAL_27:.*]] = select %[[VAL_26]], %[[VAL_25]], %[[VAL_24]] : index
				// CHECK: %[[VAL_28:.*]] = cmpi "eq", %[[VAL_24]], %[[VAL_27]] : index
				// CHECK: %[[VAL_29:.*]] = cmpi "eq", %[[VAL_25]], %[[VAL_27]] : index
				// CHECK: %[[VAL_30:.*]] = and %[[VAL_28]], %[[VAL_29]] : i1
				// CHECK: scf.if %[[VAL_30]] {
				// CHECK: %[[VAL_31:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_22]]] : memref<?xf32>
				// CHECK: %[[VAL_32:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_23]]] : memref<?xf32>
				// CHECK: %[[VAL_33:.*]] = mulf %[[VAL_31]], %[[VAL_32]] : f32
				// CHECK: store %[[VAL_33]], %[[VAL_11]]{{\[}}%[[VAL_27]]] : memref<32xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: %[[VAL_34:.*]] = cmpi "eq", %[[VAL_24]], %[[VAL_27]] : index
				// CHECK: %[[VAL_35:.*]] = addi %[[VAL_22]], %[[VAL_4]] : index
				// CHECK: %[[VAL_36:.*]] = select %[[VAL_34]], %[[VAL_35]], %[[VAL_22]] : index
				// CHECK: %[[VAL_37:.*]] = cmpi "eq", %[[VAL_25]], %[[VAL_27]] : index
				// CHECK: %[[VAL_38:.*]] = addi %[[VAL_23]], %[[VAL_4]] : index
				// CHECK: %[[VAL_39:.*]] = select %[[VAL_37]], %[[VAL_38]], %[[VAL_23]] : index
				// CHECK: scf.yield %[[VAL_36]], %[[VAL_39]] : index, index
				// CHECK: }
				// CHECK: %[[VAL_40:.*]] = tensor_load %[[VAL_11]] : memref<32xf32>
				// CHECK: return %[[VAL_40]] : tensor<32xf32>
				// CHECK: }
				func @mul_ss(%arga: tensor<32xf32>, %argb: tensor<32xf32>) -> tensor<32xf32> {
				%0 = linalg.generic #trait_ss
				ins(%arga, %argb: tensor<32xf32>, tensor<32xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32xf32>
				return %0 : tensor<32xf32>
				}

mlir/test/Dialect/Linalg/sparse_2d.mlir

This file was added.

				// NOTE: Assertions have been autogenerated by utils/generate-test-checks.py
				// RUN: mlir-opt %s -test-sparsification \| FileCheck %s

				#trait_dd = {
				indexing_maps = [
				affine_map<(i,j) -> (i,j)>, // A
				affine_map<(i,j) -> (i,j)>, // B
				affine_map<(i,j) -> (i,j)> // X (out)
				],
				sparse = [
				[ "D", "D" ], // A
				[ "D", "D" ], // B
				[ "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel"],
				doc = "X(i,j) = A(i,j) OP B(i,j)"
				}

				// CHECK-LABEL: func @add_dd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16xf32>) -> tensor<32x16xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 32 : index
				// CHECK: %[[VAL_3:.*]] = constant 16 : index
				// CHECK: %[[VAL_4:.*]] = constant 0 : index
				// CHECK: %[[VAL_5:.*]] = constant 1 : index
				// CHECK: %[[VAL_6:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_7:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_8:.*]] = alloca() : memref<32x16xf32>
				// CHECK: scf.for %[[VAL_9:.*]] = %[[VAL_4]] to %[[VAL_2]] step %[[VAL_5]] {
				// CHECK: scf.for %[[VAL_10:.*]] = %[[VAL_4]] to %[[VAL_3]] step %[[VAL_5]] {
				// CHECK: %[[VAL_11:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_9]], %[[VAL_10]]] : memref<32x16xf32>
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_9]], %[[VAL_10]]] : memref<32x16xf32>
				// CHECK: %[[VAL_13:.*]] = addf %[[VAL_11]], %[[VAL_12]] : f32
				// CHECK: store %[[VAL_13]], %[[VAL_8]]{{\[}}%[[VAL_9]], %[[VAL_10]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_14:.*]] = tensor_load %[[VAL_8]] : memref<32x16xf32>
				// CHECK: return %[[VAL_14]] : tensor<32x16xf32>
				// CHECK: }
				func @add_dd(%arga: tensor<32x16xf32>, %argb: tensor<32x16xf32>) -> tensor<32x16xf32> {
				%0 = linalg.generic #trait_dd
				ins(%arga, %argb: tensor<32x16xf32>, tensor<32x16xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16xf32>
				return %0 : tensor<32x16xf32>
				}

				// CHECK-LABEL: func @mul_dd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16xf32>) -> tensor<32x16xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 32 : index
				// CHECK: %[[VAL_3:.*]] = constant 16 : index
				// CHECK: %[[VAL_4:.*]] = constant 0 : index
				// CHECK: %[[VAL_5:.*]] = constant 1 : index
				// CHECK: %[[VAL_6:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_7:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_8:.*]] = alloca() : memref<32x16xf32>
				// CHECK: scf.for %[[VAL_9:.*]] = %[[VAL_4]] to %[[VAL_2]] step %[[VAL_5]] {
				// CHECK: scf.for %[[VAL_10:.*]] = %[[VAL_4]] to %[[VAL_3]] step %[[VAL_5]] {
				// CHECK: %[[VAL_11:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_9]], %[[VAL_10]]] : memref<32x16xf32>
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_9]], %[[VAL_10]]] : memref<32x16xf32>
				// CHECK: %[[VAL_13:.*]] = mulf %[[VAL_11]], %[[VAL_12]] : f32
				// CHECK: store %[[VAL_13]], %[[VAL_8]]{{\[}}%[[VAL_9]], %[[VAL_10]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_14:.*]] = tensor_load %[[VAL_8]] : memref<32x16xf32>
				// CHECK: return %[[VAL_14]] : tensor<32x16xf32>
				// CHECK: }
				func @mul_dd(%arga: tensor<32x16xf32>, %argb: tensor<32x16xf32>) -> tensor<32x16xf32> {
				%0 = linalg.generic #trait_dd
				ins(%arga, %argb: tensor<32x16xf32>, tensor<32x16xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16xf32>
				return %0 : tensor<32x16xf32>
				}

				#trait_ds = {
				indexing_maps = [
				affine_map<(i,j) -> (i,j)>, // A
				affine_map<(i,j) -> (i,j)>, // B
				affine_map<(i,j) -> (i,j)> // X (out)
				],
				sparse = [
				[ "D", "S" ], // A
				[ "D", "D" ], // B
				[ "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel"],
				doc = "X(i,j) = A(i,j) OP B(i,j)"
				}

				// CHECK-LABEL: func @add_ds(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16xf32>) -> tensor<32x16xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 16 : index
				// CHECK: %[[VAL_5:.*]] = constant 0 : index
				// CHECK: %[[VAL_6:.*]] = constant true
				// CHECK: %[[VAL_7:.*]] = constant 1 : index
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_12:.*]] = alloca() : memref<32x16xf32>
				// CHECK: scf.for %[[VAL_13:.*]] = %[[VAL_5]] to %[[VAL_3]] step %[[VAL_7]] {
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_13]]] : memref<?xindex>
				// CHECK: %[[VAL_15:.*]] = addi %[[VAL_13]], %[[VAL_7]] : index
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_15]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.]]:2 = scf.while (%[[VAL_18:.]] = %[[VAL_14]], %[[VAL_19:.*]] = %[[VAL_5]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_20:.*]] = cmpi "ult", %[[VAL_18]], %[[VAL_16]] : index
				// CHECK: scf.condition(%[[VAL_20]]) %[[VAL_18]], %[[VAL_19]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_21:.]]: index, %[[VAL_22:.]]: index):
				// CHECK: %[[VAL_23:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_21]]] : memref<?xindex>
				// CHECK: %[[VAL_24:.*]] = cmpi "eq", %[[VAL_23]], %[[VAL_22]] : index
				// CHECK: scf.if %[[VAL_24]] {
				// CHECK: %[[VAL_25:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_21]]] : memref<?xf32>
				// CHECK: %[[VAL_26:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_13]], %[[VAL_22]]] : memref<32x16xf32>
				// CHECK: %[[VAL_27:.*]] = addf %[[VAL_25]], %[[VAL_26]] : f32
				// CHECK: store %[[VAL_27]], %[[VAL_12]]{{\[}}%[[VAL_13]], %[[VAL_22]]] : memref<32x16xf32>
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_6]] {
				// CHECK: %[[VAL_28:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_13]], %[[VAL_22]]] : memref<32x16xf32>
				// CHECK: store %[[VAL_28]], %[[VAL_12]]{{\[}}%[[VAL_13]], %[[VAL_22]]] : memref<32x16xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_29:.*]] = cmpi "eq", %[[VAL_23]], %[[VAL_22]] : index
				// CHECK: %[[VAL_30:.*]] = addi %[[VAL_21]], %[[VAL_7]] : index
				// CHECK: %[[VAL_31:.*]] = select %[[VAL_29]], %[[VAL_30]], %[[VAL_21]] : index
				// CHECK: %[[VAL_32:.*]] = addi %[[VAL_22]], %[[VAL_7]] : index
				// CHECK: scf.yield %[[VAL_31]], %[[VAL_32]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_33:.]] = %[[VAL_34:.]]#1 to %[[VAL_4]] step %[[VAL_7]] {
				// CHECK: %[[VAL_35:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_13]], %[[VAL_33]]] : memref<32x16xf32>
				// CHECK: store %[[VAL_35]], %[[VAL_12]]{{\[}}%[[VAL_13]], %[[VAL_33]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_36:.*]] = tensor_load %[[VAL_12]] : memref<32x16xf32>
				// CHECK: return %[[VAL_36]] : tensor<32x16xf32>
				// CHECK: }
				func @add_ds(%arga: tensor<32x16xf32>, %argb: tensor<32x16xf32>) -> tensor<32x16xf32> {
				%0 = linalg.generic #trait_ds
				ins(%arga, %argb: tensor<32x16xf32>, tensor<32x16xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16xf32>
				return %0 : tensor<32x16xf32>
				}

				// CHECK-LABEL: func @mul_ds(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16xf32>) -> tensor<32x16xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 0 : index
				// CHECK: %[[VAL_5:.*]] = constant 1 : index
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_9:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_10:.*]] = alloca() : memref<32x16xf32>
				// CHECK: scf.for %[[VAL_11:.*]] = %[[VAL_4]] to %[[VAL_3]] step %[[VAL_5]] {
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_11]]] : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = addi %[[VAL_11]], %[[VAL_5]] : index
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_13]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_15:.*]] = %[[VAL_12]] to %[[VAL_14]] step %[[VAL_5]] {
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_15]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_15]]] : memref<?xf32>
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_11]], %[[VAL_16]]] : memref<32x16xf32>
				// CHECK: %[[VAL_19:.*]] = mulf %[[VAL_17]], %[[VAL_18]] : f32
				// CHECK: store %[[VAL_19]], %[[VAL_10]]{{\[}}%[[VAL_11]], %[[VAL_16]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_20:.*]] = tensor_load %[[VAL_10]] : memref<32x16xf32>
				// CHECK: return %[[VAL_20]] : tensor<32x16xf32>
				// CHECK: }
				func @mul_ds(%arga: tensor<32x16xf32>, %argb: tensor<32x16xf32>) -> tensor<32x16xf32> {
				%0 = linalg.generic #trait_ds
				ins(%arga, %argb: tensor<32x16xf32>, tensor<32x16xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16xf32>
				return %0 : tensor<32x16xf32>
				}

				#trait_sd = {
				indexing_maps = [
				affine_map<(i,j) -> (i,j)>, // A
				affine_map<(i,j) -> (i,j)>, // B
				affine_map<(i,j) -> (i,j)> // X (out)
				],
				sparse = [
				[ "S", "D" ], // A
				[ "D", "D" ], // B
				[ "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel"],
				doc = "X(i,j) = A(i,j) OP B(i,j)"
				}

				// CHECK-LABEL: func @add_sd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16xf32>) -> tensor<32x16xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 16 : index
				// CHECK: %[[VAL_5:.*]] = constant true
				// CHECK: %[[VAL_6:.*]] = constant 0 : index
				// CHECK: %[[VAL_7:.*]] = constant 1 : index
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_12:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_6]]] : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_7]]] : memref<?xindex>
				// CHECK: %[[VAL_15:.]]:2 = scf.while (%[[VAL_16:.]] = %[[VAL_13]], %[[VAL_17:.*]] = %[[VAL_6]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_18:.*]] = cmpi "ult", %[[VAL_16]], %[[VAL_14]] : index
				// CHECK: scf.condition(%[[VAL_18]]) %[[VAL_16]], %[[VAL_17]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_19:.]]: index, %[[VAL_20:.]]: index):
				// CHECK: %[[VAL_21:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_19]]] : memref<?xindex>
				// CHECK: %[[VAL_22:.*]] = cmpi "eq", %[[VAL_21]], %[[VAL_20]] : index
				// CHECK: scf.if %[[VAL_22]] {
				// CHECK: scf.for %[[VAL_23:.*]] = %[[VAL_6]] to %[[VAL_4]] step %[[VAL_7]] {
				// CHECK: %[[VAL_24:.*]] = muli %[[VAL_19]], %[[VAL_4]] : index
				// CHECK: %[[VAL_25:.*]] = addi %[[VAL_24]], %[[VAL_23]] : index
				// CHECK: %[[VAL_26:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_25]]] : memref<?xf32>
				// CHECK: %[[VAL_27:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_20]], %[[VAL_23]]] : memref<32x16xf32>
				// CHECK: %[[VAL_28:.*]] = addf %[[VAL_26]], %[[VAL_27]] : f32
				// CHECK: store %[[VAL_28]], %[[VAL_12]]{{\[}}%[[VAL_20]], %[[VAL_23]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_5]] {
				// CHECK: scf.for %[[VAL_29:.*]] = %[[VAL_6]] to %[[VAL_4]] step %[[VAL_7]] {
				// CHECK: %[[VAL_30:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_20]], %[[VAL_29]]] : memref<32x16xf32>
				// CHECK: store %[[VAL_30]], %[[VAL_12]]{{\[}}%[[VAL_20]], %[[VAL_29]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_31:.*]] = cmpi "eq", %[[VAL_21]], %[[VAL_20]] : index
				// CHECK: %[[VAL_32:.*]] = addi %[[VAL_19]], %[[VAL_7]] : index
				// CHECK: %[[VAL_33:.*]] = select %[[VAL_31]], %[[VAL_32]], %[[VAL_19]] : index
				// CHECK: %[[VAL_34:.*]] = addi %[[VAL_20]], %[[VAL_7]] : index
				// CHECK: scf.yield %[[VAL_33]], %[[VAL_34]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_35:.]] = %[[VAL_36:.]]#1 to %[[VAL_3]] step %[[VAL_7]] {
				// CHECK: scf.for %[[VAL_37:.*]] = %[[VAL_6]] to %[[VAL_4]] step %[[VAL_7]] {
				// CHECK: %[[VAL_38:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_35]], %[[VAL_37]]] : memref<32x16xf32>
				// CHECK: store %[[VAL_38]], %[[VAL_12]]{{\[}}%[[VAL_35]], %[[VAL_37]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_39:.*]] = tensor_load %[[VAL_12]] : memref<32x16xf32>
				// CHECK: return %[[VAL_39]] : tensor<32x16xf32>
				// CHECK: }
				func @add_sd(%arga: tensor<32x16xf32>, %argb: tensor<32x16xf32>) -> tensor<32x16xf32> {
				%0 = linalg.generic #trait_sd
				ins(%arga, %argb: tensor<32x16xf32>, tensor<32x16xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16xf32>
				return %0 : tensor<32x16xf32>
				}

				// CHECK-LABEL: func @mul_sd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16xf32>) -> tensor<32x16xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 16 : index
				// CHECK: %[[VAL_4:.*]] = constant 0 : index
				// CHECK: %[[VAL_5:.*]] = constant 1 : index
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_9:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_10:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_11:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_5]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_13:.*]] = %[[VAL_11]] to %[[VAL_12]] step %[[VAL_5]] {
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_13]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_15:.*]] = %[[VAL_4]] to %[[VAL_3]] step %[[VAL_5]] {
				// CHECK: %[[VAL_16:.*]] = muli %[[VAL_13]], %[[VAL_3]] : index
				// CHECK: %[[VAL_17:.*]] = addi %[[VAL_16]], %[[VAL_15]] : index
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_17]]] : memref<?xf32>
				// CHECK: %[[VAL_19:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_14]], %[[VAL_15]]] : memref<32x16xf32>
				// CHECK: %[[VAL_20:.*]] = mulf %[[VAL_18]], %[[VAL_19]] : f32
				// CHECK: store %[[VAL_20]], %[[VAL_10]]{{\[}}%[[VAL_14]], %[[VAL_15]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_21:.*]] = tensor_load %[[VAL_10]] : memref<32x16xf32>
				// CHECK: return %[[VAL_21]] : tensor<32x16xf32>
				// CHECK: }
				func @mul_sd(%arga: tensor<32x16xf32>, %argb: tensor<32x16xf32>) -> tensor<32x16xf32> {
				%0 = linalg.generic #trait_sd
				ins(%arga, %argb: tensor<32x16xf32>, tensor<32x16xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16xf32>
				return %0 : tensor<32x16xf32>
				}

				#trait_ss = {
				indexing_maps = [
				affine_map<(i,j) -> (i,j)>, // A
				affine_map<(i,j) -> (i,j)>, // B
				affine_map<(i,j) -> (i,j)> // X (out)
				],
				sparse = [
				[ "S", "S" ], // A
				[ "D", "D" ], // B
				[ "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel"],
				doc = "X(i,j) = A(i,j) OP B(i,j)"
				}

				// CHECK-LABEL: func @add_ss(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16xf32>) -> tensor<32x16xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 16 : index
				// CHECK: %[[VAL_5:.*]] = constant true
				// CHECK: %[[VAL_6:.*]] = constant 0 : index
				// CHECK: %[[VAL_7:.*]] = constant 1 : index
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_12:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_13:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_14:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_6]]] : memref<?xindex>
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_7]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.]]:2 = scf.while (%[[VAL_18:.]] = %[[VAL_15]], %[[VAL_19:.*]] = %[[VAL_6]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_20:.*]] = cmpi "ult", %[[VAL_18]], %[[VAL_16]] : index
				// CHECK: scf.condition(%[[VAL_20]]) %[[VAL_18]], %[[VAL_19]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_21:.]]: index, %[[VAL_22:.]]: index):
				// CHECK: %[[VAL_23:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_21]]] : memref<?xindex>
				// CHECK: %[[VAL_24:.*]] = cmpi "eq", %[[VAL_23]], %[[VAL_22]] : index
				// CHECK: scf.if %[[VAL_24]] {
				// CHECK: %[[VAL_25:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_21]]] : memref<?xindex>
				// CHECK: %[[VAL_26:.*]] = addi %[[VAL_21]], %[[VAL_7]] : index
				// CHECK: %[[VAL_27:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: %[[VAL_28:.]]:2 = scf.while (%[[VAL_29:.]] = %[[VAL_25]], %[[VAL_30:.*]] = %[[VAL_6]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_31:.*]] = cmpi "ult", %[[VAL_29]], %[[VAL_27]] : index
				// CHECK: scf.condition(%[[VAL_31]]) %[[VAL_29]], %[[VAL_30]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_32:.]]: index, %[[VAL_33:.]]: index):
				// CHECK: %[[VAL_34:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_32]]] : memref<?xindex>
				// CHECK: %[[VAL_35:.*]] = cmpi "eq", %[[VAL_34]], %[[VAL_33]] : index
				// CHECK: scf.if %[[VAL_35]] {
				// CHECK: %[[VAL_36:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_32]]] : memref<?xf32>
				// CHECK: %[[VAL_37:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_22]], %[[VAL_33]]] : memref<32x16xf32>
				// CHECK: %[[VAL_38:.*]] = addf %[[VAL_36]], %[[VAL_37]] : f32
				// CHECK: store %[[VAL_38]], %[[VAL_14]]{{\[}}%[[VAL_22]], %[[VAL_33]]] : memref<32x16xf32>
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_5]] {
				// CHECK: %[[VAL_39:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_22]], %[[VAL_33]]] : memref<32x16xf32>
				// CHECK: store %[[VAL_39]], %[[VAL_14]]{{\[}}%[[VAL_22]], %[[VAL_33]]] : memref<32x16xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_40:.*]] = cmpi "eq", %[[VAL_34]], %[[VAL_33]] : index
				// CHECK: %[[VAL_41:.*]] = addi %[[VAL_32]], %[[VAL_7]] : index
				// CHECK: %[[VAL_42:.*]] = select %[[VAL_40]], %[[VAL_41]], %[[VAL_32]] : index
				// CHECK: %[[VAL_43:.*]] = addi %[[VAL_33]], %[[VAL_7]] : index
				// CHECK: scf.yield %[[VAL_42]], %[[VAL_43]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_44:.]] = %[[VAL_45:.]]#1 to %[[VAL_4]] step %[[VAL_7]] {
				// CHECK: %[[VAL_46:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_22]], %[[VAL_44]]] : memref<32x16xf32>
				// CHECK: store %[[VAL_46]], %[[VAL_14]]{{\[}}%[[VAL_22]], %[[VAL_44]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_5]] {
				// CHECK: scf.for %[[VAL_47:.*]] = %[[VAL_6]] to %[[VAL_4]] step %[[VAL_7]] {
				// CHECK: %[[VAL_48:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_22]], %[[VAL_47]]] : memref<32x16xf32>
				// CHECK: store %[[VAL_48]], %[[VAL_14]]{{\[}}%[[VAL_22]], %[[VAL_47]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_49:.*]] = cmpi "eq", %[[VAL_23]], %[[VAL_22]] : index
				// CHECK: %[[VAL_50:.*]] = addi %[[VAL_21]], %[[VAL_7]] : index
				// CHECK: %[[VAL_51:.*]] = select %[[VAL_49]], %[[VAL_50]], %[[VAL_21]] : index
				// CHECK: %[[VAL_52:.*]] = addi %[[VAL_22]], %[[VAL_7]] : index
				// CHECK: scf.yield %[[VAL_51]], %[[VAL_52]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_53:.]] = %[[VAL_54:.]]#1 to %[[VAL_3]] step %[[VAL_7]] {
				// CHECK: scf.for %[[VAL_55:.*]] = %[[VAL_6]] to %[[VAL_4]] step %[[VAL_7]] {
				// CHECK: %[[VAL_56:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_53]], %[[VAL_55]]] : memref<32x16xf32>
				// CHECK: store %[[VAL_56]], %[[VAL_14]]{{\[}}%[[VAL_53]], %[[VAL_55]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_57:.*]] = tensor_load %[[VAL_14]] : memref<32x16xf32>
				// CHECK: return %[[VAL_57]] : tensor<32x16xf32>
				// CHECK: }
				func @add_ss(%arga: tensor<32x16xf32>, %argb: tensor<32x16xf32>) -> tensor<32x16xf32> {
				%0 = linalg.generic #trait_ss
				ins(%arga, %argb: tensor<32x16xf32>, tensor<32x16xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16xf32>
				return %0 : tensor<32x16xf32>
				}

				// CHECK-LABEL: func @mul_ss(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16xf32>) -> tensor<32x16xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_10:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_14:.*]] = %[[VAL_12]] to %[[VAL_13]] step %[[VAL_4]] {
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_14]]] : memref<?xindex>
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_14]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.*]] = addi %[[VAL_14]], %[[VAL_4]] : index
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_17]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_19:.*]] = %[[VAL_16]] to %[[VAL_18]] step %[[VAL_4]] {
				// CHECK: %[[VAL_20:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_19]]] : memref<?xindex>
				// CHECK: %[[VAL_21:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_19]]] : memref<?xf32>
				// CHECK: %[[VAL_22:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_15]], %[[VAL_20]]] : memref<32x16xf32>
				// CHECK: %[[VAL_23:.*]] = mulf %[[VAL_21]], %[[VAL_22]] : f32
				// CHECK: store %[[VAL_23]], %[[VAL_11]]{{\[}}%[[VAL_15]], %[[VAL_20]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_24:.*]] = tensor_load %[[VAL_11]] : memref<32x16xf32>
				// CHECK: return %[[VAL_24]] : tensor<32x16xf32>
				// CHECK: }
				func @mul_ss(%arga: tensor<32x16xf32>, %argb: tensor<32x16xf32>) -> tensor<32x16xf32> {
				%0 = linalg.generic #trait_ss
				ins(%arga, %argb: tensor<32x16xf32>, tensor<32x16xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16xf32>
				return %0 : tensor<32x16xf32>
				}

				#trait_ss_ss = {
				indexing_maps = [
				affine_map<(i,j) -> (i,j)>, // A
				affine_map<(i,j) -> (i,j)>, // B
				affine_map<(i,j) -> (i,j)> // X (out)
				],
				sparse = [
				[ "S", "S" ], // A
				[ "S", "S" ], // B
				[ "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel"],
				doc = "X(i,j) = A(i,j) OP B(i,j)"
				}

				// CHECK-LABEL: func @add_ss_ss(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16xf32>) -> tensor<32x16xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_12:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_15:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_19:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_20:.]]:2 = scf.while (%[[VAL_21:.]] = %[[VAL_16]], %[[VAL_22:.*]] = %[[VAL_18]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_23:.*]] = cmpi "ult", %[[VAL_21]], %[[VAL_17]] : index
				// CHECK: %[[VAL_24:.*]] = cmpi "ult", %[[VAL_22]], %[[VAL_19]] : index
				// CHECK: %[[VAL_25:.*]] = and %[[VAL_23]], %[[VAL_24]] : i1
				// CHECK: scf.condition(%[[VAL_25]]) %[[VAL_21]], %[[VAL_22]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_26:.]]: index, %[[VAL_27:.]]: index):
				// CHECK: %[[VAL_28:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: %[[VAL_29:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_27]]] : memref<?xindex>
				// CHECK: %[[VAL_30:.*]] = cmpi "ult", %[[VAL_29]], %[[VAL_28]] : index
				// CHECK: %[[VAL_31:.*]] = select %[[VAL_30]], %[[VAL_29]], %[[VAL_28]] : index
				// CHECK: %[[VAL_32:.*]] = cmpi "eq", %[[VAL_28]], %[[VAL_31]] : index
				// CHECK: %[[VAL_33:.*]] = cmpi "eq", %[[VAL_29]], %[[VAL_31]] : index
				// CHECK: %[[VAL_34:.*]] = and %[[VAL_32]], %[[VAL_33]] : i1
				// CHECK: scf.if %[[VAL_34]] {
				// CHECK: %[[VAL_35:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: %[[VAL_36:.*]] = addi %[[VAL_26]], %[[VAL_4]] : index
				// CHECK: %[[VAL_37:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_36]]] : memref<?xindex>
				// CHECK: %[[VAL_38:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_27]]] : memref<?xindex>
				// CHECK: %[[VAL_39:.*]] = addi %[[VAL_27]], %[[VAL_4]] : index
				// CHECK: %[[VAL_40:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_39]]] : memref<?xindex>
				// CHECK: %[[VAL_41:.]]:2 = scf.while (%[[VAL_42:.]] = %[[VAL_35]], %[[VAL_43:.*]] = %[[VAL_38]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_44:.*]] = cmpi "ult", %[[VAL_42]], %[[VAL_37]] : index
				// CHECK: %[[VAL_45:.*]] = cmpi "ult", %[[VAL_43]], %[[VAL_40]] : index
				// CHECK: %[[VAL_46:.*]] = and %[[VAL_44]], %[[VAL_45]] : i1
				// CHECK: scf.condition(%[[VAL_46]]) %[[VAL_42]], %[[VAL_43]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_47:.]]: index, %[[VAL_48:.]]: index):
				// CHECK: %[[VAL_49:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_47]]] : memref<?xindex>
				// CHECK: %[[VAL_50:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_48]]] : memref<?xindex>
				// CHECK: %[[VAL_51:.*]] = cmpi "ult", %[[VAL_50]], %[[VAL_49]] : index
				// CHECK: %[[VAL_52:.*]] = select %[[VAL_51]], %[[VAL_50]], %[[VAL_49]] : index
				// CHECK: %[[VAL_53:.*]] = cmpi "eq", %[[VAL_49]], %[[VAL_52]] : index
				// CHECK: %[[VAL_54:.*]] = cmpi "eq", %[[VAL_50]], %[[VAL_52]] : index
				// CHECK: %[[VAL_55:.*]] = and %[[VAL_53]], %[[VAL_54]] : i1
				// CHECK: scf.if %[[VAL_55]] {
				// CHECK: %[[VAL_56:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_47]]] : memref<?xf32>
				// CHECK: %[[VAL_57:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_48]]] : memref<?xf32>
				// CHECK: %[[VAL_58:.*]] = addf %[[VAL_56]], %[[VAL_57]] : f32
				// CHECK: store %[[VAL_58]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_52]]] : memref<32x16xf32>
				// CHECK: } else {
				// CHECK: %[[VAL_59:.*]] = cmpi "eq", %[[VAL_49]], %[[VAL_52]] : index
				// CHECK: scf.if %[[VAL_59]] {
				// CHECK: %[[VAL_60:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_47]]] : memref<?xf32>
				// CHECK: store %[[VAL_60]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_52]]] : memref<32x16xf32>
				// CHECK: } else {
				// CHECK: %[[VAL_61:.*]] = cmpi "eq", %[[VAL_50]], %[[VAL_52]] : index
				// CHECK: scf.if %[[VAL_61]] {
				// CHECK: %[[VAL_62:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_48]]] : memref<?xf32>
				// CHECK: store %[[VAL_62]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_52]]] : memref<32x16xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_63:.*]] = cmpi "eq", %[[VAL_49]], %[[VAL_52]] : index
				// CHECK: %[[VAL_64:.*]] = addi %[[VAL_47]], %[[VAL_4]] : index
				// CHECK: %[[VAL_65:.*]] = select %[[VAL_63]], %[[VAL_64]], %[[VAL_47]] : index
				// CHECK: %[[VAL_66:.*]] = cmpi "eq", %[[VAL_50]], %[[VAL_52]] : index
				// CHECK: %[[VAL_67:.*]] = addi %[[VAL_48]], %[[VAL_4]] : index
				// CHECK: %[[VAL_68:.*]] = select %[[VAL_66]], %[[VAL_67]], %[[VAL_48]] : index
				// CHECK: scf.yield %[[VAL_65]], %[[VAL_68]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_69:.]] = %[[VAL_70:.]]#0 to %[[VAL_37]] step %[[VAL_4]] {
				// CHECK: %[[VAL_71:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_69]]] : memref<?xindex>
				// CHECK: %[[VAL_72:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_69]]] : memref<?xf32>
				// CHECK: store %[[VAL_72]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_71]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: scf.for %[[VAL_73:.]] = %[[VAL_74:.]]#1 to %[[VAL_40]] step %[[VAL_4]] {
				// CHECK: %[[VAL_75:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_73]]] : memref<?xindex>
				// CHECK: %[[VAL_76:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_73]]] : memref<?xf32>
				// CHECK: store %[[VAL_76]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_75]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: %[[VAL_77:.*]] = cmpi "eq", %[[VAL_28]], %[[VAL_31]] : index
				// CHECK: scf.if %[[VAL_77]] {
				// CHECK: %[[VAL_78:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: %[[VAL_79:.*]] = addi %[[VAL_26]], %[[VAL_4]] : index
				// CHECK: %[[VAL_80:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_79]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_81:.*]] = %[[VAL_78]] to %[[VAL_80]] step %[[VAL_4]] {
				// CHECK: %[[VAL_82:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_81]]] : memref<?xindex>
				// CHECK: %[[VAL_83:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_81]]] : memref<?xf32>
				// CHECK: store %[[VAL_83]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_82]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: %[[VAL_84:.*]] = cmpi "eq", %[[VAL_29]], %[[VAL_31]] : index
				// CHECK: scf.if %[[VAL_84]] {
				// CHECK: %[[VAL_85:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_27]]] : memref<?xindex>
				// CHECK: %[[VAL_86:.*]] = addi %[[VAL_27]], %[[VAL_4]] : index
				// CHECK: %[[VAL_87:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_86]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_88:.*]] = %[[VAL_85]] to %[[VAL_87]] step %[[VAL_4]] {
				// CHECK: %[[VAL_89:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_88]]] : memref<?xindex>
				// CHECK: %[[VAL_90:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_88]]] : memref<?xf32>
				// CHECK: store %[[VAL_90]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_89]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_91:.*]] = cmpi "eq", %[[VAL_28]], %[[VAL_31]] : index
				// CHECK: %[[VAL_92:.*]] = addi %[[VAL_26]], %[[VAL_4]] : index
				// CHECK: %[[VAL_93:.*]] = select %[[VAL_91]], %[[VAL_92]], %[[VAL_26]] : index
				// CHECK: %[[VAL_94:.*]] = cmpi "eq", %[[VAL_29]], %[[VAL_31]] : index
				// CHECK: %[[VAL_95:.*]] = addi %[[VAL_27]], %[[VAL_4]] : index
				// CHECK: %[[VAL_96:.*]] = select %[[VAL_94]], %[[VAL_95]], %[[VAL_27]] : index
				// CHECK: scf.yield %[[VAL_93]], %[[VAL_96]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_97:.]] = %[[VAL_98:.]]#0 to %[[VAL_17]] step %[[VAL_4]] {
				// CHECK: %[[VAL_99:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_97]]] : memref<?xindex>
				// CHECK: %[[VAL_100:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_97]]] : memref<?xindex>
				// CHECK: %[[VAL_101:.*]] = addi %[[VAL_97]], %[[VAL_4]] : index
				// CHECK: %[[VAL_102:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_101]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_103:.*]] = %[[VAL_100]] to %[[VAL_102]] step %[[VAL_4]] {
				// CHECK: %[[VAL_104:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_103]]] : memref<?xindex>
				// CHECK: %[[VAL_105:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_103]]] : memref<?xf32>
				// CHECK: store %[[VAL_105]], %[[VAL_15]]{{\[}}%[[VAL_99]], %[[VAL_104]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: scf.for %[[VAL_106:.]] = %[[VAL_107:.]]#1 to %[[VAL_19]] step %[[VAL_4]] {
				// CHECK: %[[VAL_108:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_106]]] : memref<?xindex>
				// CHECK: %[[VAL_109:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_106]]] : memref<?xindex>
				// CHECK: %[[VAL_110:.*]] = addi %[[VAL_106]], %[[VAL_4]] : index
				// CHECK: %[[VAL_111:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_110]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_112:.*]] = %[[VAL_109]] to %[[VAL_111]] step %[[VAL_4]] {
				// CHECK: %[[VAL_113:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_112]]] : memref<?xindex>
				// CHECK: %[[VAL_114:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_112]]] : memref<?xf32>
				// CHECK: store %[[VAL_114]], %[[VAL_15]]{{\[}}%[[VAL_108]], %[[VAL_113]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_115:.*]] = tensor_load %[[VAL_15]] : memref<32x16xf32>
				// CHECK: return %[[VAL_115]] : tensor<32x16xf32>
				// CHECK: }
				func @add_ss_ss(%arga: tensor<32x16xf32>, %argb: tensor<32x16xf32>) -> tensor<32x16xf32> {
				%0 = linalg.generic #trait_ss_ss
				ins(%arga, %argb: tensor<32x16xf32>, tensor<32x16xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16xf32>
				return %0 : tensor<32x16xf32>
				}

				// CHECK-LABEL: func @mul_ss_ss(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16xf32>) -> tensor<32x16xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_12:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_15:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_19:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_20:.]]:2 = scf.while (%[[VAL_21:.]] = %[[VAL_16]], %[[VAL_22:.*]] = %[[VAL_18]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_23:.*]] = cmpi "ult", %[[VAL_21]], %[[VAL_17]] : index
				// CHECK: %[[VAL_24:.*]] = cmpi "ult", %[[VAL_22]], %[[VAL_19]] : index
				// CHECK: %[[VAL_25:.*]] = and %[[VAL_23]], %[[VAL_24]] : i1
				// CHECK: scf.condition(%[[VAL_25]]) %[[VAL_21]], %[[VAL_22]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_26:.]]: index, %[[VAL_27:.]]: index):
				// CHECK: %[[VAL_28:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: %[[VAL_29:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_27]]] : memref<?xindex>
				// CHECK: %[[VAL_30:.*]] = cmpi "ult", %[[VAL_29]], %[[VAL_28]] : index
				// CHECK: %[[VAL_31:.*]] = select %[[VAL_30]], %[[VAL_29]], %[[VAL_28]] : index
				// CHECK: %[[VAL_32:.*]] = cmpi "eq", %[[VAL_28]], %[[VAL_31]] : index
				// CHECK: %[[VAL_33:.*]] = cmpi "eq", %[[VAL_29]], %[[VAL_31]] : index
				// CHECK: %[[VAL_34:.*]] = and %[[VAL_32]], %[[VAL_33]] : i1
				// CHECK: scf.if %[[VAL_34]] {
				// CHECK: %[[VAL_35:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: %[[VAL_36:.*]] = addi %[[VAL_26]], %[[VAL_4]] : index
				// CHECK: %[[VAL_37:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_36]]] : memref<?xindex>
				// CHECK: %[[VAL_38:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_27]]] : memref<?xindex>
				// CHECK: %[[VAL_39:.*]] = addi %[[VAL_27]], %[[VAL_4]] : index
				// CHECK: %[[VAL_40:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_39]]] : memref<?xindex>
				// CHECK: %[[VAL_41:.]]:2 = scf.while (%[[VAL_42:.]] = %[[VAL_35]], %[[VAL_43:.*]] = %[[VAL_38]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_44:.*]] = cmpi "ult", %[[VAL_42]], %[[VAL_37]] : index
				// CHECK: %[[VAL_45:.*]] = cmpi "ult", %[[VAL_43]], %[[VAL_40]] : index
				// CHECK: %[[VAL_46:.*]] = and %[[VAL_44]], %[[VAL_45]] : i1
				// CHECK: scf.condition(%[[VAL_46]]) %[[VAL_42]], %[[VAL_43]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_47:.]]: index, %[[VAL_48:.]]: index):
				// CHECK: %[[VAL_49:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_47]]] : memref<?xindex>
				// CHECK: %[[VAL_50:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_48]]] : memref<?xindex>
				// CHECK: %[[VAL_51:.*]] = cmpi "ult", %[[VAL_50]], %[[VAL_49]] : index
				// CHECK: %[[VAL_52:.*]] = select %[[VAL_51]], %[[VAL_50]], %[[VAL_49]] : index
				// CHECK: %[[VAL_53:.*]] = cmpi "eq", %[[VAL_49]], %[[VAL_52]] : index
				// CHECK: %[[VAL_54:.*]] = cmpi "eq", %[[VAL_50]], %[[VAL_52]] : index
				// CHECK: %[[VAL_55:.*]] = and %[[VAL_53]], %[[VAL_54]] : i1
				// CHECK: scf.if %[[VAL_55]] {
				// CHECK: %[[VAL_56:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_47]]] : memref<?xf32>
				// CHECK: %[[VAL_57:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_48]]] : memref<?xf32>
				// CHECK: %[[VAL_58:.*]] = mulf %[[VAL_56]], %[[VAL_57]] : f32
				// CHECK: store %[[VAL_58]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_52]]] : memref<32x16xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: %[[VAL_59:.*]] = cmpi "eq", %[[VAL_49]], %[[VAL_52]] : index
				// CHECK: %[[VAL_60:.*]] = addi %[[VAL_47]], %[[VAL_4]] : index
				// CHECK: %[[VAL_61:.*]] = select %[[VAL_59]], %[[VAL_60]], %[[VAL_47]] : index
				// CHECK: %[[VAL_62:.*]] = cmpi "eq", %[[VAL_50]], %[[VAL_52]] : index
				// CHECK: %[[VAL_63:.*]] = addi %[[VAL_48]], %[[VAL_4]] : index
				// CHECK: %[[VAL_64:.*]] = select %[[VAL_62]], %[[VAL_63]], %[[VAL_48]] : index
				// CHECK: scf.yield %[[VAL_61]], %[[VAL_64]] : index, index
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: %[[VAL_65:.*]] = cmpi "eq", %[[VAL_28]], %[[VAL_31]] : index
				// CHECK: %[[VAL_66:.*]] = addi %[[VAL_26]], %[[VAL_4]] : index
				// CHECK: %[[VAL_67:.*]] = select %[[VAL_65]], %[[VAL_66]], %[[VAL_26]] : index
				// CHECK: %[[VAL_68:.*]] = cmpi "eq", %[[VAL_29]], %[[VAL_31]] : index
				// CHECK: %[[VAL_69:.*]] = addi %[[VAL_27]], %[[VAL_4]] : index
				// CHECK: %[[VAL_70:.*]] = select %[[VAL_68]], %[[VAL_69]], %[[VAL_27]] : index
				// CHECK: scf.yield %[[VAL_67]], %[[VAL_70]] : index, index
				// CHECK: }
				// CHECK: %[[VAL_71:.*]] = tensor_load %[[VAL_15]] : memref<32x16xf32>
				// CHECK: return %[[VAL_71]] : tensor<32x16xf32>
				// CHECK: }
				func @mul_ss_ss(%arga: tensor<32x16xf32>, %argb: tensor<32x16xf32>) -> tensor<32x16xf32> {
				%0 = linalg.generic #trait_ss_ss
				ins(%arga, %argb: tensor<32x16xf32>, tensor<32x16xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16xf32>
				return %0 : tensor<32x16xf32>
				}

				#trait_sd_ds = {
				indexing_maps = [
				affine_map<(i,j) -> (i,j)>, // A
				affine_map<(i,j) -> (i,j)>, // B
				affine_map<(i,j) -> (i,j)> // X (out)
				],
				sparse = [
				[ "S", "D" ], // A
				[ "D", "S" ], // B
				[ "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel"],
				doc = "X(i,j) = A(i,j) OP B(i,j)"
				}

				// CHECK-LABEL: func @add_sd_ds(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16xf32>) -> tensor<32x16xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_12:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_15:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_19:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_20:.]]:2 = scf.while (%[[VAL_21:.]] = %[[VAL_16]], %[[VAL_22:.*]] = %[[VAL_18]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_23:.*]] = cmpi "ult", %[[VAL_21]], %[[VAL_17]] : index
				// CHECK: %[[VAL_24:.*]] = cmpi "ult", %[[VAL_22]], %[[VAL_19]] : index
				// CHECK: %[[VAL_25:.*]] = and %[[VAL_23]], %[[VAL_24]] : i1
				// CHECK: scf.condition(%[[VAL_25]]) %[[VAL_21]], %[[VAL_22]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_26:.]]: index, %[[VAL_27:.]]: index):
				// CHECK: %[[VAL_28:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: %[[VAL_29:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_27]]] : memref<?xindex>
				// CHECK: %[[VAL_30:.*]] = cmpi "ult", %[[VAL_29]], %[[VAL_28]] : index
				// CHECK: %[[VAL_31:.*]] = select %[[VAL_30]], %[[VAL_29]], %[[VAL_28]] : index
				// CHECK: %[[VAL_32:.*]] = cmpi "eq", %[[VAL_28]], %[[VAL_31]] : index
				// CHECK: %[[VAL_33:.*]] = cmpi "eq", %[[VAL_29]], %[[VAL_31]] : index
				// CHECK: %[[VAL_34:.*]] = and %[[VAL_32]], %[[VAL_33]] : i1
				// CHECK: scf.if %[[VAL_34]] {
				// CHECK: %[[VAL_35:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: %[[VAL_36:.*]] = addi %[[VAL_26]], %[[VAL_4]] : index
				// CHECK: %[[VAL_37:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_36]]] : memref<?xindex>
				// CHECK: %[[VAL_38:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_27]]] : memref<?xindex>
				// CHECK: %[[VAL_39:.*]] = addi %[[VAL_27]], %[[VAL_4]] : index
				// CHECK: %[[VAL_40:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_39]]] : memref<?xindex>
				// CHECK: %[[VAL_41:.]]:2 = scf.while (%[[VAL_42:.]] = %[[VAL_35]], %[[VAL_43:.*]] = %[[VAL_38]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_44:.*]] = cmpi "ult", %[[VAL_42]], %[[VAL_37]] : index
				// CHECK: %[[VAL_45:.*]] = cmpi "ult", %[[VAL_43]], %[[VAL_40]] : index
				// CHECK: %[[VAL_46:.*]] = and %[[VAL_44]], %[[VAL_45]] : i1
				// CHECK: scf.condition(%[[VAL_46]]) %[[VAL_42]], %[[VAL_43]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_47:.]]: index, %[[VAL_48:.]]: index):
				// CHECK: %[[VAL_49:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_47]]] : memref<?xindex>
				// CHECK: %[[VAL_50:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_48]]] : memref<?xindex>
				// CHECK: %[[VAL_51:.*]] = cmpi "ult", %[[VAL_50]], %[[VAL_49]] : index
				// CHECK: %[[VAL_52:.*]] = select %[[VAL_51]], %[[VAL_50]], %[[VAL_49]] : index
				// CHECK: %[[VAL_53:.*]] = cmpi "eq", %[[VAL_49]], %[[VAL_52]] : index
				// CHECK: %[[VAL_54:.*]] = cmpi "eq", %[[VAL_50]], %[[VAL_52]] : index
				// CHECK: %[[VAL_55:.*]] = and %[[VAL_53]], %[[VAL_54]] : i1
				// CHECK: scf.if %[[VAL_55]] {
				// CHECK: %[[VAL_56:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_47]]] : memref<?xf32>
				// CHECK: %[[VAL_57:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_48]]] : memref<?xf32>
				// CHECK: %[[VAL_58:.*]] = addf %[[VAL_56]], %[[VAL_57]] : f32
				// CHECK: store %[[VAL_58]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_52]]] : memref<32x16xf32>
				// CHECK: } else {
				// CHECK: %[[VAL_59:.*]] = cmpi "eq", %[[VAL_49]], %[[VAL_52]] : index
				// CHECK: scf.if %[[VAL_59]] {
				// CHECK: %[[VAL_60:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_47]]] : memref<?xf32>
				// CHECK: store %[[VAL_60]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_52]]] : memref<32x16xf32>
				// CHECK: } else {
				// CHECK: %[[VAL_61:.*]] = cmpi "eq", %[[VAL_50]], %[[VAL_52]] : index
				// CHECK: scf.if %[[VAL_61]] {
				// CHECK: %[[VAL_62:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_48]]] : memref<?xf32>
				// CHECK: store %[[VAL_62]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_52]]] : memref<32x16xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_63:.*]] = cmpi "eq", %[[VAL_49]], %[[VAL_52]] : index
				// CHECK: %[[VAL_64:.*]] = addi %[[VAL_47]], %[[VAL_4]] : index
				// CHECK: %[[VAL_65:.*]] = select %[[VAL_63]], %[[VAL_64]], %[[VAL_47]] : index
				// CHECK: %[[VAL_66:.*]] = cmpi "eq", %[[VAL_50]], %[[VAL_52]] : index
				// CHECK: %[[VAL_67:.*]] = addi %[[VAL_48]], %[[VAL_4]] : index
				// CHECK: %[[VAL_68:.*]] = select %[[VAL_66]], %[[VAL_67]], %[[VAL_48]] : index
				// CHECK: scf.yield %[[VAL_65]], %[[VAL_68]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_69:.]] = %[[VAL_70:.]]#0 to %[[VAL_37]] step %[[VAL_4]] {
				// CHECK: %[[VAL_71:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_69]]] : memref<?xindex>
				// CHECK: %[[VAL_72:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_69]]] : memref<?xf32>
				// CHECK: store %[[VAL_72]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_71]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: scf.for %[[VAL_73:.]] = %[[VAL_74:.]]#1 to %[[VAL_40]] step %[[VAL_4]] {
				// CHECK: %[[VAL_75:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_73]]] : memref<?xindex>
				// CHECK: %[[VAL_76:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_73]]] : memref<?xf32>
				// CHECK: store %[[VAL_76]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_75]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: %[[VAL_77:.*]] = cmpi "eq", %[[VAL_28]], %[[VAL_31]] : index
				// CHECK: scf.if %[[VAL_77]] {
				// CHECK: %[[VAL_78:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: %[[VAL_79:.*]] = addi %[[VAL_26]], %[[VAL_4]] : index
				// CHECK: %[[VAL_80:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_79]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_81:.*]] = %[[VAL_78]] to %[[VAL_80]] step %[[VAL_4]] {
				// CHECK: %[[VAL_82:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_81]]] : memref<?xindex>
				// CHECK: %[[VAL_83:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_81]]] : memref<?xf32>
				// CHECK: store %[[VAL_83]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_82]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: %[[VAL_84:.*]] = cmpi "eq", %[[VAL_29]], %[[VAL_31]] : index
				// CHECK: scf.if %[[VAL_84]] {
				// CHECK: %[[VAL_85:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_27]]] : memref<?xindex>
				// CHECK: %[[VAL_86:.*]] = addi %[[VAL_27]], %[[VAL_4]] : index
				// CHECK: %[[VAL_87:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_86]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_88:.*]] = %[[VAL_85]] to %[[VAL_87]] step %[[VAL_4]] {
				// CHECK: %[[VAL_89:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_88]]] : memref<?xindex>
				// CHECK: %[[VAL_90:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_88]]] : memref<?xf32>
				// CHECK: store %[[VAL_90]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_89]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_91:.*]] = cmpi "eq", %[[VAL_28]], %[[VAL_31]] : index
				// CHECK: %[[VAL_92:.*]] = addi %[[VAL_26]], %[[VAL_4]] : index
				// CHECK: %[[VAL_93:.*]] = select %[[VAL_91]], %[[VAL_92]], %[[VAL_26]] : index
				// CHECK: %[[VAL_94:.*]] = cmpi "eq", %[[VAL_29]], %[[VAL_31]] : index
				// CHECK: %[[VAL_95:.*]] = addi %[[VAL_27]], %[[VAL_4]] : index
				// CHECK: %[[VAL_96:.*]] = select %[[VAL_94]], %[[VAL_95]], %[[VAL_27]] : index
				// CHECK: scf.yield %[[VAL_93]], %[[VAL_96]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_97:.]] = %[[VAL_98:.]]#0 to %[[VAL_17]] step %[[VAL_4]] {
				// CHECK: %[[VAL_99:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_97]]] : memref<?xindex>
				// CHECK: %[[VAL_100:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_97]]] : memref<?xindex>
				// CHECK: %[[VAL_101:.*]] = addi %[[VAL_97]], %[[VAL_4]] : index
				// CHECK: %[[VAL_102:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_101]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_103:.*]] = %[[VAL_100]] to %[[VAL_102]] step %[[VAL_4]] {
				// CHECK: %[[VAL_104:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_103]]] : memref<?xindex>
				// CHECK: %[[VAL_105:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_103]]] : memref<?xf32>
				// CHECK: store %[[VAL_105]], %[[VAL_15]]{{\[}}%[[VAL_99]], %[[VAL_104]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: scf.for %[[VAL_106:.]] = %[[VAL_107:.]]#1 to %[[VAL_19]] step %[[VAL_4]] {
				// CHECK: %[[VAL_108:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_106]]] : memref<?xindex>
				// CHECK: %[[VAL_109:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_106]]] : memref<?xindex>
				// CHECK: %[[VAL_110:.*]] = addi %[[VAL_106]], %[[VAL_4]] : index
				// CHECK: %[[VAL_111:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_110]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_112:.*]] = %[[VAL_109]] to %[[VAL_111]] step %[[VAL_4]] {
				// CHECK: %[[VAL_113:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_112]]] : memref<?xindex>
				// CHECK: %[[VAL_114:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_112]]] : memref<?xf32>
				// CHECK: store %[[VAL_114]], %[[VAL_15]]{{\[}}%[[VAL_108]], %[[VAL_113]]] : memref<32x16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_115:.*]] = tensor_load %[[VAL_15]] : memref<32x16xf32>
				// CHECK: return %[[VAL_115]] : tensor<32x16xf32>
				// CHECK: }
				func @add_sd_ds(%arga: tensor<32x16xf32>, %argb: tensor<32x16xf32>) -> tensor<32x16xf32> {
				%0 = linalg.generic #trait_ss_ss
				ins(%arga, %argb: tensor<32x16xf32>, tensor<32x16xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16xf32>
				return %0 : tensor<32x16xf32>
				}

				// CHECK-LABEL: func @mul_sd_ds(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16xf32>) -> tensor<32x16xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_12:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_15:.*]] = alloca() : memref<32x16xf32>
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_19:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_20:.]]:2 = scf.while (%[[VAL_21:.]] = %[[VAL_16]], %[[VAL_22:.*]] = %[[VAL_18]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_23:.*]] = cmpi "ult", %[[VAL_21]], %[[VAL_17]] : index
				// CHECK: %[[VAL_24:.*]] = cmpi "ult", %[[VAL_22]], %[[VAL_19]] : index
				// CHECK: %[[VAL_25:.*]] = and %[[VAL_23]], %[[VAL_24]] : i1
				// CHECK: scf.condition(%[[VAL_25]]) %[[VAL_21]], %[[VAL_22]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_26:.]]: index, %[[VAL_27:.]]: index):
				// CHECK: %[[VAL_28:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: %[[VAL_29:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_27]]] : memref<?xindex>
				// CHECK: %[[VAL_30:.*]] = cmpi "ult", %[[VAL_29]], %[[VAL_28]] : index
				// CHECK: %[[VAL_31:.*]] = select %[[VAL_30]], %[[VAL_29]], %[[VAL_28]] : index
				// CHECK: %[[VAL_32:.*]] = cmpi "eq", %[[VAL_28]], %[[VAL_31]] : index
				// CHECK: %[[VAL_33:.*]] = cmpi "eq", %[[VAL_29]], %[[VAL_31]] : index
				// CHECK: %[[VAL_34:.*]] = and %[[VAL_32]], %[[VAL_33]] : i1
				// CHECK: scf.if %[[VAL_34]] {
				// CHECK: %[[VAL_35:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: %[[VAL_36:.*]] = addi %[[VAL_26]], %[[VAL_4]] : index
				// CHECK: %[[VAL_37:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_36]]] : memref<?xindex>
				// CHECK: %[[VAL_38:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_27]]] : memref<?xindex>
				// CHECK: %[[VAL_39:.*]] = addi %[[VAL_27]], %[[VAL_4]] : index
				// CHECK: %[[VAL_40:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_39]]] : memref<?xindex>
				// CHECK: %[[VAL_41:.]]:2 = scf.while (%[[VAL_42:.]] = %[[VAL_35]], %[[VAL_43:.*]] = %[[VAL_38]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_44:.*]] = cmpi "ult", %[[VAL_42]], %[[VAL_37]] : index
				// CHECK: %[[VAL_45:.*]] = cmpi "ult", %[[VAL_43]], %[[VAL_40]] : index
				// CHECK: %[[VAL_46:.*]] = and %[[VAL_44]], %[[VAL_45]] : i1
				// CHECK: scf.condition(%[[VAL_46]]) %[[VAL_42]], %[[VAL_43]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_47:.]]: index, %[[VAL_48:.]]: index):
				// CHECK: %[[VAL_49:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_47]]] : memref<?xindex>
				// CHECK: %[[VAL_50:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_48]]] : memref<?xindex>
				// CHECK: %[[VAL_51:.*]] = cmpi "ult", %[[VAL_50]], %[[VAL_49]] : index
				// CHECK: %[[VAL_52:.*]] = select %[[VAL_51]], %[[VAL_50]], %[[VAL_49]] : index
				// CHECK: %[[VAL_53:.*]] = cmpi "eq", %[[VAL_49]], %[[VAL_52]] : index
				// CHECK: %[[VAL_54:.*]] = cmpi "eq", %[[VAL_50]], %[[VAL_52]] : index
				// CHECK: %[[VAL_55:.*]] = and %[[VAL_53]], %[[VAL_54]] : i1
				// CHECK: scf.if %[[VAL_55]] {
				// CHECK: %[[VAL_56:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_47]]] : memref<?xf32>
				// CHECK: %[[VAL_57:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_48]]] : memref<?xf32>
				// CHECK: %[[VAL_58:.*]] = mulf %[[VAL_56]], %[[VAL_57]] : f32
				// CHECK: store %[[VAL_58]], %[[VAL_15]]{{\[}}%[[VAL_31]], %[[VAL_52]]] : memref<32x16xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: %[[VAL_59:.*]] = cmpi "eq", %[[VAL_49]], %[[VAL_52]] : index
				// CHECK: %[[VAL_60:.*]] = addi %[[VAL_47]], %[[VAL_4]] : index
				// CHECK: %[[VAL_61:.*]] = select %[[VAL_59]], %[[VAL_60]], %[[VAL_47]] : index
				// CHECK: %[[VAL_62:.*]] = cmpi "eq", %[[VAL_50]], %[[VAL_52]] : index
				// CHECK: %[[VAL_63:.*]] = addi %[[VAL_48]], %[[VAL_4]] : index
				// CHECK: %[[VAL_64:.*]] = select %[[VAL_62]], %[[VAL_63]], %[[VAL_48]] : index
				// CHECK: scf.yield %[[VAL_61]], %[[VAL_64]] : index, index
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: %[[VAL_65:.*]] = cmpi "eq", %[[VAL_28]], %[[VAL_31]] : index
				// CHECK: %[[VAL_66:.*]] = addi %[[VAL_26]], %[[VAL_4]] : index
				// CHECK: %[[VAL_67:.*]] = select %[[VAL_65]], %[[VAL_66]], %[[VAL_26]] : index
				// CHECK: %[[VAL_68:.*]] = cmpi "eq", %[[VAL_29]], %[[VAL_31]] : index
				// CHECK: %[[VAL_69:.*]] = addi %[[VAL_27]], %[[VAL_4]] : index
				// CHECK: %[[VAL_70:.*]] = select %[[VAL_68]], %[[VAL_69]], %[[VAL_27]] : index
				// CHECK: scf.yield %[[VAL_67]], %[[VAL_70]] : index, index
				// CHECK: }
				// CHECK: %[[VAL_71:.*]] = tensor_load %[[VAL_15]] : memref<32x16xf32>
				// CHECK: return %[[VAL_71]] : tensor<32x16xf32>
				// CHECK: }
				func @mul_sd_ds(%arga: tensor<32x16xf32>, %argb: tensor<32x16xf32>) -> tensor<32x16xf32> {
				%0 = linalg.generic #trait_ss_ss
				ins(%arga, %argb: tensor<32x16xf32>, tensor<32x16xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16xf32>
				return %0 : tensor<32x16xf32>
				}

				#trait_matvec = {
				indexing_maps = [
				affine_map<(i,j) -> (i,j)>, // A
				affine_map<(i,j) -> (j)>, // b
				affine_map<(i,j) -> (i)> // x (out)
				],
				sparse = [
				[ "D", "S" ], // A
				[ "D" ], // b
				[ "D" ] // x
				],
				iterator_types = ["parallel", "reduction"],
				doc = "x(i) += A(i,j) * b(j)"
				}

				// CHECK-LABEL: func @matvec(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<16x32xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32xf32>,
				// CHECK-SAME: %[[VAL_2:.*]]: tensor<16xf32>) -> tensor<16xf32> {
				// CHECK: %[[VAL_3:.*]] = constant 999 : index
				// CHECK: %[[VAL_4:.*]] = constant 16 : index
				// CHECK: %[[VAL_5:.*]] = constant 0 : index
				// CHECK: %[[VAL_6:.*]] = constant 1 : index
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_3]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_3]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_3]]) : memref<?xf32>
				// CHECK: %[[VAL_10:.*]] = alloca() : memref<32xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<16xf32>
				// CHECK: scf.for %[[VAL_12:.*]] = %[[VAL_5]] to %[[VAL_4]] step %[[VAL_6]] {
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_12]]] : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = addi %[[VAL_12]], %[[VAL_6]] : index
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_14]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_16:.*]] = %[[VAL_13]] to %[[VAL_15]] step %[[VAL_6]] {
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_16]]] : memref<?xindex>
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_16]]] : memref<?xf32>
				// CHECK: %[[VAL_19:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_17]]] : memref<32xf32>
				// CHECK: %[[VAL_20:.*]] = mulf %[[VAL_18]], %[[VAL_19]] : f32
				// CHECK: %[[VAL_21:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_12]]] : memref<16xf32>
				// CHECK: %[[VAL_22:.*]] = addf %[[VAL_20]], %[[VAL_21]] : f32
				// CHECK: store %[[VAL_22]], %[[VAL_11]]{{\[}}%[[VAL_12]]] : memref<16xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_23:.*]] = tensor_load %[[VAL_11]] : memref<16xf32>
				// CHECK: return %[[VAL_23]] : tensor<16xf32>
				// CHECK: }
				func @matvec(%argA: tensor<16x32xf32>, %argb: tensor<32xf32>, %argx: tensor<16xf32>) -> tensor<16xf32> {
				%0 = linalg.generic #trait_matvec
				ins(%argA, %argb : tensor<16x32xf32>, tensor<32xf32>)
				init(%argx : tensor<16xf32>) {
				^bb(%A: f32, %b: f32, %x: f32):
				%0 = mulf %A, %b : f32
				%1 = addf %0, %x : f32
				linalg.yield %1 : f32
				} -> tensor<16xf32>
				return %0 : tensor<16xf32>
				}

mlir/test/Dialect/Linalg/sparse_3d.mlir

This file was added.

				// NOTE: Assertions have been autogenerated by utils/generate-test-checks.py
				// RUN: mlir-opt %s -test-sparsification \| FileCheck %s

				#trait_ddd = {
				indexing_maps = [
				affine_map<(i,j,k) -> (i,j,k)>, // A
				affine_map<(i,j,k) -> (i,j,k)>, // B
				affine_map<(i,j,k) -> (i,j,k)> // X (out)
				],
				sparse = [
				[ "D", "D", "D" ], // A
				[ "D", "D", "D" ], // B
				[ "D", "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel", "parallel"],
				doc = "X(i,j,k) = A(i,j,k) OP B(i,j,k)"
				}

				// CHECK-LABEL: func @add_ddd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 32 : index
				// CHECK: %[[VAL_3:.*]] = constant 16 : index
				// CHECK: %[[VAL_4:.*]] = constant 8 : index
				// CHECK: %[[VAL_5:.*]] = constant 0 : index
				// CHECK: %[[VAL_6:.*]] = constant 1 : index
				// CHECK: %[[VAL_7:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_8:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_9:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: scf.for %[[VAL_10:.*]] = %[[VAL_5]] to %[[VAL_2]] step %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_11:.*]] = %[[VAL_5]] to %[[VAL_3]] step %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_12:.*]] = %[[VAL_5]] to %[[VAL_4]] step %[[VAL_6]] {
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_10]], %[[VAL_11]], %[[VAL_12]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_10]], %[[VAL_11]], %[[VAL_12]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_15:.*]] = addf %[[VAL_13]], %[[VAL_14]] : f32
				// CHECK: store %[[VAL_15]], %[[VAL_9]]{{\[}}%[[VAL_10]], %[[VAL_11]], %[[VAL_12]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_16:.*]] = tensor_load %[[VAL_9]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_16]] : tensor<32x16x8xf32>
				// CHECK: }
				func @add_ddd(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_ddd
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				// CHECK-LABEL: func @mul_ddd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 32 : index
				// CHECK: %[[VAL_3:.*]] = constant 16 : index
				// CHECK: %[[VAL_4:.*]] = constant 8 : index
				// CHECK: %[[VAL_5:.*]] = constant 0 : index
				// CHECK: %[[VAL_6:.*]] = constant 1 : index
				// CHECK: %[[VAL_7:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_8:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_9:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: scf.for %[[VAL_10:.*]] = %[[VAL_5]] to %[[VAL_2]] step %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_11:.*]] = %[[VAL_5]] to %[[VAL_3]] step %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_12:.*]] = %[[VAL_5]] to %[[VAL_4]] step %[[VAL_6]] {
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_10]], %[[VAL_11]], %[[VAL_12]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_10]], %[[VAL_11]], %[[VAL_12]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_15:.*]] = mulf %[[VAL_13]], %[[VAL_14]] : f32
				// CHECK: store %[[VAL_15]], %[[VAL_9]]{{\[}}%[[VAL_10]], %[[VAL_11]], %[[VAL_12]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_16:.*]] = tensor_load %[[VAL_9]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_16]] : tensor<32x16x8xf32>
				// CHECK: }
				func @mul_ddd(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_ddd
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				#trait_dds = {
				indexing_maps = [
				affine_map<(i,j,k) -> (i,j,k)>, // A
				affine_map<(i,j,k) -> (i,j,k)>, // B
				affine_map<(i,j,k) -> (i,j,k)> // X (out)
				],
				sparse = [
				[ "D", "D", "S" ], // A
				[ "D", "D", "D" ], // B
				[ "D", "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel", "parallel"],
				doc = "X(i,j,k) = A(i,j,k) OP B(i,j,k)"
				}

				// CHECK-LABEL: func @add_dds(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 16 : index
				// CHECK: %[[VAL_5:.*]] = constant 8 : index
				// CHECK: %[[VAL_6:.*]] = constant 0 : index
				// CHECK: %[[VAL_7:.*]] = constant true
				// CHECK: %[[VAL_8:.*]] = constant 1 : index
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_12:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_13:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: scf.for %[[VAL_14:.*]] = %[[VAL_6]] to %[[VAL_3]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_15:.*]] = %[[VAL_6]] to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: %[[VAL_16:.*]] = muli %[[VAL_14]], %[[VAL_4]] : index
				// CHECK: %[[VAL_17:.*]] = addi %[[VAL_16]], %[[VAL_15]] : index
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_17]]] : memref<?xindex>
				// CHECK: %[[VAL_19:.*]] = addi %[[VAL_17]], %[[VAL_8]] : index
				// CHECK: %[[VAL_20:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_19]]] : memref<?xindex>
				// CHECK: %[[VAL_21:.]]:2 = scf.while (%[[VAL_22:.]] = %[[VAL_18]], %[[VAL_23:.*]] = %[[VAL_6]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_24:.*]] = cmpi "ult", %[[VAL_22]], %[[VAL_20]] : index
				// CHECK: scf.condition(%[[VAL_24]]) %[[VAL_22]], %[[VAL_23]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_25:.]]: index, %[[VAL_26:.]]: index):
				// CHECK: %[[VAL_27:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_25]]] : memref<?xindex>
				// CHECK: %[[VAL_28:.*]] = cmpi "eq", %[[VAL_27]], %[[VAL_26]] : index
				// CHECK: scf.if %[[VAL_28]] {
				// CHECK: %[[VAL_29:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_25]]] : memref<?xf32>
				// CHECK: %[[VAL_30:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_14]], %[[VAL_15]], %[[VAL_26]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_31:.*]] = addf %[[VAL_29]], %[[VAL_30]] : f32
				// CHECK: store %[[VAL_31]], %[[VAL_13]]{{\[}}%[[VAL_14]], %[[VAL_15]], %[[VAL_26]]] : memref<32x16x8xf32>
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_7]] {
				// CHECK: %[[VAL_32:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_14]], %[[VAL_15]], %[[VAL_26]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_32]], %[[VAL_13]]{{\[}}%[[VAL_14]], %[[VAL_15]], %[[VAL_26]]] : memref<32x16x8xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_33:.*]] = cmpi "eq", %[[VAL_27]], %[[VAL_26]] : index
				// CHECK: %[[VAL_34:.*]] = addi %[[VAL_25]], %[[VAL_8]] : index
				// CHECK: %[[VAL_35:.*]] = select %[[VAL_33]], %[[VAL_34]], %[[VAL_25]] : index
				// CHECK: %[[VAL_36:.*]] = addi %[[VAL_26]], %[[VAL_8]] : index
				// CHECK: scf.yield %[[VAL_35]], %[[VAL_36]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_37:.]] = %[[VAL_38:.]]#1 to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_39:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_14]], %[[VAL_15]], %[[VAL_37]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_39]], %[[VAL_13]]{{\[}}%[[VAL_14]], %[[VAL_15]], %[[VAL_37]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_40:.*]] = tensor_load %[[VAL_13]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_40]] : tensor<32x16x8xf32>
				// CHECK: }
				func @add_dds(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_dds
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				// CHECK-LABEL: func @mul_dds(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 16 : index
				// CHECK: %[[VAL_5:.*]] = constant 0 : index
				// CHECK: %[[VAL_6:.*]] = constant 1 : index
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_10:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: scf.for %[[VAL_12:.*]] = %[[VAL_5]] to %[[VAL_3]] step %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_13:.*]] = %[[VAL_5]] to %[[VAL_4]] step %[[VAL_6]] {
				// CHECK: %[[VAL_14:.*]] = muli %[[VAL_12]], %[[VAL_4]] : index
				// CHECK: %[[VAL_15:.*]] = addi %[[VAL_14]], %[[VAL_13]] : index
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_15]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.*]] = addi %[[VAL_15]], %[[VAL_6]] : index
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_17]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_19:.*]] = %[[VAL_16]] to %[[VAL_18]] step %[[VAL_6]] {
				// CHECK: %[[VAL_20:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_19]]] : memref<?xindex>
				// CHECK: %[[VAL_21:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_19]]] : memref<?xf32>
				// CHECK: %[[VAL_22:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_12]], %[[VAL_13]], %[[VAL_20]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_23:.*]] = mulf %[[VAL_21]], %[[VAL_22]] : f32
				// CHECK: store %[[VAL_23]], %[[VAL_11]]{{\[}}%[[VAL_12]], %[[VAL_13]], %[[VAL_20]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_24:.*]] = tensor_load %[[VAL_11]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_24]] : tensor<32x16x8xf32>
				// CHECK: }
				func @mul_dds(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_dds
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				#trait_dsd = {
				indexing_maps = [
				affine_map<(i,j,k) -> (i,j,k)>, // A
				affine_map<(i,j,k) -> (i,j,k)>, // B
				affine_map<(i,j,k) -> (i,j,k)> // X (out)
				],
				sparse = [
				[ "D", "S", "D" ], // A
				[ "D", "D", "D" ], // B
				[ "D", "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel", "parallel"],
				doc = "X(i,j,k) = A(i,j,k) OP B(i,j,k)"
				}

				// CHECK-LABEL: func @add_dsd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 16 : index
				// CHECK: %[[VAL_5:.*]] = constant 8 : index
				// CHECK: %[[VAL_6:.*]] = constant true
				// CHECK: %[[VAL_7:.*]] = constant 0 : index
				// CHECK: %[[VAL_8:.*]] = constant 1 : index
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_12:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_13:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: scf.for %[[VAL_14:.*]] = %[[VAL_7]] to %[[VAL_3]] step %[[VAL_8]] {
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_14]]] : memref<?xindex>
				// CHECK: %[[VAL_16:.*]] = addi %[[VAL_14]], %[[VAL_8]] : index
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_16]]] : memref<?xindex>
				// CHECK: %[[VAL_18:.]]:2 = scf.while (%[[VAL_19:.]] = %[[VAL_15]], %[[VAL_20:.*]] = %[[VAL_7]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_21:.*]] = cmpi "ult", %[[VAL_19]], %[[VAL_17]] : index
				// CHECK: scf.condition(%[[VAL_21]]) %[[VAL_19]], %[[VAL_20]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_22:.]]: index, %[[VAL_23:.]]: index):
				// CHECK: %[[VAL_24:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_22]]] : memref<?xindex>
				// CHECK: %[[VAL_25:.*]] = cmpi "eq", %[[VAL_24]], %[[VAL_23]] : index
				// CHECK: scf.if %[[VAL_25]] {
				// CHECK: scf.for %[[VAL_26:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_27:.*]] = muli %[[VAL_22]], %[[VAL_5]] : index
				// CHECK: %[[VAL_28:.*]] = addi %[[VAL_27]], %[[VAL_26]] : index
				// CHECK: %[[VAL_29:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_28]]] : memref<?xf32>
				// CHECK: %[[VAL_30:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_14]], %[[VAL_23]], %[[VAL_26]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_31:.*]] = addf %[[VAL_29]], %[[VAL_30]] : f32
				// CHECK: store %[[VAL_31]], %[[VAL_13]]{{\[}}%[[VAL_14]], %[[VAL_23]], %[[VAL_26]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_32:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_33:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_14]], %[[VAL_23]], %[[VAL_32]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_33]], %[[VAL_13]]{{\[}}%[[VAL_14]], %[[VAL_23]], %[[VAL_32]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_34:.*]] = cmpi "eq", %[[VAL_24]], %[[VAL_23]] : index
				// CHECK: %[[VAL_35:.*]] = addi %[[VAL_22]], %[[VAL_8]] : index
				// CHECK: %[[VAL_36:.*]] = select %[[VAL_34]], %[[VAL_35]], %[[VAL_22]] : index
				// CHECK: %[[VAL_37:.*]] = addi %[[VAL_23]], %[[VAL_8]] : index
				// CHECK: scf.yield %[[VAL_36]], %[[VAL_37]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_38:.]] = %[[VAL_39:.]]#1 to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_40:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_41:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_14]], %[[VAL_38]], %[[VAL_40]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_41]], %[[VAL_13]]{{\[}}%[[VAL_14]], %[[VAL_38]], %[[VAL_40]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_42:.*]] = tensor_load %[[VAL_13]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_42]] : tensor<32x16x8xf32>
				// CHECK: }
				func @add_dsd(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_dsd
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				// CHECK-LABEL: func @mul_dsd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 8 : index
				// CHECK: %[[VAL_5:.*]] = constant 0 : index
				// CHECK: %[[VAL_6:.*]] = constant 1 : index
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_10:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: scf.for %[[VAL_12:.*]] = %[[VAL_5]] to %[[VAL_3]] step %[[VAL_6]] {
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_12]]] : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = addi %[[VAL_12]], %[[VAL_6]] : index
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_14]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_16:.*]] = %[[VAL_13]] to %[[VAL_15]] step %[[VAL_6]] {
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_16]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_18:.*]] = %[[VAL_5]] to %[[VAL_4]] step %[[VAL_6]] {
				// CHECK: %[[VAL_19:.*]] = muli %[[VAL_16]], %[[VAL_4]] : index
				// CHECK: %[[VAL_20:.*]] = addi %[[VAL_19]], %[[VAL_18]] : index
				// CHECK: %[[VAL_21:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_20]]] : memref<?xf32>
				// CHECK: %[[VAL_22:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_12]], %[[VAL_17]], %[[VAL_18]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_23:.*]] = mulf %[[VAL_21]], %[[VAL_22]] : f32
				// CHECK: store %[[VAL_23]], %[[VAL_11]]{{\[}}%[[VAL_12]], %[[VAL_17]], %[[VAL_18]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_24:.*]] = tensor_load %[[VAL_11]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_24]] : tensor<32x16x8xf32>
				// CHECK: }
				func @mul_dsd(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_dsd
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				#trait_dss = {
				indexing_maps = [
				affine_map<(i,j,k) -> (i,j,k)>, // A
				affine_map<(i,j,k) -> (i,j,k)>, // B
				affine_map<(i,j,k) -> (i,j,k)> // X (out)
				],
				sparse = [
				[ "D", "S", "S" ], // A
				[ "D", "D", "D" ], // B
				[ "D", "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel", "parallel"],
				doc = "X(i,j,k) = A(i,j,k) OP B(i,j,k)"
				}

				// CHECK-LABEL: func @add_dss(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 16 : index
				// CHECK: %[[VAL_5:.*]] = constant 8 : index
				// CHECK: %[[VAL_6:.*]] = constant true
				// CHECK: %[[VAL_7:.*]] = constant 0 : index
				// CHECK: %[[VAL_8:.*]] = constant 1 : index
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_12:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_14:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_15:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: scf.for %[[VAL_16:.*]] = %[[VAL_7]] to %[[VAL_3]] step %[[VAL_8]] {
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_16]]] : memref<?xindex>
				// CHECK: %[[VAL_18:.*]] = addi %[[VAL_16]], %[[VAL_8]] : index
				// CHECK: %[[VAL_19:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_18]]] : memref<?xindex>
				// CHECK: %[[VAL_20:.]]:2 = scf.while (%[[VAL_21:.]] = %[[VAL_17]], %[[VAL_22:.*]] = %[[VAL_7]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_23:.*]] = cmpi "ult", %[[VAL_21]], %[[VAL_19]] : index
				// CHECK: scf.condition(%[[VAL_23]]) %[[VAL_21]], %[[VAL_22]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_24:.]]: index, %[[VAL_25:.]]: index):
				// CHECK: %[[VAL_26:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_24]]] : memref<?xindex>
				// CHECK: %[[VAL_27:.*]] = cmpi "eq", %[[VAL_26]], %[[VAL_25]] : index
				// CHECK: scf.if %[[VAL_27]] {
				// CHECK: %[[VAL_28:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_24]]] : memref<?xindex>
				// CHECK: %[[VAL_29:.*]] = addi %[[VAL_24]], %[[VAL_8]] : index
				// CHECK: %[[VAL_30:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_29]]] : memref<?xindex>
				// CHECK: %[[VAL_31:.]]:2 = scf.while (%[[VAL_32:.]] = %[[VAL_28]], %[[VAL_33:.*]] = %[[VAL_7]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_34:.*]] = cmpi "ult", %[[VAL_32]], %[[VAL_30]] : index
				// CHECK: scf.condition(%[[VAL_34]]) %[[VAL_32]], %[[VAL_33]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_35:.]]: index, %[[VAL_36:.]]: index):
				// CHECK: %[[VAL_37:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_35]]] : memref<?xindex>
				// CHECK: %[[VAL_38:.*]] = cmpi "eq", %[[VAL_37]], %[[VAL_36]] : index
				// CHECK: scf.if %[[VAL_38]] {
				// CHECK: %[[VAL_39:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_35]]] : memref<?xf32>
				// CHECK: %[[VAL_40:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_16]], %[[VAL_25]], %[[VAL_36]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_41:.*]] = addf %[[VAL_39]], %[[VAL_40]] : f32
				// CHECK: store %[[VAL_41]], %[[VAL_15]]{{\[}}%[[VAL_16]], %[[VAL_25]], %[[VAL_36]]] : memref<32x16x8xf32>
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_6]] {
				// CHECK: %[[VAL_42:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_16]], %[[VAL_25]], %[[VAL_36]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_42]], %[[VAL_15]]{{\[}}%[[VAL_16]], %[[VAL_25]], %[[VAL_36]]] : memref<32x16x8xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_43:.*]] = cmpi "eq", %[[VAL_37]], %[[VAL_36]] : index
				// CHECK: %[[VAL_44:.*]] = addi %[[VAL_35]], %[[VAL_8]] : index
				// CHECK: %[[VAL_45:.*]] = select %[[VAL_43]], %[[VAL_44]], %[[VAL_35]] : index
				// CHECK: %[[VAL_46:.*]] = addi %[[VAL_36]], %[[VAL_8]] : index
				// CHECK: scf.yield %[[VAL_45]], %[[VAL_46]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_47:.]] = %[[VAL_48:.]]#1 to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_49:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_16]], %[[VAL_25]], %[[VAL_47]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_49]], %[[VAL_15]]{{\[}}%[[VAL_16]], %[[VAL_25]], %[[VAL_47]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_50:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_51:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_16]], %[[VAL_25]], %[[VAL_50]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_51]], %[[VAL_15]]{{\[}}%[[VAL_16]], %[[VAL_25]], %[[VAL_50]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_52:.*]] = cmpi "eq", %[[VAL_26]], %[[VAL_25]] : index
				// CHECK: %[[VAL_53:.*]] = addi %[[VAL_24]], %[[VAL_8]] : index
				// CHECK: %[[VAL_54:.*]] = select %[[VAL_52]], %[[VAL_53]], %[[VAL_24]] : index
				// CHECK: %[[VAL_55:.*]] = addi %[[VAL_25]], %[[VAL_8]] : index
				// CHECK: scf.yield %[[VAL_54]], %[[VAL_55]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_56:.]] = %[[VAL_57:.]]#1 to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_58:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_59:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_16]], %[[VAL_56]], %[[VAL_58]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_59]], %[[VAL_15]]{{\[}}%[[VAL_16]], %[[VAL_56]], %[[VAL_58]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_60:.*]] = tensor_load %[[VAL_15]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_60]] : tensor<32x16x8xf32>
				// CHECK: }
				func @add_dss(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_dss
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				// CHECK-LABEL: func @mul_dss(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 0 : index
				// CHECK: %[[VAL_5:.*]] = constant 1 : index
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_12:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: scf.for %[[VAL_13:.*]] = %[[VAL_4]] to %[[VAL_3]] step %[[VAL_5]] {
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_13]]] : memref<?xindex>
				// CHECK: %[[VAL_15:.*]] = addi %[[VAL_13]], %[[VAL_5]] : index
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_15]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_17:.*]] = %[[VAL_14]] to %[[VAL_16]] step %[[VAL_5]] {
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_17]]] : memref<?xindex>
				// CHECK: %[[VAL_19:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_17]]] : memref<?xindex>
				// CHECK: %[[VAL_20:.*]] = addi %[[VAL_17]], %[[VAL_5]] : index
				// CHECK: %[[VAL_21:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_20]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_22:.*]] = %[[VAL_19]] to %[[VAL_21]] step %[[VAL_5]] {
				// CHECK: %[[VAL_23:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_22]]] : memref<?xindex>
				// CHECK: %[[VAL_24:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_22]]] : memref<?xf32>
				// CHECK: %[[VAL_25:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_13]], %[[VAL_18]], %[[VAL_23]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_26:.*]] = mulf %[[VAL_24]], %[[VAL_25]] : f32
				// CHECK: store %[[VAL_26]], %[[VAL_12]]{{\[}}%[[VAL_13]], %[[VAL_18]], %[[VAL_23]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_27:.*]] = tensor_load %[[VAL_12]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_27]] : tensor<32x16x8xf32>
				// CHECK: }
				func @mul_dss(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_dss
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				#trait_sdd = {
				indexing_maps = [
				affine_map<(i,j,k) -> (i,j,k)>, // A
				affine_map<(i,j,k) -> (i,j,k)>, // B
				affine_map<(i,j,k) -> (i,j,k)> // X (out)
				],
				sparse = [
				[ "S", "D", "D" ], // A
				[ "D", "D", "D" ], // B
				[ "D", "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel", "parallel"],
				doc = "X(i,j,k) = A(i,j,k) OP B(i,j,k)"
				}

				// CHECK-LABEL: func @add_sdd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 16 : index
				// CHECK: %[[VAL_5:.*]] = constant 8 : index
				// CHECK: %[[VAL_6:.*]] = constant true
				// CHECK: %[[VAL_7:.*]] = constant 0 : index
				// CHECK: %[[VAL_8:.*]] = constant 1 : index
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_12:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_13:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_7]]] : memref<?xindex>
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_8]]] : memref<?xindex>
				// CHECK: %[[VAL_16:.]]:2 = scf.while (%[[VAL_17:.]] = %[[VAL_14]], %[[VAL_18:.*]] = %[[VAL_7]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_19:.*]] = cmpi "ult", %[[VAL_17]], %[[VAL_15]] : index
				// CHECK: scf.condition(%[[VAL_19]]) %[[VAL_17]], %[[VAL_18]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_20:.]]: index, %[[VAL_21:.]]: index):
				// CHECK: %[[VAL_22:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_20]]] : memref<?xindex>
				// CHECK: %[[VAL_23:.*]] = cmpi "eq", %[[VAL_22]], %[[VAL_21]] : index
				// CHECK: scf.if %[[VAL_23]] {
				// CHECK: scf.for %[[VAL_24:.*]] = %[[VAL_7]] to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: %[[VAL_25:.*]] = muli %[[VAL_20]], %[[VAL_4]] : index
				// CHECK: %[[VAL_26:.*]] = addi %[[VAL_25]], %[[VAL_24]] : index
				// CHECK: scf.for %[[VAL_27:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_28:.*]] = muli %[[VAL_26]], %[[VAL_5]] : index
				// CHECK: %[[VAL_29:.*]] = addi %[[VAL_28]], %[[VAL_27]] : index
				// CHECK: %[[VAL_30:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_29]]] : memref<?xf32>
				// CHECK: %[[VAL_31:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_21]], %[[VAL_24]], %[[VAL_27]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_32:.*]] = addf %[[VAL_30]], %[[VAL_31]] : f32
				// CHECK: store %[[VAL_32]], %[[VAL_13]]{{\[}}%[[VAL_21]], %[[VAL_24]], %[[VAL_27]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_33:.*]] = %[[VAL_7]] to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_34:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_35:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_21]], %[[VAL_33]], %[[VAL_34]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_35]], %[[VAL_13]]{{\[}}%[[VAL_21]], %[[VAL_33]], %[[VAL_34]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_36:.*]] = cmpi "eq", %[[VAL_22]], %[[VAL_21]] : index
				// CHECK: %[[VAL_37:.*]] = addi %[[VAL_20]], %[[VAL_8]] : index
				// CHECK: %[[VAL_38:.*]] = select %[[VAL_36]], %[[VAL_37]], %[[VAL_20]] : index
				// CHECK: %[[VAL_39:.*]] = addi %[[VAL_21]], %[[VAL_8]] : index
				// CHECK: scf.yield %[[VAL_38]], %[[VAL_39]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_40:.]] = %[[VAL_41:.]]#1 to %[[VAL_3]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_42:.*]] = %[[VAL_7]] to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_43:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_44:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_40]], %[[VAL_42]], %[[VAL_43]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_44]], %[[VAL_13]]{{\[}}%[[VAL_40]], %[[VAL_42]], %[[VAL_43]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_45:.*]] = tensor_load %[[VAL_13]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_45]] : tensor<32x16x8xf32>
				// CHECK: }
				func @add_sdd(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_sdd
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				// CHECK-LABEL: func @mul_sdd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 16 : index
				// CHECK: %[[VAL_4:.*]] = constant 8 : index
				// CHECK: %[[VAL_5:.*]] = constant 0 : index
				// CHECK: %[[VAL_6:.*]] = constant 1 : index
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_10:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_12:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_5]]] : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_6]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_14:.*]] = %[[VAL_12]] to %[[VAL_13]] step %[[VAL_6]] {
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_14]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_16:.*]] = %[[VAL_5]] to %[[VAL_3]] step %[[VAL_6]] {
				// CHECK: %[[VAL_17:.*]] = muli %[[VAL_14]], %[[VAL_3]] : index
				// CHECK: %[[VAL_18:.*]] = addi %[[VAL_17]], %[[VAL_16]] : index
				// CHECK: scf.for %[[VAL_19:.*]] = %[[VAL_5]] to %[[VAL_4]] step %[[VAL_6]] {
				// CHECK: %[[VAL_20:.*]] = muli %[[VAL_18]], %[[VAL_4]] : index
				// CHECK: %[[VAL_21:.*]] = addi %[[VAL_20]], %[[VAL_19]] : index
				// CHECK: %[[VAL_22:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_21]]] : memref<?xf32>
				// CHECK: %[[VAL_23:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_15]], %[[VAL_16]], %[[VAL_19]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_24:.*]] = mulf %[[VAL_22]], %[[VAL_23]] : f32
				// CHECK: store %[[VAL_24]], %[[VAL_11]]{{\[}}%[[VAL_15]], %[[VAL_16]], %[[VAL_19]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_25:.*]] = tensor_load %[[VAL_11]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_25]] : tensor<32x16x8xf32>
				// CHECK: }
				func @mul_sdd(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_sdd
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				#trait_sds = {
				indexing_maps = [
				affine_map<(i,j,k) -> (i,j,k)>, // A
				affine_map<(i,j,k) -> (i,j,k)>, // B
				affine_map<(i,j,k) -> (i,j,k)> // X (out)
				],
				sparse = [
				[ "S", "D", "S" ], // A
				[ "D", "D", "D" ], // B
				[ "D", "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel", "parallel"],
				doc = "X(i,j,k) = A(i,j,k) OP B(i,j,k)"
				}

				// CHECK-LABEL: func @add_sds(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 16 : index
				// CHECK: %[[VAL_5:.*]] = constant 8 : index
				// CHECK: %[[VAL_6:.*]] = constant true
				// CHECK: %[[VAL_7:.*]] = constant 0 : index
				// CHECK: %[[VAL_8:.*]] = constant 1 : index
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_12:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_14:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_15:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_7]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_8]]] : memref<?xindex>
				// CHECK: %[[VAL_18:.]]:2 = scf.while (%[[VAL_19:.]] = %[[VAL_16]], %[[VAL_20:.*]] = %[[VAL_7]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_21:.*]] = cmpi "ult", %[[VAL_19]], %[[VAL_17]] : index
				// CHECK: scf.condition(%[[VAL_21]]) %[[VAL_19]], %[[VAL_20]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_22:.]]: index, %[[VAL_23:.]]: index):
				// CHECK: %[[VAL_24:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_22]]] : memref<?xindex>
				// CHECK: %[[VAL_25:.*]] = cmpi "eq", %[[VAL_24]], %[[VAL_23]] : index
				// CHECK: scf.if %[[VAL_25]] {
				// CHECK: scf.for %[[VAL_26:.*]] = %[[VAL_7]] to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: %[[VAL_27:.*]] = muli %[[VAL_22]], %[[VAL_4]] : index
				// CHECK: %[[VAL_28:.*]] = addi %[[VAL_27]], %[[VAL_26]] : index
				// CHECK: %[[VAL_29:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_28]]] : memref<?xindex>
				// CHECK: %[[VAL_30:.*]] = addi %[[VAL_28]], %[[VAL_8]] : index
				// CHECK: %[[VAL_31:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_30]]] : memref<?xindex>
				// CHECK: %[[VAL_32:.]]:2 = scf.while (%[[VAL_33:.]] = %[[VAL_29]], %[[VAL_34:.*]] = %[[VAL_7]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_35:.*]] = cmpi "ult", %[[VAL_33]], %[[VAL_31]] : index
				// CHECK: scf.condition(%[[VAL_35]]) %[[VAL_33]], %[[VAL_34]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_36:.]]: index, %[[VAL_37:.]]: index):
				// CHECK: %[[VAL_38:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_36]]] : memref<?xindex>
				// CHECK: %[[VAL_39:.*]] = cmpi "eq", %[[VAL_38]], %[[VAL_37]] : index
				// CHECK: scf.if %[[VAL_39]] {
				// CHECK: %[[VAL_40:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_36]]] : memref<?xf32>
				// CHECK: %[[VAL_41:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_23]], %[[VAL_26]], %[[VAL_37]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_42:.*]] = addf %[[VAL_40]], %[[VAL_41]] : f32
				// CHECK: store %[[VAL_42]], %[[VAL_15]]{{\[}}%[[VAL_23]], %[[VAL_26]], %[[VAL_37]]] : memref<32x16x8xf32>
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_6]] {
				// CHECK: %[[VAL_43:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_23]], %[[VAL_26]], %[[VAL_37]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_43]], %[[VAL_15]]{{\[}}%[[VAL_23]], %[[VAL_26]], %[[VAL_37]]] : memref<32x16x8xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_44:.*]] = cmpi "eq", %[[VAL_38]], %[[VAL_37]] : index
				// CHECK: %[[VAL_45:.*]] = addi %[[VAL_36]], %[[VAL_8]] : index
				// CHECK: %[[VAL_46:.*]] = select %[[VAL_44]], %[[VAL_45]], %[[VAL_36]] : index
				// CHECK: %[[VAL_47:.*]] = addi %[[VAL_37]], %[[VAL_8]] : index
				// CHECK: scf.yield %[[VAL_46]], %[[VAL_47]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_48:.]] = %[[VAL_49:.]]#1 to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_50:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_23]], %[[VAL_26]], %[[VAL_48]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_50]], %[[VAL_15]]{{\[}}%[[VAL_23]], %[[VAL_26]], %[[VAL_48]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_51:.*]] = %[[VAL_7]] to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_52:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_53:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_23]], %[[VAL_51]], %[[VAL_52]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_53]], %[[VAL_15]]{{\[}}%[[VAL_23]], %[[VAL_51]], %[[VAL_52]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_54:.*]] = cmpi "eq", %[[VAL_24]], %[[VAL_23]] : index
				// CHECK: %[[VAL_55:.*]] = addi %[[VAL_22]], %[[VAL_8]] : index
				// CHECK: %[[VAL_56:.*]] = select %[[VAL_54]], %[[VAL_55]], %[[VAL_22]] : index
				// CHECK: %[[VAL_57:.*]] = addi %[[VAL_23]], %[[VAL_8]] : index
				// CHECK: scf.yield %[[VAL_56]], %[[VAL_57]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_58:.]] = %[[VAL_59:.]]#1 to %[[VAL_3]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_60:.*]] = %[[VAL_7]] to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_61:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_62:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_58]], %[[VAL_60]], %[[VAL_61]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_62]], %[[VAL_15]]{{\[}}%[[VAL_58]], %[[VAL_60]], %[[VAL_61]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_63:.*]] = tensor_load %[[VAL_15]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_63]] : tensor<32x16x8xf32>
				// CHECK: }
				func @add_sds(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_sds
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				// CHECK-LABEL: func @mul_sds(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 16 : index
				// CHECK: %[[VAL_4:.*]] = constant 0 : index
				// CHECK: %[[VAL_5:.*]] = constant 1 : index
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_12:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_5]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_15:.*]] = %[[VAL_13]] to %[[VAL_14]] step %[[VAL_5]] {
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_15]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_17:.*]] = %[[VAL_4]] to %[[VAL_3]] step %[[VAL_5]] {
				// CHECK: %[[VAL_18:.*]] = muli %[[VAL_15]], %[[VAL_3]] : index
				// CHECK: %[[VAL_19:.*]] = addi %[[VAL_18]], %[[VAL_17]] : index
				// CHECK: %[[VAL_20:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_19]]] : memref<?xindex>
				// CHECK: %[[VAL_21:.*]] = addi %[[VAL_19]], %[[VAL_5]] : index
				// CHECK: %[[VAL_22:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_21]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_23:.*]] = %[[VAL_20]] to %[[VAL_22]] step %[[VAL_5]] {
				// CHECK: %[[VAL_24:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_23]]] : memref<?xindex>
				// CHECK: %[[VAL_25:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_23]]] : memref<?xf32>
				// CHECK: %[[VAL_26:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_16]], %[[VAL_17]], %[[VAL_24]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_27:.*]] = mulf %[[VAL_25]], %[[VAL_26]] : f32
				// CHECK: store %[[VAL_27]], %[[VAL_12]]{{\[}}%[[VAL_16]], %[[VAL_17]], %[[VAL_24]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_28:.*]] = tensor_load %[[VAL_12]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_28]] : tensor<32x16x8xf32>
				// CHECK: }
				func @mul_sds(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_sds
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				#trait_ssd = {
				indexing_maps = [
				affine_map<(i,j,k) -> (i,j,k)>, // A
				affine_map<(i,j,k) -> (i,j,k)>, // B
				affine_map<(i,j,k) -> (i,j,k)> // X (out)
				],
				sparse = [
				[ "S", "S", "D" ], // A
				[ "D", "D", "D" ], // B
				[ "D", "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel", "parallel"],
				doc = "X(i,j,k) = A(i,j,k) OP B(i,j,k)"
				}

				// CHECK-LABEL: func @add_ssd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 16 : index
				// CHECK: %[[VAL_5:.*]] = constant 8 : index
				// CHECK: %[[VAL_6:.*]] = constant true
				// CHECK: %[[VAL_7:.*]] = constant 0 : index
				// CHECK: %[[VAL_8:.*]] = constant 1 : index
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_12:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_14:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_15:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_7]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_8]]] : memref<?xindex>
				// CHECK: %[[VAL_18:.]]:2 = scf.while (%[[VAL_19:.]] = %[[VAL_16]], %[[VAL_20:.*]] = %[[VAL_7]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_21:.*]] = cmpi "ult", %[[VAL_19]], %[[VAL_17]] : index
				// CHECK: scf.condition(%[[VAL_21]]) %[[VAL_19]], %[[VAL_20]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_22:.]]: index, %[[VAL_23:.]]: index):
				// CHECK: %[[VAL_24:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_22]]] : memref<?xindex>
				// CHECK: %[[VAL_25:.*]] = cmpi "eq", %[[VAL_24]], %[[VAL_23]] : index
				// CHECK: scf.if %[[VAL_25]] {
				// CHECK: %[[VAL_26:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_22]]] : memref<?xindex>
				// CHECK: %[[VAL_27:.*]] = addi %[[VAL_22]], %[[VAL_8]] : index
				// CHECK: %[[VAL_28:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_27]]] : memref<?xindex>
				// CHECK: %[[VAL_29:.]]:2 = scf.while (%[[VAL_30:.]] = %[[VAL_26]], %[[VAL_31:.*]] = %[[VAL_7]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_32:.*]] = cmpi "ult", %[[VAL_30]], %[[VAL_28]] : index
				// CHECK: scf.condition(%[[VAL_32]]) %[[VAL_30]], %[[VAL_31]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_33:.]]: index, %[[VAL_34:.]]: index):
				// CHECK: %[[VAL_35:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_33]]] : memref<?xindex>
				// CHECK: %[[VAL_36:.*]] = cmpi "eq", %[[VAL_35]], %[[VAL_34]] : index
				// CHECK: scf.if %[[VAL_36]] {
				// CHECK: scf.for %[[VAL_37:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_38:.*]] = muli %[[VAL_33]], %[[VAL_5]] : index
				// CHECK: %[[VAL_39:.*]] = addi %[[VAL_38]], %[[VAL_37]] : index
				// CHECK: %[[VAL_40:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_39]]] : memref<?xf32>
				// CHECK: %[[VAL_41:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_23]], %[[VAL_34]], %[[VAL_37]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_42:.*]] = addf %[[VAL_40]], %[[VAL_41]] : f32
				// CHECK: store %[[VAL_42]], %[[VAL_15]]{{\[}}%[[VAL_23]], %[[VAL_34]], %[[VAL_37]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_43:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_44:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_23]], %[[VAL_34]], %[[VAL_43]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_44]], %[[VAL_15]]{{\[}}%[[VAL_23]], %[[VAL_34]], %[[VAL_43]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_45:.*]] = cmpi "eq", %[[VAL_35]], %[[VAL_34]] : index
				// CHECK: %[[VAL_46:.*]] = addi %[[VAL_33]], %[[VAL_8]] : index
				// CHECK: %[[VAL_47:.*]] = select %[[VAL_45]], %[[VAL_46]], %[[VAL_33]] : index
				// CHECK: %[[VAL_48:.*]] = addi %[[VAL_34]], %[[VAL_8]] : index
				// CHECK: scf.yield %[[VAL_47]], %[[VAL_48]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_49:.]] = %[[VAL_50:.]]#1 to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_51:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_52:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_23]], %[[VAL_49]], %[[VAL_51]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_52]], %[[VAL_15]]{{\[}}%[[VAL_23]], %[[VAL_49]], %[[VAL_51]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_53:.*]] = %[[VAL_7]] to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_54:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_55:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_23]], %[[VAL_53]], %[[VAL_54]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_55]], %[[VAL_15]]{{\[}}%[[VAL_23]], %[[VAL_53]], %[[VAL_54]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_56:.*]] = cmpi "eq", %[[VAL_24]], %[[VAL_23]] : index
				// CHECK: %[[VAL_57:.*]] = addi %[[VAL_22]], %[[VAL_8]] : index
				// CHECK: %[[VAL_58:.*]] = select %[[VAL_56]], %[[VAL_57]], %[[VAL_22]] : index
				// CHECK: %[[VAL_59:.*]] = addi %[[VAL_23]], %[[VAL_8]] : index
				// CHECK: scf.yield %[[VAL_58]], %[[VAL_59]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_60:.]] = %[[VAL_61:.]]#1 to %[[VAL_3]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_62:.*]] = %[[VAL_7]] to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_63:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_64:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_60]], %[[VAL_62]], %[[VAL_63]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_64]], %[[VAL_15]]{{\[}}%[[VAL_60]], %[[VAL_62]], %[[VAL_63]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_65:.*]] = tensor_load %[[VAL_15]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_65]] : tensor<32x16x8xf32>
				// CHECK: }
				func @add_ssd(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_ssd
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				// CHECK-LABEL: func @mul_ssd(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 8 : index
				// CHECK: %[[VAL_4:.*]] = constant 0 : index
				// CHECK: %[[VAL_5:.*]] = constant 1 : index
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_11:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_12:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_13:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_5]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_15:.*]] = %[[VAL_13]] to %[[VAL_14]] step %[[VAL_5]] {
				// CHECK: %[[VAL_16:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_15]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_15]]] : memref<?xindex>
				// CHECK: %[[VAL_18:.*]] = addi %[[VAL_15]], %[[VAL_5]] : index
				// CHECK: %[[VAL_19:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_18]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_20:.*]] = %[[VAL_17]] to %[[VAL_19]] step %[[VAL_5]] {
				// CHECK: %[[VAL_21:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_20]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_22:.*]] = %[[VAL_4]] to %[[VAL_3]] step %[[VAL_5]] {
				// CHECK: %[[VAL_23:.*]] = muli %[[VAL_20]], %[[VAL_3]] : index
				// CHECK: %[[VAL_24:.*]] = addi %[[VAL_23]], %[[VAL_22]] : index
				// CHECK: %[[VAL_25:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_24]]] : memref<?xf32>
				// CHECK: %[[VAL_26:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_16]], %[[VAL_21]], %[[VAL_22]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_27:.*]] = mulf %[[VAL_25]], %[[VAL_26]] : f32
				// CHECK: store %[[VAL_27]], %[[VAL_12]]{{\[}}%[[VAL_16]], %[[VAL_21]], %[[VAL_22]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_28:.*]] = tensor_load %[[VAL_12]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_28]] : tensor<32x16x8xf32>
				// CHECK: }
				func @mul_ssd(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_ssd
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				#trait_sss = {
				indexing_maps = [
				affine_map<(i,j,k) -> (i,j,k)>, // A
				affine_map<(i,j,k) -> (i,j,k)>, // B
				affine_map<(i,j,k) -> (i,j,k)> // X (out)
				],
				sparse = [
				[ "S", "S", "S" ], // A
				[ "D", "D", "D" ], // B
				[ "D", "D", "D" ] // X
				],
				iterator_types = ["parallel", "parallel", "parallel"],
				doc = "X(i,j,k) = A(i,j,k) OP B(i,j,k)"
				}

				// CHECK-LABEL: func @add_sss(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 32 : index
				// CHECK: %[[VAL_4:.*]] = constant 16 : index
				// CHECK: %[[VAL_5:.*]] = constant 8 : index
				// CHECK: %[[VAL_6:.*]] = constant true
				// CHECK: %[[VAL_7:.*]] = constant 0 : index
				// CHECK: %[[VAL_8:.*]] = constant 1 : index
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_12:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_13:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_14:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_15:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_16:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_17:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_7]]] : memref<?xindex>
				// CHECK: %[[VAL_19:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_8]]] : memref<?xindex>
				// CHECK: %[[VAL_20:.]]:2 = scf.while (%[[VAL_21:.]] = %[[VAL_18]], %[[VAL_22:.*]] = %[[VAL_7]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_23:.*]] = cmpi "ult", %[[VAL_21]], %[[VAL_19]] : index
				// CHECK: scf.condition(%[[VAL_23]]) %[[VAL_21]], %[[VAL_22]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_24:.]]: index, %[[VAL_25:.]]: index):
				// CHECK: %[[VAL_26:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_24]]] : memref<?xindex>
				// CHECK: %[[VAL_27:.*]] = cmpi "eq", %[[VAL_26]], %[[VAL_25]] : index
				// CHECK: scf.if %[[VAL_27]] {
				// CHECK: %[[VAL_28:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_24]]] : memref<?xindex>
				// CHECK: %[[VAL_29:.*]] = addi %[[VAL_24]], %[[VAL_8]] : index
				// CHECK: %[[VAL_30:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_29]]] : memref<?xindex>
				// CHECK: %[[VAL_31:.]]:2 = scf.while (%[[VAL_32:.]] = %[[VAL_28]], %[[VAL_33:.*]] = %[[VAL_7]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_34:.*]] = cmpi "ult", %[[VAL_32]], %[[VAL_30]] : index
				// CHECK: scf.condition(%[[VAL_34]]) %[[VAL_32]], %[[VAL_33]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_35:.]]: index, %[[VAL_36:.]]: index):
				// CHECK: %[[VAL_37:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_35]]] : memref<?xindex>
				// CHECK: %[[VAL_38:.*]] = cmpi "eq", %[[VAL_37]], %[[VAL_36]] : index
				// CHECK: scf.if %[[VAL_38]] {
				// CHECK: %[[VAL_39:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_35]]] : memref<?xindex>
				// CHECK: %[[VAL_40:.*]] = addi %[[VAL_35]], %[[VAL_8]] : index
				// CHECK: %[[VAL_41:.*]] = load %[[VAL_13]]{{\[}}%[[VAL_40]]] : memref<?xindex>
				// CHECK: %[[VAL_42:.]]:2 = scf.while (%[[VAL_43:.]] = %[[VAL_39]], %[[VAL_44:.*]] = %[[VAL_7]]) : (index, index) -> (index, index) {
				// CHECK: %[[VAL_45:.*]] = cmpi "ult", %[[VAL_43]], %[[VAL_41]] : index
				// CHECK: scf.condition(%[[VAL_45]]) %[[VAL_43]], %[[VAL_44]] : index, index
				// CHECK: } do {
				// CHECK: ^bb0(%[[VAL_46:.]]: index, %[[VAL_47:.]]: index):
				// CHECK: %[[VAL_48:.*]] = load %[[VAL_14]]{{\[}}%[[VAL_46]]] : memref<?xindex>
				// CHECK: %[[VAL_49:.*]] = cmpi "eq", %[[VAL_48]], %[[VAL_47]] : index
				// CHECK: scf.if %[[VAL_49]] {
				// CHECK: %[[VAL_50:.*]] = load %[[VAL_15]]{{\[}}%[[VAL_46]]] : memref<?xf32>
				// CHECK: %[[VAL_51:.*]] = load %[[VAL_16]]{{\[}}%[[VAL_25]], %[[VAL_36]], %[[VAL_47]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_52:.*]] = addf %[[VAL_50]], %[[VAL_51]] : f32
				// CHECK: store %[[VAL_52]], %[[VAL_17]]{{\[}}%[[VAL_25]], %[[VAL_36]], %[[VAL_47]]] : memref<32x16x8xf32>
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_6]] {
				// CHECK: %[[VAL_53:.*]] = load %[[VAL_16]]{{\[}}%[[VAL_25]], %[[VAL_36]], %[[VAL_47]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_53]], %[[VAL_17]]{{\[}}%[[VAL_25]], %[[VAL_36]], %[[VAL_47]]] : memref<32x16x8xf32>
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_54:.*]] = cmpi "eq", %[[VAL_48]], %[[VAL_47]] : index
				// CHECK: %[[VAL_55:.*]] = addi %[[VAL_46]], %[[VAL_8]] : index
				// CHECK: %[[VAL_56:.*]] = select %[[VAL_54]], %[[VAL_55]], %[[VAL_46]] : index
				// CHECK: %[[VAL_57:.*]] = addi %[[VAL_47]], %[[VAL_8]] : index
				// CHECK: scf.yield %[[VAL_56]], %[[VAL_57]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_58:.]] = %[[VAL_59:.]]#1 to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_60:.*]] = load %[[VAL_16]]{{\[}}%[[VAL_25]], %[[VAL_36]], %[[VAL_58]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_60]], %[[VAL_17]]{{\[}}%[[VAL_25]], %[[VAL_36]], %[[VAL_58]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_61:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_62:.*]] = load %[[VAL_16]]{{\[}}%[[VAL_25]], %[[VAL_36]], %[[VAL_61]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_62]], %[[VAL_17]]{{\[}}%[[VAL_25]], %[[VAL_36]], %[[VAL_61]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_63:.*]] = cmpi "eq", %[[VAL_37]], %[[VAL_36]] : index
				// CHECK: %[[VAL_64:.*]] = addi %[[VAL_35]], %[[VAL_8]] : index
				// CHECK: %[[VAL_65:.*]] = select %[[VAL_63]], %[[VAL_64]], %[[VAL_35]] : index
				// CHECK: %[[VAL_66:.*]] = addi %[[VAL_36]], %[[VAL_8]] : index
				// CHECK: scf.yield %[[VAL_65]], %[[VAL_66]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_67:.]] = %[[VAL_68:.]]#1 to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_69:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_70:.*]] = load %[[VAL_16]]{{\[}}%[[VAL_25]], %[[VAL_67]], %[[VAL_69]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_70]], %[[VAL_17]]{{\[}}%[[VAL_25]], %[[VAL_67]], %[[VAL_69]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: } else {
				// CHECK: scf.if %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_71:.*]] = %[[VAL_7]] to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_72:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_73:.*]] = load %[[VAL_16]]{{\[}}%[[VAL_25]], %[[VAL_71]], %[[VAL_72]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_73]], %[[VAL_17]]{{\[}}%[[VAL_25]], %[[VAL_71]], %[[VAL_72]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: } else {
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_74:.*]] = cmpi "eq", %[[VAL_26]], %[[VAL_25]] : index
				// CHECK: %[[VAL_75:.*]] = addi %[[VAL_24]], %[[VAL_8]] : index
				// CHECK: %[[VAL_76:.*]] = select %[[VAL_74]], %[[VAL_75]], %[[VAL_24]] : index
				// CHECK: %[[VAL_77:.*]] = addi %[[VAL_25]], %[[VAL_8]] : index
				// CHECK: scf.yield %[[VAL_76]], %[[VAL_77]] : index, index
				// CHECK: }
				// CHECK: scf.for %[[VAL_78:.]] = %[[VAL_79:.]]#1 to %[[VAL_3]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_80:.*]] = %[[VAL_7]] to %[[VAL_4]] step %[[VAL_8]] {
				// CHECK: scf.for %[[VAL_81:.*]] = %[[VAL_7]] to %[[VAL_5]] step %[[VAL_8]] {
				// CHECK: %[[VAL_82:.*]] = load %[[VAL_16]]{{\[}}%[[VAL_78]], %[[VAL_80]], %[[VAL_81]]] : memref<32x16x8xf32>
				// CHECK: store %[[VAL_82]], %[[VAL_17]]{{\[}}%[[VAL_78]], %[[VAL_80]], %[[VAL_81]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_83:.*]] = tensor_load %[[VAL_17]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_83]] : tensor<32x16x8xf32>
				// CHECK: }
				func @add_sss(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_sss
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = addf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				// CHECK-LABEL: func @mul_sss(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<32x16x8xf32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				// CHECK: %[[VAL_2:.*]] = constant 999 : index
				// CHECK: %[[VAL_3:.*]] = constant 0 : index
				// CHECK: %[[VAL_4:.*]] = constant 1 : index
				// CHECK: %[[VAL_5:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_6:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_10:.*]] = alloca(%[[VAL_2]]) : memref<?xindex>
				// CHECK: %[[VAL_11:.*]] = alloca(%[[VAL_2]]) : memref<?xf32>
				// CHECK: %[[VAL_12:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_13:.*]] = alloca() : memref<32x16x8xf32>
				// CHECK: %[[VAL_14:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_3]]] : memref<?xindex>
				// CHECK: %[[VAL_15:.*]] = load %[[VAL_5]]{{\[}}%[[VAL_4]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_16:.*]] = %[[VAL_14]] to %[[VAL_15]] step %[[VAL_4]] {
				// CHECK: %[[VAL_17:.*]] = load %[[VAL_6]]{{\[}}%[[VAL_16]]] : memref<?xindex>
				// CHECK: %[[VAL_18:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_16]]] : memref<?xindex>
				// CHECK: %[[VAL_19:.*]] = addi %[[VAL_16]], %[[VAL_4]] : index
				// CHECK: %[[VAL_20:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_19]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_21:.*]] = %[[VAL_18]] to %[[VAL_20]] step %[[VAL_4]] {
				// CHECK: %[[VAL_22:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_21]]] : memref<?xindex>
				// CHECK: %[[VAL_23:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_21]]] : memref<?xindex>
				// CHECK: %[[VAL_24:.*]] = addi %[[VAL_21]], %[[VAL_4]] : index
				// CHECK: %[[VAL_25:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_24]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_26:.*]] = %[[VAL_23]] to %[[VAL_25]] step %[[VAL_4]] {
				// CHECK: %[[VAL_27:.*]] = load %[[VAL_10]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: %[[VAL_28:.*]] = load %[[VAL_11]]{{\[}}%[[VAL_26]]] : memref<?xf32>
				// CHECK: %[[VAL_29:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_17]], %[[VAL_22]], %[[VAL_27]]] : memref<32x16x8xf32>
				// CHECK: %[[VAL_30:.*]] = mulf %[[VAL_28]], %[[VAL_29]] : f32
				// CHECK: store %[[VAL_30]], %[[VAL_13]]{{\[}}%[[VAL_17]], %[[VAL_22]], %[[VAL_27]]] : memref<32x16x8xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_31:.*]] = tensor_load %[[VAL_13]] : memref<32x16x8xf32>
				// CHECK: return %[[VAL_31]] : tensor<32x16x8xf32>
				// CHECK: }
				func @mul_sss(%arga: tensor<32x16x8xf32>, %argb: tensor<32x16x8xf32>) -> tensor<32x16x8xf32> {
				%0 = linalg.generic #trait_sss
				ins(%arga, %argb: tensor<32x16x8xf32>, tensor<32x16x8xf32>) {
				^bb(%a: f32, %b: f32):
				%0 = mulf %a, %b : f32
				linalg.yield %0 : f32
				} -> tensor<32x16x8xf32>
				return %0 : tensor<32x16x8xf32>
				}

				#trait_kernel_3d = {
				indexing_maps = [
				affine_map<(i,j,k,l) -> (i,k,l)>, // B
				affine_map<(i,j,k,l) -> (k,j)>, // C
				affine_map<(i,j,k,l) -> (l,j)>, // D
				affine_map<(i,j,k,l) -> (i,j)> // A (out)
				],
				sparse = [
				[ "D", "D", "S" ], // B
				[ "D", "D" ], // C
				[ "D", "D" ], // D
				[ "D", "D" ] // A
				],
				iterator_types = ["parallel", "parallel", "reduction", "reduction"],
				doc = "A(i,j) = SUM_k,l B(i,k,l) * C(k,j) * D(l,j)"
				}

				// CHECK-LABEL: func @kernel_3d(
				// CHECK-SAME: %[[VAL_0:.*0]]: tensor<?x?xf32>,
				// CHECK-SAME: %[[VAL_1:.*1]]: tensor<?x?x?xf32>,
				// CHECK-SAME: %[[VAL_2:.*2]]: tensor<?x?xf32>,
				// CHECK-SAME: %[[VAL_3:.*3]]: tensor<?x?xf32>) -> tensor<?x?xf32> {
				// CHECK: %[[VAL_4:.*]] = constant 999 : index
				// CHECK: %[[VAL_5:.*]] = constant 0 : index
				// CHECK: %[[VAL_6:.*]] = constant 1 : index
				// CHECK: %[[VAL_7:.*]] = alloca(%[[VAL_4]]) : memref<?xindex>
				// CHECK: %[[VAL_8:.*]] = alloca(%[[VAL_4]]) : memref<?xindex>
				// CHECK: %[[VAL_9:.*]] = alloca(%[[VAL_4]]) : memref<?xf32>
				// CHECK: %[[VAL_10:.*]] = dim %[[VAL_2]], %[[VAL_5]] : tensor<?x?xf32>
				// CHECK: %[[VAL_11:.*]] = dim %[[VAL_2]], %[[VAL_6]] : tensor<?x?xf32>
				// CHECK: %[[VAL_12:.*]] = alloca(%[[VAL_10]], %[[VAL_11]]) : memref<?x?xf32>
				// CHECK: %[[VAL_13:.*]] = dim %[[VAL_3]], %[[VAL_5]] : tensor<?x?xf32>
				// CHECK: %[[VAL_14:.*]] = dim %[[VAL_3]], %[[VAL_6]] : tensor<?x?xf32>
				// CHECK: %[[VAL_15:.*]] = alloca(%[[VAL_13]], %[[VAL_14]]) : memref<?x?xf32>
				// CHECK: %[[VAL_16:.*]] = dim %[[VAL_0]], %[[VAL_5]] : tensor<?x?xf32>
				// CHECK: %[[VAL_17:.*]] = dim %[[VAL_0]], %[[VAL_6]] : tensor<?x?xf32>
				// CHECK: %[[VAL_18:.*]] = alloca(%[[VAL_16]], %[[VAL_17]]) : memref<?x?xf32>
				// CHECK: scf.for %[[VAL_19:.*]] = %[[VAL_5]] to %[[VAL_16]] step %[[VAL_6]] {
				// CHECK: scf.for %[[VAL_20:.*]] = %[[VAL_5]] to %[[VAL_10]] step %[[VAL_6]] {
				// CHECK: %[[VAL_21:.*]] = muli %[[VAL_10]], %[[VAL_19]] : index
				// CHECK: %[[VAL_22:.*]] = addi %[[VAL_21]], %[[VAL_20]] : index
				// CHECK: %[[VAL_23:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_22]]] : memref<?xindex>
				// CHECK: %[[VAL_24:.*]] = addi %[[VAL_22]], %[[VAL_6]] : index
				// CHECK: %[[VAL_25:.*]] = load %[[VAL_7]]{{\[}}%[[VAL_24]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_26:.*]] = %[[VAL_23]] to %[[VAL_25]] step %[[VAL_6]] {
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions %[[VAL_25:.]] = load %[[VAL_9]][%[[VAL_24]]] should just work. Cases where I've seen a need for the escaping sequence is when the % was also captured. nicolasvasilache:* ``` %[[VAL_25:.*]] = load %[[VAL_9]][%[[VAL_24]]] ``` should just work. Cases where I've seen a…
				aartbikAuthorUnsubmitted Done Reply Inline Actions Note that all CHECKs are auto-generated (generate-test-checks.py), so the layout and constructs are no my doing. Perhaps we can improve the script to avoid escapes where not needed (also, I had to make one hand modification on a parameter match that did not work out of the box). aartbik: Note that all CHECKs are auto-generated (generate-test-checks.py), so the layout and constructs…
				// CHECK: %[[VAL_27:.*]] = load %[[VAL_8]]{{\[}}%[[VAL_26]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_28:.*]] = %[[VAL_5]] to %[[VAL_17]] step %[[VAL_6]] {
				// CHECK: %[[VAL_29:.*]] = load %[[VAL_9]]{{\[}}%[[VAL_26]]] : memref<?xf32>
				// CHECK: %[[VAL_30:.*]] = load %[[VAL_12]]{{\[}}%[[VAL_20]], %[[VAL_28]]] : memref<?x?xf32>
				// CHECK: %[[VAL_31:.*]] = mulf %[[VAL_29]], %[[VAL_30]] : f32
				// CHECK: %[[VAL_32:.*]] = load %[[VAL_15]]{{\[}}%[[VAL_27]], %[[VAL_28]]] : memref<?x?xf32>
				// CHECK: %[[VAL_33:.*]] = mulf %[[VAL_31]], %[[VAL_32]] : f32
				// CHECK: %[[VAL_34:.*]] = load %[[VAL_18]]{{\[}}%[[VAL_19]], %[[VAL_28]]] : memref<?x?xf32>
				// CHECK: %[[VAL_35:.*]] = addf %[[VAL_33]], %[[VAL_34]] : f32
				// CHECK: store %[[VAL_35]], %[[VAL_18]]{{\[}}%[[VAL_19]], %[[VAL_28]]] : memref<?x?xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: }
				// CHECK: %[[VAL_36:.*]] = tensor_load %[[VAL_18]] : memref<?x?xf32>
				// CHECK: return %[[VAL_36]] : tensor<?x?xf32>
				// CHECK: }
				func @kernel_3d(%arga: tensor<?x?xf32>,
				%argb: tensor<?x?x?xf32>,
				%argc: tensor<?x?xf32>,
				%argd: tensor<?x?xf32>) -> tensor<?x?xf32> {
				%0 = linalg.generic #trait_kernel_3d
				ins(%argb, %argc, %argd : tensor<?x?x?xf32>, tensor<?x?xf32>, tensor<?x?xf32>)
				init(%arga : tensor<?x?xf32>) {
				^bb(%b: f32, %c: f32, %d : f32, %a : f32):
				%0 = mulf %b, %c : f32
				%1 = mulf %0, %d : f32
				%2 = addf %1, %a : f32
				linalg.yield %2 : f32
				} -> tensor<?x?xf32>
				return %0 : tensor<?x?xf32>
				}

mlir/test/lib/Transforms/CMakeLists.txt

Show All 24 Lines	add_mlir_library(MLIRTestTransforms
TestLoopParametricTiling.cpp		TestLoopParametricTiling.cpp
TestLoopUnrolling.cpp		TestLoopUnrolling.cpp
TestNumberOfExecutions.cpp		TestNumberOfExecutions.cpp
TestOpaqueLoc.cpp		TestOpaqueLoc.cpp
TestMemRefBoundCheck.cpp		TestMemRefBoundCheck.cpp
TestMemRefDependenceCheck.cpp		TestMemRefDependenceCheck.cpp
TestMemRefStrideCalculation.cpp		TestMemRefStrideCalculation.cpp
TestSCFUtils.cpp		TestSCFUtils.cpp
		TestSparsification.cpp
TestVectorTransforms.cpp		TestVectorTransforms.cpp

EXCLUDE_FROM_LIBMLIR		EXCLUDE_FROM_LIBMLIR

ADDITIONAL_HEADER_DIRS		ADDITIONAL_HEADER_DIRS
${MLIR_MAIN_INCLUDE_DIR}/mlir/Transforms		${MLIR_MAIN_INCLUDE_DIR}/mlir/Transforms

DEPENDS		DEPENDS
Show All 27 Lines

mlir/test/lib/Transforms/TestSparsification.cpp

This file was added.

				//===- TestSparsification.cpp - Test sparsification of tensors ------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
				#include "mlir/Pass/Pass.h"
				#include "mlir/Transforms/GreedyPatternRewriteDriver.h"

				using namespace mlir;

				namespace {

				struct TestSparsification
				: public PassWrapper<TestSparsification, FunctionPass> {
				void getDependentDialects(DialectRegistry &registry) const override {
				registry.insert<scf::SCFDialect>();
				}
				void runOnFunction() override {
				auto *ctx = &getContext();
				OwningRewritePatternList patterns;
				linalg::populateSparsificationPatterns(ctx, patterns);
				applyPatternsAndFoldGreedily(getFunction(), std::move(patterns));
				}
				};

				} // end anonymous namespace

				namespace mlir {
				namespace test {

				void registerTestSparsification() {
				PassRegistration<TestSparsification> sparsificationPass(
				"test-sparsification",
				"Test automatic geneneration of sparse tensor code");
				}

				} // namespace test
				} // namespace mlir

mlir/tools/mlir-opt/mlir-opt.cpp

Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
void registerTestMemRefDependenceCheck();		void registerTestMemRefDependenceCheck();
void registerTestMemRefStrideCalculation();		void registerTestMemRefStrideCalculation();
void registerTestNumberOfBlockExecutionsPass();		void registerTestNumberOfBlockExecutionsPass();
void registerTestNumberOfOperationExecutionsPass();		void registerTestNumberOfOperationExecutionsPass();
void registerTestOpaqueLoc();		void registerTestOpaqueLoc();
void registerTestPreparationPassWithAllowedMemrefResults();		void registerTestPreparationPassWithAllowedMemrefResults();
void registerTestRecursiveTypesPass();		void registerTestRecursiveTypesPass();
void registerTestSCFUtilsPass();		void registerTestSCFUtilsPass();
		void registerTestSparsification();
void registerTestVectorConversions();		void registerTestVectorConversions();
} // namespace test		} // namespace test
} // namespace mlir		} // namespace mlir

#ifdef MLIR_INCLUDE_TESTS		#ifdef MLIR_INCLUDE_TESTS
void registerTestPasses() {		void registerTestPasses() {
registerConvertToTargetEnvPass();		registerConvertToTargetEnvPass();
registerPassManagerTestPass();		registerPassManagerTestPass();
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	#endif
test::registerTestLoopUnrollingPass();		test::registerTestLoopUnrollingPass();
test::registerTestMemRefDependenceCheck();		test::registerTestMemRefDependenceCheck();
test::registerTestMemRefStrideCalculation();		test::registerTestMemRefStrideCalculation();
test::registerTestNumberOfBlockExecutionsPass();		test::registerTestNumberOfBlockExecutionsPass();
test::registerTestNumberOfOperationExecutionsPass();		test::registerTestNumberOfOperationExecutionsPass();
test::registerTestOpaqueLoc();		test::registerTestOpaqueLoc();
test::registerTestRecursiveTypesPass();		test::registerTestRecursiveTypesPass();
test::registerTestSCFUtilsPass();		test::registerTestSCFUtilsPass();
		test::registerTestSparsification();
test::registerTestVectorConversions();		test::registerTestVectorConversions();
}		}
#endif		#endif

int main(int argc, char **argv) {		int main(int argc, char **argv) {
registerAllPasses();		registerAllPasses();
#ifdef MLIR_INCLUDE_TESTS		#ifdef MLIR_INCLUDE_TESTS
registerTestPasses();		registerTestPasses();
Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir] [sparse] start of sparse tensor compiler supportClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 305878

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h

mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt

mlir/lib/Dialect/Linalg/Transforms/Sparsification.cpp

mlir/test/Dialect/Linalg/sparse_1d.mlir

mlir/test/Dialect/Linalg/sparse_2d.mlir

mlir/test/Dialect/Linalg/sparse_3d.mlir

mlir/test/lib/Transforms/CMakeLists.txt

mlir/test/lib/Transforms/TestSparsification.cpp

mlir/tools/mlir-opt/mlir-opt.cpp

[mlir] [sparse] start of sparse tensor compiler support
ClosedPublic