This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse] refine heuristic for iteration graph topsort
ClosedPublic

Authored by aartbik on Sep 1 2021, 3:31 PM.

Download Raw Diff

Details

Reviewers

penpornk
gussmith23
bixia
wrengr

Commits

rGb6d1a31c1b88: [mlir][sparse] refine heuristic for iteration graph topsort

Summary

The sparse index order must always be satisfied, but this
may give a choice in topsorts for several cases. We broke
ties in favor of any dense index order, since this gives
good locality. However, breaking ties in favor of pushing
unrelated indices into sparse iteration spaces gives better
asymptotic complexity. This revision improves the heuristic.

Note that in the long run, we are really interested in using
ML for ML to find the best loop ordering as a replacement for
such heuristics.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

aartbik created this revision.Sep 1 2021, 3:31 PM

Herald added subscribers: wrengr, Chia-hungDuan, dcaballe and 17 others. · View Herald TranscriptSep 1 2021, 3:31 PM

aartbik requested review of this revision.Sep 1 2021, 3:31 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2021, 3:31 PM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B122185: Diff 370089.Sep 1 2021, 3:32 PM

aartbik added reviewers: penpornk, gussmith23, bixia, wrengr.Sep 1 2021, 3:33 PM

Not surprisingly, this shows a speedup of, for example, 5.9x on e.g. sampled dense dense for a sparse matrix like jpwh_991. But this is just because we manged to pick the wrong topsort before this fix ;-)

rebased with main

Harbormaster completed remote builds in B122373: Diff 370371.Sep 2 2021, 1:20 PM

bixia accepted this revision.Sep 2 2021, 2:32 PM

bixia added inline comments.

mlir/lib/Dialect/SparseTensor/Transforms/Sparsification.cpp
157–158	This doesn't match the code exactly now. Maybe change to "the dense constraint bit is not set" or the like?
158–159	Shall we name the constants and use the constant names to replace the "magic constants"? const int kDenseConstraintBit = 0x1; const int kSparseOuterConstraintBit = 0x2; 0x3 => kDenseConstraintBit \| kSparseOuterConstraintBit etc
mlir/test/Dialect/SparseTensor/sparse_2d.mlir
1053	Shall we add a comment to this test and maybe also rename the test, to make it explicit that the test makes sure the dense dimension k is the inner loop in the generated code for performance?

This revision is now accepted and ready to land.Sep 2 2021, 2:32 PM

aartbik marked 3 inline comments as done.Sep 2 2021, 5:21 PM

aartbik added inline comments.

mlir/lib/Dialect/SparseTensor/Transforms/Sparsification.cpp
157–158	Ah, it is in a way the same, but yes, rephrased the comment.
mlir/test/Dialect/SparseTensor/sparse_2d.mlir
1053	well, it is called sampled_dense_dense, right? and these 1d/2d/3d files usually have few comments on what to expect (other than what is expressed with CHECKs)

used named bits in masked

Harbormaster completed remote builds in B122431: Diff 370453.Sep 2 2021, 5:59 PM

bixia added inline comments.Sep 2 2021, 8:52 PM

mlir/lib/Dialect/SparseTensor/Transforms/Sparsification.cpp
1150–1156	Would it be better to write this as a loop, from SortMask::kIncludeUndef \| SortMask::kIncludeDense down to 0 with step -1?

aartbik marked an inline comment as done.Sep 3 2021, 8:34 AM

aartbik added inline comments.

mlir/lib/Dialect/SparseTensor/Transforms/Sparsification.cpp
1150–1156	I agree that in the original form, it was just screaming "for-loop" here! But now with the symbolic bit masks, I think the spelled out version makes more sense.

Closed by commit rGb6d1a31c1b88: [mlir][sparse] refine heuristic for iteration graph topsort (authored by aartbik). · Explain WhySep 3 2021, 8:37 AM

This revision was automatically updated to reflect the committed changes.

aartbik marked an inline comment as done.

aartbik added a commit: rGb6d1a31c1b88: [mlir][sparse] refine heuristic for iteration graph topsort.

wrengr added inline comments.Sep 3 2021, 2:01 PM

mlir/lib/Dialect/SparseTensor/Transforms/Sparsification.cpp
1150–1156	I still kinda feel that using a for-loop would be clearer, since the logic is "if all the options fail, then return failure()". The only concern with a for-loop is if we're relying on a specific ordering to improve the effectiveness of short-circuiting; but I think that can be resolved with a comment (or if things get really hairy in the future then defining a custom iterator to make sure to enumerate things in the right order).

Revision Contents

Path

Size

mlir/

lib/

Dialect/

SparseTensor/

Transforms/

Sparsification.cpp

27 lines

test/

Dialect/

SparseTensor/

sparse_2d.mlir

33 lines

Diff 370591

mlir/lib/Dialect/SparseTensor/Transforms/Sparsification.cpp

Show All 24 Lines
#include "mlir/IR/TensorEncoding.h"		#include "mlir/IR/TensorEncoding.h"
#include "llvm/ADT/SmallBitVector.h"		#include "llvm/ADT/SmallBitVector.h"

using namespace mlir;		using namespace mlir;
using namespace mlir::sparse_tensor;		using namespace mlir::sparse_tensor;

namespace {		namespace {

		// Iteration graph sorting.
		enum SortMask { kSparseOnly = 0x0, kIncludeDense = 0x1, kIncludeUndef = 0x2 };

// Code generation.		// Code generation.
struct CodeGen {		struct CodeGen {
CodeGen(SparsificationOptions o, unsigned numTensors, unsigned numLoops)		CodeGen(SparsificationOptions o, unsigned numTensors, unsigned numLoops)
: options(o), loops(numLoops), sizes(numLoops), buffers(numTensors),		: options(o), loops(numLoops), sizes(numLoops), buffers(numTensors),
pointers(numTensors, std::vector<Value>(numLoops)),		pointers(numTensors, std::vector<Value>(numLoops)),
indices(numTensors, std::vector<Value>(numLoops)),		indices(numTensors, std::vector<Value>(numLoops)),
highs(numTensors, std::vector<Value>(numLoops)),		highs(numTensors, std::vector<Value>(numLoops)),
pidxs(numTensors, std::vector<Value>(numLoops)),		pidxs(numTensors, std::vector<Value>(numLoops)),
▲ Show 20 Lines • Show All 95 Lines • ▼ Show 20 Lines

/// Computes a topologically sorted iteration graph for the linalg operation.		/// Computes a topologically sorted iteration graph for the linalg operation.
/// Ensures all tensors are visited in natural index order. This is essential		/// Ensures all tensors are visited in natural index order. This is essential
/// for sparse storage formats since these only support access along fixed		/// for sparse storage formats since these only support access along fixed
/// dimensions. Even for dense storage formats, however, the natural index		/// dimensions. Even for dense storage formats, however, the natural index
/// order yields innermost unit-stride access with better spatial locality.		/// order yields innermost unit-stride access with better spatial locality.
static bool computeIterationGraph(Merger &merger, linalg::GenericOp op,		static bool computeIterationGraph(Merger &merger, linalg::GenericOp op,
std::vector<unsigned> &topSort,		std::vector<unsigned> &topSort,
bool sparseOnly) {		unsigned mask) {
// Set up an n x n from/to adjacency matrix of the iteration graph		// Set up an n x n from/to adjacency matrix of the iteration graph
// for the implicit loop indices i_0 .. i_n-1.		// for the implicit loop indices i_0 .. i_n-1.
unsigned n = op.getNumLoops();		unsigned n = op.getNumLoops();
std::vector<std::vector<bool>> adjM(n, std::vector<bool>(n, false));		std::vector<std::vector<bool>> adjM(n, std::vector<bool>(n, false));

// Iterate over the indexing maps of every tensor in the tensor expression.		// Iterate over the indexing maps of every tensor in the tensor expression.
for (OpOperand *t : op.getInputAndOutputOperands()) {		for (OpOperand *t : op.getInputAndOutputOperands()) {
auto map = op.getTiedIndexingMap(t);		auto map = op.getTiedIndexingMap(t);
auto enc = getSparseTensorEncoding(t->get().getType());		auto enc = getSparseTensorEncoding(t->get().getType());
assert(map.getNumDims() == n);		assert(map.getNumDims() == n);
// Skip dense tensor constraints when sparse only is requested.		// Skip dense tensor constraints when not requested.
		bixiaUnsubmitted Done Reply Inline Actions This doesn't match the code exactly now. Maybe change to "the dense constraint bit is not set" or the like? bixia: This doesn't match the code exactly now. Maybe change to "the dense constraint bit is not set"…
		aartbikAuthorUnsubmitted Done Reply Inline Actions Ah, it is in a way the same, but yes, rephrased the comment. aartbik: Ah, it is in a way the same, but yes, rephrased the comment.
if (sparseOnly && !enc)		if (!(mask & SortMask::kIncludeDense) && !enc)
		bixiaUnsubmitted Done Reply Inline Actions Shall we name the constants and use the constant names to replace the "magic constants"? const int kDenseConstraintBit = 0x1; const int kSparseOuterConstraintBit = 0x2; 0x3 => kDenseConstraintBit \| kSparseOuterConstraintBit etc bixia: Shall we name the constants and use the constant names to replace the "magic constants"? const…
continue;		continue;
// Each tensor expression and optional dimension ordering (row-major		// Each tensor expression and optional dimension ordering (row-major
// by default) puts an ordering constraint on the loop indices. For		// by default) puts an ordering constraint on the loop indices. For
// example, the tensor expresion A_ijk forces the ordering i < j < k		// example, the tensor expresion A_ijk forces the ordering i < j < k
// on the loop indices if no explicit dimension ordering is given.		// on the loop indices if no explicit dimension ordering is given.
for (unsigned d = 1, rank = map.getNumResults(); d < rank; d++) {		for (unsigned d = 1, rank = map.getNumResults(); d < rank; d++) {
unsigned f = map.getDimPosition(perm(enc, d - 1));		unsigned f = map.getDimPosition(perm(enc, d - 1));
unsigned t = map.getDimPosition(perm(enc, d));		unsigned t = map.getDimPosition(perm(enc, d));
adjM[f][t] = true;		adjM[f][t] = true;
}		}
		// Push unrelated loops into sparse iteration space, so these
		// will be skipped more often.
		if (mask & SortMask::kIncludeUndef) {
		unsigned tensor = t->getOperandNumber();
		for (unsigned i = 0; i < n; i++)
		if (merger.isDim(tensor, i, Dim::kSparse))
		for (unsigned j = 0; j < n; j++)
		if (merger.isDim(tensor, j, Dim::kUndef))
		adjM[i][j] = true;
		}
}		}

// Topologically sort the iteration graph to determine loop order.		// Topologically sort the iteration graph to determine loop order.
// Report failure for a cyclic iteration graph.		// Report failure for a cyclic iteration graph.
topSort.clear();		topSort.clear();
topSort.reserve(n);		topSort.reserve(n);
std::vector<unsigned> visit(n, 0);		std::vector<unsigned> visit(n, 0);
for (unsigned i = 0; i < n; i++)		for (unsigned i = 0; i < n; i++)
▲ Show 20 Lines • Show All 954 Lines • ▼ Show 20 Lines	LogicalResult matchAndRewrite(linalg::GenericOp op,
if (!findSparseAnnotations(merger, op))		if (!findSparseAnnotations(merger, op))
return failure();		return failure();

// Computes a topologically sorted iteration graph to ensure		// Computes a topologically sorted iteration graph to ensure
// tensors are visited in natural index order. Fails on cycles.		// tensors are visited in natural index order. Fails on cycles.
// This assumes that higher-level passes have already put the		// This assumes that higher-level passes have already put the
// tensors in each tensor expression in a feasible order.		// tensors in each tensor expression in a feasible order.
std::vector<unsigned> topSort;		std::vector<unsigned> topSort;
if (!computeIterationGraph(merger, op, topSort, /sparseOnly=/false) &&		if (!computeIterationGraph(merger, op, topSort,
!computeIterationGraph(merger, op, topSort, /sparseOnly=/true))		SortMask::kIncludeUndef \|
		SortMask::kIncludeDense) &&
		!computeIterationGraph(merger, op, topSort, SortMask::kIncludeUndef) &&
		!computeIterationGraph(merger, op, topSort, SortMask::kIncludeDense) &&
		!computeIterationGraph(merger, op, topSort, SortMask::kSparseOnly))
return failure();		return failure();
		bixiaUnsubmitted Done Reply Inline Actions Would it be better to write this as a loop, from SortMask::kIncludeUndef \| SortMask::kIncludeDense down to 0 with step -1? bixia: Would it be better to write this as a loop, from SortMask::kIncludeUndef \| SortMask…
		aartbikAuthorUnsubmitted Done Reply Inline Actions I agree that in the original form, it was just screaming "for-loop" here! But now with the symbolic bit masks, I think the spelled out version makes more sense. aartbik: I agree that in the original form, it was just screaming "for-loop" here! But now with the…
		wrengrUnsubmitted Not Done Reply Inline Actions I still kinda feel that using a for-loop would be clearer, since the logic is "if all the options fail, then return failure()". The only concern with a for-loop is if we're relying on a specific ordering to improve the effectiveness of short-circuiting; but I think that can be resolved with a comment (or if things get really hairy in the future then defining a custom iterator to make sure to enumerate things in the right order). wrengr: I still kinda feel that using a for-loop would be clearer, since the logic is "if all the…

// Builds the tensor expression for the Linalg operation in SSA form.		// Builds the tensor expression for the Linalg operation in SSA form.
Optional<unsigned> exp = merger.buildTensorExpFromLinalg(op);		Optional<unsigned> exp = merger.buildTensorExpFromLinalg(op);
if (!exp.hasValue())		if (!exp.hasValue())
return failure();		return failure();

// Rejects an inadmissable tensor expression.		// Rejects an inadmissable tensor expression.
if (!isAdmissableTensorExp(merger, op, exp.getValue()))		if (!isAdmissableTensorExp(merger, op, exp.getValue()))
Show All 24 Lines

mlir/test/Dialect/SparseTensor/sparse_2d.mlir

	Show First 20 Lines • Show All 1,037 Lines • ▼ Show 20 Lines
	// CHECK: %[[VAL_15:.*]] = tensor.dim %[[VAL_3]], %[[VAL_5]] : tensor<?x?xf32>			// CHECK: %[[VAL_15:.*]] = tensor.dim %[[VAL_3]], %[[VAL_5]] : tensor<?x?xf32>
	// CHECK: %[[VAL_16:.*]] = memref.buffer_cast %[[VAL_3]] : memref<?x?xf32>			// CHECK: %[[VAL_16:.*]] = memref.buffer_cast %[[VAL_3]] : memref<?x?xf32>
	// CHECK: %[[VAL_17:.*]] = memref.alloc(%[[VAL_14]], %[[VAL_15]]) : memref<?x?xf32>			// CHECK: %[[VAL_17:.*]] = memref.alloc(%[[VAL_14]], %[[VAL_15]]) : memref<?x?xf32>
	// CHECK: memref.copy %[[VAL_16]], %[[VAL_17]] : memref<?x?xf32> to memref<?x?xf32>			// CHECK: memref.copy %[[VAL_16]], %[[VAL_17]] : memref<?x?xf32> to memref<?x?xf32>
	// CHECK: %[[VAL_18:.*]] = memref.load %[[VAL_6]]{{\[}}%[[VAL_4]]] : memref<?xindex>			// CHECK: %[[VAL_18:.*]] = memref.load %[[VAL_6]]{{\[}}%[[VAL_4]]] : memref<?xindex>
	// CHECK: %[[VAL_19:.*]] = memref.load %[[VAL_6]]{{\[}}%[[VAL_5]]] : memref<?xindex>			// CHECK: %[[VAL_19:.*]] = memref.load %[[VAL_6]]{{\[}}%[[VAL_5]]] : memref<?xindex>
	// CHECK: scf.for %[[VAL_20:.*]] = %[[VAL_18]] to %[[VAL_19]] step %[[VAL_5]] {			// CHECK: scf.for %[[VAL_20:.*]] = %[[VAL_18]] to %[[VAL_19]] step %[[VAL_5]] {
	// CHECK: %[[VAL_21:.*]] = memref.load %[[VAL_7]]{{\[}}%[[VAL_20]]] : memref<?xindex>			// CHECK: %[[VAL_21:.*]] = memref.load %[[VAL_7]]{{\[}}%[[VAL_20]]] : memref<?xindex>
	// CHECK: scf.for %[[VAL_22:.*]] = %[[VAL_4]] to %[[VAL_12]] step %[[VAL_5]] {			// CHECK: %[[VAL_22:.*]] = memref.load %[[VAL_8]]{{\[}}%[[VAL_20]]] : memref<?xindex>
	// CHECK: %[[VAL_23:.*]] = memref.load %[[VAL_11]]{{\[}}%[[VAL_21]], %[[VAL_22]]] : memref<?x?xf32>			// CHECK: %[[VAL_23:.*]] = addi %[[VAL_20]], %[[VAL_5]] : index
	// CHECK: %[[VAL_24:.*]] = memref.load %[[VAL_8]]{{\[}}%[[VAL_20]]] : memref<?xindex>			// CHECK: %[[VAL_24:.*]] = memref.load %[[VAL_8]]{{\[}}%[[VAL_23]]] : memref<?xindex>
	// CHECK: %[[VAL_25:.*]] = addi %[[VAL_20]], %[[VAL_5]] : index			// CHECK: scf.for %[[VAL_25:.*]] = %[[VAL_22]] to %[[VAL_24]] step %[[VAL_5]] {
	// CHECK: %[[VAL_26:.*]] = memref.load %[[VAL_8]]{{\[}}%[[VAL_25]]] : memref<?xindex>			// CHECK: %[[VAL_26:.*]] = memref.load %[[VAL_9]]{{\[}}%[[VAL_25]]] : memref<?xindex>
	// CHECK: scf.for %[[VAL_27:.*]] = %[[VAL_24]] to %[[VAL_26]] step %[[VAL_5]] {			// CHECK: %[[VAL_27:.*]] = memref.load %[[VAL_10]]{{\[}}%[[VAL_25]]] : memref<?xf32>
	// CHECK: %[[VAL_28:.*]] = memref.load %[[VAL_9]]{{\[}}%[[VAL_27]]] : memref<?xindex>			// CHECK: %[[VAL_28:.*]] = memref.load %[[VAL_17]]{{\[}}%[[VAL_21]], %[[VAL_26]]] : memref<?x?xf32>
	// CHECK: %[[VAL_29:.*]] = memref.load %[[VAL_17]]{{\[}}%[[VAL_21]], %[[VAL_28]]] : memref<?x?xf32>			// CHECK: %[[VAL_29:.]] = scf.for %[[VAL_30:.]] = %[[VAL_4]] to %[[VAL_12]] step %[[VAL_5]] iter_args(%[[VAL_31:.*]] = %[[VAL_28]]) -> (f32) {
				bixiaUnsubmitted Done Reply Inline Actions Shall we add a comment to this test and maybe also rename the test, to make it explicit that the test makes sure the dense dimension k is the inner loop in the generated code for performance? bixia: Shall we add a comment to this test and maybe also rename the test, to make it explicit that…
				aartbikAuthorUnsubmitted Done Reply Inline Actions well, it is called sampled_dense_dense, right? and these 1d/2d/3d files usually have few comments on what to expect (other than what is expressed with CHECKs) aartbik: well, it is called sampled_dense_dense, right? and these 1d/2d/3d files usually have few…
	// CHECK: %[[VAL_30:.*]] = memref.load %[[VAL_10]]{{\[}}%[[VAL_27]]] : memref<?xf32>			// CHECK: %[[VAL_32:.*]] = memref.load %[[VAL_11]]{{\[}}%[[VAL_21]], %[[VAL_30]]] : memref<?x?xf32>
	// CHECK: %[[VAL_31:.*]] = memref.load %[[VAL_13]]{{\[}}%[[VAL_22]], %[[VAL_28]]] : memref<?x?xf32>			// CHECK: %[[VAL_33:.*]] = memref.load %[[VAL_13]]{{\[}}%[[VAL_30]], %[[VAL_26]]] : memref<?x?xf32>
	// CHECK: %[[VAL_32:.*]] = mulf %[[VAL_23]], %[[VAL_31]] : f32			// CHECK: %[[VAL_34:.*]] = mulf %[[VAL_32]], %[[VAL_33]] : f32
	// CHECK: %[[VAL_33:.*]] = mulf %[[VAL_30]], %[[VAL_32]] : f32			// CHECK: %[[VAL_35:.*]] = mulf %[[VAL_27]], %[[VAL_34]] : f32
	// CHECK: %[[VAL_34:.*]] = addf %[[VAL_29]], %[[VAL_33]] : f32			// CHECK: %[[VAL_36:.*]] = addf %[[VAL_31]], %[[VAL_35]] : f32
	// CHECK: memref.store %[[VAL_34]], %[[VAL_17]]{{\[}}%[[VAL_21]], %[[VAL_28]]] : memref<?x?xf32>			// CHECK: scf.yield %[[VAL_36]] : f32
	// CHECK: }			// CHECK: }
				// CHECK: memref.store %[[VAL_37:.*]], %[[VAL_17]]{{\[}}%[[VAL_21]], %[[VAL_26]]] : memref<?x?xf32>
	// CHECK: }			// CHECK: }
	// CHECK: }			// CHECK: }
	// CHECK: %[[VAL_35:.*]] = memref.tensor_load %[[VAL_17]] : memref<?x?xf32>			// CHECK: %[[VAL_38:.*]] = memref.tensor_load %[[VAL_17]] : memref<?x?xf32>
	// CHECK: return %[[VAL_35]] : tensor<?x?xf32>			// CHECK: return %[[VAL_38]] : tensor<?x?xf32>
	// CHECK: }			// CHECK: }
	func @sampled_dense_dense(%args: tensor<?x?xf32, #Tss>,			func @sampled_dense_dense(%args: tensor<?x?xf32, #Tss>,
	%arga: tensor<?x?xf32>,			%arga: tensor<?x?xf32>,
	%argb: tensor<?x?xf32>,			%argb: tensor<?x?xf32>,
	%argx: tensor<?x?xf32>) -> tensor<?x?xf32> {			%argx: tensor<?x?xf32>) -> tensor<?x?xf32> {
	%0 = linalg.generic #trait_sampled_dense_dense			%0 = linalg.generic #trait_sampled_dense_dense
	ins(%args, %arga, %argb: tensor<?x?xf32, #Tss>, tensor<?x?xf32>, tensor<?x?xf32>)			ins(%args, %arga, %argb: tensor<?x?xf32, #Tss>, tensor<?x?xf32>, tensor<?x?xf32>)
	outs(%argx: tensor<?x?xf32>) {			outs(%argx: tensor<?x?xf32>) {
	▲ Show 20 Lines • Show All 236 Lines • Show Last 20 Lines