Download Raw Diff

Details

Reviewers

jpienaar
GeorgeARM
nicolasvasilache
rsuderman
stellaraccident

Commits

rG110c1b64a7b9: [mlir][tosa] Improve performance of tosa.transpose constant folding

Summary

Folding of the tosa.transpose operation is both time and memory
intensive as the underlying ElementsAttr is processed as a sequence of
Attributes. This change attempts operate on the underlying raw data of
the ElementsAttr.

In an example resnet50 network, this change reduces the time spent in
folding transpose ops from 35s to 1.5s.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

sabauma created this revision.Mar 21 2023, 6:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2023, 6:57 AM

Herald added subscribers: mgehre-amd, Moerafaat, zero9178 and 26 others. · View Herald Transcript

sabauma edited the summary of this revision. (Show Details)Mar 21 2023, 7:06 AM

Harbormaster completed remote builds in B220714: Diff 506964.Mar 21 2023, 8:07 AM

Add testcase and remove impossible case

Harbormaster completed remote builds in B220759: Diff 507030.Mar 21 2023, 11:07 AM

sabauma added reviewers: jpienaar, GeorgeARM.Mar 21 2023, 11:40 AM

Remove dead comment

Try to publish

Request review

sabauma retitled this revision from Review Requested [mlir][tosa] Improve performance of tosa.transpose constant folding to Request Review [mlir][tosa] Improve performance of tosa.transpose constant folding.Mar 21 2023, 12:08 PM

sabauma retitled this revision from Request Review [mlir][tosa] Improve performance of tosa.transpose constant folding to [mlir][tosa] Improve performance of tosa.transpose constant folding.

sabauma edited the summary of this revision. (Show Details)

Request Review

sabauma retitled this revision from [Request Review][mlir][tosa] Improve performance of tosa.transpose constant folding to [mlir][tosa] Improve performance of tosa.transpose constant folding.Mar 21 2023, 12:19 PM

Harbormaster completed remote builds in B220805: Diff 507083.Mar 21 2023, 2:32 PM

sabauma published this revision for review.Mar 22 2023, 5:37 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptMar 22 2023, 5:37 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

sabauma added a reviewer: rsuderman.Mar 23 2023, 6:10 AM

Avoid converting data to mlir::Attributes in all cases

Harbormaster completed remote builds in B221302: Diff 507724.Mar 23 2023, 6:59 AM

sabauma added a reviewer: stellaraccident.Mar 24 2023, 7:27 AM

Thank you for the work!

mlir/lib/Dialect/Tosa/Transforms/TosaFoldConstantTranspose.cpp
26	I haven't looked at compiler explorer, but I'd be willing to bet that the mixing of signed and unsigned arithmetic in the indexing is not generating the best code. May be worth going completely signed (unless if factoring this to not be hot per below).
67	I'm not sure that breaking permuteLinearIndex out like this is paying for itself. As written, it will have very bad, per element memory overhead for rank > 6 (reallocating 2 SmallVector storages for each element). Further, I suspect that those indexing tables can be constructed independent of `it.index()`, simplifying the per element computation. I suggest inlining the indexing helper and minimally reusing the vectors for the entire iteration. Bear would be factoring more of the per element computation outside of the loop.

This revision now requires changes to proceed.Mar 24 2023, 8:19 AM

Address @stellaraccident's comments

sabauma added inline comments.Mar 24 2023, 10:23 AM

mlir/lib/Dialect/Tosa/Transforms/TosaFoldConstantTranspose.cpp
67	I've reworked the indexing calculation logic to avoid the need to temporarily store the indices for the input and output tensors. The only overhead is to pre-compute the strides on the output tensor and invert the permutation map, both of which are loop invariant. Based on my simple resnet50 example, this halves the runtime from my initial solution.

Thank you for going the extra mile. This looks good to me and is a much needed improvement.

This revision is now accepted and ready to land.Mar 24 2023, 10:32 AM

Harbormaster completed remote builds in B221630: Diff 508145.Mar 24 2023, 10:45 AM

That's really nice @sabauma! LGTM!

rsuderman accepted this revision.Mar 24 2023, 12:19 PM

@stellaraccident @GeorgeARM if this change looks acceptable, would one of you mind submitting it? I do not have commit access.

Closed by commit rG110c1b64a7b9: [mlir][tosa] Improve performance of tosa.transpose constant folding (authored by sabauma, committed by Robert Suderman <suderman@google.com>). · Explain WhyMar 24 2023, 12:51 PM

This revision was automatically updated to reflect the committed changes.

Robert Suderman <suderman@google.com> added a commit: rG110c1b64a7b9: [mlir][tosa] Improve performance of tosa.transpose constant folding.

Nice speedup! Sorry I missed the review notification

Diff 508200

mlir/lib/Dialect/Tosa/Transforms/TosaFoldConstantTranspose.cpp

	//===- TosaFoldConstantTranspose.cpp --------------------------------------===//			//===- TosaFoldConstantTranspose.cpp --------------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// Fold TOSA Transpose operation on constant data			// Fold TOSA Transpose operation on constant data
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "mlir/Dialect/Tosa/IR/TosaOps.h"			#include "mlir/Dialect/Tosa/IR/TosaOps.h"
	#include "mlir/Dialect/Tosa/Transforms/Passes.h"			#include "mlir/Dialect/Tosa/Transforms/Passes.h"
				#include "mlir/Dialect/Utils/IndexingUtils.h"
	#include "mlir/IR/Matchers.h"			#include "mlir/IR/Matchers.h"
	#include "mlir/Pass/Pass.h"			#include "mlir/Pass/Pass.h"

	using namespace mlir;			using namespace mlir;
	using namespace mlir::tosa;			using namespace mlir::tosa;

	namespace {			namespace {

				template <typename BaseType>
				DenseElementsAttr transposeType(ElementsAttr attr, ShapedType inputType,
				ShapedType outputType,
				stellaraccidentUnsubmitted Not Done Reply Inline Actions I haven't looked at compiler explorer, but I'd be willing to bet that the mixing of signed and unsigned arithmetic in the indexing is not generating the best code. May be worth going completely signed (unless if factoring this to not be hot per below). stellaraccident: I haven't looked at compiler explorer, but I'd be willing to bet that the mixing of signed and…
				llvm::ArrayRef<int64_t> permValues) {
				if (inputType.getNumElements() == 0)
				return DenseElementsAttr::get(outputType, llvm::ArrayRef<BaseType>{});

				auto attrValues = attr.getValues<BaseType>();
				auto inputShape = inputType.getShape();

				// The inverted permutation map and strides of the output are used to compute
				// the contribution of a given dimension to the destination linear index in
				// an order-independent way.
				auto outputStrides = computeStrides(outputType.getShape());
				auto invertedPermValues = invertPermutationVector(permValues);

				auto initialValue = *std::begin(attrValues);
				SmallVector<BaseType> outputValues(inputType.getNumElements(), initialValue);

				for (const auto &it : llvm::enumerate(attrValues)) {
				auto srcLinearIndex = it.index();

				uint64_t dstLinearIndex = 0;
				for (int64_t dim = inputShape.size() - 1; dim >= 0; --dim) {
				// Compute the index into the current dimension of the source vector.
				auto sourceIndexForDim = srcLinearIndex % inputShape[dim];
				srcLinearIndex /= inputShape[dim];

				// Add the contribution of the current dimension to the output using the
				// permutation map.
				dstLinearIndex +=
				outputStrides[invertedPermValues[dim]] * sourceIndexForDim;
				}

				outputValues[dstLinearIndex] = it.value();
				}

				return DenseElementsAttr::get(outputType,
				llvm::ArrayRef<BaseType>(outputValues));
				}

				// A type specialized transposition of an ElementsAttr.
				// This implementation tries to operate on the underlying data in its raw
				// representation when possible to avoid allocating a large number of Attribute
				stellaraccidentUnsubmitted Not Done Reply Inline Actions I'm not sure that breaking permuteLinearIndex out like this is paying for itself. As written, it will have very bad, per element memory overhead for rank > 6 (reallocating 2 SmallVector storages for each element). Further, I suspect that those indexing tables can be constructed independent of `it.index()`, simplifying the per element computation. I suggest inlining the indexing helper and minimally reusing the vectors for the entire iteration. Bear would be factoring more of the per element computation outside of the loop. stellaraccident: I'm not sure that breaking permuteLinearIndex out like this is paying for itself. As written…
				sabaumaAuthorUnsubmitted Done Reply Inline Actions I've reworked the indexing calculation logic to avoid the need to temporarily store the indices for the input and output tensors. The only overhead is to pre-compute the strides on the output tensor and invert the permutation map, both of which are loop invariant. Based on my simple resnet50 example, this halves the runtime from my initial solution. sabauma: I've reworked the indexing calculation logic to avoid the need to temporarily store the indices…
				// objects.
				DenseElementsAttr transpose(ElementsAttr attr, ShapedType inputType,
				ShapedType outputType,
				llvm::ArrayRef<int64_t> permValues) {
				auto baseType = inputType.getElementType();

				// Handle possible integer types
				if (auto intType = baseType.dyn_cast<IntegerType>()) {
				switch (intType.getWidth()) {
				case 1:
				return transposeType<bool>(attr, inputType, outputType, permValues);
				case 8:
				return transposeType<int8_t>(attr, inputType, outputType, permValues);
				case 16:
				return transposeType<int16_t>(attr, inputType, outputType, permValues);
				case 32:
				return transposeType<int32_t>(attr, inputType, outputType, permValues);
				case 64:
				return transposeType<int64_t>(attr, inputType, outputType, permValues);
				default:
				return transposeType<APInt>(attr, inputType, outputType, permValues);
				}
				}

				// Handle possible float types
				if (baseType.isF32()) {
				return transposeType<float>(attr, inputType, outputType, permValues);
				}

				return transposeType<APFloat>(attr, inputType, outputType, permValues);
				}

	struct TosaFoldConstantTranspose : public OpRewritePattern<tosa::TransposeOp> {			struct TosaFoldConstantTranspose : public OpRewritePattern<tosa::TransposeOp> {
	using OpRewritePattern::OpRewritePattern;			using OpRewritePattern::OpRewritePattern;

	LogicalResult matchAndRewrite(tosa::TransposeOp op,			LogicalResult matchAndRewrite(tosa::TransposeOp op,
	PatternRewriter &rewriter) const override {			PatternRewriter &rewriter) const override {
	auto outputType = op.getType().cast<ShapedType>();			auto outputType = op.getType().cast<ShapedType>();
	// TOSA supports quantized types.			// TOSA supports quantized types.
	if (!outputType.getElementType().isIntOrIndexOrFloat())			if (!outputType.getElementType().isIntOrIndexOrFloat())
	return failure();			return failure();

	ElementsAttr inputValues;			ElementsAttr inputValues;
	if (!matchPattern(op.getInput1(), m_Constant(&inputValues)))			if (!matchPattern(op.getInput1(), m_Constant(&inputValues)))
	return failure();			return failure();
	// Make sure the input is a constant that has a single user.			// Make sure the input is a constant that has a single user.
	if (!llvm::hasSingleElement(op.getInput1().getDefiningOp()->getUsers()))			if (!llvm::hasSingleElement(op.getInput1().getDefiningOp()->getUsers()))
	return failure();			return failure();

	DenseIntElementsAttr permAttr;			DenseIntElementsAttr permAttr;
	if (!matchPattern(op.getPerms(), m_Constant(&permAttr)))			if (!matchPattern(op.getPerms(), m_Constant(&permAttr)))
	return failure();			return failure();
	auto permValues = llvm::to_vector<6>(llvm::map_range(			auto permValues = llvm::to_vector<6>(llvm::map_range(
	// TOSA allows both 32- and 64-bit integer tensors here.			// TOSA allows both 32- and 64-bit integer tensors here.
	permAttr.getValues<APInt>(),			permAttr.getValues<APInt>(),
	[](const APInt &val) { return val.getZExtValue(); }));			[](const APInt &val) { return val.getSExtValue(); }));

	auto inputType = op.getInput1().getType().cast<ShapedType>();			auto inputType = op.getInput1().getType().cast<ShapedType>();
	ArrayRef<int64_t> inputShape = inputType.getShape();
	int64_t numElements = inputType.getNumElements();

	SmallVector<Attribute, 4> outputValues;
	outputValues.resize(numElements);

	// Transpose the input constant. Because we don't know its rank in advance,
	// we need to loop over the range [0, element count) and delinearize the
	// index.
	auto attrValues = inputValues.getValues<Attribute>();
	ArrayRef<int64_t> outputShape = outputType.getShape();
	for (const auto &it : llvm::enumerate(attrValues)) {
	SmallVector<uint64_t, 6> srcIndices(inputType.getRank(), 0);
	int totalCount = it.index();
	for (int dim = inputType.getRank() - 1; dim >= 0; --dim) {
	srcIndices[dim] = totalCount % inputShape[dim];
	totalCount /= inputShape[dim];
	}

	SmallVector<uint64_t, 6> dstIndices(outputType.getRank(), 0);
	for (int dim = outputType.getRank() - 1; dim >= 0; --dim)
	dstIndices[dim] = srcIndices[permValues[dim]];

	uint64_t dstLinearIndex = dstIndices.front();
	for (int dim = 1; dim < outputType.getRank(); ++dim)
	dstLinearIndex = dstLinearIndex * outputShape[dim] + dstIndices[dim];

	outputValues[dstLinearIndex] = it.value();
	}

	rewriter.replaceOpWithNewOp<tosa::ConstOp>(			auto resultAttr = transpose(inputValues, inputType, outputType, permValues);
	op, outputType, DenseElementsAttr::get(outputType, outputValues));			rewriter.replaceOpWithNewOp<tosa::ConstOp>(op, outputType, resultAttr);
	return success();			return success();
	}			}
	};			};

	} // namespace			} // namespace

	void mlir::tosa::populateTosaFoldConstantTransposePatterns(			void mlir::tosa::populateTosaFoldConstantTransposePatterns(
	MLIRContext *ctx, RewritePatternSet &patterns) {			MLIRContext *ctx, RewritePatternSet &patterns) {
	patterns.add<TosaFoldConstantTranspose>(ctx);			patterns.add<TosaFoldConstantTranspose>(ctx);
	}			}

mlir/test/Dialect/Tosa/constant-op-fold.mlir

Show All 40 Lines	func.func @transpose_fold_2d_float() -> tensor<3x2xf32> {
%perms = "tosa.const"() {value = dense<[1, 0]> : tensor<2xi32>} : () -> tensor<2xi32>		%perms = "tosa.const"() {value = dense<[1, 0]> : tensor<2xi32>} : () -> tensor<2xi32>
// CHECK: %[[CST:.+]] = "tosa.const"()		// CHECK: %[[CST:.+]] = "tosa.const"()
// CHECK-SAME{LITERAL}: value = dense<[[0.000000e+00, 3.000000e+00], [1.000000e+00, 4.000000e+00], [2.000000e+00, 5.000000e+00]]> : tensor<3x2xf32>		// CHECK-SAME{LITERAL}: value = dense<[[0.000000e+00, 3.000000e+00], [1.000000e+00, 4.000000e+00], [2.000000e+00, 5.000000e+00]]> : tensor<3x2xf32>
%1 = "tosa.transpose"(%input, %perms) : (tensor<2x3xf32>, tensor<2xi32>) -> tensor<3x2xf32>		%1 = "tosa.transpose"(%input, %perms) : (tensor<2x3xf32>, tensor<2xi32>) -> tensor<3x2xf32>
// CHECK: return %[[CST]]		// CHECK: return %[[CST]]
return %1 : tensor<3x2xf32>		return %1 : tensor<3x2xf32>
}		}

		// CHECK-LABEL: @transpose_fold_2d_bool
		func.func @transpose_fold_2d_bool() -> tensor<3x2xi1> {
		%input = "tosa.const"() {value = dense<[[true, false, false], [false, false, true]]> : tensor<2x3xi1>} : () -> tensor<2x3xi1>
		%perms = "tosa.const"() {value = dense<[1, 0]> : tensor<2xi32>} : () -> tensor<2xi32>
		// CHECK: %[[CST:.+]] = "tosa.const"()
		// CHECK-SAME{LITERAL}: value = dense<[[true, false], [false, false], [false, true]]> : tensor<3x2xi1>
		%1 = "tosa.transpose"(%input, %perms) : (tensor<2x3xi1>, tensor<2xi32>) -> tensor<3x2xi1>
		// CHECK: return %[[CST]]
		return %1 : tensor<3x2xi1>
		}

// CHECK-LABEL: @transpose_fold_4d_int		// CHECK-LABEL: @transpose_fold_4d_int
func.func @transpose_fold_4d_int() -> tensor<3x1x4x2xi32> {		func.func @transpose_fold_4d_int() -> tensor<3x1x4x2xi32> {
%input = "tosa.const"() {value = dense<[[		%input = "tosa.const"() {value = dense<[[
[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]],		[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]],
[[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]		[[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]
]]> : tensor<1x2x3x4xi32>} : () -> tensor<1x2x3x4xi32>		]]> : tensor<1x2x3x4xi32>} : () -> tensor<1x2x3x4xi32>
%perms = "tosa.const"() {value = dense<[2, 0, 3, 1]> : tensor<4xi64>} : () -> tensor<4xi64>		%perms = "tosa.const"() {value = dense<[2, 0, 3, 1]> : tensor<4xi64>} : () -> tensor<4xi64>
// CHECK: %[[CST:.+]] = "tosa.const"()		// CHECK: %[[CST:.+]] = "tosa.const"()
▲ Show 20 Lines • Show All 508 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][tosa] Improve performance of tosa.transpose constant folding
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 508200

mlir/lib/Dialect/Tosa/Transforms/TosaFoldConstantTranspose.cpp

mlir/test/Dialect/Tosa/constant-op-fold.mlir

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][tosa] Improve performance of tosa.transpose constant foldingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 508200

mlir/lib/Dialect/Tosa/Transforms/TosaFoldConstantTranspose.cpp

mlir/test/Dialect/Tosa/constant-op-fold.mlir

[mlir][tosa] Improve performance of tosa.transpose constant folding
ClosedPublic