This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/
-
mlir/
-
Dialect/
-
GPU/Transforms/
-
Transforms/
-
Passes.h
-
Passes.td
-
Utils/
-
IndexingUtils.h
-
lib/Dialect/
-
Dialect/
-
GPU/
-
CMakeLists.txt
-
Transforms/
1/4
DecomposeMemrefs.cpp
-
Utils/
-
IndexingUtils.cpp
-
test/Dialect/GPU/
-
Dialect/
-
GPU/
-
decompose-memrefs.mlir

Differential D155247

[mlir][gpu] Add DecomposeMemrefsPass
ClosedPublic

Authored by Hardcode84 on Jul 13 2023, 4:10 PM.

Download Raw Diff

Details

Reviewers

bondhugula
ThomasRaoux
nicolasvasilache
herhut
antiagainst
kuhar

Commits

rG793ee2bf0868: [mlir][gpu] Add DecomposeMemrefsPass
rG2b5b2bfef102: [mlir][gpu] Add DecomposeMemrefsPass

Summary

Some GPU backends (SPIR-V) lower memrefs to bare pointers, so for dynamically sized/strided memrefs it will fail.
This pass extracts sizes and strides via memref.extract_strrided_metadata outside gpu.launch body and do index/offset calculation explicitly and then reconstructs memrefs via memref.reinterpret_cast.

memref.reinterpret_cast then lowered via https://reviews.llvm.org/D155011

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Hardcode84 created this revision.Jul 13 2023, 4:10 PM

Herald added a reviewer: bondhugula. · View Herald TranscriptJul 13 2023, 4:10 PM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bviyer, Moerafaat, zero9178 and 22 others. · View Herald Transcript

Hardcode84 requested review of this revision.Jul 13 2023, 4:10 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJul 13 2023, 4:10 PM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Hardcode84 added reviewers: antiagainst, kuhar.Jul 13 2023, 4:10 PM

Harbormaster completed remote builds in B245241: Diff 540210.Jul 13 2023, 7:01 PM

Nice !

Can we refactor the implementations and put them in a place in MemRefUtils to separate the orthogonal GPU concerns here ?
Then we can all reuse these nice features that people have been rewriting over and over.

mlir/lib/Dialect/GPU/Transforms/DecomposeMemrefs.cpp
32	anon ns and static is redundant. The preferred mechanism in LLVM is static, could you drop the ns ?
46	Offhand this looks like an n^th implementation / extensions of features available in IndexingUtils and StaticValueUtils. I made a similar comment in https://reviews.llvm.org/D155017. I made some refactorings in the past in MemRef/Transforms/FoldMemRefAliasOps.cpp but more can probably be done. Can we try to all converge on one thing and compound interest?

move some code into indexingUtils

Herald added a subscriber: arphaman. · View Herald TranscriptJul 15 2023, 4:00 PM

Harbormaster completed remote builds in B245638: Diff 540748.Jul 15 2023, 4:16 PM

tschuett added a subscriber: tschuett.Jul 15 2023, 8:43 PM

tschuett added inline comments.

mlir/lib/Dialect/GPU/Transforms/DecomposeMemrefs.cpp
165	You are close: https://llvm.org/docs/CodingStandards.html#anonymous-namespaces

ping

sorry for the delay, thanks!

mlir/lib/Dialect/GPU/Transforms/DecomposeMemrefs.cpp

Hmm I find this somewhat hard to parse that this is not a lambda but the result of the application of a lambda ..

How about:

memref::ExtractStridedMetadataOp newExtractStridedMetadata;
{
    OpBuilder::InsertionGuard g(rewriter);
    setInsertionPointToStart(rewriter, source);
   newExtractStridedMetadata = rewriter.create<memref::ExtractStridedMetadataOp>(loc, source);
};

This revision is now accepted and ready to land.Aug 9 2023, 4:10 PM

rebase, review comments

Hardcode84 marked an inline comment as done.Aug 9 2023, 4:58 PM

This revision was landed with ongoing or failed builds.Aug 9 2023, 5:28 PM

Closed by commit rG2b5b2bfef102: [mlir][gpu] Add DecomposeMemrefsPass (authored by Hardcode84). · Explain Why

This revision was automatically updated to reflect the committed changes.

Hardcode84 added a commit: rG2b5b2bfef102: [mlir][gpu] Add DecomposeMemrefsPass.

Hardcode84 added a reverting change: rGb13248f997dd: Revert "[mlir][gpu] Add DecomposeMemrefsPass".Aug 9 2023, 6:08 PM

Reverted as it broke some bots.

So, there is a problem, computeLinearIndex introduces dependency on affine dialect, but I cannot just add MLIRAffineDialect as dependency to IndexingUtils as it creates circular dependencies.

Any ideas how to handle it?

Hardcode84 reopened this revision.Aug 9 2023, 7:32 PM

This revision is now accepted and ready to land.Aug 9 2023, 7:32 PM

moved to memrefutils

Moved code to memrefUtils as it already have affine dependency. computeLinearIndex and getLinearizeMemRefAndOffset functions should be unified, but getLinearizeMemRefAndOffset is doing too many things at once and refactoring will be too big for unrelated patch.

Harbormaster completed remote builds in B251555: Diff 548851.Aug 9 2023, 10:34 PM

In D155247#4575271, @Hardcode84 wrote:

Moved code to memrefUtils as it already have affine dependency. computeLinearIndex and getLinearizeMemRefAndOffset functions should be unified, but getLinearizeMemRefAndOffset is doing too many things at once and refactoring will be too big for unrelated patch.

Hmm not a fan of the landing place .. I suggest revisiting how you construct your AffineExpr using/extending IndexingUtils.
For instance, I am unclear why you need to interleave symbols/values for strides and indices.
Seems pretty easy to just use the following in IndexingUtils:

SmallVector<AffineExpr> computeElementwiseMul(ArrayRef<AffineExpr> v1, ArrayRef<AffineExpr> v2);
AffineExpr computeSum(MLIRContext *ctx, ArrayRef<AffineExpr> basis);

And build something resembling

OpFoldResult mlir::computeLinearIndex(OpBuilder &builder, Location loc,
                                      OpFoldResult sourceOffset,
                                      ArrayRef<OpFoldResult> strides,
                                      ArrayRef<OpFoldResult> indices) {
  assert(strides.size() == indices.size());
  auto sourceRank = static_cast<unsigned>(strides.size());

  // Hold the affine symbols and values for the computation of the offset.
  SmallVector<OpFoldResult> values{indices.begin(), indices.end()};
  llvm::append_range(values, strides);

  AffineExpr expr = computeTheAffineExprYouWantWithIndexingUtilsAndExtendThemAsNecessary();

  return affine::makeComposedFoldedAffineApply(builder, loc, expr, values);
}

In D155247#4576163, @nicolasvasilache wrote:

OpFoldResult mlir::computeLinearIndex(OpBuilder &builder, Location loc,
                                      OpFoldResult sourceOffset,
                                      ArrayRef<OpFoldResult> strides,
                                      ArrayRef<OpFoldResult> indices) {

For the specific landing place of this function, how about Dialect/Affine/ViewLikeInterfaceUtils.h which contains related functions?

I wouldn't be opposed to moving this file to a better place as long as we keep all related functionality together as much as possible.
Maybe there is a need for AffineIndexingUtils that would just make the IndexingUtils APIs available with OpFoldResult + AffineApply.

In D155247#4576163, @nicolasvasilache wrote:
In D155247#4575271, @Hardcode84 wrote:

Moved code to memrefUtils as it already have affine dependency. computeLinearIndex and getLinearizeMemRefAndOffset functions should be unified, but getLinearizeMemRefAndOffset is doing too many things at once and refactoring will be too big for unrelated patch.

Hmm not a fan of the landing place .. I suggest revisiting how you construct your AffineExpr using/extending IndexingUtils.
For instance, I am unclear why you need to interleave symbols/values for strides and indices.
Seems pretty easy to just use the following in IndexingUtils:
SmallVector<AffineExpr> computeElementwiseMul(ArrayRef<AffineExpr> v1, ArrayRef<AffineExpr> v2);
AffineExpr computeSum(MLIRContext *ctx, ArrayRef<AffineExpr> basis);
And build something resembling
OpFoldResult mlir::computeLinearIndex(OpBuilder &builder, Location loc,
                                      OpFoldResult sourceOffset,
                                      ArrayRef<OpFoldResult> strides,
                                      ArrayRef<OpFoldResult> indices) {
  assert(strides.size() == indices.size());
  auto sourceRank = static_cast<unsigned>(strides.size());

  // Hold the affine symbols and values for the computation of the offset.
  SmallVector<OpFoldResult> values{indices.begin(), indices.end()};
  llvm::append_range(values, strides);

  AffineExpr expr = computeTheAffineExprYouWantWithIndexingUtilsAndExtendThemAsNecessary();

  return affine::makeComposedFoldedAffineApply(builder, loc, expr, values);
}

Not sure I got you comment, the main issue is affine::makeComposedFoldedAffineApply which require dependency on affine dialect, and adding affine dialect dep to indexingUtils causing circular targets dependencies in cmake.

In D155247#4576496, @Hardcode84 wrote:

Not sure I got you comment, the main issue is affine::makeComposedFoldedAffineApply which require dependency on affine dialect, and adding affine dialect dep to indexingUtils causing circular targets dependencies in cmake.

Yes, sorry, there are too many parts to my comment, let me untangle:

you should be able to make the impl of computeLinearIndex use pure AffineExpr helpers which can live in IndexingUtils (AffineExpr is a builtin type, it does not require a dependence on the affine dialect). I wrote one suggestion about using pointwise mul and add reduction on AffineExpr if you can live without the interleaving of indices and strides. If you really want the interleaving, I'd suggest adding more AffineExpr-only helpers.
then your computeLinearIndex is ~3 lines and can live in Dialect/Affine/ViewLikeInterfaceUtils.h, which depends on the affine dialect and already contains helpers that are related in spirit.
bonus points for a better reorganization of Dialect/Affine/ViewLikeInterfaceUtils.h into smaller components such as Dialect/Affine/IndexingUtils.h which would roughly mirror the existing IndexingUtils.h API but additionally use affine_apply. This is not for this PR.

Does this make sense ?

I can change computeLinearIndex to return an affineExpr + list of values so it will be user responsibility to pass it to makeComposedFoldedAffineApply. We won't need anything in Dialect/Affine/ViewLikeInterfaceUtils.h and interleaving and values order won't matter as long as user just forwards returned expr and values to makeComposedFoldedAffineApply.

In D155247#4576583, @Hardcode84 wrote:

I can change computeLinearIndex to return an affineExpr + list of values so it will be user responsibility to pass it to makeComposedFoldedAffineApply. We won't need anything in Dialect/Affine/ViewLikeInterfaceUtils.h and interleaving and values order won't matter as long as user just forwards returned expr and values to makeComposedFoldedAffineApply.

SGTM, this follows how we are doing things in FoldMemRefAliasOps.cpp and is prob the shortest path indeed.

move to IndexingUtils, do not use affine dialect

Harbormaster completed remote builds in B251708: Diff 549062.Aug 10 2023, 1:27 PM

Closed by commit rG793ee2bf0868: [mlir][gpu] Add DecomposeMemrefsPass (authored by Hardcode84). · Explain WhyAug 10 2023, 1:31 PM

This revision was automatically updated to reflect the committed changes.

Hardcode84 added a commit: rG793ee2bf0868: [mlir][gpu] Add DecomposeMemrefsPass.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

Transforms/

Passes.h

6 lines

Passes.td

18 lines

Utils/

IndexingUtils.h

10 lines

lib/

Dialect/

GPU/

CMakeLists.txt

3 lines

Transforms/

DecomposeMemrefs.cpp

234 lines

Utils/

IndexingUtils.cpp

41 lines

test/

Dialect/

GPU/

decompose-memrefs.mlir

137 lines

Diff 549062

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h

	Show First 20 Lines • Show All 144 Lines • ▼ Show 20 Lines

	/// Create an instance of the GPU kernel function to HSAco binary serialization			/// Create an instance of the GPU kernel function to HSAco binary serialization
	/// pass.			/// pass.
	std::unique_ptr<Pass> createGpuSerializeToHsacoPass(StringRef triple,			std::unique_ptr<Pass> createGpuSerializeToHsacoPass(StringRef triple,
	StringRef arch,			StringRef arch,
	StringRef features,			StringRef features,
	int optLevel);			int optLevel);

				/// Collect a set of patterns to decompose memrefs ops.
				void populateGpuDecomposeMemrefsPatterns(RewritePatternSet &patterns);

				/// Pass decomposes memref ops inside `gpu.launch` body.
				std::unique_ptr<Pass> createGpuDecomposeMemrefsPass();

	/// Generate the code for registering passes.			/// Generate the code for registering passes.
	#define GEN_PASS_REGISTRATION			#define GEN_PASS_REGISTRATION
	#include "mlir/Dialect/GPU/Transforms/Passes.h.inc"			#include "mlir/Dialect/GPU/Transforms/Passes.h.inc"

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_DIALECT_GPU_TRANSFORMS_PASSES_H_			#endif // MLIR_DIALECT_GPU_TRANSFORMS_PASSES_H_

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td

	Show All 31 Lines
	def GpuMapParallelLoopsPass			def GpuMapParallelLoopsPass
	: Pass<"gpu-map-parallel-loops", "mlir::func::FuncOp"> {			: Pass<"gpu-map-parallel-loops", "mlir::func::FuncOp"> {
	let summary = "Greedily maps loops to GPU hardware dimensions.";			let summary = "Greedily maps loops to GPU hardware dimensions.";
	let constructor = "mlir::createGpuMapParallelLoopsPass()";			let constructor = "mlir::createGpuMapParallelLoopsPass()";
	let description = "Greedily maps loops to GPU hardware dimensions.";			let description = "Greedily maps loops to GPU hardware dimensions.";
	let dependentDialects = ["mlir::gpu::GPUDialect"];			let dependentDialects = ["mlir::gpu::GPUDialect"];
	}			}

				def GpuDecomposeMemrefsPass : Pass<"gpu-decompose-memrefs"> {
				let summary = "Decomposes memref index computation into explicit ops.";
				let description = [{
				This pass decomposes memref index computation into explicit computations on
				sizes/strides, obtained from `memref.extract_memref_metadata` which it tries
				to place outside of `gpu.launch` body. Memrefs are then reconstructed using
				`memref.reinterpret_cast`.
				This is needed for as some targets (SPIR-V) lower memrefs to bare pointers
				and sizes/strides for dynamically-sized memrefs are not available inside
				`gpu.launch`.
				}];
				let constructor = "mlir::createGpuDecomposeMemrefsPass()";
				let dependentDialects = [
				"mlir::gpu::GPUDialect", "mlir::memref::MemRefDialect",
				"mlir::affine::AffineDialect"
				];
				}

	#endif // MLIR_DIALECT_GPU_PASSES			#endif // MLIR_DIALECT_GPU_PASSES

mlir/include/mlir/Dialect/Utils/IndexingUtils.h

	Show First 20 Lines • Show All 223 Lines • ▼ Show 20 Lines
	computePermutationVector(int64_t permSize, ArrayRef<int64_t> positions,			computePermutationVector(int64_t permSize, ArrayRef<int64_t> positions,
	ArrayRef<int64_t> desiredPositions);			ArrayRef<int64_t> desiredPositions);

	/// Helper to return a subset of `arrayAttr` as a vector of int64_t.			/// Helper to return a subset of `arrayAttr` as a vector of int64_t.
	// TODO: Port everything relevant to DenseArrayAttr and drop this util.			// TODO: Port everything relevant to DenseArrayAttr and drop this util.
	SmallVector<int64_t> getI64SubArray(ArrayAttr arrayAttr, unsigned dropFront = 0,			SmallVector<int64_t> getI64SubArray(ArrayAttr arrayAttr, unsigned dropFront = 0,
	unsigned dropBack = 0);			unsigned dropBack = 0);

				/// Compute linear index from provided strides and indices, assuming strided
				/// layout.
				/// Returns AffineExpr and list of values to apply to it, e.g.:
				///
				/// auto &&[expr, values] = computeLinearIndex(...);
				/// offset = affine::makeComposedFoldedAffineApply(builder, loc, expr, values);
				std::pair<AffineExpr, SmallVector<OpFoldResult>>
				computeLinearIndex(OpFoldResult sourceOffset, ArrayRef<OpFoldResult> strides,
				ArrayRef<OpFoldResult> indices);

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_DIALECT_UTILS_INDEXINGUTILS_H			#endif // MLIR_DIALECT_UTILS_INDEXINGUTILS_H

mlir/lib/Dialect/GPU/CMakeLists.txt

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	add_mlir_dialect_library(MLIRGPUDialect
MLIRMemRefDialect		MLIRMemRefDialect
MLIRSideEffectInterfaces		MLIRSideEffectInterfaces
MLIRSupport		MLIRSupport
)		)

add_mlir_dialect_library(MLIRGPUTransforms		add_mlir_dialect_library(MLIRGPUTransforms
Transforms/AllReduceLowering.cpp		Transforms/AllReduceLowering.cpp
Transforms/AsyncRegionRewriter.cpp		Transforms/AsyncRegionRewriter.cpp
		Transforms/DecomposeMemrefs.cpp
Transforms/GlobalIdRewriter.cpp		Transforms/GlobalIdRewriter.cpp
Transforms/KernelOutlining.cpp		Transforms/KernelOutlining.cpp
Transforms/MemoryPromotion.cpp		Transforms/MemoryPromotion.cpp
Transforms/ParallelLoopMapper.cpp		Transforms/ParallelLoopMapper.cpp
Transforms/ShuffleRewriter.cpp
Transforms/SerializeToBlob.cpp		Transforms/SerializeToBlob.cpp
Transforms/SerializeToCubin.cpp		Transforms/SerializeToCubin.cpp
Transforms/SerializeToHsaco.cpp		Transforms/SerializeToHsaco.cpp
		Transforms/ShuffleRewriter.cpp

ADDITIONAL_HEADER_DIRS		ADDITIONAL_HEADER_DIRS
${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU		${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU

LINK_COMPONENTS		LINK_COMPONENTS
Core		Core
MC		MC
Target		Target
▲ Show 20 Lines • Show All 86 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/Transforms/DecomposeMemrefs.cpp

This file was added.

				//===- DecomposeMemrefs.cpp - Decompose memrefs pass implementation -------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements decompose memrefs pass.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/Affine/IR/AffineOps.h"
				#include "mlir/Dialect/Arith/IR/Arith.h"
				#include "mlir/Dialect/GPU/IR/GPUDialect.h"
				#include "mlir/Dialect/GPU/Transforms/Passes.h"
				#include "mlir/Dialect/MemRef/IR/MemRef.h"
				#include "mlir/Dialect/Utils/IndexingUtils.h"
				#include "mlir/IR/AffineExpr.h"
				#include "mlir/IR/Builders.h"
				#include "mlir/IR/PatternMatch.h"
				#include "mlir/Pass/Pass.h"
				#include "mlir/Transforms/GreedyPatternRewriteDriver.h"

				namespace mlir {
				#define GEN_PASS_DEF_GPUDECOMPOSEMEMREFSPASS
				#include "mlir/Dialect/GPU/Transforms/Passes.h.inc"
				} // namespace mlir

				using namespace mlir;

				static void setInsertionPointToStart(OpBuilder &builder, Value val) {
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions anon ns and static is redundant. The preferred mechanism in LLVM is static, could you drop the ns ? nicolasvasilache: anon ns and static is redundant. The preferred mechanism in LLVM is static, could you drop the…
				if (auto parentOp = val.getDefiningOp()) {
				builder.setInsertionPointAfter(parentOp);
				} else {
				builder.setInsertionPointToStart(val.getParentBlock());
				}
				}

				static bool isInsideLaunch(Operation *op) {
				return op->getParentOfType<gpu::LaunchOp>();
				}

				static std::tuple<Value, OpFoldResult, SmallVector<OpFoldResult>>
				getFlatOffsetAndStrides(OpBuilder &rewriter, Location loc, Value source,
				ArrayRef<OpFoldResult> subOffsets,
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Offhand this looks like an n^th implementation / extensions of features available in IndexingUtils and StaticValueUtils. I made a similar comment in https://reviews.llvm.org/D155017. I made some refactorings in the past in MemRef/Transforms/FoldMemRefAliasOps.cpp but more can probably be done. Can we try to all converge on one thing and compound interest? nicolasvasilache: Offhand this looks like an n^th implementation / extensions of features available in…
				ArrayRef<OpFoldResult> subStrides = std::nullopt) {
				auto sourceType = cast<MemRefType>(source.getType());
				auto sourceRank = static_cast<unsigned>(sourceType.getRank());

				memref::ExtractStridedMetadataOp newExtractStridedMetadata;
				{
				nicolasvasilacheUnsubmitted Done Reply Inline Actions Hmm I find this somewhat hard to parse that this is not a lambda but the result of the application of a lambda .. How about: memref::ExtractStridedMetadataOp newExtractStridedMetadata; { OpBuilder::InsertionGuard g(rewriter); setInsertionPointToStart(rewriter, source); newExtractStridedMetadata = rewriter.create<memref::ExtractStridedMetadataOp>(loc, source); }; nicolasvasilache: Hmm I find this somewhat hard to parse that this is not a lambda but the result of the…
				OpBuilder::InsertionGuard g(rewriter);
				setInsertionPointToStart(rewriter, source);
				newExtractStridedMetadata =
				rewriter.create<memref::ExtractStridedMetadataOp>(loc, source);
				}

				auto &&[sourceStrides, sourceOffset] = getStridesAndOffset(sourceType);

				auto getDim = [&](int64_t dim, Value dimVal) -> OpFoldResult {
				return ShapedType::isDynamic(dim) ? getAsOpFoldResult(dimVal)
				: rewriter.getIndexAttr(dim);
				};

				OpFoldResult origOffset =
				getDim(sourceOffset, newExtractStridedMetadata.getOffset());
				ValueRange sourceStridesVals = newExtractStridedMetadata.getStrides();

				SmallVector<OpFoldResult> origStrides;
				origStrides.reserve(sourceRank);

				SmallVector<OpFoldResult> strides;
				strides.reserve(sourceRank);

				AffineExpr s0 = rewriter.getAffineSymbolExpr(0);
				AffineExpr s1 = rewriter.getAffineSymbolExpr(1);
				for (auto i : llvm::seq(0u, sourceRank)) {
				OpFoldResult origStride = getDim(sourceStrides[i], sourceStridesVals[i]);

				if (!subStrides.empty()) {
				strides.push_back(affine::makeComposedFoldedAffineApply(
				rewriter, loc, s0 * s1, {subStrides[i], origStride}));
				}

				origStrides.emplace_back(origStride);
				}

				auto &&[expr, values] =
				computeLinearIndex(origOffset, origStrides, subOffsets);
				OpFoldResult finalOffset =
				affine::makeComposedFoldedAffineApply(rewriter, loc, expr, values);
				return {newExtractStridedMetadata.getBaseBuffer(), finalOffset, strides};
				}

				static Value getFlatMemref(OpBuilder &rewriter, Location loc, Value source,
				ValueRange offsets) {
				SmallVector<OpFoldResult> offsetsTemp = getAsOpFoldResult(offsets);
				auto &&[base, offset, ignore] =
				getFlatOffsetAndStrides(rewriter, loc, source, offsetsTemp);
				auto retType = cast<MemRefType>(base.getType());
				return rewriter.create<memref::ReinterpretCastOp>(loc, retType, base, offset,
				std::nullopt, std::nullopt);
				}

				static bool needFlatten(Value val) {
				auto type = cast<MemRefType>(val.getType());
				return type.getRank() != 0;
				}

				static bool checkLayout(Value val) {
				auto type = cast<MemRefType>(val.getType());
				return type.getLayout().isIdentity() \|\|
				isa<StridedLayoutAttr>(type.getLayout());
				}

				namespace {
				struct FlattenLoad : public OpRewritePattern<memref::LoadOp> {
				using OpRewritePattern::OpRewritePattern;

				LogicalResult matchAndRewrite(memref::LoadOp op,
				PatternRewriter &rewriter) const override {
				if (!isInsideLaunch(op))
				return rewriter.notifyMatchFailure(op, "not inside gpu.launch");

				Value memref = op.getMemref();
				if (!needFlatten(memref))
				return rewriter.notifyMatchFailure(op, "nothing to do");

				if (!checkLayout(memref))
				return rewriter.notifyMatchFailure(op, "unsupported layout");

				Location loc = op.getLoc();
				Value flatMemref = getFlatMemref(rewriter, loc, memref, op.getIndices());
				rewriter.replaceOpWithNewOp<memref::LoadOp>(op, flatMemref);
				return success();
				}
				};

				struct FlattenStore : public OpRewritePattern<memref::StoreOp> {
				using OpRewritePattern::OpRewritePattern;

				LogicalResult matchAndRewrite(memref::StoreOp op,
				PatternRewriter &rewriter) const override {
				if (!isInsideLaunch(op))
				return rewriter.notifyMatchFailure(op, "not inside gpu.launch");

				Value memref = op.getMemref();
				if (!needFlatten(memref))
				return rewriter.notifyMatchFailure(op, "nothing to do");

				if (!checkLayout(memref))
				return rewriter.notifyMatchFailure(op, "unsupported layout");

				Location loc = op.getLoc();
				Value flatMemref = getFlatMemref(rewriter, loc, memref, op.getIndices());
				Value value = op.getValue();
				rewriter.replaceOpWithNewOp<memref::StoreOp>(op, value, flatMemref);
				return success();
				}
				};

				struct FlattenSubview : public OpRewritePattern<memref::SubViewOp> {
				using OpRewritePattern::OpRewritePattern;

				tschuettUnsubmitted Not Done Reply Inline Actions You are close: https://llvm.org/docs/CodingStandards.html#anonymous-namespaces tschuett: You are close: https://llvm.org/docs/CodingStandards.html#anonymous-namespaces
				LogicalResult matchAndRewrite(memref::SubViewOp op,
				PatternRewriter &rewriter) const override {
				if (!isInsideLaunch(op))
				return rewriter.notifyMatchFailure(op, "not inside gpu.launch");

				Value memref = op.getSource();
				if (!needFlatten(memref))
				return rewriter.notifyMatchFailure(op, "nothing to do");

				if (!checkLayout(memref))
				return rewriter.notifyMatchFailure(op, "unsupported layout");

				Location loc = op.getLoc();
				SmallVector<OpFoldResult> subOffsets = op.getMixedOffsets();
				SmallVector<OpFoldResult> subSizes = op.getMixedSizes();
				SmallVector<OpFoldResult> subStrides = op.getMixedStrides();
				auto &&[base, finalOffset, strides] =
				getFlatOffsetAndStrides(rewriter, loc, memref, subOffsets, subStrides);

				auto srcType = cast<MemRefType>(memref.getType());
				auto resultType = cast<MemRefType>(op.getType());
				unsigned subRank = static_cast<unsigned>(resultType.getRank());

				llvm::SmallBitVector droppedDims = op.getDroppedDims();

				SmallVector<OpFoldResult> finalSizes;
				finalSizes.reserve(subRank);

				SmallVector<OpFoldResult> finalStrides;
				finalStrides.reserve(subRank);

				for (auto i : llvm::seq(0u, static_cast<unsigned>(srcType.getRank()))) {
				if (droppedDims.test(i))
				continue;

				finalSizes.push_back(subSizes[i]);
				finalStrides.push_back(strides[i]);
				}

				rewriter.replaceOpWithNewOp<memref::ReinterpretCastOp>(
				op, resultType, base, finalOffset, finalSizes, finalStrides);
				return success();
				}
				};

				struct GpuDecomposeMemrefsPass
				: public impl::GpuDecomposeMemrefsPassBase<GpuDecomposeMemrefsPass> {

				void runOnOperation() override {
				RewritePatternSet patterns(&getContext());

				populateGpuDecomposeMemrefsPatterns(patterns);

				if (failed(
				applyPatternsAndFoldGreedily(getOperation(), std::move(patterns))))
				return signalPassFailure();
				}
				};

				} // namespace

				void mlir::populateGpuDecomposeMemrefsPatterns(RewritePatternSet &patterns) {
				patterns.insert<FlattenLoad, FlattenStore, FlattenSubview>(
				patterns.getContext());
				}

				std::unique_ptr<Pass> mlir::createGpuDecomposeMemrefsPass() {
				return std::make_unique<GpuDecomposeMemrefsPass>();
				}

mlir/lib/Dialect/Utils/IndexingUtils.cpp

Show First 20 Lines • Show All 255 Lines • ▼ Show 20 Lines	SmallVector<int64_t> mlir::getI64SubArray(ArrayAttr arrayAttr,
auto range = arrayAttr.getAsRange<IntegerAttr>();		auto range = arrayAttr.getAsRange<IntegerAttr>();
SmallVector<int64_t> res;		SmallVector<int64_t> res;
res.reserve(arrayAttr.size() - dropFront - dropBack);		res.reserve(arrayAttr.size() - dropFront - dropBack);
for (auto it = range.begin() + dropFront, eit = range.end() - dropBack;		for (auto it = range.begin() + dropFront, eit = range.end() - dropBack;
it != eit; ++it)		it != eit; ++it)
res.push_back((*it).getValue().getSExtValue());		res.push_back((*it).getValue().getSExtValue());
return res;		return res;
}		}

		// TODO: do we have any common utily for this?
		static MLIRContext *getContext(OpFoldResult val) {
		assert(val && "Invalid value");
		if (auto attr = dyn_cast<Attribute>(val)) {
		return attr.getContext();
		} else {
		return cast<Value>(val).getContext();
		}
		}

		std::pair<AffineExpr, SmallVector<OpFoldResult>>
		mlir::computeLinearIndex(OpFoldResult sourceOffset,
		ArrayRef<OpFoldResult> strides,
		ArrayRef<OpFoldResult> indices) {
		assert(strides.size() == indices.size());
		auto sourceRank = static_cast<unsigned>(strides.size());

		// Hold the affine symbols and values for the computation of the offset.
		SmallVector<OpFoldResult> values(2 * sourceRank + 1);
		SmallVector<AffineExpr> symbols(2 * sourceRank + 1);

		bindSymbolsList(getContext(sourceOffset), MutableArrayRef{symbols});
		AffineExpr expr = symbols.front();
		values[0] = sourceOffset;

		for (unsigned i = 0; i < sourceRank; ++i) {
		// Compute the stride.
		OpFoldResult origStride = strides[i];

		// Build up the computation of the offset.
		unsigned baseIdxForDim = 1 + 2 * i;
		unsigned subOffsetForDim = baseIdxForDim;
		unsigned origStrideForDim = baseIdxForDim + 1;
		expr = expr + symbols[subOffsetForDim] * symbols[origStrideForDim];
		values[subOffsetForDim] = indices[i];
		values[origStrideForDim] = origStride;
		}

		return {expr, values};
		}

mlir/test/Dialect/GPU/decompose-memrefs.mlir

This file was added.

				// RUN: mlir-opt -gpu-decompose-memrefs -allow-unregistered-dialect -split-input-file %s \| FileCheck %s

				// CHECK: #[[MAP:.]] = affine_map<()[s0, s1, s2, s3, s4] -> (s0 s1 + s2 * s3 + s4)>
				// CHECK: @decompose_store
				// CHECK-SAME: (%[[VAL:.]]: f32, %[[MEM:.]]: memref<?x?x?xf32>)
				// CHECK: %[[BASE:.]], %[[OFFSET:.]], %[[SIZES:.]]:3, %[[STRIDES:.]]:3 = memref.extract_strided_metadata %[[MEM]]
				// CHECK: gpu.launch
				// CHECK-SAME: threads(%[[TX:.]], %[[TY:.]], %[[TZ:.*]]) in
				// CHECK: %[[IDX:.*]] = affine.apply #[[MAP]]()[%[[TX]], %[[STRIDES]]#0, %[[TY]], %[[STRIDES]]#1, %[[TZ]]]
				// CHECK: %[[PTR:.*]] = memref.reinterpret_cast %[[BASE]] to offset: [%[[IDX]]], sizes: [], strides: [] : memref<f32> to memref<f32>
				// CHECK: memref.store %[[VAL]], %[[PTR]][] : memref<f32>
				func.func @decompose_store(%arg0 : f32, %arg1 : memref<?x?x?xf32>) {
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : index
				%c2 = arith.constant 2 : index
				%block_dim0 = memref.dim %arg1, %c0 : memref<?x?x?xf32>
				%block_dim1 = memref.dim %arg1, %c1 : memref<?x?x?xf32>
				%block_dim2 = memref.dim %arg1, %c2 : memref<?x?x?xf32>
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %block_dim0, %block_y = %block_dim1, %block_z = %block_dim2) {
				memref.store %arg0, %arg1[%tx, %ty, %tz] : memref<?x?x?xf32>
				gpu.terminator
				}
				return
				}

				// -----

				// CHECK: #[[MAP:.]] = affine_map<()[s0, s1, s2, s3, s4, s5, s6] -> (s0 + s1 s2 + s3 * s4 + s5 * s6)>
				// CHECK: @decompose_store_strided
				// CHECK-SAME: (%[[VAL:.]]: f32, %[[MEM:.]]: memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>)
				// CHECK: %[[BASE:.]], %[[OFFSET:.]], %[[SIZES:.]]:3, %[[STRIDES:.]]:3 = memref.extract_strided_metadata %[[MEM]]
				// CHECK: gpu.launch
				// CHECK-SAME: threads(%[[TX:.]], %[[TY:.]], %[[TZ:.*]]) in
				// CHECK: %[[IDX:.*]] = affine.apply #[[MAP]]()[%[[OFFSET]], %[[TX]], %[[STRIDES]]#0, %[[TY]], %[[STRIDES]]#1, %[[TZ]], %[[STRIDES]]#2]
				// CHECK: %[[PTR:.*]] = memref.reinterpret_cast %[[BASE]] to offset: [%[[IDX]]], sizes: [], strides: [] : memref<f32> to memref<f32>
				// CHECK: memref.store %[[VAL]], %[[PTR]][] : memref<f32>
				func.func @decompose_store_strided(%arg0 : f32, %arg1 : memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>) {
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : index
				%c2 = arith.constant 2 : index
				%block_dim0 = memref.dim %arg1, %c0 : memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>
				%block_dim1 = memref.dim %arg1, %c1 : memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>
				%block_dim2 = memref.dim %arg1, %c2 : memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %block_dim0, %block_y = %block_dim1, %block_z = %block_dim2) {
				memref.store %arg0, %arg1[%tx, %ty, %tz] : memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>
				gpu.terminator
				}
				return
				}

				// -----

				// CHECK: #[[MAP:.]] = affine_map<()[s0, s1, s2, s3, s4] -> (s0 s1 + s2 * s3 + s4)>
				// CHECK: @decompose_load
				// CHECK-SAME: (%[[MEM:.*]]: memref<?x?x?xf32>)
				// CHECK: %[[BASE:.]], %[[OFFSET:.]], %[[SIZES:.]]:3, %[[STRIDES:.]]:3 = memref.extract_strided_metadata %[[MEM]]
				// CHECK: gpu.launch
				// CHECK-SAME: threads(%[[TX:.]], %[[TY:.]], %[[TZ:.*]]) in
				// CHECK: %[[IDX:.*]] = affine.apply #[[MAP]]()[%[[TX]], %[[STRIDES]]#0, %[[TY]], %[[STRIDES]]#1, %[[TZ]]]
				// CHECK: %[[PTR:.*]] = memref.reinterpret_cast %[[BASE]] to offset: [%[[IDX]]], sizes: [], strides: [] : memref<f32> to memref<f32>
				// CHECK: %[[RES:.*]] = memref.load %[[PTR]][] : memref<f32>
				// CHECK: "test.test"(%[[RES]]) : (f32) -> ()
				func.func @decompose_load(%arg0 : memref<?x?x?xf32>) {
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : index
				%c2 = arith.constant 2 : index
				%block_dim0 = memref.dim %arg0, %c0 : memref<?x?x?xf32>
				%block_dim1 = memref.dim %arg0, %c1 : memref<?x?x?xf32>
				%block_dim2 = memref.dim %arg0, %c2 : memref<?x?x?xf32>
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %block_dim0, %block_y = %block_dim1, %block_z = %block_dim2) {
				%res = memref.load %arg0[%tx, %ty, %tz] : memref<?x?x?xf32>
				"test.test"(%res) : (f32) -> ()
				gpu.terminator
				}
				return
				}

				// -----

				// CHECK: #[[MAP:.]] = affine_map<()[s0, s1, s2, s3, s4] -> (s0 s1 + s2 * s3 + s4)>
				// CHECK: @decompose_subview
				// CHECK-SAME: (%[[MEM:.*]]: memref<?x?x?xf32>)
				// CHECK: %[[BASE:.]], %[[OFFSET:.]], %[[SIZES:.]]:3, %[[STRIDES:.]]:3 = memref.extract_strided_metadata %[[MEM]]
				// CHECK: gpu.launch
				// CHECK-SAME: threads(%[[TX:.]], %[[TY:.]], %[[TZ:.*]]) in
				// CHECK: %[[IDX:.*]] = affine.apply #[[MAP]]()[%[[TX]], %[[STRIDES]]#0, %[[TY]], %[[STRIDES]]#1, %[[TZ]]]
				// CHECK: %[[PTR:.]] = memref.reinterpret_cast %[[BASE]] to offset: [%[[IDX]]], sizes: [%{{.}}, %{{.}}, %{{.}}], strides: [%[[STRIDES]]#0, %[[STRIDES]]#1, 1]
				// CHECK: "test.test"(%[[PTR]]) : (memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>) -> ()
				func.func @decompose_subview(%arg0 : memref<?x?x?xf32>) {
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : index
				%c2 = arith.constant 2 : index
				%block_dim0 = memref.dim %arg0, %c0 : memref<?x?x?xf32>
				%block_dim1 = memref.dim %arg0, %c1 : memref<?x?x?xf32>
				%block_dim2 = memref.dim %arg0, %c2 : memref<?x?x?xf32>
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %block_dim0, %block_y = %block_dim1, %block_z = %block_dim2) {
				%res = memref.subview %arg0[%tx, %ty, %tz] [%c2, %c2, %c2] [%c1, %c1, %c1] : memref<?x?x?xf32> to memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>
				"test.test"(%res) : (memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>) -> ()
				gpu.terminator
				}
				return
				}

				// -----

				// CHECK: #[[MAP:.]] = affine_map<()[s0] -> (s0 2)>
				// CHECK: #[[MAP1:.]] = affine_map<()[s0] -> (s0 3)>
				// CHECK: #[[MAP2:.]] = affine_map<()[s0, s1, s2, s3, s4] -> (s0 s1 + s2 * s3 + s4)>
				// CHECK: @decompose_subview_strided
				// CHECK-SAME: (%[[MEM:.*]]: memref<?x?x?xf32>)
				// CHECK: %[[BASE:.]], %[[OFFSET:.]], %[[SIZES:.]]:3, %[[STRIDES:.]]:3 = memref.extract_strided_metadata %[[MEM]]
				// CHECK: gpu.launch
				// CHECK-SAME: threads(%[[TX:.]], %[[TY:.]], %[[TZ:.*]]) in
				// CHECK: %[[IDX:.*]] = affine.apply #[[MAP]]()[%[[STRIDES]]#0]
				// CHECK: %[[IDX1:.*]] = affine.apply #[[MAP1]]()[%[[STRIDES]]#1]
				// CHECK: %[[IDX2:.*]] = affine.apply #[[MAP2]]()[%[[TX]], %[[STRIDES]]#0, %[[TY]], %[[STRIDES]]#1, %[[TZ]]]
				// CHECK: %[[PTR:.]] = memref.reinterpret_cast %[[BASE]] to offset: [%[[IDX2]]], sizes: [%{{.}}, %{{.}}, %{{.}}], strides: [%[[IDX]], %[[IDX1]], 4]
				// CHECK: "test.test"(%[[PTR]]) : (memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>) -> ()
				func.func @decompose_subview_strided(%arg0 : memref<?x?x?xf32>) {
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : index
				%c2 = arith.constant 2 : index
				%block_dim0 = memref.dim %arg0, %c0 : memref<?x?x?xf32>
				%block_dim1 = memref.dim %arg0, %c1 : memref<?x?x?xf32>
				%block_dim2 = memref.dim %arg0, %c2 : memref<?x?x?xf32>
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %block_dim0, %block_y = %block_dim1, %block_z = %block_dim2) {
				%res = memref.subview %arg0[%tx, %ty, %tz] [%c2, %c2, %c2] [2, 3, 4] : memref<?x?x?xf32> to memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>
				"test.test"(%res) : (memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>) -> ()
				gpu.terminator
				}
				return
				}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add DecomposeMemrefsPassClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 549062

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td

mlir/include/mlir/Dialect/Utils/IndexingUtils.h

mlir/lib/Dialect/GPU/CMakeLists.txt

mlir/lib/Dialect/GPU/Transforms/DecomposeMemrefs.cpp

mlir/lib/Dialect/Utils/IndexingUtils.cpp

mlir/test/Dialect/GPU/decompose-memrefs.mlir

[mlir][gpu] Add DecomposeMemrefsPass
ClosedPublic