This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/
-
mlir/
-
Dialect/
-
NVGPU/TransformOps/
-
TransformOps/
-
NVGPUTransformOps.h
3/3
NVGPUTransformOps.td
-
SCF/Transforms/
-
Transforms/
1/1
Patterns.h
-
Transforms.h
-
lib/Dialect/
-
Dialect/
-
NVGPU/TransformOps/
-
TransformOps/
-
CMakeLists.txt
18/18
NVGPUTransformOps.cpp
-
SCF/Transforms/
-
Transforms/
-
LoopPipelining.cpp
-
test/Dialect/NVGPU/
-
Dialect/
-
NVGPU/
7/7
transform-pipeline-shared.mlir
-
utils/bazel/llvm-project-overlay/mlir/
-
bazel/
-
llvm-project-overlay/
-
mlir/
-
BUILD.bazel

Differential D155223

[mlir][nvgpu] add simple pipelining for shared memory copies
ClosedPublic

Authored by ftynse on Jul 13 2023, 11:01 AM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
springerm
herhut

Commits

rG371366ce2730: [mlir][nvgpu] add simple pipelining for shared memory copies

Summary

Add a simple transform operation to the NVGPU extension that performs
software pipelining of copies to shared memory. The functionality is
extremely minimalistic in this version and only supports copies from
global to shared memory inside an scf.for loop with either
vector.transfer or nvgpu.device_async_copy operations when
pipelining preconditions are already satisfied in the IR. This is the
minimally useful version that uses the more general loop pipeliner in an
NVGPU-specific way. Further extensions and orthogonalizations will be
necessary.

This required a change to the loop pipeliner itself to properly
propagate errors should the predicate generator fail.

This is loosely inspired from the vesion in IREE, but has less unsafe
assumptions and more principled way of communicating decisions.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ftynse created this revision.Jul 13 2023, 11:01 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2023, 11:01 AM

Herald added subscribers: bviyer, Moerafaat, zero9178 and 23 others. · View Herald Transcript

ftynse requested review of this revision.Jul 13 2023, 11:01 AM

Herald added a reviewer: herhut. · View Herald TranscriptJul 13 2023, 11:01 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added a subscriber: stephenneuendorffer. · View Herald Transcript

Harbormaster completed remote builds in B245183: Diff 540118.Jul 13 2023, 1:57 PM

Great, thanks for starting the rationalizing / tech debt removal effort around this important transformation!

mlir/include/mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.td
34	nit: `for a load into shared memory` ?
51	Description seems out of date, I see definite failures in the impl.
56	+1
mlir/include/mlir/Dialect/SCF/Transforms/Patterns.h
36	grammo: `to a value that captures whether` ?
mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
98	Even if trivial, I like the bulletpoints in `getPipelineStages`, can we rewrite as: Specifically, this collects: 1. all loads from global memory, both sync and async 2. the barriers for async loads In particular, barriers are omitted if they do not dominate at least one async load for which there is not yet a barrier. The `in particular` part seems significant to document and not immediately clear from glancing over code.
117	As always, C++ is cute and terrifying at the same time :)
144	nit: order of
147	Can we merge this condition and the one above and document properly ? From reading the code I am inferring: // If not a waitOp or if numGroups is already set, bail. auto waitOp = dyn_cast<nvgpu::DeviceAsyncWaitOp>(op); if (!waitOp \|\| waitOp.getNumGroups()) return;
160	isn't there an auto-tablegen'd method to do that automatically rather than have to manipulate the attrname?
168	add a case that: - operations in the backward slice of any stage0Ops are all at stage 0
173	Nit: add something along the lines of `Hook for the loop pipeliner used by the scheduleFn hook`. Even better put all these under a special namespace nvgpu::pipelining_hooks to really make clear what is what.
174	nit: opsWithPipelineStage would be more self-documenting
191	nit: iterate directly on dependencies rather than on `forOp.getBody()->getOperations()`
196	seems off, why would you want to replace with a predicated version when no predication is necessary?
198	Same thing re hook here.
216	nit: lines
302	add "try to set the peel_epilogue attribute" ?
mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
10	can we add some comments to the test here to explain why we expect this outcome? It is really not clear to me why we need to "peel the epilogue" here.
89	Can we spell out the selects operands and how they connect here ? Would be nice to see the difference between the 2; this is new untested code in MLIR.
92	nit: load
92	can we also check the `scf.yield` here to "close the loop" ?
111	interestingly, I would have expected something to go wrong here without peeling: we have both predicated async_copy and loop trip count not divisible by 4 yet this case does not require peeling but the first one does. A few comments in the test here would help my past, present and future self.

This revision is now accepted and ready to land.Jul 14 2023, 12:58 AM

nicolasvasilache added inline comments.Jul 14 2023, 3:52 AM

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp

Ok now I am lost, I looked deeper into memory spaces and I see:

https://github.com/llvm/llvm-project/blob/5c5a1a2927118eb9cc51c7b98c959a30524cc491/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td#L79

says:

def GPU_AddressSpaceGlobal : I32EnumAttrCase<"Global", 1, "global">;
def GPU_AddressSpaceWorkgroup : I32EnumAttrCase<"Workgroup", 2, "workgroup">;
def GPU_AddressSpacePrivate : I32EnumAttrCase<"Private", 3, "private">;
def GPU_AddressSpaceEnum : GPU_I32Enum<
  "AddressSpace", "GPU address space", [
    GPU_AddressSpaceGlobal,
    GPU_AddressSpaceWorkgroup,
    GPU_AddressSpacePrivate
  ]>;

https://github.com/llvm/llvm-project/blob/5c5a1a2927118eb9cc51c7b98c959a30524cc491/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td#L60

says:

    /// Defines the MemRef memory space attribute numeric value that indicates
    /// a memref is located in global memory. This should correspond to the
    /// value used in NVVM.
    static constexpr unsigned kGlobaldMemoryAddressSpace = 1;

    /// Defines the MemRef memory space attribute numeric value that indicates
    /// a memref is located in shared memory. This should correspond to the
    /// value used in NVVM.
    static constexpr unsigned kSharedMemoryAddressSpace = 3;

whereas https://github.com/llvm/llvm-project/blob/5c5a1a2927118eb9cc51c7b98c959a30524cc491/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td#L40

says:

/// The address space value that represents global memory.
static constexpr unsigned kGlobalMemoryAddressSpace = 1;
/// The address space value that represents shared memory.
static constexpr unsigned kSharedMemoryAddressSpace = 3;
/// The address space value that represents private memory.
static constexpr unsigned kPrivateMemoryAddressSpace = 5;

I don't know if there are others.

Address review.

ftynse added inline comments.Jul 17 2023, 5:12 AM

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
58	ROCDL is irrelevant here. The other two are the platform-agnostic and the NVVM-specific memory spaces. Given that conversion to NVVM happens later than this, I expect the platform-agnostic space to be used in the input.
173	Putting them in a namespace makes them potentially accessible from different translation units, which we'd like to avoid.
191	I think the order in which operations are listed in the `opsWIthPipelineStage` matters so I'd rather have a small overhead here and be sure the ops are in the same order as originally.
196	Because the pipeliner wants that... I suppose we can change it to require a custom three-state result "replaced with Op, keep the same, couldn't replace". Updated the doc to reflect this.
mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir
10	The predication is only implemented for `nvpgu.device.async.copy`, not for transfer reads/writes.
111	This test does not require peeling because the code being pipelined already handles the "partial" loop iteration correctly from the start. There's no reason why pipelining should break it.

ftynse mentioned this in D155454: [mlir] NFC: untangle SCF Patterns.h and Transforms.h.Jul 17 2023, 6:00 AM

Harbormaster completed remote builds in B245796: Diff 540962.Jul 17 2023, 7:27 AM

NVGPUToNVVM bazel doesn't look related to this.

This revision was landed with ongoing or failed builds.Jul 17 2023, 7:29 AM

Closed by commit rG371366ce2730: [mlir][nvgpu] add simple pipelining for shared memory copies (authored by ftynse). · Explain Why

This revision was automatically updated to reflect the committed changes.

ftynse added a commit: rG371366ce2730: [mlir][nvgpu] add simple pipelining for shared memory copies.

ftynse mentioned this in rG4a6b31b8d877: [mlir] NFC: untangle SCF Patterns.h and Transforms.h.Jul 18 2023, 4:28 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

NVGPU/

TransformOps/

NVGPUTransformOps.h

4 lines

NVGPUTransformOps.td

71 lines

SCF/

Transforms/

Patterns.h

22 lines

Transforms.h

7 lines

lib/

Dialect/

NVGPU/

TransformOps/

CMakeLists.txt

2 lines

NVGPUTransformOps.cpp

281 lines

SCF/

Transforms/

LoopPipelining.cpp

25 lines

test/

Dialect/

NVGPU/

transform-pipeline-shared.mlir

182 lines

utils/

bazel/

llvm-project-overlay/

mlir/

BUILD.bazel

3 lines

Diff 541023

mlir/include/mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.h

	Show All 22 Lines

	namespace mlir {			namespace mlir {
	class DialectRegistry;			class DialectRegistry;

	namespace linalg {			namespace linalg {
	class LinalgOp;			class LinalgOp;
	} // namespace linalg			} // namespace linalg

				namespace scf {
				class ForOp;
				} // namespace scf

	namespace nvgpu {			namespace nvgpu {
	void registerTransformDialectExtension(DialectRegistry &registry);			void registerTransformDialectExtension(DialectRegistry &registry);
	} // namespace nvgpu			} // namespace nvgpu
	} // namespace mlir			} // namespace mlir

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// NVGPU Transform Operations			// NVGPU Transform Operations
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#define GET_OP_CLASSES			#define GET_OP_CLASSES
	#include "mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.h.inc"			#include "mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.h.inc"

	#endif // MLIR_DIALECT_NVGPU_TRANSFORMOPS_NVGPUTRANSFORMOPS_H			#endif // MLIR_DIALECT_NVGPU_TRANSFORMOPS_NVGPUTRANSFORMOPS_H

mlir/include/mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.td

	Show All 10 Lines

	include "mlir/Dialect/Transform/IR/TransformAttrs.td"			include "mlir/Dialect/Transform/IR/TransformAttrs.td"
	include "mlir/Dialect/Transform/IR/TransformDialect.td"			include "mlir/Dialect/Transform/IR/TransformDialect.td"
	include "mlir/Dialect/Transform/IR/TransformInterfaces.td"			include "mlir/Dialect/Transform/IR/TransformInterfaces.td"
	include "mlir/Dialect/Transform/IR/TransformTypes.td"			include "mlir/Dialect/Transform/IR/TransformTypes.td"
	include "mlir/Interfaces/SideEffectInterfaces.td"			include "mlir/Interfaces/SideEffectInterfaces.td"

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
				// PipelineSharedMemoryCopiesOp
				//===----------------------------------------------------------------------===//

				def PipelineSharedMemoryCopiesOp :
				Op<Transform_Dialect, "nvgpu.pipeline_shared_memory_copies",
				[FunctionalStyleTransformOpTrait,
				MemoryEffectsOpInterface,
				TransformEachOpTrait,
				TransformOpInterface,
				ReportTrackingListenerFailuresOpTrait]> {
				let summary =
				"Applies software pipelining to a given loop with shared memory copies";

				let description = [{
				Applies software pipelining to a given scf.for loop. The pipelining
				strategy will look for a load into shared memory and pipeline it to overlap
				nicolasvasilacheUnsubmitted Done Reply Inline Actions nit: `for a load into shared memory` ? nicolasvasilache: nit: `for a load into shared memory` ?
				it with the rest of the loop.

				NOTE: It is user responsibility to ensure that there are no dependency
				between `depth` iterations of the loop by using multi-buffering. It is
				also user responsibility to ensure a sufficient amount of shared memory
				is allocated to cover eventual writes by `depth-1` speculative
				iterations.

				`depth` will indicate how many stages the software pipeline should have.
				`peel_epilogue` allows to force the epilogue to be peeled out instead of
				potentially using predicated operations for the epilogue phase.

				#### Return modes

				Consumes the operand handle and produces a result handle pointing to the
				loop, which may or may not have been pipelined. Produces a definite failure
				if the loop pipeliner mutated the IR before failing to pipeline, in
				nicolasvasilacheUnsubmitted Done Reply Inline Actions Description seems out of date, I see definite failures in the impl. nicolasvasilache: Description seems out of date, I see definite failures in the impl.
				particular if `peel_epilogue` is not set and the loop body doesn't support
				predication. If failure propagation mode is set to "propagate", produces a
				silenceable failure when pipelining preconditions, e.g., loop bound being
				static, are not met or when the loop wasn't pipelined because due to the
				lack of loads into shared memory. If the failure propagation mode is set
				nicolasvasilacheUnsubmitted Done Reply Inline Actions +1 nicolasvasilache: +1
				to "suppress" (default), succeeds in these case and associates the result
				handle with the original loop.

				TODO: the shared memory part and behavior specific to NVGPU should be
				made orthogonal to pipelining so that `transform.loop.pipeline` becomes
				usable here.
				}];

				let arguments = (ins TransformHandleTypeInterface:$for_op,
				I64Attr:$depth,
				UnitAttr:$peel_epilogue,
				DefaultValuedAttr<FailurePropagationMode,
				"::mlir::transform::FailurePropagationMode::Suppress">
				:$failure_propagation_mode);
				let results = (outs TransformHandleTypeInterface:$result);

				let assemblyFormat = [{
				`failures` `(` $failure_propagation_mode `)`
				$for_op
				attr-dict
				`:` functional-type(operands, results)
				}];

				let extraClassDeclaration = [{
				::mlir::DiagnosedSilenceableFailure applyToOne(
				::mlir::transform::TransformRewriter &rewriter,
				::mlir::scf::ForOp forOp,
				::mlir::transform::ApplyToEachResultList &results,
				::mlir::transform::TransformState &state);
				}];
				}

				//===----------------------------------------------------------------------===//
	// RewriteMatmulAsMmaSyncOp			// RewriteMatmulAsMmaSyncOp
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	def RewriteMatmulAsMmaSyncOp :			def RewriteMatmulAsMmaSyncOp :
	Op<Transform_Dialect, "nvgpu.rewrite_matmul_as_mma_sync",			Op<Transform_Dialect, "nvgpu.rewrite_matmul_as_mma_sync",
	[FunctionalStyleTransformOpTrait,			[FunctionalStyleTransformOpTrait,
	MemoryEffectsOpInterface,			MemoryEffectsOpInterface,
	TransformEachOpTrait,			TransformEachOpTrait,
	Show All 25 Lines

mlir/include/mlir/Dialect/SCF/Transforms/Patterns.h

	Show All 18 Lines
	/// as option. This applies the mechanical transformation of changing the loop			/// as option. This applies the mechanical transformation of changing the loop
	/// and generating the prologue/epilogue for the pipelining and doesn't make any			/// and generating the prologue/epilogue for the pipelining and doesn't make any
	/// decision regarding the schedule.			/// decision regarding the schedule.
	/// Based on the options the loop is split into several stages.			/// Based on the options the loop is split into several stages.
	/// The transformation assumes that the scheduling given by user is valid.			/// The transformation assumes that the scheduling given by user is valid.
	/// For example if we break a loop into 3 stages named S0, S1, S2 we would			/// For example if we break a loop into 3 stages named S0, S1, S2 we would
	/// generate the following code with the number in parenthesis as the iteration			/// generate the following code with the number in parenthesis as the iteration
	/// index:			/// index:
				///
	/// S0(0) // Prologue			/// S0(0) // Prologue
	/// S0(1) S1(0) // Prologue			/// S0(1) S1(0) // Prologue
	/// scf.for %I = %C0 to %N - 2 {			/// scf.for %I = %C0 to %N - 2 {
	/// S0(I+2) S1(I+1) S2(I) // Pipelined kernel			/// S0(I+2) S1(I+1) S2(I) // Pipelined kernel
	/// }			/// }
	/// S1(N) S2(N-1) // Epilogue			/// S1(N) S2(N-1) // Epilogue
	/// S2(N) // Epilogue			/// S2(N) // Epilogue
				///
				/// If `modifiedIR` is provided, it will be set to a value that indicates
				nicolasvasilacheUnsubmitted Done Reply Inline Actions grammo: `to a value that captures whether` ? nicolasvasilache: grammo: `to a value that captures whether` ?
				/// whether pipelining modified the IR before failing, signaling to the caller
				/// whether they can proceed with different transformations.
	FailureOr<ForOp> pipelineForLoop(RewriterBase &rewriter, ForOp forOp,			FailureOr<ForOp> pipelineForLoop(RewriterBase &rewriter, ForOp forOp,
	const PipeliningOption &options);			const PipeliningOption &options,
				bool *modifiedIR = nullptr);

	// TODO: such patterns should be auto-generated.			// TODO: such patterns should be auto-generated.
	class ForLoopPipeliningPattern : public OpRewritePattern<ForOp> {			class ForLoopPipeliningPattern : public OpRewritePattern<ForOp> {
	public:			public:
	ForLoopPipeliningPattern(const PipeliningOption &options,			ForLoopPipeliningPattern(const PipeliningOption &options,
	MLIRContext *context)			MLIRContext *context)
	: OpRewritePattern<ForOp>(context), options(options) {}			: OpRewritePattern<ForOp>(context), options(options) {}
	LogicalResult matchAndRewrite(ForOp forOp,			LogicalResult matchAndRewrite(ForOp forOp,
	Show All 17 Lines

mlir/include/mlir/Dialect/SCF/Transforms/Transforms.h

Show First 20 Lines • Show All 148 Lines • ▼ Show 20 Lines	struct PipeliningOption {
AnnotationlFnType annotateFn = nullptr;		AnnotationlFnType annotateFn = nullptr;

/// Control whether the epilogue should be peeled out of the loop or		/// Control whether the epilogue should be peeled out of the loop or
/// operations should be predicated to skip the early stages in the last loop		/// operations should be predicated to skip the early stages in the last loop
/// iterations. If the epilogue is predicated; the user needs to provide a		/// iterations. If the epilogue is predicated; the user needs to provide a
/// lambda to generate the predicated version of operations.		/// lambda to generate the predicated version of operations.
bool peelEpilogue = true;		bool peelEpilogue = true;

// Lamdba to predicate operations when the prologue or epilogue are not		// Callback to predicate operations when the prologue or epilogue are not
// peeled. This takes the original operation, an i1 predicate value and the		// peeled. This takes the original operation, an i1 predicate value and the
// pattern rewriter.		// pattern rewriter. It is expected to replace the given operation with
		// the predicated equivalent and return it, or return nullptr if the
		// predication is impossible. In the latter case, pipelining will fail and
		// may leave IR in a partially transformed state.
using PredicateOpFn =		using PredicateOpFn =
std::function<Operation (RewriterBase &, Operation , Value)>;		std::function<Operation (RewriterBase &, Operation , Value)>;
PredicateOpFn predicateFn = nullptr;		PredicateOpFn predicateFn = nullptr;

// TODO: add option to decide if the prologue should be peeled.		// TODO: add option to decide if the prologue should be peeled.
};		};

/// Populate patterns for SCF software pipelining transformation. See the		/// Populate patterns for SCF software pipelining transformation. See the
Show All 13 Lines

mlir/lib/Dialect/NVGPU/TransformOps/CMakeLists.txt

Show All 9 Lines	add_mlir_dialect_library(MLIRNVGPUTransformOps
LINK_LIBS PUBLIC		LINK_LIBS PUBLIC
MLIRAffineDialect		MLIRAffineDialect
MLIRArithDialect		MLIRArithDialect
MLIRIR		MLIRIR
MLIRLinalgDialect		MLIRLinalgDialect
MLIRNVGPUDialect		MLIRNVGPUDialect
MLIRParser		MLIRParser
MLIRSideEffectInterfaces		MLIRSideEffectInterfaces
		MLIRSCFDialect
		MLIRSCFTransforms
MLIRTransformDialect		MLIRTransformDialect
MLIRTransformDialectUtils		MLIRTransformDialectUtils
MLIRVectorTransforms		MLIRVectorTransforms
)		)

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp

	//===- NVGPUTransformOps.cpp - Implementation of NVGPU transform ops ------===//			//===- NVGPUTransformOps.cpp - Implementation of NVGPU transform ops ------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.h"			#include "mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.h"

				#include "mlir/Analysis/SliceAnalysis.h"
	#include "mlir/Dialect/Affine/IR/AffineOps.h"			#include "mlir/Dialect/Affine/IR/AffineOps.h"
	#include "mlir/Dialect/Arith/IR/Arith.h"			#include "mlir/Dialect/Arith/IR/Arith.h"
	#include "mlir/Dialect/Arith/Utils/Utils.h"			#include "mlir/Dialect/Arith/Utils/Utils.h"
	#include "mlir/Dialect/GPU/IR/GPUDialect.h"			#include "mlir/Dialect/GPU/IR/GPUDialect.h"
	#include "mlir/Dialect/Linalg/IR/Linalg.h"			#include "mlir/Dialect/Linalg/IR/Linalg.h"
	#include "mlir/Dialect/MemRef/IR/MemRef.h"			#include "mlir/Dialect/MemRef/IR/MemRef.h"
	#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"			#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"
				#include "mlir/Dialect/SCF/IR/SCF.h"
				#include "mlir/Dialect/SCF/Transforms/Patterns.h"
				#include "mlir/Dialect/SCF/Transforms/Transforms.h"
	#include "mlir/Dialect/Utils/IndexingUtils.h"			#include "mlir/Dialect/Utils/IndexingUtils.h"
	#include "mlir/Dialect/Utils/StaticValueUtils.h"
	#include "mlir/Dialect/Vector/IR/VectorOps.h"			#include "mlir/Dialect/Vector/IR/VectorOps.h"
	#include "mlir/IR/AffineExpr.h"			#include "mlir/IR/AffineExpr.h"
	#include "mlir/IR/BuiltinTypes.h"			#include "mlir/IR/BuiltinTypes.h"
	#include "mlir/IR/MLIRContext.h"			#include "mlir/IR/MLIRContext.h"
	#include "mlir/IR/Operation.h"			#include "mlir/IR/Operation.h"
	#include "mlir/IR/TypeRange.h"			#include "mlir/IR/TypeRange.h"
	#include "mlir/IR/TypeUtilities.h"			#include "mlir/IR/TypeUtilities.h"
	#include "mlir/Support/LogicalResult.h"			#include "mlir/Support/LogicalResult.h"
	#include "llvm/ADT/ArrayRef.h"			#include "llvm/ADT/ArrayRef.h"
	#include "llvm/Support/Debug.h"			#include "llvm/Support/Debug.h"
	#include "llvm/Support/ErrorHandling.h"

	using namespace mlir;			using namespace mlir;
	using namespace mlir::linalg;			using namespace mlir::linalg;
	using namespace mlir::nvgpu;			using namespace mlir::nvgpu;
	using namespace mlir::transform;			using namespace mlir::transform;

	#define DEBUG_TYPE "nvgpu-transforms"			#define DEBUG_TYPE "nvgpu-transforms"
	#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")			#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")
	#define DBGSNL() (llvm::dbgs() << "\n")			#define DBGSNL() (llvm::dbgs() << "\n")
	#define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n")			#define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n")

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
				// PipelineSharedMemoryCopiesOp
				//===----------------------------------------------------------------------===//

				/// Returns true if the given type has the default memory space.
				static bool hasDefaultMemorySpace(BaseMemRefType type) {
				return !type.getMemorySpace() \|\| type.getMemorySpaceAsInt() == 0;
				}

				/// Returns true if the given type has the shared (workgroup) memory space.
				static bool hasSharedMemorySpace(BaseMemRefType type) {
				auto space =
				dyn_cast_if_present<gpu::AddressSpaceAttr>(type.getMemorySpace());
				return space &&
				space.getValue() == gpu::GPUDialect::getWorkgroupAddressSpace();
				nicolasvasilacheUnsubmitted Done Reply Inline Actions Ok now I am lost, I looked deeper into memory spaces and I see: https://github.com/llvm/llvm-project/blob/5c5a1a2927118eb9cc51c7b98c959a30524cc491/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td#L79 says: def GPU_AddressSpaceGlobal : I32EnumAttrCase<"Global", 1, "global">; def GPU_AddressSpaceWorkgroup : I32EnumAttrCase<"Workgroup", 2, "workgroup">; def GPU_AddressSpacePrivate : I32EnumAttrCase<"Private", 3, "private">; def GPU_AddressSpaceEnum : GPU_I32Enum< "AddressSpace", "GPU address space", [ GPU_AddressSpaceGlobal, GPU_AddressSpaceWorkgroup, GPU_AddressSpacePrivate ]>; https://github.com/llvm/llvm-project/blob/5c5a1a2927118eb9cc51c7b98c959a30524cc491/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td#L60 says: /// Defines the MemRef memory space attribute numeric value that indicates /// a memref is located in global memory. This should correspond to the /// value used in NVVM. static constexpr unsigned kGlobaldMemoryAddressSpace = 1; /// Defines the MemRef memory space attribute numeric value that indicates /// a memref is located in shared memory. This should correspond to the /// value used in NVVM. static constexpr unsigned kSharedMemoryAddressSpace = 3; whereas https://github.com/llvm/llvm-project/blob/5c5a1a2927118eb9cc51c7b98c959a30524cc491/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td#L40 says: /// The address space value that represents global memory. static constexpr unsigned kGlobalMemoryAddressSpace = 1; /// The address space value that represents shared memory. static constexpr unsigned kSharedMemoryAddressSpace = 3; /// The address space value that represents private memory. static constexpr unsigned kPrivateMemoryAddressSpace = 5; I don't know if there are others. nicolasvasilache: Ok now I am lost, I looked deeper into memory spaces and I see: https://github.com/llvm/llvm…
				ftynseAuthorUnsubmitted Done Reply Inline Actions ROCDL is irrelevant here. The other two are the platform-agnostic and the NVVM-specific memory spaces. Given that conversion to NVVM happens later than this, I expect the platform-agnostic space to be used in the input. ftynse: ROCDL is irrelevant here. The other two are the platform-agnostic and the NVVM-specific memory…
				}

				/// Returns the value produced by a load from the default memory space. Returns
				/// null if the operation is not such a load.
				static Value getValueLoadedFromGlobal(Operation *op) {
				// TODO: consider an interface or leveraging the memory effects interface.
				auto load = dyn_cast<vector::TransferReadOp>(op);
				if (!load)
				return nullptr;

				auto loadType = dyn_cast<MemRefType>(load.getSource().getType());
				if (!loadType \|\| !hasDefaultMemorySpace(loadType))
				return nullptr;
				return load;
				}

				/// Returns true if the operation is storing the given value into shared memory.
				static bool isStoreToShared(Operation *op, Value v) {
				// TOD: consider an interface or leveraging the memory effects interface.
				auto store = dyn_cast<vector::TransferWriteOp>(op);
				if (!store \|\| store.getVector() != v)
				return false;

				auto storeType = dyn_cast<MemRefType>(store.getSource().getType());
				return storeType \|\| hasSharedMemorySpace(storeType);
				}

				/// Returns true if the operation is a load from the default memory space the
				/// result of which is only stored into the shared memory space.
				static bool isLoadFromGlobalStoredToShared(Operation *op) {
				Value loaded = getValueLoadedFromGlobal(op);
				if (!loaded \|\| !loaded.hasOneUse())
				return false;

				return isStoreToShared(*loaded.getUsers().begin(), loaded);
				}

				/// Populate `ops` with the set of operations that belong to the stage 0 of the
				/// pipelined version of the given loop when pipelining copies to shared memory.
				/// Specifically, this collects:
				nicolasvasilacheUnsubmitted Done Reply Inline Actions Even if trivial, I like the bulletpoints in `getPipelineStages`, can we rewrite as: Specifically, this collects: 1. all loads from global memory, both sync and async 2. the barriers for async loads In particular, barriers are omitted if they do not dominate at least one async load for which there is not yet a barrier. The `in particular` part seems significant to document and not immediately clear from glancing over code. nicolasvasilache: Even if trivial, I like the bulletpoints in `getPipelineStages`, can we rewrite as: ```…
				///
				/// 1. all loads from global memory, both sync and async;
				/// 2. the barriers for async loads.
				///
				/// In particular, barriers are omitted if they do not dominate at least one
				/// async load for which there is not yet a barrier.
				static LogicalResult
				collectStage0PipeliningOps(scf::ForOp forOp,
				llvm::SmallPtrSet<Operation *, 16> &ops) {

				llvm::SmallPtrSet<Operation *, 4> barriers;
				for (Operation &op : *forOp.getBody()) {
				// Bail on nested ops for now.
				if (op.getNumRegions() > 0)
				return failure();

				if (isa<gpu::BarrierOp>(op)) {
				barriers.insert(&op);
				continue;
				nicolasvasilacheUnsubmitted Done Reply Inline Actions As always, C++ is cute and terrifying at the same time :) nicolasvasilache: As always, C++ is cute and terrifying at the same time :)
				}

				if (isa<nvgpu::DeviceAsyncCopyOp, nvgpu::DeviceAsyncCreateGroupOp>(op)) {
				ops.insert(&op);
				ops.insert(std::make_move_iterator(barriers.begin()),
				std::make_move_iterator(barriers.end()));
				assert(barriers.empty() &&
				"expected to have moved the barriers into another set");
				continue;
				}

				if (isLoadFromGlobalStoredToShared(&op)) {
				ops.insert(&op);
				continue;
				}
				}

				return success();
				}

				/// Hook for the loop pipeliner that sets the "num groups in flight" attribute
				/// of async wait operations corresponding to pipelined shared memory copies.
				// TODO: this currently assumes that there are no groups that could be in flight
				// in the existing code.
				static void
				setAsyncWaitGroupsInFlight(OpBuilder &builder, Operation *op,
				scf::PipeliningOption::PipelinerPart part,
				nicolasvasilacheUnsubmitted Done Reply Inline Actions nit: order of nicolasvasilache: nit: order of
				unsigned iteration, unsigned depth) {
				// Based on the order of copies within the loop we need to set the number
				// of copies in flight, unless it is already set.
				nicolasvasilacheUnsubmitted Done Reply Inline Actions Can we merge this condition and the one above and document properly ? From reading the code I am inferring: // If not a waitOp or if numGroups is already set, bail. auto waitOp = dyn_cast<nvgpu::DeviceAsyncWaitOp>(op); if (!waitOp \|\| waitOp.getNumGroups()) return; nicolasvasilache: Can we merge this condition and the one above and document properly ? From reading the code I…
				auto waitOp = dyn_cast<nvgpu::DeviceAsyncWaitOp>(op);
				if (!waitOp \|\| waitOp.getNumGroups())
				return;

				int numGroupInFlight = 0;
				if (part == scf::PipeliningOption::PipelinerPart::Kernel \|\|
				part == scf::PipeliningOption::PipelinerPart::Prologue) {
				numGroupInFlight = depth - 1;
				} else {
				// By construction there should be no wait op in the prologue as all the
				// wait should be in the last stage.
				assert(part == scf::PipeliningOption::PipelinerPart::Epilogue);
				// Based on the schedule we pick we know how many groups are in flight for
				nicolasvasilacheUnsubmitted Done Reply Inline Actions isn't there an auto-tablegen'd method to do that automatically rather than have to manipulate the attrname? nicolasvasilache: isn't there an auto-tablegen'd method to do that automatically rather than have to manipulate…
				// each iteration of the epilogue.
				numGroupInFlight = depth - 1 - iteration;
				}
				waitOp.setNumGroups(numGroupInFlight);
				}

				/// Hook for the loop pipeliner that populates `ops` with the stage information
				/// as follows:
				nicolasvasilacheUnsubmitted Done Reply Inline Actions add a case that: - operations in the backward slice of any stage0Ops are all at stage 0 nicolasvasilache: add a case that: ``` - operations in the backward slice of any stage0Ops are all at stage 0 ```
				///
				/// - operations in `stage0Ops` (typically loads from global memory and
				/// related barriers) are at stage 0;
				/// - operations in the backward slice of any stage0Ops are all at stage 0;
				/// - other operations are at stage `depth`;
				nicolasvasilacheUnsubmitted Done Reply Inline Actions Nit: add something along the lines of `Hook for the loop pipeliner used by the scheduleFn hook`. Even better put all these under a special namespace nvgpu::pipelining_hooks to really make clear what is what. nicolasvasilache: Nit: add something along the lines of `Hook for the loop pipeliner used by the scheduleFn hook`.
				ftynseAuthorUnsubmitted Done Reply Inline Actions Putting them in a namespace makes them potentially accessible from different translation units, which we'd like to avoid. ftynse: Putting them in a namespace makes them potentially accessible from different translation units…
				/// - the internal order of the pipelined loop has ops at stage `depth` first,
				nicolasvasilacheUnsubmitted Done Reply Inline Actions nit: opsWithPipelineStage would be more self-documenting nicolasvasilache: nit: opsWithPipelineStage would be more self-documenting
				/// then those at stage 0, with relative order within each group preserved.
				///
				static void getPipelineStages(
				scf::ForOp forOp,
				std::vector<std::pair<Operation *, unsigned>> &opsWithPipelineStages,
				unsigned depth, llvm::SmallPtrSetImpl<Operation *> &stage0Ops) {
				SetVector<Operation *> dependencies;
				BackwardSliceOptions options([&](Operation *visited) {
				return visited->getBlock() == forOp.getBody();
				});
				options.inclusive = true;
				for (Operation &op : forOp.getBody()->getOperations()) {
				if (stage0Ops.contains(&op))
				getBackwardSlice(&op, &dependencies, options);
				}

				for (Operation &op : forOp.getBody()->getOperations()) {
				nicolasvasilacheUnsubmitted Done Reply Inline Actions nit: iterate directly on dependencies rather than on `forOp.getBody()->getOperations()` nicolasvasilache: nit: iterate directly on dependencies rather than on `forOp.getBody()->getOperations()`
				ftynseAuthorUnsubmitted Done Reply Inline Actions I think the order in which operations are listed in the `opsWIthPipelineStage` matters so I'd rather have a small overhead here and be sure the ops are in the same order as originally. ftynse: I think the order in which operations are listed in the `opsWIthPipelineStage` matters so I'd…
				if (!dependencies.contains(&op) && !isa<scf::YieldOp>(op))
				opsWithPipelineStages.emplace_back(&op, depth);
				}
				for (Operation &op : forOp.getBody()->getOperations()) {
				if (dependencies.contains(&op))
				nicolasvasilacheUnsubmitted Done Reply Inline Actions seems off, why would you want to replace with a predicated version when no predication is necessary? nicolasvasilache: seems off, why would you want to replace with a predicated version when no predication is…
				ftynseAuthorUnsubmitted Done Reply Inline Actions Because the pipeliner wants that... I suppose we can change it to require a custom three-state result "replaced with Op, keep the same, couldn't replace". Updated the doc to reflect this. ftynse: Because the pipeliner wants that... I suppose we can change it to require a custom three-state…
				opsWithPipelineStages.emplace_back(&op, 0);
				}
				nicolasvasilacheUnsubmitted Done Reply Inline Actions Same thing re hook here. nicolasvasilache: Same thing re hook here.
				}

				/// Hook for the loop pipeliner. Replaces op with a predicated version and
				/// returns the resulting operation. Returns the original op if the predication
				/// isn't necessary for the given op. Returns null if predication is needed but
				/// not supported.
				static Operation *replaceOpWithPredicatedOp(RewriterBase &rewriter,
				Operation *op, Value predicate) {
				// Some operations may be fine to execute "speculatively" more times than the
				// original number of iterations, in particular side-effect free operations
				// and barriers, even if they cannot be predicated.
				if (isMemoryEffectFree(op) \|\|
				isa<gpu::BarrierOp, nvgpu::DeviceAsyncCreateGroupOp,
				nvgpu::DeviceAsyncWaitOp>(op)) {
				return op;
				}

				// Otherwise, only async copies can currently be predicated.
				nicolasvasilacheUnsubmitted Done Reply Inline Actions nit: lines nicolasvasilache: nit: lines
				auto asyncCopyOp = dyn_cast<nvgpu::DeviceAsyncCopyOp>(op);
				if (!asyncCopyOp)
				return nullptr;

				// Create srcElement Value based on `predicate`. The next lines generate
				// the following code:
				//
				// srcElement = (pred) ? prevSrcElements : 0;
				//
				Location loc = asyncCopyOp->getLoc();
				Value dstElements =
				rewriter.create<arith::ConstantOp>(loc, asyncCopyOp.getDstElementsAttr());
				Value originalSrcElement =
				asyncCopyOp.getSrcElements() ? asyncCopyOp.getSrcElements() : dstElements;
				Value c0Index = rewriter.create<arith::ConstantIndexOp>(loc, 0);
				auto srcElements = rewriter.create<arith::SelectOp>(
				loc, predicate, originalSrcElement, c0Index);
				auto asyncCopyZeroFillOp = rewriter.create<nvgpu::DeviceAsyncCopyOp>(
				loc, nvgpu::DeviceAsyncTokenType::get(asyncCopyOp.getContext()),
				asyncCopyOp.getDst(), asyncCopyOp.getDstIndices(), asyncCopyOp.getSrc(),
				asyncCopyOp.getSrcIndices(), asyncCopyOp.getDstElements(), srcElements,
				UnitAttr());
				rewriter.replaceOp(asyncCopyOp, asyncCopyZeroFillOp);
				return asyncCopyZeroFillOp;
				}

				/// Applies loop pipelining with the given depth to the given loop so that
				/// copies into the shared memory are pipelined. Doesn't affect other loops.
				/// Returns a pair containing the error state and the pipelined op, the latter
				/// being null in case of any failure. The error state contains a definite error
				/// if the IR has been modified and a silenceable error otherwise.
				static std::tuple<DiagnosedSilenceableFailure, scf::ForOp>
				pipelineForSharedCopies(RewriterBase &rewriter, scf::ForOp forOp, int64_t depth,
				bool epiloguePeeling) {
				llvm::SmallPtrSet<Operation *, 16> stage0Ops;
				if (failed(collectStage0PipeliningOps(forOp, stage0Ops))) {
				return std::make_tuple(
				emitSilenceableFailure(forOp, "cannot find stage 0 ops for pipelining"),
				scf::ForOp());
				}
				if (stage0Ops.empty()) {
				return std::make_tuple(
				emitSilenceableFailure(forOp, "no shared memory copy"), scf::ForOp());
				}

				scf::PipeliningOption options;
				unsigned maxDepth = depth;
				auto setAnnotation = [&](Operation *op,
				scf::PipeliningOption::PipelinerPart part,
				unsigned iteration) {
				return setAsyncWaitGroupsInFlight(rewriter, op, part, iteration, maxDepth);
				};
				options.getScheduleFn =
				[&](scf::ForOp schedulingFor,
				std::vector<std::pair<Operation *, unsigned>> &ops) {
				if (schedulingFor != forOp)
				return;
				return getPipelineStages(forOp, ops, maxDepth, stage0Ops);
				};
				options.annotateFn = setAnnotation;
				if (!epiloguePeeling) {
				options.peelEpilogue = false;
				options.predicateFn = replaceOpWithPredicatedOp;
				}

				OpBuilder::InsertionGuard guard(rewriter);
				rewriter.setInsertionPoint(forOp);
				bool modifiedIR;
				FailureOr<scf::ForOp> maybePipelined =
				pipelineForLoop(rewriter, forOp, options, &modifiedIR);
				if (succeeded(maybePipelined)) {
				return std::make_tuple(DiagnosedSilenceableFailure::success(),
				*maybePipelined);
				}
				return std::make_tuple(
				modifiedIR
				? DiagnosedSilenceableFailure::definiteFailure()
				: emitSilenceableFailure(forOp, "pipelining preconditions failed"),
				scf::ForOp());
				}

				DiagnosedSilenceableFailure PipelineSharedMemoryCopiesOp::applyToOne(
				TransformRewriter &rewriter, scf::ForOp forOp,
				ApplyToEachResultList &results, TransformState &state) {
				auto [diag, pipelined] = pipelineForSharedCopies(
				rewriter, forOp, static_cast<int64_t>(getDepth()), getPeelEpilogue());
				nicolasvasilacheUnsubmitted Done Reply Inline Actions add "try to set the peel_epilogue attribute" ? nicolasvasilache: add "try to set the peel_epilogue attribute" ?
				if (diag.succeeded()) {
				results.push_back(pipelined);
				return DiagnosedSilenceableFailure::success();
				}
				if (diag.isDefiniteFailure()) {
				auto diag = emitDefiniteFailure("irreversible pipelining failure");
				if (!getPeelEpilogue()) {
				diag.attachNote(forOp->getLoc()) << "couldn't predicate?";
				diag.attachNote(getLoc()) << "try setting " << getPeelEpilogueAttrName();
				}
				return diag;
				}

				return std::move(diag);
				}

				//===----------------------------------------------------------------------===//
	// RewriteMatmulAsMmaSyncOp			// RewriteMatmulAsMmaSyncOp
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	/// Helper struct to encode a pair of row/column indexings in the form of			/// Helper struct to encode a pair of row/column indexings in the form of
	/// affine expressions.			/// affine expressions.
	struct RowColIndexing : private std::pair<AffineExpr, AffineExpr> {			struct RowColIndexing : private std::pair<AffineExpr, AffineExpr> {
	RowColIndexing(AffineExpr row, AffineExpr col)			RowColIndexing(AffineExpr row, AffineExpr col)
	: std::pair<AffineExpr, AffineExpr>(row, col) {}			: std::pair<AffineExpr, AffineExpr>(row, col) {}
	▲ Show 20 Lines • Show All 438 Lines • Show Last 20 Lines

mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp

Show First 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	public:
/// than its definition.		/// than its definition.
llvm::MapVector<Value, LiverangeInfo> analyzeCrossStageValues();		llvm::MapVector<Value, LiverangeInfo> analyzeCrossStageValues();
scf::ForOp createKernelLoop(		scf::ForOp createKernelLoop(
const llvm::MapVector<Value, LiverangeInfo> &crossStageValues,		const llvm::MapVector<Value, LiverangeInfo> &crossStageValues,
RewriterBase &rewriter,		RewriterBase &rewriter,
llvm::DenseMap<std::pair<Value, unsigned>, unsigned> &loopArgMap);		llvm::DenseMap<std::pair<Value, unsigned>, unsigned> &loopArgMap);
/// Emits the pipelined kernel. This clones loop operations following user		/// Emits the pipelined kernel. This clones loop operations following user
/// order and remaps operands defined in a different stage as their use.		/// order and remaps operands defined in a different stage as their use.
void createKernel(		LogicalResult createKernel(
scf::ForOp newForOp,		scf::ForOp newForOp,
const llvm::MapVector<Value, LiverangeInfo> &crossStageValues,		const llvm::MapVector<Value, LiverangeInfo> &crossStageValues,
const llvm::DenseMap<std::pair<Value, unsigned>, unsigned> &loopArgMap,		const llvm::DenseMap<std::pair<Value, unsigned>, unsigned> &loopArgMap,
RewriterBase &rewriter);		RewriterBase &rewriter);
/// Emits the epilogue, this creates `maxStage - 1` part which will contain		/// Emits the epilogue, this creates `maxStage - 1` part which will contain
/// operations from stages [i; maxStage], where i is the part index.		/// operations from stages [i; maxStage], where i is the part index.
llvm::SmallVector<Value> emitEpilogue(RewriterBase &rewriter);		llvm::SmallVector<Value> emitEpilogue(RewriterBase &rewriter);
};		};
▲ Show 20 Lines • Show All 220 Lines • ▼ Show 20 Lines	auto newForOp =
forOp.getStep(), newLoopArg);		forOp.getStep(), newLoopArg);
// When there are no iter args, the loop body terminator will be created.		// When there are no iter args, the loop body terminator will be created.
// Since we always create it below, remove the terminator if it was created.		// Since we always create it below, remove the terminator if it was created.
if (!newForOp.getBody()->empty())		if (!newForOp.getBody()->empty())
rewriter.eraseOp(newForOp.getBody()->getTerminator());		rewriter.eraseOp(newForOp.getBody()->getTerminator());
return newForOp;		return newForOp;
}		}

void LoopPipelinerInternal::createKernel(		LogicalResult LoopPipelinerInternal::createKernel(
scf::ForOp newForOp,		scf::ForOp newForOp,
const llvm::MapVector<Value, LoopPipelinerInternal::LiverangeInfo>		const llvm::MapVector<Value, LoopPipelinerInternal::LiverangeInfo>
&crossStageValues,		&crossStageValues,
const llvm::DenseMap<std::pair<Value, unsigned>, unsigned> &loopArgMap,		const llvm::DenseMap<std::pair<Value, unsigned>, unsigned> &loopArgMap,
RewriterBase &rewriter) {		RewriterBase &rewriter) {
valueMapping.clear();		valueMapping.clear();

// Create the kernel, we clone instruction based on the order given by		// Create the kernel, we clone instruction based on the order given by
▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	for (OpOperand *operand : operands) {
std::make_pair(operand->get(), useStage - stageDef->second));		std::make_pair(operand->get(), useStage - stageDef->second));
assert(remap != loopArgMap.end());		assert(remap != loopArgMap.end());
nestedNewOp->setOperand(operand->getOperandNumber(),		nestedNewOp->setOperand(operand->getOperandNumber(),
newForOp.getRegionIterArgs()[remap->second]);		newForOp.getRegionIterArgs()[remap->second]);
}		}

if (predicates[useStage]) {		if (predicates[useStage]) {
newOp = predicateFn(rewriter, newOp, predicates[useStage]);		newOp = predicateFn(rewriter, newOp, predicates[useStage]);
		if (!newOp)
		return failure();
// Remap the results to the new predicated one.		// Remap the results to the new predicated one.
for (auto values : llvm::zip(op->getResults(), newOp->getResults()))		for (auto values : llvm::zip(op->getResults(), newOp->getResults()))
mapping.map(std::get<0>(values), std::get<1>(values));		mapping.map(std::get<0>(values), std::get<1>(values));
}		}
rewriter.setInsertionPointAfter(newOp);		rewriter.setInsertionPointAfter(newOp);
if (annotateFn)		if (annotateFn)
annotateFn(newOp, PipeliningOption::PipelinerPart::Kernel, 0);		annotateFn(newOp, PipeliningOption::PipelinerPart::Kernel, 0);
}		}

// Collect the Values that need to be returned by the forOp. For each		// Collect the Values that need to be returned by the forOp. For each
// value we need to have `LastUseStage - DefStage` number of versions		// value we need to have `LastUseStage - DefStage` number of versions
// returned.		// returned.
// We create a mapping between original values and the associated loop		// We create a mapping between original values and the associated loop
// returned values that will be needed by the epilogue.		// returned values that will be needed by the epilogue.
llvm::SmallVector<Value> yieldOperands;		llvm::SmallVector<Value> yieldOperands;
for (Value retVal : forOp.getBody()->getTerminator()->getOperands()) {		for (Value retVal : forOp.getBody()->getTerminator()->getOperands()) {
yieldOperands.push_back(mapping.lookupOrDefault(retVal));		yieldOperands.push_back(mapping.lookupOrDefault(retVal));
}		}
for (auto &it : crossStageValues) {		for (auto &it : crossStageValues) {
int64_t version = maxStage - it.second.lastUseStage + 1;		int64_t version = maxStage - it.second.lastUseStage + 1;
unsigned numVersionReturned = it.second.lastUseStage - it.second.defStage;		unsigned numVersionReturned = it.second.lastUseStage - it.second.defStage;
// add the original verstion to yield ops.		// add the original version to yield ops.
// If there is a liverange spanning across more than 2 stages we need to add		// If there is a live range spanning across more than 2 stages we need to
// extra arg.		// add extra arg.
for (unsigned i = 1; i < numVersionReturned; i++) {		for (unsigned i = 1; i < numVersionReturned; i++) {
setValueMapping(it.first, newForOp->getResult(yieldOperands.size()),		setValueMapping(it.first, newForOp->getResult(yieldOperands.size()),
version++);		version++);
yieldOperands.push_back(		yieldOperands.push_back(
newForOp.getBody()->getArguments()[yieldOperands.size() + 1 +		newForOp.getBody()->getArguments()[yieldOperands.size() + 1 +
newForOp.getNumInductionVars()]);		newForOp.getNumInductionVars()]);
}		}
setValueMapping(it.first, newForOp->getResult(yieldOperands.size()),		setValueMapping(it.first, newForOp->getResult(yieldOperands.size()),
version++);		version++);
yieldOperands.push_back(mapping.lookupOrDefault(it.first));		yieldOperands.push_back(mapping.lookupOrDefault(it.first));
}		}
// Map the yield operand to the forOp returned value.		// Map the yield operand to the forOp returned value.
for (const auto &retVal :		for (const auto &retVal :
llvm::enumerate(forOp.getBody()->getTerminator()->getOperands())) {		llvm::enumerate(forOp.getBody()->getTerminator()->getOperands())) {
Operation *def = retVal.value().getDefiningOp();		Operation *def = retVal.value().getDefiningOp();
assert(def && "Only support loop carried dependencies of distance 1");		assert(def && "Only support loop carried dependencies of distance 1");
unsigned defStage = stages[def];		unsigned defStage = stages[def];
setValueMapping(forOp.getRegionIterArgs()[retVal.index()],		setValueMapping(forOp.getRegionIterArgs()[retVal.index()],
newForOp->getResult(retVal.index()),		newForOp->getResult(retVal.index()),
maxStage - defStage + 1);		maxStage - defStage + 1);
}		}
rewriter.create<scf::YieldOp>(forOp.getLoc(), yieldOperands);		rewriter.create<scf::YieldOp>(forOp.getLoc(), yieldOperands);
		return success();
}		}

llvm::SmallVector<Value>		llvm::SmallVector<Value>
LoopPipelinerInternal::emitEpilogue(RewriterBase &rewriter) {		LoopPipelinerInternal::emitEpilogue(RewriterBase &rewriter) {
llvm::SmallVector<Value> returnValues(forOp->getNumResults());		llvm::SmallVector<Value> returnValues(forOp->getNumResults());
// Emit different versions of the induction variable. They will be		// Emit different versions of the induction variable. They will be
// removed by dead code if not used.		// removed by dead code if not used.
for (int64_t i = 0; i < maxStage; i++) {		for (int64_t i = 0; i < maxStage; i++) {
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	it =
.insert(std::make_pair(key, llvm::SmallVector<Value>(maxStage + 1)))		.insert(std::make_pair(key, llvm::SmallVector<Value>(maxStage + 1)))
.first;		.first;
it->second[idx] = el;		it->second[idx] = el;
}		}

} // namespace		} // namespace

FailureOr<ForOp> mlir::scf::pipelineForLoop(RewriterBase &rewriter, ForOp forOp,		FailureOr<ForOp> mlir::scf::pipelineForLoop(RewriterBase &rewriter, ForOp forOp,
const PipeliningOption &options) {		const PipeliningOption &options,
		bool *modifiedIR) {
		if (modifiedIR)
		*modifiedIR = false;
LoopPipelinerInternal pipeliner;		LoopPipelinerInternal pipeliner;
if (!pipeliner.initializeLoopInfo(forOp, options))		if (!pipeliner.initializeLoopInfo(forOp, options))
return failure();		return failure();

		if (modifiedIR)
		*modifiedIR = true;

// 1. Emit prologue.		// 1. Emit prologue.
pipeliner.emitPrologue(rewriter);		pipeliner.emitPrologue(rewriter);

// 2. Track values used across stages. When a value cross stages it will		// 2. Track values used across stages. When a value cross stages it will
// need to be passed as loop iteration arguments.		// need to be passed as loop iteration arguments.
// We first collect the values that are used in a different stage than where		// We first collect the values that are used in a different stage than where
// they are defined.		// they are defined.
llvm::MapVector<Value, LoopPipelinerInternal::LiverangeInfo>		llvm::MapVector<Value, LoopPipelinerInternal::LiverangeInfo>
crossStageValues = pipeliner.analyzeCrossStageValues();		crossStageValues = pipeliner.analyzeCrossStageValues();

// Mapping between original loop values used cross stage and the block		// Mapping between original loop values used cross stage and the block
// arguments associated after pipelining. A Value may map to several		// arguments associated after pipelining. A Value may map to several
// arguments if its liverange spans across more than 2 stages.		// arguments if its liverange spans across more than 2 stages.
llvm::DenseMap<std::pair<Value, unsigned>, unsigned> loopArgMap;		llvm::DenseMap<std::pair<Value, unsigned>, unsigned> loopArgMap;
// 3. Create the new kernel loop and return the block arguments mapping.		// 3. Create the new kernel loop and return the block arguments mapping.
ForOp newForOp =		ForOp newForOp =
pipeliner.createKernelLoop(crossStageValues, rewriter, loopArgMap);		pipeliner.createKernelLoop(crossStageValues, rewriter, loopArgMap);
// Create the kernel block, order ops based on user choice and remap		// Create the kernel block, order ops based on user choice and remap
// operands.		// operands.
pipeliner.createKernel(newForOp, crossStageValues, loopArgMap, rewriter);		if (failed(pipeliner.createKernel(newForOp, crossStageValues, loopArgMap,
		rewriter)))
		return failure();

llvm::SmallVector<Value> returnValues =		llvm::SmallVector<Value> returnValues =
newForOp.getResults().take_front(forOp->getNumResults());		newForOp.getResults().take_front(forOp->getNumResults());
if (options.peelEpilogue) {		if (options.peelEpilogue) {
// 4. Emit the epilogue after the new forOp.		// 4. Emit the epilogue after the new forOp.
rewriter.setInsertionPointAfter(newForOp);		rewriter.setInsertionPointAfter(newForOp);
returnValues = pipeliner.emitEpilogue(rewriter);		returnValues = pipeliner.emitEpilogue(rewriter);
}		}
Show All 13 Lines

mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir

This file was added.

				// RUN: mlir-opt %s --test-transform-dialect-interpreter --split-input-file --verify-diagnostics \| FileCheck %s

				func.func @simple_depth_2_unpeeled(%global: memref<?xf32>, %result: memref<?xf32> ) {
				%c0 = arith.constant 0 : index
				%c100 = arith.constant 100 : index
				%c4 = arith.constant 4 : index
				%shared = memref.alloc(%c100) : memref<?xf32, #gpu.address_space<workgroup>>
				%c0f = arith.constant 0.0 : f32
				// Predication is not currently implemented for transfer_read/write, so this is expected to fail.
				// expected-note @below {{couldn't predicate}}
				nicolasvasilacheUnsubmitted Done Reply Inline Actions can we add some comments to the test here to explain why we expect this outcome? It is really not clear to me why we need to "peel the epilogue" here. nicolasvasilache: can we add some comments to the test here to explain why we expect this outcome? It is really…
				ftynseAuthorUnsubmitted Done Reply Inline Actions The predication is only implemented for `nvpgu.device.async.copy`, not for transfer reads/writes. ftynse: The predication is only implemented for `nvpgu.device.async.copy`, not for transfer…
				scf.for %i = %c0 to %c100 step %c4 iter_args(%accum = %c0f) -> f32 {
				%mem = vector.transfer_read %global[%i], %c0f : memref<?xf32>, vector<4xf32>
				vector.transfer_write %mem, %shared[%i] : vector<4xf32>, memref<?xf32, #gpu.address_space<workgroup>>
				%0 = arith.addf %accum, %accum : f32
				scf.yield %0 : f32
				}
				return
				}

				!t = !transform.any_op

				transform.sequence failures(propagate) {
				^bb0(%arg0: !t):
				%loop = transform.structured.match ops{["scf.for"]} in %arg0 : (!t) -> !t
				// expected-error @below {{irreversible pipelining failure}}
				// expected-note @below {{try setting "peel_epilogue"}}
				transform.nvgpu.pipeline_shared_memory_copies failures(propagate) %loop { depth = 2 } : (!t) -> !t
				}

				// -----

				// Loop pipeliner is tested separately, just verify the overall shape of the IR here.

				func.func private @body(index, memref<?xf32, #gpu.address_space<workgroup>>)

				// CHECK-LABEL: @simple_depth_2_peeled
				// CHECK-SAME: %[[ARG:.+]]: memref
				func.func @simple_depth_2_peeled(%global: memref<?xf32>) {
				%c0 = arith.constant 0 : index
				%c100 = arith.constant 100 : index
				%c200 = arith.constant 200 : index
				%c4 = arith.constant 4 : index
				// CHECK: memref.alloc
				%shared = memref.alloc(%c200) : memref<?xf32, #gpu.address_space<workgroup>>
				%c0f = arith.constant 0.0 : f32
				// CHECK: %[[LOADED1:.+]] = vector.transfer_read %[[ARG]]
				// CHECK: %[[LOADED2:.+]] = vector.transfer_read %[[ARG]]
				// CHECK: %[[LOOP:.+]]:2 = scf.for {{.*}} iter_args(%[[IA1:.+]] = %[[LOADED1]], %[[IA2:.+]] = %[[LOADED2]])
				// CHECK: vector.transfer_write %[[IA1]]
				// CHECK: func.call @body
				// CHECK: %[[LOCAL_LOADED:.+]] = vector.transfer_read %[[ARG]]
				// CHECK: scf.yield %[[IA2]], %[[LOCAL_LOADED]]
				scf.for %i = %c0 to %c100 step %c4 {
				%mem = vector.transfer_read %global[%i], %c0f : memref<?xf32>, vector<4xf32>
				vector.transfer_write %mem, %shared[%i] : vector<4xf32>, memref<?xf32, #gpu.address_space<workgroup>>
				func.call @body(%i, %shared) : (index, memref<?xf32, #gpu.address_space<workgroup>>) -> ()
				}
				// CHECK: vector.transfer_write %[[LOOP]]#0
				// CHECK: call @body
				// CHECK: vector.transfer_write %[[LOOP]]#1
				// CHECK: call @body
				return
				}

				!t = !transform.any_op

				transform.sequence failures(propagate) {
				^bb0(%arg0: !t):
				%loop = transform.structured.match ops{["scf.for"]} in %arg0 : (!t) -> !t
				transform.nvgpu.pipeline_shared_memory_copies failures(propagate) %loop { depth = 2, peel_epilogue } : (!t) -> !t
				}

				// -----

				// CHECK-LABEL: @async_depth_2_predicated
				// CHECK-SAME: %[[GLOBAL:.+]]: memref
				func.func @async_depth_2_predicated(%global: memref<?xf32>) {
				%c0 = arith.constant 0 : index
				%c98 = arith.constant 98 : index
				%c100 = arith.constant 100 : index
				%c200 = arith.constant 200 : index
				// CHECK: %[[C4:.+]] = arith.constant 4
				%c4 = arith.constant 4 : index
				// CHECK: %[[SHARED:.+]] = memref.alloc{{.*}} #gpu.address_space<workgroup>
				%shared = memref.alloc(%c200) : memref<?xf32, #gpu.address_space<workgroup>>
				%c0f = arith.constant 0.0 : f32
				// CHECK: %[[TOKEN0:.+]] = nvgpu.device_async_copy
				// CHECK: %[[TOKEN1:.+]] = nvgpu.device_async_copy
				// CHECK: scf.for %[[I:.+]] = {{.*}} iter_args
				nicolasvasilacheUnsubmitted Done Reply Inline Actions Can we spell out the selects operands and how they connect here ? Would be nice to see the difference between the 2; this is new untested code in MLIR. nicolasvasilache: Can we spell out the selects operands and how they connect here ? Would be nice to see the…
				// CHECK-SAME: %[[ITER_ARG0:.+]] = %[[TOKEN0]]
				// CHECK-SAME: %[[ITER_ARG1:.+]] = %[[TOKEN1]]
				scf.for %i = %c0 to %c98 step %c4 {
				nicolasvasilacheUnsubmitted Done Reply Inline Actions nit: load nicolasvasilache: nit: load
				nicolasvasilacheUnsubmitted Done Reply Inline Actions can we also check the `scf.yield` here to "close the loop" ? nicolasvasilache: can we also check the `scf.yield` here to "close the loop" ?
				// Condition for the predication "select" below.
				// CHECK: %[[C90:.+]] = arith.constant 90
				// CHECK: %[[CMP0:.+]] = arith.cmpi slt, %[[I]], %[[C90]]
				// CHECK: nvgpu.device_async_wait %[[ITER_ARG0]] {numGroups = 1

				// Original "select" with updated induction variable.
				// CHECK: %[[C96:.+]] = arith.constant 96
				// CHECK: %[[C8:.+]] = arith.constant 8
				// CHECK: %[[I_PLUS_8:.+]] = arith.addi %[[I]], %[[C8]]
				// CHECK: %[[CMP1:.+]] = arith.cmpi slt, %[[I_PLUS_8]], %[[C96]]
				// CHECK: %[[C2:.+]] = arith.constant 2
				// CHECK: %[[SELECTED0:.+]] = arith.select %[[CMP1]], %[[C4]], %[[C2]]
				%c96 = arith.constant 96 : index
				%cond = arith.cmpi slt, %i, %c96 : index
				%c2 = arith.constant 2 : index
				%read_size = arith.select %cond, %c4, %c2 : index

				// Updated induction variables (two more) for the device_async_copy below.
				// These are generated repeatedly by the pipeliner.
				nicolasvasilacheUnsubmitted Done Reply Inline Actions interestingly, I would have expected something to go wrong here without peeling: we have both predicated async_copy and loop trip count not divisible by 4 yet this case does not require peeling but the first one does. A few comments in the test here would help my past, present and future self. nicolasvasilache: interestingly, I would have expected something to go wrong here without peeling: we have both…
				ftynseAuthorUnsubmitted Done Reply Inline Actions This test does not require peeling because the code being pipelined already handles the "partial" loop iteration correctly from the start. There's no reason why pipelining should break it. ftynse: This test does not require peeling because the code being pipelined already handles the…
				// CHECK: %[[C8_2:.+]] = arith.constant 8
				// CHECK: %[[I_PLUS_8_2:.+]] = arith.addi %[[I]], %[[C8_2]]
				// CHECK: %[[C8_3:.+]] = arith.constant 8
				// CHECK: %[[I_PLUS_8_3:.+]] = arith.addi %[[I]], %[[C8_3]]

				// The second "select" is generated by predication and selects 0 for
				// the two last iterations.
				// CHECK: %[[C0:.+]] = arith.constant 0
				// CHECK: %[[SELECTED1:.+]] = arith.select %[[CMP0]], %[[SELECTED0]], %[[C0]]
				// CHECK: %[[ASYNC_TOKEN:.+]] = nvgpu.device_async_copy %[[GLOBAL]][%[[I_PLUS_8_3]]], %[[SHARED]][%[[I_PLUS_8_2]]], 4, %[[SELECTED1]]
				%token = nvgpu.device_async_copy %global[%i], %shared[%i], 4, %read_size
				: memref<?xf32> to memref<?xf32, #gpu.address_space<workgroup>>

				nvgpu.device_async_wait %token

				// CHECK: scf.yield %[[ITER_ARG1]], %[[ASYNC_TOKEN]]
				}
				// There is no need to wait for the last copies as it it was fully predicated
				// out and doesn't load the original data.
				// CHECK-NOT: nvgpu.device_async_wait
				return
				}


				!t = !transform.any_op

				transform.sequence failures(propagate) {
				^bb0(%arg0: !t):
				%loop = transform.structured.match ops{["scf.for"]} in %arg0 : (!t) -> !t
				transform.nvgpu.pipeline_shared_memory_copies failures(propagate) %loop { depth = 2 } : (!t) -> !t
				}

				// -----

				// CHECK-LABEL: @async_depth_2_peeled
				func.func @async_depth_2_peeled(%global: memref<?xf32>) {
				%c0 = arith.constant 0 : index
				%c98 = arith.constant 98 : index
				%c100 = arith.constant 100 : index
				%c4 = arith.constant 4 : index
				%shared = memref.alloc(%c100) : memref<?xf32, #gpu.address_space<workgroup>>
				%c0f = arith.constant 0.0 : f32
				// CHECK: nvgpu.device_async_copy
				// CHECK: nvgpu.device_async_copy
				// CHECK: scf.for
				// CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1
				// CHECK: arith.select
				// CHECK: nvgpu.device_async_copy
				// CHECK: scf.yield
				// CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1
				// CHEKC: nvgpu.device_async_wait %{{.*}} {numGroups = 0
				scf.for %i = %c0 to %c98 step %c4 {
				%c96 = arith.constant 96 : index
				%cond = arith.cmpi slt, %i, %c96 : index
				%c2 = arith.constant 2 : index
				%read_size = arith.select %cond, %c4, %c2 : index
				%token = nvgpu.device_async_copy %global[%i], %shared[%i], 4, %read_size
				: memref<?xf32> to memref<?xf32, #gpu.address_space<workgroup>>
				nvgpu.device_async_wait %token
				}
				return
				}


				!t = !transform.any_op

				transform.sequence failures(propagate) {
				^bb0(%arg0: !t):
				%loop = transform.structured.match ops{["scf.for"]} in %arg0 : (!t) -> !t
				transform.nvgpu.pipeline_shared_memory_copies failures(propagate) %loop { depth = 2, peel_epilogue } : (!t) -> !t
				}

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,780 Lines • ▼ Show 20 Lines	srcs = glob([
"lib/Dialect/NVGPU/TransformOps/*.cpp",		"lib/Dialect/NVGPU/TransformOps/*.cpp",
]),		]),
hdrs = glob([		hdrs = glob([
"include/mlir/Dialect/NVGPU/TransformOps/*.h",		"include/mlir/Dialect/NVGPU/TransformOps/*.h",
]),		]),
includes = ["include"],		includes = ["include"],
deps = [		deps = [
":AffineDialect",		":AffineDialect",
		":Analysis",
":ArithDialect",		":ArithDialect",
":ArithUtils",		":ArithUtils",
":DialectUtils",		":DialectUtils",
":GPUDialect",		":GPUDialect",
":IR",		":IR",
":LinalgDialect",		":LinalgDialect",
":MemRefDialect",		":MemRefDialect",
":NVGPUDialect",		":NVGPUDialect",
":NVGPUTransformOpsIncGen",		":NVGPUTransformOpsIncGen",
		":SCFDialect",
		":SCFTransforms",
":Support",		":Support",
":TransformDialect",		":TransformDialect",
":VectorDialect",		":VectorDialect",
"//llvm:Support",		"//llvm:Support",
],		],
)		)

td_library(		td_library(
▲ Show 20 Lines • Show All 9,034 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvgpu] add simple pipelining for shared memory copiesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 541023

mlir/include/mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.h

mlir/include/mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.td

mlir/include/mlir/Dialect/SCF/Transforms/Patterns.h

mlir/include/mlir/Dialect/SCF/Transforms/Transforms.h

mlir/lib/Dialect/NVGPU/TransformOps/CMakeLists.txt

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp

mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp

mlir/test/Dialect/NVGPU/transform-pipeline-shared.mlir

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

[mlir][nvgpu] add simple pipelining for shared memory copies
ClosedPublic