This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/LLVMIR/
-
mlir/
-
Dialect/
-
LLVMIR/
1/4
ROCDLOps.td
-
lib/Conversion/GPUToROCDL/
-
Conversion/
-
GPUToROCDL/
-
LowerGpuOpsToROCDLOps.cpp
-
test/
-
Conversion/GPUToROCDL/
-
GPUToROCDL/
-
gpu-to-rocdl.mlir
-
Target/LLVMIR/
-
LLVMIR/
1
rocdl.mlir

Differential D154666

[MLIR][ROCDL] Add conversion for gpu.lane_id to ROCDL
ClosedPublic

Authored by sjw36 on Jul 6 2023, 4:25 PM.

Download Raw Diff

Details

Reviewers

ftynse
ThomasRaoux
dcaballe
nicolasvasilache
herhut
krzysz00

Commits

rGcdf7ca6db76b: [MLIR][ROCDL] Add conversion for gpu.lane_id to ROCDL

Summary

Creates rocdl.lane_id op with llvm conversion to:

__device__ static unsigned int __lane_id() {
    return  __builtin_amdgcn_mbcnt_hi(
               -1, __builtin_amdgcn_mbcnt_lo(-1, 0));
}

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

sjw36 created this revision.Jul 6 2023, 4:25 PM

Herald added a reviewer: ftynse. · View Herald TranscriptJul 6 2023, 4:25 PM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, bviyer and 25 others. · View Herald Transcript

sjw36 requested review of this revision.Jul 6 2023, 4:25 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJul 6 2023, 4:25 PM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B243614: Diff 537925.Jul 6 2023, 7:06 PM

As a high-level note, wouldn't it make more sense to define a rocdl wrapper around the mbcnt intrinsic and then rewrite to that so than we're not hiding a substantial bit of translation in the LLVM IR builder? I've generally seen the rocdl dialect as the place for 1:1 wrappers around LLVM functionality.

Second, as a minor note, when you have the final lane_id number, would it be possible to put range metadata on it - probably a conservative [0, 63] value, but, still, that'll allow for optimizations.

In D154666#4480824, @krzysz00 wrote:

I've generally seen the rocdl dialect as the place for 1:1 wrappers around LLVM functionality.

Let's not keep doing that. Just call intrinsics directly. All of those straight intrinsic wrappers introduce more trouble

In D154666#4480836, @arsenm wrote:

In D154666#4480824, @krzysz00 wrote:

I've generally seen the rocdl dialect as the place for 1:1 wrappers around LLVM functionality.

Let's not keep doing that. Just call intrinsics directly. All of those straight intrinsic wrappers introduce more trouble

I missed the "dialect" part here. Just don't call the ockl wrapper for this

The ROCDL dialect is meant to directly represent the AMDGPU-specific intrinsics in LLVM IR within MLIR, and to work with the LLVM dialect (which is LLVM-IR-in-MLIR that you can run through a simple translation layer)

updated for review feedback

in the future, the backend should provide a lane_id intrinsic if the HW ever adds it

This overall design works, but I've got minor nitpicks

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
89	Would it be possible to instead explicitly spell out the arguments and the result type, given that they're known from the LLVM?
mlir/test/Target/LLVMIR/rocdl.mlir
60	Could we get a variable capture in this? `%[[loCount:.+]] = call i32 ...`

krzysz00 added a reviewer: krzysz00.Jul 24 2023, 7:11 AM

sjw36 added inline comments.Jul 24 2023, 1:01 PM

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
89	Hoping this is short lived and very limited usage. So I am inclined to leave it.

krzysz00 added inline comments.Jul 24 2023, 2:30 PM

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
89	I'm thinking this might have some other use in the future we don't know about yet so let's do it right

updated per review

Overall, looks good

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
91	Nit: We don't need to list the argument types here since they're statically known, but this is a weird rare intrinsic so it's fine.

This revision is now accepted and ready to land.Jul 25 2023, 10:57 AM

Harbormaster completed remote builds in B248015: Diff 544005.Jul 25 2023, 7:02 PM

This revision was landed with ongoing or failed builds.Jul 26 2023, 8:13 AM

Closed by commit rGcdf7ca6db76b: [MLIR][ROCDL] Add conversion for gpu.lane_id to ROCDL (authored by SJW <swaters@amd.com>, committed by krzysz00). · Explain Why

This revision was automatically updated to reflect the committed changes.

krzysz00 added a commit: rGcdf7ca6db76b: [MLIR][ROCDL] Add conversion for gpu.lane_id to ROCDL.

krzysz00 mentioned this in D157228: Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialect.Aug 6 2023, 7:52 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

LLVMIR/

ROCDLOps.td

17 lines

lib/

Conversion/

GPUToROCDL/

LowerGpuOpsToROCDLOps.cpp

34 lines

test/

Conversion/

GPUToROCDL/

gpu-to-rocdl.mlir

14 lines

Target/

LLVMIR/

rocdl.mlir

10 lines

Diff 543104

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td

	Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// ROCDL op definitions			// ROCDL op definitions
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	class ROCDL_Op<string mnemonic, list<Trait> traits = []> :			class ROCDL_Op<string mnemonic, list<Trait> traits = []> :
	LLVM_OpBase<ROCDL_Dialect, mnemonic, traits> {			LLVM_OpBase<ROCDL_Dialect, mnemonic, traits> {
	}			}

				class ROCDL_IntrPure1Op<string mnemonic> :
				LLVM_IntrOpBase<ROCDL_Dialect, mnemonic,
				"amdgcn_" # !subst(".", "_", mnemonic), [], [], [Pure], 1>;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// ROCDL special register op definitions			// ROCDL special register op definitions
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	class ROCDL_SpecialRegisterOp<string mnemonic,			class ROCDL_SpecialRegisterOp<string mnemonic,
	list<Trait> traits = []> :			list<Trait> traits = []> :
	ROCDL_Op<mnemonic, !listconcat(traits, [Pure])>,			ROCDL_Op<mnemonic, !listconcat(traits, [Pure])>,
	Results<(outs LLVM_Type:$res)>, Arguments<(ins)> {			Results<(outs LLVM_Type:$res)>, Arguments<(ins)> {
	string llvmBuilder = "$res = createIntrinsicCallWithRange(builder,"			string llvmBuilder = "$res = createIntrinsicCallWithRange(builder,"
	# "llvm::Intrinsic::amdgcn_" # !subst(".","_", mnemonic)			# "llvm::Intrinsic::amdgcn_" # !subst(".","_", mnemonic)
	# ", op->getAttrOfType<::mlir::DenseI32ArrayAttr>(\"range\"));";			# ", op->getAttrOfType<::mlir::DenseI32ArrayAttr>(\"range\"));";
	let assemblyFormat = "attr-dict `:` type($res)";			let assemblyFormat = "attr-dict `:` type($res)";
	}			}

	class ROCDL_DeviceFunctionOp<string mnemonic, string device_function,			class ROCDL_DeviceFunctionOp<string mnemonic, string device_function,
	int parameter, list<Trait> traits = []> :			int parameter, list<Trait> traits = []> :
	ROCDL_Op<mnemonic, !listconcat(traits, [Pure])>,			ROCDL_Op<mnemonic, !listconcat(traits, [Pure])>,
	Results<(outs LLVM_Type:$res)>, Arguments<(ins)> {			Results<(outs LLVM_Type:$res)>, Arguments<(ins)> {
	string llvmBuilder = "$res = createDeviceFunctionCall(builder, \""			string llvmBuilder = "$res = createDeviceFunctionCall(builder, \""
	# device_function # "\", " # parameter # ");";			# device_function # "\", " # parameter # ");";
	let assemblyFormat = "attr-dict `:` type($res)";			let assemblyFormat = "attr-dict `:` type($res)";
	}			}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
				// Wave-level primitives

				class ROCDL_MbcntOp<string mnemonic> :
				ROCDL_IntrPure1Op<"mbcnt." # mnemonic>,
				Arguments<(ins Variadic<LLVM_Type>:$args)> {
				krzysz00Unsubmitted Not Done Reply Inline Actions Would it be possible to instead explicitly spell out the arguments and the result type, given that they're known from the LLVM? krzysz00: Would it be possible to instead explicitly spell out the arguments and the result type, given…
				sjw36AuthorUnsubmitted Done Reply Inline Actions Hoping this is short lived and very limited usage. So I am inclined to leave it. sjw36: Hoping this is short lived and very limited usage. So I am inclined to leave it.
				krzysz00Unsubmitted Not Done Reply Inline Actions I'm thinking this might have some other use in the future we don't know about yet so let's do it right krzysz00: I'm thinking this might have some other use in the future we don't know about yet so let's do…
				let assemblyFormat =
				"$args attr-dict `:` functional-type($args, $res)";
				krzysz00Unsubmitted Not Done Reply Inline Actions Nit: We don't need to list the argument types here since they're statically known, but this is a weird rare intrinsic so it's fine. krzysz00: Nit: We don't need to list the argument types here since they're statically known, but this is…
				}

				def ROCDL_MbcntLoOp : ROCDL_MbcntOp<"lo">;
				def ROCDL_MbcntHiOp : ROCDL_MbcntOp<"hi">;

				//===----------------------------------------------------------------------===//
	// Thread index and Block index			// Thread index and Block index

	def ROCDL_ThreadIdXOp : ROCDL_SpecialRegisterOp<"workitem.id.x">;			def ROCDL_ThreadIdXOp : ROCDL_SpecialRegisterOp<"workitem.id.x">;
	def ROCDL_ThreadIdYOp : ROCDL_SpecialRegisterOp<"workitem.id.y">;			def ROCDL_ThreadIdYOp : ROCDL_SpecialRegisterOp<"workitem.id.y">;
	def ROCDL_ThreadIdZOp : ROCDL_SpecialRegisterOp<"workitem.id.z">;			def ROCDL_ThreadIdZOp : ROCDL_SpecialRegisterOp<"workitem.id.z">;

	def ROCDL_BlockIdXOp : ROCDL_SpecialRegisterOp<"workgroup.id.x">;			def ROCDL_BlockIdXOp : ROCDL_SpecialRegisterOp<"workgroup.id.x">;
	def ROCDL_BlockIdYOp : ROCDL_SpecialRegisterOp<"workgroup.id.y">;			def ROCDL_BlockIdYOp : ROCDL_SpecialRegisterOp<"workgroup.id.y">;
	▲ Show 20 Lines • Show All 283 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp

Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	static bool canBeCalledWithBarePointers(gpu::GPUFuncOp func) {
bool canBeBare = true;		bool canBeBare = true;
for (Type type : func.getArgumentTypes())		for (Type type : func.getArgumentTypes())
if (auto memrefTy = dyn_cast<BaseMemRefType>(type))		if (auto memrefTy = dyn_cast<BaseMemRefType>(type))
canBeBare &= LLVMTypeConverter::canConvertToBarePtr(memrefTy);		canBeBare &= LLVMTypeConverter::canConvertToBarePtr(memrefTy);
return canBeBare;		return canBeBare;
}		}

namespace {		namespace {
		struct GPULaneIdOpToROCDL : ConvertOpToLLVMPattern<gpu::LaneIdOp> {
		using ConvertOpToLLVMPattern<gpu::LaneIdOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(gpu::LaneIdOp op, gpu::LaneIdOp::Adaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		auto loc = op->getLoc();
		MLIRContext *context = rewriter.getContext();
		// convert to: %mlo = call @llvm.amdgcn.mbcnt.lo(-1, 0)
		// followed by: %lid = call @llvm.amdgcn.mbcnt.hi(-1, %mlo)

		Type intTy = IntegerType::get(context, 32);
		Value zero = rewriter.createOrFold<arith::ConstantIntOp>(loc, 0, 32);
		Value minus1 = rewriter.createOrFold<arith::ConstantIntOp>(loc, -1, 32);
		Value mbcntLo =
		rewriter.create<ROCDL::MbcntLoOp>(loc, intTy, ValueRange{minus1, zero});
		Value laneId = rewriter.create<ROCDL::MbcntHiOp>(
		loc, intTy, ValueRange{minus1, mbcntLo});
		// Truncate or extend the result depending on the index bitwidth specified
		// by the LLVMTypeConverter options.
		const unsigned indexBitwidth = getTypeConverter()->getIndexTypeBitwidth();
		if (indexBitwidth > 32) {
		laneId = rewriter.create<LLVM::SExtOp>(
		loc, IntegerType::get(context, indexBitwidth), laneId);
		} else if (indexBitwidth < 32) {
		laneId = rewriter.create<LLVM::TruncOp>(
		loc, IntegerType::get(context, indexBitwidth), laneId);
		}
		rewriter.replaceOp(op, {laneId});
		return success();
		}
		};

/// Import the GPU Ops to ROCDL Patterns.		/// Import the GPU Ops to ROCDL Patterns.
#include "GPUToROCDL.cpp.inc"		#include "GPUToROCDL.cpp.inc"

// A pass that replaces all occurrences of GPU device operations with their		// A pass that replaces all occurrences of GPU device operations with their
// corresponding ROCDL equivalent.		// corresponding ROCDL equivalent.
//		//
// This pass only handles device code and is not meant to be run on GPU host		// This pass only handles device code and is not meant to be run on GPU host
▲ Show 20 Lines • Show All 164 Lines • ▼ Show 20 Lines	patterns.add<GPUFuncOpLowering>(
ROCDL::ROCDLDialect::getKernelFuncAttrName()));		ROCDL::ROCDLDialect::getKernelFuncAttrName()));
if (Runtime::HIP == runtime) {		if (Runtime::HIP == runtime) {
patterns.add<GPUPrintfOpToHIPLowering>(converter);		patterns.add<GPUPrintfOpToHIPLowering>(converter);
} else if (Runtime::OpenCL == runtime) {		} else if (Runtime::OpenCL == runtime) {
// Use address space = 4 to match the OpenCL definition of printf()		// Use address space = 4 to match the OpenCL definition of printf()
patterns.add<GPUPrintfOpToLLVMCallLowering>(converter, /addressSpace=/4);		patterns.add<GPUPrintfOpToLLVMCallLowering>(converter, /addressSpace=/4);
}		}

		patterns.add<GPULaneIdOpToROCDL>(converter);

populateOpPatterns<math::AbsFOp>(converter, patterns, "__ocml_fabs_f32",		populateOpPatterns<math::AbsFOp>(converter, patterns, "__ocml_fabs_f32",
"__ocml_fabs_f64");		"__ocml_fabs_f64");
populateOpPatterns<math::AtanOp>(converter, patterns, "__ocml_atan_f32",		populateOpPatterns<math::AtanOp>(converter, patterns, "__ocml_atan_f32",
"__ocml_atan_f64");		"__ocml_atan_f64");
populateOpPatterns<math::Atan2Op>(converter, patterns, "__ocml_atan2_f32",		populateOpPatterns<math::Atan2Op>(converter, patterns, "__ocml_atan2_f32",
"__ocml_atan2_f64");		"__ocml_atan2_f64");
populateOpPatterns<math::CbrtOp>(converter, patterns, "__ocml_cbrt_f32",		populateOpPatterns<math::CbrtOp>(converter, patterns, "__ocml_cbrt_f32",
"__ocml_cbrt_f64");		"__ocml_cbrt_f64");
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl.mlir

// RUN: mlir-opt %s -convert-gpu-to-rocdl='use-opaque-pointers=1' -split-input-file \| FileCheck %s		// RUN: mlir-opt %s -convert-gpu-to-rocdl='use-opaque-pointers=1' -split-input-file \| FileCheck %s
// RUN: mlir-opt %s -convert-gpu-to-rocdl='index-bitwidth=32 use-opaque-pointers=1' -split-input-file \| FileCheck --check-prefix=CHECK32 %s		// RUN: mlir-opt %s -convert-gpu-to-rocdl='index-bitwidth=32 use-opaque-pointers=1' -split-input-file \| FileCheck --check-prefix=CHECK32 %s

gpu.module @test_module {		gpu.module @test_module {
// CHECK-LABEL: func @gpu_index_ops()		// CHECK-LABEL: func @gpu_index_ops()
// CHECK32-LABEL: func @gpu_index_ops()		// CHECK32-LABEL: func @gpu_index_ops()
func.func @gpu_index_ops()		func.func @gpu_index_ops()
-> (index, index, index, index, index, index,		-> (index, index, index, index, index, index,
index, index, index, index, index, index) {		index, index, index, index, index, index,
		index) {
// CHECK32-NOT: = llvm.sext %{{.*}} : i32 to i64		// CHECK32-NOT: = llvm.sext %{{.*}} : i32 to i64

// CHECK: rocdl.workitem.id.x : i32		// CHECK: rocdl.workitem.id.x : i32
// CHECK: = llvm.sext %{{.*}} : i32 to i64		// CHECK: = llvm.sext %{{.*}} : i32 to i64
%tIdX = gpu.thread_id x		%tIdX = gpu.thread_id x
// CHECK: rocdl.workitem.id.y : i32		// CHECK: rocdl.workitem.id.y : i32
// CHECK: = llvm.sext %{{.*}} : i32 to i64		// CHECK: = llvm.sext %{{.*}} : i32 to i64
%tIdY = gpu.thread_id y		%tIdY = gpu.thread_id y
Show All 26 Lines	func.func @gpu_index_ops()
%gDimX = gpu.grid_dim x		%gDimX = gpu.grid_dim x
// CHECK: rocdl.grid.dim.y : i32		// CHECK: rocdl.grid.dim.y : i32
// CHECK: = llvm.sext %{{.*}} : i32 to i64		// CHECK: = llvm.sext %{{.*}} : i32 to i64
%gDimY = gpu.grid_dim y		%gDimY = gpu.grid_dim y
// CHECK: rocdl.grid.dim.z : i32		// CHECK: rocdl.grid.dim.z : i32
// CHECK: = llvm.sext %{{.*}} : i32 to i64		// CHECK: = llvm.sext %{{.*}} : i32 to i64
%gDimZ = gpu.grid_dim z		%gDimZ = gpu.grid_dim z

		// CHECK: = rocdl.mbcnt.lo %{{.}}, %{{.}} : (i32, i32) -> i32
		// CHECK: = rocdl.mbcnt.hi %{{.}}, %{{.}} : (i32, i32) -> i32
		// CHECK: = llvm.sext %{{.*}} : i32 to i64
		%laneId = gpu.lane_id

func.return %tIdX, %tIdY, %tIdZ, %bDimX, %bDimY, %bDimZ,		func.return %tIdX, %tIdY, %tIdZ, %bDimX, %bDimY, %bDimZ,
%bIdX, %bIdY, %bIdZ, %gDimX, %gDimY, %gDimZ		%bIdX, %bIdY, %bIdZ, %gDimX, %gDimY, %gDimZ,
		%laneId
: index, index, index, index, index, index,		: index, index, index, index, index, index,
index, index, index, index, index, index		index, index, index, index, index, index,
		index
}		}
}		}

// -----		// -----

gpu.module @test_module {		gpu.module @test_module {
// CHECK-LABEL: func @gpu_index_ops_range()		// CHECK-LABEL: func @gpu_index_ops_range()
// CHECK-SAME: rocdl.flat_work_group_size = "1536,1536"		// CHECK-SAME: rocdl.flat_work_group_size = "1536,1536"
▲ Show 20 Lines • Show All 432 Lines • Show Last 20 Lines

mlir/test/Target/LLVMIR/rocdl.mlir

Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	attributes {rocdl.kernel,
rocdl.flat_work_group_size = "128,128",		rocdl.flat_work_group_size = "128,128",
rocdl.reqd_work_group_size = array<i32: 16, 4, 2>} {		rocdl.reqd_work_group_size = array<i32: 16, 4, 2>} {
// CHECK-LABEL: amdgpu_kernel void @known_block_sizes()		// CHECK-LABEL: amdgpu_kernel void @known_block_sizes()
// CHECK: #[[$KNOWN_BLOCK_SIZE_ATTRS:[0-9]+]]		// CHECK: #[[$KNOWN_BLOCK_SIZE_ATTRS:[0-9]+]]
// CHECK: !reqd_work_group_size ![[$REQD_WORK_GROUP_SIZE:[0-9]+]]		// CHECK: !reqd_work_group_size ![[$REQD_WORK_GROUP_SIZE:[0-9]+]]
llvm.return		llvm.return
}		}

		llvm.func @rocdl.lane_id() -> i32 {
		// CHECK: call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
		krzysz00Unsubmitted Not Done Reply Inline Actions Could we get a variable capture in this? `%[[loCount:.+]] = call i32 ...` krzysz00: Could we get a variable capture in this? `%[[loCount:.+]] = call i32 ...`
		// CHECK-NEXT: call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %{{.*}})
		%0 = llvm.mlir.constant(-1 : i32) : i32
		%1 = llvm.mlir.constant(0 : i32) : i32
		%2 = rocdl.mbcnt.lo %0, %1 : (i32, i32) -> i32
		%3 = rocdl.mbcnt.hi %0, %2 : (i32, i32) -> i32
		llvm.return %3 : i32
		}

llvm.func @rocdl.barrier() {		llvm.func @rocdl.barrier() {
// CHECK: fence syncscope("workgroup") release		// CHECK: fence syncscope("workgroup") release
// CHECK-NEXT: call void @llvm.amdgcn.s.barrier()		// CHECK-NEXT: call void @llvm.amdgcn.s.barrier()
// CHECK-NEXT: fence syncscope("workgroup") acquire		// CHECK-NEXT: fence syncscope("workgroup") acquire
rocdl.barrier		rocdl.barrier
llvm.return		llvm.return
}		}

▲ Show 20 Lines • Show All 307 Lines • Show Last 20 Lines