This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/
-
mlir/
-
Dialect/
-
LLVMIR/
-
NVVMOps.td
-
NVGPU/IR/
-
IR/
3/7
NVGPU.td
-
lib/Conversion/NVGPUToNVVM/
-
Conversion/
-
NVGPUToNVVM/
-
NVGPUToNVVM.cpp
-
test/Conversion/
-
Conversion/
-
NVGPUToNVVM/
1
nvgpu-to-nvvm.mlir
-
NVVMToLLVM/
-
nvvm-to-llvm.mlir

Differential D154094

[mlir][nvgpu] Add `mbarrier.arrive.expect_tx` and `mbarrier.try_wait.parity`
ClosedPublic

Authored by guraypp on Jun 29 2023, 8:35 AM.

Download Raw Diff

Details

Reviewers

qcolombet
nicolasvasilache
herhut
ftynse
bondhugula
dcaballe

Commits

rG836dbb8522ab: [mlir][nvgpu] Add `mbarrier.arrive.expect_tx` and `mbarrier.try_wait.parity`

Summary

This work adds two Ops:
mbarrier.arrive.expect_tx performs expect_tx mbarrier.barrier returns mbarrier.barrier.token
mbarrier.try_wait.parity waits on mbarrier.barrier and mbarrier.barrier.token

mbarrier.arrive.expect_tx is one of the requirement to enable H100 TMA support.

Depends on D154074 D154076 D154059 D154060

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

guraypp created this revision.Jun 29 2023, 8:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2023, 8:35 AM

Herald added subscribers: bviyer, Moerafaat, zero9178 and 24 others. · View Herald Transcript

guraypp requested review of this revision.Jun 29 2023, 8:35 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJun 29 2023, 8:35 AM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache, jholewinski. · View Herald Transcript

Harbormaster completed remote builds in B242108: Diff 535822.Jun 29 2023, 8:36 AM

manishucsd added a subscriber: manishucsd.Jun 29 2023, 1:10 PM

manishucsd added inline comments.

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
476	`txcount` is number number of elements or number of bytes?

guraypp added inline comments.Jun 30 2023, 12:00 AM

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

476

It is bytes, see the cuda example here:
https://gist.github.com/grypp/5594a5036ffed6689ff3d7527b0a8370#file-tma_00_simple-cu-L131

This is from ptx definition.

The optional qualifier .expect_tx specifies that an expect-tx operation is performed prior to the arrive-on operation. The 32-bit unsigned integer operand txCount specifies the expectCount argument to the expect-tx operation. When both qualifiers .arrive and .expect_tx are specified, then the count argument of the arrive-on operation is assumed to be 1.

LGTM with nits.

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
474	Could you add some description here to better explain what this is doing? Essentially what it means to "perform arrive with expect_tx". From what I understand, when the thread arrives to this point, we set the expect count of the barrier to `txcount` and produce a token. Later we'll wait on the produced token until the barrier is released. Feel free to reword/fix, but we need some kind of explanation :).
484	Ditto on the description.

This revision is now accepted and ready to land.Jun 30 2023, 1:21 AM

update the op information

Harbormaster completed remote builds in B242802: Diff 536770.Jul 3 2023, 7:43 AM

guraypp marked 2 inline comments as done and an inline comment as not done.Jul 3 2023, 7:50 AM

guraypp added inline comments.

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

476

I understood what you mean, nvgpu dialect uses the number of elements not the bytes and`nvgpu.device_async_copy` is example for that.

I cannot make this txcount number of element, because the Op does not get any memref that I can calculate the bytes. See the Op usage below:

func @test(%barrier : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>) {
  %txcountBytes = arith.constant 256 : index
  %token = nvgpu.mbarrier.arrive.expect_tx %barrier, %txcountBytes : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>> -> !nvgpu.mbarrier.token
}

I can make Op to take memref and get rid of txcount. But this somehow does not match with the ptx instruction.

func @test(%barrier : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>, %buffer: memref<128x128xf16, 3>) {
  %token = nvgpu.mbarrier.arrive.expect_tx %barrier, %buffer : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>, memref<128x128xf16, 3>
}

Thanks for the added descriptions. Still LGTM :).

Couple of nits

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
499	nit: is is
501	Nit: A suspended thread... or Suspended threads resume
mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
607	And nl

update ptx and nvgpu ops

Herald added a reviewer: ftynse. · View Herald TranscriptJul 20 2023, 3:26 AM

Herald added a reviewer: bondhugula. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, awarzynski. · View Herald Transcript

This revision was landed with ongoing or failed builds.Jul 20 2023, 4:48 AM

Closed by commit rG836dbb8522ab: [mlir][nvgpu] Add `mbarrier.arrive.expect_tx` and `mbarrier.try_wait.parity` (authored by guraypp). · Explain Why

This revision was automatically updated to reflect the committed changes.

guraypp added a commit: rG836dbb8522ab: [mlir][nvgpu] Add `mbarrier.arrive.expect_tx` and `mbarrier.try_wait.parity`.

Harbormaster completed remote builds in B246844: Diff 542413.Jul 20 2023, 5:28 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

LLVMIR/

NVVMOps.td

58 lines

NVGPU/

IR/

NVGPU.td

40 lines

lib/

Conversion/

NVGPUToNVVM/

NVGPUToNVVM.cpp

64 lines

test/

Conversion/

NVGPUToNVVM/

nvgpu-to-nvvm.mlir

46 lines

NVVMToLLVM/

nvvm-to-llvm.mlir

32 lines

Diff 542426

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

Show First 20 Lines • Show All 366 Lines • ▼ Show 20 Lines	def NVVM_MBarrierArriveNocompleteSharedOp : NVVM_Op<"mbarrier.arrive.nocomplete.shared">,
Arguments<(ins LLVM_i64ptr_shared:$addr, I32:$count)> {		Arguments<(ins LLVM_i64ptr_shared:$addr, I32:$count)> {
string llvmBuilder = [{		string llvmBuilder = [{
$res = createIntrinsicCall(builder, llvm::Intrinsic::nvvm_mbarrier_arrive_noComplete_shared, {$addr, $count});		$res = createIntrinsicCall(builder, llvm::Intrinsic::nvvm_mbarrier_arrive_noComplete_shared, {$addr, $count});
}];		}];
let assemblyFormat = "$addr `,` $count attr-dict `:` type(operands) `->` type($res)";		let assemblyFormat = "$addr `,` $count attr-dict `:` type(operands) `->` type($res)";
}		}

def NVVM_MBarrierArriveExpectTxOp : NVVM_Op<"mbarrier.arrive.expect_tx",		def NVVM_MBarrierArriveExpectTxOp : NVVM_Op<"mbarrier.arrive.expect_tx",
[DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,		[DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
Results<(outs LLVM_Type:$res)>,
Arguments<(ins LLVM_i64ptr_any:$addr, I32:$txcount)> {		Arguments<(ins LLVM_i64ptr_any:$addr, I32:$txcount)> {
let assemblyFormat = "$addr `,` $txcount attr-dict `:` type(operands) `->` type($res)";		let assemblyFormat = "$addr `,` $txcount attr-dict `:` type(operands)";
let extraClassDefinition = [{		let extraClassDefinition = [{
std::string $cppClass::getPtx() { return std::string("mbarrier.arrive.expect_tx.b64 %0, [%1], %2;"); }		std::string $cppClass::getPtx() { return std::string("mbarrier.arrive.expect_tx.b64 _, [%0], %1;"); }
}];		}];
}		}

def NVVM_MBarrierArriveExpectTxSharedOp : NVVM_Op<"mbarrier.arrive.expect_tx.shared",		def NVVM_MBarrierArriveExpectTxSharedOp : NVVM_Op<"mbarrier.arrive.expect_tx.shared",
[DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,		[DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
Results<(outs LLVM_Type:$res)>,
Arguments<(ins LLVM_i64ptr_shared:$addr, I32:$txcount)> {		Arguments<(ins LLVM_i64ptr_shared:$addr, I32:$txcount)> {
let assemblyFormat = "$addr `,` $txcount attr-dict `:` type(operands) `->` type($res)";		let assemblyFormat = "$addr `,` $txcount attr-dict `:` type(operands)";
let extraClassDefinition = [{		let extraClassDefinition = [{
std::string $cppClass::getPtx() { return std::string("mbarrier.arrive.expect_tx.shared.b64 %0, [%1], %2;"); }		std::string $cppClass::getPtx() { return std::string("mbarrier.arrive.expect_tx.shared.b64 _, [%0], %1;"); }
}];		}];
}		}

def NVVM_MBarrierTryWaitParityOp : NVVM_Op<"mbarrier.try_wait.parity",		def NVVM_MBarrierTryWaitParityOp : NVVM_Op<"mbarrier.try_wait.parity",
[DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,		[DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
Results<(outs LLVM_Type:$res)>,		Arguments<(ins LLVM_i64ptr_any:$addr, I32:$phase, I32:$ticks)> {
Arguments<(ins LLVM_i64ptr_any:$addr, LLVM_Type:$token)> {		let assemblyFormat = "$addr `,` $phase `,` $ticks attr-dict `:` type(operands)";
let assemblyFormat = "$addr `,` $token attr-dict `:` type(operands) `->` type($res)";
let extraClassDefinition = [{		let extraClassDefinition = [{
std::string $cppClass::getPtx() {		std::string $cppClass::getPtx() {
return std::string("{\n\t"		return std::string(
		"{\n\t"
".reg .pred P1; \n\t"		".reg .pred P1; \n\t"
"mbarrier.try_wait.parity.b64 P1, [%1], %2; \n\t"		"LAB_WAIT: \n\t"
"selp.b32 %0, 1, 0, P1; \n\t"		"mbarrier.try_wait.parity.b64 P1, [%0], %1, %2; \n\t"
"}");		"@P1 bra.uni DONE; \n\t"
		"bra.uni LAB_WAIT; \n\t"
		"DONE: \n\t"
		"}"
		);
}		}
}];		}];
}		}

def NVVM_MBarrierTryWaitParitySharedOp : NVVM_Op<"mbarrier.try_wait.parity.shared",		def NVVM_MBarrierTryWaitParitySharedOp : NVVM_Op<"mbarrier.try_wait.parity.shared",
[DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,		[DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
Results<(outs LLVM_Type:$res)>,		Arguments<(ins LLVM_i64ptr_shared:$addr, I32:$phase, I32:$ticks)> {
Arguments<(ins LLVM_i64ptr_shared:$addr, LLVM_Type:$token)> {		let assemblyFormat = "$addr `,` $phase `,` $ticks attr-dict `:` type(operands)";
let assemblyFormat = "$addr `,` $token attr-dict `:` type(operands) `->` type($res)";
let extraClassDefinition = [{		let extraClassDefinition = [{
std::string $cppClass::getPtx() {		std::string $cppClass::getPtx() {
return std::string("{\n\t"		return std::string(
		"{\n\t"
".reg .pred P1; \n\t"		".reg .pred P1; \n\t"
"mbarrier.try_wait.parity.shared.b64 P1, [%1], %2; \n\t"		"LAB_WAIT: \n\t"
"selp.b32 %0, 1, 0, P1; \n\t"		"mbarrier.try_wait.parity.shared.b64 P1, [%0], %1, %2; \n\t"
"}");		"@P1 bra.uni DONE; \n\t"
		"bra.uni LAB_WAIT; \n\t"
		"DONE: \n\t"
		"}"
		);
}		}
}];		}];
}		}

def NVVM_MBarrierTestWaitOp : NVVM_Op<"mbarrier.test.wait">,		def NVVM_MBarrierTestWaitOp : NVVM_Op<"mbarrier.test.wait">,
Results<(outs LLVM_Type:$res)>,		Results<(outs LLVM_Type:$res)>,
Arguments<(ins LLVM_i64ptr_any:$addr, LLVM_Type:$state)> {		Arguments<(ins LLVM_i64ptr_any:$addr, LLVM_Type:$state)> {
string llvmBuilder = [{		string llvmBuilder = [{
▲ Show 20 Lines • Show All 1,040 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

Show First 20 Lines • Show All 463 Lines • ▼ Show 20 Lines	let description = [{
```		```
}];		}];
let arguments = (ins NVGPU_MBarrier:$barrier,		let arguments = (ins NVGPU_MBarrier:$barrier,
Index:$count);		Index:$count);
let results = (outs NVGPU_MBarrierToken:$token);		let results = (outs NVGPU_MBarrierToken:$token);
let assemblyFormat = "$barrier `,` $count attr-dict `:` type($barrier) `->` type($token)";		let assemblyFormat = "$barrier `,` $count attr-dict `:` type($barrier) `->` type($token)";
}		}

		def NVGPU_MBarrierArriveExpectTxOp : NVGPU_Op<"mbarrier.arrive.expect_tx", []> {
		let summary = "Performs expect_tx operation on the `nvgpu.mbarrier.arrive`";
		let description = [{
		qcolombetUnsubmitted Done Reply Inline Actions Could you add some description here to better explain what this is doing? Essentially what it means to "perform arrive with expect_tx". From what I understand, when the thread arrives to this point, we set the expect count of the barrier to `txcount` and produce a token. Later we'll wait on the produced token until the barrier is released. Feel free to reword/fix, but we need some kind of explanation :). qcolombet: Could you add some description here to better explain what this is doing? Essentially what it…
		A thread executing the Op performs an expect-tx operation on the mbarrier
		object at the location specified by the address operand $barrier. The
		manishucsdUnsubmitted Not Done Reply Inline Actions `txcount` is number number of elements or number of bytes? manishucsd: `txcount` is number number of elements or number of bytes?
		gurayppAuthorUnsubmitted Not Done Reply Inline Actions It is bytes, see the cuda example here: https://gist.github.com/grypp/5594a5036ffed6689ff3d7527b0a8370#file-tma_00_simple-cu-L131 This is from ptx definition. The optional qualifier .expect_tx specifies that an expect-tx operation is performed prior to the arrive-on operation. The 32-bit unsigned integer operand txCount specifies the expectCount argument to the expect-tx operation. When both qualifiers .arrive and .expect_tx are specified, then the count argument of the arrive-on operation is assumed to be 1. guraypp: It is bytes, see the cuda example here: https://gist.github.
		gurayppAuthorUnsubmitted Done Reply Inline Actions I understood what you mean, nvgpu dialect uses the number of elements not the bytes and`nvgpu.device_async_copy` is example for that. I cannot make this `txcount` number of element, because the Op does not get any memref that I can calculate the bytes. See the Op usage below: func @test(%barrier : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>) { %txcountBytes = arith.constant 256 : index %token = nvgpu.mbarrier.arrive.expect_tx %barrier, %txcountBytes : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>> -> !nvgpu.mbarrier.token } I can make Op to take memref and get rid of `txcount`. But this somehow does not match with the ptx instruction. func @test(%barrier : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>, %buffer: memref<128x128xf16, 3>) { %token = nvgpu.mbarrier.arrive.expect_tx %barrier, %buffer : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>, memref<128x128xf16, 3> } guraypp: I understood what you mean, nvgpu dialect uses the number of elements not the bytes and`nvgpu.
		expect-tx operation, with an $txcount argument, increases the tx-count of
		an mbarrier object by the value specified by $txcount. This makes the
		current phase of the mbarrier object to expect and track the completion of
		additional asynchronous transactions.

		The `$txCount` specifies the number of element to the expect-tx operation.

		Example:
		qcolombetUnsubmitted Done Reply Inline Actions Ditto on the description. qcolombet: Ditto on the description.
		```mlir
		nvgpu.mbarrier.arrive.expect_tx %barrier, %ic0 : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
		```
		}];
		let arguments = (ins NVGPU_MBarrier:$barrier,
		Index:$txcount);
		let assemblyFormat = "$barrier `,` $txcount attr-dict `:` type($barrier)";
		}

		def NVGPU_MBarrierTryWaitParityOp : NVGPU_Op<"mbarrier.try_wait.parity", []> {
		let summary = "Waits for the `nvgpu.mbarrier` to complete its current phase.";
		let description = [{
		Checks whether the mbarrier object has completed the phase. It is is a
		potentially blocking instruction which tests for the completion of the
		phase. Suspended thread resumes execution when the specified phase completes
		qcolombetUnsubmitted Not Done Reply Inline Actions nit: is is qcolombet: nit: is is
		OR before the phase completes following a system-dependent time limit.

		qcolombetUnsubmitted Not Done Reply Inline Actions Nit: A suspended thread... or Suspended threads resume qcolombet: Nit: A suspended thread... or Suspended threads resume
		Example:
		```mlir
		nvgpu.mbarrier.try_wait.parity %barrier, %phase, %ticks : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
		```

		}];
		let arguments = (ins NVGPU_MBarrier:$barrier, Index:$phase, Index:$ticks);
		let assemblyFormat = "$barrier `,` $phase `,` $ticks attr-dict `:` type($barrier)";
		}

#endif // NVGPU		#endif // NVGPU

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

Show All 19 Lines

namespace mlir {		namespace mlir {
#define GEN_PASS_DEF_CONVERTNVGPUTONVVMPASS		#define GEN_PASS_DEF_CONVERTNVGPUTONVVMPASS
#include "mlir/Conversion/Passes.h.inc"		#include "mlir/Conversion/Passes.h.inc"
} // namespace mlir		} // namespace mlir

using namespace mlir;		using namespace mlir;

		/// GPU has 32 bit registers, this function truncates values when larger width
		/// is not needed.
		static Value truncToI32(ConversionPatternRewriter &rewriter, Location loc,
		Value value) {
		Type type = value.getType();
		assert(llvm::isa<IntegerType>(type) && "expected an integer Value");
		if (type.getIntOrFloatBitWidth() <= 32)
		return value;
		return rewriter.create<LLVM::TruncOp>(loc, rewriter.getI32Type(), value);
		}

/// Returns the type for the intrinsic given the vectorResultType of the		/// Returns the type for the intrinsic given the vectorResultType of the
/// `gpu.mma.sync` operation.		/// `gpu.mma.sync` operation.
static Type inferIntrinsicResultType(Type vectorResultType) {		static Type inferIntrinsicResultType(Type vectorResultType) {
MLIRContext *ctx = vectorResultType.getContext();		MLIRContext *ctx = vectorResultType.getContext();
auto a = cast<LLVM::LLVMArrayType>(vectorResultType);		auto a = cast<LLVM::LLVMArrayType>(vectorResultType);
auto f16x2Ty = LLVM::getFixedVectorType(Float16Type::get(ctx), 2);		auto f16x2Ty = LLVM::getFixedVectorType(Float16Type::get(ctx), 2);
auto i32Ty = IntegerType::get(ctx, 32);		auto i32Ty = IntegerType::get(ctx, 32);
auto i32x2Ty = LLVM::getFixedVectorType(i32Ty, 2);		auto i32x2Ty = LLVM::getFixedVectorType(i32Ty, 2);
▲ Show 20 Lines • Show All 809 Lines • ▼ Show 20 Lines	matchAndRewrite(nvgpu::MBarrierTestWaitOp op, OpAdaptor adaptor,
} else {		} else {
rewriter.replaceOpWithNewOp<NVVM::MBarrierTestWaitOp>(		rewriter.replaceOpWithNewOp<NVVM::MBarrierTestWaitOp>(
op, retType, barrier, adaptor.getToken());		op, retType, barrier, adaptor.getToken());
}		}
return success();		return success();
}		}
};		};

		struct NVGPUMBarrierArriveExpectTxLowering
		: public ConvertOpToLLVMPattern<nvgpu::MBarrierArriveExpectTxOp> {
		using ConvertOpToLLVMPattern<
		nvgpu::MBarrierArriveExpectTxOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(nvgpu::MBarrierArriveExpectTxOp op, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(),
		op.getBarrier(), adaptor.getBarrier());
		Value txcount = truncToI32(rewriter, op->getLoc(), adaptor.getTxcount());

		if (isMbarrierShared(op.getBarrier().getType())) {
		rewriter.replaceOpWithNewOp<NVVM::MBarrierArriveExpectTxSharedOp>(
		op, barrier, txcount);
		return success();
		}

		rewriter.replaceOpWithNewOp<NVVM::MBarrierArriveExpectTxOp>(op, barrier,
		txcount);
		return success();
		}
		};

		struct NVGPUMBarrierTryWaitParityLowering
		: public ConvertOpToLLVMPattern<nvgpu::MBarrierTryWaitParityOp> {
		using ConvertOpToLLVMPattern<
		nvgpu::MBarrierTryWaitParityOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(nvgpu::MBarrierTryWaitParityOp op, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(),
		op.getBarrier(), adaptor.getBarrier());
		Value ticks = truncToI32(rewriter, op->getLoc(), adaptor.getTicks());
		Value phase = truncToI32(rewriter, op->getLoc(), adaptor.getPhase());

		if (isMbarrierShared(op.getBarrier().getType())) {
		rewriter.replaceOpWithNewOp<NVVM::MBarrierTryWaitParitySharedOp>(
		op, barrier, phase, ticks);
		return success();
		}

		rewriter.replaceOpWithNewOp<NVVM::MBarrierTryWaitParityOp>(op, barrier,
		phase, ticks);
		return success();
		}
		};

} // namespace		} // namespace

void mlir::populateNVGPUToNVVMConversionPatterns(LLVMTypeConverter &converter,		void mlir::populateNVGPUToNVVMConversionPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns) {		RewritePatternSet &patterns) {
patterns.add<		patterns.add<
NVGPUMBarrierCreateLowering, // nvgpu.mbarrier.create		NVGPUMBarrierCreateLowering, // nvgpu.mbarrier.create
NVGPUMBarrierInitLowering, // nvgpu.mbarrier.init		NVGPUMBarrierInitLowering, // nvgpu.mbarrier.init
NVGPUMBarrierArriveLowering, // nvgpu.mbarrier.arrive		NVGPUMBarrierArriveLowering, // nvgpu.mbarrier.arrive
NVGPUMBarrierArriveNoCompleteLowering, // nvgpu.mbarrier.arrive.no_complete		NVGPUMBarrierArriveNoCompleteLowering, // nvgpu.mbarrier.arrive.no_complete
NVGPUMBarrierTestWaitLowering, // nvgpu.try_wait_parity		NVGPUMBarrierTestWaitLowering, // nvgpu.mbarrier.test_wait_parity
		NVGPUMBarrierTryWaitParityLowering, // nvgpu.mbarrier.try_wait_parity
		NVGPUMBarrierArriveExpectTxLowering, // nvgpu.mbarrier.arrive.expect_tx
MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM, NVGPUAsyncCopyLowering,		MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM, NVGPUAsyncCopyLowering,
NVGPUAsyncCreateGroupLowering, NVGPUAsyncWaitLowering,		NVGPUAsyncCreateGroupLowering, NVGPUAsyncWaitLowering,
NVGPUMmaSparseSyncLowering>(converter);		NVGPUMmaSparseSyncLowering>(converter);
}		}

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

Show First 20 Lines • Show All 552 Lines • ▼ Show 20 Lines	func.func @mbarrier_nocomplete() {
%token = nvgpu.mbarrier.arrive.nocomplete %barrier, %count : !barrierType -> !tokenType		%token = nvgpu.mbarrier.arrive.nocomplete %barrier, %count : !barrierType -> !tokenType

// CHECK: %[[barPtr3:.+]] = llvm.extractvalue %[[barStr]][1] : !llvm.struct<(ptr<3>, ptr<3>, i64, array<1 x i64>, array<1 x i64>)>		// CHECK: %[[barPtr3:.+]] = llvm.extractvalue %[[barStr]][1] : !llvm.struct<(ptr<3>, ptr<3>, i64, array<1 x i64>, array<1 x i64>)>
// CHECK: nvvm.mbarrier.test.wait.shared %[[barPtr3]], %[[token]]		// CHECK: nvvm.mbarrier.test.wait.shared %[[barPtr3]], %[[token]]
%isDone = nvgpu.mbarrier.test.wait %barrier, %token : !barrierType, !tokenType		%isDone = nvgpu.mbarrier.test.wait %barrier, %token : !barrierType, !tokenType

func.return		func.return
}		}


		// -----
		!barrierType = !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
		!tokenType = !nvgpu.mbarrier.token

		// CHECK-LABEL: func @mbarrier_txcount
		func.func @mbarrier_txcount() {
		%num_threads = arith.constant 128 : index

		// CHECK: %[[barMemref:.+]] = memref.get_global @__mbarrier : memref<1xi64, 3>
		%barrier = nvgpu.mbarrier.create -> !barrierType

		// CHECK: %[[barStr:.+]] = builtin.unrealized_conversion_cast %[[barMemref]] : memref<1xi64, 3> to !llvm.struct<(ptr<3>, ptr<3>, i64, array<1 x i64>, array<1 x i64>)>
		// CHECK: %[[barPtr:.+]] = llvm.extractvalue %[[barStr]][1] : !llvm.struct<(ptr<3>, ptr<3>, i64, array<1 x i64>, array<1 x i64>)>
		// CHECK: nvvm.mbarrier.init.shared %[[barPtr]]
		nvgpu.mbarrier.init %barrier, %num_threads : !barrierType

		%c0 = arith.constant 0 : index
		%tidxreg = nvvm.read.ptx.sreg.tid.x : i32
		%tidx = arith.index_cast %tidxreg : i32 to index
		%cnd = arith.cmpi eq, %tidx, %c0 : index

		scf.if %cnd {
		%txcount = arith.constant 256 : index
		// CHECK: %[[barPtr2:.+]] = llvm.extractvalue %[[barStr]][1] : !llvm.struct<(ptr<3>, ptr<3>, i64, array<1 x i64>, array<1 x i64>)>
		// CHECK: nvvm.mbarrier.arrive.expect_tx.shared %[[barPtr2]]
		nvgpu.mbarrier.arrive.expect_tx %barrier, %txcount : !barrierType
		scf.yield
		} else {
		%txcount = arith.constant 0 : index
		// CHECK: %[[barPtr2:.+]] = llvm.extractvalue %[[barStr]][1] : !llvm.struct<(ptr<3>, ptr<3>, i64, array<1 x i64>, array<1 x i64>)>
		// CHECK: nvvm.mbarrier.arrive.expect_tx.shared %[[barPtr2]]
		nvgpu.mbarrier.arrive.expect_tx %barrier, %txcount : !barrierType
		scf.yield
		}


		%phase = arith.constant 0 : index
		%ticks = arith.constant 10000000 : index
		// CHECK: %[[barPtr3:.+]] = llvm.extractvalue %[[barStr]][1] : !llvm.struct<(ptr<3>, ptr<3>, i64, array<1 x i64>, array<1 x i64>)>
		// CHECK: nvvm.mbarrier.try_wait.parity.shared %[[barPtr3]]
		nvgpu.mbarrier.try_wait.parity %barrier, %phase, %ticks : !barrierType

		func.return
		}
		No newline at end of file
		qcolombetUnsubmitted Not Done Reply Inline Actions And nl qcolombet: And nl

mlir/test/Conversion/NVVMToLLVM/nvvm-to-llvm.mlir

	// RUN: mlir-opt --convert-nvvm-to-llvm --split-input-file %s \| FileCheck %s			// RUN: mlir-opt --convert-nvvm-to-llvm --split-input-file %s \| FileCheck %s

	// CHECK-LABEL : @init_mbarrier_arrive_expect_tx			// CHECK-LABEL : @init_mbarrier_arrive_expect_tx
	llvm.func @init_mbarrier_arrive_expect_tx(%barrier : !llvm.ptr<3>, %txcount : i32) -> i64 {			llvm.func @init_mbarrier_arrive_expect_tx(%barrier : !llvm.ptr<3>, %txcount : i32) {
	//CHECK : llvm.inline_asm has_side_effects asm_dialect = att "mbarrier.arrive.expect_tx.shared.b64 $0, [$1], $2;", "=l,r,r" %{{.}}, %{{.}} : (!llvm.ptr<3>, i32) -> i64			//CHECK : llvm.inline_asm has_side_effects asm_dialect = att "mbarrier.arrive.expect_tx.shared.b64 _, [$0], $1;", "r,r"
	%res = nvvm.mbarrier.arrive.expect_tx.shared %barrier, %txcount : !llvm.ptr<3>, i32 -> i64			nvvm.mbarrier.arrive.expect_tx.shared %barrier, %txcount : !llvm.ptr<3>, i32
	llvm.return %res : i64			llvm.return
	}			}

	// CHECK-LABEL : @init_mbarrier_arrive_expect_tx_generic			// CHECK-LABEL : @init_mbarrier_arrive_expect_tx_generic
	llvm.func @init_mbarrier_arrive_expect_tx_generic(%barrier : !llvm.ptr, %txcount : i32)-> i64 {			llvm.func @init_mbarrier_arrive_expect_tx_generic(%barrier : !llvm.ptr, %txcount : i32) {
	// CHECK: llvm.inline_asm has_side_effects asm_dialect = att "mbarrier.arrive.expect_tx.b64 $0, [$1], $2;", "=l,l,r" %{{.}}, %{{.}} : (!llvm.ptr, i32) -> i64			// CHECK: llvm.inline_asm has_side_effects asm_dialect = att "mbarrier.arrive.expect_tx.b64 _, [$0], $1;", "l,r"
	%res = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount : !llvm.ptr, i32 -> i64			nvvm.mbarrier.arrive.expect_tx %barrier, %txcount : !llvm.ptr, i32
	llvm.return %res : i64			llvm.return
	}			}

	// CHECK-LABEL : @init_mbarrier_try_wait.parity.shared			// CHECK-LABEL : @init_mbarrier_try_wait.parity.shared
	llvm.func @init_mbarrier_try_wait_shared(%barrier : !llvm.ptr<3>, %token : i32) -> i32 {			llvm.func @init_mbarrier_try_wait_shared(%barrier : !llvm.ptr<3>, %ticks : i32, %phase : i32) {
	// CHECK : llvm.inline_asm has_side_effects asm_dialect = att "{\0A\09.reg .pred P1; \0A\09mbarrier.try_wait.parity.shared.b64 P1, [$1], $2; \0A\09selp.b32 $0, 1, 0, P1; \0A\09}", "=r,r,r" %{{.}}, %{{.}} : (!llvm.ptr<3>, i32) -> i32			// CHECK : llvm.inline_asm has_side_effects asm_dialect = att "{\0A\09.reg .pred P1; \0A\09LAB_WAIT: \0A\09mbarrier.try_wait.parity.shared.b64 P1, [$0], $1, $2; \0A\09@P1 bra.uni DONE; \0A\09bra.uni LAB_WAIT; \0A\09DONE: \0A\09}", "r,r,r"
	%res = nvvm.mbarrier.try_wait.parity.shared %barrier, %token : !llvm.ptr<3>, i32 -> i32			nvvm.mbarrier.try_wait.parity.shared %barrier, %phase, %ticks : !llvm.ptr<3>, i32, i32
	llvm.return %res : i32			llvm.return
	}			}

	// CHECK-LABEL : @init_mbarrier_try_wait.parity			// CHECK-LABEL : @init_mbarrier_try_wait.parity
	llvm.func @init_mbarrier_try_wait(%barrier : !llvm.ptr, %token : i32) -> i32{			llvm.func @init_mbarrier_try_wait(%barrier : !llvm.ptr, %ticks : i32, %phase : i32){
	// CHECK: llvm.inline_asm has_side_effects asm_dialect = att "{\0A\09.reg .pred P1; \0A\09mbarrier.try_wait.parity.b64 P1, [$1], $2; \0A\09selp.b32 $0, 1, 0, P1; \0A\09}", "=r,l,r" %{{.}}, %{{.}} : (!llvm.ptr, i32) -> i32			// CHECK : llvm.inline_asm has_side_effects asm_dialect = att "{\0A\09.reg .pred P1; \0A\09LAB_WAIT: \0A\09mbarrier.try_wait.parity.b64 P1, [$0], $1, $2; \0A\09@P1 bra.uni DONE; \0A\09bra.uni LAB_WAIT; \0A\09DONE: \0A\09}", "r,r,r"
	%res = nvvm.mbarrier.try_wait.parity %barrier, %token : !llvm.ptr, i32 -> i32			nvvm.mbarrier.try_wait.parity %barrier, %phase, %ticks : !llvm.ptr, i32, i32
	llvm.return %res : i32			llvm.return
	}			}

	// CHECK-LABEL : @async_cp			// CHECK-LABEL : @async_cp
	func.func @async_cp(%dst: !llvm.ptr<3>, %src: !llvm.ptr<1>) {			func.func @async_cp(%dst: !llvm.ptr<3>, %src: !llvm.ptr<1>) {
	// CHECK : nvvm.cp.async.shared.global %{{.}}, %{{.}}, 16, cache = ca : !llvm.ptr<3>, !llvm.ptr<1>			// CHECK : nvvm.cp.async.shared.global %{{.}}, %{{.}}, 16, cache = ca : !llvm.ptr<3>, !llvm.ptr<1>
	nvvm.cp.async.shared.global %dst, %src, 16, cache = ca : !llvm.ptr<3>, !llvm.ptr<1>			nvvm.cp.async.shared.global %dst, %src, 16, cache = ca : !llvm.ptr<3>, !llvm.ptr<1>
	// CHECK : nvvm.cp.async.shared.global %{{.}}, %{{.}}, 16, cache = cg : !llvm.ptr<3>, !llvm.ptr<1>			// CHECK : nvvm.cp.async.shared.global %{{.}}, %{{.}}, 16, cache = cg : !llvm.ptr<3>, !llvm.ptr<1>
	nvvm.cp.async.shared.global %dst, %src, 16, cache = cg : !llvm.ptr<3>, !llvm.ptr<1>			nvvm.cp.async.shared.global %dst, %src, 16, cache = cg : !llvm.ptr<3>, !llvm.ptr<1>
	▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvgpu] Add `mbarrier.arrive.expect_tx` and `mbarrier.try_wait.parity`ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 542426

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

mlir/test/Conversion/NVVMToLLVM/nvvm-to-llvm.mlir

[mlir][nvgpu] Add `mbarrier.arrive.expect_tx` and `mbarrier.try_wait.parity`
ClosedPublic