This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/AMDGPU/
-
mlir/
-
Dialect/
-
AMDGPU/
-
AMDGPU.td
-
lib/Conversion/AMDGPUToROCDL/
-
Conversion/
-
AMDGPUToROCDL/
2/5
AMDGPUToROCDL.cpp
-
test/
-
Conversion/AMDGPUToROCDL/
-
AMDGPUToROCDL/
-
amdgpu-to-rocdl.mlir
-
Dialect/AMDGPU/
-
AMDGPU/
-
ops.mlir

Differential D129522

[mlir][AMDGPU] Add lds_barrier op
ClosedPublic

Authored by krzysz00 on Jul 11 2022, 4:12 PM.

Download Raw Diff

Details

Reviewers

nirvedhmeshram
herhut

Commits

rGbc61cc9a2db5: [mlir][AMDGPU] Add lds_barrier op

Summary

The lds_barrier op allows workgroups to wait at a barrier for
operations to/from their local data store (LDS) to complete without
incurring the performance penalties of a full memory fence.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

krzysz00 created this revision.Jul 11 2022, 4:12 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 11 2022, 4:12 PM

Herald added subscribers: bzcheeseman, kosarev, sdasgup3 and 27 others. · View Herald Transcript

krzysz00 requested review of this revision.Jul 11 2022, 4:12 PM

Herald added a reviewer: herhut. · View Herald TranscriptJul 11 2022, 4:12 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache, wdng. · View Herald Transcript

Is this really necessary if we have https://reviews.llvm.org/D120544. A barrier has no requirement to wait for LDS and VMEM it only does so currently because of a bug. Using inline asm like this seems like it will eventually cause problems although I'm not too familiar with MLIR.

This is a single op that expands to both a waitcnt on LDS and a barrier. I
can go digging tomorrow for what we used to lower this to some sort of
fence and a barrier, but I recall ( @whchung who may have more detail) that
using this bit of inline assembly to work around what I think was the lack
of an LDS-only fence gave a noticable performance increase

Harbormaster completed remote builds in B174764: Diff 443784.Jul 11 2022, 4:36 PM

In D129522#3643798, @krzysz00 wrote:

This is a single op that expands to both a waitcnt on LDS and a barrier. I
can go digging tomorrow for what we used to lower this to some sort of
fence and a barrier, but I recall ( @whchung who may have more detail) that
using this bit of inline assembly to work around what I think was the lack
of an LDS-only fence gave a noticable performance increase

It's not really the lack of an LDS-only fence that was the problem. You could use the intrinsics for a barrier and a waitcnt for example. The problem with that idea is the backend would always wait for both LDS and VM at barriers when it saw them. So you would need to use inline asm to make the barrier invisible which could cause other problems in the backend with combining waitcnt for example. There is a lot of desire to reland https://reviews.llvm.org/D120544 in the near future. After that, the inline asm will not be needed for gfx90a+ and gfx10+. So then you should probably consider not using the inline asm for those architectures.

Re the above, you mentioned gfx90a+ ... what about gfx908?

And even if there'll be a better way to do it in the future, this type of partier+fence construct, however we implement it, is useful for kernel developers now. We can always swap out the implementation later once the compiler fix lands.

nirvedhmeshram added inline comments.Jul 12 2022, 8:15 AM

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
257	Should the op have side effects?

krzysz00 added inline comments.Jul 12 2022, 11:51 AM

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
257	I don't think so.

nirvedhmeshram added inline comments.Jul 13 2022, 2:22 PM

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
257	So can we change this to `/has_side_effects=/false` then?

@kerbowa I'd like to still land this because the lds_barrier is a useful abstraction over whatever combination of (fence + barrier)/inline assembly/... you end up needing to use to implement it.

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
257	... right, that field. has_side_effects will get the compiler to not throw this out on account that it doesn't write any registers. So yeah, it some sense, it does have side effects (because it needs to stay put)

I'm not objecting to the change, just pointing out that you may miss out on some optimization since this is lowered to inline ASM, and that you may want to lower it to intrinsics in the future.

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
257	I think it should definitely be marked as having side effects. It may be scheduled around memops if it isn't. The backend cannot see what instructions are inside inline asm.

nirvedhmeshram accepted this revision.Jul 14 2022, 9:21 AM

This revision is now accepted and ready to land.Jul 14 2022, 9:21 AM

Closed by commit rGbc61cc9a2db5: [mlir][AMDGPU] Add lds_barrier op (authored by krzysz00). · Explain WhyJul 14 2022, 1:45 PM

This revision was automatically updated to reflect the committed changes.

krzysz00 added a commit: rGbc61cc9a2db5: [mlir][AMDGPU] Add lds_barrier op.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

AMDGPU/

AMDGPU.td

19 lines

lib/

Conversion/

AMDGPUToROCDL/

AMDGPUToROCDL.cpp

21 lines

test/

Conversion/

AMDGPUToROCDL/

amdgpu-to-rocdl.mlir

7 lines

Dialect/

AMDGPU/

ops.mlir

7 lines

Diff 444788

mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td

Show First 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	def AMDGPU_RawBufferAtomicFaddOp :
let assemblyFormat = [{		let assemblyFormat = [{
attr-dict $value `->` $memref `[` $indices `]`		attr-dict $value `->` $memref `[` $indices `]`
(`sgprOffset` $sgprOffset^)? `:`		(`sgprOffset` $sgprOffset^)? `:`
type($value) `->` type($memref) `,` type($indices)		type($value) `->` type($memref) `,` type($indices)
}];		}];
let hasVerifier = 1;		let hasVerifier = 1;
}		}

		def AMDGPU_LDSBarrierOp : AMDGPU_Op<"lds_barrier"> {
		let summary = "Barrier that includes a wait for LDS memory operations.";
		let description = [{
		`amdgpu.lds_barrier` is both a barrier (all workitems in a workgroup must reach
		the barrier before any of them may proceed past it) and a wait for all
		operations that affect the Local Data Store (LDS) issued from that wrokgroup
		to complete before the workgroup may continue. Since the LDS is per-workgroup
		memory, this barrier may be used, for example, to ensure all workitems have
		written data to LDS before any workitem attempts to read from it.

		Note that `lds_barrier` does not force reads to or from global memory
		to complete before execution continues. Therefore, it should be used when
		operations on global memory can be issued far in advance of when their results
		are used (for example, by writing them to LDS).
		}];
		let assemblyFormat = "attr-dict";
		}


#endif // AMDGPU		#endif // AMDGPU

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp

Show First 20 Lines • Show All 235 Lines • ▼ Show 20 Lines	if (lowered->getNumResults() == 1) {
rewriter.replaceOp(gpuOp, replacement);		rewriter.replaceOp(gpuOp, replacement);
} else {		} else {
rewriter.eraseOp(gpuOp);		rewriter.eraseOp(gpuOp);
}		}
return success();		return success();
}		}
};		};

		struct LDSBarrierOpLowering : public ConvertOpToLLVMPattern<LDSBarrierOp> {
		using ConvertOpToLLVMPattern<LDSBarrierOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(LDSBarrierOp op, LDSBarrierOp::Adaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		auto asmDialectAttr = LLVM::AsmDialectAttr::get(rewriter.getContext(),
		LLVM::AsmDialect::AD_ATT);
		const char *asmStr = "s_waitcnt lgkmcnt(0)\ns_barrier";
		const char *constraints = "";
		rewriter.replaceOpWithNewOp<LLVM::InlineAsmOp>(
		op,
		/resultTypes=/TypeRange(), /operands=/ValueRange(),
		/asm_string=/asmStr, constraints, /has_side_effects=/true,
		nirvedhmeshramUnsubmitted Not Done Reply Inline Actions Should the op have side effects? nirvedhmeshram: Should the op have side effects?
		krzysz00AuthorUnsubmitted Done Reply Inline Actions I don't think so. krzysz00: I don't think so.
		nirvedhmeshramUnsubmitted Not Done Reply Inline Actions So can we change this to `/has_side_effects=/false` then? nirvedhmeshram: So can we change this to `/has_side_effects=/false` then?
		krzysz00AuthorUnsubmitted Done Reply Inline Actions ... right, that field. has_side_effects will get the compiler to not throw this out on account that it doesn't write any registers. So yeah, it some sense, it does have side effects (because it needs to stay put) krzysz00: ... right, that field. has_side_effects will get the compiler to not throw this out on account…
		kerbowaUnsubmitted Not Done Reply Inline Actions I think it should definitely be marked as having side effects. It may be scheduled around memops if it isn't. The backend cannot see what instructions are inside inline asm. kerbowa: I think it should definitely be marked as having side effects. It may be scheduled around…
		/is_align_stack=/false, /asm_dialect=/asmDialectAttr,
		/operand_attrs=/ArrayAttr());
		return success();
		}
		};

struct ConvertAMDGPUToROCDLPass		struct ConvertAMDGPUToROCDLPass
: public ConvertAMDGPUToROCDLBase<ConvertAMDGPUToROCDLPass> {		: public ConvertAMDGPUToROCDLBase<ConvertAMDGPUToROCDLPass> {
ConvertAMDGPUToROCDLPass() = default;		ConvertAMDGPUToROCDLPass() = default;

void runOnOperation() override {		void runOnOperation() override {
MLIRContext *ctx = &getContext();		MLIRContext *ctx = &getContext();
FailureOr<Chipset> maybeChipset = Chipset::parse(chipset);		FailureOr<Chipset> maybeChipset = Chipset::parse(chipset);
if (failed(maybeChipset)) {		if (failed(maybeChipset)) {
Show All 12 Lines	if (failed(applyPartialConversion(getOperation(), target,
signalPassFailure();		signalPassFailure();
}		}
};		};
} // namespace		} // namespace

void mlir::populateAMDGPUToROCDLConversionPatterns(LLVMTypeConverter &converter,		void mlir::populateAMDGPUToROCDLConversionPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns,		RewritePatternSet &patterns,
Chipset chipset) {		Chipset chipset) {
		patterns.add<LDSBarrierOpLowering>(converter);
patterns.add<		patterns.add<
RawBufferOpLowering<RawBufferLoadOp, ROCDL::RawBufferLoadOp>,		RawBufferOpLowering<RawBufferLoadOp, ROCDL::RawBufferLoadOp>,
RawBufferOpLowering<RawBufferStoreOp, ROCDL::RawBufferStoreOp>,		RawBufferOpLowering<RawBufferStoreOp, ROCDL::RawBufferStoreOp>,
RawBufferOpLowering<RawBufferAtomicFaddOp, ROCDL::RawBufferAtomicFAddOp>>(		RawBufferOpLowering<RawBufferAtomicFaddOp, ROCDL::RawBufferAtomicFAddOp>>(
converter, chipset);		converter, chipset);
}		}

std::unique_ptr<Pass> mlir::createConvertAMDGPUToROCDLPass() {		std::unique_ptr<Pass> mlir::createConvertAMDGPUToROCDLPass() {
return std::make_unique<ConvertAMDGPUToROCDLPass>();		return std::make_unique<ConvertAMDGPUToROCDLPass>();
}		}

mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir

Show First 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	func.func @gpu_gcn_raw_buffer_atomic_fadd_f32(%value: f32, %buf: memref<64xf32>, %idx: i32) {
// CHECK: %[[numRecords:.*]] = llvm.mlir.constant(256 : i32)		// CHECK: %[[numRecords:.*]] = llvm.mlir.constant(256 : i32)
// CHECK: llvm.insertelement{{.*}}%[[numRecords]]		// CHECK: llvm.insertelement{{.*}}%[[numRecords]]
// CHECK: %[[word3:.*]] = llvm.mlir.constant(159744 : i32)		// CHECK: %[[word3:.*]] = llvm.mlir.constant(159744 : i32)
// CHECK: %[[resource:.]] = llvm.insertelement{{.}}%[[word3]]		// CHECK: %[[resource:.]] = llvm.insertelement{{.}}%[[word3]]
// CHECK: rocdl.raw.buffer.atomic.fadd %{{.}} %[[resource]], %{{.}}, %{{.}}, %{{.}} : f32		// CHECK: rocdl.raw.buffer.atomic.fadd %{{.}} %[[resource]], %{{.}}, %{{.}}, %{{.}} : f32
amdgpu.raw_buffer_atomic_fadd {boundsCheck = true} %value -> %buf[%idx] : f32 -> memref<64xf32>, i32		amdgpu.raw_buffer_atomic_fadd {boundsCheck = true} %value -> %buf[%idx] : f32 -> memref<64xf32>, i32
func.return		func.return
}		}

		// CHECK-LABEL: func @lds_barrier
		func.func @lds_barrier() {
		// CHECK: llvm.inline_asm has_side_effects asm_dialect = att "s_waitcnt lgkmcnt(0)\0As_barrier"
		amdgpu.lds_barrier
		func.return
		}

mlir/test/Dialect/AMDGPU/ops.mlir

	Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines
	}			}

	// CHECK-LABEL: func @raw_buffer_atomic_fadd_f32_to_rank_4			// CHECK-LABEL: func @raw_buffer_atomic_fadd_f32_to_rank_4
	func.func @raw_buffer_atomic_fadd_f32_to_rank_4(%value : f32, %dst : memref<128x64x32x16xf32>, %offset : i32, %idx0 : i32, %idx1 : i32, %idx2 : i32, %idx3 : i32) {			func.func @raw_buffer_atomic_fadd_f32_to_rank_4(%value : f32, %dst : memref<128x64x32x16xf32>, %offset : i32, %idx0 : i32, %idx1 : i32, %idx2 : i32, %idx3 : i32) {
	// CHECK: amdgpu.raw_buffer_atomic_fadd {boundsCheck = true, indexOffset = 1 : i32} %{{.}} -> %{{.}}[%{{.}}, %{{.}}, %{{.}}] sgprOffset %{{.}} : f32 -> memref<128x64x32x16xf32>, i32, i32, i32, i32			// CHECK: amdgpu.raw_buffer_atomic_fadd {boundsCheck = true, indexOffset = 1 : i32} %{{.}} -> %{{.}}[%{{.}}, %{{.}}, %{{.}}] sgprOffset %{{.}} : f32 -> memref<128x64x32x16xf32>, i32, i32, i32, i32
	amdgpu.raw_buffer_atomic_fadd {boundsCheck = true, indexOffset = 1 : i32} %value -> %dst[%idx0, %idx1, %idx2, %idx3] sgprOffset %offset : f32 -> memref<128x64x32x16xf32>, i32, i32, i32, i32			amdgpu.raw_buffer_atomic_fadd {boundsCheck = true, indexOffset = 1 : i32} %value -> %dst[%idx0, %idx1, %idx2, %idx3] sgprOffset %offset : f32 -> memref<128x64x32x16xf32>, i32, i32, i32, i32
	func.return			func.return
	}			}

				// CHECK-LABEL: func @lds_barrier
				func.func @lds_barrier() {
				// CHECK: amdgpu.lds_barrier
				amdgpu.lds_barrier
				func.return
				}