This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/LLVMIR/Transforms/
-
mlir/
-
Dialect/
-
LLVMIR/
-
Transforms/
-
Passes.h
-
Passes.td
-
lib/
-
Conversion/GPUToROCDL/
-
GPUToROCDL/
-
LowerGpuOpsToROCDLOps.cpp
-
Dialect/LLVMIR/Transforms/
-
LLVMIR/
-
Transforms/
-
CMakeLists.txt
-
SoftwareBf16.cpp
-
test/Conversion/
-
Conversion/
-
GPUCommon/
-
memory-attrbution.mlir
-
GPUToROCDL/
-
gpu-to-rocdl-hip.mlir
-
gpu-to-rocdl-opencl.mlir
-
SoftwareBF16/
-
softwareBF16.mlir

Differential D126444

[mlir]Implement SoftwareBF16 to handle the bf16 type
Needs ReviewPublic

Authored by yiqian1 on May 25 2022, 9:00 PM.

Download Raw Diff

Details

Reviewers

ftynse
ThomasRaoux
herhut
nicolasvasilache
dcaballe

Summary

Some LLVM targets such as AMDGPU and X86 do not support bfloat types.
Add a SoftwareBF16 pass to support the bf16 type on such targets. This
pass replaces all bf16 by i16 and then replaces operations on bf16 by
f32 operations with extended operands and/or truncated results.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

yiqian1 created this revision.May 25 2022, 9:00 PM

Herald added a reviewer: ftynse. · View Herald TranscriptMay 25 2022, 9:00 PM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bzcheeseman, kosarev, awarzynski and 26 others. · View Herald Transcript

yiqian1 requested review of this revision.May 25 2022, 9:00 PM

Herald added a reviewer: herhut. · View Herald TranscriptMay 25 2022, 9:00 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B166409: Diff 432186.May 25 2022, 9:15 PM

FreddyYe added a subscriber: FreddyYe.May 26 2022, 1:35 AM

krzysz00 added a subscriber: krzysz00.May 26 2022, 1:56 PM

A part of the motivation that is missing here (but was mentioned in the RFC) is why do this at this level than in LLVM? E.g., type conversions for backends not supporting specific types is already in LLVM with all kinds of helpers of how one can generally lower by combining overflow behavior of compare odd with add with carry op (as example). It feels like one would need to reimplement some of those expansions here if done at this level. Given it was mentioned, was this discussed and evaluated?

Herald added a subscriber: jsji. · View Herald TranscriptJun 1 2022, 2:50 PM

In D126444#3551698, @jpienaar wrote:

A part of the motivation that is missing here (but was mentioned in the RFC) is why do this at this level than in LLVM? E.g., type conversions for backends not supporting specific types is already in LLVM with all kinds of helpers of how one can generally lower by combining overflow behavior of compare odd with add with carry op (as example). It feels like one would need to reimplement some of those expansions here if done at this level. Given it was mentioned, was this discussed and evaluated?

+1. This pass also seems to be making some opinionated stances on how the conversions should take place, what types to change to, etc. It isn't clear to me why this shouldn't be a part of backend legalization/isel.

I'll admit to not having looked into the possibility of doing this in LLVM in any particular detail, especially due to my rather limited knowledge of what infrastructure is available for backend legalization.

(I looked around and found some old discussions with @rampitec that I incorrectly remembered as implying doing things in the backend would be hard, which is probably why I didn't look into it)

Doing this as an LLVM backend thing may well be the right call, and it's surprising it hasn't already been done, given that the following LLVM IR

define void @test(bfloat* %0, bfloat* %1) {
  %3 = load bfloat, bfloat* %0, align 2
  store bfloat %3, bfloat* %1, align 2
  ret void
}

gives the following results when run through llc -march=x86-64

LLVM ERROR: Cannot select: t8: ch = store<(store (s16) into %ir.1)> t7:1, t7, t4, undef:i64
  t7: bf16,ch = load<(load (s16) from %ir.0)> t0, t2, undef:i64                             t2: i64,ch = CopyFromReg t0, Register:i64 %0                                              t1: i64 = Register %0
    t6: i64 = undef                                                                       t4: i64,ch = CopyFromReg t0, Register:i64 %1
    t3: i64 = Register %1                                                                 t6: i64 = undef
In function: test

In D126444#3551836, @krzysz00 wrote:

I'll admit to not having looked into the possibility of doing this in LLVM in any particular detail, especially due to my rather limited knowledge of what infrastructure is available for backend legalization.

(I looked around and found some old discussions with @rampitec that I incorrectly remembered as implying doing things in the backend would be hard, which is probably why I didn't look into it)

Doing this as an LLVM backend thing may well be the right call, and it's surprising it hasn't already been done, given that the following LLVM IR

I.e. it is a full SW emulation. That is why it is hard to do (although it seems to be done via conversions). Not sure though there are users for that, the reason to use bfloat16 is to have a faster less precise float, and this is a slower less precise float.

I'll quickly note that we are users for this. Fundamentally, we want to generate either GPU kernels or CPU functions that can be used to validate generated code that uses the bfloat-baded MFMA instructions

Since hardware doesn't know about bfloat except for those instructions, other operations that can be meaningfully defined on bfloat need to be emulated. We can (and currently do) this at the MLIR level, but comments above have made the point that how to do such emulation is probably a decision hardware bsckends should be making.

And while we *could* avoid all this by pulling all this emulation into the matrix multiply lowering path far before LLVM gets involved ... that's ugly and confusing, IMO. I strongly prefer a world where higher-level code can say bfloat to mean bfloat and then have the details of how those ops are actually implemented live somewhere in (or, as in this patch, near) LLVM.

As to the speed question, would folks be unhappy with something like -femulate-bfloat as a backend option?

And at that point should the discussion be taken to the LLVM Discourse?

To re-raise things, what's the right venue for discussing bf16 emulation in LLVM?

In D126444#3560451, @krzysz00 wrote:

To re-raise things, what's the right venue for discussing bf16 emulation in LLVM?

I landed fb34d531af953119593be74753b89baf99fbc194 today which does it for x86, it should
be portable to other target fairly easily. It works by promoting all bf16 arithmetic to f32.
Converting from bf16 to f32 is expanded inline, but I opted for a libcall for the other way
because proper handling of rounding, NaNs and denormals is quite tricky.

So all a user has to do now is provide a version of __truncsfbf2 in the runtime environment
and it should just work. Targets can also directly lower the BF16_TO_FP node if there's a
better way than a libcall.

Since we got rid of this pass downstream, we should abandon this

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptAug 10 2023, 7:43 AM

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, bviyer and 2 others. · View Herald Transcript

Just as an FYI, a while back I hit a related issue in the context of AArch64: https://discourse.llvm.org/t/should-llvm-provide-more-documentation-for-frontend-developers/66134.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

LLVMIR/

Transforms/

Passes.h

3 lines

Passes.td

14 lines

lib/

Conversion/

GPUToROCDL/

LowerGpuOpsToROCDLOps.cpp

7 lines

Dialect/

LLVMIR/

Transforms/

CMakeLists.txt

2 lines

SoftwareBf16.cpp

384 lines

test/

Conversion/

GPUCommon/

memory-attrbution.mlir

52 lines

GPUToROCDL/

gpu-to-rocdl-hip.mlir

28 lines

gpu-to-rocdl-opencl.mlir

2 lines

SoftwareBF16/

softwareBF16.mlir

59 lines

Diff 432186

mlir/include/mlir/Dialect/LLVMIR/Transforms/Passes.h

	Show All 10 Lines

	#include "mlir/Dialect/LLVMIR/Transforms/LegalizeForExport.h"			#include "mlir/Dialect/LLVMIR/Transforms/LegalizeForExport.h"
	#include "mlir/Pass/Pass.h"			#include "mlir/Pass/Pass.h"

	namespace mlir {			namespace mlir {

	namespace LLVM {			namespace LLVM {

				/// Create a pass to remove BF16 types from LLVM IR.
				std::unique_ptr<Pass> createSoftwareBF16Pass();

	/// Generate the code for registering conversion passes.			/// Generate the code for registering conversion passes.
	#define GEN_PASS_REGISTRATION			#define GEN_PASS_REGISTRATION
	#include "mlir/Dialect/LLVMIR/Transforms/Passes.h.inc"			#include "mlir/Dialect/LLVMIR/Transforms/Passes.h.inc"

	} // namespace LLVM			} // namespace LLVM
	} // namespace mlir			} // namespace mlir

	#endif // MLIR_DIALECT_LLVMIR_TRANSFORMS_PASSES_H			#endif // MLIR_DIALECT_LLVMIR_TRANSFORMS_PASSES_H

mlir/include/mlir/Dialect/LLVMIR/Transforms/Passes.td

	Show All 10 Lines

	include "mlir/Pass/PassBase.td"			include "mlir/Pass/PassBase.td"

	def LLVMLegalizeForExport : Pass<"llvm-legalize-for-export"> {			def LLVMLegalizeForExport : Pass<"llvm-legalize-for-export"> {
	let summary = "Legalize LLVM dialect to be convertible to LLVM IR";			let summary = "Legalize LLVM dialect to be convertible to LLVM IR";
	let constructor = "mlir::LLVM::createLegalizeForExportPass()";			let constructor = "mlir::LLVM::createLegalizeForExportPass()";
	}			}

				def SoftwareBF16 : Pass<"llvm-software-bf16"> {
				let summary = "Convert BF16 to I16 in LLVM IR";
				let description = [{
				This pass erases the BF16 type from LLVM IR.

				Some LLVM targets do not support LLVM's `bfloat` type, or only support it
				incompletely. To allow using the `bf16` type on such targets, this pass
				replaces all of its uses by `i16` and then replaces operations on `bf16` by
				extending the 16-bit values into `f32`, then computes the floating-point
				operation on the extended value, and then truncates the results.
				}];
				let constructor = "mlir::LLVM::createSoftwareBF16Pass()";
				}

	#endif // MLIR_DIALECT_LLVMIR_TRANSFORMS_PASSES			#endif // MLIR_DIALECT_LLVMIR_TRANSFORMS_PASSES

mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp

Show All 25 Lines
#include "mlir/Conversion/VectorToROCDL/VectorToROCDL.h"		#include "mlir/Conversion/VectorToROCDL/VectorToROCDL.h"
#include "mlir/Dialect/Func/IR/FuncOps.h"		#include "mlir/Dialect/Func/IR/FuncOps.h"
#include "mlir/Dialect/GPU/GPUDialect.h"		#include "mlir/Dialect/GPU/GPUDialect.h"
#include "mlir/Dialect/GPU/Passes.h"		#include "mlir/Dialect/GPU/Passes.h"
#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"		#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
#include "mlir/Dialect/Math/IR/Math.h"		#include "mlir/Dialect/Math/IR/Math.h"
#include "mlir/Dialect/Vector/IR/VectorOps.h"		#include "mlir/Dialect/Vector/IR/VectorOps.h"
#include "mlir/Pass/Pass.h"		#include "mlir/Pass/Pass.h"
		#include "mlir/Pass/PassManager.h"
#include "mlir/Transforms/DialectConversion.h"		#include "mlir/Transforms/DialectConversion.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"		#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
#include "llvm/Support/FormatVariadic.h"		#include "llvm/Support/FormatVariadic.h"

#include "../GPUCommon/GPUOpsLowering.h"		#include "../GPUCommon/GPUOpsLowering.h"
#include "../GPUCommon/IndexIntrinsicsOpLowering.h"		#include "../GPUCommon/IndexIntrinsicsOpLowering.h"
#include "../GPUCommon/OpToFuncCallLowering.h"		#include "../GPUCommon/OpToFuncCallLowering.h"
#include "../PassDetail.h"		#include "../PassDetail.h"
		#include "mlir/Dialect/LLVMIR/Transforms/Passes.h"

using namespace mlir;		using namespace mlir;

namespace {		namespace {

/// Import the GPU Ops to ROCDL Patterns.		/// Import the GPU Ops to ROCDL Patterns.
#include "GPUToROCDL.cpp.inc"		#include "GPUToROCDL.cpp.inc"

Show All 36 Lines	void runOnOperation() override {
cf::populateControlFlowToLLVMConversionPatterns(converter, llvmPatterns);		cf::populateControlFlowToLLVMConversionPatterns(converter, llvmPatterns);
populateFuncToLLVMConversionPatterns(converter, llvmPatterns);		populateFuncToLLVMConversionPatterns(converter, llvmPatterns);
populateMemRefToLLVMConversionPatterns(converter, llvmPatterns);		populateMemRefToLLVMConversionPatterns(converter, llvmPatterns);
populateGpuToROCDLConversionPatterns(converter, llvmPatterns, runtime);		populateGpuToROCDLConversionPatterns(converter, llvmPatterns, runtime);
LLVMConversionTarget target(getContext());		LLVMConversionTarget target(getContext());
configureGpuToROCDLConversionLegality(target);		configureGpuToROCDLConversionLegality(target);
if (failed(applyPartialConversion(m, target, std::move(llvmPatterns))))		if (failed(applyPartialConversion(m, target, std::move(llvmPatterns))))
signalPassFailure();		signalPassFailure();

		OpPassManager pm("gpu.module");
		pm.addPass(LLVM::createSoftwareBF16Pass());
		if (failed(runPipeline(pm, getOperation())))
		signalPassFailure();
}		}
};		};

} // namespace		} // namespace

void mlir::configureGpuToROCDLConversionLegality(ConversionTarget &target) {		void mlir::configureGpuToROCDLConversionLegality(ConversionTarget &target) {
target.addIllegalOp<func::FuncOp>();		target.addIllegalOp<func::FuncOp>();
target.addLegalDialect<::mlir::LLVM::LLVMDialect>();		target.addLegalDialect<::mlir::LLVM::LLVMDialect>();
▲ Show 20 Lines • Show All 80 Lines • Show Last 20 Lines

mlir/lib/Dialect/LLVMIR/Transforms/CMakeLists.txt

	add_mlir_dialect_library(MLIRLLVMIRTransforms			add_mlir_dialect_library(MLIRLLVMIRTransforms
	LegalizeForExport.cpp			LegalizeForExport.cpp
				SoftwareBf16.cpp

	DEPENDS			DEPENDS
	MLIRLLVMPassIncGen			MLIRLLVMPassIncGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRIR			MLIRIR
	MLIRLLVMIR			MLIRLLVMIR
	MLIRPass			MLIRPass
				MLIRLLVMCommonConversion
	)			)

mlir/lib/Dialect/LLVMIR/Transforms/SoftwareBf16.cpp

This file was added.

				//===- SoftwareBf16.cpp - Prepare for translation to LLVM IR ---------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "PassDetail.h"
				#include "mlir/Conversion/LLVMCommon/Pattern.h"
				#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
				#include "mlir/Dialect/LLVMIR/Transforms/Passes.h"
				#include "mlir/IR/PatternMatch.h"
				#include "mlir/Transforms/GreedyPatternRewriteDriver.h"

				using namespace mlir;

				static APInt castBF16toInt(APFloat value) {
				assert(&value.getSemantics() == &APFloat::BFloat() && "Must cast bf16 only");
				APInt ret = value.bitcastToAPInt();
				assert(ret.getBitWidth() == 16 && "bf16 conversion should make i16");
				return ret;
				}

				static Value getLlvmI32Const(Location loc, PatternRewriter &rewriter, Type type,
				int32_t value) {
				Attribute ret = rewriter.getI32IntegerAttr(value);
				if (LLVM::isCompatibleVectorType(type))
				ret = SplatElementsAttr::get(type.cast<ShapedType>(), ret);
				return rewriter.createOrFold<LLVM::ConstantOp>(loc, type, ret);
				}

				namespace {
				/// Rewrites bf16 constants to their i16 equivalents
				/// This is relying on the fact that the vector, i16, and bf16 types used in the
				/// LLVM dialect are the standard ones and not weird custom wrappers
				struct BF16ConstCasting : OpRewritePattern<LLVM::ConstantOp> {
				explicit BF16ConstCasting(MLIRContext *context) : OpRewritePattern(context) {}

				LogicalResult matchAndRewrite(LLVM::ConstantOp op,
				PatternRewriter &rewriter) const override {
				Attribute val = op.getValueAttr();
				Operation *rawOp = op.getOperation();
				Type bf16 = rewriter.getBF16Type();
				Type retType = op.getRes().getType();
				Type retElemType = retType;

				if (auto retTypeShaped = retType.dyn_cast<ShapedType>())
				retElemType = retTypeShaped.getElementType();

				if (auto valFloat = val.dyn_cast<FloatAttr>()) {
				if (valFloat.getType() != bf16)
				return failure();
				APInt newVal = castBF16toInt(valFloat.getValue());
				rewriter.replaceOpWithNewOp<LLVM::ConstantOp>(
				rawOp, retType, rewriter.getIntegerAttr(retType, newVal));
				return success();
				}

				if (auto valDense = val.dyn_cast<DenseElementsAttr>()) {
				if (valDense.getElementType() != bf16)
				return failure();
				DenseElementsAttr newVal = valDense.bitcast(retElemType);
				rewriter.replaceOpWithNewOp<LLVM::ConstantOp>(rawOp, retType, newVal);
				return success();
				}

				if (auto valSparse = val.dyn_cast<SparseElementsAttr>()) {
				if (valSparse.getElementType() != bf16)
				return failure();
				DenseElementsAttr values = valSparse.getValues();
				DenseElementsAttr newValues = values.bitcast(retElemType);
				auto newVal = SparseElementsAttr::get(retType.cast<ShapedType>(),
				valSparse.getIndices(), newValues);
				rewriter.replaceOpWithNewOp<LLVM::ConstantOp>(rawOp, retType, newVal);
				return success();
				}
				// No match otherwise
				return failure();
				}
				};

				template <typename Op>
				struct BF16AsF32 : OpRewritePattern<Op> {
				explicit BF16AsF32(MLIRContext *context) : OpRewritePattern<Op>(context) {}

				LogicalResult matchAndRewrite(Op op,
				PatternRewriter &rewriter) const override {

				Location loc = op.getLoc();
				Type opdType = op->getOperand(0).getType();
				Type opdElementType = opdType;
				Type i16 = rewriter.getIntegerType(16);

				if (auto opdShaped = opdType.dyn_cast<ShapedType>()) {
				opdElementType = opdShaped.getElementType();
				}

				Type resType = op.getResult().getType();
				Type extType = rewriter.getF32Type();
				Type resElementType = resType;

				if (auto resShaped = resType.dyn_cast<ShapedType>()) {
				extType = resShaped.clone(extType);
				resElementType = resShaped.getElementType();
				}

				if (resElementType != i16 && opdElementType != i16)
				return failure();

				llvm::SmallVector<Value, 2> extended;
				if (isa<LLVM::SIToFPOp>(op) \|\| isa<LLVM::UIToFPOp>(op)) {
				extended.push_back(op->getOperand(0));
				} else {
				for (Value v : op->getOperands()) {
				extended.push_back(
				rewriter.create<LLVM::FPExtOp>(loc, extType, v)); // i16->f32
				}
				}

				if (resElementType == i16) {
				Op operation =
				rewriter.create<Op>(loc, extType, extended, op->getAttrs());
				rewriter.replaceOpWithNewOp<LLVM::FPTruncOp>(op, resType,
				operation.getResult());
				} else { // FCmp
				rewriter.replaceOpWithNewOp<Op>(op, resType, extended, op->getAttrs());
				}

				return success();
				}
				};

				struct SoftwareBF16Ext : OpRewritePattern<LLVM::FPExtOp> {
				explicit SoftwareBF16Ext(MLIRContext *context) : OpRewritePattern(context) {}

				LogicalResult matchAndRewrite(LLVM::FPExtOp op,
				PatternRewriter &rewriter) const override {
				Location loc = op.getLoc();

				Type srcType = op.getArg().getType();
				Type destType = op.getResult().getType();
				Type srcElemType = srcType;
				if (auto shaped = srcType.dyn_cast<ShapedType>())
				srcElemType = shaped.getElementType();

				Type i16 = rewriter.getIntegerType(16);
				if (srcElemType != i16)
				return failure();

				Type extType = rewriter.getI32Type();
				if (auto srcShaped = srcType.dyn_cast<ShapedType>())
				extType = srcShaped.clone(extType);

				Type f32 = rewriter.getF32Type();
				if (auto destShaped = destType.dyn_cast<ShapedType>()) {
				if (destShaped.getElementType() != f32)
				return failure();
				} else if (destType != f32)
				return failure();

				Value extended = rewriter.create<LLVM::ZExtOp>(loc, extType, op.getArg());
				Value shifted = rewriter.create<LLVM::ShlOp>(
				loc, extended, getLlvmI32Const(loc, rewriter, extType, 16));
				rewriter.replaceOpWithNewOp<LLVM::BitcastOp>(op, destType, shifted);

				return success();
				}
				};

				/// Rewrites truncation to bfloat as a series of integer operations.
				struct SoftwareBF16Trunc : OpRewritePattern<LLVM::FPTruncOp> {
				explicit SoftwareBF16Trunc(MLIRContext *context)
				: OpRewritePattern(context) {}

				LogicalResult matchAndRewrite(LLVM::FPTruncOp op,
				PatternRewriter &rewriter) const override {

				Location loc = op.getLoc();

				Type srcType = op.getArg().getType();
				Type destType = op.getRes().getType();
				Type srcElemType = srcType;
				if (auto shaped = srcType.dyn_cast<ShapedType>())
				srcElemType = shaped.getElementType();

				Type f32 = rewriter.getF32Type();
				if (srcElemType != f32)
				return failure();

				Type bitcastType = rewriter.getI32Type();
				if (auto srcShaped = srcType.dyn_cast<ShapedType>())
				bitcastType = srcShaped.clone(bitcastType);

				Type i16 = rewriter.getIntegerType(16);
				if (auto destShaped = destType.dyn_cast<ShapedType>()) {
				if (destShaped.getElementType() != i16)
				return failure();
				} else if (destType != i16)
				return failure();

				// a = bitcast f32 value to i32
				// b = (a + 32767) >> 16
				// c = ((a >> 16) & 1)
				// d = b + c
				// truncate (d << 16) to i16 and return this i16
				Value bitcastop =
				rewriter.create<LLVM::BitcastOp>(loc, bitcastType, op.getArg());

				Value constantSixteen = getLlvmI32Const(loc, rewriter, bitcastType, 16);
				Value shiftValue = rewriter.create<LLVM::LShrOp>(
				loc, bitcastType, bitcastop, constantSixteen);

				Value constantOne = getLlvmI32Const(loc, rewriter, bitcastType, 1);
				Value andValue = rewriter.create<LLVM::AndOp>(loc, shiftValue, constantOne);

				Value constantBig = getLlvmI32Const(loc, rewriter, bitcastType, 32767);
				Value addBigValue =
				rewriter.create<LLVM::AddOp>(loc, bitcastop, constantBig);
				Value shiftBigValue = rewriter.create<LLVM::LShrOp>(
				loc, bitcastType, addBigValue, constantSixteen);

				Value addValue = rewriter.create<LLVM::AddOp>(loc, andValue, shiftBigValue);

				Value shiftBeforeTruncValue = rewriter.create<LLVM::LShrOp>(
				loc, bitcastType, addValue, constantSixteen);
				Value truncValue =
				rewriter.create<LLVM::TruncOp>(loc, destType, shiftBeforeTruncValue);
				rewriter.replaceOp(op.getOperation(), {truncValue});

				return success();
				}
				};

				} // namespace

				static void replaceBF16WithI16(Operation *op, TypeConverter &converter) {
				if (auto func = dyn_cast<LLVM::LLVMFuncOp>(op)) {
				auto funcType = func.getFunctionType();
				func.setType(converter.convertType(funcType));
				for (Value arg : func.getArguments())
				arg.setType(converter.convertType(arg.getType()));
				} else if (auto globalOp = dyn_cast<LLVM::GlobalOp>(op)) {
				Type globalType = globalOp.getType();
				globalOp.setGlobalTypeAttr(
				TypeAttr::get(converter.convertType(globalType)));
				} else {
				for (unsigned idx = 0; idx < op->getNumOperands(); idx++) {
				auto type = converter.convertType(op->getOperand(idx).getType());
				op->getOperand(idx).setType(type);
				}
				for (unsigned idx = 0; idx < op->getNumResults(); idx++) {
				auto type = converter.convertType(op->getResult(idx).getType());
				op->getResult(idx).setType(type);
				}
				}
				return;
				}

				static void populateSoftwareBF16Patterns(MLIRContext *ctx,
				TypeConverter &converter,
				RewritePatternSet &patterns) {
				Type llvmI16 = IntegerType::get(ctx, 16);

				converter.addConversion([](Type type) { return type; });

				converter.addConversion(
				[llvmI16](BFloat16Type type) -> Type { return llvmI16; });

				converter.addConversion([&](VectorType type) -> Optional<Type> {
				if (auto element = converter.convertType(type.getElementType()))
				return type.clone(element);
				return llvm::None;
				});

				converter.addConversion(
				[&](LLVM::LLVMPointerType type) -> llvm::Optional<Type> {
				if (auto pointee = converter.convertType(type.getElementType()))
				return LLVM::LLVMPointerType::get(pointee, type.getAddressSpace());
				return llvm::None;
				});

				converter.addConversion(
				[&](LLVM::LLVMStructType type, SmallVectorImpl<Type> &results,
				ArrayRef<Type> callStack) -> Optional<LogicalResult> {
				bool converted = false;
				SmallVector<Type> convertedElemTypes;
				convertedElemTypes.reserve(type.getBody().size());
				for (auto t : type.getBody()) {
				SmallVector<Type, 1> element;
				if (failed(converter.convertType(t, element)))
				return llvm::None;
				assert(element.size() == 1);
				convertedElemTypes.push_back(element[0]);
				if (t != element[0])
				converted = true;
				}

				if (!converted) {
				results.push_back(type);
				return success();
				}

				// Identified StructType
				if (type.isIdentified()) {
				auto convertedType = LLVM::LLVMStructType::getIdentified(
				type.getContext(), ("_Converted_" + type.getName()).str());
				unsigned counter = 1;
				while (convertedType.isInitialized()) {
				convertedType = LLVM::LLVMStructType::getIdentified(
				type.getContext(),
				("_Converted_" + Twine(counter++) + type.getName()).str());
				}
				if (llvm::count(callStack, type) > 1) {
				results.push_back(convertedType);
				return success();
				}
				if (failed(
				convertedType.setBody(convertedElemTypes, type.isPacked())))
				return llvm::None;
				results.push_back(convertedType);
				return success();
				}

				// Literal StructType
				results.push_back(LLVM::LLVMStructType::getLiteral(
				type.getContext(), convertedElemTypes, type.isPacked()));
				return success();
				});

				converter.addConversion(
				[&](LLVM::LLVMArrayType type) -> llvm::Optional<Type> {
				if (auto element = converter.convertType(type.getElementType()))
				return LLVM::LLVMArrayType::get(element, type.getNumElements());
				return llvm::None;
				});

				converter.addConversion(
				[&](LLVM::LLVMFunctionType type) -> llvm::Optional<Type> {
				Type convertedResType = converter.convertType(type.getReturnType());
				if (!convertedResType)
				return llvm::None;

				SmallVector<Type> convertedArgTypes;
				convertedArgTypes.reserve(type.getNumParams());
				if (failed(converter.convertTypes(type.getParams(), convertedArgTypes)))
				return llvm::None;
				return LLVM::LLVMFunctionType::get(convertedResType, convertedArgTypes,
				type.isVarArg());
				});

				patterns.add<BF16ConstCasting, SoftwareBF16Trunc, SoftwareBF16Ext>(ctx);

				patterns.add<BF16AsF32<LLVM::FAddOp>, BF16AsF32<LLVM::FCmpOp>,
				BF16AsF32<LLVM::FDivOp>, BF16AsF32<LLVM::FMulOp>,
				BF16AsF32<LLVM::FNegOp>, BF16AsF32<LLVM::FRemOp>,
				BF16AsF32<LLVM::FSubOp>, BF16AsF32<LLVM::FPToSIOp>,
				BF16AsF32<LLVM::FPToUIOp>, BF16AsF32<LLVM::SIToFPOp>,
				BF16AsF32<LLVM::UIToFPOp>, BF16AsF32<LLVM::FAbsOp>,
				BF16AsF32<LLVM::FCeilOp>, BF16AsF32<LLVM::FFloorOp>,
				BF16AsF32<LLVM::FMAOp>, BF16AsF32<LLVM::FMulAddOp>>(ctx);
				}

				namespace {
				struct SoftwareBF16Pass : public SoftwareBF16Base<SoftwareBF16Pass> {
				void runOnOperation() override {
				auto m = getOperation();
				MLIRContext *ctx = m->getContext();
				TypeConverter converter;
				RewritePatternSet bf16fixupPatterns(ctx);

				populateSoftwareBF16Patterns(ctx, converter, bf16fixupPatterns);
				// Replace BF16 types in an operation with I16 types
				m->walk([&converter](Operation *op) { replaceBF16WithI16(op, converter); });

				if (failed(applyPatternsAndFoldGreedily(m, std::move(bf16fixupPatterns))))
				signalPassFailure();
				}
				};
				} // namespace

				std::unique_ptr<Pass> LLVM::createSoftwareBF16Pass() {
				return std::make_unique<SoftwareBF16Pass>();
				}

mlir/test/Conversion/GPUCommon/memory-attrbution.mlir

Show All 16 Lines	gpu.func @private(%arg0: f32) private(%arg1: memref<4xf32, 5>) {
// NVVM: %[[descr3:.*]] = llvm.insertvalue %[[raw]], %[[descr2]][1]		// NVVM: %[[descr3:.*]] = llvm.insertvalue %[[raw]], %[[descr2]][1]
// NVVM: %[[c0:.*]] = llvm.mlir.constant(0 : index) : i64		// NVVM: %[[c0:.*]] = llvm.mlir.constant(0 : index) : i64
// NVVM: %[[descr4:.*]] = llvm.insertvalue %[[c0]], %[[descr3]][2]		// NVVM: %[[descr4:.*]] = llvm.insertvalue %[[c0]], %[[descr3]][2]
// NVVM: %[[c4:.*]] = llvm.mlir.constant(4 : index) : i64		// NVVM: %[[c4:.*]] = llvm.mlir.constant(4 : index) : i64
// NVVM: %[[descr5:.*]] = llvm.insertvalue %[[c4]], %[[descr4]][3, 0]		// NVVM: %[[descr5:.*]] = llvm.insertvalue %[[c4]], %[[descr4]][3, 0]
// NVVM: %[[c1:.*]] = llvm.mlir.constant(1 : index) : i64		// NVVM: %[[c1:.*]] = llvm.mlir.constant(1 : index) : i64
// NVVM: %[[descr6:.*]] = llvm.insertvalue %[[c1]], %[[descr5]][4, 0]		// NVVM: %[[descr6:.*]] = llvm.insertvalue %[[c1]], %[[descr5]][4, 0]

// ROCDL: %[[descr1:.*]] = llvm.mlir.undef : !llvm.struct<(ptr<f32, 5>, ptr<f32, 5>, i64, array<1 x i64>, array<1 x i64>)>
// ROCDL: %[[descr2:.*]] = llvm.insertvalue %[[raw]], %[[descr1]][0]
// ROCDL: %[[descr3:.*]] = llvm.insertvalue %[[raw]], %[[descr2]][1]
// ROCDL: %[[c0:.*]] = llvm.mlir.constant(0 : index) : i64
// ROCDL: %[[descr4:.*]] = llvm.insertvalue %[[c0]], %[[descr3]][2]
// ROCDL: %[[c4:.*]] = llvm.mlir.constant(4 : index) : i64
// ROCDL: %[[descr5:.*]] = llvm.insertvalue %[[c4]], %[[descr4]][3, 0]
// ROCDL: %[[c1:.*]] = llvm.mlir.constant(1 : index) : i64
// ROCDL: %[[descr6:.*]] = llvm.insertvalue %[[c1]], %[[descr5]][4, 0]

// "Store" lowering should work just as any other memref, only check that		// "Store" lowering should work just as any other memref, only check that
// we emit some core instructions.		// we emit some core instructions.
// NVVM: llvm.extractvalue %[[descr6:.*]]		// NVVM: llvm.extractvalue %[[descr6:.*]]
// NVVM: llvm.getelementptr		// NVVM: llvm.getelementptr
// NVVM: llvm.store		// NVVM: llvm.store

// ROCDL: llvm.extractvalue %[[descr6:.*]]		// ROCDL: llvm.store {{.*}}, %[[raw]]
// ROCDL: llvm.getelementptr
// ROCDL: llvm.store
%c0 = arith.constant 0 : index		%c0 = arith.constant 0 : index
memref.store %arg0, %arg1[%c0] : memref<4xf32, 5>		memref.store %arg0, %arg1[%c0] : memref<4xf32, 5>

"terminator"() : () -> ()		"terminator"() : () -> ()
}		}
}		}

// -----		// -----
Show All 31 Lines	gpu.func @workgroup(%arg0: f32) workgroup(%arg1: memref<4xf32, 3>) {
// NVVM: %[[descr3:.*]] = llvm.insertvalue %[[raw]], %[[descr2]][1]		// NVVM: %[[descr3:.*]] = llvm.insertvalue %[[raw]], %[[descr2]][1]
// NVVM: %[[c0:.*]] = llvm.mlir.constant(0 : index) : i64		// NVVM: %[[c0:.*]] = llvm.mlir.constant(0 : index) : i64
// NVVM: %[[descr4:.*]] = llvm.insertvalue %[[c0]], %[[descr3]][2]		// NVVM: %[[descr4:.*]] = llvm.insertvalue %[[c0]], %[[descr3]][2]
// NVVM: %[[c4:.*]] = llvm.mlir.constant(4 : index) : i64		// NVVM: %[[c4:.*]] = llvm.mlir.constant(4 : index) : i64
// NVVM: %[[descr5:.*]] = llvm.insertvalue %[[c4]], %[[descr4]][3, 0]		// NVVM: %[[descr5:.*]] = llvm.insertvalue %[[c4]], %[[descr4]][3, 0]
// NVVM: %[[c1:.*]] = llvm.mlir.constant(1 : index) : i64		// NVVM: %[[c1:.*]] = llvm.mlir.constant(1 : index) : i64
// NVVM: %[[descr6:.*]] = llvm.insertvalue %[[c1]], %[[descr5]][4, 0]		// NVVM: %[[descr6:.*]] = llvm.insertvalue %[[c1]], %[[descr5]][4, 0]

// ROCDL: %[[descr1:.*]] = llvm.mlir.undef : !llvm.struct<(ptr<f32, 3>, ptr<f32, 3>, i64, array<1 x i64>, array<1 x i64>)>
// ROCDL: %[[descr2:.*]] = llvm.insertvalue %[[raw]], %[[descr1]][0]
// ROCDL: %[[descr3:.*]] = llvm.insertvalue %[[raw]], %[[descr2]][1]
// ROCDL: %[[c0:.*]] = llvm.mlir.constant(0 : index) : i64
// ROCDL: %[[descr4:.*]] = llvm.insertvalue %[[c0]], %[[descr3]][2]
// ROCDL: %[[c4:.*]] = llvm.mlir.constant(4 : index) : i64
// ROCDL: %[[descr5:.*]] = llvm.insertvalue %[[c4]], %[[descr4]][3, 0]
// ROCDL: %[[c1:.*]] = llvm.mlir.constant(1 : index) : i64
// ROCDL: %[[descr6:.*]] = llvm.insertvalue %[[c1]], %[[descr5]][4, 0]

// "Store" lowering should work just as any other memref, only check that		// "Store" lowering should work just as any other memref, only check that
// we emit some core instructions.		// we emit some core instructions.
// NVVM: llvm.extractvalue %[[descr6:.*]]		// NVVM: llvm.extractvalue %[[descr6:.*]]
// NVVM: llvm.getelementptr		// NVVM: llvm.getelementptr
// NVVM: llvm.store		// NVVM: llvm.store

// ROCDL: llvm.extractvalue %[[descr6:.*]]		// ROCDL: llvm.store {{.*}}, %[[raw]]
// ROCDL: llvm.getelementptr
// ROCDL: llvm.store
%c0 = arith.constant 0 : index		%c0 = arith.constant 0 : index
memref.store %arg0, %arg1[%c0] : memref<4xf32, 3>		memref.store %arg0, %arg1[%c0] : memref<4xf32, 3>

"terminator"() : () -> ()		"terminator"() : () -> ()
}		}
}		}

// -----		// -----
Show All 36 Lines	gpu.func @workgroup3d(%arg0: f32) workgroup(%arg1: memref<4x2x6xf32, 3>) {
// NVVM: %[[descr7:.*]] = llvm.insertvalue %[[c2]], %[[descr6]][3, 1]		// NVVM: %[[descr7:.*]] = llvm.insertvalue %[[c2]], %[[descr6]][3, 1]
// NVVM: %[[c6:.*]] = llvm.mlir.constant(6 : index) : i64		// NVVM: %[[c6:.*]] = llvm.mlir.constant(6 : index) : i64
// NVVM: %[[descr8:.*]] = llvm.insertvalue %[[c6]], %[[descr7]][4, 1]		// NVVM: %[[descr8:.*]] = llvm.insertvalue %[[c6]], %[[descr7]][4, 1]
// NVVM: %[[c6:.*]] = llvm.mlir.constant(6 : index) : i64		// NVVM: %[[c6:.*]] = llvm.mlir.constant(6 : index) : i64
// NVVM: %[[descr9:.*]] = llvm.insertvalue %[[c6]], %[[descr8]][3, 2]		// NVVM: %[[descr9:.*]] = llvm.insertvalue %[[c6]], %[[descr8]][3, 2]
// NVVM: %[[c1:.*]] = llvm.mlir.constant(1 : index) : i64		// NVVM: %[[c1:.*]] = llvm.mlir.constant(1 : index) : i64
// NVVM: %[[descr10:.*]] = llvm.insertvalue %[[c1]], %[[descr9]][4, 2]		// NVVM: %[[descr10:.*]] = llvm.insertvalue %[[c1]], %[[descr9]][4, 2]

// ROCDL: %[[descr1:.*]] = llvm.mlir.undef : !llvm.struct<(ptr<f32, 3>, ptr<f32, 3>, i64, array<3 x i64>, array<3 x i64>)>		// ROCDL: %[[offset:.*]] = llvm.getelementptr %[[raw]]
// ROCDL: %[[descr2:.*]] = llvm.insertvalue %[[raw]], %[[descr1]][0]		// ROCDL: llvm.store {{.*}} %[[offset]]
// ROCDL: %[[descr3:.*]] = llvm.insertvalue %[[raw]], %[[descr2]][1]
// ROCDL: %[[c0:.*]] = llvm.mlir.constant(0 : index) : i64
// ROCDL: %[[descr4:.*]] = llvm.insertvalue %[[c0]], %[[descr3]][2]
// ROCDL: %[[c4:.*]] = llvm.mlir.constant(4 : index) : i64
// ROCDL: %[[descr5:.*]] = llvm.insertvalue %[[c4]], %[[descr4]][3, 0]
// ROCDL: %[[c12:.*]] = llvm.mlir.constant(12 : index) : i64
// ROCDL: %[[descr6:.*]] = llvm.insertvalue %[[c12]], %[[descr5]][4, 0]
// ROCDL: %[[c2:.*]] = llvm.mlir.constant(2 : index) : i64
// ROCDL: %[[descr7:.*]] = llvm.insertvalue %[[c2]], %[[descr6]][3, 1]
// ROCDL: %[[c6:.*]] = llvm.mlir.constant(6 : index) : i64
// ROCDL: %[[descr8:.*]] = llvm.insertvalue %[[c6]], %[[descr7]][4, 1]
// ROCDL: %[[c6:.*]] = llvm.mlir.constant(6 : index) : i64
// ROCDL: %[[descr9:.*]] = llvm.insertvalue %[[c6]], %[[descr8]][3, 2]
// ROCDL: %[[c1:.*]] = llvm.mlir.constant(1 : index) : i64
// ROCDL: %[[descr10:.*]] = llvm.insertvalue %[[c1]], %[[descr9]][4, 2]

%c0 = arith.constant 0 : index		%c0 = arith.constant 0 : index
memref.store %arg0, %arg1[%c0,%c0,%c0] : memref<4x2x6xf32, 3>		memref.store %arg0, %arg1[%c0,%c0,%c0] : memref<4x2x6xf32, 3>
"terminator"() : () -> ()		"terminator"() : () -> ()
}		}
}		}

// -----		// -----
Show All 11 Lines	gpu.module @kernel {
// ROCDL-SAME: !llvm.array<2 x f32>		// ROCDL-SAME: !llvm.array<2 x f32>

// NVVM-LABEL: llvm.func @multiple		// NVVM-LABEL: llvm.func @multiple
// ROCDL-LABEL: llvm.func @multiple		// ROCDL-LABEL: llvm.func @multiple
gpu.func @multiple(%arg0: f32)		gpu.func @multiple(%arg0: f32)
workgroup(%arg1: memref<1xf32, 3>, %arg2: memref<2xf32, 3>)		workgroup(%arg1: memref<1xf32, 3>, %arg2: memref<2xf32, 3>)
private(%arg3: memref<3xf32, 5>, %arg4: memref<4xf32, 5>) {		private(%arg3: memref<3xf32, 5>, %arg4: memref<4xf32, 5>) {

		// ROCDL: %[[c4:.*]] = llvm.mlir.constant(4 : i64)
		// ROCDL: %[[c3:.*]] = llvm.mlir.constant(3 : i64)

// Workgroup buffers.		// Workgroup buffers.
// NVVM: llvm.mlir.addressof @[[$buffer1]]		// NVVM: llvm.mlir.addressof @[[$buffer1]]
// NVVM: llvm.mlir.addressof @[[$buffer2]]		// NVVM: llvm.mlir.addressof @[[$buffer2]]

// ROCDL: llvm.mlir.addressof @[[$buffer1]]		// ROCDL: llvm.mlir.addressof @[[$buffer1]]
// ROCDL: llvm.mlir.addressof @[[$buffer2]]		// ROCDL: llvm.mlir.addressof @[[$buffer2]]

// Private buffers.		// Private buffers.
// NVVM: %[[c3:.*]] = llvm.mlir.constant(3 : i64)		// NVVM: %[[c3:.*]] = llvm.mlir.constant(3 : i64)
// NVVM: llvm.alloca %[[c3]] x f32 : (i64) -> !llvm.ptr<f32>		// NVVM: llvm.alloca %[[c3]] x f32 : (i64) -> !llvm.ptr<f32>
// NVVM: %[[c4:.*]] = llvm.mlir.constant(4 : i64)		// NVVM: %[[c4:.*]] = llvm.mlir.constant(4 : i64)
// NVVM: llvm.alloca %[[c4]] x f32 : (i64) -> !llvm.ptr<f32>		// NVVM: llvm.alloca %[[c4]] x f32 : (i64) -> !llvm.ptr<f32>

// ROCDL: %[[c3:.*]] = llvm.mlir.constant(3 : i64)
// ROCDL: llvm.alloca %[[c3]] x f32 : (i64) -> !llvm.ptr<f32, 5>		// ROCDL: llvm.alloca %[[c3]] x f32 : (i64) -> !llvm.ptr<f32, 5>
// ROCDL: %[[c4:.*]] = llvm.mlir.constant(4 : i64)
// ROCDL: llvm.alloca %[[c4]] x f32 : (i64) -> !llvm.ptr<f32, 5>		// ROCDL: llvm.alloca %[[c4]] x f32 : (i64) -> !llvm.ptr<f32, 5>

%c0 = arith.constant 0 : index		%c0 = arith.constant 0 : index
memref.store %arg0, %arg1[%c0] : memref<1xf32, 3>		memref.store %arg0, %arg1[%c0] : memref<1xf32, 3>
memref.store %arg0, %arg2[%c0] : memref<2xf32, 3>		memref.store %arg0, %arg2[%c0] : memref<2xf32, 3>
memref.store %arg0, %arg3[%c0] : memref<3xf32, 5>		memref.store %arg0, %arg3[%c0] : memref<3xf32, 5>
memref.store %arg0, %arg4[%c0] : memref<4xf32, 5>		memref.store %arg0, %arg4[%c0] : memref<4xf32, 5>
"terminator"() : () -> ()		"terminator"() : () -> ()
}		}
}		}

mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl-hip.mlir

	// RUN: mlir-opt %s -convert-gpu-to-rocdl=runtime=HIP -split-input-file \| FileCheck %s			// RUN: mlir-opt %s -convert-gpu-to-rocdl=runtime=HIP -split-input-file \| FileCheck %s

	gpu.module @test_module {			gpu.module @test_module {
	// CHECK-DAG: llvm.mlir.global internal constant @[[$PRINT_GLOBAL0:[A-Za-z0-9_]+]]("Hello, world\0A\00")			// CHECK-DAG: llvm.mlir.global internal constant @[[$PRINT_GLOBAL0:[A-Za-z0-9_]+]]("Hello, world\0A\00")
	// CHECK-DAG: llvm.mlir.global internal constant @[[$PRINT_GLOBAL1:[A-Za-z0-9_]+]]("Hello: %d\0A\00")			// CHECK-DAG: llvm.mlir.global internal constant @[[$PRINT_GLOBAL1:[A-Za-z0-9_]+]]("Hello: %d\0A\00")
	// CHECK-DAG: llvm.func @__ockl_printf_append_args(i64, i32, i64, i64, i64, i64, i64, i64, i64, i32) -> i64			// CHECK-DAG: llvm.func @__ockl_printf_append_args(i64, i32, i64, i64, i64, i64, i64, i64, i64, i32) -> i64
	// CHECK-DAG: llvm.func @__ockl_printf_append_string_n(i64, !llvm.ptr<i8>, i64, i32) -> i64			// CHECK-DAG: llvm.func @__ockl_printf_append_string_n(i64, !llvm.ptr<i8>, i64, i32) -> i64
	// CHECK-DAG: llvm.func @__ockl_printf_begin(i64) -> i64			// CHECK-DAG: llvm.func @__ockl_printf_begin(i64) -> i64

	// CHECK-LABEL: func @test_const_printf			// CHECK-LABEL: func @test_const_printf
	gpu.func @test_const_printf() {			gpu.func @test_const_printf() {
				// CHECK: %[[ISLAST:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK: %[[FORMATLEN:.*]] = llvm.mlir.constant(14 : i64) : i64
	// CHECK: %[[CST0:.*]] = llvm.mlir.constant(0 : i64) : i64			// CHECK: %[[CST0:.*]] = llvm.mlir.constant(0 : i64) : i64
	// CHECK-NEXT: %[[DESC0:.*]] = llvm.call @__ockl_printf_begin(%0) : (i64) -> i64			// CHECK: %[[DESC0:.]] = llvm.call @__ockl_printf_begin({{.}}) : (i64) -> i64
	// CHECK-NEXT: %[[FORMATSTR:.*]] = llvm.mlir.addressof @[[$PRINT_GLOBAL0]] : !llvm.ptr<array<14 x i8>>			// CHECK: %[[FORMATSTR:.*]] = llvm.mlir.addressof @[[$PRINT_GLOBAL0]] : !llvm.ptr<array<14 x i8>>
	// CHECK-NEXT: %[[CST1:.*]] = llvm.mlir.constant(0 : i64) : i64			// CHECK: %[[FORMATSTART:.*]] = llvm.getelementptr %[[FORMATSTR]][%[[CST0]], %[[CST0]]] : (!llvm.ptr<array<14 x i8>>, i64, i64) -> !llvm.ptr<i8>
	// CHECK-NEXT: %[[FORMATSTART:.*]] = llvm.getelementptr %[[FORMATSTR]][%[[CST1]], %[[CST1]]] : (!llvm.ptr<array<14 x i8>>, i64, i64) -> !llvm.ptr<i8>
	// CHECK-NEXT: %[[FORMATLEN:.*]] = llvm.mlir.constant(14 : i64) : i64
	// CHECK-NEXT: %[[ISLAST:.*]] = llvm.mlir.constant(1 : i32) : i32
	// CHECK-NEXT: %[[ISNTLAST:.*]] = llvm.mlir.constant(0 : i32) : i32
	// CHECK-NEXT: %{{.*}} = llvm.call @__ockl_printf_append_string_n(%[[DESC0]], %[[FORMATSTART]], %[[FORMATLEN]], %[[ISLAST]]) : (i64, !llvm.ptr<i8>, i64, i32) -> i64			// CHECK-NEXT: %{{.*}} = llvm.call @__ockl_printf_append_string_n(%[[DESC0]], %[[FORMATSTART]], %[[FORMATLEN]], %[[ISLAST]]) : (i64, !llvm.ptr<i8>, i64, i32) -> i64
	gpu.printf "Hello, world\n"			gpu.printf "Hello, world\n"
	gpu.return			gpu.return
	}			}


	// CHECK-LABEL: func @test_printf			// CHECK-LABEL: func @test_printf
	// CHECK: (%[[ARG0:.*]]: i32)			// CHECK: (%[[ARG0:.*]]: i32)
	gpu.func @test_printf(%arg0: i32) {			gpu.func @test_printf(%arg0: i32) {
				// CHECK: %[[ISNTLAST:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK: %[[NARGS1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK: %[[FORMATLEN:.*]] = llvm.mlir.constant(11 : i64) : i64
	// CHECK: %[[CST0:.*]] = llvm.mlir.constant(0 : i64) : i64			// CHECK: %[[CST0:.*]] = llvm.mlir.constant(0 : i64) : i64
	// CHECK-NEXT: %[[DESC0:.*]] = llvm.call @__ockl_printf_begin(%0) : (i64) -> i64			// CHECK: %[[DESC0:.]] = llvm.call @__ockl_printf_begin({{.}}) : (i64) -> i64
	// CHECK-NEXT: %[[FORMATSTR:.*]] = llvm.mlir.addressof @[[$PRINT_GLOBAL1]] : !llvm.ptr<array<11 x i8>>			// CHECK: %[[FORMATSTR:.*]] = llvm.mlir.addressof @[[$PRINT_GLOBAL1]] : !llvm.ptr<array<11 x i8>>
	// CHECK-NEXT: %[[CST1:.*]] = llvm.mlir.constant(0 : i64) : i64			// CHECK: %[[FORMATSTART:.*]] = llvm.getelementptr %[[FORMATSTR]][%[[CST0]], %[[CST0]]] : (!llvm.ptr<array<11 x i8>>, i64, i64) -> !llvm.ptr<i8>
	// CHECK-NEXT: %[[FORMATSTART:.*]] = llvm.getelementptr %[[FORMATSTR]][%[[CST1]], %[[CST1]]] : (!llvm.ptr<array<11 x i8>>, i64, i64) -> !llvm.ptr<i8>
	// CHECK-NEXT: %[[FORMATLEN:.*]] = llvm.mlir.constant(11 : i64) : i64
	// CHECK-NEXT: %[[ISLAST:.*]] = llvm.mlir.constant(1 : i32) : i32
	// CHECK-NEXT: %[[ISNTLAST:.*]] = llvm.mlir.constant(0 : i32) : i32
	// CHECK-NEXT: %[[DESC1:.*]] = llvm.call @__ockl_printf_append_string_n(%[[DESC0]], %[[FORMATSTART]], %[[FORMATLEN]], %[[ISNTLAST]]) : (i64, !llvm.ptr<i8>, i64, i32) -> i64			// CHECK-NEXT: %[[DESC1:.*]] = llvm.call @__ockl_printf_append_string_n(%[[DESC0]], %[[FORMATSTART]], %[[FORMATLEN]], %[[ISNTLAST]]) : (i64, !llvm.ptr<i8>, i64, i32) -> i64
	// CHECK-NEXT: %[[NARGS1:.*]] = llvm.mlir.constant(1 : i32) : i32
	// CHECK-NEXT: %[[ARG0_64:.*]] = llvm.zext %[[ARG0]] : i32 to i64			// CHECK-NEXT: %[[ARG0_64:.*]] = llvm.zext %[[ARG0]] : i32 to i64
	// CHECK-NEXT: %{{.*}} = llvm.call @__ockl_printf_append_args(%[[DESC1]], %[[NARGS1]], %[[ARG0_64]], %[[CST0]], %[[CST0]], %[[CST0]], %[[CST0]], %[[CST0]], %[[CST0]], %[[ISLAST]]) : (i64, i32, i64, i64, i64, i64, i64, i64, i64, i32) -> i64			// CHECK-NEXT: %{{.*}} = llvm.call @__ockl_printf_append_args(%[[DESC1]], %[[NARGS1]], %[[ARG0_64]], %[[CST0]], %[[CST0]], %[[CST0]], %[[CST0]], %[[CST0]], %[[CST0]], %[[NARGS1]]) : (i64, i32, i64, i64, i64, i64, i64, i64, i64, i32) -> i64
	gpu.printf "Hello: %d\n" %arg0 : i32			gpu.printf "Hello: %d\n" %arg0 : i32
	gpu.return			gpu.return
	}			}
	}			}

mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl-opencl.mlir

	// RUN: mlir-opt %s -convert-gpu-to-rocdl=runtime=OpenCL \| FileCheck %s			// RUN: mlir-opt %s -convert-gpu-to-rocdl=runtime=OpenCL \| FileCheck %s

	gpu.module @test_module {			gpu.module @test_module {
	// CHECK: llvm.mlir.global internal constant @[[$PRINT_GLOBAL:[A-Za-z0-9_]+]]("Hello: %d\0A\00") {addr_space = 4 : i32}			// CHECK: llvm.mlir.global internal constant @[[$PRINT_GLOBAL:[A-Za-z0-9_]+]]("Hello: %d\0A\00") {addr_space = 4 : i32}
	// CHECK: llvm.func @printf(!llvm.ptr<i8, 4>, ...) -> i32			// CHECK: llvm.func @printf(!llvm.ptr<i8, 4>, ...) -> i32
	// CHECK-LABEL: func @test_printf			// CHECK-LABEL: func @test_printf
	// CHECK: (%[[ARG0:.*]]: i32)			// CHECK: (%[[ARG0:.*]]: i32)
	gpu.func @test_printf(%arg0: i32) {			gpu.func @test_printf(%arg0: i32) {
				// CHECK: %[[IMM1:.*]] = llvm.mlir.constant(0 : i64) : i64
	// CHECK: %[[IMM0:.*]] = llvm.mlir.addressof @[[$PRINT_GLOBAL]] : !llvm.ptr<array<11 x i8>, 4>			// CHECK: %[[IMM0:.*]] = llvm.mlir.addressof @[[$PRINT_GLOBAL]] : !llvm.ptr<array<11 x i8>, 4>
	// CHECK-NEXT: %[[IMM1:.*]] = llvm.mlir.constant(0 : i64) : i64
	// CHECK-NEXT: %[[IMM2:.*]] = llvm.getelementptr %[[IMM0]][%[[IMM1]], %[[IMM1]]] : (!llvm.ptr<array<11 x i8>, 4>, i64, i64) -> !llvm.ptr<i8, 4>			// CHECK-NEXT: %[[IMM2:.*]] = llvm.getelementptr %[[IMM0]][%[[IMM1]], %[[IMM1]]] : (!llvm.ptr<array<11 x i8>, 4>, i64, i64) -> !llvm.ptr<i8, 4>
	// CHECK-NEXT: %{{.*}} = llvm.call @printf(%[[IMM2]], %[[ARG0]]) : (!llvm.ptr<i8, 4>, i32) -> i32			// CHECK-NEXT: %{{.*}} = llvm.call @printf(%[[IMM2]], %[[ARG0]]) : (!llvm.ptr<i8, 4>, i32) -> i32
	gpu.printf "Hello: %d\n" %arg0 : i32			gpu.printf "Hello: %d\n" %arg0 : i32
	gpu.return			gpu.return
	}			}
	}			}

mlir/test/Conversion/SoftwareBF16/softwareBF16.mlir

This file was added.

				// RUN: mlir-opt --llvm-software-bf16 %s\| FileCheck %s

				module attributes {llvm.data_layout = ""} {
				llvm.func @verify_bf16_f32(%arg0: bf16, %arg1: f32) -> i32 {
				//CHECK: llvm.func @verify_bf16_f32(%arg0: i16, %arg1: f32) -> i32 {

				%0 = llvm.mlir.constant(0 : i32) : i32
				%1 = llvm.mlir.constant(1 : i32) : i32
				//CHECK-DAG: %[[V0:.*]] = llvm.mlir.constant(32767 : i32) : i32
				//CHECK-DAG: %[[V1:.*]] = llvm.mlir.constant(16 : i32) : i32
				//CHECK-DAG: %[[V2:.*]] = llvm.mlir.constant(1 : i32) : i32
				//CHECK-DAG: %[[V3:.*]] = llvm.mlir.constant(0 : i32) : i32

				%2 = llvm.mlir.constant(1.503910e-01 : bf16) : bf16
				//CHECK-DAG: %[[V4:.*]] = llvm.mlir.constant(15898 : i16) : i16

				%3 = llvm.fptrunc %arg1 : f32 to bf16
				//CHECK: %[[V5:.*]] = llvm.bitcast %arg1 : f32 to i32
				//CHECK-NEXT: %[[V6:.*]] = llvm.lshr %[[V5]], %[[V1]] : i32
				//CHECK-NEXT: %[[V7:.*]] = llvm.and %[[V6]], %[[V2]] : i32
				//CHECK-NEXT: %[[V8:.*]] = llvm.add %[[V5]], %[[V0]] : i32
				//CHECK-NEXT: %[[V9:.*]] = llvm.lshr %[[V8]], %[[V1]] : i32
				//CHECK-NEXT: %[[V10:.*]] = llvm.add %[[V7]], %[[V9]] : i32
				//CHECK-NEXT: %[[V11:.*]] = llvm.lshr %[[V10]], %[[V1]] : i32
				//CHECK-NEXT: %[[V12:.*]] = llvm.trunc %[[V11]] : i32 to i16

				%4 = llvm.fsub %arg0, %3 : bf16
				//CHECK: %[[V13:.*]] = llvm.zext %arg0 : i16 to i32
				//CHECK-NEXT: %[[V14:.*]] = llvm.shl %[[V13]], %[[V1]] : i32
				//CHECK-NEXT: %[[V15:.*]] = llvm.bitcast %[[V14]] : i32 to f32
				//CHECK: %[[V16:.*]] = llvm.zext %[[V12]] : i16 to i32
				//CHECK-NEXT: %[[V17:.*]] = llvm.shl %[[V16]], %1 : i32
				//CHECK-NEXT: %[[V18:.*]] = llvm.bitcast %[[V17]] : i32 to f32
				//CHECK: %[[V19:.*]] = llvm.fsub %[[V15]], %[[V18]] : f32
				//CHECK-NEXT: %[[V20:.*]] = llvm.bitcast %[[V19]] : f32 to i32
				//CHECK-NEXT: %[[V21:.*]] = llvm.lshr %[[V20]], %[[V1]] : i32
				//CHECK-NEXT: %[[V22:.*]] = llvm.and %[[V21]], %[[V2]] : i32
				//CHECK-NEXT: %[[V23:.*]] = llvm.add %[[V20]], %[[V0]] : i32
				//CHECK-NEXT: %[[V24:.*]] = llvm.lshr %[[V23]], %[[V1]] : i32
				//CHECK-NEXT: %[[V25:.*]] = llvm.add %[[V22]], %[[V24]] : i32
				//CHECK-NEXT: %[[V26:.*]] = llvm.lshr %[[V25]], %[[V1]] : i32
				//CHECK-NEXT: %[[V27:.*]] = llvm.trunc %[[V26]] : i32 to i16

				%5 = llvm.fcmp "ugt" %4, %2 : bf16
				//CHECK: %[[V28:.*]] = llvm.zext %[[V27]] : i16 to i32
				//CHECK-NEXT: %[[V29:.*]] = llvm.shl %[[V28]], %[[V1]] : i32
				//CHECK-NEXT: %[[V30:.*]] = llvm.bitcast %[[V29]] : i32 to f32
				//CHECK: %[[V31:.*]] = llvm.zext %[[V4]] : i16 to i32
				//CHECK-NEXT: %[[V32:.*]] = llvm.shl %[[V31]], %[[V1]] : i32
				//CHECK-NEXT: %[[V33:.*]] = llvm.bitcast %[[V32]] : i32 to f32
				//CHECK: %{{.*}} = llvm.fcmp "ugt" %[[V30]], %[[V33]] : f32

				llvm.cond_br %5, ^bb1, ^bb2
				^bb1: // pred: ^bb0
				llvm.return %1 : i32
				^bb2: // pred: ^bb0
				llvm.return %0 : i32
				}
				}