This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Conversion/
-
mlir/
-
Conversion/
-
NVVMToLLVM/
-
NVVMToLLVM.h
-
Passes.h
-
Passes.td
-
lib/Conversion/
-
Conversion/
-
CMakeLists.txt
-
NVVMToLLVM/
-
CMakeLists.txt
2
NVVMToLLVM.cpp
-
test/Conversion/NVVMToLLVM/
-
Conversion/
-
NVVMToLLVM/
1
nvvm-to-llvm.mlir

Differential D154060

[mlir][nvvm] Add NVVMToLLVM Pass
ClosedPublic

Authored by guraypp on Jun 29 2023, 3:28 AM.

Download Raw Diff

Details

Reviewers

ftynse
bondhugula
ThomasRaoux
dcaballe
nicolasvasilache
qcolombet
manishucsd

Commits

rGdd080c7579e0: [mlir][nvvm] Add NVVMToLLVM Pass

Summary

It introduces an NVVMToLLVM Pass and a BasicPtxBuilderOpInterface interface. The Pass performs pattern matching on all the NVVM Ops that implement the BasicPtxBuilderOpInterface interface to generate LLVM Inline Assembly Ops.

The BasicPtxBuilderOpInterface interface is utilized in the convert-nvvm-to-llvm pass, which lowers Ops that support this interface to inline assembly Ops. The interface provides several methods that are used for this lowering.

The getPtx method returns PTX code. The hasSideEffect method is used to determine whether the op has any side effects on the memory. The hasIntrinsic method indicates whether the operation has intrinsic support in LLVM. This is particularly useful for Ops that don't have intrinsic support for each case. The getAsmValues method returns the arguments to be passed to the PTX code. The order of arguments starts with the results and they are used for write operations, followed by the operands and attributes.

Example:

If we have the following Op definition that returns PTX code through getPtx:

tablegen
def NVVM_MBarrierArriveExpectTxOp : NVVM_Op<\"mbarrier.arrive.expect_tx\",
                    [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
  Results<(outs LLVM_Type:$res)>, Arguments<(ins LLVM_i64ptr_any:$addr, I32:$txcount)> {
  ...
  let extraClassDefinition = [{
    const char* $cppClass::getPtx() { return \"mbarrier.arrive.expect_tx.b64 %0, [%1], %2;\"; }
  }\];
}

The NVVM Op will look like below:

mlir
  %0 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount : !llvm.ptr, i32 -> i32

The convert-nvvm-to-llvm Pass generates the following PTX code, while keeping the order of arguments the same. The read/write modifiers are set based on the input and result types.

mlir
  %0 = llvm.inline_asm has_side_effects asm_dialect = att "mbarrier.arrive.expect_tx.b64 %0, [%1], %2;", "=r,l,r" %arg0, %arg1 : (!llvm.ptr, i32) -> i32

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

guraypp created this revision.Jun 29 2023, 3:28 AM

Herald added a reviewer: ftynse. · View Herald TranscriptJun 29 2023, 3:28 AM

Herald added a reviewer: bondhugula. · View Herald Transcript

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, bviyer and 26 others. · View Herald Transcript

Harbormaster completed remote builds in B242023: Diff 535703.Jun 29 2023, 3:43 AM

guraypp added a child revision: D154094: [mlir][nvgpu] Add `mbarrier.arrive.expect_tx` and `mbarrier.try_wait.parity`.Jun 29 2023, 8:35 AM

add constant support to inline assembly generator

rebase

fix typo

guraypp added reviewers: nicolasvasilache, qcolombet, manishucsd.Jul 3 2023, 5:52 AM

Harbormaster completed remote builds in B242777: Diff 536733.Jul 3 2023, 6:08 AM

guraypp added a child revision: D154345: [mlir][nvgpu] Implement `nvgpu.device_async_copy` by NVVMToLLVM Pass.Jul 3 2023, 6:53 AM

Introduced BasicPtxBuilderOpInterface that automatically builds PTX based on Op.

guraypp edited the summary of this revision. (Show Details)Jul 5 2023, 1:30 AM

Harbormaster completed remote builds in B243151: Diff 537255.Jul 5 2023, 1:48 AM

Improve the interface

Implement it with PtxBuilder Interface in
https://reviews.llvm.org/D154060

Herald added a subscriber: csigg. · View Herald TranscriptJul 5 2023, 8:30 AM

Harbormaster completed remote builds in B243225: Diff 537365.Jul 5 2023, 8:30 AM

rebase

guraypp published this revision for review.Jul 5 2023, 8:33 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2023, 8:33 AM

Herald added subscribers: stephenneuendorffer, jholewinski. · View Herald Transcript

Harbormaster completed remote builds in B243227: Diff 537367.Jul 5 2023, 9:04 AM

nvvm is supposed to match 1 to 1 to nvvm llvm intrinsics so having them lower to inline assembly doesn't make sense to me. Either an intrinsic exist and the nvvm op should lower to it or there wouldn't be an nvvm op.
It seems like this would diverge from the original design of nvvm dialect? Were there some discussions about this?

In D154060#4474069, @ThomasRaoux wrote:

nvvm is supposed to match 1 to 1 to nvvm llvm intrinsics so having them lower to inline assembly doesn't make sense to me. Either an intrinsic exist and the nvvm op should lower to it or there wouldn't be an nvvm op.
It seems like this would diverge from the original design of nvvm dialect? Were there some discussions about this?

Thanks for the review! I am definitely open to hearing your recommendations and suggestions here.

The new nvvm Ops are designed to perfectly match the potential llvm intrinsic, maintaining a 1:1 correspondence. Both this pass and LLVM intrinsics operate at the same level. It's worth noting that even in LLVM, intrinsics are implemented using inline assembly, which aligns with my thought process.

In my understanding, nvvm serves as the platform for generating device assembly or intrinsic, while the nvgpu dialect acts as a bridge from vector to nvvm. So nvgpu is high level dialect to generate a single line of PTX, and nvvm seems the right place to me. So I put this pass in there. This is of course my thinking, your input here is very valuable.

How do you think we can proceed here? What's your recommendations?

In D154060#4474916, @guraypp wrote:

In D154060#4474069, @ThomasRaoux wrote:

nvvm is supposed to match 1 to 1 to nvvm llvm intrinsics so having them lower to inline assembly doesn't make sense to me. Either an intrinsic exist and the nvvm op should lower to it or there wouldn't be an nvvm op.
It seems like this would diverge from the original design of nvvm dialect? Were there some discussions about this?

Thanks for the review! I am definitely open to hearing your recommendations and suggestions here.

The new nvvm Ops are designed to perfectly match the potential llvm intrinsic, maintaining a 1:1 correspondence. Both this pass and LLVM intrinsics operate at the same level.

The part that looks odd to me is that until those intrinsics exist it is hard to be sure that what we are adding to nvvm dialect matches well.

It's worth noting that even in LLVM, intrinsics are implemented using inline assembly, which aligns with my thought process.

Do you have examples? I haven't see it in the codebase so far.

In my understanding, nvvm serves as the platform for generating device assembly or intrinsic, while the nvgpu dialect acts as a bridge from vector to nvvm. So nvgpu is high level dialect to generate a single line of PTX, and nvvm seems the right place to me. So I put this pass in there. This is of course my thinking, your input here is very valuable.

nvvm as llvm or rocdl dialect is meant to match as closely as possible llvm IR in it's design. Moving away from that sounds like a non negligible change to me. What I'm wondering is whether we even need this intermediate step.

How do you think we can proceed here? What's your recommendations?

My recommendation would be to directly generate inline assembly when converting from nvgpu/gpu to nvvm as it is done for existing cases using inline assembly.

The part that looks odd to me is that until those intrinsics exist it is hard to be sure that what we are adding to nvvm dialect matches well.

I thought as long as nvvm op matches with PTX, it also matches with the potential intrinsic.

Do you have examples? I haven't see it in the codebase so far.

See an example below. I mean calling an llvm intrinsic vs blob of inline assembly are same for the llvm or its optimizer. Please correct me if I am wrong.
https://github.com/llvm/llvm-project/blob/5e807c38bf9b5c1528012ec9003953b352f99c79/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td#L76-L81

My recommendation would be to directly generate inline assembly when converting from nvgpu/gpu to nvvm as it is done for existing cases using inline assembly.

In this case, nvgpu will be as low as PTX level. Would it be desirable?

Thanks for the review again.

Maybe I should create a discussion in discourse :)

In D154060#4475381, @guraypp wrote:

The part that looks odd to me is that until those intrinsics exist it is hard to be sure that what we are adding to nvvm dialect matches well.

I thought as long as nvvm op matches with PTX, it also matches with the potential intrinsic.

Do you have examples? I haven't see it in the codebase so far.

See an example below. I mean calling an llvm intrinsic vs blob of inline assembly are same for the llvm or its optimizer. Please correct me if I am wrong.
https://github.com/llvm/llvm-project/blob/5e807c38bf9b5c1528012ec9003953b352f99c79/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td#L76-L81

This is the assembler code for MIR instructions, I don't think it is really comparable. This is basically code that convert MIR to the equivalent assembly, this is much lower than intrinsics and obviously the only way to emit PTX assembly. I don't think you can compare that to inline assembly.

My recommendation would be to directly generate inline assembly when converting from nvgpu/gpu to nvvm as it is done for existing cases using inline assembly.

In this case, nvgpu will be as low as PTX level. Would it be desirable?

That shouldn't change the level of nvgpu and shouldn't affect the design or level of abstraction of those ops. It just means when lowering from nvgpu to nvvm instead of emitting nvvm op that gets then converted to inline assembly you directly emit inline assembly. That also doesn't mean you cannot build abstraction in your lowering code.

Thanks for the review again.

Maybe I should create a discussion in discourse :)

It's up to you, in my opinion if we decide to diverge from the original idea that nvvm is a 1 to 1 mapping to intrinsics this is probably something that should be discussed as this is not the original intent of the design. I'm not saying it is a good or a bad idea but we should have a clear direction on what the dialect should and shouldn't represent.

To summarize my main point is that having an intermediate step before LLVM inline assembly op seems unnecessary and as far as I know is not how nvvm was meant to be used. If you think there is a need for this it would be interesting to mention it, otherwise my guess is that it is actually simpler to skip the intermediate representation.

In D154060#4475942, @ThomasRaoux wrote:

To summarize my main point is that having an intermediate step before LLVM inline assembly op seems unnecessary and as far as I know is not how nvvm was meant to be used. If you think there is a need for this it would be interesting to mention it, otherwise my guess is that it is actually simpler to skip the intermediate representation.

I think nvvm is not unnecessary. Allow me to make a final attempt to convince you. Having BasicPtxBuilderOpInterface in nvvm automates inline assembly generation without writing C++ while nvgpu remains important as it serves as a bridge from memref->llvm lowering.

It is particularly beneficial for ops like cp.async with half nvptx support. See the example below, I can generate intrinsic and PTX without touching C++ (D154345). Previosuly, cp.async calls emitCpAsyncOpZfillAsm that's boilerplate code. Don't you think BasicPtxBuilderOpInterface simplifies a lot?

I understand the design concern of nvvm. If you think this work is beneficial, I start a discussion to relax the design (or I can split nvvm.td file into nvvm.td and nvvm_inlineasm.td). But if you think there is no value, I can work on retargeting the interface on nvgpu dialect.

def NVVM_CpAsyncOp : NVVM_Op<"cp.async.shared.global", [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
  ...
  string llvmBuilder = [{
      ...
      createIntrinsicCall(builder, id, {$dst, $src});
  }];
  let extraClassDeclaration = [{
    bool canBuildPtx() { if(getCpSize()) return true; return false; }

    void getAsmValues(RewriterBase &rewriter, 
        llvm::SmallVectorImpl<std::pair<mlir::Value, mlir::NVVM::PTXRegisterMod>> &asmValues) {
      asmValues.push_back({getDst(), PTXRegisterMod::Read});
      asmValues.push_back({getSrc(), PTXRegisterMod::Read});
      asmValues.push_back({makeConstant(rewriter, getSize()), PTXRegisterMod::Read});
      asmValues.push_back({getCpSize(), PTXRegisterMod::Read});
    }        
  }];
  let extraClassDefinition = [{        
    const char* $cppClass::getPtx() { 
      if(getModifier() == NVVM::LoadCacheModifierKind::CG)
        return "cp.async.cg.shared.global [%0], [%1], %2, %3;\n"; 
      if(getModifier() == NVVM::LoadCacheModifierKind::CA)
        return "cp.async.ca.shared.global [%0], [%1], %2, %3;\n";        
      llvm_unreachable("unsupported cache modifier");      
    }
  }];
}

If op has no intrinsic, generating PTX is even simpler and defining getPtx is enough.

def NVVM_MBarrierTryWaitParitySharedOp : NVVM_Op<"mbarrier.try_wait.parity.shared", 
                    [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
  Results<(outs LLVM_Type:$res)>,
  Arguments<(ins LLVM_i64ptr_shared:$addr, LLVM_Type:$token)> {  
  let assemblyFormat = "$addr `,` $token attr-dict `:` type(operands) `->` type($res)";
  let extraClassDefinition = [{
    const char* $cppClass::getPtx() {
      return "{\n\t"
              ".reg .pred P1; \n\t"
              "mbarrier.try_wait.parity.shared.b64 P1, [%1], %2; \n\t"
              "selp.b32 %0, 1, 0, P1; \n\t"
              "}"; 
    }
  }];
}

I put a draft code here D154624. I put the BasicPtxBuilder as a class and used nvgpu. In this case, building ptx happens in C++. I am not sure it is any better.

In D154060#4477294, @guraypp wrote:
In D154060#4475942, @ThomasRaoux wrote:

To summarize my main point is that having an intermediate step before LLVM inline assembly op seems unnecessary and as far as I know is not how nvvm was meant to be used. If you think there is a need for this it would be interesting to mention it, otherwise my guess is that it is actually simpler to skip the intermediate representation.

I think nvvm is not unnecessary. Allow me to make a final attempt to convince you. Having BasicPtxBuilderOpInterface in nvvm automates inline assembly generation without writing C++ while nvgpu remains important as it serves as a bridge from memref->llvm lowering.

It is particularly beneficial for ops like cp.async with half nvptx support. See the example below, I can generate intrinsic and PTX without touching C++ (D154345). Previosuly, cp.async calls emitCpAsyncOpZfillAsm that's boilerplate code. Don't you think BasicPtxBuilderOpInterface simplifies a lot?

I understand the design concern of nvvm. If you think this work is beneficial, I start a discussion to relax the design (or I can split nvvm.td file into nvvm.td and nvvm_inlineasm.td). But if you think there is no value, I can work on retargeting the interface on nvgpu dialect.
def NVVM_CpAsyncOp : NVVM_Op<"cp.async.shared.global", [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
  ...
  string llvmBuilder = [{
      ...
      createIntrinsicCall(builder, id, {$dst, $src});
  }];
  let extraClassDeclaration = [{
    bool canBuildPtx() { if(getCpSize()) return true; return false; }

    void getAsmValues(RewriterBase &rewriter, 
        llvm::SmallVectorImpl<std::pair<mlir::Value, mlir::NVVM::PTXRegisterMod>> &asmValues) {
      asmValues.push_back({getDst(), PTXRegisterMod::Read});
      asmValues.push_back({getSrc(), PTXRegisterMod::Read});
      asmValues.push_back({makeConstant(rewriter, getSize()), PTXRegisterMod::Read});
      asmValues.push_back({getCpSize(), PTXRegisterMod::Read});
    }        
  }];
  let extraClassDefinition = [{        
    const char* $cppClass::getPtx() { 
      if(getModifier() == NVVM::LoadCacheModifierKind::CG)
        return "cp.async.cg.shared.global [%0], [%1], %2, %3;\n"; 
      if(getModifier() == NVVM::LoadCacheModifierKind::CA)
        return "cp.async.ca.shared.global [%0], [%1], %2, %3;\n";        
      llvm_unreachable("unsupported cache modifier");      
    }
  }];
}
If op has no intrinsic, generating PTX is even simpler and defining getPtx is enough.
def NVVM_MBarrierTryWaitParitySharedOp : NVVM_Op<"mbarrier.try_wait.parity.shared", 
                    [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
  Results<(outs LLVM_Type:$res)>,
  Arguments<(ins LLVM_i64ptr_shared:$addr, LLVM_Type:$token)> {  
  let assemblyFormat = "$addr `,` $token attr-dict `:` type(operands) `->` type($res)";
  let extraClassDefinition = [{
    const char* $cppClass::getPtx() {
      return "{\n\t"
              ".reg .pred P1; \n\t"
              "mbarrier.try_wait.parity.shared.b64 P1, [%1], %2; \n\t"
              "selp.b32 %0, 1, 0, P1; \n\t"
              "}"; 
    }
  }];
}

I understand the motivation and I agree it has some advantages and it allows having an extra level where both intrinsics and inline assembly share a representation. I don't know if there is a use for this extra level at the time which is why I was saying it is probably not necessary. My intuition would be that it is less code and less maintenance to skip the extra level of representation for inline assembly. I'm not sure I follow how this makes the inline ptx emission simpler. Whether the code is in tablegen or not it is still c++ code that needs to emit either the intrinsic or inline PTX? I would assume that the code would be the same emitting directly in conversion to nvvm.

If you think this is better I'm not going to block your work, I just wanted to point out that this doesn't match my mental model of nvvm. I'm not extremely active in this code recently so it probably make sense for someone else to comment more in the area if needed (Alex has a better background than me in LLVM Dialect). My advice in term of simplicity would be to skip the extra op representation but feel free to ignore it. The PTX builder looks neat.

In D154060#4479680, @ThomasRaoux wrote:

In D154060#4477294, @guraypp wrote:

In D154060#4475942, @ThomasRaoux wrote:

It is particularly beneficial for ops like cp.async with half nvptx support. See the example below, I can generate intrinsic and PTX without touching C++ (D154345). Previosuly, cp.async calls emitCpAsyncOpZfillAsm that's boilerplate code. Don't you think BasicPtxBuilderOpInterface simplifies a lot?

I see your point about levering tablgen to simplify ptx declaration. I assume something similar could be done in C++ but this is definitely a cool representation. I agree this is an advantage of this approach.

I have not followed the full discussion here but consider the following case:

some op exists at the nvgpu level on higher-level MLIR types
it lowers to a bunch of ops in the NVVM dialect
there is not yet an intrinsic representation in LLVM, there should be one, but we have no ETA on this

Isn't the NVVM dialect the right place to put the op and a lowering to PTX ?

Otherwise do we need to start distinguishing in NVGPU ops that should become LLVM intrinsics in the fullness of time and quarantine them for a potential future move to NVVM once the LLVM intrinsic is available ?

Maybe I missed the point of the discussion, please lmk if I need to dig deeper to better understand what the fundamental issues are here.

In D154060#4481135, @nicolasvasilache wrote:

I have not followed the full discussion here but consider the following case:

some op exists at the nvgpu level on higher-level MLIR types

it lowers to a bunch of ops in the NVVM dialect

there is not yet an intrinsic representation in LLVM, there should be one, but we have no ETA on this

Isn't the NVVM dialect the right place to put the op and a lowering to PTX ?

Or you can directly LLVM inline assembly ops where you would emit the NVVM that doesn't exist because there is no intrinsic (as it is currently done for cases that don't have intrinsics). It sill happens in NVGPU to NVVM but you don't create an intermediate NVVM op that then needs to be lowered to LLVM dialect.

Otherwise do we need to start distinguishing in NVGPU ops that should become LLVM intrinsics in the fullness of time and quarantine them for a potential future move to NVVM once the LLVM intrinsic is available ?

We definitely don't want to do that. My comments were unrelated to NVGPU dialect and shouldn't affects what ops are added there.

Maybe I missed the point of the discussion, please lmk if I need to dig deeper to better understand what the fundamental issues are here.

I don't think we need much more discussions on that. I just meant to give my opinion and it is totally fine to continue with what you have.

In D154060#4481172, @ThomasRaoux wrote:

In D154060#4481135, @nicolasvasilache wrote:

I have not followed the full discussion here but consider the following case:

some op exists at the nvgpu level on higher-level MLIR types

it lowers to a bunch of ops in the NVVM dialect

there is not yet an intrinsic representation in LLVM, there should be one, but we have no ETA on this

Isn't the NVVM dialect the right place to put the op and a lowering to PTX ?

Or you can directly LLVM inline assembly ops where you would emit the NVVM that doesn't exist because there is no intrinsic (as it is currently done for cases that don't have intrinsics). It sill happens in NVGPU to NVVM but you don't create an intermediate NVVM op that then needs to be lowered to LLVM dialect.

Ah yes ok, that is where NGPU kicks in, I had glanced over the discussion and mistakenly thought we talking about the NVGPU dialect.

Otherwise do we need to start distinguishing in NVGPU ops that should become LLVM intrinsics in the fullness of time and quarantine them for a potential future move to NVVM once the LLVM intrinsic is available ?

We definitely don't want to do that. My comments were unrelated to NVGPU dialect and shouldn't affects what ops are added there.

+1, sorry for the confusion!

Maybe I missed the point of the discussion, please lmk if I need to dig deeper to better understand what the fundamental issues are here.

I don't think we need much more discussions on that. I just meant to give my opinion and it is totally fine to continue with what you have.

I'd certainly want the better PTX builder and nice inference of the magic r m etc.

I have no strong opinion on the NVVM ops themselves.
It seems a bit closer to the final landing place in NVVM and then the switch from InlineAsm to LLVM intrinsic can happen locally within 1 file when ready but it is not a big blocker to also update the Conversion at the same time.

nicolasvasilache added inline comments.Jul 11 2023, 1:43 AM

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
164 ↗	(On Diff #537367)	makeConstantI32 please ? since this is ahrdcoded ?
174 ↗	(On Diff #537367)	Can you expand the description here to say that the default implementation emits results, inputs and then attributes?

Upon deeper inspection, I se great value in the way this is built, we can iterate to further improvements if needed in the future, let's get this rolling for now!

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
140 ↗	(On Diff #537367)	this is off it seems
149 ↗	(On Diff #537367)	`Return whether the operation has memory side effects.`
408 ↗	(On Diff #537367)	it is really great that we can actually just see the ptx with a `"".format`-like behavior. This greatly simplifies the understanding vs building C++ by hand and reduces surprise. Nice work!
mlir/lib/Conversion/NVVMToLLVM/NVVMToLLVM.cpp
73	Add `TODO: f8,f16,bf16` etc ?
257	nit: nl
mlir/test/Conversion/NVVMToLLVM/nvvm-to-llvm.mlir
26	nice!

This revision is now accepted and ready to land.Jul 11 2023, 1:49 AM

address comments

fix debug printer (errs -> dbgs)

guraypp edited the summary of this revision. (Show Details)Jul 11 2023, 2:57 AM

This revision was landed with ongoing or failed builds.Jul 11 2023, 3:14 AM

Closed by commit rGdd080c7579e0: [mlir][nvvm] Add NVVMToLLVM Pass (authored by guraypp). · Explain Why

This revision was automatically updated to reflect the committed changes.

guraypp added a commit: rGdd080c7579e0: [mlir][nvvm] Add NVVMToLLVM Pass.

Harbormaster completed remote builds in B244390: Diff 538986.Jul 11 2023, 4:33 AM

khasanovaa mentioned this in rGbe29fe2f987b: Fix bazel build file for D154060..Jul 11 2023, 8:34 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Conversion/

NVVMToLLVM/

NVVMToLLVM.h

26 lines

Passes.h

1 line

Passes.td

15 lines

lib/

Conversion/

CMakeLists.txt

1 line

NVVMToLLVM/

CMakeLists.txt

21 lines

NVVMToLLVM.cpp

256 lines

test/

Conversion/

NVVMToLLVM/

nvvm-to-llvm.mlir

25 lines

Diff 536732

mlir/include/mlir/Conversion/NVVMToLLVM/NVVMToLLVM.h

This file was added.

				//===- NVVMTOLLVMPass.h - Convert NVVM to LLVM dialect --------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				#ifndef MLIR_CONVERSION_NVVMTOLLVM_NVVMTOLLVMPASS_H_
				#define MLIR_CONVERSION_NVVMTOLLVM_NVVMTOLLVMPASS_H_

				#include <memory>

				namespace mlir {

				class LLVMTypeConverter;
				class RewritePatternSet;
				class Pass;

				#define GEN_PASS_DECL_CONVERTNVVMTOLLVMPASS
				#include "mlir/Conversion/Passes.h.inc"

				void populateNVVMToLLVMConversionPatterns(LLVMTypeConverter &converter,
				RewritePatternSet &patterns);
				} // namespace mlir

				#endif // MLIR_CONVERSION_NVVMTOLLVM_NVVMTOLLVMPASS_H_

mlir/include/mlir/Conversion/Passes.h

	Show All 34 Lines
	#include "mlir/Conversion/LinalgToStandard/LinalgToStandard.h"			#include "mlir/Conversion/LinalgToStandard/LinalgToStandard.h"
	#include "mlir/Conversion/MathToFuncs/MathToFuncs.h"			#include "mlir/Conversion/MathToFuncs/MathToFuncs.h"
	#include "mlir/Conversion/MathToLLVM/MathToLLVM.h"			#include "mlir/Conversion/MathToLLVM/MathToLLVM.h"
	#include "mlir/Conversion/MathToLibm/MathToLibm.h"			#include "mlir/Conversion/MathToLibm/MathToLibm.h"
	#include "mlir/Conversion/MathToSPIRV/MathToSPIRVPass.h"			#include "mlir/Conversion/MathToSPIRV/MathToSPIRVPass.h"
	#include "mlir/Conversion/MemRefToLLVM/MemRefToLLVM.h"			#include "mlir/Conversion/MemRefToLLVM/MemRefToLLVM.h"
	#include "mlir/Conversion/MemRefToSPIRV/MemRefToSPIRVPass.h"			#include "mlir/Conversion/MemRefToSPIRV/MemRefToSPIRVPass.h"
	#include "mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h"			#include "mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h"
				#include "mlir/Conversion/NVVMToLLVM/NVVMToLLVM.h"
	#include "mlir/Conversion/OpenACCToSCF/ConvertOpenACCToSCF.h"			#include "mlir/Conversion/OpenACCToSCF/ConvertOpenACCToSCF.h"
	#include "mlir/Conversion/OpenMPToLLVM/ConvertOpenMPToLLVM.h"			#include "mlir/Conversion/OpenMPToLLVM/ConvertOpenMPToLLVM.h"
	#include "mlir/Conversion/PDLToPDLInterp/PDLToPDLInterp.h"			#include "mlir/Conversion/PDLToPDLInterp/PDLToPDLInterp.h"
	#include "mlir/Conversion/ReconcileUnrealizedCasts/ReconcileUnrealizedCasts.h"			#include "mlir/Conversion/ReconcileUnrealizedCasts/ReconcileUnrealizedCasts.h"
	#include "mlir/Conversion/SCFToControlFlow/SCFToControlFlow.h"			#include "mlir/Conversion/SCFToControlFlow/SCFToControlFlow.h"
	#include "mlir/Conversion/SCFToGPU/SCFToGPUPass.h"			#include "mlir/Conversion/SCFToGPU/SCFToGPUPass.h"
	#include "mlir/Conversion/SCFToOpenMP/SCFToOpenMP.h"			#include "mlir/Conversion/SCFToOpenMP/SCFToOpenMP.h"
	#include "mlir/Conversion/SCFToSPIRV/SCFToSPIRVPass.h"			#include "mlir/Conversion/SCFToSPIRV/SCFToSPIRVPass.h"
	Show All 22 Lines

mlir/include/mlir/Conversion/Passes.td

Show First 20 Lines • Show All 704 Lines • ▼ Show 20 Lines	def ConvertMemRefToSPIRV : Pass<"convert-memref-to-spirv"> {
let options = [		let options = [
Option<"boolNumBits", "bool-num-bits",		Option<"boolNumBits", "bool-num-bits",
"int", /default=/"8",		"int", /default=/"8",
"The number of bits to store a boolean value">		"The number of bits to store a boolean value">
];		];
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		// NVVMToLLVM
		//===----------------------------------------------------------------------===//

		def ConvertNVVMToLLVMPass : Pass<"convert-nvvm-to-llvm"> {
		let summary = "Convert NVVM dialect to LLVM dialect";
		let description = [{
		This pass generates inline assembly for the NVVM ops which is not
		implemented in LLVM core.
		}];
		let dependentDialects = [
		"NVVM::NVVMDialect",
		];
		}

		//===----------------------------------------------------------------------===//
// NVGPUToNVVM		// NVGPUToNVVM
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def ConvertNVGPUToNVVMPass : Pass<"convert-nvgpu-to-nvvm"> {		def ConvertNVGPUToNVVMPass : Pass<"convert-nvgpu-to-nvvm"> {
let summary = "Convert NVGPU dialect to NVVM dialect";		let summary = "Convert NVGPU dialect to NVVM dialect";
let description = [{		let description = [{
This pass converts supported NVGPU ops to NVVM dialect intrinsics.		This pass converts supported NVGPU ops to NVVM dialect intrinsics.
}];		}];
▲ Show 20 Lines • Show All 407 Lines • Show Last 20 Lines

mlir/lib/Conversion/CMakeLists.txt

	Show All 24 Lines
	add_subdirectory(LLVMCommon)			add_subdirectory(LLVMCommon)
	add_subdirectory(MathToFuncs)			add_subdirectory(MathToFuncs)
	add_subdirectory(MathToLibm)			add_subdirectory(MathToLibm)
	add_subdirectory(MathToLLVM)			add_subdirectory(MathToLLVM)
	add_subdirectory(MathToSPIRV)			add_subdirectory(MathToSPIRV)
	add_subdirectory(MemRefToLLVM)			add_subdirectory(MemRefToLLVM)
	add_subdirectory(MemRefToSPIRV)			add_subdirectory(MemRefToSPIRV)
	add_subdirectory(NVGPUToNVVM)			add_subdirectory(NVGPUToNVVM)
				add_subdirectory(NVVMToLLVM)
	add_subdirectory(OpenACCToSCF)			add_subdirectory(OpenACCToSCF)
	add_subdirectory(OpenMPToLLVM)			add_subdirectory(OpenMPToLLVM)
	add_subdirectory(PDLToPDLInterp)			add_subdirectory(PDLToPDLInterp)
	add_subdirectory(ReconcileUnrealizedCasts)			add_subdirectory(ReconcileUnrealizedCasts)
	add_subdirectory(SCFToControlFlow)			add_subdirectory(SCFToControlFlow)
	add_subdirectory(SCFToGPU)			add_subdirectory(SCFToGPU)
	add_subdirectory(SCFToOpenMP)			add_subdirectory(SCFToOpenMP)
	add_subdirectory(SCFToSPIRV)			add_subdirectory(SCFToSPIRV)
	Show All 12 Lines

mlir/lib/Conversion/NVVMToLLVM/CMakeLists.txt

This file was added.

				add_mlir_conversion_library(MLIRNVVMToLLVM
				NVVMToLLVM.cpp

				ADDITIONAL_HEADER_DIRS
				${MLIR_MAIN_INCLUDE_DIR}/mlir/Conversion/NVVMToLLVM

				DEPENDS
				MLIRConversionPassIncGen

				LINK_COMPONENTS
				Core

				LINK_LIBS PUBLIC
				MLIRGPUDialect
				MLIRLLVMCommonConversion
				MLIRLLVMDialect
				MLIRNVVMDialect
				MLIRNVGPUDialect
				MLIRPass
				MLIRTransforms
				)

mlir/lib/Conversion/NVVMToLLVM/NVVMToLLVM.cpp

This file was added.

				//===- NVVMToLLVM.cpp - NVVM to LLVM dialect conversion -----------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements a translation NVVM ops which is not supported in LLVM
				// core.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Conversion/NVVMToLLVM/NVVMToLLVM.h"

				#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"
				#include "mlir/Conversion/LLVMCommon/Pattern.h"
				#include "mlir/Dialect/Func/IR/FuncOps.h"
				#include "mlir/Dialect/GPU/IR/GPUDialect.h"
				#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
				#include "mlir/Dialect/LLVMIR/LLVMTypes.h"
				#include "mlir/Dialect/LLVMIR/NVVMDialect.h"
				#include "mlir/Dialect/MemRef/IR/MemRef.h"
				#include "mlir/IR/Attributes.h"
				#include "mlir/IR/BuiltinTypes.h"
				#include "mlir/IR/MLIRContext.h"
				#include "mlir/IR/PatternMatch.h"
				#include "mlir/IR/TypeUtilities.h"
				#include "mlir/IR/Value.h"
				#include "mlir/Pass/Pass.h"
				#include "mlir/Support/LLVM.h"
				#include "mlir/Support/LogicalResult.h"
				#include "llvm/ADT/SmallPtrSet.h"
				#include "llvm/ADT/SmallSet.h"
				#include "llvm/ADT/StringRef.h"
				#include "llvm/Support/raw_ostream.h"
				#include <string>

				#define DEBUG_TYPE "nvvm-to-llvm"
				#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")
				#define DBGSNL() (llvm::dbgs() << "\n")

				namespace mlir {
				#define GEN_PASS_DEF_CONVERTNVVMTOLLVMPASS
				#include "mlir/Conversion/Passes.h.inc"
				} // namespace mlir

				using namespace mlir;

				namespace {
				struct ConvertNVVMToLLVMPass
				: public impl::ConvertNVVMToLLVMPassBase<ConvertNVVMToLLVMPass> {
				using Base::Base;

				void getDependentDialects(DialectRegistry &registry) const override {
				registry.insert<LLVM::LLVMDialect, NVVM::NVVMDialect>();
				}

				void runOnOperation() override {
				LowerToLLVMOptions options(&getContext());
				RewritePatternSet patterns(&getContext());
				LLVMTypeConverter converter(&getContext(), options);
				IRRewriter rewriter(&getContext());
				populateNVVMToLLVMConversionPatterns(converter, patterns);
				LLVMConversionTarget target(getContext());
				target.addLegalDialect<::mlir::LLVM::LLVMDialect>();
				if (failed(applyPartialConversion(getOperation(), target,
				std::move(patterns))))
				signalPassFailure();
				}
				};
				} // namespace

				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Add `TODO: f8,f16,bf16` etc ? nicolasvasilache: Add `TODO: f8,f16,bf16` etc ?
				enum InputModifier { read, readwrite, write };
				class GenerateInlineAsm {
				Operation *op;
				ConversionPatternRewriter &rewriter;
				const char *asmStr;
				SmallVector<Value> asmVals;
				std::string asmConstraints;
				bool sideEffects;
				bool hasResult = false;

				// https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#constraints
				char getRegisterType(Value v) {
				if (v.getDefiningOp<LLVM::ConstantOp>())
				return 'n';
				if (v.getType().isInteger(16))
				return 'h';
				if (v.getType().isInteger(32))
				return 'r';
				if (v.getType().isInteger(64))
				return 'l';
				if (v.getType().isF32())
				return 'f';
				if (v.getType().isF64())
				return 'd';
				if (auto ptr = v.getType().dyn_cast<LLVM::LLVMPointerType>()) {
				// Shared address spaces is addressed with 32-bit pointers.
				if (ptr.getAddressSpace() == NVVM::kSharedMemorySpace) {
				return 'r';
				}
				return 'l';
				}
				assert(false && "Register type is not handled yet");
				return ' ';
				}

				Value makeConstantValue(unsigned v) {
				return rewriter.create<LLVM::ConstantOp>(
				op->getLoc(),
				IntegerAttr::get(IntegerType::get(op->getContext(), 32), v));
				}

				public:
				GenerateInlineAsm(Operation *op, ConversionPatternRewriter &rewriter,
				const char *ptxBlock, bool sideEffects = false)
				: op(op), rewriter(rewriter), asmStr(ptxBlock), sideEffects(sideEffects) {
				}

				void insertValue(Value v, InputModifier itype = InputModifier::read) {
				llvm::raw_string_ostream ss(asmConstraints);
				if (itype == InputModifier::read) {
				asmVals.push_back(v);
				} else if (itype == InputModifier::readwrite) {
				asmVals.push_back(v);
				ss << "+";
				hasResult = true;
				} else if (itype == InputModifier::write) {
				ss << "=";
				hasResult = true;
				}
				ss << getRegisterType(v) << ",";
				ss.flush();
				}

				void insertValue(unsigned v, InputModifier itype = InputModifier::read) {
				insertValue(makeConstantValue(v), itype);
				}

				LLVM::InlineAsmOp generate(ConversionPatternRewriter &rewriter,
				Operation *op) {
				auto asmDialectAttr =
				LLVM::AsmDialectAttr::get(op->getContext(), LLVM::AsmDialect::AD_ATT);
				Type resultType = hasResult ? op->getResult(0).getType()
				: LLVM::LLVMVoidType::get(op->getContext());

				// Remove the last comma from the constraints string.
				if (asmConstraints[asmConstraints.size() - 1] == ',')
				asmConstraints.pop_back();

				return rewriter.create<LLVM::InlineAsmOp>(
				op->getLoc(), resultType,
				/operands=/asmVals,
				/asm_string=/asmStr,
				/constraints=/asmConstraints.data(),
				/has_side_effects=/sideEffects,
				/is_align_stack=/false,
				/asm_dialect=/asmDialectAttr,
				/operand_attrs=/ArrayAttr());
				}

				void generateAsmAndReplaceOp(ConversionPatternRewriter &rewriter,
				Operation *op) {
				LLVM::InlineAsmOp inlineAsmOp = generate(rewriter, op);
				LLVM_DEBUG(DBGS() << inlineAsmOp);
				if (inlineAsmOp->getNumResults() == op->getNumResults())
				rewriter.replaceOp(op, inlineAsmOp);
				else
				rewriter.eraseOp(op);
				}
				};

				struct MBarrierArriveExpectTxSharedOpLowering
				: public ConvertOpToLLVMPattern<NVVM::MBarrierArriveExpectTxSharedOp> {
				using ConvertOpToLLVMPattern<
				NVVM::MBarrierArriveExpectTxSharedOp>::ConvertOpToLLVMPattern;
				LogicalResult
				matchAndRewrite(NVVM::MBarrierArriveExpectTxSharedOp op, OpAdaptor adaptor,
				ConversionPatternRewriter &rewriter) const override {
				const char *ptx = "mbarrier.arrive.expect_tx.shared.b64 %0, [%1], %2;";
				GenerateInlineAsm generator(op, rewriter, ptx, true);
				generator.insertValue(op.getRes(), write);
				generator.insertValue(op.getAddr());
				generator.insertValue(op.getTxcount());
				generator.generateAsmAndReplaceOp(rewriter, op);
				return success();
				}
				};

				struct MBarrierArriveExpectTxOpLowering
				: public ConvertOpToLLVMPattern<NVVM::MBarrierArriveExpectTxOp> {
				using ConvertOpToLLVMPattern<
				NVVM::MBarrierArriveExpectTxOp>::ConvertOpToLLVMPattern;
				LogicalResult
				matchAndRewrite(NVVM::MBarrierArriveExpectTxOp op, OpAdaptor adaptor,
				ConversionPatternRewriter &rewriter) const override {
				const char *ptx = "mbarrier.arrive.expect_tx.b64 %0, [%1], %2;";
				GenerateInlineAsm generator(op, rewriter, ptx, true);
				generator.insertValue(op.getRes(), write);
				generator.insertValue(op.getAddr());
				generator.insertValue(op.getTxcount());
				generator.generateAsmAndReplaceOp(rewriter, op);
				return success();
				}
				};

				struct MBarrierTryWaitParitySharedOpLowering
				: public ConvertOpToLLVMPattern<NVVM::MBarrierTryWaitParitySharedOp> {
				using ConvertOpToLLVMPattern<
				NVVM::MBarrierTryWaitParitySharedOp>::ConvertOpToLLVMPattern;
				LogicalResult
				matchAndRewrite(NVVM::MBarrierTryWaitParitySharedOp op, OpAdaptor adaptor,
				ConversionPatternRewriter &rewriter) const override {
				const char *ptx = "{\n\t"
				".reg .pred P1; \n\t"
				"mbarrier.try_wait.parity.shared.b64 P1, [%1], %2; \n\t"
				"selp.b32 %0, 1, 0, P1; \n\t"
				"}";
				GenerateInlineAsm generator(op, rewriter, ptx, true);
				generator.insertValue(op.getRes(), write);
				generator.insertValue(op.getAddr());
				generator.insertValue(op.getToken());
				generator.generateAsmAndReplaceOp(rewriter, op);
				return success();
				}
				};

				struct MBarrierTryWaitParityOpLowering
				: public ConvertOpToLLVMPattern<NVVM::MBarrierTryWaitParityOp> {
				using ConvertOpToLLVMPattern<
				NVVM::MBarrierTryWaitParityOp>::ConvertOpToLLVMPattern;
				LogicalResult
				matchAndRewrite(NVVM::MBarrierTryWaitParityOp op, OpAdaptor adaptor,
				ConversionPatternRewriter &rewriter) const override {
				const char *ptx = "{\n\t"
				".reg .pred P1; \n\t"
				"mbarrier.try_wait.parity.b64 P1, [%1], %2; \n\t"
				"selp.b32 %0, 1, 0, P1; \n\t"
				"}";
				GenerateInlineAsm generator(op, rewriter, ptx, true);
				generator.insertValue(op.getRes(), write);
				generator.insertValue(op.getAddr());
				generator.insertValue(op.getToken());
				generator.generateAsmAndReplaceOp(rewriter, op);
				return success();
				}
				};

				void mlir::populateNVVMToLLVMConversionPatterns(LLVMTypeConverter &converter,
				RewritePatternSet &patterns) {
				patterns.add<
				MBarrierArriveExpectTxSharedOpLowering, MBarrierArriveExpectTxOpLowering,
				MBarrierTryWaitParitySharedOpLowering, MBarrierTryWaitParityOpLowering>(
				converter);
				}
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions nit: nl nicolasvasilache: nit: nl

mlir/test/Conversion/NVVMToLLVM/nvvm-to-llvm.mlir

This file was added.

				// RUN: mlir-opt --convert-nvvm-to-llvm --split-input-file %s \| FileCheck %s

				llvm.func @init_mbarrier_arrive_expect_tx(%barrier : !llvm.ptr<3>, %txcount : i32) -> i32{
				//CHECK : llvm.inline_asm has_side_effects asm_dialect = att "mbarrier.arrive.expect_tx.shared.b64 %0, [%1], %2;", "=r,r,r" %arg0, %arg1 : (!llvm.ptr<3>, i32) -> i32
				%res = nvvm.mbarrier.arrive.expect_tx.shared %barrier, %txcount : !llvm.ptr<3>, i32 -> i32
				llvm.return %res : i32
				}

				llvm.func @init_mbarrier_arrive_expect_tx_generic(%barrier : !llvm.ptr, %txcount : i32)-> i32 {
				// CHECK: llvm.inline_asm has_side_effects asm_dialect = att "mbarrier.arrive.expect_tx.b64 %0, [%1], %2;", "=r,l,r" %arg0, %arg1 : (!llvm.ptr, i32) -> i32
				%res = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount : !llvm.ptr, i32 -> i32
				llvm.return %res : i32
				}

				llvm.func @init_mbarrier_try_wait.parity.shared(%barrier : !llvm.ptr<3>, %token : i32) -> i32 {
				// CHECK : llvm.inline_asm has_side_effects asm_dialect = att "{\0A\09.reg .pred P1; \0A\09mbarrier.try_wait.parity.shared.b64 P1, [%1], %2; \0A\09selp.b32 %0, 1, 0, P1; \0A\09}", "=r,r,r" %arg0, %arg1 : (!llvm.ptr<3>, i32) -> i32
				%res = nvvm.mbarrier.try_wait.parity.shared %barrier, %token : !llvm.ptr<3>, i32 -> i32
				llvm.return %res : i32
				}

				llvm.func @init_mbarrier_try_wait.parity(%barrier : !llvm.ptr, %token : i32) -> i32{
				// CHECK: llvm.inline_asm has_side_effects asm_dialect = att "{\0A\09.reg .pred P1; \0A\09mbarrier.try_wait.parity.b64 P1, [%1], %2; \0A\09selp.b32 %0, 1, 0, P1; \0A\09}", "=r,l,r" %arg0, %arg1 : (!llvm.ptr, i32) -> i32
				%res = nvvm.mbarrier.try_wait.parity %barrier, %token : !llvm.ptr, i32 -> i32
				llvm.return %res : i32
				}
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions nice! nicolasvasilache: nice!

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvvm] Add NVVMToLLVM PassClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 536732

mlir/include/mlir/Conversion/NVVMToLLVM/NVVMToLLVM.h

mlir/include/mlir/Conversion/Passes.h

mlir/include/mlir/Conversion/Passes.td

mlir/lib/Conversion/CMakeLists.txt

mlir/lib/Conversion/NVVMToLLVM/CMakeLists.txt

mlir/lib/Conversion/NVVMToLLVM/NVVMToLLVM.cpp

mlir/test/Conversion/NVVMToLLVM/nvvm-to-llvm.mlir

[mlir][nvvm] Add NVVMToLLVM Pass
ClosedPublic