This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvvm] Add NVVMToLLVM Pass
ClosedPublic

Authored by guraypp on Jun 29 2023, 3:28 AM.

Details

Summary

It introduces an NVVMToLLVM Pass and a BasicPtxBuilderOpInterface interface. The Pass performs pattern matching on all the NVVM Ops that implement the BasicPtxBuilderOpInterface interface to generate LLVM Inline Assembly Ops.

The BasicPtxBuilderOpInterface interface is utilized in the convert-nvvm-to-llvm pass, which lowers Ops that support this interface to inline assembly Ops. The interface provides several methods that are used for this lowering.

The getPtx method returns PTX code. The hasSideEffect method is used to determine whether the op has any side effects on the memory. The hasIntrinsic method indicates whether the operation has intrinsic support in LLVM. This is particularly useful for Ops that don't have intrinsic support for each case. The getAsmValues method returns the arguments to be passed to the PTX code. The order of arguments starts with the results and they are used for write operations, followed by the operands and attributes.

Example:

If we have the following Op definition that returns PTX code through getPtx:

tablegen
def NVVM_MBarrierArriveExpectTxOp : NVVM_Op<\"mbarrier.arrive.expect_tx\",
                    [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
  Results<(outs LLVM_Type:$res)>, Arguments<(ins LLVM_i64ptr_any:$addr, I32:$txcount)> {
  ...
  let extraClassDefinition = [{
    const char* $cppClass::getPtx() { return \"mbarrier.arrive.expect_tx.b64 %0, [%1], %2;\"; }
  }\];
}

The NVVM Op will look like below:

mlir
  %0 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount : !llvm.ptr, i32 -> i32

The convert-nvvm-to-llvm Pass generates the following PTX code, while keeping the order of arguments the same. The read/write modifiers are set based on the input and result types.

mlir
  %0 = llvm.inline_asm has_side_effects asm_dialect = att "mbarrier.arrive.expect_tx.b64 %0, [%1], %2;", "=r,l,r" %arg0, %arg1 : (!llvm.ptr, i32) -> i32

Diff Detail

Event Timeline

guraypp created this revision.Jun 29 2023, 3:28 AM
Herald added a reviewer: dcaballe. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
guraypp updated this revision to Diff 536728.Jul 3 2023, 5:24 AM

add constant support to inline assembly generator

guraypp updated this revision to Diff 536733.Jul 3 2023, 5:52 AM

fix typo

guraypp updated this revision to Diff 537255.Jul 5 2023, 1:27 AM

Introduced BasicPtxBuilderOpInterface that automatically builds PTX based on Op.

guraypp edited the summary of this revision. (Show Details)Jul 5 2023, 1:30 AM
guraypp updated this revision to Diff 537362.Jul 5 2023, 8:26 AM

Improve the interface

guraypp updated this revision to Diff 537365.Jul 5 2023, 8:30 AM

Implement it with PtxBuilder Interface in
https://reviews.llvm.org/D154060

guraypp published this revision for review.Jul 5 2023, 8:33 AM

nvvm is supposed to match 1 to 1 to nvvm llvm intrinsics so having them lower to inline assembly doesn't make sense to me. Either an intrinsic exist and the nvvm op should lower to it or there wouldn't be an nvvm op.
It seems like this would diverge from the original design of nvvm dialect? Were there some discussions about this?

nvvm is supposed to match 1 to 1 to nvvm llvm intrinsics so having them lower to inline assembly doesn't make sense to me. Either an intrinsic exist and the nvvm op should lower to it or there wouldn't be an nvvm op.
It seems like this would diverge from the original design of nvvm dialect? Were there some discussions about this?

Thanks for the review! I am definitely open to hearing your recommendations and suggestions here.

The new nvvm Ops are designed to perfectly match the potential llvm intrinsic, maintaining a 1:1 correspondence. Both this pass and LLVM intrinsics operate at the same level. It's worth noting that even in LLVM, intrinsics are implemented using inline assembly, which aligns with my thought process.

In my understanding, nvvm serves as the platform for generating device assembly or intrinsic, while the nvgpu dialect acts as a bridge from vector to nvvm. So nvgpu is high level dialect to generate a single line of PTX, and nvvm seems the right place to me. So I put this pass in there. This is of course my thinking, your input here is very valuable.

How do you think we can proceed here? What's your recommendations?

nvvm is supposed to match 1 to 1 to nvvm llvm intrinsics so having them lower to inline assembly doesn't make sense to me. Either an intrinsic exist and the nvvm op should lower to it or there wouldn't be an nvvm op.
It seems like this would diverge from the original design of nvvm dialect? Were there some discussions about this?

Thanks for the review! I am definitely open to hearing your recommendations and suggestions here.

The new nvvm Ops are designed to perfectly match the potential llvm intrinsic, maintaining a 1:1 correspondence. Both this pass and LLVM intrinsics operate at the same level.

The part that looks odd to me is that until those intrinsics exist it is hard to be sure that what we are adding to nvvm dialect matches well.

It's worth noting that even in LLVM, intrinsics are implemented using inline assembly, which aligns with my thought process.

Do you have examples? I haven't see it in the codebase so far.

In my understanding, nvvm serves as the platform for generating device assembly or intrinsic, while the nvgpu dialect acts as a bridge from vector to nvvm. So nvgpu is high level dialect to generate a single line of PTX, and nvvm seems the right place to me. So I put this pass in there. This is of course my thinking, your input here is very valuable.

nvvm as llvm or rocdl dialect is meant to match as closely as possible llvm IR in it's design. Moving away from that sounds like a non negligible change to me. What I'm wondering is whether we even need this intermediate step.

How do you think we can proceed here? What's your recommendations?

My recommendation would be to directly generate inline assembly when converting from nvgpu/gpu to nvvm as it is done for existing cases using inline assembly.

The part that looks odd to me is that until those intrinsics exist it is hard to be sure that what we are adding to nvvm dialect matches well.

I thought as long as nvvm op matches with PTX, it also matches with the potential intrinsic.

Do you have examples? I haven't see it in the codebase so far.

See an example below. I mean calling an llvm intrinsic vs blob of inline assembly are same for the llvm or its optimizer. Please correct me if I am wrong.
https://github.com/llvm/llvm-project/blob/5e807c38bf9b5c1528012ec9003953b352f99c79/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td#L76-L81

My recommendation would be to directly generate inline assembly when converting from nvgpu/gpu to nvvm as it is done for existing cases using inline assembly.

In this case, nvgpu will be as low as PTX level. Would it be desirable?

Thanks for the review again.

Maybe I should create a discussion in discourse :)

The part that looks odd to me is that until those intrinsics exist it is hard to be sure that what we are adding to nvvm dialect matches well.

I thought as long as nvvm op matches with PTX, it also matches with the potential intrinsic.

Do you have examples? I haven't see it in the codebase so far.

See an example below. I mean calling an llvm intrinsic vs blob of inline assembly are same for the llvm or its optimizer. Please correct me if I am wrong.
https://github.com/llvm/llvm-project/blob/5e807c38bf9b5c1528012ec9003953b352f99c79/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td#L76-L81

This is the assembler code for MIR instructions, I don't think it is really comparable. This is basically code that convert MIR to the equivalent assembly, this is much lower than intrinsics and obviously the only way to emit PTX assembly. I don't think you can compare that to inline assembly.

My recommendation would be to directly generate inline assembly when converting from nvgpu/gpu to nvvm as it is done for existing cases using inline assembly.

In this case, nvgpu will be as low as PTX level. Would it be desirable?

That shouldn't change the level of nvgpu and shouldn't affect the design or level of abstraction of those ops. It just means when lowering from nvgpu to nvvm instead of emitting nvvm op that gets then converted to inline assembly you directly emit inline assembly. That also doesn't mean you cannot build abstraction in your lowering code.

Thanks for the review again.

Maybe I should create a discussion in discourse :)

It's up to you, in my opinion if we decide to diverge from the original idea that nvvm is a 1 to 1 mapping to intrinsics this is probably something that should be discussed as this is not the original intent of the design. I'm not saying it is a good or a bad idea but we should have a clear direction on what the dialect should and shouldn't represent.

To summarize my main point is that having an intermediate step before LLVM inline assembly op seems unnecessary and as far as I know is not how nvvm was meant to be used. If you think there is a need for this it would be interesting to mention it, otherwise my guess is that it is actually simpler to skip the intermediate representation.

To summarize my main point is that having an intermediate step before LLVM inline assembly op seems unnecessary and as far as I know is not how nvvm was meant to be used. If you think there is a need for this it would be interesting to mention it, otherwise my guess is that it is actually simpler to skip the intermediate representation.

I think nvvm is not unnecessary. Allow me to make a final attempt to convince you. Having BasicPtxBuilderOpInterface in nvvm automates inline assembly generation without writing C++ while nvgpu remains important as it serves as a bridge from memref->llvm lowering.

It is particularly beneficial for ops like cp.async with half nvptx support. See the example below, I can generate intrinsic and PTX without touching C++ (D154345). Previosuly, cp.async calls emitCpAsyncOpZfillAsm that's boilerplate code. Don't you think BasicPtxBuilderOpInterface simplifies a lot?

I understand the design concern of nvvm. If you think this work is beneficial, I start a discussion to relax the design (or I can split nvvm.td file into nvvm.td and nvvm_inlineasm.td). But if you think there is no value, I can work on retargeting the interface on nvgpu dialect.

def NVVM_CpAsyncOp : NVVM_Op<"cp.async.shared.global", [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
  ...
  string llvmBuilder = [{
      ...
      createIntrinsicCall(builder, id, {$dst, $src});
  }];
  let extraClassDeclaration = [{
    bool canBuildPtx() { if(getCpSize()) return true; return false; }

    void getAsmValues(RewriterBase &rewriter, 
        llvm::SmallVectorImpl<std::pair<mlir::Value, mlir::NVVM::PTXRegisterMod>> &asmValues) {
      asmValues.push_back({getDst(), PTXRegisterMod::Read});
      asmValues.push_back({getSrc(), PTXRegisterMod::Read});
      asmValues.push_back({makeConstant(rewriter, getSize()), PTXRegisterMod::Read});
      asmValues.push_back({getCpSize(), PTXRegisterMod::Read});
    }        
  }];
  let extraClassDefinition = [{        
    const char* $cppClass::getPtx() { 
      if(getModifier() == NVVM::LoadCacheModifierKind::CG)
        return "cp.async.cg.shared.global [%0], [%1], %2, %3;\n"; 
      if(getModifier() == NVVM::LoadCacheModifierKind::CA)
        return "cp.async.ca.shared.global [%0], [%1], %2, %3;\n";        
      llvm_unreachable("unsupported cache modifier");      
    }
  }];
}

If op has no intrinsic, generating PTX is even simpler and defining getPtx is enough.

def NVVM_MBarrierTryWaitParitySharedOp : NVVM_Op<"mbarrier.try_wait.parity.shared", 
                    [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
  Results<(outs LLVM_Type:$res)>,
  Arguments<(ins LLVM_i64ptr_shared:$addr, LLVM_Type:$token)> {  
  let assemblyFormat = "$addr `,` $token attr-dict `:` type(operands) `->` type($res)";
  let extraClassDefinition = [{
    const char* $cppClass::getPtx() {
      return "{\n\t"
              ".reg .pred P1; \n\t"
              "mbarrier.try_wait.parity.shared.b64 P1, [%1], %2; \n\t"
              "selp.b32 %0, 1, 0, P1; \n\t"
              "}"; 
    }
  }];
}

I put a draft code here D154624. I put the BasicPtxBuilder as a class and used nvgpu. In this case, building ptx happens in C++. I am not sure it is any better.

To summarize my main point is that having an intermediate step before LLVM inline assembly op seems unnecessary and as far as I know is not how nvvm was meant to be used. If you think there is a need for this it would be interesting to mention it, otherwise my guess is that it is actually simpler to skip the intermediate representation.

I think nvvm is not unnecessary. Allow me to make a final attempt to convince you. Having BasicPtxBuilderOpInterface in nvvm automates inline assembly generation without writing C++ while nvgpu remains important as it serves as a bridge from memref->llvm lowering.

It is particularly beneficial for ops like cp.async with half nvptx support. See the example below, I can generate intrinsic and PTX without touching C++ (D154345). Previosuly, cp.async calls emitCpAsyncOpZfillAsm that's boilerplate code. Don't you think BasicPtxBuilderOpInterface simplifies a lot?

I understand the design concern of nvvm. If you think this work is beneficial, I start a discussion to relax the design (or I can split nvvm.td file into nvvm.td and nvvm_inlineasm.td). But if you think there is no value, I can work on retargeting the interface on nvgpu dialect.

def NVVM_CpAsyncOp : NVVM_Op<"cp.async.shared.global", [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
  ...
  string llvmBuilder = [{
      ...
      createIntrinsicCall(builder, id, {$dst, $src});
  }];
  let extraClassDeclaration = [{
    bool canBuildPtx() { if(getCpSize()) return true; return false; }

    void getAsmValues(RewriterBase &rewriter, 
        llvm::SmallVectorImpl<std::pair<mlir::Value, mlir::NVVM::PTXRegisterMod>> &asmValues) {
      asmValues.push_back({getDst(), PTXRegisterMod::Read});
      asmValues.push_back({getSrc(), PTXRegisterMod::Read});
      asmValues.push_back({makeConstant(rewriter, getSize()), PTXRegisterMod::Read});
      asmValues.push_back({getCpSize(), PTXRegisterMod::Read});
    }        
  }];
  let extraClassDefinition = [{        
    const char* $cppClass::getPtx() { 
      if(getModifier() == NVVM::LoadCacheModifierKind::CG)
        return "cp.async.cg.shared.global [%0], [%1], %2, %3;\n"; 
      if(getModifier() == NVVM::LoadCacheModifierKind::CA)
        return "cp.async.ca.shared.global [%0], [%1], %2, %3;\n";        
      llvm_unreachable("unsupported cache modifier");      
    }
  }];
}

If op has no intrinsic, generating PTX is even simpler and defining getPtx is enough.

def NVVM_MBarrierTryWaitParitySharedOp : NVVM_Op<"mbarrier.try_wait.parity.shared", 
                    [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>,
  Results<(outs LLVM_Type:$res)>,
  Arguments<(ins LLVM_i64ptr_shared:$addr, LLVM_Type:$token)> {  
  let assemblyFormat = "$addr `,` $token attr-dict `:` type(operands) `->` type($res)";
  let extraClassDefinition = [{
    const char* $cppClass::getPtx() {
      return "{\n\t"
              ".reg .pred P1; \n\t"
              "mbarrier.try_wait.parity.shared.b64 P1, [%1], %2; \n\t"
              "selp.b32 %0, 1, 0, P1; \n\t"
              "}"; 
    }
  }];
}

I understand the motivation and I agree it has some advantages and it allows having an extra level where both intrinsics and inline assembly share a representation. I don't know if there is a use for this extra level at the time which is why I was saying it is probably not necessary. My intuition would be that it is less code and less maintenance to skip the extra level of representation for inline assembly. I'm not sure I follow how this makes the inline ptx emission simpler. Whether the code is in tablegen or not it is still c++ code that needs to emit either the intrinsic or inline PTX? I would assume that the code would be the same emitting directly in conversion to nvvm.

If you think this is better I'm not going to block your work, I just wanted to point out that this doesn't match my mental model of nvvm. I'm not extremely active in this code recently so it probably make sense for someone else to comment more in the area if needed (Alex has a better background than me in LLVM Dialect). My advice in term of simplicity would be to skip the extra op representation but feel free to ignore it. The PTX builder looks neat.

It is particularly beneficial for ops like cp.async with half nvptx support. See the example below, I can generate intrinsic and PTX without touching C++ (D154345). Previosuly, cp.async calls emitCpAsyncOpZfillAsm that's boilerplate code. Don't you think BasicPtxBuilderOpInterface simplifies a lot?

I see your point about levering tablgen to simplify ptx declaration. I assume something similar could be done in C++ but this is definitely a cool representation. I agree this is an advantage of this approach.

I have not followed the full discussion here but consider the following case:

  1. some op exists at the nvgpu level on higher-level MLIR types
  2. it lowers to a bunch of ops in the NVVM dialect
  3. there is not yet an intrinsic representation in LLVM, there should be one, but we have no ETA on this

Isn't the NVVM dialect the right place to put the op and a lowering to PTX ?

Otherwise do we need to start distinguishing in NVGPU ops that should become LLVM intrinsics in the fullness of time and quarantine them for a potential future move to NVVM once the LLVM intrinsic is available ?

Maybe I missed the point of the discussion, please lmk if I need to dig deeper to better understand what the fundamental issues are here.

I have not followed the full discussion here but consider the following case:

  1. some op exists at the nvgpu level on higher-level MLIR types
  2. it lowers to a bunch of ops in the NVVM dialect
  3. there is not yet an intrinsic representation in LLVM, there should be one, but we have no ETA on this

Isn't the NVVM dialect the right place to put the op and a lowering to PTX ?

Or you can directly LLVM inline assembly ops where you would emit the NVVM that doesn't exist because there is no intrinsic (as it is currently done for cases that don't have intrinsics). It sill happens in NVGPU to NVVM but you don't create an intermediate NVVM op that then needs to be lowered to LLVM dialect.

Otherwise do we need to start distinguishing in NVGPU ops that should become LLVM intrinsics in the fullness of time and quarantine them for a potential future move to NVVM once the LLVM intrinsic is available ?

We definitely don't want to do that. My comments were unrelated to NVGPU dialect and shouldn't affects what ops are added there.

Maybe I missed the point of the discussion, please lmk if I need to dig deeper to better understand what the fundamental issues are here.

I don't think we need much more discussions on that. I just meant to give my opinion and it is totally fine to continue with what you have.

I have not followed the full discussion here but consider the following case:

  1. some op exists at the nvgpu level on higher-level MLIR types
  2. it lowers to a bunch of ops in the NVVM dialect
  3. there is not yet an intrinsic representation in LLVM, there should be one, but we have no ETA on this

Isn't the NVVM dialect the right place to put the op and a lowering to PTX ?

Or you can directly LLVM inline assembly ops where you would emit the NVVM that doesn't exist because there is no intrinsic (as it is currently done for cases that don't have intrinsics). It sill happens in NVGPU to NVVM but you don't create an intermediate NVVM op that then needs to be lowered to LLVM dialect.

Ah yes ok, that is where NGPU kicks in, I had glanced over the discussion and mistakenly thought we talking about the NVGPU dialect.

Otherwise do we need to start distinguishing in NVGPU ops that should become LLVM intrinsics in the fullness of time and quarantine them for a potential future move to NVVM once the LLVM intrinsic is available ?

We definitely don't want to do that. My comments were unrelated to NVGPU dialect and shouldn't affects what ops are added there.

+1, sorry for the confusion!

Maybe I missed the point of the discussion, please lmk if I need to dig deeper to better understand what the fundamental issues are here.

I don't think we need much more discussions on that. I just meant to give my opinion and it is totally fine to continue with what you have.

I'd certainly want the better PTX builder and nice inference of the magic r m etc.

I have no strong opinion on the NVVM ops themselves.
It seems a bit closer to the final landing place in NVVM and then the switch from InlineAsm to LLVM intrinsic can happen locally within 1 file when ready but it is not a big blocker to also update the Conversion at the same time.

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
164

makeConstantI32 please ? since this is ahrdcoded ?

174

Can you expand the description here to say that the default implementation emits results, inputs and then attributes?

nicolasvasilache accepted this revision.Jul 11 2023, 1:49 AM

Upon deeper inspection, I se great value in the way this is built, we can iterate to further improvements if needed in the future, let's get this rolling for now!

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
140

this is off it seems

149

Return whether the operation has memory side effects.

408

it is really great that we can actually just see the ptx with a "".format-like behavior.
This greatly simplifies the understanding vs building C++ by hand and reduces surprise.

Nice work!

mlir/lib/Conversion/NVVMToLLVM/NVVMToLLVM.cpp
72

Add TODO: f8,f16,bf16 etc ?

188

nit: nl

mlir/test/Conversion/NVVMToLLVM/nvvm-to-llvm.mlir
25

nice!

This revision is now accepted and ready to land.Jul 11 2023, 1:49 AM
guraypp updated this revision to Diff 538982.Jul 11 2023, 2:30 AM

address comments

guraypp updated this revision to Diff 538986.Jul 11 2023, 2:42 AM

fix debug printer (errs -> dbgs)

guraypp edited the summary of this revision. (Show Details)Jul 11 2023, 2:57 AM
This revision was landed with ongoing or failed builds.Jul 11 2023, 3:14 AM
This revision was automatically updated to reflect the committed changes.