This is an archive of the discontinued LLVM Phabricator instance.

[mlir][vector] Improve lowering to LLVM for `minf`, `maxf` reductions
ClosedPublic

Authored by unterumarmung on Jul 20 2023, 12:17 PM.

Details

Summary

This patch improves the lowering by changing target LLVM intrinsics from
reduce.fmax and reduce.fmin,
which have different semantic for handling NaN,
to reduce.fmaximum and reduce.fminimum ones.

Fixes #63969

Depends on D155869

Diff Detail

Event Timeline

unterumarmung created this revision.Jul 20 2023, 12:17 PM
Herald added a reviewer: ftynse. · View Herald Transcript
Herald added a reviewer: dcaballe. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
unterumarmung requested review of this revision.Jul 20 2023, 12:17 PM

Thanks for looking into this!

mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp
587–591

IIRC, I added this comment quite sometime ago.

/ Create lowering of minf/maxf op. We cannot use llvm.maximum/llvm.minimum
/ with vector types.

Have you tried replacing these two ops with an llvm.maximum/llvm.minimum? Maybe they are more widely supported now.
Otherwise, we would have to propagate the NaNs ourselves here.

unterumarmung added inline comments.Jul 20 2023, 1:34 PM
mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp
587–591

Oh, sorry, I'm new to this
I thought that using new intrinsics would allow to delete this boilerplate for NaNs.

From what I can see in tests, llvm.minimum and llvm.maximum can be called with vector arguments too.
But I do not understand one thing - why do we need this ability here in the first place? Both FCmpOp and SelectOp accept scalar values and return them because they accept the result value of the reduction intrinsic and the accumulator value which are floating-point scalars. So, replacing them would result in a llvm.minimum/llvm.maximum intrinsic call with scalar arguments.

Before:

func.func @reduce_fmax_f32(%arg0: vector<16xf32>, %arg1: f32) -> f32 {
  %0 = vector.reduction <maxf>, %arg0, %arg1 : vector<16xf32> into f32
  return %0 : f32
}
// CHECK-LABEL: @reduce_fmax_f32(
// CHECK-SAME: %[[A:.*]]: vector<16xf32>, %[[B:.*]]: f32)
//      CHECK: %[[V:.*]] = llvm.intr.vector.reduce.fmaximum(%[[A]]) : (vector<16xf32>) -> f32
//      CHECK: %[[C0:.*]] = llvm.fcmp "ogt" %[[V]], %[[B]] : f32
//      CHECK: %[[S0:.*]] = llvm.select %[[C0]], %[[V]], %[[B]] : i1, f32
//      CHECK: return %[[S0]] : f32

After:

func.func @reduce_fmax_f32(%arg0: vector<16xf32>, %arg1: f32) -> f32 {
  %0 = vector.reduction <maxf>, %arg0, %arg1 : vector<16xf32> into f32
  return %0 : f32
}
// CHECK-LABEL: @reduce_fmax_f32(
// CHECK-SAME: %[[A:.*]]: vector<16xf32>, %[[B:.*]]: f32)
//      CHECK: %[[V:.*]] = llvm.intr.vector.reduce.fmaximum(%[[A]]) : (vector<16xf32>) -> f32
//      CHECK: %[[C0:.*]] = llvm.maximum "ogt" %[[V]], %[[B]] : f32
//      CHECK: return %[[C0]] : f32

Is this the intended behavior?

dcaballe added inline comments.Jul 20 2023, 2:54 PM
mlir/lib/Conversion/VectorToLLVM/ConvertVectorToLLVM.cpp
587–591

Yes, that would be the intended behavior, thanks! The existing semantics of fp max/min ops in MLIR match the semantics of maximum/minimum in LLVM. However, when we first implemented this, LLVM's maximum/minimum instructions were not implemented in some of the backends so we couldn't use them directly. Instead, we generated the fcmp + select with the ad-hoc NaN handling, which is kind of matching the semantics of the maximum/minimum instructions. Now that they seem more generally supported by the backends, we can use the maximum/minimum ones. This shouldn't have a large impact, at least on x64, where these instructions are lowered to fcmp + select in the backend.

Eventually, we may want to rename arith::fmax/fmin to arith::fmaximum/fminimum and redefine the semantics of the former ones so that they all match LLVM spec. That would also require changing the fmax/fmin reductions in the Vector dialect... Quite some work...

Happy to answer any other question you may have! Thanks for helping with this!

Address review comments

dcaballe accepted this revision.Jul 21 2023, 1:53 PM

LGTM! Thanks a lot!

This revision is now accepted and ready to land.Jul 21 2023, 1:53 PM

Thank you for providing such a clear explanation! Now I understand that the issue goes beyond just lowering to LLVM IR itself.

I have a few questions about the points you mentioned:

  1. When I searched the repository, I found mentions of llvm.vector.reduce.fmaximum only in tests for AArch64 and X86 backends. Is it an acceptable number of backends to be supported for merging this kind of change? Perhaps I missed something or are there fallbacks for other backends?
  2. I agree that fixing the semantics of operations to match those in LLVM IR is extremely important. However, I'm concerned that there might be many downstream users of these MLIR dialects, such as IREE or possibly even Flang (not exactly a downstream user but not an MLIR in-tree one). As far as I know, MLIR has never declared IR stability, but breaking someone's code wouldn't be ideal. How does the community handle such situations? Also, I would like to participate in finding a solution, but I'm unsure if I'll have enough free time. So, please ping me when this comes up :)
dcaballe added a comment.EditedJul 21 2023, 2:49 PM

Thank you for providing such a clear explanation! Now I understand that the issue goes beyond just lowering to LLVM IR itself.

I have a few questions about the points you mentioned:

  1. When I searched the repository, I found mentions of llvm.vector.reduce.fmaximum only in tests for AArch64 and X86 backends. Is it an acceptable number of backends to be supported for merging this kind of change? Perhaps I missed something or are there fallbacks for other backends?

I had quickly tested x86 and AArch64 but let's run a more extensive testing also including RISC-V: https://github.com/openxla/iree/pull/14472

  1. I agree that fixing the semantics of operations to match those in LLVM IR is extremely important. However, I'm concerned that there might be many downstream users of these MLIR dialects, such as IREE or possibly even Flang (not exactly a downstream user but not an MLIR in-tree one). As far as I know, MLIR has never declared IR stability, but breaking someone's code wouldn't be ideal. How does the community handle such situations? Also, I would like to participate in finding a solution, but I'm unsure if I'll have enough free time. So, please ping me when this comes up :)

MLIR allows breaking changes. In this case they would be well justified. We usually send a PSA to Discourse in advance, letting the community know...
I personally don't think I have cycles right now so you are more than welcome to champion this as time permits! :). Happy to help with questions and reviews if you feel motivated to move this forward!

I think this should be good to go. Tests are passing.

This change causes downstream breakages. On CUDA backends this is generating an instruction that is valid only on SM80 and above (and therefor causes a crash on anything below SM80). Posting this here as an FYI. Looking at the patch this is probably just exposing an issue in the NVPTX backend rather than anything here being the root cause.

It could be that the new intrinsics are not supported/implemented in the NVPTX backend for some cases? I think we can revert it if that is the case and something is filed against the NVPTX backend. Otherwise, this will become a blocker for a much larger effort: https://discourse.llvm.org/t/rfc-fix-floating-point-max-and-min-operations-in-mlir/72671
What do you think @unterumarmung?

I believe that the issue lies within the NVPTX backend, and it should be addressed as a bug. If this change is causing significant issues and there isn't a straightforward solution to the bug, we can definitely consider reverting it. Additionally, I'd like to request @mravishankar to review the RFC mentioned, particularly from the NVPTX backend standpoint. Your insights would be greatly appreciated.

I believe that the issue lies within the NVPTX backend, and it should be addressed as a bug. If this change is causing significant issues and there isn't a straightforward solution to the bug, we can definitely consider reverting it. Additionally, I'd like to request @mravishankar to review the RFC mentioned, particularly from the NVPTX backend standpoint. Your insights would be greatly appreciated.

I filed this issue on NVPTX backend https://github.com/llvm/llvm-project/issues/64606 . Ill take a look at the RFC, but my knowledge of NVPTX is pretty dated at this point. This comes down to whether these instructions are supported or not. It looks like it is only supported on SM 80 and above. If I dont hear back, then instead of reverting the change, maybe refactor it so that the use of the llvm.intr.minimum can be controlled from downstream users.

@unterumarmung and @dcaballe based on the discussion here (https://github.com/llvm/llvm-project/issues/64606) this patch either needs to be reverted or needs to have a way to opt-in/opt-out to not hit the issue on CUDA backends... Maybe create a separate entry point which populates these patterns, or have a flag that says enforce NaN propagation semantics that can be true by default, but can be set to false on CUDA backends. This is breaking downstream tests, so a fix sooner rather than later would be appreciated.

It looks like the fix is moving forward now, right?

It looks like the fix is moving forward now, right?

Actually it seems like we have to make this opt-in somehow. It is failing on CUDA on architectures lesser then sm_80 and that doesn't seem easy to fix. I was going to update this bug saying this. Either we need to revert or make this transformation opt-in.

It looks like the fix is moving forward now, right?

Actually it seems like we have to make this opt-in somehow. It is failing on CUDA on architectures lesser then sm_80 and that doesn't seem easy to fix. I was going to update this bug saying this. Either we need to revert or make this transformation opt-in.

If a workaround is required at MLIR level, we should add expansion patterns for these ops to Arith/Transforms/ExpandOps.cpp, where we can a introduce compare + select for the NaN part and another compare + select for the +-0.0 part. However, these ops are first class instructions/intrinsics in LLVM so they should be supported in one way or the other by the backends.