This is an archive of the discontinued LLVM Phabricator instance.

[Flang][OpenMP][MLIR] An mlir transformation pass for marking FuncOp's implicitly called from TargetOp's and declare target marked FuncOp's as implicitly declare target
ClosedPublic

Authored by agozillon on Jun 30 2023, 11:50 AM.

Details

Summary

This pass will mark functions called from TargetOp's
and declare target functions as implicitly declare
target by adding the MLIR declare target attribute
directly to the function.

This pass executes after the initial lowering of Fortran's PFT
to MLIR (FIR/OMP+Arith etc.) and is one of a series of passes
that aim to clean up the MLIR for offloading (seperate passes
in different patches, one for early outlining, another for declare
target function filtering).

The pass is an upgraded version of the prior MLIR Opt
pass that I created that did the same thing. In this case
better handling of recursive cases, updated to work with
the interface and the ability to work with TargetOp's.

The semantic pass equivalent of this can be found in:
https://reviews.llvm.org/D150323

So we can choose to go either direction, however, this
pass is simpler to maintain and understand and having
it as an transformation pass keeps it consistent with the
other transformation passes that deal with OpenMP
function and target op processing.

Diff Detail

Event Timeline

agozillon created this revision.Jun 30 2023, 11:50 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald Transcript
agozillon requested review of this revision.Jun 30 2023, 11:50 AM

I like this approach, and I think it is a more manageable approach than semantics, but it looks to me like this pass could be moved outside of the Flang project and into the OpenMP MLIR dialect as a transformation pass. I may be wrong, so please correct me if that's the case, but I understand that implicit declare target propagation should be common to Fortran and C/C++ OpenMP. So it makes sense to me to have the pass defined where it can be reused.

So I would propose, rather than defining the pass in Flang and propagating declare target information through fir.call ops, building on top of D154194 to create the pass as part of the OpenMP dialect (D147641 shows an example of what I mean) and to propagate declare target through func.call ops. The implementation of the pass itself I think would remain mostly unchanged except from swapping fir.call with func.call, and then it would just be a matter of moving some files around. I think this way the offloading-related passes (implicit declare target, target region outlining and host/device function filtering for now) would be implemented in a consistent and reusable way.

I may be missing some other considerations, so I'd be interested to hear any other opinions you or others may have.

flang/lib/Optimizer/Transforms/OMPCaptureImplicitDeclTar.cpp
79 ↗(On Diff #536346)

In the case of a regular target region (omp.target inside of a host function) this is correct. However, a reverse-offloading target region (omp.target inside of a device function and with the 'ancestor' clause set) should instead mark functions called inside as host functions.

Currently there is no support for reverse offloading, so maybe it's not important to already mark functions properly here, but at least it would be good to acknowledge this known limitation in a TODO comment.

I like this approach, and I think it is a more manageable approach than semantics, but it looks to me like this pass could be moved outside of the Flang project and into the OpenMP MLIR dialect as a transformation pass. I may be wrong, so please correct me if that's the case, but I understand that implicit declare target propagation should be common to Fortran and C/C++ OpenMP. So it makes sense to me to have the pass defined where it can be reused.

So I would propose, rather than defining the pass in Flang and propagating declare target information through fir.call ops, building on top of D154194 to create the pass as part of the OpenMP dialect (D147641 shows an example of what I mean) and to propagate declare target through func.call ops. The implementation of the pass itself I think would remain mostly unchanged except from swapping fir.call with func.call, and then it would just be a matter of moving some files around. I think this way the offloading-related passes (implicit declare target, target region outlining and host/device function filtering for now) would be implemented in a consistent and reusable way.

I may be missing some other considerations, so I'd be interested to hear any other opinions you or others may have.

I have not gone through the details. But the function calls in FIR always exist as a fir.call till conversion to llvm.call.

I agree that this is simpler to maintain than the Semantics pass. My preference (not a strong opinion) is to keep this pass in Flang, and be called early on (one of the first passes after lowering) and be looked at as part of the lowering process. Practically, this also gives you a lot of leeway to experiment and make changes. Once we are all clear with the flow then we can think of moving these to the OpenMP Dialect.

kiranchandramohan added a comment.EditedJul 4 2023, 4:03 AM

I have not gone through the details. But the function calls in FIR always exist as a fir.call till conversion to llvm.call.

There is the CallOpInterface that can be used instead if you do not want to rely on a FIR operation.

I like this approach, and I think it is a more manageable approach than semantics, but it looks to me like this pass could be moved outside of the Flang project and into the OpenMP MLIR dialect as a transformation pass. I may be wrong, so please correct me if that's the case, but I understand that implicit declare target propagation should be common to Fortran and C/C++ OpenMP. So it makes sense to me to have the pass defined where it can be reused.

So I would propose, rather than defining the pass in Flang and propagating declare target information through fir.call ops, building on top of D154194 to create the pass as part of the OpenMP dialect (D147641 shows an example of what I mean) and to propagate declare target through func.call ops. The implementation of the pass itself I think would remain mostly unchanged except from swapping fir.call with func.call, and then it would just be a matter of moving some files around. I think this way the offloading-related passes (implicit declare target, target region outlining and host/device function filtering for now) would be implemented in a consistent and reusable way.

I may be missing some other considerations, so I'd be interested to hear any other opinions you or others may have.

I would have also preferred to have kept them together as an OpenMP pass, but the pass unfortunately relies on fir.call ops as like @kiranchandramohan says the func.call's aren't a thing in FIR as far as I was aware and I don't believe there is inheritance to dynamically cast between them or infer one is the same as the other, which means it has to sit inside of Flang as FIR isn't part of the MLIR project, we'd need to utilize an interface as @kiranchandramohan suggested (also discussed with @jsjodin but decided it wasn't worth the effort to implement an interface at the moment for this reason alone), however, as @kiranchandramohan mentioned there is a CallOpInterface so perhaps this would work. Although, perhaps there is something I am missing (very likely) and func.call or some other replacement for fir.call other than the interface can be used!

I like this approach, and I think it is a more manageable approach than semantics, but it looks to me like this pass could be moved outside of the Flang project and into the OpenMP MLIR dialect as a transformation pass. I may be wrong, so please correct me if that's the case, but I understand that implicit declare target propagation should be common to Fortran and C/C++ OpenMP. So it makes sense to me to have the pass defined where it can be reused.

So I would propose, rather than defining the pass in Flang and propagating declare target information through fir.call ops, building on top of D154194 to create the pass as part of the OpenMP dialect (D147641 shows an example of what I mean) and to propagate declare target through func.call ops. The implementation of the pass itself I think would remain mostly unchanged except from swapping fir.call with func.call, and then it would just be a matter of moving some files around. I think this way the offloading-related passes (implicit declare target, target region outlining and host/device function filtering for now) would be implemented in a consistent and reusable way.

I may be missing some other considerations, so I'd be interested to hear any other opinions you or others may have.

I have not gone through the details. But the function calls in FIR always exist as a fir.call till conversion to llvm.call.

I agree that this is simpler to maintain than the Semantics pass. My preference (not a strong opinion) is to keep this pass in Flang, and be called early on (one of the first passes after lowering) and be looked at as part of the lowering process. Practically, this also gives you a lot of leeway to experiment and make changes. Once we are all clear with the flow then we can think of moving these to the OpenMP Dialect.

I'm not overly bothered where the opt pass rests at the moment personally (provided I can make it a little more operation agnostic), so I'm happy to change this opt pass to reside in the OpenMP dialect or the FIR dialects depending on the outcome of both (@kiranchandramohan @skatrak) your discussions and if the CallOpInterface yields any success.

I do not think it being part of the OpenMP dialect will prevent it being called early on (it currently happens just before verification, in the same opt pipeline, which I believe is directly after lowering, it is only optional added to the pipeline based on the OpenMP flag being set currently. As far as I am aware this is where the rest will be called as well), but being cautious with testing the flow before making it part of the OpenMP Dialect seems reasonable, although I don't think anything is going to use it in the meantime, at least until/if Clang swaps over to generating MLIR.

flang/lib/Optimizer/Transforms/OMPCaptureImplicitDeclTar.cpp
79 ↗(On Diff #536346)

I'll see how difficult this is to add, but it may end up being a TODO!

I have not gone through the details. But the function calls in FIR always exist as a fir.call till conversion to llvm.call.

You are right, I wasn't aware of this. Then, as you say, the alternative to having this pass work as part of the OpenMP dialect would be to use the CallOpInterface rather than searching for func.call.

Practically, this also gives you a lot of leeway to experiment and make changes. Once we are all clear with the flow then we can think of moving these to the OpenMP Dialect.

That is a good point. By the same logic, then it would be probably good to also define the passes for target region outlining and host/device function filtering inside of Flang at first. Once all this stabilizes, then we can think of moving these (and potentially other) transformation passes to the dialect. Even then, it would be worthwhile to try implementing these passes without dependencies to the FIR dialect as much as possible, so that the eventual move won't be too involved.

I'm not overly bothered where the opt pass rests at the moment personally (provided I can make it a little more operation agnostic), so I'm happy to change this opt pass to reside in the OpenMP dialect or the FIR dialects depending on the outcome of both (@kiranchandramohan @skatrak) your discussions and if the CallOpInterface yields any success.

I think it's fine to keep it as part of Flang for now, but trying to use the CallOpInterface if possible, to remove the dependency on the FIR dialect.

I do not think it being part of the OpenMP dialect will prevent it being called early on (it currently happens just before verification, in the same opt pipeline, which I believe is directly after lowering, it is only optional added to the pipeline based on the OpenMP flag being set currently. As far as I am aware this is where the rest will be called as well), but being cautious with testing the flow before making it part of the OpenMP Dialect seems reasonable, although I don't think anything is going to use it in the meantime, at least until/if Clang swaps over to generating MLIR.

I agree, I think at the moment it's not really that important where the pass is defined because it will be created in the same spots inside of Flang regardless.

agozillon updated this revision to Diff 537113.Jul 4 2023, 8:33 AM
  • [Flang][MLIR][OpenMP] Make the pass more generic and remove the use of Fir.CallOps via the CallOpInterface
  • [Flang][MLIR][OpenMP] Add comment for future extension for reverse-offloading
agozillon updated this revision to Diff 537114.Jul 4 2023, 8:35 AM
  • Add missing newlines at eof

Added utilisation of the CallOpInterface in the most recent set of commits so it's now no longer dependent on FIR in it's current incarnation. Also added a TODO for the reverse-offloading support, as unfortunately the lowering from the PFT to the Op doesn't handle the device clause for the moment. Otherwise fixed the standard forgetting newlines at the end of new files that I always do (sorry)!

flang/lib/Optimizer/Transforms/OMPCaptureImplicitDeclTar.cpp
79 ↗(On Diff #536346)

I've added a TODO comment for the time being, I don't think implementing it will be particularly difficult, the current issue is that there is no lowering of the device clause for the TargetOp so no way to test it unfortunately.

If it would be possible to get some further review on this it would be greatly appreciated! Thank you very much ahead of time.

Nit: Better to refer to this as a transformation pass and not an optimization pass in the title.
Nit: Drop the omp prefix from the test files.
Nit: I think it is better to have tests with 2 spacing and lowercase.

In general (even for c/c++), how does this work for called functions/subroutines that are in a different file? e.g below where module a, module b are in different files.

module a
  use b
contains
  function double() result(x)
    integer :: x
    !$omp declare target
    x = add(10,20)
  end function
end module
module b
contains
  function add(x,y)
    integer :: x, y
    add = x + y
  end function
end module

What is the expected behaviour for operations that might lower to function calls later in the lowering pipeline? A simple example is the math sin function. But there could be others as well.

function mysin(x)
  real :: x
  !$omp declare target
  print *, sin(x)
end function
flang/include/flang/Optimizer/Transforms/Passes.td
301

Prefix the name with OpenMP or OMP.
Would OMPMarkDeclareTargetPass be a better name?

303
agozillon retitled this revision from [Flang][OpenMP][MLIR] An mlir optimisation pass for marking FuncOp's implicitly called from TargetOp's and declare target marked FuncOp's as implicitly declare target to [Flang][OpenMP][MLIR] An mlir transformation pass for marking FuncOp's implicitly called from TargetOp's and declare target marked FuncOp's as implicitly declare target.Jul 12 2023, 6:58 AM
agozillon edited the summary of this revision. (Show Details)
agozillon marked 2 inline comments as done.EditedJul 12 2023, 9:47 AM

Nit: Better to refer to this as a transformation pass and not an optimization pass in the title.
Nit: Drop the omp prefix from the test files.
Nit: I think it is better to have tests with 2 spacing and lowercase.

In general (even for c/c++), how does this work for called functions/subroutines that are in a different file? e.g below where module a, module b are in different files.

module a
  use b
contains
  function double() result(x)
    integer :: x
    !$omp declare target
    x = add(10,20)
  end function
end module
module b
contains
  function add(x,y)
    integer :: x, y
    add = x + y
  end function
end module

In the above case the pass simply does nothing, it will not mark add as declare target as it's in another compilation unit, and the pass won't fail either (from running a test at least). This is specification defined though from my understanding/reading, so it isn't something uncommon:

"If a procedure appears in an enter clause in the same compilation unit in which the definition of the procedure occurs then a device-specific version of the procedure is created for all devices to which the directive of the clause applies.", Section 5.8.4 of specification 5.2

The to clause got renamed to enter after 5.1, so it's essentially an alias. But in this case to get functioning code, the user would I think have to decorate add with its own declare target directive I believe to make it "legal" OpenMP code. Someone with more experience in OpenMP may be able to shed more light or refute my understanding of it though.

The above sort of also hinges on if two seperate modules in two seperate files are considered seperate compilation units, e.g. it's not like it's some C++ header only library (from doing a test, it doesn't appear to be like that though, but I'm fairly new to Fortran).

What is the expected behaviour for operations that might lower to function calls later in the lowering pipeline? A simple example is the math sin function. But there could be others as well.

function mysin(x)
  real :: x
  !$omp declare target
  print *, sin(x)
end function

In this case, it's not going to apply declare target to it unfortunately, as it's not a function or a call, it's I suppose effectively a builtin at this point, until it's lowered. In theory, you could extend the pass to support marking of these I imagine, or look at it as falling under something defined in another compilation unit e.g. some kind of library.

However, in either case, from my current understanding, I believe it's a non-issue. The function filtering pass will execute immediately after this and provided the function it exists in, in this example. mysin survives the culling, it will just be lowered as normal with no other affects at the moment, so the math.sin function would just go through it's regular lowering, and if it happens to spawn a new function, then that's fine, it will not be harmed or go through any other alterations other than regular lowering.

agozillon updated this revision to Diff 539616.Jul 12 2023, 9:55 AM
  • Pass name change
  • remove omp from test
  • rename other tests I've added with similar omp prefix
  • tidy up test as suggested
agozillon added a comment.EditedJul 12 2023, 9:56 AM

Applied reviewer nits in the last series of commits and also removed omp- prefix from previous tests I wrote and added the prefix to while I was there, just to keep consistency.

However, in either case, from my current understanding, I believe it's a non-issue. The function filtering pass will execute immediately after this and provided the function it exists in, in this example. mysin survives the culling, it will just be lowered as normal with no other affects at the moment, so the math.sin function would just go through it's regular lowering, and if it happens to spawn a new function, then that's fine, it will not be harmed or go through any other alterations other than regular lowering.

When it is written as different passes, there is no guarantee that these passes will be invoked one after the other. In Flang we can probably control this. Would combining the marking and erasing passes be better? Are there any other uses for having these declare target attributes? From your existing reply, i am assuming that is the only use.

Also, the best example for the situation I mentioned previously is the following.

subroutine sb1(x, y, i)
  real :: x, y
  integer(kind=8) :: i
  !$omp declare target
  y = x ** i
end subroutine
func.func @sb1_(%arg0: !fir.ref<f32> {fir.bindc_name = "x"}, %arg1: !fir.ref<f32> {fir.bindc_name = "y"}, %arg2: !fir.ref<i64> {fir.bindc_name = "i"}) attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>} {
  %0 = fir.load %arg0 : !fir.ref<f32>
  %1 = fir.load %arg2 : !fir.ref<i64>
  %2 = call @__mlir_math_fpowi_f32_i64(%0, %1) : (f32, i64) -> f32
  fir.store %2 to %arg1 : !fir.ref<f32>
  return
}

func.func private @__mlir_math_fpowi_f32_i64(%arg0: f32, %arg1: i64) -> f32 attributes {llvm.linkage = #llvm.linkage<linkonce_odr>} {
  %cst = arith.constant 1.000000e+00 : f32
  ...
}

Some comments inline. Looks Good. Please wait for one more acceptance.

flang/lib/Optimizer/Transforms/OMPMarkDeclareTarget.cpp
26

Cant the moduleOp be part of the class so that you do not have to pass it around?

27–33

Would a set or map be better here?

However, in either case, from my current understanding, I believe it's a non-issue. The function filtering pass will execute immediately after this and provided the function it exists in, in this example. mysin survives the culling, it will just be lowered as normal with no other affects at the moment, so the math.sin function would just go through it's regular lowering, and if it happens to spawn a new function, then that's fine, it will not be harmed or go through any other alterations other than regular lowering.

When it is written as different passes, there is no guarantee that these passes will be invoked one after the other. In Flang we can probably control this. Would combining the marking and erasing passes be better? Are there any other uses for having these declare target attributes? From your existing reply, i am assuming that is the only use.

if they're both passes built to execute on ModuleOp, are they not guaranteed to execute in-order? I have a very minimal understanding of how the pipeline manager runs, but wouldn't one of the passes have to be either added as a nested pass explicitly or implicitly (only seems to happen if a pass executes on a different Op type than the PassManager is given) for them to not run in the provided sequence? I am not against combining them though if that is something we wish to do @skatrak @jsjodin @kiranchandramohan, I'm just not sure it's necessary in the context of them being run out of order, perhaps someone with more MLIR knowledge than me can feed into the discussion. In either case, it would require a bit of a merger of this patch and https://reviews.llvm.org/D147641 or waiting until one is landed and then having the other rebase and do the merging of components on top of it.

For the moment, for functions it's the only use. For global data the attribute has a different use-case, which another patch is up for: https://reviews.llvm.org/D149368

Also, the best example for the situation I mentioned previously is the following.

subroutine sb1(x, y, i)
  real :: x, y
  integer(kind=8) :: i
  !$omp declare target
  y = x ** i
end subroutine
func.func @sb1_(%arg0: !fir.ref<f32> {fir.bindc_name = "x"}, %arg1: !fir.ref<f32> {fir.bindc_name = "y"}, %arg2: !fir.ref<i64> {fir.bindc_name = "i"}) attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>} {
  %0 = fir.load %arg0 : !fir.ref<f32>
  %1 = fir.load %arg2 : !fir.ref<i64>
  %2 = call @__mlir_math_fpowi_f32_i64(%0, %1) : (f32, i64) -> f32
  fir.store %2 to %arg1 : !fir.ref<f32>
  return
}

func.func private @__mlir_math_fpowi_f32_i64(%arg0: f32, %arg1: i64) -> f32 attributes {llvm.linkage = #llvm.linkage<linkonce_odr>} {
  %cst = arith.constant 1.000000e+00 : f32
  ...
}

Thank you for the example! I believe if we did encounter it causing issues in this situation, we could try to move the passes after the math lowering, which would I think be in the FIRToLLVMLowering pass (perhaps I'm wrong on the location though). The ordering may be more difficult to restrict there though as there's a number of math related passes going on. But for the moment, filtering as close to the original lowering as possible is likely the best option I think.

Some comments inline. Looks Good. Please wait for one more acceptance.

Thank you, I'll try to address these and ask if someone else can have a look.

However, in either case, from my current understanding, I believe it's a non-issue. The function filtering pass will execute immediately after this and provided the function it exists in, in this example. mysin survives the culling, it will just be lowered as normal with no other affects at the moment, so the math.sin function would just go through it's regular lowering, and if it happens to spawn a new function, then that's fine, it will not be harmed or go through any other alterations other than regular lowering.

When it is written as different passes, there is no guarantee that these passes will be invoked one after the other. In Flang we can probably control this. Would combining the marking and erasing passes be better? Are there any other uses for having these declare target attributes? From your existing reply, i am assuming that is the only use.

We can create another patch that implements a pass that invokes the three different passes in the order we want. It seems cleaner to keep the code separate so that it is easier to reason about each transform individually.

agozillon updated this revision to Diff 540056.Jul 13 2023, 8:39 AM
  • Pass name change
  • remove omp from test
  • rename other tests I've added with similar omp prefix
  • tidy up test as suggested
  • Remove ModuleOp Arg
  • Replace vector with set
  • Move addPass into same segment as other OpenMP addPass
agozillon marked 2 inline comments as done.Jul 13 2023, 8:40 AM

Rebased and updated patch based on reviewer comments (use set and remove ModuleOp argument)

jsjodin added inline comments.Jul 13 2023, 10:17 AM
flang/lib/Optimizer/Transforms/OMPMarkDeclareTarget.cpp
85

Can this set be a SmallPtrSet (for Operation *) instead of a StringRef? I am thinking doing mem compare might be expensive.

  • Pass name change
  • remove omp from test
  • rename other tests I've added with similar omp prefix
  • tidy up test as suggested
  • Remove ModuleOp Arg
  • Replace vector with set
  • Move addPass into same segment as other OpenMP addPass
  • Use a smallptrset rather than a regualr set
agozillon marked an inline comment as done.Jul 13 2023, 10:56 AM

Update to utilise SmallPtrSet<Operation*> instead of std::set<StringRef>

flang/lib/Optimizer/Transforms/OMPMarkDeclareTarget.cpp
85

Thank you for pointing it out, the last commit should address this and use SmallPtrSet!

jsjodin accepted this revision.Jul 13 2023, 11:01 AM

LGTM! Please wait for another acceptance.

This revision is now accepted and ready to land.Jul 13 2023, 11:01 AM
agozillon marked an inline comment as done.Jul 13 2023, 2:57 PM

@kiranchandramohan would it be possible to get a 2nd acceptance from you if possible please, if you're happy with the patch in its current state of course! or @skatrak if you wish to give the patch a final look over too see if it's in a state for acceptance.

Thank you Andrew, LGTM.

However, in either case, from my current understanding, I believe it's a non-issue. The function filtering pass will execute immediately after this and provided the function it exists in, in this example. mysin survives the culling, it will just be lowered as normal with no other affects at the moment, so the math.sin function would just go through it's regular lowering, and if it happens to spawn a new function, then that's fine, it will not be harmed or go through any other alterations other than regular lowering.

When it is written as different passes, there is no guarantee that these passes will be invoked one after the other. In Flang we can probably control this. Would combining the marking and erasing passes be better? Are there any other uses for having these declare target attributes? From your existing reply, i am assuming that is the only use.

We can create another patch that implements a pass that invokes the three different passes in the order we want. It seems cleaner to keep the code separate so that it is easier to reason about each transform individually.

I agree with this approach. We can define each of these as an independent patch, potentially relying on having one of the others run prior to it, and then make sure that either they all run in their expected order or none of them do by grouping them into a kind of "OpenMP-offloading pipeline" pass (which we can implement as a following patch, if we decide to follow that approach). This pass should run as soon as possible to avoid marking compiler-generated calls like @__mlir_math_fpowi_f32_i64 in the example above.

flang/lib/Optimizer/Transforms/CMakeLists.txt
21

Nit: Seems like there are some leftover spaces there

skatrak accepted this revision.Jul 14 2023, 6:22 AM
agozillon added a comment.EditedJul 14 2023, 6:28 AM

Thank you Andrew, LGTM.

Thank you for the review!

However, in either case, from my current understanding, I believe it's a non-issue. The function filtering pass will execute immediately after this and provided the function it exists in, in this example. mysin survives the culling, it will just be lowered as normal with no other affects at the moment, so the math.sin function would just go through it's regular lowering, and if it happens to spawn a new function, then that's fine, it will not be harmed or go through any other alterations other than regular lowering.

When it is written as different passes, there is no guarantee that these passes will be invoked one after the other. In Flang we can probably control this. Would combining the marking and erasing passes be better? Are there any other uses for having these declare target attributes? From your existing reply, i am assuming that is the only use.

We can create another patch that implements a pass that invokes the three different passes in the order we want. It seems cleaner to keep the code separate so that it is easier to reason about each transform individually.

I agree with this approach. We can define each of these as an independent patch, potentially relying on having one of the others run prior to it, and then make sure that either they all run in their expected order or none of them do by grouping them into a kind of "OpenMP-offloading pipeline" pass (which we can implement as a following patch, if we decide to follow that approach). This pass should run as soon as possible to avoid marking compiler-generated calls like @__mlir_math_fpowi_f32_i64 in the example above.

This approach sounds good to me.

I'll push this patch on Monday afternoon and fix the nit before I push, provided nobody has any further comments!

I am going to land this in the next couple of hours now, provided I run into no issues applying it! Thank you all for your reviews and help.

The gfortran testsuite is failing after this change. Could you fix immediately or revert until it is fixed?
https://lab.llvm.org/buildbot/#/builders/179/builds/6693
https://github.com/llvm/llvm-test-suite/blob/main/Fortran/gfortran/regression/gomp/defaultmap-7.f90

Hi Kiran, looking into it just now and have an idea of what the problem is, discussing a fix and it should be committed by the end of today.

The gfortran testsuite is failing after this change. Could you fix immediately or revert until it is fixed?
https://lab.llvm.org/buildbot/#/builders/179/builds/6693
https://github.com/llvm/llvm-test-suite/blob/main/Fortran/gfortran/regression/gomp/defaultmap-7.f90

It's an interaction between my pass and @skatrak 's pass added in: https://reviews.llvm.org/D147641

I've made a commit/fix for the time being that deactivates the pass for the host, but leaves it enabled for device: https://reviews.llvm.org/rG64f5a7642a05

@skatrak has an idea of how to do the filtering for the host in a way that won't adversely affect things hopefully, so I believe the current plan is for @skatrak to investigate his idea a little further and that should allow us to re-enable filtering for the host. The host filtering is primarily an optimisation/something to allow closer following of the OpenMP specification, it's not a necessity for offloading but it would be nice to have, especially if we aim to implement reverse offloading in the future.

The gfortran testsuite is failing after this change. Could you fix immediately or revert until it is fixed?
https://lab.llvm.org/buildbot/#/builders/179/builds/6693
https://github.com/llvm/llvm-test-suite/blob/main/Fortran/gfortran/regression/gomp/defaultmap-7.f90

It's an interaction between my pass and @skatrak 's pass added in: https://reviews.llvm.org/D147641

I've made a commit/fix for the time being that deactivates the pass for the host, but leaves it enabled for device: https://reviews.llvm.org/rG64f5a7642a05

@skatrak has an idea of how to do the filtering for the host in a way that won't adversely affect things hopefully, so I believe the current plan is for @skatrak to investigate his idea a little further and that should allow us to re-enable filtering for the host. The host filtering is primarily an optimisation/something to allow closer following of the OpenMP specification, it's not a necessity for offloading but it would be nice to have, especially if we aim to implement reverse offloading in the future.

Thanks @agozillon for the quick response and the plan for fixing this.

I have spent some time working on this problem and I just want to ask for your opinions on what the expected behavior is, because I'm not sure. Let's say we have the following application:

subroutine g()
  !$omp declare target to(g) device_type(nohost)
end subroutine device

subroutine f()
  !$omp target
  call g()
  !$omp end target
end subroutine host

I was thinking that g, since it's a device subroutine, should be filtered out when compiling for the host. But code generation for the contents of target regions is also triggered during host compilation, to provide a host fallback implementation. So it seems like it actually would be necessary to keep "nohost" functions also for the host, so that these fallbacks can actually work. If that's the case, then the current solution of only triggering function filtering during device compilation would be correct.

I think this wouldn't impact our ability of implementing reverse offloading later because that just requires function filtering to make sure device functions containing a target region are kept. The only problem could be, during device compilation, the potential calls to host-only functions inside of these reverse offload target regions (if they are removed, the function calls turn into calls to undefined symbols). The patch I am preparing changes the way in which functions are filtered so that undefined references are avoided in that situation.

So would the solution to this be to only filter host-only functions while compiling for the device and tweaking it slightly to avoid calls to undefined references or am I going the wrong way about this?

I have spent some time working on this problem and I just want to ask for your opinions on what the expected behavior is, because I'm not sure. Let's say we have the following application:

subroutine g()
  !$omp declare target to(g) device_type(nohost)
end subroutine device

subroutine f()
  !$omp target
  call g()
  !$omp end target
end subroutine host

I was thinking that g, since it's a device subroutine, should be filtered out when compiling for the host. But code generation for the contents of target regions is also triggered during host compilation, to provide a host fallback implementation. So it seems like it actually would be necessary to keep "nohost" functions also for the host, so that these fallbacks can actually work. If that's the case, then the current solution of only triggering function filtering during device compilation would be correct.

I think this wouldn't impact our ability of implementing reverse offloading later because that just requires function filtering to make sure device functions containing a target region are kept. The only problem could be, during device compilation, the potential calls to host-only functions inside of these reverse offload target regions (if they are removed, the function calls turn into calls to undefined symbols). The patch I am preparing changes the way in which functions are filtered so that undefined references are avoided in that situation.

So would the solution to this be to only filter host-only functions while compiling for the device and tweaking it slightly to avoid calls to undefined references or am I going the wrong way about this?

I think that would be correct behavior, as we do need the functions for the fallback.

I suppose you could in theory mark functions needed for fallback as device_type::Any and that'd allow discarding those marked as nohost only, but that's a bit of an excessive and unnecessary optimisation I think, I imagine the cases where a nohost function is not needed in a fallback host implementation is few and far between and something else is likely to clean it up.

I think that would be correct behavior, as we do need the functions for the fallback.

I suppose you could in theory mark functions needed for fallback as device_type::Any and that'd allow discarding those marked as nohost only, but that's a bit of an excessive and unnecessary optimisation I think, I imagine the cases where a nohost function is not needed in a fallback host implementation is few and far between and something else is likely to clean it up.

Thank you Andrew for your comments. I just created D155827 to follow this approach and clean up a bit some remaining TODO comments.

flang/test/Lower/OpenMP/declare-target-implicit-tarop-cap.f90