This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][libomptarget] Bringing up to spec with respect to OMP_TARGET_OFFLOAD env var
ClosedPublic

Authored by AlexEichenberger on Aug 9 2018, 10:55 AM.

Diff Detail

Repository
rOMP OpenMP

Event Timeline

Hahnfeld requested changes to this revision.Aug 9 2018, 11:17 AM
Hahnfeld added a subscriber: Hahnfeld.

Does this patch supersede D44522?

As discussed in there I don't see that DEFAULT means MANDATORY only iff there was a successful offload. From TR7, page 610, lines 17/18:

The DEFAULT value specifies that when one or more target devices are available, the runtime
behaves as if this environment variable is set to MANDATORY [...]

Another remark from D44522: I think we need to handle API methods as well.

libomptarget/src/rtl.cpp
49–64

There was an agreement with Intel to have a query function __kmpc_get_target_offload. Please use that one.

This revision now requires changes to proceed.Aug 9 2018, 11:17 AM

Does this patch supersede D44522?

yes

As discussed in there I don't see that DEFAULT means MANDATORY only iff there was a successful offload. From TR7, page 610, lines 17/18:

The DEFAULT value specifies that when one or more target devices are available, the runtime
behaves as if this environment variable is set to MANDATORY [...]

The question is how we define "available." IMO, it means success of first command. Happy to reconsider, I think most of our users are fine with success of first command

Another remark from D44522: I think we need to handle API methods as well.

I am all for using the suggested __kmpc_get_target_offload function, which is implemented. If we do so, we will be depending on a definition in kmp.h. Do we want a redundant definition of the

enum kmp_target_offload_kind {

tgt_disabled = 0,
tgt_default = 1,
tgt_mandatory = 2

};

I assume we want to do that, as we don't want to include kmp.h in libomptarget.

As discussed in there I don't see that DEFAULT means MANDATORY only iff there was a successful offload. From TR7, page 610, lines 17/18:

The DEFAULT value specifies that when one or more target devices are available, the runtime
behaves as if this environment variable is set to MANDATORY [...]

The question is how we define "available." IMO, it means success of first command. Happy to reconsider, I think most of our users are fine with success of first command

You are right, "available" is not defined in the standard. I've always though of "plugged into the system", ie all devices that are visible to the CUDA runtime. That would match the current implementation of omp_get_num_devices which is defined to return "the number of available target devices".
Actually this behaviour would be important for us as we have our GPUs configured exclusively. So when there is already a process running all other users get a runtime error. In that case it would be very helpful to have libomptarget abort the program.

Another remark from D44522: I think we need to handle API methods as well.

I am all for using the suggested __kmpc_get_target_offload function, which is implemented. If we do so, we will be depending on a definition in kmp.h. Do we want a redundant definition of the

enum kmp_target_offload_kind {
  tgt_disabled = 0,
  tgt_default = 1,
  tgt_mandatory = 2
};

I assume we want to do that, as we don't want to include kmp.h in libomptarget.

Yes, D44522 also had that.

RaviNarayanaswamy added inline comments.
libomptarget/src/interface.cpp
52

Why not move the check of handle_target_outcome into CheckDeviceAndCtors so you dont have to do every time you invoke this function.

You are right, "available" is not defined in the standard. I've always though of "plugged into the system", ie all devices that are visible to the CUDA runtime. That would match the current implementation of omp_get_num_devices which is defined to return "the number of available target devices".
Actually this behaviour would be important for us as we have our GPUs configured exclusively. So when there is already a process running all other users get a runtime error. In that case it would be very helpful to have libomptarget abort the program.

So you like deciding available on first use? This is what your comment seems to imply, but I am not 100% sure.

I think that our current interpretation of “available” for devices is
reasonable. There may be many reasons that a device is not available, even
if it is plugged in. Deciding that it is available because we were able to
use it seems the most dynamic method of determining this.

Kevin O’Brien

You are right, "available" is not defined in the standard. I've always though of "plugged into the system", ie all devices that are visible to the CUDA runtime. That would match the current implementation of omp_get_num_devices which is defined to return "the number of available target devices".
Actually this behaviour would be important for us as we have our GPUs configured exclusively. So when there is already a process running all other users get a runtime error. In that case it would be very helpful to have libomptarget abort the program.

So you like deciding available on first use? This is what your comment seems to imply, but I am not 100% sure.

No, I favor the implication "visible" -> "available" which is the same interpretation that omp_get_num_devices is using (in its current form).
If we implemented the behaviour of this patch ("successful offload to ONE device" -> "ALL devices available", "error on ONE device" -> "NO devices available") we'd need to change the API methods. That would probably imply probing all devices at runtime startup - because after all we don't know which device the user is going to use. IIRC that was to be avoided, libomptarget uses lazy initialization at the moment.

I think that our current interpretation of “available” for devices is
reasonable. There may be many reasons that a device is not available, even
if it is plugged in. Deciding that it is available because we were able to
use it seems the most dynamic method of determining this.

Suppose we have 2 devices plugged into the system, and the first one cannot be used (for whatever reason: hardware failure, exclusive configuration and somebody else is running, etc.).
Now a clever application sees the two (because omp_get_num_devices() returns 2) and does:

#pragma omp parallel num_threads(omp_get_num_devices())
{
  #pragma omp target device(omp_get_thread_num())
  { }
}

I think the runtime behaviour with this patch depends on the execution order (and exposes a race condition in handle_target_outcome on TargetOffloadPolicy; let's ignore that for now):

  • If target device(0) executes first, libomptarget will notice the error and silently disable offloading. All target regions will execute on the host.
  • If however target device(1) executes first and returns successfully, libomptarget will raise OMP_TARGET_OFFLOAD to MANDATORY and will abort execution when catching the error of target device(0).

I don't think that makes much sense. IMO the runtime should detect two "visible" -> "available" devices and abort execution in all cases.

So you like deciding available on first use? This is what your comment seems to imply, but I am not 100% sure.

As I understand Jonas, he would prefer semantically something like:

devices-available =  omp_get_num_devices()>0

This does not depend on a successful offload, just compares whether a device is there:

if devices-available && DEFAULT:
  continue as if MANDATORY
else:
  continue as if DISABLED

So you like deciding available on first use? This is what your comment seems to imply, but I am not 100% sure.

As I understand Jonas, he would prefer semantically something like:

devices-available =  omp_get_num_devices()>0

Yes, because that's how the standard defines omp_get_num_devices(): to return the number of available devices.
And because it results in a sane behaviour for my example.

This does not depend on a successful offload, just compares whether a device is there:

if devices-available && DEFAULT:
  continue as if MANDATORY
else:
  continue as if DISABLED

Almost, !devices-available && MANDATORY should not result in DISABLED ;-)

if DEFAULT:
  if devices-available:
    continue as if MANDATORY
  else:
    continue as if DISABLED
endif

I disagree. As it is currently written, omp_get_num_devices() can also grow over time, so you can also have it return zero, only to increase later to a larger number. What does that do to DEFAULT?
I believe the current policy is ok; if any device fails for any reason on first use, it becomes disabled. Anyone that want to rely on devices being there should use the MANDATORY policy.
I am happy to add a lock around the change of DEFAULT to MANDATORY or DISABLED.

Hahnfeld added a comment.EditedAug 10 2018, 9:11 AM

I disagree. As it is currently written, omp_get_num_devices() can also grow over time, so you can also have it return zero, only to increase later to a larger number. What does that do to DEFAULT?

Ok, fair point. I think we need to decide on the first entry from user code: If one construct fell back to the host all following constructs should, shouldn't they?

I believe the current policy is ok; if any device fails for any reason on first use, it becomes disabled. Anyone that want to rely on devices being there should use the MANDATORY policy.

Huh, but there is only one global TargetOffloadPolicy. So how can we disable it = a single device?

Suppose we have 2 devices plugged into the system, and the first one cannot be used (for whatever reason: hardware failure, exclusive configuration and somebody else is running, etc.).
Now a clever application sees the two (because omp_get_num_devices() returns 2) and does:

#pragma omp parallel num_threads(omp_get_num_devices())
{
  #pragma omp target device(omp_get_thread_num())
  { }
}

I think the runtime behaviour with this patch depends on the execution order (and exposes a race condition in handle_target_outcome on TargetOffloadPolicy; let's ignore that for now):

  • If target device(0) executes first, libomptarget will notice the error and silently disable offloading. All target regions will execute on the host.
  • If however target device(1) executes first and returns successfully, libomptarget will raise OMP_TARGET_OFFLOAD to MANDATORY and will abort execution when catching the error of target device(0).

I don't think that makes much sense. IMO the runtime should detect two "visible" -> "available" devices and abort execution in all cases.

Did you consider this example?

Suppose we have 2 devices plugged into the system, and the first one cannot be used (for whatever reason: hardware failure, exclusive configuration and somebody else is running, etc.).
Now a clever application sees the two (because omp_get_num_devices() returns 2) and does:

#pragma omp parallel num_threads(omp_get_num_devices())
{
  #pragma omp target device(omp_get_thread_num())
  { }
}

I think the runtime behaviour with this patch depends on the execution order (and exposes a race condition in handle_target_outcome on TargetOffloadPolicy; let's ignore that for now):

  • If target device(0) executes first, libomptarget will notice the error and silently disable offloading. All target regions will execute on the host.
  • If however target device(1) executes first and returns successfully, libomptarget will raise OMP_TARGET_OFFLOAD to MANDATORY and will abort execution when catching the error of target device(0).

I don't think that makes much sense. IMO the runtime should detect two "visible" -> "available" devices and abort execution in all cases.

Did you consider this example?

Your example can show a different issue as well:

A program first do a register_lib that has code for a device which does not exist on that machine. Thus omp_get_num_devices is zero. DEFAULT then becomes DISABLED. A program then does a register_lib that has code for a device that exist on this machine. But now it is disabled.

A way out of this is to see the bigger picture. If the user want to guarantee execution of all targets on a device, the user must us MANDATORY. If the user does not want devices, then DISABLED is called for. DEFAULT is a best effort, it will not be perfect, nor it has to be perfect.

I suggest that deciding on the first attempt to actually offload is as valid a policy as any, and is simple to understand. Regardless of how we do it, what we want to really avoid is that we execute some kernels on the device, and some on the host, (ignoring here explicit orders from the program via the "if(0)" clause). Both policies "num_devices>0" and "decide on first invocation" satisfy this.

Added a mutex around changing DEFAULT to MANDATORY or DISABLED
Added proper action when discovering failure (which is immediate abort of function being performed)
Failure to locate data during an update is now tolerated (a warning could be issued)

AlexEichenberger marked an inline comment as done.Aug 10 2018, 3:18 PM
AlexEichenberger added inline comments.
libomptarget/src/interface.cpp
52

Ravi, I did your change but reverted it for the following reason. I liked to have all of the handling of target_offload variable in the same file (interface.cpp). When I moved in into the CheckDeviceAndCtors, I needed to insert it in 2 places within the function, and handle_target_outcome was now in two different files.

If you feel strongly about your request, I will be happy to move the code into CheckDeviceAndCtors.

Alex: If you feel strongly about your request, I will be happy to move the code into CheckDeviceAndCtors.
It is just a suggestion. Either way is fine.

protze.joachim added a comment.EditedAug 13 2018, 4:03 AM

@Hahnfeld and I discussed the behavior and found a very inconsistent behavior. Let's consider something like the following code, under the constraint, that device(1) is busy (will fail to offload) and device(0) is free:

#pragma omp parallel num_threads(omp_get_num_devices())
{
  #pragma omp target device(omp_get_thread_num())
  { }
}

We found the following cases:

  1. first offload to device(0) -> succeed -> set MANDATORY -> abort on offloading to device(1)
  2. first offload to device(1) -> fail -> set DISABLED -> continue execution on the host
  3. start offloading to device(0) and device(1) -> fail on 1 -> set DISABLED, but execution of offloaded code will complete. -> although we are in DISABLED, we executed code on a device

The locking only prevents the race on the variable, but not the behavior in 3).

From our point of view, the decision between DISABLED and MANDATORY must be before starting any offloading.

Another reason to implement such behavior:
If the offloaded code changes the memory state (and the target region code is not idempotent), this prevents inconsistent states of execution, if the initial offloading fails after partial execution on the device and a second execution the host.

Default policy is now selected when registering the libraries, and will default to mandatory/disabled depending of whether there are one or more devices/none.

Good arguments, I see why the consistency of why linking the number of devices returned and OMP_TARGET_OFFLOAD environment is good.
Implemented the requested changes

Hahnfeld requested changes to this revision.Aug 13 2018, 12:10 PM
Hahnfeld added inline comments.
libomptarget/include/omptarget.h
188–194 ↗(On Diff #160413)

Please remove, there is a CMake flag to do this.

libomptarget/src/device.cpp
368–370

I think this implementation can live directly in interface.cpp as no other file should use it. Please make the function static and I think internal function names follow a CamelCase naming convention; device_is_ready seems to be the exception...

385–386

Asserts are no-ops when building with -DNDEBUG which is the default for Release builds. Call exit(1) directly?

I think we should provide the user with an error message when aborting the execution, DP is subject to LIBOMPTARGET_DEBUG and only available in Debug builds.

(PGI's OpenACC implementation prints a detailed error code from CUDA which is probably not possible in the target agnostic part. I think that's ok for now, the user can use cuda-gdb or other tools...)

libomptarget/src/device.h
24–31

Is there a particular reason the code is put into device.h / device.cpp? IMO it's not related to device management.

For later reuse in API methods (if deemed necessary after the discussion on omp-lang) I think this should go into private.h.

libomptarget/src/rtl.cpp
230–231

I think this code path is not triggered when there is no matching RTL at all. At the moment this causes an assert (in Debug builds) because TargetOffloadPolicy is still tgt_default.

However I don't think we can decide on a single call to __tgt_register_lib for the reason you mentioned last week: The number of available devices can change when more device images are registered. As such I think we should move the decision handling tgt_default to the beginning of all interface methods that can come from a user's target construct. What do you think?

This revision now requires changes to proceed.Aug 13 2018, 12:10 PM

Addressed all remaining issues

AlexEichenberger marked 3 inline comments as done.Aug 14 2018, 9:31 AM

responded to comments.

libomptarget/src/interface.cpp
52

Made the handle_target_outcome static to the interface.cpp as suggested by Jonas

libomptarget/src/rtl.cpp
230–231

Agreed, you are absolutely right

Please also clang-format your changes.

libomptarget/src/interface.cpp
28–61

This seems overly complex where we now only need a single fprintf (which is thread-safe) inside an if. Do you have upcoming patches that will use these functions?

65

Please rename to CamelCase, something like HandleTargetOutcome? (device_is_ready seems to be the exception)

67–77

After some thinking I guess this actually works in all cases:

  1. If there is at least one device, the code path in RTLsTy::LoadRTLs will be executed.
  2. Otherwise all functions check whether the device ID is less than Devices.size() (mostly in device_is_ready which is called from CheckDeviceAndCtors). This can never be true when there is no RTL and the execution ends up here.

However I don't find this very intuitive and it will be very easy to mess up with future changes. I suggest to do this upfront, for example with a helper function:

static bool IsOffloadDisabled() {
  if (TargetOffloadPolicy == tgt_default) {
    // ...
  }
  return TargetOffloadPolicy == tgt_disabled;
}

The first line of each interface function would then become if (IsOffloadDisabled()) return;. In addition the duplicate code could be dropped from RTLsTy::LoadRTLs. Do you have concerns about this straight-forward solution?

AlexEichenberger marked 2 inline comments as done.

implemented suggest changes

AlexEichenberger marked 2 inline comments as done.Aug 14 2018, 1:00 PM

responded to comments

libomptarget/src/interface.cpp
28–61

Yes, I intend this to be used for all output feedback to users. Ideally, I would find a way to share the infrastructure in kmpc, but for the moment this should do it. Note that the lock is only acquired prior to aborting, so overheads are not an issue.

67–77

good suggestions, re-inserted the lock as now there could be a race condition. Again, in practice, this should not happen often. In practice, the lock should be acquired only once, so this is ok

Looks mostly good.

Please also clang-format your changes.

I'm not sure whether you did this; especially the new functions in interface.cpp don't seem to follow the style I'm used to...

libomptarget/src/interface.cpp
28–61

In that case I think the functions should be moved to omptarget.cpp (or a completely new file?) because they'll be used by more than just the interface?

Inverting cond seems weird, could you explain that choice? If that's to model assert I think the function should be renamed to AssertOrFatalMassage. To me the current function name implies if (cond) FatalMessage(...), but that may be just me.

AlexEichenberger marked an inline comment as done.Aug 14 2018, 8:48 PM
AlexEichenberger added inline comments.
libomptarget/src/interface.cpp
28–61

I agree.. maybe private.cpp?

I named them AssertWithFatalMessage, but then you mentioned that asserts are turned off in the release mode, so I though that would be confusing too.

But if you prefer Assert, that was my first choice too

Hahnfeld added inline comments.Aug 15 2018, 12:11 AM
libomptarget/src/interface.cpp
28–61

Another option would be to remove FatalMessageWithCond and put in the if statement directly. That would a) be clear, b) avoid choosing a name and whether to invert cond or not, and c) deduplicate code and make FatalMessage used (which is not at the moment).

For one single function I don't think we need to start a new file right now. IMO that could go to omptarget.cpp.

fixed clang-format and moved fatal message function

AlexEichenberger marked 8 inline comments as done.Aug 15 2018, 8:30 AM

updated comments status

Hahnfeld accepted this revision.Aug 16 2018, 12:59 AM

Looks good, thanks for the changes.

This revision is now accepted and ready to land.Aug 16 2018, 12:59 AM

Thanks to all for your valuable comments, much appreciated, really contributed to the quality of the patch

Looking thourhg the code one more time, I realized that there was no default init for this key variable

kmp_target_offload_kind_t TargetOffloadPolicy = tgt_default;

It's better for it to be init to zero value, to be absolutely safe.

Alex, will you land this anytime soon? (I'd like to backport for our local installation of Clang 7.0...)

This revision was automatically updated to reflect the committed changes.
mikerice added inline comments.
libomptarget/src/omptarget.cpp
36

Anyone else having trouble building after this change?

.../llvm/projects/openmp/libomptarget/src/omptarget.cpp:37:21: error: 'va_start' was not declared in this scope

va_start(args, fmt);
                  ^

.../llvm/projects/openmp/libomptarget/src/omptarget.cpp:43:14: error: 'va_end' was not declared in this scope

va_end(args);
           ^

It seems my environment would really like a stdarg.h in this file.

Yes, see D51226 which will replace the varargs function by a macro.