This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][FIX] Enforce a function boundary for a new data environment
ClosedPublic

Authored by jdoerfert on Jan 8 2021, 8:32 AM.

Details

Summary

Whenever we enter a new OpenMP data environment we want to enter a
function to simplify reasoning. Later we probably want to remove the
entire specialization wrt. the if clause and pass the result to the
runtime, for now this should fix PR48686.

Diff Detail

Event Timeline

jdoerfert created this revision.Jan 8 2021, 8:32 AM
jdoerfert requested review of this revision.Jan 8 2021, 8:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 8 2021, 8:32 AM

I'm guessing we're using the function boundary as a compiler barrier. That seems fragile in the face of improving cross-function optimisation.

What is the invariant we want around entry to a data environment, which happens to be met by a function boundary?

I'm guessing we're using the function boundary as a compiler barrier. That seems fragile in the face of improving cross-function optimisation.

Looks like applying inaccessiblemem_or_argmemonly attribute to the OpenMP functions is way too optimistic since we still can access this (inaccessible) memory using other OpenMP functions. Not sure about the semantics of this attribute, though.

What is the invariant we want around entry to a data environment, which happens to be met by a function boundary?

I'm guessing we're using the function boundary as a compiler barrier. That seems fragile in the face of improving cross-function optimisation.

Looks like applying inaccessiblemem_or_argmemonly attribute to the OpenMP functions is way too optimistic since we still can access this (inaccessible) memory using other OpenMP functions. Not sure about the semantics of this attribute, though.

I don't understand where inaccessiblemem_or_argmemonly is coming from. This prevents us from inlining the outlined parallel region.

What is the invariant we want around entry to a data environment, which happens to be met by a function boundary?

The data environment in one function is the same. So, changes that take affect only in a new data environment will not manifest in the function.
Take:

a = get_some_ICV_fixed_per_data_env()
unknown();
b = get_some_ICV_fixed_per_data_env();

We want to ensure a == b.


The problem with PR48686 could actually be solved differently as well. However, the property I describe above is generally useful.
It holds, as far as I can tell, except when we do the if(0) "optimization". IMHO, the if condition should be passed to the runtime
and handled there. There is little to no point in exposing the conditional in user land and exposing even more runtime calls.

I'm guessing we're using the function boundary as a compiler barrier. That seems fragile in the face of improving cross-function optimisation.

Looks like applying inaccessiblemem_or_argmemonly attribute to the OpenMP functions is way too optimistic since we still can access this (inaccessible) memory using other OpenMP functions. Not sure about the semantics of this attribute, though.

I don't understand where inaccessiblemem_or_argmemonly is coming from. This prevents us from inlining the outlined parallel region.

kmpc_serialized_parallel and kmpc_end_serialized_parallel functions are marked with this attribute and I think OpenMPOpt optimizes read_only functions because of this. I.e. it does not consider kmpc_serialized_parallel and kmpc_end_serialized_paralle function calls as opimization barriers.

What is the invariant we want around entry to a data environment, which happens to be met by a function boundary?

The data environment in one function is the same. So, changes that take affect only in a new data environment will not manifest in the function.
Take:

a = get_some_ICV_fixed_per_data_env()
unknown();
b = get_some_ICV_fixed_per_data_env();

We want to ensure a == b.


The problem with PR48686 could actually be solved differently as well. However, the property I describe above is generally useful.
It holds, as far as I can tell, except when we do the if(0) "optimization". IMHO, the if condition should be passed to the runtime
and handled there. There is little to no point in exposing the conditional in user land and exposing even more runtime calls.

I don't understand where inaccessiblemem_or_argmemonly is coming from. This prevents us from inlining the outlined parallel region.

kmpc_serialized_parallel and kmpc_end_serialized_parallel functions are marked with this attribute and I think OpenMPOpt optimizes read_only functions because of this. I.e. it does not consider kmpc_serialized_parallel and kmpc_end_serialized_paralle function calls as opimization barriers.

Can you provide an example why this is a problem?

I don't understand where inaccessiblemem_or_argmemonly is coming from. This prevents us from inlining the outlined parallel region.

kmpc_serialized_parallel and kmpc_end_serialized_parallel functions are marked with this attribute and I think OpenMPOpt optimizes read_only functions because of this. I.e. it does not consider kmpc_serialized_parallel and kmpc_end_serialized_paralle function calls as opimization barriers.

Can you provide an example why this is a problem?

Looks like PR48686, you mentioned, caused by this. The second function call should not be optimized, because it is inside the region between kmpc_serialized_parallel() and kmpc_end_serialized_parallel() function calls.

I don't understand where inaccessiblemem_or_argmemonly is coming from. This prevents us from inlining the outlined parallel region.

kmpc_serialized_parallel and kmpc_end_serialized_parallel functions are marked with this attribute and I think OpenMPOpt optimizes read_only functions because of this. I.e. it does not consider kmpc_serialized_parallel and kmpc_end_serialized_paralle function calls as opimization barriers.

Can you provide an example why this is a problem?

Looks like PR48686, you mentioned, caused by this. The second function call should not be optimized, because it is inside the region between kmpc_serialized_parallel() and kmpc_end_serialized_parallel() function calls.

OpenMPOpt assumes some ICVs are fixed per data environment and they are not changing within a single function. Removing inaccessiblemem_or_argmemonly should not make a difference here.
I'm going to propose an alternative to __kmpc_fork_call now, that would fix this issue as well and make a lot of things simpler and cleaner.

What we should do, as we move to the OpenMPIRBuilder, is to use runtime interfaces that match OpenMP directives.
Here is the omp parallel one for the host: D94332
The device one should look the same, potentially after extending it, as we want to apply the same logic regardless of the device.
This will also make things like parallel region merging (@ggeorgakoudis) much easier.

I would prefer to fix PR48686 with this patch though until we switch to a new interface. The part where I remove the fn argument
meddling is on it's own valuable. We should not (need to) overwrite the function attributes for performance reasons given that we
have not inserted them. The new noinline will establish the invariant D94332 is going to have as well and which makes OpenMPOpt
much simpler.

ABataev accepted this revision.Jan 8 2021, 1:48 PM

What we should do, as we move to the OpenMPIRBuilder, is to use runtime interfaces that match OpenMP directives.
Here is the omp parallel one for the host: D94332
The device one should look the same, potentially after extending it, as we want to apply the same logic regardless of the device.
This will also make things like parallel region merging (@ggeorgakoudis) much easier.

I would prefer to fix PR48686 with this patch though until we switch to a new interface. The part where I remove the fn argument
meddling is on it's own valuable. We should not (need to) overwrite the function attributes for performance reasons given that we
have not inserted them. The new noinline will establish the invariant D94332 is going to have as well and which makes OpenMPOpt
much simpler.

It would be better to teach OpenMPOpt about current limitations, because these changes may prevent some optimizations, but if this is hard to implement/something else, better to have a bugfix rather than doing something the user does not expect.

This revision is now accepted and ready to land.Jan 8 2021, 1:48 PM

What we should do, as we move to the OpenMPIRBuilder, is to use runtime interfaces that match OpenMP directives.
Here is the omp parallel one for the host: D94332
The device one should look the same, potentially after extending it, as we want to apply the same logic regardless of the device.
This will also make things like parallel region merging (@ggeorgakoudis) much easier.

I would prefer to fix PR48686 with this patch though until we switch to a new interface. The part where I remove the fn argument
meddling is on it's own valuable. We should not (need to) overwrite the function attributes for performance reasons given that we
have not inserted them. The new noinline will establish the invariant D94332 is going to have as well and which makes OpenMPOpt
much simpler.

It would be better to teach OpenMPOpt about current limitations, because these changes may prevent some optimizations, but if this is hard to implement/something else, better to have a bugfix rather than doing something the user does not expect.

True, "some optimizations" are for now prevented on the if(0) code path. The idea is to apply these optimizations on the if(1) as well. It seems hardly useful to optimize the former through some special handling anyway, another reason for D94332.