Page MenuHomePhabricator

[OpenMP 5.0] declare mapper runtime implementation
Needs ReviewPublic

Authored by lildmh on Sep 26 2019, 12:32 PM.

Details

Summary

This patch implements the runtime functionality to support the OpenMP 5.0 declare mapper. It introduces a set of new interfaces so user-defined mapper functions can be passed to the runtime. The runtime will call mapper functions to fill up an internal data structure. Later it will map every component in the internal data structure.
The design slides can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx

Diff Detail

Event Timeline

lildmh created this revision.Sep 26 2019, 12:32 PM
lildmh updated this revision to Diff 222173.Sep 27 2019, 7:43 AM

Please review when you have time, thanks

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

  1. This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.
  1. If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

  1. This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.
  2. If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

  1. This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.
  2. If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Okay, I think I understand your idea now. Then in this case, we will have a call to __tgt_mapper before every call to __tgt_target*, because we need to overwrite the mappers written for previous calls. I don't particularly like this idea, since this will introduce implicit dependencies between different runtime calls and a program will have twice runtime calls. But if most people like it, I'm okay.

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

  1. This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.
  2. If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Okay, I think I understand your idea now. Then in this case, we will have a call to __tgt_mapper before every call to __tgt_target*, because we need to overwrite the mappers written for previous calls. I don't particularly like this idea, since this will introduce implicit dependencies between different runtime calls and a program will have twice runtime calls. But if most people like it, I'm okay.

I think another problem is this may not work with legacy code, since they don't have calls to __tgt_mapper. This may be a bigger problem when legacy code and code compiled with new Clang are mixed together: a legacy call to __tgt_target may get a mapper which is intended for some new code.

Please review when you have time, thanks

Maybe, it is better to add a single function to pass the list of mappers to the runtime and use the old functions rather than just duplicate them? Something like __tgt_mappers(...) and store them in the runtime. Plus, modify the original functions to use mappers if they were provided. Thoughts?

I'm in favor of this solution, as it is less intrusive than the one posted in this patch. I think a revised patch will be quite smaller and the changes it introduces will be clearer.

I think Alexey and George's concern is that this patch introduces many new runtime apis. My arguments are:

  1. This patch will deprecate the old interfaces as we discussed before in the meeting. E.g., __tgt_target_teams will no longer be used. They are kept there just for legacy code support. So I didn't actually introduce any new runtime apis.
  2. If we have something like __tgt_mapper instead, and integrate all mapper functionalities in it, we will need to pass extra arguments to it to distinguish whether it is for __tgt_targt, __tgt_target_data_begin, etc. On the other hand, the current runtime has an interface for each case. For instance, we have __tgt_targt, __tgt_target_data_begin, __tgt_target_data_update_nowait, etc., instead of a single __tgt_target function which does it all. Therefore, I think the current design fits the overall design of OpenMP runtime better: have a function for each case.

Why we will need this extra argument? It just provides a list of mapper functions and stores them in the runtime before we call any __tgt_... function. Each particular __tgt_... runtime function will know what do with all these mappers if they were stored.

Okay, I think I understand your idea now. Then in this case, we will have a call to __tgt_mapper before every call to __tgt_target*, because we need to overwrite the mappers written for previous calls. I don't particularly like this idea, since this will introduce implicit dependencies between different runtime calls and a program will have twice runtime calls. But if most people like it, I'm okay.

I think another problem is this may not work with legacy code, since they don't have calls to __tgt_mapper. This may be a bigger problem when legacy code and code compiled with new Clang are mixed together: a legacy call to __tgt_target may get a mapper which is intended for some new code.

No, there should not be problem with the legacy code. If the array of mappers is empty - use default mapping through the bitcopying.

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

tgt_mapper must be called immediately before tgt_target in the same task context.

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

tgt_mapper must be called immediately before tgt_target in the same task context.

Yes, but I think it cannot solve this problem. For example, after a task executes __tgt_mapper, it is scheduled out and a new task is scheduled to execute. After the previous task resumes execution, the mapper information it stored has lost, and the execution of the __tgt_target will not get the mapper. I don't think there is a per-task context in libomptarget.

Sorry, you are right. I didn't think about the case to always clean up the mappers after finishing using it.

Another possible problem: what if a task is scheduled out after __tgt_mapper and before __tgt_target, for example. I don't think we can keep a per task/thread mapper storage in the current implementation.

tgt_mapper must be called immediately before tgt_target in the same task context.

Yes, but I think it cannot solve this problem. For example, after a task executes __tgt_mapper, it is scheduled out and a new task is scheduled to execute. After the previous task resumes execution, the mapper information it stored has lost, and the execution of the __tgt_target will not get the mapper. I don't think there is a per-task context in libomptarget.

libomptarget already uses some functions from libomp, you can use it to check for the task context.

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

I don't see a big problems here. You can store the mapper data per task using thread id from libomp, as we do it for tripcount. We can use the same solution for mappers.

Alexey and George: This is a big decision to make. We need to have most people's consents. I'll send it to the mailing list later.

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

Btw, I never understand why we have a separate function to push loop trip count. Why is that?

Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.

Btw, I never understand why we have a separate function to push loop trip count. Why is that?

The same problem, to not bloat interface of the runtime library

ABataev added inline comments.Nov 11 2019, 7:08 AM
libomptarget/src/exports
16–25

On the last telecon, we decided to support this solution so adding new functions is accepted.

libomptarget/src/interface.cpp
103

__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

171

__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

240

__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

293

__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

358

__kmpc_omp_taskwait(NULL, __kmpc_global_thread_num(NULL));

libomptarget/src/omptarget.cpp
379–382

Why we have this limitation?

I think the direction is good here. Some duplication is inevitable in the interface. Between the non functional changes and the code duplication in the implementation it's difficult to work out exactly what the code is doing though.

libomptarget/src/omptarget.cpp
373

typedef/using instead of copying the type list for the cast?

435

It's probably worth doing the arg_types[i] => Type refactor first. Combined with the whitespace change it's quite a lot of this diff.

458

Likewise data_size => Size. Separating the NFC from the FC makes it easier to parse the latter.

539

I think this is the same type list copy & paste that suggested a typedef above

542

The rest of this looks quite familiar too. Perhaps factor the copy & paste into helper functions that are called by both locations?

699

And again

lildmh updated this revision to Diff 229199.Nov 13 2019, 4:11 PM
lildmh marked 6 inline comments as done.

Thanks Alexey and Jon for your review. Fixed the issues and rebased

libomptarget/src/omptarget.cpp
373

Sounds good, thanks

379–382

It's because that's the length of the parent bit. If we have more components, the parent bits will break.

435

I extracted the mapping of each component into a separate function for code reuse purpose, like target_data_end_component here. It uses Type as the input argument, so there is no longer arg_types[i]. It's the same for Size.

So I don't think it will make it more clear to change arg_types[i] to Type first. What do you think?

542

The duplication is not too much though. Do you think it will worth it to have a helper function?

lildmh marked 6 inline comments as done.Nov 13 2019, 4:12 PM
ABataev added inline comments.Nov 14 2019, 8:23 AM
libomptarget/src/omptarget.cpp
542

+1 for refactoring.

lildmh marked an inline comment as done.Nov 15 2019, 8:46 AM
lildmh added inline comments.
libomptarget/src/omptarget.cpp
542

Hi Alexey and Jon,

I didn't find an elegant way to merge the code below. It's mainly because they have different way to access other components:
E.g., for mapper, Components.get(parent_idx) is used to get its parent, on the other hand, args[parent_idx] is used for arguments. One is array of struct, the other is struct of array.

ABataev added inline comments.Mon, Nov 25, 10:28 AM
libomptarget/src/omptarget.cpp
542

Still, do not understand what is a problem with the refactoring. You can use lambdas, if need some differences in data, or something similar. Anyway, it would better rather than just copy-paste.

544

Usually, we use something like (for i = 0, e = end(); i < e; ++i) pattern.

My thoughts are the same as before. This change mixes a refactor with a functional change plus duplicates a bunch of code. The overall change might work but I can't tell from the diff.

libomptarget/src/omptarget.cpp
542

Some options:

  • wrap object in a class that adapts the interface
  • pass in a function that does the access
  • refactor one data type to the same layout as the other
  • extract small functions which are called by both
lildmh updated this revision to Diff 230971.Mon, Nov 25, 2:40 PM

Thanks for your reviews. Hope this looks better.

@JonChesterfield: If you insist, I can break this patch into 2 smaller ones. Since I don't have much time now, it will happen later.

Thanks for your reviews. Hope this looks better.

Thanks

@JonChesterfield: If you insist, I can break this patch into 2 smaller ones. Since I don't have much time now, it will happen later.

Splitting large patches into NFC and functional change doesn't seem contentious but is not required.

The advantage is seen when the build breaks. It's less annoying for the author to have the functional change part temporarily reverted than to lose the whole lot, especially when the functional change is the smaller diff as I think it would be here.