Page MenuHomePhabricator

[OpenMP 5.0] Codegen support for user-defined mappers
Needs ReviewPublic

Authored by lildmh on Mar 17 2019, 11:14 AM.

Details

Summary

This patch implements the code generation for OpenMP 5.0 declare mapper (user-defined mapper) constructs. For each declare mapper, a mapper function is generated. These mapper functions will be called by the runtime and/or other mapper functions to achieve user defined mapping.
The design slides can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lildmh updated this revision to Diff 202790.Jun 3 2019, 1:55 PM
lildmh edited the summary of this revision. (Show Details)

Implement the new mapper codegen scheme

lildmh edited the summary of this revision. (Show Details)Jun 4 2019, 5:34 AM
lildmh updated this revision to Diff 202926.Jun 4 2019, 6:38 AM

Address Alexey's comment about mapping between function and user-defined mappers

lildmh updated this revision to Diff 203438.Jun 6 2019, 1:43 PM

Use tgt instead of kmpc for mapper runtime function names

ABataev added inline comments.Jun 10 2019, 8:48 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8805

This function looks like the universal one, regardless of the type <type_name> specifics. Do we really need to generate it for each particular type and mapper? Or we could use the same function for all types/mappers?

lildmh marked an inline comment as done.Jun 10 2019, 9:18 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8805

I think we need a particular mapper function for each type and mapper, because the code generated within the mapper function depends on what type and what mapper it is.

ABataev added inline comments.Jun 10 2019, 9:39 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8805

Hmm, maybe I'm wrong but I don't see significant mapper or type-specific dependencies in this mapper function. It uses the pointer to type and size of the type, but this information can be generalized, I think. Could you point the lines of code that are type and mapper specific?

lildmh marked an inline comment as done.Jun 10 2019, 9:51 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8805

Code between line 8857-8965 is type and mapper specific. For instance, generateAllInforForMapper depends on the map clauses associated with the mapper and the internal structure of struct/class type, and generates difference code as a result. BasePointers.size() also depends on the above things.

ABataev added inline comments.Jun 10 2019, 10:01 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8805

Most of these data can be passed as parameters to the function. It would be good, if we could move this function to the libomptaret library and reduce the number of changes (and, thus, complexity) of the compiler itself. It is always easier to review and to maintain the source code written in C/C++ rather than the changes in the compiler codegen. Plus, it may reduce the size of the final code significantly, I assume. I would appreciate it if you would try to move this function to libomptarget.

8933

These data can be passed as parameters to the function, no?

lildmh marked an inline comment as done.Jun 10 2019, 10:14 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8805

Hi Alexey, I don't think libomptarget can do this efficiently. For example,

class C {
  int a;
  double *b
}

#pragma omp declare mapper(C c) map(c.a, c.b[0:1])

The codegen can directly know there are 2 components (c.a, c.b[0]) in this mapper function (3 actually when we count the pointer), and it can also know the size, starting address, map type, etc. about these components. Passing all these information to libomptarget seems to be a bad idea. Or did I get your idea wrong?

ABataev added inline comments.Jun 10 2019, 10:18 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8805

Yes, I understand this. But can we pass some additional parameters to this function so we don't need to generate a unique copy of almost the same function for all types/mappers?

lildmh marked an inline comment as done.Jun 10 2019, 10:35 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8805

For different types/mappers, the skeleton of mapper functions are similar (i.e., the things outlined in the comment here). I would say most other code is unique, for instance, the code to prepare parameters of call to __tgt_push_mapper_component. These code should be much more compared with the skeleton shown here. I cannot think of a way to reduce the code by passing more parameters to this function. Please let me know if you have some suggestions.

ABataev added inline comments.Jun 19 2019, 10:33 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8815

Currently currentComponent is generated by the compiler. But can we instead pass this data as an extra parameter to this omp_mapper function.

8816

I don't see this part of logic in the code. Could you show me where is it exactly?

8861

Bad idea to do this. Better to use something like this:

SmallString<256> TyStr;
llvm::raw_svector_ostream Out(TyStr);
CGM.getCXXABI().getMangleContext().mangleTypeName(Ty, Out);
lildmh updated this revision to Diff 205647.Jun 19 2019, 11:22 AM
lildmh marked 3 inline comments as done.

Fix mapper function name mangling

lib/CodeGen/CGOpenMPRuntime.cpp
8815

Emm, I think this scheme will be very difficult and inefficient. If we pass components as an argument of omp_mapper function, it means that the runtime needs to generate all components related to a map clause. I don't think the runtime is able to do that efficiently. On the other hand, in the current scheme, these components are naturally generated by the compiler, and the runtime only needs to know the base pointer, pointer, type, size. etc.

8816

This part doesn't exist in this patch. Currently we don't really look up the mapper for any mapped variable/array/etc. The next patch will add the code to look up the specified mapper for every map clause, and get the mapper function for them correspondingly.

8861

Sounds good. Thanks!

ABataev added inline comments.Jun 19 2019, 11:56 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8815

With the current scheme, we may end with the code blowout. We need to generate very similar code for different types and variables. The worst thing here is that we will be unable to optimize this huge amount of code because the codegen relies on the runtime functions and the code cannot be inlined. That's why I would like to move as much as possible code to the runtime rather than to emit it in the compiler.

8816

Then we need at least some kind of TODO note that this part is not implemented in this patch.

8849

Always better to use constructor with the location to generate correct debug info for all the parameters.

8942–8943

I don't like this code very much! It hides the logiс ща the MEMBER_OF flag deep inside and it is going to be very hard to update it in future if there are some changes in the flags.

9010

I don't see this logic in the comment for the function. Could you add more details for all this logic implemented here?

9071
  1. Use /// style of comment here
  2. Add the description of the logic implemented here
9079

Use StringRef or SmallString

9096–9102

Enclose all substatements into braces or none of them.

lildmh updated this revision to Diff 205785.Jun 20 2019, 5:28 AM
lildmh marked 9 inline comments as done.

Address Alexey's comments

lib/CodeGen/CGOpenMPRuntime.cpp
8815

I understand your concerns. I think this is the best we can do right now.

The most worrisome case will be when we have nested mappers within each other. In this case, a mapper function will call another mapper function. We can inline the inner mapper functions in this scenario, so that these mapper function can be properly optimized. As a result, I think the performance should be fine.

8942–8943

Add a function to calculate this offset. Also modify another existing place using the hard coded number 48.

Sorry for the delay, Lingda. I tried to find some better solution for this.

lib/CodeGen/CGOpenMPRuntime.cpp
755

Do we really need to use int64_t for number of elements? size_t must be enough.

8815

Instead, we can use indirect function calls passed in the array to the runtime. Do you think it is going to be slower? In your current scheme, we generate many runtime calls instead. Could you try to estimate the number of calls in cases if we'll call the mappers through the indirect function calls and in your cuurent scheme, where we need to call the runtime functions many times in each particular mapper?

8974

You can use nuw attribute here, I think

9100

You can use C.toBits(CGM.getSizeSize())

9122

Just CGM.getSizeSize()

9123

Add nuw attribute

lildmh marked an inline comment as done.Tue, Jun 25, 12:37 PM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8815

Hi Alexey,

Sorry I don't understand your idea. What indirect function calls do you propose to be passed to the runtime? What are these functions supposed to do?

The number of function calls will be exactly equal to the number of components mapped, no matter whether there are nested mappers or not. The number of components depend on the program. E.g., if we map a large array section, then there will be many more function calls.

ABataev added inline comments.Tue, Jun 25, 12:49 PM
lib/CodeGen/CGOpenMPRuntime.cpp
8815

I mean the pointers to the mapper function, generated by the compiler. In your comment, it is c.Mapper()

lildmh marked 4 inline comments as done.Tue, Jun 25, 1:02 PM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8815

If we pass nested mapper functions to the runtime, I think it will slow down execution because of the extra level of indirect function calls. E.g., the runtime will call omp_mapper1, which calls the runtime back, which calls omp_mapper2, .... This can result in a deep call stack.

I think the current implementation will be more efficient, which doesn't pass nested mappers to the runtime. One call to the outer most mapper function will have all data mapping done. The call stack will be 2 level deep (the first level is the mapper function, and the second level is __tgt_push_mapper_component) in this case from the runtime. There are also more compiler optimization space when we inline all nested mapper functions.

lildmh marked 2 inline comments as done.Tue, Jun 25, 1:23 PM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
755

Because we use the return value to shift the memberof filed of map type, which is int64_t, so I think int64_t makes sense here.

lildmh updated this revision to Diff 206527.Tue, Jun 25, 1:52 PM
ABataev added inline comments.Tue, Jun 25, 2:53 PM
lib/CodeGen/CGOpenMPRuntime.cpp
755

Ok, there is a discrepancy between runtime part and compiler part: __tgt_push_mapper_component uses size_t for size, while the runtime function uses int64_t. It won't work for 32bit platform.

8815

Yes, if we leave it as is. But if instead of the bunch unique functions we'll have the common one, that accept list if indirect pointers to functions additionally, and move it to the runtime library, we won't need those 2 functions we have currently. We'll have full access to the mapping data vector in the runtime library and won't need to use those 2 accessors we have currently. Instead, we'll need just one runtime functions, which implements the whole mapping logic. We still need to call it recursively, but I assume the number of calls will remain the same as in the current scheme. Did you understand the idea? If yes, it would good if you coild try to estimate the number of function calls in current scheme and in this new scheme to estimate possible pros and cons.

lildmh marked an inline comment as done.Wed, Jun 26, 9:46 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8815

Hi Alexey,

Could you give an example for this scheme? 1) I don't understand how the mapper function can have full access to the mapping data vector without providing these 2 accessors. 2) I don't think it is possible to have a common function instead of bunch of unique functions for each mapper declared.

lildmh marked an inline comment as done.Wed, Jun 26, 9:52 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
755

This should work on 32bit machines, since 32bit machines also use int64_t for the map type?

ABataev added inline comments.Wed, Jun 26, 10:14 AM
lib/CodeGen/CGOpenMPRuntime.cpp
758

Check the declaration of the runtime function in the runtime patch and here. size parameter here has type size_t, while in runtime part it is int64_t

8815

Hi Lingda, something like this.

void __tgt_mapper(void *base, void *begin, size_t size, int64_t type, auto components[]) {
  // Allocate space for an array section first.
  if (size > 1 && !maptype.IsDelete)
     <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
   
  // Map members.
  for (unsigned i = 0; i < size; i++) {
     // For each component specified by this mapper:
     for (auto c : components) {
       if (c.hasMapper())
         (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
       else
         <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
     }
  }
  // Delete the array section.
  if (size > 1 && maptype.IsDelete)
    <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
}

void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
 auto sub_components[] = {...};
 __tgt_mapper(base, begin, size, type, sub_components);
}
lildmh marked 2 inline comments as done.Wed, Jun 26, 10:50 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
758

The runtime part uses int64_t because I see every other runtime function use int64_t instead of size_t for size, e.g., __tgt_target, __tgt_target_teams, etc., despite that they are declared to be size_t in Clang codegen. So I guess it is done on purpose? Otherwise we need to modify all these runtime function interface in the future.

8815

Hi Alexey,

I don't think this scheme is more efficient than the current scheme. My reasons are:

  1. Most code here is essentially to generate components, i.e., we need to generate c.arg_base, c.arg_begin, c.arg_size, c.arg_type for each c in components, so there will still be a lot of code in <type>.mapper. It will not reduce the mapper function code, i.e., we will still have a bunch of unique mapper functions.
  1. This scheme will prevent a lot of compiler optimization from happening. In reality, a lot of computation should be redundant. E.g., for two components c1 and c2, c1's base may be the same as c2's begin, so the compiler will be able to eliminate these reduction computation, especially when we inline all nested mapper functions together. If we move these computation into the runtime, the compiler will not be able to do such optimization.
  1. In terms of the number of push function calls, this scheme has the exact same number of calls as the current scheme, so I don't think this scheme can bring performance benefits. The scheme should perform worse than the current scheme, because it reduces the opportunities of compiler optimization as mentioned above.
ABataev added inline comments.Wed, Jun 26, 12:11 PM
lib/CodeGen/CGOpenMPRuntime.cpp
758

I recently committed the patch that fixes this problem in clang. If you're using int64_t in the runtime, the same type must be used in clang codegen.

8815

Hi Lingda, I'm trying to simplify the code generated by clang and avoid some unnecessary code duplications. If the complexity of this scheme is the same as proposed by you, I would prefer to use this scheme unless there are some other opinions.

  1. It is not a problem. This code is unique and is not duplicated in the different mappers.
  2. Inlining is no solution here. We still generate to much code, which is almost the same in many cases and it will lead to very ineffective codegen because we still end up with a lot of almost the same code. This also might lead to poor performance.
  3. Yes, the number of pushes is always the same, in all possible schemes. It would be good to compare somehow the performance of both schemes, at least preliminary.

Also, this solution reduces the number of required runtime functions, instead of 2 we need just 1 and, thus, we need to make fewer runtime functions calls.

I think it would better to propose this scheme as an alternate design and discuss it in the OpenMP telecon. What do you think? Or we can try to discuss it in the offline mode via the e-mail with other members.
I'm not trying to convince you to implement this scheme right now, but it would be good to discuss it. Maybe it will lead to some better ideas from others?

lildmh marked 2 inline comments as done.Wed, Jun 26, 12:48 PM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
758

That's nice to know. Then I will change the type of size to int64_t as well.

8815

Hi Alexey,

I still prefer the current scheme, because:

  1. I don't like recursive mapper calls, which goes back to my original scheme a little bit. I really think inlining can make a big difference when we have nested mappers. These compiler optimizations are the keys to have better performance for mappers.
  2. I don't think the codegen here is inefficient. Yes there is duplicated code across different mapper functions, but why that will lead to poor performance?
  3. Although we have 2 runtime functions now, the __tgt_mapper_num_components is called only once per mapper. It should have very negligible performance impact.

But if you have a different option, we can discuss it next time in the meeting. I do have a time constraint to work on the mapper implementation. I'll no longer work in this project starting this September, and I have about 30% of my time working on it until then.

ABataev added inline comments.Wed, Jun 26, 12:56 PM
lib/CodeGen/CGOpenMPRuntime.cpp
8815

Lingda,

  1. We have recursive (actually, not recursive, because you cannot use types recursively) mappers calls anyway, it is nature of struсtures/classes.
  2. We have a lot of similar code. And I'm not sure that it can be optimized out.
  3. Yes, but it means that we have n extra runtime calls, where n is the number of branches in the structure/class tree.

I see :(. I understand your concern. In this case, we could try to discuss it offline, in the mailing list, to make it a little bit faster. We just need to hear other opinions on this matter, maybe there are some other pros and cons for these schemes.

lildmh marked an inline comment as done.Wed, Jun 26, 1:11 PM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8815

Hi Alexey,

Sure, let's discuss this in the mailing list. I'll summarize it and send it to the mailing list later.

We have recursive (actually, not recursive, because you cannot use types recursively) mappers calls anyway, it is nature of struсtures/classes.

We won't have recursive calls with inlining.

We have a lot of similar code. And I'm not sure that it can be optimized out.

I think it's even harder to optimized these code out when we move them into the runtime.

Yes, but it means that we have n extra runtime calls, where n is the number of branches in the structure/class tree.

I don't quite understand. It's still equal to the number of mappers in any case.

ABataev added inline comments.Wed, Jun 26, 1:20 PM
lib/CodeGen/CGOpenMPRuntime.cpp
8815

Sure, let's discuss this in the mailing list. I'll summarize it and send it to the mailing list later.

Good, thanks!

We won't have recursive calls with inlining.

We won't have recursive calls anyway (recursive types are not allowed). Plus, I'm not sure that inlining is the best option here. We have a lot of code for each mapper and I'm not sure that the optimizer will be able to squash it effectively.

I think it's even harder to optimized these code out when we move them into the runtime.

Definitely not, unless we use LTO or inlined runtime.

lildmh marked an inline comment as done.Wed, Jun 26, 1:26 PM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8815

We won't have recursive calls anyway (recursive types are not allowed). Plus, I'm not sure that inlining is the best option here. We have a lot of code for each mapper and I'm not sure that the optimizer will be able to squash it effectively.

Sorry I should not say recursive calls. Here it needs to "recursively" call other mapper functions in case of nested mappers, but we don't need it in case of inlining.

Definitely not, unless we use LTO or inlined runtime.

But you are proposing to move many code to the runtime here, right? That doesn't make sense to me.

ABataev added inline comments.Wed, Jun 26, 1:34 PM
lib/CodeGen/CGOpenMPRuntime.cpp
8815

But you are proposing to move much code to the runtime here, right? That doesn't make sense to me.

I'm just not sure that there going be significant problems with the performance because of that. And it significantly simplifies codegen in the compiler and moves the common part into a single function.

Plus, if in future we'll need to modify this functionality for some reason, 2 different versions of the compiler will produce incompatible code. With my scheme, you still can use old runtime and have the same functionality as the old compiler and the new one.

lildmh marked an inline comment as done.Thu, Jun 27, 9:48 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8815

Hi Alexey,

I think more carefully about your scheme, and I don't think we can solve the 2 problems below with this scheme:

  1. In the example you gave before, the compiler needs to generate all map types and pass them to __tgt_mapper through sub_components. But in this case, the compiler won't be able to generate the correct MEMBER_OF field in map type. As a result, the runtime has to fix it using the mechanism we already have here: __tgt_mapper_num_components. This not only increases complexity, but also, it means the runtime needs further manipulation of the map type, which creates locality issues. While in the current scheme, the map type is generated by compiler once, so the data locality will be very good in this case.
  1. sub_components includes all components that should be mapped. If we are mapping an array, this means we need to map many components, which will need to allocate memory for sub_components in the heap. This creates further memory management burden and is not an efficient way to use memory.

Based on these reasons, I think the current scheme is still more preferable.

ABataev added inline comments.Thu, Jun 27, 9:54 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8815

Hi Lingda,

  1. Actually, I thought that the runtime function __tgt_mapper will do this, not the compiler.
  2. Why do we need to allocate it on the heap? We can allocate it on the stack.
lildmh marked an inline comment as done.Thu, Jun 27, 10:04 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8815
  1. In your scheme, both compiler and __tgt_mapper need to do this: the compiler will generate other parts in the type, e.g., TO FROM bits and basic MEMBER_OF bits. Then __tgt_mapper needs to modify the MEMBER_OF bits later. Since there are a lot of other memory accesses between the compiler and __tgt_mapper operations to the same map type, it's very likely the map type will not stay in the cache, which causes locality problem.
  1. Assume we are mapping an array with 1000000 elements, and each elements have 5 components. For each component, we need base, begin_ptr, size, type, mapper, which are 40 bytes. Together, we will need 1000000 * 5 * 40 = 200MB of space for this array, which stack cannnot handle.
ABataev added inline comments.Thu, Jun 27, 10:09 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8815
  1. I don't think it is a big problem, this part of the code is executed on the CPU and I don't think it will lead to significant overhead.
  2. When we map an array, we do not map it element-by-element, so we don't need 10000 records. Moreover, we try to merge contiguous parts into single one, reducing the total number of elements.
lildmh marked an inline comment as done.Thu, Jun 27, 10:20 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8815
  1. I think it is a problem. Beside, doing so is not elegant: why having a single thing (setting the map type) done in 2 places while we can do it in one place?
  1. We need to map them element by element because it is not always possible to merge contiguous parts together (there may be no contiguous parts based on the mapper). And merging parts together will be a complex procedure: I don't think it can be done in the runtime because many code is moved into the runtime now. In contrast, the compiler will have better opportunities to merge things.

Besides, I don't think there is a valid reason that the current scheme is not good. You mentioned it's complex codegen. But it only has less 200 loc here, I don't see why it is complex.

ABataev added inline comments.Thu, Jun 27, 10:26 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8815
  1. I rather doubt that we will need to map a record with 100000 fields element-by-element.

I think it would be better to share it with others and listen to their opinions. It is better to spend some extra time to provide good design.
You can include your doubts in the description of the new scheme, of course.

lildmh marked an inline comment as done.Thu, Jun 27, 10:33 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8815
  1. The implementation should work for any case anyway. Besides, I think mapping a large array should be actually a common case. The users don't want to map 1000000 elements by themselves, so they will want to use mapper to let the system do it automatically.

Sure, I can release the discussion to the mailing list. I don't see a reason to use the new scheme now.

ABataev added inline comments.Thu, Jun 27, 10:58 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8815

Lingda, I meant to send the message to the OpenMP telecon list :) Could you forward the same e-mail to Ravi and others, please?

lildmh marked an inline comment as done.Thu, Jun 27, 11:03 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8815

Sure, probably no one in the public mailing list will care about it. I'll send it to the bi-weekly meeting list.

lildmh updated this revision to Diff 206895.Thu, Jun 27, 11:28 AM

Change the type of size from size_t to int64_t, and rebase

Change the type of size from size_t to int64_t, and rebase

Lingda, could you rebase it again? Thanks.

Change the type of size from size_t to int64_t, and rebase

Lingda, could you rebase it again? Thanks.

Sure, I'll do it next week, since I'm on vacation and don't have access to my desktop.