Page MenuHomePhabricator

[OpenMP 5.0] Codegen support for user-defined mappers
ClosedPublic

Authored by lildmh on Mar 17 2019, 11:14 AM.

Details

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lildmh marked 2 inline comments as done.Jun 26 2019, 10:50 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
725

The runtime part uses int64_t because I see every other runtime function use int64_t instead of size_t for size, e.g., __tgt_target, __tgt_target_teams, etc., despite that they are declared to be size_t in Clang codegen. So I guess it is done on purpose? Otherwise we need to modify all these runtime function interface in the future.

8801

Hi Alexey,

I don't think this scheme is more efficient than the current scheme. My reasons are:

  1. Most code here is essentially to generate components, i.e., we need to generate c.arg_base, c.arg_begin, c.arg_size, c.arg_type for each c in components, so there will still be a lot of code in <type>.mapper. It will not reduce the mapper function code, i.e., we will still have a bunch of unique mapper functions.
  1. This scheme will prevent a lot of compiler optimization from happening. In reality, a lot of computation should be redundant. E.g., for two components c1 and c2, c1's base may be the same as c2's begin, so the compiler will be able to eliminate these reduction computation, especially when we inline all nested mapper functions together. If we move these computation into the runtime, the compiler will not be able to do such optimization.
  1. In terms of the number of push function calls, this scheme has the exact same number of calls as the current scheme, so I don't think this scheme can bring performance benefits. The scheme should perform worse than the current scheme, because it reduces the opportunities of compiler optimization as mentioned above.
ABataev added inline comments.Jun 26 2019, 12:11 PM
lib/CodeGen/CGOpenMPRuntime.cpp
725

I recently committed the patch that fixes this problem in clang. If you're using int64_t in the runtime, the same type must be used in clang codegen.

8801

Hi Lingda, I'm trying to simplify the code generated by clang and avoid some unnecessary code duplications. If the complexity of this scheme is the same as proposed by you, I would prefer to use this scheme unless there are some other opinions.

  1. It is not a problem. This code is unique and is not duplicated in the different mappers.
  2. Inlining is no solution here. We still generate to much code, which is almost the same in many cases and it will lead to very ineffective codegen because we still end up with a lot of almost the same code. This also might lead to poor performance.
  3. Yes, the number of pushes is always the same, in all possible schemes. It would be good to compare somehow the performance of both schemes, at least preliminary.

Also, this solution reduces the number of required runtime functions, instead of 2 we need just 1 and, thus, we need to make fewer runtime functions calls.

I think it would better to propose this scheme as an alternate design and discuss it in the OpenMP telecon. What do you think? Or we can try to discuss it in the offline mode via the e-mail with other members.
I'm not trying to convince you to implement this scheme right now, but it would be good to discuss it. Maybe it will lead to some better ideas from others?

lildmh marked 2 inline comments as done.Jun 26 2019, 12:48 PM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
725

That's nice to know. Then I will change the type of size to int64_t as well.

8801

Hi Alexey,

I still prefer the current scheme, because:

  1. I don't like recursive mapper calls, which goes back to my original scheme a little bit. I really think inlining can make a big difference when we have nested mappers. These compiler optimizations are the keys to have better performance for mappers.
  2. I don't think the codegen here is inefficient. Yes there is duplicated code across different mapper functions, but why that will lead to poor performance?
  3. Although we have 2 runtime functions now, the __tgt_mapper_num_components is called only once per mapper. It should have very negligible performance impact.

But if you have a different option, we can discuss it next time in the meeting. I do have a time constraint to work on the mapper implementation. I'll no longer work in this project starting this September, and I have about 30% of my time working on it until then.

ABataev added inline comments.Jun 26 2019, 12:56 PM
lib/CodeGen/CGOpenMPRuntime.cpp
8801

Lingda,

  1. We have recursive (actually, not recursive, because you cannot use types recursively) mappers calls anyway, it is nature of struсtures/classes.
  2. We have a lot of similar code. And I'm not sure that it can be optimized out.
  3. Yes, but it means that we have n extra runtime calls, where n is the number of branches in the structure/class tree.

I see :(. I understand your concern. In this case, we could try to discuss it offline, in the mailing list, to make it a little bit faster. We just need to hear other opinions on this matter, maybe there are some other pros and cons for these schemes.

lildmh marked an inline comment as done.Jun 26 2019, 1:11 PM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8801

Hi Alexey,

Sure, let's discuss this in the mailing list. I'll summarize it and send it to the mailing list later.

We have recursive (actually, not recursive, because you cannot use types recursively) mappers calls anyway, it is nature of struсtures/classes.

We won't have recursive calls with inlining.

We have a lot of similar code. And I'm not sure that it can be optimized out.

I think it's even harder to optimized these code out when we move them into the runtime.

Yes, but it means that we have n extra runtime calls, where n is the number of branches in the structure/class tree.

I don't quite understand. It's still equal to the number of mappers in any case.

ABataev added inline comments.Jun 26 2019, 1:20 PM
lib/CodeGen/CGOpenMPRuntime.cpp
8801

Sure, let's discuss this in the mailing list. I'll summarize it and send it to the mailing list later.

Good, thanks!

We won't have recursive calls with inlining.

We won't have recursive calls anyway (recursive types are not allowed). Plus, I'm not sure that inlining is the best option here. We have a lot of code for each mapper and I'm not sure that the optimizer will be able to squash it effectively.

I think it's even harder to optimized these code out when we move them into the runtime.

Definitely not, unless we use LTO or inlined runtime.

lildmh marked an inline comment as done.Jun 26 2019, 1:26 PM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8801

We won't have recursive calls anyway (recursive types are not allowed). Plus, I'm not sure that inlining is the best option here. We have a lot of code for each mapper and I'm not sure that the optimizer will be able to squash it effectively.

Sorry I should not say recursive calls. Here it needs to "recursively" call other mapper functions in case of nested mappers, but we don't need it in case of inlining.

Definitely not, unless we use LTO or inlined runtime.

But you are proposing to move many code to the runtime here, right? That doesn't make sense to me.

ABataev added inline comments.Jun 26 2019, 1:34 PM
lib/CodeGen/CGOpenMPRuntime.cpp
8801

But you are proposing to move much code to the runtime here, right? That doesn't make sense to me.

I'm just not sure that there going be significant problems with the performance because of that. And it significantly simplifies codegen in the compiler and moves the common part into a single function.

Plus, if in future we'll need to modify this functionality for some reason, 2 different versions of the compiler will produce incompatible code. With my scheme, you still can use old runtime and have the same functionality as the old compiler and the new one.

lildmh marked an inline comment as done.Jun 27 2019, 9:48 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8801

Hi Alexey,

I think more carefully about your scheme, and I don't think we can solve the 2 problems below with this scheme:

  1. In the example you gave before, the compiler needs to generate all map types and pass them to __tgt_mapper through sub_components. But in this case, the compiler won't be able to generate the correct MEMBER_OF field in map type. As a result, the runtime has to fix it using the mechanism we already have here: __tgt_mapper_num_components. This not only increases complexity, but also, it means the runtime needs further manipulation of the map type, which creates locality issues. While in the current scheme, the map type is generated by compiler once, so the data locality will be very good in this case.
  1. sub_components includes all components that should be mapped. If we are mapping an array, this means we need to map many components, which will need to allocate memory for sub_components in the heap. This creates further memory management burden and is not an efficient way to use memory.

Based on these reasons, I think the current scheme is still more preferable.

ABataev added inline comments.Jun 27 2019, 9:54 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8801

Hi Lingda,

  1. Actually, I thought that the runtime function __tgt_mapper will do this, not the compiler.
  2. Why do we need to allocate it on the heap? We can allocate it on the stack.
lildmh marked an inline comment as done.Jun 27 2019, 10:04 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8801
  1. In your scheme, both compiler and __tgt_mapper need to do this: the compiler will generate other parts in the type, e.g., TO FROM bits and basic MEMBER_OF bits. Then __tgt_mapper needs to modify the MEMBER_OF bits later. Since there are a lot of other memory accesses between the compiler and __tgt_mapper operations to the same map type, it's very likely the map type will not stay in the cache, which causes locality problem.
  1. Assume we are mapping an array with 1000000 elements, and each elements have 5 components. For each component, we need base, begin_ptr, size, type, mapper, which are 40 bytes. Together, we will need 1000000 * 5 * 40 = 200MB of space for this array, which stack cannnot handle.
ABataev added inline comments.Jun 27 2019, 10:09 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8801
  1. I don't think it is a big problem, this part of the code is executed on the CPU and I don't think it will lead to significant overhead.
  2. When we map an array, we do not map it element-by-element, so we don't need 10000 records. Moreover, we try to merge contiguous parts into single one, reducing the total number of elements.
lildmh marked an inline comment as done.Jun 27 2019, 10:20 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8801
  1. I think it is a problem. Beside, doing so is not elegant: why having a single thing (setting the map type) done in 2 places while we can do it in one place?
  1. We need to map them element by element because it is not always possible to merge contiguous parts together (there may be no contiguous parts based on the mapper). And merging parts together will be a complex procedure: I don't think it can be done in the runtime because many code is moved into the runtime now. In contrast, the compiler will have better opportunities to merge things.

Besides, I don't think there is a valid reason that the current scheme is not good. You mentioned it's complex codegen. But it only has less 200 loc here, I don't see why it is complex.

ABataev added inline comments.Jun 27 2019, 10:26 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8801
  1. I rather doubt that we will need to map a record with 100000 fields element-by-element.

I think it would be better to share it with others and listen to their opinions. It is better to spend some extra time to provide good design.
You can include your doubts in the description of the new scheme, of course.

lildmh marked an inline comment as done.Jun 27 2019, 10:33 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8801
  1. The implementation should work for any case anyway. Besides, I think mapping a large array should be actually a common case. The users don't want to map 1000000 elements by themselves, so they will want to use mapper to let the system do it automatically.

Sure, I can release the discussion to the mailing list. I don't see a reason to use the new scheme now.

ABataev added inline comments.Jun 27 2019, 10:58 AM
lib/CodeGen/CGOpenMPRuntime.cpp
8801

Lingda, I meant to send the message to the OpenMP telecon list :) Could you forward the same e-mail to Ravi and others, please?

lildmh marked an inline comment as done.Jun 27 2019, 11:03 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
8801

Sure, probably no one in the public mailing list will care about it. I'll send it to the bi-weekly meeting list.

lildmh updated this revision to Diff 206895.Jun 27 2019, 11:28 AM

Change the type of size from size_t to int64_t, and rebase

Change the type of size from size_t to int64_t, and rebase

Lingda, could you rebase it again? Thanks.

Change the type of size from size_t to int64_t, and rebase

Lingda, could you rebase it again? Thanks.

Sure, I'll do it next week, since I'm on vacation and don't have access to my desktop.

ABataev added inline comments.Jul 23 2019, 9:35 AM
lib/CodeGen/CGDecl.cpp
2491

Why do we need to emit it for simd only mode?

lib/CodeGen/CGOpenMPRuntime.cpp
1663

You're looking for CGF.CurFn twice here, used find member function instead and work with iterator.

7046–7054

Maybe it is better to define a constant constexpr uint64_t OMP_MEMBER_OF_RANK = 48 and then deduce OMP_MAP_MEMBER_OF as ~((1<<OMP_MEMBER_OF_RANK) - 1)?

8025

AFAIK, LLVM has dropped support for msvc 2013, do we still need this?

lib/CodeGen/CGOpenMPRuntime.h
352

Use using instead of typedef.

lildmh updated this revision to Diff 211333.Jul 23 2019, 10:43 AM
lildmh marked 4 inline comments as done.
lildmh added inline comments.
lib/CodeGen/CGDecl.cpp
2491

This code is not emitting mapper for simd only mode.

lib/CodeGen/CGOpenMPRuntime.cpp
7046–7054

In libomptarget, the same way is used to define OMP_TGT_MAPTYPE_MEMBER_OF: OMP_TGT_MAPTYPE_MEMBER_OF = 0xffff000000000000. So I think they should stay the same. Btw, the number 48 is directly used in libomptarget now, which may need to change in the future.

In your code, it assumes bits higher than 48 are all OMP_MAP_MEMBER_OF, which may not be true in the future. My code here is more universal, although it does not look great. What do you think?

ABataev added inline comments.Jul 23 2019, 11:11 AM
lib/CodeGen/CGDecl.cpp
2491

Ah, yes, it is the condition for the early exit.

lib/CodeGen/CGOpenMPRuntime.cpp
7046–7054

You can apply a mask to drop some of the most significant bits if required. My code looks much cleaner it would be good to have the same code in libomptarget too.
But it is up to you what to do here.

8025

Ping.

lib/CodeGen/CGOpenMPRuntime.h
797–802

I don't think we need virtual functions here, non-virtual are good enough.

2095

I think you can drop this function here if the original function is not virtual

lildmh marked 3 inline comments as done.Jul 23 2019, 11:37 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.cpp
7046–7054

Maybe let's keep it this way in this patch, then we modify both places in a further patch?

8025

I'm okay either way. I guess it doesn't hurt to keep it?

lib/CodeGen/CGOpenMPRuntime.h
2095

The function for simd only mode includes a llvm_unreachable, so I think it's still needed as a virtual function above

ABataev added inline comments.Jul 23 2019, 11:42 AM
lib/CodeGen/CGOpenMPRuntime.h
2095

It is better to reduce number of virtual functions, if possible.

lildmh updated this revision to Diff 211589.Jul 24 2019, 12:50 PM

Get rid of MSVC requirement of this, and a virtual function

ABataev added inline comments.Jul 24 2019, 12:57 PM
lib/CodeGen/CGOpenMPRuntime.h
2095

What about virtual function?

lildmh marked an inline comment as done.Jul 25 2019, 3:45 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.h
2095

Since emitUserDefinedMapper here overrides the same function in CGOpenMPRuntime, I think emitUserDefinedMapper of CGOpenMPRuntime needs to be defined as virtual?

lildmh marked an inline comment as done.Jul 25 2019, 3:47 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.h
2095

If you are asking about what virtual function I removed, it is emitUDMapperArrayInitOrDel of CGOpenMPRuntime which is never overrided.

ABataev added inline comments.Jul 25 2019, 6:59 AM
lib/CodeGen/CGOpenMPRuntime.h
2095

I would suggest to not make this virtual too and remove overriden version. It does not help.

lildmh marked an inline comment as done.Jul 25 2019, 8:48 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.h
2095

If we remove virtual from emitUserDefinedMapper, we will have to define it in both CGOpenMPRuntime and CGOpenMPSIMDRuntime, and its definition is identical in both places. I don't think that's good software engineering practice?

ABataev added inline comments.Jul 25 2019, 8:52 AM
lib/CodeGen/CGOpenMPRuntime.h
2095

WE just don't need an overridden version in CGOpenMPSIMDRuntime. Just remove virtual and remove the overridden function from CGOpenMPSIMDRuntime.
You have the check for OpenMPSimd mode that prevents the emission of mappers in simd-only mode.

lildmh updated this revision to Diff 211794.Jul 25 2019, 11:14 AM

Remove virtual from function declaration

ABataev added inline comments.Jul 26 2019, 10:33 AM
lib/CodeGen/CGOpenMPRuntime.h
802

Seems to me, this function is used only in emitUserDefinedMapper. I think you can make it static local in the CGOpenMPRuntime.cpp and do not expose it in the interface.

lildmh marked an inline comment as done.Jul 26 2019, 11:01 AM
lildmh added inline comments.
lib/CodeGen/CGOpenMPRuntime.h
802

emitUserDefinedMapper needs to call createRuntimeFunction of CGOpenMPRuntime, which is private.

Which one do you think is better, make createRuntimeFunction public, or have emitUserDefinedMapper not defined in CGOpenMPRuntime? It seems to me that they are similar

ABataev added inline comments.Jul 26 2019, 11:37 AM
lib/CodeGen/CGOpenMPRuntime.h
802

Then make emitUDMapperArrayInitOrDel private instead.

lildmh updated this revision to Diff 211991.Jul 26 2019, 12:52 PM

Make emitUDMapperArrayInitOrDel private

ABataev accepted this revision.Jul 26 2019, 1:45 PM

Looks good in general, but commit the runtime part at first.

test/OpenMP/declare_mapper_codegen.cpp
197–201

I would not rely on the predetermined indices here, better to use some kind of patterns here just like in other places.

482–486

Same here

This revision is now accepted and ready to land.Jul 26 2019, 1:45 PM
lildmh marked an inline comment as done.Jul 29 2019, 12:09 PM

Thanks Alexey! Could you look into the runtime patch D60972 then?

test/OpenMP/declare_mapper_codegen.cpp
197–201

Could you give an example about what you suggest? For instance, some other tests I should look into.

ABataev added inline comments.Jul 29 2019, 12:57 PM
test/OpenMP/declare_mapper_codegen.cpp
197–201

Just like in this test when you're using vars.

lildmh marked an inline comment as done.Jul 29 2019, 1:19 PM
lildmh added inline comments.
test/OpenMP/declare_mapper_codegen.cpp
197–201

Sorry I was not clear before. What do you mean by "predetermined indices" here? If you are referring to, for example, %0 in store i8* %0, i8** [[HANDLEADDR:%[^,]+]], I guess there is no way to get rid of %0 because it means the first argument of the function?

ABataev added inline comments.Jul 29 2019, 1:21 PM
test/OpenMP/declare_mapper_codegen.cpp
197–201

Yes, I meant those %0 like registers. Better to mark them as variables in function declaration and use those names in the checks.

lildmh marked an inline comment as done.Jul 29 2019, 1:44 PM
lildmh added inline comments.
test/OpenMP/declare_mapper_codegen.cpp
197–201

Now it's like define {{.*}}void @.omp_mapper.{{.*}}C.id{{.*}}(i8*, i8*, i8*, i64, i64), I think you are suggesting something like define {{.*}}void @.omp_mapper.{{.*}}C.id{{.*}}(i8* [[HANDLE:%[^,]+]], i8* [[BPTR:%[^,]+]], ...), and later I can use store i8* [[HANDLE]], i8** [[HANDLEADDR:%[^,]+]]

I'm not sure how to add names for function arguments. They seems to be always nameless like (i8*, i8*, i8*, i64, i64). Is there a way to do that?

ABataev added inline comments.Jul 29 2019, 2:00 PM
test/OpenMP/declare_mapper_codegen.cpp
197–201

If the clang parameters have names, the llvm params also will get the names. But it is not worth it to add the names to the function. Could just use regexp here to avoid using LLVM register names? Just %{{·+}}. And rely on the order, i.e. remove -DAG checks?

lildmh updated this revision to Diff 212399.Jul 30 2019, 11:44 AM
lildmh marked an inline comment as done.

Change mapper function argument checking

Thanks, looks good.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2019, 9:18 PM
lildmh updated this revision to Diff 213339.Aug 5 2019, 5:57 AM

Fix declare mapper codegen test when the function argument has name attached.