This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Pack first-private arguments to improve efficiency of data transfer
ClosedPublic

Authored by tianshilei1992 on Aug 20 2020, 10:16 AM.

Details

Summary

In this patch, we pack all small first-private arguments, allocate and transfer them all at once to reduce the number of data transfer which is very expensive.

Let's take the test case as example.

int main() {
  int data1[3] = {1}, data2[3] = {2}, data3[3] = {3};
  int sum[16] = {0};
#pragma omp target teams distribute parallel for map(tofrom: sum) firstprivate(data1, data2, data3)
  for (int i = 0; i < 16; ++i) {
    for (int j = 0; j < 3; ++j) {
      sum[i] += data1[j];
      sum[i] += data2[j];
      sum[i] += data3[j];
    }
  }
}

Here data1, data2, and data3 are three first-private arguments of the target region. In the previous libomptarget, it called data allocation and data transfer three times, each of which allocated and transferred 12 bytes. With this patch, it only calls allocation and transfer once. The size is (12+4)*3=48 where 12 is the size of each array and 4 is the padding to keep the address aligned with 8. It is implemented in this way:

  1. First collect all information for those *first*-private arguments. _private_ arguments are not the case because private arguments don't need to be mapped to target device. It just needs a data allocation. With the patch for memory manager, the data allocation could be very cheap, especially for the small size. For each qualified argument, push a place holder pointer nullptr to the vector for kernel arguments, and we will update them later.
  2. After we have all information, create a buffer that can accommodate all arguments plus their paddings. Copy the arguments to the buffer at the right place, i.e. aligned address.
  3. Allocate a target memory with the same size as the host buffer, transfer the host buffer to target device, and finally update all place holder pointers in the arguments vector.

The reason we only consider small arguments is, the data transfer is asynchronous. Therefore, for the large argument, we could continue to do things on the host side meanwhile, hopefully, the data is also being transferred. The "small" is defined by that the argument size is less than a predefined value. Currently it is 1024. I'm not sure whether it is a good one, and that is an open question. Another question is, do we need to make it configurable via an environment variable?

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2020, 10:16 AM
tianshilei1992 requested review of this revision.Aug 20 2020, 10:16 AM

This patch contains some changes by clang-format. I'll revert those unrelated changes.

Removed unrelated changes caused by clang-format

Can you add some C pseudo code to the commit message to illustrate what is happening?

Why just "small" ones? why not all of them?

tianshilei1992 edited the summary of this revision. (Show Details)Aug 20 2020, 12:01 PM
tianshilei1992 edited the summary of this revision. (Show Details)

Why just "small" ones? why not all of them?

In addition to the last paragraph of the new commit message, we also have to copy the data on the host in the right place. That is not free as the size grows.

Why just "small" ones? why not all of them?

In addition to the last paragraph of the new commit message, we also have to copy the data on the host in the right place. That is not free as the size grows.

Copy cost on the host is a convincing argument but the asynchronous benefit doesn't seem to me true.

Only minor nits.

openmp/libomptarget/src/omptarget.cpp
841

When I read FP I think floating point. Maybe rename all FPs into FirstPriv or similar, characters are free, confusion is costly.

1019

Why? Still multiple.

1072

when you talk about "the threshold" in the comments above it is not clear what you mean. Add a (see ...) or additional words, e.g., firstprivate bundling threshold.

1103

Make HstPtr a char* in the struct. Then you don't need to unpack hstptr nor size here.

ye-luo added inline comments.Aug 20 2020, 12:47 PM
openmp/libomptarget/src/omptarget.cpp
1064

When I read here with a condition on the transfer.
I'm wondering is this just first-private? or private is also affected?

JonChesterfield added a comment.EditedAug 20 2020, 1:19 PM

This is a good direction. Packing copies together is likely to be faster.

Things like the size limit for profitability will be target specific, suggest adding a hook to the plugin to query such things. Better than yet another environment variable imo.

E.g. over pcie vs a shared memory system are likely to have different thresholds. Nvptx and amdgcn will be different too.

Update based on comments

tianshilei1992 marked 5 inline comments as done.Aug 20 2020, 1:33 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/omptarget.cpp
1064

Just first-private because we don't need data transfer for private arguments.

tianshilei1992 marked an inline comment as done.Aug 20 2020, 1:33 PM

Updated some comments

ye-luo added inline comments.Aug 20 2020, 1:51 PM
openmp/libomptarget/src/omptarget.cpp
1064

I'm wondering is this just first-private? or private is also affected?

I were referring to the whole patch instead of just the transfer here. I guess you patch affect both private and firstprivate.
if I have private(data1, data2, data3), will your change make a single allocation?

tianshilei1992 marked an inline comment as done.Aug 20 2020, 2:58 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/omptarget.cpp
1064

No. In that case, MapTo will be false.

tianshilei1992 marked an inline comment as done.Aug 20 2020, 2:58 PM
ye-luo added inline comments.Aug 20 2020, 3:18 PM
openmp/libomptarget/src/omptarget.cpp
1064
  1. Then rename it IsFirstPrivate. MapTo seems quite misleading.
  2. In the future, I think it can be nice optimization to allocate all the private in a single shot. Device malloc is still very expensive.

Rename MapTo to IsFirstPrivate

tianshilei1992 marked an inline comment as done.Aug 20 2020, 3:32 PM
tianshilei1992 added inline comments.
openmp/libomptarget/src/omptarget.cpp
1064
  1. Done.
  2. I think it depends. Just like I said in the code comments somewhere, we have memory manager now, then the allocation might not go to device directly, especially those small size of memory. If we take all private arguments into account, it will increase more host overhead (more host memory, and extra Boolean variable to tell whether we need to copy from original place to the buffer), and also transfer unnecessary data to target device. In fact, in my first implementation, I did have all private arguments, but later I changed it.
tianshilei1992 marked an inline comment as done.Aug 20 2020, 3:32 PM

Only minor things.

openmp/libomptarget/src/omptarget.cpp
830

const?

836

const?

1019

Why? Still multiple.

I think it is a vector of first private arrays. So the name better to be plural.

FirstPrivateArgs seems to be a better name.

Updated based on comments

tianshilei1992 marked 3 inline comments as done.Aug 20 2020, 4:07 PM

Renamed the vector for first private arguments

tianshilei1992 updated this revision to Diff 286929.EditedAug 20 2020, 6:03 PM

Use synchronous data transfer for the buffer, but I'm not sure whether this can offset the benefit of packing memory copy, even degrade the performance.

openmp/libomptarget/src/omptarget.cpp
1112

This seems target specific. Multiple async copies can easily be cheaper than one larger sync copy.

How is the lifetime of the host buffer usually managed for async copies? It seems that keeping it alive until the async copy is done would be reasonable, or we could move ownership of the host buffer into the target plugin which is then responsible for the cleanup.

tianshilei1992 added inline comments.Aug 20 2020, 8:01 PM
openmp/libomptarget/src/omptarget.cpp
1112

I’ll change this part back to async copy by taking the buffer out of the function such that we can make sure that it will be deallocated safely.

Swtiched to PrivateArgumentManagerTy to handle private arguments

tianshilei1992 marked an inline comment as done.Aug 23 2020, 5:43 PM

Down the road, we may need a way to allocate host pinned memory via the plugin for the host buffer to maximize transfer performance.

openmp/libomptarget/src/omptarget.cpp
872

I think only an index instead of TgtArgs is needed.

924

Document the use of TgtArgs please.

970

In my first thought, I feel better to mark this function private and called by the destructor only. Requiring free() to be called explicitly is error-prone. In a second thought, this probably needed to propagate the return error.

1065

Directly use ArgSizes[I] in the next line.

Updated based on comments

tianshilei1992 marked 4 inline comments as done.Aug 24 2020, 9:27 AM
tianshilei1992 added inline comments.
openmp/libomptarget/src/omptarget.cpp
872

It would be not very straightforward why we need an index here, and there is no difference in terms of performance between an index and a reference.

970

Yes, that's exactly the reason that we have a function here.

tianshilei1992 marked 2 inline comments as done.Aug 24 2020, 9:27 AM
ye-luo added inline comments.Aug 24 2020, 10:44 AM
openmp/libomptarget/src/omptarget.cpp
872

Without documentation, nothing is straightforward. IndexInTgtArgs is better than TgtArgs. Passing TgtArgs directly open doors to unintended uses.

Now pass the index of TgtArgs instead of a reference to TgtArgs

tianshilei1992 marked an inline comment as done.Aug 24 2020, 11:09 AM
ye-luo accepted this revision.Aug 24 2020, 2:26 PM

I prefer to PrivateArgumentManagerTy moved into its own files.
The rest looks good to me.

This revision is now accepted and ready to land.Aug 24 2020, 2:26 PM
ABataev added inline comments.
openmp/libomptarget/test/mapping/private_mapping.c
3

Why the test is marked as expected to fail? Is this intentional?

tianshilei1992 marked an inline comment as done.Aug 27 2020, 7:47 PM
tianshilei1992 added inline comments.
openmp/libomptarget/test/mapping/private_mapping.c
3

That is a mistake. Nice catch, thanks! I'll fix it in another patch. I copied this part from another file and didn't notice it is expected to fail. I did the test manually w/o lit so didn't observe this mistake. I'm wondering why the Buildkite didn't report any issue?

grokos added a subscriber: grokos.Sep 1 2020, 7:40 AM

Minor comments (about typos) to be taken into account in case of a future patch.

openmp/libomptarget/src/omptarget.cpp
831

Here you are referring to TgtArgs which is an argument of packAndTransfer(). Can you add a comment in parenthesis to make it clear? The first time I read the code I thought you were mistakenly using the name TgtArgs instead of TgtPtrs (which is a member of PrivateArgumentManagerTy class).

907

needs to mapped --> needs to be copied to target device
Technically speaking, first-private variables are not mapped, they are only copied to the device.

928

is empty --> is not empty

957

by mistaken --> by mistake