In this patch, we pack all small first-private arguments, allocate and transfer them all at once to reduce the number of data transfer which is very expensive.
Let's take the test case as example.
int main() { int data1[3] = {1}, data2[3] = {2}, data3[3] = {3}; int sum[16] = {0}; #pragma omp target teams distribute parallel for map(tofrom: sum) firstprivate(data1, data2, data3) for (int i = 0; i < 16; ++i) { for (int j = 0; j < 3; ++j) { sum[i] += data1[j]; sum[i] += data2[j]; sum[i] += data3[j]; } } }
Here data1, data2, and data3 are three first-private arguments of the target region. In the previous libomptarget, it called data allocation and data transfer three times, each of which allocated and transferred 12 bytes. With this patch, it only calls allocation and transfer once. The size is (12+4)*3=48 where 12 is the size of each array and 4 is the padding to keep the address aligned with 8. It is implemented in this way:
- First collect all information for those *first*-private arguments. _private_ arguments are not the case because private arguments don't need to be mapped to target device. It just needs a data allocation. With the patch for memory manager, the data allocation could be very cheap, especially for the small size. For each qualified argument, push a place holder pointer nullptr to the vector for kernel arguments, and we will update them later.
- After we have all information, create a buffer that can accommodate all arguments plus their paddings. Copy the arguments to the buffer at the right place, i.e. aligned address.
- Allocate a target memory with the same size as the host buffer, transfer the host buffer to target device, and finally update all place holder pointers in the arguments vector.
The reason we only consider small arguments is, the data transfer is asynchronous. Therefore, for the large argument, we could continue to do things on the host side meanwhile, hopefully, the data is also being transferred. The "small" is defined by that the argument size is less than a predefined value. Currently it is 1024. I'm not sure whether it is a good one, and that is an open question. Another question is, do we need to make it configurable via an environment variable?
const?