Page MenuHomePhabricator

[OPENMP][NVPTX] Add support for lastprivates/reductions handling in SPMD constructs with lightweight runtime.
ClosedPublic

Authored by ABataev on Sep 10 2018, 12:15 PM.

Details

Summary

We need the support for per-team shared variables to support codegen for
lastprivates/reductions. Patch adds this support by using shared memory
if the total size of the reductions/lastprivates is <= 128 bytes,
then pre-allocated buffer in global memory if size is <= 4K bytes,or
uses malloc/free, otherwise.

Diff Detail

Repository
rL LLVM

Event Timeline

ABataev created this revision.Sep 10 2018, 12:15 PM
Hahnfeld added a subscriber: Hahnfeld.

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

libomptarget/deviceRTLs/nvptx/src/option.h
37 ↗(On Diff #164715)

This doesn't exist unless you have information that are not public yet. Volta is 720 at most.

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

libomptarget/deviceRTLs/nvptx/src/option.h
37 ↗(On Diff #164715)

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

What's the overhead of initializing it? The whole libomptarget-nvptx is already a pretty much mess, see my thread on openmp-dev.

libomptarget/deviceRTLs/nvptx/src/option.h
37 ↗(On Diff #164715)

I'm not commenting on MAX_SM, rather on the value of __CUDA_ARCH__. As such these defines are never active.

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

What's the overhead of initializing it? The whole libomptarget-nvptx is already a pretty much mess, see my thread on openmp-dev.

It is not the runtime issue, it is the problem with the compiler itself. It breaks compatibility with the other outlined regions and, thus, it cannot be committed to trunk. I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

What's the overhead of initializing it? The whole libomptarget-nvptx is already a pretty much mess, see my thread on openmp-dev.

It is not the runtime issue, it is the problem with the compiler itself. It breaks compatibility with the other outlined regions and, thus, it cannot be committed to trunk.

Can you please describe the problems? Again, maybe posting the patch may help.

I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

Is that a commitment to actively work on that area?

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

What's the overhead of initializing it? The whole libomptarget-nvptx is already a pretty much mess, see my thread on openmp-dev.

It is not the runtime issue, it is the problem with the compiler itself. It breaks compatibility with the other outlined regions and, thus, it cannot be committed to trunk.

Can you please describe the problems? Again, maybe posting the patch may help.

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

Is that a commitment to actively work on that area?

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

ABataev updated this revision to Diff 164742.Sep 10 2018, 1:32 PM

Fixed SM_MAX definition guard

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

What's the overhead of initializing it? The whole libomptarget-nvptx is already a pretty much mess, see my thread on openmp-dev.

It is not the runtime issue, it is the problem with the compiler itself. It breaks compatibility with the other outlined regions and, thus, it cannot be committed to trunk.

Can you please describe the problems? Again, maybe posting the patch may help.

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

First that's a general statement without any explanation. Second I'm not asking about the scratchpad pointer solution in ibm-devel but rather why we can't pass RequiresDataSharing = true to __kmpc_spmd_kernel_init. Which will give us the data sharing in existing buffers.

I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

Is that a commitment to actively work on that area?

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

I'm referring to the process of cleaning up libomptarget-nvptx.

Hahnfeld added inline comments.Sep 10 2018, 1:39 PM
libomptarget/deviceRTLs/nvptx/src/option.h
37 ↗(On Diff #164715)

That's now 1 GiB of global memory that can't be used by the user application.

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

What's the overhead of initializing it? The whole libomptarget-nvptx is already a pretty much mess, see my thread on openmp-dev.

It is not the runtime issue, it is the problem with the compiler itself. It breaks compatibility with the other outlined regions and, thus, it cannot be committed to trunk.

Can you please describe the problems? Again, maybe posting the patch may help.

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

First that's a general statement without any explanation. Second I'm not asking about the scratchpad pointer solution in ibm-devel but rather why we can't pass RequiresDataSharing = true to __kmpc_spmd_kernel_init. Which will give us the data sharing in existing buffers.

First, stop talking like this. I don't owe you anything.
Second, RequiresDataSharing is not required, because I tend to use the preallocated buffer instead of dynamically allocated.

I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

Is that a commitment to actively work on that area?

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

I'm referring to the process of cleaning up libomptarget-nvptx.

No, if you're interested in this, you can do it.

No, it is about 10-12 Mb.

libomptarget/deviceRTLs/nvptx/src/option.h
37 ↗(On Diff #164715)

Just like I said, I can reduce the size of the preallocated buffers.

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

First that's a general statement without any explanation. Second I'm not asking about the scratchpad pointer solution in ibm-devel but rather why we can't pass RequiresDataSharing = true to __kmpc_spmd_kernel_init. Which will give us the data sharing in existing buffers.

First, stop talking like this. I don't owe you anything.

Sorry, my last comment sounds rude even though I didn't mean it.
My point is that it's impossible to review patches without a big picture: what are the other parts, which alternatives did you evaluate, why don't they work?
And to be honest: Disregarding technical review and simply ignoring my comments doesn't feel nice either.

Second, RequiresDataSharing is not required, because I tend to use the preallocated buffer instead of dynamically allocated.

The data sharing infrastructure also has preallocated buffers.

I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

Is that a commitment to actively work on that area?

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

I'm referring to the process of cleaning up libomptarget-nvptx.

No, if you're interested in this, you can do it.

That's what I feared. Yes, I think this is needed, but to be honest, I'm already facing enough resistence with moderately conservative proposals.

No, it is about 10-12 Mb.

The additional buffer, maybe. But raising MAX_SM from 56 to 84 scales the array of queues proportionally.

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

First that's a general statement without any explanation. Second I'm not asking about the scratchpad pointer solution in ibm-devel but rather why we can't pass RequiresDataSharing = true to __kmpc_spmd_kernel_init. Which will give us the data sharing in existing buffers.

First, stop talking like this. I don't owe you anything.

Sorry, my last comment sounds rude even though I didn't mean it.
My point is that it's impossible to review patches without a big picture: what are the other parts, which alternatives did you evaluate, why don't they work?
And to be honest: Disregarding technical review and simply ignoring my comments doesn't feel nice either.

It is going to use the same globalization support we use for the generic data-sharing scheme. But in SPMD mode we need to share only lastprivates/reduction variables in the teams, so we can use simplified data allocation algorithm as we don't need to use it in other constructs.

Second, RequiresDataSharing is not required, because I tend to use the preallocated buffer instead of dynamically allocated.

The data sharing infrastructure also has preallocated buffers.

I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

Is that a commitment to actively work on that area?

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

I'm referring to the process of cleaning up libomptarget-nvptx.

No, if you're interested in this, you can do it.

That's what I feared. Yes, I think this is needed, but to be honest, I'm already facing enough resistence with moderately conservative proposals.

No, it is about 10-12 Mb.

The additional buffer, maybe. But raising MAX_SM from 56 to 84 scales the array of queues proportionally.

I will try to reuse the existing buffers in the global memory.

ABataev updated this revision to Diff 164879.Sep 11 2018, 7:44 AM

Reused preallocated memory for the full runtime as the global memory buffer for the lightweight runtime.

grokos added inline comments.Sep 11 2018, 7:53 AM
libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu
44 ↗(On Diff #164879)

Expected <space> number....

ABataev updated this revision to Diff 164882.Sep 11 2018, 7:54 AM

Fixed message.

This revision is now accepted and ready to land.Sep 21 2018, 6:34 AM
This revision was automatically updated to reflect the committed changes.

@Hahnfeld: Are the latest changes in line with your requirements/plans to reduce the memory footprint of the nvptx runtime?

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

[...]

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

Just to make sure I came to the right conclusions after trying to understand the code generated since rC342738 and for documentation purposes if the following explanation is correct: The compiler generated code asks the runtime for two loop schedules, one for distribute and the other to implement for. The latter iterates in the chunk returned from the distribute schedule.

For lastprivates on teams distribute parallel for this means that the global value needs to be updated in the last iteration of the last distribute chunk. However, the outlined parallel region only knows whether the current thread is executing the last iteration of the for worksharing construct. This means the lastprivate value of the parallel for is passed back to the distribute loop which decides if it has just executed the last chunk and needs to write to the global value.
In SPMD constructs all CUDA threads are executing the distribute loop, but only the thread executing the last iteration of the for loop has seen the lastprivate value. However the information of which thread this is has been lost at the end of the parallel region. So data sharing is used to communicate the lastprivate value to all threads in the team that is executing the last distribute chunk.

Assume a simple case like this:

int last;
#pragma omp target teams distribute parallel for map(from: last) lastprivate(last)
for (int i = 0; i < 10000; i++) {
  last = i;
}

Clang conceptually generates the following:

void outlined_target_fn(int *last) {
  int *last_ds = /* get data sharing frame from runtime */
  for (/* distribute loop from 0 to 9999 */) {
    outlined_parallel_fn(lb, ub, last_ds);
  }
  if (/* received last chunk */) {
    *last = *last_ds;
  }
}

void outlined_parallel_fn(int lb, int ub, int *last) {
  int last_privatized;
  for (/* for loop from lb to ub */) {
    last_privatized = i;
  }
  if (/* executed last iteration of for loop */) {
    *last = last_privatized;
  }
}

I tried to solve this problem without support from the runtime and this appears to work:

void outlined_target_fn(int *last) {
  int last_dummy;
  for (/* distribute loop from 0 to 9999 */) {
    int *last_p = &last_dummy;
    if (/* is last chunk */) {
      last_p = last;
    }
    outlined_parallel_fn(lb, ub, last_p);
  }
}

void outlined_parallel_fn(int lb, int ub, int *last) {
  int last_privatized;
  for (/* for loop from lb to ub */) {
    last_privatized = i;
  }
  if (/* executed last iteration of for loop */) {
    *last = last_privatized;
  }
}

(Alternatively it should also be possible to set last_p before entering the distribute loop. This will write to last multiple times but the final value should stay in memory after the kernel.)

As you can see the outlined parallel function is unchanged (which is probably what you mean with "breaks the compatibility", @ABataev?). This should work because all invocations of outlined_parallel_fn write their value of last into a dummy location, except the one executing the last distribute chunk.
What do you think?

@Hahnfeld: Are the latest changes in line with your requirements/plans to reduce the memory footprint of the nvptx runtime?

I still think it's a waste of resources to statically allocate around 1 GB on sm_70 / 660 MB on sm_60. And I think it's worrying that we are adding more and more data structures because it seems convenient to quickly solve a problem. The truth seems to be that it's incredibly hard to get rid of them later on...

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

[...]

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

Just to make sure I came to the right conclusions after trying to understand the code generated since rC342738 and for documentation purposes if the following explanation is correct: The compiler generated code asks the runtime for two loop schedules, one for distribute and the other to implement for. The latter iterates in the chunk returned from the distribute schedule.

For lastprivates on teams distribute parallel for this means that the global value needs to be updated in the last iteration of the last distribute chunk. However, the outlined parallel region only knows whether the current thread is executing the last iteration of the for worksharing construct. This means the lastprivate value of the parallel for is passed back to the distribute loop which decides if it has just executed the last chunk and needs to write to the global value.
In SPMD constructs all CUDA threads are executing the distribute loop, but only the thread executing the last iteration of the for loop has seen the lastprivate value. However the information of which thread this is has been lost at the end of the parallel region. So data sharing is used to communicate the lastprivate value to all threads in the team that is executing the last distribute chunk.

Assume a simple case like this:

int last;
#pragma omp target teams distribute parallel for map(from: last) lastprivate(last)
for (int i = 0; i < 10000; i++) {
  last = i;
}

Clang conceptually generates the following:

void outlined_target_fn(int *last) {
  int *last_ds = /* get data sharing frame from runtime */
  for (/* distribute loop from 0 to 9999 */) {
    outlined_parallel_fn(lb, ub, last_ds);
  }
  if (/* received last chunk */) {
    *last = *last_ds;
  }
}

void outlined_parallel_fn(int lb, int ub, int *last) {
  int last_privatized;
  for (/* for loop from lb to ub */) {
    last_privatized = i;
  }
  if (/* executed last iteration of for loop */) {
    *last = last_privatized;
  }
}

I tried to solve this problem without support from the runtime and this appears to work:

void outlined_target_fn(int *last) {
  int last_dummy;
  for (/* distribute loop from 0 to 9999 */) {
    int *last_p = &last_dummy;
    if (/* is last chunk */) {
      last_p = last;
    }
    outlined_parallel_fn(lb, ub, last_p);
  }
}

void outlined_parallel_fn(int lb, int ub, int *last) {
  int last_privatized;
  for (/* for loop from lb to ub */) {
    last_privatized = i;
  }
  if (/* executed last iteration of for loop */) {
    *last = last_privatized;
  }
}

(Alternatively it should also be possible to set last_p before entering the distribute loop. This will write to last multiple times but the final value should stay in memory after the kernel.)

As you can see the outlined parallel function is unchanged (which is probably what you mean with "breaks the compatibility", @ABataev?). This should work because all invocations of outlined_parallel_fn write their value of last into a dummy location, except the one executing the last distribute chunk.
What do you think?

@Hahnfeld: Are the latest changes in line with your requirements/plans to reduce the memory footprint of the nvptx runtime?

I still think it's a waste of resources to statically allocate around 1 GB on sm_70 / 660 MB on sm_60. And I think it's worrying that we are adding more and more data structures because it seems convenient to quickly solve a problem. The truth seems to be that it's incredibly hard to get rid of them later on...

No, you're not correct here.

void outlined_target_fn(int *last) {
  int *last_ds = /* get data sharing frame from runtime */
  for (/* distribute loop from 0 to 9999 */) {
    outlined_parallel_fn(lb, ub, last_ds);
  }
  if (/* received last chunk */) {
    *last = *last_ds;
  }
}

This code is for the distribute loop. And here you have conflict without the datasharing scheme. The problem here is that this check /* received last chunk */ is true for all inner loop iterations for inner for directive and *last_ds may come not from the last iteration of for loop, but from some other iterations. To solve this problem, we need to share the same last_ds between all the threads in the team.

No, you're not correct here.

void outlined_target_fn(int *last) {
  int *last_ds = /* get data sharing frame from runtime */
  for (/* distribute loop from 0 to 9999 */) {
    outlined_parallel_fn(lb, ub, last_ds);
  }
  if (/* received last chunk */) {
    *last = *last_ds;
  }
}

This code is for the distribute loop. And here you have conflict without the datasharing scheme. The problem here is that this check /* received last chunk */ is true for all inner loop iterations for inner for directive and *last_ds may come not from the last iteration of for loop, but from some other iterations. To solve this problem, we need to share the same last_ds between all the threads in the team.

Yes, that's the current solution in Clang and actually what I described above:

In SPMD constructs all CUDA threads are executing the distribute loop, but only the thread executing the last iteration of the for loop has seen the lastprivate value. However the information of which thread this is has been lost at the end of the parallel region. So data sharing is used to communicate the lastprivate value to all threads in the team that is executing the last distribute chunk.

I'm assuming that the pointer returned by /* get data sharing frame from runtime */ is shared between all threads in a team.

Yes, that's the current solution in Clang and actually what I described above:

In SPMD constructs all CUDA threads are executing the distribute loop, but only the thread executing the last iteration of the for loop has seen the lastprivate value. However the information of which thread this is has been lost at the end of the parallel region. So data sharing is used to communicate the lastprivate value to all threads in the team that is executing the last distribute chunk.

I'm assuming that the pointer returned by /* get data sharing frame from runtime */ is shared between all threads in a team.

  1. It is not how clang works, it is how standard requires.
  2. Yes, it is shared between all the threads in the team and this is how it is intended to be according to the standard

Yes, that's the current solution in Clang and actually what I described above:

In SPMD constructs all CUDA threads are executing the distribute loop, but only the thread executing the last iteration of the for loop has seen the lastprivate value. However the information of which thread this is has been lost at the end of the parallel region. So data sharing is used to communicate the lastprivate value to all threads in the team that is executing the last distribute chunk.

I'm assuming that the pointer returned by /* get data sharing frame from runtime */ is shared between all threads in a team.

  1. It is not how clang works, it is how standard requires.
  2. Yes, it is shared between all the threads in the team and this is how it is intended to be according to the standard

The main problem with your solution is that distribute loop does not have information which thread actually executed the last
chunk of the loop. All the threads in the last team must execute the same check and only one shall write its private value to the original variable. But, just like I said, runtime does not provide this information to the compiler

  1. It is not how clang works, it is how standard requires.

I've tried to describe how the current implementation works, based on the IR that is generated.

Clang conceptually generates the following:

void outlined_target_fn(int *last) {
  int *last_ds = /* get data sharing frame from runtime */
  for (/* distribute loop from 0 to 9999 */) {
    outlined_parallel_fn(lb, ub, last_ds);
  }
  if (/* received last chunk */) {
    *last = *last_ds;
  }
}

void outlined_parallel_fn(int lb, int ub, int *last) {
  int last_privatized;
  for (/* for loop from lb to ub */) {
    last_privatized = i;
  }
  if (/* executed last iteration of for loop */) {
    *last = last_privatized;
  }
}

Please let me know if this pseudo code conceptually doesn't match the current IR.

  1. Yes, it is shared between all the threads in the team and this is how it is intended to be according to the standard

The main problem with your solution is that distribute loop does not have information which thread actually executed the last
chunk of the loop. All the threads in the last team must execute the same check and only one shall write its private value to the original variable. But, just like I said, runtime does not provide this information to the compiler

Now you are talking about the second pseudo-code:

I tried to solve this problem without support from the runtime and this appears to work:

void outlined_target_fn(int *last) {
  int last_dummy;
  for (/* distribute loop from 0 to 9999 */) {
    int *last_p = &last_dummy;
    if (/* is last chunk */) {
      last_p = last;
    }
    outlined_parallel_fn(lb, ub, last_p);
  }
}

void outlined_parallel_fn(int lb, int ub, int *last) {
  int last_privatized;
  for (/* for loop from lb to ub */) {
    last_privatized = i;
  }
  if (/* executed last iteration of for loop */) {
    *last = last_privatized;
  }
}

I don't see why the distribute loop cares which thread actually executes the last iteration of the for loop, that's only relevant in the outlined parallel region.

I don't see why the distribute loop cares which thread actually executes the last iteration of the for loop, that's only relevant in the outlined parallel region.

Because it marks as lastprivate not the last loop chunk executed by the last thread, but the set of loop chunks executed by the last team. It means that when you try to write the lastprivate value after the distribute loop you will have multiple writes from the different threads with the different values of lastprivates.

I don't see why the distribute loop cares which thread actually executes the last iteration of the for loop, that's only relevant in the outlined parallel region.

Because it marks as lastprivate not the last loop chunk executed by the last thread, but the set of loop chunks executed by the last team. It means that when you try to write the lastprivate value after the distribute loop you will have multiple writes from the different threads with the different values of lastprivates.

Say, last distribute chunk is [L, U]. In the inner for directive it is split into [L,U1], [U1+1, U2], ..., [Un-1 + 1, U]. Distribute marks all these chunks as last, not the last [Un-1 + 1, U].

I don't see why the distribute loop cares which thread actually executes the last iteration of the for loop, that's only relevant in the outlined parallel region.

Because it marks as lastprivate not the last loop chunk executed by the last thread, but the set of loop chunks executed by the last team. It means that when you try to write the lastprivate value after the distribute loop you will have multiple writes from the different threads with the different values of lastprivates.

Say, last distribute chunk is [L, U]. In the inner for directive it is split into [L,U1], [U1+1, U2], ..., [Un-1 + 1, U]. Distribute marks all these chunks as last, not the last [Un-1 + 1, U].

I got that. This is why the outer distribute only passes the global address for its last chunk. Then the inner for decides which thread executes [Un-1 + 1, U] and writes the lastprivate value.

Plus, I need to add that I tried the solution you proposed here maybe a month or two ago. If it would work, I would definitely use this one rather than the one implemented now. Because it is much easier to implement and works much faster. But it just does not work!

I don't see why the distribute loop cares which thread actually executes the last iteration of the for loop, that's only relevant in the outlined parallel region.

Because it marks as lastprivate not the last loop chunk executed by the last thread, but the set of loop chunks executed by the last team. It means that when you try to write the lastprivate value after the distribute loop you will have multiple writes from the different threads with the different values of lastprivates.

Say, last distribute chunk is [L, U]. In the inner for directive it is split into [L,U1], [U1+1, U2], ..., [Un-1 + 1, U]. Distribute marks all these chunks as last, not the last [Un-1 + 1, U].

I got that. This is why the outer distribute only passes the global address for its last chunk. Then the inner for decides which thread executes [Un-1 + 1, U] and writes the lastprivate value.

Yes, that's right! You got it.

Say, last distribute chunk is [L, U]. In the inner for directive it is split into [L,U1], [U1+1, U2], ..., [Un-1 + 1, U]. Distribute marks all these chunks as last, not the last [Un-1 + 1, U].

I got that. This is why the outer distribute only passes the global address for its last chunk. Then the inner for decides which thread executes [Un-1 + 1, U] and writes the lastprivate value.

Yes, that's right! You got it.

So now you are agreeing to "my" solution which is different than what Clang currently does - I'm confused.

Say, last distribute chunk is [L, U]. In the inner for directive it is split into [L,U1], [U1+1, U2], ..., [Un-1 + 1, U]. Distribute marks all these chunks as last, not the last [Un-1 + 1, U].

I got that. This is why the outer distribute only passes the global address for its last chunk. Then the inner for decides which thread executes [Un-1 + 1, U] and writes the lastprivate value.

Yes, that's right! You got it.

So now you are agreeing to "my" solution which is different than what Clang currently does - I'm confused.

No, I do not agree with your solution, I thought you agreed with the implemented one. You said that you understood that actually inner for loop decides which chunk is actually the last one. And because of that, we need to share the distribute private copy of the lastprivate variable, so all the threads in the inner parallel region could modify it.