This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/trunk/libomptarget/deviceRTLs/nvptx/src/
-
trunk/
-
libomptarget/
-
deviceRTLs/
-
nvptx/
-
src/
-
data_sharing.cu
-
omp_data.cu
-
omptarget-nvptx.h
-
omptarget-nvptx.cu
-
omptarget-nvptxi.h
-
option.h

Differential D51875

[OPENMP][NVPTX] Add support for lastprivates/reductions handling in SPMD constructs with lightweight runtime.
ClosedPublic

Authored by ABataev on Sep 10 2018, 12:15 PM.

Download Raw Diff

Details

Reviewers

gtbercea
kkwli0
grokos
Hahnfeld

Commits

rG022bf16b417f: [OPENMP][NVPTX] Add support for lastprivates/reductions handling in SPMD…
rOMP342737: [OPENMP][NVPTX] Add support for lastprivates/reductions handling in SPMD…
rL342737: [OPENMP][NVPTX] Add support for lastprivates/reductions handling in SPMD…

Summary

We need the support for per-team shared variables to support codegen for
lastprivates/reductions. Patch adds this support by using shared memory
if the total size of the reductions/lastprivates is <= 128 bytes,
then pre-allocated buffer in global memory if size is <= 4K bytes,or
uses malloc/free, otherwise.

Diff Detail

Repository: rL LLVM

Event Timeline

ABataev created this revision.Sep 10 2018, 12:15 PM

Herald added a subscriber: guansong. · View Herald TranscriptSep 10 2018, 12:15 PM

Harbormaster completed remote builds in B22434: Diff 164715.Sep 10 2018, 12:15 PM

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

libomptarget/deviceRTLs/nvptx/src/option.h
37 ↗	(On Diff #164715)	This doesn't exist unless you have information that are not public yet. Volta is `720` at most.

In D51875#1229491, @Hahnfeld wrote:

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

libomptarget/deviceRTLs/nvptx/src/option.h
37 ↗	(On Diff #164715)	According to this https://docs.nvidia.com/cuda/volta-tuning-guide/index.html, it is 84

In D51875#1229496, @ABataev wrote:

In D51875#1229491, @Hahnfeld wrote:

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

What's the overhead of initializing it? The whole libomptarget-nvptx is already a pretty much mess, see my thread on openmp-dev.

libomptarget/deviceRTLs/nvptx/src/option.h
37 ↗	(On Diff #164715)	I'm not commenting on `MAX_SM`, rather on the value of `__CUDA_ARCH__`. As such these defines are never active.

In D51875#1229502, @Hahnfeld wrote:

In D51875#1229496, @ABataev wrote:

In D51875#1229491, @Hahnfeld wrote:

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

What's the overhead of initializing it? The whole libomptarget-nvptx is already a pretty much mess, see my thread on openmp-dev.

It is not the runtime issue, it is the problem with the compiler itself. It breaks compatibility with the other outlined regions and, thus, it cannot be committed to trunk. I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

In D51875#1229506, @ABataev wrote:

In D51875#1229502, @Hahnfeld wrote:

In D51875#1229496, @ABataev wrote:

In D51875#1229491, @Hahnfeld wrote:

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

What's the overhead of initializing it? The whole libomptarget-nvptx is already a pretty much mess, see my thread on openmp-dev.

It is not the runtime issue, it is the problem with the compiler itself. It breaks compatibility with the other outlined regions and, thus, it cannot be committed to trunk.

Can you please describe the problems? Again, maybe posting the patch may help.

I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

Is that a commitment to actively work on that area?

In D51875#1229517, @Hahnfeld wrote:

In D51875#1229506, @ABataev wrote:

In D51875#1229502, @Hahnfeld wrote:

In D51875#1229496, @ABataev wrote:

In D51875#1229491, @Hahnfeld wrote:

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

What's the overhead of initializing it? The whole libomptarget-nvptx is already a pretty much mess, see my thread on openmp-dev.

It is not the runtime issue, it is the problem with the compiler itself. It breaks compatibility with the other outlined regions and, thus, it cannot be committed to trunk.

Can you please describe the problems? Again, maybe posting the patch may help.

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

Is that a commitment to actively work on that area?

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

Fixed SM_MAX definition guard

Harbormaster completed remote builds in B22439: Diff 164742.Sep 10 2018, 1:32 PM

In D51875#1229536, @ABataev wrote:

In D51875#1229517, @Hahnfeld wrote:

In D51875#1229506, @ABataev wrote:

In D51875#1229502, @Hahnfeld wrote:

In D51875#1229496, @ABataev wrote:

In D51875#1229491, @Hahnfeld wrote:

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

What's the overhead of initializing it? The whole libomptarget-nvptx is already a pretty much mess, see my thread on openmp-dev.

It is not the runtime issue, it is the problem with the compiler itself. It breaks compatibility with the other outlined regions and, thus, it cannot be committed to trunk.

Can you please describe the problems? Again, maybe posting the patch may help.

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

First that's a general statement without any explanation. Second I'm not asking about the scratchpad pointer solution in ibm-devel but rather why we can't pass RequiresDataSharing = true to __kmpc_spmd_kernel_init. Which will give us the data sharing in existing buffers.

I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

Is that a commitment to actively work on that area?

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

I'm referring to the process of cleaning up libomptarget-nvptx.

Hahnfeld added inline comments.Sep 10 2018, 1:39 PM

libomptarget/deviceRTLs/nvptx/src/option.h
37 ↗	(On Diff #164715)	That's now 1 GiB of global memory that can't be used by the user application.

In D51875#1229545, @Hahnfeld wrote:

In D51875#1229536, @ABataev wrote:

In D51875#1229517, @Hahnfeld wrote:

In D51875#1229506, @ABataev wrote:

In D51875#1229502, @Hahnfeld wrote:

In D51875#1229496, @ABataev wrote:

In D51875#1229491, @Hahnfeld wrote:

I really, really dislike adding even more global buffers. 4096 * 32 * 56 are another 7MiB that are not usable for applications. What's wrong with using the existing ones?

Can you upload the CodeGen patch for reductions somewhere? I thought we need a global scratchpad buffer that is adressable for all teams?

I really, really dislike an implementation in ibm-devel, the scratchpad solution will never be added to the trunk. The existing ones cannot be reused, as they are allocated only if the full runtime is used.

What's the overhead of initializing it? The whole libomptarget-nvptx is already a pretty much mess, see my thread on openmp-dev.

It is not the runtime issue, it is the problem with the compiler itself. It breaks compatibility with the other outlined regions and, thus, it cannot be committed to trunk.

Can you please describe the problems? Again, maybe posting the patch may help.

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

First that's a general statement without any explanation. Second I'm not asking about the scratchpad pointer solution in ibm-devel but rather why we can't pass RequiresDataSharing = true to __kmpc_spmd_kernel_init. Which will give us the data sharing in existing buffers.

First, stop talking like this. I don't owe you anything.
Second, RequiresDataSharing is not required, because I tend to use the preallocated buffer instead of dynamically allocated.

I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

Is that a commitment to actively work on that area?

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

I'm referring to the process of cleaning up libomptarget-nvptx.

No, if you're interested in this, you can do it.

No, it is about 10-12 Mb.

libomptarget/deviceRTLs/nvptx/src/option.h
37 ↗	(On Diff #164715)	Just like I said, I can reduce the size of the preallocated buffers.

In D51875#1229565, @ABataev wrote:

In D51875#1229545, @Hahnfeld wrote:

In D51875#1229536, @ABataev wrote:

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

First that's a general statement without any explanation. Second I'm not asking about the scratchpad pointer solution in ibm-devel but rather why we can't pass RequiresDataSharing = true to __kmpc_spmd_kernel_init. Which will give us the data sharing in existing buffers.

First, stop talking like this. I don't owe you anything.

Sorry, my last comment sounds rude even though I didn't mean it.
My point is that it's impossible to review patches without a big picture: what are the other parts, which alternatives did you evaluate, why don't they work?
And to be honest: Disregarding technical review and simply ignoring my comments doesn't feel nice either.

Second, RequiresDataSharing is not required, because I tend to use the preallocated buffer instead of dynamically allocated.

The data sharing infrastructure also has preallocated buffers.

I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

Is that a commitment to actively work on that area?

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

I'm referring to the process of cleaning up libomptarget-nvptx.

No, if you're interested in this, you can do it.

That's what I feared. Yes, I think this is needed, but to be honest, I'm already facing enough resistence with moderately conservative proposals.

In D51875#1229566, @ABataev wrote:

No, it is about 10-12 Mb.

The additional buffer, maybe. But raising MAX_SM from 56 to 84 scales the array of queues proportionally.

In D51875#1229582, @Hahnfeld wrote:

In D51875#1229565, @ABataev wrote:

In D51875#1229545, @Hahnfeld wrote:

In D51875#1229536, @ABataev wrote:

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

First that's a general statement without any explanation. Second I'm not asking about the scratchpad pointer solution in ibm-devel but rather why we can't pass RequiresDataSharing = true to __kmpc_spmd_kernel_init. Which will give us the data sharing in existing buffers.

First, stop talking like this. I don't owe you anything.

Sorry, my last comment sounds rude even though I didn't mean it.
My point is that it's impossible to review patches without a big picture: what are the other parts, which alternatives did you evaluate, why don't they work?
And to be honest: Disregarding technical review and simply ignoring my comments doesn't feel nice either.

It is going to use the same globalization support we use for the generic data-sharing scheme. But in SPMD mode we need to share only lastprivates/reduction variables in the teams, so we can use simplified data allocation algorithm as we don't need to use it in other constructs.

Second, RequiresDataSharing is not required, because I tend to use the preallocated buffer instead of dynamically allocated.

The data sharing infrastructure also has preallocated buffers.

I'd like to have this at least as a temporirily solution to support lastprivates/reductions in SPMD mode with lightweight runtime. We can reduce the size of the preallocated buffers, if you wish.

Is that a commitment to actively work on that area?

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

I'm referring to the process of cleaning up libomptarget-nvptx.

No, if you're interested in this, you can do it.

That's what I feared. Yes, I think this is needed, but to be honest, I'm already facing enough resistence with moderately conservative proposals.

In D51875#1229566, @ABataev wrote:

No, it is about 10-12 Mb.

The additional buffer, maybe. But raising MAX_SM from 56 to 84 scales the array of queues proportionally.

I will try to reuse the existing buffers in the global memory.

Reused preallocated memory for the full runtime as the global memory buffer for the lightweight runtime.

Harbormaster completed remote builds in B22482: Diff 164879.Sep 11 2018, 7:46 AM

grokos added inline comments.Sep 11 2018, 7:53 AM

libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu
44 ↗	(On Diff #164879)	Expected <space> number....

Fixed message.

Ping!

LGTM

This revision is now accepted and ready to land.Sep 21 2018, 6:34 AM

Closed by commit rL342737: [OPENMP][NVPTX] Add support for lastprivates/reductions handling in SPMD… (authored by ABataev). · Explain WhySep 21 2018, 7:13 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptSep 21 2018, 7:13 AM

@Hahnfeld: Are the latest changes in line with your requirements/plans to reduce the memory footprint of the nvptx runtime?

In D51875#1229536, @ABataev wrote:

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

[...]

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

Just to make sure I came to the right conclusions after trying to understand the code generated since rC342738 and for documentation purposes if the following explanation is correct: The compiler generated code asks the runtime for two loop schedules, one for distribute and the other to implement for. The latter iterates in the chunk returned from the distribute schedule.

For lastprivates on teams distribute parallel for this means that the global value needs to be updated in the last iteration of the last distribute chunk. However, the outlined parallel region only knows whether the current thread is executing the last iteration of the for worksharing construct. This means the lastprivate value of the parallel for is passed back to the distribute loop which decides if it has just executed the last chunk and needs to write to the global value.
In SPMD constructs all CUDA threads are executing the distribute loop, but only the thread executing the last iteration of the for loop has seen the lastprivate value. However the information of which thread this is has been lost at the end of the parallel region. So data sharing is used to communicate the lastprivate value to all threads in the team that is executing the last distribute chunk.

Assume a simple case like this:

int last;
#pragma omp target teams distribute parallel for map(from: last) lastprivate(last)
for (int i = 0; i < 10000; i++) {
  last = i;
}

Clang conceptually generates the following:

void outlined_target_fn(int *last) {
  int *last_ds = /* get data sharing frame from runtime */
  for (/* distribute loop from 0 to 9999 */) {
    outlined_parallel_fn(lb, ub, last_ds);
  }
  if (/* received last chunk */) {
    *last = *last_ds;
  }
}

void outlined_parallel_fn(int lb, int ub, int *last) {
  int last_privatized;
  for (/* for loop from lb to ub */) {
    last_privatized = i;
  }
  if (/* executed last iteration of for loop */) {
    *last = last_privatized;
  }
}

I tried to solve this problem without support from the runtime and this appears to work:

void outlined_target_fn(int *last) {
  int last_dummy;
  for (/* distribute loop from 0 to 9999 */) {
    int *last_p = &last_dummy;
    if (/* is last chunk */) {
      last_p = last;
    }
    outlined_parallel_fn(lb, ub, last_p);
  }
}

void outlined_parallel_fn(int lb, int ub, int *last) {
  int last_privatized;
  for (/* for loop from lb to ub */) {
    last_privatized = i;
  }
  if (/* executed last iteration of for loop */) {
    *last = last_privatized;
  }
}

(Alternatively it should also be possible to set last_p before entering the distribute loop. This will write to last multiple times but the final value should stay in memory after the kernel.)

As you can see the outlined parallel function is unchanged (which is probably what you mean with "breaks the compatibility", @ABataev?). This should work because all invocations of outlined_parallel_fn write their value of last into a dummy location, except the one executing the last distribute chunk.
What do you think?

In D51875#1241913, @grokos wrote:

@Hahnfeld: Are the latest changes in line with your requirements/plans to reduce the memory footprint of the nvptx runtime?

I still think it's a waste of resources to statically allocate around 1 GB on sm_70 / 660 MB on sm_60. And I think it's worrying that we are adding more and more data structures because it seems convenient to quickly solve a problem. The truth seems to be that it's incredibly hard to get rid of them later on...

In D51875#1248997, @Hahnfeld wrote:
In D51875#1229536, @ABataev wrote:

I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.

[...]

Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

Just to make sure I came to the right conclusions after trying to understand the code generated since rC342738 and for documentation purposes if the following explanation is correct: The compiler generated code asks the runtime for two loop schedules, one for distribute and the other to implement for. The latter iterates in the chunk returned from the distribute schedule.

For lastprivates on teams distribute parallel for this means that the global value needs to be updated in the last iteration of the last distribute chunk. However, the outlined parallel region only knows whether the current thread is executing the last iteration of the for worksharing construct. This means the lastprivate value of the parallel for is passed back to the distribute loop which decides if it has just executed the last chunk and needs to write to the global value.
In SPMD constructs all CUDA threads are executing the distribute loop, but only the thread executing the last iteration of the for loop has seen the lastprivate value. However the information of which thread this is has been lost at the end of the parallel region. So data sharing is used to communicate the lastprivate value to all threads in the team that is executing the last distribute chunk.

Assume a simple case like this:
int last;
#pragma omp target teams distribute parallel for map(from: last) lastprivate(last)
for (int i = 0; i < 10000; i++) {
  last = i;
}
Clang conceptually generates the following:
void outlined_target_fn(int *last) {
  int *last_ds = /* get data sharing frame from runtime */
  for (/* distribute loop from 0 to 9999 */) {
    outlined_parallel_fn(lb, ub, last_ds);
  }
  if (/* received last chunk */) {
    *last = *last_ds;
  }
}

void outlined_parallel_fn(int lb, int ub, int *last) {
  int last_privatized;
  for (/* for loop from lb to ub */) {
    last_privatized = i;
  }
  if (/* executed last iteration of for loop */) {
    *last = last_privatized;
  }
}
I tried to solve this problem without support from the runtime and this appears to work:
void outlined_target_fn(int *last) {
  int last_dummy;
  for (/* distribute loop from 0 to 9999 */) {
    int *last_p = &last_dummy;
    if (/* is last chunk */) {
      last_p = last;
    }
    outlined_parallel_fn(lb, ub, last_p);
  }
}

void outlined_parallel_fn(int lb, int ub, int *last) {
  int last_privatized;
  for (/* for loop from lb to ub */) {
    last_privatized = i;
  }
  if (/* executed last iteration of for loop */) {
    *last = last_privatized;
  }
}
(Alternatively it should also be possible to set last_p before entering the distribute loop. This will write to last multiple times but the final value should stay in memory after the kernel.)

As you can see the outlined parallel function is unchanged (which is probably what you mean with "breaks the compatibility", @ABataev?). This should work because all invocations of outlined_parallel_fn write their value of last into a dummy location, except the one executing the last distribute chunk.
What do you think?

In D51875#1241913, @grokos wrote:

@Hahnfeld: Are the latest changes in line with your requirements/plans to reduce the memory footprint of the nvptx runtime?

I still think it's a waste of resources to statically allocate around 1 GB on sm_70 / 660 MB on sm_60. And I think it's worrying that we are adding more and more data structures because it seems convenient to quickly solve a problem. The truth seems to be that it's incredibly hard to get rid of them later on...

No, you're not correct here.

void outlined_target_fn(int *last) {
  int *last_ds = /* get data sharing frame from runtime */
  for (/* distribute loop from 0 to 9999 */) {
    outlined_parallel_fn(lb, ub, last_ds);
  }
  if (/* received last chunk */) {
    *last = *last_ds;
  }
}

This code is for the distribute loop. And here you have conflict without the datasharing scheme. The problem here is that this check /* received last chunk */ is true for all inner loop iterations for inner for directive and *last_ds may come not from the last iteration of for loop, but from some other iterations. To solve this problem, we need to share the same last_ds between all the threads in the team.

In D51875#1249019, @ABataev wrote:
No, you're not correct here.
void outlined_target_fn(int *last) {
  int *last_ds = /* get data sharing frame from runtime */
  for (/* distribute loop from 0 to 9999 */) {
    outlined_parallel_fn(lb, ub, last_ds);
  }
  if (/* received last chunk */) {
    *last = *last_ds;
  }
}
This code is for the distribute loop. And here you have conflict without the datasharing scheme. The problem here is that this check /* received last chunk */ is true for all inner loop iterations for inner for directive and *last_ds may come not from the last iteration of for loop, but from some other iterations. To solve this problem, we need to share the same last_ds between all the threads in the team.

Yes, that's the current solution in Clang and actually what I described above:

In D51875#1248997, @Hahnfeld wrote:

In SPMD constructs all CUDA threads are executing the distribute loop, but only the thread executing the last iteration of the for loop has seen the lastprivate value. However the information of which thread this is has been lost at the end of the parallel region. So data sharing is used to communicate the lastprivate value to all threads in the team that is executing the last distribute chunk.

I'm assuming that the pointer returned by /* get data sharing frame from runtime */ is shared between all threads in a team.

In D51875#1249034, @Hahnfeld wrote:

In D51875#1249019, @ABataev wrote:

Yes, that's the current solution in Clang and actually what I described above:

In D51875#1248997, @Hahnfeld wrote:

In SPMD constructs all CUDA threads are executing the distribute loop, but only the thread executing the last iteration of the for loop has seen the lastprivate value. However the information of which thread this is has been lost at the end of the parallel region. So data sharing is used to communicate the lastprivate value to all threads in the team that is executing the last distribute chunk.

I'm assuming that the pointer returned by /* get data sharing frame from runtime */ is shared between all threads in a team.

It is not how clang works, it is how standard requires.
Yes, it is shared between all the threads in the team and this is how it is intended to be according to the standard

In D51875#1249088, @ABataev wrote:

In D51875#1249034, @Hahnfeld wrote:

In D51875#1249019, @ABataev wrote:

Yes, that's the current solution in Clang and actually what I described above:

In D51875#1248997, @Hahnfeld wrote:

In SPMD constructs all CUDA threads are executing the distribute loop, but only the thread executing the last iteration of the for loop has seen the lastprivate value. However the information of which thread this is has been lost at the end of the parallel region. So data sharing is used to communicate the lastprivate value to all threads in the team that is executing the last distribute chunk.

I'm assuming that the pointer returned by /* get data sharing frame from runtime */ is shared between all threads in a team.

It is not how clang works, it is how standard requires.

Yes, it is shared between all the threads in the team and this is how it is intended to be according to the standard

The main problem with your solution is that distribute loop does not have information which thread actually executed the last
chunk of the loop. All the threads in the last team must execute the same check and only one shall write its private value to the original variable. But, just like I said, runtime does not provide this information to the compiler

In D51875#1249092, @ABataev wrote:

In D51875#1249088, @ABataev wrote:

It is not how clang works, it is how standard requires.

I've tried to describe how the current implementation works, based on the IR that is generated.

In D51875#1248997, @Hahnfeld wrote:

Clang conceptually generates the following:

void outlined_target_fn(int *last) {
  int *last_ds = /* get data sharing frame from runtime */
  for (/* distribute loop from 0 to 9999 */) {
    outlined_parallel_fn(lb, ub, last_ds);
  }
  if (/* received last chunk */) {
    *last = *last_ds;
  }
}

void outlined_parallel_fn(int lb, int ub, int *last) {
  int last_privatized;
  for (/* for loop from lb to ub */) {
    last_privatized = i;
  }
  if (/* executed last iteration of for loop */) {
    *last = last_privatized;
  }
}

Please let me know if this pseudo code conceptually doesn't match the current IR.

Yes, it is shared between all the threads in the team and this is how it is intended to be according to the standard

The main problem with your solution is that distribute loop does not have information which thread actually executed the last
chunk of the loop. All the threads in the last team must execute the same check and only one shall write its private value to the original variable. But, just like I said, runtime does not provide this information to the compiler

Now you are talking about the second pseudo-code:

In D51875#1248997, @Hahnfeld wrote:

I tried to solve this problem without support from the runtime and this appears to work:

void outlined_target_fn(int *last) {
  int last_dummy;
  for (/* distribute loop from 0 to 9999 */) {
    int *last_p = &last_dummy;
    if (/* is last chunk */) {
      last_p = last;
    }
    outlined_parallel_fn(lb, ub, last_p);
  }
}

void outlined_parallel_fn(int lb, int ub, int *last) {
  int last_privatized;
  for (/* for loop from lb to ub */) {
    last_privatized = i;
  }
  if (/* executed last iteration of for loop */) {
    *last = last_privatized;
  }
}

I don't see why the distribute loop cares which thread actually executes the last iteration of the for loop, that's only relevant in the outlined parallel region.

In D51875#1249122, @Hahnfeld wrote:

In D51875#1249092, @ABataev wrote:

In D51875#1249088, @ABataev wrote:

I don't see why the distribute loop cares which thread actually executes the last iteration of the for loop, that's only relevant in the outlined parallel region.

Because it marks as lastprivate not the last loop chunk executed by the last thread, but the set of loop chunks executed by the last team. It means that when you try to write the lastprivate value after the distribute loop you will have multiple writes from the different threads with the different values of lastprivates.

In D51875#1249136, @ABataev wrote:

In D51875#1249122, @Hahnfeld wrote:

In D51875#1249092, @ABataev wrote:

In D51875#1249088, @ABataev wrote:

I don't see why the distribute loop cares which thread actually executes the last iteration of the for loop, that's only relevant in the outlined parallel region.

Because it marks as lastprivate not the last loop chunk executed by the last thread, but the set of loop chunks executed by the last team. It means that when you try to write the lastprivate value after the distribute loop you will have multiple writes from the different threads with the different values of lastprivates.

Say, last distribute chunk is [L, U]. In the inner for directive it is split into [L,U1], [U1+1, U2], ..., [Un-1 + 1, U]. Distribute marks all these chunks as last, not the last [Un-1 + 1, U].

In D51875#1249153, @ABataev wrote:

In D51875#1249136, @ABataev wrote:

In D51875#1249122, @Hahnfeld wrote:

In D51875#1249092, @ABataev wrote:

In D51875#1249088, @ABataev wrote:

I don't see why the distribute loop cares which thread actually executes the last iteration of the for loop, that's only relevant in the outlined parallel region.

Because it marks as lastprivate not the last loop chunk executed by the last thread, but the set of loop chunks executed by the last team. It means that when you try to write the lastprivate value after the distribute loop you will have multiple writes from the different threads with the different values of lastprivates.

Say, last distribute chunk is [L, U]. In the inner for directive it is split into [L,U1], [U1+1, U2], ..., [Un-1 + 1, U]. Distribute marks all these chunks as last, not the last [Un-1 + 1, U].

I got that. This is why the outer distribute only passes the global address for its last chunk. Then the inner for decides which thread executes [Un-1 + 1, U] and writes the lastprivate value.

Plus, I need to add that I tried the solution you proposed here maybe a month or two ago. If it would work, I would definitely use this one rather than the one implemented now. Because it is much easier to implement and works much faster. But it just does not work!

In D51875#1249159, @Hahnfeld wrote:

In D51875#1249153, @ABataev wrote:

In D51875#1249136, @ABataev wrote:

In D51875#1249122, @Hahnfeld wrote:

In D51875#1249092, @ABataev wrote:

In D51875#1249088, @ABataev wrote:

I don't see why the distribute loop cares which thread actually executes the last iteration of the for loop, that's only relevant in the outlined parallel region.

Because it marks as lastprivate not the last loop chunk executed by the last thread, but the set of loop chunks executed by the last team. It means that when you try to write the lastprivate value after the distribute loop you will have multiple writes from the different threads with the different values of lastprivates.

Say, last distribute chunk is [L, U]. In the inner for directive it is split into [L,U1], [U1+1, U2], ..., [Un-1 + 1, U]. Distribute marks all these chunks as last, not the last [Un-1 + 1, U].

I got that. This is why the outer distribute only passes the global address for its last chunk. Then the inner for decides which thread executes [Un-1 + 1, U] and writes the lastprivate value.

Yes, that's right! You got it.

In D51875#1249162, @ABataev wrote:

In D51875#1249159, @Hahnfeld wrote:

In D51875#1249153, @ABataev wrote:

Say, last distribute chunk is [L, U]. In the inner for directive it is split into [L,U1], [U1+1, U2], ..., [Un-1 + 1, U]. Distribute marks all these chunks as last, not the last [Un-1 + 1, U].

I got that. This is why the outer distribute only passes the global address for its last chunk. Then the inner for decides which thread executes [Un-1 + 1, U] and writes the lastprivate value.

Yes, that's right! You got it.

So now you are agreeing to "my" solution which is different than what Clang currently does - I'm confused.

In D51875#1249164, @Hahnfeld wrote:

In D51875#1249162, @ABataev wrote:

In D51875#1249159, @Hahnfeld wrote:

In D51875#1249153, @ABataev wrote:

Say, last distribute chunk is [L, U]. In the inner for directive it is split into [L,U1], [U1+1, U2], ..., [Un-1 + 1, U]. Distribute marks all these chunks as last, not the last [Un-1 + 1, U].

I got that. This is why the outer distribute only passes the global address for its last chunk. Then the inner for decides which thread executes [Un-1 + 1, U] and writes the lastprivate value.

Yes, that's right! You got it.

So now you are agreeing to "my" solution which is different than what Clang currently does - I'm confused.

No, I do not agree with your solution, I thought you agreed with the implemented one. You said that you understood that actually inner for loop decides which chunk is actually the last one. And because of that, we need to share the distribute private copy of the lastprivate variable, so all the threads in the inner parallel region could modify it.

Revision Contents

Path

Size

openmp/

trunk/

libomptarget/

deviceRTLs/

nvptx/

src/

12 lines

2 lines

5 lines

18 lines

33 lines

5 lines

Diff 166479

openmp/trunk/libomptarget/deviceRTLs/nvptx/src/data_sharing.cu

	Show First 20 Lines • Show All 372 Lines • ▼ Show 20 Lines
	// the list of references to shared variables and to pre-allocate global storage			// the list of references to shared variables and to pre-allocate global storage
	// for holding the globalized variables.			// for holding the globalized variables.
	//			//
	// By default the globalized variables are stored in global memory. If the			// By default the globalized variables are stored in global memory. If the
	// UseSharedMemory is set to true, the runtime will attempt to use shared memory			// UseSharedMemory is set to true, the runtime will attempt to use shared memory
	// as long as the size requested fits the pre-allocated size.			// as long as the size requested fits the pre-allocated size.
	EXTERN void* __kmpc_data_sharing_push_stack(size_t DataSize,			EXTERN void* __kmpc_data_sharing_push_stack(size_t DataSize,
	int16_t UseSharedMemory) {			int16_t UseSharedMemory) {
				if (isRuntimeUninitialized()) {
				ASSERT0(LT_FUSSY, isSPMDMode(),
				"Expected SPMD mode with uninitialized runtime.");
				return omptarget_nvptx_SimpleThreadPrivateContext::Allocate(DataSize);
				}

	// Frame pointer must be visible to all workers in the same warp.			// Frame pointer must be visible to all workers in the same warp.
	unsigned WID = getWarpId();			unsigned WID = getWarpId();
	void *&FrameP = DataSharingState.FramePtr[WID];			void *&FrameP = DataSharingState.FramePtr[WID];

	// Only warp active master threads manage the stack.			// Only warp active master threads manage the stack.
	if (IsWarpMasterActiveThread()) {			if (IsWarpMasterActiveThread()) {
	// SlotP will point to either the shared memory slot or an existing			// SlotP will point to either the shared memory slot or an existing
	// global memory slot.			// global memory slot.
	▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
	}			}

	// Pop the stack and free any memory which can be reclaimed.			// Pop the stack and free any memory which can be reclaimed.
	//			//
	// When the pop operation removes the last global memory slot,			// When the pop operation removes the last global memory slot,
	// reclaim all outstanding global memory slots since it is			// reclaim all outstanding global memory slots since it is
	// likely we have reached the end of the kernel.			// likely we have reached the end of the kernel.
	EXTERN void __kmpc_data_sharing_pop_stack(void *FrameStart) {			EXTERN void __kmpc_data_sharing_pop_stack(void *FrameStart) {
				if (isRuntimeUninitialized()) {
				ASSERT0(LT_FUSSY, isSPMDMode(),
				"Expected SPMD mode with uninitialized runtime.");
				return omptarget_nvptx_SimpleThreadPrivateContext::Deallocate(FrameStart);
				}

	if (IsWarpMasterActiveThread()) {			if (IsWarpMasterActiveThread()) {
	unsigned WID = getWarpId();			unsigned WID = getWarpId();

	// Current slot			// Current slot
	__kmpc_data_sharing_slot *&SlotP = DataSharingState.SlotPtr[WID];			__kmpc_data_sharing_slot *&SlotP = DataSharingState.SlotPtr[WID];

	// Pointer to next available stack.			// Pointer to next available stack.
	void *&StackP = DataSharingState.StackPtr[WID];			void *&StackP = DataSharingState.StackPtr[WID];
	▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

openmp/trunk/libomptarget/deviceRTLs/nvptx/src/omp_data.cu

	Show All 32 Lines

	// Pointer to this team's OpenMP state object			// Pointer to this team's OpenMP state object
	__device__ __shared__			__device__ __shared__
	omptarget_nvptx_ThreadPrivateContext *omptarget_nvptx_threadPrivateContext;			omptarget_nvptx_ThreadPrivateContext *omptarget_nvptx_threadPrivateContext;

	__device__ __shared__ omptarget_nvptx_SimpleThreadPrivateContext			__device__ __shared__ omptarget_nvptx_SimpleThreadPrivateContext
	*omptarget_nvptx_simpleThreadPrivateContext;			*omptarget_nvptx_simpleThreadPrivateContext;

				__device__ __shared__ void *omptarget_nvptx_simpleGlobalData;

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	// The team master sets the outlined parallel function in this variable to			// The team master sets the outlined parallel function in this variable to
	// communicate with the workers. Since it is in shared memory, there is one			// communicate with the workers. Since it is in shared memory, there is one
	// copy of these variables for each kernel, instance, and team.			// copy of these variables for each kernel, instance, and team.
	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	volatile __device__ __shared__ omptarget_nvptx_WorkFn omptarget_nvptx_workFn;			volatile __device__ __shared__ omptarget_nvptx_WorkFn omptarget_nvptx_workFn;

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	Show All 18 Lines

openmp/trunk/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h

Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	enum DATA_SHARING_SIZES {
// The maximum number of workers in a kernel.		// The maximum number of workers in a kernel.
DS_Max_Worker_Threads = 992,		DS_Max_Worker_Threads = 992,
// The size reserved for data in a shared memory slot.		// The size reserved for data in a shared memory slot.
DS_Slot_Size = 256,		DS_Slot_Size = 256,
// The slot size that should be reserved for a working warp.		// The slot size that should be reserved for a working warp.
DS_Worker_Warp_Slot_Size = WARPSIZE * DS_Slot_Size,		DS_Worker_Warp_Slot_Size = WARPSIZE * DS_Slot_Size,
// The maximum number of warps in use		// The maximum number of warps in use
DS_Max_Warp_Number = 32,		DS_Max_Warp_Number = 32,
		// The size of the preallocated shared memory buffer per team
		DS_Shared_Memory_Size = 128,
};		};

// Data structure to keep in shared memory that traces the current slot, stack,		// Data structure to keep in shared memory that traces the current slot, stack,
// and frame pointer as well as the active threads that didn't exit the current		// and frame pointer as well as the active threads that didn't exit the current
// environment.		// environment.
struct DataSharingStateTy {		struct DataSharingStateTy {
__kmpc_data_sharing_slot *SlotPtr[DS_Max_Warp_Number];		__kmpc_data_sharing_slot *SlotPtr[DS_Max_Warp_Number];
void *StackPtr[DS_Max_Warp_Number];		void *StackPtr[DS_Max_Warp_Number];
▲ Show 20 Lines • Show All 257 Lines • ▼ Show 20 Lines

/// Device envrionment data		/// Device envrionment data
struct omptarget_device_environmentTy {		struct omptarget_device_environmentTy {
int32_t debug_level;		int32_t debug_level;
};		};

class omptarget_nvptx_SimpleThreadPrivateContext {		class omptarget_nvptx_SimpleThreadPrivateContext {
uint16_t par_level[MAX_THREADS_PER_TEAM];		uint16_t par_level[MAX_THREADS_PER_TEAM];

public:		public:
INLINE void Init() {		INLINE void Init() {
ASSERT0(LT_FUSSY, isSPMDMode() && isRuntimeUninitialized(),		ASSERT0(LT_FUSSY, isSPMDMode() && isRuntimeUninitialized(),
"Expected SPMD + uninitialized runtime modes.");		"Expected SPMD + uninitialized runtime modes.");
par_level[GetThreadIdInBlock()] = 0;		par_level[GetThreadIdInBlock()] = 0;
}		}
		static INLINE void *Allocate(size_t DataSize);
		static INLINE void Deallocate(void *Ptr);
INLINE void IncParLevel() {		INLINE void IncParLevel() {
ASSERT0(LT_FUSSY, isSPMDMode() && isRuntimeUninitialized(),		ASSERT0(LT_FUSSY, isSPMDMode() && isRuntimeUninitialized(),
"Expected SPMD + uninitialized runtime modes.");		"Expected SPMD + uninitialized runtime modes.");
++par_level[GetThreadIdInBlock()];		++par_level[GetThreadIdInBlock()];
}		}
INLINE void DecParLevel() {		INLINE void DecParLevel() {
ASSERT0(LT_FUSSY, isSPMDMode() && isRuntimeUninitialized(),		ASSERT0(LT_FUSSY, isSPMDMode() && isRuntimeUninitialized(),
"Expected SPMD + uninitialized runtime modes.");		"Expected SPMD + uninitialized runtime modes.");
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

openmp/trunk/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu

Show All 19 Lines
extern __device__		extern __device__
omptarget_nvptx_Queue<omptarget_nvptx_ThreadPrivateContext, OMP_STATE_COUNT>		omptarget_nvptx_Queue<omptarget_nvptx_ThreadPrivateContext, OMP_STATE_COUNT>
omptarget_nvptx_device_State[MAX_SM];		omptarget_nvptx_device_State[MAX_SM];

extern __device__ omptarget_nvptx_Queue<		extern __device__ omptarget_nvptx_Queue<
omptarget_nvptx_SimpleThreadPrivateContext, OMP_STATE_COUNT>		omptarget_nvptx_SimpleThreadPrivateContext, OMP_STATE_COUNT>
omptarget_nvptx_device_simpleState[MAX_SM];		omptarget_nvptx_device_simpleState[MAX_SM];

		extern __device__ __shared__ void *omptarget_nvptx_simpleGlobalData;

////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////
// init entry points		// init entry points
////////////////////////////////////////////////////////////////////////////////		////////////////////////////////////////////////////////////////////////////////

		INLINE unsigned nsmid() {
		unsigned n;
		asm("mov.u32 %0, %%nsmid;" : "=r"(n));
		return n;
		}

INLINE unsigned smid() {		INLINE unsigned smid() {
unsigned id;		unsigned id;
asm("mov.u32 %0, %%smid;" : "=r"(id));		asm("mov.u32 %0, %%smid;" : "=r"(id));
		ASSERT0(LT_FUSSY, nsmid() <= MAX_SM,
		"Expected number of SMs is less than reported.");
return id;		return id;
}		}

EXTERN void __kmpc_kernel_init_params(void *Ptr) {		EXTERN void __kmpc_kernel_init_params(void *Ptr) {
PRINT(LD_IO, "call to __kmpc_kernel_init_params with version %f\n",		PRINT(LD_IO, "call to __kmpc_kernel_init_params with version %f\n",
OMPTARGET_NVPTX_VERSION);		OMPTARGET_NVPTX_VERSION);

SetTeamsReductionScratchpadPtr(Ptr);		SetTeamsReductionScratchpadPtr(Ptr);
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	EXTERN void __kmpc_spmd_kernel_init(int ThreadLimit, int16_t RequiresOMPRuntime,

if (!RequiresOMPRuntime) {		if (!RequiresOMPRuntime) {
// If OMP runtime is not required don't initialize OMP state.		// If OMP runtime is not required don't initialize OMP state.
setExecutionParameters(Spmd, RuntimeUninitialized);		setExecutionParameters(Spmd, RuntimeUninitialized);
if (GetThreadIdInBlock() == 0) {		if (GetThreadIdInBlock() == 0) {
int slot = smid() % MAX_SM;		int slot = smid() % MAX_SM;
omptarget_nvptx_simpleThreadPrivateContext =		omptarget_nvptx_simpleThreadPrivateContext =
omptarget_nvptx_device_simpleState[slot].Dequeue();		omptarget_nvptx_device_simpleState[slot].Dequeue();
		// Reuse the memory allocated for the full runtime as the preallocated
		// global memory buffer for the lightweight runtime.
		omptarget_nvptx_simpleGlobalData =
		omptarget_nvptx_device_State[slot].Dequeue();
}		}
__syncthreads();		__syncthreads();
omptarget_nvptx_simpleThreadPrivateContext->Init();		omptarget_nvptx_simpleThreadPrivateContext->Init();
return;		return;
}		}
setExecutionParameters(Spmd, RuntimeInitialized);		setExecutionParameters(Spmd, RuntimeInitialized);

//		//
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	EXTERN void __kmpc_spmd_kernel_deinit() {
__syncthreads();		__syncthreads();
int threadId = GetThreadIdInBlock();		int threadId = GetThreadIdInBlock();
if (isRuntimeUninitialized()) {		if (isRuntimeUninitialized()) {
if (threadId == 0) {		if (threadId == 0) {
// Enqueue omp state object for use by another team.		// Enqueue omp state object for use by another team.
int slot = smid() % MAX_SM;		int slot = smid() % MAX_SM;
omptarget_nvptx_device_simpleState[slot].Enqueue(		omptarget_nvptx_device_simpleState[slot].Enqueue(
omptarget_nvptx_simpleThreadPrivateContext);		omptarget_nvptx_simpleThreadPrivateContext);
		// Enqueue global memory back.
		omptarget_nvptx_device_State[slot].Enqueue(
		reinterpret_cast<omptarget_nvptx_ThreadPrivateContext *>(
		omptarget_nvptx_simpleGlobalData));
}		}
return;		return;
}		}
if (threadId == 0) {		if (threadId == 0) {
// Enqueue omp state object for use by another team.		// Enqueue omp state object for use by another team.
int slot = smid() % MAX_SM;		int slot = smid() % MAX_SM;
omptarget_nvptx_device_State[slot].Enqueue(		omptarget_nvptx_device_State[slot].Enqueue(
omptarget_nvptx_threadPrivateContext);		omptarget_nvptx_threadPrivateContext);
}		}
}		}

// Return true if the current target region is executed in SPMD mode.		// Return true if the current target region is executed in SPMD mode.
EXTERN int8_t __kmpc_is_spmd_exec_mode() {		EXTERN int8_t __kmpc_is_spmd_exec_mode() {
return isSPMDMode();		return isSPMDMode();
}		}

openmp/trunk/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptxi.h

	Show First 20 Lines • Show All 196 Lines • ▼ Show 20 Lines

	INLINE omptarget_nvptx_TaskDescr *getMyTopTaskDescriptor(int threadId) {			INLINE omptarget_nvptx_TaskDescr *getMyTopTaskDescriptor(int threadId) {
	return omptarget_nvptx_threadPrivateContext->GetTopLevelTaskDescr(threadId);			return omptarget_nvptx_threadPrivateContext->GetTopLevelTaskDescr(threadId);
	}			}

	INLINE omptarget_nvptx_TaskDescr *getMyTopTaskDescriptor() {			INLINE omptarget_nvptx_TaskDescr *getMyTopTaskDescriptor() {
	return getMyTopTaskDescriptor(GetLogicalThreadIdInBlock());			return getMyTopTaskDescriptor(GetLogicalThreadIdInBlock());
	}			}

				////////////////////////////////////////////////////////////////////////////////
				// Lightweight runtime functions.
				////////////////////////////////////////////////////////////////////////////////

				// Shared memory buffer for globalization support.
				static __align__(16) __device__ __shared__ char
				omptarget_static_buffer[DS_Shared_Memory_Size];
				static __device__ __shared__ void *omptarget_spmd_allocated;

				extern __device__ __shared__ void *omptarget_nvptx_simpleGlobalData;

				INLINE void *
				omptarget_nvptx_SimpleThreadPrivateContext::Allocate(size_t DataSize) {
				if (DataSize <= DS_Shared_Memory_Size)
				return ::omptarget_static_buffer;
				if (DataSize <= sizeof(omptarget_nvptx_ThreadPrivateContext))
				return ::omptarget_nvptx_simpleGlobalData;
				if (threadIdx.x == 0)
				omptarget_spmd_allocated = SafeMalloc(DataSize, "SPMD teams alloc");
				__syncthreads();
				return omptarget_spmd_allocated;
				}

				INLINE void
				omptarget_nvptx_SimpleThreadPrivateContext::Deallocate(void *Ptr) {
				if (Ptr != ::omptarget_static_buffer &&
				Ptr != ::omptarget_nvptx_simpleGlobalData) {
				__syncthreads();
				if (threadIdx.x == 0)
				SafeFree(Ptr, "SPMD teams dealloc");
				}
				}

openmp/trunk/libomptarget/deviceRTLs/nvptx/src/option.h

	Show All 28 Lines
	#define L1_BARRIER (1)			#define L1_BARRIER (1)

	// Maximum number of preallocated arguments to an outlined parallel/simd function.			// Maximum number of preallocated arguments to an outlined parallel/simd function.
	// Anything more requires dynamic memory allocation.			// Anything more requires dynamic memory allocation.
	#define MAX_SHARED_ARGS 20			#define MAX_SHARED_ARGS 20

	// Maximum number of omp state objects per SM allocated statically in global			// Maximum number of omp state objects per SM allocated statically in global
	// memory.			// memory.
	#if __CUDA_ARCH__ >= 600			#if __CUDA_ARCH__ >= 700
				#define OMP_STATE_COUNT 32
				#define MAX_SM 84
				#elif __CUDA_ARCH__ >= 600
	#define OMP_STATE_COUNT 32			#define OMP_STATE_COUNT 32
	#define MAX_SM 56			#define MAX_SM 56
	#else			#else
	#define OMP_STATE_COUNT 16			#define OMP_STATE_COUNT 16
	#define MAX_SM 16			#define MAX_SM 16
	#endif			#endif

	////////////////////////////////////////////////////////////////////////////////			////////////////////////////////////////////////////////////////////////////////
	Show All 18 Lines