This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][Offloading] Replaced default stream with an actual per-device unblocking stream in NVPTX implementation
AbandonedPublic

Authored by tianshilei1992 on Nov 8 2019, 7:53 AM.

Details

Summary

In this patch, default stream is replaced with an actual per-device stream for better performance as there're plenty of constraints in default stream, according to https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf. Later I'll enable multiple streams to improve the concurrency.

Diff Detail

Repository
rL LLVM

Event Timeline

tianshilei1992 created this revision.Nov 8 2019, 7:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2019, 7:53 AM
tianshilei1992 retitled this revision from [OpenMP][Offlloading] Replaced default stream with an actual per-device unblocking stream in NVPTX implementation to [OpenMP][Offloading] Replaced default stream with an actual per-device unblocking stream in NVPTX implementation.Nov 8 2019, 7:53 AM
ABataev added a subscriber: ABataev.Nov 8 2019, 8:02 AM
ABataev added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
260
  1. Use real type instead of auto
  2. Variables must start with the upper case letter.
tianshilei1992 marked an inline comment as done.Nov 8 2019, 8:17 AM
tianshilei1992 added inline comments.
openmp/libomptarget/plugins/cuda/src/rtl.cpp
260

Sure but just to keep align with existing code. Do I need to update existing code as well?

Also, the main question, how does it affect the exiting execution model? What if we have target region in a parallel region, will they be executed asynchronously? We need some tests for this if we don't have such tests.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
260

In a separate patch, please

tianshilei1992 marked an inline comment as not done.Nov 8 2019, 8:37 AM

Also, the main question, how does it affect the exiting execution model? What if we have target region in a parallel region, will they be executed asynchronously? We need some tests for this if we don't have such tests.

According to https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf, non-default stream can improve performance. This is actually the first step to use multiple streams I'm gonna implement later.

Also, the main question, how does it affect the exiting execution model? What if we have target region in a parallel region, will they be executed asynchronously? We need some tests for this if we don't have such tests.

According to https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf, non-default stream can improve performance. This is actually the first step to use multiple streams I'm gonna implement later.

My question is different. Does it affect execution of the existing code anyhow?

tianshilei1992 added a comment.EditedNov 8 2019, 10:21 AM

Also, the main question, how does it affect the exiting execution model? What if we have target region in a parallel region, will they be executed asynchronously? We need some tests for this if we don't have such tests.

According to https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf, non-default stream can improve performance. This is actually the first step to use multiple streams I'm gonna implement later.

My question is different. Does it affect execution of the existing code anyhow?

AFAIK, no. Currently we still only have one stream for each device, but it's just not the default stream. Kernels in a stream are executed in order. The asynchronous execution requires multiple streams. I'll check whether existing cases can cover it, and will write one if no.

jdoerfert added subscribers: grokos, JonChesterfield.

Can you please write a commit message explaining the change and the plan?

openmp/libomptarget/plugins/cuda/src/rtl.cpp
260

You can go either way but I would just keep it as the surrounding code for now.
Making the code blend in is arguably a good think and we have not decided how we handle the plugins in the short term (wrt. coding standard).

@JonChesterfield @grokos What do you think? Should we run once over the plugin and adjust the coding style now or keep it consistent for the time being?

grokos added inline comments.Nov 8 2019, 1:04 PM
openmp/libomptarget/plugins/cuda/src/rtl.cpp
260

I would say let's keep it consistent. Later on, we can adjust the code style for the whole library but until then I prefer consistency over mixed styles.

Can you please write a commit message explaining the change and the plan?

No problem. Will do.

tianshilei1992 edited the summary of this revision. (Show Details)Nov 8 2019, 1:46 PM

This one can be abandoned now...

grokos added a comment.Apr 6 2020, 7:42 PM

Superseded by D74145. You can abandon this one.

Superseded by D74145. You can abandon this one.

I didn't see the option?

tianshilei1992 abandoned this revision.Apr 7 2020, 7:17 AM