The parallelLevel counter must be on per-thread basis to fully support
L2+ parallelism, otherwise we may end up with undefined behavior.
Introduce the parallelLevel on per-warp basis using shared memory. It
allows to avoid the problems with the synchronization and allows fully
support L2+ parallelism in SPMD mode with no runtime.
Details
Diff Detail
- Repository
- rOMP OpenMP
- Build Status
Buildable 30782 Build 30781: arc lint + arc unit
Event Timeline
Why is it enough to have one counter per warp, what happens if threads within a warp diverge? Before D55773 we had a counter per thread...
Counter per thread significantly affects the performance. It requires the class allocated in the global memory with the dynamically controlled queue to handle this object in the global memory. Doru removed this at cost of some functionality. At the moment, this solution does not work effectively when we have even warp divergence, to handle correctly L2 parallelism you must compile your code with full runtime support. This patch fixes this problem. It supports the divergence between the warps. Plus, it fixes the problem with the threads divergence within the warp. As far as the divergent threads are serialized, it is ok to have a parallelism level counter on per-warp basis. The test demonstrates the threads divergence and produces correct result in SPMD without runtime mode.
We can use per-thread counter, but it requires 1K of shared memory. This solution allows to save the shared memory and use only 32 bytes per block.
libomptarget/deviceRTLs/nvptx/test/parallel/spmd_parallel_regions.cpp | ||
---|---|---|
9–13 | I'm confused by the ternary operator (also the one below). If the first target thread executes this critical region, isHost is -1 and it will be set to omp_is_initial_device(). The second thread comes along and will set it to omp_is_initial_device because it was, at any point in time, either -1 or omp_is_initial_device. And so on. So why the ternary operator? | |
17–26 | Why do we need this ternary operator, shouldn't ParallelLevel2 always be set to L2? | |
21 | I don't get the calculation in the comment. Could you please elaborate? In general, I'm confused because you rewrite the test again. I have more and more the feeling changes are added and modified faster than anyone actually understands them, leading to "tests" that simply verify the current behavior. | |
31 | As far as I can tell, a value of -1 for isHost will not cause the test to fail. (bool)-1 is true and you never verify the "Runtime error" is not printed. |
libomptarget/deviceRTLs/nvptx/test/parallel/spmd_parallel_regions.cpp | ||
---|---|---|
9–13 | Because it is a test. If omp_is_initial_device() at least once return unexpected value, isHost will be set to 1 instead of the expected 0. | |
17–26 | Same, it is test, it checks that ParallelLevel2 inall required threads set correctly to 2, not 3, 1, 0 or something else. | |
21 | The first version of the test did not test the required functionality - SPMD mode without runtime. I fixed the test and committed to test the construct that supports SPMD+no runtime. This version of the test goes further and adds the testing of the parallel level with thread divergence. | |
31 | It will, since expected result is Target region executed on the device. With isHost equal to -1 the result will be Target region executed on the host |
libomptarget/deviceRTLs/nvptx/test/parallel/spmd_parallel_regions.cpp | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9–13 | I don't think it makes sense to complicate this test case by checking for the behavior of omp_is_initial_device. Also, your method of determining "at least once return unexpected value" for omp_is_initial_device is flawed. I sketched a path below but I don't think it is the only way to get the "correct" result even though the omp_is_initial_device implementation is, for some reason, broken. Finally, Setting isHost to 1 is not distinguishable from running on the initial device. So in a test case for omp_is_initial_device we should probably be able to distinguish these cases. Let's say we are on the device, so omp_is_initial_device should be false (=0), and we run with 4 threads. Now, an alternating omp_is_initial_device would still result in isHost = 0:
_below_: If there are two values the first is the result of the call in the condition, the second in the consequence. | ||||||||||||||||
17–26 | Same as above, this doesn't test much. | ||||||||||||||||
21 |
That helps a lot, thanks. |
I'm confused by the ternary operator (also the one below).
If the first target thread executes this critical region, isHost is -1 and it will be set to omp_is_initial_device(). The second thread comes along and will set it to omp_is_initial_device because it was, at any point in time, either -1 or omp_is_initial_device. And so on. So why the ternary operator?