This is an archive of the discontinued LLVM Phabricator instance.

[OPENMP][NVPTX]Correctly handle L2 parallelism in SPMD mode.

Authored by ABataev on Apr 19 2019, 1:46 PM.



The parallelLevel counter must be on per-thread basis to fully support
L2+ parallelism, otherwise we may end up with undefined behavior.
Introduce the parallelLevel on per-warp basis using shared memory. It
allows to avoid the problems with the synchronization and allows fully
support L2+ parallelism in SPMD mode with no runtime.

Diff Detail


Event Timeline

ABataev created this revision.Apr 19 2019, 1:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 19 2019, 1:46 PM

Why is it enough to have one counter per warp, what happens if threads within a warp diverge? Before D55773 we had a counter per thread...

ABataev updated this revision to Diff 196112.Apr 22 2019, 12:04 PM

Updated the test after the fix + outlined warp/lane id operations into functions.

Why is it enough to have one counter per warp, what happens if threads within a warp diverge? Before D55773 we had a counter per thread...

Counter per thread significantly affects the performance. It requires the class allocated in the global memory with the dynamically controlled queue to handle this object in the global memory. Doru removed this at cost of some functionality. At the moment, this solution does not work effectively when we have even warp divergence, to handle correctly L2 parallelism you must compile your code with full runtime support. This patch fixes this problem. It supports the divergence between the warps. Plus, it fixes the problem with the threads divergence within the warp. As far as the divergent threads are serialized, it is ok to have a parallelism level counter on per-warp basis. The test demonstrates the threads divergence and produces correct result in SPMD without runtime mode.
We can use per-thread counter, but it requires 1K of shared memory. This solution allows to save the shared memory and use only 32 bytes per block.

jdoerfert added inline comments.Apr 22 2019, 12:51 PM
18 ↗(On Diff #196112)

I'm confused by the ternary operator (also the one below).

If the first target thread executes this critical region, isHost is -1 and it will be set to omp_is_initial_device(). The second thread comes along and will set it to omp_is_initial_device because it was, at any point in time, either -1 or omp_is_initial_device. And so on. So why the ternary operator?

27 ↗(On Diff #196112)

I don't get the calculation in the comment. Could you please elaborate?

In general, I'm confused because you rewrite the test again.
Was the old test broken? If not, why don't we keep both? If it was, can you explain why?

I have more and more the feeling changes are added and modified faster than anyone actually understands them, leading to "tests" that simply verify the current behavior.

30 ↗(On Diff #196112)

Why do we need this ternary operator, shouldn't ParallelLevel2 always be set to L2?

38 ↗(On Diff #196112)

As far as I can tell, a value of -1 for isHost will not cause the test to fail. (bool)-1 is true and you never verify the "Runtime error" is not printed.

ABataev marked 4 inline comments as done.Apr 22 2019, 1:06 PM
ABataev added inline comments.
18 ↗(On Diff #196112)

Because it is a test. If omp_is_initial_device() at least once return unexpected value, isHost will be set to 1 instead of the expected 0.

27 ↗(On Diff #196112)

The first version of the test did not test the required functionality - SPMD mode without runtime. I fixed the test and committed to test the construct that supports SPMD+no runtime. This version of the test goes further and adds the testing of the parallel level with thread divergence.
The calculations are simplу (see the comment below). We have a loop, that will be executed 4 times (since the outer loop iterates from 0 to 10, scheduling is static, 1 and we have condition omp_get_thread_num() > 5. It means that only 4 threads will execute inner parallel for construct (10 - number of iterations in outer loop, 6 - the first thread that will execute the loop, need to fix the expression from 9-5 to 10-6 in the comment).
Each inner loop iterates 10 time and for secnd level of parallelism omp_get_level() must return 2. Thus, the expected result is 4*10*2 = 80.

30 ↗(On Diff #196112)

Same, it is test, it checks that ParallelLevel2 inall required threads set correctly to 2, not 3, 1, 0 or something else.

38 ↗(On Diff #196112)

It will, since expected result is Target region executed on the device. With isHost equal to -1 the result will be Target region executed on the host

ABataev updated this revision to Diff 196125.Apr 22 2019, 1:08 PM

Fixed the comments for the expected results.

jdoerfert added inline comments.Apr 22 2019, 1:48 PM
18 ↗(On Diff #196112)

I don't think it makes sense to complicate this test case by checking for the behavior of omp_is_initial_device.

Also, your method of determining "at least once return unexpected value" for omp_is_initial_device is flawed. I sketched a path below but I don't think it is the only way to get the "correct" result even though the omp_is_initial_device implementation is, for some reason, broken. Finally, Setting isHost to 1 is not distinguishable from running on the initial device. So in a test case for omp_is_initial_device we should probably be able to distinguish these cases.

Let's say we are on the device, so omp_is_initial_device should be false (=0), and we run with 4 threads. Now, an alternating omp_is_initial_device would still result in isHost = 0:

thread #isHostomp_is_initial_device() (see _below_)

_below_: If there are two values the first is the result of the call in the condition, the second in the consequence.

27 ↗(On Diff #196112)

need to fix the expression from 9-5 to 10-6 in the comment

That helps a lot, thanks.

30 ↗(On Diff #196112)

Same as above, this doesn't test much.

ABataev updated this revision to Diff 196251.Apr 23 2019, 7:56 AM

Fixed test checks.

ABataev updated this revision to Diff 196511.Apr 24 2019, 1:22 PM

Update to top.

Other comments?

grokos accepted this revision.Apr 26 2019, 12:18 PM

I think it looks good now.

This revision is now accepted and ready to land.Apr 26 2019, 12:18 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2019, 12:28 PM