Page MenuHomePhabricator

ABataev (Alexey Bataev)
User

Projects

User does not belong to any projects.

User Details

User Since
Jan 31 2013, 4:40 AM (333 w, 6 d)

Recent Activity

Today

ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Wed, Jun 26, 1:34 PM · Restricted Project, Restricted Project
ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Wed, Jun 26, 1:20 PM · Restricted Project, Restricted Project
ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Wed, Jun 26, 12:59 PM · Restricted Project, Restricted Project
ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Wed, Jun 26, 12:12 PM · Restricted Project, Restricted Project
ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Wed, Jun 26, 10:14 AM · Restricted Project, Restricted Project
ABataev updated the diff for D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

Simplified constraint.

Wed, Jun 26, 9:01 AM · Restricted Project
ABataev added inline comments to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..
Wed, Jun 26, 8:56 AM · Restricted Project
ABataev updated the diff for D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

Use inline assembler to implement volatile load/stores for parallelLevel.

Wed, Jun 26, 7:55 AM · Restricted Project

Yesterday

ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Tue, Jun 25, 2:53 PM · Restricted Project, Restricted Project
ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Tue, Jun 25, 12:50 PM · Restricted Project, Restricted Project
ABataev added a comment to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.

Sorry for the delay, Lingda. I tried to find some better solution for this.

Tue, Jun 25, 10:54 AM · Restricted Project, Restricted Project
ABataev committed rGa90fc6617fb5: [OPENMP]Fix PR41966: type mismatch in runtime functions. (authored by ABataev).
[OPENMP]Fix PR41966: type mismatch in runtime functions.
Tue, Jun 25, 9:05 AM
ABataev committed rL364327: [OPENMP]Fix PR41966: type mismatch in runtime functions..
[OPENMP]Fix PR41966: type mismatch in runtime functions.
Tue, Jun 25, 9:04 AM

Mon, Jun 24

ABataev committed rGdb26bcda8cbb: [OPENMP]Relax the test checks to pacify 32bit buildbots, NFC. (authored by ABataev).
[OPENMP]Relax the test checks to pacify 32bit buildbots, NFC.
Mon, Jun 24, 8:33 AM
ABataev committed rL364189: [OPENMP]Relax the test checks to pacify 32bit buildbots, NFC..
[OPENMP]Relax the test checks to pacify 32bit buildbots, NFC.
Mon, Jun 24, 8:33 AM
ABataev added inline comments to D62397: [OPENMP][NVPTX]Relax flush directive..
Mon, Jun 24, 8:07 AM · Restricted Project

Fri, Jun 21

ABataev added a comment to D63664: [SLPVectorizer] Precommit of supernode.ll test for D63661.

You can request the commit access.

Fri, Jun 21, 12:36 PM · Restricted Project
ABataev committed rG0f21507b4478: [OPENMP]Fix PR42068: Vla type is not captured. (authored by ABataev).
[OPENMP]Fix PR42068: Vla type is not captured.
Fri, Jun 21, 10:26 AM
ABataev committed rL364080: [OPENMP]Fix PR42068: Vla type is not captured..
[OPENMP]Fix PR42068: Vla type is not captured.
Fri, Jun 21, 10:25 AM
ABataev updated the diff for D62397: [OPENMP][NVPTX]Relax flush directive..

Reworked test.

Fri, Jun 21, 9:52 AM · Restricted Project
ABataev added inline comments to D62397: [OPENMP][NVPTX]Relax flush directive..
Fri, Jun 21, 9:04 AM · Restricted Project
ABataev committed rL364061: [OPENMP]Fix PR42159: do not capture threadprivate variables..
[OPENMP]Fix PR42159: do not capture threadprivate variables.
Fri, Jun 21, 8:07 AM
ABataev committed rGe0eb66bbff5e: [OPENMP]Fix PR42159: do not capture threadprivate variables. (authored by ABataev).
[OPENMP]Fix PR42159: do not capture threadprivate variables.
Fri, Jun 21, 8:07 AM

Thu, Jun 20

ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

This means the problem, as of right now, does not exist, correct?

No, it still might appear but it is rather harder to run into the trouble with the current version of the runtime.

If so, what part of the D62318 patch is causing the problem?

I reduced significantly the size of the runtime class and it triggers some of optimizations more often. The access to parallelLevel variable when we check the current parallel level to grt the correct thread ID triggers those optimizations.

Does the test.c that I floated earlier expose the problem then or do I need a different test case?
What configuration are you running? Is it reproducible with Cuda 9/10 and sm_70?

Yes, it exposes the problem, but only with D62318 applied. Not sure about Cuda9, will try to check this later today.

Thu, Jun 20, 11:26 AM · Restricted Project
ABataev accepted D63108: [OpenMP] Add support for handling declare target to clause when unified memory is required.

LG

Thu, Jun 20, 9:30 AM · Restricted Project, Restricted Project
ABataev added inline comments to D63108: [OpenMP] Add support for handling declare target to clause when unified memory is required.
Thu, Jun 20, 9:15 AM · Restricted Project, Restricted Project
ABataev added inline comments to D63108: [OpenMP] Add support for handling declare target to clause when unified memory is required.
Thu, Jun 20, 7:41 AM · Restricted Project, Restricted Project

Wed, Jun 19

ABataev resigned from D63009: [OpenMP] Add target task alloc function with device ID.
Wed, Jun 19, 12:27 PM · Restricted Project, Restricted Project, Restricted Project
ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Wed, Jun 19, 11:56 AM · Restricted Project, Restricted Project
ABataev added inline comments to D63108: [OpenMP] Add support for handling declare target to clause when unified memory is required.
Wed, Jun 19, 11:28 AM · Restricted Project, Restricted Project
ABataev closed D60583: [AArch64] Implement Vector Funtion ABI name mangling..
Wed, Jun 19, 11:02 AM · Restricted Project, Restricted Project
ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Wed, Jun 19, 10:34 AM · Restricted Project, Restricted Project
ABataev updated the diff for D62397: [OPENMP][NVPTX]Relax flush directive..

Address comments.

Wed, Jun 19, 8:17 AM · Restricted Project
ABataev updated the diff for D62398: [OPENMP][NVPTX]Perform memory flush if number of threads to sync is 1 or less..

Addressed comments.

Wed, Jun 19, 8:09 AM · Restricted Project
ABataev added inline comments to D62397: [OPENMP][NVPTX]Relax flush directive..
Wed, Jun 19, 8:08 AM · Restricted Project
ABataev committed rG8a2bd361eb62: [OPENMP][CUDA]Use __syncthreads when compiled by nvcc and clang >= 9.0. (authored by ABataev).
[OPENMP][CUDA]Use __syncthreads when compiled by nvcc and clang >= 9.0.
Wed, Jun 19, 7:19 AM
ABataev committed rL363807: [OPENMP][CUDA]Use __syncthreads when compiled by nvcc and clang >= 9.0..
[OPENMP][CUDA]Use __syncthreads when compiled by nvcc and clang >= 9.0.
Wed, Jun 19, 7:18 AM
ABataev closed D63515: [OPENMP][CUDA]Use __syncthreads when compiled by nvcc and clang >= 9.0..
Wed, Jun 19, 7:18 AM · Restricted Project, Restricted Project
ABataev accepted D63454: [OpenMP] Strengthen regression tests for task allocation under nowait depend clauses NFC.

LG

Wed, Jun 19, 6:43 AM · Restricted Project, Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

You need to apply the patch D62318 to reproduce the problem for sure.

This means the problem, as of right now, does not exist, correct?

Wed, Jun 19, 1:56 AM · Restricted Project

Tue, Jun 18

ABataev created D63515: [OPENMP][CUDA]Use __syncthreads when compiled by nvcc and clang >= 9.0..
Tue, Jun 18, 1:45 PM · Restricted Project, Restricted Project
ABataev committed rG9f3a805ee96b: [OPENMP]Use host's mangling for 128 bit float types on the device. (authored by ABataev).
[OPENMP]Use host's mangling for 128 bit float types on the device.
Tue, Jun 18, 1:26 PM
ABataev committed rL363734: [OPENMP]Use host's mangling for 128 bit float types on the device..
[OPENMP]Use host's mangling for 128 bit float types on the device.
Tue, Jun 18, 1:26 PM
ABataev committed rG7ae267dc0f34: [OPENMP][NVPTX]Correct codegen for 128 bit long double. (authored by ABataev).
[OPENMP][NVPTX]Correct codegen for 128 bit long double.
Tue, Jun 18, 12:03 PM
ABataev committed rL363720: [OPENMP][NVPTX]Correct codegen for 128 bit long double..
[OPENMP][NVPTX]Correct codegen for 128 bit long double.
Tue, Jun 18, 12:01 PM
ABataev committed rG8557d1ac9884: [OPENMP]Use host's long double when compiling the code for device. (authored by ABataev).
[OPENMP]Use host's long double when compiling the code for device.
Tue, Jun 18, 11:39 AM
ABataev committed rL363717: [OPENMP]Use host's long double when compiling the code for device..
[OPENMP]Use host's long double when compiling the code for device.
Tue, Jun 18, 11:38 AM
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

Tue, Jun 18, 7:19 AM · Restricted Project

Sun, Jun 16

ABataev added a comment to D63009: [OpenMP] Add target task alloc function with device ID.

Am I correct that the second to last revision ("- Fix tests.") removed all checks for the actual device_id argument from the tests? From my point of view that's not fixing but weakening the tests! Can you explain why they needed "fixing"?

When I was just passing the default value the LLVM-IR was: i64 -1 i.e. constant, easy to check.

With the latest change the emitted code is: i64 %123 i.e. where %123 is a local derived from the expression of the device ID.

If the value is constant, check for the constant.

All the tests here use expressions so the value is never constant. If you'd like me to add a test with constant I can.

Sun, Jun 16, 10:29 AM · Restricted Project, Restricted Project, Restricted Project

Sat, Jun 15

ABataev added a comment to D63009: [OpenMP] Add target task alloc function with device ID.

Am I correct that the second to last revision ("- Fix tests.") removed all checks for the actual device_id argument from the tests? From my point of view that's not fixing but weakening the tests! Can you explain why they needed "fixing"?

When I was just passing the default value the LLVM-IR was: i64 -1 i.e. constant, easy to check.

With the latest change the emitted code is: i64 %123 i.e. where %123 is a local derived from the expression of the device ID.

Sat, Jun 15, 1:58 PM · Restricted Project, Restricted Project, Restricted Project
ABataev requested changes to D63009: [OpenMP] Add target task alloc function with device ID.

The tests must check the device ID for target-based calls of the task alloc function.

Sat, Jun 15, 1:25 AM · Restricted Project, Restricted Project, Restricted Project
ABataev reopened D63009: [OpenMP] Add target task alloc function with device ID.
Sat, Jun 15, 1:25 AM · Restricted Project, Restricted Project, Restricted Project
ABataev added a comment to D63009: [OpenMP] Add target task alloc function with device ID.

Am I correct that the second to last revision ("- Fix tests.") removed all checks for the actual device_id argument from the tests? From my point of view that's not fixing but weakening the tests! Can you explain why they needed "fixing"?

If I had to guess this is because some directives have the device clause, so the tests should check that the generated code passes the correct argument in that case.

Sat, Jun 15, 1:25 AM · Restricted Project, Restricted Project, Restricted Project

Fri, Jun 14

ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

We have problems in Cuda 8, at least, for arch sm_35

I couldn't get that version to run properly so I asked someone who had a system set up.
Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in spmd_parallel_regions.cpp 1000 times and check the result each time.
It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.

Could you share more information on how the system has to look to trigger the problem?
Could you take a look at the test case we run and make sure it triggers the problem on your end?

[1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f

Fri, Jun 14, 1:05 PM · Restricted Project
ABataev accepted D63009: [OpenMP] Add target task alloc function with device ID.
Fri, Jun 14, 12:51 PM · Restricted Project, Restricted Project, Restricted Project
ABataev accepted D60883: [OpenMP] Avoid emitting maps for target link variables when unified memory is used.
Fri, Jun 14, 10:47 AM · Restricted Project, Restricted Project
ABataev added a comment to D63108: [OpenMP] Add support for handling declare target to clause when unified memory is required.

Will review it next week, when I'm back to work, need to think about it a little bit.

Fri, Jun 14, 9:34 AM · Restricted Project, Restricted Project
ABataev added inline comments to D60883: [OpenMP] Avoid emitting maps for target link variables when unified memory is used.
Fri, Jun 14, 9:34 AM · Restricted Project, Restricted Project
ABataev added inline comments to D63009: [OpenMP] Add target task alloc function with device ID.
Fri, Jun 14, 8:09 AM · Restricted Project, Restricted Project, Restricted Project
ABataev added inline comments to D63009: [OpenMP] Add target task alloc function with device ID.
Fri, Jun 14, 7:45 AM · Restricted Project, Restricted Project, Restricted Project
ABataev added inline comments to D63009: [OpenMP] Add target task alloc function with device ID.
Fri, Jun 14, 7:29 AM · Restricted Project, Restricted Project, Restricted Project

Thu, Jun 13

ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Ok, you found another one committed half year ago. Do really think you can call it systematic and suspicious?

If there is a systematic problem, then this would be one part among several. Regardless, we both agree that this is something to be avoided, so as far as I'm concerned, this problem has been addressed.

Tests are added to all the functional patches as soon as I discovered the testing infrastructure for libomptarget is configured already. And I myself insist on adding test cases for the functional patches, you can ask Doru about this.

I'm aware of a lot of the the history (and I recall being involved in the discussion in D60578 regarding the tests). Nevertheless, we've somehow ended up with a library without sufficient documentation, and robust code reviews should have prevented that. It's not useful to assign blame, but it is something of which we should all be aware so that we can actively improve the situation doing forward.

I said already that we need a documentation. But I can't do everything myself. Plus, I'm not the original developer of the library, I'm just doing my best trying to improve it. Maybe you or Johaness can try to help with this and prepare a documentation? That would be a great help!

Johannes and I can help with the documentation.

Thu, Jun 13, 6:20 PM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Only one patch was combined.

I'm not sure that this is true. D55379 was also. There might be others. I don't think that it's useful to conduct an audit in this regard. The important point is that there were a number of issues over time, some have been addressed, and others we're addressing right now.

Thu, Jun 13, 5:38 PM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

This is not a problem inside LLVM. The problem appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 with the inlined runtime.

I extracted the test case (see below) but I was not seeing the ERROR. How did you run the test case to see a different value for Count?

You need to compile it with the inlined runtime at O2 or O3.

When I run
./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85 -Xopenmp-target -march=sm_70 -fopenmp=libomp test.c -o test.ll -emit-llvm -S
I get

https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259

which shows the inlined and optimized libomptarget.

And you need the latest version of the libomptarget

My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676

Thu, Jun 13, 4:15 PM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

It makes me suspicious of that too. But no one here believes that anyone was trying to subvert the system and produce an inferior result - it is very likely that everyone had, and continues to have, the best of intentions.

Maybe, just maybe, before starting treat someone's activity "suspicious" better to start to try to understand something? To read something, to ask the questions in the proper manner, etc.?

I did ask questions [0,1] and read *a lot* [2,3].

Here, and in D62199, my comments resulted in:

Thu, Jun 13, 2:28 PM · Restricted Project
ABataev added inline comments to D63009: [OpenMP] Add target task alloc function with device ID.
Thu, Jun 13, 2:09 PM · Restricted Project, Restricted Project, Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.

Thu, Jun 13, 1:46 PM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

...

Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Okay, I understand why you say this, but no. Frankly, Johannes was implying that the reviews were too lax (e.g., allowing combined patches with insufficient tests/documentation/etc.).

Thu, Jun 13, 1:38 PM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

This is a clarrification for some older comments.

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

  1. The right solution to a problem I don't think we specified(*) properly, and
  2. The general lack of documentation that lead to various questions on how to interpret the changes.

    (*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

    I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

Hal, @hfinkel, do you really want me to continue with all this stuff?

Yes, there are outstanding technical questions; from @jdoerfert :

I'm also still not following what race condition you try to prevent with volatile in the first place, could you elaborate which accesses are racing without it?
Thu, Jun 13, 10:58 AM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

No, I'm not relying on the non-optimization of atomics. I need both, volatile semantics and atomic. So, the compiler could not optimize out memory accesses and the access would be atomic.

I'm going to try to improve the precision on that statement, if you'll allow me.

Since the address at &parallelLevel is necessarily backed by real memory (I mean RAM, and not PCI registers) because it is an instance introduced by C++ code, then many optimizations are not observable regardless of whatever decorations are applied to the declaration. For example x = *ptr; y = *ptr; can always be simplified to x = *ptr; y = x; when you know that ptr points to memory (and some other conditions, like the set of all y = ... simplified is finite). There is a long list of such simplifications based on "run in finite isolation" substitutions (*ptr = x; y = *ptr; => *ptr = x; y = x;... etc...).

To be clear: there is nothing you can do to prevent ptxas from performing optimizations such as these, even on accesses marked volatile, so the value of the predicate "could not optimize memory accesses" is just hardcoded to false, full stop.

The specific optimizations you don't want, are the optimizations that aren't valid on std::atomic<T> with memory_order_relaxed, for example performing substitutions across fences, barriers, and in potentially infinite loops. And again, the final conclusion is completely valid, you disable those optimizations with *.relaxed.sys or *.volatile in PTX. You just don't disable all optimizations.

Thu, Jun 13, 10:29 AM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..
In D62393#1542042, @tra wrote:
In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel.

So you need to have ld.volatile/st.volatile to make sure compiler does not mess with accesses to shared memory.
C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.

I guess the conservative approach here would be to declare the variable volatile, and then use inline asm ld.volatile/st.volatile to increment it. This will guarantee that the code does exactly what you want without assuming that LLVM back-end always lowers volatile to ld/st.volatile.

Thu, Jun 13, 10:23 AM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..
In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel. According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.
Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

The quoted stack overflow comment doesn't seem like the one you intended. Seems like a mispaste.

When you say "we need to command the ptxas compiler to not optimize accesses to parallelLevel", this is not the best way to describe what you actually need here. I think you're channeling the belief that compilers don't optimize accesses to std::atomic<T>, and explaining what you need in terms of that belief. It seems that any optimization that could be performed (and some are indeed performed!) on std::atomic<T> with memory_order_relaxed is valid to perform on the accesses to parallelLevel as well, because that is really what it should be.

(If that is not the case, then now would be a good time to say it.)

The next conclusion is completely correct: you get that very semantic from the ptxas compiler by using either *.relaxed.sys or *.volatile accesses. Both have the same meaning as accesses std::atomic<T> with memory_order_relaxed in modern PTX, whilst nothing in modern PTX carries a true "no optimizations" meaning.

Lastly, I can confirm that Clang definitely supports inline PTX asm, I've used it, with these instructions specifically.

Thu, Jun 13, 10:01 AM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..
In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel. According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.
Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

The quoted stack overflow comment doesn't seem like the one you intended. Seems like a mispaste.

When you say "we need to command the ptxas compiler to not optimize accesses to parallelLevel", this is not the best way to describe what you actually need here. I think you're channeling the belief that compilers don't optimize accesses to std::atomic<T>, and explaining what you need in terms of that belief. It seems that any optimization that could be performed (and some are indeed performed!) on std::atomic<T> with memory_order_relaxed is valid to perform on the accesses to parallelLevel as well, because that is really what it should be.

(If that is not the case, then now would be a good time to say it.)

The next conclusion is completely correct: you get that very semantic from the ptxas compiler by using either *.relaxed.sys or *.volatile accesses. Both have the same meaning as accesses std::atomic<T> with memory_order_relaxed in modern PTX, whilst nothing in modern PTX carries a true "no optimizations" meaning.

Lastly, I can confirm that Clang definitely supports inline PTX asm, I've used it, with these instructions specifically.

Thu, Jun 13, 9:55 AM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..
In D62393#1541969, @tra wrote:

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Do I understand it correctly that the argument is about using C++ volatile with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's atomic monotonic operations?

If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.

In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so volatile might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.

Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.

Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.

Thu, Jun 13, 9:30 AM · Restricted Project

Wed, Jun 12

ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

Trying to catch up on all the comments, replies to some:

I would have assumed omp_in_parallel to return true if the parallelLevel is not the default (0 or 1). Why isn't that so?

according tk the standard, it should return 1 only if we ar3 in the active parallel region. Active parallel region is the region with number of threaded greater than 1.

I don't agree with this statement and its implication that we need to track both levels and active-levels as I already argued in D61380: In my understanding of the standard, omp_in_parallel should return true iff it's nested somewhere in an active parallel region.

Wed, Jun 12, 8:56 AM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

This is a clarrification for some older comments.

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

Third, the single array is used to handle up to 128 (not 256) inner parallel region. It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

  1. The right solution to a problem I don't think we specified(*) properly, and
  2. The general lack of documentation that lead to various questions on how to interpret the changes.

    (*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

    I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to parallelLevel or why you do not synchronize the accesses to parallelLevel across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution. Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

Wed, Jun 12, 4:06 AM · Restricted Project

Tue, Jun 11

ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..
In D62393#1539057, @jfb wrote:
In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Most probably, we don't need this option, clang should emit correct code for volatile vars in Cuda mode automatically.

+1

That having been said, maybe this is a very simple change because it is the same as (or very similar to) what MSVolatile does?

My point exactly: MS volatile promotes volatile to seq_cst, it's trivial to copy and promote to relaxed and isn't too objectionable because we've got precedent.

Further consideration: you say you want to match the nvcc semantics... but it sounds like it doesn't actually have great / documented semantics, and they might changes. Should this volatile->relaxed be opt-in or opt-out? Is that the only thing we'd want to guarantee?

Tue, Jun 11, 4:10 PM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..
In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Most probably, we don't need this option, clang should emit correct code for volatile vars in Cuda mode automatically.

+1

That having been said, maybe this is a very simple change because it is the same as (or very similar to) what MSVolatile does?

/// An LValue is a candidate for having its loads and stores be made atomic if
/// we are operating under /volatile:ms *and* the LValue itself is volatile and
/// performing such an operation can be performed without a libcall.
bool CodeGenFunction::LValueIsSuitableForInlineAtomic(LValue LV) {
  if (!CGM.getCodeGenOpts().MSVolatile) return false;
  AtomicInfo AI(*this, LV);
  bool IsVolatile = LV.isVolatile() || hasVolatileMember(LV.getType());
  // An atomic is inline if we don't need to use a libcall.
  bool AtomicIsInline = !AI.shouldUseLibcall();
  // MSVC doesn't seem to do this for types wider than a pointer.
  if (getContext().getTypeSize(LV.getType()) >
      getContext().getTypeSize(getContext().getIntPtrType()))
    return false;
  return IsVolatile && AtomicIsInline;
}

Making this return true for when LangOpts.CUDAIsDevice might essentially do what we need?

Tue, Jun 11, 4:02 PM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..
In D62393#1539012, @jfb wrote:

FWIW we already support -fms-volatile, so there's precedent if you wanted -fnv-volatile (however terrible that is).

Tue, Jun 11, 3:37 PM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Hi Olivier, thanks for your comments.

+1

You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.

I'm pretty sure that it's not accidental, but I'm very concerned about relying on it working without documenting the required semantics. Two things:

  1. First, we currently provide volatile accesses with some synchronization semantics. For example, for LoadInst, we have this:
bool isUnordered() const {
  return (getOrdering() == AtomicOrdering::NotAtomic ||
          getOrdering() == AtomicOrdering::Unordered) &&
         !isVolatile();
}

while, at the same time, documenting that volatile has no such semantics at the IR level. The LangRef currently says, " The optimizers may change the order of volatile operations relative to non-volatile operations. This is not Java’s “volatile” and has no cross-thread synchronization behavior."

And, thus, my concern. It works, and there is some explicit code to support this functonality, but the semantics are not documented (and, perhaps, are documented *not* to work). This I think that we should correct to ensure correct functioning going forward. Alternatively, we might change how Clang lowers volatile access in CUDA mode to make them relaxed atomics? Any thoughts on this?

Tue, Jun 11, 2:57 PM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Hi Olivier, thanks for your comments. You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.
Do you suggest to create a template function that will provide the access to the parallelLevel variable? Amd when the atomic<> is supported by Cuda change the type of this variable to atomic<> so the compiler could automatically instantiate this template function with the proper type, right? Or you have something different in mind? If so, could provide a small example of your idea?

That's better than what I had in mind, so it sounds like a good step to me. (I don't know what's the norm here, but maybe with a comment and an issue to fix it later.) Cheers.

Tue, Jun 11, 2:41 PM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.

The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.

Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.

However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.

It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.

Tue, Jun 11, 2:13 PM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..
In D62393#1538350, @jfb wrote:

I think it is important to note that, to the best of my knowledge, no one was implying any bad faith on the part of anyone. To name one specific factor: We have had problems in the past where groups of collaborators/coworkers have done off-list reviews, followed only by a perfunctory/superficial on-list review, resulting in a lack of public discussion around the relevant design and implementation considerations. The review history highlighted by Johannes can give that impression, and we all need the community to watch itself in this regard, because of many potentially-relevant factors, to ensure the quality of our public code reviews is high. I see no reason to believe that everyone here wants to create the best possible code base. Sometimes that means one member of the community pointing out that, in his or her opinion, past reviews might have been insufficiently considered, and we should welcome that, and all work together to address these kinds of concerns going forward.

Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

Alexey: Hal is trying to turn this review into a productive one. Please internalize his guidance, and come back ready to answer feedback which you'd rather not answer. I don't see this review going anywhere without you changing your approach. You're frustrated by having to explain what you consider to be basics, and that's fine. Say so politely, and point at relevant resources. Reviews don't *need* to be a school lecture, but an open-source project is healthy when everyone learns. First it forces the committer to actually understand what they're doing (explaining basics is a great learning tool), and second it makes the code easier to maintain if the original committer stops participating.

Tue, Jun 11, 10:25 AM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

> The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?
>

They all are tracked on this array.

Where is that described?

Sorry, read the code. I'm not a debugger.

First of all, this response is simply rude.
Second, you wrote the code so my question is well justified.
Third, does that mean you encode multiple values in a single int array but did not document this at all? The code is not documentation.
Finally, this will break once we have a parallel region in a recursive loop with more than 254 instantiations, correct?

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I think it is important to note that, to the best of my knowledge, no one was implying any bad faith on the part of anyone. To name one specific factor: We have had problems in the past where groups of collaborators/coworkers have done off-list reviews, followed only by a perfunctory/superficial on-list review, resulting in a lack of public discussion around the relevant design and implementation considerations. The review history highlighted by Johannes can give that impression, and we all need the community to watch itself in this regard, because of many potentially-relevant factors, to ensure the quality of our public code reviews is high. I see no reason to believe that everyone here wants to create the best possible code base. Sometimes that means one member of the community pointing out that, in his or her opinion, past reviews might have been insufficiently considered, and we should welcome that, and all work together to address these kinds of concerns going forward.

Tue, Jun 11, 9:40 AM · Restricted Project

Mon, Jun 10

ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

> The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?
>

They all are tracked on this array.

Where is that described?

Sorry, read the code. I'm not a debugger.

First of all, this response is simply rude.
Second, you wrote the code so my question is well justified.
Third, does that mean you encode multiple values in a single int array but did not document this at all? The code is not documentation.
Finally, this will break once we have a parallel region in a recursive loop with more than 254 instantiations, correct?

Mon, Jun 10, 6:28 PM · Restricted Project
ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Mon, Jun 10, 10:21 AM · Restricted Project, Restricted Project
ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Mon, Jun 10, 10:01 AM · Restricted Project, Restricted Project
ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Mon, Jun 10, 9:41 AM · Restricted Project, Restricted Project
ABataev added a comment to D63009: [OpenMP] Add target task alloc function with device ID.

This patches does not add any functionality. I thought, it was published by an accident. The patch should be abandoned or reworked to add a functional part of the code.

Mon, Jun 10, 8:52 AM · Restricted Project, Restricted Project, Restricted Project
ABataev added inline comments to D59474: [OpenMP 5.0] Codegen support for user-defined mappers.
Mon, Jun 10, 8:49 AM · Restricted Project, Restricted Project

Thu, Jun 6

ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

> The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?
>

They all are tracked on this array.

Where is that described?

Thu, Jun 6, 6:45 PM · Restricted Project
ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

I thought allowing volatile for synchronization was a mistake, and CUDA deprecates doing so, at leasts since Volta

Thu, Jun 6, 4:36 PM · Restricted Project

Wed, Jun 5

ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

It marks if the parallel region has more than 1 thread. Only the very 1 parallel region may have >1 threads. Required to correctly implement omp_in_parallel function, at least.

I would have assumed omp_in_parallel to return true if the parallelLevel is not the default (0 or 1). Why isn't that so?

Wed, Jun 5, 4:34 AM · Restricted Project

Tue, Jun 4

ABataev added a comment to D60583: [AArch64] Implement Vector Funtion ABI name mangling..

Why/Where did we decide to clobber the attribute list with "non-existent function names"?

This seems to me like an ad-hoc implementation of the RFC that is currently discussed but committed before the discussion is finished.

It has nothing to do with the RFC for a variant. It is a standard interface to communicate with the backend to generate vectorized versions of the functions. It relies on Vector ABI, provided by Intel and ARM, it follows the way it is implemented in GCC. There was an RFC for this long time ago which was accepted by the community and later implemented.

The RFC states, in a nutshell, let us add one attribute to identify all vector variants. This patch adds all vector variants as attributes. Clearly, these things are related.

Tue, Jun 4, 1:52 PM · Restricted Project, Restricted Project
ABataev added a comment to D60583: [AArch64] Implement Vector Funtion ABI name mangling..

Why/Where did we decide to clobber the attribute list with "non-existent function names"?

This seems to me like an ad-hoc implementation of the RFC that is currently discussed but committed before the discussion is finished.

Tue, Jun 4, 1:48 PM · Restricted Project, Restricted Project

Mon, Jun 3

ABataev added a comment to D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile..

I still do not see why volatile is the right solution here:
In D61395 the way the level was tracked was changed but I failed to see how exactly. We have to start there.

  • What is the constant OMP_ACTIVE_PARALLEL_LEVEL doing exactly?
Mon, Jun 3, 6:01 PM · Restricted Project

Sun, Jun 2

ABataev added inline comments to D57779: [SLP] Add support for throttling..
Sun, Jun 2, 10:31 AM

Fri, May 31

ABataev added inline comments to D62397: [OPENMP][NVPTX]Relax flush directive..
Fri, May 31, 4:51 PM · Restricted Project

Thu, May 30

ABataev added inline comments to D62427: [SLP] Fix regression in broadcasts caused by operand reordering patch D59973..
Thu, May 30, 12:42 PM · Restricted Project
ABataev added inline comments to D62427: [SLP] Fix regression in broadcasts caused by operand reordering patch D59973..
Thu, May 30, 7:55 AM · Restricted Project