- User Since
- Feb 16 2016, 1:22 PM (191 w, 1 d)
Thu, Oct 10
Tue, Oct 8
Fri, Oct 4
I also feel the patch must be marked as NFC, there is no functional change here.
I'm fine with the changes too. Maybe we should change the title because eventually this patch does not use -1 but rather the full representation (0xffff....).
Wed, Oct 2
I'm wondering whether it would be better to define __kmpc_impl_all_lanes conditionally, i.e. as 0xFFFF for 32-wide and 0xFFFFFFFF for 64-wide architectures. I'm not sure I like this implicit signed-to-unsigned conversion here... Also, __kmpc_impl_lanemask_t is currently typedef'ed as uint32_t, if we want to support 64-wide lanes, then __kmpc_impl_lanemask_t must also be conditionally defined accordingly.
Could alternatively add a constant to target_impl if preferred.
Fri, Sep 27
Lingda is right, we had faced the same issue in the loop trip count implementation. The loop trip count should be set per task but libomptarget has no notion of tasks, so we ended up engaging the host runtime (libomp) to store per-task information. Although it involves more work, I still believe that will be a more elegant solution.
Aug 23 2019
I guess there is a clang-side patch as well which makes use of the new exposed kmpc function? Can you add it to the description? Also, can you add a basic test with a critical region that doesn't work correctly without this patch?
As there has been no update from the original author for 1.5 years and what is posted here doesn't work anyway, I am abandoning this patch. If for any reason we need to enable sm_30 devices to be used for offloading, we can do it later on as part of redesigning the runtime by factoring out common logic.
Aug 8 2019
I don't have any comments about this patch other than the ones @Hahnfeld mentioned.
Aug 6 2019
Aug 2 2019
This patch makes debug output more consistent, thanks!
Jul 31 2019
OK, it's good to go now.
Jul 30 2019
Jul 29 2019
Now that the clang-side patch is finalized, this is almost good to go. What is missing from this patch is a test. Can you add something under libomptarget/test/offloading/?
Jul 22 2019
This looks consistent with the changes in D62397, I think it's good to go.
OK, looks good.
Does this behavior of CUDA >= 9.0 affect only the parallel level counter? Do we need to propagate these changes to other functions?
Jul 17 2019
Jul 15 2019
This change makes sense, looks good to me.
Jul 9 2019
Right, this patch is now obsolete.
Jul 2 2019
A long time ago we had identified the problem with the loop trip count and if I recall correctly the proposed solution was to store the trip count per OpenMP task. The fix would be implemented in libomp because libomptarget has no notion of tasks.
Jun 19 2019
Jun 17 2019
Jun 14 2019
Jun 4 2019
May I propose something extra? I guess the purpose of this patch is to disable libffi, for instance on systems where the library is not installed or for whatever reason the user doesn't want to use it. In that case, we should accommodate for the standalone build as well. We could introduce a new Cmake variable LIBOMPTARGET_ENABLE_FFI and
- set it to LLVM_ENABLE_FFI if building in-tree or
- have a default value of true if building standalone and let the user set it to false if they want to
Jun 3 2019
I formatted the description a bit. The patch looks good and the reasoning behind it is now obvious, in line with Nvidia's documentation.
I think this patch is wrong. Where are the previous additions of target_data and __tgt_target_data_mapper? Possibly you uploaded the diff between the previous revision and the current one, whereas you need to upload the diff between origin/master and your current HEAD!
OK, there don't seem to be any more comments, looks good.
May 31 2019
May 29 2019
Also, can you include a reference in the commit message pointing to D56274 as this was the case demonstrating the need for this fix? Thanks!
May 25 2019
May 24 2019
I agree with some of your complaints, I disagree with others:
May 23 2019
Let me address what has been written so far from my perspective and request some changes since there has been demand for them:
- No doubt there are 2 changes here, unrelated to one another. Yes, they could be split into 2 patches to strictly comply with community guidelines. And I would prefer it if the author had submitted two distinct patches. On the other hand, I don't like missing the forest for the trees. This is a 4-line patch, I figured out it is OK to let both changes in at once. Sure, the small size of a patch is not a free pass to doing whatever we want, but in this case committing both changes in one go doesn't do any harm, at least none that I can think of so that I would make a big deal out of it. There's the letter of the law and the spirit of the law, in cases where these two collide I tend to go with the latter.
- My bad that I didn't request a test for cache flushing, I was predisposed due to the NFC tag. I agree there is no way to test whether the cache has been flushed, but @ABataev you could provide a test case which fails without cache flushing so that you can demonstrate that this patch does indeed work. There must be such a test case, after all how did you figure out there is a problem there?
- The NFC bit does not apply to the barrier fix. If we keep both changes in one patch, that should be reflected in the title. I agree that not being able to test a change does not make it NFC. You can call it NTC (non-testable change), but not NFC.
- In case of volatile, it seems there's a gray line here between what constitutes a NFC patch. The way I see it, "volatile" prevents compiler optimizations, it doesn't change anything in the way this code works, the algorithms/data structures used etc., the only change is how the code is compiled. On the other hand, we could argue that since the compiled library behaves differently there is indeed a functional change. Once again, to me this is a gray line and there is no absolute answer. However, since here there are more objections than approvals, @ABataev can you change the patch's title and remove the NFC part altogether?
- Regarding justification for the need for volatile, I agree that the explanation of the problem is very short, but the solution makes sense given the nature of that problem - at least to me. I could have blocked the patch until the author finds a way to describe the issue in detail - in the meantime there would be a correctness problem in the library which could cause inexplicable failures/wrong results to the users. Even if the "fix" is more of a hack - which I don't believe - I would rather let this hack in than leave the correctness problem lying around indefinitely. From the description I understand that a thread may cache values from the parallelLevel array but at the same time those values change in RAM without the thread knowing about it. This leads to corrupt states where some threads have a stale view of the parallel level. The only way to demonstrate the problem is to post ptx code, which I think is too much in the scope of a review, that's why I don't see the point in providing more details. The author could point to two places in the code, one where a thread caches a value from that array and another one where another thread modifies the global value, but it is not always possible to track where those changes occur. This is in case I understood the problem correctly. If not, then @ABataev I am also requesting a detailed explanation of the problem because the current description leads to wrong conclusions.
- 1, 4 and 5 represent my point of view, which is of course subjective, anyone could have a different mentality. But calling my reviews "reviews" and my activity here "suspicious" doesn't contribute at all. In pretty much the same way the "this is your problem, I've got the owner's approval" reply is inappropriate (and in no way do I endorse this "argument" as a reply to reviewers' requests).
- Since February 2018 I have been the owner of the libomptarget project. The issue of libomptarget having no owner came up in one of our biweekly multi-company telecons, I was the only one who volunteered and at a subsequent F2F there was consensus for my appointment since I had written most of the code for the base library and the CUDA and generic-elf64 plugins.
- Regarding the nvptx device runtime, certainly it is a different project from the rest of libomptarget. I think the RTL is still considered "coupled" with libomptarget for historical reasons only (both the base library, the plugins and the RTL were all written by the same people at the same company in the scope of the same project). But in terms of further development and expertise needed, it is a completely different kettle of fish. From my current position I can only contribute to the RTL by reviewing patches, it is true that I no longer do any development in this RTL. If anyone feels that the nvptx RTL repository should be managed/owned separately from the rest of libomptarget, please speak up and I will raise the issue at the next community telecon.
May 21 2019
May 20 2019
I don't have any other comments or suggestions, the patch looks good.
May 17 2019
May 10 2019
May 3 2019
May 2 2019
Apr 26 2019
I think it looks good now.
Apr 24 2019
I see that target_data_mapper pretty much copies the entire code from target_data_begin and target_data_end. That's 200+ duplicate lines of code. Couldn't we just add this check:
Mar 25 2019
OK, now it makes sense. I had forgotten what was in the description since the patch was first submitted. :)
Mar 22 2019
This revision looks much better. Please have a look at the inline comment.
Jan 4 2019
Nice approach to optimizing the degree of collisions of atomic adds. LGTM.
Looks good. Thanks!
LGTM. Thanks for all this work!
Jan 3 2019
I'll accept the patch for the sake of consistency and correctness of execution. Just one question:
which cost is too high
So should we expect a performance penalty until function copy is fixed in LLVM and we can revert back to __syncthreads()?
Dec 31 2018
I'm not sure what we are trying to do here. OMPTARGET_DEBUG is not an environment variable, it's a preprocessor definition. It is used by libomptarget/include/omptarget.h to include all necessary header files and define the DEBUGP macro appropriately, which in turn is used by libomptarget/src/private.h to define the DP macro used throughout the library.
Dec 27 2018
Dec 18 2018
Since a similar solution was adopted for clang, I think we should let this one land. After all, it's a matter of consistency between the two projects.
Nov 5 2018
Oct 22 2018
Are we still interested in this patch or should we abandon it?
Oct 11 2018
Oct 1 2018
Looks good. Thanks for taking care of this issue!
Looks good. I have no idea why this option was left there...
Sep 29 2018
Sep 28 2018
Interesting. Are there other cases where alignment is needed?
Then this revision is good to go!
Sep 21 2018
@Hahnfeld: Are the latest changes in line with your requirements/plans to reduce the memory footprint of the nvptx runtime?
Sep 12 2018
I completely missed this one. The proposed change makes sense.
Sep 11 2018
Sep 10 2018
Great job, adding testing infrastructure has been on our list for a long time!
Sep 6 2018
Sep 4 2018
Good catch, thanks for the fix!