- User Since
- Aug 8 2018, 8:02 AM (110 w, 3 d)
Mon, Sep 14
If I remember correctly, you may yield the thread inside a target region after enqueuing kernels and transfers. So even with 1 thread, there is chance to run other tasks without finishing this target. Isn't that possible?
However, OpenMP task has a problem that it must be within
to a parallel region; otherwise the task will be executed immediately. As a
result, if we directly wrap to a regular task, the nowait target outside of a
parallel region is still a synchronous version.
Thu, Sep 10
Tue, Sep 8
The changes I requested as been added. Remove my blocking. Still need other reviews to be addressed.
Fri, Sep 4
Tue, Sep 1
Mon, Aug 31
It seems that functions are marked static so they should be OK. However, including the whole Debug.h in a plugin cpp makes it feel OK to use any function/macro from the header file. But actually only part of the macros are for the plugin. some are only for the libomptarget.
I don't feel right having Debug.h shared by libomptarget and plugins especially when Debug.h is not just macro but also functions.
Wed, Aug 26
Please document the flags in the patch summary.
Mon, Aug 24
I prefer to PrivateArgumentManagerTy moved into its own files.
The rest looks good to me.
Sun, Aug 23
Down the road, we may need a way to allocate host pinned memory via the plugin for the host buffer to maximize transfer performance.
Aug 20 2020
Only minor things.
Why just "small" ones? why not all of them?
Aug 19 2020
Aug 18 2020
- the DeviceTy copy constructor and assign operator are imperfect before this patch. I don't think we can fix them in this patch. We should just document the imperfection here.
- Because the memory limit is per allocation, it seems that the MemoryManager can still hold infinite amount of memory and we don't have way to free them. I'm concerned about having this feature on by default.
Aug 13 2020
Aug 12 2020
Block the patch temporarily for my earlier questions.
- Please mention LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD, default value and unit in the patch summary.
- Is it possible to have a unit test testing the manager class behaviors?
- Can we offload to host and run address sanitizer or valgrind?
I'm not sure if I'm asking for too much here.
Jul 31 2020
Jul 30 2020
Thanks for fixing the bug. It should be good for the moment.
When I think about the existence of recursive mapper, we may still have more sync than needed. I think recursion the whole targetDataBegin/targetDataEnd is convenient but sub-optimal choice.
Recursion should only be done on the map/mapper analysis. Just leave my thoughts here. It needs a discussion beyond this patch.
LGTM. Please mention renaming variables in the summary.
Jul 29 2020
LGTM. My applications run as expected now. PR46824, PR46012, PR46868 all work fine.
Only minor documentation issues.
Jul 28 2020
GPU activities: 96.99% 350.05ms 10 35.005ms 1.5680us 350.00ms [CUDA memcpy HtoD]
before the July21 change
GPU activities: 95.33% 20.317ms 4 5.0793ms 1.6000us 20.305ms [CUDA memcpy HtoD]
Still more transfer than it should.
OK. Leave the unrelated renaming to the future.
Only one minor issue. Your initial sophisticated patch made my thought you replaced all the lock/unlock. After splitting, the change becomes very clean.
Should be easy to address my comments and let us get this merged ASAP.
Please check the reproducer in https://bugs.llvm.org/show_bug.cgi?id=46868 with LIBOMPTARGET_DEBUG=1.
The reference counting on the base pointer variable has side effects. It was not cleaned up when these variables leave its scope.
Needs to split this patch into three.
- function renaming. In addtion, should we update target_data_update as well?
- std::lock_guard change.
- "target" change.
The order of 1 and 2 can be flexible
Jul 24 2020
Jul 23 2020
Jul 22 2020
OK. It is less broken now.
target_data_end still does Device.deallocTgtPtr and needs a sync before it.
To fully fix this issue, target_data_end must be spitted.
Indeed, target_data_begin should be split as well. cudaMalloc blocks the whole device. Alternating cudaMalloc and transfer only makes the whole process further slower. Better to make all the allocation and then start queuing the transfer.
Does it mean the D2H will always run synchronously after this change?
Does it also mean that target_data_end should be split into data transfer and data free parts?
Jul 21 2020
I verified that 46012 is fixed with this patch
Jul 19 2020
Could you describe what "SplitMemTransfer" is? The current patch summary doesn't provide sufficient explanation. What kind of offloading arrays is considered by this optimization?
Jul 16 2020
Are you sure abort gets fprintf flushed?
Jul 14 2020
Jul 13 2020
Assertion failure in PR46687 goes away with this patch.
Jul 7 2020
Jul 6 2020
Jul 2 2020
@protze.joachim Is the current way of handling CU_CTX_SCHED_BLOCKING_SYNC OK?
I mean still set the flag as CU_CTX_SCHED_BLOCKING_SYNC if the primary context is not activated prior to libomptarget.
Allowing users to choose a scheme will be a separate patch.
I'd like to get this patch in as soon as possible because the 11 release branch will be created soon and the current issue is a stopper for applications using libomptarget.
Jul 1 2020
Jun 30 2020
Updated description regarding setting CU_CTX_SCHED_BLOCKING_SYNC.
I have updated the patch removing the old behavior. @Hahnfeld would you approve?
Jun 29 2020
It seems no one objects to removing the old behavior. I will update the patch.