In the case of coalesced global records, we need to push the exact data size passed in. This patch fixes this by outlining the common functionality of the previous push function and by adding a separate entry point for coalesced pushes. The pop function remains unchanged.
Details
- Reviewers
ABataev grokos caomhin - Commits
- rGb10bacf12254: [OpenMP][libomptarget] Add runtime function for pushing coalesced global records
rL345867: [OpenMP][libomptarget] Add runtime function for pushing coalesced global records
rOMP345867: [OpenMP][libomptarget] Add runtime function for pushing coalesced global records
Diff Detail
- Repository
- rOMP OpenMP
- Build Status
Buildable 23703 Build 23702: arc lint + arc unit
Event Timeline
I guess this will break the case when the DataSize passed to __kmpc_data_sharing_push_stack() needs additional alignment: With this change it is handled in data_sharing_push_stack_common() but __kmpc_data_sharing_push_stack() will determine PushSize without the adjustment and do the final pointer arithmetic.
Why DataSize might require an additional alignment? The DataSize must be already aligned.
It's required in __kmpc_data_sharing_push_stack() for some cases, please see D52655.
Another problem with this patch is that __kmpc_data_sharing_push_stack() is adding the lane offset to the return value of omptarget_nvptx_SimpleThreadPrivateContext::Allocate(). AFAICS this will break lastprivate which requires the same buffer for all threads.
Anyway, you must do what you think to be correct.
It is another problem. The problem is that the StackP pointer is unaligned after all the pointer maths. The DataSize itself should not be aligned.
Another problem with this patch is that __kmpc_data_sharing_push_stack() is adding the lane offset to the return value of omptarget_nvptx_SimpleThreadPrivateContext::Allocate(). AFAICS this will break lastprivate which requires the same buffer for all threads.
Anyway, you must do what you think to be correct.
Yes, there must be an additional check size_t PushSize = (isRuntimeUninitialized() || IsMasterThread()) ? DataSize : WARPSIZE * DataSize;
libomptarget/deviceRTLs/nvptx/src/data_sharing.cu | ||
---|---|---|
387 | This must be removed | |
456 | Also, must be removed | |
462 | It is a very bad idea to have something like this without atomic instructions. |
libomptarget/deviceRTLs/nvptx/src/data_sharing.cu | ||
---|---|---|
461 | Do you realy need the loop here? Seems to me it is better to have it in else branch of the if (IsWarpMaster) statement |
libomptarget/deviceRTLs/nvptx/src/supporti.h | ||
---|---|---|
191 ↗ | (On Diff #171993) | Yes, leaving this in will lead to the whole malloc to be optimized out after inlining. |
This must be removed