The size of worker_rootS should have been DS_Max_Warp_Number.
This reduces memory usage by deviceRTL on AMDGPU from around 2.3GB
to around 770MB.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This can't break nvptx as WARPSIZE == DS_Max_Warp_Number.
Can you share your reasoning for amdgcn? The control flow is difficult to follow in this area
Other places need to be updated too: data_sharing_init_stack_common for one.
Do we actually use this array? What happens if you make it a single element and update the function above?
IIRC this was only used in the old data sharing scheme for nested parallelism on the GPU, see also http://lists.llvm.org/pipermail/openmp-dev/2018-September/002160.html. This seems to be gone since D83349, AFAICT the arrays are only initialized (via the various code paths) but never used.
It might be dead. Difficult to tell. The control flow being spread across codegen and the runtime obfuscates things.
I've convinced myself that the change proposed here is right, in that the access to the data structure are based on warp id, which is bounded by DS_Max_Warp_Number. On the basis that it's definitely a NFC for nvptx and reduces an object from 2.4gb to 600mb on amdgcn without loss of functionality, I say we accept this as-is. It's a step in a good direction. If we can determine that all this stuff is dynamically dead and delete it, great. This patch won't obstruct that.
Somewhat related - __kmpc_spmd_kernel_init is always passed '0' for RequiresDataSharing, so the code in that function which writes to DataSharingState is dead. So we should drop the always-zero argument and remove the dead code. It's probably worth doing another pass over codegen looking for functions that are only ever called with the same constant and working through the dead code elimination that falls out.
Only places where it was accessed are here and here. Jon's observation is correct. The maximum number of threads on both amdgcn and nvptx is 1024. However, on amdgcn, wave size is 64 and so maximum number of waves can be 16 and on nvptx, the warp size is 32 and maximum number of warps is 32.
And, this array is not completely dead yet as generic mode uses this for locals allocation whenever compiler is unable to properly detect if the method is being executed in generic mode or spmd mode (atleast on amdgcn).
@jdoerfert thanks for noticing, our local fork was already using DS_Max_Warp_Number in data_sharing_init_stack_common, I will change this here as well. This will not impact nvptx.