This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/deviceRTLs/common/
-
libomptarget/
-
deviceRTLs/
-
common/
-
omptarget.h
-
src/
-
data_sharing.cu

Differential D87084

[OpenMP][AMDGPU] Use DS_Max_Warp_Number instead of WARPSIZE
ClosedPublic

Authored by pdhaliwal on Sep 3 2020, 5:10 AM.

Download Raw Diff

Details

Reviewers

JonChesterfield
ABataev
jdoerfert
grokos

Commits

rG7634c64b6121: [OpenMP][AMDGPU] Use DS_Max_Warp_Number instead of WARPSIZE

Summary

The size of worker_rootS should have been DS_Max_Warp_Number.
This reduces memory usage by deviceRTL on AMDGPU from around 2.3GB
to around 770MB.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

pdhaliwal created this revision.Sep 3 2020, 5:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 3 2020, 5:10 AM

Herald added subscribers: openmp-commits, guansong, t-tye and 4 others. · View Herald Transcript

pdhaliwal requested review of this revision.Sep 3 2020, 5:10 AM

Herald added subscribers: sstefan1, wdng. · View Herald TranscriptSep 3 2020, 5:10 AM

Harbormaster completed remote builds in B70533: Diff 289691.Sep 3 2020, 5:35 AM

This can't break nvptx as WARPSIZE == DS_Max_Warp_Number.

Can you share your reasoning for amdgcn? The control flow is difficult to follow in this area

Other places need to be updated too: data_sharing_init_stack_common for one.

Do we actually use this array? What happens if you make it a single element and update the function above?

This revision now requires changes to proceed.Sep 3 2020, 8:25 AM

In D87084#2254700, @jdoerfert wrote:

Other places need to be updated too: data_sharing_init_stack_common for one.

Do we actually use this array? What happens if you make it a single element and update the function above?

IIRC this was only used in the old data sharing scheme for nested parallelism on the GPU, see also http://lists.llvm.org/pipermail/openmp-dev/2018-September/002160.html. This seems to be gone since D83349, AFAICT the arrays are only initialized (via the various code paths) but never used.

In D87084#2254937, @Hahnfeld wrote:

In D87084#2254700, @jdoerfert wrote:

Other places need to be updated too: data_sharing_init_stack_common for one.

Do we actually use this array? What happens if you make it a single element and update the function above?

IIRC this was only used in the old data sharing scheme for nested parallelism on the GPU, see also http://lists.llvm.org/pipermail/openmp-dev/2018-September/002160.html. This seems to be gone since D83349, AFAICT the arrays are only initialized (via the various code paths) but never used.

In that case, we should remove it.

It might be dead. Difficult to tell. The control flow being spread across codegen and the runtime obfuscates things.

I've convinced myself that the change proposed here is right, in that the access to the data structure are based on warp id, which is bounded by DS_Max_Warp_Number. On the basis that it's definitely a NFC for nvptx and reduces an object from 2.4gb to 600mb on amdgcn without loss of functionality, I say we accept this as-is. It's a step in a good direction. If we can determine that all this stuff is dynamically dead and delete it, great. This patch won't obstruct that.

Somewhat related - __kmpc_spmd_kernel_init is always passed '0' for RequiresDataSharing, so the code in that function which writes to DataSharingState is dead. So we should drop the always-zero argument and remove the dead code. It's probably worth doing another pass over codegen looking for functions that are only ever called with the same constant and working through the dead code elimination that falls out.

Only places where it was accessed are here and here. Jon's observation is correct. The maximum number of threads on both amdgcn and nvptx is 1024. However, on amdgcn, wave size is 64 and so maximum number of waves can be 16 and on nvptx, the warp size is 32 and maximum number of warps is 32.

And, this array is not completely dead yet as generic mode uses this for locals allocation whenever compiler is unable to properly detect if the method is being executed in generic mode or spmd mode (atleast on amdgcn).

@jdoerfert thanks for noticing, our local fork was already using DS_Max_Warp_Number in data_sharing_init_stack_common, I will change this here as well. This will not impact nvptx.

Updated data_sharing_stack_init_common

Harbormaster completed remote builds in B70621: Diff 289869.Sep 3 2020, 11:15 PM

LGTM

This revision is now accepted and ready to land.Sep 4 2020, 3:43 PM

This revision was landed with ongoing or failed builds.Sep 7 2020, 2:15 AM

Closed by commit rG7634c64b6121: [OpenMP][AMDGPU] Use DS_Max_Warp_Number instead of WARPSIZE (authored by pdhaliwal). · Explain Why

This revision was automatically updated to reflect the committed changes.

pdhaliwal added a commit: rG7634c64b6121: [OpenMP][AMDGPU] Use DS_Max_Warp_Number instead of WARPSIZE.

Revision Contents

Path

Size

openmp/

libomptarget/

deviceRTLs/

common/

omptarget.h

2 lines

src/

data_sharing.cu