This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add more utility functions for the GPU
ClosedPublic

Authored by jhuber6 on Apr 20 2023, 9:21 AM.

Details

Summary

This patch adds extra intrinsics for the GPU. Some of these are unused
for now but will be used later. We use these currently to update the
RPC handling. Currently, every thread can update the RPC client, which
isn't correct. This patch adds code neccesary to allow a single thread
to perfrom the write while the others wait.

Feedback is welcome for the naming of these functions. I'm copying the
OpenMP nomenclature where we call an AMD wavefront or NVIDIA warp a
lane.

Diff Detail

Event Timeline

jhuber6 created this revision.Apr 20 2023, 9:21 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 20 2023, 9:21 AM
jhuber6 requested review of this revision.Apr 20 2023, 9:21 AM
jhuber6 updated this revision to Diff 515846.Apr 21 2023, 11:14 AM

Add 'LANE_SIZE' variable that can be used directly for instantiating arrays.

jhuber6 updated this revision to Diff 516155.Apr 23 2023, 4:34 AM

Fix activemask not using volatile causing a control-flow sensitive operation to be hoisted by the optimizer. Also add support for Volta warp syncs.

Using volatile to model convergence/divergence sounds wrong. Perhaps we're missing an intrinsic for activemask.

Using volatile to model convergence/divergence sounds wrong. Perhaps we're missing an intrinsic for activemask.

There isn't as far as I know, this is how CUDA in clang defines __activemaks()

inline __device__ unsigned int __activemask() {                                     
#if CUDA_VERSION < 9020                                                             
  return __nvvm_vote_ballot(1);                                                     
#else                                                                               
  unsigned int mask;                                                                
  asm volatile("activemask.b32 %0;" : "=r"(mask));                                  
  return mask;                                                                      
#endif                                                                              
}

The volatile is necessary because the activemask is sensitive to control flow so we can't change its location.

jhuber6 updated this revision to Diff 516458.Apr 24 2023, 10:27 AM

Update and ping

tra added inline comments.Apr 24 2023, 10:42 AM
libc/src/__support/GPU/nvptx/utils.h
77

32 bits are not sufficient to represent all threads, as block dimensions alone can take up to 63 bits. (31 bits for x, 16 for y,z).

jhuber6 added inline comments.Apr 24 2023, 10:45 AM
libc/src/__support/GPU/nvptx/utils.h
77

So, the individual thread_id_x() functions can be kept 32-bit but this one global one should be 64-bit?

jhuber6 updated this revision to Diff 516470.Apr 24 2023, 10:57 AM

Fix thread id return size.

tra accepted this revision.Apr 24 2023, 11:06 AM
tra added inline comments.
libc/src/__support/GPU/amdgpu/utils.h
73

wrong dimension label. Here and below.

This revision is now accepted and ready to land.Apr 24 2023, 11:06 AM
This revision was automatically updated to reflect the committed changes.