This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Add amdgpu sub archs
ClosedPublic

Authored by yaxunl on Apr 4 2018, 11:41 AM.

Diff Detail

Repository
rC Clang

Event Timeline

yaxunl created this revision.Apr 4 2018, 11:41 AM
rjmccall accepted this revision.Apr 4 2018, 11:49 AM

Thanks for splitting the patch up. LGTM.

This revision is now accepted and ready to land.Apr 4 2018, 11:49 AM
This revision was automatically updated to reflect the committed changes.
tra added a comment.Apr 5 2018, 10:10 AM

I didn't get a chance to review the patch before it got committed.

lib/Basic/Targets.cpp
161

Unless you're planning to guarantee 1:1 match to functionality provided by nvidia's sm_32, it would be prudent to use some other value for the macro so the source code has a way to tell these GPUs apart.

Another issue with this approach is that typical use pattern for CUDA_ARCH is
#if __CUDA_ARCH__ >= XXX. I don't expect that we'll always be able to maintain order across GPU architectures among NVIDIA and AMD GPUs. Perhaps for HIP compilation it would make more sense to define CUDA_ARCH as 1 (this should serve as a legacy indication of device-side compilation) and define HIP_ARCH to indicate which AMD GPU we're compiling for without accidentally enabling something that was intended for NVIDIA's GPUs only.

yaxunl added inline comments.Apr 5 2018, 11:19 AM
lib/Basic/Targets.cpp
161

I think let __CUDA_ARCH__==1 for amdgcn is reasonable and I can make that change.

On the other hand, I think it may be difficult to define __HIP_ARCH__ which can sort mixed nvptx/amdgcn GPU's by capability. I do think a well defined __HIP_ARCH__ would be useful for users. Just need some further discussion how to define it.

For now, if there are specific codes for nvptx, it can continue use __CUDA_ARCH__. If there are specific codes for amdgcn, it can check predefined amdgpu canonical names, e.g. __gfx803__, etc.

tra added inline comments.Apr 5 2018, 11:28 AM
lib/Basic/Targets.cpp
161

OK.

yaxunl added inline comments.Apr 6 2018, 7:41 AM
lib/Basic/Targets.cpp
161

I asked Ben Sander about whether we can define HIP_ARCH, here is his answer:

HIP targets a broader set of hardware than just a single vendor so additional flexibility in defining feature capability is required. The HIP_ARCH_ macro provide per-feature-granularity mechanism to query features. Also the code tends to be more clear as opposed to an "if __CUDA_ARCH>3 ..assume some feature".

For example

// 32-bit Atomics:
#define __HIP_ARCH_HAS_GLOBAL_INT32_ATOMICS__ (1)
#define __HIP_ARCH_HAS_GLOBAL_FLOAT_ATOMIC_EXCH__ (1)
#define __HIP_ARCH_HAS_SHARED_INT32_ATOMICS__ (1)
#define __HIP_ARCH_HAS_SHARED_FLOAT_ATOMIC_EXCH__ (1)
#define __HIP_ARCH_HAS_FLOAT_ATOMIC_ADD__ (1)

// 64-bit Atomics:
#define __HIP_ARCH_HAS_GLOBAL_INT64_ATOMICS__ (1)
#define __HIP_ARCH_HAS_SHARED_INT64_ATOMICS__ (0)

// Doubles
#define __HIP_ARCH_HAS_DOUBLES__ (1)

// warp cross-lane operations:
#define __HIP_ARCH_HAS_WARP_VOTE__ (1)
#define __HIP_ARCH_HAS_WARP_BALLOT__ (1)
#define __HIP_ARCH_HAS_WARP_SHUFFLE__ (1)
#define __HIP_ARCH_HAS_WARP_FUNNEL_SHIFT__ (0)

// sync
#define __HIP_ARCH_HAS_THREAD_FENCE_SYSTEM__ (1)
#define __HIP_ARCH_HAS_SYNC_THREAD_EXT__ (0)

// misc
#define __HIP_ARCH_HAS_SURFACE_FUNCS__ (0)
#define __HIP_ARCH_HAS_3DGRID__ (1)
#define __HIP_ARCH_HAS_DYNAMIC_PARALLEL__ (0)
tra added inline comments.Apr 6 2018, 9:58 AM
lib/Basic/Targets.cpp
161

I assume that will be handled somewhere else -- different patch, different place.
Looks like setting __CUDA_ARCH__ to 1 is all that should be done here.

While we're looking a this, is CUDA compatibility one of your goals? I.e. do you expect existing CUDA code to be compilable and functional on AMD hardware? If not, then, perhaps you don't need __CUDA_*__ defines at all.

yaxunl added inline comments.Apr 6 2018, 10:47 AM
lib/Basic/Targets.cpp
161

CUDA code needs to be translated to HIP code since the host API is different. In most cases the translation can be done by script automatically. __CUDA_ARCH__ cannot be automatically translated because it is not portable to non-nvptx devices, however it is often used to indicate device compilation. Therefore we still need to define it in HIP to indicate device compilation. In this way, CUDA programs using __CUDA_ARCH__ just for checking device compilation can be automatically translated. If users want to use features associated with specific __CUDA_ARCH__ they can manually modify the translated code to use __HIP_ARCH_HAS_* macros.

tra added inline comments.Apr 6 2018, 11:03 AM
lib/Basic/Targets.cpp
161

It sounds like this translation is a one-time offline process and HIP-mode compiler is not going to see any non-HIP code. If that's the case, I'm not quite sure I see the need for defining __CUDA_ARCH__ in HIP mode -- translation process should've converted the CUDA-specific macro in the original code to it's HIP equivalent or get user to part it to something HIP can deal with.

HIP programming guide also says that __CUDA_ARCH__ is undefined by hcc.

yaxunl added inline comments.Apr 6 2018, 11:17 AM
lib/Basic/Targets.cpp
161

Sorry I missed that. I will revert the change about macro __CUDA_ARCH__ and define __HIP_DEVICE_COMPILE__ instead. Thanks.