This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Match initial thread pattern on AMDGPU
AbandonedPublic

Authored by jhuber6 on Jun 25 2021, 6:35 AM.

Details

Summary

The AAExecutionDomain pass used to push globalized memory calls to
global shared memory doesn't match the pattern AMDGPU generates. This
means the optimizations won't work on anything other than an NVPTX
target. This patch adds AMDGPU's pattern to the check.

Depends on D102423

Diff Detail

Event Timeline

jhuber6 created this revision.Jun 25 2021, 6:35 AM
jhuber6 requested review of this revision.Jun 25 2021, 6:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 25 2021, 6:35 AM

Ah, nice catch. I have not been paying enough attention to OpenMPOpt, this pattern will indeed miss on amdgcn.

__kmpc_amdgcn_gpu_num_threads is a library function because there is no corresponding intrinsic. I think each of the nvptx intrinsics has either a corresponding amdgcn intrinsic or a corresponding function call, but there might also be some things that are a scalar constant on one arch and a function returning a constant on the other.

For this patch, I'm wondering if we can use a single pattern, preceded by:
auto &&m_BlockSize = nvidia ? m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_ntid_x>() : m_Intrinsic<Intrinsic::some-amd-name>();

In the general case, I'd like the Opt layer to be more architecture agnostic than this. Could we insert functions at codegen like 'amdgpu_get_block_size', pattern match those in the IR opt, and lower them to the nvptx or amdgcn intrinsics later on?

Ah, nice catch. I have not been paying enough attention to OpenMPOpt, this pattern will indeed miss on amdgcn.

__kmpc_amdgcn_gpu_num_threads is a library function because there is no corresponding intrinsic. I think each of the nvptx intrinsics has either a corresponding amdgcn intrinsic or a corresponding function call, but there might also be some things that are a scalar constant on one arch and a function returning a constant on the other.

I was debating whether or not to enter this as an intrinsic or at least an RTL function in OMPKinds.def and just settled on this ugly string comparison.

For this patch, I'm wondering if we can use a single pattern, preceded by:
auto &&m_BlockSize = nvidia ? m_Intrinsic<Intrinsic::nvvm_read_ptx_sreg_ntid_x>() : m_Intrinsic<Intrinsic::some-amd-name>();

The patterns are slightly different without the difference in finding the block size, AMD uses a constant bit mask while Nvidia derives it from the warp size.

In the general case, I'd like the Opt layer to be more architecture agnostic than this. Could we insert functions at codegen like 'amdgpu_get_block_size', pattern match those in the IR opt, and lower them to the nvptx or amdgcn intrinsics later on?

This is planned when we switch over to the new device runtime library where it will be a simple comparison on a TID function to zero. Right now it needs to do some weird calls to determine if a thread is inside the "master warp" for the runtime library.

As part of D101976 I will change the matching to be target independent. This is not needed.

ormris removed a subscriber: ormris.Jun 29 2021, 10:20 AM
jhuber6 abandoned this revision.Jul 10 2021, 7:08 PM